Create YOLO v2 Object Detection Network

This example shows how to modify a pretrained MobileNet v2 network to create a YOLO v2 object detection network. This approach offers additional flexibility compared to the yolov2Layers function, which returns a canonical YOLO v2 object detector.

The procedure to convert a pretrained network into a YOLO v2 network is similar to the transfer learning procedure for image classification:

  1. Load the pretrained network.

  2. Select a layer from the pretrained network to use for feature extraction.

  3. Remove all the layers after the feature extraction layer.

  4. Add new layers to support the object detection task.

You can also implement this procedure using the deepNetworkDesigner app.

Load Pretrained Network

Load a pretrained MobileNet v2 network using mobilenetv2. This requires the Deep Learning Toolbox Model for MobileNet v2 Network™.

% Load a pretrained network.
net = mobilenetv2();

% Convert network into a layer graph object
% in order to manipulate the layers.
lgraph = layerGraph(net);

Update Network Image Size

Change the image size of the network based on the training data requirements. To illustrate this step, assume the required image size is [300 300 3] for RGB images.

% Input size for detector.
imageInputSize = [300 300 3];

% Create new image input layer. Set the new layer name
% to the original layer name.
imgLayer = imageInputLayer(imageInputSize,"Name","input_1")
imgLayer = 
  ImageInputLayer with properties:

                      Name: 'input_1'
                 InputSize: [300 300 3]

   Hyperparameters
          DataAugmentation: 'none'
             Normalization: 'zerocenter'
    NormalizationDimension: 'auto'
                      Mean: []

% Replace old image input layer.
lgraph = replaceLayer(lgraph,"input_1",imgLayer);

Select Feature Extraction Layer

A good feature extraction layer for YOLO v2 is one where the output feature width and height is between 8 and 16 times smaller than the input image. This amount of downsampling is a trade-off between spatial resolution and quality of output features. The analyzeNetwork app or deepNetworkDesigner app can be used to determine the output sizes of layers within a network. Note that selecting an optimal feature extraction layer requires empirical evaluation.

Set the feature extraction layer to “block_12_add” from MobileNet v2. Because the required input size was previously set to [300 300], the output feature size is [19 19]. This results in a downsampling factor of about 16.

featureExtractionLayer = "block_12_add";

Remove Layers After Feature Extraction Layer

To easily remove layers from a deep network, such as MobileNet v2, use the deepNetworkDesigner app. Import the network into the app to manually remove the layers after "block_12_add". Export the modified network to your workspace. This example uses a pre-saved version of MobileNet v2 which was exported from the app.

% Load a network modified using Deep Network Designer.
modified = load("mobilenetv2Block12Add.mat");
lgraph = modified.mobilenetv2Block12Add;

Alternatively, if you have a list of layers to remove, you can use the removeLayers function to remove them manually.

Create YOLO v2 Detection Sub-Network

The detection subnetwork consists of groups of serially connected convolution, ReLU, and batch normalization layers. These layers are followed by a yolov2TransformLayer and a yolov2OutputLayer.

Create the convolution, ReLU, and batch normalization portion of the detection sub-network.

% Set the convolution layer filter size to [3 3].
% This size is common in CNN architectures. 
filterSize = [3 3];

% Set the number of filters in the convolution layers
% to match the number of channels in the
% feature extraction layer output.
numFilters = 96;

% Create the detection subnetwork.
% * The convolution layer uses "same" padding
%   to preserve the input size.
detectionLayers = [
    % group 1
    convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv1",...
    "Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01)
    batchNormalizationLayer("Name","yolov2Batch1");
    reluLayer("Name","yolov2Relu1");
    
    % group 2
    convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv2",...
    "Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01)
    batchNormalizationLayer("Name","yolov2Batch2");
    reluLayer("Name","yolov2Relu2");
    ]
detectionLayers = 
  6x1 Layer array with layers:

     1   'yolov2Conv1'    Convolution           96 3x3 convolutions with stride [1  1] and padding 'same'
     2   'yolov2Batch1'   Batch Normalization   Batch normalization
     3   'yolov2Relu1'    ReLU                  ReLU
     4   'yolov2Conv2'    Convolution           96 3x3 convolutions with stride [1  1] and padding 'same'
     5   'yolov2Batch2'   Batch Normalization   Batch normalization
     6   'yolov2Relu2'    ReLU                  ReLU

The remaining layers are configured based on application specific details such as number of object classes and anchor boxes.

% Define the number of classes to detect.
numClasses = 5;

% Define the anchor boxes.
anchorBoxes = [
    16 16
    32 16
    ];

% Number of anchor boxes.
numAnchors = size(anchorBoxes,1);

% There are five predictions per anchor box: 
%  * Predict the x, y, width, and height offset
%    for each anchor.
%  * Predict the intersection-over-union with ground
%    truth boxes.
numPredictionsPerAnchor = 5;

% Number of filters in last convolution layer.
outputSize = numAnchors*(numClasses+numPredictionsPerAnchor);

Create the convolution2dLayer, yolov2Transform, and yolov2Output layers.

% Final layers in detection sub-network.
finalLayers = [
    convolution2dLayer(1,outputSize,"Name","yolov2ClassConv",...
    "WeightsInitializer", @(sz)randn(sz)*0.01)
    yolov2TransformLayer(numAnchors,"Name","yolov2Transform")
    yolov2OutputLayer(anchorBoxes,"Name","yolov2OutputLayer")
    ];

Add the last layers to the network.

% Add the last layers to network.
detectionLayers = [
    detectionLayers
    finalLayers
    ]
detectionLayers = 
  9x1 Layer array with layers:

     1   'yolov2Conv1'         Convolution               96 3x3 convolutions with stride [1  1] and padding 'same'
     2   'yolov2Batch1'        Batch Normalization       Batch normalization
     3   'yolov2Relu1'         ReLU                      ReLU
     4   'yolov2Conv2'         Convolution               96 3x3 convolutions with stride [1  1] and padding 'same'
     5   'yolov2Batch2'        Batch Normalization       Batch normalization
     6   'yolov2Relu2'         ReLU                      ReLU
     7   'yolov2ClassConv'     Convolution               20 1x1 convolutions with stride [1  1] and padding [0  0  0  0]
     8   'yolov2Transform'     YOLO v2 Transform Layer   YOLO v2 Transform Layer with 2 anchors
     9   'yolov2OutputLayer'   YOLO v2 Output            YOLO v2 Output with 2 anchors

Complete YOLO v2 Detection Network

Attach the detection subnetwork to the feature extraction network.

% Add the detection subnetwork to the feature extraction network.
lgraph = addLayers(lgraph,detectionLayers);

% Connect the detection subnetwork to the feature extraction layer.
lgraph = connectLayers(lgraph,featureExtractionLayer,"yolov2Conv1");

Use analyzeNetwork(lgraph) to check the network and then train a YOLO v2 object detector using the trainYOLOv2ObjectDetector function.