Create YOLO v2 Object Detection Network

This example uses:

This example shows how to modify a pretrained MobileNet v2 network to create a YOLO v2 object detection network.

The procedure to convert a pretrained network into a YOLO v2 network is similar to the transfer learning procedure for image classification:

Load the pretrained network.
Select a layer from the pretrained network to use for feature extraction.
Remove all layers after the feature extraction layer.
Add new layers to support the object detection task.

Load Pretrained Network

Load a pretrained MobileNet v2 network using mobilenetv2. This requires the Deep Learning Toolbox Model for MobileNet v2 Network™ support package. If this support package is not installed, then the function provides a download link. After you load the network, convert the network into a layerGraph object so that you can manipulate the layers.

net = mobilenetv2();
lgraph = layerGraph(net);

Update Network Input Size

Update the network input size to meet the training data requirements. For example, assume the training data are 300-by-300 RGB images. Set the input size.

imageInputSize = [300 300 3];

Next, create a new image input layer with the same name as the original layer.

imgLayer = imageInputLayer(imageInputSize,"Name","input_1")

imgLayer = 
  ImageInputLayer with properties:

                      Name: 'input_1'
                 InputSize: [300 300 3]
        SplitComplexInputs: 0

   Hyperparameters
          DataAugmentation: 'none'
             Normalization: 'zerocenter'
    NormalizationDimension: 'auto'
                      Mean: []

Replace the old image input layer with the new image input layer.

lgraph = replaceLayer(lgraph,"input_1",imgLayer);

Display and inspect the layers in the network by using the analyzeNetwork function.

analyzeNetwork(lgraph);

Select Feature Extraction Layer

A YOLO v2 feature extraction layer is most effective when the output feature width and height are between 8 and 16 times smaller than the input image. This amount of downsampling is a trade-off between spatial resolution and output-feature quality. You can use the analyzeNetwork function or the Deep Network Designer app to determine the output sizes of layers within a network. Note that selecting an optimal feature extraction layer requires empirical evaluation.

Set the feature extraction layer to "block_12_add". The output size of this layer is about 16 times smaller than the input image size of 300-by-300.

featureExtractionLayer = "block_12_add";

Remove Layers After Feature Extraction Layer

Next, remove all the layers after the feature extraction layer by using the removeLayers function.

index = find(strcmp({lgraph.Layers(1:end).Name},featureExtractionLayer));
lgraph = removeLayers(lgraph,{lgraph.Layers(index+1:end).Name});

Create YOLO v2 Detection Sub-Network

The detection subnetwork consists of groups of serially connected convolution, ReLU, and batch normalization layers. These layers are followed by a yolov2TransformLayer and a yolov2OutputLayer.

First, create two groups of serially connected convolution, ReLU, and batch normalization layers. Set the convolution layer filter size to 3-by-3 and the number of filters to match the number of channels in the feature extraction layer output. Specify "same" padding in the convolution layer to preserve the input size.

filterSize = [3 3];
numFilters = 96;

detectionLayers = [
    convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv1","Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01)
    batchNormalizationLayer("Name","yolov2Batch1")
    reluLayer("Name","yolov2Relu1")
    convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv2","Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01)
    batchNormalizationLayer("Name","yolov2Batch2")
    reluLayer("Name","yolov2Relu2")
    ]

detectionLayers = 
  6x1 Layer array with layers:

     1   'yolov2Conv1'    2-D Convolution       96 3x3 convolutions with stride [1  1] and padding 'same'
     2   'yolov2Batch1'   Batch Normalization   Batch normalization
     3   'yolov2Relu1'    ReLU                  ReLU
     4   'yolov2Conv2'    2-D Convolution       96 3x3 convolutions with stride [1  1] and padding 'same'
     5   'yolov2Batch2'   Batch Normalization   Batch normalization
     6   'yolov2Relu2'    ReLU                  ReLU

Next, create the final portion of the detection subnetwork, which has a convolution layer followed by a yolov2TransformLayer and a yolov2OutputLayer. The output of convolution layer predicts the following for each anchor box:

The object class probabilities.
The x and y location offset.
The width and height offset.

Specify the anchor boxes and number of classes and compute the number of filters for the convolution layer.

numClasses = 5;

anchorBoxes = [
    16 16
    32 16
    ];

numAnchors = size(anchorBoxes,1);
numPredictionsPerAnchor = 5;
numFiltersInLastConvLayer = numAnchors*(numClasses+numPredictionsPerAnchor);

Add the convolution2dLayer, yolov2TransformLayer, and yolov2OutputLayer to the detection subnetwork.

detectionLayers = [
    detectionLayers
    convolution2dLayer(1,numFiltersInLastConvLayer,"Name","yolov2ClassConv",...
    "WeightsInitializer", @(sz)randn(sz)*0.01)
    yolov2TransformLayer(numAnchors,"Name","yolov2Transform")
    yolov2OutputLayer(anchorBoxes,"Name","yolov2OutputLayer")
    ]

detectionLayers = 
  9x1 Layer array with layers:

     1   'yolov2Conv1'         2-D Convolution            96 3x3 convolutions with stride [1  1] and padding 'same'
     2   'yolov2Batch1'        Batch Normalization        Batch normalization
     3   'yolov2Relu1'         ReLU                       ReLU
     4   'yolov2Conv2'         2-D Convolution            96 3x3 convolutions with stride [1  1] and padding 'same'
     5   'yolov2Batch2'        Batch Normalization        Batch normalization
     6   'yolov2Relu2'         ReLU                       ReLU
     7   'yolov2ClassConv'     2-D Convolution            20 1x1 convolutions with stride [1  1] and padding [0  0  0  0]
     8   'yolov2Transform'     YOLO v2 Transform Layer.   YOLO v2 Transform Layer with 2 anchors.
     9   'yolov2OutputLayer'   YOLO v2 Output             YOLO v2 Output with 2 anchors.

Complete YOLO v2 Detection Network

Attach the detection subnetwork to the feature extraction network.

lgraph = addLayers(lgraph,detectionLayers);
lgraph = connectLayers(lgraph,featureExtractionLayer,"yolov2Conv1");

Use analyzeNetwork function to check the network. You can then train the network by using the trainYOLOv2ObjectDetector function.

analyzeNetwork(lgraph)