Main Content

Deploy Quantized Network Example

This example shows how to train, compile, and deploy a dlhdl.Workflow object that has quantized ResNet-18 as the network object by using the Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC. Quantization helps reduce the memory requirement of a deep neural network by quantizing weights, biases and activations of network layers to 8-bit scaled integer data types. Use MATLAB® to retrieve the prediction results from the target device.

Required Products

For this example, you need:

  • Deep Learning Toolbox ™

  • Deep Learning HDL Toolbox ™

  • Deep Learning Toolbox Model Quantization Library

  • Deep Learning Toolbox Model for ResNet-18 Network

  • Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices

  • MATLAB Coder Interface for Deep Learning Libraries.

Load Pretrained DAG Network

To load the pretrained DAG network ResNet-18, enter:

net = resnet18;

To view the layers of the pretrained DAG network, enter:

analyzeNetwork(net);

The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the number of color channels.

inputSize = net.Layers(1).InputSize;

inputSize = 1×3

227 227 3

Define Training and Validation Data Sets

This example uses the logos_dataset data set. The data set consists of 320 images. Create an augmentedImageDatastore object to use for training and validation.

curDir = pwd;
newDir = fullfile(matlabroot,'examples','deeplearning_shared','data','logos_dataset.zip');
copyfile(newDir,curDir,'f');

unzip('logos_dataset.zip');

imds = imageDatastore('logos_dataset', ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');

[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');

Replace Final Layers

The last three layers of the pretrained network net are configured for 1000 classes. These three layers must be fine-tuned for the new classification problem. Extract all the layers, except the last three layers, from the pretrained network.

Extract the layer graph from the trained network.

lgraph = layerGraph(net)
lgraph = 
  LayerGraph with properties:

         Layers: [71×1 nnet.cnn.layer.Layer]
    Connections: [78×2 table]
     InputNames: {'data'}
    OutputNames: {'ClassificationLayer_predictions'}

Remove 'fc1000', 'prob' and 'ClassificationLayer_predictions' layers from the lgraph.

layers = net.SortedLayers;
for i = 0:2
    lgraph = removeLayers(lgraph,layers(end-i).Name);
end

Transfer the layers to the new classification task by replacing the last three layers with a fully connected layer, a softmax layer, and a classification output layer. Set the fully connected layer to have the same size as the number of classes in the new data.

numClasses = numel(categories(imdsTrain.Labels));

numClasses = 32

Create three new layers and add them to the lgraph. Ensure the transferred and new layers are properly connected together in the lgraph.

newLayers = [
    fullyConnectedLayer(numClasses,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20,'Name','newFC')
    softmaxLayer('Name','newProb')
    classificationLayer('Name','newClassOutput',"Classes","auto")];

lgraph = addLayers(lgraph,newLayers);
lgraph = connectLayers(lgraph,layers(end-3).Name,'newFC');

Train Network

The network requires input images of size 227-by-227-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.

pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
    'RandXReflection',true, ...
    'RandXTranslation',pixelRange, ...
    'RandYTranslation',pixelRange);
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
    'DataAugmentation',imageAugmenter);

To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.

augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);

Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency iterations during training.

options = trainingOptions('sgdm', ...
    'MiniBatchSize',10, ...
    'MaxEpochs',6, ...
    'InitialLearnRate',1e-4, ...
    'Shuffle','every-epoch', ...
    'ValidationData',augimdsValidation, ...
    'ValidationFrequency',3, ...
    'Verbose',false, ...
    'Plots','training-progress');

Train the network that consists of the transferred and new layers. By default, trainNetwork uses a GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Computing Requirements (Parallel Computing Toolbox)). Otherwise, the network uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify the execution environment by using the 'ExecutionEnvironment' name-value argument of trainingOptions.

netTransfer = trainNetwork(augimdsTrain,lgraph,options);

Create dlquantizer Object

Create a dlquantizer object and specify the network to quantize. Specify the execution environment as FPGA.

dlQuantObj = dlquantizer(netTransfer,'ExecutionEnvironment','FPGA');

Calibrate Quantized Network

The dlquantizer object uses calibration data to collect dynamic ranges for the learnable parameters of the convolution and fully connected layers of the network.

For best quantization results, the calibration data must be a representative of actual inputs predicted by the LogoNet network. Expedite the calibration process by reducing the calibration data set to 20 images.

imageData = imageDatastore(fullfile(curDir,'logos_dataset'),...
 'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames');
imageData_reduced = imageData.subset(1:20);
dlQuantObj.calibrate(imageData_reduced)
ans=95×5 table
       Optimized Layer Name       Network Layer Name    Learnables / Activations    MinValue    MaxValue
    __________________________    __________________    ________________________    ________    ________

    {'conv1_Weights'         }    {'conv1'         }           "Weights"            -0.52595    0.83365 
    {'conv1_Bias'            }    {'conv1'         }           "Bias"               -0.66142    0.67493 
    {'res2a_branch2a_Weights'}    {'res2a_branch2a'}           "Weights"            -0.36239    0.42815 
    {'res2a_branch2a_Bias'   }    {'res2a_branch2a'}           "Bias"               -0.83058     1.1734 
    {'res2a_branch2b_Weights'}    {'res2a_branch2b'}           "Weights"            -0.80143    0.54724 
    {'res2a_branch2b_Bias'   }    {'res2a_branch2b'}           "Bias"                -1.2691     1.7777 
    {'res2b_branch2a_Weights'}    {'res2b_branch2a'}           "Weights"            -0.26073    0.25689 
    {'res2b_branch2a_Bias'   }    {'res2b_branch2a'}           "Bias"                -1.0012     1.2976 
    {'res2b_branch2b_Weights'}    {'res2b_branch2b'}           "Weights"             -1.1361    0.77358 
    {'res2b_branch2b_Bias'   }    {'res2b_branch2b'}           "Bias"                -1.1981     1.1897 
    {'res3a_branch2a_Weights'}    {'res3a_branch2a'}           "Weights"            -0.13934    0.21123 
    {'res3a_branch2a_Bias'   }    {'res3a_branch2a'}           "Bias"               -0.54418    0.71134 
    {'res3a_branch2b_Weights'}    {'res3a_branch2b'}           "Weights"            -0.49925    0.69286 
    {'res3a_branch2b_Bias'   }    {'res3a_branch2b'}           "Bias"               -0.66837     1.4745 
    {'res3a_branch1_Weights' }    {'res3a_branch1' }           "Weights"            -0.63797    0.66549 
    {'res3a_branch1_Bias'    }    {'res3a_branch1' }           "Bias"                -1.0594    0.92627 
      ⋮

Create Target Object

Create a target object with a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install Xilinx™ Vivado™ Design Suite 2020.2. To set the Xilinx Vivado toolpath, enter:

% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat');

To create the target object, enter:

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Alternatively, you can also use the JTAG interface.

% hTarget = dlhdl.Target('Xilinx', 'Interface', 'JTAG');

Create Workflow Object

Create an object of the dlhdl.Workflow class. When you create the class, an instance of the dlquantizer object, the bitstream name, and the target information are specified. Specify dlQuantObj as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SOC board and the bitstream uses the int8 data type.

hW = dlhdl.Workflow('Network', dlQuantObj, 'Bitstream', 'zcu102_int8','Target',hTarget);

Compile the Quantized DAGNetwork

To compile the quantized ResNet-18 DAG network, run the compile function of the dlhdl.Workflow object.

dn = hW.compile
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### The network includes the following layers:
     1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                 Convolution                  64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'bn_conv1'              Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
     4   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
     5   'pool1'                 Max Pooling                  3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     6   'res2a_branch2a'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     7   'bn2a_branch2a'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
     8   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
     9   'res2a_branch2b'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    10   'bn2a_branch2b'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    11   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    12   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    13   'res2b_branch2a'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    14   'bn2b_branch2a'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    15   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    16   'res2b_branch2b'        Convolution                  64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    17   'bn2b_branch2b'         Batch Normalization          Batch normalization with 64 channels                                  (HW Layer)
    18   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    19   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    20   'res3a_branch2a'        Convolution                  128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    21   'bn3a_branch2a'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    22   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    23   'res3a_branch2b'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'bn3a_branch2b'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    25   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    26   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    27   'res3a_branch1'         Convolution                  128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    28   'bn3a_branch1'          Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    29   'res3b_branch2a'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    30   'bn3b_branch2a'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    31   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    32   'res3b_branch2b'        Convolution                  128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'bn3b_branch2b'         Batch Normalization          Batch normalization with 128 channels                                 (HW Layer)
    34   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    35   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    36   'res4a_branch2a'        Convolution                  256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    37   'bn4a_branch2a'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    38   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    39   'res4a_branch2b'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'bn4a_branch2b'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    41   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    43   'res4a_branch1'         Convolution                  256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    44   'bn4a_branch1'          Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    45   'res4b_branch2a'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'bn4b_branch2a'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    47   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    48   'res4b_branch2b'        Convolution                  256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    49   'bn4b_branch2b'         Batch Normalization          Batch normalization with 256 channels                                 (HW Layer)
    50   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    51   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    52   'res5a_branch2a'        Convolution                  512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    53   'bn5a_branch2a'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    54   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    55   'res5a_branch2b'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    56   'bn5a_branch2b'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    57   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    58   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    59   'res5a_branch1'         Convolution                  512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    60   'bn5a_branch1'          Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    61   'res5b_branch2a'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    62   'bn5b_branch2a'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    63   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    64   'res5b_branch2b'        Convolution                  512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    65   'bn5b_branch2b'         Batch Normalization          Batch normalization with 512 channels                                 (HW Layer)
    66   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    67   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    68   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    69   'newFC'                 Fully Connected              32 fully connected layer                                              (HW Layer)
    70   'newProb'               Softmax                      softmax                                                               (HW Layer)
    71   'newClassOutput'        Classification Output        crossentropyex with 'adidas' and 31 other classes                     (SW Layer)
                                                                                                                                  
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'newProb' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'newClassOutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: conv1>>pool1 ...
### Compiling layer group: conv1>>pool1 ... complete.
### Compiling layer group: res2a_branch2a>>res2a_branch2b ...
### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.
### Compiling layer group: res2b_branch2a>>res2b_branch2b ...
### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.
### Compiling layer group: res3a_branch1 ...
### Compiling layer group: res3a_branch1 ... complete.
### Compiling layer group: res3a_branch2a>>res3a_branch2b ...
### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.
### Compiling layer group: res3b_branch2a>>res3b_branch2b ...
### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.
### Compiling layer group: res4a_branch1 ...
### Compiling layer group: res4a_branch1 ... complete.
### Compiling layer group: res4a_branch2a>>res4a_branch2b ...
### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.
### Compiling layer group: res4b_branch2a>>res4b_branch2b ...
### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.
### Compiling layer group: res5a_branch1 ...
### Compiling layer group: res5a_branch1 ... complete.
### Compiling layer group: res5a_branch2a>>res5a_branch2b ...
### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.
### Compiling layer group: res5b_branch2a>>res5b_branch2b ...
### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.
### Compiling layer group: pool5 ...
### Compiling layer group: pool5 ... complete.
### Compiling layer group: newFC ...
### Compiling layer group: newFC ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "12.0 MB"       
    "OutputResultOffset"        "0x00c00000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x01000000"     "4.0 MB"        
    "SystemBufferOffset"        "0x01400000"     "28.0 MB"       
    "InstructionDataOffset"     "0x03000000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x03400000"     "16.0 MB"       
    "FCWeightDataOffset"        "0x04400000"     "4.0 MB"        
    "EndOffset"                 "0x04800000"     "Total: 72.0 MB"

### Network compilation complete.
dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {}

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.

hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 09-Dec-2021 18:36:39
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 09-Dec-2021 18:36:39

Load Example Images and Run the Prediction

To load the example image, execute the predict function of the dlhdl.Workflow object, and then display the FPGA result, enter:

idx = randperm(numel(imdsValidation.Files),4);
figure
for i = 1:4
    subplot(2,2,i)
    I = readimage(imdsValidation,idx(i));
    I = imresize(I,[224 224]);
    imshow(I)
    [prediction, speed] = hW.predict(single(I),'Profile','on');
    [val, index] = max(prediction);
    netTransfer.Layers(end).ClassNames{index}
    label = netTransfer.Layers(end).ClassNames{index}
    title(string(label));
end
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    7335952                  0.02934                       1            7338581             34.1
    conv1                  1115480                  0.00446 
    pool1                   238029                  0.00095 
    res2a_branch2a          269834                  0.00108 
    res2a_branch2b          269956                  0.00108 
    res2a                    88905                  0.00036 
    res2b_branch2a          269794                  0.00108 
    res2b_branch2b          269965                  0.00108 
    res2b                    88783                  0.00036 
    res3a_branch1           156139                  0.00062 
    res3a_branch2a          227324                  0.00091 
    res3a_branch2b          245055                  0.00098 
    res3a                    44462                  0.00018 
    res3b_branch2a          244852                  0.00098 
    res3b_branch2b          245048                  0.00098 
    res3b                    44426                  0.00018 
    res4a_branch1           135525                  0.00054 
    res4a_branch2a          136187                  0.00054 
    res4a_branch2b          237212                  0.00095 
    res4a                    22312                  0.00009 
    res4b_branch2a          236600                  0.00095 
    res4b_branch2b          237466                  0.00095 
    res4b                    22542                  0.00009 
    res5a_branch1           311891                  0.00125 
    res5a_branch2a          311873                  0.00125 
    res5a_branch2b          596194                  0.00238 
    res5a                    11201                  0.00004 
    res5b_branch2a          595857                  0.00238 
    res5b_branch2b          596713                  0.00239 
    res5b                    11431                  0.00005 
    pool5                    36976                  0.00015 
    newFC                    17733                  0.00007 
 * The clock frequency of the DL processor is: 250MHz
ans = 
'becks'
label = 
'becks'
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    7336472                  0.02935                       1            7339005             34.1
    conv1                  1115736                  0.00446 
    pool1                   237938                  0.00095 
    res2a_branch2a          269787                  0.00108 
    res2a_branch2b          270062                  0.00108 
    res2a                    88865                  0.00036 
    res2b_branch2a          269710                  0.00108 
    res2b_branch2b          269870                  0.00108 
    res2b                    88975                  0.00036 
    res3a_branch1           156178                  0.00062 
    res3a_branch2a          227565                  0.00091 
    res3a_branch2b          245130                  0.00098 
    res3a                    44486                  0.00018 
    res3b_branch2a          244733                  0.00098 
    res3b_branch2b          244875                  0.00098 
    res3b                    44486                  0.00018 
    res4a_branch1           135725                  0.00054 
    res4a_branch2a          136247                  0.00054 
    res4a_branch2b          236972                  0.00095 
    res4a                    22382                  0.00009 
    res4b_branch2a          236891                  0.00095 
    res4b_branch2b          237046                  0.00095 
    res4b                    22402                  0.00009 
    res5a_branch1           312061                  0.00125 
    res5a_branch2a          311738                  0.00125 
    res5a_branch2b          596238                  0.00238 
    res5a                    11261                  0.00005 
    res5b_branch2a          595867                  0.00238 
    res5b_branch2b          596768                  0.00239 
    res5b                    11351                  0.00005 
    pool5                    36999                  0.00015 
    newFC                    17941                  0.00007 
 * The clock frequency of the DL processor is: 250MHz
ans = 
'nvidia'
label = 
'nvidia'
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    7335649                  0.02934                       1            7338257             34.1
    conv1                  1115615                  0.00446 
    pool1                   237521                  0.00095 
    res2a_branch2a          269761                  0.00108 
    res2a_branch2b          270039                  0.00108 
    res2a                    88844                  0.00036 
    res2b_branch2a          269603                  0.00108 
    res2b_branch2b          269901                  0.00108 
    res2b                    88855                  0.00036 
    res3a_branch1           156360                  0.00063 
    res3a_branch2a          227646                  0.00091 
    res3a_branch2b          245061                  0.00098 
    res3a                    44526                  0.00018 
    res3b_branch2a          244836                  0.00098 
    res3b_branch2b          244944                  0.00098 
    res3b                    44566                  0.00018 
    res4a_branch1           135820                  0.00054 
    res4a_branch2a          136251                  0.00055 
    res4a_branch2b          236828                  0.00095 
    res4a                    22352                  0.00009 
    res4b_branch2a          237023                  0.00095 
    res4b_branch2b          236932                  0.00095 
    res4b                    22392                  0.00009 
    res5a_branch1           311901                  0.00125 
    res5a_branch2a          311751                  0.00125 
    res5a_branch2b          596252                  0.00239 
    res5a                    11281                  0.00005 
    res5b_branch2a          595829                  0.00238 
    res5b_branch2b          596743                  0.00239 
    res5b                    11249                  0.00004 
    pool5                    36913                  0.00015 
    newFC                    17867                  0.00007 
 * The clock frequency of the DL processor is: 250MHz
ans = 
'google'
label = 
'google'
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    7336269                  0.02935                       1            7338849             34.1
    conv1                  1115457                  0.00446 
    pool1                   238045                  0.00095 
    res2a_branch2a          269786                  0.00108 
    res2a_branch2b          269928                  0.00108 
    res2a                    88895                  0.00036 
    res2b_branch2a          269765                  0.00108 
    res2b_branch2b          269928                  0.00108 
    res2b                    88825                  0.00036 
    res3a_branch1           156232                  0.00062 
    res3a_branch2a          227320                  0.00091 
    res3a_branch2b          245058                  0.00098 
    res3a                    44493                  0.00018 
    res3b_branch2a          244799                  0.00098 
    res3b_branch2b          245040                  0.00098 
    res3b                    44456                  0.00018 
    res4a_branch1           135640                  0.00054 
    res4a_branch2a          136131                  0.00054 
    res4a_branch2b          237165                  0.00095 
    res4a                    22312                  0.00009 
    res4b_branch2a          236583                  0.00095 
    res4b_branch2b          237521                  0.00095 
    res4b                    22512                  0.00009 
    res5a_branch1           311839                  0.00125 
    res5a_branch2a          311902                  0.00125 
    res5a_branch2b          596212                  0.00238 
    res5a                    11211                  0.00004 
    res5b_branch2a          595898                  0.00238 
    res5b_branch2b          596808                  0.00239 
    res5b                    11381                  0.00005 
    pool5                    36989                  0.00015 
    newFC                    17951                  0.00007 
 * The clock frequency of the DL processor is: 250MHz
ans = 
'shell'
label = 
'shell'

See Also