When to Use Halide Code for Efficiency

Halide is an open-source, domain-specific language designed to optimize image processing and computer vision applications (such as convolutional neural networks). Halide focuses on addressing the nonrecursive and large multidimensional arrays commonly found in these applications. By integrating Halide with programming languages such as C and C++, you can generate highly efficient code for high-performance array computations. The separation of the schedule from the algorithm allows automatic tools to discover optimal schedules for large pipelines. See Speed Up Generated Code Execution with Halide Code.

Halide code is particularly beneficial under these circumstances:

Computations involve large contiguous multidimensional data.
Computations involve stenciled operations such as convolution, matrix multiplication, max pooling, or image processing window filters.
Pipelined operations are present.

In simpler scenarios, such as 1-D point-wise operations, using SIMD or OpenMP parallelization is often sufficient. When working with large multidimensional data and stenciled operations, balancing redundancy, locality, and parallelization becomes crucial. Halide scheduling primitives, such as compute_at and store_at, are effective in pipelines with stenciled operations. For example, these optimizations can be applied between a convolution layer and a max pooling layer, which the manually optimized libraries cannot predict due to the vast number of possible layer combinations.

Generate Halide Code from Simulink

With Embedded Coder^®, you can generate Halide code by using specific set of Simulink^® blocks and integrate Halide code into your existing C/C++ projects. Although Halide can improve performance, it does not always guarantee improvement, and sometimes the improvement can be minimal. You can use execution-time profiling to determine if the generated Halide code enhances execution speed when compared to plain C/C++ code. For more information, see Execution-Time Profiling for Generated Code.

Application Specific Use Cases

This section provides few examples where you can use Halide to enhance code efficiency, and evaluate performance differences with and without Halide optimization.

Deep Learning Applications

If you have a trained deep learning network, you can use the function exportNetworkToSimulink (Deep Learning Toolbox) to generate Simulink model from the network. Once you export the network to Simulink, you can evaluate its performance with different course of actions. This flow chart helps to determine if you should use Halide for your application.

Flow chart to determine if you should use Halide with Simulink

Consider the convolutional neural network ResNet-50, which consists of 50 layers. A pretrained ResNet-50 model for MATLAB is available in the Deep Learning Toolbox Model for ResNet-50 Network support package. Use the Add-On Explorer to download and install the support package.

To load the pretrained ResNet-50 network, use this command:

[net, classNames] = imagePretrainedNetwork('resnet50');
disp(net)

  dlnetwork with properties:

         Layers: [176×1 nnet.cnn.layer.Layer]
    Connections: [191×2 table]
     Learnables: [214×3 table]
          State: [106×3 table]
     InputNames: {'input_1'}
    OutputNames: {'fc1000_softmax'}
    Initialized: 1

  View summary with summary.

Export the network to Simulink.

exportNetworkToSimulink(net)

ans = struct with fields:
           ModelName: 'my_model'
         NetworkName: 'net'
           ModelPath: 'C:\Users\Projects\MATLAB\deeplearning_shared'
       InputDataType: 'Inherit: auto'
     BlockParameters: dictionary (string ⟼ cell) with 177 entries
    BlockConnections: [193×2 table]
          NumInports: 1
         NumOutports: 1
          SampleTime: '-1'
            Stateful: 0
          FrameBased: 0

save('resnet50.mat', 'net', 'classNames');

To determine if the generated Halide code improves execution speed compared to plain C/C++ code for the model, you can run a software-in-the-loop (SIL) simulation. For more information, see Compare Code Execution Times.

For this example, Halide code executed approximately 134 times faster than plain C++ code when using SIL simulation. The simulation was conducted on an AMD EPYC™ 74F3 24-Core Processor @ 3.19 GHz test system.

If the function exportNetworkToSimulink is not working for your needs, consider using the Predict (Deep Learning Toolbox) block to improve the speed of your neural network operations.

Harris Corner Detection

Using Neighborhood Processing Subsystem blocks, you can create a Harris corner detection model to identify features and objects in an image. For more information, see Perform Corner Detection by Using Neighborhood Processing Subsystem Blocks.

You can generate Halide code to perform corner detection and assess whether the generated Halide code improves the execution speed compared to plain C/C++ code by running a SIL simulation. For more information, see Compare Code Execution Times.

For this model, Halide code runs approximately twice as fast as the equivalent C++ code when running SIL simulation. The simulation was run on AMD EPYC™ 74F3 24-Core Processor @ 3.19 GHz test system.

Using Neighborhood Processing Subsystem improves performance of image processing tasks. The code generator supports a subset of Simulink blocks within a Neighborhood Processing Subsystem for Halide code generation. For more information, see Supported Blocks for Halide Code Generation in Neighborhood Processing Subsystem.

Calculate Optical Flow

You can use Halide to calculate optical flow to identify and track objects in video sequences. The OpticalFlowNeighborhoodExample model uses Neighborhood Processing Subsystem blocks. For detailed information, see Calculate Optical Flow by Using Neighborhood Processing Subsystem Blocks.

To assess the performance improvement of Halide-generated code over equivalent C++ code, you can perform a Software-in-the-Loop (SIL) simulation. For more information, see Compare Code Execution Times.

For this model, Halide code executed approximately six times faster than the standard C++ code when using SIL simulation. The simulation was run on AMD EPYC™ 74F3 24-Core Processor @ 3.19 GHz test system.