Run CUDA or PTX Code on GPU

If running MATLAB^® functions on the GPU does not sufficiently speed up your code, or if you need to use advanced GPU CUDA^® features, you can write your own CUDA code and run it in MATLAB by an executable kernel using parallel.gpu.CUDAKernel.

`CUDAKernel` Workflow Overview

This page explains how to create an executable kernel from a CUDA C++ source file (CU) file and run that kernel on a GPU in MATLAB. The kernel is represented in MATLAB by a CUDAKernel object, which can operate on arrays stored in host memory or on GPU arrays.

The following steps describe the CUDAKernel general workflow:

Compile a parallel thread execution (PTX) file from a CU file using mexcuda. You do not need the CUDA Toolkit to compile a PTX file using mexcuda.
Before R2023a: Use the nvcc compiler in the NVIDIA^® CUDA Toolkit to compile a PTX file instead of the mexcuda function.
Use the parallel.gpu.CUDAKernel function to create a CUDAKernel object from the CU file and the PTX file. The CUDAKernel contains the GPU executable code.
Set the properties of the CUDAKernel to control its execution on the GPU.
Call feval on the CUDAKernel with the required inputs, to run the kernel on the GPU.

MATLAB code that follows these steps might look something like this:

% 1. Compile a PTX file.
mexcuda -ptx myfun.cu

% 2. Create CUDAKernel object.
k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu");

% 3. Set object properties.
k.GridSize = [8 1];
k.ThreadBlockSize = [16 1];

% 4. Call feval with defined inputs.
g1 = gpuArray(in1); % Input gpuArray.
g2 = gpuArray(in2); % Input gpuArray.

result = feval(k,g1,g2);

The following sections provide details of these commands and workflow steps.

Create a `CUDAKernel` Object

If you have a CU file you want to execute on the GPU, you must first compile it to create a PTX file. To compile a PTX file, pass the CU file to the mexcuda with the -ptx flag.

mexcuda -ptx myfun.cu

This generates a PTX file myfun.ptx.

Create a CUDAKernel object using the CU and PTX files.

k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu");

Note

You cannot save or load CUDAKernel objects.

`CUDAKernel` Object Properties

When you create a CUDAKernel object without a terminating semicolon, or when you type the object variable at the command line, MATLAB displays the kernel object properties.

k = parallel.gpu.CUDAKernel("conv.ptx","conv.cu")

k = 
  parallel.gpu.CUDAKernel handle
  Package: parallel.gpu

  Properties:
     ThreadBlockSize: [1 1 1]
  MaxThreadsPerBlock: 512
            GridSize: [1 1 1]
    SharedMemorySize: 0
          EntryPoint: '_Z8theEntryPf'
  MaxNumLHSArguments: 1
     NumRHSArguments: 2
       ArgumentTypes: {'in single vector'  'inout single vector'}

The properties of a CUDAKernel object control some of its execution behavior. Use dot notation to alter the properties that can be changed. For a descriptions of the object properties, see CUDAKernel. A typical reason for modifying the settable properties is to specify the number of threads, as described below.

Specify Entry Points

A single PTX file can contain multiple entry points to different kernels. Each of these entry points has a unique name. The name of each entry point is mangled (as in C++ mangling) but always contains the original function name from the CU file. For example, if the CU file defines the kernel function as

__global__ void simplestKernelEver( float * x, float val )

then the PTX code contains an entry that might be called _Z18simplestKernelEverPff.

When you have multiple entry points, specify the entry name for the particular kernel when calling parallel.gpu.CUDAKernel to generate your kernel.

k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu","myKernel1");

Note

The parallel.gpu.CUDAKernel function searches for your entry name in the PTX file, and matches on any substring occurrences. Therefore, you should not name any of your entry points as substrings of any others.

You might not have control over the original entry names, in which case you must be aware of the unique mangled derived for each. For example, consider the following function template.

template <typename T>
__global__ void add4( T * v1, const T * v2 )
{
    int idx = threadIdx.x;
    v1[idx] += v2[idx];
}

When the template is expanded out for float and double, it results in two entry points, both of which contain the substring add4.

template __global__ void add4<float>(float *, const float *);
template __global__ void add4<double>(double *, const double *);

The PTX has corresponding entries:

_Z4add4IfEvPT_PKS0_
_Z4add4IdEvPT_PKS0_

Use entry point add4If for the float version, and add4Id for the double version.

k = parallel.gpu.CUDAKernel("test.ptx","double *, const double *","add4Id");

Specify Number of Threads

You specify the number of computational threads for your CUDAKernel by setting two of its object properties:

GridSize — A vector of three elements, the product of which determines the number of blocks.
ThreadBlockSize — A vector of three elements, the product of which determines the number of threads per block. The product cannot exceed the value of the MaxThreadsPerBlock property.

The default value for both of these properties is [1 1 1], but suppose you want to use 500 threads to run element-wise operations on vectors of 500 elements in parallel. Set the ThreadBlockSize such that the product of its elements is 500.

k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu");
k.ThreadBlockSize = [500,1,1];

Generally, you set the grid and thread block sizes based on the sizes of your inputs. For information on thread hierarchy, and multiple-dimension grids and blocks, see the NVIDIA CUDA C Programming Guide.

Construct CUDAKernel Object with C Prototype Input. If you do not have the CU file corresponding to your PTX file, you can specify the C prototype for your C kernel instead of the CU file. For example:

k = parallel.gpu.CUDAKernel("myfun.ptx","float *, const float *, float");

Another use for the C prototype input is when your source code uses an unrecognized renaming of a supported data type. Suppose your kernel comprises the following code.

typedef float ArgType;
__global__ void add3( ArgType * v1, const ArgType * v2 )
{
    int idx = threadIdx.x;
    v1[idx] += v2[idx];
}

ArgType itself is not recognized as a supported data type, so the CU file that includes it cannot be directly used as input when creating the CUDAKernel object in MATLAB. However, the supported input types to the add3 kernel can be specified as C prototype input to the CUDAKernel constructor. For example:

k = parallel.gpu.CUDAKernel("test.ptx","float *, const float *","add3");

Supported Datatypes. The supported C/C++ standard data types are listed in the following table.

Float Types Integer Types Boolean and Character Types

Float Types	Integer Types	Boolean and Character Types
`double`, `double2` `float`, `float2`	`short`, `unsigned short`, `short2`, `ushort2` `int`, `unsigned int`, `int2`, `uint2` `long`, `unsigned long`, `long2`, `ulong2` `long long`, `unsigned long long`, `longlong2`, `ulonglong2` `ptrdiff_t`, `size_t`	`bool` `char`, `unsigned char`, `char2`, `uchar2`

double, double2

float, float2

short, unsigned short, short2, ushort2

int, unsigned int, int2, uint2

long, unsigned long, long2, ulong2

long long, unsigned long long, longlong2, ulonglong2

ptrdiff_t, size_t

bool

char, unsigned char, char2, uchar2

Also, the following integer types are supported when you include the tmwtypes.h header file in your program.

Integer Types

Integer Types
`int8_T`, `int16_T`, `int32_T`, `int64_T` `uint8_T`, `uint16_T`, `uint32_T`, `uint64_T`

int8_T, int16_T, int32_T, int64_T

uint8_T, uint16_T, uint32_T, uint64_T

The header file is shipped as matlabroot/extern/include/tmwtypes.h. You include the file in your program with the line:

#include "tmwtypes.h"

Argument Restrictions. All inputs can be scalars or pointers, and can be labeled as constant values using const.

The C declaration of a kernel is always of the form:

__global__ void aKernel(inputs ...)

The kernel must return nothing, and operate only on its input arguments (scalars or pointers).
The kernel is unable to allocate any form of memory, so all outputs must be pre-allocated before the kernel is executed. Therefore, the sizes of all outputs must be known before you run the kernel.
In principle, all pointers passed into the kernel that are not labelled const could contain output data, since the many threads of the kernel could modify that data.

When translating the definition of a kernel in C into MATLAB:

All scalar inputs in C (double, float, int, etc.) must be scalars in MATLAB, or scalar (i.e., single-element) gpuArray variables.
All constant pointer inputs in C (const double *, etc.) can be scalars or matrices in MATLAB. They are cast to the correct type, copied onto the device, and a pointer to the first element is passed to the kernel. No information about the original size is passed to the kernel. It is as though the kernel has directly received the result of mxGetData on an mxArray.
All non-constant pointer inputs in C are transferred to the kernel exactly as non-constant pointers. However, because a non-constant pointer could be changed by the kernel, this will be considered as an output from the kernel.
Inputs from MATLAB workspace scalars and arrays are cast into the requested type and then passed to the kernel. However, gpuArray inputs are not automatically cast, so their type and complexity must exactly match those expected.

These rules have some implications. The most notable is that every output from a kernel must necessarily also be an input to the kernel, since the input allows the user to define the size of the output (which follows from being unable to allocate memory on the GPU).

Run a `CUDAKernel`

Determine Input and Output Correspondence

Use the feval function to evaluate a CUDAKernel on the GPU.

Assume that you have already written some kernels and want to use them in MATLAB to execute on the GPU. You have a kernel that does a convolution on two vectors; load and run it with two random input vectors.

k = parallel.gpu.CUDAKernel("conv.ptx","conv.cu");

result = feval(k,rand(100,1),rand(100,1));

The output is a gpuArray even if the inputs are not. However, it might be more efficient to use gpuArray objects as input when running a kernel.

k = parallel.gpu.CUDAKernel("conv.ptx","conv.cu");

i1 = rand(100,1,"single","gpuArray");
i2 = rand(100,1,"single","gpuArray");

result1 = feval(k,i1,i2);

Because the output is a gpuArray, you can now perform other operations using this input or output data without further transfers between GPU memory and host memory.

Determine Input and Output Correspondence

When calling [out1, out2] = feval(kernel,in1,in2,in3), the inputs in1, in2, and in3 correspond to each of the input arguments to the function within your CU file. The outputs out1 and out2 store the values of the first and second non-constant pointer input arguments to the function after the kernel has been executed.

For example, if the kernel within a CU file has the following signature:

void reallySimple( float * pInOut, float c )

the corresponding kernel object (k) in MATLAB has the following properties:

MaxNumLHSArguments: 1
   NumRHSArguments: 2
     ArgumentTypes: {'inout single vector'  'in single scalar'}

Therefore, to use the kernel object from this code with feval, you need to provide feval two input arguments (in addition to the kernel object), and you can use one output argument.

y = feval(k,x1,x2)

The input values x1 and x2 correspond to pInOut and c in the function prototype. The output argument y corresponds to the value of pInOut in the function prototype after the kernel has executed.

The following is a slightly more complicated example that shows a combination of constant and non-constant pointers:

void moreComplicated( const float * pIn, float * pInOut1, float * pInOut2 )

The corresponding kernel object in MATLAB then has the properties:

MaxNumLHSArguments: 2
   NumRHSArguments: 3
     ArgumentTypes: {'in single vector'  'inout single vector'  'inout single vector'}

You can use feval on this code's kernel (k) using three input arguments and two output arguments.

[y1,y2] = feval(k,x1,x2,x3)

The three input arguments x1, x2, and x3, correspond to the three arguments that are passed into the function. The output arguments y1 and y2, correspond to the values of pInOut1 and pInOut2 after the kernel has executed.

Add Two Numbers

This example adds two doubles together in the GPU.

The CU code to do this is as follows.
```
__global__ void add1( double * a, double b ) 
{
    *a += b;
}
```
The directive __global__ indicates that this is an entry point to a kernel. The code uses a pointer to send out the result in a, which is both an input and an output. Save this code in a file called test.cu in the current directory.
Compile the CU code using mexcuda to generate a PTX file called test.ptx.
```
mexcuda -ptx test.cu
```
Create the kernel in MATLAB. Currently this PTX file only has one entry so you do not need to specify it. If the PTX file contained more than one kernel entry point, you would specify add1 as the entry point.
```
k = parallel.gpu.CUDAKernel("test.ptx","test.cu");
```
Run the kernel with two numeric inputs. By default, a kernel runs on one thread.
```
result = feval(k,2,3)
```
```
result = 
    5
```

Add Two Vectors

This example extends the previous one to add two vectors together. For simplicity, assume that there are exactly the same number of threads as elements in the vectors and that there is only one thread block.

The CU code is slightly different from the last example. Both inputs are pointers, and one is constant because you are not changing it. Each thread will simply add the elements at its thread index. The thread index must work out which element this thread should add. Getting these thread- and block-specific values is a very common pattern in CUDA programming.
```
__global__ void add2( double * v1, const double * v2 ) 
{
    int idx = threadIdx.x;
    v1[idx] += v2[idx];
}
```
Save this code in the file test.cu.
Compile a PTX file called test.ptx from the CU file using mexcuda.
```
mexcuda -ptx test.cu
```
If this code was put in the same CU file along with the code of the first example, specify the entry point name this time to distinguish it.
```
k = parallel.gpu.CUDAKernel("test.ptx","test.cu","add2");
```

Before you run the kernel, set the number of threads correctly for the vectors you want to add.

N = 128;
k.ThreadBlockSize = N;
in1 = ones(N,1,"gpuArray");
in2 = ones(N,1,"gpuArray");
result = feval(k,in1,in2);

Example with CU and PTX Files

For an example that shows how to work with CUDA, and provides CU and PTX files for you to experiment with, see Illustrating Three Approaches to GPU Computing: The Mandelbrot Set.

Run CUDA or PTX Code on GPU

`CUDAKernel` Workflow Overview

Create a `CUDAKernel` Object

`CUDAKernel` Object Properties

Specify Entry Points

Specify Number of Threads

Run a `CUDAKernel`

Determine Input and Output Correspondence

Complete Kernel Workflow

Add Two Numbers

Add Two Vectors

Example with CU and PTX Files

See Also

Topics

Run CUDA or PTX Code on GPU

CUDAKernel Workflow Overview

Create a CUDAKernel Object

CUDAKernel Object Properties

Specify Entry Points

Specify Number of Threads

Run a CUDAKernel

Determine Input and Output Correspondence

Complete Kernel Workflow

Add Two Numbers

Add Two Vectors

Example with CU and PTX Files

See Also

Topics

`CUDAKernel` Workflow Overview

Create a `CUDAKernel` Object

`CUDAKernel` Object Properties

Run a `CUDAKernel`