Main Content

Kernel Analysis

For GPU code generation, the primary mechanism for creating CUDA® kernels is by using for-loops. The way you write loops in your MATLAB® code has a significant impact on the number of kernels created as well as the performance of the generated code. When you generate GPU code, check the diagnostic report to see if your loop segment has Loop not parallelized notices. Calls to MATLAB functions in your code may also have for-loops that contain these notices. To get maximum performance, you want to ensure that compute intensive loop segments in your code are mapped to kernels and executed in parallel. The following recommendations help you in achieving this goal and generating efficient CUDA kernels.

Mapping Nested Loops to Kernels


Consider a function that has nested for-loops.

function y = foo(x)
 for i1 = 1:N1
  for i2 = 1:N2
   for i3 = 1:N3
    for i4 = 1:N4

Assume that one of the intermediate loop i3 is not parallelizable. When performs loop analysis to create kernels, GPU Coder™ it considers only the outermost parallel loops i1,i2 and creates a kernel with the outer loop dimensions N1,N2. The loops i3,i4 are within the kernel body and are executed sequentially. However if the innermost i4 is large (iteration), then better performance may be achieved by creating kernels for the innermost loop.


There are three ways in which you can parallelize the innermost loop:

  • Rewrite the code so that the innermost code segment is not within a nested loop.

  • If the iteration size of the outer loop is small, then attach the loop to a coder.unroll function. This function unrolls the for-loop by making a copy of the loop body for each loop iteration. For more information, see coder.unroll.

    function y = foo(x)
     for i1 = coder.unroll(1:N1)
  • Make the outer loop dimension as dynamic bound. This way parallel loop analysis fails on the outer loop, whereas it succeeds on the inner loops.

    function y = foo(x,N1)
     for i1 = 1:N1

For-Loops with Break


Loops with break are not supported.

while (i < N)
	if (cond2)


Remove breaks by creating a guard variable and conditional.

cond = true;
while (i< N)
			cond = false;

Dependence Analysis Parallel Loop Check Fails


Kernel extraction use parallel loop dependence analysis. There are cases where loop dependence analysis cannot detect a parallel for loop. The coder.gpu.kernel allows GPU Coder to override dependence analysis and force kernel creation. The caveat is for user to be sure that the loop is “for-all” loop without inter-iteration dependencies.


Use coder.gpu.kernel pragma explicitly on each of your for-loops.

Logical Indexing of Arrays


GPU Coder may not create kernels when logical indexing is used for accessing array elements.

i = (mag ~= 0);
vx(i) = vx(i)./mag(i);
vy(i) = vy(i)./mag(i); 


Rewrite the code by using a loop body and guarding with an appropriate conditional.

for i = 1:numel(mag)
 if (mag(i) ~= 0)
    vx(i) = vx(i)./mag(i);
    vy(i) = vy(i)./mag(i);  

Unsupported Functions


Use of unsupported functions, coder pragmas, toolbox functions etc. inside a loop prevents them from becoming a kernel.


Try rewriting unsupported functions using pure MATLAB.

Loop Interchange


If smaller loops in a loop nest are the outer most loops, then a kernel could be created with just a subset of the loops in the nesting. If algorithm allows it, always put the largest loops in the outermost nesting.


Rewrite loop nesting with larger loops as outer loops.

Related Topics

Go to top of page