Speed of Looped GPU operations with varying input sizes

Question

D. Plotnick el 19 de Oct. de 2017

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/362237-speed-of-looped-gpu-operations-with-varying-input-sizes

Hello all, This is a reformulated post related to this one where I am trying to understand certain observed speed effects on looped GPU operations, how to write better code for the GPU, and also how to understand timing measurements.

In the example below, I start by looping 100x over some numerical operation. In this operation, I found that if the size of the input arrays remained constant, the operations was very very fast. However, if the size of these inputs changed on each iteration, I saw massive slowdowns. Hopefully the code will make this clearer.

We start by running our loop on the CPU with fixed sizes and measure the timing.

% Initial parameters
M = 1000;
N = 1000;
its = 100; 
 % CPU operation, fixed size inputs. 
tic
B = zeros(its,1); % Initialize the output matrix
for ii = 1:its
    A1 = rand(M,N); % random input 1
    A2 = rand(M,N); % random input 2 
    T = A1.*exp(1i*A2); % numerical operation 1
    P = max(T(:)); % numerical operation 2
    B(ii) = P; % output value
end
toc

Elapsed time is 4.284955 seconds.

Ok, now I will do the same function on the GPU:

% GPU operation, fixed size inputs. 
wait(gpuDevice);
tic
B = gpuArray.zeros(its,1); % Initialize array
for ii = 1:100
  A1 = gpuArray.rand(M,N);% random input 1 on gpu
  A2 = gpuArray.rand(M,N);% random input 2 on gpu
  T = A1.*exp(1i*A2); % numerical operation 1
  P = max(T(:)); % numerical operation 2
  B(ii) = P; % output value
end
wait(gpuDevice);
toc

Elapsed time is 0.171926 seconds. Great! However, the size of the input matrices A1 and A2 might be changing, and this is where I see some weirdness.

 % Set up some changing sizes, random shifts in the size of A1 and A2
Mshifts = round(200*rand(its,1))-100; % Change 1
Nshifts = round(200*rand(its,1))-100; % Change 2

Now run it on the CPU with these changing sizes.

 % CPU operation, variable size inputs. 
tic
B = zeros(its,1); % Initialize the output matrix
for ii = 1:its
    A1 = rand(M+Mshifts(ii),N+Nshifts(ii)); % random input 1
    A2 = rand(M+Mshifts(ii),N+Nshifts(ii)); % random input 2 
    T = A1.*exp(1i*A2); % numerical operation 1
    P = max(T(:)); % numerical operation 2
    B(ii) = P; % output value
end
toc

Elapsed time is 4.583695 seconds. Around the same as before, given that our operations are sometimes larger or smaller.

HOWEVER, when I run it on the GPU

 % GPU operation, variable size inputs. 
wait(gpuDevice);
tic
B = gpuArray.zeros(its,1); % Initialize array
for ii = 1:its
    A1 = gpuArray.rand(M+Mshifts(ii),N+Nshifts(ii));% random input 1 on gpu
    A2 = gpuArray.rand(M+Mshifts(ii),N+Nshifts(ii));% random input 2 on gpu
    T = A1.*exp(1i*A2); % numerical operation 1
    P = max(T(:)); % numerical operation 2
    B(ii) = P; % output value
end
wait(gpuDevice);
toc

Elapsed time is 1.143043 seconds. About 5-10x slower than previously!! Now, I know that there are some issues of how to measure the speed on the GPU, but the important point is that the loop does not finish till 1.14 seconds, meaning that if this is part of a larger code I will not actually continue on until that time has elapsed.

SO: Question - why is there this effective performance drop when A1 and A2 are not of constant size, and can I avoid it? Is there a clever way of pre-allocating, or using anonymous functions (I tried this with no real gain)?

Thanks y'all.