Is it possible to code a function like movsum to perform as fast as the diff function when using gpu arrays?

Question

Nathan Zechar el 22 de Dic. de 2021

1
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1616260-is-it-possible-to-code-a-function-like-movsum-to-perform-as-fast-as-the-diff-function-when-using-gpu

Comentada: Nathan Zechar el 23 de Dic. de 2021

I'm attempting to find the fastest way to average several points on 3D data. In other parts of a program I'm writting, I've used the diff function which is much faster than coding the difference between adjacent points along a dimension.

Since movsum appears to be closely related to diff, I thought it would be just as fast. But it's not. At least it isn't when using a gpu array. Perhaps then I need to understand how diff is written so I can write my own mx file to speed up the averaging process? Or would I need to understand both how cuda is implemented and mx files? Here is how I timed the performance of the movsum and diff functions with gpu arrays using the "Run and Time" button.

%% Compare movsum to diff
clear all
[Nx,Ny,Nz] = deal(150);
A         = rand(Nx,Ny,Nz);
A         = gpuArray(A);
dev = gpuDevice();
for i = 1:100
A0 = movsum(A,2,3);
A0 = A0(:,:,2:Nz);
wait(dev);
A1 =  A(:,:,1:Nz-1)+A(:,:,2:Nz);
wait(dev);
end
for i = 1:100
A2 = diff(A,1,3);
wait(dev);
A3 =  A(:,:,2:Nz)-A(:,:,1:Nz-1);
wait(dev);
end

As for what I'd ultimately like to do. Here is the code I used to average points upon 3D data, and how the same process can be done using movsum. Using the "Run and Time" button there is a large difference in performance. Because the diff function is more computationally efficient than hardcoding differences, surely there must be a way to gain that efficicency when calculating adjacent sums, yes?

[Nx,Ny,Nz] = deal(100);
A         = rand(Nx,Ny,Nz);
A         = gpuArray(A);
dev = gpuDevice();
for i = 1:100
    %% Code I'm using to average points
    A1 = (A(1:Nx-1,:,1:Nz-1) +...
          A(1:Nx-1,:,2:Nz) +...
          A(2:Nx  ,:,1:Nz-1) +...
          A(2:Nx  ,:,2:Nz))*0.25;
    wait(dev);
    
    %% Idea I would like to use to speed up code
    A0 = movsum(movsum(A,2,3),2,1)*0.25;
    A0 = A0(2:Nx,:,2:Nz);
    wait(dev);
end

If movsum performed similiar to diff when using gpuArray, then I believe I could speed up my code significantly. Is this a situation where I would need to learn

2 comentarios
Mostrar NingunoOcultar Ninguno

Matt J el 22 de Dic. de 2021

Editada: Matt J el 22 de Dic. de 2021

I am uncertain about your method of timing. By my measurements (on the GTX 1080Ti), your current method is only about a factor of 2 away from the speed offered by diff().

% function test

T=60;

[Nx,Ny,Nz] = deal(150);

A = gpuArray.rand(Nx,Ny,Nz);

dev = gpuDevice();

tic;

for i = 1:T

B0 = movsum(movsum(A,2,3,'End','d'),2,1,'End','d')*0.25;

end

wait(dev);

toc;%Elapsed time is 0.270494 seconds.

tic;

e=reshape(ones(2),2,1,2)/4;

for i = 1:T

B1 = convn(A,e,'valid');

end

wait(dev);

toc;%Elapsed time is 0.152638 seconds.

tic;

for i = 1:T

B2 = (A(1:Nx-1,:,1:Nz-1) +...

A(1:Nx-1,:,2:Nz) +...

A(2:Nx ,:,1:Nz-1) +...

A(2:Nx ,:,2:Nz))*0.25;

end

wait(dev);

toc;%Elapsed time is 0.042915 seconds.

tic;

for i = 1:T

D = diff(diff(A,1,1),1,3);

end

wait(dev);

toc;%Elapsed time is 0.019765 seconds.

Nathan Zechar el 23 de Dic. de 2021

Interesting, I'll have to dive in to what you did with reshape. I believe you've answered other questions using the reshape approach. You're certainly a master at it! I've tried implementing it in other portions of my code. Sometimes it works faster, sometimes it's just a fun thought exercise.

And yes, as you can see, diff is just blazingly fast. It might offer better than 2x performance as 3D matricies become larger, I would have to test it though.

As for the uncertaintly in my method of timing... I was looking up other posts which focused on timing GPU computations. I hope I'm doing it correctly by adding "wait(dev)" in the code as I thought that's what was needed for the most accurate results. If there is a better way to time this, or a standard I should be following, I'm ready to learn.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Matt J el 22 de Dic. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1616260-is-it-possible-to-code-a-function-like-movsum-to-perform-as-fast-as-the-diff-function-when-using-gpu#answer_860500

Editada: Matt J el 22 de Dic. de 2021

Perhaps then I need to understand how diff is written so I can write my own mx file to speed up the averaging process?

I'm not a Mathworks developer so this is all my speculation, but it stands to reason that accelerating diff() on the GPU is going to be easier than for other routines like movsum. This is because each thread always requires exactly 2 pieces of data. The CUDA kernel threads can read all elements of A(1:Nx-1,:,:) from GPU global memory in one synchronized step, and then a A(2:Nx,:,:) in a second synchronized step. No memory sharing is required between the threads, all data access is embarassingly parallel and naturally coalesced, and the memory consumption per thread as so small that the fastest local memory registers can be used.

Conversely, for movsum and conv, the code needs to handle moving windows of arbitrary sizes, and so the parallelization strategy needs to be more complicated (and hence is slower).

You could almost certainly write your own optimized movsum CUDA kernel using the same principles as with diff(), as long as, like diff, the moving window size is always 2. Or at least, as long as the window size is small, fixed, and a priori known.

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Nathan Zechar el 23 de Dic. de 2021

Yeah, I envision it should be as simple as changing a plus sign to a minus sign in order to mimic exactly what diff() is doing. If I knew how diff() was coded I could make so much more of my code operate so much faster knowing whatever principels were used.

Iniciar sesión para comentar.

Is it possible to code a function like movsum to perform as fast as the diff function when using gpu arrays?

2 comentarios
Mostrar NingunoOcultar Ninguno

Respuestas (1)

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Is it possible to code a function like movsum to perform as fast as the diff function when using gpu arrays?

2 comentarios Mostrar NingunoOcultar Ninguno

Respuestas (1)

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos