GPU speed up for pcg() is disappointing

Question

Dan R el 7 de Sept. de 2022

1
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1798620-gpu-speed-up-for-pcg-is-disappointing

Comentada: Dan R el 13 de Sept. de 2022

I am using the pcg() to solve x =A\b. b is about 2e6 long. I am using R2021b on Ubuntu 20.04 with a Ryzen 9 5950x CPU and an nVidia A4000 GPU.

Running this code...

tol = 1e-4;
%solve on the CPU
tic
L = ichol(A, struct('michol','on'));
x = pcg(A, b, tol, 5000, L, L');
fprintf('solve time: %0.4g s\n', toc);
%solve on the GPU
tic
x = pcg(gpuArray(A), b, tol, 50000);
fprintf('solve time: %0.4g s\n', toc);

...gives

pcg converged at iteration 347 to a solution with relative residual 9.6e-05.

solve time: 16.24 s

pcg converged at iteration 7281 to a solution with relative residual 9.9e-05.

solve time: 13.91 s

So, I am seeing a small GPU speed up (nice, but not very exciting). The fact that the GPU takes 20x the iterations of the CPU makes me think there could be a >20x speed up possible, which would be much more exciting.

The L arguments make a big difference to the CPU speed, but I can't get the GPU version to take it. Doing

x = pcg(gpuArray(A), b, tol, 5000, L, L');

throws "Error using gpuArray/pcg (line 58) When the first input argument is a sparse matrix, the second preconditioner cannot be a matrix. Use functions for both preconditioners, or multiply the precondition matrices". Doing

x = pcg(gpuArray(A), b, tol, 50000, L*L');

seems to hang MATLAB (after waiting 5 minutes I gave up and had to terminate by restarting MATLAB; ctrl-C did nothing).

Can anyone tell me what is going on here? Is it simply that the GPU version is using a different algorithm and so it makes no sense to compare number of iterations (and so the 15% speed up I see is all I should hope for)? Or is it that I need a different preconditioning approach?

I see there is a possible bug related to this: https://uk.mathworks.com/support/bugreports/details/2534618.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Joss Knight el 11 de Sept. de 2022

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1798620-gpu-speed-up-for-pcg-is-disappointing#answer_1051385

I'm guessing LL' is extremely dense, which will explain why the solver stalls. On the GPU the preconditioning is (currently) performed using ILU, which, like most sparse operations should be passed a satisfactorily sparse matrix. Try just passing A as the preconditioner and you may get a better result. Also, try the other solvers (CGS, GMRES, LSQR, BICG etc).

The reason why the solvers on the GPU work differently is because a sparse direct solve does not parallelize well, which is why sparse backslash (\) is generally slow - mostly because of the amount of memory needed. That doesn't explain why the solvers do not accept two triangular sparse matrices as preconditioner input - that is something that should be rectified. But the ILU should have much the same effect as ICHOL does.

I thought I'd be telling you that the GPU is slow because your card isn't very fast for double precision (only 599 GFLOPS). But actually you're doing 20 times the iterations in less time, so it seems you're right, if you hit the right combination of solver and preconditioner there's a good chance you'll get to the result much faster.

16 comentarios
Mostrar 14 comentarios más antiguosOcultar 14 comentarios más antiguos

Dan R el 12 de Sept. de 2022

Editada: Dan R el 12 de Sept. de 2022

Abrir en MATLAB Online

Thanks for the explanation and I can see the rationale... but it seems L*L' is not much denser than A (see pics below).

Interestingly, doing the following on the CPU:

x = pcg(A, b, tol, 5000, A);

I get

pcg converged at iteration 1 to a solution with relative residual 1.4e-12.

To me, this looks like pcg() is actually choosing to do x=A\b (it itakes exactly the same amount of time as A\b). I guess this is some sort of heuristic internal to pcg()?

IIRC I did try this same line on the GPU before and it stalled (I'm not at work so can't try the GPU until tomorrow). If pcg() also tries to do A\b directly on the GPU then this would explain the stall given that backslash does not parralise well.

I'll explore the other solvers tomorrow, but I was under the impression that pcg() was the one to go for for positive definite symmetric problems? (This is certainly what I took from the MATLAB docs anyway).

This is a standard finite difference problem so it must be typical use case for pcg() and therefore a good target fix for a future MATLAB release :-). I'm very happy to share A and b if this would help!

Joss Knight el 12 de Sept. de 2022

Hi Dan. I'll try to remember to update this thread when it's done.

I'm hoping the speed improvement will be available next year, support for two triangular preconditioners later than that. Single precision sparse...unknown. As you can imagine, because of our quality requirements, even if I make a change now you would not see it for many months.

Dan R el 13 de Sept. de 2022

Thank you Joss. I'll keep checking the release notes.

Bruno, thanks for the pointer to SuiteSparse. I will take a look.

Iniciar sesión para comentar.

Answer 2

Yair Altman el 8 de Sept. de 2022

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1798620-gpu-speed-up-for-pcg-is-disappointing#answer_1048675

Abrir en MATLAB Online

The gpuArray version of pcg has not been updated since 2018, so it is somewhat lagging compared to the CPU version. Preconditioner input for sparse input is only supported for a single preconditioner matrix, not two as in the CPU case. Refer to the help of the GPU version (type help('gpuArray/pcg') or edit(...) for details).

Perhaps you could try to use the GPU version with a dense (non-sparse) input - it's wasteful in memory but perhaps possibly faster than the sparse-CPU version:

x = pcg(gpuArray(double(A)), b, tol, 5000, L, L');

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Dan R el 9 de Sept. de 2022

Abrir en MATLAB Online

Thanks for the suggestion. For

pcg(gpuArray(A),b,[],[],@(x)L\x,@(x)L'\x)

I get

Error using iterapp (line 66)

user supplied function ==> @(x)gather(L\x) failed with the following error:

Sparse MLDIVIDE only supports sparse square matrices divided by full column vectors.

Error in parallel.internal.flowthrough.pcg (line 198)

y = iterapp('mldivide',m1fun,m1type,m1fcnstr,r,varargin{:});

Error in gpuArray/pcg (line 65)

[varargout{1:nargout}] = parallel.internal.flowthrough.pcg(varargin{:});

For

pcg(gpuArray(A),b,[],[],@(x)L\double(x),@(x)L'\double(x))

and

pcg(gpuArray(A),b,[],[],@(x)gather(L)\double(x),@(x)gather(L')\double(x))

I get

Error using iterapp (line 66)

user supplied function ==> @(x)L\double(x) failed with the following error:

Sparse gpuArrays are not supported for this function.

Error in parallel.internal.flowthrough.pcg (line 198)

y = iterapp('mldivide',m1fun,m1type,m1fcnstr,r,varargin{:});

Error in gpuArray/pcg (line 65)

[varargout{1:nargout}] = parallel.internal.flowthrough.pcg(varargin{:});

So to me it looks like MATLAB won't let us pass a gpuArray to the user function, so this apporoach can never work.

Bruno Luong el 9 de Sept. de 2022

Editada: Bruno Luong el 9 de Sept. de 2022

You shouldn't dream much on 20x acceleration. The preconditioning is efficient only if it improves the condition number and the resolution of M1*M2*x is cheap compared to A*x.

incomplete cholesky ichol and lu ilu are relatively expensive, since it is almost like solving A*x.

That's why people looks for permutations (cheap) that render matrix diagonal dominant, and approximate the matrix by narrow band matrix for M (cheap to solve).

Dan R el 9 de Sept. de 2022

Editada: Dan R el 12 de Sept. de 2022

Abrir en MATLAB Online

But ichol() seems fast for me:

tic; L = ichol(A, struct('michol','on')); toc; %this runs on the CPU and takes ~0.07s
tic; x = pcg(A, b, tol, 5000, L, L'); toc; %this runs on the CPU and takes ~16s 

In summary:

CPU version of pcg() with L: 347 iterations, 16 sec.
CPU version of pcg() without L: 7285 iterations, 106 sec.
GPU version of pcg() with L: does not work.
GPU version of pcg() without L: 7281 iterations, 14 sec.

2 and 4 look very similar in terms of behaviour. I am hoping that 1 and 3 would also have similar behaviour with a corresponding speed up of 106/16 = 6.6 times (so not 20x as I said above), i.e. I expect/hope 3 would take 14/6.6 = 2.1 s if it could be made to work.

I confess I am not an expert on the numerical methods being used here, so I may be missing something...

Iniciar sesión para comentar.

Answer 3

Christine Tobler el 12 de Sept. de 2022

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1798620-gpu-speed-up-for-pcg-is-disappointing#answer_1051625

Abrir en MATLAB Online

It looks like you can simply replace your current call to pcg with

x = pcg(A, b, tol, 5000, @(y) L\y, @(y)L'\y);

as the error message just says that it only supports function handle input here (though it's not clear to me why that restriction is there).

Perhaps it makes sense to also cast the matrix L to a gpuArray?

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Dan R el 12 de Sept. de 2022

I tried this after advice from Bruno Luong above - no luck I'm afraid.

Iniciar sesión para comentar.

GPU speed up for pcg() is disappointing

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

16 comentarios
Mostrar 14 comentarios más antiguosOcultar 14 comentarios más antiguos

Más respuestas (2)

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

GPU speed up for pcg() is disappointing

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

16 comentarios Mostrar 14 comentarios más antiguosOcultar 14 comentarios más antiguos

Más respuestas (2)

7 comentarios Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

16 comentarios
Mostrar 14 comentarios más antiguosOcultar 14 comentarios más antiguos

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos