Slow performance using iterative solver with gpuArray

8 visualizaciones (últimos 30 días)
Paulo Ribeiro
Paulo Ribeiro el 2 de Abr. de 2019
Editada: Joss Knight el 15 de Abr. de 2019
Hi all,
I am using iterative solvers combined with gpuArrays for performance benefits. These simulations include the solution of linear systems in the finite element method (sparse, symmetric matrices) with order up to 5e6. The following configurations are available:
  • Setup 1 (local) = i7 8700, 32Gb RAM, Nvidia Titan XP, Windows 10;
  • Setup 2 (cloud) = 20Gb RAM, Nvidia Tesla V100, Windows 10.
Benchmark results using these setups are described below, where:
a) the pcg solver in the CPU uses the following:
L=ichol(A);
sol=pcg(A,b,1e-5,1e5,L,L');
b) the pcg solver in the GPU uses the following:
A_gpu=gpuArray(A);
sol=pcg(A_gpu,b,1e-5,1e5);
Setup 1 Results
Mesh // Matrix Order // Direct Solver Time (s) // PCG time (s) - CPU // PCG time (s) - GPU
1 9e5 9 145 16
2 2.6e6 32 650 58
3 5.8e6 96 3688 4210
Setup 2 Results
Mesh // Matrix Order // Direct Solver Time (s) // PCG time (s) - CPU // PCG time (s) - GPU
1 9e5 - - 11
2 2.6e6 - - 27
3 5.8e6 - - ????
As expected, a very interesting speedup is observed for the PCG GPU solver in Meshes 1 and 2. On the other hand, I can't find the reasons for what appears to be a bottleneck in Mesh 3. Even with a Tesla V100 (Setup 2) computations are extremely slow. In this mesh the PCG-CPU is faster than the PCG-GPU, which performs in 4210s (really poor performance). Resource monitor in Windows 10 shows no GPU activity, even though PCG is running with a gpuArray. It seems that there is lag between CPU and GPU for a given matrix size.
Any thoughts or advice? Thanks!
Paulo
### update 01 ###
Solution described in the next comment using both 'A' and 'b' as gpuArray.
### update 02 ###
Following Joss Knight comment I noticed that A_gpu was not a gpuArray. This explains why b_gpu provided an amazing speedup, transfering computations to the GPU.
  6 comentarios
Paulo Ribeiro
Paulo Ribeiro el 5 de Abr. de 2019
Thanks Joss. The pcg solver described previously had a typing mistake (now corrected at the beginning of this thread). Please consider the correct version (applied in the MATLAB script) for both options:
L=ichol(A);
sol=pcg(A,b,1e-5,1e5,L,L');
A_gpu=gpuArray(A);
sol=pcg(A_gpu,b,1e-5,1e5);
Your observation was really helpful. A_gpu was not a gpuArray. My bad!
This explains why using b_gpu provides such an impressive speedup. This speedup is exactly the same if A is a gpuArray and b is not. Or if both of them are gpuArrays. Therefore, using A, b or both of them in the GPU:
Setup 1 Results
Mesh // Matrix Order // Direct Solver Time (s) // PCG time (s) - CPU // PCG time (s) - GPU
1 9e5 9 145 16
2 2.6e6 32 650 58
3 5.8e6 96 3688 160
with acurate results and a 23x speedup over PCG-CPU computations.
On the other hand, using this precondtioner makes the GPU processing almost inactive:
sol=pcg(A_gpu,b,1e-5,1e5,A_gpu);
Windows resource monitor shows some GPU activity when the solver is started, but it quickly diminishes to zero. Without preconditioning GPU activity is always constant and around 15%. Any ideas?
Thank you!
Paulo
Joss Knight
Joss Knight el 15 de Abr. de 2019
Editada: Joss Knight el 15 de Abr. de 2019
I fear that question is too hard to answer. You need to get some additional outputs from the solver so you can find out how many iterations it is taking to reach your tolerance. But the behaviour of sparse solvers is very problem-dependent. When you provide A as a preconditioner, an incomplete LU is used to decompose it and then the resulting factors are used as preconditioners. Perhaps in your case this factorisation is particularly bad and so damages the convergence properties instead of improving them. You can investigate the quality of ILU on your matrix using something like
[L,U] = ilu(A,0);
norm(A - (L*U))
So if this is the case I'd expect to see that your numiters output is very high when you precondition, or perhaps you get no convergence at all. Exactly why your GPU utilisation appears to drop I don't know. The solvers are hybrid algorithms and use both CPU and GPU, but without some reproduction code I can't look at where all the work is going. Maybe this is perfectly normal. Maybe something went wrong and you're running on the CPU again.
One thing you ought to try is using L*L' as the preconditioner, where L is the result of ichol(A). Hopefully the ILU will recover L during the solve and the result will be as good as for CPU preconditioning. The downside is if L*L' turns out to be very non-sparse.
You should try some other supported solvers to see if they have better convergence properties or work better with preconditioning - gmres, bicg, bicgstab, cgs, lsqr. You could also look at different preconditioners - a typical one to try for diagonally dominant inputs is the diagonal diag(diag(A)) or some variation of that.

Iniciar sesión para comentar.

Respuestas (1)

Matt J
Matt J el 2 de Abr. de 2019
I would guess that the differences you are seeing are in large part due to the fact that you are pre-conditioning in the CPU case, but not the GPU. This undoubtedly impacts the number of PCG iterations that are run before the solver terminates.

Categorías

Más información sobre GPU Computing en Help Center y File Exchange.

Productos


Versión

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by