How to use lsqr with GPU?

11 visualizaciones (últimos 30 días)
Su Py
Su Py el 28 de Jul. de 2021
Comentada: Joss Knight el 9 de Ag. de 2021
I'm using lsqr in order to solve a least squares problem of the form . Since A is a huge matrix, I implemented the function and I'm using it in lsqr instead of holding A in the memory. In order to speed up the calculation, the function f uses parfor statements to calulate .
How can I use my GPU cores in order to speed up these elements?
The general structure of my code:
function Ax = afun(x,flag)
if strcmp(flag,'notransp') % Compute A*x
parfor i = 1:K
Ax_mat(:,i) = ...
end
Ax = Ax_mat(:);
elseif strcmp(flag,'transp') % Compute A'*x
parfor i = 1:K
Ax_mat(:,i) = ...
end
Ax = Ax_mat(:);
end
end
solution = lsqr(@afun,b);

Respuestas (1)

Joss Knight
Joss Knight el 4 de Ag. de 2021
Editada: Joss Knight el 4 de Ag. de 2021
solution = lsqr(@afun,gpuArray(b));
Or alternatively, move the data to the GPU inside your afun operation. The problem is, using a parallel pool in conjunction with GPU execution is generally counter-productive. You may have many CPU cores to perform your matrix-vector multiplication one chunk at a time, but you only have one GPU. If you do the same on GPU, the parallelism will be lost as each worker waits to access the GPU. You could attempt to load-balance between CPU and GPU by having only one worker use the GPU but then you're going to encounter your memory issue - to balance properly the GPU worker may need to be working on a chunk of data 10x larger than the CPU workers.
  2 comentarios
Su Py
Su Py el 4 de Ag. de 2021
Thank you for your comment.
I actually have 8 GPUs to work with on my servers. Does that changes your answer? Can I use an equivalent parallelism with them?
Joss Knight
Joss Knight el 9 de Ag. de 2021
Yes, you can open a pool with 8 workers...however, you are still probably not going to get the most efficient utilisation with your parfor loop. The GPU works best when you vectorize, which means that ideally you will not process column by column but instead do multiple columns at a time, i.e. 1/8th of the columns on each worker.
The point is that even in parfor you are running many operations serially, just spread between multiple workers. On the CPU this doesn't matter because the whole operation was serial anyway, but on GPU it's critical that you maintain the density of array elements being processed in each function call.

Iniciar sesión para comentar.

Categorías

Más información sobre GPU Computing en Help Center y File Exchange.

Productos


Versión

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by