Extreme memory usage when using gather
Mostrar comentarios más antiguos
Hi! I have a code that solves a tridiagonal system and, because of the large sizes of the matrices, I'm trying to solve the A\V on GPU.
At each iteration I modify the diagonal elements and need to solve on the GPU like the below code:
After the GPU have solved the system, I need to get the present result and store in a matrix. For this I'm using abs(gather(head(ffield,savez-TERRAIN+1))); and when this commands are called, the memory usage goes beyond 130 GB.
What is happening here to the memory goes crazy like that? Also, there are any tips to improve my code? In summary, I have to solve a large tridiagonal system at each iteration of a large loop (like hundreds of thousands iterations). It'd be better if I could create the matrices on the GPU once and simply modify its diagonals values directly on the GPU, but I couldn't do this yet, so I'm modifying a normal matrix and sending it to the GPU again at each iteration.
A = spdiags([[subdiagA;0] zeros(N-1,1) [0;superdiagA]],-1:1,N-1,N-1);
B = spdiags([[subdiagB;0] zeros(N-1,1) [0;superdiagB]],-1:1,N-1,N-1);
ffield = gpuArray(complex(field));
indicesDiag = (1:N:(N-1)^2)';
for m = 2:M
a_mj = k0^2*delz^2*(m_ref_curr(2:end).^2 - 1);
% modify matrix A diagonal
diagA = c*(-2 + b + a_mj);
% modify matrix B diagonal
diagB = conj(c)*(-2 + conj(b) + a_mj);
% some boundary conditions
km = delz*(alpha_BC(m) + 1i*k0*0.5*slopes(m)^3);
if PEC && POL == 'V'
km = delz*1i*k0*0.5*slopes(m)^3;
end
% modify first and last elements of matrices A and B diagonals
diagA(1) = diagA(1) + c/(1 - km + 0.5*km^2);
diagB(1) = diagB(1) + conj(c)/(1 - km + 0.5*km^2);
if POL == 'H' % DIRICHLET
diagA(end) = 1;
diagB(end) = 1;
elseif POL == 'V' % NEUMANN
diagA(end) = diagA(end)/2;
diagB(end) = diagB(end)/2;
end
% update the diagonal terms of matrix A and send to GPU
A(indicesDiag) = diagA;
AA = gpuArray(complex(A)); % AA is the version of A on the GPU
% update the diagonal terms of matrix A and send to GPU
B(indicesDiag) = diagB;
BB = gpuArray(complex(B)); % BB is the version of matrix B on the GPU
% update another GPU array
ffield = ffield.*exp(-1i*k0*slopes(m)*z);
VV = BB*ffield(2:end); % operation on GPU
% solve the tridiagonal system on GPU
ffield(2:end) = AA\VV;
ffield(1) = ffield(2)/(1 - km + 0.5*km^2);
% some operations to prepare for the next operation
ffield = ffield.*exp(1i*k0*(slopes(m)*z + 0.5*slopes(m)^2*delx + delx));
ffield(zwin:end) = ffield(zwin:end).*win;
% here i need to get the result that is in the GPU array ffield and
% store part of it in a normal matrix u()
TERRAIN = floor(terreno(m)/delz)+1;
u(TERRAIN:end,m) = abs(gather(head(ffield,savez-TERRAIN+1))); %% when this line is executed, memory usage goes above 130 GB and the thing crashes
end
8 comentarios
Steven Lord
el 28 de Mzo. de 2024
What are the typical values for N and M you're using in runs of this code?
What are the sizes of the other variables (particularly those whose sizes are not obviously some function of N)?
On that last line of code, what part of the command causes the memory spike? I agree it's likely the gather call, but is m much larger than the width of u (thus requiring u to be expanded with a large number of additional columns)? Could you split that into lines that perform the indexing into head, that call gather, that call abs, and that assigns the result into u and confirm which line causes memory to increase?
Joss Knight
el 29 de Mzo. de 2024
It would be easier if you could provide executable code with some example values for inspection.
Whose memory usage goes to 130GB, the GPU or the CPU? I don't know of any GPUs that have that much memory.
gather causes synchronisation and so often appears to be a performance or memory bottleneck, but actually it is just because that is the line of code where all GPU operations complete. Try inserting a call to wait(gpuDevice) to test this theory.
Gather also has to do extra work for sparse arrays, because the GPU and CPU use different sparse storage representations. To translate involves a transpose, which can mean different memory usage if the matrix is very non-square.
Direct sparse solve is slow, doesn't parallelized well, and factorization tends to create dense triangular factors that use a lot of memory. Consider whether you are gaining anything using the GPU, and consider using an iterative solver instead. Tridiagonal is a special case though.
Lorenco Santos Vasconcelos
el 1 de Abr. de 2024
Lorenco Santos Vasconcelos
el 1 de Abr. de 2024
Joss Knight
el 1 de Abr. de 2024
Ah yes, that will be it. Presumably this is a script? Make it a function and the problem will go.
When you assign into u, MATLAB needs to make a full copy of u during the assignment because it is protecting your workspace against errors in your code. Make sure u is not in the workspace and this won't happen.
Lorenco Santos Vasconcelos
el 1 de Abr. de 2024
Joss Knight
el 1 de Abr. de 2024
Editada: Joss Knight
el 1 de Abr. de 2024
I would have to do a deeper dive to give really good advice, but I can surmise this much. If your system is tridiagonal then store it as an n-by-3 matrix on the GPU and update it as such; only create a sparse matrix out of it to solve the system. The slowness of spdiags on the GPU is I think only because diagA is not a gpuArray. Make it one and it should speed things up.
So I would probably make diagA and diagB gpuArray objects and store A and B (and AA and BB) as 3-column dense matrices, for which you are updating the middle column. When you do the system solve, convert to sparse only temporarily:
ffield(2:end) = spdiags(AA,-1:1)\VV;
I'm assuming here that you are using R2021a or later, which was when sparse tridiagonal solves were optimized.
Lorenco Santos Vasconcelos
el 1 de Abr. de 2024
Respuesta aceptada
Más respuestas (1)
Taylor
el 28 de Mzo. de 2024
0 votos
gather is just doing what it's supposed to do. Namely, "On a gpuArray: transfers the elements of A from the GPU to the local workspace and assigns them to X." Here's more info on running MATLAB functions on a GPU.
3 comentarios
Lorenco Santos Vasconcelos
el 28 de Mzo. de 2024
Taylor
el 28 de Mzo. de 2024
I'm wondering if it is related to calling gather multiple times with the for loop. Have you tried minimizing your number fo gather calls using the [X1,X2,...,Xn] = gather(A1,A2,...,An) syntax outside of the loop instead?
Lorenco Santos Vasconcelos
el 28 de Mzo. de 2024
Categorías
Más información sobre Matrix Indexing en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!