Extreme memory usage when using gather

Hi! I have a code that solves a tridiagonal system and, because of the large sizes of the matrices, I'm trying to solve the A\V on GPU.
At each iteration I modify the diagonal elements and need to solve on the GPU like the below code:
After the GPU have solved the system, I need to get the present result and store in a matrix. For this I'm using abs(gather(head(ffield,savez-TERRAIN+1))); and when this commands are called, the memory usage goes beyond 130 GB.
What is happening here to the memory goes crazy like that? Also, there are any tips to improve my code? In summary, I have to solve a large tridiagonal system at each iteration of a large loop (like hundreds of thousands iterations). It'd be better if I could create the matrices on the GPU once and simply modify its diagonals values directly on the GPU, but I couldn't do this yet, so I'm modifying a normal matrix and sending it to the GPU again at each iteration.
A = spdiags([[subdiagA;0] zeros(N-1,1) [0;superdiagA]],-1:1,N-1,N-1);
B = spdiags([[subdiagB;0] zeros(N-1,1) [0;superdiagB]],-1:1,N-1,N-1);
ffield = gpuArray(complex(field));
indicesDiag = (1:N:(N-1)^2)';
for m = 2:M
a_mj = k0^2*delz^2*(m_ref_curr(2:end).^2 - 1);
% modify matrix A diagonal
diagA = c*(-2 + b + a_mj);
% modify matrix B diagonal
diagB = conj(c)*(-2 + conj(b) + a_mj);
% some boundary conditions
km = delz*(alpha_BC(m) + 1i*k0*0.5*slopes(m)^3);
if PEC && POL == 'V'
km = delz*1i*k0*0.5*slopes(m)^3;
end
% modify first and last elements of matrices A and B diagonals
diagA(1) = diagA(1) + c/(1 - km + 0.5*km^2);
diagB(1) = diagB(1) + conj(c)/(1 - km + 0.5*km^2);
if POL == 'H' % DIRICHLET
diagA(end) = 1;
diagB(end) = 1;
elseif POL == 'V' % NEUMANN
diagA(end) = diagA(end)/2;
diagB(end) = diagB(end)/2;
end
% update the diagonal terms of matrix A and send to GPU
A(indicesDiag) = diagA;
AA = gpuArray(complex(A)); % AA is the version of A on the GPU
% update the diagonal terms of matrix A and send to GPU
B(indicesDiag) = diagB;
BB = gpuArray(complex(B)); % BB is the version of matrix B on the GPU
% update another GPU array
ffield = ffield.*exp(-1i*k0*slopes(m)*z);
VV = BB*ffield(2:end); % operation on GPU
% solve the tridiagonal system on GPU
ffield(2:end) = AA\VV;
ffield(1) = ffield(2)/(1 - km + 0.5*km^2);
% some operations to prepare for the next operation
ffield = ffield.*exp(1i*k0*(slopes(m)*z + 0.5*slopes(m)^2*delx + delx));
ffield(zwin:end) = ffield(zwin:end).*win;
% here i need to get the result that is in the GPU array ffield and
% store part of it in a normal matrix u()
TERRAIN = floor(terreno(m)/delz)+1;
u(TERRAIN:end,m) = abs(gather(head(ffield,savez-TERRAIN+1))); %% when this line is executed, memory usage goes above 130 GB and the thing crashes
end

8 comentarios

Steven Lord
Steven Lord el 28 de Mzo. de 2024
What are the typical values for N and M you're using in runs of this code?
What are the sizes of the other variables (particularly those whose sizes are not obviously some function of N)?
On that last line of code, what part of the command causes the memory spike? I agree it's likely the gather call, but is m much larger than the width of u (thus requiring u to be expanded with a large number of additional columns)? Could you split that into lines that perform the indexing into head, that call gather, that call abs, and that assigns the result into u and confirm which line causes memory to increase?
Joss Knight
Joss Knight el 29 de Mzo. de 2024
It would be easier if you could provide executable code with some example values for inspection.
Whose memory usage goes to 130GB, the GPU or the CPU? I don't know of any GPUs that have that much memory.
gather causes synchronisation and so often appears to be a performance or memory bottleneck, but actually it is just because that is the line of code where all GPU operations complete. Try inserting a call to wait(gpuDevice) to test this theory.
Gather also has to do extra work for sparse arrays, because the GPU and CPU use different sparse storage representations. To translate involves a transpose, which can mean different memory usage if the matrix is very non-square.
Direct sparse solve is slow, doesn't parallelized well, and factorization tends to create dense triangular factors that use a lot of memory. Consider whether you are gaining anything using the GPU, and consider using an iterative solver instead. Tridiagonal is a special case though.
Hi @Joss Knight, it is the CPU memory that goes to very high values. Also, I've tried the wait(gpuDevice) without success.
Regarding the use of gather with sparse arrays, I thought it would not be a problem because i'm calling gather on a normal array 'ffield'. The sparse arrays are the matrices A and B, and ffield is computed from this sparce matrices but is supposed to be a full array.
Maybe the issue is at the assignment of the values. With the sliced representation u(start:end,m) = gather(...);
As matrix u() is big, if the sliced assignment is creating a full copy of u(), there is a problem.
Is there a way to do this sliced assignment u(start:end,m) = gather(...); without underlying copies?
Update: I was initializing the big matrix u() outside the function and passing it to the function that modifies it like:
u = zeros(N,M);
u = propagTerreno(...,u,...);
now I'm initializing u inside the function and it seems to solved the problem.
Here are a version of working code and a .mat data file needed.
There are not typical values, because both N and M depends on a wavelength and terrain data that can be so variable. But the issue appears with great N and M like 500.000 x 8.000 double array.
I'm thinking the problems come from the sliced assignment u(TERRAIN:end,m) = abs(gather(head(ffield,savez-TERRAIN+1)));
Meybe matlab is creating a full copy of matrix u to make the assignment.
Is there a way of perform kind of inplace assignment without underlying copies?
I shared the code and data required on my reply to @Joss Knight
Joss Knight
Joss Knight el 1 de Abr. de 2024
Ah yes, that will be it. Presumably this is a script? Make it a function and the problem will go.
When you assign into u, MATLAB needs to make a full copy of u during the assignment because it is protecting your workspace against errors in your code. Make sure u is not in the workspace and this won't happen.
Yes, I think it solved the memory issue.
Regarding the performance of the code, I'm doing the following:
at each iteration, modify sparse matrices A and B diagonal elements and then update the gpu version of the matrices
indicesDiag = (1:N:(N-1)^2)'; % indexes of main diagonal
Aloc(indicesDiag) = diagA; % Aloc is a local sparse matrix. Here I modify the diagonal elements
A = gpuArray(complex(Aloc)); % update the GPU array A with the local Aloc
Matlab says that this kind of indexing [Aloc(indicesDiag)] is likely to be slow, but on my tests, it is faster than use spdiags(diagA,0,Aloc); to update the main diagonal
I've also tried to work directly on gpu with
A = spdiags(diagA,0,A); % here A is the gpuArray
but this is taking longer than the first option.
Is there somethin better than this?
What I really want to do is to run the for loop inside the gpu, like modifying the values of the gpuArray directly so I do not wast time passing values to the GPU at each iteration. Any tips on this?
Joss Knight
Joss Knight el 1 de Abr. de 2024
Editada: Joss Knight el 1 de Abr. de 2024
I would have to do a deeper dive to give really good advice, but I can surmise this much. If your system is tridiagonal then store it as an n-by-3 matrix on the GPU and update it as such; only create a sparse matrix out of it to solve the system. The slowness of spdiags on the GPU is I think only because diagA is not a gpuArray. Make it one and it should speed things up.
So I would probably make diagA and diagB gpuArray objects and store A and B (and AA and BB) as 3-column dense matrices, for which you are updating the middle column. When you do the system solve, convert to sparse only temporarily:
ffield(2:end) = spdiags(AA,-1:1)\VV;
I'm assuming here that you are using R2021a or later, which was when sparse tridiagonal solves were optimized.
Maybe replacing
ffield(2:end) = A\V; % A and V are g
by
[ffield(2:end),~] = pcg(A,V);
could do a better work but I think that the bottleneck is not the system solver (mldivide or iterative with preconditioner), but the modification of the matrices at each iteration and passing this arrays to the GPU.

Iniciar sesión para comentar.

 Respuesta aceptada

Joss Knight
Joss Knight el 1 de Abr. de 2024
Editada: Joss Knight el 14 de Abr. de 2024
As others have worked out, it looks like the issue is the indexed assignment into u:
u(TERRAIN:end,m) = abs(gather(head(ffield,savez-TERRAIN+1)));
Because u is in the base workspace, this needs to make a second copy of u during the assignment, to protect your workspace.
The normal way to fix this is to make your code into a function which returns u, so that it doesn't exist in the base workspace while the function is running. If this is impossible for you, there are ways to 'move' the variable u into a function for update, but they are a little bit advanced and I'm not even sure how safe it is to give this advice...
...
u = updateU(u, head, ffield, savez, TERRAIN, m);
...
function u = updateU(u, head, ffield, savez, TERRAIN, m)
% Delete u in parent workspace
assignin('base', 'u', []);
% Now the only copy of u is local and the assignment can be inplace
% BUT!! IF THIS ERRORS, u WILL BE DESTROYED PERMANENTLY!!
u(TERRAIN:end,m) = abs(gather(head(ffield,savez-TERRAIN+1)));
end
Of course, you could also try to rework your code so that you're not storing hundreds of gigs of data in a single variable. Even breaking it up into a cell array (e.g. each column is a different cell element) would help because only the element being overwritten would need to be copied.

Más respuestas (1)

Taylor
Taylor el 28 de Mzo. de 2024

0 votos

gather is just doing what it's supposed to do. Namely, "On a gpuArray: transfers the elements of A from the GPU to the local workspace and assigns them to X." Here's more info on running MATLAB functions on a GPU.

3 comentarios

Hi @Taylor! Thank you for your answer.
Yes, but why gather uses so much memory like this? I mean, the local workspace matrix is already pre allocated with zeros (u = zeros(savez,M);) and it does perfectly fit on the memory. It uses only a few kB.
Why transfering only a few values from the gpuArray ffield to one column of matrix u (u(TERRAIN:end,m) = abs(gather(head(ffield,savez-TERRAIN+1)));) consumes more than 130 GB of RAM? Something must be wrong here, don't you think?
Taylor
Taylor el 28 de Mzo. de 2024
I'm wondering if it is related to calling gather multiple times with the for loop. Have you tried minimizing your number fo gather calls using the [X1,X2,...,Xn] = gather(A1,A2,...,An) syntax outside of the loop instead?
It could be, but I think no because I put a breakpoint inside the loop jat the gather line and run only the first iteration. Once I press F10 to execute gather, the MATLAB proccess that is taking about 2GB of RAM jumps to 20, 50, almost 100 GB.
The idea of avoid the gather could work, but unfortunately this is not possible for me. I could create the entire matrix u as a gpuArray, fill it with the loop and than gather it at the end, but this would mean to store a big matrix in the gpu shared memory. For this reason I've preferred to store in the gpu only one column at a time and gather this column at each iteration to fill the complete local workspace matrix.
Do you think it'd be better to create the entire matrix as a gpuArray and avoid the gather at each iteration?
Just to make the context available, the problem is straightforward:
big matrix u(N,M) represents a physical quantity over a domain. I have initial values at the first column of u, i.e u(:,1); With the initial column, I compute the next colum u(:,m) by solving a large tridiagonal system on the GPU. So, at each iteration, the GPU computes an array of size (m,1) and I gather this and put on u(:,m).

Iniciar sesión para comentar.

Categorías

Productos

Versión

R2023a

Preguntada:

el 28 de Mzo. de 2024

Editada:

el 14 de Abr. de 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by