Matrix and vector multiplication of size using a CPU is very slow. Using GPU is much quicker but I need a way around the size limitation.

Question

Jonathan Wharrier el 23 de Ag. de 2023

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/2011787-matrix-and-vector-multiplication-of-size-using-a-cpu-is-very-slow-using-gpu-is-much-quicker-but-i-n

Comentada: Jonathan Wharrier el 30 de Ag. de 2023

This is an extract from the core part of a project I am working on. It is essentially a Finite Difference implementation of Crank Nicholson method. If I run a non-GPU version app.U0 can easily be a vecor of size 64k. The matrices are of similar dimension. The non-gpu version can take a day to run on my overclocked AMD Ryzen 7 5700X which has 128GB memory. I thought I would try and use the GPU

>> gpuDevice
ans = 
  CUDADevice with properties:
                      Name: 'NVIDIA GeForce RTX 3060'
                     Index: 1
         ComputeCapability: '8.6'
            SupportsDouble: 1
     GraphicsDriverVersion: '536.99'
               DriverModel: 'WDDM'
            ToolkitVersion: 11.8000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 12884377600 (12.88 GB)

I have little idea of the significance of this other than the 12+GB which I thought suggested I had lots of room. The test I ran using both on and off GPU showed a perfect match. The only difference being the CPU version for the small app.U0 took 50 seconds and the GPU version took 2!! Total win... except... when the xMesh reaches >53 the array contians only NaN except for the initial input data.

The LHS diagram shows why I need larger than 100x2x50 start vector.The rays emanating from the start and ricocheing off the boundary are a real effect and one that needs to be avoided. This means larger distance between boundaries. Apparently fine if I use the CPU but not on the GPU. Using stochastic data for the noise I might want 30+ runs and that means 30+days just to get the data.

                A = diag(-2*ones(1,M)) + diag(ones(1,M-1),1) + diag(ones(1,M-1),-1);
                sparse(A);
                ul = gpuArray(complex(ul));
                ur = gpuArray(complex((ur)));
                app.u0 = addNoise(app,startAgain,loop);%first line of array eventually
                U = gpuArray(complex(zeros(M+2,app.NJ+1))); %U = zeros(M+2,app.NJ); %output file
                U0 = gpuArray(complex(app.u0(2:end-1,:)));
                U1 = gpuArray(complex(size(U0)));
                UC = gpuArray(complex(size(U0)));
                D = app.dt/(2*app.h^2); % D 
                Bdy(1,:)= D*ul;
                Bdy(end,:)=D*ur;
                %Bdy = sparse(Bdy);
                Bdy = gpuArray(complex(Bdy));
                AB = gpuArray(complex(D*A));
                AA = sparse(1i*eye(size(A))+AB); AA = gpuArray(complex(AA));
                sdg = gpuArray(s*app.dt*app.gamma);%this is just a constant
                    for j=1:app.NJ
                        % This is the first shot at Crank Nicholson
                        U1=AA\(1i.*U0 - AB*U0 - sdg*(U0.*conj(U0)).*U0...
                            - Bdy(:,j)-Bdy(:,j+1));
                        %which becomes a predictor corrector by a second
                        %run through
                        UC = (U1+U0)/2;
                        U1=AA\(1i.*U0 - AB*U0 - sdg*(UC.*conj(UC)).*UC...
                            - Bdy(:,j)-Bdy(:,j+1));
                        U0=U1; U(:,j+1) = [ul(j+1);U1;ur(j+1)];
                    end
                disp(f); disp(showDateAndTime(app,2));           %quick indicator that a loop has finished
                %put start on data
                %U(:,1) = app.u0;%U = [app.u0 U];
                U = gather(U);
                U(:,1) = app.u0;

The 'for' loop is the heart of the matter. Is there any way around this issue without losing all the benefits of the GPU speed?

10 comentarios
Mostrar 8 comentarios más antiguosOcultar 8 comentarios más antiguos

Bruno Luong el 25 de Ag. de 2023

Editada: Bruno Luong el 25 de Ag. de 2023

Abrir en MATLAB Online

@Jonathan Wharrier here is why nobody can effectively help you:

                A = diag(-2*ones(1,M)) + diag(ones(1,M-1),1) + diag(ones(1,M-1),-1);
Unrecognized function or variable 'M'.
                sparse(A);
                ul = gpuArray(complex(ul));
                ur = gpuArray(complex((ur)));
                app.u0 = addNoise(app,startAgain,loop);%first line of array eventually
                U = gpuArray(complex(zeros(M+2,app.NJ+1))); %U = zeros(M+2,app.NJ); %output file
                U0 = gpuArray(complex(app.u0(2:end-1,:)));
                U1 = gpuArray(complex(size(U0)));
                UC = gpuArray(complex(size(U0)));
                D = app.dt/(2*app.h^2); % D 
                Bdy(1,:)= D*ul;
                Bdy(end,:)=D*ur;
                %Bdy = sparse(Bdy);
                Bdy = gpuArray(complex(Bdy));
                AB = gpuArray(complex(D*A));
                AA = sparse(1i*eye(size(A))+AB); AA = gpuArray(complex(AA));
                sdg = gpuArray(s*app.dt*app.gamma);%this is just a constant
                    for j=1:app.NJ
                        % This is the first shot at Crank Nicholson
                        U1=AA\(1i.*U0 - AB*U0 - sdg*(U0.*conj(U0)).*U0...
                            - Bdy(:,j)-Bdy(:,j+1));
                        %which becomes a predictor corrector by a second
                        %run through
                        UC = (U1+U0)/2;
                        U1=AA\(1i.*U0 - AB*U0 - sdg*(UC.*conj(UC)).*UC...
                            - Bdy(:,j)-Bdy(:,j+1));
                        U0=U1; U(:,j+1) = [ul(j+1);U1;ur(j+1)];
                    end
                disp(f); disp(showDateAndTime(app,2));           %quick indicator that a loop has finished
                %put start on data
                %U(:,1) = app.u0;%U = [app.u0 U];
                U = gather(U);
                U(:,1) = app.u0;

Jonathan Wharrier el 25 de Ag. de 2023

Abrir en MATLAB Online

I have cut the code out of the app and added enough data so that the loop runs. Changing xRange from 50 to 54 reproduces the problem. The output <U> becomes mainly NaN the non NaN are the boundary values (all zero) and the start value.

           % CRANK NICHOLSON VERSION
            xRange = 50; %This is the critical value which defines the size of the mesh grid
            % crashes at 54
            dx = 0.01; %mesh grid increment size
            x = -xRange:dx:xRange;%mesh grid
            T = 5; %time for run
            dt = 0.01; %Time increment size
            NJ=T/dt; %number of iterations
            t= dt*(0:NJ); %time vector
            M=length(x);    % length of the mesh
            M=M-2;          % active length minus end points
                                
            S  = @(ex) 1*sech(sqrt(1/2)*ex).*exp(1i*2*ex); %creates Soliton
            u0 = S(x)'; %Start data - note transform
            s=1; %constant
            ul = zeros(size(t)); %set boundary values
            ur = zeros(size(t)); %boundaries are zero
                reset(gpuDevice);
                Bdy=zeros(M,NJ+1);
                A = diag(-2*ones(1,M)) + diag(ones(1,M-1),1) + diag(ones(1,M-1),-1);
                sparse(A);
                U = gpuArray(complex(zeros(M+2,NJ+1))); %U = zeros(M+2,app.NJ); %output file
                U0 = gpuArray(complex(u0(2:end-1,:)));
                UC = gpuArray(complex(size(U0)));
                D = dt/(2*dx^2); % D 
                Bdy(1,:)= D*ul;
                Bdy(end,:)=D*ur;
                %Bdy = sparse(Bdy);
                url = gpuArray(complex(ul));
                urr = gpuArray(complex((ur)));
                Bdy = gpuArray(complex(Bdy));
                AB = gpuArray(complex(D*A));
                AA = sparse(1i*eye(size(A))+AB); AA = gpuArray(complex(AA));
                sdg = gpuArray(1*dt*1);%this is just a constant
                    for j=1:NJ
                        % This is the first shot at Crank Nicholson
                        U1=AA\(1i.*U0 - AB*U0 - sdg*(U0.*conj(U0)).*U0...
                            - Bdy(:,j)-Bdy(:,j+1));
                        %which becomes a predictor corrector by a second
                        %run through
                        UC = (U1+U0)/2;
                        U1=AA\(1i.*U0 - AB*U0 - sdg*(UC.*conj(UC)).*UC...
                            - Bdy(:,j)-Bdy(:,j+1));
                        U0=U1;
                        %U(:,j+1) = [ul(j+1);U1;ur(j+1)];
                        U(:,j+1) = [url(j+1);gather(U1);urr(j+1)];
                        
                    end
                U = gather(U);
                U(:,1) = u0;
            close all;
            surf(abs(U));

Jonathan Wharrier el 25 de Ag. de 2023

The run which took 52 minutes on the CPU took less than 2 on the GPU even using the backslash solver.

Joss Knight el 27 de Ag. de 2023

Yes. It looks like AA is tridiagonal which is a special case and so runs fast.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Bruno Luong el 25 de Ag. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/2011787-matrix-and-vector-multiplication-of-size-using-a-cpu-is-very-slow-using-gpu-is-much-quicker-but-i-n#answer_1293817

Editada: Bruno Luong el 25 de Ag. de 2023

Abrir en MATLAB Online

I run your (slighly modified) code with xRange = 100 (EDIT) and get the finite result

My config: R2023a, Windows 11, Laptop 32 Gbytes, Intel i9 12th generation and Nvidia RTX 3060.

I would look for updateing your MATLAB, graphic card drivers, or eventually HW incompatibility issue.

% CRANK NICHOLSON VERSION
xRange = 100; %This is the critical value which defines the size of the mesh grid
% crashes at 54
dx = 0.01; %mesh grid increment size
x = -xRange:dx:xRange;%mesh grid
T = 5; %time for run
dt = 0.01; %Time increment size
NJ=T/dt; %number of iterations
t= dt*(0:NJ); %time vector
M=length(x);    % length of the mesh
M=M-2;          % active length minus end points
S  = @(ex) 1*sech(sqrt(1/2)*ex).*exp(1i*2*ex); %creates Soliton
u0 = S(x)'; %Start data - note transform
s=1; %constant
ul = zeros(size(t)); %set boundary values
ur = zeros(size(t)); %boundaries are zero
reset(gpuDevice);
Bdy=zeros(M,NJ+1);
A = spdiags(ones(3*M,1)*[1 -2 1],[-1 0 1],M,M); % modified by Bruno %diag(-2*ones(1,M)) + diag(ones(1,M-1),1) + diag(ones(1,M-1),-1);
U = gpuArray(complex(zeros(M+2,NJ+1))); %U = zeros(M+2,app.NJ); %output file
U0 = gpuArray(complex(u0(2:end-1,:)));
UC = gpuArray(complex(size(U0)));
D = dt/(2*dx^2); % D
Bdy(1,:)= D*ul;
Bdy(end,:)=D*ur;
%Bdy = sparse(Bdy);
url = gpuArray(complex(ul));
urr = gpuArray(complex((ur)));
Bdy = gpuArray(complex(Bdy));
AB = gpuArray(complex(D*A));
AA = 1i*speye(size(A))+AB; 
AA = gpuArray(AA); % modified by Bruno
sdg = gpuArray(1*dt*1);%this is just a constant
for j=1:NJ
    % This is the first shot at Crank Nicholson
    U1=AA\(1i.*U0 - AB*U0 - sdg*(U0.*conj(U0)).*U0...
        - Bdy(:,j)-Bdy(:,j+1));
    %which becomes a predictor corrector by a second
    %run through
    UC = (U1+U0)/2;
    U1=AA\(1i.*U0 - AB*U0 - sdg*(UC.*conj(UC)).*UC...
        - Bdy(:,j)-Bdy(:,j+1));
    U0=U1;
    %U(:,j+1) = [ul(j+1);U1;ur(j+1)];
    U(:,j+1) = [url(j+1);gather(U1);urr(j+1)];
end
U = gather(U);
U(:,1) = u0;
close all;
surf(abs(U),'EdgeColor','none');

37 comentarios
Mostrar 35 comentarios más antiguosOcultar 35 comentarios más antiguos

Jonathan Wharrier el 26 de Ag. de 2023

Editada: Walter Roberson el 26 de Ag. de 2023

Abrir en MATLAB Online

because U is not on the gpu however it is unecessary ...

I even tried keeping U1 on the CPU side but it made no difference

So I am left with rhs is correctly calculated and AA is also correct bu once I pass a certain size (xRange > 53) the expression AA\rhs produces NaN on my system but not on yours. What I have done is the following...

Firstly I did a new run picking arbitrarily xRange =60. It failed. Double clicking on U1 produces a list of NaN. I cleared the workspace then ran the following by habnd in the command window...

load('debug.mat') this loads the debug file and AA; AB; rhs; U0 & U1
surf(abs(AA),'EdgeColor','none') provides graph of tri diagonal matrix as before.
surf(abs(AB),'EdgeColor','none') provides graph of matrix as before.
plot(abs(U0)) provides graph of matrix as before.
plot(abs(rhs)) provides graph of matrix as before.
issparse(AA) logical 1 (yes), it does not say this in workspace, just gpuArray.
issparse(AB) logical 1 (yes), it does not say this in workspace, just gpuArray.
sol = AA\rhs; this failed in the program for U1.
plot(abs(sol)) produces non NaN and shows a graph of a slightly moved soliton as required.
isgpuarray(sol) logical 1 (yes),

Apparently I can get the gpu to do this by hand one at a time! I am at a total loss. The following code runs if xRange <= 53 and provides a graph of a Soliton translated by 20 units and a surf diagram. Values of 54 and above fail for no discernible reason that I can fathom though the calc above (by hand as it were) seems to provide the correct answer.

           tic
            % CRANK NICHOLSON VERSION
            xRange = 50; %This is the critical value which defines the size of the mesh grid
            % crashes at 54
            dx = 0.01; %mesh grid increment size
            x = -xRange:dx:xRange;%mesh grid
            T = 5; %time for run
            dt = 0.01; %Time increment size
            NJ=T/dt; %number of iterations
            t= dt*(0:NJ); %time vector
            M=length(x);    % length of the mesh
            M=M-2;          % active length minus end points
                                
            S  = @(ex) 1*sech(sqrt(1/2)*ex).*exp(-1i*2*ex); %creates Soliton
            u0 = S(x)'; %Start data - note transform
            s=1; %constant
                reset(gpuDevice);
                A = diag(-2*ones(1,M)) + diag(ones(1,M-1),1) + diag(ones(1,M-1),-1);
                A = sparse(A);
                U = gpuArray(complex(zeros(M+2,NJ+1)));
                U0 = gpuArray(complex(u0(2:end-1,:)));
                UC = gpuArray(complex(size(U0)));
                D = dt/(2*dx^2); % D 
                DA = sparse(D*A);
                AB = sparse(gpuArray(complex(DA)));
                AA = sparse((1i*eye(size(A))+AB));
                AA = gpuArray(complex(AA));
                
                sdg = gpuArray(1*dt*1);%this is just a constant
                    for j=1:NJ
                        % This is the first shot at Crank Nicholson
                        rhs = (1i.*U0 - AB*U0 - sdg*(U0.*conj(U0)).*U0);
                        U1=AA\rhs;
                            if any(isnan(U1))
                                save debug.mat U1 AA AB U0 rhs;
                                return
                            end
                  %which becomes a predictor corrector by a second
                  %run through
                        UC = (U1+U0)/2;
                        rhs=(1i.*U0 - AB*U0 - sdg*(UC.*conj(UC)).*UC); %...
                        U1=(AA\rhs);
                        U0=U1;
                        U(:,j+1) = [0;U1;0];
                    end
               U(:,1) = u0;
            toc
            figure(1);
            surf(abs(U),'EdgeColor','none');
            figure(2)
            plot(x,abs(u0));
            grid on; hold on;

Joss Knight el 27 de Ag. de 2023

Editada: Joss Knight el 27 de Ag. de 2023

This could well be a bug in NVIDIA sparse direct solver that MATLAB uses, specific to your hardware (either the 3060 or the specific limited environment of a laptop with power and memory constraints). We can try to reproduce it on another 30xx card but I suspect this will also require the WDDM (Windows) driver.

What is it about xRange that is truly important for you? Almost certainly, if there is an environment-related issue it will be due to the size of your matrix, not the values, so can you just use a different dx grid increment for the larger values so your matrices stay the same size?

I have also confirmed that you get the same answers for iterative solvers if you use AA as a preconditioner. This is because the GPU iterative solvers use ILU for preconditioning and ILU gives an exact answer for tridiagonal matrices. So if you substitute, for instance, lsqr(AA,rhs,[],[],AA) for AA\rhs you should get the same answer. I also found that gpuArray(gather(AA)\gather(rhs)) was faster still. I agree that none are as fast as the original direct solve on the GPU, but I am just trying to find a way to unblock you. I used the CPU option on your script with xRange=54 and it went from 2.6s to 4s, so slower, but still pretty fast.

Jonathan Wharrier el 28 de Ag. de 2023

This morning has been very productive. Firstly everything ran fine up to 200 xRange. It took about 3 minutes. I tried to be clever and went for 240 and it fell over but I have spent a bit of time thinking about sparse. It transpires that the whilst the graphics card works sparse OK it appears to me that it manages memory poorly. The "out of memory" seemed to happen because the graphics card maintained both versions. I have been watching carefully the Task manager as the computer handles the code. I have therefore used the main memory to create the large matrices and to sparse them and then loaded them to the card. Turns out this works a treat. I now can get to xRange = 360. When I tried for 400 because the card appeared to be less than "full" it fell over because my CPU/RAm at 128GB ran out of memory!, not the card!. The time when I used the CPU only before I had the NVIDIA (xRange = 320) was 23+ hours for a single run whereas this morning 360 xRangetook 5 minutes. I am still puzzled as to why

U1 = gather(AA)\gather(rhs);

works though !?

At ~30s on the CPU is creatng matrices and I would presume that MATLAB has set up storage on the Graphics card. The CPU is doing the work.

2 minutes in the CPU has loaded the graphics card (GPU spike) and CPU memory usage drops from 128GB.

The GPU is not over extended in terms of memory use and runs for about 2 minutes

the last minutes are the CPU saving some very large files to the HD

Jonathan Wharrier el 28 de Ag. de 2023

My limited understanding of implementation suggests that the problem is AS. This is where, if xRange (300+) is very large that the program runs out of memory. However, trying to create A or AD or AS on the graphics card runs out of memory much faster (54+). However, this all sounds fine until I realised that U, which is the output, sits quite happily in the gpu card memory. It is the only complete matrix. All the others are well zeroed and therefore sparse or else they are simply vectors. So everything is now on the graphics card after a lot of one off calculations on setup. However, that does not make sense when you get to the for loop. Your suggestion of rhs clearly works no matter how big xRange because I can get results for it even when the program falls over which just leaves me the question of why U1 = gather(AA)\gather(rhs); works and works quickly whereas U1 = AA\rhs; does not anbd gives NaN. Just before I started wrting this I thought that maybe I should look up and see what gather actually does.

gather - Transfer distributed array, Composite array or gpuArray to local workspace. This MATLAB function can operate on the following array data: On a gpuArray: transfers the elements of A from the GPU to the local workspace and assigns them to X.

I thought, because I skim read (it's an age thing!) that gather transferred stuff from GPU to ordinary memory. I'm not convinced that it actually does this. Couple of reasons. Firstly U1 is returned as a GPU array. If AA, a very large matrix, were being moved around I would expect to see something especially if gather is moving it to CPU RAM and then it would have to move the resultant back to put it into U1. I cannot see this in the diagrams previously shown. I just wondered if gather was a mechanism to provide a vector to the locations of the distributed data. This would be faster than looking for each piece individually. What is clear to me from the previous diagrams is that the CPU is NOT doing this calculation.

D = app.dt/(2*app.h^2); % D

A = diag(-2*ones(1,M)) + diag(ones(1,M-1),1) + diag(ones(1,M-1),-1);

A = sparse(A);

AD = sparse(D*A);

AS = sparse(1i*eye(size(A))+AD);%in this order I get the largest ...

% effective xRange and it is this line where finally OUT OF MEMORY ...

%causes a fail with size.

AB = gpuArray(complex(AD));

AA = gpuArray(complex(AS));

sdg = gpuArray(s*app.dt*app.gamma);%this is just a constant

clearvars A AD AS;

U = gpuArray(complex(zeros(M+2,app.NJ+1)));

U0 = gpuArray(complex(app.u0(2:end-1,:)));

for j=1:app.NJ

% This is the first shot at Crank Nicholson

rhs = (1i.*U0 - AB*U0 - sdg*(U0.*conj(U0)).*U0);

U1 = gather(AA)\gather(rhs); %r=1;

%which becomes a predictor corrector

U1 = (U1+U0)/2;

rhs = (1i.*U0 - AB*U0 - sdg*(U1.*conj(U1).*U1));

U1 = gather(AA)\gather(rhs); %r=2;

U0=U1;

U(:,j+1) = [0;U1;0];

end

Joss Knight el 28 de Ag. de 2023

Editada: Joss Knight el 28 de Ag. de 2023

Do make sure you move the result of backslash back to the GPU so that subsequent operations take place on the GPU. So use gpuArray(gather(AA)\gather(rhs)) rather than just gather(AA)\gather(rhs), to ensure U1 is still a gpuArray on output.

As a developer of the gpuArray datatype, I'm usually - not always, but usually! - pretty confident in my assertions about its behaviour. Whatever you're seeing in the task manager is heavily confused by the fact that MATLAB's GPU functionality pools memory and operates lazily and asynchronously. That means the GPU isn't always being used when a line of code is run, it doesn't always finish when the next line of code is run, and memory no longer assigned to a variable is not necessarily released back to the system to show up as free in the Task Manager. Hopefully we will soon have our own monitor apps so you can more easily see this behaviour in MATLAB.

Worth reminding you perhaps that gather(AA) does not move AA to the CPU, it creates a temporary variable on the CPU and copies AA there. So the GPU memory is not released.

I'm glad you've found some compromises about where to do CPU work and where GPU. It takes quite a lot of memory to move a sparse CPU array to the GPU because a conversion between two different storage formats is required (CSC to CSR, if you care). It's usually best to create the data on the GPU from the outset, for instance by passing gpuArray data to spdiags.

Bruno Luong el 28 de Ag. de 2023

Editada: Bruno Luong el 28 de Ag. de 2023

Abrir en MATLAB Online

@Joss Knight But the major party runtime is when solving AA\rhs, not the rest. The advantage of solving linear system on GPU is lost when you force to do it on CPU.

The only reason for @Jonathan Wharrier to do CPU inversion because GPU is buggy beyond the spatial resolution of 54.

Here is the three methods run together, and clearly mixing between GPU and CPU (third result time) is the slowest, at least on my PC.

>> CPUtest
Config = cpuarray + home_native_backslash
	Elapsed time is 2.025 seconds
Config = gpuarray + home_native_backslash
	Elapsed time is 0.906 seconds
Config = gpuarray + cpu_backslash
	Elapsed time is 3.191 seconds

So still the question, what is the point? I mean you better off doing the whole calculation on CPU. The hybrid just slow down by 1.1 second, more than 50%.

The code demo is here.

close all;
configs = struct('arraytype',  {@cpuarray,             @gpuarray,             @gpuarray},...
                 'solvertype', {@home_native_backslash, @home_native_backslash, @cpu_backslash});
for k = 1:length(configs)
    array  = configs(k).arraytype;
    solver = configs(k).solvertype;
    % CRANK NICHOLSON VERSION
    xRange = 100;
    dx = 0.01; %mesh grid increment size
    x = -xRange:dx:xRange;%mesh grid
    T = 5; %time for run
    dt = 0.01; %Time increment size
    NJ=T/dt; %number of iterations
    t= dt*(0:NJ); %time vector
    M=length(x);    % length of the mesh
    M=M-2;          % active length minus end points
    S  = @(ex) 1*sech(sqrt(1/2)*ex).*exp(1i*2*ex); %creates Soliton
    u0 = S(x)'; %Start data - note transform
    s=1; %constant
    ul = zeros(size(t)); %set boundary values
    ur = zeros(size(t)); %boundaries are zero
    reset(gpuDevice);
    Bdy=zeros(M,NJ+1);
    A = spdiags(ones(3*M,1)*[1 -2 1],[-1 0 1],M,M);
    U = array(complex(zeros(M+2,NJ+1)));
    U0 = array(complex(u0(2:end-1,:)));
    UC = array(complex(size(U0)));
    D = dt/(2*dx^2); % D
    Bdy(1,:)= D*ul;
    Bdy(end,:)=D*ur;
    %Bdy = sparse(Bdy);
    url = array(complex(ul));
    urr = array(complex((ur)));
    Bdy = array(complex(Bdy));
    AB = array(complex(D*A));
    AA = 1i*speye(size(A))+AB;
    AA = array(AA);
    sdg = array(1*dt*1);
    t0=tic;
    for j=1:NJ
        CrankNicolsonRhsFun = @(U) (1i.*U0 - AB*U0 - sdg*(U.*conj(U)).*U - Bdy(:,j)-Bdy(:,j+1));
        rhs = CrankNicolsonRhsFun(U0);
        U1 = solver(AA,rhs);
        UC = (U1+U0)/2;
        rhs = CrankNicolsonRhsFun(UC);
        U1 = solver(AA,rhs);
        U0=U1;
        U(:,j+1) = [url(j+1);U1;urr(j+1)];
    end
    t=toc(t0);
    fprintf('Config = %s + %s\n', func2str(array), func2str(solver));
    fprintf('\tElapsed time is %1.3f seconds\n', t);
    U = gather(U);
    U(:,1) = double(u0);
    figure('Name', sprintf('Config(%d)', k));
    surf(abs(U),'EdgeColor','none');
end
function C = home_native_backslash(A,B)
C=A\B;
end
function C = cpu_backslash(A,B)
C=gpuArray(gather(A)\gather(B));
end
function a = cpuarray(a) % a input is double array
end
function a = gpuarray(a)
a = gpuArray(a);
end

Jonathan Wharrier el 29 de Ag. de 2023

To make a run with xRange of 360 which I have done this morning means the process runs on a single thread. It is possible to use a multi threaded parallel approach, which I did in the early versions of the app. However there is a trade off. More threads, shared memory and therefore smaller xRange. The latter is unhelpful so one thread it is. If I run this solely on the CPU and watch task managerI found that I get a significant hump where the processor loads up the memory and performs the "for" calculation. Specifically on xR 300+ 4/min which is two passes through the Crank Nicholson/Predictor corrector calculation. I need a minimum of T = 25 with dt = 0.01 so 2500 goes through the CN/PC mill which takes approximately 20 hours. That is one run. I have a newer computer - still an AMD R7 but a 128Gb 5000 model rather than the 64GB R7 2400 laptop. It takes about an hour less on CPU.

This morning I have run so far, 10 runs. Each complete run T = 36 dt = 0.01, takes fractionally under 5 minutes. Now I do not claim to understand the why and the how but the difference on my machine is remarkable. Because I am pushing the machine to its limits - the CPU does not like 100% memory usage much and has fallen over twice - but I have a work around for the data - and the GPU falls over if it does not completely clear with "reset(gpuDevice)" and free up all the memory I have taken but again this seems to be an infrequent event, it is a bit frustrating BUT the overall is an improvement in the amount of data I am able to process by an order of magnitude.

Joss Knight el 29 de Ag. de 2023

Editada: Joss Knight el 30 de Ag. de 2023

Agreed @Bruno Luong, this is what I see too, but it doesn't match Jonathan's figures which seemed to show an order of magnitude worse performance on CPU. If his figures had similar proportions to yours (and mine) I certainly wouldn't see much point in using the GPU, but also the original problem - that it's computationally intractable to run his code with high resolutions - is not there, since the CPU is only 2x slower. Most people are prepared to let things run twice as long if the alternative is something that occasionally crashes.

The GPU documentation is clear, it cannot always be faster, not all algorithms are straightforwardly parallelizable. If the matrix weren't tridiagonal, for instance, the GPU would be much slower than CPU. SVD is another algorithm that is highly problematic for GPU parallelization. Sparse in general is a peculiar problem, because GPU parallelization depends on uniformity but sparse data by its nature is non-uniform (except where it fits certain patterns like diagonality). That is why we don't provide a huge amount of sparse GPU functionality, because if it cannot be made faster than CPU we don't want people to use it only to be disappointed.

Bruno Luong el 29 de Ag. de 2023

Editada: Bruno Luong el 29 de Ag. de 2023

Abrir en MATLAB Online

A "trick" of reduce cpu time is use decomposition object of AA matrix. Timing is now comparable to gpu.

Unfortunately there is no gpu support for decomposition.

>> CPUtest
Config = cpuarray + home_native_backslash
	Elapsed time is 1.980 seconds
Config = gpuarray + home_native_backslash
	Elapsed time is 0.896 seconds
Config = cpuarray + decomp
	Elapsed time is 1.156 seconds

with this demo script

close all;
configs = struct('arraytype',  {@cpuarray,              @gpuarray,              @cpuarray},...
                 'solvertype', {@home_native_backslash, @home_native_backslash, @decomp});
for k = 1:length(configs)
    array  = configs(k).arraytype;
    solver = configs(k).solvertype;
    % CRANK NICHOLSON VERSION
    xRange = 100;
    dx = 0.01; %mesh grid increment size
    x = -xRange:dx:xRange;%mesh grid
    T = 5; %time for run
    dt = 0.01; %Time increment size
    NJ=T/dt; %number of iterations
    t= dt*(0:NJ); %time vector
    M=length(x);    % length of the mesh
    M=M-2;          % active length minus end points
    S  = @(ex) 1*sech(sqrt(1/2)*ex).*exp(1i*2*ex); %creates Soliton
    u0 = S(x)'; %Start data - note transform
    s=1; %constant
    ul = zeros(size(t)); %set boundary values
    ur = zeros(size(t)); %boundaries are zero
    reset(gpuDevice);
    Bdy=zeros(M,NJ+1);
    A = spdiags(ones(3*M,1)*[1 -2 1],[-1 0 1],M,M);
    U = array(complex(zeros(M+2,NJ+1)));
    U0 = array(complex(u0(2:end-1,:)));
    UC = array(complex(size(U0)));
    D = dt/(2*dx^2); % D
    Bdy(1,:)= D*ul;
    Bdy(end,:)=D*ur;
    %Bdy = sparse(Bdy);
    url = array(complex(ul));
    urr = array(complex((ur)));
    Bdy = array(complex(Bdy));
    AB = array(complex(D*A));
    AA = 1i*speye(size(A))+AB;
    AA = array(AA);
    sdg = array(1*dt*1);
    t0=tic;
    if isequal(solver, @decomp)
        AA= decomposition(AA);
    end
    for j=1:NJ
        CrankNicolsonRhsFun = @(U) (1i.*U0 - AB*U0 - sdg*(U.*conj(U)).*U - Bdy(:,j)-Bdy(:,j+1));
        rhs = CrankNicolsonRhsFun(U0);
        U1 = solver(AA,rhs);
        UC = (U1+U0)/2;
        rhs = CrankNicolsonRhsFun(UC);
        U1 = solver(AA,rhs);
        U0=U1;
        U(:,j+1) = [url(j+1);U1;urr(j+1)];
    end
    t=toc(t0);
    fprintf('Config = %s + %s\n', func2str(array), func2str(solver));
    fprintf('\tElapsed time is %1.3f seconds\n', t);
    U = gather(U);
    U(:,1) = double(u0);
    figure('Name', sprintf('Config(%d)', k));
    surf(abs(U),'EdgeColor','none');
end
function C = home_native_backslash(A,B)
C=A\B;
end
function C = decomp(A,B)
C=A\B; % A is decompistion class, not matrix
end
function C = cpu_backslash(A,B)
C=gpuArray(gather(A)\gather(B)); % what is the point compare to purely cpu?
end
function a = cpuarray(a) % a input is double array
end
function a = gpuarray(a)
a = gpuArray(a);
end

Bruno Luong el 30 de Ag. de 2023

@Jonathan Wharrier May be if you are satisfied with the answer and it somehow sheds a light even if it doesn't still explain few obscure aspects, could you accept the answer?

Jonathan Wharrier el 30 de Ag. de 2023

I am happy to do so. I have found the whole experience very helpful.

Tx Jo

Iniciar sesión para comentar.

Matrix and vector multiplication of size using a CPU is very slow. Using GPU is much quicker but I need a way around the size limitation.

10 comentarios
Mostrar 8 comentarios más antiguosOcultar 8 comentarios más antiguos

Respuesta aceptada

37 comentarios
Mostrar 35 comentarios más antiguosOcultar 35 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Matrix and vector multiplication of size using a CPU is very slow. Using GPU is much quicker but I need a way around the size limitation.

10 comentarios Mostrar 8 comentarios más antiguosOcultar 8 comentarios más antiguos

Respuesta aceptada

37 comentarios Mostrar 35 comentarios más antiguosOcultar 35 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

10 comentarios
Mostrar 8 comentarios más antiguosOcultar 8 comentarios más antiguos

37 comentarios
Mostrar 35 comentarios más antiguosOcultar 35 comentarios más antiguos