Multiple GPU usage in Parallel

There just isn't a ton of information out there about using multiple gpus.
I apologize in advance for not posting exact code but only pseudocode. I've also kind of felt my way through the available matlab parallel structures to get the form that I want.
What I don't get is the performance I want. Here is what I'm doing.
Number_of_Things = 4;
parpool = 4
spmd(4) % Parallel region 1 allocate gpus. Yes, my box has 4 gpus.
gd = gpuDevice;
end
single processor stuff gets done relating to data initialization.
spmd(4) % Parallel region 2 Push the local data to the different gpus.
gpu_data = gpuArray(localdata(labindex))
end
spmd(4) % parallel region 3 do the work
process(gpu_data)
end
spmd(4) % parallel region 4 gather the data.
output(labindex) = gather(results)
end
Now please recognize the code that I've psuedocoded does what I want it to do.
I've put things in this form for timing purposes.
I've verified that I'm using 4 different gpus.
As I vary the Number_of_Things that the timing for regions 1,2 & 4 show an increase as number of things increases. I expect that for 1 and 4 and I accept that for #2 as there is a good bit of data being transferred.
What I don't understand is a linear increase in the time of region 3 as the number of things increases. If I pull out the references to gpus and just use standard processors my time goes large,but flat with respect to the number of things. I don't understand why my timing is not flat in the processing region and would appreciate thoughts. My only explanation is that transferring the commands in region 4 to the different gpus is causing interference and slowing thing down in a linear way.
A single thing takes 40 seconds to process. Each multiple thing adds 10 seconds.

8 comentarios

Joss Knight
Joss Knight el 18 de Abr. de 2017
I don't understand why you would think that your processing time wouldn't go up as you increase the number of things? This only works with a GPU if it isn't fully utilized. If you do a small matrix multiplication on your GPU, small enough that not all the cores are busy, then maybe you can hope that a bigger matrix multiply wouldn't take any longer. But to go from, for instance, solving a 500x500 linear system to solving a 1000x1000 linear system is definitely going to take longer. There's more data to move around, and you'll need more iterations to complete the solution.
So really it depends entirely what you're doing inside region 3.
I also can't explain why the GPU performance is bound to the number of things but the CPU isn't without knowing more. Whatever it is you're doing is apparently memory bound, which means it's affected by the number of things. Whereas on the CPU it's probably compute bound.
Do you actually divide your computation up into 4 spmd blocks? Any particular reason?
David Short
David Short el 21 de Abr. de 2017
Editada: Walter Roberson el 27 de Abr. de 2017
Joss
"I don't understand why you would think that your processing time wouldn't go up as you increase the number of things." I guess because I'm spoiled and used to seeing parfor loops where you get a pretty flat response as you add more workers.
I suspect I've done a poor job of explaining things. To use your example I want to solve 4 independent 5000X5000 linear systems using 4 gpus and I was hoping that would take about as much time as solving a single 5000X5000 linear system on a single gpu. Does that help?
hopefully the example below helps. It's a reasonable approximation of what I'm doing (in this case on a system with 8 gpu's)
Here is the output
1 allocate: 4.1681 send: 10.435 compute: 2.7707 gather: 1.0183
2 allocate: 4.941 send: 11.4104 compute: 3.1639 gather: 1.2482
3 allocate: 6.0978 send: 12.3541 compute: 3.7005 gather: 1.5098
4 allocate: 7.0844 send: 13.6551 compute: 3.8716 gather: 1.736
5 allocate: 8.3564 send: 14.331 compute: 4.3651 gather: 2.0024
6 allocate: 9.4381 send: 15.3916 compute: 4.7052 gather: 2.2718
7 allocate: 11.038 send: 16.8184 compute: 5.0789 gather: 2.4995
8 allocate: 11.8256 send: 18.4739 compute: 5.2926 gather: 2.891
Notice that as we add more independent gpus the time of each segment increases. In my actual case, the send and gather times are trivial, but the compute time is much larger and the time expansion as I add more gpus is even more dramatic. In my case each single system will take 40 seconds and adding another system will add about 10 seconds to the execution time.
function test_two
matrix_size=5000;
pep = randi(255,matrix_size,matrix_size,12);
pep = reshape(pep,[matrix_size matrix_size 12]);
for j = 1:8
num_chan =j;
poolobj = gcp('nocreate');
delete(poolobj);
parpool(num_chan);
tic
spmd
gd = gpuDevice;
end
a = toc;
spmd
spmd (j);
pep_gpu = gpuArray(pep(:,:,j));
end
b = toc;
spmd (j);
R_gpu = work_for_test_two(pep_gpu);
end
c = toc;
spmd (j);
R = gather(R_gpu);
end
d = toc;
clear R R_gpu pep_gpu;
disp ([' ' num2str(j) ' allocate: ' num2str(a) ' send: ' num2str(b-a) ' compute: ' num2str(c-b) ' gather: ' num2str(d-c)]);
end
function [R] = work_for_test_two(I);
f = fftshift(fft2(I));
thresh = 0.8*max(f(:));
mask = f >thresh;
proc = f;
proc(mask) = 0;
R = ifft2(ifftshift(proc));
If I remove all the housekeeping associated with gpus, I end up with output that looks like....
1 compute: 56.7
2 compute: 57.7
3 compute: 59:3
4 compute: 61.4
5 compute: 63.2
...
Does that help?
Joss Knight
Joss Knight el 25 de Abr. de 2017
Hi, I apologize if my answer is curt - for a complete response to your specific code example you are best off contacting MathWorks support.
If you are using a parallel pool and you have multiple GPUs then you can indeed run them all in parallel. Let's look at your code:
  1. Stop opening and closing spmd blocks with every command. You can do more than one thing inside spmd, but every time you close and then reopen the block the workers are forced to synchronize, which is costly.
  2. There's no need to pass an argument to spmd if you are using all the workers in the pool.
  3. There is a cost to using a parallel pool, and particularly to using spmd which is intended for communicating work. Synchronization between workers takes time, and more time for more workers.
Since your workers don't need to communicate with each other, really you shouldn't be using SPMD at all, you should be using parfor or parfeval. I even suggest you disable SPMD in your pool because that stops it unnecessarily creating an MPI communicator, which you don't need. Everything should proceed much more easily then.
matrix_size=5000;
pep = randi(255,matrix_size,matrix_size,12);
pep = reshape(pep,[matrix_size matrix_size 12]);
for j = 1:8
num_chan =j;
poolobj = gcp('nocreate');
delete(poolobj);
parpool(num_chan, 'SpmdEnabled', false);
tic
parfor i = 1:j
gd = gpuDevice;
pep_gpu = gpuArray(pep(:,:,j));
R_gpu = work_for_test_two(pep_gpu);
R = gather(R_gpu);
end
tic
clear R R_gpu pep_gpu;
end
In many ways parfeval is more appropriate even than parfor here, in fact parfevalOnAll is what you want. parfor schedules work in an opaque way and has to do some analysis to decide what data to send to each worker. However, parfeval is a bit more complicated to use.
David Short
David Short el 26 de Abr. de 2017
Editada: David Short el 26 de Abr. de 2017
Thank you Josh. I used multiple spmd blocks here in the example to show how much time was being used in the different steps. In my actual code, I'm a bit shocked at how much extra time each extra channel costs and I was/am trying to figure out where the cost is. In my code the extra time is mostly in the compute section where I expect performance to be flat.
I will do some reading on the parfevalOnAll statement.
David Short
David Short el 27 de Abr. de 2017
Editada: Walter Roberson el 27 de Abr. de 2017
In case anyone is interested, recasting the code with a parfeval and fetchNext structure does produce marginal improvements in the test case, but still illustrates my primary concern. Adding independent workers using another gpu produces a consistent increase in processing time.
In the table below the first column is the number of workers and gpus. The second column is the time spent allocating the gpus. The third column is the time spent doing the work using a parfor structure. The fourth is the time spent doing the work using a parfeval structure. Note that the timing IS better using the parfeval structure, but in both the parfor and parfeval case the time consistently increases as we add workers doing independent work. It's not just data communication time.
2 allocate: 5.3017 run parfor: 4.689 run feval: 2.9313
3 allocate: 6.1161 run parfor: 5.7515 run feval: 4.4275
4 allocate: 7.1411 run parfor: 6.2636 run feval: 4.9099
5 allocate: 8.1599 run parfor: 6.6393 run feval: 5.7411
6 allocate: 9.0779 run parfor: 8.5535 run feval: 7.9715
function test_parfeval
matrix_size=5000;
pep = randi(255,matrix_size,matrix_size,12);
pep = reshape(pep,[matrix_size matrix_size 12]);
for j = 2:6
num_chan =j;
poolobj = gcp('nocreate');
delete(poolobj);
p = parpool(num_chan, 'SpmdEnabled', false);
tic
parfor i = 1:j
gd = gpuDevice;
end
a = toc;
parfor i = 1:j
pep_gpu = gpuArray(pep(:,:,i));
R_gpu = work_for_test_two(pep_gpu);
R(:,:,i) = gather(R_gpu);
end
b = toc;
for i = 1:j
F(i) = parfeval( @stuff, 1, pep(:,:,i));
end
c = toc;
for i = 1:j
[idx RR] = fetchNext(F);
dfs(:,:,idx) = RR;
end
d = toc;
clear R R_gpu pep_gpu dfs RR F;
disp ([' ' num2str(j) ' allocate: ' num2str(a) ' run parfor: ' num2str(b-a) ' run feval: ' num2str(d-b)]);
end
function [R] = stuff (pep)
pep_gpu = gpuArray(pep);
R_gpu = work_for_test_two(pep_gpu);
R = gather(R_gpu);
function [R] = work_for_test_two(I)
f = fftshift(fft2(I));
thresh = 0.8*max(f(:));
mask = f >thresh;
proc = f;
proc(mask) = 0;
R = ifft2(ifftshift(proc));
David Short
David Short el 27 de Abr. de 2017
Editada: David Short el 27 de Abr. de 2017
And one step further.... this is the kind of performance my actual job shows.
workers parfor parfeval
1 58.7 37.5
2 68.0 48.5
3 77.8 59.7
4 93.1 85.6
So...in my case going from one worker to 4 roughly doubles (or more) the time it takes to process a bit of signal.
Walter Roberson
Walter Roberson el 27 de Abr. de 2017
David Short:
When you are posting code, please use your cursor to select it, and then click on the "{} Code" button. That would format the code so that the Answers system knows it is code for presentation purposes.
David Short
David Short el 27 de Abr. de 2017
Thanks Walter.

Iniciar sesión para comentar.

Respuestas (0)

Categorías

Preguntada:

el 17 de Abr. de 2017

Comentada:

el 27 de Abr. de 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by