MATLAB Answers

Why multiple GPUs slower than one GPU?

9 views (last 30 days)
Mantas Vaitonis
Mantas Vaitonis on 4 Oct 2018
Commented: Mantas Vaitonis on 7 Oct 2018
Dear All,
On my machine there are 2 GPUs. Why moving data to multiple GPUs in my case is about 5x slower, than working with just one GPU, environment WIN10, MATLAB R2017b. Here is code and example:
clear;
dd1=rand(100000,200,10 );
cc1=rand(100000,200,10 );
tic
dd=gpuArray(dd1);
cc=gpuArray(cc1);
wait (gpuDevice);
toc
nGPUs = gpuDeviceCount();
parpool('local', nGPUs );
d1=rand(100000,200,10 );
d2(1)={d1(1:50000,:,:)};
d2(2)={d1(50001:100000,:,:)};
c1(1:nGPUs) = {zeros(50000,200,10)};
tic
parfor i = 1:nGPUs
gpuDevice(i);
c=gpuArray(c1{i});
d=gpuArray(d2{i});
end
toc

  6 Comments

Show 3 older comments
Mantas Vaitonis
Mantas Vaitonis on 6 Oct 2018
What i managed to do is to implement some calculations on one GPU and then on both. Then using parfor and dividing data and both proved to be faster on two GPUs. However, still moving data to two GPUs takes more time then just to one. Isin't some way to improve it? The code I use now is below:
clear;
%with default 1 GPU
dd1=rand(10000,4000,10 );
cc1=zeros(10000,4000,10 );
tic%time to move data to one GPU and make calculations on it
dd=gpuArray(dd1);
cc=dd*2;
wait (gpuDevice);
toc %end of 1 GPU
nGPUs = gpuDeviceCount(); %start pool of 2 GPUS
parpool('local', nGPUs );
d1=rand(10000,4000,10 );
d2(1)={d1(1:5000,:,:)};
d2(2)={d1(5001:10000,:,:)};
c1(1:nGPUs) = {zeros(5000,4000,10)};
tstart = tic;% calculate time for moving data to GPU and calculations
parfor i = 1:nGPUs
gd=gpuDevice(i);
c=gpuArray(c1{i});
d=gpuArray(d2{i});
tic% time for calculation on GPU
c=d*2;
wait(gd);
time=toc;
fprintf('Time on GPU: %f\n',time);
end
toc(tstart)
Joss Knight
Joss Knight on 6 Oct 2018
You're not just moving data to two GPUs, you're moving it from the client to the pool, and then onto the GPUs. Communicating between processes takes time. Also, you don't call wait(gp) before you call tic which means the copy-to-device hasn't finished when you start timing.
In a real multi-GPU example you would be doing significant computation and constructing data on the pool, rather than on the client. This example is all overhead and so isn't very representative. You would see a similar issue if you opened a pool of only one worker.
Also, you don't need to select the gpuDevice since selecting a different GPU on each worker is done automatically for communicating jobs.
Mantas Vaitonis
Mantas Vaitonis on 7 Oct 2018
Yes you are right. I did not select gpuDevice and did construct data on the pool then the speed improved significantly and it is faster than one GPU. But it is achieved if data is constructed on the pool, but if the data is already predefined on the cient, there is no way to overcome overhead? Maybe you could help me a bit more? In my experiment I would load data from file of size (5000000x300x50), how should I move data to the pool? And what would be the way to divide this data for both GPUs?

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by