MATLAB Answers

Perfomance Loss of Matrix-Vector Multilplication on GPU with Array Indexing

2 views (last 30 days)
I have a large matrix A and a vector B. I want to do a partial multiplication on GPU using array indexing but the peformance is much lower than doing a full A*B. Below is a simple example of what I am trying to do:
A = rand(20000,'gpuArray');
B = rand(20000,1,'gpuArray');
C = A(8001:18000,1:end)*B;
GPU Device: Tesla V100
MATLAB 2020a
Any suggestion on how to improve the performance? Thank you.


Sign in to comment.

Accepted Answer

Edric Ellis
Edric Ellis on 30 Apr 2020
Unfortunately, the expression A(8001:18000,:) requires a strided memory copy. Matrices in MATLAB (even on the GPU) are stored in column-major format, so picking out only certain rows is much less efficient than picking out only certain columns.
There's a trick you can use though that takes advantage of the fact that gpuArray matrix multiplication is optimised for the transposed-times case. Try instead pre-transposing A (this is relatively expensive, but perhaps you can do it only once) and then doing:
A(:, 8001:18000).' * B;
This uses the much-faster indexing pattern, and is about ~2x faster on my GPU.


Show 2 older comments
Afshin Ahmadi
Afshin Ahmadi on 1 May 2020
Interestingly, your solution is taking twice more time compared with the normal multiplication on Tesla V100! I guess they are different in the architecture. Thank you tho.
Edric Ellis
Edric Ellis on 4 May 2020
Strange, I just tried on a WIN64 machine here with a V100, and got the following result:
t1 =
t2 =
(This was using R2020a).
Afshin Ahmadi
Afshin Ahmadi on 4 May 2020
I tried again and it seems your solution is quite fast when the block size is small, which is exactly what I need. Thank you so much for the help! I will just include some information here for the people who are interested in doing the same thing.
A = gpuArray.rand(20000);
B = gpuArray.rand(20000,1);
At = A.';
t1 = gputimeit(@() At(:,500:2000).'*B)
t2 = gputimeit(@() At(:,500:5000).'*B)
t3 = gputimeit(@() At(:,500:10000).'*B)
t4 = gputimeit(@() A(500:2000,:)*B)
t5 = gputimeit(@() A(500:5000,:)*B)
t6 = gputimeit(@() A(500:10000,:)*B)
t7 = gputimeit(@() A*B)
Execution time:
t1 = 4.4423e-04
t2 = 0.0010
t3 = 0.0020
t4 = 0.0035
t5 = 0.0051
t6 = 0.0076
t7 = 0.0044
(MATLAB R2020a, Tesla V100, Linux)

Sign in to comment.

More Answers (0)

Translated by