# Perfomance Loss of Matrix-Vector Multilplication on GPU with Array Indexing

2 views (last 30 days)
Afshin Ahmadi on 29 Apr 2020
Commented: Afshin Ahmadi on 4 May 2020
Hi,
I have a large matrix A and a vector B. I want to do a partial multiplication on GPU using array indexing but the peformance is much lower than doing a full A*B. Below is a simple example of what I am trying to do:
A = rand(20000,'gpuArray');
B = rand(20000,1,'gpuArray');
C = A(8001:18000,1:end)*B;
GPU Device: Tesla V100
MATLAB 2020a
Any suggestion on how to improve the performance? Thank you.

Edric Ellis on 30 Apr 2020
Unfortunately, the expression A(8001:18000,:) requires a strided memory copy. Matrices in MATLAB (even on the GPU) are stored in column-major format, so picking out only certain rows is much less efficient than picking out only certain columns.
There's a trick you can use though that takes advantage of the fact that gpuArray matrix multiplication is optimised for the transposed-times case. Try instead pre-transposing A (this is relatively expensive, but perhaps you can do it only once) and then doing:
A(:, 8001:18000).' * B;
This uses the much-faster indexing pattern, and is about ~2x faster on my GPU.

Afshin Ahmadi on 1 May 2020
Interestingly, your solution is taking twice more time compared with the normal multiplication on Tesla V100! I guess they are different in the architecture. Thank you tho.
Edric Ellis on 4 May 2020
Strange, I just tried on a WIN64 machine here with a V100, and got the following result:
t1 =
1.6677e-04
t2 =
4.4944e-04
(This was using R2020a).
Afshin Ahmadi on 4 May 2020
I tried again and it seems your solution is quite fast when the block size is small, which is exactly what I need. Thank you so much for the help! I will just include some information here for the people who are interested in doing the same thing.
A = gpuArray.rand(20000);
B = gpuArray.rand(20000,1);
At = A.';
t1 = gputimeit(@() At(:,500:2000).'*B)
t2 = gputimeit(@() At(:,500:5000).'*B)
t3 = gputimeit(@() At(:,500:10000).'*B)
t4 = gputimeit(@() A(500:2000,:)*B)
t5 = gputimeit(@() A(500:5000,:)*B)
t6 = gputimeit(@() A(500:10000,:)*B)
t7 = gputimeit(@() A*B)
Execution time:
t1 = 4.4423e-04
t2 = 0.0010
t3 = 0.0020
t4 = 0.0035
t5 = 0.0051
t6 = 0.0076
t7 = 0.0044
(MATLAB R2020a, Tesla V100, Linux)