I have two same size 4D gpuArrays f(NxMxLxK) and f1(NxMxLxK) and I need to left divide each column, for that this code is implemented, which become a bottleneck in my algorithm and uses about 95% of runtime:
beta2= arrayfun(@(n) f(:,n)\f1(:,n), 1:numel(f)/size(f,1));
Result beta2 is vector. Is there a way to speed up this code? I assume the latency is due to fact that inside arrayfun is for loop which moves data from cpu to gpu and so on.