Implicit expansion with arrayfun (cpu vs gpu)

I find very convenient that Matlab allows for implicit expansion since the 2016 version (for an explanation, see this nice article: https://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/?s_tid=blogs_rc_1).
I was then puzzled to discover that arrayfun on the cpu does not allow for it, while arrayfun when called with gpu arrays does allow for implicit expansion. Below is a MWE to demonstrate this behavior.
Let me quickly explain it: if I have two vectors x with size [2,1] and y with size [1,2], I can calculate the sum x+y and get a matrix 2*2 as intended. This is better than the ugly and more memory-intensive
repmat(x,1,2)+repmat(y,2,1)
Unfortunately this does not work with arrayfun on the cpu!
Since I code both using normal arrays and GPU arrays, I find this different behavior of arrayfun quite misleading. It would be great if Matlab could allow implicit expansion also on arrayfun cpu. When I have large arrays, duplicating dimensions with repmat takes a lot of memory.
%% Demonstration of implicit expansion support in MATLAB and arrayfun
% This script shows that:
% 1) MATLAB supports implicit expansion for standard array operations.
% 2) arrayfun on the GPU supports implicit expansion.
% 3) arrayfun on the CPU does NOT support implicit expansion!!
%
% Implicit expansion allows a 2×1 vector to be added to a 1×2 vector,
% producing a 2×2 matrix.
clear; clc; close all;
% Define test vectors
x = [1; 2]; % Column vector (2×1)
y = [1, 2]; % Row vector (1×2)
%% Implicit expansion using standard MATLAB operations
F1 = myadd(x, y);
%% Implicit expansion using arrayfun on the GPU
F2 = arrayfun(@myadd, gpuArray(x), gpuArray(y));
%% Attempt implicit expansion using arrayfun on the CPU (expected to fail)
try
F3 = arrayfun(@myadd, x, y);
catch ME
fprintf('CPU arrayfun error (expected):\n%s\n\n', ME.message);
end
%% Function myadd
function F = myadd(x, y)
% Element-wise addition
F = x + y;
end

10 comentarios

Catalytic
Catalytic el 4 de En. de 2026
It would be great if Matlab could allow implicit expansion also on arrayfun cpu. When I have large arrays, duplicating dimensions with repmat takes a lot of memory.
It's hard to imagine why you would ever be using arrayfun on the CPU when you have large arrays. It would be super-slow.
I can understand that sometimes you want to run CPU and GPU versions of the same code side by side to demonstrate speed-up, but in the case of arrayfun on large arrays, it's so obvious in advance that the CPU version will be vastly slower. How informative could such a comparison ever be?
It is not difficult to imagine wanting to arrayfun() between (for example) a 10000 x 1 array, and a 1 x 15000 array, and hoping that it is not necessary to explicitly form 10000 x 15000 input arrays to make it work.
10000 * 8 + 15000 * 8
ans = 200000
ans / 10^9
ans = 2.0000e-04
10000 * 15000 * 8 * 2 %two input arrays
ans = 2.4000e+09
ans/10^9
ans = 2.4000
Big difference.
Catalytic
Catalytic el 4 de En. de 2026
Editada: Catalytic el 4 de En. de 2026
It is not difficult to imagine caring about that on the GPU, but even attempting a computation of that size on the CPU, knowing that it will run with the speed of a for-loop over ~10^9 elements, is difficult to imagine.
Also, the extra memory consumption in your example isn't really all that dramatic - only twice what would be consumed in an implicit expansion implementation. You've basically saved a 1 GB temp array. What's 1 GB by today's standards of RAM?
Paul
Paul el 4 de En. de 2026
Assume that the underlying function cannot be executed on the GPU.
In this case, are you suggesting that if the arrays are large but all have the same size that arrayfun should not be used?
What if the arrays are not large, but are compatible? As it stands now, arrayfun cannot be used even though it might be a very good approach if it supported implicit expansion.
Catalytic
Catalytic el 4 de En. de 2026
Editada: Catalytic el 4 de En. de 2026
If no vectorization or parallelization approaches are available, then I think the benefits of arrayfun over a plain old for-loop are pretty narrow in the scenarios you describe -- basically just syntactic sugar. I also speculate that they are pretty rare use cases. With such rare and marginal benefits, I can see why TMW would question whether it warrants developer time.
Paul
Paul el 4 de En. de 2026
Editada: Paul el 4 de En. de 2026
Frankly, I've never thought about using arrayfun with implicit expansion and so don't really know how useful it would or wouldn't be. But I can definitely envision use cases where it might come handy and save lines of code. Whether or not eliminating such lines of code is worth it is in the eye of the beholder.
Consider a situation where I have discrete time filter coefficients stored in two cell arrays: b for the numerator, and a for the denominator. Sometimes I want unique pairs of numerators and denominators (numel(b) == numel(a)), sometimes I want to analyze multiple sets of denominators with one numerator, and sometimes I want to analyze multiple numerators with one denominator, as shown below.
Define three sets of numerator coefficients and one set of denominator coefficients
rng(100);
w = linspace(0,pi,1000);
b = mat2cell(rand(3,2),[1,1,1]);
a = {[1 2]};
Define a wrapper around freqz to compute the frequency response of a filter
myfreqz = @(b,a) freqz(b{1},a{1},w);
Call @Matt J's function as defined in this Answer to compute all three responses with implicit expansion of the denominator. I changed two lines of code in Matt's function to return a cell array rather than a double as Matt's function does not (yet?) support the UniformOutput option.
h = arrayfunImplExp(myfreqz,b,a);
Verify
isequal(vertcat(h{:}),[freqz(b{1},a{1},w);freqz(b{2},a{1},w);freqz(b{3},a{1},w)])
ans = logical
1
To be sure, I could have written a loop with a couple of if statements to check isscalar on b or a to know how to call freqz inside the loop, but I rather like the one-liner.
function out = arrayfunImplExp(fun, varargin)
%ARRAYFUNIMPLEXP arrayfun with implicit expansion (CPU)
%
% NOTE:
% Useful for CPU-only execution. On the GPU, use arrayfun instead,
% which implements its own implicit expansion.
% Number of inputs and maximum dimensionality
numArgs = numel(varargin);
numDims = max(cellfun(@ndims, varargin));
% Collect per-input sizes (row-wise)
sizes = cellfun(@(c) size(c, 1:numDims), varargin, 'UniformOutput', false);
sizes = vertcat(sizes{:}); % [numArgs x numDims]
% Output size is max along each dimension
outSize = max(sizes, [], 1);
% Precompute row-wise strides for linear indexing
strides = [ ...
ones(numArgs, 1), ...
cumprod(sizes(:, 1:end-1), 2) ...
];
% Convert sizes to zero-based limits
sizes = sizes - 1;
% Allocate output
%out = nan(outSize);
out = cell(outSize);
argsIdx=nan(numArgs,1);
args=cell(1,numArgs);
A=1:numArgs;
% Main loop over output elements
for linIdx = 1:numel(out)
% Convert linear index → zero-based subscripts
idx = linIdx - 1;
subs = zeros(1, numDims);
for d = 1:numDims
sd = outSize(d);
subs(d) = mod(idx, sd);
idx = floor(idx / sd);
end
% Apply implicit expansion masking
Subs = min(subs,sizes);
% Row-wise sub2ind
argsIdxNew = sum(Subs .* strides, 2) + 1;
map=(argsIdxNew ~= argsIdx);
% Gather arguments
% v=varargin(map);
% k=argsIdxNew(map);
% c=0;
% for j=map
% c=c+1;
% args{j} = v{j}(k(c));
% end
for j = 1:numArgs
if map(j)
args{j} = varargin{j}(argsIdxNew(j));
end
end
argsIdx=argsIdxNew;
% Evaluate function
out{linIdx} = fun(args{:});
end
end
Catalytic
Catalytic el 4 de En. de 2026
Editada: Catalytic el 4 de En. de 2026
That's fair, but the use case you describe does not demand high speed or RAM efficiency. If syntactic sugar with implicit expansion is all you want, then it is trivial to make a one-liner function of your own (as @Matt J has done with his proposals).
Conversely, the OP has called out the built-in CPU arrayfun for not offering RAM-efficient scalar expansion. My real point was that, if your data is so big that RAM consumption is a concern, then arrayfun on the CPU never had a hope of being high performing to begin with. It's the difference between a computation than takes 100 years and a computation that takes 300 years.
Matt J
Matt J el 4 de En. de 2026
Editada: Matt J el 4 de En. de 2026
I changed two lines of code in Matt's function to return a cell array rather than a double as Matt's function does not (yet?) support the UniformOutput option.
I was never a big fan of the UniformOuput flag. If the user wants cell-valued output, then I think the iterated input function should assume responsibility for that. In your case, that would look like,
rng(100);
w = linspace(0,pi,1000);
b = num2cell( rand(3,2) , 2);
a = {[1 2]};
myfreqz = @(b,a) {freqz(b{1},a{1},w)};
h = arrayfunImplExp(myfreqz,b,a)
h = 3×1 cell array
{[0.4627 + 0.0000i 0.4627 + 0.0001i 0.4627 + 0.0002i 0.4627 + 0.0003i 0.4627 + 0.0003i 0.4627 + 0.0004i 0.4627 + 0.0005i 0.4627 + 0.0006i 0.4627 + 0.0007i … ] (1×1000 double)} {[0.0944 + 0.0000i 0.0944 + 0.0002i 0.0944 + 0.0004i 0.0944 + 0.0006i 0.0944 + 0.0008i 0.0944 + 0.0010i 0.0944 + 0.0012i 0.0944 + 0.0014i 0.0944 + 0.0015i … ] (1×1000 double)} {[0.1820 + 0.0000i 0.1820 + 0.0003i 0.1820 + 0.0005i 0.1820 + 0.0008i 0.1820 + 0.0010i 0.1820 + 0.0013i 0.1820 + 0.0015i 0.1820 + 0.0018i 0.1820 + 0.0020i … ] (1×1000 double)}
Joss Knight
Joss Knight el 5 de En. de 2026
Movida: Matt J el 5 de En. de 2026
arrayfun with expansion, particularly for expanding scalars, is certainly very convenient syntactic sugar for a for loop to make code more compact and readable, for instance, setting a bunch of settings on an object array, where the for loop is not going to be optimized so there is no advantage to it.
layers = arrayfun(@setHyperParams, layers, 0, [layers.L2Factor]); % Freeze learn rate
It's just easier to read, isn't it? Obviously there are other ways to do this particular operation on one line but I certainly see your point. However, the others have good points; arrayfun is almost always slower than a for loop or alternative approach, so taking action to encourage its use is something to do with caution.
Paul
Paul el 5 de En. de 2026
Movida: Matt J el 5 de En. de 2026
Hi Joss,
Any idea why the gpu version of arrayfun supports expansion (whereas the cpu version does not)?

Iniciar sesión para comentar.

 Respuesta aceptada

Matt J
Matt J el 4 de En. de 2026
Editada: Matt J el 4 de En. de 2026
Remember that arrayfun (on the CPU) is nothing more than an M-Code for-loop in disguise. Therefore, it is not hard to write your own arrayfun version -- one which supports implicit expansion -- using for-loops.
The implementation below is slower in speed to explicit expansion (using repmat), but of course conserves a lot more memory. You would probably need very significant data sizes and/or RAM limits before implicit expansion is faster.
% Define test vectors
N=100;
x = rand(N,1); % Column vector
y = rand(1,N); % Row vector
z = rand(1,1,N);
myadd=@(a,b,c) a+b+c;
tic;
out1=arrayfunImplExp(myadd,x,y,z); %implicit expansion
toc
Elapsed time is 0.898685 seconds.
tic;
out2=arrayfunExplExp(myadd,x,y,z); %explicit expansion (with repmat)
toc
Elapsed time is 0.548641 seconds.
isequal(out1,x+y+z)
ans = logical
1
isequal(out2,x+y+z)
ans = logical
1
function out = arrayfunImplExp(fun, varargin)
%ARRAYFUNIMPLEXP arrayfun with implicit expansion (CPU)
%
% NOTE:
% Useful for CPU-only execution. On the GPU, use arrayfun instead,
% which implements its own implicit expansion.
% Number of inputs and maximum dimensionality
numArgs = numel(varargin);
numDims = max(cellfun(@ndims, varargin));
% Collect per-input sizes (row-wise)
sizes = cellfun(@(c) size(c, 1:numDims), varargin, 'UniformOutput', false);
sizes = vertcat(sizes{:}); % [numArgs x numDims]
% Output size is max along each dimension
outSize = max(sizes, [], 1);
% Precompute row-wise strides for linear indexing
strides = [ ...
ones(numArgs, 1), ...
cumprod(sizes(:, 1:end-1), 2) ...
];
% Convert sizes to zero-based limits
sizes = sizes - 1;
% Allocate output
out = nan(outSize);
argsIdx=nan(numArgs,1);
args=cell(1,numArgs);
A=1:numArgs;
% Main loop over output elements
for linIdx = 1:numel(out)
% Convert linear index → zero-based subscripts
idx = linIdx - 1;
subs = zeros(1, numDims);
for d = 1:numDims
sd = outSize(d);
subs(d) = mod(idx, sd);
idx = floor(idx / sd);
end
% Apply implicit expansion masking
Subs = min(subs,sizes);
% Row-wise sub2ind
argsIdxNew = sum(Subs .* strides, 2) + 1;
map=(argsIdxNew ~= argsIdx);
% Gather arguments
% v=varargin(map);
% k=argsIdxNew(map);
% c=0;
% for j=map
% c=c+1;
% args{j} = v{j}(k(c));
% end
for j = 1:numArgs
if map(j)
args{j} = varargin{j}(argsIdxNew(j));
end
end
argsIdx=argsIdxNew;
% Evaluate function
out(linIdx) = fun(args{:});
end
end
function out = arrayfunExplExp(fun, varargin)
%ARRAYFUNEXPLEXP arrayfun with explicit expansion using repmat
% Provided for comparison with arrayfunImplExp.
numArgs = numel(varargin);
numDims = max(cellfun(@ndims, varargin));
% Collect sizes
sizes = cellfun(@(c) size(c, 1:numDims), varargin, 'UniformOutput', false);
sizes = vertcat(sizes{:});
% Output size
outSize = max(sizes, [], 1);
numOut = prod(outSize);
% Preallocate expanded argument cell array
args = cell(numArgs, numOut);
% Explicit expansion
for k = 1:numArgs
v = varargin{k};
vSz = size(v, 1:numDims);
reps = outSize ./ vSz;
% Expand and linearize
vk = repmat(num2cell(v), reps);
args(k, :) = vk(:).';
end
% Allocate output
out = nan(outSize);
% Evaluate function
for j = 1:numOut
out(j) = fun(args{:, j});
end
end

17 comentarios

Alessandro
Alessandro el 4 de En. de 2026
Editada: Alessandro el 4 de En. de 2026
@Matt J Your function arrayfunExplExp works quite well on my tests and does exactly the same as arrayfun on GPU (much slower but this is expected).
Question 1: Why is arrayfunExplExp faster than arrayfunImplExp? arrayfunExplExp uses repmat which creates large temporary arrays, so it should be slower.
Question 2: Am I free to use arrayfunExplExp in my own codes (for academic research)?
Question 3: Nested loops are much faster than arrayfunExplExp. I am afraid this is expected, since arrayfun on the CPU is slow. Still, I was wondering if there are ways of speeding up arrayfunExplExp.
Here I show a test to verify that arrayfun GPU and arrayfun Matt J give exactly the same result:
clear,clc,close all
r = 0.04;
sigma = 1.5;
n_a = 1000;
n_z = 11;
a_grid = linspace(0,50,n_a)';
z_grid = linspace(0.5,1.5,n_z)';
aprime_vec = a_grid;
a_vec = reshape(a_grid,[1,n_a,1]);
z_vec = reshape(z_grid,[1,1,n_z]);
aprime_vec_gpu = gpuArray(aprime_vec);
Error using gpuArray
Unable to find a supported GPU device.
a_vec_gpu = gpuArray(a_vec);
z_vec_gpu = gpuArray(z_vec);
% arrayfun GPU F1 is (a',a,z)
disp('arrayfun GPU')
tic
F1 = arrayfun(@RetFun,aprime_vec_gpu,a_vec_gpu,z_vec_gpu,r,sigma);
toc
disp('Nested loops')
tic
F2 = zeros(n_a,n_a,n_z);
for z_c=1:n_z
for a_c=1:n_a
for ap_c=1:n_a
F2(ap_c,a_c,z_c) = RetFun(a_grid(ap_c),a_grid(a_c),z_grid(z_c),r,sigma);
end
end
end
toc
% User-written arrayfun
disp('User-written arrayfun CPU')
tic
F3 = arrayfunExplExp(@RetFun, aprime_vec,a_vec,z_vec,r,sigma);
toc
err2 = max(abs(F2-F1),[],"all");
err3 = max(abs(F3-F1),[],"all");
fprintf('Error 2: %f \n',err2)
fprintf('Error 3: %f \n',err3)
function F = RetFun(aprime,a,z,r,sigma)
cons = (1+r)*a+z-aprime;
F = -Inf;
if cons>0
F = cons^(1-sigma)/(1-sigma);
end
end %end function
P.S. I can open a separate question if it is considered more appropriate.
Alessandro
Alessandro el 4 de En. de 2026

Related to my Question 3, is it possible to accelerate your arrayfunExplExp with a MEX file (maybrle using codegen)?

Matt J
Matt J el 4 de En. de 2026
Editada: Matt J el 5 de En. de 2026
I have modified the implementation of arrayfunImplExp() to improve its efficiency somewhat.
Additionally, I have modifed your test code below to compare with a few more CPU cases of interest. What we see is, yes, explicit loops (nested or linear) are the fastest, thanks to the JIT. However, arrayfunImplExp and arrayfunExplExp compare quite well in speed to Matlab's native arrayfun, even when we exclude from consideration the data replication time associated with ndgrid() or repmat(). And, of course, arrayfunImplExp reduces temp RAM consumption by a few factors. Of course, the very best speed comes from the vectorized implementation, which shows that you should only resort to arrayfun if vectorization is truly out of reach.
r = 0.04;
sigma = 1.5;
N=120;
a_grid = linspace(0,50,N)';
z_grid = linspace(0.5,1.5,N)';
[aprime,a ,z]=ndgrid(a_grid, a_grid, z_grid);
[aprime_vec,a_vec ,z_vec]=ndgridVecs(a_grid, a_grid, z_grid);
%Get ndgridVecs from the File Exchange:
% https://www.mathworks.com/matlabcentral/fileexchange/74956-ndgridvecs
fun=@(A,B,C)RetFun(A,B,C,r,sigma);
disp ' '
disp('Vectorized Implementation')
Vectorized Implementation
tic
BaseLine0 = RetFunVec(aprime_vec,a_vec ,z_vec, r,sigma);
toc
Elapsed time is 0.043575 seconds.
disp ' '
disp('Nested loops')
Nested loops
tic
BaseLine1 = zeros(size(a));
for k=1:N
for j=1:N
for i=1:N
BaseLine1(i,j,k) = fun(a_grid(i),a_grid(j),z_grid(k));
end
end
end
toc
Elapsed time is 5.366150 seconds.
% User-written arrayfun
disp ' '
disp('Linear loop')
Linear loop
tic;
BaseLine2 = zeros(size(a));
for i=1:numel(BaseLine2)
BaseLine2(i) = fun(aprime(i),a(i),z(i));
end
toc
Elapsed time is 4.967472 seconds.
% User-written arrayfun
disp ' '
disp('Native arrayfun CPU')
Native arrayfun CPU
tic
BaseLine3 = arrayfun(fun, aprime,a ,z );
toc
Elapsed time is 7.463388 seconds.
% User-written arrayfun
disp ' '
disp('User-written arrayfun CPU')
User-written arrayfun CPU
tic
F1 = arrayfunImplExp(fun, aprime_vec,a_vec,z_vec);
toc
Elapsed time is 7.403977 seconds.
tic
F2 = arrayfunExplExp(fun, aprime_vec,a_vec,z_vec);
toc
Elapsed time is 6.563372 seconds.
isequal(BaseLine0, BaseLine1, BaseLine2, BaseLine3, F1,F2)
ans = logical
1
function F = RetFun(aprime,a,z,r,sigma)
cons = (1+r).*a+z-aprime;
F = -Inf;
if cons>0
F = cons.^(1-sigma)./(1-sigma);
end
end %end function
function F = RetFunVec(aprime,a,z,r,sigma)
cons = (1+r).*a+z-aprime;
F = cons.^(1-sigma)./(1-sigma);
F(cons<=0)=-Inf;
end %end function
Hi Matt,
Cool function.
One enhancement would be to support the UniformOutput option.
Also, I'm sure that a production version would include some input checking. As it stands, the function runs w/o error even if the inputs aren't compatible.
z = arrayfunImplExp(@(a,b) a + b,[1,2,3],[4])
z = 1×3
5 6 7
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
z = arrayfunImplExp(@(a,b) a + b,[1,2,3],[4,5])
z = 1×3
5 7 8
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Matt J
Matt J el 5 de En. de 2026
Also, I'm sure that a production version would include some input checking. As it stands, the function runs w/o error even if the inputs aren't compatible.
Yes. Or, alternatively, perhaps that's behavior you would want.
@Matt J Thanks a lot! I have a small comment on the function RetFunVec.
function F = RetFun(aprime,a,z,r,sigma)
cons = (1+r).*a+z-aprime;
F = -Inf;
if cons>0
F = cons.^(1-sigma)./(1-sigma);
end
end %end function
function F = RetFunVec(aprime,a,z,r,sigma)
cons = (1+r).*a+z-aprime;
F = cons.^(1-sigma)./(1-sigma);
F(cons<=0)=-Inf;
end %end function
The function RetFunVec is not equivalent to RetFun (even they return the same result) since in RetFunVec you compute F even if cons is negative and this generates a temporary complex array (negative number raised to a fractional power gives complex result in Matlab). It would be better to write
function F = RetFunVec(aprime, a, z, r, sigma)
cons = (1 + r) * a + z - aprime;
%F = cons.^(1 - sigma) / (1 - sigma);
%F(cons <= 0) = -Inf;
F = -Inf(size(cons));
pos = cons>0;
F(pos) = cons(pos).^(1 - sigma) / (1 - sigma);
end
Alessandro
Alessandro el 5 de En. de 2026
Editada: Alessandro el 5 de En. de 2026
@Matt J I ran the tests on my computer and got these results:
Timing summary (seconds):
1) arrayfun GPU : 0.025733
2) Vectorized CPU : 0.040589
3) Nested loops CPU : 0.190420
4) arrayfunExplExp : 1.817837
5) arrayfunImplExp : 2.585310
6) arrayfun Matlab : 1.691153
So arrayfunImplExp is too slow, but arrayfunExplExp runs at similar speed as the built-in arrayfun and it supports implicit expansion (input arrays don't need to have same size, as long as they are compatible), so I like it!
Matt J
Matt J el 5 de En. de 2026
Editada: Matt J el 5 de En. de 2026
So arrayfunImplExp is too slow
And you're using my updated version? Anyway...there could be some platform-dependent differences in the speed performances.
The function RetFunVec is not equivalent to RetFun ... It would be better to write
I get still faster performance with,
function F = RetFunVec(aprime, a, z, r, sigma)
cons = (1 + r) * a + z - aprime;
F = max(cons,0).^(1 - sigma) / (1 - sigma);
F(cons<=0)=-Inf;
end
Paul
Paul el 5 de En. de 2026
... arrayfunExplExp runs at similar speed as the built-in arrayfun and it supports implicit expansion (input arrays don't need to have same size, as long as they are compatible), so I like it!
If it meets your needs, that's great. But arrayfunExplExp uses repmat, which is exactly what I thought you wanted to avoid: When I have large arrays, duplicating dimensions with repmat takes a lot of memory.
Alessandro
Alessandro el 5 de En. de 2026
@Paul If it meets your needs, that's great. But arrayfunExplExp uses repmat, which is exactly what I thought you wanted to avoid
Correct: it's good to have both arrayfunExplExp and arrayfunImplExp and choose between the two based on the size of the arrays.
@Matt J And you're using my updated version? Anyway...there could be some platform-dependent differences in the speed performances.
Yes, I am. Maybe the differences in speed performance might arise from the fact that I changed a bit the tests. I report here my complete test code.
%% Performance test: arrayfun (GPU), vectorized (CPU), nested loops, user arrayfun (CPU)
clear;
clc;
%% Parameters and grids
r = 0.04;
sigma = 1.5;
n_a = 150;
n_z = 150;
a_grid = linspace(0, 50, n_a).';
z_grid = linspace(0.5, 1.5, n_z).';
% Shapes:
% aprime_vec : (a', 1, 1)
% a_vec : (1, a, 1)
% z_vec : (1, 1, z)
aprime_vec = a_grid;
a_vec = reshape(a_grid, [1, n_a, 1]);
z_vec = reshape(z_grid, [1, 1, n_z]);
[aprime_full,a_full,z_full] = ndgrid(a_grid,a_grid,z_grid);
r_full = repmat(r,size(aprime_full));
sigma_full = repmat(sigma,size(aprime_full));
%% Move to GPU
aprime_vec_gpu = gpuArray(aprime_vec);
a_vec_gpu = gpuArray(a_vec);
z_vec_gpu = gpuArray(z_vec);
%% 1. arrayfun on GPU: F1 is (a', a, z)
disp('1) arrayfun on GPU');
t_gpu = tic;
F1 = arrayfun(@RetFun, aprime_vec_gpu, a_vec_gpu, z_vec_gpu, r, sigma);
time_gpu = toc(t_gpu);
%% 2. Vectorized on CPU
disp('2) Vectorized on CPU');
t_vec = tic;
F2 = RetFunVec(aprime_vec, a_vec, z_vec, r, sigma);
time_vec = toc(t_vec);
%% 3. Nested loops on CPU
disp('3) Nested loops on CPU');
t_loop = tic;
F3 = zeros(n_a, n_a, n_z);
for z_c = 1:n_z
for a_c = 1:n_a
for ap_c = 1:n_a
F3(ap_c, a_c, z_c) = RetFun(a_grid(ap_c), a_grid(a_c), z_grid(z_c), r, sigma);
end
end
end
time_loop = toc(t_loop);
%% 4. User-written arrayfun on CPU
disp('4) User-written arrayfun on CPU');
t_user = tic;
F4 = arrayfunImplExp(@RetFun, aprime_vec, a_vec, z_vec, r, sigma);
time_user = toc(t_user);
%% 5. Matlab arrayfun on CPU
disp('5) Matlab arrayfun on CPU');
t_user = tic;
%F5 = arrayfun(@RetFun, aprime_full, a_full, z_full, r_full, sigma_full);
F5 = arrayfun(@RetFun, aprime_full, a_full, z_full,r_full,sigma_full);
time_matlab = toc(t_user);
%% Accuracy checks
F1_cpu = gather(F1); % bring GPU result back to CPU for comparison
err2 = max(abs(F2 - F1_cpu), [], "all");
err3 = max(abs(F3 - F1_cpu), [], "all");
err4 = max(abs(F4 - F1_cpu), [], "all");
err5 = max(abs(F5 - F1_cpu), [], "all");
fprintf('\nMaximum absolute error vs GPU arrayfun result:\n');
fprintf(' Vectorized CPU (F2): %g\n', err2);
fprintf(' Nested loops CPU (F3): %g\n', err3);
fprintf(' User arrayfun CPU (F4): %g\n', err4);
fprintf(' Matlab arrayfun CPU (F5): %g\n', err5);
%% Timing summary
fprintf('\nTiming summary (seconds):\n');
fprintf(' 1) arrayfun GPU : %.6f\n', time_gpu);
fprintf(' 2) Vectorized CPU : %.6f\n', time_vec);
fprintf(' 3) Nested loops CPU : %.6f\n', time_loop);
fprintf(' 4) arrayfunExplExp : %.6f\n', time_user);
fprintf(' 5) arrayfun Matlab : %.6f\n', time_matlab);
%% ------------------------------------------------------------------------
%% Local functions
function F = RetFun(aprime, a, z, r, sigma)
cons = (1 + r) * a + z - aprime;
if cons > 0
F = cons^(1 - sigma) / (1 - sigma);
else
F = -Inf;
end
end
function F = RetFunVec(aprime, a, z, r, sigma)
cons = (1 + r) * a + z - aprime;
%F = cons.^(1 - sigma) / (1 - sigma);
%F(cons <= 0) = -Inf;
F = -Inf(size(cons));
pos = cons>0;
F(pos) = cons(pos).^(1 - sigma) / (1 - sigma);
end
Matt J
Matt J el 6 de En. de 2026
Editada: Matt J el 6 de En. de 2026
OK, so attached is my idea of how the test should have been coded, and here are my results:
>> timeComparisons
1) arrayfun on GPU
2) Vectorized on CPU
3) Nested loops on CPU
4i) User-written arrayfunImplExp on CPU
4e) User-written arrayfunExplExp on CPU
5) Matlab arrayfun on CPU
Maximum absolute error vs baseline (F2):
gpuArray.arrayfun (F1) : 7.10543e-15
Vectorized CPU (F2) : 0
Nested loops CPU (F3) : 0
arrayfunImplExp (F4i): 0
arrayfunExplExp (F4e): 0
Matlab arrayfun CPU (F5) : 0
Timing summary (seconds):
1) arrayfun GPU : 0.010027
2) Vectorized CPU : 0.078522
3) Nested loops CPU : 0.529490
4i) arrayfunImplExp : 6.427602
4e) arrayfunExplExp : 4.775996
5) arrayfun Matlab : 7.100541
The most important difference is that I have wrapped RetFun() in a 3-argument function:
fun=@(A,B,C) RetFun(A,B,C, r, sigma);
since r and sigma are fixed constants. If you don't do this, you are forcing arrayfunImplExp to do additional scalar expansions on the scalar inputs r and sigma, which I think is kind of silly. It's needless extra computation.
Now, the discussion gets complicated because the native CPU arrayfun doesn't seem to process anonymous functions as well as it does non-anonymous functions. If you run test (5) instead with,
disp('5) Matlab arrayfun on CPU');
t_user = tic;
[aprime_full,a_full,z_full] = ndgrid(a_grid,a_grid,z_grid); %Include time for explicit expansion
r_full = repmat(r,size(aprime_full));
sigma_full = repmat(sigma,size(aprime_full));
F5 = arrayfun(@RetFun, aprime_full, a_full, z_full, r_full, sigma_full);
time_matlab = toc(t_user);
then its time drops to about 4.9 sec. So, okay you might argue that test (5) is unfairly handicapped. On the other hand, I think that quite often, you would like to be able to apply arrayfun() to anonymous functions. So, to get the better performance from native arrayfun, you are forced to spend both data duplication effort, and the effort of rewriting fun() non-anonymously.
Side Note: I have modified the way compute times are calculated in test (1) and (2). TIC and TOC don't work very accurately for GPU functions.
Matt J
Matt J el 6 de En. de 2026
Editada: Matt J el 6 de En. de 2026
I have just uploaded a 3rd version arrayfunScalarExp() to the File Echange:
It is a hybrid of arrayfunImplExmp and arrayfunExplExp, which I think achieves an optimal tradeoff in speed (about 10% slower than native arrayfun) and RAM consumption. It is also supports the 'UniformOutput' flag, per the recommendation of @Paul.
I've also attached a revised timeComparison script, this time using r_full and sigma_full in all CPU arrayfun implementations, so that they are tested against the best performance of the native CPU arrayfun(). The results I get are,
1) arrayfun on GPU
2) Vectorized on CPU
3) Nested loops on CPU
4a) User-written arrayfunScalarExp on CPU
4b) User-written arrayfunExplExp on CPU
5) Matlab arrayfun on CPU
Maximum absolute error vs baseline (F2):
gpuArray.arrayfun (F1) : 7.10543e-15
Vectorized CPU (F2) : 0
Nested loops CPU (F3) : 0
arrayfunScalarExp (F4a): 0
arrayfunExplExp (F4b): 0
Matlab arrayfun CPU (F5) : 0
Timing summary (seconds):
1) arrayfun GPU : 0.010373
2) Vectorized CPU : 0.078654
3) Nested loops CPU : 0.331331
4a) arrayfunScalarExp : 5.479543
4b) arrayfunExplExp : 11.268232
5) arrayfun Matlab : 5.054332
Athough, note that two other syntax choices are possible for (4a) and (4b), with different results.
Alessandro
Alessandro el 6 de En. de 2026
Editada: Alessandro el 6 de En. de 2026
Fantastic, thanks for sharing this function. In terms of speeding it up, would it be a good idea to add a parfor here (see code below, taken from your function arrayfunExplExp)? Clearly the loop iterations over j are independent from one another.
% Evaluate function
for j = 1:numOut
out(j) = fun(args{:, j});
end
Matt J
Matt J el 6 de En. de 2026
Editada: Matt J el 6 de En. de 2026
In certain cases, probably yes. However, if fun() is anonymous, it gets tricky. An anonymous function can contain large externally-scoped data from its parent workspace. This data needds to be cloned to the parfor workers, adding overhead.
Also, keep in mind that the bottleneck in arrayfunExplExp is not always the loop you've shown. The line,
vk = repmat(num2cell(v), reps);
can be the bottleneck when v is large. This was demonstrated in my last comment where arrayfunExplExp gave the slowest performance of all the implementations. One of the reasons that I prefer arrayfunScalarExp is that its performance is not as sensitive to factors like these.
Alessandro
Alessandro el 6 de En. de 2026

Thanks for the explanation. But using parfor with threads instead of processes should reduce the overhead considerably. I will try later if I have time.

Matt J
Matt J el 7 de En. de 2026
I made some improvements to my File Exchange submission, which now includes two versions, one which is RAM-conserving (arrayfunLowMem) and one which is liberal with RAM, but somewhat faster (arrayfunHighMem). Both are significantly faster than native arrayfun (revised timeComparisons.m script attached).
1) arrayfun on GPU
2) Vectorized on CPU
3) Nested loops on CPU
4a) User-written arrayfunLowMem on CPU
4b) User-written arrayfunHighMem on CPU
5) Matlab arrayfun on CPU
Maximum absolute error vs baseline (F2):
gpuArray.arrayfun (F1) : 7.10543e-15
Vectorized CPU (F2) : 0
Nested loops CPU (F3) : 0
arrayfunLowMem (F4a): 0
arrayfunHighMem (F4b): 0
Matlab arrayfun CPU (F5) : 0
Timing summary (seconds):
1) arrayfun GPU : 0.010418
2) Vectorized CPU : 0.097721
3) Nested loops CPU : 0.329859
4a) arrayfunLowMem : 3.806847
4b) arrayfunHighMem : 3.187784
5) arrayfun Matlab : 5.101973

Iniciar sesión para comentar.

Más respuestas (3)

Torsten
Torsten el 3 de En. de 2026
Editada: Torsten el 3 de En. de 2026
F3 = cellfun(@myadd, [{x}], [{y}], 'UniformOutput', false)
will work.
Arrayfun tries to apply "myadd" to [first element of x,first element of y] and [second element of x, second element of y]. Thus x and y must be of the same size and - even if it would work - the result would be [2, 4] or [2;4].
I don't understand why it works for gpuArray and gives the result you want. Maybe input arrays of different size are automatically interpreted here as two separate single objects to which "myadd" is to be applied - as it is done if you use cell arrays together with "cellfun".

4 comentarios

Paul
Paul el 3 de En. de 2026
The cpu version of arrayfun states: "The arrays A1,...,An all must have the same size."
The gpuArray version of gpuArray.arrayfun states: "The sizes of A1,...,An must match or be compatible." which I gather means the inputs A1 .. An are implicitly expanded.
So I guess the question is why the gpu version offers that expansion and the cpu version does not. I wonder if that could lead to problems for people who develop on the cpu version and then transition later to the gpu version, or vice versa.
Torsten
Torsten el 3 de En. de 2026
Editada: Torsten el 3 de En. de 2026
Ok. "Compatible" for the gpu case means here: it is allowed that A1 is a column vector whereas A2 is a row vector. This makes implicit expansion of A1 + A2 possible.
Alessandro
Alessandro el 3 de En. de 2026
@Paul: I agree with you. Ideally Matlab should implement implicit expansion also on arrayfun cpu, as it is implemented on the gpu version. This improvement would also not break any code.
Torsten
Torsten el 3 de En. de 2026
Editada: Torsten el 3 de En. de 2026
If your contribution rather was a complaint and not a question, make a feature request:

Iniciar sesión para comentar.

Matt J
Matt J el 3 de En. de 2026
Editada: Matt J el 4 de En. de 2026
I agree it is confusing, but the gpuArray version of arrayfun was never intended as a direct analgoue of the CPU version. Additionally, there are all kinds of other instances where the gpuArray support for a function differs in capabilities from its CPU version. Usually, though, the GPU version is less flexible, not moreso, as seems to be the case with arrayfun.
A more appropriate comparison of implicit expansion between CPU vs GPU (for binary functions) would probably be to use bsxfun instead:
% Define test vectors
x = rand(10000,1); % Column vector
y = rand(1,10000); % Row vector
xg=gpuArray(x);
yg=gpuArray(y);
myadd=@(a,b) a+b;
timeit( @() bsxfun(myadd, x, y) ) %CPU
ans =
0.3355
gputimeit( @() bsxfun(myadd, xg, yg) ) %GPU
ans =
0.0078
Joss Knight
Joss Knight el 5 de En. de 2026
Movida: Matt J el 5 de En. de 2026

0 votos

Well, there are a couple of answers to that. Firstly and probably most importantly, GPU arrayfun is just an extension of gpuArray's existing compiler for element-wise operations, all of which support dimension expansion; it's also a natural strided memory access pattern to support for a GPU kernel, where each input has its own stride, and there is native support for this kind of memory access in GPU hardware. Secondly, it makes design sense and only isn't implemented for CPU for the reasons given. The historical explanation is that the idea of broadcast operations didn't even exist back when arrayfun was first created for the CPU, but it did exist by the time GPU arrayfun came along.

2 comentarios

Matt J
Matt J el 5 de En. de 2026
Editada: Matt J el 5 de En. de 2026
Secondly, it makes design sense and only isn't implemented for CPU for the reasons given.
Given where? In your initial post?
And are those reasons still an argument for not pursuing it now, or is it just because of historical legacy?
Joss Knight
Joss Knight el 6 de En. de 2026
The reasons given by various people. Should you really enhance something you want to discourage people from using?

Iniciar sesión para comentar.

Categorías

Más información sobre Matrices and Arrays en Centro de ayuda y File Exchange.

Productos

Versión

R2024b

Preguntada:

el 3 de En. de 2026

Comentada:

el 7 de En. de 2026

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by