Implicit expansion with arrayfun (cpu vs gpu)

Question

1 voto

I find very convenient that Matlab allows for implicit expansion since the 2016 version (for an explanation, see this nice article: https://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/?s_tid=blogs_rc_1).

I was then puzzled to discover that arrayfun on the cpu does not allow for it, while arrayfun when called with gpu arrays does allow for implicit expansion. Below is a MWE to demonstrate this behavior.

Let me quickly explain it: if I have two vectors x with size [2,1] and y with size [1,2], I can calculate the sum x+y and get a matrix 2*2 as intended. This is better than the ugly and more memory-intensive

repmat(x,1,2)+repmat(y,2,1)

Unfortunately this does not work with arrayfun on the cpu!

Since I code both using normal arrays and GPU arrays, I find this different behavior of arrayfun quite misleading. It would be great if Matlab could allow implicit expansion also on arrayfun cpu. When I have large arrays, duplicating dimensions with repmat takes a lot of memory.

%% Demonstration of implicit expansion support in MATLAB and arrayfun
% This script shows that:
% 1) MATLAB supports implicit expansion for standard array operations.
% 2) arrayfun on the GPU supports implicit expansion.
% 3) arrayfun on the CPU does NOT support implicit expansion!!
%
% Implicit expansion allows a 2×1 vector to be added to a 1×2 vector,
% producing a 2×2 matrix.
clear; clc; close all;
% Define test vectors
x = [1; 2];   % Column vector (2×1)
y = [1, 2];   % Row vector    (1×2)
%% Implicit expansion using standard MATLAB operations
F1 = myadd(x, y);
%% Implicit expansion using arrayfun on the GPU
F2 = arrayfun(@myadd, gpuArray(x), gpuArray(y));
%% Attempt implicit expansion using arrayfun on the CPU (expected to fail)
try
    F3 = arrayfun(@myadd, x, y);
catch ME
    fprintf('CPU arrayfun error (expected):\n%s\n\n', ME.message);
end
%% Function myadd
function F = myadd(x, y)
    % Element-wise addition
    F = x + y;
end

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Paul el 4 de En. de 2026

Editada: Paul el 4 de En. de 2026

Abrir en MATLAB Online

Frankly, I've never thought about using arrayfun with implicit expansion and so don't really know how useful it would or wouldn't be. But I can definitely envision use cases where it might come handy and save lines of code. Whether or not eliminating such lines of code is worth it is in the eye of the beholder.

Consider a situation where I have discrete time filter coefficients stored in two cell arrays: b for the numerator, and a for the denominator. Sometimes I want unique pairs of numerators and denominators (numel(b) == numel(a)), sometimes I want to analyze multiple sets of denominators with one numerator, and sometimes I want to analyze multiple numerators with one denominator, as shown below.

Define three sets of numerator coefficients and one set of denominator coefficients

rng(100);
w = linspace(0,pi,1000);
b = mat2cell(rand(3,2),[1,1,1]);
a = {[1 2]};

Define a wrapper around freqz to compute the frequency response of a filter

myfreqz = @(b,a) freqz(b{1},a{1},w);

Call @Matt J's function as defined in this Answer to compute all three responses with implicit expansion of the denominator. I changed two lines of code in Matt's function to return a cell array rather than a double as Matt's function does not (yet?) support the UniformOutput option.

h = arrayfunImplExp(myfreqz,b,a);

Verify

isequal(vertcat(h{:}),[freqz(b{1},a{1},w);freqz(b{2},a{1},w);freqz(b{3},a{1},w)])
ans = logical
   1

To be sure, I could have written a loop with a couple of if statements to check isscalar on b or a to know how to call freqz inside the loop, but I rather like the one-liner.

function out = arrayfunImplExp(fun, varargin)
%ARRAYFUNIMPLEXP arrayfun with implicit expansion (CPU)
%
% NOTE:
%   Useful for CPU-only execution. On the GPU, use arrayfun instead,
%   which implements its own implicit expansion.
    % Number of inputs and maximum dimensionality
    numArgs = numel(varargin);
    numDims = max(cellfun(@ndims, varargin));
    % Collect per-input sizes (row-wise)
    sizes = cellfun(@(c) size(c, 1:numDims), varargin, 'UniformOutput', false);
    sizes = vertcat(sizes{:});                      % [numArgs x numDims]
    % Output size is max along each dimension
    outSize = max(sizes, [], 1);
    % Precompute row-wise strides for linear indexing
    strides = [ ...
        ones(numArgs, 1), ...
        cumprod(sizes(:, 1:end-1), 2) ...
    ];
    % Convert sizes to zero-based limits
    sizes = sizes - 1;
    % Allocate output
    %out = nan(outSize);
    out = cell(outSize);
    argsIdx=nan(numArgs,1);
    args=cell(1,numArgs);
    A=1:numArgs;
    % Main loop over output elements
    for linIdx = 1:numel(out)
        % Convert linear index → zero-based subscripts
        idx = linIdx - 1;
        subs = zeros(1, numDims);
        for d = 1:numDims
            sd = outSize(d);
            subs(d) = mod(idx, sd);
            idx = floor(idx / sd);
        end
        % Apply implicit expansion masking
        Subs = min(subs,sizes);
        % Row-wise sub2ind
        argsIdxNew = sum(Subs .* strides, 2) + 1;
        map=(argsIdxNew ~= argsIdx);
        % Gather arguments
        % v=varargin(map);
        % k=argsIdxNew(map);
        % c=0;
        % for j=map
        %     c=c+1;
        %  args{j} = v{j}(k(c));
        % end
        for j = 1:numArgs
            if map(j)
             args{j} = varargin{j}(argsIdxNew(j));
            end
        end
        argsIdx=argsIdxNew;
        % Evaluate function
        out{linIdx} = fun(args{:});
    end
end

Matt J el 4 de En. de 2026

Editada: Matt J el 4 de En. de 2026

Abrir en MATLAB Online

I changed two lines of code in Matt's function to return a cell array rather than a double as Matt's function does not (yet?) support the UniformOutput option.

I was never a big fan of the UniformOuput flag. If the user wants cell-valued output, then I think the iterated input function should assume responsibility for that. In your case, that would look like,

rng(100);
w = linspace(0,pi,1000);
b = num2cell( rand(3,2) , 2);
a = {[1 2]};
myfreqz = @(b,a) {freqz(b{1},a{1},w)};
h = arrayfunImplExp(myfreqz,b,a)
h = 3×1 cell array
    {[0.4627 + 0.0000i 0.4627 + 0.0001i 0.4627 + 0.0002i 0.4627 + 0.0003i 0.4627 + 0.0003i 0.4627 + 0.0004i 0.4627 + 0.0005i 0.4627 + 0.0006i 0.4627 + 0.0007i … ] (1×1000 double)}
    {[0.0944 + 0.0000i 0.0944 + 0.0002i 0.0944 + 0.0004i 0.0944 + 0.0006i 0.0944 + 0.0008i 0.0944 + 0.0010i 0.0944 + 0.0012i 0.0944 + 0.0014i 0.0944 + 0.0015i … ] (1×1000 double)}
    {[0.1820 + 0.0000i 0.1820 + 0.0003i 0.1820 + 0.0005i 0.1820 + 0.0008i 0.1820 + 0.0010i 0.1820 + 0.0013i 0.1820 + 0.0015i 0.1820 + 0.0018i 0.1820 + 0.0020i … ] (1×1000 double)}

Joss Knight el 5 de En. de 2026

Movida: Matt J el 5 de En. de 2026

Abrir en MATLAB Online

arrayfun with expansion, particularly for expanding scalars, is certainly very convenient syntactic sugar for a for loop to make code more compact and readable, for instance, setting a bunch of settings on an object array, where the for loop is not going to be optimized so there is no advantage to it.

layers = arrayfun(@setHyperParams, layers, 0, [layers.L2Factor]); % Freeze learn rate

It's just easier to read, isn't it? Obviously there are other ways to do this particular operation on one line but I certainly see your point. However, the others have good points; arrayfun is almost always slower than a for loop or alternative approach, so taking action to encourage its use is something to do with caution.

Paul el 5 de En. de 2026

Movida: Matt J el 5 de En. de 2026

Hi Joss,

Any idea why the gpu version of arrayfun supports expansion (whereas the cpu version does not)?

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Matt J el 4 de En. de 2026

Editada: Matt J el 4 de En. de 2026

Abrir en MATLAB Online

1 voto

Remember that arrayfun (on the CPU) is nothing more than an M-Code for-loop in disguise. Therefore, it is not hard to write your own arrayfun version -- one which supports implicit expansion -- using for-loops.

The implementation below is slower in speed to explicit expansion (using repmat), but of course conserves a lot more memory. You would probably need very significant data sizes and/or RAM limits before implicit expansion is faster.

% Define test vectors
N=100;
x = rand(N,1);   % Column vector 
y = rand(1,N);   % Row vector    
z = rand(1,1,N); 
myadd=@(a,b,c) a+b+c;
tic;
 out1=arrayfunImplExp(myadd,x,y,z); %implicit expansion
toc
Elapsed time is 0.898685 seconds.
tic;
 out2=arrayfunExplExp(myadd,x,y,z); %explicit expansion (with repmat)
toc
Elapsed time is 0.548641 seconds.
isequal(out1,x+y+z)
ans = logical
   1
isequal(out2,x+y+z)
ans = logical
   1
function out = arrayfunImplExp(fun, varargin)
%ARRAYFUNIMPLEXP arrayfun with implicit expansion (CPU)
%
% NOTE:
%   Useful for CPU-only execution. On the GPU, use arrayfun instead,
%   which implements its own implicit expansion.
    % Number of inputs and maximum dimensionality
    numArgs = numel(varargin);
    numDims = max(cellfun(@ndims, varargin));
    % Collect per-input sizes (row-wise)
    sizes = cellfun(@(c) size(c, 1:numDims), varargin, 'UniformOutput', false);
    sizes = vertcat(sizes{:});                      % [numArgs x numDims]
    % Output size is max along each dimension
    outSize = max(sizes, [], 1);
    % Precompute row-wise strides for linear indexing
    strides = [ ...
        ones(numArgs, 1), ...
        cumprod(sizes(:, 1:end-1), 2) ...
    ];
    % Convert sizes to zero-based limits
    sizes = sizes - 1;
    % Allocate output
    out = nan(outSize);
    argsIdx=nan(numArgs,1);
    args=cell(1,numArgs);
    A=1:numArgs;
    % Main loop over output elements
    for linIdx = 1:numel(out)
        % Convert linear index → zero-based subscripts
        idx = linIdx - 1;
        subs = zeros(1, numDims);
        for d = 1:numDims
            sd = outSize(d);
            subs(d) = mod(idx, sd);
            idx = floor(idx / sd);
        end
        % Apply implicit expansion masking
        Subs = min(subs,sizes);
        % Row-wise sub2ind
        argsIdxNew = sum(Subs .* strides, 2) + 1;
        map=(argsIdxNew ~= argsIdx);
        % Gather arguments
        % v=varargin(map);
        % k=argsIdxNew(map);
        % c=0;
        % for j=map
        %     c=c+1;
        %  args{j} = v{j}(k(c));
        % end
        for j = 1:numArgs
            if map(j)
             args{j} = varargin{j}(argsIdxNew(j));
            end
        end
        argsIdx=argsIdxNew;
        % Evaluate function
        out(linIdx) = fun(args{:});
    end
end
function out = arrayfunExplExp(fun, varargin)
%ARRAYFUNEXPLEXP arrayfun with explicit expansion using repmat
% Provided for comparison with arrayfunImplExp.
    numArgs = numel(varargin);
    numDims = max(cellfun(@ndims, varargin));
    % Collect sizes
    sizes = cellfun(@(c) size(c, 1:numDims), varargin, 'UniformOutput', false);
    sizes = vertcat(sizes{:});
    % Output size
    outSize = max(sizes, [], 1);
    numOut  = prod(outSize);
    % Preallocate expanded argument cell array
    args = cell(numArgs, numOut);
    % Explicit expansion
    for k = 1:numArgs
        v    = varargin{k};
        vSz  = size(v, 1:numDims);
        reps = outSize ./ vSz;
        % Expand and linearize
        vk = repmat(num2cell(v), reps);
        args(k, :) = vk(:).';
    end
    % Allocate output
    out = nan(outSize);
    % Evaluate function
    for j = 1:numOut
        out(j) = fun(args{:, j});
    end
end

17 comentarios
Mostrar 15 comentarios más antiguos Ocultar 15 comentarios más antiguos

Alessandro el 4 de En. de 2026

Editada: Alessandro el 4 de En. de 2026

Abrir en MATLAB Online

@Matt J Your function arrayfunExplExp works quite well on my tests and does exactly the same as arrayfun on GPU (much slower but this is expected).

Question 1: Why is arrayfunExplExp faster than arrayfunImplExp? arrayfunExplExp uses repmat which creates large temporary arrays, so it should be slower.

Question 2: Am I free to use arrayfunExplExp in my own codes (for academic research)?

Question 3: Nested loops are much faster than arrayfunExplExp. I am afraid this is expected, since arrayfun on the CPU is slow. Still, I was wondering if there are ways of speeding up arrayfunExplExp.

Here I show a test to verify that arrayfun GPU and arrayfun Matt J give exactly the same result:

clear,clc,close all
r = 0.04;
sigma = 1.5;
n_a = 1000;
n_z = 11;
a_grid = linspace(0,50,n_a)';
z_grid = linspace(0.5,1.5,n_z)';
aprime_vec = a_grid;
a_vec      = reshape(a_grid,[1,n_a,1]);
z_vec      = reshape(z_grid,[1,1,n_z]);
aprime_vec_gpu = gpuArray(aprime_vec);
Error using gpuArray
Unable to find a supported GPU device.
a_vec_gpu = gpuArray(a_vec);
z_vec_gpu = gpuArray(z_vec);
% arrayfun GPU F1 is (a',a,z)
disp('arrayfun GPU')
tic
F1 = arrayfun(@RetFun,aprime_vec_gpu,a_vec_gpu,z_vec_gpu,r,sigma);
toc
disp('Nested loops')
tic
F2 = zeros(n_a,n_a,n_z);
for z_c=1:n_z
    for a_c=1:n_a
        for ap_c=1:n_a
            F2(ap_c,a_c,z_c) = RetFun(a_grid(ap_c),a_grid(a_c),z_grid(z_c),r,sigma);
        end
    end
end
toc
% User-written arrayfun
disp('User-written arrayfun CPU')
tic
F3 = arrayfunExplExp(@RetFun, aprime_vec,a_vec,z_vec,r,sigma);
toc
err2 = max(abs(F2-F1),[],"all");
err3 = max(abs(F3-F1),[],"all");
fprintf('Error 2: %f \n',err2)
fprintf('Error 3: %f \n',err3)
function F = RetFun(aprime,a,z,r,sigma)
cons = (1+r)*a+z-aprime;
F = -Inf;
if cons>0
    F = cons^(1-sigma)/(1-sigma);
end
end %end function

P.S. I can open a separate question if it is considered more appropriate.

Matt J el 4 de En. de 2026

Editada: Matt J el 5 de En. de 2026

Abrir en MATLAB Online

I have modified the implementation of arrayfunImplExp() to improve its efficiency somewhat.

Additionally, I have modifed your test code below to compare with a few more CPU cases of interest. What we see is, yes, explicit loops (nested or linear) are the fastest, thanks to the JIT. However, arrayfunImplExp and arrayfunExplExp compare quite well in speed to Matlab's native arrayfun, even when we exclude from consideration the data replication time associated with ndgrid() or repmat(). And, of course, arrayfunImplExp reduces temp RAM consumption by a few factors. Of course, the very best speed comes from the vectorized implementation, which shows that you should only resort to arrayfun if vectorization is truly out of reach.

r = 0.04;
sigma = 1.5;
N=120;
a_grid = linspace(0,50,N)';
z_grid = linspace(0.5,1.5,N)';
[aprime,a ,z]=ndgrid(a_grid, a_grid, z_grid);
[aprime_vec,a_vec ,z_vec]=ndgridVecs(a_grid, a_grid, z_grid);
 %Get ndgridVecs from the File Exchange:
 %   https://www.mathworks.com/matlabcentral/fileexchange/74956-ndgridvecs
fun=@(A,B,C)RetFun(A,B,C,r,sigma);
disp ' '
 
disp('Vectorized Implementation')
Vectorized Implementation
tic
BaseLine0 = RetFunVec(aprime_vec,a_vec ,z_vec, r,sigma);
toc
Elapsed time is 0.043575 seconds.
disp ' '
 
disp('Nested loops')
Nested loops
tic
BaseLine1 = zeros(size(a));
for k=1:N
    for j=1:N
        for i=1:N
            BaseLine1(i,j,k) = fun(a_grid(i),a_grid(j),z_grid(k));
        end
    end
end
toc
Elapsed time is 5.366150 seconds.
% User-written arrayfun
disp ' '
 
disp('Linear loop') 
Linear loop
tic;
BaseLine2 = zeros(size(a));
for i=1:numel(BaseLine2)
  BaseLine2(i) = fun(aprime(i),a(i),z(i));
end
toc
Elapsed time is 4.967472 seconds.
% User-written arrayfun
disp ' '
 
disp('Native arrayfun CPU')
Native arrayfun CPU
tic
BaseLine3 = arrayfun(fun, aprime,a ,z );
toc
Elapsed time is 7.463388 seconds.
% User-written arrayfun
disp ' '
 
disp('User-written arrayfun CPU')
User-written arrayfun CPU
tic
F1 = arrayfunImplExp(fun, aprime_vec,a_vec,z_vec);
toc
Elapsed time is 7.403977 seconds.
tic
F2 = arrayfunExplExp(fun, aprime_vec,a_vec,z_vec);
toc
Elapsed time is 6.563372 seconds.
isequal(BaseLine0, BaseLine1, BaseLine2, BaseLine3, F1,F2)
ans = logical
   1
function F = RetFun(aprime,a,z,r,sigma)
    cons = (1+r).*a+z-aprime;
    
    F = -Inf;
    
    if cons>0
        F = cons.^(1-sigma)./(1-sigma);
    end
end %end function
function F = RetFunVec(aprime,a,z,r,sigma)
    cons = (1+r).*a+z-aprime;
    F = cons.^(1-sigma)./(1-sigma);
    F(cons<=0)=-Inf;
end %end function

Alessandro el 5 de En. de 2026

Abrir en MATLAB Online

@Matt J And you're using my updated version? Anyway...there could be some platform-dependent differences in the speed performances.

Yes, I am. Maybe the differences in speed performance might arise from the fact that I changed a bit the tests. I report here my complete test code.

%% Performance test: arrayfun (GPU), vectorized (CPU), nested loops, user arrayfun (CPU)
clear;
clc;
%% Parameters and grids
r      = 0.04;
sigma  = 1.5;
n_a    = 150;
n_z    = 150;
a_grid = linspace(0, 50,  n_a).';
z_grid = linspace(0.5, 1.5, n_z).';
% Shapes:
% aprime_vec : (a', 1, 1)
% a_vec      : (1, a, 1)
% z_vec      : (1, 1, z)
aprime_vec = a_grid;
a_vec      = reshape(a_grid, [1, n_a, 1]);
z_vec      = reshape(z_grid, [1, 1, n_z]);
[aprime_full,a_full,z_full] = ndgrid(a_grid,a_grid,z_grid);
r_full = repmat(r,size(aprime_full));
sigma_full = repmat(sigma,size(aprime_full));
%% Move to GPU
aprime_vec_gpu = gpuArray(aprime_vec);
a_vec_gpu      = gpuArray(a_vec);
z_vec_gpu      = gpuArray(z_vec);
%% 1. arrayfun on GPU: F1 is (a', a, z)
disp('1) arrayfun on GPU');
t_gpu = tic;
F1 = arrayfun(@RetFun, aprime_vec_gpu, a_vec_gpu, z_vec_gpu, r, sigma);
time_gpu = toc(t_gpu);
%% 2. Vectorized on CPU
disp('2) Vectorized on CPU');
t_vec = tic;
F2 = RetFunVec(aprime_vec, a_vec, z_vec, r, sigma);
time_vec = toc(t_vec);
%% 3. Nested loops on CPU
disp('3) Nested loops on CPU');
t_loop = tic;
F3 = zeros(n_a, n_a, n_z);
for z_c = 1:n_z
    for a_c = 1:n_a
        for ap_c = 1:n_a
            F3(ap_c, a_c, z_c) = RetFun(a_grid(ap_c), a_grid(a_c), z_grid(z_c), r, sigma);
        end
    end
end
time_loop = toc(t_loop);
%% 4. User-written arrayfun on CPU
disp('4) User-written arrayfun on CPU');
t_user = tic;
F4 = arrayfunImplExp(@RetFun, aprime_vec, a_vec, z_vec, r, sigma);
time_user = toc(t_user);
%% 5. Matlab arrayfun on CPU
disp('5) Matlab arrayfun on CPU');
t_user = tic;
%F5 = arrayfun(@RetFun, aprime_full, a_full, z_full, r_full, sigma_full);
F5 = arrayfun(@RetFun, aprime_full, a_full, z_full,r_full,sigma_full);
time_matlab = toc(t_user);
%% Accuracy checks
F1_cpu = gather(F1);  % bring GPU result back to CPU for comparison
err2 = max(abs(F2 - F1_cpu), [], "all");
err3 = max(abs(F3 - F1_cpu), [], "all");
err4 = max(abs(F4 - F1_cpu), [], "all");
err5 = max(abs(F5 - F1_cpu), [], "all");
fprintf('\nMaximum absolute error vs GPU arrayfun result:\n');
fprintf('  Vectorized CPU      (F2): %g\n', err2);
fprintf('  Nested loops CPU    (F3): %g\n', err3);
fprintf('  User arrayfun CPU   (F4): %g\n', err4);
fprintf('  Matlab arrayfun CPU (F5): %g\n', err5);
%% Timing summary
fprintf('\nTiming summary (seconds):\n');
fprintf('  1) arrayfun GPU            : %.6f\n', time_gpu);
fprintf('  2) Vectorized CPU          : %.6f\n', time_vec);
fprintf('  3) Nested loops CPU        : %.6f\n', time_loop);
fprintf('  4) arrayfunExplExp         : %.6f\n', time_user);
fprintf('  5) arrayfun Matlab         : %.6f\n', time_matlab);
%% ------------------------------------------------------------------------
%% Local functions
function F = RetFun(aprime, a, z, r, sigma)
    cons = (1 + r) * a + z - aprime;
    if cons > 0
        F = cons^(1 - sigma) / (1 - sigma);
    else
        F = -Inf;
    end
end
function F = RetFunVec(aprime, a, z, r, sigma)
    cons = (1 + r) * a + z - aprime;
    %F = cons.^(1 - sigma) / (1 - sigma);
    %F(cons <= 0) = -Inf;
    F = -Inf(size(cons));
    pos = cons>0;
    F(pos) = cons(pos).^(1 - sigma) / (1 - sigma);
end

Matt J el 6 de En. de 2026

Editada: Matt J el 6 de En. de 2026

Abrir en MATLAB Online

timeComparisons.m

OK, so attached is my idea of how the test should have been coded, and here are my results:

>> timeComparisons

1) arrayfun on GPU

2) Vectorized on CPU

3) Nested loops on CPU

4i) User-written arrayfunImplExp on CPU

4e) User-written arrayfunExplExp on CPU

5) Matlab arrayfun on CPU

Maximum absolute error vs baseline (F2):

gpuArray.arrayfun (F1) : 7.10543e-15

Vectorized CPU (F2) : 0

Nested loops CPU (F3) : 0

arrayfunImplExp (F4i): 0

arrayfunExplExp (F4e): 0

Matlab arrayfun CPU (F5) : 0

Timing summary (seconds):

1) arrayfun GPU : 0.010027

2) Vectorized CPU : 0.078522

3) Nested loops CPU : 0.529490

4i) arrayfunImplExp : 6.427602

4e) arrayfunExplExp : 4.775996

5) arrayfun Matlab : 7.100541

The most important difference is that I have wrapped RetFun() in a 3-argument function:

fun=@(A,B,C) RetFun(A,B,C, r, sigma);

since r and sigma are fixed constants. If you don't do this, you are forcing arrayfunImplExp to do additional scalar expansions on the scalar inputs r and sigma, which I think is kind of silly. It's needless extra computation.

Now, the discussion gets complicated because the native CPU arrayfun doesn't seem to process anonymous functions as well as it does non-anonymous functions. If you run test (5) instead with,

disp('5) Matlab arrayfun on CPU');
t_user = tic;
    [aprime_full,a_full,z_full] = ndgrid(a_grid,a_grid,z_grid); %Include  time for explicit expansion 
    r_full = repmat(r,size(aprime_full));
    sigma_full = repmat(sigma,size(aprime_full));
F5 = arrayfun(@RetFun, aprime_full, a_full, z_full, r_full, sigma_full);
time_matlab = toc(t_user);

then its time drops to about 4.9 sec. So, okay you might argue that test (5) is unfairly handicapped. On the other hand, I think that quite often, you would like to be able to apply arrayfun() to anonymous functions. So, to get the better performance from native arrayfun, you are forced to spend both data duplication effort, and the effort of rewriting fun() non-anonymously.

Side Note: I have modified the way compute times are calculated in test (1) and (2). TIC and TOC don't work very accurately for GPU functions.

Matt J el 6 de En. de 2026

Editada: Matt J el 6 de En. de 2026

timeComparisons.m

I have just uploaded a 3rd version arrayfunScalarExp() to the File Echange:

https://www.mathworks.com/matlabcentral/fileexchange/182992-arrayfun-with-scalar-expansion

It is a hybrid of arrayfunImplExmp and arrayfunExplExp, which I think achieves an optimal tradeoff in speed (about 10% slower than native arrayfun) and RAM consumption. It is also supports the 'UniformOutput' flag, per the recommendation of @Paul.

I've also attached a revised timeComparison script, this time using r_full and sigma_full in all CPU arrayfun implementations, so that they are tested against the best performance of the native CPU arrayfun(). The results I get are,

1) arrayfun on GPU

2) Vectorized on CPU

3) Nested loops on CPU

4a) User-written arrayfunScalarExp on CPU

4b) User-written arrayfunExplExp on CPU

5) Matlab arrayfun on CPU

Maximum absolute error vs baseline (F2):

gpuArray.arrayfun (F1) : 7.10543e-15

Vectorized CPU (F2) : 0

Nested loops CPU (F3) : 0

arrayfunScalarExp (F4a): 0

arrayfunExplExp (F4b): 0

Matlab arrayfun CPU (F5) : 0

Timing summary (seconds):

1) arrayfun GPU : 0.010373

2) Vectorized CPU : 0.078654

3) Nested loops CPU : 0.331331

4a) arrayfunScalarExp : 5.479543

4b) arrayfunExplExp : 11.268232

5) arrayfun Matlab : 5.054332

Athough, note that two other syntax choices are possible for (4a) and (4b), with different results.

Alessandro el 6 de En. de 2026

Thanks for the explanation. But using parfor with threads instead of processes should reduce the overhead considerably. I will try later if I have time.

Matt J el 7 de En. de 2026

timeComparisons.m

I made some improvements to my File Exchange submission, which now includes two versions, one which is RAM-conserving (arrayfunLowMem) and one which is liberal with RAM, but somewhat faster (arrayfunHighMem). Both are significantly faster than native arrayfun (revised timeComparisons.m script attached).

1) arrayfun on GPU

2) Vectorized on CPU

3) Nested loops on CPU

4a) User-written arrayfunLowMem on CPU

4b) User-written arrayfunHighMem on CPU

5) Matlab arrayfun on CPU

Maximum absolute error vs baseline (F2):

gpuArray.arrayfun (F1) : 7.10543e-15

Vectorized CPU (F2) : 0

Nested loops CPU (F3) : 0

arrayfunLowMem (F4a): 0

arrayfunHighMem (F4b): 0

Matlab arrayfun CPU (F5) : 0

Timing summary (seconds):

1) arrayfun GPU : 0.010418

2) Vectorized CPU : 0.097721

3) Nested loops CPU : 0.329859

4a) arrayfunLowMem : 3.806847

4b) arrayfunHighMem : 3.187784

5) arrayfun Matlab : 5.101973

Iniciar sesión para comentar.

Answer 2

Torsten el 3 de En. de 2026

Editada: Torsten el 3 de En. de 2026

Abrir en MATLAB Online

1 voto

F3 = cellfun(@myadd, [{x}], [{y}], 'UniformOutput', false)

will work.

Arrayfun tries to apply "myadd" to [first element of x,first element of y] and [second element of x, second element of y]. Thus x and y must be of the same size and - even if it would work - the result would be [2, 4] or [2;4].

I don't understand why it works for gpuArray and gives the result you want. Maybe input arrays of different size are automatically interpreted here as two separate single objects to which "myadd" is to be applied - as it is done if you use cell arrays together with "cellfun".

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Alessandro el 3 de En. de 2026

@Paul: I agree with you. Ideally Matlab should implement implicit expansion also on arrayfun cpu, as it is implemented on the gpu version. This improvement would also not break any code.

Torsten el 3 de En. de 2026

Editada: Torsten el 3 de En. de 2026

@Alessandro

If your contribution rather was a complaint and not a question, make a feature request:

https://uk.mathworks.com/matlabcentral/answers/460998-how-can-i-make-a-features-request

Iniciar sesión para comentar.

Answer 3

Matt J el 3 de En. de 2026

Editada: Matt J el 4 de En. de 2026

Abrir en MATLAB Online

1 voto

I agree it is confusing, but the gpuArray version of arrayfun was never intended as a direct analgoue of the CPU version. Additionally, there are all kinds of other instances where the gpuArray support for a function differs in capabilities from its CPU version. Usually, though, the GPU version is less flexible, not moreso, as seems to be the case with arrayfun.

A more appropriate comparison of implicit expansion between CPU vs GPU (for binary functions) would probably be to use bsxfun instead:

% Define test vectors
x = rand(10000,1);   % Column vector 
y = rand(1,10000);   % Row vector    
xg=gpuArray(x);
yg=gpuArray(y);
myadd=@(a,b) a+b;
timeit( @() bsxfun(myadd, x, y) ) %CPU

ans =

0.3355

gputimeit( @() bsxfun(myadd, xg, yg) ) %GPU

ans =

0.0078

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 4

Joss Knight el 5 de En. de 2026

Movida: Matt J el 5 de En. de 2026

0 votos

Well, there are a couple of answers to that. Firstly and probably most importantly, GPU arrayfun is just an extension of gpuArray's existing compiler for element-wise operations, all of which support dimension expansion; it's also a natural strided memory access pattern to support for a GPU kernel, where each input has its own stride, and there is native support for this kind of memory access in GPU hardware. Secondly, it makes design sense and only isn't implemented for CPU for the reasons given. The historical explanation is that the idea of broadcast operations didn't even exist back when arrayfun was first created for the CPU, but it did exist by the time GPU arrayfun came along.

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Matt J el 5 de En. de 2026

Editada: Matt J el 5 de En. de 2026

Secondly, it makes design sense and only isn't implemented for CPU for the reasons given.

Given where? In your initial post?

And are those reasons still an argument for not pursuing it now, or is it just because of historical legacy?

Joss Knight el 6 de En. de 2026

The reasons given by various people. Should you really enhance something you want to discourage people from using?

Iniciar sesión para comentar.

Implicit expansion with arrayfun (cpu vs gpu)

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Respuesta aceptada

17 comentarios
Mostrar 15 comentarios más antiguos Ocultar 15 comentarios más antiguos

Más respuestas (3)

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Categorías

Productos

Versión

Etiquetas

Community Treasure Hunt

Implicit expansion with arrayfun (cpu vs gpu)

10 comentarios Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

Respuesta aceptada

17 comentarios Mostrar 15 comentarios más antiguos Ocultar 15 comentarios más antiguos

Más respuestas (3)

4 comentarios Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

2 comentarios Mostrar Ninguno Ocultar Ninguno

Categorías

Productos

Versión

Etiquetas

Ver también

Community Treasure Hunt

10 comentarios
Mostrar 8 comentarios más antiguos Ocultar 8 comentarios más antiguos

17 comentarios
Mostrar 15 comentarios más antiguos Ocultar 15 comentarios más antiguos

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

2 comentarios
Mostrar Ninguno Ocultar Ninguno