Borrar filtros
Borrar filtros

Applying large logical masks to large arrays quickly, row wise

28 visualizaciones (últimos 30 días)
George Hackney
George Hackney el 15 de Dic. de 2022
Comentada: Jan el 17 de Dic. de 2022
I am filtering many large arrays NxM (n>500000, m>10) by applying a NxM mask to generate a many coloumn vectors. It is only possible for the mask to be true in a single column per row so the resultant column vectors will be nx1.
Directly applying the mask to the array is not an option as MATLAB applies the mask in a columnwise fashion. This is undesireable as the rows in the data represent a time stamp.
Is there a quicker way to apply a mask in a row wise fasion or convert a mask to linear indicies.
Example:
%% SETUP
% Data to be filtered
data = ...
[ 17 24 1 8 15; ...
23 5 7 14 16; ...
4 6 13 20 22; ...
10 12 19 21 3; ...
11 18 25 2 9 ];
% Mask to be applied
mask = ...
[ 0 0 1 0 0 ; ...
0 0 1 0 0 ; ...
0 1 0 0 0 ; ...
0 1 0 0 0 ; ...
1 0 0 0 0 ...
];
mask = logical(mask);
%% DESIRED DATA
wanted = [ 1;7;6;12;11]
wanted = 5×1
1 7 6 12 11
%% DIRECT APPLICATION OF MASK
tic
direct = data(mask); % [11;6;12;1;7]
toc
Elapsed time is 0.002715 seconds.
disp(direct)
11 6 12 1 7
%% CURRENT SLOW SOLOUTION
tic
[row,col] = find(mask); % This line is slow when called many times with large arrays
[row,sortIdx] = sort(row);
col = col(sortIdx);
idx = sub2ind(size(mask), row, col);
current = data(idx);
toc
Elapsed time is 0.008832 seconds.
disp(current)
1 7 6 12 11
  4 comentarios
Dyuman Joshi
Dyuman Joshi el 16 de Dic. de 2022
I see, it's top to down rather than right to left.
I have a query regarding your comment below, if you don't mind answering @Jan
"The creation of large temporary arrays is expensive."
How to know if something is memory or time (I don't if they are different or not) expensive?
Jan
Jan el 17 de Dic. de 2022
You can simply count, how much memory a code costs. Split the executing into atomic parts, e.g.:
Z = exp(sin(X + Y).^2); % where X and Y are matrices
% In steps:
A = X + Y;
B = sin(X);
C = B.^2;
Z = exp(C);
Three temporary arrays are required. How expensive this is, depends on the size of the arrays and the CPU. If the array matchs into the CPU caches, there is no latency for waiting for the slow RAM.
How expensive the calculations are, depends in the CPU also and even on the Matlab version. SIN, EXP and the elementwise power can be multi-threaded. Then Matlab distributes the processing to several cores. It depends on the implementation how many cores are used for a specific data size.
There is no general way to predict, how much time is needed for the processing and the memory access. You have to try it. Look at my answers for your question. I've tried 10 different algorithms, but posted the most promissing versions only.
When I write a function, I implement different methods and store them in a unit-test function. On one hand I use them to test the output. On the other hand I can check for new Matlab versions, if the formerly fastest implementation is still the best one.
Don't forget: The time for solving a problem is:
T_all = T_design + T_program + T_debug + T_documentation + T_run
Avoid a pre-mature optimization of the run time. Write stable code at first and then focus on on the bottleneck determined by the profiler.

Iniciar sesión para comentar.

Respuesta aceptada

Jan
Jan el 15 de Dic. de 2022
Editada: Jan el 15 de Dic. de 2022
Here is a set of different approaches. Run the code to see timings locally on your computer.
n = 500000;
m = 10;
data = rand(n, m);
mask = false(n, m);
mask(sub2ind([n, m], 1:n, randi([1,m], 1, n))) = true;
tic;
for k = 1:100
y0 = origSol(data, mask);
end
toc
Elapsed time is 3.558161 seconds.
tic;
for k = 1:100
y1 = transSol(data, mask);
end
toc
Elapsed time is 2.124697 seconds.
assert(isequal(y0, y1), 'Results differ');
tic;
for k = 1:100
y2 = matchSol(data, mask);
end
toc
Elapsed time is 2.345099 seconds.
assert(isequal(y0, y2), 'Results differ');
tic;
for k = 1:100
y3 = loopSol(data, mask);
end
toc
Elapsed time is 1.696199 seconds.
assert(isequal(y0, y3), 'Results differ');
% ========================================================================
function data = origSol(data, mask) % Written by George Hackney
[row, col] = find(mask);
[row, sortIdx] = sort(row);
col = col(sortIdx);
idx = sub2ind(size(mask), row, col);
data = data(idx);
end
function data = transSol(data, mask) % Transpose both matrices
D = data.';
data = D(mask.');
end
function R = matchSol(data, mask) % Leaner version of the original code
[I, J] = find(mask);
R(I) = data(sub2ind(size(data), I, J));
R = R(:);
end
function R = loopSol(data, mask) % C-style loops
[m, n] = size(data);
R = zeros(m, 1);
for in = 1:n
for im = 1:m
if mask(im, in)
R(im) = data(im, in);
end
end
end
end
Transposing large arrays is expensive, such that the speed-up of transSol() is not impressive.
The ugly loop solution is inspired by the C-mex implementation, which is provided in my 2nd answer. Timings on my R2018b, i5-3320M:
% 2.60 sec Original
% 2.29 sec transpose data and mask
% 1.83 sec matchSol(): find(mask)
% 1.13 sec dull loops
% 0.45 sec C-mex
  5 comentarios
Jan
Jan el 16 de Dic. de 2022
Editada: Jan el 16 de Dic. de 2022
@George Hackney: Yes, you are told to use fast vectorized code. But this is not necessarily the code with the shortest run-time.
There is an easy rule: The creation of large temporary arrays is expensive. In this case:
function R = transSol(data, mask) % Transpose both matrices
D = data.'; % Expensive
M = mask.'; % Expensive
R = D(M); % Fairly efficient
end
To my surprise even D(M) can be accelerated by a simple C-mex function: FEX: CopyMask . But the most expensive parts are the explicit transposings of the inputs. CPUs got very fast, even if just one core is used. In comparison, the RAM access can be more expensive than the actual processing. If the processed memory does not fit into the CPU caches, this slows down the processing.
I do love Matlab's elegant vectorized commands. The compact code is nice, easy to maintain and less prone to typos compared to nested loops. But "vectorized" means, that vectors (or in general: arrays) are processed. If this requires large temporary arrays, the speed suffers.
Jan
Jan el 16 de Dic. de 2022
@George Hackney: Another try to improve the speed: Actually processing a matrix cloumnwise is cheaper, because the CPU imports memory in blocks of 64 byte to uts caches (the "cache-line-size"). Therefore accessing neighboring bytes is faster than a random access. Matlab stores arrays in columnwise order, so processing them column-wise is faster.
Inspite of this, the row-wise processing in loopSolRow() is faster, because the inner loop can be stopped after the first match (there is only 1 TRUE per row). This saves the half loop in average, if the TRUEs are randomly distributed.
n = 500000;
m = 10;
data = rand(n, m);
mask = false(n, m);
mask(sub2ind([n, m], 1:n, randi([1,m], 1, n))) = true;
tic;
for k = 1:100
y3 = loopSol(data, mask);
end
toc
Elapsed time is 1.613006 seconds.
tic;
for k = 1:100
y3 = loopSolRow(data, mask);
end
toc
Elapsed time is 1.497016 seconds.
% ============================================================
function R = loopSol(data, mask) % C-style loops, column-wise
[m, n] = size(data);
R = zeros(m, 1);
for in = 1:n
for im = 1:m
if mask(im, in)
R(im) = data(im, in);
end
end
end
end
function R = loopSolRow(data, mask) % C-style loops, row-wise
[m, n] = size(data);
R = zeros(m, 1);
for im = 1:m
for in = 1:n
if mask(im, in)
R(im) = data(im, in);
break;
end
end
end
end

Iniciar sesión para comentar.

Más respuestas (1)

Jan
Jan el 15 de Dic. de 2022
Editada: Jan el 15 de Dic. de 2022
A C-mex version:
// File: maskTCopy.c
// Compile: mex -O -R2018a maskTCopy.c
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *data, *out;
mxLogical *mask;
mwSize m, n, im, in;
data = mxGetDoubles(prhs[0]);
m = mxGetM(prhs[0]);
n = mxGetN(prhs[0]);
mask = mxGetLogicals(prhs[1]);
plhs[0] = mxCreateDoubleMatrix(m, 1, mxREAL);
out = mxGetDoubles(plhs[0]);
for (in = 0; in < n; in++) {
for (im = 0; im < m; im++) {
out[im] += *mask++ * *data++;
}
}
}
Under my Matlab R2018b this needs 0.39 sec instead of 1.83 sec of matchSol() from my other answer and 2.65 sec of the original version. [EDITED, new version of C code, 045 -> 0.39 sec]

Categorías

Más información sobre Creating and Concatenating Matrices en Help Center y File Exchange.

Productos


Versión

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by