speed up ismember with two-dim data
Mostrar comentarios más antiguos
I have two large cell-arrays (X and Y) where each field is a two dimensional array representing pixel positions. The cell arrays are not the same size (can be in the range of thousands) and each entry can be be a different length (ranging from length 1 to length >1000). Maybe relevant: Over X and Y individually, no point can be included twice, i.e. all entries in X have no points in common (same for Y).
Now I want to find the overlap between all entries in X with all entries in Y. (in reality I filter the entries in Y for a criterion and compute only roughly 10-20 percent of the comparisons)
Is there a way to speed up this process? I have read about ismembc which requires sorted data (and thus I thought it cannot be used in my case).
Minimal working example:
X{1} = [1 1; 1 2; 1 3];
X{2} = [3 1; 3 2; 3 3];
Y{1} = [1 1; 1 2];
Y{2} = [2 3; 3 1; 3 2; 3 3];
Y{3} = [5 5];
mat_overlap = NaN(length(X), length(Y));
for ii = 1:length(X)
for jj = 1:length(Y)
mat_overlap(ii,jj) = sum(ismember(X{ii}, Y{jj}, 'rows'));
end
end
6 comentarios
Are the values integers or floating point numbers? How large are the values?
The current set of data can be converted to 1D data by:
m = 10;
XX{i} = X{i}(:, 1) * m + X{i}(:, 2);
Afterwards ismembc can acclerate the search again. This works only if there is an m > max(Data(:, 2)) which does not cause values > 2^53, where rounding effects would invalidate the integrity.
An alternative is to calculate the MD5 hash of each row, but this takes some time and I guess, that this is slower in total.
Optimizing the runtime depends on the specific input: How large are the data in average? How many elements overlap? Would this be a realistic input:
X = cell(1, 1000);
for k = 1:1000
n = randi([1, 100]);
X{k} = rand(n, 2); % Or: randi([1,50], n, 2) ?
% Or: unique(randi([1,50], n, 2), 'rows')?
end
% And the same for Y
Are the rows of the elements of X and Y guaranteed to be unique? Or is this possible:
X{1} = [1 1; 1 2; 1 1];
Felix Müller
el 14 de Jul. de 2021
Editada: Felix Müller
el 14 de Jul. de 2021
Bjorn Gustavsson
el 14 de Jul. de 2021
Generate arrays groupidxX and groupidxY arrays for all points in X and Y then put all of X and Y into two n-by-2 arrays, pass those 2 arrays to sortrows, those row-sorted arrays should be "reasonably" efficient to find matching pairs in by looping through only small subsets of the second array, then keeping the proper group-indices and whatnot of the matches?
Felix Müller
el 14 de Jul. de 2021
Editada: Felix Müller
el 15 de Jul. de 2021
Jan
el 14 de Jul. de 2021
@Felix Müller: I assume, there is a typo in the code to produce the test data:
X{k} = random_vector(idx_first:idx_last, :);
% ^^^ added to get 2 columns
I've posted a comparison with ismembc and get a speedup of a factor 100 - much more then 60%.
Felix Müller
el 15 de Jul. de 2021
Respuesta aceptada
Más respuestas (0)
Categorías
Más información sobre Loops and Conditional Statements en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!