Vectorizing multiple string comparison

3 visualizaciones (últimos 30 días)
Paolo Binetti
Paolo Binetti el 26 de En. de 2017
Comentada: Paolo Binetti el 28 de En. de 2017
Is there a way to significantly speed up this loop, perhaps by vectorizing it? Inputs in attachment. I do not have a Matlab version with "string" functions.
d = a';
for i = 1:numel(a)
d{i} = c(strcmp(a{i}, b), :);
end
I tried working my way from the inner part with cellfun, but either I am not getting it right or it is not the good approach:
aux = cellfun(@strcmp, a, b); % does not work
  2 comentarios
Walter Roberson
Walter Roberson el 27 de En. de 2017
That file is an Octave file that would take a bunch of work to read in MATLAB.
This is the wrong resource to be asking about performance improvement for Octave.
Paolo Binetti
Paolo Binetti el 27 de En. de 2017
You are right. R2016 does not run on the PC I mostly use, and old beast which still works perfectly, but on XP. So until I buy a new computer, I am stuck with either a much older version of Matlab or Octave, which does run on XP. I could have generated the input with my older Matlab. And your answer below gives me one more motivation to buy a new computer soon!

Iniciar sesión para comentar.

Respuesta aceptada

Guillaume
Guillaume el 26 de En. de 2017
One obvious minor speed-up is to get rid of the find that serves absolutely no purpose. You can directly use the logical vector returned by strcmp:
d{i} = c(strcmp(a{i}, b)), :);
For some reason, I cannot load your mat file. I'm going to assume that a is a cell array of string, and so is b (otherwise the loop would not be needed). Assuming that there are no repeated strings in b:
assert(numel(unique(b)) == numel(b), 'This code does not work when there are duplicate values in b');
d = cell(size(a))';
[isfound, loc] = ismember(a, b);
d(isfound) = c(loc(isfound), :);
If it's guaranteed that all elements of a are found in b, then you can simplify even further to:
assert(numel(unique(b)) == numel(b), 'This code does not work when there are duplicate values in b');
[isfound, loc] = ismember(a, b);
assert(all(isfound), 'The next line only works if all elements of a are in b');
d = num2cell(c(loc, :), 2);
  2 comentarios
Paolo Binetti
Paolo Binetti el 27 de En. de 2017
Editada: Paolo Binetti el 27 de En. de 2017
Thank you.
  • Good on you for getting rid of "find". I have edited the question accordingly.
  • I am sorry that you could not download my inputs.mat file. I have uploaded it again, I have tested it and it seems to work for me.
  • Nevertheless, all of your assumptions were right, except that b does actually contain repeated strings, unfortunately (if it did not, the "intersect" function would allow to vectorize).
Guillaume
Guillaume el 27 de En. de 2017
Editada: Guillaume el 27 de En. de 2017
According to Walter, your mat file is an octave file that matlab can't open.
If there are duplicate values in b, then you don't have a choice but to use a loop, either explicitly as you have done or with cellfun:
d = cellfun(@(aa) c(strcmp(aa, b), :), a, 'UniformOutput', false);
It's very possible that the cellfun may be slower than the explicit loop (due to the anonymous function call).
edit: in matlab R2016b there is a an extremely easy way to vectorise the string comparison, using the new string class:
string(a) == string(b)'
but you'd still need a loop or cellfun afterward to create the d cell array:
d = cellfun(@(r) c(r, :), num2cell(string(a) == string(b)', 1), 'UniformOutput', false)

Iniciar sesión para comentar.

Más respuestas (1)

Walter Roberson
Walter Roberson el 27 de En. de 2017
ismember can be used between cell arrays of strings. The two-output version can be used to find the indices, which you can then use to index into c.
  3 comentarios
Walter Roberson
Walter Roberson el 27 de En. de 2017
Flip the order around, ismember(b, a) .
Paolo Binetti
Paolo Binetti el 28 de En. de 2017
I had a feeling I was missing an obvious point. Thank you for pointing it out! The modified code, below, runs much faster. I tried to vectorize the remainder of the loop, to no avail, but the costly string comparison at least if out of the loop.
a = { 'AAG' 'AGA' 'ATT' 'CTA' 'CTC' 'GAT' 'TAA' 'TCT' 'TTC' };
b = { 'AAG' 'AGA' 'GAT' 'ATT' 'TTC' 'TCT' 'CTC' 'TCT' 'CTA' 'TAA' 'AAG' };
c = [ 'AGA';'GAT';'ATT';'TTC';'TCT';'CTC';'TCT';'CTA';'TAA';'AAG';'AGA' ];
[temp, idx] = ismember(b, a);
d = a';
for i = 1:numel(a)
d{i} = c(i == idx, :);
end

Iniciar sesión para comentar.

Categorías

Más información sobre Octave en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by