Fastest way to replace multipe substrings with a single new string?

7 visualizaciones (últimos 30 días)
Omar Salah
Omar Salah el 6 de Jun. de 2020
Comentada: Omar Salah el 18 de Jun. de 2020
Hello Everyone,
I'm trying to replace 7k different substrings with the same Tag in a 50 milllion words dataset (cell array of size 1 million of strings of average size 50 words). and as you can see, using replace or regexprep takes a long time. I tried using strrep the same way as replace but it gives me this error.
Error using strrep
All nonscalar inputs must be the same size.
I want to ask, what is the fastest and less memory consuming way to do it?
Here is the code:
%using replace
Tag='IMPORTANT'
substr={'very','much'} % a cell array of +7k words
reptag=cell(1,size(substr,2));
tagcell=cellfun(@(x) Tag,reptag,'Uniformoutput',false);
maintext=replace(maintext,substr,tagcell);
% using regexprep
ev='(';
for evi=1:size(substr,2)
ev=[ev substr '|'];
end
ev=[ev(1:end-1) ')'];
maintext=regexprep(maintext,ev,Tag);
  4 comentarios
Omar Salah
Omar Salah el 10 de Jun. de 2020
@james I can actually work with both. Either a cella rray of character vectors or a cell of strings. I move between them easily. Is one type faster than the other?
Omar Salah
Omar Salah el 10 de Jun. de 2020
@stephen I never worked with C++ but I'm wondering, why would they be faster? Is it because they are compiled or because C++ functions are generally faster?

Iniciar sesión para comentar.

Respuestas (1)

Mohammad Sami
Mohammad Sami el 11 de Jun. de 2020
After some experimentations I think that if you tokenize your sentences, you can use a hashmap to lookup the words to replace.
An example code is as follows. If you want case insensitive matching, use function lower on both the words and sentences.
substr = cellstr(substr);
w = containers.Map(substr,substr); %create a hashmap of substring you want to replace
m2 = cellstr(sentences);
m5 = cell(length(m2),1);
for i = 1:length(m2)
m3 = split(m2{i},' '); % tokenize the sentence
m4 = w.isKey(m3); % lookup which words to replace
m3(m4) = {'IMPORTANT'}; % replace the words
m5(i) = join(m3,' '); % store the updated sentence
end
  1 comentario
Omar Salah
Omar Salah el 18 de Jun. de 2020
Wow! thanks. that's definitely something to try. I will try it tonight ang get back to you :)

Iniciar sesión para comentar.

Categorías

Más información sobre Characters and Strings en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by