Most efficient way to search in text arrays

18 visualizaciones (últimos 30 días)
Uwe Ehret
Uwe Ehret el 18 de Abr. de 2017
Respondida: Uwe Ehret el 22 de Abr. de 2017
Dear All,
I have a large array of textual information (model element IDs). In this array, I frequently have to find the index of particular IDs. What is the computationally most efficient way to do this? Store the IDs in a character array, cell array of characters or string array? What is the most efficient way for indexing when working with text?
Thanks for your help! Uwe

Respuesta aceptada

Alain Kuchta
Alain Kuchta el 21 de Abr. de 2017
Editada: Alain Kuchta el 21 de Abr. de 2017
Here are two possible approaches to accomplish this:
1) Use strcmp and find with string array
This option is O(n) for each lookup; in the worst case, every string in ids will be checked in order to find query. You can also use a cell array of character vectors for ids, in my test a string array was slightly faster.
>> ids = ["M1","M2", "M3"];
>> query = "M2";
>> index = find(strcmp(ids, query) == 1)
index =
2
2) Use containers.Map with char arrays as keys and indices as values
This option is O(n) to setup, but O(1) for each lookup. Meaning that regardless of how many ids are in your map, looking up each one will take the same amount of time.
>> ids = {'M1', 'M2', 'M3'};
>> indices = 1:length(ids);
>> idMap = containers.Map(ids, indices);
>> query = 'M2';
>> index = idMap(query)
index =
2
Here is a performance comparison. At each size increment the average time to compute 500 random queries was measured for each approach. Each approach used the same set of queries at each size increment. In my case, for less than ~1000 elements, find with strcmp is faster. But as the number of elements grows, containers.Map is the clear winner.
  1 comentario
Walter Roberson
Walter Roberson el 21 de Abr. de 2017
The O(1) reference sounds as if containers.Map is using hashing -- which is a possibility but not one I see documented ?
True O(1) would tend to imply that it is using Perfect Hash, as regular hashes that can have collisions would have an O(n) or O(ln(n)) or O(n*ln(n)) term for the worst case as the table fills up.

Iniciar sesión para comentar.

Más respuestas (1)

Uwe Ehret
Uwe Ehret el 22 de Abr. de 2017
Dear Alain,
Thanks for your very helpful reply! The containers.Map option was new to me, but offers many very elegant and time-saving improvements to the program I am working on.
Uwe

Categorías

Más información sobre Dictionaries en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by