regexp: extra cell layer in the output
31 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Andrey Kazak
el 19 de Sept. de 2012
Greetings!
I try to decompose a complex header line in tokens:
>> HeaderLine = 'TEKTRONIX TDS 1012B Project Number:0 Sample Name: Depth: 0.000000';
>> regexp(HeaderLine,'^(.*) Project Number:(.*) Sample Name:(.*) Depth: (.*)$', 'tokens')
ans =
{1x4 cell}
However ans itself is a 1x1 cell, but not 1x4 cell as stated above.
Could you please suggest me a way to force regexp to output a pure 1x4 cell array of tokens?
Thank you in advance!
2 comentarios
Respuesta aceptada
Andrei Bobrov
el 19 de Sept. de 2012
c = regexp(HeaderLine,'^(.*) Project Number:(.*) Sample Name:(.*) Depth: (.*)$', 'tokens');
out = c{:};
Más respuestas (2)
Stephen23
el 25 de Mzo. de 2021
Editada: Stephen23
el 18 de Abr. de 2023
"Could you please suggest me a way to force regexp to output a pure 1x4 cell array of tokens?"
The simple answer to your question is to specify the 'once' option:
str = 'TEKTRONIX TDS 1012B Project Number:0 Sample Name: Depth: 0.000000';
rgx = '^(.*) Project Number:(.*) Sample Name:(.*) Depth: (.*)$';
tkn = regexp(str, rgx, 'tokens', 'once') % specify the ONCE option
"What I don't get is why regexp returns nested cell array, instead of a pure 1x4 cell array of tokens."
Tokens are always returned in a cell array (with size equal to the number of tokens requested, in your case 1x4). When you specify that you want 'all' matches (the default, what you used), then regexp will return the output nested inside another cell array (with size equal to the number of matches made, in your case 1 match gives a 1x1 cell array).
Thus what you are getting is a 1x1 cell array containing a 1x4 cell array which contains the tokens themselves.
Selecting the 'once' option, as my answer shows, simply avoids one layer of nested cell array.
2 comentarios
David Young
el 17 de Abr. de 2023
Unfortunately, the "once" option doesn't affect the output when using the "names" option. So in that case the degree of nesting of the output always depends on whether the inputs are scalars or not (assuming they're strings). This seems to me a very unfortunate legacy from the cell array of character vectors days (and not a good design even then).
Stephen23
el 18 de Abr. de 2023
Editada: Stephen23
el 19 de Abr. de 2023
"Unfortunately, the "once" option doesn't affect the output when using the "names" option."
Actually it does. Lets try it right now:
T = 'hello world';
S1 = regexp(T,'(?<word>\w+)','names','once') % with ONCE option
S2 = regexp(T,'(?<word>\w+)','names') % with default ALL option
isequal(S1,S2) % not the same outputs
Are they the same? No, they are not: S1 contains one match, whereas S2 contains two matches (just as the "once" option is defined as doing), so the structure arrays S1 and S2 have different sizes. Lets look at their content:
S1.word % with ONCE option
S2.word % with default ALL option
Nope, not the same. We also get the same thing with multiple input text, where the ONCE option may change the sizes of the structures inside the cell array. If you are expecting the outermost cell array to be removed, then you need to provide an explanation of how you would store a non-scalar structure in a scalar array element. So far you have not given any explanation how you imagine that to be possible.
In any case, lets try multiple input text:
T = {'hello world','not a bug'};
C1 = regexp(T,'(?<word>\w+)','names','once') % with ONCE option
C2 = regexp(T,'(?<word>\w+)','names') % with default ALL option
C1{:} % with ONCE option
C2{:} % with default ALL option
With the ONCE option, every cell contains a scalar structure:
C1{1}.word
C1{2}.word
With the default ALL option, every cell may contain non-scalar structures:
C2{1}.word
C2{2}.word
Note these are also not the same output, so your original statement is demonstrably incorrect.
These outputs are also consistent with the ONCE option on the TOKENS or MATCH outputs (where it removes the innermost nested cell array, but with multiple input text the outermost cell array must remain).
In a sense, when called with multiple input text, it is conceptually as if we call REGEXP for each text and store the output/s in cell array/s. With multiple input text every output is stored in a cell array: this is perfectly consistent, required, and the expected behavior (because of potentially differently-sized outputs, as my examples show). Multiple text inputs is a convenience, but its outputs must be in cell arrays. The ONCE option does not change that for any of the outputs.
"This seems to me a very unfortunate legacy from the cell array of character vectors days (and not a good design even then)."
I don't see how that is related. Even for string arrays you have not provided any explanation of how it would be possible to return structures of different sizes for the different input text, which are also exactly the same size as the input cell/string array. Basically you would need a simple container array that can contain those structures, whose elements correspond exactly to same the locations as the input text: how are you proposing to do that? (hint: MATLAB uses a cell array).
Lets look at string array input text with the default ALL option:
C3 = regexp(["hello world","not a bug"],"(?<word>\w+)","names") % default ALL option
C3{:}
C3{1}.word
C3{2}.word
There is exactly the same situation: each of the input texts returns a different number of NAMES, which thus need to be stored in some kind of container array of the same size as the input text...
Ver también
Categorías
Más información sobre Characters and Strings en Help Center y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!