regexp: extra cell layer in the output

31 visualizaciones (últimos 30 días)
Andrey Kazak
Andrey Kazak el 19 de Sept. de 2012
Editada: Stephen23 el 19 de Abr. de 2023
Greetings!
I try to decompose a complex header line in tokens:
>> HeaderLine = 'TEKTRONIX TDS 1012B Project Number:0 Sample Name: Depth: 0.000000';
>> regexp(HeaderLine,'^(.*) Project Number:(.*) Sample Name:(.*) Depth: (.*)$', 'tokens')
ans =
{1x4 cell}
However ans itself is a 1x1 cell, but not 1x4 cell as stated above.
Could you please suggest me a way to force regexp to output a pure 1x4 cell array of tokens?
Thank you in advance!
  2 comentarios
Azzi Abdelmalek
Azzi Abdelmalek el 19 de Sept. de 2012
why HeaderLine1? it's HeaderLine, your code is working
Andrey Kazak
Andrey Kazak el 19 de Sept. de 2012
Editada: Andrey Kazak el 19 de Sept. de 2012
Sorry, it should be read as HeaderLine. I fixed the misprint.

Iniciar sesión para comentar.

Respuesta aceptada

Andrei Bobrov
Andrei Bobrov el 19 de Sept. de 2012
c = regexp(HeaderLine,'^(.*) Project Number:(.*) Sample Name:(.*) Depth: (.*)$', 'tokens');
out = c{:};
  1 comentario
Andrey Kazak
Andrey Kazak el 19 de Sept. de 2012
Thank you. I understand this. What I don't get is why regexp returns nested cell array, instead of a pure 1x4 cell array of tokens.

Iniciar sesión para comentar.

Más respuestas (2)

Stephen23
Stephen23 el 25 de Mzo. de 2021
Editada: Stephen23 el 18 de Abr. de 2023
"Could you please suggest me a way to force regexp to output a pure 1x4 cell array of tokens?"
The simple answer to your question is to specify the 'once' option:
str = 'TEKTRONIX TDS 1012B Project Number:0 Sample Name: Depth: 0.000000';
rgx = '^(.*) Project Number:(.*) Sample Name:(.*) Depth: (.*)$';
tkn = regexp(str, rgx, 'tokens', 'once') % specify the ONCE option
tkn = 1×4 cell array
{'TEKTRONIX TDS 1012B'} {'0'} {0×0 char} {'0.000000'}
"What I don't get is why regexp returns nested cell array, instead of a pure 1x4 cell array of tokens."
Tokens are always returned in a cell array (with size equal to the number of tokens requested, in your case 1x4). When you specify that you want 'all' matches (the default, what you used), then regexp will return the output nested inside another cell array (with size equal to the number of matches made, in your case 1 match gives a 1x1 cell array).
Thus what you are getting is a 1x1 cell array containing a 1x4 cell array which contains the tokens themselves.
Selecting the 'once' option, as my answer shows, simply avoids one layer of nested cell array.
  2 comentarios
David Young
David Young el 17 de Abr. de 2023
Unfortunately, the "once" option doesn't affect the output when using the "names" option. So in that case the degree of nesting of the output always depends on whether the inputs are scalars or not (assuming they're strings). This seems to me a very unfortunate legacy from the cell array of character vectors days (and not a good design even then).
Stephen23
Stephen23 el 18 de Abr. de 2023
Editada: Stephen23 el 19 de Abr. de 2023
"Unfortunately, the "once" option doesn't affect the output when using the "names" option."
Actually it does. Lets try it right now:
T = 'hello world';
S1 = regexp(T,'(?<word>\w+)','names','once') % with ONCE option
S1 = struct with fields:
word: 'hello'
S2 = regexp(T,'(?<word>\w+)','names') % with default ALL option
S2 = 1×2 struct array with fields:
word
isequal(S1,S2) % not the same outputs
ans = logical
0
Are they the same? No, they are not: S1 contains one match, whereas S2 contains two matches (just as the "once" option is defined as doing), so the structure arrays S1 and S2 have different sizes. Lets look at their content:
S1.word % with ONCE option
ans = 'hello'
S2.word % with default ALL option
ans = 'hello'
ans = 'world'
Nope, not the same. We also get the same thing with multiple input text, where the ONCE option may change the sizes of the structures inside the cell array. If you are expecting the outermost cell array to be removed, then you need to provide an explanation of how you would store a non-scalar structure in a scalar array element. So far you have not given any explanation how you imagine that to be possible.
In any case, lets try multiple input text:
T = {'hello world','not a bug'};
C1 = regexp(T,'(?<word>\w+)','names','once') % with ONCE option
C1 = 1×2 cell array
{1×1 struct} {1×1 struct}
C2 = regexp(T,'(?<word>\w+)','names') % with default ALL option
C2 = 1×2 cell array
{1×2 struct} {1×3 struct}
C1{:} % with ONCE option
ans = struct with fields:
word: 'hello'
ans = struct with fields:
word: 'not'
C2{:} % with default ALL option
ans = 1×2 struct array with fields:
word
ans = 1×3 struct array with fields:
word
With the ONCE option, every cell contains a scalar structure:
C1{1}.word
ans = 'hello'
C1{2}.word
ans = 'not'
With the default ALL option, every cell may contain non-scalar structures:
C2{1}.word
ans = 'hello'
ans = 'world'
C2{2}.word
ans = 'not'
ans = 'a'
ans = 'bug'
Note these are also not the same output, so your original statement is demonstrably incorrect.
These outputs are also consistent with the ONCE option on the TOKENS or MATCH outputs (where it removes the innermost nested cell array, but with multiple input text the outermost cell array must remain).
In a sense, when called with multiple input text, it is conceptually as if we call REGEXP for each text and store the output/s in cell array/s. With multiple input text every output is stored in a cell array: this is perfectly consistent, required, and the expected behavior (because of potentially differently-sized outputs, as my examples show). Multiple text inputs is a convenience, but its outputs must be in cell arrays. The ONCE option does not change that for any of the outputs.
"This seems to me a very unfortunate legacy from the cell array of character vectors days (and not a good design even then)."
I don't see how that is related. Even for string arrays you have not provided any explanation of how it would be possible to return structures of different sizes for the different input text, which are also exactly the same size as the input cell/string array. Basically you would need a simple container array that can contain those structures, whose elements correspond exactly to same the locations as the input text: how are you proposing to do that? (hint: MATLAB uses a cell array).
Lets look at string array input text with the default ALL option:
C3 = regexp(["hello world","not a bug"],"(?<word>\w+)","names") % default ALL option
C3 = 1×2 cell array
{1×2 struct} {1×3 struct}
C3{:}
ans = 1×2 struct array with fields:
word
ans = 1×3 struct array with fields:
word
C3{1}.word
ans = "hello"
ans = "world"
C3{2}.word
ans = "not"
ans = "a"
ans = "bug"
There is exactly the same situation: each of the input texts returns a different number of NAMES, which thus need to be stored in some kind of container array of the same size as the input text...

Iniciar sesión para comentar.


Andrey Kazak
Andrey Kazak el 19 de Sept. de 2012
One of examples or regexp applications:
poestr = ['While I nodded, nearly napping, ' ...
'suddenly there came a tapping,'];
[mat tok ext] = regexp(poestr, '(\S)\1', 'match', ...
'tokens', 'tokenExtents');
mat
mat =
'dd' 'pp' 'dd' 'pp'
returns the pure 1x4 cell array of tokens without extra cell layer.

Categorías

Más información sobre Characters and Strings en Help Center y File Exchange.

Productos

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by