Why is my regular expression always greedy?

9 visualizaciones (últimos 30 días)
zhert
zhert el 16 de Oct. de 2019
Comentada: zhert el 16 de Oct. de 2019
I have the following string, read into MATLAB:
*aaa
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
*abcde99999
$eeeeeeeeeeeeeeeeeeeeee
I would like to perform a search that only extracts the text between *aaa and *ddd, using the following regexp pattern:
pattern = '(?<=\*aaa\s)(.*|\n)*?(?=\*)';
I expected the middle (.*|\n)*? to match the minimum number of "either any character other than linebreak, or a linebreak" that sits between *aaa and the closest * symbol, at *ddd. Instead, MATLAB returns the following:
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
$11111111111111111111111
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
Instead of stopping at just before *ddd, regexp continued until just before *abcde99999, despite the presence of the "?" at the end of the middle section of the pattern.
Just to make sure this isn't a lookaround issue, I also tried running
pattern = '\*(.*|\n)*?\*';
And sure enough, I get the following, with the *ddd in the middle being skipped entirely:
*aaa
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
$11111111111111111111111
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
*
Interestingly enough, when I tried this pattern on online regex testers here and here, I get the expected result. Is there any reason why the MATLAB implementation of regex remains greedy even with a "?" at the end? Any help would be appreciated!

Respuesta aceptada

Guillaume
Guillaume el 16 de Oct. de 2019
Matlab regex engine has the odd peculiarity that . also matches \n by default, whereas other engines don't. So your greedy .* inside the capturing group also captures all newlines, and the 2nd half of the alternation never get a chance to match anything. That can be turned off, and if you do you get the result you expected:
regexp(yourstring, pattern, 'match', 'dotexceptnewline')
However I don't understand why the alternation in the first place, and a simpler pattern that would achieve the same would be:
regexp(yourstring, '(?<=\*aaa\s)[^*]*(?=\*)', 'match') %dotall or dotexceptnewline doesn't matter for that one, since [^*] also matches newline.
  1 comentario
zhert
zhert el 16 de Oct. de 2019
Thanks for the explanation! If I were to use [^\*], I suppose I don't even really need the (?=\*), right?
Another related question, why is it that the pattern works with "\*aaa\s", but not with "\*aaa\n"?

Iniciar sesión para comentar.

Más respuestas (0)

Productos


Versión

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by