Using regexp to extract labels

Question

Dhani Dharmaprani el 27 de Dic. de 2016

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/318277-using-regexp-to-extract-labels

Editada: Stephen23 el 16 de En. de 2017

Respuesta aceptada: Stephen23

test.txt

Abrir en MATLAB Online

Hi,

I have a file (attached) which includes some header information I am interested in. Specifically, I would like to extract the various channel labels, but can't quite correct the expression in order to obtain all of the label names in the file and nothing else. I have tried

labels = regexp(filetext, '((?<=Label:)(\s*\w*){2}\D*\d*)+', 'match');

although this doesn't quite work due to the first two channel labels being in a different format to the rest. If anyone can offer advice so that I can fix my expression to work for the first two channel labels also, that would be greatly appreciated.

Thank you kindly in advance!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Stephen23 el 27 de Dic. de 2016

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/318277-using-regexp-to-extract-labels#answer_248604

Editada: Stephen23 el 27 de Dic. de 2016

Abrir en MATLAB Online

There is no need to be so specific about the format of the data field. It would be enough to identify the 'Label:' and anything that is not a newline, something like this:

>> str = fileread('test.txt');
>> C = regexp(str,'(?<=Label:\s*)[^\n]*','match');
>> C{:}
ans =  I
ans =  II
ans =  A 1-2
ans =  A 3-4
ans =  A5-6
ans =  A 7-8
ans =  B 1-2
ans =  B 3-4
ans =  B 5-6
ans =  B 7-8
ans =  C 1-2
ans =  C 3-4
ans =  C 5-6
ans =  C 7-8
ans =  D 1-2
ans =  D 3-4
ans =  D 5-6
ans =  D 7-8
ans =  CS 5-6
ans =  E 1-2
ans =  E 3-4
ans =  E 5-6
ans =  E 7-8
ans =  F 1-2
ans =  F 3-4
ans =  F 5-6
ans =  F 7-8
ans =  G 1-2
ans =  G 3-4
ans =  G 5-6
ans =  G 7-8
ans =  H 1-2
ans =  H 3-4
... etc

Or if you want to match that format, something like this:

>> C = regexp(str,'(?<=Label: *)[A-Z]+[\-\d ]*','match'); C{:}
ans = I
ans = II
ans = A 1-2
ans = A 3-4
ans = A5-6
ans = A 7-8
ans = B 1-2
ans = B 3-4
ans = B 5-6
ans = B 7-8
ans = C 1-2
ans = C 3-4
ans = C 5-6
...etc

If you want to experiment with regular expressions, then try my FEX submission that lets you quickly change and test regular expressions in an interactive figure:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-maker

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Dhani Dharmaprani el 16 de En. de 2017

Hi Stephen,

Thank you so much for your previous help with this! I was wondering if you perhaps know how to optimise this process so that it runs faster, as I am currently running this on a larger .txt file and the processing time is incredibly slow (I have left it running for 3 days straight and it still hasn't finished).

Thank you kindly in advance for any help you can offer!

Kind regards, Dhani

Stephen23 el 16 de En. de 2017

Editada: Stephen23 el 16 de En. de 2017

Abrir en MATLAB Online

To optimize you could start by getting rid of the lookaround operation: these are always slow.

 C = regexp(str,'(Label: *)([A-Z]+[\-\d ]*)','tokens');
 vertcat(C{:})

But the most efficient solution is likely to avoid regexp entirely and use one of the file import tools (such as textscan) to read the file data into cell arrays, and then use a fast strcmp to locate the desired rows:

opt = {'HeaderLines',1, 'Delimiter','\n'};
fid = fopen('test.txt','rt');
C = textscan(fid,'%[^:]: %[^\n]', opt{:});
fclose(fid);
data = C{1}{end};
C = horzcat(C{1}(1:end-1),C{2});
idx = strcmpi('Label',C(:,1));
out = C(idx,2)

create this output:

 out = 
    'I'
    'II'
    'A 1-2'
    'A 3-4'
    'A5-6'
    'A 7-8'
    'B 1-2'
    'B 3-4'
    'B 5-6'
    'B 7-8'
    'C 1-2'
    'C 3-4'
    'C 5-6'
    'C 7-8'
    'D 1-2'
    'D 3-4'
    'D 5-6'
    'D 7-8'
    'CS 5-6'
    'E 1-2'
    'E 3-4'
    'E 5-6'
    'E 7-8'
    'F 1-2'
    'F 3-4'
    'F 5-6'
    'F 7-8'
    'G 1-2'
    'G 3-4'
    'G 5-6'
    'G 7-8'
    'H 1-2'
    'H 3-4'
    'H 5-6'
    'H 7-8'

Iniciar sesión para comentar.

Answer 2

José-Luis el 27 de Dic. de 2016

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/318277-using-regexp-to-extract-labels#answer_248606

Abrir en MATLAB Online

filetext = fileread('test.txt');
expr = '(?<=Label:\s+)([\w-\s]+)(?=\n)';
hits = regexp(filetext, expr, 'match')

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Using regexp to extract labels

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (1)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

Using regexp to extract labels

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (1)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos