How to extract text from string at the same location, one line above

Question

zhert el 11 de Oct. de 2019

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/484860-how-to-extract-text-from-string-at-the-same-location-one-line-above

Comentada: dpb el 15 de Oct. de 2019

I have a variable number of text files (between 3-8), each between 20,000 and 30,000 lines long (different lengths), and around 400 words to search for. The words have different lengths.

Let's say I have the following text:

xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

where xxxxx can be anything other than what I want to search for. I want to make check whether the following is true:

That each text file includes '12345'
That for at least one occurrence of '12345' in each file, there is '999'. The end of '999' always coincides with the end of '12345'.

I can determine whether '12345' is in each of the text files using strfind, but strfind only ouputs an "index" value for the first character of my search pattern (e.g. 613587). Is there a way to find the line number that "index" value corresponds with, and search one line above for '999'?

I think I saw people recommending that each line for each file be read as a separate string, then search each string independently, but that seems like a lot of work for MATLAB to go through, having to generate close to a hundred thousand strings. Is there a better/more efficient way of achieving this?

Any help would be appreciated!

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

zhert el 15 de Oct. de 2019

Thanks a ton for everyone's help! For anyone who may have a similar question, I ended up solving it in a rather dumb way.

I did this:

Use regexp to extract a section of the code that matches the pattern, which spans two lines.
Then split the resultant two lines.
Finally, use strfind to find the locations of the two search terms in their own respective lines, and make sure they line up.

A little convoluted (and I'm sure very inefficient), but it got the job done!

dpb el 15 de Oct. de 2019

The solution provided above seems more straightforward in locating just the lines that are possible matches first, eliminating any splitting of lines being required--it only returns the indices of allowable pairs.

One slight enhancement that regexp allows in the solution shown eliminates the adjustment of the location in the record to match positions...

isOK=(regexp(s(ix-1),"999",'end'))==(regexp(s(ix),"12345",'end'));

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

per isakson el 13 de Oct. de 2019

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/484860-how-to-extract-text-from-string-at-the-same-location-one-line-above#answer_396165

Editada: per isakson el 13 de Oct. de 2019

"Is there a better/more efficient way of achieving this?" No, I don't think so. However, speed depends on how "each line for each file be read as a separate string" is done. (Are strings in an array separate?)

"that seems like a lot of work for MATLAB" Don't guess and don't rely on hearsay. Make a simple test.

I assume that your example is oversimplefied and that the script below won't work with the actual files. However, it might help you to estimate execution times.

I made a test file, cssm.txt, with 30,000 lines by copying and modifying lines from your question. It contains only one pair

xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx

which is at line 15001.

The script below contains two independent solutions and strfind(chr,'12345') for comparison. The elapse times for the three cases are

Elapsed time is 0.006313 seconds.
Elapsed time is 0.053587 seconds.
Elapsed time is 0.020818 seconds.

on a vanilla desktop and R2018b. The execution time of the second solution is less than four times that of fileread(); strfind();. Eight files and four hundred words should be possible to process in a bit more than one minute (8*400*0.02). During my test the text file was somewhere in the cache system. The execution time will depend (a little) on whether you have a SSD or spinning disk.

%%
%#ok<*NASGU>
tic 
chr = fileread('cssm.txt'); 
pos = strfind( chr, '12345' );
toc
%%
tic 
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
e1  = regexp( str, "999", 'end', 'once' );
e2  = regexp( str, "12345", 'end', 'once' );
is1 = not( cellfun( 'isempty', e1 ) );
is2 = not( cellfun( 'isempty', e2 ) );
%%
pos = find( is1 & [ is2(2:end), false ], 1, 'first' ); 
%
found = false;              
for p = reshape( pos, 1,[] )
    if e1{p}==e2{p+1}
        found = true;
        break
    end
end
toc
%%
tic 
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
is1 = contains( str, "999" );
is2 = contains( str, "12345" );
pos = find( is1 & [ is2(2:end), false ], 1, 'first' ); 
%
found = false;
for p = reshape( pos, 1,[] )
    if regexp(str(p),"999",'end','once') == regexp(str(p+1),"12345",'end','once')
        found = true;
        break
    end
end
toc

(There are edge cases for which this script will throw errors.)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 2

dpb el 12 de Oct. de 2019

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/484860-how-to-extract-text-from-string-at-the-same-location-one-line-above#answer_396031

Editada: dpb el 14 de Oct. de 2019

Presuming have read the file into a string array, s,

>> isOK=contains(s(find(contains(s,"12345"))-1),"999")
isOK =
  logical
   1
>> 

NB: The above will return an empty result if the first search fails; be sure to either wrap the search in a function that handles that case or test for it in the result.

ADDENDUM:

There's no reason you can't also search for the "999" and then see if the subsequent line contains the other magic string--

isOK=contains(s(find(contains(s,"999"))+1),"12345")

If you do want to revert to the reading-from-file routine, this is the way to do it there--then there's no need to try to retrieve a previous record; just scan for the second matching string in the subsequent line once find the first; if found you're done, if not, continue searching for the first.

ADDENDUM 2:

Additional requirement of same ending column:

isOK=false;
ix=find(contains(s,"12345"));
if ~isempty(ix)
  isMaybe=contains(s(ix-1),"999");
  if isMaybe
    isOK=(strfind(s(ix-1),"999")-2)==(strfind(s(ix),"12345"));
  end
end

Also NB: a match failure in strfind returns empty, so can't just AND the two because isMaybe could return false in which case would have an empty result which would result in the equality test returning an empty result in isOK.

The above also takes care of the case there are no lines satisfying the first condition of find returning an empty result. It's kinda' implied you know there is at least one such record, but never hurts to code defensively.

ADDENDUM 3:

One could replace the adjustment of location to account for the length difference in the search strings by using regexp with a two-character wildcard match for the preceding characters. I'll leave as "exercise for the student" but will note that while powerful, regexp does generally have a performance hit -- would have to test to see if the size of files is such as to make an issue or not if choose that route.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

zhert el 12 de Oct. de 2019

Thanks for the suggestion. Sorry I wasn't being clear enough: I want to know whether

xxxxxxxxxxxxxxxxxxxxxxxx12345xxx

is immediately preceded by

xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx

where '12345' ends on the same number of character (counted from the beginning of the line) as '999'.

Is this something that maybe regex would be able to solve?

Walter Roberson el 13 de Oct. de 2019

Is this something that maybe regex would be able to solve?

Yes, in theory, using dynamic regular expressions. Possible, but not something I would especially recommend.

On the other hand, if you regexp() to return positions, then it becomes easier to test corresponding positions in another line.

Iniciar sesión para comentar.

How to extract text from string at the same location, one line above

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Más respuestas (1)

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

How to extract text from string at the same location, one line above

6 comentarios Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Más respuestas (1)

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo