How to extract text from string at the same location, one line above

1 visualización (últimos 30 días)
I have a variable number of text files (between 3-8), each between 20,000 and 30,000 lines long (different lengths), and around 400 words to search for. The words have different lengths.
Let's say I have the following text:
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
where xxxxx can be anything other than what I want to search for. I want to make check whether the following is true:
  1. That each text file includes '12345'
  2. That for at least one occurrence of '12345' in each file, there is '999'. The end of '999' always coincides with the end of '12345'.
I can determine whether '12345' is in each of the text files using strfind, but strfind only ouputs an "index" value for the first character of my search pattern (e.g. 613587). Is there a way to find the line number that "index" value corresponds with, and search one line above for '999'?
I think I saw people recommending that each line for each file be read as a separate string, then search each string independently, but that seems like a lot of work for MATLAB to go through, having to generate close to a hundred thousand strings. Is there a better/more efficient way of achieving this?
Any help would be appreciated!
  6 comentarios
zhert
zhert el 15 de Oct. de 2019
Thanks a ton for everyone's help! For anyone who may have a similar question, I ended up solving it in a rather dumb way.
I did this:
  1. Use regexp to extract a section of the code that matches the pattern, which spans two lines.
  2. Then split the resultant two lines.
  3. Finally, use strfind to find the locations of the two search terms in their own respective lines, and make sure they line up.
A little convoluted (and I'm sure very inefficient), but it got the job done!
dpb
dpb el 15 de Oct. de 2019
The solution provided above seems more straightforward in locating just the lines that are possible matches first, eliminating any splitting of lines being required--it only returns the indices of allowable pairs.
One slight enhancement that regexp allows in the solution shown eliminates the adjustment of the location in the record to match positions...
isOK=(regexp(s(ix-1),"999",'end'))==(regexp(s(ix),"12345",'end'));

Iniciar sesión para comentar.

Respuesta aceptada

per isakson
per isakson el 13 de Oct. de 2019
Editada: per isakson el 13 de Oct. de 2019
"Is there a better/more efficient way of achieving this?" No, I don't think so. However, speed depends on how "each line for each file be read as a separate string" is done. (Are strings in an array separate?)
"that seems like a lot of work for MATLAB" Don't guess and don't rely on hearsay. Make a simple test.
I assume that your example is oversimplefied and that the script below won't work with the actual files. However, it might help you to estimate execution times.
I made a test file, cssm.txt, with 30,000 lines by copying and modifying lines from your question. It contains only one pair
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
which is at line 15001.
The script below contains two independent solutions and strfind(chr,'12345') for comparison. The elapse times for the three cases are
Elapsed time is 0.006313 seconds.
Elapsed time is 0.053587 seconds.
Elapsed time is 0.020818 seconds.
on a vanilla desktop and R2018b. The execution time of the second solution is less than four times that of fileread(); strfind();. Eight files and four hundred words should be possible to process in a bit more than one minute (8*400*0.02). During my test the text file was somewhere in the cache system. The execution time will depend (a little) on whether you have a SSD or spinning disk.
%%
%#ok<*NASGU>
tic
chr = fileread('cssm.txt');
pos = strfind( chr, '12345' );
toc
%%
tic
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
e1 = regexp( str, "999", 'end', 'once' );
e2 = regexp( str, "12345", 'end', 'once' );
is1 = not( cellfun( 'isempty', e1 ) );
is2 = not( cellfun( 'isempty', e2 ) );
%%
pos = find( is1 & [ is2(2:end), false ], 1, 'first' );
%
found = false;
for p = reshape( pos, 1,[] )
if e1{p}==e2{p+1}
found = true;
break
end
end
toc
%%
tic
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
is1 = contains( str, "999" );
is2 = contains( str, "12345" );
pos = find( is1 & [ is2(2:end), false ], 1, 'first' );
%
found = false;
for p = reshape( pos, 1,[] )
if regexp(str(p),"999",'end','once') == regexp(str(p+1),"12345",'end','once')
found = true;
break
end
end
toc
(There are edge cases for which this script will throw errors.)

Más respuestas (1)

dpb
dpb el 12 de Oct. de 2019
Editada: dpb el 14 de Oct. de 2019
Presuming have read the file into a string array, s,
>> isOK=contains(s(find(contains(s,"12345"))-1),"999")
isOK =
logical
1
>>
NB: The above will return an empty result if the first search fails; be sure to either wrap the search in a function that handles that case or test for it in the result.
ADDENDUM:
There's no reason you can't also search for the "999" and then see if the subsequent line contains the other magic string--
isOK=contains(s(find(contains(s,"999"))+1),"12345")
If you do want to revert to the reading-from-file routine, this is the way to do it there--then there's no need to try to retrieve a previous record; just scan for the second matching string in the subsequent line once find the first; if found you're done, if not, continue searching for the first.
ADDENDUM 2:
Additional requirement of same ending column:
isOK=false;
ix=find(contains(s,"12345"));
if ~isempty(ix)
isMaybe=contains(s(ix-1),"999");
if isMaybe
isOK=(strfind(s(ix-1),"999")-2)==(strfind(s(ix),"12345"));
end
end
Also NB: a match failure in strfind returns empty, so can't just AND the two because isMaybe could return false in which case would have an empty result which would result in the equality test returning an empty result in isOK.
The above also takes care of the case there are no lines satisfying the first condition of find returning an empty result. It's kinda' implied you know there is at least one such record, but never hurts to code defensively.
ADDENDUM 3:
One could replace the adjustment of location to account for the length difference in the search strings by using regexp with a two-character wildcard match for the preceding characters. I'll leave as "exercise for the student" but will note that while powerful, regexp does generally have a performance hit -- would have to test to see if the size of files is such as to make an issue or not if choose that route.
  3 comentarios
zhert
zhert el 12 de Oct. de 2019
Thanks for the suggestion. Sorry I wasn't being clear enough: I want to know whether
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
is immediately preceded by
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
where '12345' ends on the same number of character (counted from the beginning of the line) as '999'.
Is this something that maybe regex would be able to solve?
Walter Roberson
Walter Roberson el 13 de Oct. de 2019
Is this something that maybe regex would be able to solve?
Yes, in theory, using dynamic regular expressions. Possible, but not something I would especially recommend.
On the other hand, if you regexp() to return positions, then it becomes easier to test corresponding positions in another line.

Iniciar sesión para comentar.

Categorías

Más información sobre Characters and Strings en Help Center y File Exchange.

Productos


Versión

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by