MATLAB Answers

zhert
0

How to extract text from string at the same location, one line above

Asked by zhert
on 11 Oct 2019
Latest activity Commented on by dpb
on 15 Oct 2019
I have a variable number of text files (between 3-8), each between 20,000 and 30,000 lines long (different lengths), and around 400 words to search for. The words have different lengths.
Let's say I have the following text:
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
where xxxxx can be anything other than what I want to search for. I want to make check whether the following is true:
  1. That each text file includes '12345'
  2. That for at least one occurrence of '12345' in each file, there is '999'. The end of '999' always coincides with the end of '12345'.
I can determine whether '12345' is in each of the text files using strfind, but strfind only ouputs an "index" value for the first character of my search pattern (e.g. 613587). Is there a way to find the line number that "index" value corresponds with, and search one line above for '999'?
I think I saw people recommending that each line for each file be read as a separate string, then search each string independently, but that seems like a lot of work for MATLAB to go through, having to generate close to a hundred thousand strings. Is there a better/more efficient way of achieving this?
Any help would be appreciated!

  6 Comments

The word, "word", indicates the string being searched should be surrounded by word boundaries.
Thanks a ton for everyone's help! For anyone who may have a similar question, I ended up solving it in a rather dumb way.
I did this:
  1. Use regexp to extract a section of the code that matches the pattern, which spans two lines.
  2. Then split the resultant two lines.
  3. Finally, use strfind to find the locations of the two search terms in their own respective lines, and make sure they line up.
A little convoluted (and I'm sure very inefficient), but it got the job done!
The solution provided above seems more straightforward in locating just the lines that are possible matches first, eliminating any splitting of lines being required--it only returns the indices of allowable pairs.
One slight enhancement that regexp allows in the solution shown eliminates the adjustment of the location in the record to match positions...
isOK=(regexp(s(ix-1),"999",'end'))==(regexp(s(ix),"12345",'end'));

Sign in to comment.

Products


Release

R2019b

2 Answers

Answer by per isakson
on 13 Oct 2019
Edited by per isakson
on 13 Oct 2019
 Accepted Answer

"Is there a better/more efficient way of achieving this?" No, I don't think so. However, speed depends on how "each line for each file be read as a separate string" is done. (Are strings in an array separate?)
"that seems like a lot of work for MATLAB" Don't guess and don't rely on hearsay. Make a simple test.
I assume that your example is oversimplefied and that the script below won't work with the actual files. However, it might help you to estimate execution times.
I made a test file, cssm.txt, with 30,000 lines by copying and modifying lines from your question. It contains only one pair
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
which is at line 15001.
The script below contains two independent solutions and strfind(chr,'12345') for comparison. The elapse times for the three cases are
Elapsed time is 0.006313 seconds.
Elapsed time is 0.053587 seconds.
Elapsed time is 0.020818 seconds.
on a vanilla desktop and R2018b. The execution time of the second solution is less than four times that of fileread(); strfind();. Eight files and four hundred words should be possible to process in a bit more than one minute (8*400*0.02). During my test the text file was somewhere in the cache system. The execution time will depend (a little) on whether you have a SSD or spinning disk.
%%
%#ok<*NASGU>
tic
chr = fileread('cssm.txt');
pos = strfind( chr, '12345' );
toc
%%
tic
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
e1 = regexp( str, "999", 'end', 'once' );
e2 = regexp( str, "12345", 'end', 'once' );
is1 = not( cellfun( 'isempty', e1 ) );
is2 = not( cellfun( 'isempty', e2 ) );
%%
pos = find( is1 & [ is2(2:end), false ], 1, 'first' );
%
found = false;
for p = reshape( pos, 1,[] )
if e1{p}==e2{p+1}
found = true;
break
end
end
toc
%%
tic
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
is1 = contains( str, "999" );
is2 = contains( str, "12345" );
pos = find( is1 & [ is2(2:end), false ], 1, 'first' );
%
found = false;
for p = reshape( pos, 1,[] )
if regexp(str(p),"999",'end','once') == regexp(str(p+1),"12345",'end','once')
found = true;
break
end
end
toc
(There are edge cases for which this script will throw errors.)

  0 Comments

Sign in to comment.


Answer by dpb
on 12 Oct 2019
Edited by dpb
on 14 Oct 2019

Presuming have read the file into a string array, s,
>> isOK=contains(s(find(contains(s,"12345"))-1),"999")
isOK =
logical
1
>>
NB: The above will return an empty result if the first search fails; be sure to either wrap the search in a function that handles that case or test for it in the result.
ADDENDUM:
There's no reason you can't also search for the "999" and then see if the subsequent line contains the other magic string--
isOK=contains(s(find(contains(s,"999"))+1),"12345")
If you do want to revert to the reading-from-file routine, this is the way to do it there--then there's no need to try to retrieve a previous record; just scan for the second matching string in the subsequent line once find the first; if found you're done, if not, continue searching for the first.
ADDENDUM 2:
Additional requirement of same ending column:
isOK=false;
ix=find(contains(s,"12345"));
if ~isempty(ix)
isMaybe=contains(s(ix-1),"999");
if isMaybe
isOK=(strfind(s(ix-1),"999")-2)==(strfind(s(ix),"12345"));
end
end
Also NB: a match failure in strfind returns empty, so can't just AND the two because isMaybe could return false in which case would have an empty result which would result in the equality test returning an empty result in isOK.
The above also takes care of the case there are no lines satisfying the first condition of find returning an empty result. It's kinda' implied you know there is at least one such record, but never hurts to code defensively.
ADDENDUM 3:
One could replace the adjustment of location to account for the length difference in the search strings by using regexp with a two-character wildcard match for the preceding characters. I'll leave as "exercise for the student" but will note that while powerful, regexp does generally have a performance hit -- would have to test to see if the size of files is such as to make an issue or not if choose that route.

  3 Comments

Code to get s:
s = fileread(fullFileName);
Thanks for the suggestion. Sorry I wasn't being clear enough: I want to know whether
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
is immediately preceded by
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
where '12345' ends on the same number of character (counted from the beginning of the line) as '999'.
Is this something that maybe regex would be able to solve?
Is this something that maybe regex would be able to solve?
Yes, in theory, using dynamic regular expressions. Possible, but not something I would especially recommend.
On the other hand, if you regexp() to return positions, then it becomes easier to test corresponding positions in another line.

Sign in to comment.