reading text from various positions
3 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
I have a .txt with floats and strings I want to import. The text was created by a software logging all events within an experiment (by time), organized in trials, so it looks a bit like:
+++ LogStart1 +++
procedure = org
List = 45
Condition = 43
DelayOnset = 1
[and a lot more of variables]
+++ LogEnd +++
+++ LogStart2 +++
[and the same for the second trial and so on...]
I would like to get a vector of all values for each variable (e.g., "DelayOnset"). I started importing data with fscanf, lets say, "DelayOnset" was the third string in the file:
filename = 'data.txt';
fid = fopen(filename);
formatSpecDELAY = '%*s %*s DelayOnset=%f'; [here is the object I want, so skip a string and skip another string, and then the position is after "DelayOnset="]
DelayOnset= fscanf(fid,formatSpecDELAY, *1*);
This worked out well, but if I can't do this for each variable, because as the file contains 1000+ lines, I would have to skip each object before the actual values I want to read, that is, I would have to write %*s a thousand times. Initially I thought, if I don't limit the number of objects (1 in the example above), I get every value for delay in the file ("search for every "DelayOnset=" and return the float which follows"), but that was not the case. In fact, I had to skip all the objects between DelayOnset in the first trial and DelayOnset in the second trial in order to get a vector of both values. I can't do this for the whole file.
Is it possible to create several points of reference within the text file, in order to start fscanf from these points?
Thank you very much in advance!
1 comentario
dpb
el 27 de Sept. de 2016
Think we need to see at least some more of the file and a more specific description of what you're trying to read. Specifically, is there simply a variable and its associated value within each section or is there an array of values for a variable over the time of a trial or is the variable name duplicated for every event or...???
Respuestas (4)
KSSV
el 28 de Sept. de 2016
Editada: KSSV
el 28 de Sept. de 2016
You can use textscan and copy all the text file data into a cell, and then find your required string.
clc; clear all ;
fid = fopen('data.txt') ; % your text file in data.txt
S = textscan(fid,'%s','delimiter','\n') ; % scan the text file
fclose(fid) ;
S = S{1} ;
idx = strfind(S,'DelayOnset') ; % find your string from cell arrays
idx = find(not(cellfun('isempty', idx))); % remove empty cells
S{idx} % your required information
3 comentarios
dpb
el 28 de Sept. de 2016
Editada: dpb
el 28 de Sept. de 2016
Ah...my old eyes had glossed over where the issue was so I just wrote a new solution from scratch--but now I see the problem.
Don't convert the cellstr array S to char array until after the string search so will return only the cells containing the desired string. Then, process that subset in character string form to extract the numeric data.
Or, you can get there this way as well, at this point however, S(response) is the subset of lines containing the text; you've still got to then read the numeric values from that array. S{response} otoh is a comma-separated list of values; that's not so easy to deal with for the purpose which is why I'd keep it as cellstr array until ready to do the conversion.
But, you're missing the part of using textscan on the subset of the overall file to read the data values at the above point; you've isolated the proper lines but not yet parsed them.
Or, of course, regexp could be made to do this, too, but I'm such a klutz with its syntax I'll leave that to the whizards of that arena... :)
I think my solution while as noted is very close to this is somewhat simpler in its sequence of operations and does do the last step as well...
KSSV
el 29 de Sept. de 2016
Hello Once you got the indices of your required string, you can easily extract the number from the string. Can't you?
Try:
clc; clear all ;
fid = fopen('data.txt') ; % your text file in data.txt
S = textscan(fid,'%s','delimiter','\n') ; % scan the text file
fclose(fid) ;
S = S{1} ;
response = strfind(S,'response') ; % find your string from cell arrays
response = find(not(cellfun('isempty', response))); % remove empty cells
S{response} % your required information if true % code end
iwant = zeros(length(response),1) ;
for i = 1:length(response)
tmp = regexp(S{response(i)},'\d*','Match');
iwant(i) = str2num(tmp{1}) ;
end
dpb
el 28 de Sept. de 2016
Editada: dpb
el 29 de Sept. de 2016
Essentially other respondent's solution with a few shortcuts along the way of not building the intermediaries primarily...although used textread to first scan the file as it saves the fopen/fclose hoopla when don't need the extra facilities of textscan (such as to scan a string in memory as later on). Built it to read whatever variable of this form in the file you're interested in by simply changing the STR variable. The only other real trick is note the transpose .' on the output of the conversion of the cellstr array found in the cast to char which is needed as textscan isn't cellstring literate. This is necessary as memory is column-major in Matlab so to scan the string must orient it so that the lines are essentially columns to be read. Otherwise, one must loop through record-by-record.
ADDENDUM
OK, to deal with the multiple records case, what I'd envision would be sotoo:
STRS={'Procedure','response'};
fmts={'%s','%f'};
s=textread('koh.txt','%s','delimiter','\n','whitespace','');
for i=1:length(STRS) % loop over the number to read...
fmt=[STRS{i} '= ' fmts{i}]; % build the format string
if strcmp(fmts(i),'%s') % ok do need know which type variable reading
txt=cellfun(@(x) sscanf(x,fmt),s(~cellfun(@isempty,strfind(s,STRS{i}))),'uniformoutput',0);
else
data=sscanf(char(s(~cellfun(@isempty,strfind(s,STRS{i})))).',fmt);
end
end
The above should need only a little extra bookkeeping to add multiple data sets and text info by creating arrays for the outputs.
At command line here after having read the data file--there was also a missing set of curlies to dereference the cellstr in the first strfind call and the closing end on the for loop, but that's the sort of thing one can expect from "air code"...I made those corrections above, as well...
>> for i=1:length(STRS) % loop over the number to read...
fmt=[STRS{i} '= ' fmts{i}]; % build the format string
if strcmp(fmts(i),'%s') % ok do need know which type variable reading
txt=cellfun(@(x) sscanf(x,fmt),s(~cellfun(@isempty,strfind(s,STRS{i}))),'uniformoutput',0);
else
data=sscanf(char(s(~cellfun(@isempty,strfind(s,STRS{i})))).',fmt);
end
end
>> txt
txt =
'left'
'left'
>> data
data =
3
5
>>
As for using the results, I've noted I don't have the table class but something similar in the Statistics Toolbox is the dataset. I'm not advocate you use it instead, but to illustrate the type of thing it does,
>> ds=dataset(txt,data,'VarNames',STRS)
ds =
Procedure response
'left' 3
'left' 5
>>
Now there's a composite data object with both variables you can address for analysis, etc., programmatically generically rather than with multiple variables and the like. The builtin table has all the dataset features and more...
4 comentarios
dpb
el 2 de Oct. de 2016
Editada: dpb
el 3 de Oct. de 2016
All that should be needed in the outline above is to simply list the variables in the STRS array and their corresponding format in the fmts array--that's why I did it that way. Any valid numeric string can be scanned with '%f' on input; the string tokens are the "odd man out" that need the different format for sscanf. You therefore only need two formats, you just need to know a priori which one goes with which variable. Or, of course, one can go to more effort in coding with try..catch blocks or the like to dynamically ascertain the type but with a relatively few items it seemed simpler to just enumerate 'em and go on...
But, as nice as it is to solve a problem for someone, look at Guillaume's solution--it returns all the token pairs automagically and all you're left with is selecting the ones of interest by name. That's pretty nice presuming you have recent-enough release of Matlab for the collection class to exist.
ADDENDUM
Also note that the complexity does grow somewhat more with the explicit solution; you'll have to either build a cell array of the results during the loop or build the dataset or table object during the loop or the subsequent passes thru the loop will overwrite the txt, data variables on the next pass, leaving you with only the last of each type after the loop without doing something about that. That was my previous comment...
A simplistic solution is to write
data=[]; % before the loop
for...
...
data=[data sscanf(... ];
that will append the later set onto the first. This, however will require every set be the same length. You could create a column vector instead but then if they're not the same length you have a problem knowing which belongs to which variable. A cell array would work, but that means keeping an index to increment for each type. Doable certainly, but not, probably, the best solution given the expanded wishes and likely not the way I'd've started if you'd asked the more general question to begin with.
Franz Kohlhus
el 29 de Sept. de 2016
3 comentarios
dpb
el 29 de Sept. de 2016
Editada: dpb
el 29 de Sept. de 2016
No! Just string the two pieces together to parse the two record types--but you only need to read the file once. The only reason for a loop might be to place a set of STR and fmt values in an array for the number of record types to be processed and iterate over that generically rather than using two variables (or reusing the same ones would also work, of course) as I did just for demo purposes.
There isn't really any need to make to response variables; just use the other as the grouping variable. You could, of course, create the two by separating them out by using the indicator variable to select, but it would seem likely that it's just as easy or even, perhaps, easier to simply have one response variable, not two. Particularly with, as noted, the facilities built into the table class.
Guillaume
el 29 de Sept. de 2016
Editada: Guillaume
el 29 de Sept. de 2016
Not having read the other answers, here is how I would deal with your problem:
filecontent = fileread('C:\somewhere\Yourfile'); %read whole content of file at once
keyvaluepairs = regexp(filecontent, '([^=\n\r]*)= ([^=\n\r]*)', 'tokens'); %identify all key values pairs (any strings separated by '= '
keyvaluepairs = vertcat(keyvaluepairs{:}); %transform cell array of cell array in two column cell array
[keys, ~, rows] = unique(keyvaluepairs(:, 1)); %get unique keys and corresponding rows
values = accumarray(rows, (1:numel(rows))', [], @(ridx) {keyvaluepairs(ridx, 2)}); %group together all values for each key
mymap = Containers.Map(keys, values); %store it into a map for easy querying
Querying for any key is then straightforward, e.g:
mymap('response')
13 comentarios
Guillaume
el 4 de Oct. de 2016
You an silent the warning with:
warning('off', 'MATLAB:iofun:UnsupportedEncoding');
However, I would leave it on as a reminder that you're using an undocumented and unsupported option that may break / disappear in future releases. Matlab does not officially support UTF-16.
dpb suggestion would work if you read the file normally (e.g. with fileread) and iif all the characters in the file have code < 256 (not guaranteed if there's some non US-english characters). You would do the filtering immediately after reading the file content.
dpb
el 4 de Oct. de 2016
@Franz--Guillaume's comment has merit but warnings all the time are annoying so if the application that builds the raw data files does use UTF16 and you can't easily change that, personally I'd turn off the warning and make a comment in the m-file about what the issue is.
While it is unsupported in other areas at least so far, I don't see that TMW can possibly regress in removing at least minimal support and gradually increasing other support in Matlab--the encoding isn't going to go away and they'll just be left further and further behind if were to do so.
As he also notes, the "fixup" does work as long as character codes are within the lower 8bit UTF8 character set which it appears from the type of file is likely the case....but certainly not guaranteed. Of course, given the limited support elsewhere, if you find some in a file you'll possibly have other difficulties arise anyway that you'll have to work around.
Ver también
Categorías
Más información sobre Text Data Preparation en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!