Conditional textscan - How to select certain lines from a file

24 visualizaciones (últimos 30 días)
Hi there, I would like to read information from a file into an array for later use. Only certain rows of that file are supposed to be read in, namely rows for which the second column starts with 'S1' and is followed by two random digits. I'm having trouble with this conditional textscan. Here is the code for reading in the file (note that it starts with 13 lines that are not in column format, hence the "headline" codes at the beginning). I basically want the varibales Postion, Length, Channel etc only to be read in for lines that meet the regex condition.
dataFileName=strcat('EEG_Anne_',int2str(pNumber),'.vmrk');
fid = fopen(dataFileName);
headline1=fgets(fid);
headline2=fgets(fid);
headline3=fgets(fid);
headline4=fgets(fid);
headline5=fgets(fid);
headline6=fgets(fid);
headline7=fgets(fid);
headline8=fgets(fid);
headline9=fgets(fid);
headline10=fgets(fid);
headline11=fgets(fid);
headline12=fgets(fid);
headline13=fgets(fid);
C = textscan(fid, '%s%s%d%d%d','Delimiter',',');
Stimulus=C{2};
if regexp(Stimulus{i},'S1\d*'),
Type=C{1};
Position=C{3};
Length=C{4};
Channel=C{5};
end
fclose(fid);

Respuesta aceptada

Stephen23
Stephen23 el 13 de Oct. de 2015
Editada: Stephen23 el 13 de Oct. de 2015
Usually the fastest and easiest way to select from a dataset is to read the complete file into MATLAB and then make the selection inside of MATLAB:
N = 117;
fileName = sprintf('EEG_Anne_%d.vmrk',N);
fid = fopen(fileName);
hdrRows = 13;
hdrData = textscan(fid,'%s',hdrRows, 'Delimiter','\n');
matData = textscan(fid,'%s%s%s%d%d%d', 'Delimiter',{',','='}, 'CollectOutput',true);
fclose(fid);
X = ~cellfun('isempty',regexp(matData{1}(:,3),'^S1\d\d$','once'));
To read the header data into a cell array I also replaced the very awkward 17 calls to fgets with one simple call to textcscan. To test this code I used the file that you gave in your other answer (attached here also). The test detects these rows:
>> matData{2}(X,:)
ans =
13127 1 0
17828 1 0
22387 1 0
27429 1 0
31951 1 0
36610 1 0
51258 1 0
56417 1 0
61951 1 0
.... etc
which corresponds exactly to the rows with 'S1xx' in the second column.
Bonus: if you want to practice using regular expressions (i.e. regexp), then you can try my FEX submission:
This tool lets you interactively write and change a regular expression, and updates the outputs as you type, so you can see what effect those changes have on the string parsing. It is a great way to practice using regular expressions, or to adapt a regular expression to your particular requirements.
  7 comentarios
Anne Mickan
Anne Mickan el 16 de Oct. de 2015
Ok, never mind, I finally understood it and my script works! Took a bit longer than it should have - sorry for being slow. Thanks again for your help!
Stephen23
Stephen23 el 16 de Oct. de 2015
Editada: Stephen23 el 16 de Oct. de 2015
My code certainly does not extract only the last three columns, it actually gives you all of the data in your file, even the headers!
Did you actually look at the variables that my code generates?
You should have a play with your workspace browser: there you can view a summary of every variable, and double-click them to open any variable in the workspace viewer, where you can view every variable and its elements (i.e. values). Double-click on matData and you will find both the character arrays (the first few columns) and numeric arrays (the last columns) inside it.
All of your data is there, I promise you, it just requires some exploration and cell arrays containing other arrays.
You write that you "want a matrix with 5 columns (the first two of which contain strings and the last three of which contain numbers" but it is not possible to store both character and numeric data in one array, although these arrays can be stored together in a cell array, which is what textscan does. A cell array is just a container of other arrays, and it does not matter what kind they are, but it adds an extra level of complexity to your code.
The character data occurs in the first columns, so it is the first array in the output cell array:
strData = matData{1}
while the numeric data are all of the remaining columns, so occurs second in the output cell array:
numData = matData{2}
This is why we access the first cell for the regexp (regular expression) call: because the first cell contains all of the characters data. So "The second is the one that contains the "S1xx" information, no?" is incorrect: the string data is the first cell: you don't need me to tell you this, have a look at the variables in your variable browser.
Your statement "And then the code continues to access (;,3).... Does that mean it accesses all rows of column 3 of MatData?" is incorrect, as should be becoming clear: matData is a cell array, it contains some other arrays. matData has size 1*2. It certainly does not have three columns. What you are interested in are the data arrays inside of matData: these are the character and numeric arrays that were specified in textscan, with as many columns as that format specification. So we can so this:
numData = matData{2} : <- get the numeric array out of the cell array
numData(:,3) % <- get the third column of the numeric array.
Or we can do this in one go, this is equivalent:
matData{2}(:,3)
To answer your question: it is not strictly required to use the option CollectOutput, but either way the function textscan will output a cell array containing some character/numeric data, and by using this option you simply merge these character/numeric arrays together where possible. This often makes further processing much simpler, because accessing and processing data in lots of cells of a cell array is not as convenient as it sounds.

Iniciar sesión para comentar.

Más respuestas (1)

Samy Alkhayat
Samy Alkhayat el 12 de Nov. de 2018
Editada: Samy Alkhayat el 12 de Nov. de 2018
Hello, I have a similar problem, where I want to concatenate 32 columns from different 32 files sequentially named. The code works fine if the the 32 files have similar size arrays(2500 in some of the sets); however, other sets of 32 files have one file of size 2909 or more. Now I need to to consider all the concatenation to be over the first 2500 only. Please help in editing the code below (error message is below as well):
clear all size=2500; for i=1 : 32 filename=horzcat(pwd,'\Run4915_Inj',int2str(i),'.pre'); delimiter = {'\t',','}; startRow = 3; %% Format string for each line of text: formatSpec = '%f%f%f%*s%[^\n\r]';
%% Open the text file. fileID = fopen(filename,'r');
%% Read columns of data according to format string. dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'HeaderLines' ,startRow-1, 'ReturnOnError', false);
%% Close the text file. fclose(fileID); %% Create output variable time(:,i)=dataArray{1:size, 1}; P(:,i)=dataArray{1:size, 2}; needle(:, i)=dataArray{1:size, 3}; Pav=mean(P,2); nav=mean(needle,2); tav=mean(time,2); %% Clear temporary variables clearvars filename delimiter startRow formatSpec fileID dataArray ans; end
I get this error as I run the code to the exceptional set: Unable to perform assignment because the size of the left side is 2500-by-1 and the size of the right side is 2909-by-1.

Categorías

Más información sobre Large Files and Big Data en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by