MATLAB Answers

count number of rows in csv outside of matlab

55 views (last 30 days)
I have 10000+ csv files I would like to import into matlab. I only need the data from the first and last rows for inlet and exit conditions. Each csv file has a different number of data points, so I do not know the length of the file imported a priori. I am trying to automate the import process. I can automate importing all the data or specific lines, but I do not know how to import the last row. The only way I can think of is to determine the number of rows in the file without importing the data (importing all the data takes a few hours) and import that row specifically. Does anyone know how I can do this? I have tried messing with textscan, but I have not had any luck.
  5 Comments
Alexandra McClernon Ownbey
I am using windows. I am not sure what the limit/line is. There are 8 columns in all files. The first row are the titles, the remaining rows are numbers with >8 figures in each cell

Sign in to comment.

Accepted Answer

Jeremy Hughes
Jeremy Hughes on 26 Feb 2021
Turns out this is not as easy of a question as you might think, especially if your CSV contains data that might be double-quoted and that data contains a new line character.
e.g. "This data\nhas a new line","but this doesn't"
The number of lines might not really be that important, depending on what you're trying to do. But I don't know what that is.
If you're trying to avoid having all the data in memory at one time, I suggest reading up on tabularTextDatastore, as that helps automate working with large sets of data. There's a rich set of features you can use with datastores to make working with larger datasets eaiser, tall, transform, combine. None of these assume to know the size of each table. But again, without knowing what you plan do to with those files, that's hard to say if you can use it.
---- but to answer the original question ----
If you don't have any double-quoted data, it gets a lot easier to count lines of a CSV file. This code will scan the lines without importing any of the data. (It reads the file internally, but doesn't generate any output--thats what the %*... formats are about)
fid = fopen(filename);
numLines = 0;
while ~feof(fid)
[~,c]=textscan(fid,'%*[^\r\n]%*[\r\n]',1,'Delimiter','','Whitespace','','EndOfLine','');
if c > 0 % if c==0, then there wasn't a line there. this may happen at the end of the file.
numLines = numLines + 1;
end
end
fclose(fid);
If you want the data for the lines, this newer function should help:
---- if you have double-quoted strings with new lines (or don't know if you do) ----
If you want to count the actual lines and not count the ones in double quoted fields, then you really need to parse each CSV line and find the fields with double-quotes, and at that point you might as well be importing the data, but if you really just want to count the parsed lines, you can do something with importOptions.
This code will import only the first variable in the table as a string, but to do that, it will still parse the file and consider the quoted data that appears later in the line. It will be slower than the method above, but robust if you have quoted data that contains newlines. It will be faster than bringing in the whole table.
opts = delimitedTextImportOptions('Delimiter',',','ExtraColumnsRule','ignore','VariableTypes',"string");
T = readtable(filename,opts);
numLines = height(T);

More Answers (2)

KSSV
KSSV on 26 Feb 2021
csvFiles = dir('*.csv') ;
N = length(csvFiles) ;
f = cell(N,1) ; % first row
l = cell(N,1) ; % last row
for i = 1:N
data = csvread(csvFiles(i).name) ;
f{i} = data(1,:) ;
l{i} = data(end,:) ;
end
  1 Comment
Alexandra McClernon Ownbey
I am trying to read in the data without reading in the entire table. I already have a script that can read in all the csv files in separate folders. csvread takes too long. I know I can import the data into cells or a 3D matrix and then select the points I want, but this is a very round-about way of doing it.

Sign in to comment.


Walter Roberson
Walter Roberson on 26 Feb 2021
csvdir = 'appropriate_directory_name'; %use '.' for current directory
csvFiles = dir(fullfile(csvdir, '*.csv'));
filenames = fullfile({csvdir.folder}, {csvdir.name});
N = length(csvFiles) ;
f = cell(N,1) ; % first row
l = cell(N,1) ; % last row
for K = 1:N
thisfile = filenames{K};
[fid, msg] = fopen(thisfile, 'r');
if fid < 0
fprintf('failed to open file "%s" because "%s", ignoring it\n', thisfile, msg);
next
end
fgetl(fid); %skip header
f{i} = cell2mat(textscan(fgetl(fid), '')); %first line
%data is 8 columns. We can be sure that columns are < 25 characters each
fseek(fid, 256, 'eof'); %move to near end of file
fgetl(fid); %we positioned to middle of line, discard to end of line
%look for the last non-empty line
old_line = '';
while ~feof(fid)
new_line = fgetl(fid);
if ~ischar(new_line); break; end %EOF
if ~isempty(strtrim(new_line))
old_line = new_line;
end
end
fclose(fid)
l{i} = cell2mat(textscan(old_line, ''))
end
What this code is doing is opening each file, skipping a header line, reading the next line and converting it to numeric. Then it seeks to before the end of file and reads lines, discarding empty lines, including empty lines that occur at end of file, keeping the last non-empty line it finds, and converting the last non-empty line to numeric.
The code seeks to 256 characters before the end of file, skipping the rest of the file -- literally not reading it as much as is possible with the operating system. Why 256? Because it is a "nice round number" to computer scientists ;-) If the data was output as double precision, then it could take as many as 25 characters per entry such as '-6.32359246225409463e+110' plus the comma delimiter, maybe a space as well, so possibly 27*8+2 characters = 218 characters for the line. Using 256 gives a bit of slack in case we miscounted or there is something odd in the file.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by