Reading and processing data from text file to matlab variable quickly

I use the following code to read data from a text file and process it into two cell arrays, and it works, but can it be done faster? Although I currently need the cell array data format for the downstream code that uses the data, I am also open to consider other data types, if they help reading more quickly from the text file.
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp(adjlist, '\w*(?= )', 'match');
nodes = cell2mat(nodes);
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');

2 comentarios

The time overhead is likely not in the file reading portion but the regexp processing afterwards; it is pretty notorious for not being a performance speed demon. You're reading the file as just a cellstr array so I suspect that's not the issue.
Try breaking out the fileread from the surrounding regexp and profile the result; I'll be quite surprised if the above supposition doesn't turn out to be true.
You are right, the bottleneck are the three regexp instructions. I have reworded my question slightly, I hope it is clearer. Or do you suggest recasting the problem just in term of regexp?

Iniciar sesión para comentar.

 Respuesta aceptada

per isakson
per isakson el 26 de Feb. de 2017
Editada: per isakson el 26 de Feb. de 2017
"Reading and processing data from text file to matlab variable quickly" &nbsp The short answer is that using textscan to read and do most of the parsing is faster. And gives cleaner code.
It's a bit tricky to measure the speed of reading small files, since the file will be available in the system cache after the first test. However, it's safe to claim that in this case texdtscan is faster.
Run this
>> [nodes,edges,cac] = cssm();
Elapsed time is 0.054037 seconds.
Elapsed time is 0.009937 seconds.
>> cac(:)
ans =
{3001x1 cell}
{3001x1 cell}
where
function [nodes,edges,cac] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
toc
end
&nbsp
A more fair comparison:
>> [nodes,edges,n2,e2] = cssm();
Elapsed time is 0.047859 seconds.
Elapsed time is 0.014726 seconds.
>> edges{1}
ans =
'3' '5' '9'
>> e2{1}
ans =
'3' '5' '9'
where three lines are added to produce the data on the same format
function [nodes,edges,n2,e2] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
n2 = cac{1}; % new
e2 = regexp( cac{2}, ',', 'split' ); % new
e2 = reshape( e2, 1,[] ); % new
toc
end

7 comentarios

It worked for me too, thank you. The remaining regexp instruction is now the slowest instruction of the code.
per isakson
per isakson el 27 de Feb. de 2017
Editada: per isakson el 2 de Mzo. de 2017
"remaining regexp instruction" &nbsp Do you refer to &nbsp e2=regexp(cac{2},',','split');?&nbsp I don't think there is a faster alternative (with plain Matlab).
Question is whether it's a good idea to store these numbers as strings in cell arrays. Cell arrays require a lot of memory and contribute to slow code.
Thank you for the feedback on storing these numbers as cell arrays. These numbers are just an example. The real use case I am interested in has strings of different length rather than numbers. Is there a more efficient way to store strings of different lengths than cell arrays? Perhaps regular char arrays with blanks?
Interesting, I just timed regexp 'split' compared to strsplit. I expected regexp to be slower, but strsplit was distinctly slower.
Cell arrays require:
  • 8 bytes per cell, whether used or not
  • plus 104 bytes per non-empty cell, which includes the size and type information for the cell
  • plus the storage for the obvious data of the cell. For character strings, that is 2 bytes per character.
For a fully occupied cell array, that is 112 bytes per cell plus the obvious data of the cell. (And you have to add to that, whatever storage is used to represent the size and type information of the variable that is the cell array header.)
If you were to use a blank-padded rectangular region, then that would be 2 bytes per character, times number of rows, times number of columns; to which you would add whatever storage is used to represent the size and type information of variable (probably the same cost as the a cell array header.) You would be wasting some of those columns with the blank padding.
You have not happened to indicate anything about minimum and maximum and typical row size. If the occupancy was uniform random (unlikely), then on average half of the columns would be unused; in that situation if the fixed width were at least twice 112 bytes, which you would get with 112 characters wide, then the average waste would be the same as the cell overhead. However, uniform random is not typical, really: more typical is that either there is not much variation in sizes (e.g., if the variation were just between 3 and 5 fields) --- or else that most of the data is relatively short but a small fraction if it is really large (power law), in which case if you allocate as if everything could be the longest then you could waste a lot.
In terms of timing, access into a rectangular array is faster, but it is not all that different for a single level of cell nesting.
The final line of strsplit after all the preprocessing is
% Split.
[c, matches] = regexp(str, aDelim, 'split', 'match');
so guess it stands to reason it's going to be slower... :)
"more efficient way to store strings of different lengths" &nbsp I guess, that there is no one-size-fits-all.
  • "efficient" regarding memory use and computational speed may conflict.
  • The number of strings to store
  • The variation in length of the strings as Walter pointed out.
  • Which operations will be done on the set of strings.
  • Whether or not strictly "write-once-read-many"
  • Does the cost of making the program/code count?
  • And more ... .
Regarding character arrays: "'first','second','third'" should be stored as
fst
ieh
rci
sor
tnd
d
since Matlab is column major. This is tricky to read when debugging.
I recently had a problem:
  • a fraction of a million valid Matlab variable names. Most names are short, but some are long. (No, I don't use them in expressions with EVAL.)
  • searches typically returns a dozen names
Solution:
  • store all names in one row separated by char(31), huge_str. char(31) is displayed as space by editors.
  • store the positions of char(31) to avoid repeated use of strfind(huge_str)
  • use STRFIND and REGEXP in searches
My resulting code is fast and memory efficient, but it did require some debugging.
Is this undocumented use of char(31), which might not survive next Matlab release? I don't think the use of char(31) is mentioned in the Matlab documentation.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Characters and Strings en Centro de ayuda y File Exchange.

Preguntada:

el 25 de Feb. de 2017

Editada:

el 3 de Mzo. de 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by