Reading and processing data from text file to matlab variable quickly

Question

0 votos

sample_input.txt

I use the following code to read data from a text file and process it into two cell arrays, and it works, but can it be done faster? Although I currently need the cell array data format for the downstream code that uses the data, I am also open to consider other data types, if they help reading more quickly from the text file.

adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes =   regexp(adjlist, '\w*(?= )',      'match');
nodes =   cell2mat(nodes);
edges =   regexp(adjlist, '(?<=( |,))\w*', 'match');

2 comentarios
Mostrar Ninguno Ocultar Ninguno

dpb el 25 de Feb. de 2017

The time overhead is likely not in the file reading portion but the regexp processing afterwards; it is pretty notorious for not being a performance speed demon. You're reading the file as just a cellstr array so I suspect that's not the issue.

Try breaking out the fileread from the surrounding regexp and profile the result; I'll be quite surprised if the above supposition doesn't turn out to be true.

Paolo Binetti el 26 de Feb. de 2017

You are right, the bottleneck are the three regexp instructions. I have reworded my question slightly, I hope it is clearer. Or do you suggest recasting the problem just in term of regexp?

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

per isakson el 26 de Feb. de 2017

Editada: per isakson el 26 de Feb. de 2017

Abrir en MATLAB Online

1 voto

"Reading and processing data from text file to matlab variable quickly" &nbsp The short answer is that using textscan to read and do most of the parsing is faster. And gives cleaner code.

It's a bit tricky to measure the speed of reading small files, since the file will be available in the system cache after the first test. However, it's safe to claim that in this case texdtscan is faster.

Run this

>> [nodes,edges,cac] = cssm();
Elapsed time is 0.054037 seconds.
Elapsed time is 0.009937 seconds.
>> cac(:)
ans = 
    {3001x1 cell}
    {3001x1 cell}

where

function    [nodes,edges,cac] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes =   regexp( adjlist, '\w*(?= )', 'match' );
% nodes =   cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects. 
nodes =   cat( 1, nodes{:} );
edges =   regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
toc
end

&nbsp

A more fair comparison:

>> [nodes,edges,n2,e2] = cssm();
Elapsed time is 0.047859 seconds.
Elapsed time is 0.014726 seconds.
>> edges{1}
ans = 
    '3'    '5'    '9'
>> e2{1}
ans = 
    '3'    '5'    '9'

where three lines are added to produce the data on the same format

function    [nodes,edges,n2,e2] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes =   regexp( adjlist, '\w*(?= )', 'match' );
% nodes =   cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects. 
nodes =   cat( 1, nodes{:} );
edges =   regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
n2  = cac{1};                          % new
e2  = regexp( cac{2}, ',', 'split' );  % new 
e2  = reshape( e2, 1,[] );             % new
toc
end

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

Walter Roberson el 1 de Mzo. de 2017

Cell arrays require:

8 bytes per cell, whether used or not
plus 104 bytes per non-empty cell, which includes the size and type information for the cell
plus the storage for the obvious data of the cell. For character strings, that is 2 bytes per character.

For a fully occupied cell array, that is 112 bytes per cell plus the obvious data of the cell. (And you have to add to that, whatever storage is used to represent the size and type information of the variable that is the cell array header.)

If you were to use a blank-padded rectangular region, then that would be 2 bytes per character, times number of rows, times number of columns; to which you would add whatever storage is used to represent the size and type information of variable (probably the same cost as the a cell array header.) You would be wasting some of those columns with the blank padding.

You have not happened to indicate anything about minimum and maximum and typical row size. If the occupancy was uniform random (unlikely), then on average half of the columns would be unused; in that situation if the fixed width were at least twice 112 bytes, which you would get with 112 characters wide, then the average waste would be the same as the cell overhead. However, uniform random is not typical, really: more typical is that either there is not much variation in sizes (e.g., if the variation were just between 3 and 5 fields) --- or else that most of the data is relatively short but a small fraction if it is really large (power law), in which case if you allocate as if everything could be the longest then you could waste a lot.

In terms of timing, access into a rectangular array is faster, but it is not all that different for a single level of cell nesting.

dpb el 1 de Mzo. de 2017

Abrir en MATLAB Online

The final line of strsplit after all the preprocessing is

% Split.
[c, matches] = regexp(str, aDelim, 'split', 'match');

so guess it stands to reason it's going to be slower... :)

per isakson el 2 de Mzo. de 2017

Editada: per isakson el 3 de Mzo. de 2017

Abrir en MATLAB Online

"more efficient way to store strings of different lengths" &nbsp I guess, that there is no one-size-fits-all.

"efficient" regarding memory use and computational speed may conflict.
The number of strings to store
The variation in length of the strings as Walter pointed out.
Which operations will be done on the set of strings.
Whether or not strictly "write-once-read-many"
Does the cost of making the program/code count?
And more ... .

Regarding character arrays: "'first','second','third'" should be stored as

fst
ieh
rci
sor
tnd
 d

since Matlab is column major. This is tricky to read when debugging.

I recently had a problem:

a fraction of a million valid Matlab variable names. Most names are short, but some are long. (No, I don't use them in expressions with EVAL.)
searches typically returns a dozen names

Solution:

store all names in one row separated by char(31), huge_str. char(31) is displayed as space by editors.
store the positions of char(31) to avoid repeated use of strfind(huge_str)
use STRFIND and REGEXP in searches

My resulting code is fast and memory efficient, but it did require some debugging.

Is this undocumented use of char(31), which might not survive next Matlab release? I don't think the use of char(31) is mentioned in the Matlab documentation.

Iniciar sesión para comentar.

Reading and processing data from text file to matlab variable quickly

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Respuesta aceptada

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Community Treasure Hunt

Reading and processing data from text file to matlab variable quickly

2 comentarios Mostrar Ninguno Ocultar Ninguno

Respuesta aceptada

7 comentarios Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Ver también

Community Treasure Hunt

2 comentarios
Mostrar Ninguno Ocultar Ninguno

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos