Read big file with mixed data types with datastore

33 visualizaciones (últimos 30 días)
Sy Dat Ho
Sy Dat Ho el 25 de Nov. de 2024 a las 13:30
Comentada: Walter Roberson el 29 de Nov. de 2024 a las 20:44
I've got a file which is 300 GB big. A piece of it can be found in the attached file. I've read that the best way to handle this kind of files is to read them into a datastore.
As you can see, the first two lines are characters, while the following lines are a combination of floats and integers. Is it possible to read them predefined? I know from fscanf that you can specify the data type, but when I do datastore it interprets every line as a string.

Respuestas (1)

Stephen23
Stephen23 el 25 de Nov. de 2024 a las 14:05
ds = datastore('./*.txt', 'Type','tabulartext', 'NumHeaderLines',2, 'TextscanFormats',repmat("%f",1,5));
T = preview(ds)
T = 8x5 table
Var1 Var2 Var3 Var4 Var5 ____ _______ _______ _______ ______ 192 0 0 0 NaN 108 0.21721 0 0 NaN 108 0 0.21721 0 NaN 108 0 0 0.21721 NaN 8 0 17.09 2.3461 1.2766 8 0 21.968 21.103 17.839 8 0 14.849 17.511 11.303 8 0 22.723 23.318 13.066
  5 comentarios
Stephen23
Stephen23 el 29 de Nov. de 2024 a las 17:14
Editada: Stephen23 el 29 de Nov. de 2024 a las 17:29
FOPEN does not read a file into RAM.
Of course the details are likely more nuanced than that, possibly a small part of the file is loaded and other parts in virtual memory. But in any case, I doubt that there is any implementation of FOPEN in any language that would load an entire file when FOPEN is called. That would be a terrible way to implement FOPEN.
Walter Roberson
Walter Roberson el 29 de Nov. de 2024 a las 20:44
i can't use fopen bc my ram is smaller than the file.
Replace
fid = fopen('test.txt','rt');
with
fid = fopen('test.txt','rt','n','US-ASCII');
The fact that you supplied the text encoding will keep the first fgetl() from scanning through the file trying to guess the file encoding. It will just leave the file positioned at the beginning, ready to read piece by piece. It will not need to buffer the file in memory.

Iniciar sesión para comentar.

Categorías

Más información sobre Large Files and Big Data en Help Center y File Exchange.

Etiquetas

Productos

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by