Read big file with mixed data types with datastore

Question

Sy Dat el 25 de Nov. de 2024

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/2168548-read-big-file-with-mixed-data-types-with-datastore

Comentada: Walter Roberson el 19 de Dic. de 2024

test.txt

I've got a file which is 300 GB big. A piece of it can be found in the attached file. I've read that the best way to handle this kind of files is to read them into a datastore.

As you can see, the first two lines are characters, while the following lines are a combination of floats and integers. Is it possible to read them predefined? I know from fscanf that you can specify the data type, but when I do datastore it interprets every line as a string.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Stephen23 el 25 de Nov. de 2024

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/2168548-read-big-file-with-mixed-data-types-with-datastore#answer_1549508

Abrir en MATLAB Online

test.txt

ds = datastore('./*.txt', 'Type','tabulartext', 'NumHeaderLines',2, 'TextscanFormats',repmat("%f",1,5));
T = preview(ds)
T = 8x5 table
    Var1     Var2       Var3       Var4       Var5 
    ____    _______    _______    _______    ______

    192           0          0          0       NaN
    108     0.21721          0          0       NaN
    108           0    0.21721          0       NaN
    108           0          0    0.21721       NaN
      8           0      17.09     2.3461    1.2766
      8           0     21.968     21.103    17.839
      8           0     14.849     17.511    11.303
      8           0     22.723     23.318    13.066

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Sy Dat el 18 de Dic. de 2024

Ah okay, so fopen is able to read a 300 GB file? So did I understood the documentation wrong? Is there a case where you would actually use datastore over fopen?

Walter Roberson el 19 de Dic. de 2024

I do not know what documentation you are referring to?

The documentation for fopen() says "If you do not specify an encoding scheme when opening a file for reading, fopen uses auto character-set detection to determine the encoding." . Details about the auto detection are left unspecified, so hypothetically it might have to scan through the entire file (just in case somewhere in the file there are some utf8 sequences.) But no auto-detection is done if you specify a text encoding.

datastore is good for processing lots of line-oriented data, as datastore can automatically break line-oriented files up into pieces for processing chunks. But the processing would have to be such that it made sense to do the task in chunks -- for example if the processing required calculating the standard deviation of the first column of data, then all of the data would have to be read in first.

Iniciar sesión para comentar.

Read big file with mixed data types with datastore

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

Read big file with mixed data types with datastore

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

7 comentarios Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos