Importing a large database using dataset

Question

Sebastiano delre el 14 de Mayo de 2016

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/284180-importing-a-large-database-using-dataset

Comentada: Sebastiano delre el 18 de Mayo de 2016

I have a very large database (3GB) in a CSV format. I have a RAM of 8GB. Do you think I can import it using dataset? How long should it take more or less? Thank you for your help.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Walter Roberson el 14 de Mayo de 2016

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/284180-importing-a-large-database-using-dataset#answer_222110

Use textscan() with a repeat count to read a block of it and save the block in binary format, perhaps using matFile; repeat until end of file. This limits the amount of memory you need, and you can update a display to show progress. Afterwards close the csv file, and read in the binary version.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Walter Roberson el 15 de Mayo de 2016

Abrir en MATLAB Online

mat = matfile('YourMat.mat', 'Writable', true);
fid = fopen('YourFile.csv', 'r');
fmt = '%s%f%f';
field_is_numeric = [false, true, true];
field_names = {'Ryeleno', 'Tubamisno', 'atomic_weight'};
numeric_field_nums = find(field_is_numeric);
non_numeric_field_nums = find(~field_is_numeric);
blocksize = 2000;
basepos = 0;
while true
  datacell = textscan(fid, fmt, blocksize, 'Delimiter', ',');
  numread = size(datacell{1},1);
  if numread == 0   %we hit EOF or something incompatible, nothing read
    break
  end
  for fieldno = numeric_field_nums
     thisfield = field_names{fieldno};
     mat.(thisfield)(basepos+1:basepos+numread,1) = datacell{fieldno);
  end
  if ~isempty(non_numeric_field_nums)
    this_data = struct();
    for fieldno = non_numeric_field_nums
      thisfield = field_names{fieldno};
      this_data.(thisfield) = datacell{fieldno};
    end
    thisfield = sprintf('from_%07d_to_%07d', basepos+1, basepos+numread);
    mat.(thisfield) = this_data;
  end
  basepos = basepos + numread;
end
fclose(fid);

The result of this will be a .mat file that will have one variable named after every numeric field, and that variable will contain all of the data for that field. The .mat file will also have a series of variables named "from_" followed by a numeric row number followed by "_to_" followed by a numeric row number. Each of those variables will be a structure, with fields named after each of the non-numeric columns.

This use of variables numbered by row is needed because matFile() cannot index into cell arrays (or non-numeric arrays) to write data at the end of the array. You indicated a different posting that some of the columns contain strings. There is no fixed width for a string, which makes it more difficult to write strings to binary files (but there are ways of doing it.)

This kind of arrangement can read files which are too large to fit into memory; it can also handle cases in which the data for the string columns are themselves too long to fit into memory. But it is not necessarily the most convenient for working with the non-numeric data afterwards. You could be more efficient if you could be sure that the non-numeric data would fit into memory.

Of course, you could simply try reading your entire .csv file in one block using textscan(), and see how well it does.

Sebastiano delre el 18 de Mayo de 2016

Thank you very much, this is very useful. Best, Sebastiano.

Iniciar sesión para comentar.

Importing a large database using dataset

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

Importing a large database using dataset

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo