Pre-processing tall array / datastore data

Question

0 votos

Please pardon me if this has been asked before.

What is the most efficient way to pre-process a large wide table (about 10,000,000 rows by 500 columns) currently divided in many small tables in separate mat-files. The data may be quite sparse in areas and is mostly numeric with a datetime column, some categorical, and some text fields.

For example, should I:

stack() the small tables tables but leave them in separate files
combine them into a giant wide table
combine them into a stacked very tall table
delete many NaNs significantly reducing the height of the stacked table
use sparse() on a wide table

Those are just some thoughts. Please let me know the best way.

Thank You,

Michael

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Guillaume el 25 de Ag. de 2019

Editada: Guillaume el 25 de Ag. de 2019

I'm not really clear on your question. pre-process in order to achieve what?

Note that stack makes a table less wide (less variables) but a lot taller. I'm not sure that's what you mean by stacking. Perhaps you mean vertically concatenate, in which case the datastore takes care of that for you.

Also note, that sparse is not a function (or a concept) that applies to tables.

I would think that if you use a datastore with tall tables, there's nothing to do. Just use the tables as is (as one big tall table backed by the datastore).

Guillaume el 26 de Ag. de 2019

Abrir en MATLAB Online

Michael's comment mistakenly posted as an answer moved here:

Dear Guillame,

Thank you for responding.

1) I haven't used sparse in a long time and didn't realize sparse didn't apply to tables. I guess that's a very good reason, to use a tall table. I was worried that stacked tables would be slow relative to wide tables but perhaps the sparse nature of the stacked table would offset that. Would you know about the relative speed of a stacked vs wide table for SQL type lookups?

2) Regarding datastore/tall/mapreduce, I am just starting to read about them, and like most Matlab docs, they're a little light on examples. Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?

3) Also, I am unclear on the value of having one datastore table vs. many. I currently have tens of thousands of little mat-files with timetables in them. Is there a benefit to combining them and is there a preferred format that is faster than others, e.g. mat vs. CSV?

Thanks again for your help,

Michael

PS, Yes, what I mean by stacked is from wide or unstacked:

Date   Var1    Var2    Var3
Jan    xxxx    xxxx    xxxx
Feb    xxxx    xxxx    xxxx

vs. stacked:

Date   Field    Value
Jan    Var1     xxxx
Jan    Var2     xxxx
Jan    Var3     xxxx
Feb    Var1     xxxx
Feb    Var2     xxxx
Feb    Var3     xxxx

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Guillaume el 26 de Ag. de 2019

0 votos

I was worried that stacked tables would be slow relative to wide tables

slow for what type operation. I would think that some things are better suited to wide tables, others to stacked ones. If you are going to be using myfun(mytable.Var1, mytable.Var2) then stacking Var1 and Var2 may not be a good idea. In addition, in the context of tall arrays less rows may be better.

Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?

datastores are only for reading. If you have a tall table, you can write it directly to a single text file with writetable. If you want to split it into several text files, simply write chunks of rows in a loop with writetable.

Also, I am unclear on the value of having one datastore table vs. many

You have just one datastore that is backed by as many files as you want (all files must have the same format and variables of course). The datastore manages accessing the data from the file as required and you access the data using a single tall table (or array). There isn't an option to get several tables out of one datastore.

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 2

Michael el 26 de Ag. de 2019

0 votos

Dear Guillaume,

Certainly wide tables are simpler for my purpose (generating tables for machine learning inputs) but due to the sparsness and size of the data that I'm selecting from, tall arrays may be more efficient. I'm not sure.

Thank you for pointing out that datastore is read-only. It looked that way and I was quite frustrated because I could not confirm it. Likewise, writetable doesn't have an append feature which is disappointing. I've been using writetable to make CSVs and then combining them by piping the output of DOS copy or type commands.

I will perform some experiments and write back regarding:

The relative speed of a large CSV datastore vs. many small ones
Access speed of tall vs. wide storage
The speed of the 32bit KDB+ (Q) solution via the datafeed toolbox.

Thanks,

Michael

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Walter Roberson el 26 de Ag. de 2019

dlmwrite() has append mode -- but it is only for numeric values.

Michael el 27 de Ag. de 2019

Dear Mr. Robertson,

Thank you. That may be helpful in the future. In this case, unfortunately, I have mostly text and categorical data.

It would be great if The Mathworks added some basic IO like appending with writetable and writing to a datastore.

It's a bit of a mission to write a whole flexible routine to append a table with many data types using fprintf.

Thanks Again,

Michael

Iniciar sesión para comentar.

Answer 3

Michael el 26 de Ag. de 2019

Abrir en MATLAB Online

0 votos

Hello,

I'm trying to evaluate a datastore of about 20,000 csv files with about 36GB of data that I saved with Matlab using writetable. One column contains datetimes and an example of the files' contents is this:

29-Jul-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.8979
31-Aug-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9029
30-Sep-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9106
31-Oct-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9154
30-Nov-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9227
30-Dec-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9311

I tried the following code and received the subsequent error. Can someone enlighten me on how to get this to work?

Thank You,

Michael

PS, Code:

ds = datastore('tall*.csv');
tds = tall(ds);
u = unique(tds.FIELD);
U = gather(u);

PPS, Error:

Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: 0% complete
Evaluation 0% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 77)
Unable to read the DATETIME data using the locale setting for your system: 'en_US'
If the data contains month or day names in a language foreign to this locale, use the 'DateLocale' parameter to specify the correct locale.
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
            [t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}, readFailureSummary] = iGather(varargin{:});

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 4

Michael el 27 de Ag. de 2019

Abrir en MATLAB Online

0 votos

Hello,

I moved the datetime problem to a separate thread.

As for the speed, I ran a little experiment comparing one 36GB file and the same data in 20,000 smaller files.

Here are my unscientific results that show many files are about 20% slower than one big file in this example:

One Huge File

One large CSV
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 10 min 31 sec
Evaluation completed in 10 min 32 sec
dt = 30

20,000 Small FIles

20,000 small CSVs
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 12 min 35 sec
Evaluation completed in 12 min 36 sec
dt = 28.9531

The Code

clear all
fprintf('One large CSV\n')
tcpu(1) = cputime;
ds = datastore('bigtall.csv','DatetimeType','text');
tds = tall(ds);
u = unique(tds.FIELD);
U = gather(u);
tcpu(2) = cputime;
dt = tcpu(2)-tcpu(1)
clear all
fprintf('\n20,000 small CSVs\n')
tcpu(1) = cputime;
ds = datastore('tall*.csv','DatetimeType','text');
u = unique(tds.FIELD);
U = gather(u);
tds = tall(ds);
tcpu(2) = cputime;
dt = tcpu(2)-tcpu(1)

Thanks,

Michael

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Pre-processing tall array / datastore data

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Respuestas (4)

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

2 comentarios
Mostrar Ninguno Ocultar Ninguno

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Categorías

Productos

Versión

Etiquetas

Community Treasure Hunt

Pre-processing tall array / datastore data

2 comentarios Mostrar Ninguno Ocultar Ninguno

Respuestas (4)

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

2 comentarios Mostrar Ninguno Ocultar Ninguno

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Categorías

Productos

Versión

Etiquetas

Ver también

Community Treasure Hunt

2 comentarios
Mostrar Ninguno Ocultar Ninguno

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

2 comentarios
Mostrar Ninguno Ocultar Ninguno

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos