Extracting a wide table from a very tall datastore array, also trying KDB+

1 visualización (últimos 30 días)
Hello,
I am currently storing about 40GB of data in a 4 column tall array with about 1B rows. I'm using a Matlab Tall Array Datastore, but I also plan on experimenting with KDB+.
Could someone comment on the best way to extract a wide table, with about 350,000 rows and about 20 columns by selecting some of the items in the first three columns and the corresponding values,e.g DATE = {29-Jul-1983:31-Aug-1983} & STOCK = AAPL US EQUITY & FIELD= {MOV_AVG_50D & PRICE}?
Thank You,
Michael
Input
DATE,STOCK,FIELD,VALUE
29-Jul-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.8979
31-Aug-1983 00:00:00,BHP AT EQUITY,PRICE,0.9029
29-Jul-1983 00:00:00,IBM US EQUITY,MOV_AVG_50D,0.9106
31-Aug-1983 00:00:00,IBM US EQUITY,PRICE_50D,0.9154
29-Jul-1983 00:00:00,AAPL US EQUITY,MOV_AVG_50D,0.9227
31-Aug-1983 00:00:00,AAPL US EQUITY,PRICE,0.9311
Output:
DATE,MOV_AVG_50D,PRICE
29-Jul-1983 00:00:00,0.9227,NaN
31-Aug-1983 00:00:00,NaN,0.9311
  1 comentario
Michael
Michael el 29 de Ag. de 2019
This is what I tried so far.
ds = datastore('bigtall.csv','DatetimeType','text');
ds.SelectedFormats={'%q','%C','%C','%f'};
tds = tall(ds);tm.cputime(1)=cputime;
tic;
t = tds(tds.TICKER=='UNIT US EQUITY' & tds.FIELD=='EBITDA_MARGIN',:);
T = gather(t);
tm.cputime(2)=cputime;
tm.tictoc(2)= toc;
fprintf('%d minutes CPU, %d minutes\n',[diff(tm.cputime) tm.tictoc(2)]./60);
fprintf('Table size [%dx%d]\n',size(T));
T.DATE = datetime(T.DATE);
tb = unstack(T(:,{'DATE','VALUE','FIELD'}), 'VALUE', 'FIELD');
fprintf('Table size [%dx%d]\n',size(tb));
The output took 23 minutes.
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 20 min 44 sec
Evaluation completed in 21 min 26 sec
5 minutes CPU, 23 minutes
Table size [762x4]
Table size [762x2]

Iniciar sesión para comentar.

Respuesta aceptada

Michael
Michael el 12 de Sept. de 2019
Hello,
Here's an update. Unfortunatly, or fortuntely depending on how you look at it, KDB+ is about 45x times faster (31 seconds vs. 23 minutes) and can be accessed through Matlab's datafeed toolbox.
Best,
Michael
  1 comentario
Michael
Michael el 12 de Sept. de 2019
This took 1 second instead of 23 minutes!
select from f where TICKER=`$("AAPL US EQUITY"), FIELD in (`MOV_AVG_50D,`HIGH), DATE within 1983.07.29 1983.08.31

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Big Data Processing en Help Center y File Exchange.

Productos


Versión

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by