# Choice of data structure - cellarray or dataset

5 views (last 30 days)
Ms. Mat on 7 Mar 2011
I have data of the structure 3 columns by 1.5 million rows. Each column is a variable namely cocode, date, price. I would like to extract /group data from the above data structure based on code and date and perform operations on it or ideally create a time series object for each cocode for a specific time period. Which data type would be the best to use for such data ? "cellArray" or "dataset" or any other ?(I am familiar with reading from excel and dumping the data into either of these) Also what is the ideal method to query the data set based on cocode and date without having to use nested loops ? Is it ok to use a for loop to gather all data or use "find" and get the indices of cocode equal to distinct value of cocode in my data(run in a loop) and use the same indices to access other data corresponding to it? Also I would want to query again based on dates from the above resultset and then perform my analysis on it. If I can group values using "accumarray" how will I pass the indices and can I return input "cellarray" to it ? So bottomline I want to group by cocode and then by dates and have the data , may be in a time series object.
Oleg Komarov on 7 Mar 2011
Make a bullet point list of the questions.

Oleg Komarov on 7 Mar 2011
• Store it as double matrix. You need to convert the dates, if in string format, to serial dates with datenum.
• Use logical indexing (no need of FOR loops)
• If you want to group data by date and id you may want to give a look at Pivot/UnPivot which I specifically use to work with panels of time series
EDIT
How data can be organized:
% 1. Double matrix
Amat = [repmat((now-2:now).',3,1),...
reshape(repmat(1:3,3,1),[],1),...
rand(9,1)];
% 2. Cell
Acel = [cellstr(datestr(Amat(:,1),'dd/mm/yyyy')),...
num2cell(Amat(:,2)),...
num2cell(Amat(:,3))];
% 3. Dataset
Memory used
whos
Name Size Bytes Class
Acel 9x3 1944 cell --> 1.5e6x3 ~ 308 mb
Adat 9x3 2114 dataset --> 1.5e6x3 ~ 336 mb
Amat 9x3 216 double --> 1.5e6x3 ~ 34 mb
Convert datestrings in serial dates for dataset and cell to save memory
Acel(:,1) = num2cell(datenum(Acel(:,1),'dd/mm/yyyy'));
whos
Name Size Bytes Class
Acel 9x3 1836 cell
Pros/cons:
• doule matrix: save on memory --> faster; cannot mix datatypes, so each date should be converted into serial numbers and doesn't have grpstats functionalities as the dataset class but should use accumarray or my Pivot/unPivot.
• dataset: easier to use and more friendly behavior (grpstats, stack/unstack etc...), can mix datatypes; consumes a LOT of memory
• cell array: can mix datatypes and consumes less memory than dataset; 'harder' to work with and still uses a LOT of memory
How to group data
Pivot(Amat(:,[2,1,3]))
NaN 1 2 3
7.3457e+005 0.54701 0.18896 0.36848
7.3457e+005 0.29632 0.68678 0.62562
7.3457e+005 0.74469 0.18351 0.78023
Pivot(Acel(:,[2,1,3]))
[ NaN] [ 1] [ 2] [ 3]
'06/03/2011' [0.54701] [0.18896] [0.36848]
'07/03/2011' [0.29632] [0.68678] [0.62562]
'08/03/2011' [0.74469] [0.18351] [0.78023]
Date x1 x2 x3
'06/03/2011' 0.54701 0.18896 0.36848
'07/03/2011' 0.29632 0.68678 0.62562
'08/03/2011' 0.74469 0.18351 0.78023
Oleg
Matt Tearle on 10 Mar 2011
what about structures and dataset arrays?

Matt Tearle on 7 Mar 2011
In addition to what Oleg suggested, does "code" store a small number of possible values, or are they all distinct (for 1.5 million rows)? In the former case, you might want to use a nominal array.
If you have only three columns, why not store them as separate variables? But if you really want to keep them together, either a 1-by-3 cell array or a dataset array would be the most natural, and least memory-intensive (other than numeric, of course).
Matt Tearle on 8 Mar 2011
OK, I'd recommend a nominal array for the company code, then combine everything together in a cell or dataset array. If you go with cells, store the three arrays in a 1-by-3 cell array. I'd also recommend what Oleg suggested about converting the dates to date numbers. You'll most likely need that anyway, plus you can store a double array instead of a cell array of strings.
To do the analysis, see if you can write a function that will return what you need for a single company/code. Then you can use grpstats to do all companies, by specifying code as your grouping variable and your function as the stat to calculate.
Inside your function, use logical indexing to extract the dates you want. You may have to do the analysis in a loop. Look at the various time/date functions in MATLAB to help here. In particular, using date vectors (then datenum to convert to serial date numbers) is a good way to deal with months. For weeks, you can always just add multiples of 7 to a date number.
Logical indexing works something like this: suppose you have dates stored as date numbers in an array "date" and you know the start and end dates, t1and t2. Then
x = price(date>=t1 & date<=t2);
extracts the prices for the given range of dates.