logical indexing vs cell indexing - performance

I am importing [X,Y] data arrays from a number of *.csv files and combining them into a single *.mat file for plotting / analysis later.
The length of [X,Y] arrays is different in each file. The length of [X,Y] is equal in a single file. So I can't use a multidimensional numerical array.
Is it more efficient to store the data in a single numerical array with a few identification columns which I can index logically later?
Or, is it better to store the [X,Y] data from each file in a separate numerical array in separate cells and access it by normal indexing.
Perhaps my question is... is it better to sort/filter the data as I import it, or when I plot it?
I've included some (poor) sudo-code below to better explain the concepts. I hope this makes sense.
Thanks in advance, Mark.
% Logical indexing example (like a database table).
% concatenate data from different files into single numerical array
for file = 1:999
data = [data;...
[file, id1, id2, X, Y]...
];
end
% plot single X,Y pair of data with specified ids
function plotProfile(data,fileid,id1,id2)
logicalMask = all(data(:,1)==fileid, data(:,2)==id1, data(:,3)==id2);
plot(data(logicalMask,4),data(logicalMask,5));
end
% Cell array example:
% Store [X, Y] arrays in separate cells.
% Cell index defined by file number and other IDs
for file = 1:999
X{file}{id1}{id2} = X;
Y{file}{id1}{id2} = Y;
end
% plot single X,Y pair of data with specified ids
function plotProfile(data,fileid,id1,id2)
plot(X{fileid}{id1}{id2}, Y{fileid}{id1}{id2})
end

16 comentarios

Stephen23
Stephen23 el 6 de Jul. de 2021
Editada: Stephen23 el 6 de Jul. de 2021
If I would suggest another approach, somewhat similar to the numeric array + indexing: use one table (with columns/variables for the IDs, filenames, etc.). Benefits:
  • keep different data types in one table (e.g. filenames, numeric data)
  • robust data access using indexing and/or variable names and/or row names.
  • compact data processing, e.g.:
MC
MC el 6 de Jul. de 2021
Thanks for the suggestion.
I had considered this but thought the duplication of filenames etc.. would be prohibitive.
Would you still advise a table if the length of X (and Y) is ~500 lines / file?
I must admit, I don't use matlab tables that often. Perhaps this is a good project to familiarise myself with them.
dpb
dpb el 6 de Jul. de 2021
A table will have to have max(size(LargestArray,1) rows to hold the separate files -- this isn't a real problem, but like arrays of any type, tables must be regular, not jagged. (Cell arrays themselves are regular, just that cells can hold disparate sizes of things).
For only roughly 500 elements/file, I seriously doubt the access time difference between however you choose to do it will be significant; frankly, if the lengths are different, I'd probably just use the cell array approach or, if I were to use the table I'd consider an array of tables.
The latter has the advantage you can have each and every table column header be X and Y and not have a conflict.
But, the former option of handling disparate arrays in a table was illustrated in an Answer to another similar storage issue just earlier this morning -- https://www.mathworks.com/matlabcentral/answers/872518-add-headers-to-matrix-using-table#answer_740593 See, specifically Akira's sample code in his Comment in response to my initial Answer.
Seth Furman
Seth Furman el 6 de Jul. de 2021
It's worth trying Stephen's suggestion of putting your data into a table for some subset of your data. If table works for you, but the entire dataset is too large to fit into memory, you might consider using a tall table, which avoids reading all the data into memory at once:
Stephen23
Stephen23 el 6 de Jul. de 2021
Editada: Stephen23 el 6 de Jul. de 2021
"The latter has the advantage you can have each and every table column header be X and Y and not have a conflict."
??? Where does the conflict arise?
Only five columns/variables are required, e.g. file, id1, id2, X, Y.
"Would you still advise a table if the length of X (and Y) is ~500 lines / file?"
Yes: a table is a container type, where each column/variable is its own separate array stored in memory, not so different to a cell array. This means the memory consumption is probably not so different from a cell array of the same data, unlike your proposal to use one huge numeric array (which would require a large amount of contiguous memory).
dpb
dpb el 6 de Jul. de 2021
Editada: dpb el 6 de Jul. de 2021
If you duplicate the id info as well, yes, but I'm averse to adding superfluous storage of the same thing over and over and over...I'd just load the X,Y data as columns.
Stephen23
Stephen23 el 7 de Jul. de 2021
Editada: Stephen23 el 7 de Jul. de 2021
"I'd just load the X,Y data as columns."
Which eliminates any benefit of using tables (especially split-apply-combine).
Forcing meta-data (e.g. IDs) into the column/variable names just makes data access more complex, removes any ability the use the inbuilt tools that are specifically designed for processing entire tables, and likely requires more looping.
It goes against the fundemental paradigm of the table (which is an exact corollary of that of a dataframe in R, or DataFrame in pandas) that each row represents one data sample, and each column/variable one specific metric of that data.
dpb
dpb el 7 de Jul. de 2021
Editada: dpb el 7 de Jul. de 2021
Yeah, grouping variables are powerful and have their place and rowfun or splitapply also when operating on groupwise data.
To me, for the purposes of OP here in plotting variables, that's not the way that comes to me as the way to go, however; instead of row-oriented, I'd use varfun with dynamic 'InputVariables' addressing instead of grouping variables. One still can create grouping variables within the table for other factors (here timestamp or values of the variables themselves is about all else there is) if that is also needed/wanted.
I've not tested recently; I do know that some releases ago when tables were relatively young that I saw performance degrade markedly when they got to be very long; that may well have been improved since, but the observation tends to have stuck with me.
ADDENDUM
As for "Forcing meta-data (e.g. IDs) into the column/variable names", it's not an issue when generated programmatically, but actually, if use the idea of a table of tables, then the variables can all be X, Y for every one with only the one ID of the test number being at the higher level.
I don't see it as a real issue in the given application. If IDs were random or otherwise difficult, perhaps, but then they're still a pain to deal with even in selecting them out of a grouping variable instead as a variable name. "There is no free lunch!" :)
Hi Guys,
Thanks again for your advice and discussions - very informative and much appreciated.
I spent a bit of time today playing with Tables. My numbers were a bit off. I have ~1500 lines per unique combination of File,ID1,ID2 and ~16 combinations per file. I imported 12 files which generated ~275k rows, 5 cols.
During the import loop I generated a small temporary table for each unique ID combination (1.5k rows) and concatenated them one by one into a single master tall-table (275k rows). I don't know if this is the best method but it worked fine.
FileID is text, so I made that variable categorical. ID1 and ID2 are small numbers so I made them uint8 to keep the size down. X, Y are both doubles.
I generated a logical filter mask where:
mask = (T.FileID == tgtFileID & T.ID1 == tgtID1 & T.ID2 == tgtID2);
Is there a better way to mask a category? something similar to strcmp()? I did look in the docs, but it suggests '==' or ismember().
Finally I gathered the masked tall table into a local table, for plotting and analysis. This process was quite slow ~0.5s.
T_local = gather(T_Tall(mask,:));
However, if I gather the entire tall table into a local table, it was rapid, and I could just index with my serach criteria without creating a intermediate table for analysis.
T_local = gather(T_Tall);
I will import more files and see how well this scales.
Again, I hope this makes sense. I just wanted to provide some positive feedback for your help.
Best regards, Mark.
Stephen23
Stephen23 el 7 de Jul. de 2021
"Is there a better way to mask a category?"
Your proposals seem reasonable to me. I would definitely try STRCMP. You could also investigate:
MC
MC el 7 de Jul. de 2021
Editada: MC el 7 de Jul. de 2021
Interestingly, I could use strcmp on the local table, but it threw an error on the tall table.
Error using tall/strcmp (line 12)
Argument 1 to STRCMP must be one of the following data types:
string cellstr.
The '==' processing times on 135,000 rows for local and tall tables were ~0.010s and 0.130s respectively.
The processing time for '==' and strcmp were ~0.010 and ~0.008 respectively on a local table.
Michael
Michael el 8 de Jul. de 2021
@MC I think the answer to this depends on the number of files and the number of elements in X and Y. Using the first approach you provided is going to use a lot of extra memory, since it looks like you are repeating the file, id1, and id2 entries for all X and Y values. Can you clarify a bit of what/why you are using id1 and id2? Each .csv file has a two columns (x and y) correct? Also, you call X and Y arrays, but are they each nx1 or nxm arrays? Why do you need id1 and id2? Is the filename sufficient to look up the x and y data you need?
MC
MC el 8 de Jul. de 2021
Hi Michael,
Each *.csv file contains some header data, then a variable number of X,Y column pairs; it could be 1 or 16 pairs. Each pair can have a different number of rows (typically ~1500) (example format below).
So my plan was/is to formalise the structure of the data better so I can do some custom analysis on it and present it in a dashboard style GUI (think PowerBI) where the user can select certain unique File,ID1,ID2 combinations to compare visually.
The number of files will increase over time as more data arrives, but I don't have to have all the data available at once necessarily. Certain people/projects will only be interested in their own data (typically up to 50 files with 8 unique ID combinations) so I can split up the master tall table for each user prior to dynamic visualisation.
A long table with columns File,ID1,ID2,X,Y is common in databases, I just didn't know if matlab was really geared up to cope with data in this way.
Best regards,
Mark.
Header Data ........................................
ID1 1 1 1 1 2 2 2 2... n n
ID2 1 1 2 2 1 1 2 2... n n
X Y X Y X Y X Y X Y
# # # # # # # # # #
# # # # ... ...
# # n n
Stephen23
Stephen23 el 8 de Jul. de 2021
Editada: Stephen23 el 8 de Jul. de 2021
"A long table with columns File,ID1,ID2,X,Y is common in databases, I just didn't know if matlab was really geared up to cope with data in this way."
That is what tables and tall tables are for.
But it also depends on how you need to process your data. A cell array or structure might also be suitable.
MC
MC el 8 de Jul. de 2021
Editada: MC el 8 de Jul. de 2021
"That is what tables and tall tables are for."
Yes, and the route I am progressing with. Thank you. :-)
"But it also depends on how you need to process your data. A cell array or structure might also be suitable."
Hence my original question. There are many ways to approach this and all would work. But what would be most efficient given the semi-structured data that I'm dealing with...?
dpb
dpb el 8 de Jul. de 2021
Editada: dpb el 8 de Jul. de 2021
"what tables and tall tables are for ... also depends on how you need to process your data"
But, my observations have been similar to report above that when tables get to be quite long performance lags so while the structure is there, as implemented and refined to date it becomes impractical with large datasets. TMW will undoubtedly continue to improve the implementation with time.
Also, I did not recognize in the initial response the multiple grouping variables ID1 and ID2, I thought the application was just one set of X,Y data for a number of tests and the intent was simply to plot those by test. Adding more criteria makes the rearrangement more appropriate, agreed, with again still the problem of performance may be a kick in the teeth in the most straightforward way of just using tables. And, it was clearly demonstrated the performance hit the tall table object extracts--if it's the only way, it's probably better than not being able to analyze the data at all, but that definitely comes at a price.

Iniciar sesión para comentar.

Respuestas (0)

Categorías

Productos

Versión

R2020a

Preguntada:

MC
el 6 de Jul. de 2021

Editada:

dpb
el 8 de Jul. de 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by