logical indexing vs cell indexing - performance
Mostrar comentarios más antiguos
I am importing [X,Y] data arrays from a number of *.csv files and combining them into a single *.mat file for plotting / analysis later.
The length of [X,Y] arrays is different in each file. The length of [X,Y] is equal in a single file. So I can't use a multidimensional numerical array.
Is it more efficient to store the data in a single numerical array with a few identification columns which I can index logically later?
Or, is it better to store the [X,Y] data from each file in a separate numerical array in separate cells and access it by normal indexing.
Perhaps my question is... is it better to sort/filter the data as I import it, or when I plot it?
I've included some (poor) sudo-code below to better explain the concepts. I hope this makes sense.
Thanks in advance, Mark.
% Logical indexing example (like a database table).
% concatenate data from different files into single numerical array
for file = 1:999
data = [data;...
[file, id1, id2, X, Y]...
];
end
% plot single X,Y pair of data with specified ids
function plotProfile(data,fileid,id1,id2)
logicalMask = all(data(:,1)==fileid, data(:,2)==id1, data(:,3)==id2);
plot(data(logicalMask,4),data(logicalMask,5));
end
% Cell array example:
% Store [X, Y] arrays in separate cells.
% Cell index defined by file number and other IDs
for file = 1:999
X{file}{id1}{id2} = X;
Y{file}{id1}{id2} = Y;
end
% plot single X,Y pair of data with specified ids
function plotProfile(data,fileid,id1,id2)
plot(X{fileid}{id1}{id2}, Y{fileid}{id1}{id2})
end
16 comentarios
If I would suggest another approach, somewhat similar to the numeric array + indexing: use one table (with columns/variables for the IDs, filenames, etc.). Benefits:
- keep different data types in one table (e.g. filenames, numeric data)
- robust data access using indexing and/or variable names and/or row names.
- compact data processing, e.g.:
MC
el 6 de Jul. de 2021
dpb
el 6 de Jul. de 2021
A table will have to have max(size(LargestArray,1) rows to hold the separate files -- this isn't a real problem, but like arrays of any type, tables must be regular, not jagged. (Cell arrays themselves are regular, just that cells can hold disparate sizes of things).
For only roughly 500 elements/file, I seriously doubt the access time difference between however you choose to do it will be significant; frankly, if the lengths are different, I'd probably just use the cell array approach or, if I were to use the table I'd consider an array of tables.
The latter has the advantage you can have each and every table column header be X and Y and not have a conflict.
But, the former option of handling disparate arrays in a table was illustrated in an Answer to another similar storage issue just earlier this morning -- https://www.mathworks.com/matlabcentral/answers/872518-add-headers-to-matrix-using-table#answer_740593 See, specifically Akira's sample code in his Comment in response to my initial Answer.
Seth Furman
el 6 de Jul. de 2021
It's worth trying Stephen's suggestion of putting your data into a table for some subset of your data. If table works for you, but the entire dataset is too large to fit into memory, you might consider using a tall table, which avoids reading all the data into memory at once:
"The latter has the advantage you can have each and every table column header be X and Y and not have a conflict."
??? Where does the conflict arise?
Only five columns/variables are required, e.g. file, id1, id2, X, Y.
"Would you still advise a table if the length of X (and Y) is ~500 lines / file?"
Yes: a table is a container type, where each column/variable is its own separate array stored in memory, not so different to a cell array. This means the memory consumption is probably not so different from a cell array of the same data, unlike your proposal to use one huge numeric array (which would require a large amount of contiguous memory).
"I'd just load the X,Y data as columns."
Which eliminates any benefit of using tables (especially split-apply-combine).
Forcing meta-data (e.g. IDs) into the column/variable names just makes data access more complex, removes any ability the use the inbuilt tools that are specifically designed for processing entire tables, and likely requires more looping.
It goes against the fundemental paradigm of the table (which is an exact corollary of that of a dataframe in R, or DataFrame in pandas) that each row represents one data sample, and each column/variable one specific metric of that data.
Yeah, grouping variables are powerful and have their place and rowfun or splitapply also when operating on groupwise data.
To me, for the purposes of OP here in plotting variables, that's not the way that comes to me as the way to go, however; instead of row-oriented, I'd use varfun with dynamic 'InputVariables' addressing instead of grouping variables. One still can create grouping variables within the table for other factors (here timestamp or values of the variables themselves is about all else there is) if that is also needed/wanted.
I've not tested recently; I do know that some releases ago when tables were relatively young that I saw performance degrade markedly when they got to be very long; that may well have been improved since, but the observation tends to have stuck with me.
ADDENDUM
As for "Forcing meta-data (e.g. IDs) into the column/variable names", it's not an issue when generated programmatically, but actually, if use the idea of a table of tables, then the variables can all be X, Y for every one with only the one ID of the test number being at the higher level.
I don't see it as a real issue in the given application. If IDs were random or otherwise difficult, perhaps, but then they're still a pain to deal with even in selecting them out of a grouping variable instead as a variable name. "There is no free lunch!" :)
MC
el 7 de Jul. de 2021
Stephen23
el 7 de Jul. de 2021
"Is there a better way to mask a category?"
Your proposals seem reasonable to me. I would definitely try STRCMP. You could also investigate:
Michael
el 8 de Jul. de 2021
@MC I think the answer to this depends on the number of files and the number of elements in X and Y. Using the first approach you provided is going to use a lot of extra memory, since it looks like you are repeating the file, id1, and id2 entries for all X and Y values. Can you clarify a bit of what/why you are using id1 and id2? Each .csv file has a two columns (x and y) correct? Also, you call X and Y arrays, but are they each nx1 or nxm arrays? Why do you need id1 and id2? Is the filename sufficient to look up the x and y data you need?
MC
el 8 de Jul. de 2021
"A long table with columns File,ID1,ID2,X,Y is common in databases, I just didn't know if matlab was really geared up to cope with data in this way."
But it also depends on how you need to process your data. A cell array or structure might also be suitable.
But, my observations have been similar to report above that when tables get to be quite long performance lags so while the structure is there, as implemented and refined to date it becomes impractical with large datasets. TMW will undoubtedly continue to improve the implementation with time.
Also, I did not recognize in the initial response the multiple grouping variables ID1 and ID2, I thought the application was just one set of X,Y data for a number of tests and the intent was simply to plot those by test. Adding more criteria makes the rearrangement more appropriate, agreed, with again still the problem of performance may be a kick in the teeth in the most straightforward way of just using tables. And, it was clearly demonstrated the performance hit the tall table object extracts--if it's the only way, it's probably better than not being able to analyze the data at all, but that definitely comes at a price.
Respuestas (0)
Categorías
Más información sobre Tables en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!