I am working on a project where the ultimate goal is to analyse data and produce insights on this. However, I'm having a little trouble with the data storage itself, before even starting any analysis. I am hoping someone with more experience who has a feel for this can give some insight.
We are recording ~70 signals from several hydrogen fuel cell vehicles that are driving around. Most of the signals are at 1Hz, about 10 at 5Hz and a handful at slower (0.1Hz, 0.002Hz) frequencies. The cars are driven for anywhere from 5 minutes to 2 hours (roughly) any number of times per day (sometimes 0 for multiple days), but while they are being used the data is being continuously recorded. All signals are doubles.
I currently save everything to one large nested structure (which is good for sharing with coworkers) that is just a nested-cell structure. This structure is inherited from the reading of the RAW signal data. Signals are things like GPS, speed, steering wheel angle, fuel cell voltages etc. The current data structure works well in some ways, but calculating the mean vehicle speed for example requires a 2-level nested for-loop (loops through cars, then trips [trips are vehicle on->vehicle off]) instead of mean(ts) as can be done with timeseries objects. Part of the problem is that the types of analysis that can be performed is very broad so it's known what the specific query functions will be.
The dataset currently covers one year of data from a four year project, to give a sense of the scale. It's also possible more vehicles will be added, increasing the amount of data for analysis.
- Relatively easy to use, for less MATLAB savvy coworkers
- Filtering out trips on different conditions, such as: length, signal value, and date
- Easy to perform statistical analysis
- Ability to perform analysis on entire dataset easily (across different sample frequencies)
- Units as a property of a signal
- (Very optional) Saved on a server (I have both Windows and Linux servers available) so that users can dynamically query the data
I have tried several different approaches to this so far:
- Single timeseries collection per car and per sample rate. Used append to make them into single timeseries whilst maintaining proper timestamps (I like how this works when plotting all data). This seems to work alright, unless the PC has limited amounts of RAM....
- Single timetable per car, with missing data for resampling filled by NaNs. This is my current in use method, but at 3,629,763; 2,853,868; 1,707,074; rows of data (with 70+ signals), they can use quite a lot of data.
I've been looking at support docs and everything and don't want to put a lot of work in to pursuing the wrong solution.
- Use a timeseries collection for each car and resample the data using ZoH (probably causes errors due to massive TSC size in memory)
- Use a timetable, but this doesn't seem as applicable for use if you've only got numbers.
- Use one of the above, but have each trip in a nested array instead of one big table (ease of use of one big table is nice, though).
- Use one of the above, store it in HDF5 data set (not sure if possible) so data is only loaded as needed.
- Use a datastore with the above?
- Double down on "big data" and start learning map-reduce stuff
I am no where near running into limits on storage or memory; that is a future problem I will have, but not the important thing to solve now. As mentioned in the comments there's much I can do to reduce size that has nothing to do with structure.
My MAIN QUESTION is about whether I should use timeseries, timetables or non-time-based arrays, and how I should organise and store these. I have many sets of data from each car and I have contained each set (which can be recorded minutes or hours apart) in a single table as this simplifies finding things like the mean of all samples very quickly, but massive tables aren't very nice to memory. So how do I store and organise the data such that I can easily perform analysis? Memory restrictions can be kept in mind but they are definitely not a driving force in this at all. My computer handles what I've described just fine.