How to optimize my code and system configs?
1 visualización (últimos 30 días)
Mostrar comentarios más antiguos
Hi
I am working with big datasets (> 3 GB each), and use mostly Statistics and Machine Learning ToolboxStatistics built-in functions (e.g. supervised learning methods). To better handle the datasets, I often store them (my raw data are in csv or txt format) as chunks of MAT files (mainly using datastore). As a summary, I realized that the most time-consuming parts of my scripts are associated with the followings:
- Loading the data (either new raw datasets or chunks of data).
- Passing the data structures to different in-house functions.
- Converting between data types (e.g. converting a character vector of ~500,000 X 1 to numeric using str2double)
- Performing the required statistical analyses (e.g. Regression analysis).
Moreover, I am using an Intel Core i9 7980XE processor with 128 GB RAM (DDR4). I also have a 2 TB SSD (560 MBps).
However, I have several doubts and questions, which I hope anyone can help me with:
- How can I improve loading time of my data? For instance, is it better to save chunks of data in file formats rather MAT files? Curretnly, each variable in the raw data is stored as a vector.
- Since I intend to convert between data types (I originally save all data as character data type), is it better to save double as double, character as character, etc. ? Saving time of character arrays showed to be faster, and that's why I saved all as character arrays.
- While I run my codes/functions, both cpu and RAM usage seem not be limiting, so, I wonder what is a the main rate limiting step in handling big data? Should I improve my SSD?
Many thanks for your helps in advance.
0 comentarios
Respuestas (1)
Le Vu Bao
el 15 de Jul. de 2019
Editada: Le Vu Bao
el 15 de Jul. de 2019
Same interested with you.
In my experience, I used to store my data in *.mat (-v7) file. Don't know if this is a generic rule or not, but in my case, storing and loading any big STRUCT, CELL arrays in a *.mat (-v7) took so much time ( Although it supports partly reading from data). So I tried to split my data into smaller files and store them in *.mat (-v6), tried to avoid storing any struct or cell (although I had to use them in my case). And create a function to load only the part I need.
Ver también
Categorías
Más información sobre Classification Trees en Help Center y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!