Way of conserving memory when extracting data from CSV
Mostrar comentarios más antiguos
Hi everybody I have few questions. I have some HUGE CSV files which I need in Matlab for analysis. The CSV it self has 5 columns. The columns of relevance are:
Column 1 is our date starting from early 2007 all the way till till mid 2011 in the form of mm/dd/yyyy.
Column 3 is our respective prices
Column 5 is the number of trades.
The questions I have are these:
1) How can I extract these 3 columns into a Matrix in MATLAB without taking too much memory (bear in mind that some of these CSV files have around 60 million rows)? Is there a way to decrease the memory of each cell Matlab allocates for the matrix? Please help with code.
2) How can I extract all the information into a non-string matrix (for analysis) for a specific year....ie only for 2009. So I would require to store in Matrix all information for 2009 (bearing in mind the memory limitations in 1).
Thanks so much.
13 comentarios
Matt Kindig
el 12 de Abr. de 2013
If memory is an issue, you might want to parse the file line-by-line (using fgetl() or similar). That way, you don't need to read the entire file into memory at once. Also, by retrieving one line at a time, you can easily implement your year=2009 inclusion criterion.
per isakson
el 12 de Abr. de 2013
Please provide a sample. Type a few lines in the Matlab editor.
per isakson
el 12 de Abr. de 2013
I cannot see any reason why I should guess and create a sample cvs-file!
per isakson
el 12 de Abr. de 2013
Editada: per isakson
el 12 de Abr. de 2013
Does the file look like this?
04/29/2008,38:52.0,71.35,CTN08,2
04/29/2008,38:53.0,71.35,CTN08,2
04/29/2008,38:56.0,71.35,CTN08,3
04/29/2008,38:56.0,71.35,CTN08,1
04/29/2008,38:56.0,71.35,CTN08,1
04/29/2008,38:57.0,71.35,CTN08,1
Mate 2u
el 12 de Abr. de 2013
per isakson
el 12 de Abr. de 2013
What are the maximum and minimum values of column 3 and 5, respectively?
Mate 2u
el 13 de Abr. de 2013
per isakson
el 13 de Abr. de 2013
Editada: per isakson
el 13 de Abr. de 2013
See the answer of ImageAnalyst. It is wasteful to store the numbers in double float (8 byte).
Maybe, volume can be stored in uint8.
>> intmax('uint8')
ans =
255
Image Analyst
el 13 de Abr. de 2013
The maximum determines the smallest data class you can use. See code in my answer.
per isakson
el 13 de Abr. de 2013
And price will never exceed
>> intmax('uint32')
ans =
4294967295
cents ????
Respuesta aceptada
Más respuestas (1)
Image Analyst
el 12 de Abr. de 2013
1 voto
What are the classes of each column? Are they all 8 byte (64 bit) doubles? For example, the number of trades might be able to be a 4 byte integer, and most of the floating point numbers could probably be single instead of double. By retrieving it a line at a time and using sscanf() you can place each value into the smallest type of variable that is appropriate for that number. For example, assuming no stock price is over $655.35 you could read in the number and multiply by 100 so that all stock prices are in cents rather than dollars. That way you can use 16 bit unsigned integer instead of a 32 bit single.
I don't have the toolboxes, but perhaps the Financial Toolbox or the Fixed Point Designer may have efficient ways of handling numbers like prices of stocks.
Like Matt said, perhaps you don't need all 60 million rows in memory at once - hopefully you can process it in chunks.
4 comentarios
Image Analyst
el 13 de Abr. de 2013
realmax('double')
realmax('single')
intmax('int32')
intmax('int16')
intmax('uint16')
intmax('uint8')
ans =
1.79769313486232e+308
ans =
3.402823e+38
ans =
2147483647
ans =
32767
ans =
65535
ans =
255
Image Analyst
el 13 de Abr. de 2013
Regarding your comment above, They could both be uint16 then. That's 2 bytes instead of 8, so that saves you a lot - a factor of 4 in memory for those two columns.
Mate 2u
el 13 de Abr. de 2013
Image Analyst
el 13 de Abr. de 2013
For example, maybe someone asks about 2010 prices, so you scan the file line by line, throwing away data if it belongs to any other year than 2010. Only if the year is 2010 do you use put it into your array. Other years just go into single variables because you used sscanf but you re-use (overwrite) those variables. So on a line by line basis you will have variables thisPrice, thisDay, thisVolume, thisYear, and only when this year = 2010 do you add thisPrice, thisDay, thisVolume to priceArray, dayArray, volumeArray.
Categorías
Más información sobre Large Files and Big Data en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!