Make classification with huge dataset

I'm trying to make classification with huge dataset containing 6 persons for training and here I'm getting this error from only 1 person dataset: "Requested 248376x39305 (9.1GB) array exceeds maximum array size preference." First of all I'm trying Bagged Tree and Neural Network classificators and I want to ask how can I do it? It's possible to learn these classificators in portions of datasets (learn saved classification model again)?

9 comentarios

Please explain how 248376 x 39305 constitutes a 1 person data set
[ I N ] = size(input)
[ O N ] = size(target)
Thanks,
Greg
Mindaugas Vaiciunas
Mindaugas Vaiciunas el 7 de Nov. de 2016
Editada: Walter Roberson el 7 de Nov. de 2016
Input matrix size 248376 x 765
Target matrix size 248376 x 1
Then I'm trying to make Tree Bagged mdl it makes 248376 x 39305 size matrix. P.s. as you see 1 frame got 765 features.
Walter Roberson
Walter Roberson el 7 de Nov. de 2016
Please show your Tree Bagging code. https://www.mathworks.com/help/stats/treebagger.html does not return matrices.
Right it doesn't return matrices cause he can't start due following error about ram problem, code simple:
Mdl = TreeBagger(50,Features,FeaturesTarget);
So I'm thinking about decomposing all test data into lower size files, but I didn't know how to learn classificator again and again with that portions of data. Need something that let me update a classifier with new data, without retraining the entire thing from scratch.
Walter Roberson
Walter Roberson el 7 de Nov. de 2016
Have you considered reducing the number of trees?
Mindaugas Vaiciunas
Mindaugas Vaiciunas el 8 de Nov. de 2016
Tree number reducing not helping, had tried reduce test data for two different models, make them compact and combine, from first view it helps, but can't reach high recognition ratio. I think I need "online" algorithm , that can learn saved model using testing data.
Greg Heath
Greg Heath el 9 de Nov. de 2016
Editada: Greg Heath el 9 de Nov. de 2016
I still don't get it
39305/765
ans =
51.3791
Regardless, I think you should use dimensionality reduction via feature extraction.
Hope this helps,
Greg
Mindaugas Vaiciunas
Mindaugas Vaiciunas el 9 de Nov. de 2016
This is solution to take some of features average for dimensionality reduction, but it may affect recognition percent.
Greg Heath
Greg Heath el 10 de Nov. de 2016
Of course it will affect it. However, the way to choose is to set a limit on the loss of accuracy.

Iniciar sesión para comentar.

Respuestas (1)

Walter Roberson
Walter Roberson el 7 de Nov. de 2016

0 votos

Add more memory (RAM) to you computer. Then check or adjust Preferences -> MATLAB -> Workspace -> MATLAB array size limit.
Or, you could set the division ratios so that a much smaller fraction is used for training and validation, with most of it left for test. This effectively uses only a small subset of the data, but a different small subset each time it trains.

6 comentarios

Mindaugas Vaiciunas
Mindaugas Vaiciunas el 7 de Nov. de 2016
More memory not solution for this, it would be need around 36 Gb of RAM with all training data. With division ratios I would be able to learn same saved model with small portions of test data again and again ?
Walter Roberson
Walter Roberson el 7 de Nov. de 2016
Amazon Web Services, among other providers, make available machines with more than 36 Gb of RAM. If you had that much RAM your program would run; therefore adding RAM is a solution for the problem.
Mindaugas Vaiciunas
Mindaugas Vaiciunas el 8 de Nov. de 2016
This project not commercial it's for university master degree, adding RAM is not solution for me, but thanks for answer.
Walter Roberson
Walter Roberson el 8 de Nov. de 2016
https://www.mathworks.com/products/parallel-computing/matlab-parallel-cloud/ 16 workers, 60 Gigabytes, $US 4.32 per hour educational pricing, including compute services.
Or if you provide your own EC2 instance, https://www.mathworks.com/products/parallel-computing/parallel-computing-on-the-cloud/distriben-ec2.html $0.07 per worker per hour for the software licensing from MATLAB. For example you could do https://aws.amazon.com/ec2/pricing/on-demand/ m4.4xlarge, 16 cores, 64 gigabytes, $US 0.958 per hour for the EC2 service. Between that and the $0.07 per worker from Mathworks it would come in less than $US2.50 per hour. About the price of a Starbucks "Grande" coffee.
Remember, your time is not really "free". At the very least you need to take into account "opportunity costs" -- like an hour spent fighting a memory issue is an hour you could have been working on a minimum wage job.
Mindaugas Vaiciunas
Mindaugas Vaiciunas el 9 de Nov. de 2016
Thanks for advice, keep this in mind if there would be no other solution
Walter Roberson
Walter Roberson el 9 de Nov. de 2016
Let me put it this way:
  • You do not with to reduce the number of trees or the data because doing so might decrease the recognition rate
  • We do not have a magic low-memory implementation of the TreeBagger available.
  • You do not have enough memory on your system to run the classification using the existing software
Your choices would seem to be:
  • write the classifier yourself, somehow not using as much memory; or
  • obtain more memory for your own system; or
  • obtain use of a system with more memory

Iniciar sesión para comentar.

Categorías

Más información sobre Licensing on Cloud Platforms en Centro de ayuda y File Exchange.

Preguntada:

el 6 de Nov. de 2016

Comentada:

el 10 de Nov. de 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by