fitcecoc SVM with categorical predictors not predicting the correct label for multiclass problem.

2 visualizaciones (últimos 30 días)
Building a simple SVM model in Matlab does not seem to predict the correct label when using categorical predictors, for multiclass problems.
The sample code is as follows:
% first model, train and test data are categorical
% the test data is closest to label 20
trainData = [1 1 1; 2 2 2; 2 3 3];
trainLabel = [10; 20; 30];
testData = [1 2 2];
model = fitcecoc(trainData,trainLabel,'CategoricalPredictors','all');
predictLabel = predict(model,testData);
disp(['predictLabel: ',num2str(predictLabel)]);
% second model, train and test data are same as above but represented as:
% 1 = 1 0 0, 2 = 0 1 0, 3 = 0 0 1
trainData2 = [1 0 0 1 0 0 1 0 0; 0 1 0 0 1 0 0 1 0; 0 1 0 0 0 1 0 0 1];
testData2 = [1 0 0 0 1 0 0 1 0];
model2 = fitcecoc(trainData2,trainLabel);
predictLabel2 = predict(model2,testData2);
disp(['predictLabel2: ',num2str(predictLabel2)]);
The first model should predict label 20, but chooses label 30 instead. Based on my understanding of how SVM works, it should have chosen label 20. When I transform the first model, per this link, and reduce it to it's binary representation as per model2, then it predicts the correct label 20. As fas as I'm aware, and per the previous link, the two models are logically identical. So, I may be using some incorrect syntax for the first model, or my understanding of how SVM works under the covers is incorrect (but then the two models above should have the same result), or perhaps there is a bug for multiclass ECOC categorical models.
Any help is greatly appreciated - thanks!

Respuestas (1)

the cyclist
the cyclist el 13 de Feb. de 2020
I'm pretty sure you've got your dummy encoding wrong.
You are treating 1,2 and 3 as if they are somehow the same categories in all three columns. But those are different explanatory variables, so it could be:
  • 1st col: 1 = Blue, 2 = Red (notice there is no observation of 3 in the 1st column)
  • 2nd col: 1 = Democrat, 2 = Republican, 3 = Libertarian
  • 3rd col: 1 = Ford, 2 = BMW, 3 = Honda
Therefore, the correct dummy encoding is
trainData2 = dummyvar({categorical([1;2;2]),categorical([1;2;3]),categorical([1;2;3])});
trainData2 =
1 0 1 0 0 1 0 0
0 1 0 1 0 0 1 0
0 1 0 0 1 0 0 1
where the first two columns indicate Blue/Red, the next three colums indicate Dem/Rep/Lib, and the last three columns indicate Ford/BMW/Honda.
The correct test data for the dummy-encoded version is then
testData2 = [1 0 0 1 0 0 1 0]; % Because the test is Blue / Rep / BMW
Those inputs give me the same prediction for the dummy-encoded version as the categorical version.
  3 comentarios
the cyclist
the cyclist el 16 de Feb. de 2020
So, let's call my dummy encoding the third model. Then,
% first model, train and test data are categorical
% the test data is closest to label 20
trainData = [1 1 1;
2 2 2;
2 3 3];
trainLabel = [10;
20;
30];
testData = [1 2 2];
model = fitcecoc(trainData,trainLabel,'CategoricalPredictors','all');
predictLabel = predict(model,testData);
disp(['predictLabel: ',num2str(predictLabel)]);
% second model, train and test data are same as above but represented as:
% 1 = 1 0 0, 2 = 0 1 0, 3 = 0 0 1
trainData2 = [1 0 0 1 0 0 1 0 0;
0 1 0 0 1 0 0 1 0;
0 1 0 0 0 1 0 0 1];
testData2 = [1 0 0 0 1 0 0 1 0];
model2 = fitcecoc(trainData2,trainLabel);
predictLabel2 = predict(model2,testData2);
disp(['predictLabel2: ',num2str(predictLabel2)]);
% third model
trainData3 = dummyvar({categorical([1;2;2]),categorical([1;2;3]),categorical([1;2;3])})
testData3 = [1 0 0 1 0 0 1 0]; % Because the test is Blue / Rep / BMW
model3 = fitcecoc(trainData3,trainLabel);
predictLabel3 = predict(model3,testData3);
disp(['predictLabel3: ',num2str(predictLabel3)]);
Weird thing is that I could have sworn that models #1 and #3 were the ones that gave the same result. I think the reason for that may have been that I was also playing around with using the name-value pair ['CategoricalPredictors','all']for the dummy-encoded models as well. When I do, then everything gives the same answer.
I'm frankly not sure at the moment if it makes sense to use that for the dummy-encoded models. I'm not able to spend time right now thinking about it, but thought I would toss that idea out there.
John Pfeifer
John Pfeifer el 16 de Feb. de 2020
Thanks for putting some time towards this.
As far as I can tell, for the first model Matlab should be internally converting the data into a categorical representation, similar to your model 3, but it does not seem to be happening correctly.
The workaround is to just explicitly convert the data to a categorical representation, so I'll just go with that.

Iniciar sesión para comentar.

Categorías

Más información sobre Image Data Workflows en Help Center y File Exchange.

Productos


Versión

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by