ClassificationPartitionedECOC

Cross-validated multiclass ECOC model for support vector machines (SVMs) and other classifiers

Description

ClassificationPartitionedECOC is a set of error-correcting output codes (ECOC) models trained on cross-validated folds. Estimate the quality of the cross-validated classification by using one or more “kfold” functions: kfoldPredict, kfoldLoss, kfoldMargin, kfoldEdge, and kfoldfun.

Every “kfold” method uses models trained on training-fold (in-fold) observations to predict the response for validation-fold (out-of-fold) observations. For example, suppose you cross-validate using five folds. In this case, the software randomly assigns each observation into five groups of equal size (roughly). The training fold contains four of the groups (roughly 4/5 of the data), and the validation fold contains the other group (roughly 1/5 of the data). In this case, cross-validation proceeds as follows:

The software trains the first model (stored in CVMdl.Trained{1}) by using the observations in the last four groups and reserves the observations in the first group for validation.
The software trains the second model (stored in CVMdl.Trained{2}) by using the observations in the first group and the last three groups. The software reserves the observations in the second group for validation.
The software proceeds in a similar fashion for the third, fourth, and fifth models.

If you validate by using kfoldPredict, the software computes predictions for the observations in group i by using the ith model. In short, the software estimates a response for every observation by using the model trained without that observation.

Creation

You can create a ClassificationPartitionedECOC model in two ways:

Create a cross-validated ECOC model from an ECOC model by using the crossval object function.
Create a cross-validated ECOC model by using the fitcecoc function and specifying one of the name-value pair arguments 'CrossVal', 'CVPartition', 'Holdout', 'KFold', or 'Leaveout'.

Properties

expand all

Cross-Validation Properties

`CrossValidatedModel` — Cross-validated model name
character vector

Cross-validated model name, specified as a character vector.

For example, 'ECOC' specifies a cross-validated ECOC model.

Data Types: char

`KFold` — Number of cross-validated folds
positive integer

Number of cross-validated folds, specified as a positive integer.

Data Types: double

`ModelParameters` — Cross-validation parameter values
object

Cross-validation parameter values, specified as an object. The parameter values correspond to the name-value pair argument values used to cross-validate the ECOC classifier. ModelParameters does not contain estimated parameters.

You can access the properties of ModelParameters using dot notation.

`NumObservations` — Number of observations
positive numeric scalar

Number of observations in the training data, specified as a positive numeric scalar.

Data Types: double

`Partition` — Data partition
`cvpartition` model

Data partition indicating how the software splits the data into cross-validation folds, specified as a cvpartition model.

`Trained` — Compact classifiers trained on cross-validation folds
cell array of `CompactClassificationECOC` models

Compact classifiers trained on cross-validation folds, specified as a cell array of CompactClassificationECOC models. Trained has k cells, where k is the number of folds.

Data Types: cell

`W` — Observation weights
numeric vector

Observation weights used to cross-validate the model, specified as a numeric vector. W has NumObservations elements.

The software normalizes the weights used for training so that sum(W,'omitnan') is 1.

Data Types: single | double

`X` — Unstandardized predictor data
numeric matrix | table

Unstandardized predictor data used to cross-validate the classifier, specified as a numeric matrix or table.

Each row of X corresponds to one observation, and each column corresponds to one variable.

Data Types: single | double | table

`Y` — Observed class labels
categorical array | character array | logical vector | numeric vector | cell array of character vectors

Observed class labels used to cross-validate the model, specified as a categorical or character array, logical or numeric vector, or cell array of character vectors. Y has NumObservations elements and has the same data type as the input argument Y that you pass to fitcecoc to cross-validate the model. (The software treats string arrays as cell arrays of character vectors.)

Each row of Y represents the observed classification of the corresponding row of X.

ECOC Properties

`BinaryLoss` — Binary learner loss function
`'binodeviance'` | `'exponential'` | `'hamming'` | `'hinge'` | `'linear'` | `'logit'` | `'quadratic'`

Binary learner loss function, specified as a character vector representing the loss function name.

This table identifies the default BinaryLoss value, which depends on the score ranges returned by the binary learners.

Assumption	Default Value
All binary learners are any of the following: Classification decision trees Discriminant analysis models k-nearest neighbor models Naive Bayes models	`'quadratic'`
All binary learners are SVMs.	`'hinge'`
All binary learners are ensembles trained by `AdaboostM1` or `GentleBoost`.	`'exponential'`
All binary learners are ensembles trained by `LogitBoost`.	`'binodeviance'`
You specify to predict class posterior probabilities by setting `'FitPosterior',true` in `fitcecoc`.	`'quadratic'`
Binary learners are heterogeneous and use different loss functions.	`'hamming'`

To check the default value, use dot notation to display the BinaryLoss property of the trained model at the command line.

To potentially increase accuracy, specify a binary loss function other than the default during a prediction or loss computation by using the BinaryLoss name-value argument of kfoldPredict or kfoldLoss. For more information, see Binary Loss.

Data Types: char

`BinaryY` — Binary learner class labels
numeric matrix | `[]`

Binary learner class labels, specified as a numeric matrix or [].

If the coding matrix is the same across all folds, then BinaryY is a NumObservations-by-L matrix, where L is the number of binary learners (size(CodingMatrix,2)).

The elements of BinaryY are –1, 0, and 1, and the values correspond to dichotomous class assignments. This table describes how learner j assigns observation k to a dichotomous class corresponding to the value of BinaryY(k,j).

Value	Dichotomous Class Assignment
`–1`	Learner `j` assigns observation `k` to a negative class.
`0`	Before training, learner `j` removes observation `k` from the data set.
`1`	Learner `j` assigns observation `k` to a positive class.

If the coding matrix varies across folds, then BinaryY is empty ([]).

Data Types: double

`CodingMatrix` — Codes specifying class assignments
numeric matrix | `[]`

Codes specifying class assignments for the binary learners, specified as a numeric matrix or [].

If the coding matrix is the same across all folds, then CodingMatrix is a K-by-L matrix, where K is the number of classes and L is the number of binary learners.

The elements of CodingMatrix are –1, 0, and 1, and the values correspond to dichotomous class assignments. This table describes how learner j assigns observations in class i to a dichotomous class corresponding to the value of CodingMatrix(i,j).

Value	Dichotomous Class Assignment
`–1`	Learner `j` assigns observations in class `i` to a negative class.
`0`	Before training, learner `j` removes observations in class `i` from the data set.
`1`	Learner `j` assigns observations in class `i` to a positive class.

If the coding matrix varies across folds, then CodingMatrix is empty ([]). You can obtain the coding matrix for each fold by using the Trained property. For example, CVMdl.Trained{1}.CodingMatrix is the coding matrix in the first fold of the cross-validated ECOC model CVMdl.

Data Types: double | single | int8 | int16 | int32 | int64

Other Classification Properties

`CategoricalPredictors` — Categorical predictor indices
vector of positive integers | `[]`

Categorical predictor indices, specified as a vector of positive integers. CategoricalPredictors contains index values indicating that the corresponding predictors are categorical. The index values are between 1 and p, where p is the number of predictors used to train the model. If none of the predictors are categorical, then this property is empty ([]).

Data Types: single | double

`ClassNames` — Unique class labels
categorical array | character array | logical vector | numeric vector | cell array of character vectors

This property is read-only.

Unique class labels used in training, specified as a categorical or character array, logical or numeric vector, or cell array of character vectors. ClassNames has the same data type as the class labels Y. (The software treats string arrays as cell arrays of character vectors.) ClassNames also determines the class order.

`Cost` — Misclassification costs
square numeric matrix

This property is read-only.

Misclassification costs, specified as a square numeric matrix. Cost has K rows and columns, where K is the number of classes.

Cost(i,j) is the cost of classifying a point into class j if its true class is i. The order of the rows and columns of Cost corresponds to the order of the classes in ClassNames.

Data Types: double

`PredictorNames` — Predictor names
cell array of character vectors

This property is read-only.

Predictor names in order of their appearance in the predictor data X, specified as a cell array of character vectors. The length of PredictorNames is equal to the number of columns in X.

Data Types: cell

`Prior` — Prior class probabilities
numeric vector

This property is read-only.

Prior class probabilities, specified as a numeric vector. Prior has as many elements as the number of classes in ClassNames, and the order of the elements corresponds to the order of the classes in ClassNames.

fitcecoc incorporates misclassification costs differently among different types of binary learners.

Data Types: double

`ResponseName` — Response variable name
character vector

Response variable name, specified as a character vector.

Data Types: char

`ScoreTransform` — Score transformation function to apply to predicted scores
`'none'`

This property is read-only.

Score transformation function to apply to the predicted scores, specified as 'none'. An ECOC model does not support score transformation.

Object Functions

`gather`	Gather properties of Statistics and Machine Learning Toolbox object from GPU
`kfoldEdge`	Classification edge for cross-validated ECOC model
`kfoldLoss`	Classification loss for cross-validated ECOC model
`kfoldMargin`	Classification margins for cross-validated ECOC model
`kfoldPredict`	Classify observations in cross-validated ECOC model
`kfoldfun`	Cross-validate function using cross-validated ECOC model

Examples

collapse all

Cross-Validate ECOC Classifier

Open Live Script

Cross-validate an ECOC classifier with SVM binary learners, and estimate the generalized classification error.

Load Fisher's iris data set. Specify the predictor data X and the response data Y.

load fisheriris
X = meas;
Y = species;
rng(1); % For reproducibility

Create an SVM template, and standardize the predictors.

t = templateSVM('Standardize',true)

t = 
Fit template for SVM.
    Standardize: 1

t is an SVM template. Most of the template object properties are empty. When training the ECOC classifier, the software sets the applicable properties to their default values.

Train the ECOC classifier, and specify the class order.

Mdl = fitcecoc(X,Y,'Learners',t,...
    'ClassNames',{'setosa','versicolor','virginica'});

Mdl is a ClassificationECOC classifier. You can access its properties using dot notation.

Cross-validate Mdl using 10-fold cross-validation.

CVMdl = crossval(Mdl);

CVMdl is a ClassificationPartitionedECOC cross-validated ECOC classifier.

Estimate the generalized classification error.

genError = kfoldLoss(CVMdl)

genError = 
0.0400

The generalized classification error is 4%, which indicates that the ECOC classifier generalizes fairly well.

Speed Up Training ECOC Classifiers Using Binning and Parallel Computing

This example uses:

Open Live Script

Train a one-versus-all ECOC classifier using a GentleBoost ensemble of decision trees with surrogate splits. To speed up training, bin numeric predictors and use parallel computing. Binning is valid only when fitcecoc uses a tree learner. After training, estimate the classification error using 10-fold cross-validation. Note that parallel computing requires Parallel Computing Toolbox™.

Load Sample Data

Load and inspect the arrhythmia data set.

load arrhythmia
[n,p] = size(X)

n = 452

p = 279

isLabels = unique(Y);
nLabels = numel(isLabels)

nLabels = 13

tabulate(categorical(Y))

  Value    Count   Percent
      1      245     54.20%
      2       44      9.73%
      3       15      3.32%
      4       15      3.32%
      5       13      2.88%
      6       25      5.53%
      7        3      0.66%
      8        2      0.44%
      9        9      1.99%
     10       50     11.06%
     14        4      0.88%
     15        5      1.11%
     16       22      4.87%

The data set contains 279 predictors, and the sample size of 452 is relatively small. Of the 16 distinct labels, only 13 are represented in the response (Y). Each label describes various degrees of arrhythmia, and 54.20% of the observations are in class 1.

Train One-Versus-All ECOC Classifier

Create an ensemble template. You must specify at least three arguments: a method, a number of learners, and the type of learner. For this example, specify 'GentleBoost' for the method, 100 for the number of learners, and a decision tree template that uses surrogate splits because there are missing observations.

tTree = templateTree('surrogate','on');
tEnsemble = templateEnsemble('GentleBoost',100,tTree);

tEnsemble is a template object. Most of its properties are empty, but the software fills them with their default values during training.

Train a one-versus-all ECOC classifier using the ensembles of decision trees as binary learners. To speed up training, use binning and parallel computing.

Binning ('NumBins',50) — When you have a large training data set, you can speed up training (a potential decrease in accuracy) by using the 'NumBins' name-value pair argument. This argument is valid only when fitcecoc uses a tree learner. If you specify the 'NumBins' value, then the software bins every numeric predictor into a specified number of equiprobable bins, and then grows trees on the bin indices instead of the original data. You can try 'NumBins',50 first, and then change the 'NumBins' value depending on the accuracy and training speed.
Parallel computing ('Options',statset('UseParallel',true)) — With a Parallel Computing Toolbox license, you can speed up the computation by using parallel computing, which sends each binary learner to a worker in the pool. The number of workers depends on your system configuration. When you use decision trees for binary learners, fitcecoc parallelizes training using Intel® Threading Building Blocks (TBB) for dual-core systems and above. Therefore, specifying the 'UseParallel' option is not helpful on a single computer. Use this option on a cluster.

Additionally, specify that the prior probabilities are 1/K, where K = 13 is the number of distinct classes.

options = statset('UseParallel',true);
Mdl = fitcecoc(X,Y,'Coding','onevsall','Learners',tEnsemble,...
                'Prior','uniform','NumBins',50,'Options',options);

Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 6).

Mdl is a ClassificationECOC model.

Cross-Validation

Cross-validate the ECOC classifier using 10-fold cross-validation.

CVMdl = crossval(Mdl,'Options',options);

Warning: One or more folds do not contain points from all the groups.

CVMdl is a ClassificationPartitionedECOC model. The warning indicates that some classes are not represented while the software trains at least one fold. Therefore, those folds cannot predict labels for the missing classes. You can inspect the results of a fold using cell indexing and dot notation. For example, access the results of the first fold by entering CVMdl.Trained{1}.

Use the cross-validated ECOC classifier to predict validation-fold labels. You can compute the confusion matrix by using confusionchart. Move and resize the chart by changing the inner position property to ensure that the percentages appear in the row summary.

oofLabel = kfoldPredict(CVMdl,'Options',options);
ConfMat = confusionchart(Y,oofLabel,'RowSummary','total-normalized');
ConfMat.InnerPosition = [0.10 0.12 0.85 0.85];

Reproduce Binned Data

Reproduce binned predictor data by using the BinEdges property of the trained model and the discretize function.

X = Mdl.X; % Predictor data
Xbinned = zeros(size(X));
edges = Mdl.BinEdges;
% Find indices of binned predictors.
idxNumeric = find(~cellfun(@isempty,edges));
if iscolumn(idxNumeric)
    idxNumeric = idxNumeric';
end
for j = idxNumeric 
    x = X(:,j);
    % Convert x to array if x is a table.
    if istable(x)
        x = table2array(x);
    end
    % Group x into bins by using the discretize function.
    xbinned = discretize(x,[-inf; edges{j}; inf]);
    Xbinned(:,j) = xbinned;
end

Xbinned contains the bin indices, ranging from 1 to the number of bins, for numeric predictors. Xbinned values are 0 for categorical predictors. If X contains NaNs, then the corresponding Xbinned values are NaNs.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

The following object functions fully support GPU arrays:
The kfoldfun object function offers limited support for GPU arrays.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2014b

expand all

R2024b: Specify linear and ensemble learners for `gpuArray` sample data

You can specify linear and ensemble learners when you create a ClassificationPartitionedECOC object by passing gpuArray sample data to fitcecoc.

ClassificationPartitionedECOC

Description

Creation

Properties

Cross-Validation Properties

CrossValidatedModel — Cross-validated model name character vector

KFold — Number of cross-validated folds positive integer

ModelParameters — Cross-validation parameter values object

NumObservations — Number of observations positive numeric scalar

Partition — Data partition cvpartition model

Trained — Compact classifiers trained on cross-validation folds cell array of CompactClassificationECOC models

W — Observation weights numeric vector

X — Unstandardized predictor data numeric matrix | table

Y — Observed class labels categorical array | character array | logical vector | numeric vector | cell array of character vectors

ECOC Properties

BinaryLoss — Binary learner loss function 'binodeviance' | 'exponential' | 'hamming' | 'hinge' | 'linear' | 'logit' | 'quadratic'

BinaryY — Binary learner class labels numeric matrix | []

CodingMatrix — Codes specifying class assignments numeric matrix | []

Other Classification Properties

CategoricalPredictors — Categorical predictor indices vector of positive integers | []

ClassNames — Unique class labels categorical array | character array | logical vector | numeric vector | cell array of character vectors

Cost — Misclassification costs square numeric matrix

PredictorNames — Predictor names cell array of character vectors

Prior — Prior class probabilities numeric vector

ResponseName — Response variable name character vector

ScoreTransform — Score transformation function to apply to predicted scores 'none'

Object Functions

Examples

Cross-Validate ECOC Classifier

Speed Up Training ECOC Classifiers Using Binning and Parallel Computing

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2024b: Specify linear and ensemble learners for gpuArray sample data

See Also

`CrossValidatedModel` — Cross-validated model name
character vector

`KFold` — Number of cross-validated folds
positive integer

`ModelParameters` — Cross-validation parameter values
object

`NumObservations` — Number of observations
positive numeric scalar

`Partition` — Data partition
`cvpartition` model

`Trained` — Compact classifiers trained on cross-validation folds
cell array of `CompactClassificationECOC` models

`W` — Observation weights
numeric vector

`X` — Unstandardized predictor data
numeric matrix | table

`Y` — Observed class labels
categorical array | character array | logical vector | numeric vector | cell array of character vectors

`BinaryLoss` — Binary learner loss function
`'binodeviance'` | `'exponential'` | `'hamming'` | `'hinge'` | `'linear'` | `'logit'` | `'quadratic'`

`BinaryY` — Binary learner class labels
numeric matrix | `[]`

`CodingMatrix` — Codes specifying class assignments
numeric matrix | `[]`

`CategoricalPredictors` — Categorical predictor indices
vector of positive integers | `[]`

`ClassNames` — Unique class labels
categorical array | character array | logical vector | numeric vector | cell array of character vectors

`Cost` — Misclassification costs
square numeric matrix

`PredictorNames` — Predictor names
cell array of character vectors

`Prior` — Prior class probabilities
numeric vector

`ResponseName` — Response variable name
character vector

`ScoreTransform` — Score transformation function to apply to predicted scores
`'none'`

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

R2024b: Specify linear and ensemble learners for `gpuArray` sample data