resubPredict

Predict resubstitution labels of k-nearest neighbor classifier

Description

example

label = resubPredict(mdl) returns the labels that mdl predicts for the training data mdl.X. The output label contains the predictions of mdl on the data used by fitcknn to create mdl.

[label,score] = resubPredict(mdl) also returns the posterior class probabilities for the predictions.

[label,score,cost] = resubPredict(mdl) also returns the misclassification costs.

Examples

collapse all

Examine the quality of a classifier by its resubstitution predictions.

Load the Fisher iris data set.

load fisheriris
X = meas;
Y = species;

Create a classifier for five nearest neighbors.

mdl = fitcknn(X,Y,'NumNeighbors',5);

Generate the resubstitution predictions.

label = resubPredict(mdl);

Calculate the number of differences between the predictions label and the original data Y.

mydiff = not(strcmp(Y,label)); % mydiff(i) = 1 means they differ
sum(mydiff) % Number of differences
ans = 5

A value of 1 in mydiff indicates that the observed label differs from the corresponding predicted label. This example has five misclassifications.

Input Arguments

collapse all

k-nearest neighbor classifier model, specified as a ClassificationKNN object.

Output Arguments

collapse all

Predicted class labels for the observations (rows) in the training data X, returned as a categorical array, character array, logical vector, numeric vector, or cell array of character vectors. label has length equal to the number of rows in X. The label is the class with minimal expected cost. See Predicted Class Label.

Predicted class scores or posterior probabilities, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in the training data X, and K is the number of classes (in mdl.ClassNames). score(i,j) is the posterior probability that observation i in X is of class j in mdl.ClassNames. See Posterior Probability.

Expected classification costs, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in the training data X, and K is the number of classes (in mdl.ClassNames). cost(i,j) is the cost of classifying row i of X as class j in mdl.ClassNames. See Expected Cost.

Tips

  • If you standardize the predictor data, that is, mdl.Mu and mdl.Sigma are not empty ([]), then resubPredict standardizes the predictor data before predicting labels.

Algorithms

collapse all

Predicted Class Label

resubPredict classifies by minimizing the expected classification cost:

y^=argminy=1,...,Kj=1KP^(j|x)C(y|j),

where

  • y^ is the predicted classification.

  • K is the number of classes.

  • P^(j|x) is the posterior probability of class j for observation x.

  • C(y|j) is the cost of classifying an observation as y when its true class is j.

Posterior Probability

Consider a vector (single query point) xnew and a model mdl.

  • k is the number of nearest neighbors used in prediction, mdl.NumNeighbors.

  • nbd(mdl,xnew) specifies the k nearest neighbors to xnew in mdl.X.

  • Y(nbd) specifies the classifications of the points in nbd(mdl,xnew), namely mdl.Y(nbd).

  • W(nbd) specifies the weights of the points in nbd(mdl,xnew).

  • prior specifies the priors of the classes in mdl.Y.

If the model contains a vector of prior probabilities, then the observation weights W are normalized by class to sum to the priors. This process might involve a calculation for the point xnew, because weights can depend on the distance from xnew to the points in mdl.X.

The posterior probability p(j|xnew) is

p(j|xnew)=inbdW(i)1Y(X(i))=jinbdW(i).

Here, 1Y(X(i))=j is 1 when mdl.Y(i) = j, and 0 otherwise.

True Misclassification Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation.

You can set the true misclassification cost per class by using the 'Cost' name-value pair argument when you run fitcknn. The value Cost(i,j) is the cost of classifying an observation into class j if its true class is i. By default, Cost(i,j) = 1 if i ~= j, and Cost(i,j) = 0 if i = j. In other words, the cost is 0 for correct classification and 1 for incorrect classification.

Expected Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation. The third output of resubPredict is the expected misclassification cost per observation.

Suppose you have Nobs observations that you classified with a trained classifier mdl, and you have K classes. The command

[label,score,cost] = resubPredict(mdl)

returns a matrix cost of size Nobs-by-K, among other outputs. Each row of the cost matrix contains the expected (average) cost of classifying the observation into each of the K classes. cost(n,j) is

i=1KP^(i|X(n))C(j|i),

where

  • K is the number of classes.

  • P^(i|X(n)) is the posterior probability of class i for observation X(n).

  • C(j|i) is the true misclassification cost of classifying an observation as j when its true class is i.

Introduced in R2012a