Documentación

Esta página aún no se ha traducido para esta versión. Puede ver la versión más reciente de esta página en inglés.

# predict

Predict labels using k-nearest neighbor classification model

## Sintaxis

``label = predict(mdl,X)``
``````[label,score,cost] = predict(mdl,X)``````

## Descripción

````label = predict(mdl,X)` returns a vector of predicted class labels for the predictor data in the table or matrix `X`, based on the trained k-nearest neighbor classification model `mdl`. See Predicted Class Label.```

ejemplo

``````[label,score,cost] = predict(mdl,X)``` also returns: A matrix of classification scores (`score`) indicating the likelihood that a label comes from a particular class. For k-nearest neighbor, scores are posterior probabilities. See Posterior Probability.A matrix of expected classification cost (`cost`). For each observation in `X`, the predicted class label corresponds to the minimum expected classification costs among all classes. See Expected Cost. ```

## Ejemplos

contraer todo

Create a k-nearest neighbor classifier for Fisher's iris data, where k = 5. Evaluate some model predictions on new data.

Load the Fisher iris data set.

```load fisheriris X = meas; Y = species;```

Create a classifier for five nearest neighbors. Standardize the noncategorical predictor data.

`mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1);`

Predict the classifications for flowers with minimum, mean, and maximum characteristics.

```Xnew = [min(X);mean(X);max(X)]; [label,score,cost] = predict(mdl,Xnew)```
```label = 3x1 cell array {'versicolor'} {'versicolor'} {'virginica' } ```
```score = 3×3 0.4000 0.6000 0 0 1.0000 0 0 0 1.0000 ```
```cost = 3×3 0.6000 0.4000 1.0000 1.0000 0 1.0000 1.0000 1.0000 0 ```

The second and third rows of the score and cost matrices have binary values, which means all five nearest neighbors of the mean and maximum flower measurements have identical classifications.

contraer todo

k-nearest neighbor classifier model, specified as a `ClassificationKNN` object.

Predictor data to be classified, specified as a numeric matrix or table.

Each row of `X` corresponds to one observation, and each column corresponds to one variable.

• For a numeric matrix:

• The variables that make up the columns of `X` must have the same order as the predictor variables used to train `mdl`.

• If you train `mdl` using a table (for example, `Tbl`), then `X` can be a numeric matrix if `Tbl` contains all numeric predictor variables. k-nearest neighbor classification requires homogeneous predictors. Therefore, to treat all numeric predictors in `Tbl` as categorical during training, set `'CategoricalPredictors','all'` when you train using `fitcknn`. If `Tbl` contains heterogeneous predictors (for example, numeric and categorical data types) and `X` is a numeric matrix, then `predict` throws an error.

• For a table:

• `predict` does not support multicolumn variables and cell arrays other than cell arrays of character vectors.

• If you train `mdl` using a table (for example, `Tbl`), then all predictor variables in `X` must have the same variable names and data types as those used to train `mdl` (stored in `mdl.PredictorNames`). However, the column order of `X` does not need to correspond to the column order of `Tbl`. Both `Tbl` and `X` can contain additional variables (response variables, observation weights, and so on), but `predict` ignores them.

• If you train `mdl` using a numeric matrix, then the predictor names in `mdl.PredictorNames` and corresponding predictor variable names in `X` must be the same. To specify predictor names during training, see the `PredictorNames` name-value pair argument of `fitcknn`. All predictor variables in `X` must be numeric vectors. `X` can contain additional variables (response variables, observation weights, and so on), but `predict` ignores them.

If you set `'Standardize',true` in `fitcknn` to train `mdl`, then the software standardizes the columns of `X` using the corresponding means in `mdl.Mu` and standard deviations in `mdl.Sigma`.

Tipos de datos: `double` | `single` | `table`

## Output Arguments

contraer todo

Predicted class labels for the observations (rows) in `X`, returned as a categorical array, character array, logical vector, vector of numeric values, or cell array of character vectors. `label` has length equal to the number of rows in `X`. The label is the class with minimal expected cost. See Predicted Class Label.

Predicted class scores or posterior probabilities, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in `X`, and K is the number of classes (in `mdl.ClassNames`). `score(i,j)` is the posterior probability that observation `i` in `X` is of class `j` in `mdl.ClassNames`. See Posterior Probability.

Tipos de datos: `single` | `double`

Expected classification costs, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in `X`, and K is the number of classes (in `mdl.ClassNames`). `cost(i,j)` is the cost of classifying row `i` of `X` as class `j` in `mdl.ClassNames`. See Expected Cost.

Tipos de datos: `single` | `double`

## Algoritmos

contraer todo

### Predicted Class Label

`predict` classifies by minimizing the expected classification cost:

`$\stackrel{^}{y}=\underset{y=1,...,K}{\mathrm{arg}\mathrm{min}}\sum _{j=1}^{K}\stackrel{^}{P}\left(j|x\right)C\left(y|j\right),$`

where

• $\stackrel{^}{y}$ is the predicted classification.

• K is the number of classes.

• $\stackrel{^}{P}\left(j|x\right)$ is the posterior probability of class j for observation x.

• $C\left(y|j\right)$ is the cost of classifying an observation as y when its true class is j.

### Posterior Probability

Consider a vector (single query point) `xnew` and a model `mdl`.

• k is the number of nearest neighbors used in prediction, `mdl.NumNeighbors`.

• `nbd(mdl,xnew)` specifies the k nearest neighbors to `xnew` in `mdl.X`.

• `Y(nbd)` specifies the classifications of the points in `nbd(mdl,xnew)`, namely `mdl.Y(nbd)`.

• `W(nbd)` specifies the weights of the points in `nbd(mdl,xnew)`.

• `prior` specifies the priors of the classes in `mdl.Y`.

If the model contains a vector of prior probabilities, then the observation weights `W` are normalized by class to sum to the priors. This process might involve a calculation for the point `xnew`, because weights can depend on the distance from `xnew` to the points in `mdl.X`.

The posterior probability p(j|`xnew`) is

`$p\left(j|x\text{new}\right)=\frac{\sum _{i\in \text{nbd}}W\left(i\right){1}_{Y\left(X\left(i\right)\right)=j}}{\sum _{i\in \text{nbd}}W\left(i\right)}.$`

Here, ${1}_{Y\left(X\left(i\right)\right)=j}$ is `1` when `mdl.Y(i) = j`, and `0` otherwise.

### True Misclassification Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation.

You can set the true misclassification cost per class by using the `'Cost'` name-value pair argument when you run `fitcknn`. The value `Cost(i,j)` is the cost of classifying an observation into class `j` if its true class is `i`. By default, `Cost(i,j) = 1` if `i ~= j`, and `Cost(i,j) = 0` if `i = j`. In other words, the cost is `0` for correct classification and `1` for incorrect classification.

### Expected Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation. The third output of `predict` is the expected misclassification cost per observation.

Suppose you have `Nobs` observations that you want to classify with a trained classifier `mdl`, and you have `K` classes. You place the observations into a matrix `Xnew` with one observation per row. The command

`[label,score,cost] = predict(mdl,Xnew)`

returns a matrix `cost` of size `Nobs`-by-`K`, among other outputs. Each row of the `cost` matrix contains the expected (average) cost of classifying the observation into each of the `K` classes. `cost(n,j)` is

`$\sum _{i=1}^{K}\stackrel{^}{P}\left(i|Xnew\left(n\right)\right)C\left(j|i\right),$`

where

• K is the number of classes.

• $\stackrel{^}{P}\left(i|Xnew\left(n\right)\right)$ is the posterior probability of class i for observation Xnew(n).

• $C\left(j|i\right)$ is the true misclassification cost of classifying an observation as j when its true class is i.