kfoldPredict

Predict responses for observations in cross-validated regression model

Syntax

yFit = kfoldPredict(CVMdl)

yFit = kfoldPredict(CVMdl,Name,Value)

[yFit,ySD,yInt] = kfoldPredict(___)

Description

yFit = kfoldPredict(CVMdl) returns responses predicted by the cross-validated regression model CVMdl. For every fold, kfoldPredict predicts the responses for validation-fold observations using a model trained on training-fold observations. CVMdl.X and CVMdl.Y contain both sets of observations.

example

yFit = kfoldPredict(CVMdl,Name,Value) specifies options using one or more name-value arguments. For example, 'IncludeInteractions',true specifies to include interaction terms in computations for generalized additive models.

[yFit,ySD,yInt] = kfoldPredict(___) also returns the standard deviations and prediction intervals of the response variable, evaluated at each observation in the predictor data CVMdl.X, using any of the input argument combinations in the previous syntaxes. This syntax applies only to generalized additive models (GAM) for which the IsStandardDeviationFit property of CVMdl is true.

Examples

collapse all

Compute Cross-Validation Loss Manually

Open Live Script

When you create a cross-validated regression model, you can compute the mean squared error (MSE) by using the kfoldLoss object function. Alternatively, you can predict responses for validation-fold observations using kfoldPredict and compute the MSE manually.

Load the carsmall data set. Specify the predictor data X and the response data Y.

load carsmall
X = [Cylinders Displacement Horsepower Weight];
Y = MPG;

Train a cross-validated regression tree model. By default, the software implements 10-fold cross-validation.

rng('default') % For reproducibility
CVMdl = fitrtree(X,Y,'CrossVal','on');

Compute the 10-fold cross-validation MSE by using kfoldLoss.

L = kfoldLoss(CVMdl)

L = 
29.4963

Predict the responses yfit by using the cross-validated regression model. Compute the mean squared error between yfit and the true responses CVMdl.Y. The computed MSE matches the loss value returned by kfoldLoss.

yfit = kfoldPredict(CVMdl);
mse = mean((yfit - CVMdl.Y).^2)

mse = 
29.4963

Compare Holdout and k-Fold Cross-Validation Losses and Predictions

Open Live Script

Compute the loss and the predictions for a classification model, first partitioned using holdout validation and then partitioned using 3-fold cross-validation. Compare the two sets of losses and predictions.

Create a table from the fisheriris data set, which contains length and width measurements from the sepals and petals of three species of iris flowers. View the first eight observations.

fisheriris = readtable("fisheriris.csv");
head(fisheriris)

    SepalLength    SepalWidth    PetalLength    PetalWidth     Species  
    ___________    __________    ___________    __________    __________

        5.1           3.5            1.4           0.2        {'setosa'}
        4.9             3            1.4           0.2        {'setosa'}
        4.7           3.2            1.3           0.2        {'setosa'}
        4.6           3.1            1.5           0.2        {'setosa'}
          5           3.6            1.4           0.2        {'setosa'}
        5.4           3.9            1.7           0.4        {'setosa'}
        4.6           3.4            1.4           0.3        {'setosa'}
          5           3.4            1.5           0.2        {'setosa'}

Partition the data using cvpartition. First, create a partition for holdout validation, using approximately 70% of the observations for the training data and 30% for the validation data. Then, create a partition for 3-fold cross-validation.

rng(0,"twister") % For reproducibility
holdoutPartition = cvpartition(fisheriris.Species,Holdout=0.30);
kfoldPartition = cvpartition(fisheriris.Species,KFold=3);

holdoutPartition and kfoldPartition are both stratified random partitions. You can use the training and test functions to find the indices for the observations in the training and validation sets, respectively.

Train a classification tree model using the fisheriris data. Specify Species as the response variable.

Mdl = fitctree(fisheriris,"Species");

Create the partitioned classification models using crossval.

holdoutMdl = crossval(Mdl,CVPartition=holdoutPartition)

holdoutMdl = 
  ClassificationPartitionedModel
    CrossValidatedModel: 'Tree'
         PredictorNames: {'SepalLength'  'SepalWidth'  'PetalLength'  'PetalWidth'}
           ResponseName: 'Species'
        NumObservations: 150
                  KFold: 1
              Partition: [1×1 cvpartition]
             ClassNames: {'setosa'  'versicolor'  'virginica'}
         ScoreTransform: 'none'


  Properties, Methods

kfoldMdl = crossval(Mdl,CVPartition=kfoldPartition)

kfoldMdl = 
  ClassificationPartitionedModel
    CrossValidatedModel: 'Tree'
         PredictorNames: {'SepalLength'  'SepalWidth'  'PetalLength'  'PetalWidth'}
           ResponseName: 'Species'
        NumObservations: 150
                  KFold: 3
              Partition: [1×1 cvpartition]
             ClassNames: {'setosa'  'versicolor'  'virginica'}
         ScoreTransform: 'none'


  Properties, Methods

holdoutMdl and kfoldMdl are ClassificationPartitionedModel objects.

Compute the minimal expected misclassification cost for holdoutMdl and kfoldMdl using kfoldLoss. Because both models use the default cost matrix, this cost is the same as the classification error.

holdoutL = kfoldLoss(holdoutMdl)

holdoutL = 
0.0889

kfoldL = kfoldLoss(kfoldMdl)

kfoldL = 
0.0600

holdoutL is the error computed using the predictions for one validation set, while kfoldL is an average error computed using the predictions for three folds of validation data. Cross-validation metrics tend to be better indicators of a model's performance on unseen data.

Compute the validation data predictions for the two models using kfoldPredict.

[holdoutLabels,holdoutScores] = kfoldPredict(holdoutMdl);
[kfoldLabels,kfoldScores] = kfoldPredict(kfoldMdl);

holdoutClassNames = holdoutMdl.ClassNames;
holdoutScores = array2table(holdoutScores,VariableNames=holdoutClassNames);
kfoldClassNames = kfoldMdl.ClassNames;
kfoldScores = array2table(kfoldScores,VariableNames=kfoldClassNames);

predictions = table(holdoutLabels,kfoldLabels, ...
    holdoutScores,kfoldScores, ...
    VariableNames=["holdoutMdl Labels","kfoldMdl Labels", ...
    "holdoutMdl Scores","kfoldMdl Scores"])

predictions=150×4 table
    holdoutMdl Labels    kfoldMdl Labels            holdoutMdl Scores                     kfoldMdl Scores         
    _________________    _______________    _________________________________    _________________________________

                                            setosa    versicolor    virginica    setosa    versicolor    virginica
                                            ______    __________    _________    ______    __________    _________
                                                                                                                  
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
      ⋮

kfoldPredict returns NaN scores for the observations used to train holdoutMdl.Trained. For these observations, the function selects the class label with the highest frequency as the predicted label. In this case, because all classes have the same frequency, the function selects the first class (setosa) as the predicted label. The function uses the trained model to return predictions for the validation set observations. kfoldPredict returns each kfoldMdl prediction using the model in kfoldMdl.Trained that was trained without that observation.

To predict responses for unseen data, use the model trained on the entire data set (Mdl) and its predict function rather than a partitioned model such as holdoutMdl or kfoldMdl.

Input Arguments

collapse all

`CVMdl` — Cross-validated partitioned regression model
`RegressionPartitionedModel` object | `RegressionPartitionedEnsemble` object | `RegressionPartitionedGAM` object | `RegressionPartitionedGP` object | `RegressionPartitionedNeuralNetwork` object | `RegressionPartitionedSVM` object

Cross-validated partitioned regression model, specified as a RegressionPartitionedModel, RegressionPartitionedEnsemble, RegressionPartitionedGAM, RegressionPartitionedGP, RegressionPartitionedNeuralNetwork, or RegressionPartitionedSVM object. You can create the object in two ways:

Pass a trained regression model listed in the following table to its crossval object function.
Train a regression model using a function listed in the following table and specify one of the cross-validation name-value arguments for the function.

Regression Model	Function
`RegressionEnsemble`	`fitrensemble`
`RegressionGAM`	`fitrgam`
`RegressionGP`	`fitrgp`
`RegressionNeuralNetwork`	`fitrnet`
`RegressionSVM`	`fitrsvm`
`RegressionTree`	`fitrtree`

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Alpha',0.01,'IncludeInteractions',false specifies the confidence level as 99% and excludes interaction terms from computations for a generalized additive model.

`Alpha` — Significance level
0.05 (default) | numeric scalar in `[0,1]`

Significance level for the confidence level of the prediction intervals yInt, specified as a numeric scalar in the range [0,1]. The confidence level of yInt is equal to 100(1 – Alpha)%.

This argument is valid only for a generalized additive model object that includes the standard deviation fit. That is, you can specify this argument only when CVMdl is RegressionPartitionedGAM and the IsStandardDeviationFit property of CVMdl is true.

Example: 'Alpha',0.01

Data Types: single | double

`IncludeInteractions` — Flag to include interaction terms
`true` | `false`

Flag to include interaction terms of the model, specified as true or false. This argument is valid only for a generalized additive model (GAM). That is, you can specify this argument only when CVMdl is RegressionPartitionedGAM.

The default value is true if the models in CVMdl (CVMdl.Trained) contain interaction terms. The value must be false if the models do not contain interaction terms.

Data Types: logical

`PredictionForMissingValue` — Predicted response value to use for observations with missing predictor values
`"median"` | `"mean"` | numeric scalar

Since R2023b

Predicted response value to use for observations with missing predictor values, specified as "median", "mean", or a numeric scalar. This argument is valid only for a Gaussian process regression, neural network, or support vector machine model. That is, you can specify this argument only when CVMdl is a RegressionPartitionedGP, RegressionPartitionedNeuralNetwork, or RegressionPartitionedSVM object.

Value Description

Value	Description
`"median"`	`kfoldPredict` uses the median of the observed response values in the training-fold data as the predicted response value for observations with missing predictor values. This value is the default when `CVMdl` is a `RegressionPartitionedGP`, `RegressionPartitionedNeuralNetwork`, or `RegressionPartitionedSVM` object.
`"mean"`	`kfoldPredict` uses the mean of the observed response values in the training-fold data as the predicted response value for observations with missing predictor values.
Numeric scalar	`kfoldPredict` uses this value as the predicted response value for observations with missing predictor values.

"median"

kfoldPredict uses the median of the observed response values in the training-fold data as the predicted response value for observations with missing predictor values.

This value is the default when CVMdl is a RegressionPartitionedGP, RegressionPartitionedNeuralNetwork, or RegressionPartitionedSVM object.

"mean" kfoldPredict uses the mean of the observed response values in the training-fold data as the predicted response value for observations with missing predictor values.

Numeric scalar kfoldPredict uses this value as the predicted response value for observations with missing predictor values.

Example: "PredictionForMissingValue","mean"

Example: "PredictionForMissingValue",NaN

Data Types: single | double | char | string

Output Arguments

collapse all

`yFit` — Predicted responses
numeric vector

Predicted responses, returned as an n-by-1 numeric vector, where n is the number of observations. (n is size(CVMdl.X,1) when observations are in rows.) Each entry of yFit corresponds to the predicted response for the corresponding row of CVMdl.X.

If you use a holdout validation technique to create CVMdl (that is, if CVMdl.KFold is 1), then yFit has NaN values for training-fold observations.

`ySD` — Standard deviations of response variable
column vector

Standard deviations of the response variable, evaluated at each observation in the predictor data CVMdl.X, returned as a column vector of length n, where n is the number of observations in CVMdl.X. The ith element ySD(i) contains the standard deviation of the ith response for the ith observation CVMdl.X(i,:), estimated using the trained standard deviation model in CVMdl.

This argument is valid only for a generalized additive model object that includes the standard deviation fit. That is, kfoldPredict can return this argument only when CVMdl is RegressionPartitionedGAM and the IsStandardDeviationFit property of CVMdl is true.

`yInt` — Prediction intervals of response variable
two-column matrix

Prediction intervals of the response variable, evaluated at each observation in the predictor data CVMdl.X, returned as an n-by-2 matrix, where n is the number of observations in CVMdl.X. The ith row yInt(i,:) contains the estimated 100(1 – Alpha)% prediction interval of the ith response for the ith observation CVMdl.X(i,:) using ySD(i). The Alpha value is the probability that the prediction interval does not contain the true response value CVMdl.Y(i). The first column of yInt contains the lower limits of the prediction intervals, and the second column contains the upper limits.

Extended Capabilities

expand all

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

This function fully supports GPU arrays for the following models.
- RegressionPartitionedEnsemble
- RegressionPartitionedGP
- RegressionPartitionedNeuralNetwork
- RegressionPartitionedModel object fitted using fitrtree, or by passing a RegressionTree object to crossval
- RegressionPartitionedSVM

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2011a

expand all

R2026a: Specify GPU arrays for Gaussian process regression models (requires Parallel Computing Toolbox)

kfoldPredict fully supports GPU arrays for RegressionPartitionedGP models.

R2024b: Specify GPU arrays for neural network models (requires Parallel Computing Toolbox)

kfoldPredict fully supports GPU arrays for RegressionPartitionedNeuralNetwork models.

R2023b: Specify predicted response value to use for observations with missing predictor values

Starting in R2023b, when you predict or compute the loss, some regression models allow you to specify the predicted response value for observations with missing predictor values. Specify the PredictionForMissingValue name-value argument to use a numeric scalar, the training set median, or the training set mean as the predicted value. When computing the loss, you can also specify to omit observations with missing predictor values.

This table lists the object functions that support the PredictionForMissingValue name-value argument. By default, the functions use the training set median as the predicted response value for observations with missing predictor values.

Model Type	Model Objects	Object Functions
Gaussian process regression (GPR) model	`RegressionGP`, `CompactRegressionGP`	`loss`, `predict`, `resubLoss`, `resubPredict`
Gaussian process regression (GPR) model	`RegressionPartitionedGP`	`kfoldLoss`, `kfoldPredict`
Gaussian kernel regression model	`RegressionKernel`	`loss`, `predict`
Gaussian kernel regression model	`RegressionPartitionedKernel`	`kfoldLoss`, `kfoldPredict`
Linear regression model	`RegressionLinear`	`loss`, `predict`
Linear regression model	`RegressionPartitionedLinear`	`kfoldLoss`, `kfoldPredict`
Neural network regression model	`RegressionNeuralNetwork`, `CompactRegressionNeuralNetwork`	`loss`, `predict`, `resubLoss`, `resubPredict`
Neural network regression model	`RegressionPartitionedNeuralNetwork`	`kfoldLoss`, `kfoldPredict`
Support vector machine (SVM) regression model	`RegressionSVM`, `CompactRegressionSVM`	`loss`, `predict`, `resubLoss`, `resubPredict`
Support vector machine (SVM) regression model	`RegressionPartitionedSVM`	`kfoldLoss`, `kfoldPredict`

In previous releases, the regression model loss and predict functions listed above used NaN predicted response values for observations with missing predictor values. The software omitted observations with missing predictor values from the resubstitution ("resub") and cross-validation ("kfold") computations for prediction and loss.

R2023a: GPU support for `RegressionPartitionedSVM` models

Starting in R2023a, kfoldPredict fully supports GPU arrays for RegressionPartitionedSVM models.

kfoldPredict

Syntax

Description

Examples

Compute Cross-Validation Loss Manually

Compare Holdout and k-Fold Cross-Validation Losses and Predictions

Input Arguments

CVMdl — Cross-validated partitioned regression model RegressionPartitionedModel object | RegressionPartitionedEnsemble object | RegressionPartitionedGAM object | RegressionPartitionedGP object | RegressionPartitionedNeuralNetwork object | RegressionPartitionedSVM object

Name-Value Arguments

Alpha — Significance level 0.05 (default) | numeric scalar in [0,1]

IncludeInteractions — Flag to include interaction terms true | false

PredictionForMissingValue — Predicted response value to use for observations with missing predictor values "median" | "mean" | numeric scalar

Output Arguments

yFit — Predicted responses numeric vector

ySD — Standard deviations of response variable column vector

yInt — Prediction intervals of response variable two-column matrix

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2026a: Specify GPU arrays for Gaussian process regression models (requires Parallel Computing Toolbox)

R2024b: Specify GPU arrays for neural network models (requires Parallel Computing Toolbox)

R2023b: Specify predicted response value to use for observations with missing predictor values

R2023a: GPU support for RegressionPartitionedSVM models

See Also

`CVMdl` — Cross-validated partitioned regression model
`RegressionPartitionedModel` object | `RegressionPartitionedEnsemble` object | `RegressionPartitionedGAM` object | `RegressionPartitionedGP` object | `RegressionPartitionedNeuralNetwork` object | `RegressionPartitionedSVM` object

`Alpha` — Significance level
0.05 (default) | numeric scalar in `[0,1]`

`IncludeInteractions` — Flag to include interaction terms
`true` | `false`

`PredictionForMissingValue` — Predicted response value to use for observations with missing predictor values
`"median"` | `"mean"` | numeric scalar

`yFit` — Predicted responses
numeric vector

`ySD` — Standard deviations of response variable
column vector

`yInt` — Prediction intervals of response variable
two-column matrix

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

R2023a: GPU support for `RegressionPartitionedSVM` models