Main Content

crossval

Cross-validate direct forecasting model

Since R2023b

    Description

    CVMdl = crossval(Mdl,TSPartition) returns a cross-validated (partitioned) direct forecasting model (CVMdl) from a trained direct forecasting model (Mdl). The crossval function uses the cross-validation scheme specified by TSPartition.

    You can assess the predictive performance of Mdl on cross-validated data by using the object functions of CVMdl (cvloss and cvpredict).

    example

    Examples

    collapse all

    Create a cross-validated direct forecasting model using expanding window cross-validation. To evaluate the performance of the model:

    • Compute the mean squared error (MSE) on each test set using the cvloss object function.

    • For each test set, compare the true response values to the predicted response values using the cvpredict object function.

    Load the sample file TemperatureData.csv, which contains average daily temperature from January 2015 through July 2016. Read the file into a table. Observe the first eight observations in the table.

    Tbl = readtable("TemperatureData.csv");
    head(Tbl)
        Year       Month       Day    TemperatureF
        ____    ___________    ___    ____________
    
        2015    {'January'}     1          23     
        2015    {'January'}     2          31     
        2015    {'January'}     3          25     
        2015    {'January'}     4          39     
        2015    {'January'}     5          29     
        2015    {'January'}     6          12     
        2015    {'January'}     7          10     
        2015    {'January'}     8           4     
    

    Create a datetime variable t that contains the year, month, and day information for each observation in Tbl.

    numericMonth = month(datetime(Tbl.Month, ...
        InputFormat="MMMM",Locale="en_US"));
    t = datetime(Tbl.Year,numericMonth,Tbl.Day);

    Plot the temperature values in Tbl over time.

    plot(t,Tbl.TemperatureF)
    xlabel("Date")
    ylabel("Temperature in Fahrenheit")

    Figure contains an axes object. The axes object with xlabel Date, ylabel Temperature in Fahrenheit contains an object of type line.

    Create a direct forecasting model by using the data in Tbl. Train the model using a bagged ensemble of trees. All three of the predictors (Year, Month, and Day) are leading predictors because their future values are known. To create new predictors by shifting the leading predictor and response variables backward in time, specify the leading predictor lags and the response variable lags.

    Mdl = directforecaster(Tbl,"TemperatureF", ...
        Learner="bag", ...
        LeadingPredictors="all",LeadingPredictorLags={0:1,0:1,0:7}, ...
        ResponseLags=1:7)
    Mdl = 
      DirectForecaster
    
                      Horizon: 1
                 ResponseLags: [1 2 3 4 5 6 7]
            LeadingPredictors: [1 2 3]
         LeadingPredictorLags: {[0 1]  [0 1]  [0 1 2 3 4 5 6 7]}
                 ResponseName: 'TemperatureF'
               PredictorNames: {'Year'  'Month'  'Day'}
        CategoricalPredictors: 2
                     Learners: {[1x1 classreg.learning.regr.CompactRegressionEnsemble]}
                       MaxLag: 7
              NumObservations: 565
    
    
    

    Mdl is a DirectForecaster model object. By default, the horizon is one step ahead. That is, Mdl predicts a value that is one step into the future.

    Partition the time series data in Tbl using an expanding window cross-validation scheme. Create three training sets and three test sets, where each test set has 100 observations. Note that each observation in Tbl is in at most one test set.

    CVPartition = tspartition(size(Mdl.X,1),"ExpandingWindow",3, ...
        TestSize=100)
    CVPartition = 
      tspartition
    
                   Type: 'expanding-window'
        NumObservations: 565
            NumTestSets: 3
              TrainSize: [265 365 465]
               TestSize: [100 100 100]
               StepSize: 100
    
    
    

    The training sets increase in size from 265 observations in the first window to 465 observations in the third window.

    Create a cross-validated direct forecasting model using the partition specified in CVPartition. Inspect the Learners property of the resulting CVMdl object.

    CVMdl = crossval(Mdl,CVPartition)
    CVMdl = 
      PartitionedDirectForecaster
    
                    Partition: [1x1 tspartition]
                      Horizon: 1
                 ResponseLags: [1 2 3 4 5 6 7]
            LeadingPredictors: [1 2 3]
         LeadingPredictorLags: {[0 1]  [0 1]  [0 1 2 3 4 5 6 7]}
                 ResponseName: 'TemperatureF'
               PredictorNames: {'Year'  'Month'  'Day'}
        CategoricalPredictors: 2
                     Learners: {3x1 cell}
                       MaxLag: 7
              NumObservations: 565
    
    
    
    CVMdl.Learners
    ans=3×1 cell array
        {1x1 timeseries.forecaster.CompactDirectForecaster}
        {1x1 timeseries.forecaster.CompactDirectForecaster}
        {1x1 timeseries.forecaster.CompactDirectForecaster}
    
    

    CVMdl is a PartitionedDirectForecaster model object. The crossval function trains CVMdl.Learners{1} using the observations in the first training set, CVMdl.Learner{2} using the observations in the second training set, and CVMdl.Learner{3} using the observations in the third training set.

    Compute the average test set MSE.

    averageMSE = cvloss(CVMdl)
    averageMSE = 
    53.3480
    

    To obtain more information, compute the MSE for each test set.

    individualMSE = cvloss(CVMdl,Mode="individual")
    individualMSE = 3×1
    
       44.1352
       84.0695
       31.8393
    
    

    The models trained on the first and third training sets seem to perform better than the model trained on the second training set.

    For each test set observation, predict the temperature value using the corresponding model in CVMdl.Learners.

    predictedY = cvpredict(CVMdl);
    predictedY(260:end,:)
    ans=306×1 table
        TemperatureF_Step1
        __________________
    
                 NaN      
                 NaN      
                 NaN      
                 NaN      
                 NaN      
                 NaN      
              50.963      
              57.363      
               57.04      
              60.705      
              59.606      
              58.302      
              58.023      
               61.39      
              67.229      
              61.083      
          ⋮
    
    

    Only the last 300 observations appear in any test set. For observations that do not appear in a test set, the predicted response value is NaN.

    For each test set, plot the true response values and the predicted response values.

    tiledlayout(3,1)
    
    nexttile
    idx1 = test(CVPartition,1);
    plot(t(idx1),Tbl.TemperatureF(idx1))
    hold on
    plot(t(idx1),predictedY.TemperatureF_Step1(idx1))
    legend("True Response","Predicted Response", ...
        Location="eastoutside")
    xlabel("Date")
    ylabel("Temperature")
    title("Test Set 1")
    hold off
    
    nexttile
    idx2 = test(CVPartition,2);
    plot(t(idx2),Tbl.TemperatureF(idx2))
    hold on
    plot(t(idx2),predictedY.TemperatureF_Step1(idx2))
    legend("True Response","Predicted Response", ...
        Location="eastoutside")
    xlabel("Date")
    ylabel("Temperature")
    title("Test Set 2")
    hold off
    
    nexttile
    idx3 = test(CVPartition,3);
    plot(t(idx3),Tbl.TemperatureF(idx3))
    hold on
    plot(t(idx3),predictedY.TemperatureF_Step1(idx3))
    legend("True Response","Predicted Response", ...
        Location="eastoutside")
    xlabel("Date")
    ylabel("Temperature")
    title("Test Set 3")
    hold off

    Figure contains 3 axes objects. Axes object 1 with title Test Set 1, xlabel Date, ylabel Temperature contains 2 objects of type line. These objects represent True Response, Predicted Response. Axes object 2 with title Test Set 2, xlabel Date, ylabel Temperature contains 2 objects of type line. These objects represent True Response, Predicted Response. Axes object 3 with title Test Set 3, xlabel Date, ylabel Temperature contains 2 objects of type line. These objects represent True Response, Predicted Response.

    Overall, the cross-validated direct forecasting model is able to predict the trend in temperatures. If you are satisfied with the performance of the cross-validated model, you can use the full DirectForecaster model Mdl for forecasting at time steps beyond the available data.

    Input Arguments

    collapse all

    Direct forecasting model, specified as a DirectForecaster model object.

    Cross-validation partition for time series data, specified as a tspartition object. TSPartition uses an expanding window cross-validation, sliding window cross-validation, or holdout validation scheme (as specified by the tspartition function).

    Output Arguments

    collapse all

    Cross-validated direct forecasting model, returned as a PartitionedDirectForecaster model object.

    Version History

    Introduced in R2023b