# RegressionPartitionedLinear

Package: classreg.learning.partition
Superclasses: `RegressionPartitionedModel`

Cross-validated linear regression model for high-dimensional data

## Description

`RegressionPartitionedLinear` is a set of linear regression models trained on cross-validated folds. To obtain a cross-validated, linear regression model, use `fitrlinear` and specify one of the cross-validation options. You can estimate the predictive quality of the model, or how well the linear regression model generalizes, using one or more of these “kfold” methods: `kfoldPredict` and `kfoldLoss`.

Every “kfold” method uses models trained on in-fold observations to predict the response for out-of-fold observations. For example, suppose that you cross-validate using five folds. In this case, the software randomly assigns each observation into five roughly equally sized groups. The training fold contains four of the groups (that is, roughly 4/5 of the data) and the test fold contains the other group (that is, roughly 1/5 of the data). In this case, cross-validation proceeds as follows:

1. The software trains the first model (stored in `CVMdl.Trained{1}`) using the observations in the last four groups and reserves the observations in the first group for validation.

2. The software trains the second model (stored in `CVMdl.Trained{2}`) using the observations in the first group and last three groups. The software reserves the observations in the second group for validation.

3. The software proceeds in a similar fashion for the third through fifth models.

If you validate by calling `kfoldPredict`, it computes predictions for the observations in group 1 using the first model, group 2 for the second model, and so on. In short, the software estimates a response for every observation using the model trained without that observation.

Note

Unlike other cross-validated, regression models, `RegressionPartitionedLinear` model objects do not store the predictor data set.

## Construction

`CVMdl = fitrlinear(X,Y,Name,Value)` creates a cross-validated, linear regression model when `Name` is either `'CrossVal'`, `'CVPartition'`, `'Holdout'`, or `'KFold'`. For more details, see `fitrlinear`.

## Properties

expand all

Cross-Validation Properties

Cross-validated model name, specified as a character vector.

For example, `'Linear'` specifies a cross-validated linear model for binary classification or regression.

Data Types: `char`

Number of cross-validated folds, specified as a positive integer.

Data Types: `double`

Cross-validation parameter values, e.g., the name-value pair argument values used to cross-validate the linear model, specified as an object. `ModelParameters` does not contain estimated parameters.

Access properties of `ModelParameters` using dot notation.

Number of observations in the training data, specified as a positive numeric scalar.

Data Types: `double`

Data partition indicating how the software splits the data into cross-validation folds, specified as a `cvpartition` model.

Linear regression models trained on cross-validation folds, specified as a cell array of `RegressionLinear` models. `Trained` has k cells, where k is the number of folds.

Data Types: `cell`

Observation weights used to cross-validate the model, specified as a numeric vector. `W` has `NumObservations` elements.

The software normalizes the weights used for training so that `sum(W,'omitnan')` is `1`.

Data Types: `single` | `double`

Observed responses used to cross-validate the model, specified as a numeric vector containing `NumObservations` elements.

Each row of `Y` represents the observed response of the corresponding observation in the predictor data.

Data Types: `single` | `double`

Other Regression Properties

Categorical predictor indices, specified as a vector of positive integers. Assuming that the predictor data contains observations in rows, `CategoricalPredictors` contains index values corresponding to the columns of the predictor data that contain categorical predictors. If none of the predictors are categorical, then this property is empty (`[]`).

Data Types: `single` | `double`

Predictor names in order of their appearance in the predictor data, specified as a cell array of character vectors. The length of `PredictorNames` is equal to the number of variables in the training data `X` or `Tbl` used as predictor variables.

Data Types: `cell`

Response variable name, specified as a character vector.

Data Types: `char`

Response transformation function, specified as `'none'` or a function handle. `ResponseTransform` describes how the software transforms raw response values.

For a MATLAB® function or a function that you define, enter its function handle. For example, you can enter ```Mdl.ResponseTransform = @function```, where `function` accepts a numeric vector of the original responses and returns a numeric vector of the same size containing the transformed responses.

Data Types: `char` | `function_handle`

## Methods

 kfoldLoss Regression loss for observations not used in training kfoldPredict Predict responses for observations not used for training

## Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects.

## Examples

collapse all

Simulate 10000 observations from this model

`$y={x}_{100}+2{x}_{200}+e.$`

• $X=\left\{{x}_{1},...,{x}_{1000}\right\}$ is a 10000-by-1000 sparse matrix with 10% nonzero standard normal elements.

• e is random normal error with mean 0 and standard deviation 0.3.

```rng(1) % For reproducibility n = 1e4; d = 1e3; nz = 0.1; X = sprandn(n,d,nz); Y = X(:,100) + 2*X(:,200) + 0.3*randn(n,1);```

Cross-validate a linear regression model. To increase execution speed, transpose the predictor data and specify that the observations are in columns.

```X = X'; CVMdl = fitrlinear(X,Y,'CrossVal','on','ObservationsIn','columns');```

`CVMdl` is a `RegressionPartitionedLinear` cross-validated model. Because `fitrlinear` implements 10-fold cross-validation by default, `CVMdl.Trained` contains a cell vector of ten `RegressionLinear` models. Each cell contains a linear regression model trained on nine folds, and then tested on the remaining fold.

Predict responses for out-of-fold observations and estimate the generalization error by passing `CVMdl` to `kfoldPredict` and `kfoldLoss`, respectively.

```oofYHat = kfoldPredict(CVMdl); ge = kfoldLoss(CVMdl)```
```ge = 0.1748 ```

The estimated, generalization, mean squared error is 0.1748.

To determine a good lasso-penalty strength for a linear regression model that uses least squares, implement 5-fold cross-validation.

Simulate 10000 observations from this model

`$y={x}_{100}+2{x}_{200}+e.$`

• $X=\left\{{x}_{1},...,{x}_{1000}\right\}$ is a 10000-by-1000 sparse matrix with 10% nonzero standard normal elements.

• e is random normal error with mean 0 and standard deviation 0.3.

```rng(1) % For reproducibility n = 1e4; d = 1e3; nz = 0.1; X = sprandn(n,d,nz); Y = X(:,100) + 2*X(:,200) + 0.3*randn(n,1);```

Create a set of 15 logarithmically-spaced regularization strengths from $1{0}^{-5}$ through $1{0}^{-1}$.

`Lambda = logspace(-5,-1,15);`

Cross-validate the models. To increase execution speed, transpose the predictor data and specify that the observations are in columns. Optimize the objective function using SpaRSA.

```X = X'; CVMdl = fitrlinear(X,Y,'ObservationsIn','columns','KFold',5,'Lambda',Lambda,... 'Learner','leastsquares','Solver','sparsa','Regularization','lasso'); numCLModels = numel(CVMdl.Trained)```
```numCLModels = 5 ```

`CVMdl` is a `RegressionPartitionedLinear` model. Because `fitrlinear` implements 5-fold cross-validation, `CVMdl` contains 5 `RegressionLinear` models that the software trains on each fold.

Display the first trained linear regression model.

`Mdl1 = CVMdl.Trained{1}`
```Mdl1 = RegressionLinear ResponseName: 'Y' ResponseTransform: 'none' Beta: [1000x15 double] Bias: [1x15 double] Lambda: [1x15 double] Learner: 'leastsquares' Properties, Methods ```

`Mdl1` is a `RegressionLinear` model object. `fitrlinear` constructed `Mdl1` by training on the first four folds. Because `Lambda` is a sequence of regularization strengths, you can think of `Mdl1` as 15 models, one for each regularization strength in `Lambda`.

Estimate the cross-validated MSE.

`mse = kfoldLoss(CVMdl);`

Higher values of `Lambda` lead to predictor variable sparsity, which is a good quality of a regression model. For each regularization strength, train a linear regression model using the entire data set and the same options as when you cross-validated the models. Determine the number of nonzero coefficients per model.

```Mdl = fitrlinear(X,Y,'ObservationsIn','columns','Lambda',Lambda,... 'Learner','leastsquares','Solver','sparsa','Regularization','lasso'); numNZCoeff = sum(Mdl.Beta~=0);```

In the same figure, plot the cross-validated MSE and frequency of nonzero coefficients for each regularization strength. Plot all variables on the log scale.

```figure [h,hL1,hL2] = plotyy(log10(Lambda),log10(mse),... log10(Lambda),log10(numNZCoeff)); hL1.Marker = 'o'; hL2.Marker = 'o'; ylabel(h(1),'log_{10} MSE') ylabel(h(2),'log_{10} nonzero-coefficient frequency') xlabel('log_{10} Lambda') hold off``` Choose the index of the regularization strength that balances predictor variable sparsity and low MSE (for example, `Lambda(10)`).

`idxFinal = 10;`

Extract the model with corresponding to the minimal MSE.

`MdlFinal = selectModels(Mdl,idxFinal)`
```MdlFinal = RegressionLinear ResponseName: 'Y' ResponseTransform: 'none' Beta: [1000x1 double] Bias: -0.0050 Lambda: 0.0037 Learner: 'leastsquares' Properties, Methods ```
`idxNZCoeff = find(MdlFinal.Beta~=0)`
```idxNZCoeff = 2×1 100 200 ```
`EstCoeff = Mdl.Beta(idxNZCoeff)`
```EstCoeff = 2×1 1.0051 1.9965 ```

`MdlFinal` is a `RegressionLinear` model with one regularization strength. The nonzero coefficients `EstCoeff` are close to the coefficients that simulated the data.