Esta página aún no se ha traducido para esta versión. Puede ver la versión más reciente de esta página en inglés.

Select Data and Validation for Regression Problem

Select Data from Workspace

Sugerencia

In Regression Learner, tables are the easiest way to work with your data, because they can contain numeric and label data. Use the Import Tool to bring your data into the MATLAB® workspace as a table, or use the table functions to create a table from workspace variables. See Tables (MATLAB).

If your predictors are a matrix and the response is a vector, combine them into a table using the table function.

  1. Load your data into the MATLAB workspace.

    Predictor variables can be numeric, categorical, string, or logical vectors, cell arrays of character vectors, or character arrays. The response variable must be a floating-point vector (single or double precision).

    For example data sets, see Example Data for Regression.

  2. On the Apps tab, click Regression Learner to open the app.

  3. On the Regression Learner tab, in the File section, click New Session.

  4. In the New Session dialog box, select a table or matrix from the workspace variables.

    If you select a matrix, choose whether to use rows or columns for observations by clicking the option buttons.

  5. Observe the roles the app selects for the variables based on their data type. The app tries to select a suitable response variable, and all other variables are predictors. Change the selections if needed. Add or remove predictors using the check boxes. Add or remove all predictors by clicking Add All or Remove All. You can also add or remove multiple predictors by selecting them in the table, and then clicking Add N or Remove N, where N is the number of selected predictors. The Add All and Remove All buttons change to Add N and Remove N when you select multiple predictors.

  6. Click Start Session to accept the default validation scheme and continue. The default validation option is 5-fold cross-validation, which protects against overfitting.

    Sugerencia

    If you have a large data set, you might want to switch to holdout validation. To learn more, see Choose Validation Scheme.

For next steps, see Train Regression Models in Regression Learner App.

Import Data from File

  1. On the Regression Learner tab, in the File section, select New Session > From File.

  2. Select a file type in the list, such as spreadsheets, text files, or comma-separated values (.csv) files, or select All Files to browse for other file types such as .dat.

Example Data for Regression

To get started using Regression Learner, try these example data sets.

NameSizeDescription
CarsNumber of predictors: 7
Number of observations: 406
Response: MPG (miles per gallon)

Data on different car models, 1970–1982. Predict the fuel economy (in miles per gallon), or one of the other characteristics.

For a step-by-step example, see Train Regression Trees Using Regression Learner App.

Create a table from variables in the carbig.mat file:
load carbig
cartable = table(Acceleration, Cylinders, Displacement,...
Horsepower, Model_Year, Weight, Origin, MPG);
AbaloneNumber of predictors: 8
Number of observations: 4177
Response: Rings

Measurements of abalone (a group of sea snails). Predict the age of abalones, which is closely related to the number of rings in their shells.

Download the data from the UCI Machine Learning Repository and save it in your current folder. Read the data into a table and specify the variable names.

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data';
websave('abalone.csv',url);
varnames = {'Sex'; 'Length'; 'Diameter'; 'Height'; 'Whole_weight';...
'Shucked_weight'; 'Viscera_weight'; 'Shell_weight'; 'Rings'}; 
abalonetable = readtable('abalone.csv');
abalonetable.Properties.VariableNames = varnames;
HospitalNumber of predictors: 5
Number of observations: 100
Response: BloodPressure_2

Simulated hospital data. Predict the blood pressure of patients.

Create a table from the hospital variable in the hospital.mat file:
load hospital.mat
hospitaltable = dataset2table(hospital(:,2:end-1));

Choose Validation Scheme

Choose a validation method to examine the predictive accuracy of the fitted models. Validation estimates model performance on new data, and helps you choose the best model. Validation protects against overfitting. A model that is too flexible and suffers from overfitting has a worse validation accuracy. Choose a validation scheme before training any models so that you can compare all the models in your session using the same validation scheme.

Sugerencia

Try the default validation scheme and click Start Session to continue. The default option is 5-fold cross-validation, which protects against overfitting.

If you have a large data set and training the models takes too long using cross-validation, reimport your data and try the faster holdout validation instead.

  • Cross-Validation: Select the number of folds (or divisions) to partition the data set using the slider control.

    If you choose k folds, then the app:

    1. Partitions the data into k disjoint sets or folds

    2. For each fold:

      1. Trains a model using the out-of-fold observations

      2. Assesses model performance using in-fold data

    3. Calculates the average test error over all folds

    This method gives a good estimate of the predictive accuracy of the final model trained using the full data set. The method requires multiple fits, but makes efficient use of all the data, so it works well for small data sets.

  • Holdout Validation: Select a percentage of the data to use as a validation set using the slider control. The app trains a model on the training set and assesses its performance with the validation set. The model used for validation is based on only a portion of the data, so holdout validation is appropriate only for large data sets. The final model is trained using the full data set.

  • No Validation: No protection against overfitting. The app uses all the data for training and computes the error rate on the same data. Without any test data, you get an unrealistic estimate of the model’s performance on new data. That is, the training sample accuracy is likely to be unrealistically high, and the predictive accuracy is likely to be lower.

    To help you avoid overfitting to the training data, choose a validation scheme instead.

Nota

The validation scheme only affects the way that Regression Learner computes validation metrics. The final model is always trained using the full data set.

All the models you train after selecting data use the same validation scheme that you select in this dialog box. You can compare all the models in your session using the same validation scheme.

To change the validation selection and train new models, you can select data again, but you lose any trained models. The app warns you that importing data starts a new session. Save any trained models you want to keep to the workspace, and then import the data.

For next steps training models, see Train Regression Models in Regression Learner App.

Temas relacionados