Esta página aún no se ha traducido para esta versión. Puede ver la versión más reciente de esta página en inglés.

Dummy Indicator Variables

What Are Dummy Variables?

When performing regression analysis, it is common to include both continuous and categorical (quantitative and qualitative) predictor variables. When including a categorical independent variable, it is important not to input the variable as a numeric array. Numeric arrays have both order and magnitude. A categorical variable might have order (for example, an ordinal variable), but it does not have magnitude. Using a numeric array implies a known “distance” between the categories.

The appropriate way to include categorical predictors is as dummy indicator variables. An indicator variable has values 0 and 1. A categorical variable with c categories can be represented by c – 1 indicator variables.

For example, suppose you have a categorical variable with levels {Small,Medium,Large}. You can represent this variable using two dummy variables, as shown in this figure.

In this example, X1 is a dummy variable that has value 1 for the Medium group, and 0 otherwise. X2 is a dummy variable that has value 1 for the Large group, and 0 otherwise. Together, these two variables represent the three categories. Observations in the Small group have 0s for both dummy variables.

The category represented by all 0s is the reference group. When you include the dummy variables in a regression model, the coefficients of the dummy variables are interpreted with respect to the reference group.

Creating Dummy Variables

Automatic Creation of Dummy Variables

Most classification and regression fitting functions accept categorical predictors.

  • If the predictor data is in a table, the function assumes that a variable is categorical if it contains logical values, categorical values, a string array, or a cell array of character vectors.

  • If the predictor data is a matrix, the function assumes all predictors are continuous. To identify any categorical predictors, use the 'CategoricalPredictors' or 'CategoricalVars' name-value pair argument.

For parametric regression fitting functions such as fitlm and fitglm, if there are c unique levels in the categorical array, then the fitting function estimates c – 1 coefficients for the categorical predictor.

Manual Creation of Dummy Variables

If you prefer to create your own dummy variable design matrix, use dummyvar. This function accepts a numeric or categorical column vector, and returns a matrix of indicator variables. The dummy variable design matrix has a column for every group, and a row for every observation.

For example,

gender = nominal({'Male';'Female';'Female';'Male';'Female'});
dv = dummyvar(gender)
dv =

     0     1
     1     0
     1     0
     0     1
     1     0
There are five rows corresponding to the number of rows in gender, and two columns for the unique groups, Female and Male. Column order corresponds to the order of the levels in gender. For nominal arrays, the default order is ascending alphabetical.

To use these dummy variables in a regression model, you must either delete a column (to create a reference group), or fit a regression model with no intercept term. For the gender example, only one dummy variable is needed to represent two genders. Notice what happens if you add an intercept term to the complete design matrix, dv.

X = [ones(5,1) dv]
X =

     1     0     1
     1     1     0
     1     1     0
     1     0     1
     1     1     0
ans =

The design matrix with an intercept term is not of full rank, and is not invertible. Because of this linear dependence, use only c – 1 indicator variables to represent a categorical variable with c categories in a regression model with an intercept term.

Consulte también

| | |

Temas relacionados