Main Content

fscchi2

Univariate feature ranking for classification using chi-square tests

Description

example

idx = fscchi2(Tbl,ResponseVarName) ranks features (predictors) using chi-square tests. The table Tbl contains predictor variables and a response variable, and ResponseVarName is the name of the response variable in Tbl. The function returns idx, which contains the indices of predictors ordered by predictor importance, meaning idx(1) is the index of the most important predictor. You can use idx to select important predictors for classification problems.

idx = fscchi2(Tbl,formula) specifies a response variable and predictor variables to consider among the variables in Tbl by using formula.

idx = fscchi2(Tbl,Y) ranks predictors in Tbl using the response variable Y.

example

idx = fscchi2(X,Y) ranks predictors in X using the response variable Y.

example

idx = fscchi2(___,Name,Value) specifies additional options using one or more name-value pair arguments in addition to any of the input argument combinations in the previous syntaxes. For example, you can specify prior probabilities and observation weights.

example

[idx,scores] = fscchi2(___) also returns the predictor scores scores. A large score value indicates that the corresponding predictor is important.

Examples

collapse all

Rank predictors in a numeric matrix and create a bar plot of predictor importance scores.

Load the sample data.

load ionosphere

ionosphere contains predictor variables (X) and a response variable (Y).

Rank the predictors using chi-square tests.

[idx,scores] = fscchi2(X,Y);

The values in scores are the negative logs of the p-values. If a p-value is smaller than eps(0), then the corresponding score value is Inf. Before creating a bar plot, determine whether scores includes Inf values.

find(isinf(scores))
ans =

  1x0 empty double row vector

scores does not include Inf values. If scores includes Inf values, you can replace Inf by a large numeric number before creating a bar plot for visualization purposes. For details, see Rank Predictors in Table.

Create a bar plot of the predictor importance scores.

bar(scores(idx))
xlabel('Predictor rank')
ylabel('Predictor importance score')

Select the top five most important predictors. Find the columns of these predictors in X.

idx(1:5)
ans = 1×5

     5     7     3     8     6

The fifth column of X is the most important predictor of Y.

Rank predictors in a table and create a bar plot of predictor importance scores.

If your data is in a table and fscchi2 ranks a subset of the variables in the table, then the function indexes the variables using only the subset. Therefore, a good practice is to move the predictors that you do not want to rank to the end of the table. Move the response variable and observation weight vector as well. Then, the indexes of the output arguments are consistent with the indexes of the table.

Load the census1994 data set.

load census1994

The table adultdata in census1994 contains demographic data from the US Census Bureau to predict whether an individual makes over $50,000 per year. Display the first three rows of the table.

head(adultdata,3)
ans=3×15 table
    age       workClass          fnlwgt      education    education_num      marital_status         occupation        relationship     race     sex     capital_gain    capital_loss    hours_per_week    native_country    salary
    ___    ________________    __________    _________    _____________    __________________    _________________    _____________    _____    ____    ____________    ____________    ______________    ______________    ______

    39     State-gov                77516    Bachelors         13          Never-married         Adm-clerical         Not-in-family    White    Male        2174             0                40          United-States     <=50K 
    50     Self-emp-not-inc         83311    Bachelors         13          Married-civ-spouse    Exec-managerial      Husband          White    Male           0             0                13          United-States     <=50K 
    38     Private             2.1565e+05    HS-grad            9          Divorced              Handlers-cleaners    Not-in-family    White    Male           0             0                40          United-States     <=50K 

In the table adultdata, the third column fnlwgt is the weight of the samples, and the last column salary is the response variable. Move fnlwgt to the left of salary by using the movevars function.

adultdata = movevars(adultdata,'fnlwgt','before','salary');
head(adultdata,3)
ans=3×15 table
    age       workClass        education    education_num      marital_status         occupation        relationship     race     sex     capital_gain    capital_loss    hours_per_week    native_country      fnlwgt      salary
    ___    ________________    _________    _____________    __________________    _________________    _____________    _____    ____    ____________    ____________    ______________    ______________    __________    ______

    39     State-gov           Bachelors         13          Never-married         Adm-clerical         Not-in-family    White    Male        2174             0                40          United-States          77516    <=50K 
    50     Self-emp-not-inc    Bachelors         13          Married-civ-spouse    Exec-managerial      Husband          White    Male           0             0                13          United-States          83311    <=50K 
    38     Private             HS-grad            9          Divorced              Handlers-cleaners    Not-in-family    White    Male           0             0                40          United-States     2.1565e+05    <=50K 

Rank the predictors in adultdata. Specify the column salary as a response variable, and specify the column fnlwgt as observation weights.

[idx,scores] = fscchi2(adultdata,'salary','Weights','fnlwgt');

The values in scores are the negative logs of the p-values. If a p-value is smaller than eps(0), then the corresponding score value is Inf. Before creating a bar plot, determine whether scores includes Inf values.

idxInf = find(isinf(scores))
idxInf = 1×8

     1     3     4     5     6     7    10    12

scores includes eight Inf values.

Create a bar plot of predictor importance scores. Use the predictor names for the x-axis tick labels.

figure
bar(scores(idx))
xlabel('Predictor rank')
ylabel('Predictor importance score')
xticklabels(strrep(adultdata.Properties.VariableNames(idx),'_','\_'))
xtickangle(45)

The bar function does not plot any bars for the Inf values. For the Inf values, plot bars that have the same length as the largest finite score.

hold on
bar(scores(idx(length(idxInf)+1))*ones(length(idxInf),1))
legend('Finite Scores','Inf Scores')
hold off

The bar graph displays finite scores and Inf scores using different colors.

Input Arguments

collapse all

Sample data, specified as a table. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

Each row of Tbl corresponds to one observation, and each column corresponds to one predictor variable. Optionally, Tbl can contain additional columns for a response variable and observation weights.

A response variable can be a categorical, character, or string array, logical or numeric vector, or cell array of character vectors. If the response variable is a character array, then each element of the response variable must correspond to one row of the array.

  • If Tbl contains the response variable, and you want to use all remaining variables in Tbl as predictors, then specify the response variable by using ResponseVarName. If Tbl also contains the observation weights, then you can specify the weights by using Weights.

  • If Tbl contains the response variable, and you want to use only a subset of the remaining variables in Tbl as predictors, then specify the subset of variables by using formula.

  • If Tbl does not contain the response variable, then specify a response variable by using Y. The response variable and Tbl must have the same number of rows.

If fscchi2 uses a subset of variables in Tbl as predictors, then the function indexes the predictors using only the subset. The values in the 'CategoricalPredictors' name-value pair argument and the output argument idx do not count the predictors that the function does not rank.

fscchi2 considers NaN, '' (empty character vector), "" (empty string), <missing>, and <undefined> values in Tbl for a response variable to be missing values. fscchi2 does not use observations with missing values for a response variable.

Data Types: table

Response variable name, specified as a character vector or string scalar containing the name of a variable in Tbl.

For example, if a response variable is the column Y of Tbl (Tbl.Y), then specify ResponseVarName as 'Y'.

Data Types: char | string

Explanatory model of the response variable and a subset of the predictor variables, specified as a character vector or string scalar in the form 'Y ~ X1 + X2 + X3'. In this form, Y represents the response variable, and X1, X2, and X3 represent the predictor variables.

To specify a subset of variables in Tbl as predictors, use a formula. If you specify a formula, then fscchi2 does not rank any variables in Tbl that do not appear in formula.

The variable names in the formula must be both variable names in Tbl (Tbl.Properties.VariableNames) and valid MATLAB® identifiers. For details, see Tips.

Data Types: char | string

Response variable, specified as a numeric, categorical, or logical vector, a character or string array, or a cell array of character vectors. Each row of Y represents the labels of the corresponding row of X.

fscchi2 considers NaN, '' (empty character vector), "" (empty string), <missing>, and <undefined> values in Y to be missing values. fscchi2 does not use observations with missing values for Y.

Data Types: single | double | categorical | logical | char | string | cell

Predictor data, specified as a numeric matrix. Each row of X corresponds to one observation, and each column corresponds to one predictor variable.

Data Types: single | double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'NumBins',20,'UseMissing',true sets the number of bins as 20 and specifies to use missing values in predictors for ranking.

List of categorical predictors, specified as the comma-separated pair consisting of 'CategoricalPredictors' and one of the values in this table.

ValueDescription
Vector of positive integersEach entry in the vector is an index value corresponding to the column of the predictor data (X or Tbl) that contains a categorical variable.
Logical vectorA true entry means that the corresponding column of predictor data (X or Tbl) is a categorical variable.
Character matrixEach row of the matrix is the name of a predictor variable. The names must match the names in Tbl. Pad the names with extra blanks so each row of the character matrix has the same length.
String array or cell array of character vectorsEach element in the array is the name of a predictor variable. The names must match the names in Tbl.
'all'All predictors are categorical.

By default, if the predictor data is in a table (Tbl), fscchi2 assumes that a variable is categorical if it is a logical vector, unordered categorical vector, character array, string array, or cell array of character vectors. If the predictor data is a matrix (X), fscchi2 assumes that all predictors are continuous. To identify any other predictors as categorical predictors, specify them by using the 'CategoricalPredictors' name-value pair argument.

If fscchi2 uses a subset of variables in Tbl as predictors, then the function indexes the predictors using only the subset. The 'CategoricalPredictors' values do not count the predictors that the function does not rank.

Example: 'CategoricalPredictors','all'

Data Types: single | double | logical | char | string | cell

Names of the classes to use for ranking, specified as the comma-separated pair consisting of 'ClassNames' and a categorical, character, or string array, a logical or numeric vector, or a cell array of character vectors. ClassNames must have the same data type as Y or the response variable in Tbl.

If ClassNames is a character array, then each element must correspond to one row of the array.

Use 'ClassNames' to:

  • Specify the order of the Prior dimensions that corresponds to the class order.

  • Select a subset of classes for ranking. For example, suppose that the set of all distinct class names in Y is {'a','b','c'}. To rank predictors using observations from classes 'a' and 'c' only, specify 'ClassNames',{'a','c'}.

The default value for 'ClassNames' is the set of all distinct class names in Y or the response variable in Tbl. The default 'ClassNames' value has mathematical ordering if the response variable is ordinal. Otherwise, the default value has alphabetical ordering.

Example: 'ClassNames',{'b','g'}

Data Types: categorical | char | string | logical | single | double | cell

Number of bins for binning continuous predictors, specified as the comma-separated pair consisting of 'NumBins' and a positive integer scalar.

Example: 'NumBins',50

Data Types: single | double

Prior probabilities for each class, specified as the comma-separated pair consisting of 'Prior' and one of the following:

  • Character vector or string scalar.

    • 'empirical' determines class probabilities from class frequencies in the response variable in Y or Tbl. If you pass observation weights, fscchi2 uses the weights to compute the class probabilities.

    • 'uniform' sets all class probabilities to be equal.

  • Vector (one scalar value for each class). To specify the class order for the corresponding elements of 'Prior', also specify the ClassNames name-value pair argument.

  • Structure S with two fields.

    • S.ClassNames contains the class names as a variable of the same type as the response variable in Y or Tbl.

    • S.ClassProbs contains a vector of corresponding probabilities.

If you set values for both 'Weights' and 'Prior', fscchi2 normalizes the weights in each class to add up to the value of the prior probability of the respective class.

Example: 'Prior','uniform'

Data Types: char | string | single | double | struct

Indicator for whether to use or discard missing values in predictors, specified as the comma-separated pair consisting of 'UseMissing' and either true to use or false to discard missing values in predictors for ranking.

fscchi2 considers NaN, '' (empty character vector), "" (empty string), <missing>, and <undefined> values to be missing values.

If you specify 'UseMissing',true, then fscchi2 uses missing values for ranking. For a categorical variable, fscchi2 treats missing values as an extra category. For a continuous variable, fscchi2 places NaN values in a separate bin for binning.

If you specify 'UseMissing',false, then fscchi2 does not use missing values for ranking. Because fscchi2 computes importance scores individually for each predictor, the function does not discard an entire row when values in the row are partially missing. For each variable, fscchi2 uses all values that are not missing.

Example: 'UseMissing',true

Data Types: logical

Observation weights, specified as the comma-separated pair consisting of 'Weights' and a vector of scalar values or the name of a variable in Tbl. The function weights the observations in each row of X or Tbl with the corresponding value in Weights. The size of Weights must equal the number of rows in X or Tbl.

If you specify the input data as a table Tbl, then Weights can be the name of a variable in Tbl that contains a numeric vector. In this case, you must specify Weights as a character vector or string scalar. For example, if the weight vector is the column W of Tbl (Tbl.W), then specify 'Weights,'W'.

fscchi2 normalizes the weights in each class to add up to the value of the prior probability of the respective class.

Data Types: single | double | char | string

Output Arguments

collapse all

Indices of predictors in X or Tbl ordered by predictor importance, returned as a 1-by-r numeric vector, where r is the number of ranked predictors.

If fscchi2 uses a subset of variables in Tbl as predictors, then the function indexes the predictors using only the subset. For example, suppose Tbl includes 10 columns and you specify the last five columns of Tbl as the predictor variables by using formula. If idx(3) is 5, then the third most important predictor is the 10th column in Tbl, which is the fifth predictor in the subset.

Predictor scores, returned as a 1-by-r numeric vector, where r is the number of ranked predictors.

A large score value indicates that the corresponding predictor is important.

  • If you use X to specify the predictors or use all the variables in Tbl as predictors, then the values in scores have the same order as the predictors in X or Tbl.

  • If you specify a subset of variables in Tbl as predictors, then the values in scores have the same order as the subset.

For example, suppose Tbl includes 10 columns and you specify the last five columns of Tbl as the predictor variables by using formula. Then, score(3) contains the score value of the 8th column in Tbl, which is the third predictor in the subset.

Tips

  • If you specify the response variable and predictor variables by using the input argument formula, then the variable names in the formula must be both variable names in Tbl (Tbl.Properties.VariableNames) and valid MATLAB identifiers.

    You can verify the variable names in Tbl by using the isvarname function. The following code returns logical 1 (true) for each variable that has a valid variable name.

    cellfun(@isvarname,Tbl.Properties.VariableNames)
    If the variable names in Tbl are not valid, then convert them by using the matlab.lang.makeValidName function.
    Tbl.Properties.VariableNames = matlab.lang.makeValidName(Tbl.Properties.VariableNames);

Algorithms

collapse all

Univariate Feature Ranking Using Chi-Square Tests

  • fscchi2 examines whether each predictor variable is independent of a response variable by using individual chi-square tests. A small p-value of the test statistic indicates that the corresponding predictor variable is dependent on the response variable, and, therefore is an important feature.

  • The output scores is –log(p). Therefore, a large score value indicates that the corresponding predictor is important. If a p-value is smaller than eps(0), then the output is Inf.

  • fscchi2 examines a continuous variable after binning, or discretizing, the variable. You can specify the number of bins using the 'NumBins' name-value pair argument.

Introduced in R2020a