Compare accuracies of two classification models using new data

`compareHoldout`

statistically assesses the accuracies of
two classification models. The function first compares their predicted labels against
the true labels, and then it detects whether the difference between the
misclassification rates is statistically significant.

You can determine whether the accuracies of the classification models differ or
whether one model performs better than another. `compareHoldout`

can
conduct several McNemar test variations,
including the asymptotic test, the exact-conditional test, and the
mid-*p*-value test. For cost-sensitive assessment, available tests include a chi-square test
(requires Optimization
Toolbox™) and a likelihood ratio test.

`h = compareHoldout(C1,C2,T1,T2,ResponseVarName)`

`h = compareHoldout(C1,C2,T1,T2,Y)`

`h = compareHoldout(C1,C2,X1,X2,Y)`

`h = compareHoldout(___,Name,Value)`

```
[h,p,e1,e2]
= compareHoldout(___)
```

returns the test decision from testing the null hypothesis that the trained
classification models `h`

= compareHoldout(`C1`

,`C2`

,`T1`

,`T2`

,`ResponseVarName`

)`C1`

and `C2`

have
equal accuracy for predicting the true class labels in the
`ResponseVarName`

variable. The alternative hypothesis is
that the labels have unequal accuracy.

The first classification model `C1`

uses the predictor data
in `T1`

, and the second classification model
`C2`

uses the predictor data in `T2`

.
The tables `T1`

and `T2`

must contain the
same response variable but can contain different sets of predictors. By default,
the software conducts the mid-*p*-value McNemar test to compare
the accuracies.

`h`

= `1`

indicates rejecting the null
hypothesis at the 5% significance level. `h`

=
`0`

indicates not rejecting the null hypothesis at the 5%
level.

The following are examples of tests you can conduct:

Compare the accuracies of a simple classification model and a model that is more complex by passing the same set of predictor data (that is,

`T1`

=`T2`

).Compare the accuracies of two potentially different models using two potentially different sets of predictors.

Perform various types of feature selection. For example, you can compare the accuracy of a model trained using a set of predictors to the accuracy of one trained on a subset or different set of those predictors. You can choose the set of predictors arbitrarily, or use a feature selection technique such as PCA or sequential feature selection (see

`pca`

and`sequentialfs`

).

returns the test decision from testing the null hypothesis that the trained
classification models `h`

= compareHoldout(`C1`

,`C2`

,`T1`

,`T2`

,`Y`

)`C1`

and `C2`

have
equal accuracy for predicting the true class labels `Y`

. The
alternative hypothesis is that the labels have unequal accuracy.

The first classification model `C1`

uses the predictor data
`T1`

, and the second classification model
`C2`

uses the predictor data `T2`

. By
default, the software conducts the mid-*p*-value McNemar test
to compare the accuracies.

returns the test decision from testing the null hypothesis that the trained
classification models `h`

= compareHoldout(`C1`

,`C2`

,`X1`

,`X2`

,`Y`

)`C1`

and `C2`

have
equal accuracy for predicting the true class labels `Y`

. The
alternative hypothesis is that the labels have unequal accuracy.

The first classification model `C1`

uses the predictor data
`X1`

, and the second classification model
`C2`

uses the predictor data `X2`

. By
default, the software conducts the mid-*p*-value McNemar test
to compare the accuracies.

specifies options using one or more name-value pair arguments in addition to the
input argument combinations in previous syntaxes. For example, you can specify
the type of alternative hypothesis, specify the type of test, and supply a cost
matrix.`h`

= compareHoldout(___,`Name,Value`

)

`compareHoldout`

does not compare ECOC models composed of linear or kernel classification models (that is,`ClassificationLinear`

or`ClassificationKernel`

model objects). To compare`ClassificationECOC`

models composed of linear or kernel classification models, use`testcholdout`

instead.Similarly,

`compareHoldout`

does not compare`ClassificationLinear`

or`ClassificationKernel`

model objects. To compare these models, use`testcholdout`

instead.

One way to perform cost-insensitive feature selection is:

Train the first classification model (

`C1`

) using the full predictor set.Train the second classification model (

`C2`

) using the reduced predictor set.Specify

`X1`

as the full test-set predictor data and`X2`

as the reduced test-set predictor data.Enter

`compareHoldout(C1,C2,X1,X2,Y,'Alternative','less')`

. If`compareHoldout`

returns`1`

, then there is enough evidence to suggest that the classification model that uses fewer predictors performs better than the model that uses the full predictor set.

Alternatively, you can assess whether there is a significant difference between the accuracies of the two models. To perform this assessment, remove the

`'Alternative','less'`

specification in step 4.`compareHoldout`

conducts a two-sided test, and`h = 0`

indicates that there is not enough evidence to suggest a difference in the accuracy of the two models.Cost-sensitive tests perform numerical optimization, which requires additional computational resources. The likelihood ratio test conducts numerical optimization indirectly by finding the root of a Lagrange multiplier in an interval. For some data sets, if the root lies close to the boundaries of the interval, then the method can fail. Therefore, if you have an Optimization Toolbox license, consider conducting the cost-sensitive chi-square test instead. For more details, see

`CostTest`

and Cost-Sensitive Testing.

To directly compare the accuracy of two sets of class labels
in predicting a set of true class labels, use `testcholdout`

.

[1] Agresti, A. *Categorical Data Analysis*,
2nd Ed. John Wiley & Sons, Inc.: Hoboken, NJ, 2002.

[2] Fagerlan, M.W., S. Lydersen, and P. Laake. “The McNemar Test for Binary
Matched-Pairs Data: Mid-p and Asymptotic Are Better Than Exact Conditional.”
*BMC Medical Research Methodology*. Vol. 13, 2013, pp.
1–8.

[3] Lancaster, H.O. “Significance Tests in Discrete
Distributions.” *JASA*, Vol. 56, Number
294, 1961, pp. 223–234.

[4] McNemar, Q. “Note on the Sampling Error of the
Difference Between Correlated Proportions or Percentages.” *Psychometrika*,
Vol. 12, Number 2, 1947, pp. 153–157.

[5] Mosteller, F. “Some Statistical Problems in Measuring
the Subjective Response to Drugs.” *Biometrics*,
Vol. 8, Number 3, 1952, pp. 220–226.