## Shapley Values for Machine Learning Model

This topic defines Shapley values, describes two available algorithms in the Statistics and Machine Learning Toolbox™ feature that computes Shapley values, provides examples for each, and shows how to reduce the computational cost.

### What Is a Shapley Value?

In game theory, the Shapley value of a player is the average marginal contribution of the player in a cooperative game. That is, Shapley values are fair allocations, to individual players, of the total gain generated from a cooperative game. In the context of machine learning prediction, the Shapley value of a feature for a query point explains the contribution of the feature to a prediction (response for regression or score of each class for classification) at the specified query point. The Shapley value corresponds to the deviation of the prediction for the query point from the average prediction, due to the feature. For each query point, the sum of the Shapley values for all features corresponds to the total deviation of the prediction from the average.

The Shapley value of the ith feature for the query point x is defined by the value function v:

 ${\phi }_{i}\left({v}_{x}\right)=\frac{1}{M}\sum _{S\subseteq ℳ\\left\{i\right\}}\frac{{v}_{x}\left(S\cup \left\{i\right\}\right)-{v}_{x}\left(S\right)}{\frac{\left(M-1\right)!}{|S|!\left(M-|S|-1\right)!}}$ (1)
• M is the number of all features.

• $ℳ$ is the set of all features.

• |S| is the cardinality of the set S, or the number of elements in the set S.

• vx(S) is the value function of the features in a set S for the query point x. The value of the function indicates the expected contribution of the features in S to the prediction for the query point x.

### Shapley Value with MATLAB

You can compute Shapley values for a machine learning model by using a `shapley` object. Use the values to interpret the contributions of individual features in the model to the prediction for a query point. There are two ways to compute Shapley values:

• Create a `shapley` object for a machine learning model with a specified query point by using the `shapley` function. The function computes the Shapley values of all features in the model for the query point.

• Create a `shapley` object for a machine learning model by using the `shapley` function and, then compute the Shapley values for a specified query point by using the `fit` function.

### Algorithms

`shapley` offers two algorithms: kernelSHAP [1], which uses interventional distributions for the value function, and the extension to kernelSHAP [2], which uses conditional distributions for the value function. You can specify the algorithm to use by setting the `'Method'` name-value argument of the `shapley` function or the `fit` function.

The difference between the two algorithms is the definition of the value function. Both algorithms define the value function such that the sum of the Shapley values of a query point over all features corresponds to the total deviation of the prediction for the query point from the average.

`$\sum _{i=1}^{M}{\phi }_{i}\left({v}_{x}\right)=f\left(x\right)-E\left[f\left(x\right)\right].$`

Therefore, the value function vx(S) must correspond to the expected contribution of the features in S to the prediction (f) for the query point x. The two algorithms compute the expected contribution by using artificial samples created from the specified data (X). You must provide `X` through the machine learning model input or a separate data input argument when you create a `shapley` object. In the artificial samples, the values for the features in S come from the query point. For the rest of the features (features in Sc, the complement of S), the kernelSHAP algorithm generates samples using interventional distributions, whereas the extension to the kernelSHAP algorithm generates samples using conditional distributions.

#### KernelSHAP (`'Method','interventional-kernel'`)

`shapley` uses the kernelSHAP algorithm by default.

The kernelSHAP algorithm defines the value function of the features in S at the query point x as the expected prediction with respect to the interventional distribution D, which is the joint distribution of the features in Sc:

`${v}_{x}\left(S\right)={E}_{D}\left[f\left({x}_{S},{X}_{{S}^{c}}\right)\right],$`

where xS is the query point value for the features in S, and XSc are the features in Sc.

To evaluate the value function vx(S) at the query point x, with the assumption that the features are not highly correlated, `shapley` uses the values in the data X as samples of the interventional distribution D for the features in Sc:

`${v}_{x}\left(S\right)={E}_{D}\left[f\left({x}_{S},{X}_{{S}^{c}}\right)\right]\approx \frac{1}{N}\sum _{j=1}^{N}f\left({x}_{S},{\left({X}_{{S}^{c}}\right)}_{j}\right),$`

where N is the number of observations, and (XSc)j contains the values of the features in Sc for the jth observation.

For example, suppose you have three features in X and four observations: (x11,x12,x13), (x21,x22,x23), (x31,x32,x33), and (x41,x42,x43). Assume that S includes the first feature, and Sc includes the rest. In this case, the value function of the first feature evaluated at the query point (x41,x42,x43) is

`${v}_{x}\left(S\right)=\frac{1}{4}\left[f\left({x}_{41},{x}_{12},{x}_{13}\right)+f\left({x}_{41},{x}_{22},{x}_{23}\right)+f\left({x}_{41},{x}_{32},{x}_{33}\right)+f\left({x}_{41},{x}_{42},{x}_{43}\right)\right].$`

The kernelSHAP algorithm is computationally less expensive than the extension to the kernelSHAP algorithm, supports ordered categorical predictors, and can handle missing values in X. However, the algorithm requires the feature independence assumption and uses out-of-distribution samples [3]. The artificial samples created with a mix of the query point and the data X can contain unrealistic observations. For example, (x41,x12,x13) might be a sample that does not occur in the full joint distribution of the three features.

#### Extension to KernelSHAP (`'Method','conditional-kernel'`)

Specify `'Method','conditional-kernel'` to use the extension to the kernelSHAP algorithm.

the extension to the kernelSHAP algorithm defines the value function of the features in S at the query point x using the conditional distribution of XSc, given that XS has the query point values:

`${v}_{x}\left(S\right)={E}_{{X}_{{S}^{c}}|{X}_{S}={x}_{S}}\left[f\left({x}_{S},{X}_{{S}^{c}}\right)\right].$`

To evaluate the value function vx(S) at the query point x, `shapley` uses nearest neighbors of the query point, which correspond to 10% of the observations in the data X. This approach uses more realistic samples than the kernelSHAP algorithm and does not require the feature independence assumption. However, this algorithm is computationally more expensive, does not support ordered categorical predictors, and cannot handle `NaN`s in continuous features. Also, the algorithm might assign a nonzero Shapley value to a dummy feature that does not contribute to the prediction, if the dummy feature is correlated with an important feature [3].

### Specify Computation Algorithm

This example trains a linear classification model and computes Shapley values using both the kernelSHAP algorithm (`'Method','interventional-kernel'`) and the extension to the kernelSHAP algorithm (`'Method','conditional-kernel'`).

Train Linear Classification Model

Load the `ionosphere` data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad (`'b'`) or good (`'g'`).

`load ionosphere`

Train a linear classification model. Specify the objective function minimization technique (`'Solver'` name-value argument) as the limited-memory Broyden-Fletcher-Goldfarb-Shanno quasi-Newton algorithm (`'lbfgs'`) for better accuracy of linear coefficients.

`Mdl = fitclinear(X,Y,'Solver','lbfgs')`
```Mdl = ClassificationLinear ResponseName: 'Y' ClassNames: {'b' 'g'} ScoreTransform: 'none' Beta: [34x1 double] Bias: -3.7100 Lambda: 0.0028 Learner: 'svm' Properties, Methods ```

KernelSHAP

Compute the Shapley values for the first observation using the kernelSHAP algorithm, which uses the interventional distribution for the value function evaluation. You do not have to specify the `'Method'` value because `'interventional-kernel'` is the default.

```queryPoint = X(1,:); explainer1 = shapley(Mdl,X,'QueryPoint',queryPoint); ```

For a classification model, `shapley` computes Shapley values using the predicted class score for each class. Plot the Shapley values for the predicted class by using the `plot` function.

`plot(explainer1)`

The horizontal bar graph shows the Shapley values for the 10 most important variables, sorted by their absolute values. Each value explains the deviation of the score for the query point from the average score of the predicted class, due to the corresponding variable.

For a linear model where you assume features are independent from one another, you can compute the interventional Shapley values for the positive class (or the second class in `Mdl.ClassNames`, `'g'`) from the estimated coefficients (`Mdl.Beta`) [1].

`linearSHAPValues = (Mdl.Beta'.*(queryPoint-mean(X)))';`

Create a table containing the Shapley values computed from the kernelSHAP algorithm and the values from the coefficients.

```t = table(explainer1.ShapleyValues.Predictor,explainer1.ShapleyValues.g,linearSHAPValues, ... 'VariableNames',{'Predictor','KernelSHAP Value','LinearSHAP Value'})```
```t=34×3 table Predictor KernelSHAP Value LinearSHAP Value _________ ________________ ________________ "x1" 0.28789 0.28789 "x2" 2.315e-15 0 "x3" 0.20822 0.20822 "x4" -0.01998 -0.01998 "x5" 0.20872 0.20872 "x6" -0.076991 -0.076991 "x7" 0.19188 0.19188 "x8" -0.64386 -0.64386 "x9" 0.42348 0.42348 "x10" -0.030049 -0.030049 "x11" -0.23132 -0.23132 "x12" 0.1422 0.1422 "x13" -0.045973 -0.045973 "x14" -0.29022 -0.29022 "x15" 0.21051 0.21051 "x16" 0.13382 0.13382 ⋮ ```

Extension to KernelSHAP

Compute the Shapley values for the first observation using the extension to the kernelSHAP algorithm, which uses the conditional distribution for the value function evaluation.

`explainer2 = shapley(Mdl,X,'QueryPoint',queryPoint,'Method','conditional-kernel');`

Plot the Shapley values.

`plot(explainer2)`

The two algorithms identify different sets for the 10 most important variables. Only the two variables `x8` and `x22` are common to both sets.

### Computational Cost

The computational cost for Shapley values increases if the number of observations or features is large.

#### Large Number of Observations

Computing the value function (v) can be computationally expensive if you have a large number of observations, for example, more than 1000. For faster computation, use a smaller sample of the observations when you create a `shapley` object, or run in parallel by specifying `'UseParallel'` as `true` when you compute the values using the `shapley` or `fit` function. Computing in parallel requires Parallel Computing Toolbox™.

#### Large Number of Features

Computing the summand in Equation 1 for all available subsets S can be computationally expensive when M (the number of features) is large. The total number of subsets to consider is 2M. Instead of computing the summand for all subsets, you can specify the maximum number of subsets by using the `'MaxNumSubsets'` name-value argument. `shapley` chooses subsets to use based on their weight values. The weight of a subset is proportional to 1/(denominator of the summand), which corresponds to 1 over the binomial coefficient: $1/\left(\begin{array}{c}M-1\\ |S|\end{array}\right)$. Therefore, a subset with a high or low value of cardinality has a large weight value. `shapley` includes the subsets with the highest weight first, and then includes the other subsets in descending order based on their weight values.

### Reduce Computational Cost

This example shows how to reduce the computational cost for Shapley values when you have a large number of both observations and features.

Load the sample data set `NYCHousing2015`.

`load NYCHousing2015`

The data set includes 55,246 observations of 10 variables with information on the sales of properties in New York City in 2015. This example uses these variables to analyze the sale prices (`SALEPRICE`).

Preprocess the data set. Convert the `datetime` array (`SALEDATE`) to the month numbers.

`NYCHousing2015.SALEDATE = month(NYCHousing2015.SALEDATE);`

Train a regression ensemble.

`Mdl = fitrensemble(NYCHousing2015,'SALEPRICE');`

Compute the Shapley values of all predictor variables for the first observation. Measure the time required for the computation by using `tic` and `toc`.

```tic explainer1 = shapley(Mdl,'QueryPoint',NYCHousing2015(1,:));```
```Warning: Computation can be slow because the predictor data has over 1000 observations. Use a smaller sample of the training set or specify 'UseParallel' as true for faster computation. ```
`toc`
```Elapsed time is 504.276830 seconds. ```

As the warning message indicates, the computation can be slow because the predictor data has over 1000 observations.

`shapley` provides several options to reduce the computational cost when you have a large number of observations or features.

• Large number of observations — Use a smaller sample of the training data and run in parallel by specifying `'UseParallel'` as `true`.

• Large number of features — Specify the `'MaxNumSubsets'` name-value argument to limit the number of subsets included in the computation.

Compute the Shapley values again using a smaller sample of the training data and the parallel computing option. Also, specify the maximum number of subsets as `2^5`.

```NumSamples = 5e2; Tbl = datasample(NYCHousing2015,NumSamples,'Replace',false); tic explainer2 = shapley(Mdl,Tbl,'QueryPoint',NYCHousing2015(1,:), ... 'UseParallel',true,'MaxNumSubsets',2^5);```
```Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 6). ```
`toc`
```Elapsed time is 71.145339 seconds. ```

Specifying the additional options reduces the computation time.

## References

[1] Lundberg, Scott M., and S. Lee. "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems 30 (2017): 4765–774.

[2] Aas, Kjersti, Martin. Jullum, and Anders Løland. "Explaining Individual Predictions When Features Are Dependent: More Accurate Approximations to Shapley Values." arXiv:1903.10464 (2019).

[3] Kumar, I. Elizabeth, Suresh Venkatasubramanian, Carlos Scheidegger, and Sorelle Friedler. "Problems with Shapley-Value-Based Explanations as Feature Importance Measures." arXiv:2002.11097 (2020).