evalclusters

Evaluate clustering solutions

Syntax

eva = evalclusters(x,clust,criterion)

eva = evalclusters(x,clust,criterion,Name,Value)

Description

eva = evalclusters(x,clust,criterion) creates a clustering evaluation object containing data used to evaluate the optimal number of data clusters.

example

eva = evalclusters(x,clust,criterion,Name,Value) creates a clustering evaluation object using additional options specified by one or more name-value pair arguments.

Examples

collapse all

Evaluate Clustering Solution Using Calinski-Harabasz Criterion

Open Live Script

Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.

Load the sample data.

load fisheriris

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion. Cluster the data using kmeans.

rng('default') % For reproducibility
eva = evalclusters(meas,'kmeans','CalinskiHarabasz','KList',1:6)

eva = 
  CalinskiHarabaszEvaluation with properties:

    NumObservations: 150
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068]
           OptimalK: 3


  Properties, Methods

The OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Evaluate a Matrix of Clustering Solutions

Open Live Script

Use an input matrix of proposed clustering solutions to evaluate the optimal number of clusters.

Load the sample data.

load fisheriris;

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use kmeans to create an input matrix of proposed clustering solutions for the sepal length measurements, using 1, 2, 3, 4, 5, and 6 clusters.

clust = zeros(size(meas,1),6);
for i=1:6
clust(:,i) = kmeans(meas,i,'emptyaction','singleton',...
        'replicate',5);
end

Each row of clust corresponds to one sepal length measurement. Each of the six columns corresponds to a clustering solution containing 1 to 6 clusters.

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion.

eva = evalclusters(meas,clust,'CalinskiHarabasz')

eva = 
  CalinskiHarabaszEvaluation with properties:

    NumObservations: 150
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068]
           OptimalK: 3


  Properties, Methods

The OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Specify Clustering Algorithm with a Function Handle

Open Live Script

Use a function handle to specify the clustering algorithm, then evaluate the optimal number of clusters.

Load the sample data.

load fisheriris;

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use a function handle to specify the clustering algorithm.

myfunc = @(X,K)(kmeans(X,K,Emptyaction="singleton",Replicate=5));

Evaluate the optimal number of clusters for the sepal length data using the Calinski-Harabasz criterion.

eva = evalclusters(meas,myfunc,'CalinskiHarabasz',KList=1:6)

eva = 
  CalinskiHarabaszEvaluation with properties:

    NumObservations: 150
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068]
           OptimalK: 3


  Properties, Methods

The OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Input Arguments

collapse all

`x` — Input data
matrix

Input data, specified as an N-by-P matrix. N is the number of observations, and P is the number of variables.

Data Types: single | double

`clust` — Clustering algorithm or solutions
`'kmeans'` | `'linkage'` | `'gmdistribution'` | matrix of clustering solutions | function handle

Clustering algorithm, specified as one of the following.

`'kmeans'`	Cluster the data in `x` using the `kmeans` clustering algorithm, with `'EmptyAction'` set to `'singleton'` and `'Replicates'` set to `5`.
`'linkage'`	Cluster the data in `x` using the `clusterdata` agglomerative clustering algorithm, with `'Linkage'` set to `'ward'`.
`'gmdistribution'`	Cluster the data in `x` using the `gmdistribution` Gaussian mixture distribution algorithm, with `'SharedCov'` set to `true` and `'Replicates'` set to `5`.

If criterion is 'CalinskiHarabasz', 'DaviesBouldin', or 'silhouette', you can specify a clustering algorithm using a function handle. The function must be of the form C = clustfun(DATA,K), where DATA is the data to be clustered, and K is the number of clusters. The output of clustfun must be one of the following:

A vector of integers representing the cluster index for each observation in DATA. There must be K unique values in this vector.
A numeric n-by-K matrix of score for n observations and K classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.

If criterion is 'CalinskiHarabasz', 'DaviesBouldin', or 'silhouette', you can also specify clust as a n-by-K matrix containing the proposed clustering solutions. n is the number of observations in the sample data, and K is the number of proposed clustering solutions. Column j contains the cluster indices for each of the N points in the jth clustering solution.

Data Types: single | double | char | string | function_handle

`criterion` — Clustering evaluation criterion
`'silhouette'` | `'DaviesBouldin'` | `'CalinskiHarabasz'` | `'gap'`

Clustering evaluation criterion, specified as one of the following.

`'silhouette'`	Create a `SilhouetteEvaluation` cluster evaluation object containing silhouette values. For more information, see Silhouette Value and Criterion.
`'DaviesBouldin'`	Create a `DaviesBouldinEvaluation` cluster evaluation object containing Davies-Bouldin index values. For more information, see Davies-Bouldin Criterion.
`'CalinskiHarabasz'`	Create a `CalinskiHarabaszEvaluation` clustering evaluation object containing Calinski-Harabasz index values. For more information, see Calinski-Harabasz Criterion.
`'gap'`	Create a `GapEvaluation` cluster evaluation object containing gap criterion values. For more information, see Gap Value.

The best choice of cluster evaluation method depends on the characteristics of your data set. Each method uses a different algorithm to derive an evaluation metric.

The silhouette method (Rouseeuw,1987) calculates a score ranging from +1 to –1 for each point in a cluster. The score measures how similar a point is to points in its own cluster, when compared to points in other clusters. The method can use any distance metric. You can use the mean silhouette scores as a numerical metric, or visualize the scores of points in each cluster by creating a silhouette plot. Because the method assumes that the clusters have convex shapes, the silhouette metric is not well suited for irregularly shaped clusters.
The Davies-Bouldin method (Davies and Bouldin,1979) calculates a single index value that is based on a ratio of within-cluster and between-cluster Euclidean distances. Unlike the silhouette metric, this method makes no assumption about the shape of the clusters. A clustering solution generally has a lower (improved) Davies-Bouldin index value when there is a larger separation between the clusters and a smaller dispersion within the clusters.
The Calinski-Harabasz (CH) method (Calinski and Harabasz,1974) is similar to the Davies-Bouldin method, but instead uses squared Euclidean distances and variance statistics. The CH index is based on the ratio of between-cluster distance variance to within-cluster distance variance, normalized by degrees of freedom. The method makes no assumption about the shape of the clusters. A clustering solution generally has a higher (improved) CH index value when there is a larger separation between the clusters and a smaller dispersion within the clusters.
The gap value method (Tibshirani, Walther, and Hastie, 2001) calculates a single metric value by comparing a clustering solution to a simulated reference distribution that has the characteristics of the input points but lacks any clusters. Given a set of clustering solutions for the same data set, each with a different number of clusters k, the optimal solution has the highest gap value. The method can use any distance metric. The gap value method is more computationally expensive than other clustering evaluation methods, because the clustering algorithm must be applied to the reference data for each proposed clustering solution.

Name-Value Arguments

expand all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: evalclusters(x,"kmeans","gap",KList=1:5,Distance="cityblock") specifies to test 1, 2, 3, 4, and 5 clusters using the city block distance metric.

For All Criteria

expand all

`KList` — List of number of clusters to evaluate
vector of positive integer values

List of number of clusters to evaluate, specified as a vector of positive integer values. You must specify KList when clust is a clustering algorithm name or a function handle. When criterion is 'gap', clust must be a character vector, a string scalar, or a function handle, and you must specify KList.

Example: KList=1:6

Data Types: single | double

For Silhouette and Gap

expand all

`Distance` — Distance metric
`'sqEuclidean'` (default) | `'Euclidean'` | `'cityblock'` | vector | function | ...

Distance metric used for computing the criterion values, specified as the comma-separated pair consisting of 'Distance' and one of the following.

`'sqEuclidean'`	Squared Euclidean distance
`'Euclidean'`	Euclidean distance. This option is not valid for the `kmeans` clustering algorithm.
`'cityblock'`	Sum of absolute differences
`'cosine'`	One minus the cosine of the included angle between points (treated as vectors)
`'correlation'`	One minus the sample correlation between points (treated as sequences of values)
`'Hamming'`	Percentage of coordinates that differ. This option is only valid for the `Silhouette` criterion.
`'Jaccard'`	Percentage of nonzero coordinates that differ. This option is only valid for the `Silhouette` criterion.

For detailed information about each distance metric, see pdist.

You can also specify a function for the distance metric using a function handle. The distance function must be of the form d2 = distfun(XI,XJ), where XI is a 1-by-n vector corresponding to a single row of the input matrix X, and XJ is an m₂-by-n matrix corresponding to multiple rows of X. distfun must return an m₂-by-1 vector of distances d2, whose kth element is the distance between XI and XJ(k,:).

Distance only accepts a function handle if the clustering algorithm clust accepts a function handle as the distance metric. For example, the kmeans clustering algorithm does not accept a function handle as the distance metric. Therefore, if you use the kmeans algorithm and then specify a function handle for Distance, the software errors.

If criterion is 'silhouette', you can also specify Distance as the output vector created by the function pdist.
When clust is 'kmeans' or 'gmdistribution', evalclusters uses the distance metric specified for Distance to cluster the data.
If clust is 'linkage', and Distance is either 'sqEuclidean' or 'Euclidean', then the clustering algorithm uses the Euclidean distance and Ward linkage.
If clust is 'linkage' and Distance is any other metric, then the clustering algorithm uses the specified distance metric and average linkage.
In all other cases, the distance metric specified for Distance must match the distance metric used in the clustering algorithm to obtain meaningful results.

Example: 'Distance','Euclidean'

Data Types: single | double | char | string | function_handle

For Silhouette Only

expand all

`ClusterPriors` — Prior probabilities for each cluster
`'empirical'` (default) | `'equal'`

Prior probabilities for each cluster, specified as the comma-separated pair consisting of 'ClusterPriors' and one of the following.

`'empirical'`	Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the overall silhouette value proportionally to its size.
`'equal'`	Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Each cluster contributes equally to the overall silhouette value, regardless of its size.

Example: 'ClusterPriors','empirical'

For Gap Only

expand all

`B` — Number of reference data sets
`100` (default) | positive integer value

Number of reference data sets generated from the reference distribution ReferenceDistribution, specified as the comma-separated pair consisting of 'B' and a positive integer value.

Example: 'B',150

Data Types: single | double

`ReferenceDistribution` — Reference data generation method
`'PCA'` (default) | `'uniform'`

Reference data generation method, specified as the comma-separated pair consisting of 'ReferenceDistributions' and one of the following.

`'PCA'`	Generate reference data from a uniform distribution over a box aligned with the principal components of the data matrix `x`.
`'uniform'`	Generate reference data uniformly over the range of each feature in the data matrix `x`.

Example: 'ReferenceDistribution','uniform'

`SearchMethod` — Method for selecting optimal number of clusters
`'globalMaxSE'` (default) | `'firstMaxSE'`

Method for selecting the optimal number of clusters, specified as the comma-separated pair consisting of 'SearchMethod' and one of the following.

'globalMaxSE'

Evaluate each proposed number of clusters in KList and select the smallest number of clusters satisfying

$Gap (K) \geq G A P M A X - SE (G A P M A X),$

where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, GAPMAX is the largest gap value, and SE(GAPMAX) is the standard error corresponding to the largest gap value.

'firstMaxSE'

Evaluate each proposed number of clusters in KList and select the smallest number of clusters satisfying

$Gap (K) \geq Gap (K + 1) - SE (K + 1),$

where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, and SE(K + 1) is the standard error of the clustering solution with K + 1 clusters.

Example: 'SearchMethod','globalMaxSE'

Output Arguments

collapse all

`eva` — Clustering evaluation data
clustering evaluation object

Clustering evaluation data, returned as a clustering evaluation object.

References

[1] Rouseeuw, P. J. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” Journal of Computational and Applied Mathematics. Vol. 20, No. 1, 1987, pp. 53–65.

[2] Davies, D. L., and D. W. Bouldin. “A Cluster Separation Measure.” IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. PAMI-1, No. 2, 1979, pp. 224–227.

[3] Calinski, T., and J. Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.

[4] Tibshirani, R., G. Walther, and T. Hastie. “Estimating the number of clusters in a data set via the gap statistic.” Journal of the Royal Statistical Society: Series B. Vol. 63, Part 2, 2001, pp. 411–423.

Version History

Introduced in R2013b

evalclusters

Syntax

Description

Examples

Evaluate Clustering Solution Using Calinski-Harabasz Criterion

Evaluate a Matrix of Clustering Solutions

Specify Clustering Algorithm with a Function Handle

Input Arguments

`x` — Input data
matrix

`clust` — Clustering algorithm or solutions
`'kmeans'` | `'linkage'` | `'gmdistribution'` | matrix of clustering solutions | function handle

`criterion` — Clustering evaluation criterion
`'silhouette'` | `'DaviesBouldin'` | `'CalinskiHarabasz'` | `'gap'`

Name-Value Arguments

For All Criteria

`KList` — List of number of clusters to evaluate
vector of positive integer values

For Silhouette and Gap

`Distance` — Distance metric
`'sqEuclidean'` (default) | `'Euclidean'` | `'cityblock'` | vector | function | ...

For Silhouette Only

`ClusterPriors` — Prior probabilities for each cluster
`'empirical'` (default) | `'equal'`

For Gap Only

`B` — Number of reference data sets
`100` (default) | positive integer value

`ReferenceDistribution` — Reference data generation method
`'PCA'` (default) | `'uniform'`

`SearchMethod` — Method for selecting optimal number of clusters
`'globalMaxSE'` (default) | `'firstMaxSE'`

Output Arguments

`eva` — Clustering evaluation data
clustering evaluation object

References

Version History

See Also

Topics

evalclusters

Syntax

Description

Examples

Evaluate Clustering Solution Using Calinski-Harabasz Criterion

Evaluate a Matrix of Clustering Solutions

Specify Clustering Algorithm with a Function Handle

Input Arguments

x — Input data matrix

clust — Clustering algorithm or solutions 'kmeans' | 'linkage' | 'gmdistribution' | matrix of clustering solutions | function handle

criterion — Clustering evaluation criterion 'silhouette' | 'DaviesBouldin' | 'CalinskiHarabasz' | 'gap'

Name-Value Arguments

For All Criteria

KList — List of number of clusters to evaluate vector of positive integer values

For Silhouette and Gap

Distance — Distance metric 'sqEuclidean' (default) | 'Euclidean' | 'cityblock' | vector | function | ...

For Silhouette Only

ClusterPriors — Prior probabilities for each cluster 'empirical' (default) | 'equal'

For Gap Only

B — Number of reference data sets 100 (default) | positive integer value

ReferenceDistribution — Reference data generation method 'PCA' (default) | 'uniform'

SearchMethod — Method for selecting optimal number of clusters 'globalMaxSE' (default) | 'firstMaxSE'

Output Arguments

eva — Clustering evaluation data clustering evaluation object

References

Version History

See Also

Topics

`x` — Input data
matrix

`clust` — Clustering algorithm or solutions
`'kmeans'` | `'linkage'` | `'gmdistribution'` | matrix of clustering solutions | function handle

`criterion` — Clustering evaluation criterion
`'silhouette'` | `'DaviesBouldin'` | `'CalinskiHarabasz'` | `'gap'`

`KList` — List of number of clusters to evaluate
vector of positive integer values

`Distance` — Distance metric
`'sqEuclidean'` (default) | `'Euclidean'` | `'cityblock'` | vector | function | ...

`ClusterPriors` — Prior probabilities for each cluster
`'empirical'` (default) | `'equal'`

`B` — Number of reference data sets
`100` (default) | positive integer value

`ReferenceDistribution` — Reference data generation method
`'PCA'` (default) | `'uniform'`

`SearchMethod` — Method for selecting optimal number of clusters
`'globalMaxSE'` (default) | `'firstMaxSE'`

`eva` — Clustering evaluation data
clustering evaluation object