SilhouetteEvaluation
Silhouette criterion clustering evaluation object
Description
SilhouetteEvaluation
is an object consisting of sample data
(X
), clustering data (OptimalY
), and silhouette criterion
values (CriterionValues
) used to
evaluate the optimal number of data clusters (OptimalK
). The silhouette value for
each point (observation in X
) is a measure of how similar that point is to
other points in the same cluster, compared to points in other clusters. If most points have a
high silhouette value, then the clustering solution is appropriate. If many points have a low
or negative silhouette value, then the clustering solution might have too many or too few
clusters. For more information, see Silhouette Value and Criterion.
Creation
Create a silhouette criterion clustering evaluation object by using the evalclusters
function and specifying the criterion as
"silhouette"
.
You can then use compact
to create a compact version of the
silhouette criterion clustering evaluation object. The function removes the contents of the
properties X
, OptimalY
, and
Missing
.
Properties
Clustering Evaluation Properties
ClusteringFunction
— Clustering algorithm
'kmeans'
| 'linkage'
| 'gmdistribution'
| function handle | []
This property is read-only.
Clustering algorithm used to cluster the sample data, returned as
'kmeans'
, 'linkage'
,
'gmdistribution'
, or a function handle. If you specify the
clustering solutions as an input argument to evalclusters
when you
create the clustering evaluation object, then ClusteringFunction
is
empty.
Value | Description |
---|---|
'kmeans' | Cluster the data in X using the kmeans clustering
algorithm, with EmptyAction set to
"singleton" and Replicates set
to 5 . |
'linkage' | Cluster the data in X using the clusterdata agglomerative
clustering algorithm, with Linkage set to
"ward" . |
'gmdistribution' | Cluster the data in X using the gmdistribution Gaussian
mixture distribution algorithm, with SharedCov set to
true and Replicates set to
5 . |
Data Types: double
| char
| function_handle
ClusterPriors
— Prior probabilities for each cluster
'empirical'
| 'equal'
This property is read-only.
Prior probabilities for each cluster, returned as 'empirical'
or 'equal'
.
Value | Description |
---|---|
'empirical' | Compute the silhouette criterion value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the criterion value proportionally based on its size. |
'equal' | Compute the silhouette criterion value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Regardless of its size, each cluster contributes equally to the criterion value. |
ClusterSilhouettes
— Average silhouette values
cell array of numeric vectors
This property is read-only.
Average silhouette values corresponding to each proposed number of clusters in
InspectedK
, returned as a cell array of numeric vectors. For
each proposed number of clusters k
, the vector
ClusterSilhouettes{k}
contains the average silhouette value for
each cluster.
For example, suppose evaluation
is a silhouette criterion
clustering evaluation object and evaluation.InspectedK
is
1:5
. Then, evaluation.ClusterSilhouettes{4}(3)
is the average silhouette value for the points in the third cluster of the clustering
solution with four total clusters.
Data Types: cell
CriterionName
— Name of criterion
'Silhouette'
This property is read-only.
Name of the criterion used for clustering evaluation, returned as
'Silhouette'
.
CriterionValues
— Criterion values
numeric vector
This property is read-only.
Criterion values, returned as a numeric vector. Each value corresponds to a proposed
number of clusters in InspectedK
.
Data Types: double
Distance
— Distance metric
'sqEuclidean'
| 'Euclidean'
| 'cityblock'
| function handle | numeric vector | ...
This property is read-only.
Distance metric used for clustering data and computing the criterion values,
returned as one of the values in this table, a function handle, or a numeric vector
returned by the function pdist
.
Value | Description |
---|---|
'sqEuclidean' | Squared Euclidean distance |
'Euclidean' | Euclidean distance |
'cityblock' | Sum of absolute differences |
'cosine' | One minus the cosine of the included angle between points (treated as vectors) |
'correlation' | One minus the sample correlation between points (treated as sequences of values) |
'Hamming' | Percentage of coordinates that differ |
'Jaccard' | Percentage of nonzero coordinates that differ |
Data Types: single
| double
| char
| function_handle
InspectedK
— List of number of proposed clusters
positive integer vector
This property is read-only.
List of the number of proposed clusters for which to compute criterion values, returned as a positive integer vector.
Data Types: double
OptimalK
— Optimal number of clusters
positive integer scalar
This property is read-only.
Optimal number of clusters, returned as a positive integer scalar.
Data Types: double
OptimalY
— Optimal clustering solution
positive integer column vector | []
This property is read-only.
Optimal clustering solution corresponding to OptimalK
, returned
as a positive integer column vector. Each row of OptimalY
represents the cluster index of the corresponding observation (or row) in
X
. If you specify the clustering solutions as an input argument
to evalclusters
when you create the clustering evaluation object,
or if the clustering evaluation object is compact (see compact
), then OptimalY
is empty.
Data Types: double
Sample Data Properties
Missing
— Excluded data
logical column vector | []
This property is read-only.
Excluded data, returned as a logical column vector. If an element of
Missing
is true
, then the corresponding
observation (or row) in the data matrix X
is not used in the
clustering solutions. If the clustering evaluation object is compact (see compact
), then Missing
is empty.
Data Types: double
| logical
NumObservations
— Number of observations
positive integer scalar
This property is read-only.
Number of observations in the data matrix X
, ignoring
observations with missing (NaN
) values, returned as a positive
integer scalar.
Data Types: double
X
— Data used for clustering
numeric matrix | []
This property is read-only.
Data used for clustering, returned as a numeric matrix. Rows correspond to
observations, and columns correspond to variables. If the clustering evaluation object
is compact (see compact
), then X
is
empty.
Data Types: single
| double
Object Functions
Examples
Evaluate Clustering Solution Using Silhouette Criterion
Evaluate the optimal number of clusters using the silhouette clustering evaluation criterion.
Generate sample data containing random numbers from three multivariate distributions with different parameter values.
rng("default") % For reproducibility n = 200; mu1 = [2 2]; sigma1 = [0.9 -0.0255; -0.0255 0.9]; mu2 = [5 5]; sigma2 = [0.5 0; 0 0.3]; mu3 = [-2 -2]; sigma3 = [1 0; 0 0.9]; X = [mvnrnd(mu1,sigma1,n); ... mvnrnd(mu2,sigma2,n); ... mvnrnd(mu3,sigma3,n)];
Evaluate the optimal number of clusters using the silhouette criterion. Cluster the data using kmeans
.
evaluation = evalclusters(X,"kmeans","silhouette","KList",1:6)
evaluation = SilhouetteEvaluation with properties: NumObservations: 600 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 0.8055 0.8551 0.7155 0.6071 0.6232] OptimalK: 3
The OptimalK
value indicates that, based on the silhouette criterion, the optimal number of clusters is three.
Plot the silhouette criterion values for each number of clusters tested.
plot(evaluation)
The plot shows that the highest silhouette value occurs at three clusters, suggesting that the optimal number of clusters is three.
Create a grouped scatter plot to visually examine the suggested clusters.
clusters = evaluation.OptimalY;
gscatter(X(:,1),X(:,2),clusters,[],"xod")
The plot shows three distinct clusters within the data: cluster 1 in the lower-left corner, cluster 2 in the upper-right corner, and cluster 3 near the center of the plot.
More About
Silhouette Value and Criterion
The silhouette value for each point is a measure of how similar that point is to other points in the same cluster, compared to points in other clusters.
The silhouette value si for the ith point is defined as
where ai is the average distance from the ith point to the other points in the same cluster as i, and bi is the minimum average distance from the ith point to points in a different cluster, minimized over the clusters. If the ith point is the only point in its cluster, then the silhouette value si is set to 1.
The silhouette values range from –1 to 1. A high silhouette value indicates that the point is well matched to its own cluster, and poorly matched to other clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution might have too many or too few clusters. You can use silhouette values as a clustering evaluation criterion with any distance metric.
The ClusterPriors
value determines the silhouette criterion computation. If the value is
'empirical'
, then the software computes the silhouette criterion value
for a clustering solution by averaging the silhouette values for all points. Each cluster
contributes to the criterion value proportionally based on its size. If the
ClusterPriors
value is 'equal'
, then the software
computes the silhouette criterion value for a clustering solution by averaging the
silhouette values for all points within each cluster, and then averaging those values across
all clusters. Regardless of its size, each cluster contributes equally to the criterion
value. The optimal number of clusters corresponds to the solution with the highest
silhouette criterion value.
References
[1] Kaufman, L., and P. J. Rouseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: John Wiley & Sons, Inc., 1990.
[2] Rouseeuw, P. J. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” Journal of Computational and Applied Mathematics. Vol. 20, No. 1, 1987, pp. 53–65.
Version History
Introduced in R2013b
Comando de MATLAB
Ha hecho clic en un enlace que corresponde a este comando de MATLAB:
Ejecute el comando introduciéndolo en la ventana de comandos de MATLAB. Los navegadores web no admiten comandos de MATLAB.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)