clusterdata

Construct agglomerative clusters from data

Syntax

T = clusterdata(X,Cutoff=cutoff)

T = clusterdata(X,MaxClust=maxclust)

T = clusterdata(___,Name=Value)

Description

T = clusterdata(X,Cutoff=cutoff) returns cluster indices for each observation (row) of an input data matrix X, given a threshold cutoff for cutting an agglomerative hierarchical tree generated by the linkage function from X.

clusterdata supports agglomerative clustering and incorporates the pdist, linkage, and cluster functions, which you can use separately for more detailed analysis. See Algorithm Description for more details.

T = clusterdata(X,MaxClust=maxclust) returns cluster indices for a maximum of maxclust clusters, using "distance" as the default criterion for defining clusters.

example

T = clusterdata(___,Name=Value) specifies options using one or more name-value arguments in addition to any of the input argument combinations in the previous syntaxes. For example, specify clusterdata(X,MaxClust=5,Depth=3) to find a maximum of five clusters by evaluating distance values up to a depth of three below each node.

example

Examples

collapse all

Find Limited Number of Clusters from Sample Data

Open Live Script

Find and visualize a maximum of three clusters in a randomly generated data set.

Create a sample data set consisting of randomly generated data from three standard uniform distributions.

rng(0,"twister");  % For reproducibility
X = [gallery("uniformdata",[10 3],12); ...
    gallery("uniformdata",[10 3],13)+1.2; ...
    gallery("uniformdata",[10 3],14)+2.5];
y = [ones(10,1);2*(ones(10,1));3*(ones(10,1))]; % Actual classes

Create a scatter plot of the data.

scatter3(X(:,1),X(:,2),X(:,3),100,y,"filled")

Figure contains an axes object. The axes object contains an object of type scatter.

Find a maximum of three clusters.

T = clusterdata(X,MaxClust=3);

Plot the data with the resulting cluster assignments.

scatter3(X(:,1),X(:,2),X(:,3),100,T,"filled")
title("Result of Clustering");

Figure contains an axes object. The axes object with title Result of Clustering contains an object of type scatter.

clusterdata identifies the three distinct clusters in the data.

Create and Cluster Hierarchical Tree

Open Live Script

Create a hierarchical cluster tree and find clusters in one step. Visualize the clusters using a 3-D scatter plot.

Create a 20,000-by-3 matrix of sample data generated from the standard uniform distribution.

rng(0,"twister");  % For reproducibility
X = rand(20000,3);

Find a maximum of four clusters in a hierarchical cluster tree created using the ward linkage method. Specify SaveMemory as "on" to construct clusters without computing the distance matrix. Otherwise, you can receive an out-of-memory error if your machine does not have enough memory to hold the distance matrix.

T = clusterdata(X,MaxClust=4,Linkage="ward",SaveMemory="on");

Plot the data with each cluster shown in a different color.

scatter3(X(:,1),X(:,2),X(:,3),10,T)

Figure contains an axes object. The axes object contains an object of type scatter.

clusterdata identifies four clusters in the data.

Input Arguments

collapse all

`X` — Input data
numeric matrix

Input data, specified as a numeric matrix with two or more rows. The rows represent observations, and the columns represent categories or dimensions.

Data Types: single | double

`cutoff` — Cutoff for inconsistent or distance criterion
positive scalar

Cutoff for inconsistent or distance criterion, specified as a positive scalar.

If you specify Criterion="inconsistent" (or do not specify Criterion), the inconsistent values of a node and all its subnodes must be less than cutoff for the clusterdata function to group them into a cluster.

If you specify Criterion="distance", then the function groups all leaves at or below a node into a cluster, provided that the height of the node is less than cutoff.

When you specify cutoff, you cannot specify maxclust.

Example: clusterdata(X,Cutoff=0.2)

Data Types: single | double

`maxclust` — Maximum number of clusters
positive integer

Maximum number of clusters to form, specified as a positive integer.

If you specify Criterion="distance" (or do not specify Criterion), the height of each node in the tree represents the distance between the two subnodes merged at that node. The clusterdata function finds the smallest height at which a horizontal cut through the tree results in maxclust or fewer clusters. See Specify Arbitrary Clusters for more details.

If you specify Criterion="inconsistent", then the function starts with the node that has the highest inconsistency coefficient (or inconsistent value) and groups that node and all its subnodes into a cluster. The function then repeats the process to construct a maximum of maxclust clusters.

When you specify maxclust, you cannot specify cutoff.

Example: clusterdata(X,MaxClust=4)

Data Types: single | double

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: clusterdata(X,MaxClust=3,Linkage="ward") creates a maximum of three clusters of X using the ward linkage method.

`Criterion` — Criterion for defining clusters
`"inconsistent"` | `"distance"`

Criterion for defining clusters in a hierarchical cluster tree, specified as "inconsistent" or "distance".

Example: Criterion="distance"

Data Types: char | string

`Depth` — Depth for computing inconsistent values
numeric scalar

Depth for computing inconsistent values, specified as a numeric scalar. clusterdata evaluates inconsistent values by looking to the specified depth below each node in the hierarchical cluster tree.

Example: Depth=3

Data Types: single | double

`Distance` — Distance metric
`"euclidean"` (default) | `"squaredeuclidean"` | `"seuclidean"` | `"mahalanobis"` | function handle | ...

Distance metric, specified as the comma-separated pair consisting of "Distance" and any distance metric accepted by the pdist function, as described in the following table.

Metric	Description
`"euclidean"`	Euclidean distance (default)
`"squaredeuclidean"`	Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)
`"seuclidean"`	Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, `S = std(X,"omitnan")`.
`"mahalanobis"`	Mahalanobis distance using the sample covariance of `X`, `C = cov(X,"omitrows")`
`"cityblock"`	City block distance
`"minkowski"`	Minkowski distance. The default exponent is 2. To use a different exponent `P`, specify `P` after `"minkowski"`, where `P` is a positive scalar value: `"minkowski",P`.
`"chebychev"`	Chebychev distance (maximum coordinate difference)
`"cosine"`	One minus the cosine of the included angle between points (treated as vectors)
`"correlation"`	One minus the sample correlation between points (treated as sequences of values)
`"hamming"`	Hamming distance, which is the percentage of coordinates that differ
`"jaccard"`	One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ
`"spearman"`	One minus the sample Spearman's rank correlation between observations (treated as sequences of values)
`@distfun`	Custom distance function handle. A distance function has the form function D2 = distfun(ZI,ZJ) % calculation of distance ... where `ZI` is a `1`-by-`n` vector containing a single observation. `ZJ` is an `m2`-by-`n` matrix containing multiple observations. `distfun` must accept a matrix `ZJ` with an arbitrary number of observations. `D2` is an `m2`-by-`1` vector of distances, and `D2(k)` is the distance between observations `ZI` and `ZJ(k,:)`. If your data is not sparse, using a built-in `distance` is generally faster than using a function handle.

For more information, see Distance Metrics.

Example: clusterdata(X,"Distance","minkowski",3,MaxClust=4)

Data Types: char | string | function_handle

`Linkage` — Algorithm for computing distance between clusters
`"average"` | `"centroid"` | `"complete"` | `"median"` | `"single"` | `"ward"` | `"weighted"`

Algorithm for computing the distance between clusters, specified as any algorithm accepted by the linkage function, as described in the following table.

Algorithm	Description
`"average"`	Unweighted average distance (UPGMA)
`"centroid"`	Centroid distance (UPGMC), appropriate for Euclidean distances only
`"complete"`	Farthest distance
`"median"`	Weighted center of mass distance (WPGMC), appropriate for Euclidean distances only
`"single"`	Shortest distance
`"ward"`	Inner squared distance (minimum variance algorithm), appropriate for Euclidean distances only
`"weighted"`	Weighted average distance (WPGMA)

For more information, see Linkages.

Example: Linkage="median"

Data Types: char | string

`SaveMemory` — Option for saving memory
`"on"` | `"off"`

Option for saving memory, specified as "on" or "off".

The "on" setting causes clusterdata to construct clusters without computing the distance matrix. The "on" setting applies when both of these conditions are satisfied:

Linkage is "centroid", "median", or "ward".
Distance is "euclidean" (default).

When these two conditions apply, the default value for SaveMemory is "on" if X has 20 columns or fewer, or if the computer does not have enough memory to store the distance matrix. Otherwise, the default value for SaveMemory is "off".

When SaveMemory is "on", the linkage run time is proportional to the number of dimensions (number of columns of X). When SaveMemory is "off", the linkage memory requirement is proportional to N², where N is the number of observations. Choosing the best (least-time) setting for SaveMemory depends on the problem dimensions, number of observations, and available memory. The default SaveMemory setting is a rough approximation of an optimal setting.

Example: SaveMemory="on"

Data Types: char | string

Output Arguments

collapse all

`T` — Cluster indices
numeric column vector

Cluster indices, returned as a numeric column vector. T has as many rows as X, and each row of T indicates the cluster assignment of the corresponding observation in X.

Tips

If Linkage is "centroid" or "median", then linkage can produce a cluster tree that is not monotonic. This result occurs when the distance from the union of two clusters, r and s, to a third cluster is less than the distance between r and s. In this case, in a dendrogram drawn with the default orientation, the path from a leaf to the root node takes some downward steps. To avoid this result, specify another value for Linkage. The following image shows a nonmonotonic cluster tree.

In this case, cluster 1 and cluster 3 are joined into a new cluster, while the distance between this new cluster and cluster 2 is less than the distance between cluster 1 and cluster 3.

Algorithms

When you do not specify any optional name-value arguments, the clusterdata function performs the following steps:

Create a vector of the Euclidean distance between pairs of observations in X by using pdist.
Y = pdist(X,"euclidean")
Create an agglomerative hierarchical cluster tree from Y by using linkage with the "single" method for computing the shortest distance between clusters.
Z = linkage(Y,"single")
When you specify cutoff, the clusterdata function uses cluster to define clusters from Z when inconsistent values are less than cutoff.
T = cluster(Z,Cutoff=cutoff)
When you specify maxclust, the clusterdata function uses cluster to find a maximum of maxclust clusters from Z, using "distance" as the criterion for defining clusters.
T = cluster(Z,MaxClust=maxclust)

Alternative Functionality

If you have a hierarchical cluster tree Z (the output of the linkage function for the input data matrix X), you can use cluster to perform agglomerative clustering on Z and return the cluster assignment for each observation (row) in X.

Version History

Introduced before R2006a

expand all

R2025a: Define the maximum number of clusters using inconsistency coefficients

When you specify maxclust, you can specify Criterion="inconsistent" to define clusters based on inconsistency coefficients. In previous releases, when you specify maxclust, the clusterdata function always uses "distance" as the criterion for defining clusters.

clusterdata

Syntax

Description

Examples

Find Limited Number of Clusters from Sample Data

Create and Cluster Hierarchical Tree

Input Arguments

`X` — Input data
numeric matrix

`cutoff` — Cutoff for inconsistent or distance criterion
positive scalar

`maxclust` — Maximum number of clusters
positive integer

Name-Value Arguments

`Criterion` — Criterion for defining clusters
`"inconsistent"` | `"distance"`

`Depth` — Depth for computing inconsistent values
numeric scalar

`Distance` — Distance metric
`"euclidean"` (default) | `"squaredeuclidean"` | `"seuclidean"` | `"mahalanobis"` | function handle | ...

`Linkage` — Algorithm for computing distance between clusters
`"average"` | `"centroid"` | `"complete"` | `"median"` | `"single"` | `"ward"` | `"weighted"`

`SaveMemory` — Option for saving memory
`"on"` | `"off"`

Output Arguments

`T` — Cluster indices
numeric column vector

Tips

Algorithms

Alternative Functionality

Version History

R2025a: Define the maximum number of clusters using inconsistency coefficients

See Also

Topics

clusterdata

Syntax

Description

Examples

Find Limited Number of Clusters from Sample Data

Create and Cluster Hierarchical Tree

Input Arguments

X — Input data numeric matrix

cutoff — Cutoff for inconsistent or distance criterion positive scalar

maxclust — Maximum number of clusters positive integer

Name-Value Arguments

Criterion — Criterion for defining clusters "inconsistent" | "distance"

Depth — Depth for computing inconsistent values numeric scalar

Distance — Distance metric "euclidean" (default) | "squaredeuclidean" | "seuclidean" | "mahalanobis" | function handle | ...

Linkage — Algorithm for computing distance between clusters "average" | "centroid" | "complete" | "median" | "single" | "ward" | "weighted"

SaveMemory — Option for saving memory "on" | "off"

Output Arguments

T — Cluster indices numeric column vector

Tips

Algorithms

Alternative Functionality

Version History

R2025a: Define the maximum number of clusters using inconsistency coefficients

See Also

Topics

`X` — Input data
numeric matrix

`cutoff` — Cutoff for inconsistent or distance criterion
positive scalar

`maxclust` — Maximum number of clusters
positive integer

`Criterion` — Criterion for defining clusters
`"inconsistent"` | `"distance"`

`Depth` — Depth for computing inconsistent values
numeric scalar

`Distance` — Distance metric
`"euclidean"` (default) | `"squaredeuclidean"` | `"seuclidean"` | `"mahalanobis"` | function handle | ...

`Linkage` — Algorithm for computing distance between clusters
`"average"` | `"centroid"` | `"complete"` | `"median"` | `"single"` | `"ward"` | `"weighted"`

`SaveMemory` — Option for saving memory
`"on"` | `"off"`

`T` — Cluster indices
numeric column vector