hnswSearcher

Hierarchical Navigable Small Worlds (HNSW) approximate nearest neighbor search

Since R2024a

Description

An hnswSearcher model object stores the training data, distance metric, and parameter values of the distance metric for a Hierarchical Navigable Small Worlds (HNSW) approximate nearest neighbor search. For an algorithm overview, see Approximate KNN Search Using Hierarchical Navigable Small Worlds (HNSW) Algorithm.

After you create an hnswSearcher model object, you can use it to find points in the training data that are close to the query data by performing a nearest neighbor search using knnsearch. The HNSW search algorithm is faster than other KNN search algorithms. However, HNSW search results might not be exact.

Creation

Syntax

Mdl = hnswSearcher(X)

Mdl = hnswSearcher(X,Name=Value)

Description

Mdl = hnswSearcher(X) creates an HNSW searcher object based on the data in the matrix X. Each row of X represents one observation, and each column represents one feature.

example

Mdl = hnswSearcher(X,Name=Value) sets Properties and additional options using name-value arguments. For example, to use the Minkowski distance, set Distance="minkowski".

example

Input Arguments

expand all

`X` — Input data
numeric matrix

Input data, specified as a numeric matrix. The rows of X correspond to observations, and the columns correspond to variables.

The value of X sets the value of the X property in the model object.

Data Types: single | double

Name-Value Arguments

expand all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: Distance="minkowski",P=3 specifies to use the Minkowski distance metric with an exponent of 3.

`Distance` — Distance metric
`"euclidean"` (default) | character vector or string scalar

Distance metric used when you call knnsearch to find nearest neighbors for future query points, specified as a character vector or string scalar of the distance metric name.

The value of Distance sets the value of the Distance property in the model object.

Value	Description
`"chebychev"`	Chebychev distance (maximum coordinate difference)
`"cityblock"`	City block distance
`"correlation"`	One minus the sample linear correlation between observations (treated as sequences of values)
`"cosine"`	One minus the cosine of the included angle between observations (treated as row vectors)
`"euclidean"`	Euclidean distance
`"hamming"`	Hamming distance, which is the percentage of coordinates that differ
`"jaccard"`	One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ
`"mahalanobis"`	Mahalanobis distance
`"minkowski"`	Minkowski distance. The default exponent is 2.
`"seuclidean"`	Standardized Euclidean distance
`"spearman"`	One minus the sample Spearman's rank correlation between observations (treated as sequences of values)

Example: Distance="minkowski"

Data Types: char

`Cov` — Covariance matrix for Mahalanobis distance metric
`cov(X,"omitrows")` (default) | positive definite matrix

Covariance matrix for the Mahalanobis distance metric, specified as a K-by-K positive definite matrix, where K is the number of columns in X. This argument is valid only when Distance is "mahalanobis".

The value of Cov sets the value of the DistParameter property in the model object.

Example: Cov=eye(3)

Data Types: single | double

`P` — Exponent for Minkowski distance metric
`2` (default) | positive scalar

Exponent for the Minkowski distance metric, specified as a positive scalar. This argument is valid only when Distance is "minkowski".

The value of P sets the value of the DistParameter property in the model object.

Example: P=3

Data Types: single | double

`Scale` — Scale parameter value for standardized Euclidean distance metric
`std(X,"omitnan")` (default) | nonnegative numeric vector

Scale parameter value for the standardized Euclidean distance metric, specified as a nonnegative numeric vector of length K, where K is the number of columns in X. The software scales each difference between the training and query data using the corresponding element of Scale. This argument is valid only when Distance is "seuclidean".

The value of Scale sets the value of the DistParameter property in the model object.

Example: Scale=quantile(X,0.75) - quantile(X,0.25)

Data Types: single | double

Properties

expand all

You can set the Distance, MaxNumLinksPerNode, and TrainSetSize properties by using the name-value argument syntax when you call hnswSearcher. However, you cannot directly specify the DistParameter property. As noted earlier, DistParameter is set by the Cov, P, or Scale name-value argument.

`Distance` — Distance metric
Read-only: `'euclidean'` (default) | character vector

This property is read-only.

Distance metric used when you call knnsearch to find nearest neighbors for future query points, specified as a character vector of the distance metric name.

Value	Description
`'chebychev'`	Chebychev distance (maximum coordinate difference)
`'cityblock'`	City block distance
`'correlation'`	One minus the sample linear correlation between observations (treated as sequences of values)
`'cosine'`	One minus the cosine of the included angle between observations (treated as row vectors)
`'euclidean'`	Euclidean distance
`'hamming'`	Hamming distance, which is the percentage of coordinates that differ
`'jaccard'`	One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ
`'mahalanobis'`	Mahalanobis distance
`'minkowski'`	Minkowski distance. The default exponent is 2.
`'seuclidean'`	Standardized Euclidean distance
`'spearman'`	One minus the sample Spearman's rank correlation between observations (treated as sequences of values)

Data Types: char

`DistParameter` — Distance metric parameter values
Read-only: `[]` | positive scalar | numeric vector | numeric matrix

This property is read-only.

Distance metric parameter values, specified as a positive scalar, numeric vector, or numeric matrix. This property is nonempty only when you specify Distance as "seuclidean", "minkowski", or "mahalanobis".

If Distance is "seuclidean", DistParameter is a vector of scaling factors for each dimension, specified as a positive vector. The value is set by the Scale name-value argument. The default value is std(X,"omitnan").
If Distance is "minkowski", DistParameter is the exponent of the Minkowski distance, specified as a positive scalar. The value is set by the P name-value argument. The default value is 2.
If Distance is "mahalanobis", DistParameter is a covariance matrix, specified as a numeric matrix. The value is set by the Cov name-value argument. The default value is cov(X,"omitrows").

Data Types: single | double

`MaxNumLinksPerNode` — Number of connections created for each node
Read-only: `min(16,N)`, where `N` is the number of rows in `X` (default) | positive scalar

This property is read-only.

Number of connections created for each node, specified as a positive scalar. Typically, set a number from about 5 to 48. MaxNumLinksPerNode cannot exceed TrainSetSize.

Data Types: double

`TrainSetSize` — Size of list for potential nearest neighbors
Read-only: `200` (default) | positive integer

This property is read-only.

Size of the list for potential nearest neighbors, specified as a positive integer. Typically, set a number from about 100 to 2000. TrainSetSize must be at least MaxNumLinksPerNode and no more than N, the number of rows of X.

You can try to search for a good value of TrainSetSize as follows:

Specify a large enough value to achieve a very accurate search result (close to an exact search).
Run query tests with decreased values, until the accuracy reaches a reasonably good range (recall rate of 0.95–0.99).

Data Types: double

`X` — Point data
Read-only: real matrix

This property is read-only.

Point data, specified as a real matrix. Each row of X represents one point. The value of this property is set by the X input argument.

Data Types: single | double

Object Functions

knnsearch Find k-nearest neighbors using searcher object

Examples

collapse all

Create HNSW Searcher

Open Live Script

Create an HNSW searcher object using the data in a pseudorandom input matrix X.

rng default % For reproducibility
X = 50*randn(1000,20);
Mdl = hnswSearcher(X)

Mdl = 
  hnswSearcher with properties:

    MaxNumLinksPerNode: 16
          TrainSetSize: 200
              Distance: 'euclidean'
         DistParameter: []
                     X: [1000×20 double]

Mdl is an hnswSearcher model object with default properties.

Find the closest data point to the point ones(1,20) and display its distance.

[idx,D] = knnsearch(Mdl,ones(1,20))

idx = 
174

D = 
118.4583

Perform the same search using an exact KNN search, and compare the results.

[idx2,D2] = knnsearch(X,ones(1,20))

idx2 = 
174

D2 = 
118.4583

The approximate hnswSearcher algorithm gives the same results as the exact KNN search algorithm

Create HNSW Searcher with Nondefault Properties

Open Live Script

Create an HNSW searcher object using the "cityblock" distance and twice as many links per node as the default.

rng default % For reproducibility
X = 50*randn(1000,20);
Mdl = hnswSearcher(X,Distance="cityblock",MaxNumLinksPerNode=32)

Mdl = 
  hnswSearcher with properties:

    MaxNumLinksPerNode: 32
          TrainSetSize: 200
              Distance: 'cityblock'
         DistParameter: []
                     X: [1000×20 double]

Mdl has the specified nondefault properties.

Find the closest data point to the point ones(1,20) and display its distance.

[idx,D] = knnsearch(Mdl,ones(1,20))

idx = 
738

D = 
420.2990

Perform the same search using an exact KNN search, and compare the results.

[idx,D] = knnsearch(X,ones(1,20),Distance="cityblock")

idx = 
738

D = 
420.2990

hnswSearcher gives the same results as the exact searcher.

HNSW Speed and Accuracy for Large Data

Open Live Script

Create a large data set and compare the speed of an HNSW searcher to the speed of an exhaustive searcher for 1000 query points.

Create the data.

rng default % For reproducibility
A = diag(1:100);
B = randn(100,10);
K = kron(A,B);
ss = size(K)

ss = 1×2

       10000        1000

The data K has 1e4 rows and 1e3 columns.

Create an HNSW searcher object h from the data K.

tic;
h = hnswSearcher(K);
chnsw = toc

chnsw = 10.6544

Create 1000 random query points with 1000 features (columns). Specify to search for five nearest neighbors.

Y = randn(1000, 1000);
tic;
[idx, D] = knnsearch(h,Y,K=5);
thnsw = toc

thnsw = 0.0797

Create an exhaustive searcher object e from the data K.

tic
e = ExhaustiveSearcher(K);
ces = toc

ces = 0.0021

Creating an exhaustive searcher is much faster than creating an HNSW searcher.

Time the search using e and compare the result to the time using the HNSW searcher h.

tic;
[idx0, D0] = knnsearch(e,Y,K=5);
tes = toc

tes = 1.4841

In this case, the HNSW searcher computes about 20 times faster than the exhaustive searcher.

Determine how many results of the HNSW search differ in some way from the results of the exhaustive search.

v = abs(idx - idx0); % Count any difference in the five found entries
vv = sum(v,2); % vv = 0 means no difference
nz = nnz(vv) % Out of 1000 rows how many differ at all?

nz = 118

Here, 118 of 1000 HNSW search results differ from the exhaustive search results.

Try to improve the accuracy of the HNSW searcher by training with nondefault parameters. Specifically, use larger values for MaxNumLinksPerNode and TrainSetSize. These settings affect the speed of training and finding nearest neighbors.

tic;
h2 = hnswSearcher(K,MaxNumLinksPerNode=48,TrainSetSize=2000);
chnsw2 = toc

chnsw2 = 78.4836

With these parameters, the searcher takes about seven times as long to train.

tic;
[idx2, D2] = knnsearch(h2,Y,K=5);
thnsw2 = toc

thnsw2 = 0.1049

The speed of finding nearest neighbors using HNSW decreases, but is still more than ten times faster than the exhaustive search.

v = abs(idx2 - idx0);
vv = sum(v,2);
nz = nnz(vv)

nz = 57

For the slower but more accurate HNSW search, only 57 of 1000 results differ in any way from the exact results. Summarize the timing results in a table.

timet = table([chnsw;ces;chnsw2],[thnsw;tes;thnsw2],'RowNames',["HNSW";"Exhaustive";"HNSW2"],'VariableNames',["Creation" "Search"])

timet=3×2 table
                  Creation      Search 
                  _________    ________

    HNSW             10.654    0.079741
    Exhaustive    0.0021304      1.4841
    HNSW2            78.484     0.10492

Algorithms

The HNSW algorithm is described in Malkov and Yashunin [1]. A brief description of the algorithm appears in Approximate KNN Search Using Hierarchical Navigable Small Worlds (HNSW) Algorithm.

Alternative Functionality

You can also create an HNSW searcher object by using createns with the specification NSMethod="hnsw". The resulting object is the same as the object you create using the hnswSearcher function.

References

[1] Malkov, Yu. A., and D. A. Yashunin. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. Available at https://arxiv.org/abs/1603.09320.

Version History

Introduced in R2024a

hnswSearcher

Description

Creation

Syntax

Description

Input Arguments

`X` — Input data
numeric matrix

Name-Value Arguments

`Distance` — Distance metric
`"euclidean"` (default) | character vector or string scalar

`Cov` — Covariance matrix for Mahalanobis distance metric
`cov(X,"omitrows")` (default) | positive definite matrix

`P` — Exponent for Minkowski distance metric
`2` (default) | positive scalar

`Scale` — Scale parameter value for standardized Euclidean distance metric
`std(X,"omitnan")` (default) | nonnegative numeric vector

Properties

`Distance` — Distance metric
Read-only: `'euclidean'` (default) | character vector

`DistParameter` — Distance metric parameter values
Read-only: `[]` | positive scalar | numeric vector | numeric matrix

`MaxNumLinksPerNode` — Number of connections created for each node
Read-only: `min(16,N)`, where `N` is the number of rows in `X` (default) | positive scalar

`TrainSetSize` — Size of list for potential nearest neighbors
Read-only: `200` (default) | positive integer

`X` — Point data
Read-only: real matrix

Object Functions

Examples

Create HNSW Searcher

Create HNSW Searcher with Nondefault Properties

HNSW Speed and Accuracy for Large Data

Algorithms

Alternative Functionality

References

Version History

See Also

Topics

hnswSearcher

Description

Creation

Syntax

Description

Input Arguments

X — Input data numeric matrix

Name-Value Arguments

Distance — Distance metric "euclidean" (default) | character vector or string scalar

Cov — Covariance matrix for Mahalanobis distance metric cov(X,"omitrows") (default) | positive definite matrix

P — Exponent for Minkowski distance metric 2 (default) | positive scalar

Scale — Scale parameter value for standardized Euclidean distance metric std(X,"omitnan") (default) | nonnegative numeric vector

Properties

Distance — Distance metric Read-only: 'euclidean' (default) | character vector

DistParameter — Distance metric parameter values Read-only: [] | positive scalar | numeric vector | numeric matrix

MaxNumLinksPerNode — Number of connections created for each node Read-only: min(16,N), where N is the number of rows in X (default) | positive scalar

TrainSetSize — Size of list for potential nearest neighbors Read-only: 200 (default) | positive integer

X — Point data Read-only: real matrix

Object Functions

Examples

Create HNSW Searcher

Create HNSW Searcher with Nondefault Properties

HNSW Speed and Accuracy for Large Data

Algorithms

Alternative Functionality

References

Version History

See Also

Topics

`X` — Input data
numeric matrix

`Distance` — Distance metric
`"euclidean"` (default) | character vector or string scalar

`Cov` — Covariance matrix for Mahalanobis distance metric
`cov(X,"omitrows")` (default) | positive definite matrix

`P` — Exponent for Minkowski distance metric
`2` (default) | positive scalar

`Scale` — Scale parameter value for standardized Euclidean distance metric
`std(X,"omitnan")` (default) | nonnegative numeric vector

`Distance` — Distance metric
Read-only: `'euclidean'` (default) | character vector

`DistParameter` — Distance metric parameter values
Read-only: `[]` | positive scalar | numeric vector | numeric matrix

`MaxNumLinksPerNode` — Number of connections created for each node
Read-only: `min(16,N)`, where `N` is the number of rows in `X` (default) | positive scalar

`TrainSetSize` — Size of list for potential nearest neighbors
Read-only: `200` (default) | positive integer

`X` — Point data
Read-only: real matrix