editDistanceSearcher
Edit distance nearest neighbor searcher
Description
An edit distance searcher performs a nearest neighborhood search in a list of known strings, using edit distance.
Creation
Syntax
Description
creates an edit distance searcher and sets the eds
= editDistanceSearcher(vocabulary
,maxDist
)Vocabulary
and MaximumDistance
properties. The returned object searches the words in
vocabulary
and with maximum edit distance
maxDist
.
specifies additional options using one or more name-value pair arguments.eds
= editDistanceSearcher(vocabulary
,maxDist
,Name,Value
)
Properties
Vocabulary
— Words to compare against
string vector | character vector | cell array of character vectors
Words to compare against, specified as a string vector, character vector, or a cell array of character vectors.
Data Types: char
| string
| cell
MaximumDistance
— Maximum edit distance
positive scalar
Maximum edit distance, specified as a positive scalar.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
InsertCost
— Cost to insert grapheme
1 (default) | nonnegative scalar | function handle
Cost to insert grapheme, specified as a nonnegative scalar or a function handle.
If InsertCost
is a function handle, then the function must accept
a single input and return the cost of inserting the input to the source. The cost
function must have the form cost = func(grapheme)
, where the function
returns the cost of inserting grapheme
into the source string.
If you specify a custom cost function, then the searcher perform exhaustive searching. For large vocabularies, the functions knnsearch
and rangesearch
can take a long time to find matches.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
DeleteCost
— Cost to delete grapheme
1 (default) | nonnegative scalar | function handle
Cost to delete grapheme, specified as a nonnegative scalar or a function handle.
If DeleteCost
is a function handle, then the function must accept
a single input and return the cost of deleting the input from the source. The cost
function must have the form cost = func(grapheme)
, where the function
returns the cost of deleting grapheme
from the source string.
If you specify a custom cost function, then the searcher perform exhaustive searching. For large vocabularies, the functions knnsearch
and rangesearch
can take a long time to find matches.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
SubstituteCost
— Cost to substitute grapheme
1 (default) | nonnegative scalar | function handle
Cost to substitute grapheme, specified as a nonnegative scalar or a function handle.
If SubstituteCost
is a function handle, then the function must
accept exactly two inputs and return the cost of substituting the first input to the
second in the source. The cost function must have the form cost =
func(grapheme1,grapheme2)
, where the function returns the cost of
substituting grapheme1
with grapheme2
in the
source.
If you specify a custom cost function, then the searcher perform exhaustive searching. For large vocabularies, the functions knnsearch
and rangesearch
can take a long time to find matches.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
SwapCost
— Cost to swap adjacent graphemes
Inf
(default) | nonnegative scalar | function handle
Cost to swap adjacent graphemes, specified as a nonnegative scalar or a function handle.
If SwapCost
is a function handle, then the function must accept
exactly two inputs and return the cost of swapping the first input with the second in
the source. The cost function must have the form cost =
func(grapheme1,grapheme2)
, where the function returns the cost of swapping
the adjacent graphemes grapheme1
and grapheme2
in
the source.
If you specify a custom cost function, then the searcher perform exhaustive searching. For large vocabularies, the functions knnsearch
and rangesearch
can take a long time to find matches.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
Object Functions
rangesearch | Find nearest neighbors by edit distance range |
knnsearch | Find nearest neighbors by edit distance |
Examples
Create Edit Distance Searcher
Create an edit distance searcher with a maximum edit distance 3 from the words "MathWorks"
, "MATLAB"
, and "Analytics"
.
vocabulary = ["MathWorks" "MATLAB" "Analytics"]; eds = editDistanceSearcher(vocabulary,3)
eds = editDistanceSearcher with properties: Vocabulary: ["MathWorks" "MATLAB" "Analytics"] MaximumDistance: 3 InsertCost: 1 DeleteCost: 1 SubstituteCost: 1 SwapCost: Inf
Create Damerau-Levenshtein Edit Distance Searcher
Create an edit distance searcher using the Damerau-Levenshtein edit distance. The Damerau-Levenshtein edit distance is the lowest number of insertions, deletions, substitutions, and swaps.
Create the edit distance searcher from the words "MathWorks"
, "MATLAB"
, and "Analytics"
and specify a maximum distance of 3. To specify the Damerau-Levenshtein edit distance, set 'SwapCost'
to 1.
vocabulary = ["MathWorks" "MATLAB" "Analytics"]; eds = editDistanceSearcher(vocabulary,3,'SwapCost',1)
eds = editDistanceSearcher with properties: Vocabulary: ["MathWorks" "MATLAB" "Analytics"] MaximumDistance: 3 InsertCost: 1 DeleteCost: 1 SubstituteCost: 1 SwapCost: 1
Find Nearest Words
Create an edit distance searcher.
vocabulary = ["Text" "Analytics" "Toolbox"]; eds = editDistanceSearcher(vocabulary,2);
Find the nearest words to "Test"
and "Analysis"
.
words = ["Test" "Analysis"]; idx = knnsearch(eds,words)
idx = 2×1
1
2
Get the words from the vocabulary using the returned indices.
nearestWords = eds.Vocabulary(idx)
nearestWords = 1x2 string
"Text" "Analytics"
Find Nearest Neighbors in Range
Create an edit distance searcher and specify a maximum edit distance of 3.
vocabulary = ["MathWorks" "MATLAB" "Simulink" "text" "analytics" "analysis"]; maxDist = 3; eds = editDistanceSearcher(vocabulary,maxDist);
Find the nearest words to "test"
, "analytic"
, and "analyze"
with edit distance less than or equal to 1.
words = ["test" "analytic" "analyze"]; maxDist = 1; idx = rangesearch(eds,words,maxDist)
idx=3×1 cell array
{[ 4]}
{[ 5]}
{1x0 double}
For "analyze"
, there are no words in the searcher within the specified range. For "test"
and "analytic"
, there is one result each. View the corresponding word for "test"
using the returned index.
nearestWords = eds.Vocabulary(idx{2})
nearestWords = "analytics"
Find the nearest words to "test"
, "analytic"
, and "analyze"
with edit distance less than or equal to 3 and their corresponding edit distances.
words = ["test" "analytic" "analyze"]; maxDist = 3; [idx,d] = rangesearch(eds,words,maxDist)
idx=3×1 cell array
{[ 4]}
{[5 6]}
{[ 6]}
d=3×1 cell array
{[ 1]}
{[1 2]}
{[ 3]}
For both "test"
and "analyze"
, there is one word in the searcher within the specified range. For "analytic"
, there are two results. View the corresponding words for "analytic"
(the second word) using the returned indices and their edit distances.
i = 2; nearestWords = eds.Vocabulary(idx{i})
nearestWords = 1x2 string
"analytics" "analysis"
d{i}
ans = 1×2
1 2
Algorithms
Edit Distance
The function, by default, uses the Levenshtein distance: the lowest number of insertions, deletions, and substitutions required to convert one string to another.
For other commonly used edit distances, use these options:
Distance | Description | Options |
---|---|---|
Levenshtein (default) | lowest number of insertions, deletions, and substitutions | Default |
Damerau-Levenshtein | lowest number of insertions, deletions, substitutions, and swaps | 'SwapCost',1 |
Hamming | lowest number of substitutions only | 'InsertCost',Inf,'DeleteCost',Inf |
Version History
Introduced in R2019a
See Also
correctSpelling
| editDistance
| knnsearch
| rangesearch
| splitGraphemes
| tokenizedDocument
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)