editDistance
Find edit distance between two strings or documents
Syntax
Description
specifies additional options using one or more name-value pair arguments.d
= editDistance(___,Name,Value
)
Examples
Edit Distance Between Two Strings
Find the edit distance between the strings "Text analytics"
and "Text analysis"
. The edit distance, by default, is the total number of grapheme insertions, deletions, and substitutions required to change one string to another.
str1 = "Text analytics"; str2 = "Text analysis";
Find the edit distance.
d = editDistance(str1,str2)
d = 2
This means changing the first string to the second requires two edits. For example:
Substitution – Substitute the character
"t"
to an"s"
:"Text analytics"
to"Text analysics"
.Deletion – Delete the character
"c"
:"Text analysics"
to"Text analysis"
.
Edit Distance Between Two Documents
Find the edit distance between two tokenized documents. For tokenized document input, the edit distance, by default, is the total number of token insertions, deletions, and substitutions required to change one document to another.
str1 = "It's time for breakfast."; document1 = tokenizedDocument(str1); str2 = "It's now time to sleep."; document2 = tokenizedDocument(str2);
Find the edit distance.
d = editDistance(document1,document2)
d = 3
This means changing the first document to the second requires three edits. For example:
Insertion – Insert the word
"now"
.Substitution – Substitute the word
"for"
with"to"
.Substitution – Substitute the word
"breakfast"
with"sleep"
.
Specify Cost Values
The editDistance
function, by default, returns the lowest number of grapheme insertions, deletions, and substitutions required to change one string to another. To also include the swap action in the calculation, use the 'SwapCost'
option.
First, find the edit distance between the strings "MATALB"
and "MATLAB"
.
str1 = "MATALB"; str2 = "MATLAB"; d = editDistance(str1,str2)
d = 2
One possible edit is:
Substitute the second
"A"
with"L"
: ("MATALB"
to"MATLLB"
).Substitute the second
"L"
with"A"
: ("MATLLB"
to"MATLAB"
).
The default value for the swap cost (the cost of swapping two adjacent graphemes) is Inf
. This means that swaps do not count towards the edit distance. To include swaps, set the 'SwapCost'
option to 1.
d = editDistance(str1,str2,'SwapCost',1)
d = 1
This means there is one action. For example, swap the adjacent characters "A"
and "L"
.
Specify Custom Cost Function
To compute the edit distance between two words and specify that the edits are case-insensitive, specify a custom substitute cost function.
First, compute the edit distance between the strings "MATLAB"
and "MathWorks"
.
d = editDistance("MATLAB","MathWorks")
d = 8
This means changing the first string to the second requires eight edits. For example:
Substitution – Substitute the character
"A"
with"a"
. ("MATLAB"
to"MaTLAB"
)Substitution – Substitute the character
"T"
with"t"
. ("MaTLAB"
to"MatLAB"
)Substitution – Substitute the character
"L"
with"h"
. ("MatLAB"
to"MathAB"
)Substitution – Substitute the character
"A"
with"W"
. ("MathAB"
to"MathWB"
)Substitution – Substitute the character
"B"
with"o"
. ("MathWB"
to"MathWo"
)Insertion – Insert the character
"r"
. ("MathWo"
to"MathWor"
)Insertion – Insert the character
"k"
. ("MathWor"
to"MathWork"
)Insertion – Insert the character
"s"
. ("MathWork"
to"MathWorks"
)
Compute the edit distance and specify the custom substitution cost function caseInsensitiveSubstituteCost
, listed at the end of the example. The custom function caseInsensitiveSubstituteCost
returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.
d = editDistance("MATLAB","MathWorks",'SubstituteCost',@caseInsensitiveSubstituteCost)
d = 6
This means the total cost for changing the first string to the second is 6. For example:
Substitution (cost 0) – Substitute the character
"A"
with"a"
. ("MATLAB"
to"MaTLAB"
)Substitution (cost 0) – Substitute the character
"T"
with"t"
. ("MaTLAB"
to"MatLAB"
)Substitution (cost 1) – Substitute the character
"L"
with"h"
. ("MatLAB"
to"MathAB"
)Substitution (cost 1) – Substitute the character
"A"
with"W"
. ("MathAB"
to"MathWB"
)Substitution (cost 1) – Substitute the character
"B"
with"o"
. ("MathWB"
to"MathWo"
)Insert (cost 1) – Insert the character
"r"
. ("MathWo"
to"MathWor"
)Insert (cost 1) – Insert the character
"k"
. ("MathWor"
to"MathWork"
)Insert (cost 1) – Insert the character
"s"
. ("MathWork"
to"MathWorks"
)
Custom Cost Function
The custom function caseInsensitiveSubstituteCost
returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.
function cost = caseInsensitiveSubstituteCost(grapheme1,grapheme2) if lower(grapheme1) == lower(grapheme2) cost = 0; else cost = 1; end end
Input Arguments
str1
— Source string
string array | character vector | cell array of character vectors
Source string, specified as a string array, character vector, or a cell array of character vectors.
If str1
contains multiple strings, then
str2
must be the same size as str1
or
scalar.
Data Types: char
| string
| cell
str2
— Target string
string array | character vector | cell array of character vectors
Target string, specified as a string array, character vector, or a cell array of character vectors.
If str2
contains multiple strings, then
str1
must be the same size as str2
or
scalar.
Data Types: char
| string
| cell
document1
— Source document
tokenizedDocument
Source document, specified as a tokenizedDocument
array.
If document1
contains multiple documents, then
document2
must be the same size as document1
or scalar.
document2
— Target document
tokenizedDocument
Target document, specified as a tokenizedDocument
array.
If document2
contains multiple documents, then
document1
must be the same size as document2
or scalar.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: editDistance("MATALB","MATLAB",'SwapCost',1)
returns the edit
distance between the strings "MATALB"
and "MATLAB"
and
sets the cost to swap two adjacent graphemes to 1.
InsertCost
— Cost to insert grapheme or token
1 (default) | nonnegative scalar | function handle
Cost to insert a grapheme or token, specified as the comma-separated pair
consisting of 'InsertCost'
and a nonnegative scalar or a function
handle.
If 'InsertCost'
is a function handle, then the function must
accept a single input and return the cost of inserting the input to the source. For example:
For string input to
editDistance
, the cost function must have the formcost = func(grapheme)
, where the function returns the cost of insertinggrapheme
intostr1
.For document input to
editDistance
, the cost function must have the formcost = func(token)
, where the function returns the cost of insertingtoken
intodocument1
.
Example: 'InsertCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
DeleteCost
— Cost to delete grapheme or token
1 (default) | nonnegative scalar | function handle
Cost to delete grapheme or token, specified as the comma-separated pair consisting
of 'DeleteCost'
and a nonnegative scalar or a function
handle.
If 'DeleteCost'
is a function handle, then the function must
accept a single input and return the cost of deleting the input from the source. For example:
For string input to
editDistance
, the cost function must have the formcost = func(grapheme)
, where the function returns the cost of deletinggrapheme
fromstr1
.For document input to
editDistance
, the cost function must have the formcost = func(token)
, where the function returns the cost of deletingtoken
fromdocument1
.
Example: 'DeleteCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
SubstituteCost
— Cost to substitute grapheme or token
1 (default) | nonnegative scalar | function handle
Cost to substitute a grapheme or token, specified as the comma-separated pair consisting
of 'SubstituteCost'
and a nonnegative scalar or a function
handle.
If 'SubstituteCost'
is a function handle, then the function must
accept exactly two inputs and return the cost of substituting the first input with the
second in the source. For example:
For string input to
editDistance
, the cost function must have the formcost = func(grapheme1,grapheme2)
, where the function returns the cost of substitutinggrapheme1
withgrapheme2
instr1
.For document input to
editDistance
, the cost function must have the formcost = func(token1,token2)
, where the function returns the cost of substitutingtoken1
withtoken2
indocument1
.
Example: 'SubstituteCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
SwapCost
— Cost to swap two adjacent graphemes or tokens
Inf
(default) | nonnegative scalar | function handle
Cost to swap two adjacent graphemes or tokens, specified as the comma-separated
pair consisting of 'SwapCost'
and a nonnegative scalar or a
function handle.
If 'SwapCost'
is a function handle, then the function must
accept exactly two inputs and return the cost of swapping the first input with the
second in the source. For example:
For string input to
editDistance
, the cost function must have the formcost = func(grapheme1,grapheme2)
, where the function returns the cost of swapping the adjacent graphemesgrapheme1
andgrapheme2
instr1
.For document input to
editDistance
, the cost function must have the formcost = func(token1,token2)
, where the function returns the cost of swapping the adjacent tokenstoken1
andtoken2
indocument1
.
Example: 'SwapCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
Output Arguments
d
— Edit distance
nonnegative scalar | vector of nonnegative values
Algorithms
Edit Distance
The function, by default, uses the Levenshtein distance: the lowest number of insertions, deletions, and substitutions required to convert one string to another.
For other commonly used edit distances, use these options:
Distance | Description | Options |
---|---|---|
Levenshtein (default) | lowest number of insertions, deletions, and substitutions | Default |
Damerau-Levenshtein | lowest number of insertions, deletions, substitutions, and swaps | 'SwapCost',1 |
Hamming | lowest number of substitutions only | 'InsertCost',Inf,'DeleteCost',Inf |
Version History
Introduced in R2019a
See Also
correctSpelling
| editDistanceSearcher
| knnsearch
| rangesearch
| splitGraphemes
| tokenizedDocument
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)