What is Levenshtein Distance?

by kalai 2010-01-12 15:29:59

Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

* If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
* If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.
The Levenshtein distance algorithm has been used in:

* Spell checking
* Speech recognition
* DNA analysis
* Plagiarism detection

Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.

Step 2
Initialize the first row to 0..n.
Initialize the first column to 0..m.

Step 3
Examine each character of s (i from 1 to n).

Step 4 Examine each character of t (j from 1 to m).

Step 5
If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
Step 6
Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
Step 7
After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".

Step 7

The distance is in the lower right hand corner of the matrix, i.e. 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and 1 insertion = 2 changes).

Tagged in:


You must LOGIN to add comments