Word similarity considering special characteristics

I'm looking for an algorithm that computes the similarity between two strings just like the levenshtein distance. However, I want to consider the following. The levenshtein distance gives me the same for these cases:

distance(apple, appli) #1
distance(apple, appel) #1
distance(apple, applr) #1

However, I want the second and third example to have a smaller distance because of the following reasons:

  • second example: all the correct letters are used in the second word
  • third example: r is much likely to be a typo of the letter e because of the keyboard placement.

Are you familiar with any algorithm that weights such characteristics ?

Topic nlp similarity

Category Data Science


There are plenty of versions of edit distance and plenty of possible extensions.

The edit operations can have different weights. By default, all operations have a weight of 1. However, you can easily reassign the weights of substitutions based on their distance on the keyboard and assign a lower weight for substituting neighboring characters. In Python, you can use e.g., weighted-levenshtein package.

There is also an extension of the standard edit distance called Damerau-Levenshtein distance. It allows an additional operation of swapping two characters. If you set a lower cost for swapping characters, permuting characters in the string will be less penalized.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.