Levenshtein distance vs simple for loop

I have recently begun studying different data science principles, and have had a particular interest as of late in fuzzy matching. For preface, I'd like to include smarter fuzzy searching in a proprietary language named 4D in my workplace, so access to libraries is pretty much non existent. It's also worth noting that client side is single threaded currently, so taking advantage of multi-threaded matrix manipulations is out of the question.

I began studying the levenshtein algorithm and got that implemented, but it's well known for the slowness. Further, it builds a matrix by looping through both words, creating an O(mn) efficiency in time.

That got me thinking, what's the advantage in using the Levenshtein algorithm, vs just a single for loop that checks the character at every index, compares, and throws back a -1 if incorrect (among checking for length of string, etc), producing an O(n) time efficiency? Or does Levenshtein do extra checks that I'm unaware of?

Topic fuzzy-logic distance efficiency

Category Data Science


Levenstein and Jaccard distances are both useful for fuzzy string matching, and give far more meaningful results than Hamming distance. Both have a fair cost, and I'm not sure that Jaccard would be much cheaper for a single-string comparison.

However, in the context of many data items, your overall costs are very strongly influenced by how many data items you have to perform the comparison against.

In this context, Jaccard style algorithms (based on trigrams or N-grams) offer an opportunity for efficient selection of a subset to compare against. This goes something like this:

  • build a trigram index,
  • sampling 3-5 trigrams from the input word,
  • use the index to collect all candidates containing any one of those trigrams,
  • calculate the (more costly) distance metric only on those candidates.

By restricting the comparison in this way, you may likely be able to reduce the size of the candidate set from the order of 100,000 to nearer 100.

I implemented this once for fuzzy search in a database application, and wrote up the approach. See:


When I understand you correctly, you would loop over a string character by character and compare if there is a match at the same position in some other string.

Drawing from this post, you find that:

import distance 
distance.levenshtein("0123456789", "1234567890")
distance.hamming("0123456789", "1234567890")

The Levenshtein distance is 2, while the Hamming distance is 10 (and so would be your loop approach when I understand you correctly).

So in case "Hamming" is okay for your task, you can use the loop approach.


A simple example shows the difference: let's compare the strings hello! and hhello!:

  • The Levenshtein distance finds that the 2nd character doesn't match, but also that the following ones can be aligned (in other words it finds one insertion). It returns a score corresponding to 1 edit out of 6 characters.
  • The "single loop" described in the question finds that the first character matches and every other doesn't. It returns a score corresponding to 5 edits out of 6 characters.

The Levenshtein distance is an alignment algorithm. Without the double loop there's no alignment, so only the characters at the same position get compared.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.