Cosine similarity vs The Levenshtein distance

Cosine similarity vs The Levenshtein distance

I wanted to know what is the difference between them and in what situations they work best?

As per my understanding:

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians.

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits

My question is

  • When would one use Cosine similarity over The Levenshtein distance?

Topic metric cosine-distance similarity

Category Data Science


Cosine similarity uses vectors and can calculate similarity for sets and multisets (=bags). If used for similarity of sequences (of characters, words, sentences, lines, ...) the comparison is unordered and each kind of element is a feature = dimension in the vector space. Thus the letters of the word 'banana' are transformed into a set [a, b, n] or a bag {a: 3, b: 1, n: 2}, where the set can be thought as bag {a: 1, b: 1, n: 1} and the same calculation can be used. Each character is treated as a dimension of the vectors. Thus with supporting Unicode the vectorspace can have potentially 0x10FFFF ~ 1.1 million dimensions, but for comparison of two strings you need only a subset of size <= len1 + len2. That's implemented as sparse vector. To bring some sequential order into cosine similarity applied to sequences, we can use 2-grams or 3-grams. This can be very efficient for searching similar words in large dictionaries as candidates for spelling correction, e.g. limit the search to minimal similarity 0.7, or get the top 20 similar words.

Out of the candidates you can use the slower, but more precise Levenshtein or LCS.


As mentioned in other answers, traditionally cosine is used to measure similarity between vectors whereas Levenshtein is used as a string similarity measure, i.e. measuring the distance between sequences of characters.

Nevertheless they both can be used in non-traditional settings and are indeed comparable:

  • the vectors compared with cosine can for instance contain frequencies of characters or characters n-grams, hence making it a string similarity measure
  • one can replace the sequence of characters with a sequence of strings or a sequence of n-grams, thus making Levenshtein a more general distance measure.

The main conceptual difference between Cosine and Levenshtein is that the former assumes a "bag-of-words" vector representation, i.e. compares unordered sets, whereas the latter takes into account the order of the elements in the sequences.

In the context of comparing sequences of words many combinations are possible. In case that's what you're looking for you might be interested in this paper: https://www.aclweb.org/anthology/C08-1075/ (full disclosure: I'm one of the authors).


To answer directly to your question, I would say that one could use Cosine similarity when dealing with vectors (for instance the distance between (1,2,3) and (4,5,6)) and one could use the Levenshtein distance when dealing with strings ("distance" between "aaaaa" and "aaaba").

Concretely they don't really apply to the same context and are not used for the same applications. If you want to test if two different piece of texts are quite similar, it could be reasonable to use the Levenshtein distance. If you want to know if two vectors are quite similar to each other in a 3 dimensional space, it might be a good idea to use the cosine similarity.


The first one is for computing the similarity between objects considering their representations as vectors. The second one is for computing the similarity between sequences of characters.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.