When to use cosine simlarity over Euclidean similarity

In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean?

Overview of the task set: The task is to compute context similarities of multi-word expressions. For example, suppose we were given an MWE of put up, context refers to the words on the left side of put up and as well as the words on the right side of it in one text. Mathematically speaking, similarity in this task is about calculating

sim(context_of_using_put_up, context_of_using_in_short)

Note that context is the feature that built on top of word embeddings, let's assume each word has an embedding dimension of 200:

Two scenarios of representing context_of_an_expression.

  1. concatenate the left and right context words, producing an embedding vector of dimension 200*4=800 if picking two words on each side. In other words, a feature vector of [lc1, lc2, rc1, rc2] is build for context, where lc=left_context and rc=right_context.

  2. get the mean of the sum of left and right context words, producing a vector of 200 dimensions. In other words, a feature vector of [mean(lc1+lc2+rc1+rc2)] is built for context.

[Edited] For both scenarios, I think Euclidean distance is a better fit. Cosine similarity is known for handling scale/length effects because of normalization. But I don't think there's much to be normalized.

Topic nlp similarity clustering machine-learning

Category Data Science


Ok, so, your intuition here is wrong. Not necessarily about the examples you gave, but the fact that you think Euclidian distance could be useful in 200 dimensional space. 200d space is so, so empty. Everything is far from everything else. That's why we use cosine similarity - because everything is far from everything else, so if two vectors are pointing in the same direction, that's already pretty good. [This one area of NLP is where my traditional math background came in most useful].

The example that made this clear for me is thinking about the ratio of the volume of the unit sphere to the unit cube in n dimensions, as n goes to infinity. You can read the answers on Stack Exchange Math or just think about the first few cases. In one dimension, a line, the unit "sphere" (line) takes up 100% of the unit line. In 2D, that's π/4: roughly 78%. 3D this π/6, around 52%. By 10D, this is π^5/122880, or ~0.2%.

In 200D, the unit sphere is 5e10^-165 of the unit cube. It's just a point. Euclidian distance just... becomes useless for most things.

Another very-related example is the Curse of Dimensionality in Sampling, quoted below for good measure

There is an exponential increase in volume associated with adding extra dimensions to a mathematical space. For example, 102=100 evenly spaced sample points suffice to sample a unit interval (a "1-dimensional cube") with no more than 10−2=0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice that has a spacing of 10−2=0.01 between adjacent points would require 1020[=(102)10] sample points. In general, with a spacing distance of 10−n the 10-dimensional hypercube appears to be a factor of 10n(10-1)[=(10n)10/(10n)] "larger" than the 1-dimensional hypercube, which is the unit interval. In the above example n=2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 1018 "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.


Caveat: for normalized vectors (unit vectors), cosine similarity and Euclidean distance are essentially equivalent (minimizing one is equivalent to maximizing the other). This is because for unit vectors, cosine similarity is computed simply as a dot product and $\lVert x - y\rVert^2 = 2 - x^T y$. Computationally, a dot product is faster because it can be used on sparse vectors and saves on one vector subtraction.


The intuition built by the top response is spot-on for tf-idf vectors, and carries over to any vector that naturally wants to be normalized. However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.

The second response is incorrect, and sparsity doesn't matter in this context, at least, not in practice. In fact, I wrote a paper on the algebraic (and some geometric) properties of word embeddings trained end-to-end using RNNs on NLP tasks (well, on a classification task) and found that cosine similarity is a far, far weaker estimate of a word's action on the hidden state (heuristically, the "meaning" of the sentence) than the Euclidean distance. You can review it in Fig.16 here if you wish.

To this end, I actually think the use of cosine similarity is a hold-over from classical NLP that made use of tf-idf vectors, and really needs to be abandoned.


When to use cosine similarity over Euclidean similarity

Cosine similarity looks at the angle between two vectors, euclidian similarity at the distance between two points.

Let's say you are in an e-commerce setting and you want to compare users for product recommendations:

  • User 1 bought 1x eggs, 1x flour and 1x sugar.
  • User 2 bought 100x eggs, 100x flour and 100x sugar
  • User 3 bought 1x eggs, 1x Vodka and 1x Red Bull

By cosine similarity, user 1 and user 2 are more similar. By euclidean similarity, user 3 is more similar to user 1.

Questions in the text

I don't understand the first part.

Cosine similarity is specialized in handling scale/length effects. For case 1, context length is fixed -- 4 words, there's no scale effects. In terms of case 2, the term frequency matters, a word appears once is different from a word appears twice, we cannot apply cosine.

This goes in the right direction, but is not completely true. For example:

$$ \cos \left (\begin{pmatrix}1\\0\end{pmatrix}, \begin{pmatrix}2\\1\end{pmatrix} \right) = \cos \left (\begin{pmatrix}1\\0\end{pmatrix}, \begin{pmatrix}4\\2\end{pmatrix} \right) \neq \cos \left (\begin{pmatrix}1\\0\end{pmatrix}, \begin{pmatrix}5\\2\end{pmatrix} \right) $$

With cosine similarity, the following is true:

$$ \cos \left (\begin{pmatrix}a\\b\end{pmatrix}, \cdot \begin{pmatrix}c\\d\end{pmatrix} \right) = \cos \left (\begin{pmatrix}a\\b\end{pmatrix}, n \cdot \begin{pmatrix}c\\d\end{pmatrix} \right) \text{ with } n \in \mathbb{N} $$

So frequencies are only ignored, if all features are multiplied with the same constant.

Curse of Dimensionality

When you look at the table of my blog post, you can see:

  • The more dimensions I have, the closer the average distance and the maximum distance between randomly placed points become.
  • Similarly, the average angle between uniformly randomly placed points becomes 90°.

So both measures suffer from high dimensionality. More about this: Curse of dimensionality - does cosine similarity work better and if so, why?. A key point:

  • Cosine is essentially the same as Euclidean on normalized data.

Alternatives

You might be interested in metric learning. The principle is described/used in FaceNet: A Unified Embedding for Face Recognition and Clustering (my summary). Instead of taking one of the well-defined and simple metrics. You can learn a metric for the problem domain.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.