When to use cosine simlarity over Euclidean similarity
In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean?
Overview of the task set: The task is to compute context similarities of multi-word expressions. For example, suppose we were given an MWE of put up
, context refers to the words on the left side of put up
and as well as the words on the right side of it in one text. Mathematically speaking, similarity in this task is about calculating
sim(context_of_using_put_up, context_of_using_in_short)
Note that context is the feature that built on top of word embeddings, let's assume each word has an embedding dimension of 200
:
Two scenarios of representing context_of_an_expression
.
concatenate the left and right context words, producing an embedding vector of dimension
200*4=800
if picking two words on each side. In other words, a feature vector of [lc1, lc2, rc1, rc2] is build for context, wherelc=left_context
andrc=right_context
.get the mean of the sum of left and right context words, producing a vector of
200
dimensions. In other words, a feature vector of [mean(lc1+lc2+rc1+rc2)] is built for context.
[Edited] For both scenarios, I think Euclidean distance is a better fit. Cosine similarity is known for handling scale/length effects because of normalization. But I don't think there's much to be normalized.
Topic nlp similarity clustering machine-learning
Category Data Science