Dot product for similarity in word to vector computation in NLP

In NLP while computing word to vector we try to maximize log(P(o|c)). Where P(o|c) is probability that o is outside word, given that c is center word.

Uo is word vector for outside word
Vc is word vector for center word
T is number of words in vocabulary

Above equation is softmax. And dot product of Uo and Vc acts as score, which should be higher the better. If words o and c are closer then their dot product should be high, but that is not the case with dot product. Because of following:
Consider vectors
A=[1, 1, 1], B=[2, 2, 2], C=[100, 100, 100]
A.B = 1 * 2 + 1 * 2 + 1 * 2 = 6
A.C = 1 * 100 + 1 * 100 + 1 * 100 = 300

Vectors A and B are closer compared to A and C, however dot product A.C A.B
So Dot product is acting as distance and not as similarity measure. Then why is it used in softmax.
Please help me make my understanding better

Topic softmax word2vec word-embeddings nlp similarity

Category Data Science


Vectors Uo and Vc are more or less similar in magnitude. And magnitude is very small. Dot product normalized by length of vectors is cosine similarity.


The dot product is the magnitude of the projection of one vector $v$ to $w$, which is not a very good similarity measure, however:

$$ v \cdot w = \lVert v\rVert\lVert w\rVert Cos(\theta) $$

Being $\theta$ the angle between the 2 vectors.
And:

$$ v \cdot w = \sum_{i=1}^n v_i \cdot w_i $$

Being: $\lVert v\rVert = \sqrt{v \cdot v} $
With very simple math we can obtain:

$$ \frac{v}{\lVert v\rVert} \cdot \frac{w}{\lVert w\rVert} = Cos(\theta) $$ As

$$ -1 \le Cos(\theta) \le 1 $$

In case the vectors are divided by their norm (normalization) we can say " the higher the dot prod the more similar"


The reference points to the word2vec library (source code).

It does not use normalised vectors during training (although it indeed uses the cosine similarity metric for semantic comparisons on already trained vectors).

The reasons for using only dot product instead of cosine similarity during training can be due to:

  1. Dot product is a variation of cosine similarity.
  2. Length captures some semantic information in the sense that length can correlate to frequency of occurance in a given context, so using dot product only captures this information as well (although for strict similarity testing cosine metric is still used)
  3. When vectors are normalised the two metrics coincide.
  4. Efficiency (doing less computations)

References:

  1. Why does the word2vec objective use the inner product (inside the softmax), but the nearest neighbors phase of it uses cosine? It seems like a mismatch.
  2. Should I normalize word2vec's word vectors before using them?
  3. Measuring Word Significance using Distributed Representations of Words
  4. word2vec Parameter Learning Explained

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.