Dot product for similarity in word to vector computation in NLP

Question

Dot product for similarity in word to vector computation in NLP

Vivek Dani

2022年4月17日 17:06

In NLP while computing word to vector we try to maximize log(P(o|c)). Where P(o|c) is probability that o is outside word, given that c is center word.

U_o is word vector for outside word
V_c is word vector for center word
T is number of words in vocabulary

Above equation is softmax. And dot product of U_o and V_c acts as score, which should be higher the better. If words o and c are closer then their dot product should be high, but that is not the case with dot product. Because of following:
Consider vectors
A=[1, 1, 1], B=[2, 2, 2], C=[100, 100, 100]
A.B = 1 * 2 + 1 * 2 + 1 * 2 = 6
A.C = 1 * 100 + 1 * 100 + 1 * 100 = 300

Vectors A and B are closer compared to A and C, however dot product A.C A.B
So Dot product is acting as distance and not as similarity measure. Then why is it used in softmax.
Please help me make my understanding better

Topic softmax word2vec word-embeddings nlp similarity

Category Data Science

Vivek Dani · Accepted Answer · 2021年3月5日 18:17

1

Vivek Dani answered at 2021年3月5日 18:17

Vectors Uo and Vc are more or less similar in magnitude. And magnitude is very small. Dot product normalized by length of vectors is cosine similarity.

3nomis · Accepted Answer · 2021年1月17日 12:26

The dot product is the magnitude of the projection of one vector $v$ to $w$, which is not a very good similarity measure, however:

$$ v \cdot w = \lVert v\rVert\lVert w\rVert Cos(\theta) $$

Being $\theta$ the angle between the 2 vectors.
And:

$$ v \cdot w = \sum_{i=1}^n v_i \cdot w_i $$

Being: $\lVert v\rVert = \sqrt{v \cdot v} $
With very simple math we can obtain:

$$ \frac{v}{\lVert v\rVert} \cdot \frac{w}{\lVert w\rVert} = Cos(\theta) $$ As

$$ -1 \le Cos(\theta) \le 1 $$

In case the vectors are divided by their norm (normalization) we can say " the higher the dot prod the more similar"

Nikos M. · Accepted Answer · 2021年1月16日 10:45

The reference points to the word2vec library (source code).

It does not use normalised vectors during training (although it indeed uses the cosine similarity metric for semantic comparisons on already trained vectors).

The reasons for using only dot product instead of cosine similarity during training can be due to:

Dot product is a variation of cosine similarity.
Length captures some semantic information in the sense that length can correlate to frequency of occurance in a given context, so using dot product only captures this information as well (although for strict similarity testing cosine metric is still used)
When vectors are normalised the two metrics coincide.
Efficiency (doing less computations)

References:

Dot product for similarity in word to vector computation in NLP

About