Sum vs mean of word-embeddings for sentence similarity

Question

Sum vs mean of word-embeddings for sentence similarity

CutePoison

2022年5月6日 13:56

So, say I have the following sentences

[The dog says woof, a king leads the country, an apple is red]

I can embed each word using an N dimensional vector, and represent each sentence as either the sum or mean of all the words in the sentence (e.g Word2Vec).

When we represent the words as vectors we can do something like vector(king)-vector(man)+vector(woman) = vector(queen) which then combines the different meanings of each vector and create a new, where the mean would place us in somewhat the middle of all words.

Are there any difference between using the sum/mean when we want to compare similarity of sentences, or does it simply depend on the data, the task etc. of which performs better?

Topic word2vec word-embeddings nlp

Category Data Science

Bruno Lubascher · Accepted Answer · 2022年5月6日 13:56

TL;DR

You are better off averaging the vectors.

Average vs sum

Averaging the word vectors is a pretty known approach to get sentence level vectors. Some people may even call that "Sentence2Vec". Doing this, can give you a pretty good dimension space. If you have multiple sentences like that, you can even calculate their similarity with a cosine distance.

If you sum the values, you are not guaranteed to have the sentence vectors in the same magnitude in the vector space. Sentences that have many words will have very high values, where as sentences with few words with have low values. I cannot think of a use-case where this outcome is desirable since the semantical value of the embeddings will be very much dependand on the lenght of the sentence, but there may be sentences that are long with a very similar meaning of a short sentence.

Example

Sentence 1 = "I love dogs."
Sentence 2 = "My favourite animal in the whole wide world are men's best friend, dogs!"

Since you may want these two sentence above to fall closely in the vector space, you need to average the word embeddings.

Doc2Vec

Another approach is to use Doc2Vec which doesn't average word embeddings, but rather treats full sentences (or paragraphs) as a single entity and therefore a single embeddings for it is created.

Sum vs mean of word-embeddings for sentence similarity

TL;DR

Average vs sum

Example

Doc2Vec

About