Why do we calculate the vector of a document by averaging the vectors of all the words?

Question

Why do we calculate the vector of a document by averaging the vectors of all the words?

amber

2021年10月7日 14:06

I am trying to build a search engine to query a folder of documents. Tutorials online suggest that we should obtain the vector of a document by averaging the vectors of all the words, then compare similarity to the vector of the query.

May I know how does the vector of all the words in the document retain the information of the words?

Would it be better if i retrieved similar words of the query and checked if these words were in each document?

Topic spacy search-engine gensim word2vec nlp

Category Data Science

Edoardo Guerriero · Accepted Answer · 2021年10月7日 14:06

we should obtain the vector of a document by averaging the vectors of all the words

This is not necessarily the case. But surely it is a convenient approach. The main advantage in particular, is to avoid issues due to different lengths for different documents. By obtaining a single final vector, we make sure we can compare any document of any length. Concatenating or performing other operations with the word vectors would probably force you define a max length and pad shorter documents / trim longer documents. A final note is that usually it's always a good practice to remove stop words from the documents, i.e. most frequent words which don't provide much semantic meaning.

May I know how does the vector of all the words in the document retain the information of the words?

This really depends on how you obtain the words vectors. If you just perform one hot encoding then performing an average is actually meaningless, since you would generate real numbers out of binary representations. So I assume you're planning to use embeddings generate through word2vec, skipgram, glove or other deep learning models. In that case to understand why averaging provide useful information you need to understand first how these models turn words into vectors. The extensive explanation is beyond the question so to keep it short: dense representation allows to do simple math with words. When translating words into dense representation similar words will be turned into similar (close in space) vectors. Of course there would be differences depending on the chosen model. For example Skipgram is better in capturing semantic more than word2vec, which in contrast still encode quite a lot of grammatical similarities, so if comparing two document talking about celebrities planets, both will probably contain the word "star", but a skipgram model would probably be able to better distinguish the documents since start would have more skewed values in the dimensions encoding both domains, and the other words in the document would provide the information to boost the right dimension, whereas a more grammatical model would have a harder time distinguishing the documents since grammatically, start and similar words are used in a similar fashion.

Would it be better if i retrieved similar words of the query and checked if these words were in each document?

You surely can try that, but it would hardly perform better than using any dense representation. Reason being that words themselves provide no information at all about the contextual relationship between them. For example "apple" could be present in a shop list, a review of an Apple product, or it could even be used as a slang for drugs.

Why do we calculate the vector of a document by averaging the vectors of all the words?

About