Does word2vec fail for window size equal to sentence size

Will word2vec fail if sentences contain only similar words, or in other words, if the window size is equal to the sentence size? I suppose this question boils down to whether word to vec considers words from other sentences as negative samples, or only words from the same sentence but outside of the window

Topic word2vec word-embeddings nlp

Category Data Science


You are mechanically fine using sentence size as your window size. All context/target word combinations will be treated as positive cases & things will seem to work.

Logically speaking you also want to make some fraction of negative context/target word pairs to yield better embeddings. You can use the popular negative sampling technique. Roughly, it probabilistically picks word pairs from your vocab that don’t exist in your positive set. Like choosing grammatically impossible combinations: (“summer”, “actively”)

Also, since sentence word counts in natural text varies, using this as a dynamic window size might learn uneven word representations. Not sure if this is good or bad but maybe you could report back what you find?


Negative sampling aims at maximizing the similarity of the words in the same context and minimizing similarity if they occur in different sentences. However, instead of doing the minimization for all the words in the dictionary except for the context words, it randomly selects a handful of words depending on the training size and uses them to optimize the objective.

If you make context very large you need a lot of training data and training time to train the model to get to good results. So i think negative samples consider word which are out of context.


After some testing, it seems that having window size equal to the sentence length does not cause any issues for the model. It must be using other sentences as negative examples.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.