Getting context-word pairs for a continuous bag of words model and other confusions

Suppose I have a corpus with documents:

corpus = [
The sky looks lovely today,
The fat cat hit the poor dog,
God created all men equal,
He wrestled the creature to the ground,
The king did not treat his subjects fairly,                                           
]

Which I've preprocessed, and want to generate context-word pairs, following this article. The writer notes:

The preceding output should give you some more perspective of how X forms our context words and we are trying to predict the target center word Y based on this context. For example, if the original text was ‘in the beginning god created heaven and earth’ which after pre-processing and removal of stopwords became ‘beginning god created heaven earth’ and for us, what we are trying to achieve is that. Given [beginning, god, heaven, earth] as the context, what the target center word is, which is ‘created’ in this case.

I have two questions regarding this:

  1. What if the number of words in the document (sentence) is even? There is no center in an even-lengthed sequence (what's the center number in 1, 2, 3, 4?), so what would my target word be in this case?

  2. What's so significant about choosing the center word as the target word? Why are we giving it special importance in the document?

  3. If I have it right (and I may easily not and please let me know how I'm wrong if this is the case), once a CBOW model is trained, any input I give it should be able to have it mapped to the vector space of the first weights matrix, and the most similar words are the ones whose embeddings are closest spatially to the input word embedding in the vector space. How is this accomplished by choosing the target words so arbitrarily? Wouldn't every single word in the vocabulary need to be a target word?

Topic context-vector bag-of-words word2vec word-embeddings

Category Data Science


Overall, consider the algorithm as working by (after subsampling etc) choosing each word in turn to be the target word and, for each, defining the context words as (up to) $l$ words either side of the target word, where $l$ is a fixed number. Answering your questions from this perspective:

  1. the target word is not chosen from the context window, rather the other way round, context words are defined relative to the target word.

  2. it is not that being in the centre is important, rather for each word (in turn considered the target word), the context window effectively defines which words are considered (in some sense) influenced by that target word. The choice of $l$ words either side can be varied, e.g. be made larger or smaller or asymmetric. various research looks at this but there isn't (as far as I know) any conclusively better approach or theoretical basis for what is optimal.

  3. the target words aren't chosen "arbitrarily", they can be thought of as the central object, with context words defined relative to them. So (subject to subsampling, dropping rare words etc) every single word in the corpus - and so the vocabulary- is at some point considered a target word. If it weren't, it's ``input" embedding would not be updated by the algorithm and not learn anything (i.e. just stick as its random initialisation).

Overall, the word2vec (and GloVe) algorithms effectively factorise co-occurrence statistics (specifically pointwise mutual information), so each embedding of a word can be thought of as a compressed (dimensionality-reduced) encoding of the distribution of words that occur around it, as learned by the algorithm as it trawls over the corpus. This paper explains how that works. (There are follow up works if of interest).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.