Getting context-word pairs for a continuous bag of words model and other confusions
Suppose I have a corpus with documents:
corpus = [
The sky looks lovely today,
The fat cat hit the poor dog,
God created all men equal,
He wrestled the creature to the ground,
The king did not treat his subjects fairly,
]
Which I've preprocessed, and want to generate context-word pairs, following this article. The writer notes:
The preceding output should give you some more perspective of how X forms our context words and we are trying to predict the target center word Y based on this context. For example, if the original text was ‘in the beginning god created heaven and earth’ which after pre-processing and removal of stopwords became ‘beginning god created heaven earth’ and for us, what we are trying to achieve is that. Given [beginning, god, heaven, earth] as the context, what the target center word is, which is ‘created’ in this case.
I have two questions regarding this:
What if the number of words in the document (sentence) is even? There is no center in an even-lengthed sequence (what's the center number in 1, 2, 3, 4?), so what would my target word be in this case?
What's so significant about choosing the center word as the target word? Why are we giving it special importance in the document?
If I have it right (and I may easily not and please let me know how I'm wrong if this is the case), once a CBOW model is trained, any input I give it should be able to have it mapped to the vector space of the first weights matrix, and the most similar words are the ones whose embeddings are closest spatially to the input word embedding in the vector space. How is this accomplished by choosing the target words so arbitrarily? Wouldn't every single word in the vocabulary need to be a target word?
Topic context-vector bag-of-words word2vec word-embeddings
Category Data Science