Implemented word2vec CBOW using just one vector space W of shape (V, D) where V is the number of vocabulary and D is the number of features in a word-vector.
In short, it did not work well.
Let (word, context) is a pair and create BoW (Bag of Words) from the context. For instance the pairs for a sentence I love dogs that meaw, is (dogs, I love that meaw) when the context length is 4.
Use bold italic word to indicate that it is the word in a (word, context) pair.
The steps of training are as below feeding N number of (word, context) pairs as a batch.
Positive score
Calculate BoW from the word-vectors extracted from W for the context, and take the dot product with the word-vector for the word as the positive score.
E-1. Extract a word-vector We where We = W[index_of_word] where index_of_word is the index to the word in W.
E-2. Extract context vectors Wc where Wc = W[indices_of_context] and create the BoW Bc = sum(Wc, axis=0).
E-3. Calculate the Score of We and Bc as Ye = dot(Bc, We).
E-4. Calculate the loss Le = -log(sigmoid(Ye)).
E-5. Back-propagate dLe/dYe to We as dLe/dWe = dLe/dYe * dYe/dWe = dLe/dYe * Bc.
E-6. Back-propagate dLe/dYe to Wc as dLe/dWc = dLe/dYe * dYe/dWc = dLe/dYe * We.
In the actual auto-difference calculation, the derivative of sum needs to be considered to apply the * operation.
Negative score and its loss value
Take SL number of negative sample words and calculate a negative score for each negative sample by taking a dot product with the BoW as the negative score. The result is SL number of negative scores.
S-1. Take the SL number of negative sampling words, excluding those words in (word, context).
S-2. Extract word-vectors for negative samples Ws where Ws = W[indices_of_samples].
S-3. Calculate the negative scores from Ws and Bc as Ys = einsum("nd,nsd->ns",Bc, Ws)
* n is for the batch size N, d is for the word-vector size D, and s is for the negative sample size SL.
S-4. Calculate the loss Ls = -log(1 - sigmoid(Ys)).
S-5. Back-propagate dLs/dYs to Ws as dLs/dWs = dLs/dYs * dYs/dWs = dLs/dYs * Bc.
S-6. Back-propagate dLe/dYe to Wc as dLe/dWc = dLs/dYs * dYs/dWs = dLs/dYs * Ws.
In the actual auto-difference calculation, the derivative of einsum needs to be considered to apply the * operation.
Problem
I think the causes of single W not working result from updating the same W with multiple back-propagations in a batch at the same time:
- positive score back-propagations
dLe/dWe and dLe/dWc
- negative score back-propagations
dLs/dWs and dLs/dWc.

In a batch that has multiple (word, context) pairs, one pair may have X as the word for a positive score. But it can be used as a negative sample for other pairs in the same batch. Hence during the gradient-descent in a batch, word X would be used for a positive score as well as for negative scores.
Therefore, the back-propagation would update the same word-vector W[X] both for positive and negative at the same time.
Suppose in the 1st row in a batch, the word dogs is the word for a (word, context) pair, and is used for a positive score. W[index_for_dogs] gets updated by dLe/dWe.
Then for the 2nd pair in the batch, dogs is sampled as a negative sample. Then W[index_for_dogs] gets updated by dLs/dWs. It would be possible to exclude all the words in (word, context) pairs in a batch, but it will cause a narrow or skewed set of words available for the negative samples.
Also, a same word-vector in W will be word for a pair, and in a context in another pair, and is a negative sample for yet another pair.
I believe these mixture could be an act of confusion
- rewarding (positive score back-prop) and penalizing (negative score back-prop) on the same
W
- using the same word as word, context, and negative sample.
Hence, it would require the separation into different vector spaces to give a clear role, e.g one vector space Wc for context.
It may be possible to have a separate vector space for word and one for negative samples. Using one vector space for both would also cause back-propagation for positive and negative at the same time. I think this could be a reason why the vector space used for negative samples are not used as the result model for the word2vec.