Why is Word2vec regarded as a neural embedding?

In the skip-gram model, the probability that a word $w$ is part of the set of context words $\{w_o^{(i)}\}$ $(i= 1:m)$ where $m$ is the context window around the central word, is given by:

$$p(w_o | w_c) = \frac{\exp{(\vec{u_o}\cdot \vec{v_c)}}}{\sum_{i\in V}\exp{(\vec{u_i}\cdot \vec{v_c)}}} $$

where $V$ is the number of words in the training set, $\vec{u_i}$ is the word embedding for the context word and $\vec{v_i}$ is the word embedding for the central word.

But this type of model is defining a linear transformation of the input similar to the one found in a multinomial logistic regression:

$$ p(y = c|\vec{x};\vec\theta) = \frac{\exp{(\vec{w_c}\cdot \vec{x})}}{\sum_{i \in N}\exp{(\vec{w_i}\cdot \vec x)}}$$

I understand that the real trick is in how you formulate the loss function, where in the skip-gram model instead of multiplying the probability of every class (every word) you just do it by a subset of words (the context). However, the transformations are linear instead of non-linear as I would expect if this was a neural network model.

I know that you can have some linear transformations in a DNN (actually linear composed to nonlinear composed to linear ...), but I thought the main purpose of using the term DNN and constructing a visual representation was that you had some non-linear transformations which if you choose carefully can be viewed as functions that range between -1,1 or 0,1 and it can be seen as activation functions which then induces this neural network graphical representation thing.

However, I fail to grasp this for word2vec and the skipgram model. Could anyone shed some light on this?

Topic multilabel-classification word2vec word-embeddings logistic-regression neural-network

Category Data Science


I think you are confused - the reason why Word2Vec is regarded as 'neural' is not due to its loss function, but that it uses neural network to estimate the word embedding ($\vec{u}$ and $\vec{v}$) (see section 2 of the original paper).

For example, I can have a ML problem with a loss function $L$ to minimize (on some data $X$ and target $y$). If I use a simple linear model to do the job, it is linear; or I would call it 'neural model' if I use (say) a CNN. Does not matter whether the loss $L$ is linear or else.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.