Why is Word2vec regarded as a neural embedding?
In the skip-gram model, the probability that a word $w$ is part of the set of context words $\{w_o^{(i)}\}$ $(i= 1:m)$ where $m$ is the context window around the central word, is given by:
$$p(w_o | w_c) = \frac{\exp{(\vec{u_o}\cdot \vec{v_c)}}}{\sum_{i\in V}\exp{(\vec{u_i}\cdot \vec{v_c)}}} $$
where $V$ is the number of words in the training set, $\vec{u_i}$ is the word embedding for the context word and $\vec{v_i}$ is the word embedding for the central word.
But this type of model is defining a linear transformation of the input similar to the one found in a multinomial logistic regression:
$$ p(y = c|\vec{x};\vec\theta) = \frac{\exp{(\vec{w_c}\cdot \vec{x})}}{\sum_{i \in N}\exp{(\vec{w_i}\cdot \vec x)}}$$
I understand that the real trick is in how you formulate the loss function, where in the skip-gram model instead of multiplying the probability of every class (every word) you just do it by a subset of words (the context). However, the transformations are linear instead of non-linear as I would expect if this was a neural network model.
I know that you can have some linear transformations in a DNN (actually linear composed to nonlinear composed to linear ...), but I thought the main purpose of using the term DNN and constructing a visual representation was that you had some non-linear transformations which if you choose carefully can be viewed as functions that range between -1,1 or 0,1 and it can be seen as activation functions which then induces this neural network graphical representation thing.
However, I fail to grasp this for word2vec and the skipgram model. Could anyone shed some light on this?