Treating Word Embeddings as Multivariate Gaussian Random Variables

I want to specify some probabilistic clustering model (such as a mixture model or lda) over words, and instead of using the traditional method of representing words as an indicator vector , I want to use the corresponding word embeddings extracted from word2vec, glove, etc. as input.

While treating word embeddings from my word2vec as an input to my GMM model, I observed that my word embeddings for each feature had a normal distribution, i.e. feature 1..100 were normally distributed for my word dictionary. Can anyone tell how that is true? In my understanding, they are word embeddings are model weights attributed from a shallow neural network. Are they always supposed to be normally distributed?

Furthermore, when using doc2vec word embeddings,my features were uniformly distributed? This goes against the earlier assertion that word embeddings are normally distributed. Can anyone explain this discrepancy?

Topic doc2vec gmm word2vec nlp machine-learning

Category Data Science


One way to approach is to separate the steps.

  1. Learn an embedding space of either words and/or documents. Learning an embedding makes no assumptions about the distributional form of the data. The data (e.g., words or documents) could be uniform, normal, or another distribution. The result is an embedding space.

  2. Gaussian Mixture Model (GMM) clustering of entities in the embedding space. GMM assumes a multivariate normal distribution best fits the data since it only estimates the parameters related to a multivariate normal distribution. The more the underly features are not a multi-variate normal distribution, the worse a GMM model will fit.

It is up to you as the modeler to decide if the features are normally distributed enough that a GMM is a useful model. If GMM is not a useful model, then choose a non-parametric clustering algorithm (e.g., kernel density estimation) that has fewer assumptions than a GMM.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.