Treating Word Embeddings as Multivariate Gaussian Random Variables

Question

Treating Word Embeddings as Multivariate Gaussian Random Variables

ricardo

2021年12月21日 22:47

I want to specify some probabilistic clustering model (such as a mixture model or lda) over words, and instead of using the traditional method of representing words as an indicator vector , I want to use the corresponding word embeddings extracted from word2vec, glove, etc. as input.

While treating word embeddings from my word2vec as an input to my GMM model, I observed that my word embeddings for each feature had a normal distribution, i.e. feature 1..100 were normally distributed for my word dictionary. Can anyone tell how that is true? In my understanding, they are word embeddings are model weights attributed from a shallow neural network. Are they always supposed to be normally distributed?

Furthermore, when using doc2vec word embeddings,my features were uniformly distributed? This goes against the earlier assertion that word embeddings are normally distributed. Can anyone explain this discrepancy?

Topic doc2vec gmm word2vec nlp machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2021年12月21日 22:47

One way to approach is to separate the steps.

Learn an embedding space of either words and/or documents. Learning an embedding makes no assumptions about the distributional form of the data. The data (e.g., words or documents) could be uniform, normal, or another distribution. The result is an embedding space.
Gaussian Mixture Model (GMM) clustering of entities in the embedding space. GMM assumes a multivariate normal distribution best fits the data since it only estimates the parameters related to a multivariate normal distribution. The more the underly features are not a multi-variate normal distribution, the worse a GMM model will fit.

It is up to you as the modeler to decide if the features are normally distributed enough that a GMM is a useful model. If GMM is not a useful model, then choose a non-parametric clustering algorithm (e.g., kernel density estimation) that has fewer assumptions than a GMM.

Treating Word Embeddings as Multivariate Gaussian Random Variables

About