How to examine if a Doc2Vec model is sufficiently trained?

I started experimenting with gensim's Doc2Vec for sentiment analysis. For the training of the embedding itself, I have seen examples using a reduced learning rate with a few 10s or even a few hundred epochs. However, there does not seem to be a straightforward way to use early stopping to prevent overfitting, and it is not yet clear to me how I should access loss values for each epoch to detect overfitting. What should be the proper way to examine if word2vec or doc2vec model themselves are sufficiently good? Thank you!

Topic doc2vec gensim word2vec word-embeddings

Category Data Science


One way to test an embedding space is to use word analogies as unit tests. A properly trained embedding space should successfully complete the analogy “Man is to king as woman is to _____" with "queen".

Google has released a collection of 19,000+ word analogies to evaluate word embedding models.


If you are unhappy with using your training-validation split set for evaluating your model, here are a few additional ways to compare your performance:

  1. Metric tracking. This is often used when data is abundant (for example - MSMarco uses MRR to evaluate the quality of their embeddings). You can find that here: https://microsoft.github.io/msmarco/ Another good metric is mean RBO (Rank biased overlap).

  2. Eyeball checks using a few queries. This is most helpful when you are looking to build something for the first time and you want to sense-check it across your top X queries. If your data has overfit - you will see very poor performance on search queries outside of your data distribution.

  3. Embedding projector for you to evaluate your embeddings and their nearest neighbors such that about the data bias. The embedding projector will require a good dimensionality reduction algorithm and should have a good clustering algorithm to help you detect these biases.

I have helped co-author a few Python packages to help out some of these issues:

For comparing search performances: https://github.com/RelevanceAI/search_comparator

(Releasing a few packages in the next few days that will help with this as well - will update this comment when I do!)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.