How to examine if a Doc2Vec model is sufficiently trained?

Question

How to examine if a Doc2Vec model is sufficiently trained?

Shan Dou

2021年12月1日 21:20

I started experimenting with gensim's Doc2Vec for sentiment analysis. For the training of the embedding itself, I have seen examples using a reduced learning rate with a few 10s or even a few hundred epochs. However, there does not seem to be a straightforward way to use early stopping to prevent overfitting, and it is not yet clear to me how I should access loss values for each epoch to detect overfitting. What should be the proper way to examine if word2vec or doc2vec model themselves are sufficiently good? Thank you!

Topic doc2vec gensim word2vec word-embeddings

Category Data Science

Brian Spiering · Accepted Answer · 2021年12月1日 21:20

One way to test an embedding space is to use word analogies as unit tests. A properly trained embedding space should successfully complete the analogy “Man is to king as woman is to _____" with "queen".

Google has released a collection of 19,000+ word analogies to evaluate word embedding models.

Jacky Wong · Accepted Answer · 2021年11月21日 12:18

If you are unhappy with using your training-validation split set for evaluating your model, here are a few additional ways to compare your performance:

Metric tracking. This is often used when data is abundant (for example - MSMarco uses MRR to evaluate the quality of their embeddings). You can find that here: https://microsoft.github.io/msmarco/ Another good metric is mean RBO (Rank biased overlap).
Eyeball checks using a few queries. This is most helpful when you are looking to build something for the first time and you want to sense-check it across your top X queries. If your data has overfit - you will see very poor performance on search queries outside of your data distribution.
Embedding projector for you to evaluate your embeddings and their nearest neighbors such that about the data bias. The embedding projector will require a good dimensionality reduction algorithm and should have a good clustering algorithm to help you detect these biases.

I have helped co-author a few Python packages to help out some of these issues:

For comparing search performances: https://github.com/RelevanceAI/search_comparator

(Releasing a few packages in the next few days that will help with this as well - will update this comment when I do!)

How to examine if a Doc2Vec model is sufficiently trained?

About