Number of epochs in Gensim Word2Vec implementation

There's an iter parameter in the gensim Word2Vec implementation

class gensim.models.word2vec.Word2Vec(sentences=None, size=100,
alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0,
seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1, negative=0,
cbow_mean=0, hashfxn=built-in function hash, **iter=1**, null_word=0,
trim_rule=None, sorted_vocab=1)

that specifies the number of epochs, i.e.:

iter = number of iterations (epochs) over the corpus.

Does anyone know whether that helps in improving the model over the corpus?

Is there any reason why the iter is set to 1 by default? Is there not much effect in increasing the no. of epochs?

Is there any scientific/empirical evaluation of how to set the no. of epochs?

Unlike classification/regression task, the grid search method wouldn't really work since the vectors are generated in an unsupervised manner and the objective function is simply by either hierarchical softmax or negative sampling.

Is there an early stopping mechanism to cut short the no. of epochs once vectors converges? And can the hierarchical softmax or negative sampling objective converge?

Topic convergence gensim word2vec

Category Data Science


You can use a call back to output the loss at every epoch to help you decide how many to use:

import gensim
from gensim.models.callbacks import CallbackAny2Vec

# Your model params:
CONTEXT_WINDOW = 5
NEGATIVES = 5
MIN_COUNT = 5
EPOCHS = 20

class LossLogger(CallbackAny2Vec):
    '''Output loss at each epoch'''
    def __init__(self):
        self.epoch = 1
        self.losses = []

    def on_epoch_begin(self, model):
        print(f'Epoch: {self.epoch}', end='\t')

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        self.losses.append(loss)
        print(f'  Loss: {loss}')
        self.epoch += 1

loss_logger = LossLogger()
mod = gensim.models.word2vec.Word2Vec(sentences=sentences,
                                      sg=1,
                                      window=CONTEXT_WINDOW,
                                      negative=NEGATIVES,
                                      min_count=MIN_COUNT,
                                      callbacks=[loss_logger],
                                      compute_loss=True,
                                      iter=EPOCHS)

...and you can use loss_logger.losses to retrieve them later (if you want to plot them, for example...)


I trained my w2v model on google news 300 for [2, 10, 100] epochs and the best one was on 10 epochs. After all that waiting, I was shocked that 100 epochs was bad.

epoch   wall                    
------ ------                    
2       56 s                    
10      4m 44s (284s)           
100     47m 27s (2847 s)    

I looked here, and found that the default value changed from 1 to 5. Apparently the authors believe that more epochs will improve the results.

I cannot tell from experience, yet.


Increasing the iter count (number of epochs) dramatically increases the training time. Word2Vec gives quality results only if you feed a massive amount documents, therefore looping even twice on them is not reasonable although it actually makes the resulting word embeddings more accurate.


Increasing the number of epochs usually benefits the quality of the word representations. In experiments I have performed where the goal was to use the word embeddings as features for text classification setting the epochs to 15 instead of 5, increased the performance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.