Number of epochs in Gensim Word2Vec implementation

Question

Number of epochs in Gensim Word2Vec implementation

alvas

2021年3月11日 20:18

There's an iter parameter in the gensim Word2Vec implementation

class gensim.models.word2vec.Word2Vec(sentences=None, size=100,
alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0,
seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1, negative=0,
cbow_mean=0, hashfxn=built-in function hash, **iter=1**, null_word=0,
trim_rule=None, sorted_vocab=1)

that specifies the number of epochs, i.e.:

iter = number of iterations (epochs) over the corpus.

Does anyone know whether that helps in improving the model over the corpus?

Is there any reason why the iter is set to 1 by default? Is there not much effect in increasing the no. of epochs?

Is there any scientific/empirical evaluation of how to set the no. of epochs?

Unlike classification/regression task, the grid search method wouldn't really work since the vectors are generated in an unsupervised manner and the objective function is simply by either hierarchical softmax or negative sampling.

Is there an early stopping mechanism to cut short the no. of epochs once vectors converges? And can the hierarchical softmax or negative sampling objective converge?

Topic convergence gensim word2vec

Category Data Science

Damian Satterthwaite-Phillips · Accepted Answer · 2020年9月18日 20:18

You can use a call back to output the loss at every epoch to help you decide how many to use:

import gensim
from gensim.models.callbacks import CallbackAny2Vec

# Your model params:
CONTEXT_WINDOW = 5
NEGATIVES = 5
MIN_COUNT = 5
EPOCHS = 20

class LossLogger(CallbackAny2Vec):
    '''Output loss at each epoch'''
    def __init__(self):
        self.epoch = 1
        self.losses = []

    def on_epoch_begin(self, model):
        print(f'Epoch: {self.epoch}', end='\t')

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        self.losses.append(loss)
        print(f'  Loss: {loss}')
        self.epoch += 1

loss_logger = LossLogger()
mod = gensim.models.word2vec.Word2Vec(sentences=sentences,
                                      sg=1,
                                      window=CONTEXT_WINDOW,
                                      negative=NEGATIVES,
                                      min_count=MIN_COUNT,
                                      callbacks=[loss_logger],
                                      compute_loss=True,
                                      iter=EPOCHS)

...and you can use loss_logger.losses to retrieve them later (if you want to plot them, for example...)

MasterOne Piece · Accepted Answer · 2019年5月11日 19:03

I trained my w2v model on google news 300 for [2, 10, 100] epochs and the best one was on 10 epochs. After all that waiting, I was shocked that 100 epochs was bad.

epoch   wall                    
------ ------                    
2       56 s                    
10      4m 44s (284s)           
100     47m 27s (2847 s)

H.M. Prins · Accepted Answer · 2019年5月11日 19:01

1

H.M. Prins answered at 2019年5月11日 19:01

I looked here, and found that the default value changed from 1 to 5. Apparently the authors believe that more epochs will improve the results.

I cannot tell from experience, yet.

Metin Say · Accepted Answer · 2017年8月11日 14:49

Increasing the iter count (number of epochs) dramatically increases the training time. Word2Vec gives quality results only if you feed a massive amount documents, therefore looping even twice on them is not reasonable although it actually makes the resulting word embeddings more accurate.

geompalik · Accepted Answer · 2016年9月10日 18:03

Increasing the number of epochs usually benefits the quality of the word representations. In experiments I have performed where the goal was to use the word embeddings as features for text classification setting the epochs to 15 instead of 5, increased the performance.

Number of epochs in Gensim Word2Vec implementation

About