Fine-tuning pre-trained Word2Vec model with Gensim 4.0

With Gensim 4.0, we can retrain a word2vec model using the following code:

model = Word2Vec.load_word2vec_format(GoogleNews-vectors-negative300.bin, binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)

However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors.

How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?

Topic pretraining transfer-learning gensim word2vec

Category Data Science


You can try the following steps to fine-tune on your domain-specific corpus using Gensim 4.0:

  1. Create a Word2Vec model with the same vector size as the pretrained model

    w2vModel = Word2Vec(vector_size=..., min_count=..., ...)

  2. Build the vocabulary for the new corpus

    w2vModel.build_vocab(my_corpus)

  3. Create a vector of ones that determine the mutability of the pretrained vectors. In the previous Gensim versions, this used to be a single lockf argument to the intersect_word2vec_format function. Using a vector of ones ensures that all the words in the vocabulary are updated during fine-tuning

    w2vModel.wv.vectors_lockf = np.ones(len(w2vModel.wv))

  4. Perform a vocabulary intersection using intersect_word2vec_format function to initialize the new embeddings with the pretrained embeddings for the words that are in the pretraining vocabulary. I am quoting from the official Gensim documentation as follows intersect_word2vec_format 1

Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

w2vModel.wv.intersect_word2vec_format('pretrained.bin', binary=True)
  1. Now, you can train the model on the new corpus

    w2vModel.train(my_corpus, total_examples=len(my_corpus), epochs=...)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.