Fine-tuning pre-trained Word2Vec model with Gensim 4.0

Question

Fine-tuning pre-trained Word2Vec model with Gensim 4.0

NST

2022年4月7日 10:04

With Gensim 4.0, we can retrain a word2vec model using the following code:

model = Word2Vec.load_word2vec_format(GoogleNews-vectors-negative300.bin, binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)

However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors.

How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?

Topic pretraining transfer-learning gensim word2vec

Category Data Science

Ishrak · Accepted Answer · 2022年3月6日 19:30

You can try the following steps to fine-tune on your domain-specific corpus using Gensim 4.0:

Create a Word2Vec model with the same vector size as the pretrained model

w2vModel = Word2Vec(vector_size=..., min_count=..., ...)
Build the vocabulary for the new corpus

w2vModel.build_vocab(my_corpus)
Create a vector of ones that determine the mutability of the pretrained vectors. In the previous Gensim versions, this used to be a single lockf argument to the intersect_word2vec_format function. Using a vector of ones ensures that all the words in the vocabulary are updated during fine-tuning

w2vModel.wv.vectors_lockf = np.ones(len(w2vModel.wv))
Perform a vocabulary intersection using intersect_word2vec_format function to initialize the new embeddings with the pretrained embeddings for the words that are in the pretraining vocabulary. I am quoting from the official Gensim documentation as follows intersect_word2vec_format 1

Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

w2vModel.wv.intersect_word2vec_format('pretrained.bin', binary=True)

Now, you can train the model on the new corpus

w2vModel.train(my_corpus, total_examples=len(my_corpus), epochs=...)

Fine-tuning pre-trained Word2Vec model with Gensim 4.0

About