Why we need to 'train word2vec' when word2vec itself is said to be 'pretrained'?

Question

Why we need to 'train word2vec' when word2vec itself is said to be 'pretrained'?

Hing

2022年4月15日 10:55

I get really confused on why we need to 'train word2vec' when word2vec itself is said to be 'pretrained'? I searched for word2vec pretrained embedding, thinking i can get a mapping table directly mapping my vocab on my dataset to a pretrained embedding but to no avail. Instead, what I only find is how we literally train our own:

Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)

But I'm confused: isn't word2vec already pretrained? Why do we need to 'train' it again? If it's pretrained, then what do we modify in the model (or specifically, which part) with our new 'training'? And how does our now 'training' differ from its 'pretraining'? TIA.

Which type of word embedding are truly 'pretrained' and we can just use, for instance, model['word'] and get its corresponding embedding?

Topic word2vec word-embeddings nlp

Category Data Science

Erwan · Accepted Answer · 2022年4月15日 10:55

word2vec is an algorithm to train word embeddings: given a raw text, it calculates a word vector for every word in the vocabulary. These vectors can be used in other applications, thus they form a pretrained model.

It's important to understand that the model (embeddings) depends a lot on the data they are trained on. Some simple applications can simply use a general pretrained model, but some specific applications (for example specific to a technical domain) require the embeddings to be trained on some custom data.

Why we need to 'train word2vec' when word2vec itself is said to be 'pretrained'?

About