How to fit Word2Vec on test data?

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:

# PREPROCESSING THE DATA

# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)

train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()

# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]

# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index

# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}

# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)

The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:

model = Word2Vec(test_x3, min_count = 1)

I dont think so this would be the correct way. Any help is appreciated!

PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Wrod2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.

PS: Here is a sample dataset:

train_x3 is a list of tokenized sentences which I am feeding to the Word2Vec model.

id review
1 ['bad', 'quality', 'poor', 'color', 'wash']
2 ['product', 'quality', 'good']
3 ['kindly', 'return', 'order', 'asap']

and after vectorizing it each of the tokenized words will have a dimension of 100 as follows:

id good product colour
1 -0.00103 0.00788 0.004578
2 0.0051 0.00478 0.00653
3 0.0015 0.00413 0.00051

Topic data-leakage gensim word2vec sentiment-analysis python

Category Data Science


Your dataframe new is already the correct embeddings to use for the test set. Just tokenize the test reviews, limit to those words in your training vocabulary, and use the vectors in new.

I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage.

Correct.

Should I simply pass the test list to another instance of the model

No, then the embeddings would have nothing to do with each other, and any subsequent sentiment analysis model will be very confused.

assign indexes to all the words in train, train word2vec on train, assign same index to similar words in test set and map the embeddings from train to test. But then again in the second method there is the problem of OOV words. Not all words present in the train will be present in test and vice versa!

Not similar words between train and test, only exact matches. Doing some manual intervention might be fine, but any score thus obtained would be based on you doing the same interventions on the model in production...

As for out-of-vocabulary words, that's just how these things work generally. You (rather, your model) don't know anything about those words, so you just have to discard them.


When you look at word2vec model, it is different from other machine learning model and you cannot just call model on test data to get the output. You should do the following :

  1. Convert Test Data and assign same index to similar words as in train data
  2. Once Word 2 vec is trained on training data, use similar transformation on test data to convert sentences to vector
  3. Word2vec gives vector at word level, so if you have sentence you may want to take average of word2vec vectors for training your ML model

Also, i don't think word2vec on whole data will lead to data leakage as you are not learning a classifier you are just doing feature engineering using word2vec. Once you pass data to training model like logistic regression, XGBoost etc you should worry about that.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.