How to fit Word2Vec on test data?
I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new
dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Wrod2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.
PS: Here is a sample dataset:
train_x3 is a list of tokenized sentences which I am feeding to the Word2Vec model.
id | review |
---|---|
1 | ['bad', 'quality', 'poor', 'color', 'wash'] |
2 | ['product', 'quality', 'good'] |
3 | ['kindly', 'return', 'order', 'asap'] |
and after vectorizing it each of the tokenized words will have a dimension of 100 as follows:
id | good | product | colour |
---|---|---|---|
1 | -0.00103 | 0.00788 | 0.004578 |
2 | 0.0051 | 0.00478 | 0.00653 |
3 | 0.0015 | 0.00413 | 0.00051 |
Topic data-leakage gensim word2vec sentiment-analysis python
Category Data Science