Splitting before tfidf or after?

When should I perform preprocessing and matrix creation of text data in NLP, before or after train_test_split? Below is my sample code where I have done preprocessing and matrix creation (tfidf) before train_test_split. I want to know will there be data leakage?

corpus = []

for i in range(0 ,len(data1)):
    review = re.sub('[^a-zA-Z]', ' ', data1['features'][i])
    review = review.lower()
    review = review.split()
    review = [stemmer.stem(j) for j in review if not j in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features = 6000)
x = cv.fit_transform(corpus).toarray()

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(data1['label'])

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, 
                                                                                stratify = y)

spam_model = MultinomialNB().fit(train_x, train_y)
pred = spam_model.predict(test_x)
c_matrix = confusion_matrix(test_y, pred)
acc_score = accuracy_score(test_y, pred)

Topic data-leakage preprocessing nlp python machine-learning

Category Data Science


You should split before tf-idf.

If you learn tf-idf also on the test set you will have data leakage. In example, you won't have out-of-vocabulary words in inference on the test set, what might happen in the "real world".

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.