Train/Test Split after performing SMOTE

I am dealing with a highly unbalanced dataset so I used SMOTE to resample it.

After SMOTE resampling, I split the resampled dataset into training/test sets using the training set to build a model and the test set to evaluate it.

However, I am worried that some data points in the test set might actually be jittered from data points in the training set (i.e. the information is leaking from the training set into the test set) so the test set is not really a clean set for testing.

Does anyone have any similar experience? Does the information really leak from the training set into the test set? Or does SMOTE actually take care of this and we do not need to worry about it?

Topic smote class-imbalance evaluation machine-learning

Category Data Science


The method by Bashar Haddad is preferred (to split data and apply SMOTE just for training), although at time when dataset is small and imbalanced, RepeatedStratifiedKFold helps.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
scores = cross_val_score(model, X, y, scoring='mean_squared_error', cv=cv, n_jobs=-1)
print('Mean MSE: %.4f' % mean(scores))

In this example there are 5 splits and two iteration, total of 10 models aggregated score.


I have dealt with the same problem. No, you cannot use generated samples, especially with an algorithm such as SMOTE, as it can be bad for accuracy and precision.

SMOTE does not take into account neighboring examples from other classes when generating synthetic examples. This could result in more class overlap and noise. This is especially bad if you have a high-dimensional dataset.

So the answer is you definitely should not with SMOTE. Maybe you can with another method in rare cases if that is your last resort.


Per your last question:

Then I am wondering this way, I won't be able to perform n-fold cross validation, right? Because my data is so small (especially for the minority class)

This is not true. You can try upsampling if your data is really small (but how small is it?)


When you use any sampling technique (specifically synthetic) you divide your data first and then apply synthetic sampling on the training data only. After you do the training, you use the test set (which contains only original samples) to evaluate.

The risk if you use your strategy is having the original sample in training (testing) and the synthetic sample (that was created based on this original sample) in the test (training) set.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.