How to properly use oversampling without inflating results?

I am using with a tiny private dataset (over 192 samples) with 4 classes. A preprocessing step is trivial in order to do any classification. Among feature selection and extraction techniques, i decided to apply oversampling (SMOTE). Here is what i did:

  • Using the entire dataset (original 192 samples):
  • Create synthetic samples for each class using SMOTE, so i get a total of 500 samples per class (2000 in total)

I have a big suspicion about this procedure because when i apply SMOTE i get very high accuracy rates (even 100% in some cases) with the simplest models such as a 15-neuron MLP. So, i have some questions to assert the correctness of my experiments:

  1. Is it ok to oversample the entire dataset or should i apply the SMOTE only in train data (keep in mind that this will leave few samples for test)?

  2. SMOTE is originally used to deal with imbalanced datasets, creating samples for the classes with few sample. Is it ok to use it for generating samples for ALL classes in order to enlarge the entire dataset?

Topic oversampling smote preprocessing classification python

Category Data Science


In my experience, some notes that I can share.

  • I wouldn't trust SMOTE to the extent to create test instances. Profoundly, it may create artificial data that is too similar to each other, hence, the high accuracy. Thus, I would create SMOTE samples only in the training set.
  • Since your dataset is small I would use something like a "leave one out" cross-validation.
  • I assume to create artificial samples for all the classes is to make your model more accurate. Again you can try with only the train data but in machine learning similar (statistical sampling) approaches are integrated into the classifiers (e.g., bootstrapping, random forest etc) to make the classifiers perform better. Speaking of which since your dataset is small you could try a decision tree implementation.

First of all - have you tried to do modeling without oversampling? If you dont have imbalanced classes, why do you need oversampling? You dont always need tons of data to get a good model as a baseline.

Regarding should i apply the SMOTE only in train data - always leave some test data out, dont apply any transformation on the whole dataset (even normalization or missing values imputing). If you have some preprocessing - do it on train data set, train model. Then apply the preprocessing pipeline on test data and predict on it.

simplest models such as a 15-neuron MLP. - it doesnt sound as simplest solution to me. have you tried any other approaches? also, have you checked confusion matrix to see the classification error on test data for each of the class?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.