How to properly use oversampling without inflating results?
I am using with a tiny private dataset (over 192 samples) with 4 classes. A preprocessing step is trivial in order to do any classification. Among feature selection and extraction techniques, i decided to apply oversampling (SMOTE). Here is what i did:
- Using the entire dataset (original 192 samples):
- Create synthetic samples for each class using SMOTE, so i get a total of 500 samples per class (2000 in total)
I have a big suspicion about this procedure because when i apply SMOTE i get very high accuracy rates (even 100% in some cases) with the simplest models such as a 15-neuron MLP. So, i have some questions to assert the correctness of my experiments:
Is it ok to oversample the entire dataset or should i apply the SMOTE only in train data (keep in mind that this will leave few samples for test)?
SMOTE is originally used to deal with imbalanced datasets, creating samples for the classes with few sample. Is it ok to use it for generating samples for ALL classes in order to enlarge the entire dataset?
Topic oversampling smote preprocessing classification python
Category Data Science