Does synthetic data be over sampled as well?

I'm building a binary text classifier, the ratio between the positives and negatives is 1:100 (100 / 10000).

By using back translation as an augmentation, I was able to get 400 more positives. Then I decided to do up sampling to balance the data. Do I include only the positive data points (100) or should I also include the 400 that I have generated?

I will definitely try both, but I wanted to know if there is any rule of thumb as to what to do in such a case.

Thanks.

Topic oversampling data-augmentation class-imbalance classification

Category Data Science


Imbalance class is a problem when you have small data in most cases. Your data ratio 100:10000 in this case for your model to do well you should increase the records related to minority class. Now no thumb rule exist in ML (read No Free lunch in ML). Unfortunately you will have to try three scenarios and see what works best for you :

  1. Upsampling by using only actual data
  2. Upsampling by creating new synthetic data by using techniques like SMOTE
  3. Trying a combination of both generate some synthetic data and oversample.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.