Does synthetic data be over sampled as well?

Question

Does synthetic data be over sampled as well?

guestmember123456790

2022年3月31日 06:07

I'm building a binary text classifier, the ratio between the positives and negatives is 1:100 (100 / 10000).

By using back translation as an augmentation, I was able to get 400 more positives. Then I decided to do up sampling to balance the data. Do I include only the positive data points (100) or should I also include the 400 that I have generated?

I will definitely try both, but I wanted to know if there is any rule of thumb as to what to do in such a case.

Thanks.

Topic oversampling data-augmentation class-imbalance classification

Category Data Science

Ashwiniku918 · Accepted Answer · 2022年3月31日 06:07

Imbalance class is a problem when you have small data in most cases. Your data ratio 100:10000 in this case for your model to do well you should increase the records related to minority class. Now no thumb rule exist in ML (read No Free lunch in ML). Unfortunately you will have to try three scenarios and see what works best for you :

Upsampling by using only actual data
Upsampling by creating new synthetic data by using techniques like SMOTE
Trying a combination of both generate some synthetic data and oversample.

Does synthetic data be over sampled as well?

About