Will oversampling help with generalization (small imbalanced dataset)?

Question

Will oversampling help with generalization (small imbalanced dataset)?

Luis Pinto

2021年8月22日 23:32

I have an imbalanced dataset (2:1 ratio) with about 60 patients and 80 features.

I performed Recursive Feature Elimination (RFE) and stratified cross validation to reduce the features to 15 and I get an AUC of 0.9 with Logistic regression and/or SVM. I don't fully trust the AUC I got because I think it will not generalize correctly because of such a small positive class. So, I was thinking on oversampling (K-means + PCA) the minority class and re-run the RFE approach, would this help? Thanks.

My question is more or less the same as this one: Why will the accuracy of a highly unbalanced dataset reduce after oversampling? but I do use AUC.

Topic generalization auc overfitting

Category Data Science

Miguel Raevenswood · Accepted Answer · 2021年8月22日 23:32

I have found a useful article for this here. It sounds like you did a decent amount already but the best question to always ask first is if you can get more data. If you can’t get more date, then you will have to some of things outlined in the article. In these situations, I find it useful to look at the confusion matrix as well as other metrics as the accuracy metric can be hiding the underlying details. Like it’s possible, to get a large accuracy score but have a poor confusion matrix because you just accurately predict the majority class. Hope this helps.

Brian Spiering · Accepted Answer · 2020年7月7日 13:56

The bigger issue might be the small n. With 60 samples and 2:1 ratio, you only have 20 samples in the minority class. Generalization, no matter what machine learning technique is used, will be limited with just 20 samples.

Will oversampling help with generalization (small imbalanced dataset)?

About