Will oversampling help with generalization (small imbalanced dataset)?

I have an imbalanced dataset (2:1 ratio) with about 60 patients and 80 features.

I performed Recursive Feature Elimination (RFE) and stratified cross validation to reduce the features to 15 and I get an AUC of 0.9 with Logistic regression and/or SVM. I don't fully trust the AUC I got because I think it will not generalize correctly because of such a small positive class. So, I was thinking on oversampling (K-means + PCA) the minority class and re-run the RFE approach, would this help? Thanks.

My question is more or less the same as this one: Why will the accuracy of a highly unbalanced dataset reduce after oversampling? but I do use AUC.

Topic generalization auc overfitting

Category Data Science


I have found a useful article for this here. It sounds like you did a decent amount already but the best question to always ask first is if you can get more data. If you can’t get more date, then you will have to some of things outlined in the article. In these situations, I find it useful to look at the confusion matrix as well as other metrics as the accuracy metric can be hiding the underlying details. Like it’s possible, to get a large accuracy score but have a poor confusion matrix because you just accurately predict the majority class. Hope this helps.


The bigger issue might be the small n. With 60 samples and 2:1 ratio, you only have 20 samples in the minority class. Generalization, no matter what machine learning technique is used, will be limited with just 20 samples.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.