How to increase the accuracy of an imbalanced dataset (not precision)?

Question

How to increase the accuracy of an imbalanced dataset (not precision)?

section117

2021年7月23日 12:43

There's an imbalanced dataset in a Kaggle competition I'm trying. The target variable of the dataset is binary and it is biased towards 0. 0 - 70% 1 - 30% I tried several machine learning algorithms like Logistic Regression, Random Forest, Decision Trees etc. But all of them give an accuracy around 70%. It seems that the models always tend to predict 0. So I tried several methods to get an unbiased dataset like the following.

Up sampling the dataset using SMOTE and other techniques.
Under sampling the dataset
Changing the weight of the model.

But all of these steps reduced the accuracy instead of increasing. Area under the curve and precision was improved but unfortunately I have to increase the accuracy somehow to win the competition.

So I would really appreciate it if you could tell me about the techniques to improve the accuracy in an imbalanced dataset.

Topic imbalanced-data preprocessing visualization dataset

Category Data Science

Adept · Accepted Answer · 2021年7月23日 12:43

Following your comment, I'll detail here (too long for comments basically)

Acuracy may not be a good way to measure your model's performance. Imagine a problem with 99 '0' and 1 '1'. A model always gessing '0' will have 99% accuracy, and is useless, since you want to detect the '1'. A model giving you 10 '1' including the real one is way better, and have a way lower accuracy.

You then have to define your problem correctly, and change metric according to it. For example, one of the useful metric in those cases can be AUC, since it's not affected by unbalanced datasets.

So one of the methods you could apply, is trying to maximize AUC, and when you found the good model, manually select your 30% best-scored features in your test. If you find half the true '1' on your selection, this can already be a really good result (according to the problem difficulty) while accuracy would be way worse.

You really have to adapt the metric you try to maximise to your problem : since here, there are more possibilities of being '0' than '1', accuracy is pretty good with a classifier always guessing '0', and tuning your model following accuracy could turn you to such a classifier.

How to increase the accuracy of an imbalanced dataset (not precision)?

About