In which situation should we consider a dataset as imbalanced?

Question

In which situation should we consider a dataset as imbalanced?

ouyqf

2022年5月10日 12:01

I'm facing a problem about making a classification on a dataset. The target variable is binary (with 2 classes, 0 and 1). I have 8,161 samples in the training dataset. And for each class, I have:

class 0: 6,008 samples, 73.6% of total numbers.
class 1: 2,153 samples, 26.4%

My questions are:

In this case, should I consider the dataset I used as an imbalanced dataset?
If it was, should I process the data before using RandomForest to make a prediction?
If it was not an imbalanced dataset, could somebody tell me in which situation (like what ratio for each class) I could consider a dataset as imbalanced?

Topic class-imbalance random-forest classification machine-learning

Category Data Science

Hithesh Kk · Accepted Answer · 2021年12月4日 06:29

Intuitively, it seems like an imbalanced dataset to have ~75/25 ratio of class labels.

If you want to take a look at it theoretically, you can do a hypothesis test. For a sample size of 8161, you can assume that the dataset is 50/50 as null hypothesis and then compute the probability that a number extreme as 6008 or more of them belong to one class as p-value and then try to reject the null hypothesis if the p value is low (less than 0.05 or 0.01 as per choice.)

This can be done using a binomial distribution.

Carlos Mougan · Accepted Answer · 2021年10月26日 11:20

Imbalanced data is a hot topic and in my opinion there are a couple misconceptions around.

For the metric: You should always be aware of the meaning of the metric you are using and the ratio of classes.

For the optimization, Unless your ratio is more than 1:999 I wont consider changing the optimization of your algorithm (and in my opinion never use synthetic generators such as SMOTE). If this is your case, I will recommend you to have a look at this paper Practical Lessons from Predicting Clicks on Ads at Facebook

Peter · Accepted Answer · 2021年1月17日 18:41

I think you can speak of imbalanced targets if (in case of a binary classification problem) the classes are not represented in a 50:50 manner. This is almost always the case.

With about 25/75 in your case, I would see this as „imbalanced“. There are some strategies to deal with this problem, such as (re)sampling so that you achieve a 50:50 balanced sample (essentially you will lose observations in the majority class here). Alternatively you can use synthetic oversampling (SMOTE) and related rechniques.

However, some packages come with built-in options to deal with unbalanced targets, e.g. sklearn‘s random forest (option class_weight). Check the docs.

In which situation should we consider a dataset as imbalanced?

About