Preferred approaches for imbalanced data

Question

Preferred approaches for imbalanced data

thereandhere1

2022年3月4日 19:04

I am building a binary classification model with imbalanced target variable (13% Class 1 vs 87% class 0). I am considering the following three options to handle the data imbalance

Option1: Create a balanced training dataset where with 50% / 50% split of the target variable.
Option 2: Samples the dataset as-is (i.e., 87% / 13% split) and use upsampling methods (e.g., SMOTE) to balance the target variable to 50% / 50% split.
Option 3: Use learning methods with appropriate hyperparameters to account for data imbalance for example: scale_pos_weight in XGBoost, class_weight in LGBMRegressor, class_weight in RandomForestClassifier

Assuming I have enough available data, is the first option is always the best approach? What are the Cons and Pros of each of the three methods? especially the 2nd and 3rd options (I assume that it is always preferred to avoiding creating new synthetic samples)

Topic imbalanced-learn smote class-imbalance classification

Category Data Science

Hamed · Accepted Answer · 2020年4月14日 19:39

I think it mostly depends on your dataset type! are you dealing with text? or image? or... and your features will tell which option is the best fit for your case....but according to my experience in most of the cases, option 1 and 2 besides they depend on your dataset and power of your features they need to be judge based on your model high bias or variance and they should inform you they are good or no! you need to do some experiment to figure out them or know your dataset well to find out adding or reducing dataset will affect your model performance or not!

and what I like to tell is try to use upsampling and downsampling methods same time to make your dataset balanced in a fair way(kinda)!....in this case (87% class 0 and 13% class 1)....upsample class 1 and downsample class 0! how much you need to upsample or how much downsample it is all your choice and definition of fairness in your dataset! and this definition could differ!

Preferred approaches for imbalanced data

About