Preferred approaches for imbalanced data
I am building a binary classification model with imbalanced target variable (13% Class 1 vs 87% class 0). I am considering the following three options to handle the data imbalance
- Option1: Create a balanced training dataset where with 50% / 50% split of the target variable.
Option 2: Samples the dataset as-is (i.e., 87% / 13% split) and use upsampling methods (e.g., SMOTE) to balance the target variable to 50% / 50% split.
Option 3: Use learning methods with appropriate hyperparameters to account for data imbalance for example: scale_pos_weight in XGBoost, class_weight in LGBMRegressor, class_weight in RandomForestClassifier
Assuming I have enough available data, is the first option is always the best approach? What are the Cons and Pros of each of the three methods? especially the 2nd and 3rd options (I assume that it is always preferred to avoiding creating new synthetic samples)
Topic imbalanced-learn smote class-imbalance classification
Category Data Science