Overfitted model produces similar AUC on test set, so which model do I go with?
I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling.
Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes.
I generated a binary classification dataset with following:
# Generate binary classification dataset with 5% minority class,
# 3 informative features, introduce noise with flip_y = 15%
X, y = make_classification(n_samples=5000, n_features=3, n_informative=3,
n_redundant=0, n_repeated=0, n_classes=2,
n_clusters_per_class=1,
weights=[0.95, 0.05],
flip_y = 0.15,
class_sep=0.8)
I split this into 60/40% train/test split and performed GridSearchCV with both approaches on a random forest model. Ended up with following output based on best_estimator_ from both approaches:
Best Params from Post-Oversampled Grid CV: {'n_estimators': 1000}
Best Params from Pre-Oversampled Grid CV: {'classifier__n_estimators': 500}
AUC of Post-Oversampled Grid CV - training set: 0.9996723239846446
AUC of Post-Oversampled Grid CV - test set: 0.6060618701968091
AUC of Pre-Oversampled Grid CV - training set: 0.6578310812852065
AUC of Pre-Oversampled Grid CV - test set: 0.6112671617024038
As expected, the Post-Oversampled Grid CV AUC is very high due to overfitting. However, evaluating both models on the test set lead to very similar results on AUC (60.6% vs 61.1%).
I had two questions. Why is this observed? I didn't assign a random_state to any of these steps and retried it many times, but still end up with the same results. In such a case, what becomes the better model to progress with since they're producing similar results on test set?
For oversampling and handling it through the pipeline, I made use of imblearn:
# imblearn functions
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as Imb_Pipeline
Happy to share my code if needed. Thanks.
Topic gridsearchcv overfitting sampling class-imbalance random-forest
Category Data Science