XGBoost: how to adjust the probabilities of a binary classifier to match training data?

Question

XGBoost: how to adjust the probabilities of a binary classifier to match training data?

Henrique Nader

2020年8月18日 16:28

Training and testing data have around 1% positives, but the model predicts only around 0.1% as positives.

The model is an xgboost classifier.

I’ve tried calibration but it didn’t improve much. I also don’t want to pick thresholds since the final goal is to output probabilities.

What I want is for the model to have a number of classified positives similar to the number of positives in the actual data.

Topic probability-calibration xgboost python machine-learning

Category Data Science

lcrmorin · Accepted Answer · 2020年8月18日 16:28

The first (and easiest) option is to make sure that your model is calibrated in probabilites. In Python, it means that you should pass the option binary:logistic in your fitting method.

The alternative is to transform the output of your model into probabilities. There are different approaches for that.

This could be achieved with some sort of regression techniques to find the relationship between probabilities and your output. Python's isotonic regression should work for that purpose. However, without more information on your score ditribution it is possible it doesn't work well.

This can also be achieved with platt scaling : transforming your output into binary prediction (0 and 1) with a threshold, then calibrate a logistic regression on those new variables. It is relatively easy to do, but in my experience doesn't necssarily work well with unbalanced problems with non-linear relationships.

Finally, there are some approaches that just correct the output depending on your model. For logistic regression that would mean changing your bias variable so that the overall predicted proportion match the one of your data-set. This also can be used to counter the effects of rare events (see this). I have found this to work wuite well with logistic regression. However, I am not sure if it would directly be appliable to XGBoost, but it could be worth a try.

XGBoost: how to adjust the probabilities of a binary classifier to match training data?

About