Predictions using calibrated classifer

I find myself asking alot of calibration related questions recently - but i cannot find adequate material on it!

I am training a binary classifier to predict default. This probability will be used in such a way that the customers predicted to be class '1' will form the target for us - we simply provide `summary stats on this target set. Since we sometimes move the threshold for when class is 1 e.g. moving lower or higher than 0.5 we have decided that calibration is important. (please correct if wrong..)

I have

  1. split train test, validation, (i undersampled majority class on x_train only)
  2. I use a sklearn pipeline which imputes missing values with 0 and applies feature selection method
  3. i calibrate the model from pipeline - taking care to fit on x_validation and plot x_test values whilst transforming this data using pipeline.
  4. I assess results below.

My results are as follows:

  1. uncalibrated: (accuracy is 0.9, and recall is 0.6)
  2. sigmoid: accuracy 0.95 and recall 0.5
  3. isotonic: accuracy 0.95 and recall 0.5

When i run predictions on a holdout set to see which customers would be classified as defaulters the uncalibrated model predicts the most : 50%, then sigmoid: 1% then isotonic: 0.5%

note blue = uncalibrated, yellow = sigmoid, red = istonic. x-axis above is average predicted vs actual proportions. my questions are:

  1. Why has a calibration method lead to less customers being predicted as defaulters? i.e. class 1, is this because the uncalibrated model was 'overconfident' looking at above?
  2. The sole aim is to provide summary stats on customers predicted to be class 1 - is calibration even necessary in this case would it be necessary if then we decide to move the threshold i.e. 0.5?

Topic probability-calibration probability xgboost classification machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.