Optimising for Brier objective function directly gives worse Brier score than optimising with custom objective - what does it tell me?

I am training an XGBoost model and as I care the most about resulting probabilities, not classification itself I have chosen Brier score as a metric for my model, so that probabilities would be well calibrated. I tuned my hyperparameters using GridSearchCV and brier_score_loss as a metric. Here's an example of a tuning step:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)

model = XGBClassifier(learning_rate=0.1, n_estimators=200, gamma=0, subsample=0.8, colsample_bytree=0.8, scale_pos_weight=1, verbosity=1, seed=0)
parameters = {'max_depth': [3, 5, 7], 
              'min_child_weight': [1, 3, 5]}
gs = GridSearchCV(model, parameters, scoring='brier_score_loss', n_jobs=1, cv=cv)
gs_results = gs.fit(X_train, y_train)

Finally, I train my main model with chosen hyperparameters on two ways:

optimising for custom objective - brier, using custom brier_error function as a metric

model = XGBClassifier(obj=brier, learning_rate=0.02, n_estimators=2000, max_depth=5, 
                      min_child_weight=1, gamma=0.3, reg_lambda=20, subsample=1, colsample_bytree=0.6, 
                          scale_pos_weight=1, seed=0, disable_default_eval_metric=1)
model1.fit(X_train, y_train, eval_metric=brier_error, eval_set=[(X_train, y_train), (X_test, y_train)],
          early_stopping_rounds=100)
y_proba1 = model1.predict_proba(X_test)[:, 1]
brier_score_loss(y_test, y_proba1) # 0.005439
roc_auc_score(y_test, y_proba1) # 0.8567

optimising for default binary:logistic and auc as an evaluation metric

model2 = XGBClassifier(learning_rate=0.02, n_estimators=2000, max_depth=5, 
                      min_child_weight=1, gamma=0.3, reg_lambda=20, subsample=1, colsample_bytree=0.6, 
                          scale_pos_weight=1, seed=0, disable_default_eval_metric=1)
model2.fit(X_train, y_train, eval_metric='auc', eval_set=[(X_train, y_train), (X_test, y_train)],
          early_stopping_rounds=100)
y_proba2 = model2.predict_proba(X_test)[:, 1]
brier_score_loss(y_test, y_proba2) # 0.004914
roc_auc_score(y_test, y_proba2) # 0.8721

I would expect Brier score to be lower in the model1 since we optimise directly for it, but apparently it is not the case (see results above). What does it tell me? Does optimising brier is somehow harder? Should I use more boosting rounds? (Although this was found using grid search with brier_score_loss...) Is it explainable somehow but data distribution? (e.g. such an issue can occur in the event of unbalanced classes or something like that?) I have no idea where does that situation come from, but probably there is a reason behind that.

Topic objective-function machine-learning-model xgboost optimization

Category Data Science


One thing that you can do to try to optimize Brier score, is often done in Kaggle competitions. Is optimize another loss and do early stopping with the Brier score.

One Example would be minimizing classic binary logistic loss and per iteration plotting the Brier score.

The binary logistic function will keep being minimized but the Brier doesn't have to be. At some point, Brier can start increasing and there is when you stop your training and not based on the binary classification results.

You could perform this experiment using different loss functions and see which one is performing better.

Here there is some links about early stopping:

https://ai.stackexchange.com/questions/16/what-is-early-stopping-in-machine-learning

https://www.kaggle.com/vincentf/early-stopping-for-xgboost-python

Is there away to change the metric used by the Early Stopping callback in Keras?


Brier score has known shortcomings for very rare or very frequent events.

Binary logistic regression objective function is relatively more robust than the Brier score to the rate of rare or frequent events.

It is possible that the difference in performance between the two objective functions is due to the event frequency in the data set.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.