XGBoost Log Loss different from GridSearchCV Log Loss

I have a classification problem where I am trying to predict if the data returns a 1 or 0. So your classic binary classification. I have my set of data that I have split into the dependent variables (ones I am training on) and the independent variable (my target that I am predicting, either a 0 or 1). I am using log loss as the scoring metric for my model.

Firstly, I am using the cv function in xgboost to figure out the number of estimators I need as it stops when the log loss has not improved over 50 rounds. I then train my model and predict. My code is below:

def modelfit(alg, dtrain, dtarget, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    if useTrainCV:
        # gets the xgb parameters specifically.
        xgb_param = alg.get_xgb_params()
        # this is the internal xgb data frame that is for efficiency. We map the training data to the labels.
        xgtrain = xgb.DMatrix(dtrain.values, label=dtarget)
        # this performs cross validation on the dataset. As our data is not really time dependent we can afford to cross
        # validate. It stops when it hasnt improved for 50 rounds. This is only for determining n_estimators
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='logloss', early_stopping_rounds=early_stopping_rounds)
        print(f'Optimal n_estimators - {cvresult.shape[0]}')
        # this sets the most optimal n_estimators parameter into the booster.
    # fit the algorithm on the data and set evaluation metric
    alg.fit(dtrain.values, dtarget, eval_metric='logloss', eval_set=[(dtrain.values, dtarget)])
    # predict training set:
    dtrain_predictions = alg.predict(dtrain.values)
    dtrain_predprob = alg.predict_proba(dtrain.values)[:,1]
    # print model report:
    print(\nModel Report)
    print(Log Loss Score (Train): %f % metrics.log_loss(dtarget, dtrain_predprob))

I then run this function on this particular XGBoostClassifier:

#Choose all predictors
xgb1 = XGBClassifier(
 learning_rate =0.1,
 objective= 'binary:logistic',

modelfit(xgb1, X, y)

The log loss value that is returned is: 0.577496 and the number of estimators is 65.

I then turn to GridSearchCV to tune the other parameters and I start with:

param_test1 = {
 'max_depth' : range(1,10),
 'min_child_weight' : range(1,6)

Note how the original max depth and min child weight are contained within these ranges that I used in the xgb1 classifier.

xgb2 = XGBClassifier(
        learning_rate =0.1, 
        objective= 'binary:logistic',

gsearch1 = GridSearchCV(
    estimator = xgb2, 
    param_grid = param_test1, scoring='neg_log_loss', n_jobs=-1, cv=5

gsearch1.fit(X, y)
gsearch1.best_params_, gsearch1.best_score_

However, this returns me with:

{'max_depth': 1, 'min_child_weight': 1}, -0.6275341839742403

So my question is how has the grid search said the best parameters are max_depth = 1 and min_child_weight = 1 and the log loss is 0.628 when previously before using GridSearchCV my model returned a better log loss of 0.577 with max_depth = 5 and min_child_weight = 1?

Any help would be appreciated, please. Thanks.

Topic grid-search xgboost ensemble-modeling classification machine-learning

Category Data Science

Your modelfit prints the training score, but GridSearchCV bases its decisions on the out-of-fold average (and in particular best_score_ is an out-of-fold average score). This is an unfair comparison, and in particular your 0.577 is probably quite optimistically biased.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.