XGBoost Log Loss different from GridSearchCV Log Loss
I have a classification problem where I am trying to predict if the data returns a 1 or 0. So your classic binary classification. I have my set of data that I have split into the dependent variables (ones I am training on) and the independent variable (my target that I am predicting, either a 0 or 1). I am using log loss as the scoring metric for my model.
Firstly, I am using the cv function in xgboost to figure out the number of estimators I need as it stops when the log loss has not improved over 50 rounds. I then train my model and predict. My code is below:
def modelfit(alg, dtrain, dtarget, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
# gets the xgb parameters specifically.
xgb_param = alg.get_xgb_params()
# this is the internal xgb data frame that is for efficiency. We map the training data to the labels.
xgtrain = xgb.DMatrix(dtrain.values, label=dtarget)
# this performs cross validation on the dataset. As our data is not really time dependent we can afford to cross
# validate. It stops when it hasnt improved for 50 rounds. This is only for determining n_estimators
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='logloss', early_stopping_rounds=early_stopping_rounds)
print(f'Optimal n_estimators - {cvresult.shape[0]}')
# this sets the most optimal n_estimators parameter into the booster.
alg.set_params(n_estimators=cvresult.shape[0])
# fit the algorithm on the data and set evaluation metric
alg.fit(dtrain.values, dtarget, eval_metric='logloss', eval_set=[(dtrain.values, dtarget)])
print(alg.evals_result())
# predict training set:
dtrain_predictions = alg.predict(dtrain.values)
print(dtrain_predictions)
dtrain_predprob = alg.predict_proba(dtrain.values)[:,1]
# print model report:
print(\nModel Report)
print(Log Loss Score (Train): %f % metrics.log_loss(dtarget, dtrain_predprob))
I then run this function on this particular XGBoostClassifier:
#Choose all predictors
xgb1 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
scale_pos_weight=1,
nthread=-1,
seed=27)
modelfit(xgb1, X, y)
The log loss value that is returned is: 0.577496 and the number of estimators is 65.
I then turn to GridSearchCV to tune the other parameters and I start with:
param_test1 = {
'max_depth' : range(1,10),
'min_child_weight' : range(1,6)
}
Note how the original max depth and min child weight are contained within these ranges that I used in the xgb1 classifier.
xgb2 = XGBClassifier(
learning_rate =0.1,
n_estimators=65,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=-1,
scale_pos_weight=1,
seed=27
)
gsearch1 = GridSearchCV(
estimator = xgb2,
param_grid = param_test1, scoring='neg_log_loss', n_jobs=-1, cv=5
)
gsearch1.fit(X, y)
gsearch1.best_params_, gsearch1.best_score_
However, this returns me with:
(
{'max_depth': 1, 'min_child_weight': 1}, -0.6275341839742403
)
So my question is how has the grid search said the best parameters are max_depth = 1
and min_child_weight = 1
and the log loss is 0.628 when previously before using GridSearchCV my model returned a better log loss of 0.577 with max_depth = 5
and min_child_weight = 1
?
Any help would be appreciated, please. Thanks.
Topic grid-search xgboost ensemble-modeling classification machine-learning
Category Data Science