LightGBM eval_set - what to do when I fit the final model (there's no test data left)

I'm using LightGBM's eval_set feature when fitting my model. This enables early stopping on the number of estimators used.

callbacks = [lgb.early_stopping(80, verbose=0), lgb.log_evaluation(period=0)]

fit_params = {callbacks:callbacks, eval_metric : auc, eval_set : [(x_train,y_train), (x_test,y_test)], eval_names : ['train', 'valid']}     

lg = LGBMClassifier(n_estimators=5000, verbose=-1,objective=binary, **{scale_pos_weight:train_weight, metric:auc})#binary_logloss})                                
This works great when doing cross validation and early stopping is triggered.

But when I have finally selected a model, and want to train it on the full data set. I have no test data left to trigger early stopping?

What's the accepted practise here? Can I use the holdout data?

Or shall I keep another set of data purely for the eval_set?

EDIT:

Come to think of it, is there data leakage if in a cross validation I pass my test data to eval_set? Am I doing this all wrong?

Topic lightgbm

Category Data Science


Its a always a good practice to have complete unsused evaluation data set for stopping your final model.

Repeating the early stopping procedure many times may result in the model overfitting the validation dataset.This can happen just as easily as overfitting the training dataset.

One approach is to only use early stopping once all other hyperparameters of the model have been chosen.

Another strategy may be to use a different split of the training dataset into train and validation sets each time early stopping is used.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.