What is the proper way to use early stopping with cross-validation?

I am not sure what is the proper way to use early stopping with cross-validation for a gradient boosting algorithm. For a simple train/valid split, we can use the valid dataset as the evaluation dataset for the early stopping and when refitting we use the best number of iterations.

But in case of cross-validation like k-fold, my intuition would be to use each valid set of each fold as evaluation dataset for the early stopping but that means the best number of iterations would be different from a fold to another. So when refitting, what do we use as the final best number of iterations ? the mean ?

Thanks !

Topic early-stopping lightgbm xgboost cross-validation

Category Data Science


I think some answers to (/comments about) related questions are well addressed in these posts:

  1. https://stats.stackexchange.com/q/402403
  2. https://stats.stackexchange.com/q/361494

In my mind, the tldr summary as it relates to your question is that after cross validation one could (or maybe should) retrain a model using a single very large training set, with a small validation set left in place to determine an iteration at which to stop early. While one certainly can think about ways to determine an early stopping parameter from the cross validation folds, and then use all of the data for training the final model, it is not at all clear that is will result in the best performance. It seems reasonable to think that simply using cross validation to test the model performance and determine other model hyperparameters, and then to retain a small validation set to determine the early stopping parameter for the final model training may yield the best performance.

If one wants to proceed as you suggest by using cross validation to train many different models on different folds, each set to stop early based on its own validation set, and then use these cross validation folds to determine an early stopping parameter for a final model to be trained on all of the data, my inclination would be to use the mean as you suggest. This is just a hunch, and I have no evidence to support it (though it does seem to be an opinion mentioned in numerous reputable seeming sources). I would suggest testing the performance of this choice vs other candidates such as taking the max/min, etc. if you are set on proceeding in this way. I wouldn't take anyone's word for it being the best way to proceed unless they provide proof or evidence of their assertion.

Finally, I want to mention that if one is not necessarily interested in constructing a newly trained final model after cross validation, but rather just wants to obtain predictions for a specific instance of a problem, yet a third route is to forego training a final model altogether. By this I mean, one could train one model for each fold using cross validation, but record during each fold predictions that fold's model makes for the test set while the cross validation loop is occurring. At the end of cross validation, one is left with one trained model per fold (each with it's own early stopping iteration), as well as one prediction list for the test set for each fold's model. Finally, one can average these predictions across folds to produce a final prediction list for the test set (or use any other way to take the numerous prediction lists and produce a single one).

Note: This response may be more appropriate as a comment since I don't provide an answer to the question, but it was a bit long for that.


I suspect this is a "no free lunch" situation, and the best thing to do is experiment with (subsets) of your data (or ideally, similar data disjoint from your training data) to see how the final model's ideal number of estimators compares to those of the cv iterations.

For example, if your validation performance rises sharply with additional estimators, then levels out, and finally decreases very slowly, then going too far isn't such a problem but cutting off early is. If instead your validation performance grows slowly to a peak but then plummets with overfitting, then you'll want to set a smaller number of estimators for the final model. And then there are all the other considerations for your model aside from straight validation score; maybe you're particularly averse to overfitting and want to set a smaller number of estimators, say the minimum among the cv iterations.

Another wrench: with more data, your model may want more estimators than any of the cv-estimates. If you have the resources to experiment, also look into this.

Finally, you may consider leaving an early-stopping validation set aside even for the final model. That trades away some extra training data for the convenience of not needing to estimate the optimal number of estimators as above.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.