I think some answers to (/comments about) related questions are well addressed in these posts:
- https://stats.stackexchange.com/q/402403
- https://stats.stackexchange.com/q/361494
In my mind, the tldr summary as it relates to your question is that after cross validation one could (or maybe should) retrain a model using a single very large training set, with a small validation set left in place to determine an iteration at which to stop early. While one certainly can think about ways to determine an early stopping parameter from the cross validation folds, and then use all of the data for training the final model, it is not at all clear that is will result in the best performance. It seems reasonable to think that simply using cross validation to test the model performance and determine other model hyperparameters, and then to retain a small validation set to determine the early stopping parameter for the final model training may yield the best performance.
If one wants to proceed as you suggest by using cross validation to train many different models on different folds, each set to stop early based on its own validation set, and then use these cross validation folds to determine an early stopping parameter for a final model to be trained on all of the data, my inclination would be to use the mean as you suggest. This is just a hunch, and I have no evidence to support it (though it does seem to be an opinion mentioned in numerous reputable seeming sources). I would suggest testing the performance of this choice vs other candidates such as taking the max/min, etc. if you are set on proceeding in this way. I wouldn't take anyone's word for it being the best way to proceed unless they provide proof or evidence of their assertion.
Finally, I want to mention that if one is not necessarily interested in constructing a newly trained final model after cross validation, but rather just wants to obtain predictions for a specific instance of a problem, yet a third route is to forego training a final model altogether. By this I mean, one could train one model for each fold using cross validation, but record during each fold predictions that fold's model makes for the test set while the cross validation loop is occurring. At the end of cross validation, one is left with one trained model per fold (each with it's own early stopping iteration), as well as one prediction list for the test set for each fold's model. Finally, one can average these predictions across folds to produce a final prediction list for the test set (or use any other way to take the numerous prediction lists and produce a single one).
Note: This response may be more appropriate as a comment since I don't provide an answer to the question, but it was a bit long for that.