Should hyperparameter optimisation focus on many trials (models) lower epochs first, then a second round with few models, many epochs?
Rather than a hyperparameter optimisation with kt.tuners.RandomSearch, say, that does (option A), say X model trials (e.g. 100), Y epochs each (say 100, so a total of 10,000 epochs across all models) where Y would be 'enough epochs per experiment to give good estimates for each model' in one whole experiment, would it be more appropriate to split the experiment into two parts (option B):
- run X*5 model trials (200) with Y/10 epochs each (say 25). (Thus we scan many more models). (5,000 total epochs)
- Choose the top N models (5 say), re-run these 5 best models with 1,000 epochs each. Choose the best model from 2.
I'm aware that altering hyperparameters alters rate of convergence of the loss function to zero. So that options A and B suffer from the same fundamental flaw. But is B a better option than A at any rate?
And are there other options to accommodate for this, e.g. loss functions 'adjusted' by batch size (even if this isn't truly knowable without computing it in advance anyway), or are we limited to 'blindly' comparing models on the loss function without accommodating that e.g. one may have a v low learning rate, but is a better model than another model which converges quickly and halts, but looks 'better' after 100 epochs?
Thanks!