Changing the batch size during training
The choice of batch size is in some sense the measure of stochasticity :
- On one hand, smaller batch sizes make the gradient descent more stochastic, the SGD can deviate significantly from the exact GD on the whole data, but allows for more exploration and performs in some sense a Bayesian inference.
- Larger batch sizes approximate the exact gradient better, but in this way one is more likely to overfit the data or get stuck in the local optimum. Processing larger batch sizes also speed-ups calculations on paraller architectures, but increases the demand of RAM or GPU RAM.
Seems like a sensible strategy would be starting from smaller batch sizes to have a lot of exploration in the initial stages, and then increase gradually the batch size to fine-tune the model.
However, I have not seen implementing this strategy in practice? Did it turn out to be inefficient? Or the appropriate choice of learning rate scheduler with the Dropout
does this well enough?
Topic sgd mini-batch-gradient-descent bayesian gradient-descent
Category Data Science