Changing the batch size during training

Question

Changing the batch size during training

spiridon_the_sun_rotator

2021年1月29日 11:30

The choice of batch size is in some sense the measure of stochasticity :

On one hand, smaller batch sizes make the gradient descent more stochastic, the SGD can deviate significantly from the exact GD on the whole data, but allows for more exploration and performs in some sense a Bayesian inference.
Larger batch sizes approximate the exact gradient better, but in this way one is more likely to overfit the data or get stuck in the local optimum. Processing larger batch sizes also speed-ups calculations on paraller architectures, but increases the demand of RAM or GPU RAM.

Seems like a sensible strategy would be starting from smaller batch sizes to have a lot of exploration in the initial stages, and then increase gradually the batch size to fine-tune the model.

However, I have not seen implementing this strategy in practice? Did it turn out to be inefficient? Or the appropriate choice of learning rate scheduler with the Dropout does this well enough?

Topic sgd mini-batch-gradient-descent bayesian gradient-descent

Category Data Science

n1k31t4 · Accepted Answer · 2021年1月29日 11:30

Efficient use of resources

It is a balancing game with the learning rate, and one reason you don't normally see people do this is that you want to utilise as much of the GPU as possible.

It is commonly preferred to start with the maximum batch size you can fit in memory, then increase the learning rate accordingly. This applies to "effective batch sizes" e.g. when you have 4 GPUs, each running with batch_size=10, then you might have a global learning rate of nb_gpu * initial_lr (used with sum or average of all 4 GPUs).

The final "best approach" is usually problem specific - small batch sizes might not work for GAN type models, and large batch sizes might be slow and sub-optimal for certain vision based tasks.

Friends don't let friends use large batch sizes

There is literature to support usage of small batch sizes at almost all times. Even though this idea was supported by Yann Lecun, there are differences of opinion.

Super convergence

There are also other tricks that you might consider, if you are interested in faster convergence, playing with learning rate cycling.

Changing the batch size during training

Efficient use of resources

Friends don't let friends use large batch sizes

Super convergence

About