Can we use decreasing step size to replace mini-batch in SGD?

Question

Can we use decreasing step size to replace mini-batch in SGD?

coolcat

2022年5月22日 14:01

As far as I know, mini-batch can be used to reduce the variance of the gradient, but I am also considering if we can achieve the same result if we use the decreasing step size and only single sample in each iteration? Can we compare the convergence rate of them?

Topic mini-batch-gradient-descent gradient-descent optimization machine-learning

Category Data Science

Abhishek Singla · Accepted Answer · 2020年7月23日 15:08

Main objective of mini-batch gradient descent is to achieve faster results over full-batch gradient descent as it will start learning weights before completion of one epoch. SGD will start learning earlier than Mini-batch, isn't it? But mini-batch reduces variance of the gradient compared to SGD.

Coming to the question, you're right it's possible to compare the convergence of both scenarios. People used to use SGD with decreasing step-size until Mini-batch algorithm came. Because in practice, Mini-batch gives better performance over SGD due to it's vectorisation property. This property helps in making the computation faster with comparable results wrt SGD.

mirror2image · Accepted Answer · 2019年2月28日 07:39

Generally answer is "it's not known". Similarity of effects of increasing minibatches size and decreasing learning rate is mostly empirical, there is no known asymptotic formula for it. Also effect of small LR and big minibatch is not the same. For example batch normalization layer would act completely different on those two approaches. Probabilistic distribution of gradients produced by minibatches and single sample (or mb of significantly different size) would be also quite different

Can we use decreasing step size to replace mini-batch in SGD?

About