Is it beneficial to use a batch size > 1 even when all computing power can be used?

In regards to training a neural network, it is often said that increasing the batch size decreases the network's ability to generalize, as alluded to here. This is due to the fact that training on large batches causes the network to converge to sharp minimas, as opposed to wide ones, as explained here.

This begs the question: In situations where all available computing power can be used by training on a batch size of one, is there a benefit to using a batch size greater than one?

A situation like this would likely occur when training on a CPU, or when training a very large network on any hardware.

Topic training gradient-descent neural-network optimization machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.