Understanding Learning Rate in depth
I am trying to understand why the learning rate does not work universally. I have two different data sets and have tested out three learning rates 0.001 ,0.01 and 0.1 . For the first data set, I was able to achieve results for all learning rates at optimization using stochastic gradient descent.
For the second data set the learning rate 0.1 did not converge. I understand the logic behind it overshooting the gradients, however, I'm failing to understand why this was the case for one data set but not the other? I was unable to find much about this online but I have been advised that it was due to the data shape and potentially requires more deep insight into the data.
If there is any relevant literature to read that would be highly appreciated.
Topic sgd gradient-descent deep-learning optimization machine-learning
Category Data Science