Learning rate terminology, what is 'reducing' a learning rate?

I'm investigating a loss plateau and various techniques for overcoming it, which led me to this page and statement:

Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for a 'patience' number of epochs, the learning rate is reduced.

https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau

I'm confused by this terminology. If my learning rate is 0.001, am I reducing the learning rate when it moves to 0.01, or to 0.0001? I initially would've thought the latter since 0.001 0.0001, but it doesn't make sense to me to change the learning rate to a smaller value when a model has reached a plateau since you would end up making smaller changes to your model than before, making the situation worse.

Topic learning-rate tensorflow machine-learning

Category Data Science


The learning rate (proportionally) determines the step size in the parameters that the optimization takes every time the model parameters are updated. So, a small (or "slow") learning rate means, you take smaller steps into the expected optimal direction (which is the direction towards which the loss function decays strongest). A larger learning rate means you take larger steps; thus you learn faster, assuming your cost function is well-defined and your optimizer proceeds into the right direction.

That said, as you observe in your post, both increasing or decreasing the learning rate does not always necessarily mean that the learning will be faster / slower or better / worse, because in messy, noisy loss functions (many loss functions are like that) with many local optima, plateaus, etc. changing the learning rate may have different effects.

This part I am not entirely sure; in theory, when you approach a local optimum and your learning rate is too large, a phenomenon can occur where your optimizer jumps between both sides of the actual minimum, never actually reaching the actual minimum, because the steps it takes are too large. This visually results in a zig-zag pattern. So, the idea here would be to assume that as soon as you see a plateauing of the loss function with a little bit of up and down movement, it might be advantageous to lower the learning rate in this case, because it then can closer approach the minimum it otherwise jumps around. That might not be the only reason though.

All in all, that is why the learning rate is a hyperparameter that needs to be scanned over a range and thoroughly tuned, in order to find a good value that suits your cost function. I recommend the book deep learning by goodfellow about this. They say, if you have many hyperparameters, but only resources to tune one, you should in most cases prioritize the learning rate; that's how important a hyperparameter it is.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.