How to improve the learning rate of an MLP for regression when tanh is used with the Adam solver as an activation function?
I'm trying to use an MLP to approximate a smooth function f : R^3 - R, that takes a point in space as an argument, and returns a scalar value.
The MLP architecture has a 3-dimensional (for 3 point coordinates) input layer, N hidden layers and a single linear scalar output layer, since the output should be the function value:
x x x
x x x x
x x x ... x x
x x x x
x x x
I'm using the tanh activation function because I want the model (MLP) to be continuously differential.
I'm playing with different hidden layer architectures, using the Adam solver, and I get this behavior for the MSE loss
The maximal validation error that I get with this mean MSE loss is 99.982942% - is this generally considered accurate for regression?
For the network with hidden layers (16, 16, 16, 16, 16), the error stagnates but then drops and oscillates. I suspect the oscillation is due to the diminishing gradients when using the tanh activation function for learning, is this true?
How to set/improve the learning rate, are there techniques that prevent oscillations when the solver (Adam, SGD, ...) approaches the optimum?
Topic mlp learning-rate regression
Category Data Science