Why Gradient methods work in finding the parameters in Neural Networks?

Question

Why Gradient methods work in finding the parameters in Neural Networks?

induction601

2020年8月18日 16:21

After reading quite a lot of papers (20-30 or so), I feel that I am quite not understanding things.

Let us focus on the supervised learnings (for example). Given a set of data $\mathcal{D}_{train}=\{(x_i^{train},y_i^{train})\}$ and $\mathcal{D}_{test}=\{(x_i^{test},y_i^{test})\}$ where we assume $y_i^{test}$ are unknown, the goal is to find a function $$ f_\theta(x), \qquad \text{such that} \quad f_\theta(x_i^{test}) \approx y_i^{test}. $$ To do this, we need a model for $f$. Typically, neural networks are frequently employed. Thus we have $$ f_\theta(x) = W^{(L+1)}\sigma(W^{(L)}\sigma(\cdots \sigma(W^{(1)}\sigma(W^{(0)}x+b^{(0)})+b^{(1)})\cdots )+b^{(L)})+b^{(L+1)} $$ where $\theta = \{W^{(i)},b^{(i)}\}_{i=0}^{L+1}$. Then $f_\theta$ is a neural network of $L$ hidden layers. In order to find $\theta$, typically, one define a loss function $\mathcal{L}$. One popular choice is $$ \mathcal{L}(\mathcal{D}_{train}):= \sum_{(x_i^{train},y_i^{train})\in \mathcal{D}_{train}} \left(f_\theta(x_i^{train}) - y_i^{train} \right)^2. $$ In order to find $\theta^*$ which minimizes the loss function $\mathcal{L}$, a typical (or it seems the only approach) is to apply the gradient method.

As far as I know, the gradient method does not guarantee the convergence to the minimizer.

However, it seems that a lot of research papers simply mention something like

We apply the standard gradient method (e.g., Adam, Adadelta, Adagrad, etc.) to find the parameters.

It seems that we don't know those methods can return the minimizer. This makes me think that it could be possible that all the papers rely on this argument (utilizing the parameters found by gradient methods) might be wrong. Typically, their justifications are heavily on their examples saying it works well.

In addition to that, sometimes, they mentioned that they tuned some parameters to run gradient methods. What does that mean ``tune"? The gradient method high depends on the initialization of the parameter $\theta^{(0)}$. If the initial choice were already close enought to the minimizer, i.e., $\theta^{(0)} \approx \theta^*$, it is not surprising that it works well. But it seems that a lot of trials and errors are necessary to find a proper (good and working well) initialization. It sounds to me that they already found the good solution via trials and errors, not gradient methods. Thus tuning sounds to me that they already found a parameter which already closes to $\theta^*$.

I start to think that there may be something I am not aware of as the volume of such researches is huge. Did I miss something? Or can we just do research in this manner? I am so confused... I am not trying to attack or critize a specific paper or research. I am trying to understand.

Any comments/answers will be very appreciated.

Topic loss-function gradient-descent reference-request neural-network machine-learning

Category Data Science

Tuyen · Accepted Answer · 2020年8月18日 16:21

You can see my answer for this question: Does gradient descent always converge to an optimum?

You are right that there is no guarantee (no proven theoretical result) for the standard gradient descent method to converge, not mention to converge to a critical point or minimum point. (Except the case where the gradient - i.e. the derivative - of the function is globally Lipschitz continuous. [If these conditions are not satisfied, there are simple counter-examples showing that no convergence result is possible, see the cited paper below for some.]) However,

In a very recent joint work (cited in that answer, and for your convenience here), we showed that:

Back tracking gradient descent, when applied to an arbitrary C^1 function, with the mild assumption that it has at most a countable number of critical points (which is the generic case), will always either converge to a critical point or diverge to infinity. (Think about like cos (x), sin (x) or combinations thereof.) Therefore, if you assume that the cost function has compact sub levels, as it is always the case in applications, then convergence to a critical point is guaranteed.
We showed that the limit point in the above, in a sense, is rarely a saddle point.
We argued that in the long term backtracking gradient descent will become the standard gradient descent method. This gives an explanation to why the standard gradient descent usually works well in practice, and thus answers your question, I believe.
Based on the above, we proposed a new method in deep learning, which is on par with current state-of-the-art methods in deep learning, and does not require manual fine-tuning (= errors and trials in your language) of the learning rate. (In a nutshell, the idea is that you run backtracking gradient descent a certain amount of time, until you see that the learning rates, which change with each iteration, become stabilise. We expect this stabilisation, in particular at a critical point which is C^2 and is non-degenerate, because of the convergence result I mentioned in 1. At that point, you switch to the standard gradient descent method. Please see the cited paper for more detail. This method can also be applied to other optimal algorithms, for example the ones you mentioned in your question.)

P.S. The above answer (in particular, the one in 4) shows that you can find a good learning rate automatically by backtracking gradient descent method. If you have other parameters, then usually you can state the question in some double optimal problem, and can again use backtracking gradient descent method as above.

Adrian Keister · Accepted Answer · 2018年10月5日 16:43

If you have a convex cost function, gradient descent will find the global minimizer. Typically, the cost functions associated with neural networks are not convex - technically. However, if you read this question on Stats.SE, you find that the cost function of many neural networks is "close enough" to convex so that gradient descent still works. You'll find, technically, a local minimizer, but neural networks are often invariant to node-switching (as mentioned in the answer to the question to which I linked), so that you're still finding a very good minimizer.

Why Gradient methods work in finding the parameters in Neural Networks?

About