CNN gradients with different magnitude

I have a CNN architecture with two cross entropy losses $\mathcal{L}_1$ and $\mathcal{L}_2$ summed in the total loss $\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_2$. The task I want to solve is Unsupervised Domain Adaptation.

I have attested the following behavior:

  • The gradients coming from $\mathcal{L}_1$ have a different magnitude than those coming from $\mathcal{L}_2$ such that the supervision coming from the first loss is negligible.
  • $\mathcal{L}_1$ has a positive constant value and does not decrease during the training, while $\mathcal{L}_2$ does decrease.

How can I minimize $\mathcal{L}_1$ and how can I make the gradient from $\mathcal{L}_1$ more important? Currently I have two options:

  1. Add a tradeoff parameter to one of the two losses $\mathcal{L} = \mathcal{L}_1 + \gamma \cdot \mathcal{L}_2$
  2. Normalize the gradients at some step

The last option would be to leave everything as it is, with the motivation that one loss does not provide supervision to the task I want to solve. Do you have some advice on the road to follow?

Topic gradient cnn loss-function machine-learning

Category Data Science


If you are capable of optimizing the weighting hyperparameter γ, then the relative importance of the loss functions becomes an empirical question. That data will help guide the mixture of losses.

Normalizing the gradients is ad hoc. Thus simplifier but could over simplify the problem.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.