What are the effects of clipping the reward in stability?

Question

What are the effects of clipping the reward in stability?

user10296606

2020年3月15日 03:24

I am looking for stabilizing my results of DQN, I found clipping is one technique to do it but I did not understand it completely!

1- what are the effects of clipping the reward, clipping the gradient, clipping the error in stability and how makes results more stable?

2- In DQN nature it has written they clipping the reward? Would you please explain this more?

3- which of them are more effective in stability?

Topic dqn keras-rl training tensorflow deep-learning

Category Data Science

gvgramazio · Accepted Answer · 2018年12月1日 14:18

You could clip for several reasons.

If you clip the gradient, the stabilizing effect is to force the optimizer to do only small changes in the backward step. Of course, you could also decrease the learning rate but the effect is slightly different. When you decrease the learning rate you basically say "learn more slowly". Instead, with gradient clipping, you say "learn as usual but if you have to change your mind rapidly don't do it" (I'm not sure that this sentence is understandable, English isn't my first language).
If you clip the error, the effect is the same. Maybe it changes a bit in a mathematical point of view but the bigger result is equal to clipping the gradient.
Clipping the reward doesn't give you any direct stabilizing effect. It is only a particular case of the more general reward shaping. In Playing Atari with Deep Reinforcement Learning, it's stated that:

Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. At the same time, it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude.

Basically, they've done it to make the environments similar to each other in rewards terms when seen by an agent. If you think of it, it's not difficult for you to play to a game when the score is given as multiple of thousand and then play to another one where even one hundred could be a great score. This doesn't apply to RL agents and so they reshaped the reward.

What are the effects of clipping the reward in stability?

About