How to choose between discounted reward and average reward?

How to select between average reward and discounted reward?

  • And when average reward is more effective in comparison with discounter reward and when vice versa is correct?

  • Is is possible to use both of them in a problem? Because as I understand the RL reward is based on average reward or discounted future reward, but I think this paper use the discounted and average together. Is it correct: we use discounted future reward in order to training and average reward in test and evaluation? What is wrong in my understanding?

In this picture, figure 2 of the paper "Playing Atari with Deep Reinforcement Learning":

The authors report the "average reward". However, in the same paper, the authors also mention "discounted reward". So, I'm confused. What is the difference between discounted reward and average reward?

Topic discounted-reward dqn reinforcement-learning

Category Data Science


In order to consider they can be used together or not, let's see it this way.

Discounting is determined by the "discounting factor" or gamma symbol in the paper. This hyper-parameter always exists in the calculation of return. You can adjust your environment to have no discount at all by setting gamma=1, or you can choose to have the discount by setting gamma below 1. Adjusting this is likely to affect your learning performance.

For the average reward in the figure2, the paper says "One epoch corresponds to 50000 minibatch weight updates or roughly 30 minutes of training time". During this 30 minutes, the agent will not play for just one episode, but a lot of them. Each played episode generates a return (the total reward which contains that gamma in the calculation). The average reward is calculated directly from those episodes in the same epoch. If you don't like average, you may choose max or min or any other operator. It depends on what you want to see.

These two things are two knobs that you can choose to adjust independently.


The Average reward in that figure is used as a measure of performance. In other words, the score of the agent playing the game. You do not track reward per episode as this doesnt indicate a general improvement in the learning process. Instead you track the average reward over training epochs. If it steadily increases this means that your agent indeed is learning.

The discounted reward is used to create some kind of future-reward dependencies and is used in the learning equations. So, instead of evaluating how good is a particular state according to the immediate reward you received, you also take into account the future reward from your next state. In RL you attempt to max your expected return and some methods estimate the expected reward from every state.

Please note that my answer gives a high level description and is not referring to a specific RL algorithm (as there are many variations). I would suggest you to understand very well the simple tabular form of Q-learning before moving to RL and function approximators combinations.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.