Having a reward structure which gives high positive rewards compared to the negative rewards

I am training an RL agent using PPO algorithm for a control problem. The objective of the agent is to maintain temperature in a room. It is an episodic task with episode length of 9 hrs and step size(action being taken) for every 15 mins.During the training of an agent, from a given state, agent takes an action.Then I check the temperature of the room after 15 mins(step size) and if this temperature is within limits, I give the action a very high positive reward, and if the temperature is not in he limits, I give a negative reward.Episode ends after 36 actions(9hrs * 4 actions/hour)(step size is 15 mins)

My formulation of reward structure.

zone_temperature = output[4]  # temperature of the zone 15 mins after the action is taken

thermal_coefficient = -10

if zone_temperature  self.temp_limit_min:
    temp_penalty = self.temp_limit_min - zone_temperature
elif zone_temperature  self.temp_limit_max:
    temp_penalty = zone_temperature - self.temp_limit_max
else :
    temp_penalty = -100

        reward = thermal_coefficient * temp_penalty

value of zone_temperature varies from limits in a range of 0 to 5 deg. So the reward for when the actions are bad(temperature not in limits) varies from 0 to -50 , but when the actions are good(temperature is in limits) reward is +1000 . I had such a formulation so that the agent can easily understand which is a good action and which one is bad. Is my understanding correct and is it recommended to have such a reward structure for my use case ?

Thanks

Topic actor-critic dqn ai monte-carlo reinforcement-learning

Category Data Science


Is my understanding correct and is it recommended to have such a reward structure for my use case ?

Your understanding is not correct, and setting extremely high rewards for the goal state in this case can backfire.

Probably the most important way it could backfire in your case, is that your scaling of bad results becomes irrelevant. The difference between 0 and -50 is not significant compared to the +1000 result. In turn that means the agent will not really care by how much it fails when it does, except as a matter of fine tuning once it is already close to an optimal solution.

If the environment is stochastic, then the agent will prioritise a small chance of being at the target temperatures, over a large chance of ending up at an extreme bad temperature.

If you are using a discount factor, $\gamma$, then the agent will prioritise being at the target temperatures immediately, maybe overshooting and ending up with an unwanted temperature within a few timesteps.

Working in your favour, your environment is one where the goal is some managed stability, like the "cartpole" environment, with a negative feedback loop (the correction to the measured quantities is always to force in the opposite direction). Agents for these are often quite robust to changes in hyperparameters, so you may still find your agent learns successfully.

However, I would advise sticking with a simple and relatively small scale for the reward function. Experimenting with it, after you are certain that it expresses your goals for the agent, is unlikely to lead to better solutions. Instead you should focus your efforts on how the agent is performing, and what changes you can make to the learning algorithm.

What I would do (without knowing more about your environment):

  • Reward +1 per time step when temperature is in acceptable range

  • Reward between -0.1 * temperature difference per time step when temperature is outside acceptable range. It doesn't really matter if you measure that in Fahrenheit or Celsius.

  • No discounting (set discount factor $\gamma =1$ if you are using a formula that includes discounting)

The maximum total reward possible is then +36, and you probably don't expect a worse episode than around -100 or so. This will plot neatly on a graph and be easy to interpret (every unit below 36 is roughly equivalent to performance of an agent spending 15 mins per day just outside acceptable temperatures). More importantly, these lower numbers should not cause massive error values whilst the agent is learning, which will help when training a neural network to predict future reward.


As an aside (as you didn't ask), if you are using a value-based method, like DQN, then you will need to include the current timestep (or timesteps remaining) in the state features. That is because the total remaining reward - as represented by action value Q - depends on the remaining time that the agent has to act. It also doesn't matter to the agent what happens after the last time step, so it is OK for it to choose actions just before then that would make the system go outside acceptable temperatures at that point.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.