Having a reward structure which gives high positive rewards compared to the negative rewards
I am training an RL agent using PPO algorithm for a control problem. The objective of the agent is to maintain temperature in a room. It is an episodic task with episode length of 9 hrs and step size(action being taken) for every 15 mins.During the training of an agent, from a given state, agent takes an action.Then I check the temperature of the room after 15 mins(step size) and if this temperature is within limits, I give the action a very high positive reward, and if the temperature is not in he limits, I give a negative reward.Episode ends after 36 actions(9hrs * 4 actions/hour)(step size is 15 mins)
My formulation of reward structure.
zone_temperature = output[4] # temperature of the zone 15 mins after the action is taken
thermal_coefficient = -10
if zone_temperature self.temp_limit_min:
temp_penalty = self.temp_limit_min - zone_temperature
elif zone_temperature self.temp_limit_max:
temp_penalty = zone_temperature - self.temp_limit_max
else :
temp_penalty = -100
reward = thermal_coefficient * temp_penalty
value of zone_temperature
varies from limits in a range of 0 to 5 deg. So the reward for when the actions are bad(temperature not in limits) varies from 0 to -50 , but when the actions are good(temperature is in limits) reward is +1000 . I had such a formulation so that the agent can easily understand which is a good action and which one is bad. Is my understanding correct and is it recommended to have such a reward structure for my use case ?
Thanks
Topic actor-critic dqn ai monte-carlo reinforcement-learning
Category Data Science