How to formulate reward of an rl agent with two objectives

I have started learning reinforcement learning and trying to apply it for my use case. I am developing an rl agent which can maintain temperature at a particular value, and minimize the energy consumption if equipment by taking different actions that are available for it to take.

I am trying to formulate a reward function for it.

energy and temp_act can be measured

energy_coeff = -10
temp_coeff = -10

temp_penalty = np.abs(temp_setpoint - temp_act)

reward = energy_coeff * energy + temp_coeff * temp_penalty

This is the reward function I am using, but intuitively , I feel it should be better. because absolute value of enenrgy and temp_penalty are on different scales. How do i take into count the scaling problem, while structuring a reward.

Topic discounted-reward dqn monte-carlo q-learning reinforcement-learning

Category Data Science


In general it is not possible to simultaneously optimise two separate objective functions. Your approach of adding weights (your coefficients) to each objective, then summing the scaled objectives, is a standard way of resolving that.

As your penalties are on different scales and in different units, it is your task as the engineer setting the objective to provide the conversion to a single scale. That is what the coefficients represent - you can even think of them as $points/Joule$ for energy and $points/(\Delta K)$ for temperature difference.

Sometimes analysis will show you that there is a natural combined scale. For instance, in business settings it may be possible to frame compromises as financial costs, e.g. your coefficients might be $\text{GBP}/Joule$ for energy and $\text{GBP}/(\Delta K)$ for temperature difference. Then you have one clear objective to minimise cost or maximise profit.

If that is not possible - a financial cost for exceeding temperature bounds may be hard if this is about human comfort in a building - deeper analysis might lead to thinking about longer-term outcomes. Perhaps your initial rewards are too focused on immediate numerical issues (that appear easy to collect, but don't represent your true goals), and a re-framing of the problem could work. For instance, perhaps it is more reasonable that both the temperature and energy costs stay within strict bounds over a year of varying external temperatures and system workload, with a scaling penalty depending on how badly these are exceeded.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.