Reward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning

I'm currently implementing PPO for a game with the following characteristics:

  • Observation space: 9x9x(>150)
  • Action space: 144
  • In a given state, only a handful of actions (~1-10) are legal
  • The state in time-step t can vary a lot from state t+1
  • The environment is episodic (~25 steps, depending on level) and ends with a win or a loose
  • In some levels, a random policy (if only legal actions are made) can result in a win, in some levels strategy is needed.

I want the algorithm to learn:

  • Avoid illegal actions as much as possible, preferably only legal actions
  • Learn to play the best combination of legal action to win as fast as possible

I've tried different reward functions and the best one so far is:

  • Illegal action: -0.10
  • Legal action: +0.10
  • Win: +1.0

I tested it on a simple level that a random policy will beat. During the time I trained, my policy learned to make around 80% legal actions, but it never won. It will probably go up even more if I run for longer and maybe start winning.

The problem with the above reward function is that it doesn't encourage to win with as few actions as possible, as each legal action get a positive reward.

I've also tried the following reward function:

  • Illegal action: -0.10
  • Legal action: -0.02
  • Win: +1.0

But it converges to around 20% legal moves.

I have two questions:

  1. Anyone familiar with a similar problem that have an idea of how I might design the reward function?
  2. Does anyone know of any papers where they discuss the problem of learning three different objectives: To avoid illegal actions, to win and to win by doing the minimum number of actions?

Topic reinforcement-learning deep-learning neural-network machine-learning

Category Data Science


Most of the Grid world examples given in RL books uses a reward of -1 for every step until it reaches the terminal state. This encourages the algorithm to reach the terminal state in as few steps as possible. So for every legal action you can give a reward of -1, and to avoid illegal actions in a state you give a reward of something like -10 or so. Since you said each episode might have around 25 steps, increase the win reward to more than 25, say 50 for the algorithm to understand that winning is more important since it gives more reward.


Do you want your agent to perform illegal actions at any given time or not at all? One way to avoid illegal actions is to use a mask over your 144 action vector (so the action indices are preserved) when you calculate the softmax probabilities. Then you sample actions according to the masked probabilities. First you need to determine which actions are illegal at a given step which should depend on the dynamics of the game. One example are the algorithms used to learn to play the mini-missions in StarCraft II.

In order to enforce your agent to act with the minimum amount of steps you should introduce a small penalty (lets say for example -0.01) so your agent will try to optimize also this part of the reward.

For winning the game question, there is no answer to that. This depends on many factors, I name few here: how appropriate is the agent's architecture given the game, reward sparsity, types of observations (images, non-vector data), data p reprocessing, inputs to your agent, amount of exploration etc. There is no guarantee that if you use the X type of method will solve your task (except if your task has already been solved with a specific type of learner). You could possibly name the game/task you want to solve.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.