Reward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning
I'm currently implementing PPO for a game with the following characteristics:
- Observation space: 9x9x(>150)
- Action space: 144
- In a given state, only a handful of actions (~1-10) are legal
- The state in time-step t can vary a lot from state t+1
- The environment is episodic (~25 steps, depending on level) and ends with a win or a loose
- In some levels, a random policy (if only legal actions are made) can result in a win, in some levels strategy is needed.
I want the algorithm to learn:
- Avoid illegal actions as much as possible, preferably only legal actions
- Learn to play the best combination of legal action to win as fast as possible
I've tried different reward functions and the best one so far is:
- Illegal action: -0.10
- Legal action: +0.10
- Win: +1.0
I tested it on a simple level that a random policy will beat. During the time I trained, my policy learned to make around 80% legal actions, but it never won. It will probably go up even more if I run for longer and maybe start winning.
The problem with the above reward function is that it doesn't encourage to win with as few actions as possible, as each legal action get a positive reward.
I've also tried the following reward function:
- Illegal action: -0.10
- Legal action: -0.02
- Win: +1.0
But it converges to around 20% legal moves.
I have two questions:
- Anyone familiar with a similar problem that have an idea of how I might design the reward function?
- Does anyone know of any papers where they discuss the problem of learning three different objectives: To avoid illegal actions, to win and to win by doing the minimum number of actions?