Problem when cherry picking actions - Proximal Policy Optimization
I am using the implementation of PPO2 in stable-baselines (a fork of OpenAI's baselines) for a Reinforcement Learning problem.
My observation space is $9x9x191$ and my action space is $144$. Given a state, only some actions are legal. If an illegal action is taken, the environment will return the same state. Think of it as the game of Go, where you try to place a stone on an intersection that is already occupied.
When a legal action is taken, it can result in a state that is completely different from the state before. It can also result in legal actions that are completely different. So it is not as in Go, where the stones stay on the board.
I currently feed the model a vector containing the legal actions (a multi-hot-vector, 1 for legal, 0 for illegal). When the model samples the actions, I first run the action means through a sigmoid so they all become positive but still keeps relative size order. Then I multiply the means with the legal action vector which sets the illegal actions to zero. The model then samples from only the legal actions with argmax. I do not change the underlying probabilities of the actions, so when calculating the neglogs of an action, they will be the true neglogs.
This works well for some time, but after a while I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
The error is thrown when I calculate the tf.clip_by_global_norm:
grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
I am guessing that the grads become infinitely large.
A general trend is that new and old neglog action probabilities starts to increase (which means that the action probabilities for the new and old policy starts to decrease). The Kullback Leiber distance also increases rapidly between them. This will lead to the ratio (r(θ) in the paper) to increase, which also will lead to a very volatile policy loss.
If I increase the number of steps before update, it can run for longer, but eventially it will crash.
I have tried to punish illegal actions instead of removing them, but the convergence is very slow.
My questions are:
- What might be the reason for the grads going bananas?
- Can it affect the model in an unexpected way if I cherry pick the legal actions?
- Is there any other, better way that limit the action space to just legal actions? I have not found any papers that analyze it. Does anyone know how AlphaZero handles illegal actions?