Policy Gradient with continuous action space

How to apply reinforce/policy-gradient algorithms for continuous action space. I have learnt that one of the advantages of policy gradients is , it is applicable for continuous action space. One way I can think of is discretizing the action space same as the way we do it for dqn. Should we follow the same method for policy -gradient algorithms also ? Or is there any other way this is done?

Thanks

Topic policy-gradients dqn ai reinforcement-learning

Category Data Science


Yes, that is possible. It can be done in the following way:

We assume that the action distribution is guassian, i.e. that we need to learn the parameters $\theta$ of $\mathcal{N}(a|\mu_\theta,\sigma_\theta)$. Let's say that $\theta$ is given by the weights of a neural network, which we find by optimizing the objective $$\max_\theta \mathbb{E}_{p_{\theta}}\left[ R(s,a)p_\theta(a|s)\right],$$ where $p_\theta(s,a) = \mathcal{N}(a|\mu_\theta, \sigma_\theta)$ and $R(s,a)$ is the cumulative discounted reward. The gradient is then per policy gradient theorem simply $\mathbb{E}_{p_{\theta}}\left[\nabla_\theta R(s,a) \log p_\theta(a|s) \right]$.

In practice, we design a neural network to output one $\mu$ per action dimension and $\sigma$ can either be learned or kept fixed. If learned, we interpret the output as $\log \sigma$, so it can take any value (e.g., become negative). To sample the action we use the outputs learned by our network.

See e.g. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. “Continuous control with deep reinforcement learning,” International Conference on Learning Representations, 2016.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.