I am using the implementation of PPO2 in stable-baselines (a fork of OpenAI's baselines) for a Reinforcement Learning problem. My observation space is $9x9x191$ and my action space is $144$. Given a state, only some actions are "legal". If an "illegal" action is taken, the environment will return the same state. Think of it as the game of Go, where you try to place a stone on an intersection that is already occupied. When a legal action is taken, it …
I am learning reinforcement learning, and as a practice, I am trying to stabilize an inverted pendulum (gym: Pendulum-v0) in an upright position using policy gradient: REINFORCE. I have some questions, please help me, I tried a lot but couldn't understand. an answer to any question can help me. Thanks in advance. 1- why the observations in the pendulum code are: cos(theta), sin(theta) and theta_dot? not theta and theta_dot only? 2- the action which I should send to the environment …
The course of David Silver about Reinforcement Learning explains how you get the optimal policy from the optimal value function. It seems to be very simple, you just have to act greedily, by maximizing at each step the value function. In the case of a small grid world, once you have applied the Policy Evaluation algorithm, you get for example the following matrix for the value function : You start from the up-left corner and the unique actions are the …
I am working on training RNN model on caption generation with REINFORCE algorithm. I adopt self-critic strategy (see paper Self-critical Sequence Training for Image Captioning) to reduce the variance. I initialize the model with a pre-trained RNN model (a.k.a. warm start). This pre-trained model (trained with log-likelihood objective) got 0.6 F1 score in my task. When I use adam optimizer to train this policy gradient objective, the performance of my model drops to 0 after a few epochs. However, if …
so, I am using an agent with a state-action-policy and I am trying to understand the concept of experience replay memory (ERM). As far as I learned until now, the ERM is basically a buffer that stores sets experiences: e_t = {s_t, a_t, r_t+1, s_t+1} Where s is the state, a the action and r the reward, as usual. Basically, in order to use a network that learns to predict the correct action from such experiences, the network's input should …
I implemented a self-critical policy gradient (as described here), for text summarization. However, after training, the results are not as high as expected (actually lower than without RL...). I'm looking for general guidelines on how to debug RL-based algorithms. I tried : Overfitting on small datasets (~6 samples) : I could increase the average reward , but it does not converge. Sometimes the average reward would go down again. Changing the learning rate : I changed the learning rate and …
I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture. At the bottom of that slide, the gradient of the expected sum of rewards function is given by $$ \nabla J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log{\pi_\theta(a_{i,t} \vert s_{i,t}) (Q(s_{i,t},a_{i,t}) - V(s_{i,t}))} $$ The q-value function is defined as $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})\vert s_t,a_t]$$ At first glance, this makes sense, because I compare the value of taking the chosen action …
In policy gradient, we have something like this: Is my understanding correct that if I apply log cross-entropy on the last layer, the gradient will be automatically calculated as per formula above?
I am looking for a little clarity on what the policy gradient theorem means. My confusion lies in the fact that the reward $R$ in reinforcement learning is non-differentiable in the policy parameters. As that is the case how does the central objective of policy gradients, finding the gradients of Reward $R$ wrt the parameters of policy function even make sense?
so, I am training a deterministic policy, represented by basically a convolutional networks. I have an action space which is basically a vector of weights / probabilities, output by the network. The actions encoded in that vector then determine the value of my reward function, which is to be minimized. I train sequentially on time series data and, at time t, always provide the last actions a(t-1) as input to the cnn besides the state s(t). Thus: a(t) = model(a(t-1),s(t)) …
I'm following a tutorial on youtube about reinforcement learning. They are going through the steps to understand policy gradient optimisation. In one of the steps he says (delta policy)/policy == delta log policy. How can he make that jump? I have attached a screenshot from the video and also a link to the video. https://www.youtube.com/watch?v=wDVteayWWvU&list=PLMrJAkhIeNNR20Mz-VpzgfQs5zrYi085m&index=48&ab_channel=SteveBrunton
I was experimenting with my policy gradient reinforcement learning algorithm, and I was wondering if I could use a similar method to the supervised cross-entropy. So, instead of using existing labels, I would generate a label for every step in the trajectory. Depending on the value of the action, I would shift the stochastic policy (Neural Network) output to a more efficient output and train it as a label to a cross entropy loss function. Example of an action: Real …
I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch. My models look as follows: model = nn.Sequential( nn.Linear(4, 128), nn.ELU(), nn.Linear(128, 2), ) Criterion and optimisers: criterion = nn.BCEWithLogitsLoss() optim = torch.optim.Adam(model.parameters(), lr=0.01) Training: env = gym.make("CartPole-v0") n_games_per_update = 10 n_max_steps = 1000 n_iterations = 250 save_iterations = 10 discount_rate = 0.95 for iteration in range(n_iterations): …
I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section 4.1) Note that in the above derivation, the term H(q(thetha(at|st))) should have been log (qthetha(at|st)), and that log refers to log base e (i.e. natural logarithm). In the first line of gradient, it should have been r(st,at) - log(qthetha(at|st)). In particular, I …
How to apply reinforce/policy-gradient algorithms for continuous action space. I have learnt that one of the advantages of policy gradients is , it is applicable for continuous action space. One way I can think of is discretizing the action space same as the way we do it for dqn. Should we follow the same method for policy -gradient algorithms also ? Or is there any other way this is done? Thanks
I was reading a document about Reinforcement Learning policy gradient http://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes8.pdf when I encountered this expression $ \nabla_{\theta} \mathbb{E_{\pi_{\theta}}}[r_{t^{t}}] = \mathbb{E_{\pi_{\theta}}} \left[ r_{t^{'}} \sum_{t = 0}^{t^{'}} \nabla_{\theta} \log \pi_{\theta} (a_t|s_t) \right] $ which is on page 6 just below (11). The problem is I have no idea how is this expression derived. The document says that it can be derived the same way as (11) but I do not understand how. Any pointers or hints would be appreciated.
I am training an RL model using PPO for AAPL stock. There are 3 actions to take, Buy, Sell or Hold. If there is a Buy(/Sell) signal, the environment will buy(/sell) all. To trade for each year, the model learns to trade by using the last 5 years data (it randomly select a year out of these 5 years to train). In the process, I have accidentally put future data in the state and the model for 2010 learnt that …
In this page of keras's website, a reinforcement learning algorithm based in an actor critic scheme has been described. It is a deep policy gradient algorithm (hence DPG). Of course keras functions are central in this code, for this reason tensorflow tries to have an access to a NVIDIA gpu for the acceleration, otherwise it does use the accessible cores. I believe that this code is not optimized because it uses only one core, the main part of the code …
I was wondering whether using a GPU will be effective if I am using an on-policy (eg PPO) RL as the model? I.e, how can we use a GPU to decrease training time for an on-policy RL model? I recently trained a model and GPU utilization was around 2%.