policy-gradients

Problem when cherry picking actions - Proximal Policy Optimization

Max Fischer

2022年5月24日 08:08

I am using the implementation of PPO2 in stable-baselines (a fork of OpenAI's baselines) for a Reinforcement Learning problem. My observation space is $9x9x191$ and my action space is $144$. Given a state, only some actions are "legal". If an "illegal" action is taken, the environment will return the same state. Think of it as the game of Go, where you try to place a stone on an intersection that is already occupied. When a legal action is taken, it …

Topic: policy-gradients reinforcement-learning deep-learning neural-network python

Category: Data Science

inverted pendulum REINFORCE

sara

2022年5月19日 12:06

I am learning reinforcement learning, and as a practice, I am trying to stabilize an inverted pendulum (gym: Pendulum-v0) in an upright position using policy gradient: REINFORCE. I have some questions, please help me, I tried a lot but couldn't understand. an answer to any question can help me. Thanks in advance. 1- why the observations in the pendulum code are: cos(theta), sin(theta) and theta_dot? not theta and theta_dot only? 2- the action which I should send to the environment …

Topic: policy-gradients reinforcement-learning python

Category: Data Science

Reinforcement Learning : Why acting greedily with the optimal value function gives you the optimal policy?

tristan

2022年5月18日 06:05

The course of David Silver about Reinforcement Learning explains how you get the optimal policy from the optimal value function. It seems to be very simple, you just have to act greedily, by maximizing at each step the value function. In the case of a small grid world, once you have applied the Policy Evaluation algorithm, you get for example the following matrix for the value function : You start from the up-left corner and the unique actions are the …

Topic: policy-gradients reinforcement-learning evaluation optimization

Category: Data Science

Policy gradient/REINFORCE algorithm with RNN: why does this converge with SGM but not Adam?

Kechen

2022年5月17日 20:00

I am working on training RNN model on caption generation with REINFORCE algorithm. I adopt self-critic strategy (see paper Self-critical Sequence Training for Image Captioning) to reduce the variance. I initialize the model with a pre-trained RNN model (a.k.a. warm start). This pre-trained model (trained with log-likelihood objective) got 0.6 F1 score in my task. When I use adam optimizer to train this policy gradient objective, the performance of my model drops to 0 after a few epochs. However, if …

Topic: policy-gradients rnn reinforcement-learning deep-learning nlp

Category: Data Science

experience replay memory: saving the next state required when state does not depend on action?

user101893

2022年5月11日 21:12

so, I am using an agent with a state-action-policy and I am trying to understand the concept of experience replay memory (ERM). As far as I learned until now, the ERM is basically a buffer that stores sets experiences: e_t = {s_t, a_t, r_t+1, s_t+1} Where s is the state, a the action and r the reward, as usual. Basically, in order to use a network that learns to predict the correct action from such experiences, the network's input should …

Topic: policy-gradients reinforcement-learning machine-learning

Category: Data Science

Guidelines to debug REINFORCE-type algorithms?

Astariul

2022年5月8日 20:00

I implemented a self-critical policy gradient (as described here), for text summarization. However, after training, the results are not as high as expected (actually lower than without RL...). I'm looking for general guidelines on how to debug RL-based algorithms. I tried : Overfitting on small datasets (~6 samples) : I could increase the average reward , but it does not converge. Sometimes the average reward would go down again. Changing the learning rate : I changed the learning rate and …

Topic: policy-gradients pytorch reinforcement-learning nlp

Category: Data Science

Time horizon T in policy gradients (actor-critic)

Dummie Variable

2022年4月21日 09:00

I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture. At the bottom of that slide, the gradient of the expected sum of rewards function is given by $$ \nabla J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log{\pi_\theta(a_{i,t} \vert s_{i,t}) (Q(s_{i,t},a_{i,t}) - V(s_{i,t}))} $$ The q-value function is defined as $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})\vert s_t,a_t]$$ At first glance, this makes sense, because I compare the value of taking the chosen action …

Topic: policy-gradients actor-critic reinforcement-learning deep-learning machine-learning

Category: Data Science

Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)

Johannes

2022年4月6日 16:01

There are so many PPO implementations that use GAE and do the following: def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95): values = values + [next_value] gae = 0 returns = [] for step in reversed(range(len(rewards))): delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step] gae = delta + gamma * tau * masks[step] * gae returns.insert(0, gae + values[step]) return returns ... advantage = returns - values ... critic_loss = (returns - value).pow(2).mean() Soure: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb, I …

Topic: policy-gradients actor-critic mathematics reinforcement-learning machine-learning

Category: Data Science

Policy gradient - and auto-differentiation (Pytorch/Tensorflow)

Jed

2022年3月29日 07:04

In policy gradient, we have something like this: Is my understanding correct that if I apply log cross-entropy on the last layer, the gradient will be automatically calculated as per formula above?

Topic: policy-gradients pytorch tensorflow reinforcement-learning

Category: Data Science

Understanding policy gradient theorem - What does it mean to take gradients of reward wrt policy parameters?

MiloMinderbinder

2022年3月26日 15:05

I am looking for a little clarity on what the policy gradient theorem means. My confusion lies in the fact that the reward $R$ in reinforcement learning is non-differentiable in the policy parameters. As that is the case how does the central objective of policy gradients, finding the gradients of Reward $R$ wrt the parameters of policy function even make sense?

Topic: policy-gradients reinforcement-learning machine-learning

Category: Data Science

Is my exploration scheme in reinforcement learning done correctly?

user101893

2022年3月21日 11:10

so, I am training a deterministic policy, represented by basically a convolutional networks. I have an action space which is basically a vector of weights / probabilities, output by the network. The actions encoded in that vector then determine the value of my reward function, which is to be minimized. I train sequentially on time series data and, at time t, always provide the last actions a(t-1) as input to the cnn besides the state s(t). Thus: a(t) = model(a(t-1),s(t)) …

Topic: policy-gradients cnn reinforcement-learning deep-learning

Category: Data Science

Understanding derivation of gradient optimisation problem

Funzo

2022年2月20日 17:24

I'm following a tutorial on youtube about reinforcement learning. They are going through the steps to understand policy gradient optimisation. In one of the steps he says (delta policy)/policy == delta log policy. How can he make that jump? I have attached a screenshot from the video and also a link to the video. https://www.youtube.com/watch?v=wDVteayWWvU&list=PLMrJAkhIeNNR20Mz-VpzgfQs5zrYi085m&index=48&ab_channel=SteveBrunton

Topic: policy-gradients reinforcement-learning

Category: Data Science

Policy Gradient custom loss function not working

Ross Myhovych

2022年2月8日 18:02

I was experimenting with my policy gradient reinforcement learning algorithm, and I was wondering if I could use a similar method to the supervised cross-entropy. So, instead of using existing labels, I would generate a label for every step in the trajectory. Depending on the value of the action, I would shift the stochastic policy (Neural Network) output to a more efficient output and train it as a label to a cross entropy loss function. Example of an action: Real …

Topic: policy-gradients implementation reinforcement-learning neural-network

Category: Data Science

Policy Gradient not "learning"

Harpal

2022年2月5日 15:04

I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch. My models look as follows: model = nn.Sequential( nn.Linear(4, 128), nn.ELU(), nn.Linear(128, 2), ) Criterion and optimisers: criterion = nn.BCEWithLogitsLoss() optim = torch.optim.Adam(model.parameters(), lr=0.01) Training: env = gym.make("CartPole-v0") n_games_per_update = 10 n_max_steps = 1000 n_iterations = 250 save_iterations = 10 discount_rate = 0.95 for iteration in range(n_iterations): …

Topic: policy-gradients pytorch implementation reinforcement-learning

Category: Data Science

Maximum Entropy Policy Gradient Derivation

Ricky Sanjaya

2022年2月1日 17:05

I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section 4.1) Note that in the above derivation, the term H(q(thetha(at|st))) should have been log (qthetha(at|st)), and that log refers to log base e (i.e. natural logarithm). In the first line of gradient, it should have been r(st,at) - log(qthetha(at|st)). In particular, I …

Topic: derivation policy-gradients reinforcement-learning gradient-descent machine-learning

Category: Data Science

Policy Gradient with continuous action space

cvg

2022年1月21日 13:29

How to apply reinforce/policy-gradient algorithms for continuous action space. I have learnt that one of the advantages of policy gradients is , it is applicable for continuous action space. One way I can think of is discretizing the action space same as the way we do it for dqn. Should we follow the same method for policy -gradient algorithms also ? Or is there any other way this is done? Thanks

Topic: policy-gradients dqn ai reinforcement-learning

Category: Data Science

Reinforcement learning policy gradient derivation

endeavor

2021年12月30日 10:06

I was reading a document about Reinforcement Learning policy gradient http://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes8.pdf when I encountered this expression $ \nabla_{\theta} \mathbb{E_{\pi_{\theta}}}[r_{t^{t}}] = \mathbb{E_{\pi_{\theta}}} \left[ r_{t^{'}} \sum_{t = 0}^{t^{'}} \nabla_{\theta} \log \pi_{\theta} (a_t|s_t) \right] $ which is on page 6 just below (11). The problem is I have no idea how is this expression derived. The document says that it can be derived the same way as (11) but I do not understand how. Any pointers or hints would be appreciated.

Topic: reward policy-gradients mathematics reinforcement-learning neural-network

Category: Data Science

RL agent behave differently for different data

Wenuka

2021年11月12日 16:14

I am training an RL model using PPO for AAPL stock. There are 3 actions to take, Buy, Sell or Hold. If there is a Buy(/Sell) signal, the environment will buy(/sell) all. To trade for each year, the model learns to trade by using the last 5 years data (it randomly select a year out of these 5 years to train). In the process, I have accidentally put future data in the state and the model for 2010 learnt that …

Topic: policy-gradients convergence reinforcement-learning

Category: Data Science

Improving the Actor Critic algorithm proposed by Keras

Siderius

2021年11月10日 10:31

In this page of keras's website, a reinforcement learning algorithm based in an actor critic scheme has been described. It is a deep policy gradient algorithm (hence DPG). Of course keras functions are central in this code, for this reason tensorflow tries to have an access to a NVIDIA gpu for the acceleration, otherwise it does use the accessible cores. I believe that this code is not optimized because it uses only one core, the main part of the code …

Topic: policy-gradients keras-rl gpu tensorflow reinforcement-learning

Category: Data Science

Does GPU decreases training time for on-policy RL?

Wenuka

2021年10月20日 21:57

I was wondering whether using a GPU will be effective if I am using an on-policy (eg PPO) RL as the model? I.e, how can we use a GPU to decrease training time for an on-policy RL model? I recently trained a model and GPU utilization was around 2%.

Topic: policy-gradients gpu reinforcement-learning

Category: Data Science

About