Policy Gradient not "learning"
I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch.
My models look as follows:
model = nn.Sequential(
nn.Linear(4, 128),
nn.ELU(),
nn.Linear(128, 2),
)
Criterion and optimisers:
criterion = nn.BCEWithLogitsLoss()
optim = torch.optim.Adam(model.parameters(), lr=0.01)
Training:
env = gym.make("CartPole-v0")
n_games_per_update = 10
n_max_steps = 1000
n_iterations = 250
save_iterations = 10
discount_rate = 0.95
for iteration in range(n_iterations): # Run the game 250 times
all_rewards = []
all_gradients = []
n_steps = []
optim.zero_grad()
for game in range(n_games_per_update): # Run the game 10 times to accumulate gradients
current_rewards = []
current_gradients = []
obs = env.reset()
for step in range(n_max_steps): # Run a single game a maximum of 1000 steps
logit = model(torch.tensor(obs, dtype=torch.float))
output = F.softmax(logit, dim=0)
c = Categorical(output)
action = c.sample()
y = torch.tensor([1.0 - action, action], dtype=torch.float)
loss = criterion(logit, y)
loss.backward()
obs, reward, done, info = env.step(int(action))
current_rewards.append(reward)
current_gradients.append([p.grad for p in model.parameters()])
if done:
break
n_steps.append(step)
all_rewards.append(current_rewards)
all_gradients.append(current_gradients)
# Performs the discount and normalises
all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate)
# For each batch of 10 games multiply the discounted rewards against the gradients of the
# network. Then take the mean for each layer
new_gradients = []
for var_index, gradient_placeholder in enumerate(gradient_placeholders):
means = []
for game_index, rewards in enumerate(all_rewards):
for step, reward in enumerate(rewards):
means.append(reward * all_gradients[game_index][step][var_index])
new_gradients.append(torch.mean(torch.stack(means), 0, True).squeeze(0))
# Apply the new gradients to the network
for p, g in zip(model.parameters(), new_gradients):
p.grad = g.clone()
optim.step()
When I run the code for 250 interactions I print the average game length I get:
Iteration: 50, Average Length: 18.2
Iteration: 100, Average Length: 23.4
Iteration: 150, Average Length: 29.9
Iteration: 200, Average Length: 11.2
Iteration: 250, Average Length: 38.6
The network isn't really improving and training for longer doesn't help. My two questions are: 1. Is there anything obviously wrong that I'm doing? 2. I've notices the log of the probability is used in the tensorflow implementation, but I'm not sure how to integrate it here
Topic policy-gradients pytorch implementation reinforcement-learning
Category Data Science