Policy Gradient not "learning"

Question

Policy Gradient not "learning"

Harpal

2022年2月5日 15:04

I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch.

My models look as follows:

model = nn.Sequential(
    nn.Linear(4, 128),
    nn.ELU(),
    nn.Linear(128, 2),
)

Criterion and optimisers:

criterion = nn.BCEWithLogitsLoss()
optim = torch.optim.Adam(model.parameters(), lr=0.01)

Training:

env = gym.make("CartPole-v0")

n_games_per_update = 10
n_max_steps = 1000
n_iterations = 250
save_iterations = 10
discount_rate = 0.95


for iteration in range(n_iterations): # Run the game 250 times
    all_rewards = []
    all_gradients = []
    n_steps = []
    optim.zero_grad()
    for game in range(n_games_per_update): # Run the game 10 times to accumulate gradients
        current_rewards = []
        current_gradients = []
        obs = env.reset()
        for step in range(n_max_steps): # Run a single game a maximum of 1000 steps

            logit = model(torch.tensor(obs, dtype=torch.float))
            output = F.softmax(logit, dim=0)
            c = Categorical(output)
            action = c.sample()

            y = torch.tensor([1.0 - action, action], dtype=torch.float)
            loss = criterion(logit, y)
            loss.backward()

            obs, reward, done, info = env.step(int(action))
            current_rewards.append(reward)
            current_gradients.append([p.grad for p in model.parameters()])
            if done:
                break
        n_steps.append(step)

        all_rewards.append(current_rewards)
        all_gradients.append(current_gradients)

    # Performs the discount and normalises
    all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate)

    # For each batch of 10 games multiply the discounted rewards against the gradients of the 
    # network. Then take the mean for each layer
    new_gradients = []
    for var_index, gradient_placeholder in enumerate(gradient_placeholders):
        means = []
        for game_index, rewards in enumerate(all_rewards):
            for step, reward in enumerate(rewards):
                means.append(reward * all_gradients[game_index][step][var_index])
        new_gradients.append(torch.mean(torch.stack(means), 0, True).squeeze(0))

    # Apply the new gradients to the network
    for p, g in zip(model.parameters(), new_gradients):
        p.grad = g.clone()
    optim.step()

When I run the code for 250 interactions I print the average game length I get:

Iteration: 50, Average Length: 18.2
Iteration: 100, Average Length: 23.4
Iteration: 150, Average Length: 29.9
Iteration: 200, Average Length: 11.2
Iteration: 250, Average Length: 38.6

The network isn't really improving and training for longer doesn't help. My two questions are: 1. Is there anything obviously wrong that I'm doing? 2. I've notices the log of the probability is used in the tensorflow implementation, but I'm not sure how to integrate it here

Topic policy-gradients pytorch implementation reinforcement-learning

Category Data Science

Karl · Accepted Answer · 2020年5月16日 00:54

I can't say for sure but I think the issue here is you're not subtracting the mean of the rewards.

The idea is that actions with above average reward are positive after mean normalization, while actions with below average reward are negative after mean normalization.

Your update step is -log(P(action))*reward, which you then minimize with your optimizer.

P(action)<1 therefore log(P(action))<0, -log(P(action))>0

If reward>0, -log(P(action))*reward>0. Minimizing this value is the same maximizing log(P(action))*reward<0, which is maximized when P(action)=1.

Conversely, if reward<0, -log(P(action))*reward<0. This has the opposite effect, where P(action) is driven to 0.

The important part is the different sign on above average/below average rewards causes actions associated with good rewards to have their probability increased, while actions associated with bad rewards have their probability decreased.

Policy Gradient not "learning"

About