What is a minimal setup to solve the CartPole-v0 with DQN?

I solved the CartPole-v0 with a CEM agent pretty easily (experiments and code), but I struggle to find a setup which works with DQN.

Do you know which parameters should be adjusted so that the mean reward is about 200 for this problem?

What I tried

  • Adjustments in the model: Deeper / less deep, neurons per layer
  • Memory size (how many steps are stored for replay)

What I'm unsure about

  • How should I choose the memory? Is higher always better? - Some quick experiments indicate that there might be a sweet-spot - not too high, but also not too low. I have no idea how to figure out the region of that sweet spot.
  • Window size: Having a window size of 1 seems to work well in this case. Bigger window sizes seem to be worse. Is there any indicator when to increase the Window size?
  • How to deal with delayed rewards: Suppose the CartPole did not start upright, but down. Then it would only get rewards late. Would this be a case for increasing the window size?

My current code

I use Keras-RL for the model and OpenAI gym for the environment.

Here is my code

#!/usr/bin/env python

import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import EpisodeParameterMemory


def main(env_name, nb_steps):
    # Get the environment and extract the number of actions.
    env = gym.make(env_name)
    np.random.seed(123)
    env.seed(123)

    nb_actions = env.action_space.n
    input_shape = (1,) + env.observation_space.shape
    model = create_nn_model(input_shape, nb_actions)

    # Finally, we configure and compile our agent.
    memory = EpisodeParameterMemory(limit=2000, window_length=1)

    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1.,
                                  value_min=.1, value_test=.05,
                                  nb_steps=1000000)
    agent = DQNAgent(model=model, nb_actions=nb_actions, policy=policy,
                     memory=memory, nb_steps_warmup=50000,
                     gamma=.99, target_model_update=10000,
                     train_interval=4, delta_clip=1.)
    agent.compile(Adam(lr=.00025), metrics=['mae'])
    agent.fit(env, nb_steps=nb_steps, visualize=False, verbose=2)

    # After training is done, we save the best weights.
    agent.save_weights('dqn_{}_params.h5f'.format(env_name), overwrite=True)

    # Finally, evaluate the agent
    history = agent.test(env, nb_episodes=100, visualize=False)
    rewards = np.array(history.history['episode_reward'])
    print(("Test rewards (#episodes={}): mean={:5.2f}, std={:5.2f}, "
           "min={:5.2f}, max={:5.2f}")
          .format(len(rewards),
                  rewards.mean(),
                  rewards.std(),
                  rewards.min(),
                  rewards.max()))


def create_nn_model(input_shape, nb_actions):
    """
    Create a neural network model which maps the input to actions.

    Parameters
    ----------
    input_shape : tuple of int
    nb_actoins : int

    Returns
    -------
    model : keras Model object
    """
    model = Sequential()
    model.add(Flatten(input_shape=input_shape))
    model.add(Dense(32))
    model.add(Activation('relu'))
    model.add(Dense(64))
    model.add(Activation('relu'))
    model.add(Dense(64))
    model.add(Activation('relu'))
    model.add(Dense(512))
    model.add(Activation('relu'))
    model.add(Dense(nb_actions))
    model.add(Activation('linear'))
    print(model.summary())
    return model


def get_parser():
    """Get parser object."""
    from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
    parser = ArgumentParser(description=__doc__,
                            formatter_class=ArgumentDefaultsHelpFormatter)
    parser.add_argument("--env",
                        dest="environment",
                        help="OpenAI Gym environment",
                        metavar="ENVIRONMENT",
                        default="CartPole-v0")
    parser.add_argument("--steps",
                        dest="steps",
                        default=10000,
                        type=int,
                        help="how many steps is the model trained?")
    return parser


if __name__ == "__main__":
    args = get_parser().parse_args()
    main(args.environment, args.steps)

Topic dqn openai-gym keras-rl reinforcement-learning

Category Data Science


As previously stated in the comment, you could simply look at the example in the repository you're using.

These are a couple of comments on your choices:

  • A rule of thumb is to decrease the number of hidden units when going deeper in a dense NN. Instead, you do the opposite: you start with 32 hidden units and ends with 512 hidden units. Also, there is no need for so many hidden units. You could easily solve with 32 or 16 units for each hidden layer.
  • Another rule of thumb: the deeper your NN is, the slower it is to learn. That saying, it learns faster if you put only one hidden layer with the same total number of units you use right now in your hidden layers. However, learn faster doesn't mean to learn better: usually, a deeper NN shows better results in the long term. Of course, it doesn't if you have thousands of hidden layers with only one unit each.
  • the learning rate: the slower it is your NN to learn and a better result it will show in the long run but, what is the long run? This is a sort of trade-off because you cannot wait years to train a simple NN. I had good results with a learning rate of 1e-3 but this depends on the size of the network, how deep it is, what is your batch size, etc.
  • memory and warmup steps: what is the purpose of having 50000 time-steps of warmup if you only store 2000? No sense. Also, for this kind of environment, you probably don't have any need for a warmup. On the other hand, have a bigger memory is almost always good. The only downside is usually only the space.
  • target model update: are you really sure of that value? I think that you used it without looking at what it represents.
  • policy: I think that you should use a policy built for q learning. there are a lot already implemented like EpsGreedyQPolicy, GreedyQPolicy, BoltzmannQPolicy, MaxBoltzmannQPolicy, and BoltzmannGumbelQPolicy.

To conclude, look at the examples, they are not self-explanatory but are a good starting point. Modify them and look what it changes when you only change one parameter. After that, try to change the environment, because the cart-pole environment is very easy for a DQN. Mountain car environment is a good next step (but try to don't reshape the reward, otherwise is too easy).

Happy learning!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.