Agent always takes a same action in DQN - Reinforcement Learning

I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is there a reason, anyone can think of why is the agent behaving this way?

Reward plot

When I test the agent

state = env.reset()
print('State: ', state)

state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)
print(action_key)
print(index_to_action_mapping[action_key])
print(q_values[0][0])
print(q_values[0][action_key])

q_values_plotting = []
for i in range(0,action_size):
    q_values_plotting.append(q_values[0][i])


plt.plot(np.arange(0,action_size),q_values_plotting)

Every time it gives the same q_values plot, even though state initialized is different every time.Below is the q_Value plot.

Testing:

code

test_rewards = []
for episode in range(1000):
    terminal_state = False
    state = env.reset()
    episode_reward = 0
    while terminal_state == False:
        print('State: ', state)
        state_encod = np.reshape(state, [1, state_size])
        q_values = model.predict(state_encod)
        action_key = np.argmax(q_values)
        action = index_to_action_mapping[action_key]
        print('Action: ', action)
        next_state, reward, terminal_state = env.step(state, action)
        print('Next_state: ', next_state)
        print('Reward: ', reward)
        print('Terminal_state: ', terminal_state, '\n')
        print('----------------------------')
        episode_reward += reward
        state = deepcopy(next_state)
    print('Episode Reward' + str(episode_reward))
    test_rewards.append(episode_reward)

plt.plot(test_rewards)

Thanks.

Topic policy-gradients actor-critic dqn reinforcement-learning

Category Data Science


This may seem obvious, but have you tried using a Boltzmann distribution for action selection instead of argmax? This is known to encourage exploration and can be done by setting the action policy to

$$p(a|s) = \frac{\exp(\beta Q(a,s)}{\sum_{a'} \exp(\beta Q(a',s))},$$

where $\beta$ is the temperature parameter and governs the exploration-exploitation trade-off. This is also known as the softmax distribution.

Put into code, this would be something like this:

beta = 1.0
p_a_s = np.exp(beta * q_values)/np.sum(np.exp(beta * q_values))
action_key = np.random.choice(a=num_act, p=p_as)

This can lead to numerical instabilities because of the exponential, but that can be handled e.g. by first subtracting the highest q value:

q_values = q_values - np.max(q_vaues)

  • The action taken by agent can be the most optimal action.
  • If the same state is input, you might be getting the same reward. Might be state not getting updated properly. Since next_state is given by agent, check the deepcopy function.
  • The model might not be updating it's parameters or it's q-values. Check how the model updates it's parameters and q-values.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.