Agent always takes a same action in DQN - Reinforcement Learning

Question

Agent always takes a same action in DQN - Reinforcement Learning

cvg

2020年3月4日 21:34

I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is there a reason, anyone can think of why is the agent behaving this way?

Reward plot

When I test the agent

state = env.reset()
print('State: ', state)

state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)
print(action_key)
print(index_to_action_mapping[action_key])
print(q_values[0][0])
print(q_values[0][action_key])

q_values_plotting = []
for i in range(0,action_size):
    q_values_plotting.append(q_values[0][i])


plt.plot(np.arange(0,action_size),q_values_plotting)

Every time it gives the same q_values plot, even though state initialized is different every time.Below is the q_Value plot.

Testing:

code

test_rewards = []
for episode in range(1000):
    terminal_state = False
    state = env.reset()
    episode_reward = 0
    while terminal_state == False:
        print('State: ', state)
        state_encod = np.reshape(state, [1, state_size])
        q_values = model.predict(state_encod)
        action_key = np.argmax(q_values)
        action = index_to_action_mapping[action_key]
        print('Action: ', action)
        next_state, reward, terminal_state = env.step(state, action)
        print('Next_state: ', next_state)
        print('Reward: ', reward)
        print('Terminal_state: ', terminal_state, '\n')
        print('----------------------------')
        episode_reward += reward
        state = deepcopy(next_state)
    print('Episode Reward' + str(episode_reward))
    test_rewards.append(episode_reward)

plt.plot(test_rewards)

Thanks.

Topic policy-gradients actor-critic dqn reinforcement-learning

Category Data Science

hh32 · Accepted Answer · 2019年10月17日 07:46

This may seem obvious, but have you tried using a Boltzmann distribution for action selection instead of argmax? This is known to encourage exploration and can be done by setting the action policy to

$$p(a|s) = \frac{\exp(\beta Q(a,s)}{\sum_{a'} \exp(\beta Q(a',s))},$$

where $\beta$ is the temperature parameter and governs the exploration-exploitation trade-off. This is also known as the softmax distribution.

Put into code, this would be something like this:

beta = 1.0
p_a_s = np.exp(beta * q_values)/np.sum(np.exp(beta * q_values))
action_key = np.random.choice(a=num_act, p=p_as)

This can lead to numerical instabilities because of the exponential, but that can be handled e.g. by first subtracting the highest q value:

q_values = q_values - np.max(q_vaues)

arunppsg · Accepted Answer · 2019年10月16日 06:30

The action taken by agent can be the most optimal action.
If the same state is input, you might be getting the same reward. Might be state not getting updated properly. Since next_state is given by agent, check the deepcopy function.
The model might not be updating it's parameters or it's q-values. Check how the model updates it's parameters and q-values.

Agent always takes a same action in DQN - Reinforcement Learning

About