When should the last action be included in the state in reinforcement learning?

Question

When should the last action be included in the state in reinforcement learning?

user91315

2022年4月10日 06:06

I am having some confusion as to whether the action should be included as part of the state input to an agent in a reinforcement learning setting (state-action pair). As from my observation, this is not completely clear as different agents/environments combinations might have different performances if action was included/excluded from input states (I might be wrong).

For my specific problem:

the agent can't influence/control the states through its actions (similar to the case of a simple multi-armed bandit)
the action space is discrete
I am using a DQN based approach

I would also appreciate a general overview/rules of thumb of when to include/exclude actions as state inputs.

ps. when i say "different agents/environments combinations" in the beginning I mean using different agents to solve the same env or same agent to solve different env.

Topic dqn reinforcement-learning deep-learning machine-learning

Category Data Science

Johannes Ackermann · Accepted Answer · 2020年3月11日 18:05

If your environment fulfills the Markov property, there is no reason to include the actions $a$ in the state $s$, as the action $a_t$ that lead to the new state $s_{t+1}$ should not provide any additional information, i.e. the knowledge of the old action should not influence the reward and transition functions from $s_{t+1}$ onwards.

Therefore, if your environment is a proper MDP, there is no reason to include actions in the next state.

It might be helpful in some rare cases, where your environment is not actually Markov to include the last action in your state, but in that case you should also think about if there is a better way to represent your state, such that it becomes Markov.

In your concrete case, where the actions do not change the state, there is no reason to include the last action as part of your state.

Side Note: If your problem is a simple discrete bandit, you will probably be better off by looking into bandit approaches, rather than using a DQN based approach.

When should the last action be included in the state in reinforcement learning?

About