Deep Q-learning

I am working on the DDQN algorithm which is given in the following paper.

I am facing a problem with the Q value.

The author calculate Q value by this Q(s, a; θ , α, β) = V(s; θ , β) + A(s, a; θ , α). Q value is divided into two parts: the state–action value and action-advantage value. The action-advantage value is independent of state and environment noise, which is a relative action–value in each state relative to other unselected actions.

Anyone can help me to understand state–action value and action-advantage value.

Any help will be appreciated.

Topic q-learning reinforcement-learning machine-learning

Category Data Science


It's better to start with understanding what the state-value and action-value functions are, and then move on to advantage. The explanation below is based on Reinforcement Learning by Sutton and Barto.

As you make more an more steps in the environment, you collect more and more rewards and you can denote the discounted sum of future rewards with:

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdot\cdot\cdot \gamma^{T-1} R_T $$

The $R_i$ are the random variables representing the future reward and the $\gamma$ is the discount factor. The term $G_t$ is called the return.

The state-value function is the expected return given that you start from a specific state and you follow the policy $\pi$ afterwards.

$$ V_\pi(s_t) = \mathbb{E}_\pi [G_t|S_t = s_t] $$

The action-value function is the expected return given that you start from a specific state and take a specific action, and follow the policy $\pi$ afterwards.

$$ Q_\pi(s_t, a_t) = \mathbb{E}_\pi [G_t|S_t = s_t, A_t = a_t] $$

In simple words, if you have a policy (way of acting), then the state-value function, $V_\pi$, will tell you from any state what is reward that you expect to get using that policy. The action-value function, $Q_\pi$, will tell you something very similar, but instead of following the policy at the next timestep, you can take an action that your policy wouldn't choose. The reason you would want to do that is because your policy is usually not the optimal policy, so you would want to understand how your expected return changes if you take small deviations from your policy. However, after the first step even in the action-value function, you'll follow your policy.

The difference between the state-value function and the action-value function is the advantage:

$$ A_\pi(s_t, a_t) = Q_\pi(s_t, a_t) - V_\pi(s_t) $$

If you understand the definitions above, then it's easy to see what the advantage function represent. Let's keep the state constant for now, so $s_t$ is constant and $V_\pi(s_t)$ is what you could expect to get by following the policy. The first term allows small deviations from the policy. Let's consider three scenarios:

  1. The small deviation is an action that is part of your policy (i.e. you don't deviate), then your action-value and state-value functions are the same, and your advantage is zero.
  2. If you take an action, that is not part of your policy and you end up with smaller returns, then the action-value function will reflect that and it will be smaller than the state-value function, so your advantage is negative.
  3. If you take an action, that is not part of your policy and you end up with larger returns, then the action-value function will be larger than the state-value function, so your advantage is positive.

The reason the advantage function is useful, is because it segments your actions, those with positive advantages can be used to improve your policy.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.