In DQN, why not use target network to predict current state Q values?

Question

In DQN, why not use target network to predict current state Q values?

Lorenzo Tinfena

2021年5月7日 13:09

In DQN, why not use target network to predict current state Q values, and not only next state q values? In doing a basic dq learning algorithm with nn from scratch, with replay memory, and minibatch gd, and I'm implementing target neural network to predict at every minibatch samples current and mext state q values, and in the end of minibatch, sync target network, bu I notice the weights diverge very easily, maybe because I used nn to predict current state q values? If yes, why not?

the agent code:

def start_episode_and_evaluate(self, discount_factor, learning_rate,
                                epsilon, epsilon_decay = 0.99, min_epsilon = 0.01):
        state = self.env.reset()
        done = False
        while not done:
            if np.random.uniform(0, 1)  epsilon: action = self.env.action_space.sample()
            else: action = np.argmax(self._nn.predict(state))
            next_state, reward, done, _ = self.env.step(action)
            self._replay_memory.put(state, action, reward, done, next_state)
            if len(self._replay_memory) = self._batch_size:
                for state_exp, action_exp, reward_exp, done_exp, next_state_exp in self._replay_memory.get(batch_size=self._batch_size):
                    z, a = self._target_nn.forward_propagate(state_exp) #HERE I HAVE TO USE TARGET NETOWRK (_target_nn) OR THE MAIN NETWORK (_nn)?
                    q_values_target = np.copy(a[-1])
                    if done: q_values_target[action_exp] = reward_exp
                    else: q_values_target[action_exp] = reward_exp + discount_factor * np.max(self._target_nn.predict(next_state_exp))
                    self._nn.backpropagate(z, a, q_values_target, learning_rate)
                self._sync_target_nn_weights()
                epsilon *= epsilon_decay
                if epsilon  min_epsilon:
                    epsilon = min_epsilon
            state = next_state

source: https://github.com/LorenzoTinfena/deep-q-learning-itt-final-project/blob/main/src/core/dqn_agent.py

Topic dqn q-learning reinforcement-learning deep-learning machine-learning

Category Data Science

YuseqYaseq · Accepted Answer · 2021年5月7日 12:44

Bellman equation for deterministic environments is given as follows $$ V(s) = max_aR(s, a) + \gamma V(s') $$ Where $V$ is a value function, $R$ is a reward function, $s$ is current state, $s'$ is next state, $a$ is an action. In DQN when optimizing $V$ it is assumed that $V(s')$ has on average lower error than $V(s)$. This is intuitively because $s'$ is closer to the end state for which $V$ is known and so there are fewer steps in which error can accumulate. This way we can progressively lower errors of $V$ for all states starting from the end states up to the starting state. I don't know how exactly you use current state to update weights but assuming it looks like this $$ \delta V(s) := \alpha(max_aR(s, a) + \gamma V(s) - V(s)) $$ instead of $$ \delta V(s) := \alpha(max_aR(s, a) + \gamma V(s') - V(s)) $$ i.e. you don't use $V(s')$ in the weights update what happens is the same error propagates over and over again because information about rewards in the later states is not propagated to earlier states.

As a concrete example imagine an environment with 3 states $s_1$, $s_2$ and $s_3$ where $s_1$ is starting state $s_3$ is the end state, the action space is only one move: go from state $s_i$ to state $s_{i+1}$ and reward function is 1 for moving to $s_3$ and 0 otherwise. What happens then is $$ \delta V(s_3) = 0 \\ \delta V(s_2) = \alpha (max_a(R(s_2, a) + \gamma V(s_2) - V(s_2)) = \alpha (1 + \gamma V(s_2) - V(s_2)) \\ \delta V(s_1) = \alpha (max_a(R(s_1, a) + \gamma V(s_1) - V(s_1)) = \alpha (0 + \gamma V(s_1) - V(s_1)) $$ Because starting value for $V(s_1)$ is 0 it's gradient will always be 0 and the network will never learn the value of this state. That's why it must take a loook at values of its next state.

In DQN, why not use target network to predict current state Q values?

About