In DQN, why not use target network to predict current state Q values?
In DQN, why not use target network to predict current state Q values, and not only next state q values? In doing a basic dq learning algorithm with nn from scratch, with replay memory, and minibatch gd, and I'm implementing target neural network to predict at every minibatch samples current and mext state q values, and in the end of minibatch, sync target network, bu I notice the weights diverge very easily, maybe because I used nn to predict current state q values? If yes, why not?
the agent code:
def start_episode_and_evaluate(self, discount_factor, learning_rate,
epsilon, epsilon_decay = 0.99, min_epsilon = 0.01):
state = self.env.reset()
done = False
while not done:
if np.random.uniform(0, 1) epsilon: action = self.env.action_space.sample()
else: action = np.argmax(self._nn.predict(state))
next_state, reward, done, _ = self.env.step(action)
self._replay_memory.put(state, action, reward, done, next_state)
if len(self._replay_memory) = self._batch_size:
for state_exp, action_exp, reward_exp, done_exp, next_state_exp in self._replay_memory.get(batch_size=self._batch_size):
z, a = self._target_nn.forward_propagate(state_exp) #HERE I HAVE TO USE TARGET NETOWRK (_target_nn) OR THE MAIN NETWORK (_nn)?
q_values_target = np.copy(a[-1])
if done: q_values_target[action_exp] = reward_exp
else: q_values_target[action_exp] = reward_exp + discount_factor * np.max(self._target_nn.predict(next_state_exp))
self._nn.backpropagate(z, a, q_values_target, learning_rate)
self._sync_target_nn_weights()
epsilon *= epsilon_decay
if epsilon min_epsilon:
epsilon = min_epsilon
state = next_state
source: https://github.com/LorenzoTinfena/deep-q-learning-itt-final-project/blob/main/src/core/dqn_agent.py
Topic dqn q-learning reinforcement-learning deep-learning machine-learning
Category Data Science