Does convergence equal learning in Deep Q-learning?

In my current research project I'm using the Deep Q-learning algorithm. The setup is as follows: I'm training the model (using Deep Q-learning) on a static dataset made up of experiences extracted from N levels of a given game. Then, I want to use the trained model to solve M new levels of the same game, i.e., I want to test the generalization ability of the agent on new levels of the same game.

Currently, I have managed to find a complex (deep) CNN architecture which is able to converge. This means that after training it for a large number of iterations (by the way, I'm using prioritized experience replay), the training error (square difference between Q-values and Q-targets) is very low.

Since I want the agent to be able to generalize to new, unseen levels, I thought about finding the simplest possible CNN architecture which is able to converge on the training levels, since that model would generalize better (be less prone to overfitting on the training levels).

In supervised learning, my reasoning would be correct. However, I don't know if it holds true in Reinforcement Learning. Given a model which is able to converge (minimize error between Q-targets and Q-values), is that model always learning to solve the levels optimally, i.e., does the model find the optimal policy which maximizes reward? In other words, is it possible for a Deep Q-learning agent to find a non-optimal policy which converges to a very small error between Q-targets and Q-values?

From what I have read, I think that as long as Deep Q-learning converges, the policy found is always the optimal one. Please, correct me if I'm wrong.

Topic generalization convergence q-learning reinforcement-learning deep-learning

Category Data Science


Policies found by Deep Q-Learning, even after convergence, are not guaranteed to be optimal. The reason is that the neural networks that approximate the Q function in DQN inherently come with a statistical error (bias and variance), a pointer can be found here.

Furthermore, convergence to the optimal policy for tabular Q-learning is only guaranteed when every action is sampled infinitely often in every state. This may make using any 'converged and optimal policy' in experiments very hard, even when disregarding additional complexities from function approximation in DQN.

Please be aware, additionally, that generalization in supervised learning requires that the training and test data are sampled from the same distribution. Similarly for RL, the training and test environments are assumed to be sampled from the same 'distribution'. Generalization to environments not reflected in the training set is referred to as transfer learning in the RL literature.

This is an important and very interesting line of research though, so please be not be discouraged by this comment.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.