Does convergence equal learning in Deep Q-learning?
In my current research project I'm using the Deep Q-learning algorithm. The setup is as follows: I'm training the model (using Deep Q-learning) on a static dataset made up of experiences extracted from N levels of a given game. Then, I want to use the trained model to solve M new levels of the same game, i.e., I want to test the generalization ability of the agent on new levels of the same game.
Currently, I have managed to find a complex (deep) CNN architecture which is able to converge. This means that after training it for a large number of iterations (by the way, I'm using prioritized experience replay), the training error (square difference between Q-values and Q-targets) is very low.
Since I want the agent to be able to generalize to new, unseen levels, I thought about finding the simplest possible CNN architecture which is able to converge on the training levels, since that model would generalize better (be less prone to overfitting on the training levels).
In supervised learning, my reasoning would be correct. However, I don't know if it holds true in Reinforcement Learning. Given a model which is able to converge (minimize error between Q-targets and Q-values), is that model always learning to solve the levels optimally, i.e., does the model find the optimal policy which maximizes reward? In other words, is it possible for a Deep Q-learning agent to find a non-optimal policy which converges to a very small error between Q-targets and Q-values?
From what I have read, I think that as long as Deep Q-learning converges, the policy found is always the optimal one. Please, correct me if I'm wrong.