Guidelines to debug REINFORCE-type algorithms?

I implemented a self-critical policy gradient (as described here), for text summarization.

However, after training, the results are not as high as expected (actually lower than without RL...).

I'm looking for general guidelines on how to debug RL-based algorithms.


I tried :

  • Overfitting on small datasets (~6 samples) : I could increase the average reward , but it does not converge. Sometimes the average reward would go down again.
  • Changing the learning rate : I changed the learning rate and see its effect on small dataset. From my experiment I choose quite big learning rate (0.02 vs 1e-4 in the paper)
  • Looking at how average reward evolve as training (on full dataset) goes : Average reward significantly does not move...

Topic policy-gradients pytorch reinforcement-learning nlp

Category Data Science


The only resource I could find so far :

https://github.com/williamFalcon/DeepRLHacks


For my specific case, I made a few errors :

  • Frozen some part of the network that shouldn't be frozen
  • Wrong learning rate

Even if I could overfit a small dataset, it didn't mean anything : while training on the whole dataset, the average reward was not going up.

You should look for a reward going up.


I'm not accepting this answer as I believe it is not complete : it lacks general and systematic guidelines to debug a Reinforcement Learning algorithm.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.