Guidelines to debug REINFORCE-type algorithms?
I implemented a self-critical policy gradient (as described here), for text summarization.
However, after training, the results are not as high as expected (actually lower than without RL...).
I'm looking for general guidelines on how to debug RL-based algorithms.
I tried :
- Overfitting on small datasets (~6 samples) : I could increase the average reward , but it does not converge. Sometimes the average reward would go down again.
- Changing the learning rate : I changed the learning rate and see its effect on small dataset. From my experiment I choose quite big learning rate (
0.02
vs1e-4
in the paper) - Looking at how average reward evolve as training (on full dataset) goes : Average reward significantly does not move...
Topic policy-gradients pytorch reinforcement-learning nlp
Category Data Science