RL agent behave differently for different data

I am training an RL model using PPO for AAPL stock. There are 3 actions to take, Buy, Sell or Hold. If there is a Buy(/Sell) signal, the environment will buy(/sell) all. To trade for each year, the model learns to trade by using the last 5 years data (it randomly select a year out of these 5 years to train). In the process, I have accidentally put future data in the state and the model for 2010 learnt that and exploited it while the model for 2006 did not.

I did some thinking and came up with two possible reasons,

  1. This happens due to the different distributions of the trained data. I.e the model trained for 2010 used data from 2005 to 2009 (which fully includes the 2009 stock market crash and a few good years before that) while the model for 2006 used data from 2001 to 2005 (mostly increasing stock market).
  2. Initial weight distribution affects learning. For example, when I trained a model when the stock market increases with a small state size some models converge to choose always buy while few other models choose always to hold even with every other parameter same.

It would be great if someone can help me to confirm my thinking or give a critic to the above-mentioned points.

Please do also mention if you come up with any other possible reasons for such behaviour.

Any workaround to fix this is also highly appreciated.

Topic policy-gradients convergence reinforcement-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.