How we can have RF-QLearning or SVR-QLearning (Combine these algorithm with a Q-Learning )

Question

How we can have RF-QLearning or SVR-QLearning (Combine these algorithm with a Q-Learning )

user10296606

2022年5月8日 02:01

How we can have RF-QLearning or SVR-QLearning (Combine these algorithm with a Q-Learning )? I want to replace the DNN section of Qlearning with a RF or SVR but the problem is that there is no clear training data that I can put in my code at tensorflow or keras! How we can do this?

Topic svr dqn q-learning reinforcement-learning random-forest

Category Data Science

Neil Slater · Accepted Answer · 2018年9月1日 08:20

You would need to train the RF-QN or SVR-QN on a very large batch/sample generated in the same way as a mini-batch in the DQN version. The input data would be the states and actions visited whilst simulating or running the environment for the batch, and the label would be the TD Target $r + \gamma \text{max}_{a'}\hat{q}(s', a')$ using an old copy of the RF or SVR model to calculate $\hat{q}$ with a loss function of MSE.

The reason you would need a very large batch (and why you generally don't see this done, because it will be far slower than NN) is because neither RF nor SVR in their basic form can be made to run online. However, if you manage to find online versions of the algorithm (I know there are some for RF), then you can use them almost identically to a DNN.

Wherever you see $\hat{q}$ in the pseudo-code, you know you need to run the model forward to estimate Q values. Typically you need to do this to make action selections (by finding best Q value for possible actions) and to generate TD Targets for further training.

Here's how you would need to change a DQN implementation in general terms:

You need two copies of your SVR or RF model: a "target model" and a "learning model". You would start with a basic target model that can predict either random values or fixed values for $Q(s,a)$ - those will clearly be wrong, but it should not matter. Each model will train using output from the previous one, plus the real data about transitions and reward. On each training session, the model becomes more accurate, and allows for better selection of actions.
Wherever you see a call to predict or train the DNN, replace with same call to predict or train the SVR or RF. When you see the current network being cloned to the "target network" do the same with the SVR or RF model. Pay attention to when the code uses the learning network or the target network, and use the appropriate SVR or RF model.
If you are using the default full batch training of RF or SVR, then you will need to generate a large dataset to train at once, and you should clone the resulting newly trained SVR or RF model direct to the target model immediately. Also, as the learning model in SVR or RF does not work online, you will need to use the target network to select actions in your case - that should be fine, although it may slow down learning further. Depending on your RL problem, without online versions of the algorithms, your dataset could easily need to be 100,000 or more records per training and cloning step.

How we can have RF-QLearning or SVR-QLearning (Combine these algorithm with a Q-Learning )

About