Offline/Batch Reinforcement Learning: Doubly Robust Off-policy Estimator takes huge values
Context:
My team and I are working on a RL problem for a specific application. We have data collected from user interactions (states, actions, etc.).
It is too costly for us to emulate agents. We decided therefore to concentrate on Offline RL techniques. For this, we are currently using the RL-Coach library by Intel, which offers support for Batch/Offline RL. More specifically, to evaluate policies in offline settings, we train a DDQN-BCQ model and evaluate the learned policies using Offline Policy Estimators (OPEs).
Tools:
RL-Coach library implements different OPEs:
- For Contextual Bandits problems: Inverse Propensity Score (IPS), Direct Method Reward (DM), Doubly Robust (DR)
- For Full-blown RL problems: Weighted Importance Sampling (WIS) and a sequential verison of Doubly Robust (Seq-DR)
The implementation of Seq-DR is based on the following paper https://arxiv.org/pdf/1511.03722.pdf and is defined as the following: $V_{DR}^{H+1-t} := \hat{V}(s_t) + \rho_t(r_t + \gamma V_{DR}^{H-t} - \hat{Q}(s_t, a_t))$, where $\rho_t = \frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}$. The target policy probabilities $\pi_e(s|a)$ is computed from the softmax of the learned Q-table $\hat{Q}(s,a)$. The state value estimate $\hat{V}(s_t)$ is computed from the Q-table value and softmax probabilities too. The behavior policy probabilities $\pi_b(s|a)$ are directly estimated from the data.
Problem:
At each epoch, we evaluate the learned policy with Seq-DR. Systemtically, we get gigantic values in the orders from $10^{18}$ to $10^{38}$, while the WIS estimation gives values ranging from 100 to 500, which is far more sensible given the returns in our data.
Problem cause explorations:
I have looked at the different terms composing the above equation separately, too see where these estimation explosions come from. What I have discovered is that the Seq-DR value always explodes when evaluating the same specific episodes in our data. I focused on a specific one and looked at the estimated value after each transition. Basically, what happens is that at some transition $t$, during ~5 consecutive transitions, $\rho_t = ~7$ (while it generally takes values between 0 and 2) and these transitions are sufficient to start the explosion of Seq-DR estimates (as the definition is recursive). All other terms in the above formula have reasonable values and I am pretty sure that the explosions is not because of them.
We also tried to look at the length of our episodes. Our episodes actually have values of different lengths which can sometime be pretty long, where no clear horizon exists (infinite horizon case). We tried to cut those episodes into smaller episodes of length 20 to see if the problem would disappear, as long episodes might lead to unstable values, but this didn't change anything and the Seq-DR value still explose.
Questions:
Why would Doubly Robust estimators give such gigantic values? Doubly Robust estimators are combinations of both Model-Based and Importance Sampling estimators. They are supposed to still give good values when either one of the two following conditions are true: 1) The reward model is accurate, 2) The behavior policy is close the the reality. If the values we get with the WIS estimator are reasonable, why would we get completely off values for a DR estimator?
Is it possible that this happens just because of a distributional shift between the Q-values target policy probabilities and the behavior policy probabilities, i.e. the Q-values would like to go to take actions where we actually only have very few data? But in that case, shouldn't the WIS estimations be completely off as well?
Topic estimators q-learning reinforcement-learning dataset machine-learning
Category Data Science