Testing Multi-Arm Bandits on Historical Data

Question

Testing Multi-Arm Bandits on Historical Data

Pavan Sangha

2018年1月5日 08:07

Suppose I want to test a multi-arm bandit algorithm in the contextual setting on a set of historical data. For simplicity, let's assume there are only two arms A and B and suppose the rewards are binary. Furthermore, suppose I have a data set where users were shown one of the two arms and I have a record of the rewards. What would be the best approach to simulating the scenario of running the algorithm online?

I was thinking of doing the following: If the algorithm outputs A then I want to record a reward so the algorithm can learn, to do this we sample uniformly at random from the set of data where users were shown the output to obtain this reward.

I'd like to know if this approach is suitable and if anyone knows of any better ways to simulate online learning algorithms with historical data, in particular the multi-arm bandit problem?

Topic randomized-algorithms simulation online-learning reinforcement-learning machine-learning

Category Data Science

Neil Slater · Accepted Answer · 2018年1月5日 08:07

The problem you have is that the users were originally shown A or B under a different policy to that which the optimiser is learning. You probably don't know the probabilities for A or B from that policy (if you do, that would allow for a bit more subtlety).

You will need to reject samples that do not match what the optimiser does at any point. This may leave you with far less historical data to mine, and the possibility of getting a biased result.

to do this we sample uniformly at random from the set of data where users were shown the output to obtain this reward.

If I understand this correctly, you are planning to:

Pick an event (including user data) for input context
Generate the choice of A or B using your learning agent
Find another user that was presented the same output in the historical data, by sampling randomly from "all users shown A" or "all users shown B"
Count the reward generated in the sampled historical record

Sadly this will not work, because in step 3 you divorce the context from the agent's decision. So you will not be measuring the agent's likely reward, but the hybrid reward of an agent that uses a different user population and gains rewards according to the historic policies (which you may not even know).

A less biased approach might be:

Pick an event (including user data) randomly from historic data
Generate the choice of A or B using your learning agent
If the agent's choice and historic choice for that event match, then count the reward, else discard the event from the test

You may get sampling bias issues here if the agent's policy is different to historic policy for large number of users. The problem is that you are systematically removing users where old policies said one thing and new policies another, due to lack of data about the reward. So unless the original policy was completely random, the samples you actually train/evaluate on will not be chosen with correct ratios across the user population as it varies. Importance sampling could help mitigate that, but only if you know the probabilities of making A|B selection in both historic and learning agent policies.

Another possibility is to train a supervised model that predicts reward given user data and action from your historic data, and use that to drive a simulator of the online environment. If you are also training your agent during this test, you may need to simulate variance - e.g. turn probability of a click-through event into either a click or no click event by sampling $x < p(click)$ - because that is the data you need your online algorithm to learn from. This is clearly limited by the accuracy of the supervised model from the training data, but could reduce variance, because you will get some kind of indication of reward for events that did not happen historically (with some bounds on accuracy that you can estimate). That is due to function approximation in the supervised model - it is pretty similar idea to bootstrapping the Critic evaluation in an Actor-Critic reinforcement learning model.

Testing Multi-Arm Bandits on Historical Data

About