Testing Multi-Arm Bandits on Historical Data
Suppose I want to test a multi-arm bandit algorithm in the contextual setting on a set of historical data. For simplicity, let's assume there are only two arms A and B and suppose the rewards are binary. Furthermore, suppose I have a data set where users were shown one of the two arms and I have a record of the rewards. What would be the best approach to simulating the scenario of running the algorithm online?
I was thinking of doing the following: If the algorithm outputs A then I want to record a reward so the algorithm can learn, to do this we sample uniformly at random from the set of data where users were shown the output to obtain this reward.
I'd like to know if this approach is suitable and if anyone knows of any better ways to simulate online learning algorithms with historical data, in particular the multi-arm bandit problem?