Multi-arm bandit problem for bernoulli reward distribution

Pavan Sangha

2017年12月28日 15:54

Suppose in the multi-arm bandit problem I know my rewards are distributed as $0$ or $1$ i.e according to a Bernoulli distribution rather than the condition that they lie in the range $[0,1]$. Does anyone know if we can do better with our confidence bounds with this restricted condition?

In particular how does the upper confidence bound algorithm change and what is the corresponding upper bound on the expected regret?

Can someone provide links to a paper or a set of notes on this particular problem?

Topic randomized-algorithms probability reinforcement-learning machine-learning

Category Data Science

Multi-arm bandit problem for bernoulli reward distribution

About