Multi-arm bandit problem for bernoulli reward distribution

Suppose in the multi-arm bandit problem I know my rewards are distributed as $0$ or $1$ i.e according to a Bernoulli distribution rather than the condition that they lie in the range $[0,1]$. Does anyone know if we can do better with our confidence bounds with this restricted condition?

In particular how does the upper confidence bound algorithm change and what is the corresponding upper bound on the expected regret?

Can someone provide links to a paper or a set of notes on this particular problem?

Topic randomized-algorithms probability reinforcement-learning machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.