Multi-arm bandit problem for bernoulli reward distribution
Suppose in the multi-arm bandit problem I know my rewards are distributed as $0$ or $1$ i.e according to a Bernoulli distribution rather than the condition that they lie in the range $[0,1]$. Does anyone know if we can do better with our confidence bounds with this restricted condition?
In particular how does the upper confidence bound algorithm change and what is the corresponding upper bound on the expected regret?
Can someone provide links to a paper or a set of notes on this particular problem?