Difference between regret and pseudo-regret definitions in multi-armed bandits

Question

Difference between regret and pseudo-regret definitions in multi-armed bandits

Shew

2022年5月6日 21:02

I posted this question Cross Validated, but didn't get any answer. So I am posting it here too, as the question is very relevant to machine learning

I am following the book Bandit Algorithms. In page 48, they introduces regret after $n$ rounds as

$$ \mathbf{R} = n\mu^\star - \mathbb{E}\Bigg[\sum_{t=1}^n \mathbf{X}_t\Bigg] \tag{1} $$

In page 55, they also define pseudo-regret as

$$ \bar{\mathbf{R}} = n\mu^\star - \sum_{t=1}^n \mu_{A_t} \tag{2} $$

In the paper Regret Analysis of Stochastic and ..., authors introduces pseudo-regret as

$$ \bar{\mathbf{R}} = n\mu^\star - \mathbb{E}\Bigg[\sum_{t=1}^n \mu_{A_t}\Bigg] \tag{3} $$

Can anyone tell me the differences in the three definitions ?

In the first definition, due to the linearity of the expectation, we can write it exactly like (2). Hence (1) and (2) should be referring to the same quantity ?

Since $\mu_{A_t}$ is not a random variable, $\mathbb{E}[\mu_{A_t}] = \mu_{A_t}$, and due to the linearity of expectation, (2) and (3) should be referring to the same quantity.

Anyone can help me out with these definitions ?

Topic online-learning reinforcement-learning machine-learning

Category Data Science

cyber-meow · Accepted Answer · 2022年5月6日 21:02

There is confusion because there is no consensus in the community. While I think the definitions are well-explained in your references, here I try to phrase them in my own words hoping that it can clarify your question.

Notice that the two regrets and the two pseudo-regrets below are not the same things.

In the book of Lattimore and Szepesvári, they start with a definition that is tailored to the stochastic MAB setup. In fact, $\mu^{\star}$ is defined because each arm has an associated distribution and thus an associated mean $\mu_a$; we can then set $\mu^{\star}=\max_{a\in\mathcal{A}}\mu_A$. After that, we take $\mu^{\star}$ as the comparator and defined the regret as an expected value, which is just a real number.

$$ \text{Regret: }T\mu^{\star}- \mathbb{E}\Bigg[\sum_{t=1}^{T} X_{a_t,t}\Bigg] $$

On the other hand, as mentioned in the reply of Valentas, both random regret and pseudo-regrets are random variables.

$$ \text{Random regret: } T\mu^{\star}- \sum_{t=1}^{T} X_{a_t,t}\\ \text{Pseudo-regret: } T\mu^{\star}- \sum_{t=1}^{T}\mu_{a_t} $$

The expectation of both of these are the regret defined above. While both of these are random variables that vary from realization to realization, the difference is that in pseudo-regret we evaluate the performance of the learner by the expected reward of the pulled arm while in random regret we consider the actual feedback.

In the paper of Bubeck and Cesa-Bianchi, they first defined the regret as a random variable.

$$\text{Regret: } \max_{a\in\mathcal{A}} \sum_{t=1}^{T} X_{a,t} - \sum_{t=1}^{T} X_{a_t,t}$$

On the other hand, both pseudo regret and expected regret are real numbers. The difference is in how we define the "best" action, or the comparator.

In expected regret, the "best" action depends on the randomness and is itself a random variable, because the maximum is inside the expectation

$$\text{Expected regret: } \mathbb{E}\Bigg[\max_{a\in\mathcal{A}} \sum_{t=1}^{T} X_{a,t} - \sum_{t=1}^{T} X_{a_t,t}\Bigg]$$

In pseudo-regret, the "best" action is only defined with respect to the expectation and is not random. The randomness only plays a role once the "deterministic" comparator is fixed.

$$\text{Psudo-regret: } \max_{a\in\mathcal{A}}\mathbb{E}\Bigg[ \sum_{t=1}^{T} X_{a,t} - \sum_{t=1}^{T} X_{a_t,t}\Bigg]$$

If we consider the situation of stochastic MAB, we get

The pseudo-regret is exactly what we refer to as regret in the first part. In fact, $$\mathbb{E}\Bigg[ \sum_{t=1}^{T} X_{a,t} - \sum_{t=1}^{T} X_{a_t,t}\Bigg]=\mathbb{E}\Bigg[ \sum_{t=1}^{T} \mu_a - \sum_{t=1}^{T} \mu_{a_t}\Bigg]=T\mu_a - \mathbb{E}\Bigg[\sum_{t=1}^{T} \mu_{a_t}\Bigg],$$ so if we maximize with respect to $a$ we get $T\mu^{\star}$ in the first term
The expected regret is something larger because for each realization, we take the action that maximize the realized reward. If I remember correctly, we can not avoid $\sqrt{T}$ expected regret even in the simple stochastic MAB case. This is in sharp contrast with the $\log T$ instance dependent bound that we can get for the regret / pseudo-regret.

For more contexts

The notion of the regret is a general thing in online learning; it applies to problems like bandits, online convex optimization, learning in games, and probably many others. The essential idea is to compare the cumulative performance of the learner against some type of comparator. Beyond this intuition, I am not sure if there is any consensus in the community. The comparator can change over time (dynamic regret), can be a function (external regret), while the environment can be stochastic, adversarial (oblivious or adaptive), modeled by a game etc. I think this diversity is the main source of confusion. As mentioned above, the definition the you give here is very specific to the stochastic MAB setup and is only a special case of what we can be referred as regret.

Valentas · Accepted Answer · 2021年12月17日 12:08

After the formula for (1) on p. 60 (I found no relevant formula on p. 48) the linked version of the book says that the expectation is over both the learner's actions and the bandit randomness, so (1) is some constant. In (2) the actions $A_t$ are random, so (2) defines a random variable, clearly not the same as (1).

The current definition of regret in Wikipedia agrees with (1), i.e., the book.

As you say, (3) indeed seems equal to (1), so confusingly the arXiv paper (Sébastien Bubeck, Nicolò Cesa-Bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems) calls "pseudo regret" something that the book and Wikipedia calls "regret". Also in (1.1) the paper appears to define "regret" very differently than the book or Wikipedia, the formulas would only agree for deterministic bandits.

I am not an expert on the topic, but the definition from the book looks the correct / better one.

Difference between regret and pseudo-regret definitions in multi-armed bandits

About