Bandits with heavy tail

Sébastien Bubeck, Nicolò Cesa-Bianchi, Gábor Lugosi

Introduction

In this paper we investigate the classical stochastic multi-armed bandit problem introduced by Robbins (1952) and described as follows: an agent facing KK actions (or bandit arms) selects one arm at every time step. With each arm i{1,,K}i\in\{1,\ldots,K\} there is an associated probability distribution νi\nu_{i} with finite mean μi\mu_{i}. These distributions are unknown to the agent. At each round t=1,,nt=1,\ldots,n, the agent chooses an arm ItI_{t}, and observes a reward drawn from νIt\nu_{I_{t}} independently from the past given ItI_{t}. The goal of the agent is to minimize the regret

We refer the reader to Bubeck and Cesa-Bianchi (2012) for a survey of the extensive literature of this problem and its variations. The vast majority of authors assume that the unknown distributions νi\nu_{i} are sub-Gaussian, that is, the moment generating function of each νi\nu_{i} is such that if XX is a random variable drawn according to the distribution νi\nu_{i}, then for all λ0\lambda\geq 0,

Then one can show that the so-called ψ\psi-UCB strategy (a variant of the basic UCB strategy of Auer et al. (2002)) satisfies the following regret guarantee. Let Δi=maxj=1,,Kμjμi\Delta_{i}=\max_{j=1,\ldots,K}\mu_{j}-\mu_{i}, and ψ\psi^{*} the Legendre-Fenchel transform of ψ\psi, defined by

Then ψ\psi-UCBMore precisely, (α,ψ)(\alpha,\psi)-UCB with α=4\alpha=4. satisfies

In particular, when the reward distributions are sub-Gaussian, the regret bound is of the order i(logn)/Δi\sum_{i}(\log n)/\Delta_{i}, which is known to be optimal even for bounded reward distributions, see Auer et al. (2002).

While this result shows that assumptions weaker than sub-Gaussian distributions may suffice for a logarithmic regret, it still requires the distributions to have finite moment generating function. Another disadvantage of the bound above is that the dependence on the gaps Δi\Delta_{i} deteriorates as the tail of the distributions become heavier. In fact, as we show it in this paper, the bound is sub-optimal when the tails are heavier than sub-Gaussian.

In this paper we investigate the behavior of the regret when the distributions are heavy-tailed, and might not have a finite moment generating function. We show that under significantly weaker assumptions, regret bounds of the same form as in the sub-Gaussian case may be achieved. In fact, the only condition we need is that the reward distributions have a finite variance. Moreover, even if the variance is infinite but the distributions have finite moments of order 1+ε1+\varepsilon for some ε>0\varepsilon>0, one may still achieve a regret logarithmic in the number nn of rounds though the dependency on the Δi\Delta_{i}’s worsens as ε\varepsilon gets smaller. For instance, for distributions with moment of order 1+ε1+\varepsilon bounded by 11 we derive a strategy that satisfies

The key to this result is to replace the empirical mean by more refined robust estimators of the mean and construct “upper confidence bound” strategies.

We also prove matching lower bounds that show that the proposed strategies are optimal up to constant factors. In particular the dependency in 1/Δi1/ε1/\Delta_{i}^{1/\varepsilon} is unavoidable.

In the following we start by defining a general class of sampling strategies that are based on the availability of estimators of the mean with certain performance guarantees. Then we examine various estimators for the mean. For each estimator we describe their performance (in terms of concentration to the mean) and deduce the corresponding regret bound.

Robust upper confidence bound strategies

This property of the empirical mean turns out to be crucial in order to achieve a regret of optimal order. However, when the sub-Gaussian assumption does not hold, one cannot expect the empirical mean to have such an accuracy. In fact, if one only knows, say, that the variance of each Xi,rX_{i,r} is bounded, then the best possible confidence intervals are significantly wider, deteriorating the performance of standard ucb strategies. (See Appendix A for properties of the empirical mean under distributions of heavy tails.)

The key to successful handling heavy-tailed reward distributions is to replace the empirical mean with other, more robust, estimators of the mean. All we need is a performance guarantee like the one shown above for the empirical mean. More precisely, we need a mean estimator with the following property.

Let ε(0,1]\varepsilon\in(0,1] be a positive parameter and let c,vc,v be positive constants. Let X1,,XnX_{1},\ldots,X_{n} be i.i.d. random variables with finite mean μ\mu. Suppose that for all δ(0,1)\delta\in(0,1) there exists an estimator μ^=μ^(n,δ)\widehat{\mu}=\widehat{\mu}(n,\delta) such that, with probability at least 1δ1-\delta,

and also, with probability at least 1δ1-\delta,

For example, if the distribution of the XtX_{t} satisfies the sub-Gaussian condition (1), then Assumption 1 is satisfied for ε=1\varepsilon=1, c=2c=2, and variance factor vv. Interestingly, the assumption may be satisfied for significantly more general distributions by using more sophisticated mean estimators. We recall some of these estimators in the following subsections, where we also show how they satisfy Assumption 1. As we shall see, the basic requirement for Assumption 1 to be satisfied is that the distribution of the XtX_{t} has a finite moment of order 1+ε1+\varepsilon.

We are now ready to define our generalized robust ucb strategy, described in Figure 1. We denote by Ti(t)T_{i}(t) the (random) number of times arm ii is selected up to time tt.

The following proposition gives a performance bound for the robust ucb policy provided that the reward distributions and the mean estimator used by the policy jointly satisfy Assumption 1. Below we exhibit several mean estimators that, under various moment assumptions, lead to regret bounds of optimal order.

Let ε(0,1]\varepsilon\in(0,1] and let μ^(s,δ)\widehat{\mu}(s,\delta) be a mean estimator. Suppose that the distributions ν1,,νK\nu_{1},\ldots,\nu_{K} are such that the mean estimator satisfies Assumption 1 for all i=1,,Ki=1,\ldots,K. Then the regret of the Robust ucb policy satisfies

Also, if nn is such that \log n\geq\max_{i}\left(5\Delta_{i}^{(1+\varepsilon)/\varepsilon}\big{/}(2cv^{1/\varepsilon})\right), then

Note that a regret of at least iΔi\sum_{i}\Delta_{i} is suffered by any strategy that pulls each arm at least once. Thus, the interesting term in (3) is the one of the order of \sum_{i\,:\,\Delta_{i}>0}\bigl{(}v/\Delta_{i}\bigr{)}^{\frac{1}{\varepsilon}}\log n. We show below in Theorem 2 that this term is of optimal order under a moment assumption on the reward distributions. We also show in Theorem 2 that the gap-independent inequality (4) is optimal up to a logarithmic factor.

Proof. Both proofs of (3) and (4) rely on bounding the expected number of pulls for a suboptimal arm. More precisely, in the first two steps of the proof we prove that, for any ii such that Δi>0\Delta_{i}>0,

First step. We show that if It=iI_{t}=i, then one the following three inequalities is true: either

Indeed, assume that all three inequalities are false. Then we have

which implies, in particular, that ItiI_{t}\neq i.

Second step. Here we first bound the probability that (6) or (7) hold. By Assumption 1 as well as an union bound over the value of Ti(t1)T_{i^{*}}(t-1) and Ti(t1)T_{i}(t-1) we obtain

In the next sections we show how Proposition 1 may be applied, with different mean estimators, to obtain optimal regret bounds for possibly heavy-tailed reward distributions.

In this section we consider the simplest of the proposed mean estimators, a truncated version of the empirical mean. This estimator is similar to the “winsorized mean” and “trimmed mean” of Tukey, see Bickel (1965).

The following lemma shows that if the (1+ε)(1+\varepsilon)-th raw moment is bounded, then the truncated mean satisfies Assumption 1.

Let δ(0,1),ε(0,1]\delta\in(0,1),\varepsilon\in(0,1], and u>0u>0. Consider the truncated empirical mean μ^T\widehat{\mu}_{T} defined as

An easy computation concludes the proof. \Box

The following is now a straightforward corollary of Proposition 1 and Lemma 1.

Let ε(0,1]\varepsilon\in(0,1] and u>0u>0. Assume that the reward distributions ν1,,νK\nu_{1},\ldots,\nu_{K} satisfy

Then the regret of the Robust-ucb policy used with the truncated mean estimator defined above satisfies

When ε=1\varepsilon=1, the only assumption of the theorem above is that each reward distribution has a finite variance. In this case the obtained regret bound is of the order of i(logn)/Δi\sum_{i}(\log n)/\Delta_{i}, which is known to be not improvable in general, even when the rewards are bounded —note, however, that the kl-ucb algorithm of Garivier and Cappé (2011) is never worse than Robust-ucb in case of bounded rewards. We find it remarkable that regret of this order may be achieved under the only assumption of finite variance and one cannot improve the order by imposing stronger tail conditions.

When the variance is infinite but moments of order 1+ε1+\varepsilon are available, we still have a regret that depends only logarithmically on nn. The bound deteriorates slightly as the dependency on 1/Δi1/\Delta_{i} is replaced by 1/Δi1/ε1/\Delta_{i}^{1/\varepsilon}. We show next that this dependency is inevitable.

Furthermore, for any fixed nn, there exists a set of KK distributions satisfying (9) with u=1u=1 and such that for any algorithm, one has

The proof of (11) follows the same scheme. We use the same distributions as above and we consider the multi-armed bandit problem where one arm has distribution ν1\nu_{1}, and the K1K-1 remaining arms have distribution ν2\nu_{2}. Furthermore we set Δ=(K/n)ε1+ε\Delta=(K/n)^{\frac{\varepsilon}{1+\varepsilon}} for this part of the proof. Now we can use the same proof as for (Bubeck, 2010, Theorem 2.6) on the modified algorithm A\mathcal{A}^{\prime} that runs on the Bernoulli distributions corresponding to ν1\nu_{1} and ν2\nu_{2}. We leave the straightforward details to the reader. \Box

2 Median of means

The median-of-means estimator was proposed by Alon et al. (2002). The simple idea is to divide the data into various disjoint blocks. Within each block one calculates the standard empirical mean and takes a median value of these empirical means. The next lemma shows that for certain block size the estimator has the property required by our robust ucb strategy.

be kk empirical mean estimates, each one computed on NN data points. Consider a median μ^M\widehat{\mu}_{M} of these empirical means. Then, with probability at least 1δ1-\delta,

we have p1/4p\leq 1/4. Thus using Hoeffding’s inequality for the tail of a binomial distribution, we get

The next performance bound is a straightforward consequence of Proposition 1 and Lemma 2. In some situations it significantly improves on Theorem 1 as the bound depends on the centered moments of order 1+ε1+\varepsilon rather than on raw moments.

Let ε(0,1]\varepsilon\in(0,1] and v>0v>0. Assume that the reward distributions ν1,,νK\nu_{1},\ldots,\nu_{K} satisfy

Then the regret of the Robust-ucb policy used with the median-of-means mean estimator defined in Lemma 2 satisfies

3 Catoni’s M𝑀M estimator

Finally, we consider an elegant mean estimator introduced by Catoni (2010). As we will see, this estimator has similar performance guarantees as the median-of-means estimator but with better, near optimal, numerical constants. However, we only have a good guarantee in terms of the variance. Thus, in this section we assume that the variance is finite and we do not consider the case ε<1\varepsilon<1.

Let δ(0,1)\delta\in(0,1) be such that n>2log(1/δ)n>2\log(1/\delta) and introduce

If X1,,XnX_{1},\ldots,X_{n} be i.i.d. random variables, then Catoni’s estimator is defined as the unique value μ^C=μ^C(n,δ)\widehat{\mu}_{C}=\widehat{\mu}_{C}(n,\delta) such that

Catoni (2010) proves that if n4log(1/δ)n\geq 4\log(1/\delta) and the XiX_{i} have mean μ\mu and variance at most vv, then, with probability at least 1δ1-\delta,

and a similar bound holds for the lower-tail. This bound has the same form as in Assumption 1, though it only holds with the additional requirement that n4log(1/δ)n\geq 4\log(1/\delta) and therefore it does not fomally fit in the framework of the robust ucb strategy as described in Section 2. However, by a simple modification, one may define a strategy that incorporates such a restriction. In Figure 2 we describe a policy based on Catoni’s mean estimator. The policy assumes that there is a known upper bound vv for the largest variance of any reward distribution. Then by a simple modification of the proof of Proposition 1, we obtain the following performance bound.

Let v>0v>0. Assume that the reward distributions ν1,,νK\nu_{1},\ldots,\nu_{K} satisfy

Then the regret of the modified robust ucb policy satisfies

The regret bound has better numerical constants than its analogue based on the median-of-means estimator. However, a term of the order iΔilogn\sum_{i}\Delta_{i}\log n appears due to the restricted range of validity of Catoni’s estimator.

Discussion and conclusions

In this work we have extended the ucb algorithm to heavy-tailed stochastic multi-armed bandit problems in which the reward distributions have only moments of order 1+ε1+\varepsilon for some ε(0,1]\varepsilon\in(0,1]. In this setting, we have compared three estimators for the mean reward of the arms: median-of-means, truncated mean, and Catoni’s MM-estimator. The median-of-means estimator gives a regret bound that depends on the central (1+ε)(1+\varepsilon)-moments of the reward distributions, without need of knowing bounds on these moments. The truncated mean estimator, instead, delivers a regret bound that depends on the raw (1+ε)(1+\varepsilon)-moments, and requires the knowledge of a bound uu on these moments. Finally, Catoni’s estimator depends on the central moments like the median-of-means, but it requires the knowledge of a bound vv on the central moments, and only works in the special case ε=1\varepsilon=1 (where it gives the best leading constants on the regret). A trade-off in the choice of the estimator appears if we take into account the computational costs involved in the update of each estimator as new rewards are observed. Indeed, while the truncated mean requires constant time and space per update, the median-of-means is slightly more difficult to update, requiring O(logδ1)O(\log\delta^{-1}) space and O(loglogδ1)O(\log\log\delta^{-1}) time per update. Finally, Catoni’s MM-estimator requires linear space per update, which is an unfortunate feature in this sequential setting.

It is an interesting question whether there exists an estimator with the same good concentration properties as the median-of-means, but requiring only constant time and space per update. The truncated mean has good computational properties but the knowledge of raw moment bounds is required. So it is natural to ask whether we may drop this requirement for the truncated mean or some variants of it. Finally, our proof techniques heavily rely on the independence of rewards for each arm. It is unclear whether similar results could be obtained for heavy-tailed bandits with dependent reward processes.

While we focused our attention on bandit problems, the concentration results presented in this work may be naturally applied to other related sequential decision settings. Such examples include the racing algorithms of Maron and Moore (1997), and more generally nonparametric Monte Carlo estimation, see Dagum et al. (2000) and Domingo et al. (2002). These techniques are based on mean estimators, and current results are limited to the application of the empirical mean to bounded reward distributions.

Appendix A Empirical mean

In this appendix we discuss the behavior of the standard empirical mean when only moments of order 1+ε1+\varepsilon are available. We focus on finite-sample guarantees (i.e., non-asymptotic results), as this is the key property to obtain finite-time results for the multi-armed bandit problem.

Let μ^\widehat{\mu} be the empirical mean:

Then for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, one has

The first term on the right-hand side can be bounded by using a union bound followed by Chebyshev’s inequality (for moments of order 1+ε1+\varepsilon):

By applying a trivial manipulation on the first term, and using Hölder’s inequality with exponents p=1+εp=1+\varepsilon and q=1+1/εq=1+1/\varepsilon for the second term, we obtain that the last expression above is upper bounded by

Note that if vnεη1+ε>1\frac{v}{n^{\varepsilon}\eta^{1+\varepsilon}}>1 then the bound is trivial, and thus we always have

The proof now follows by straightforward computations. \Box

We can restrict our attention to the case where η>nε1+ε\eta>n^{-\frac{\varepsilon}{1+\varepsilon}}, for otherwise the above upper bound is trivial. Now consider γ=12nη\gamma=\frac{1}{2n\eta}. Note that we have μ=γε=1(2nη)ε<η\mu=\gamma^{\varepsilon}=\frac{1}{(2n\eta)^{\varepsilon}}<\eta and in particular this implies 1/γ=2nη>n(η+μ)1/\gamma=2n\eta>n(\eta+\mu). From this last inequality and basic computations we get

which shows that (12) is tight up to a constant factor for this distribution.

Clearly, the concentration properties of the empirical mean are much weaker than for the truncated empirical mean or the median-of-means. Indeed, while the dependency on nn in the confidence term is similar for the three estimators, the dependency on 1/δ1/\delta is polynomial for the empirical mean and polylogarithmic for the truncated empirical mean and the median-of-means. As we just showed, this is not an artifact of the proof method, and the empirical mean indeed has polynomial deviations (as opposed to the exponential deviations of the two other estimators). This remark is at the basis of the theory of robust statistics and many approaches to fix the above issue have been proposed, see for example Huber (1964, 1981). The empirical mean estimator has been previously applied to heavy-tailed stochastic bandits in Liu and Zhao (2011) obtaining polynomial, rather than logarithmic, regret bounds.

References