The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

Aurélien Garivier, Olivier Cappé

Introduction

The multi-armed bandit problem is a simple, archetypal setting of reinforcement learning, where an agent facing a slot machine with several arms tries to maximize her profit by a proper choice of arm draws. In the stochasticAnother interesting setting is the adversarial bandit problem, where the rewards are not stochastic but chosen by an opponent - this setting is not the subject of this paper. bandit problem, the agent sequentially chooses, for $t=1,2,\dots,n$ , an arm $A_{t}\in\{1,\dots,K\}$ , and receives a reward $X_{t}$ such that, conditionally on the arm choices $A_{1},A_{2},\dots$ , the rewards are independent and identically distributed, with expectation $\mu_{A_{1}},\mu_{A_{2}},\dots$ . Her policy is the (possibly randomized) decision rule that, to every past observations $(A_{1},X_{1},\dots,A_{t-1},X_{t-1})$ , associates her next choice $A_{t}$ . The best choice is any arm $a^{*}$ with maximal expected reward $\mu_{a^{*}}$ . The performance of her policy can be measured by the regret $R_{n}$ , defined as the difference between the rewards she accumulates up to the horizon $t=n$ , and the rewards that she would have accumulated during the same period, had she known from the beginning which arm had the highest expected reward.

The agent typically faces an “exploration versus exploitation dilemma” : at time $t$ , she can take advantage of the information she has gathered, by choosing the so-far best performing arm, but she has to consider the possibility that the other arms are actually under-rated and she must play sufficiently often all of them. Since the works of Gittins (1979) in the 1970s, this problem raised much interest and several variants, solutions and extensions have been proposed, see Even-Dar et al. (2002) and references therein.

Two families of bandit settings can be distinguished: in the first family, the distribution of $X_{t}$ given $A_{t}=a$ is assumed to belong to a family $\{p_{\theta},\theta\in\Theta_{a}\}$ of probability distributions. In a particular parametric framework, Lai and Robbins (1985) proved a lower-bound on the performance of any policy, and determined optimal policies. This framework was extended to multi-parameter models by Burnetas and Katehakis (1997) who showed that the number of draws up to time $n$ , $N_{a}(n)$ , of any sub-optimal arm $a$ is lower-bounded by

where $KL$ denotes the Kullback-Leibler divergence and $E(p_{\theta})$ is the expectation under $p_{\theta}$ ; hence, the regret is lower-bounded as follows:

Recently, Honda and Takemura (2010) proposed an algorithm called Deterministic Minimum Empirical Divergence (DMED) which they proved to be first order optimal. This algorithm, which maintains a list of arms that are close enough to the best one (and which thus must be played), is inspired by large deviations ideas and relies on the availability of the rate function associated to the reward distribution.

In the second family of bandit problems, the rewards are only assumed to be bounded (say, between and $1$ ), and policies rely directly on the estimates of the expected rewards for each arm. The KL-UCB algorithm considered in this paper is primarily meant to address this second, non-parametric, setting. We will nonetheless show that KL-UCB also matches the lower bound of Burnetas and Katehakis (1997) in the binary case and that it can be extended to a larger class of parametric bandit problems.

Among all the bandit policies that have been proposed, a particular family has raised a strong interest, after Gittins (1979) proved that (in the Bayesian setting he considered) optimal policies could be chosen in the following very special form: compute for each arm a dynamic allocation index (which only depends on the draws of this arm), and simply choose an arm with maximal index. These policies not only compute an estimate of the expected rewards, but rather an upper-confidence bound (UCB), and the agent’s choice is an arm with highest UCB. This approach is sometimes called “optimistic”, as the agent acts as if, at each instant, the expected rewards were equal to the highest possible values that are compatible with her past observations. Auer et al. (2002), following Agrawal (1995), proposed and analyzed two variants, UCB1 (usually called simply UCB in latter works) and UCB2, for which they provided regret bounds. UCB is an online, horizon-free procedure for which (Auer et al., 2002) proves that there exists a constant $C$ such that

The UCB2 variant relies on a parameter $\alpha$ that needs to be tuned, depending in particular on the horizon, and satisfies the tighter regret bound

where $\epsilon(\alpha)>0$ is a constant that can get arbitrary small when $\alpha$ is small, at the expense of an increased value of the constant $C(\alpha)$ . The constant $1/2$ in front of the factor $\log(n)/\left(\mu_{a^{*}}-\mu_{a}\right)$ cannot be improved. We show in Proposition 3.4, as a by-product of our analysis, that a correctly tuned UCB algorithm satisfies a similar bound. However, Auer et al. (2002) found in numerical experiments that UCB and UCB2 were outperformed by a third heuristic variant called UCB-Tuned, which includes estimates of the variance, but no theoretical guarantee was given. In a latter work, Audibert et al. (2009) proposed a related policy, called UCB-V, which uses an empirical version of the Bernstein bound to obtain refined upper confidence bounds. Recently, Audibert and Bubeck (2010) introduced an improved UCB algorithm, termed MOSS, which achieves the distribution-free optimal rate.

In this contribution, we first consider the stochastic, non-parametric, bounded bandit problem. We consider an online index policy, called KL-UCB (for Kullback-Leibler UCB), that requires no problem- or horizon-dependent tuning. This algorithm was recently advocated by Filippi (2010), together with a similar procedure for Markov Decision Processes (Filippi et al., 2010), and we learnt since our initial submission that an analysis of the Bernoulli case can also be found in Maillard et al. (2011), together with other results. We prove in Theorem 3.1 below that the regret of KL-UCB satisfies

where $d(p,q)=p\log(p/q)+(1-p)\log((1-p/(1-q))$ denotes the Kullback-Leibler divergence between Bernoulli distributions of parameters $p$ and $q$ , respectively. This comes as a consequence of Theorem 3.2, a non-asymptotic upper-bound on the number of draws of a sub-optimal arm $a$ : for all $\epsilon>0$ there exist $C_{1},C_{2}(\epsilon)$ and $\beta(\epsilon)$ such that

We insist on the fact that, despite the presence of divergence $d$ , this bound is not specific to the Bernoulli case and applies to all reward distributions bounded in $ $(and thus, by rescaling, to all bounded reward distributions). By Pinsker’s inequality,$ d(\mu_{a},\mu_{a^{*}})>2(\mu_{a}-\mu_{a^{*}})^{2}$, and thus KL-UCB has strictly better theoretical guarantees than UCB, while it has the same range of application. The improvement appears to be significant in simulations. Moreover, KL-UCB is the first index policy that reaches the lower-bound of Lai and Robbins (1985) for binary rewards; it does also achieve lower regret than UCB-V in that case. Hence, KL-UCB is both a general-purpose procedure for bounded bandits, and an optimal solution for the binary case.

Furthermore, it is easy to adapt KL-UCB to particular (possibly non-bounded) bandit settings, when the distribution of reward is known to belong to some family of probability laws. By simply changing the definition of the divergence $d$ , optimal algorithms may be built in a great variety of situations.

The proofs we give for these results are short and elementary. They rely on deviation results for bounded variables that are of independent interest : Lemma A.1 shows that Bernoulli variable are, in a sense, the “least concentrated” bounded variables with a given expectation (as is well-known for variance), and Theorem A.3 shows an efficient way to build confidence intervals for sums of bounded variables with an unknown number of summands.

In practice, numerical experiments confirm the significant advantage of KL-UCB over existing procedures; not only does this method outperform UCB, MOSS, UCB-V and even UCB-tuned in various scenarios, but it also compares well to DMED in the Bernoulli case, especially for small or moderate horizons.

The paper is organized as follows: in Section 2, we introduce some notation and present the KL-UCB algorithm. Section 3 contains the main results of the paper, namely the regret bound for KL-UCB and the optimality in the Bernoulli case. In Section 4, we show how to adapt the KL-UCB algorithm to address general families of reward distributions, and we provide finite-time regret bounds showing asymptotic optimality. Section 5 reports the results of extensive numerical experiments, showing the practical superiority of KL-UCB. Section 6 is devoted to an elementary proof of the main theorem. Finally, the Appendix gathers some deviation results that are useful in the proofs of our regret bounds, but which are also of independent interest.

The KL-UCB Algorithm

with, by convention, $0\log 0=0\log 0/0=0$ and $x\log x/0=+\infty$ for $x>0$ .

Algorithm 1 provides the pseudo-code for KL-UCB. On line 6, $c$ is a parameter that, in the regret bound stated below in Theorem 3.1 is chosen equal to $3$ ; in practice, however, we recommend to take $c=0$ for optimal performance. For each arm $a$ the upper-confidence bound

can be efficiently computed using Newton iterations, as for any $p\in$ the function $q\mapsto d(p,q)$ is strictly convex and increasing on the interval $[p,1]$ . In case of ties between severals arms, any maximizer can be chosen (for instance, at random). The KL-UCB elaborates on ideas suggested in Sections 3 and 4 of Lai and Robbins (1985).

Regret bounds and optimality

We first state the main result of this paper. It is a direct consequence of the non-asymptotic bound in Theorem 3.2 stated below.

Consider a bandit problem with $K$ arms and independent rewards bounded in $ $, and denote by$ a^{*} $an optimal arm. Choosing$ c=3$, the regret of the KL-UCB algorithm satisfies:

Consider a bandit problem with $K$ arms and independent rewards bounded in $ $. Let$ \epsilon>0 $, and take$ c=3 $in Algorithm 1. Let$ a^{*} $denote an arm with maximal expected reward$ \mu_{a^{*}} $, and let$ a $be an arm such that$ \mu_{a}<\mu_{a^{*}} $. For any positive integer$ n $, the number of times algorithm KL-UCB chooses arm$ a$ is upper-bounded by

where $C_{1}$ denotes a positive constant and where $C_{2}(\epsilon)$ and $\beta(\epsilon)$ denote positive functions of $\epsilon$ . Hence,

If the reward distributions are Bernoulli, the KL-UCB algorithm is asymptotically optimal, in the sense that the regret of KL-UCB matches the lower-bound proved by Lai and Robbins (1985) and generalized by Burnetas and Katehakis (1997):

The KL-UCB algorithm thus appears to be (asymptotically) optimal for Bernoulli rewards. However, Lemma A.1 shows that the Chernoff bounds obtained for Bernoulli variables actually apply to any variable with range $$. This is why KL-UCB is not only efficient in the binary case, but also for general bounded rewards.

As a by-product, we obtain an improved upper-bound for the regret of the UCB algorithm:

Consider the UCB algorithm tuned as follows: at step $t>K$ , the arm that maximizes the upper-bound $S[a]/N[a]+\sqrt{(\log(t)+c\log\log(t))/(2N[a])}$ is chosen. Then, for the choice $c=3$ , the number of draws of a sub-optimal arm $a$ is upper-bounded as:

This bound is “optimal”, in the sense that the constant $1/2$ in the logarithmic term cannot be improved. The proof of this proposition just mimics that of Section 6 (which concerns KL-UCB), using the quadratic divergence $d(p,q):=2(p-q)^{2}$ instead of the Kullback-Leibler divergence; it is thus omitted. In contrast, Pinsker’s inequality $d(\mu_{a},\mu_{a^{*}})\geq 2(\mu_{a}-\mu_{a^{*}})^{2}$ shows that KL-UCB dominates UCB, and we will see in the simulation study that the difference is significant, even for smaller values of the horizon.

At line $6$ , Algorithm 1 computes for each arm $a\in\{1,\dots,K\}$ the upper-confidence bound

The level of this confidence bound is parameterized by the exploration function $\log(t)+c\log(\log(t))$ , and the results of Theorems 3.1 and 3.2 are true as soon as $c\geq 3$ . However, similar results can be proved with an exploration function equal to $(1+\epsilon)\log(t)$ (instead of $\log(t)+c\log(\log(t))$ ) for every $\epsilon>0$ ; this is no surprise, as $(1+\epsilon)\log(t)\geq\log(t)+c\log(\log(t))$ when $t$ is large enough. But “large enough”, in that case, can be quite large : for $\epsilon=0.1$ , this holds true only for $t>2.10^{51}$ . This is why, in practice (and for the simulations presented in Section 5), we rather suggest to choose $c=0$ .

KL-UCB for parametric families of reward distributions

The KL-UCB algorithm makes no assumption on the distribution of the rewards, except that they are bounded. Actually, the definition of the divergence function $d$ in KL-UCB is dictated by the rate function of the Large Deviations Principle satisfied by Bernoulli random variables: the proof of Theorem A.3 relies on the control of the Fenchel-Legendre transform of the Bernoulli distribution. Thanks to Lemma A.1, this choice also makes sense for all bounded variables.

But the method presented here is not limited to the Bernoulli case: KL-UCB can very easily be adapted to other settings by choosing an appropriate divergence function $d$ . As an illustration, we will assume in this section that, for each arm $a$ , the distribution of rewards belongs to a canonical exponential family, i.e., that its density with respect to some reference measure can be written as $p_{\theta_{a}}(x)$ for some real parameter $\theta_{a}$ , with

For example, in the case of exponential rewards, one should choose $d(x,y)=x/y-1-\log(x/y)$ . Or, for Poisson rewards, the right choice is $d(x,y)=y-x+x\log(x/y)$ . Then, all the results stated above apply (as the proofs do not involve the particular form of the function $d$ ), and in particular :

In order to prove that, for those families of rewards, this version of the KL-UCB algorithm matches the bound of Lai and Robbins (1985) , it remains only to show that $d(x,y)=KL(p_{\mu^{-1}(x)},p_{\mu^{-1}(y)})$ . This is the object of Lemma 4.1. Generalizations to other families of reward distributions (possibly different from arm to arm) are conceivable, but require more technical, topological discussions, as in Burnetas and Katehakis (1997) and Honda and Takemura (2010).

To conclude, observe that it is not required to work with the divergence function $d$ corresponding exactly to the family of reward distributions: using an upper-bound instead often leads to more simple and versatile policies at the price of only a slight loss in performance. This is illustrated in the third scenario of the simulation study presented in Section 5, but also by Theorems 3.1 and 3.2 for bounded variables.

Let $(\beta,\theta)$ be two real numbers, let $p_{\beta}$ and $p_{\theta}$ be two probability densities of the canonical exponential family defined in (5), and let $X$ have density $p_{\theta}$ . Then Kullback-Leibler divergence $KL(p_{\beta},p_{\theta})$ is equal to $d(\mu(\beta),\mu(\theta))$ . More precisely,

is reached for $\lambda=\lambda^{*}$ such that $x=\dot{b}(\theta+\lambda^{*})=\mu(\theta+\lambda^{*})$ . Thus, if $x=\mu(\beta)$ , the fact that $\mu$ is one-to-one implies that $\theta+\lambda^{*}=\beta$ and thus that:

Numerical experiments and comparisons of the policies

Simulations studies require particular attention in the case of bandit algorithms. As pointed out by Audibert et al. (2009), for a fixed horizon $n$ the distribution of the regret is very poorly concentrated around its expectation. This can be explained as follows: most of the time, the estimates of all arms remain correctly ordered for almost all instants $t=1,\dots,n$ and the regret is of order $\log(n)$ . But sometimes, at the beginning, the best arm is under-estimated while one of the sub-optimal arms is over-estimated, so that the agent keeps playing the latter; and as she neglects the best arm, she has hardly an occasion to realize her mistake, and the error perpetuates for a very long time. This happens with a small, but not negligible probability, because the regret is very important (of order $n$ ) on these occasions. Bandit algorithms are typically designed to control the probability of such adverse events but usually at a rate which only decays slightly faster than $1/n$ , which results in very skewed regret distributions with slowly vanishing upper tails.

We first consider the basic two arm scenario with Bernoulli rewards of expectations $\mu_{1}=0.9$ and $\mu_{2}=0.8$ , respectively. The left panel of Figure 1 shows the average number of draws of the suboptimal arm as a function of time (on a logarithmic scale) for KL-UCB compared to five benchmark algorithms (UCB, MOSS, UCB-Tuned, UCB-V and DMED). The right panel of Figure 1 shows the empirical distributions of suboptimal draws, represented as box-and-whiskers plots, at a particular time ( $t=5,000$ ) for all six algorithms. These plot are obtained from $N=50,000$ independent runs of the algorithms and the right panel of Figure 1 clearly highlight the tail effect mentioned above. On this very simple example, we observed that results obtained from $N=1,000$ or less simulations were not reliable, typically resulting in a significant over-estimationIncidentally, Theorem A.3 could be used to construct sharp confidence bounds for the regret. of the performance of “risky” algorithms, in particular of UCB-Tuned. More generally, results obtained in configurations where $N$ is much smaller than $n$ are likely to be unreliable. For this reason, we limit our investigations to a final instant of $n=20,000$ . Note however that the average number of suboptimal draws of most algorithms at $n=20,000$ is only of the order of $300$ , showing that there is no point in considering larger horizons for such a simple problem.

MOSS, UCB-Tuned and UCB-V are run exactly as described by Audibert and Bubeck (2010), Auer et al. (2002) and Audibert et al. (2009), respectively. For UCB, we use an upper confidence bound $S[a]/N[a]+\sqrt{\log(t)/(2N[a])}$ , as in Proposition 3.4, again with $c=0$ . Note that in our two arm scenario, $\{2(\mu_{1}-\mu_{2})^{2}\}^{-1}=50$ while $d^{-1}(\mu_{2},\mu_{1})=22.5$ . Hence, the performance of DMED and KL-UCB should be about two times better than that of UCB. The left panel of Figure 1 does show the expected behavior but with a difference of lesser magnitude. Indeed, one can observe that the bound $d^{-1}(\mu_{2},\mu_{1})\log(n)$ (shown in dashed line) is quite pessimistic for the values of the horizon $n$ considered here as the actual performance of KL-UCB is significantly below the bound. For DMED, we follow Honda and Takemura (2010) but using

as the criterion to decide whether arm $a$ should be included in the list of arms to be played. This criterion is clearly related to the decision rule used by KL-UCB when $c=0$ (see line 6 of Algorithm 1) except for the fact that in KL-UCB the estimate $S[a]/N[a]$ is not compared to that of the current best arm $\max_{b}S[b]/N[b]$ but to the corresponding upper confidence bound. As a consequence, any arm that is not included in the list of arms to be played by DMED would not be played by KL-UCB either (assuming that both algorithms share a common history). As one can observe on the left panel of Figure 1, this results in a degraded performance for DMED. We also observed this effect on UCB, for instance, and it seems that index algorithms are generally preferable to their “arm elimination” variant.

The original proposal of Honda and Takemura (2010) consists in using in the exploration function a factor $\log(t/N[a])$ instead of $\log(t)$ , as in the MOSS algorithm. As will be seen below on Figure 2, this variant (which we refer to as DMED+) indeed outperforms DMED. But our previous conjecture appears to hold also in this case as the heuristic variant of KL-UCB (termed KL-UCB+) in which $\log(t)$ in line 6 of Algorithm 1 is replaced by $\log(t/N[a])$ remains preferable to DMED+.

As final comments on Figure 1, first note that UCB-Tuned performs as expected —though slightly worse than KL-UCB— but is a very risky algorithm: the right panel of Figure 1 casts some doubts on the fact that the tails of $N_{a}(n)$ are indeed controlled uniformly in $n$ for UCB-Tuned. Second, the performance of UCB-V is somewhat disappointing. Indeed, the upper-confidence bounds of UCB-V differ from those of UCB-Tuned simply by the non-asymptotic correction term $3\log(t)/N[a]$ required by Bennett’s and Bernstein’s inequalities (Audibert et al., 2009). This correction term appears to have a significant impact on moderate time horizons: for a sub-optimal arm $a$ , the number of draws $N[a]$ does not grow faster than the $\log(t)$ exploration function, and $\log(t)/N[a]$ does not vanish.

2 Scenario 2: low rewards

In Figure 2 we consider a significantly more difficult scenario, again with Bernoulli rewards, inspired by a situation (frequent in applications like marketing or Internet advertising) where the mean reward of each arm is very low. In this scenario, there are ten arms: the optimal arm has expected reward 0.1, and the nine suboptimal arms consist of three different groups of three (stochastically) identical arms each with expected rewards 0.05, 0.02 and 0.01, respectively. We again used $N=50,000$ simulations to obtain the regret plots of Figure 2. These plots show, for each algorithm, the average cumulated regret together with quantiles of the cumulated regret distribution as a function of time (again on a logarithmic scale).

In this scenario, the difference is more pronounced between UCB and KL-UCB. The performance gain of UCB-Tuned is also much less significant. KL-UCB and DMED reach a performance that is on par with the lower bound of Burnetas and Katehakis (1997) in (2), although the performance of KL-UCB is here again significantly better. Using KL-UCB+ and DMED+ results in significant mean improvements, although there are hints that those algorithms might indeed be too risky with occasional very large deviations from the mean regret curve.

The Clopper-Pearson Upper-Confidence Bound algorithm (CP-UCB) differs from KL-UCB only in the way the upper-confidence bound on the performance of each arm is computed, replacing line 6 of Algorithm 1 by

As the Clopper-Wilson confidence intervals are always sharper than the Kullback-Leibler intervals, one can very easily adapt the proof of Section 6 to show that the regret bounds proved for the KL-UCB algorithm also hold for CP-UCB in the case of Bernoulli rewards. However, the improvement over KL-UCB is very limited (often, the two algorithms actually take exactly the same decisions). In terms of results, one can observe on Figure 2 that CP-UCB only achieves a performance that is marginally better than that of KL-UCB. Besides, there is no guarantee that the CP-UCB algorithm is also efficient on arbitrary bounded distributions.

3 Scenario 3: bounded exponential rewards

In the third example, there are $5$ arms: the rewards are exponential variables, with parameters $1/5$ , $1/4$ , $1/3$ , $1/2$ and $1$ respectively, truncated at $x_{\max}=10$ (thus, they are bounded in $). The interest of this scenario is twofold: first, it shows the performance of KL-UCB for non-binary, non-discrete, non$ -valued rewards. Second, it illustrates that, as explained in Section 4, specific variants of the KL-UCB algorithm can reach an even better performance.

In this scenario, UCB and MOSS, but also KL-UCB are clearly sub-optimal. UCB-Tuned and UCB-V, by taking into account the variance of the reward distributions (much smaller than the variance of a $\{0,10\}$ -valued distribution with the same expectation), were expected to perform significantly better. For the reasons mentionned above this is not the case for UCB-V on a time horizon $n=20,000$ . Yet, UCB-Tuned is spectacularly more efficient, and is only caught up by KL-UCB-exp, the variant of KL-UCB designed for exponential rewards. Actually, the KL-UCB-exp algorithm ignores the fact that the rewards are truncated, and uses the divergence $d(x,y)=x/y-1-\log(x/y)$ prescribed for genuine exponential distributions. One can easily show that this choice leads to slightly too large upper confidence bounds. Yet, the performance is still excellent, stable, and the algorithm is particularly simple.

Proof of Theorem 3.2

Consider a positive integer $n$ , a small $\epsilon>0$ , an optimal arm $a^{*}$ and a sub-optimal arm $a$ such that $\mu_{a}<\mu_{a^{*}}$ . Without loss of generality, we will assume that $a^{*}=1$ . For any arm $b$ , the past average performance of arm $b$ is denoted by $\hat{\mu}_{b}(t)=S_{b}(t)/N_{b}(t)$ ; by convenience, for every positive integer $s$ we will also denote $\hat{\mu}_{b,s}=\left(X_{b,1}+\dots+X_{b,s}\right)/s$ , so that $\hat{\mu}_{t}(b)=\hat{\mu}_{b,N_{b}(t)}$ . KL-UCB relies on the following upper-confidence bound for $\mu_{b}$ :

where the last inequality is a consequence of Lemma 6.1. The first summand is upper-bounded as follows: by Theorem A.3 (proved in the Appendix),

for some positive constant $C^{\prime}_{1}$ ( $C^{\prime}_{1}\leq 7$ is sufficient). For the second summand, let

according to Lemma 6.3. This will conclude the proof, provided that we prove the following two lemmas.

Observe that $A_{t}=a$ and $\mu_{1}\leq u_{1}(t)$ together imply that $u_{a}(t)\geq u_{1}(t)\geq\mu_{1}$ , and hence that

For each $\epsilon>0$ , there exist $C_{2}(\epsilon)>0$ and $\beta(\epsilon)>0$ such that

If $d^{+}(\hat{\mu}_{a,s},\mu_{1})<d(\mu_{a},\mu_{1})/(1+\epsilon)\;$ , then $\hat{\mu}_{a,s}>r(\epsilon)$ , where $r(\epsilon)\in]\mu_{a},\mu_{1}[$ is such that $d(r(\epsilon),\mu_{1})=d(\mu_{a},\mu_{1})/(1+\epsilon)$ . Hence,

with $C_{2}(\epsilon)=(1-\exp(-d(r(\epsilon),\mu_{a})))^{-1}$ and $\beta(\epsilon)=(1+\epsilon)d(r(\epsilon),\mu_{a})/d(\mu_{a},\mu_{1})$ . Easy computations show that $r(\epsilon)=\mu_{a}+O(\epsilon)$ , so that $C_{2}(\epsilon)=O(\epsilon^{-2})$ and $\beta(\epsilon)=O(\epsilon^{2})$ .

Conclusion

The self-normalized deviation bound of Theorems A.3 and A.5, together with the new analysis presented in Section 6, allowed us to design and analyze improved UCB algorithms. In this approach, only an upper-bound of the deviations (more precisely, of the exponential moments) of the rewards is required, which makes it possible to obtain versatile policies satisfying interesting regret bounds for large classes of reward distributions. The resulting index policies are simple, fast, and very efficient in practice, even for small time horizons.

References

Appendix A Kullback-Leibler deviations for bounded variables with a random number of summands

We start with a simple lemma justifying the focus on binary rewards.

$\left(W^{\lambda}_{t}\right)_{t\geq 0}$ is a super-martingale relative to $\left(\mathcal{F}_{t}\right)_{t\geq 0}$ . In fact,

But as $\eta=1/(\delta-1)$ , $D=\left\lceil\frac{\log n}{\log\left(1+1/(\delta-1)\right)}\right\rceil$ and as $\log(1+1/(\delta-1))\geq 1/\delta$ , we obtain:

Of course, a symmetric bound for the probability of over-estimating $\mu$ can be derived following the same lines. Together, they show that for all $\delta>0$ :

Finally, we state a more general deviation bound for arbitrary reward distributions with finite exponential moments. The proof (very similar to that of Theorem A.3) is omitted.

Let $\mathcal{F}_{t}$ be an increasing sequence of $\sigma$ -fields of $\mathcal{F}$ such that for each $t$ , $\sigma(X_{1}\ldots,X_{t})\subset\mathcal{F}_{t}$ and for $s>t$ , $X_{s}$ is independent from $\mathcal{F}_{t}$ . Consider a previsible sequence $(\epsilon_{t})_{t\geq 1}$ of Bernoulli variables (for all $t>0$ , $\epsilon_{t}$ is $\mathcal{F}_{t-1}$ -measurable). Let $\delta>0$ and for every $t\in\{1,\dots,n\}$ let