Thompson Sampling for 1-Dimensional Exponential Family Bandits

Nathaniel Korda, Emilie Kaufmann, Remi Munos

Introduction

An algorithm, ϕ\phi, for a KK-armed bandit problem is a (possibly randomised) method for choosing which distribution to sample from next, given a history of previous arm choices and obtained rewards, Ht1:=((as,xs))s=1t1\mathcal{H}_{t-1}:=((a_{s},x_{s}))_{s=1}^{t-1}: each reward xsx_{s} is drawn from the distribution pasp_{a_{s}}. We denote by ϕt\phi_{t} the distribution over {1,,K}\{1,\dots,K\} induced by the history Ht1\mathcal{H}_{t-1}: at time tt the agent using ϕ\phi picks arm aa with probability ϕt(a)\phi_{t}(a). The agent’s goal is to design an algorithm with low regret:

This quantity measures the expected performance of algorithm ϕ\phi compared to the expected performance of an optimal algorithm given knowledge of the reward distributions, i.e. sampling always from the distribution with the highest expectation.

Since the early 2000s the “optimisim in the face of uncertainty” heuristic has been a popular approach to this problem, providing both simplicity of implementation and finite-time upper bound on the regret (e.g. ). However in the last two years there has been renewed interest in the Thompson Sampling heuristic (TS). While this heuristic was first put forward to solve bandit problems eighty years ago in , it was not until recently that theoretical analyses of its performance were achieved . In this paper we take a major step towards generalising these analyses to the same level of generality already achieved for “optimistic” algorithms.

Unlike optimistic algorithm which are often based on confidence intervals, the Thompson Sampling algorithm ϕTS,π0\phi^{TS,\pi_{0}} uses Bayesian tools and puts a prior distribution πa,0=π0\pi_{a,0}=\pi_{0} on each θa\theta_{a}. A posterior distribution, πa,t\pi_{a,t}, is then maintained according to the rewards observed in Ht1\mathcal{H}_{t-1}. At each time a sample θa,t\theta_{a,t} is drawn from each posterior πa,t\pi_{a,t} and then the algorithm chooses to sample at=argmaxa{1,,K}{μ(θa,t)}a_{t}=\arg\max_{a\in\{1,\dots,K\}}\{\mu(\theta_{a,t})\}. Therefore ϕtTS,π0(a)\phi^{TS,\pi_{0}}_{t}(a) is the posterior probability that a=aa=a^{*} given the history Ht1\mathcal{H}_{t-1}.

Our Contributions

TS has proved to have impressive empirical performances, very close to those of state of the art algorithms such as DMED and KL-UCB . Furthermore recent works have shown that in the special case where each pap_{a} is a Bernoulli distribution B(θa)\mathcal{B}(\theta_{a}), TS using a uniform prior over the arms is asymptotically optimal in the sense that it achieves the asymptotic lower bound on the regret provided by Lai and Robbins in (that holds for univariate parametric bandits). In this paper, we show this optimality property also holds for 11-dimensional exponential families if the algorithm uses the Jeffrey’s prior:

Suppose that the rewards distributions belong to a 11-dimensional canonical exponential family and that πJ\pi_{J} is the Jeffrey’s prior. Then,

where K(θ,θ):=KL(pθ,pθ)\text{K}(\theta,\theta^{\prime}):=\text{KL}(p_{\theta},p_{\theta}^{\prime}) is the Kullback-Leibler divergence between pθp_{\theta} and pθp_{\theta}^{\prime}.

This theorem follows directly from Theorem 2. In the proof of this result we provide in Theorem 4 a finite-time, exponential concentration bound for posterior distributions of exponential family random variables, something that to the best of our knowledge is new to the literature and of interest in its own right. Our proof also exploits the explicit connection between the Jeffreys prior, Fisher information and the Kullback-Leibler divergence in exponential families.

Related Work

Another line of recent work has focused on distribution-independent bounds for Thompson Sampling. establishes that R(ϕTS,πU,T)=O(KTln(T))\mathcal{R}(\phi^{TS,\pi_{U}},T)=O(\sqrt{KT\ln(T)}) for Thompson Sampling for bounded rewards (with the classic uniform prior on the underlying Bernoulli parameter). go beyond the Bernoulli model, and give an upper bound on the Bayes risk (i.e. the regret averaged over the prior) independent of the prior distribution. For the parametric multi-armed bandit with KK arms described above, their result states that the regret of Thompson Sampling using a prior π0\pi_{0} is not too big when averaged over this same prior:

Building on the same ideas, have improved this upper bound to 14KT14\sqrt{KT}. In our paper, we rather see the prior used by Thompson Sampling as a tool, and we want therefore to obtain garantees for any given problem parametrized by θ\theta.

also use Thompson Sampling in more general models, like the linear bandit model. Their result is a bound on the Bayes risk that does not depend on the prior, whereas Agrawal and Goyal give in a first regret bound for this model. Linear bandits consider a possibly infinite number of arms whose mean rewards are linearly related by a single, unknown coefficient vector. Once again, the analysis in encounters the problem of describing the concentration of posterior distributions. However by using a conjugate normal prior, they can employ explicit the concentration bounds available for Normal distributions to complete their argument.

Paper Structure

In Section 2 we describe important features of the one-dimensional canonical exponential families we consider, including closed-form expression for KL-divergences and the Jeffrey’s prior. Section 3 gives statements of the main results, and provides the proof of the regret bound. Section 4 proves the posterior concentration result used in the proof of the regret bound.

Exponential Families and Jeffreys Priors

A distribution is said to belong to a one-dimensional canonical exponential family if it has a density with respect to some reference measure ν\nu of the form:

showing in particular that FF is strictly convex. The mean function μ\mu is differentiable and stricly increasing, since we can show that

In particular, this shows that μ\mu is one-to-one in θ\theta.

In an exponential family, a direct computation show that the Kullback-Leibler divergence can be expressed as a Bregman divergence of the normalisation function, F:

Jeffreys prior in Exponential Families

In the Bayesian literature, a special “non-informative” prior, one which is invariant under re-parametrisation of the parameter space, is sometimes considered. It is called the Jeffrey’s prior, and it can be shown to be proportional to the square-root of the Fisher information I(θ)I(\theta). In the special case of the canonical exponential family, the Fisher information takes the form I(θ)=F(θ)I(\theta)=F^{\prime\prime}(\theta), hence the Jeffrey’s prior for the model (2) is

Under the Jeffrey’s prior, the posterior on θ\theta after nn observations is given by

When ΘF(θ)dθ<+\int_{\Theta}\sqrt{F^{\prime\prime}(\theta)}d\theta<+\infty, the prior is called proper. However, stasticians often use priors which are not proper: the prior is called improper if ΘF(θ)dθ=+\int_{\Theta}\sqrt{F^{\prime\prime}(\theta)}d\theta=+\infty and any observation makes the corresponding posterior (4) integrable.

Some Intuition for choosing the Jeffreys Prior

In the proof of our concentration result for posterior distributions (Theorem 4) it will be crucial to lower bound the prior probability of an ϵ\epsilon-sized KL-divergence ball around each of the parameters θa\theta_{a}. Since the Fisher information F(θ)=limθθK(θ,θ)/θθ2F^{\prime\prime}(\theta)=\lim_{\theta^{\prime}\rightarrow\theta}K(\theta,\theta^{\prime})/|\theta-\theta^{\prime}|^{2}, choosing a prior proportional to F(θ)F^{\prime\prime}(\theta) ensures that the prior measure of such balls are Ω(ϵ)\Omega(\sqrt{\epsilon}).

Examples and Pseudocode

Algorithm 1 presents pseudocode for Thompson Sampling with the Jeffreys prior for distributions parametrized by their natural parameter θ\theta. But as the Jeffreys prior is invariant under reparametrization, if a distribution is parametrised by some parameter λ≢θ\lambda\not\equiv\theta, the algorithm can use the Jeffrey’s prior I(λ)\propto\sqrt{I(\lambda)} on λ\lambda, drawing samples from the posterior on λ\lambda. Note that the posterior sampling step (in bold) is always tractable using, for example, a Hastings-Metropolis algorithm.

Some examples of common exponential family models are given in Figure 1, together with the posterior distributions on the parameter λ\lambda that is used by TS with Jeffreys prior. In addition to examples already studied in for which T(x)=xT(x)=x, we also give two examples of more general canonical exponential families, namely the Pareto distribution with known min value and unknown tail index λ\lambda, Pareto(xm,λ)\text{Pareto}(x_{m},\lambda), for which T(x)=log(x)T(x)=\log(x), and the Weibul distribution with known shape and unknown rate parameter, Weibull(k,λ)\text{Weibull}(k,\lambda), for which T(x)=xkT(x)=x^{k}. These last two distributions are not covered even by the work in , and belong to the family of heavy-tailed distributions.

For the Bernoulli model, one note futher that the use of the Jeffreys prior is not covered by the previous analyses. These analyses make an extensive use of the uniform prior, through the fact that the coefficient of the Beta posteriors they consider have to be integers.

Results and Proof of Regret Bound

An exponential family KK-armed bandit is a KK-armed bandit for which the reward distributions pap_{a} are known to be elements of an exponential family of distributions P(Θ)\mathcal{P}(\Theta). We denote by pθap_{\theta_{a}} the distribution of arm aa and its mean by μa=μ(θa)\mu_{a}=\mu(\theta_{a}).

Assume that μ1>μa\mu_{1}>\mu_{a} for all a1a\neq 1, and that πa,0\pi_{a,0} is taken to be the Jeffrey’s prior over Θ\Theta. Then for every ϵ>0\epsilon>0 there exists a constant C(ϵ,P)\mathcal{C}(\epsilon,\mathcal{P}) depending on ϵ\epsilon and on the problem P\mathcal{P} such that the regret of Thompson Sampling using the Jeffrey’s prior satisfies

We give here the main argument of the proof of the regret bound, which proceed by bounding the expected number of draws of any suboptimal arm. Along the way we shall state concentration results whose proofs are postponed to later sections.

Step 0: Notation

We denote by ya,sy_{a,s} the ss-th observation of arm aa and by Na,tN_{a,t} the number of times arm aa is chosen up to time tt. (ya,s)s1(y_{a,s})_{s\geq 1} is i.i.d. with distribution pθap_{\theta_{a}}. Let Yau:=(ya,s)1suY_{a}^{u}:=(y_{a,s})_{1\leq s\leq u} be the vector of first uu observations from arm aa. Ya,t:=YaNa,tY_{a,t}:=Y_{a}^{N_{a,t}} is therefore the vector of observations from arm aa available at the beginning of round tt. Recall that πa,t\pi_{a,t}, respectively πa,0\pi_{a,0}, is the posterior, respectively the prior, on θa\theta_{a} at round tt of the algorithm.

For all a1a\neq 1 and Δa\Delta_{a} such that μa<μa+Δa<μ1\mu_{a}<\mu_{a}+\Delta_{a}<\mu_{1}, we introduce

Step 1: Concentration Results

We state here the two concentration results that are necessary to evaluate the probability of the above events.

Let (ys)(y_{s}) be an i.i.d sequence of distribution p(θ)p(\cdot\mid\theta) and δ>0\delta>0. Then

The two following inequalities that will be useful in the sequel can easily be deduced from Lemma 3. Their proof is gathered in Appendix A with that of Lemma 3. For any arm aa,

The second result tells us that concentration of the empirical sufficient statistic around its mean implies concentration of the posterior distribution around the true parameter:

Let πa,0\pi_{a,0} be the Jeffreys’ prior. There exists constants C1,a=C1(F,θa)>0C_{1,a}=C_{1}(F,\theta_{a})>0, C2,a=C2(F,θa,Δa)>0C_{2,a}=C_{2}(F,\theta_{a},\Delta_{a})>0, and N(θa,F)N(\theta_{a},F) s.t., Na,tN(θa,F)\forall N_{a,t}\geq N(\theta_{a},F),

whenever δa<1\delta_{a}<1 and Δa\Delta_{a} are such that 1δaC2,a(Δa)>01-\delta_{a}C_{2,a}(\Delta_{a})>0.

Step 2: Lower Bound the Number of Optimal Arm Plays with High Probability

The main difficulty adressed in previous regret analyses for Thompson Sampling is the control of the number of draws of the optimal arm. We provide this control in the form of Proposition 5 which is adapted from Proposition 1 in whose proof, an outline of which is given in Appendix D, explores in depth the randomised nature of Thompson Sampling. In particular, we show that the proof in can be significantly simplified, but at the expense of no longer being able to describe the constant CbC_{b} explicitly:

For any b(0,1)b\in(0,1) there exists a constant Cb(π,μ1,μ2,K)<C_{b}(\pi,\mu_{1},\mu_{2},K)<\infty such that

Step 3: Decomposition

The idea in this step is to decompose the probability of playing a suboptimal arm into principle and negligible components and control these components with the results from Steps 1 and 2:

The terms (B) and (C) are about concentration of the posterior on the suboptimal arm. An upper bound on term (C) is given in (6), whereas a bound on term (B) follows from Lemma 6 below. Although the proof of this lemma is standard, and bears a strong similarity to Lemma 3 of , we provide it in Appendix C for the sake of completeness.

For all actions aa and for all ϵ>0\epsilon>0, \exists Nϵ=Nϵ(δa,Δa,θa)>0N_{\epsilon}=N_{\epsilon}(\delta_{a},\Delta_{a},\theta_{a})>0 such that

where Nϵ=Nϵ(δa,Δa,θa)N_{\epsilon}=N_{\epsilon}(\delta_{a},\Delta_{a},\theta_{a}) is the smallest integer such that for all nNϵn\geq N_{\epsilon}

and N(θa,F)N(\theta_{a},F) is the constant from Theorem 4.

When we have seen enough observations on the optimal arm, term (A) also becomes a result about the concentration of the posterior, but this time for the optimal arm:

For any Δa>0\Delta_{a}^{\prime}>0, one can choose δ1\delta_{1} such that 1δ1C1,1>01-\delta_{1}C_{1,1}>0. Then, with N=N(P)N=N(\mathcal{P}) such that the function

is decreasing for uNu\geq N, (B)(B^{\prime}) is bounded by

So far, we have shown that for any ϵ>0\epsilon>0 and for any choice of δa>0\delta_{a}>0 and 0<Δa<μ1μa0<\Delta_{a}<\mu_{1}-\mu_{a} such that 1δaC2,a>01-\delta_{a}C_{2,a}>0, there exists a constant C(δa,Δa,ϵ,P)\mathcal{C}(\delta_{a},\Delta_{a},\epsilon,\mathcal{P}) such that

The constant is of course increasing (dramatically) when δa\delta_{a} goes to zero, Δa\Delta_{a} to μ1μa\mu_{1}-\mu_{a}, or ϵ\epsilon to zero. But one can choose Δa\Delta_{a} close enough to μ1μa\mu_{1}-\mu_{a} and δa\delta_{a} small enough, such that

Posterior Concentration: Proof of Theorem 4

The argument rests on the following Lemma, whose proof can be found in Appendix B

Step 2: Upper bounding the numerator of (10)

We first note that on Θθ,Δ\Theta_{\theta,\Delta} the leading term in the exponential is K(θ,θ)K(\theta,\theta^{\prime}). Indeed, from (3) we know that

which, by strict convexity of FF, is strictly increasing in θθ|\theta-\theta^{\prime}| for any fixed θ\theta. Now since μ\mu is one-to-one and continuous, Θθ,Δc\Theta_{\theta,\Delta}^{c} is an interval whose interior contains θ\theta, and hence, on Θθ,Δ\Theta_{\theta,\Delta},

So for δ\delta such that 1δC2>01-\delta C_{2}>0 we can bound the numerator of (10) by:

where we have used that π(ys)\pi(\cdot|y_{s^{\prime}}) is a probability distribution, and that, since μ\mu is increasing, K(θ,μ1(μ+Δ))=infθΘθ,ΔK(θ,θ)\text{K}(\theta,\mu^{-1}(\mu+\Delta))=\inf_{\theta^{\prime}\in\Theta_{\theta,\Delta}}K(\theta,\theta^{\prime}).

Step 3: Lower bounding the denominator of (10)

To lower bound the denominator, we reduce the integral on the whole space Θ\Theta to a KL-ball, and use the structure of the prior to lower bound the measure of that KL-ball under the posterior obtained with the well-chosen observation ysy_{s^{\prime}}. We introduce the following notation for KL balls: for any xΘ, ϵ>0x\in\Theta,\ \epsilon>0, we define

We have K(θ,θ)(θθ)2F(θ)0\frac{K(\theta,\theta^{\prime})}{(\theta-\theta^{\prime})^{2}}\rightarrow F^{\prime\prime}(\theta)\neq 0 (since FF is strictly convex). Therefore, there exists N1(θ,F)N_{1}(\theta,F) such that for uN1(θ,F)u\geq N_{1}(\theta,F), on B1u2(θ)B_{\frac{1}{u^{2}}}(\theta),

Using this inequality we can then bound the denominator of (10) whenever uN1(θ,F)u\geq N_{1}(\theta,F) and δ<1\delta<1:

Finally we turn our attention to the quantity

Now since the KL divergence is convex in the second argument, we can write B1/u(θ)=(a,b)B_{1/u}(\theta)=(a,b). So, from the convexity of FF we deduce that

As p(yθ)0p(y\mid\theta)\rightarrow 0 as y±y\rightarrow\pm\infty, the set C(θ)={y:p(yθ)L(θ)}\mathcal{C}(\theta)=\{y:p(y\mid\theta)\geq L(\theta)\} is compact. The map yΘp(yθ)F(θ)dθ<y\mapsto\int_{\Theta}p(y|\theta^{\prime})\sqrt{F^{\prime\prime}(\theta^{\prime})}d\theta^{\prime}<\infty is continuous on the compact C(θ)\mathcal{C}(\theta). Thus, it follows that

is an upper bound on the denominator of (13).

Now by the continuity of FF^{\prime\prime}, and the continuity of (y,θ)p(yθ)(y,\theta)\mapsto p(y|\theta) in both coordinates, there exists an N2(θ,F)N_{2}(\theta,F) such that for all uN2(θ,F)u\geq N_{2}(\theta,F)

Finally, for uN2(θ,F)u\geq N_{2}(\theta,F), we have a lower bound on the numerator of (13):

Puting everything together, we get

that there exist constants C2=C2(F,θ,Δ)C_{2}=C_{2}(F,\theta,\Delta) and N(θ,F)=max{N1,N2}N(\theta,F)=\max\{N_{1},N_{2}\} such that for every δ<1\delta<1 satisfying 1δC2>01-\delta C_{2}>0, and for every uNu\geq N, one has

Note that when the prior is proper we do not need to introduce the observation ysy_{s^{\prime}}, which significantly simplifies the argument. Indeed in this case, in (11) we can use π0\pi_{0} in place of π(ys)\pi(\cdot|y_{s^{\prime}}) which is already a probability distribution. In particular, the quantity (13) is replaced by π0(B1/u2(θ))\pi_{0}\left(B_{1/u^{2}}(\theta)\right), and so the constants LL and LL^{\prime} are not needed.

Conclusions

We have shown that choosing to use the Jeffrey’s prior in Thompson Sampling leads to an asymptotically optimal algorithm for bandit models whose rewards belong to a 11-dimensional canonical exponential family. The cornerstone of our proof is a finite time concentration bound for posterior distributions in exponential families, which, to the best of our knowledge, is new to the literature. With this result we built on previous analyses and avoided Bernoulli-specific arguments. Thompson Sampling with Jeffreys prior is now a provably competitive alternative to KL-UCB for exponential family bandits. Moreover our proof holds for slightly more general problems than those for which KL-UCB is provably optimal, including some heavy-tailed exponential family bandits.

Our arguments are potentially generalisable. Notably generalising to nn-dimensional exponential family bandits requires only generalising Lemma 3 and Step 3 in the proof of Theorem 4. Our result is asymptotic, but the only stage where the constants are not explicitly derivable from knowledge of FF, TT, and θ0\theta_{0} is in Lemma 9. Future work will investigate these open problems. Another possible future direction lies the optimal choice of prior distribution. Our theoretical guarantees only hold for Jeffreys prior, but a careful examination of our proof shows that the important property is to have, for every θa\theta_{a},

which could hold for prior distributions other than the Jeffreys prior.

References

Appendix A Concentration of the Sufficient Statistics: Proof of Lemma 3, and Inequalities (6) and (7)

The proof of Lemma 3 follows from the classical Cramér-Chenoff technique (see ). For any λ>0\lambda>0.

where we have used the Markov inequality, and where

Now we optimize in λ\lambda by choosing λ>0\lambda>0 that maximizes

f(λ)f(\lambda) is differentiable in λ\lambda and its minimum, λ\lambda^{*}, satisfies f(λ)=0f^{\prime}(\lambda^{*})=0 i.e.

(Note that λ>0\lambda^{*}>0 since FF^{\prime} is increasing). Finally, we get

The same reasoning leads to the upper bound

where ν\nu^{*} is such that F(θν)=F(θ)δF^{\prime}(\theta-\nu^{*})=F^{\prime}(\theta)-\delta. ∎

which gives inequality (6). To proof (7), we write:

Appendix B Extracting the KL-divergence: Proof of Lemma 7

where π(θys)\pi(\theta|y_{s^{\prime}}) denotes the posterior distribution on θ\theta after observation ysy_{s^{\prime}} and

denotes the empirical KL-divergence obtained from the observations Ysu=Yu{ys}Y_{s^{\prime}}^{u}=Y^{u}\setminus\{y_{s^{\prime}}\}. Introducing

Indeed, for that for any θ,θΘ\theta,\theta^{\prime}\in\Theta

Appendix C Proof of Lemma 6

From Theorem 4 we know that, for Na,tN(θa,F)N_{a,t}\geq N(\theta_{a},F),

Let Nϵ=Nϵ(δa,Δa,θa)N_{\epsilon}=N_{\epsilon}(\delta_{a},\Delta_{a},\theta_{a}) be the smallest integer such that for all nNϵn\geq N_{\epsilon}

we have that for all tt and TT such that Na,t1max(LT,Nϵ,N(θa,F)),N_{a,t}-1\geq\max(L_{T},N_{\epsilon},N(\theta_{a},F)),

Appendix D Controling the Number of Optimal Plays: Outline Proof of Proposition 5

The proof of this proposition is quite detailed, and essentially the same as the proof given for Proposition 1 in , which we will sometimes refer to. However, in generalising to the case of exponential family bandits we show how to avoid the need to explicity calculate posterior probabilities that lead to Lemma 4 in . While simplifying the proof we loose the ability to specify the constants explicitly, and so the analysis becomes asymptotic, but holds for every b]0,1[b\in]0,1[.

We introduce the interval Ij={τj,τj+t1b1}\mathcal{I}_{j}=\{\tau_{j},\tau_{j}+\lceil t^{1-b}-1\rceil\}: on the event Ej\mathcal{E}_{j}, Ij\mathcal{I}_{j} is included in {τj,τj+1}\{\tau_{j},\tau_{j+1}\} and no draw of arm 1 occurs on I\mathcal{I}. We also introduce for each arm a1a\neq 1 da:=μ1μa2d_{a}:=\frac{\mu_{1}-\mu_{a}}{2}.

The idea of the rest of the analysis is based on the following remark. If on a subinterval I[τj,τj+1[\mathcal{I}\subseteq[\tau_{j},\tau_{j+1}[ of size f(t)f(t) arm 1 is not drawn and all the samples of the suboptimal arms fall below μ2+d2<μ1\mu_{2}+d_{2}<\mu_{1}, then for all sIs\in\mathcal{I}, μ(θ1,s)μ2+d2\mu(\theta_{1,s})\leq\mu_{2}+d_{2}. On I\mathcal{I}, the sequence (θ1,s)(\theta_{1,s}) is i.i.d. with distribution π1,τj\pi_{1,\tau_{j}}, and hence,

At this point, an asymptotic result, telling that the posterior on θ1\theta_{1} concentrates to a Dirac in θ1\theta_{1} (the Bernstein-Von-Mises theorem, see ) , leads to

There exists a constant C=C(π0)<1C=C(\pi_{0})<1, such that for every (random) interval I\mathcal{I} included in Ij\mathcal{I}_{j} and for every positive function ff, one has

For every aA, δ>0a\in A,\ \delta>0, there exist constants Ca=Ca(μa,δ,F)C_{a}=C_{a}(\mu_{a},\delta,F) and NN such that for tNt\geq N,

The rest of the proof proceeds by finding a subinterval of Ij\mathcal{I}_{j} on which all the samples of all the suboptimal arms indeed fall below the corresponding thresholds μa+da\mu_{a}+d_{a}. This is done exactly as in and we recall the main steps of the proof below. Before that, we need to introduce the notion of saturated, suboptimal action.

Let tt be fixed. For any a1a\neq 1, an action aa is said to be saturated at time ss if it has been chosen at least Caln(t)C_{a}\ln(t) times, i.e. Na,tCaln(t)N_{a,t}\geq C_{a}\ln(t). We shall say that it is unsaturated otherwise. Furthermore at any time we call a choice of an unsaturated, suboptimal action an interruption.

We want to study the process of saturation on the event Ej={ξjt1b1}\mathcal{E}_{j}=\{\xi_{j}\geq t^{1-b}-1\}. We start by decomposing the interval Ij={τj,τj+t1b1}\mathcal{I}_{j}=\{\tau_{j},\tau_{j}+\lceil t^{1-b}-1\rceil\} into KK subintervals:

Now for each interval Ij,l\mathcal{I}_{j,l}, we introduce:

Fj,l\mathcal{F}_{j,l}: the event that by the end of the interval Ij,l\mathcal{I}_{j,l} at least ll suboptimal actions are saturated;

nj,ln_{j,l}: the number of interruptions during this interval.

We use the following decomposition to bound the probability of the event Ej\mathcal{E}_{j}:

Note that the quantities Ej, Ij,l, Fj,l\mathcal{E}_{j},\ \mathcal{I}_{j,l},\ \mathcal{F}_{j,l} and nj,ln_{j,l} all depend on tt, however we suppress this dependency for notational convenience. However, we keep in mind that we bound the different probabilities for tNt\geq N, so that Lemma 10 applies.

On the event EjFj,K1\mathcal{E}_{j}\cap\mathcal{F}_{j,K-1}, only saturated suboptimal arms are drawn on the interval Ij,K\mathcal{I}_{j,K}. Using Lemma 10, we get

for 0<C<10<C<1 as in Lemma 9. The second last inequality comes from the fact that if arm 1 is not drawn, the sample θ1,s\theta_{1,s} must be smaller than some sample θa,s\theta_{a,s} and therefore smaller than μ2+d2\mu_{2}+d_{2}.

A similar argument to that employed in Step 2 can be used in an induction to show that for all 2lK2\leq l\leq K, if tt is larger than some deterministic constant Nμ1,μ2,bN_{\mu_{1},\mu_{2},b} specified in the base case,

We refer the reader to for a precise description of the induction. For l=Kl=K we then get

Step 4: Conclusion

Putting Steps 2 and 3 together we obtain that for tN0:=max(N,Nμ1,μ2,b)t\geq N_{0}:=\max(N,N_{\mu_{1},\mu_{2},b}),