Further Optimal Regret Bounds for Thompson Sampling

Shipra Agrawal, Navin Goyal

Introduction

Multi-armed bandit problem models the exploration/exploitation trade-off inherent in sequential decision problems. One of the early motivations for studying MAB problem was clinical trials: suppose that we have NN different treatments of unknown efficacy for a certain disease. Patients arrive sequentially, and we must decide on a treatment to administer for each arriving patient. To make this decision, we could learn from how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we may have a reasonable idea of which treatment is most effective, and from then on, we could administer that treatment for all the patients. However, initially, when there is no or very little information available, we need to explore and try each treatment sufficient number of times. We wish to do this exploration in such a way that we can find the best treatment and start exploiting it as soon as possible. The MAB problem is to decide how to choose the treatment for the next patient, given the outcomes of the treatments so far. Today, multi-armed bandit problem has a diverse set of applications some of which will be mentioned shortly.

Many versions and generalizations of the multi-armed bandit problem have been studied in the literature; in this paper we will consider a basic and well-studied version of this problem: the stochastic multi-armed bandit problem. Among many algorithms available for the stochastic bandit problem, some popular ones include Upper Confidence Bound (UCB) family of algorithms, (e.g., , and more recently ), which have good theoretical guarantees, and the algorithm by , which gives optimal strategy under Bayesian setting with known priors and geometric time-discounted rewards. In one of the earliest works on stochastic bandit problems, proposed a natural randomized Bayesian algorithm to minimize regret. The basic idea is to assume a simple prior distribution on the parameters of the reward distribution of every arm, and at any time step, play an arm according to its posterior probability of being the best arm. This algorithm is known as Thompson Sampling (TS), and it is a member of the family of randomized probability matching algorithms. TS is a very natural algorithm and the same idea has been rediscovered many times independently in the context of reinforcement learning, e.g., in . We emphasize that although TS algorithm is a Bayesian approach, the description of the algorithm and our analysis apply to the prior-free stochastic multi-armed bandit model where parameters of the reward distribution of every arm are fixed, though unknown (see Section 1.1). One could interpret the “assumed” Bayesian priors as the current knowledge of the algorithm about the arms. Thus, our regret bounds for Thompson Sampling are directly comparable to the regret bounds for UCB family of algorithms which are a frequentist approach to the same problem.

Recently, TS has attracted considerable attention. Several studies (e.g., ) have empirically demonstrated the efficacy of Thompson Sampling: provides a detailed discussion of probability matching techniques in many general settings along with favorable empirical comparisons with other techniques. demonstrate that empirically TS achieves regret comparable to the lower bound of ; and in applications like display advertising and news article recommendation, it is competitive to or better than popular methods such as UCB. In their experiments, TS is also more robust to delayed or batched feedback (delayed feedback means that the result of a play of an arm may become available only after some time delay, but we are required to make immediate decisions for which arm to play next) than the other methods. A possible explanation may be that TS is a randomized algorithm and so it is unlikely to get trapped in an early bad decision during the delay. Microsoft’s adPredictor () for CTR prediction of search ads on Bing uses the idea of Thompson Sampling.

Despite being easy to implement, competitive to the state of the art methods, and popular in practice, TS lacked a strong theoretical analysis. provide weak guarantees, namely, a bound of o(T)o(T) on expected regret in time TT. Significant progress was made in the recent work of and . In , the first logarithmic bound on expected regret of TS algorithm were proven. provided a bound that matches the asymptotic lower bound of for this problem. However, both these bounds were problem dependent, i.e. the regret bounds are logarithmic in TT when the problem parameters, namely the mean rewards for each arm, and their differences, are assumed to be constants. The problem-independent bounds implied by these existing works were far from optimal. Obtaining a problem-independent bound that is close to the lower bound of Ω(NT)\Omega(\sqrt{NT}) was also posed as an open problem by Chapaelle and Li .

In this paper, we give a regret analysis for Thompson Sampling that provides both optimal problem-dependent and near-optimal problem-independent regret bounds for Thompson Sampling. Our novel martingale-based analysis technique is conceptually simple (arguably simpler than the previous work). Our technique easily extends to distributions other than Beta distribution, and it also extends to the more general contextual bandits setting . While the basic idea for the analysis in the contextual bandits setting of is inspired by the idea in this paper, the details are substantially different.

Before stating our results, we describe the MAB problem and the TS algorithm formally.

We consider the stochastic multi-armed bandit (MAB) problem: We are given a slot machine with NN arms; at each time step t=1,2,3,t=1,2,3,\ldots, one of the NN arms must be chosen to be played. Each arm ii, when played, yields a random real-valued reward according to some fixed (unknown) distribution with support in $$. The random reward obtained from playing an arm repeatedly are i.i.d. and independent of the plays of the other arms. The reward is observed immediately after playing the arm.

Other performance measures include PAC-style guarantees; we do not consider those measures here.

2 Thompson Sampling

We provide the details of Thompson Sampling algorithm and our analysis for the Bernoulli bandit problem, i.e. when the rewards are either or 11, and for arm ii the probability of success (reward =11) is μi\mu_{i}. This description of Thompson Sampling follows closely that of . A simple extension of this algorithm to general reward distributions with support $$ is described in , which seamlessly extends our analysis for Bernoulli bandits to general stochastic bandit problem.

The algorithm for Bernoulli bandits maintains Bayesian priors on the Bernoulli means μi\mu_{i}’s. Beta distribution turns out to be a very convenient choice of priors for Bernoulli rewards. Let us briefly recall that beta distributions form a family of continuous probability distributions on the interval (0,1)(0,1). The pdf of Beta(α,β)\text{Beta}(\alpha,\beta), the beta distribution with parameters α>0\alpha>0, β>0\beta>0, is given by f(x;α,β)=Γ(α+β)Γ(α)Γ(β)xα1(1x)β1f(x;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}. The mean of Beta(α,β)\text{Beta}(\alpha,\beta) is α/(α+β)\alpha/(\alpha+\beta); and as is apparent from the pdf, higher the α,β\alpha,\beta, tighter is the concentration of Beta(α,β)\text{Beta}(\alpha,\beta) around the mean. Beta distribution is useful for Bernoulli rewards because if the prior is a Beta(α,β)\text{Beta}(\alpha,\beta) distribution, then after observing a Bernoulli trial, the posterior distribution is simply Beta(α+1,β)\text{Beta}(\alpha+1,\beta) or Beta(α,β+1)\text{Beta}(\alpha,\beta+1), depending on whether the trial resulted in a success or failure, respectively.

The Thompson Sampling algorithm initially assumes arm ii to have prior Beta(1,1)\text{Beta}(1,1) on μi\mu_{i}, which is natural because Beta(1,1)\text{Beta}(1,1) is the uniform distribution on (0,1)(0,1). At time tt, having observed Si(t)S_{i}(t) successes (reward = 11) and Fi(t)F_{i}(t) failures (reward = ) in ki(t)=Si(t)+Fi(t)k_{i}(t)=S_{i}(t)+F_{i}(t) plays of arm ii, the algorithm updates the distribution on μi\mu_{i} as Beta(Si(t)+1,Fi(t)+1)\text{Beta}(S_{i}(t)+1,F_{i}(t)+1). The algorithm then samples from these posterior distributions of the μi\mu_{i}’s, and plays an arm according to the probability of its mean being the largest. We summarize the Thompson Sampling algorithm below.

3 Our results

In this article, we bound the finite time expected regret of Thompson Sampling. From now on we will assume that the first arm is the unique optimal arm, i.e., μ=μ1>argmaxi1μi\mu^{*}=\mu_{1}>\arg\max_{i\neq 1}\mu_{i}. Assuming that the first arm is an optimal arm is a matter of convenience for stating the results and for the analysis and of course the algorithm does not use this assumption. The assumption of unique optimal arm is also without loss of generality, since adding more arms with μi=μ\mu_{i}=\mu^{*} can only decrease the expected regret; details of this argument were provided in .

(Problem-dependent bound) For the NN-armed stochastic bandit problem, Thompson Sampling algorithm has expected regret

in time TT, where d(μi,μ1)=μilogμiμ1+(1μi)log(1μi)(1μ1)d(\mu_{i},\mu_{1})=\mu_{i}\log\frac{\mu_{i}}{\mu_{1}}+(1-\mu_{i})\log\frac{(1-\mu_{i})}{(1-\mu_{1})}. The big-Oh notation For any two functions f(n),g(n)f(n),g(n), f(n)=O(g(n))f(n)=O(g(n)) if there exist two constants n0n_{0} and cc such that for all nn0n\geq n_{0}, f(n)cg(n)f(n)\leq cg(n). in above assumes μi,Δi,i=1,,N\mu_{i},\Delta_{i},i=1,\ldots,N to be constants.

(Problem-independent bound) For the NN-armed stochastic bandit problem, Thompson Sampling algorithm has expected regret

in time TT, where the big-Oh notation hides only the absolute constants.

Let us contrast our bounds with the previous work. Let us first consider the problem-dependent regret bounds, i.e., regret bounds that depend on problem parameters μi,Δi,i=1,,N\mu_{i},\Delta_{i},i=1,\ldots,N. Lai and Robbins essentially proved the following lower bound on the regret of any bandit algorithm (see for a precise statement):

They also gave algorithms asymptotically achieving this guarantee, though unfortunately their algorithms are not efficient. Auer et al. gave the UCB1 algorithm, which is efficient and achieves the following bound:

More recently, Kaufmann et al. gave Bayes-UCB algorithm which achieves the lower bound of for Bernoulli rewards. Bayes-UCB is a UCB like algorithm, where the upper confidence bounds are based on the quantiles of Beta posterior distributions. Interestingly, these upper confidence bounds turn out to be similar to those used by algorithms in and . Our bounds in Theorem 1 achieve the asymptotic lower bounds of , and match those provided by for Thompson Sampling.

Theorem 2 shows that Thompson Sampling also achieves a problem independent regret bound of O(NTlnT)O(\sqrt{NT\ln T}) on regret. This is the first analyis for TS that matches the Ω(NT)\Omega(\sqrt{NT}) problem-inpdependent lower bound (see ) for this problem within logarithmic factors. The problem-dependent bounds in the existing work implied only suboptimal problem-independent bounds: implied a problem independent bound of O(T2/3)O(T^{2/3}). In , the additive problem dependent term was not explicitly calculated, which makes it difficult to derive the corresponding problem independent bound, but on a preliminary examination, it appears that it would involve an even higher power of TT. To compare with other existing algorithms for this problem, note that the best known problem-independent bound for the expected regret of UCB1 is also O(NTlnT)O(\sqrt{NT\ln T}) (see ). More recently, Audibert and Bubeck gave an algorithm MOSS, inspired by UCB1, with regret O(NT)O(\sqrt{NT}).

Proofs

In this section, we prove Theorem 1 and Theorem 2. The proofs of the two theorems follow the same steps, and diverge only towards the end of the analysis.

Our proof uses a martingale based analysis. Essentially, we prove that conditioned on any history of execution in the preceding steps, the probability of playing any suboptimal arm ii at the current step can be bounded by a linear function of the probability of playing the optimal arm at the current step. This is proven in Lemma 1, which forms the core of our analysis. Further, we show that the coefficient in this linear function decreases exponentially fast with the increase in the number of plays of optimal arm (refer to Lemma 4), this allows us to bound the total number of plays of every suboptimal arm, to bound the regret as desired. The difference between the analysis for obtaining the logarithmic problem-dependent bound of Theorem 1, and the problem-independent bound of Theorem 2 is merely technical, and occurs only towards the end of the proof. We recall some of the definitions introduced earlier, and introduce some new notations used in the proof. Fn,pB()F^{B}_{n,p}(\cdot) denotes the cdf and fn,pB()f^{B}_{n,p}(\cdot) denotes the probability mass function of the binomial distribution with parameters n,pn,p. Let Fα,βbeta()F^{beta}_{\alpha,\beta}(\cdot) denote the cdf of the beta distribution with parameters α,β\alpha,\beta.

ki(t)k_{i}(t) is defined as the number of plays of arm ii until time t1t-1, and Si(t)S_{i}(t) as the number of successes among the plays of arm ii until time t1t-1. Also, i(t)i(t) denotes the arm played at time tt.

For each arm ii, we will choose two thresholds xix_{i} and yiy_{i} such that μi<xi<yi<μ1\mu_{i}<x_{i}<y_{i}<\mu_{1}. The specific choice of these thresholds will depend on whether we are proving problem-dependent bound or problem-independent bound, and will be described at the approporiate points in the proof. Define Li(T)=lnTd(xi,yi)L_{i}(T)=\frac{\ln T}{d(x_{i},y_{i})}, and μ^i(t)=Si(t)/ki(t)\hat{\mu}_{i}(t)=S_{i}(t)/k_{i}(t) (define μ^i(t)=1\hat{\mu}_{i}(t)=1 when ki(t)=0k_{i}(t)=0). Define Eiμ(t)E^{\mu}_{i}(t) as the event that μ^i(t)xi\hat{\mu}_{i}(t)\leq x_{i}. Define Eiθ(t)E^{\theta}_{i}(t) as the event that θi(t)yi\theta_{i}(t)\leq y_{i}.

Intuitively, Eiμ(t)E^{\mu}_{i}(t), Eiθ(t)E^{\theta}_{i}(t) are the events that μ^i(t)\hat{\mu}_{i}(t) and θi(t)\theta_{i}(t), respectively, are not too far from the mean μi\mu_{i}. As we show later, these events will hold with high probability for most time steps.

Define filtration Ft1{\cal F}_{t-1} as the history of plays until time t1t-1, i.e.

where i(t)i(t) denotes the arm played at time tt, and ri(t)r_{i}(t) denotes the reward observed for arm ii at time tt.

Note that pi,tp_{i,t} is determined by Ft1{\cal F}_{t-1}.

where pi,t=Pr(θ1(t)>yiFt1)p_{i,t}=\Pr(\theta_{1}(t)>y_{i}|{\cal F}_{t-1}).

Note that whether Eiμ(t)E^{\mu}_{i}(t) is true or not is determined by Ft1{\cal F}_{t-1}. Assume that filtration Ft1{\cal F}_{t-1} is such that Eiμ(t)E^{\mu}_{i}(t) is true (otherwise the probability on the left hand side is and the inequality is trivially true). It then suffices to prove that

Let Mi(t)M_{i}(t) denote the event that arm ii exceeds all the suboptimal arms at time tt. That is,

We will prove the following two inequalities which immediately give (1).

Now, given Mi(t),Eiθ(t)M_{i}(t),E^{\theta}_{i}(t), it holds that for all ji,j1j\neq i,j\neq 1,

The second last equality follows because the events Mi(t)M_{i}(t) and Eiθ(t),i1E^{\theta}_{i}(t),\forall i\neq 1 involve conditions on only θj(t)\theta_{j}(t) for j1j\neq 1, and given Ft1{\cal F}_{t-1} (and hence μ^j(t),kj(t),j\hat{\mu}_{j}(t),k_{j}(t),\forall j), θ1(t)\theta_{1}(t) is independent of all the other θj(t),j1\theta_{j}(t),j\neq 1, and hence independent of these events. This together with (4) gives (2).

Since Eiθ(t)E^{\theta}_{i}(t) is the event that θi(t)yi\theta_{i}(t)\leq y_{i}, therefore, given Eiθ(t)E^{\theta}_{i}(t), i(t)=ii(t)=i only if θ1(t)<yi\theta_{1}(t)<y_{i}. This gives (3):

This essentially follows from application of Chernoff-Hoeffding bounds for concentration of μ^i(t)\hat{\mu}_{i}(t). Refer to Appendix B for details. ∎

This essentially follows from the observation that the beta-distributed random variable θi(t)\theta_{i}(t) is well-concentrated around its mean when ki(t)k_{i}(t) is large, that is, larger than Li(T)L_{i}(T). Refer to Appendix C for details. ∎

Let τj\tau_{j} denote the time step at which jthj^{th} trial of first arm happens, then

where Δi=μ1yi\Delta^{\prime}_{i}=\mu_{1}-y_{i}, Di=yilogyiμ1+(1yi)log1yi1μ1D_{i}=y_{i}\log\frac{y_{i}}{\mu_{1}}+(1-y_{i})\log\frac{1-y_{i}}{1-\mu_{1}}.

The proof of this inequality involves some careful algebraic manipulations using tight estimates for partial Binomial sums provided by . Refer to Appendix D for details. ∎

Proof of Theorem 1 and Theorem 2

Let τk\tau_{k} denote the time step at which arm 11 is played for the kthk^{th} time for k1k\geq 1, and let τ0=0\tau_{0}=0. Using the above lemmas,

The inequality marked ()(*) uses the observation that pi,t=Pr(θ1(t)>yiFt1)p_{i,t}=\Pr(\theta_{1}(t)>y_{i}|{\cal F}_{t-1}) changes only when the distribution of θ1(t)\theta_{1}(t) changes, that is, only on the time step after each play of first arm. Thus, pi,tp_{i,t} is same at all time steps t{τk+1,,τk+1}t\in\{\tau_{k}+1,\ldots,\tau_{k+1}\}, for every kk.

For logarithmic problem-dependent bound of Theorem 1, for some 0<ϵ10<\epsilon\leq 1, we set xi(μi,μ1)x_{i}\in(\mu_{i},\mu_{1}) such that d(xi,μ1)=d(μi,μ1)/(1+ϵ)d(x_{i},\mu_{1})=d(\mu_{i},\mu_{1})/(1+\epsilon), and set yi(xi,μ1)y_{i}\in(x_{i},\mu_{1}) such that d(xi,yi)=d(xi,μ1)/(1+ϵ)=d(μi,μ1)/(1+ϵ)2d(x_{i},y_{i})=d(x_{i},\mu_{1})/(1+\epsilon)=d(\mu_{i},\mu_{1})/(1+\epsilon)^{2} (This way of choosing thresholds, in order to obtain bounds in terms of KL-divergences d(μi,μ1)d(\mu_{i},\mu_{1}) rather than Δi\Delta_{i}s, is inspired by .). This gives

Also, by some simple algebraic manipulations of the equality d(xi,μ1)=d(μi,μ1)/(1+ϵ)d(x_{i},\mu_{1})=d(\mu_{i},\mu_{1})/(1+\epsilon), we can obtain

Here order notation is hiding functions of μi\mu_{i}s and Δi\Delta_{i}s, since they are assumed to be constants.

where ϵ=3ϵ\epsilon^{\prime}=3\epsilon, and the order notation in above hides μi\mu_{i}s and Δi\Delta_{i}s in addition to the absolute constants.

For obtaining O(NTlnT)O(\sqrt{NT\ln T}) problem-independent bound of Theorem 2, we pick xi=μi+Δi3,yi=μ1Δi3x_{i}=\mu_{i}+\frac{\Delta_{i}}{3},y_{i}=\mu_{1}-\frac{\Delta_{i}}{3}, so that Δ2=(μ1yi)2=Δi29\Delta^{\prime 2}=(\mu_{1}-y_{i})^{2}=\frac{\Delta_{i}^{2}}{9}, and using Pinsker’s inequality, d(xi,μi)12(xiμi)2=Δi218d(x_{i},\mu_{i})\geq\frac{1}{2}(x_{i}-\mu_{i})^{2}=\frac{\Delta_{i}^{2}}{18}, d(xi,yi)12(yixi)2Δi218d(x_{i},y_{i})\geq\frac{1}{2}(y_{i}-x_{i})^{2}\geq\frac{\Delta_{i}^{2}}{18}. Then,

We observe that in the worst case, for all suboptimal ii, ΔiNlnTT\Delta_{i}\geq\sqrt{\frac{N\ln T}{T}}. This is because the total regret on playing arms with Δi<NlnTT\Delta_{i}<\sqrt{\frac{N\ln T}{T}} instead of the optimal arm is at most NTlnT\sqrt{NT\ln T}. Thus, all the arms with Δi<NlnTT\Delta_{i}<\sqrt{\frac{N\ln T}{T}} can be assumed to be optimal arms. Also, in we proved that multiple optimal arms can only help.

Therefore, substituting Δi=NlnTT\Delta_{i}=\sqrt{\frac{N\ln T}{T}},

Acknowledgement.

We thank Emil Jeřábek for telling us about his estimates of partial binomial sums. We also thank MathOverflow for connecting us with Emil.

References

Appendix A Some results used in the proofs

Let X1,,XnX_{1},\ldots,X_{n} be independent 010-1 r.v.s with E[Xi]=piE[X_{i}]=p_{i} (not necessarily equal). Let X=1ni=XiX=\frac{1}{n}\sum_{i}=X_{i}, μ=E[X]=1ni=1npi\mu=E[X]=\frac{1}{n}\sum_{i=1}^{n}p_{i}. Then, for any 0<λ<1μ0<\lambda<1-\mu,

where d(a,b)=alnab+(1a)ln(1a)(1b)d(a,b)=a\ln\frac{a}{b}+(1-a)\ln\frac{(1-a)}{(1-b)}.

for all positive integers α,β\alpha,\beta.

Appendix B Proof of Lemma 2

Let τk\tau_{k} denote the time at which kthk^{th} trial of arm ii happens. Let τ0=0\tau_{0}=0; Then,

The second last inequality follows from the observation that the event Eiμ(t)\overline{E^{\mu}_{i}(t)} was defined as μ^i(t)>xi\hat{\mu}_{i}(t)>x_{i}, where μ^i(t)\hat{\mu}_{i}(t) is the average of the outcomes observed from the plays of arm ii until time t1t-1. Thus at time τk+1\tau_{k}+1, μ^i(τk+1)\hat{\mu}_{i}(\tau_{k}+1) is simply the average of the outcomes observed from kk i.i.d. plays of arm ii, each of which is a Bernoulli trial with mean μi\mu_{i}. Using Chernoff-Hoeffding bounds (Fact 1), we obtain that Pr(μ^i(τk+1)>xi)ekd(xi,μi)\Pr(\hat{\mu}_{i}(\tau_{k}+1)>x_{i})\leq e^{-kd(x_{i},\mu_{i})}. ∎

Appendix C Proof of Lemma 3

where the last inequality follows from Chernoff-Hoeffding bounds (refer to Fact 1). Therefore, for tt such that ki(t)>Li(T)k_{i}(t)>L_{i}(T),

Let τ\tau be the largest time step until ki(t)Li(T)k_{i}(t)\leq L_{i}(T), then,

Appendix D Proof of Lemma 4

Let k1(t)=j,S1(t)=sk_{1}(t)=j,S_{1}(t)=s. Let y=yiy=y_{i}. Then, pi,t=Pr(θ1(t)>y)=Fj+1,yB(s)p_{i,t}=\Pr(\theta_{1}(t)>y)=F^{B}_{j+1,y}(s). Let τj+1\tau_{j}+1 denote the time step after the (j)th(j)^{th} play of arm 11. Then, k1(τj+1)=jk_{1}(\tau_{j}+1)=j, and

Let R=μ1(1y)y(1μ1)R=\frac{\mu_{1}(1-y)}{y(1-\mu_{1})}, D=ylogyμ1+(1y)log1y1μ1D=y\log\frac{y}{\mu_{1}}+(1-y)\log\frac{1-y}{1-\mu_{1}}.

We will divide the sum Sum(0,j)=s=0jfj,μ1(s)Fj+1,y(s)Sum(0,j)=\sum_{s=0}^{j}\frac{f_{j,\mu_{1}}(s)}{F_{j+1,y}(s)} into four partial sums and prove that

Together, the above estimates will prove the required bound.

We use the following bounds on the cdf of Binomial distribution [12, Prop. A.4]. For sy(j+1)(j+1)y(1y)s\leq y(j+1)-\sqrt{(j+1)y(1-y)},

Bounding S​u​m​(0,⌊y​j⌋−1)𝑆𝑢𝑚0𝑦𝑗1Sum(0,\lfloor yj\rfloor-1).

Using the bounds just given, for any ss,

We now bound the first expression on the RHS.

Now, R1=μ1(1y)y(1μ1)1=μ1yy(1μ1)R-1=\frac{\mu_{1}(1-y)}{y(1-\mu_{1})}-1=\frac{\mu_{1}-y}{y(1-\mu_{1})}. And, RR1=μ1(1y)μ1y\frac{R}{R-1}=\frac{\mu_{1}(1-y)}{\mu_{1}-y}. Therefore,

Bounding S​u​m​(⌊y​j⌋,⌊y​j⌋)𝑆𝑢𝑚𝑦𝑗𝑦𝑗Sum(\lfloor yj\rfloor,\lfloor yj\rfloor).

We use fj,μ1(s)Fj+1,y(s)fj,μ1(s)fj+1,y(s)=(1sj+1)Rs(1μ1)j(1y)j+1\frac{f_{j,\mu_{1}}(s)}{F_{j+1,y}(s)}\leq\frac{f_{j,\mu_{1}}(s)}{f_{j+1,y}(s)}=\left(1-\frac{s}{j+1}\right)R^{s}\frac{(1-\mu_{1})^{j}}{(1-y)^{j+1}}, to get

The last inequality uses j1Δ11yj\geq\frac{1}{\Delta^{\prime}}\geq\frac{1}{1-y}.

Now, if j>1Δj>\frac{1}{\Delta^{\prime}},then (j+1)y(1y)>y>y\sqrt{(j+1)y(1-y)}>\sqrt{y}>y, so y(j+1)(j+1)y(1y)<yjyjy(j+1)-\sqrt{(j+1)y(1-y)}<yj\leq\lceil yj\rceil. Therefore, for syjs\geq\lceil yj\rceil, Fj+1,y(s)=Θ(1).F_{j+1,y}(s)=\Theta(1). Using this observation, we derive the following.

where the inequality follows using Chernoff-Hoeffding bounds (refer to Fact 2).

For sμ1jΔ2j=yj+Δ2js\geq\lceil\mu_{1}j-\frac{\Delta^{\prime}}{2}j\rceil=\lceil yj+\frac{\Delta^{\prime}}{2}j\rceil, again using Chernoff-Hoeffding bounds from Fact 2,

The last inequality uses j8Δj\geq\frac{8}{\Delta^{\prime}}.

Combining, we get for j8Δj\geq\frac{8}{\Delta^{\prime}},