Prior-free and prior-dependent regret bounds for Thompson Sampling

Sébastien Bubeck, Che-Yu Liu

Introduction

where the expectation is taken with respect to the parameter θ\theta, the rewards (Xi,s)s1(X_{i,s})_{s\geq 1}, and the external source of randomness (Us)s1(U_{s})_{s\geq 1}. We will also be interested in the individual regret Rn(θ)R_{n}(\theta) which is defined similarly except that θ\theta is fixed (instead of being integrated over π0\pi_{0}). When it is clear from the context we drop the dependency on θ\theta in the various quantities defined above.

In other words the value of the best arm is known as well as a non-trivial lower bound on the gap between the values of the best and second best arms. For this problem a new algorithm was proposed in Bubeck et al. (2013) (which we call the BPR policy), and it was shown that the BPR policy satisfies

Thus the BPR policy attains a regret uniformly bounded over time in the BPR setting, a feature that standard bandit algorithms such as UCB of Auer et al. (2002) cannot achieve. It is natural to view the assumptions of the BPR setting as a prior over the reward distributions and to ask what regret guarantees attains Thompson Sampling in that situation. More precisely we consider Thompson Sampling with Gaussian reward distributions and uniform prior over the possible range of parameters. We then prove individual regret bounds for any sub-Gaussian distributions (similarly to Bubeck et al. (2013)). We obtain that Thompson Sampling uses optimally the prior information in the sense that it also attains uniformly bounded over time regret. Furthermore as an added bonus we remove the extraneous log-log factor of the BPR policy’s regret bound.

The results presented in Section 3 and 4 can be viewed as a first step towards a better understanding of prior-dependent regret bounds for Thompson Sampling. Generalizing these results to arbitrary priors is a challenging open problem which is beyond the scope of our current techniques.

Optimal prior-free regret bound for Thompson Sampling

In this section we prove the following result.

For any prior distribution π0\pi_{0} over reward distributions in $$, Thompson Sampling satisfies

Remark that the above result is unimprovable in the sense that there exist prior distributions π0\pi_{0} such that for any algorithm one has Rn120nKR_{n}\geq\frac{1}{20}\sqrt{nK} (see e.g. [Theorem 3.5, Bubeck and Cesa-Bianchi (2012)]). This theorem also implies an optimal rate of identification for the best arm, see Bubeck et al. (2009) for more details on this.

Step 1: rewriting of the Bayesian regret in terms of upper confidence bounds. This step is given by [Proposition 1, Russo and Roy (2013)] which we reprove for the sake of completeness. Let Bi,tB_{i,t} be a random variable measurable with respect to σ(Ht)\sigma(H_{t}). Note that by definition θ(t)\theta^{(t)} and θ\theta are identically distributed conditionally on HtH_{t}. This implies by the tower rule:

Inspired by the MOSS strategy of Audibert and Bubeck (2009) we will now take

where μ^i,s=1st=1sXi,t\widehat{\mu}_{i,s}=\frac{1}{s}\sum_{t=1}^{s}X_{i,t}, and log+(x)=log(x)\mathds1x1\log_{+}(x)=\log(x)\mathds{1}_{x\geq 1}. In the following we denote δ0=2Kn\delta_{0}=2\sqrt{\frac{K}{n}}. From now on we work conditionally on θ\theta and thus we drop all the dependency on θ\theta.

Next we extract the following inequality from Audibert and Bubeck (2010) (see p2683–2684), for any i[K]i\in[K],

Next we use the following simple inequality:

Now for uδ0u\geq\delta_{0} let s(u)=3log(nu2K)/u2s(u)=\lceil 3\log\left(\frac{nu^{2}}{K}\right)/u^{2}\rceil where x\lceil x\rceil is the smallest integer large than xx. Let c=113c=1-\frac{1}{\sqrt{3}}. It is easy to see that one has:

Using an integration already done in Step 2 we have

Next using Hoeffding’s inequality and the fact that the rewards are in $onehasforone has foru\geq\delta_{0}$

Now using that 1exp(x)xx2/21-\exp(-x)\geq x-x^{2}/2 for x0x\geq 0 one obtains

which concludes the proof together with the results of Step 1 and Step 2.

Thompson Sampling in the two-armed BPR setting

In order to derive the Thompson Sampling strategy for this problem we further assume that the reward distributions are in fact Gaussian with variance 11. In other words let Θ={θ1,θ2}\Theta=\{\theta_{1},\theta_{2}\}, π0(θ1)=π0(θ2)=1/2\pi_{0}(\theta_{1})=\pi_{0}(\theta_{2})=1/2, and under θ1\theta_{1} one has X1,sN(μ,1)X_{1,s}\sim\mathcal{N}(\mu^{*},1) and X2,sN(μΔ,1)X_{2,s}\sim\mathcal{N}(\mu^{*}-\Delta,1) while under θ2\theta_{2} one has X2,sN(μ,1)X_{2,s}\sim\mathcal{N}(\mu^{*},1) and X1,sN(μΔ,1)X_{1,s}\sim\mathcal{N}(\mu^{*}-\Delta,1). Then a straightforward computation (using Bayes rule and induction) shows that one has for some normalizing constant c>0c>0:

Recall that Thompson Sampling draws θ(t)\theta^{(t)} from πt\pi_{t} and then pulls the best arm for the environment θ(t)\theta^{(t)}. Observe that under θ1\theta_{1} the best arm is arm 11 and under θ2\theta_{2} the best arm is arm 22. In other words Thompson Sampling draws ItI_{t} at random with the probabilities given by the posterior πt\pi_{t}. This leads to a general algorithm for the two-armed BPR setting with sub-Gaussian reward distributions that we summarize in Figure 1. The next result shows that it attains optimal performances in this setting up to a numerical constant (see Bubeck et al. (2013) for lower bounds), for any sub-Gaussian reward distribution (not necessarily Gaussian) with largest mean μ\mu^{*} and gap Δ\Delta.

The policy of Figure 1 has regret bounded as RnΔ+578ΔR_{n}\leq\Delta+\frac{578}{\Delta}, uniformly in nn.

Note that we did not try to optimize the numerical constant in the above bound. Figure 2 shows an empirical comparison of the policy of Figure 1 with Policy 1 of Bubeck et al. (2013). Note in particular that a regret bound of order 16/Δ16/\Delta was proved for the latter algorithm and the (limited) numerical simulation presented here suggests that Thompson Sampling outperforms this strategy.

Proof Without loss of generality we assume that arm 11 is the optimal arm, that is μ1=μ\mu_{1}=\mu^{*} and μ2=μΔ\mu_{2}=\mu^{*}-\Delta. Let μ^i,s=1st=1sXi,t\widehat{\mu}_{i,s}=\frac{1}{s}\sum_{t=1}^{s}X_{i,t}, γ^1,s=μ1μ^1,s\widehat{\gamma}_{1,s}=\mu_{1}-\widehat{\mu}_{1,s} and γ^2,s=μ^2,sμ2\widehat{\gamma}_{2,s}=\widehat{\mu}_{2,s}-\mu_{2}. Note that large (positive) values of γ^1,s\widehat{\gamma}_{1,s} or γ^2,s\widehat{\gamma}_{2,s} might mislead the algorithm into bad decisions, and we will need to control what happens in various regimes for these γ\gamma coefficients. We decompose the proof into three steps.

Step 1. This first step will be useful in the rest of the analysis, it shows how the probability ratio of a bad pull over a good pull evolves as a function of the γ\gamma coefficients introduced above. One has:

Step 2. We decompose the regret RnR_{n} as follows:

We use Hoeffding’s inequality to control the first term:

For the second term, using the rewriting of Step 1 as an upper bound on pt(2)p_{t}(2), one obtains:

The third term is more difficult to control, and we further decompose the corresponding event as follows:

The cumulative probability of the first event in the above decomposition is easy to control thanks to Hoeffding’s maximal inequalityIt is an easy exercise to verify that Azuma-Hoeffding holds for martingale differences with sub-Gaussian increments, which implies Hoeffding’s maximal inequality for sub-Gaussian distributions. which states that for any m1m\geq 1 and x>0x>0 one has

where the last inequality follows from Step 1. The last step is devoted to bounding from above this last term.

Step 3. By integrating the deviations and using again Hoeffding’s maximal inequality one obtains

which concludes the proof by putting this together with the results of the previous step.

Optimal strategy for the BPR setting inspired by Thompson Sampling

Similarly to the previous section we assume that the reward distributions are Gaussian with variance 11 for the derivation of the Thompson Sampling strategy (but we do not make this assumption for the analysis of the resulting algorithm). Then the set of possible parameters is described as follows:

Assuming a uniform prior over the index of the best arm, and a prior λ\lambda over the mean of a suboptimal arm one obtains by Bayes rule that the probability density function of the posterior is given by:

Now remark that with Thompson Sampling arm ii is played at time tt if and only if θ(t)Θi\theta^{(t)}\in\Theta_{i}. In other words ItI_{t} is played at random from probability ptp_{t} where

Taking inspiration from the above calculation we consider the following policy, where λ\lambda is the Lebesgue measure and we assume a slightly larger value for the variance (this is necessary for the proof).

The following theorem shows that this policy attains the best known performance for the BPR setting, shaving off a log-log term in the regret bound of the BPR policy.

The policy of Figure 3 has regret bounded as Rni:Δi>0(Δi+80+log(Δi/ε)Δi)R_{n}\leq\sum_{i:\Delta_{i}>0}\left(\Delta_{i}+\frac{80+\log(\Delta_{i}/\varepsilon)}{\Delta_{i}}\right), uniformly in nn.

Proof The general structure of the proof is superficially similar to the proof of Theorem 2 but many details are different. Without loss of generality we assume that arm 11 is the optimal arm, that is μ1=μ\mu_{1}=\mu^{*} and i2,μi=μΔi\forall i\geq 2,\mu_{i}=\mu^{*}-\Delta_{i}. Let γ^1,s=μ1μ^1,s\widehat{\gamma}_{1,s}=\mu_{1}-\widehat{\mu}_{1,s} and γ^i,s=μ^i,sμi\widehat{\gamma}_{i,s}=\widehat{\mu}_{i,s}-\mu_{i} for i2i\geq 2. We decompose the proof into four steps.

Step 1: Rewriting of the ratio pi,tp1,t\frac{p_{i,t}}{p_{1,t}}. Let i2i\geq 2, the following rewriting will be useful in the rest of the proof:

where the last step follows by a simple change of variable.

Step 2: Decomposition of RnR_{n}. For i2i\geq 2. Let Ai=6Δi2log(e6Δiε)A_{i}=\lceil\frac{6}{\Delta_{i}^{2}}\log(\frac{e^{6}\Delta_{i}}{\varepsilon})\rceil where x\lceil x\rceil is the smallest integer larger than xx. We decompose the regret RnR_{n} as follows.

The first expectation can be bounded by using Hoeffding’s inequality.

The second expectation is more difficult to bound from above and the next two steps are dedicated to this task.

We have now to control the term pt(i)pt(1)\frac{p_{t}(i)}{p_{t}(1)} on the event {γ^i,Ti(t1)Δi4,Ti(t1)Ai}\{\widehat{\gamma}_{i,T_{i}(t-1)}\leq\frac{\Delta_{i}}{4},T_{i}(t-1)\geq A_{i}\}. The following bounds on the tail of the standard Gaussian distribution will be useful, for any x>0x>0 one has

Next, using the fact that the function x1xe16xΔi2x\rightarrow\frac{1}{x}e^{\frac{1}{6}x\Delta_{i}^{2}} is increasing on [6Δi2,+)[\frac{6}{\Delta_{i}^{2}},+\infty), we get

Plugging into the expression of pt(i)pt(1)\frac{p_{t}(i)}{p_{t}(1)}, we obtain

The first term is straightforward to compute:

For the second term, we first integrate the deviations and we use Hoeffding’s inequality to obtain

Putting together all the steps finishes the proof.

References