Prior-free and prior-dependent regret bounds for Thompson Sampling
Sébastien Bubeck, Che-Yu Liu
Introduction
where the expectation is taken with respect to the parameter , the rewards , and the external source of randomness . We will also be interested in the individual regret which is defined similarly except that is fixed (instead of being integrated over ). When it is clear from the context we drop the dependency on in the various quantities defined above.
In other words the value of the best arm is known as well as a non-trivial lower bound on the gap between the values of the best and second best arms. For this problem a new algorithm was proposed in Bubeck et al. (2013) (which we call the BPR policy), and it was shown that the BPR policy satisfies
Thus the BPR policy attains a regret uniformly bounded over time in the BPR setting, a feature that standard bandit algorithms such as UCB of Auer et al. (2002) cannot achieve. It is natural to view the assumptions of the BPR setting as a prior over the reward distributions and to ask what regret guarantees attains Thompson Sampling in that situation. More precisely we consider Thompson Sampling with Gaussian reward distributions and uniform prior over the possible range of parameters. We then prove individual regret bounds for any sub-Gaussian distributions (similarly to Bubeck et al. (2013)). We obtain that Thompson Sampling uses optimally the prior information in the sense that it also attains uniformly bounded over time regret. Furthermore as an added bonus we remove the extraneous log-log factor of the BPR policy’s regret bound.
The results presented in Section 3 and 4 can be viewed as a first step towards a better understanding of prior-dependent regret bounds for Thompson Sampling. Generalizing these results to arbitrary priors is a challenging open problem which is beyond the scope of our current techniques.
Optimal prior-free regret bound for Thompson Sampling
In this section we prove the following result.
For any prior distribution over reward distributions in $$, Thompson Sampling satisfies
Remark that the above result is unimprovable in the sense that there exist prior distributions such that for any algorithm one has (see e.g. [Theorem 3.5, Bubeck and Cesa-Bianchi (2012)]). This theorem also implies an optimal rate of identification for the best arm, see Bubeck et al. (2009) for more details on this.
Step 1: rewriting of the Bayesian regret in terms of upper confidence bounds. This step is given by [Proposition 1, Russo and Roy (2013)] which we reprove for the sake of completeness. Let be a random variable measurable with respect to . Note that by definition and are identically distributed conditionally on . This implies by the tower rule:
Inspired by the MOSS strategy of Audibert and Bubeck (2009) we will now take
where , and . In the following we denote . From now on we work conditionally on and thus we drop all the dependency on .
Next we extract the following inequality from Audibert and Bubeck (2010) (see p2683–2684), for any ,
Next we use the following simple inequality:
Now for let where is the smallest integer large than . Let . It is easy to see that one has:
Using an integration already done in Step 2 we have
Next using Hoeffding’s inequality and the fact that the rewards are in $u\geq\delta_{0}$
Now using that for one obtains
which concludes the proof together with the results of Step 1 and Step 2.
Thompson Sampling in the two-armed BPR setting
In order to derive the Thompson Sampling strategy for this problem we further assume that the reward distributions are in fact Gaussian with variance . In other words let , , and under one has and while under one has and . Then a straightforward computation (using Bayes rule and induction) shows that one has for some normalizing constant :
Recall that Thompson Sampling draws from and then pulls the best arm for the environment . Observe that under the best arm is arm and under the best arm is arm . In other words Thompson Sampling draws at random with the probabilities given by the posterior . This leads to a general algorithm for the two-armed BPR setting with sub-Gaussian reward distributions that we summarize in Figure 1. The next result shows that it attains optimal performances in this setting up to a numerical constant (see Bubeck et al. (2013) for lower bounds), for any sub-Gaussian reward distribution (not necessarily Gaussian) with largest mean and gap .
The policy of Figure 1 has regret bounded as , uniformly in .
Note that we did not try to optimize the numerical constant in the above bound. Figure 2 shows an empirical comparison of the policy of Figure 1 with Policy 1 of Bubeck et al. (2013). Note in particular that a regret bound of order was proved for the latter algorithm and the (limited) numerical simulation presented here suggests that Thompson Sampling outperforms this strategy.
Proof Without loss of generality we assume that arm is the optimal arm, that is and . Let , and . Note that large (positive) values of or might mislead the algorithm into bad decisions, and we will need to control what happens in various regimes for these coefficients. We decompose the proof into three steps.
Step 1. This first step will be useful in the rest of the analysis, it shows how the probability ratio of a bad pull over a good pull evolves as a function of the coefficients introduced above. One has:
Step 2. We decompose the regret as follows:
We use Hoeffding’s inequality to control the first term:
For the second term, using the rewriting of Step 1 as an upper bound on , one obtains:
The third term is more difficult to control, and we further decompose the corresponding event as follows:
The cumulative probability of the first event in the above decomposition is easy to control thanks to Hoeffding’s maximal inequalityIt is an easy exercise to verify that Azuma-Hoeffding holds for martingale differences with sub-Gaussian increments, which implies Hoeffding’s maximal inequality for sub-Gaussian distributions. which states that for any and one has
where the last inequality follows from Step 1. The last step is devoted to bounding from above this last term.
Step 3. By integrating the deviations and using again Hoeffding’s maximal inequality one obtains
which concludes the proof by putting this together with the results of the previous step.
Optimal strategy for the BPR setting inspired by Thompson Sampling
Similarly to the previous section we assume that the reward distributions are Gaussian with variance for the derivation of the Thompson Sampling strategy (but we do not make this assumption for the analysis of the resulting algorithm). Then the set of possible parameters is described as follows:
Assuming a uniform prior over the index of the best arm, and a prior over the mean of a suboptimal arm one obtains by Bayes rule that the probability density function of the posterior is given by:
Now remark that with Thompson Sampling arm is played at time if and only if . In other words is played at random from probability where
Taking inspiration from the above calculation we consider the following policy, where is the Lebesgue measure and we assume a slightly larger value for the variance (this is necessary for the proof).
The following theorem shows that this policy attains the best known performance for the BPR setting, shaving off a log-log term in the regret bound of the BPR policy.
The policy of Figure 3 has regret bounded as , uniformly in .
Proof The general structure of the proof is superficially similar to the proof of Theorem 2 but many details are different. Without loss of generality we assume that arm is the optimal arm, that is and . Let and for . We decompose the proof into four steps.
Step 1: Rewriting of the ratio . Let , the following rewriting will be useful in the rest of the proof:
where the last step follows by a simple change of variable.
Step 2: Decomposition of . For . Let where is the smallest integer larger than . We decompose the regret as follows.
The first expectation can be bounded by using Hoeffding’s inequality.
The second expectation is more difficult to bound from above and the next two steps are dedicated to this task.
We have now to control the term on the event . The following bounds on the tail of the standard Gaussian distribution will be useful, for any one has
Next, using the fact that the function is increasing on , we get
Plugging into the expression of , we obtain
The first term is straightforward to compute:
For the second term, we first integrate the deviations and we use Hoeffding’s inequality to obtain
Putting together all the steps finishes the proof.