Further Optimal Regret Bounds for Thompson Sampling

Shipra Agrawal, Navin Goyal

Introduction

Multi-armed bandit problem models the exploration/exploitation trade-off inherent in sequential decision problems. One of the early motivations for studying MAB problem was clinical trials: suppose that we have $N$ different treatments of unknown efficacy for a certain disease. Patients arrive sequentially, and we must decide on a treatment to administer for each arriving patient. To make this decision, we could learn from how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we may have a reasonable idea of which treatment is most effective, and from then on, we could administer that treatment for all the patients. However, initially, when there is no or very little information available, we need to explore and try each treatment sufficient number of times. We wish to do this exploration in such a way that we can find the best treatment and start exploiting it as soon as possible. The MAB problem is to decide how to choose the treatment for the next patient, given the outcomes of the treatments so far. Today, multi-armed bandit problem has a diverse set of applications some of which will be mentioned shortly.

Many versions and generalizations of the multi-armed bandit problem have been studied in the literature; in this paper we will consider a basic and well-studied version of this problem: the stochastic multi-armed bandit problem. Among many algorithms available for the stochastic bandit problem, some popular ones include Upper Confidence Bound (UCB) family of algorithms, (e.g., , and more recently ), which have good theoretical guarantees, and the algorithm by , which gives optimal strategy under Bayesian setting with known priors and geometric time-discounted rewards. In one of the earliest works on stochastic bandit problems, proposed a natural randomized Bayesian algorithm to minimize regret. The basic idea is to assume a simple prior distribution on the parameters of the reward distribution of every arm, and at any time step, play an arm according to its posterior probability of being the best arm. This algorithm is known as Thompson Sampling (TS), and it is a member of the family of randomized probability matching algorithms. TS is a very natural algorithm and the same idea has been rediscovered many times independently in the context of reinforcement learning, e.g., in . We emphasize that although TS algorithm is a Bayesian approach, the description of the algorithm and our analysis apply to the prior-free stochastic multi-armed bandit model where parameters of the reward distribution of every arm are fixed, though unknown (see Section 1.1). One could interpret the “assumed” Bayesian priors as the current knowledge of the algorithm about the arms. Thus, our regret bounds for Thompson Sampling are directly comparable to the regret bounds for UCB family of algorithms which are a frequentist approach to the same problem.

Recently, TS has attracted considerable attention. Several studies (e.g., ) have empirically demonstrated the efficacy of Thompson Sampling: provides a detailed discussion of probability matching techniques in many general settings along with favorable empirical comparisons with other techniques. demonstrate that empirically TS achieves regret comparable to the lower bound of ; and in applications like display advertising and news article recommendation, it is competitive to or better than popular methods such as UCB. In their experiments, TS is also more robust to delayed or batched feedback (delayed feedback means that the result of a play of an arm may become available only after some time delay, but we are required to make immediate decisions for which arm to play next) than the other methods. A possible explanation may be that TS is a randomized algorithm and so it is unlikely to get trapped in an early bad decision during the delay. Microsoft’s adPredictor () for CTR prediction of search ads on Bing uses the idea of Thompson Sampling.

Despite being easy to implement, competitive to the state of the art methods, and popular in practice, TS lacked a strong theoretical analysis. provide weak guarantees, namely, a bound of $o(T)$ on expected regret in time $T$ . Significant progress was made in the recent work of and . In , the first logarithmic bound on expected regret of TS algorithm were proven. provided a bound that matches the asymptotic lower bound of for this problem. However, both these bounds were problem dependent, i.e. the regret bounds are logarithmic in $T$ when the problem parameters, namely the mean rewards for each arm, and their differences, are assumed to be constants. The problem-independent bounds implied by these existing works were far from optimal. Obtaining a problem-independent bound that is close to the lower bound of $\Omega(\sqrt{NT})$ was also posed as an open problem by Chapaelle and Li .

In this paper, we give a regret analysis for Thompson Sampling that provides both optimal problem-dependent and near-optimal problem-independent regret bounds for Thompson Sampling. Our novel martingale-based analysis technique is conceptually simple (arguably simpler than the previous work). Our technique easily extends to distributions other than Beta distribution, and it also extends to the more general contextual bandits setting . While the basic idea for the analysis in the contextual bandits setting of is inspired by the idea in this paper, the details are substantially different.

Before stating our results, we describe the MAB problem and the TS algorithm formally.

We consider the stochastic multi-armed bandit (MAB) problem: We are given a slot machine with $N$ arms; at each time step $t=1,2,3,\ldots$ , one of the $N$ arms must be chosen to be played. Each arm $i$ , when played, yields a random real-valued reward according to some fixed (unknown) distribution with support in $$. The random reward obtained from playing an arm repeatedly are i.i.d. and independent of the plays of the other arms. The reward is observed immediately after playing the arm.

Other performance measures include PAC-style guarantees; we do not consider those measures here.

2 Thompson Sampling

We provide the details of Thompson Sampling algorithm and our analysis for the Bernoulli bandit problem, i.e. when the rewards are either or $1$ , and for arm $i$ the probability of success (reward = $1$ ) is $\mu_{i}$ . This description of Thompson Sampling follows closely that of . A simple extension of this algorithm to general reward distributions with support $$ is described in , which seamlessly extends our analysis for Bernoulli bandits to general stochastic bandit problem.

The algorithm for Bernoulli bandits maintains Bayesian priors on the Bernoulli means $\mu_{i}$ ’s. Beta distribution turns out to be a very convenient choice of priors for Bernoulli rewards. Let us briefly recall that beta distributions form a family of continuous probability distributions on the interval $(0,1)$ . The pdf of $\text{Beta}(\alpha,\beta)$ , the beta distribution with parameters $\alpha>0$ , $\beta>0$ , is given by $f(x;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}$ . The mean of $\text{Beta}(\alpha,\beta)$ is $\alpha/(\alpha+\beta)$ ; and as is apparent from the pdf, higher the $\alpha,\beta$ , tighter is the concentration of $\text{Beta}(\alpha,\beta)$ around the mean. Beta distribution is useful for Bernoulli rewards because if the prior is a $\text{Beta}(\alpha,\beta)$ distribution, then after observing a Bernoulli trial, the posterior distribution is simply $\text{Beta}(\alpha+1,\beta)$ or $\text{Beta}(\alpha,\beta+1)$ , depending on whether the trial resulted in a success or failure, respectively.

The Thompson Sampling algorithm initially assumes arm $i$ to have prior $\text{Beta}(1,1)$ on $\mu_{i}$ , which is natural because $\text{Beta}(1,1)$ is the uniform distribution on $(0,1)$ . At time $t$ , having observed $S_{i}(t)$ successes (reward = $1$ ) and $F_{i}(t)$ failures (reward = ) in $k_{i}(t)=S_{i}(t)+F_{i}(t)$ plays of arm $i$ , the algorithm updates the distribution on $\mu_{i}$ as $\text{Beta}(S_{i}(t)+1,F_{i}(t)+1)$ . The algorithm then samples from these posterior distributions of the $\mu_{i}$ ’s, and plays an arm according to the probability of its mean being the largest. We summarize the Thompson Sampling algorithm below.

3 Our results

In this article, we bound the finite time expected regret of Thompson Sampling. From now on we will assume that the first arm is the unique optimal arm, i.e., $\mu^{*}=\mu_{1}>\arg\max_{i\neq 1}\mu_{i}$ . Assuming that the first arm is an optimal arm is a matter of convenience for stating the results and for the analysis and of course the algorithm does not use this assumption. The assumption of unique optimal arm is also without loss of generality, since adding more arms with $\mu_{i}=\mu^{*}$ can only decrease the expected regret; details of this argument were provided in .

(Problem-dependent bound) For the $N$ -armed stochastic bandit problem, Thompson Sampling algorithm has expected regret

in time $T$ , where $d(\mu_{i},\mu_{1})=\mu_{i}\log\frac{\mu_{i}}{\mu_{1}}+(1-\mu_{i})\log\frac{(1-\mu_{i})}{(1-\mu_{1})}$ . The big-Oh notation For any two functions $f(n),g(n)$ , $f(n)=O(g(n))$ if there exist two constants $n_{0}$ and $c$ such that for all $n\geq n_{0}$ , $f(n)\leq cg(n)$ . in above assumes $\mu_{i},\Delta_{i},i=1,\ldots,N$ to be constants.

(Problem-independent bound) For the $N$ -armed stochastic bandit problem, Thompson Sampling algorithm has expected regret

in time $T$ , where the big-Oh notation hides only the absolute constants.

Let us contrast our bounds with the previous work. Let us first consider the problem-dependent regret bounds, i.e., regret bounds that depend on problem parameters $\mu_{i},\Delta_{i},i=1,\ldots,N$ . Lai and Robbins essentially proved the following lower bound on the regret of any bandit algorithm (see for a precise statement):

They also gave algorithms asymptotically achieving this guarantee, though unfortunately their algorithms are not efficient. Auer et al. gave the UCB1 algorithm, which is efficient and achieves the following bound:

More recently, Kaufmann et al. gave Bayes-UCB algorithm which achieves the lower bound of for Bernoulli rewards. Bayes-UCB is a UCB like algorithm, where the upper confidence bounds are based on the quantiles of Beta posterior distributions. Interestingly, these upper confidence bounds turn out to be similar to those used by algorithms in and . Our bounds in Theorem 1 achieve the asymptotic lower bounds of , and match those provided by for Thompson Sampling.

Theorem 2 shows that Thompson Sampling also achieves a problem independent regret bound of $O(\sqrt{NT\ln T})$ on regret. This is the first analyis for TS that matches the $\Omega(\sqrt{NT})$ problem-inpdependent lower bound (see ) for this problem within logarithmic factors. The problem-dependent bounds in the existing work implied only suboptimal problem-independent bounds: implied a problem independent bound of $O(T^{2/3})$ . In , the additive problem dependent term was not explicitly calculated, which makes it difficult to derive the corresponding problem independent bound, but on a preliminary examination, it appears that it would involve an even higher power of $T$ . To compare with other existing algorithms for this problem, note that the best known problem-independent bound for the expected regret of UCB1 is also $O(\sqrt{NT\ln T})$ (see ). More recently, Audibert and Bubeck gave an algorithm MOSS, inspired by UCB1, with regret $O(\sqrt{NT})$ .

Proofs

In this section, we prove Theorem 1 and Theorem 2. The proofs of the two theorems follow the same steps, and diverge only towards the end of the analysis.

Our proof uses a martingale based analysis. Essentially, we prove that conditioned on any history of execution in the preceding steps, the probability of playing any suboptimal arm $i$ at the current step can be bounded by a linear function of the probability of playing the optimal arm at the current step. This is proven in Lemma 1, which forms the core of our analysis. Further, we show that the coefficient in this linear function decreases exponentially fast with the increase in the number of plays of optimal arm (refer to Lemma 4), this allows us to bound the total number of plays of every suboptimal arm, to bound the regret as desired. The difference between the analysis for obtaining the logarithmic problem-dependent bound of Theorem 1, and the problem-independent bound of Theorem 2 is merely technical, and occurs only towards the end of the proof. We recall some of the definitions introduced earlier, and introduce some new notations used in the proof. $F^{B}_{n,p}(\cdot)$ denotes the cdf and $f^{B}_{n,p}(\cdot)$ denotes the probability mass function of the binomial distribution with parameters $n,p$ . Let $F^{beta}_{\alpha,\beta}(\cdot)$ denote the cdf of the beta distribution with parameters $\alpha,\beta$ .

$k_{i}(t)$ is defined as the number of plays of arm $i$ until time $t-1$ , and $S_{i}(t)$ as the number of successes among the plays of arm $i$ until time $t-1$ . Also, $i(t)$ denotes the arm played at time $t$ .

For each arm $i$ , we will choose two thresholds $x_{i}$ and $y_{i}$ such that $\mu_{i}<x_{i}<y_{i}<\mu_{1}$ . The specific choice of these thresholds will depend on whether we are proving problem-dependent bound or problem-independent bound, and will be described at the approporiate points in the proof. Define $L_{i}(T)=\frac{\ln T}{d(x_{i},y_{i})}$ , and $\hat{\mu}_{i}(t)=S_{i}(t)/k_{i}(t)$ (define $\hat{\mu}_{i}(t)=1$ when $k_{i}(t)=0$ ). Define $E^{\mu}_{i}(t)$ as the event that $\hat{\mu}_{i}(t)\leq x_{i}$ . Define $E^{\theta}_{i}(t)$ as the event that $\theta_{i}(t)\leq y_{i}$ .

Intuitively, $E^{\mu}_{i}(t)$ , $E^{\theta}_{i}(t)$ are the events that $\hat{\mu}_{i}(t)$ and $\theta_{i}(t)$ , respectively, are not too far from the mean $\mu_{i}$ . As we show later, these events will hold with high probability for most time steps.

Define filtration ${\cal F}_{t-1}$ as the history of plays until time $t-1$ , i.e.

where $i(t)$ denotes the arm played at time $t$ , and $r_{i}(t)$ denotes the reward observed for arm $i$ at time $t$ .

Note that $p_{i,t}$ is determined by ${\cal F}_{t-1}$ .

where $p_{i,t}=\Pr(\theta_{1}(t)>y_{i}|{\cal F}_{t-1})$ .

Note that whether $E^{\mu}_{i}(t)$ is true or not is determined by ${\cal F}_{t-1}$ . Assume that filtration ${\cal F}_{t-1}$ is such that $E^{\mu}_{i}(t)$ is true (otherwise the probability on the left hand side is and the inequality is trivially true). It then suffices to prove that

Let $M_{i}(t)$ denote the event that arm $i$ exceeds all the suboptimal arms at time $t$ . That is,

We will prove the following two inequalities which immediately give (1).

Now, given $M_{i}(t),E^{\theta}_{i}(t)$ , it holds that for all $j\neq i,j\neq 1$ ,

The second last equality follows because the events $M_{i}(t)$ and $E^{\theta}_{i}(t),\forall i\neq 1$ involve conditions on only $\theta_{j}(t)$ for $j\neq 1$ , and given ${\cal F}_{t-1}$ (and hence $\hat{\mu}_{j}(t),k_{j}(t),\forall j$ ), $\theta_{1}(t)$ is independent of all the other $\theta_{j}(t),j\neq 1$ , and hence independent of these events. This together with (4) gives (2).

Since $E^{\theta}_{i}(t)$ is the event that $\theta_{i}(t)\leq y_{i}$ , therefore, given $E^{\theta}_{i}(t)$ , $i(t)=i$ only if $\theta_{1}(t)<y_{i}$ . This gives (3):

This essentially follows from application of Chernoff-Hoeffding bounds for concentration of $\hat{\mu}_{i}(t)$ . Refer to Appendix B for details. ∎

This essentially follows from the observation that the beta-distributed random variable $\theta_{i}(t)$ is well-concentrated around its mean when $k_{i}(t)$ is large, that is, larger than $L_{i}(T)$ . Refer to Appendix C for details. ∎

Let $\tau_{j}$ denote the time step at which $j^{th}$ trial of first arm happens, then

where $\Delta^{\prime}_{i}=\mu_{1}-y_{i}$ , $D_{i}=y_{i}\log\frac{y_{i}}{\mu_{1}}+(1-y_{i})\log\frac{1-y_{i}}{1-\mu_{1}}$ .

The proof of this inequality involves some careful algebraic manipulations using tight estimates for partial Binomial sums provided by . Refer to Appendix D for details. ∎

Proof of Theorem 1 and Theorem 2

Let $\tau_{k}$ denote the time step at which arm $1$ is played for the $k^{th}$ time for $k\geq 1$ , and let $\tau_{0}=0$ . Using the above lemmas,

The inequality marked $(*)$ uses the observation that $p_{i,t}=\Pr(\theta_{1}(t)>y_{i}|{\cal F}_{t-1})$ changes only when the distribution of $\theta_{1}(t)$ changes, that is, only on the time step after each play of first arm. Thus, $p_{i,t}$ is same at all time steps $t\in\{\tau_{k}+1,\ldots,\tau_{k+1}\}$ , for every $k$ .

For logarithmic problem-dependent bound of Theorem 1, for some $0<\epsilon\leq 1$ , we set $x_{i}\in(\mu_{i},\mu_{1})$ such that $d(x_{i},\mu_{1})=d(\mu_{i},\mu_{1})/(1+\epsilon)$ , and set $y_{i}\in(x_{i},\mu_{1})$ such that $d(x_{i},y_{i})=d(x_{i},\mu_{1})/(1+\epsilon)=d(\mu_{i},\mu_{1})/(1+\epsilon)^{2}$ (This way of choosing thresholds, in order to obtain bounds in terms of KL-divergences $d(\mu_{i},\mu_{1})$ rather than $\Delta_{i}$ s, is inspired by .). This gives

Also, by some simple algebraic manipulations of the equality $d(x_{i},\mu_{1})=d(\mu_{i},\mu_{1})/(1+\epsilon)$ , we can obtain

Here order notation is hiding functions of $\mu_{i}$ s and $\Delta_{i}$ s, since they are assumed to be constants.

where $\epsilon^{\prime}=3\epsilon$ , and the order notation in above hides $\mu_{i}$ s and $\Delta_{i}$ s in addition to the absolute constants.

For obtaining $O(\sqrt{NT\ln T})$ problem-independent bound of Theorem 2, we pick $x_{i}=\mu_{i}+\frac{\Delta_{i}}{3},y_{i}=\mu_{1}-\frac{\Delta_{i}}{3}$ , so that $\Delta^{\prime 2}=(\mu_{1}-y_{i})^{2}=\frac{\Delta_{i}^{2}}{9}$ , and using Pinsker’s inequality, $d(x_{i},\mu_{i})\geq\frac{1}{2}(x_{i}-\mu_{i})^{2}=\frac{\Delta_{i}^{2}}{18}$ , $d(x_{i},y_{i})\geq\frac{1}{2}(y_{i}-x_{i})^{2}\geq\frac{\Delta_{i}^{2}}{18}$ . Then,

We observe that in the worst case, for all suboptimal $i$ , $\Delta_{i}\geq\sqrt{\frac{N\ln T}{T}}$ . This is because the total regret on playing arms with $\Delta_{i}<\sqrt{\frac{N\ln T}{T}}$ instead of the optimal arm is at most $\sqrt{NT\ln T}$ . Thus, all the arms with $\Delta_{i}<\sqrt{\frac{N\ln T}{T}}$ can be assumed to be optimal arms. Also, in we proved that multiple optimal arms can only help.

Therefore, substituting $\Delta_{i}=\sqrt{\frac{N\ln T}{T}}$ ,

Acknowledgement.

We thank Emil Jeřábek for telling us about his estimates of partial binomial sums. We also thank MathOverflow for connecting us with Emil.

References

Appendix A Some results used in the proofs

Let $X_{1},\ldots,X_{n}$ be independent $0-1$ r.v.s with $E[X_{i}]=p_{i}$ (not necessarily equal). Let $X=\frac{1}{n}\sum_{i}=X_{i}$ , $\mu=E[X]=\frac{1}{n}\sum_{i=1}^{n}p_{i}$ . Then, for any $0<\lambda<1-\mu$ ,

where $d(a,b)=a\ln\frac{a}{b}+(1-a)\ln\frac{(1-a)}{(1-b)}$ .

for all positive integers $\alpha,\beta$ .

Appendix B Proof of Lemma 2

Let $\tau_{k}$ denote the time at which $k^{th}$ trial of arm $i$ happens. Let $\tau_{0}=0$ ; Then,

The second last inequality follows from the observation that the event $\overline{E^{\mu}_{i}(t)}$ was defined as $\hat{\mu}_{i}(t)>x_{i}$ , where $\hat{\mu}_{i}(t)$ is the average of the outcomes observed from the plays of arm $i$ until time $t-1$ . Thus at time $\tau_{k}+1$ , $\hat{\mu}_{i}(\tau_{k}+1)$ is simply the average of the outcomes observed from $k$ i.i.d. plays of arm $i$ , each of which is a Bernoulli trial with mean $\mu_{i}$ . Using Chernoff-Hoeffding bounds (Fact 1), we obtain that $\Pr(\hat{\mu}_{i}(\tau_{k}+1)>x_{i})\leq e^{-kd(x_{i},\mu_{i})}$ . ∎

Appendix C Proof of Lemma 3

where the last inequality follows from Chernoff-Hoeffding bounds (refer to Fact 1). Therefore, for $t$ such that $k_{i}(t)>L_{i}(T)$ ,

Let $\tau$ be the largest time step until $k_{i}(t)\leq L_{i}(T)$ , then,

Appendix D Proof of Lemma 4

Let $k_{1}(t)=j,S_{1}(t)=s$ . Let $y=y_{i}$ . Then, $p_{i,t}=\Pr(\theta_{1}(t)>y)=F^{B}_{j+1,y}(s)$ . Let $\tau_{j}+1$ denote the time step after the $(j)^{th}$ play of arm $1$ . Then, $k_{1}(\tau_{j}+1)=j$ , and

Let $R=\frac{\mu_{1}(1-y)}{y(1-\mu_{1})}$ , $D=y\log\frac{y}{\mu_{1}}+(1-y)\log\frac{1-y}{1-\mu_{1}}$ .

We will divide the sum $Sum(0,j)=\sum_{s=0}^{j}\frac{f_{j,\mu_{1}}(s)}{F_{j+1,y}(s)}$ into four partial sums and prove that

Together, the above estimates will prove the required bound.

We use the following bounds on the cdf of Binomial distribution [12, Prop. A.4]. For $s\leq y(j+1)-\sqrt{(j+1)y(1-y)}$ ,

Bounding Sum(0,⌊yj⌋−1)𝑆𝑢𝑚0𝑦𝑗1Sum(0,\lfloor yj\rfloor-1).

Using the bounds just given, for any $s$ ,

We now bound the first expression on the RHS.

Now, $R-1=\frac{\mu_{1}(1-y)}{y(1-\mu_{1})}-1=\frac{\mu_{1}-y}{y(1-\mu_{1})}$ . And, $\frac{R}{R-1}=\frac{\mu_{1}(1-y)}{\mu_{1}-y}$ . Therefore,

Bounding Sum(⌊yj⌋,⌊yj⌋)𝑆𝑢𝑚𝑦𝑗𝑦𝑗Sum(\lfloor yj\rfloor,\lfloor yj\rfloor).

We use $\frac{f_{j,\mu_{1}}(s)}{F_{j+1,y}(s)}\leq\frac{f_{j,\mu_{1}}(s)}{f_{j+1,y}(s)}=\left(1-\frac{s}{j+1}\right)R^{s}\frac{(1-\mu_{1})^{j}}{(1-y)^{j+1}}$ , to get

The last inequality uses $j\geq\frac{1}{\Delta^{\prime}}\geq\frac{1}{1-y}$ .

Now, if $j>\frac{1}{\Delta^{\prime}}$ ,then $\sqrt{(j+1)y(1-y)}>\sqrt{y}>y$ , so $y(j+1)-\sqrt{(j+1)y(1-y)}<yj\leq\lceil yj\rceil$ . Therefore, for $s\geq\lceil yj\rceil$ , $F_{j+1,y}(s)=\Theta(1).$ Using this observation, we derive the following.

where the inequality follows using Chernoff-Hoeffding bounds (refer to Fact 2).

For $s\geq\lceil\mu_{1}j-\frac{\Delta^{\prime}}{2}j\rceil=\lceil yj+\frac{\Delta^{\prime}}{2}j\rceil$ , again using Chernoff-Hoeffding bounds from Fact 2,

The last inequality uses $j\geq\frac{8}{\Delta^{\prime}}$ .

Combining, we get for $j\geq\frac{8}{\Delta^{\prime}}$ ,