Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal, Navin Goyal

Introduction

Multi-armed bandit (MAB) problems model the exploration/exploitation trade-off inherent in many sequential decision problems. There are many versions of multi-armed bandit problems; a particularly useful version is the contextual multi-armed bandit problem. In this problem, in each of TT rounds, a learner is presented with the choice of taking one out of NN actions, referred to as NN arms. Before making the choice of which arm to play, the learner sees dd-dimensional feature vectors bib_{i}, referred to as “context”, associated with each arm ii. The learner uses these feature vectors along with the feature vectors and rewards of the arms played by her in the past to make the choice of the arm to play in the current round. Over time, the learner’s aim is to gather enough information about how the feature vectors and rewards relate to each other, so that she can predict, with some certainty, which arm is likely to give the best reward by looking at the feature vectors. The learner competes with a class of predictors, in which each predictor takes in the feature vectors and predicts which arm will give the best reward. If the learner can guarantee to do nearly as well as the predictions of the best predictor in hindsight (i.e., have low regret), then the learner is said to successfully compete with that class.

Thompson Sampling (TS) is one of the earliest heuristics for multi-armed bandit problems. The first version of this Bayesian heuristic is around 80 years old, dating to Thompson (1933). Since then, it has been rediscovered numerous times independently in the context of reinforcement learning, e.g., in Wyatt (1997); Ortega & Braun (2010); Strens (2000). It is a member of the family of randomized probability matching algorithms. The basic idea is to assume a simple prior distribution on the underlying parameters of the reward distribution of every arm, and at every time step, play an arm according to its posterior probability of being the best arm. The general structure of TS for the contextual bandits problem involves the following elements:

past observations D{\cal D} consisting of (context bb, reward rr) for the past time steps;

In each round, TS plays an arm according to its posterior probability of having the best parameter. A simple way to achieve this is to produce a sample of parameter for each arm, using the posterior distributions, and play the arm that produces the best sample. In this paper, we design and analyze a natural generalization of Thompson Sampling (TS) for contextual bandits; this generalization fits the above general structure, and uses Gaussian prior and Gaussian likelihood function. We emphasize that although TS is a Bayesian approach, the description of the algorithm and our analysis apply to the prior-free stochastic MAB model, and our regret bounds will hold irrespective of whether or not the actual reward distribution matches the Gaussian likelihood function used to derive this Bayesian heuristic. Thus, our bounds for TS algorithm are directly comparable to the UCB family of algorithms which form a frequentist approach to the same problem. One could interpret the priors used by TS as a way of capturing the current knowledge about the arms.

Recently, TS has attracted considerable attention. Several studies (e.g., Granmo (2010); Scott (2010); Graepel et al. (2010); Chapelle & Li (2011); May & Leslie (2011); Kaufmann et al. (2012)) have empirically demonstrated the efficacy of TS: Scott (2010) provides a detailed discussion of probability matching techniques in many general settings along with favorable empirical comparisons with other techniques. Chapelle & Li (2011) demonstrate that for the basic stochastic MAB problem, empirically TS achieves regret comparable to the lower bound of Lai & Robbins (1985); and in applications like display advertising and news article recommendation modeled by the contextual bandits problem, it is competitive to or better than the other methods such as UCB. In their experiments, TS is also more robust to delayed or batched feedback than the other methods. TS has been used in an industrial-scale application for CTR prediction of search ads on search engines (Graepel et al., 2010). Kaufmann et al. (2012) do a thorough comparison of TS with the best known versions of UCB and show that TS has the lowest regret in the long run.

However, the theoretical understanding of TS is limited. Granmo (2010) and May et al. (2011) provided weak guarantees, namely, a bound of o(T)o(T) on the expected regret in time TT. For the the basic (i.e. without contexts) version of the stochastic MAB problem, some significant progress was made by Agrawal & Goyal (2012), Kaufmann et al. (2012) and, more recently, by Agrawal & Goyal (2013b), who provided optimal regret bounds on the expected regret. But, many questions regarding theoretical analysis of TS remained open, including high probability regret bounds, and regret bounds for the more general contextual bandits setting. In particular, the contextual MAB problem does not seem easily amenable to the techniques used so far for analyzing TS for the basic MAB problem. In Section 3.1, we describe some of these challenges. Some of these questions and difficulties were also formally raised as a COLT 2012 open problem (Chapelle & Li, 2012).

In this paper, we use novel martingale-based analysis techniques to demonstrate that TS (i.e., our Gaussian prior based generalization of TS for contextual bandits) achieves high probability, near-optimal regret bounds for stochastic contextual bandits with linear payoff functions. To our knowledge, ours are the first non-trivial regret bounds for TS for the contextual bandits problem. Additionally, our results are the first high probability regret bounds for TS, even in the case of basic MAB problem. This essentially solves the COLT 2012 open problem by(Chapelle & Li, 2012) for contextual bandits with linear payoffs.

Our version of Thompson Sampling algorithm for the contextual MAB problem, described formally in Section 2.2, uses Gaussian prior and Gaussian likelihood functions. Our techniques can be extended to the use of other prior distributions, satisfying certain conditions, as discussed in Section 4.

Problem setting and algorithm description

Ht1={a(τ),ra(τ)(τ),bi(τ),i=1,,N,τ=1,,t1},{\cal H}_{t-1}=\{a(\tau),r_{a(\tau)}(\tau),b_{i}(\tau),i=1,\ldots,N,\tau=1,\ldots,t-1\},

An algorithm for the contextual bandit problem needs to choose, at every time tt, an arm a(t)a(t) to play, using history Ht1{\cal H}_{t-1} and current contexts bi(t),i=1,,Nb_{i}(t),i=1,\ldots,N. Let a(t)a^{*}(t) denote the optimal arm at time tt, i.e. a(t)=argmaxibi(t)Tμ.a^{*}(t)=\arg\max_{i}b_{i}(t)^{T}\mu. And let Δi(t)\Delta_{i}(t) be the difference between the mean rewards of the optimal arm and of arm ii at time tt, i.e.,

Δi(t)=ba(t)(t)Tμbi(t)Tμ.\Delta_{i}(t)=b_{a^{*}(t)}(t)^{T}\mu-b_{i}(t)^{T}\mu.

Then, the regret at time tt is defined as

The objective is to minimize the total regret R(T)=t=1T\mboxregret(t){\cal R}(T)=\sum_{t=1}^{T}\mbox{regret}(t) in time TT. The time horizon TT is finite but possibly unknown.

We assume that ηi,t=ri(t)bi(t)Tμ\eta_{i,t}=r_{i}(t)-b_{i}(t)^{T}\mu is conditionally RR-sub-Gaussian for a constant R0R\geq 0, i.e.,

An alternative definition of regret that appears in the literature is

\mboxregret(t)=ra(t)(t)ra(t)(t).{\mbox{regret}(t)=r_{a^{*}(t)}(t)-r_{a(t)}(t).}

We can obtain the same regret bounds for this alternative definition of regret. The details are provided in the supplementary material in Appendix LABEL:app:proof.

2 Thompson Sampling algorithm

We use Gaussian likelihood function and Gaussian prior to design our version of Thompson Sampling algorithm. More precisely, suppose that the likelihood of reward ri(t)r_{i}(t) at time tt, given context bi(t)b_{i}(t) and parameter μ\mu, were given by the pdf of Gaussian distribution N(bi(t)Tμ,v2){\cal N}(b_{i}(t)^{T}\mu,v^{2}). Here, v=R9dln(Tδ)v=R\sqrt{9d\ln(\frac{T}{\delta})}.Let

B(t)=Id+τ=1t1ba(τ)(τ)ba(τ)(τ)TB(t)=I_{d}+\sum_{\tau=1}^{t-1}b_{a(\tau)}(\tau)b_{a(\tau)}(\tau)^{T}

μ^(t)=B(t)1(τ=1t1ba(τ)(τ)ra(τ)(τ)).\hat{\mu}(t)=B(t)^{-1}\left(\sum_{\tau=1}^{t-1}b_{a(\tau)}(\tau)r_{a(\tau)}(\tau)\right).

Then, if the prior for μ\mu at time tt is given by N(μ^(t),v2B(t)1){\cal N}(\hat{\mu}(t),v^{2}B(t)^{-1}), it is easy to compute the posterior distribution at time t+1t+1,

We emphasize that the Gaussian priors and the Gaussian likelihood model for rewards are only used above to design the Thompson Sampling algorithm for contextual bandits. Our analysis of the algorithm allows these models to be completely unrelated to the actual reward distribution. The assumptions on the actual reward distribution are only those mentioned in Section 2.1, i.e., the RR-sub-Gaussian assumption.

The parameter v=R9dln(Tδ)v=R\sqrt{9d\ln(\frac{T}{\delta})} can be replaced by vt=R9dln(tδ)v_{t}=R\sqrt{9d\ln(\frac{t}{\delta})} at time tt, if the time horizon TT is not known. In fact, this is the version of Thompson Sampling that we will analyze. The analysis we provide can be applied as it is (with only notational changes) to the version using the fixed value of vv for all time steps, to get the same regret upper bound.

Computational efficiency:

3 Our Results

With probability 1δ1-\delta, the total regret for Thompson Sampling algorithm in time TT is bounded as

whichever is smaller, for any 0<δ<10<\delta<1, where δ\delta is a parameter used by the algorithm.

The regret bound in Equation (1) does not depend on NN, and are applicable to the case of infinite arms, with only notational changes required in the analysis.

4 Related Work

The contextual bandit problem with linear payoffs is a widely studied problem in statistics and machine learning often under different names as mentioned by Chu et al. (2011): bandit problems with co-variates (Woodroofe, 1979; Sarkar, 1991), associative reinforcement learning (Kaelbling, 1994), associative bandit problems (Auer, 2002; Strehl et al., 2006), bandit problems with expert advice (Auer et al., 2002), and linear bandits (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Bubeck et al., 2012). The name contextual bandits was coined in Langford & Zhang (2007).

For finite NN, Chu et al. (2011) show a lower bound of Ω(Td)\Omega(\sqrt{Td}) for d2Td^{2}\leq T. Auer (2002) and Chu et al. (2011) analyze SupLinUCB, a complicated algorithm using UCB as a subroutine, for this problem. Chu et al. (2011) achieve a regret bound of O(Tdln3(NTln(T)/δ))O(\sqrt{Td\ln^{3}(NT\ln(T)/\delta)}) with probability at least 1δ1-\delta (Auer (2002) proves similar results). This regret bound is not applicable to the case of infinite arms, and assumes that context vectors are generated by an oblivious adversary. Also, this regret bound would give O(d2T)O(d^{2}\sqrt{T}) regret if NN is exponential in dd. The state-of-the-art bounds for linear bandits problem in case of finite NN are given by Bubeck et al. (2012). They provide an algorithm based on exponential weights, with regret of order dTlogN\sqrt{dT\log N} for any finite set of NN actions. This also gives O(dT)O(d\sqrt{T}) regret when NN is exponential in dd.

Our results demonstrate that the natural and efficient heuristic of Thompson Sampling can achieve theoretical bounds that are close to the best bounds. The main contribution of this paper is to provide new tools for analysis of Thompson Sampling algorithm for contextual bandits, which despite being popular and empirically attractive, has eluded theoretical analysis. We believe the techniques used in this paper will provide useful insights into the workings of this Bayesian algorithm, and may be useful for further improvements and extensions.

Regret Analysis: Proof of Theorem 1

The contextual version of the multi-armed bandit problem presents new challenges for the analysis of TS algorithm, and the techniques used so far for analyzing the basic multi-armed bandit problem by Agrawal & Goyal (2012); Kaufmann et al. (2012) do not seem directly applicable. Let us describe some of these difficulties and our novel ideas to resolve them.

In general, the basis of regret analysis for stochastic MAB is to prove that the variances of empirical estimates for all arms decrease fast enough, so that the regret incurred until the variances become small enough, is small. In the basic MAB, the variance of the empirical mean is inversely proportional to the number of plays ki(t)k_{i}(t) of arm ii at time tt. Thus, every time the suboptimal arm ii is played, we know that even though a regret of μiμi1\mu_{i^{*}}-\mu_{i}\leq 1 is incurred, there is also an improvement of exactly 11 in the number of plays of that arm, and hence, corresponding decrease in the variance. The techniques for analyzing basic MAB rely on this observation to precisely quantify the exploration-exploitation tradeoff. On the other hand, the variance of the empirical mean for the contextual case is given by inverse of Bi(t)=τ=1:a(τ)=itbi(τ)2B_{i}(t)=\sum_{\tau=1:a(\tau)=i}^{t}b_{i}(\tau)^{2}. When a suboptimal arm ii is played, if bi(t)b_{i}(t) is small, the regret ba(t)(t)μa(t)bi(t)μib_{a^{*}(t)}(t)\mu_{a^{*}(t)}-b_{i}(t)\mu_{i} could be much higher than the improvement bi(t)2b_{i}(t)^{2} in Bi(t)B_{i}(t).

In our proof, we overcome this difficulty by dividing the arms into two groups at any time: saturated and unsaturated arms, based on whether the standard deviation of the estimates for an arm is smaller or larger compared to the standard deviation for the optimal arm. The optimal arm is included in the group of unsaturated arms. We show that for the unsaturated arms, the regret on playing the arm can be bounded by a factor of the standard deviation, which improves every time the arm is played. This allows us to bound the total regret due to unsaturated arms. For the saturated arms, standard deviation is small, or in other words, the estimates of the means constructed so far are quite accurate in the direction of the current contexts of these arms, so that the algorithm is able to distinguish between them and the optimal arm. We utilize this observation to show that the probability of playing such arms is small, and at every time step an unsaturated arm will be played with some constant probability. Below is a more technical outline of the proof of Theorem 1. At any time step tt, we divide the arms into two groups:

saturated arms defined as those with Δi(t)>gt si(t)\Delta_{i}(t)>g_{t}\ s_{i}(t),

unsaturated arms defined as those with Δi(t)gt si(t)\Delta_{i}(t)\leq g_{t}\ s_{i}(t),

To bound the regret irrespective of whether a saturated or unsaturated arm is played at time tt, we lower bound the probability of playing an unsaturated arm at any time tt. More precisely, we define Ft1{\cal F}_{t-1} as the union of history Ht1{\cal H}_{t-1} and the contexts bi(t),i=1,,Nb_{i}(t),i=1,\ldots,N at time tt, and prove that for “most” (in a high probability sense) Ft1{\cal F}_{t-1},

\Pr\left(\mbox{a(t)is a unsaturated arm}\ \vline\ {\cal F}_{t-1}\right)\geq p-\frac{1}{t^{2}},

where p=14eπp=\frac{1}{4e\sqrt{\pi}}. Note that for pp is constant for ϵt=1/ln(t)\epsilon_{t}=1/\ln(t). This observation allows us to establish that the expected regret at any time step tt is upper bounded in terms of regret due to playing an unsaturated arm at that time, i.e. in terms of sa(t)(t)s_{a(t)}(t). More precisely, we prove that for “most” Ft1{\cal F}_{t-1}

We use these observations to establish that (Xt;t0)(X_{t};t\geq 0), where

Xt\mboxregret(t)3gtpsa(t)(t)2gtpt2,X_{t}\simeq\mbox{regret}(t)-\frac{3g_{t}}{p}s_{a(t)}(t)-\frac{2g_{t}}{pt^{2}},

is a super-martingale difference process adapted to filtration Ft{\cal F}_{t}. Then, using the Azuma-Hoeffding inequality for super-martingales, along with the inequality tsa(t)(t)=O(TdlnT)\sum_{t}s_{a(t)}(t)=O(\sqrt{Td\ln T}), we will obtain the desired high probability regret bound.

2 Formal proof

As mentioned earlier, we will analyze the version of Algorithm 1 that uses vt=R9dln(tδ)v_{t}=R\sqrt{9d\ln(\frac{t}{\delta})} instead of v=R9dln(Tδ)v=R\sqrt{9d\ln(\frac{T}{\delta})} at time tt.

We start with introducing some notations. For quick reference, the notations introduced below also appear in a table of notations at the beginning of the supplementary material.

Recall that Δi(t)=ba(t)(t)Tμbi(t)Tμ\Delta_{i}(t)=b_{a^{*}(t)}(t)^{T}\mu-b_{i}(t)^{T}\mu, the difference between the mean reward of optimal arm and arm ii at time tt.

Define Eμ(t)E^{\mu}(t) and Eθ(t)E^{\theta}(t) as the events that bi(t)Tμ^(t)b_{i}(t)^{T}\hat{\mu}(t) and θi(t)\theta_{i}(t) are concentrated around their respective means. More precisely, define Eμ(t)E^{\mu}(t) as the event that

i:θi(t)bi(t)Tμ^(t)min{4dln(t),4log(tN) }vt si(t).\forall i:|\theta_{i}(t)-b_{i}(t)^{T}\hat{\mu}(t)|\leq\min\{\sqrt{4d\ln(t)},\sqrt{4\log(tN)}\ \}v_{t}\ s_{i}(t).

An arm ii is called saturated at time tt if Δi(t)>gtsi(t)\Delta_{i}(t)>g_{t}s_{i}(t), and unsaturated otherwise. Let C(t)C(t) denote the set of saturated arms at time tt. Note that the optimal arm is always unsaturated at time tt, i.e., a(t)C(t)a^{*}(t)\notin C(t). An arm may keep shifting from saturated to unsaturated and vice-versa over time.

Define filtration Ft1{\cal F}_{t-1} as the union of history until time t1t-1, and the contexts at time tt, i.e., Ft1={Ht1,bi(t),i=1,,N}{\cal F}_{t-1}=\{{\cal H}_{t-1},b_{i}(t),i=1,\ldots,N\}.

By definition, F1F2FT1{\cal F}_{1}\subseteq{\cal F}_{2}\cdots\subseteq{\cal F}_{T-1}. Observe that the following quantities are determined by the history Ht1{\cal H}_{t-1} and the contexts bi(t)b_{i}(t) at time tt, and hence are included in Ft1{\cal F}_{t-1},

the identity of the optimal arm a(t)a^{*}(t) and the set of saturated arms C(t)C(t),

For all tt, 0<δ<10<\delta<1, Pr(Eμ(t))1δt2.\Pr(E^{\mu}(t))\geq 1-\frac{\delta}{t^{2}}. And, for all possible filtrations Ft1{\cal F}_{t-1}, Pr(Eθ(t)Ft1)11t2.\Pr(E^{\theta}(t)|{\cal F}_{t-1})\geq 1-\frac{1}{t^{2}}.

The complete proof of this lemma appears in Appendix LABEL:app:E(t). The probability bound for Eμ(t)E^{\mu}(t) will be proven using a concentration inequality given by Abbasi-Yadkori et al. (2011), stated as Lemma LABEL:lem:dDim-concentration in Appendix LABEL:app:concentration. The RR-sub-Gaussian assumption on rewards will be utilized here. The probability bound for Eθ(t)E^{\theta}(t) will be proven using a concentration inequality for Gaussian random variables from Abramowitz & Stegun (1964) stated as Lemma LABEL:lem:gauss in Appendix LABEL:app:concentration . ∎

For any filtration Ft1{\cal F}_{t-1} such that Eμ(t)E^{\mu}(t) is true,

Pr(θa(t)(t)>ba(t)(t)Tμ \vline Ft1)p.\Pr\left(\theta_{a^{*}(t)}(t)>b_{a^{*}(t)}(t)^{T}\mu\ \vline\ {\cal F}_{t-1}\right)\geq p.

The following lemma bounds the probability of playing saturated arms in terms of the probability of playing unsaturated arms.

For any filtration Ft1{\cal F}_{t-1} such that Eμ(t)E^{\mu}(t) is true,

By definition, for all saturated arms, i.e. for all jC(t)j\in C(t), Δj(t)>gtst,j\Delta_{j}(t)>g_{t}s_{t,j}. Also, if both the events Eμ(t)E^{\mu}(t) and Eθ(t)E^{\theta}(t) are true then, by the definitions of these events, for all jC(t)j\in C(t), θj(t)bj(t)Tμ+gtst,j\theta_{j}(t)\leq b_{j}(t)^{T}\mu+g_{t}s_{t,j}. Therefore, given an Ft1{\cal F}_{t-1} such that Eμ(t)E^{\mu}(t) is true, either Eθ(t)E^{\theta}(t) is false, or else for all jC(t)j\in C(t),

Hence, for any Ft1{\cal F}_{t-1} such that Eμ(t)E^{\mu}(t) is true,

The last inequality uses Lemma 2 and Lemma 1. ∎

For any filtration Ft1{\cal F}_{t-1} such that Eμ(t)E^{\mu}(t) is true,

Let aˉ(t)\bar{a}(t) denote the unsaturated arm with smallest si(t)s_{i}(t), i.e.

Note that since C(t)C(t) and si(t)s_{i}(t) for all ii are fixed on fixing Ft1{\cal F}_{t-1}, so is aˉ(t)\bar{a}(t).

Now, using Lemma 3, for any Ft1{\cal F}_{t-1} such that Eμ(θ)E^{\mu}(\theta) is true,

Now, if events Eμ(t)E^{\mu}(t) and Eθ(t)E^{\theta}(t) are true, then for all ii, by definition, θi(t)bi(t)Tμ+gtsi(t)\theta_{i}(t)\leq b_{i}(t)^{T}\mu+g_{t}s_{i}(t). Using this observation along with the fact that θa(t)(t)θi(t)\theta_{a(t)}(t)\geq\theta_{i}(t) for all ii,

Therefore, for any Ft1{\cal F}_{t-1} such that Eμ(θ)E^{\mu}(\theta) is true either Δa(t)(t)2gtst,aˉ(t)+gtsa(t)(t)\Delta_{a(t)}(t)\leq 2g_{t}s_{t,\bar{a}}(t)+g_{t}s_{a(t)}(t) or Eθ(t)E^{\theta}(t) is false. Therefore,

In the first inequality we used that for all ii, Δi(t)1\Delta_{i}(t)\leq 1. The second inequality used the inequality derived in the beginning of this proof, and Lemma 1 to apply Pr(Eθ(t))1t2\Pr\left(\overline{E^{\theta}(t)}\right)\leq\frac{1}{t^{2}}. The third inequality used the observation that 0st,a(t)ba(t)(t)10\leq s_{t,a(t)}\leq||b_{a(t)}(t)||\leq 1.

Recall that \mboxregret(t)\mbox{regret}(t) was defined as, \mboxregret(t)=Δa(t)(t)=ba(t)(t)Tμba(t)(t)Tμ.\mbox{regret}(t)=\Delta_{a(t)}(t)=b_{a^{*}(t)}(t)^{T}\mu-b_{a(t)}(t)^{T}\mu. Define \mboxregret(t)=\mboxregret(t)I(Eμ(t)).\mbox{regret}^{\prime}(t)=\mbox{regret}(t)\cdot I(E^{\mu}(t)).

Next, we establish a super-martingale process that will form the basis of our proof of the high-probability regret bound.

(Yt;t=0,,T)(Y_{t};t=0,\ldots,T) is a super-martingale process with respect to filtration Ft{\cal F}_{t}.

Note that whether Eμ(t)E^{\mu}(t) is true or not is completely determined by Ft1{\cal F}_{t-1}. If Ft1{\cal F}_{t-1} is such that Eμ(t)E^{\mu}(t) is not true, then \mboxregret(t)=\mboxregret(t)I(Eμ(t))=0\mbox{regret}^{\prime}(t)=\mbox{regret}(t)\cdot I(E^{\mu}(t))=0, and the above inequality holds trivially. And, for Ft1{\cal F}_{t-1} such that Eμ(t)E^{\mu}(t) holds, the inequality follows from Lemma 4. ∎

Note that XtX_{t} is bounded, Xt1+3pgt+2pt2gt6pgt|X_{t}|\leq 1+\frac{3}{p}g_{t}+\frac{2}{pt^{2}}g_{t}\leq\frac{6}{p}g_{t}. Thus, we can apply Azuma-Hoeffding inequality (see Lemma LABEL:lem:azuma in Appendix LABEL:app:concentration), to obtain that with probability 1δ21-\frac{\delta}{2},

Note that pp is a constant. Also, by definition, gtgTg_{t}\leq g_{T}. Therefore, from above equation, with probability 1δ21-\frac{\delta}{2},

Now, we can use t=1Tsa(t)(t)5dTlnT,\sum_{t=1}^{T}s_{a(t)}(t)\leq 5\sqrt{dT\ln T}, which can be derived along the lines of Lemma 3 of Chu et al. (2011) using Lemma 11 of Auer (2002) (see Appendix LABEL:app:proof for details). Also, by definition gT=O(dln(Tδ)(min{d,log(N)}))g_{T}=O(\sqrt{d\ln(\frac{T}{\delta})}\cdot(\min\{\sqrt{d},\sqrt{\log(N)}\})) (see the Table of notations in the beginning of the supplementary material). Substituting in above, we get

Also, because Eμ(t)E^{\mu}(t) holds for all tt with probability at least 1δ21-\frac{\delta}{2} (see Lemma 1), \mboxregret(t)=\mboxregret(t)\mbox{regret}^{\prime}(t)=\mbox{regret}(t) for all tt with probability at least 1δ21-\frac{\delta}{2}. Hence, with probability 1δ1-\delta,

The proof for the alternate definition of regret mentioned in Remark 1 is provided in Appendix LABEL:app:proof.

Conclusions

We provided a theoretical analysis of Thompson Sampling for the stochastic contextual bandits problem with linear payoffs. Our results resolve some open questions regarding the theoretical guarantees for Thompson Sampling, and establish that even for the contextual version of the stochastic MAB problem, TS achieves regret bounds close to the state-of-the-art methods. We used a novel martingale-based analysis technique which is arguably simpler than the techniques in the past work on TS (Agrawal & Goyal, 2012; Kaufmann et al., 2012), and is amenable to extensions.

This is an extended and slightly modified version of Agrawal & Goyal (2013a).

References