An Information-Theoretic Analysis of Thompson Sampling
Daniel Russo, Benjamin Van Roy
Introduction
This paper considers the problem of repeated decision making in the presence of model uncertainty. A decision-maker repeatedly chooses among a set of possible actions, observes an outcome, and receives a reward representing the utility derived from this outcome. The decision-maker is uncertain about the underlying system and is therefore initially unsure of which action is best. However, as outcomes are observed, she is able to learn over time to make increasingly effective decisions. Her objective is to choose actions sequentially so as to maximize the expected cumulative reward.
We focus on settings with partial feedback, under which the decision-maker does not generally observe what the reward would have been had she selected a different action. This feedback structure leads to an inherent tradeoff between exploration and exploitation: by experimenting with poorly understood actions one can learn to make more effective decisions in the future, but focusing on better understood actions may lead to higher rewards in the short term. The classical multi–armed bandit problem is an important special case of this formulation. In such problems, the decision–maker only observes rewards she receives, and rewards from one action provide no information about the reward that can be attained by selecting other actions. We are interested here primarily in models and algorithms that accommodate cases where the number of actions is very large and there is a richer information structure relating actions and observations. This category of problems is often referred to as online optimization with partial feedback.
A large and growing literature treats the design and analysis of algorithms for such problems. An online optimization algorithm typically starts with two forms of prior knowledge. The first – hard knowledge – posits that the mapping from action to outcome distribution lies within a particular family of mappings. The second – soft knowledge – concerns which of these mappings are more or less likely to match reality. Soft knowledge evolves with observations and is typically represented in terms of a probability distribution or a confidence set.
Much recent work concerning online optimization algorithms focuses on establishing performance guarantees in the form of regret bounds. Surprisingly, virtually all of these regret bounds depend on hard knowledge but not soft knowledge, a notable exception being the bounds of Srinivas et al. (2012) which we discuss further in Section 1.2. Regret bounds that depend on hard knowledge yield insight into how an algorithm’s performance scales with the complexity of the family of mappings and are useful for delineating algorithms on that basis, but if a regret bound does not depend on soft knowledge, it does not have much to say about how future performance should improve as data is collected. The latter sort of insight should be valuable for designing better ways of trading off between exploration and exploitation, since it is soft knowledge that is refined as a decision-maker learns. Another important benefit to understanding how performance depends on soft knowledge arises in practical applications: when designing an online learning algorithm, one may have access to historical data and want to understand how this prior information benefits future performance.
In this paper, we establish regret bounds that depend on both hard and soft knowledge for a simple online learning algorithm alternately known as Thompson sampling, posterior sampling, or probability matching. The bounds strengthen results from prior work not only with respect to Thompson sampling but relative to regret bounds for any online optimization algorithm. Further, the bounds offer new insight into how regret depends on soft knowledge. Indeed, forthcoming work of ours leverages this to produce an algorithm that outperforms Thompson sampling.
We found information theory to provide tools ideally suited for deriving our new regret bounds, and our analysis inherits the simplicity and elegance enjoyed by work in that field. Our formulation encompasses a broad family of information structures, including as special cases multi–armed bandit problems with independent arms, online optimization problems with full information, linear bandit problems, and problems with combinatorial action sets and “semi–bandit” feedback. We leverage information theory to provide a unified analysis that applies to each of those special cases, establishing that Thompson sampling satisfies order optimal Bayesian regret bounds for each one.For two of these special cases, the bounds are only tight up to a logarithmic factor.
A novel feature of our bounds is their dependence on the entropy of the optimal-action distribution. To our knowledge, these are the first bounds on the expected regret of any algorithm that depend on the magnitude of the agent’s uncertainty about which action is optimal. The fact that our bounds only depend on uncertainties relevant to optimizing performance highlights the manner in which Thompson sampling naturally exploits complex information structures. Further, in practical contexts, a decision-maker may begin with an understanding that some actions are more likely to be optimal than others. For example, when dealing with a shortest path problem, one might expect paths that traverse fewer edges to generally incur less cost. Our bounds are the first to formalize the performance benefits afforded by such an understanding.
Our analysis is based on a general probabilistic, or Bayesian, formulation in which uncertain quantities are modeled as random variables. In principle, the optimal strategy for such a problem could be calculated via dynamic programing, but for problem classes of practical interest this would be computationally intractable. Thompson sampling serves as a simple and elegant heuristic strategy. In this section, we provide a somewhat informal problem statement, and a preview of our main results about Thompson sampling. In the next subsection we discuss how these results relate to the existing literature.
for all . When an action is sampled, a random reward in $T$ is bounded by
Because the entropy of is always less than , (2) yields a bound of order , and this scaling cannot be improved in general (see Section 6.5). The bound is stronger than this worst–case bound, since the entropy of can be much smaller than .
Thompson sampling incorporates prior knowledge in a flexible and coherent way, and the benefits of this are reflected in two distinct ways by the above bound. First, as in past work (see e.g. Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010), the bound depends on the dimension of the linear model instead of the number of actions. This reflects that the algorithm is able to learn more rapidly by exploiting the known model, since observations from selecting one action provide information about rewards that would have been generated by other actions. Second, the bound depends on the entropy of the prior distribution of instead of a worst case measure like the logarithm of the number of actions. This highlights the benefit of prior knowledge that some actions are more likely to be optimal than others. In particular, this bound exhibits the natural property that as the entropy of the prior distribution of goes to zero, expected regret does as well.
Our main result extends beyond the class of linear bandit problems. Instead of depending on the linear dimension of the model, it depends on a more general measure of the problem’s information complexity: what we call the problem’s information ratio. By bounding the information ratio in specific settings, we recover the bound (2) as a special case, along with bounds for problems with full feedback and problems with combinatorial action sets and “semi–bandit” feedback.
2 Related Work
Though Thompson sampling was first proposed in 1933 (Thompson, 1933), until recently it was largely ignored in the academic literature. Interest in the algorithm grew after empirical studies (Scott, 2010; Chapelle and Li, 2011) demonstrated performance exceeding state of the art. Over the past several years, it has also been adopted in industry.Microsoft (Graepel et al., 2010), Google analytics (Scott, 2014) and Linkedin (Tang et al., 2013) have used Thompson sampling. This has prompted a surge of interest in providing theoretical guarantees for Thompson sampling.
One of the first theoretical guarantees for Thompson sampling was provided by May et al. (2012), but they showed only that the algorithm converges asymptotically to optimality. Agrawal and Goyal (2012); Kauffmann et al. (2012); Agrawal and Goyal (2013a) and Korda et al. (2013) studied on the classical multi-armed bandit problem, where sampling one action provides no information about other actions. They provided frequentist regret bounds for Thompson sampling that are asymptotically optimal in the sense defined by Lai and Robbins (1985). To attain these bounds, the authors fixed a specific uninformative prior distribution, and studied the algorithm’s performance assuming this prior is used.
Our interest in Thompson sampling is motivated by its ability to incorporate rich forms of prior knowledge about the actions and the relationship among them. Accordingly, we study the algorithm in a very general framework, allowing for an arbitrary prior distribution over the true outcome distributions . To accommodate this level of generality while still focusing on finite–time performance, we study the algorithm’s expected regret under the prior distribution. This measure is sometimes called Bayes risk or Bayesian regret.
Our recent work (Russo and Van Roy, 2013) provided the first results in this setting. That work leverages a close connection between Thompson sampling and upper confidence bound (UCB) algorithms, as well as existing analyses of several UCB algorithms. This confidence bound analysis was then extended to a more general setting, leading to a general regret bound stated in terms of a new notion of model complexity – what we call the eluder dimension. While the connection with UCB algorithms may be of independent interest, it’s desirable to have a simple, self–contained, analysis that does not rely on the–often complicated–construction of confidence bounds.
Other very recent work (gopalan2014thompson) provided a general asymptotic guarantee for Thompson sampling. They studied the growth rate of the cumulative number of times a suboptimal action is chosen as the time horizon tends to infinity. One of their asymptotic results applies to the problem of online linear optimization under “semi–bandit” feedback, which we study in Section 6.6.
An important aspect of our regret bound is its dependence on soft knowledge through the entropy of the optimal-action distribution. One of the only other regret bounds that depends on soft–knowledge was provided very recently by Li (2013). Inspired by a connection between Thompson sampling and exponential weighting schemes, that paper introduced a family of Thompson sampling like algorithms and studied their application to contextual bandit problems. While our analysis does not currently treat contextual bandit problems, we improve upon their regret bound in several other respects. First, their bound depends on the entropy of the prior distribution of mean rewards, which is never smaller, and can be much larger, than the entropy of the distribution of the optimal action. In addition, their bound has an order dependence on the problem’s time horizon, and, in order to guarantee each action is explored sufficiently often, requires that actions are frequently selected uniformly at random. In contrast, our focus is on settings where the number of actions is large and the goal is to learning without sampling each one.
Another regret bound that to some extent captures dependence on soft knowledge is that of Srinivas et al. (2012). This excellent work focuses on extending algorithms and expanding theory to address multi-armed bandit problems with arbitrary reward functions and possibly an infinite number of actions. In a sense, there is no hard knowledge while soft knowledge is represented in terms of a Gaussian process over a possibly infinite dimensional family of functions. An upper-confidence-bound algorithm is proposed and analyzed. Our earlier work (Russo and Van Roy, 2013) showed similar bounds also apply to Thompson sampling. Though our results here do not treat infinite action spaces, it should be possible to extend the analysis in that direction. One substantive difference is that our results apply to a much broader class of models: distributions are not restricted to Gaussian and more complex information structures are allowed. Further, the results of Srinivas et al. (2012) do not capture the benefits of soft knowledge to the extent that ours do. For example, their regret bounds do not depend on the mean of the reward function distribution, even though mean rewards heavily influence the chances that each action is optimal. Our regret bounds, in contrast, establish that regret vanishes as the probability that a particular action is optimal grows.
Our review has discussed the recent literature on Thompson sampling as well as two papers that have established regret bounds that depend on soft knowledge. There is of course an immense body of work on alternative approaches to efficient exploration, including work on the Gittins index approach (Gittins et al., 2011), Knowledge Gradient strategies (Ryzhov et al., 2012), and upper-confidence bound strategies (Lai and Robbins, 1985; Auer et al., 2002). In an adversarial framework, authors often study exponential-weighting shemes or, more generally, strategies based on online stochastic mirror descent. Bubeck and Cesa-Bianchi (2012) provided a general review of upper–confidence strategies and algorithms for the adversarial multi-armed bandit problem.
Problem Formulation
which measures the cumulative difference between the reward earned by an algorithm that always chooses the optimal action, and actual accumulated reward up to time . In this paper we study expected regret
where the expectation is taken over the randomness in the actions and the outcomes , and over the prior distribution over . This measure of performance is commonly called Bayesian regret or Bayes risk.
Further Assumptions:
To simplify the exposition, our main results will be stated under two further assumptions. The first requires that rewards are uniformly bounded, effectively controlling the worst-case variance of the reward distribution. In particular, this assumption is used only in proving Fact 9. In the technical appendix, we show that Fact 9 can be extended to the case where reward distributions are sub-Gaussian, which yields results in that more general setting.
Our next assumption requires that the action set is finite. In the technical appendix we show that some cases where is infinite can be addressed through discretization arguments.
Because the Thompson sampling algorithm only chooses actions from the support of , all of our results hold when the finite set denotes only the actions in the support of . This difference can be meaningful. For example, when is a polytope and the objective function is linear, the support of contains only extreme points of the polytope: a finite set.
Basic Measures and Relations in Information Theory
Before proceeding, we will define several common information measures – entropy, mutual information, and Kullback-Leibler divergence — and state several facts about these measures that will be referenced in our analysis. When all random variables are discrete, each of these facts is shown in chapter 2 of Cover and Thomas (2012). A treatment that applies to general random variables is provided in chapter 5 of Gray (2011).
Throughout this section, we will fix random variables , , and that are defined on a joint probability space. We will allow and to be general random variables, but will restrict to be supported on a finite set . This is sufficient for our purposes, as we will typically apply these relations when is , and is useful, in part, because the entropy of a general random variable can be infinite.
The first fact establishes uniform bounds on the entropy of a probability distribution.
If is a discrete random variable, the entropy of conditional on is
and the conditional entropy is of given is
For a general random variable , the conditional entropy of given is,
where this expectation is taken over the marginal distribution of . For two probability measures and , if is absolutely continuous with respect to , the Kullback–Leibler divergence between them is
where is the Radon–Nikodym derivative of with respect to . This is the expected value under of the log-likelihood ratio between and , and is a measure of how different and are. The next fact establishes the non-negativity of Kullback–Leibler divergence.
(Gibbs’ inequality) For any probability distributions and such that is absolutely continuous with respect to , with equality if and only if –almost everywhere.
The mutual information between and
is the Kullback–Leibler divergence between the joint distribution of and and the product of the marginal distributions. From the definition, it’s clear that , and Gibbs’ inequality implies that and when and are independent.
The next fact, which is Lemma 5.5.6 of Gray (2011), states that the mutual information between and is the expected reduction in the entropy of the posterior distribution of due to observing .
(Entropy reduction form of mutual information)
The mutual information between and , conditional on a third random variable is
the expected additional reduction in entropy due to observing given that is also observed. This definition is also a natural generalization of the one given in (6), since
The next fact shows that conditioning on a random variable that is independent of and does not affect mutual information.
If is jointly independent of and , then .
The mutual information between a random variable and a collection of random variables can be expressed elegantly using the following “chain rule.”
We provide some details related to the derivation of Fact 6 in the appendix.
(KL divergence form of mutual information)
While Facts 1 and 6, are standard properties of mutual information, it’s worth highlighting their surprising power. It’s useful to think as being , the optimal action, and as being , the observation when selecting some action . Then, combining these properties, we see that the next observation is expected to greatly reduce uncertainty about the optimal action if and only if the distribution of varies greatly depending on the realization of , in the sense that is large on average. This fact is crucial to our analysis.
One implication of the next fact is that the expected reduction in entropy from observing the outcome is always at least as large as that from observing the reward .
(Weak Version of the Data Processing Inequality) If for a deterministic function , then . If is invertible, so is also a deterministic function of , then .
Because these quantities depend on the realizations of , they are random variables. By taking their expectation, we recover the standard definition of conditional entropy and conditional mutual information:
Thompson Sampling
Algorithm 1 is efficient as long as the linear objective can be maximized efficiently over the action set . For this reason, the algorithm is implementable in important cases where other popular approaches, like the algorithm of Dani et al. (2008), are computationally intractable. Because the posterior distribution of has a closed form, Algorithm 1 is particularly efficient. Even when the posterior distribution is complex, however, one can often generate samples from this distribution using Markov chain Monte Carlo algorithms, enabling efficient implementations of Thompson sampling. A more detailed discussion of the strengths and potential advantages of Thompson sampling can be found in earlier work (Scott, 2010; Chapelle and Li, 2011; Russo and Van Roy, 2013; gopalan2014thompson).
The Information Ratio and a General Regret Bound
Our analysis will relate the expected regret of Thompson sampling of Thompson sampling to its expected information gain: the expected reduction in the entropy of the posterior distribution of . The relationship between these quantities is characterized by what we call the information ratio,
Notice that if the information ratio is small, Thompson sampling can only incur large regret when it is expected to gain a lot of information about which action is optimal. This suggests its expected regret is bounded in terms of the maximum amount of information any algorithm could expect to acquire, which is at most the entropy of the prior distribution of the optimal action. Our next result shows this formally. We provide a general upper bound on the expected regret of Thompson sampling that depends on the time horizon , the entropy of the prior distribution of , , and any worst–case upper bound on the information ratio . In the next section, we will provide bounds on for some of the most widely studied classes of online optimization problems.
where (a) follows from the tower property of conditional expectation, and (b) follows from the Cauchy-Schwartz inequality. We complete the proof by showing that expected information gain cannot exceed the entropy of the prior distribution. For the remainder of this proof, let . Then, as discussed in Subsection 3.1,
where (c) follows from the chain rule for mutual information (Fact 5), and (d) follows from the non-negativity of entropy (Fact 1).
Bounding the Information Ratio
This section establishes upper bounds on the information ratio in several important settings. This yields explicit regret bounds when combined with Proposition 1, and also helps to clarify the role the information ratio plays in our results: it roughly captures the extent to which sampling some actions allows the decision maker to make inferences about different actions. In the worst case, the information ratio depends on the number of actions, reflecting the fact that actions could provide no information about others. For problems with full information, the information ratio is bounded by a numerical constant, reflecting that sampling one action perfectly reveals the rewards that would have been earned by selecting any other action. The problems of online linear optimization under “bandit feedback” and under “semi–bandit feedback” lie between these two extremes, and the information ratio provides a natural measure of each problem’s information structure. In each case, our bounds reflect that Thompson sampling is able to automatically exploit this structure.
Recall that the information ratio is defined to be
The numerator of the information ratio measures the average difference between rewards generated from , the posterior predictive distribution at , and , the posterior predictive distribution at conditioned on being the optimal action. It roughly captures how much knowing that the selected action is optimal influences the expected reward observed. The denominator measures how much, on average, knowing which action is optimal changes the observations at the selected action. Intuitively, the information ratio tends to be small when knowing which action is optimal significantly influences the anticipated observations at many other actions.
It’s worth pointing out that this form of the information ratio bears a superficial resemblance to fundamental complexity terms in the multi-armed bandit literature. The results of Lai and Robbins (1985) and agrawal1989asymptotically show the optimal asymptotic growth rate of regret is characterized by a ratio where the numerator depends on the difference between means of the reward distributions and the denominator depends on Kullback–Leibler divergences.
Proof Both proofs will use that the action is selected based on past observations and independent random noise. Therefore, conditioned on the history, is jointly independent of and the outcome vector .
where (a) follows from the chain rule for mutual information (Fact 5), (b) uses that and are independent and the mutual information between independent random variables is zero (Fact 4), (c) uses Fact 4 and that is jointly independent of and , and equality (d) uses Fact 6. Now, the numerator can be rewritten as,
2 Preliminaries
Here we state two basic facts that are used in bounding the information ratio. Proofs of both results are provided in the appendix for completeness.
The first fact lower bounds the Kullback–Leibler divergence between two bounded random variables in terms of the difference between their means. It follows trivially from an application of pinsker’s inequality.
denote respectively the Nuclear norm, Frobenius norm and trace of .
3 Worst Case Bound
The next proposition provides a bound on the information ratio that holds whenever rewards are bounded, but that has an explicit dependence on the number of actions. This scaling cannot be improved in general, but we go on to show tighter bounds are possible for problems with different information structures.
Proof We bound the numerator of the information ratio by times its denominator:
4 Full Information
Proof As before we bound the numerator of by times the denominator:
where again (a) and (e) follow from Proposition 2, (b) follows from Fact 9 and (c) follows from Jensen’s inequality. Equality (d) follows because does not depend on the sampled action under full information.
5 Linear Optimization Under Bandit Feedback
The stochastic linear bandit problem has been widely studied (e.g Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011) and is one of the most important examples of a multi-armed bandit problem with “correlated arms.” In this setting, each action is associated with a finite dimensional feature vector, and the mean reward generated by an action is the inner product between its known feature vector and some unknown parameter vector. Because of this structure, observations from taking one action allow the decision–maker to make inferences about other actions. The next proposition bounds the information ratio for such problems. Its proof is essentially a generalization of the proof of Proposition 3.
for all . Then, by Proposition 2,
where inequality (a) uses Fact 9. This shows, by Fact 10, that
We now show . Define
Then, by the linearity of the expectation operator,
6 Combinatorial Action Sets and “Semi–Bandit” Feedback
To motivate the information structure studied here, consider a simple resource allocation problem. There are possible projects, but the decision–maker can allocate resources to at most of them at a time. At time , project yields a random reward , and the reward from selecting a subset of projects is . In the linear bandit formulation of this problem, upon choosing a subset of projects the agent would only observe the overall reward . It may be natural instead to assume that the outcome of each selected project is observed. This type of observation structure is sometimes called “semi–bandit” feedback (Audibert et al., 2013).
A naive application of Proposition 5 to address this problem would show . The next proposition shows that since the entire parameter vector is observed upon selecting action , we can provide an improved bound on the information ratio. The proof of the proposition is provided in the appendix.
Conclusion
This paper has provided a new analysis of Thompson sampling based on tools from information theory. As such, our analysis inherits the simplicity and elegance enjoyed by work in that field. Further, our results apply to a much broader range of information structures than those studied in prior work on Thompson sampling. Our analysis leads to regret bounds that highlight the benefits of soft knowledge, quantified in terms of the entropy of the optimal-action distribution. Such regret bounds yield insight into how future performance depends on past observations. This is key to assessing the benefits of exploration, and as such, to the design of more effective schemes that trade off between exploration and exploitation. In forthcoming work, we leverage this insight to produce an algorithm that outperforms Thompson sampling.
While our focus has been on providing theoretical guarantees for Thompson sampling, we believe the techniques and quantities used in the analysis may be of broader interest. Our formulation and notation may be complex, but the proofs themselves essentially follow from combining known relations in information theory with the tower property of conditional expectation, Jensen’s inequality, and the Cauchy–Schwartz inequality. In addition, the information theoretic view taken in this paper may provide a fresh perspective on this class of problems.
Daniel Russo is supported by a Burt and Deedee McMurty Stanford Graduate Fellowship. This work was supported, in part, by the National Science Foundation under Grant CMMI-0968707. The authors would like to thank the anonymous reviewers for their helpful comments.
A Proof of Basic Facts
This result is a consequence of Pinsker’s inequality, (see Lemma 5.2.8 of Gray (2011)) which states that
where is the total variation distance between and . When is countable, . More generally, if and are both absolutely continuous with respect to some base measure , then , where and are the Radon–Nikodym derivatives of and with respect to . We now prove Fact 9.
Proof Choose a base measure so that and are absolutely continuous with respect to . This is always possible: since is absolutely continuous with respect to by hypothesis, one could always choose this base measure to be . Let so that and and differ only by a constant. Then,
where the first inequality is Pinsker’s inequality.
A.2 Proof of Fact 10
Here (b) follows from the triangle inequality and the fact that norms are homogeneous of degree one. Inequality (c) uses that the singular values of and are the same, and inequality (d) follows from equation (8).
B Proof of Proposition 6
The proof relies on the following lemma, which lower bounds the information gain due to selecting an action . The proof is provided below, and relies on the chain–rule of Kullback–Leibler divergence.
Under the conditions of Proposition 6, for any ,
Proof In the proof of this lemma, for any we let . Since is fixed throughout this proof, we do not display the dependence of on . Recall that by Fact 6,
Equality (a) follows from the chain rule of Kullback–Leibler divergence, (b) follows from Fact 9, (c) follows from the assumption that are independent conditioned on any history of observations, and (d) follows from Jensen’s inequality and the tower property of conditional expectation. Now, we can show
where (e) and (f) follow from Jensen’s inequality and the tower property of conditional expectation, respectively.
Proof By Lemma 1 and the tower property of conditional expectation,
Now, we complete the proof of Proposition 6.
Proof The proof establishes that the numerator of the information ratio is less than times its denominator:
C Proof of Fact 6
We could not find a reference that proves Fact 6 in a general setting, and will therefore provide a proof here.
Consider random variables and where is assumed to be a finite set but is a general random variable. We show
The extension follows by considering quantized versions of the random variable . For a finite partition , we denote by the quantization of defined so that if and only if . As shown in Gray (2011), for two probability measures and on ,
where the supremum is taken over all finite partitions of , and whenever is a refinement of ,
Let the finite partition denote the refinement of the partitions . That is, for each , every set in can be written as the union of sets in . Then,
hence the inequalities above are equalities and our claim is established.
D Technical Extensions
We introduce a quantization of the random variable . is supported only on the set , but satisfies for each .
This new bound depends on the time horizon , the dimension of the linear model, the entropy of the prior distribution of the quantized random variable and the discretization error . Since , by choosing and choosing a finite cover with we attain a regret bound of order . Note that for , this bound is of order , but for , we have a trivial regret bound of .
Here, for simplicity, we have considered a variant of Thompson sampling that randomizes according to the posterior distribution of the quantized random variable . However, with a somewhat more careful analysis, one can provide a similar result for an algorithm that randomizes according to the posterior distribution of .
D.2 Unbounded Noise
In this section we relax Assumption 1, which requires that reward distributions are uniformly bounded. This assumption is required for Fact 9 to hold, but otherwise was never used in our analysis. Here, we show that an analogue to Fact 9 holds in the more general setting where reward distributions are sub–Gaussian. This yields more general regret bounds.
A random variable is sub–Gaussian if its moment generating function is dominated by that of a Gaussian random variable. Gaussian random variables are sub–Gaussian, as are real-valued random varaibles with bounded support.
The next lemma extends Fact 9 to sub-Gaussian random variables.
Suppose there is a measurable random variable such that for each , is sub-Gaussian conditional on . Then for every
Using this Lemma in place of Fact 9 in our analysis leads to the following corollary.
It’s worth highlighting two cases under which the conditions of Corollary 1 are satisfied. The first is a setting with a Gaussian prior and Guassian reward noise. The second case is when under any model expected rewards lie in a bounded interval and reward noise is sub–Gaussian.
Consider random variables . If for each there is a deterministic such that is sub–Gaussian conditional on , then is sub-Gaussian.
The proof of Lemma 3 relies on the following property of Kullback–Leibler divergence.
(Variational form of KL–Divergence given in Theorem 5.2.1 of Gray (2011)) Fix two probability distributions and such that is absolutely continuous with respect to . Then,
We now show that Lemma 3 follows from the variational form of KL–Divergence.
Maximizing over yields the result.