Linearly Parameterized Bandits

Paat Rusmevichientong, John N. Tsitsiklis

Introduction

Since its introduction by Thompson (1933), the multiarmed bandit problem has served as an important model for decision making under uncertainty. Given a set of arms with unknown reward profiles, the decision maker must choose a sequence of arms to maximize the expected total payoff, where the decision in each period may depend on the previously observed rewards. The multiarmed bandit problem elegantly captures the tradeoff between the need to exploit arms with high payoff and the incentive to explore previously untried arms for information gathering purposes.

Much of the previous work on the multiarmed bandit problem assumes that the rewards of the arms are statistically independent (see, for example, Lai and Robbins (1985) and Lai (1987)). This assumption enables us to consider each arm separately, but it leads to policies whose regret scales linearly with the number of arms. Most policies that assume independence require each arm to be tried at least once, and are impractical in settings involving many arms. In such settings, we want a policy whose regret is independent of the number of arms.

When the mean rewards of the arms are assumed to be independent random variables, Lai and Robbins (1985) show that the regret under an arbitrary policy must increase linearly with the number of arms. However, the assumption of independence is quite strong in practice. In many applications, the information obtained from pulling one arm can change our understanding of other arms. For instance, in marketing applications, we expect a priori that similar products should have similar sales. By exploiting the correlation among products/arms, we should be able to obtain a policy whose regret scales more favorably than traditional bandit algorithms that ignore correlation and assume independence.

Mersereau et al. (2009) propose a simple model that demonstrates the benefits of exploiting the underlying structure of the rewards. They consider a bandit problem where the expected reward of each arm is a linear function of an unknown scalar, with a known prior distribution. Since the reward of each arm depends on a single random variable, the mean rewards are perfectly correlated. They prove that, under certain assumptions, the cumulative Bayes risk over TT periods (defined below) under a greedy policy admits an O(logT)O\left(\log T\right) upper bound, independent of the number of arms.

The linearly parameterized bandit is an important model that has been studied by many researchers, including Ginebra and Clayton (1995), Abe and Long (1999), and Auer (2002). The results in this paper complement and extend the earlier and independent work of Dani et al. (2008a) in a number of directions. We provide a detailed comparison between our work and the existing literature in Sections 1.3 and 1.4.

where for any t1t\geq 1, UtUr\mathbf{U}_{t}\in\mathcal{U}_{r} is the arm chosen under ψ\mathbf{\psi} in period tt. Since Ur\mathcal{U}_{r} is compact, maxvUrvz\max_{\mathbf{v}\,\in\,\mathcal{U}_{r}}\mathbf{v}^{\prime}\mathbf{z} is well defined for all z\mathbf{z}. The TT-period cumulative Bayes risk under ψ\mathbf{\psi} is defined by

where the expectation is taken with respect to the prior distribution of Z\mathbf{Z}. We aim to develop a policy that minimizes the cumulative regret and Bayes risk. We note that minimizing the TT-period cumulative Bayes risk is equivalent to maximizing the expected total reward over TT periods.

2 Potential Applications

3 Related Literature

The multiarmed bandit literature can be divided into two streams, depending on the objective function criteria: maximizing the total discounted reward over an infinite horizon or minimizing the cumulative regret and Bayes risk over a finite horizon. Our paper focuses exclusively on the second criterion. Much of the work in this area focuses on understanding the rate with which the regret and risk under various policies increase over time. In their pioneering work, Lai and Robbins (1985) establish an asymptotic lower bound of Ω(mlogT)\Omega(m\log T) for the TT-period cumulative regret for bandit problems with mm independent arms whose mean rewards are “well-separated,” where the difference between the expected reward of the best and second best arms is fixed and bounded away from zero. They further demonstrate a policy whose regret asymptotically matches the lower bound. In contrast, our paper focuses on problems where the number of arms is large (possibly infinite), and where the gap between the maximum expected reward and the expected reward of the second best arm can be arbitrarily small. Lai (1987) extends these results to a Bayesian setting, with a prior distribution on the reward characteristics of each arm. He shows that when we have mm arms, the TT-period cumulative Bayes risk is of order Θ(mlog2T)\Theta(m\log^{2}T), when the prior distribution has a continuous density function satisfying certain properties (see Theorem 3 in Lai, 1987). Subsequent papers along this line include Agrawal et al. (1989), Agrawal (1995), and Auer et al. (2002).

There has been relatively little research, however, on policies that exploit the dependence among the arms. Thompson (1933) allows for correlation among arms in his initial formulation, though he only analyzes a special case involving independent arms. Robbins (1952) formulates a continuum-armed bandit regression problem, but does not provide an analysis of the regret or risk. Berry and Fristedt (1985) allow for dependence among arms in their formulation in Chapter 2, but mostly focus on the case of independent arms. Feldman (1962) and Keener (1985) consider two-armed bandit problems with two hidden states, where the rewards of each arm depend on the underlying state that prevails. Pressman and Sonin (1990) formulate a general multiarmed bandit problem with an arbitrary number of hidden states, and provide a detailed analysis for the case of two hidden states. Pandey et al. (2007) study bandit problems where the dependence of the arm rewards is represented by a hierarchical model.

A somewhat related literature on bandits with dependent arms is the recent work by Wang et al. (2005a, b) and Goldenshluger and Zeevi (2008, 2009) who consider bandit problems with two arms, where the expected reward of each arm depends on an exogenous variable that represents side information. These models, however, differ from ours because they assume that the side information variables are independent and identically distributed over time, and moreover, these variables are perfectly observed before we choose which arm to play. In contrast, we assume that the underlying random vector Z\mathbf{Z} is unknown and fixed over time, to be estimated based on past rewards and decisions.

Our model generalizes the “response surface bandits” introduced by Ginebra and Clayton (1995), who assume a normal prior on Z\mathbf{Z} and provide a simple tunable heuristic, without any analysis on the regret or risk. Abe and Long (1999), Auer (2002), and Dani et al. (2008a) all consider a special case of our model where the random vector Z\mathbf{Z} and the error random variables WtuW^{\mathbf{u}}_{t} are bounded almost surely, and with the exception of the last paper, focus on the regret criterion. Abe and Long (1999) demonstrate a class of bandits where the dimension rr is at least Ω(T)\Omega(\sqrt{T}), and show that the TT-period regret under an arbitrary policy must be at least Ω(T3/4)\Omega\left(T^{3/4}\right). Auer (2002) describes an algorithm based on least squares estimation and confidence bounds, and establishes an O(rTlog3/2(TUr))O\left(\sqrt{r}\sqrt{T}\log^{3/2}\left(T\left|\mathcal{U}_{r}\right|\right)\right) upper bound on the regret, for the case of finitely many arms. Dani et al. (2008a) show that the policy of Auer (2002) can be extended to problems having an arbitrary compact set of arms, and also make use of a barycentric spanner. They establish an O(rTlog3/2T)O(r\sqrt{T}\log^{3/2}T) upper bound on the regret, and discuss a variation of the policy that is more computationally tractable (at the expense of higher regret). Dani et al. (2008a) also establish an Ω(rT)\Omega(r\sqrt{T}) lower bound on the Bayes risk when the set of arms is the Cartesian product of circlesThe original lower bound (Theorem 3 on page 360 of Dani et al., 2008a) was not entirely correct; a correct version was provided later, in Dani et al. (2008b).. However, this leaves a O(log3/2T)O(\log^{3/2}T) gap from the upper bound, leaving open the question of the exact order of regret and risk.

4 Contributions and Organizations

Although we obtain the same lower bound of Ω(rT)\Omega(r\sqrt{T}), our example and proof techniques are very different from Dani et al. (2008a). We consider the unit sphere, with a multivariate normal prior on Z\mathbf{Z}, and standard normal errors. The analysis in Section 2 also illuminates the behavior of the least mean squares estimator in this setting, and we believe that it provides an approach that can be used to address more general classes of linear estimation and adaptive control problems.

We also prove that the phase-based policy remains effective (that is, admits an O(rT)O(r\sqrt{T}) upper bound) for a broad class of bandit problems in which the set of arms is strongly convexOne can show that the Cartesian product of circles is not strongly convex, and thus, our phase-based policy cannot be applied to give the matching upper bound for the example used in Dani et al. (2008a). (defined in Section 3). To our knowledge, this is the first result that establishes the connection between a geometrical property (strong convexity) of the underlying set of arms and the effectiveness of separating exploration from exploitation. We suspect that strong convexity may have similar implications for other types of bandit and learning problems.

When the set of arms is an arbitrary compact set, the separation of exploration and exploitation may not be effective, and we consider in Section 4 an active exploration policy based on least squares estimation and confidence regions. We prove that the regret and risk under this policy are bounded above by O(rTlog3/2T)O(r\sqrt{T}\log^{3/2}T), which is within a logarithmic factor of the lower bound. Our policy is closely related to the one considered in Auer (2002) and further analyzed in Dani et al. (2008a), with differences in a number of respects. First, our model allows the random vector Z\mathbf{Z} and the errors WtuW^{\mathbf{u}}_{t} to have unbounded support, which requires a somewhat more complicated analysis. Second, our policy is an “anytime” policy, in the sense that the policy does not depend on the time horizon TT of interest. In contrast, the policies of Auer (2002) and Dani et al. (2008a) involve a certain parameter δ\delta whose value must be set in advance as a function of the time horizon TT in order to obtain the O(rTlog3/2T)O\left(r\sqrt{T}\log^{3/2}T\right) regret bound.

We finally comment on the case where the set of arms is finite and fixed. We show that the regret and risk under our active exploration policy increase gracefully with time, as logT\log T and log2T\log^{2}T, respectively. These results show that our policy is within a constant factor of the asymptotic lower bounds established by Lai and Robbins (1985) and Lai (1987). In contrast, for the policies of Auer (2002) and Dani et al. (2008a), the available regret upper bounds grow over time as Tlog3/2T\sqrt{T}\log^{3/2}T and log3T\log^{3}T, respectively.

We note that the bounds on the cumulative Bayes risk given in Table 1 hold under certain assumptions on the prior distribution of the random vector Z\mathbf{Z}. For r=1r=1, Z\mathbf{Z} is assumed to be a continuous random variable with a bounded density function (Theorem 3.2 in Mersereau et al., 2009). When the collection of arms is a unit sphere with r2r\geq 2, we require that both \mboxE[Z]\mbox{\sf E}\left[\left\|\mathbf{Z}\right\|\right] and \mboxE[1/Z]\mbox{\sf E}\left[1/\left\|\mathbf{Z}\right\|\right] are bounded (see Theorems 2.1 and 3.1, and Lemma 3.2). For general compact sets of arms where our risk bound is not tight, we only require that Z\left\|\mathbf{Z}\right\| has a bounded expectation.

Lower Bounds

In this section, we establish Ω(rT)\Omega(r\sqrt{T}) lower bounds on the regret and risk under an arbitrary policy when the set of arms is the unit sphere. This result is stated in the following theoremThe result of Theorem 2.1 easily extends to the case where the covariance matrix is Ir\mathbf{I}_{r}, rather than Ir/r\mathbf{I}_{r}/r. The proof is essentially the same.

Let St1,,Str1\mathbf{S}_{t}^{1},\ldots,\mathbf{S}_{t}^{r-1} denote a collection of orthogonal unit vectors that are also orthogonal to Z^t\widehat{\mathbf{Z}}_{t}. Note that Z^t\widehat{\mathbf{Z}}_{t} and St1,,Str1\mathbf{S}_{t}^{1},\ldots,\mathbf{S}_{t}^{r-1} are functions of Ht\mathbf{H}_{t}.

Using the fact that for any two unit vectors w\mathbf{w} and v\mathbf{v}, 1wv=wv2/21-\mathbf{w}^{\prime}\mathbf{v}=\left\|\mathbf{w}-\mathbf{v}\right\|^{2}/2, the instantaneous regret in period tt satisfies

where the inequality follows from the fact that ST1,,STr1\mathbf{S}_{T}^{1},\ldots,\mathbf{S}_{T}^{r-1} are orthogonal unit vectors. Therefore, the cumulative conditional risk satisfies

with probability one. From the definition of STk\mathbf{S}^{k}_{T}, we have Z^TSTk=0\widehat{\mathbf{Z}}_{T}^{\prime}\mathbf{S}_{T}^{k}=0 for k=1,r1k=1\ldots,r-1. Therefore, for tTt\leq T,

which eliminates the middle term in the summand above. Furthermore, we see that ZSTk=(ZZ^T)STk\mathbf{Z}^{\prime}{\mathbf{S}}_{T}^{k}=\left(\mathbf{Z}-\widehat{\mathbf{Z}}_{T}\right)^{\prime}{\mathbf{S}}_{T}^{k} for all kk. Thus, with probability one,

and the desired result follows by taking the expectation of both sides. ∎

Since STk{\mathbf{S}}_{T}^{k} is orthogonal to Z^T\widehat{\mathbf{Z}}_{T}, we can interpret t=1T(UtSTk)2\sum_{t=1}^{T}\left(\mathbf{U}_{t}^{\prime}{\mathbf{S}}^{k}_{T}\right)^{2} and {(ZZ^T)STk}2\left\{\left(\mathbf{Z}-\widehat{\mathbf{Z}}_{T}\right)^{\prime}{\mathbf{S}}_{T}^{k}\right\}^{2} as the total amount of exploration over TT periods and the squared estimation error, respectively, in the direction STk\mathbf{S}^{k}_{T}. Thus, Lemma 2.2 tells us that the cumulative risk is bounded below by the sum of the squared estimation error and the total amount of exploration in the past TT periods. This result suggests an approach for establishing a lower bound on the risk. If the amount of exploration t=1T(UtSTk)2\sum_{t=1}^{T}\left(\mathbf{U}_{t}^{\prime}{\mathbf{S}}_{T}^{k}\right)^{2} is large, then the risk will be large. On the other hand, if the amount of exploration is small, we expect significant estimation errors, which in turn imply large risk. This intuition is made precise in Lemma 2.3, which relates the squared estimation error and the amount of exploration.

Let \mathbf{Q}_{T}=\widehat{\mathbf{Z}}_{T}\big{/}\|\widehat{\mathbf{Z}}_{T}\|. For any tt, we have that Ut=k=1r1(UtSTk)STk+(UtQT)QT\mathbf{U}_{t}=\sum_{k=1}^{r-1}\left(\mathbf{U}_{t}^{\prime}\mathbf{S}_{T}^{k}\right)\mathbf{S}_{T}^{k}+\left(\mathbf{U}_{t}^{\prime}\mathbf{Q}_{T}\right)\mathbf{Q}_{T}. Let

be an r×rr\times r orthonormal matrix whose columns are the vectors ST1,,STr1\mathbf{S}_{T}^{1},\ldots,\mathbf{S}_{T}^{r-1}, and QT\mathbf{Q}_{T}, respectively. Then, it is easy to verify that

Since Z\mathbf{Z} has a multivariate normal prior distribution with covariance matrix Ir/r\mathbf{I}_{r}/r, it is a standard result (use, for example, Corollary E.3.5 in Appendix E in Bertsekas, 1995) that

Since STk\mathbf{S}^{k}_{T} is a function of HT\mathbf{H}_{T} and VSTk=ek\mathbf{V}^{\prime}\mathbf{S}^{k}_{T}=\mathbf{e}_{k}, we have, for kr1k\leq r-1, that

where the inequality follows from Fiedler’s Inequality (see, for example, Theorem 2.1 in Fiedler and Pták, 1997), and the final equality follows from the definition of A\mathbf{A}. ∎

The next lemma gives a lower bound on the probability that Z\mathbf{Z} is bounded away from the origin. The proof follows from simple calculations involving normal densities, and the details are given in Appendix A.1.

For any θ1/2\theta\leq 1/2 and β>0\beta>0, Pr{θZβ}14θ21β2 \Pr\left\{\theta\leq\left\|\mathbf{Z}\right\|\leq\beta\right\}\geq 1-4\theta^{2}-\frac{1}{\beta^{2}}~{}.

The last lemma establishes a lower bound on the sum of the total amount of exploration and the squared estimation error, which is also the minimum cumulative Bayes risk along the direction STk\mathbf{S}^{k}_{T} by Lemma 2.2.

We note that if Z\left\|\mathbf{Z}\right\| were a constant, rather than a random variable, Lemma 2.5 would follow immediately. Hence, most of the work in the proof below involves constraining Z\left\|\mathbf{Z}\right\| to a certain range [θ,β][\theta,\beta].

Consider an arbitrary kk, and let Ξ=t=1T(UtSTk)2\Xi=\sum_{t=1}^{T}\left(\mathbf{U}_{t}^{\prime}{\mathbf{S}}_{T}^{k}\right)^{2}, Γ={(ZZ^T)STk}2\Gamma=\left\{\left(\mathbf{Z}-\widehat{\mathbf{Z}}_{T}\right)^{\prime}{\mathbf{S}}_{T}^{k}\right\}^{2}. Our proof will make use of positive constants θ\theta, β\beta, and η\eta, whose values will be chosen later. Note that

where we use the fact that Ξ\Xi is a function of HT\mathbf{H}_{T} in the final inequality. We will now lower bound the last term on the right hand side of the above inequality. Let \Theta=\mbox{\sf E}\left[\Gamma~{}\big{|}~{}\mathbf{H}_{T}\right]. Since Θ\Theta is a function of HT\mathbf{H}_{T},

where the last inequality follows from Lemma 2.3 which implies that, with probability one,

and where the last inequality follows from the fact that Tr2T\geq r^{2}, and thus, 1/(r+T)1/(2T)1/\left(r+\sqrt{T}\right)\geq 1/(2\sqrt{T}).

with probability one. By the Bonferroni Inequality, we have that

with probability one. Conditioned on HT\mathbf{H}_{T}, (ZZ^T)STk\left(\mathbf{Z}-\widehat{\mathbf{Z}}_{T}\right)^{\prime}{\mathbf{S}}^{k}_{T} is normally distributed with mean zero and variance

Let Φ()\Phi(\cdot) be the cumulative distribution function of the standard normal random variable, that is, Φ(x)=12πxeu2/2du\Phi(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}e^{-u^{2}/2}\,du. Then,

from which it follows that, with probability one,

where the last inequality follows from Lemma 2.4. Set θ=0.09\theta=0.09, β=3\beta=3, and η=0.5\eta=0.5, to obtain \mboxE[ZΞ+TΓZ]0.027T ,\mbox{\sf E}\left[\left\|\mathbf{Z}\right\|\Xi+\frac{T\,\Gamma}{\left\|\mathbf{Z}\right\|}\right]\geq 0.027\sqrt{T}~{}, which is the desired result. ∎

Finally, here is the proof of Theorem 2.1.

where we have used the fact r2r\geq 2, which implies that r1r/2r-1\geq r/2. ∎

Matching Upper Bounds

We have established Ω(rT)\Omega\left(r\sqrt{T}\right) lower bounds when the set of arms Ur\mathcal{U}_{r} is the unit sphere. We now prove that a policy that alternates between exploration and exploitation phases yields matching upper bounds on the regret and risk, and is therefore optimal for this problem. Surprisingly, we will see that the phase-based policy is effective for a large class of bandit problems, involving a strongly convex set of arms. We introduce the following assumption on the tails of the error random variables WtuW_{t}^{\mathbf{u}} and on the set of arms Ur\mathcal{U}_{r}, which will remain in effect throughout the rest of paper.

There exist positive constants uˉ\bar{u} and λ0\lambda_{0} such that for any r2r\geq 2,

Under Assumption 1(a), the tails of the distribution of the errors WtuW^{\mathbf{u}}_{t} decay at least as fast as for a normal random variable with variance σ02\sigma_{0}^{2}. The first part of Assumption 1(b) ensures that the expected reward of the arms remain bounded as the dimension rr increases, while the arms b1,,br\mathbf{b}_{1},\ldots,\mathbf{b}_{r} given in the second part of Assumption 1(b) will be used during the exploration phase of our policy.

Our policy – which we refer to as the Phased Exploration and Greedy Exploitation (PEGE) – operates in cycles, and in each cycle, we alternate between exploration and exploitation phases. During the exploration phase of cycle cc, we play the rr linearly independent arms from Assumption 1(b). Using the rewards observed during the exploration phases in the past cc cycles, we compute an ordinary least squares (OLS) estimate Z^(c)\widehat{\mathbf{Z}}(c). In the exploitation phase of cycle cc, we use Z^(c)\widehat{\mathbf{Z}}(c) as a proxy for Z\mathbf{Z} and compute a greedy decision G(c)Ur\mathbf{G}(c)\in\mathcal{U}_{r} defined by:

where we break ties arbitrarily. We then play the arm G(c)\mathbf{G}(c) for an additional cc periods to complete cycle cc. Here is a formal description of the policy.

Phased Exploration and Greedy Exploitation (PEGE)

Description: For each cycle c1c\geq 1, complete the following two phases.

where for any kk, Xbk(s)X^{\mathbf{b}_{k}}(s) and Wbk(s)W^{\mathbf{b}_{k}}(s) denote the observed reward and the error random variable associated with playing arm bk\mathbf{b}_{k} in cycle ss. Note that the last equality follows from Equation (1) defining our model.

Exploitation (cc periods): Play the greedy arm G(c)=argmaxvUrvZ^(c)\mathbf{G}(c)=\arg\max_{\mathbf{v}\in\mathcal{U}_{r}}\mathbf{v}^{\prime}\widehat{\mathbf{Z}}(c) for cc periods.

Even though the SBAR condition appears to be an implicit one, it admits a simple interpretation. According to Corollary 4 of Polovinkin (1996), a compact set Ur\mathcal{U}_{r} satisfies condition SBAR(JJ) if and only if it is strongly convex with parameter JJ, in the sense that the set Ur\mathcal{U}_{r} can be represented as the intersection of closed balls of radius JJ. Intuitively, the SBAR condition requires the boundary of Ur\mathcal{U}_{r} to have a curvature that is bounded below by a positive constant. For some examples, the unit ball satisfies the SBAR(1) condition. Furthermore, according to Theorem 3 of Polovinkin (1996), an ellipsoid of the form {uRr:uQ1u1}\{\mathbf{u}\in R^{r}:\mathbf{u}^{\prime}\mathbf{Q}^{-1}\mathbf{u}\leq 1\}, where Q\mathbf{Q} is a symmetric positive definite matrix, satisfies the condition SBAR (λmax(Q)/λmin(Q))\left(\lambda_{\max}(\mathbf{Q})/\sqrt{\lambda_{\min}(\mathbf{Q})}\right).

The main result of this section is stated in the following theorem. The proof is given in Section 3.1.

Suppose in addition, that there exists a constant M>0M>0 such that for every r2r\geq 2 we have \mboxE[Z]M\mbox{\sf E}\left[\,\left\|\mathbf{Z}\right\|\,\right]\leq M and \mboxE[1/Z]M\mbox{\sf E}\left[\,1/\left\|\mathbf{Z}\right\|\,\right]\leq M. Then, there exists a positive constant a2a_{2} that depends only on σ0\sigma_{0}, uˉ\bar{u}, λ0\lambda_{0}, JJ, and MM, such that for any TrT\geq r,

The above result shows that the performance of our policy does not deteriorate as the norm of z\mathbf{z} approaches zero.

Intuitively, the requirement \mboxE[Z]M\mbox{\sf E}\left[\left\|\mathbf{Z}\right\|\right]\leq M in Theorem 3.1 implies that, as rr increases, the maximum expected reward (over all arms) remains bounded. Moreover, the assumption on the boundedness of \mboxE[1/Z]\mbox{\sf E}\left[\,1/\left\|\mathbf{Z}\right\|\,\right] means that Z\mathbf{Z} does not have too much mass near the origin. The following lemma provides conditions under which this assumption holds, and shows that the case of the multivariate normal distribution used in Theorem 2.1 is also covered. The proof is given in Appendix A.2.

The following corollary shows that the example in Section 2 admits tight matching upper bounds on the regret and risk.

Moreover, if Z\mathbf{Z} has a multivariate normal distribution with mean 0\mathbf{0} and covariance matrix Ir/r\mathbf{I}_{r}/r, then for all TrT\geq r,

Since the set of arms is the unit sphere and the errors are standard normal, Assumption 1 is satisfied with σ0=uˉ=λ0=1\sigma_{0}=\bar{u}=\lambda_{0}=1. Moreover, as already discussed, the unit sphere satisfies the SBAR(1) condition. Finally, By Lemma 3.2, the random vector Z\mathbf{Z} satisfies the hypotheses of Theorem 3.1. The regret and risk bounds then follow immediately. ∎

The proof of Theorem 3.1 relies on the following upper bound on the square of the norm difference between Z^(c)\widehat{\mathbf{Z}}(c) and Z\mathbf{Z}.

Recall from the definition of the PEGE policy that the estimate Z^(c)\widehat{\mathbf{Z}}(c) at the end of the exploration phase of cycle cc is given by

where B=(k=1rbkbk)1\mathbf{B}=\left(\sum_{k=1}^{r}\mathbf{b}_{k}\mathbf{b}_{k}^{\prime}\right)^{-1} and V(s)=k=1rbkWbk(s)\mathbf{V}(s)=\sum_{k=1}^{r}\mathbf{b}_{k}W^{\mathbf{b}_{k}}(s). Note that the mean-zero random variables Wbk(s)W^{\mathbf{b}_{k}}(s) are independent of each other and their variance is bounded by some constant γ0\gamma_{0} that depends only on σ0\sigma_{0}. Then, it follows from Assumption 1 that

The next lemma gives an upper bound on the difference between two normalized vectors in terms of the difference of the original vectors.

where we define 0/0\mathbf{0}/\left\|\mathbf{0}\right\| to be some fixed unit vector.

The inequality is easily seen to hold if either w=0\mathbf{w}=\mathbf{0} or z=0\mathbf{z}=\mathbf{0}. So, assume that both w\mathbf{w} and z\mathbf{z} are nonzero. Using the triangle inequality and the fact that \big{|}\left\|\mathbf{w}\right\|-\left\|\mathbf{z}\right\|\big{|}\leq\left\|\mathbf{w}-\mathbf{z}\right\|, we have that

By symmetry, we also have wwzz2wzz\left\|\frac{\mathbf{w}}{\left\|\mathbf{w}\right\|}-\frac{\mathbf{z}}{\left\|\mathbf{z}\right\|}\right\|\leq\frac{2\left\|\mathbf{w}-\mathbf{z}\right\|}{\left\|\mathbf{z}\right\|}, which gives the desired result. ∎

The following lemma gives an upper bound on the expected instantaneous regret under the greedy decision G(c)\mathbf{G}(c) during the exploitation phase of cycle cc.

where the inequality follows from the definition of the greedy decision in Equation (2), and the final equality follows from the fact that G(c)=u(Z^(c))\mathbf{G}(c)=\mathbf{u}^{*}\left(\widehat{\mathbf{Z}}(c)\right). As a convention, we define 0/0\mathbf{0}/\left\|\mathbf{0}\right\| to some fixed unit vector and set u(0)=u(0/0)\mathbf{u}^{*}(\mathbf{0})=\mathbf{u}^{*}(\mathbf{0}/\left\|\mathbf{0}\right\|).

It then follows from the Cauchy-Schwarz Inequality that, with probability one,

where the equality follows from the fact that u(z)=u(λz)\mathbf{u}^{*}(\mathbf{z})=\mathbf{u}^{*}(\lambda\mathbf{z}) for all λ>0\lambda>0. The second inequality follows from condition SBAR(JJ), and the final inequality follows from Lemma 3.5. The desired result follows by taking conditional expectations, given Z=z{\bf Z=z}, and applying Lemma 3.4. ∎

We can now complete the proof of Theorem 3.1, by adding the regret over the differnt times and cycles. By Assumption 1 and the Cauchy-Schwarz Inequality, the instantaneous regret from playing any arm uUr\mathbf{u}\in\mathcal{U}_{r} is bounded above by maxvUrz(vu)2uˉz\max_{\mathbf{v}\in\mathcal{U}_{r}}\mathbf{z}^{\prime}\left(\mathbf{v}-\mathbf{u}\right)\leq 2\,\bar{u}\left\|\mathbf{z}\right\|. Consider an arbitrary cycle cc. Then, the total regret incurred during the exploration phase (with rr periods) in this cycle is bounded above by 2uˉrz2\,\bar{u}\,r\,\left\|\mathbf{z}\right\|. During the exploitation phase of cycle cc, we always play the greedy arm G(c)\mathbf{G}(c). The expected instantaneous regret in each period during the exploitation phase is bounded above by rh2/czrh_{2}/c\left\|\mathbf{z}\right\|. So, the total regret during cycle cc is bounded above by 2uˉrz+h2r/z2\,\bar{u}\,r\,\left\|\mathbf{z}\right\|+h_{2}\,r/\left\|\mathbf{z}\right\|. Summing over KK cycles, we obtain

for some positive constants h3h_{3} and h4h_{4} that depend only on σ0\sigma_{0}, uˉ\bar{u}, λ0\lambda_{0}, and JJ.

where the final inequality follows because K0=2T3TK_{0}=\left\lceil\sqrt{2T}\right\rceil\leq 3\sqrt{T}. The risk bound follows by taking expectations and using the assumption on the boundedness of \mboxE[Z]\mbox{\sf E}[\,\left\|\mathbf{Z}\right\|\,] and \mboxE[1/Z]\mbox{\sf E}[\,1/\left\|\mathbf{Z}\right\|\,].

A Policy for General Bandits

We have shown that when a bandit has a smooth best arm response, the PEGE policy achieves optimal O(rT)O(r\sqrt{T}) regret and Bayes risk. The general idea is that when the estimation error is small, the instantaneous regret of the greedy decision based on our estimate Z^(c)\widehat{\mathbf{Z}}(c) can be of the same order as ZZ^(c)\|\mathbf{Z}-\widehat{\mathbf{Z}}(c)\|. However, under the smoothness assumption, this upper bound on the instantaneous regret is improved to O(ZZ^(c)2)O\left(\|\mathbf{Z}-\widehat{\mathbf{Z}}(c)\|^{2}\right), as shown in the proof of Lemma 3.6, and this enables us to separate exploration from exploitation.

However, if the number of arms is finite or if the collection of arms is an arbitrary compact set, then the PEGE policy may not be effective. This is because a small estimation error may have a disproportionately large effect on the arm chosen by a greedy policy, leading to a large instantaneous regret. In this section, we discuss a policy – which we refer to as the Uncertainty Ellipsoid (UE) policy – that can be applied to any bandit problem, at the price of slightly higher regret and Bayes risk. In contrast to the PEGE policy, the UE policy combines active exploration and exploitation in every period.

As discussed in the introduction, the UE policy is closely related to the algorithms described in Auer (2002) and Dani et al. (2008a), but also has the “anytime” property (the policy does not require prior knowledge of the time horizon TT), and also allows the random vector Z\mathbf{Z} and the errors WtuW^{\mathbf{u}}_{t} to be unbounded. For the sake of completeness, we give a detailed description of our policy and state the regret and risk bounds that we obtain. The reader can find the proofs of these bounds in Appendix B.

To facilitate exposition, we introduce a constant that will appear in the description of the policy, namely,

where the parameters uˉ\bar{u} and λ0\lambda_{0} are given in Assumption 1. The UE policy maintains, at each time period tt, the following two pieces of information.

The ordinary least squares (OLS) estimate defined as follows: if U1,,Ut\mathbf{U}_{1},\ldots,\mathbf{U}_{t} are the arms chosen during the first tt periods, then the OLS estimate Z^t\widehat{\mathbf{Z}}_{t} is given byLet us note that we are abusing notation here. Throughout this section Z^t\widehat{\mathbf{Z}}_{t} stands for the OLS estimate, which is different from the least mean squares estimator \mbox{\sf E}\left[\mathbf{Z}~{}\big{|}~{}\mathbf{H}_{t}\right] introduced in Section 2.:

In contrast to the PEGE policy, whose estimates relied only on the rewards observed in the exploration phases, the estimate Z^t\widehat{\mathbf{Z}}_{t} incorporates all available information up to time tt. We initialize the policy by playing rr linearly independent arms, so that Ct\mathbf{C}_{t} is positive definite for trt\geq r.

where the parameters σ0\sigma_{0} and κ0\kappa_{0} are given in Assumption 1(a) and Equation (3). The uncertainty ellipsoid Et{\mathcal{E}}_{t} represents the set of likely “errors” associated with the estimate Z^t\widehat{\mathbf{Z}}_{t}. We define the uncertainty radius RtuR_{t}^{\mathbf{u}} associated with each arm u\mathbf{u} as follows:

A formal description of the policy is given below.

Initialization: During the first rr periods, play the rr linearly independent arms b1,b2,,br\mathbf{b}_{1},\mathbf{b}_{2},\ldots,\mathbf{b}_{r} given in Assumption 1(b). Determine the OLS estimate Z^r\widehat{\mathbf{Z}}_{r}, the uncertainty ellipsoid Er{\mathcal{E}}_{r}, and the uncertainty radius associated with each arm.

Description: For tr+1t\geq r+1, do the following:

Let UtUr\mathbf{U}_{t}\in\mathcal{U}_{r} be an arm that gives the maximum estimated reward over the ellipsoid Z^t1+Et1\widehat{\mathbf{Z}}_{t-1}+{\mathcal{E}}_{t-1}, that is,

where the uncertainty radius Rt1vR_{t-1}^{\mathbf{v}} is defined in Equation (6); ties are broken arbitrarily.

Play arm Ut\mathbf{U}_{t} and observe the resulting reward XtX_{t}.

Update the OLS estimate Z^t\widehat{\mathbf{Z}}_{t}, the uncertainty ellipsoid Et{\mathcal{E}}_{t}, and the uncertainty radius RtuR_{t}^{\mathbf{u}} of each arm u\mathbf{u}, using the formulas in Equations (4), (5), and (6).

The main results of this section are given in the following two theorems. The first theorem establishes upper bounds on the regret and risk when the set of arms is an arbitrary compact set. This result shows that the UE policy is nearly optimal, admitting upper bounds that are within a logarithmic factor of the Ω(rT)\Omega(r\sqrt{T}) lower bounds given in Theorem 2.1. Although the proof of this theorem makes use of somewhat different (and novel) large deviation inequalities for adaptive least squares estimators, the argument shares similarities with the proofs given in Dani et al. (2008a), and we omit the details. The reader can find a complete proof in Appendix B.2.

When the number of arms is finite, it turns out that we can obtain bounds on regret and risk that scale more gracefully over time, growing as logT\log T and log2T\log^{2}T, respectively. This result is stated in Theorem 4.2, which shows that, for a fixed set of arms, the UE policy is asymptotically optimal as a function time, within a constant factor of the lower bounds established by Lai and Robbins (1985) and Lai (1987).

The reader can find a proof of this result in Appendix B.3.

The regret bound in Theorem 4.2 then follows immediately from the above upper bound and the fact that Nu(z,T)TN^{\mathbf{u}}(\mathbf{z},T)\leq T with probability one, because

and the desired result follows from the fact that Δu(z)=maxvUr(vu)z2uˉz\Delta^{\mathbf{u}}(\mathbf{z})=\max_{\mathbf{v}\in\mathcal{U}_{r}}\left(\mathbf{v}-\mathbf{u}\right)^{\prime}\mathbf{z}\leq 2\bar{u}\left\|\mathbf{z}\right\|, by the Cauchy-Schwarz Inequality.

We will now establish an upper bound on the Bayes risk. From the regret bound, it suffices to show that for any uUr\mathbf{u}\in\mathcal{U}_{r},

Let qu()q^{\mathbf{u}}(\cdot) denote the density function associated with the random variable Δu(Z)\Delta^{\mathbf{u}}\left(\mathbf{Z}\right). Then,

We will now proceed to bound each of the three terms on the right hand side of the above equality. Having assumed that qu()M0q^{\mathbf{u}}(\cdot)\leq M_{0}, the first term satisfies

where the last inequality follows from the fact that logTloglogT2logT\log T-\log\log T\leq 2\log T for all T2T\geq 2. To evaluate the last term, note that logTxlogT\frac{\log T}{x}\leq\log T for all x1x\geq 1, and thus, 1min{logTx,Tx}qu(x)dxlogT1qu(x)logT .\int_{1}^{\infty}\min\left\{\frac{\log T}{x},Tx\right\}q^{\mathbf{u}}(x)dx\leq\log T\int_{1}^{\infty}q^{\mathbf{u}}(x)\leq\log T~{}. Putting everything together, we have that \mboxE[min{logTΔu(Z),TΔu(Z)}](M0+1)logT+M0log2T{\mbox{\sf E}\left[\min\left\{\frac{\log T}{\Delta^{\mathbf{u}}\left(\mathbf{Z}\right)}\,,\,T\Delta^{\mathbf{u}}\left(\mathbf{Z}\right)\right\}\right]}\leq(M_{0}+1)\log T+M_{0}\log^{2}T, which is the desired result. ∎

We conclude this section by giving an example of a random vector Z\mathbf{Z} that satisfies the condition in Theorem 4.2. A similar example also appears in Example 2 of Lai (1987).

Since a polyhedral set has a finite number of extreme points (E(Ur)<\left|\mathcal{E}(\mathcal{U}_{r})\right|<\infty), the parameterized bandit problem can be reduced to the standard multi-armed bandit problem, where each arm corresponds to an extreme point of Ur\mathcal{U}_{r}. We can thus apply the algorithm of Lai and Robbins (1985) and obtain the following upper bound on the TT-period cumulative regret for polyhedra

where the denominator corresponds to the difference between the expected reward of the optimal and the second best extreme points. The algorithm of Lai and Robbins (1985) is effective only when the polyhedral set Ur\mathcal{U}_{r} has a small number of extreme points, as shown by the following examples.

Conclusion

Acknowledgement

The authors would like to thank Adrian Lewis and Mike Todd for helpful and stimulating discussions on the structure of positive definite matrices and the eigenvalues associated with least squares estimators, Gena Samorodnitsky for sharing his deep insights on the application of large deviation theory to this problem, Adam Mersereau for his contributions to the problem formulation, and Assaf Zeevi for helpful suggestions and discussions during the first author’s visit to Columbia Graduate School of Business. We also want to thank the Associate Editor and the referee for their helpful comments and suggestions on the paper. This research is supported in part by the National Science Foundation through grants DMS-0732196, ECCS-0701623, CMMI-0856063, and CMMI-0855928.

References

Appendix A Properties of Normal Vectors

We want to establish a lower bound on Pr{θ  Z  β}\Pr\left\{\theta~{}\leq~{}\left\|\mathbf{Z}\right\|~{}\leq~{}\beta\right\}. Let Y=(Y1,,Yr)\mathbf{Y}=\left(Y_{1},\ldots,Y_{r}\right) denote the standard multivariate normal random vector with mean 0\mathbf{0} and identity covariance matrix Ir\mathbf{I}_{r}. By our hypothesis, Z\mathbf{Z} has the same distribution as Y/r\mathbf{Y}/\sqrt{r}, which implies that

By definition, Y2=Y12++Yr2\left\|\mathbf{Y}\right\|^{2}=Y_{1}^{2}+\cdots+Y_{r}^{2} has a chi-square distribution with rr degrees of freedom. By the Markov Inequality, Pr{Y2>β2r}\mboxE[Y2]/(β2r)=1/β2\Pr\left\{\left\|\mathbf{Y}\right\|^{2}>\beta^{2}r\right\}\leq\mbox{\sf E}\left[\left\|\mathbf{Y}\right\|^{2}\right]/(\beta^{2}r)=1/\beta^{2}. We will now establish an upper bound on Pr{Y2<θ2r}\Pr\left\{\left\|\mathbf{Y}\right\|^{2}<\theta^{2}r\right\}. Note that, for any λ>0\lambda>0,

where last equality follows from the fact that Y1,,YrY_{1},\ldots,Y_{r} are independent standard normal random variables and thus, \mboxE[eλYk2]=1/1+2λ\mbox{\sf E}\left[e^{-\lambda Y_{k}^{2}}\right]=1/\sqrt{1+2\lambda} for λ>0\lambda>0. Set λ=1/θ2\lambda=1/\theta^{2}, and use the facts θ1/22/e\theta\leq 1/2\leq\sqrt{2}/e and r2r\geq 2, to obtain

which implies that Pr{θZβ}11β24θ2\Pr\left\{\theta\,\leq\,\left\|\mathbf{Z}\right\|\,\leq\,\beta\right\}\geq 1-\frac{1}{\beta^{2}}-4\theta^{2}, which is the desired result.

A.2 Proof of Lemma 3.2

For the proof of part (b), let Y=(Y1,,Yr)\mathbf{Y}=\left(Y_{1},\ldots,Y_{r}\right) be a standard multivariate normal random vector with mean 0\mathbf{0} and identity covariance matrix, Ir\mathbf{I}_{r}. Then, Z\mathbf{Z} has the same distribution as Y/r\mathbf{Y}/\sqrt{r}. Note that Y2\left\|\mathbf{Y}\right\|^{2} has a chi-square distribution with rr degrees of freedom. Thus,

We will now establish an upper bound on \mboxE[1/Z]=r\mboxE[1/Y]\mbox{\sf E}[\,1/\left\|\mathbf{Z}\right\|\,]=\sqrt{r}\,\mbox{\sf E}[\,1/\left\|\mathbf{Y}\right\|\,]. For r=2r=2, since Y\left\|\mathbf{Y}\right\| has a chi distribution with two degrees of freedom, we have that

Using the formula for the density of the chi-square distribution, we have

where the third equality follows from the fact that Γ(r/2)=((r/2)1)Γ((r/2)1)\Gamma(r/2)=\left((r/2)-1\right)\cdot\Gamma((r/2)-1) for r3r\geq 3 and the integrand is the density function of the chi-square distribution with r2r-2 degrees of freedom and evaluates to 11. The last inequality follows because r3r\geq 3. Thus, we have \mboxE[1/Z]3π\mbox{\sf E}[\,1/\left\|\mathbf{Z}\right\|\,]\leq\sqrt{3}\leq\sqrt{\pi}, which is the desired result.

Appendix B Proof of Theorems 4.1 and 4.2

In the next section, we establish large deviation inequalities for adaptive least squares estimators (with unbounded error random variables), which will be used in the proof of Theorems 4.1 and 4.2, given in Sections B.2 and B.3, respectively.

The first result extend the standard Chernoff Inequality to our setting involving uncertainty ellipsoids when we have finitely many arms.

We will only prove the second inequality because the proof of the first one follows the same argument. If the sequence of arms U1,U2,\mathbf{U}_{1},\mathbf{U}_{2},\ldots is deterministic (and thus, the matrix Ct\mathbf{C}_{t} is also deterministic), then

The classical Chernoff Inequality for the sum of independent random variables (see, for example, Chapter 1 in Dudley, 1999) then yields

In our setting, however, the arms Ut\mathbf{U}_{t} are random variables that depend on the accumulated history, and we cannot apply the standard Chernoff inequality directly. If Nu(z,t)N^{\mathbf{u}}(\mathbf{z},t) denotes the total number of times that arm u\mathbf{u} has been chosen during the first tt periods given that Z=z\mathbf{Z}=\mathbf{z}, then

which shows that the matrix Ct\mathbf{C}_{t} is completely determined by the nonnegative integer random variables Nu(z,t)N^{\mathbf{u}}(\mathbf{z},t). Since 0Nu(z,t)t0\leq N^{\mathbf{u}}(\mathbf{z},t)\leq t, the number of possible values of the vector (Nu(z,t):uUr)\left(N^{\mathbf{u}}(\mathbf{z},t):\mathbf{u}\in\mathcal{U}_{r}\right) is at most tUrt^{\left|\mathcal{U}_{r}\right|}. It then follows easily that the number of different values of the ordered pair (Ut+1,Ct)\left(\mathbf{U}_{t+1},\mathbf{C}_{t}\right) is at most UrtUrt5Ur\left|\mathcal{U}_{r}\right|t^{\left|\mathcal{U}_{r}\right|}\leq t^{5\left|\mathcal{U}_{r}\right|}. To get the desired result, we can then use the union bound and apply the classical Chernoff Inequality to each ordered pair. ∎

When the number of arms is infinite, the bounds in Theorem B.1 are vacuous. The following theorem provides an extension of the Chernoff inequality to the case of infinitely many arms.

The proof of Theorem B.2 makes use of the following series of lemmas. The first lemma establishes a tail inequality for a ratio of two random variables. De la Peña et al. (2004) gave a proof of this result in Corollary 2.2 (page 1908) of their paper .

Using a standard argument involving iterated expectations, we obtain

We can thus apply Lemma B.3 to the random variables AA and BB. Moreover, it follows from the definition of uˉ\bar{u} and λ0\lambda_{0} in Assumption 1(b) that, with probability one,

Therefore, B2+λ02B2B^{2}+\lambda_{0}\leq 2B^{2}, and

where the last inequality follows from the definition of κ0\kappa_{0} and the fact that tr2t\geq r\geq 2. These two upper bounds imply that (B2+λ0)(1+12log(1+B2λ0))κ0logt  B .\sqrt{\left(B^{2}+\lambda_{0}\right)\left(1+\frac{1}{2}\log\left(1+\frac{B^{2}}{\lambda_{0}}\right)\right)}\leq\kappa_{0}\sqrt{\log t}\;B~{}. Therefore,

and the desired result then follows immediately from Lemma B.3. ∎

The next and final lemma extends the previous result to show that the matrix ζ2κ02σ02(logt)Ct1MtMt\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,(\log t)\,\mathbf{C}_{t}^{-1}-\mathbf{M}_{t}\mathbf{M}_{t}^{\prime} is positive semidefinite with a high probability. The proof of this result makes use of the fact that for the matrix ζ2κ02σ02(logt)Ct1MtMt\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,(\log t)\,\mathbf{C}_{t}^{-1}-\mathbf{M}_{t}\mathbf{M}_{t}^{\prime} to be positive semidefinite, it suffices for the inequality xMtMtxζ2κ02σ02(logt)xCt1x\mathbf{x}^{\prime}\mathbf{M}_{t}\mathbf{M}_{t}^{\prime}\mathbf{x}\leq\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,(\log t)\,\mathbf{x}^{\prime}\mathbf{C}_{t}^{-1}\mathbf{x} to hold for vectors x\mathbf{x} in a sufficiently dense subset. We can then apply Lemma B.4 for each such vector x\mathbf{x} and use the union bound.

Under Assumption 1, for any trt\geq r and ζ2\zeta\geq 2,

where the constants λ0\lambda_{0} and uˉ\bar{u} are given in Assumption 1(b). Without loss of generality, we can assume that δ1/2\delta\leq 1/2 and that 1/δ1/\delta is an integer. Let Xr\mathcal{X}^{r} be a covering of Sr\mathcal{S}^{r}, that is, for any xSr\mathbf{x}\in\mathcal{S}^{r}, there exists yXr\mathbf{y}\in\mathcal{X}^{r} such that xyδ\left\|\mathbf{x}-\mathbf{y}\right\|\leq\delta. It is easy to verify that Xr\mathcal{X}^{r} can be chosen to have a cardinality of at most (2r/δ)r\left(2\sqrt{r}/\delta\right)^{r} because we can consider a rectangular grid on r^{r} with a grid spacing of δ/r\delta/\sqrt{r}. Then, for any point xSr\mathbf{x}\in\mathcal{S}^{r}, there is a point y\mathbf{y} on the rectangular grid such that the magnitude of each component of xy\mathbf{x}-\mathbf{y} is at most δ/r\delta/\sqrt{r}, which implies that xyδ\left\|\mathbf{x}-\mathbf{y}\right\|\leq\delta.

Let trt\geq r and ζ2\zeta\geq 2 be given. To facilitate our exposition, let β=ζ2κ02σ02logt\beta=\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,\log t. Let G\mathcal{G} denote the event that the following inequalities hold:

Using the union bound, it follows from Lemma B.4 that the event G\mathcal{G} happens with a probability at least

where we have used the fact that tr2t\geq\sqrt{r}\geq 2 in the penultimate inequality. The final inequality follows from the definition of κ0\kappa_{0} in Equation (3), which implies that κ024(1+log(36uˉ2/λ0))4\kappa_{0}^{2}\geq 4\left(1+\log(36\bar{u}^{2}/\lambda_{0})\right)\geq 4, and thus, 36uˉ2t2λ0t2eκ02/4tκ02/2(t2)κ02/4=tκ02 .\frac{36\,\bar{u}^{2}\,t^{2}}{\lambda_{0}}\leq t^{2}e^{\kappa_{0}^{2}/4}\leq t^{\kappa_{0}^{2}/2}\left(t^{2}\right)^{\kappa_{0}^{2}/4}=t^{\kappa_{0}^{2}}~{}.

To complete the proof, it suffices to show that when the event G\mathcal{G} occurs, we have that xMtMtxβxCt1x\mathbf{x}^{\prime}\mathbf{M}_{t}\mathbf{M}_{t}^{\prime}\mathbf{x}\leq\beta\,\mathbf{x}^{\prime}\mathbf{C}_{t}^{-1}\mathbf{x} for all xSr\mathbf{x}\in\mathcal{S}^{r}. Consider an arbitrary xSr\mathbf{x}\in\mathcal{S}^{r}, and let yXr\mathbf{y}\in\mathcal{X}^{r} be such that xyδ\left\|\mathbf{x}-\mathbf{y}\right\|\leq\delta. This implies that x+y2x+yx2+δ3\left\|\mathbf{x}+\mathbf{y}\right\|\leq\left\|2\mathbf{x}\right\|+\left\|\mathbf{y}-\mathbf{x}\right\|\leq 2+\delta\leq 3. Moreover, xMtMtxyMtMty=(xy)MtMt(x+y)3δMt2\mathbf{x}^{\prime}\mathbf{M}_{t}\mathbf{M}_{t}^{\prime}\mathbf{x}-\mathbf{y}^{\prime}\mathbf{M}_{t}\mathbf{M}_{t}^{\prime}\mathbf{y}=\left(\mathbf{x}-\mathbf{y}\right)^{\prime}\mathbf{M}_{t}\mathbf{M}_{t}^{\prime}\left(\mathbf{x}+\mathbf{y}\right)\leq 3\delta\left\|\mathbf{M}_{t}\right\|^{2} where we use the Cauchy-Schwarz for the last inequality.

Similarly, we can show that for all ss, yUsUsyxUsUsx+3δuˉ2\mathbf{y}^{\prime}\mathbf{U}_{s}\mathbf{U}_{s}^{\prime}\mathbf{y}\leq\mathbf{x}^{\prime}\mathbf{U}_{s}\mathbf{U}_{s}^{\prime}\mathbf{x}+3\delta\bar{u}^{2}. Summing over all ss, we obtain that yCt1yxCt1x+3δtuˉ2\mathbf{y}^{\prime}\mathbf{C}_{t}^{-1}\mathbf{y}\leq\mathbf{x}\mathbf{C}_{t}^{-1}\mathbf{x}+3\delta t\bar{u}^{2}. Putting everything together, we have that

where the last inequality follows from the fact that Ct1=s=1tUsUsλ0Ir\mathbf{C}_{t}^{-1}=\sum_{s=1}^{t}\mathbf{U}_{s}\mathbf{U}_{s}^{\prime}\geq\lambda_{0}\mathbf{I}_{r} from our definition of λ0\lambda_{0}. Finally, note that under the event G\mathcal{G},

where the last inequality follows from the definition of δ\delta. Thus, we have that xMtMtxβxCt1x\mathbf{x}^{\prime}\mathbf{M}_{t}\mathbf{M}_{t}^{\prime}\mathbf{x}\leq\beta\mathbf{x}^{\prime}\mathbf{C}_{t}^{-1}\mathbf{x}, which is the desired result. ∎

We are now ready to give a proof of Theorem B.2.

It suffices to establish the second inequality in Theorem B.2 because the proof for the first inequality follows the same argument. It follows from the Cauchy-Schwarz inequality that

where the last equality follows from the definition of the least squares estimate Z^t\widehat{\mathbf{Z}}_{t}.

It is a well-known result in linear algebra (see, for example, Theorem 1.3.3 in Bhatia, 2007) that if A\mathbf{A} and B\mathbf{B} are two symmetric positive definite matrices, then the block matrix \left(\begin{array}[]{cc}\mathbf{A}&\mathbf{X}\\ \mathbf{X}^{\prime}&\mathbf{B}\end{array}\right) is positive semidefinite if and only if XB1XA\mathbf{X}\mathbf{B}^{-1}\mathbf{X}^{\prime}\leq\mathbf{A}. Applying this result to the two “equivalent” (r+1)×(r+1)(r+1)\times(r+1) matrices \left(\begin{array}[]{cc}\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,\log t&\mathbf{M}_{t}^{\prime}\\ \mathbf{M}_{t}&\mathbf{C}_{t}^{-1}\end{array}\right)\quad\textrm{and}\quad\left(\begin{array}[]{cc}\mathbf{C}_{t}^{-1}&\mathbf{M}_{t}\\ \mathbf{M}_{t}^{\prime}&\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,\log t\end{array}\right)~{}, we conclude that MtCtMtζ2κ02σ02logt\mathbf{M}_{t}^{\prime}\mathbf{C}_{t}\mathbf{M}_{t}\leq\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,\log t if and only if MtMtζ2κ02σ02(logt)Ct1\mathbf{M}_{t}\mathbf{M}_{t}^{\prime}\leq\zeta^{2}\,\kappa_{0}^{2}\,\sigma_{0}^{2}\,(\log t)\mathbf{C}_{t}^{-1}. The desired result then follows from the fact that

where the last inequality follows from Lemma B.5. ∎

B.2 Bounds for General Compact Sets of Arms: Proof of Theorem 4.1

The proof of Theorem 4.1 makes use of a number of auxiliary results. The first result provides a motivation for the choice of the parameter α\alpha in Equation (5) and our definition of the uncertainty radius RtuR^{\mathbf{u}}_{t} in Equation (6). They are chosen to keep the probability of overestimating the reward of an arm by more than RtuR^{\mathbf{u}}_{t} bounded by 1/t21/t^{2}. This will limit the growth rate of the cumulative regret due to such overestimation.

Under Assumption 1, for any arm uUr\mathbf{u}\in\mathcal{U}_{r} and trt\geq r,

where α=4σ0κ02\alpha=4\sigma_{0}\kappa_{0}^{2}.

It suffices to establish the first inequality because the proof of the second one is exactly the same. Let βt=4σ0κ02logtmin{rlogt,Ur}\beta_{t}=4\sigma_{0}\kappa_{0}^{2}\,\sqrt{\log t}\,\sqrt{\min\{r\log t\,,\,\left|\mathcal{U}_{r}\right|\}}. Recall from Equations (5) and (6) that Rtu=βtuCtR^{\mathbf{u}}_{t}=\beta_{t}\left\|\mathbf{u}\right\|_{\mathbf{C}_{t}}. By applying Theorem B.1 (with ζ=4κ02logtmin{rlogt,Ur}\zeta=4\kappa_{0}^{2}\,\sqrt{\log t}\,\sqrt{\min\{r\log t\,,\,\left|\mathcal{U}_{r}\right|\}}) and Theorem B.2 (with ζ=4κ0min{rlogt,Ur}\zeta=4\kappa_{0}\,\sqrt{\min\{r\log t\,,\,\left|\mathcal{U}_{r}\right|\}}), we obtain

There are two cases to consider: rlogt>Urr\log t>\left|\mathcal{U}_{r}\right| and rlogtUrr\log t\leq\left|\mathcal{U}_{r}\right|. Suppose that rlogt>Urr\log t>\left|\mathcal{U}_{r}\right|. Then,

where the last inequality follows from the fact that (8κ045)Ur2\left(8\kappa_{0}^{4}-5\right)\left|\mathcal{U}_{r}\right|\geq 2. In the second case where rlogtUrr\log t\leq\left|\mathcal{U}_{r}\right|, we have that

where the last inequality follows from the fact that 3rκ0223r\kappa_{0}^{2}\geq 2. Since the probability is bounded by 1/t21/t^{2} in both cases, this gives the desired result. ∎

For any t1t\geq 1, let the random variable Qt(z)Q_{t}(\mathbf{z}) denote the instantaneous regret in period tt given that Z=z\mathbf{Z}=\mathbf{z}, that is,

Lemma B.6 shows that the probability of a large estimation error in period tt is at most O(1/t2)O\left(1/t^{2}\right). Consequently, as shown in the following lemma, the probability of having a large instantaneous regret in period tt is also small.

Suppose that the event Qt+1(z)>2βtUt+1CtQ_{t+1}(\mathbf{z})>2\beta_{t}\,\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}} occurs. Then, it follows that

which implies that (Ut+1w)(Z^tz)>βt(Ut+1Ct+wCt)βtUt+1wCt .\left(\mathbf{U}_{t+1}-\mathbf{w}\right)^{\prime}\left(\widehat{\mathbf{Z}}_{t}-\mathbf{z}\right)>\beta_{t}\,\left(\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}+\left\|\mathbf{w}\right\|_{\mathbf{C}_{t}}\right)\geq\beta_{t}\,\left\|\mathbf{U}_{t+1}-\mathbf{w}\right\|_{\mathbf{C}_{t}}~{}. Thus,

the last inequality follows from Lemma B.6. ∎

Lemma B.7 suggests the following approach for bounding the cumulative regret over TT periods. In the first rr periods (during the initialization), we incur a regret of O(r)O(r). For each time period between r+1r+1 and TT, we consider the two cases: 1) where the instantaneous regret is large with Qt+1(z)>2αlogtmin{rlogt,Ur}Ut+1CtQ_{t+1}(\mathbf{z})>2\alpha\,\sqrt{\log t}\sqrt{\min\left\{r\log t,\left|\mathcal{U}_{r}\right|\right\}}\,\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}; and, 2) the instantaneous regret is small. By the above lemma, the contribution to the cumulative regret from the first case is bounded above by O(t1/t2)O\left(\sum_{t}1/t^{2}\right), which is finite. In the second case, we have a simple upper bound of 2αr(logt)Ut+1Ct2\alpha\,\sqrt{r}\,\left(\log t\right)\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}} for the instantaneous regret. This argument leads to the following bound on the cumulative regret over TT periods.

For any trt\geq r, let the indicator random variable Gt+1(z)G_{t+1}(\mathbf{z}) be defined by:

The contribution to the expected instantaneous regret \mbox{\sf E}\left[Q_{t+1}(\mathbf{z})~{}\big{|}~{}\mathbf{Z}=\mathbf{z}\right] comes from two cases: 1) when Gt+1(z)=0G_{t+1}(\mathbf{z})=0 and 2) when Gt+1(z)=1G_{t+1}(\mathbf{z})=1. We will upper bound each of these two contributions separately. In the first case, we know from Lemma B.7 that \Pr\left\{G_{t+1}(\mathbf{z})=0~{}\big{|}~{}\mathbf{Z}=\mathbf{z}\right\}=\Pr\left\{Q_{t+1}(\mathbf{z})>2\alpha\,\sqrt{\log t}\sqrt{\min\left\{r\log t,\left|\mathcal{U}_{r}\right|\right\}}\,\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}~{}\big{|}~{}\mathbf{Z}=\mathbf{z}\right\}\leq 1/t^{2}. Since t=11/t22\sum_{t=1}^{\infty}1/t^{2}\leq 2, we have that

On the other hand, when Gt+1(z)=1G_{t+1}(\mathbf{z})=1, we have that Qt+1(z)2αr(logt)Ut+1CtQ_{t+1}(\mathbf{z})\leq 2\alpha\sqrt{r}(\log t)\,\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}. This implies that, with probability one,

where we use the Cauchy-Schwarz Inequality in the second inequality and the final inequality follows from the fact that t=rT1log2tt=1T1log2T(logT)T .\sqrt{\sum_{t=r}^{T-1}\log^{2}t}\,\leq\,\sqrt{\sum_{t=1}^{T-1}\log^{2}T}\leq\left(\log T\right)\sqrt{T}~{}. Putting the two cases together gives the desired upper bound because

The eigenvectors of the matrix Ct=(s=1tUsUs)1\mathbf{C}_{t}=\left(\sum_{s=1}^{t}\mathbf{U}_{s}\mathbf{U}_{s}^{\prime}\right)^{-1} reflect the directions of the arms that are chosen during the first tt periods. The corresponding eigenvalues then measure the frequency with which these directions are explored. Frequently explored directions will have small eigenvalues, while the eigenvalues for unexplored directions will be large. Thus, the weighted norm Ut+1Ct\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}} has two interpretations. First, it measures the size of the regret in period t+1t+1. In addition, since Ut+1Ct2\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}^{2} is a linear combination of the eigenvalues of Ct\mathbf{C}_{t}, it also reflects the amount of exploration in period t+1t+1 in the unexplored directions.

The above interpretation suggests that if we incur large regrets in the past (equivalently, we have done a lot of exploration), then the current regret should be small. Our intuition is confirmed in the following lemma that establishes a recursive relationship between the weighted norm Ut+1Ct\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}} in period t+1t+1 and the norms in the preceding periods.

For any trt\geq r, let Υt=(Ct)1=s=1tUsUs\mathbf{\Upsilon}_{t}=\left(\mathbf{C}_{t}\right)^{-1}=\sum_{s=1}^{t}\mathbf{U}_{s}\mathbf{U}_{s}^{\prime}. By the Rayleigh Principle,

where the last inequality follows from the definition of uˉ\bar{u} and the fact that

where the last equality follows from the fact that Υr=k=1rbkbk\mathbf{\Upsilon}_{r}=\sum_{k=1}^{r}{\mathbf{b}}_{k}{\mathbf{b}}_{k}^{\prime} where the vectors b1,,br\mathbf{b}_{1},\ldots,\mathbf{b}_{r} are given in Assumption 1(b). This proves the claimed upper bound on Ut+1Ct2\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}^{2}.

We will now establish the inequality that relates Ut+1Ct2\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}^{2} to Us+1Cs2\left\|\mathbf{U}_{{s+1}}\right\|_{\mathbf{C}_{s}}^{2} for s<ts<t. Note that

where the second to last equality follows the matrix determinant lemma.

We will now establish bounds on the determinants det(Υt+1)\det\left(\mathbf{\Upsilon}_{t+1}\right) and det(Υt)\det\left(\mathbf{\Upsilon}_{t}\right). Note that

where the last inequality follows from the definition of uˉ\bar{u} . Therefore, det(Υt+1)[λmax(Υt+1)]r(t+1)ruˉ2r\det\left(\mathbf{\Upsilon}_{t+1}\right)\leq\left[\lambda_{\max}\left(\mathbf{\Upsilon}_{t+1}\right)\right]^{r}\leq(t+1)^{r}\bar{u}^{2r}. Moreover, using Equation (10) repeatedly, we obtain

where the last inequality follows from the fact that Υr=k=1rbkbk\mathbf{\Upsilon}_{r}=\sum_{k=1}^{r}\mathbf{b}_{k}\mathbf{b}_{k}^{\prime} and det(Υr)[λmin(Υr)]rλ0r\det\left(\mathbf{\Upsilon}_{r}\right)\geq\left[\lambda_{\min}\left(\mathbf{\Upsilon}_{r}\right)\right]^{r}\geq\lambda_{0}^{r}, where the vectors b1,,br\mathbf{b}_{1},\ldots,\mathbf{b}_{r} and the parameter λ0\lambda_{0} are defined in Assumption 1(b).

Putting everything together, we have that

The above result shows that if the weighted norms in the preceding periods, as measured by s=rt1(1+Us+1Cs2){\prod_{s=r}^{t-1}\left(1+\left\|\mathbf{U}_{{s+1}}\right\|_{\mathbf{C}_{s}}^{2}\right)}, are large, then the weighted norm in the current period t+1t+1 will be small. Moreover, since the weighted norm in the current period depends on the product of the norms in the past, we hope that the growth rate of the sum t=rT1Ut+1Ct2\sum_{t=r}^{T-1}\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}^{2} should be small. To formalize our conjecture, we introduce a related optimization problem. For any c0c\geq 0 and t1t\geq 1, let V(c,t)V^{*}(c,t) be defined by:

where we define q=10(1+yq)=1\prod_{q=1}^{0}(1+y_{q})=1. The following lemma gives an upper bound in terms of the function VV^{*}.

For all s1s\geq 1, let ys=Ur+sCr+s12y_{s}=\left\|\mathbf{U}_{r+s}\right\|_{\mathbf{C}_{r+s-1}}^{2}. Then, t=rT1Ut+1Ct2=s=1Trys\sum_{t=r}^{T-1}\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}^{2}=\sum_{s=1}^{T-r}y_{s}. Let c0=uˉ2/λ0c_{0}=\bar{u}^{2}/\lambda_{0}. It follows from Lemma B.9 that for all ss, with probability one, 0ysc00\leq y_{s}\leq c_{0} and ys{c0(r+s)}r/q=1s1(1+yq)y_{s}\leq\left\{c_{0}\,(r+s)\right\}^{r}/\prod_{q=1}^{s-1}\left(1+y_{q}\right). Therefore, we have t=rT1Ut+1Ct2V(c0 , Tr)\sum_{t=r}^{T-1}\left\|\mathbf{U}_{t+1}\right\|_{\mathbf{C}_{t}}^{2}\leq V^{*}\left(c_{0}~{},~{}T-r\right). ∎

It follows from Lemma B.10 that it suffices to develop an upper bound on V(c,t)V^{*}(c,t). This result is given in the following lemma.

Any feasible solution {ys:s=1,,t}\{y_{s}:s=1,\ldots,t\} for the problem defining V(c,t)V^{*}(c,t), also satisfies the constraints

where the last inequality follows from the fact that for any aa\in, we have 1+aea/21+a\geq e^{a/2}. Thus, by letting as=ys/2c0a_{s}=y_{s}/2c_{0}, we obtain V(c,t)2c0W(c0,t)V^{*}(c,t)\leq 2c_{0}W^{*}(c_{0},t), where W(c0,t)W^{*}(c_{0},t) is the maximum possible value of s=1tas\sum_{s=1}^{t}a_{s}, subject to

Let us introduce a continuous-time variable τ\tau, and define a(τ)=asa(\tau)=a_{s}, for τ[s1,s)\tau\in[s-1,s). Let b(τ)=0τa(τ)dτb(\tau)=\int_{0}^{\tau}a(\tau^{\prime})\,d\tau^{\prime}, and note that b(s)=q=1saqb(s)=\sum_{q=1}^{s}a_{q}. For any τ[s1,s)\tau\in[s-1,s), we have

Let d(τ)=eb(τ)d(\tau)=e^{b(\tau)}. Then, for any τ0\tau\geq 0,

By integrating both sides, we obtain d(t)ec0r(r+t+1)r+1r+1d(t)\leq\frac{e\,c_{0}^{r}\,(r+t+1)^{r+1}}{r+1} for all t0t\geq 0. Since e/(r+1)1e/(r+1)\leq 1 because r2r\geq 2, taking logarithms, we obtain

The right-hand side above is therefore an upper bound on W(c0,t)W^{*}(c_{0},t), which leads to the upper bound on V(c,t)V^{*}(c,t), which gives the desired result. ∎

Finally, here is the proof of Theorem 4.1.

It suffices to establish the regret bound because the risk bound follows immediately from taking the expectation. Let A0=max{1,uˉ2/λ0}A_{0}=\max\{1,\bar{u}^{2}/\lambda_{0}\}. It follows from Lemmas B.8, B.10, and B.11 that

for some positive constants a4a_{4} and a5a_{5} that depend only on σ0\sigma_{0}, uˉ\bar{u}, and λ0\lambda_{0}. ∎

B.3 Bounds for Finitely Many Arms: Proof of Theorem 4.2

Since Pr{u(Z^tz)>RtuZ=z}\Pr\left\{\mathbf{u}^{\prime}\left(\widehat{\mathbf{Z}}_{t}-\mathbf{z}\right)>R_{t}^{\mathbf{u}}\mid\mathbf{Z}=\mathbf{z}\right\} and Pr{w(Z^tz)<RtwZ=z}\Pr\left\{\mathbf{w}^{\prime}\left(\widehat{\mathbf{Z}}_{t}-\mathbf{z}\right)<-R_{t}^{\mathbf{w}}\mid\mathbf{Z}=\mathbf{z}\right\} are bounded above by 1/t21/t^{2} by Lemma B.6, we can show that

Let H=vUr:vuNv(z,t)vv\mathbf{H}=\sum_{\mathbf{v}\in\mathcal{U}_{r}:\mathbf{v}\neq\mathbf{u}}N^{\mathbf{v}}(\mathbf{z},t)\mathbf{v}\mathbf{v}^{\prime}. It follows from Equation (4) and the Sherman-Morrison Formula (see Sherman and Morrison, 1950) that

and therefore, 2R_{t}^{\mathbf{u}}=2\alpha\,\sqrt{\log t}\sqrt{\min\left\{r\log t,\left|\mathcal{U}_{r}\right|\right\}}\left\|\mathbf{u}\right\|_{\mathbf{C}_{t}}\leq\big{(}2\alpha\,\sqrt{\left|\mathcal{U}_{r}\right|\,\log t}\,\big{)}/\sqrt{N^{\mathbf{u}}(\mathbf{z},t)}.

By setting θ=1+4α2UrlogT(Δu(z))2 ,\theta=1+\left\lceil\frac{4\alpha^{2}\left|\mathcal{U}_{r}\right|\log T}{\left(\Delta^{\mathbf{u}}\left(\mathbf{z}\right)\right)^{2}}\right\rceil~{}, we conclude that 2Rtu<Δu(z)=(wu)z2R_{t}^{\mathbf{u}}<\Delta^{\mathbf{u}}\left(\mathbf{z}\right)=\left(\mathbf{w}-\mathbf{u}\right)^{\prime}\mathbf{z} whenever Nu(z,t)θN^{\mathbf{u}}(\mathbf{z},t)\geq\theta. This implies that 1 ⁣l{(wu)z  2Rtu  and  Nu(z,t)θ}=0 {\mathchoice{\rm 1\mskip-5.0mul}{\rm 1\mskip-4.0mul}{\rm 1\mskip-4.5mul}{\rm 1\mskip-5.0mul}}_{\left\{\left(\mathbf{w}-\mathbf{u}\right)^{\prime}\mathbf{z}~{}\leq~{}2R_{t}^{\mathbf{u}}\;\textrm{and}\;N^{\mathbf{u}}(\mathbf{z},t)\geq\theta\right\}}=0~{}, and we have that