Multi-Armed Bandits in Metric Spaces

Robert Kleinberg, Aleksandrs Slivkins, Eli Upfal

Introduction

In a multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence of nn trials so as to maximize the total payoff of the chosen strategies. These problems are the principal theoretical tool for modeling the exploration/exploitation tradeoffs inherent in sequential decision-making under uncertainty. Studied intensively for the last three decades , bandit problems are having an increasingly visible impact on computer science because of their diverse applications including online auctions, adaptive routing, and the theory of learning in games. The performance of a multi-armed bandit algorithm is often evaluated in terms of its regret, defined as the gap between the expected payoff of the algorithm and that of an optimal strategy. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with exponentially or infinitely large strategy sets are still a topic of very active investigation .

Absent any assumptions about the strategies and their payoffs, bandit problems with large strategy sets allow for no non-trivial solutions — any multi-armed bandit algorithm performs as badly, on some inputs, as random guessing. But in most applications it is natural to assume a structured class of payoff functions, which often enables the design of efficient learning algorithms . In this paper, we consider a broad and natural class of problems in which the structure is induced by a metric on the space of strategies. While bandit problems have been studied in a few specific metric spaces (such as a one-dimensional interval) , the case of general metric spaces has not been treated before, despite being an extremely natural setting for bandit problems. As a motivating example, consider the problem faced by a website choosing from a database of thousands of banner ads to display to users, with the aim of maximizing the click-through rate of the ads displayed by matching ads to users’ characterizations and the web content that they are currently watching. Independently experimenting with each advertisement is infeasible, or at least highly inefficient, since the number of ads is too large. Instead, the advertisements are usually organized into a taxonomy based on metadata (such as the category of product being advertised) which allows a similarity measure to be defined. The website can then attempt to optimize its learning algorithm by generalizing from experiments with one ad to make inferences about the performance of similar ads . Abstractly, we have a bandit problem of the following form: there is a strategy set XX, with an unknown payoff function μ:X\mu\,:\,X\rightarrow satisfying a set of predefined constraints of the form μ(u)μ(v)δ(u,v)|\mu(u)-\mu(v)|\leq\delta(u,v) for some u,vXu,v\in X and δ(u,v)>0\delta(u,v)>0. In each period the algorithm chooses a point xXx\in X and observes an independent random sample from a payoff distribution whose expectation is μ(x)\mu(x).

A moment’s thought reveals that this abstract problem can be regarded as a bandit problem in a metric space. Specifically, if L(u,v)L(u,v) is defined to be the infimum, over all finite sequences u=x0,x1,,xk=vu=x_{0},x_{1},\ldots,x_{k}=v in XX, of the quantity iδ(xi,xi+1)\sum_{i}\delta(x_{i},x_{i+1}), then LL is a metricMore precisely, it is a pseudometric because some pairs of distinct points x,yXx,y\in X may satisfy L(x,y)=0L(x,y)=0. and the constraints μ(u)μ(v)<δ(u,v)|\mu(u)-\mu(v)|<\delta(u,v) may be summarized by stating that μ\mu is a Lipschitz function (of Lipschitz constant 11) on the metric space (L,X)(L,X). We refer to this problem as the Lipschitz MAB problem on (L,X)(L,X), and we refer to the ordered triple (L,X,μ)(L,X,\mu) as an instance of the Lipschitz MAB problem. When the metric space (L,X)(L,X) is understood from context, we may also refer to μ\mu as an instance.

While our work is the first to treat the Lipschitz MAB problem in general metric spaces, special cases of the problem are implicit in prior work on the continuum-armed bandit problem — which corresponds to the space $underthemetricunder the metricL_{d}(x,y)=|x-y|^{1/d},,d\geq 1andtheexperimentalworkonbanditsfortaxonomies,whichcorrespondstothecaseinwhich— and the experimental work on “bandits for taxonomies” , which corresponds to the case in which(L,X)isatreemetric.Beforedescribingourresultsingreaterdetail,itishelpfultoputthemincontextbyrecountingthenearlyoptimalboundsfortheonedimensionalcontinuumarmedbanditproblem,aproblemfirstformulatedbyR.Agrawalin1995andrecentlysolved(uptologarithmicfactors)byvariousauthors.Inthefollowingtheoremandthroughoutthispaper,theregretofamultiarmedbanditalgorithmis a tree metric. Before describing our results in greater detail, it is helpful to put them in context by recounting the nearly optimal bounds for the one-dimensional continuum-armed bandit problem, a problem first formulated by R. Agrawal in 1995 and recently solved (up to logarithmic factors) by various authors . In the following theorem and throughout this paper, the regret of a multi-armed bandit algorithm\mathcal{A}runningonaninstancerunning on an instance(L,X,\mu)isdefinedtobethefunctionis defined to be the functionR_{\mathcal{A}}(t)whichmeasuresthedifferencebetweenitsexpectedpayoffattimewhich measures the difference between its expected payoff at timetandthequantityand the quantityt\,\sup_{x\in X}\mu(x).Thelatterquantityistheexpectedpayoffofalwaysplayingastrategy. The latter quantity is the expected payoff of always playing a strategyx\in\operatornamewithlimits{argmax}\mu(x)$ if such strategy exists.

In fact, if the time horizon tt is known in advance, the upper bound in the theorem can be achieved by an extremely naïve algorithm which simply uses an optimal kk-armed bandit algorithm (such as the ucb1 algorithm ) to choose strategies from the set S={0,1k,2k,,1}S=\{0,\tfrac{1}{k},\,\tfrac{2}{k},\,\ldots,1\}, for a suitable choice of the parameter kk. While the regret bound in Theorem 1.1 is essentially optimal for the Lipschitz MAB problem in (Ld,)(L_{d},), it is strikingly odd that it is achieved by such a simple algorithm. In particular, the algorithm approximates the strategy set by a fixed mesh SS and does not refine this mesh as it gains information about the location of the optimal strategy. Moreover, the metric contains seemingly useful proximity information, but the algorithm ignores this information after choosing its initial mesh. Is this really the best algorithm?

A closer examination of the lower bound proof raises further reasons for suspicion: it is based on a contrived, highly singular payoff function μ\mu that alternates between being constant on some distance scales and being very steep on other (much smaller) distance scales, to create a multi-scale “needle in haystack” phenomenon which nearly obliterates the usefulness of the proximity information contained in the metric LdL_{d}. Can we expect algorithms to do better when the payoff function is more benign? For the Lipschitz MAB problem on (L1,)(L_{1},), the question was answered affirmatively in for some classes of instances, with algorithms that are tuned to the specific classes.

Our results and techniques.

In this paper we consider the Lipschitz MAB problem on arbitrary metric spaces. We are concerned with the following two main questions motivated by the discussion above:

What is the best possible bound on regret for a given metric space?

Can one take advantage of benign payoff functions?

In this paper we give a complete solution to (i), by describing for every metric space XX a family of algorithms which come arbitrarily close to achieving the best possible regret bound for XX. We also give a satisfactory answer to (ii); our solution is arbitrarily close to optimal in terms of the zooming dimension defined below. In fact, our algorithm for (i) is an extension of the algorithmic technique used to solve (ii).

Our main technical contribution is a new algorithm, the zooming algorithm, that combines the upper confidence bound technique used in earlier bandit algorithms such as ucb1 with a novel adaptive refinement step that uses past history to zoom in on regions near the apparent maxima of μ\mu and to explore a denser mesh of strategies in these regions. This algorithm is a key ingredient in our design of an optimal bandit algorithm for every metric space (L,X)(L,X). Moreover, we show that the zooming algorithm can perform significantly better on benign problem instances. That is, for every instance (L,X,μ)(L,X,\mu) we define a parameter called the zooming dimension, and use it to bound the algorithm’s performance in a way that is often significantly stronger than the corresponding per-metric bound. Note that the zooming algorithm is self-tuning, i.e. it achieves this bound without requiring prior knowledge of the zooming dimension.

To state our theorem on the per-metric optimal solution for (i), we need to sketch a few definitions which arise naturally as one tries to extend the lower bound from to general metric spaces. Let us say that a subset YY in a metric space XX has covering dimension dd if it can be covered by O(δd)O(\delta^{-d}) sets of diameter δ\delta for all δ>0\delta>0. A point xXx\in X has local covering dimension dd if it has an open neighborhood of covering dimension dd. The space XX has max-min-covering dimension d=MaxMinCOV(X)d=\mathtt{MaxMinCOV}(X) if it has no subspace whose local covering dimension is uniformly bounded below by a number greater than dd.

Consider the Lipschitz MAB problem on a compact metric space (L,X)(L,X). Let d=MaxMinCOV(X)d=\mathtt{MaxMinCOV}(X). If γ>d+1d+2\gamma>\tfrac{d+1}{d+2} then there exists a bandit algorithm A\mathcal{A} such that for every problem instance I\mathcal{I} it satisfies RA(t)=OI(tγ)R_{\mathcal{A}}(t)=O_{\mathcal{I}}(t^{\gamma}) for all tt. No such algorithm exists if d>0d>0 and γ<d+1d+2\gamma<\tfrac{d+1}{d+2}.

In general MaxMinCOV(X)\mathtt{MaxMinCOV}(X) is bounded above by the covering dimension of XX. For metric spaces which are highly homogeneous (in the sense that any two ϵ\epsilon-balls are isometric to one another) the two dimensions are equal, and the upper bound in the theorem can be achieved using a generalization of the naïve algorithm described earlier. The difficulty in Theorem 1.2 lies in dealing with inhomogeneities in the metric space.To appreciate this issue, it is very instructive to consider a concrete example of a metric space (L,X)(L,X) where MaxMinCOV(X)\mathtt{MaxMinCOV}(X) is strictly less than the covering dimension, and for this specific example design a bandit algorithm whose regret bounds are better than those suggested by the covering dimension. This is further discussed in Section 3. It is important to treat the problem at this level of generality, because some of the most natural applications of the Lipschitz MAB problem, e.g. the web advertising problem described earlier, are based on highly inhomogeneous metric spaces. (That is, in web taxonomies, it is unreasonable to expect different categories at the same level of a topic hierarchy to have the roughly the same number of descendants.)

The algorithm in Theorem 1.2 combines the zooming algorithm described earlier with a delicate transfinite construction over closed subsets consisting of “fat points” whose local covering dimension exceeds a given threshold dd. For the lower bound, we craft a new dimensionality notion, the max-min-covering dimension introduced above, which captures the inhomogeneity of a metric space, and we connect this notion with the transfinite construction that underlies the algorithm.

For “benign” input instances we provide a better performance guarantee for the zooming algorithm. The lower bounds in Theorems 1.1 and 1.2 are based on contrived, highly singular, “needle in haystack” instances in which the set of near-optimal strategies is astronomically larger than the set of precisely optimal strategies. Accordingly, we quantify the tractability of a problem instance in terms of the number of near-optimal strategies. We define the zooming dimension of an instance (L,X,μ)(L,X,\mu) as the smallest dd such that the following covering property holds: for every δ>0\delta>0 we require only O(δd)O(\delta^{-d}) sets of diameter δ/8\delta/8 to cover the set of strategies whose payoff falls short of the maximum by an amount between δ\delta and 2δ2\delta.

The zooming dimension can be significantly smaller than the max-min-covering dimension. Let us illustrate this point with two examples (where for simplicity the max-min-covering dimension is equal to the covering dimension). For the first example, consider a metric space consisting of a high-dimensional part and a low-dimensional part. For concreteness, consider a rooted tree TT with two top-level branches TT^{\prime} and TT^{\prime\prime} which are complete infinite kk-ary trees, k=2,10k=2,10. Assign edge weights in TT that are exponentially decreasing with distance to the root, and let LL be the resulting shortest-path metric on the leaf set XX.Here a leaf is defined as an infinite path away from the root. If there is a unique optimal strategy that lies in the low-dimensional part TT^{\prime} then the zooming dimension is bounded above by the covering dimension of TT^{\prime}, whereas the “global” covering dimension is that of TT^{\prime\prime}. In the second example, let (L,X)(L,X) be a homogeneous high-dimensional metric, e.g. the Euclidean metric on the unit kk-cube, and the payoff function is μ(x)=1L(x,S)\mu(x)=1-L(x,S) for some subset SS. Then the zooming dimension is equal to the covering dimension of SS, e.g. it is if SS is a finite point set.

Discussion.

In stating the theorems above, we have been imprecise about specifying the model of computation. In particular, we have ignored the thorny issue of how to provide an algorithm with an input containing a metric space which may have an infinite number of points. The simplest way to interpret our theorems is to ignore implementation details and interpret “algorithm” to mean an abstract decision rule, i.e. a (possibly randomized) function mapping a history of past observations (xi,ri)X×(x_{i},r_{i})\in X\times to a strategy xXx\in X which is played in the current period. All of our theorems are valid under this interpretation, but they can also be made into precise algorithmic results provided that the algorithm is given appropriate oracle access to the metric space. In most cases, our algorithms require only a covering oracle which takes a finite collection of open balls and either declares that they cover XX or outputs an uncovered point. We refer to this setting as the standard Lipschitz MAB problem. For example, the zooming algorithm uses only a covering oracle for (L,X)(L,X), and requires only one oracle query per round (with at most tt balls in round tt). However, the per-metric optimal algorithm in Theorem 1.2 uses more complicated oracles, and we defer the definition of these oracles to Section 3.

While our definitions and results so far have been tailored for the Lipschitz MAB problem on infinite metrics, some of them can be extended to the finite case as well. In particular, for the zooming algorithm we obtain sharp results (that are meaningful for both finite and infinite metrics) using a more precise, non-asymptotic version of the zooming dimension. Extending the notions in Theorem 1.2 to the finite case is an open question.

Extensions.

We provide a number of extensions in which we elaborate on our analysis of the zooming algorithm. First, we provide sharper bounds for several examples in which the reward from playing each strategy uu is μ(u)\mu(u) plus an independent noise of a known and “benign” shape. Second, we upgrade the zooming algorithm so that it satisfies the guarantee in Theorem 1.3 and enjoys a better guarantee if the maximal reward is exactly 1. Third, we apply this result to a version where μ()=1L(,S)\mu(\cdot)=1-L(\,\cdot\,,S) for some target set SS which is not revealed to the algorithm. Fourth, we relax some assumptions in the analysis of the zooming algorithm, and use this generalization to analyze the version in which μ()=1f(L(,S))\mu(\cdot)=1-f(L(\,\cdot\,,S)) for some known function ff. Finally, we extend our analysis from reward distributions supported on $$ to those with unbounded support and finite absolute third moment.

Follow-up work.

For metric spaces whose max-min-covering dimension is exactly 0, this paper provides an upper bound R(T)=OI(Tγ)R(T)=O_{\mathcal{I}}(T^{\gamma}) for any γ>12\gamma>\tfrac{1}{2}, but no matching lower bound. Characterizing the optimal regret for such metric spaces remained an open question. Following the publication of the conference version, this question has been settled in , revealing the following dichotomy: for every metric space, the optimal regret of a Lipschitz MAB algorithm is either bounded above by any fω(logt)f\in\omega(\log t), or bounded below by any go(T)g\in o(\sqrt{T}), depending on whether the completion of the metric space is compact and countable.

1 Preliminaries

Given a metric space, B(x,r)B(x,r) denotes an open ball of radius rr around point xx. Throughout the paper, he constants in the O()O(\cdot) notation are absolute unless specified otherwise.

In the Lipschitz MAB problem on (L,X)(L,X), there is a strategy set XX, a metric space (L,X)(L,X) of diameter 1\leq 1, and a payoff function μ:X\mu\,:\,X\rightarrow such that the following Lipschitz condition holds:

Call LL is the similarity function. The metric space (L,X)(L,X) is revealed to an algorithm, whereas the payoff function μ\mu is not. In each round the algorithm chooses a strategy xXx\in X and observes an independent random sample from a payoff distribution D(x)\mathcal{D}(x) with support S\mathcal{S}\subset and expectation μ(x)\mu(x).

The regret of a bandit algorithm A\mathcal{A} running on a given problem instance is RA(t)=WA(t)tμR_{\mathcal{A}}(t)=W_{\mathcal{A}}(t)-t\mu^{*}, where WA(t)W_{\mathcal{A}}(t) is the expected payoff of A\mathcal{A} at time tt and μ=supxXμ(x)\mu^{*}=\sup_{x\in X}\mu(x) is the maximal expected reward.

The CC-zooming dimension of the problem instance (L,X,μ)(L,X,\mu) is the smallest dd such that for every r(0,1]r\in(0,1] the set Xr={xX:r2<μμ(x)r}X_{r}=\{x\in X:\,\tfrac{r}{2}<\mu^{*}-\mu(x)\leq r\} can be covered by CrdC\,r^{-d} sets of diameter at most r/8r/8.

Fix a metric space on set XX. Let N(r)N(r) be the smallest number of sets of diameter rr required to cover XX. The covering dimension of XX is

The cc-covering dimension of XX is defined as the infimum of all dd such that N(r)crdN(r)\leq cr^{-d} for all r>0r>0.

In Section 2 we prove Theorem 1.3. In Section 3 we discuss the per-metric optimality and prove Theorem 1.2. Section 4 covers the extensions.

Adaptive exploration: the zooming algorithm

In this section we introduce the zooming algorithm which uses adaptive exploration to take advantage of the ”benign” input instances, and prove the main guarantee (Theorem 1.3).

Consider the standard Lipschitz MAB problem on (L,X)(L,X). The zooming algorithm proceeds in phases i=1,2,3,i=1,2,3,\ldots of 2i2^{i} rounds each. Let us consider a single phase iphi_{\mathtt{ph}} of the algorithm. For each strategy vXv\in X and time tt, let nt(v)n_{t}(v) be the number of times this strategy has been played in this phase before time tt, and let μt(v)\mu_{t}(v) be the corresponding average reward. Define μt(v)=0\mu_{t}(v)=0 if nt(v)=0n_{t}(v)=0. Note that at time tt both quantities are known to the algorithm. Define the confidence radius of vv at time tt as

Let μ(v)\mu(v) be the expected reward of strategy vv. Note that E[μt(v)]=μ(v)E[\mu_{t}(v)]=\mu(v). Using Chernoff Bounds, we can bound μt(v)μ(v)|\mu_{t}(v)-\mu(v)| in terms of the confidence radius:

A phase is called clean if for each strategy vXv\in X that has been played at least once during this phase and each time tt we have μt(v)μ(v)rt(v)|\mu_{t}(v)-\mu(v)|\leq r_{t}(v).

Phase iphi_{\mathtt{ph}} is clean with probability at least 14iph1-4^{-i_{\mathtt{ph}}}.

Throughout the execution of the algorithm, a finite number of strategies are designated active. Our algorithm only plays active strategies, among which it chooses a strategy vv with the maximal index

Say that strategy vv covers strategy uu at time tt if uB(v,rt(v))u\in B(v,\,r_{t}(v)). Say that a strategy uu is covered at time tt if at this time it is covered by some active strategy vv. Note that the covering oracle (as defined in Section 1) can return a strategy which is not covered if such strategy exists, or else inform the algorithm that all strategies are covered. Now we are ready to state the algorithm:

Each phase ii runs for 2i2^{i} rounds. In the beginning of the phase no strategies are active. In each round do the following:

If some strategy is not covered, make it active.

Play an active strategy with the maximal index (3); break ties arbitrarily.

We formulate the main result of this section as follows:

Consider the standard Lipschitz MAB problem. Let A\mathcal{A} be Algorithm 2.3. Then C>0\forall\,C>0

where dd is the CC-zooming dimension of the problem instance.

The zooming algorithm is not parameterized by the CC in (4), yet satisfies (4) for all C>0C>0. For sharper guarantees, CC can be tuned to the specific problem instance and specific time tt.

Let us prove Theorem 2.4. Note that after step 1 in Algorithm 2.3 all strategies are covered. (Indeed, if some strategy is activated in step 1 then it covers the entire metric.) Let μ=supuXμ(u)\mu^{*}=\sup_{u\in X}\mu(u) be the maximal expected reward; note that we do not assume that the supremum is achieved by some strategy. Let Δ(v)=μμ(v)\Delta(v)=\mu^{*}-\mu(v). Let us focus on a given phase iphi_{\mathtt{ph}} of the algorithm.

If phase iphi_{\mathtt{ph}} is clean then we have Δ(v)4rt(v)\Delta(v)\leq 4\,r_{t}(v) for any time tt and any strategy vv. It follows that nt(v)O(iph)Δ2(v)n_{t}(v)\leq O(i_{\mathtt{ph}})\,\Delta^{-2}(v).

Suppose strategy vv is played at time tt. First we claim that It(v)μI_{t}(v)\geq\mu^{*}. Indeed, fix ϵ>0\epsilon>0. By definition of μ\mu^{*} there exists a strategy vv^{*} such that Δ(v)<ϵ\Delta(v^{*})<\epsilon. Let vtv_{t} be an active strategy that covers vv^{*}. By the algorithm specification It(v)It(vt)I_{t}(v)\geq I_{t}(v_{t}). Since vv is clean at time tt, by definition of index we have It(vt)μ(vt)+rt(vt)I_{t}(v_{t})\geq\mu(v_{t})+r_{t}(v_{t}). By the Lipschitz property we have μ(vt)μ(v)L(vt,v)\mu(v_{t})\geq\mu(v^{*})-L(v_{t},v^{*}). Since vtv_{t} covers vv^{*}, we have L(vt,v)rt(vt)L(v_{t},v^{*})\leq r_{t}(v_{t}) Putting all these inequalities together, we have It(v)μ(v)μϵI_{t}(v)\geq\mu(v^{*})\geq\mu^{*}-\epsilon. Since this inequality holds for an arbitrary ϵ>0\epsilon>0, we in fact have It(v)μI_{t}(v)\geq\mu^{*}. Claim proved.

Furthermore, note that by the definitions of “clean phase” and “index” we have μIt(v)μ(v)+3rt(v)\mu^{*}\leq I_{t}(v)\leq\mu(v)+3\,r_{t}(v) and therefore Δ(v)3rt(v)\Delta(v)\leq 3\,r_{t}(v).

Now suppose strategy vv is not played at time tt. If it has never been played before time tt in this phase, then rt(v)>1r_{t}(v)>1 and thus the lemma is trivial. Else, let ss be the last time strategy vv has been played before time tt. Then by definition of the confidence radius rt(v)=rs+1(v)2/3rs(v)14Δ(v)r_{t}(v)=r_{s+1}(v)\geq\sqrt{2/3}\,r_{s}(v)\geq\tfrac{1}{4}\,\Delta(v). ∎

In a clean phase, for any active strategies u,vu,v we have L(u,v)>14min(Δ(u),Δ(v))L(u,v)>\tfrac{1}{4}\min(\Delta(u),\Delta(v)).

Assume uu has been activated before vv. Let ss be the time when vv has been activated. Then by the algorithm specification we have L(u,v)>rs(u)L(u,v)>r_{s}(u). By Lemma 2.5 rs(u)14Δ(u)r_{s}(u)\geq\tfrac{1}{4}\Delta(u). ∎

Let dd be the the CC-zooming dimension. For a given time tt in the current phase, let S(t)S(t) be the set of all strategies that are active at time tt, and let

We claim that A(i,t)C2id|A(i,t)|\leq C\,2^{id}. Indeed, set A(i,t)A(i,t) can be covered by C2idC\,2^{id} sets of diameter at most 2i/82^{-i}/8; by Corollary 2.6 each of these sets contains at most one strategy from A(i,t)A(i,t).

In a clean phase iphi_{\mathtt{ph}}, for each time tt we have

where γ=d+1d+2\gamma=\tfrac{d+1}{d+2} and dd is the CC-zooming dimension.

Fix the time horizon tt. For a subset SXS\subset X of strategies, let RS=vSΔ(v)nt(v)R_{S}=\sum_{v\in S}\Delta(v)\,n_{t}(v). Let us choose ρ(0,1)\rho\in(0,1) such that

Define BB as the set of all strategies vS(t)v\in S(t) such that Δ(v)ρ\Delta(v)\leq\rho. Recall that by Lemma 2.5 for each vA(i,t)v\in A(i,t) we have nt(v)O(iph)Δ2(v)n_{t}(v)\leq O(i_{\mathtt{ph}})\,\Delta^{-2}(v). Then

The left-hand side of (5) is essentially the contribution of the current phase to the overall regret. It remains to sum these contributions over all past phases.

Let iphi_{\mathtt{ph}} be the current phase, let tt be the time spend in this phase, and let TT be the total time since the beginning of phase 11. Let Rph(iph,t)R_{\mathtt{ph}}(i_{\mathtt{ph}},t) be the left-hand side of (5). Combining Claim 2.2 and Claim 2.7, we have

Attaining the optimal per-metric performance

In this section we ask, “What is the best possible algorithm for the Lipschitz MAB problem on a given metric space?” We consider the per-metric performance, which we define as the worst-case performance of a given algorithm over all possible problem instances on a given metric. As everywhere else in this paper, we focus on minimizing the exponent γ\gamma such that RA(t)tγR_{\mathcal{A}}(t)\leq t^{\gamma} for all sufficiently large tt. Motivated by the shape of the guarantees in Theorem 1.1, let us define the regret dimension of an algorithm as follows.

Consider the Lipschitz MAB problem on a given metric space. For algorithm A\mathcal{A} and problem instance I\mathcal{I} let

The regret dimension of A\mathcal{A} is DIM(A)=supIDIMI(A)\mathtt{DIM}(\mathcal{A})=\sup_{\mathcal{I}}\,\mathtt{DIM}_{\mathcal{I}}(\mathcal{A}), where the supremum is taken over all problem instances I\mathcal{I} on the given metric space.

Then Theorem 1.1 states that for the Lipschitz MAB problem on (Ld,)(L_{d},), the regret dimension of the “naïve algorithm” is at most dd. In fact, it is easy to extend the “naïve algorithm” to arbitrary metric spaces. Such algorithm is parameterized by the covering dimension dd of the metric space. It divides time into phases of exponentially increasing length, chooses a δ\delta-net during each phase,It is easy to see that the cardinality of this δ\delta-net is K=O(δd)K=O(\delta^{-d}). and runs a KK-armed bandit algorithm such as ucb1 on the elements of the δ\delta-net. The parameter δ\delta is tuned optimally given dd and the phase length TT; the optimal value turns out to be δ=T1/(d+2)\delta=T^{-1/(d+2)}. Using the technique from it is easy to prove that the regret dimension of this algorithm is at most dd.

Consider the Lipschitz MAB problem on a metric space (L,X)(L,X) of covering dimension dd. Let A\mathcal{A} be the naïve algorithm that uses ucb1 in each phase. Then DIM(A)d\mathtt{DIM}(\mathcal{A})\leq d.

Let us focus on some phase ii. In this phase the algorithm chooses a δ\delta-net, call it SS. We claim that Scδd|S|\leq c\,\delta^{-d}. Indeed, for any δ<δ\delta^{\prime}<\delta the metric space can be covered by cδdc\,\delta^{-d} sets of diameter at most δ\delta^{\prime}, each of which can contain only one point from SS. Claim proved. The algorithm proceeds to run ucb1 on the elements of SS. By the expected regret of ucb1 on KK arms in tt rounds is at most O(Ktlogt)O(\sqrt{K\,t\log t}). Since the maximal μ\mu on SS is at most δ\delta off of the maximal μ\mu on XX, we have

Thus we ask: is it possible to achieve a better regret dimension, perhaps using a more sophisticated algorithm? We show that this is indeed the case. Moreover, we provide an algorithm such that for any given metric space its regret dimension is arbitrarily close to optimal.

The rest of this section is organized as follows. In Section 3.1 we develop a lower bound on regret dimension. In Section 3.2 we will show that for some metric spaces, there exist algorithms whose regret dimension is smaller than the covering dimension. We develop these ideas further in Section 3.3 and provide an algorithm whose regret dimension is arbitrarily close to optimal.

Let us develop a lower bound on regret dimension of any algorithm on a given metric space. This bound is equal to the covering dimension for highly homogeneous metric spaces (such as those in which all balls of a given radius are isometric to each other), but in general it can be much smaller.

It is known that a worst-case instance of the KK-armed bandit problem consists of K1K-1 strategies with identical payoff distributions, and one which is slightly better. We refer to this as a “needle-in-haystack” instance. The known constructions of lower bounds for Lipschitz MAB problems rely on creating a multi-scale needle-in-haystack instance in which there are KK disjoint open sets, and K1K-1 of them consist of strategies with identical payoff distributions, but in the remaining open set there are strategies whose payoff is slightly better. Moreover, this special open set contains KKK^{\prime}\gg K disjoint subsets, only one of which contains strategies superior to the others, and so on down through infinitely many levels of recursion. To ensure that this construction can be continued indefinitely, one needs to assume a covering property which ensures that each of the open sets arising in the construction has sufficiently many disjoint subsets to continue to the next level of recursion.

For a metric space (L,X)(L,X), we say that dd is the min-covering dimension of XX, d=MinCOV(X)d=\mathtt{MinCOV}(X), if dd is the infimum of COV(U)\mathtt{COV}(U) over all non-empty open subsets UXU\subseteq X. The max-min-covering dimension of XX is defined by

The infimum over open UXU\subseteq X in the definition of min-covering dimension ensures that every open set which may arise in the needle-in-haystack construction described above will contain Ω(δεd)\Omega(\delta^{\varepsilon-d}) disjoint δ\delta-balls for some sufficiently small δ,ε\delta,\varepsilon. Constructing lower bounds for Lipschitz MAB algorithms in a metric space XX only requires that XX should have subsets with large min-covering dimension, which explains the supremum over subsets in the definition of max-min-covering dimension.

We will use the following simple packing lemma.This is a folklore result; we provide the proof for convenience.

If YY is a metric space of covering dimension dd, then for any b<db<d and r0>0r_{0}>0, there exists r(0,r0)r\in(0,r_{0}) such that YY contains a collection of at least rbr^{-b} disjoint open balls of radius rr.

Let r<r0r<r_{0} be a positive number such that every covering of YY requires more than rbr^{-b} balls of radius 2r2r. Such an rr exists, because the covering dimension of YY is strictly greater than bb. Now let P={B1,B2,,BM}\mathcal{P}=\{B_{1},B_{2},\ldots,B_{M}\} be any maximal collection of disjoint rr-balls. For every yYy\in Y there must exist some ball Bi  (1iM)B_{i}\;(1\leq i\leq M) whose center is within distance 2r2r of yy, as otherwise B(y,r)B(y,r) would be disjoint from every element of P\mathcal{P} contradicting the maximality of that collection. If we enlarge each ball BiB_{i} to a ball Bi+B_{i}^{+} of radius 2r2r, then every yYy\in Y is contained in one of the balls {Bi+1iM}\{B_{i}^{+}\,|\,1\leq i\leq M\}, i.e. they form a covering of YY. Hence MrbM\geq r^{-b} as desired. ∎

If XX is a metric space and dd is the max-min-covering dimension of XX then DIM(A)d\mathtt{DIM}(\mathcal{A})\geq d for every bandit algorithm A\mathcal{A}.

Without loss of generality let us assume that d>0d>0. Given γ<d+1d+2,\gamma<\tfrac{d+1}{d+2}, let a<b<c<da<b<c<d be such that γ<a+1a+2\gamma<\tfrac{a+1}{a+2}. Let YY be a subset of XX such that MinCOV(Y)c\mathtt{MinCOV}(Y)\geq c. Using Lemma 3.4 we recursively construct an infinite sequence of sets P0,P1,\mathcal{P}_{0},\mathcal{P}_{1},\ldots each consisting of finitely many disjoint open balls in XX, centered at points of YY. Let P0={X}\mathcal{P}_{0}=\{X\} consist of a single ball that contains all of XX. If i>0i>0, for every ball BPi1B\in\mathcal{P}_{i-1}, let rr denote the radius of BB and choose a number ri(B)(0,r/4)r_{i}(B)\in(0,r/4) such that BB contains ni(B)=ri(B)bn_{i}(B)=\lceil r_{i}(B)^{-b}\rceil disjoint balls of radius ri(B)r_{i}(B) centered at points of YY. Such a collection of disjoint balls exists, by Lemma 3.4. Let Pi(B)\mathcal{P}_{i}(B) denote this collection of disjoint balls and let Pi=BPi1Pi(B).\mathcal{P}_{i}=\bigcup_{B\in\mathcal{P}_{i-1}}\mathcal{P}_{i}(B). Now sample a random sequence of balls B1,B2,B_{1},B_{2},\ldots by picking B1P1B_{1}\in\mathcal{P}_{1} uniformly at random, and for i>1i>1 picking BiPi(Bi1)B_{i}\in\mathcal{P}_{i}(B_{i-1}) uniformly at random.

Given a ball B=B(x,r)B=B(x_{*},r_{*}), let fB(x)f_{B}(x) be a Lipschitz function on XX defined by

Let fi=fBif_{i}=f_{B_{i}} for i1i\geq 1. Define f0f_{0} by setting f0(x)=1/3f_{0}(x)=1/3 for all xXx\in X. The reader may verify that the sum μ=i=0fi\mu=\sum_{i=0}^{\infty}f_{i} is a Lipschitz function. Define the payoff distribution for xXx\in X to be a Bernoulli random variable with expectation μ(x)\mu(x). We have thus specified a randomized construction of an instance (L,X,μ)(L,X,\mu).

We claim that for any algorithm A\mathcal{A} and any constant CC,

The proof of this claim is based on a “needle in haystack” lemma (Lemma 3.6 below) which states that for all ii, conditional on the sequence B1,,Bi1B_{1},\ldots,B_{i-1}, with probability at least 1O((ri(Bi))(ba)/2)1-O((r_{i}(B_{i}))^{(b-a)/2}), no more than half of the first ti(Bi)=ri(Bi)a2t_{i}(B_{i})=r_{i}(B_{i})^{-a-2} strategies picked by A\mathcal{A} lie inside BiB_{i}. The proof of the lemma is deferred to the end of this section.

Any strategy x∉Bix\not\in B_{i} satisfies μ(x)<μ(x)ri/2\mu(x)<\mu(x^{*})-r_{i}/2, so we may conclude that

Denoting ri(Bi)r_{i}(B_{i}) and ti(Bi)t_{i}(B_{i}) by rir_{i} and tit_{i}, respectively, we have 14ria1=14ti(a+1)/(a+2)>Ctiγ\tfrac{1}{4}\,r_{i}^{-a-1}=\tfrac{1}{4}\,t_{i}^{(a+1)/(a+2)}>Ct_{i}^{\gamma} for all sufficiently large ii. As ii runs through the positive integers, the terms on the right side of (8) are dominated by a geometric progression because ri(Bi)4i.r_{i}(B_{i})\leq 4^{-i}. By the Borel-Cantelli Lemma, almost surely there are only finitely many ii such that the events on the left side of (8) occur. Thus (7) follows. ∎

To prove Theorem 3.5 it suffices to show that for every given algorithm there exists a “hard” problem instance. In fact we proved a stronger result (7): essentially, we construct a probability distribution over problem instances which is hard, almost surely, for every given algorithm. This seems to be the best possible bound since, obviously, a single problem instance cannot be hard for every algorithm.

In rest of this subsection we prove the “needle in haystack” lemma used in the proof of Theorem 3.5.

Consider the randomized construction of an instance (L,X,μ)(L,X,\mu) in the proof of Theorem 3.5. Fix a bandit algorithm A\mathcal{A}. Then for all ii, conditional on the sequence B1,,Bi1B_{1},\ldots,B_{i-1}, with probability at least 1O((ri(Bi))(ba)/2)1-O((r_{i}(B_{i}))^{(b-a)/2}), no more than half of the first ri(Bi)a2r_{i}(B_{i})^{-a-2} strategies picked by A\mathcal{A} lie inside BiB_{i}.

Let us introduce some notation needed to prove the lemma. Let us fix an arbitrary Lipschitz MAB algorithm A\mathcal{A}. We will assume that A\mathcal{A} is deterministic; the corresponding result for randomized algorithms follows by conditioning on the algorithm’s random bits (so that its behavior, conditional on these bits, is deterministic), invoking the lemma for deterministic algorithms, and then removing the conditioning by averaging over the distribution of random bits. Note that since our construction uses only {0,1}\{0,1\}-valued payoffs, and the algorithm A\mathcal{A} is deterministic, the entire history of play in the first tt rounds can be summarized by a binary vector σ{0,1}t\sigma\in\{0,1\}^{t}, consisting of the payoffs observed by A\mathcal{A} in the first tt rounds. Thus a payoff function μ\mu determines a probability distribution PμP_{\mu} on the set {0,1}t\{0,1\}^{t}, i.e. the distribution on tt-step histories realized when using algorithm A\mathcal{A} on instance μ\mu.

Let BB be any ball in the set Pi1\mathcal{P}_{i-1}, let n=ni(B)n=n_{i}(B), r=ri(B)r=r_{i}(B), and t=ti(B)=ri(B)a2t=t_{i}(B)=r_{i}(B)^{-a-2}. Let B1,B2,,BnB^{1},B^{2},\ldots,B^{n} be an enumeration of the balls in Pi(B)\mathcal{P}_{i}(B). Choose an arbitrary sequence of balls B1B2Bi1=BB_{1}\supseteq B_{2}\supseteq\ldots\supseteq B_{i-1}=B such that B1P1B_{1}\in\mathcal{P}_{1} and for all j>0j>0 BjP(Bj1).B_{j}\in\mathcal{P}(B_{j-1}). Similarly, for k=1,2,,nk=1,2,\ldots,n, choose an arbitrary sequence of balls Bk=BikBi+1kB^{k}=B^{k}_{i}\supseteq B^{k}_{i+1}\supseteq\ldots such that BjkP(Bj1k)B^{k}_{j}\in\mathcal{P}(B^{k}_{j-1}) for all ji.j\geq i. Define functions fj  (1ji1)f_{j}\;(1\leq j\leq i-1) and fjk  (ji)f^{k}_{j}\;(j\geq i) using the balls Bj,BjkB_{j},B^{k}_{j}, as in the proof of Theorem 3.5. Specifically, use definition (6) and set fj=fBjf_{j}=f_{B_{j}} and fjk=fBjkf^{k}_{j}=f_{B^{k}_{j}}. Let μ0=j=0i1fj\mu^{0}=\sum_{j=0}^{i-1}f_{j} and

Note that the instances μk  (1kn)\mu^{k}\;(1\leq k\leq n) are equiprobable under our distribution on input instances μ\mu. The instance μ0\mu^{0} is not one that could be randomly sampled by our construction, but it is useful as a “reference measure” in the following proof. Note that the functions μk\mu^{k} have the following properties, by construction.

1/3μk(x)2/31/3\leq\mu^{k}(x)\leq 2/3 for all xXx\in X.

0μk(x)μ0(x)r0\leq\mu^{k}(x)-\mu^{0}(x)\leq r for all xXx\in X.

If xXBk,x\in X\setminus B^{k}, then μk(x)=μ0(x)\mu^{k}(x)=\mu^{0}(x).

If xXBk,x\in X\setminus B^{k}, then there exists some point xkBkx^{k}\in B^{k} such that μk(xk)μk(x)r/2.\mu^{k}(x^{k})-\mu^{k}(x)\geq r/2.

Each of the payoff functions μk  (0kn)\mu^{k}\;(0\leq k\leq n) gives rise to a probability distribution PμkP_{\mu^{k}} on {0,1}t\{0,1\}^{t} as described in the preceding section. We will use the shorthand notation PkP_{k} instead of PμkP_{\mu^{k}}. We will also use Ek\mathbf{E}_{k} to denote the expectation of a random variable under distribution PkP_{k}. Finally, we let NkN_{k} denote the random variable defined on {0,1}t\{0,1\}^{t} that counts the number of rounds s  (1st)s\;(1\leq s\leq t) in which algorithm A\mathcal{A} chooses a strategy in BkB^{k} given the history σ\sigma.

The following lemma is analogous to Lemma A.1 of , and its proof is identical to the proof of that lemma.

Let f:{0,1}t[0,M]f\,:\,\{0,1\}^{t}\rightarrow[0,M] be any function defined on reward sequences σ\sigma. Then for any kk,

Applying Lemma 3.7 with f=Nkf=N_{k} and M=tM=t, and averaging over kk, we may apply exactly the same reasoning as in the proof of Theorem A.2 of to derive the bound

Recalling that the actual ball BkB_{k} sampled when randomly constructing μ\mu in the proof of Theorem 3.5 is a uniform random sample from B1,B2,,BnB^{1},B^{2},\ldots,B^{n}, we may write NN_{*} to denote the random variable which counts the number of rounds in which the algorithm plays a strategy in BkB_{k} and the bound (9) implies

Recalling that t=ra2t=r^{-a-2} and n=rbn=r^{-b}, we see that the O(trt/n)O(tr\sqrt{t/n}) term is the dominant term on the right side, and that it is bounded by O(tr(ba)/2).O(tr^{(b-a)/2}). An application of Markov’s inequality now yields:

2 Beyond the covering dimension

Thus far, we have seen that every metric space XX has a bandit algorithm A\mathcal{A} such that DIM(A)=COV(X)\mathtt{DIM}(\mathcal{A})=\mathtt{COV}(X) (the naïve algorithm), and we have seen (via the needle-in-haystack construction, Theorem 3.5) that XX can never have a bandit algorithm satisfying DIM(A)<MaxMinCOV(X)\mathtt{DIM}(\mathcal{A})<\mathtt{MaxMinCOV}(X). When COV(X)MaxMinCOV(X)\mathtt{COV}(X)\neq\mathtt{MaxMinCOV}(X), which of these two bounds is correct, or can they both be wrong?

In both of the metrics described above, the zooming algorithm (Algorithm 2.3) performs poorly when the optimum xx^{*} is located inside the fat subset SS, because it is too burdensome to keep coveringRecall that a strategy uu is called covered at time tt if for some active strategy vv we have L(u,v)rt(v)L(u,v)\leq r_{t}(v). the profusion of strategies located near xx^{*} as the ball containing xx^{*} shrinks. An improved algorithm, achieving regret exponent dd, modifies the zooming algorithm by imposing quotas on the number of active strategies that lie outside SS. At any given time, some strategies outside SS may not be covered; however, it is guaranteed that there exists an optimal strategy which eventually becomes covered and remains covered forever afterward. Intuitively, if some optimal strategy lies in SS then imposing a quota on active strategies outside SS does not hurt. If no optimal strategy lies in SS then all of SS gets covered eventually and stays covered thereafter, in which case the uncovered part of the strategy set has low covering dimension and (starting after the time when SS becomes permanently covered) no quota is ever exceeded.

This use of quotas extends to the following general setting which abstracts the idea of “fat subsets”:

Fix a metric space (L,X)(L,X). A closed subset SXS\subset X is dd-fat if COV(S)d\mathtt{COV}(S)\leq d and for any open superset UU of SS we have COV(XU)d\mathtt{COV}(X\setminus U)\leq d. More generally, a dd-fat decomposition of depth kk is a decreasing sequence X=S0SkSk+1=X=S_{0}\supset\ldots\supset S_{k}\supset S_{k+1}=\emptyset of closed subsets such that COV(Sk)d\mathtt{COV}(S_{k})\leq d and COV(SiU)d\mathtt{COV}(S_{i}\setminus U)\leq d whenever i[k]i\in[k] and UU is an open superset of Si+1S_{i+1}.

Let (L,X)(L,X) be the metric space in either of the two “tree with a fat subset” examples. Then the corresponding “fat subset” SS is dd-fat. For an example of a fat decomposition of depth k=2k=2, consider the product metric (L,X×X)(L^{*},X\times X) defined by

with a fat decomposition given by S1=(S×X)(X×S)S_{1}=(S\times X)\cup(X\times S) and S2=S×SS_{2}=S\times S.

When XX is a metric space with a dd^{*}-fat decomposition D\mathcal{D}, the algorithm described earlier can be modified to achieve regret O(tγ)O\left(t^{\gamma}\right) for any γ>11/(d+2)\gamma>1-1/(d^{*}+2), by instituting a separate quota for each subset Si.S_{i}. The algorithm requires access to a D\mathcal{D}-covering oracle which for a given ii and a given finite set of open balls (given by the centers and the radii) either reports that the balls cover SiS_{i}, or returns some strategy in SiS_{i} which is not covered by the balls. No further knowledge of D\mathcal{D} or the metric space is required.

Consider the Lipschitz MAB problem on a fixed compact metric space with a dd^{*}-fat decomposition D\mathcal{D}. Then for any d>dd>d^{*} there is an algorithm AD\mathcal{A}_{\mathcal{D}} such that DIM(AD)d\mathtt{DIM}(\mathcal{A}_{\mathcal{D}})\leq d.

(1) We can relax the compactness assumption in Theorem 3.10: instead, we can assume that the completion of the metric space is compact and re-define the sets in the dd-fat decomposition as subsets of the completion (possibly disjoint with the strategy set). This corresponds to the “fat leaf” which lies outside the strategy set. Such extension requires some minor modifications.

(2) The per-metric guarantee expressed by Theorem 3.10 can be complemented with sharper per-instance guarantees. First, for every problem instance I\mathcal{I} the per-instance regret dimension DIMI(A)\mathtt{DIM}_{\mathcal{I}}(\mathcal{A}) is upper-bounded by the zooming dimension of I\mathcal{I}. Second, if for some c>0c>0 the cc-covering dimension of XX is finite then for some γ<1\gamma<1 and all tt we have RA(t)O(ctγ)R_{\mathcal{A}}(t)\leq O(c\,t^{\gamma}). However, as this extension is tangential to our main storyline, we focus on analyzing the regret dimension.

Our algorithm proceeds in phases i=1,2,3,  i=1,2,3,\,\ldots\; of 2i2^{i} rounds each. In a given phase, we run a fresh instance of the following phase algorithm Aph(T,d,D)\mathcal{A}_{\mathtt{ph}}(T,d,\mathcal{D}) parameterized by the phase length T=2iT=2^{i}, target dimension d>dd>d^{*} and the D\mathcal{D}-covering oracle. The phase algorithm is a version of a single phase of the zooming algorithm (Algorithm 2.3) with very different rules for activating strategies. As in Algorithm 2.3, the confidence radius and the index are defined by (2) and (3), respectively. At the start of each round some strategies are activated, and then an active strategy with the maximal index is played.

where ρ=T1/(d+2)\rho=T^{-1/(d+2)} and Cρ=(64klog1ρ)1C_{\rho}=(64k\,\log\tfrac{1}{\rho})^{-1}. In the beginning of each round the following activation routine is performed. If there exists a set SiS_{i} such that some strategy in SiS_{i} is not covered and there is room under the corresponding quota QiQ_{i}, pick one such strategy, activate it, and add it to the corresponding pool PiP_{i}. Since for a given strategy uu the confidence radius rt(u)r_{t}(u) is non-increasing in tt, the constraint (10) is never violated. Repeat until there are no such sets SiS_{i} left. This completes the description of the algorithm.

Analysis.

As was the case in Section 2, the analysis of the unbounded-time-horizon algorithm reduces to proving a lemma about the regret of each phase algorithm.

Fix a problem instance in the setting of Theorem 3.10. Let Aph(T)=Aph(T,d,D)\mathcal{A}_{\mathtt{ph}}(T)=\mathcal{A}_{\mathtt{ph}}(T,d,\mathcal{D}). Then

Note that the lemma bounds the regret of Aph(T)\mathcal{A}_{\mathtt{ph}}(T) for time TtminT\geq t_{\text{min}} only. Proving Theorem 3.10 is now straightforward:

Let Aph(T)\mathcal{A}_{\mathtt{ph}}(T) be the phase algorithm from Lemma 3.11. Recall that in each phase ii in the overall algorithm A\mathcal{A} we simply run a fresh instance of algorithm Aph(2i)\mathcal{A}_{\mathtt{ph}}(2^{i}) for 2i2^{i} steps.

Let t0t_{0} be the tmint_{\text{min}} from (11) rounded up to the nearest end-of-phase time. Let i0i_{0} be the phase starting at time t0+1t_{0}+1. Note that RA(t0)t0R_{\mathcal{A}}(t_{0})\leq t_{0}. Let RiR_{i} be the regret accumulated by A\mathcal{A} during phase ii. Let γ=d+1d+2\gamma=\tfrac{d+1}{d+2}. Then for any time tt01/γt\geq t_{0}^{1/\gamma} in phase ii we have RA(t)t0+j=i0iRjt0+j=i0i(2j)γO(tγ).R_{\mathcal{A}}(t)\leq t_{0}+\sum_{j=i_{0}}^{i}R_{j}\leq t_{0}+\sum_{j=i_{0}}^{i}(2^{j})^{\gamma}\leq O(t^{\gamma}).

In the remainder of this section we prove Lemma 3.11. Let us fix a problem instance of the Lipschitz MAB problem on a compact metric space (L,X)(L,X) with a depth-kk dd^{*}-fat decomposition D={Si}i=0k+1\mathcal{D}=\{S_{i}\}_{i=0}^{k+1}. Fix d>dd>d^{*} and let Aph(T)=Aph(T,d,D)\mathcal{A}_{\mathtt{ph}}(T)=\mathcal{A}_{\mathtt{ph}}(T,d,\mathcal{D}) be the phase algorithm. Let μ\mu be the expected reward function and let μ=supuXμ(u)\mu^{*}=\sup_{u\in X}\mu(u) be the optimal reward. Let Δ(u)=μμ(u)\Delta(u)=\mu^{*}-\mu(u).

By definition of the Lipschitz MAB problem, μ\mu is a continuous function on the metric space (L,X)(L,X). Therefore the supremum μ\mu^{*} is achieved by some strategy (call such strategies optimal). Say that a run of algorithm Aph(T)\mathcal{A}_{\mathtt{ph}}(T) is well-covered if at every time tTt\leq T some optimal strategy is covered.

Say that a run of algorithm Aph(T)\mathcal{A}_{\mathtt{ph}}(T) is clean if the property in Claim 2.2 holds for all times tTt\leq T. Note that a given run is clean with probability at least 1T21-T^{-2}. The following lemma adapts the technique from Lemma 2.5 to the present setting:

Consider a clean run of algorithm Aph(T)\mathcal{A}_{\mathtt{ph}}(T).

If strategies u,vu,v are active at time tTt\leq T then Δ(v)Δ(u)4rt(v)\Delta(v)-\Delta(u)\leq 4r_{t}(v).

if the run is well-covered and strategy vv is active at time tTt\leq T then Δ(v)4rt(v)\Delta(v)\leq 4r_{t}(v).

The quotas (10) are chosen so that the regret computation in Claim 2.7 works out for a clean and well-covered run of algorithm Aph(T)\mathcal{A}_{\mathtt{ph}}(T).

RA(T)T11/(d+2)R_{\mathcal{A}}(T)\leq T^{1-1/(d+2)} for any clean well-covered run of algorithm A=Aph(T)\mathcal{A}=\mathcal{A}_{\mathtt{ph}}(T).

Let At(δ)A_{t}(\delta) be the set of all strategies uXu\in X such that uu is active at time tTt\leq T and δrt(u)<2δ\delta\leq r_{t}(u)<2\delta. Note that for any such strategy we have nt(u)O(logT)δ2n_{t}(u)\leq O(\log T)\,\delta^{-2} and Δ(u)4rt(u)<8δ\Delta(u)\leq 4r_{t}(u)<8\delta. Write

where ρ=T1/(d+2)\rho=T^{-1/(d+2)} and apply the quotas (10). ∎

Recall that in the beginning of algorithm A(T)\mathcal{A}(T) all strategies in some 2j2^{-j}-net N\mathcal{N} are activated. Suppose TT is large enough so that 2jϵ2^{-j}\leq\epsilon.

Δ(v)=μ(u)μ(v)L(u,v)ϵ\Delta(v^{*})=\mu(u^{*})-\mu(v^{*})\leq L(u^{*},v^{*})\leq\epsilon.

Since L(v,w)ϵL(v,w)\leq\epsilon and Δ(w)>8ϵ\Delta(w)>8\epsilon, we have Δ(v)>7ϵ\Delta(v)>7\epsilon.

By Claim 3.12 we have Δ(v)Δ(v)4rt(v)\Delta(v)-\Delta(v^{*})\leq 4r_{t}(v^{*}).

Combining (a-c), it follows that rt(v)32ϵL(u,v)r_{t}(v)\geq\tfrac{3}{2}\,\epsilon\geq L(u,v), so vv covers uu. Claim proved. ∎

3 The per-metric optimal algorithm

The algorithm in Theorem 3.10 requires a fat decomposition of finite depth, which in general might not exist. To extend the ideas of the preceding section to arbitrary metric spaces, we must generalize Definition 3.8 to transfinitely infinite depth.

Fix a metric space (L,X)(L,X). Let β\beta denote an arbitrary ordinal. A transfinite dd-fat decomposition of depth β\beta is a transfinite sequence {Sλ}0λβ\{S_{\lambda}\}_{0\leq\lambda\leq\beta} of closed subsets of XX such that:

S0=XS_{0}=X, Sβ=S_{\beta}=\emptyset, and SνSλS_{\nu}\supseteq S_{\lambda} whenever ν<λ\nu<\lambda.

if VXV\subset X is closed, then the set {ordinals νβ\{\text{ordinals }\nu\leq\beta: V\mboxintersectsSν}V\mbox{ intersects }S_{\nu}\} has a maximum element.

for any ordinal λβ\lambda\leq\beta and any open set UXU\subset X containing Sλ+1S_{\lambda+1} we have COV(SλU)d\mathtt{COV}(S_{\lambda}\setminus U)\leq d.

Note that for a finite depth β\beta the above definition is equivalent to Definition 3.8. In Theorem 3.17 below, we will show how to modify the “quota algorithms” from the previous section to achieve regret dimension dd in any metric with a transfinite dd^{*}-fat decomposition for d<dd^{*}<d. This gives an optimal algorithm for every metric space XX because of the following surprising relation between the max-min-covering dimension and transfinite fat decompositions.

For every compact metric space (L,X)(L,X), the max-min-covering dimension of XX is equal to the infimum of all dd such that XX has a transfinite dd-fat decomposition.

If YX\emptyset\neq Y\subseteq X and MinCOV(Y)>d\mathtt{MinCOV}(Y)>d then, by transfinite induction, YSλY\subseteq S_{\lambda} for all λ\lambda in any transfinite dd-fat decomposition, contradicting the fact that Sβ=S_{\beta}=\emptyset. Thus, the existence of a transfinite dd-fat decomposition of XX implies dMaxMinCOV(X)d\geq\mathtt{MaxMinCOV}(X). To complete the proof we will construct, given any d>MaxMinCOV(X)d>\mathtt{MaxMinCOV}(X), a transfinite dd-fat decomposition of depth β\beta, where β\beta is any ordinal whose cardinality exceeds that of XX. For a metric space YY, define the set of dd-thin points TP(Y,d){\operatorname{TP}}(Y,d) to be the union of all open sets UYU\subseteq Y satisfying COV(U)<d\mathtt{COV}(U)<d. Its complement, the set of dd-fat points, is denoted by FP(Y,d){\operatorname{FP}}(Y,d). Note that it is a closed subset of YY.

For an ordinal λβ\lambda\leq\beta, we define a set SλS_{\lambda} using transfinite induction as follows:

S0=XS_{0}=X and Sλ+1=FP(Sλ,d)S_{\lambda+1}={\operatorname{FP}}(S_{\lambda},d) for each ordinal λ\lambda.

If λ\lambda is a limit ordinal then Sλ=ν<λSνS_{\lambda}=\bigcap_{\nu<\lambda}S_{\nu}.

Note that each SλS_{\lambda} is closed, by transfinite induction. It remains to show that D={Sλ}λO\mathcal{D}=\{S_{\lambda}\}_{\lambda\in\mathcal{O}} satisfies the properties (a-c) in Definition 3.15. It follows immediately from the construction that S0=XS_{0}=X and SνSλS_{\nu}\supseteq S_{\lambda} when ν<λ.\nu<\lambda. To prove that Sβ=S_{\beta}=\emptyset, observe first that the sets SλSλ+1  (\mboxfor0λ<β)S_{\lambda}\setminus S_{\lambda+1}\;(\mbox{for }0\leq\lambda<\beta) are disjoint subsets of XX, and the number of such sets is greater than the cardinality of XX, so at least one of them is empty. This means that Sλ=Sλ+1S_{\lambda}=S_{\lambda+1} for some λ<β.\lambda<\beta. If Sλ=S_{\lambda}=\emptyset then Sβ=S_{\beta}=\emptyset as desired. Otherwise, the relation FP(Sλ,d)=Sλ{\operatorname{FP}}(S_{\lambda},d)=S_{\lambda} implies that MinCOV(Sλ)d\mathtt{MinCOV}(S_{\lambda})\geq d contradicting the assumption that MaxMinCOV(X)<d.\mathtt{MaxMinCOV}(X)<d. This completes the proof of property (a). To prove property (b), suppose {νiiI}\{\nu_{i}\,|\,i\in\mathcal{I}\} is a set of ordinals such that SνiS_{\nu_{i}} intersects VV for every ii. Let ν=sup{νi}.\nu=\sup\{\nu_{i}\}. Then SνV=iI(SνiV)S_{\nu}\cap V=\bigcap_{i\in\mathcal{I}}(S_{\nu_{i}}\cap V), and the latter set is nonempty because XX is compact and the closed sets {SνiViI}\{S_{\nu_{i}}\cap V\,|\,i\in\mathcal{I}\} have the finite intersection property. Finally, to prove property (c), note that if UU is an open neighborhood of Sλ+1S_{\lambda+1} then the set T=SλUT=S_{\lambda}\setminus U is closed (hence compact) and is contained in TP(Sλ,d){\operatorname{TP}}(S_{\lambda},d). Consequently TT can be covered by open sets VV satisfying COV(V)<d\mathtt{COV}(V)<d. By compactness of TT, this covering has a finite subcover V1,,VmV_{1},\ldots,V_{m}, and consequently COV(T)=max1imCOV(Vi)<d.\mathtt{COV}(T)=\max_{1\leq i\leq m}\mathtt{COV}(V_{i})<d.

Consider the Lipschitz MAB problem on a compact metric space (L,X)(L,X). For any d>MaxMinCOV(X)d>\mathtt{MaxMinCOV}(X) there exists an algorithm Ad\mathcal{A}_{d} such that DIM(Ad)d\mathtt{DIM}(\mathcal{A}_{d})\leq d.

Note that Theorem 1.2 follows immediately by combining Theorem 3.17 with Theorem 3.5.

We next describe an algorithm Ad\mathcal{A}_{d} satisfying Theorem 3.17. The algorithm requires two oracles: a depth oracle Depth(){\mathtt{Depth}}(\cdot) and a D\mathcal{D}-covering oracle {\mbox{\mathcal{D}-\mathtt{Cov}}}(\cdot). For any finite set of open balls B0,B1,,BnB_{0},B_{1},\ldots,B_{n} (given via the centers and the radii) whose union is denoted by BB, Depth(B0,B1,,Bn){\mathtt{Depth}}(B_{0},B_{1},\ldots,B_{n}) returns the maximum ordinal λ\lambda such that SλS_{\lambda} intersects the closure B\overline{B}; such an ordinal exists by Definition 3.15(b).To avoid the question of how arbitrary ordinals are represented on the oracle’s output tape, we can instead say that the oracle outputs a point uSλu\in S_{\lambda} instead of outputting λ.\lambda. In this case, the definition of D\mathcal{D}-Cov\mathtt{Cov} should be modified so that its first argument is a point of SλS_{\lambda} rather than λ\lambda itself. Given a finite set of open balls B0,B1,,BnB_{0},B_{1},\ldots,B_{n} with union BB as above, and an ordinal λ\lambda, {\mbox{\mathcal{D}-\mathtt{Cov}}}(\lambda,B_{0},B_{1},\ldots,B_{n}) either reports that BB covers SλS_{\lambda}, or it returns a strategy xSλB.x\in S_{\lambda}\setminus B.

Our algorithm proceeds in phases i=1,2,3,i=1,2,3,\ldots of 2i2^{i} rounds each. In any given phase ii, there is a “target ordinal” λ(i)\lambda(i) (defined at the end of the preceding phase), and we run an algorithm during the phase which: (i) activates some nodes initially; (ii) plays a version of the zooming algorithm which only activates strategies in Sλ(i)S_{\lambda(i)}; (iii) concludes the phase by computing λ(i+1)\lambda(i+1). The details are as follows. In a given phase we run a fresh instance of a phase algorithm Aph(T,d,λ)\mathcal{A}_{\mathtt{ph}}(T,d,\lambda) where T=2iT=2^{i} and λ=λ(i)\lambda=\lambda(i) is a target ordinal for phase ii, defined below when we give the full description of Aph(T,d,λ).\mathcal{A}_{\mathtt{ph}}(T,d,\lambda). The goal of Aph(T,d,λ)\mathcal{A}_{\mathtt{ph}}(T,d,\lambda) is to satisfy the per-phase bound

for all T>T0T>T_{0}, where γ=11/(d+2)\gamma=1-1/(d+2) and T0T_{0} is a number which may depend on the instance μ\mu. Then, to derive the bound RAd(t)=O~(tγ)R_{\mathcal{A}_{d}}(t)=\widetilde{O}(t^{\gamma}) for all tt we simply sum per-phase bounds over all phases ending before time 2t2t.

Initially Aph(T,d,λ)\mathcal{A}_{\mathtt{ph}}(T,d,\lambda) uses the covering oracle to construct 2j2^{-j}-nets Nj\mathcal{N}_{j}, j=0,1,2,j=0,1,2,\ldots, until it finds the largest jj such that N=Nj\mathcal{N}=\mathcal{N}_{j} contains at most 12  Td/(d+2)log(T)\tfrac{1}{2}\;T^{d/(d+2)}\log(T) points. It activates all strategies in N\mathcal{N} and sets

After this initialization step, for every active strategy vv we define the confidence radius

where nt(v)n_{t}(v) is the number of times vv has been played by the phase algorithm Aph(T,d,λ)\mathcal{A}_{\mathtt{ph}}(T,d,\lambda) before time tt. Let B0,B1,,BnB_{0},B_{1},\ldots,B_{n} be an enumeration of the open balls belonging to the collection

If n<12  Td/(d+2)log(T)n<\tfrac{1}{2}\;T^{d/(d+2)}\log(T) then we perform the oracle call {\mbox{\mathcal{D}-\mathtt{Cov}}}(\lambda,B_{0},\ldots,B_{n}), and if it reports that a point xSλx\in S_{\lambda} is uncovered, we activate xx and set nt(x)=0.n_{t}(x)=0. The index of an active strategy vv is defined as μt(v)+4rt(v)\mu_{t}(v)+4r_{t}(v) — note the slight difference from the index defined in Algorithm 2.3 — and we always play the active strategy with maximum index. To complete the description of the algorithm, it remains to explain how the ordinals λ(i)\lambda(i) are defined. The definition is recursive, beginning with λ(1)=0\lambda(1)=0. At the end of phase i  (i1)i\;(i\geq 1), we let B0,B1,,BmB_{0},B_{1},\ldots,B_{m} be an enumeration of the open balls in the set {B(v,ε(i))v\mboxactive,rT(v)<ε(i)/2}.\{B(v,\varepsilon(i))\,|\,v\mbox{ active},\,r_{T}(v)<\varepsilon(i)/2\}. Finally, we set λ(i+1)=Depth(B0,B1,,Bm).\lambda(i+1)={\mathtt{Depth}}(B_{0},B_{1},\ldots,B_{m}).

Since we have modified the definition of index, we must prove a variant of Claim 3.12 which asserts the following:

To prove it, let ss be the latest round in {1,2,,t}\{1,2,\ldots,t\} when vv was played. We have rt(v)=rs(v)r_{t}(v)=r_{s}(v), and Δ(v)Δ(u)=μ(u)μ(v)\Delta(v)-\Delta(u)=\mu(u)-\mu(v), so it remains to prove that

From the fact that vv was played instead of uu at time ss, together with the fact that both strategies are clean,

We obtain (14) by adding (15)-(17), noting that rs(u)>0r_{s}(u)>0. This completes the proof of (13).

Let λ\lambda be the maximum ordinal such that SλS_{\lambda} contains an optimal strategy uu^{*}; such an ordinal exists by Definition 3.15(b). We will prove that for sufficiently large ii, if the ii-th phase is clean, then λ(i)=λ\lambda(i)=\lambda. The set Sλ+1S_{\lambda+1} is compact, and the function μ\mu is continuous, so it assumes a maximum value on Sλ+1S_{\lambda+1} which is, by construction, strictly less than μ\mu^{*}. Choose ε>0\varepsilon>0 such that Δ(w)>5ε\Delta(w)>5\varepsilon for all wSλ+1w\in S_{\lambda+1}, and choose T0=2i0T_{0}=2^{i_{0}} such that ε(i0)ε.\varepsilon(i_{0})\leq\varepsilon. We shall prove that for all T=2iT0T=2^{i}\geq T_{0} and all ordinals ν\nu, a clean run of Aph(T,d,ν)\mathcal{A}_{\mathtt{ph}}(T,d,\nu) results in setting λ(i+1)=λ\lambda(i+1)=\lambda. First, let vNv^{*}\in\mathcal{N} be such that L(u,v)ε(i).L(u^{*},v^{*})\leq\varepsilon(i). If vv is active and rT(v)<ε(i)/2r_{T}(v)<\varepsilon(i)/2 then (13) implies that Δ(v)Δ(v)52ε(i)\Delta(v)-\Delta(v^{*})\leq\tfrac{5}{2}\,\varepsilon(i) hence Δ(v)72ε(i)\Delta(v)\leq\tfrac{7}{2}\,\varepsilon(i). As Δ(w)>5ε5ε(i)\Delta(w)>5\varepsilon\geq 5\varepsilon(i) for all wSλ+1w\in S_{\lambda+1}, it follows that the closure of B(v,ε(i))B(v,\varepsilon(i)) does not intersect Sλ+1.S_{\lambda+1}. This guarantees that Depth(B0,B1,,Bm){\mathtt{Depth}}(B_{0},B_{1},\ldots,B_{m}) returns an ordinal less than or equal to λ.\lambda. Next we must prove that this ordinal is greater than or equal to λ.\lambda. Note that the total number of strategies activated by Aph(T,d,ν)\mathcal{A}_{\mathtt{ph}}(T,d,\nu) is bounded above by Td/(d+2)log(T)T^{d/(d+2)}\log(T). Let ATA_{T} denote the set of strategies active at time TT and let

By the pigeonhole principle, nT(v0)T2/(d+2)/log(T)n_{T}(v^{0})\geq T^{2/(d+2)}/\log(T) and hence rT(v0)<3T1/(d+2)log(T).r_{T}(v^{0})<3T^{-1/(d+2)}\log(T). If tt denotes the last time at which v0v^{0} was played, then we have

provided that the phase is clean and that TT0.T\geq T_{0}. Since v0v^{0} had maximum index at time tt, we deduce that It(v)<μ+ε(i)/2I_{t}(v^{*})<\mu^{*}+\varepsilon(i)/2 as well. As L(u,v)ε(i)L(u^{*},v^{*})\leq\varepsilon(i) we have μt(v)με(i)rt(v)\mu_{t}(v^{*})\geq\mu^{*}-\varepsilon(i)-r_{t}(v^{*}) provided the phase is clean. To finish the proof we observe that

which implies rt(v)<ε(i)/2r_{t}(v^{*})<\varepsilon(i)/2. Since the confidence radius does not increase over time, we have rT(v)<ε(i)/2r_{T}(v^{*})<\varepsilon(i)/2 so B(v,ε(i))B(v^{*},\varepsilon(i)) is one of the balls B0,B1,,Bm.B_{0},B_{1},\ldots,B_{m}. Since uu^{*} is contained in the closure of this ball, we may conclude that Depth(B0,B1,,Bm){\mathtt{Depth}}(B_{0},B_{1},\ldots,B_{m}) returns the ordinal λ\lambda as desired.

Let U=B(Sλ+1,ε(i)/2)U=B(S_{\lambda+1},\varepsilon(i)/2). As in Claim 3.14 it holds that in any clean phase, UU is covered throughout the phase by balls centered at points of N\mathcal{N}. Hence for any pair of consecutive clean phases, in the second phase of the pair our algorithm only calls the covering oracle D\mathcal{D}-Cov\mathtt{Cov} with the proper ordinal λ\lambda (i.e. the maximum λ\lambda such that SλS_{\lambda} contains an optimal strategy) and with a set of balls B0,B1,,BnB_{0},B_{1},\ldots,B_{n} that covers UU. Also, note that an active strategy vv during a run of Aph(T,d,λ)\mathcal{A}_{\mathtt{ph}}(T,d,\lambda) never has a confidence radius rt(v)r_{t}(v) less than δ=T1/(d+2)\delta=T^{-1/(d+2)}, so the strategies activated by the covering oracle form a δ\delta-net in the space SλUS_{\lambda}\setminus U. By Definition 3.15(c), a δ\delta-net in SλUS_{\lambda}\setminus U contains fewer than O(δd)O(\delta^{-d}) points. Hence for sufficiently large TT the “quota” of 12  Td/(d+2)\tfrac{1}{2}\;T^{d/(d+2)} active strategies is never reached, which implies that every point of SλS_{\lambda} — including uu^{*} — is covered throughout the phase. The upper bound on the regret of Aph(T,d,λ)\mathcal{A}_{\mathtt{ph}}(T,d,\lambda) concludes as in the proof of Theorem 2.4. ∎

Zooming algorithm: extensions and examples

We extend the analysis in Section 2 in several directions, and follow up with examples.

In Section 4.1 we note that our analysis works under a more abstract notion of the confidence radius: essentially, it can be any function of the history of playing a given strategy such that Claim 2.2 holds. This observation leads to sharper results if the reward from playing each strategy uu is μ(u)\mu(u) plus an independent noise of a known and “benign” shape; we provide several concrete examples.

In Section 4.2 we provide an improved version of the confidence radius such that the zooming algorithm satisfies the guarantee in Theorem 2.4 and achieves a better regret exponent dd+1\tfrac{d}{d+1} if the maximal reward is exactly 1. The analysis builds on a novel Chernoff-style bound which, to the best of our knowledge, has not appeared in the literature.

In Section 4.3 we consider the an example which show-cases both the notion of the zooming dimension and the improved algorithm from Section 4.2. It is the target MAB problem, a version of the Lipschitz MAB problem in which the expected reward of a given strategy is a equal to its distance to some (unknown) target set SS. We show that the zooming algorithm performs much better in this setting; in particular, if the metric is doubling and SS is finite, it achieves poly-logarithmic regret.

In Section 4.4 we relax some of the assumptions in the Lipschitz MAB problem: we do not require the similarity function LL to satisfy the triangle inequality, and we need the Lipschitz condition (1) to hold only if one of the two strategies is optimal. We use this extension to analyze a generalization of the target MAB problem in which μ(u)=f(L(u,S))\mu(u)=f(L(u,S)) for some known function ff.

Finally, in Section 4.5 we extend the analysis in Section 2 from reward distributions with bounded supportIn Section 4.1 we also consider stochastically bounded distributions such as Gaussians. to arbitrary reward distributions with a finite absolute third moment. Our analysis relies on the extension of Azuma inequality known as the non-uniform Berry-Esseen theorem .

Let us recap some conventions we’ll be using throughout this section. The zooming algorithm proceeds in phases i=1,2,3,i=1,2,3,\ldots of 2i2^{i} rounds each. Within a given phase, for each strategy vXv\in X and time tt, nt(v)n_{t}(v) is the number of times vv has been played before time tt, and μt(v)\mu_{t}(v) is the corresponding average reward. Also, we denote Δ(v)=μμ(v)\Delta(v)=\mu^{*}-\mu(v), where μ=supvXμ(v)\mu^{*}=\sup_{v\in X}\mu(v) is the maximal reward.

In Section 4.1 the confidence radius of a given strategy was defined by (2). Here we generalize this definition to any function of the history of playing this strategy that satisfies certain properties.

Consider a single phase iphi_{\mathtt{ph}} of the algorithm. For each strategy vv and any time tt within this phase, let r^t(v)\hat{r}_{t}(v) and μ^t(v)\hat{\mu}_{t}(v) be non-negative functions of iphi_{\mathtt{ph}}, tt, and the history of playing vv up to round tt. Call r^t(v)\hat{r}_{t}(v) a confidence radius with respect to μ^t(v)\hat{\mu}_{t}(v) if

μ^t(v)μ(t)r^t(v)|\hat{\mu}_{t}(v)-\mu(t)|\leq\hat{r}_{t}(v) with probability at least 18iph1-8^{-i_{\mathtt{ph}}}.

34r^t(v)r^t+1(v)r^t(v)\tfrac{3}{4}\,\hat{r}_{t}(v)\leq\hat{r}_{t+1}(v)\leq\hat{r}_{t}(v).

The confidence radius is (β,C)(\beta,C)-good if nt(v)(Ciph)Δβ(v)n_{t}(v)\leq(C\,i_{\mathtt{ph}})\,\Delta^{-\beta}(v) whenever Δ(v)4r^t(v)\Delta(v)\leq 4\hat{r}_{t}(v).

Property (i) says that Claim 2.2 holds for the appropriately redefined clean phase. Property (ii) is a “smoothness” condition: r^t(v)\hat{r}_{t}(v) does not increase with time, and does not decrease too fast. It is needed for the last line of the proof of Lemma 2.5.

Given such confidence radius, we can carry out the proof of Theorem 2.4 with very minor modifications.

Consider an instance of the standard Lipschitz MAB problem for which there exists a (β,c0)(\beta,c_{0})-good confidence radius, β0\beta\geq 0. Let A\mathcal{A} be an instance of Algorithm 2.3 defined with respect to this confidence radius. Suppose the problem instance has cc-zooming dimension dd. Then:

If d+β>1d+\beta>1 then RA(t)a(t)  t11/(d+β)R_{\mathcal{A}}(t)\leq a(t)\;t^{1-1/(d+\beta)} for all tt, where a(t)=O(cc0log2t)1/(d+β)a(t)=O(c\,c_{0}\,\log^{2}t)^{1/(d+\beta)}.

If d+β1d+\beta\leq 1 then RA(t)O(cc0log2t)R_{\mathcal{A}}(t)\leq O(c\,c_{0}\,\log^{2}t).

A new feature of this theorem (as compared to Theorem 2.4) is the poly-logarithmic bound on regret in part (b). For better intuition on this, note that the exponent in part (a) becomes negative if d+β<1d+\beta<1. Since the regret bound should not be decreasing in tt, one would expect this term to vanish from the “correct” bound. Indeed, it is easy to check that the computation in the proof of Claim 2.7 results in part (b).

A natural application of Theorem 4.2 if a setting in which the reward from playing each strategy uu is μ(u)\mu(u) plus an independent noise of known shape.

The Noisy Lipschitz MAB problem is a standard Lipschitz MAB problem such that every time any strategy uu is played, the reward is μ(u)\mu(u) plus an independent random sample from some fixed distribution P\mathcal{P} (called the noise distribution) which is revealed to the algorithm.

We present several examples in which we take advantage of a “benign” shape of P\mathcal{P}. Interestingly, in these examples the payoff distributions are not restricted to have bounded support.Recall that throughout the paper the payoff distribution of each strategy xx has support S(x)\mathcal{S}(x)\subset. In this subsection, by a slight abuse of notation, we do not make this assumption. Technically the results are simple corollaries of Theorem 4.2.

We start with perhaps the most natural example when the noise distribution is normal.

Consider the Noisy Lipschitz MAB problem with normal noise distribution P=N(0,σ2)\mathcal{P}=\mathcal{N}(0,\sigma^{2}). Then there exists an algorithm A\mathcal{A} which enjous guarantee (4) with the right-hand side multiplied by σ\sigma.

Define the confidence radius as (2) with the right-hand side multiplied by σ\sigma. It is easy to see that this is a (2,O(σ))(2,O(\sigma))-good confidence radius. The result follows from Theorem 4.2(a). ∎

In fact, Corollary 4.4 can be extended to noise distributions of a somewhat more general form: let us say that a random variable XX is stochastically (ρ,σ)(\rho,\sigma)-bounded if its moment-generating function satisfies

The derivation and the theorem statement needs to be modified slightly to account for the parameter ρ\rho; we omit the details from this version.

Consider the Noisy Lipschitz MAB problem such that the noise distribution P\mathcal{P} has at least one point mass. Then the problem admits a confidence radius which is (β,c)(\beta,c)-good for any given β>0\beta>0 and a constant c=c(β,P)c=c(\beta,\mathcal{P}). The corresponding low-regret guarantees follow via Theorem 4.2.

Let S=argmaxP(x)S=\operatornamewithlimits{argmax}\mathcal{P}(x) be the set of all points with the largest point mass p=maxxP(x)p=\max_{x}\mathcal{P}(x), and let q=maxx:P(x)<pP(x)q=\max_{x:\,\mathcal{P}(x)<p}\mathcal{P}(x) be the second largest point mass. Then n=Θ(logt)n=\Theta(\log t) samples suffices to ensure that with high probability each node in SS will get at least n(p+q)/2n(p+q)/2 hits whereas any other node will get less, which exactly locates all points in SS. We use confidence radius rt(v)=Θ(iph)(34)nt(v)r_{t}(v)=\Theta(i_{\mathtt{ph}})(\tfrac{3}{4})^{n_{t}(v)}. ∎

Third, we consider noise distributions with a ”special region” which can be located using a few samples. This may be a more efficient way to estimate μ(v)\mu(v) than using the standard Chernoff-style tail bounds. Moreover, in our examples P\mathcal{P} may be heavy-tailed, so that Chernoff-style bounds do not hold.

Consider the Noisy Lipschitz MAB problem with noise distribution P\mathcal{P}. Suppose P\mathcal{P} has a density f(x)f(x) which is symmetric around and non-increasing for x>0x>0. Assume one of the following:

f(x)f(x) has a sharp peak: f(x)=Θ(xα)f(x)=\Theta(|x|^{-\alpha}) for all small enough x|x|, where α(0,1)\alpha\in(0,1).

f(x)f(x) piecewise continuous on (0,)(0,\infty) with at least one jump.

Then for some constant cPc_{\mathcal{P}} that depends only on P\mathcal{P} the problem admits a (β,cP)(\beta,c_{\mathcal{P}})-good confidence radius, where (a)(a) β=1α\beta=1-\alpha, (b)(b) β=1\beta=1. The corresponding low-regret guarantees follow via Theorem 4.2.

For part (aa), note that for any x>0x>0 in a neighborhood of we have P[(x,x)]=Θ(x1α)\mathcal{P}[(-x,x)]=\Theta(x^{1-\alpha}). Therefore n=Θ(xα1logt)n=\Theta(x^{\alpha-1}\,\log t) samples suffices to separate with high probability any length-xx sub-interval of (x,x)(-x,x) from any length-xx sub-interval of (2x,)(2x,\infty). It follows that using nn samples we can approximate the mean reward up to ±O(x)\pm O(x). Accordingly, we set rt(v)=Θ(iph/nt(v))1/(1α)r_{t}(v)=\Theta(i_{\mathtt{ph}}/n_{t}(v))^{1/(1-\alpha)}.

For part (bb), let x0x_{0} be the smallest positive point where density ff has a jump. Then by continuity there exists some ϵ>0\epsilon>0 such that infx(x0ϵ,x0)f(x)>supx(x0,x0+ϵ)f(x)\inf_{x\in(x_{0}-\epsilon,\,x_{0})}f(x)>\sup_{x\in(x_{0},\,x_{0}+\epsilon)}f(x). Therefore for any x<ϵx<\epsilon using n=Θ(1xlogn)n=\Theta(\tfrac{1}{x}\log n) samples suffices to separate with high probability any length-xx sub-interval of (0,x0)(0,x_{0}) from any length-xx sub-interval of (x0,)(x_{0},\infty). It follows that using nn samples we can approximate the mean reward up to ±O(x)\pm O(x). Accordingly, we set rt(v)=Θ(iph/nt(v))r_{t}(v)=\Theta(i_{\mathtt{ph}}/n_{t}(v)). ∎

2 What if the maximal expected reward is 1?

We elaborate the algorithm from Section 2 so that it satisfies the guarantee (4) and performs much better if the maximal expected reward is 11.

Consider the Lipschitz MAB problem. Call an algorithm β\beta-good if there exists an absolute constant c0c_{0} such that for any problem instance of cc-zooming dimension dd it has the properties (ab) in Theorem 4.2. Call a confidence radius β\beta-good if it is (β,c0)(\beta,c_{0})-good for some absolute constant c0c_{0}.

Consider the standard Lipschitz MAB problem. There is an algorithm A\mathcal{A} which is 22-good in general, and 11-good when the maximal expected reward is 11.

The key ingredient here is a refined version of the confidence radius which is much sharper than (2) when the sample average is close to 11. For phase iphi_{\mathtt{ph}}, we define

In order to analyze (20) we need to establish the following Chernoff-style bound which, to the best of our knowledge, has not appeared in the literature:

Consider nn i.i.d. random variables X1XnX_{1}\ldots X_{n} on $.Let. Let\mubetheirmean,andletbe their mean, and letXbetheiraverage.Thenforanybe their average. Then for any\alpha>0$ the following holds:

We will use two well-known Chernoff Bounds which we state below (e.g. see p. 64 of ):

Pr[Xμ>δμ]<2eμnδ2/3\Pr[|X-\mu|>\delta\mu]<2\,e^{-\mu n\delta^{2}/3} for any δ(0,1)\delta\in(0,1).

First, suppose μα6n\mu\geq\tfrac{\alpha}{6n}. Apply (cb1) with δ=12α6μn\delta=\tfrac{1}{2}\sqrt{\tfrac{\alpha}{6\mu n}}. Thus with probability at least 1eΩ(α)1-e^{-\Omega(\alpha)} we have Xμ<δμμ/2|X-\mu|<\delta\mu\leq\mu/2. Moreover, plugging in the value for δ\delta,

Now suppose μ<α6n\mu<\tfrac{\alpha}{6n}. Then using (cb2) with a=αna=\tfrac{\alpha}{n}, we obtain that with probability at least 12Ω(α)1-2^{-\Omega(\alpha)} we have X<αnX<\tfrac{\alpha}{n}, and therefore

Let us fix a strategy vv and time tt. Let us use Lemma 4.9 with n=nt(v)n=n_{t}(v) and α=Θ(iph)\alpha=\Theta(i_{\mathtt{ph}}) as in (20), setting each random variable XiX_{i} equal to 1 minus the reward from the ii-th time strategy vv is played in the current phase. Then μ=μ(v)\mu=\mu(v) and X=μt(v)X=\mu_{t}(v), so the Lemma says that

Note that (20) is indeed a confidence radius with respect to μt(v)\mu_{t}(v): property (i) in Definition 4.1 holds by (21), and it is easy to check that property (ii) holds, too. It is easy to see that (20) is a 22-good confidence radius. It remains to show that it is 11-good when the maximal reward is 11; this is where we use the upper bound on rt(v)r_{t}(v) in (21). It suffices to prove the following claim:

Indeed, let n=nt(v)n=n_{t}(v) and Δ=Δ(v)\Delta=\Delta(v), and suppose that the maximal reward is 11 and Δ(v)4rt(v)\Delta(v)\leq 4\,r_{t}(v). Then by (21) we have Δ4rt(v)αn+αΔ/n\Delta\leq 4\,r_{t}(v)\leq\tfrac{\alpha}{n}+\sqrt{\alpha\Delta/n} for some α=O(logt)\alpha=O(\log t). Now there are two cases. If αn<Δ/2\tfrac{\alpha}{n}<\Delta/2 then αΔ/nΔαn>Δ(v)/2\sqrt{\alpha\Delta/n}\geq\Delta-\tfrac{\alpha}{n}>\Delta(v)/2, which implies the desired inequality. Else we simply have nO(α/Δ)n\leq O(\alpha/\Delta). Claim proved. ∎

3 Example: expected reward == distance to the target

We consider a version of the Lipschitz MAB problem where the expected reward of a given strategy is equal to its distance to some target set which is not revealed to the algorithm.

The Target MAB problem on a metric space (L,X)(L,X) with a target set SXS\subset X is the standard Lipschitz MAB problem on (L,X)(L,X) with payoff function μ(u)=1L(u,S)\mu(u)=1-L(u,S).

It is a well-known fact that L(u,v)L(u,S)L(v,S)L(u,v)\geq L(u,S)-L(v,S) for any u,vXu,v\in X and any set SXS\subset X. Therefore the payoff function μ\mu in Definition 4.10 is Lipschitz on (L,X)(L,X).

Let us refine this bound for metric spaces of finite doubling dimension. In particular, we show that for a finite target set the zooming algorithm from Theorem 4.8 achieves poly-logarithmic regret.

Consider the Target MAB problem on a metric space of finite doubling dimension dd^{*}. Let A\mathcal{A} be the zooming algorithm from Theorem 4.8. Then

where dd is the cc-covering dimension of the target set SS.

By Theorem 4.8 it suffices to prove that the KK-zooming dimension of the pair (L,μ)(L,\mu) is at most dd, for some K=c2O(d)K=c\,2^{O(d^{*})}. In other words, it suffices to cover the set Sδ={uY:Δ(u)δ}S_{\delta}=\{u\in Y:\Delta(u)\leq\delta\} with KδdK\,\delta^{-d} sets of diameter δ/16\leq\delta/16, for any given δ>0\delta>0.

Fix δ>0\delta>0 and note that Δ(u)=L(u,S)\Delta(u)=L(u,S). Note that set SS can be covered with cδdc\,\delta^{-d} sets {Ci}i\{\,C_{i}\,\}_{i} of diameter δ\leq\delta. It follows that the set SδS_{\delta} can be covered with rdr^{-d} sets {B(Ci,r)}i\{\,B(C_{i},r)\,\}_{i} of diameter 3r\leq 3r. Moreover, each set B(Ci,r)B(C_{i},r) can be covered with 2O(d)2^{O(d^{*})} of sets of diameter δ/16\leq\delta/16. ∎

This theorem is useful when d<dd<d^{*}, i.e. when the target set is a low-dimensional subset of the metric space. Recall that the zooming algorithm is self-tuning: it does not need to know dd^{*} and dd, and in fact it does not even need to know that it is presented with an instance of the Target MAB problem!

We note in passing that it is very easy to extend Theorem 4.11 to a setting in which the strategy set YY is a proper subset of the metric space (L,X)(L,X) and does not contain the target set SS. If L(Y,S)=0L(Y,S)=0 then the guarantee (22) holds as is. If L(Y,S)>0L(Y,S)>0 then the following guarantee holds:

where dd is the cc-covering dimension of the set B(S,r)B(S,r), r=L(Y,S)r=L(Y,S).

4 The Lipschitz MAB problem under relaxed assumptions

The analysis in Section 2 does not require all the assumptions in the Lipschitz MAB problem. In fact, it never uses the triangle inequality, and applies the Lipschitz condition (1) only if (essentially) one of the two strategies in (1) is optimal. Let us formulate our results under the properly relaxed assumptions. In what follows, the zooming algorithm will refer to the algorithm in Theorem 4.8.

Consider a version of the Lipschitz MAB problem on (L,X)(L,X) in which the similarity metric is not required to satisfy triangle inequality,Formally, we require LL to be a symmetric function X×X[0,]X\times X\rightarrow[0,\infty] such that L(x,x)=0L(x,x)=0 for all xXx\in X. We call such function a quasi-distance on XX. and the Lipschitz condition (1) is replaced by

More generally, if such node vv^{*} does not exist, assume

Then the guarantees for the zooming algorithm in Theorem 4.8 still hold.

We apply this theorem to a generalization of the Target MAB problem in which μ(u)=f(L(u,S))\mu(u)=f(L(u,S)) for some known non-decreasing shape function f:f:\rightarrow. Let us define a quasi-distance LfL_{f} by Lf(u,v)=f(L(u,v))f(0)L_{f}(u,v)=f(L(u,v))-f(0). It is easy to see that LfL_{f} satisfies (23). Indeed, fix any uSu^{*}\in S. Then

Thus we can use the zooming algorithm on the quasi-distance LfL_{f} and enjoy the guarantees in Theorem 4.8. Below we refine these guarantees for several examples.

Our goal here is to provide clean illustrative statements rather than cover the most general setting to which our refined guarantees apply. Therefore we start with the most concrete example which we formulate as a theorem, and follow up with some extensions which we list without a proof.

Consider the Target MAB problem on a metric space (L,X)(L,X) of finite doubling dimension dd^{*}, with shape function f(x)=x1/αf(x)=x^{1/\alpha}, α>0\alpha>0. Let A\mathcal{A} be the zooming algorithm on (Lf,X)(L_{f},X). Then

where dd is the cc-covering dimension of the target set SS.

Consider the pair (Lf,μ)(L_{f},\mu). Since the maximal reward is 1, by Theorem 4.8 it suffices to prove that for some c=c2O(d)c^{*}=c\,2^{O(d^{*})} the cc^{*}-zooming dimension of this pair is at most αd\alpha d. Specifically, for each δ>0\delta>0 we need to cover the set Sδ={uX:Δ(u)δ}S_{\delta}=\{u\in X:\,\Delta(u)\leq\delta\} with cδαdc^{*}\,\delta^{-\alpha d} sets of LfL_{f}-diameter at most δ/16\delta/16.

Indeed, since Δ(u)=Lf(u,S)\Delta(u)=L_{f}(u,S), for each uSδu\in S_{\delta} we have L(u,S)δαL(u,S)\leq\delta^{\alpha}. Thus SδB(S,δα)S_{\delta}\subset B(S,\delta^{\alpha}). Let ϵ=16α\epsilon=16^{-\alpha}. As in the proof of Theorem 4.11, we can show that SδS_{\delta} can be covered by cϵO(d)δαdc\,\epsilon^{-O(d^{*})}\,\delta^{-\alpha d} sets of diameter ϵδα\epsilon\,\delta^{\alpha}. Each of these sets has LfL_{f}-diameter at most f(ϵδα)f(\epsilon\,\delta^{\alpha}), which is at most δ/16\delta/16. ∎

This theorem includes Theorem 4.3 as a special case f(x)=xf(x)=x. Like the latter, this theorem is useful when the target set is a low-dimensional subset of the metric space.

We consider extensions to more general shape functions and to strategy sets which do not contain SS:

Suppose the shape function ff satisfies the following constraints for some constants αα>0\alpha\geq\alpha^{*}>0:

where g(x)=f(x)f(0)g(x)=f(x)-f(0). Then for β=1+1{f(0)>0}\beta=1+1_{\{f(0)>0\}} we have

Consider the setting in which the strategy set YY is a proper subset of the metric space (L,X)(L,X) and does not contain the target set SS. If L(u,S)=0L(u^{*},S)=0 for some uYu^{*}\in Y then the guarantee (25) holds as is. In general, if we restrict the shape function to f(x)=c+x1/αf(x)=c+x^{1/\alpha}, α(0,1]\alpha\in(0,1] then

where dd is the cc-covering dimension of the set S=B(S,r)S^{*}=B(S,r), r=L(Y,S)r=L(Y,S). Moreover, one can prove similar guarantees with dd^{*} being the doubling dimension of an open neighborhood of SS^{*}, rather than that of the entire metric space

5 Heavy-tailed reward distributions

Consider the standard Lipschitz MAB problem. Let Xn(v)X_{n}(v) be the reward from the nn-th trial of strategy vv. Assume that each Xn(v)X_{n}(v) is an independent random variable with mean μ(v)\mu(v)\in and furthermore that E[Xn(v)3]<ρE\left[\,|X_{n}(v)|^{3}\,\right]<\rho for some constant ρ\rho. Then there is an algorithm A\mathcal{A} such that if for some cc the problem instance has cc-zooming dimension dd then

The proof relies on the non-uniform Berry-Esseen theorem (e.g. see for a nice survey) which we use to obtain a tail inequality similar to Claim 2.2: for any α>0\alpha>0

However, this inequality gives much higher failure probability than Claim 2.2; in particular, we cannot take a union bound over all active strategies. Accordingly, we need a more refined version of Theorem 4.2 which is parameterized by the failure probability in (27). In the analysis, instead of the failure events when the phase is not clean (see Definition 2.1) we need to consider the ρ\rho-failure events when the tail bound from (27) is violated by some strategy vv such that Δ(v)>ρ\Delta(v)>\rho. Then using the technique from Section 2 we can upper-bound RA(T)R_{\mathcal{A}}(T) in terms of TT, dd, ρ\rho and α\alpha and choose the optimal values for ρ\rho and α\alpha.

References