X-Armed Bandits

Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvari

Introduction

In the classical stochastic bandit problem a gambler tries to maximize his revenue by sequentially playing one of a finite number of slot machines that are associated with initially unknown (and potentially different) payoff distributions . Assuming old-fashioned slot machines, the gambler pulls the arms of the machines one by one in a sequential manner, simultaneously learning about the machines’ payoff-distributions and gaining actual monetary reward. Thus, in order to maximize his gain, the gambler must choose the next arm by taking into consideration both the urgency of gaining reward (“exploitation”) and acquiring new information (“exploration”).

Maximizing the total cumulative payoff is equivalent to minimizing the (total) regret, i.e., minimizing the difference between the total cumulative payoff of the gambler and the one of another clairvoyant gambler who chooses the arm with the best mean-payoff in every round. The quality of the gambler’s strategy can be characterized as the rate of growth of his expected regret with time. In particular, if this rate of growth is sublinear, the gambler in the long run plays as well as the clairvoyant gambler. In this case the gambler’s strategy is called Hannan consistent.

Bandit problems have been studied in the Bayesian framework , as well as in the frequentist parametric [25; 2] and non-parametric settings , and even in non-stochastic scenarios [5; 10]. While in the Bayesian case the question is whether the optimal actions can be computed efficiently, in the frequentist case the question is how to achieve low rate of growth of the regret in the lack of prior information, i.e., it is a statistical question. In this paper we consider the stochastic, frequentist, non-parametric setting.

Although the first papers studied bandits with a finite number of arms, researchers have soon realized that bandits with infinitely many arms are also interesting, as well as practically significant. One particularly important case is when the arms are identified by a finite number of continuous-valued parameters, resulting in online optimization problems over continuous finite-dimensional spaces. Such problems are ubiquitous to operations research and control. Examples are “pricing a new product with uncertain demand in order to maximize revenue, controlling the transmission power of a wireless communication system in a noisy channel to maximize the number of bits transmitted per unit of power, and calibrating the temperature or levels of other inputs to a reaction so as to maximize the yield of a chemical process” . Other examples are optimizing parameters of schedules, rotational systems, traffic networks or online parameter tuning of numerical methods. During the last decades numerous authors have investigated such “continuum-armed” bandit problems [3; 21; 6; 22; 12]. A special case of interest, which forms a bridge between the case of a finite number of arms and the continuum-armed setting, is formed by bandit linear optimization, see and the references therein.

In many of the above-mentioned problems, however, the natural domain of some of the optimization parameters is a discrete set, while other parameters are still continuous-valued. For example, in the pricing problem different product lines could also be tested while tuning the price, or in the case of transmission power control different protocols could be tested while optimizing the power. In other problems, such as in online sequential search, the parameter-vector to be optimized is an infinite sequence over a finite alphabet [13; 7].

The motivation for this paper is to handle all these various cases in a unified framework. More precisely, we consider a general setting that allows us to study bandits with almost no restriction on the set of arms. In particular, we allow the set of arms to be an arbitrary measurable space. Since we allow non-denumerable sets, we shall assume that the gambler has some knowledge about the behavior of the mean-payoff function (in terms of its local regularity around its maxima, roughly speaking). This is because when the set of arms is uncountably infinite and absolutely no assumptions are made on the payoff function, it is impossible to construct a strategy that simultaneously achieves sublinear regret for all bandits problems (see, e.g., [9, Corollary 4]). When the set of arms is a metric space (possibly with the power of the continuum) previous works have assumed either the global smoothness of the payoff function [3; 21; 22; 12] or local smoothness in the vicinity of the maxima . Here, smoothness means that the payoff function is either Lipschitz or Hölder continuous (locally or globally). These smoothness assumptions are indeed reasonable in many practical problems of interest.

In this paper, we assume that there exists a dissimilarity function that constrains the behavior of the mean-payoff function, where a dissimilarity function is a measure of the discrepancy between two arms that is neither symmetric, nor reflexive, nor satisfies the triangle inequality. (The same notion was introduced simultaneously and independently of us by [23, Section 4.4] under the name “quasi-distance.”) In particular, the dissimilarity function is assumed to locally set a bound on the decrease of the mean-payoff function at each of its global maxima. We also assume that the decision maker can construct a recursive covering of the space of arms in such a way that the diameters of the sets in the covering shrink at a known geometric rate when measured with this dissimilarity.

Our work generalizes and improves previous works on continuum-armed bandits.

In particular, Kleinberg and Auer et al. focused on one-dimensional problems, while we allow general spaces. In this sense, the closest work to the present contribution is that of Kleinberg et al. , who considered generic metric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the space; its full version relaxed this condition and only requires that the mean-payoff function is Lipschitz at some maximum with respect to some (known) dissimilarity. The present paper paper is a concurrent and independent work with respect to the paper of Kleinberg, Slivkins, and Upfal . An extended abstract of the latter was published in May 2008 at STOC’08, while the NIPS’08 version of the present paper was submitted at the beginning of June 2008. At that time, we were not aware of the existence of the full version , which was released in September 2008. Kleinberg et al. proposed a novel algorithm that achieves essentially the best possible regret bound in a minimax sense with respect to the environments studied, as well as a much better regret bound if the mean-payoff function has a small “zooming dimension”.

Our contribution furthers these works in two ways:

our algorithms, motivated by the recent successful tree-based optimization algorithms [24; 18; 13], are easy to implement;

we show that a version of our main algorithm is able to exploit the local properties of the mean-payoff function at its maxima only, which, as far as we know, was not investigated in the approach of Kleinberg et al. .

The precise discussion of the improvements (and drawbacks) with respect to the papers by Kleinberg et al. requires the introduction of somewhat extensive notations and is therefore deferred to Section 5. However, in a nutshell, the following can be said.

First, by resorting to a hierarchical approach, we are able to avoid the use of the doubling trick, as well as the need for the (covering) oracle, both of which the so-called zooming algorithm of Kleinberg et al. relies on. This comes at the cost of slightly more restrictive assumptions on the mean-payoff function, as well as a more involved analysis. Moreover, the oracle is replaced by an a priori choice of a covering tree. In standard metric spaces, such as the Euclidean spaces, such trees are trivial to construct, though, in full generality they may be difficult to obtain when their construction must start from (say) a distance function only. We also propose a variant of our algorithm that has smaller computational complexity of order nlnnn\ln n compared to the quadratic complexity n2n^{2} of our basic algorithm. However, the cheaper algorithm requires the doubling trick to achieve an anytime guarantee (just like the zooming algorithm).

Second, we are also able to weaken our assumptions and to consider only properties of the mean-payoff function in the neighborhoods of its maxima; this leads to regret bounds scaling as \widetilde{O}\bigl{(}\sqrt{n}\bigr{)} We write un=O~(vn)u_{n}=\widetilde{O}(v_{n}) when un=O(vn)u_{n}=O(v_{n}) up to a logarithmic factor. when, e.g., the space is the unit hypercube and the mean-payoff function has a finite number of global maxima xx^{*} around which it is locally equivalent to a function \Arrowvertxx\Arrowvertα\Arrowvert x-x^{*}\Arrowvert^{\alpha} with some known degree α>0\alpha>0. Thus, in this case, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. (Comparable dimensionality-free rates are obtained under different assumptions in .)

Finally, in addition to the strong theoretical guarantees, we expect our algorithm to work well in practice since the algorithm is very close to the recent, empirically very successful tree-search methods from the games and planning literature [16; 17; 27; 11; 15].

In Section 2 we formalize the X\mathcal{X}–armed bandit problem.

In Section 3 we describe the basic strategy proposed, called HOO (hierarchical optimistic optimization).

We present the main results in Section 4. We start by specifying and explaining our assumptions (Section 4.1) under which various regret bounds are proved. Then we prove a distribution-dependent bound for the basic version of HOO (Section 4.2). A problem with the basic algorithm is that its computational cost increases quadratically with the number of time steps. Assuming the knowledge of the horizon, we thus propose a computationally more efficient variant of the basic algorithm, called truncated HOO and prove that it enjoys a regret bound identical to the one of the basic version (Section 4.3) while its computational complexity is only log-linear in the number of time steps. The first set of assumptions constrains the mean-payoff function everywhere. A second set of assumptions is therefore presented that puts constraints on the mean-payoff function only in a small vicinity of its global maxima; we then propose another algorithm, called local-HOO, which is proven to enjoy a regret again essentially similar to the one of the basic version (Section 4.4). Finally, we prove the minimax optimality of HOO in metric spaces (Section 4.5).

In Section 5 we compare the results of this paper with previous works.

Problem setup

The mean-payoff function ff thus defined is assumed to be measurable. For simplicity, we shall also assume that all MxM_{x} have bounded supports, included in some fixed bounded intervalMore generally, our results would also hold when the tails of the reward distributions are uniformly sub-Gaussian. , say, the unit interval $.Then,. Then,falsotakesboundedvalues,inalso takes bounded values, in$.

A decision maker (the gambler of the introduction) that interacts with a stochastic bandit problem B{\cal B} plays a game at discrete time steps according to the following rules. In the first round the decision maker can select an arm X1XX_{1}\in\mathcal{X} and receives a reward Y1Y_{1} drawn at random from MX1M_{X_{1}}. In round n>1n>1 the decision maker can select an arm XnXX_{n}\in\mathcal{X} based on the information available up to time nn, i.e., (X1,Y1,,Xn1,Yn1)(X_{1},Y_{1},\ldots,X_{n-1},Y_{n-1}), and receives a reward YnY_{n} drawn from MXnM_{X_{n}}, independently of (X1,Y1,,Xn1,Yn1)(X_{1},Y_{1},\ldots,X_{n-1},Y_{n-1}) given XnX_{n}. Note that a decision maker may randomize his choice, but can only use information available up to the point in time when the choice is made.

Formally, a strategy of the decision maker in this game (“bandit strategy”) can be described by an infinite sequence of measurable mappings, φ=(φ1,φ2,)\varphi=(\varphi_{1},\varphi_{2},\ldots), where φn\varphi_{n} maps the space of past observations,

to the space of probability measures over X\mathcal{X}. By convention, φ1\varphi_{1} does not take any argument. A strategy is called deterministic if for every nn, φn\varphi_{n} is a Dirac distribution.

The goal of the decision maker is to maximize his expected cumulative reward. Equivalently, the goal can be expressed as minimizing the expected cumulative regret, which is defined as follows. Let

be the best expected payoff in a single round. At round nn, the cumulative regret of a decision maker playing B{\cal B} is

Finally, we define the cumulative pseudo-regret as

that is, the actual rewards used in the definition of the regret are replaced by the mean-payoffs of the arms pulled. Since (by the tower rule)

As it is argued in , in many real-world problems, the decision maker is not interested in his cumulative regret but rather in its simple regret. The latter can be defined as follows. After nn rounds of play in a stochastic bandit problem B\mathcal{B}, the decision maker is asked to make a recommendation ZnXZ_{n}\in\mathcal{X} based on the nn obtained rewards Y1,,YnY_{1},\ldots,Y_{n}. The simple regret of this recommendation equals

In this paper we focus on the cumulative regret RnR_{n}, but all the results can be readily extended to the simple regret by considering the recommendation Zn=XTnZ_{n}=X_{T_{n}}, where TnT_{n} is drawn uniformly at random in {1,,n}\{1,\ldots,n\}. Indeed, in this case,

The Hierarchical Optimistic Optimization (HOO) strategy

The HOO strategy (cf. Algorithm 1) incrementally builds an estimate of the mean-payoff function ff over X\mathcal{X}. The core idea (as in previous works) is to estimate ff precisely around its maxima, while estimating it loosely in other parts of the space X\mathcal{X}. To implement this idea, HOO maintains a binary tree whose nodes are associated with measurable regions of the arm-space X\mathcal{X} such that the regions associated with nodes deeper in the tree (further away from the root) represent increasingly smaller subsets of X\mathcal{X}. The tree is built in an incremental manner. At each node of the tree, HOO stores some statistics based on the information received in previous rounds. In particular, HOO keeps track of the number of times a node was traversed up to round nn and the corresponding empirical average of the rewards received so far. Based on these, HOO assigns an optimistic estimate (denoted by BB) to the maximum mean-payoff associated with each node. These estimates are then used to select the next node to “play”. This is done by traversing the tree, beginning from the root, and always following the node with the highest BB–value (cf. lines 4–14 of Algorithm 1). Once a node is selected, a point in the region associated with it is chosen (line 16) and is sent to the environment. Based on the point selected and the received reward, the tree is updated (lines 18–33).

The tree of coverings which HOO needs to receive as an input is an infinite binary tree whose nodes are associated with subsets of X\mathcal{X}. The nodes in this tree are indexed by pairs of integers (h,i)(h,i); node (h,i)(h,i) is located at depth h0h\geqslant 0 from the root. The range of the second index, ii, associated with nodes at depth hh is restricted by 1i2h1\leqslant i\leqslant 2^{h}. Thus, the root node is denoted by (0,1)(0,1). By convention, (h+1,2i1)(h+1,2i-1) and (h+1,2i)(h+1,2i) are used to refer to the two children of the node (h,i)(h,i). Let Ph,iX\mathcal{P}_{h,i}\subset\mathcal{X} be the region associated with node (h,i)(h,i). By assumption, these regions are measurable and must satisfy the constraints

As a corollary, the regions Ph,i\mathcal{P}_{h,i} at any level h0h\geqslant 0 cover the space X\mathcal{X},

In the algorithm listing the recursive computation of the BB–values (lines 28–33) makes a local copy of the tree; of course, this part of the algorithm could be implemented in various other ways. Other arbitrary choices in the algorithm as shown here are how tie breaking in the node selection part is done (lines 9–12), or how a point in the region associated with the selected node is chosen (line 16). We note in passing that implementing these differently would not change our theoretical results.

To facilitate the formal study of the algorithm, we shall need some more notation. In particular, we shall introduce time-indexed versions (Tn\mathcal{T}_{n}, (Hn,In)(H_{n},I_{n}), XnX_{n}, YnY_{n}, μ^h,i(n)\widehat{\mu}_{h,i}(n), etc.) of the quantities used by the algorithm. The convention used is that the indexation by nn is used to indicate the value taken at the end of the nthn^{\rm th} round.

In particular, Tn\mathcal{T}_{n} is used to denote the finite subtree stored by the algorithm at the end of round nn. Thus, the initial tree is T0={(0,1)}\mathcal{T}_{0}=\{(0,1)\} and it is expanded round after round as

where (Hn,In)(H_{n},I_{n}) is the node selected in line 15. We call (Hn,In)(H_{n},I_{n}) the node played in round nn. We use XnX_{n} to denote the point selected by HOO in the region associated with the node played in round nn, while YnY_{n} denotes the received reward.

Node selection works by comparing BB–values and always choosing the node with the highest BB–value. The BB–value, Bh,i(n)B_{h,i}(n), at node (h,i)(h,i) by the end of round nn is an estimated upper bound on the mean-payoff function at node (h,i)(h,i). To define it we first need to introduce the average of the rewards received in rounds when some descendant of node (h,i)(h,i) was chosen (by convention, each node is a descendant of itself):

Here, C(h,i)\mathcal{C}(h,i) denotes the set of all descendants of a node (h,i)(h,i) in the infinite tree,

and Th,i(n)T_{h,i}(n) is the number of times a descendant of (h,i)(h,i) is played up to and including round nn, that is,

A key quantity determining Bh,i(n)B_{h,i}(n) is Uh,i(n)U_{h,i}(n), an initial estimate of the maximum of the mean-payoff function in the region Ph,i\mathcal{P}_{h,i} associated with node (h,i)(h,i):

In the expression corresponding to the case Th,i(n)>0T_{h,i}(n)>0, the first term added to the average of rewards accounts for the uncertainty arising from the randomness of the rewards that the average is based on, while the second term, ν1ρh\nu_{1}\rho^{h}, accounts for the maximum possible variation of the mean-payoff function over the region Ph,i\mathcal{P}_{h,i}. The actual bound on the maxima used in HOO is defined recursively by

The role of Bh,i(n)B_{h,i}(n) is to put a tight, optimistic, high-probability upper bound on the best mean-payoff that can be achieved in the region Ph,i\mathcal{P}_{h,i}. By assumption, Ph,i=Ph+1,2i1Ph+1,2i\mathcal{P}_{h,i}=\mathcal{P}_{h+1,2i-1}\cup\mathcal{P}_{h+1,2i}. Thus, assuming that Bh+1,2i1(n)B_{h+1,2i-1}(n) (resp., Bh+1,2i(n)B_{h+1,2i}(n)) is a valid upper bound for region Ph+1,2i1\mathcal{P}_{h+1,2i-1} (resp., Ph+1,2i\mathcal{P}_{h+1,2i}), we see that \max\bigl{\{}B_{h+1,2i-1}(n),B_{h+1,2i}(n)\bigr{\}} must be a valid upper bound for region Ph,i\mathcal{P}_{h,i}. Since Uh,i(n)U_{h,i}(n) is another valid upper bound for region Ph,i\mathcal{P}_{h,i}, we get a tighter (less overoptimistic) upper bound by taking the minimum of these bounds.

Obviously, for leafs (h,i)(h,i) of the tree Tn\mathcal{T}_{n}, one has Bh,i(n)=Uh,i(n)B_{h,i}(n)=U_{h,i}(n), while close to the root one may expect that Bh,i(n)<Uh,i(n)B_{h,i}(n)<U_{h,i}(n); that is, the upper bounds close to the root are expected to be less biased than the ones associated with nodes farther away from the root.

Note that at the beginning of round nn, the algorithm uses Bh,i(n1)B_{h,i}(n-1) to select the node (Hn,In)(H_{n},I_{n}) to be played (since Bh,i(n)B_{h,i}(n) will only be available at the end of round nn). It does so by following a path from the root node to an inner node with only one child or a leaf and finally considering a child (Hn,In)(H_{n},I_{n}) of the latter; at each node of the path, the child with highest BB–value is chosen, till the node (Hn,In)(H_{n},I_{n}) with infinite BB–value is reached.

Figure 1 illustrates the computation done by HOO in round nn, as well as the correspondence between the nodes of the tree constructed by the algorithm and their associated regions. Figure 2 shows trees built by running HOO for a specific environment.

At the end of round nn, the size of the active tree Tn\mathcal{T}_{n} is at most nn, making the storage requirements of HOO linear in nn. In addition, the statistics and BB–values of all nodes in the active tree need to be updated, which thus takes time O(n)O(n). HOO runs in time O(n)O(n) at each round nn, making the algorithm’s total running time up to round nn quadratic in nn. In Section 4.3 we modify HOO so that if the time horizon n0n_{0} is known in advance, the total running time is O(n0lnn0)O(n_{0}\ln n_{0}), while the modified algorithm will be shown to enjoy essentially the same regret bound as the original version.

Main results

We start by describing and commenting on the assumptions that we need to analyze the regret of HOO. This is followed by stating the first upper bound, followed by some improvements on the basic algorithm. The section is finished by the statement of our results on the minimax optimality of HOO.

The main assumption will concern the “smoothness” of the mean-payoff function. However, somewhat unconventionally, we shall use a notion of smoothness that is built around dissimilarity functions rather than distances, allowing us to deal with function classes of highly different smoothness degrees in a unified manner. Before stating our smoothness assumptions, we define the notion of a dissimilarity function and some associated concepts.

However, it is also natural to wonder what is the class of functions for which the algorithm (given a fixed tree) can achieve non-trivial regret bounds; a similar question for regression was investigated e.g., by Yang . We shall indicate below how to construct a subset of such a class, right after stating our assumptions connecting the tree, the dissimilarity, and the environment (the mean-payoff function). Of these, Assumption A2. will be interpreted, discussed, and equivalently reformulated below into (4), a form that might be more intuitive. The form (3) stated below will turn out to be the most useful one in the proofs.

There exists ν2>0\nu_{2}>0 such that for all integers h0h\geqslant 0,

for all i=1,,2hi=1,\ldots,2^{h}, there exists xh,iPh,ix^{\circ}_{h,i}\in\mathcal{P}_{h,i} such that

Bh,iBh,j=\mathcal{B}_{h,i}\cap\mathcal{B}_{h,j}=\emptyset for all 1i<j2h1\leqslant i<j\leqslant 2^{h}.

The mean-payoff function ff satisfies that for all x,yXx,y\in\mathcal{X},

We show next how a tree induces in a natural way first a dissimilarity and then a class of environments. For this, we need to assume that the tree of coverings (Ph,i)(\mathcal{P}_{h,i}) –in addition to (1a) and (1b)– is such that the subsets Ph,i\mathcal{P}_{h,i} and Ph,j\mathcal{P}_{h,j} are disjoint whenever 1i<j2h1\leqslant i<j\leqslant 2^{h} and that none of them is empty. Then, each xXx\in\mathcal{X} corresponds to a unique path in the tree, which can be represented as an infinite binary sequence x0x1x2x_{0}x_{1}x_{2}\ldots, where

For points x,yXx,y\in\mathcal{X} with respective representations x0x1x_{0}x_{1}\ldots and y0y1y_{0}y_{1}\ldots, we let

It is not hard to see that this dissimilarity satisfies A1.. Thus, the associated class of environments C{\cal C} is formed by those with mean-payoff functions satisfying A2. with the so-defined dissimilarity. This is a “natural class” underlying the tree for which our tree-based algorithm can achieve non-trivial regret. (However, we do not know if this is the largest such class.)

In general, Assumption A1. ensures that the regions in the tree of coverings (Ph,i)(\mathcal{P}_{h,i}) shrink exactly at a geometric rate. The following example shows how to satisfy A1. when the domain X\mathcal{X} is a DD–dimensional hyper-rectangle and the dissimilarity is some positive power of the Euclidean (or supremum) norm.

We now argue that Assumption A1. is satisfied. With no loss of generality we take X=D\mathcal{X}=^{D}. Then, for all integers u0u\geqslant 0 and 0kD10\leqslant k\leqslant D-1,

It is now easy to see that Assumption A1. is satisfied for the indicated dissimilarity, e.g., with the choice of the parameters ρ=2a/D\rho=2^{-a/D} and \nu_{1}=b\,\bigl{(}2\sqrt{D}\bigr{)}^{a} for HOO, and the value ν2=b/2a\nu_{2}=b/2^{a}.

This time, Assumption A1. is satisfied, e.g., with the choice of the parameters ρ=2a/D\rho=2^{-a/D} and ν1=b2a\nu_{1}=b\,2^{a} for HOO, and the value ν2=b/2a\nu_{2}=b/2^{a}.

Here is another interpretation of these two facts; it will be useful when considering local assumptions in Section 4.4 (a weaker set of assumptions). First, concerning the behavior around global maxima, Assumption A2. implies that for any set AX\mathcal{A}\subset\mathcal{X} with supxAf(x)=f\sup_{x\in\mathcal{A}}f(x)=f^{*},

denotes the set of ε\varepsilon–optimal arms. This second property essentially states that there is no sudden and large drop in the mean-payoff function around the global maxima (note that this property can be satisfied even for discontinuous functions).

Figure 3 presents an illustration of the two properties discussed above.

Before stating our main results, we provide a straightforward, though useful consequence of Assumptions A1. and A2., which should be seen as an intuitive justification for the third term in (2).

Δh,i\Delta_{h,i} is called the suboptimality factor of node (h,i)(h,i). Depending whether it is positive or not, a node (h,i)(h,i) is called suboptimal (Δh,i>0\Delta_{h,i}>0) or optimal (Δh,i=0\Delta_{h,i}=0).

Under Assumptions A1. and A2., if the suboptimality factor Δh,i\Delta_{h,i} of a region Ph,i\mathcal{P}_{h,i} is bounded by cν1ρhc\nu_{1}\rho^{h} for some c0c\geqslant 0, then all arms in Ph,i\mathcal{P}_{h,i} are max{2c,c+1}ν1ρh\max\{2c,c+1\}\,\nu_{1}\rho^{h}–optimal, that is,

Proof For all δ>0\delta>0, we denote by xh,i(δ)x^{*}_{h,i}(\delta) an element of Ph,i\mathcal{P}_{h,i} such that

By the weak Lipschitz property (Assumption A2.), it then follows that for all yPh,iy\in\mathcal{P}_{h,i},

Letting δ0\delta\to 0 and substituting the bounds on the suboptimality and on the diameter of Ph,i\mathcal{P}_{h,i} (Assumption A1) concludes the proof.

2 Upper bound for the regret of HOO

Auer et al. [6, Assumption 2] observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volumes of the sets of ε\varepsilon–optimal arms shrink as ε0\varepsilon\rightarrow 0. Here, we capture this by defining a new notion, the near-optimality dimension of the mean-payoff function. The connection between these concepts, as well as with the zooming dimension defined by Kleinberg et al. , will be further discussed in Section 5. We start by recalling the definition of packing numbers.

The following example shows that using a dissimilarity (rather than a metric, for instance) may sometimes allow for a significant reduction of the near-optimality dimension.

Note that in all these cases the cc–near-optimality dimension of ff is independent of the value of cc.

We can now state our first main result. The proof is presented in Section A.1.

Note that if dd is infinite, then the bound is vacuous. The constant γ\gamma in the theorem depends on dd^{\prime} and on all other parameters of HOO and of the assumptions, as well as on the bandit environment MM. (The value of γ\gamma is determined in the analysis; it is in particular proportional to ν2d\nu_{2}^{-d^{\prime}}.) The next section will exhibit a refined upper bound with a more explicit value of γ\gamma in terms of all these parameters.

The tuning of the parameters of HOO is critical for the assumptions to be satisfied, thus to achieve a good regret; given some environment, one should select the parameters of HOO such that the near-optimality dimension of the mean-payoff function is minimized. Since the mean-payoff function is unknown to the user, this might be difficult to achieve. Thus, ideally, these parameters should be selected adaptively based on the observation of some preliminary sample. For now, the investigation of this possibility is left for future work.

3 Improving the running time when the time horizon is known

A deficiency of the basic HOO algorithm is that its computational complexity scales quadratically with the number of time steps. In this section we propose a simple modification to HOO that achieves essentially the same regret as HOO and whose computational complexity scales only log-linearly with the number of time steps. The needed amount of memory is still linear. We work out the case when the time horizon, n0n_{0}, is known in advance. The case of unknown horizon can be dealt with by resorting to the so-called doubling trick, see, e.g., [10, Section 2.3], which consists of periodically restarting the algorithm for regimes of lengths that double at each such fresh start, so that the rthr^{\rm th} instance of the algorithm runs for 2r2^{r} rounds.

We consider two modifications to the algorithm described in Section 3. First, the quantities Uh,i(n)U_{h,i}(n) of (2) are redefined by replacing the factor lnn\ln n by lnn0\ln n_{0}, that is, now

(This results in a policy which explores the arms with a slightly increased frequency.) The definition of the BB–values in terms of the Uh,i(n)U_{h,i}(n) is unchanged. A pleasant consequence of the above modification is that the BB–value of a given node changes only when this node is part of a path selected by the algorithm. Thus at each round nn, only the nodes along the chosen path need to be updated according to the obtained reward.

However, and this is the reason for the second modification, in the basic algorithm, a path at round nn may be of length linear in nn (because the tree could have a depth linear in nn). This is why we also truncate the trees Tn\mathcal{T}_{n} at a depth Dn0D_{n_{0}} of the order of lnn0\ln n_{0}. More precisely, the algorithm now selects the node (Hn,In)(H_{n},I_{n}) to pull at round nn by following a path in the tree Tn1\mathcal{T}_{n-1}, starting from the root and choosing at each node the child with the highest BB–value (with the new definition above using lnn0\ln n_{0}), and stopping either when it encounters a node which has not been expanded before or a node at depth equal to

(It is assumed that n0>1/ν12n_{0}>1/\nu_{1}^{2} so that Dn01D_{n_{0}}\geqslant 1.) Note that since no child of a node (Dn0,i)(D_{n_{0}},i) located at depth Dn0D_{n_{0}} will ever be explored, its BB–value at round nn0n\leqslant n_{0} simply equals UDn0,i(n)U_{D_{n_{0}},i}(n).

We call this modified version of HOO the truncated HOO algorithm. The computational complexity of updating all BB–values at each round nn is of the order of Dn0D_{n_{0}} and thus of the order of lnn0\ln n_{0}. The total computational complexity up to round n0n_{0} is therefore of the order of n0lnn0n_{0}\ln n_{0}, as claimed in the introduction of this section.

As the next theorem indicates this new procedure enjoys almost the same cumulative regret bound as the basic HOO algorithm.

Fix a horizon n0n_{0} such that Dn01D_{n_{0}}\geqslant 1. Then, the regret bound of Theorem 6 still holds true at round n0n_{0} for truncated HOO up to an additional additive 4n04\sqrt{n_{0}} factor.

4 Local assumptions

In this section we further relax the weak Lipschitz assumption and require it only to hold locally around the maxima. Doing so, we will be able to deal with an even larger class of functions and in fact we will show that the algorithm studied in this section achieves a O(n)O(\sqrt{n}) bound on the regret regret when it is used for functions that are smooth around their maxima (e.g., equivalent to \Arrowvertxx\Arrowvertα\Arrowvert x-x^{*}\Arrowvert^{\alpha} for some known smoothness degree α>0\alpha>0).

For the sake of simplicity and to derive exact constants we also state in a more explicit way the assumption on the near-optimality dimension. We then propose a simple and efficient adaptation of the HOO algorithm suited for this context.

Further, there exists L>0L>0 such that for all xXε0x\in\mathcal{X}_{\varepsilon_{0}} and ε[0,ε0]\varepsilon\in[0,\varepsilon_{0}],

There exist C>0C>0 and d>0d>0 such that for all εε0\varepsilon\leqslant\varepsilon_{0},

Assumption A3. is not a real assumption but merely a reformulation of the definition of near optimality (with the small added ingredient that the limit can be achieved, see the second step of the proof of Theorem 6 in Section A.1).

Proof Note that x=(0,,0)x^{*}=(0,\ldots,0) is such that f=1=f(x)f^{*}=1=f(x^{*}). Therefore, for all xXx\in\mathcal{X},

which concludes the proof of the second part of A2’..

4.2 Modified HOO algorithm

We now describe the proposed modifications to the basic HOO algorithm.

We first consider, as a building block, the algorithm called zz–HOO, which takes an integer zz as an additional parameter to those of HOO. Algorithm zz–HOO works as follows: it never plays any node with depth smaller or equal to z1z-1 and starts directly the selection of a new node at depth zz. To do so, it first picks the node at depth zz with the best BB–value, chooses a path and then proceeds as the basic HOO algorithm. Note in particular that the initialization of this algorithm consists (in the first 2z2^{z} rounds) in playing once each of the 2z2^{z} nodes located at depth zz in the tree (since by definition a node that has not been played yet has a BB–value equal to ++\infty). We note in passing that when z=0z=0, algorithm zz–HOO coincides with the basic HOO algorithm.

Algorithm local-HOO employs the doubling trick in conjunction with consecutive instances of zz–HOO. It works as follows. The integers r1r\geqslant 1 will index different regimes. The rthr^{\rm th} regime starts at round 2r12^{r}-1 and ends when the next regime starts; it thus lasts for 2r2^{r} rounds. At the beginning of regime rr, a fresh copy of zrz_{r}–HOO, where zr=log2rz_{r}=\lceil\log_{2}r\rceil, is initialized and is then used throughout the regime.

Note that each fresh start needs to pull each of the 2zr2^{z_{r}} nodes located at depth zrz_{r} at least once (the number of these nodes is r\approx r). However, since round rr lasts for 2r2^{r} time steps (which is exponentially larger than the number of nodes to explore), the time spent on the initialization of zrz_{r}–HOO in any regime rr is greatly outnumbered by the time spent in the rest of the regime.

In the rest of this section, we propose first an upper bound on the regret of zz–HOO (with exact and explicit constants). This result will play a key role in proving a bound on the performance of local-HOO.

4.3 Adaptation of the regret bound

In the following we write h0h_{0} for the smallest integer such that

and consider the algorithm zz–HOO, where zh0z\geqslant h_{0}. In particular, when z=0z=0 is chosen, the obtained bound is the same as the one of Theorem 6, except that the constants are given in analytic forms.

then the following bound holds for the expected regret of zz–HOO:

The proof, which is a modification of the proof to Theorem 6, can be found in Section A.3 of the Appendix. The main complication arises because the weakened assumptions do not allow one to reason about the smoothness at an arbitrary scale; this is essentially due to the threshold ε0\varepsilon_{0} used in the formulation of the assumptions. This is why in the proposed variant of HOO we discard nodes located too close to the root (at depth smaller than h01h_{0}-1). Note that in the bound the second term arises from playing in regions corresponding to the descendants of “poor” nodes located at level zz. In particular, this term disappears when z=0z=0, in which case we get a bound on the regret of HOO provided that 2ν1<ε02\nu_{1}<\varepsilon_{0} holds.

Under the prescribed assumptions, the rate of convergence is of order n\sqrt{n} no matter the ambient dimension DD. Although the rate is independent of DD, the latter impacts the performance through the multiplicative factor in front of the rate, which is exponential in DD. This is, however, not an artifact of our analysis, since it is natural that exploration in a DD–dimensional space comes at a cost exponential in DD. (The exploration performed by HOO naturally combines an initial global search, which is bound to be exponential in DD, and a local optimization, whose regret is of the order of n\sqrt{n}.)

The following theorem is an almost straightforward consequence of Theorem 9 (the detailed proof can be found in Section A.4 of the Appendix). Note that local-HOO does not require the knowledge of the parameter ε0\varepsilon_{0} in A2’..

5 Minimax optimality in metric spaces

In this section we provide two theorems showing the minimax optimality of HOO in metric spaces. The notion of packing dimension is key.

The reader interested in the explicit expressions of N(c,D)N(c,D) and γ(c,D)\gamma(c,D) is referred to the last lines of the proof of the theorem in the Appendix.

Discussion

In this section we would like to shed some light on the results of the previous sections. In particular we generalize the situation of Example 5, discuss the regret that we can obtain, and compare it with what could be obtained by previous works.

We equip X=D\mathcal{X}=^{D} with a norm \|\,\cdot\,\|. We assume that the mean-payoff function ff has a finite number of global maxima and that it is locally equivalent to the function xxα\|x-x^{*}\|^{\alpha} –with degree α[0,)\alpha\in[0,\infty)– around each such global maximum xx^{*} of ff; that is,

This means that there exist c1,c2,δ>0c_{1},c_{2},\delta>0 such that for all xx satisfying xxδ\|x-x^{*}\|\leqslant\delta,

Smoothness overestimated: Now, if the true smoothness is overestimated by choosing β>α\beta>\alpha or α=β\alpha=\beta and c<c1c<c_{1}, then the assumption of weak Lipschitzness is violated and we are unable to provide any guarantee on the behavior of HOO. The latter, when used with an overestimated smoothness parameter, may lack exploration and exploit too heavily from the beginning. As a consequence, it may get stuck in some local optimum of ff, missing the global one(s) for a very long time (possibly indefinitely). Such a behavior is illustrated in the example provided in and showing the possible problematic behavior of the closely related algorithm UCT of . UCT is an example of an algorithm overestimating the smoothness of the function; this is because the BB–values of UCT are defined similarly to the ones of the HOO algorithm but without the third term in the definition (2) of the UU–values. This corresponds to an assumed infinite degree of smoothness (that is, to a locally constant mean-payoff function).

2 Relation to previous works

Several works [3; 21; 12; 6; 22] have considered continuum-armed bandits in Euclidean or, more generally, normed or metric spaces and provided upper and lower bounds on the regret for given classes of environments.

Kleinberg considered mean-payoff functions ff on the real line that are Hölder continuous with degree 0<α10<\alpha\leqslant 1. The derived regret bound is \Theta\bigl{(}n^{(\alpha+1)/(\alpha+2)}\bigr{)}.

Auer et al. extended the analysis to classes of functions that are equivalent to \Arrowvertxx\Arrowvertα\Arrowvert x-x^{*}\Arrowvert^{\alpha} around their maxima xx^{*}, where the allowed smoothness degree is also larger: α[0,)\alpha\in[0,\infty). They derived the regret bound

where the parameter β\beta is such that the Lebesgue measure of ε\varepsilon–optimal arm is O(εβ)O(\varepsilon^{\beta}).

For an illustration, consider again the example of Section 5.1. The result of Auer et al. shows that for D=1D=1, the regret is Θ(n)\Theta(\sqrt{n}) (since here β=1/α\beta=1/\alpha, with the notation above). Our result extends the n\sqrt{n} rate of the regret bound to any dimension DD.

On the other hand the analysis of Kleinberg et al. does not apply because in this example f(x)f(x)f(x^{*})-f(x) is controlled only when xx is close in some sense to xx^{*} (i.e., when \Arrowvertxx\Arrowvertδ\Arrowvert x-x^{*}\Arrowvert\leqslant\delta), while (9) requires such a control over the whole set X\mathcal{X}. However, note that the local weak-Lipschitz assumption A2’. requires an extra condition in the vicinity of xx^{*} compared to (9) as it is based on the notion of weak Lipschitzness. Thus, A2’. and (9) are in general incomparable (both capture a different phenomenon at the maxima).

We now compare our results to those of and under Assumption A2. (which does not cover the example of Section 5.1 unless δ\delta is large). Under this assumption, our algorithms enjoy essentially the same theoretical guarantees as the zooming algorithm of [22; 23]. Further, the following hold.

Our algorithms do not require the oracle needed by the zooming algorithm.

Our truncated HOO algorithm achieves a computational complexity of order O(nlogn)O(n\log n), whereas the complexity of a naive implementation of the zooming algorithm is likely to be much larger.The zooming algorithm requires a covering oracle that is able to return a point which is not covered by the set of active strategies, if there exists one. Thus a straightforward implementation of this covering oracle might be computationally expensive in (general) continuous spaces and would require a ‘global’ search over the whole space.

Both truncated HOO and the zooming algorithms use the doubling trick. The basic HOO algorithm, however, avoids the doubling trick, while meeting the computational complexity of the zooming algorithm.

The fact that the doubling trick can be avoided is good news since an algorithm that uses the doubling trick must start from tabula rasa time to time, which results in predictable, yet inevitable, sharp performance drops –a quite unpleasant property. In particular, for this reason algorithms that rely on the doubling trick are often neglected by practitioners. In addition, the fact that we avoid the oracle needed by the zooming algorithm is attractive as this oracle might be difficult to implement for general (non-metric) dissimilarities.

Acknowledgements

We thank one of the anonymous referee for his valuable comments, which helped us to provide a fair and detailed comparison of our work to prior contributions.

This work was supported in part by French National Research Agency (ANR, project EXPLO-RA, ANR-08-COSI-004), the Alberta Ingenuity Centre of Machine Learning, Alberta Innovates Technology Futures (formerly iCore and AIF), NSERC and the PASCAL2 Network of Excellence under EC grant no. 216886.

Appendix A Proofs

We begin with three lemmas. The proofs of Lemmas 15 and 16 rely on concentration-of-measure techniques, while the one of Lemma 14 follows from a simple case study. Let us fix some path (0,1)(0,1), (1,i1)(1,i^{*}_{1}), (2,i2)(2,i^{*}_{2}), \ldots\, of optimal nodes, starting from the root. That is, denoting i0=1i^{*}_{0}=1, we mean that for all j1j\geqslant 1, the suboptimality of (j,ij)(j,i^{*}_{j}) equals Δj,ij=0\Delta_{j,i^{*}_{j}}=0 and (j,ij)(j,i^{*}_{j}) is a child of (j1,ij1)(j-1,i^{*}_{j-1}).

Let (h,i)(h,i) be a suboptimal node. Let 0kh10\leqslant k\leqslant h-1 be the largest depth such that (k,ik)(k,i^{*}_{k}) is on the path from the root (0,1)(0,1) to (h,i)(h,i). Then for all integers u0u\geqslant 0, we have

Proof Consider a given round t{1,,n}t\in\{1,\ldots,n\}. If (Ht,It)C(h,i)(H_{t},I_{t})\in\mathcal{C}(h,i), then this is because the child (k+1,i)(k+1,i^{\prime}) of (k,ik)(k,i^{*}_{k}) on the path to (h,i)(h,i) had a better BB–value than its brother (k+1,ik+1)(k+1,i^{*}_{k+1}). Since by definition, BB–values can only increase on a chosen path, this entails that Bk+1,ik+1Bk+1,i(t)Bh,i(t)B_{k+1,i^{*}_{k+1}}\leqslant B_{k+1,i^{\prime}}(t)\leqslant B_{h,i}(t). This is turns implies, again by definition of the BB–values, that Bk+1,ik+1(t)Uh,i(t)B_{k+1,i^{*}_{k+1}}(t)\leqslant U_{h,i}(t). Thus,

But, once again by definition of BB–values,

and the argument can be iterated. Since up to round tt no more than tt nodes have been played (including the suboptimal node (h,i)(h,i)), we know that (t,it)(t,i^{*}_{t}) has not been played so far and thus has a BB–value equal to ++\infty. (Some of the previous optimal nodes could also have had an infinite UU–value, if not played so far.) We thus have proved the inclusion

Now, for any integer u0u\geqslant 0 it holds that

where we used for the inequality the fact that the quantities Th,i(t)T_{h,i}(t) are constant from tt to t+1t+1, except when (Ht,It)C(h,i)(H_{t},I_{t})\in\mathcal{C}(h,i), in which case, they increase by 1; therefore, on the one hand, at most uu of the Th,i(t)T_{h,i}(t) can be smaller than uu and on the other hand, Th,i(t)>uT_{h,i}(t)>u can only happen if t>ut>u. Using (11) and then taking expectations yields the result.

Let Assumptions A1. and A2. hold. Then, for all optimal nodes (h,i)(h,i) and for all integers n1n\geqslant 1,

Proof On the event that (h,i)(h,i) was not played during the first nn rounds, one has, by convention, Uh,i(n)=+U_{h,i}(n)=+\infty. In the sequel, we therefore restrict our attention to the event \bigl{\{}T_{h,i}(n)\geqslant 1\bigr{\}}.

Lemma 3 with c=0c=0 ensures that ff(x)ν1ρhf^{*}-f(x)\leqslant\nu_{1}\rho^{h} for all arms xPh,ix\in\mathcal{P}_{h,i}. Hence,

We take care of the last term with a union bound and the Hoeffding-Azuma inequality for martingale differences.

To do this in a rigorous manner, we need to define a sequence of (random) stopping times when arms in C(h,i)\mathcal{C}(h,i) were pulled:

Note that 1T1<T2<1\leqslant T_{1}<T_{2}<\ldots, hence it holds that TjjT_{j}\geqslant j. We denote by X~j=XTj\widetilde{X}_{j}=X_{T_{j}} the jthj^{\rm th} arm pulled in the region corresponding to C(h,i)\mathcal{C}(h,i). Its associated corresponding reward equals Y~j=YTj\widetilde{Y}_{j}=Y_{T_{j}} and

where we used a union bound to get the last inequality.

is a martingale w.r.t. the filtration \mathcal{G}_{t}=\sigma\bigl{(}\widetilde{X}_{1},Z_{1},\ldots,\widetilde{X}_{t},Z_{t},\widetilde{X}_{t+1}\bigr{)}. This follows, via optional skipping (see [14, Chapter VII, adaptation of Theorem 2.3]), from the facts that

is a martingale w.r.t. the filtration Ft=σ(X1,Y1,,Xt,Yt,Xt+1)\mathcal{F}_{t}=\sigma(X_{1},Y_{1},\ldots,X_{t},Y_{t},X_{t+1}) and that the events {Tj=k}\{T_{j}=k\} are Fk1\mathcal{F}_{k-1}–measurable for all kjk\geqslant j.

Applying the Hoeffding-Azuma inequality for martingale differences (see ), using the boundedness of the ranges of the induced martingale difference sequence, we then get, for each t1t\geqslant 1,

For all integers tnt\leqslant n, for all suboptimal nodes (h,i)(h,i) such that Δh,i>ν1ρh\Delta_{h,i}>\nu_{1}\rho^{h}, and for all integers u1u\geqslant 1 such that

Proof The uu mentioned in the statement of the lemma are such that

Now it follows from the same arguments as in the proof of Lemma 15 (optional skipping, the Hoeffding-Azuma inequality, and a union bound) that

where we used the stated bound on uu to obtain the last inequality.

Combining the results of Lemmas 14, 15, and 16 leads to the following key result bounding the expected number of visits to descendants of a “poor” node.

Under Assumptions A1. and A2., for all suboptimal nodes (h,i)(h,i) with Δh,i>ν1ρh\Delta_{h,i}>\nu_{1}\rho^{h}, we have, for all n1n\geqslant 1,

Proof We take uu as the upper integer part of (8lnn)/(Δh,iν1ρh)2(8\,\ln n)/(\Delta_{h,i}-\nu_{1}\rho^{h})^{2} and use union bounds to get from Lemma 14 the bound

Lemmas 15 and 16 further bound the quantity of interest as

Proof (of Theorem 6) First, let us fix d>dd^{\prime}>d. The statement will be proven in four steps.

First step. For all h=0,1,2,h=0,1,2,\ldots, denote by Ih\mathcal{I}_{h} the set of those nodes at depth hh that are 2ν1ρh2\nu_{1}\rho^{h}–optimal, i.e., the nodes (h,i)(h,i) such that fh,if2ν1ρhf_{h,i}^{*}\geqslant f^{*}-2\nu_{1}\rho^{h}. (Of course, I0={(0,1)}\mathcal{I}_{0}=\{(0,1)\}.) Then, let I\mathcal{I} be the union of these sets when hh varies. Further, let J\mathcal{J} be the set of nodes that are not in I\mathcal{I} but whose parent is in I\mathcal{I}. Finally, for h=1,2,h=1,2,\ldots we denote by Jh\mathcal{J}_{h} the nodes in J\mathcal{J} that are located at depth hh in the tree (i.e., whose parent is in Ih1\mathcal{I}_{h-1}).

Lemma 17 bounds in particular the expected number of times each node (h,i)Jh(h,i)\in\mathcal{J}_{h} is visited. Since for these nodes Δh,i>2ν1ρh\Delta_{h,i}>2\nu_{1}\rho^{h}, we get

Second step. We bound the cardinality Ih|\mathcal{I}_{h}| of Ih\mathcal{I}_{h}. We start with the case h1h\geqslant 1. By definition, when (h,i)Ih(h,i)\in\mathcal{I}_{h}, one has Δh,i2ν1ρh\Delta_{h,i}\leqslant 2\nu_{1}\rho^{h}, so that by Lemma 3 the inclusion Ph,iX4ν1ρh\mathcal{P}_{h,i}\subset\mathcal{X}_{4\nu_{1}\rho^{h}} holds. Since by Assumption A1., the sets Ph,i\mathcal{P}_{h,i} contain disjoint balls of radius ν2ρh\nu_{2}\rho^{h}, we have that

We prove below that there exists a constant CC such that for all εν2\varepsilon\leqslant\nu_{2},

Thus we obtain the bound |\mathcal{I}_{h}|\leqslant C\,\bigl{(}\nu_{2}\rho^{h}\bigr{)}^{-d^{\prime}} for all h1h\geqslant 1. We note that the obtained bound |\mathcal{I}_{h}|\leqslant C\,\bigl{(}\nu_{2}\rho^{h}\bigr{)}^{-d^{\prime}} is still valid for h=0h=0, since I0=1|\mathcal{I}_{0}|=1.

It only remains to prove (13). Since d>dd^{\prime}>d, where dd is the near-optimality of ff, we have, by definition, that

and thus, there exists εd>0\varepsilon_{d^{\prime}}>0 such that for all εεd\varepsilon\leqslant\varepsilon_{d^{\prime}},

which in turn implies that for all εεd\varepsilon\leqslant\varepsilon_{d^{\prime}},

The result is proved with C=1C=1 if εdν2\varepsilon_{d^{\prime}}\geqslant\nu_{2}. Now, consider the case εd<ν2\varepsilon_{d^{\prime}}<\nu_{2}. Given the definition of packing numbers, it is straightforward that for all \varepsilon\in\bigl{[}\varepsilon_{d^{\prime}},\,\nu_{2}\bigr{]},

therefore, for all \varepsilon\in\bigl{[}\varepsilon_{d^{\prime}},\,\nu_{2}\bigr{]},

for the choice C=\max\bigl{\{}1,\,\,u_{d^{\prime}}\,\nu_{2}^{d^{\prime}}\bigr{\}}. Because we take the maximum with 1, the stated inequality also holds for εεd\varepsilon\leqslant\varepsilon^{-d^{\prime}}, which concludes the proof of (13).

Third step. Let H1H\geqslant 1 be an integer to be chosen later. We partition the nodes of the infinite tree T\mathcal{T} into three subsets, T=T1T2T3\mathcal{T}=\mathcal{T}^{1}\cup\mathcal{T}^{2}\cup\mathcal{T}^{3}, as follows. Let the set T1\mathcal{T}^{1} contain the descendants of the nodes in IH\mathcal{I}_{H} (by convention, a node is considered its own descendant, hence the nodes of IH\mathcal{I}_{H} are included in T1\mathcal{T}^{1}); let T2=0h<HIh\mathcal{T}^{2}=\cup_{0\leqslant h<H}\,\mathcal{I}_{h}; and let T3\mathcal{T}^{3} contain the descendants of the nodes in 1hHJh\cup_{1\leqslant h\leqslant H}\,\mathcal{J}_{h}. Thus, T1\mathcal{T}^{1} and T3\mathcal{T}^{3} are potentially infinite, while T2\mathcal{T}^{2} is finite.

We recall that we denote by (Ht,It)(H_{t},I_{t}) the node that was chosen by HOO in round tt. From the definition of the algorithm, each node is played at most once, thus no two such random variables are equal when tt varies. We decompose the regret according to which of the sets Tj\mathcal{T}^{j} the nodes (Ht,It)(H_{t},I_{t}) belong to:

The contribution from T1\mathcal{T}^{1} is easy to bound. By definition any node in IH\mathcal{I}_{H} is 2ν1ρH2\nu_{1}\rho^{H}–optimal. Hence, by Lemma 3, the corresponding domain is included in X4ν1ρH\mathcal{X}_{4\nu_{1}\rho^{H}}. By definition of a tree of coverings, the domains of the descendants of these nodes are still included in X4ν1ρH\mathcal{X}_{4\nu_{1}\rho^{H}}. Therefore,

For h0h\geqslant 0, consider a node (h,i)T2(h,i)\in\mathcal{T}^{2}. It belongs to Ih\mathcal{I}_{h} and is therefore 2ν1ρh2\nu_{1}\rho^{h}–optimal. By Lemma 3, the corresponding domain is included in X4ν1ρh\mathcal{X}_{4\nu_{1}\rho^{h}}. By the result of the second step of this proof and using that each node is played at most once, one gets

We finish by bounding the contribution from T3\mathcal{T}^{3}. We first remark that since the parent of any element (h,i)Jh(h,i)\in\mathcal{J}_{h} is in Ih1\mathcal{I}_{h-1}, by Lemma 3 again, we have that Ph,iX4ν1ρh1\mathcal{P}_{h,i}\subset\mathcal{X}_{4\nu_{1}\rho^{h-1}}. We now use the first step of this proof to get

Now, it follows from the fact that the parent of Jh\mathcal{J}_{h} is in Ih1\mathcal{I}_{h-1} that Jh2Ih1|\mathcal{J}_{h}|\leqslant 2|\mathcal{I}_{h-1}| when h1h\geqslant 1. Substituting this and the bound on Ih1|\mathcal{I}_{h-1}| obtained in the second step of this proof, we get

Fourth step. Putting the obtained bounds together, we get

(recall that ρ<1\rho<1). Note that all constants hidden in the OO symbol only depend on ν1\nu_{1}, ν2\nu_{2}, ρ\rho and dd^{\prime}.

Now, by choosing HH such that ρH(d+2)\rho^{-H(d^{\prime}+2)} is of the order of n/lnnn/\ln n, that is, ρH\rho^{H} is of the order of (n/lnn)1/(d+2)(n/\ln n)^{-1/(d^{\prime}+2)}, we get the desired result, namely,

A.2 Proof of Theorem 8 (regret bound for truncated HOO)

The proof follows from an adaptation of the proof of Theorem 6 and of its associated lemmas; for the sake of clarity and precision, we explicitly state the adaptations of the latter.

Adaptations of the lemmas. Remember that Dn0D_{n_{0}} denotes the maximum depth of the tree, given horizon n0n_{0}. The adaptation of Lemma 14 is done as follows. Let (h,i)(h,i) be a suboptimal node with hDn0h\leqslant D_{n_{0}} and let 0kh10\leqslant k\leqslant h-1 be the largest depth such that (k,ik)(k,i^{*}_{k}) is on the path from the root (0,1)(0,1) to (h,i)(h,i). Then, for all integers u0u\geqslant 0, one has

As for Lemma 15, its straightforward adaptation states that under Assumptions A1. and A2., for all optimal nodes (h,i)(h,i) with hDn0h\leqslant D_{n_{0}} and for all integers 1tn01\leqslant t\leqslant n_{0},

Similarly, the same changes yield from Lemma 16 the following result for truncated HOO. For all integers tn0t\leqslant n_{0}, for all suboptimal nodes (h,i)(h,i) such that hDn0h\leqslant D_{n_{0}} and Δh,i>ν1ρh\Delta_{h,i}>\nu_{1}\rho^{h}, and for all integers u1u\geqslant 1 such that

Combining these three results (using the same methodology as in the proof of Lemma 17) shows that under Assumptions A1. and A2., for all suboptimal nodes (h,i)(h,i) such that hDn0h\leqslant D_{n_{0}} and Δh,i>ν1ρh\Delta_{h,i}>\nu_{1}\rho^{h}, one has

(We thus even improve slightly the bound of Lemma 17.)

Adaptation of the proof of Theorem 6. The main change here comes from the fact that trees are cut at the depth Dn0D_{n_{0}}. As a consequence, the sets Ih\mathcal{I}_{h}, I\mathcal{I}, J\mathcal{J}, and Jh\mathcal{J}_{h} are defined only by referring to nodes of depth smaller than Dn0D_{n_{0}}. All steps of the proof can then be repeated, except the third step; there, while the bounds on the regret resulting from nodes of T1\mathcal{T}^{1} and T3\mathcal{T}^{3} go through without any changes (as these sets were constructed by considering all descendants of some base nodes), the bound on the regret Rn,2R_{n,2} associated with the nodes T2\mathcal{T}^{2} calls for a modified proof since at this stage we used the property that each node is played at most once. But this is not true anymore for nodes (h,i)(h,i) located at depth Dn0D_{n_{0}}, which can be played several times. Therefore the proof is modified as follows.

Consider a node at depth h=Dn0h=D_{n_{0}}. Then, by definition of Dn0D_{n_{0}},

Since the considered nodes are 2ν1ρDn02\nu_{1}\rho^{D_{n_{0}}}–optimal, the corresponding domains are 4ν1ρDn04\nu_{1}\rho^{D_{n_{0}}}–optimal by Lemma 3, thus also 4/n04/\sqrt{n_{0}}–optimal. The instantaneous regret incurred when playing any of these nodes is therefore bounded by 4/n04/\sqrt{n_{0}}; and the associated cumulative regret (over n0n_{0} rounds) can be bounded by 4n04\sqrt{n_{0}}. In conclusion, with the notations of Theorem 6, we get the new bound

The rest of the proof goes through and only this additional additive factor of 4n04\sqrt{n_{0}} is suffered in the final regret bound. (The additional factor can be included in the OO notation.)

A.3 Proof of Theorem 9 (regret bound for z𝑧z–HOO)

We start with the following equivalent of Lemma 3 in this new local context. Remember that h0h_{0} is the smallest integer such that

Under Assumptions A1. and A2’., for all hh0h\geqslant h_{0}, if the suboptimality factor Δh,i\Delta_{h,i} of a region Ph,i\mathcal{P}_{h,i} is bounded by cν1ρhc\nu_{1}\rho^{h} for some cc\in, then all arms in Ph,i\mathcal{P}_{h,i} are Lmax{2c,c+1}ν1ρhL\max\{2c,\,c+1\,\}\,\nu_{1}\rho^{h}–optimal, that is,

When c=0c=0, i.e., the node (h,i)(h,i) is optimal, the bound improves to

Since xXε0x\in\mathcal{X}_{\varepsilon_{0}} and εν1ρhν1ρh0<ε0\varepsilon\leqslant\nu_{1}\rho^{h}\leqslant\nu_{1}\rho^{h_{0}}<\varepsilon_{0}, the second part of Assumption A2’. then yields

It follows from the definition of ε\varepsilon that f^{*}-f(x)+\varepsilon=\max\bigl{\{}f^{*}-f(x),\,\nu_{1}\rho^{h}\bigr{\}}, and this implies

But xXcν1ρh+δx\in\mathcal{X}_{c\nu_{1}\rho^{h}+\delta}, i.e., ff(x)cν1ρh+δf^{*}-f(x)\leqslant c\nu_{1}\rho^{h}+\delta, we thus have proved

In conclusion, Ph,iXLmax{2c,c+1}ν1ρh+2Lδ\mathcal{P}_{h,i}\subset\mathcal{X}_{L\max\{2c,\,c+1\}\nu_{1}\rho^{h}+2L\delta} for all sufficiently small δ>0\delta>0. Letting δ0\delta\to 0 concludes the proof.

that is, Ph,iXν1ρh\mathcal{P}_{h,i}\subset\mathcal{X}_{\nu_{1}\rho^{h}}.

We now provide an adaptation of Lemma 17 (actually based on adaptations of Lemmas 14 and 15), providing the same bound under local conditions that relax the assumptions of Lemma 17 to some extent.

Consider a depth zh0z\geqslant h_{0}. Under Assumptions A1. and A2.’, the algorithm zz–HOO satisfies that for all n1n\geqslant 1 and all suboptimal nodes (h,i)(h,i) with Δh,i>ν1ρh\Delta_{h,i}>\nu_{1}\rho^{h} and hzh\geqslant z,

Proof We consider some path (z,iz),(z+1,iz+1),(z,i^{*}_{z}),\,(z+1,i^{*}_{z+1}),\,\ldots of optimal nodes, starting at depth zz. We distinguish two cases, depending on whether there exists zkh1z\leqslant k^{\prime}\leqslant h-1 such that (h,i)C(k,ik)(h,i)\in\mathcal{C}(k^{\prime},i^{*}_{k^{\prime}}) or not.

In the first case, we denote kk^{\prime} the largest such kk. The argument of Lemma 14 can be used without any change and shows that for all integers u0u\geqslant 0,

In the second case, we denote by (z,ih)(z,i_{h}) the ancestor of (h,i)(h,i) located at depth zz. By definition of zz–HOO, (Ht,It)C(h,i)(H_{t},I_{t})\in\mathcal{C}(h,i) at some round t1t\geqslant 1 only if Bz,iz(t)Bz,ih(t)B_{z,i^{*}_{z}}(t)\leqslant B_{z,i_{h}}(t) and since BB–values can only increase on a chosen path, (Ht,It)C(h,i)(H_{t},I_{t})\in\mathcal{C}(h,i) can only happen if Bz,iz(t)Bh,i(t)B_{z,i^{*}_{z}}(t)\leqslant B_{h,i}(t). Repeating again the argument of Lemma 14, we get that for all integers u0u\geqslant 0,

Now, notice that Lemma 16 is valid without any assumption. On the other hand, with the modified assumptions, Lemma 15 is still true but only for optimal nodes (h,i)(h,i) with hh0h\geqslant h_{0}. Indeed, the only point in its proof where the assumptions were used was in the fourth line, when applying Lemma 3; here, Lemma 18 with c=0c=0 provides the needed guarantee.

The proof is concluded with the same computations as in the proof of Lemma 17.

Proof (of Theorem 9) We follow the four steps in the proof of Theorem 6 with some slight adjustments. In particular, for hzh\geqslant z, we use the sets of nodes Ih\mathcal{I}_{h} and Jh\mathcal{J}_{h} defined therein.

First step. Lemma 19 bounds the expected number of times each node (h,i)Jh(h,i)\in\mathcal{J}_{h} is visited. Since for these nodes Δh,i>2ν1ρh\Delta_{h,i}>2\nu_{1}\rho^{h}, we get

Second step. We bound here the cardinality Ih|\mathcal{I}_{h}|. By Lemma 18 with c=2c=2, when (h,i)Ih(h,i)\in\mathcal{I}_{h} and hzh\geqslant z, one has Ph,iX4Lν1ρh\mathcal{P}_{h,i}\subset\mathcal{X}_{4L\nu_{1}\rho^{h}}.

Now, by Assumption A1. and by using the same argument as in the second step of the proof of Theorem 6,

Assumption A3. can be applied since ν2ρh2ν1ρh2ν1ρh0ε0\nu_{2}\rho^{h}\leqslant 2\nu_{1}\rho^{h}\leqslant 2\nu_{1}\rho^{h_{0}}\leqslant\varepsilon_{0} and yields the inequality |\mathcal{I}_{h}|\leqslant C\bigl{(}\nu_{2}\rho^{h}\bigr{)}^{-d}.

Third step. We consider some integer HzH\geqslant z to be defined by the analysis in the fourth step. We define a partition of the nodes located at a depth equal to or larger than zz; more precisely,

T1\mathcal{T}^{1} contains the nodes of IH\mathcal{I}_{H} and their descendants,

T2=zhH1Ih\mathcal{T}^{2}=\displaystyle{\bigcup_{z\leqslant h\leqslant H-1}\mathcal{I}_{h}},

T3\mathcal{T}^{3} contains the nodes z+1hHJh\displaystyle{\bigcup_{z+1\leqslant h\leqslant H}\mathcal{J}_{h}} and their descendants,

T4\mathcal{T}^{4} is formed by the nodes (z,i)(z,i) located at depth zz not belonging to Iz\mathcal{I}_{z}, i.e., such that Δz,i>2ν1ρz\Delta_{z,i}>2\nu_{1}\rho^{z}, and their descendants.

As in the proof of Theorem 6 we denote by Rn,iR_{n,i} the regret resulting from the selection of nodes in Ti\mathcal{T}^{i}, for i{1,2,3,4}i\in\{1,2,3,4\}.

As for Rn,3R_{n,3}, we also use here that nodes in T3\mathcal{T}^{3} belong to some Jh\mathcal{J}_{h}, with z+1hHz+1\leqslant h\leqslant H; in particular, they are the child of some element of Ih1\mathcal{I}_{h-1} and as such, firstly, they are 4Lν1ρh14L\nu_{1}\rho^{h-1}–optimal (by Lemma 18) and secondly, their number is bounded by |\mathcal{J}_{h}|\leqslant 2|\mathcal{I}_{h-1}|\leqslant 2C\bigl{(}\nu_{2}\rho^{h-1}\bigr{)}^{-d}. Thus,

where we used the bound of Lemma 19. Finally, for T4\mathcal{T}^{4}, we use that it contains at most 2z12^{z}-1 nodes, each of them being associated with a regret controlled by Lemma 19; therefore,

Fourth step. Putting things together, we have proved that

where (using that ρ<1\rho<1 in the second inequality)

It remains to define the parameter HzH\geqslant z. In particular, we propose to choose it such that the terms

are balanced. To this end, let HH be the smallest integer kk such that 4Lν1ρknγρk(d+1)lnn4L\nu_{1}\rho^{k}n\leqslant\gamma\rho^{-k(d+1)}\ln n; in particular,

Note from the inequality that this HH is such that

and thus this HH satisfies HzH\geqslant z in view of the assumption of the theorem indicating that nn is large enough. The final bound on the regret is then

A.4 Proof of Theorem 10 (regret bound for local-HOO)

Proof We use the notation of the proof of Theorem 9. Let r0r_{0} be a positive integer such that for rr0r\geqslant r_{0}, one has

we can therefore apply the result of Theorem 9 in regimes indexed by rr0r\geqslant r_{0}. For previous regimes, we simply upper bound the regret by the number of rounds, that is, 2r022r02^{r_{0}}-2\leqslant 2^{r_{0}}. For round nn, we denote by rnr_{n} the index of the regime where nn lies in (regime rn=log2(n+1)r_{n}=\lfloor\log_{2}(n+1)\rfloor). Since regime rnr_{n} terminates at round 2rn+122^{r_{n}+1}-2, we have

where C1,C2>0C_{1},C_{2}>0 denote some constants depending only on the parameters but not on nn. Note that for the last equality we used that the first term in the sum of the two terms that depend on nn dominates the second term.

A.5 Proof of Theorem 12 (uniform upper bound on the regret of HOO against the class of all weak Lipschitz environments)

A.6 Proof of Theorem 13 (minimax lower bound in metric spaces)

The set A\mathcal{A} contains “difficult” environments for the {1,,K+1}\{1,\ldots,K+1\}–armed bandit problem.

For any strategy φ(X)\varphi^{(\mathcal{X})} suited to the X\mathcal{X}–armed bandit problem, one can construct a strategy ψ(K+1)\psi^{(K+1)} for the {1,,K+1}\{1,\ldots,K+1\}–armed bandit problem such that

Proof We only deal with the case of deterministic strategies. The extension to randomized strategies can be done using Fubini’s theorem (by integrating also w.r.t. the auxiliary randomizations used).

First step. Let η(0,1/2)\eta\in(0,1/2) be a real number and K2K\geqslant 2 be an integer, both to be defined during the course of the analysis. The set A\mathcal{A} only contains KK elements, denoted by ν1,,νK\nu^{1},\ldots,\nu^{K} and given by product distributions. For 1jK1\leqslant j\leqslant K, the distribution νj\nu^{j} is obtained as the product of the νij\nu^{j}_{i} when i{1,,K+1}i\in\{1,\ldots,K+1\} and where

One can extract the following result from the proof of the lower bound of [10, Section 6.9].

For all strategies ψ(K+1)\psi^{(K+1)} for the {1,,K+1}\{1,\ldots,K+1\}–armed bandit (where K2K\geqslant 2), one has

Second step. We now need to construct F\mathcal{F}^{\prime} such that item (ii) is satisfied. We assume that KK is such that X\mathcal{X} contains KK disjoint balls with radius η\eta. (We shall quantify later in this proof a suitable value of KK.) Denoting by x1,,xKx_{1},\ldots,x_{K} the corresponding centers, these disjoint balls are then B(x1,η),,B(xK,η)\mathcal{B}(x_{1},\eta),\,\ldots,\,\mathcal{B}(x_{K},\eta).

With each of these balls we now associate a bandit environment over X\mathcal{X}, in the following way. For all xXx^{*}\in\mathcal{X}, we introduce a mapping gx,ηg_{x^{*},\eta} on X\mathcal{X} defined by

for all xXx\in\mathcal{X}. This mapping is used to define an environment Mx,ηM_{x^{*},\eta} over X\mathcal{X}, as follows. For all xXx\in\mathcal{X},

Let fx,ηf_{x^{*},\eta} be the corresponding mean-payoff function; its values equal

Third step. We describe how to associate with each (deterministic) strategy φ(X)\varphi^{(\mathcal{X})} on X\mathcal{X} a (random) strategy ψ(K+1)\psi^{(K+1)} on the finite set of arms {1,,K+1}\{1,\ldots,K+1\}. Each of these strategies is indeed given by a sequence of mappings,

where for t1t\geqslant 1, the mappings φt(X)\varphi^{(\mathcal{X})}_{t} and ψt(K+1)\psi^{(K+1)}_{t} should only depend on the past up to the beginning of round tt. Since the strategy φ(X)\varphi^{(\mathcal{X})} is deterministic, the mapping φt(X)\varphi^{(\mathcal{X})}_{t} takes only into account the past rewards Y1,,Yt1Y_{1},\ldots,Y_{t-1} and is therefore a mapping t1X^{t-1}\to\mathcal{X}. (In particular, φ1(X)\varphi^{(\mathcal{X})}_{1} equals a constant.)

We use the notations ItI^{\prime}_{t} and YtY^{\prime}_{t} for, respectively, the arms pulled and the rewards obtained by the strategy ψ(K+1)\psi^{(K+1)} at each round tt. The arms ItI^{\prime}_{t} are drawn at random according to the distributions

which we now define. (Actually, they will depend on the obtained payoffs Y1,,Yt1Y^{\prime}_{1},\ldots,Y^{\prime}_{t-1} only.) To do that, we need yet another mapping TT that links elements in X\mathcal{X} to probability distributions over {1,,K+1}\{1,\ldots,K+1\}. Denoting by δk\delta_{k} the Dirac probability on k{1,,K+1}k\in\{1,\ldots,K+1\}, the mapping TT is defined as

for all xXx\in\mathcal{X}. Note that this definition is legitimate because the balls B(xj,η)\mathcal{B}(x_{j},\eta) are disjoint when jj varies between 11 and KK.

Finally, ψ(K+1)\psi^{(K+1)} is defined as follows. For all t1t\geqslant 1,

Before we proceed, we study the distribution of the reward YY^{\prime} obtained under νi\nu^{i} (for i{1,,K}i\in\{1,\ldots,K\}) by the choice of a random arm II^{\prime} drawn according to T(x)T(x), for some xXx\in\mathcal{X}. Since YY^{\prime} can only take the values 0 or 1, its distribution is a Bernoulli distribution whose parameter μi(x)\mu_{i}(x) we compute now. The computation is based on the fact that under νi\nu^{i}, the Bernoulli distribution corresponding to arm jj has 1/21/2 as an expectation, except if j=ij=i, in which case it is 1/2+η1/2+\eta. Thus, for all xXx\in\mathcal{X},

That is, μi=fxi,η\mu_{i}=f_{x_{i},\eta} on X\mathcal{X}.

Fourth step. We now prove that the distributions of the regrets of φ(X)\varphi^{(\mathcal{X})} under Mxj,ηM_{x_{j},\eta} and of ψ(K+1)\psi^{(K+1)} under νj\nu^{j} are equal for all j=1,,Kj=1,\ldots,K. On the one hand, the expectations of rewards associated with the best arms equal 1/2+η1/2+\eta under the two environments. On the other hand, one can prove by induction that the sequences Y1,Y2,Y_{1},Y_{2},\ldots and Y1,Y2,Y^{\prime}_{1},Y^{\prime}_{2},\ldots have the same distribution. (In the argument below, conditioning by empty sequences means no conditioning. This will be the case only for t=1t=1.)

Under νj\nu^{j} and given Y1,,Yt1Y^{\prime}_{1},\ldots,Y^{\prime}_{t-1}, the distribution of YtY^{\prime}_{t} is obtained by definition as the two-step random draw of ItT(Xt)I^{\prime}_{t}\sim T(X^{\prime}_{t}) and then, conditionally on this first draw, YtνItjY^{\prime}_{t}\sim\nu^{j}_{I^{\prime}_{t}}. By the above results, the distribution of YtY^{\prime}_{t} is thus a Bernoulli distribution with parameter μj(Xt)\mu_{j}(X^{\prime}_{t}).

At the same time, under Mxj,ηM_{x_{j},\eta} and given Y1,,Yt1Y_{1},\ldots,Y_{t-1}, the choice of

yields a reward YtY_{t} distributed according to Mxj,η(Xt)M_{x_{j},\eta}(X_{t}), that is, by definition and with the notations above, a Bernoulli distribution with parameter fxj,η(Xt)=μj(Xt)f_{x_{j},\eta}(X_{t})=\mu_{j}(X_{t}).

The argument is concluded by induction and by using the fact that rewards are drawn independently in each round.

Fifth step. We summarize what we proved so far. For η(0,1/2)\eta\in(0,1/2), provided that there exist K2K\geqslant 2 disjoint balls B(xj,η)\mathcal{B}(x_{j},\eta) in X\mathcal{X}, we could construct, for all strategies φ(X)\varphi^{(\mathcal{X})} for the X\mathcal{X}–armed bandit problem, a strategy ψ(K+1)\psi^{(K+1)} for the {1,,K+1}\{1,\ldots,K+1\}–armed bandit problem such that, for all j=1,,Kj=1,\ldots,K and all n1n\geqslant 1,

But by the assumption on the packing dimension, there exists c>0c>0 such that for all η<1/2\eta<1/2, the choice of Kη=cηD2K_{\eta}=\lceil c\,\eta^{-D}\rceil\geqslant 2 guarantees the existence of such KηK_{\eta} disjoint balls. Substituting this value, and using the results of the first and fourth steps of the proof, we get

the left-hand side is smaller than the maximal regret w.r.t. all weak Lipschitz environments;

the right-hand side can be lower bounded and then optimized over η<1/2\eta<1/2 in the following way.

By definition of KηK_{\eta} and the fact that it is larger than 2, one has

where C=\sqrt{\bigl{(}4\ln(4/3)\bigr{)}\,\big{/}\,c}. We can optimize the final lower bound over η[0,1/2]\eta\in[0,\,1/2].

To that end, we choose, for instance, η\eta such that Cη1+D/2n=1/4C\,\eta^{1+D/2}\sqrt{n}=1/4, that is,

To ensure that this choice of η\eta is valid we need to show that η1/2\eta\leqslant 1/2. Since the latter requirement is equivalent to

it suffices to choose the right-hand side to be N(c,D)N(c,D); we then get that η1/2\eta\leqslant 1/2 indeed holds for all nN(c,D)n\geqslant N(c,D), thus concluding the proof of the theorem.

References