Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Sébastien Bubeck, Nicolò Cesa-Bianchi

Chapter 1 Introduction

A multi-armed bandit problem (or, simply, a bandit problem) is a sequential allocation problem defined by a set of actions. At each time step, a unit resource is allocated to an action and some observable payoff is obtained. The goal is to maximize the total payoff obtained in a sequence of allocations. The name bandit refers to the colloquial term for a slot machine (“one-armed bandit” in American slang). In a casino, a sequential allocation problem is obtained when the player is facing many slot machines at once (a “multi-armed bandit”), and must repeatedly choose where to insert the next coin.

Bandit problems are basic instances of sequential decision making with limited information, and naturally address the fundamental trade-off between exploration and exploitation in sequential experiments. Indeed, the player must balance the exploitation of actions that did well in the past and the exploration of actions that might give higher payoffs in the future.

Although the original motivation of Thompson (1933) for studying bandit problems came from clinical trials (when different treatments are available for a certain disease and one must decide which treatment to use on the next patient), modern technologies have created many opportunities for new applications, and bandit problems now play an important role in several industrial domains. In particular, online services are natural targets for bandit algorithms, because there one can benefit from adapting the service to the individual sequence of requests. We now describe a few concrete examples in various domains.

Ad placement is the problem of deciding which advertisement to display on the web page delivered to the next visitor of a website. Similarly, website optimization deals with the problem of sequentially choosing design elements (font, images, layout) for the web page. Here the payoff is associated with visitor’s actions, e.g., clickthroughs or other desired behaviors. Of course there are important differences with the basic bandit problem: in ad placement the pool of available ads (bandit arms) may change over time, and there might be a limit on the number of times each ad could be displayed.

In source routing a sequence of packets must be routed from a source host to a destination host in a given network, and the protocol allows to choose a specific source-destination path for each packet to be sent. The (negative) payoff is the time it takes to deliver a packet, and depends additively on the congestion of the edges in the chosen path.

In computer game-playing, each move is chosen by simulating and evaluating many possible game continuations after the move. Algorithms for bandits (more specifically, for a tree-based version of the bandit problem) can be used to explore more efficiently the huge tree of game continuations by focusing on the most promising subtrees. This idea has been successfully implemented in the MoGo player of Gelly et al. (2006), which plays Go at world-class level. MoGo is based on the UCT strategy for hierarchical bandits of Kocsis and Szepesvári (2006), which is in turn derived from the UCB bandit algorithm —see Chapter 2.

There are three fundamental formalizations of the bandit problem depending on the assumed nature of the reward process: stochastic, adversarial, and Markovian. Three distinct playing strategies have been shown to effectively address each specific bandit model: the UCB algorithm in the stochastic case, the Exp3 randomized algorithm in the adversarial case, and the so-called Gittins indices in the Markovian case. In this survey, we focus on stochastic and adversarial bandits, and refer the reader to the survey by Mahajan and Teneketzis (2008) or to the recent monograph by Gittins et al. (2011) for an extensive analysis of Markovian bandits.

In order to analyze the behavior of a player or forecaster (i.e., the agent implementing a bandit strategy), we may compare its performance with that of an optimal strategy that, for any horizon of nn time steps, consistently plays the arm that is best in the first nn steps. In other terms, we may study the regret of the forecaster for not playing always optimally. More specifically, given K2K\geq 2 arms and given sequences Xi,1,Xi,2,X_{i,1},X_{i,2},\dots of unknown rewards associated with each arm i=1,,Ki=1,\dots,K, we study forecasters that at each time step t=1,2,t=1,2,\dots select an arm ItI_{t} and receive the associated reward XIt,tX_{I_{t},t}. The regret after nn plays I1,,InI_{1},\dots,I_{n} is defined by

If the time horizon is not known in advance we say that the forecaster is anytime.

In general, both rewards Xi,tX_{i,t} and forecaster’s choices ItI_{t} might be stochastic. This allows to distinguish between the two following notions of averaged regret: the expected regret

In the original formalization of Robbins (1952), which builds on the work of Wald (1947) —see also Arrow et al. (1949), each arm i=1,,Ki=1,\dots,K corresponds to an unknown probability distribution νi\nu_{i} on $,andrewards, and rewardsX_{i,t}areindependentdrawsfromthedistributionare independent draws from the distribution\nu_{i}$ corresponding to the selected arm.

The stochastic bandit problem Known parameters: number of arms KK and (possibly) number of rounds nKn\geq K. Unknown parameters: KK probability distributions ν1,,νK\nu_{1},\ldots,\nu_{K} on $.Foreachround. For each roundt=1,2,\ldots(1)theforecasterchooses(1) the forecaster choosesI_{t}\in\{1,\ldots,K\};(2)given; (2) givenI_{t},theenvironmentdrawsthereward, the environment draws the rewardX_{I_{t},t}\sim\nu_{I_{t}}$ independently from the past and reveals it to the forecaster.

For i=1,,Ki=1,\dots,K we denote by μi\mu_{i} the mean of νi\nu_{i} (mean reward of arm ii). Let

In the stochastic setting, it is easy to see that the pseudo-regret can be written as

The analysis of the stochastic bandit model was pioneered in the seminal paper of Lai and Robbins (1985), who introduced the technique of upper confidence bounds for the asymptotic analysis of regret. In Chapter 2 we describe this technique using the simpler formulation of Agrawal (1995), which naturally lends itself to a finite-time analysis.

In parallel to the research on stochastic bandits, a game-theoretic formulation of the trade-off between exploration and exploitation has been independently investigated, although for quite some time this alternative formulation was not recognized as an instance of the multi-armed bandit problem. In order to motivate these game-theoretic bandits, consider again the initial example of gambling on slot machines. We now assume that we are in a rigged casino, where for each slot machine i=1,,Ki=1,\ldots,K and time step t1t\geq 1 the owner sets the gain Xi,tX_{i,t} to some arbitrary (and possibly maliciously chosen) value gi,tg_{i,t}\in. Note that it is not in the interest of the owner to simply set all the gains to zero (otherwise, no gamblers would go to that casino). Now recall that a forecaster selects sequentially one arm It{1,,K}I_{t}\in\{1,\ldots,K\} at each time step t=1,2,t=1,2,\dots and observes (and earns) the gain gIt,tg_{I_{t},t}. Is it still possible to minimize regret in such a setting?

Following a standard terminology, we call adversary, or opponent, the mechanism setting the sequence of gains for each arm. If this mechanism is independent of the forecaster’s actions, then we call it an oblivious adversary. In general, however, the adversary may adapt to the forecaster’s past behaviour, in which case we speak of a non-oblivious adversary. For instance, in the rigged casino the owner may observe the way a gambler plays in order to design even more evil sequences of gains. Clearly, the distinction between oblivious and non-oblivious adversary is only meaningful when the player is randomized (if the player is deterministic, then the adversary can pick a bad sequence of gains right at the beginning of the game by simulating the player’s future actions). Note, however, that in presence of a non-oblivious adversary the interpretation of regret is ambiguous. Indeed, in this case the assignment of gains gi,tg_{i,t} to arms i=1,,Ki=1,\dots,K made by the adversary at each step tt is allowed to depend on the player’s past randomized actions I1,,It1I_{1},\dots,I_{t-1}. In other words, gi,t=gi,t(I1,,It1)g_{i,t}=g_{i,t}(I_{1},\dots,I_{t-1}) for each ii and tt. Now, the regret compares the player’s cumulative gain to that obtained by playing the single best arm for the first nn rounds. However, had the player consistently chosen the same arm ii in each round, namely It=iI_{t}=i for t=1,,nt=1,\dots,n, the adversarial gains gi,t(I1,,It1)g_{i,t}(I_{1},\dots,I_{t-1}) would have been possibly different than those actually experienced by the player.

The study of non-oblivious regret is mainly motivated by the connection between regret minimization and equilibria in games —see, e.g. (Auer et al., 2002b, Section 9). Here we just observe that game-theoretic equilibria are indeed defined similarly to regret: in equilibrium, the player has nSo incentive to behave differently provided the opponent does not react to changes in the player’s behaviour. Interestingly, regret minimization has been also studied against reactive opponents, see for instance the works of Pucci de Farias and Megiddo (2006) and Arora et al. (2012a).

The adversarial bandit problem Known parameters: number of arms K2K\geq 2 and (possibly) number of rounds nKn\geq K. For each round t=1,2,t=1,2,\ldots (1) the forecaster chooses It{1,,K}I_{t}\in\{1,\ldots,K\}, possibly with the help of external randomization; (2) simultaneously, the adversary selects a gain vector gt=(g1,t,,gK,t)Kg_{t}=(g_{1,t},\ldots,g_{K,t})\in^{K}, possibly with the help of external randomization; (3) the forecaster receives (and observes) the reward gIt,tg_{I_{t},t}, while the gains of the other arms are not observed.

In this adversarial setting the goal is to obtain regret bounds in high probability or in expectation with respect to any possible randomization in the strategies used by the forecaster or the opponent, and irrespective of the opponent. In the case of a non-oblivious adversary this is not an easy task, and for this reason we usually start by bounding the pseudo-regret

Note that the randomization of the adversary is not very important here since we ask for bounds which hold for any opponent. On the other hand, it is fundamental to allow randomization for the forecaster —see Chapter 3 for details and basic results in the adversarial bandit model. This adversarial, or non-stochastic, version of the bandit problem was originally proposed as a way of playing an unknown game against an opponent. The problem of playing a game repeatedly, now a classical topic in game theory, was initiated by the groundbreaking work of James Hannan and David Blackwell. In Hannan’s seminal paper Hannan (1957), the game (i.e., the payoff matrix) is assumed to be known by the player, who also observes the opponent’s moves in each play. Later, Baños (1968) considered the problem of a repeated unknown game, where in each game round the player only observes its own payoff. This problem turns out to be exactly equivalent to the adversarial bandit problem with a non-oblivious adversary. Simpler strategies for playing unknown games were more recently proposed by Foster and Vohra (1998) and Hart and Mas-Colell (2000, 2001). Approximately at the same time, the problem was re-discovered in computer science by Auer et al. (2002b). It was them who made apparent the connection to stochastic bandits by coining the term nonstochastic multi-armed bandit problem.

The third fundamental model of multi-armed bandits assumes that the reward processes are neither i.i.d. (like in stochastic bandits) nor adversarial. More precisely, arms are associated with KK Markov processes, each with its own state space. Each time an arm ii is chosen in state ss, a stochastic reward is drawn from a probability distribution νi,s\nu_{i,s}, and the state of the reward process for arm ii changes in a Markovian fashion, based on an underlying stochastic transition matrix MiM_{i}. Both reward and new state are revealed to the player. On the other hand, the state of arms that are not chosen remains unchanged. Going back to our initial interpretation of bandits as sequential resource allocation processes, here we may think of KK competing projects that are sequentially allocated a unit resource of work. However, unlike the previous bandit models, in this case the state of a project that gets the resource may change. Moreover, the underlying stochastic transition matrices MiM_{i} are typically assumed to be known, thus the optimal policy can be computed via dynamic programming and the problem is essentially of computational nature. The seminal result of Gittins (1979) provides an optimal greedy policy which can be computed efficiently.

A notable special case of Markovian bandits is that of Bayesian bandits. These are parametric stochastic bandits, where the parameters of the reward distributions are assumed to be drawn from known priors, and the regret is computed by also averaging over the draw of parameters from the prior. The Markovian state change associated with the selection of an arm corresponds here to updating the posterior distribution of rewards for that arm after observing a new reward.

Markovian bandits are a standard model in the areas of Operations Research and Economics. However, the techniques used in their analysis are significantly different from those used to analyze stochastic and adversarial bandits. For this reason, in this survey we do not cover Markovian bandits and their many variants.

Chapter 2 Stochastic bandits: fundamental results

We start by recalling the basic definitions for the stochastic bandit problem. Each arm i{1,,K}i\in\{1,\ldots,K\} corresponds to an unknown probability distribution νi\nu_{i}. At each time step t=1,2,,t=1,2,\ldots, the forecaster selects some arm It{1,,K}I_{t}\in\{1,\ldots,K\} and receives a reward XIt,tX_{I_{t},t} drawn from νIt\nu_{I_{t}} (independently from the past). Denote by μi\mu_{i} the mean of arm ii and define

We focus here on the pseudo-regret, which is defined as

We choose the pseudo-regret as our main quantity of interest because in a stochastic framework it is more natural to compete against the optimal action in expectation, rather than the optimal action on the sequence of realized rewards (as in the definition of the plain regret (1)). Furthermore, because of the order of magnitude of typical random fluctuations, in general one cannot hope to prove a bound on the expected regret (2) better than \Theta\bigl{(}\sqrt{n}\bigr{)}. On the contrary, the pseudo-regret can be controlled so well that we are able to bound it by a logarithmic function of nn.

In the following we also use a different formula for the pseudo-regret. Let Ti(s)=t=1s\mathds1It=iT_{i}(s)=\sum_{t=1}^{s}\mathds{1}_{I_{t}=i} denote the number of times the player selected arm ii on the first ss rounds. Let Δi=μμi\Delta_{i}=\mu^{*}-\mu_{i} be the suboptimality parameter of arm ii. Then the pseudo-regret can be written as:

The difficulty of the stochastic multi-armed bandit problem lies in the exploration-exploitation dilemma that the forecaster is facing. Indeed, there is an intrinsic tradeoff between exploiting the current knowledge to focus on the arm that seems to yield the highest rewards, and exploring further the other arms to identify with better precision which arm is actually the best. As we shall see, the key to obtain a good strategy for this problem is, in a certain sense, to simultaneously perform exploration and exploitation.

A simple heuristic principle for doing that is the so-called optimism in face of uncertainty. The idea is very general, and applies to many sequential decision making problems in uncertain environments. Assume that the forecaster has accumulated some data on the environment and must decide how to act next. First, a set of “plausible” environments which are “consistent” with the data (typically, through concentration inequalities) is constructed. Then, the most “favorable” environment is identified in this set. Based on that, the heuristic prescribes that the decision which is optimal in this most favorable and plausible environment should be made. As we see below, this principle gives simple and yet almost optimal algorithms for the stochastic multi-armed bandit problem. More complex algorithms for various extensions of the stochastic multi-armed bandit problem are also based on the same idea which, along with the exponential weighting scheme presented in Section 3, is an algorithmic cornerstone of regret analysis in bandits.

Upper Confidence Bound (UCB) strategies

In this section we assume that the distribution of rewards XX satisfy the following moment conditions. There exists a convex function One can easily generalize the discussion to functions ψ\psi defined only on an interval [0,b][0,b]. ψ\psi on the reals such that, for all λ0\lambda\geq 0,

For example, when XX\in one can take ψ(λ)=λ28\psi(\lambda)=\tfrac{\lambda^{2}}{8}. In this case (6) is known as Hoeffding’s lemma.

We attack the stochastic multi-armed bandit using the optimism in face of uncertainty principle. In order do so, we use assumption (6) to construct an upper bound estimate on the mean of each arm at some fixed confidence level, and then choose the arm that looks best under this estimate. We need a standard notion from convex analysis: the Legendre-Fenchel transform of ψ\psi, defined by

For instance, if ψ(x)=ex\psi(x)=e^{x} then ψ(x)=xlnxx\psi^{*}(x)=x\ln x-x for x>0x>0. If ψ(x)=1pxp\psi(x)=\frac{1}{p}|x|^{p} then ψ(x)=1qxq\psi^{*}(x)=\frac{1}{q}|x|^{q} for any pair 1<p,q<1<p,q<\infty such that 1p+1q=1\frac{1}{p}+\frac{1}{q}=1 —see also Section 15, where the same notion is used in a different bandit model.

Let μ^i,s\widehat{\mu}_{i,s} be the sample mean of rewards obtained by pulling arm ii for ss times. Note that since the rewards are i.i.d., we have that in distribution μ^i,s\widehat{\mu}_{i,s} is equal to 1st=1sXi,t\frac{1}{s}\sum_{t=1}^{s}X_{i,t}.

Using Markov’s inequality, from (6) one obtains that

In other words, with probability at least 1δ1-\delta,

We thus consider the following strategy, called (α,ψ)(\alpha,\psi)-UCB, where α>0\alpha>0 is an input parameter: At time tt, select

Assume that the reward distributions satisfy (6). Then (α,ψ)(\alpha,\psi)-UCB with α>2\alpha>2 satisfies

In case of $valuedrandomvariables,taking-valued random variables, taking\psi(\lambda)=\tfrac{\lambda^{2}}{8}in(6)theHoeffdingsLemmagivesin (6) —the Hoeffding’s Lemma— gives\psi^{*}(\varepsilon)=2\varepsilon^{2}$, which in turns gives the following pseudo-regret bound

In this important special case of bounded random variables we refer to (α,ψ)(\alpha,\psi)-UCB simply as α\alpha-UCB.

First note that if It=iI_{t}=i, then at least one of the three following equations must be true:

Indeed, assume that the three equations are all false, then we have:

which implies, in particular, that ItiI_{t}\neq i. In other words, letting

Thus it suffices to bound the probability of the events (9) and (10). Using an union bound and (7) one directly obtains:

The same upper bound holds for (10). Straightforward computations conclude the proof.

Lower bound

In order to compare this result with (8) we use the following standard inequalities (the left hand side follows from Pinsker’s inequality, and the right hand side simply uses lnxx1\ln x\leq x-1),

The proof is organized in three steps. For simplicity, we only consider the case of two arms.

We now slightly change the notation for rewards and denote by X2,1,,X2,nX_{2,1},\ldots,X_{2,n} the sequence of random variables obtained when pulling arm 22 for nn times (that is, X2,sX_{2,s} is the reward obtained from the ss-th pull). For s{1,,n}s\in\{1,\ldots,n\}, let

In order to link the behavior of the forecaster on the initial and modified bandits we introduce the event

Using again (15) and Markov’s inequality, the above implies

Now we use the maximal version of the strong law of large numbers: for any sequence \bigl{(}X_{t}\bigr{)} of independent real random variables with positive mean μ>0\mu>0,

Thus, by the result of the second step and (16), we get

Refinements and bibliographic remarks

The UCB strategy presented in Section 2 was introduced by Auer et al. (2002a) for bounded random variables. Theorem 3.3 is extracted from Lai and Robbins (1985). Note that in this last paper the result is more general than ours, which is restricted to Bernoulli distributions. Although Burnetas and Katehakis (1997) prove an even more general lower bound, Theorem 3.3 and the UCB regret bound provide a reasonably complete solution to the problem. We now discuss some of the possible refinements. In the following, we restrict our attention to the case of bounded rewards (except in Section 4.7).

The regret bound proof for UCB can be improved in two ways. First, the union bound over the different time steps can be replaced by a “peeling” argument. This allows to show a logarithmic regret for any α>1\alpha>1, whereas the proof of Section 2 requires α>2\alpha>2 —see (Bubeck, 2010, Section 2.2) for more details. A second improvement, proposed by Garivier and Cappé (2011), is to use a more subtle set of conditions than (9)–(11). In fact, the authors take both improvements into account, and show that α\alpha-UCB has a regret of order α2lnn\frac{\alpha}{2}\ln n for any α>1\alpha>1. In the limit when α\alpha tends to 11, this constant is unimprovable in light of Theorem 3.3 and (12).

2 Second order bounds

where α>1\alpha>1 is a parameter of the algorithm. Thus, KL-UCB is optimal for Bernoulli distributions, and strictly dominates α\alpha-UCB for any bounded reward distributions.

3 Distribution-free bounds

In the limit when Δi\Delta_{i} tends to , the upper bound in (8) becomes vacuous. On the other hand, it is clear that the regret incurred from pulling arm ii cannot be larger than nΔin\,\Delta_{i}. Using this idea, it is easy to show that the regret of α\alpha-UCB is always smaller than αnKlnn\sqrt{\alpha nK\ln n} (up to a numerical constant). However, as we shall see in the next chapter, one can show a minimax lower bound on the regret of order nK\sqrt{nK}. Audibert and Bubeck (2009) proposed a modification of α\alpha-UCB that gets rid of the extraneous logarithmic term in the upper bound. More precisely, let Δ=miniiΔi\Delta=\min_{i\neq i^{*}}\Delta_{i}, Audibert and Bubeck (2010) show that MOSS (Minimax Optimal Strategy in the Stochastic case) attains a regret smaller than

up to a numerical constant. The weakness of this result is that the second term in the above equation only depends on the smallest gap Δ\Delta. In Auer and Ortner (2010) (see also Perchet and Rigollet (2011)) the authors designed a strategy, called improved UCB, with a regret of order

This latter regret bound can be better than the one for MOSS in some regimes, but it does not attain the minimax optimal rate of order nK\sqrt{nK}. It is an open problem to obtain a strategy with a regret always better than those of MOSS and improved UCB. A plausible conjecture is that a regret of order

is attainable. Note that the quantity HH appears in other variants of the stochastic multi-armed bandit problem, see Audibert et al. (2010).

4 High probability bounds

While bounds on the pseudo-regret Rn\overline{R}_{n} are important, one typically wants to control the quantity R^n=nμt=1nμIt\hat{R}_{n}=n\mu^{*}-\sum_{t=1}^{n}\mu_{I_{t}} with high probability. Showing that R^n\hat{R}_{n} is close to its expectation Rn\overline{R}_{n} is a challenging task, since naively one might expect the fluctuations of R^n\hat{R}_{n} to be of order n\sqrt{n}, which would dominate the expectation Rn\overline{R}_{n} which is only of order lnn\ln n. The concentration properties of R^n\hat{R}_{n} for UCB are analyzed in detail in Audibert et al. (2009), where it is shown that R^n\hat{R}_{n} concentrates around its expectation, but that there is also a polynomial (in nn) probability that R^n\hat{R}_{n} is of order nn. In fact the polynomial concentration of R^n\hat{R}_{n} around Rn\overline{R}_{n} can be directly derived from our proof of Theorem 2.1. In Salomon and Audibert (2011) it is proved that for anytime strategies (i.e., strategies that do not use the time horizon nn) it is basically impossible to improve this polynomial concentration to a classical exponential concentration. In particular this shows that, as far as high probability bounds are concerned, anytime strategies are surprisingly weaker than strategies using the time horizon information (for which exponential concentration of R^n\hat{R}_{n} around lnn\ln n are possible, see Audibert et al. (2009)).

5 ε𝜀\varepsilon-greedy

A simple and popular heuristic for bandit problems is the ε\varepsilon-greedy strategy —see, e.g., Sutton and Barto (1998). The idea is very simple. First, pick a parameter 0<ε<10<\varepsilon<1. Then, at each step greedily play the arm with highest empirical mean reward with probability 1ε1-\varepsilon, and play a random arm with probability ε\varepsilon. Auer et al. (2002a) proved that, if ε\varepsilon is allowed to be a certain function εt\varepsilon_{t} of the current time step tt, namely εt=K/(d2t)\varepsilon_{t}=K/(d^{2}t), then the regret grows logarithmically like (Klnn)/d2(K\ln n)/d^{2}, provided 0<d<miniiΔi0<d<\min_{i\neq i^{*}}\Delta_{i}. While this bound has a suboptimal dependence on dd, Auer et al. (2002a) show that this algorithm performs well in practice, but the performance degrades quickly if dd is not chosen as a tight lower bound of miniiΔi\min_{i\neq i^{*}}\Delta_{i}.

6 Thompson sampling

7 Heavy-tailed distributions

We showed in Section 2 how to obtain a UCB-type strategy through a bound on the moment generating function. Moreover one can see that the resulting bound in Theorem 2.1 deteriorates as the tail of the distributions become heavier. In particular, we did not provide any result for the case of distributions for which the moment generating function is not finite. Surprisingly, it was shown in Bubeck et al. (2012b) that in fact there exists strategy with essentially the same regret than in (8), as soon as the variance of the distributions are finite. More precisely, using more refined robust estimators of the mean than the basic empirical mean, one can construct a UCB-type strategy such that for distributions with moment of order 1+ε1+\varepsilon bounded by 11 it satisfies

We refer the interested reader to Bubeck et al. (2012b) for more details on these ’robust’ versions of UCB.

Chapter 3 Adversarial bandits: fundamental results

In this chapter we consider the important variant of the multi-armed bandit problem where no stochastic assumption is made on the generation of rewards. Denote by gi,tg_{i,t} the reward (or gain) of arm ii at time step tt. We assume all rewards are bounded, say gi,tg_{i,t}\in. At each time step t=1,2,t=1,2,\ldots, simultaneously with the player’s choice of the arm It{1,,K}I_{t}\in\{1,\ldots,K\}, an adversary assigns to each arm i=1,,Ki=1,\dots,K the reward gi,tg_{i,t}. Similarly to the stochastic setting, we measure the performance of the player compared to the performance of the best arm through the regret

The key idea to get around this difficulty is to add randomization to the selection of the action ItI_{t} to play. By doing so, the forecaster can “surprise” the adversary, and this surprise effect suffices to get a regret essentially as low as the minimax regret for the stochastic model. Since the regret RnR_{n} then becomes a random variable, the goal is thus to obtain bounds in high probability or in expectation on RnR_{n} (with respect to both eventual randomization of the forecaster and of the adversary). This task is fairly difficult, and a convenient first step is to bound the pseudo-regret

As we pointed out, in order to obtain non-trivial regret guarantees in the adversarial framework it is necessary to consider randomized forecasters. Below we describe the randomized forecaster Exp3, which is based on two fundamental ideas.

First, despite the fact that only the loss of the played arm is observed, with a simple trick it is still possible to build an unbiased estimator for the loss of any other arm. Namely, if the next arm ItI_{t} to be played is drawn from a probability distribution p_{t}=\bigl{(}p_{1,t},\ldots,p_{K,t}\bigr{)}, then

The second idea is to use an exponential reweighting of the cumulative estimated losses to define the probability distribution ptp_{t} from which the forecaster will select the arm ItI_{t}. Exponential weighting schemes are a standard tool in the study of sequential prediction schemes under adversarial assumptions. The reader is referred to the monograph by Cesa-Bianchi and Lugosi (2006) for a general introduction to prediction of individual sequences, and to the recent survey by Arora et al. (2012b) focussed on computer science applications of exponential weighting.

We provide two different pseudo-regret bounds for this strategy. The bound (19) is obtained assuming that the forecaster does not know the number of rounds nn. This is the anytime version of the algorithm. The bound (18), instead, shows that a better constant can be achieved using the knowledge of the time horizon.

If Exp3 is run with ηt=η=2lnKnK\eta_{t}=\eta=\sqrt{\frac{2\ln K}{nK}}, then

Moeover, if Exp3 is run with ηt=lnKtK\eta_{t}=\sqrt{\frac{\ln K}{tK}}, then

Inequality (18) then trivially follows from (20). For (19) we use (20) and t=1n1t0n1tdt=2n\sum_{t=1}^{n}\frac{1}{\sqrt{t}}\leq\int_{0}^{n}\frac{1}{\sqrt{t}}dt=2\sqrt{n}. The proof of (20) in divided in five steps.

The following equalities can be easily verified:

Second step: Study of the first term in (23).

We use the inequalities lnxx1\ln x\leq x-1 and exp(x)1+xx2/2\exp(-x)-1+x\leq x^{2}/2, for all x0x\geq 0, to obtain:

where the last step comes from the third equality in (21).

Third step: Study of the second term in (23).

Let L~i,0=0\widetilde{L}_{i,0}=0, Φ0(η)=0\Phi_{0}(\eta)=0 and Φt(η)=1ηln1Ki=1Kexp(ηL~i,t)\Phi_{t}(\eta)=\frac{1}{\eta}\ln\frac{1}{K}\sum_{i=1}^{K}\exp{\left(-\eta\widetilde{L}_{i,t}\right)}. Then, by definition of ptp_{t} we have

Fourth step: Summing.

Putting together (22), (23), (24) and (25) we obtain

The first term is easy to bound in expectation since, by the rule of conditional expectations and the last equality in (21) we have

For the second term we start with an Abel transformation,

To conclude the proof, we show that Φt(η)0\Phi_{t}^{\prime}(\eta)\geq 0. Since ηt+1ηt\eta_{t+1}\leq\eta_{t}, we then obtain Φt(ηt+1)Φt(ηt)0\Phi_{t}(\eta_{t+1})-\Phi_{t}(\eta_{t})\leq 0. Let

Simplifying, we get (since p1p_{1} is the uniform distribution over {1,,K}\{1,\ldots,K\}),

High probability and expected regret bounds

This issue can be solved by combining the mixing idea with a different estimate for losses. In fact, the core idea is more transparent when expressed in terms of gains, and so we turn to the gain version of the problem. The trick is to introduce a bias in the gain estimate which allows to derive a high probability statement on this estimate.

Then, with probability at least 1δ1-\delta,

where the last inequality uses 1+uexp(u)1+u\leq\exp(u). As a consequence, we have

The strategy associated with these new estimates, called Exp3.P, is described in Figure 1. Note that, for the sake of simplicity, the strategy is described in the setting with known time horizon (η\eta is constant). Anytime results can easily be derived with the same techniques as in the proof of Theorem 5.1.

In the next theorem we propose two different high probability bounds. In (26) the algorithm needs the confidence level δ\delta as an input parameter. In (27) the algorithm satisfies a high probability bound for any confidence level. This latter property is particularly important to derive good bounds on the expected regret.

For any given δ(0,1)\delta\in(0,1), if Exp3.P is run with

then, with probability at least 1δ1-\delta,

Moreover, if Exp3.P is run with β=ln(K)nK\beta=\sqrt{\tfrac{\ln(K)}{nK}} while η\eta and γ\gamma are chosen as before, then, with probability at least 1δ1-\delta,

We first prove (in three steps) that if γ1/2\gamma\leq 1/2 and (1+β)Kηγ(1+\beta)K\eta\leq\gamma, then Exp3.P satisfies, with probability at least 1δ1-\delta,

The key step is, again, to consider the cumulant-generating function of g~i,t\widetilde{g}_{i,t}. However, because of the mixing, we need to introduce a few more notations. Let u=\bigl{(}\tfrac{1}{K},\ldots,\tfrac{1}{K}\bigr{)} be the uniform distribution over the arms, let and wt=ptu1γw_{t}=\tfrac{p_{t}-u}{1-\gamma} be the distribution induced by Exp3.P at time tt without the mixing. Then we have:

Second step: Study of the first term in (30).

We use the inequalities lnxx1\ln x\leq x-1 and exp(x)1+x+x2\exp(x)\leq 1+x+x^{2}, for all x1x\leq 1, as well as the fact that ηg~i,t1\eta\widetilde{g}_{i,t}\leq 1 since (1+β)ηKγ(1+\beta)\eta K\leq\gamma:

where we used wi,tpi,t11γ\frac{w_{i,t}}{p_{i,t}}\leq\frac{1}{1-\gamma} in the last step.

Third step: Summing.

Set G~i,0=0\widetilde{G}_{i,0}=0. Recall that w_{t}=\bigl{(}w_{1,t},\ldots,w_{K,t}\bigr{)} with

Then substituting (31) in (30) and summing using (32), we obtain

The last inequality comes from Lemma 6.3, the union bound, and γ(1+β)ηK1\gamma-(1+\beta)\eta K\leq 1 which is a consequence of (1+β)ηKγ1/2(1+\beta)\eta K\leq\gamma\leq 1/2. Combining this last inequality with (29) we obtain

We integrate the deviations in (27) using the formula

for a real-valued random variable WW. In particular, taking

Lower Bound

The next theorem shows that the results of the previous sections are essentially unimprovable, up to logarithmic factors. The result is proven via the probabilistic method: we show that there exists a distribution of rewards for the arms such that the pseudo-regret of any forecaster must be high when averaged over this distribution. Owing to this probabilistic construction, the lower bound proof is based on the same Kullback-Leibler divergence as the one used in the proof of the lower bound for stochastic bandits —see Subsection 3. We are not aware of other techniques for proving bandit lower bounds.

We find it more convenient to prove the results for rewards rather than losses. In order to emphasize that our rewards are stochastic (in particular, Bernoulli random variables), we use Yi,t{0,1}Y_{i,t}\in\{0,1\} to denote the reward obtained by pulling arm ii at time tt.

Let sup\sup be the supremum over all distribution of rewards such that, for i=1,,Ki=1,\dots,K, the rewards Yi,1,Yi,2,{0,1}Y_{i,1},Y_{i,2},\ldots\in\{0,1\} are i.i.d., and let inf\inf be the infimum over all forecasters. Then

where expectations are with respect to both the random generation of rewards and the internal randomization of the forecaster.

The general idea of the proof goes as follows. Since at least one arm is pulled less than n/Kn/K times, for this arm one cannot differentiate between a Bernoulli of parameter 1/21/2 and and a Bernoulli of parameter 1/2+K/n1/2+\sqrt{K/n}. Thus, if all arms are Bernoulli of parameter 1/21/2 but one, whose parameter is 1/2+K/n1/2+\sqrt{K/n}, then the forecaster should incur a regret of order nK/n=nKn\sqrt{K/n}=\sqrt{nK}. In order to formalize this idea, we use the Kullback-Leibler divergence, and in particular Pinsker’s inequality, to compare the behavior of a given forecaster against: (1) the distribution where all arms are Bernoulli of parameter 1/21/2; (2) the same distribution where the parameter of one arm is increased by ε\varepsilon.

We start by proving a more general lemma, which could also be used to derive lower bounds in other contexts. The proof of Theorem 7.9 then follows by a simple optimization over ε\varepsilon.

Second step: Pinsker’s inequality.

Fourth step: conclusion for deterministic forecasters.

Fifth step: randomized forecasters via Fubini’s Theorem.

Refinements and bibliographic remarks

One can see that there is a logarithmic gap between the pseudo-regret of Exp3, presented in Theorem 5.1, and the minimax lower bound of Theorem 7.9. This gap was closed by Audibert and Bubeck (2009), who constructed a new class of strategies and showed that some of them have a pseudo-regret of order nK\sqrt{nK}. This new class of strategies, called INF (Implicitily Normalized Forecaster), is based on the following idea. First, note that one can generalize the exponential weighting scheme of Exp3 as follows: given a potential function ψ\psi, assign the probability

This type of strategy is called a weighted average forecaster, see (Cesa-Bianchi and Lugosi, 2006, Chapter 2). In INF the normalization is done implicitily, by a translation of the losses. More precisely, INF with potential ψ\psi assigns the probability p_{i,t+1}=\psi\bigl{(}C_{t}-\widetilde{L}_{i,t}\bigr{)}, where CtC_{t} is the constant such that pt+1p_{t+1} sum to 11. The key to obtain a minimax optimal pseudo-regret is to take ψ\psi of the form ψ(x)=(ηx)q\psi(x)=(-\eta x)^{-q} with q>1q>1, while Exp3 corresponds to ψ(x)=exp(ηx)\psi(x)=\exp(\eta x). Audibert et al. (2011) realized that the INF strategy can be formulated as a Mirror Descent algorithm. This point of view significantly simplifies the proofs. We refer the reader to Chapter 5 (and in particular Theorem 17.20) for more details.

While it is possible to get log-free pseudo-regret bounds, the situation becomes significantly more complicated when one considers high probability regret and expected regret. Audibert and Bubeck (2010) proved that one can get a log-free expected regret if the adversary is oblivious, i.e., the sequence of loss vectors is independent of the forecaster’s actions. Moreover, it is also possible to get a log-free high probability regret if the adversary is fully oblivious (i.e., the loss vectors are independently drawn, but not identically distributed —this includes the oblivious adversary). It is conjectured (in Audibert and Bubeck (2010)) that it is not possible to obtain a log-free expected regret bound against a general non-oblivious adversary.

2 Adaptive bounds

One of the strengths of the bounds proposed in this chapter is also one of its weaknesses: the bounds hold against any adversary. It is clear that in some cases it is possible to obtain a much smaller regret than the worst case regret. For example, when the sequence of losses is an i.i.d. sequence, we proved in Chapter 2 that it is is possible to obtain a logarithmic pseudo-regret (provided that the gap Δ\Delta is considered as a constant). Thus it is natural to ask if it possible to have strategies with minimax optimal regret, but also with much smaller regret when the loss sequence is not worst case.

The first bound in this direction was proved by Auer et al. (2002b), who showed that, for the gain version of the problem and against an oblivious adversary, Exp3 has a pseudo-regret of order KGn\sqrt{KG_{n}^{*}} (omitting log factors), where GnnG_{n}^{*}\leq n is the maximal cumulative reward of the optimal arm after nn rounds. This result was improved by Audibert and Bubeck (2010), who showed that using the gain estimate

one can bound the regret with high probability by essentially the same quantity as before, and against any adversary.

Another direction was explored by Hazan and Kale (2009) building on previous works in the full information setting —see Cesa-Bianchi et al. (2007). In this work the authors proved that one can attain a regret of order i=1KVi,n\sqrt{\sum_{i=1}^{K}V_{i,n}} excluding log factors, where

is the total variation of the loss for arm ii. In fact their result is more general, as it applies to the linear bandit framework —see Chapter 5. The main new ingredient in their analysis is a “reservoir sampling” procedure. We refer the reader to Hazan and Kale (2009) for details. See also the works of Slivkins and Upfal (2008); Slivkins (2011) for related results on slowly changing bandits.

In Section 8.4 below we describe another type of adaptive bound, where one combines minimax optimal regret for the adversarial model with logarithmic pseudo-regret for the stochastic model.

3 Competing with the best switching strategy

While competing against the policy consistently playing the best fixed arm is a natural way of defining regret, in some applications it might be interesting to consider regret with respect to a bigger class of policies. Though this problem is the focus of Chapter 4, there is a class of natural policies that can be directly dealt with by the methods of this chapter. Namely, consider the problem of competing against any policy constrained to make at most SnS\leq n switches (a switch is when the arm played at time tt is different from the arm played at time t+1t+1). This problem was studied by Auer (2002), where it was first shown that a simple variant of Exp3 attains a low switching regret against oblivious adversaries. Later, Audibert and Bubeck (2010) proved that Exp3.P attains an expected regret (and a high probability regret) of order nKSln(nK/S)\sqrt{nKS\ln(nK/S)} for this problem.

4 Stochastic versus adversarial bandits

From a practical viewpoint, Exp3 should be a safe choice when we have reasons to believe that the sequence of rewards is not well matched by any i.i.d. process. Indeed, it is easy to prove that UCB can have linear regret, i.e. Rn=Ω(n)\overline{R}_{n}=\Omega(n), on certain deterministic sequences. In Bubeck and Slivkins (2012) a new strategy was described, called SAO (Stochastic and Adversarial Optimal), which enjoys (up to logarithmic factors) both the guarantee of Exp3 for the adversarial model and the guarantee of UCB for the stochastic model. More precisely SAO satisfies Rn=O(KΔlog2(n)log(K))\overline{R}_{n}=\mathcal{O}\left(\frac{K}{\Delta}\log^{2}(n)\log(K)\right) in the stochastic model and Rn=O(nKlog3/2(n)log(K))\overline{R}_{n}=\mathcal{O}\left(\sqrt{nK}\log^{3/2}(n)\log(K)\right) in the adversarial model. Note that while this result is a step towards more flexible strategies, the very notion of regret Rn\overline{R}_{n} can become vacuous with nonstationarities in the reward sequence, since the total reward of the best fixed action might be very small. In that case the notion of switching regret —see Subsection 8.3— is more relevant, and it would be interesting to derive a strategy with logarithmic regret in the stochastic model, and a switching regret of order nKS\sqrt{nKS} in the adversarial model.

5 Alternative feedback structures

As mentioned at the beginning of this section, the adversarial multi-armed bandit is a variation of the full information setting, with a weaker feedback signal (only the incurred loss versus the full vector of losses is observed). Many other feedback structures can be considered, and we conclude the chapter by describing a few of them.

In the label efficient setting, originally proposed by Helmbold and Panizza (1997), at the end of each round the forecaster has to decide whether to ask for the losses of the current round, knowing that this can be done for at most mnm\leq n times. In this setting, Cesa-Bianchi et al. (2005) proved that the minimax pseudo-regret is of order nlnKmn\sqrt{\frac{\ln K}{m}}. A bandit label efficient version was proposed by Allenberg et al. (2006). Audibert and Bubeck (2010) proved that the minimax pseudo-regret for the bandit label efficient version is of order nKmn\sqrt{\frac{K}{m}}. These results do not require any fundamentally new algorithmic idea, besides the fact the forecaster has to randomize to select the rounds in which the losses are revealed. Roughly speaking, a simple coin toss with parameter ε=m/n\varepsilon=m/n is sufficient to obtain an optimal regret.

Mannor and Shamir (2011) study a model that interpolates between the full information and the bandit setting. The basic idea is that there is an undirected graph GG with KK vertices (one vertex for each arm) that encodes the feedback structure. When one pulls arm ii the losses of all neighboring arms jN(i)j\in N(i) in the graph are observed. Thus, a graph with no edges is equivalent to the bandit problem, while the complete graph is equivalent to the full information setting. Given the feedback structure GG, it is natural to consider the following unbiased loss estimate

Using Exp3 with this loss estimate, the authors show that the minimax pseudo-regret (up to logarithmic factors) is of order of α(G)n\sqrt{\alpha(G)n}, where α(G)\alpha(G) is the independence number of graph GG. Note that this interpolated setting naturally arises in applications like ad placement on websites. Indeed, if a user clicks on an advertisement, it is plausible to assume that the same user would have clicked on similar advertisements, had they been displayed.

The relationship between the signals and the incurred losses defines the instance of a partial monitoring problem. We refer the interested reader to Cesa-Bianchi and Lugosi (2006) for more details, including an historical account. Recent progress on this problem has been made by Bartok et al. (2010) and Foster and Rakhlin (2012).

Chapter 4 Contextual bandits

A natural extension of the multi-armed armed problem is obtained by associating side information with each arm. Based on this side information, or context, a notion of “contextual regret” is introduced where optimality is defined with respect to the best policy (i.e., mapping from contexts to arms) rather than the best arm. The space of policies, within which the optimum is sought, is typically chosen in order to have some desired structure. A different viewpoint is obtained when contexts are privately accessed by the policies (which are then appropriately called “experts”). In this case the contextual information is hidden from the forecaster, and arms must be chosen based only on the past estimated performance of the experts.

Contextual bandits naturally arise in many applications. For example, in personalized news article recommendation the task is to select, from a pool of candidates, a news article to display whenever a new user visits a website. The articles correspond to arms, and a reward is obtained whenever the user clicks on the selected article. Side information, in the form of features, can be extracted from both user and articles. For the user this may include historical activities, demographic information, and geolocation; for the articles, we may have content information and categories. See Li et al. (2010) for more details on this application of contextual bandits.

In general, the presence of contexts creates a wide spectrum of possible variations obtained by combining assumptions on the rewards with assumptions on the nature of contexts and policies. In this chapter we describe just a few of the results available in the literature, and use the bibliographic remarks to mention all those that we are aware of.

The most basic example of contextual bandits is obtained when game rounds t=1,2,t=1,2,\dots are marked by contexts s1,s2,s_{1},s_{2},\dots from a given context set S\mathcal{S}. The forecaster must learn the best mapping g:S{1,,K}g:\mathcal{S}\to\{1,\dots,K\} of contexts to arms. We analyze this simple side information setting in the case of adversarial rewards, and we further assume that the sequence of contexts sts_{t} is arbitrary but fixed. The approach we take is the simplest: run a separate instance of Exp3 on each distinct context.

We introduce the following notion of pseudoregret

Here stSs_{t}\in\mathcal{S} denotes the context marking the tt-th game round. A bound on this pseudoregret is almost immediately obtained using the adversarial bandit results from Section 3.

There exists a randomized forecaster for bandits with side information (the S\mathcal{S}-Exp3 forecaster, defined in the proof) that satisfies

Let S=SS=|\mathcal{S}|. The S\mathcal{S}-Exp3 forecaster runs an instance of Exp3 on each context sSs\in\mathcal{S}. Let nsn_{s} the number of times when st=ss_{t}=s within the first nn time steps. Using the bound (18) established in Theorem 5.1 we get

where in the last step we used Jensen’s inequality and the identity sns=n\sum_{s}n_{s}=n.

In subsection 10.1, we extend this construction by considering several context sets simultaneously.

A lower bound \Omega\bigl{(}\sqrt{nSK}\bigl{)} is an immediate consequence of the adversarial bandit lower bound (Theorem 7.9) under the assumption that a constant fraction of the contexts in S\mathcal{S} marks at least constant fraction of the nn game rounds.

The expert case

We now consider the contextual variant of the basic adversarial bandit model of Chapter 3. In this variant there is a finite set of NN randomized policies. Following the setting of prediction with expert advice, no assumptions are made on the way policies compute their randomized predictions, and the forecaster experiences the contexts only through the advice provided by the policies. For this reason, in what follows we use the word expert to denote a policy. Calling this a model of contextual bandits may sound a little strange, as the structure of contexts does not seem to play a role here. However, we have decided to include this setting in this chapter because bandit with experts have been used in practical contextual bandit problems -see, e.g., the news recommendation experiment in Beygelzimer et al. (2011b).

Exp4 without mixing and with ηt=η=2lnNnK\eta_{t}=\eta=\sqrt{\frac{2\ln N}{nK}} satisfies

On the other hand, with ηt=lnNtK\eta_{t}=\sqrt{\frac{\ln N}{tK}} it satisfies

We apply the analysis of Exp3 (Theorem 5.1) to a forecaster using distributions qtq_{t} over NN experts, whose pseudo-losses are y~j,t\widetilde{y}_{j,t} for j=1,,Nj=1,\dots,N. This immediately gives the inequality

Now, similarly to (21) in the proof of Theorem 5.1, we establish the following inequalities

where we used Jensen’s inequality to prove (44). By applying (43) and (44) to (41) we get

Now note that, if we take expectation over the draw of I1,,InI_{1},\dots,I_{n}, using (42) we obtain

Choosing ηt\eta_{t} as in the statement of the Theorem, and using the inequality t=1nt1/22n\sum_{t=1}^{n}t^{-1/2}\leq 2\sqrt{n}, concludes the proof.

Besides pseudo-regret, the contextual regret

can be also bounded, at least with high probability. Indeed, similarly to the variant Exp3.P of Exp3 (see Section 6), an analogous modification of Exp4, called Exp4.P, satisfies

for some constant c>0c>0 and with probability at least 1δ1-\delta, where δ(0,1)\delta\in(0,1) is a parameter of the algorithm.

We revisit the basic contextual scenario introduced in Section 9, where the goal is to compete against the best mapping from contexts to arms. Consider now a class {Sθ:θΘ}\left\{{\mathcal{S}_{\theta}}\,:\,{\theta\in\Theta}\right\} of context sets. In this new game, each time step t=1,2,t=1,2,\dots is marked by the vector \bigl{(}s_{\theta,t}\bigr{)}_{\theta\in\Theta} of contexts, one for each set in Θ\Theta. Introduce the pseudoregret

When Θ=1|\Theta|=1 we recover the contextual pseudoregret RnS\overline{R}^{\mathcal{S}}_{n}. In general, when Θ\Theta contains more than one set, the forecaster must learn both the best set Sθ\mathcal{S}_{\theta} and the best function g:Sθ{1,,K}g:\mathcal{S}_{\theta}\to\{1,\dots,K\} from that set to the set of arms.

We find this variant of contextual bandits interesting because its solution involves a nontrivial combination of two of the main algorithms examined in this chapter: Exp4 and S\mathcal{S}-Exp3. In particular, we consider a scenario in which Exp4 uses instances of S\mathcal{S}-Exp3 as experts. The interesting aspect is that these experts are learning themselves, and thus the analysis of the combined algorithm requires taking into account the learning process at both levels.

Note that in order to solve this problem we could simply lump all contexts in a big set and use the proof of Theorem 9.1. However, this would give a regret bound that depends exponentially in Θ|\Theta|. On the other hand, by using Exp4 directly on the set of all policies gg (which is of cardinality exponential in Θ×S|\Theta|\times|S|), we could improve this to a bound that scales with Θ\sqrt{|\Theta|}. The idea we explore here is to use Exp4 over the class Θ\Theta of “experts”, and combine this with the S\mathcal{S}-Exp3 algorithm of Theorem 9.1. This gets us down to a logarithmic dependency on Θ|\Theta|, albeit at the price of a worse dependency on nn.

Intuitively, Exp4 provides competitiveness against the best context set Sθ\mathcal{S}_{\theta}, while the instances of the S\mathcal{S}-Exp3 algorithm, acting as experts for Exp4, ensure that we are competitive against the best function g:Sθ{1,,K}g:\mathcal{S}_{\theta}\to\{1,\dots,K\} for each θΘ\theta\in\Theta. However, by doing so we immediately run into a problem: the ptp_{t} used by Exp4 is not the same as the ptp_{t}’s used by each expert. In order to address this issue, we now show that the analysis of Exp3 holds even when the sequence of plays I1,I2,I_{1},I_{2},\dots is drawn from a sequence of distributions q1,q2,q_{1},q_{2},\dots possibly different from the one chosen by the forecaster. The only requirement we need is that each probability in qtq_{t} be bounded away from zero.

where InqnI^{n}\sim q^{n} means that each ItI_{t} is drawn from qtq_{t} for t=1,,nt=1,\dots,n, and ptp_{t} is the distribution used by Exp3 at time tt.

The first term is bounded in a manner slightly different from the proof of Theorem 5.1,

The analysis of the second term is unchanged: Let L~i,0=0\widetilde{L}_{i,0}=0, Φ0(η)=0\Phi_{0}(\eta)=0 and Φt(η)=1ηlog1Ki=1Kexp(ηL~i,t)\Phi_{t}(\eta)=\frac{1}{\eta}\log\frac{1}{K}\sum_{i=1}^{K}\exp{\left(-\eta\widetilde{L}_{i,t}\right)}. Then by definition of ptp_{t} we have:

Proceeding again as in the proof of Theorem 5.1 we obtain

Choosing η\eta as in the statement of the theorem concludes the proof.

It is left to the reader to verify that the analysis of S\mathcal{S}-Exp in Theorem 9.1 can be combined with the above analysis to give the bound

where γ>0\gamma>0 is the mixing coefficient. This mixing clearly achieves the desired property for each pi,tp_{i,t}.

Exp4 with mixing coefficient γ\gamma and with ηt=η=γ/K\eta_{t}=\eta=\gamma/K satisfies

There exists a randomized forecaster achieving

for any class {Sθ:θΘ}\left\{{\mathcal{S}_{\theta}}\,:\,{\theta\in\Theta}\right\} of context sets.

We run the Exp4 forecaster with mixing coefficient γ\gamma using instances of the S\mathcal{S}-Exp3 algorithm (defined in the proof of Theorem 9.1) as experts. Each S\mathcal{S}-Exp3 instance is run on a different context set Sθ\mathcal{S}_{\theta} for θΘ\theta\in\Theta. Let ξtθ\xi_{t}^{\theta} be the distribution used at time tt by the S\mathcal{S}-Exp3 instance running on context set Sθ\mathcal{S}_{\theta} and let pnp^{n} be the joint distribution of In=(I1,,In)I^{n}=(I_{1},\dots,I_{n}) used by Exp4. Since pi,tγKp_{i,t}\geq\tfrac{\gamma}{K} for all i=1,,Ki=1,\dots,K and t1t\geq 1, we can use (46) with ε=γ/K\varepsilon=\gamma/K. Thus, Theorem 10.7 implies

Substituting ε=γ/K\varepsilon=\gamma/K in the above expression and choosing γ\gamma of the order of n1/3(maxθΘSθKlnK)1/3lnΘn^{-1/3}\left(\max_{\theta\in\Theta}|\mathcal{S}_{\theta}|K\ln K\right)^{1/3}\sqrt{\ln|\Theta|} gives the desired result.

Note that in Theorem 10.9 the rate is n2/3n^{2/3}, in contrast to the more usual n1/2n^{1/2} bandit rate. This worsening is inherent in the Exp4-over-Exp3 construction. It is not known whether the rate could be improved while keeping the same logarithmic dependence on Θ|\Theta| guaranteed by this construction.

Stochastic contextual bandits

We now move on to consider the case in which policies have a known structure. More specifically, each policy is a function ff mapping the context space to the arm space {1,,K}\{1,\dots,K\} and the set F\mathcal{F} of policies is given as an input parameter to the forecaster.

Under this assumption on the policies, the problem can be viewed as a bandit variant of supervised learning. For this reason, here and in the next section we follow the standard notation of supervised learning and use xx rather than ss to denote contexts.

the risk-minimizing policy in the class. The regret with respect to the class F\mathcal{F} of a forecaster choosing arms I1,I2,I_{1},I_{2},\dots is then defined by

In the rest of this section we focus on the case of K=2K=2 arms and parametrize classes F\mathcal{F} of policies f:X{1,2}f:\mathcal{X}\to\{1,2\} by their VC-dimension dd —see Boucheron et al. (2005) for a modern introduction to VC theory. For this setting, we consider the following forecaster.

VE (VC dimension by Exponentiation): Parameters: number nn of rounds, nn^{\prime} satisfying 1nn1\leq n^{\prime}\leq n. (1) For the first nn^{\prime} rounds, choose arms uniformly at random. (2) Build FF\mathcal{F}^{\prime}\subseteq\mathcal{F} such that for any fFf\in\mathcal{F} there is exactly one fFf^{\prime}\in\mathcal{F}^{\prime} satisfying f(xt)=f(xt)f(x_{t})=f^{\prime}(x_{t}) for all t=1,,nt=1,\dots,n^{\prime}. (3) For t=n+1,,nt=n^{\prime}+1,\dots,n play by simulating Exp4.P using the policies of F\mathcal{F}^{\prime} as experts.

We now show that the per round regret of VE is of order d/n\sqrt{d/n}, excluding logarithmic factors. This rate is equal to the optimal rate for supervised learning of VC-classes, showing that —in this case— the price of bandit information is essentially zero.

For any class F\mathcal{F} of binary policies f:X{0,1}f:\mathcal{X}\to\{0,1\} of VC-dimension dd and for all n>dn>d, the forecaster VE run with n=n(2dlnend+ln3δ)n^{\prime}=\sqrt{n\left(2d\ln\frac{en}{d}+\ln\frac{3}{\delta}\right)} satisfies

for some constant c>0c>0 and with probability at least 1δ1-\delta with respect to both the random data generation and VE’s internal randomization.

with probability at least 2δ3\tfrac{2\delta}{3} with respect to both the random sample draw and VE’s internal randomization.

holds with probability at least δ3\tfrac{\delta}{3} for any sample realization, it holds with the same probability for a random sample. Hence, by choosing nn^{\prime} as in the statement of the theorem and overapproximating, we get the desired result.

The multiclass case

A simple but effective learning strategy for (non-bandit) online classification is the multiclass Perceptron algorithm. This algorithm updates WW at time tt using the rule WW+XtW\leftarrow W+X_{t}, where XtX_{t} is a K×dK\times d matrix with components \bigl{(}X_{t}\bigr{)}_{i,j}=x_{t,j}\bigl{(}\mathds{1}_{y_{t}=i}-\mathds{1}_{\widehat{y}_{t}=i}\bigr{)}. Therefore, the update rule can be rewritten as

where wiw_{i} denotes the ii-th row of matrix WW. Note, in particular, that no update takes place (i.e., XtX_{t} is the all zero matrix) when y^t=yt\widehat{y}_{t}=y_{t}, which means that yty_{t} is predicted correctly.

uniformly over n1n\geq 1, where the infimum is over all K×dK\times d matrices UU and \left\|{\,\cdot\,}\right\| denotes the Frobenius norm. Here Ln(U)L_{n}(U) denotes the cumulative hinge loss of policy UU,

where []+=max{0,}[\,\cdot\,]_{+}=\max\{0,\,\cdot\,\} is the hinge function. Finally, Lˉn(U)=1nLn(U)\bar{L}_{n}(U)=\tfrac{1}{n}L_{n}(U) is the average hinge loss of UU.

We now introduce a bandit variant of the multiclass Perceptron called Banditron.

We now analyze the expected number of prediction mistakes made by the Banditron algorithm on any sequence of examples (xt,yt)(x_{t},y_{t}). Unlike the non-bandit case, where the number of mistakes MnM_{n} after nn steps of the multiclass Perceptron provides a “multiclass regret” bound M_{n}-L_{n}(U)=\mathcal{O}\bigl{(}\sqrt{n}\bigr{)}, in the bandit case the regret achieved by the variant of the Perceptron is only bounded by \mathcal{O}\bigl{(}n^{2/3}\bigr{)}.

where the infimum is over all K×dK\times d matrices UU and Lˉn(U)=1nLn(U)\bar{L}_{n}(U)=\tfrac{1}{n}L_{n}(U) is the average hinge loss of UU.

Therefore, if yty^ty_{t}\neq\widehat{y}_{t},

because pi,t=γp_{i,t}=\gamma when yty^ty_{t}\neq\widehat{y}_{t}. Otherwise, if yt=y^ty_{t}=\widehat{y}_{t}

because pi,t=1γp_{i,t}=1-\gamma when yt=y^ty_{t}=\widehat{y}_{t} and γ12\gamma\leq\tfrac{1}{2}. Hence,

In order to see why the inequality holds, note that

Now we lower bound \bigl{\langle}{U},{W_{n+1}}\bigr{\rangle} as follows,

Therefore, using again the fact that W1W_{1} is the zero matrix,

Combining the upper and lower bounds on \bigl{\langle}{U},{W_{n+1}}\bigr{\rangle} we get

Choosing γ\gamma as in the statement of the theorem yields the desired result. Note that γ12\gamma\leq\tfrac{1}{2} because we assume n8Kn\geq 8K.

Bibliographic remarks

A model of contextual stochastic bandits close to those studied here is introduced by Wang et al. (2005b). The context is provided by a i.i.d. sequence of random variables and the rewards for each arm depend on the context through an unknown parametric model beloging to a known class. This result has been generalized to the non i.i.d. case by Wang et al. (2005a), to the multivariate linear case by Rusmevichientong and Tsitsiklis (2010), and to the multivariate and nonparametric case by Perchet and Rigollet (2011). In Slivkins (2011), contexts and rewards belong to arbitrary metric spaces, and the unknown function mapping contexts to rewards satisfies a Lipschitz assumption (remarkably, the same algorithm also applies to slowly changing expected rewards and sleeping bandit settings). The case of deterministic covariates (fixed design), finitely many arms, and a linear stochastic dependence between covariates and rewards has been studied in Auer (2002); Chu et al. (2011) —see also Abe and Long (1999). The work of Filippi et al. (2010) extends the analysis of fixed design by assuming a generalized linear model to capture the dependence of rewards on covariates.

The simple contextual model analyzed in Section 9, as well as its extension described in Subsection 10.1, are due to Maillard and Munos (2011). The Exp4 algorithm for the adversarial case was introduced in Auer et al. (2002b). Subsequent improvements were proposed in the two papers Beygelzimer et al. (2011a) (Exp4.P with high-probability bounds) and McMahan and Streeter (2009) (exploitation of correlations in expert advice). The VE algorithm and its analysis in Section 11 are also taken from Beygelzimer et al. (2011a).

A stochastic model for contextual bandits with finitely many arms and finitely many states has been investigated by Seldin et al. (2011) using new sophisticated tools of PAC-Bayesian analysis.

The general stochastic model of Section 11 for contextual bandits with finitely many arms is due to Langford and Zhang (2007). An efficient algorithm for this model has been recently proposed in Dudik et al. (2011).

The bandit multiclass model of Section 12 is due to Langford and Zhang (2007). The Banditron algorithm and its analysis are from Kakade et al. (2008). See also Crammer and Gentile (2011) and Hazan and Kale (2011) for recent variants and improvements.

Chapter 5 Linear bandits

It is important to note that, without any loss of generality (as far as regret bounds are concerned), one can always assume that K\mathcal{K} has size O(nd)\mathcal{O}(n^{d}). Indeed, since K\mathcal{K} is a compact set and the loss is linear (and therefore Lipschitz), one can cover K\mathcal{K} with O(nd)\mathcal{O}(n^{d}) points such that one incurs a vanishing extra cumulative regret by playing on the discretization rather than on the original set. Of course, the downside of this approach is that a strategy resulting from such a cover is often not computationally efficient. On the other hand, this assumption allows us to apply ideas from Chapter 3 to this more general setting. We analyze the pseudo-regret for finite classes in Section 14. Without loss of generality, it is also possible to assume that K\mathcal{K} is a convex set. Indeed, the pseudo-regret is the same if one plays xtx_{t}, or if one plays a point at random in K\mathcal{K} such that the expectation of the played point is xtx_{t}. This remark is critical, and allows us to develop a powerful technology known as the Mirror Descent algorithm. We describe this approach in Section 15, and explore it further in subsequent sections.

To describe the strategy, we first need a useful result from convex geometry: John’s theorem. This result concerns the ellipsoid E\mathcal{E} of minimal volume enclosing a given convex set K\mathcal{K} (which we call the John’s ellipsoid of K\mathcal{K}). Basically, the theorem states that E\mathcal{E} has many contact points with K\mathcal{K}, and these contact points are “nicely” distributed, in the sense that they almost form an orthonormal basis —see Ball (1997) for a proof.

Now the Exp2 algorithm with John’s exploration corresponds to playing according to the following probability distribution

For any η,γ>0\eta,\gamma>0 such that ηdγ{\eta d}\leq{\gamma}, Exp2 with John’s exploration satisfies

In particular, with η=lnN3nd\eta=\sqrt{\frac{\ln N}{3nd}} and γ=ηd\gamma=\eta d,

Since K\mathcal{K} is finite, we can easily adapt the analysis of Exp3 in Theorem 5.1 to take into account the exploration term. This gives

Now we use a spectral decomposition of PtP_{t} in an orthonormal basis for ,\langle\cdot,\cdot\rangle and write Pt=i=1dλiviviP_{t}=\sum_{i=1}^{d}\lambda_{i}v_{i}\otimes v_{i}. In particular, we have Pt1=i=1d1λiviviP_{t}^{-1}=\sum_{i=1}^{d}\frac{1}{\lambda_{i}}v_{i}\otimes v_{i} and thus

where the last inequality follows from the fact that x,x1\langle x,x\rangle\leq 1 for any xKx\in\mathcal{K}, since K\mathcal{K} is included in the unit ball. To conclude the proof, we need to lower bound the smallest eigenvalue of PtP_{t}. Using Theorem 14.1, one can see that PtγdIdP_{t}\succeq\frac{\gamma}{d}I_{d}, and thus λiγd\lambda_{i}\geq\frac{\gamma}{d}. Since ηdγ{\eta d}\leq{\gamma}, the proof is concluded.

Online Mirror Descent (OMD)

The rest of this chapter is organized as follows. Next, we briefly recall a few key concepts from convex analysis. Then we describe the OMD strategy and prove a general regret bound. In Section 16 we introduce Online Stochastic Mirror Descent (OMSD), which is a variant of OMD based on a stochastic estimate of the gradient. We apply this strategy to linear losses in two different bandit settings. Finally, in Section 18 we show how OMSD obtains improved bounds in certain special cases. The case of convex losses with bandit feedback is treated in Chapter 6.

Such a gg is called a subgradient of ff at xx.

Abusing notation, we use f(x)\nabla f(x) to denote both the gradient of ff at xx when ff is differentiable, and any subgradient of ff at xx when ff is subdifferentiable (a sufficient condition for subdifferentiability of ff is that ff is convex and X\mathcal{X} is open).

This definition directly implies the Fenchel-Young inequality for convex functions, uxf(x)+f(u)u^{\top}x\leq f(x)+f^{*}(u).

FF is strictly convex and admits continuous first partial derivatives on D\mathcal{D};

Let FF be a Legendre function. Then F=FF^{**}=F, and F=(F)1\nabla F^{*}=(\nabla F)^{-1} restricted on the set D\mathcal{D}^{*}. Moreover, for all x,yDx,y\in\mathcal{D},

The gradient F\nabla F maps D\mathcal{D} to the dual space D\mathcal{D}^{*}, and F\nabla F^{*} is the inverse mapping from the dual space to the original (primal) space. Note also that (50) shows that the Bregman divergence in the primal exactly corresponds to the Bregman divergence of the Legendre-transform in the dual.

The next lemma shows that the geometry induced by a Bregman divergence resembles to the geometry of the squared Euclidean distance. See (Cesa-Bianchi and Lugosi, 2006, Lemma 11.3) for a proof.

Let KD\mathcal{K}\subseteq\overline{\mathcal{D}} be a closed convex set such that KD\mathcal{K}\cap\mathcal{D}\neq\emptyset. Then, for all xDx\in\mathcal{D}, the Bregman projection

exists and is unique. Moreover, for all zKDz\in\mathcal{K}\cap\mathcal{D} and yKy\in\mathcal{K},

The idea of OMD is very simple: first, select a Legendre function FF on DK\overline{\mathcal{D}}\supset\mathcal{K} (such that KD\mathcal{K}\cap\mathcal{D}\neq\emptyset); second, perform a gradient descent step, where the update with the gradient is performed in the dual space D\mathcal{D}^{*} rather than in the primal D\mathcal{D}; third, project back to K\mathcal{K} according to the Bregman divergence defined by FF.

Note that step (2) is well defined if the following consistency condition is satisfied:

Note also that step (2) can be rewritten as

Finally, note that the Bregman projection in step (3) is a convex program, in the sense that xDF(x,y)x\mapsto D_{F}(x,y) is always a convex function. This does not necessarily imply that step (3) can be performed efficiently, since in some cases the feasible set K\mathcal{K} might only be described with an exponential number of constraints (we encounter examples like this in Section 17).

When K\mathcal{K} is the simplex and F(x)=i=1dxilnxii=1dxiF(x)=\sum_{i=1}^{d}x_{i}\ln x_{i}-\sum_{i=1}^{d}x_{i}, OMD reduces to an exponentially weighted average forecaster, similar to those studied in Chapter 3. The well-known online gradient descent strategy corresponds to taking F(x)=12x22F(x)=\frac{1}{2}\left\|{x}\right\|^{2}_{2}. In the following we shall see other possibilities for the Legendre function FF.

We prove now a very general and powerful theorem concerning the regret of OMD.

Let K\mathcal{K} be a compact and convex set of arms, L\mathcal{L} be a set of subdifferentiable functions, and FF a Legendre function defined on DK\overline{\mathcal{D}}\supset\mathcal{K}, such that (51) is satisfied. Then OMD satisfies for any xKx\in\mathcal{K},

Let xKx\in\mathcal{K}. Since L\mathcal{L} is a set of subdifferentiable functions, we have:

Using (52), and applying the definition of DFD_{F}, one obtains

By Lemma 15.8, one gets DF(x,wt+1)DF(x,xt+1)+DF(xt+1,wt+1),D_{F}(x,w_{t+1})\geq D_{F}(x,x_{t+1})+D_{F}(x_{t+1},w_{t+1}), hence

By the nonnegativity of the Bregman divergences, we get

Let xKx\in\mathcal{K}. Since L\mathcal{L} is a set of subdifferentiable functions, we have

Using the Fenchel-Young inequality, one obtains

Observe that F(0)=F(x1)F^{*}(0)=-F(x_{1}) and, for each term in the above sum,

Online Stochastic Mirror Descent (OSMD)

The following theorem establishes a general regret bound for OSMD. Note that here the pseudo-regret is defined as

Note also that we state the theorem for a Legendre function FF, but a similar result can be obtained under the same assumptions as those of Theorem 15.11.

which concludes the proof of the first regret bound. The case of a linear loss follows very easily from the same computations.

Online combinatorial optimization

In this section we consider an interesting special case of online linear optimization. In the online combinatorial optimization setting the set of arms is C{0,1}d\mathcal{C}\subseteq\{0,1\}^{d} and the set of linear loss functions is L=d\mathcal{L}=^{d}. We assume v1=m\left\|{v}\right\|_{1}=m for all vCv\in\mathcal{C} and for some integer mdm\leq d. Many interesting problems fall into this framework, including ranking/selection of mm items, or path planning.

Using Theorem 16.13 one directly obtains:

We show now how to use this bound to obtain concrete performances for OSMD using the negative entropy as Legendre function. Later, we show that one can improve the results by logarithmic factors, using a more subtle Legendre function.

We start with OSMD and the negative entropy.

In particular, with the estimate (53) and η=2mndlndm\eta=\sqrt{\frac{2m}{nd}\ln\frac{d}{m}},

Moreover, straightforward computations give

We now greatly generalize the negative entropy with the following definition. When used with OSMD, this more general entropy allows us to obtain a bound tighter than that of Theorem 17.15.

With a potential ψ\psi we associate the function FψF_{\psi} defined on D=(ω,+)d\mathcal{D}=(\omega,+\infty)^{d} by

We restrict our attention to -potentials. A non-zero ω\omega might be used to derive high probability regret bounds (instead of pseudo-regret bounds). Note that with ψ(x)=ex\psi(x)=e^{x} we have that FψF_{\psi} reduces to the negative entropy.

Let ψ\psi be a -potential. Then FψF_{\psi} is Legendre and for all u,vD=(,a)du,v\in\mathcal{D}^{*}=(-\infty,a)^{d} such that uiviu_{i}\leq v_{i} for i=1,,di=1,\ldots,d,

It is easy to check that FF is a Legendre function. Moreover, since \nabla F^{*}(u)=(\nabla F)^{-1}(u)=\big{(}\psi(u_{1}),\dots,\psi(u_{d})\big{)} we obtain

Since the function ψ\psi is convex, and uiviu_{i}\leq v_{i}, we have

We are now ready to bound the pseudo-regret of OSMD run with an arbitrary -potential. For a specific choice of the potential we obtain an improvement of Theorem 17.15. In particular for m=1m=1 this result gives the log-free bound for the adversarial multi-armed bandit that was discussed in Section 8.1.

In particular, choosing the -potential ψ(x)=(x)q\psi(x)=(-x)^{-q}, the estimate (53), and η=2q1m12/qd12/q\eta=\sqrt{\frac{2}{q-1}\frac{m^{1-2/q}}{d^{1-2/q}}},

The first inequality trivially follows from (54), Lemma 17.18, and the fact that \psi^{\prime}\bigl{(}\psi^{-1}(s)\bigr{)}=\frac{1}{(\psi^{-1})^{\prime}(s)}.

Let ψ(x)=(x)q\psi(x)=(-x)^{-q}. Then we have that ψ1(x)=x1/q\psi^{-1}(x)=-x^{-1/q} and F(x)=qq1i=1dxi11/qF(x)=-\frac{q}{q-1}\sum_{i=1}^{d}x_{i}^{1-1/q}. In particular, by Hölder’s inequality, since i=1dx1(i)=m\sum_{i=1}^{d}x_{1}(i)=m,

Moreover, note that (ψ1)(x)=1qx11/q(\psi^{-1})^{\prime}(x)=\frac{1}{q}x^{-1-1/q}, and

Improved regret bounds for bandit feedback

where ξt\xi_{t} is a Bernoulli random variable of parameter xt\left\|{x_{t}}\right\|, ItI_{t} is drawn uniformly at random in {1,,d}\{1,\ldots,d\}, and εt\varepsilon_{t} is a Rademacher random variable with parameter 12\frac{1}{2}.

In particular, with γ=1n\gamma=\frac{1}{\sqrt{n}} and η=lnn2nd\eta=\sqrt{\frac{\ln n}{2nd}},

First, it is clear that by playing on K=(1γ)K\mathcal{K}^{\prime}=(1-\gamma)\mathcal{K} instead of K\mathcal{K}, OSMD incurs an extra γn\gamma n regret. Second, note that FF is stricly convex (it is the composition of a convex and nondecreasing function with the Euclidean norm) and

The first term is clearly bounded by 1ηln1γ\frac{1}{\eta}\ln\frac{1}{\gamma} (since x1=0x_{1}=0). For the second term, we need to do a few computations. The first one follows from (58)), the others follow from simple algebra

Let Θ(u,v)\Theta(u,v) such that DF(u,v)=11+vΘ(u,v)D_{F^{*}}(u,v)=\frac{1}{1+\left\|{v}\right\|}\Theta(u,v). First note that

Now using that ln(1+x)xx2\ln(1+x)\geq x-x^{2} for all x12x\geq-\frac{1}{2}, we obtain that for u,vu,v such that uv1+v12\frac{\left\|{u}\right\|-\left\|{v}\right\|}{1+\left\|{v}\right\|}\geq-\frac{1}{2},

which concludes the proof of (56). For the proof of (57) it suffices to note that

and perform with straightforward computations.

Refinements and bibliographic remarks

Online convex optimization in the full information setting was introduced by Zinkevich (2003). Online linear optimization with bandit feedback was pioneered in Awerbuch and Kleinberg (2004); McMahan and Blum (2004). For this problem, Dani et al. (2008a) were the first to obtain optimal \mathcal{O}\bigl{(}\sqrt{n}\bigr{)} bounds in terms of the number nn of rounds. This was done using the Exp2 strategy with an exploration uniform over a barycentric spanner for K\mathcal{K}. The exploration part was first improved by Cesa-Bianchi and Lugosi (2012) for combinatorial sets K\mathcal{K}. Finally, the optimal exploration based on John’s theorem was introduced by Bubeck et al. (2012a). Theorem 14.2 is extracted from this last paper.

Simultaneously with the line of research on Exp2, algorithms based on Online Mirror Descent were also investigated. Mirror Descent was originally introduced in the seminal work of Nemirovski (1979); Nemirovski and Yudin (1983) as a standard (offline) convex optimization method. A somewhat similar class of algorithms was rediscovered in the online learning community, see Herbster and Warmuth (1998); Grove et al. (2001); Kivinen and Warmuth (2001); Shalev-Shwartz (2007). The connection between existing online learning algorithms (such as Exponential weights or Online Gradient Descent) and Mirror Descent was first made explicit in Cesa-Bianchi and Lugosi (2006) —see also Rakhlin (2009) and Hazan (2011). Earlier applications of Mirror Descent in the learning community can be found in Juditsky et al. (2005). The first application of Mirror Descent to online linear optimization with bandit feedback was given by Abernethy et al. (2008). In this pioneering paper, the authors describe the first computationally efficient strategy (i.e., with complexity polynomial in dd) with O(n)\mathcal{O}(\sqrt{n}) regret. The main idea is to use Mirror Descent with a self-concordant barrier FF for the set K\mathcal{K}. Unfortunately, the drawback is a suboptimal dependency on dd in the regret. More precisely. they obtain a O(d2n)\mathcal{O}(d^{2}\sqrt{n}) regret under the bounded scalar loss assumption, while Exp2 with John’s exploration attains O(dn)\mathcal{O}(d\sqrt{n}). However, Mirror Descent can also deliver optimal regret bounds in the bandit case, as we showed in Section 18, which is extracted from Bubeck et al. (2012a).

The presentation of the Online Mirror Descent algorithm in Section 15 is inspired by Bubeck (2011). The definition of Legendre functions comes from (Cesa-Bianchi and Lugosi, 2006, Chapter 11) —further developments on convex analysis can be found in Hiriart-Urruty and Lemaréchal (2001); Boyd and Vandenberghe (2004). Theorem 15.9 is taken from Audibert et al. (2011), but the proof technique goes back at least to Ben-Tal and Nemirovski (1999). The proof of Theorem 15.11 is adapted from Kakade et al. (2012). Section 16 is inspired by gradient-free optimization, a topic extensively studied since Robbins and Monro (1951); Kiefer and Wolfowitz (1952) —see Nemirovski et al. (2009); Conn et al. (2009); Nesterov (2011); Bach and Moulines (2011) for recent accounts on this theory. Alternative views have been proposed on the Online Mirror Descent strategy. In particular, it is equivalent to a Follow The Regularized Leader, and to proximal algorithms, see Rakhlin (2009). This viewpoint was pioneered by Beck and Teboulle (2003) —see also Bartok et al. (2011) for more details. Finally, a notion of universality of Online Mirror Descent in the online prediction setting was proposed by Srebro et al. (2011).

The online combinatorial optimization problem studied in Section 17 was introduced by Kalai and Vempala (2005) for the full information setting. Several works have studied this problem for specific sets C\mathcal{C}, see in particular Takimoto and Warmuth (2003); Warmuth and Kuzmin (2008); Helmbold and Warmuth (2009); Hazan et al. (2010); Koolen et al. (2010); Warmuth et al. (2011); Cesa-Bianchi and Lugosi (2012). The semi-bandit feedback was studied in the series of papers György et al. (2007); Kale et al. (2010); Uchiya et al. (2010); Audibert et al. (2011). The presentation adopted in this section is based on the last paper. OSMD with negative entropy was first studied by Helmbold and Warmuth (2009) for the full information setting and for a specific set C\mathcal{C}. It was then studied more generally in Koolen et al. (2010) for any set C\mathcal{C}. The generalization to semi-bandit feedback was done by Audibert et al. (2011). OSMD with a Legendre derived from a potential was introduced by Audibert et al. (2011). In the case of the simplex, this strategy corresponds to the INF strategy of Audibert and Bubeck (2009) discussed in Section 8.1.

Online linear optimization is still far from being completely understood. For instance, see (Bubeck, 2011, Chapter 9) for a list of open problems. In this section we also omitted a few important topics related to online linear optimization. We briefly review some of them below.

Under the bounded scalar loss assumption, it was proved by Dani et al. (2008a) that for K=d\mathcal{K}=^{d} the minimax regret in the full information setting is at least of order dn\sqrt{dn}, while under bandit feedback it is of order dnd\sqrt{n}. In both cases Exp2 is matching these lower bounds (using John’s exploration in the bandit case).

In the combinatorial setting, where K{0,1}d\mathcal{K}\subset\{0,1\}^{d} and L=d\mathcal{L}=^{d}, Audibert et al. (2011) show that the minimax regret in the full information and semi-bandit cases is at least of order dnd\sqrt{n}, while in the bandit case it is of order d3/2nd^{3/2}\sqrt{n}. OSMD with the negative entropy matches the bounds in the full information and semi-bandit cases. However, in the bandit case the best known bound is obtained by Exp2 (with John’s exploration) and gives a regret of order d2nd^{2}\sqrt{n}. It is important to remark that Audibert et al. (2011) show that Exp2 is a provably suboptimal strategy in the combinatorial setting.

Finally, lower bounds for the full information case, and for a few specific sets K\mathcal{K} of interest, were derived by Koolen et al. (2010).

2 High probability bounds

In this chapter we focused on the pseudo-regret Rn\overline{R}_{n}. However, just like in Chapter 3, a much more important and interesting statement concerns high probability bounds for the regret RnR_{n}. Partial results in this direction can be found in Bartlett et al. (2008) for the Exp2 strategy, and in Abernethy and Rakhlin (2009) for the OSMD algorithm.

3 Stochastic online linear optimization

Chapter 6 Nonlinear bandits

This modification has important consequences. For instance with strictly convex losses one has to do local perturbations in order to estimate the loss gradient, this is in contrast to the global perturbations studied in the previous chapter. In agreement with the setting of Chapter 5, we initially focus on the nonstochastic setting, where the forecaster faces an unknown sequence of convex Lipschitz and differentiable losses (in the nonlinear case the regret scales with the Lipschitz constant of losses). Problems of this kind can be viewed as dynamic variants of convex optimization problems, in which the convex function to optimize evolves over time. The bandit constraint can be simply interpreted as the impossibility of computing gradients (because, for instance, we do not have a explicit representation of the function, but it can only be accessed by querying for values at desired points).

We look at two feedback models. In the first one, at each step the forecaster evaluates the loss function at two points: the played point plus an additional point of its choice. In the second one, only the value of the loss evaluated at the played point is made available to the forecaster. We show that while the two-points model allows for a \mathcal{O}\bigl{(}\sqrt{n}\bigr{)} bound on pseudo-regret, in the one-point model a pseudo-regret bound of only \mathcal{O}\bigl{(}n^{3/4}\bigr{)} is achieved. The stochastic setting is investigated in Section 22 where, similarly to Chapter 2, we assume that each play of an arm returns a stochastic loss with fixed but unknown mean. Unlike the nonstochastic case, the mean loss function is assumed to be Lipschitz and unimodal, but not necessarily convex. For keeping things simple, the stochastic setting is studied in 11-dimensional case, when arms are points in the unit interval. For this case we show a bound on the pseudo-regret of \mathcal{O}\bigl{(}\sqrt{n}(\log n)\bigr{)}.

We focus our analysis on OSMD with Legendre function F=122F=\tfrac{1}{2}\left\|{\cdot}\right\|^{2}, where \left\|{\cdot}\right\| is the Euclidean norm. The resulting strategy, Online Stochastic Gradient Descent (OSGD) is sketched below here.

We now introduce our main technical tool: the two-point gradient estimate. The two points on which the loss value is queried at time tt are denoted by Xt+X_{t}^{+} and XtX_{t}^{-}. OSGD always plays one of these two points at random.

In order to compute the expectation of g~t\widetilde{g}_{t}, first note that by symmetry

In order to compute the expectation in the right-hand side we need the following preliminary lemma.

where σ\sigma is the unnormalized spherical measure.

The proof of this result is an easy consequence of the Divergence Theorem,

We are now fully equipped to compute the expectation of g~t\widetilde{g}_{t}.

First consider the easy one-dimensional case. Namely, K=[a,b]\mathcal{K}=[a,b] for some reals a<ba<b. Note that, in this case, SS is uniform in {1,+1}\{-1,+1\} whereas BB is uniform in [1,+1][-1,+1]. Then

for all realizations of the random process \bigl{(}X_{t}^{+},X_{t}^{-}\bigr{)}_{t\geq 1}.

We are now ready to prove the main result of this section. Namely, that the pseudo-regret of OSGD run using the gradient estimate (60) is of order n\sqrt{n}. We assume that the point X~t\widetilde{X}_{t} played by OSGD at each time tt is randomly drawn between the two points Xt+X_{t}^{+} and XtX_{t}^{-} where the loss function is queried.

Moreover, if η=RGDn\eta=\tfrac{R}{GD\sqrt{n}} then for δ0\delta\to 0 we have that

where we overapproximated \left\|{\bigl{(}1-\tfrac{\delta}{r}\bigr{)}\mathcal{K}}\right\|\leq\left\|{\mathcal{K}}\right\|=R. This concludes the proof.

One-point bandit feedback

Building on the analysis of the previous section, it is not hard to show that the pseudo-regret can be bounded even when the loss function at each time tt is queried in only one point. However, we pay this reduced bandit feedback with a worse rate of n3/4n^{3/4} in the pseudo-regret bound. It is not known if this rate is optimal, or if it is possible to get a n\sqrt{n} regret as in the two-points setting.

The one-point estimate at time tt is defined by

Note the inverse dependence on δ\delta. This dependence plays a key role in the final bound, as the next result shows.

Applying Theorem 16.13 as in the proof of Theorem 20.7 gives

Nonlinear stochastic bandits

We conclude with a simple example of nonlinear bandits in the stochastic setting. Unlike the gain-based analysis of stochastic bandits of Chapter 2, here we keep in with the convention used throughout this chapter and work exclusively with losses.

where x1,,xnx_{1},\dots,x_{n}\in denote the points played by the strategy.

The bandit strategy we analyze in this section is based on the golden section search due to Kiefer (1953), which is a general algorithm for finding the extremum of a unimodal function. Similarly to binary search, each step of golden section search narrows the interval in which the extremum is found by querying the function value at certain points that are chosen depending on the outcome of previous queries. Each query shrinks the interval by a factor of 1φ=0.618\tfrac{1}{\varphi}=0.618\dots, where \varphi=\tfrac{1}{2}\bigl{(}1+\sqrt{5}\bigr{)} is the golden ratio.

In our case, queries (i.e., plays) at xx return a perturbed version of μ(x)\mu(x). Since μ\mu is bounded, Hoeffding bounds ensure that we can find the minimum of μ\mu by repeatedly querying each point xx requested by the golden search algorithm. However, in order to have a lower bound on the accuracy with which each μ\mu needs to be estimated, we must assume the following condition: there exists CL>0C_{L}>0 such that

for each x,xx,x^{\prime} that belong either to [0,x1/CL][0,x^{*}-1/C_{L}] or to [x+1/CL,1][x^{*}+1/C_{L},1].

Finally, irrespective to the uncertainty in the evaluation of μ\mu, in order to bound the regret incurred by golden section search we need a Lipschitz condition on μ\mu. Namely, there exists CH>0C_{H}>0 such that \bigl{|}\mu(x)-\mu(x^{\prime})\bigr{|}\leq C_{H}|x-x^{\prime}| for all x,xx,x^{\prime}\in.

We are now ready to introduce our stochastic version of the golden section search algorithm.

SGS (Stochastic Golden Search): Parameters: ε1,ε2,>0\varepsilon_{1},\varepsilon_{2},\dots>0. Initialize: xA=0xB=1φ2xC=1x_{A}=0\quad x_{B}=\frac{1}{\varphi^{2}}\quad x_{C}=1. For each stage s=1,,ns=1,\dots,n (1) Let {\displaystyle x_{B}^{\prime}=\left\{\begin{array}[]{cl}x_{B}-\tfrac{1}{\varphi^{2}}(x_{B}-x_{A})&x_{B}-x_{A}>x_{C}-x_{B}\\[2.84526pt] x_{B}+\tfrac{1}{\varphi^{2}}(x_{C}-x_{B})&\text{otherwise}\end{array}\right.} and rename points xB,xBx_{B},x_{B}^{\prime} so that xA<xB<xB<xCx_{A}<x_{B}<x_{B}^{\prime}<x_{C}. (2) Play each point in {xA,xB,xB,xC}\{x_{A},x_{B},x_{B}^{\prime},x_{C}\} for 2εs2ln(6n)\tfrac{2}{\varepsilon_{s}^{2}}\ln(6n) times and let x^\widehat{x} be the point with lowest total loss in this stage. (3) If x^{xA,xB}\widehat{x}\in\{x_{A},x_{B}\} then eliminate interval (xB,xC](x_{B}^{\prime},x_{C}] and let xC=xBx_{C}=x_{B}^{\prime}, (4) else eliminate interval [xA,xB)[x_{A},x_{B}) and let xA=xBx_{A}=x_{B}.

Recall that golden section search proceeds as follows: given three queried points xA<xB<xCx_{A}<x_{B}<x_{C} where the distance of xBx_{B} to the other two points is in the golden ratio (xBx_{B} might be closer to xAx_{A} or to xCx_{C} depending on past queries), the next point xBx_{B}^{\prime} is queried in the largest interval between xBxAx_{B}-x_{A} and xCxBx_{C}-x_{B} so that the distance of xBx_{B}^{\prime} to the extrema of that largest interval is in the golden ratio. Assume the resulting ordering is xA<xB<xB<xCx_{A}<x_{B}<x_{B}^{\prime}<x_{C}. Then we drop either [xA,xB)[x_{A},x_{B}) or (xB,xC](x_{B}^{\prime},x_{C}] according to whether the smallest value of μ\mu is found in, respectively, {xB,xC}\{x_{B}^{\prime},x_{C}\} or {xB,xC}\{x_{B}^{\prime},x_{C}\}. The remaining triplet is such that the distance of the middle point to the other two is again in the golden ratio.

Using elementary algebraic identities for φ\varphi, one can show that setting xCxA=1x_{C}-x_{A}=1 the following equalities hold at any step of SGS:

Since either xBxAx_{B}-x_{A} or xCxBx_{C}-x_{B}^{\prime} are eliminated at each stage, at each stage SGS shrinks the search interval by a factor of 1φ2=1φ1-\varphi^{-2}=\tfrac{1}{\varphi}.

Let [xA,s,xB,s][x_{A,s},x_{B,s}] be the search interval at the beginning of stage s+1s+1, where xA,0=0x_{A,0}=0 and xC,0=1x_{C,0}=1.

If εs=CLφ(s+3)\varepsilon_{s}=C_{L}\varphi^{-(s+3)} then

holds uniformly over all stages s1s\geq 1.

Once the interval containing xx^{*} is eliminated it is never recovered, thus we have

Now assume x[xA,s1,xB,s1]x^{*}\in[x_{A,s-1},x_{B,s-1}]. Then x∉[xA,s,xC,s]x^{*}\not\in[x_{A,s},x_{C,s}] implies μ^s(xB,s1)<μ^(xB,s1)\widehat{\mu}_{s}(x_{B^{\prime},s-1})<\widehat{\mu}(x_{B,s-1}) or μ^s(xC,s1)<μ^(xB,s1)\widehat{\mu}_{s}(x_{C,s-1})<\widehat{\mu}(x_{B,s-1}). Similarly, assume x[xB,s1,xC,s1]x^{*}\in[x_{B^{\prime},s-1},x_{C,s-1}]. Then x∉[xA,s,xC,s]x^{*}\not\in[x_{A,s},x_{C,s}] implies μ^s(xA,s1)<μ^(xB,s1)\widehat{\mu}_{s}(x_{A,s-1})<\widehat{\mu}(x_{B^{\prime},s-1}) or μ^s(xB,s1)<μ^(xB,s1)\widehat{\mu}_{s}(x_{B,s-1})<\widehat{\mu}(x_{B^{\prime},s-1}). In both cases, we need to compare three values of μ\mu on the same side with respect to xx^{*}. (When x[xB,s1,xB,s1]x^{*}\in[x_{B,s-1},x_{B^{\prime},s-1}] we always have x[xA,s,xC,s]x^{*}\in[x_{A,s},x_{C,s}].) Hence, we can apply our assumption involving CLC_{L}. More precisely, (64) implies that after ss stages the search interval has size φs\varphi^{-s} and min{xB,sxA,s,xB,sxB,s,xC,sxB,s}=φ(s+3)\min\{x_{B,s}-x_{A,s},x_{B,s}^{\prime}-x_{B,s},x_{C,s}-x_{B,s}^{\prime}\}=\varphi^{-(s+3)} . Hence, introducing

Let Ts=8εs2ln(6n)T_{s}=\tfrac{8}{\varepsilon_{s}^{2}}\ln(6n) the length of stage ss. We can write

Substituting this in (65) and recurring gives the desired result.

For any unimodal and CHC_{H}-Lipschitz mean loss function μ:\mu:\to that satisfies (63), if the SGS algorithm is run with εs=CLφ(s+3)\varepsilon_{s}=C_{L}\varphi^{-(s+3)} then

We start by decomposing the pseudo-regret as follows,

and recalling that \bigl{|}x_{C,s}-x_{A,s}\bigr{|}\leq\varphi^{-s}, we bound the first term in this decomposition as follows

Hence we get an easy expression for the regret,

We now compute an upper bound on the number SS of stages,

Solving for nn and overapproximating we get

Therefore, the sum in (66) is bounded as follows

Substituting the above in (66) concludes the proof.

Bibliographic remarks

Gradient-free methods for stochastic approximation have been studied for a long time —see the bibliographic remarks at the end of Chapter 5 for some references. However, relatively few results provide regret bounds. The approach presented in this chapter for online convex optimization with bandit feedback was pioneered by (Flaxman et al., 2005) and (Kleinberg, 2004). The improved rate for the two-points model was later shown in Agarwal et al. (2010b).

Another direction for nonlinear stochastic bandits was recently investigated in (Agarwal et al., 2011b). In this work the authors study the case of a convex mean loss function, and they show how to combine the zeroth-order optimization method of (Nemirovski and Yudin, 1983) with a “center point device” to obtain a regret of O~(n)\widetilde{O}(\sqrt{n}).

Finally, a related problem in a Bayesian setting was studied in Srinivas et al. (2010) and Grünewälder et al. (2010), where it is assumed that the payoff functions are drawn from Gaussian processes.

Chapter 7 Variants

In the previous chapters we explored a few fundamental variations around the basic multi-armed bandit problem. In both the stochastic and adversarial frameworks, these variants basically revolved around a single principle: by adding constraints on the losses (or rewards), it is possible to compete against larger sets of arms. While this is indeed a fundamental axis in the space of bandit problems, it is important to realize that there are many other directions. Indeed, we might sketch a “bandit space” spanning the following coordinates:

Evolution of payoffs over time: stochastic, adversarial, Markovian, …

Structure of payoff functions: linear, Lipschitz, Gaussian process, …

Feedback structure: full information, bandit, semi-bandit, partial monitoring, …

Clearly, such extensions greatly increase the number of potential applications of bandit models. While many of these extensions were already discussed in the previous chapters, in the following we focus on others (such as the sleeping bandits or the thruthful bandits) so to visit more exotic regions of the bandit space.

Extending further the model of Markovian bandits (mentioned at the end of Chapter 1), one can also define a general Markov Decision Process (MDP) —see also Section 24. For example, the stochastic bandit of Chapter 2 corresponds to a single-state MDP.

In full generality, a finite MDP can be described by a set of states {1,,S}\{1,\ldots,S\}, a set of actions {1,,K}\{1,\ldots,K\}, a set {pi,s,1iK,1sS}\{p_{i,s},\,1\leq i\leq K,\,1\leq s\leq S\} of transition distributions over SS, and a set {νi,s,1iK,1sS}\{\nu_{i,s},\,1\leq i\leq K,\,1\leq s\leq S\} of reward distributions over $.Inthismodel,takingaction. In this model, taking actioniinstatein statesgeneratesastochasticrewarddrawnfromgenerates a stochastic reward drawn from\nu_{i,s}andaMarkoviantransitiontoastatedrawnfromand a Markovian transition to a state drawn fromp_{i,s}$. Similarly to the multi-armed bandit problem, here one typically assumes that the reward distributions and transition distributions are unknown, and the goal is to navigate through the MDP so as to maximize some function of the obtained rewards. The field that studies this type of problem is called Reinforcement Learning. The interested reader is addressed to Sutton and Barto (1998); Kakade (2003); Szepesvári (2010). Reinforcement learning results with a flavor similar to those described in the previous chapters can be found in Yu et al. (2009); Bubeck and Munos (2010); Jaksch et al. (2010); Neu et al. (2010).

An intermediate model, between stochastic multi-armed bandits and MDPs, is the one of restless bandits. As in Markovian bandits, each arm is associated with a Markovian reward process with its own state space. Each time an arm is chosen, the associated Markov process generates an observable reward and makes a transition to a new state, which is also observed. However, unlike Markovian bandits an unobserved transition occurs for each arm that is not chosen. Using concentration inequalities for Markov chains —see, e.g., Lezaud (1998), one can basically show that, under suitable assumptions, UCB attains a logarithmic regret for restless bandits as well —see Tekin and Liu (2012) and Filippi et al. (2011). A more general regret bound for restless bandits has been recently proven by Ortner et al. (2012).

An apparently similar problem was studied by Garivier and Moulines (2011), where they assume that the reward distributions can abruptly change at unknown time instants (and there is a small number of such change-points). Within this model, the authors prove that the best possible regret is of order n\sqrt{n}, which is matched by the Exp3.P algorithm —see the discussion in Section 8.3. Thus, while the two problems (restless bandits and bandits with change-points) might look similar, they are fundamentally different. In particular, note that the latter problem cannot be cast as a MDP.

Another intermediate model, with important applications, is that of the sleeping bandits. There, it is assumed that the set of available actions is varying over time. We refer the interested reader to Kleinberg et al. (2010); Kanade et al. (2009); Slivkins (2011); Kanade and Steinke (2012) for the details of this model as well as the theoretical guarantees that can be obtained. A somewhat related problem was also studied in György et al. (2007) where it is assumed that the set of arms becomes unavailable for a random time after each arm pull (and the distribution of this random time depends on the selected arm).

Pure exploration problems

Bubeck et al. (2009a) prove that minimizing the simple regret is fundamentally different from minimizing the pseudo-regret Rn\overline{R}_{n}, in the sense that one always have rnexp(CRn)r_{n}\geq\exp(-C\overline{R}_{n}) for some constant C>0C>0 (which depends on the reward distributions). Thus, this regret calls for different bandit algorithms. Audibert et al. (2010) exhibit a simple strategy with optimal performances up to a logarithmic factor. The idea is very simple: the strategy SR (Successive Rejects) works in K1K-1 phases. SR keeps a set of active arms, that are sampled uniformly in each phase. At the end of a phase, the arm with smallest empirical mean is removed from the set of active arms. It can be shown that this strategy has a simple regret of order exp(cnHlnK)\exp\left(-c\frac{n}{H\ln K}\right), where H=ii1Δi2H=\sum_{i\neq i^{*}}\frac{1}{\Delta_{i}^{2}} is the complexity measure of identifying the best arm, and cc is a numerical constant. Moreover, a matching lower bound (up to logarithmic factors) was also proved. These ideas were extended in different ways by Gabillon et al. (2011); Bui et al. (2011); Bubeck et al. (2012c).

A similar problem was studied in a PAC model by Even-Dar et al. (2002). The goal is to find, with probability at least 1δ1-\delta, an arm with mean at least ε\varepsilon close the optimal mean, and the relevant quantity is the number of pulls needed to achieve this goal. For this problem, the authors derive an algorithm called Successive Elimination that achieves an optimal number of pulls (up to logarithmic factors). Successive Elimination works as follows: it keeps an estimate of the mean of each arm, together with a confidence interval. If two confidence intervals are disjoint, then the arm with the lowest confidence interval is eliminated. Using this procedure, one can achieve the (ε,δ)(\varepsilon,\delta) guarantee with a number of pulls of order HlnKΔH\ln\tfrac{K}{\Delta}. A matching lower bound is due to Mannor and Tsitsiklis (2004), and further results are discussed by Even-Dar et al. (2006).

In some applications one is not interested in the best arm, but rather in having a good estimate of the mean μi\mu_{i} for each arm. In this setting a reasonable measure of performance is given by

Clearly, the optimal static allocation depends only on the variances of the arms, and we denote by LnL_{n}^{*} the performance of this strategy. This setting was introduced by Antos et al. (2008), where the authors studied the regret LnLnL_{n}-L_{n}^{*}, and showed that a regret of order n3/2n^{-3/2} was achievable. This result was then refined by Carpentier et al. (2011); Carpentier and Munos (2011). The basic idea in these papers is to resort to the optimism in face of uncertainty principle, and to approximate the optimal static allocation by replacing the true variance with an upper confidence bound on it.

Dueling bandits

An interesting variation of stochastic bandits was recently studied by Yue et al. (2009). The model considered in that paper is called dueling bandits. The main idea is that the player has to choose a pair or arms (It,Jt)(I_{t},J_{t}) at each round, and can only observe the relative performances of these two arms, i.e., the player only knows which arm had the highest reward. More formally, in dueling bandits we assume that there exists a total ordering \succ on {1,,K}\{1,\ldots,K\} with the following properties:

If iji\succ j, then the probability that the reward of arm ii is larger than the reward of arm jj is equal to 12+Δi,j\frac{1}{2}+\Delta_{i,j} with Δi,j>0\Delta_{i,j}>0.

If ijki\succ j\succ k, then \Delta_{i,j}+\Delta_{j,k}\geq\Delta_{i,k}\geq\max\bigl{\{}\Delta_{i,j},\Delta_{j,k}\bigr{\}}.

Upon selecting a pair (It,Jt)(I_{t},J_{t}), the player receives a random variable drawn from a Bernoulli distribution with parameter 12+Δi,j\frac{1}{2}+\Delta_{i,j}. In this setting a natural regret notion is the following quantity, where ii^{*} is the largest element in the ordering \succ,

It was proved in Yue et al. (2009) that the optimal regret for this problem is of order KΔlogn\frac{K}{\Delta}\log n, where Δ=minijΔi,j\Delta=\min_{i\neq j}\Delta_{i,j}. A simple strategy that attains this rate, based on the Successive Elimination algorithm of Even-Dar et al. (2002), was proposed by Yue and Joachims (2011).

Discovery with probabilistic expert advice

Bubeck et al. (2011a) study a model with a stochastic bandit flavor (in fact it can be cast as an MDP), where the key for the analysis is a sort of ’non-linear’ regret bound. In this model rewards represent items in some set X\mathcal{X} which is partitioned in a subset AXA\subset\mathcal{X} of interesting items and in a subset XA\mathcal{X}\setminus A of non-interesting items. The goal is to maximize the total expected number of interesting items found after nn pulls, where observing twice the same item does not help. A natural notion of regret is obtained by comparing the number of interesting items F(n)F(n) found by a given strategy to the number F(n)F^{*}(n) found by the optimal strategy. It turns out that analyzing such regret directly is difficult. The first issue is that in this problem the notion of a “good” arm is dynamic, in the sense that an arm could be very good for a few pulls and then completely useless. Furthermore a strategy making bad decisions in the beginning will have better opportunities in the future than the optimal strategy (which already discovered some interesting items). Taking into account these issues, it turns out that it is easier to show that for good strategies, F(n)F(n) is not too far from F(n)F^{*}(n^{\prime}), where nn^{\prime} is not much smaller than nn. Such a statement – which can be interepreted as a non-linear regret bound – shows that the analyzed strategy slightly ’lags’ behind the optimal strategy. In Bubeck et al. (2011a) a non-linear regret bound is derived for an algorithm based on estimating the mass of interesting items left on each arm (the so-called Good-Turing estimator), combined with the optimism in face of uncertainty principle of Chapter 2. We refer the reader to Bubeck et al. (2011a) for more precise statements.

Many-armed bandits

The many-armed bandit setting was introduced by Berry et al. (1997), and then extended and refined by Wang et al. (2008). This setting corresponds to a stochastic bandit with an infinite number of arms. The extra assumption that makes this problem feasible is a prior knowledge on the distribution of the arms. More precisely, when the player ask to “add” a new arm to his current set of active arms, one assumes that the probability that this arm is ε\varepsilon-optimal is of order εβ\varepsilon^{\beta}, for some known β>0\beta>0. Thus the player faces a trade-off between exploitation, exploration, and discovery, where the last component comes from the fact that the player needs to consider new arms to make sure that he has an active ε\varepsilon-optimal arm. Using a UCB strategy on the active arms, and by adding new arms at a rate which depends on β\beta, Wang et al. (2008) prove that a regret of order

Truthful bandits

A popular application domain for bandit algorithms is ad placement on the Web. In the pay-per-click model, for each incoming user t=1,2,t=1,2,\dots the publisher selects an advertiser ItI_{t} from a pool of KK advertisers, and display the corresponding ad to the user. The publisher then gets a reward if the ad is clicked by the user. This problem is well modeled by the multi-armed bandit setting. However, there is a fundamental aspect of the ad placement process which is overlooked by this formulation. Indeed, prior to running an ad-selection algorithm (i.e., a bandit algorithm), each advertiser i{1,,K}i\in\{1,\ldots,K\} issues a bet bib_{i}. This number is how much ii is willing to pay for a click. Each bidder keeps also a private value viv_{i}, which is the true value ii assigns to a click. Because a rational bidder ensures that bivib_{i}\leq v_{i}, the difference vibiv_{i}-b_{i} defines the utility for bidder ii. The basic idea of truthful bandits is to construct a bandit algorithm such that each advertiser has no incentive in submitting a bet bib_{i} such that bi<vib_{i}<v_{i}. A natural question to ask is whether this restriction to truthful algorithms changes the dynamics of the multi-armed bandit problem. This has investigated in a number of papers, including Babaioff et al. (2009); Devanur and Kakade (2009); Babaioff et al. (2010); Wilkens and Sivan (2012). Thruthful bandits are part of a more general thread of research at the interface between bandits and Mechanism Design.

Concluding remarks

As pointed out in the introduction, the growing interest for bandits arises from the large number of industrially relevant problems that can be modeled as a multi-armed bandit. In particular, the sequential nature of the bandit setting makes it perfectly suited to various Internet and Web applications. These include search engine optimization with dueling bandits, or ad placement with contextual bandits and truthful bandits, see the references in, respectively, Section 26, Section 13 and Section 29.

Multi-armed bandits also proved to be very useful in other areas. For example, thanks to the strong connections between bandits and Markov Decision Processes, a breakthrough in Monte Carlo Tree Search (MCTS) was achieved using bandits ideas. More precisely, based on the sparse planning idea of Kearns et al. (2002), Kocsis and Szepesvári (2006) introduced a new MCTS strategy called UCT (UCB applied to Trees) that led to a substantial advancement in Computer Go performance, see Gelly et al. (2006). Note that, from a theoretical point of view UCT was proved to perform poorly by Coquelin and Munos (2007), and a strategy based on a similar idea, but with improved theoretical performance, was proposed by Bubeck and Munos (2010). Other applications in related directions have also been explored, see for example Teytaud and Teytaud (2009); Hoock and Teytaud (2010) and many others.

Many new domains of application for bandits problems are currently investigated. For example: multichannel opportunistic communications Liu et al. (2010), model selection Agarwal et al. (2011a), boosting Busa-Fekete and Kegl (2011), management of dark pools of liquidity (a recent type of stock exchange) Agarwal et al. (2010a), security analysis of power systems Bubeck et al. (2011a).

Given the fast pace of new variants, extensions, and applications coming out every week, we had to make tough decisions about what to present in this survey. We apologize for everything we had to leave out. On the other hand, we do hope that what we decided to put in will enthuse more researchers about entering this exciting field.

References