Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Sébastien Bubeck, Nicolò Cesa-Bianchi

Chapter 1 Introduction

A multi-armed bandit problem (or, simply, a bandit problem) is a sequential allocation problem defined by a set of actions. At each time step, a unit resource is allocated to an action and some observable payoff is obtained. The goal is to maximize the total payoff obtained in a sequence of allocations. The name bandit refers to the colloquial term for a slot machine (“one-armed bandit” in American slang). In a casino, a sequential allocation problem is obtained when the player is facing many slot machines at once (a “multi-armed bandit”), and must repeatedly choose where to insert the next coin.

Bandit problems are basic instances of sequential decision making with limited information, and naturally address the fundamental trade-off between exploration and exploitation in sequential experiments. Indeed, the player must balance the exploitation of actions that did well in the past and the exploration of actions that might give higher payoffs in the future.

Although the original motivation of Thompson (1933) for studying bandit problems came from clinical trials (when different treatments are available for a certain disease and one must decide which treatment to use on the next patient), modern technologies have created many opportunities for new applications, and bandit problems now play an important role in several industrial domains. In particular, online services are natural targets for bandit algorithms, because there one can benefit from adapting the service to the individual sequence of requests. We now describe a few concrete examples in various domains.

Ad placement is the problem of deciding which advertisement to display on the web page delivered to the next visitor of a website. Similarly, website optimization deals with the problem of sequentially choosing design elements (font, images, layout) for the web page. Here the payoff is associated with visitor’s actions, e.g., clickthroughs or other desired behaviors. Of course there are important differences with the basic bandit problem: in ad placement the pool of available ads (bandit arms) may change over time, and there might be a limit on the number of times each ad could be displayed.

In source routing a sequence of packets must be routed from a source host to a destination host in a given network, and the protocol allows to choose a specific source-destination path for each packet to be sent. The (negative) payoff is the time it takes to deliver a packet, and depends additively on the congestion of the edges in the chosen path.

In computer game-playing, each move is chosen by simulating and evaluating many possible game continuations after the move. Algorithms for bandits (more specifically, for a tree-based version of the bandit problem) can be used to explore more efficiently the huge tree of game continuations by focusing on the most promising subtrees. This idea has been successfully implemented in the MoGo player of Gelly et al. (2006), which plays Go at world-class level. MoGo is based on the UCT strategy for hierarchical bandits of Kocsis and Szepesvári (2006), which is in turn derived from the UCB bandit algorithm —see Chapter 2.

There are three fundamental formalizations of the bandit problem depending on the assumed nature of the reward process: stochastic, adversarial, and Markovian. Three distinct playing strategies have been shown to effectively address each specific bandit model: the UCB algorithm in the stochastic case, the Exp3 randomized algorithm in the adversarial case, and the so-called Gittins indices in the Markovian case. In this survey, we focus on stochastic and adversarial bandits, and refer the reader to the survey by Mahajan and Teneketzis (2008) or to the recent monograph by Gittins et al. (2011) for an extensive analysis of Markovian bandits.

In order to analyze the behavior of a player or forecaster (i.e., the agent implementing a bandit strategy), we may compare its performance with that of an optimal strategy that, for any horizon of $n$ time steps, consistently plays the arm that is best in the first $n$ steps. In other terms, we may study the regret of the forecaster for not playing always optimally. More specifically, given $K\geq 2$ arms and given sequences $X_{i,1},X_{i,2},\dots$ of unknown rewards associated with each arm $i=1,\dots,K$ , we study forecasters that at each time step $t=1,2,\dots$ select an arm $I_{t}$ and receive the associated reward $X_{I_{t},t}$ . The regret after $n$ plays $I_{1},\dots,I_{n}$ is defined by

If the time horizon is not known in advance we say that the forecaster is anytime.

In general, both rewards $X_{i,t}$ and forecaster’s choices $I_{t}$ might be stochastic. This allows to distinguish between the two following notions of averaged regret: the expected regret

In the original formalization of Robbins (1952), which builds on the work of Wald (1947) —see also Arrow et al. (1949), each arm $i=1,\dots,K$ corresponds to an unknown probability distribution $\nu_{i}$ on $ $, and rewards$ X_{i,t} $are independent draws from the distribution$ \nu_{i}$ corresponding to the selected arm.

The stochastic bandit problem Known parameters: number of arms $K$ and (possibly) number of rounds $n\geq K$ . Unknown parameters: $K$ probability distributions $\nu_{1},\ldots,\nu_{K}$ on $ $. For each round$ t=1,2,\ldots $(1) the forecaster chooses$ I_{t}\in\{1,\ldots,K\} $; (2) given$ I_{t} $, the environment draws the reward$ X_{I_{t},t}\sim\nu_{I_{t}}$ independently from the past and reveals it to the forecaster.

For $i=1,\dots,K$ we denote by $\mu_{i}$ the mean of $\nu_{i}$ (mean reward of arm $i$ ). Let

In the stochastic setting, it is easy to see that the pseudo-regret can be written as

The analysis of the stochastic bandit model was pioneered in the seminal paper of Lai and Robbins (1985), who introduced the technique of upper confidence bounds for the asymptotic analysis of regret. In Chapter 2 we describe this technique using the simpler formulation of Agrawal (1995), which naturally lends itself to a finite-time analysis.

In parallel to the research on stochastic bandits, a game-theoretic formulation of the trade-off between exploration and exploitation has been independently investigated, although for quite some time this alternative formulation was not recognized as an instance of the multi-armed bandit problem. In order to motivate these game-theoretic bandits, consider again the initial example of gambling on slot machines. We now assume that we are in a rigged casino, where for each slot machine $i=1,\ldots,K$ and time step $t\geq 1$ the owner sets the gain $X_{i,t}$ to some arbitrary (and possibly maliciously chosen) value $g_{i,t}\in$ . Note that it is not in the interest of the owner to simply set all the gains to zero (otherwise, no gamblers would go to that casino). Now recall that a forecaster selects sequentially one arm $I_{t}\in\{1,\ldots,K\}$ at each time step $t=1,2,\dots$ and observes (and earns) the gain $g_{I_{t},t}$ . Is it still possible to minimize regret in such a setting?

Following a standard terminology, we call adversary, or opponent, the mechanism setting the sequence of gains for each arm. If this mechanism is independent of the forecaster’s actions, then we call it an oblivious adversary. In general, however, the adversary may adapt to the forecaster’s past behaviour, in which case we speak of a non-oblivious adversary. For instance, in the rigged casino the owner may observe the way a gambler plays in order to design even more evil sequences of gains. Clearly, the distinction between oblivious and non-oblivious adversary is only meaningful when the player is randomized (if the player is deterministic, then the adversary can pick a bad sequence of gains right at the beginning of the game by simulating the player’s future actions). Note, however, that in presence of a non-oblivious adversary the interpretation of regret is ambiguous. Indeed, in this case the assignment of gains $g_{i,t}$ to arms $i=1,\dots,K$ made by the adversary at each step $t$ is allowed to depend on the player’s past randomized actions $I_{1},\dots,I_{t-1}$ . In other words, $g_{i,t}=g_{i,t}(I_{1},\dots,I_{t-1})$ for each $i$ and $t$ . Now, the regret compares the player’s cumulative gain to that obtained by playing the single best arm for the first $n$ rounds. However, had the player consistently chosen the same arm $i$ in each round, namely $I_{t}=i$ for $t=1,\dots,n$ , the adversarial gains $g_{i,t}(I_{1},\dots,I_{t-1})$ would have been possibly different than those actually experienced by the player.

The study of non-oblivious regret is mainly motivated by the connection between regret minimization and equilibria in games —see, e.g. (Auer et al., 2002b, Section 9). Here we just observe that game-theoretic equilibria are indeed defined similarly to regret: in equilibrium, the player has nSo incentive to behave differently provided the opponent does not react to changes in the player’s behaviour. Interestingly, regret minimization has been also studied against reactive opponents, see for instance the works of Pucci de Farias and Megiddo (2006) and Arora et al. (2012a).

The adversarial bandit problem Known parameters: number of arms $K\geq 2$ and (possibly) number of rounds $n\geq K$ . For each round $t=1,2,\ldots$ (1) the forecaster chooses $I_{t}\in\{1,\ldots,K\}$ , possibly with the help of external randomization; (2) simultaneously, the adversary selects a gain vector $g_{t}=(g_{1,t},\ldots,g_{K,t})\in^{K}$ , possibly with the help of external randomization; (3) the forecaster receives (and observes) the reward $g_{I_{t},t}$ , while the gains of the other arms are not observed.

In this adversarial setting the goal is to obtain regret bounds in high probability or in expectation with respect to any possible randomization in the strategies used by the forecaster or the opponent, and irrespective of the opponent. In the case of a non-oblivious adversary this is not an easy task, and for this reason we usually start by bounding the pseudo-regret

Note that the randomization of the adversary is not very important here since we ask for bounds which hold for any opponent. On the other hand, it is fundamental to allow randomization for the forecaster —see Chapter 3 for details and basic results in the adversarial bandit model. This adversarial, or non-stochastic, version of the bandit problem was originally proposed as a way of playing an unknown game against an opponent. The problem of playing a game repeatedly, now a classical topic in game theory, was initiated by the groundbreaking work of James Hannan and David Blackwell. In Hannan’s seminal paper Hannan (1957), the game (i.e., the payoff matrix) is assumed to be known by the player, who also observes the opponent’s moves in each play. Later, Baños (1968) considered the problem of a repeated unknown game, where in each game round the player only observes its own payoff. This problem turns out to be exactly equivalent to the adversarial bandit problem with a non-oblivious adversary. Simpler strategies for playing unknown games were more recently proposed by Foster and Vohra (1998) and Hart and Mas-Colell (2000, 2001). Approximately at the same time, the problem was re-discovered in computer science by Auer et al. (2002b). It was them who made apparent the connection to stochastic bandits by coining the term nonstochastic multi-armed bandit problem.

The third fundamental model of multi-armed bandits assumes that the reward processes are neither i.i.d. (like in stochastic bandits) nor adversarial. More precisely, arms are associated with $K$ Markov processes, each with its own state space. Each time an arm $i$ is chosen in state $s$ , a stochastic reward is drawn from a probability distribution $\nu_{i,s}$ , and the state of the reward process for arm $i$ changes in a Markovian fashion, based on an underlying stochastic transition matrix $M_{i}$ . Both reward and new state are revealed to the player. On the other hand, the state of arms that are not chosen remains unchanged. Going back to our initial interpretation of bandits as sequential resource allocation processes, here we may think of $K$ competing projects that are sequentially allocated a unit resource of work. However, unlike the previous bandit models, in this case the state of a project that gets the resource may change. Moreover, the underlying stochastic transition matrices $M_{i}$ are typically assumed to be known, thus the optimal policy can be computed via dynamic programming and the problem is essentially of computational nature. The seminal result of Gittins (1979) provides an optimal greedy policy which can be computed efficiently.

A notable special case of Markovian bandits is that of Bayesian bandits. These are parametric stochastic bandits, where the parameters of the reward distributions are assumed to be drawn from known priors, and the regret is computed by also averaging over the draw of parameters from the prior. The Markovian state change associated with the selection of an arm corresponds here to updating the posterior distribution of rewards for that arm after observing a new reward.

Markovian bandits are a standard model in the areas of Operations Research and Economics. However, the techniques used in their analysis are significantly different from those used to analyze stochastic and adversarial bandits. For this reason, in this survey we do not cover Markovian bandits and their many variants.

Chapter 2 Stochastic bandits: fundamental results

We start by recalling the basic definitions for the stochastic bandit problem. Each arm $i\in\{1,\ldots,K\}$ corresponds to an unknown probability distribution $\nu_{i}$ . At each time step $t=1,2,\ldots,$ the forecaster selects some arm $I_{t}\in\{1,\ldots,K\}$ and receives a reward $X_{I_{t},t}$ drawn from $\nu_{I_{t}}$ (independently from the past). Denote by $\mu_{i}$ the mean of arm $i$ and define

We focus here on the pseudo-regret, which is defined as

We choose the pseudo-regret as our main quantity of interest because in a stochastic framework it is more natural to compete against the optimal action in expectation, rather than the optimal action on the sequence of realized rewards (as in the definition of the plain regret (1)). Furthermore, because of the order of magnitude of typical random fluctuations, in general one cannot hope to prove a bound on the expected regret (2) better than $\Theta\bigl{(}\sqrt{n}\bigr{)}$ . On the contrary, the pseudo-regret can be controlled so well that we are able to bound it by a logarithmic function of $n$ .

In the following we also use a different formula for the pseudo-regret. Let $T_{i}(s)=\sum_{t=1}^{s}\mathds{1}_{I_{t}=i}$ denote the number of times the player selected arm $i$ on the first $s$ rounds. Let $\Delta_{i}=\mu^{*}-\mu_{i}$ be the suboptimality parameter of arm $i$ . Then the pseudo-regret can be written as:

The difficulty of the stochastic multi-armed bandit problem lies in the exploration-exploitation dilemma that the forecaster is facing. Indeed, there is an intrinsic tradeoff between exploiting the current knowledge to focus on the arm that seems to yield the highest rewards, and exploring further the other arms to identify with better precision which arm is actually the best. As we shall see, the key to obtain a good strategy for this problem is, in a certain sense, to simultaneously perform exploration and exploitation.

A simple heuristic principle for doing that is the so-called optimism in face of uncertainty. The idea is very general, and applies to many sequential decision making problems in uncertain environments. Assume that the forecaster has accumulated some data on the environment and must decide how to act next. First, a set of “plausible” environments which are “consistent” with the data (typically, through concentration inequalities) is constructed. Then, the most “favorable” environment is identified in this set. Based on that, the heuristic prescribes that the decision which is optimal in this most favorable and plausible environment should be made. As we see below, this principle gives simple and yet almost optimal algorithms for the stochastic multi-armed bandit problem. More complex algorithms for various extensions of the stochastic multi-armed bandit problem are also based on the same idea which, along with the exponential weighting scheme presented in Section 3, is an algorithmic cornerstone of regret analysis in bandits.

Upper Confidence Bound (UCB) strategies

In this section we assume that the distribution of rewards $X$ satisfy the following moment conditions. There exists a convex function One can easily generalize the discussion to functions $\psi$ defined only on an interval $[0,b]$ . $\psi$ on the reals such that, for all $\lambda\geq 0$ ,

For example, when $X\in$ one can take $\psi(\lambda)=\tfrac{\lambda^{2}}{8}$ . In this case (6) is known as Hoeffding’s lemma.

We attack the stochastic multi-armed bandit using the optimism in face of uncertainty principle. In order do so, we use assumption (6) to construct an upper bound estimate on the mean of each arm at some fixed confidence level, and then choose the arm that looks best under this estimate. We need a standard notion from convex analysis: the Legendre-Fenchel transform of $\psi$ , defined by

For instance, if $\psi(x)=e^{x}$ then $\psi^{*}(x)=x\ln x-x$ for $x>0$ . If $\psi(x)=\frac{1}{p}|x|^{p}$ then $\psi^{*}(x)=\frac{1}{q}|x|^{q}$ for any pair $1<p,q<\infty$ such that $\frac{1}{p}+\frac{1}{q}=1$ —see also Section 15, where the same notion is used in a different bandit model.

Let $\widehat{\mu}_{i,s}$ be the sample mean of rewards obtained by pulling arm $i$ for $s$ times. Note that since the rewards are i.i.d., we have that in distribution $\widehat{\mu}_{i,s}$ is equal to $\frac{1}{s}\sum_{t=1}^{s}X_{i,t}$ .

Using Markov’s inequality, from (6) one obtains that

In other words, with probability at least $1-\delta$ ,

We thus consider the following strategy, called $(\alpha,\psi)$ -UCB, where $\alpha>0$ is an input parameter: At time $t$ , select

Assume that the reward distributions satisfy (6). Then $(\alpha,\psi)$ -UCB with $\alpha>2$ satisfies

In case of $ $-valued random variables, taking$ \psi(\lambda)=\tfrac{\lambda^{2}}{8} $in (6) —the Hoeffding’s Lemma— gives$ \psi^{*}(\varepsilon)=2\varepsilon^{2}$, which in turns gives the following pseudo-regret bound

In this important special case of bounded random variables we refer to $(\alpha,\psi)$ -UCB simply as $\alpha$ -UCB.

First note that if $I_{t}=i$ , then at least one of the three following equations must be true:

Indeed, assume that the three equations are all false, then we have:

which implies, in particular, that $I_{t}\neq i$ . In other words, letting

Thus it suffices to bound the probability of the events (9) and (10). Using an union bound and (7) one directly obtains:

The same upper bound holds for (10). Straightforward computations conclude the proof.

Lower bound

In order to compare this result with (8) we use the following standard inequalities (the left hand side follows from Pinsker’s inequality, and the right hand side simply uses $\ln x\leq x-1$ ),

The proof is organized in three steps. For simplicity, we only consider the case of two arms.

We now slightly change the notation for rewards and denote by $X_{2,1},\ldots,X_{2,n}$ the sequence of random variables obtained when pulling arm $2$ for $n$ times (that is, $X_{2,s}$ is the reward obtained from the $s$ -th pull). For $s\in\{1,\ldots,n\}$ , let

In order to link the behavior of the forecaster on the initial and modified bandits we introduce the event

Using again (15) and Markov’s inequality, the above implies

Now we use the maximal version of the strong law of large numbers: for any sequence $\bigl{(}X_{t}\bigr{)}$ of independent real random variables with positive mean $\mu>0$ ,

Thus, by the result of the second step and (16), we get

Refinements and bibliographic remarks

The UCB strategy presented in Section 2 was introduced by Auer et al. (2002a) for bounded random variables. Theorem 3.3 is extracted from Lai and Robbins (1985). Note that in this last paper the result is more general than ours, which is restricted to Bernoulli distributions. Although Burnetas and Katehakis (1997) prove an even more general lower bound, Theorem 3.3 and the UCB regret bound provide a reasonably complete solution to the problem. We now discuss some of the possible refinements. In the following, we restrict our attention to the case of bounded rewards (except in Section 4.7).

The regret bound proof for UCB can be improved in two ways. First, the union bound over the different time steps can be replaced by a “peeling” argument. This allows to show a logarithmic regret for any $\alpha>1$ , whereas the proof of Section 2 requires $\alpha>2$ —see (Bubeck, 2010, Section 2.2) for more details. A second improvement, proposed by Garivier and Cappé (2011), is to use a more subtle set of conditions than (9)–(11). In fact, the authors take both improvements into account, and show that $\alpha$ -UCB has a regret of order $\frac{\alpha}{2}\ln n$ for any $\alpha>1$ . In the limit when $\alpha$ tends to $1$ , this constant is unimprovable in light of Theorem 3.3 and (12).

2 Second order bounds

where $\alpha>1$ is a parameter of the algorithm. Thus, KL-UCB is optimal for Bernoulli distributions, and strictly dominates $\alpha$ -UCB for any bounded reward distributions.

3 Distribution-free bounds

In the limit when $\Delta_{i}$ tends to , the upper bound in (8) becomes vacuous. On the other hand, it is clear that the regret incurred from pulling arm $i$ cannot be larger than $n\,\Delta_{i}$ . Using this idea, it is easy to show that the regret of $\alpha$ -UCB is always smaller than $\sqrt{\alpha nK\ln n}$ (up to a numerical constant). However, as we shall see in the next chapter, one can show a minimax lower bound on the regret of order $\sqrt{nK}$ . Audibert and Bubeck (2009) proposed a modification of $\alpha$ -UCB that gets rid of the extraneous logarithmic term in the upper bound. More precisely, let $\Delta=\min_{i\neq i^{*}}\Delta_{i}$ , Audibert and Bubeck (2010) show that MOSS (Minimax Optimal Strategy in the Stochastic case) attains a regret smaller than

up to a numerical constant. The weakness of this result is that the second term in the above equation only depends on the smallest gap $\Delta$ . In Auer and Ortner (2010) (see also Perchet and Rigollet (2011)) the authors designed a strategy, called improved UCB, with a regret of order

This latter regret bound can be better than the one for MOSS in some regimes, but it does not attain the minimax optimal rate of order $\sqrt{nK}$ . It is an open problem to obtain a strategy with a regret always better than those of MOSS and improved UCB. A plausible conjecture is that a regret of order

is attainable. Note that the quantity $H$ appears in other variants of the stochastic multi-armed bandit problem, see Audibert et al. (2010).

4 High probability bounds

While bounds on the pseudo-regret $\overline{R}_{n}$ are important, one typically wants to control the quantity $\hat{R}_{n}=n\mu^{*}-\sum_{t=1}^{n}\mu_{I_{t}}$ with high probability. Showing that $\hat{R}_{n}$ is close to its expectation $\overline{R}_{n}$ is a challenging task, since naively one might expect the fluctuations of $\hat{R}_{n}$ to be of order $\sqrt{n}$ , which would dominate the expectation $\overline{R}_{n}$ which is only of order $\ln n$ . The concentration properties of $\hat{R}_{n}$ for UCB are analyzed in detail in Audibert et al. (2009), where it is shown that $\hat{R}_{n}$ concentrates around its expectation, but that there is also a polynomial (in $n$ ) probability that $\hat{R}_{n}$ is of order $n$ . In fact the polynomial concentration of $\hat{R}_{n}$ around $\overline{R}_{n}$ can be directly derived from our proof of Theorem 2.1. In Salomon and Audibert (2011) it is proved that for anytime strategies (i.e., strategies that do not use the time horizon $n$ ) it is basically impossible to improve this polynomial concentration to a classical exponential concentration. In particular this shows that, as far as high probability bounds are concerned, anytime strategies are surprisingly weaker than strategies using the time horizon information (for which exponential concentration of $\hat{R}_{n}$ around $\ln n$ are possible, see Audibert et al. (2009)).

5 ε𝜀\varepsilon-greedy

A simple and popular heuristic for bandit problems is the $\varepsilon$ -greedy strategy —see, e.g., Sutton and Barto (1998). The idea is very simple. First, pick a parameter $0<\varepsilon<1$ . Then, at each step greedily play the arm with highest empirical mean reward with probability $1-\varepsilon$ , and play a random arm with probability $\varepsilon$ . Auer et al. (2002a) proved that, if $\varepsilon$ is allowed to be a certain function $\varepsilon_{t}$ of the current time step $t$ , namely $\varepsilon_{t}=K/(d^{2}t)$ , then the regret grows logarithmically like $(K\ln n)/d^{2}$ , provided $0<d<\min_{i\neq i^{*}}\Delta_{i}$ . While this bound has a suboptimal dependence on $d$ , Auer et al. (2002a) show that this algorithm performs well in practice, but the performance degrades quickly if $d$ is not chosen as a tight lower bound of $\min_{i\neq i^{*}}\Delta_{i}$ .

6 Thompson sampling

7 Heavy-tailed distributions

We showed in Section 2 how to obtain a UCB-type strategy through a bound on the moment generating function. Moreover one can see that the resulting bound in Theorem 2.1 deteriorates as the tail of the distributions become heavier. In particular, we did not provide any result for the case of distributions for which the moment generating function is not finite. Surprisingly, it was shown in Bubeck et al. (2012b) that in fact there exists strategy with essentially the same regret than in (8), as soon as the variance of the distributions are finite. More precisely, using more refined robust estimators of the mean than the basic empirical mean, one can construct a UCB-type strategy such that for distributions with moment of order $1+\varepsilon$ bounded by $1$ it satisfies

We refer the interested reader to Bubeck et al. (2012b) for more details on these ’robust’ versions of UCB.

Chapter 3 Adversarial bandits: fundamental results

In this chapter we consider the important variant of the multi-armed bandit problem where no stochastic assumption is made on the generation of rewards. Denote by $g_{i,t}$ the reward (or gain) of arm $i$ at time step $t$ . We assume all rewards are bounded, say $g_{i,t}\in$ . At each time step $t=1,2,\ldots$ , simultaneously with the player’s choice of the arm $I_{t}\in\{1,\ldots,K\}$ , an adversary assigns to each arm $i=1,\dots,K$ the reward $g_{i,t}$ . Similarly to the stochastic setting, we measure the performance of the player compared to the performance of the best arm through the regret

The key idea to get around this difficulty is to add randomization to the selection of the action $I_{t}$ to play. By doing so, the forecaster can “surprise” the adversary, and this surprise effect suffices to get a regret essentially as low as the minimax regret for the stochastic model. Since the regret $R_{n}$ then becomes a random variable, the goal is thus to obtain bounds in high probability or in expectation on $R_{n}$ (with respect to both eventual randomization of the forecaster and of the adversary). This task is fairly difficult, and a convenient first step is to bound the pseudo-regret

As we pointed out, in order to obtain non-trivial regret guarantees in the adversarial framework it is necessary to consider randomized forecasters. Below we describe the randomized forecaster Exp3, which is based on two fundamental ideas.

First, despite the fact that only the loss of the played arm is observed, with a simple trick it is still possible to build an unbiased estimator for the loss of any other arm. Namely, if the next arm $I_{t}$ to be played is drawn from a probability distribution $p_{t}=\bigl{(}p_{1,t},\ldots,p_{K,t}\bigr{)}$ , then

The second idea is to use an exponential reweighting of the cumulative estimated losses to define the probability distribution $p_{t}$ from which the forecaster will select the arm $I_{t}$ . Exponential weighting schemes are a standard tool in the study of sequential prediction schemes under adversarial assumptions. The reader is referred to the monograph by Cesa-Bianchi and Lugosi (2006) for a general introduction to prediction of individual sequences, and to the recent survey by Arora et al. (2012b) focussed on computer science applications of exponential weighting.

We provide two different pseudo-regret bounds for this strategy. The bound (19) is obtained assuming that the forecaster does not know the number of rounds $n$ . This is the anytime version of the algorithm. The bound (18), instead, shows that a better constant can be achieved using the knowledge of the time horizon.

If Exp3 is run with $\eta_{t}=\eta=\sqrt{\frac{2\ln K}{nK}}$ , then

Moeover, if Exp3 is run with $\eta_{t}=\sqrt{\frac{\ln K}{tK}}$ , then

Inequality (18) then trivially follows from (20). For (19) we use (20) and $\sum_{t=1}^{n}\frac{1}{\sqrt{t}}\leq\int_{0}^{n}\frac{1}{\sqrt{t}}dt=2\sqrt{n}$ . The proof of (20) in divided in five steps.

The following equalities can be easily verified:

Second step: Study of the first term in (23).

We use the inequalities $\ln x\leq x-1$ and $\exp(-x)-1+x\leq x^{2}/2$ , for all $x\geq 0$ , to obtain:

where the last step comes from the third equality in (21).

Third step: Study of the second term in (23).

Let $\widetilde{L}_{i,0}=0$ , $\Phi_{0}(\eta)=0$ and $\Phi_{t}(\eta)=\frac{1}{\eta}\ln\frac{1}{K}\sum_{i=1}^{K}\exp{\left(-\eta\widetilde{L}_{i,t}\right)}$ . Then, by definition of $p_{t}$ we have

Fourth step: Summing.

Putting together (22), (23), (24) and (25) we obtain

The first term is easy to bound in expectation since, by the rule of conditional expectations and the last equality in (21) we have

For the second term we start with an Abel transformation,

To conclude the proof, we show that $\Phi_{t}^{\prime}(\eta)\geq 0$ . Since $\eta_{t+1}\leq\eta_{t}$ , we then obtain $\Phi_{t}(\eta_{t+1})-\Phi_{t}(\eta_{t})\leq 0$ . Let

Simplifying, we get (since $p_{1}$ is the uniform distribution over $\{1,\ldots,K\}$ ),

High probability and expected regret bounds

This issue can be solved by combining the mixing idea with a different estimate for losses. In fact, the core idea is more transparent when expressed in terms of gains, and so we turn to the gain version of the problem. The trick is to introduce a bias in the gain estimate which allows to derive a high probability statement on this estimate.

Then, with probability at least $1-\delta$ ,

where the last inequality uses $1+u\leq\exp(u)$ . As a consequence, we have

The strategy associated with these new estimates, called Exp3.P, is described in Figure 1. Note that, for the sake of simplicity, the strategy is described in the setting with known time horizon ( $\eta$ is constant). Anytime results can easily be derived with the same techniques as in the proof of Theorem 5.1.

In the next theorem we propose two different high probability bounds. In (26) the algorithm needs the confidence level $\delta$ as an input parameter. In (27) the algorithm satisfies a high probability bound for any confidence level. This latter property is particularly important to derive good bounds on the expected regret.

For any given $\delta\in(0,1)$ , if Exp3.P is run with

then, with probability at least $1-\delta$ ,

Moreover, if Exp3.P is run with $\beta=\sqrt{\tfrac{\ln(K)}{nK}}$ while $\eta$ and $\gamma$ are chosen as before, then, with probability at least $1-\delta$ ,

We first prove (in three steps) that if $\gamma\leq 1/2$ and $(1+\beta)K\eta\leq\gamma$ , then Exp3.P satisfies, with probability at least $1-\delta$ ,

The key step is, again, to consider the cumulant-generating function of $\widetilde{g}_{i,t}$ . However, because of the mixing, we need to introduce a few more notations. Let $u=\bigl{(}\tfrac{1}{K},\ldots,\tfrac{1}{K}\bigr{)}$ be the uniform distribution over the arms, let and $w_{t}=\tfrac{p_{t}-u}{1-\gamma}$ be the distribution induced by Exp3.P at time $t$ without the mixing. Then we have:

Second step: Study of the first term in (30).

We use the inequalities $\ln x\leq x-1$ and $\exp(x)\leq 1+x+x^{2}$ , for all $x\leq 1$ , as well as the fact that $\eta\widetilde{g}_{i,t}\leq 1$ since $(1+\beta)\eta K\leq\gamma$ :

where we used $\frac{w_{i,t}}{p_{i,t}}\leq\frac{1}{1-\gamma}$ in the last step.

Third step: Summing.

Set $\widetilde{G}_{i,0}=0$ . Recall that $w_{t}=\bigl{(}w_{1,t},\ldots,w_{K,t}\bigr{)}$ with

Then substituting (31) in (30) and summing using (32), we obtain

The last inequality comes from Lemma 6.3, the union bound, and $\gamma-(1+\beta)\eta K\leq 1$ which is a consequence of $(1+\beta)\eta K\leq\gamma\leq 1/2$ . Combining this last inequality with (29) we obtain

We integrate the deviations in (27) using the formula

for a real-valued random variable $W$ . In particular, taking

Lower Bound

The next theorem shows that the results of the previous sections are essentially unimprovable, up to logarithmic factors. The result is proven via the probabilistic method: we show that there exists a distribution of rewards for the arms such that the pseudo-regret of any forecaster must be high when averaged over this distribution. Owing to this probabilistic construction, the lower bound proof is based on the same Kullback-Leibler divergence as the one used in the proof of the lower bound for stochastic bandits —see Subsection 3. We are not aware of other techniques for proving bandit lower bounds.

We find it more convenient to prove the results for rewards rather than losses. In order to emphasize that our rewards are stochastic (in particular, Bernoulli random variables), we use $Y_{i,t}\in\{0,1\}$ to denote the reward obtained by pulling arm $i$ at time $t$ .

Let $\sup$ be the supremum over all distribution of rewards such that, for $i=1,\dots,K$ , the rewards $Y_{i,1},Y_{i,2},\ldots\in\{0,1\}$ are i.i.d., and let $\inf$ be the infimum over all forecasters. Then

where expectations are with respect to both the random generation of rewards and the internal randomization of the forecaster.

The general idea of the proof goes as follows. Since at least one arm is pulled less than $n/K$ times, for this arm one cannot differentiate between a Bernoulli of parameter $1/2$ and and a Bernoulli of parameter $1/2+\sqrt{K/n}$ . Thus, if all arms are Bernoulli of parameter $1/2$ but one, whose parameter is $1/2+\sqrt{K/n}$ , then the forecaster should incur a regret of order $n\sqrt{K/n}=\sqrt{nK}$ . In order to formalize this idea, we use the Kullback-Leibler divergence, and in particular Pinsker’s inequality, to compare the behavior of a given forecaster against: (1) the distribution where all arms are Bernoulli of parameter $1/2$ ; (2) the same distribution where the parameter of one arm is increased by $\varepsilon$ .

We start by proving a more general lemma, which could also be used to derive lower bounds in other contexts. The proof of Theorem 7.9 then follows by a simple optimization over $\varepsilon$ .

Second step: Pinsker’s inequality.

Fourth step: conclusion for deterministic forecasters.

Fifth step: randomized forecasters via Fubini’s Theorem.

Refinements and bibliographic remarks

One can see that there is a logarithmic gap between the pseudo-regret of Exp3, presented in Theorem 5.1, and the minimax lower bound of Theorem 7.9. This gap was closed by Audibert and Bubeck (2009), who constructed a new class of strategies and showed that some of them have a pseudo-regret of order $\sqrt{nK}$ . This new class of strategies, called INF (Implicitily Normalized Forecaster), is based on the following idea. First, note that one can generalize the exponential weighting scheme of Exp3 as follows: given a potential function $\psi$ , assign the probability

This type of strategy is called a weighted average forecaster, see (Cesa-Bianchi and Lugosi, 2006, Chapter 2). In INF the normalization is done implicitily, by a translation of the losses. More precisely, INF with potential $\psi$ assigns the probability $p_{i,t+1}=\psi\bigl{(}C_{t}-\widetilde{L}_{i,t}\bigr{)}$ , where $C_{t}$ is the constant such that $p_{t+1}$ sum to $1$ . The key to obtain a minimax optimal pseudo-regret is to take $\psi$ of the form $\psi(x)=(-\eta x)^{-q}$ with $q>1$ , while Exp3 corresponds to $\psi(x)=\exp(\eta x)$ . Audibert et al. (2011) realized that the INF strategy can be formulated as a Mirror Descent algorithm. This point of view significantly simplifies the proofs. We refer the reader to Chapter 5 (and in particular Theorem 17.20) for more details.

While it is possible to get log-free pseudo-regret bounds, the situation becomes significantly more complicated when one considers high probability regret and expected regret. Audibert and Bubeck (2010) proved that one can get a log-free expected regret if the adversary is oblivious, i.e., the sequence of loss vectors is independent of the forecaster’s actions. Moreover, it is also possible to get a log-free high probability regret if the adversary is fully oblivious (i.e., the loss vectors are independently drawn, but not identically distributed —this includes the oblivious adversary). It is conjectured (in Audibert and Bubeck (2010)) that it is not possible to obtain a log-free expected regret bound against a general non-oblivious adversary.

2 Adaptive bounds

One of the strengths of the bounds proposed in this chapter is also one of its weaknesses: the bounds hold against any adversary. It is clear that in some cases it is possible to obtain a much smaller regret than the worst case regret. For example, when the sequence of losses is an i.i.d. sequence, we proved in Chapter 2 that it is is possible to obtain a logarithmic pseudo-regret (provided that the gap $\Delta$ is considered as a constant). Thus it is natural to ask if it possible to have strategies with minimax optimal regret, but also with much smaller regret when the loss sequence is not worst case.

The first bound in this direction was proved by Auer et al. (2002b), who showed that, for the gain version of the problem and against an oblivious adversary, Exp3 has a pseudo-regret of order $\sqrt{KG_{n}^{*}}$ (omitting log factors), where $G_{n}^{*}\leq n$ is the maximal cumulative reward of the optimal arm after $n$ rounds. This result was improved by Audibert and Bubeck (2010), who showed that using the gain estimate

one can bound the regret with high probability by essentially the same quantity as before, and against any adversary.

Another direction was explored by Hazan and Kale (2009) building on previous works in the full information setting —see Cesa-Bianchi et al. (2007). In this work the authors proved that one can attain a regret of order $\sqrt{\sum_{i=1}^{K}V_{i,n}}$ excluding log factors, where

is the total variation of the loss for arm $i$ . In fact their result is more general, as it applies to the linear bandit framework —see Chapter 5. The main new ingredient in their analysis is a “reservoir sampling” procedure. We refer the reader to Hazan and Kale (2009) for details. See also the works of Slivkins and Upfal (2008); Slivkins (2011) for related results on slowly changing bandits.

In Section 8.4 below we describe another type of adaptive bound, where one combines minimax optimal regret for the adversarial model with logarithmic pseudo-regret for the stochastic model.

3 Competing with the best switching strategy

While competing against the policy consistently playing the best fixed arm is a natural way of defining regret, in some applications it might be interesting to consider regret with respect to a bigger class of policies. Though this problem is the focus of Chapter 4, there is a class of natural policies that can be directly dealt with by the methods of this chapter. Namely, consider the problem of competing against any policy constrained to make at most $S\leq n$ switches (a switch is when the arm played at time $t$ is different from the arm played at time $t+1$ ). This problem was studied by Auer (2002), where it was first shown that a simple variant of Exp3 attains a low switching regret against oblivious adversaries. Later, Audibert and Bubeck (2010) proved that Exp3.P attains an expected regret (and a high probability regret) of order $\sqrt{nKS\ln(nK/S)}$ for this problem.

4 Stochastic versus adversarial bandits

From a practical viewpoint, Exp3 should be a safe choice when we have reasons to believe that the sequence of rewards is not well matched by any i.i.d. process. Indeed, it is easy to prove that UCB can have linear regret, i.e. $\overline{R}_{n}=\Omega(n)$ , on certain deterministic sequences. In Bubeck and Slivkins (2012) a new strategy was described, called SAO (Stochastic and Adversarial Optimal), which enjoys (up to logarithmic factors) both the guarantee of Exp3 for the adversarial model and the guarantee of UCB for the stochastic model. More precisely SAO satisfies $\overline{R}_{n}=\mathcal{O}\left(\frac{K}{\Delta}\log^{2}(n)\log(K)\right)$ in the stochastic model and $\overline{R}_{n}=\mathcal{O}\left(\sqrt{nK}\log^{3/2}(n)\log(K)\right)$ in the adversarial model. Note that while this result is a step towards more flexible strategies, the very notion of regret $\overline{R}_{n}$ can become vacuous with nonstationarities in the reward sequence, since the total reward of the best fixed action might be very small. In that case the notion of switching regret —see Subsection 8.3— is more relevant, and it would be interesting to derive a strategy with logarithmic regret in the stochastic model, and a switching regret of order $\sqrt{nKS}$ in the adversarial model.

5 Alternative feedback structures

As mentioned at the beginning of this section, the adversarial multi-armed bandit is a variation of the full information setting, with a weaker feedback signal (only the incurred loss versus the full vector of losses is observed). Many other feedback structures can be considered, and we conclude the chapter by describing a few of them.

In the label efficient setting, originally proposed by Helmbold and Panizza (1997), at the end of each round the forecaster has to decide whether to ask for the losses of the current round, knowing that this can be done for at most $m\leq n$ times. In this setting, Cesa-Bianchi et al. (2005) proved that the minimax pseudo-regret is of order $n\sqrt{\frac{\ln K}{m}}$ . A bandit label efficient version was proposed by Allenberg et al. (2006). Audibert and Bubeck (2010) proved that the minimax pseudo-regret for the bandit label efficient version is of order $n\sqrt{\frac{K}{m}}$ . These results do not require any fundamentally new algorithmic idea, besides the fact the forecaster has to randomize to select the rounds in which the losses are revealed. Roughly speaking, a simple coin toss with parameter $\varepsilon=m/n$ is sufficient to obtain an optimal regret.

Mannor and Shamir (2011) study a model that interpolates between the full information and the bandit setting. The basic idea is that there is an undirected graph $G$ with $K$ vertices (one vertex for each arm) that encodes the feedback structure. When one pulls arm $i$ the losses of all neighboring arms $j\in N(i)$ in the graph are observed. Thus, a graph with no edges is equivalent to the bandit problem, while the complete graph is equivalent to the full information setting. Given the feedback structure $G$ , it is natural to consider the following unbiased loss estimate

Using Exp3 with this loss estimate, the authors show that the minimax pseudo-regret (up to logarithmic factors) is of order of $\sqrt{\alpha(G)n}$ , where $\alpha(G)$ is the independence number of graph $G$ . Note that this interpolated setting naturally arises in applications like ad placement on websites. Indeed, if a user clicks on an advertisement, it is plausible to assume that the same user would have clicked on similar advertisements, had they been displayed.

The relationship between the signals and the incurred losses defines the instance of a partial monitoring problem. We refer the interested reader to Cesa-Bianchi and Lugosi (2006) for more details, including an historical account. Recent progress on this problem has been made by Bartok et al. (2010) and Foster and Rakhlin (2012).

Chapter 4 Contextual bandits

A natural extension of the multi-armed armed problem is obtained by associating side information with each arm. Based on this side information, or context, a notion of “contextual regret” is introduced where optimality is defined with respect to the best policy (i.e., mapping from contexts to arms) rather than the best arm. The space of policies, within which the optimum is sought, is typically chosen in order to have some desired structure. A different viewpoint is obtained when contexts are privately accessed by the policies (which are then appropriately called “experts”). In this case the contextual information is hidden from the forecaster, and arms must be chosen based only on the past estimated performance of the experts.

Contextual bandits naturally arise in many applications. For example, in personalized news article recommendation the task is to select, from a pool of candidates, a news article to display whenever a new user visits a website. The articles correspond to arms, and a reward is obtained whenever the user clicks on the selected article. Side information, in the form of features, can be extracted from both user and articles. For the user this may include historical activities, demographic information, and geolocation; for the articles, we may have content information and categories. See Li et al. (2010) for more details on this application of contextual bandits.

In general, the presence of contexts creates a wide spectrum of possible variations obtained by combining assumptions on the rewards with assumptions on the nature of contexts and policies. In this chapter we describe just a few of the results available in the literature, and use the bibliographic remarks to mention all those that we are aware of.

The most basic example of contextual bandits is obtained when game rounds $t=1,2,\dots$ are marked by contexts $s_{1},s_{2},\dots$ from a given context set $\mathcal{S}$ . The forecaster must learn the best mapping $g:\mathcal{S}\to\{1,\dots,K\}$ of contexts to arms. We analyze this simple side information setting in the case of adversarial rewards, and we further assume that the sequence of contexts $s_{t}$ is arbitrary but fixed. The approach we take is the simplest: run a separate instance of Exp3 on each distinct context.

We introduce the following notion of pseudoregret

Here $s_{t}\in\mathcal{S}$ denotes the context marking the $t$ -th game round. A bound on this pseudoregret is almost immediately obtained using the adversarial bandit results from Section 3.

There exists a randomized forecaster for bandits with side information (the $\mathcal{S}$ -Exp3 forecaster, defined in the proof) that satisfies

Let $S=|\mathcal{S}|$ . The $\mathcal{S}$ -Exp3 forecaster runs an instance of Exp3 on each context $s\in\mathcal{S}$ . Let $n_{s}$ the number of times when $s_{t}=s$ within the first $n$ time steps. Using the bound (18) established in Theorem 5.1 we get

where in the last step we used Jensen’s inequality and the identity $\sum_{s}n_{s}=n$ .

In subsection 10.1, we extend this construction by considering several context sets simultaneously.

A lower bound $\Omega\bigl{(}\sqrt{nSK}\bigl{)}$ is an immediate consequence of the adversarial bandit lower bound (Theorem 7.9) under the assumption that a constant fraction of the contexts in $\mathcal{S}$ marks at least constant fraction of the $n$ game rounds.

The expert case

We now consider the contextual variant of the basic adversarial bandit model of Chapter 3. In this variant there is a finite set of $N$ randomized policies. Following the setting of prediction with expert advice, no assumptions are made on the way policies compute their randomized predictions, and the forecaster experiences the contexts only through the advice provided by the policies. For this reason, in what follows we use the word expert to denote a policy. Calling this a model of contextual bandits may sound a little strange, as the structure of contexts does not seem to play a role here. However, we have decided to include this setting in this chapter because bandit with experts have been used in practical contextual bandit problems -see, e.g., the news recommendation experiment in Beygelzimer et al. (2011b).

Exp4 without mixing and with $\eta_{t}=\eta=\sqrt{\frac{2\ln N}{nK}}$ satisfies

On the other hand, with $\eta_{t}=\sqrt{\frac{\ln N}{tK}}$ it satisfies

We apply the analysis of Exp3 (Theorem 5.1) to a forecaster using distributions $q_{t}$ over $N$ experts, whose pseudo-losses are $\widetilde{y}_{j,t}$ for $j=1,\dots,N$ . This immediately gives the inequality

Now, similarly to (21) in the proof of Theorem 5.1, we establish the following inequalities

where we used Jensen’s inequality to prove (44). By applying (43) and (44) to (41) we get

Now note that, if we take expectation over the draw of $I_{1},\dots,I_{n}$ , using (42) we obtain

Choosing $\eta_{t}$ as in the statement of the Theorem, and using the inequality $\sum_{t=1}^{n}t^{-1/2}\leq 2\sqrt{n}$ , concludes the proof.

Besides pseudo-regret, the contextual regret

can be also bounded, at least with high probability. Indeed, similarly to the variant Exp3.P of Exp3 (see Section 6), an analogous modification of Exp4, called Exp4.P, satisfies

for some constant $c>0$ and with probability at least $1-\delta$ , where $\delta\in(0,1)$ is a parameter of the algorithm.

We revisit the basic contextual scenario introduced in Section 9, where the goal is to compete against the best mapping from contexts to arms. Consider now a class $\left\{{\mathcal{S}_{\theta}}\,:\,{\theta\in\Theta}\right\}$ of context sets. In this new game, each time step $t=1,2,\dots$ is marked by the vector $\bigl{(}s_{\theta,t}\bigr{)}_{\theta\in\Theta}$ of contexts, one for each set in $\Theta$ . Introduce the pseudoregret

When $|\Theta|=1$ we recover the contextual pseudoregret $\overline{R}^{\mathcal{S}}_{n}$ . In general, when $\Theta$ contains more than one set, the forecaster must learn both the best set $\mathcal{S}_{\theta}$ and the best function $g:\mathcal{S}_{\theta}\to\{1,\dots,K\}$ from that set to the set of arms.

We find this variant of contextual bandits interesting because its solution involves a nontrivial combination of two of the main algorithms examined in this chapter: Exp4 and $\mathcal{S}$ -Exp3. In particular, we consider a scenario in which Exp4 uses instances of $\mathcal{S}$ -Exp3 as experts. The interesting aspect is that these experts are learning themselves, and thus the analysis of the combined algorithm requires taking into account the learning process at both levels.

Note that in order to solve this problem we could simply lump all contexts in a big set and use the proof of Theorem 9.1. However, this would give a regret bound that depends exponentially in $|\Theta|$ . On the other hand, by using Exp4 directly on the set of all policies $g$ (which is of cardinality exponential in $|\Theta|\times|S|$ ), we could improve this to a bound that scales with $\sqrt{|\Theta|}$ . The idea we explore here is to use Exp4 over the class $\Theta$ of “experts”, and combine this with the $\mathcal{S}$ -Exp3 algorithm of Theorem 9.1. This gets us down to a logarithmic dependency on $|\Theta|$ , albeit at the price of a worse dependency on $n$ .

Intuitively, Exp4 provides competitiveness against the best context set $\mathcal{S}_{\theta}$ , while the instances of the $\mathcal{S}$ -Exp3 algorithm, acting as experts for Exp4, ensure that we are competitive against the best function $g:\mathcal{S}_{\theta}\to\{1,\dots,K\}$ for each $\theta\in\Theta$ . However, by doing so we immediately run into a problem: the $p_{t}$ used by Exp4 is not the same as the $p_{t}$ ’s used by each expert. In order to address this issue, we now show that the analysis of Exp3 holds even when the sequence of plays $I_{1},I_{2},\dots$ is drawn from a sequence of distributions $q_{1},q_{2},\dots$ possibly different from the one chosen by the forecaster. The only requirement we need is that each probability in $q_{t}$ be bounded away from zero.

where $I^{n}\sim q^{n}$ means that each $I_{t}$ is drawn from $q_{t}$ for $t=1,\dots,n$ , and $p_{t}$ is the distribution used by Exp3 at time $t$ .

The first term is bounded in a manner slightly different from the proof of Theorem 5.1,

The analysis of the second term is unchanged: Let $\widetilde{L}_{i,0}=0$ , $\Phi_{0}(\eta)=0$ and $\Phi_{t}(\eta)=\frac{1}{\eta}\log\frac{1}{K}\sum_{i=1}^{K}\exp{\left(-\eta\widetilde{L}_{i,t}\right)}$ . Then by definition of $p_{t}$ we have:

Proceeding again as in the proof of Theorem 5.1 we obtain

Choosing $\eta$ as in the statement of the theorem concludes the proof.

It is left to the reader to verify that the analysis of $\mathcal{S}$ -Exp in Theorem 9.1 can be combined with the above analysis to give the bound

where $\gamma>0$ is the mixing coefficient. This mixing clearly achieves the desired property for each $p_{i,t}$ .

Exp4 with mixing coefficient $\gamma$ and with $\eta_{t}=\eta=\gamma/K$ satisfies

There exists a randomized forecaster achieving

for any class $\left\{{\mathcal{S}_{\theta}}\,:\,{\theta\in\Theta}\right\}$ of context sets.

We run the Exp4 forecaster with mixing coefficient $\gamma$ using instances of the $\mathcal{S}$ -Exp3 algorithm (defined in the proof of Theorem 9.1) as experts. Each $\mathcal{S}$ -Exp3 instance is run on a different context set $\mathcal{S}_{\theta}$ for $\theta\in\Theta$ . Let $\xi_{t}^{\theta}$ be the distribution used at time $t$ by the $\mathcal{S}$ -Exp3 instance running on context set $\mathcal{S}_{\theta}$ and let $p^{n}$ be the joint distribution of $I^{n}=(I_{1},\dots,I_{n})$ used by Exp4. Since $p_{i,t}\geq\tfrac{\gamma}{K}$ for all $i=1,\dots,K$ and $t\geq 1$ , we can use (46) with $\varepsilon=\gamma/K$ . Thus, Theorem 10.7 implies

Substituting $\varepsilon=\gamma/K$ in the above expression and choosing $\gamma$ of the order of $n^{-1/3}\left(\max_{\theta\in\Theta}|\mathcal{S}_{\theta}|K\ln K\right)^{1/3}\sqrt{\ln|\Theta|}$ gives the desired result.

Note that in Theorem 10.9 the rate is $n^{2/3}$ , in contrast to the more usual $n^{1/2}$ bandit rate. This worsening is inherent in the Exp4-over-Exp3 construction. It is not known whether the rate could be improved while keeping the same logarithmic dependence on $|\Theta|$ guaranteed by this construction.

Stochastic contextual bandits

We now move on to consider the case in which policies have a known structure. More specifically, each policy is a function $f$ mapping the context space to the arm space $\{1,\dots,K\}$ and the set $\mathcal{F}$ of policies is given as an input parameter to the forecaster.

Under this assumption on the policies, the problem can be viewed as a bandit variant of supervised learning. For this reason, here and in the next section we follow the standard notation of supervised learning and use $x$ rather than $s$ to denote contexts.

the risk-minimizing policy in the class. The regret with respect to the class $\mathcal{F}$ of a forecaster choosing arms $I_{1},I_{2},\dots$ is then defined by

In the rest of this section we focus on the case of $K=2$ arms and parametrize classes $\mathcal{F}$ of policies $f:\mathcal{X}\to\{1,2\}$ by their VC-dimension $d$ —see Boucheron et al. (2005) for a modern introduction to VC theory. For this setting, we consider the following forecaster.

VE (VC dimension by Exponentiation): Parameters: number $n$ of rounds, $n^{\prime}$ satisfying $1\leq n^{\prime}\leq n$ . (1) For the first $n^{\prime}$ rounds, choose arms uniformly at random. (2) Build $\mathcal{F}^{\prime}\subseteq\mathcal{F}$ such that for any $f\in\mathcal{F}$ there is exactly one $f^{\prime}\in\mathcal{F}^{\prime}$ satisfying $f(x_{t})=f^{\prime}(x_{t})$ for all $t=1,\dots,n^{\prime}$ . (3) For $t=n^{\prime}+1,\dots,n$ play by simulating Exp4.P using the policies of $\mathcal{F}^{\prime}$ as experts.

We now show that the per round regret of VE is of order $\sqrt{d/n}$ , excluding logarithmic factors. This rate is equal to the optimal rate for supervised learning of VC-classes, showing that —in this case— the price of bandit information is essentially zero.

For any class $\mathcal{F}$ of binary policies $f:\mathcal{X}\to\{0,1\}$ of VC-dimension $d$ and for all $n>d$ , the forecaster VE run with $n^{\prime}=\sqrt{n\left(2d\ln\frac{en}{d}+\ln\frac{3}{\delta}\right)}$ satisfies

for some constant $c>0$ and with probability at least $1-\delta$ with respect to both the random data generation and VE’s internal randomization.

with probability at least $\tfrac{2\delta}{3}$ with respect to both the random sample draw and VE’s internal randomization.

holds with probability at least $\tfrac{\delta}{3}$ for any sample realization, it holds with the same probability for a random sample. Hence, by choosing $n^{\prime}$ as in the statement of the theorem and overapproximating, we get the desired result.

The multiclass case

A simple but effective learning strategy for (non-bandit) online classification is the multiclass Perceptron algorithm. This algorithm updates $W$ at time $t$ using the rule $W\leftarrow W+X_{t}$ , where $X_{t}$ is a $K\times d$ matrix with components $\bigl{(}X_{t}\bigr{)}_{i,j}=x_{t,j}\bigl{(}\mathds{1}_{y_{t}=i}-\mathds{1}_{\widehat{y}_{t}=i}\bigr{)}$ . Therefore, the update rule can be rewritten as

where $w_{i}$ denotes the $i$ -th row of matrix $W$ . Note, in particular, that no update takes place (i.e., $X_{t}$ is the all zero matrix) when $\widehat{y}_{t}=y_{t}$ , which means that $y_{t}$ is predicted correctly.

uniformly over $n\geq 1$ , where the infimum is over all $K\times d$ matrices $U$ and $\left\|{\,\cdot\,}\right\|$ denotes the Frobenius norm. Here $L_{n}(U)$ denotes the cumulative hinge loss of policy $U$ ,

where $[\,\cdot\,]_{+}=\max\{0,\,\cdot\,\}$ is the hinge function. Finally, $\bar{L}_{n}(U)=\tfrac{1}{n}L_{n}(U)$ is the average hinge loss of $U$ .

We now introduce a bandit variant of the multiclass Perceptron called Banditron.

We now analyze the expected number of prediction mistakes made by the Banditron algorithm on any sequence of examples $(x_{t},y_{t})$ . Unlike the non-bandit case, where the number of mistakes $M_{n}$ after $n$ steps of the multiclass Perceptron provides a “multiclass regret” bound $M_{n}-L_{n}(U)=\mathcal{O}\bigl{(}\sqrt{n}\bigr{)}$ , in the bandit case the regret achieved by the variant of the Perceptron is only bounded by $\mathcal{O}\bigl{(}n^{2/3}\bigr{)}$ .

where the infimum is over all $K\times d$ matrices $U$ and $\bar{L}_{n}(U)=\tfrac{1}{n}L_{n}(U)$ is the average hinge loss of $U$ .

Therefore, if $y_{t}\neq\widehat{y}_{t}$ ,

because $p_{i,t}=\gamma$ when $y_{t}\neq\widehat{y}_{t}$ . Otherwise, if $y_{t}=\widehat{y}_{t}$

because $p_{i,t}=1-\gamma$ when $y_{t}=\widehat{y}_{t}$ and $\gamma\leq\tfrac{1}{2}$ . Hence,

In order to see why the inequality holds, note that

Now we lower bound $\bigl{\langle}{U},{W_{n+1}}\bigr{\rangle}$ as follows,

Therefore, using again the fact that $W_{1}$ is the zero matrix,

Combining the upper and lower bounds on $\bigl{\langle}{U},{W_{n+1}}\bigr{\rangle}$ we get

Choosing $\gamma$ as in the statement of the theorem yields the desired result. Note that $\gamma\leq\tfrac{1}{2}$ because we assume $n\geq 8K$ .

Bibliographic remarks

A model of contextual stochastic bandits close to those studied here is introduced by Wang et al. (2005b). The context is provided by a i.i.d. sequence of random variables and the rewards for each arm depend on the context through an unknown parametric model beloging to a known class. This result has been generalized to the non i.i.d. case by Wang et al. (2005a), to the multivariate linear case by Rusmevichientong and Tsitsiklis (2010), and to the multivariate and nonparametric case by Perchet and Rigollet (2011). In Slivkins (2011), contexts and rewards belong to arbitrary metric spaces, and the unknown function mapping contexts to rewards satisfies a Lipschitz assumption (remarkably, the same algorithm also applies to slowly changing expected rewards and sleeping bandit settings). The case of deterministic covariates (fixed design), finitely many arms, and a linear stochastic dependence between covariates and rewards has been studied in Auer (2002); Chu et al. (2011) —see also Abe and Long (1999). The work of Filippi et al. (2010) extends the analysis of fixed design by assuming a generalized linear model to capture the dependence of rewards on covariates.

The simple contextual model analyzed in Section 9, as well as its extension described in Subsection 10.1, are due to Maillard and Munos (2011). The Exp4 algorithm for the adversarial case was introduced in Auer et al. (2002b). Subsequent improvements were proposed in the two papers Beygelzimer et al. (2011a) (Exp4.P with high-probability bounds) and McMahan and Streeter (2009) (exploitation of correlations in expert advice). The VE algorithm and its analysis in Section 11 are also taken from Beygelzimer et al. (2011a).

A stochastic model for contextual bandits with finitely many arms and finitely many states has been investigated by Seldin et al. (2011) using new sophisticated tools of PAC-Bayesian analysis.

The general stochastic model of Section 11 for contextual bandits with finitely many arms is due to Langford and Zhang (2007). An efficient algorithm for this model has been recently proposed in Dudik et al. (2011).

The bandit multiclass model of Section 12 is due to Langford and Zhang (2007). The Banditron algorithm and its analysis are from Kakade et al. (2008). See also Crammer and Gentile (2011) and Hazan and Kale (2011) for recent variants and improvements.

Chapter 5 Linear bandits

It is important to note that, without any loss of generality (as far as regret bounds are concerned), one can always assume that $\mathcal{K}$ has size $\mathcal{O}(n^{d})$ . Indeed, since $\mathcal{K}$ is a compact set and the loss is linear (and therefore Lipschitz), one can cover $\mathcal{K}$ with $\mathcal{O}(n^{d})$ points such that one incurs a vanishing extra cumulative regret by playing on the discretization rather than on the original set. Of course, the downside of this approach is that a strategy resulting from such a cover is often not computationally efficient. On the other hand, this assumption allows us to apply ideas from Chapter 3 to this more general setting. We analyze the pseudo-regret for finite classes in Section 14. Without loss of generality, it is also possible to assume that $\mathcal{K}$ is a convex set. Indeed, the pseudo-regret is the same if one plays $x_{t}$ , or if one plays a point at random in $\mathcal{K}$ such that the expectation of the played point is $x_{t}$ . This remark is critical, and allows us to develop a powerful technology known as the Mirror Descent algorithm. We describe this approach in Section 15, and explore it further in subsequent sections.

To describe the strategy, we first need a useful result from convex geometry: John’s theorem. This result concerns the ellipsoid $\mathcal{E}$ of minimal volume enclosing a given convex set $\mathcal{K}$ (which we call the John’s ellipsoid of $\mathcal{K}$ ). Basically, the theorem states that $\mathcal{E}$ has many contact points with $\mathcal{K}$ , and these contact points are “nicely” distributed, in the sense that they almost form an orthonormal basis —see Ball (1997) for a proof.

Now the Exp2 algorithm with John’s exploration corresponds to playing according to the following probability distribution

For any $\eta,\gamma>0$ such that ${\eta d}\leq{\gamma}$ , Exp2 with John’s exploration satisfies

In particular, with $\eta=\sqrt{\frac{\ln N}{3nd}}$ and $\gamma=\eta d$ ,

Since $\mathcal{K}$ is finite, we can easily adapt the analysis of Exp3 in Theorem 5.1 to take into account the exploration term. This gives

Now we use a spectral decomposition of $P_{t}$ in an orthonormal basis for $\langle\cdot,\cdot\rangle$ and write $P_{t}=\sum_{i=1}^{d}\lambda_{i}v_{i}\otimes v_{i}$ . In particular, we have $P_{t}^{-1}=\sum_{i=1}^{d}\frac{1}{\lambda_{i}}v_{i}\otimes v_{i}$ and thus

where the last inequality follows from the fact that $\langle x,x\rangle\leq 1$ for any $x\in\mathcal{K}$ , since $\mathcal{K}$ is included in the unit ball. To conclude the proof, we need to lower bound the smallest eigenvalue of $P_{t}$ . Using Theorem 14.1, one can see that $P_{t}\succeq\frac{\gamma}{d}I_{d}$ , and thus $\lambda_{i}\geq\frac{\gamma}{d}$ . Since ${\eta d}\leq{\gamma}$ , the proof is concluded.

Online Mirror Descent (OMD)

The rest of this chapter is organized as follows. Next, we briefly recall a few key concepts from convex analysis. Then we describe the OMD strategy and prove a general regret bound. In Section 16 we introduce Online Stochastic Mirror Descent (OMSD), which is a variant of OMD based on a stochastic estimate of the gradient. We apply this strategy to linear losses in two different bandit settings. Finally, in Section 18 we show how OMSD obtains improved bounds in certain special cases. The case of convex losses with bandit feedback is treated in Chapter 6.

Such a $g$ is called a subgradient of $f$ at $x$ .

Abusing notation, we use $\nabla f(x)$ to denote both the gradient of $f$ at $x$ when $f$ is differentiable, and any subgradient of $f$ at $x$ when $f$ is subdifferentiable (a sufficient condition for subdifferentiability of $f$ is that $f$ is convex and $\mathcal{X}$ is open).

This definition directly implies the Fenchel-Young inequality for convex functions, $u^{\top}x\leq f(x)+f^{*}(u)$ .

$F$ is strictly convex and admits continuous first partial derivatives on $\mathcal{D}$ ;

Let $F$ be a Legendre function. Then $F^{**}=F$ , and $\nabla F^{*}=(\nabla F)^{-1}$ restricted on the set $\mathcal{D}^{*}$ . Moreover, for all $x,y\in\mathcal{D}$ ,

The gradient $\nabla F$ maps $\mathcal{D}$ to the dual space $\mathcal{D}^{*}$ , and $\nabla F^{*}$ is the inverse mapping from the dual space to the original (primal) space. Note also that (50) shows that the Bregman divergence in the primal exactly corresponds to the Bregman divergence of the Legendre-transform in the dual.

The next lemma shows that the geometry induced by a Bregman divergence resembles to the geometry of the squared Euclidean distance. See (Cesa-Bianchi and Lugosi, 2006, Lemma 11.3) for a proof.

Let $\mathcal{K}\subseteq\overline{\mathcal{D}}$ be a closed convex set such that $\mathcal{K}\cap\mathcal{D}\neq\emptyset$ . Then, for all $x\in\mathcal{D}$ , the Bregman projection

exists and is unique. Moreover, for all $z\in\mathcal{K}\cap\mathcal{D}$ and $y\in\mathcal{K}$ ,

The idea of OMD is very simple: first, select a Legendre function $F$ on $\overline{\mathcal{D}}\supset\mathcal{K}$ (such that $\mathcal{K}\cap\mathcal{D}\neq\emptyset$ ); second, perform a gradient descent step, where the update with the gradient is performed in the dual space $\mathcal{D}^{*}$ rather than in the primal $\mathcal{D}$ ; third, project back to $\mathcal{K}$ according to the Bregman divergence defined by $F$ .

Note that step (2) is well defined if the following consistency condition is satisfied:

Note also that step (2) can be rewritten as

Finally, note that the Bregman projection in step (3) is a convex program, in the sense that $x\mapsto D_{F}(x,y)$ is always a convex function. This does not necessarily imply that step (3) can be performed efficiently, since in some cases the feasible set $\mathcal{K}$ might only be described with an exponential number of constraints (we encounter examples like this in Section 17).

When $\mathcal{K}$ is the simplex and $F(x)=\sum_{i=1}^{d}x_{i}\ln x_{i}-\sum_{i=1}^{d}x_{i}$ , OMD reduces to an exponentially weighted average forecaster, similar to those studied in Chapter 3. The well-known online gradient descent strategy corresponds to taking $F(x)=\frac{1}{2}\left\|{x}\right\|^{2}_{2}$ . In the following we shall see other possibilities for the Legendre function $F$ .

We prove now a very general and powerful theorem concerning the regret of OMD.

Let $\mathcal{K}$ be a compact and convex set of arms, $\mathcal{L}$ be a set of subdifferentiable functions, and $F$ a Legendre function defined on $\overline{\mathcal{D}}\supset\mathcal{K}$ , such that (51) is satisfied. Then OMD satisfies for any $x\in\mathcal{K}$ ,

Let $x\in\mathcal{K}$ . Since $\mathcal{L}$ is a set of subdifferentiable functions, we have:

Using (52), and applying the definition of $D_{F}$ , one obtains

By Lemma 15.8, one gets $D_{F}(x,w_{t+1})\geq D_{F}(x,x_{t+1})+D_{F}(x_{t+1},w_{t+1}),$ hence

By the nonnegativity of the Bregman divergences, we get

Let $x\in\mathcal{K}$ . Since $\mathcal{L}$ is a set of subdifferentiable functions, we have

Using the Fenchel-Young inequality, one obtains

Observe that $F^{*}(0)=-F(x_{1})$ and, for each term in the above sum,

Online Stochastic Mirror Descent (OSMD)

The following theorem establishes a general regret bound for OSMD. Note that here the pseudo-regret is defined as

Note also that we state the theorem for a Legendre function $F$ , but a similar result can be obtained under the same assumptions as those of Theorem 15.11.

which concludes the proof of the first regret bound. The case of a linear loss follows very easily from the same computations.

Online combinatorial optimization

In this section we consider an interesting special case of online linear optimization. In the online combinatorial optimization setting the set of arms is $\mathcal{C}\subseteq\{0,1\}^{d}$ and the set of linear loss functions is $\mathcal{L}=^{d}$ . We assume $\left\|{v}\right\|_{1}=m$ for all $v\in\mathcal{C}$ and for some integer $m\leq d$ . Many interesting problems fall into this framework, including ranking/selection of $m$ items, or path planning.

Using Theorem 16.13 one directly obtains:

We show now how to use this bound to obtain concrete performances for OSMD using the negative entropy as Legendre function. Later, we show that one can improve the results by logarithmic factors, using a more subtle Legendre function.

We start with OSMD and the negative entropy.

In particular, with the estimate (53) and $\eta=\sqrt{\frac{2m}{nd}\ln\frac{d}{m}}$ ,

Moreover, straightforward computations give

We now greatly generalize the negative entropy with the following definition. When used with OSMD, this more general entropy allows us to obtain a bound tighter than that of Theorem 17.15.

With a potential $\psi$ we associate the function $F_{\psi}$ defined on $\mathcal{D}=(\omega,+\infty)^{d}$ by

We restrict our attention to -potentials. A non-zero $\omega$ might be used to derive high probability regret bounds (instead of pseudo-regret bounds). Note that with $\psi(x)=e^{x}$ we have that $F_{\psi}$ reduces to the negative entropy.

Let $\psi$ be a -potential. Then $F_{\psi}$ is Legendre and for all $u,v\in\mathcal{D}^{*}=(-\infty,a)^{d}$ such that $u_{i}\leq v_{i}$ for $i=1,\ldots,d$ ,

It is easy to check that $F$ is a Legendre function. Moreover, since $\nabla F^{*}(u)=(\nabla F)^{-1}(u)=\big{(}\psi(u_{1}),\dots,\psi(u_{d})\big{)}$ we obtain

Since the function $\psi$ is convex, and $u_{i}\leq v_{i}$ , we have

We are now ready to bound the pseudo-regret of OSMD run with an arbitrary -potential. For a specific choice of the potential we obtain an improvement of Theorem 17.15. In particular for $m=1$ this result gives the log-free bound for the adversarial multi-armed bandit that was discussed in Section 8.1.

In particular, choosing the -potential $\psi(x)=(-x)^{-q}$ , the estimate (53), and $\eta=\sqrt{\frac{2}{q-1}\frac{m^{1-2/q}}{d^{1-2/q}}}$ ,

The first inequality trivially follows from (54), Lemma 17.18, and the fact that $\psi^{\prime}\bigl{(}\psi^{-1}(s)\bigr{)}=\frac{1}{(\psi^{-1})^{\prime}(s)}$ .

Let $\psi(x)=(-x)^{-q}$ . Then we have that $\psi^{-1}(x)=-x^{-1/q}$ and $F(x)=-\frac{q}{q-1}\sum_{i=1}^{d}x_{i}^{1-1/q}$ . In particular, by Hölder’s inequality, since $\sum_{i=1}^{d}x_{1}(i)=m$ ,

Moreover, note that $(\psi^{-1})^{\prime}(x)=\frac{1}{q}x^{-1-1/q}$ , and

Improved regret bounds for bandit feedback

where $\xi_{t}$ is a Bernoulli random variable of parameter $\left\|{x_{t}}\right\|$ , $I_{t}$ is drawn uniformly at random in $\{1,\ldots,d\}$ , and $\varepsilon_{t}$ is a Rademacher random variable with parameter $\frac{1}{2}$ .

In particular, with $\gamma=\frac{1}{\sqrt{n}}$ and $\eta=\sqrt{\frac{\ln n}{2nd}}$ ,

First, it is clear that by playing on $\mathcal{K}^{\prime}=(1-\gamma)\mathcal{K}$ instead of $\mathcal{K}$ , OSMD incurs an extra $\gamma n$ regret. Second, note that $F$ is stricly convex (it is the composition of a convex and nondecreasing function with the Euclidean norm) and

The first term is clearly bounded by $\frac{1}{\eta}\ln\frac{1}{\gamma}$ (since $x_{1}=0$ ). For the second term, we need to do a few computations. The first one follows from (58)), the others follow from simple algebra

Let $\Theta(u,v)$ such that $D_{F^{*}}(u,v)=\frac{1}{1+\left\|{v}\right\|}\Theta(u,v)$ . First note that

Now using that $\ln(1+x)\geq x-x^{2}$ for all $x\geq-\frac{1}{2}$ , we obtain that for $u,v$ such that $\frac{\left\|{u}\right\|-\left\|{v}\right\|}{1+\left\|{v}\right\|}\geq-\frac{1}{2}$ ,

which concludes the proof of (56). For the proof of (57) it suffices to note that

and perform with straightforward computations.

Refinements and bibliographic remarks

Online convex optimization in the full information setting was introduced by Zinkevich (2003). Online linear optimization with bandit feedback was pioneered in Awerbuch and Kleinberg (2004); McMahan and Blum (2004). For this problem, Dani et al. (2008a) were the first to obtain optimal $\mathcal{O}\bigl{(}\sqrt{n}\bigr{)}$ bounds in terms of the number $n$ of rounds. This was done using the Exp2 strategy with an exploration uniform over a barycentric spanner for $\mathcal{K}$ . The exploration part was first improved by Cesa-Bianchi and Lugosi (2012) for combinatorial sets $\mathcal{K}$ . Finally, the optimal exploration based on John’s theorem was introduced by Bubeck et al. (2012a). Theorem 14.2 is extracted from this last paper.

Simultaneously with the line of research on Exp2, algorithms based on Online Mirror Descent were also investigated. Mirror Descent was originally introduced in the seminal work of Nemirovski (1979); Nemirovski and Yudin (1983) as a standard (offline) convex optimization method. A somewhat similar class of algorithms was rediscovered in the online learning community, see Herbster and Warmuth (1998); Grove et al. (2001); Kivinen and Warmuth (2001); Shalev-Shwartz (2007). The connection between existing online learning algorithms (such as Exponential weights or Online Gradient Descent) and Mirror Descent was first made explicit in Cesa-Bianchi and Lugosi (2006) —see also Rakhlin (2009) and Hazan (2011). Earlier applications of Mirror Descent in the learning community can be found in Juditsky et al. (2005). The first application of Mirror Descent to online linear optimization with bandit feedback was given by Abernethy et al. (2008). In this pioneering paper, the authors describe the first computationally efficient strategy (i.e., with complexity polynomial in $d$ ) with $\mathcal{O}(\sqrt{n})$ regret. The main idea is to use Mirror Descent with a self-concordant barrier $F$ for the set $\mathcal{K}$ . Unfortunately, the drawback is a suboptimal dependency on $d$ in the regret. More precisely. they obtain a $\mathcal{O}(d^{2}\sqrt{n})$ regret under the bounded scalar loss assumption, while Exp2 with John’s exploration attains $\mathcal{O}(d\sqrt{n})$ . However, Mirror Descent can also deliver optimal regret bounds in the bandit case, as we showed in Section 18, which is extracted from Bubeck et al. (2012a).

The presentation of the Online Mirror Descent algorithm in Section 15 is inspired by Bubeck (2011). The definition of Legendre functions comes from (Cesa-Bianchi and Lugosi, 2006, Chapter 11) —further developments on convex analysis can be found in Hiriart-Urruty and Lemaréchal (2001); Boyd and Vandenberghe (2004). Theorem 15.9 is taken from Audibert et al. (2011), but the proof technique goes back at least to Ben-Tal and Nemirovski (1999). The proof of Theorem 15.11 is adapted from Kakade et al. (2012). Section 16 is inspired by gradient-free optimization, a topic extensively studied since Robbins and Monro (1951); Kiefer and Wolfowitz (1952) —see Nemirovski et al. (2009); Conn et al. (2009); Nesterov (2011); Bach and Moulines (2011) for recent accounts on this theory. Alternative views have been proposed on the Online Mirror Descent strategy. In particular, it is equivalent to a Follow The Regularized Leader, and to proximal algorithms, see Rakhlin (2009). This viewpoint was pioneered by Beck and Teboulle (2003) —see also Bartok et al. (2011) for more details. Finally, a notion of universality of Online Mirror Descent in the online prediction setting was proposed by Srebro et al. (2011).

The online combinatorial optimization problem studied in Section 17 was introduced by Kalai and Vempala (2005) for the full information setting. Several works have studied this problem for specific sets $\mathcal{C}$ , see in particular Takimoto and Warmuth (2003); Warmuth and Kuzmin (2008); Helmbold and Warmuth (2009); Hazan et al. (2010); Koolen et al. (2010); Warmuth et al. (2011); Cesa-Bianchi and Lugosi (2012). The semi-bandit feedback was studied in the series of papers György et al. (2007); Kale et al. (2010); Uchiya et al. (2010); Audibert et al. (2011). The presentation adopted in this section is based on the last paper. OSMD with negative entropy was first studied by Helmbold and Warmuth (2009) for the full information setting and for a specific set $\mathcal{C}$ . It was then studied more generally in Koolen et al. (2010) for any set $\mathcal{C}$ . The generalization to semi-bandit feedback was done by Audibert et al. (2011). OSMD with a Legendre derived from a potential was introduced by Audibert et al. (2011). In the case of the simplex, this strategy corresponds to the INF strategy of Audibert and Bubeck (2009) discussed in Section 8.1.

Online linear optimization is still far from being completely understood. For instance, see (Bubeck, 2011, Chapter 9) for a list of open problems. In this section we also omitted a few important topics related to online linear optimization. We briefly review some of them below.

Under the bounded scalar loss assumption, it was proved by Dani et al. (2008a) that for $\mathcal{K}=^{d}$ the minimax regret in the full information setting is at least of order $\sqrt{dn}$ , while under bandit feedback it is of order $d\sqrt{n}$ . In both cases Exp2 is matching these lower bounds (using John’s exploration in the bandit case).

In the combinatorial setting, where $\mathcal{K}\subset\{0,1\}^{d}$ and $\mathcal{L}=^{d}$ , Audibert et al. (2011) show that the minimax regret in the full information and semi-bandit cases is at least of order $d\sqrt{n}$ , while in the bandit case it is of order $d^{3/2}\sqrt{n}$ . OSMD with the negative entropy matches the bounds in the full information and semi-bandit cases. However, in the bandit case the best known bound is obtained by Exp2 (with John’s exploration) and gives a regret of order $d^{2}\sqrt{n}$ . It is important to remark that Audibert et al. (2011) show that Exp2 is a provably suboptimal strategy in the combinatorial setting.

Finally, lower bounds for the full information case, and for a few specific sets $\mathcal{K}$ of interest, were derived by Koolen et al. (2010).

2 High probability bounds

In this chapter we focused on the pseudo-regret $\overline{R}_{n}$ . However, just like in Chapter 3, a much more important and interesting statement concerns high probability bounds for the regret $R_{n}$ . Partial results in this direction can be found in Bartlett et al. (2008) for the Exp2 strategy, and in Abernethy and Rakhlin (2009) for the OSMD algorithm.

3 Stochastic online linear optimization

Chapter 6 Nonlinear bandits

This modification has important consequences. For instance with strictly convex losses one has to do local perturbations in order to estimate the loss gradient, this is in contrast to the global perturbations studied in the previous chapter. In agreement with the setting of Chapter 5, we initially focus on the nonstochastic setting, where the forecaster faces an unknown sequence of convex Lipschitz and differentiable losses (in the nonlinear case the regret scales with the Lipschitz constant of losses). Problems of this kind can be viewed as dynamic variants of convex optimization problems, in which the convex function to optimize evolves over time. The bandit constraint can be simply interpreted as the impossibility of computing gradients (because, for instance, we do not have a explicit representation of the function, but it can only be accessed by querying for values at desired points).

We look at two feedback models. In the first one, at each step the forecaster evaluates the loss function at two points: the played point plus an additional point of its choice. In the second one, only the value of the loss evaluated at the played point is made available to the forecaster. We show that while the two-points model allows for a $\mathcal{O}\bigl{(}\sqrt{n}\bigr{)}$ bound on pseudo-regret, in the one-point model a pseudo-regret bound of only $\mathcal{O}\bigl{(}n^{3/4}\bigr{)}$ is achieved. The stochastic setting is investigated in Section 22 where, similarly to Chapter 2, we assume that each play of an arm returns a stochastic loss with fixed but unknown mean. Unlike the nonstochastic case, the mean loss function is assumed to be Lipschitz and unimodal, but not necessarily convex. For keeping things simple, the stochastic setting is studied in $1$ -dimensional case, when arms are points in the unit interval. For this case we show a bound on the pseudo-regret of $\mathcal{O}\bigl{(}\sqrt{n}(\log n)\bigr{)}$ .

We focus our analysis on OSMD with Legendre function $F=\tfrac{1}{2}\left\|{\cdot}\right\|^{2}$ , where $\left\|{\cdot}\right\|$ is the Euclidean norm. The resulting strategy, Online Stochastic Gradient Descent (OSGD) is sketched below here.

We now introduce our main technical tool: the two-point gradient estimate. The two points on which the loss value is queried at time $t$ are denoted by $X_{t}^{+}$ and $X_{t}^{-}$ . OSGD always plays one of these two points at random.

In order to compute the expectation of $\widetilde{g}_{t}$ , first note that by symmetry

In order to compute the expectation in the right-hand side we need the following preliminary lemma.

where $\sigma$ is the unnormalized spherical measure.

The proof of this result is an easy consequence of the Divergence Theorem,

We are now fully equipped to compute the expectation of $\widetilde{g}_{t}$ .

First consider the easy one-dimensional case. Namely, $\mathcal{K}=[a,b]$ for some reals $a<b$ . Note that, in this case, $S$ is uniform in $\{-1,+1\}$ whereas $B$ is uniform in $[-1,+1]$ . Then

for all realizations of the random process $\bigl{(}X_{t}^{+},X_{t}^{-}\bigr{)}_{t\geq 1}$ .

We are now ready to prove the main result of this section. Namely, that the pseudo-regret of OSGD run using the gradient estimate (60) is of order $\sqrt{n}$ . We assume that the point $\widetilde{X}_{t}$ played by OSGD at each time $t$ is randomly drawn between the two points $X_{t}^{+}$ and $X_{t}^{-}$ where the loss function is queried.

Moreover, if $\eta=\tfrac{R}{GD\sqrt{n}}$ then for $\delta\to 0$ we have that

where we overapproximated $\left\|{\bigl{(}1-\tfrac{\delta}{r}\bigr{)}\mathcal{K}}\right\|\leq\left\|{\mathcal{K}}\right\|=R$ . This concludes the proof.

One-point bandit feedback

Building on the analysis of the previous section, it is not hard to show that the pseudo-regret can be bounded even when the loss function at each time $t$ is queried in only one point. However, we pay this reduced bandit feedback with a worse rate of $n^{3/4}$ in the pseudo-regret bound. It is not known if this rate is optimal, or if it is possible to get a $\sqrt{n}$ regret as in the two-points setting.

The one-point estimate at time $t$ is defined by

Note the inverse dependence on $\delta$ . This dependence plays a key role in the final bound, as the next result shows.

Applying Theorem 16.13 as in the proof of Theorem 20.7 gives

Nonlinear stochastic bandits

We conclude with a simple example of nonlinear bandits in the stochastic setting. Unlike the gain-based analysis of stochastic bandits of Chapter 2, here we keep in with the convention used throughout this chapter and work exclusively with losses.

where $x_{1},\dots,x_{n}\in$ denote the points played by the strategy.

The bandit strategy we analyze in this section is based on the golden section search due to Kiefer (1953), which is a general algorithm for finding the extremum of a unimodal function. Similarly to binary search, each step of golden section search narrows the interval in which the extremum is found by querying the function value at certain points that are chosen depending on the outcome of previous queries. Each query shrinks the interval by a factor of $\tfrac{1}{\varphi}=0.618\dots$ , where $\varphi=\tfrac{1}{2}\bigl{(}1+\sqrt{5}\bigr{)}$ is the golden ratio.

In our case, queries (i.e., plays) at $x$ return a perturbed version of $\mu(x)$ . Since $\mu$ is bounded, Hoeffding bounds ensure that we can find the minimum of $\mu$ by repeatedly querying each point $x$ requested by the golden search algorithm. However, in order to have a lower bound on the accuracy with which each $\mu$ needs to be estimated, we must assume the following condition: there exists $C_{L}>0$ such that

for each $x,x^{\prime}$ that belong either to $[0,x^{*}-1/C_{L}]$ or to $[x^{*}+1/C_{L},1]$ .

Finally, irrespective to the uncertainty in the evaluation of $\mu$ , in order to bound the regret incurred by golden section search we need a Lipschitz condition on $\mu$ . Namely, there exists $C_{H}>0$ such that $\bigl{|}\mu(x)-\mu(x^{\prime})\bigr{|}\leq C_{H}|x-x^{\prime}|$ for all $x,x^{\prime}\in$ .

We are now ready to introduce our stochastic version of the golden section search algorithm.

SGS (Stochastic Golden Search): Parameters: $\varepsilon_{1},\varepsilon_{2},\dots>0$ . Initialize: $x_{A}=0\quad x_{B}=\frac{1}{\varphi^{2}}\quad x_{C}=1$ . For each stage $s=1,\dots,n$ (1) Let $x_{B}^{\prime}=\left\{\begin{array}[]{cl}x_{B}-\tfrac{1}{\varphi^{2}}(x_{B}-x_{A})&x_{B}-x_{A}>x_{C}-x_{B}\\[2.84526pt] x_{B}+\tfrac{1}{\varphi^{2}}(x_{C}-x_{B})&\text{otherwise}\end{array}\right.$ and rename points $x_{B},x_{B}^{\prime}$ so that $x_{A}<x_{B}<x_{B}^{\prime}<x_{C}$ . (2) Play each point in $\{x_{A},x_{B},x_{B}^{\prime},x_{C}\}$ for $\tfrac{2}{\varepsilon_{s}^{2}}\ln(6n)$ times and let $\widehat{x}$ be the point with lowest total loss in this stage. (3) If $\widehat{x}\in\{x_{A},x_{B}\}$ then eliminate interval $(x_{B}^{\prime},x_{C}]$ and let $x_{C}=x_{B}^{\prime}$ , (4) else eliminate interval $[x_{A},x_{B})$ and let $x_{A}=x_{B}$ .

Recall that golden section search proceeds as follows: given three queried points $x_{A}<x_{B}<x_{C}$ where the distance of $x_{B}$ to the other two points is in the golden ratio ( $x_{B}$ might be closer to $x_{A}$ or to $x_{C}$ depending on past queries), the next point $x_{B}^{\prime}$ is queried in the largest interval between $x_{B}-x_{A}$ and $x_{C}-x_{B}$ so that the distance of $x_{B}^{\prime}$ to the extrema of that largest interval is in the golden ratio. Assume the resulting ordering is $x_{A}<x_{B}<x_{B}^{\prime}<x_{C}$ . Then we drop either $[x_{A},x_{B})$ or $(x_{B}^{\prime},x_{C}]$ according to whether the smallest value of $\mu$ is found in, respectively, $\{x_{B}^{\prime},x_{C}\}$ or $\{x_{B}^{\prime},x_{C}\}$ . The remaining triplet is such that the distance of the middle point to the other two is again in the golden ratio.

Using elementary algebraic identities for $\varphi$ , one can show that setting $x_{C}-x_{A}=1$ the following equalities hold at any step of SGS:

Since either $x_{B}-x_{A}$ or $x_{C}-x_{B}^{\prime}$ are eliminated at each stage, at each stage SGS shrinks the search interval by a factor of $1-\varphi^{-2}=\tfrac{1}{\varphi}$ .

Let $[x_{A,s},x_{B,s}]$ be the search interval at the beginning of stage $s+1$ , where $x_{A,0}=0$ and $x_{C,0}=1$ .

If $\varepsilon_{s}=C_{L}\varphi^{-(s+3)}$ then

holds uniformly over all stages $s\geq 1$ .

Once the interval containing $x^{*}$ is eliminated it is never recovered, thus we have

Now assume $x^{*}\in[x_{A,s-1},x_{B,s-1}]$ . Then $x^{*}\not\in[x_{A,s},x_{C,s}]$ implies $\widehat{\mu}_{s}(x_{B^{\prime},s-1})<\widehat{\mu}(x_{B,s-1})$ or $\widehat{\mu}_{s}(x_{C,s-1})<\widehat{\mu}(x_{B,s-1})$ . Similarly, assume $x^{*}\in[x_{B^{\prime},s-1},x_{C,s-1}]$ . Then $x^{*}\not\in[x_{A,s},x_{C,s}]$ implies $\widehat{\mu}_{s}(x_{A,s-1})<\widehat{\mu}(x_{B^{\prime},s-1})$ or $\widehat{\mu}_{s}(x_{B,s-1})<\widehat{\mu}(x_{B^{\prime},s-1})$ . In both cases, we need to compare three values of $\mu$ on the same side with respect to $x^{*}$ . (When $x^{*}\in[x_{B,s-1},x_{B^{\prime},s-1}]$ we always have $x^{*}\in[x_{A,s},x_{C,s}]$ .) Hence, we can apply our assumption involving $C_{L}$ . More precisely, (64) implies that after $s$ stages the search interval has size $\varphi^{-s}$ and $\min\{x_{B,s}-x_{A,s},x_{B,s}^{\prime}-x_{B,s},x_{C,s}-x_{B,s}^{\prime}\}=\varphi^{-(s+3)}$ . Hence, introducing

Let $T_{s}=\tfrac{8}{\varepsilon_{s}^{2}}\ln(6n)$ the length of stage $s$ . We can write

Substituting this in (65) and recurring gives the desired result.

For any unimodal and $C_{H}$ -Lipschitz mean loss function $\mu:\to$ that satisfies (63), if the SGS algorithm is run with $\varepsilon_{s}=C_{L}\varphi^{-(s+3)}$ then

We start by decomposing the pseudo-regret as follows,

and recalling that $\bigl{|}x_{C,s}-x_{A,s}\bigr{|}\leq\varphi^{-s}$ , we bound the first term in this decomposition as follows

Hence we get an easy expression for the regret,

We now compute an upper bound on the number $S$ of stages,

Solving for $n$ and overapproximating we get

Therefore, the sum in (66) is bounded as follows

Substituting the above in (66) concludes the proof.

Bibliographic remarks

Gradient-free methods for stochastic approximation have been studied for a long time —see the bibliographic remarks at the end of Chapter 5 for some references. However, relatively few results provide regret bounds. The approach presented in this chapter for online convex optimization with bandit feedback was pioneered by (Flaxman et al., 2005) and (Kleinberg, 2004). The improved rate for the two-points model was later shown in Agarwal et al. (2010b).

Another direction for nonlinear stochastic bandits was recently investigated in (Agarwal et al., 2011b). In this work the authors study the case of a convex mean loss function, and they show how to combine the zeroth-order optimization method of (Nemirovski and Yudin, 1983) with a “center point device” to obtain a regret of $\widetilde{O}(\sqrt{n})$ .

Finally, a related problem in a Bayesian setting was studied in Srinivas et al. (2010) and Grünewälder et al. (2010), where it is assumed that the payoff functions are drawn from Gaussian processes.

Chapter 7 Variants

In the previous chapters we explored a few fundamental variations around the basic multi-armed bandit problem. In both the stochastic and adversarial frameworks, these variants basically revolved around a single principle: by adding constraints on the losses (or rewards), it is possible to compete against larger sets of arms. While this is indeed a fundamental axis in the space of bandit problems, it is important to realize that there are many other directions. Indeed, we might sketch a “bandit space” spanning the following coordinates:

Evolution of payoffs over time: stochastic, adversarial, Markovian, …

Structure of payoff functions: linear, Lipschitz, Gaussian process, …

Feedback structure: full information, bandit, semi-bandit, partial monitoring, …

Clearly, such extensions greatly increase the number of potential applications of bandit models. While many of these extensions were already discussed in the previous chapters, in the following we focus on others (such as the sleeping bandits or the thruthful bandits) so to visit more exotic regions of the bandit space.

Extending further the model of Markovian bandits (mentioned at the end of Chapter 1), one can also define a general Markov Decision Process (MDP) —see also Section 24. For example, the stochastic bandit of Chapter 2 corresponds to a single-state MDP.

In full generality, a finite MDP can be described by a set of states $\{1,\ldots,S\}$ , a set of actions $\{1,\ldots,K\}$ , a set $\{p_{i,s},\,1\leq i\leq K,\,1\leq s\leq S\}$ of transition distributions over $S$ , and a set $\{\nu_{i,s},\,1\leq i\leq K,\,1\leq s\leq S\}$ of reward distributions over $ $. In this model, taking action$ i $in state$ s $generates a stochastic reward drawn from$ \nu_{i,s} $and a Markovian transition to a state drawn from$ p_{i,s}$. Similarly to the multi-armed bandit problem, here one typically assumes that the reward distributions and transition distributions are unknown, and the goal is to navigate through the MDP so as to maximize some function of the obtained rewards. The field that studies this type of problem is called Reinforcement Learning. The interested reader is addressed to Sutton and Barto (1998); Kakade (2003); Szepesvári (2010). Reinforcement learning results with a flavor similar to those described in the previous chapters can be found in Yu et al. (2009); Bubeck and Munos (2010); Jaksch et al. (2010); Neu et al. (2010).

An intermediate model, between stochastic multi-armed bandits and MDPs, is the one of restless bandits. As in Markovian bandits, each arm is associated with a Markovian reward process with its own state space. Each time an arm is chosen, the associated Markov process generates an observable reward and makes a transition to a new state, which is also observed. However, unlike Markovian bandits an unobserved transition occurs for each arm that is not chosen. Using concentration inequalities for Markov chains —see, e.g., Lezaud (1998), one can basically show that, under suitable assumptions, UCB attains a logarithmic regret for restless bandits as well —see Tekin and Liu (2012) and Filippi et al. (2011). A more general regret bound for restless bandits has been recently proven by Ortner et al. (2012).

An apparently similar problem was studied by Garivier and Moulines (2011), where they assume that the reward distributions can abruptly change at unknown time instants (and there is a small number of such change-points). Within this model, the authors prove that the best possible regret is of order $\sqrt{n}$ , which is matched by the Exp3.P algorithm —see the discussion in Section 8.3. Thus, while the two problems (restless bandits and bandits with change-points) might look similar, they are fundamentally different. In particular, note that the latter problem cannot be cast as a MDP.

Another intermediate model, with important applications, is that of the sleeping bandits. There, it is assumed that the set of available actions is varying over time. We refer the interested reader to Kleinberg et al. (2010); Kanade et al. (2009); Slivkins (2011); Kanade and Steinke (2012) for the details of this model as well as the theoretical guarantees that can be obtained. A somewhat related problem was also studied in György et al. (2007) where it is assumed that the set of arms becomes unavailable for a random time after each arm pull (and the distribution of this random time depends on the selected arm).

Pure exploration problems

Bubeck et al. (2009a) prove that minimizing the simple regret is fundamentally different from minimizing the pseudo-regret $\overline{R}_{n}$ , in the sense that one always have $r_{n}\geq\exp(-C\overline{R}_{n})$ for some constant $C>0$ (which depends on the reward distributions). Thus, this regret calls for different bandit algorithms. Audibert et al. (2010) exhibit a simple strategy with optimal performances up to a logarithmic factor. The idea is very simple: the strategy SR (Successive Rejects) works in $K-1$ phases. SR keeps a set of active arms, that are sampled uniformly in each phase. At the end of a phase, the arm with smallest empirical mean is removed from the set of active arms. It can be shown that this strategy has a simple regret of order $\exp\left(-c\frac{n}{H\ln K}\right)$ , where $H=\sum_{i\neq i^{*}}\frac{1}{\Delta_{i}^{2}}$ is the complexity measure of identifying the best arm, and $c$ is a numerical constant. Moreover, a matching lower bound (up to logarithmic factors) was also proved. These ideas were extended in different ways by Gabillon et al. (2011); Bui et al. (2011); Bubeck et al. (2012c).

A similar problem was studied in a PAC model by Even-Dar et al. (2002). The goal is to find, with probability at least $1-\delta$ , an arm with mean at least $\varepsilon$ close the optimal mean, and the relevant quantity is the number of pulls needed to achieve this goal. For this problem, the authors derive an algorithm called Successive Elimination that achieves an optimal number of pulls (up to logarithmic factors). Successive Elimination works as follows: it keeps an estimate of the mean of each arm, together with a confidence interval. If two confidence intervals are disjoint, then the arm with the lowest confidence interval is eliminated. Using this procedure, one can achieve the $(\varepsilon,\delta)$ guarantee with a number of pulls of order $H\ln\tfrac{K}{\Delta}$ . A matching lower bound is due to Mannor and Tsitsiklis (2004), and further results are discussed by Even-Dar et al. (2006).

In some applications one is not interested in the best arm, but rather in having a good estimate of the mean $\mu_{i}$ for each arm. In this setting a reasonable measure of performance is given by

Clearly, the optimal static allocation depends only on the variances of the arms, and we denote by $L_{n}^{*}$ the performance of this strategy. This setting was introduced by Antos et al. (2008), where the authors studied the regret $L_{n}-L_{n}^{*}$ , and showed that a regret of order $n^{-3/2}$ was achievable. This result was then refined by Carpentier et al. (2011); Carpentier and Munos (2011). The basic idea in these papers is to resort to the optimism in face of uncertainty principle, and to approximate the optimal static allocation by replacing the true variance with an upper confidence bound on it.

Dueling bandits

An interesting variation of stochastic bandits was recently studied by Yue et al. (2009). The model considered in that paper is called dueling bandits. The main idea is that the player has to choose a pair or arms $(I_{t},J_{t})$ at each round, and can only observe the relative performances of these two arms, i.e., the player only knows which arm had the highest reward. More formally, in dueling bandits we assume that there exists a total ordering $\succ$ on $\{1,\ldots,K\}$ with the following properties:

If $i\succ j$ , then the probability that the reward of arm $i$ is larger than the reward of arm $j$ is equal to $\frac{1}{2}+\Delta_{i,j}$ with $\Delta_{i,j}>0$ .

If $i\succ j\succ k$ , then $\Delta_{i,j}+\Delta_{j,k}\geq\Delta_{i,k}\geq\max\bigl{\{}\Delta_{i,j},\Delta_{j,k}\bigr{\}}$ .

Upon selecting a pair $(I_{t},J_{t})$ , the player receives a random variable drawn from a Bernoulli distribution with parameter $\frac{1}{2}+\Delta_{i,j}$ . In this setting a natural regret notion is the following quantity, where $i^{*}$ is the largest element in the ordering $\succ$ ,

It was proved in Yue et al. (2009) that the optimal regret for this problem is of order $\frac{K}{\Delta}\log n$ , where $\Delta=\min_{i\neq j}\Delta_{i,j}$ . A simple strategy that attains this rate, based on the Successive Elimination algorithm of Even-Dar et al. (2002), was proposed by Yue and Joachims (2011).

Discovery with probabilistic expert advice

Bubeck et al. (2011a) study a model with a stochastic bandit flavor (in fact it can be cast as an MDP), where the key for the analysis is a sort of ’non-linear’ regret bound. In this model rewards represent items in some set $\mathcal{X}$ which is partitioned in a subset $A\subset\mathcal{X}$ of interesting items and in a subset $\mathcal{X}\setminus A$ of non-interesting items. The goal is to maximize the total expected number of interesting items found after $n$ pulls, where observing twice the same item does not help. A natural notion of regret is obtained by comparing the number of interesting items $F(n)$ found by a given strategy to the number $F^{*}(n)$ found by the optimal strategy. It turns out that analyzing such regret directly is difficult. The first issue is that in this problem the notion of a “good” arm is dynamic, in the sense that an arm could be very good for a few pulls and then completely useless. Furthermore a strategy making bad decisions in the beginning will have better opportunities in the future than the optimal strategy (which already discovered some interesting items). Taking into account these issues, it turns out that it is easier to show that for good strategies, $F(n)$ is not too far from $F^{*}(n^{\prime})$ , where $n^{\prime}$ is not much smaller than $n$ . Such a statement – which can be interepreted as a non-linear regret bound – shows that the analyzed strategy slightly ’lags’ behind the optimal strategy. In Bubeck et al. (2011a) a non-linear regret bound is derived for an algorithm based on estimating the mass of interesting items left on each arm (the so-called Good-Turing estimator), combined with the optimism in face of uncertainty principle of Chapter 2. We refer the reader to Bubeck et al. (2011a) for more precise statements.

Many-armed bandits

The many-armed bandit setting was introduced by Berry et al. (1997), and then extended and refined by Wang et al. (2008). This setting corresponds to a stochastic bandit with an infinite number of arms. The extra assumption that makes this problem feasible is a prior knowledge on the distribution of the arms. More precisely, when the player ask to “add” a new arm to his current set of active arms, one assumes that the probability that this arm is $\varepsilon$ -optimal is of order $\varepsilon^{\beta}$ , for some known $\beta>0$ . Thus the player faces a trade-off between exploitation, exploration, and discovery, where the last component comes from the fact that the player needs to consider new arms to make sure that he has an active $\varepsilon$ -optimal arm. Using a UCB strategy on the active arms, and by adding new arms at a rate which depends on $\beta$ , Wang et al. (2008) prove that a regret of order

Truthful bandits

A popular application domain for bandit algorithms is ad placement on the Web. In the pay-per-click model, for each incoming user $t=1,2,\dots$ the publisher selects an advertiser $I_{t}$ from a pool of $K$ advertisers, and display the corresponding ad to the user. The publisher then gets a reward if the ad is clicked by the user. This problem is well modeled by the multi-armed bandit setting. However, there is a fundamental aspect of the ad placement process which is overlooked by this formulation. Indeed, prior to running an ad-selection algorithm (i.e., a bandit algorithm), each advertiser $i\in\{1,\ldots,K\}$ issues a bet $b_{i}$ . This number is how much $i$ is willing to pay for a click. Each bidder keeps also a private value $v_{i}$ , which is the true value $i$ assigns to a click. Because a rational bidder ensures that $b_{i}\leq v_{i}$ , the difference $v_{i}-b_{i}$ defines the utility for bidder $i$ . The basic idea of truthful bandits is to construct a bandit algorithm such that each advertiser has no incentive in submitting a bet $b_{i}$ such that $b_{i}<v_{i}$ . A natural question to ask is whether this restriction to truthful algorithms changes the dynamics of the multi-armed bandit problem. This has investigated in a number of papers, including Babaioff et al. (2009); Devanur and Kakade (2009); Babaioff et al. (2010); Wilkens and Sivan (2012). Thruthful bandits are part of a more general thread of research at the interface between bandits and Mechanism Design.

Concluding remarks

As pointed out in the introduction, the growing interest for bandits arises from the large number of industrially relevant problems that can be modeled as a multi-armed bandit. In particular, the sequential nature of the bandit setting makes it perfectly suited to various Internet and Web applications. These include search engine optimization with dueling bandits, or ad placement with contextual bandits and truthful bandits, see the references in, respectively, Section 26, Section 13 and Section 29.

Multi-armed bandits also proved to be very useful in other areas. For example, thanks to the strong connections between bandits and Markov Decision Processes, a breakthrough in Monte Carlo Tree Search (MCTS) was achieved using bandits ideas. More precisely, based on the sparse planning idea of Kearns et al. (2002), Kocsis and Szepesvári (2006) introduced a new MCTS strategy called UCT (UCB applied to Trees) that led to a substantial advancement in Computer Go performance, see Gelly et al. (2006). Note that, from a theoretical point of view UCT was proved to perform poorly by Coquelin and Munos (2007), and a strategy based on a similar idea, but with improved theoretical performance, was proposed by Bubeck and Munos (2010). Other applications in related directions have also been explored, see for example Teytaud and Teytaud (2009); Hoock and Teytaud (2010) and many others.

Many new domains of application for bandits problems are currently investigated. For example: multichannel opportunistic communications Liu et al. (2010), model selection Agarwal et al. (2011a), boosting Busa-Fekete and Kegl (2011), management of dark pools of liquidity (a recent type of stock exchange) Agarwal et al. (2010a), security analysis of power systems Bubeck et al. (2011a).

Given the fast pace of new variants, extensions, and applications coming out every week, we had to make tough decisions about what to present in this survey. We apologize for everything we had to leave out. On the other hand, we do hope that what we decided to put in will enthuse more researchers about entering this exciting field.

Chapter 1 Introduction

Chapter 2 Stochastic bandits: fundamental results

Upper Confidence Bound (UCB) strategies

Lower bound

Refinements and bibliographic remarks

2 Second order bounds

3 Distribution-free bounds

4 High probability bounds

5 ε𝜀\varepsilon-greedy

6 Thompson sampling

7 Heavy-tailed distributions

Chapter 3 Adversarial bandits: fundamental results

Second step: Study of the first term in (23).

Third step: Study of the second term in (23).

Fourth step: Summing.

High probability and expected regret bounds

Second step: Study of the first term in (30).

Third step: Summing.

Lower Bound

Second step: Pinsker’s inequality.

Fourth step: conclusion for deterministic forecasters.

Fifth step: randomized forecasters via Fubini’s Theorem.

Refinements and bibliographic remarks

2 Adaptive bounds

3 Competing with the best switching strategy

4 Stochastic versus adversarial bandits

5 Alternative feedback structures

Chapter 4 Contextual bandits

The expert case

Stochastic contextual bandits

The multiclass case

Bibliographic remarks

Chapter 5 Linear bandits

Online Mirror Descent (OMD)

Online Stochastic Mirror Descent (OSMD)

Online combinatorial optimization

Improved regret bounds for bandit feedback

Refinements and bibliographic remarks

2 High probability bounds

3 Stochastic online linear optimization

Chapter 6 Nonlinear bandits

One-point bandit feedback

Nonlinear stochastic bandits

Bibliographic remarks

Chapter 7 Variants

Pure exploration problems

Dueling bandits

Discovery with probabilistic expert advice

Many-armed bandits

Truthful bandits

Concluding remarks

References