Introduction to Online Convex Optimization

Elad Hazan

Preface

This book serves as an introduction to the expanding theory of online convex optimization (OCO). It was written as an advanced textbook to serve as a basis for a graduate course, and/or as a reference to researchers diving into this fascinating world at the intersection of optimization and machine learning.

Such a course was given at the Technion in 2010–2014, with slight variations from year to year, and later at Princeton University in 2015–2020. The core material in these courses is fully covered in this book, along with exercises that allow students to complete parts of proofs, or that were found illuminating and thought-provoking by those taking the course. Most of the material is given with examples of applications, which are interlaced throughout various topics. These include prediction from expert advice, portfolio selection, matrix completion and recommendation systems, and support vector machine training.

Our hope is that this compendium of material and exercises will be useful to you; the researcher and/or educator.

The broad field of machine learning, as in the sub-disciplines of online learning, boosting, regret minimization in games, universal prediction, and other related topics, have seen a plethora of introductory books in recent years. With this note, we can hardly do justice to all of these, but perhaps point to the most related books on the topics of machine learning, learning in games, and optimization, whose intersection is our main focus.

The most closely related book, which served as an inspiration to the current, and indeed an inspiration to the entire field of learning in games, is the wonderful text of Cesa-Bianchi and Lugosi (2006). From the literature on mathematical optimization theory, there are the numerous introductory essays to convex optimization and convex analysis, to name only a few (Boyd and Vandenberghe, 2004; Nesterov, 2004; Nemirovski and Yudin, 1983; Nemirovskii, 2004; Borwein and Lewis, 2006; Rockafellar, 1997). The author fondly recommends the text from which he has learned about mathematical optimization theory (Nemirovskii, 2004). The more broad texts on machine learning are too numerous to state here.

The primary purpose of this is to serve as an educational textbook for a dedicated course on OCO and the convex optimization approach to machine learning. Online convex optimization has already had enough impact to appear in several surveys and introductory texts (Hazan, 2011; Shalev-Shwartz, 2011; Rakhlin, 2009; Rakhlin and Sridharan, 2014). We hope this compilation of material and exercises will further enrich the literature.

Book’s tructure

This book is intended to serve as a reference for a self-contained course for graduate students in computer science/electrical engineering/operations research/statistics and related fields. As such, its organization follows the structure of the course “Decision Analysis”, taught at the Technion, and later “Theoretical Machine Learning”, taught at Princeton University.

Each chapter should take one or two weeks of classes, depending on the depth and breadth of the intended course. Chapter 1 is designed to be a teaser for the field, and it is less rigorous than the rest of the book.

Roughly speaking, the book can be conceived as three units. The first, from chapter 2 through 4, contains the basic definitions, framework and core algorithms for OCO. Chapters 5 to 7 contain more advanced algorithms and in-depth analysis of the framework and its extensions to other computational and information access models. The rest of the book deals with more advanced algorithms, more difficult settings, and relationships to well-known machine learning paradigms.

This book can assist educators in designing a complete course on the topic of online convex optimization, or it can serve as a component in a comprehensive course on machine learning. A accompanying manual of solutions to selected exercises given in the book is available for educators only.

The main additions to the second edition of this book include the following:

Expanded coverage of optimization in chapter 2, with a unified gradient descent analysis of the Polyak stepsize.

Expanded coverage of learning theory in chapter 9, with an introduction to compression and its use in generalization theory.

An expanded chapter 4, with addition of the exponential weighted optimizer for exp-concave loss functions.

A revised chapter 5, with the addition of mirror descent analysis, as well as a revised section on adaptive gradient methods.

New chapter 10 on the notion of adaptive regret and algorithms for OCO with near-optimal adaptive regret bounds.

New chapter 11 on boosting and its relationship to OCO. Derivation of boosting algorithms from regret minimization.

New chapter 13 on Blackwell approachability and its strong connection to OCO.

In addition, numerous typos are fixed, exercises are corrected, and solutions to several questions have been made available in a separate manual for educators.

Acknowledgements

First of all, I gratefully acknowledge the numerous contributions and insight of the students of the course “decision analysis” given at the Technion during 2010-2014, as well as the students of “theoretical machine learning” taught at Princeton University during 2015-2016.

I would like to thank the friends, colleagues and students that have contributed many suggestions and corrections. A partial list includes: Sanjeev Arora, Shai Shalev-Shwartz, Aleksander Madry, Yoram Singer, Satyen Kale, Alon Gonen, Roi Livni, Gal Lavee, Maayan Harel, Daniel Khasabi, Shuang Liu, Jason Altschuler, Haipeng Luo, Zeyuan Allen-Zhu, Mehrdad Mahdavi, Jaehyun Park, Baris Ungun, Maria Gregori, Tengyu Ma, Kayla McCue, Esther Rolf, Jeremy Cohen, Daniel Suo, Lydia Liu, Fermi Ma, Mert Al, Amir Reza Asadi, Carl Gabel, Nati Srebro, Abbas Mehrabian, Chris Liaw, Nikhil Bansal, Naman Agarwal, Raunak Kumar, Zhize Li, Sheng Zhang, Swati Gupta, Xinyi Chen, Liang Zeng and Kunal Mittal.

I thank Udi Aharoni for his artwork and illustrations depicting algorithms in this book.

I am forever indebted to my teacher and mentor, Sanjeev Arora, without him this book would not be possible.

Finally, I am grateful for the love and support of my wife and children: Dana, Hadar, Yoav, and Oded.

The Second Version

I am most thankful to my students and colleagues that have collaborated with me on research, some of which appears in the second version of this book. Notably my collaborators on boosting methods including Nataly Brukhim, Xinyi Chen, Shay Moran, Naman Agarwal and Karan Singh.

Thanks to my students that helped me proof check new and existing sections: Edgar Minasyan, Paula Gradu, Karan Singh, Nataly Brukhim, Xinyi Chen, Naman Agarwal and Udaya Ghai.

I am thankful to Shay Moran for explaining compression schemes and how they simplify generalization for boosting.

I gratefully acknowledges the help of Ahmed Farah, Charlie Cowen-Breen and the students of “COS 597C: Computational Control Theory” for many helpful suggestions, corrections and solutions to many of the problem sets.

I am very thankful to Wouter Koolen-Wijkstra for a helpful suggestion in the analysis of the Online Newton Step algorithm.

Thanks to Shiyun Lin for finding typos and improving the presentation of the randomized regularization section.

I thank the extremely helpful and rigorous reviewers of this book found by MIT press who gave fantastic suggestions and improve the final manuscript.

As in the first version and even more so, I am grateful for the love and support of my wife and children: Dana, Hadar, Yoav, and Oded.

List of Symbols

Geometry and Calculus

Learning Theory

Optimization

Regularization

Chapter 1 Introduction

This book considers optimization as a process. In many practical applications, the environment is so complex that it is not feasible to lay out a comprehensive theoretical model and use classical algorithmic theory and mathematical optimization. It is necessary, as well as beneficial, to take a robust approach, by applying an optimization method that learns as more aspects of the problem are observed. This view of optimization as a process has become prominent in various fields, which has led to spectacular successes in modeling and systems that are now part of our daily lives.

The growing body of literature of machine learning, statistics, decision science, and mathematical optimization blurs the classical distinctions between deterministic modeling, stochastic modeling, and optimization methodology. We continue this trend in this book, studying a prominent optimization framework whose precise location in the mathematical sciences is unclear: the framework of online convex optimization (OCO), which was first defined in the machine learning literature (see section 1.4, later in this chapter). The metric of success is borrowed from game theory, and the framework is closely tied to statistical learning theory and convex optimization.

We embrace these fruitful connections and, on purpose, do not try to use any particular jargon in the discussion. Rather, this book will start with actual problems that can be modeled and solved via OCO. We will proceed to present rigorous definitions, backgrounds, and algorithms. Throughout, we provide connections to the literature in other fields. It is our hope that you, the reader, will contribute to our understanding of these connections from your domain of expertise, and expand the growing amount of literature on this fascinating subject.

In OCO, an online player iteratively makes decisions. At the time of each decision, the outcome or outcomes associated with it are unknown to the player.

After committing to a decision, the decision maker suffers a loss: every possible decision incurs a (possibly different) loss. These losses are unknown to the decision maker beforehand. The losses can be adversarially chosen, and even depend on the action taken by the decision maker.

Already at this point, several restrictions are necessary in order for this framework to make any sense at all:

The losses determined by an adversary should not be allowed to be unbounded 1Alternatively and equivalently, any performance metric for this setting should depend on the magnitude of the largest loss. This is the viewpoint taken later in the rest of this book. . Otherwise, the adversary could keep decreasing the scale of the loss at each step, and never allow the algorithm to recover from the loss of the first step. Thus, we assume that the losses lie in some bounded region.

The decision set must be somehow bounded and/or structured, though not necessarily finite.

To see why this is necessary, consider decision making with an infinite set of possible decisions. An adversary can assign high loss to all the strategies chosen by the player indefinitely, while setting apart some strategies with zero loss. This precludes any meaningful performance metric.

The OCO framework can be seen as a structured repeated game. The protocol of this learning framework is as follows.

What would make an algorithm a good OCO algorithm? As the framework is game-theoretic and adversarial in nature, the appropriate performance metric also comes from game theory: define the regret of the decision maker to be the difference between the total cost she has incurred and that of the best fixed decision in hindsight. In OCO, we are usually interested in an upper bound on the worst-case regret of an algorithm.

Let A{\mathcal{A}} be an algorithm for OCO, which maps a certain game history to a decision in the decision set:

We formally define the regret of A{\mathcal{A}} after TT iterations as:

The running time of an algorithm for OCO is defined to be the worst-case expected time to produce xt\mathbf{x}_{t}, for an iteration t[T]t\in[T]2Here and henceforth, we denote as [n][n] the set of integers {1,...,n}\{1,...,n\}. in a TT-iteration repeated game. Typically, the running time will depend on nn (the dimensionality of the decision set K\mathcal{K}), TT (the total number of game iterations), and the parameters of the cost functions and underlying convex set.

2 Examples of Problems That Can Be Modeled via Online Convex Optimization

Perhaps the main reason that OCO has become a leading online learning framework in recent years is its powerful modeling capability: problems from diverse domains such as online routing, ad selection for search engines, and spam filtering can all be modeled as special cases. In this section, we briefly survey a few special cases and how they fit into the OCO framework.

Perhaps the most well known problem in prediction theory is the experts problem. The decision maker has to choose among the advice of nn given experts. After making her choice, a loss between zero and 11 is incurred. This scenario is repeated iteratively, and at each iteration, the costs of the various experts are arbitrary (and possibly even adversarial, trying to mislead the decision maker). The goal of the decision maker is to do as well as the best expert in hindsight.

The fundamental importance of the experts problem in machine learning warrants special attention, and we shall return to it and analyze it in detail at the end of this chapter.

2.2 Online spam filtering

Consider an online spam-filtering system. Repeatedly, emails arrive in the system and are classified as spam or valid. Obviously, such a system has to cope with adversarially generated data and dynamically change with the varying input—a hallmark of the OCO model.

At this point the reader may wonder - why use a square loss rather than any other function? The most natural choice being perhaps a loss of one if b=b^b=\hat{b} and zero otherwise.

To answer this, notice first that if both bb and b^\hat{b} are binary and in {1,1}\{-1,1\}, then the square loss is indeed one or zero. However, moving to a continuous function allows us much more flexibility in the decision making process. We can allow, for example, the algorithm to return a number in the interval $$ depending on its confidence.

Another reason has to do with the algorithmic efficiency of finding a a good solution. This will be the subject of future chapters.

2.3 Online shortest paths

The discrete description of this problem as an experts problem, where we have an expert for each path, presents an efficiency challenge. There are potentially exponentially many paths in terms of the graph representation size.

2.4 Portfolio selection

In this section we consider a portfolio selection model that does not make any statistical assumptions about the stock market (as opposed to the standard geometric Brownian motion model for stock prices), and is called the “universal portfolio selection” model.

The goal of regret minimization, which in this case corresponds to minimizing the difference maxxΔnt=1Tlog(rtx)t=1Tlog(rtxt)\max_{\mathbf{x}^{\star}\in\Delta_{n}}{\textstyle\sum}_{t=1}^{T}\log(\mathbf{r}_{t}^{\top}\mathbf{x}^{\star})-{\textstyle\sum}_{t=1}^{T}\log(\mathbf{r}_{t}^{\top}\mathbf{x}_{t}), has an intuitive interpretation. The first term is the logarithm of the wealth accumulated by the best possible in-hindsight distribution x\mathbf{x}^{\star}. Since this distribution is fixed, it corresponds to a strategy of rebalancing the position after every trading period, and hence, is called a constant rebalanced portfolio. The second term is the logarithm of the wealth accumulated by the online decision maker. Hence regret minimization corresponds to maximizing the ratio of the investor’s wealth to the wealth of the best benchmark from a pool of investing strategies.

A universal portfolio selection algorithm is defined to be one that, in this setting, attains regret converging to zero. Such an algorithm, albeit requiring exponential time, was first described by Cover (see bibliographic notes at the end of this chapter). The online convex optimization framework has given rise to much more efficient algorithms based on Newton’s method. We shall return to study these in detail in chapter 4.

2.5 Matrix completion and recommendation systems

The prevalence of large-scale media delivery systems such as the Netflix online video library, Spotify music service and many others, give rise to very large scale recommendation systems. One of the most popular and successful models for automated recommendation is the matrix completion model.

In this mathematical model, recommendations are thought of as composing a matrix. The customers are represented by the rows, the different media are the columns, and at the entry corresponding to a particular user/media pair we have a value scoring the preference of the user for that particular media.

For example, for the case of binary recommendations for music, we have a matrix X{0,1}n×mX\in\{0,1\}^{n\times m} where nn is the number of persons considered, mm is the number of songs in our library, and 0/10/1 signifies dislike/like respectively:

In the online setting, for each iteration the decision maker outputs a preference matrix XtKX_{t}\in\mathcal{K}, where K{0,1}n×m\mathcal{K}\subseteq\{0,1\}^{n\times m} is a subset of all possible zero/one matrices. An adversary then chooses a user/song pair (it,jt)(i_{t},j_{t}) along with a “real” preference for this pair yt{0,1}y_{t}\in\{0,1\}. Thus, the loss experienced by the decision maker can be described by the convex loss function,

The natural comparator in this scenario is a low-rank matrix, which corresponds to the intuitive assumption that preference is determined by few unknown factors. Regret with respect to this comparator means performing, on the average, as few preference-prediction errors as the best low-rank matrix.

We return to this problem and explore efficient algorithms for it in chapter 7.

3 A Gentle Start: Learning from Expert Advice

Consider the following fundamental iterative decision making problem:

At each time step t=1,2,,Tt=1,2,\ldots,T, the decision maker faces a choice between two actions AA or BB (i.e., buy or sell a certain stock). The decision maker has assistance in the form of NN “experts” that offer their advice. After a choice between the two actions has been made, the decision maker receives feedback in the form of a loss associated with each decision. For simplicity one of the actions receives a loss of zero (i.e., the “correct” decision) and the other a loss of one.

We make the following elementary observations:

A decision maker that chooses an action uniformly at random each iteration, trivially attains a loss of T2\frac{T}{2} and is “correct” 50%50\% of the time.

In terms of the number of mistakes, no algorithm can do better in the worst case! In a later exercise, we will devise a randomized setting in which the expected number of mistakes of any algorithm is at least T2\frac{T}{2}.

We are thus motivated to consider a relative performance metric: can the decision maker make as few mistakes as the best expert in hindsight? The next theorem shows that the answer in the worst case is negative for a deterministic decision maker.

Let LT2L\leq\frac{T}{2} denote the number of mistakes made by the best expert in hindsight. Then there does not exist a deterministic algorithm that can guarantee less than 2L2L mistakes.

Assume that there are only two experts and one always chooses option AA while the other always chooses option BB. Consider the setting in which an adversary always chooses the opposite of our prediction (she can do so, since our algorithm is deterministic). Then, the total number of mistakes the algorithm makes is TT. However, the best expert makes no more than T2\frac{T}{2} mistakes (at every iteration exactly one of the two experts is mistaken). Therefore, there is no algorithm that can always guarantee less than 2L2L mistakes.

This observation motivates the design of random decision making algorithms, and indeed, the OCO framework gracefully models decisions on a continuous probability space. Henceforth we prove Lemmas 1.3 and 1.4 that show the following:

Let ε(0,12)\varepsilon\in(0,\frac{1}{2}). Suppose the best expert makes LL mistakes. Then:

There is an efficient deterministic algorithm that can guarantee less than 2(1+ε)L+2logNε2(1+\varepsilon)L+\frac{2\log N}{\varepsilon} mistakes;

There is an efficient randomized algorithm for which the expected number of mistakes is at most (1+ε)L+logNε(1+\varepsilon)L+\frac{\log N}{\varepsilon}.

The weighted majority (WM) algorithm is intuitive to describe: each expert ii is assigned a weight Wt(i)W_{t}(i) at every iteration tt. Initially, we set W1(i)=1W_{1}(i)=1 for all experts i[N]i\in[N]. For all t[T]t\in[T] let St(A),St(B)[N]S_{t}(A),S_{t}(B)\subseteq[N] be the set of experts that choose AA (and respectively BB) at time tt. Define,

Next, update the weights Wt(i)W_{t}(i) as follows:

where ε\varepsilon is a parameter of the algorithm that will affect its performance. This concludes the description of the WM algorithm. We proceed to bound the number of mistakes it makes.

Denote by MtM_{t} the number of mistakes the algorithm makes until time tt, and by Mt(i)M_{t}(i) the number of mistakes made by expert ii until time tt. Then, for any expert i[N]i\in[N] we have

We can optimize ε\varepsilon to minimize the above bound. The expression on the right hand side is of the form f(x)=ax+b/xf(x)=ax+b/x, that reaches its minimum at x=b/ax=\sqrt{b/a}. Therefore the bound is minimized at ε=logN/MT(i)\varepsilon^{\star}=\sqrt{\log N/M_{T}(i)}. Using this optimal value of ε\varepsilon, we get that for the best expert ii^{\star}

Of course, this value of ε\varepsilon^{\star} cannot be used in advance since we do not know which expert is the best one ahead of time (and therefore we do not know the value of MT(i)M_{T}(i^{\star})). However, we shall see later on that the same asymptotic bound can be obtained even without this prior knowledge.

Let Φt=i=1NWt(i)\Phi_{t}=\sum_{i=1}^{N}W_{t}(i) for all t[T]t\in[T], and note that Φ1=N\Phi_{1}=N.

Notice that Φt+1Φt\Phi_{t+1}\leq\Phi_{t}. However, on iterations in which the WM algorithm erred, we have

the reason being that experts with at least half of total weight were wrong (else WM would not have erred), and therefore

On the other hand, by definition we have for any expert ii that

Since the value of WT(i)W_{T}(i) is always less than the sum of all weights ΦT\Phi_{T}, we conclude that

Taking the logarithm of both sides we get

which follow from the Taylor series of the logarithm function, to obtain that

3.2 Randomized weighted majority

In the randomized version of the WM algorithm, denoted RWM, we choose expert ii w.p. pt(i)=Wt(i)/j=1NWt(j)p_{t}(i)=W_{t}(i)/\sum_{j=1}^{N}W_{t}(j) at time tt.

Let MtM_{t} denote the number of mistakes made by RWM until iteration tt. Then, for any expert i[N]i\in[N] we have

The proof of this lemma is very similar to the previous one, where the factor of two is saved by the use of randomness:

On the other hand, by definition we have for any expert ii that

Since the value of WT(i)W_{T}(i) is always less than the sum of all weights ΦT\Phi_{T}, we conclude that

Taking the logarithm of both sides we get

3.3 Hedge

Historically, this was observed by a different and closely related algorithm called Hedge, whose total loss bound will be of interest to us later on in the book.

Henceforth, denote in vector notation the expected loss of the algorithm by

As before, let Φt=i=1NWt(i)\Phi_{t}=\sum_{i=1}^{N}W_{t}(i) for all t[T]t\in[T], and note that Φ1=N\Phi_{1}=N.

On the other hand, by definition, for expert ii^{\star} we have that

Since the value of WT(i)W_{T}(i^{\star}) is always less than the sum of all weights Φt\Phi_{t}, we conclude that

Taking the logarithm of both sides we get

and the theorem follows by simplifying. ∎

4 Bibliographic Remarks

The OCO model was first defined by Zinkevich (2003) and has since become widely influential in the learning community and significantly extended since (see thesis and surveys (Hazan, 2006, 2011; Shalev-Shwartz, 2011)).

The problem of prediction from expert advice and the Weighted Majority algorithm were devised in (Littlestone and Warmuth, 1989, 1994). This seminal work was one of the first uses of the multiplicative updates method—a ubiquitous meta-algorithm in computation and learning, see the survey (Arora et al., 2012) for more details. The Hedge algorithm was introduced by Freund and Schapire (1997).

The Universal Portfolios model was put forth in (Cover, 1991), and is one of the first examples of a worst-case online learning model. Cover gave an optimal-regret algorithm for universal portfolio selection that runs in exponential time. A polynomial time algorithm was given in (Kalai and Vempala, 2003), which was further sped up in (Agarwal et al., 2006; Hazan et al., 2007). Numerous extensions to the model also appeared in the literature, including addition of transaction costs (Blum and Kalai, 1999) and relation to the Geometric Brownian Motion model for stock prices (Hazan and Kale, 2009).

In their influential paper, Awerbuch and Kleinberg (2008) put forth the application of online convex optimization to online routing. A great deal of work has been devoted since then to improve the initial bounds, and generalize it into a complete framework for decision making with limited feedback. This framework is an extension of OCO, called Bandit Convex Optimization (BCO). We defer further bibliographic remarks to chapter 6 which is devoted to the BCO framework.

5 Exercises

1. (Attributed to Claude Shannon) Construct market returns over two stocks for which the wealth accumulated over any single stock decreases exponentially, whereas the best constant rebalanced portfolio increases wealth exponentially. More precisely, construct two sequences of numbers in the range (0,)(0,\infty), that represent returns, such that:

(a) Investing in any of the individual stocks results in exponential decrease in wealth. This means that the product of the prefix of numbers in each of these sequences decreases exponentially.

(b) Investing evenly on the two assets and rebalancing after every iteration increases wealth exponentially.

(a) Consider the experts problem in which the losses are between zero and a positive real number G>0G>0. Give an algorithm that attains expected loss upper bounded by:

for the best constant cc you can (the constant cc should be independent of the number of game iterations TT, and the number of experts NN. Assume that TT is known in advance).

(b) Suppose the upper bound GG is not known in advance. Give an algorithm whose performance is asymptotically as good as your algorithm in part (a), up to an additive and/or multiplicative constant which is independent of T,N,GT,N,G. Prove your claim.

3. Consider the experts problem in which the losses can be negative and are real numbers in the range $.Giveanalgorithmwithregretguaranteeof. Give an algorithm with regret guarantee ofO(\sqrt{T\log N})$ and prove your claim.

Chapter 2 Basic Concepts in Convex Optimization

In this chapter we give a gentle introduction to convex optimization and present some basic algorithms for solving convex mathematical programs. Although offline convex optimization is not our main topic, it is useful to recall the basic definitions and results before we move on to OCO. This will help in assessing the advantages and limitations of OCO. Furthermore, we describe some tools that will be our bread-and-butter later on.

The material in this chapter is far from being new. A broad and significantly more detailed literature exists, and the reader is deferred to the bibliography at the end of this chapter for references. We give here only the most elementary analysis, and focus on the techniques that will be of use to us later on.

A set K\mathcal{K} is convex if for any x,yK\mathbf{x},\mathbf{y}\in\mathcal{K}, all the points on the line segment connecting x\mathbf{x} and y\mathbf{y} also belong to K\mathcal{K}, i.e.,

This inequality, and generalizations thereof, is also known as Jensen’s inequality. Equivalently, if ff is differentiable, that is, its gradient f(x)\nabla f(\mathbf{x}) exists for all xK\mathbf{x}\in\mathcal{K}, then it is convex if and only if x,yK\forall\mathbf{x},\mathbf{y}\in\mathcal{K}

For convex and non-differentiable functions ff, the subgradient at x\mathbf{x} is defined to be any member of the set of vectors {f(x)}\{\nabla f(\mathbf{x})\} that satisfies the above for all yK\mathbf{y}\in\mathcal{K}.

We denote by G>0G>0 an upper bound on the norm of the subgradients of ff over K\mathcal{K}, i.e., f(x)G\|\nabla f(\mathbf{x})\|\leq G for all xK\mathbf{x}\in\mathcal{K}. Such an upper bound implies that the function ff is Lipschitz continuous with parameter GG, that is, for all x,yK\mathbf{x},\mathbf{y}\in\mathcal{K}

The optimization and machine learning literature studies special types of convex functions that admit useful properties, which in turn allow for more efficient optimization. Notably, we say that a function is α\alpha-strongly convex if

The latter condition is equivalent to a Lipschitz condition over the gradients, i.e.,

If the function is twice differentiable and admits a second derivative, known as a Hessian for a function of several variables, the above conditions are equivalent to the following condition on the Hessian, denoted 2f(x)\nabla^{2}f(\mathbf{x}):

where ABA\preccurlyeq B if the matrix BAB-A is positive semidefinite.

When the function ff is both α\alpha-strongly convex and β\beta-smooth, we say that it is γ\gamma-well-conditioned where γ\gamma is the ratio between strong convexity and smoothness, also called the condition number of ff

In the following algorithms we shall make use of a projection operation onto a convex set, which is defined as the closest point in terms of Euclidean distance 3 We will discuss projections with respect to other distance notions in chapter 5. inside the convex set to a given point. Formally,

The computational complexity of projections is a subtle issue that depends much on the characterization of K\mathcal{K} itself. Most generally, K\mathcal{K} can be represented by a membership oracle—an efficient procedure that is capable of deciding whether a given x\mathbf{x} belongs to K\mathcal{K} or not. In this case, projections can be computed in polynomial time. In certain special cases, projections can be computed very efficiently in near-linear time. The computational cost of projections, as well as optimization algorithms that avoid them altogether, is discussed in chapter 7.

A crucial property of projections that we shall make extensive use of is the Pythagorean theorem, which we state here for completeness:

We note that there exists a more general version of the Pythagorean theorem. The above theorem and the definition of projections are true and valid not only for Euclidean norms, but for projections according to other distances that are not norms. In particular, an analogue of the Pythagorean theorem remains valid with respect to Bregman divergences (see chapter 5).

1.2 Introduction to optimality conditions

The standard curriculum of high school mathematics contains the basic facts concerning when a function (usually in one dimension) attains a local optimum or saddle point. The generalization of these conditions to more than one dimension is called the KKT (Karush-Kuhn-Tucker) conditions, and the reader is referred to the bibliographic material at the end of this chapter for an in-depth rigorous discussion of optimality conditions in general mathematical programming.

We will require a slightly more general, but equally intuitive, fact for constrained optimization: at a minimum point of a constrained convex function, the inner product between the negative gradient and direction towards the interior of K\mathcal{K} is non-positive. This is depicted in figure 2.2, which shows that f(x)-\nabla f(\mathbf{x}^{\star}) defines a supporting hyperplane to K\mathcal{K}. The intuition is that if the inner product were positive, one could improve the objective by moving in the direction of the projected negative gradient. This fact is stated formally in the following theorem.

2 Gradient Descent

Gradient descent (GD) is the simplest and oldest of optimization methods. It is an iterative method—the optimization procedure proceeds in iterations, each improving the objective value. The basic method amounts to iteratively moving the current point in the direction of the gradient, which is a linear time operation if the gradient is given explicitly (indeed, for many functions computing the gradient at a certain point is a simple linear-time operation).

The basic template algorithm, for unconstrained optimization, is given in 2, and a depiction of the iterates it produced in figure 2.3.

For a convex function there always exists a choice of step sizes that will cause GD to converge to the optimal solution. The rates of convergence, however, differ greatly and depend on the smoothness and strong convexity properties of the objective function. The following table summarises the convergence rates of GD variants for convex functions with different convexity parameters. The rates described omit the (usually small) constants in the bounds—we focus on asymptotic rates.

In this section we address only the first row of Table 2.1. For accelerated methods and their analysis see references at the bibliographic section.

Luckily, there exists a simple choice of step sizes that yields the optimal convergence rate, called the Polyak stepsize. It has a huge advantage of not depending on the strong convexity and/or smoothness parameters of the objective function.

However, it does depend on the distance in function value to optimality and gradient norm. While the latter can be efficiently estimated, the distance to optimality is not always available if f(x)f(\mathbf{x}^{*}) is not known ahead of time. This can be remedied, as referred to in the bibliography.

Distance to optimality in value: ht=h(xt)=f(xt)f(x)h_{t}=h(\mathbf{x}_{t})=f(\mathbf{x}_{t})-f(\mathbf{x}^{*})

Euclidean distance to optimality: dt=xtxd_{t}=\|\mathbf{x}_{t}-\mathbf{x}^{*}\|

Current gradient norm t=f(xt)\|\nabla_{t}\|=\|\nabla f(\mathbf{x}_{t})\|

With these notations we can describe the algorithm precisely in Algorithm 3:

To prove precise convergence bounds, assume tG\|\nabla_{t}\|\leq G, and define:

We can now state the main guarantee of GD with the Polyak stepsize:

(GD with the Polyak Step Size) Algorithm 3 guarantees the following after TT steps:

2.2 Measuring distance to optimality

When analyzing convergence of gradient methods, it is useful to use potential functions in lieu of function distance to optimality, such as gradient norm and/or Euclidean distance. The following relationships hold between these quantities.

12βt2ht\frac{1}{2\beta}\|\nabla_{t}\|^{2}\leq h_{t}

ht12αt2h_{t}\leq\frac{1}{2\alpha}\|\nabla_{t}\|^{2}

where the last inequality follows since the gradient at the global optimum is zero.

where the last inequality follows since the gradient at the global optimum is zero.

ht12βt2h_{t}\geq\frac{1}{2\beta}\|\nabla_{t}\|^{2}: Using smoothness, and let xt+1=xtηt\mathbf{x}_{t+1}=\mathbf{x}_{t}-\eta\nabla_{t} for η=1β\eta=\frac{1}{\beta},

ht12αt2h_{t}\leq\frac{1}{2\alpha}\|\nabla_{t}\|^{2}:

In particular, taking x=xt , y=x\mathbf{x}=\mathbf{x}_{t}\ ,\ \mathbf{y}=\mathbf{x}^{\star}, we get

2.3 Analysis of the Polyak stepsize

We are now ready to prove Theorem 2.3, which directly follows from the following lemma.

Suppose that a sequence x0,xt\mathbf{x}_{0},\ldots\mathbf{x}_{t} satisfies:

then for xˉ\bar{\mathbf{x}} as defined in the algorithm, we have:

For convex functions with gradient bounded by GG,

Summing up over TT iterations, and using Cauchy-Schwartz on the TT-dimensional vectors of 1T1\frac{1}{T}\mathbf{1} and (h1,...,hT)(h_{1},...,h_{T}), we have

For smooth functions whose gradient is bounded by GG, Lemma 2.4 implies:

For strongly convex functions, Lemma 2.4 implies:

In other words, dt+12dt2(1α2dt24G2).d_{t+1}^{2}\leq d_{t}^{2}(1-\frac{\alpha^{2}d_{t}^{2}}{4G^{2}})\,. Defining at:=α2dt24G2a_{t}:=\frac{\alpha^{2}d_{t}^{2}}{4G^{2}}, we have:

This implies that at1t+1a_{t}\leq\frac{1}{t+1}, which can be seen by induction5That a01a_{0}\leq 1 follows from Lemma 2.4. For t=1t=1, a112a_{1}\leq\frac{1}{2} since a1a0(1a0)a_{1}\leq a_{0}(1-a_{0}) and 0a010\leq a_{0}\leq 1. For the induction step, atat1(1at1)1t(11t)=t1t2=1t+1(t21t2)1t+1a_{t}\leq a_{t-1}(1-a_{t-1})\leq\frac{1}{t}(1-\frac{1}{t})=\frac{t-1}{t^{2}}=\frac{1}{t+1}(\frac{t^{2}-1}{t^{2}})\leq\frac{1}{t+1}.. The proof is completed as follows6This assumes TT is even. TT odd leads to the same constants. :

Thus, there exists a tt for which ht29G4α2T2h_{t}^{2}\leq\frac{9G^{4}}{\alpha^{2}T^{2}}. Taking the square root completes the claim.

For both strongly convex and smooth functions:

3 Constrained Gradient/Subgradient Descent

The vast majority of the problems considered in this text include constraints. Consider the examples given in section 1.2: a path is a point in the flow polytope, a portfolio is a point in the simplex and so on. In the language of optimization, we require x\mathbf{x} not only to minimize a certain objective function, but also to belong to a convex set K\mathcal{K}.

In this section we describe and analyze constrained gradient descent. Algorithmically, the change from the previous section is small: after updating the current point in the direction of the gradient, one may need to project back to the decision set. However, the analysis is somewhat more involved, and instructive for the later parts of this text.

Algorithmic box 4 describes a template for gradient descent over a constrained set. It is a template since the sequence of step sizes {ηt}\{\eta_{t}\} is left as an input parameter, and the several variants of the algorithm differ on its choice.

As opposed to the unconstrained setting, here we require a precise setting of the learning rate to obtain the optimal convergence rate.

For constrained minimization of γ\gamma-well-conditioned functions and ηt=1β\eta_{t}=\frac{1}{\beta}, Algorithm 4 converges as

By strong convexity we have for every x,xtK\mathbf{x},\mathbf{x}_{t}\in\mathcal{K} (where we denote t=f(xt)\nabla_{t}=\nabla f(\mathbf{x}_{t}) as before):

Next, appealing to the algorithm’s definition and the choice ηt=1β\eta_{t}=\frac{1}{\beta}, we have

The minimum can only grow if we take it over a subset of K\mathcal{K}. Thus we can restrict our attention to all points that are convex combination of xt\mathbf{x}_{t} and x\mathbf{x}^{\star}, which we denote by the interval [xt,x]={(1μ)xt+μx,μ}[\mathbf{x}_{t},\mathbf{x}^{\star}]=\{(1-\mu)\mathbf{x}_{t}+\mu\mathbf{x}^{\star},\mu\in\}, and write

Where the equality is by writing x\mathbf{x} as x=(1μ)xt+μx\mathbf{x}=(1-\mu)\mathbf{x}_{t}+\mu\mathbf{x}^{\star}. Using strong convexity, we have for any xt\mathbf{x}_{t} and the minimizer x\mathbf{x}^{\star}:

Thus, plugging this into equation (2.3.1), we get

This gives the theorem statement by induction. ∎

4 Reductions to Non-smooth and Non-strongly Convex Functions

The previous section dealt with γ\gamma-well-conditioned functions, which may seem like a significant restriction over vanilla convexity. Indeed, many interesting convex functions are not strongly convex nor smooth, and as we have seen, the convergence rate of gradient descent greatly differs for these functions. We have completed the picture for unconstrained optimization, and in this section we complete it for a bounded set.

The literature on first order methods is abundant with specialized analyses that explore the convergence rate of gradient descent for more general functions. In this manuscript we take a different approach: instead of analyzing variants of GD from scratch, we use reductions to derive near-optimal convergence rates for smooth functions that are not strongly convex, or strongly convex functions that are not smooth, or general convex functions without any further restrictions.

While attaining sub-optimal convergence bounds (by logarithmic factors), the advantage of this approach is two-fold: first, the reduction method is very simple to state and analyze, and its analysis is significantly shorter than analyzing GD from scratch. Second, the reduction method is generic, and thus extends to the analysis of accelerated gradient descent (or any other first order method) along the same lines. We turn to these reductions next.

Our first reduction applies the GD algorithm to functions that are β\beta-smooth but not strongly convex.

The idea is to add a controlled amount of strong convexity to the function ff, and then apply the algorithm 4 to optimize the new function. The solution is distorted by the added strong convexity, but a tradeoff guarantees a meaningful convergence rate.

where we ignore constants and terms depending on DD and h1gh_{1}^{g}. ∎

Stronger convergence rates of O(βt)O(\frac{\beta}{t}) can be obtained by analyzing GD from scratch, and these are known to be tight. Thus, our reduction is suboptimal by a factor of O(logT)O(\log T), which we tolerate for the reasons stated at the beginning of this section.

4.2 Reduction to strongly convex, non-smooth functions

Our reduction from non-smooth functions to γ\gamma-well-conditioned functions is similar in spirit to the one of the previous subsection. However, whereas for strong convexity the obtained rates were off by a factor of logT\log T, in this section we will also be off by factor of dd, the dimension of the decision variable x\mathbf{x}, as compared to the standard analyses in convex optimization. For tight bounds, the reader is referred to the excellent reference books and surveys listed in the bibliography section 2.6.

We apply the GD algorithm to a smoothed variant of the objective function. In contrast to the previous reduction, smoothing cannot be obtained by simple addition of a smooth (or any other) function. Instead, we need a smoothing operation. The one we describe is particularly simple and amounts to taking a local integral of the function. More sophisticated, but less general, smoothing operators exist that are based on the Moreau-Yoshida regularization, see bibliographic section for more details.

Let ff be GG-Lipschitz continuous and α\alpha-strongly convex. Define for any δ>0\delta>0,

f^δ\hat{f}_{\delta} has the following properties:

If ff is α\alpha-strongly convex, then so is f^δ{\hat{f}}_{\delta}

f^δ\hat{f}_{\delta} is dGδ\frac{dG}{\delta}-smooth

f^δ(x)f(x)δG|\hat{f}_{\delta}(\mathbf{x})-f(\mathbf{x})|\leq\delta G for all xK\mathbf{x}\in\mathcal{K} .

Before proving this lemma, let us first complete the reduction. Using Lemma 2.8 and the convergence for γ\gamma-well-conditioned functions the following approximation bound is obtained.

For δ=dGαlogtt\delta=\frac{dG}{\alpha}\frac{\log{t}}{t} Algorithm 6 converges as

Before proving this lemma, notice that the gradient descent method is applied with gradients of the smoothed function f^δ{\hat{f}}_{\delta} rather than gradients of the original objective ff. In this section we ignore the computational cost of computing such gradients given only access to gradients of ff, which may be significant. Techniques for estimating these gradients are further explored in chapter 6.

Note that by Lemma 2.8 the function f^δ{\hat{f}}_{\delta} is γ\gamma-well-conditioned for γ=αδdG.\gamma=\frac{\alpha\delta}{dG}.

We proceed to prove that f^δ{\hat{f}}_{\delta} is indeed a good approximation to the original function.

Recall that a function ff is β\beta-smooth if and only if for all x,yK\mathbf{x},\mathbf{y}\in\mathcal{K}, it holds that f(x)f(y)βxy\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|\leq\beta\|\mathbf{x}-\mathbf{y}\|. Now,

This proves the second property of Lemma 2.8. We proceed to show the third property, namely that f^δ{\hat{f}}_{\delta} is a good approximation to ff.

We note that GD variants for α\alpha-strongly convex functions, even without the smoothing approach used in our reduction, are known to converge quickly and without dependence on the dimension. We state the known algorithm and result here without proof (see bibliography for references).

Let ff be α\alpha-strongly convex, and let x1,...,xt\mathbf{x}_{1},...,\mathbf{x}_{t} be the iterates of Algorithm 4 applied to ff with ηt=2α(t+1)\eta_{t}=\frac{2}{\alpha(t+1)}. Then

4.3 Reduction to general convex functions

5 Example: Support Vector Machine Training

To illustrate the usefulness of the gradient descent method, let us describe an optimization problem that has gained much attention in machine learning and can be solved efficiently using the methods we have just analyzed.

A very basic and successful learning paradigm is the linear classification model. In this model, the learner is presented with positive and negative examples of a concept. Each example, denoted by ai\mathbf{a}_{i}, is represented in Euclidean space by a dd dimensional feature vector. For example, a common representation for emails in the spam-classification problem are binary vectors in Euclidean space, where the dimension of the space is the number of words in the language. The ii’th email is a vector ai\mathbf{a}_{i} whose entries are given as ones for coordinates corresponding to words that appear in the email, and zero otherwise7Such a representation may seem naïve at first as it completely ignores the words’ order of appearance and their context. Extensions to capture these features are indeed studied in the Natural Language Processing literature.. In addition, each example has a label bi{1,+1}b_{i}\in\{-1,+1\}, corresponding to whether the email has been labeled spam/not spam. The goal is to find a hyperplane separating the two classes of vectors: those with positive labels and those with negative labels. If such a hyperplane, which completely separates the training set according to the labels, does not exist, then the goal is to find a hyperplane that achieves a separation of the training set with the smallest number of mistakes.

where \mboxsign(x){1,+1}\mathop{\mbox{\rm sign}}(x)\in\{-1,+1\} is the sign function, and δ(z){0,1}\delta(z)\in\{0,1\} is the indicator function that takes the value 11 if the condition zz is satisfied and zero otherwise.

This optimization problem, which is at the heart of the linear classification formulation, is NP-hard, and in fact NP-hard to even approximate non-trivially 8see bibliography for references to these results.. However, in the special case that a linear classifier (a hyperplane x\mathbf{x}) that classifies all of the examples correctly exists, the problem is solvable in polynomial time via linear programming.

Various relaxations have been proposed to solve the more general case, when no perfect linear classifier exists. One of the most successful in practice is the Support Vector Machine (SVM) formulation.

The soft margin SVM relaxation replaces the 0/10/1 loss in (2.7) with a convex loss function, called the hinge-loss, given by

In figure 2.4 we depict how the hinge loss is a convex relaxation for the non-convex 0/10/1 loss.

Further, the SVM formulation adds to the loss minimization objective a term that regularizes the size of the elements in x\mathbf{x}. The reason and meaning of this additional term shall be addressed in later sections. For now, let us consider the SVM convex program:

This is an unconstrained non-smooth and strongly convex program. It follows from Theorems 2.3 and 2.10 that O(1ε){O}(\frac{1}{\varepsilon}) iterations suffice to attain an ε\varepsilon-approximate solution. We spell out the details of applying the subgradient descent algorithm to this formulation in Algorithm 7.

Notice that the learning rates are left unspecified, even though they can be explicitly set as in Theorem 2.10, or using the Polyak rate. The Polyak rate requires knowing the function value at optimality, although this can be relaxed (see bibliography).

A caveat of using gradient descent for SVM is the requirement to compute the full gradient, which may require a full pass over the data for each iteration. We will see a significantly more efficient algorithm in the next chapter!

6 Bibliographic Remarks

The reader is referred to dedicated books on convex optimization for much more in-depth treatment of the topics surveyed in this background chapter. For background in convex analysis see the texts (Borwein and Lewis, 2006; Rockafellar, 1997). The classic textbook of Boyd and Vandenberghe (2004) gives a broad introduction to convex optimization with numerous applications, see also (Boyd, 2014). For detailed rigorous convergence proofs and in depth analysis of first order methods, see lecture notes by Nesterov (2004) and books by Nemirovski and Yudin (1983); Nemirovskii (2004), as well as more recent lecture notes and texts (Bubeck, 2015; Hazan, 2019). Theorem 2.10 is taken from (Bubeck, 2015) Theorem 3.9.

The logarithmic overhead in the reductions of section 2.4 can be removed with a more careful reduction and analysis, for details see (Allen-Zhu and Hazan, 2016). A more sophisticated smoothing operator is the Moreau-Yoshida regularization: it avoids the dimension factor loss. However, it is sometimes less computationally efficient to work with (Parikh and Boyd, 2014).

The Polyak learning rate is detailed in (Polyak, 1987). A recent exposition allows obtaining the same optimal rate without knowledge of the optimal function value (Hazan and Kakade, 2019).

Using linear separators and halfspaces to learn and separate data was considered in the very early days of AI (Rosenblatt, 1958; Minsky and Papert, 1969). Notable the Perceptron algorithm was one of the first learning algorithms, and closely related to gradient descent. Support vector machines were introduced in (Cortes and Vapnik, 1995; Boser et al., 1992), see also the book of Schölkopf and Smola (2002).

Learning halfspaces with the zero-one loss is computationally hard, and hard to even approximate non-trivially (Daniely, 2016). Proving that a problem is hard to approximate is at the forefront of computational complexity, and based on novel characterizations of the complexity class NP (Arora and Barak, 2009).

7 Exercises

(a) f(y)f(x)+(yx)f(x)+α2xy2f(\mathbf{y})\geq f(\mathbf{x})+(\mathbf{y}-\mathbf{x})^{\top}\nabla{}f(\mathbf{x})+\frac{\alpha}{2}\|{\mathbf{x}-\mathbf{y}}\|^{2}

(b) f(y)f(x)+(yx)f(x)+β2xy2f(\mathbf{y})\leq f(\mathbf{x})+(\mathbf{y}-\mathbf{x})^{\top}\nabla{}f(\mathbf{x})+\frac{\beta}{2}\|{\mathbf{x}-\mathbf{y}}\|^{2} Prove that if ff is twice differentiable and it holds that βI2f(x)αI\beta\textbf{I}\succcurlyeq\nabla^{2}f(\mathbf{x})\succcurlyeq\alpha\textbf{I} for any xK\mathbf{x}\in\mathcal{K}, then the condition number of ff over K\mathcal{K} is α/β\alpha/\beta.

(a) The sum of convex functions is convex.

(b) Let ff be α1\alpha_{1}-strongly convex and gg be α2\alpha_{2}-strongly convex. Then f+gf+g is (α1+α2)(\alpha_{1}+\alpha_{2})-strongly convex.

(c) Let ff be β1\beta_{1}-smooth and gg be β2\beta_{2}-smooth. Then f+gf+g is (β1+β2)(\beta_{1}+\beta_{2})-smooth.

8. * Extending Nesterov’s accelerated GD algorithm: Assume a black-box access to Nesterov’s algorithm that attains the rate of eγ Te^{-\sqrt{\gamma}\ T} for γ\gamma-well-conditioned functions, as in Table 2.1. Apply a reduction to obtain the βT2\frac{\beta}{T^{2}} rate for β\beta-smooth functions, as in Table 2.1, up to logarithmic factors.

Chapter 3 First-Order Algorithms for Online Convex Optimization

In this chapter we describe and analyze the most simple and basic algorithms for online convex optimization (recall the definition of the model as introduced in chapter 1), which are also surprisingly useful and applicable in practice. We use the same notation introduced in §2.1. However, in contrast to the previous chapter, the goal of the algorithms introduced in this chapter is to minimize regret, rather than the optimization error (which is ill-defined in an online setting).

Table 3.1 details known upper and lower bounds on the regret for different types of convex functions as it depends on the number of prediction iterations.

The reader may recall Table 2.1 describing offline convergence of first order methods: as opposed to offline optimization, smoothness does not improve asymptotic regret rates. However, exp-concavity, a weaker property than strong convexity, comes into play and gives improved regret rates.

This chapter will present algorithms and lower bounds that realize the above known results for OCO. The property of exp-concavity and its applications, as well as logarithmic regret algorithms for exp-concave functions are deferred to the next chapter.

Perhaps the simplest algorithm that applies to the most general setting of online convex optimization is online gradient descent. This algorithm, which is based on standard gradient descent from offline optimization, was introduced in its online form by Zinkevich (see bibliography at the end of this section).

Pseudo-code for the algorithm is given in Algorithm 8, and a conceptual illustration is given in figure 3.1.

In each iteration, the algorithm takes a step from the previous point in the direction of the gradient of the previous cost. This step may result in a point outside of the underlying convex set. In such cases, the algorithm projects the point back to the convex set, i.e. finds its closest point in the convex set. Despite the fact that the next cost function may be completely different than the costs observed thus far, the regret attained by the algorithm is sublinear. This is formalized in the following theorem (recall the definition of GG and DD from the previous chapter).

Online gradient descent with step sizes {ηt=DGt, t[T]}\{\eta_{t}=\frac{D}{G\sqrt{t}},\ t\in[T]\} guarantees the following for all T1T\geq 1:

Let xargminxKt=1Tft(x)\mathbf{x}^{\star}\in\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x}). Define t=defft(xt)\nabla_{t}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\nabla f_{t}(\mathbf{x}_{t}). By convexity

We first upper-bound t(xtx)\nabla_{t}^{\top}(\mathbf{x}_{t}-\mathbf{x}^{\star}) using the update rule for xt+1\mathbf{x}_{t+1} and Theorem 2.1 (the Pythagorean theorem):

Summing (3.1) and (3.1) from t=1t=1 to TT, and setting ηt=DGt\eta_{t}=\frac{D}{G\sqrt{t}} (with 1η0=def0\frac{1}{\eta_{0}}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}0):

The last inequality follows since ηt=DGt\eta_{t}=\frac{D}{G\sqrt{t}} and t=1T1t2T\sum_{t=1}^{T}\frac{1}{\sqrt{t}}\leq 2\sqrt{T}. ∎

The online gradient descent algorithm is straightforward to implement, and updates take linear time given the gradient. However, there is a projection step which may take significantly longer, as discussed in §2.1.1 and chapter 7.

2 Lower Bounds

The previous section introduces and analyzes a very simple and natural approach to online convex optimization. Before continuing our venture, it is worthwhile to consider whether the previous bound can be improved? We measure performance of OCO algorithms both by regret and by computational efficiency. Therefore, we ask ourselves whether even simpler algorithms that attain tighter regret bounds exist.

The computational efficiency of online gradient descent seemingly leaves little room for improvement, putting aside the projection step it runs in linear time per iteration. What about obtaining better regret?

Perhaps surprisingly, the answer is negative: online gradient descent attains, in the worst case, tight regret bounds up to small constant factors! This is formally given in the following theorem.

Any algorithm for online convex optimization incurs Ω(DGT)\Omega(DG\sqrt{T}) regret in the worst case. This is true even if the cost functions are generated from a fixed stationary distribution.

We give a sketch of the proof; filling in all details is left as an exercise at the end of this chapter.

Consider an instance of OCO where the convex set K\mathcal{K} is the nn-dimensional hypercube, i.e.

There are 2n2^{n} linear cost functions, one for each vertex v{±1}n\mathbf{v}\in\{\pm 1\}^{n}, defined as

Notice that both the diameter of K\mathcal{K} and the bound on the norm of the cost function gradients, denoted G, are bounded by

The cost functions in each iteration are chosen at random, with uniform probability, from the set {fv,v{±1}n}\{f_{\mathbf{v}},\mathbf{v}\in\{\pm 1\}^{n}\}. Denote by vt{±1}n\mathbf{v}_{t}\in\{\pm 1\}^{n} the vertex chosen in iteration tt, and denote ft=fvtf_{t}=f_{\mathbf{v}_{t}}. By uniformity and independence, for any tt and xt\mathbf{x}_{t} chosen online, \mboxEvt[ft(xt)]=\mboxEvt[vtxt]=0\mathop{\mbox{\bf E}}_{\mathbf{v}_{t}}[f_{t}(\mathbf{x}_{t})]=\mathop{\mbox{\bf E}}_{\mathbf{v}_{t}}[\mathbf{v}_{t}^{\top}\mathbf{x}_{t}]=0. However,

The last equality is left as an exercise.

The facts above nearly complete the proof of Theorem 3.2; see the exercises at the end of this chapter.

3 Logarithmic Regret

At this point, the reader may wonder: we have introduced a seemingly sophisticated and obviously general framework for learning and prediction, as well as a linear-time algorithm for the most general case, complete with tight regret bounds, and done so with elementary proofs! Is this all OCO has to offer?

Simple is good: the philosophy behind OCO treats simplicity as a merit. The main reason OCO has taken the stage in online learning in recent years is the simplicity of its algorithms and their analysis, which allow for numerous variations and tweaks in their host applications.

A very wide class of settings, which will be the subject of the next sections, admit more efficient algorithms, in terms of both regret and computational complexity.

In §2 we surveyed optimization algorithms with convergence rates that vary greatly according to the convexity properties of the function to be optimized. Do the regret bounds in online convex optimization vary as much as the convergence bounds in offline convex optimization over different classes of convex cost functions?

Indeed, next we show that for important classes of loss functions significantly better regret bounds are possible.

The first algorithm that achieves regret logarithmic in the number of iterations is a twist on the online gradient descent algorithm, changing only the step size. The following theorem establishes logarithmic bounds on the regret if the cost functions are strongly convex.

For α\alpha-strongly convex loss functions, online gradient descent with step sizes ηt=1αt\eta_{t}=\frac{1}{\alpha{t}} achieves the following guarantee for all T1T\geq 1

Let xargminxKt=1Tft(x)\mathbf{x}^{\star}\in\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x}). Recall the definition of regret

Define t=defft(xt)\nabla_{t}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\nabla f_{t}(\mathbf{x}_{t}). Applying the definition of α\alpha-strong convexity to the pair of points {xt\{\mathbf{x}_{t},x}\mathbf{x}^{*}\}, we have

We proceed to upper-bound t(xtx)\nabla_{t}^{\top}(\mathbf{x}_{t}-\mathbf{x}^{\star}). Using the update rule for xt+1\mathbf{x}_{t+1} and the Pythagorean theorem 2.1, we get

Summing (3.5) from t=1t=1 to TT, setting ηt=1αt\eta_{t}=\frac{1}{\alpha t} (define 1η0=def0\frac{1}{\eta_{0}}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}0), and combining with (3.4), we have:

4 Application: Stochastic Gradient Descent

A special case of Online Convex Optimization is the well-studied setting of stochastic optimization. In stochastic optimization, the optimizer attempts to minimize a convex function over a convex domain as given by the mathematical program:

However, unlike standard offline optimization, the optimizer is given access to a noisy gradient oracle, defined by

That is, given a point in the decision set, a noisy gradient oracle returns a random vector whose expectation is the gradient at the point and whose variance is bounded by G2G^{2}.

We will show that regret bounds for OCO translate to convergence rates for stochastic optimization. As a special case, consider the online gradient descent algorithm whose regret is bounded by

Applying the OGD algorithm over a sequence of linear functions that are defined by the noisy gradient oracle at consecutive points, and finally returning the average of all points along the way, we obtain the stochastic gradient descent algorithm, presented in Algorithm 9.

Algorithm 9 with step sizes ηt=DGt\eta_{t}=\frac{D}{G\sqrt{t}} guarantees

It is important to note that in the proof above, we have used the fact that the regret bounds of online gradient descent hold against an adaptive adversary. This need arises since the cost functions ftf_{t} defined in Algorithm 9 depend on the choice of decision xtK\mathbf{x}_{t}\in\mathcal{K}.

Recall our example of Support Vector Machine training from §2.5. The task of training an SVM over a given data set amounts to solving the following convex program (equation (2.8)):

Using the technique described in this chapter, namely the OGD and SGD algorithms, we can devise a much faster algorithm than the one presented in the previous chapter. The idea is to generate an unbiased estimator for the gradient of the objective using a single example in the dataset, and use it in lieu of the entire gradient. This is given formally in the SGD algorithm for SVM training presented in Algorithm 10.

This matches the convergence rate of standard offline gradient descent. However, observe that each iteration is significantly cheaper—only one example in the data set need be considered! That is the magic of SGD; we have matched the nearly optimal convergence rate of first order methods using extremely cheap iterations. This makes it the method of choice in numerous applications.

5 Bibliographic Remarks

The OCO framework was introduced by Zinkevich (2003), where the OGD algorithm was introduced and analyzed. Precursors to this algorithm, albeit for less general settings, were introduced and analyzed in (Kivinen and Warmuth, 1997). Logarithmic regret algorithms for Online Convex Optimization were introduced and analyzed in (Hazan et al., 2007).

The stochastic gradient descent (SGD) algorithm dates back to Robbins and Monro (1951), where it was called “stochastic approximation”. The importance of SGD for machine learning was advocated for in (Bottou, 1998; Bottou and Bousquet, 2008). The literature on SGD is vast and the reader is referred to the text of Bubeck (2015) and paper by Lan (2012).

Application of SGD to soft-margin SVM training was explored in (Shalev-Shwartz et al., 2011a). Tight convergence rates of SGD for strongly convex and non-smooth functions were only recently obtained in (Hazan and Kale, 2011; Rakhlin et al., 2012; Shamir and Zhang, 2013).

6 Exercises

2. ∗ In this exercise we show how to remove some a-priory knowledge from the design of online convex optimization algorithms.

(a) Design an OCO algorithm that attains the same asymptotic regret bound as OGD, up to factors logarithmic in GG without knowing the parameter GG ahead of time.

(b) Do the same for the parameter DD: design an OCO algorithm that attains the same asymptotic regret bound as OGD, up to factors logarithmic in DD without knowing the parameter DD ahead of time. This time you may assume GG is known. You may assume that it is possible to compute projections onto K\mathcal{K} without knowing its diameter.

3. In this exercise we prove a tight lower bound on the regret of any algorithm for online convex optimization.

(a) For any sequence of TT fair coin tosses, let NhN_{h} be the number of head outcomes and NtN_{t} be the number of tails. Give an asymptotically tight upper and lower bound on \mboxE[NhNt]\mathop{\mbox{\bf E}}[\left|N_{h}-N_{t}\right|]. That is, give an order of growth of this random variable as a function of TT, up to multiplicative and additive constants.

(b) Consider a 2-expert problem, in which the losses are inversely correlated: either expert one incurs a loss of one and the second expert negative one, or vice versa. Use the fact above to design a setting in which any experts algorithm incurs regret asymptotically matching the upper bound.

(c) Consider the general OCO setting over a convex set K\mathcal{K}. Design a setting in which the cost functions have gradients whose norm is bounded by GG, and obtain a lower bound on the regret as a function of GG, the diameter of K\mathcal{K}, and the number of game iterations.

4. Implement the SGD algorithm for SVM training. Apply it on the MNIST dataset. Compare your results to the offline GD algorithm from the previous chapter.

Chapter 4 Second-Order Methods

The motivation for this chapter is the application of online portfolio selection, considered in the first chapter of this book. We begin with a detailed description of this application. We proceed to describe a new class of convex functions that model this problem. This new class of functions is more general than the class of strongly convex functions discussed in the previous chapter. It allows for logarithmic regret algorithms, which are based on second order methods from convex optimization. In contrast to first order methods, which have been our focus thus far and relied on (sub)gradients, second order methods exploit information about the second derivative of the objective function.

In this subsection we give the formal definition of the universal portfolio selection problem that was informally described in §1.2.

Mainstream financial theory models stock prices as a stochastic process known as Geometric Brownian Motion (GBM). This model assumes that the fluctuations in the prices of the stocks behave essentially as a random walk. It is perhaps easier to think about a price of an asset (stock) on time segments, obtained from a discretization of time into equal segments. Thus, the logarithm of the price at segment t+1t+1, denoted lt+1l_{t+1}, is given by the sum of the logarithm of the price at segment tt and a Gaussian random variable with a particular mean and variance,

This is only an informal way of thinking about GBM. The formal model is a continuous time process, similar to the discrete time stochastic process we have just described, obtained as the time intervals, means, and variances approach zero.

The GBM model gives rise to particular algorithms for portfolio selection (as well as more sophisticated applications such as options pricing). Given the means and variances of the stock prices over time of a set of assets, as well as their cross-correlations, a portfolio with maximal expected gain (mean) for a specific risk (variance) threshold can be formulated.

The fundamental question is, of course, how does one obtain the mean and variance parameters, not to mention the cross-correlations, of a given set of stocks? One accepted solution is to estimate these from historical data, e.g., by taking the recent history of stock prices.

1.2 Universal portfolio theory

The theory of universal portfolio selection is very different from the GBM model. The main difference being the lack of statistical assumptions about the stock market. The idea is to model investing as a repeated decision making scenario, which fits nicely into our OCO framework, and to measure regret as a performance metric.

How does the decision maker’s wealth change? Let WtW_{t} be her total wealth at iteration tt. Then, ignoring transaction costs, we have

Over TT iterations, the total wealth of the investor is given by

The goal of the decision maker, to maximize the overall wealth gain WT/W0{W_{T}}/{W_{0}}, can be attained by maximizing the following more convenient logarithm of this quantity, given by

The above formulation is already very similar to our OCO setting, albeit phrased as a gain maximization rather than a loss minimization setting. Let

The convex set is the nn-dimensional simplex K=Δn\mathcal{K}=\Delta_{n}, and define the regret to be

The functions ftf_{t} are concave rather than convex, which is perfectly fine as we are framing the problem as a maximization rather than a minimization. Note also that the regret is the negation of the usual regret notion (1.1) we have considered for minimization problems.

Since this is an online convex optimization instance, we can use the online gradient descent algorithm from the previous chapter to invest, which ensures O(T)O(\sqrt{T}) regret (see exercises). What guarantee do we attain in terms of investing? To answer this, in the next section we reason about what x\mathbf{x}^{\star} in the above expression may be.

1.3 Constant rebalancing portfolios

As xK=Δn\mathbf{x}^{\star}\in\mathcal{K}=\Delta_{n} is a point in the nn-dimensional simplex, consider the special case of x=e1\mathbf{x}^{\star}=\mathbf{e}_{1}, i.e., the first standard basis vector (the vector that has zero in all coordinates except the first, which is set to one). The term t=1Tft(e1)\sum_{t=1}^{T}f_{t}(\mathbf{e}_{1}) becomes t=1Tlogrt(1)\sum_{t=1}^{T}\log\mathbf{r}_{t}(1), or

As TT becomes large, any sublinear regret guarantee (e.g., the O(T)O(\sqrt{T}) regret guarantee achieved using online gradient descent) achieves an average regret that approaches zero. In this context, this implies that the log-wealth gain achieved (in average over TT rounds) is as good as that of the first stock. Since x\mathbf{x}^{\star} can be taken to be any vector, sublinear regret guarantees average log-wealth growth as good as any stock!

However, x\mathbf{x}^{\star} can be significantly better, as shown in the following example. Consider a market of two stocks that fluctuate wildly. The first stock increases by 100%100\% every even day and returns to its original price the following (odd) day. The second stock does exactly the opposite: decreases by 50%50\% on even days and rises back on odd days. Formally, we have

Clearly, any investment in either of the stocks will not gain in the long run. However, the portfolio x=(0.5,0.5)\mathbf{x}^{\star}=(0.5,0.5) increases wealth by a factor of rtx=(12)2+1=1.25\mathbf{r}_{t}^{\top}\mathbf{x}^{\star}=(\frac{1}{2})^{2}+1=1.25 daily! Such a mixed distribution is called a fixed rebalanced portfolio, as it needs to rebalance the proportion of total capital invested in each stock at each iteration to maintain this fixed distribution strategy.

Thus, vanishing average regret guarantees long-run growth as the best constant rebalanced portfolio in hindsight. Such a portfolio strategy is called universal. We have seen that the online gradient descent algorithm gives essentially a universal algorithm with regret O(T)O(\sqrt{T}). Can we get better regret guarantees?

2 Exp-Concave Functions

For convenience, we return to considering losses of convex functions, rather than gains of concave functions as in the application for portfolio selection. The two problems are equivalent: we simply replace the maximization of the concave f(x)=log(rtx)f(\mathbf{x})=\log(\mathbf{r}_{t}^{\top}\mathbf{x}) with the minimization of the convex f(x)=log(rtx)f(\mathbf{x})=-\log(\mathbf{r}_{t}^{\top}\mathbf{x}).

In the previous chapter we have seen that the OGD algorithm with carefully chosen step sizes can deliver logarithmic regret for strongly convex functions. However, the loss function for the OCO setting of portfolio selection, ft(x)=log(rtx),f_{t}(\mathbf{x})=-\log(\mathbf{r}_{t}^{\top}\mathbf{x}), is not strongly convex. Instead, the Hessian of this function is given by

which is a rank one matrix. Recall that the Hessian of a twice-differentiable strongly convex function is larger than a multiple of identity matrix and is positive definite and in particular has full rank. Thus, the loss function above is quite far from being strongly convex.

However, an important observation is that this Hessian is large in the direction of the gradient. This property is called exp-concavity. We proceed to define this property rigorously and show that it suffices to attain logarithmic regret.

For the following discussion, recall the notation of §2.1, and in particular our convention over matrices that ABA\succcurlyeq B if and only if ABA-B is positive semidefinite. Exp-concavity implies strong-convexity in the direction of the gradient. This reduces to the following property:

The proof of this lemma is given as a guided exercise at the end of this chapter. We prove a slightly stronger lemma below.

The composition of a concave and non-decreasing function with another concave function is concave 9see exercises.. Therefore, since 2γα2\gamma\leq\alpha, the composition of g(x)=x2γ/αg(x)=x^{2\gamma/\alpha} with f(x)=exp(αf(x))f(\mathbf{x})=\exp(-\alpha f(\mathbf{x})) is concave. It follows that the function h(x)=defexp(2γf(x))h(\mathbf{x})\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\exp(-2\gamma f(\mathbf{x})) is concave. Then by the concavity of h(x)h(\mathbf{x}),

Plugging in h(y)=2γexp(2γf(y))f(y)\nabla h(\mathbf{y})=-2\gamma\exp(-2\gamma f(\mathbf{y}))\nabla f(\mathbf{y}) gives

Next, note that 2γf(y)(xy)  2γGD1|2\gamma\nabla f(\mathbf{y})^{\top}(\mathbf{x}-\mathbf{y})|\ \leq\ 2\gamma GD\leq 1 and that using the Taylor approximation, for z1z\geq-1, it holds that log(1z)z+14z2-\log(1-z)\geq z+\frac{1}{4}{z^{2}}. Applying the inequality for z=2γf(y)(xy)z=2\gamma\nabla f(\mathbf{y})^{\top}(\mathbf{x}-\mathbf{y}) implies the lemma. ∎

3 Exponentially Weighted Online Convex Optimization

Before diving into efficient second order methods, we first describe a simple algorithm based on the multiplicative updates method which gives logarithmic regret for exp-concave losses. Algorithm (11) below, called EWOO, is a close relative to the Hedge Algorithm (1). Its regret guarantee is robust: it does not include a Lipschitz constant or a diameter bound. In addition, it is particularly simple to describe and analyze.

The downside of EWOO is its running time. A naive implementation would run in exponential time of the dimension. It is possible to given a randomized polynomial time implementation based on random sampling techniques, where the polynomial depends both on the dimension as well as the number of iterations, see bibliographic section for more details.

In the analysis below, it can be observed that choosing xt\mathbf{x}_{t} at random with density proportional to wt(x)w_{t}(\mathbf{x}), instead of computing the entire integral, also guarantees our regret bounds on the expectation. This is the basis for the polynomial time implementation. We proceed to give the logarithmic regret bounds.

Let ht(x)=eαft(x)h_{t}(\mathbf{x})=e^{-\alpha f_{t}(\mathbf{x})}. Since ftf_{t} is α\alpha-exp-concave, we have that hth_{t} is concave and thus

By definition of x\mathbf{x}^{\star} we have xargmaxxKt=1Tht(x)\mathbf{x}^{\star}\in\arg\max_{\mathbf{x}\in\mathcal{K}}\prod_{t=1}^{T}h_{t}(\mathbf{x}). Denote by SδKS_{\delta}\subset\mathcal{K} the translated Minkowski set given by

By concavity of hth_{t} and the fact that hth_{t} is non-negative, we have that,

Finally, since Sδ=(1δ)x+δKS_{\delta}=(1-\delta)\mathbf{x}^{\star}+\delta\mathcal{K} is simply a rescaling of K\mathcal{K} followed by a translation, and we are in dd dimensions, \mboxvol(Sδ)=\mboxvol(K)×δd\mbox{vol}(S_{\delta})=\mbox{vol}(\mathcal{K})\times\delta^{d}. Putting this together with equation (4.1), we have

We can now simplify by taking logarithms and changing sides,

where the last step is by choosing δ=1T\delta=\frac{1}{T}. ∎

4 The Online Newton Step Algorithm

Thus far we have only considered first order methods for regret minimization. In this section we consider a quasi-Newton approach, i.e., an online convex optimization algorithm that approximates the second derivative, or Hessian in more than one dimension. However, strictly speaking, the algorithm we analyze is also first order, in the sense that it only uses gradient information.

The algorithm we introduce and analyze, called online Newton step, is detailed in Algorithm 12. At each iteration, this algorithm chooses a vector that is the projection of the sum of the vector chosen at the previous iteration and an additional vector. Whereas for the online gradient descent algorithm this added vector was the gradient of the previous cost function, for online Newton step this vector is different: it is reminiscent to the direction in which the Newton-Raphson method would proceed if it were an offline optimization problem for the previous cost function. The Newton-Raphson algorithm would move in the direction of the vector which is the inverse Hessian times the gradient. In online Newton step, this direction is At1tA_{t}^{-1}\nabla_{t}, where the matrix AtA_{t} is related to the Hessian as will be shown in the analysis.

Since adding a multiple of the Newton vector At1tA_{t}^{-1}\nabla_{t} to the current vector may result in a point outside the convex set, an additional projection step is required to obtain xt\mathbf{x}_{t}, the decision at time tt. This projection is different than the standard Euclidean projection used by online gradient descent in Section 3.1. It is the projection according to the norm defined by the matrix AtA_{t}, rather than the Euclidean norm.

The advantage of the online Newton step algorithm is its logarithmic regret guarantee for exp-concave functions, as defined in the previous section. The following theorem bounds the regret of online Newton step.

Algorithm 12 with parameters γ=12min{1GD,α}\gamma=\frac{1}{2}\min\{\frac{1}{GD},\alpha\}, ε=1γ2D2\varepsilon=\frac{1}{\gamma^{2}D^{2}} and T4T\geq 4 guarantees

As a first step, we prove the following lemma.

The regret of online Newton step is bounded by

Let xargminxKt=1Tft(x)\mathbf{x}^{\star}\in\arg\min_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x}) be the best decision in hindsight. By Lemma 4.3, we have for γ=12min{1GD,α}\gamma=\frac{1}{2}\min\{\frac{1}{GD},\alpha\},

According to the update rule of the algorithm xt+1=ΠKAt(yt+1)\mathbf{x}_{t+1}=\mathop{\Pi}_{\mathcal{K}}^{{A}_{t}}(\mathbf{y}_{t+1}). Now, by the definition of yt+1\mathbf{y}_{t+1}:

Multiplying the transpose of (4.2) by (4.3) we get

Since xt+1\mathbf{x}_{t+1} is the projection of yt+1\mathbf{y}_{t+1} in the norm induced by At{A}_{t}, we have by the Pythagorean theorem (see §2.1.1)

This inequality is the reason for using generalized projections as opposed to standard projections, which were used in the analysis of online gradient descent (see §3.1 Equation (3.2)). This fact together with (4.4) gives

Now, summing up over t=1t=1 to TT we get that

In the last inequality we use the fact that AtAt1=ttA_{t}-A_{t-1}=\nabla_{t}\nabla_{t}^{\top}, and the fact that the matrix ATA_{T} is PSD and hence the last term before the inequality is negative. Thus,

Using the algorithm parameters A111=εIn{A}_{1}-\nabla_{1}\nabla_{1}^{\top}=\varepsilon\mathbf{I}_{n} , ε=1γ2D2\varepsilon=\frac{1}{\gamma^{2}D^{2}} and our notation for the diameter x1x2D2\|\mathbf{x}_{1}-\mathbf{x}^{\star}\|^{2}\leq D^{2} we have

Since γ=12min{1GD,α}\gamma=\frac{1}{2}\min\{\frac{1}{GD},\alpha\}, we have 1γ2(1α+GD)\frac{1}{\gamma}\leq 2(\frac{1}{\alpha}+GD). This gives the lemma. ∎

First we show that the term t=1TtAt1t\sum_{t=1}^{T}\nabla_{t}^{\top}{A}_{t}^{-1}\nabla_{t} is upper bounded by a telescoping sum. Notice that

Since AT=t=1Ttt+εInA_{T}=\sum_{t=1}^{T}\nabla_{t}\nabla_{t}^{\top}+\varepsilon I_{n} and tG\|\nabla_{t}\|\leq G, the largest eigenvalue of ATA_{T} is at most TG2+εTG^{2}+\varepsilon. Hence the determinant of ATA_{T} can be bounded by AT(TG2+ε)n|A_{T}|\leq(TG^{2}+\varepsilon)^{n}. Hence recalling that ε=1γ2D2\varepsilon=\frac{1}{\gamma^{2}D^{2}} and γ=12min{1GD,α}\gamma=\frac{1}{2}\min\{\frac{1}{GD},\alpha\}, for T>4T>4,

which implies the theorem for n>1, T4n>1,\ T\geq 4.

It remains to prove the technical lemma for positive semidefinite (PSD) matrices used above.

Let AB0A\succcurlyeq B\succ 0 be positive definite matrices. Then

For any positive definite matrix CC, denote by λ1(C),,λn(C)\lambda_{1}(C),\ldots,\lambda_{n}(C) its eigenvalues (which are positive).

In the last equality we use the facts AB=AB|AB|=|A||B| and A1=1A|A^{-1}|=\frac{1}{|A|} for positive definite matrices (see exercises). ∎

The online Newton step algorithm requires O(n2)O(n^{2}) space to store the matrix AtA_{t}. Every iteration requires the computation of the matrix At1A_{t}^{-1}, the current gradient, a matrix-vector product, and possibly a projection onto the underlying convex set K\mathcal{K}.

A naïve implementation would require computing the inverse of the matrix AtA_{t} on every iteration. However, in the case that AtA_{t} is invertible, the matrix inversion lemma (see bibliography) states that for invertible matrix AA and vector x\mathbf{x},

Thus, given At11A_{t-1}^{-1} and t\nabla_{t} one can compute At1A_{t}^{-1} in time O(n2)O(n^{2}) using only matrix-vector and vector-vector products.

The online Newton step algorithm also needs to make projections onto K\mathcal{K}, but of a slightly different nature than online gradient descent and other online convex optimization algorithms. The required projection, denoted by ΠKAt\mathop{\Pi}_{\mathcal{K}}^{A_{t}}, is in the vector norm induced by the matrix AtA_{t}, viz. xAt=xAtx\|\mathbf{x}\|_{A_{t}}=\sqrt{\mathbf{x}^{\top}A_{t}\mathbf{x}}. It is equivalent to finding the point xK\mathbf{x}\in\mathcal{K} which minimizes (xy)At(xy)(\mathbf{x}-\mathbf{y})^{\top}A_{t}(\mathbf{x}-\mathbf{y}) where y\mathbf{y} is the point we are projecting. This is a convex program which can be solved up to any degree of accuracy in polynomial time.

Modulo the computation of generalized projections, the online Newton step algorithm can be implemented in time and space O(n2)O(n^{2}). In addition, the only information required is the gradient at each step (and the exp-concavity constant α\alpha of the loss functions).

5 Bibliographic Remarks

The Geometric Brownian Motion model for stock prices was suggested and studied as early as 1900 in the PhD thesis of Louis Bachelier (Bachelier, 1900), see also (Osborne, 1959), and used in the Nobel Prize winning work of Black and Scholes on options pricing (Black and Scholes, 1973). In a strong deviation from standard financial theory, Thomas Cover (Cover, 1991) put forth the universal portfolio model, whose algorithmic theory we have historically sketched in chapter 1. The EWOO algorithm was essentially given in Cover’s paper for the application of portfolio selection and logarithmic loss functions, and extended to exp-concave loss functions in (Hazan et al., 2006). The randomized extension of Cover’s algorithm that runs in polynomial running time is due to Kalai and Vempala (2003), and it naturally extends to EWOO.

Some bridges between classical portfolio theory and the universal model appear in (Abernethy et al., 2012). Options pricing and its relation to regret minimization was recently also explored in the work of (DeMarzo et al., 2006).

Exp-concave functions have been considered in the context of prediction in (Kivinen and Warmuth, 1999), see also (Cesa-Bianchi and Lugosi, 2006) (chapter 3.3 and bibliography). A more general condition than exp-concavity called mixability was used by Vovk (1990) to give a general multiplicative update algorithm, see also (Foster et al., 2018). For a thorough discussion of various conditions that allow logarithmic regret in online learning see (van Erven et al., 2015).

For the square-loss, (Azoury and Warmuth, 2001) gave a specially tailored and near-optimal prediction algorithm. Logarithmic regret algorithms for online convex optimization and the Online Newton Step algorithm were presented in (Hazan et al., 2007).

The Sherman-Morrison formula, a.k.a. the matrix inversion lemma, gives the form of the inverse of a matrix after a rank-1 update, see (Riedel, 1991).

6 Exercises

1. For this question, assume all functions are twice differentiable. Prove that exp-concave functions are a larger class than strongly convex and Lipschitz functions. That is, prove that a strongly convex function over a bounded domain that is also GG-Lipschitz is also exp-concave. Show that the converse is not necessarily true.

2. Prove that a twice-differentiable function ff is α\alpha-exp-concave over K\mathcal{K} if and only if for all xK\mathbf{x}\in\mathcal{K},

Hint: consider the Hessian of the function eαf(x)e^{-\alpha f(\mathbf{x})}, and use the fact that the Hessian of a convex function is always positive semidefinite.

3. Write pseudo-code for a portfolio selection algorithm based on online gradient descent. That is, given a set of return vectors, spell out the exact constants and updates based upon the gradients of the reward functions. Derive the regret bound based on Theorem 3.1. You may assume that the multiplicative change in price for any single asset is bounded, and use this quantity in your regret bound. Do the same (pseudo-code and regret bound) for the Online Newton Step algorithm applied to portfolio selection. Note: you are not required to give pseudo-code for projections onto the simplex.

4. Download stock prices from your favorite online finance website over a period of at least three years. Create a dataset for testing portfolio selection algorithms by creating price-return vectors. Implement the OGD and ONS algorithms and benchmark them on your data.

5. Prove that for positive definite matrices, A,B0A,B\succ 0, the following hold: AB=AB|AB|=|A||B| and A1=1A|A^{-1}|=\frac{1}{|A|}, where A|A| denotes the determinant of AA.

Chapter 5 Regularization

In the previous chapters we have explored algorithms for OCO that are motivated by convex optimization. However, unlike convex optimization, the OCO framework optimizes the Regret performance metric. This distinction motivates a family of algorithms, called “Regularized Follow The Leader” (RFTL), which we introduce in this chapter.

In an OCO setting of regret minimization, the most straightforward approach for the online player is to use at any time the optimal decision (i.e., point in the convex set) in hindsight. Formally, let

This flavor of strategy is known as “fictitious play” in economics, and has been named “Follow the Leader” (FTL) in machine learning. It is not hard to see that this simple strategy fails miserably in a worst-case sense. That is, this strategy’s regret can be linear in the number of iterations, as the following example shows: Consider K=\mathcal{K}=, let f1(x)=12xf_{1}(x)=\frac{1}{2}x, and let fτf_{\tau} for τ=2,,T\tau=2,\ldots,T alternate between x-x or xx. Thus,

The FTL strategy will keep shifting between xt=1x_{t}=-1 and xt=1x_{t}=1, always making the wrong choice.

The intuitive FTL strategy fails in the example above because it is unstable. Can we modify the FTL strategy such that it won’t change decisions often, thereby causing it to attain low regret?

This question motivates the need for a general means of stabilizing the FTL method. Such a means is referred to as “regularization”.

Although it is not strictly necessary, we assume that the regularization functions in this chapter are twice differentiable over K\mathcal{K} and, for all points xint(K)\mathbf{x}\in\text{int}(\mathcal{K}) in the interior of the decision set, have a Hessian 2R(x)\nabla^{2}R(\mathbf{x}) that is, by the strong convexity of RR, positive definite.

We denote the diameter of the set K\mathcal{K} relative to the function RR as

Henceforth we make use of general norms and their dual. The dual norm to a norm \|\cdot\| is given by the following definition:

A positive definite matrix AA gives rise to the matrix norm xA=xAx\|\mathbf{x}\|_{A}=\sqrt{\mathbf{x}^{\top}A\mathbf{x}}. The dual norm of a matrix norm is xA=xA1\|\mathbf{x}\|_{A}^{*}=\|\mathbf{x}\|_{A^{-1}}.

The generalized Cauchy-Schwarz theorem asserts xyxy\mathbf{x}^{\top}\mathbf{y}\leq\|\mathbf{x}\|\|\mathbf{y}\|^{*} and in particular for matrix norms, xyxAyA\mathbf{x}^{\top}\mathbf{y}\leq\|\mathbf{x}\|_{A}\|\mathbf{y}\|_{A}^{*} (see exercises).

In our derivations, we usually consider matrix norms with respect to 2R(x)\nabla^{2}R(\mathbf{x}), the Hessian of the regularization function R(x)R(\mathbf{x}), as well as the inverse Hessian denoted 2R(x)\nabla^{-2}R(\mathbf{x}). In such cases, we use the notation

A crucial quantity in the analysis of OCO algorithms that use regularization is the remainder term of the Taylor approximation of the regularization function, and especially the remainder term of the first order Taylor approximation. The difference between the value of the regularization function at x\mathbf{x} and the value of the first order Taylor approximation is known as the Bregman divergence, given by

Denote by BR(xy)B_{R}(\mathbf{x}||\mathbf{y}) the Bregman divergence with respect to the function R{R}, defined as

For twice differentiable functions, Taylor expansion and the mean-value theorem assert that the Bregman divergence is equal to the second derivative at an intermediate point, i.e., (see exercises)

for some point z[x,y]\mathbf{z}\in[\mathbf{x},\mathbf{y}], meaning there exists some α\alpha\in such that z=αx+(1α)y\mathbf{z}=\alpha\mathbf{x}+(1-\alpha)\mathbf{y}. Therefore, the Bregman divergence defines a local norm, which has a dual norm. We shall denote this dual norm by

In online convex optimization, we commonly refer to the Bregman divergence between two consecutive decision points xt\mathbf{x}_{t} and xt+1\mathbf{x}_{t+1}. In such cases, we shorthand notation for the norm defined by the Bregman divergence with respect to R{R} on the intermediate point in [xt,xt+1][\mathbf{x}_{t},\mathbf{x}_{t+1}] as t=defxt,xt+1\|\cdot\|_{t}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\|\cdot\|_{\mathbf{x}_{t},\mathbf{x}_{t+1}}. The latter norm is called the local norm at iteration tt. With this notation, we have BR(xtxt+1)=12xtxt+1t2B_{R}(\mathbf{x}_{t}||\mathbf{x}_{t+1})=\frac{1}{2}\|\mathbf{x}_{t}-\mathbf{x}_{t+1}\|_{t}^{2}.

Finally, we consider below generalized projections that use the Bregman divergence as a distance instead of a norm. Formally, the projection of a point y\mathbf{y} according to the Bregman divergence with respect to function RR is given by

2 The RFTL Algorithm and its Analysis

Recall the caveat with straightforward use of follow-the-leader: as in the bad example we have considered, the predictions of FTL may vary wildly from one iteration to the next. This motivates the modification of the basic FTL strategy in order to stabilize the prediction. By adding a regularization term, we obtain the RFTL (Regularized Follow the Leader) algorithm.

We proceed to formally describe the RFTL algorithmic template and analyze it. The analysis gives asymptotically optimal regret bounds. However, we do not optimize the constants in the regret bounds in order to improve clarity of presentation.

Throughout this chapter, recall the notation of t\nabla_{t} to denote the gradient of the current cost function at the current point, i.e.,

In the OCO setting, the regret of convex cost functions can be bounded by a linear function via the inequality ft(xt)ft(x)t(xtx)f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}^{\star})\leq\nabla_{t}^{\top}(\mathbf{x}_{t}-\mathbf{x}^{\star}). Thus, the overall regret (recall definition (1.1)) of an OCO algorithm can be bounded by

The generic RFTL meta-algorithm is defined in Algorithm 13. The regularization function R{R} is assumed to be strongly convex, smooth, and twice differentiable.

2.2 The regret bound

The RFTL Algorithm 13 attains for every uK\mathbf{u}\in\mathcal{K} the following bound on the regret:

If an upper bound on the local norms is known, i.e., ttGR\|\nabla_{t}\|_{t}^{*}\leq G_{R} for all times tt, then we can further optimize over the choice of η\eta to obtain

To prove Theorem 5.2, we first relate the regret to the “stability” in prediction. This is formally captured by the following lemma10Historically, this lemma is known as the “FTL-BTL,” which stands for follow-the-leader vs. be-the-leader. BTL is a hypothetical algorithm that predicts xt+1\mathbf{x}_{t+1} at iteration tt, where xt\mathbf{x}_{t} is the prediction made by FTL. These terms were coined by Kalai and Vempala (Kalai and Vempala, 2005)..

Algorithm 13 guarantees the following regret bound

For convenience of the derivations, define the functions

By equation (5.1), it suffices to bound t=1T[gt(xt)gt(u)]\sum_{t=1}^{T}[g_{t}(\mathbf{x}_{t})-g_{t}(\mathbf{u})]. As a first step, we prove the following inequality:

by induction on TT: Induction base: By definition, we have that x1=argminxK{R(x)}\mathbf{x}_{1}=\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\{R(\mathbf{x})\}, and thus g0(u)g0(x1)g_{0}(\mathbf{u})\geq g_{0}(\mathbf{x}_{1}) for all u\mathbf{u}. Induction step: Assume that for TT, we have

and let us prove the statement for T+1T+1. Since xT+2=argminxK{t=0T+1gt(x)}\mathbf{x}_{T+2}=\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\{\sum_{t=0}^{T+1}g_{t}(\mathbf{x})\} we have:

Where in the third line we used the induction hypothesis for u=xT+2\mathbf{u}=\mathbf{x}_{T+2}. ∎

Recall that R(x){R}(\mathbf{x}) is a convex function and K\mathcal{K} is a convex set. Denote:

By the Taylor expansion (with its explicit remainder term via the mean-value theorem) at xt+1\mathbf{x}_{t+1}, and by the definition of the Bregman divergence,

The inequality holds since xt+1\mathbf{x}_{t+1} is a minimum of Φt\Phi_{t} over K\mathcal{K}, as in Theorem 2.2. The last equality holds since the component sx\nabla_{s}^{\top}\mathbf{x} is linear and thus does not affect the Bregman divergence. Thus,

To proceed, recall the shorthand for the norm induced by the Bregman divergence with respect to R{R} on point xt,xt+1\mathbf{x}_{t},\mathbf{x}_{t+1} as t=xt,xt+1\|\cdot\|_{t}=\|\cdot\|_{\mathbf{x}_{t},\mathbf{x}_{t+1}}. Similarly for the dual local norm t=xt,xt+1\|\cdot\|^{*}_{t}=\|\cdot\|^{*}_{\mathbf{x}_{t},\mathbf{x}_{t+1}}. With this notation, we have BR(xtxt+1)=12xtxt+1t2B_{R}(\mathbf{x}_{t}||\mathbf{x}_{t+1})=\frac{1}{2}\|\mathbf{x}_{t}-\mathbf{x}_{t+1}\|_{t}^{2}. By the generalized Cauchy-Schwarz inequality,

Combining this inequality with Lemma 5.3 we obtain the theorem statement. ∎

3 Online Mirror Descent

In the convex optimization literature, “Mirror Descent” refers to a general class of first order methods generalizing gradient descent. Online Mirror descent (OMD) is the online counterpart of this class of methods. This relationship is analogous to the relationship of online gradient descent to traditional (offline) gradient descent.

For the RFTL algorithm the intuition was straightforward—the regularization was used to ensure stability of the decision. For OMD, regularization has an additional purpose: regularization transforms the space in which gradient updates are performed. This transformation enables better bounds in terms of the geometry of the space.

The OMD algorithm comes in two flavors: an agile and a lazy version. The lazy version keeps track of a point in Euclidean space and projects onto the convex decision set K\mathcal{K} only at decision time. In contrast, the agile version maintains a feasible point at all times, much like OGD.

Both versions can be analyzed to give roughly the same regret bounds as the RFTL algorithm. In light of what we will see next, this is not surprising: for linear cost functions, the RFTL and lazy-OMD algorithms are equivalent!

Thus, we get regret bounds for free for the lazy version. The agile version can be shown to attain similar regret bounds, and is in fact superior in certain settings that require adaptivity. This issue is further explored in chapter 10. The analysis of the agile version is of independent interest and we give it below.

The OMD (lazy version) and RFTL are identical for linear cost functions, as we show next.

Let f1,...,fTf_{1},...,f_{T} be linear cost functions. The lazy OMD and RFTL algorithms produce identical predictions, i.e.,

First, observe that the unconstrained minimum

By definition, yt\mathbf{y}_{t} also satisfies the above equation, but since R(x){R}(\mathbf{x}) is strictly convex, there is only one solution for the above equation and thus yt=xt\mathbf{y}_{t}=\mathbf{x}^{\star}_{t}. Hence,

Since R(yt){R}(\mathbf{y}_{t}) and s=1t1syt\sum_{s=1}^{t-1}\nabla_{s}^{\top}\mathbf{y}_{t} are independent of x\mathbf{x}, it follows that BR(xyt)B_{R}(\mathbf{x}||\mathbf{y}_{t}) is minimized at the point x\mathbf{x} that minimizes R(x)+ηs=1t1sx{R}(\mathbf{x})+\eta\,\sum_{s=1}^{t-1}\nabla_{s}^{\top}\mathbf{x} over K\mathcal{K} which, in turn, implies that

3.2 Regret bounds for Mirror Descent

In this subsection we prove regret bounds for the agile version of the RFTL algorithm. The analysis is quite different than the one for the lazy version, and of independent interest.

The OMD Algorithm 14 attains for every uK\mathbf{u}\in\mathcal{K} the following bound on the regret:

If an upper bound on the local norms is known, i.e., ttGR\|\nabla_{t}\|_{t}^{*}\leq G_{R} for all times tt, then we can further optimize over the choice of η\eta to obtain

Since the functions ft\mathbf{f}_{t} are convex, for any xK\mathbf{x}^{*}\in K,

The following property of Bregman divergences follows from the definition: for any vectors x,y,z\mathbf{x},\mathbf{y},\mathbf{z},

where the last inequality follows from the generalized Pythagorean theorem, as xt+1\mathbf{x}_{t+1} is the projection w.r.t the Bregman divergence of yt+1\mathbf{y}_{t+1} and xK\mathbf{x}^{*}\in K is in the convex set. Summing over all iterations,

We proceed to bound BR(xt,yt+1)B_{\mathcal{R}}(\mathbf{x}_{t},\mathbf{y}_{t+1}). By definition of Bregman divergence, and the generalized Cauchy-Schwartz inequality,

where in the last inequality follows from (ab)20(a-b)^{2}\geq 0. Thus, we have

Plugging back into Equation (5.3), and by non-negativity of the Bregman divergence, we get

by taking η=DRTGR\eta=\frac{D_{R}}{\sqrt{T}G_{R}}

4 Application and Special Cases

In this section we illustrate the generality of the regularization technique: we show how to derive the two most important and famous online algorithms—the online gradient descent algorithm and the online exponentiated gradient (based on the multiplicative update method)—from the RFTL meta-algorithm.

Other important special cases of the RFTL meta-algorithm are derived with matrix-norm regularization—namely, the von Neumann entropy function, and the log-determinant function, as well as self-concordant barrier regularization—which we shall explore in detail in the next chapter.

To derive the online gradient descent algorithm, we take R(x)=12xx022{R}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}_{0}\|_{2}^{2} for an arbitrary x0K\mathbf{x}_{0}\in\mathcal{K}. Projection with respect to this divergence is the standard Euclidean projection (see exercises), and in addition, R(x)=xx0\nabla{R}(\mathbf{x})=\mathbf{x}-\mathbf{x}_{0}. Hence, the update rule for the OMD Algorithm 14 becomes:

The latter algorithm is exactly online gradient descent, as described in Algorithm 8 in chapter 3. However, both variants behave very differently, as explored in chapter 10 (see also exercises).

Theorem 5.2 gives us the following bound on the regret (where DR,tD_{R},\|\cdot\|_{t} are the diameter and local norm defined with respect to the regularizer RR as defined in the beginning of this chapter, and DD is the Euclidean diameter as defined in chapter 2)

where the second inequality follows since for R(x)=12xx02{R}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}_{0}\|^{2}, the local norm t\|\cdot\|_{t} reduces to the Euclidean norm.

4.2 Deriving multiplicative updates

Let R(x)=xlogx=ixilogxi{R}(\mathbf{x})=\mathbf{x}\log\mathbf{x}=\sum_{i}\mathbf{x}_{i}\log\mathbf{x}_{i} be the negative entropy function, where logx\log\mathbf{x} is to be interpreted element-wise. Then R(x)=1+logx\nabla{R}(\mathbf{x})=\mathbf{1}+\log\mathbf{x}, and hence the update rules for the OMD algorithm become:

which is exactly the Hedge algorithm from the first chapter!

Theorem 5.6 gives us the following bound on the regret:

If the costs per individual expert are in the range $$, it can be shown that

In addition, when RR is the negative entropy function, the diameter over the simplex can be shown to be bounded by DR2lognD_{R}^{2}\leq\log n (see exercises), giving rise to the bound

For an arbitrary range of costs, we obtain the exponentiated gradient algorithm described in Algorithm 15.

The regret achieved by the exponentiated gradient algorithm can be bounded using the following corollary of Theorem 5.2:

The exponentiated gradient algorithm with gradients bounded by tG\|\nabla_{t}\|_{\infty}\leq G_{\infty} and parameter η=logn2TG2\eta=\sqrt{\frac{\log n}{2TG_{\infty}^{2}}} has regret bounded by

5 Randomized Regularization

The connection between stability in decision making and low regret has motivated our discussion of regularization thus far. However, this stability need not be achieved only using strongly convex regularization functions. An alternative method to achieve stability in decisions is by introducing randomization into the algorithm. In fact, historically, this method preceded methods based on strongly convex regularization (see bibliography).

In this section we first describe a deterministic algorithm for online convex optimization that is easily amenable to speedup via randomization. We then give an efficient randomized algorithm for the special case of OCO with linear losses.

For simplicity, we consider ourselves in this section with a slightly restricted version of OCO. So far, we have not restricted the cost functions in any way, and they could depend on the choice of decision by the online learner. However, when dealing with randomized algorithms, this issue becomes a bit more subtle: can the cost functions depend on the randomness of the decision making algorithm itself? Furthermore, when analyzing the regret, which is now a random variable, dependencies across different iterations require probabilistic machinery which adds little to the fundamental understanding of randomized OCO algorithms. To avoid these complications, we make the following assumption throughout this section: the cost functions {ft}\{\mathbf{f}_{t}\} are adversarially chosen ahead of time, and do not depend on the actual decisions of the online learner. This version of OCO is called the oblivious setting, to distinguish it from the adaptive setting.

5.1 Perturbation for convex losses

The prediction in Algorithm 16 is according to a version of the follow-the-leader algorithm, augmented with an additional component of randomization. It is a deterministic algorithm that computes the expected decision according to a random variable. The random variable is the minimizer over the decision set according to the sum of gradients of the cost functions and an additional random vector.

In practice, the expectation need not be computed exactly. Estimation (via random sampling) up to a precision that depends linearly on the number of iterations would suffice.

The first parameter, σ\sigma, is related to the variance of D{\mathcal{D}}, while the second, LL, is a measure of the sensitivity of the distribution11In harmonic analysis of Boolean functions, a similar quantity is called “average sensitivity”.. For example, if D{\mathcal{D}} is the uniform distribution over the hypercube n^{n}, then it holds that (see exercises) for the Euclidean norm

Reusing notation from previous chapters, denote by D=DaD=D_{a} the diameter of the set K\mathcal{K} according to the norm a\|\cdot\|_{a}, and by D=DaD^{*}=D_{a}^{*} the diameter according to its dual norm. Similarly, denote by G=GaG=G_{a} and G=GaG^{*}=G_{a}^{*} an upper bound on the norm (and dual norm) of the gradients.

Let the distribution D{\mathcal{D}} be (σ,L)(\sigma,L)-stable with respect to norm a\|\cdot\|_{a}. The FPL algorithm attains the following bound on the regret:

We can further optimize over the choice of η\eta to obtain

Define the random variable xtn=argminxK{ηs=1tsx+nx}\mathbf{x}_{t}^{\mathbf{n}}=\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\left\{\eta\sum_{s=1}^{t}\nabla_{s}^{\top}\mathbf{x}+\mathbf{n}^{\top}\mathbf{x}\right\}, and the random function g0ng_{0}^{\mathbf{n}} as

It follows from Lemma 5.4 applied to the functions {gt(x)=tx}\{g_{t}(\mathbf{x})=\nabla_{t}^{\top}\mathbf{x}\} that

We now argue that xtxt+1=O(η)\|\mathbf{x}_{t}-\mathbf{x}_{t+1}\|=O(\eta). Let

Notice that xt,xt+1\mathbf{x}_{t},\mathbf{x}_{t+1} may depend on each other. However, by linearity of expectation, we have that

Substituting this bound back into (5.5) we have

For the choice of D{\mathcal{D}} as the uniform distribution over the unit hypercube n^{n}, which has parameters σ2n\sigma_{2}\leq\sqrt{n} and L21L_{2}\leq 1 for the Euclidean norm, the optimal choice of η\eta gives a regret bound of DGn1/4TDGn^{1/4}\sqrt{T}. This is a factor n1/4{n}^{1/4} worse than the online gradient descent regret bound of Theorem 3.1. For certain decision sets K\mathcal{K} a better choice of distribution D{\mathcal{D}} results in near-optimal regret bounds.

5.2 Perturbation for linear cost functions

The case of linear cost functions ft(x)=gtxf_{t}(\mathbf{x})=\mathbf{g}_{t}^{\top}\mathbf{x} is of particular interest in the context of randomized regularization. Denote

By linearity of expectation, we have that

Thus, instead of computing xt\mathbf{x}_{t} precisely, we can sample a single vector n0D\mathbf{n}_{0}\sim{\mathcal{D}}, and use it to compute x^t=wt(n0)\hat{\mathbf{x}}_{t}=w_{t}(\mathbf{n}_{0}), as illustrated in Algorithm 17.

By the above arguments, we have that the expected regret for the random variables x^t\hat{\mathbf{x}}_{t} is the same as that for xt\mathbf{x}_{t}. We obtain the following Corollary:

The main advantage of this algorithm is computational: with a single linear optimization step over the decision set K\mathcal{K} (which does not even have to be convex!), we attain near optimal expected regret bounds.

5.3 Follow-the-perturbed-leader for expert advice

An interesting special case (and in fact the first use of perturbation in decision making) is that of non-negative linear cost functions over the unit nn-dimensional simplex with costs bounded by one, or the problem of prediction of expert advice we have considered in chapter 1.

Algorithm 17 applied to the probability simplex and with exponentially distributed noise is known as the follow-the-perturbed-leader for prediction from expert advice method. We spell it out in Algorithm 18.

Notice that we take the perturbation to be distributed according to the one-sided negative exponential distribution, i.e., n(i)eηx\mathbf{n}(i)\sim e^{-\eta x}, or more precisely

Corollary 5.9 gives regret bounds that are suboptimal for this special case, thus we give here an alternative analysis that gives tight bounds up to constants amounting to the following theorem.

Algorithm 18 outputs a sequence of predictions x^1,...,x^TΔn\hat{\mathbf{x}}_{1},...,\hat{\mathbf{x}}_{T}\in\Delta_{n} such that:

Notice that as a special case of the above theorem, choosing η=lognT\eta=\sqrt{\frac{\log n}{T}} yields a regret bound of

which is equivalent up to constant factors to the guarantee given for the Hedge algorithm in Theorem 1.5.

We start with the same analysis technique throughout this chapter: let g0=n\mathbf{g}_{0}=-\mathbf{n}. It follows from Lemma 5.4 applied to the functions {ft(x)=gtx}\{f_{t}(\mathbf{x})=\mathbf{g}_{t}^{\top}\mathbf{x}\} that

where the second inequality follows by the generalized Cauchy-Schwarz inequality, and the last inequality follows since (see exercises)

Above we have that gt1\|\mathbf{g}_{t}\|_{\infty}\leq 1 by assumption that the losses are bounded by one.

To bound the latter, notice that the probability x^t=eit\hat{\mathbf{x}}_{t}=\mathbf{e}_{i_{t}} is the leader at time tt is the probability that n(it)>v-\mathbf{n}({i_{t}})>v for some value vv that depends on the entire loss sequence till now. On the other hand, given x^t\hat{\mathbf{x}}_{t}, we have that x^t+1=x^t\hat{\mathbf{x}}_{t+1}=\hat{\mathbf{x}}_{t} remains the leader if n(it)>v+gt(it)-\mathbf{n}(i_{t})>v+\mathbf{g}_{t}(i_{t}), since it was a leader by a margin of more than the cost it will suffer. Thus,

Substituting this bound back into (5.5.3) we have

6 * Adaptive Gradient Descent

Thus far we have introduced regularization as a general methodology for deriving online convex optimization algorithms. The main theorem of this chapter, Theorem 5.2, bounds the regret of the RFTL algorithm for any strongly convex regularizer as

In addition, we have seen how to derive the online gradient descent and the multiplicative weights algorithms as special cases of the RFTL methodology. But are there other special cases of interest, besides these two basic algorithms, that warrant such general and abstract treatment?

There are surprisingly few cases of interest besides the Euclidean and Entropic regularizations and their matrix analogues12One such example is the self-concordant barrier regularization which we shall explore in the next chapter.. However, in this chapter we will give some justification of the abstract treatment of regularization.

Our treatment is motivated by the following question: thus far we have thought of RR as a strongly convex function. But which strongly convex function should we choose to minimize regret? This is a deep and difficult question which has been considered in the optimization literature since its early developments. Naturally, the optimal regularization should depend on both the convex underlying decision set, as well as the actual cost functions (see exercises for a natural candidate of a regularization function that depends on the convex decision set).

We shall treat this question no differently than we treat other optimization problems throughout this manuscript itself: we’ll learn the optimal regularization online! That is, a regularizer that adapts to the sequence of cost functions and is in a sense the “optimal” regularization to use in hindsight. This gives rise to the AdaGrad (Adaptive subGradient method) algorithm 19, which explicitely optimizes over the regularization choice in line (5) to minimize the gradient norms, which is the dominant expression in (5.7).

AdaGrad comes in two versions: diagonal and full matrix, the first being particularly efficient to implement with negligible computational overhead over online gradient descent. In the algorithm definition and throughout this chapter, the notation A1A^{-1} refers to the Moore-Penrose pseudoinverse of the matrix AA.

The computation in line (5) finds the regularization matrix HH which minimizes the norm of the gradients from within the positive semi-definite cone, with or without a diagonal constraint. This is closely related, as we shall see, to optimization w.r.t. two natural sets of matrices:

H1={H=diag(H),H0 , Tr(H)1}{\mathcal{H}}_{1}=\{H={\bf diag}(H),H\succeq 0\ ,\ {\bf Tr}(H)\leq 1\}

H2={H0 , Tr(H)1}{\mathcal{H}}_{2}=\{H\succeq 0\ ,\ {\bf Tr}(H)\leq 1\}.

This results in a regularization matrix that is provably optimal in the following sense,

For Hi{H1,H2}{\mathcal{H}}_{i}\in\{{\mathcal{H}}_{1},{\mathcal{H}}_{2}\} with the corresponding HTH_{T},

Using this lemma, we show the regret of AdaGrad is at most a constant factor larger than the minimum regret of all RFTL algorithm with regularization functions whose Hessian is fixed and belongs to the class Hi{\mathcal{H}}_{i}. Furthermore, the regret of the diagonal version can be a factor d\sqrt{d} smaller than that of online gradient descent for certain gradient geometries. The regret bound on AdaGrad is formally stated in the following theorem.

Let {xt}\{\mathbf{x}_{t}\} be defined by Algorithm 19 with parameters η=D\eta={D} (full matrix) or η=D\eta=D_{\infty} (diagonal). Then for any xK\mathbf{x}^{\star}\in\mathcal{K},

Before proceeding to the analysis, we consider when the regret bounds for AdaGrad improve upon those of Online Gradient Descent. One such case is when K\mathcal{K} is the unit cube in dd-dimensional Euclidean space. This convex set has D=1D_{\infty}=1 and D=dD=\sqrt{d}. Lemma 5.11 and Theorems 5.12,5.2 imply that the regret of diagonal AdaGrad and OGD are bounded by

The relationship between the two terms depends on the matrix diag(GT){\bf diag}(G_{T}). If this matrix is sparse, then AdaGrad has a superior bound by at most d\sqrt{d} factor. For other convex bodies, such as the Euclidean ball, and when the matrix GTG_{T} is dense, the regret of OGD can be a factor d\sqrt{d} lower.

We proceed with the proof of Theorem 5.12. The first component is the following Lemma, which generalizes the RFTL analysis to changing regularization.

Let H0=argminH0{Tr(H)}=0H_{0}=\operatorname*{\arg\min}_{H\succeq 0}\left\{{\bf Tr}(H)\right\}=0,

Multiplying the transpose of the first equation by the second we get

Since xt+1\mathbf{x}_{t+1} is the projection of yt+1\mathbf{y}_{t+1} in the norm induced by HtH_{t}, we have (see §2.1.1)

This inequality is the reason for using generalized projections as opposed to standard projections, which were used in the analysis of online gradient descent (see §3.1 Equation (3.2)). This fact together with (5.8) gives

Now, summing up over t=1t=1 to TT we get that

In the last inequality we use the definition H0=0H_{0}=0. We proceed to bound the first term. To this end, define the functions

By definition, HtH_{t} is the minimizer of i=0tΨi\sum_{i=0}^{t}\Psi_{i} over H{\mathcal{H}}. Therefore, using the BTL Lemma 5.4, we have that

We can now continue with the proof of Theorem 5.12.

We bound both parts of Lemma 5.13, with the following two lemmas,

For both the diagonal and full matrix versions of AdaGrad, the following holds

Now combining Lemma 5.13 with the above two lemmas, and using η=D2\eta=\frac{D}{\sqrt{2}} or η=D2\eta=\frac{D_{\infty}}{\sqrt{2}} appropriately, we obtain the theorem. ∎

We proceed to complete the proof of the two lemmas above.

The optimization problem of choosing HtH_{t} in line (5) of Algorithm 19 has an explicit solution, given in the following proposition (whose proof is left as an exercise).

Consider the following optimization problems, for A0A\succcurlyeq 0:

Then the global optimizer to these problems is obtained at X=A1/2Tr(A1/2)X=\frac{A^{1/2}}{{\bf Tr}(A^{1/2})} and X=A1/2X=A^{1/2} respectively. Over the set of diagonal matrices, the global optimizer is obtained at X=diag(A)1/2Tr(A1/2)X=\frac{{\bf diag}(A)^{1/2}}{{\bf Tr}(A^{1/2})} and X=diag(A)1/2X={\bf diag}(A)^{1/2} respectively.

A direct corollary of this proposition gives Lemma 5.11 as follows:

The remaining term from Lemma 5.13 is the expression t=0TxtxHtHt12\sum_{t=0}^{T}\|\mathbf{x}_{t}-\mathbf{x}^{\star}\|^{2}_{H_{t}-H_{t-1}}, which we proceed to bound.

By definition GtGt1G_{t}\succcurlyeq G_{t-1}, and hence using proposition 5.16 and the definition of HtH_{t} in line (5), we have that Ht=diag(Gt1/2)diag(Gt11/2)=Ht1H_{t}={\bf diag}(G_{t}^{1/2})\succcurlyeq{\bf diag}(G_{t-1}^{1/2})=H_{t-1}. Since for a diagonal matrix HH it holds that xHxx2Tr(H)\mathbf{x}^{\top}H\mathbf{x}\leq\|\mathbf{x}\|_{\infty}^{2}{\bf Tr}(H), we have

Next, we consider the full matrix case. By definition GtGt1G_{t}\succcurlyeq G_{t-1}, and hence HtHt1H_{t}\succcurlyeq H_{t-1}. Thus,

7 Bibliographic Remarks

Regularization in the context of online learning was first studied in (Grove et al., 2001) and (Kivinen and Warmuth, 2001). The influential paper of Kalai and Vempala (2005) coined the term “follow-the-leader” and introduced many of the techniques that followed in OCO. The latter paper studies random perturbation as a regularization and analyzes the follow-the-perturbed-leader algorithm, following an early development by Hannan (1957) that was overlooked in learning for many years.

In the context of OCO, the term follow-the-regularized-leader was coined in (Shalev-Shwartz and Singer, 2007; Shalev-Shwartz, 2007), and at roughly the same time an essentially identical algorithm was called “RFTL” in (Abernethy et al., 2008). The equivalence of RFTL and Online Mirror Descent was observed by (Hazan and Kale, 2008). The AdaGrad algorithm was introduced in (Duchi et al., 2010, 2011), its diagonal version was also discovered in parallel in (McMahan and Streeter, 2010). The analysis of AdaGrad presented in this chapter is due to (Gupta et al., 2017).

Adaptive regularization has received significant attention due to its success in training deep neural networks, and notably the development of adaptive algorithms that incorporate momentum and other heuristics, most popular of which are AdaGrad, RMSprop (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014). For a survey of optimization for deep learning, see the comprehensive text of Goodfellow et al. (2016).

There is a strong connection between randomized perturbation and deterministic regularization. For some special cases, adding randomization can be thought of as a special case of deterministic strongly convex regularization, see (Abernethy et al., 2014, 2016).

8 Exercises

1. This exercise concerns the notion of a dual norm.

(a) Show that the dual norm to a matrix norm given by A0A\succ 0 corresponds to the matrix norm of A1A^{-1}.

(b) Prove the generalized Cauchy-Schwarz inequality for any norm, i.e.,

2. Prove that the Bregman divergence is equal to the local norm at an intermediate point, that is:

where z[x,y]\mathbf{z}\in[\mathbf{x},\mathbf{y}], and the interval [x,y][\mathbf{x},\mathbf{y}] is defined as

3. Let R(x)=12xx02{R}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}_{0}\|^{2} be the (shifted) Euclidean regularization function. Prove that the corresponding Bregman divergence is the Euclidean metric. Conclude that projections with respect to this divergence are standard Euclidean projections.

4. Prove that the agile and lazy versions of the OMD meta-algorithm are different, in the sense that they can produce different predictions over the same setting and cost functions. Show this for the case that the regularization is Euclidean and the decision set is the Euclidean ball.

6. Prove that for the uniform distribution D{\mathcal{D}} over the unit hypercube n^{n}, the parameters σ,L\sigma,L defined in §5.5 with respect to the Euclidean norm can be bounded as σ<n , L1\sigma<\sqrt{n}\ ,\ L\leq 1.

7. Let D{\mathcal{D}} be a one-sided multi-dimensional exponential distribution, such that a vector nD\mathbf{n}\sim{\mathcal{D}} is distributed over each coordinate exponentially:

(Hint: use the Chernoff bound) Extra credit: prove that \mboxEnD[n]=Hn,\mathop{\mbox{\bf E}}_{\mathbf{n}\sim{\mathcal{D}}}[\|\mathbf{n}\|_{\infty}]=H_{n}, where HnH_{n} is the nn-th harmonic number.

Prove that K\|\cdot\|_{\mathcal{K}} is a norm if and only if K\mathcal{K} is convex.

9. ∗∗ Prove a lower bound of Ω(T)\Omega(T) on the regret of the RFTL algorithm with K\|\cdot\|_{\mathcal{K}} as a regularizer.

10. ∗ Prove that for positive definite matrices AB0A\succcurlyeq B\succ 0 it holds that

(b) 2Tr((AB)1/2)+Tr(A1/2B)2Tr(A1/2).2{\bf Tr}(({A-B})^{1/2})+{\bf Tr}(A^{-1/2}B)\leq 2{\bf Tr}({A}^{1/2}).

11. ∗ Consider the following minimization problem where A0A\succ 0:

Prove that its minimizer is given by X=A1/2/Tr(A1/2)X=A^{1/2}/{\bf Tr}(A^{1/2}), and the minimum is obtained at Tr2(A1/2){\bf Tr}^{2}(A^{1/2}).

Chapter 6 Bandit Convex Optimization

In many real-world scenarios the feedback available to the decision maker is noisy, partial or incomplete. Such is the case in online routing in data networks, in which an online decision maker iteratively chooses a path through a known network, and her loss is measured by the length (in time) of the path chosen. In data networks, the decision maker can measure the RTD (round trip delay) of a packet through the network, but rarely has access to the congestion pattern of the entire network.

Another useful example is that of online ad placement in web search. The decision maker iteratively chooses an ordered set of ads from an existing pool. Her reward is measured by the viewer’s response—if the user clicks a certain ad, a reward is generated according to the weight assigned to the particular ad. In this scenario, the search engine can inspect which ads were clicked through, but cannot know whether different ads, had they been chosen to be displayed, would have been clicked through or not.

The examples above can readily be modeled in the OCO framework, with the underlying sets being the convex hull of decisions. The pitfall of the general OCO model is the feedback; it is unrealistic to expect that the decision maker has access to a gradient oracle at any point in the space for every iteration of the game.

The Bandit Convex Optimization (short: BCO) model is identical to the general OCO model we have explored in previous chapters with the only difference being the feedback available to the decision maker.

As before, let TT denote the total number of game iterations (i.e., predictions and their incurred loss). Let A{\mathcal{A}} be an algorithm for BCO, which maps a certain game history to a decision in the decision set. We formally define the regret of A{\mathcal{A}} that predicted x1,...,xTx_{1},...,x_{T} to be

2 The Multiarmed Bandit (MAB) Problem

A classical model for decision making under uncertainty is the multiarmed bandit (MAB) model. The term MAB nowadays refers to a multitude of different variants and sub-scenarios that are too large to survey. This section addresses perhaps the simplest variant—the non-stochastic MAB problem—which is defined as follows:

Iteratively, a decision maker chooses between nn different actions it{1,2,...,n}i_{t}\in\{1,2,...,n\}, while, at the same time, an adversary assigns each action a loss in the range $.Thedecisionmakerreceivesthelossfor. The decision maker receives the loss fori_{t}$ and observes this loss, and nothing else. The goal of the decision maker is to minimize her regret.

The reader undoubtedly observes this setting is identical to the setting of prediction from expert advice, the only difference being the feedback available to the decision maker: whereas in the expert setting the decision maker can observe the rewards or losses for all experts in retrospect, in the MAB setting, only the losses of the decisions actually chosen are known.

It is instructive to explicitly model this problem as a special case of BCO. Take the decision set to be the set of all distributions over nn actions, i.e., K=Δn\mathcal{K}=\Delta_{n} is the nn-dimensional simplex. The loss function is taken to be the linearization of the costs of the individual actions, that is:

The MAB problem exhibits an exploration-exploitation tradeoff: an efficient (low regret) algorithm has to explore the value of the different actions in order to make the best decision. On the other hand, having gained sufficient information about the environment, a reasonable algorithm needs to exploit this action by picking the best action.

The simplest way to attain a MAB algorithm would be to separate exploration and exploitation. Such a method would proceed by

With some probability, explore the action space (i.e., by choosing an action uniformly at random). Use the feedback to construct an estimate of the actions’ losses.

Otherwise, use the estimates to apply a full-information experts algorithm as if the estimates are the true historical costs.

This simple scheme already gives a sublinear regret algorithm, presented in algorithm 20.

Algorithm 20, with A\mathcal{A} being the the online gradient descent algorithm, guarantees the following regret bound:

Therefore the regret of the simple algorithm can be related to that of A\mathcal{A} on the estimated functions.

On the other hand, the simple MAB algorithm does not always play according to the distribution generated by A\mathcal{A}: with probability δ\delta it plays uniformly at random, which may lead to a regret of one on these exploration iterations. Let St[T]S_{t}\subseteq[T] be those iterations in which bt=1b_{t}=1. This is captured by the following lemma:

The simple algorithm of the previous section can be improved by combining the exploration and exploitation steps. This gives a near-optimal regret algorithm, called EXP3, presented below.

Ultimately, the EXP3 algorithm attains a worst case regret bound of O(Tnlogn)O(\sqrt{Tn\log n}), which is nearly optimal (up to a logarithmic term in the number of actions).

Algorithm 21 with non-negative losses and ε=lognTn\varepsilon=\sqrt{\frac{\log n}{Tn}} guarantees the following regret bound:

We proceed to derive an algorithm for the more general setting of bandit convex optimization that attains near-optimal regret.

3 A Reduction from Limited Information to Full Information

In this section we derive a low regret algorithm for the general setting of bandit convex optimization. In fact, we shall describe a general technique for designing bandit algorithms, which is composed of two parts:

A general technique for taking an online convex optimization algorithm that uses only the gradients of the cost functions (formally defined below), and applying it to a family of vector random variables with carefully chosen properties.

Designing the random variables that allow the template reduction to produce meaningful regret guarantees.

We proceed to describe the two parts of this reduction, and in the remainder of this chapter we describe two examples of using this reduction to design bandit convex optimization algorithms.

The key idea behind many of the efficient algorithms for bandit convex optimization is the following: although we cannot calculate ft(xt)\nabla f_{t}(\mathbf{x}_{t}) explicitly, it is possible to find an observable random variable gt\mathbf{g}_{t} that satisfies \mboxE[gt]ft(xt)=t\mathop{\mbox{\bf E}}[\mathbf{g}_{t}]\approx\nabla f_{t}(\mathbf{x}_{t})=\nabla_{t}. Thus, gt\mathbf{g}_{t} can be seen as an estimator of the gradient. By substituting gt\mathbf{g}_{t} for t\nabla_{t} in an OCO algorithm, we will show that many times it retains its sublinear regret bound.

Formally, the family of regret minimization algorithms for which this reduction works is captured in the following definition.

(first order OCO Algorithm) Let A{\mathcal{A}} be an OCO (deterministic) algorithm receiving an arbitrary sequence of differential loss functions f1,,fTf_{1},\ldots,f_{T}, and producing decisions x1A(),xtA(f1,,ft1)\mathbf{x}_{1}\leftarrow{\mathcal{A}}(\emptyset),\mathbf{x}_{t}\leftarrow{\mathcal{A}}(f_{1},\ldots,f_{t-1}). A{\mathcal{A}} is called a first order online algorithm if the following holds:

Let f^t\hat{f}_{t} be the linear function f^t(x)=ft(xt)x\hat{f}_{t}(\mathbf{x})=\nabla f_{t}(\mathbf{x}_{t})^{\top}\mathbf{x}, then for every iteration t[T]t\in[T]:

We can now consider a formal reduction from any first order online algorithm to a bandit convex optimization algorithm as follows.

Perhaps surprisingly, under very mild conditions the reduction above guarantees the same regret bounds as the original first order algorithm up to the magnitude of the estimated gradients. This is captured in the following lemma.

Then the following holds for all uK\mathbf{u}\in\mathcal{K}:

Therefore, deterministically applying a first order method A{\mathcal{A}} on the random functions hth_{t} is equivalent to applying A{\mathcal{A}} on a stochastic first order approximation of the deterministic functions ftf_{t}. Thus by the full-information regret bound of A{\mathcal{A}} we have:

where we used \mboxE[ξtx1,f1,,xt,ft]=0\mathop{\mbox{\bf E}}[\boldsymbol{\xi}_{t}|\mathbf{x}_{1},f_{1},\ldots,\mathbf{x}_{t},f_{t}]=0. Similarly, since uK\mathbf{u}\in\mathcal{K} is fixed we have that \mboxE[ht(u)]=ft(u)\mathop{\mbox{\bf E}}[h_{t}(\mathbf{u})]=f_{t}(\mathbf{u}). The lemma follows from taking the expectation of Equation (6.2). ∎

3.2 Part 2: point-wise gradient estimators

In the preceding part we have described how to convert a first order algorithm for OCO to one that uses bandit information, using specially tailored random variables. We now describe how to create these vector random variables.

Although we cannot calculate ft(xt)\nabla f_{t}(\mathbf{x}_{t}) explicitly, it is possible to find an observable random variable gt\mathbf{g}_{t} that satisfies \mboxE[gt]ft\mathop{\mbox{\bf E}}[\mathbf{g}_{t}]\approx\nabla f_{t}, and serves as an estimator of the gradient.

The question is how to find an appropriate gt\mathbf{g}_{t}, and in order to answer it we begin with an example in a 1-dimensional case.

A 1-dimensional gradient estimate Recall the definition of the derivative:

The above shows that for a 1-dimensional derivative, two evaluations of ff are required. Since in our problem we can perform only one evaluation, let us define g(x)g(x) as follows:

Thus, in expectation, for small δ\delta, g(x)g(x) approximates f(x)f^{\prime}(x).

We define f^(x)=f^δ(x)\hat{f}(\mathbf{x})=\hat{f}_{\delta}(\mathbf{x}) to be a δ\delta-smoothed version of f(x)f(\mathbf{x}):

where v\mathbf{v} is drawn from a uniform distribution over the unit ball. This construction is very similar to the one used in Lemma 2.8 in context of convergence analysis for convex optimization. However, our goal here is very different.

Note that when ff is linear, we have f^δ(x)=f(x)\hat{f}_{\delta}(\mathbf{x})=f(\mathbf{x}). We shall address the case in which ff is indeed linear as a special case, and show how to estimate the gradient of f^(x)\hat{f}(\mathbf{x}), which, under the assumption, is also the gradient of f(x)f(\mathbf{x}). The following lemma shows a simple relation between the gradient f^δ\nabla\hat{f}_{\delta} and a uniformly drawn unit vector.

Using Stokes’ theorem from calculus, we have

From (6.4), and by definition of expectation, we have

where \mboxvol(Bδ)\mbox{vol}(B_{\delta}) is the volume of an n-dimensional ball of radius δ\delta. Similarly,

Combining (6.4), (6.5), (6.6), and (6.7), and the fact that the ratio of the volume of a ball in nn dimensions and the sphere of dimension n1n-1 is volnBδ/voln1Sδ=δ/n\textrm{vol}_{n}B_{\delta}/\textrm{vol}_{n-1}S_{\delta}=\delta/n gives the desired result. ∎

Under the assumption that ff is linear, Lemma 6.7 suggests a simple estimator for the gradient f\nabla f. Draw a random unit vector u\mathbf{u}, and let g(x)=nδf(x+δu)ug\left(\mathbf{x}\right)=\frac{n}{\delta}f\left(\mathbf{x}+\delta\mathbf{u}\right)\mathbf{u}.

The sphere estimator above is at times difficult to use: when the center of the sphere is very close to the boundary of the decision set only a very small sphere can fit completely inside. This results in a gradient estimator with large variance.

In such cases, it is useful to consider ellipsoids rather than spheres. Luckily, the generalisation to ellipsoidal sampling for gradient estimation is a simple corollary of our derivation above:

4 Online Gradient Descent without a Gradient

The simplest and historically earliest application of the BCO-to-OCO reduction outlined before is the application of the online gradient descent algorithm to the bandit setting. The FKM algorithm (named after its inventors, see bibliographic section) is outlined in algorithm 23.

For simplicity, we assume that the set K\mathcal{K} contains the unit ball centered at the zero vector, denoted 0\mathbf{0}. Denote Kδ={x  11δxK}\mathcal{K}_{\delta}=\{\mathbf{x}\ |\ \frac{1}{1-\delta}\mathbf{x}\in\mathcal{K}\}. It is left as an exercise to show that Kδ\mathcal{K}_{\delta} is convex for any 0<δ<10<\delta<1 and that all balls of radius δ\delta around points in Kδ\mathcal{K}_{\delta} are contained in K\mathcal{K}.

We also assume for simplicity that the adversarially chosen cost functions are bounded by one over K\mathcal{K}, i.e., that ft(x)1|\mathbf{f}_{t}(\mathbf{x})|\leq 1 for all xK\mathbf{x}\in\mathcal{K}.

The FKM algorithm is an instantiation of the generic reduction from bandit convex optimization to online convex optimization with spherical gradient estimators over the set Kδ\mathcal{K}_{\delta}. It iteratively projects onto Kδ\mathcal{K}_{\delta}, in order to have enough space for spherical gradient estimation. This degrades its performance by a controlled quantity. Its regret is bounded as follows.

Algorithm 23 with parameters  η=DnT3/4,δ=1T1/4\ \eta=\frac{D}{nT^{3/4}},\delta=\frac{1}{T^{1/4}} guarantees the following expected regret bound

Recall our notation of x=argminxKt=1Tft(x)\mathbf{x}^{\star}=\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x}). Denote

Then by properties of projections we have xδxδD\|\mathbf{x}_{\delta}^{\star}-\mathbf{x}^{\star}\|\leq\delta D, where DD is the diameter of K\mathcal{K}. Thus, assuming that the cost functions {ft}\{f_{t}\} are GG-Lipschitz, we have

5 * Optimal Regret Algorithms for Bandit Linear Optimization

A special case of BCO that is of considerable interest is BLO—Bandit Linear Optimization. This setting has linear cost functions, and captures the network routing and ad placement examples discussed in the beginning of this chapter, as well as the non-stochastic MAB problem.

In this section we give near-optimal regret bounds for BLO using techniques from interior point methods for convex optimization.

The generic OGD method of the previous section suffers from three pitfalls:

The gradient estimators are biased, and estimate the gradient of a smoothed version of the real cost function.

The gradient estimators require enough “wiggle room” and are thus ill-defined on the boundary of the decision set.

The gradient estimates have potentially large magnitude, proportional to the distance from the boundary.

Fortunately, the first issue is non-existent for linear functions - the gradient estimators turn out to be unbiased for linear functions. In the notation of the previous chapters, we have for linear functions:

Thus, Lemma 6.7 gives us a stronger guarantee:

To resolve the second and third issues we use self-concordant barrier functions, a rather advanced technique from interior point methods for convex optimization.

Self-concordant barrier functions were devised in the context of interior point methods for optimization as a way of ensuring that the Newton method converges in polynomial time over bounded convex sets. In this brief introduction we survey some of their beautiful properties that will allow us to derive an optimal regret algorithm for BLO.

R\mathcal{R} is three times continuously differentiable and convex, and approaches infinity along any sequence of points approaching the boundary of K\mathcal{K}.

where the third order differential is defined as:

We assume that 2R(x)\nabla^{2}\mathcal{R}(\mathbf{x}) always has full rank. In BCO applications this is easy to ensure by adding a fictitious quadratic function to the barrier, which does not affect the overall regret by more than a constant.

Let R\mathcal{R} be a self-concordant barrier and xint(K)\mathbf{x}\in\text{int}(\mathcal{K}). The Dikin ellipsoid is

i.e., the x\|\cdot\|_{\mathbf{x}}-unit ball centered around x\mathbf{x}, is completely contained in K\mathcal{K}.

In our next analysis we will need to bound R(y)R(x)\mathcal{R}(\mathbf{y})-\mathcal{R}(\mathbf{x}) for x,yint(K)\mathbf{x},\mathbf{y}\in\text{int}(\mathcal{K}), for which the following lemma is useful:

Let R\mathcal{R} be a ν\nu-self concordant function over K\mathcal{K}, then for all x,yint(K)\mathbf{x},\mathbf{y}\in\text{int}(\mathcal{K}):

where πx(y)=inf{t0:x+t1(yx)K}.\pi_{\mathbf{x}}(\mathbf{y})=\inf\{t\geq 0:\mathbf{x}+t^{-1}(\mathbf{y}-\mathbf{x})\in\mathcal{K}\}.

The function πx(y)\pi_{\mathbf{x}}(\mathbf{y}) is called the Minkowski function for K\mathcal{K}, and its output is always in the interval $.Moreover,as. Moreover, asyapproachestheboundaryofapproaches the boundary of\mathcal{K}thenthen\pi_{\mathbf{x}}(\mathbf{y})\to 1$.

Another important property of self-concordant functions is the relationship between a point and the optimum, and the norm of the gradient at the point, according to the local norm, as given by the following lemma.

Let xint(K)\mathbf{x}\in\text{int}(\mathcal{K}) be such that R(x)x14\|\nabla\mathcal{R}(\mathbf{x})\|_{\mathbf{x}}^{*}\leq\frac{1}{4}, and let x=argminxKR(x)\mathbf{x}^{\star}=\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\mathcal{R}(\mathbf{x}). Then

5.2 A near-optimal algorithm

We have now set up all the necessary tools to derive a near-optimal BLO algorithm, presented in algorithm 24.

For appropriate choice of η,δ\eta,\delta, the SCRIBLE algorithm guarantees

First, we note that ytK\mathbf{y}_{t}\in\mathcal{K} never steps outside of the decision set. The reason is that xtK\mathbf{x}_{t}\in\mathcal{K} and yt\mathbf{y}_{t} lies in the Dikin ellipsoid centered at xt\mathbf{x}_{t}.

where the latter equality follows since ftf_{t} is linear, and thus its smoothed version is identical to itself.

A final observation is that line 6 in the algorithm is an invocation of the RFTL algorithm with the self-concordant barrier R\mathcal{R} serving as a regularisation function. The RFTL algorithm for linear functions is a first order OCO algorithm and thus Lemma 6.5 applies.

Here we use our notation from the previous chapter for the local norm ht=hxt=h2R(xt)h\|\mathbf{h}\|_{t}=\|\mathbf{h}\|_{\mathbf{x}_{t}}=\sqrt{\mathbf{h}^{\top}\nabla^{2}\mathcal{R}(\mathbf{x}_{t})\mathbf{h}}.

To bound the last expression, we use Lemma 6.12, and the definition of xt+1=argminxKΦt(x)\mathbf{x}_{t+1}=\operatorname*{\arg\min}_{\mathbf{x}\in\mathcal{K}}\Phi_{t}(\mathbf{x}) where Φt(x)=ητ=1tgτx+R(x)\Phi_{t}(\mathbf{x})=\eta\sum_{\tau=1}^{t}\mathbf{g}_{\tau}^{\top}\mathbf{x}+\mathcal{R}(\mathbf{x}) is a self-concordant barrier. Thus,

since Φt1(xt)=0\nabla\Phi_{t-1}(\mathbf{x}_{t})=0 by definition of xt\mathbf{x}_{t}. Recall that to use Lemma 6.12, we need Φt(xt)t=ηgtt14\|\nabla\Phi_{t}(\mathbf{x}_{t})\|_{t}^{*}=\eta\|\mathbf{g}_{t}\|_{t}^{*}\leq\frac{1}{4}, which is true by choice of η\eta and since

It remains to bound the Bregman divergence with respect to x\mathbf{x}^{\star}, for which we use a similar technique as in the analysis of algorithm 23, and bound the regret with respect to xδ\mathbf{x}^{\star}_{\delta}, which is the projection of x\mathbf{x}^{\star} onto Kδ\mathcal{K}_{\delta}. Using equation (6.8), we can bound the overall regret by:

Taking η=O(1T)\eta=O(\frac{1}{\sqrt{T}}) and δ=O(1T)\delta=O(\frac{1}{T}), the above bound implies our theorem.

6 Bibliographic Remarks

The Multi-Armed Bandit problem has history going back more than fifty years to the work of Robbins (1952), see the survey of Bubeck and Cesa-Bianchi (2012) for a much more detailed history. The non-stochastic MAB problem and the EXP3 algorithm, as well as tight lower bounds were given in the seminal paper of Auer et al. (2003). The logarithmic gap in attainable regret for non-stochastic MAB was resolved in (Audibert and Bubeck, 2009).

Bandit Convex Optimization for the special case of linear cost functions and the flow polytope, was introduced and studied by Awerbuch and Kleinberg (2008) in the context of online routing. The full generality BCO setting was introduced by Flaxman et al. (2005), who gave the first efficient and low-regret algorithm for BCO. Tight bounds for BCO were obtained by Bubeck et al. (2015) for the one dimensional case, via an inefficient algorithm by Hazan and Li (2016), and finally with a polynomial time algorithm in Bubeck et al. (2017).

The special case in which the cost functions are linear, called Bandit Linear Optimization, received significant attention. Dani et al. (2008) gave an optimal regret algorithm up to constants depending on the dimension. Abernethy et al. (2008) gave an efficient algorithm and introduced self-concordant barriers to the bandit setting. Self-concordant barrier functions were devised in the context of polynomial-time algorithms for convex optimization in the seminal work of Nesterov and Nemirovskii (1994). Lower bounds for regret in the bandit linear optimization setting were studied by Shamir (2015).

In this chapter we have considered the expected regret as a performance metric. Significant literature is devoted to high probability guarantees on the regret. High probability bounds for the MAB problem were given in (Auer et al., 2003), and for bandit linear optimization in (Abernethy and Rakhlin, 2009). Other more refined metrics have been recently explored in (Dekel et al., 2012) and in the context of adaptive adversaries in (Neu et al., 2014; Yu and Mannor, 2009; Even-Dar et al., 2009; Mannor and Shimkin, 2003; Yu et al., 2009).

For a recent comprehensive text on bandit algorithms see (Lattimore and Szepesvári, 2020).

7 Exercises

1. Prove a lower bound on the regret of any algorithm for BCO: show that for the special case of BCO over the unit sphere, any online algorithm must incur a regret of Ω(T)\Omega(\sqrt{T}).

3. Let K\mathcal{K} be convex. Show that the set Kδ\mathcal{K}_{\delta} is convex.

4. Let K\mathcal{K} be convex and contain the unit ball centered at zero. Show that for any point xKδ\mathbf{x}\in\mathcal{K}_{\delta}, the ball of radius δ\delta centered at x\mathbf{x} is contained in K\mathcal{K}.

6. Consider the BCO setting with the following twist: at every iteration, the player is allowed to observe two evaluations of the function, as opposed to just one. That is, the player gives xt,ytx_{t},y_{t}, and observes ft(xt),ft(yt)f_{t}(x_{t}),f_{t}(y_{t}). Regret is measured w.r.t. xtx_{t}, as usual:

(a) Show how to construct a biased gradient estimator for ftf_{t} with arbitrary small bias and constant variance, that is independent of the bias.

(b) Show how to use the gradient estimator from the previous part to give an efficient algorithm for this setting that attains O(T)O(\sqrt{T}) regret.

Chapter 7 Projection-Free Algorithms

In many computational and learning scenarios the main bottleneck of optimization, both online and offline, is the computation of projections onto the underlying decision set (see §2.1.1). In this chapter we introduce projection-free methods for online convex optimization, that yield more efficient algorithms in these scenarios.

The motivating example throughout this chapter is the problem of matrix completion, which is a widely used and accepted model in the construction of recommendation systems. For matrix completion and related problems, projections amount to expensive linear algebraic operations and avoiding them is crucial in big data applications.

We start with a detour into classical offline convex optimization and describe the conditional gradient algorithm, also known as the Frank-Wolfe algorithm. Afterwards, we describe problems for which linear optimization can be carried out much more efficiently than projections. We conclude with an OCO algorithm that eschews projections in favor of linear optimization, in much the same flavor as its offline counterpart.

The vectors u,v\mathbf{u},\mathbf{v} are called the left and right singular vectors respectively. The non-zero singular values are the square roots of the eigenvalues of the matrix XXXX^{\top} (and XXX^{\top}X). The matrix XX can be written as

where ρ=min{n,m}\rho=\min\{n,m\}, the matrix UU is an orthogonal basis of the left singular vectors of XX, the matrix VV is an orthogonal basis of right singular vectors, and Σ\Sigma is a diagonal matrix of singular values. This form is called the singular value decomposition for XX.

It can be shown (see exercises) that the nuclear norm is equal to the trace of the square root of the matrix times its transpose, i.e.,

2 Motivation: Recommender Systems

Media recommendations have changed significantly with the advent of the Internet and rise of online media stores. The large amounts of data collected allow for efficient clustering and accurate prediction of users’ preferences for a variety of media. A well-known example is the so called “Netflix challenge”—a competition of automated tools for recommendation from a large dataset of users’ motion picture preferences.

One of the most successful approaches for automated recommendation systems, as proven in the Netflix competition, is matrix completion. Perhaps the simplest version of the problem can be described as follows.

The entire dataset of user-media preference pairs is thought of as a partially-observed matrix. Thus, every person is represented by a row in the matrix, and every column represents a media item (movie). For simplicity, let us think of the observations as binary—a person either likes or dislikes a particular movie. Thus, we have a matrix M{0,1,}n×mM\in\{0,1,*\}^{n\times m} where nn is the number of persons considered, mm is the number of movies at our library, and 0/10/1 and * signify “dislike”, “like” and “unknown” respectively:

The natural goal is to complete the matrix, i.e., correctly assign or 11 to the unknown entries. As defined so far, the problem is ill-posed, since any completion would be equally good (or bad), and no restrictions have been placed on the completions.

The intuitive interpretation of this property is that each entry in MM can be explained by only kk numbers. In matrix completion this means, intuitively, that there are only kk factors that determine a persons preference over movies, such as genre, director, actors and so on.

Now the simplistic matrix completion problem can be well-formulated as in the following mathematical program. Denote by ob\|\cdot\|_{ob} the Euclidean norm only on the observed (non starred) entries of MM, i.e.,

The mathematical program for matrix completion is given by

Since the constraint over the rank of a matrix is non-convex, it is standard to consider a relaxation that replaces the rank constraint by the nuclear norm. It is known that the nuclear norm is a lower bound on the matrix rank if the singular values are bounded by one (see exercises). Thus, we arrive at the following convex program for matrix completion:

We consider algorithms to solve this convex optimization problem next.

3 The Conditional Gradient Method

In this section we return to the basics of convex optimization—minimization of a convex function over a convex domain as studied in chapter 2.

Note that in the CG method, the update to the iterate xt\mathbf{x}_{t} may be not be in the direction of the gradient, as vt\mathbf{v}_{t} is the result of a linear optimization procedure in the direction of the negative gradient. This is depicted in figure 7.1.

The following theorem gives an essentially tight performance guarantee of this algorithm over smooth functions. Recall our notation from chapter 2: x\mathbf{x}^{\star} denotes the global minimizer of ff over K\mathcal{K}, DD denotes the diameter of the set K\mathcal{K}, and ht=f(xt)f(x)h_{t}=f(\mathbf{x}_{t})-f(\mathbf{x}^{\star}) denotes the suboptimality of the objective value in iteration tt.

The CG algorithm applied to β\beta-smooth functions with step sizes ηt=min{1,2t}\eta_{t}=\min\{1,\frac{2}{t}\} attains the following convergence guarantee

As done before in this manuscript, we denote t=f(xt)\nabla_{t}=\nabla f(\mathbf{x}_{t}). For any set of step sizes, we have

We reached the recursion ht+1(1ηt)ht+ηt2βD22h_{t+1}\leq(1-\eta_{t})h_{t}+\eta_{t}^{2}\frac{\beta D^{2}}{2}, and by Lemma 7.2 we obtain,

Let {ht}\{h_{t}\} be a sequence that satisfies the recurence

Then taking ηt=min{1,2t}\eta_{t}=\min\{1,\frac{2}{t}\} implies

As an example of an application for the conditional gradient algorithm, recall the mathematical program given by (7.1). The gradient of the objective function at point XtX^{t} is

Over the set of bounded-nuclear norm matrices, the linear optimization of line 3 in algorithm 25 becomes,

For simplicity, let’s consider square symmetric matrices, for which the nuclear norm is equivalent to the trace norm, and the above optimization problem becomes

It can be shown that this program is equivalent to the following (see exercises):

Hence, this is an eigenvector computation in disguise! Computing the largest eigenvector of a matrix takes linear time via the power method, which also applies more generally to computing the largest singular value of rectangular matrices. With this, step 3 of algorithm 25, which amounts to mathematical program (7.1), becomes computing vmax(f(Xt))v_{\max}(-\nabla f(X^{t})), the largest eigenvector of f(Xt)-\nabla f(X^{t}). Algorithm 25 takes on the modified form described in Algorithm 26.

How does this compare to previous convex optimization methods for solving the same matrix completion problem? As a convex program, we can apply gradient descent, or even more advantageously in this setting, stochastic gradient descent as in §3.4. Recall that the gradient of the objective function at point XtX^{t} takes the simple form (7.3). A stochastic estimate for the gradient can be attained by observing just a single entry of the matrix MM, and the update itself takes constant time as the gradient estimator is sparse. However, the projection step is significantly more difficult.

In this setting, the convex set K\mathcal{K} is the set of bounded nuclear norm matrices. Projecting a matrix onto this set amounts to calculating the SVD of the matrix, which is similar in computational complexity to algorithms for matrix diagonalization or inversion. The best known algorithms for matrix diagonalization are superlinear in the matrices’ size, and thus impractical for large datasets that are common in applications.

In contrast, the CG method does not require projections at all, and replaces them with linear optimization steps over the convex set, which we have observed to amount to singular vector computations. The latter can be implemented to take linear time via the power method or the Lanczos algorithm (see bibliography).

Thus, the Conditional Gradient method allows for optimization of the mathematical program (7.1) with a linear-time operation (eigenvector using power method) per iteration, rather than a significantly more expensive computation (SVD) needed for gradient descent.

4 Projections versus Linear Optimization

The conditional gradient (Frank-Wolfe) algorithm described before does not resort to projections, but rather computes a linear optimization problem of the form

When is the CG method computationally preferable? The overall computational complexity of an iterative optimization algorithm is the product of the number of iterations and the computational cost per iteration. The CG method does not converge as well as the most efficient gradient descent algorithms, meaning it requires more iterations to produce a solution of a comparable level of accuracy. However, for many interesting scenarios the computational cost of a linear optimization step (7.4) is significantly lower than that of a projection step.

Let us point out several examples of problems for which we have very efficient linear optimization algorithms, whereas our state-of-the-art algorithms for computing projections are significantly slower.

In the example pointed out in the preceding section of matrix completion, known methods for projection onto the spectahedron, or more generally the bounded nuclear-norm ball, require singular value decompositions, which take superlinear time via our best known methods. In contrast, the CG method requires maximal eigenvector computations which can be carried out in linear time via the power method (or the more sophisticated Lanczos algorithm).

Various routing and graph problems can be modeled as convex optimization problems over a convex set called the flow polytope.

Since the flow polytope is just the convex hull of sstt paths in the graph, minimizing a linear objective over it amounts to finding a minimum weight path given weights for the edges. For the shortest path problem we have very efficient combinatorial optimization algorithms, namely Dijkstra’s algorithm.

Thus, applying the CG algorithm to solve any convex optimization problem over the flow polytope will only require iterative shortest path computations.

A common way to represent a permutation or ordering is by a permutation matrix. Such are square matrices over {0,1}n×n\{0,1\}^{n\times n} that contain exactly one 11 entry in each row and column.

Doubly-stochastic matrices are square, real-valued matrices with non-negative entries, in which the sum of entries of each row and each column amounts to 1. The polytope that defines all doubly-stochastic matrices is called the Birkhoff-von Neumann polytope. The Birkhoff-von Neumann theorem states that this polytope is the convex hull of exactly all n×nn\times{n} permutation matrices.

Since a permutation matrix corresponds to a perfect matching in a fully connected bipartite graph, linear minimization over this polytope corresponds to finding a minimum weight perfect matching in a bipartite graph.

Consider a convex optimization problem over the Birkhoff-von Neumann polytope. The CG algorithm will iteratively solve a linear optimization problem over the BVN polytope, thus iteratively solving a minimum weight perfect matching in a bipartite graph problem, which is a well-studied combinatorial optimization problem for which we know of efficient algorithms. In contrast, other gradient based methods will require projections, which are quadratic optimization problems over the BVN polytope.

A matroid is pair (E,I)(E,I) where EE is a set of elements and II is a set of subsets of EE called the independent sets which satisfy various interesting proprieties that resemble the concept of linear independence in vector spaces. Matroids have been studied extensively in combinatorial optimization and a key example of a matroid is the graphical matroid in which the set EE is the set of edges of a given graph and the set II is the set of all subsets of EE which are cycle-free. In this case, II contains all the spanning trees of the graph. A subset SIS\in{I} could be represented by its identifying vector which lies in {0,1}E\{{0,1}\}^{|{E}|} which also gives rise to the matroid polytope which is just the convex hull of all identifying vectors of sets in II. It can be shown that some matroid polytopes are defined by exponentially many linear inequalities (exponential in E|{E}|), which makes optimization over them difficult.

On the other hand, linear optimization over matroid polytopes is easy using a simple greedy procedure which runs in nearly linear time. Thus, the CG method serves as an efficient algorithm to solve any convex optimization problem over matroids iteratively using only a simple greedy procedure.

5 The Online Conditional Gradient Algorithm

In this section we give a projection-free algorithm for OCO based on the conditional gradient method, which is projection-free and thus carries the computational advantages of the CG method to the online setting.

It is tempting to apply the CG method straightforwardly to the online appearance of functions in the OCO setting, such as the OGD algorithm in §3.1. However, it can be shown that an approach that only takes into account the last cost function is doomed to fail. The reason is that the conditional gradient method takes into account the direction of the gradient, and is insensitive to its magnitude.

Instead, we apply the CG algorithm step to the aggregate sum of all previous cost functions with added Euclidean regularization. The resulting algorithm is given formally in Algorithm 27.

We can prove the following regret bound for this algorithm. While this regret bound is suboptimal in light of the previous upper bounds we have seen, its suboptimality is compensated by the algorithm’s lower computational cost.

Online conditional gradient (Algorithm 27) with parameters η=D2GT3/4,σt=min{1,2t1/2}\eta=\frac{D}{2GT^{3/4}},\sigma_{t}=\min\{1,\frac{2}{t^{1/2}}\}, attains the following guarantee

As a first step in analyzing Algorithm 27, consider the points

These are exactly the iterates of the RFTL algorithm from chapter 5, namely Algorithm 13 with the regularization being R(x)=xx12R(\mathbf{x})=\|\mathbf{x}-\mathbf{x}_{1}\|^{2}, applied to cost functions with a shift, namely:

Using our previous notation, denote by ht(x)=Ft(x)Ft(xt)h_{t}(\mathbf{x})={F_{t}(\mathbf{x})-F_{t}(\mathbf{x}^{\star}_{t})}, and by ht=ht(xt)h_{t}=h_{t}(\mathbf{x}_{t}). The main lemma we require to proceed is the following, which relates the iterates xt\mathbf{x}_{t} to the optimal point according to the aggregate function FtF_{t}.

The iterates xt\mathbf{x}_{t} of Algorithm 27 satisfy for all t1t\geq 1

As the functions FtF_{t} are 11-smooth, applying the offline Frank-Wolfe analysis technique, and in particular Equation (7.3) to the function FtF_{t} we obtain:

In addition, by definition of FtF_{t} and hth_{t} we have

Since FtF_{t} is 11-strongly convex, we have

Above we used the following derivation, that holds by choice of parameters η=D2GT3/4\eta=\frac{D}{2GT^{3/4}} and σt=min{1,2t1/2}\sigma_{t}=\min\{1,\frac{2}{t^{1/2}}\}: since η,G,ht\eta,G,h_{t} are all non-negative, we have

We now claim that the theorem follows inductively. The base of the induction holds since, for t=1t=1, the definition of F1F_{1} implies

Assuming the bound is true for tt, we now show it holds for t+1t+1 as well:

as required. The last inequality follows by the definition of σt\sigma_{t} (see exercises).

We proceed to use this lemma in order to prove our theorem:

By definition, the functions FtF_{t} are 11-strongly convex. Thus, we have for xt=argminxKFt(x)\mathbf{x}_{t}^{\star}=\arg\min_{\mathbf{x}\in\mathcal{K}}F_{t}(\mathbf{x}):

Let η=D2GT3/4\eta=\frac{D}{2GT^{3/4}}, and notice that this satisfies the constraint of Lemma 7.4, which requires ηGht+1D22σt2\eta G\sqrt{h_{t+1}}\leq\frac{D^{2}}{2}\sigma_{t}^{2}. In addition, η<1\eta<1 for TT large enough. Hence,

6 Bibliographic Remarks

The matrix completion model has been extremely popular since its inception in the context of recommendation systems (Srebro, 2004; Rennie and Srebro, 2005; Salakhutdinov and Srebro, 2010; Lee et al., 2010; Candes and Recht, 2009; Shamir and Shalev-Shwartz, 2011).

The conditional gradient algorithm was devised in the seminal paper by Frank and Wolfe (1956). Due to the applicability of the FW algorithm to large-scale constrained problems, it has been a method of choice in recent machine learning applications, to name a few: (Hazan, 2008; Jaggi and Sulovský, 2010; Lacoste-Julien et al., 2013; Jaggi, 2013; Dudík et al., 2012; Harchaoui et al., 2012; Hazan and Kale, 2012; Shalev-Shwartz et al., 2011b; Bach et al., 2012; Tewari et al., 2011; Garber and Hazan, 2011, 2013; Bellet et al., 2014). In the context of matrix completion and recommendation systems, several faster variants of the Frank-Wolfe method were proposed (Garber, 2016; Allen-Zhu et al., 2017)

The online conditional gradient algorithm is due to Hazan and Kale (2012). An optimal regret algorithm, attaining the O(T)O(\sqrt{T}) bound, for the special case of polyhedral sets was devised in (Garber and Hazan, 2013).

Recent works consider accelerating projection-free optimization using variance reduction (Lan and Zhou, 2016; Hazan and Luo, 2016), and the case of projection-free algorithms with stochastic gradient oracles (Mokhtari et al., 2018; Chen et al., 2018; Xie et al., 2019).

For an analysis of the running time of the power and Lanczos methods for computing eigenvectors see (Kuczyński and Woźniakowski, 1992). For modern algorithms for fast computation of the singular value decomposition see (Allen-Zhu and Li, 2016; Musco and Musco, 2015).

7 Exercises

1. Prove that if the singular values are smaller than or equal to one, then the nuclear norm is a lower bound on the rank, i.e., show

2. Prove that the trace is related to the nuclear norm via

3. In this question we show that maximizing a linear function over the spectahedron can be reduced to a maximal eigenvector computation.

Prove that it has the same solution as the mathematical program:

(b) Show how to use eigenvector computations to maximize a general (a-symmetric) linear functions over the spectahedron.

4. Prove that for positive integers t>0t>0, any cc\in and σt=2tc\sigma_{t}=\frac{2}{t^{c}} it holds that

5. Download the MovieLens dataset from the web. Implement an online recommendation system based on the matrix completion model: implement the OCG and OGD algorithms for matrix completion. Benchmark your results.

Chapter 8 Games, Duality, and Regret

In this chapter we tie the material covered thus far to some of the most intriguing concepts in optimization and game theory. We shall use the existence of online convex optimization algorithms with sublinear regret to prove two fundamental properties: convex duality in mathematical optimization, and von Neumann’s minimax theorem in game theory.

Historically, the theory of games was developed by von Neumann in the early 1930’s. In an entirely different scientific thread, the theory of linear programming (LP) was advanced by Dantzig a decade later. Dantzig describes in his memoir a notable meeting between himself and von Neumann at Princeton in 1947. In this meeting, according to Dantzig, after describing the geometric and algebraic versions of linear programming, von Neumann essentially formulated and proved linear programming duality:

“I don’t want you to think I am pulling all this out of my sleeve at the spur of the moment like a magician. I have just recently completed a book with Oscar Morgenstern on the theory of games. What I am doing is conjecturing that the two problems are equivalent. The theory that I am outlining for your problem is an analogue to the one we have developed for games.” 13Taken from (Gass, 2006)

At that time, the topic of discussion was not the existence and uniqueness of equilibrium in zero-sum games, which is captured by the minimax theorem. Both concepts were originally captured and proved using very different mathematical techniques: the minimax theorem was originally proved using machinery from mathematical topology, whereas linear programming duality was shown using convexity and geometric tools.

More than half a century later, Freund and Schapire tied both concepts, which were by then known to be strongly related, to regret minimization. We shall follow their lead in this chapter, introduce the relevant concepts and give concise proofs using the machinery developed earlier in this manuscript.

The chapter can be read with basic familiarity with linear programming and little or no background in game theory. We define linear programming and zero-sum games succinctly, barely enough to prove the duality theorem and the minimax theorem. The reader is referred to the numerous wonderful texts available on linear programming and game theory for a much more thorough introduction and definitions.

The above formulation can be transformed into several different forms via basic manipulations. For example, any LP can be transformed to an equivalent LP with the variables taking only non-negative values. This can be accomplished by writing every variable xx as x=x+xx=x^{+}-x^{-}, with x+,x0x^{+},x^{-}\geq 0. It can be verified that this transformation leaves us with another LP, whose variables are non-negative, and contains at most twice as many variables (see exercises section for more details).

We are now ready to define a central notion in LP and state the duality theorem:

and the objectives of both problems are either equal or unbounded.

Instead of studying duality directly, we proceed to define zero-sum games and an analogous concept to duality.

2 Zero-sum Games and Equilibria

The theory of games is an established research field in economic theory. We give here brief definitions of the main concepts studied in this chapter.

Let us start with an example of a zero-sum game we all know: the rock-paper-scissors game. In this game each of the two players chooses a strategy: either rock, scissors or paper. The winner is determined according to the following table, where denotes a draw, 1-1 denotes that the row player wins, and 11 denotes a column player victory.

The rock-paper-scissors game is called a “zero-sum” game since one can think of the numbers as losses for the row player (loss of 1-1 resembles victory, 11 loss and draw), in which case the column player receives a loss which is exactly the negation of the loss of the row player. Thus the sum of losses which both players suffer is zero in every outcome of the game.

Noticed that we termed one player as the “row player” and the other as the “column player” corresponding to the matrix losses. Such a matrix representation is far more general:

A two-player zero-sum-game in normal form is given by a matrix An×mA\in^{n\times m}. The loss for the row player playing strategy i[n]i\in[n] is equal to the negative loss (reward) of the column player playing strategy j[m]j\in[m] and equal to AijA_{ij}.

The fact that the losses were defined in the range $$ is arbitrary, as the concept of main importance we define next is invariant to scaling and shifting by a constant.

A central concept in game theory is equilibrium. There are many different notions of equilibria. In two-player zero-sum games, a pure equilibrium is a pair of strategies (i,j)[n]×[m](i,j)\in[n]\times[m] with the following property: given that the column player plays jj, there is no strategy that dominates ii - i.e., every other strategy k[n]k\in[n] gives higher or equal loss to the row player. Equilibrium also requires that a symmetric property for strategy jj holds - it is not dominated by any other strategy given that the row player plays ii.

It can be shown that some games do not have a pure equilibrium as defined above, e.g., the rock-paper-scissors game. However, we can extend the notion of a strategy to a mixed strategy - a distribution over pure strategies. The loss of a mixed strategy is the expected loss according to the distribution over pure strategies. More formally, if the row player chooses xΔn\mathbf{x}\in\Delta_{n} and column player chooses yΔm,\mathbf{y}\in\Delta_{m}, then the expected loss of the row player (which is the negative reward to the column player) is given by:

We say that x\mathbf{x} is dominant with respect to y\mathbf{y} if and only if it is not dominated by any other mixed strategy. A pair (x,y)(\mathbf{x},\mathbf{y}) is an equilibrium for game AA if and only if both x\mathbf{x} and y\mathbf{y} are dominant with respect to each other. It is a good exercise for the reader at this point to find an equilibrium for the rock-paper-scissors game.

At this point, some natural questions arise: Is there always an equilibrium in a given zero-sum game? Can it be computed efficiently? Are there natural repeated-game-playing strategies that reach it?

As we shall see, the answer to all questions above is affirmative. Let us rephrase these questions in a different way. Consider the optimal row strategy, i.e., a mixed strategy x\mathbf{x}, such that the expected loss is minimized, no matter what the column player does. The optimal strategy for the row player would be:

Notice that we use the notation x\mathbf{x}^{\star}\in rather than x=\mathbf{x}^{\star}=, since in general the set of strategies attaining the minimal loss over worst-case column strategies can contain more than a single strategy. Similarly, the optimal strategy for the column player would be:

Playing these strategies, no matter what the column player does, the row player would pay no more than

With these definitions we can state von Neumann’s famous minimax theorem:

In any zero-sum game, it holds that λR=λC\lambda_{R}=\lambda_{C}.

This theorem answers all our above questions on the affirmative. The value λ=λC=λR\lambda^{\star}=\lambda_{C}=\lambda_{R} is called the value of the game, and its existence and uniqueness imply that any x\mathbf{x}^{\star} and y\mathbf{y}^{\star} in the appropriate optimality sets are an equilibrium.

We proceed to give a constructive proof of von Neumann’s theorem which also yields an efficient algorithm as well as natural repeated-game playing strategies that converge to it.

The von Neumann theorem is equivalent to the duality theorem of linear programming in a very strong sense, and either implies the other via simple reduction. Thus, it suffices to prove only von Neumann’s theorem to prove the duality theorem.

The first part of this equivalence is shown by representing a zero-sum game as a primal-dual linear program instance, as we do now.

Observe that the definition of an optimal row strategy and value is equivalent to the following LP:

To see that the optimum of the above LPLP is attained at λR\lambda_{R}, note that the constraint xAeiλi[m]\mathbf{x}^{\top}Ae_{i}\leq\lambda\quad\forall i\in[m] is equivalent to the constraint yΔm . xAyλ\forall\mathbf{y}\in\Delta_{m}\ .\ \mathbf{x}^{\top}A\mathbf{y}\leq\lambda, since:

The dual program to the above LP is given by

By similar arguments, the dual program precisely defines λC\lambda_{C} and y\mathbf{y}^{\star}. The duality theorem asserts that λR=λC=λ\lambda_{R}=\lambda_{C}=\lambda^{\star}, which gives von Neumann’s theorem.

The other direction, i.e., showing that von Neumann’s theorem implies LP duality, is slightly more involved. Basically, one can convert any LP into the format of a zero-sum game. Special care is needed to ensure that the original LP is indeed feasible, as zero-sum games are always feasible and linear programs need not be. The details are left as an exercise at the end of this chapter.

3 Proof of von Neumann Theorem

In this section we give a proof of von Neumann’s theorem using online convex optimization algorithms with sublinear regret.

The first part of the theorem, which is also known as weak duality in the LP context, is rather straightforward:

Direction 1 (λRλC\lambda_{R}\geq\lambda_{C}):

The second and main direction, known as strong duality in the LP context, requires the technology of online convex optimization we have proved thus far:

Direction 2 (λRλC\lambda_{R}\leq\lambda_{C}):

We consider a repeated game defined by the n×mn\times m matrix AA. For t=1,2,...,Tt=1,2,...,T, the row player provides a mixed strategy xtΔn\mathbf{x}_{t}\in\Delta_{n}, column player plays mixed strategy ytΔm\mathbf{y}_{t}\in\Delta_{m}, and the loss of the row player, which equals to the reward of the column player, equals xtAyt\mathbf{x}_{t}^{\top}A\mathbf{y}_{t}.

Spelling out the EG strategy for this particular instance, we have

Then, by appropriate choice of η\eta and Corollary 5.7, we have

The column player plays her best response to the row player’s strategy, that is:

Thus λRλC+2logn/T\lambda_{R}\leq\lambda_{C}+\sqrt{2\log n/T}. As TT\rightarrow\infty, we obtain part 2 of the theorem. ∎

Notice that besides the basic definitions, the only tool used in the proof is the existence of sublinear regret algorithms for online convex optimization. The fact that the regret bounds for OCO algorithms were defined without restricting the cost functions, and that they can be adversarially chosen, is crucial for the proof. The functions ftf_{t} are defined according to yt\mathbf{y}_{t}, which is chosen based on xt\mathbf{x}_{t}. Thus, the cost functions we constructed are adversarially chosen after the decision xt\mathbf{x}_{t} was made by the row player.

4 Approximating Linear Programs

The technique in the preceding section not only proves the minimax theorem, and thus linear programming duality, but also entails an efficient algorithm. Using the equivalence of zero-sum games and linear programs, this efficient algorithm can be used to solve linear programming. We now spell out the details of this algorithm in the context of zero-sum games.

Almost immediately we obtain from the previous derivation the following:

The returned vector xˉ\bar{\mathbf{x}} of Algorithm 28 is a 2lognT\frac{\sqrt{2\log n}}{\sqrt{T}}-approximate solution to the zero-sum game and linear program it describes.

Following the exact same steps of the previous derivation, we have

Thus, to obtain an ε\varepsilon-approximate solution, one would need 2lognε2\frac{2\log n}{\varepsilon^{2}} iterations, each involving a simple update procedure.

5 Bibliographic Remarks

Game theory was founded in the late 1920’s-early ’30s, whose cornerstone was laid in the classic text “Theory of Games and Economic Behavior” by Neumann and Morgenstern (1944).

Linear programming is a fundamental mathematical optimization and modeling tool, dating back to the 1940’s and the work of Kantorovich (1940) and Dantzig (1951). Duality for linear programming was conceived by von Neumann, as described by Dantzig in an interview (Albers et al., 1986). For in depth treatment of the theory of linear programming there are numerous comprehensive texts, e.g., (Bertsimas and Tsitsiklis, 1997; Matousek and Gärtner, 2007).

The beautiful connection between low-regret algorithms and solving zero-sum games was discovered by Freund and Schapire (1999). More general connections of convergence of low-regret algorithms to equilibria in games were studied by Hart and Mas-Colell (2000), and more recently in (Even-dar et al., 2009; Roughgarden, 2015).

Approximation algorithms that arise via simple Lagrangian relaxation techniques were pioneered by Plotkin et al. (1995). See also the survey (Arora et al., 2012) and more recent developments that give rise to sublinear time algorithms (Clarkson et al., 2012; Hazan et al., 2011).

6 Exercises

1. Prove that equilibrium strategy pairs in zero-sum games are not unique. That is, construct a zero-sum game for which there is more than one equilibrium.

3. Read Adler’s exposition on the equivalence of linear programming and zero sum games (Adler, 2013). Explain how to convert a linear program to a zero-sum game.

4. Consider a repeated zero-sum game over a matrix AA in which both players change their mixed strategies according to a low-regret algorithm over the linear cost/reward functions of the game. Prove that the average value of the game approaches that of an equilibrium of the game given by AA.

5. ∗ Write a semidefinite program as a zero-sum game. Write down an algorithm for approximating the solution of a semidefinite program using OCO algorithms, and sketch an analysis of its correctness and performance bound.

Chapter 9 Learning Theory, Generalization, and Online Convex Optimization

In our treatment of online convex optimization so far we have only implicitly discussed learning theory. The framework of OCO was shown to capture applications such as learning classifiers online, prediction with expert advice, online portfolio selection and matrix completion, all of which have a learning aspect. We have introduced the metric of regret and gave efficient algorithms to minimize regret in various settings. We have also argued that minimizing regret is a meaningful approach for many online prediction problems. However, the relation to other theories of learning was not discussed thus far.

In this section we draw a formal and strong connection between OCO and the theory of statistical learning. We begin by giving the basic definitions of statistical learning theory, and proceed to describe how the applications studied in this manuscript relate to this model. We then continue to show how regret minimization in the setting of online convex optimization gives rise to computationally efficient statistical learning algorithms.

The theory of statistical learning addresses the problem of learning a concept from examples. A concept is a mapping from domain X{\mathcal{X}} to labels Y{\mathcal{Y}}, denoted C:XYC:{\mathcal{X}}\mapsto{\mathcal{Y}}.

As an example, consider the problem of optical character recognition. In this setting, the domain X{\mathcal{X}} can be all n×nn\times n bitmap images, the label set Y{\mathcal{Y}} is the Latin (or other) alphabet, and the concept CC maps a bitmap into the character depicted in the image.

Statistical theory models the problem of learning a concept by allowing access to labelled examples from the target distribution. The learning algorithm has access to pairs, or samples, from an unknown distribution

The goal is to be able to predict yy as a function of x\mathbf{x}, i.e., to learn a hypothesis, or a mapping from X{\mathcal{X}} to Y{\mathcal{Y}}, denoted h:XYh:{\mathcal{X}}\mapsto{\mathcal{Y}}, with small error with respect to the distribution D{\mathcal{D}}. In the case that the label set is binary Y={0,1}{\mathcal{Y}}=\{0,1\}, or discrete such as in optical character recognition, the generalization error of an hypothesis hh with respect to distribution D{\mathcal{D}} is given by

We henceforth consider learning algorithms A{\mathcal{A}} that observe a sample from the distribution D{\mathcal{D}} , denoted SDmS\sim{\mathcal{D}}^{m} for a sample of mm examples, S={(x1,y1),...,(xm,ym)}S=\{(\mathbf{x}_{1},y_{1}),...,(\mathbf{x}_{m},y_{m})\}, and produce a hypothesis A(S):XY{\mathcal{A}}(S):{\mathcal{X}}\mapsto{\mathcal{Y}} based on this sample.

The goal of statistical learning can thus be summarised as follows:

Given access to i.i.d. samples from an arbitrary distribution over X×Y{\mathcal{X}}\times{\mathcal{Y}} corresponding to a certain concept, learn a hypothesis h:XYh:{\mathcal{X}}\mapsto{\mathcal{Y}} which has arbitrarily small generalization error with respect to a given loss function.

In the problem of optical character recognition the task is to recognize a character from a given image in bitmap format. To model it in the statistical learning setting, the domain X{\mathcal{X}} is the set of all n×nn\times n bitmap images for some integer nn. The label set Y{\mathcal{Y}} is the latin alphabet, and the concept CC maps a bitmap into the character depicted in the image.

Consider the naive algorithm which fits the perfect hypothesis for a given sample, in this case set of bitmaps. Namely, A(S){\mathcal{A}}(S) is the hypothesis which correctly maps any given bitmap input xi\mathbf{x}_{i} to its correct label yiy_{i}, and maps all unseen bitmaps to the character 1."``1."

Clearly, this hypothesis does a very poor job of generalizing from experience - all images that have not been observed yet will be classified without regard to their properties, surely an erroneous classification most times. However - the training set, or observed examples, are perfectly classified by this hypothesis!

This disturbing phenomenon is called “overfitting,” a central concern in machine learning. Before continuing to add the necessary components in learning theory to prevent overfitting, we turn our attention to a formal statement of when overfitting can appear.

1.2 No free lunch?

The following theorem shows that learning, as stated in the goal of statistical learning theory, is impossible without restricting the hypothesis class being considered. For simplicity, we consider the zero-one loss in this section.

Consider any domain X\mathcal{X} of size X=2m>4|\mathcal{X}|=2m>4, and any algorithm A{\mathcal{A}} which outputs a hypothesis A(S){\mathcal{A}}(S) given a sample SS of size mm. Then there exists a concept C:X{0,1}C:\mathcal{X}\rightarrow\{0,1\} and a distribution D\mathcal{D} such that:

The generalization error of the concept CC is zero.

With probability at least 110\frac{1}{10}, the error of the hypothesis generated by A{\mathcal{A}} is at least \mboxerror(A(S))110\mathop{\mbox{\rm error}}(A(S))\geq\frac{1}{10}.

The proof of this theorem is based on the probabilistic method, a useful technique for showing the existence of combinatorial objects by showing that the probability they exist in some distributional setting is bounded away from zero. In our setting, instead of explicitly constructing a concept CC with the required properties, we show it exists by a probabilistic argument.

We show that for any learning algorithm, there is some learning task (i.e., “hard” concept) that it will not learn well. Formally, take D\mathcal{D} to be the uniform distribution over X{\mathcal{X}}. Our proof strategy will be to show the following inequality, where we take a uniform distribution over all concepts X{0,1}{\mathcal{X}}\mapsto\{0,1\}

After showing this step, we will use Markov’s Inequality to conclude the theorem.

We proceed by using the linearity property of expectations, which allows us to swap the order of expectations, and then conditioning on the event that xS\mathbf{x}\in S.

All terms in the above expression, and in particular the first term, are non-negative and at least . Also note that since the domain size is 2m2m and the sample is of size Sm|S|\leq m, we have Pr(x∉S)12\Pr(\mathbf{x}\not\in S)\geq\frac{1}{2}. Finally, observe that Pr[A(S)(x)C(x)]=12\Pr[{\mathcal{A}}(S)(\mathbf{x})\neq C(\mathbf{x})]=\frac{1}{2} for all x∉S\mathbf{x}\not\in S since we are given that the “true” concept CC is chosen uniformly at random over all possible concepts. Hence, we get that:

which is the intermediate step we wanted to show. The random variable \mboxESDm[\mboxerror(A(S))]\mathop{\mbox{\bf E}}_{S\sim\mathcal{D}^{m}}[\mathop{\mbox{\rm error}}({\mathcal{A}}(S))] attains values in the range $.Sinceitsexpectationisatleast. Since its expectation is at least\frac{1}{4},theeventthatitattainsavalueofatleast, the event that it attains a value of at least\frac{1}{4}$ is non-empty. Thus, there exists a concept such that

where, as assumed beforehand, D\mathcal{D} is the uniform distribution over X{\mathcal{X}}.

We now conclude with Markov’s Inequality: since the expectation above over the error is at least one-fourth, the probability over examples such that the error of A{\mathcal{A}} over a random sample is at least one-tenth is at least

1.3 Examples of learning problems

The conclusion of the previous theorem is that the space of possible concepts being considered in a learning problem needs to be restricted for any meaningful guarantee. Thus, learning theory concerns itself with concept classes, also called hypothesis classes, which are sets of possible hypotheses from which one would like to learn. We denote the concept (hypothesis) class by H={h:XY}{\mathcal{H}}=\{h:{\mathcal{X}}\mapsto{\mathcal{Y}}\}.

Common examples of learning problems that can be formalized in this model and the corresponding definitions include:

Optimal character recognition: In the problem of optical character recognition the domain X{\mathcal{X}} consists of all n×nn\times n bitmap images for some integer nn, the label set Y{\mathcal{Y}} is a certain alphabet, and the concept CC maps a bitmap image into the character depicted in it. A common (finite) hypothesis class for this problem is the set of all decision trees with bounded depth.

Recommendation systems: recall the online convex optimization formulation of this problem in section 7.2. A statistical learning formulation for this problem is very similar. The domain is a direct sum of two sets X=X1X2{\mathcal{X}}={\mathcal{X}}_{1}\oplus{\mathcal{X}}_{2}. Here x1X1\mathbf{x}_{1}\in{\mathcal{X}}_{1} is a certain media item, and every person is an item x2X2\mathbf{x}_{2}\in{\mathcal{X}}_{2}. The label set Y{\mathcal{Y}} is binary, where one indicates a positive sentiment for the person to the particular media item, and zero a negative sentiment.

1.4 Defining generalization and learnability

We are now ready to give the fundamental definition of statistical learning theory, called Probably Approximately Correct (PAC) learning:

The set STS_{T} of samples from the underlying distribution is called the training set. The error in the above definition is called the generalization error, as it describes the overall error of the concept as generalized from the observed training set. The behavior of the number of samples TT as a function of the parameters ε,δ\varepsilon,\delta and the concept class is called the sample complexity of H{\mathcal{H}}.

The definition of PAC learning says nothing about computational efficiency. Computational learning theory usually requires, in addition to the definition above, that the algorithm A{\mathcal{A}} is efficient, i.e., polynomial running time with respect to ε,log1δ\varepsilon,\log\frac{1}{\delta} and the representation of the hypothesis class. The representation size for a discrete set of concepts is taken to be the logarithm of the number of hypotheses in H{\mathcal{H}}, denoted logH\log|{\mathcal{H}}|.

If the hypothesis A(ST){\mathcal{A}}(S_{T}) returned by the learning algorithm belongs to the hypothesis class H{\mathcal{H}}, as in the definition above, we say that H{\mathcal{H}} is properly learnable. More generally, A{\mathcal{A}} may return hypothesis from a different hypothesis class, in which case we say that H{\mathcal{H}} is improperly learnable.

The fact that the learning algorithm can learn up to any desired accuracy ε>0\varepsilon>0 is called the realizability assumption and greatly reduces the generality of the definition. It amounts to requiring that a hypothesis with near-zero error belongs to the hypothesis class. In many cases, concepts are only approximately learnable by a given hypothesis class, or inherent noise in the problem prohibits realizability (see exercises).

This issue is addressed in the definition of a more general learning concept, called agnostic learning:

With these definitions, we can state the fundamental theorem of statistical learning theory for finite hypothesis classes:

Every finite concept class H{\mathcal{H}} is agnostically PAC learnable with sample complexity that is \mboxpoly(ε,δ,logH)\mathop{\mbox{\rm poly}}(\varepsilon,\delta,\log|{\mathcal{H}}|).

In the following sections we prove this theorem, and in fact a more general statement that holds also for certain infinite hypothesis classes. The complete characterization of which infinite hypothesis classes are learnable is a deep and fundamental question, whose complete answer was given by Vapnik and Chervonenkis (see bibliography). The question of which (finite or infinite) hypothesis classes are efficiently PAC learnable, especially in the improper sense, is still at the forefront of learning theory today.

2 Agnostic Learning using Online Convex Optimization

In this section we show how to use online convex optimization for agnostic PAC learning. Following the paradigm of this manuscript, we describe and analyze a reduction from agnostic learning to online convex optimization. The reduction is formally described in Algorithm 29.

For this reduction we assumed that the concept (hypothesis) class is a convex subset of Euclidean space. A similar reduction can be carried out for discrete hypothesis classes (see exercises). In fact, the technique we explore below will work for any hypothesis set H{\mathcal{H}} that admits a low regret algorithm, and can be generalized to infinite hypothesis classes that are known to be learnable.

Let h=argminhH{\mboxerror(h)}h^{\star}=\arg\min_{h\in{\mathcal{H}}}\{\mathop{\mbox{\rm error}}(h)\} be the hypothesis in the class H{\mathcal{H}} that minimizes the generalization error. Using the assumption that A{\mathcal{A}} guarantees sublinear regret, our simple reduction implies PAC learning, as given in the following theorem.

How general is the theorem above? In the previous chapters we have described and analyzed OCO algorithms with regret guarantees that behave asymptotically as O(T)O(\sqrt{T}) or better. This translates to sample complexity of O(1ε2log1δ)O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\delta}) (see exercises), which is known to be tight for certain scenarios.

To prove this theorem we need some tools from probability theory, such as the concentration inequalities that we survey next.

Let us briefly discuss the notion of a martingale in probability theory. For intuition, it is useful to recall the simple random walk. Let XiX_{i} be a Rademacher random variable which takes values

A simple symmetric random walk is described by the sum of such random variables, depicted in figure 9.1. Let X=i=1TXiX=\sum_{i=1}^{T}X_{i} be the position after TT steps of this random walk. The expectation and variance of this random variable are \mboxE[X]=0 , \mboxVar(X)=T\mathop{\mbox{\bf E}}[X]=0\ ,\ \mbox{Var}(X)=T.

The phenomenon of measure concentration addresses the probability of a random variable to attain values within range of its standard deviation. For the random variable XX, this probability is much higher than one would expect using only the first and second moments. Using only the variance, it follows from Chebychev’s inequality that

However, the event that X|X| is centred around O(T)O(\sqrt{T}) is in fact much tighter, and can be bounded by the Hoeffding-Chernoff lemma as follows

Thus, deviating by a constant from the standard deviation decreases the probability exponentially, rather than polynomially. This well-studied phenomenon generalizes to sums of weakly dependent random variables and martingales, which are important for our application.

A sequence of random variables X1,X2,...X_{1},X_{2},... is called a martingale if it satisfies:

A similar concentration phenomenon to the random walk sequence occurs in martingales. This is captured in the following theorem by Azuma.

Let \big{\{}X_{i}\big{\}}_{i=1}^{T} be a martingale of TT random variables that satisfy XiXi+11|X_{i}-X_{i+1}|\leq{1}. Then:

2.2 Analysis of the reduction

We start by defining a sequence of random variables that form a martingale. Let

Let us verify that {Xt}\{X_{t}\} is indeed a bounded martingale. Notice that by definition of \mboxerror(h)\mathop{\mbox{\rm error}}(h), we have that

In addition, by our assumption that the loss is bounded, we have that (see exercises)

Therefore we can apply Azuma’s theorem to the martingale {Xt}\{X_{t}\}, or rather its consequence (9.2), and get

Plugging in the definition of XTX_{T}, dividing by TT and using c=2Tlog(2δ)c=\sqrt{2T\log(\frac{2}{\delta})}:

A similar martingale can be defined for hh^{\star} rather than hth_{t}, and repeating the analogous definitions and applying Azuma’s inequality we get:

For notational convenience, let us use the following notation:

By convexity we have that \mboxerror(hˉ)1Tt=1T\mboxerror(ht)\mathop{\mbox{\rm error}}(\bar{h})\leq\frac{1}{T}\sum_{t=1}^{T}\mathop{\mbox{\rm error}}(h_{t}). Thus, with probability at least 1δ1-\delta,

3 Learning and Compression

Thus far we have considered finite and certain infinite hypothesis classes, and shown that they are efficiently learnable if there exists an efficient regret-minimization algorithm for a corresponding OCO setting.

In this section we describe yet another property which is sufficient for PAC learnability: the ability to compress the training set. This property is particularly easy to state and use, especially for infinite hypothesis classes. It does not, however, imply efficient algorithms.

Intuitively, if a learning algorithm is capable to express an hypothesis using a small fraction of the training set, we will show that it generalizes well to unseen data. For simplicity, we only consider learning problems that satisfy a variant of the realizablility assumption, i.e., the compression scheme generates a hypothesis that attains zero error.

More formally, we define the notion of a compression scheme for a given learning problem as follows. The definition and theorem henceforth can be generalized to allow for loss functions, but for simplicity, consider only the zero-one loss function for this section.

(Compression Scheme) A distribution D{\mathcal{D}} over X×Y{\mathcal{X}}\times{\mathcal{Y}} admits a compression scheme of size kk, realized by an algorithm A{\mathcal{A}}, if the following holds. For any T>kT>k, let ST={(xt,yt), t[T]}S_{T}=\{(\mathbf{x}_{t},y_{t}),\ t\in[T]\} be a sample from D{\mathcal{D}}. There exists an SST , S=kS^{\prime}\subseteq S_{T}\ ,\ |S^{\prime}|=k, such that the algorithm A{\mathcal{A}} accepts the set of kk examples SS^{\prime}, and returns a hypothesis A(S){XY}{\mathcal{A}}(S^{\prime})\in\{{\mathcal{X}}\mapsto{\mathcal{Y}}\}, which satsifies:

The main conclusion of this section is that a learning problem that admits a compression scheme of size kk is PAC learnable with sample complexity proportional to kk. This is formally given the following theorem.

Let D{\mathcal{D}} be a data distribution that admits a compression scheme of size kk realized by algorithm A{\mathcal{A}}. Then with probability at least 1δ1-\delta over the choice of a training set ST=T|S_{T}|=T, it holds that

Denote by \mboxerrorS(h)\mathop{\mbox{\rm error}}_{S}(h) the error of an hypothesis hh on a sample SS of i.i.d. examples, where the sample is taken independently of hh. Since the examples are chosen independently, the probability that a hypothesis with \mboxerror(h)>ε\mathop{\mbox{\rm error}}(h)>\varepsilon has \mboxerrorS(h)=0\mathop{\mbox{\rm error}}_{S}(h)=0 is at most (1ε)S(1-\varepsilon)^{|S|}. Denote the event of hh satisfying these two conditions by h\mboxbad{h\in{\mbox{bad}}}.

Consider a compression scheme for distribution D{\mathcal{D}} of size kk, realized by A{\mathcal{A}}, and a sample of size ST=Tk|S_{T}|=T\gg k. By definition of a compression scheme, the hypothesis returned by A{\mathcal{A}} is based on kk examples chosen from the set SSTS^{\prime}\subseteq S_{T}. We can bounds the probability of the event that \mboxerrorST(A(S))=0\mathop{\mbox{\rm error}}_{S_{T}}({\mathcal{A}}(S^{\prime}))=0 and \mboxerror(A(S))>ε\mathop{\mbox{\rm error}}({\mathcal{A}}(S^{\prime}))>\varepsilon, denoted by A(S)\mboxbad{{\mathcal{A}}(S^{\prime})\in\mbox{bad}}, as follows,

For ε=8klogTδT\varepsilon=\frac{8k\log\frac{T}{\delta}}{T}, we have that

Since the compression scheme is guaranteed to return a hypothesis such that \mboxerrorST(A(S))=0\mathop{\mbox{\rm error}}_{S_{T}}({\mathcal{A}}(S^{\prime}))=0, this implies that with probability at least 1δ1-\delta, the hypothesis A(S){\mathcal{A}}(S^{\prime}) has \mboxerror(A(S))ε\mathop{\mbox{\rm error}}({\mathcal{A}}(S^{\prime}))\leq\varepsilon. ∎

4 Bibliographic Remarks

The foundations of statistical and computational learning theory were put forth in the seminal works of Vapnik (1998) and Valiant (1984) respectively. There are numerous comprehensive texts on statistical and computational learning theory, see e.g., (Kearns and Vazirani, 1994), and the recent text (Shalev-Shwartz and Ben-David, 2014).

Reductions from the online to the statistical (a.k.a. “batch”) setting were initiated by Littlestone (Littlestone, 1989). Tighter and more general bounds were explored in (Cesa-Bianchi et al., 2006; Cesa-Bianchi and Gentile, 2008; Zhang, 2005).

The probabilistic method is attributed to Paul Erdos, see the illuminating text of Alon and Spencer (Alon and Spencer, 1992).

The relationship between compression and PAC learning was studied in the seminal work of Littlestone and Warmuth (1986). For more on the relationship and historical connections between statistical learning and compression see the inspiring chapter in (Wigderson, 2019). More recently Moran and Yehudayoff (2016); David et al. (2016) show that compression is equivalent to learnability in general supervised learning tasks and give quantitative bounds for this relationship.

The use of compression for proving generalization error bounds has been applied in (Hanneke et al., 2019) for regression and in (Gottlieb et al., 2018; Kontorovich et al., 2017) for nearest neighbor classification. Another application is the recent work of Bousquet et al. (2020) which gives optimal generalization error bounds for support vector machines using compression.

5 Exercises

1. Strengthen the no free lunch Theorem 9.1 to show the following: For any ε>0\varepsilon>0, there exists a finite domain X{\mathcal{X}}, such that for any learning algorithm A{\mathcal{A}} which given a sample SS produces hypothesis A(S){\mathcal{A}}(S), there exists a distribution DD and a concept C:X{0,1}C:X\mapsto\{0,1\} such that

(b) \mboxESDm[\mboxerror(A(S))]12ε\mathop{\mbox{\bf E}}_{S\sim{\mathcal{D}}^{m}}[\mathop{\mbox{\rm error}}({\mathcal{A}}(S))]\geq\frac{1}{2}-\varepsilon.

2. Let A{\mathcal{A}} be an agnostic learning algorithm for the finite hypothesis class H:XY{\mathcal{H}}:{\mathcal{X}}\mapsto{\mathcal{Y}} and the zero-one loss. Consider any concept C:XYC:X\mapsto Y which is realized by H{\mathcal{H}}, and the concept C^\hat{C} which is obtained by replacing the label associated with each domain entry xXx\in X randomly with probability ε0>0\varepsilon_{0}>0 every time xx is sampled independently. That is:

Prove that A{\mathcal{A}} can ε\varepsilon-approximate the concept C^\hat{C}: that is, show that A{\mathcal{A}} can be used to produce a hypothesis hAh_{{\mathcal{A}}} that has error

with probability at least 1δ1-\delta for every ε,δ\varepsilon,\delta with sample complexity polynomial in 1ε,log1δ,logH\frac{1}{\varepsilon},\log\frac{1}{\delta},\log|H|.

4. (Sample complexity of SVM) Consider the class of hypothesis given by hyperplanes in Euclidean space with bounded norm

Give an algorithm to PAC-learn this class with respect to the hinge loss function using reduction (29). Analyze the resulting computational and sample complexity.

5. Show how to use a modification of reduction 29 to learn a finite (non-convex) hypothesis class efficiently, i.e., without enumerating over all hypothesis. For this question, success probability of 12\frac{1}{2} is sufficient. Hint: instead of returning hˉ\bar{h}, consider returning a hypothesis at random.

6. Consider the hypothesis class of all axis-aligned rectangles in the plane. That is, consider all hypothesis parametrized by four real numbers, ax,bx,ay,bya_{x},b_{x},a_{y},b_{y}, such that

Prove that this hypothesis class admits a compression scheme of size 44.

Prove that this class has a compression scheme of size dd.

Chapter 10 Learning in Changing Environments

In online convex optimization the decision maker iteratively makes a decision without knowledge of the future, and pays a cost based on her decision and the observed outcome. The algorithms that we have studied thus far are designed to perform nearly as well as the best single decision in hindsight. The performance metric we have advocated for, average regret of the online player, approaches zero as the number of game iterations grows.

In scenarios in which the outcomes are sampled from some (unknown) distribution, regret minimization algorithms effectively “learn” the environment and approach the optimal strategy. This was formalized in chapter 9. However, if the underlying distribution changes, no such claim can be made.

Consider for example the online shortest path problem we have studied in the first chapter. It is a well observed fact that traffic in networks exhibits changing cyclic patterns. A commuter may choose one path from home to work on a weekday, but a completely different path on the weekend when traffic patterns are different. Another example is the stock market: in a bull market the investor may want to purchase technology stocks, but in a bear market perhaps they would shift their investments to gold or government bonds.

When the environment undergoes many changes, standard regret may not be the best measure of performance. In changing environments, the online convex optimization algorithms we have studied thus far for strongly convex or exp-concave loss functions exhibit undesirable “static” behavior, and converge to a fixed solution.

In this chapter we introduce and study a generalization of the concept of regret called adaptive regret, to allow for a changing prediction strategy. We start with examining the notion of adapting in the problem of prediction from expert advice. We then continue to the more challenging setting of online convex optimization, and derive efficient algorithms for minimizing this more refined regret metric.

Before giving the main performance metric studied in this chapter, we consider the first natural approach: measuring regret w.r.t. any sequence of decisions. Clearly, in general it is impossible to compete with an arbitrary changing benchmark. However, it is possible to give a refined analysis that shows what happens to the regret of an online convex optimization algorithm vs. changing decisions.

More precisely, define the dynamic regret of an OCO algorithm with respect to a sequence u1,,uT\mathbf{u}_{1},\ldots,\mathbf{u}_{T} as:

To analyze the dynamic regret, some measure of the complexity of the sequence u1,,uT\mathbf{u}_{1},\ldots,\mathbf{u}_{T} is necessary. Let P(u1,,uT){\mathcal{P}}(\mathbf{u}_{1},\ldots,\mathbf{u}_{T}) be the path length of the comparison sequence defined as

It is natural to expect the regret to scale with the path length, as indeed the following theorem shows. For a fixed comparator ut=x\mathbf{u}_{t}=\mathbf{x}^{\star}, the path length is one, and thus Theorem 10.1 recovers the O(T)O(\sqrt{T}) standard regret bound. For simplicity, we assume that the time horizon TT is known ahead of time, and so is the path length of the comparator sequence, although these limitations can be removed (see bibliographic section).

Online Gradient Descent (algorithm 8) with step size ηP(u1,...,uT)T\eta\approx\sqrt{\frac{{\mathcal{P}}(\mathbf{u}_{1},...,\mathbf{u}_{T})}{T}} guarantees the following dynamic regret bound:

Using our notation, and following the steps of the proof of Theorem 3.1,

Using convexity and summing this inequality across time we get

The theorem now follows by choice of η\eta. ∎

This simple modification to the analysis of online gradient descent naturally extends to online mirror descent, as well as to other notions of path distance of the comparison sequence.

We now turn to another metric of performance that requires more advanced methods than we have seen thus far. This metric can be shown to be more general than dynamic regret, in the sense that the bounds we prove also imply low dynamic regret.

2 The Notion of Adaptive Regret

The main performance metric we consider in this chapter is designed to measure the performance of a decision maker in a changing environment. It is formally given in the following definition.

The adaptive regret of an online convex optimization algorithm A{\mathcal{A}} is defined as the maximum regret it achieves over any contiguous time interval. Formally,

As opposed to standard regret, the power of this definition stems from the fact that the comparator is allowed to change. In fact, it is allowed to change indefinitely with every interval of time.

For an algorithm with low adaptive regret, as opposed to standard regret, how would its performance guarantee differ in a changing environment? Consider the problem of portfolio selection, for which time can be divided into disjoint segments with different characteristics: bear market in the first T/2T/2 iterations and bull market in the last T/2T/2 iterations. A (standard) sublinear regret algorithm is only required to converge to the average of both optimal portfolios, clearly an undesirable outcome. However, an algorithm with sublinear adaptive regret bounds would necessarily converge to the optimal portfolio in both intervals.

The Online Gradient Descent algorithm over general convex losses, with step sizes O(1t)O(\frac{1}{\sqrt{t}}), attains an adaptive regret guarantee of

and this bound is tight. This is a simple consequence of the analysis we have already seen in chapter 3, and left as an exercise. Unfortunately this guarantee is meaningless for intervals of length o(T)o(\sqrt{T}).

Recall that for strongly convex loss functions, the OGD algorithm with the optimal learning rate schedule attains O(logT)O(\log T) regret. However, it does not attain any non-trivial adaptive regret guarantee: its adaptive regret can be as large as Ω(T)\Omega(T), and this is also left as an exercise.

An OCO algorithm A{\mathcal{A}} is said to be strongly adaptive if its adaptive regret can be bounded by its regret over the interval up to logarithmic terms in TT, i.e.

The natural question is thus: are there algorithms that attain the optimal regret guarantee, and simultaneously the optimal adaptive regret guarantee? As we shall see, the answer is affirmative in a strong sense: we shall describe and analyze algorithms that are optimal in both metrics. Furthermore, these algorithms can be implemented with small computational overhead over the non-adaptive methods we have already studied.

3 Tracking the Best Expert

Consider the fundamental problem studied in the first chapter of this text, prediction from expert advice, but with a small twist. Instead of a static best expert, consider the setting in which different experts are the “best expert” in different time intervals. More precisely, consider the situation in which time [T][T] can be divided into kk disjoint intervals such that each admits a different “locally best” expert. Can we learn to track the best expert?

This tracking problem was historically the first motivation to study adaptivity in online learning. Indeed, as shown by Herbster and Warmuth (see bibliographic section), there is a natural algorithm that attains optimal regret bounds.

The Fixed Share algorithm, describe in Algorithm 30, is a variant of the Hedge Algorithm 1. On top of the familiar multiplicative updates, it adds a uniform exploration term whose purpose is to avoid the weight of any expert from becoming too small. This provably allows a regret bound that tracks the best expert in any interval.

In line with the notation we have used throughout this manuscript, we denote decisions in a convex decision set by xK\mathbf{x}\in\mathcal{K}. An expert ii suggests decision xti\mathbf{x}_{t}^{i}, and suffers loss according to a convex loss function denoted ft(xti)f_{t}(\mathbf{x}_{t}^{i}). The main performance guarantee for the Fixed Share algorithm is given in the theorem below.

Given a sequence of α\alpha-exp-concave loss functions, the Fixed-Share algorithm with δ=12T\delta=\frac{1}{2T} guarantees

Notice that this is a different guarantee than adaptive regret as per Definition 10.2, as the decision set is discrete. However, it is a crucial component in the adaptive algorithms we will explore in the next section.

As a direct conclusion from this theorem, it can be shown (see exercises) that if the best expert changes kk times in a sequence of length TT, the overall regret compared to the best expert in every interval is bounded by

To prove this theorem, we start with the following lemma, which is a fine-grained analysis of the multiplicative weights properties:

Using the α\alpha-exp concavity of ftf_{t},

Theorem 10.3 can now be derived as a corollary:

By summing up over the interval I=[r,s]I=[r,s], and using the lower bound on ptip_{t}^{i}, we have

4 Efficient Adaptive Regret for Online Convex Optimization

The Fixed-Share algorithm described in the previous section is extremely practical and efficient for discrete sets of experts. However, to exploit the full power of OCO we require an efficient algorithm for continuous decision sets.

Consider for example the problems of online portfolio selection and online shortest paths: naïvely applying the Fixed-Share algorithm is computationally inefficient. Instead, we seek an algorithm which takes advantage of the efficient representation of these problems in the language of convex programming.

We present such a method called FLH, or Follow the Leading History. The basic idea is to think of different online convex optimization algorithms starting at different time points as experts, and apply a version of Fixed Share to these experts.

The main performance guarantee is given in the following theorem.

In particular, taking AONS{\mathcal{A}}\equiv ONS guarantees

Notice that FLH invokes A{\mathcal{A}} at iteration tt at most TT times. Hence its running time is bounded by TT times that of A{\mathcal{A}}. This can still be prohibitive as the number of iterations grows large. In the next section, we show how the ideas from this algorithm can give rise to an efficient adaptive algorithm with only O(logT)O(\log T) computational overhead and slightly worse regret bounds.

The analysis of FLH is very similar to that of Fixed Share, with the main subtleties due to the fact that the time horizon TT is not assumed to be known ahead of time, and thus the number of experts varies with time.

Instead of giving the full analysis, which is deferred to the exercises, we give a simplified version of FLH which does assume a-priory knowledge of TT, and whose analysis can be directly reduced to that of Theorem 10.3.

The simplified version of FLH is given in Algorithm 32, and it guarantees the following adaptive regret bound.

Applying Theorem 10.3 to the experts defined in Simple FLH, guarantees for every interval in time I=[r,s]I=[r,s], and by choice of NN, for every isi\leq s,

In particular, consider the sequence of predictions given by the rr’th expert, for which we have

The theorem now follows since this holds for every iterval I[T]I\subseteq[T]. ∎

5 * Computationally Efficient Methods

In the previous section we studied adaptive regret, introduced and analyzed an algorithm that attains near optimal adaptive regret bounds. However, FLH suffers from a significant computational and memory overhead: it requires maintaining O(T)O(T) copies of an online convex optimization algorithm. This computational overhead, which is proportional to the number of iterations, can be prohibitive in many applications. In this section our goal is to implement the algorithmic template of FLH efficiently and using little space.

To be more precise, henceforth denote the running time per iteration of algorithm A{\mathcal{A}} as Vt(A)V_{t}({\mathcal{A}}). Recall that at time tt, FLH stores all predictions {xti  i[t]}\{\mathbf{x}_{t}^{i}\ |\ i\in[t]\} and has to compute weights for all of them. This requires running time of at least O(Vt(A)t)O(V_{t}({\mathcal{A}})\cdot t).

The FLH2 algorithm, described in Algorithm 33, significantly cuts down this running time to being only logarithmic in the current time iteration parameter tt. To achieve this, FLH2 applies a pruning method to cut down the number of active online algorithms from tt to O(logt)O(\log t). However, its adaptive regret guarantee is slightly worse, and suffers a multiplicative factor of O(logT)O(\log T) as compared to FLH.

Before giving the exact details of this pruning method, we state the performance guarantee for FLH2.

The main conclusion from this theorem is obtained by using FLH2 with A{\mathcal{A}} being the ONS algorithm from chapter 4. This gives adaptive regret of O(1αlog2T)O(\frac{1}{\alpha}\log^{2}T) and running time which is polynomial in natural parameters of the problem and poly-logarithmic in the number of iterations.

Before diving into the analysis, we explain the main new ingredient. At the heart of this algorithm is a new method for incorporating history. We will show that it suffices to store only O(logt)O(\log t) experts at time tt, rather than all tt experts as in FLH.

At time tt, there is a working set StS_{t} of experts. In FLH, this set can be thought of to contain E1,,EtE^{1},\cdots,E^{t}, where each EiE^{i} is the algorithm A{\mathcal{A}} starting from iteration ii. For the next round, a new expert Et+1E^{t+1} is added to get St+1S_{t+1}. The complexity and regret of FLH is directly related to the cardinality of these sets.

The key to decreasing the sizes of the sets StS_{t} is to also remove (or prune) some experts. Once an expert is removed, it is never used again. The algorithm will perform the multiplicative update and mixing steps (steps 5 and 7 in algorithm 33) only on the working set of experts.

The problem of maintaining the set of active experts can be thought of as the following abstract data streaming problem. Suppose the integers 1,2,1,2,\cdots are being “processed” in a streaming fashion. At time tt, we have “read” the positive integers up to tt and maintain a very small subset of them in StS_{t}. At time tt we create St+1S_{t+1} from StS_{t}: we are allowed to add to StS_{t} only the integer t+1t+1, and remove some integers already in StS_{t}. Our aim is to maintain a set StS_{t} which satisfies:

For every positive sts\leq t, [s,(s+t)/2]St[s,(s+t)/2]\cap S_{t}\neq\emptyset.

For all tt, St+1\St={t+1}S_{t+1}\backslash S_{t}=\{t+1\}.

The first property of the sets StS_{t} intuitively means that StS_{t} is “well spread out” in a logarithmic scale. This is depicted in Figure 10.1. The second property ensures computational efficiency.

Indeed, the procedure “Prune” maintains StS_{t} with these exact properties, and is detailed after we prove Theorem 10.7.

We proceed to prove the main theorem. We start with an analogue of Lemma 10.4.

The following holds for all iSti\in S_{t},

ft(xt)ft(xti)α1(logp^t+1ilogp^ti+logt1t)f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}^{i}_{t})\leq\alpha^{-1}(\log\hat{p}^{i}_{t+1}-\log\hat{p}^{i}_{t}+\log\frac{t-1}{t})

ft(xt)ft(xtt)α1(logp^t+1t+logt)f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}^{t}_{t})\leq\alpha^{-1}(\log\hat{p}^{t}_{t+1}+\log t)

Using the α\alpha-exp concavity of ftf_{t} -

To complete the proof, we note the following two facts that are analogous to the ones used in Claim 10.4:

For 1i<t1\leq i<t, logptilogp^ti+logt1t\log p^{i}_{t}\geq\log\hat{p}^{i}_{t}+\log\frac{t-1}{t}

Proving these facts is left as an exercise.

Using this we can prove the following Lemma.

Consider the regret in II with respect to expert ErE^{r},

Given the properties of StS_{t}, we can show that in any interval the regret incurred is small.

base case: For q=0q=0 the regret is bounded by

The interval [r,i1][r,i-1] has size at most I2[2q1,2q)\frac{|I|}{2}\in[2^{q-1},2^{q}), and by induction the algorithm has regret of at most RTqR_{T}\cdot q on this interval. This gives a total of RT(q+1)R_{T}(q+1) regret on II. ∎

The running time of FLH2 is bounded by StVT(A)|S_{t}|\cdot V_{T}({\mathcal{A}}). Since St=O(logt)|S_{t}|=O(\log t), we can bound the running time by O(VT(A)logT)O(V_{T}({\mathcal{A}})\log T). This fact, together with Lemma 10.10, completes the proof. ∎

We now explain the pruning procedure used to maintain the set St{1,2,...,t}S_{t}\subseteq\{1,2,...,t\}.

We specify the lifetime of integer ii - if i=r2ki=r2^{k}, where rr is odd, then the lifetime of ii is 2k+2+12^{k+2}+1. Suppose the lifetime of ii is mm. Then for any time t[i,i+m]t\in[i,i+m], integer ii is alive at tt. The set StS_{t} is simply the set of all integers that are alive at time tt. Obviously, at time tt, the only integer added to StS_{t} is tt - this immediately proves Property (3). We now prove the other properties.

(Property (2)) For each 0klogt0\leq k\leq\lfloor\log t\rfloor, let us count the number of integers of the form r2kr2^{k} (rr odd) alive at tt. The lifetimes of these integers are 2k+2+12^{k+2}+1. The only integers alive lie in the interval [t2k+21,t][t-2^{k+2}-1,t]. Since all of these integers of this form are separated by gaps of size at least 2k2^{k}, there are at most a constant number of such integers alive at tt. In total, the size of StS_{t} is O(logt)O(\log t). ∎

6 Bibliographic Remarks

Dynamic regret bounds for online gradient descent were proposed by Zinkevich (2003), and further studied in (Besbes et al., 2015). It was shown in (Zhang et al., 2018) that adaptive regret bounds imply dynamic regret bounds.

The study of learning in changing environments can be traced to the seminal work of Herbster and Warmuth (1998) in the context of tracking for the problem of prediction from expert advice. Their technique was later extended to tracking of experts from a small pool (Bousquet and Warmuth, 2003).

The problem of tracking a large set of experts efficiently was studied using the Fixed-Share technique in (Singer, 1998; Kozat and Singer, 2007; György et al., 2005).

The deviation from Fixed-Share to the FLH technique and the notion of adaptive regret were introduced in Hazan and Seshadhri (2007). These techniques were subject of later study and extensions (Adamskiy et al., 2016; Zhang et al., 2019). Daniely et al. (2015) study adaptive regret for weakly convex loss functions and introduced the term “strongly adaptive”, which differentiates the weakly and strongly convex settings. They note that FLH is a strongly adaptive algorithm.

The use of an exponential look-back for prediction has roots in information theory (Willems and Krom, 1997; Shamir and Merhav, 2006). Efficient methods for streaming, that were used in this chapter to maintaining a small set of active experts, were studied in the steaming algorithms literature (Gopalan et al., 2007).

Adaptive regret algorithms were motivated by applications involving changing environments, such as the portfolio selection problem. More recently they were applied for time series prediction (Anava et al., 2013) and the control of dynamical systems (Gradu et al., 2020).

7 Exercises

2. Prove that the OGD algorithm for convex functions, with step sizes O(1t)O(\frac{1}{\sqrt{t}}), has an adaptive regret guarantee of O(T)O(\sqrt{T}), and that this is tight. Prove that the lazy version of OGD, from chapter 5, behaves differently and has an adaptive regret bound of Ω(T)\Omega(T).

3. Prove that the OGD algorithm with step size O(1t)O(\frac{1}{t}) for strongly convex functions, has adaptive regret which is lower bounded by Ω(T)\Omega(T).

4. Consider the problem of prediction from expert advice with α\alpha-exp-concave loss functions, where the best expert switches kk times. That is, time can be divided into kk segments I1,...,IkI_{1},...,I_{k}, such that the best expert in each segment is different. Show that the regret of the Fixed Share algorithm vs. the best kk-switching expert (a strategy that is allowed to change experts kk times) is bounded by

5. Spell out a choice of δ\delta parameter for the Fixed Share algorithm that does not require knowledge of the number of iterations TT in advance. Prove an analogue of Theorem 10.3 with your choice of δ\delta.

7. Prove the following facts used in the proof of Proposition 10.8:

(a) For 1i<t1\leq i<t, logptilogp^ti+logt1t\log p^{i}_{t}\geq\log\hat{p}^{i}_{t}+\log\frac{t-1}{t}

8. Implement meta-algorithms FLH and FLH2 with the ONS algorithm, and apply the resulting method on the portfolio selection problem. Benchmark your results and compare to ONS and OGD.

Chapter 11 Boosting and Regret

In this chapter we consider a fundamental methodology of machine learning: boosting. In the statistical learning setting, roughly speaking, boosting refers to the process of taking a set of rough “rules of thumb” and combining them into a more accurate predictor.

Consider for example the problem of Optical Character Recognition (OCR) in its simplest form: given a set of bitmap images depicting hand-written postal-code digits, classify those that contain the digit “1” from those of “0”.

Seemingly, discerning the two digits seems a difficult task taking into account the different styles of handwriting, inconsistent styles even for the same person, label errors in the training data, etc. However, an inaccurate rule of thumb is rather easy to produce: in the bottom-left area of the picture we’d expect many more dark bits for “1”s than if the image depicts a “0”. This is, of course, a rather inaccurate classifier. It does not consider the alignment of the digit, thickness of the handwriting, and numerous other factors. Nevertheless, as a rule of thumb - we’d expect better-than-random performance and some correlation with the ground truth.

The inaccuracy of the crude single-bit predictor is compensated by its simplicity. It is not hard to implement a classifier based upon this rule of thumb, which is very efficient indeed. The natural and fundamental question which now arises is: can several such rules of thumb be combined into a single, accurate and efficient classifier?

In the rest of this chapter we shall formalize this question in the statistical learning theory framework. We then proceed to use the technology developed in this manuscript, namely regret minimization algorithms for online convex optimization, to answer this question in the affirmative. Our development will be somewhat non-standard: we’ll describe a black-box reduction from regret-minimization to boosting. This allows any of the OCO methods previously discussed in this text to be used as the main component of a boosting algorithm.

Throughout this chapter we use the notation and definitions of chapter 9 on learning theory, and focus on statistical learnability rather than agnostic learnability. More formally, we assume the so called “realizability assumption”, which states that for a learning problem over hypothesis class H{\mathcal{H}} there exists some hHh^{\star}\in\mathcal{H} such that its generalization error is zero, or formally \mboxerror(h)=0.\mathop{\mbox{\rm error}}(h^{\star})=0.

Using the notations of the previous chapter, we can define the following seemingly weaker notion than statistical learnability.

The concept class H:XY{\mathcal{H}}:{\mathcal{X}}\mapsto{\mathcal{Y}} is said to be γ\gamma-weakly-learnable if the following holds. There exists an algorithm A{\mathcal{A}} that accepts Sm={(x,y)}S_{m}=\{(\mathbf{x},y)\} and returns an hypothesis in A(Sm)H{\mathcal{A}}(S_{m})\in{\mathcal{H}} that satisfies: for any δ>0\delta>0 there exists m=m(δ)m=m(\delta) large enough such that for any distribution D{\mathcal{D}} over pairs (x,y)(\mathbf{x},y), for y=h(x)y=h^{\star}(\mathbf{x}), and mm samples from this distribution, it holds that with probability 1δ1-\delta,

This is an apparent weakening of the definition of statistical learnability that we have described in chapter 9: the error is not required to approach zero. The standard case of statistical learning in the context of boosting is called “strong learnability”. An algorithm that achieves weak learning is referred to as a weak learner, and respectively we can refer to a strong learner as an algorithm that attains statistical learning, i.e., allows for generalization error arbitrarily close to zero, for a certain concept class.

The central question of boosting can now be formalized: are weak learning and strong learning equivalent? In other words, is there an (efficient?) procedure that has access to a weak oracle for a concept class, and returns a strong learner for the class?

Miraculously, the answer is affirmative, and gives rise to one of the most effective paradigms in machine learning.

2 Boosting by Online Convex Optimization

In this section we describe a reduction from OCO to boosting. The template is similar to the one we have used in chapter 9: using one of the numerous algorithms for online convex optimization we have explored in this manuscript, as well as access to a weak learner, we create a procedure for strong learning.

Our derivation focuses on simplicity rather than generality. As such, we make the following assumptions:

We restrict ourselves to the classical setting of binary classification. Boosting to real-valued losses is also possible, but outside our scope. Thus, we assume the loss function to be the zero-one loss, that is:

We assume that the concept class is realizable, i.e., there exists an hHh^{\star}\in{\mathcal{H}} such that \mboxerror(h)=0\mathop{\mbox{\rm error}}(h^{\star})=0. There are results on boosting in the agnostic learning setting, these are surveyed in the bibliographic section.

We denote the distribution over examples X×Y={(x,y)}{\mathcal{X}}\times{\mathcal{Y}}=\{(x,y)\}, where y=h(x)y=h^{\star}(\mathbf{x}), as a point in ΔX\Delta_{{\mathcal{X}}}. That is, a point pΔX\mathbf{p}\in\Delta_{{\mathcal{X}}} is a non-negative vector that integrates to one over all examples. For simplicity, we think of X,Y{\mathcal{X}},{\mathcal{Y}} as a finite, and therefore pΔm\mathbf{p}\in\Delta_{m} belongs to the mm dimensional simplex, i.e., is a discrete distribution over mm elements.

We henceforth denote the weak learning algorithm by W{\mathcal{W}}, and denote by W(p,δ){\mathcal{W}}(\mathbf{p},\delta) a call to the weak learning algorithm over distribution p\mathbf{p} that satisfies

With these assumptions and definitions we are ready to prove the main result: a reduction from weak learning to strong learning using an online convex optimization algorithm with a sublinear regret bound. Essentially, our task would be to find a hypothesis which attains zero error on a given sample.

2.2 Algorithm and analysis

Pseudocode for the boosting algorithm is given in Algorithm 34. This reduction accepts as input a γ\gamma-weak learner and treats it as a black box, returning a function which we’ll prove is a strong learner.

The reduction also accepts as input an online convex optimization algorithm denoted AOCO{\mathcal{A}}^{OCO}. The underlying decision set for the OCO algorithm is the mm-dimensional simplex, where mm is the sample size. Thus, its decisions are distributions over examples. The cost functions are linear, and assign a value of zero or one, depending on whether the current hypothesis errs on a particular example. Hence, the cost at a certain iteration is the expected error of the current hypothesis (chosen by the weak learner) over the distribution chosen by the low-regret algorithm.

It is important to note that the final hypothesis hˉ\bar{h} which the algorithm outputs does not necessarily belong to H{\mathcal{H}} - the initial hypothesis class we started off with.

Algorithm 34 returns a hypothesis hˉ\bar{h} such that with probability at least 1δ1-\delta,

Given hHh\in\mathcal{H}, we denote its empirical error on the sample SS, weighted by the distribution pδm\mathbf{p}\in\delta_{m}, by:

Notice that by definition of rt\mathbf{r}_{t} we have rtpt=1\mboxerrorS,pt(ht)\mathbf{r}_{t}^{\top}\mathbf{p}_{t}=1-\mathop{\mbox{\rm error}}_{S,\mathbf{p}_{t}}(h_{t}). Since hth_{t} is the output of a γ\gamma-weak-learner on the distribution pt\mathbf{p}_{t}, we have for all t[T]t\in[T],

This applies for each tt separately, and by the union bound we have

Denote by SϕSS_{\phi}\subseteq S be the set of all missclassified examples by hˉ\bar{h}. Let p\mathbf{p}^{*} the uniform distribution over SϕS_{\phi}.

Combining the previous two observations, we have with probability at least 1δ1-\delta that

This is a contradiction. We conclude that a distribution p\mathbf{p}^{*} cannot exist, and thus all examples in SS are classified correctly.

2.3 AdaBoost

A special case of the template reduction we have described is obtained when the OCO algorithm is taken to be the Multiplicative Updates method we have come to know in the manuscript.

Corollary 5.7 gives a bound of O(Tlogm)O(\sqrt{T\log m}) on the regret of the EG algorithm in our context. This bounds TT in Algorithm 34 by O(1γ2logm)O(\frac{1}{\gamma^{2}}\log m).

Closely related is the AdaBoost algorithm, which is one of the most useful and successful algorithms in Machine Learning at large (see bibliography). Unlike the Boosting algorithm that we have analyzed, AdaBoost doesn’t have to know in advance the parameter γ\gamma of the weak learners. Pseudo code for the AdaBoost algorithm is given in 35.

2.4 Completing the picture

In our discussion so far we have focused only on the empirical error over a sample. To show generalization and complete the Boosting theorem, one must show that zero empirical error on a large enough sample implies ε\varepsilon generalization error on the underlying distribution.

Notice that the hypothesis returned by the Boosting algorithms does not belong to the original concept class. This presents a challenge for certain methods of proving generalization error bounds that are based on measure concentration over a fixed hypothesis class.

Both issues are resolved using the implication that compression implies generalization, as given in Theorem 9.9. We sketch the argument below, and the precise derivation is left as an exercise.

Roughly speaking, boosting algorithm 34 runs on mm examples for T=O(logmγ2)T=O(\frac{\log m}{\gamma^{2}}) rounds, returns a final hypothesis hˉ\bar{h} that is the majority vote of TT hypothesis, and classifies correctly all mm examples of the training set.

Suppose that the weak learning algorithm has sample complexity of size k(γ,δ)k(\gamma,\delta): given k=k(γ,δ)k=k(\gamma,\delta) examples, it returns a hypothesis with generalization error at most 12γ\frac{1}{2}-\gamma with probability at least 1δ1-\delta. Further, suppose the original training set of mm examples was sampled from distribution D{\mathcal{D}}.

Since hˉ\bar{h} classifies correctly the entire training set, it follows that the distribution D{\mathcal{D}} has a compression scheme of size

Therefore, using Theorem 9.9, we have that,

Now one can obtain an arbitrary small generalization error by choosing mm as a function of k,δ,γk,\delta,\gamma. Notice that this argument makes an assumption only about the sample complexity of the weak learning algorithm, rather than the hypothesis class H{\mathcal{H}}.

3 Bibliographic Remarks

The theoretical question of Boosting and posed and addressed in the work of Schapire (1990); Freund (1995). The AdaBoost algorithm was proposed in the seminal paper of Freund and Schapire (1997). The latter paper also contains the essential ingredients for the reduction from general low-regret algorithms to boosting.

Boosting has had significant impact on theoretical and practical data analysis as described by the statistician Leo Breiman (Olshen, 2001). For a much more comprehensive survey of Boosting theory and applications see the recent book (Schapire and Freund, 2012).

The theory for agnostic boosting is more recent, and several different definitions and settings exist, see (Kalai et al., 2008; Kalai and Servedio, 2005; Kanade and Kalai, 2009; Feldman, 2009; Ben-David et al., 2001), the most general of which is perhaps by Kanade and Kalai (2009).

A unified framework for realizable and agnostic boosting, for both the statistical and online settings, is given in (Brukhim et al., 2020).

The theory of boosting has been extended to real valued learning via the theory of gradient boosting (Friedman, 2002). More recently it was extended to online learning (Leistner et al., 2009; Chen et al., 2012, 2014; Beygelzimer et al., 2015a, b; Agarwal et al., 2019; Jung et al., 2017; Jung and Tewari, 2018; Brukhim and Hazan, 2020).

4 Exercises

1. Describe a boosting algorithm based on the online gradient descent algorithm. Give a bound on its running time.

2. Download the MNIST dataset from the web. Implement weak learners to differentiate a pair of digits of your choice according to a single bit. Implement AdaBoost and apply it to your weak learners. Summarize your results and conclusions.

3. ∗ Consider the problem of agnostic boosting, in which the existence of a zero-error hypothesis is not assumed.

(a) Write down an alternative definition of a weak learning algorithm for the agnostic setting.

(b) Write down a reasonable goal for a boosting algorithm.

(c) Write down an analogue of Theorem 11.2 for agnostic boosting (without proof).

4. Compute the number of samples required to achieve a generalization error of ε\varepsilon, using boosting algorithm 34 and a weak learning algorithm with sample complexity bound k(γ,δ)k(\gamma,\delta).

Chapter 12 Online Boosting

This text considers online optimization and learning, and it is a natural question to ask whether the technique of boosting has an analogue in the online world? What is a “weak learner” in online convex optimization, and how can one strengthen it? This is the subject of this chapter, and we shall see that boosting can be extremely powerful and useful in the setting of online convex optimization.

Recall the classical problem of prediction from expert advice from the first chapter of this text. A learner iteratively makes decisions and receives loss according to an arbitrarily chosen loss function. For its decision making, the learner is assisted by a pool of experts. Classical algorithms such as the Hedge algorithm 1, guarantees a regret bound of O(TlogN)O(\sqrt{T\log N}), where NN is the number of experts, and this is known to be tight.

However, in many problems of interest, the class of experts is too large to efficiently manipulate. This is particularly evident in contextual learning, as formally defined below, where the experts are policies – functions mapping contexts to action. In such instances, even if a regret bound of O(TlogN)O(\sqrt{T\log N}) is meaningful, the algorithms achieving this bound are computationally inefficient; their running time is linear in NN. This linear dependence is many times unacceptable: the effective number of policies mapping contexts to actions is exponential in the number of contexts.

The boosting approach to address this computational intractability is motivated by the observation that it is often possible to design simple rules-of-thumb that perform slightly better than random guesses. Analogously to the weak learning oracles from chapter 11, We propose that the learner has access to an “online weak learner” - a computationally cheap mechanism capable of guaranteeing multiplicatively approximate regret against a base hypotheses class.

In the rest of this chapter we describe efficient algorithms that when provided weak learners, compete with the convex hull of the base hypotheses class with near-optimal regret.

As a more precise example to the motivation we just surveyed, we formalize online boosting for binary prediction from expert advice. At iteration tt, a set of experts denoted hHh\in{\mathcal{H}}, observe a context at\mathbf{a}_{t}, and predict a binary outcome h(at){1,1}h(\mathbf{a}_{t})\in\{-1,1\}. The loss of each expert is taken to be the binary loss, h(at)yt-h(\mathbf{a}_{t})\cdot y_{t} for a true label yt{1,1}y_{t}\in\{-1,1\}. The Hedge algorithm from the first chapter applies to this problem, and guarantees a regret of O(TlogH)O(\sqrt{T\log|{\mathcal{H}}|}) for a finite H{\mathcal{H}}. However, the case in which H{\mathcal{H}} is extremely large, maintaining the weights is prohibitive computationally.

A weak online learner W{\mathcal{W}} in this setting is an algorithm which is guaranteed to attain at most a factor γ\gamma loss from the best expert in class, for some γ\gamma\in, up to an additive sublinear regret term. Formally, for any sequence of contexts and labels {at,yt}\{\mathbf{a}_{t},y_{t}\},

The online boosting question can now be phrased as follows: given access to a weak online learning algorithm W{\mathcal{W}}, can we design an efficient online algorithm A{\mathcal{A}} that guarantees vanishing regret over H{\mathcal{H}}? More formally, let

1.2 Example: personalized article placement

In the problem of matching articles to visitors of a web-page on the Internet, a number of articles are available to be placed in a given web-page for a particular visitor. The goal of the decision maker, in this case the article placer, is to find the most relevant article that will maximize the probability of a visitor click.

It is usually the case that context is available, in the form of user profile, preferences surfing history and so forth. This context is invaluable in terms of placing the most relevant article. Thus, the decision of the article-placer is to choose from a policy: a mapping from context to article.

The space of policies is significantly larger than the space of articles and space of contexts: its size is the power of articles to the cardinality of contexts. This motivates the use of online learning algorithms whose computational complexity is independent of the number of experts.

The natural formulation of this problem is not binary prediction, but rather multi-class prediction. Formulating this problem in the language of online convex optimization is left as an exercise.

2 The Contextual Learning Model

Boosting in the context of online convex optimization is most useful for the contextual learning problem which we now describe.

We have studied numerous methods capable of minimizing regret for this setting in this text, all assumed that we have access to the set H{\mathcal{H}}, and depend on its diameter in some way.

To avoid this dependence, we consider an alternative access model to H{\mathcal{H}}. A weak learner for the OCO setting is defined as follows.

An online learning algorithm W{\mathcal{W}} is a γ\gamma-weak OCO learner (WOCL) for H{\mathcal{H}} and γ(0,1)\gamma\in(0,1), if for any sequence of contexts {at}\{\mathbf{a}_{t}\} and linear loss functions f1,...,fTf_{1},...,f_{T}, for which maxxKft(x)minyKft(y)1\max_{\mathbf{x}\in\mathcal{K}}f_{t}(\mathbf{x})-\min_{\mathbf{y}\in\mathcal{K}}f_{t}(\mathbf{y})\leq 1 15The results henceforth hold with a different constant if we replace one by a different scaling., we have

where xˉ=xKx\bar{\mathbf{x}}=\int_{\mathbf{x}\in\mathcal{K}}\mathbf{x} is the center of mass of K\mathcal{K}.

This definition differs in two aspects from the types of regret minimization guarantees we have seen thus far. For one, the algorithm competes with a γ\gamma-multiple of the best comparator in hindsight, and is “weak” in this precise manner.

Secondly, a multiplicative guarantee is not invariant for a constant shift. This is the reason for the existence of an additional component, tft(xˉ)\sum_{t}f_{t}(\bar{\mathbf{x}}), in the regret bound. This can be thought of as the cost of a random, or naive predictor. A weak learner must, at the very least, perform better than this naive and non-anticipating predictor!

It is convenient to henceforth assume that the loss functions are shifted such that ft(xˉ)=0f_{t}(\bar{\mathbf{x}})=0. Under this assumption, we can rephrase γ\gamma-WOCL as

3 The Extension Operator

The main difficulty is coping with the approximate guarantee that the WOCL provides. Therefore the algorithm we describe henceforth scales the predictions returned by the weak learner by a factor of 1γ\frac{1}{\gamma}. This means that the scaled decisions do not belong to the original decision set anymore, and need to be projected back.

First, denote the Euclidean distance function to a set K\mathcal{K} as (see also section 13.2),

where the smoothing operator SδS_{\delta} was defined as per Lemma 2.8.

The important take-away from these operators is the following lemma, whose importance is crucial in the OCO boosting algorithm 36, as it projects infeasible points that are obtained from the weak learners to the feasible domain.

The (K,κ,δ)(\mathcal{K},\kappa,\delta)-extension of a function f^=X[f]{\hat{f}}=X[f] satisfies the following:

For every point xK\mathbf{x}\in\mathcal{K}, we have f^(x)f(x)2δG\|{\hat{f}}(\mathbf{x})-f(\mathbf{x})\|_{2}\leq\delta G.

The projection of a point, whose gradient is bounded by GG, onto K\mathcal{K} improves the (K,κ,δ)(\mathcal{K},\kappa,\delta)-extension function, for κ=G\kappa=G, value up to a small term,

Since Dist(x,K)=0{\bf Dist}(\mathbf{x},\mathcal{K})=0 for all xKx\in\mathcal{K}, this follows immediately from Lemma 2.8.

Denote xπ=K(x)\mathbf{x}_{\pi}=\prod_{\mathcal{K}}(\mathbf{x}) for brevity. Then

4 The Online Boosting Method

The online boosting algorithm we describe in this section is closely related to the online Frank-Wolfe algorithm from chapter 7. Not only does it deliver in boosting WOCL to strong learning, but it gives an even stronger guarantee: low regret over the convex hull of the hypothesis class.

Algorithm 36 efficiently converts a weak online learning algorithm into an OCO algorithm with vanishing regret in a black-box manner. The idea is to apply the weak learning algorithm on linear functions that are gradients of the loss. The algorithm then recursively applies another weak learner on the gradients of the residual loss, and so forth.

However, the Frank-Wolfe method is not applied directly to the loss functions, but rather to a proxy loss which defined using the extension operation in 12.2. Importantly, algorithm 36 has a running time that is independent of H|{\mathcal{H}}|.

Notice that if γ=1\gamma=1, the algorithm stills gives a significant advantage as compared to the weak learner: the regret guarantee is vs. the convex hull of H{\mathcal{H}}, as compared to the best single hypothesis.

The predictions xt\mathbf{x}_{t} generated by Algorithm 36 with δ=D2γN,ηi=min{2i,1}\delta=\sqrt{\frac{D^{2}}{\gamma N}},\eta_{i}=\min\{\frac{2}{i},1\} satisfy

It is possible to obtain tighter bounds by a factor of the dimension, and other constant terms, using a more sophisticated smoothing operator. References for these tighter results are given in the bibliographic section at the end of this chapter.

The regret bound of Theorem 12.4 is nearly as good as we could hope for. The first term approaches zero as the number of weak learners NN grows. The second term is sublinear as the regret of the weak learner. It is scaled by a factor of 1γ\frac{1}{\gamma}, which we can expect due to the approximate guarantee of the weak learner.

Before proving the theorem, let us define some notations we use. The algorithm defines the extension of the loss functions as

We apply the setting of κ=G\kappa=G, as required by Lemma 12.3, and by Lemma 2.8, f^t{\hat{f}}_{t} is dGδ\frac{dG}{\delta}-smooth. Also, denote by CH(H)={hHphhpΔH}\mathbf{CH}({\mathcal{H}})=\{\sum_{h\in{\mathcal{H}}}\mathbf{p}_{h}h|\mathbf{p}\in\Delta_{\mathcal{H}}\} the convex hull of the set H{\mathcal{H}}, and let

to be the best hypothesis in the convex hull of H{\mathcal{H}} in hindsight, i.e., the best convex combination of hypothesis from H{\mathcal{H}}. Notice that since the loss functions are generally convex and non-linear, this convex combination is not necessarily a singleton. We define xt=h(at)\mathbf{x}_{t}^{\star}=h^{\star}(\mathbf{a}_{t}) as the decisions of this hypothesis.

The main crux of the proof is given by the following lemma.

For smoothed loss functions {f^t}\{{\hat{f}}_{t}\} that are β{\beta}-smooth and G^\hat{G} Lipschitz, it holds that

Recall that f^t{\hat{f}}_{t} is β{\beta} smooth by our assumption. Therefore:

By using the definition and linearity of ftif_{t}^{i}, we have

Now, note the following equivalent restatement of the WOCL guarantee, which again utilizes linearity of ftif_{t}^{i} to conclude: linear loss on a convex combination of a set is equal to the same convex combination of the linear loss applied to individual elements.

Using the above and that hCH(H)h^{\star}\in\mathbf{CH}({\mathcal{H}}), we have

Denoting Δ^i=ΔiRT\hat{\Delta}_{i}=\Delta_{i}-{R_{T}}, we are left with

This is a recursive relation that can be simplified by applying Lemma 7.2 from chapter 7. We obtain that Δ^N2βD2Tγ2N\hat{\Delta}_{N}\leq\frac{2{\beta}D^{2}T}{\gamma^{2}N}. ∎

We are ready to prove the main guarantee of Algorithm 36.

Using both parts of Lemma 12.3 in succession, we have

Next, recall by Lemma 2.8, that f^t{\hat{f}}_{t} is dGδ\frac{dG}{\delta}-smooth. By applying Lemma 12.5, and optimizing δ\delta, we have

where the last inequality is only to obtain a nicer expression.

It remains to bound G^\hat{G}, and we claim that G^2G\hat{G}\leq 2G. To see this, notice that the function Dist(x,K){\bf Dist}(\mathbf{x},\mathcal{K}) is 11-Lipschitz, since

Thus, by the definition of the extension operator and the functions ftif_{t}^{i}, we have that fti(xti)=f^t(xti)2G\|\nabla f_{t}^{i}(\mathbf{x}_{t}^{i})\|=\|\nabla\hat{f}_{t}(\mathbf{x}_{t}^{i})\|\leq 2G. ∎

5 Bibliographic Remarks

The theory of boosting, which we have surveyed in chapter 11, originally applied to binary classification problems. Boosting for real-valued regression was studied in the theory of gradient boosting by Friedman (2002).

Online boosting, for both the classification and regression settings was studied much later (Leistner et al., 2009; Chen et al., 2012, 2014; Beygelzimer et al., 2015a, b; Agarwal et al., 2019; Jung et al., 2017; Jung and Tewari, 2018; Brukhim and Hazan, 2020). The relationship to the Frank-Wolfe method was explicit in these works, and also studied in (Freund et al., 2017; Wang et al., 2015). A framework which encapsulates both agnostic and realizable boosting, for both offline and online settings, is given in (Brukhim et al., 2020).

Boosting for the full online convex optimization setting, with a multiplicative approximation and general convex decision set, was obtained in (Hazan and Singh, 2021). The latter also give tighter bounds by a factor of the dimension than those presented in this text using a more sophisticated smoothing technique known as the Moreau-Yoshida regularization (Beck, 2017).

The contextual experts and bandits problems have been proposed by Langford and Zhang (2008) as a decision making framework with large number of policies. In the online setting, several works study the problem with emphasis on efficient algorithms given access to an optimization oracle (Rakhlin and Sridharan, 2016; Syrgkanis et al., 2016a, b; Rakhlin and Sridharan, 2016). For surveys on contextual bandit algorithms and applications of this model see (Zhou, 2015; Bouneffouf and Rish, 2019).

6 Exercises

1. Derive, from Algorithm 36, an algorithm for online binary classification. Spell out its regret guarantee.

2. Formulate the online personalized article placement problem in the language of online convex optimization (what is the decision set?). Define a weak learner in this context, and the final boosting guarantee from Theorem 12.4.

3. ∗ Define the Moreau-Yoshida regularization (or smoothing operator) as:

Prove that given any GG-Lipschitz ff, the smoothed function f^δ=Mδ[f]{\hat{f}}_{\delta}=M_{\delta}[f] satisfies:

(a) f^δ\hat{f}_{\delta} is 1δ\frac{1}{\delta}-smooth, and GG-Lispchitz.

4. ∗ Use the Moreau-Yoshida regularization to improve the bounds of Theorem 12.4 by a factor of the dimension.

Chapter 13 Blackwell Approachability and Online Convex Optimization

The history of adversarial prediction started with the seminal works of mathematicians David Blackwell and James Hannan. In most of the text thus far, we have presented the viewpoint of sequential prediction and loss minimization, taken by Hannan. This was especially true in chapter 5, as the FPL algorithm dates back to his work. In this chapter we turn to a dual view of regret minimization, called “Blackwell approachability”. Approachability theory originated in the work of Blackwell, and was discovered simultaneously to that of Hannan. A short historical account is surveyed in the bibliographic materials at the end of this chapter.

For decades the relationship between regret minimization in general convex games and Blackwell approachability was not fully understood. The common thought was, in fact, that Blackwell approachability is a stronger notion. In this chapter we show that approachability and online convex optimization are equivalent in a strong sense: an algorithm for one task implies an algorithm for the other with no loss of computational efficiency.

As a side benefit to this equivalence, we deduce a proof of Blackwell’s approachability theorem using the existence of online convex optimization algorithms. This proof applies to a more general version of approachabilty, over general vector games, and comes with rates of convergence that are borrowed from the OCO algorithms we have already studied.

While previous chapters had a practical motivation and introduced methods for online learning, this chapter is purely theoretical, and devoted to give an alternate viewpoint of online convex optimization from a game theoretic perspective.

Von Neumann’s minimax theorem, that we have studied in chapter 8, establishes a central result in the theory of two-player zero-sum games by providing a prescription to both players. This prescription is in the form of a pair of optimal mixed strategies: each strategy attains the optimal worst-case value of the game without knowledge of the opponent’s strategy. However, the theorem fundamentally requires that both players have a utility function that can be expressed as a scalar.

In 1956, in response to von Neumann’s result, David Blackwell posed an intriguing question: what guarantee can we hope to achieve when playing a two-player game with a vector-valued payoff?

A vector-valued game is defined similarly to zero-sum games as we have defined in Definition 8.2, with reward/loss vectors replacing the scalar rewards/losses.

Similar to scalar games, we can define mixed strategies as distributions over pure strategies, and denote the expected reward vector for playing mixed strategies by

We henceforth consider more general vector games than originally considered in the literature. The additional generality allows for uncountably many strategies for both players, and allows the strategies to originate from bounded convex and closed sets in Euclidean space.

The goal of a zero-sum game is clear: to guarantee a certain loss/reward. What should be the vector game generalization? Blackwell proposed to ask “can we guarantee that our vector payoff lies in some closed convex set SS?”

It is left as an exercise at the end of this chapter to show that an immediate analogue of Von Neumann’s theorem does not exist: there is no single mixed strategy that ensures the vector payoff lies in a given set. However, this does not rule out an asymptotic notion, if we allow the game to repeat indefinitely, and ask whether there exists a strategy to ensure that the average reward vector lies in a certain set, or at least approaches it in terms of Euclidean distance. This is exactly the solution concept that Blackwell proposed as defined formally below.

Using the notation we have used throughout this text, we denote the (Euclidean) distance to a bounded, closed and convex set SS as

Under this notion, we can now allow the player to implement an adaptive strategy for a repeated version of the game, and we require that the average reward vector comes arbitrarily close to SS. Blackwell’s theorem characterizes which sets in Euclidean space are approachable. We give it below in generalized form,

The approachability condition spelled out in the equation above is both necessary and sufficient. The necessity of this condition is left as an exercise, and the more interesting implication is that any set that satisfies this condition is, in fact, approachable. Our reductions henceforth give an explicit proof of Blackwell’s theorem, and we leave it as an exercise to draw the explicit conclusion of this theorem from the first efficient reduction.

The relationship between Blackwell approachability in vector games and OCO may not be evident at this point. However, we proceed to show that the two notions are in fact algorithmically equivalent.

In the next section we show that any algorithm for OCO can be efficiently converted to an approachability algorithm for vector games. Following this, we show the other direction as well: an approachability algorithm for vector games gives an OCO algorithm with no loss of efficiency!

2 From Online Convex Optimization to Approachability

In this section we give an efficient reduction from OCO to approachability. Namely, assume that we have an OCO algorithm denoted A{\mathcal{A}}, that attains sublinear regret. Our goal is to design a Blackwell approachability algorithm for a given vector game and closed, bounded convex set SS. Thus, the reduction in this section shows that OCO is a stronger notion than approachabiliy. This direction is perhaps the more surprising one, and was discovered more recently, see bibliographic section for an historical account of this development.

Since we are looking to approach a given set, it is natural to consider minimizing the distance of our reward vector to the set. Recall we denote the (Euclidean) distance to a set as Dist(w,S)=minxSwx{\bf Dist}(\mathbf{w},S)=\min_{\mathbf{x}\in S}\|\mathbf{w}-\mathbf{x}\|. The support function of closed convex set SS is given by

Notice that this function is convex, since it is a maximum over linear functions.

Using the definition of the support function,

Blackwell’s theorem characterizes approachable sets: it is necessary and sufficient to be able to find a best response xK1\mathbf{x}\in\mathcal{K}_{1}, to any yK2\mathbf{y}\in\mathcal{K}_{2}, such that u(x,y)S\mathbf{u}(\mathbf{x},\mathbf{y})\in S. To proceed with the reduction, we need an equivalent condition stated formally as follows.

For a generalized vector game K1,K2,{u}\mathcal{K}_{1},\mathcal{K}_{2},\{\mathbf{u}\}, the following to conditions are equivalent:

Blackwell’s theorem asserts that λ=0\lambda=0 if and only if SS is approachable. Using Sion’s generalization to the Von Neumann minimax theorem from chapter 8,

Thus, the second statement of the lemma is satisfied if and only if λ=0\lambda=0.

As mentioned previously, the necessity of Blackwell’s condition is left as an exercise. To prove sufficiency, we assume that form (2) of Blackwell’s condition as in Lemma 13.6 is satisfied. Formally, we henceforth assume that the vector game and set SS are equipped with a best response oracle O{\mathcal{O}}, such that

We proceed with the formal proof of sufficiency, constructively specified in Algorithm 37. Notice that in this reduction, the functions ftf_{t} are concave, and the OCO algorithm is used for maximization.

Algorithm 37, with input OCO algorithm A{\mathcal{A}}, returns the vector uˉT=1Tt=1Tu(xt,yt)\bar{\mathbf{u}}_{T}=\frac{1}{T}\sum_{t=1}^{T}\mathbf{u}(\mathbf{x}_{t},\mathbf{y}_{t}) that approaches the set SS at a rate of

Notice that equation (13.2) implies that for wt\mathbf{w}_{t} as defined in the algorithm, we have

This theorem explicitly relates OCO with approachability, and since we have already proved the existence of efficient OCO algorithms in this text, it can be used to formally prove Blackwell’s theorem. Completing the details is left as an exercise.

3 From Approachability to Online Convex Optimization

In this subsection we show the converse reduction: given an approachability algorithm, we design an OCO algorithm with no loss of computational efficiency. This direction was essentially shown by Blackwell for discrete decision problems, as described in more detail in the bibliographic section. We prove it here in the full generality of OCO.

Formally, given an approachability algorithm A{\mathcal{A}}, denote by DistT(A){\bf Dist}_{T}({\mathcal{A}}) an upper bound on its rate of convergence to the set SS as a function of the number of iterations TT. That is, for a given vector game, denote by uˉT=1Tt=1Tu(xt,yt)\bar{\mathbf{u}}_{T}=\frac{1}{T}\sum_{t=1}^{T}\mathbf{u}(\mathbf{x}_{t},\mathbf{y}_{t}) the average reward vector. Then A{\mathcal{A}} guarantees

Given an approachability algorithm A{\mathcal{A}}, we henceforth create an OCO algorithm with vanishing regret.

Approachability is in a certain geometric sense a dual to OCO. To see this, we require several geometric notions, that are explicitly required for the reduction from approachability to OCO.

It is left as an exercise to prove that K0\mathcal{K}^{0} is a convex set, and that for cones, the polar to the polar is the original set.

That is, we take all points in the polar set to the direct sum 1K1\oplus\mathcal{K}.

This definition of the polar set gives rise to the following quantitative characterization.

3.2 The reduction

Algorithm 38 takes as an input a Blackwell approachability algorithm that guarantees, under the necessary and sufficient condition, convergence to a given set. It also takes as an input a set K\mathcal{K} for OCO.

The reduction considers a vector game with decision sets K,F\mathcal{K},{\mathcal{F}} and approachability set S=Q(K)S=Q(\mathcal{K}), and generates a sequence of decisions that guarantee low regret as we prove next.

Since this reduction creates the approachability set SS as a function of K\mathcal{K}, we need to prove that indeed the set SS is approachable. We show this in the next subsection.

The reduction defined in Algorithm 38, for any input algorithm A{\mathcal{A}}, produces an OLO algorithm L{\mathcal{L}} such that

The approachability algorithm guarantees Dist(uTˉ,S)DistT(A){\bf Dist}(\bar{\mathbf{u}_{T}},S)\leq{\bf Dist}_{T}({\mathcal{A}}). Using the definition of SS and Lemma 13.8 we have

3.3 Existence of a best response oracle

Notice that the reduction of this section from approachability to OCO does not require the best response oracle. However, Blackwell’s approachability theorem does require this oracle as sufficient and necessary, and thus for the set SS we constructed to be approachable at all, such an oracle needs to exist. This is what we show next.

Consider the vectors ut\mathbf{u}_{t} constructed in the reduction. A best response oracle finds, for every vector y\mathbf{y}, a vector x\mathbf{x} that guarantees u(x,y)S\mathbf{u}(\mathbf{x},\mathbf{y})\in S. In our case, this translates to the condition

In other words, the best response oracle corresponds to a procedure that given ff, finds a vector x\mathbf{x}^{\star} such that

This is an optimization oracle for the set K\mathcal{K}!

4 Bibliographic Remarks

David Blackwell’s celebrated Approachability Theorem was published in (Blackwell, 1956). The first no-regret algorithm for a discrete action setting was given in a seminal paper by James Hannan in (Hannan, 1957) the next year. That same year, Blackwell pointed out (Blackwell, 1954) that his approachability result leads, as a special case, to an algorithm with essentially the same low-regret guarantee proven by Hannan. For Hannan’s account of events see (Gilliland et al., 2010).

Over the years several other problems have been reduced to Blackwell approachability, including asymptotic calibration (Foster and Vohra, 1998), online learning with global cost functions (Even-Dar et al., 2009) and more (Mannor and Shimkin, 2008). Indeed, it has been presumed that approachability, while establishing the existence of a no-regret algorithm, is strictly more powerful than regret-minimization; hence its utility in such a wide range of problems.

However, this was recently shown not to be the case. Abernethy et al. (2011) showed that approachability is in fact equivalent to OCO. This result is the basis of the material presented in this chapter. One side of their reduction was simplified and generalized in (Shimkin, 2016).

5 Exercises

1. Prove that Von Neumann’s theorem does not hold for vector games, i.e., give an example of a vector game such that there is no single mixed strategy that ensures the vector payoff lies in a given set. Show that this is true even for approachable sets.

3. Prove that Blackwell’s condition is necessary, i.e., prove that for a set SS to be approachable in a given vector game, there must exist an oracle O{\mathcal{O}} such that

References