Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints

Laurent Lessard, Benjamin Recht, Andrew Packard

Introduction

Convex optimization algorithms provide a powerful toolkit for robust, efficient, large-scale optimization algorithms. They provide not only effective tools for solving optimization problems, but are guaranteed to converge to accurate solutions in provided time budgets , are robust to errors and time delays , and are amendable to declarative modeling that decouples the algorithm design from the problem formulation . However, as we push up against the boundaries of the convex analysis framework, try to build more complicated models, and aim to deploy optimization systems in highly complex environments, the mathematical guarantees of convexity start to break. The standard proof techniques for analyzing convex optimization rely on deep insights by experts and are devised on an algorithm-by-algorithm basis. It is thus not clear how to extend the toolkit to more diverse scenarios where multiple objectives—such as robustness, accuracy, and speed—need to be delicately balanced.

This paper marks an attempt at providing a systematized approach to the design and analysis optimization algorithms using techniques from control theory. Our strategy is to adapt the notion of an integral quadratic constraint from robust control theory . These constraints link sequences of inputs and outputs of operators, and are ideally suited to proving algorithmic convergence. We will see that for convex functions, we can derive these constraints using only the standard first-order characterization of convex functions, and that these inequalities will be sufficient to reduce the analysis of first-order methods to the solution of a very small semidefinite program. Our IQC framework puts the analysis of algorithms in a unified proof framework, and enables new analyses of algorithms by minor perturbations of existing proofs. This new system aims to simplify and automate the analysis of optimization programs, and perhaps to open new directions for algorithm design.

Our methods are inspired by the recent work of Drori and Teboulle . In their manuscript, the authors propose writing down the first-order convexity inequality for all steps of an algorithmic procedure. They then derive a semidefinite program that analytically verifies very tight bounds for the convergence rate for the Gradient method, and numerically precise bounds for convergence of Nesterov’s method and other first-order methods. The main drawback of the Drori and Teboulle approach is that the size of the semidefinite program scales with the number of time steps desired. Thus, it becomes computationally laborious to analyze algorithms that require more than a few hundred iterations.

Integral quadratic constraints will allow us to circumvent this issue. A typical example of one of our semidefinite programs might have a 3×33\times 3 positive semidefinite decision variable, 3 scalar variables, a 5×55\times 5 semidefinite cone constraint, and 4 scalar constraints. Such a problem can be solved in less than 10 milliseconds on a laptop with standard solvers.

We are able to analyze a variety of methods in our framework. We show that our framework recovers the standard rates of convergence for the Gradient method applied to strongly convex functions. We show that we can numerically estimate the performance of Nesterov’s method. Indeed, our analysis provides slightly sharper bounds than Nesterov’s proof. We show how our system fails to certify the stability of the popular Heavy-ball method of Polyak for strongly convex functions whose condition ratio is larger than 18. Based on this analysis, we are able to construct a one-dimensional strongly convex function whose condition ratio is 25 and prove analytically that the Heavy-ball method fails to find the global minimum of this function. This suggests that our tools can also be used as a way to guide the construction of counterexamples.

We show that our methods extend immediately to the projected and proximal variants of all the first order methods we analyze. We also show how to extend our analysis to functions that are convex but not strongly convex, and provide bounds on convergence that are within a logarithmic factor of the best upper bounds. We also demonstrate that our methods can bound convergence rates when the gradient is perturbed by relative deterministic noise. We show how different parameter settings lead to very different degradations in performance bounds as the noise increases.

Finally, we turn to algorithm design. Since our semidefinite program takes as input the parameters of our iterative scheme, we can search over these parameters. For simple two-step methods, our algorithms are parameterized by 3 parameters, and we show how we can derive first-order methods that achieve nearly the same rate of convergence as Nesterov’s accelerated method but are more robust to noise.

The manuscript is organized as follows. We begin with a discussion of discrete-time dynamical system and how common optimization algorithms can be viewed as feedback interconnections between a known linear system with an uncertain nonlinear component. We then turn to show how quadratic Lyapunov functions can be used to certify rates of convergence for optimization problems and can be found by semidefinite programming. This immediately leads to the notion of an integral quadratic constraint. Another contribution of this work is a new form of IQC analysis geared specifically toward rate-of-convergence conclusions, and accessible to optimization researchers. We also discuss their history in robust control theory and how they can be derived. With these basic IQCs in hand, we then turn to analyzing the Gradient method and Nesterov method, their projected and proximal variants, and their robustness to noise. We discuss one possible brute-force technique for designing new algorithms, and how we can outperform existing methods. Finally, we conclude with many directions for future work.

Norms and sequences.

Convex functions.

Kronecker product

Two useful properties of the Kronecker product are that (AB)T=ATBT(A\otimes B)^{\mathsf{T}}=A^{\mathsf{T}}\otimes B^{\mathsf{T}} and that (AB)(CD)=(AC)(BD)(A\otimes B)(C\otimes D)=(AC)\otimes(BD) whenever the matrix dimensions are such that the products ACAC and BDBD make sense.

Optimization algorithms as dynamical systems

A linear dynamical system is a set of recursive linear equations of the form

We can connect this linear system in feedback with a nonlinearity ϕ\phi by defining the rule

In this paper, we will be interested in the case when the interconnected nonlinearity has the form ϕ(y)= ⁣f(y)\phi(y)=\nabla\!f(y) where fS(m,L)f\in S(m,L). In particular, we will consider algorithms designed to solve the optimization problem

as dynamical systems and see how this new viewpoint can give us insights into convergence analysis. Section 5.3 considers variants of (2.3) where the decision variable xx is constrained or ff is non-smooth.

Standard first order methods such as the Gradient method, Heavy-ball method, and Nesterov’s accelerated method, can all be cast in the form (2.2). In all cases, the nonlinearity is the mapping ϕ(y)= ⁣f(y)\phi(y)=\nabla\!f(y). The state transition matrices AA, BB, CC, DD differ for each algorithm. The Gradient method can be expressed as

To verify this, substitute (2.4) into (2.2) and obtain

Eliminating yky_{k} and uku_{k} and renaming ξ\xi to xx yields

which is the familiar Gradient method with constant stepsize. Nesterov’s accelerated method for strongly convex functions is given by the dynamical system

Verifying that (2.5) is equivalent to Nesterov’s method takes only slightly more effort than it did for the Gradient method. Substituting (2.5) into (2.2) now yields

Note that (2.6b) asserts that the partial state ξ(2)\xi^{(2)} is a delayed version of the state ξ(1)\xi^{(1)}. Substituting (2.6b) into (2.6a) gives the simplified system

Eliminating uku_{k} and renaming ξ(1)\xi^{(1)} to xx yields the common form of Nesterov’s method

Note that other variants of this algorithm exist for which the α\alpha and β\beta parameters are updated at each iteration. In this paper, we restrict our analysis to the constant-parameter version above. The Heavy-ball method is given by

One can check by similar analysis that (2.7) is equivalent to the update rule

Convergence analysis of convex optimization algorithms typically follows a two step procedure. First one must show that the algorithm has a fixed point that solves the optimization problem in question. Then, one must verify that from a reasonable starting point, the algorithm converges to this optimal solution at a specified rate.

In dynamical systems, such proofs are called stability analysis. By writing common first order methods as dynamical systems, we can unify their stability analysis. For a general problem with minimum occurring at yy_{\star}, a necessary condition for optimality is that u= ⁣f(y)=0u_{\star}=\nabla\!f(y_{\star})=0. Substituting into (2.1), the fixed point satisfies

In particular, AA must have an eigenvalue of 11. If the blocks of AA are diagonal as in the Gradient, Heavy-ball, or Nesterov methods shown above, then the eigenvalue of 11 will have a geometric multiplicity of at least dd.

Proving that all paths lead to the optimal solution requires more effort and constitutes the bulk of what is studied herein. Before we proceed for general convex ff, it is instructive to study what happens for quadratic ff.

2 Quadratic problems

Suppose ff is a convex, quadratic function f(y)=12yTQypTy+rf(y)=\tfrac{1}{2}y^{\mathsf{T}}Qy-p^{\mathsf{T}}y+r, where mIdQLIdmI_{d}\preceq Q\preceq LI_{d} in the positive definite ordering. The gradient of ff is simply  ⁣f(y)=Qyp\nabla\!f(y)=Qy-p and the optimal solution is y=Q1py_{\star}=Q^{-1}p.

What happens when we run a first order method on a quadratic problem? Assume throughout this section that D=0D=0. Substituting the equation for yy_{\star} and  ⁣f(y)\nabla\!f(y) back into (2.2), we obtain the system of equations:

Now make use of the fixed-point equations y=Cξy_{\star}=C\xi_{\star} and ξ=Aξ\xi_{\star}=A\xi_{\star} and we obtain uk=QC(ξkξ)u_{k}=QC(\xi_{k}-\xi_{\star}). Eliminating yky_{k} and uku_{k} from the above equations, we obtain

Let T\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}A+BQC denote the closed-loop state transition matrix. A necessary and sufficient condition for ξk\xi_{k} to converge to ξ\xi_{\star} is that the spectral radius of TT is strictly less than 11. Recall that the spectral radius of a matrix MM is defined as the largest magnitude of the eigenvalues of MM. We denote the spectral radius by ρ(M)\rho(M). It is a fact that

where \|\cdot\| is the induced 22-norm. Therefore, for any ε>0\varepsilon>0, we have for all kk sufficiently large that ρ(T)kTk(ρ(T)+ε)k\rho(T)^{k}\leq\|T^{k}\|\leq(\rho(T)+\varepsilon)^{k}. Hence, we can bound the convergence rate:

So the spectral radius also determines the rate of convergence of the algorithm. With only bounds on the eigenvalues of QQ, we can provide conditions under which the algorithms above converge for quadratic ff.

All of these results are proven by elementary linear algebra and the bounds are tight. In other words, there exists a quadratic function that achieves the worst-case ρ\rho. See Appendix A for more detail.

Unfortunately, the proof technique used in Proposition 1 does not extend to the case where ff is a more general strongly convex function. However, a different characterization of stability does generalize and will be described in Section 3. It turns out that for linear systems, stability is equivalent to the feasibility of a particular semidefinite program. We will see in the sequel that similar semidefinite programs can be used to certify stability of nonlinear systems.

The proof of Proposition 2 is elementary so we omit it. The use of Linear Matrix Inequalities (LMI) to characterize stability of a linear time-invariant system dates back to Lyapunov , and we give a more detailed account of this history in Section 3.4. Now suppose we are studying a dynamical system of the form ξk+1ξ=T(ξkξ)\xi_{k+1}-\xi_{\star}=T(\xi_{k}-\xi_{\star}) as in (2.8). Then, if there exists a P0P\succ 0 satisfying TTPTρ2P0T^{\mathsf{T}}PT-\rho^{2}P\prec 0,

along all trajectories. If ρ<1\rho<1, then the sequence {ξk}k0\{\xi_{k}\}_{k\geq 0} converges linearly to ξ\xi_{\star}. Iterating (2.9) down to k=0k=0, we see that

where cond(P)\operatorname{cond}(P) is the condition number of PP. In what follows, we will generalize this semidefinite programming approach to yield feasibility problems that are sufficient to characterize when the closed loop system (2.2) converges and which provide bounds on the distance to optimality as well. The function

is called a Lyapunov function for the dynamical system. This function strictly decreases over all trajectories and hence certifies that the algorithm is stable, i.e., converges to nominal values. The conventional method for proving stability of an electromechanical system is to show that some notion of total energy always decreases over time. Lyapunov functions provide a convenient mathematical formulation of this notion of total energy.

The question for the remainder of the paper is how can we search for Lyapunov-like functions that guarantee algorithmic convergence when ff is not quadratic.

Proving convergence using integral quadratic constraints

When the function being minimized is quadratic as explored in Section 2.2, its gradient is affine and the interconnected dynamical system is a simple linear difference equation whose stability and convergence rate is analyzed solely in terms of eigenvalues of the closed-loop system. When the cost function is not quadratic, the gradient update is not an affine function and hence a different analysis technique is required.

A popular technique in the control theory literature is to use integral quadratic constraints (IQCs) to capture features of the behavior of partially-known components. The term IQC was introduced in the seminal paper by Megretski and Rantzer . In that work, the authors analyzed continuous time dynamical systems and the constraints involved integrals of quadratic functions, hence the name IQC.

In the development that follows, we repurpose the classical IQC theory for use in algorithm analysis. This requires using discrete time dynamical systems so our constraints will involve sums of quadratics rather than integrals. We also adapt the theory in a way that allows us to certify a specific convergence rate in addition to stability.

IQCs provide a convenient framework for analyzing interconnected dynamical systems that contain components that are noisy, uncertain, or otherwise difficult to model. The idea is to replace this troublesome component by a quadratic constraint on its inputs and outputs that is known to be satisfied by all possible instances of the component. If we can certify that the newly constrained system performs as desired, then the original system must do so as well.

Although we do not know ϕ\phi exactly, we assume that we have some knowledge of the constraints it imposes on the pair (y,u)(y,u). For example, suppose it is known that ϕ\phi satisfies the following properties:

Instead of analyzing a system that contains ϕ\phi, we analyze the system where ϕ\phi is removed, but we enforce the constraints (3.1) on the signals (y,u)(y,u). Since (3.1) is true for all admissible choices of ϕ\phi, then any properties we can prove for the constrained system must hold for the original system as well.

where we will define the initial condition ζ\zeta_{\star} shortly. The equations (3.2) define an affine map z=Ψ(y,u)z=\Psi(y,u). Assuming a reference point (y,u)(y_{\star},u_{\star}) as before, we can define the associated reference (ζ,z)(\zeta_{\star},z_{\star}) that is a fixed point of (3.2). In other words,

We will require that ρ(AΨ)<1\rho(A_{\Psi})<1, which ensures that (3.3) has a unique solution (ζ,z)(\zeta_{\star},z_{\star}) for any choice of (y,u)(y_{\star},u_{\star}). Note that the reference points are defined in such a way that if we use y=(y,y,)y=(y_{\star},y_{\star},\dots) and u=(u,u,)u=(u_{\star},u_{\star},\dots) in (3.2), we will obtain ζ=(ζ,ζ,)\zeta=(\zeta_{\star},\zeta_{\star},\dots) and z=(z,z,)z=(z_{\star},z_{\star},\dots).

We then consider the quadratic forms (zkz)TM(zkz)(z_{k}-z_{\star})^{\mathsf{T}}M(z_{k}-z_{\star}) for a given symmetric matrix MM (typically indefinite). Note that each such quadratic form is a function of (y0,,yk,u0,,uk)(y_{0},\dots,y_{k},u_{0},\dots,u_{k}) that is determined by our choice of (Ψ,M,y,u)(\Psi,M,y_{\star},u_{\star}). In our previous example (3.1), Ψ\Psi has no dynamics and the corresponding Ψ\Psi and MM are

In other words, if we use the definitions (3.4), then (zkz)TM(zkz)0(z_{k}-z_{\star})^{\mathsf{T}}M(z_{k}-z_{\star})\geq 0 is the same as (3.1). In general, these sorts of quadratic constraints are called IQCs. We consider four different types of IQCs, which we now define.

Note that the example (3.1) is a pointwise IQC. Examples of the other types of IQCs will be described in Section 3.3. Note that the sets of maps satisfying the various IQCs defined above are nested as follows:

For example, if ϕ\phi satisfies a pointwise IQC defined by (Ψ,M,y,u)(\Psi,M,y_{\star},u_{\star}) then it must also satisfy the hard IQC defined by the same (Ψ,M,y,u)(\Psi,M,y_{\star},u_{\star}). The notions of hard IQC and the more general soft IQC (sometimes simply called IQC) were introduced in and their relationship is discussed in . These concepts are useful in proving that a dynamic system is stable, but do not directly allow for the derivation of useful bounds on convergence rates. The definitions of pointwise and ρ\rho-hard IQCs are new, and were created for the purpose of better characterizing convergence rates, as we will see in Section 3.2.

Finally, note that yy_{\star} and uu_{\star} are nominal inputs and outputs for the unknown ϕ\phi, and they can be tuned to certify different fixed points of the interconnected system. We will see in Section 3.2 that certifying a particular convergence rate to some fixed point does not require prior knowledge of fixed point; only knowledge that the fixed point exists.

2 Stability and performance results

In this section, we show how IQCs can be used to prove that iterative algorithms converge and to bound the rate of convergence. In both cases, the certification requires solving a tractable convex program. We note that the original work on IQCs only proved stability (boundedness). Some other works have addressed exponential stability , but the emphasis of these works is on proving the existence of an exponential decay rate, and so the rates constructed are very conservative. We require rates that are less conservative, and this is reflected in the inclusion of ρ\rho in the LMI of our main result, Theorem 4.

where (A,B,C)(A,B,C) are matrices of appropriate dimensions. The map is affine rather than linear because of the initial condition ξ0\xi_{0}. As in Section 2, GG is the iterative algorithm we wish to analyze, and using the general formalism of Section 3.1, ϕ\phi is the nonlinear map (y0,y1,)(u0,u1,)(y_{0},y_{1},\dots)\mapsto(u_{0},u_{1},\dots) that characterizes the feedback. Of course, this framework subsumes the special case of interest in which uk= ⁣f(yk)u_{k}=\nabla\!f(y_{k}) for each kk. We assume that ϕ\phi satisfies an IQC, and this IQC is characterized by a map Ψ\Psi and matrix MM. We can interpret z=Ψ(y,u)z=\Psi(y,u) as a filtered version of the signals uu and yy. These equations can be represented using a block-diagram as in Figure 2(a).

Consider the dynamics of GG and Ψ\Psi from (3.5) and (3.2), respectively. Upon eliminating yy, the recursions may be combined to obtain

The dynamical system (3.7) is represented in Figure 2(b) by the dashed box. Our main result is as follows.

Consider the block interconnection of Figure 2(a). Suppose GG is given by (3.5) and Ψ\Psi is given by (3.2). Define (A^,B^,C^,D^)(\hat{A},\hat{B},\hat{C},\hat{D}) as in (3.6)–(3.7). Suppose (ξ,ζ,y,u,z)(\xi_{\star},\zeta_{\star},y_{\star},u_{\star},z_{\star}) is a fixed point of (3.5) and (3.2). In other words,

Suppose ϕ\phi satisfies the ρ\rho-hard IQC defined by (Ψ,M,ρ,y,u)(\Psi,M,\rho,y_{\star},u_{\star}) where 0ρ10\leq\rho\leq 1. Consider the following LMI.

If (3.9) is feasible for some P0P\succ 0 and λ0\lambda\geq 0, then for any ξ0\xi_{0}, we have

where cond(P)\operatorname{cond}(P) is the condition number of PP.

Multiply (3.10) by ρ2k\rho^{-2k} for each kk and sum over kk. The first two terms yield a telescoping sum and we obtain

Because ϕ\phi satisfies the ρ\rho-hard IQC defined by (Ψ,M,ρ,y,u)(\Psi,M,\rho,y_{\star},u_{\star}), the summation part of the inequality is nonnegative for all kk. Therefore,

for all kk and consequently xkxcond(P)ρkx0x\|x_{k}-x_{\star}\|\leq\sqrt{\operatorname{cond}(P)}\,\rho^{k}\,\|x_{0}-x_{\star}\|. Recall from (3.7) that xk=(ξk,ζk)x_{k}=(\xi_{k},\zeta_{k}) and from (3.2a) that ζ0=ζ\zeta_{0}=\zeta_{\star}. Therefore,

We now make several comments regarding Theorem 4.

Theorem 4 can easily be adapted to other types of IQCs.

If the pointwise IQC defined by some (Ψ,M,y,u)(\Psi,M,y_{\star},u_{\star}) is satisfied, then so is the ρ\rho-hard IQC defined by (Ψ,M,ρ,y,u)(\Psi,M,\rho,y_{\star},u_{\star}) for any ρ\rho. Therefore, we may apply Theorem 4 directly and ignore the ρ\rho-hardness constraint. The smallest ρ\rho that makes (3.9) feasible will correspond to the best exponential rate we can guarantee.

Hard IQCs are a special case of ρ\rho-hard IQCs with ρ=1\rho=1. Therefore, if the LMI (3.9) is feasible, Theorem 4 guarantees that ξkξcond(P)ξ0ξ\|\xi_{k}-\xi_{\star}\|\leq\sqrt{\operatorname{cond}(P)}\|\xi_{0}-\xi_{\star}\|. In other words, the iterates are bounded (but not necessarily convergent).

If a ρ1\rho_{1}-hard IQC is satisfied, then so is the ρ\rho-hard IQC for any ρρ1\rho\geq\rho_{1}. Also, if (3.9) is feasible for some ρ2\rho_{2}, it will also be feasible for any ρρ2\rho\geq\rho_{2}. Therefore, if we use a ρ1\rho_{1}-hard IQC and (3.9) is feasible for ρ2\rho_{2}, then the smallest exponential rate we can guarantee is ρ=max(ρ1,ρ2)\rho=\max(\rho_{1},\rho_{2}).

Multiple IQCs

Theorem 4 can also be generalized to the case where ϕ\phi satisfies multiple IQCs. Suppose ϕ\phi satisfies the ρ\rho-hard IQCs defined by (Ψi,Mi,ρ,y(i),u(i))(\Psi_{i},M_{i},\rho,y_{\star}^{(i)},u_{\star}^{(i)}) for i=1,,ri=1,\dots,r. Simply redefine the matrices (A^,B^,C^,D^)(\hat{A},\hat{B},\hat{C},\hat{D}) in a manner analogous to (3.7), but where the output is now (zk(1),,zk(r))(z_{k}^{(1)},\dots,z_{k}^{(r)}). Instead of (3.9), use

where λ1,,λr0\lambda_{1},\dots,\lambda_{r}\geq 0. Thus, when (3.11) is multiplied out as in (3.10), we now obtain

and the rest of the proof proceeds as in Theorem 4.

Remark on Lyapunov functions

In the quadratic case treated in Section 2.2, a quadratic Lyapunov function is constructed from the solution PP in (2.12). In the case of IQCs, such a quadratic function cannot serve as a Lyapunov function because it does not strictly decrease over all trajectories. Nevertheless, Theorem 4 shows how ρ\rho-hard IQCs can be used to certify a convergence rate and no Lyapunov function is explicitly constructed. We can explain this difference more explicitly. If V(x)V(x) is a Lyapunov function, then by definition it satisfies the properties;

λ1xx2V(x)λ2xx2\lambda_{1}\|x-x_{\star}\|^{2}\leq V(x)\leq\lambda_{2}\|x-x_{\star}\|^{2} for all xx and kk.

V(xk+1)ρ2V(xk)V(x_{k+1})\leq\rho^{2}V(x_{k}) for all system trajectories {xk}k0\{x_{k}\}_{k\geq 0}.

which, combined with Property (i) implies that xkxλ2/λ1ρkx0x\|x_{k}-x_{\star}\|\leq\sqrt{\lambda_{2}/\lambda_{1}}\,\rho^{k}\|x_{0}-x_{\star}\|. In Theorem 4, we use V(x)=(xx)TP(xx)V(x)=(x-x_{\star})^{\mathsf{T}}P(x-x_{\star}), which satisfies (i) but not (ii). So V(x)V(x) is not a Lyapunov function in the technical sense. Nevertheless, we prove directly that (3.12) holds, and so the desired result still holds. That is, V(x)V(x) serves the same purpose as a Lyapunov function.

3 IQCs for convex functions

We will derive three IQCs that are useful for describing gradients of strongly convex functions: the sector (pointwise) IQC, the off-by-one (hard) IQC, and weighted off-by-one (ρ\rho-hard) IQC. In general, gradients of strongly convex functions satisfy an infinite family of IQCs, originally characterized by Zames and Falb for the single-input-single-output case . A generalization of the Zames-Falb IQCs to multidimensional functions is derived in . Both the sector and off-by-one IQCs are special cases of Zames-Falb, while the weighted off-by-one IQC is a convex combination of the sector and off-by-one IQCs. While the Zames-Falb family is infinite, the three simple IQCs mentioned above are the only ones used in this paper. IQCs can be used to describe many other types of functions as well, and further examples are available in . We begin with some fundamental inequalities that describe strongly convex function.

Property (3.13a) follows from the definition of Lipschitz gradients. Properties (3.13b) and (3.13c) are commonly known as co-coercivity. To prove (3.13d), define g(x)\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}f(x)-\tfrac{m}{2}\|x\|^{2} and note that gS(0,Lm)g\in S(0,L-m). Applying (3.13b) to gg and rearranging, we obtain

which is precisely (3.13d). Detailed derivations of these properties can be found for example in .

Suppose fkS(m,L)f_{k}\in S(m,L) for each kk, and (y,u)(y_{\star},u_{\star}) is a common reference point for the gradients of fkf_{k}. In other words, u= ⁣fk(y)u_{\star}=\nabla\!f_{k}(y_{\star}) for all k0k\geq 0. Let \phi\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}(\nabla\!f_{0},\nabla\!f_{1},\dots). If u=ϕ(y)u=\phi(y), then ϕ\phi satisfies the pointwise IQC defined by

Equation (3.14) follows immediately from (3.13d) by using (f,x,y)(fk,y,yk)(f,x,y)\to(f_{k},y_{\star},y_{k}). It can be verified that

and therefore (3.14) is equivalent to (zkz)TM(zkz)0(z_{k}-z_{\star})^{\mathsf{T}}M(z_{k}-z_{\star})\geq 0 as required.

Suppose fS(m,L)f\in S(m,L) and (y,u)(y_{\star},u_{\star}) is a reference for the gradient of ff. In other words, u= ⁣f(y)u_{\star}=\nabla\!f(y_{\star}). Let \phi\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}(\nabla\!f,\nabla\!f,\dots). Then ϕ\phi satisfies the hard IQC defined by

where the inequality follows this time from applying (3.13c) using (f,x,y)(g,yt,yt1)(f,x,y)\to(g,y_{t},y_{t-1}). Substituting (3.4) and (3.4) into the left-hand side of (3.15), the sum telescopes and we obtain the lower bound qkq_{k}, which is nonnegative from (3.16).

To verify the IQC factorization, note that the state equations for Ψ\Psi given in the statement of Lemma 8 are

Moreover, the solution to the fixed-point equations (3.3) are

and it follows that t=0k(ztz)TM(ztz)0\sum_{t=0}^{k}(z_{t}-z_{\star})^{\mathsf{T}}M(z_{t}-z_{\star})\geq 0 is equivalent to (3.15), as required.

Note that the sector IQC (3.14) is a special case of the off-by-one IQC when k=0k=0. The off-by-one IQC is itself a special case of the Zames-Falb IQC, which we now describe.

Suppose fS(m,L)f\in S(m,L) has the optimal point u= ⁣f(y)=0u_{\star}=\nabla\!f(y_{\star})=0. Let \phi\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}(\nabla\!f,\nabla\!f,\dots) and let h1,h2,h_{1},h_{2},\dots be any sequence of real numbers that satisfies

{hτ}τ1\{h_{\tau}\}_{\tau\geq 1} is finitely nonzero, and hsh_{s} is the last nonzero component.

0hτ10\leq h_{\tau}\leq 1 for all τ1\tau\geq 1.

Then ϕ\phi satisfies the hard IQC defined by

We will construct a proof for a general sequence h1,h2,h_{1},h_{2},\dots by first considering a specific set of sequences. Fix some j1j\geq 1 and consider the case where

For t<jt<j, the terms in the sum (3.19) have the form

which are bounded below by qt0q_{t}\geq 0, as proven in Lemma 8, (3.16)–(3.4). For tjt\geq j, the terms in the sum (3.19) have the form

which are bounded below by qtqtjq_{t}-q_{t-j}, as proven in (3.4). Summing up (3.19) for all tt yields a telescoping sum, thereby proving that (3.19) holds. This can be thought of an “off-by-jj” IQC. Indeed, when j=1j=1, we recover the off-by-one IQC of Lemma 8.

Now note that if we take a convex combination of the inequalities (3.19) corresponding to each off-by-jj IQC and let the associated coefficient be hjh_{j}, we have proven (3.19) for the case of a general sequence h1,h2,h_{1},h_{2},\dots.

Though we will not make use of the more general Zames-Falb family of inequalities, we include them as they are interesting in their own right and may find applications in future work. We conclude this section with a ρ\rho-hard version of the off-by-one IQC. This final IQC will be critical for deriving convergence rates.

Suppose fS(m,L)f\in S(m,L) and (y,u)(y_{\star},u_{\star}) is a reference for the gradient of ff. In other words, u= ⁣f(y)u_{\star}=\nabla\!f(y_{\star}). Let \phi\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}(\nabla\!f,\nabla\!f,\dots). Then for any (ρˉ,ρ)(\bar{\rho},\rho) satisfying 0ρˉρ10\leq\bar{\rho}\leq\rho\leq 1, ϕ\phi satisfies the ρ\rho-hard IQC defined by

Note that the weighted off-by-one IQC is a Zames-Falb IQC with h=(ρˉ2,0,)h=(\bar{\rho}^{2},0,\dots). Thus the hardness and the factorization (Ψ,M)(\Psi,M) follows from Lemma 9. In order to prove ρ\rho-hardness (3.20), a bit more work is required. First, observe (see remarks on pointwise and hard IQCs after Theorem 4) that it suffices to show ρˉ\bar{\rho}-hardness, and this will imply ρ\rho-hardness. The ttht^{\text{th}} term in the sum in (3.20) can be bounded as follows. First, define the general terms in the sector (Lemma 6) and off-by-one (Lemma 8) inequalities:

Algebraic manipulations reveal that the general term in the sum (3.20) satisfies

where the inequalities follow from (3.4) and (3.4). Substituting the general term back into (3.20) with ρ=ρˉ\rho=\bar{\rho}, the ρˉ2t\bar{\rho}^{-2t} coefficient causes the sum to telescope and we are left with ρˉ2kqk\bar{\rho}^{-2k}q_{k}, which is nonnegative from (3.16). This completes the proof.

In implementing the weighted off-by-one IQC, one can simply set ρˉ=ρ\bar{\rho}=\rho. However, a less conservative approach is to keep ρˉ\bar{\rho} as an additional degree of freedom. In Theorem 4, the IQC constraint is included in (3.9) in the final term and is multiplied by the constant λ0\lambda\geq 0. When using the weighted off-by-one IQC, this amounts to:

By defining λ1=λ(1ρˉ2)\lambda_{1}=\lambda(1-\bar{\rho}^{2}) and λ2=λρˉ2\lambda_{2}=\lambda\bar{\rho}^{2}, an equivalent expression is

4 Historical context of IQCs and Lyapunov theory

Constructing Lyapunov functions has a long history in control and dynamical systems, and the central focus of this paper is borrowing tools from this literature to see how we can generalize our analysis from quadratic functions to more general, nonlinear convex functions.

One of the most fundamental problems in control theory is certifying the stability of nonlinear systems. In interconnected systems such as electric circuits or chemical plants, individual components are typically modeled using differential (or difference) equations. Interconnected systems often contain nonlinearities or components that are otherwise difficult to model. The earliest results on such systems date back to the work of Lur’e and Postnikov . The goal was to prove stability under a wide range of admissible uncertainties. This notion of robust stability was called absolute stability. Indeed, Lur’e studied precisely the model we are concerned with: a known linear system interconnected in feedback to an uncertain nonlinear system.

In the 1960’s and 70’s, several sufficient conditions for absolute stability were expressed as frequency-domain conditions. In other words, the main objects of interest are ratios of the Laplace transforms of the outputs to the inputs, also known as transfer functions. Examples include the Popov criterion , the small-gain theorem, the circle criterion, and passivity theory . Frequency-domain conditions were popular at the time because they could be verified graphically. The work of Willems unified many of the existing results by casting them in the time domain in a framework called dissipativity theory. This notion is on one hand a generalization of Lyapunov functions to include systems with exogenous inputs, and on the other hand a generalization of passivity theory and the small-gain theorem. These ideas form the core of modern nonlinear control theory, and are covered in many textbooks such as Khalil .

With the advent of computers, graphical methods were no longer required. The connection between frequency-domain conditions and Linear Matrix Inequalities (LMIs) was made by Kalman and Yakubovich and culminated in the Kalman-Yakubovich-Popov (KYP) lemma, also known as the Positive-Real lemma. This paved the way for the use of modern computational tools such as semidefinite programming. Another important development is the concept of the structured singular value , also known as μ\mu-analysis. While previous theory had been used to describe static nonlinearities or uncertainties, μ\mu-analysis is a computationally tractable framework for describing a system containing multiple dynamic uncertainties. A survey of μ\mu-related techniques and results is given in . For a comprehensive overview of the history and development of LMIs in control theory, we refer the reader to .

Integral Quadratic Constraints (IQCs) were first introduced by Yakubovich, who considered the notion of imposing quadratic constraints on an infinite-horizon control problem , and combining multiple constraints via the S-procedure . The definitive work on IQCs is Megretski and Rantzer . In this seminal paper, the authors showed that dissipativity theory, as well as all the frequency-domain conditions, could be formulated as IQCs. Furthermore, the KYP lemma in conjunction with the S-procedure allows stability to be verified by solving an LMI.

The seminal paper on IQCs develops the theory primarily in the frequency domain, but also alludes to time-domain versions of the results by introducing hard IQCs. This notion of hard IQCs is pursued in , where the main IQC stability theorem is rederived entirely in the time domain. In the time domain, these constraints parallel the development of Nesterov, where we are able to construct inequalities linking multiple inputs and outputs of uncertain functions. This allows us to provide a wholly self-contained development of the theory. Moreover, we are able to enhance the techniques of , providing new IQCs and considerably sharper rates of convergence than those discussed in the earlier work. In this sense, our work provides useful methods for control theorists interested in estimating rates of stabilization of their control systems.

Case studies

We now use the results of Section 3 to rederive some existing results from the literature on iterative large-scale algorithms. The IQC approach gives a unified method to analyze many different algorithms. In addition to verifying existing results, we also present a negative result that was not previously known.

Given an iterative algorithm, our first step is to express it as a feedback interconnection of a discrete linear time-invariant dynamical system with a nonlinearity representing  ⁣f\nabla\!f. This procedure is explained in Section 2 and yields matrices (A,B,C)(A,B,C).

The next step is to decide which IQCs will be used to characterize the nonlinearity. A simple but conservative choice is the sector IQC defined in Lemma 6. A less conservative choice is the weighted off-by-one IQC of Lemma 10. For the chosen (Ψ,M)(\Psi,M), we find the smallest ρ\rho such that the semidefinite program (SDP) (3.9) of Theorem 4 is feasible. In the case of the sector IQC, the SDP has variables (P,λ,ρ)(P,\lambda,\rho). For the weighted off-by-one IQC, the SDP has variables (P,λ1,λ2,ρ)(P,\lambda_{1},\lambda_{2},\rho) as explained in Remark 11. The resulting ρ\rho is an upper bound for the worst-case convergence rate of the algorithm. Specifically,

To solve the SDP (3.9) numerically, observe that it is a quasiconvex program. In particular, for every fixed ρ\rho, (3.9) is an LMI. The simplest way to solve (3.9) is to use a bisection search on ρ\rho. For a fixed ρ\rho, the SDP (3.9) or (3.11) become an LMI and can be efficiently solved using interior-point methods. Popular implementations include SDPT3, SeDuMi, and Mosek. This approach was used for all the simulations presented herein.

More sophisticated methods exist to solve (3.9) as well. A quasiconvex program of the type (3.9) is known as a generalized eigenvalue optimization problem (GEVP) . The GEVP is well-studied and modified interior-point methods such as the method of centers and the long-step method of analytic centers can be used to solve it.

2 Lossless dimensionality reduction

The size of the SDP in (3.9) is proportional to dd, the size of the state ξk\xi_{k} in the optimization algorithm. This can be problematic in cases where dd is large because it can be computationally costly to solve large SDPs. In many cases of interest, however, the algorithms we wish to analyze have a block-diagonal structure. For example, Nesterov’s accelerated method has the form (2.5), which is

Each of the matrices (A,B,C)(A,B,C) is a block matrix with repeated diagonal blocks. Using Kronecker product notation (see Section 1.1, this means for example that

and similarly for BB and CC. Moreover, the IQCs we use to describe  ⁣f\nabla\!f have the same sort of structure. That is, (AΨ,BΨy,BΨu,CΨ,DΨy,DΨu)(A_{\Psi},B_{\Psi}^{y},B_{\Psi}^{u},C_{\Psi},D_{\Psi}^{y},D_{\Psi}^{u}) are block matrices with repeated diagonal blocks. Now consider the SDP (3.9) from Theorem 4.

3 Known bounds for first-order optimization algorithms

The following proposition summarizes some of the known bounds for optimizing strongly convex functions.

The Gradient bounds in the table above follow from the bound ρ12αmLL+m\rho\leq\sqrt{1-\frac{2\alpha mL}{L+m}}, which is proven in . A tighter Gradient bound \rho\leq\max\bigl{\{}|1-\alpha m|,|1-\alpha L|\bigr{\}} is proven in but makes the additional assumption that ff is twice differentiable. The Nesterov bound in Proposition 12 is proven in using the technique of estimate sequences. There are no known global convergence guarantees for the Heavy-ball method in the case of strongly convex functions, but it is proven in that the Heavy-ball method converges locally with the same rate as in Proposition 1.

In the following sections, we will use IQC machinery to demonstrate that the first two bounds in Proposition 12 are loose. We will construct tighter bounds for the strongly convex case without requiring additional assumptions about locality or twice-differentiability. We will then use our framework to help guide a refutation of the convergence of the Heavy-ball method.

4 The Gradient method

The Gradient method with constant stepsize is among the simplest optimization schemes. The recursion is given by

We will analyze this algorithm by applying Theorem 4. Since fS(m,L)f\in S(m,L), we may use the sector IQC of Lemma 6 and (3.9) together with the dimensionality reduction of Section 4.2 yields the following SDP.

Note that PP is 1×11\times 1, so we may set P=1P=1 without loss of generality and we obtain the following LMI in (ρ2,λ)(\rho^{2},\lambda).

Using Schur complements, (4.6) is equivalent to

By analyzing the lower bound on ρ\rho in (4.7), we can find the optimal choice of λ\lambda as a function of the stepsize α\alpha. Omitting the details, we eventually obtain the simple expression \rho=\max\bigl{\{}|1-\alpha m|,|1-\alpha L|\bigr{\}}. This is precisely the bound found for the quadratic case, as derived in Appendix A. However, we have shown something much stronger here, since the only assumption we made about ff is that  ⁣f\nabla\!f satisfies the sector IQC of Lemma 6. In particular, the Gradient method rates in Proposition 1 hold not only for quadratics, but also for strongly convex functions, and even for functions that change or switch over time (either stochastically, adversarially, or otherwise), so long as each function satisfies the pointwise sector constraint. Note that (4.6) can be transformed using Schur complements:

And now (4.8) is linear in (ρ2,λ,α)(\rho^{2},\lambda,\alpha). This formulation allows one to directly answer questions such as “what range of stepsizes can yield a given rate?”.

5 Nesterov’s accelerated method

Nesterov’s accelerated method with constant stepsize converges at a linear rate. There exists some c>0c>0 such that for any initial condition ξ0\xi_{0},

when applied to functions fS(m,L)f\in S(m,L). In this case, the parameters are the standard parameters from Proposition 12, which are \alpha\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}1/L and \beta\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}(\sqrt{L}-\sqrt{m})/(\sqrt{L}+\sqrt{m}) . Nesterov also showed that a lower bound on convergence rate for any algorithm of the form (2.2) and for any fS(m,L)f\in S(m,L) is given by

Since ρ\rho and ρopt\rho_{\textup{opt}} behave similarly as L/mL/m\to\infty, Nesterov’s accelerated method is sometimes called “optimal” or “nearly optimal”.

We computed the rate bounds using Theorem 4 using either the sector IQC of Lemma 6, or a combination of the sector IQC and the weighted off-by-one IQC of Lemma 10. It is important to note that unlike the Gradient method case, the LMI (3.9) is no longer linear in ρ2\rho^{2}. Therefore, we found the minimal ρ\rho by performing a bisection search on ρ\rho, see the first plot in Figure 3.

The rate obtained using the sector IQC alone is very poor. To understand why, recall from Lemma 6 that the sector IQC allows for fkf_{k} to be different at each iteration. Unlike the Gradient method, Nesterov’s accelerated method is not robust to having a changing fkf_{k}. However, convergence can nevertheless be guaranteed as long as ρ<1\rho<1, which corresponds approximately to L/m<11.7L/m<11.7.

The rate obtained using the weighted off-by-one IQC improves upon the rate proven in using the estimate sequence approach (see Proposition 12). Note that we do not have an analytical expression for the improved bound; it was found numerically by solving the LMI of Theorem 4.

Given that xkcond(P)ρkx0\|x_{k}\|\leq\sqrt{\operatorname{cond}(P)}\rho^{k}\|x_{0}\|, if we seek the smallest kk such that xkε\|x_{k}\|\leq\varepsilon, then it suffices that cond(P)ρkx0ε\sqrt{\operatorname{cond}(P)}\rho^{k}\|x_{0}\|\leq\varepsilon. This implies that

For the second plot in Figure 3, we plotted 1/logρ-1/\log\rho versus L/mL/m to get a sense of how the relative iteration count scales as a function of condition number. As we can see from Figure 3, Nesterov’s method applied to quadratics is within a factor of 22 of the theoretical lower bound, and the bound we can prove for Nesterov’s method applied to strongly convex functions is within a factor of 1.41.4 of the bound for quadratics.

Finally, we must also ensure that PP is reasonably well-conditioned. In Figure 4, we see that cond(P)\operatorname{cond}(P) appears to be proportional to L/mL/m, which agrees with the scale factor found by Nesterov .

If we repeat the above experiments, but instead using the optimal tuning of Nesterov’s method given in Proposition 1, the resulting plots are virtually identical. The only differences are that the curves are shifted down slightly because the optimal rate for quadratics is now 123κ+11-\tfrac{2}{\sqrt{3\kappa+1}} instead of 11κ1-\tfrac{1}{\sqrt{\kappa}}. The sector-IQC curve goes unstable a little sooner as well, at around L/m10L/m\approx 10. Roughly speaking, if we use the optimal tuning we can guarantee slightly faster convergence but slightly less robustness.

6 The Heavy-ball method

The optimal Heavy-ball rate for quadratics in Proposition 1 matches Nesterov’s lower bound (4.9) for strongly convex functions. Although the Heavy-ball method and Nesterov’s accelerated method have similar recursions, Figures 3 and 5 tell very different stories. When we allow for a different fkf_{k} at every iteration (sector IQC), we can guarantee stability when L/m6L/m\approx 6 or less. When we include the weighted off-by-one IQC as well, we can only guarantee stability when L/m18L/m\approx 18 or less. While it seems possible that using more IQCs could potentially improve this upper bound, it turns out that the poor quality of these bounds is due to something more serious: the Heavy-ball method optimized for quadratics does not converge for general fS(m,L)f\in S(m,L).

To find an example of an f(x)f(x) that leads to a non-convergent Heavy-ball method, Figure 5 indicates that we should search for L/m>18L/m>18. The following one-dimensional example does the job.

It is easy to check that  ⁣f(x)\nabla\!f(x) is continuous and monotone, and so fS(m,L)f\in S(m,L) with m=1m=1 and L=25L=25. When using an initial condition in the interval 3.07x03.463.07\leq x_{0}\leq 3.46, the Heavy-ball method produces a limit cycle with oscillations that never damp out. The first 50 iterates for x0=3.3x_{0}=3.3 are shown in Figure 6, and a plot of f(x)f(x) with the limit cycle overlaid is shown in Figure 7.

For a detailed proof that ff can indeed converge to a limit cycle, see Appendix B. We further investigate the stability of the Heavy-ball method in Section 5.

Further applications

We saw in Section 4.6 that the Heavy-ball method that uses α\alpha and β\beta optimized for quadratic functions is unstable for general strongly convex functions. A natural question to ask is whether the Heavy-ball method is stable over the class S(m,L)S(m,L) for some choice of α\alpha and β\beta. This experiment is easy to carry out in our framework, because choosing new values of α\alpha and β\beta simply amounts to changing parameters in the LMI. We chose α=1L\alpha=\tfrac{1}{L}, and for a sampling of points in β\beta\in, we evaluated the corresponding Heavy-ball method using Theorem 4 together with the weighted off-by-one IQC. See Figure 8.

The first plot shows convergence rate. When β=0\beta=0, the Heavy-ball method becomes the Gradient method, which is always convergent. However, we can improve upon the gradient rate by optimizing over β\beta. The best achievable rate is given by the black curve. The black curve lies strictly above the optimal Heavy-ball rate for quadratics, but below the optimal gradient rate.

In the second plot, we show the iterations required to achieve convergence. Again, the black curve represents the optimal parameter choice. As L/mL/m gets large, the envelope veers away from the optimal Heavy-ball curve and becomes parallel to the optimal gradient curve. So when L/mL/m is large, even when β\beta is chosen optimally, the Heavy-ball method is comparable to the Gradient method in worst-case for general strongly convex functions.

2 Multiplicative gradient noise

A common consideration is the inclusion of noise in the gradient computation. One possible model is relative deterministic noise where we assume the gradient error is proportional to the distance to optimality . Instead of directly observing  ⁣f(y)\nabla\!f(y), we see uk= ⁣f(yk)+rk,u_{k}=\nabla\!f(y_{k})+r_{k}\,, where

for some small nonnegative δ\delta. The IQC framework can be used to analyze such situations to study the robustness of various algorithms to this type of noise.

If wkw_{k} is the true gradient, we actually measure uk=Δkwku_{k}=\Delta_{k}w_{k}, where the gradient error is bounded above by a quantity proportional to the true gradient. In other words, we assume there is some δ>0\delta>0 such that ukwkδwk\|u_{k}-w_{k}\|\leq\delta\|w_{k}\|. Squaring both sides of the inequality and rearranging, we obtain the IQC

Note that this is simply the sector IQC with m=1δm=1-\delta and L=1+δL=1+\delta. We make no assumptions on how the noise is generated; it may be the output of a stochastic process, or could even be chosen adversarially. The modified block-diagram is shown in Figure 9.

By making a small modification, we can apply Theorem 4. We will look to show that the following inequality holds over all trajectories

for some λ1,λ20\lambda_{1},\lambda_{2}\geq 0. In order to formulate an LMI that implies a solution to (5.1), we use the signal [xkTukTwkT]\begin{bmatrix}x_{k}^{\mathsf{T}}&u_{k}^{\mathsf{T}}&w_{k}^{\mathsf{T}}\end{bmatrix}. Consequently, the matrices (A^,B^,C^,D^)(\hat{A},\hat{B},\hat{C},\hat{D}) from (3.6)–(3.7) now become a map (wk,uk)zk(w_{k},u_{k})\mapsto z_{k}. This leads to an LMI of the form (3.11) which is now block-3×33\times 3 instead of the 2×22\times 2 LMI of Theorem 4. The proof is identical to that of Theorem 4.

Our first experiment is to test the Gradient method. We used noise values of δ{0.01,0.02,0.05,0.1,0.2,0.5}\delta\in\{0.01,0.02,0.05,0.1,0.2,0.5\}. See Figure 10.

In examining Figure 10, we observe that the Gradient method with stepsize 2L+m\tfrac{2}{L+m} is not very robust to multiplicative noise. Even with noise as low as 1% (δ=0.01\delta=0.01), the Gradient method is no longer stable for L/m>100L/m>100. An explanation for this phenomenon is that in choosing the stepsize α\alpha, we are trading off convergence rate with robustness. The choice 2L+m\tfrac{2}{L+m} yields the minimum worst-case rate, but is fragile to noise. If we pick a more conservative stepsize such as the popular choice α=1L\alpha=\tfrac{1}{L}, we obtain a very different picture. See Figure 11.

Notice that with the updated stepsize of α=1L\alpha=\tfrac{1}{L}, the Gradient method is now robust to multiplicative noise. Robustness comes at the expense of a degradation in the best achievable convergence rate. This degradation manifests itself as a gap in Figure 11 between the black curves and the other ones.

Nesterov’s accelerated method

We can carry out an experiment similar to the one we did with the Gradient method, but now with Nesterov’s method. As before, we examine the trade-off between the magnitude of the multiplicative noise and the degradation of the optimal convergence rate. This time, we use δ{0.05,0.1,0.2,0.3,0.4,0.5}\delta\in\{0.05,0.1,0.2,0.3,0.4,0.5\}. See Figure 12.

As with our first Gradient method test, Nesterov’s method is not robust to multiplicative noise. For moderate L/mL/m, the degradation is minor, but eventually leads to instability when we reach a certain threshold. The idea that accelerated methods are sensitive to noise and can lead to an accumulation of error was noted in the recent work , using a different notion of gradient perturbation.

Robustness of Nesterov’s method can be improved by modifiying the α\alpha and β\beta parameters. Choosing a smaller α\alpha pushes back the instability threshold, while choosing a smaller β\beta simultaneously pushes back the instability threshold and degrades the rate. In the limit β0\beta\to 0, Nesterov’s method becomes the Gradient method, so we recover the plots of Figure 11.

3 Proximal point methods

Suppose we are interested in solving a problem of the form

As an illustrative example, we will show how to analyze the proximal version of Nesterov’s algorithm. Iterations take the form:

Note that when Πν=I\Pi_{\nu}=I, we recover the standard Nesterov algorithm. When β=0\beta=0, we recover the proximal gradient method.

In order to analyze this algorithm, we must characterize Πν\Pi_{\nu} using IQCs. To this end, let T\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}\partial P be the subdifferential of PP. Then, Πν(x)\Pi_{\nu}(x) is the unique point such that xΠν(x)νT(Πν(x))x-\Pi_{\nu}(x)\in\nu T(\Pi_{\nu}(x)). Or, written another way,

Since TT is a subdifferential, it satisfies the incremental passivity condition. Namely,

Therefore, TT satisfies the sector IQC with m=0m=0 and L=L=\infty. In fact, via minor modifications of Lemma 8 and Lemma 9 using the definition of a subdifferential rather than (3.13c), TT satisfies the off-by-one and weighted off-by-one IQCs as well. Now transform (5.2) by introducing the auxiliary signals u_{k}\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}\nabla\!f(y_{k}), w_{k}\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}\Pi_{\nu}(y_{k}-\alpha u_{k}), v_{k}\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}\nu T(w_{k}). The definitions of wkw_{k} and vkv_{k} together with (5.3) immediately imply that wk=ykαukvkw_{k}=y_{k}-\alpha u_{k}-v_{k}. Therefore, we can rewrite (5.2) as

These equations may be succinctly represented as a block diagram, as in Figure 13.

Analyzing this interconnection is done by accounting for the IQCs for both unknown blocks. If (Ψ1,M1)(\Psi_{1},M_{1}) is the IQC for  ⁣f\nabla\!f with output zk1z^{1}_{k} and (Ψ2,M2)(\Psi_{2},M_{2}) is the IQC for νT\nu T with output zk2z^{2}_{k}, then we seek to show that for all trajectories satisfy

where xkx_{k} now includes the states ξk\xi_{k} as well as the internal states of Ψ1\Psi_{1} and Ψ2\Psi_{2}. As in the proof of Theorem 4, for each fixed ρ\rho, we can write (5.4) as an LMI in the variables P0P\succ 0, λ10\lambda_{1}\geq 0, λ20\lambda_{2}\geq 0.

Applying this approach to the proximal version of Nesterov’s accelerated method, we recover the exact same plots as in Figure 3. This is to be expected because it is known that the proximal gradient and accelerated methods achieves the same worst-case convergence rates as their unconstrained counterparts . We conjecture that any algorithm GG of the form (2.2) which converges with rate ρ\rho has a proximal variant that converges at precisely the same rate.

4 Weakly convex functions

With minor modifications to our analysis, we can immediately extend our results to the case where the function to be optimized is convex, but not strongly convex. Specifically, we will assume throughout this subsection that fS(0,L)f\in S(0,L). The following development is due to Elad Hazan .

Suppose we want to minimize ff over a compact, convex domain D\mathcal{D} for which we can readily compute the Euclidean projection. Let RR denote the diameter of the set D\mathcal{D}. Define the function f_{\varepsilon}(x)\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}f(x)+\tfrac{\varepsilon}{2R^{2}}\|x\|^{2}. Note that fεf_{\varepsilon} is differentiable and strongly convex; it satisfies fεS(εR2,L+εR2)f_{\varepsilon}\in S(\tfrac{\varepsilon}{R^{2}},L+\tfrac{\varepsilon}{R^{2}}). Therefore, we may apply our analysis to fεf_{\varepsilon}.

Suppose we execute on fεf_{\varepsilon} an algorithm with interleaved projections as in Section 5.3. Let xx_{\star} be any minimizer of ff on D\mathcal{D} and x(ε)x_{\star}^{(\varepsilon)} be the minimizer of fεf_{\varepsilon}. Let ρ\rho denote the rate of convergence achieved when the condition ratio is set as κ=(1+LR2/ε)\kappa=(1+LR^{2}/\varepsilon) and let Pε0P_{\varepsilon}\succ 0 be the associated solution to the LMI. Let \sigma\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}\operatorname{cond}(P_{\varepsilon}). After kk steps,

Now apply (3.13a) from Proposition 5 using (f,x,y)=(fε,x(ε),xk)(f,x,y)=(f_{\varepsilon},x_{\star}^{(\varepsilon)},x_{k}) and obtain

Where the last inequality follows from the definition of set diameter. Therefore, if

then f(xk)f(x)εf(x_{k})-f(x_{\star})\leq\varepsilon. Substituting the rates found algebraically for the quadratic case (Section 2.2) or our numerical results for the strongly convex case (Sections 4.4–4.5), the convergence rate ρ\rho satisfies

Finally, note that σ=cond(Pε)\sigma=\operatorname{cond}(P_{\varepsilon}) also depends on ε\varepsilon. We can control the growth of σ\sigma directly by including a constraint of the form IPσII\preceq P\preceq\sigma I when solving the SDP of Theorem 4. Alternatively, we can observe (see Figure 4) that σκ=(1+LR2/ε)\sigma\propto\kappa=(1+LR^{2}/\varepsilon). Therefore, we conclude that

This analysis matches the standard bounds up to the logarithmic terms .

Algorithm design

In this section, we show one way in which the IQC analysis framework can be used for algorithm design. We saw in Section 5.2 that the Gradient method can be very robust to noise (Figure 11), or not robust at all (Figure 10), depending on whether we use a stepsize of α=1/L\alpha=1/L or α=2/(L+m)\alpha=2/(L+m), respectively.

A natural question to ask is whether such a trade-off between performance and robustness exists with Nesterov’s method as well. As can be seen in Figure 12, Nesterov’s method is only somewhat robust to noise. In the sequel, we will synthesize variants of Nesterov’s method that explore the performance-robustness trade-off space.

In light of the discussion in Section 2, we see that the Gradient, Heavy-ball, and Nesterov methods are all special cases of (6.1). In particular,

We may also rewrite (6.1) in more familiar recursion form as

Our approach is straightforward: for each choice of condition ratio L/mL/m and noise strength δ\delta, we generate a large grid of tuples (α,β1,β2)(\alpha,\beta_{1},\beta_{2}) and use the approach of Section 5.2 to evaluate each algorithm. We then choose the algorithm with the lowest ρ\rho. In other words, given bounds on the condition ratio and noise strength, we choose the algorithm for which we can certify the best possible convergence rate over all admissible choices of ff and gradient noise. The performance of each optimized algorithm is plotted in Figure 14.

By design, this new family of algorithms must have a performance superior to the Gradient method, Nesterov’s method, and the Heavy-ball method for any choice of tuning parameters. In the limit δ0\delta\to 0, we appear to recover the performance of Nesterov’s method when it is applied to quadratics. That is, we have used numerical search to find an algorithm whose worst case performance guarantee is slightly better than what is guaranteed by Nesterov’s method.

In the second plot of Figure 14, the algorithms robust to higher noise levels have greater slopes. When the noise level is low (δ=0.01\delta=0.01), we approach a slope of 0.5, the same as Nesterov. When the noise level is high (δ=0.5\delta=0.5), the slope is roughly 0.75. Note that the Gradient method, which was robust for all noise levels, has a slope of 1. Therefore, the new algorithms we found explore the trade-off between noise robustness and performance, and may be useful in instances where Nesterov’s method would be too fragile and the Gradient method would be too slow.

Future work

We are only beginning to get a sense of what IQCs can tell us about optimization schemes, and there are many more control theory tools and techniques left to adapt to the context of optimization and machine learning. We conclude this paper with several interesting directions for future work.

One of the drawbacks of our numerical proofs is that we are always pushing up against numerical error and conditioning error. Analytic proofs would alleviate this issue and could provide more interpretable results about how parameters of algorithms should vary to meet performance and robustness demands. To provide such analytic proofs, one would have to solve small LMIs in closed form. This amounts to solving small semidefinite programming problems, and this may be doable using analytic tools from algebraic geometry .

Lower Bounds

Our IQC conditions are merely sufficient for verifying the convergence of an optimization problem. However, as pointed out by Megretski and Rantzer, the derived conditions are necessary in a restricted sense . If we fail to find a solution to our LMI, then there is necessarily a sequence of point that satisfy all of the IQC constraints and that do not converge to an equilibrium . It is thus possible that this tool can be used to construct a convex function to serve as a counterexample for convergence. This intuition was what guided our construction of a counterexample for the convergence of the Heavy-ball method. It may be possible that this construction can be generalized to systematically produce counterexamples.

Time-varying algorithms

In many practical scenarios, we know neither the Lipschitz constant LL nor the strong convexity parameter mm. Under such conditions, some sort of estimation scheme is used to choose the appropriate step size. This could be a simple backoff scheme to ensure a sufficient decrease, or a more intricate search method to find the appropriate parameters . From our control vantage point, it may be possible to use techniques from adaptive control to certify when such line search methods are stable. In particular, these could be used to differentiate between the different sorts of schemes used to choose the parameters of the nonlinear conjugate gradient method. Useful connections are made between robustness analysis of adaptive controllers and Lyapunov theory in .

A related area of study is that of linear parameter varying (LPV) systems. This extension of linear systems analysis considers parameterized variations in the dynamical system matrices (A,B,C,D)(A,B,C,D). Algorithms with variable stepsize are examples of LPV systems. Some recent work discussing IQCs applied to LPV systems appeared in . Another possible direction would be to use optimal control techniques directly to choose algorithm parameters, possibly solving a small SDP at every iteration to choose new assignments.

Algorithm synthesis

Perhaps even more ambitiously than using our framework for parameter selection, our initial results show that we can use IQCs as a way of designing new algorithms. We restricted our attention to algorithms with one-step of memory, as then we only had to search over 3 parameters. However, new techniques would be necessary to explore more complicated algorithms. Local search heuristics could be used here to probe the feasible region of the associated LMIs, but convex methods and convex relaxations may also be applicable and should be investigated for these searches.

Noise analysis

Our robustness analysis only allows us to consider certain forms of deterministic noise. Expanding our techniques to study stochastic noise would expand the applicability of our techniques and could provide new insight into popular stochastic optimization algorithms such as stochastic coordinate descent and stochastic gradient descent . Many of the most common techniques for proving convergence of stochastic methods rely on Lyapunov-type arguments, and we may be able to generalize this approach to account for the variety of different methods. In order to expand our techniques to this space, we would need to introduce IQCs that were valid in expectation. Stability methods from stochastic control may be applicable to such investigations.

Beyond convexity

Since our analysis decouples the derivation of constraints on function classes from the algorithm analysis, it is possible that it can be generalized to nonconvex optimization. If we can characterize the function class by reasonable quadratic constraints, our framework immediately applies, and may lead to entirely new analyses for nonconvex function classes. For example, IQCs for saturating nonlinearities are readily available in the controls literature . From a complementary perspective, if we know that our function is not merely convex, but has additional structure, this can be incorporated as additional IQCs. With extra constraints, it is possible that we can derive faster rates or more robustness for smaller function classes.

Non-quadratic Lyapunov functions

There has been substantial work in the past decade on efficient algorithms to search over non-quadratic Lyapunov functions . These techniques use sum-of-squares hierarchies to certify that non-quadratic polynomials are nonnegative, and still reduce to solving small semidefinite programming problems. This more general class of Lyapunov functions could be better matched to certain classes of functions than quadratics, and we could perhaps analyze more complicated algorithms and interconnections.

Large-scale composite system analysis

Perhaps the most ambitious goal of this program is to move beyond convex models and attempt to analyze complicated optimization systems used in science and industry. Powerful modeling languages like AMPL or GAMS allow for local analysis of large, complex systems, and certifying that the decisions about these systems are valid and safe would have impact in a variety of fields including process technology, web-scale analytics, and power management. Since our methods nicely abstract beyond two interconnected systems, it is our hope that they can be extended to analyze the variety of optimization algorithms deployed to handle large, high throughput data processing.

Acknowledgments

We would like to thank Peter Seiler for many helpful pointers on time-domain IQCs, Elad Hazan for his suggestion of how to analyze functions that are not strongly convex, and Bin Hu for pointing out a misreading of Nesterov’s results in an earlier draft of this paper. We would also like to thank Ali Jadbabaie, Pablo Parrilo, and Stephen Wright for many helpful discussions and suggestions.

LL and AP are partially supported by AFOSR award FA9550-12-1-0339 and NASA Grant No. NRA NNX12AM55A. BR is generously supported by ONR awards N00014-11-1-0723 and N00014-13-1-0129, NSF award CCF-1148243, AFOSR award FA9550-13-1-0138, and a Sloan Research Fellowship. This research was also supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Adobe, Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk, Virdata, Fanuc, VMware, and Yahoo!.

References

Appendix A Proof of Proposition 1

Suppose QQ has eigenvalues that satisfy 0<mλdλd1λ2λ1L0<m\leq\lambda_{d}\leq\lambda_{d-1}\leq\cdots\leq\lambda_{2}\leq\lambda_{1}\leq L. Throughout, we assume (A,B,C)(A,B,C) are the state transition matrices of the algorithm we would like to analyze (see Section 2). The state transition matrices are functions of the algorithm parameters (e.g. α\alpha and β\beta for the Heavy-ball method). Let TT be the closed loop system T\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}A+BQC. The worst-case convergence rate is found by maximizing the spectral radius over all admissible QQ. In other words,

The first observation is that for our algorithms of interest, we may assume d=1d=1. To see why, take for example the Heavy-ball method, where

Write the eigenvalue decomposition of QQ as Q=UΛUTQ=U\Lambda U^{\mathsf{T}}, where Λ=diag(λ1,λ2,,λd)\Lambda=\operatorname{diag}(\lambda_{1},\lambda_{2},\dotsc,\lambda_{d}) and UU is orthogonal. Then,

Therefore, by similarity, the eigenvalues of TT are the eigenvalues of all the matrices

and we may without loss of generality let d=1d=1. The simplified problem is therefore

It is now a matter of algebraic substitution to find the optimal rates for each parameter choice. For example, with the Gradient method,

The second equality follows from the fact that the maximum of a convex function must occur at the boundary. We can now see that when α=1/L\alpha=1/L, we have ρmax=11/κ\rho_{\textup{max}}=1-1/\kappa. Finding the optimal α\alpha is straightforward in this case because the pointwise maximum of convex functions is itself convex. In this case, the minimum ρmax\rho_{\textup{max}} occurs when α=2L+m\alpha=\tfrac{2}{L+m} and the result is ρmax=κ1κ+1\rho_{\textup{max}}=\tfrac{\kappa-1}{\kappa+1}.

The analyses for Nesterov’s method and the Heavy-ball method are similar in spirit to that of the Gradient method, but computing the spectral radius is more complicated. For Nesterov’s method, we have

where ν1\nu_{1}, ν2\nu_{2} are the roots of the characteristic polynomial of T0T_{0}, which is

where \Delta\mathrel{\mathchoice{\vbox{\hbox{\displaystyle:}}}{\vbox{\hbox{\textstyle:}}}{\vbox{\hbox{\scriptstyle:}}}{\vbox{\hbox{\scriptscriptstyle:}}}{=}}(1+\beta)^{2}(1-\alpha\lambda)^{2}-4\beta(1-\alpha\lambda). It is straightforward to verify that if α,β\alpha,\beta are fixed, \max\bigl{\{}|\nu_{1}|,|\nu_{2}|\bigr{\}} is a continuous and quasiconvex function of λ\lambda. So, the maximum over λ\lambda must occur at boundary point. For the case where α=1L\alpha=\tfrac{1}{L} and β=κ1κ+1\beta=\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}, choosing λ=L\lambda=L yields zero, so the maximum must be achieved at λ=m\lambda=m, which yields

as required. When optimizing over quadratic functions, the above tuning of Nesterov’s method is suboptimal. Finding the choice of α\alpha and β\beta that yields the smallest ρmax\rho_{\textup{max}} requires careful examination of several subcases, and we omit the details in the interest of space. The result is shown in the second last row of the table in Proposition 1.

A similar eigenvalue analysis was used in to derive the optimal parameter tuning for the Heavy-ball method applied to quadratics (last row of the table).

Appendix B Proof of the Heavy-ball counterexample

We would like to minimize the function whose gradient is given by (4.11). The Heavy-ball method with L=25L=25 and m=1m=1 is given in Section 2.2:

where we use the initialization x1=x0x_{-1}=x_{0}. Based on the plot of Figure 6, we will look for limit points p,q,rp,q,r such that we have a cycle of period 3:

where p<1p<1, q<1q<1, and r>2r>2. Substituting the forms (B.2) and (4.11) directly into (B.1), we obtain the system of linear equations

and the unique solution of these equations is

In other words, the trajectory (B.2) with values (B.3) is a fixed point of (B.1). Let us call this limit sequence {xk}k0\{x_{k}^{\star}\}_{k\geq 0}. In order to show that the limit cycle is attractive (nearby trajectories will eventually converge to the cycle) consider a perturbed version of this sequence {xk+εk}k0\{x_{k}^{\star}+\varepsilon_{k}\}_{k\geq 0}. If we assume that the kthk^{\text{th}} iterate still belongs to the same piece of the function (e.g. if xk<1x^{\star}_{k}<1 then xk+εk<1x^{\star}_{k}+\varepsilon_{k}<1, and if xk>2x^{\star}_{k}>2 then xk+εk>2x^{\star}_{k}+\varepsilon_{k}>2) then we can use the Heavy-ball equations to compute the perturbation in the subsequent iterate. Upon doing so, we find that {εk}k0\{\varepsilon_{k}\}_{k\geq 0} must satisfy

It is immediate that ρ(P)=23<1\rho(P)=\tfrac{2}{3}<1 so as long as no transient value of εk\varepsilon_{k} strays too far from zero and causes an unscheduled crossing of the dotted lines on Figure 6, then we will have εk0\varepsilon_{k}\to 0 and the limit cycle will be attractive. Undesired transient behavior can be ruled out by ensuring that the error eventually decreases monotonically. One can easily verify that

where \|\cdot\| is the induced 2-norm. Therefore, P8P^{8} is a contraction, and we have:

If eight consecutive perturbations {εi2,εi+12,,εi+72}\{\varepsilon_{i}^{2},\varepsilon_{i+1}^{2},\dots,\varepsilon_{i+7}^{2}\} are each less than some εˉ2\bar{\varepsilon}^{2}, then apply (B.4) twice to conclude that

Continuing in this fashion, we conclude that the entire tail {εk2}ki\{\varepsilon_{k}^{2}\}_{k\geq i} also satisfies the bound εk2<εˉ2\varepsilon_{k}^{2}<\bar{\varepsilon}^{2}. The closest that our proposed limit cycle comes to a transition point of f(x)f(x) (either 11 or 22) is r2=14212250.1159r-2=\tfrac{142}{1225}\approx 0.1159. Therefore, if we set this number to be εˉ\bar{\varepsilon}, and we can find eight consecutive iterates of the Heavy-ball method that are each within εˉ\bar{\varepsilon} of the limit cycle, then the remaining iterates must converge exponentially to the limit cycle. It is straightforward to check that if x0=3.3x_{0}=3.3, then the iterates x4,x5,,x11x_{4},x_{5},\dots,x_{11} are each within a distance εˉ\bar{\varepsilon} of their respective limit points. Therefore, the limit cycle is attractive.