Lower Bounds for Finding Stationary Points I

Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford

Introduction

We prove lower bounds on the number of function and derivative evaluations required for algorithms to find a point xx satisfying inequality (1). While for arbitrary smooth ff, a near-stationary point (1) is certainly insufficient for any type of optimality, there are a number of reasons to study algorithms and complexity for finding stationary points. In several statistical and engineering problems, including regression models with non-convex penalties and objectives , phase retrieval , and non-convex (low-rank) reformulations of semidefinite programs and matrix completion , it is possible to show that all first- or second-order stationary points are (near) global minima. The strong empirical success of local search strategies for such problems, as well as for neural networks , motivates a growing body of work on algorithms with strong complexity guarantees for finding stationary points . In contrast to this algorithmic progress, algorithm-independent lower bounds for finding stationary points are largely unexplored.

By evaluation of higher order derivatives, such as the Hessian, it is possible to achieve better ϵ\epsilon dependence. Nesterov and Polyak’s cubic regularization of Newton’s method guarantees ϵ\epsilon-stationarity (1) in ϵ3/2\epsilon^{-3/2} iterations, but each iteration may be expensive when the dimension dd is large. More generally, ppth-order regularization methods iterate by sequentially minimizing models of ff based on order pp Taylor approximations, and Birgin et al. show that these methods converge in ϵ(p+1)/p\epsilon^{-(p+1)/p} iterations. Each iteration requires finding an approximate stationary point of a high-dimensional, potentially non-convex, degree p+1p+1 polynomial, which suggests that the methods will be practically challenging for p>2p>2. The methods nonetheless provide fundamental upper complexity bounds.

In this paper and its companion , we focus on the converse problem: providing dimension-free complexity lower bounds for finding ϵ\epsilon-stationary points. We show fundamental limits on the best achievable ϵ\epsilon dependence, as well as dependence on other problem parameters. Together with known upper bounds, our results shed light on the optimal rates of convergence for finding stationary points.

In the case of convex optimization, we have a deep understanding of the complexity of finding ϵ\epsilon-suboptimal points, that is, xx satisfying f(x)f(x)+ϵf(x)\leq f(x^{\star})+\epsilon for some ϵ>0\epsilon>0, where xargminxf(x)x^{\star}\in\mathop{\rm arg\hskip 1.00006ptmin}_{x}f(x). Here we review only the dimension-free optimal rates, as those are most relevant for our results. Given a point x(0)x^{(0)} satisfying x(0)xD<\|{x^{(0)}-x^{\star}}\|\leq D<\infty, if ff is convex with L1L_{1}-Lipschitz gradient, Nesterov’s accelerated gradient method finds an ϵ\epsilon-suboptimal point in L1Dϵ1/2\sqrt{L_{1}}D\epsilon^{-1/2} gradient evaluations, which is optimal even among randomized, higher-order algorithms .Higher order methods can yield improvements under additional smoothness: if in addition ff has L2L_{2}-Lipschitz Hessian and ϵL17/3L24/3D2/3\epsilon\leq L_{1}^{7/3}L_{2}^{-4/3}D^{2/3}, an accelerated Newton method achieves the (optimal) rate (L2D3/ϵ)2/7(L_{2}D^{3}/\epsilon)^{2/7} . For non-smooth problems, that is, when ff is L0L_{0}-Lipschitz, subgradient methods achieve the optimal rate of L02D2/ϵ2L_{0}^{2}D^{2}/\epsilon^{2} subgradient evaluations (cf. ). In Part II of this paper , we consider the impact of convexity on the difficulty of finding stationary points using first-order methods.

A related line of work considers algorithm-dependent lower bounds, describing functions that are challenging for common classes of algorithms, such as Newton’s method and gradient descent. In this vein, Jarre shows that the Chebyshev-Rosenbrock function is difficult to optimize, and that any algorithm that employs line search to determine the step size will require an exponential (in ϵ\epsilon) number of iterations to find an ϵ\epsilon-suboptimal point, even though the Chebyshev-Rosenbrock function has only a single stationary point. While this appears to contradict the polynomial complexity guarantees mentioned above, Cartis et al. explain this by showing that the difficult Chebyshev-Rosenbrock instances have ϵ\epsilon-stationary point with function value that is ω(ϵ)\omega(\epsilon)-suboptimal. Cartis et al. also develop algorithm-specific lower bounds on the iteration complexity of finding approximate stationary points. Their works show that the performance guarantees for gradient descent and cubic regularization of Newton’s method are tight for two-dimensional functions they construct, and they also extend these results to certain structured classes of methods .

2 The importance of high-dimensional constructions

To tightly characterize the algorithm- and dimension-independent complexity of finding ϵ\epsilon-stationary points, one must construct hard instances whose domain has dimension that grows with 1/ϵ1/\epsilon. The reason for this is simple: there exist algorithms with complexity that trades dependence on dimension dd in favor of better 1/ϵ1/\epsilon dependence. Indeed, Vavasis gives a grid-search method that, for functions with Lipschitz gradient, finds an ϵ\epsilon-stationary point in max{2d,ϵ2d/(d+2)}\max\{2^{d},\epsilon^{-2d/(d+2)}\} gradient and function evaluations. Moreover, Hinder exhibits a cutting-plane method that, for functions with Lipschitz first and third derivatives, finds an ϵ\epsilon-stationary point in dϵ4/3log1ϵd\cdot\epsilon^{-4/3}\log\frac{1}{\epsilon} gradient and function evaluations.

High-dimensional constructions are similarly unavoidable when developing lower bounds in convex optimization. There, the center-of-gravity cutting plane method (cf. ) finds an ϵ\epsilon-suboptimal point in dlog1ϵd\log\frac{1}{\epsilon} (sub)gradient evaluations, for any continuous convex function with bounded distance to optimality. Consequently, proofs of the dimension-free lower bound for convex optimization (as we cite in the previous section) all rely on constructions whose dimensionality grows polynomially in 1/ϵ1/\epsilon.

3 Our contributions

oracle queries to find an ϵ\epsilon-stationary point of ff, where cp>0c_{p}>0 is a constant decreasing at most polynomially in pp. As explained in the previous section, the domain of the constructed function ff has dimension polynomial in 1/ϵ1/\epsilon.

For every pp, our lower bound matches (up to a constant) known upper bounds, thereby characterizing the optimal complexity of finding stationary points. For p=1p=1, our results imply that gradient descent is optimal among all methods (even randomized, high-order methods) operating on functions with Lipschitz continuous gradient and bounded initial sub-optimality. Therefore, to strengthen the guarantees of gradient descent one must introduce additional assumptions, such as convexity of ff or Lipschitz continuity of 2f\nabla^{{2}}f. Similarly, in the case p=2p=2 we establish that cubic regularization of Newton’s method achieves the optimal rate ϵ3/2\epsilon^{-3/2}, and for general pp we show that ppth order Taylor-approximation methods are optimal.

These results say little about the potential of first-order methods on functions with higher-order Lipschitz derivatives, where first-order methods attain rates better than ϵ2\epsilon^{-2} . In Part II of this series , we address this issue and show lower bounds for deterministic algorithms using only first-order information. The lower bounds exhibit a fundamental gap between first- and second-order methods, and nearly match the known upper bounds .

4 Our approach and paper organization

In Section 2 we introduce the classes of functions and algorithms we consider as well as our notion of complexity. Then, in Section 3, we present the generic technique we use to prove lower bound for deterministic algorithms in both this paper and Part II . While essentially present in previous work, our technique abstracts away and generalizes the central arguments in many lower bounds . The technique applies to higher-order methods and provides lower bounds for general optimization goals, including finding stationary points (our main focus), approximate minimizers, and second-order stationary points. It is also independent of whether the functions under consideration are convex, applying to any function class with appropriate rotational invariance . The key building blocks of the technique are Nesterov’s notion of a “chain-like” function , which is difficult for a certain subclass of algorithms, and a “resisting oracle” reduction that turns a lower bound for this subclass into a lower bound for all deterministic algorithms.

In Section 4 we apply this generic method to produce lower bounds for deterministic methods (Theorem 1). The deterministic results underpin our analysis for randomized algorithms, which culminates in Theorem 2 in Section 5. Following Woodworth and Srebro , we consider random rotations of our deterministic construction, and show that for any algorithm such a randomly rotated function is, with high probability, difficult. For completeness, in Section 6 we provide lower bounds on finding stationary points of functions where x(0)x\|{x^{(0)}-x^{\star}}\| is bounded, rather than the function value gap f(x(0))f(x)f(x^{(0)})-f(x^{\star}); these bounds have the same ϵ\epsilon dependence as their bounded function value counterparts.

If TT is a symmetric order kk tensor, meaning that Ti1,,ikT_{i_{1},\ldots,i_{k}} is invariant to permutations of the indices (for example, kf(x)\nabla^{{k}}f(x) is always symmetric), then Zhang et al. [47, Thm. 2.1] show that

For vectors, the Euclidean and operator norms are identical.

Preliminaries

We begin our development with definitions of the classes of functions (§ 2.1), classes of algorithms (§ 2.2), and notions of complexity (§ 2.3) that we study.

Measures of function regularity are crucial for the design and analysis of optimization algorithms . We focus on two types of regularity conditions: Lipschitzian properties of derivatives and bounds on function value.

Complexity guarantees for finding stationary points of non-convex functions ff typically depend on the function value bound f(x(0))infxf(x)f(x^{(0)})-\inf_{x}f(x), where x(0)x^{(0)} is a pre-specified point. Without loss of generality, we take the pre-specified point to be for the remainder of the paper. With that in mind, we define the following classes of functions.

Let p1p\geq 1, Δ>0\Delta>0 and Lp>0L_{p}>0. Then the set

For our results, we also require the following important invariance notion, proposed (in the context of optimization) by Nemirovski and Yudin [35, Ch. 7.2].

Every function class we consider is orthogonally invariant, as f(0)infxf(x)=fU(0)infxfU(x)f(0)-\inf_{x}f(x)=f_{U}(0)-\inf_{x}f_{U}(x) and fUf_{U} has the same Lipschitz constants to all orders as ff, as their collections of associated directional projections are identical.

2 Algorithm classes

To model the computational cost of an algorithm, we adopt the information-based complexity framework, which Nemirovski and Yudin develop (see also ), and view every every iterate x(t)x^{(t)} as a query to an information oracle. Typically, one places restrictions on the information the oracle returns (e.g. only the function value and gradient at the query point) and makes certain assumptions on how the algorithm uses this information (e.g. deterministically). Our approach is syntactically different but semantically identical: we build the oracle restriction, along with any other assumption, directly into the structure of the algorithm. To formalize this, we define

as shorthand for the response of a ppth order oracle to a query at point xx. When p=p=\infty this corresponds to an oracle that reveals all derivatives at xx. Our algorithm classes follow.

As a concrete example, for any p1p\geq 1 and L>0L>0 consider the algorithm REGp,LAdet(p){\mathsf{REG}_{p,L}}\in\mathcal{A}_{\textnormal{{det}}}^{(p)} that produces iterates by minimizing the sum of a ppth order Taylor expansion and an order p+1p+1 proximal term:

For p=1p=1, REGp,L{\mathsf{REG}_{p,L}} is gradient descent with step-size 1/L1/L, for p=2p=2 it is cubic-regularized Newton’s method , and for general pp it is a simplified form of the scheme that Birgin et al. propose.

Randomized algorithms (and function-informed processes)

A ppth-order randomized algorithm A\mathsf{A} is a distribution on ppth-order deterministic algorithms. We can write any such algorithm as a deterministic algorithm given access to a random uniform variable on $(i.e.infinitelymanyrandombits).Thusthealgorithmoperateson(i.e. infinitely many random bits). Thus the algorithm operates onfbydrawingby drawing\xi\sim\mathsf{Uni}(independentlyof(independently off$), then producing iterates of the form

Zero-respecting sequences and algorithms

While deterministic and randomized algorithms are the natural collections for which we prove lower bounds, it is useful to define an additional structurally restricted class. This class forms the backbone of our lower bound strategy (Sec. 3), as it is both ‘small’ enough to uniformly underperform on a single function, and ‘large’ enough to imply lower bounds on the natural algorithm classes.

The definition (5) says that xi(t)=0x^{(t)}_{i}=0 if all partial derivatives involving the iith coordinate of ff (up to the ppth order) are zero. For p=1p=1, this definition is equivalent to the requirement that for every tt and j[d]j\in[d], if jf(x(s))=0\nabla_{j}f(x^{(s)})=0 for s<ts<t, then xj(t)=0x^{(t)}_{j}=0. The requirement (5) implies that x(1)=0x^{(1)}=0.

In the literature on lower bounds for first-order convex optimization, it is common to assume that methods only query points in the span of the gradients they observe . Our notion of zero-respecting algorithms generalizes this assumption to higher-order methods, but even first-order zero-respecting algorithms are slightly more general. For example, coordinate descent methods are zero-respecting, but they generally do not remain in the span of the gradients.

3 Complexity measures

To measure the performance of algorithm A\mathsf{A} on function ff, we evaluate the iterates it produces from ff, and with mild abuse of notation, we define

as the complexity of A\mathsf{A} on ff. With this setup, we define the complexity of algorithm class A\mathcal{A} on function class F\mathcal{F} as

Many algorithms guarantee “dimension independent” convergence and thus provide upper bounds for the quantity (7). A careful tracing of constants in the analysis of Birgin et al. implies that the generalized regularization scheme REGp,L{\mathsf{REG}_{p,L}} defined by the recursion (3) guarantees

While definition (7) is our primary notion of complexity, our proofs provide bounds on smaller quantities than (7) that also carry meaning. For zero-respecting algorithms, we exhibit a single function ff and bound \inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)} from below, in effect interchanging the inf\inf and sup\sup in (7). This implies that all zero-respecting algorithms share a common vulnerability. For randomized algorithms, we exhibit a distribution PP supported on functions of a fixed dimension dd, and we lower bound the average \inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}}\int\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}dP(f), bounding the distributional complexity , which is never greater than worst-case complexity (and is equal for randomized and deterministic algorithms). Even randomized algorithms share a common vulnerability: functions drawn from PP.

Anatomy of a lower bound

In this section we present a generic approach to proving lower bounds for optimization algorithms. The basic techniques we use are well-known and applied extensively in the literature on lower bounds for convex optimization . However, here we generalize and abstract away these techniques, showing how they apply to high-order methods, non-convex functions, and various optimization goals (e.g. ϵ\epsilon-stationarity, ϵ\epsilon-optimality).

Nesterov [37, Chapter 2.1.2] proves lower bounds for smooth convex optimization problems using the “chain-like” quadratic function

which he calls the “worst function in the world.” The important property of ff is that for every i[d]i\in[d], if(x)=0\nabla_{i}f(x)=0 whenever xi1=xi=xi+1=0x_{i-1}=x_{i}=x_{i+1}=0 (with x0:=1x_{0}:=1 and xd+1:=0x_{d+1}:=0). Thus, if we “know” only the first t1t-1 coordinates of ff, i.e. are able to query only vectors xx such xt=xt+1==xd=0x_{t}=x_{t+1}=\cdots=x_{d}=0, then any xx we query satisfies sf(x)=0\nabla_{s}f(x)=0 for s>ts>t; we only “discover” a single new coordinate tt. We generalize this chain structure to higher-order derivatives as follows.

2 A lower bound strategy

The preceding discussion shows that zero-respecting algorithms take many iterations to “discover” all the coordinates of a zero-chain. In the following observation, we formalize how finding a suitable zero-chain provides a lower bound on the performance of zero-respecting algorithms.

ff belongs to the function class, i.e. fFf\in\mathcal{F}, and

f(x)>ϵ\left\|{\nabla f(x)}\right\|>\epsilon for every xx such that xT=0x_{T}=0; We can readily adapt this property for lower bounds on other termination criteria, e.g. require f(x)infyf(y)>ϵf(x)-\inf_{y}f(y)>\epsilon for every xx such that xT=0x_{T}=0.

then \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\{f\}\big{)}>T.

3 From deterministic to zero-respecting algorithms

Zero-chains allow us to generate strong lower bounds for zero-respecting algorithms. The following reduction shows that these lower bounds are valid for deterministic algorithms as well.

We also give a variant of Proposition 1 that is tailored to lower bounds constructed by means of Observation 2 and allows explicit accounting of dimensionality.

where fU:=f(Uz)f_{U}:=f(U^{\top}z) and O(d+T,d)\mathsf{O}(d+T,d) is the set of (d+T)×d(d+T)\times d orthogonal matrices, so that {fUUO(d+T,d)}\{f_{U}\mid U\in\mathsf{O}(d+T,d)\} contains only function with domain of dimension d+Td+T.

giving Proposition 1; Proposition 2 follows similarly, and for it we may take d=d+Td^{\prime}=d+T.

The adversarial rotation argument that yields Propositions 1 and 2 is more or less apparent in the proofs of previous lower bounds in convex optimization for deterministic algorithms. We believe it is instructive to separate the proof of lower bounds on \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}},\mathcal{F}\big{)} and the reduction from Adet\mathcal{A}_{\textnormal{{det}}} to Azr\mathcal{A}_{\textnormal{{zr}}}, as the latter holds in great generality. Indeed, Propositions 1 and 2 hold for any complexity measure \mathsf{T}_{\epsilon}\big{(}\cdot,\cdot\big{)} that satisfies

4 Randomized algorithms

Proposition 1 and 2 do not apply to randomized algorithms, as they require the adversary (maximizing choice of ff) to simulate the action of A\mathsf{A} on ff. To handle randomized algorithms, we strengthen the notion of a zero-chain as follows.

A robust zero-chain is also an “ordinary” zero-chain. In Section 5 we replace the adversarial rotation UU of § 3.3 with an orthogonal matrix drawn uniformly at random, and consider the random function fU(x)=f(Ux)f_{U}(x)=f(U^{\top}x), where ff is a robust zero-chain. We adapt a lemma by Woodworth and Srebro , and use it to show that for every AArand\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}, A[fU]A[f_{U}] satisfies an approximate form of Observation 1 (w.h.p.) whenever the iterates A[fU]\mathsf{A}[f_{U}] have bounded norm. With further modification of fUf_{U} to handle unbounded iterates, our zero-chain strategy yields a strong distributional complexity lower bound on Arand\mathcal{A}_{\textnormal{{rand}}}.

Lower bounds for zero-respecting and deterministic algorithms

Our construction, illustrated in Figure 1, has two key properties. First is that ff is a zero-chain (Observation 3 in the sequel). Second, as we show in Lemma 2, fˉT(x)\|{\nabla\bar{f}_{T}(x)}\| is large unless xi1|x_{i}|\geq 1 for every i[T]i\in[T]. These properties make it hard for any zero-respecting method to find a stationary point of scaled versions of fˉT\bar{f}_{T}, and coupled with Proposition 1, this gives a lower bound for deterministic algorithms.

Before turning to the main theorem of this section, we catalogue the important properties of the functions Ψ\Psi, Φ\Phi and fˉT\bar{f}_{T}.

The functions Ψ\Psi and Φ\Phi satisfy the following.

For all x1x\geq 1 and y<1|y|<1, Ψ(x)Φ(y)>1\Psi(x)\Phi^{\prime}(y)>1.

The functions and derivatives Ψ,Ψ,Φ\Psi,\Psi^{\prime},\Phi and Φ\Phi^{\prime} are non-negative and bounded, with

We prove Lemma 1 in Appendix B.1. The remainder our development relies on Ψ\Psi and Φ\Phi only through Lemma 1. Therefore, the precise choice of Ψ,Φ\Psi,\Phi is not particularly special; any two functions with properties similar to Lemma 1 will yield similar lower bounds.

The key consequence of Lemma 1.i is that the function ff is a robust zero-chain (see Definition 4) and consequently also a zero-chain (Definition 3):

For any j>1j>1, if xj1,xj<1/2|x_{j-1}|,|x_{j}|<1/2 then fˉT(y)=fˉT(y1,,yj1,0,yj+1,,yT)\bar{f}_{T}(y)=\bar{f}_{T}(y_{1},\ldots,y_{j-1},0,y_{j+1},\ldots,y_{T}) for all yy in a neighborhood of xx.

Applying Observation 3 for j=i+1,,Tj=i+1,\ldots,T gives that fˉT\bar{f}_{T} is a robust zero-chain by Definition 4. Taking derivatives of fˉT(x1,,xi,0,,0)\bar{f}_{T}(x_{1},\ldots,x_{i},0,\ldots,0) with respect to xjx_{j}, j>ij>i, shows that fˉT\bar{f}_{T} is also a zero-chain by Definition 3. Thus, Observation 1 shows that any zero-respecting algorithm operating on fˉT\bar{f}_{T} requires T+1T+1 iterations to find a point where xT0x_{T}\neq 0.

Next, we establish the “large gradient property” that fˉT(x)\nabla\bar{f}_{T}(x) must be large if any coordinate of xx is near zero.

If xi<1|x_{i}|<1 for any iTi\leq T, then there exists jij\leq i such that xj<1|x_{j}|<1 and

We take jij\leq i to be the smallest jj for which xj<1|x_{j}|<1, so that xj11|x_{j-1}|\geq 1 (where we use the shorthand x01x_{0}\equiv 1). Therefore, we have

In the chain of inequalities, inequality (i)(i) follows because Ψ(x)Φ(y)0\Psi^{\prime}(x)\Phi(y)\geq 0 for every x,yx,y; inequality (ii)(ii) follows because Ψ(x)=0\Psi(x)=0 for x1/2x\leq 1/2, while equality (iii)(iii) follows from Lemma 1.ii and the pairing of xj<1|x_{j}|<1 and xj11|x_{j-1}|\geq 1. ∎

Finally, we verify that fˉT\bar{f}_{T} meets the smoothness and boundedness requirements of the function classes we consider.

The function fˉT\bar{f}_{T} satisfies the following.

We have fˉT(0)infxfˉT(x)12T\bar{f}_{T}(0)-\inf_{x}\bar{f}_{T}(x)\leq 12T.

The proof of Lemma 3 is technical, so we defer it to Appendix B.2. In the lemma, Properties i and iii allow us to guarantee that appropriately scaled versions of fˉT\bar{f}_{T} are in Fp(Δ,Lp)\mathcal{F}_{p}(\Delta,L_{p}). Property is ii is necessary for analysis of the randomized construction in Section 5.

2 Lower bounds for zero-respecting and deterministic algorithms

We can now state and prove a lower bound for finding stationary points of ppth order smooth functions using full derivative information and zero-respecting algorithms (the class Azr\mathcal{A}_{\textnormal{{zr}}}). Proposition 1 transforms this bound into one on all deterministic algorithms (the class Adet\mathcal{A}_{\textnormal{{det}}}).

3 Proof of Theorem 1

Lower bounds for randomized algorithms

With our lower bounds on the complexity of deterministic algorithms established, we turn to the class of all randomized algorithms. We provide strong distributional complexity lower bounds by exhibiting a distribution on functions such that a function drawn from it is “difficult” for any randomized algorithm, with high probability. We do this via the composition of a random orthogonal transformation with the function fˉT\bar{f}_{T} defined in (10).

The result of Lemma 4 is identical (to constant factors) to an important result of Woodworth and Srebro [45, Lemma 7], but we must be careful with the sequential conditioning of randomness between the iterates x(t)x^{(t)}, the random orthogonal UU, and how much information the sequentially computed derivatives may leak. Because of this additional care, we require a modification of their original proof, In a recent note Woodworth and Srebro independently provide a revision of their proof that is similar, but not identical, to the one we propose here. which we provide in Section B.3, giving a rough outline here. For a fixed t<Tt<T, assume that u(j),x(s)<1/2|\langle u^{(j)},x^{(s)}\rangle|<1/2 holds for every pair sts\leq t and j{s,,T}j\in\{s,\ldots,T\}; we argue that this (roughly) implies that u(j),x(t+1)<1/2|\langle u^{(j)},x^{(t+1)}\rangle|<1/2 for every j{t+1,,T}j\in\{t+1,\ldots,T\} with high probability, completing the induction. When the assumption that u(j),x(s)<1/2|\langle u^{(j)},x^{(s)}\rangle|<1/2 holds, the robust zero-chain property of fˉT\bar{f}_{T} (Definition 4 and Observation 3) implies that for every sts\leq t we have

2 Handling unbounded iterates

The quadratic term in f^T;U\hat{f}_{T;U} guarantees that all points beyond a certain norm have a large gradient, which prevents the algorithm from trivially making the gradient small by increasing the norm of the iterates. The following lemma captures the hardness of f^T;U\hat{f}_{T;U} for randomized algorithms.

Let δ>0\delta>0, and let x(1),,x(T)x^{(1)},\ldots,x^{(T)} be informed by f^T;U\hat{f}_{T;U}. If d522302T2log2T2δd\geq 52\cdot 230^{2}\cdot T^{2}\log\frac{2T^{2}}{\delta} then, with probability at least 1δ1-\delta,

Therefore, by Lemma 2 with i=Ti=T, for each tt there exists jTj\leq T such that

To show that f^T;U(x(t))\|{\nabla\hat{f}_{T;U}(x^{(t)})}\| is also large, we consider separately the cases x(t)R/2\|{x^{(t)}}\|\leq R/2 and x(t)R/2\|{x^{(t)}}\|\geq R/2. For the first case, we use ρx(x)=Iρ(x)ρ(x)/R21+x2/R2\frac{\partial\rho}{\partial x}(x)=\frac{I-\rho(x)\rho(x)^{\top}/R^{2}}{\sqrt{1+\left\|{x}\right\|^{2}/R^{2}}} to write

Therefore, for y(t)x(t)R/2\|{y^{(t)}}\|\leq\|{x^{(t)}}\|\leq R/2 we have

In the second case, x(t)R/2\left\|{x^{(t)}}\right\|\geq R/2, we have for any xx satisfying xR/2\left\|{x}\right\|\geq R/2 and y=ρ(x)y=\rho(x) that

As our lower bounds repose on appropriately scaling the function f^T;U\hat{f}_{T;U}, it remains to verify that f^T;U\hat{f}_{T;U} satisfies the few boundedness properties we require. We do so in the following lemma.

The function f^T;U\hat{f}_{T;U} satisfies the following.

We have f^T;U(0)infxf^T;U(x)12T\hat{f}_{T;U}(0)-\inf_{x}\hat{f}_{T;U}(x)\leq 12T.

We defer the (computationally involved) proof of this lemma to Section B.4.

3 Final lower bounds

With Lemmas 5 and 6 in hand, we can state our lower bound for all algorithms, randomized or otherwise, given access to all derivatives of a C\mathcal{C}^{\infty} function. Note that our construction also implies an identical lower bound for (slightly) more general algorithms that use any local oracle , meaning that the information the oracle returns about a function ff when queried at a point xx is identical to that it returns when a function gg is queried at xx whenever f(z)=g(z)f(z)=g(z) for all zz in a neighborhood of xx.

implying Theorem 2 for any δ1/2\delta\geq 1/2. Thus, we exhibit a randomized procedure for finding hard instances for any randomized algorithm that requires no knowledge of the algorithm itself.

Theorem 2 is stronger than Theorem 1 in that it applies to the broad class of all randomized algorithms. Our probabilistic analysis requires that the functions constructed to prove Theorem 2 have dimension scaling proportional to T2log(T)T^{2}\log(T) where TT is the lower bound on the number of iterations. Contrast this to Theorem 1, which only requires dimension 2T+12T+1. A similar gap exists in complexity results for convex optimization . At present, it unclear if these gaps are fundamental or a consequence of our specific constructions.

4 Proof of Theorem 2

Fix AArand\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}} and let x(1),x(2),,x(T)x^{(1)},x^{(2)},\ldots,x^{(T)} be the iterates produced by A\mathsf{A} applied on fUf_{U}. Since ff and f^T;U\hat{f}_{T;U} differ only by scaling, the iterates x(1)/σ,x(2)/σ,,x(T)/σx^{(1)}/\sigma,x^{(2)}/\sigma,\ldots,x^{(T)}/\sigma are informed by f^T;U\hat{f}_{T;U} (recall Sec. 2.2), and therefore we may apply Lemma 5 with δ=1/2\delta=1/2 and our large enough choice of dimension dd to conclude that

It remains to choose TT to guarantee that fUf_{U} belongs to the relevant function class (bounded and smooth) for every orthogonal UU. By Lemma 6.ii, fUf_{U} has LpL_{p}-Lipschitz continuous ppth order derivatives. By Lemma 6.i, we have

Distance-based lower bounds

We have so far considered finding approximate stationary points of smooth functions with bounded sub-optimality at the origin, i.e. f(0)infxf(x)Δf(0)-\inf_{x}f(x)\leq\Delta. In convex optimization, it is common to consider instead functions with bounded distance between the origin and a global minimum. We may consider a similar restriction for non-convex functions; for p1p\geq 1 and positive Lp,DL_{p},D, let

be the class of C\mathcal{C}^{\infty} functions with LpL_{p}-Lipschitz ppth order derivatives satisfying

that is, all global minima have bounded distance to the origin.

In this section we give a lower bound on the complexity of this function class that has the same ϵ\epsilon dependence as our bound for the class Fp(Δ,Lp)\mathcal{F}_{p}(\Delta,L_{p}). This is in sharp contrast to convex optimization, where distance-bounded functions enjoy significantly better ϵ\epsilon dependence than their value-bounded counterparts (see Section LABEL:sec:convex in the companion ). Qualitatively, the reason for this difference is that the lack of convexity allows us to “hide” global minima close to the origin that are difficult to find for any algorithm with local function access .

We postpone the construction and proof to Appendix C, and move directly to the final bound.

There exist numerical constants 0<c0,c1<0<c_{0},c_{1}<\infty such that the following lower bound holds. For any p1p\geq 1, let D,LpD,L_{p}, and ϵ\epsilon be positive. Then

We remark that a lower-dimensional construction suffices for proving the lower bound for deterministic algorithm, similarly to Theorem 1.

While we do not have a matching upper bound for Theorem 3, we can match its ϵ\epsilon dependence in the smaller function class

Conclusion

This work provides the first algorithm independent and tight lower bounds on the dimension-free complexity of finding stationary points. As a consequence, we have characterized the optimal rates of convergence to ϵ\epsilon-stationarity, under the assumption of high dimension and an oracle that provides all derivatives. Yet, given the importance of high-dimensional problems, the picture is incomplete: high-order algorithms—even second-order method—are often impractical in large scale settings. We address this in the companion , which provides sharper lower bounds for the more restricted class of first-order methods. In we also provide a full conclusion for this paper sequence, discussing in depth the implications and questions that arise from our results.

Acknowledgments

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research, NSF-CAREER award 1553086, and a Sloan Foundation Fellowship in Mathematics. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.

References

Appendix A Proof of Propositions 1 and 2

The core of the proofs of Propositions 1 and 2 is the following construction.

Before explaining the construction of ZA\mathsf{Z}_{\mathsf{A}}, let us see how its defining property implies the lemma. If \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}>T_{0}, we are done. Otherwise, \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\leq T_{0} and we have

for every tT0t\leq T_{0} (we set z(i)=0z^{(i)}=0 for every i>T0i>T_{0} without loss of generality).

Finally, note that the arguments above hold unchanged for p=p=\infty. ∎

With Lemma 7 in hand, the propositions follow easily.

We may assume that \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}<T_{0} for some integer T0<T_{0}<\infty, as otherwise we have \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}=\infty and the result holds trivially. For any AAdet(p)\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)} and the value T0T_{0}, we invoke Lemma 7 to construct ZAAzr(p)\mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)} such that \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq\min\{T_{0},\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{A}},f\big{)}\} for every fFf\in\mathcal{F} and some orthogonal matrix UU that depends on ff and A\mathsf{A}. Consequently, we have

where inequality (i)(i) uses that fUFf_{U}\in\mathcal{F} because F\mathcal{F} is orthogonally invariant, step (ii)(ii) uses \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq\min\{T_{0},\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{A}},f\big{)}\} and step (iii)(iii) is due to ZAAzr(p)\mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)} by construction. As we chose T0T_{0} for which \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}<T_{0}, the chain of inequalities implies \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}\big{)}, concluding the proof. ∎

For any AAdet(p)\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)}, we invoke Lemma 7 with T0=TT_{0}=T to obtain ZAAzr(p)\mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)} and orthogonal matrix UU^{\prime} (dependent on ff and A\mathsf{A}) for which

where the last equality is due to \inf_{\mathsf{B}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}}\mathsf{T}_{\epsilon}\big{(}\mathsf{B},f\big{)}=\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\{f\}\big{)}\geq T. Since fU{fUUO(d+T,d)}f_{U^{\prime}}\in\{f_{U}\mid U\in\mathsf{O}(d+T,d)\}, we have

and taking the infimum over AAdet(p)\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)} concludes the proof. ∎

Appendix B Technical Results

Each of the statements in the lemma is immediate except for part iii. To see this part, we require a few further calculations. We begin by providing bounds on the derivatives of Φ(x)=e12xe12t2dt\Phi(x)=e^{\frac{1}{2}}\int_{-\infty}^{x}e^{-\frac{1}{2}t^{2}}dt. To avoid annoyances with scaling factors, we define ϕ(t)=e12t2\phi(t)=e^{-\frac{1}{2}t^{2}}.

We prove the result by induction. We have ϕ(t)=te12t2\phi^{\prime}(t)=-te^{-\frac{1}{2}t^{2}}, so that the base case of the induction is satisfied. Now, assume for our induction that

where ci(k)2k(max{i,1})k|c_{i}^{(k)}|\leq 2^{k}(\max\{i,1\})^{k}. Then taking derivatives, we have

where ci(k+1)=(i+1)ci+1(k)ci1(k)c_{i}^{(k+1)}=(i+1)c_{i+1}^{(k)}-c_{i-1}^{(k)} (and we treat ck+1(k)=0c_{k+1}^{(k)}=0) and ck+1(k+1)=1|c_{k+1}^{(k+1)}|=1. With the induction hypothesis that ci(k)(2max{i,1})kc_{i}^{(k)}\leq(2\max\{i,1\})^{k}, we obtain

With this result, we find that for any k1k\geq 1,

The function log(xiϕ(x))=ilogx12x2\log(x^{i}\phi(x))=i\log x-\frac{1}{2}x^{2} is maximized at x=ix=\sqrt{i}, so that xiϕ(x)exp(i2logie)x^{i}\phi(x)\leq\exp(\frac{i}{2}\log\frac{i}{e}). We thus obtain the numerically verifiable upper bound

Now, we turn to considering the function Ψ(x)\Psi(x). We assume w.l.o.g. that x>12x>\frac{1}{2}, as otherwise Ψ(k)(x)=0\Psi^{(k)}(x)=0 for all kk. Recall Ψ(x)=exp(11(2x1)2)\Psi(x)=\exp\left(1-\frac{1}{(2x-1)^{2}}\right) for x>12x>\frac{1}{2}. We have the following lemma regarding its derivatives.

We provide the proof by induction over kk. For k=1k=1, we have that

which yields the base case of the induction. Now, assume that for some kk, we have

where ck+1(k)=0c_{k+1}^{(k)}=0 and c0(k)=0c_{0}^{(k)}=0. Defining c11=4c_{1}^{1}=4 and ci(k+1)=4ci1(k)2(k+2i)ci(k)c_{i}^{(k+1)}=4c_{i-1}^{(k)}-2(k+2i)c_{i}^{(k)} for i>1i>1, then, under the inductive hypothesis that ci(k)6k(2i+k)k|c_{i}^{(k)}|\leq 6^{k}(2i+k)^{k}, we have

As in the derivation immediately following Lemma 8, by replacing t=12x1t=\frac{1}{2x-1}, we have that tk+2iet2t^{k+2i}e^{-t^{2}} is maximized by t=(k+2i)/2t=\sqrt{(k+2i)/2}, so that

which yields the numerically verifiable upper bound

B.2 Proof of Lemma 3

Part i follows because fˉT(0)<0\bar{f}_{T}(0)<0 and, since 0Ψ(x)e0\leq\Psi(x)\leq e and 0Φ(x)2πe0\leq\Phi(x)\leq\sqrt{2\pi e},

Part ii follows additionally from Ψ(x)=0\Psi(x)=0 on x<1/2x<1/2, 0Ψ(x)54e10\leq\Psi^{\prime}(x)\leq\sqrt{54e^{-1}} and 0Φ(x)e0\leq\Phi^{\prime}(x)\leq\sqrt{e}, which when substituted into

for every xx and jj. Consequently, fˉT(x)T23T\left\|{\nabla\bar{f}_{T}(x)}\right\|\leq\sqrt{T}\leq 23\sqrt{T}.

Examining fˉT\bar{f}_{T}, we see that i1ip+1fˉT\partial_{i_{1}}\cdots\partial_{i_{p+1}}\bar{f}_{T} is non-zero if and only if ijik1\left|i_{j}-i_{k}\right|\leq 1 for every j,k[p+1]j,k\in\left[p+1\right]. Consequently, we can rearrange the above summation as

where we take v0:=0v_{0}:=0 and vT+1:=0v_{T+1}:=0. Brief calculation show that

where we have used i=1Tvi+δ1vi+δpvi1|\sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}|\leq 1 for every δ{0,1}p{0,1}p\delta\in\{0,1\}^{p}\cup\{0,-1\}^{p}. To see this last claim is true, recall that vv is a unit vector and note that

If δ=0\delta=0 then i=1Tvi+δ1vi+δpvi=i=1Tvip+1i=1Tvi2=1|\sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}|=|\sum_{i=1}^{T}v_{i}^{p+1}|\leq\sum_{i=1}^{T}v_{i}^{2}=1. Otherwise, letting 1j=1pδj=np1\leq\sum_{j=1}^{p}|\delta_{j}|=n\leq p, the Cauchy-Swartz inequality implies

where s=1s=-1 or 11. This gives the result. ∎

B.3 Proof of Lemma 4

The following linear-algebraic result justifies the definition (21) of GtG_{t}.

For all tTt\leq T, GtG_{\leq t} implies u(j),x(s)<1/2|\langle u^{(j)},x^{(s)}\rangle|<1/2 for every s{1,,t}s\in\{1,\ldots,t\} and every j{s,,T}j\in\{s,\ldots,T\}.

First, notice that since GtG_{\leq t} implies GsG_{\leq s} for every sts\leq t, it suffices to show that GtG_{\leq t} implies u(j),x(t)<1/2|\langle u^{(j)},x^{(t)}\rangle|<1/2 for every j{t,,T}j\in\{t,\ldots,T\}. We will in fact prove a stronger statement:

Since GtG_{t} holds, its definition (21) implies u(j),Pt1x(t)αPt1x(t)αx(t)|\langle u^{(j)},P_{t-1}^{\perp}x^{(t)}\rangle|\leq\alpha\left\|{P_{t-1}^{\perp}x^{(t)}}\right\|\leq\alpha\left\|{x^{(t)}}\right\|. Moreover, by Cauchy-Schwarz and the implication (22), we have u(j),Pt1x(t)Pt1u(j)x(t)2α2(t1)x(t)|\langle u^{(j)},P_{t-1}x^{(t)}\rangle|\leq\left\|{P_{t-1}u^{(j)}}\right\|\left\|{x^{(t)}}\right\|\leq\sqrt{2\alpha^{2}(t-1)}\left\|{x^{(t)}}\right\|. Combining the two bounds, we obtain the result of the lemma,

where we have used x(t)R\left\|{x^{(t)}}\right\|\leq R and α=1/(5RT)\alpha=1/(5R\sqrt{T}).

We prove bound (22) by induction. The basis of the induction, t=1t=1, is trivial, as P0=0P_{0}=0. We shall assume (22) holds for s{1,,t1}s\in\{1,\ldots,t-1\} and show that it consequently holds for s=ts=t as well. We may apply the Graham-Schmidt procedure on the sequence x(1),u(1),,x(t1),u(t1)x^{(1)},u^{(1)},\ldots,x^{(t-1)},u^{(t-1)} to write

where P^k\hat{P}_{k} is the projection to the span of {x(1),u(1),,x(k),u(k),x(k+1)}\{x^{(1)},u^{(1)},\ldots,x^{(k)},u^{(k)},x^{(k+1)}\},

where the equalities hold by u(i),u(j)=0\left\langle u^{(i)},u^{(j)}\right\rangle=0, P^i1=IP^i1\hat{P}_{i-1}^{\perp}=I-\hat{P}_{i-1}, and the definition of P^i1\hat{P}_{i-1}.

The PiP_{i} matrices are projections, so Pi12=Pi1{P}_{i-1}^{2}={P}_{i-1}, and Cauchy-Swartz and the induction hypothesis imply

Moreover, the event GiG_{i} implies u(i),Pi1x(i)u(j),Pi1x(i)α2Pi1x(i)2\left|\langle u^{(i)},P_{i-1}^{\perp}x^{(i)}\rangle\langle u^{(j)},P_{i-1}^{\perp}x^{(i)}\rangle\right|\leq\alpha^{2}\left\|{P_{i-1}^{\perp}x^{(i)}}\right\|^{2}, so

where the first equality uses (Pi1)2=Pi1(P_{i-1}^{\perp})^{2}=P_{i-1}^{\perp}, the second the definition of P^i1\hat{P}_{i-1}, and the inequality uses u(j),Pi1x(i)αPi1x(i)\langle u^{(j)},P_{i-1}^{\perp}x^{(i)}\rangle\leq\alpha\|{P_{i-1}^{\perp}x^{(i)}}\| and Pi1u(j)22α2(i1)\|{P_{i-1}u^{(j)}}\|^{2}\leq 2\alpha^{2}\left(i-1\right).

Combining the observations (24a) and (24b), we can bound each summand in the second summation in (23). Since the summands in the first summation are bounded by α2\alpha^{2} by definition (21) of GiG_{i}, we obtain

where U(<t)U_{(<t)} is shorthand for u(1),,u(t1)u^{(1)},\ldots,u^{(t-1)} and ξ\xi is the random variable generating x(1),,x(T)x^{(1)},\ldots,x^{(T)}.

In the following lemma, we state formally that conditioned on G<iG_{<i}, the iterate x(i)x^{(i)} depends on UU only through its first (i1)(i-1) columns.

For every iTi\leq T, there exist measurable functions A+(i)\mathsf{A}^{(i)}_{+} and A(i)\mathsf{A}^{(i)}_{-} such that

Let tTt\leq T, and j{t,,T}j\in\{t,\ldots,T\}. Then conditioned on ξ,U(<t)\xi,U_{(<t)} and G<tG_{<t}, the vector Pt1u(j)Pt1u(j)\frac{P_{t-1}^{\perp}u^{(j)}}{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|} is uniformly distributed on the unit sphere in the subspace to which Pt1P_{t-1}^{\perp} projects.

This lemma is subtle. The vectors u(j)u^{(j)}, jtj\geq t, conditioned on U(<t)U_{(<t)}, are certainly uniformly distributed on the unit sphere in the subspace orthogonal to U(<t)U_{(<t)}. However, the additional conditioning on G<tG_{<t} requires careful handling. Throughout the proof we fix tTt\leq T and j{t,,T}j\in\{t,\ldots,T\}. We begin by noting that by (22), G<tG_{<t} implies

Therefore, when G<tG_{<t} holds we have Pt1u(j)0P_{t-1}^{\perp}u^{(j)}\neq 0 so Pt1u(j)/Pt1u(j){P_{t-1}^{\perp}u^{(j)}}/{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|} is well-defined.

To establish our result, we will show that the density of U(t)=[u(t),,u(T)]U_{(\geq t)}=[u^{(t)},\ldots,u^{(T)}] conditioned on ξ,U(<t),G<t\xi,U_{(<t)},G_{<t} is invariant to rotations that preserve the span of x(1),u(1),,x(t1),u(t1)x^{(1)},u^{(1)},\ldots,x^{(t-1)},u^{(t-1)}. More formally, let ptp_{\geq t} denote the density of U(t)U_{(\geq t)} conditional on ξ,U(<t)\xi,U_{(<t)} and G<tG_{<t}. We wish to show that

Throughout, we let ZZ denote such a rotation. Letting pξ,Up_{\xi,U} and pUp_{U} denote the densities of (ξ,U)(\xi,U) and UU, respectively, we have

where the first equality holds by the definition of conditional probability and second by the independence of ξ\xi and UU. We have ZU(<t)=U(<t)ZU_{(<t)}=U_{(<t)} and therefore, by the invariance of UU to rotations, pU([U(<t),ZU(t)])=pU(ZU)=pU(U)p_{U}([U_{(<t)},ZU_{(\geq t)}])=p_{U}(ZU)=p_{U}(U). Hence, replacing UU with ZUZU in the above display yields

Marginalizing the density (28) to obtain a density for u(j)u^{(j)} and recalling that Pt1P_{t-1}^{\perp} is measurable ξ,U(<t),G<t\xi,U_{(<t)},G_{<t}, we conclude that, conditioned on ξ,U(<t),G<t\xi,U_{(<t)},G_{<t} the random variable Pt1u(j)Pt1u(j)\frac{P_{t-1}^{\perp}u^{(j)}}{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|} has the same density as Pt1Zu(j)Pt1Zu(j)\frac{P_{t-1}^{\perp}Zu^{(j)}}{\left\|{P_{t-1}^{\perp}Zu^{(j)}}\right\|}. However, Pt1Z=ZPt1P_{t-1}^{\perp}Z=ZP_{t-1}^{\perp} by assumption on ZZ, and therefore

We conclude that the conditional distribution of the unit vector Pt1u(j)Pt1u(j)\frac{P_{t-1}^{\perp}u^{(j)}}{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|} is invariant to rotations in the subspace to which Pt1P_{t-1}^{\perp} projects. ∎

Substituting this bound back into the probability (26) gives

B.4 Proof of Lemma 6

and therefore by Lemma 3.i, we have f^T;U(0)infxf^T;U(x)fˉT(0)infxfˉT(x)12T\hat{f}_{T;U}(0)-\inf_{x}\hat{f}_{T;U}(x)\leq\bar{f}_{T}(0)-\inf_{x}\bar{f}_{T}(x)\leq 12T.

Establishing part ii requires substantially more work. Since smoothness with respect to Euclidean distances is invariant under orthogonal transformations, we take UU to be the first TT columns of the dd-dimensional identity matrix, denoted U=Id,TU=I_{d,T}. Recall the scaling ρ(x)=Rx/R2+x2\rho(x)=Rx/\sqrt{R^{2}+\left\|{x}\right\|^{2}} with “radius” R=230TR=230\sqrt{T} and the definition f^T;U(x)=fˉT(Uρ(x))+110x2\hat{f}_{T;U}(x)=\bar{f}_{T}(U^{\top}\rho(x))+\frac{1}{10}\left\|{x}\right\|^{2}. The quadratic 110x2\frac{1}{10}\left\|{x}\right\|^{2} term in f^T;U\hat{f}_{T;U} has 15\frac{1}{5}-Lipschitz first derivative and -Lipschitz higher order derivatives (as they are all constant or zero), and we take U=Id,TU=I_{d,T} without loss of generality, so we consider the function

We now compute the partial derivatives of f^T;I\hat{f}_{T;I}. Defining y=ρ(x)y=\rho(x), let ~j1,...,jkk:=kyj1yjk\widetilde{\nabla}^{{k}}_{j_{1},...,j_{k}}:=\frac{\partial^{k}}{\partial y_{j_{1}}\cdots\partial y_{j_{k}}} denote derivatives with respect to yy. In addition, define Pk\mathcal{P}_{k} to be the set of all partitions of [k]={1,,k}[k]=\{1,\ldots,k\}, i.e. (S1,,SL)Pk(S_{1},\ldots,S_{L})\in\mathcal{P}_{k} if and only if the SiS_{i} are disjoint and lSl=[k]\cup_{l}S_{l}=[k]. Using the chain rule, we have for any kk and set of indices i1,,ikTi_{1},\ldots,i_{k}\leq T that

where we have used the shorthand iSS\nabla^{{|S|}}_{i_{S}} to denote the partial derivatives with respect to each of xijx_{i_{j}} for jSj\in S. We use the equality (29) to argue that (recall the identity (2))

for some numerical constant To simplify notation we allow cc to change from equation to equation throughout the proof, always representing a finite numerical constant independent of dd, TT, kk or pp., 0<c<0<c<\infty and every p1p\geq 1. As explained in Section 2.1, this implies f^T;U\hat{f}_{T;U} has ecplogp+ce^{cp\log p+c}-Lipschitz ppth order derivative, giving part ii of the lemma.

algebraic manipulations and rearrangement of the sum (29) yield

Before proving inequality (30), we show how it implies the desired lemma. By the preceding display, we have

Lemma 3 shows that there exists a numerical constant c<c<\infty such that

When the number of partitions L=1L=1, we have S1=p+12|S_{1}|=p+1\geq 2, and so Lemma 3.ii yields

where we have used R=230TR=230\sqrt{T}. Using S1++SL=p+1|S_{1}|+\cdots+|S_{L}|=p+1 and the fact that q(x)=(x+1)log(x+1)q(x)=(x+1)\log(x+1) satisfies q(x)+q(y)q(x+y)q(x)+q(y)\leq q(x+y) for every x,y>0x,y>0, we have

for some c<c<\infty and every (S1,,SL)Pp+1(S_{1},\ldots,S_{L})\in\mathcal{P}_{p+1}. Bounds on Bell numbers [6, Thm. 2.1] give that there are at most exp(klogk)\exp(k\log k) partitions in Pk\mathcal{P}_{k}, which combined with the bound above gives desired result.

where P|P| denotes the number of disjoint elements of partition PPkP\in\mathcal{P}_{k}. Define the function ρ(ξ)=ξ/1+ξ2\overline{\rho}(\xi)=\xi/\sqrt{1+\|{\xi}\|^{2}}, and let λ(ξ)=1+ξ2\lambda(\xi)=\sqrt{1+\|{\xi}\|^{2}} so that ρ(ξ)=λ(ξ)\overline{\rho}(\xi)=\nabla\lambda(\xi) and ρ(ξ)=Rρ(ξ/R)\rho(\xi)=R\overline{\rho}(\xi/R). Let vjk(ξ)=kρj(ξ),vk\overline{v}_{j}^{k}(\xi)=\langle\nabla^{{k}}\overline{\rho}_{j}(\xi),v^{\otimes k}\rangle, so that

With this in mind, we consider the quantity kλ(ξ),vk\langle\nabla^{{k}}\lambda(\xi),v^{\otimes k}\rangle. Defining temporarily the functions α(r)=1+2r\alpha(r)=\sqrt{1+2r} and β(t)=12ξ+tv2\beta(t)=\frac{1}{2}\|{\xi+tv}\|^{2}, and their composition h(t)=α(β(t))h(t)=\alpha(\beta(t)), we evidently have

where the second equality used Faá di Bruno’s formula (31). Now, we note the following immediate facts:

Thus, if we let Pk,2\mathcal{P}_{k,2} denote the partitions of [k][k] consisting only of subsets with one or two elements, we have

where Ci(P)\mathsf{C}_{i}(P) denotes the number of sets in PP with precisely ii elements. Noting that v=1\left\|{v}\right\|=1, we may rewrite this as

We would like to bound al(ξ)ξ,vl1a_{l}(\xi)\langle\xi,v\rangle^{l-1} and bl(ξ)ξ,vlξb_{l}(\xi)\langle\xi,v\rangle^{l}\xi. Note that PC1(P)|P|\geq\mathsf{C}_{1}(P) for every PPkP\in\mathcal{P}_{k}, so Pl|P|\geq l in the sums above. Moreover, bounds for Bell numbers [6, Thm. 2.1] show that there are at most exp(klogk)\exp(k\log k) partitions of [k][k], and (2k1)!!exp(klogk)(2k-1)!!\leq\exp(k\log k) as well. As a consequence, we obtain

where we have used ξ,vξ|\langle\xi,v\rangle|\leq\left\|{\xi}\right\| due to v=1\left\|{v}\right\|=1. We similarly bound supξbl(ξ)ξ,vlξ\sup_{\xi}|b_{l}(\xi)||\langle\xi,v\rangle|^{l}\left\|{\xi}\right\|. Returning to expression (32), we have

for a numerical constant c<c<\infty. This is the desired bound (30), completing the proof. ∎

Appendix C Proof of Theorem 3

We divide the proof of the theorem into two parts, as in our previous results, first providing a few building blocks, then giving the theorem. The basic idea is to introduce a negative “bump” that is challenging to find, but which is close to the origin.

As Figure 2 shows, hˉT\bar{h}_{T} features a unit-height peak centered at 45e(T)\frac{4}{5}e^{(T)}, and it is identically zero when the distance from that peak exceeds 15\frac{1}{5}. The volume of the peak vanishes exponentially with TT, making it hard to find by querying hˉT\bar{h}_{T} locally. We list the properties of hˉT\bar{h}_{T} necessary for our analysis.

The function hˉT\bar{h}_{T} satisfies the following.

We prove the lemma in Section C.1; the proof is similar to that of Lemma 6. With these properties in hand, we can prove Theorem 3.

As in the proof of Lemma 6, we write hx,v(t)=Ψ(β(t)){h}_{x,v}(t)=\Psi(\beta(t)) where β(t)=112x+tv2\beta(t)=1-\frac{1}{2}\left\|{x+tv}\right\|^{2}, and use Faá di Bruno’s formula (31) to write, for any k1k\geq 1,

where Pk\mathcal{P}_{k} is the set of partitions of [k][k] and P|P| denotes the number of set in partition PP. Noting that β(0)=x,v\beta^{\prime}(0)=-\langle x,v\rangle, β(0)=v2\beta^{\prime\prime}(0)=-\left\|{v}\right\|^{2} and β(n)(0)=0\beta^{(n)}(0)=0 for any n>2n>2, we have

where Pk,2\mathcal{P}_{k,2} denote the partitions of [k][k] consisting only of subsets with one or two elements and Ci(P)\mathsf{C}_{i}(P) denotes the number of sets in PP with precisely ii elements.

Noting that Ψ(k)(112x2)=0\Psi^{(k)}(1-\frac{1}{2}\left\|{x}\right\|^{2})=0 for any k0k\geq 0 and x>1\left\|{x}\right\|>1, we may assume x1\left\|{x}\right\|\leq 1. Since v1\left\|{v}\right\|\leq 1, we may bound hx,v(p+1)(0)|{h}_{x,v}^{(p+1)}(0)| by

for some absolute constant c<c<\infty, where inequality (i)(i) follows from Lemma 1.iv and that the number of matchings in the complete graph (or the kkth telephone number [21, Lem. 2]) has bound Pk,2ek2logk|\mathcal{P}_{k,2}|\leq e^{\frac{k}{2}\log k}. This gives the result.

C.2 Proof of Theorem 3

with the final inequality using our assumption σD\sigma\leq D. On the other hand, for any xx such that hˉT(Ux/D)=0\bar{h}_{T}(U^{\top}x/D)=0, we have by Lemma 6.i (along with f^T;U(0)=0)\hat{f}_{T;U}(0)=0) that

Combining the two displays above, we conclude that if

then all global minima xx^{\star} of fUf_{U} must satisfy hˉT(Ux/D)>0\bar{h}_{T}(U^{\top}x^{\star}/D)>0. Inspecting the definition (18) of hˉT\bar{h}_{T}, this implies x/D0.8u(T)<15\left\|{x^{\star}/D-0.8u^{(T)}}\right\|<\frac{1}{5}, and therefore xD\left\|{x^{\star}}\right\|\leq D. Thus, by setting

we guarantee that fUFpdist(D,Lp)f_{U}\in\mathcal{F}^{\rm dist}_{p}(D,L_{p}) as long as σD\sigma\leq D.

We defer the proof of claim (35) to the end of this section.

By claim (35), this implies hˉT(Ux(t)/D)=0\nabla\bar{h}_{T}(U^{\top}x^{(t)}/D)=0, and by Lemma 5, f^T;U(x(t)/σ)>1/2\|{\nabla\hat{f}_{T;U}(x^{(t)}/\sigma)}\|>1/2. Thus, after scaling,

where T=Dp+1/13σp+1T=\left\lfloor D^{p+1}/13\sigma^{p+1}\right\rfloor is defined in Eq. (34). Thus, as fUFpdist(D,Lp)f_{U}\in\mathcal{F}^{\rm dist}_{p}(D,L_{p}) for our choice of TT, we immediately obtain

Finally, we return to demonstrate claim (35). Note that u(T),ρ(x/σ)<1/2|\langle u^{(T)},\rho(x/\sigma)\rangle|<1/2 is equivalent to u(T),x<σ21+xσR2|\langle u^{(T)},x\rangle|<\frac{\sigma}{2}\sqrt{1+\|{\frac{x}{\sigma R}}\|^{2}}, and consider separately the cases x/σR/2\left\|{x/\sigma}\right\|\leq R/2 and x/σ>R/2=115T\left\|{x/\sigma}\right\|>R/2=115\sqrt{T}. In the first case, we have u(T),x<(5/4)σ<(3/5)D|\langle u^{(T)},x\rangle|<(\sqrt{5}/4)\sigma<(3/5)D, by our assumption σD\sigma\leq D. Therefore, by Lemma 10.ii we have that hˉT(Uy/D)=0\bar{h}_{T}(U^{\top}y/D)=0 for yy near xx. In the second case, we have x>(4R/5)u(T),x>230u(T),x\left\|{x}\right\|>(4R/\sqrt{5})|\langle u^{(T)},x\rangle|>230|\langle u^{(T)},x\rangle|. If in addition u(T),x<(3/5)D|\langle u^{(T)},x\rangle|<(3/5)D then our conclusion follows as before. Otherwise, x/D>230(3/5)>1\left\|{x}\right\|/D>230\cdot(3/5)>1, so again the conclusion follows by Lemma 10.ii.