Lower Bounds for Finding Stationary Points I

Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford

Introduction

We prove lower bounds on the number of function and derivative evaluations required for algorithms to find a point $x$ satisfying inequality (1). While for arbitrary smooth $f$ , a near-stationary point (1) is certainly insufficient for any type of optimality, there are a number of reasons to study algorithms and complexity for finding stationary points. In several statistical and engineering problems, including regression models with non-convex penalties and objectives , phase retrieval , and non-convex (low-rank) reformulations of semidefinite programs and matrix completion , it is possible to show that all first- or second-order stationary points are (near) global minima. The strong empirical success of local search strategies for such problems, as well as for neural networks , motivates a growing body of work on algorithms with strong complexity guarantees for finding stationary points . In contrast to this algorithmic progress, algorithm-independent lower bounds for finding stationary points are largely unexplored.

By evaluation of higher order derivatives, such as the Hessian, it is possible to achieve better $\epsilon$ dependence. Nesterov and Polyak’s cubic regularization of Newton’s method guarantees $\epsilon$ -stationarity (1) in $\epsilon^{-3/2}$ iterations, but each iteration may be expensive when the dimension $d$ is large. More generally, $p$ th-order regularization methods iterate by sequentially minimizing models of $f$ based on order $p$ Taylor approximations, and Birgin et al. show that these methods converge in $\epsilon^{-(p+1)/p}$ iterations. Each iteration requires finding an approximate stationary point of a high-dimensional, potentially non-convex, degree $p+1$ polynomial, which suggests that the methods will be practically challenging for $p>2$ . The methods nonetheless provide fundamental upper complexity bounds.

In this paper and its companion , we focus on the converse problem: providing dimension-free complexity lower bounds for finding $\epsilon$ -stationary points. We show fundamental limits on the best achievable $\epsilon$ dependence, as well as dependence on other problem parameters. Together with known upper bounds, our results shed light on the optimal rates of convergence for finding stationary points.

In the case of convex optimization, we have a deep understanding of the complexity of finding $\epsilon$ -suboptimal points, that is, $x$ satisfying $f(x)\leq f(x^{\star})+\epsilon$ for some $\epsilon>0$ , where $x^{\star}\in\mathop{\rm arg\hskip 1.00006ptmin}_{x}f(x)$ . Here we review only the dimension-free optimal rates, as those are most relevant for our results. Given a point $x^{(0)}$ satisfying $\|{x^{(0)}-x^{\star}}\|\leq D<\infty$ , if $f$ is convex with $L_{1}$ -Lipschitz gradient, Nesterov’s accelerated gradient method finds an $\epsilon$ -suboptimal point in $\sqrt{L_{1}}D\epsilon^{-1/2}$ gradient evaluations, which is optimal even among randomized, higher-order algorithms .Higher order methods can yield improvements under additional smoothness: if in addition $f$ has $L_{2}$ -Lipschitz Hessian and $\epsilon\leq L_{1}^{7/3}L_{2}^{-4/3}D^{2/3}$ , an accelerated Newton method achieves the (optimal) rate $(L_{2}D^{3}/\epsilon)^{2/7}$ . For non-smooth problems, that is, when $f$ is $L_{0}$ -Lipschitz, subgradient methods achieve the optimal rate of $L_{0}^{2}D^{2}/\epsilon^{2}$ subgradient evaluations (cf. ). In Part II of this paper , we consider the impact of convexity on the difficulty of finding stationary points using first-order methods.

A related line of work considers algorithm-dependent lower bounds, describing functions that are challenging for common classes of algorithms, such as Newton’s method and gradient descent. In this vein, Jarre shows that the Chebyshev-Rosenbrock function is difficult to optimize, and that any algorithm that employs line search to determine the step size will require an exponential (in $\epsilon$ ) number of iterations to find an $\epsilon$ -suboptimal point, even though the Chebyshev-Rosenbrock function has only a single stationary point. While this appears to contradict the polynomial complexity guarantees mentioned above, Cartis et al. explain this by showing that the difficult Chebyshev-Rosenbrock instances have $\epsilon$ -stationary point with function value that is $\omega(\epsilon)$ -suboptimal. Cartis et al. also develop algorithm-specific lower bounds on the iteration complexity of finding approximate stationary points. Their works show that the performance guarantees for gradient descent and cubic regularization of Newton’s method are tight for two-dimensional functions they construct, and they also extend these results to certain structured classes of methods .

2 The importance of high-dimensional constructions

To tightly characterize the algorithm- and dimension-independent complexity of finding $\epsilon$ -stationary points, one must construct hard instances whose domain has dimension that grows with $1/\epsilon$ . The reason for this is simple: there exist algorithms with complexity that trades dependence on dimension $d$ in favor of better $1/\epsilon$ dependence. Indeed, Vavasis gives a grid-search method that, for functions with Lipschitz gradient, finds an $\epsilon$ -stationary point in $\max\{2^{d},\epsilon^{-2d/(d+2)}\}$ gradient and function evaluations. Moreover, Hinder exhibits a cutting-plane method that, for functions with Lipschitz first and third derivatives, finds an $\epsilon$ -stationary point in $d\cdot\epsilon^{-4/3}\log\frac{1}{\epsilon}$ gradient and function evaluations.

High-dimensional constructions are similarly unavoidable when developing lower bounds in convex optimization. There, the center-of-gravity cutting plane method (cf. ) finds an $\epsilon$ -suboptimal point in $d\log\frac{1}{\epsilon}$ (sub)gradient evaluations, for any continuous convex function with bounded distance to optimality. Consequently, proofs of the dimension-free lower bound for convex optimization (as we cite in the previous section) all rely on constructions whose dimensionality grows polynomially in $1/\epsilon$ .

3 Our contributions

oracle queries to find an $\epsilon$ -stationary point of $f$ , where $c_{p}>0$ is a constant decreasing at most polynomially in $p$ . As explained in the previous section, the domain of the constructed function $f$ has dimension polynomial in $1/\epsilon$ .

For every $p$ , our lower bound matches (up to a constant) known upper bounds, thereby characterizing the optimal complexity of finding stationary points. For $p=1$ , our results imply that gradient descent is optimal among all methods (even randomized, high-order methods) operating on functions with Lipschitz continuous gradient and bounded initial sub-optimality. Therefore, to strengthen the guarantees of gradient descent one must introduce additional assumptions, such as convexity of $f$ or Lipschitz continuity of $\nabla^{{2}}f$ . Similarly, in the case $p=2$ we establish that cubic regularization of Newton’s method achieves the optimal rate $\epsilon^{-3/2}$ , and for general $p$ we show that $p$ th order Taylor-approximation methods are optimal.

These results say little about the potential of first-order methods on functions with higher-order Lipschitz derivatives, where first-order methods attain rates better than $\epsilon^{-2}$ . In Part II of this series , we address this issue and show lower bounds for deterministic algorithms using only first-order information. The lower bounds exhibit a fundamental gap between first- and second-order methods, and nearly match the known upper bounds .

4 Our approach and paper organization

In Section 2 we introduce the classes of functions and algorithms we consider as well as our notion of complexity. Then, in Section 3, we present the generic technique we use to prove lower bound for deterministic algorithms in both this paper and Part II . While essentially present in previous work, our technique abstracts away and generalizes the central arguments in many lower bounds . The technique applies to higher-order methods and provides lower bounds for general optimization goals, including finding stationary points (our main focus), approximate minimizers, and second-order stationary points. It is also independent of whether the functions under consideration are convex, applying to any function class with appropriate rotational invariance . The key building blocks of the technique are Nesterov’s notion of a “chain-like” function , which is difficult for a certain subclass of algorithms, and a “resisting oracle” reduction that turns a lower bound for this subclass into a lower bound for all deterministic algorithms.

In Section 4 we apply this generic method to produce lower bounds for deterministic methods (Theorem 1). The deterministic results underpin our analysis for randomized algorithms, which culminates in Theorem 2 in Section 5. Following Woodworth and Srebro , we consider random rotations of our deterministic construction, and show that for any algorithm such a randomly rotated function is, with high probability, difficult. For completeness, in Section 6 we provide lower bounds on finding stationary points of functions where $\|{x^{(0)}-x^{\star}}\|$ is bounded, rather than the function value gap $f(x^{(0)})-f(x^{\star})$ ; these bounds have the same $\epsilon$ dependence as their bounded function value counterparts.

If $T$ is a symmetric order $k$ tensor, meaning that $T_{i_{1},\ldots,i_{k}}$ is invariant to permutations of the indices (for example, $\nabla^{{k}}f(x)$ is always symmetric), then Zhang et al. [47, Thm. 2.1] show that

For vectors, the Euclidean and operator norms are identical.

Preliminaries

We begin our development with definitions of the classes of functions (§ 2.1), classes of algorithms (§ 2.2), and notions of complexity (§ 2.3) that we study.

Measures of function regularity are crucial for the design and analysis of optimization algorithms . We focus on two types of regularity conditions: Lipschitzian properties of derivatives and bounds on function value.

Complexity guarantees for finding stationary points of non-convex functions $f$ typically depend on the function value bound $f(x^{(0)})-\inf_{x}f(x)$ , where $x^{(0)}$ is a pre-specified point. Without loss of generality, we take the pre-specified point to be for the remainder of the paper. With that in mind, we define the following classes of functions.

Let $p\geq 1$ , $\Delta>0$ and $L_{p}>0$ . Then the set

For our results, we also require the following important invariance notion, proposed (in the context of optimization) by Nemirovski and Yudin [35, Ch. 7.2].

Every function class we consider is orthogonally invariant, as $f(0)-\inf_{x}f(x)=f_{U}(0)-\inf_{x}f_{U}(x)$ and $f_{U}$ has the same Lipschitz constants to all orders as $f$ , as their collections of associated directional projections are identical.

2 Algorithm classes

To model the computational cost of an algorithm, we adopt the information-based complexity framework, which Nemirovski and Yudin develop (see also ), and view every every iterate $x^{(t)}$ as a query to an information oracle. Typically, one places restrictions on the information the oracle returns (e.g. only the function value and gradient at the query point) and makes certain assumptions on how the algorithm uses this information (e.g. deterministically). Our approach is syntactically different but semantically identical: we build the oracle restriction, along with any other assumption, directly into the structure of the algorithm. To formalize this, we define

as shorthand for the response of a $p$ th order oracle to a query at point $x$ . When $p=\infty$ this corresponds to an oracle that reveals all derivatives at $x$ . Our algorithm classes follow.

As a concrete example, for any $p\geq 1$ and $L>0$ consider the algorithm ${\mathsf{REG}_{p,L}}\in\mathcal{A}_{\textnormal{{det}}}^{(p)}$ that produces iterates by minimizing the sum of a $p$ th order Taylor expansion and an order $p+1$ proximal term:

For $p=1$ , ${\mathsf{REG}_{p,L}}$ is gradient descent with step-size $1/L$ , for $p=2$ it is cubic-regularized Newton’s method , and for general $p$ it is a simplified form of the scheme that Birgin et al. propose.

Randomized algorithms (and function-informed processes)

A $p$ th-order randomized algorithm $\mathsf{A}$ is a distribution on $p$ th-order deterministic algorithms. We can write any such algorithm as a deterministic algorithm given access to a random uniform variable on $ $(i.e. infinitely many random bits). Thus the algorithm operates on$ f $by drawing$ \xi\sim\mathsf{Uni} $(independently of$ f$), then producing iterates of the form

Zero-respecting sequences and algorithms

While deterministic and randomized algorithms are the natural collections for which we prove lower bounds, it is useful to define an additional structurally restricted class. This class forms the backbone of our lower bound strategy (Sec. 3), as it is both ‘small’ enough to uniformly underperform on a single function, and ‘large’ enough to imply lower bounds on the natural algorithm classes.

The definition (5) says that $x^{(t)}_{i}=0$ if all partial derivatives involving the $i$ th coordinate of $f$ (up to the $p$ th order) are zero. For $p=1$ , this definition is equivalent to the requirement that for every $t$ and $j\in[d]$ , if $\nabla_{j}f(x^{(s)})=0$ for $s<t$ , then $x^{(t)}_{j}=0$ . The requirement (5) implies that $x^{(1)}=0$ .

In the literature on lower bounds for first-order convex optimization, it is common to assume that methods only query points in the span of the gradients they observe . Our notion of zero-respecting algorithms generalizes this assumption to higher-order methods, but even first-order zero-respecting algorithms are slightly more general. For example, coordinate descent methods are zero-respecting, but they generally do not remain in the span of the gradients.

3 Complexity measures

To measure the performance of algorithm $\mathsf{A}$ on function $f$ , we evaluate the iterates it produces from $f$ , and with mild abuse of notation, we define

as the complexity of $\mathsf{A}$ on $f$ . With this setup, we define the complexity of algorithm class $\mathcal{A}$ on function class $\mathcal{F}$ as

Many algorithms guarantee “dimension independent” convergence and thus provide upper bounds for the quantity (7). A careful tracing of constants in the analysis of Birgin et al. implies that the generalized regularization scheme ${\mathsf{REG}_{p,L}}$ defined by the recursion (3) guarantees

While definition (7) is our primary notion of complexity, our proofs provide bounds on smaller quantities than (7) that also carry meaning. For zero-respecting algorithms, we exhibit a single function $f$ and bound $\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}$ from below, in effect interchanging the $\inf$ and $\sup$ in (7). This implies that all zero-respecting algorithms share a common vulnerability. For randomized algorithms, we exhibit a distribution $P$ supported on functions of a fixed dimension $d$ , and we lower bound the average $\inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}}\int\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}dP(f)$ , bounding the distributional complexity , which is never greater than worst-case complexity (and is equal for randomized and deterministic algorithms). Even randomized algorithms share a common vulnerability: functions drawn from $P$ .

Anatomy of a lower bound

In this section we present a generic approach to proving lower bounds for optimization algorithms. The basic techniques we use are well-known and applied extensively in the literature on lower bounds for convex optimization . However, here we generalize and abstract away these techniques, showing how they apply to high-order methods, non-convex functions, and various optimization goals (e.g. $\epsilon$ -stationarity, $\epsilon$ -optimality).

Nesterov [37, Chapter 2.1.2] proves lower bounds for smooth convex optimization problems using the “chain-like” quadratic function

which he calls the “worst function in the world.” The important property of $f$ is that for every $i\in[d]$ , $\nabla_{i}f(x)=0$ whenever $x_{i-1}=x_{i}=x_{i+1}=0$ (with $x_{0}:=1$ and $x_{d+1}:=0$ ). Thus, if we “know” only the first $t-1$ coordinates of $f$ , i.e. are able to query only vectors $x$ such $x_{t}=x_{t+1}=\cdots=x_{d}=0$ , then any $x$ we query satisfies $\nabla_{s}f(x)=0$ for $s>t$ ; we only “discover” a single new coordinate $t$ . We generalize this chain structure to higher-order derivatives as follows.

2 A lower bound strategy

The preceding discussion shows that zero-respecting algorithms take many iterations to “discover” all the coordinates of a zero-chain. In the following observation, we formalize how finding a suitable zero-chain provides a lower bound on the performance of zero-respecting algorithms.

$f$ belongs to the function class, i.e. $f\in\mathcal{F}$ , and

$\left\|{\nabla f(x)}\right\|>\epsilon$ for every $x$ such that $x_{T}=0$ ; We can readily adapt this property for lower bounds on other termination criteria, e.g. require $f(x)-\inf_{y}f(y)>\epsilon$ for every $x$ such that $x_{T}=0$ .

then $\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\{f\}\big{)}>T$ .

3 From deterministic to zero-respecting algorithms

Zero-chains allow us to generate strong lower bounds for zero-respecting algorithms. The following reduction shows that these lower bounds are valid for deterministic algorithms as well.

We also give a variant of Proposition 1 that is tailored to lower bounds constructed by means of Observation 2 and allows explicit accounting of dimensionality.

where $f_{U}:=f(U^{\top}z)$ and $\mathsf{O}(d+T,d)$ is the set of $(d+T)\times d$ orthogonal matrices, so that $\{f_{U}\mid U\in\mathsf{O}(d+T,d)\}$ contains only function with domain of dimension $d+T$ .

giving Proposition 1; Proposition 2 follows similarly, and for it we may take $d^{\prime}=d+T$ .

The adversarial rotation argument that yields Propositions 1 and 2 is more or less apparent in the proofs of previous lower bounds in convex optimization for deterministic algorithms. We believe it is instructive to separate the proof of lower bounds on $\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}},\mathcal{F}\big{)}$ and the reduction from $\mathcal{A}_{\textnormal{{det}}}$ to $\mathcal{A}_{\textnormal{{zr}}}$ , as the latter holds in great generality. Indeed, Propositions 1 and 2 hold for any complexity measure $\mathsf{T}_{\epsilon}\big{(}\cdot,\cdot\big{)}$ that satisfies

4 Randomized algorithms

Proposition 1 and 2 do not apply to randomized algorithms, as they require the adversary (maximizing choice of $f$ ) to simulate the action of $\mathsf{A}$ on $f$ . To handle randomized algorithms, we strengthen the notion of a zero-chain as follows.

A robust zero-chain is also an “ordinary” zero-chain. In Section 5 we replace the adversarial rotation $U$ of § 3.3 with an orthogonal matrix drawn uniformly at random, and consider the random function $f_{U}(x)=f(U^{\top}x)$ , where $f$ is a robust zero-chain. We adapt a lemma by Woodworth and Srebro , and use it to show that for every $\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}$ , $A[f_{U}]$ satisfies an approximate form of Observation 1 (w.h.p.) whenever the iterates $\mathsf{A}[f_{U}]$ have bounded norm. With further modification of $f_{U}$ to handle unbounded iterates, our zero-chain strategy yields a strong distributional complexity lower bound on $\mathcal{A}_{\textnormal{{rand}}}$ .

Lower bounds for zero-respecting and deterministic algorithms

Our construction, illustrated in Figure 1, has two key properties. First is that $f$ is a zero-chain (Observation 3 in the sequel). Second, as we show in Lemma 2, $\|{\nabla\bar{f}_{T}(x)}\|$ is large unless $|x_{i}|\geq 1$ for every $i\in[T]$ . These properties make it hard for any zero-respecting method to find a stationary point of scaled versions of $\bar{f}_{T}$ , and coupled with Proposition 1, this gives a lower bound for deterministic algorithms.

Before turning to the main theorem of this section, we catalogue the important properties of the functions $\Psi$ , $\Phi$ and $\bar{f}_{T}$ .

The functions $\Psi$ and $\Phi$ satisfy the following.

For all $x\geq 1$ and $|y|<1$ , $\Psi(x)\Phi^{\prime}(y)>1$ .

The functions and derivatives $\Psi,\Psi^{\prime},\Phi$ and $\Phi^{\prime}$ are non-negative and bounded, with

We prove Lemma 1 in Appendix B.1. The remainder our development relies on $\Psi$ and $\Phi$ only through Lemma 1. Therefore, the precise choice of $\Psi,\Phi$ is not particularly special; any two functions with properties similar to Lemma 1 will yield similar lower bounds.

The key consequence of Lemma 1.i is that the function $f$ is a robust zero-chain (see Definition 4) and consequently also a zero-chain (Definition 3):

For any $j>1$ , if $|x_{j-1}|,|x_{j}|<1/2$ then $\bar{f}_{T}(y)=\bar{f}_{T}(y_{1},\ldots,y_{j-1},0,y_{j+1},\ldots,y_{T})$ for all $y$ in a neighborhood of $x$ .

Applying Observation 3 for $j=i+1,\ldots,T$ gives that $\bar{f}_{T}$ is a robust zero-chain by Definition 4. Taking derivatives of $\bar{f}_{T}(x_{1},\ldots,x_{i},0,\ldots,0)$ with respect to $x_{j}$ , $j>i$ , shows that $\bar{f}_{T}$ is also a zero-chain by Definition 3. Thus, Observation 1 shows that any zero-respecting algorithm operating on $\bar{f}_{T}$ requires $T+1$ iterations to find a point where $x_{T}\neq 0$ .

Next, we establish the “large gradient property” that $\nabla\bar{f}_{T}(x)$ must be large if any coordinate of $x$ is near zero.

If $|x_{i}|<1$ for any $i\leq T$ , then there exists $j\leq i$ such that $|x_{j}|<1$ and

We take $j\leq i$ to be the smallest $j$ for which $|x_{j}|<1$ , so that $|x_{j-1}|\geq 1$ (where we use the shorthand $x_{0}\equiv 1$ ). Therefore, we have

In the chain of inequalities, inequality $(i)$ follows because $\Psi^{\prime}(x)\Phi(y)\geq 0$ for every $x,y$ ; inequality $(ii)$ follows because $\Psi(x)=0$ for $x\leq 1/2$ , while equality $(iii)$ follows from Lemma 1.ii and the pairing of $|x_{j}|<1$ and $|x_{j-1}|\geq 1$ . ∎

Finally, we verify that $\bar{f}_{T}$ meets the smoothness and boundedness requirements of the function classes we consider.

The function $\bar{f}_{T}$ satisfies the following.

We have $\bar{f}_{T}(0)-\inf_{x}\bar{f}_{T}(x)\leq 12T$ .

The proof of Lemma 3 is technical, so we defer it to Appendix B.2. In the lemma, Properties i and iii allow us to guarantee that appropriately scaled versions of $\bar{f}_{T}$ are in $\mathcal{F}_{p}(\Delta,L_{p})$ . Property is ii is necessary for analysis of the randomized construction in Section 5.

2 Lower bounds for zero-respecting and deterministic algorithms

We can now state and prove a lower bound for finding stationary points of $p$ th order smooth functions using full derivative information and zero-respecting algorithms (the class $\mathcal{A}_{\textnormal{{zr}}}$ ). Proposition 1 transforms this bound into one on all deterministic algorithms (the class $\mathcal{A}_{\textnormal{{det}}}$ ).

3 Proof of Theorem 1

Lower bounds for randomized algorithms

With our lower bounds on the complexity of deterministic algorithms established, we turn to the class of all randomized algorithms. We provide strong distributional complexity lower bounds by exhibiting a distribution on functions such that a function drawn from it is “difficult” for any randomized algorithm, with high probability. We do this via the composition of a random orthogonal transformation with the function $\bar{f}_{T}$ defined in (10).

The result of Lemma 4 is identical (to constant factors) to an important result of Woodworth and Srebro [45, Lemma 7], but we must be careful with the sequential conditioning of randomness between the iterates $x^{(t)}$ , the random orthogonal $U$ , and how much information the sequentially computed derivatives may leak. Because of this additional care, we require a modification of their original proof, In a recent note Woodworth and Srebro independently provide a revision of their proof that is similar, but not identical, to the one we propose here. which we provide in Section B.3, giving a rough outline here. For a fixed $t<T$ , assume that $|\langle u^{(j)},x^{(s)}\rangle|<1/2$ holds for every pair $s\leq t$ and $j\in\{s,\ldots,T\}$ ; we argue that this (roughly) implies that $|\langle u^{(j)},x^{(t+1)}\rangle|<1/2$ for every $j\in\{t+1,\ldots,T\}$ with high probability, completing the induction. When the assumption that $|\langle u^{(j)},x^{(s)}\rangle|<1/2$ holds, the robust zero-chain property of $\bar{f}_{T}$ (Definition 4 and Observation 3) implies that for every $s\leq t$ we have

2 Handling unbounded iterates

The quadratic term in $\hat{f}_{T;U}$ guarantees that all points beyond a certain norm have a large gradient, which prevents the algorithm from trivially making the gradient small by increasing the norm of the iterates. The following lemma captures the hardness of $\hat{f}_{T;U}$ for randomized algorithms.

Let $\delta>0$ , and let $x^{(1)},\ldots,x^{(T)}$ be informed by $\hat{f}_{T;U}$ . If $d\geq 52\cdot 230^{2}\cdot T^{2}\log\frac{2T^{2}}{\delta}$ then, with probability at least $1-\delta$ ,

Therefore, by Lemma 2 with $i=T$ , for each $t$ there exists $j\leq T$ such that

To show that $\|{\nabla\hat{f}_{T;U}(x^{(t)})}\|$ is also large, we consider separately the cases $\|{x^{(t)}}\|\leq R/2$ and $\|{x^{(t)}}\|\geq R/2$ . For the first case, we use $\frac{\partial\rho}{\partial x}(x)=\frac{I-\rho(x)\rho(x)^{\top}/R^{2}}{\sqrt{1+\left\|{x}\right\|^{2}/R^{2}}}$ to write

Therefore, for $\|{y^{(t)}}\|\leq\|{x^{(t)}}\|\leq R/2$ we have

In the second case, $\left\|{x^{(t)}}\right\|\geq R/2$ , we have for any $x$ satisfying $\left\|{x}\right\|\geq R/2$ and $y=\rho(x)$ that

As our lower bounds repose on appropriately scaling the function $\hat{f}_{T;U}$ , it remains to verify that $\hat{f}_{T;U}$ satisfies the few boundedness properties we require. We do so in the following lemma.

The function $\hat{f}_{T;U}$ satisfies the following.

We have $\hat{f}_{T;U}(0)-\inf_{x}\hat{f}_{T;U}(x)\leq 12T$ .

We defer the (computationally involved) proof of this lemma to Section B.4.

3 Final lower bounds

With Lemmas 5 and 6 in hand, we can state our lower bound for all algorithms, randomized or otherwise, given access to all derivatives of a $\mathcal{C}^{\infty}$ function. Note that our construction also implies an identical lower bound for (slightly) more general algorithms that use any local oracle , meaning that the information the oracle returns about a function $f$ when queried at a point $x$ is identical to that it returns when a function $g$ is queried at $x$ whenever $f(z)=g(z)$ for all $z$ in a neighborhood of $x$ .

implying Theorem 2 for any $\delta\geq 1/2$ . Thus, we exhibit a randomized procedure for finding hard instances for any randomized algorithm that requires no knowledge of the algorithm itself.

Theorem 2 is stronger than Theorem 1 in that it applies to the broad class of all randomized algorithms. Our probabilistic analysis requires that the functions constructed to prove Theorem 2 have dimension scaling proportional to $T^{2}\log(T)$ where $T$ is the lower bound on the number of iterations. Contrast this to Theorem 1, which only requires dimension $2T+1$ . A similar gap exists in complexity results for convex optimization . At present, it unclear if these gaps are fundamental or a consequence of our specific constructions.

4 Proof of Theorem 2

Fix $\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}$ and let $x^{(1)},x^{(2)},\ldots,x^{(T)}$ be the iterates produced by $\mathsf{A}$ applied on $f_{U}$ . Since $f$ and $\hat{f}_{T;U}$ differ only by scaling, the iterates $x^{(1)}/\sigma,x^{(2)}/\sigma,\ldots,x^{(T)}/\sigma$ are informed by $\hat{f}_{T;U}$ (recall Sec. 2.2), and therefore we may apply Lemma 5 with $\delta=1/2$ and our large enough choice of dimension $d$ to conclude that

It remains to choose $T$ to guarantee that $f_{U}$ belongs to the relevant function class (bounded and smooth) for every orthogonal $U$ . By Lemma 6.ii, $f_{U}$ has $L_{p}$ -Lipschitz continuous $p$ th order derivatives. By Lemma 6.i, we have

Distance-based lower bounds

We have so far considered finding approximate stationary points of smooth functions with bounded sub-optimality at the origin, i.e. $f(0)-\inf_{x}f(x)\leq\Delta$ . In convex optimization, it is common to consider instead functions with bounded distance between the origin and a global minimum. We may consider a similar restriction for non-convex functions; for $p\geq 1$ and positive $L_{p},D$ , let

be the class of $\mathcal{C}^{\infty}$ functions with $L_{p}$ -Lipschitz $p$ th order derivatives satisfying

that is, all global minima have bounded distance to the origin.

In this section we give a lower bound on the complexity of this function class that has the same $\epsilon$ dependence as our bound for the class $\mathcal{F}_{p}(\Delta,L_{p})$ . This is in sharp contrast to convex optimization, where distance-bounded functions enjoy significantly better $\epsilon$ dependence than their value-bounded counterparts (see Section LABEL:sec:convex in the companion ). Qualitatively, the reason for this difference is that the lack of convexity allows us to “hide” global minima close to the origin that are difficult to find for any algorithm with local function access .

We postpone the construction and proof to Appendix C, and move directly to the final bound.

There exist numerical constants $0<c_{0},c_{1}<\infty$ such that the following lower bound holds. For any $p\geq 1$ , let $D,L_{p}$ , and $\epsilon$ be positive. Then

We remark that a lower-dimensional construction suffices for proving the lower bound for deterministic algorithm, similarly to Theorem 1.

While we do not have a matching upper bound for Theorem 3, we can match its $\epsilon$ dependence in the smaller function class

Conclusion

This work provides the first algorithm independent and tight lower bounds on the dimension-free complexity of finding stationary points. As a consequence, we have characterized the optimal rates of convergence to $\epsilon$ -stationarity, under the assumption of high dimension and an oracle that provides all derivatives. Yet, given the importance of high-dimensional problems, the picture is incomplete: high-order algorithms—even second-order method—are often impractical in large scale settings. We address this in the companion , which provides sharper lower bounds for the more restricted class of first-order methods. In we also provide a full conclusion for this paper sequence, discussing in depth the implications and questions that arise from our results.

Acknowledgments

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research, NSF-CAREER award 1553086, and a Sloan Foundation Fellowship in Mathematics. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.

References

Appendix A Proof of Propositions 1 and 2

The core of the proofs of Propositions 1 and 2 is the following construction.

Before explaining the construction of $\mathsf{Z}_{\mathsf{A}}$ , let us see how its defining property implies the lemma. If $\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}>T_{0}$ , we are done. Otherwise, $\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\leq T_{0}$ and we have

for every $t\leq T_{0}$ (we set $z^{(i)}=0$ for every $i>T_{0}$ without loss of generality).

Finally, note that the arguments above hold unchanged for $p=\infty$ . ∎

With Lemma 7 in hand, the propositions follow easily.

We may assume that $\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}<T_{0}$ for some integer $T_{0}<\infty$ , as otherwise we have $\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}=\infty$ and the result holds trivially. For any $\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)}$ and the value $T_{0}$ , we invoke Lemma 7 to construct $\mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}$ such that $\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq\min\{T_{0},\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{A}},f\big{)}\}$ for every $f\in\mathcal{F}$ and some orthogonal matrix $U$ that depends on $f$ and $\mathsf{A}$ . Consequently, we have

where inequality $(i)$ uses that $f_{U}\in\mathcal{F}$ because $\mathcal{F}$ is orthogonally invariant, step $(ii)$ uses $\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq\min\{T_{0},\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{A}},f\big{)}\}$ and step $(iii)$ is due to $\mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}$ by construction. As we chose $T_{0}$ for which $\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}<T_{0}$ , the chain of inequalities implies $\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}\big{)}$ , concluding the proof. ∎

For any $\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)}$ , we invoke Lemma 7 with $T_{0}=T$ to obtain $\mathsf{Z}_{\mathsf{A}}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}$ and orthogonal matrix $U^{\prime}$ (dependent on $f$ and $\mathsf{A}$ ) for which

where the last equality is due to $\inf_{\mathsf{B}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}}\mathsf{T}_{\epsilon}\big{(}\mathsf{B},f\big{)}=\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\{f\}\big{)}\geq T$ . Since $f_{U^{\prime}}\in\{f_{U}\mid U\in\mathsf{O}(d+T,d)\}$ , we have

and taking the infimum over $\mathsf{A}\in\mathcal{A}_{\textnormal{{det}}}^{(p)}$ concludes the proof. ∎

Appendix B Technical Results

Each of the statements in the lemma is immediate except for part iii. To see this part, we require a few further calculations. We begin by providing bounds on the derivatives of $\Phi(x)=e^{\frac{1}{2}}\int_{-\infty}^{x}e^{-\frac{1}{2}t^{2}}dt$ . To avoid annoyances with scaling factors, we define $\phi(t)=e^{-\frac{1}{2}t^{2}}$ .

We prove the result by induction. We have $\phi^{\prime}(t)=-te^{-\frac{1}{2}t^{2}}$ , so that the base case of the induction is satisfied. Now, assume for our induction that

where $|c_{i}^{(k)}|\leq 2^{k}(\max\{i,1\})^{k}$ . Then taking derivatives, we have

where $c_{i}^{(k+1)}=(i+1)c_{i+1}^{(k)}-c_{i-1}^{(k)}$ (and we treat $c_{k+1}^{(k)}=0$ ) and $|c_{k+1}^{(k+1)}|=1$ . With the induction hypothesis that $c_{i}^{(k)}\leq(2\max\{i,1\})^{k}$ , we obtain

With this result, we find that for any $k\geq 1$ ,

The function $\log(x^{i}\phi(x))=i\log x-\frac{1}{2}x^{2}$ is maximized at $x=\sqrt{i}$ , so that $x^{i}\phi(x)\leq\exp(\frac{i}{2}\log\frac{i}{e})$ . We thus obtain the numerically verifiable upper bound

Now, we turn to considering the function $\Psi(x)$ . We assume w.l.o.g. that $x>\frac{1}{2}$ , as otherwise $\Psi^{(k)}(x)=0$ for all $k$ . Recall $\Psi(x)=\exp\left(1-\frac{1}{(2x-1)^{2}}\right)$ for $x>\frac{1}{2}$ . We have the following lemma regarding its derivatives.

We provide the proof by induction over $k$ . For $k=1$ , we have that

which yields the base case of the induction. Now, assume that for some $k$ , we have

where $c_{k+1}^{(k)}=0$ and $c_{0}^{(k)}=0$ . Defining $c_{1}^{1}=4$ and $c_{i}^{(k+1)}=4c_{i-1}^{(k)}-2(k+2i)c_{i}^{(k)}$ for $i>1$ , then, under the inductive hypothesis that $|c_{i}^{(k)}|\leq 6^{k}(2i+k)^{k}$ , we have

As in the derivation immediately following Lemma 8, by replacing $t=\frac{1}{2x-1}$ , we have that $t^{k+2i}e^{-t^{2}}$ is maximized by $t=\sqrt{(k+2i)/2}$ , so that

which yields the numerically verifiable upper bound

B.2 Proof of Lemma 3

Part i follows because $\bar{f}_{T}(0)<0$ and, since $0\leq\Psi(x)\leq e$ and $0\leq\Phi(x)\leq\sqrt{2\pi e}$ ,

Part ii follows additionally from $\Psi(x)=0$ on $x<1/2$ , $0\leq\Psi^{\prime}(x)\leq\sqrt{54e^{-1}}$ and $0\leq\Phi^{\prime}(x)\leq\sqrt{e}$ , which when substituted into

for every $x$ and $j$ . Consequently, $\left\|{\nabla\bar{f}_{T}(x)}\right\|\leq\sqrt{T}\leq 23\sqrt{T}$ .

Examining $\bar{f}_{T}$ , we see that $\partial_{i_{1}}\cdots\partial_{i_{p+1}}\bar{f}_{T}$ is non-zero if and only if $\left|i_{j}-i_{k}\right|\leq 1$ for every $j,k\in\left[p+1\right]$ . Consequently, we can rearrange the above summation as

where we take $v_{0}:=0$ and $v_{T+1}:=0$ . Brief calculation show that

where we have used $|\sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}|\leq 1$ for every $\delta\in\{0,1\}^{p}\cup\{0,-1\}^{p}$ . To see this last claim is true, recall that $v$ is a unit vector and note that

If $\delta=0$ then $|\sum_{i=1}^{T}v_{i+\delta_{1}}\cdots v_{i+\delta_{p}}v_{i}|=|\sum_{i=1}^{T}v_{i}^{p+1}|\leq\sum_{i=1}^{T}v_{i}^{2}=1$ . Otherwise, letting $1\leq\sum_{j=1}^{p}|\delta_{j}|=n\leq p$ , the Cauchy-Swartz inequality implies

where $s=-1$ or $1$ . This gives the result. ∎

B.3 Proof of Lemma 4

The following linear-algebraic result justifies the definition (21) of $G_{t}$ .

For all $t\leq T$ , $G_{\leq t}$ implies $|\langle u^{(j)},x^{(s)}\rangle|<1/2$ for every $s\in\{1,\ldots,t\}$ and every $j\in\{s,\ldots,T\}$ .

First, notice that since $G_{\leq t}$ implies $G_{\leq s}$ for every $s\leq t$ , it suffices to show that $G_{\leq t}$ implies $|\langle u^{(j)},x^{(t)}\rangle|<1/2$ for every $j\in\{t,\ldots,T\}$ . We will in fact prove a stronger statement:

Since $G_{t}$ holds, its definition (21) implies $|\langle u^{(j)},P_{t-1}^{\perp}x^{(t)}\rangle|\leq\alpha\left\|{P_{t-1}^{\perp}x^{(t)}}\right\|\leq\alpha\left\|{x^{(t)}}\right\|$ . Moreover, by Cauchy-Schwarz and the implication (22), we have $|\langle u^{(j)},P_{t-1}x^{(t)}\rangle|\leq\left\|{P_{t-1}u^{(j)}}\right\|\left\|{x^{(t)}}\right\|\leq\sqrt{2\alpha^{2}(t-1)}\left\|{x^{(t)}}\right\|$ . Combining the two bounds, we obtain the result of the lemma,

where we have used $\left\|{x^{(t)}}\right\|\leq R$ and $\alpha=1/(5R\sqrt{T})$ .

We prove bound (22) by induction. The basis of the induction, $t=1$ , is trivial, as $P_{0}=0$ . We shall assume (22) holds for $s\in\{1,\ldots,t-1\}$ and show that it consequently holds for $s=t$ as well. We may apply the Graham-Schmidt procedure on the sequence $x^{(1)},u^{(1)},\ldots,x^{(t-1)},u^{(t-1)}$ to write

where $\hat{P}_{k}$ is the projection to the span of $\{x^{(1)},u^{(1)},\ldots,x^{(k)},u^{(k)},x^{(k+1)}\}$ ,

where the equalities hold by $\left\langle u^{(i)},u^{(j)}\right\rangle=0$ , $\hat{P}_{i-1}^{\perp}=I-\hat{P}_{i-1}$ , and the definition of $\hat{P}_{i-1}$ .

The $P_{i}$ matrices are projections, so ${P}_{i-1}^{2}={P}_{i-1}$ , and Cauchy-Swartz and the induction hypothesis imply

Moreover, the event $G_{i}$ implies $\left|\langle u^{(i)},P_{i-1}^{\perp}x^{(i)}\rangle\langle u^{(j)},P_{i-1}^{\perp}x^{(i)}\rangle\right|\leq\alpha^{2}\left\|{P_{i-1}^{\perp}x^{(i)}}\right\|^{2}$ , so

where the first equality uses $(P_{i-1}^{\perp})^{2}=P_{i-1}^{\perp}$ , the second the definition of $\hat{P}_{i-1}$ , and the inequality uses $\langle u^{(j)},P_{i-1}^{\perp}x^{(i)}\rangle\leq\alpha\|{P_{i-1}^{\perp}x^{(i)}}\|$ and $\|{P_{i-1}u^{(j)}}\|^{2}\leq 2\alpha^{2}\left(i-1\right)$ .

Combining the observations (24a) and (24b), we can bound each summand in the second summation in (23). Since the summands in the first summation are bounded by $\alpha^{2}$ by definition (21) of $G_{i}$ , we obtain

where $U_{(<t)}$ is shorthand for $u^{(1)},\ldots,u^{(t-1)}$ and $\xi$ is the random variable generating $x^{(1)},\ldots,x^{(T)}$ .

In the following lemma, we state formally that conditioned on $G_{<i}$ , the iterate $x^{(i)}$ depends on $U$ only through its first $(i-1)$ columns.

For every $i\leq T$ , there exist measurable functions $\mathsf{A}^{(i)}_{+}$ and $\mathsf{A}^{(i)}_{-}$ such that

Let $t\leq T$ , and $j\in\{t,\ldots,T\}$ . Then conditioned on $\xi,U_{(<t)}$ and $G_{<t}$ , the vector $\frac{P_{t-1}^{\perp}u^{(j)}}{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|}$ is uniformly distributed on the unit sphere in the subspace to which $P_{t-1}^{\perp}$ projects.

This lemma is subtle. The vectors $u^{(j)}$ , $j\geq t$ , conditioned on $U_{(<t)}$ , are certainly uniformly distributed on the unit sphere in the subspace orthogonal to $U_{(<t)}$ . However, the additional conditioning on $G_{<t}$ requires careful handling. Throughout the proof we fix $t\leq T$ and $j\in\{t,\ldots,T\}$ . We begin by noting that by (22), $G_{<t}$ implies

Therefore, when $G_{<t}$ holds we have $P_{t-1}^{\perp}u^{(j)}\neq 0$ so ${P_{t-1}^{\perp}u^{(j)}}/{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|}$ is well-defined.

To establish our result, we will show that the density of $U_{(\geq t)}=[u^{(t)},\ldots,u^{(T)}]$ conditioned on $\xi,U_{(<t)},G_{<t}$ is invariant to rotations that preserve the span of $x^{(1)},u^{(1)},\ldots,x^{(t-1)},u^{(t-1)}$ . More formally, let $p_{\geq t}$ denote the density of $U_{(\geq t)}$ conditional on $\xi,U_{(<t)}$ and $G_{<t}$ . We wish to show that

Throughout, we let $Z$ denote such a rotation. Letting $p_{\xi,U}$ and $p_{U}$ denote the densities of $(\xi,U)$ and $U$ , respectively, we have

where the first equality holds by the definition of conditional probability and second by the independence of $\xi$ and $U$ . We have $ZU_{(<t)}=U_{(<t)}$ and therefore, by the invariance of $U$ to rotations, $p_{U}([U_{(<t)},ZU_{(\geq t)}])=p_{U}(ZU)=p_{U}(U)$ . Hence, replacing $U$ with $ZU$ in the above display yields

Marginalizing the density (28) to obtain a density for $u^{(j)}$ and recalling that $P_{t-1}^{\perp}$ is measurable $\xi,U_{(<t)},G_{<t}$ , we conclude that, conditioned on $\xi,U_{(<t)},G_{<t}$ the random variable $\frac{P_{t-1}^{\perp}u^{(j)}}{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|}$ has the same density as $\frac{P_{t-1}^{\perp}Zu^{(j)}}{\left\|{P_{t-1}^{\perp}Zu^{(j)}}\right\|}$ . However, $P_{t-1}^{\perp}Z=ZP_{t-1}^{\perp}$ by assumption on $Z$ , and therefore

We conclude that the conditional distribution of the unit vector $\frac{P_{t-1}^{\perp}u^{(j)}}{\left\|{P_{t-1}^{\perp}u^{(j)}}\right\|}$ is invariant to rotations in the subspace to which $P_{t-1}^{\perp}$ projects. ∎

Substituting this bound back into the probability (26) gives

B.4 Proof of Lemma 6

and therefore by Lemma 3.i, we have $\hat{f}_{T;U}(0)-\inf_{x}\hat{f}_{T;U}(x)\leq\bar{f}_{T}(0)-\inf_{x}\bar{f}_{T}(x)\leq 12T$ .

Establishing part ii requires substantially more work. Since smoothness with respect to Euclidean distances is invariant under orthogonal transformations, we take $U$ to be the first $T$ columns of the $d$ -dimensional identity matrix, denoted $U=I_{d,T}$ . Recall the scaling $\rho(x)=Rx/\sqrt{R^{2}+\left\|{x}\right\|^{2}}$ with “radius” $R=230\sqrt{T}$ and the definition $\hat{f}_{T;U}(x)=\bar{f}_{T}(U^{\top}\rho(x))+\frac{1}{10}\left\|{x}\right\|^{2}$ . The quadratic $\frac{1}{10}\left\|{x}\right\|^{2}$ term in $\hat{f}_{T;U}$ has $\frac{1}{5}$ -Lipschitz first derivative and -Lipschitz higher order derivatives (as they are all constant or zero), and we take $U=I_{d,T}$ without loss of generality, so we consider the function

We now compute the partial derivatives of $\hat{f}_{T;I}$ . Defining $y=\rho(x)$ , let $\widetilde{\nabla}^{{k}}_{j_{1},...,j_{k}}:=\frac{\partial^{k}}{\partial y_{j_{1}}\cdots\partial y_{j_{k}}}$ denote derivatives with respect to $y$ . In addition, define $\mathcal{P}_{k}$ to be the set of all partitions of $[k]=\{1,\ldots,k\}$ , i.e. $(S_{1},\ldots,S_{L})\in\mathcal{P}_{k}$ if and only if the $S_{i}$ are disjoint and $\cup_{l}S_{l}=[k]$ . Using the chain rule, we have for any $k$ and set of indices $i_{1},\ldots,i_{k}\leq T$ that

where we have used the shorthand $\nabla^{{|S|}}_{i_{S}}$ to denote the partial derivatives with respect to each of $x_{i_{j}}$ for $j\in S$ . We use the equality (29) to argue that (recall the identity (2))

for some numerical constant To simplify notation we allow $c$ to change from equation to equation throughout the proof, always representing a finite numerical constant independent of $d$ , $T$ , $k$ or $p$ ., $0<c<\infty$ and every $p\geq 1$ . As explained in Section 2.1, this implies $\hat{f}_{T;U}$ has $e^{cp\log p+c}$ -Lipschitz $p$ th order derivative, giving part ii of the lemma.

algebraic manipulations and rearrangement of the sum (29) yield

Before proving inequality (30), we show how it implies the desired lemma. By the preceding display, we have

Lemma 3 shows that there exists a numerical constant $c<\infty$ such that

When the number of partitions $L=1$ , we have $|S_{1}|=p+1\geq 2$ , and so Lemma 3.ii yields

where we have used $R=230\sqrt{T}$ . Using $|S_{1}|+\cdots+|S_{L}|=p+1$ and the fact that $q(x)=(x+1)\log(x+1)$ satisfies $q(x)+q(y)\leq q(x+y)$ for every $x,y>0$ , we have

for some $c<\infty$ and every $(S_{1},\ldots,S_{L})\in\mathcal{P}_{p+1}$ . Bounds on Bell numbers [6, Thm. 2.1] give that there are at most $\exp(k\log k)$ partitions in $\mathcal{P}_{k}$ , which combined with the bound above gives desired result.

where $|P|$ denotes the number of disjoint elements of partition $P\in\mathcal{P}_{k}$ . Define the function $\overline{\rho}(\xi)=\xi/\sqrt{1+\|{\xi}\|^{2}}$ , and let $\lambda(\xi)=\sqrt{1+\|{\xi}\|^{2}}$ so that $\overline{\rho}(\xi)=\nabla\lambda(\xi)$ and $\rho(\xi)=R\overline{\rho}(\xi/R)$ . Let $\overline{v}_{j}^{k}(\xi)=\langle\nabla^{{k}}\overline{\rho}_{j}(\xi),v^{\otimes k}\rangle$ , so that

With this in mind, we consider the quantity $\langle\nabla^{{k}}\lambda(\xi),v^{\otimes k}\rangle$ . Defining temporarily the functions $\alpha(r)=\sqrt{1+2r}$ and $\beta(t)=\frac{1}{2}\|{\xi+tv}\|^{2}$ , and their composition $h(t)=\alpha(\beta(t))$ , we evidently have

where the second equality used Faá di Bruno’s formula (31). Now, we note the following immediate facts:

Thus, if we let $\mathcal{P}_{k,2}$ denote the partitions of $[k]$ consisting only of subsets with one or two elements, we have

where $\mathsf{C}_{i}(P)$ denotes the number of sets in $P$ with precisely $i$ elements. Noting that $\left\|{v}\right\|=1$ , we may rewrite this as

We would like to bound $a_{l}(\xi)\langle\xi,v\rangle^{l-1}$ and $b_{l}(\xi)\langle\xi,v\rangle^{l}\xi$ . Note that $|P|\geq\mathsf{C}_{1}(P)$ for every $P\in\mathcal{P}_{k}$ , so $|P|\geq l$ in the sums above. Moreover, bounds for Bell numbers [6, Thm. 2.1] show that there are at most $\exp(k\log k)$ partitions of $[k]$ , and $(2k-1)!!\leq\exp(k\log k)$ as well. As a consequence, we obtain

where we have used $|\langle\xi,v\rangle|\leq\left\|{\xi}\right\|$ due to $\left\|{v}\right\|=1$ . We similarly bound $\sup_{\xi}|b_{l}(\xi)||\langle\xi,v\rangle|^{l}\left\|{\xi}\right\|$ . Returning to expression (32), we have

for a numerical constant $c<\infty$ . This is the desired bound (30), completing the proof. ∎

Appendix C Proof of Theorem 3

We divide the proof of the theorem into two parts, as in our previous results, first providing a few building blocks, then giving the theorem. The basic idea is to introduce a negative “bump” that is challenging to find, but which is close to the origin.

As Figure 2 shows, $\bar{h}_{T}$ features a unit-height peak centered at $\frac{4}{5}e^{(T)}$ , and it is identically zero when the distance from that peak exceeds $\frac{1}{5}$ . The volume of the peak vanishes exponentially with $T$ , making it hard to find by querying $\bar{h}_{T}$ locally. We list the properties of $\bar{h}_{T}$ necessary for our analysis.

The function $\bar{h}_{T}$ satisfies the following.

We prove the lemma in Section C.1; the proof is similar to that of Lemma 6. With these properties in hand, we can prove Theorem 3.

As in the proof of Lemma 6, we write ${h}_{x,v}(t)=\Psi(\beta(t))$ where $\beta(t)=1-\frac{1}{2}\left\|{x+tv}\right\|^{2}$ , and use Faá di Bruno’s formula (31) to write, for any $k\geq 1$ ,

where $\mathcal{P}_{k}$ is the set of partitions of $[k]$ and $|P|$ denotes the number of set in partition $P$ . Noting that $\beta^{\prime}(0)=-\langle x,v\rangle$ , $\beta^{\prime\prime}(0)=-\left\|{v}\right\|^{2}$ and $\beta^{(n)}(0)=0$ for any $n>2$ , we have

where $\mathcal{P}_{k,2}$ denote the partitions of $[k]$ consisting only of subsets with one or two elements and $\mathsf{C}_{i}(P)$ denotes the number of sets in $P$ with precisely $i$ elements.

Noting that $\Psi^{(k)}(1-\frac{1}{2}\left\|{x}\right\|^{2})=0$ for any $k\geq 0$ and $\left\|{x}\right\|>1$ , we may assume $\left\|{x}\right\|\leq 1$ . Since $\left\|{v}\right\|\leq 1$ , we may bound $|{h}_{x,v}^{(p+1)}(0)|$ by

for some absolute constant $c<\infty$ , where inequality $(i)$ follows from Lemma 1.iv and that the number of matchings in the complete graph (or the $k$ th telephone number [21, Lem. 2]) has bound $|\mathcal{P}_{k,2}|\leq e^{\frac{k}{2}\log k}$ . This gives the result.

C.2 Proof of Theorem 3

with the final inequality using our assumption $\sigma\leq D$ . On the other hand, for any $x$ such that $\bar{h}_{T}(U^{\top}x/D)=0$ , we have by Lemma 6.i (along with $\hat{f}_{T;U}(0)=0)$ that

Combining the two displays above, we conclude that if

then all global minima $x^{\star}$ of $f_{U}$ must satisfy $\bar{h}_{T}(U^{\top}x^{\star}/D)>0$ . Inspecting the definition (18) of $\bar{h}_{T}$ , this implies $\left\|{x^{\star}/D-0.8u^{(T)}}\right\|<\frac{1}{5}$ , and therefore $\left\|{x^{\star}}\right\|\leq D$ . Thus, by setting

we guarantee that $f_{U}\in\mathcal{F}^{\rm dist}_{p}(D,L_{p})$ as long as $\sigma\leq D$ .

We defer the proof of claim (35) to the end of this section.

By claim (35), this implies $\nabla\bar{h}_{T}(U^{\top}x^{(t)}/D)=0$ , and by Lemma 5, $\|{\nabla\hat{f}_{T;U}(x^{(t)}/\sigma)}\|>1/2$ . Thus, after scaling,

where $T=\left\lfloor D^{p+1}/13\sigma^{p+1}\right\rfloor$ is defined in Eq. (34). Thus, as $f_{U}\in\mathcal{F}^{\rm dist}_{p}(D,L_{p})$ for our choice of $T$ , we immediately obtain

Finally, we return to demonstrate claim (35). Note that $|\langle u^{(T)},\rho(x/\sigma)\rangle|<1/2$ is equivalent to $|\langle u^{(T)},x\rangle|<\frac{\sigma}{2}\sqrt{1+\|{\frac{x}{\sigma R}}\|^{2}}$ , and consider separately the cases $\left\|{x/\sigma}\right\|\leq R/2$ and $\left\|{x/\sigma}\right\|>R/2=115\sqrt{T}$ . In the first case, we have $|\langle u^{(T)},x\rangle|<(\sqrt{5}/4)\sigma<(3/5)D$ , by our assumption $\sigma\leq D$ . Therefore, by Lemma 10.ii we have that $\bar{h}_{T}(U^{\top}y/D)=0$ for $y$ near $x$ . In the second case, we have $\left\|{x}\right\|>(4R/\sqrt{5})|\langle u^{(T)},x\rangle|>230|\langle u^{(T)},x\rangle|$ . If in addition $|\langle u^{(T)},x\rangle|<(3/5)D$ then our conclusion follows as before. Otherwise, $\left\|{x}\right\|/D>230\cdot(3/5)>1$ , so again the conclusion follows by Lemma 10.ii.