Lower Bounds for Finding Stationary Points I
Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford
Introduction
We prove lower bounds on the number of function and derivative evaluations required for algorithms to find a point satisfying inequality (1). While for arbitrary smooth , a near-stationary point (1) is certainly insufficient for any type of optimality, there are a number of reasons to study algorithms and complexity for finding stationary points. In several statistical and engineering problems, including regression models with non-convex penalties and objectives , phase retrieval , and non-convex (low-rank) reformulations of semidefinite programs and matrix completion , it is possible to show that all first- or second-order stationary points are (near) global minima. The strong empirical success of local search strategies for such problems, as well as for neural networks , motivates a growing body of work on algorithms with strong complexity guarantees for finding stationary points . In contrast to this algorithmic progress, algorithm-independent lower bounds for finding stationary points are largely unexplored.
By evaluation of higher order derivatives, such as the Hessian, it is possible to achieve better dependence. Nesterov and Polyak’s cubic regularization of Newton’s method guarantees -stationarity (1) in iterations, but each iteration may be expensive when the dimension is large. More generally, th-order regularization methods iterate by sequentially minimizing models of based on order Taylor approximations, and Birgin et al. show that these methods converge in iterations. Each iteration requires finding an approximate stationary point of a high-dimensional, potentially non-convex, degree polynomial, which suggests that the methods will be practically challenging for . The methods nonetheless provide fundamental upper complexity bounds.
In this paper and its companion , we focus on the converse problem: providing dimension-free complexity lower bounds for finding -stationary points. We show fundamental limits on the best achievable dependence, as well as dependence on other problem parameters. Together with known upper bounds, our results shed light on the optimal rates of convergence for finding stationary points.
In the case of convex optimization, we have a deep understanding of the complexity of finding -suboptimal points, that is, satisfying for some , where . Here we review only the dimension-free optimal rates, as those are most relevant for our results. Given a point satisfying , if is convex with -Lipschitz gradient, Nesterov’s accelerated gradient method finds an -suboptimal point in gradient evaluations, which is optimal even among randomized, higher-order algorithms .Higher order methods can yield improvements under additional smoothness: if in addition has -Lipschitz Hessian and , an accelerated Newton method achieves the (optimal) rate . For non-smooth problems, that is, when is -Lipschitz, subgradient methods achieve the optimal rate of subgradient evaluations (cf. ). In Part II of this paper , we consider the impact of convexity on the difficulty of finding stationary points using first-order methods.
A related line of work considers algorithm-dependent lower bounds, describing functions that are challenging for common classes of algorithms, such as Newton’s method and gradient descent. In this vein, Jarre shows that the Chebyshev-Rosenbrock function is difficult to optimize, and that any algorithm that employs line search to determine the step size will require an exponential (in ) number of iterations to find an -suboptimal point, even though the Chebyshev-Rosenbrock function has only a single stationary point. While this appears to contradict the polynomial complexity guarantees mentioned above, Cartis et al. explain this by showing that the difficult Chebyshev-Rosenbrock instances have -stationary point with function value that is -suboptimal. Cartis et al. also develop algorithm-specific lower bounds on the iteration complexity of finding approximate stationary points. Their works show that the performance guarantees for gradient descent and cubic regularization of Newton’s method are tight for two-dimensional functions they construct, and they also extend these results to certain structured classes of methods .
2 The importance of high-dimensional constructions
To tightly characterize the algorithm- and dimension-independent complexity of finding -stationary points, one must construct hard instances whose domain has dimension that grows with . The reason for this is simple: there exist algorithms with complexity that trades dependence on dimension in favor of better dependence. Indeed, Vavasis gives a grid-search method that, for functions with Lipschitz gradient, finds an -stationary point in gradient and function evaluations. Moreover, Hinder exhibits a cutting-plane method that, for functions with Lipschitz first and third derivatives, finds an -stationary point in gradient and function evaluations.
High-dimensional constructions are similarly unavoidable when developing lower bounds in convex optimization. There, the center-of-gravity cutting plane method (cf. ) finds an -suboptimal point in (sub)gradient evaluations, for any continuous convex function with bounded distance to optimality. Consequently, proofs of the dimension-free lower bound for convex optimization (as we cite in the previous section) all rely on constructions whose dimensionality grows polynomially in .
3 Our contributions
oracle queries to find an -stationary point of , where is a constant decreasing at most polynomially in . As explained in the previous section, the domain of the constructed function has dimension polynomial in .
For every , our lower bound matches (up to a constant) known upper bounds, thereby characterizing the optimal complexity of finding stationary points. For , our results imply that gradient descent is optimal among all methods (even randomized, high-order methods) operating on functions with Lipschitz continuous gradient and bounded initial sub-optimality. Therefore, to strengthen the guarantees of gradient descent one must introduce additional assumptions, such as convexity of or Lipschitz continuity of . Similarly, in the case we establish that cubic regularization of Newton’s method achieves the optimal rate , and for general we show that th order Taylor-approximation methods are optimal.
These results say little about the potential of first-order methods on functions with higher-order Lipschitz derivatives, where first-order methods attain rates better than . In Part II of this series , we address this issue and show lower bounds for deterministic algorithms using only first-order information. The lower bounds exhibit a fundamental gap between first- and second-order methods, and nearly match the known upper bounds .
4 Our approach and paper organization
In Section 2 we introduce the classes of functions and algorithms we consider as well as our notion of complexity. Then, in Section 3, we present the generic technique we use to prove lower bound for deterministic algorithms in both this paper and Part II . While essentially present in previous work, our technique abstracts away and generalizes the central arguments in many lower bounds . The technique applies to higher-order methods and provides lower bounds for general optimization goals, including finding stationary points (our main focus), approximate minimizers, and second-order stationary points. It is also independent of whether the functions under consideration are convex, applying to any function class with appropriate rotational invariance . The key building blocks of the technique are Nesterov’s notion of a “chain-like” function , which is difficult for a certain subclass of algorithms, and a “resisting oracle” reduction that turns a lower bound for this subclass into a lower bound for all deterministic algorithms.
In Section 4 we apply this generic method to produce lower bounds for deterministic methods (Theorem 1). The deterministic results underpin our analysis for randomized algorithms, which culminates in Theorem 2 in Section 5. Following Woodworth and Srebro , we consider random rotations of our deterministic construction, and show that for any algorithm such a randomly rotated function is, with high probability, difficult. For completeness, in Section 6 we provide lower bounds on finding stationary points of functions where is bounded, rather than the function value gap ; these bounds have the same dependence as their bounded function value counterparts.
If is a symmetric order tensor, meaning that is invariant to permutations of the indices (for example, is always symmetric), then Zhang et al. [47, Thm. 2.1] show that
For vectors, the Euclidean and operator norms are identical.
Preliminaries
We begin our development with definitions of the classes of functions (§ 2.1), classes of algorithms (§ 2.2), and notions of complexity (§ 2.3) that we study.
Measures of function regularity are crucial for the design and analysis of optimization algorithms . We focus on two types of regularity conditions: Lipschitzian properties of derivatives and bounds on function value.
Complexity guarantees for finding stationary points of non-convex functions typically depend on the function value bound , where is a pre-specified point. Without loss of generality, we take the pre-specified point to be for the remainder of the paper. With that in mind, we define the following classes of functions.
Let , and . Then the set
For our results, we also require the following important invariance notion, proposed (in the context of optimization) by Nemirovski and Yudin [35, Ch. 7.2].
Every function class we consider is orthogonally invariant, as and has the same Lipschitz constants to all orders as , as their collections of associated directional projections are identical.
2 Algorithm classes
To model the computational cost of an algorithm, we adopt the information-based complexity framework, which Nemirovski and Yudin develop (see also ), and view every every iterate as a query to an information oracle. Typically, one places restrictions on the information the oracle returns (e.g. only the function value and gradient at the query point) and makes certain assumptions on how the algorithm uses this information (e.g. deterministically). Our approach is syntactically different but semantically identical: we build the oracle restriction, along with any other assumption, directly into the structure of the algorithm. To formalize this, we define
as shorthand for the response of a th order oracle to a query at point . When this corresponds to an oracle that reveals all derivatives at . Our algorithm classes follow.
As a concrete example, for any and consider the algorithm that produces iterates by minimizing the sum of a th order Taylor expansion and an order proximal term:
For , is gradient descent with step-size , for it is cubic-regularized Newton’s method , and for general it is a simplified form of the scheme that Birgin et al. propose.
Randomized algorithms (and function-informed processes)
A th-order randomized algorithm is a distribution on th-order deterministic algorithms. We can write any such algorithm as a deterministic algorithm given access to a random uniform variable on $f\xi\sim\mathsf{Uni}f$), then producing iterates of the form
Zero-respecting sequences and algorithms
While deterministic and randomized algorithms are the natural collections for which we prove lower bounds, it is useful to define an additional structurally restricted class. This class forms the backbone of our lower bound strategy (Sec. 3), as it is both ‘small’ enough to uniformly underperform on a single function, and ‘large’ enough to imply lower bounds on the natural algorithm classes.
The definition (5) says that if all partial derivatives involving the th coordinate of (up to the th order) are zero. For , this definition is equivalent to the requirement that for every and , if for , then . The requirement (5) implies that .
In the literature on lower bounds for first-order convex optimization, it is common to assume that methods only query points in the span of the gradients they observe . Our notion of zero-respecting algorithms generalizes this assumption to higher-order methods, but even first-order zero-respecting algorithms are slightly more general. For example, coordinate descent methods are zero-respecting, but they generally do not remain in the span of the gradients.
3 Complexity measures
To measure the performance of algorithm on function , we evaluate the iterates it produces from , and with mild abuse of notation, we define
as the complexity of on . With this setup, we define the complexity of algorithm class on function class as
Many algorithms guarantee “dimension independent” convergence and thus provide upper bounds for the quantity (7). A careful tracing of constants in the analysis of Birgin et al. implies that the generalized regularization scheme defined by the recursion (3) guarantees
While definition (7) is our primary notion of complexity, our proofs provide bounds on smaller quantities than (7) that also carry meaning. For zero-respecting algorithms, we exhibit a single function and bound \inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{zr}}}}\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)} from below, in effect interchanging the and in (7). This implies that all zero-respecting algorithms share a common vulnerability. For randomized algorithms, we exhibit a distribution supported on functions of a fixed dimension , and we lower bound the average \inf_{\mathsf{A}\in\mathcal{A}_{\textnormal{{rand}}}}\int\mathsf{T}_{\epsilon}\big{(}\mathsf{A},f\big{)}dP(f), bounding the distributional complexity , which is never greater than worst-case complexity (and is equal for randomized and deterministic algorithms). Even randomized algorithms share a common vulnerability: functions drawn from .
Anatomy of a lower bound
In this section we present a generic approach to proving lower bounds for optimization algorithms. The basic techniques we use are well-known and applied extensively in the literature on lower bounds for convex optimization . However, here we generalize and abstract away these techniques, showing how they apply to high-order methods, non-convex functions, and various optimization goals (e.g. -stationarity, -optimality).
Nesterov [37, Chapter 2.1.2] proves lower bounds for smooth convex optimization problems using the “chain-like” quadratic function
which he calls the “worst function in the world.” The important property of is that for every , whenever (with and ). Thus, if we “know” only the first coordinates of , i.e. are able to query only vectors such , then any we query satisfies for ; we only “discover” a single new coordinate . We generalize this chain structure to higher-order derivatives as follows.
2 A lower bound strategy
The preceding discussion shows that zero-respecting algorithms take many iterations to “discover” all the coordinates of a zero-chain. In the following observation, we formalize how finding a suitable zero-chain provides a lower bound on the performance of zero-respecting algorithms.
belongs to the function class, i.e. , and
for every such that ; We can readily adapt this property for lower bounds on other termination criteria, e.g. require for every such that .
then \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\{f\}\big{)}>T.
3 From deterministic to zero-respecting algorithms
Zero-chains allow us to generate strong lower bounds for zero-respecting algorithms. The following reduction shows that these lower bounds are valid for deterministic algorithms as well.
We also give a variant of Proposition 1 that is tailored to lower bounds constructed by means of Observation 2 and allows explicit accounting of dimensionality.
where and is the set of orthogonal matrices, so that contains only function with domain of dimension .
giving Proposition 1; Proposition 2 follows similarly, and for it we may take .
The adversarial rotation argument that yields Propositions 1 and 2 is more or less apparent in the proofs of previous lower bounds in convex optimization for deterministic algorithms. We believe it is instructive to separate the proof of lower bounds on \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}},\mathcal{F}\big{)} and the reduction from to , as the latter holds in great generality. Indeed, Propositions 1 and 2 hold for any complexity measure \mathsf{T}_{\epsilon}\big{(}\cdot,\cdot\big{)} that satisfies
4 Randomized algorithms
Proposition 1 and 2 do not apply to randomized algorithms, as they require the adversary (maximizing choice of ) to simulate the action of on . To handle randomized algorithms, we strengthen the notion of a zero-chain as follows.
A robust zero-chain is also an “ordinary” zero-chain. In Section 5 we replace the adversarial rotation of § 3.3 with an orthogonal matrix drawn uniformly at random, and consider the random function , where is a robust zero-chain. We adapt a lemma by Woodworth and Srebro , and use it to show that for every , satisfies an approximate form of Observation 1 (w.h.p.) whenever the iterates have bounded norm. With further modification of to handle unbounded iterates, our zero-chain strategy yields a strong distributional complexity lower bound on .
Lower bounds for zero-respecting and deterministic algorithms
Our construction, illustrated in Figure 1, has two key properties. First is that is a zero-chain (Observation 3 in the sequel). Second, as we show in Lemma 2, is large unless for every . These properties make it hard for any zero-respecting method to find a stationary point of scaled versions of , and coupled with Proposition 1, this gives a lower bound for deterministic algorithms.
Before turning to the main theorem of this section, we catalogue the important properties of the functions , and .
The functions and satisfy the following.
For all and , .
The functions and derivatives and are non-negative and bounded, with
We prove Lemma 1 in Appendix B.1. The remainder our development relies on and only through Lemma 1. Therefore, the precise choice of is not particularly special; any two functions with properties similar to Lemma 1 will yield similar lower bounds.
The key consequence of Lemma 1.i is that the function is a robust zero-chain (see Definition 4) and consequently also a zero-chain (Definition 3):
For any , if then for all in a neighborhood of .
Applying Observation 3 for gives that is a robust zero-chain by Definition 4. Taking derivatives of with respect to , , shows that is also a zero-chain by Definition 3. Thus, Observation 1 shows that any zero-respecting algorithm operating on requires iterations to find a point where .
Next, we establish the “large gradient property” that must be large if any coordinate of is near zero.
If for any , then there exists such that and
We take to be the smallest for which , so that (where we use the shorthand ). Therefore, we have
In the chain of inequalities, inequality follows because for every ; inequality follows because for , while equality follows from Lemma 1.ii and the pairing of and . ∎
Finally, we verify that meets the smoothness and boundedness requirements of the function classes we consider.
The function satisfies the following.
We have .
The proof of Lemma 3 is technical, so we defer it to Appendix B.2. In the lemma, Properties i and iii allow us to guarantee that appropriately scaled versions of are in . Property is ii is necessary for analysis of the randomized construction in Section 5.
2 Lower bounds for zero-respecting and deterministic algorithms
We can now state and prove a lower bound for finding stationary points of th order smooth functions using full derivative information and zero-respecting algorithms (the class ). Proposition 1 transforms this bound into one on all deterministic algorithms (the class ).
3 Proof of Theorem 1
Lower bounds for randomized algorithms
With our lower bounds on the complexity of deterministic algorithms established, we turn to the class of all randomized algorithms. We provide strong distributional complexity lower bounds by exhibiting a distribution on functions such that a function drawn from it is “difficult” for any randomized algorithm, with high probability. We do this via the composition of a random orthogonal transformation with the function defined in (10).
The result of Lemma 4 is identical (to constant factors) to an important result of Woodworth and Srebro [45, Lemma 7], but we must be careful with the sequential conditioning of randomness between the iterates , the random orthogonal , and how much information the sequentially computed derivatives may leak. Because of this additional care, we require a modification of their original proof, In a recent note Woodworth and Srebro independently provide a revision of their proof that is similar, but not identical, to the one we propose here. which we provide in Section B.3, giving a rough outline here. For a fixed , assume that holds for every pair and ; we argue that this (roughly) implies that for every with high probability, completing the induction. When the assumption that holds, the robust zero-chain property of (Definition 4 and Observation 3) implies that for every we have
2 Handling unbounded iterates
The quadratic term in guarantees that all points beyond a certain norm have a large gradient, which prevents the algorithm from trivially making the gradient small by increasing the norm of the iterates. The following lemma captures the hardness of for randomized algorithms.
Let , and let be informed by . If then, with probability at least ,
Therefore, by Lemma 2 with , for each there exists such that
To show that is also large, we consider separately the cases and . For the first case, we use to write
Therefore, for we have
In the second case, , we have for any satisfying and that
As our lower bounds repose on appropriately scaling the function , it remains to verify that satisfies the few boundedness properties we require. We do so in the following lemma.
The function satisfies the following.
We have .
We defer the (computationally involved) proof of this lemma to Section B.4.
3 Final lower bounds
With Lemmas 5 and 6 in hand, we can state our lower bound for all algorithms, randomized or otherwise, given access to all derivatives of a function. Note that our construction also implies an identical lower bound for (slightly) more general algorithms that use any local oracle , meaning that the information the oracle returns about a function when queried at a point is identical to that it returns when a function is queried at whenever for all in a neighborhood of .
implying Theorem 2 for any . Thus, we exhibit a randomized procedure for finding hard instances for any randomized algorithm that requires no knowledge of the algorithm itself.
Theorem 2 is stronger than Theorem 1 in that it applies to the broad class of all randomized algorithms. Our probabilistic analysis requires that the functions constructed to prove Theorem 2 have dimension scaling proportional to where is the lower bound on the number of iterations. Contrast this to Theorem 1, which only requires dimension . A similar gap exists in complexity results for convex optimization . At present, it unclear if these gaps are fundamental or a consequence of our specific constructions.
4 Proof of Theorem 2
Fix and let be the iterates produced by applied on . Since and differ only by scaling, the iterates are informed by (recall Sec. 2.2), and therefore we may apply Lemma 5 with and our large enough choice of dimension to conclude that
It remains to choose to guarantee that belongs to the relevant function class (bounded and smooth) for every orthogonal . By Lemma 6.ii, has -Lipschitz continuous th order derivatives. By Lemma 6.i, we have
Distance-based lower bounds
We have so far considered finding approximate stationary points of smooth functions with bounded sub-optimality at the origin, i.e. . In convex optimization, it is common to consider instead functions with bounded distance between the origin and a global minimum. We may consider a similar restriction for non-convex functions; for and positive , let
be the class of functions with -Lipschitz th order derivatives satisfying
that is, all global minima have bounded distance to the origin.
In this section we give a lower bound on the complexity of this function class that has the same dependence as our bound for the class . This is in sharp contrast to convex optimization, where distance-bounded functions enjoy significantly better dependence than their value-bounded counterparts (see Section LABEL:sec:convex in the companion ). Qualitatively, the reason for this difference is that the lack of convexity allows us to “hide” global minima close to the origin that are difficult to find for any algorithm with local function access .
We postpone the construction and proof to Appendix C, and move directly to the final bound.
There exist numerical constants such that the following lower bound holds. For any , let , and be positive. Then
We remark that a lower-dimensional construction suffices for proving the lower bound for deterministic algorithm, similarly to Theorem 1.
While we do not have a matching upper bound for Theorem 3, we can match its dependence in the smaller function class
Conclusion
This work provides the first algorithm independent and tight lower bounds on the dimension-free complexity of finding stationary points. As a consequence, we have characterized the optimal rates of convergence to -stationarity, under the assumption of high dimension and an oracle that provides all derivatives. Yet, given the importance of high-dimensional problems, the picture is incomplete: high-order algorithms—even second-order method—are often impractical in large scale settings. We address this in the companion , which provides sharper lower bounds for the more restricted class of first-order methods. In we also provide a full conclusion for this paper sequence, discussing in depth the implications and questions that arise from our results.
Acknowledgments
OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research, NSF-CAREER award 1553086, and a Sloan Foundation Fellowship in Mathematics. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.
References
Appendix A Proof of Propositions 1 and 2
The core of the proofs of Propositions 1 and 2 is the following construction.
Before explaining the construction of , let us see how its defining property implies the lemma. If \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}>T_{0}, we are done. Otherwise, \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\leq T_{0} and we have
for every (we set for every without loss of generality).
Finally, note that the arguments above hold unchanged for . ∎
With Lemma 7 in hand, the propositions follow easily.
We may assume that \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}<T_{0} for some integer , as otherwise we have \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}=\infty and the result holds trivially. For any and the value , we invoke Lemma 7 to construct such that \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq\min\{T_{0},\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{A}},f\big{)}\} for every and some orthogonal matrix that depends on and . Consequently, we have
where inequality uses that because is orthogonally invariant, step uses \mathsf{T}_{\epsilon}\big{(}\mathsf{A},f_{U}\big{)}\geq\min\{T_{0},\mathsf{T}_{\epsilon}\big{(}\mathsf{Z}_{\mathsf{A}},f\big{)}\} and step is due to by construction. As we chose for which \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}<T_{0}, the chain of inequalities implies \mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{det}}}^{(p)},\mathcal{F}\big{)}\geq\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\mathcal{F}\big{)}, concluding the proof. ∎
For any , we invoke Lemma 7 with to obtain and orthogonal matrix (dependent on and ) for which
where the last equality is due to \inf_{\mathsf{B}\in\mathcal{A}_{\textnormal{{zr}}}^{(p)}}\mathsf{T}_{\epsilon}\big{(}\mathsf{B},f\big{)}=\mathcal{T}_{\epsilon}\big{(}\mathcal{A}_{\textnormal{{zr}}}^{(p)},\{f\}\big{)}\geq T. Since , we have
and taking the infimum over concludes the proof. ∎
Appendix B Technical Results
Each of the statements in the lemma is immediate except for part iii. To see this part, we require a few further calculations. We begin by providing bounds on the derivatives of . To avoid annoyances with scaling factors, we define .
We prove the result by induction. We have , so that the base case of the induction is satisfied. Now, assume for our induction that
where . Then taking derivatives, we have
where (and we treat ) and . With the induction hypothesis that , we obtain
With this result, we find that for any ,
The function is maximized at , so that . We thus obtain the numerically verifiable upper bound
Now, we turn to considering the function . We assume w.l.o.g. that , as otherwise for all . Recall for . We have the following lemma regarding its derivatives.
We provide the proof by induction over . For , we have that
which yields the base case of the induction. Now, assume that for some , we have
where and . Defining and for , then, under the inductive hypothesis that , we have
As in the derivation immediately following Lemma 8, by replacing , we have that is maximized by , so that
which yields the numerically verifiable upper bound
B.2 Proof of Lemma 3
Part i follows because and, since and ,
Part ii follows additionally from on , and , which when substituted into
for every and . Consequently, .
Examining , we see that is non-zero if and only if for every . Consequently, we can rearrange the above summation as
where we take and . Brief calculation show that
where we have used for every . To see this last claim is true, recall that is a unit vector and note that
If then . Otherwise, letting , the Cauchy-Swartz inequality implies
where or . This gives the result. ∎
B.3 Proof of Lemma 4
The following linear-algebraic result justifies the definition (21) of .
For all , implies for every and every .
First, notice that since implies for every , it suffices to show that implies for every . We will in fact prove a stronger statement:
Since holds, its definition (21) implies . Moreover, by Cauchy-Schwarz and the implication (22), we have . Combining the two bounds, we obtain the result of the lemma,
where we have used and .
We prove bound (22) by induction. The basis of the induction, , is trivial, as . We shall assume (22) holds for and show that it consequently holds for as well. We may apply the Graham-Schmidt procedure on the sequence to write
where is the projection to the span of ,
where the equalities hold by , , and the definition of .
The matrices are projections, so , and Cauchy-Swartz and the induction hypothesis imply
Moreover, the event implies , so
where the first equality uses , the second the definition of , and the inequality uses and .
Combining the observations (24a) and (24b), we can bound each summand in the second summation in (23). Since the summands in the first summation are bounded by by definition (21) of , we obtain
where is shorthand for and is the random variable generating .
In the following lemma, we state formally that conditioned on , the iterate depends on only through its first columns.
For every , there exist measurable functions and such that
Let , and . Then conditioned on and , the vector is uniformly distributed on the unit sphere in the subspace to which projects.
This lemma is subtle. The vectors , , conditioned on , are certainly uniformly distributed on the unit sphere in the subspace orthogonal to . However, the additional conditioning on requires careful handling. Throughout the proof we fix and . We begin by noting that by (22), implies
Therefore, when holds we have so is well-defined.
To establish our result, we will show that the density of conditioned on is invariant to rotations that preserve the span of . More formally, let denote the density of conditional on and . We wish to show that
Throughout, we let denote such a rotation. Letting and denote the densities of and , respectively, we have
where the first equality holds by the definition of conditional probability and second by the independence of and . We have and therefore, by the invariance of to rotations, . Hence, replacing with in the above display yields
Marginalizing the density (28) to obtain a density for and recalling that is measurable , we conclude that, conditioned on the random variable has the same density as . However, by assumption on , and therefore
We conclude that the conditional distribution of the unit vector is invariant to rotations in the subspace to which projects. ∎
Substituting this bound back into the probability (26) gives
B.4 Proof of Lemma 6
and therefore by Lemma 3.i, we have .
Establishing part ii requires substantially more work. Since smoothness with respect to Euclidean distances is invariant under orthogonal transformations, we take to be the first columns of the -dimensional identity matrix, denoted . Recall the scaling with “radius” and the definition . The quadratic term in has -Lipschitz first derivative and -Lipschitz higher order derivatives (as they are all constant or zero), and we take without loss of generality, so we consider the function
We now compute the partial derivatives of . Defining , let denote derivatives with respect to . In addition, define to be the set of all partitions of , i.e. if and only if the are disjoint and . Using the chain rule, we have for any and set of indices that
where we have used the shorthand to denote the partial derivatives with respect to each of for . We use the equality (29) to argue that (recall the identity (2))
for some numerical constant To simplify notation we allow to change from equation to equation throughout the proof, always representing a finite numerical constant independent of , , or ., and every . As explained in Section 2.1, this implies has -Lipschitz th order derivative, giving part ii of the lemma.
algebraic manipulations and rearrangement of the sum (29) yield
Before proving inequality (30), we show how it implies the desired lemma. By the preceding display, we have
Lemma 3 shows that there exists a numerical constant such that
When the number of partitions , we have , and so Lemma 3.ii yields
where we have used . Using and the fact that satisfies for every , we have
for some and every . Bounds on Bell numbers [6, Thm. 2.1] give that there are at most partitions in , which combined with the bound above gives desired result.
where denotes the number of disjoint elements of partition . Define the function , and let so that and . Let , so that
With this in mind, we consider the quantity . Defining temporarily the functions and , and their composition , we evidently have
where the second equality used Faá di Bruno’s formula (31). Now, we note the following immediate facts:
Thus, if we let denote the partitions of consisting only of subsets with one or two elements, we have
where denotes the number of sets in with precisely elements. Noting that , we may rewrite this as
We would like to bound and . Note that for every , so in the sums above. Moreover, bounds for Bell numbers [6, Thm. 2.1] show that there are at most partitions of , and as well. As a consequence, we obtain
where we have used due to . We similarly bound . Returning to expression (32), we have
for a numerical constant . This is the desired bound (30), completing the proof. ∎
Appendix C Proof of Theorem 3
We divide the proof of the theorem into two parts, as in our previous results, first providing a few building blocks, then giving the theorem. The basic idea is to introduce a negative “bump” that is challenging to find, but which is close to the origin.
As Figure 2 shows, features a unit-height peak centered at , and it is identically zero when the distance from that peak exceeds . The volume of the peak vanishes exponentially with , making it hard to find by querying locally. We list the properties of necessary for our analysis.
The function satisfies the following.
We prove the lemma in Section C.1; the proof is similar to that of Lemma 6. With these properties in hand, we can prove Theorem 3.
As in the proof of Lemma 6, we write where , and use Faá di Bruno’s formula (31) to write, for any ,
where is the set of partitions of and denotes the number of set in partition . Noting that , and for any , we have
where denote the partitions of consisting only of subsets with one or two elements and denotes the number of sets in with precisely elements.
Noting that for any and , we may assume . Since , we may bound by
for some absolute constant , where inequality follows from Lemma 1.iv and that the number of matchings in the complete graph (or the th telephone number [21, Lem. 2]) has bound . This gives the result.
C.2 Proof of Theorem 3
with the final inequality using our assumption . On the other hand, for any such that , we have by Lemma 6.i (along with that
Combining the two displays above, we conclude that if
then all global minima of must satisfy . Inspecting the definition (18) of , this implies , and therefore . Thus, by setting
we guarantee that as long as .
We defer the proof of claim (35) to the end of this section.
By claim (35), this implies , and by Lemma 5, . Thus, after scaling,
where is defined in Eq. (34). Thus, as for our choice of , we immediately obtain
Finally, we return to demonstrate claim (35). Note that is equivalent to , and consider separately the cases and . In the first case, we have , by our assumption . Therefore, by Lemma 10.ii we have that for near . In the second case, we have . If in addition then our conclusion follows as before. Otherwise, , so again the conclusion follows by Lemma 10.ii.