Theoretical insights into the optimization landscape of over-parameterized shallow neural networks

Mahdi Soltanolkotabi, Adel Javanmard, Jason D. Lee

Introduction

Neural network architectures (a.k.a. deep learning) have recently emerged as powerful tools for automatic knowledge extraction from raw data. These learning architectures have lead to major breakthroughs in applications such as visual object classification , speech recognition and natural language processing . Despite their wide empirical use the mathematical success of these architectures remains a mystery. Although the expressive ability of neural networks is relatively well-understood , computational tractability of training such networks remains a major challenge. In fact, training neural nets is known to be NP-hard even for very small networks . The main challenge is that training neural networks correspond to extremely high-dimensional and nonconvex optimization problems and it is not clear how to provably solve them to global optimality. Worst-case pessimism aside, these networks are trained successfully in practice via local search heuristics on real or randomly generated data. In particular, over-parameterized neural networks-where the number of parameters exceed the number of data samples-can be optimized to global optimality using local search heuristics such as gradient or stochastic gradient methods . In this paper we wish to provide theoretical insights into this phenomenon by developing a better understanding of optimization landscape of such over-parameterized shallow neural networks.

2 Problem formulation and models

We can now rewrite our input-output model (1.1) in the more succinct form

In this paper we wish to understand the landscape of optimization problems of the form (1.4).

Main results

In this section we discuss the main results of this paper. The first set of results focus on understanding the global landscape of neural network optimization with one hidden layer with a particular activation function. We also discuss how this landscape characterization enables algorithms to find a global optima in polynomial time. The second set of results focuses on the local convergence of gradient descent but applies to a broad set of activation functions.

We begin by discussing some global properties of the loss function of training neural networks.

There are no spurious local minima, i.e. all local minima are global.

Furthermore, for almost every data inputs $\{\bm{x}_{i}\}_{i=1}^{n}$ , as long as

the global optimum of $\mathcal{L}(\bm{W})$ is zero. Here, $c>0$ is a fixed numerical constant.

The above result states that given an arbitrary data set, the optimization landscape of fitting neural networks have favorable properties that facilitate finding globally optimal models. In particular, by setting the weights of the last layer to have diverse signs all local minima are global minima and all saddles have a direction of negative curvature. This in turn implies that gradient descent on the input-to-hidden weights, when initialized at random, converges to a global optima. All of this holds as long as the neural network is sufficiently wide in the sense that the number of hidden units exceed the dimension of the inputs by a factor of two ( $k\geq 2d$ ).

An interesting and perhaps surprising aspect of the first part of this theorem is its generality: it applies to any arbitrary data set of input/label pairs of any size! However, this result only guarantees convergence to a global optima but does not explain how good this global model is at fitting the data. Stated differently, it does not provide a bound on the optimal value. Such a bound may not be possible with adversarial data. However, at least intuitively, one expects the optimal value to be small when the input data $\bm{x}_{i}$ are sufficiently diverse and the neural network is sufficiently over-parameterized. The second part of Theorem 2.1 confirms that this is indeed true. In particular assuming the input data $\bm{x}_{i}$ are distributed i.i.d. $\mathcal{N}(0,\bm{I}_{d})$ , and the total number of weights ( $kd$ ) exceeds the number of data samples ( $n$ ), the globally optimal model perfectly fits the labels (has optimal value of ). The interesting part of this result is that it still holds with arbitrary labels. Thus, this theorem shows that with random inputs, and a sufficiently diverse choice of the hidden-output weights, using gradient descent to iteratively update the input-hidden layer weights converges to a model that provably fits arbitrary labels! This result is also inline with recent numerical evidence in that demonstrates that stochastic gradient descent learns deep, over-parametrized models with zero training error even with an arbitrary choice of labels.

While the above theorem shows that the saddles are strict and there is a direction of negative curvature at every saddle point, the margin of negativity is not quantified. Thus, the above theorem does not provide explicit convergence guarantees. In the theorem below we provide such a guarantee albeit for a more restrictive data model.

with $c$ a fixed numerical constant, then the training loss as a function of the input-output weights $\bm{W}$ of the hidden layer

obeys the following two properties, for almost every input data $\{\bm{x}_{i}\}_{i=1}^{n}$ .

There are no spurious local minima, i.e. all local minima are global optima.

The global optima has loss value equal to zero ( $\mathcal{L}(\bm{W})=0$ ).

Then all points $\bm{W}$ satisfying the approximate local minima condition

with probability at least $1-12de^{-\gamma d}-8/d$ . Here $\gamma$ is a fixed numerical constant.

The above result considers the setting where the input-output data set is generated according to a neural network model with Gaussian random input vectors. We show that if the data is generated according this model, then as long as the neural network is over-parameterized ( $n\lesssim kd$ ) then all points obeying condition (2.1) have small objective value.

The reason this result is useful is that points obeying the approximate local minimum condition (2.1) can be found in time depending polynomially on $\frac{1}{\epsilon_{g}}$ and $\frac{1}{\epsilon_{H}}$ . Algorithms that provably work in this setting include cubic regularization , trust region methods , approximate cubic regularization schemes , randomly initialized and perturbed (stochastic) gradients . Therefore, the above theorem demonstrates that a weight matrix with small training error(i.e. $\mathcal{L}(\bm{W})\leq\epsilon$ ) can be found in a computationally tractable manner (specifically with poly $(\frac{1}{\epsilon})$ computational effort).

2 Local convergence analysis with general activations

We now extend our analysis to general activations functions. However, our results in this section require a sufficiently close initialization to the global optimum. Starting from such a sufficiently close initialization we apply gradient descent updates based on the loss function (1.4). Our results apply to any differentiable activation function with bounded first and second order derivatives. We believe our result will eventually extend also to non-differentiable activations by smoothing techniques but we do not pursue this direction in this paper.

Before we state our theorem, we make a technical assumption regarding the activation $\phi$ .

Function $\gamma(\sigma)$ is nonzero everywhere.

Function $\gamma(\sigma)$ is identical to zero.

We note that $\mu(\sigma)$ and $\gamma(\sigma)$ can be thought of as the average slope and curvature of the activation function. Thus the assumption on $\mu(\sigma)$ can be interpreted as the activation should have a nonzero average slope for all $\sigma>0$ (i.e. the mapping is somewhat correlated with the identity mapping $\phi(x)=x$ , under any gaussian measure) or has average slope equal to zero for all $\sigma>0$ (i.e. the mapping has no correlation with the identity mapping, under any gaussian measure).

We provide several examples of an activation function that satisfy Assumption 2.3.

(Softplus) $\phi_{b}(z)=\frac{1}{b}\log(1+e^{bz})$ , for a fixed $b>0$ .

(Sigmoid) $\phi_{b}(z)=\frac{1}{1+e^{-bz}}$ , for a fixed $b>0$ .

(Erf) $\phi(z)=\frac{2}{\sqrt{\pi}}\int_{0}^{z}e^{-t^{2}/2}{\rm d}t$ .

Note that all of these activations obey $\mu(\sigma)>0$ as they are all strictly increasing. Furthermore, the Softplus activation satisfies Assumption 2.3 (a) because it is strictly convex, while the other three activations satisfy Assumption 2.3 (b), because $\phi^{\prime\prime}$ is an odd function in these cases. These activations along with the popular ReLU activation ( $\phi(z)=\max(0,z)$ ) are depicted in Figure 1.

We now have all the elements to state our local convergence result.

for some fixed constants $v_{\min},v_{\max},w_{\min}$ , and $w_{\max}$ . Further, assume that $k\geq d$ and

for a sufficiently small constant $c_{0}>0$ .

To estimate the weights $\bm{v}^{*}$ and $\bm{W}^{*}$ , we start from initial weights $\bm{v}_{0}$ and $\bm{W}_{0}$

and apply gradient descent updates of the form

with the learning rate obeying $\alpha\leq 1/\beta$ . Then there is an event of probability at least $1-n^{-1}-2ne^{-b\sqrt{d}}$ , such that on this event starting from any initial point obeying (2.4)-(2.5) the gradient descent updates (2.5) satisfy

and $b,C_{0},c_{0},C,c>0$ are fixed numerical constants.

We note that the above theorem is still valid if Assumption 2.3 (b) holds in lieu of Assumption 2.3 (a). The only change is that the term $\sigma_{\min}(\bm{W})$ should be now replaced by one in Equations (2.3), (2.4) and (2.5).

The above result considers the setting where the input-output data set is generated according to a neural network model with a general activation and Gaussian random input vectors. We show that if the data is generated according this model, then as long as the neural network is over-parameterized ( $n\lesssim kd$ ) then gradient descent converges to the planted model when initialized close to this planted model. This implies that a training error of zero can be achieved locally, using gradient descent. We would like to note that assumptions (2.2) on the weights of the planted neural networks are only made to avoid unnecessarily complicated expressions in the statement of the theorem. As it will become clear in the proofs (Section 6) our result continues to hold without these assumptions.

Numerical experiments

In this section, we numerically investigate whether gradient descent finds the global optimum for various configurations of $n,k,d$ . In our first experiment, the data is generated from a planted Gaussian model of the form

and $\bm{v}^{*}$ the all-one vector. We consider two activations, namely the softplus $\phi(z)=\frac{1}{10}\log\left(1+e^{10z}\right)$ and the quadratic activation $\phi(z)=z^{2}$ .

In Figure 2, we show the results for the softplus activation with $n=100$ . In Figure 2(a), we fix the input dimension at $d=10$ and vary the number of hidden nodes $k$ . For each value of ( $n$ , $k$ , $d$ ) we carryout $10$ experiments. In each experiment, we generate $\bm{W}^{*}$ at random with i.i.d. $\mathcal{N}(0,1)/\sqrt{d}$ entries. For each value of $\bm{W}^{*}$ we carry out $10$ trials where in each trial we run gradient descent starting from a random initialization pair $(\bm{v}_{0},\bm{W}_{0})$ generated at random with $\bm{v}_{0}$ consisting of i.i.d. Rademacher $\pm 1$ entries and $\bm{W}_{0}$ with i.i.d. $\mathcal{N}(0,1)/\sqrt{d}$ entries. If a global minimum was obtained for every single initialization, we declare that the loss function has no spurious local minima for the corresponding $\bm{W}^{*}$ . In Figure 2 we plot the empirical probability that the loss function has no spurious local optima for the softplus activation. Dots correspond to the simulation results and the solid curve is obtained by fitting a logistic model to the results. In Figure 2(a) we focus on different number of hidden units, with the input dimension fixed at $d=10$ . The dashed line indicates the value of $k$ ( $k=11.40$ ) at which the fitted logistic model crosses the probability $1/2$ . As $k$ grows we enter a region that the gradient descent frequently converges to global minimizers, and for $k\geq 13$ every single local minimizer found by gradient descent is a global minimizer. Note that in this regime the number of parameters $(kd)$ exceeds the sample size $n$ . This suggests that a modest amount of over-parameterization is sufficient for the no spurious local optima conclusion to hold and the global convergence to occur. In Figure 2(b), we fix the number of hidden nodes at $k=10$ and vary the input dimension $d$ . We observe a similar trend: As $d$ grows we enter a region where every single local minimizer found by gradient descent is a global minimizer. Here, the dashed line is located at $d=10.70$ , corresponding to a $1/2$ probability of converging to a global minima. For $d\geq 12$ , gradient descent always finds a global minimizer.

We also repeat a similar experiment with quadratic activations and report the results in Figure 3. In Figure 3(a), we fix the input dimension at $d=20$ and the number of samples at $n=100$ and vary $k$ . The dashed line here is located at $k=5.5$ . In Figure 3(b), we fix the number of hidden units at $k=5$ and vary $d$ . The dashed line in this experiment is located at $d=25.40$ . These experiments further corroborate our theory that over-parameterization helps gradient descent find a global minimizer with quadratic activations.

In the next set of experiments, we generate the labels and features i.i.d. at random

We fit a softplus one-hidden layer network of the form $\bm{x}\rightarrow\bm{1}^{T}\phi\left(\bm{W}\bm{x}\right)$ to this data with varying number of hidden units $k$ . In Figure 4 we plot the square root of the objective function $\mathcal{L}$ as a function of iterations. Interestingly, we observe that even for randomly labeled data when $kd>1.5n$ , gradient descent is able to find a global minimizer. These experiments show that simple gradient descent is almost always able to find a global minimizer in the sufficiently over-parametrized regime, regardless of whether the data is generated from a planted model.

Related Work

As mentioned earlier, neural networks have enjoyed great empirical success . To explain this success, many papers have studied the expressive ability of shallow neural networks dating back to the 80s (e.g. see ). More recently, interesting results have focused on the expressive ability of deeper and sparser architectures . Computational tractability of training networks however is still a major challenge. In fact, training neural nets is known to be NP-hard even for very small networks . Despite this worst-case pessimism, local search heuristics such as gradient descent are surprisingly effective. For instance, empirically demonstrated that sufficiently over-parametrized networks can be efficiently optimized to near global optimality with stochastic gradient descent.

For a neural network with zero hidden units and a single output with a monotonic activation function $\sigma$ , numerous authors have shown that gradient-based methods converge to the global optimum under various assumptions and models. As a result of this literature a good understanding of the optimization landscape of learning single activations have emerged (at least when the data set is generic). Roughly stated, in this case the loss function only has a single local optima that coincides with the global optima and the gradient descent direction is sufficiently correlated with the direction pointing to the global optimum. Thus these results are able to explain the success of gradient-based methods. However, when there are hidden units, the loss function exhibits a completely different landscape, so that such analyses do not easily generalize.

We now turn our attention to global convergence results known for deeper architectures. For a 2-layered network with leaky ReLU activation, showed that gradient descent on a modified loss function can obtain a global minimum of the modified loss function; however, this does not imply reaching a global minimum of the original loss function. Under the same setting, showed that critical points with large “diversity” are near global optimality. used several assumptions to simplify the loss function to a polynomial with i.i.d. Gaussian coefficients. They then showed that every local minima of the simplified loss has objective value comparable to the global minima. used similar assumptions to show that all local minimum are global minimum in a nonlinear network. However, require an independent activations assumption meaning the activations of the hidden units are independent of the input and/or mutually independent, which is violated in practice. In comparison, our global convergence results hold for any arbitrary data set, without any additional assumptions. However, our global convergence results (see Section 2.1) only focus on quadratic activations with a single hidden layer.

We would like to mention a few interesting results regarding asymptotic characterizations of the dynamics of training shallow neural networks with Gaussian inputs using ideas from statistical physics . Saad and Solla study the online dynamics of learning a fully connected neural network with one-hidden layer in a Gaussian planted model (or Gaussian student-teacher model). In this model the output labels are generated according to planted weight vectors (a.k.a. a teacher network) with Gaussian input data. Then a new “student” network is trained to find the teacher weights. This result focuses on the case where the hidden-to-output weights are fixed to $+1$ (a.k.a. soft committee machines) and the asymptotic regime where the number of inputs and the iterations tend to infinity ( $d\rightarrow+\infty$ and $\tau\rightarrow\infty$ ). The authors provide certain dynamical equations that describes the asymptotic behavior of the training process for one-hidden layer neural networks using an “online SGD” or “resampled SGD” procedure where the SGD update is performed using a fresh and independent data point per new iteration. These dynamical equations do not admit a closed form solution, but do enable the analysis of the convergence behavior close to stationarity. Furthermore, by simulating these dynamical equations it is possible to gain some insights about the evolution of the online SGD updates. In particular, the authors use such simulations to gain insights into the convergence/divergence of such neural networks in the special case where diagonal input-hidden weights ( $\bm{W}$ ) are used. The paper also studies studies soft-committee machines under similar assumptions to those of . The authors of use tools from statistical physics (and in particular the aforementioned dynamical equations) to demonstrate that the dynamics of training such neural networks has several fixed points and find learning-rate dependent phenomena, such as splitting and disappearing of fixed points. This paper also differs from in that it also applies to transient iterations (finite $\tau$ ). Finally, the paper also studies soft committee machines but instead of online SGD, they present the analysis for a locally optimized online learning algorithm that focuses on extracting the largest possible amount of information from each new sample. The authors also demonstrate numerically that this choice leads to faster escape from the plateaux and hence faster decay of the generalization error. Our results are not directly comparable to this literature as it differs in terms of assumptions and conclusions in a variety of ways. In particular we focus on (1) analytic formulas, (2) gradient descent without resampling, (3) a non-asymptotic regime (finite $d$ ) and transient dynamics (finite $\tau$ ), (4) the over-parameterized case $n<kd$ , and (5) general hidden-output weights. Some of the results presented in this paper also apply without assuming a planted model or random/Gaussian input data (e.g. see Theorem 2.1).

We would like to note that there is also interesting growing literature on learning shallow neural networks with a single hidden layer with i.i.d. inputs, and under a realizable model (i.e. the labels are generated from a network with planted weights) . For isotropic Gaussian inputs, shows that with two hidden unites ( $k=2$ ) there are no critical points for configurations where both weight vectors fall into (or outside) the cone of ground truth weights. With the same assumptions, proves that for a single-hidden ReLU network with a single non-overlapping convolutional filter, all local minimizers of the population loss are global; they also give counter-examples in the overlapping case and prove the problem is NP-hard when inputs are not Gaussian. show that for single-hidden layer networks with non-standard activation functions gradient descent converges to global minimizers. focuses on a Recursive Neural Net (RNN) model where the hidden to output weights are close to the identity and shows that stochastic methods converge to the planted model over the population objective. studies general single-hidden layer networks and shows that a version of gradient descent which uses a fresh batch of samples in each iteration converges to the planted model. This holds using an initialization obtained via a tensor decomposition method. Our approach and local convergence results differ from this literature in a variety of different ways. First, unlike some of these results such as , we study the optimization properties of the empirical function, not its expected value. Second, we focus on the over-parametrized regime ( $n<<kd$ ) which is the regime where most popular neural networks are trained. Mathematically, this is an important distinction as in this data-poor regime the empirical loss is no longer close to the population loss and therefore one can no longer infer the convergence behavior of gradient descent based on connecting the empirical loss to the population loss. Third, we optimize over both weights $\bm{v}$ and $\bm{W}$ , while most of this literature assumes $\bm{v}=\bm{v}^{*}=\bm{1}$ . Finally, our framework does not require a fresh batch of samples per new gradient iteration as in .

Another line of research focuses on improper learning models using kernel-based approaches. These results hold under much more general data models than the realizable case. However, the practical success of deep learning is intricately tied to using gradient-based training procedures, and the learnability of these networks using improper learning does not explain the success of gradient-based methods. Related, proposes a method of moments estimator using tensor decomposition.

Finally, we would like to mention a few interesting recent developments that appeared after the initial arxiv submission of this paper. The authors of demonstrated analytically (with variable precision arithmetics) that without over-parameterization the landscape of one-hidden layer neural networks has bad local minima as soon as the number of hidden unites exceeds six ( $k\geq 6$ ). This interesting result clearly demonstrates that the over-parametrization is necessary to ensure a favorable optimization landscape when training one-hidden layer neural networks. Another recent paper by Soudry and Hoffer suggests that “most” local minima are global in the over-parameterized regime. This paper provides further evidence that over-parametrizing neural networks leads to more favorable optimization landscapes when training such neural nets. Finally, the recent paper proves that gradient descent when initialized at zero can lead to training errors close to (but not equal to) zero. This result holds for one-hidden layer neural networks with quadratic activations and Gaussian inputs. Developing rigorous convergence guarantees to a global optima with training error equal to zero which also applies to other activations is an important future research direction.

Preliminaries and notations

Before we dive into the proofs in the next two sections we collect some useful results and notations in this section.

Throughout the paper we use $c,C$ to refer to constants whose values may change from line to line.

2 Derivative calculations

In this section we gather some derivative calculations that we will use throughout the proof. As a reminder the loss function is given by

We begin by calculating the gradient with respect to the weight matrix $\bm{W}$ . To this aim we begin by calculating the gradient with respect to the $q$ th row of $\bm{W}$ denoted by $\bm{w}_{q}$ . This is equal to

Aggregating these gradients as a row vector and setting $\bm{D}_{\bm{v}}={\rm diag}(v_{1},\dotsc,v_{k})$ we arrive at

Using this Jacobian matrix we can write the vectorized version of the gradient, i.e. gradient with respect to vect $(\bm{W})$ as

Taking the derivative with respect to $v_{q}$ we arrive at

Thus the gradient with respect to $\bm{v}$ is equal to

2.2 Second order derivatives

Using the expression for the Jacobian matrix the latter can also be written in the form

3 Useful identities involving matrix products

In this section we gather a few preliminary results regarding matrix products that will be useful throughout our proofs. Most of these identities are trivial and we thus skip their proof. The only exception is Lemma 5.3 which is proved in Appendix A.

where ${\rm vec}(M)$ denotes the vectorization of the matrix $\bm{M}$ formed by stacking the columns of $\bm{M}$ into a single column vector.

For any two matrices $\bm{A}$ and $\bm{B}$ , we have

4 Useful probabilistic identities involving random vectors and matrices

In this Section we gather some useful probabilistic identities that will be used throughout our proofs. We defer the proof of these results to Appendix A. The first result, proven in Appendix A.2, relates the tail of a random variable to a proper Orlicz norm.

Suppose that a random variable $Y$ satisfies the following tail bound

Then, $\|Y\|_{\psi_{1}}\leq 9\max(\sqrt{A/(2c)},B/c)$ .

The second result concerns the minimum eigenvalue of the Khatrio-Rao product of a generic matrix with itself.

We refer to Appendix A.3 for the proof of Lemma 5.6.

Finally, we state a Lemma based on that allows us to lower bound the loss function $\mathcal{L}(\bm{v},\bm{W})$ in terms of how close $(\bm{v},\bm{W})$ is to $(\bm{v}^{*},\bm{W}^{*})$ . We prove this Lemma in Appendix A.4.

holds with probability at least $1-10de^{-\gamma d}-8/d$ . Here, $\left\|\cdot\right\|_{*}$ denotes the nuclear norm i.e. sum of singular values of the input matrix.

Proof of global landscape results

In this section we prove the two global theorems.

We begin by gathering the derivatives of the loss function for quadratic activations. We note that for quadratic activations the loss function (1.4) as a function of $\bm{W}$ takes the form

Using (5.4) the Jacobian matrix is given by

Using this Jacobian matrix we can write the vectorized version of the gradient, i.e. gradient with respect to vect $(\bm{W})$ as

Also (5.2.2) specialized to quadratic activations results in the partial Hessian

2 Proof of Theorem 2.1

The objective value is convex in terms of $\bm{M}$ (in fact its a quadratic). Therefore, at any point where the gradient with respect to $\widetilde{\bm{M}}$ is zero i.e. $\nabla f(\widetilde{\bm{M}})=0$ , that point must be a global optimum of $f$ . More specifically, any point $\widetilde{\bm{M}}$ obeying

Thus any $\widetilde{\bm{W}}$ obeying (6.5) is a global optima of $\mathcal{L}(\bm{W})$ , concluding the proof.

We will prove the first two conclusions of the theorem (i.e. no spurious local optima and saddles have a direction of negative curvature) simultaneously. To show this note that since the function $\mathcal{L}$ is twice differentiable all local optima or saddles that do not have a direction of negative curvature obey

We will prove that all points obeying (6.8) are a global optima of $\mathcal{L}(\bm{W})$ . To this aim let us define the sets

i.e. the indices of the positive and negative entries of $\bm{v}$ . Now note that two cases are possible

Case I: At least one of the sub-matrices $\left(\bm{D}_{\bm{v}}\bm{W}\right)_{\mathcal{S}_{+}}$ and $\left(\bm{D}_{\bm{v}}\bm{W}\right)_{\mathcal{S}_{-}}$ is not rank deficient. That is,

Case II: Both of the sub-matrices $\left(\bm{D}_{\bm{v}}\bm{W}\right)_{\mathcal{S}_{+}}$ and $\left(\bm{D}_{\bm{v}}\bm{W}\right)_{\mathcal{S}_{-}}$ are rank deficient. That is,

In both cases we will show that for any $\bm{W}$ obeying (6.8), we have

The latter together with Lemma 6.1 immediately implies that any $\bm{W}$ obeying (6.8) is a global optimum, proving that there are no spurious local minima and no saddles which do not have a direction of strict negative curvature. We now proceed to show that indeed $\sum_{i=1}^{n}r_{i}\bm{x}_{i}\bm{x}_{i}^{T}=0$ holds in both cases mentioned above.

Multiplying both sides of the above equality on the left by $\bm{M}$ we conclude that

Equations (6.13) and (6.14) together imply that

2.2 Proof of zero training error

Note that in the previous section we proved that all global optima of $\mathcal{L}(\bm{W})$ obey

The latter identity can be rewritten in the form

3 Proof of Theorem 2.2

First note that since we assume both $\bm{v}$ and $\bm{v}^{*}$ have the same sign pattern we can without loss of generality assume both $\bm{v}$ and $\bm{v}^{*}$ have non-negative entries. We will first prove that there are no spurious local minima, and all saddles have a direction of negative curvature, and all global optima have zero loss value. We then proceed to show all approximate local minima have small objective value in Section 6.3.2.

We begin by noting that the function $\mathcal{L}$ is twice differentiable and thus all local optima or saddle points that do not have a direction of negative curvature obey

Note that $\nabla\mathcal{L}(\bm{W})=\bm{0}$ implies that

Since we assume $k\geq d$ , two cases can occur.

Case I: rank( $\left(\bm{D}_{\bm{v}}\bm{W}\right)_{\mathcal{S}})=d$ .

Case II: rank( $\left(\bm{D}_{\bm{v}}\bm{W}\right)_{\mathcal{S}})<d$ .

In both cases we will show that any $\bm{W}$ obeying (6.8), must be a global optimum.

The latter together with Lemma 6.1 immediately implies that any $\bm{W}$ obeying (6.16) and the condition of Case I is a global optimum, proving that there are no spurious local minima. Furthermore using Lemma 5.6, $\sigma_{\min}(\bm{X}\ast\bm{X})>0$ as long as $n\leq cd^{2}$ with $c$ a fixed numerical constant. This combined with (6.18) implies that $\bm{r}=\bm{0}$ , completing the proof for zero training error in this case.

The latter together with (6.3.1) implies that

Thus for any $\bm{W}$ obeying (6.16) and the condition of Case II we have

where (a) follows from (6.17) and (b) follows from (6.21) combined with the fact that $\bm{v}^{*}$ has nonnegative entries. Thus, $\mathcal{L}(\bm{W})=0$ and therefore such a $\bm{W}$ must be a global optimum. This completes the proof in Case II.

3.2 Bounding the objective value for approximate local minima

Assume $\bm{W}$ is an approximate local minima. That is it obeys

For such a point by Cauchy-Schwarz we have

Furthermore, for such a point using (5.2.2)

Thus using (6.27), (6.23), and (6.26) we conclude that for any point $\bm{W}$ obeying (6.22) we have

Here, (a) follows from (6.27), (b) from (6.23) and (c) from Holder’s inequality combined with (6.26).

We proceed by showing that for a point $\bm{W}$ obeying $\left\|\nabla\mathcal{L}(\bm{W})\right\|_{F}\leq\epsilon_{g}$ it’s Frobenius norm norm is also sufficiently small. To this aim note that for such a point by Cauchy-Schwarz have

Using Holder’s inequality for matrices we have

Thus by standard concentration of sample covariance matrices we conclude that for any $\delta>0$ , as long as $n\geq\frac{c}{\delta^{2}}d$ for a fixed numerical constant $c$ , then for all $\bm{W}$

holds with probability at least $1-2e^{-bd}$ , for a fixed constant $b>0$ . Thus with high probability for all $\bm{W}$ we have

Plugging the latter into (6.3.2) we conclude that

We now apply Lemma 5.7 to conclude that as long as $n\geq c(\delta)d\log d$

holds with probability at least $1-10de^{-bd}-8/d$ . The latter implies that with high probability for all $\bm{W}$ we have

Plugging the latter into (6.31) and we conclude that with high probability for all $\bm{W}$ we have

Plugging the latter into (6.29) implies that

Using the assumption that $\bm{v}=\nu\bm{1}$ , the latter is equivalent to

From the latter we can conclude that for all $a>0$ we have

Setting $a=\frac{\epsilon_{g}}{\nu(1-\delta)^{2}}$ and $\delta=1/2$ , the latter inequality implies that

Plugging the latter into (6.3.2) we conclude that for all $\bm{W}$ obeying (6.22)

holds with high probability, concluding the proof.

Proof of local convergence results

In this section we prove Theorem 2.5. We begin by explaining a crucial component of our proof which involves bounding the spectrum of the Jacobian matrix. These results are the subject of the next section with the corresponding proofs appearing in Section 7.2. We then utilize these intermediate results to finalize the proof of Theorem 2.5 in Section 7.3.

Recall that the Jacobian matrix is given by $\bm{J}=\bm{D_{v}}\phi^{\prime}(\bm{W}\bm{X})\ast\bm{X}$ . To bound the spectrum of the Jacobian we first introduce a new matrix obtained by centering the $\phi^{\prime}(\bm{W}\bm{X})$ component. Precisely, we let

with probability at least $1-2e^{-c^{\prime}\sqrt{d}}$ , where $c^{\prime},C>0$ are constants (depending on $c$ ).

In addition, with probability at least $1-2e^{-n/2}$ , we have

holds for a sufficiently small constant $c>0$ . Then, there exist constants $C>0$ , such that,

holds with probability at least $1-ne^{-b_{1}\sqrt{n}}-n^{-1}-2ne^{-b_{2}d}$ with $b_{1},b_{2}>0$ fixed numerical constants.

As discussed in the proof of Proposition 7.2, when Assumption 2.3 (b) holds in lieu of Assumption 2.3 (a), then $\bm{\Gamma}=\bm{0}$ . In this case the statement of the above proposition must be modified as follows. Equation (7.6) should be replaced with

Further, we have the following bounds in lieu of equations (7.7) and (7.8).

We now combine Propositions 7.1 and 7.2 to characterize the spectrum of $\bm{J}$ . We defer the proof of this result to Appendix C.

for some fixed constants $v_{\min},v_{\max},w_{\min},w_{\max}$ . Further, assume that $k\geq d$ and

for a sufficiently small constant $c_{0}>0$ . Then, there exist constants $C\geq c>0$ , such that,

holds with probability at least $1-n^{-1}-2ne^{-b\sqrt{d}}$ with $b>0$ a fixed numerical constant.

Also when Assumption 2.3 $(b)$ holds in lieu of Assumption 2.3 $(a)$ , the term $\sigma_{\min}(\bm{W})$ should be replaced by one in Equations (7.11) and (7.12).

holds with probability at least $1-ne^{-b_{1}\sqrt{n}}-n^{-1}-2ne^{-b_{2}d}$ with $b_{1},b_{2}>0$ fixed numerical constants.

2 Proof of Proposition 7.2

Our proof strategy consists of four main steps:

(Bounding singular values). We prove bounds on singular values of a matrix with independent columns and bounded sub-exponential norms.

where the expectation is with respect to the randomness in the inputs $\bm{x}_{i}$ .

Proof We prove this lemma by contradiction i.e. assume

which is a contradiction. Note that in the penultimate inequality we used the fact that $\bm{M}\bm{D_{v}}\bm{\Gamma}\bm{W}=\bm{I}$ and the last inequality holds because $\|\bm{D_{M}}\|=\sigma_{\max}(\bm{M})$ .

2.2 Dropping rows

2.3 Bounding sub-exponential norms

In order to bound the right-hand-side of (7.18), we need to study the tail of the random variable

holds with $C$ a fixed numerical constant and $K=\sqrt{L^{2}\sigma_{\max}^{2}(\bm{W})+1}$ .

We apply Proposition 7.8, with $\bm{U}=\bm{\widetilde{U}}\bm{M}\bm{D_{v}}$ and use Lemma 5.5, to conclude that

2.4 Bounding minimum singular value

Let $\bm{u_{1}},\bm{u_{2}},\dotsc,\bm{u_{n}}$ be independent sub-exponential random vectors and also let $\psi=\max_{1\leq i\leq n}\,\|\bm{u_{i}}\|_{\psi_{1}}$ . Furthermore, let $\bm{U}$ be a matrix with $\bm{u_{1}},\dotsc,\bm{u_{n}}$ as columns. Define $\eta_{\min}=\min_{1\leq i\leq n}\|\bm{u_{i}}\|$ and $\eta_{\max}=\max_{1\leq i\leq n}\|\bm{u_{i}}\|$ . Further, set $\xi=\psi K+K^{\prime}$ , where $K,K^{\prime}\geq 1$ are arbitrary but fixed. Also define

We continue by stating a lemma that bounds the Euclidean norm of the random vector $\bm{z}$ . We defer the proof of this lemma to Appendix G.

holds with probability at least $1-2e^{-\frac{t^{2}}{2}}$ .

Further, by concentration of $\chi^{2}$ random variables, with probability at least $1-2e^{-(d-1)/32}$ , the following bounds hold. For any fixed $1\leq i\leq n$ ,

Plugging bound (7.24), with $t=2\sqrt{\log n}$ , and bound (7.25) into (7.23), the followings inequalities hold with probability at least $1-2n^{-2}-2e^{-(d-1)/32}$ ,

Note that the second inequality in (7.26) holds because by our assumption (7.6),

Therefore, by union bounding over $n$ columns with probability at least $1-2n^{-1}-2ne^{-(d-1)/32}$ we have

Also, by (7.22), the maximum sub-exponential norms of the columns is bounded by

We now have all the elements to apply Proposition 7.9. In particular we use $K=1$ and $K^{\prime}=10\rho(\bm{W})$ , to conclude that

which holds true by the assumption given in Equation (7.6).

Note that under Assumption 2.3 (b), we skip the whitening step and by following the same proof, we obtain

2.5 Bounding maximum singular value

where the second inequality follows from our assumption given by Equation (7.6).

Combining Equations (7.38) and (7.39) via the triangular inequality we arrive at:

where the second inequality holds because $n\geq d$ and $\|{\bm{v}}\|_{\infty}\sigma_{\min}(\bm{\Gamma}\bm{W})\leq\sigma_{\max}(\bm{D_{v}}\bm{\Gamma}\bm{W})$ as per Lemma 5.3.

3 Proof of Theorem 2.5

We begin by stating a few useful lemmas. We defer the proofs of these lemmas to the Appendices. Throughout, we assume $|\phi^{\prime}|<B$ and $|\phi^{\prime\prime}|<L$ . We present the proof under Assumption 2.3 (a). The proof under Assumption 2.3 (b) follows by an analogous argument. We begin by a lemma that bound the perturbation of the Jacobian matrix as a function of its inputs. We defer the proof to Appendix H.

holds with probability at least $1-ne^{-b_{1}\sqrt{n}}-n^{-1}-2ne^{-b_{2}d}$ . Here, $b_{1},b_{2}>0$ are fixed numerical constants.

Now note that by Proposition 7.4 for ${\bm{v}}^{*}$ and $\bm{W}^{*}$ , we have

Using this value of $c$ we define the radius

where $C^{\prime}$ is the same quantity that appears in Equation (7.41). Without loss of generality, we can assume $c<4C^{\prime}w_{\max}v_{\max}$ . Plugging this into the definition of $R$ together with the fact that $d\leq n$ , allows us to conclude that $R\leq v_{\max}$ .

In the next lemma, proven in Appendix I, we relate the gradients $\nabla_{(\bm{W})}\mathcal{L}(\bm{v},\bm{W})$ and $\nabla_{\bm{v}}\mathcal{L}(\bm{v},\bm{W})$ to the function value $\mathcal{L}(\bm{v},\bm{W})$ .

For $({\bm{v}},\bm{W})\in\Omega$ , the following inequalities hold with probability at least $1-n^{-1}-2ne^{-b\sqrt{d}}$ for some constant $b>0$ .

Our next lemma, proven in Appendix J, upper bounds the function value $\mathcal{L}(\bm{v},\bm{W})$ in terms of distance of $(\bm{v},\bm{W})$ to the planted solution $(\bm{v},\bm{W})$ .

The following bound holds with probability at least $1-2e^{-b_{0}d}$ for some constant $b_{0}>0$ .

Finally, the lemma below, proven in Appendix K, controls the second order derivative of the loss function $\mathcal{L}({\bm{v}},\bm{W})$ .

The function $\mathcal{L}({\bm{v}},\bm{W})$ is $\beta$ -smooth on $\Omega$ . Namely, there exists an event of probability at least $1-n^{-1}-2ne^{-b\sqrt{d}}$ , such that on this event, for any $({\bm{v}},\bm{W})\in\Omega$ , we have

With this lemmas in place we are now ready to present our local convergence analysis. To this aim note that Lemma 7.12 (Equation (7.44)) implies that for $({\bm{v}},\bm{W})\in\Omega$ , the function $\mathcal{L}({\bm{v}},\bm{W})$ satisfies the Polyak-Lojasiewicz (PL) inequality

We shall now focus on how the loss function value changes in one iteration. Using the $\beta$ -smoothness condition for $\alpha\leq 1/\beta$ we have

where in the last inequality we used (7.50).

Our assumptions on the initial weights $\bm{v}_{0}$ and $\bm{W}_{0}$ in equations (2.5) and (2.4) imply the following identities.

We next show that starting with ${\bm{v}}_{0},\bm{W}_{0}$ obeying (7.52) and (7.53), the entire trajectory of gradient descent remains in the set $\Omega$ . To establish this we use induction on $\tau$ . The induction basis $\tau=0$ is trivial. Assuming the induction hypothesis for $0\leq t\leq\tau-1$ , we show that the result continues to hold for $t=\tau$ .

By our induction hypothesis ( $(\bm{v}_{t},\bm{W}_{t})\in\Omega$ for $t=0,1,\ldots,\tau-1$ ) and (7.51), we have

By iterating the above identity we arrive at

To show that $(\bm{v}_{\tau},\bm{W}_{\tau})\in\Omega$ we proceed by quantifying how far the gradient descent trajectory can get from $(\bm{v}_{0},\bm{W}_{0})$ .

where $(a)$ follows from $({\bm{v}}_{t-1},\bm{W}_{t-1})\in\Omega$ and Lemma 7.12 equation (7.45) and $(b)$ follows from (7.54). Likewise, we obtain

Using bounds (7.55) and (7.56), in order to show that $({\bm{v}}_{\tau},\bm{W}_{\tau})\in\Omega$ , it suffices to show that

The right-hand sides of Equation (7.57) depends on $\max(\widetilde{m}_{\rm{U}},m_{\rm{U}}v_{\max}^{2})$ . The dominant term is $m_{\rm{U}}v_{\max}^{2}$ because it is of order at least $k$ , while $m_{\rm{U}}$ is $O(1)$ . Therefore, the desired bound in (7.57) is equivalent to

We can verify Equation (7.58) by using Lemma 7.13 combined with (7.52) and (7.53). This completes the induction argument and shows that $({\bm{v}}_{\tau},\bm{W}_{\tau})\in\Omega$ for all $\tau\geq 1$ .

Finally, since $({\bm{v}}_{\tau},\bm{W}_{\tau})\in\Omega$ for all $\tau\geq 0$ , (7.54) holds for all $\tau\geq 0$ . Substituting for $m_{\rm{L}}$ , we obtain

where the last step holds because $\sigma_{\min}(\bm{W}^{*})<w_{\max}$ .

This concludes the proof under Assumption 2.3 (a). The claim under Assumption 2.3 (b) can be proven by a similar argument. The only required adjustment is that the initial radius $R$ and the term $m_{\rm{L}}$ should now be defined via $R=\frac{c}{4C^{\prime}}\sqrt{{d}/{n}}$ and $m_{\rm{L}}=c^{2}{d^{2}}/(2n)$ .

Acknowledgements

This work was done in part while M.S. was visiting the Simons Institute for the Theory of Computing. M. Soltanolkotabi is supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, an NSF-CAREER under award #1846369, the Air Force Office of Scientific Research Young Investigator Program (AFOSR-YIP) under award #FA9550-18-1-0078, an NSF-CIF award #1813877, and a Google faculty research award. A.J. was partially supported by a Google Faculty Research Award. A.J. would also like to acknowledge the financial support of the Office of the Provost at the University of Southern California through the Zumberge Fund Individual Grant Program. We would like to thank Marco Mondelli and Simone Bombari for bringing to our attention a mistake in the argument of Proposition 7.1 in the previous version of this manuscript which has now been corrected. M.S. would like to thank Peter Bartlett for discussions related to .

References

Appendix A Proof of preliminary lemmas

If $\sigma_{\min}(\bm{A})=0$ then the claim is obvious. We therefore assume that $\bm{A}$ has full column rank. Let $i^{*}=\arg\max_{1\leq i\leq k}|v_{i}|$ . We choose $\bm{u}$ such that $\bm{A}\bm{u}=e_{i^{*}}$ . We then have

Equations (A.1) and (A.2) together implies the desired result.

A.2 Proof of Lemma 5.5

Let $\xi=\max(\sqrt{\frac{A}{2c}},\frac{B}{c})$ . This implies that

A.3 Proof of Lemma 5.6

A.4 Proof of Lemma 5.7

We begin by considering the eigenvalue decomposition of the matrix $\bm{A}$ given by

To continue we state a Lemma due to (see also for closely related results).

with $c$ a constant depending only on $\delta$ . Then

holds with probability at least $1-10e^{-\gamma d}-8/d^{2}$ with $\gamma$ a fixed numerical constant.

Applying the union bound to the above lemma we conclude that for the unit norm vectors $\{\bm{v}_{j}\}_{j=1}^{d}$

hold simultaneously with probability at least $1-10de^{-\gamma d}-8/d$ . Combining the latter inequalities together with (A.4) we conclude that

Appendix B Proof of Proposition 7.1

Subtracting the above two expressions we obtain

along with the fact that the Hadamard (entrywise) product of two positive-semidefinite matrices is positive-semidefinite, implies that

with high probability. To this aim note that the latter matrix can be written in the form $\bm{Y}^{T}\bm{Y}$ where

To continue we state the following result from .

To use this theorem we need to verify the assumptions. We first verify the sub-exponential assumption. Observe that $\bm{d}_{i}$ are generated i.i.d. according to the distribution

We next verify the norm bound condition (B.2). Note that we have

Therefore, when $n\leq\frac{1}{4}e^{\sqrt{d}}$ using

Putting the latter two together we conclude that

in (B.2). Note that this choice of $K$ obeys $K\geq 1$ as long as

We next upper bound $\left\|\bm{\Sigma}\right\|$ . Recall the definition

Combining the latter with (B.4) we can conclude that as long as $n\leq cd^{2}$ we have

We next prove the bound in (7.5). Note that

By [51, Corollary 5.35], $\sigma_{\max}(\bm{X})\leq\sqrt{d}+2\sqrt{n}$ , with probability at least $1-2e^{-n/2}$ . The result follows be recalling that $n\geq d$ .

Appendix C Proof of Proposition 7.4

We first prove the claim under Assumption 2.3 (a). We start by bounding the entries of $\bm{\Gamma}$ and mean vector $\bm{\mu}$ .

We next verify the assumption of Proposition 7.1. By our assumptions in the current proposition,

Therefore, the assumption of Proposition 7.1 holds. Using this proposition, the following bounds hold with probability at least $1-2e^{-c^{\prime}\sqrt{d}}$ :

We next lower bound the right-hand side. By our assumption on $n/d^{2}$ , we have

Further, by invoking assumption (7.11), we have

Putting (C.6), (C.7) and (C.8) together we arrive at

We then move to upper bound $\sigma_{\max}(\bm{J})$ .

which immediately implies assumption (7.6). Here, (a) follows from $\sigma_{\min}(\bm{\Gamma}\bm{W})\geq\gamma_{\min}\sigma_{\min}(\bm{W})$ , (b) from $\sigma_{\max}(\bm{W})\geq w_{\min}$ , and (c) from (7.11). With assumption (7.6) in place, we can now combine Proposition 7.2 with (C.9) to conclude that

for some constant $c>0$ by choosing $c_{0}$ sufficiently small. In addition,

for a constants $C^{\prime}>0$ that depends on $v_{\max}$ , $w_{\min}$ , $L$ . The claim under Assumption 2.3 (b) follows in a similar manner by using Remark 7.3 in lieu of Proposition 7.2.

Appendix D Proof of Proposition 7.8

We prove Proposition 7.8 via an asymmetric version of Hanson-Wright inequality. We begin by the definition of the convex concentration property which is the most general condition under which it is known that Hanson-Wright inequality holds.

We will prove the proposition by using a result by on the Hanson-Wright inequality stated below.

We wish to apply this result to the vector $\bm{u}=(\bm{x},\bm{z})^{T}$ . Note that $\bm{u}$ has zero mean. To this aim, we first state a lemma about the convex concentration property for the random vector $\bm{u}$ whose proof appears in Section D.1 below.

Lemmas D.2 and D.3 combined allows us to conclude that

For $\bm{A}=\begin{bmatrix}\bm{0}&\bm{U}\\ \bm{U}^{T}&\bm{0}\end{bmatrix}$ , this is equivalent to

The proof is complete by replacing $t$ with $2t$ .

Thus $\psi(\bm{u})$ is a Lipschitz function of $\bm{x}$ with Lipschitz constant $K=\sqrt{1+L^{2}\sigma_{\max}^{2}(\bm{W})}$ . The convex concentration property follows from Gaussian isoperimetry .

Appendix E Proof of Lemma 7.6

Recalling the definition $\bm{A}:=\bm{D_{v}}\bm{\Gamma}\bm{W}$ and using Lemma 5.1, we arrive at

Appendix F Proof of Proposition 7.9

The following Lemma is similar to [Theorem 3.2].

Let $\bm{u_{1}},\dotsc,\bm{u_{n}}$ be independent sub-exponential random vectors with $\psi=\max_{1\leq i\leq n}\|\bm{u_{i}}\|_{\psi_{1}}$ . Let $\theta\in(0,1/4)$ , $K,K^{\prime}\geq 1$ and assume that

Then setting $\xi=\psi K+K^{\prime}$ , the inequality

We apply Lemma F.1 by setting $\eta=\eta_{\min}$ and letting $\theta$ be the solution of $n\log^{2}(2/\theta)=\theta^{2}\eta^{2}$ . Clearly, $\theta>\sqrt{n}/\eta_{\min}$ . and therefore $\theta<(\sqrt{n}/\eta_{\min})\log(2\eta_{\min}/\sqrt{n})$ . This gives

Using bound (F.2) in (F.1) we obtain the desired result.

Proof We first recall the following Lemma from Adamczak (this applies to vectors with non-trivial covariance).

Let $\bm{u_{1}},\dotsc,\bm{u_{n}}$ be independent sub-exponential vectors with $\psi=\max_{1\leq i\leq n}\|\bm{u_{i}}\|_{\psi_{1}}$ . For $\theta\in(0,1/4)$ and $K\geq 1$ , one has

Thus if $\eta_{\max}\leq K^{\prime}\eta$ for some $K^{\prime}$ , then

where $C_{1}$ is an absolute constant. This completes the proof.

Appendix G Proof of Lemma 7.10

Appendix H Proof of Lemma 7.11

Writing the above identity in matrix form, we get

We proceed by bounding the two terms above. For the first term note that by applying (H.2), we arrive at

holds with probability at least $1-ne^{-b_{1}\sqrt{n}}-n^{-1}-2ne^{-b_{2}d}$ for some constants $b_{1},b_{2}>0$ .

Combining inequalities (H.5) and (H.6), we have

where $C$ depends on constants $L$ and $B$ .

Appendix I Proof of Lemma 7.12

Given that $({\bm{v}},\bm{W})\in\Omega$ , we have

where the first inequality follows from Lemma 7.11 and the second one follows readily from definition of set $\Omega$ and Equation (7.42).

To prove the last bound (i.e. (7.46)), we recall Equation (5.6)

Here, we used the assumption that $\phi(z)$ is $B$ -Lipschitz.

Next, by the Bai-Yin law , we have $\left\|\bm{X}\right\|\leq 4\sqrt{n}$ , with probability at least $1-2e^{-2n}$ . Continuing with Equation (I.5), we have

where in the second inequality, we use the fact that $({\bm{v}},\bm{W})\in\Omega$ and hence $\left\|\bm{W}-\bm{W}^{*}\right\|_{F}\leq 1$ (since $R<v_{\max}$ ). Using bound (I.6) in (I.3), we obtain

Appendix J Proof of Lemma 7.13

By standard concentration of sample covariance matrices we conclude that for any $\delta>0$ , there exist constants $c,b_{0}>0$ , such that for $n\geq\frac{c^{2}}{\delta^{2}}d$ , we have

with probability at least $1-2e^{-b_{0}d}$ . Using this bound, with $\delta=1$ , along with the $1$ -Lipschitz property of the activation $\phi(z)$ , we arrive at

To bound the second term note that we have

where in the last step, we used (J.2) and the fact that $\left\|\bm{W}^{*}\right\|_{F}^{2}\leq kw_{\max}^{2}$ . Combining the bounds on $T_{1}$ and $T_{2}$ , completes the proof.

Appendix K Proof of Lemma 7.14

Note that the spectral norm of a matrix is bounded above by the sum of the spectral norms of the diagonal blocks of that matrix. Hence,

We proceed by bounding each of these two terms. Using Identity (5.2.2), we have

Since $({\bm{v}},\bm{W})\in\Omega$ , we have

Furthermore using (J.2) together with $|\phi^{\prime\prime}|<L$ we conclude that

holds with probability at least $1-2e^{-b_{0}d}$ . Moreover, using the B-Lipschitz property of the activation, we have

where (a) follows from the definition of the set $\Omega$ and (b) follows from $R\leq v_{\max}$ and $k\geq d$ . Note that while we can plug in for $R$ to obtain a tighter bound, we use a looser bound on $R$ as this term will be dominated by the other term (i.e. $\|\bm{J}\bm{J}^{T}/n\|$ ) and thus a more accurate bound is unnecessary.

Restating the bound on $\sigma_{\max}(\bm{J}({\bm{v}},\bm{W}))$ for $({\bm{v}},\bm{W})\in\Omega$ , given by (I.2), we have

We next bound $\left\|\nabla_{{\bm{v}}}^{2}\mathcal{L}({\bm{v}},\bm{W})\right\|$ . Starting with Equation (5.9), we have

Here, $(a)$ holds because $\phi$ is $B$ -Lipschitz; $(b)$ holds because $\left\|\bm{W}\right\|_{F}^{2}\leq kw_{\max}^{2}$ and $\left\|\bm{X}\right\|\leq\sqrt{2n}$ , with probability at least $1-2e^{-2n}$ .