Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
Mahdi Soltanolkotabi, Adel Javanmard, Jason D. Lee
Introduction
Neural network architectures (a.k.a. deep learning) have recently emerged as powerful tools for automatic knowledge extraction from raw data. These learning architectures have lead to major breakthroughs in applications such as visual object classification , speech recognition and natural language processing . Despite their wide empirical use the mathematical success of these architectures remains a mystery. Although the expressive ability of neural networks is relatively well-understood , computational tractability of training such networks remains a major challenge. In fact, training neural nets is known to be NP-hard even for very small networks . The main challenge is that training neural networks correspond to extremely high-dimensional and nonconvex optimization problems and it is not clear how to provably solve them to global optimality. Worst-case pessimism aside, these networks are trained successfully in practice via local search heuristics on real or randomly generated data. In particular, over-parameterized neural networks-where the number of parameters exceed the number of data samples-can be optimized to global optimality using local search heuristics such as gradient or stochastic gradient methods . In this paper we wish to provide theoretical insights into this phenomenon by developing a better understanding of optimization landscape of such over-parameterized shallow neural networks.
2 Problem formulation and models
We can now rewrite our input-output model (1.1) in the more succinct form
In this paper we wish to understand the landscape of optimization problems of the form (1.4).
Main results
In this section we discuss the main results of this paper. The first set of results focus on understanding the global landscape of neural network optimization with one hidden layer with a particular activation function. We also discuss how this landscape characterization enables algorithms to find a global optima in polynomial time. The second set of results focuses on the local convergence of gradient descent but applies to a broad set of activation functions.
We begin by discussing some global properties of the loss function of training neural networks.
There are no spurious local minima, i.e. all local minima are global.
Furthermore, for almost every data inputs , as long as
the global optimum of is zero. Here, is a fixed numerical constant.
The above result states that given an arbitrary data set, the optimization landscape of fitting neural networks have favorable properties that facilitate finding globally optimal models. In particular, by setting the weights of the last layer to have diverse signs all local minima are global minima and all saddles have a direction of negative curvature. This in turn implies that gradient descent on the input-to-hidden weights, when initialized at random, converges to a global optima. All of this holds as long as the neural network is sufficiently wide in the sense that the number of hidden units exceed the dimension of the inputs by a factor of two ().
An interesting and perhaps surprising aspect of the first part of this theorem is its generality: it applies to any arbitrary data set of input/label pairs of any size! However, this result only guarantees convergence to a global optima but does not explain how good this global model is at fitting the data. Stated differently, it does not provide a bound on the optimal value. Such a bound may not be possible with adversarial data. However, at least intuitively, one expects the optimal value to be small when the input data are sufficiently diverse and the neural network is sufficiently over-parameterized. The second part of Theorem 2.1 confirms that this is indeed true. In particular assuming the input data are distributed i.i.d. , and the total number of weights () exceeds the number of data samples (), the globally optimal model perfectly fits the labels (has optimal value of ). The interesting part of this result is that it still holds with arbitrary labels. Thus, this theorem shows that with random inputs, and a sufficiently diverse choice of the hidden-output weights, using gradient descent to iteratively update the input-hidden layer weights converges to a model that provably fits arbitrary labels! This result is also inline with recent numerical evidence in that demonstrates that stochastic gradient descent learns deep, over-parametrized models with zero training error even with an arbitrary choice of labels.
While the above theorem shows that the saddles are strict and there is a direction of negative curvature at every saddle point, the margin of negativity is not quantified. Thus, the above theorem does not provide explicit convergence guarantees. In the theorem below we provide such a guarantee albeit for a more restrictive data model.
with a fixed numerical constant, then the training loss as a function of the input-output weights of the hidden layer
obeys the following two properties, for almost every input data .
There are no spurious local minima, i.e. all local minima are global optima.
The global optima has loss value equal to zero ().
Then all points satisfying the approximate local minima condition
with probability at least . Here is a fixed numerical constant.
The above result considers the setting where the input-output data set is generated according to a neural network model with Gaussian random input vectors. We show that if the data is generated according this model, then as long as the neural network is over-parameterized () then all points obeying condition (2.1) have small objective value.
The reason this result is useful is that points obeying the approximate local minimum condition (2.1) can be found in time depending polynomially on and . Algorithms that provably work in this setting include cubic regularization , trust region methods , approximate cubic regularization schemes , randomly initialized and perturbed (stochastic) gradients . Therefore, the above theorem demonstrates that a weight matrix with small training error(i.e. ) can be found in a computationally tractable manner (specifically with poly computational effort).
2 Local convergence analysis with general activations
We now extend our analysis to general activations functions. However, our results in this section require a sufficiently close initialization to the global optimum. Starting from such a sufficiently close initialization we apply gradient descent updates based on the loss function (1.4). Our results apply to any differentiable activation function with bounded first and second order derivatives. We believe our result will eventually extend also to non-differentiable activations by smoothing techniques but we do not pursue this direction in this paper.
Before we state our theorem, we make a technical assumption regarding the activation .
Function is nonzero everywhere.
Function is identical to zero.
We note that and can be thought of as the average slope and curvature of the activation function. Thus the assumption on can be interpreted as the activation should have a nonzero average slope for all (i.e. the mapping is somewhat correlated with the identity mapping , under any gaussian measure) or has average slope equal to zero for all (i.e. the mapping has no correlation with the identity mapping, under any gaussian measure).
We provide several examples of an activation function that satisfy Assumption 2.3.
(Softplus) , for a fixed .
(Sigmoid) , for a fixed .
(Erf) .
Note that all of these activations obey as they are all strictly increasing. Furthermore, the Softplus activation satisfies Assumption 2.3 (a) because it is strictly convex, while the other three activations satisfy Assumption 2.3 (b), because is an odd function in these cases. These activations along with the popular ReLU activation () are depicted in Figure 1.
We now have all the elements to state our local convergence result.
for some fixed constants , and . Further, assume that and
for a sufficiently small constant .
To estimate the weights and , we start from initial weights and
and apply gradient descent updates of the form
with the learning rate obeying . Then there is an event of probability at least , such that on this event starting from any initial point obeying (2.4)-(2.5) the gradient descent updates (2.5) satisfy
and are fixed numerical constants.
We note that the above theorem is still valid if Assumption 2.3 (b) holds in lieu of Assumption 2.3 (a). The only change is that the term should be now replaced by one in Equations (2.3), (2.4) and (2.5).
The above result considers the setting where the input-output data set is generated according to a neural network model with a general activation and Gaussian random input vectors. We show that if the data is generated according this model, then as long as the neural network is over-parameterized () then gradient descent converges to the planted model when initialized close to this planted model. This implies that a training error of zero can be achieved locally, using gradient descent. We would like to note that assumptions (2.2) on the weights of the planted neural networks are only made to avoid unnecessarily complicated expressions in the statement of the theorem. As it will become clear in the proofs (Section 6) our result continues to hold without these assumptions.
Numerical experiments
In this section, we numerically investigate whether gradient descent finds the global optimum for various configurations of . In our first experiment, the data is generated from a planted Gaussian model of the form
and the all-one vector. We consider two activations, namely the softplus and the quadratic activation .
In Figure 2, we show the results for the softplus activation with . In Figure 2(a), we fix the input dimension at and vary the number of hidden nodes . For each value of (, , ) we carryout experiments. In each experiment, we generate at random with i.i.d. entries. For each value of we carry out trials where in each trial we run gradient descent starting from a random initialization pair generated at random with consisting of i.i.d. Rademacher entries and with i.i.d. entries. If a global minimum was obtained for every single initialization, we declare that the loss function has no spurious local minima for the corresponding . In Figure 2 we plot the empirical probability that the loss function has no spurious local optima for the softplus activation. Dots correspond to the simulation results and the solid curve is obtained by fitting a logistic model to the results. In Figure 2(a) we focus on different number of hidden units, with the input dimension fixed at . The dashed line indicates the value of () at which the fitted logistic model crosses the probability . As grows we enter a region that the gradient descent frequently converges to global minimizers, and for every single local minimizer found by gradient descent is a global minimizer. Note that in this regime the number of parameters exceeds the sample size . This suggests that a modest amount of over-parameterization is sufficient for the no spurious local optima conclusion to hold and the global convergence to occur. In Figure 2(b), we fix the number of hidden nodes at and vary the input dimension . We observe a similar trend: As grows we enter a region where every single local minimizer found by gradient descent is a global minimizer. Here, the dashed line is located at , corresponding to a probability of converging to a global minima. For , gradient descent always finds a global minimizer.
We also repeat a similar experiment with quadratic activations and report the results in Figure 3. In Figure 3(a), we fix the input dimension at and the number of samples at and vary . The dashed line here is located at . In Figure 3(b), we fix the number of hidden units at and vary . The dashed line in this experiment is located at . These experiments further corroborate our theory that over-parameterization helps gradient descent find a global minimizer with quadratic activations.
In the next set of experiments, we generate the labels and features i.i.d. at random
We fit a softplus one-hidden layer network of the form to this data with varying number of hidden units . In Figure 4 we plot the square root of the objective function as a function of iterations. Interestingly, we observe that even for randomly labeled data when , gradient descent is able to find a global minimizer. These experiments show that simple gradient descent is almost always able to find a global minimizer in the sufficiently over-parametrized regime, regardless of whether the data is generated from a planted model.
Related Work
As mentioned earlier, neural networks have enjoyed great empirical success . To explain this success, many papers have studied the expressive ability of shallow neural networks dating back to the 80s (e.g. see ). More recently, interesting results have focused on the expressive ability of deeper and sparser architectures . Computational tractability of training networks however is still a major challenge. In fact, training neural nets is known to be NP-hard even for very small networks . Despite this worst-case pessimism, local search heuristics such as gradient descent are surprisingly effective. For instance, empirically demonstrated that sufficiently over-parametrized networks can be efficiently optimized to near global optimality with stochastic gradient descent.
For a neural network with zero hidden units and a single output with a monotonic activation function , numerous authors have shown that gradient-based methods converge to the global optimum under various assumptions and models. As a result of this literature a good understanding of the optimization landscape of learning single activations have emerged (at least when the data set is generic). Roughly stated, in this case the loss function only has a single local optima that coincides with the global optima and the gradient descent direction is sufficiently correlated with the direction pointing to the global optimum. Thus these results are able to explain the success of gradient-based methods. However, when there are hidden units, the loss function exhibits a completely different landscape, so that such analyses do not easily generalize.
We now turn our attention to global convergence results known for deeper architectures. For a 2-layered network with leaky ReLU activation, showed that gradient descent on a modified loss function can obtain a global minimum of the modified loss function; however, this does not imply reaching a global minimum of the original loss function. Under the same setting, showed that critical points with large “diversity” are near global optimality. used several assumptions to simplify the loss function to a polynomial with i.i.d. Gaussian coefficients. They then showed that every local minima of the simplified loss has objective value comparable to the global minima. used similar assumptions to show that all local minimum are global minimum in a nonlinear network. However, require an independent activations assumption meaning the activations of the hidden units are independent of the input and/or mutually independent, which is violated in practice. In comparison, our global convergence results hold for any arbitrary data set, without any additional assumptions. However, our global convergence results (see Section 2.1) only focus on quadratic activations with a single hidden layer.
We would like to mention a few interesting results regarding asymptotic characterizations of the dynamics of training shallow neural networks with Gaussian inputs using ideas from statistical physics . Saad and Solla study the online dynamics of learning a fully connected neural network with one-hidden layer in a Gaussian planted model (or Gaussian student-teacher model). In this model the output labels are generated according to planted weight vectors (a.k.a. a teacher network) with Gaussian input data. Then a new “student” network is trained to find the teacher weights. This result focuses on the case where the hidden-to-output weights are fixed to (a.k.a. soft committee machines) and the asymptotic regime where the number of inputs and the iterations tend to infinity ( and ). The authors provide certain dynamical equations that describes the asymptotic behavior of the training process for one-hidden layer neural networks using an “online SGD” or “resampled SGD” procedure where the SGD update is performed using a fresh and independent data point per new iteration. These dynamical equations do not admit a closed form solution, but do enable the analysis of the convergence behavior close to stationarity. Furthermore, by simulating these dynamical equations it is possible to gain some insights about the evolution of the online SGD updates. In particular, the authors use such simulations to gain insights into the convergence/divergence of such neural networks in the special case where diagonal input-hidden weights () are used. The paper also studies studies soft-committee machines under similar assumptions to those of . The authors of use tools from statistical physics (and in particular the aforementioned dynamical equations) to demonstrate that the dynamics of training such neural networks has several fixed points and find learning-rate dependent phenomena, such as splitting and disappearing of fixed points. This paper also differs from in that it also applies to transient iterations (finite ). Finally, the paper also studies soft committee machines but instead of online SGD, they present the analysis for a locally optimized online learning algorithm that focuses on extracting the largest possible amount of information from each new sample. The authors also demonstrate numerically that this choice leads to faster escape from the plateaux and hence faster decay of the generalization error. Our results are not directly comparable to this literature as it differs in terms of assumptions and conclusions in a variety of ways. In particular we focus on (1) analytic formulas, (2) gradient descent without resampling, (3) a non-asymptotic regime (finite ) and transient dynamics (finite ), (4) the over-parameterized case , and (5) general hidden-output weights. Some of the results presented in this paper also apply without assuming a planted model or random/Gaussian input data (e.g. see Theorem 2.1).
We would like to note that there is also interesting growing literature on learning shallow neural networks with a single hidden layer with i.i.d. inputs, and under a realizable model (i.e. the labels are generated from a network with planted weights) . For isotropic Gaussian inputs, shows that with two hidden unites () there are no critical points for configurations where both weight vectors fall into (or outside) the cone of ground truth weights. With the same assumptions, proves that for a single-hidden ReLU network with a single non-overlapping convolutional filter, all local minimizers of the population loss are global; they also give counter-examples in the overlapping case and prove the problem is NP-hard when inputs are not Gaussian. show that for single-hidden layer networks with non-standard activation functions gradient descent converges to global minimizers. focuses on a Recursive Neural Net (RNN) model where the hidden to output weights are close to the identity and shows that stochastic methods converge to the planted model over the population objective. studies general single-hidden layer networks and shows that a version of gradient descent which uses a fresh batch of samples in each iteration converges to the planted model. This holds using an initialization obtained via a tensor decomposition method. Our approach and local convergence results differ from this literature in a variety of different ways. First, unlike some of these results such as , we study the optimization properties of the empirical function, not its expected value. Second, we focus on the over-parametrized regime () which is the regime where most popular neural networks are trained. Mathematically, this is an important distinction as in this data-poor regime the empirical loss is no longer close to the population loss and therefore one can no longer infer the convergence behavior of gradient descent based on connecting the empirical loss to the population loss. Third, we optimize over both weights and , while most of this literature assumes . Finally, our framework does not require a fresh batch of samples per new gradient iteration as in .
Another line of research focuses on improper learning models using kernel-based approaches. These results hold under much more general data models than the realizable case. However, the practical success of deep learning is intricately tied to using gradient-based training procedures, and the learnability of these networks using improper learning does not explain the success of gradient-based methods. Related, proposes a method of moments estimator using tensor decomposition.
Finally, we would like to mention a few interesting recent developments that appeared after the initial arxiv submission of this paper. The authors of demonstrated analytically (with variable precision arithmetics) that without over-parameterization the landscape of one-hidden layer neural networks has bad local minima as soon as the number of hidden unites exceeds six (). This interesting result clearly demonstrates that the over-parametrization is necessary to ensure a favorable optimization landscape when training one-hidden layer neural networks. Another recent paper by Soudry and Hoffer suggests that “most” local minima are global in the over-parameterized regime. This paper provides further evidence that over-parametrizing neural networks leads to more favorable optimization landscapes when training such neural nets. Finally, the recent paper proves that gradient descent when initialized at zero can lead to training errors close to (but not equal to) zero. This result holds for one-hidden layer neural networks with quadratic activations and Gaussian inputs. Developing rigorous convergence guarantees to a global optima with training error equal to zero which also applies to other activations is an important future research direction.
Preliminaries and notations
Before we dive into the proofs in the next two sections we collect some useful results and notations in this section.
Throughout the paper we use to refer to constants whose values may change from line to line.
2 Derivative calculations
In this section we gather some derivative calculations that we will use throughout the proof. As a reminder the loss function is given by
We begin by calculating the gradient with respect to the weight matrix . To this aim we begin by calculating the gradient with respect to the th row of denoted by . This is equal to
Aggregating these gradients as a row vector and setting we arrive at
Using this Jacobian matrix we can write the vectorized version of the gradient, i.e. gradient with respect to vect as
Taking the derivative with respect to we arrive at
Thus the gradient with respect to is equal to
2.2 Second order derivatives
Using the expression for the Jacobian matrix the latter can also be written in the form
3 Useful identities involving matrix products
In this section we gather a few preliminary results regarding matrix products that will be useful throughout our proofs. Most of these identities are trivial and we thus skip their proof. The only exception is Lemma 5.3 which is proved in Appendix A.
where denotes the vectorization of the matrix formed by stacking the columns of into a single column vector.
For any two matrices and , we have
4 Useful probabilistic identities involving random vectors and matrices
In this Section we gather some useful probabilistic identities that will be used throughout our proofs. We defer the proof of these results to Appendix A. The first result, proven in Appendix A.2, relates the tail of a random variable to a proper Orlicz norm.
Suppose that a random variable satisfies the following tail bound
Then, .
The second result concerns the minimum eigenvalue of the Khatrio-Rao product of a generic matrix with itself.
We refer to Appendix A.3 for the proof of Lemma 5.6.
Finally, we state a Lemma based on that allows us to lower bound the loss function in terms of how close is to . We prove this Lemma in Appendix A.4.
holds with probability at least . Here, denotes the nuclear norm i.e. sum of singular values of the input matrix.
Proof of global landscape results
In this section we prove the two global theorems.
We begin by gathering the derivatives of the loss function for quadratic activations. We note that for quadratic activations the loss function (1.4) as a function of takes the form
Using (5.4) the Jacobian matrix is given by
Using this Jacobian matrix we can write the vectorized version of the gradient, i.e. gradient with respect to vect as
Also (5.2.2) specialized to quadratic activations results in the partial Hessian
2 Proof of Theorem 2.1
The objective value is convex in terms of (in fact its a quadratic). Therefore, at any point where the gradient with respect to is zero i.e. , that point must be a global optimum of . More specifically, any point obeying
Thus any obeying (6.5) is a global optima of , concluding the proof.
We will prove the first two conclusions of the theorem (i.e. no spurious local optima and saddles have a direction of negative curvature) simultaneously. To show this note that since the function is twice differentiable all local optima or saddles that do not have a direction of negative curvature obey
We will prove that all points obeying (6.8) are a global optima of . To this aim let us define the sets
i.e. the indices of the positive and negative entries of . Now note that two cases are possible
Case I: At least one of the sub-matrices and is not rank deficient. That is,
Case II: Both of the sub-matrices and are rank deficient. That is,
In both cases we will show that for any obeying (6.8), we have
The latter together with Lemma 6.1 immediately implies that any obeying (6.8) is a global optimum, proving that there are no spurious local minima and no saddles which do not have a direction of strict negative curvature. We now proceed to show that indeed holds in both cases mentioned above.
Multiplying both sides of the above equality on the left by we conclude that
Equations (6.13) and (6.14) together imply that
2.2 Proof of zero training error
Note that in the previous section we proved that all global optima of obey
The latter identity can be rewritten in the form
3 Proof of Theorem 2.2
First note that since we assume both and have the same sign pattern we can without loss of generality assume both and have non-negative entries. We will first prove that there are no spurious local minima, and all saddles have a direction of negative curvature, and all global optima have zero loss value. We then proceed to show all approximate local minima have small objective value in Section 6.3.2.
We begin by noting that the function is twice differentiable and thus all local optima or saddle points that do not have a direction of negative curvature obey
Note that implies that
Since we assume , two cases can occur.
Case I: rank(.
Case II: rank(.
In both cases we will show that any obeying (6.8), must be a global optimum.
The latter together with Lemma 6.1 immediately implies that any obeying (6.16) and the condition of Case I is a global optimum, proving that there are no spurious local minima. Furthermore using Lemma 5.6, as long as with a fixed numerical constant. This combined with (6.18) implies that , completing the proof for zero training error in this case.
The latter together with (6.3.1) implies that
Thus for any obeying (6.16) and the condition of Case II we have
where (a) follows from (6.17) and (b) follows from (6.21) combined with the fact that has nonnegative entries. Thus, and therefore such a must be a global optimum. This completes the proof in Case II.
3.2 Bounding the objective value for approximate local minima
Assume is an approximate local minima. That is it obeys
For such a point by Cauchy-Schwarz we have
Furthermore, for such a point using (5.2.2)
Thus using (6.27), (6.23), and (6.26) we conclude that for any point obeying (6.22) we have
Here, (a) follows from (6.27), (b) from (6.23) and (c) from Holder’s inequality combined with (6.26).
We proceed by showing that for a point obeying it’s Frobenius norm norm is also sufficiently small. To this aim note that for such a point by Cauchy-Schwarz have
Using Holder’s inequality for matrices we have
Thus by standard concentration of sample covariance matrices we conclude that for any , as long as for a fixed numerical constant , then for all
holds with probability at least , for a fixed constant . Thus with high probability for all we have
Plugging the latter into (6.3.2) we conclude that
We now apply Lemma 5.7 to conclude that as long as
holds with probability at least . The latter implies that with high probability for all we have
Plugging the latter into (6.31) and we conclude that with high probability for all we have
Plugging the latter into (6.29) implies that
Using the assumption that , the latter is equivalent to
From the latter we can conclude that for all we have
Setting and , the latter inequality implies that
Plugging the latter into (6.3.2) we conclude that for all obeying (6.22)
holds with high probability, concluding the proof.
Proof of local convergence results
In this section we prove Theorem 2.5. We begin by explaining a crucial component of our proof which involves bounding the spectrum of the Jacobian matrix. These results are the subject of the next section with the corresponding proofs appearing in Section 7.2. We then utilize these intermediate results to finalize the proof of Theorem 2.5 in Section 7.3.
Recall that the Jacobian matrix is given by . To bound the spectrum of the Jacobian we first introduce a new matrix obtained by centering the component. Precisely, we let
with probability at least , where are constants (depending on ).
In addition, with probability at least , we have
holds for a sufficiently small constant . Then, there exist constants , such that,
holds with probability at least with fixed numerical constants.
As discussed in the proof of Proposition 7.2, when Assumption 2.3 (b) holds in lieu of Assumption 2.3 (a), then . In this case the statement of the above proposition must be modified as follows. Equation (7.6) should be replaced with
Further, we have the following bounds in lieu of equations (7.7) and (7.8).
We now combine Propositions 7.1 and 7.2 to characterize the spectrum of . We defer the proof of this result to Appendix C.
for some fixed constants . Further, assume that and
for a sufficiently small constant . Then, there exist constants , such that,
holds with probability at least with a fixed numerical constant.
Also when Assumption 2.3 holds in lieu of Assumption 2.3 , the term should be replaced by one in Equations (7.11) and (7.12).
holds with probability at least with fixed numerical constants.
2 Proof of Proposition 7.2
Our proof strategy consists of four main steps:
(Bounding singular values). We prove bounds on singular values of a matrix with independent columns and bounded sub-exponential norms.
where the expectation is with respect to the randomness in the inputs .
Proof We prove this lemma by contradiction i.e. assume
which is a contradiction. Note that in the penultimate inequality we used the fact that and the last inequality holds because .
2.2 Dropping rows
2.3 Bounding sub-exponential norms
In order to bound the right-hand-side of (7.18), we need to study the tail of the random variable
holds with a fixed numerical constant and .
We apply Proposition 7.8, with and use Lemma 5.5, to conclude that
2.4 Bounding minimum singular value
Let be independent sub-exponential random vectors and also let . Furthermore, let be a matrix with as columns. Define and . Further, set , where are arbitrary but fixed. Also define
We continue by stating a lemma that bounds the Euclidean norm of the random vector . We defer the proof of this lemma to Appendix G.
holds with probability at least .
Further, by concentration of random variables, with probability at least , the following bounds hold. For any fixed ,
Plugging bound (7.24), with , and bound (7.25) into (7.23), the followings inequalities hold with probability at least ,
Note that the second inequality in (7.26) holds because by our assumption (7.6),
Therefore, by union bounding over columns with probability at least we have
Also, by (7.22), the maximum sub-exponential norms of the columns is bounded by
We now have all the elements to apply Proposition 7.9. In particular we use and , to conclude that
which holds true by the assumption given in Equation (7.6).
Note that under Assumption 2.3 (b), we skip the whitening step and by following the same proof, we obtain
2.5 Bounding maximum singular value
where the second inequality follows from our assumption given by Equation (7.6).
Combining Equations (7.38) and (7.39) via the triangular inequality we arrive at:
where the second inequality holds because and as per Lemma 5.3.
3 Proof of Theorem 2.5
We begin by stating a few useful lemmas. We defer the proofs of these lemmas to the Appendices. Throughout, we assume and . We present the proof under Assumption 2.3 (a). The proof under Assumption 2.3 (b) follows by an analogous argument. We begin by a lemma that bound the perturbation of the Jacobian matrix as a function of its inputs. We defer the proof to Appendix H.
holds with probability at least . Here, are fixed numerical constants.
Now note that by Proposition 7.4 for and , we have
Using this value of we define the radius
where is the same quantity that appears in Equation (7.41). Without loss of generality, we can assume . Plugging this into the definition of together with the fact that , allows us to conclude that .
In the next lemma, proven in Appendix I, we relate the gradients and to the function value .
For , the following inequalities hold with probability at least for some constant .
Our next lemma, proven in Appendix J, upper bounds the function value in terms of distance of to the planted solution .
The following bound holds with probability at least for some constant .
Finally, the lemma below, proven in Appendix K, controls the second order derivative of the loss function .
The function is -smooth on . Namely, there exists an event of probability at least , such that on this event, for any , we have
With this lemmas in place we are now ready to present our local convergence analysis. To this aim note that Lemma 7.12 (Equation (7.44)) implies that for , the function satisfies the Polyak-Lojasiewicz (PL) inequality
We shall now focus on how the loss function value changes in one iteration. Using the -smoothness condition for we have
where in the last inequality we used (7.50).
Our assumptions on the initial weights and in equations (2.5) and (2.4) imply the following identities.
We next show that starting with obeying (7.52) and (7.53), the entire trajectory of gradient descent remains in the set . To establish this we use induction on . The induction basis is trivial. Assuming the induction hypothesis for , we show that the result continues to hold for .
By our induction hypothesis ( for ) and (7.51), we have
By iterating the above identity we arrive at
To show that we proceed by quantifying how far the gradient descent trajectory can get from .
where follows from and Lemma 7.12 equation (7.45) and follows from (7.54). Likewise, we obtain
Using bounds (7.55) and (7.56), in order to show that , it suffices to show that
The right-hand sides of Equation (7.57) depends on . The dominant term is because it is of order at least , while is . Therefore, the desired bound in (7.57) is equivalent to
We can verify Equation (7.58) by using Lemma 7.13 combined with (7.52) and (7.53). This completes the induction argument and shows that for all .
Finally, since for all , (7.54) holds for all . Substituting for , we obtain
where the last step holds because .
This concludes the proof under Assumption 2.3 (a). The claim under Assumption 2.3 (b) can be proven by a similar argument. The only required adjustment is that the initial radius and the term should now be defined via and .
Acknowledgements
This work was done in part while M.S. was visiting the Simons Institute for the Theory of Computing. M. Soltanolkotabi is supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, an NSF-CAREER under award #1846369, the Air Force Office of Scientific Research Young Investigator Program (AFOSR-YIP) under award #FA9550-18-1-0078, an NSF-CIF award #1813877, and a Google faculty research award. A.J. was partially supported by a Google Faculty Research Award. A.J. would also like to acknowledge the financial support of the Office of the Provost at the University of Southern California through the Zumberge Fund Individual Grant Program. We would like to thank Marco Mondelli and Simone Bombari for bringing to our attention a mistake in the argument of Proposition 7.1 in the previous version of this manuscript which has now been corrected. M.S. would like to thank Peter Bartlett for discussions related to .
References
Appendix A Proof of preliminary lemmas
If then the claim is obvious. We therefore assume that has full column rank. Let . We choose such that . We then have
Equations (A.1) and (A.2) together implies the desired result.
A.2 Proof of Lemma 5.5
Let . This implies that
A.3 Proof of Lemma 5.6
A.4 Proof of Lemma 5.7
We begin by considering the eigenvalue decomposition of the matrix given by
To continue we state a Lemma due to (see also for closely related results).
with a constant depending only on . Then
holds with probability at least with a fixed numerical constant.
Applying the union bound to the above lemma we conclude that for the unit norm vectors
hold simultaneously with probability at least . Combining the latter inequalities together with (A.4) we conclude that
Appendix B Proof of Proposition 7.1
Subtracting the above two expressions we obtain
along with the fact that the Hadamard (entrywise) product of two positive-semidefinite matrices is positive-semidefinite, implies that
with high probability. To this aim note that the latter matrix can be written in the form where
To continue we state the following result from .
To use this theorem we need to verify the assumptions. We first verify the sub-exponential assumption. Observe that are generated i.i.d. according to the distribution
We next verify the norm bound condition (B.2). Note that we have
Therefore, when using
Putting the latter two together we conclude that
in (B.2). Note that this choice of obeys as long as
We next upper bound . Recall the definition
Combining the latter with (B.4) we can conclude that as long as we have
We next prove the bound in (7.5). Note that
By [51, Corollary 5.35], , with probability at least . The result follows be recalling that .
Appendix C Proof of Proposition 7.4
We first prove the claim under Assumption 2.3 (a). We start by bounding the entries of and mean vector .
We next verify the assumption of Proposition 7.1. By our assumptions in the current proposition,
Therefore, the assumption of Proposition 7.1 holds. Using this proposition, the following bounds hold with probability at least :
We next lower bound the right-hand side. By our assumption on , we have
Further, by invoking assumption (7.11), we have
Putting (C.6), (C.7) and (C.8) together we arrive at
We then move to upper bound .
which immediately implies assumption (7.6). Here, (a) follows from , (b) from , and (c) from (7.11). With assumption (7.6) in place, we can now combine Proposition 7.2 with (C.9) to conclude that
for some constant by choosing sufficiently small. In addition,
for a constants that depends on , , . The claim under Assumption 2.3 (b) follows in a similar manner by using Remark 7.3 in lieu of Proposition 7.2.
Appendix D Proof of Proposition 7.8
We prove Proposition 7.8 via an asymmetric version of Hanson-Wright inequality. We begin by the definition of the convex concentration property which is the most general condition under which it is known that Hanson-Wright inequality holds.
We will prove the proposition by using a result by on the Hanson-Wright inequality stated below.
We wish to apply this result to the vector . Note that has zero mean. To this aim, we first state a lemma about the convex concentration property for the random vector whose proof appears in Section D.1 below.
Lemmas D.2 and D.3 combined allows us to conclude that
For , this is equivalent to
The proof is complete by replacing with .
Thus is a Lipschitz function of with Lipschitz constant . The convex concentration property follows from Gaussian isoperimetry .
Appendix E Proof of Lemma 7.6
Recalling the definition and using Lemma 5.1, we arrive at
Appendix F Proof of Proposition 7.9
The following Lemma is similar to [Theorem 3.2].
Let be independent sub-exponential random vectors with . Let , and assume that
Then setting , the inequality
We apply Lemma F.1 by setting and letting be the solution of . Clearly, . and therefore . This gives
Using bound (F.2) in (F.1) we obtain the desired result.
Proof We first recall the following Lemma from Adamczak (this applies to vectors with non-trivial covariance).
Let be independent sub-exponential vectors with . For and , one has
Thus if for some , then
where is an absolute constant. This completes the proof.
Appendix G Proof of Lemma 7.10
Appendix H Proof of Lemma 7.11
Writing the above identity in matrix form, we get
We proceed by bounding the two terms above. For the first term note that by applying (H.2), we arrive at
holds with probability at least for some constants .
Combining inequalities (H.5) and (H.6), we have
where depends on constants and .
Appendix I Proof of Lemma 7.12
Given that , we have
where the first inequality follows from Lemma 7.11 and the second one follows readily from definition of set and Equation (7.42).
To prove the last bound (i.e. (7.46)), we recall Equation (5.6)
Here, we used the assumption that is -Lipschitz.
Next, by the Bai-Yin law , we have , with probability at least . Continuing with Equation (I.5), we have
where in the second inequality, we use the fact that and hence (since ). Using bound (I.6) in (I.3), we obtain
Appendix J Proof of Lemma 7.13
By standard concentration of sample covariance matrices we conclude that for any , there exist constants , such that for , we have
with probability at least . Using this bound, with , along with the -Lipschitz property of the activation , we arrive at
To bound the second term note that we have
where in the last step, we used (J.2) and the fact that . Combining the bounds on and , completes the proof.
Appendix K Proof of Lemma 7.14
Note that the spectral norm of a matrix is bounded above by the sum of the spectral norms of the diagonal blocks of that matrix. Hence,
We proceed by bounding each of these two terms. Using Identity (5.2.2), we have
Since , we have
Furthermore using (J.2) together with we conclude that
holds with probability at least . Moreover, using the B-Lipschitz property of the activation, we have
where (a) follows from the definition of the set and (b) follows from and . Note that while we can plug in for to obtain a tighter bound, we use a looser bound on as this term will be dominated by the other term (i.e. ) and thus a more accurate bound is unnecessary.
Restating the bound on for , given by (I.2), we have
We next bound . Starting with Equation (5.9), we have
Here, holds because is -Lipschitz; holds because and , with probability at least .