Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruosong Wang

Introduction

The well-known work of Zhang et al. (2017) highlighted intriguing experimental phenomena about deep net training – specifically, optimization and generalization – and asked whether theory could explain them. They showed that sufficiently powerful nets (with vastly more parameters than number of training samples) can attain zero training error, regardless of whether the data is properly labeled or randomly labeled. Obviously, training with randomly labeled data cannot generalize, whereas training with properly labeled data generalizes. See Figure 2 replicating some of these results.

Recent papers have begun to provide explanations, showing that gradient descent can allow an overparametrized multi-layer net to attain arbitrarily low training error on fairly generic datasets (Du et al., 2018a, c; Li & Liang, 2018; Allen-Zhu et al., 2018b; Zou et al., 2018), provided the amount of overparametrization is a high polynomial of the relevant parameters (i.e. vastly more than the overparametrization in Zhang et al. (2017)). Under further assumptions it can also be shown that the trained net generalizes (Allen-Zhu et al., 2018a). But some issues were not addressed in these papers, and the goal of the current paper is to address them.

First, the experiments in Zhang et al. (2017) show that though the nets attain zero training error on even random data, the convergence rate is much slower. See Figure 1.

Why do true labels give faster convergence rate than random labels for gradient descent?

The above papers do not answer this question, since their proof of convergence does not distinguish between good and random labels.

The next issue is about generalization: clearly, some property of properly labeled data controls generalization, but what? Classical measures used in generalization theory such as VC-dimension and Rademacher complexity are much too pessimistic. A line of research proposed norm-based (e.g. Bartlett et al. (2017a)) and compression-based bounds (Arora et al., 2018). But the sample complexity upper bounds obtained are still far too weak. Furthermore they rely on some property of the trained net that is revealed/computed at the end of training. There is no property of data alone that determine upfront whether the trained net will generalize. A recent paper (Allen-Zhu et al., 2018a) assumed that there exists an underlying (unknown) neural network that achieves low error on the data distribution, and the amount of data available is quite a bit more than the minimum number of samples needed to learn this underlying neural net. Under this condition, the overparametrized net (which has way more parameters) can learn in a way that generalizes. However, it is hard to verify from data whether this assumption is satisfied, even after the larger net has finished training.In Section 2, we discuss the related works in more details. Thus the assumption is in some sense unverifiable.

Is there an easily verifiable complexity measure that can differentiate true labels and random labels?

Without explicit regularization, to attack this problem, one must resort to algorithm-dependent generalization analysis. One such line of work established that first-order methods can automatically find minimum-norm/maximum-margin solutions that fit the data in the settings of logistic regression, deep linear networks, and symmetric matrix factorization (Soudry et al., 2018; Gunasekar et al., 2018a, b; Ji & Telgarsky, 2018; Li et al., 2018b). However, how to extend these results to non-linear neural networks remains unclear (Wei et al., 2018). Another line of algorithm-dependent analysis of generalization (Hardt et al., 2015; Mou et al., 2017; Chen et al., 2018) used stability of specific optimization algorithms that satisfy certain generic properties like convexity, smoothness, etc. However, as the number of epochs becomes large, these generalization bounds are vacuous.

We give a new analysis that provides answers to Questions 1 and 2 for overparameterized two-layer neural networks with ReLU activation trained by gradient descent (GD), when the number of neurons in the hidden layer is sufficiently large. In this setting, Du et al. (2018c) have proved that GD with random initialization can achieve zero training error for any non-degenerate data. We give a more refined analysis of the trajectory of GD which enables us to provide answers to Questions 1 and 2. In particular:

In Section 4, using the trajectory of the network predictions on the training data during optimization, we accurately estimate the magnitude of training loss in each iteration. Our key finding is that the number of iterations needed to achieve a target accuracy depends on the projections of data labels on the eigenvectors of a certain Gram matrix to be defined in Equation (3). On MNIST and CIFAR datasets, we find that such projections are significantly different for true labels and random labels, and as a result we are able to answer Question 1.

In Section 5, we give a generalization bound for the solution found by GD, based on accurate estimates of how much the network parameters can move during optimization (in suitable norms). Our generalization bound depends on a data-dependent complexity measure (c.f. Equation (10)), and notably, is completely independent of the number of hidden units in the network. Again, we test this complexity measure on MNIST and CIFAR, and find that the complexity measures for true and random labels are significantly different, which thus answers Question 2.

Notice that because zero training error is achieved by the solution found by GD, a generalization bound is an upper bound on the error on the data distribution (test error). We also remark that our generalization bound is valid for any data labels – it does not require the existence of a small ground-truth network as in (Allen-Zhu et al., 2018a). Moreover, our bound can be efficiently computed for any data labels.

In Section 6, we further study what kind of functions can be provably learned by two-layer ReLU networks trained by GD. Combining the optimization and generalization results, we uncover a broad class of learnable functions, including linear functions, two-layer neural networks with polynomial activation $\phi(z)=z^{2l}$ or cosine activation, etc. Our requirement on the smoothness of learnable functions is weaker than that in (Allen-Zhu et al., 2018a).

Finally, we note that the intriguing generalization phenomena in deep learning were observed in kernel methods as well Belkin et al. (2018). The analysis in the current paper is also related to a kernel from the ReLU activation (c.f. Equation (3)).

Related Work

In this section we survey previous works on optimization and generalization aspects of neural networks.

Many papers tried to characterize geometric landscapes of objective functions (Safran & Shamir, 2017; Zhou & Liang, 2017; Freeman & Bruna, 2016; Hardt & Ma, 2016; Nguyen & Hein, 2017; Kawaguchi, 2016; Venturi et al., 2018; Soudry & Carmon, 2016; Du & Lee, 2018; Soltanolkotabi et al., 2018; Haeffele & Vidal, 2015). The hope is to leverage recent advance in first-order algorithms (Ge et al., 2015; Lee et al., 2016; Jin et al., 2017) which showed that if the landscape satisfies (1) all local minima are global and (2) all saddle points are strict (i.e., there exists a negative curvature), then first-order methods can escape all saddle points and find a global minimum. Unfortunately, these desired properties do not hold even for simple non-linear shallow neural networks (Yun et al., 2018) or 3-layer linear neural networks (Kawaguchi, 2016).

Another approach is to directly analyze trajectory of the optimization method and to show convergence to global minimum. A series of papers made strong assumptions on input distribution as well as realizability of labels, and showed global convergence of (stochastic) gradient descent for some shallow neural networks (Tian, 2017; Soltanolkotabi, 2017; Brutzkus & Globerson, 2017; Du et al., 2017a, b; Li & Yuan, 2017). Some local convergence results have also been proved (Zhong et al., 2017; Zhang et al., 2018). However, these assumptions are not satisfied in practice.

For two-layer neural networks, a line of papers used mean field analysis to establish that for infinitely wide neural networks, the empirical distribution of the neural network parameters can be described as a Wasserstein gradient flow (Mei et al., 2018; Chizat & Bach, 2018a; Sirignano & Spiliopoulos, 2018; Rotskoff & Vanden-Eijnden, 2018; Wei et al., 2018). However, it is unclear whether this framework can explain the behavior of first-order methods on finite-size neural networks.

Recent breakthroughs were made in understanding optimization of overparameterized neural networks through the trajectory-based approach. They proved global polynomial time convergence of (stochastic) gradient descent on non-linear neural networks for minimizing empirical risk. Their proof techniques can be roughly classified into two categories. Li & Liang (2018); Allen-Zhu et al. (2018b); Zou et al. (2018) analyzed the trajectory of parameters and showed that on the trajectory, the objective function satisfies certain gradient dominance property. On the other hand, Du et al. (2018a, c) analyzed the trajectory of network predictions on training samples and showed that it enjoys a strongly-convex-like property.

Generalization.

It is well known that the VC-dimension of neural networks is at least linear in the number of parameters (Bartlett et al., 2017b), and therefore classical VC theory cannot explain the generalization ability of modern neural networks with more parameters than training samples. Researchers have proposed norm-based generalization bounds (Bartlett & Mendelson, 2002; Bartlett et al., 2017a; Neyshabur et al., 2015, 2017, 2019; Konstantinos et al., 2017; Golowich et al., 2017; Li et al., 2018a) and compression-based bounds (Arora et al., 2018). Dziugaite & Roy (2017); Zhou et al. (2019) used the PAC-Bayes approach to compute non-vacuous generalization bounds for MNIST and ImageNet, respectively. All these bounds are posterior in nature – they depend on certain properties of the trained neural networks. Therefore, one has to finish training a neural network to know whether it can generalize. Comparing with these results, our generalization bound only depends on training data and can be calculated without actually training the neural network.

Another line of work assumed the existence of a true model, and showed that the (regularized) empirical risk minimizer has good generalization with sample complexity that depends on the true model (Du et al., 2018b; Ma et al., 2018; Imaizumi & Fukumizu, 2018). These papers ignored the difficulty of optimization, while we are able to prove generalization of the solution found by gradient descent. Furthermore, our generic generalization bound does not assume the existence of any true model.

Our paper is closely related to (Allen-Zhu et al., 2018a) which showed that two-layer overparametrized neural networks trained by randomly initialized stochastic gradient descent can learn a class of infinite-order smooth functions. In contrast, our generalization bound depends on a data-dependent complexity measure that can be computed for any dataset, without assuming any ground-truth model. Furthermore, as a consequence of our generic bound, we also show that two-layer neural networks can learn a class of infinite-order smooth functions, with a less strict requirement for smoothness. Allen-Zhu et al. (2018a) also studied the generalization performance of three-layer neural nets.

Lastly, our work is related to kernel methods, especially recent discoveries of the connection between deep learning and kernels (Jacot et al., 2018; Chizat & Bach, 2018b; Daniely et al., 2016; Daniely, 2017). Our analysis utilized several properties of a related kernel from the ReLU activation (c.f. Equation (3)).

Preliminaries and Overview of Results

1 Setting: Two-Layer Neural Network Trained by Randomly Initialized Gradient Descent

We consider a two-layer ReLU activated neural network with $m$ neurons in the hidden layer:

We train the neural network by randomly initialized gradient descent (GD) on the quadratic loss over data $S$ . In particular, we first initialize the parameters randomly:

where $0<\kappa\leq 1$ controls the magnitude of initialization, and all randomnesses are independent. We then fix the second layer $\mathbf{a}$ and optimize the first layer $\mathbf{W}$ through GD on the following objective function:

The GD update rule can be written as:Since ReLU is not differentiable at , we just define “gradient” using this formula, and this is indeed what is used in practice.

2 The Gram Matrix from ReLU Kernel

This matrix can be viewed as a Gram matrix from a kernel associated with the ReLU function, and has been studied in (Xie et al., 2017; Tsuchida et al., 2017; Du et al., 2018c).

In our setting of training a two-layer ReLU network, Du et al. (2018c) showed that if $\mathbf{H}^{\infty}$ is positive definite, GD converges to training loss if $m$ is sufficiently large:

Assume $\lambda_{0}=\lambda_{\min}(\mathbf{H}^{\infty})>0$ . For $\delta\in(0,1)$ , if $m=\Omega\left(\frac{n^{6}}{\lambda_{0}^{4}\kappa^{2}\delta^{3}}\right)$ and $\eta=O\left(\frac{\lambda_{0}}{n^{2}}\right)$ , then with probability at least $1-\delta$ over the random initialization (1), we have:

$\Phi(\mathbf{W}(k+1))\leq\left(1-\frac{\eta\lambda_{0}}{2}\right)\Phi(\mathbf{W}(k)),\ \forall k\geq 0$ .

Our results on optimization and generalization also crucially depend on this matrix $\mathbf{H}^{\infty}$ .

3 Overview of Our Results

Now we give an informal description of our main results. It assumes that the initialization magnitude $\kappa$ is sufficiently small and the network width $m$ is sufficiently large (to be quantified later).

The following theorem gives a precise characterization of how the objective decreases to . It says that this process is essentially determined by a power method for matrix $\mathbf{I}-\eta\mathbf{H}^{\infty}$ applied on the label vector $\mathbf{y}$ .

As a consequence, we are able to distinguish the convergence rates for different labels $\mathbf{y}$ , which can be determined by the projections of $\mathbf{y}$ on the eigenvectors of $\mathbf{H}^{\infty}$ . This allows us to obtain an answer to Question 1. See Section 4 for details.

Our main result for generalization is the following:

For any $1$ -Lipschitz loss function, the generalization error of the two-layer ReLU network found by GD is at most

Notice that our generalization bound (4) can be computed from data $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ , and is completely independent of the network width $m$ . We observe that this bound can clearly distinguish true labels and random labels, thus providing an answer to Question 2. See Section 5 for details.

Finally, using Theorem 3.3, we prove that we can use our two-layer ReLU network trained by GD to learn a broad class of functions, including linear functions, two-layer neural networks with polynomial activation $\phi(z)=z^{2l}$ or cosine activation, etc. See Section 6 for details.

4 Additional Notation

We introduce some additional notation that will be used.

We define two matrices $\mathbf{Z}$ and $\mathbf{H}$ which will play a key role in our analysis of the GD trajectory:

and $\mathbf{H}=\mathbf{Z}^{\top}\mathbf{Z}$ . Note that

With this notation we have a more compact form of the gradient (5):

Analysis of Convergence Rate

Although Theorem 3.1 already predicts linear convergence of GD to loss, it only provides an upper bound on the loss and does not distinguish different types of labels. In particular, it cannot answer Question 1. In this section we give a fine-grained analysis of the convergence rate.

where $\mathbf{H}^{\infty}$ is the Gram matrix defined in (3).

Suppose $\lambda_{0}=\lambda_{\min}(\mathbf{H}^{\infty})>0$ , $\kappa=O\left(\frac{\epsilon\delta}{\sqrt{n}}\right)$ , $m=\Omega\left(\frac{n^{7}}{\lambda_{0}^{4}\kappa^{2}\delta^{4}\epsilon^{2}}\right)$ and $\eta=O\left(\frac{\lambda_{0}}{n^{2}}\right)$ . Then with probability at least $1-\delta$ over the random initialization, for all $k=0,1,2,\ldots$ we have:

The proof of Theorem 4.1 is given in Appendix C.

In light of (8), it suffices to understand how fast $\sum_{i=1}^{n}(1-\eta\lambda_{i})^{2k}\left(\mathbf{v}_{i}^{\top}\mathbf{y}\right)^{2}$ converges to as $k$ grows. Define $\xi_{i}(k)=(1-\eta\lambda_{i})^{2k}(\mathbf{v}_{i}^{\top}\mathbf{y})^{2}$ , and notice that each sequence $\{\xi_{i}(k)\}_{k=0}^{\infty}$ is a geometric sequence which starts at $\xi_{i}(0)=(\mathbf{v}_{i}^{\top}\mathbf{y})^{2}$ and decreases at ratio $(1-\eta\lambda_{i})^{2}$ . In other words, we can think of decomposing the label vector $\mathbf{y}$ into its projections onto all eigenvectors $\mathbf{v}_{i}$ of $\mathbf{H}^{\infty}$ : $\left\|\mathbf{y}\right\|_{2}^{2}=\sum_{i=1}^{n}(\mathbf{v}_{i}^{\top}\mathbf{y})^{2}=\sum_{i=1}^{n}\xi_{i}(0)$ , and the $i$ -th portion shrinks exponentially at ratio $(1-\eta\lambda_{i})^{2}$ . The larger $\lambda_{i}$ is, the faster $\{\xi_{i}(k)\}_{k=0}^{\infty}$ decreases to , so in order to have faster convergence we would like the projections of $\mathbf{y}$ onto top eigenvectors to be larger. Therefore we obtain the following intuitive rule to compare the convergence rates on two sets of labels in a qualitative manner (for fixed $\left\|\mathbf{y}\right\|_{2}$ ):

For a set of labels $\mathbf{y}$ , if they align with the top eigenvectors, i.e., $(\mathbf{v}_{i}^{\top}\mathbf{y})^{2}$ is large for large $\lambda_{i}$ , then gradient descent converges quickly.

For a set of labels $\mathbf{y}$ , if the projections on eigenvectors $\{\left(\mathbf{v}_{i}^{\top}\mathbf{y}\right)^{2}\}_{i=1}^{n}$ are uniform, or labels align with eigenvectors with respect to small eigenvalues, then gradient descent converges with a slow rate.

We now use this reasoning to answer Question 1. In Figure 1(b), we compute the eigenvalues of $\mathbf{H}^{\infty}$ (blue curve) for the MNIST dataset. The plot shows the eigenvalues of $\mathbf{H}^{\infty}$ admit a fast decay. We further compute the projections $\{\left|\mathbf{v}_{i}^{\top}\mathbf{y}\right|\}_{i=1}^{n}$ of true labels (red) and random labels (cyan). We observe that there is a significant difference between the projections of true labels and random labels: true labels align well with top eigenvectors whereas projections of random labels are close to being uniform. Furthermore, according to our theory, if a set of labels align with the eigenvector associated with the least eigenvalue, the convergence rate of gradient descent will be extremely slow. We construct such labels and in Figure 1(a) we indeed observe slow convergence. We repeat the same experiments on CIFAR and have similar observations (Figures 1(c) and 1(d)). These empirical findings support our theory on the convergence rate of gradient descent. See Appendix A for implementation details.

1 Proof Sketch of Theorem 4.1

Analysis of Generalization

In this section, we study the generalization ability of the two-layer neural network $f_{\mathbf{W}(k),\mathbf{a}}$ trained by GD.

First, in order for optimization to succeed, i.e., zero training loss is achieved, we need a non-degeneracy assumption on the data distribution, defined below:

Note that as long as no two $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are parallel to each other, we have $\lambda_{\min}(\mathbf{H}^{\infty})>0$ . (See (Du et al., 2018c)). For most real-world distributions, any two training inputs are not parallel.

The proof of Theorem 5.1 is given in Appendix D and we sketch the proof in Section 5.1.

Note that in Theorem 5.1 there are three sources of possible failures: (i) failure of satisfying $\lambda_{\min}(\mathbf{H}^{\infty})\geq\lambda_{0}$ , (ii) failure of random initialization, and (iii) failure in the data sampling procedure (c.f. Theorem B.1). We ensure that all these failure probabilities are at most $\delta/3$ so that the final failure probability is at most $\delta$ .

As a corollary of Theorem 5.1, for binary classification problems (i.e., labels are $\pm 1$ ), we can show that (9) also bounds the population classification error of the learned classifier. See Appendix D for the proof.

Now we discuss our generalization bound. The dominating term in (9) is:

This can be viewed as a complexity measure of data that one can use to predict the test accuracy of the learned neural network. Our result has the following advantages: (i) our complexity measure (10) can be directly computed given data $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ , without the need of training a neural network or assuming a ground-truth model; (ii) our bound is completely independent of the network width $m$ .

To illustrate that the complexity measure in (10) effectively determines test error, in Figure 2 we compare this complexity measure versus the test error with true labels and random labels (and mixture of true and random labels). Random and true labels have significantly different complexity measures, and as the portion of random labels increases, our complexity measure also increases. See Appendix A for implementation details.

1 Proof Sketch of Theorem 5.1

The main ingredients in the proof of Theorem 5.1 are Lemmas 5.3 and 5.4. We defer the proofs of these lemmas as well as the full proof of Theorem 5.1 to Appendix D.

Our proof is based on a careful characterization of the trajectory of $\left\{\mathbf{W}(k)\right\}_{k=0}^{\infty}$ during GD. In particular, we bound its distance to initialization as follows:

$\left\|\mathbf{w}_{r}(k)-\mathbf{w}_{r}(0)\right\|_{2}=O\left(\frac{n}{\sqrt{m}\lambda_{0}\sqrt{\delta}}\right)$ $(\forall r\in[m])$ , and

The bound on the movement of each $\mathbf{w}_{r}$ was proved in Du et al. (2018c). Our main contribution is the bound on $\left\|\mathbf{W}(k)-\mathbf{W}(0)\right\|_{F}$ which corresponds to the total movement of all neurons. The main idea is to couple the trajectory of $\left\{\mathbf{W}(k)\right\}_{k=0}^{\infty}$ with another simpler trajectory $\left\{\widetilde{\mathbf{W}}(k)\right\}_{k=0}^{\infty}$ defined as:

We prove $\left\|\widetilde{\mathbf{W}}(\infty)-\widetilde{\mathbf{W}}(0)\right\|_{F}=\sqrt{\mathbf{y}^{\top}\mathbf{H}(0)^{-1}\mathbf{y}}$ in Section 5.2.Note that we have $\mathbf{H}(0)\approx\mathbf{H}^{\infty}$ from standard concentration. See Lemma C.3. The actually proof of Lemma 5.3 is essentially a perturbed version of this.

Lemma 5.3 implies that the learned function $f_{\mathbf{W}(k),\mathbf{a}}$ from GD is in a restricted class of neural nets whose weights are close to initialization $\mathbf{W}(0)$ . The following lemma bounds the Rademacher complexity of this function class:

Given $R>0$ , with probability at least $1-\delta$ over the random initialization ( $\mathbf{W}(0),\mathbf{a})$ , simultaneously for every $B>0$ , the following function class

has empirical Rademacher complexity bounded as:

Finally, combining Lemmas 5.3 and 5.4, we are able to conclude that the neural network found by GD belongs to a function class with Rademacher complexity at most $\sqrt{\mathbf{y}^{\top}(\mathbf{H}^{\infty})^{-1}\mathbf{y}/(2n)}$ (plus negligible errors). This gives us the generalization bound in Theorem 5.1 using the theory of Rademacher complexity (Appendix B).

Provable Learning using Two-Layer ReLU Neural Networks

for some quantity $M_{g}$ that is independent of the number of samples $n$ , then Theorem 5.1 implies we can provably learn the function $g$ on the underlying data distribution using $O\left(\frac{M_{g}+\log(1/\delta)}{\epsilon^{2}}\right)$ samples. The following theorem shows that this is indeed the case for a broad class of functions.

The proof of Theorem 6.1 is given in Appendix E.

Notice that for two label vectors $\mathbf{y}^{(1)}$ and $\mathbf{y}^{(2)}$ , we have

This implies that the sum of learnable functions is also learnable. Therefore, the following is a direct corollary of Theorem 6.1:

Corollary 6.2 shows that overparameterized two-layer ReLU network can learn any function of the form (12) for which (13) is bounded. One can view (12) as two-layer neural networks with polynomial activation $\phi(z)=z^{p}$ , where $\{\bm{\beta}_{j}\}$ are weights in the first layer and $\{\alpha_{j}\}$ are the second layer. Below we give some specific examples.

For $g(\mathbf{x})=\bm{\beta}^{\top}\mathbf{x}$ , we have $M_{g}=O(\left\|\bm{\beta}\right\|_{2}^{2})$ .

Finally, we note that our “smoothness” requirement (13) is weaker than that in (Allen-Zhu et al., 2018a), as illustrated in the following example.

Suppose $g(\mathbf{x})=\phi(\bm{\beta}^{\top}\mathbf{x})$ , where $\phi(z)=z\cdot\arctan(\frac{z}{2})$ and $\left\|\bm{\beta}\right\|_{2}\leq 1$ . We have $g(\mathbf{x})=\sum_{j=1}^{\infty}\frac{(-1)^{j-1}2^{1-2j}}{2j-1}\left(\bm{\beta}^{\top}\mathbf{x}\right)^{2j}$ since $\left|\bm{\beta}^{\top}\mathbf{x}\right|\leq 1$ . Thus $M_{g}=O\left(\sum_{j=1}^{\infty}\frac{j\cdot 2^{1-2j}}{2j-1}\left\|\bm{\beta}\right\|_{2}^{2j}\right)\leq O\left(\sum_{j=1}^{\infty}2^{1-2j}\left\|\bm{\beta}\right\|^{2j}\right)=O\left(\left\|\bm{\beta}\right\|_{2}^{2}\right)$ , so our result implies that this function is learnable by $2$ -layer ReLU nets.

However, Allen-Zhu et al. (2018a)’s generalization theorem would require $\sum_{j=1}^{\infty}\frac{\left(C\sqrt{\log(1/\epsilon)}\right)^{2j}2^{1-2j}}{2j-1}$ to be bounded, where $C$ is a large constant and $\epsilon$ is the target generalization error. This is clearly not satisfied.

Conclusion

This paper shows how to give a fine-grained analysis of the optimization trajectory and the generalization ability of overparameterized two-layer neural networks trained by gradient descent. We believe that our approach can also be useful in analyzing overparameterized deep neural networks and other machine learning models.

Acknowledgements

SA, WH and ZL acknowledge support from NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. SSD acknowledges support from AFRL grant FA8750-17-2-0212 and DARPA D17AP00001. RW acknowledges support from ONR grant N00014-18-1-2562. Part of the work was done while SSD and RW were visiting the Simons Institute.

References

Appendix

Appendix A Experiment Setup

We use two image datasets, the CIFAR dataset Krizhevsky & Hinton (2009) and the MNIST dataset LeCun et al. (1998), in our experiments. We only use the first two classes of images in the the CIFAR dataset and the MNIST dataset, with $10,000$ training images and $2,000$ validation images in total for each dataset. In both datasets, for each image $\mathbf{x}_{i}$ , we set the corresponding label $y_{i}$ to be $+1$ if the image belongs to the first class, and $-1$ otherwise. For each image $\mathbf{x}_{i}$ in the dataset, we normalize the image so that $\|\mathbf{x}_{i}\|_{2}=1$ , following the setup in Section 3.1.

Our neural networks are trained using the PyTorch package Paszke et al. (2017), using (possibly multiple) NVIDIA Tesla V100 GPUs.

Appendix B Background on Generalization and Rademacher Complexity

Generalization error refers to the gap $L_{\mathcal{D}}(f)-L_{S}(f)$ for the learned function $f$ given sample $S$ .

Recall the standard definition of Rademacher complexity:

Rademacher complexity directly gives an upper bound on generalization error (see e.g. (Mohri et al., 2012)):

Therefore, as long as we can bound the Rademacher complexity of a certain function class that contains our learned predictor, we can obtain a generalization bound.

Appendix C Proofs for Section 4

We first show some technical lemmas. Most of them are already proved in (Du et al., 2018c) and we give proofs for them for completeness.

First, we have the following lemma which gives an upper bound on how much each weight vector can move during optimization.

Under the same setting as Theorem 3.1, i.e., $\lambda_{0}=\lambda_{\min}(\mathbf{H}^{\infty})>0$ , $m=\Omega\left(\frac{n^{6}}{\lambda_{0}^{4}\kappa^{2}\delta^{3}}\right)$ and $\eta=O\left(\frac{\lambda_{0}}{n^{2}}\right)$ , with probability at least $1-\delta$ over the random initialization we have

From Theorem 3.1 we know $\left\|\mathbf{y}-\mathbf{u}(k+1)\right\|_{2}\leq\sqrt{1-\frac{\eta\lambda_{0}}{2}}\left\|\mathbf{y}-\mathbf{u}(k)\right\|_{2}\leq\left(1-\frac{\eta\lambda_{0}}{4}\right)\left\|\mathbf{y}-\mathbf{u}(k)\right\|_{2}$ for all $k\geq 0$ , which implies

As a consequence of Lemma C.1, we can upper bound the norms of $\mathbf{H}(k)-\mathbf{H}(0)$ and $\mathbf{Z}(k)-\mathbf{Z}(0)$ for all $k$ . These bounds show that the matrices $\mathbf{H}$ and $\mathbf{Z}$ do not change much during optimization if $m$ is sufficiently large.

Under the same setting as Theorem 3.1, with probability at least $1-4\delta$ over the random initialization, for all $k\geq 0$ we have:

Let $R=\frac{Cn}{\sqrt{m}\lambda_{0}\sqrt{\delta}}$ for some universal constant $C>0$ . From Lemma C.1 we know that with probability at least $1-\delta$ we have $\left\|\mathbf{w}_{r}(k)-\mathbf{w}_{r}(0)\right\|\leq R$ for all $r\in[m]$ and all $k\geq 0$ .

Next, notice that $\mathbf{w}_{r}(0)^{\top}\mathbf{x}_{i}$ has the same distribution as $\mathcal{N}(0,\kappa^{2})$ . So we have

Now taking the expectation of (16), we have

Then from Markov’s inequality we know that with probability at least $1-\delta$ we have $\left\|\mathbf{H}(k)-\mathbf{H}(0)\right\|_{F}\leq\frac{4n^{2}R}{\sqrt{2\pi}\kappa\delta}+\frac{2n^{2}}{m}=O\left(\frac{n^{3}}{\sqrt{m}\lambda_{0}\kappa\delta^{3/2}}\right)$ .

To bound $\left\|\mathbf{Z}(k)-\mathbf{Z}(0)\right\|_{F}$ , we have

Using Markov’s inequality, with probability at least $1-\delta$ we have $\left\|\mathbf{Z}(k)-\mathbf{Z}(0)\right\|_{F}^{2}\leq\frac{2nR}{\sqrt{2\pi}\kappa\delta}+\frac{n}{m}=O\left(\frac{n^{2}}{\sqrt{m}\lambda_{0}\kappa\delta^{3/2}}\right)$ , proving the second part of the lemma. ∎

The next lemma shows that $\mathbf{H}(0)$ is close to $\mathbf{H}^{\infty}$ .

With probability at least $1-\delta$ , we have $\left\|\mathbf{H}(0)-\mathbf{H}^{\infty}\right\|_{F}=O\left(\frac{n\sqrt{\log\frac{n}{\delta}}}{\sqrt{m}}\right)$ .

Letting $\delta^{\prime}=\delta/n^{2}$ and applying a union bound over all $i,j\in[n]$ , we know that with probability at least $1-\delta$ :

We assume that all the high-probability (“with probability $1-\delta$ ”) events happen. By a union bound at the end, the success probability is at least $1-\Omega(\delta)$ , and then we can rescale $\delta$ by a constant such that the success probability is at least $1-\delta$ .

For each $i\in[n]$ , we partition all $m$ neurons into two parts:

where $A_{r,i}$ is defined in (15). We know from the proof of Lemma C.2 that all neurons in $S_{i}$ will not change activation pattern on data-point $\mathbf{x}_{i}$ during optimization, i.e.,

We consider the two terms in (20) separately. We denote the second term as $\epsilon_{i}(k)$ and treat it as a perturbation term, which we bound as

Combining (20), (21), (22), (23), we have

Next, since $\mathbf{H}(k)$ is close to $\mathbf{H}^{\infty}$ according to Lemmas C.2 and C.3, we rewrite (24) as

where $\bm{\zeta}(k)=\eta(\mathbf{H}^{\infty}-\mathbf{H}(k))(\mathbf{u}(k)-\mathbf{y})+\bm{\epsilon}(k)$ . From Lemmas C.2 and C.3 we have

Therefore we can bound $\bm{\zeta}(k)$ as

Finally, applying (25) recursively, we get

The third term in (27) can be bounded using (26). Also note that we have $\left\|\mathbf{u}(k)-\mathbf{y}\right\|_{2}\leq\left(1-\frac{\eta\lambda_{0}}{4}\right)^{k}\left\|\mathbf{u}(0)-\mathbf{y}\right\|_{2}=\left(1-\frac{\eta\lambda_{0}}{4}\right)^{k}O(\sqrt{n/\delta})$ (Theorem 3.1). Therefore we have

Combining (27), (28), (29) and (30), we obtain

where we have used $\max\limits_{k\geq 0}\left\{k(1-\eta\lambda_{0}/4)^{k-1}\right\}=O(1/(\eta\lambda_{0}))$ . From our choices of $\kappa$ and $m$ , the above error term is at most $\epsilon$ . This completes the proof of Theorem 4.1. ∎

Appendix D Proofs for Section 5

The first part of Lemma 5.3 is proved as Lemma C.1 (note that $\left\|\mathbf{y}-\mathbf{u}(0)\right\|_{2}=O\left(\sqrt{n/\delta}\right)$ ). Now we prove the second part.

Recall the update rule (6) for $\mathbf{W}$ :

According to the proof of Theorem 4.1 ((27), (29) and (30)) we can write

Plugging (32) into (31) and taking a sum over $k=0,1,\ldots,K-1$ , we get:

The second and the third terms in (34) are considered perturbations, and we can upper bound their norms easily. For the second term, using $\left\|\mathbf{Z}(k)-\mathbf{Z}(0)\right\|_{F}=O\left(\frac{n}{\sqrt{m^{1/2}\lambda_{0}\kappa\delta^{3/2}}}\right)$ (Lemma C.2), we have:

For the third term in (34), we use $\left\|\mathbf{Z}(k)\right\|_{F}\leq\sqrt{n}$ and (33) to get:

Define $\mathbf{T}=\eta\sum_{k=0}^{K-1}(\mathbf{I}-\eta\mathbf{H}^{\infty})^{k}$ . For the first term in (34), using $\left\|\mathbf{H}(0)-\mathbf{H}^{\infty}\right\|_{F}=O\left(\frac{n\sqrt{\log\frac{n}{\delta}}}{\sqrt{m}}\right)$ (Lemma C.3) we have

Let the eigen-decomposition of $\mathbf{H}^{\infty}$ be $\mathbf{H}^{\infty}=\sum_{i=1}^{n}\lambda_{i}\mathbf{v}_{i}\mathbf{v}_{i}^{\top}$ . Since $\mathbf{T}$ is a polynomial of $\mathbf{H}^{\infty}$ , it has the same set of eigenvectors as $\mathbf{H}^{\infty}$ , and we have

Finally, plugging the three bounds (35), (36) and (38) into (34), we have

D.2 Proof of Lemma 5.4

where $\left\|\mathbf{W}-\mathbf{W}(0)\right\|_{2,\infty}=\max\limits_{r\in[m]}\left\|\mathbf{w}_{r}-\mathbf{w}_{r}(0)\right\|_{2}$ .

Similar to the proof of Lemma C.2, we define events:

Thus we can bound the Rademacher complexity as:

For $\left\|\mathbf{Z}(0)\right\|_{F}$ , notice that

Therefore, with probability at least $1-\delta$ , the Rademacher complexity is bounded as:

completing the proof of Lemma 5.4. (Note that the high probability events used in the proof do not depend on the value of $B$ , so the above bound holds simultaneously for every $B$ .) ∎

D.3 Proof of Theorem 5.1

First of all, since the distribution $\mathcal{D}$ is $(\lambda_{0},\delta/3,n)$ -non-degenerate, with probability at least $1-\delta/3$ we have $\lambda_{\min}(\mathbf{H}^{\infty})\geq\lambda_{0}$ . The rest of the proof is conditioned on this happening.

Next, from Theorem 3.1, Lemma 5.3 and Lemma 5.4, we know that for any sample $S$ , with probability at least $1-\delta/3$ over the random initialization, the followings hold simultaneously:

Let $B_{i}=i$ ( $i=1,2,\ldots$ ). Simultaneously for all $i$ , the function class $\mathcal{F}_{R,B_{i}}^{\mathbf{W}(0),\mathbf{a}}$ has Rademacher complexity bounded as

Let $i^{*}$ be the smallest integer such that $B\leq B_{i^{*}}$ . Then we have $i^{*}\leq O\left(\sqrt{\frac{n}{\lambda_{0}}}\right)$ and $B_{i^{*}}\leq B+1$ . From above we know $f_{\mathbf{W}(k),\mathbf{a}}\in\mathcal{F}_{R,B_{i^{*}}}^{\mathbf{W}(0),\mathbf{a}}$ , and

Next, from the theory of Rademacher complexity (Theorem B.1) and a union bound over a finite set of different $i$ ’s, for any random initialization $\left(\mathbf{W}(0),\mathbf{a}\right)$ , with probability at least $1-\delta/3$ over the sample $S$ , we have

Finally, taking a union bound, we know that with probability at least $1-\frac{2}{3}\delta$ over the sample $S$ and the random initialization $(\mathbf{W}(0),\mathbf{a})$ , the followings are all satisfied (for some $i^{*}$ ):

D.4 Proof of Corollary 5.2

Therefore we have with probability at least $1-\delta$ :

Appendix E Proofs for Section 6

We prove a lemma before proving Theorem 6.1.

Note that when $\mathbf{A}$ and $\mathbf{B}$ are both invertible, this result reads $\mathbf{A}^{-1}\succeq\mathbf{B}^{-1}$ .

W.L.O.G. we can assume $\mathbf{B}$ is invertible, which means $\mathbf{B}^{-1}=\mathbf{B}^{{\dagger}}$ . Additionally, we can assume $\mathbf{A}$ is diagonal and $\mathbf{A}=\begin{pmatrix}\mathbf{\Lambda}&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\end{pmatrix}$ . Thus $\mathbf{P}_{\mathbf{A}}=\begin{pmatrix}\mathbf{I}&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\end{pmatrix}$ . We define $\mathbf{P}_{\mathbf{A}}^{\perp}=\mathbf{I}-\mathbf{P}_{\mathbf{A}}=\begin{pmatrix}\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{I}\end{pmatrix}$ .

Now we will show that all the solutions to the equation $\det(\mathbf{P}_{\mathbf{A}}\mathbf{B}^{-1}\mathbf{P}_{\mathbf{A}}-\lambda(\mathbf{A}^{\dagger}+\mathbf{P}_{\mathbf{A}}^{\perp}))=0$ are between and $1$ . If this is shown, we must have $\mathbf{P}_{\mathbf{A}}\mathbf{B}^{-1}\mathbf{P}_{\mathbf{A}}\preceq\mathbf{A}^{\dagger}+\mathbf{P}_{\mathbf{A}}^{\perp}$ , which would imply $\mathbf{P}_{\mathbf{A}}\mathbf{B}^{-1}\mathbf{P}_{\mathbf{A}}\preceq\mathbf{A}^{\dagger}$ .

Note $\det(\mathbf{B}^{-1})>0$ and $\det(\mathbf{A}^{\dagger}+\mathbf{P}_{\mathbf{A}}^{\perp})>0$ . Thus all the solutions to $\det(\mathbf{P}_{\mathbf{A}}\mathbf{B}^{-1}\mathbf{P}_{\mathbf{A}}-\lambda(\mathbf{A}^{\dagger}+\mathbf{P}_{\mathbf{A}}^{\perp}))=0$ are exactly all the solutions to $\det(\mathbf{A}-\lambda\mathbf{B})=0$ . Since $\mathbf{B}\succeq\mathbf{A}\succeq\mathbf{0}$ and $\mathbf{B}\succ\mathbf{0}$ , we have $\det(\mathbf{A}-\lambda\mathbf{B})\neq 0$ when $\lambda<0$ or $\lambda>1$ . ∎

First we consider the case $p=1$ . In this case we have $\mathbf{y}=\alpha\mathbf{X}^{\top}\bm{\beta}_{1}$ . Since $\mathbf{H}^{\infty}\succeq\frac{\mathbf{K}}{4}$ , from Lemma E.1 we have

where $\mathbf{P}_{\mathbf{K}}=\mathbf{K}^{1/2}\mathbf{K}^{\dagger}\mathbf{K}^{1/2}$ is the projection matrix for the subspace spanned by $\mathbf{K}$ . Since $\mathbf{K}=\mathbf{X}^{\top}\mathbf{X}$ , we have $\mathbf{P}_{\mathbf{K}}\mathbf{X}^{\top}=\mathbf{X}^{\top}$ . Therefore, we have