Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang

Introduction

Neural network learning has become a key machine learning approach and has achieved remarkable success in a wide range of real-world domains, such as computer vision, speech recognition, and game playing . In contrast to the widely accepted empirical success, much less theory is known. Despite a recent boost of theoretical studies, many questions remain largely open, including fundamental ones about the optimization and generalization in learning neural networks.

One key challenge in analyzing neural networks is that the corresponding optimization is non-convex and is theoretically hard in the general case . This is in sharp contrast to the fact that simple optimization algorithms like stochastic gradient descent (SGD) and its variants usually produce good solutions in practice even on both training and test data. Therefore,

what functions can neural networks provably learn?

Another key challenge is that, in practice, neural networks are heavily overparameterized (e.g., ): the number of learnable parameters is much larger than the number of the training samples. It is observed that overparameterization empirically improves both optimization and generalization, appearing to contradict traditional learning theory.For example, Livni et al. observed that on synthetic data generated from a target network, SGD converges faster when the learned network has more parameters than the target. Perhaps more interestingly, Arora et al. found that overparameterized networks learned in practice can often be compressed to simpler ones with much fewer parameters, without hurting their ability to generalize; however, directly learning such simpler networks runs into worse results due to the optimization difficulty. We also have experiments in Figure 1(a). Therefore,

why do overparameterized networks (found by those training algorithms) generalize?

Most existing works analyzing the learnability of neural networks make unrealistic assumptions about the data distribution (such as being random Gaussian), and/or make strong assumptions about the network (such as using linear activations). Other works focusing on structured data such as sparse coding model and Li and Liang shows that two-layer ReLU networks can learn classification tasks when the data come from mixtures of arbitrary but well-separated distributions.

A theorem without distributional assumptions on data is often more desirable. Indeed, how to obtain a result that does not depend on the data distribution, but only on the concept class itself, lies in the center of PAC-learning which is one of the foundations of machine learning theory . Also, studying non-linear activations is critical because otherwise one can only learn linear functions, which can also be easily learned via linear models without neural networks.

Brutzkus et al. prove that two-layer networks with ReLU activations can learn linearly-separable data (and thus the class of linear functions) using just SGD. This is an (improper) PAC-learning type of result because it makes no assumption on the data distribution. Andoni et al. proves that two-layer networks can learn polynomial functions of degree rr over dd-dimensional inputs in sample complexity O(dr)O(d^{r}). Their learner networks use exponential activation functions, where in practice the rectified linear unit (ReLU) activation has been the dominant choice across vastly different domains.

On a separate note, if one treats all but the last layer of neural networks as generating a random feature map, then training only the last layer is a convex task, so one can learn the class of linear functions in this implicit feature space . This result implies low-degree polynomials and compositional kernels can be learned by neural networks in polynomial time. Empirically, training last layer greatly weakens the power of neural networks (see Figure 1).

Our Result. We prove that an important concept class that contains three-layer (resp. two-layer) neural networks equipped with smooth activations can be efficiently learned by three-layer (resp. two-layer) ReLU neural networks via SGD or its variants.

Specifically, suppose in aforementioned class the best network (called the target function or target network) achieves a population risk OPT\mathsf{OPT} with respect to some convex loss function. We show that one can learn up to population risk OPT+ε\mathsf{OPT}+\varepsilon, using three-layer (resp. two-layer) ReLU networks of size greater than a fixed polynomial in the size of the target network, in 1/ε1/\varepsilon, and in the “complexity” of the activation function used in the target network. Furthermore, the sample complexity is also polynomial in these parameters, and only poly-logarithmic in the size of the learner ReLU network.

We stress here that this is agnostic PAC-learning because we allow the target function to have error (e.g., OPT\mathsf{OPT} can be positive for regression), and is improper learning because the concept class consists of smaller neural networks comparing to the networks being trained.

Our Contributions. We believe our result gives further insights to the fundamental questions about the learning theory of neural networks.

To the best of our knowledge, this is the first result showing that using hidden layers of neural networks one can provably learn the concept class containing two (or even three) layer neural networks with non-trivial activation functions.In contrast, Daniely focuses on training essentially only the last layer (and the hidden-layer movement is negligible). After this paper has appeared online, Arora et al. showed that neural networks can provably learn two-layer networks with a slightly weaker class of smooth activation functions. Namely, the activation functions that are either linear functions or even functions.

Our three-layer result gives the first theoretical proof that learning neural networks, even with non-convex interactions across layers, can still be plausible. In contrast, in the two-layer case the optimization landscape with overparameterization is almost convex ; and in previous studies on the multi-layer case, researchers have weakened the network by applying the so-called NTK (neural tangent kernel) linearization to remove all non-convex interactions .

In fact, the techniques introduced in this work is later used in to show how multi-layer neural networks can provably learn concept classes where no kernel methods can learn efficiently, and used in to show how the initial large learning rate in training a neural network can improve generalization, also not achievable by the convex kernel methods.

To some extent we explain the reason why overparameterization improves testing accuracy: with larger overparameterization, one can hope to learn better target functions with possibly larger size, more complex activations, smaller risk OPT\mathsf{OPT}, and to a smaller error ε\varepsilon.

We establish new tools to tackle the learning process of neural networks in general, which can be useful for studying other network architectures and learning tasks. (E.g., the new tools here have allowed researchers to study also the learning of recurrent neural networks .)

Other Related Works. We acknowledge a different line of research using kernels as improper learners to learn the concept class of neural networks . This is very different from us because we use “neural networks” as learners. In other words, we study the question of “what can neural networks learn” but they study “what alternative methods can replace neural networks.”

There is also a line of work studying the relationship between neural networks and NTKs (neural tangent kernels) . These works study neural networks by considering their “linearized approximations.” There is a known performance gap between the power of real neural networks and the power of their linearized approximations. For instance, ResNet achieves 96% test error on the CIFAR-10 data set but NTK (even with infinite width) achieves 77% . We also illustrate this in Figure 1.

With sufficient overparameterization, for two-layer neural networks Li and Liang and Du et al. show that gradient descent can perfectly fit the training samples when the data is not degenerated. Allen-Zhu et al. show that deep neural networks can also provably fit the training samples in polynomial time. These results do not cover generalization so are not relevant in studying the learnability of neural networks.

2 Why Do Overparameterized Networks Generalize?

Our result above assumes that the learner network is sufficiently overparameterized. So, why does it generalize to the population risk and give small test error? More importantly, why does it generalize with a number of samples that is (almost) independent of the number of parameters?

This question cannot be studied under the traditional VC-dimension learning theory since the VC dimension grows with the number of parameters. Several works explain generalization by studying some other “complexity” of the learned networks. Most related to the discussion here is where the authors prove a generalization bound in the norms (of weight matrices) of each layer, as opposed to the number of parameters. There are two main concerns with those results.

Learnability == Trainability ++ Generalization. It is not clear from those results how a network with both low “complexity” and small training loss can be found by the training method. Therefore, they do not directly imply PAC-learnability for non-trivial concept classes (at least for those concept classes studied by this paper).

Their norms are “sparsity induced norms”: for the norm not to scale with the number of hidden neurons mm, essentially, it requires the number of neurons with non-zero weights not to scale with mm. This more or less reduces the problem to the non-overparameterized case.

At a high level, our generalization is made possible with the following sequence of conceptual steps.

Good networks with small risks are plentiful: thanks to overparameterization, with high probability over random initialization, there exists a good network in the close neighborhood of any point on the SGD training trajectory. (This corresponds to Section 6.2 and 6.3.)

The optimization in overparameterized neural networks has benign properties: essentially along the training trajectory, there is no second-order critical points for learning three-layer networks, and no first-order critical points for two-layer. (This corresponds to Section 6.4.)

In the learned networks, information is also evenly distributed among neurons, by utilizing either implicit or explicit regularization. This structure allows a new generalization bound that is (almost) independent of the number of neurons. (This corresponds to Section 6.5 and 6.6, and we also empirically verify it in Section 7.1.)

Since practical neural networks are typically overparameterized, we genuinely hope that our results can provide theoretical insights to networks used in various applications.

3 Roadmap

In the main body of this paper, we introduce notations in Section 2, present our main results and contributions for two and three-layer networks in Section 3 and 4, and conclude in Section 5.

For readers interested in our novel techniques, we present in Section 6 an 8-paged proof sketch of our three-layer result. For readers more interested in the practical relevance, we give more experiments in Section 7. In the appendix, we begin with mathematical preliminaries in Appendix A. Our full three-layer proof is in Appendix C. Our two-layer proof is much easier and in Appendix B.

Notations

For notation simplicity, with high probability (or w.h.p.) means with probability 1eclog2m1-e^{-c\log^{2}m} for a sufficiently large constant cc for two-layer network where mm is the number of hidden neurons, and 1eclog2(m1m2)1-e^{-c\log^{2}(m_{1}m_{2})} for three-layer network where m1,m2m_{1},m_{2} are the number of hidden neurons for the first and second layer (to be precisely defined later). In this paper, O~\widetilde{O} hides factors of \polylog(m)\polylog(m) for two-layer networks, or \polylog(m1,m2)\polylog(m_{1},m_{2}) for three-layer networks.

Function complexity. The following notion measures the complexity of any smooth activation function ϕ(z)\phi(z). Suppose ϕ(z)=i=0cizi\phi(z)=\sum_{i=0}^{\infty}c_{i}z^{i}. Given a non-negative RR, the complexity

where CC^{*} is a sufficiently large constant (e.g., 10410^{4}). Intuitively, Cs\mathfrak{C}_{\mathfrak{s}} measures the sample complexity: how many samples are required to learn ϕ\phi correctly; while Cε\mathfrak{C}_{\varepsilon} bounds the network size: how much over-parameterization is needed for the algorithm to efficiently learn ϕ\phi up to ε\varepsilon error. It is always true that Cs(ϕ,R)Cε(ϕ,R)Cs(ϕ,O(R))×\poly(1/ε)\mathfrak{C}_{\mathfrak{s}}(\phi,R)\leq\mathfrak{C}_{\varepsilon}(\phi,R)\leq\mathfrak{C}_{\mathfrak{s}}(\phi,O(R))\times\poly(1/\varepsilon).Recall \big{(}\frac{\sqrt{\log(1/\varepsilon)}}{\sqrt{i}}C^{*}\big{)}^{i}\leq e^{O(\log(1/\varepsilon))}=\frac{1}{\poly(\varepsilon)} for every i1i\geq 1. While for sinz,exp(z)\sin z,\exp(z) or low degree polynomials, Cs(ϕ,O(R))\mathfrak{C}_{\mathfrak{s}}(\phi,O(R)) and Cε(ϕ,R)\mathfrak{C}_{\varepsilon}(\phi,R) only differ by o(1/ε)o(1/\varepsilon).

Result for Two-Layer Networks

the complexity of FF^{*} and assume they are bounded.

Standard two-layer networks fr(x)=i=1par,iϕ(w1,i,x)f_{r}^{*}(x)=\sum_{i=1}^{p}a^{*}_{r,i}\phi(\langle w^{*}_{1,i},x\rangle) are special cases of (3.1) (by setting w2,i=(0,,0,1)w_{2,i}^{*}=(0,\dots,0,1) and ϕi=ϕ\phi_{i}=\phi). Our formulation (3.1) additionally captures combinations of correlations between non-linear and linear measurements of different directions of xx.

Learning Process. Let W(0)W^{(0)} denote the initial value of the hidden weight matrix, and let W(0)+WtW^{(0)}+W_{t} denote the value at time tt. (Note that WtW_{t} is the matrix of increments.) The weights are initialized with Gaussians and then WW is updated by the vanilla SGD. More precisely,

entries of W(0)W^{(0)} and b(0)b^{(0)} are i.i.d. random Gaussians from N(0,1/m)\mathcal{N}(0,1/m),

entries of each ar(0)a^{(0)}_{r} are i.i.d. random Gaussians from N(0,εa2)\mathcal{N}(0,\varepsilon_{a}^{2}) for some fixed εa(0,1]\varepsilon_{a}\in(0,1].We shall choose εa=Θ~(ε)\varepsilon_{a}=\widetilde{\Theta}(\varepsilon) in the proof due to technical reason. As we shall see in the three-layer case, if weight decay is used, one can relax this to εa=1\varepsilon_{a}=1.

For notation simplicity, with high probability (or w.h.p.) means with probability 1eclog2m1-e^{-c\log^{2}m} for a sufficiently large constant cc, and O~\widetilde{O} hides factors of \polylog(m)\polylog(m).

For every \varepsilon\in\big{(}0,\frac{1}{pk\mathfrak{C}_{\mathfrak{s}}(\phi,1)}\big{)}, there exists

such that for every mM0m\geq M_{0} and every NΩ~(N0)N\geq\widetilde{\Omega}(N_{0}), choosing εa=ε/Θ~(1)\varepsilon_{a}=\varepsilon/\widetilde{\Theta}(1) for the initialization, choosing learning rate \eta=\widetilde{\Theta}\big{(}\frac{1}{\varepsilon km}\big{)} and

with high probability over the random initialization, SGD after TT iteration satisfies

SGD only takes one example per iteration. Thus, sample complexity is also at most TT.

We note sample complexity TT is (almost) independent of mm, the amount of overparametrization.

2 Our Interpretations

Overparameterization improves generalization. By increasing mm, Theorem 1 supports more target functions with possibly larger size, more complex activations, and smaller population risk OPT\mathsf{OPT}. In other words, when mm is fixed, among the class of target functions whose complexities are captured by mm, SGD can learn the best function approximator of the data, with the smallest population risk. This gives intuition how overparameterization improves test error, see Figure 1(a).

Large margin non-linear classifier. Theorem 1 is a nonlinear analogue of the margin theory for linear classifiers. The target function with a small population risk (and of bounded norm) can be viewed as a “large margin non-linear classifier.” In this view, Theorem 1 shows that assuming the existence of such large-margin classifier, SGD finds a good solution with sample complexity mostly determined by the margin, instead of the dimension of the data.

Inductive bias. Recent works (e.g., ) show that when the network is heavily overparameterized (that is, mm is polynomial in the number of training samples) and no two training samples are identical, then SGD can find a global optimum with classification error (or find a solution with ε\varepsilon training loss) in polynomial time. This does not come with generalization, since it can even fit random labels. Our theorem, combined with , confirms the inductive bias of SGD for two-layer networks: when the labels are random, SGD finds a network that memorizes the training data; when the labels are (even only approximately) realizable by some target network, then SGD learns and generalizes. This gives an explanation towards the well-known empirical observations of such inductive bias (e.g., ) in the two-layer setting, and is more general than Brutzkus et al. in which the target network is only linear.

Result for Three-Layer Networks

Concept class and target function F(x)F^{*}(x). This time we consider more powerful target functions F=(f1,,fk)F^{*}=(f_{1}^{*},\cdots,f_{k}^{*}) of the form

to denote the complexity of the two layers, and assume they are bounded.

Our concept class contains measures of correlations between composite non-linear functions and non-linear functions of the input, there are plenty of functions in this new concept class that may not necessarily have small-complexity representation in the previous formulation (3.1), and as we shall see in Figure 1(a), this is the critical advantage of using three-layer networks compared to two-layer ones or their NTKs. The learnability of this correlation is due to the non-convex interactions between hidden layers. As a comparison, studies the regime where the changes in hidden layers are negligible thus can not show how to learn this concept class with a three-layer network.

are only special cases of (4.1). Also, even in the special case of Φi(z)=z\Phi_{i}(z)=z, the target

captures combinations of correlations of combinations of non-linear measurements in different directions of xx. This we do not know how to compute using two-layer networks.

In fact, our results of this paper even apply to the following general form:

with the mild requirement that w1,j,w3,j=0\langle w^{*}_{1,j},w^{*}_{3,j}\rangle=0 for each j[p2]j\in[p_{2}]. We choose to present the slightly weaker formulation (4.1) for cleaner proofs.

Learner network F(x;W,V)F(x;W,V). Our learners are three-layer networks F=(f1,,fk)F=(f_{1},\ldots,f_{k}) with

Again for simplicity, we only update WW and VV. The weights are randomly initialized as:

entries of W(0)W^{(0)} and b1=b1(0)b_{1}=b^{(0)}_{1} are i.i.d. from N(0,1/m1)\mathcal{N}(0,1/m_{1}),

entries of V(0)V^{(0)} and b2=b2(0)b_{2}=b^{(0)}_{2} are i.i.d. from N(0,1/m2)\mathcal{N}(0,1/m_{2}),

entries of each ar=ar(0)a_{r}=a^{(0)}_{r} are i.i.d. from N(0,εa2)\mathcal{N}(0,\varepsilon_{a}^{2}) for εa=1\varepsilon_{a}=1. Recall in our two-layer result we have chosen εa<1\varepsilon_{a}<1 due to technical reasons; thanks to weight decay, we can simply select εa=1\varepsilon_{a}=1 in our three-layer case.

As for the optimization algorithm, we use SGD with weight decay and an explicit regularizer.

For some λ(0,1]\lambda\in(0,1], we will use λF(x;W,V)\lambda F(x;W,V) as the learner network, i.e., linearly scale FF down by λ\lambda. This is equivalent to replacing WW, VV with λW\sqrt{\lambda}W, λV\sqrt{\lambda}V, since a ReLU network is positive homogenous. The SGD will start with λ=1\lambda=1 and slowly decrease it, similar to weight decay.We illustrate the technical necessity of adding weight decay. During training, it is easy to add new information to the current network, but hard to forget “false” information that is already in the network. Such false information can be accumulated from randomness of SGD, non-convex landscapes, and so on. Thus, by scaling down the network we can effectively forget false information.

We also use an explicit regularizer for some λw,λv>0\lambda_{w},\lambda_{v}>0 with This 2,4\|\cdot\|_{2,4} norm on WW encourages weights to be more evenly distributed across neurons. It can be replaced with λt1Wt12,2+α2+α\|\sqrt{\lambda_{t-1}}W_{t-1}\|_{2,2+\alpha}^{2+\alpha} for any constant α>0\alpha>0 for our theoretical purpose. We choose α=2\alpha=2 for simplicity, and observe that in practice, weights are automatically spread out due to data randomness, so this explicit regularization may not be needed. See Section 7.1 for an experiment.

Now, in each round t=1,2,,Tt=1,2,\dots,T, we use (noisy) SGD to minimize the following stochastic objective for some fixed λt1\lambda_{t-1}:

2 Main Theorems

For notation simplicity, with high probability (or w.h.p.) means with probability 1eclog2(m1m2)1-e^{-c\log^{2}(m_{1}m_{2})} and O~\widetilde{O} hides factors of \polylog(m1,m2)\polylog(m_{1},m_{2}).

Consider Algorithm 2. For every constant γ(0,1/4]\gamma\in(0,1/4], every ε0(0,1/100]\varepsilon_{0}\in(0,1/100], every ε=ε0kp1p22Cs(Φ,p2Cs(ϕ,1))Cs(ϕ,1)2\varepsilon=\frac{\varepsilon_{0}}{kp_{1}p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}, there exists

such that for every m2=m1=mMm_{2}=m_{1}=m\geq M, and properly set λw,λv,σw,σv\lambda_{w},\lambda_{v},\sigma_{w},\sigma_{v} in Table 1, as long as

there is a choice η=1/\poly(m1,m2)\eta=1/\poly(m_{1},m_{2}) and T=\poly(m1,m2)T=\poly(m_{1},m_{2}) such that with probability 99/100\geq 99/100,

For general Φ\Phi, ignoring p2p_{2}, the complexity of three-layer networks is essentially Cε(Φ,Cε(ϕ,1))\mathfrak{C}_{\varepsilon}(\Phi,\mathfrak{C}_{\varepsilon}(\phi,1)). This is necessary in some sense: consider the case when ϕ(x)=Cx\phi(x)=Cx for a very large parameter CC, then Φ(ϕ(z))\Phi(\phi(z)) is just a function Φ(z)=Φ(Cz)\Phi^{\prime}(z)=\Phi(Cz), and we have Cε(Φ,1)=Cε(Φ,C)Cε(Φ,Θ(Cε(ϕ,1)))\mathfrak{C}_{\varepsilon}(\Phi^{\prime},1)=\mathfrak{C}_{\varepsilon}(\Phi,C)\approx\mathfrak{C}_{\varepsilon}(\Phi,\Theta(\mathfrak{C}_{\varepsilon}(\phi,1))).

3 Our Contributions

Our sample complexity NN scales polynomially with the complexity of the target network, and is (almost) independent of mm, the amount of overparameterization. This itself can be quite surprising, because recent results on neural network generalization require NN to be polynomial in mm. Furthermore, Theorem 2 shows three-layer networks can efficiently learn a bigger concept class (4.1) comparing to what we know about two-layer networks (3.1).

From a practical standpoint, one can construct target functions of the form (4.1) that cannot be (efficiently) approximated by any two-layer target function in (3.1). If data is generated according to such functions, then it may be necessary to use three-layer networks as learners (see Figure 1).

From a theoretical standpoint, even in the special case of Φ(z)=z\Phi(z)=z, our target function can capture correlations between non-linear measurements of the data (recall Remark 4.1). This means Cε(Φ,Cε(ϕ,1)p2)O(p2Cε(ϕ,1))\mathfrak{C}_{\varepsilon}(\Phi,\mathfrak{C}_{\varepsilon}(\phi,1)\sqrt{p_{2}})\approx O(\sqrt{p_{2}}\mathfrak{C}_{\varepsilon}(\phi,1)), so learning it is essentially in the same complexity as learning each ϕs,j\phi_{s,j}. For example, a three-layer network can learn cos(100w1,x)e100w2,x\cos(100\langle w^{*}_{1},x\rangle)\cdot e^{100\langle w^{*}_{2},x\rangle} up to accuracy ε\varepsilon in complexity \poly(1/ε)\poly(1/\varepsilon), while it is unclear how to do so using two-layer networks.

Technical Contributions. We highlight some technical contributions in the proof of Theorem 2.

In recent results on the training convergence of neural networks for more than two layers , the optimization process stays in a close neighborhood of the initialization so that, with heavy overparameterization, the network becomes “linearized” and the interactions across layers are negligible. In our three-layer case, this means that the matrix WW never interacts with VV. They then argue that SGD simulates a neural tangent kernel (NTK) so the learning process is almost convex .

In our analysis, we directly tackle non-convex interactions between WW and VV, by studying a “quadratic approximation” of the network. See Remark 6.1 for a mathematical comparison. In other words, we introduced a second-order version of NTK, and our new proof techniques could be useful for future theoretical applications.

Comparison to Daniely . Daniely studies the learnability of multi-layer networks when (essentially) only the output layer is trained, which reduces to a convex task. He shows that multi-layer networks can learn a compositional kernel space, which implies two/three-layer networks can efficiently learn low-degree polynomials. He did not derive the general sample/time complexity bounds for more complex functions such as those in our concept classes (3.1) and (4.1), but showed that they are finite.

In contrast, our learnability result of concept class (4.1) is due to the non-convex interaction between hidden layers. Since Daniely studies the regime when the changes in hidden layers are negligible, if three layer networks are used, to the best of our knowledge, their theorem cannot lead to similar sample complexity bounds comparing to Theorem 2 by only training the last layer of a three-layer network. Empirically, one can also observe that training hidden layers is better than training the last layer (see Figure 1).

Conclusion and Discussion

We show by training the hidden layers of two-layer (resp. three-layer) overparameterized neural networks, one can efficiently learn some important concept classes including two-layer (resp. three-layer) networks equipped with smooth activation functions. Our result is in the agnostic PAC-learning language thus is distribution-free. We believe our work opens up a new direction in both algorithmic and generalization perspectives of overparameterized neural networks, and pushing forward can possibly lead to more understanding about deep learning.

One can also combine this paper with properties of recurrent neural networks (RNNs) to derive PAC-learning results for RNNs , or use the existential tools of this paper to derive PAC-learning results for three-layer residual networks (ResNet) . The latter gives a provable separation between neural networks and kernels in the efficient PAC-learning regime.

There are plenty of open directions following our work, especially how to extend our result to larger number of layers. Even for three layers, our algorithm currently uses an explicit regularizer R(W,V)R(W^{\prime},V^{\prime}) on the weights to control the generalization. In contrast, in practice, it is observed that neural networks actually implicitly regularizes: even without any restriction to weights, the network learned by an unconstraint overparameterized neural network still generalizes. It is an interesting direction to explain this inductive bias for three layers and beyond. (This present paper only explains inductive bias for two-layer networks.) As for a third direction, currently our result does not directly apply to target networks with ReLU activations. While there are functions ϕ\phi with Cε(ϕ,1)=2O(1/ε)\mathfrak{C}_{\varepsilon}(\phi,1)=2^{O(1/\varepsilon)} that approximate ReLU function in $withwith\varepsilon$ error, obtaining an polynomial or sub-exponential bound on the sampling complexity would be of great interest. On the other hand, the result by indicates that learning ReLU could also be hard for algorithms such as SGD.

We present key technical lemmas we used for proving the three-layer network theorems in Section 6 so that interested readers do not need to go through the appendix. Many of them can be of independent interests and have found further applications beyond this paper (such as for residual networks and for recurrent networks ). The full proof is in Appendix C.

The two-layer result is based on similar ideas but simpler, and the full proof is in Appendix B.

We give more experiments in Section 7 on Page 7. Our appendix starts on page A.

Main Lemmas for Three Layer Networks

In Section 6.1, we state the main theorem for the first variant of the SGD, which we excluded from the main body due to space limitation. In Section 6.2, we show the existence of some good “pseudo network” that can approximate the target. In Section 6.3, we present our coupling technique between a real network and a pseudo network. In Section 6.4, we present the key lemma about the optimization procedure. In Section 6.5, we state a simple generalization bound that is compatible with our algorithm. These techniques together give rise to the proof of Theorem 3. In Section 6.6, we present additional techniques needed to show Theorem 2.

In the first variant of SGD, in each round t=1,2,,Tt=1,2,\dots,T, we use (noisy) SGD to minimize the following stochastic objective for some fixed λt1\lambda_{t-1}:

Below is the main theorem for using the first variant of SGD to train three-layer networks.

Consider Algorithm 3. In the same setting as Theorem 2, for every m2=m1=mMm_{2}=m_{1}=m\geq M, as long as

there is choice η=1/\poly(m1,m2)\eta=1/\poly(m_{1},m_{2}), T=\poly(m1,m2)T=\poly(m_{1},m_{2}) such that with probability at least 99/10099/100,

As mm goes large, this sample complexity Nm3/2N\approx m^{3/2} polynomially scales with mm so may not be very efficient (we did not try hard to improve the exponent 3/23/2). Perhaps interesting enough, this is already non-trivial, because NN can be much smaller than m2m^{2}, the number of parameters of the network or equivalently the naive VC dimension bound. Recall in our second variant of SGD, the sample complexity NN only grows polylogarithmically in mm.

2 Existence

Dv,x{0,1}m2×m2D_{v,x}\in\{0,1\}^{m_{2}\times m_{2}} denote the diagonal sign matrix of the second layer at random initialization.

Consider the output of a three-layer network at randomly initialized sign without bias as

The above pseudo network can be reminiscent of the simpler linearized NTK approximation of a network used in prior work , which in our language means

In such linearization it is clear that the weight matrices WW and VV do not interact with each other. In contrast, in our quadratic formula arDv,xVDw,xWxa_{r}D_{v,x}VD_{w,x}Wx, the matrices WW and VV are multiplied together, resulting in a non-convex interaction after putting into the loss function. This can be viewed as a second-order version of NTK (but is itself not NTK). We shall see in Section 6.3 that the pseudo network can be made close to the real network in some sense.

For every \varepsilon\in\big{(}0,\frac{1}{kp_{1}p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}\big{)}, there exists

such that if m1,m2Mm_{1},m_{2}\geq M, then with high probability, there exists weights W,VW^{\divideontimes},V^{\divideontimes} with

In other words, at randomly initialized signs, there exist choices of WW^{\divideontimes} and VV^{\divideontimes} with small norms so that G(0)(x;W,V)G^{(0)}(x;W^{\divideontimes},V^{\divideontimes}) approximates the target. Later, we will combine this with the coupling lemma Section 6.3 to show a main structural property of overparameterized networks: solutions with good population risks are dense in the parameter space, in the sense that with high probability over the random initialization, there exists a good solution in the “close ” neighborhood of the initialized weights.

where α1,β1,b0N(0,1)\alpha_{1},\beta_{1},b_{0}\sim\mathcal{N}(0,1) are independent random variables.

The reason Lemma lem:fit_fun_mainb is equivalent to Lemma lem:fit_fun_maina consists of a few thinking steps. Without loss of generality, one can assume w=(1,0,0,,0)w^{*}=(1,0,0,\dots,0) and write ϕ(w,x)=ϕ(x1)\phi(\langle w^{*},x\rangle)=\phi(x_{1}) and write h(w,w,b0)=h(w1,b0)h(\langle w,w^{*}\rangle,b_{0})=h(w_{1},b_{0}) so it only depends on the first coordinate of ww and b0b_{0}. Under such simplifications, the second through last coordinates do not make any difference, so we can assume without loss of generality that w,w,xw^{*},w,x are only in 2 dimensions, and write w=(α1,β1)w=(\alpha_{1},\beta_{1}) and x=(x1,1x12)x=(x_{1},\sqrt{1-x_{1}^{2}}). In sum, proving Lemma lem:fit_fun_maina suffices in establishing Lemma lem:fit_fun_mainb.

Given Lemma 6.3, we can directly apply it to the two-layer case (see Appendix B.1) to show the existence of good pseudo networks. As for the three-layer case, we need to apply Lemma 6.3 twice: once for (each neuron of) the second hidden layer and once for the output.

Consider the the input (without bias) to a single neuron of the second hidden layer at random initialization. Without loss of generality, say the first neuron, given as:

Even though n1(x)n_{1}(x) is completely random, using Lemma 6.3, we can derive the following lemma which rewrites n1(x)n_{1}(x) in the direction of an arbitrary function ϕ\phi.

Moreover, letting C=Cε(ϕ,1)C=\mathfrak{C}_{\varepsilon}(\phi,1) be the complexity of ϕ\phi, and if v1,i(0)N(0,1m2)v_{1,i}^{(0)}\sim\mathcal{N}(0,\frac{1}{m_{2}}) and wi,j(0),b1,i(0)N(0,1m1)w_{i,j}^{(0)},b^{(0)}_{1,i}\sim\mathcal{N}(0,\frac{1}{m_{1}}) are at random initialization, then we have

For every fixed xx, ρ(v1(0),W(0),b1(0))\rho\left(v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right) is independent of B(x,v1(0),W(0),b1(0))B\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right).

ρ(v1(0),W(0),b1(0))N(0,1100C2m2)\rho\left(v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right)\sim\mathcal{N}\left(0,\frac{1}{100C^{2}m_{2}}\right).

For every xx with x2=1\|x\|_{2}=1, ϕε(x)ϕ(w,x)ε|\phi_{\varepsilon}(x)-\phi(\langle w^{*},x\rangle)|\leq\varepsilon.

For every fixed xx with x2=1\|x\|_{2}=1, with high probability R(x,v1(0),W(0),b1(0))O~(1m1m2)\left|R\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right)\right|\leq\widetilde{O}\left(\frac{1}{{\sqrt{m_{1}m_{2}}}}\right) and B(x,v1(0),W(0),b1(0))O~(1m2)\left|B\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right)\right|\leq\widetilde{O}\left(\frac{1}{{\sqrt{m_{2}}}}\right).

Furthermore, there exists real-valued function ρ~(v1(0))\widetilde{\rho}(v_{1}^{(0)}) satisfying with high probability:

Lemma 6.4 shows that, up to some small noise BB and RR, we can “view” the input to any neuron of the second layer essentially as “a Gaussian variable ρ\rho” times “the target function ϕ(w,x)\phi(\langle w^{*},x\rangle)” in the first layer of the target network. This allows us to apply Lemma 6.3 again for the output layer, so as to construct for instance a composite function Φ()\Phi(\cdot) with ϕi\phi_{i} as its inputs.One technical issue is the following. When considering multiple neurons of the second layer, those Gaussian variables ρ\rho are not independent because they all depend on W(0)W^{(0)}. The notion of ρ~\widetilde{\rho} in Lemma 6.4 removes such dependency across multiple neurons, at the expense of small Wasserstein distance.

Lemma 6.4 may sound weird at first look because random initialization cannot carry any information about the target. There is no contradiction here, because we will show, BB is essentially another Gaussian (with the same distribution as ρ\rho) times constantϕ2(w,x)\sqrt{\text{constant}-\phi^{2}(\langle w^{*},x\rangle)}, thus n1(x)n_{1}(x) can still be independent of the value of ϕ(w,x)\phi(\langle w^{*},x\rangle). Nevertheless, the decomposition in Lemma 6.4 shall enable us to show that, when we start to modify the hidden weights W,VW,V, the learning process will start to discover this structure and make the weight of the term relating to ϕ(w,x)\phi(\langle w^{*},x\rangle) stand out from other terms.

Using Lemma 6.4 and applying Lemma 6.3 once more, we can prove Lemma 6.2.

3 Coupling Between Real and Pseudo Networks

Suppose we are currently at weights W(0)+W+WρW^{(0)}+W^{\prime}+W^{\rho}, V(0)+V+VρV^{(0)}+V^{\prime}+V^{\rho}, where matrices Wρ,VρW^{\rho},V^{\rho} are random Gaussian matrices such that:

for some σv,σw[1/(m1m2),1]\sigma_{v},\sigma_{w}\in[1/(m_{1}m_{2}),1] to be specified later, and W,VW^{\prime},V^{\prime} are matrices with bounded norms that can depend on the randomness of W(0),b1(0),V(0),b2(0)W^{(0)},b_{1}^{(0)},V^{(0)},b_{2}^{(0)}. Intuitively, WW^{\prime} and VV^{\prime} capture how much the algorithm has moved away from the initialization, while Wρ,VρW^{\rho},V^{\rho} are introduced for adding smoothness in the optimization, see Section 6.4.

Let us introduce the notion of pseudo networks at the current weights. Let

Dw,xD_{w,x} denote the diagonal sign matrix of the first layer at random initialization W(0)W^{(0)},

Dv,xD_{v,x} denote the diagonal sign matrix of the second layer at random initialization W(0),V(0)W^{(0)},V^{(0)},

Dv,x+Dv,x{0,1}m2×m2D_{v,x}+D_{v,x}^{\prime}\in\{0,1\}^{m_{2}\times m_{2}} denote the diagonal sign matrix of the second layer at these weights.

For a fixed r[k]r\in[k], let us denote row vector ar=(ar,i)i[m2]a_{r}=(a_{r,i})_{i\in[m_{2}]}. Define the pseudo network (and its semi-bias, bias-free versions) as

As a sanity check, at W(0)+W+WρW^{(0)}+W^{\prime}+W^{\rho}, V(0)+V+VρV^{(0)}+V^{\prime}+V^{\rho} the pseudo network equals the true one:

Suppose \tau_{v}\in\big{[}0,1\big{]}, \tau_{w}\in\big{(}\frac{1}{m_{1}^{3/2}},\frac{1}{m_{1}^{1/2}}\big{]}, \sigma_{w}\in\big{(}\frac{1}{m_{1}^{3/2}},\frac{\tau_{w}}{m_{1}^{1/4}}\big{]}, \sigma_{v}\in\big{(}0,\frac{1}{m_{2}^{1/2}}\big{]} and η>0\eta>0. Given fixed unit vector xx, and perturbation matrices W,V,W,VW^{\prime},V^{\prime},W^{\prime\prime},V^{\prime\prime} (that may depend on the randomness of W(0),b1(0),V(0),b2(0)W^{(0)},b_{1}^{(0)},V^{(0)},b_{2}^{(0)} and xx) satisfying

(Sparse sign change). Dw,x0O~(τw4/5m16/5)\|D_{w,x}^{\prime}\|_{0}\leq\widetilde{O}(\tau_{w}^{4/5}m_{1}^{6/5}), Dv,x0O~(σvm23/2+τv2/3m2+τw2/3m11/6m2)\|D_{v,x}^{\prime}\|_{0}\leq\widetilde{O}\left(\sigma_{v}m_{2}^{3/2}+\tau_{v}^{2/3}m_{2}+\tau_{w}^{2/3}m_{1}^{1/6}m_{2}\right).

The first statement “sparse sign change” of Lemma 6.5 says that, if we move from random initialization W(0),V(0)W^{(0)},V^{(0)} to W(0)+W+Wρ,V(0)+V+VρW^{(0)}+W^{\prime}+W^{\rho},V^{(0)}+V^{\prime}+V^{\rho}, then how many signs of the ReLUs (in each layer) will change, as a function of the norms of WW^{\prime} and VV^{\prime}. This calculation is similar to but slightly more involved due to the 2,4\|\cdot\|_{2,4} norm that we use here.

We quickly point out a corollary by applying the coupling and existential lemmas together.

Recall in the existential Lemma 6.2, we have studied a pseudo network G(0)G^{(0)} where the signs are determined at the random initialization W(0),V(0)W^{(0)},V^{(0)}. Now, the coupling Lemma 6.5 says that the amount of sign change from G(0))G^{(0))} to G(b,b)G^{(b,b)} can be controlled. Therefore, if parameters are chosen appropriately, the existential Lemma 6.2 should also apply to G(b,b)G^{(b,b)}. Formally,

In the same setting as Lemma 6.2, given perturbation matrices W,VW^{\prime},V^{\prime} (that may depend on the randomness of the initialization and the data distribution D\mathcal{D}) with

Using parameter choices from Table 1, w.h.p. there exist WW^{\divideontimes} and VV^{\divideontimes} (independent of the randomness of Wρ,VρW^{\rho},V^{\rho}) satisfying

As we shall see later, Corollary 6.6 gives rise to WW^{\divideontimes} and VV^{\divideontimes} that shall be used as a descent direction for the objective. (Corollary 6.6 did not use all the parameters inside Table 1, and some of those parameters shall be used in later subsections.)

4 Optimization

use the property that solutions with good risks are dense in the parameter space (i.e., Corollary 6.6) to show that the optimization landscape of the overparameterized three-layer neural network is benign: it has no spurious local minimal or (more or less) even any second-order critical points; and

use existing theorems on escaping saddle points (such as ) to show that SGD will not be stuck in saddle points and thus converges.

Key issue. Unfortunately, before digging into the details, there is already a big hole in this approach. ReLU networks are not second-order-differentiable: a ReLU activation does not have a well-defined Hessian/sub-Hessian at zero. One may na ïvelythink that since a ReLU network is infinite-order differentiable everywhere except a measure zero set, so we can safely ignore the Hessian issue and proceed by pretending that the Hessian of ReLU is always zero. This intuition is very wrong. Following it, we could have run into the absurd conclusion that any piece-wise linear function is convex, since the Hessian of it is zero almost everywhere. In other words, the only non-smooth point of ReLU has a Hessian value equal to the Dirac δ\delta-function, but such non-smooth points, albeit being measure zero, are actually turning points in the landscape. If we want a meaningful second-order statement of the ReLU network, we must not na ïvelyignore the “Hessian” of ReLU at zeros.

In practice, since the solution Wt,VtW_{t},V_{t} are found by stochastic gradient descent starting from random initialization, they will have a non-negligible amount of intrinsic noise. Thus, the additional smoothing in the algorithm might not be needed by an observation in . Smooth analysis might also be used for analyzing the effect of such noise, but this is beyond the scope of this paper.

Actual algorithm. Let us consider the following smoothed, and regularized objective:

where Wρ,VρW^{\rho},V^{\rho} are Gaussian random matrices with each entry i.i.d. from N(0,σw2)\mathcal{N}(0,\sigma_{w}^{2}) and N(0,σv2)\mathcal{N}(0,\sigma_{v}^{2}), respectively. R(λtWt,λtVt)=λvλtVtF2+λwλtWt2,44R(\sqrt{\lambda_{t}}W_{t},\sqrt{\lambda_{t}}V_{t})=\lambda_{v}\|\sqrt{\lambda_{t}}V_{t}\|_{F}^{2}+\lambda_{w}\|\sqrt{\lambda_{t}}W_{t}\|_{2,4}^{4} and λv,λw\lambda_{v},\lambda_{w} are set such that λvλtVF2ε0\lambda_{v}\|\sqrt{\lambda_{t}}V^{\divideontimes}\|_{F}^{2}\leq\varepsilon_{0} and λwλtW2,44ε0\lambda_{w}\|\sqrt{\lambda_{t}}W^{\divideontimes}\|_{2,4}^{4}\leq\varepsilon_{0} for every WW^{\divideontimes} and VV^{\divideontimes} coming from Lemma 6.2. See Algorithm 3 (first SGD variant) for the details. We prove the following lemma.

For every ε0(0,1)\varepsilon_{0}\in(0,1) and ε=ε0kp1p22Cs(Φ,p2Cs(ϕ,1))Cs(ϕ,1)2\varepsilon=\frac{\varepsilon_{0}}{kp_{1}p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}, for every constant γ(0,1/4]\gamma\in(0,1/4], consider the parameter choices in Table 1, and consider any λt,Wt,Vt\lambda_{t},W_{t},V_{t} (that may depend on the randomness of W(0),b(0),V(0),b(1)W^{(0)},b^{(0)},V^{(0)},b^{(1)} and Z\mathcal{Z}) with

With high probability over the random initialization, there exists W,VW^{\divideontimes},V^{\divideontimes} with WF,VF1\|W^{\divideontimes}\|_{F},\|V^{\divideontimes}\|_{F}\leq 1 such that for every η[0,1\poly(m1,m2)]\eta\in\left[0,\frac{1}{\poly(m_{1},m_{2})}\right]:

More interestingly, since (noisy) stochastic gradient descent is capable of finding approximate second-order critical points with a global convergence rate, one can show the following final convergence rate for minimizing LL^{\prime} for Algorithm 3.

In the setting of Theorem 3, with probability at least 99/10099/100, Algorithm 3 (the first SGD variant) converges in TTw=\poly(m1,m2)TT_{w}=\poly\left(m_{1},m_{2}\right) iterations to a point

Dw,x,ρD_{w,x,\rho} denote the diagonal matrix with diagonals being 0-1 signs of the first layer at W+WρW+W^{\rho};

Dv,x,ρD_{v,x,\rho} denote that of the second layer at weights W+WρW+W^{\rho} and V+VρV+V^{\rho}; and

For a fixed r[k]r\in[k], consider the real network Pρ,ηP_{\rho,\eta} and the pseudo network Pρ,ηP^{\prime}_{\rho,\eta}:

There exists η0=1\poly(m1,m2)\eta_{0}=\frac{1}{\poly(m_{1},m_{2})} such that, for every ηη0\eta\leq\eta_{0}, for every fixed xx with x2=1\|x\|_{2}=1, for every W,V,W,VW^{\prime},V^{\prime},W^{\prime\prime},V^{\prime\prime} that may depend on the randomness of the initialization and

where OpO_{p} hides polynomial factor of m1,m2m_{1},m_{2}.

In other words, when working with a Gaussian smoothed network (with output Pρ,ηP_{\rho,\eta}), it suffices to study a pseudo network (with output Pρ,ηP_{\rho,\eta}^{\prime}). In the proof of Lemma 6.7, this allows us to go from the real (smoothed) network to the pseudo (smoothed) network, and then apply Lemma 6.5.

5 Generalization

In our first variant of SGD Algorithm 3, we show a very crude Rademacher complexity bound that can be derived from the contraction lemma (a.k.a. Talagrand’s concentration inequality).

For every τv,τw0\tau^{\prime}_{v},\tau^{\prime}_{w}\geq 0, every σv(0,1/m2]\sigma_{v}\in(0,1/\sqrt{m_{2}}], w.h.p. for every r[k]r\in[k] and every N1N\geq 1, the empirical Rademacher complexity is bounded by

Since the population risk is bounded by the Rademacher complexity, combining this with Lemma 6.8, one can easily prove Theorem 3.

6 Second Variant of SGD

To show that this gives a better sampling complexity bound, we need the following stronger coupling lemma. It gives a somewhat better bound on the “cross term vanish” part comparing to the old coupling Lemma 6.5.

Under parameter choices Table 1, the last error term is at most ε/k\varepsilon/k.

For this SGD variant, we also have the following the stronger Rademacher complexity bound. It relies on Lemma 6.11 to reduce the function class to pseudo networks, which are only linear functions in WW^{\prime} and VV^{\prime}, and then computes its Rademacher complexity.

For every τv\tau^{\prime}_{v}\in, \tau^{\prime}_{w}\in\big{[}\frac{1}{m_{1}^{3/4}},\frac{1}{m_{1}^{9/16}}\big{]}, every σw[0,1/m1]\sigma_{w}\in[0,1/\sqrt{m_{1}}] and σv[0,1/m2]\sigma_{v}\in[0,1/\sqrt{m_{2}}], w.h.p. for every r[k]r\in[k] and every N1N\geq 1, we have by our choice of parameters in Lemma 6.7, the empirical Rademacher complexity is bounded by

Under parameter choices in Table 1, this is at most \widetilde{O}\big{(}\frac{\tau^{\prime}_{w}\tau^{\prime}_{v}m_{1}^{1/4}\sqrt{m_{2}}}{\sqrt{N}}\big{)}+\varepsilon/k.

After plugging Lemma 6.11 and Lemma 6.12 into the final proof, we can show Theorem 2.

Empirical Evaluations

To be consistent with our theorems, we choose random initialization as follows. Entries of ara_{r} are i.i.d. from N(0,1)\mathcal{N}(0,1), and entries of W,V,b1,b2W,V,b_{1},b_{2} are i.i.d. from N(0,1m)\mathcal{N}(0,\frac{1}{m}). This ensures that the output at random initialization is Θ(1)\Theta(1).

In Figure 2, we provide an additional experiment for target function y=F(x)=(tanh(8x1)+tanh(8x2)+tanh(8x3)2)2tanh(8x4)y=F^{*}(x)=(\tanh(8x_{1})+\tanh(8x_{2})+\tanh(8x_{3})-2)^{2}\cdot\tanh(8x_{4}).

Recall our three-layer theorem requires a slightly unconventional regularizer, namely, the W2,4\|W\|_{2,4} norm of the first layer to encourage weights to be more evenly distributed across neurons. Is this really necessary? To get some preliminary idea we run our aforementioned three-layer experiment on the first data set,

once with the traditional weight decay on WW (which corresponds to minimizing WF\|W\|_{F}), and

once with the W2,4\|W\|_{2,4} norm without weight decay.

In both cases we best tune the learning rates as well as the weight decay parameter (or the weight of the W2,44\|W\|_{2,4}^{4} regularizer). We present our findings in Figure 3.

In Figure 3(a), we observe that there is no real difference between the two regularizers in terms of test error. In one case (m=200m=200) using W2,4\|W\|_{2,4} even gives slightly better test accuracy (but we do not want to claim this as a general phenomenon).

More importantly, since by Cauchy-Schwarz we have WFW2,4m1/4WF\|W\|_{F}\geq\|W\|_{2,4}\geq m^{-1/4}\|W\|_{F}. Let us consider the following norm ratio:

In Figure 3(b), we see that ratioratio is roughly the same between two types of regularizers. This means, even if we regularize only the Frobenius norm, weights are still sufficiently distributed across neurons comparing to what we can do by regularizing W2,4\|W\|_{2,4}. We leave it a future work to study why SGD encourages such implicit regularization.

We provide technical preliminaries in Appendix A, give our two-layer proofs in Appendix B, and three-layer proofs in Appendix C.

Appendix A Technical Preliminaries

where the infimum is taken over all possible joint distributions over (X,Y)(X,Y) where the marginal on XX (resp. YY) is distributed in the same way as AA (resp. BB).

Slightly abusing notation, in this paper, we say a random variable XX satisfies XB|X|\leq B with high probability if (1) XB|X|\leq B w.h.p. and (2) W2(X,0)B\mathcal{W}_{2}(X,0)\leq B. For instance, if g=N(0,1/m)g=\mathcal{N}(0,1/m), then gO~(1/m)|g|\leq\widetilde{O}(1/\sqrt{m}) with high probability.

Let (n1,α1,a1,1,a2,1),,(nm,αm,a1,m,a2,m)(n_{1},\alpha_{1},a_{1,1},a_{2,1}),\cdots,(n_{m},\alpha_{m},a_{1,m},a_{2,m}) be mm i.i.d. samples from some distribution, where within a 4-tuples:

the marginal distribution of a1,ia_{1,i} and a2,ia_{2,i} is standard Gaussian N(0,1)\mathcal{N}(0,1);

nin_{i} and αi\alpha_{i} are not necessarily independent;

a1,ia_{1,i} and a2,ia_{2,i} are independent; and

nin_{i} and αi\alpha_{i} are independent of a1,ia_{1,i} and a2,ia_{2,i}.

Since this holds for every choice of {ni,αi}i[m]\{n_{i},\alpha_{i}\}_{i\in[m]} we can complete the proof. The second inequality follows from sub-exponential concentration bounds. ∎

If X1,X2X_{1},X_{2} are independent, and X1,X3X_{1},X_{3} are independent conditional on X2X_{2}, then X1X_{1} and X3X_{3} are independent.

Multiplying Pr[X2=x2]\operatornamewithlimits{\mathbf{Pr}}[X_{2}=x_{2}] on both side leads to:

Marginalizing away X2X_{2} gives Pr[X1=x1,X3=x3]=Pr[X1=x1]Pr[X3=x3]\operatornamewithlimits{\mathbf{Pr}}[X_{1}=x_{1},X_{3}=x_{3}]=\operatornamewithlimits{\mathbf{Pr}}[X_{1}=x_{1}]\operatornamewithlimits{\mathbf{Pr}}[X_{3}=x_{3}], so X1X_{1} and X3X_{3} are independent. ∎

A.2 Central Limit Theorem

The following Wasserstein distance bound of central limit theorem is easy to derive from known results:

In other words—by choosing t=Ri/Σit=R_{i}/\Sigma_{i}— we have for every Ri5C2R_{i}\geq 5C^{2}, letting ZiN(0,Ri)Z_{i}\sim\mathcal{N}(0,R_{i}), Zi+1N(0,Ri+Σi)Z_{i+1}\sim\mathcal{N}(0,R_{i}+\Sigma_{i}) be independent gaussian (also independent of XiX_{i}), it satisfies

Repeatedly applying the above inequality and starting with Z1N(0,5C2)Z_{1}\sim\mathcal{N}(0,5C^{2}) and choosing Ri=5C2+j=1i1ΣjR_{i}=5C^{2}+\sum_{j=1}^{i-1}\Sigma_{j}, we have ZiN(0,5C2+j=1i1Σj)Z_{i}\sim\mathcal{N}(0,5C^{2}+\sum_{j=1}^{i-1}\Sigma_{j}) and Zm+1N(0,5C2+V)Z_{m+1}\sim\mathcal{N}(0,5C^{2}+V). Using Σ1Σ2Σm\Sigma_{1}\geq\Sigma_{2}\geq\cdots\geq\Sigma_{m}, we know that 5CΣiRi5Ci1\frac{5C\Sigma_{i}}{R_{i}}\leq\frac{5C}{i-1} for i2i\geq 2. This implies that

Finally, since X1C|X_{1}|\leq C we have W2(X1,0)C\mathcal{W}_{2}(X_{1},0)\leq C and Σ12C2\Sigma_{1}^{2}\leq C^{2}. By triangle inequality we have

There exists ZN(0,V)Z\sim\mathcal{N}(0,V) such that W2(Zm+1,Z)=O(C)\mathcal{W}_{2}(Z_{m+1},Z)=O(C). All of these together imply

A.3 Interval Partition

(Indicator). s(y,g)=0\mathfrak{s}(y,g)=0 if gI(y)g\notin I(y), and s(y,g){1,1}\mathfrak{s}(y,g)\in\{-1,1\} otherwise.

(Balanced). PrgN(0,1)[gI(y)]=τ\operatornamewithlimits{\mathbf{Pr}}_{g\sim\mathcal{N}(0,1)}[g\in I(y)]=\tau for every yy\in.

(Symmetric). PrgN(0,1)[s(y,g)=1]=PrgN(0,1)[s(y,g)=1]\operatornamewithlimits{\mathbf{Pr}}_{g\sim\mathcal{N}(0,1)}\left[\mathfrak{s}(y,g)=1\right]=\operatornamewithlimits{\mathbf{Pr}}_{g\sim\mathcal{N}(0,1)}\left[\mathfrak{s}(y,g)=-1\right].

(Bounded). maxxI(y){s(y,x)x}minxI(y){s(y,x)x}10τ\max_{x\in I(y)}\{\mathfrak{s}(y,x)x\}-\min_{x\in I(y)}\{\mathfrak{s}(y,x)x\}\leq 10\tau.

We refer to I(y)I(y) as an “interval” although it may actually consist of two disjoint closed intervals.

Let us just prove the case when y0y\geq 0 and the other case is by symmetry. It is clear that, since there are only two degrees of freedom, there is a unique interval I1(y)=[ya(y),y+b(y)]I_{1}(y)=[y-a(y),y+b(y)] with a(y),b(y)0a(y),b(y)\geq 0 such that

(Half probability). PrgN(0,1)[gI1(y)]=τ2\operatornamewithlimits{\mathbf{Pr}}_{g\sim\mathcal{N}(0,1)}[g\in I_{1}(y)]=\frac{\tau}{2}.

[ya(y),y+b(y)][y-a(y),y+b(y)] and [yb(y),y+a(y)][-y-b(y),-y+a(y)] intersect. In this case, consider the unique interval

It must satisfy PrgN(0,1)[gI2(y)g>0]<τ/2\operatornamewithlimits{\mathbf{Pr}}_{g\sim\mathcal{N}(0,1)}[g\in I_{2}(y)\,\wedge\,g>0]<\tau/2, because otherwise we must have ya(y)0y-a(y)\geq 0 and the two intervals should not have intersected.

Define τ(y)=PrgN(0,1)[gI2(y)]<τ\tau^{\prime}(y)=\operatornamewithlimits{\mathbf{Pr}}_{g\sim\mathcal{N}(0,1)}[g\in I_{2}(y)]<\tau. Let c(y)>e(y)c(y)>e(y) be the unique positive real such that

Let d(y)[e(y),c(y)]d(y)\in[e(y),c(y)] be the unique real such that

Finally, we define I(y)=[c(y),c(y)]I(y)=[-c(y),c(y)] and

In both cases, one can carefully verify that properties 1, 2, 3, 4 hold. Property 5 follows from the standard property of Gaussian random variable under condition τ1/100\tau\leq 1/100 and yy\in.

To check the final Lipschitz continuity property, recall for a standard Gaussian distribution, inside interval [110,110][-\frac{1}{10},\frac{1}{10}] it behaves, up to multiplicative constant factor, similar to a uniform distribution. Therefore, the above defined functions a(y)a(y) and b(y)b(y) are O(1)O(1)-Lipschitz continuous in yy. Let y00y_{0}\geq 0 be the unique constant such that ya(y)=0y-a(y)=0 (it is unique because ya(y)y-a(y) monotonically decreases as y0+y\to 0+. It is clear that for y0y1y2y_{0}\leq y_{1}\leq y_{2} it satisfies

As for the turning point of y=y0y=y_{0}, it is clear that

so the function I()I(\cdot) is continuous at point y=y0y=y_{0}. Finally, consider y[y0,y0]y\in[-y_{0},y_{0}]. One can verify that e(y)e(y) is O(1)O(1)-Lipschitz continuous in yy, and therefore the above defined τ(y)\tau^{\prime}(y), d(y)d(y) and c(y)c(y) are also O(1)O(1)-Lipschitz in yy. This means, for y0y1y2y0-y_{0}\leq y_{1}\leq y_{2}\leq y_{0}, it also satisfies

This proves the Lipschitz continuity of I(y)I(y). ∎

A.4 Hermite polynomials

Let hi(i0)h_{i}(i\geq 0) denote the degree-ii (probabilists’) Hermite polynomial

where δi,j=1\delta_{i,j}=1 if i=ji=j and δi,j=0\delta_{i,j}=0 otherwise. They have the following summation and multiplication formulas.

For even i>0i>0, for any x1x_{1}\in and bb,

For odd i>0i>0, for any x1x_{1}\in and bb,

(Throughout this paper, the binomial (nm)\binom{n}{m} is defined as Γ(n+1)Γ(m+1)Γ(n+1m)\frac{\Gamma(n+1)}{\Gamma(m+1)\Gamma(n+1-m)}, and this allows us to write for instance (5/21/2)\binom{5/2}{-1/2} without notation change.)

Using the summation formula of Hermite polynomial, we have:

Using the multiplication formula of Hermite polynomial, we have:

Consider even i>0i>0. By Lemma A.7, we have for even i0i\geq 0:

Consider odd i>0i>0. By Lemma A.7, we have for odd i>0i>0:

by a similar calculation as in the even ii case.

Then Li,bL_{i,b}’s are given by the recursive formula:

As a result (with the convention that 0!!=10!!=1 and (1)!!=1(-1)!!=1)

The base cases L0,bL_{0,b} and L1,bL_{1,b} are easy to verify. Then the lemma comes from induction. ∎

A.5 Optimization

Then, λmin(2f(x))ε\lambda_{\min}(\nabla^{2}f(x))\leq-{\varepsilon}, where λmin\lambda_{\min} is the minimal eigenvalue.

We also recall the following convergence theorem of SGD for escaping saddle point.The original proof of was for constant probability but it is easy to change it to 1p1-p at the expense of paying a polynomial factor in 1/p1/p. The original proof did not state the objective “non-increasing” guarantee but it is easy to show it for mini-batch SGD. Recall η=1\poly(d,B,1/ε,1/δ)\eta=\frac{1}{\poly(d,B,1/\varepsilon,1/\delta)} in so in each iteration, the original SGD may increase the objective by no more than η2Bb\frac{\eta^{2}B}{b} if a batch size bb is used. If bb is polynomially large in 1/η1/\eta, this goal is easily achievable. In practice, however, using a very small batch size suffices.

A.6 Rademacher Complexity

Suppose X=(x1,,xN)\mathcal{X}=(x_{1},\dots,x_{N}) where each xix_{i} is generated i.i.d. from a distribution D\mathcal{D}. If every fFf\in\mathcal{F} satisfies fb|f|\leq b, for every δ(0,1)\delta\in(0,1) with probability at least 1δ1-\delta over the randomness of Z\mathcal{Z}, it satisfies

Let F\mathcal{F}^{\prime} be the class of functions by composing LL with F1,,Fk\mathcal{F}_{1},\dots,\mathcal{F}_{k}, that is, F={Lx(f1,,fk)f1F1fkFk}\mathcal{F}^{\prime}=\{L_{x}\circ(f_{1},\dots,f_{k})\mid f_{1}\in\mathcal{F}_{1}\cdots f_{k}\in\mathcal{F}_{k}\}. By the (vector version) of the contraction lemma of Rademacher complexity There are slightly different versions of the contraction lemma in the literature. For the scalar case without absolute value, see [43, Section 3.8]; for the scalar case with absolute value, see [14, Theorem 12]; and for the vector case without absolute value, see . it satisfies R^(Z;F)O(1)r=1kR^(Z;Fr)\widehat{\mathfrak{R}}(\mathcal{Z};\mathcal{F}^{\prime})\leq O(1)\cdot\sum_{r=1}^{k}\widehat{\mathfrak{R}}(\mathcal{Z};\mathcal{F}_{r}). ∎

(addition) R^(X;F1+F2)=R^(X;F1)+R^(X;F2)\widehat{R}(\mathcal{X};\mathcal{F}_{1}+\mathcal{F}_{2})=\widehat{R}(\mathcal{X};\mathcal{F}_{1})+\widehat{R}(\mathcal{X};\mathcal{F}_{2}).

(contraction) R^(X;σF)R^(X;F)\widehat{R}(\mathcal{X};\sigma\circ\mathcal{F})\leq\widehat{R}(\mathcal{X};\mathcal{F}).

satisfies R^(X;F)2w1maxj[m]R^(X;Fj)\widehat{R}(\mathcal{X};\mathcal{F}^{\prime})\leq 2\|w\|_{1}\max_{j\in[m]}\widehat{R}(\mathcal{X};\mathcal{F}_{j}).

satisfies \widehat{R}(\mathcal{X};\mathcal{F}^{\prime})\leq 2D\sum_{j\in[m]}\widehat{R}(\mathcal{X};\mathcal{F}_{j})+O\big{(}\frac{BR\log m}{\sqrt{N}}\big{)}.

The first three are trivial, and the contraction lemma is classical (see for instance ). The derivations of Lemma prop:rade and Lemma prop:radf are less standard.

For Lemma prop:rade, let us choose an arbitrary fj(0)f_{j}^{(0)} from each Fj\mathcal{F}_{j}. We write fFf\in\mathcal{F} to denote (f1,,fm)F1××Fm(f_{1},\dots,f_{m})\in\mathcal{F}_{1}\times\cdots\times\mathcal{F}_{m}.

Above, ① is because ξiwjσ(fj(0)(xi))\xi_{i}w_{j}\sigma(f_{j}^{(0)}(x_{i})) is independent of fjf_{j} and thus zero in expectation; ② uses the non-negativity of \sup_{f_{j}\in\mathcal{F}_{j}}\sum_{i\in[N]}\xi_{i}\big{(}\sigma(f_{j}(x_{i}))-\sigma(f_{j}^{(0)}(x_{i}))\big{)}; ③ is for the same reason as ①; and ④ is by Proposition prop:radd.

Above, ① is from the same derivation as Proposition prop:rade; and ② uses Proposition prop:radb by viewing function class \big{\{}x\mapsto\sum_{j=1}^{m}v_{j}\cdot\sigma(f_{j}^{(0)}(x))\big{\}} as linear in vv. ∎

Appendix B Proofs for Two-Layer Networks

Recall from (3.1) the target F=(f1,,fk)F^{*}=(f_{1}^{*},\cdots,f_{k}^{*}) for our two-layer case is

We consider another function defined as G(x;W)=(g1(x;W),,gk(x;W))G(x;W)=(g_{1}(x;W),\ldots,g_{k}(x;W)) for the weight matrix WW, where

where wiw_{i} is the rr-th row of WW and wi(0)w_{i}^{(0)} is the rr-th row of W(0)W^{(0)} (our random initialization). We call this GG a pseudo network. For convenience, we also define a pseudo network G(b)(x;W)=(g1(b)(x;W),,gk(b)(x;W))G^{(b)}(x;W)=(g^{(b)}_{1}(x;W),\ldots,g^{(b)}_{k}(x;W)) without bias

Roadmap. Our analysis begins with showing that w.h.p. over the random initialization, there exists a pseudo network in the neighborhood of the initialization that can approximate the target function (see Section B.1). We then show that, near the initialization, the pseudo network approximates the actual ReLU network FF (see Section B.2). Therefore, there exists a ReLU network near the initialization approximating the target. Furthermore, it means that the loss surface of the ReLU network is close to that of the pseudo network, which is convex. This then allows us to show the training converges (see Section B.3). Combined with a generalization bound localized to the initialization, we can prove the final theorem that SGD learns a network with small risk.

The proof is much simpler than the three-layer case. First, the optimization landscape is almost convex, so a standard argument for convex optimization applies. While for three-layer case, the optimization landscape is no longer this nice and needs an escaping-from-saddle-point argument, which in turn requires several technicalities (smoothing, explicit regularization, weight decay, and Dropout-like noise). Second, the main technical part of the analysis, the proof of the existential result, only needs to deal with approximating one layer of neurons in the target, while in the three-layer case, a composition of the neurons needs to be approximating, requiring additional delicate arguments. It is also similar with the generalization bounds. However, we believe that the analysis in the two-layer case already shows some of the key ideas, and suggest the readers to read it before reading that for the three-layer case.

The main focus of this subsection is to show that there exists a good pseudo network near the initialization. (Combining with the coupling result of the next subsection, this translates to the real network near the initialization.)

For every ε(0,1pkCs(ϕ,1))\varepsilon\in(0,\frac{1}{pk\mathfrak{C}_{\mathfrak{s}}(\phi,1)}), letting εa=ε/Θ~(1)\varepsilon_{a}=\varepsilon/\widetilde{\Theta}(1), there exists

such that if mMm\geq M, then with high probability there exists W=(w1,,wm)W^{\divideontimes}=(w^{\divideontimes}_{1},\ldots,w^{\divideontimes}_{m}) with W2,kpC0εam\|W^{\divideontimes}\|_{2,\infty}\leq\frac{kpC_{0}}{\varepsilon_{a}m} and WFO~(kpCs(ϕ,1)εam)\|W^{\divideontimes}\|_{F}\leq\widetilde{O}(\frac{kp\mathfrak{C}_{\mathfrak{s}}(\phi,1)}{\varepsilon_{a}\sqrt{m}}) and

In the same setting as Lemma B.1, we have that w.h.p.

Recall the pseudo network without bias is given by

where α1,β1,b0N(0,1)\alpha_{1},\beta_{1},b_{0}\sim\mathcal{N}(0,1) are independent random Gaussians.

where mwj(0),w1,i,mbj(0)\sqrt{m}\langle w_{j}^{(0)},w_{1,i}^{*}\rangle,\sqrt{m}b_{j}^{(0)} has the same distribution with α1,b0\alpha_{1},b_{0} in Lemma 6.3. By Lemma 6.3, we have that

Fit a combination i[p]ar,iϕi(w1,i,x)w2,i,x\sum_{i\in[p]}a^{*}_{r,i}\phi_{i}(\langle w_{1,i}^{*},x\rangle)\langle w_{2,i}^{*},x\rangle. We can re-define (the norm grows by a maximum factor of pp)

Fit multiple outputs. If there are kk outputs let us re-define (the norm grows by a maximum factor of kk)

Now, re-scaling each wjw^{\divideontimes}_{j} by a factor of 1m\frac{1}{m} and re-scaling ε\varepsilon by 12pk\frac{1}{2pk}, we can write

Now, we apply the concentration from Lemma A.1, which implies for our parameter choice of mm, with high probability

The above concentration holds for every fixed xx with high probability, and thus also holds in expectation with respect to (x,y)D(x,y)\sim\mathcal{D}. This proves the first statement. As for the second statement on L(G(b)(x;W),y)L(G^{(b)}(x;W^{\divideontimes}),y), it follows from the Lipschitz continuity of LL.

Norm on WW^{\divideontimes}. According to its definition in (B.2), we have for each j[m]j\in[m], with high probability \|w^{\divideontimes}_{j}\|_{2}\leq\widetilde{O}\big{(}\frac{kpC_{0}}{\varepsilon_{a}m}\big{)} (here the additional 1m\frac{1}{m} is because we have re-scaled wjw^{\divideontimes}_{j} by 1m\frac{1}{m}). This means \|W^{\divideontimes}\|_{2,\infty}\leq\widetilde{O}\big{(}\frac{kpC_{0}}{\varepsilon_{a}m}\big{)}. As for the Frobenius norm,

Now, for each i[p]i\in[p], we know that j[m]h(i)(mwj(0),w1,i,mbj(0))2\sum_{j\in[m]}h^{(i)}\left(\sqrt{m}\langle w_{j}^{(0)},w_{1,i}^{*}\rangle,\sqrt{m}b_{j}^{(0)}\right)^{2} is a summation of i.i.d. random variables, each with expectation at most Cs(ϕ,1)2\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2} by Lemma 6.3. Applying concentration, we have with high probability

Putting this back to (B.3) we have WF2O~(k2p2Cs(ϕ,1)2εa2m)\|W^{\divideontimes}\|_{F}^{2}\leq\widetilde{O}(\frac{k^{2}p^{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}{\varepsilon_{a}^{2}m}). ∎

Let WW^{\divideontimes} be the weights constructed in Lemma B.1 to approximate up to error ε/2\varepsilon/2. Then

By standard concentration (which uses the randomness of w(0)w^{(0)} together with the randomness of ar,1(0),,ar,m(0)a^{(0)}_{r,1},\dots,a^{(0)}_{r,m}), the above quantity is with high probability bounded by O~(εa)=ε/2\widetilde{O}(\varepsilon_{a})=\varepsilon/2. This is the only place we need parameter choice εa\varepsilon_{a}. ∎

B.2 Coupling

Here we show that the weights after a properly bounded amount of updates stay close to the initialization, and thus the pseudo network is close to the real network using the same weights.

For every unit vector xx, w.h.p. over the random initialization, for every time step t1t\geq 1, we have the following. Denote τ=εaηt\tau=\varepsilon_{a}\eta t.

For at most O~(τkm)\widetilde{O}(\tau\sqrt{km}) fraction of i[m]i\in[m]:

W.h.p. over the random initialization, there is B=O~(1)B=\widetilde{O}(1) so that every ar,i(0)εaB|a^{(0)}_{r,i}|\leq\varepsilon_{a}B. Thus, by the 1-Lipschitz continuity of LL, for every i[m]i\in[m] and every t0t\geq 0,

which implies that wi(t)wi(0)2kBεaηt=kBτ\left\|w^{(t)}_{i}-w^{(0)}_{i}\right\|_{2}\leq\sqrt{k}B\varepsilon_{a}\eta t=\sqrt{k}B\tau. Accordingly, define H\mathcal{H}

so it satisfies for every iHi\in\mathcal{H},

Now, we need to bound the size of H\mathcal{H}. Since wi(0),xN(0,1/m)\langle w_{i}^{(0)},x\rangle\sim\mathcal{N}(0,1/m), and bi(0)N(0,1/m)b^{(0)}_{i}\sim\mathcal{N}(0,1/m), by standard property of Gaussian one can derive |\mathcal{H}|\geq m\big{(}1-\widetilde{O}(\sqrt{k}B\tau\sqrt{m})\big{)}=m\big{(}1-\widetilde{O}(\tau\sqrt{km})\big{)} with high probability.

It is clear from (B.4) that frf_{r} and grg_{r} only differ on the indices i∉Hi\not\in\mathcal{H}. For such an index i∉Hi\not\in\mathcal{H}, the sign of wi(t),x+bi(0)\langle w^{(t)}_{i},x\rangle+b^{(0)}_{i} is different from that of wi(0),x+bi(0)\langle w^{(0)}_{i},x\rangle+b^{(0)}_{i} and their difference is at most kBτ\sqrt{k}B\tau. This contributes at most εaBkBτ\varepsilon_{a}B\cdot\sqrt{k}B\tau difference between frf_{r} and grg_{r}. Then the bound follows from that there are only \widetilde{O}\big{(}\tau m\sqrt{km}\big{)} many i∉Hi\not\in\mathcal{H}.

By the Lipschitz smoothness assumption on LL and Lemma lem:couplingb, we have

For i∉Hi\not\in\mathcal{H}, it contributes at most kεaB\sqrt{k}\varepsilon_{a}B because of (B.5), and there are O~(τmkm)\widetilde{O}(\tau m\sqrt{km}) many such ii’s, totaling O~(εakτm3/2)\widetilde{O}(\varepsilon_{a}k\tau m^{3/2}).

B.3 Optimization

For the set of samples Z\mathcal{Z}, define

For every ε(0,1pkCs(ϕ,1))\varepsilon\in(0,\frac{1}{pk\mathfrak{C}_{\mathfrak{s}}(\phi,1)}), letting εa=ε/Θ~(1)\varepsilon_{a}=\varepsilon/\widetilde{\Theta}(1) and \eta=\widetilde{\Theta}\big{(}\frac{1}{\varepsilon km}\big{)}, there exists

Let WW^{\divideontimes} be constructed in Corollary B.2. Recall L(,y)L(\cdot,y) is convex and G(x;W)G(x;W) is linear in WW so LG(z;W)L_{G}(z;W) is convex in WW. By such convexity, we have

Recall from (B.5) that with high probability over the random initialization,

where C0C_{0} is as in Lemma B.1. By Lemma lem:couplingc, w.h.p. we know

Therefore, averaging up (B.7) from t=0t=0 to T1T-1 we have that

Note that \|W_{0}-W^{\divideontimes}\|_{F}^{2}=\|W^{\divideontimes}\|_{F}^{2}\leq\widetilde{O}\big{(}\frac{k^{2}p^{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}{\varepsilon_{a}^{2}m}\big{)} from Lemma B.1. Also recall εa=Θ(ε)\varepsilon_{a}=\Theta(\varepsilon). By choosing \eta=\widetilde{\Theta}\big{(}\frac{\varepsilon}{km\varepsilon_{a}^{2}}\big{)} and T=Θ~(k3p2Cs(ϕ,1)2/ε2)T=\widetilde{\Theta}\left(k^{3}p^{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}/\varepsilon^{2}\right) we have Δ=O~(k6p4Cs(ϕ,1)4m1/2/ε2)\Delta=\widetilde{O}(k^{6}p^{4}\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{4}m^{1/2}/\varepsilon^{2}). Thus when mm is large enough we have:

By the coupling in Lemma lem:couplingb, we know that LF(Z;Wt)L_{F}(\mathcal{Z};W_{t}) is o(ε)o(\varepsilon)-close to LG(Z;Wt)L_{G}(\mathcal{Z};W_{t}); by applying Corollary B.2, we know that LG(Z;W)L_{G}(\mathcal{Z};W^{\divideontimes}) is ε\varepsilon-close to OPT\mathsf{OPT}. This finishes the proof. ∎

B.4 Generalization

The generalization can be bounded via known Rademacher complexity results. Recall

We have the following simple lemma (see also [43, Theorem 43])We note that [43, Theorem 43] did not have W(0)W^{(0)} but the same result holds with the introduction of W(0)W^{(0)}.

For every τw,0\tau_{w,\infty}\geq 0, w.h.p. for every r[k]r\in[k] and every N1N\geq 1, we have the empirical Rademacher complexity bounded by

The proof consists of the following simple steps.

{xwj,xwj2τw,}\{x\mapsto\langle w^{\prime}_{j},x\rangle\mid\|w^{\prime}_{j}\|_{2}\leq\tau_{w,\infty}\} has Rademacher complexity O(τw,N)O(\frac{\tau_{w,\infty}}{\sqrt{N}}) by Proposition prop:rada.

{xwj(0)+wj,x+bjwj2τw,}\{x\mapsto\langle w^{(0)}_{j}+w^{\prime}_{j},x\rangle+b_{j}\mid\|w^{\prime}_{j}\|_{2}\leq\tau_{w,\infty}\} has Rademacher complexity O(τw,N)O(\frac{\tau_{w,\infty}}{\sqrt{N}}) because singleton class has zero complexity and adding it does not affect complexity by Proposition prop:radc.

{xfr(x;W(0)+W)W2,τw,}\{x\mapsto f_{r}(x;W^{(0)}+W^{\prime})\mid\|W^{\prime}\|_{2,\infty}\leq\tau_{w,\infty}\} has Rademacher complexity O~(εamτw,N)\widetilde{O}(\frac{\varepsilon_{a}m\tau_{w,\infty}}{\sqrt{N}}) because w.h.p. ar(0)1O~(εam)\|a^{(0)}_{r}\|_{1}\leq\widetilde{O}(\varepsilon_{a}m) and Proposition prop:rade. ∎

B.5 Theorem 1: Two-Layer

First, we can apply Lemma B.4 to bound the training loss. That is

Also recall from (B.8) and our parameter choices for η,T\eta,T that

so we can choose τw,=O(\poly(k,p,logm)Cs(ϕ,1)2ε2m)\tau_{w,\infty}=O(\frac{\poly(k,p,\log m)\cdot\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}{\varepsilon^{2}m}). For each (x,y)D(x,y)\sim\mathcal{D}, it is a simple exercise to verify that L(F(x;Wt+W(0)),y)O(\poly(k,p,logm)Cs(ϕ,1)2ε)|L(F(x;W_{t}+W^{(0)}),y)|\leq O(\frac{\poly(k,p,\log m)\cdot\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}{\varepsilon}) with high probability.Indeed, with high probability gr(x;W)O~(εa)|g_{r}(x;W)|\leq\widetilde{O}(\varepsilon_{a}) and since Wt2,τw,\|W_{t}\|_{2,\infty}\leq\tau_{w,\infty} we have gr(b)(x;Wt)O~(εaτw,m)|g_{r}^{(b)}(x;W_{t})|\leq\widetilde{O}(\varepsilon_{a}\tau_{w,\infty}m). Together we have gr(x;W+Wt)O~(εaτw,m)|g_{r}(x;W+W_{t})|\leq\widetilde{O}(\varepsilon_{a}\tau_{w,\infty}m). By the coupling Lemma lem:couplingb, this implies fr(x;W+Wt)O~(εaτw,m)|f_{r}(x;W+W_{t})|\leq\widetilde{O}(\varepsilon_{a}\tau_{w,\infty}m) as well. Using L(0,y)L(0,y)\in and the 11-Lipschitz continuity finishes the proof. Thus, we can plug the Rademacher complexity Lemma B.5 together with b=O(\poly(k,p,logm)Cs(ϕ,1)2ε)b=O(\frac{\poly(k,p,\log m)\cdot\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}{\varepsilon}) into standard generalization statement Corollary A.11. It gives

This completes the proof with large enough NN. ∎

In the above proof, it appears that NN scales with ε4\varepsilon^{-4} which may seemingly be larger than TT which only scales with ε2\varepsilon^{-2}. We are aware of a proof that tightens NN to be on the order of ε2\varepsilon^{-2}. It uses standard (but complicated) martingale analysis and creates extra difficulty that is irrelevant to neural networks in general. We choose not to present it for simplicity.

Appendix C Proofs for Three-Layer Networks

Our three-layer proofs follow the same structure as our proof overview in Section 6.

Recall without loss of generality it suffices to prove Lemma lem:fit_fun_maina.

where α1,β1N(0,1)\alpha_{1},\beta_{1}\sim\mathcal{N}(0,1) and b0N(0,1)b_{0}\sim\mathcal{N}(0,1) are independent random variables. Furthermore:

hh is Cε(ϕ,1)\mathfrak{C}_{\varepsilon}(\phi,1)-Lipschitz on the first coordinate.

For notation simplicity, let us denote w0=(α1,β1)w_{0}=(\alpha_{1},\beta_{1}) and x=(x1,1x12)x=(x_{1},\sqrt{1-x_{1}^{2}}) where α1,β1\alpha_{1},\beta_{1} are two independent random standard Gaussians.

Throughout the proof, we also take an alternative view of the randomness. We write w0,x=α\langle w_{0},x\rangle=\alpha and α1=αx1+1x12β\alpha_{1}=\alpha x_{1}+\sqrt{1-x_{1}^{2}}\beta for two independent α,βN(0,1)\alpha,\beta\sim\mathcal{N}(0,1).This is possible for the following reason. Let x=(1x12,x1)x^{\perp}=(\sqrt{1-x_{1}^{2}},-x_{1}) be unit vector orthogonal to xx. We can write w0=αx+βxw_{0}=\alpha x+\beta x^{\perp} where α,βN(0,1)\alpha,\beta\sim\mathcal{N}(0,1) are two independent Gaussians.

We first make a technical claim involving in fitting monomials in x1x_{1}. Its proof is in Section C.1.2.

Recall hi(x)h_{i}(x) is the degree-ii Hermite polynomial (see Definition A.5). For every integer i1i\geq 1 there exists constant pip_{i}^{\prime} with pi(i1)!!200i2|p_{i}^{\prime}|\geq\frac{(i-1)!!}{200i^{2}} such that

We next use Claim C.1 to fit arbitrary functions ϕ(x1)\phi(x_{1}). By Taylor expansion, we have

The next technical claim carefully bounds the absolute values of the Hermite polynomials. Its proof is in Section C.1.2.

where R(x1)<ϵ/4|R^{\prime}(x_{1})|<\epsilon/4 uses Claim claim:fit_fun:UP-LOa and Claim claim:fit_fun:UP-LOb. In other words, if we define

As for the range of hh, we use Claim claim:fit_fun:UP-LOb and Claim claim:fit_fun:UP-LOc to derive that

As for the Lipschitz continuity of hh on its first coordinate α1\alpha_{1}, we observe that for each i>0i>0, h^i(z)\widehat{h}_{i}(z) has zero sub-gradient for all zBi|z|\geq B_{i}. Therefore, it suffices to bound \big{|}\frac{d}{dz}h_{i}(z)\big{|} for z<Bi|z|<B_{i}. Replacing the use of Claim claim:fit_fun:UP-LOc by Claim claim:fit_fun:UP-LOd immediately give us the same bound on the Lipschitz continuity of hh with respect to α1\alpha_{1}.

Above, ① uses inequality i!((i1)!!)22i\frac{i!}{((i-1)!!)^{2}}\leq 2\sqrt{i} for all i1i\geq 1.

This finishes the proof of Lemma lem:fit_fun_maina. \blacksquare

C.1.2 Proofs of Claim C.1 and Claim C.2

Then, for 0b012i0\leq-b_{0}\leq\frac{1}{2i}, we know that for all 1<ri11<r\leq i-1, rr odd:

is independent of the randomness of b0b_{0}. Therefore, using the formula of pip_{i} in (C.4):

Odd ii. Similarly, by Lemma A.6, we have

Then, for b012i|b_{0}|\leq\frac{1}{2i}, we know that for all even rr in 1<ri11<r\leq i-1 it satisfies

is independent of the randomness of b0b_{0}. Therefore, using the formula of pip_{i} in (C.5):

By the definition of Hermite polynomial (see Definition A.5), we have that

Using our bound on cic_{i}^{\prime} (see (C.3)), we have

Denote by b=Bi=100i1/2θb=B_{i}=100i^{1/2}\theta for notational simplicity (for some parameter θ1\theta\geq 1 that we shall choose later). We have

Above, inequality ① uses j=0i1ijj!3i\sum_{j=0}^{i-1}\frac{i^{j}}{j!}\leq 3^{i}; inequality ② uses our definition of b=Bib=B_{i}; inequality ③ uses (θe104θ2)iε2100000i(\theta\cdot e^{-10^{4}\theta^{2}})^{i}\leq\frac{\varepsilon^{2}}{100000^{i}} for θ=1+log(1/ε)10i\theta=1+\frac{\sqrt{\log(1/\varepsilon)}}{10\sqrt{i}}; and inequality ④ uses εci1\varepsilon|c_{i}|\leq 1. Putting this back to (C.8), we have

Above, in the last inequality we have used i4i!!ii/2404i\frac{i^{4}}{i!!}i^{i/2}\leq 40\cdot 4^{i} for i1i\geq 1.

Similar to the previous case, we calculate that

Above, inequality ① uses b2j=Bi2j(10i)jb^{2j}=B_{i}^{2j}\geq(10i)^{j}; and inequality ② uses again j=0i1ijj!3i\sum_{j=0}^{i-1}\frac{i^{j}}{j!}\leq 3^{i}. Using this and continue from (C.9) of the previous case, we finish the proof.

Again denote by b=Bi=100i1/2θb=B_{i}=100i^{1/2}\theta for notational simplicity. By Eq. (C.7), it holds that

Here, in ① we use the fact that ijj!10i\frac{i^{j}}{j!}\leq 10^{i}; in ② we use the fact that \big{(}\frac{a}{b}\big{)}^{b}\leq e^{a} for all b[1,a]b\in[1,a].

By the definition of Hermite polynomial (see Definition A.5), we can also bound

which is the same upper bound comparing to (C.6). Therefore, the same proof of Claim claim:fit_fun:UP-LOc also applies to \big{|}\frac{d}{dx}h_{i}(x)\big{|}. ∎

C.1.3 Lemma 6.4: Information out of Randomness

Let us consider a single neural of the second layer at random initialization, given as:

Moreover, letting C=Cε(ϕ,1)C=\mathfrak{C}_{\varepsilon}(\phi,1) be the complexity of ϕ\phi, and if v1,i(0)N(0,1m2)v_{1,i}^{(0)}\sim\mathcal{N}(0,\frac{1}{m_{2}}) and wi,j(0),b1,i(0)N(0,1m1)w_{i,j}^{(0)},b^{(0)}_{1,i}\sim\mathcal{N}(0,\frac{1}{m_{1}}) are at random initialization, then we have

For every fixed xx, ρ(v1(0),W(0),b1(0))\rho\left(v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right) is independent of B(x,v1(0),W(0),b1(0))B\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right).

ρ(v1(0),W(0),b1(0))N(0,1100C2m2)\rho\left(v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right)\sim\mathcal{N}\left(0,\frac{1}{100C^{2}m_{2}}\right).

For every xx with x2=1\|x\|_{2}=1, ϕε(x)ϕ(w,x)ε|\phi_{\varepsilon}(x)-\phi(\langle w^{*},x\rangle)|\leq\varepsilon.

For every fixed xx with x2=1\|x\|_{2}=1, with high probability R(x,v1(0),W(0),b1(0))O~(1m1m2)\left|R\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right)\right|\leq\widetilde{O}\left(\frac{1}{{\sqrt{m_{1}m_{2}}}}\right) and B(x,v1(0),W(0),b1(0))O~(1m2)\left|B\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right)\right|\leq\widetilde{O}\left(\frac{1}{{\sqrt{m_{2}}}}\right).

Furthermore, there exists real-valued function ρ~(v1(0))\widetilde{\rho}(v_{1}^{(0)}) satisfying with high probability:

Before going to proofs, we recall from Section A that we have overridden the notion of “with high probability” so the above statement implies W2(R,0)O(1m1m2)\mathcal{W}_{2}(R,0)\leq O\left(\frac{1}{{\sqrt{m_{1}m_{2}}}}\right) and W2(B,0)O~(1m2)\mathcal{W}_{2}(B,0)\leq\widetilde{O}\left(\frac{1}{{\sqrt{m_{2}}}}\right).

Without loss of generality we assume w=e1w^{*}=e_{1}. Recall that

By Lemma 6.3, for every ε>0\varepsilon>0, there exists a function hh such that for every unit xx with xd=12x_{d}=\frac{1}{2} and every i[m1]i\in[m_{1}]: If one wants to work with the more general target function in Remark 4.2, we can assume without loss of generality that w3=edw^{*}_{3}=e_{d} (because we have w1,w3=0\langle w^{*}_{1},w^{*}_{3}\rangle=0 in Remark 4.2).

and h(wi,1(0),b1,i(0))[0,1]\left|h\left(w^{(0)}_{i,1},b_{1,i}^{(0)}\right)\right|\in\left[0,1\right]. (Here to apply Lemma 6.3, we have re-scaled hh in Lemma 6.3 by 1C\frac{1}{C} and re-scaled α1,β1,b0\alpha_{1},\beta_{1},b_{0} in Lemma 6.3 by 1m1\frac{1}{\sqrt{m_{1}}}.)

We define what we call “effective sign of v1,i(0)v_{1,i}^{(0)}” to be

Since each iSi\in\mathcal{S} with probability τ\tau, we also know with high probability:

A simple observation here is that, although αN(0,1m1)\alpha\sim\mathcal{N}\left(0,\frac{1}{m_{1}}\right), if we fix W(0)W^{(0)} (or b1(0)b^{(0)}_{1}) then the distribution of α\alpha is not Gaussian. Fortunately, fixing W(0)W^{(0)} and b1(0)b^{(0)}_{1}, we still have that the coordinates of s=(s1,,sm1)s=(s_{1},\dots,s_{m_{1}}) are i.i.d. (each sis_{i} is zero with probability 1τ1-\tau, and sis_{i} is ±1\pm 1 each with probability τ/2\tau/2).

Let αW(0),b1(0)\alpha|_{W^{(0)},b^{(0)}_{1}} denote the conditional distribution of α\alpha. With high probability over W(0)W^{(0)}, W(0)edO~(1m1)\|W^{(0)}e_{d}\|_{\infty}\leq\widetilde{O}\left(\frac{1}{\sqrt{m_{1}}}\right). Fixing the support of uu to be S\mathcal{S}, we know that for every iSi\in\mathcal{S}, uiu_{i} is i.i.d. ±1S\pm\frac{1}{\sqrt{|\mathcal{S}|}}. This implies that fixing W(0),b1(0),SW^{(0)},b^{(0)}_{1},\mathcal{S}, the quantity

is a sum of S|\mathcal{S}| many independent, mean zero random variables with each |u_{i}[W^{(0)}e_{d}]_{i}|\leq\widetilde{O}\big{(}\frac{1}{\sqrt{m_{1}|\mathcal{S}|}}\big{)} and \sum_{i\in\mathcal{S}}u_{i}[W^{(0)}e_{d}]_{i}^{2}=\frac{1}{m_{1}}\pm\widetilde{O}\big{(}\frac{1}{\sqrt{\tau}m_{1}^{3/2}}\big{)} w.h.p. Applying any Wasserstein distance bound of central limit theorem (see Lemma A.3), we know that there exists some random Gaussian gN(0,1m1)g\sim\mathcal{N}(0,\frac{1}{m_{1}}) that is independent of W0W_{0} or b1(0)b^{(0)}_{1} such that w.h.p.

By definition of B1B_{1}, conditioning on the randomness of uu, we know that B1B_{1} is independent of α\alpha— because wi(0),x+b1,i(0)=βi,x+b1,i(0)\langle w_{i}^{(0)},x\rangle+b_{1,i}^{(0)}=\langle\beta_{i},x\rangle+b_{1,i}^{(0)} for i∉Si\not\in\mathcal{S}. Since uu and α\alpha are independent, we know that α\alpha and B1B_{1} are independent by Proposition A.2. We continue to write

First consider T3T_{3}. For each iSi\in\mathcal{S} we have

Above, the first row is because m2v1,i(0)Ii\sqrt{m_{2}}v^{(0)}_{1,i}\in I_{i} and the Bounded-ness property of the interval (see Lemma A.4); and the second row by the Unbiased property of the interval (see Lemma A.4). By concentration, for fixed vector xx, with high probability over the randomness of V(0)V^{(0)}:

where R1=R1(x,v1(0),W(0),b1(0))R_{1}=R_{1}\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right) satisfies R1O~(αSSm2)O~(1m1m2)|R_{1}|\leq\widetilde{O}\left(\frac{|\alpha|}{\sqrt{|\mathcal{S}|}}\cdot\frac{\sqrt{|\mathcal{S}|}}{\sqrt{m_{2}}}\right)\leq\widetilde{O}\left(\frac{1}{\sqrt{m_{1}m_{2}}}\right). We write

By (C.11) (i.e., the property of hh), we know that for every fixed xx, using concentration bound, with high probability over W(0)W^{(0)} and b(0)b^{(0)}:

Note that α\alpha is independent of uu so ρ\rho is also independent of uu. We can also define

and using (C.14) we can derive the desired bound on W2(ρW(0),b1(0),ρ~)\mathcal{W}_{2}(\rho|_{W^{(0)},b^{(0)}_{1}},\widetilde{\rho}) in the statement of Lemma 6.4.

For fixed unit vector xx, with high probability, we have that

Thus, for fixed S\mathcal{S}, since W(0)W^{(0)} is independent of S\mathcal{S}, with high probability over the randomness of W(0)W^{(0)}, there are at most \widetilde{O}\big{(}\sqrt{|\mathcal{S}|}\big{)} many indices iSi\in\mathcal{S} satisfying (C.15). In other words, using v1,i(0)O~(1/m2)|v_{1,i}^{(0)}|\leq\widetilde{O}(1/\sqrt{m_{2}}) with high probability, we have

with R3=R3(x,v1(0),W(0),b1(0))R_{3}=R_{3}\left(x,v_{1}^{(0)},W^{(0)},b_{1}^{(0)}\right) satisfying R3O~(1m1m2)|R_{3}|\leq\widetilde{O}\left(\frac{1}{{\sqrt{m_{1}m_{2}}}}\right) with high probability.

Therefore, we have S=S\mathcal{S}=\mathcal{S}^{\prime} and can write

Again, conditioning on the randomness of uu, we know that B2B_{2} is independent of α\alpha. Since uu and α\alpha are independent, we know that α\alpha and B2B_{2} are also independent (by Proposition A.2). In other words, setting B=B1+B2B=B_{1}+B_{2}, BB and α\alpha (and therefore ρ\rho) are independent.

Let R=R1+R2+R3R=R_{1}+R_{2}+R_{3} be the residual term, setting τ=1100\tau=\frac{1}{100}, we have RO~(1m1m2)\left|R\right|\leq\widetilde{O}\left(\frac{1}{{\sqrt{m_{1}m_{2}}}}\right) with high probability.

and by our random initialization, n_{1}(x)\sim\mathcal{N}\big{(}0,\frac{1}{m_{2}}\big{\|}\sigma\big{(}W^{(0)}x+b_{1}^{(0)}\big{)}\big{\|}_{2}^{2}\big{)}. At the same time, with high probability \big{\|}\sigma\big{(}W^{(0)}x+b_{1}^{(0)}\big{)}\big{\|}_{2}^{2}=O(1). Therefore, we know n1(x)O~(1m2)|n_{1}(x)|\leq\widetilde{O}(\frac{1}{\sqrt{m_{2}}}), and this implies BO~(1m2)|B|\leq\widetilde{O}(\frac{1}{\sqrt{m_{2}}}) with high probability. ∎

C.1.4 Lemma 6.2: Existence

For every \varepsilon\in\big{(}0,\frac{1}{kp_{1}p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}\big{)}, there exists

such that if m1,m2Mm_{1},m_{2}\geq M, then with high probability, there exists weights W,VW^{\divideontimes},V^{\divideontimes} with

Let us mostly focus on proving Lemma 6.2 for a single term

Extending it to multiple terms and multiple outputs, that is

is rather straightforward (and indeed a repetition of the proof of Lemma B.1 in the two-layer case).

Step 1: Existence in expectation

Recall that the input (without bias) to each neuron at random initialization in the second hidden layer is

We first use Lemma 6.4 to derive the following claim:

For every ε(0,1/Cs(ϕ,1))\varepsilon\in(0,1/\mathfrak{C}_{\mathfrak{s}}(\phi,1)), there exists real-value functions ϕ1,j,ε()\phi_{1,j,\varepsilon}(\cdot) satisfying

For every i[m2]i\in[m_{2}], there exist independent Gaussians More specifically, αi,j=αi,j(vi(0),W(0),b1(0))\alpha_{i,j}=\alpha_{i,j}(v_{i}^{(0)},W^{(0)},b^{(0)}_{1}) and βi=βi(x,vi(0),W(0),b1(0))\beta_{i}=\beta_{i}(x,v_{i}^{(0)},W^{(0)},b^{(0)}_{1}) depend on the randomness of vi(0)v_{i}^{(0)}, W(0)W^{(0)} and b1(0)b^{(0)}_{1}.

Let us define p2Sp_{2}S many chunks of the first layer, each chunk corresponds to a set Sj,l\mathcal{S}_{j,l} of cardinality Sj,l=m1p2S|\mathcal{S}_{j,l}|=\frac{m_{1}}{p_{2}S} for j[p2],l[S]j\in[p_{2}],l\in[S], such that

Let us then denote vi[j,l]v_{i}[j,l] to be (vi,r)rSj,l(v_{i,r})_{r\in\mathcal{S}_{j,l}} and W[j,l]W[j,l] to be (Wr)rSj,l(W_{r})_{r\in\mathcal{S}_{j,l}}. Recall that the input (without bias) to each neuron at random initialization in the second hidden layer is

For each j[p2]j\in[p_{2}] and l[S]l\in[S], let us apply Lemma 6.4 to the summation rSj,l\sum_{r\in\mathcal{S}_{j,l}}\dots in the above formula, to approximate ϕ1,j(w1,j,x)\phi_{1,j}(\langle w^{*}_{1,j},x\rangle). (We need to replace m1m_{1} with m1p2S\frac{m_{1}}{p_{2}S} and scale up W(0)W^{(0)} and b1(0)b^{(0)}_{1} by p2S\sqrt{p_{2}S} before applying Lemma 6.4)). It tells us we can write ni(x)n_{i}(x) as:

Moreover, ϕ1,j,ε(w1,j,x)ϕ1,j(w1,j,x)ε|\phi_{1,j,\varepsilon}(\langle w_{1,j}^{*},x\rangle)-\phi_{1,j}(\langle w_{1,j}^{*},x\rangle)|\leq\varepsilon for each j[p2]j\in[p_{2}].

Lemma 6.4 tells us that random variables ρj\rho_{j} are independent of Bj(x)B_{j}(x), and with high probability

Define β(x)=j[p2]βj(x)\beta^{\prime}(x)=\sum_{j\in[p_{2}]}\beta_{j}(x), we know that β(x)\beta^{\prime}(x) is a Gaussian random variable independent of all the ρj\rho_{j} with

Let us slightly override notation and denote

Since by our random initialization, ni(x)N(0,1m2σ(W(0)x+b1(0))22)n_{i}(x)\sim\mathcal{N}\left(0,\frac{1}{m_{2}}\left\|\sigma\left(W^{(0)}x+b_{1}^{(0)}\right)\right\|_{2}^{2}\right), and since for every every unit vector xx, with high probability σ(W(0)x+b1(0))22=1±O~(1m1)\left\|\sigma\left(W^{(0)}x+b_{1}^{(0)}\right)\right\|_{2}^{2}=1\pm\widetilde{O}\left(\frac{1}{\sqrt{m_{1}}}\right), we can write

Since we can write g=j[p2]αi,jϕ1,j,ε(x)+βi(x)g=\sum_{j\in[p_{2}]}\alpha_{i,j}\phi_{1,j,\varepsilon}(x)+\beta_{i}(x) for

being an independent from αi,1,,αi,p2\alpha_{i,1},\dots,\alpha_{i,p_{2}}, we conclude that (by choosing S=(m1/p2)1/3S=(m_{1}/p_{2})^{1/3})

Let us slightly override notation and denote

Next, we wish to use the Wasserstein bound from Claim C.3 to replace j[p2]αi,jϕ1,j,ε(x)+βi(x)\sum_{j\in[p_{2}]}\alpha_{i,j}\phi_{1,j,\varepsilon}(x)+\beta_{i}(x) with ni(x)n_{i}(x). We derive that

Above, ① uses the fact that niN(0,1m2σ(W(0)x+b1(0))22)n_{i}\sim\mathcal{N}\left(0,\frac{1}{m_{2}}\left\|\sigma\left(W^{(0)}x+b_{1}^{(0)}\right)\right\|_{2}^{2}\right) with σ(W(0)x+b1(0))22\left\|\sigma\left(W^{(0)}x+b_{1}^{(0)}\right)\right\|_{2}\leq 2 w.h.p. and hC|h|\leq C^{\prime\prime} is bounded. ② uses Claim C.3 and (C.18). ③ uses Cϕ1,j,ε(x)=ϕ1,j,ε(w1,j,x)C^{\prime}\phi_{1,j,\varepsilon}(x)=\phi_{1,j,\varepsilon}(\langle w_{1,j}^{*},x\rangle), ϕ2,j(x)=ϕ2,j(w2,j,x)\phi_{2,j}(x)=\phi_{2,j}(\langle w_{2,j}^{*},x\rangle), and denote by LΦL_{\Phi} the Lipschitz continuity parameter of Φ\Phi (namely, Φ(x)Φ(y)LΦxy|\Phi(x)-\Phi(y)|\leq L_{\Phi}|x-y| for all x,y\in\big{[}-p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1),p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1)\big{]})). ④ uses LΦCs(Φ,p2Cs(ϕ,1))L_{\Phi}\leq\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1)) and our assumption m1Mm_{1}\geq M. This proves Claim C.4. ∎

Step 2: From expectation to finite neurons

Intuitively, we wish to apply concentration bound on Claim C.4 with respect to all neurons i[m2]i\in[m_{2}] on the second layer. Recall aiN(0,εa)a_{i}\sim\mathcal{N}(0,\varepsilon_{a}) is the weight of the ii-th neuron at the output layer. Our main result of Step 2 is the following claim.

Across different choices of ii, the values of ni(x)n_{i}(x) and αi,j\alpha_{i,j} can be correlated. This makes it not a trivial thing to apply concentration. In the remainder of this proof, let us try to modify the two quantities to make them independent across ii.

First modify αi,j\alpha_{i,j}. Recall αi,j=Cρj=Cl[S]ρj,l\alpha_{i,j}=C^{\prime}\rho_{j}=C^{\prime}\sum_{l\in[S]}\rho_{j,l} where each ρj,l\rho_{j,l} is a function on vi(0)[j,l],W(0)[j,l],v_{i}^{(0)}[j,l],W^{(0)}[j,l], and b1(0)[j,l]b_{1}^{(0)}[j,l]. Now, using the ρ~\widetilde{\rho} notion from Lemma 6.4, let us also define ρ~j,l=ρ~j(vi(0)[j,l])\widetilde{\rho}_{j,l}=\widetilde{\rho}_{j}\left(v_{i}^{(0)}[j,l]\right) which is in the same distribution as ρj,l\rho_{j,l} except that it does not depend on W(0)W^{(0)} or b1(0)b_{1}^{(0)}. We can similarly let ρ~j=l[S]ρ~j,l\widetilde{\rho}_{j}=\sum_{l\in[S]}\widetilde{\rho}_{j,l}. From Lemma 6.4, we know that with high probability over W(0)W^{(0)}:Recall from the proof of Claim C.3 that, we need to replace m1m_{1} with m1p2S\frac{m_{1}}{p_{2}S} and scale up W(0)W^{(0)} and b1(0)b^{(0)}_{1} by p2S\sqrt{p_{2}S} before applying Lemma 6.4.

According, we define α~i,j=Cρ~j=Cl[S]ρ~j,l\widetilde{\alpha}_{i,j}=C^{\prime}\widetilde{\rho}_{j}=C^{\prime}\sum_{l\in[S]}\widetilde{\rho}_{j,l}.

Next modifty ni(x)n_{i}(x). Recall ni(x)=r[m1]vi,r(0)σ(wr(0),x+b1,r(0))n_{i}(x)=\sum_{r\in[m_{1}]}v_{i,r}^{(0)}\sigma\left(\langle w_{r}^{(0)},x\rangle+b_{1,r}^{(0)}\right) and accordingly we define

is a Gaussian variable and is independent of uu. As a consequence, the quantities n~i(x)\widetilde{n}_{i}(x) are independent among different choices of i[m1]i\in[m_{1}]. Using standard concentration on u2\|u\|_{2} (see the line above (C.16)), we have for every xx with x2=1\|x\|_{2}=1,

Concentration. Using Claim C.4 of Step 1 we have

Using the notions of n~\widetilde{n} and α~\widetilde{\alpha} and the Wasserstein distance bounds (C.20) and (C.21), it implies We have skipped the details since it is analogous to (C.19).

Using again the Wasserstein distance bounds bounds (C.20) and (C.21), we can combine the above two equations to derive that w.h.p.

Step 3: From finite neurons to the network

We now construct the network and prove Lemma 6.2. Recall we focus on constructing

We shall explain towards the end how to extend it to multiple terms and multiple outputs.

(If one wants the more general target in Remark 4.2, we can replace the use of ede_{d} with w4w^{*}_{4} following the similar treatment as (B.1) in the two-layer existential proof.) We have

Given these weights, suppose the signs of ReLU’s are determined by the random initialization (i.e., by W(0)W^{(0)} and V(0)V^{(0)}). We can consider the network output

Above, we have used ed,x=xd=1/2\langle e_{d},x\rangle=x_{d}=1/2. Since hϕ,j[C,C]h_{\phi,j}\in[-C^{\prime\prime},C^{\prime\prime}], we can apply concentration on (C.22) and get (recalling m1m_{1} is sufficiently large) w.h.p

Putting this into (C.23), and using aa^{*}\in and h[C,C]h\in[-C^{\prime\prime},C^{\prime\prime}], we know that with high probability

Finally, scaling down the value of ε\varepsilon by factor p22Cs(Φ,p2Cs(ϕ,1))Cs(ϕ,1)p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1) we complete the proof of Lemma 6.2 for a single output and for a single Φ\Phi.

The proof generalizes to fitting a combination of multiple functions

in the same way as the proof of Lemma B.1, if we choose the vector v{1,1}m1v\in\{-1,1\}^{m_{1}} uniformly at random and apply concentration. Note that we have to further scale down ε\varepsilon by 1p1\frac{1}{p_{1}} for the same reason as Lemma B.1, and the norm of VV^{\divideontimes} shall grow by a factor of p1p_{1}.

The proof generalizes to multiple outputs in the same way as Lemma B.1, using the fact that weights ar,ia_{r,i} are independent across different outputs rr. Note that we have to further scale down ε\varepsilon by 1k\frac{1}{k} for the same reason as Lemma B.1, and the norm of VV^{\divideontimes} shall grow by a factor of kk.

Finally, applying the two remarks above, we have

We thus finish the proof of Lemma 6.2 with our choice of C0C_{0}. \blacksquare

C.2 Coupling

Suppose \tau_{v}\in\big{[}0,1\big{]}, \tau_{w}\in\big{(}\frac{1}{m_{1}^{3/2}},\frac{1}{m_{1}^{1/2}}\big{]}, \sigma_{w}\in\big{(}\frac{1}{m_{1}^{3/2}},\frac{\tau_{w}}{m_{1}^{1/4}}\big{]}, \sigma_{v}\in\big{(}0,\frac{1}{m_{2}^{1/2}}\big{]} and η>0\eta>0. Given fixed unit vector xx, and perturbation matrices W,V,W,VW^{\prime},V^{\prime},W^{\prime\prime},V^{\prime\prime} (that may depend on the randomness of W(0),b1(0),V(0),b2(0)W^{(0)},b_{1}^{(0)},V^{(0)},b_{2}^{(0)} and xx) satisfying

(Sparse sign change). Dw,x0O~(τw4/5m16/5)\|D_{w,x}^{\prime}\|_{0}\leq\widetilde{O}(\tau_{w}^{4/5}m_{1}^{6/5}), Dv,x0O~(σvm23/2+τv2/3m2+τw2/3m11/6m2)\|D_{v,x}^{\prime}\|_{0}\leq\widetilde{O}\left(\sigma_{v}m_{2}^{3/2}+\tau_{v}^{2/3}m_{2}+\tau_{w}^{2/3}m_{1}^{1/6}m_{2}\right).

Part I, Sparsity

For notation simplicity, we only do the proof when there is no bias term. The proof with bias term is analogous (but more notationally involved). Let us denote

diagonal matrix Dw,xD_{w,x} denotes the sign of ReLU’s at weights W(0)W^{(0)},

diagonal matrix Dw,x+Dw,xD_{w,x}+D_{w,x}^{\prime\prime} denotes the sign of ReLU’s at weights W(0)+WρW^{(0)}+W^{\rho}, and

diagonal matrix Dw,x+Dw,xD_{w,x}+D^{\prime}_{w,x} denotes the sign of ReLU’s at weights W(0)+Wρ+WW^{(0)}+W^{\rho}+W^{\prime}.

Sign change in Dw,xD_{w,x}^{\prime\prime}. Each coordinate of W(0)xN(0,1m1)W^{(0)}x\sim\mathcal{N}\left(0,\frac{1}{m_{1}}\right) and each coordinate of WρxN(0,σw2)W^{\rho}x\sim\mathcal{N}(0,\sigma_{w}^{2}). Thus, by standard property of Gaussian, for each ii, we have Pr[WiρxWi(0)x]O~(σwm1)\operatornamewithlimits{\mathbf{Pr}}[|W^{\rho}_{i}x|\geq|W^{(0)}_{i}x|]\leq\widetilde{O}\left(\sigma_{w}\sqrt{m_{1}}\right). By concentration bound, with high probability, the number of sign changes of the ReLU activations in the first hidden layer caused by adding WρW^{\rho} is no more than

Moreover, for each coordinate ii with [Dw,x]i,i0[D^{\prime\prime}_{w,x}]_{i,i}\neq 0, we must have (Dw,xW(0)x)i(Wρx)iO~(σw)|(D_{w,x}^{\prime\prime}W^{(0)}x)_{i}|\leq|(W^{\rho}x)_{i}|\leq\widetilde{O}(\sigma_{w}) with high probability, and thus

By our assumption σwτw/m11/4\sigma_{w}\leq\tau_{w}/m_{1}^{1/4}, we have

Next, for each coordinate ii where (Dw,xDw,x)i,i0(D^{\prime}_{w,x}-D^{\prime\prime}_{w,x})_{i,i}\neq 0, we must have ((W(0)+Wρ)x)i(Wx)i|((W^{(0)}+W^{\rho})x)_{i}|\leq|(W^{\prime}x)_{i}|, and since (Wx)i4(W^{\prime}x)_{i}^{4} must sum up to at most τw\tau_{w} for those ss coordinates, we have

Sum up: First Layer. Combining (C.24) and (C.25) and using τwm11/4\tau_{w}\leq m_{1}^{-1/4} from assumption, we have

By the 1-Lipschitz continuity of ReLU, we know that w.h.p.

where we have used our assumption σwτwm11/4\sigma_{w}\leq\tau_{w}m_{1}^{-1/4}.

Second Layer Sign Change. The sign change in the second layer is caused by input vector

Here, using w.h.p. z2O~(1)\|z\|_{2}\leq\widetilde{O}(1), we have

In comparison (at random initialization) we have V(0)z0N(0,z022m2I)V^{(0)}z_{0}\sim\mathcal{N}\left(0,\frac{\|z_{0}\|_{2}^{2}}{m_{2}}I\right) with z02=Ω~(1)\|z_{0}\|_{2}=\widetilde{\Omega}(1). Using a careful two-step argument (see Claim C.8), we can bound

Part II, Diagonal Cross Term

Second error term. We write down the second error term

For the second term, since Wx2τwm11/4\|W^{\prime\prime}x\|_{2}\leq\tau_{w}m_{1}^{1/4} and w.h.p.We note that the derivation of arDv,x(V(0)+Vρ)O~(1)\|a_{r}D_{v,x}(V^{(0)}+V^{\rho})\|_{\infty}\leq\widetilde{O}(1) may be non-trivial for some readers, because Dv,xD_{v,x} is dependent of the randomness of V(0)+V(ρ)V^{(0)}+V^{(\rho)}. In fact, for every fixed basis vector eje_{j}, we have w.h.p. Dv,x(V(0)+Vρ)ej2O~(1)\|D_{v,x}(V^{(0)}+V^{\rho})e_{j}\|_{2}\leq\widetilde{O}(1), and thus by the randomness of ara_{r} it satisfies arDv,x(V(0)+Vρ)ejO~(1)|a_{r}D_{v,x}(V^{(0)}+V^{\rho})e_{j}|\leq\widetilde{O}(1). Taking a union bound over j=1,2,,m2j=1,2,\dots,m_{2} gives the bound. Such proof idea was repeatedly used in .

For the third term, again by Wxτw\|W^{\prime\prime}x\|_{\infty}\leq\tau_{w} and Fact C.9, we have: w.h.p.

Tool

The following are some tools used in the above proofs.

A variant of the following claim has appeared in .

We also observe that (D)j,j(D^{\prime})_{j,j} is non-zero for some diagonal j[m2]j\in[m_{2}] only if

Let ξ12m2\xi\leq\frac{1}{2\sqrt{m_{2}}} be a parameter to be chosen later. We shall make sure that g2ξ/2\|g^{\prime}_{2}\|_{\infty}\leq\xi/2.

We denote by S1[m2]S_{1}\subseteq[m_{2}] the index sets where jj satisfies (g)jξ|(g)_{j}|\leq\xi. Since we know (g)jN(0,1/m2)(g)_{j}\sim\mathcal{N}(0,1/m_{2}), we have Pr[(g)jξ]O(ξm2)\operatornamewithlimits{\mathbf{Pr}}[|(g)_{j}|\leq\xi]\leq O\left(\xi\sqrt{m_{2}}\right) for each j[m2]j\in[m_{2}]. Using Chernoff bound for all j[m2]j\in[m_{2}], we have with high probability

We denote by S2[m2]S1S_{2}\subseteq[m_{2}]\setminus S_{1} the index set of all j[m2]S1j\in[m_{2}]\setminus S_{1} where xj0x_{j}\neq 0. Using (C.28), we have for each jS2j\in S_{2}:

This means S24g12ξ2.|S_{2}|\leq\frac{4\|g^{\prime}_{1}\|^{2}}{\xi^{2}}\enspace. Now, for each jS2j\in S_{2} where xj0x_{j}\neq 0, we know that the signs of (g+g1+g2)j(g+g^{\prime}_{1}+g^{\prime}_{2})_{j} and (g)j(g)_{j} are opposite. Therefore, we must have

From above, we have \|x\|_{0}\leq|S_{1}|+|S_{2}|\leq O\big{(}\xi m_{2}^{3/2}+\frac{\|g^{\prime}_{1}\|^{2}}{\xi^{2}}\big{)}. Choosing \xi=\max\big{\{}2\|g^{\prime}_{2}\|_{\infty},\Theta(\frac{\|g^{\prime}_{1}\|^{2/3}}{m_{2}^{1/2}})\big{\}} we have the desired result on sparsity.

Choosing \xi=\max\big{\{}2\|g^{\prime}_{2}\|_{\infty},\Theta(\frac{\|g^{\prime}_{1}\|^{2/3}}{m_{2}^{1/2}})\big{\}}, we have the desired bound on Euclidean norm. ∎

C.2.2 Corollary 6.6: Existence After Coupling

Corollary 6.6 is a corollary to Lemma 6.2 with g(0)g^{(0)} replaced with g(b,b)g^{(b,b)}. Recall that g(b,b)g^{(b,b)} is different from g(0)g^{(0)} only by the diagonal signs, namely,

where Dv,x+Dv,xD_{v,x}+D_{v,x}^{\prime} and Dw,x+Dw,xD_{w,x}+D_{w,x}^{\prime} are the diagonal sign matrices determined at W(0)+W+WρW^{(0)}+W^{\prime}+W^{\rho}, V(0)+V+VρV^{(0)}+V^{\prime}+V^{\rho}.

In the same setting as Lemma 6.2, perturbation matrices W,VW^{\prime},V^{\prime} (that may depend on the randomness of the initialization and D\mathcal{D}) with

Using parameter choices from Table 1, w.h.p. there exist WW^{\divideontimes} and VV^{\divideontimes} (independent of the randomness of Wρ,VρW^{\rho},V^{\rho}) satisfying

The idea to prove Corollary 6.6 is simple. First construct WW^{\divideontimes} and VV^{\divideontimes} from Lemma 6.2, and then show that g(0)g^{(0)} and g(b,b)g^{(b,b)} are close using Lemma 6.5. By Lemma 6.5 and our parameter choices Table 1, we know

Now, recall that Lemma 6.2 says W2,τw,\|W^{\divideontimes}\|_{2,\infty}\leq\tau_{w,\infty} and V2,τv,\|V^{\divideontimes}\|_{2,\infty}\leq\tau_{v,\infty} for τw,τv,=C0m1m2\tau_{w,\infty}\tau_{v,\infty}=\frac{C_{0}}{\sqrt{m_{1}}m_{2}}. Therefore,

C.2.3 Lemma 6.9: Smoothed Real vs Pseudo Networks

There exists η0=1\poly(m1,m2)\eta_{0}=\frac{1}{\poly(m_{1},m_{2})} such that, for every ηη0\eta\leq\eta_{0}, for every fixed xx with x2=1\|x\|_{2}=1, for every W,V,W,VW^{\prime},V^{\prime},W^{\prime\prime},V^{\prime\prime} that may depend on the randomness of the initialization and xx, with

where OpO_{p} hides polynomial factor of m1,m2m_{1},m_{2}.

Since Pρ,ηP_{\rho,\eta} and Pρ,ηP^{\prime}_{\rho,\eta} only differ in the sign pattern, we try to bound the (expected) output difference Pρ,ηPρ,ηP_{\rho,\eta}-P^{\prime}_{\rho,\eta} by analyzing these sign changes. We use the same proof structure as Lemma 6.5, that is to first bound the sign changes in the first layer, and then the second layer. One can carefully verify

We call \clubsuit the output difference caused by sign change of the first layer; and \spadesuit that caused by the sign change of the second layer.

One consequence of (C.29) is Pr[z02]Op(η2)\operatornamewithlimits{\mathbf{Pr}}[\|z^{\prime}\|_{0}\geq 2]\leq O_{p}(\eta^{2}). When z02\|z^{\prime}\|_{0}\geq 2, the contribution to \clubsuit is Op(η)O_{p}(\eta). Multiplying them together, the total contribution to \clubsuit in expectation is at most Op(η3)O_{p}(\eta^{3}).

Thus, we only need to consider the case z0=1\|z\|_{0}=1. Let ii be this coordinate so that zi0z^{\prime}_{i}\neq 0. This happens with probability at most O(ητw,/σw)O(\eta\tau_{w,\infty}/\sigma_{w}) for each i[m1]i\in[m_{1}]. The contribution of zz^{\prime} to \clubsuit is

and let us deal with the three terms separately:

For the term arDv,x,ρηVza_{r}D_{v,x,\rho}\eta V^{\prime\prime}z^{\prime}, it is of absolute value at most Op(η2)O_{p}(\eta^{2}). Since z00=1\|z_{0}\|_{0}=1 happens with probability Op(η)O_{p}(\eta), the total contribution to the expected value of \clubsuit is only Op(η3)O_{p}(\eta^{3}).

For the term arDv,x,ρ(V+Vρ)za_{r}D_{v,x,\rho}(V+V^{\rho})z^{\prime}, we first observe that with high probability arDv,x,ρ(V+Vρ)O~(ar2m2)O~(1)\|a_{r}D_{v,x,\rho}(V+V^{\rho})\|_{\infty}\leq\widetilde{O}(\frac{\|a_{r}\|_{2}}{\sqrt{m_{2}}})\leq\widetilde{O}(1) (for a proof see Footnote 29). Therefore, given z0=1\|z^{\prime}\|_{0}=1 and zητw,\|z^{\prime}\|_{\infty}\leq\eta\tau_{w,\infty}, we have that arDv,x,ρ(V+Vρ)zO~(ητw,)\|a_{r}D_{v,x,\rho}(V+V^{\rho})z^{\prime}\|\leq\widetilde{O}(\eta\tau_{w,\infty}). Since this happens with probability at most O(ητw,/σw)×m1O(\eta\tau_{w,\infty}/\sigma_{w})\times m_{1}—recall there are m1m_{1} many possible i[m1]i\in[m_{1}]— the total contribution to the expected value of \clubsuit is O~(η2m1τw2σw)+Op(η3)\widetilde{O}\left(\eta^{2}m_{1}\frac{\tau_{w}^{2}}{\sigma_{w}}\right)+O_{p}(\eta^{3}).

Sign Change of Second Layer. Recall that the sign of the ReLU of the second layer is changed from Dw,x,ρD_{w,x,\rho} to Dw,x,ρ,ηD_{w,x,\rho,\eta}. Let us compare the vector inputs of these two matrices before ReLU is applied, that is

This difference δ\delta has the following four terms:

With Wxτw,\|W^{\prime\prime}x\|_{\infty}\leq\tau_{w,\infty} and V2O~(1)\|V\|_{2}\leq\widetilde{O}(1), by Fact C.9 we know that w.h.p.

This is non-zero with probability Op(η)O_{p}(\eta), and when it is non-zero, its Euclidean norm is Op(η)O_{p}(\eta).

Since zO~(τw+σw+m11/2)=O(m11/2)\|z\|_{\infty}\leq\widetilde{O}(\tau_{w}+\sigma_{w}+m_{1}^{-1/2})=O(m_{1}^{-1/2}) owing to (C.27), by Fact C.9, we know that w.h.p.

Since originally each coordinate of VρzV^{\rho}z follows from N(0,σvz22)\mathcal{N}(0,\sigma_{v}\|z\|_{2}^{2}) with z2=Ω~(1)\|z\|_{2}=\widetilde{\Omega}(1), using a similar argument as (C.29)Namely, to first show that each coordinate i[m2]i\in[m_{2}] satisfies (Dv,x,ρ,ηDv,x,ρ)i,i0(D_{v,x,\rho,\eta}-D_{v,x,\rho})_{i,i}\neq 0 with probability \widetilde{O}\big{(}\eta\frac{(\frac{1}{\sqrt{m_{1}}}\tau_{v,\infty}+\tau_{w,\infty})}{\sigma_{v}}\big{)}. Then, since we can ignoring terms of magnitude Op(η3)O_{p}(\eta^{3}), it suffices to consider the case of Dv,x,ρ,ηDv,x,ρ0=1\|D_{v,x,\rho,\eta}-D_{v,x,\rho}\|_{0}=1, which occurs with probability at most m_{2}\times\widetilde{O}\big{(}\eta\frac{(\frac{1}{\sqrt{m_{1}}}\tau_{v,\infty}+\tau_{w,\infty})}{\sigma_{v}}\big{)} by union bound. Finally, each coordinate changes by at most \widetilde{O}\big{(}\eta\big{(}\frac{1}{\sqrt{m_{1}}}\tau_{v,\infty}+\tau_{w,\infty}\big{)}\big{)} by the argument above. we can bound the contribution to the expected value of \spadesuit by:

C.2.4 Lemma 6.11: Stronger Coupling

Under parameter choices Table 1, the last error term is at most ε/k\varepsilon/k.

For notation simplicity, let us do the proof without the bias term b1b_{1} and b2b_{2}. The proof with them are analogous.

On the other hand, applying Lemma 6.5 with W=0W^{\prime\prime}=0 and V=0V^{\prime\prime}=0, we have z20s=O~(τw4/5m16/5)\|z_{2}\|_{0}\leq s=\widetilde{O}(\tau_{w}^{4/5}m_{1}^{6/5}). Therefore, using the same derivation as (C.26), we can bound its Euclidean norm

Using the same derivation as (C.30), we also have w.h.p

Now, recall using Cauchy-Shwartz and the 1-Lipschitz continuity of the ReLU function, we have for every (xi,yi)i[m2](x_{i},y_{i})_{i\in[m_{2}]}, with high probability (over aa):

To bound ①, we consider the difference between

where Dv,xD_{v,x}^{\prime\prime} is the diagonal sign change matrix due to moving input from V(0)zV^{(0)}z to V(0)(z+z1+z2)+VDw,x(0)WxV^{(0)}(z+z_{1}+z_{2})+V^{\prime}D^{(0)}_{w,x}W^{\prime}x. This difference has the following three terms.

V(0)z2V^{(0)}z_{2}. Using the sparsity of z2z_{2} we know w.h.p. V(0)z2O~(z22sm21/2)O~(τw8/5m19/10m2)\|V^{(0)}z_{2}\|_{\infty}\leq\widetilde{O}(\|z_{2}\|_{2}\sqrt{s}m_{2}^{-1/2})\leq\widetilde{O}\left(\frac{\tau_{w}^{8/5}m_{1}^{9/10}}{\sqrt{m_{2}}}\right)

VDw,x(0)Wx2VFWx2τwm11/4τv\|V^{\prime}D^{(0)}_{w,x}W^{\prime}x\|_{2}\leq\|V^{\prime}\|_{F}\cdot\|W^{\prime}x\|_{2}\leq\tau_{w}m_{1}^{1/4}\tau_{v}.

Together, using arO~(1)\|a_{r}\|_{\infty}\leq\widetilde{O}(1) and invoking Claim C.8, we can bound it by:

(When invoking Claim C.8, we need τwm19/16\tau_{w}\leq m_{1}^{-9/16} and τwm11/4τv1\tau_{w}m_{1}^{1/4}\tau_{v}\leq 1.)

Since w.h.p. arDv,x(0)V(0)=O~(1)\|a_{r}D^{(0)}_{v,x}V^{(0)}\|_{\infty}=\widetilde{O}(1) and z2z_{2} is ss sparse, we know that w.h.p.

Since Wx2τwm11/4\|W^{\prime}x\|_{2}\leq\tau_{w}m_{1}^{1/4} and w.h.p. arDv,x(0)V(0)Dw,x(0)O~(1)\|a_{r}D^{(0)}_{v,x}V^{(0)}D^{(0)}_{w,x}\|_{\infty}\leq\widetilde{O}(1) (see Footnote 29), by Fact C.9,

Putting together (C.31), (C.32), (C.33), one can carefully verify that

Above, we have used our parameter choices τv\tau_{v}\in and τw[1m13/4,1m19/16]\tau_{w}\in[\frac{1}{m_{1}^{3/4}},\frac{1}{m_{1}^{9/16}}]. ∎

C.3 Optimization

For every ε0(0,1)\varepsilon_{0}\in(0,1) and ε=ε0kp1p22Cs(Φ,p2Cs(ϕ,1))Cs(ϕ,1)2\varepsilon=\frac{\varepsilon_{0}}{kp_{1}p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}, for every constant γ(0,1/4]\gamma\in(0,1/4], consider the parameter choices in Table 1, and consider any λt,Wt,Vt\lambda_{t},W_{t},V_{t} (that may depend on the randomness of W(0),b(0),V(0),b(1)W^{(0)},b^{(0)},V^{(0)},b^{(1)} and Z\mathcal{Z}) with

With high probability over the random initialization, there exists W,VW^{\divideontimes},V^{\divideontimes} with WF,VF1\|W^{\divideontimes}\|_{F},\|V^{\divideontimes}\|_{F}\leq 1 such that for every η[0,1\poly(m1,m2)]\eta\in\left[0,\frac{1}{\poly(m_{1},m_{2})}\right]:

Define the “pseudo function” for every WW^{\prime}, VV^{\prime} as

where Dv,x,ρ,tD_{v,x,\rho,t} and Dw,x,ρ,tD_{w,x,\rho,t} are the diagonal sign matrices at weights W(0)+Wρ+Wt,V(0)+Vρ+VtW^{(0)}+W^{\rho}+W_{t},V^{(0)}+V^{\rho}+V_{t}.

where Dv,x,ρ,VD_{v,x,\rho,V^{\prime}} and Dw,x,ρ,WD_{w,x,\rho,W^{\prime}} are the diagonal sign matrices at weights W(0)+Wρ+W,V(0)+Vρ+VW^{(0)}+W^{\rho}+W^{\prime},V^{(0)}+V^{\rho}+V^{\prime}.

G(x;W,V)=(g1,,gk)G(x;W^{\prime},V^{\prime})=(g_{1},\dots,g_{k}) and F(x;W,V)=(f1,,fk)F(x;W^{\prime},V^{\prime})=(f_{1},\dots,f_{k}).

As a sanity check, we have G(x;Wt,Vt)=F(x;Wt,Vt)G(x;W_{t},V_{t})=F(x;W_{t},V_{t}). (Here we slightly override the notation and use fr(x;W,V)f_{r}(x;W^{\prime},V^{\prime}) to denote fr(x;W(0)+Wρ+W,V(0)+Vρ+V)f_{r}(x;W^{(0)}+W^{\rho}+W^{\prime},V^{(0)}+V^{\rho}+V^{\prime}).)

By our regularizer parameters λw,λv\lambda_{w},\lambda_{v} in Table 1, as long as L(λt,Wt,Vt)O~(1)L^{\prime}(\lambda_{t},W_{t},V_{t})\leq\widetilde{O}(1), we know

Applying Corollary 6.6 (but scaling up the target FF^{*} by 1λt\frac{1}{\lambda_{t}}), we know that there exists W,VW^{\divideontimes},V^{\divideontimes} with (here we have scaled up WW^{\divideontimes} and scaled down VV^{\divideontimes} both by m10.005m_{1}^{0.005})

By our parameter choices in Table 1, this implies

Change in Regularizer. We first consider the change of the regularizer. We know that

For each term i[m1]i\in[m_{1}], we can bound

(Recall we use Op()O_{p}(\cdot) to hide polynomial factors in m1m_{1} and m2m_{2}.) By Cauchy-Schwarz,

By λvλtVF2ε0\lambda_{v}\|\sqrt{\lambda_{t}}V^{\divideontimes}\|_{F}^{2}\leq\varepsilon_{0}, λwλtW2,44ε0\lambda_{w}\|\sqrt{\lambda_{t}}W^{\divideontimes}\|_{2,4}^{4}\leq\varepsilon_{0}, and λwλtWt2,44R(λtWt,λtVt)\lambda_{w}\|\sqrt{\lambda_{t}}W_{t}\|_{2,4}^{4}\leq R(\sqrt{\lambda_{t}}W_{t},\sqrt{\lambda_{t}}V_{t}), we know that

Change in Objective. We now consider the change in the objective value. Recall from (C.35) the construction of good network W,VW^{\divideontimes},V^{\divideontimes} satisfies τv,1m1999/2000\tau_{v,\infty}\leq\frac{1}{m_{1}^{999/2000}} and τw,1m1999/1000\tau_{w,\infty}\leq\frac{1}{m_{1}^{999/1000}}. For polynomially small η\eta, by Lemma 6.9 (replacing its η\eta with η\sqrt{\eta}), we have:

First we focus on G(x;W^,V^)G(x;\widehat{W},\widehat{V}). Since Wt2,4τw\|W_{t}\|_{2,4}\leq\tau_{w} and Vt2,2τv\|V_{t}\|_{2,2}\leq\tau_{v} from (C.34), we can apply Lemma 6.5 to get

Combining (C.38) and (C.39), we know that for every fixed x,yx,y in the support of distribution Z\mathcal{Z}:

Above, ① uses the 1-Lipschitz smoothness of LL which implies

Next, by convexity of the loss function, we have

For sufficiently small η\eta, we know that

The objective growth inequalities (C.40) and (C.42) together imply

The regularizer growth inequality (C.37) implies

Putting (C.43) and (C.44) together we have

Multiplying 12(1η)\frac{1}{2(1-\eta)} on both sides, we have:

Therefore, as long as c3(1+γ)OPT+Ω(ε0/γ)c_{3}^{\prime}\geq(1+\gamma)\mathsf{OPT}+{\Omega}(\varepsilon_{0}/\gamma) and γ\gamma\in, we have:

C.3.2 Lemma 6.8: Convergence

In the setting of Theorem 3, with probability at least 99/10099/100, Algorithm 3 (the first SGD variant) converges in TTw=\poly(m1,m2)TT_{w}=\poly\left(m_{1},m_{2}\right) iterations to a point

For the first variant of SGD, note that there are T=Θ(η1loglog(m1m2)ε0)T=\Theta(\eta^{-1}\log\frac{\log(m_{1}m_{2})}{\varepsilon_{0}}) rounds of weight decay, which implies that λt(ε/log(m1m2))O(1)\lambda_{t}\geq(\varepsilon/\log(m_{1}m_{2}))^{O(1)} is always satisfied (because γ\gamma is a constant). By Lemma 6.7, we know that as long as L[(1+γ)OPT+Ω(ε0/γ),O~(1)]L^{\prime}\in[(1+\gamma)\mathsf{OPT}+\Omega(\varepsilon_{0}/\gamma),\widetilde{O}(1)], then there exists WF,VF1\|W^{\divideontimes}\|_{F},\|V^{\divideontimes}\|_{F}\leq 1 such that either

In the first case, recall LL^{\prime} is B=\poly(m1,m2)B=\poly(m_{1},m_{2}) second-order smooth,When convoluted with Gaussian noise (fg)=fg\nabla(f*g)=f*\nabla g, every bounded function ff becomes infinite-order differentiable with parameter BB inversely-polynomially dependent on the noise level gg. by Fact A.8, it satisfies (λt1\lambda_{t-1} is fixed and the Hessian is with respect to WW and VV):

On the other hand, for every t1t\geq 1, since WtW_{t} is the output of noisy SGD, by the escape saddle point theorem of (stated in Lemma A.9), we know with probability at least 1p1-p it satisfies λmin(2L(λt1,Wt,Vt))>1/(m1m2)8.\lambda_{\min}\left(\nabla^{2}L^{\prime}(\lambda_{t-1},W_{t},V_{t})\right)>-1/(m_{1}m_{2})^{8}\enspace. Choosing p=11000Tp=\frac{1}{1000T}, we know with probability at least 0.9990.999, this holds for all rounds t=1,2,,Tt=1,2,\dots,T. In other words, for all rounds t=1,2,,Tt=1,2,\dots,T, the first case cannot happen and therefore as long as L(1+γ)OPT+Ω(ε0/γ)L^{\prime}\geq(1+\gamma)\mathsf{OPT}+\Omega(\varepsilon_{0}/\gamma),

On the other hand, for each round t=0,1,,T1t=0,1,\dots,T-1, as long as LO~(1)L^{\prime}\leq\widetilde{O}(1), by Lemma A.9, it holds that

Since initially, L(λ1,W0,V0)O~(1)L^{\prime}(\lambda_{1},W_{0},V_{0})\leq\widetilde{O}(1) w.h.p., we have w.h.p. LO~(1)L^{\prime}\leq\widetilde{O}(1) throughout the process. Since γ\gamma is a constant, after T=Θ(η1loglogmε0)T=\Theta(\eta^{-1}\log\frac{\log m}{\varepsilon_{0}}) rounds of weight decay, we have L(1+γ)OPT+O(ε0/γ)L^{\prime}\leq(1+\gamma)\mathsf{OPT}+O(\varepsilon_{0}/\gamma). Since γ\gamma is a constant, re-scaling ε0\varepsilon_{0} down by a constant factor finishes the proof. ∎

For the second variant of the SGD, note that

C.4 Generalization

We derive a very crude Rademacher complexity bound for our three-layer neural network. We have not tried to tighten the polynomial dependency in m1m_{1} and m2m_{2}.

For every τv,τw0\tau^{\prime}_{v},\tau^{\prime}_{w}\geq 0, every σv(0,1/m2]\sigma_{v}\in(0,1/\sqrt{m_{2}}], w.h.p. for every r[k]r\in[k] and every N1N\geq 1, the empirical Rademacher complexity is bounded by

Let W=W(0)+WρW=W^{(0)}+W^{\rho} and V=V(0)+VρV=V^{(0)}+V^{\rho} for notation simplicity. Recall the input to the jj-th neuron on the second layer is

We now bound the Rademacher complexity of fr(xi;W+W,V+V)f_{r}(x_{i};W+W^{\prime},V+V^{\prime\prime}) in the following simple steps.

{xwi,xwi2τw}\{x\mapsto\langle w^{\prime}_{i},x\rangle\mid\|w^{\prime}_{i}\|_{2}\leq\tau^{\prime}_{w}\} has Rademacher complexity O(τwN)O(\frac{\tau^{\prime}_{w}}{\sqrt{N}}) by Proposition prop:rada.

{xwi+wi,x+biwi2τw}\{x\mapsto\langle w_{i}+w^{\prime}_{i},x\rangle+b_{i}\mid\|w^{\prime}_{i}\|_{2}\leq\tau^{\prime}_{w}\} has Rademacher complexity O(τwN)O(\frac{\tau^{\prime}_{w}}{\sqrt{N}}) because singleton class has zero complexity and adding it does not affect complexity by Proposition prop:radc.

{xnj(x;W+W,V+V)W2,τwvj1m1τvvjδ}\{x\mapsto n_{j}(x;W+W^{\prime},V+V^{\prime\prime})\mid\|W^{\prime}\|_{2,\infty}\leq\tau^{\prime}_{w}\wedge\|v^{\prime\prime}_{j}\|_{1}\leq\sqrt{m_{1}}\tau^{\prime}_{v}\wedge\|v^{\prime\prime}_{j}\|_{\infty}\leq\delta\} has Rademacher complexity O(τwN)O~(m1m2+δm1)+O~(τvN)O(\frac{\tau^{\prime}_{w}}{\sqrt{N}})\cdot\widetilde{O}(\frac{m_{1}}{\sqrt{m_{2}}}+\delta m_{1})+\widetilde{O}(\frac{\tau^{\prime}_{v}}{\sqrt{N}}). This is because w.h.p. vj1O~(m1m2)\|v_{j}\|_{1}\leq\widetilde{O}(\frac{m_{1}}{\sqrt{m_{2}}}) so we can apply Proposition prop:rade, and because vj1m1τv\|v^{\prime\prime}_{j}\|_{1}\leq\sqrt{m_{1}}\tau^{\prime}_{v} and vjδ\|v^{\prime\prime}_{j}\|_{\infty}\leq\delta so we can apply Proposition prop:radf by choosing fi(0)(x)=wi,x+bif^{(0)}_{i}(x)=\langle w_{i},x\rangle+b_{i} which satisfies fi(0)(x)O~(1m1)|f^{(0)}_{i}(x)|\leq\widetilde{O}(\frac{1}{\sqrt{m_{1}}}) w.h.p.

{xfr(x;W+W,V+V)W2,τwj[m2],vj1m1τv}\{x\mapsto f_{r}(x;W+W^{\prime},V+V^{\prime\prime})\mid\|W^{\prime}\|_{2,\infty}\leq\tau^{\prime}_{w}\wedge\forall j\in[m_{2}],\|v^{\prime\prime}_{j}\|_{1}\leq\sqrt{m_{1}}\tau^{\prime}_{v}\} has Rademacher complexity \Big{(}O(\frac{\tau^{\prime}_{w}}{\sqrt{N}})\cdot\widetilde{O}(\frac{m_{1}}{\sqrt{m_{2}}}+\delta m_{1})+\widetilde{O}(\frac{\tau^{\prime}_{v}}{\sqrt{N}})\Big{)}\cdot\widetilde{O}(m_{2}) because w.h.p. ar1O~(m2)\|a_{r}\|_{1}\leq\widetilde{O}(m_{2}) and Proposition prop:rade.

Finally, noticing that W2,4τw\|W^{\prime}\|_{2,4}\leq\tau^{\prime}_{w} implies W2,τw\|W^{\prime}\|_{2,\infty}\leq\tau^{\prime}_{w} and VFτv\|V^{\prime\prime}\|_{F}\leq\tau^{\prime}_{v} implies vj1m1vj2m1τv\|v^{\prime\prime}_{j}\|_{1}\leq\sqrt{m_{1}}\|v^{\prime\prime}_{j}\|_{2}\leq\sqrt{m_{1}}\tau^{\prime}_{v}, we finish the proof that the Rademacher complexity of fr(xi;W+W,V+V)f_{r}(x_{i};W+W^{\prime},V+V^{\prime\prime}) is at most

Combining this with (C.45), and tuning the best choice of δ\delta gives the desired result. ∎

For every τv\tau^{\prime}_{v}\in, \tau^{\prime}_{w}\in\big{[}\frac{1}{m_{1}^{3/4}},\frac{1}{m_{1}^{9/16}}\big{]}, every σw[0,1/m1]\sigma_{w}\in[0,1/\sqrt{m_{1}}] and σv[0,1/m2]\sigma_{v}\in[0,1/\sqrt{m_{2}}], w.h.p. for every r[k]r\in[k] and every N1N\geq 1, we have by our choice of parameters in Lemma 6.7, the empirical Rademacher complexity is bounded by

Under parameter choices in Table 1, this is at most \widetilde{O}\big{(}\frac{\tau^{\prime}_{w}\tau^{\prime}_{v}m_{1}^{1/4}\sqrt{m_{2}}}{\sqrt{N}}\big{)}+\varepsilon/k.

Let W=W(0)+WρW=W^{(0)}+W^{\rho} and V=V(0)+VρV=V^{(0)}+V^{\rho} for notation simplicity. Applying Lemma 6.11 (with τw\tau_{w} chosen as τw\tau^{\prime}_{w}), we know that by our choice of parameters,

for B=\widetilde{O}\big{(}(\tau^{\prime}_{w})^{8/5}m_{1}^{9/10}+(\tau^{\prime}_{w})^{16/5}m_{1}^{9/5}\sqrt{m_{2}}+\frac{\sqrt{m_{2}}}{\sqrt{m_{1}}}\tau^{\prime}_{v}\big{)}.

We bound the Rademacher complexity of the right hand side. It consists of three terms and the Rademacher complexity is the summation of the three (recall Proposition prop:radc).

The first term does not depend on WW^{\prime} or VV^{\prime} so has Rademacher complexity zero.

The third term has Rademacher complexity at most BB.

The second term corresponds to the function class

We calculate its Rademacher complexity as follows.

Let us bound the last term entry by entry. Let [Dw,xj]q[D_{w,x_{j}}]_{q} denote the qq-th column of Dw,xjD_{w,x_{j}}, [V]q[V^{\prime}]_{q} the qq-th column of VV^{\prime}.

For random ξj\xi_{j}, we know that w.h.p. over the randomness of ξj\xi_{j} (notice that we can do so because Dw,xjD_{w,x_{j}} only depends on W(0)W^{(0)} but not on WW^{\prime}, so we can take randomness argument on ξ\xi),

This implies \widehat{\mathfrak{R}}(\mathcal{X};\mathcal{F})\leq\widetilde{O}\big{(}\frac{\tau^{\prime}_{w}\tau^{\prime}_{v}m_{1}^{1/4}\sqrt{m_{2}}}{\sqrt{N}}\big{)} and finishes the proof. ∎

C.5 Final Theorems

Consider Algorithm 3. For every constant γ(0,1/4]\gamma\in(0,1/4], every ε0(0,1/100]\varepsilon_{0}\in(0,1/100], every ε=ε0kp1p22Cs(Φ,p2Cs(ϕ,1))Cs(ϕ,1)2\varepsilon=\frac{\varepsilon_{0}}{kp_{1}p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}, there exists

such that for every m2=m1=mMm_{2}=m_{1}=m\geq M, and properly set λw,λv,σw,σv\lambda_{w},\lambda_{v},\sigma_{w},\sigma_{v} in Table 1, as long as

there is a choice η=1/\poly(m1,m2)\eta=1/\poly(m_{1},m_{2}) and T=\poly(m1,m2)T=\poly(m_{1},m_{2}) such that with probability 99/100\geq 99/100,

Since LF(z;λT,W(0)+WT+Wρ,V(0)+VT+Vρ)[0,O~(1)]L_{F}(z;\lambda_{T},W^{(0)}+W_{T}+W^{\rho},V^{(0)}+V_{T}+V^{\rho})\in[0,\widetilde{O}(1)] w.h.p., by randomly sampling O~(1/ε02)\widetilde{O}(1/\varepsilon_{0}^{2}) many Wρ,j,Vρ,jW^{\rho,j},V^{\rho,j}, we know w.h.p. there exists one jj^{*} with

due to our regularizer (see (C.34)). By simple spectral norm bound, we also know w.h.p. for every (x,y)D(x,y)\sim\mathcal{D} and jj,Indeed, with our parameter choices in Table 1, the spectral norms λTW(0)+Wρ,j+WT2O(1)+λTWTFO(1+m11/4τw)O(1)\sqrt{\lambda_{T}}\|W^{(0)}+W^{\rho,j}+W_{T}\|_{2}\leq O(1)+\|\sqrt{\lambda_{T}}W_{T}\|_{F}\leq O(1+m_{1}^{1/4}\tau^{\prime}_{w})\leq O(1) and λTV(0)+Vρ,j+VT2O(1)+λTVTFO(1+τv)O(1)\sqrt{\lambda_{T}}\|V^{(0)}+V^{\rho,j}+V_{T}\|_{2}\leq O(1)+\|\sqrt{\lambda_{T}}V_{T}\|_{F}\leq O(1+\tau^{\prime}_{v})\leq O(1). Therefore, the network output λTF(x;W(0)+Wρ,j+WT,V(0)+Vρ,j+VT)\lambda_{T}F(x;W^{(0)}+W^{\rho,j}+W_{T},V^{(0)}+V^{\rho,j}+V_{T}) must be bounded by O~(km2)\widetilde{O}(\sqrt{km_{2}}) in Euclidean norm. By the assumption that L(0,y)L(0,y)\in and L(,y)L(\cdot,y) is 1-Lipschitz continuous in the first variable, we have that LFL_{F} is bounded as stated.

Therefore, we can plug in the Rademacher complexity from Lemma 6.10 and b=O~(km2)b=\widetilde{O}(\sqrt{km_{2}}) into standard generalization statement Corollary A.11. Using our choices of τw\tau^{\prime}_{w} and τv\tau^{\prime}_{v} from Table 1 as well as m1=m2m_{1}=m_{2}, this bound implies as long as NO~(M(m2)3/2)N\geq\widetilde{O}(M(m_{2})^{3/2}), w.h.p. for every pair Wρ,j,Vρ,jW^{\rho,j},V^{\rho,j}, it holds

C.5.2 Theorem 2: Second SGD Variant

Consider Algorithm 2. For every constant γ(0,1/4]\gamma\in(0,1/4], every ε0(0,1/100]\varepsilon_{0}\in(0,1/100], every ε=ε0kp1p22Cs(Φ,p2Cs(ϕ,1))Cs(ϕ,1)2\varepsilon=\frac{\varepsilon_{0}}{kp_{1}p_{2}^{2}\mathfrak{C}_{\mathfrak{s}}(\Phi,p_{2}\mathfrak{C}_{\mathfrak{s}}(\phi,1))\mathfrak{C}_{\mathfrak{s}}(\phi,1)^{2}}, there exists

such that for every m2=m1=mMm_{2}=m_{1}=m\geq M, and properly set λw,λv,σw,σv\lambda_{w},\lambda_{v},\sigma_{w},\sigma_{v} in Table 1, as long as

there is a choice η=1/\poly(m1,m2)\eta=1/\poly(m_{1},m_{2}) and T=\poly(m1,m2)T=\poly(m_{1},m_{2}) such that with probability 99/100\geq 99/100,

For the same reason as the proof of Theorem 3, we know w.h.p. among O~(1/ε02)\widetilde{O}(1/\varepsilon_{0}^{2}) choices of jj,

Without loss of generality, in the remainder of the proof we assume OPTO(ε0)\mathsf{OPT}\leq O(\varepsilon_{0}). This can be done because is ε0\varepsilon_{0} is too small we can increase it to ε0=Θ(OPT)\varepsilon_{0}=\Theta(\mathsf{OPT}). By our regularizer parameters λw,λv\lambda_{w},\lambda_{v} in Table 1, we know

Using the 1-Lipschitz continuity of LL together with (C.47) and (C.48), it is not hard to derive that for every jj, with high probability over (x,y)D(x,y)\sim\mathcal{D},W(0),V(0),Wρ,j,Vρ,jW^{(0)},V^{(0)},W^{\rho,j},V^{\rho,j} Indeed, fr(x;W(0)+Wρ,j,V(0)+Vρ,j)O~(1)|f_{r}(x;W^{(0)}+W^{\rho,j},V^{(0)}+V^{\rho,j})|\leq\widetilde{O}(1) with high probability, and λTgr(b,b)(x;WT,VT)O~(m2λTVT2λTWT2)O~(m2λTVTFλTWTF)O~(ε03/4m2m11/4τwτv)O~(C0)\lambda_{T}|g_{r}^{(b,b)}(x;W_{T},V_{T})|\leq\widetilde{O}(\sqrt{m_{2}}\|\sqrt{\lambda_{T}}V_{T}\|_{2}\|\sqrt{\lambda_{T}}W_{T}\|_{2})\leq\widetilde{O}(\sqrt{m_{2}}\|\sqrt{\lambda_{T}}V_{T}\|_{F}\|\sqrt{\lambda_{T}}W_{T}\|_{F})\leq\widetilde{O}(\varepsilon_{0}^{3/4}\sqrt{m_{2}}m_{1}^{1/4}\tau^{\prime}_{w}\tau^{\prime}_{v})\leq\widetilde{O}(C_{0}) by spectral norm bounds.

Therefore, we can plug in the Rademacher complexity from Lemma 6.12 with b=O~(C0)b=\widetilde{O}(C_{0}) into standard generalization statement Corollary A.11.Strictly speaking, Corollary A.11 requires an absolute value bound bb as opposed to a high probability bound. It is a simple exercise to deal with this issue, see for instance Remark B.6 in our two-layer proof. Using our choices of τw\tau^{\prime}_{w} and τv\tau^{\prime}_{v} from Table 1 as well as m1=m2m_{1}=m_{2}, the Rademacher complexity is dominated by

In other words, as long as N(kC0/ε0)2N\geq(kC_{0}/\varepsilon_{0})^{2}, the Rademacher complexity of a single output is at most ε0k\frac{\varepsilon_{0}}{k}, so the generalization error is at most ε0\varepsilon_{0} by Corollary A.11. Or, in symbols, for every jj, w.h.p. over W(0),V(0),Wρ,j,Vρ,jW^{(0)},V^{(0)},W^{\rho,j},V^{\rho,j},

For similar reason, replacing D\mathcal{D} with Z\mathcal{Z}, we have

Putting (C.52) and (C.51) into (C.49), we have

Combining (C.53) and (C.54), we immediately have

as desired. Scaling down ε0\varepsilon_{0} by constant finishes the proof. ∎

References