Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu

Introduction

Deep neural networks have achieved great success in many applications like image processing (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012) and Go games (Silver et al., 2016). However, the reason why deep networks work well in these fields remains a mystery for long time. Different lines of research try to understand the mechanism of deep neural networks from different aspects. For example, a series of work tries to understand how the expressive power of deep neural networks are related to their architecture, including the width of each layer and depth of the network (Telgarsky, 2015, 2016; Lu et al., 2017; Liang and Srikant, 2016; Yarotsky, 2017, 2018; Hanin, 2017; Hanin and Sellke, 2017). These work shows that multi-layer networks with wide layers can approximate arbitrary continuous function.

In this paper, we mainly focus on the optimization perspective of deep neural networks. It is well known that without any additional assumption, even training a shallow neural network is an NP-hard problem (Blum and Rivest, 1989). Researchers have made various assumptions to get a better theoretical understanding of training neural networks, such as Gaussian input assumption (Brutzkus et al., 2017; Du et al., 2017a; Zhong et al., 2017) and independent activation assumption (Choromanska et al., 2015; Kawaguchi, 2016). A recent line of work tries to understand the optimization process of training deep neural networks from two aspects: over-parameterization and random weight initialization. It has been observed that over-parameterization and proper random initialization can help the optimization in training neural networks, and various theoretical results have been established (Safran and Shamir, 2017; Du and Lee, 2018; Arora et al., 2018a; Allen-Zhu et al., 2018c; Du et al., 2018b; Li and Liang, 2018). More specifically, Safran and Shamir (2017) showed that over-parameterization can help reduce the spurious local minima in one-hidden-layer neural networks with Rectified Linear Unit (ReLU) activation function. Du and Lee (2018) showed that with over-parameterization, all local minima in one-hidden-layer networks with quardratic activation function are global minima. Arora et al. (2018b) showed that over-parameterization introduced by depth can accelerate the training process using gradient descent (GD). Allen-Zhu et al. (2018c) showed that with over-parameterization and random weight initialization, both gradient descent and stochastic gradient descent (SGD) can find the global minima of recurrent neural networks.

In this paper, we aim to advance this line of research by studying the optimization properties of gradient-based methods for deep ReLU neural networks. In specific, we consider an LL-hidden-layer fully-connected neural network with ReLU activation function. Similar to the one-hidden-layer case studied in Li and Liang (2018) and Du et al. (2018b), we study binary classification problem and show that both GD and SGD can achieve global minima of the training loss for any L1L\geq 1, with the aid of over-parameterization and random initialization. At the core of our analysis is to show that Gaussian random initialization followed by (stochastic) gradient descent generates a sequence of iterates within a small perturbation region centering around the initial weights. In addition, we will show that the empirical loss function of deep ReLU networks has very good local curvature properties inside the perturbation region, which guarantees the global convergence of (stochastic) gradient descent. More specifically, our main contributions are summarized as follows:

We show that with Gaussian random initialization on each layer, when the number of hidden nodes per layer is at least \widetilde{\Omega}\big{(}\text{poly}(n,\phi^{-1},L)\big{)}, GD can achieve zero training error within \widetilde{O}\big{(}\text{poly}(n,\phi^{-1},L)\big{)} iterations, where ϕ\phi is the data separation distance, nn is the number of training examples, and LL is the number of hidden layers. Our result can be applied to a broad family of loss functions, as opposed to cross entropy loss studied in Li and Liang (2018) and quadratic loss considered in Du et al. (2018b).

We also prove a similar convergence result for SGD. We show that with Gaussian random initialization on each layer, when the number of hidden nodes per layer is at least \widetilde{\Omega}\big{(}\text{poly}(n,\phi^{-1},L)\big{)}, SGD can also achieve zero training error within \widetilde{O}\big{(}\text{poly}(n,\phi^{-1},L)\big{)} iterations.

In terms of data distribution, we only make the so-called data separation assumption, which is more realistic than the assumption on the gram matrix made in Du et al. (2018b). The data separation assumption in this work is similar, but slightly milder, than that in Li and Yuan (2017) in the sense that it holds as long as the data are sampled from a distribution with a constant margin separating different classes.

When we were preparing this manuscript, we were informed that two concurrent work (Allen-Zhu et al., 2018b; Du et al., 2018a) has appeared on-line very recently. Our work bears a similarity to Allen-Zhu et al. (2018b) in the high-level proof idea, which is to extend the results for two-layer ReLU networks in Li and Liang (2018) to deep ReLU networks. However, while Allen-Zhu et al. (2018b) mainly focuses on the regression problems with least square loss, we study the classification problems for a broad class of loss functions based on a milder data distribution assumption. Du et al. (2018a) also studies the regression problem. Compared to their work, our work is based on a different assumption on the training data and is able to deal with the nonsmooth ReLU activation function.

The remainder of this paper is organized as follows: In Section 2, we discuss the literature that are most related to our work. In Section 3, we introduce the problem setup and preliminaries of our work. In Sections 4 and 5, we present our main theoretical results and their proofs respectively. We conclude our work and discuss some future work in Section 6.

Related Work

Due to the huge amount of literature on deep learning theory, we are not able to include all papers in this big vein here. Instead, we review the following three major lines of research, which are most related to our work.

One-hidden-layer neural networks with ground truth parameters Recently a series of work (Tian, 2017; Brutzkus and Globerson, 2017; Li and Yuan, 2017; Du et al., 2017a, b; Zhang et al., 2018) study a specific class of shallow two-layer (one-hidden-layer) neural networks, whose training data are generated by a ground truth network called “teacher network”. This series of work aim to provide recovery guarantee for gradient-based methods to learn the teacher networks based on either the population or empirical loss functions. More specifically, Tian (2017) proved that for two-layer ReLU networks with only one hidden neuron, GD with arbitrary initialization on the population loss is able to recover the hidden teacher network. Brutzkus and Globerson (2017) proved that GD can learn the true parameters of a two-layer network with a convolution filter. Li and Yuan (2017) proved that SGD can recover the underlying parameters of a two-layer residual network in polynomial time. Moreover, Du et al. (2017a, b) proved that both GD and SGD can recover the teacher network of a two-layer CNN with ReLU activation function. Zhang et al. (2018) showed that GD on the empirical loss function can recover the ground truth parameters of one-hidden-layer ReLU networks at a linear rate.

Deep linear networks Beyond shallow one-hidden-layer neural networks, a series of recent work (Hardt and Ma, 2016; Kawaguchi, 2016; Bartlett et al., 2018; Arora et al., 2018a, b) focus on the optimization landscape of deep linear networks. More specifically, Hardt and Ma (2016) showed that deep linear residual networks have no spurious local minima. Kawaguchi (2016) proved that all local minima are global minima in deep linear networks. Arora et al. (2018b) showed that depth can accelerate the optimization of deep linear networks. Bartlett et al. (2018) proved that with identity initialization and proper regularizer, GD can converge to the least square solution on a residual linear network with quadratic loss function, while Arora et al. (2018a) proved the same properties for general deep linear networks.

Generalization bounds for deep neural networks The phenomenon that deep neural networks generalize better than shallow neural networks have been observed in practice for a long time (Langford and Caruana, 2002). Besides classical VC-dimension based results (Vapnik, 2013; Anthony and Bartlett, 2009), a vast literature have recently studied the connection between the generalization performance of deep neural networks and their architectures (Neyshabur et al., 2015, 2017a, 2017b; Bartlett et al., 2017; Golowich et al., 2017; Arora et al., 2018c; Allen-Zhu et al., 2018a). More specifically, Neyshabur et al. (2015) derived Rademacher complexity for a class of norm-constrained feed-forward neural networks with ReLU activation function. Bartlett et al. (2017) derived margin bounds for deep ReLU networks based on Rademacher complexity and covering number. Neyshabur et al. (2017a, b) also derived similar spectrally-normalized margin bounds for deep neural networks with ReLU activation function using PAC-Bayes approach. Golowich et al. (2017) studied size-independent sample complexity of deep neural networks and showed that the sample complexity can be independent of both depth and width under additional assumptions. Arora et al. (2018c) proved generalization bounds via compression-based framework. Allen-Zhu et al. (2018a) showed that an over-parameterized one-hidden-layer neural network can learn a one-hidden-layer neural network with fewer parameters using SGD up to a small generalization error, while similar results also hold for over-parameterized two-hidden-layer neural networks.

Problem Setup and Preliminaries

2 Problem Setup

where p1p\leq 1 is a positive constant. Note that a0a_{0} can be ++\infty.

This assumption holds for a large class of loss functions including hinge loss, cross-entropy loss and exponential loss. It is worthy noting that when α0=+\alpha_{0}=+\infty and p=1/2p=1/2, this reduces to Polyak-Łukojasiewicz (PL) condition (Polyak, 1963).

In addition, we make the following assumptions on the training data.

xi2=1\|\mathbf{x}_{i}\|_{2}=1 and (xi)d=μ(\mathbf{x}_{i})_{d}=\mu for all i{1,,n}i\in\{1,\ldots,n\}, where μ(0,1)\mu\in(0,1) is a constant.

As is shown in the assumption above, the last entry of input x\mathbf{x} is considered to be a constant μ\mu, which introduces the bias term in the input layer of the network.

For all i,i{1,,n}i,i^{\prime}\in\{1,\ldots,n\}, if yiyiy_{i}\neq y_{i^{\prime}}, then xixi2ϕ\|\mathbf{x}_{i}-\mathbf{x}_{i^{\prime}}\|_{2}\geq\phi for some ϕ>0\phi>0.

Assumption 3.5 is a weaker version of Assumption 2.1 in Allen-Zhu et al. (2018b), which assumes that every two different data points are separated by a constant. In comparison, Assumption 3.5 only requires that inputs with different labels are separated, which is a much more practical assumption since it holds for all data distributions with margin ϕ\phi, while the data separation distance in Allen-Zhu et al. (2018b) is usually dependent on the sample size nn when the examples are generated independently.

Define M=max{m1,,mL}M=\max\{m_{1},\ldots,m_{L}\}, m=min{m1,,mL}m=\min\{m_{1},\ldots,m_{L}\}. We assume that M2mM\leq 2m.

Assumption 3.6 states that the number of nodes at all layers are of the same order. The constant 22 is not essential and can be replaced with an arbitrary constant greater than or equal to 11.

3 Optimization Algorithms

In this paper, we consider training a deep neural network with Gaussian initialization followed by gradient descent/stochastic gradient descent.

Gaussian initialization. We say that the weight matrices W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated from Gaussian initialization if each column of Wl\mathbf{W}_{l} is generated independently from the Gaussian distribution N(0,2/mlI)N(\mathbf{0},2/m_{l}\mathbf{I}) for all l=1,,Ll=1,\ldots,L.

Gradient descent. We consider solving the empirical risk minimization problem (3.4) with gradient descent with Gaussian initialization: let W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\ldots,\mathbf{W}_{L}^{(0)} be weight matrices generated from Gaussian initialization, we consider the following gradient descent update rule:

where η>0\eta>0 is the step size (a.k.a., learning rate).

Stochastic gradient descent. We also consider solving (3.4) using stochastic gradient descent with Gaussian initialization. Again, let {Wl(0)}l=1L\{\mathbf{W}_{l}^{(0)}\}_{l=1}^{L} be generated from Gaussian initialization. At the kk-th iteration, a minibatch B(k)\mathcal{B}^{(k)} of training examples with batch size BB is sampled from the training set, and the stochastic gradient is calculated as follows:

The update rule for stochastic gradient descent is then defined as follows:

4 Preliminaries

Here we briefly introduce some useful notations and provide some basic calculations regarding the neural network under our setting.

Output after the ll-th layer: Given an input xi\mathbf{x}_{i}, the output of the neural network after the ll-th layer is

Output of the neural network: The output of the neural network with input xi\mathbf{x}_{i} is as follows:

where we define x0,i=xi\mathbf{x}_{0,i}=\mathbf{x}_{i} and the last equality holds for any l1l\geq 1.

Gradient of the neural network: The partial gradient of the training loss LS(W)L_{S}(\mathbf{W}) with respect to Wl\mathbf{W}_{l} is as follows:

Main Theory

In this section, we show that with random Gaussian initialization, over-parameterization helps gradient based algorithms, including gradient descent and stochastic gradient descent, converge to the global minimum, i.e., find some points with arbitrary small training loss.

We provide the following theorem which characterizes the required numbers of hidden nodes and iterations such that the gradient descent can attain the global minimum of the empirical training loss function.

Suppose W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\dots,\mathbf{W}_{L}^{(0)} are generated by Gaussian initialization. Then under Assumptions 3.1-3.6, if set the step size η=O(n3L9m1)\eta=O(n^{-3}L^{-9}m^{-1}), the number of hidden nodes per layer

then with high probability, gradient descent can find a point W(K)\mathbf{W}^{(K)} such that LS(W(K))ϵL_{S}(\mathbf{W}^{(K)})\leq\epsilon.

Theorem 4.1 suggests that the required number of hidden nodes and the number of iterations are both polynomial in the number of training examples nn, and the separation parameter ϕ\phi. This is consistent with the recent work on the global convergence in training neural networks (Li and Yuan, 2017; Du et al., 2018b; Allen-Zhu et al., 2018c; Du et al., 2018a; Allen-Zhu et al., 2018b). Moreover, we prove that the dependence on the number of hidden layers LL is also polynomial, which is similar to Allen-Zhu et al. (2018b) and strictly better than Du et al. (2018a), where the dependence on LL is proved to be eΩ(L)e^{\Omega(L)}. Regarding different loss functions (depending on α0\alpha_{0} and pp according to Assumption 3.2), the dependence in ϵ\epsilon ranges from O~(ϵ1)\widetilde{O}(\epsilon^{-1}) to O~(1)\widetilde{O}(1).

Based on the results in Theorem 4.1, we are able to characterize the required number of hidden nodes per layer that gradient descent can find a point with zero training error in the following corollary.

2 Stochastic Gradient Descent

where p1p\leq 1 is a positive constant and ρ0/α0=O(1)\rho_{0}/\alpha_{0}=O(1).

Suppose W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\dots,\mathbf{W}_{L}^{(0)} are generated by Gaussian random. Then under Assumptions 3.1-3.6 and 4.4, if the step size η=O(n3L9m1)\eta=O(n^{-3}L^{-9}m^{-1}), the number of hidden nodes per layer satisfies

then with high probability, stochastic gradient descent can find a point W(K)\mathbf{W}^{(K)} such that LS(W(K))ϵL_{S}(\mathbf{W}^{(K)})\leq\epsilon.

Similar to gradient descent, the following corollary characterizes the required number of hidden nodes per layer that stochastic gradient descent can achieve zero training error.

Theorem 4.5 suggests that, to find the global minimum, both the required number of hidden nodes and the number of iterations for stochastic gradient descent are also polynomial in nn, ϕ\phi and LL, which matches the result in Allen-Zhu et al. (2018b) for the regression problem. In addition, as it cannot be directly observed in Corollaries 4.3 and 4.6, we remark here that compared with gradient descent, the required numbers of hidden nodes and iterations of stochastic gradient to achieve zero training error is worse by a factor ranging from O(n2)O(n^{2}) to O(n4)O(n^{4}). The detailed comparison can be found in the proofs of Theorems 4.1 and 4.5.

Proof of the Main Theory

In this section, we provide the proof of the main theory, including Theorems 4.1 and 4.5. Our proofs for these two theorems can be decomposed into the following five steps:

We prove the basic properties for Gaussian random matrices {Wl}l,,L\{\mathbf{W}_{l}\}_{l,\dots,L} in Theorem 5.1, which constitutes a basic structure of the neural network after Gaussian random initialization.

Based on Theorem 5.1, we analyze the effect of 2\|\cdot\|_{2}-perturbations on Gaussian initialized weight matrices within a perturbation region with radius τ\tau, and show that the neural network enjoys good local curvature properties in Theorem 5.3.

Based on the assumption that all iterates are within the perturbation region centering at W(0)\mathbf{W}^{(0)} with radius τ\tau, we establish the convergence results in Lemmas 5.4 and 5.6, and derive conditions on the product of iteration number kk and step size η\eta that guarantees convergence.

We show that as long as the product of iteration number kk and step size η\eta is smaller than some quantity TT, (stochastic) gradient descent with kk iterations remains in the perturbation region centering around the Gaussian initialization W(0)\mathbf{W}^{(0)}, which justifies the application of Theorem 5.3 to the iterates of (stochastic) gradient descent.

We finalize the proof by ensuring that (stochastic) gradient descent converges before kηk\eta exceeds TT by setting on the number of hidden nodes in each layer mm to be large enough.

The following theorem summarizes some high probability results of neural networks with Gaussian random initialization, which is pivotal to establish the subsequent theoretical analyses.

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated by Gaussian initialization. Then under Assumptions 3.4, 3.5 and 3.6, there exist absolute constants C,C,C,C,C>0\overline{C},\overline{C}^{\prime},\underline{C},\underline{C}^{\prime},\underline{C}^{\prime\prime}>0 such that for any δ>0\delta>0, β>0\beta>0 and positive integer ss, as long as

and ϕCmin{L1,μ}\phi\leq\underline{C}\min\{L^{-1},\mu\}, with probability at least 1δ1-\delta, all the following results hold:

\big{|}\|\mathbf{x}_{l,i}\|_{2}-1\big{|}\leq\overline{C}^{\prime}L\sqrt{\log(nL/\delta)/m},\|\mathbf{W}_{l}\|_{2}\leq\overline{C}^{\prime} for all l=1,Ll=1\ldots,L and i=1,,ni=1,\ldots,n.

\big{\|}\|\mathbf{x}_{l,i}\|_{2}^{-1}\mathbf{x}_{l,i}-\|\mathbf{x}_{l,i^{\prime}}\|_{2}^{-1}\mathbf{x}_{l,i^{\prime}}\big{\|}_{2}\geq\phi/2 for all l=1,,Ll=1,\ldots,L and i,i{1,,n}i,i^{\prime}\in\{1,\ldots,n\} such that yiyiy_{i}\neq y_{i^{\prime}}.

y^iClog(n/δ)|\widehat{y}_{i}|\leq\overline{C}^{\prime}\sqrt{\log(n/\delta)} for all i=1,,ni=1,\ldots,n.

\big{|}\{j\in[m_{l}]:|\langle\mathbf{w}_{l,j},\mathbf{x}_{l-1,i}\rangle|\leq\beta\}\big{|}\leq 2m_{l}^{3/2}\beta for all l=1,,Ll=1,\ldots,L and i=1,,ni=1,\ldots,n.

\big{\|}\mathbf{W}_{l_{2}}^{\top}\big{(}\prod_{r=l_{1}}^{l_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top}\big{)}\big{\|}_{2}\leq\overline{C}^{\prime}L for all 1l1<l2L1\leq l_{1}<l_{2}\leq L and i=1,,ni=1,\ldots,n.

\mathbf{v}^{\top}\big{(}\prod_{r=l}^{L}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top}\big{)}\mathbf{a}\leq\overline{C}^{\prime}L\sqrt{s\log(M)} for all l=1,,Ll=1,\ldots,L, i=1,,ni=1,\ldots,n and all aSml11\mathbf{a}\in S^{m_{l-1}-1} with a0s\|\mathbf{a}\|_{0}\leq s .

Theorem 5.1 summarizes all the properties we need for Gaussian initialization. In the sequel, we always assume that results (i)-(viii) hold for the Gaussian initialization. The parameters β\beta and ss in Theorem 5.1 are introduced to characterize the activation pattern of the ReLU activation functions in each layer. Their values that directly help the final convergence proof is derived during the proof of Theorem 5.3 as β=O(L4/3τ2/3m1/2)\beta=O(L^{4/3}\tau^{2/3}m^{-1/2}) and s=O(L4/3τ2/3m)s=O(L^{4/3}\tau^{2/3}m), where τ=O~(n9L17ϕ3)\tau=\widetilde{O}(n^{-9}L^{-17}\phi^{3}) is the perturbation level. Therefore, the condition on the rate of m1,,mLm_{1},\ldots,m_{L} given by (5.1) is satisfied under the final assumptions on mm given in Theorem 4.1 and Theorem 4.5.

We perform 2\|\cdot\|_{2}-perturbation on the collection of random matrices {Wl}l=1,,L\{\mathbf{W}_{l}\}_{l=1,\dots,L} with perturbation level τ\tau, which formulates a perturbation region centering at {Wl}l=1,,L\{\mathbf{W}_{l}\}_{l=1,\dots,L} with radius τ\tau. Let W~={W~1,,W~L}\widetilde{\mathbf{W}}=\{\widetilde{\mathbf{W}}_{1},\ldots,\widetilde{\mathbf{W}}_{L}\} and W^={W^1,,W^L}\widehat{\mathbf{W}}=\{\widehat{\mathbf{W}}_{1},\ldots,\widehat{\mathbf{W}}_{L}\} be two collections of weight matrices. For l=1,,Ll=1,\ldots,L, denote x~l,i\widetilde{\mathbf{x}}_{l,i}, x^l,i\widehat{\mathbf{x}}_{l,i} be the output of the ll-th hidden layer of the ReLU network with input xi\mathbf{x}_{i} and weight matrices W~\widetilde{\mathbf{W}} and W^\widehat{\mathbf{W}} respectively. Define x~0,i=x~0,i=xi\widetilde{\mathbf{x}}_{0,i}=\widetilde{\mathbf{x}}_{0,i}=\mathbf{x}_{i}, and

for all l=1,,Ll=1,\ldots,L. We summarize their properties in the following theorem.

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated via Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. Let {W~l}l=1,,L\{\widetilde{\mathbf{W}}_{l}\}_{l=1,\dots,L}, {W^l}l=1,,L\{\widehat{\mathbf{W}}_{l}\}_{l=1,\dots,L} be perturbed weight matrices satisfying W~lWl2,W^lWl2τ\|\widetilde{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2},\|\widehat{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2}\leq\tau, l=1,,Ll=1,\ldots,L. Then under Assumptions 3.5 and 3.6, there exist absolute constants C,C,C,C\overline{C},\overline{C}^{\prime},\underline{C},\underline{C}^{\prime}, such that as long as τC{[L11(log(M))3/2](ϕ3/2n3L2)}\tau\leq\underline{C}\{[L^{-11}(\log(M))^{-3/2}]\land(\phi^{3/2}n^{-3}L^{-2})\}, the following results hold:

W~l2C\|\widetilde{\mathbf{W}}_{l}\|_{2}\leq\overline{C} for all l[L]l\in[L].

x^l,ix~l,i2CLr=1lW^rW~r2\|\widehat{\mathbf{x}}_{l,i}-\widetilde{\mathbf{x}}_{l,i}\|_{2}\leq\overline{C}L\cdot\sum_{r=1}^{l}\|\widehat{\mathbf{W}}_{r}-\widetilde{\mathbf{W}}_{r}\|_{2} for all l[L]l\in[L] and i[n]i\in[n].

Σ^l,iΣ~l,i0CL4/3τ2/3ml\|\widehat{\bm{\Sigma}}_{l,i}-\widetilde{\bm{\Sigma}}_{l,i}\|_{0}\leq\overline{C}L^{4/3}\tau^{2/3}m_{l} for all l[L]l\in[L] and i[n]i\in[n].

\big{|}\{j\in[m_{L}]:\text{there exists }i\in[n]\mbox{ such that }(\widetilde{\bm{\Sigma}}_{L,i}-\bm{\Sigma}_{L,i})_{jj}\neq 0\}\big{|}\leq\overline{C}nL^{4/3}\tau^{2/3}m_{L}.

\big{\|}\prod_{r=l_{1}}^{l_{2}}\widetilde{\bm{\Sigma}}_{r,i}\widetilde{\mathbf{W}}_{r}^{\top}\big{\|}_{2}\leq\overline{C}L for all 1l1<l2L1\leq l_{1}<l_{2}\leq L.

The squared Frobenius norm of the partial gradient with respect to the weight matrix in the last hidden layer has the following lower bound:

where y~i=fW~(xi)\widetilde{y}_{i}=f_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i}).

The spectral norms of gradients and stochastic gradients at each layer have the following upper bounds:

where y~i=fW~(xi)\widetilde{y}_{i}=f_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i}), B=BB=|\mathcal{B}| denotes the minibatch size in SGD.

The gradient lower bound provided in (vii) implies that within the perturbation region, the empirical loss function of deep neural network enjoys good local curvature properties, which play an essential role in the convergence proof of (stochastic) gradient descent. The gradient upper bound in (viii) quantifies how much the weight matrices of the neural network would change during (stochastic) gradient descent, which is utilized to guarantee that the weight matrices won’t escape from the perturbation region during the training process.

We organize our proof as the following three steps: (1) we first assume that during gradient descent, each iterate is in the preset perturbation region centering at W(0)\mathbf{W}^{(0)} with radius τ\tau, and use the results in Theorem 5.3 to establish the convergence guarantee; (2) we prove the upper bound of the number of iteration such that the distance between the iterate and the initial point does not exceed τ\tau; (3) we compute the minimum number of hidden nodes such that gradient descent achieves the target accuracy before exceeding the upper bound derived in step (2).

For step (1), the following lemma provides the convergence guarantee of gradient descent while assuming all iterates are in the preset perturbation region, i.e., Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau for all l=1,,Ll=1,\ldots,L and k=1,,Kk=1,\ldots,K.

Suppose that W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\dots,\mathbf{W}_{L}^{(0)} are generated via Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. Under Assumptions 3.1-3.6, if Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau for all l=1,,Ll=1,\ldots,L and k=1,,Kk=1,\ldots,K with perturbation level τ=O(n9L17ϕ3)\tau=O(n^{-9}L^{-17}\phi^{3}), the step size η=O(n3L9m1ϕ)\eta=O(n^{-3}L^{-9}m^{-1}\phi) and

when α0<\alpha_{0}<\infty, then gradient descent is able to find a point W(K)\mathbf{W}^{(K)} such that LS(W(K))ϵL_{S}(\mathbf{W}^{(K)})\leq\epsilon.

The following lemma provides the upper bound of the iteration number such that the distance between the iterate and the initial point does not exceed the perturbation radius τ\tau.

Suppose that W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\dots,\mathbf{W}_{L}^{(0)} are generated by Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. Then there exist a constant T=O(L4n3τ2ϕ)T=O(L^{-4}n^{-3}\tau^{2}\phi) such for all iteration number kk and step size η\eta satisfying kηTk\eta\leq T, it holds that Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau for all l[L]l\in[L].

Now we are ready to prove Theorem 4.1 based on Lemmas 5.4 and 5.5.

The proof is straightforward. By Lemma 5.5, it suffices to show that the lower bound of KηK\eta derived in Lemma 5.4 is smaller than T=O(L4n3τ2ϕ)=O(L38n21ϕ7)T=O(L^{-4}n^{-3}\tau^{2}\phi)=O(L^{-38}n^{-21}\phi^{7}), where we plug in the assumption that τ=O(n9L17ϕ3)\tau=O(n^{-9}L^{-17}\phi^{3}). Therefore, we can derive the following lower bound on the number of hidden nodes per layer, i.e., mm:

Moreover, the required number of iterations, i.e., KK can be directly derived by combining the results of KηK\eta in Lemma 5.4 and the choice of the step size η=O(n3L9m1ϕ)\eta=O(n^{-3}L^{-9}m^{-1}\phi), thus we omit the detail here. ∎

2 Proof of Theorem 4.5

Similar to the proof for gradient descent, we first deliver the following lemma which characterizes the convergence of stochastic gradient descent for the training of ReLU network under the assumption that all iterates are in the preset perturbation region.

Suppose that W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\dots,\mathbf{W}_{L}^{(0)} are generated by Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. Under Assumptions 3.1-3.6 and 4.4, if Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau for all l=1,,Ll=1,\ldots,L and k=1,,Kk=1,\ldots,K with perturbation level τ=O(n9L17ϕ3)\tau=O(n^{-9}L^{-17}\phi^{3}), the step size η=O(Bn4L9m1ϕ)\eta=O(Bn^{-4}L^{-9}m^{-1}\phi), and

when α0<\alpha_{0}<\infty, then stochastic gradient descent is able to find a point W(K)\mathbf{W}^{(K)} such that LS(W(K))ϵL_{S}(\mathbf{W}^{(K)})\leq\epsilon.

The following lemma provides the upper bound of the iteration number such that the distance between the iterate and the initial point does not exceed the perturbation radius τ\tau.

Suppose that W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\dots,\mathbf{W}_{L}^{(0)} are generated by Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. If the step size η=O(ϕm1n2p7L8B2)\eta=O(\phi m^{-1}n^{2p-7}L^{-8}B^{2}), then there exists TT with rate T=O~(L2n1Bm1/2τ)T=\widetilde{O}(L^{-2}n^{-1}Bm^{-1/2}\tau) when α0=\alpha_{0}=\infty and T=O(L2m1/2τ)T=O(L^{-2}m^{-1/2}\tau) when α<\alpha<\infty such that for all iteration number kk satisfying kηTk\eta\leq T, with high probability it holds that Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau for all l[L]l\in[L].

Similar to the proof of Theorem 4.1, the minimum required number of hidden nodes per layer can be derived by setting the lower bound of KηK\eta in Lemma 5.6 to be smaller than T=O~(L2n1m1/2τ)=O~(n10L19m1/2ϕ3)T=\widetilde{O}(L^{-2}n^{-1}m^{-1/2}\tau)=\widetilde{O}(n^{-10}L^{-19}m^{-1/2}\phi^{3}) when α0=\alpha_{0}=\infty or T=O(L2m1/2τ)=O(n9L19m1/2ϕ3)T=O(L^{-2}m^{-1/2}\tau)=O(n^{-9}L^{-19}m^{-1/2}\phi^{3}) when α0<\alpha_{0}<\infty. Therefore the minimum required number of hidden nodes per layer satisfies

We now proceed to derive the required iteration numbers. By Lemmas 5.6 and 5.7, we can set the step size to be η=O(ϕm1n2p7L9B2)\eta=O(\phi m^{-1}n^{2p-7}L^{-9}B^{2}). Then using the bound of KηK\eta derived in Lemma 5.6, we have

Conclusions and Future Work

In this paper, we studied training deep neural networks by gradient descent and stochastic gradient descent. We proved that both gradient descent and stochastic gradient descent can achieve global minima of over-parameterized deep ReLU networks with random initialization, for a general class of loss functions, with only mild assumption on training data. Our theory sheds light on understanding why stochastic gradient descent can train deep neural networks very well in practice, and paves the way to study the optimization dynamics of training more sophisticated deep neural networks.

In the future, we will sharpen the polynomial dependence of our results on those problem-specific parameters.

Acknowledgment

We would like to thank Spencer Frei for helpful comments on the first version of this paper.

Appendix A Proof of Theorem 5.1

In this section we provide the proof of Theorem 5.1. The bound for Wl2\|\mathbf{W}_{l}\|_{2} given by result (i) in Theorem 5.1 follows from standard results for Gaussian random matrices with independent entries (See Corollary 5.35 in Vershynin (2010)). We split the rest results into several lemmas and prove them separately.

We first give the bound for the norms of the outputs of each layer. Intuitively, since the columns of Wl\mathbf{W}_{l} are sampled independently from N(0,2/mlI)N(\mathbf{0},2/m_{l}\mathbf{I}), given the output of the previous layer xl1,i\mathbf{x}_{l-1,i}, the expectation of Wlxl1,i22\|\mathbf{W}_{l}^{\top}\mathbf{x}_{l-1,i}\|_{2}^{2} is 2xl1,i222\|\mathbf{x}_{l-1,i}\|_{2}^{2}. Moreover, the ReLU activation function truncates roughly half of the entries of Wlxl1,i\mathbf{W}_{l}^{\top}\mathbf{x}_{l-1,i} to zero, and therefore xl,i22\|\mathbf{x}_{l,i}\|_{2}^{2} should be approximately equal to xl1,i22\|\mathbf{x}_{l-1,i}\|_{2}^{2}. This leads to Lemma A.1 and Corollary A.2.

Denote by wl,j\mathbf{w}_{l,j} the jj-th column of Wl\mathbf{W}_{l}. Suppose that for any l=1,,Ll=1,\ldots,L, wl,1,,wl,ml\mathbf{w}_{l,1},\ldots,\mathbf{w}_{l,m_{l}} are generated independently from N(0,2/mlI)N(\mathbf{0},2/m_{l}\mathbf{I}). Then there exists an absolute constant CC such that for any δ>0\delta>0, as long as mlC2log(nL/δ)m_{l}\geq C^{2}\log(nL/\delta), with probability at least 1δ1-\delta,

for all i=1,,ni=1,\ldots,n and l=1,,Ll=1,\ldots,L.

For any fixed i{1,,n}i\in\{1,\ldots,n\}, l{1,,L}l\in\{1,\ldots,L\} and j{1,,ml}j\in\{1,\ldots,m_{l}\}, condition on xl1,i\mathbf{x}_{l-1,i} we have wl,jxl1,iN(0,2xl1,i22/ml)\mathbf{w}_{l,j}^{\top}\mathbf{x}_{l-1,i}\sim N(\mathbf{0},2\|\mathbf{x}_{l-1,i}\|_{2}^{2}/m_{l}). Therefore,

Since xl,i22=j=1mlσ2(wl,jxl1,i)\|\mathbf{x}_{l,i}\|_{2}^{2}=\sum_{j=1}^{m_{l}}\sigma^{2}(\mathbf{w}_{l,j}^{\top}\mathbf{x}_{l-1,i}) and condition on xl1\mathbf{x}_{l-1}, σ2(wl,jxl1,i)ψ1C1xl1,i22/ml\|\sigma^{2}(\mathbf{w}_{l,j}^{\top}\mathbf{x}_{l-1,i})\|_{\psi_{1}}\leq C_{1}\|\mathbf{x}_{l-1,i}\|_{2}^{2}/m_{l} for some absolute constant C1C_{1}, by Bernstein inequality (See Proposition 5.16 in Vershynin (2010)), for any ξ0\xi\geq 0 we have

Taking union bound over ll and ii gives

The inequality above further implies that if mlC32log(nL/δ)m_{l}\geq C_{3}^{2}\log(nL/\delta), then with probability at least 1δ1-\delta, we have

for any i=1,,ni=1,\ldots,n and l=1,,Ll=1,\ldots,L, where C3C_{3} is an absolute constant. This completes the proof. ∎

Under the same conditions as Lemma A.1, with probability at least 1δ1-\delta,

where m=min{m1,,mL}m=\min\{m_{1},\ldots,m_{L}\}, and CC is an absolute constant.

The result directly follows by Lemma A.1 and induction. ∎

By Corollary A.2, we can see that the norms of the inputs are roughly preserved after passing through layers in the ReLU neural network with properly scaled Gaussian random weights. We now proceed to show (ii) in Theorem 5.1, which states that the inner product of any two samples, although may not be preserved throughout layers, also share a common upper bound based on the Assumption 3.5. The detailed results are given in Lemmas A.3, A.4 and Corollary A.5 below.

For I1I_{1}, it follows by direct calculation that I1=(1α2)3/2/(2π)=O(θ3)I_{1}=(1-\alpha^{2})^{3/2}/(2\pi)=O(\theta^{3}). For I2I_{2}, integration by parts gives

For arccos\arccos function we have the following expansion:

Therefore when α=1θ2/2\alpha=1-\theta^{2}/2 for small enough constant θ\theta, we have

Combining the obtained rates with (A.1) completes the proof. ∎

Suppose that wl,1,,wl,ml\mathbf{w}_{l,1},\ldots,\mathbf{w}_{l,m_{l}} are generated independently from N(0,2/mlI)N(\mathbf{0},2/m_{l}\mathbf{I}). Let xl1,i=xl1,i/xl1,i2\overline{\mathbf{x}}_{l-1,i}=\mathbf{x}_{l-1,i}/\|\mathbf{x}_{l-1,i}\|_{2}, i=1,,ni=1,\ldots,n. For any fixed i,i{1,,n}i,i^{\prime}\in\{1,\ldots,n\}, if ϕκL1\phi\leq\kappa L^{-1} for some small enough absolute constant κ\kappa, then for any δ>0\delta>0, if m=min{m1,,mL}CL4ϕ4log(4n2L/δ)m=\min\{m_{1},\ldots,m_{L}\}\geq CL^{4}\phi^{-4}\log(4n^{2}L/\delta) for some large enough absolute constant CC, with probability at least 1δ1-\delta,

for all i,j=1,,ni,j=1,\ldots,n, l=1,,Ll=1,\ldots,L.

We first consider any fixed l1l\geq 1. Suppose that xl1,ixl1,i2[1(2L)1log(2)]l1ϕ\|\overline{\mathbf{x}}_{l-1,i}-\overline{\mathbf{x}}_{l-1,i^{\prime}}\|_{2}\geq[1-(2L)^{-1}\log(2)]^{l-1}\phi. If we can show that under this condition, with high probability

then the result of the lemma follows by union bound and induction. Denote

Then by assumption we have xl1,ixl1,i22ϕl12\|\overline{\mathbf{x}}_{l-1,i}-\overline{\mathbf{x}}_{l-1,i^{\prime}}\|_{2}^{2}\geq\phi_{l-1}^{2}. Therefore xl1,ixl1,i1ϕl12/2\overline{\mathbf{x}}_{l-1,i}^{\top}\overline{\mathbf{x}}_{l-1,i^{\prime}}\leq 1-\phi_{l-1}^{2}/2. It follows by direct calculation that

By Lemma A.3 and the assumption that ϕl1ϕκ\phi_{l-1}\leq\phi\leq\kappa, we have

Condition on xl1,i\mathbf{x}_{l-1,i} and xl1,i\mathbf{x}_{l-1,i^{\prime}}, by Lemma 5.14 in Vershynin (2010) we have

where C1C_{1} is an absolute constant. Therefore if mlC22log(4n2L/δ)m_{l}\geq C_{2}^{2}\log(4n^{2}L/\delta), similar to the proof of Lemma A.1, by Bernstein inequality and union bound, with probability at least 1δ/(4n2L)1-\delta/(4n^{2}L) we have

where C2C_{2} is an absolute constant. Therefore with probability at least 1δ/(4n2L)1-\delta/(4n^{2}L) we have

By union bound and Lemma A.1, if mrC3L4ϕl4log(4n2L/δ)m_{r}\geq C_{3}L^{4}\phi_{l}^{-4}\log(4n^{2}L/\delta), r=1,,lr=1,\ldots,l for some large enough absolute constant C3C_{3} and ϕκL1\phi\leq\kappa L^{-1} for some small enough absolute constant κ\kappa, then with probability at least 1δ/(2n2L)1-\delta/(2n^{2}L) we have

Moreover, by Lemma A.1, with probability at least 1δ/(2n2L)1-\delta/(2n^{2}L) we have

and therefore with probability at least 1δ/(n2L)1-\delta/(n^{2}L), we have

Applying union bound and induction over l=1,,Ll=1,\ldots,L completes the proof. ∎

Under the same conditions in Lemma A.4, with probability at least 1δ1-\delta,

for all i,i=1,,ni,i^{\prime}=1,\ldots,n, l=1,,Ll=1,\ldots,L.

It directly follows by Lemma A.4 and the fact that 1x/2ex1-x/2\geq e^{-x} for x[0,log(2)]x\in[0,\log(2)]. ∎

The following lemma gives the bound of y^i\widehat{y}_{i}. It relies on the fact that half of the entries of v\mathbf{v} are 11’s and the other half are 1-1’s.

Suppose that {Wl}l=1L\{\mathbf{W}_{l}\}_{l=1}^{L} are generated via Gaussian initialization. Then for any δ>0\delta>0, with probability at least 1δ1-\delta, it holds that

for all i=1,,ni=1,\ldots,n, where CC is an absolute constant.

Apparently, we have σ(wL,jxL1,i)σ(wL,j+mL/2xL1,i)ψ2C1mL1/2\|\sigma(\mathbf{w}_{L,j}^{\top}\mathbf{x}_{L-1,i})-\sigma(\mathbf{w}_{L,j+m_{L}/2}^{\top}\mathbf{x}_{L-1,i})\|_{\psi_{2}}\leq C_{1}m_{L}^{-1/2} for some absolute constant C1C_{1}. Therefore by Hoeffding’s inequality and Corollary A.2, with probability at least 1δ1-\delta, it holds that

We now prove (iv) in Theorem 5.1, which characterizes the activation pattern of ReLU networks with a parameter β\beta. We summarize the result in Lemma A.7.

For any δ>0\delta>0, if mlCmax{[β1log(nL/δ)]2/3,L2log(nL/δ)}m_{l}\geq C\max\{[\beta^{-1}\log(nL/\delta)]^{2/3},L^{2}\log(nL/\delta)\} for some large enough constant C>0C>0, then with probability at least 1δ1-\delta, Sl,i(β)2ml3/2β|{\mathcal{S}}_{l,i}(\beta)|\leq 2m_{l}^{3/2}\beta for all l=1,,Ll=1,\ldots,L and i=1,,ni=1,\ldots,n.

For fixed l{1,,L}l\in\{1,\ldots,L\} and i{1,,n}i\in\{1,\ldots,n\}, define Zj=\mathds1{wl,j,xl1,iβ}Z_{j}=\operatorname{\mathds{1}}\{|\langle\mathbf{w}_{l,j},\mathbf{x}_{l-1,i}\rangle|\leq\beta\}. Note that by Corollary A.2, with probability at least 1exp[Ω(m/L2)]1-\exp[-\Omega(m/L^{2})] we have xl1,i21/2\|\mathbf{x}_{l-1,i}\|_{2}\geq 1/2. Condition on xl1,i\mathbf{x}_{l-1,i}, we have

Then by Bernstein inequality, with probability at least 1exp(Ω(ml3/2β))1-\exp(-\Omega(m_{l}^{3/2}\beta)),

Applying union bound over l=1,,Ll=1,\ldots,L, i=1,,ni=1,\ldots,n and using the assumption that ml3/2βClog(nL/δ)m_{l}^{3/2}\beta\geq C\log(nL/\delta) completes the proof. ∎

We now prove (v)-(vii) in Theorem 5.1. These results together characterize the Lipschitz continuity of the gradients. To show these results, we utilize similar proof technique we used in Lemma A.1 and combine the resulting bound with covering number arguments. The details are given in Lemmas A.8, A.9 and A.10.

For any i=1,,ni=1,\ldots,n and 1l1<l2L1\leq l_{1}<l_{2}\leq L, define

If mClog(nL2/δ)m\geq C\log(nL^{2}/\delta) for some absolute constant CC, then with probability at least 1δ1-\delta we have

for all i=1,,ni=1,\ldots,n and 1l1<l2L1\leq l_{1}<l_{2}\leq L, where CC^{\prime} is an absolute constant.

Denote l0=l11l_{0}=l_{1}-1, δ=δ/(nL2)\delta^{\prime}=\delta/(nL^{2}) and

Let N1=N[Sml01,1/4]\mathcal{N}_{1}=\mathcal{N}[S^{m_{l_{0}}-1},1/4], N2=N[Sml21,1/4]\mathcal{N}_{2}=\mathcal{N}[S^{m_{l_{2}}-1},1/4] be 1/41/4-nets covering Sml01S^{m_{l_{0}}-1} and Sml21S^{m_{l_{2}}-1} respectively. Then by Lemma 5.2 in Vershynin (2010), we have

where α ⁣/ ⁣/ ⁣=αxl1,i/xl1,i22xl1,i\bm{\alpha}_{\mathbin{\!/\mkern-5.0mu/\!}}=\bm{\alpha}^{\top}\mathbf{x}_{l-1,i}/\|\mathbf{x}_{l-1,i}\|_{2}^{2}\cdot\mathbf{x}_{l-1,i} and α=αα ⁣/ ⁣/ ⁣\bm{\alpha}_{\perp}=\bm{\alpha}-\bm{\alpha}_{\!/\mkern-5.0mu/\!}. Therefore, similar to the proof of Lemma A.1 and Corollary A.2, with probability at least 1δ/21-\delta^{\prime}/2 we have

for all a^N1\widehat{\mathbf{a}}\in\mathcal{N}_{1}, where C0C_{0} is an absolute constant. Therefore by Assumption 3.6 and the assumption that mClog(1/δ)m\geq C\log(1/\delta^{\prime}) for some absolute constant CC, with probability at least 1δ1-\delta^{\prime}, we have

where C4C_{4} is an absolute constant. Applying union bound over all i=1,,ni=1,\ldots,n and all 1l1<l2L1\leq l_{1}<l_{2}\leq L completes the proof. ∎

For any i=1,,ni=1,\ldots,n and 1lL1\leq l\leq L, If sClog(nL/δ)s\geq C\log(nL/\delta) for some absolute constant CC, then with probability at least 1δ1-\delta, we have

for all i=1,,ni=1,\ldots,n and l=1,,Ll=1,\ldots,L, where CC^{\prime} is an absolute constant.

then similar to the proof of Lemma A.8, with probability at least 1δ1-\delta^{\prime},

for all a^N\widehat{\mathbf{a}}\in\mathcal{N}, where C1,C2C_{1},C_{2} are absolute constants. For any aSml01\mathbf{a}\in S^{m_{l_{0}}-1} with at most ss non-zero entries, there exists a^N\widehat{\mathbf{a}}\in\mathcal{N} such that aa^21/2\|\mathbf{a}-\widehat{\mathbf{a}}\|_{2}\leq 1/2. Therefore, we have

Since g(a)g(\mathbf{a}) is linear and v2=mL\|\mathbf{v}\|_{2}=\sqrt{m_{L}}, we have

where C3C_{3} is an absolute constant. Applying union bound over all i=1,,ni=1,\ldots,n and all 1lL1\leq l\leq L completes the proof. ∎

For any i=1,,ni=1,\ldots,n and 1l1<l2L1\leq l_{1}<l_{2}\leq L, define

If sClog(nL2/δ)s\geq C\log(nL^{2}/\delta) for some absolute constant CC, then with probability at least 1δ1-\delta we have

for all i=1,,ni=1,\ldots,n and 1l1<l2L1\leq l_{1}<l_{2}\leq L, where CC^{\prime} is an absolute constant.

Similar to previous proofs, it can be shown that with probability at least 1δ/21-\delta^{\prime}/2,

for all a^N1\widehat{a}\in\mathcal{N}_{1}, where C0C_{0} is an absolute constant. Therefore with probability at least 1δ1-\delta^{\prime}, we have

where C4C_{4} is an absolute constant. Applying union bound over all i=1,,ni=1,\ldots,n and all 1l1<l2L1\leq l_{1}<l_{2}\leq L completes the proof. ∎

We now prove (viii) in Theorem 5.1. In order to prove this result, we first show that the inner products between normalized hidden layer outputs are lower bounded by a constant related to μ\mu.

The proof follows by direct calculation. By definition, we have

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated via Gaussian initialization. Let xl,i=xl,i/xl,i2\overline{\mathbf{x}}_{l,i}=\mathbf{x}_{l,i}/\|\mathbf{x}_{l,i}\|_{2}, i=1,,ni=1,\ldots,n. For any δ>0\delta>0, if m=CL4μ4log(n2L/δ)m=\geq CL^{4}\mu^{-4}\log(n^{2}L/\delta) for some large enough absolute constant CC, then with probability at least 1δ1-\delta,

for all i,i{1,,n}i,i^{\prime}\in\{1,\ldots,n\} and all l[L]l\in[L].

We prove the result by induction. Suppose that xl,ixl,iμ2(1(2L)1log(2))l\overline{\mathbf{x}}_{l,i}^{\top}\overline{\mathbf{x}}_{l,i}\geq\mu^{2}(1-(2L)^{-1}\log(2))^{l}. Then by Lemma A.11, we have

Condition on xl1,i\mathbf{x}_{l-1,i} and xl1,i\mathbf{x}_{l-1,i^{\prime}}, we have

where C1,C2C_{1},C_{2} are absolute constants. Note that we have [1(2L)1log(2)]l[1(2L)1log(2)]L1/2[1-(2L)^{-1}\log(2)]^{l}\geq[1-(2L)^{-1}\log(2)]^{L}\geq 1/2. Therefore if mCL4μ4log(n2L/δ)m\geq CL^{4}\mu^{-4}\log(n^{2}L/\delta) for some large enough constant CC, then by Bernstein inequality, union bound and (i), with probability at least 1δ1-\delta we have

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated via Gaussian initialization. Let xl,i=xl,i/xl,i2\overline{\mathbf{x}}_{l,i}=\mathbf{x}_{l,i}/\|\mathbf{x}_{l,i}\|_{2}, i=1,,ni=1,\ldots,n. For any δ>0\delta>0, if mCL4μ4log(n2L/δ)m\geq CL^{4}\mu^{-4}\log(n^{2}L/\delta) for some large enough absolute constant CC, then with probability at least 1δ1-\delta,

for all i,i{1,,n}i,i^{\prime}\in\{1,\ldots,n\} and all l[L]l\in[L].

It directly follows by Lemma A.12 and the fact that 1x/2ex1-x/2\geq e^{-x} for x[0,log(2)]x\in[0,\log(2)]. ∎

where u:=(u2,,ud)\mathbf{u}^{\prime}:=(u_{2},\ldots,u_{d})^{\top} is independent of u1u_{1}. We define the following two events based on a parameter γ(0,1]\gamma\in(0,1]:

Moreover, by definition, for any i=1,,ni=1,\ldots,n we have

Let I={i:ziz12ϕ~}\mathcal{I}=\{i:\|\mathbf{z}_{i}-\mathbf{z}_{1}\|_{2}\geq\widetilde{\phi}\}. By the assumption that ϕμ~/2\phi\leq\widetilde{\mu}/2, for any iIi\in\mathcal{I}, we have

and if ϕ~22\widetilde{\phi}^{2}\leq 2, then

By union bound over I\mathcal{I}, we have

where the last equality follows from the fact that conditioning on event E\mathcal{E}, for all iIi\in\mathcal{I}, it holds that Qu,ziu1u1z1,zi|\langle\mathbf{Q}^{\prime}\mathbf{u}^{\prime},\mathbf{z}_{i}\rangle|\geq|u_{1}|\geq|u_{1}\langle\mathbf{z}_{1},\mathbf{z}_{i}\rangle|. We then consider two cases: u1>0u_{1}>0 and u1<0u_{1}<0, which occur equally likely conditioning on E\mathcal{E}. Therefore we have

For any u1(1)>0u_{1}^{(1)}>0 and u1(2)<0u_{1}^{(2)}<0, denote w1=u1(1)z1+Qu\mathbf{w}_{1}=u_{1}^{(1)}\mathbf{z}_{1}+\mathbf{Q}^{\prime}\mathbf{u}^{\prime}, w2=u1(2)z1+Qu\mathbf{w}_{2}=u_{1}^{(2)}\mathbf{z}_{1}+\mathbf{Q}^{\prime}\mathbf{u}^{\prime}. We now proceed to give lower bound for h(w1)h(w2)2\|\mathbf{h}(\mathbf{w}_{1})-\mathbf{h}(\mathbf{w}_{2})\|_{2}. By (A), we have

Note that for all iIi\in\mathcal{I}^{\prime}, we have yi=y1y_{i}=y_{1} and z1,zi1ϕ~2/20\langle\mathbf{z}_{1},\mathbf{z}_{i}\rangle\geq 1-\widetilde{\phi}^{2}/2\geq 0. Therefore, since u1(1)>0>u1(2)u_{1}^{(1)}>0>u_{1}^{(2)}, we have

Therefore ai0a_{i}^{\prime}\geq 0 for all iIi\in\mathcal{I}^{\prime} and

We have shown that zi,z10\langle\mathbf{z}_{i},\mathbf{z}_{1}\rangle\geq 0 for all iIi\in\mathcal{I}^{\prime}. Therefore we have

Since the inequality above holds for any u1(1)>0u_{1}^{(1)}>0 and u1(2)<0u_{1}^{(2)}<0, taking infimum gives

where CC and CC^{\prime} are absolute constants. This completes the proof. ∎

For any given j{1,,mL}j\in\{1,\ldots,m_{L}\} and a^\widehat{\mathbf{a}} with a^=1\|\widehat{\mathbf{a}}\|_{\infty}=1, by Lemma A.14, we have

Let pϕ=C2ϕ/np_{\phi}=C_{2}\phi/n. Then by Bernstein inequality and union bound, with probability at least 1exp[C3mLpϕ+nlog(4n/C1)]1exp(C4mLϕ/n)1-\exp[-C_{3}m_{L}p_{\phi}+n\log(4n/C_{1})]\geq 1-\exp(C_{4}m_{L}\phi/n), we have

where C3,C4C_{3},C_{4} are absolute constants. For any aS,+n1\mathbf{a}\in S_{\infty,+}^{n-1}, there exists a^N\widehat{\mathbf{a}}\in\mathcal{N} such that

By (A.7) and (A.8), it is clear that with probability at least 1exp(C4mLϕ/n)1-\exp(C_{4}m_{L}\phi/n), for any aS,+n1\mathbf{a}\in S_{\infty,+}^{n-1}, there exist at least mLpϕ/2m_{L}p_{\phi}/2 nodes on layer LL that satisfy

Appendix B Proof of Theorem 5.3

It is clear that the bound on W~l\widetilde{\mathbf{W}}_{l} can be proved trivially, since by triangle inequality we have W~2W2+W~W2C\|\widetilde{\mathbf{W}}\|_{2}\leq\|\mathbf{W}\|_{2}+\|\widetilde{\mathbf{W}}-\mathbf{W}\|_{2}\leq C for some large enough absolute constant CC. In the following we directly use this result without referring to Theorem 5.3. Similar to the proof of Theorem 5.1, we split the rest results into the following lemmas and prove them separately.

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated via Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. For τ>0\tau>0, let W~1,,W~L\widetilde{\mathbf{W}}_{1},\ldots,\widetilde{\mathbf{W}}_{L} with W~lWl2τ\|\widetilde{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2}\leq\tau, l=1,,Ll=1,\ldots,L be the perturbed matrices. Let Σ~l,i\widetilde{\bm{\Sigma}}_{l,i}, l=1,,Ll=1,\ldots,L, i=1,,ni=1,\ldots,n be diagonal matrices satisfying Σ~l,iΣl,i0s\|\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}\leq s and (Σ~l,iΣl,i)jj,(Σ~l,i)jj1|(\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i})_{jj}|,|(\widetilde{\bm{\Sigma}}_{l,i})_{jj}|\leq 1 for all l=1,,Ll=1,\ldots,L, i=1,,ni=1,\ldots,n and j=1,,mlj=1,\ldots,m_{l}. If τ,slog(M)/mκL3\tau,\sqrt{s\log(M)/m}\leq\kappa L^{-3} for some small enough absolute constant κ\kappa, then

for any 1l1<l2L1\leq l_{1}<l_{2}\leq L, where CC is an absolute constant.

Therefore, let Ar,i={Σr,iWr,(Σ~r,iΣr,i)Wr,Σ~r,i(W~rWr)}\mathcal{A}_{r,i}=\{\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top},(\widetilde{\bm{\Sigma}}_{r,i}-\bm{\Sigma}_{r,i})\mathbf{W}_{r}^{\top},\widetilde{\bm{\Sigma}}_{r,i}(\widetilde{\mathbf{W}}_{r}-\mathbf{W}_{r})^{\top}\}, r=l1,,l2r=l_{1},\ldots,l_{2}, then we have

For the rest of the proof, we denote by Σ|\bm{\Sigma}| the diagonal matrix with absolute values of elements of Σ\bm{\Sigma} on the corresponding entries. For each sequence Al1,i,,Al2,i\mathbf{A}_{l_{1},i},\ldots,\mathbf{A}_{l_{2},i}, denote

When Ar,i=Σr,iWr\mathbf{A}_{r,i}=\bm{\Sigma}_{r,i}\mathbf{W}_{r} for all r=l1,,l2r=l_{1},\ldots,l_{2}, then the bound of r=l1l2Ar,i2\|\prod_{r=l_{1}}^{l_{2}}\mathbf{A}_{r,i}\|_{2} is given by (v) in Theorem 5.1. For all the other terms in the expansion, we consider sequences of the form Br2,i(r=r1+1r21Σr,iWr)Br1,i\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}, where

By (vii) in Theorem 5.1, there exists an absolute constant C1C_{1} such that Br2,i(r=r1+1r21Σr,iWr)Br1,i2\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2} with different choices of Br1,i\mathbf{B}_{r_{1},i} and Br2,i\mathbf{B}_{r_{2},i} have the following bounds:

If Br1,i=Σ~r1,iΣr1,i\mathbf{B}_{r_{1},i}=|\widetilde{\bm{\Sigma}}_{r_{1},i}-\bm{\Sigma}_{r_{1},i}|, Br2,i=(Σ~r2,iΣr2,i)Wr2\mathbf{B}_{r_{2},i}=(\widetilde{\bm{\Sigma}}_{r_{2},i}-\bm{\Sigma}_{r_{2},i})\mathbf{W}_{r_{2}}^{\top}, then Br2,i(r=r1+1r21Σr,iWr)Br1,i2C1Lslog(M)/m\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2}\leq C_{1}L\sqrt{s\log(M)/m}.

If Br1,i=Σ~r1,i(W~r1Wr1)\mathbf{B}_{r_{1},i}=\widetilde{\bm{\Sigma}}_{r_{1},i}(\widetilde{\mathbf{W}}_{r_{1}}-\mathbf{W}_{r_{1}})^{\top}, Br2,i=Σ~r2,i(W~r2Wr2)\mathbf{B}_{r_{2},i}=\widetilde{\bm{\Sigma}}_{r_{2},i}(\widetilde{\mathbf{W}}_{r_{2}}-\mathbf{W}_{r_{2}})^{\top}, then Br2,i(r=r1+1r21Σr,iWr)Br1,i2C1Lτ2\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2}\leq C_{1}L\tau^{2}

Otherwise, Br2,i(r=r1+1r21Σr,iWr)Br1,i2C1Lτ\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2}\leq C_{1}L\tau.

For any fixed sequence Al1,iAl1,i,,Al2,iAl2,i\mathbf{A}_{l_{1},i}\in\mathcal{A}_{l_{1},i},\ldots,\mathbf{A}_{l_{2},i}\in\mathcal{A}_{l_{2},i}, let

Then by the discussion above we see that the bound of r=l1l2Ar,i2\|\prod_{r=l_{1}}^{l_{2}}\mathbf{A}_{r,i}\|_{2} has a term (C1Lτ)p1(C_{1}L\tau)^{p_{1}} granted by the matrices of the form Σ~r,i(W~rWr)\widetilde{\bm{\Sigma}}_{r,i}(\widetilde{\mathbf{W}}_{r}-\mathbf{W}_{r})^{\top}. In addition, if p2>p1+1p_{2}>p_{1}+1, then the bound also has a term C1Lslog(M)/mC_{1}L\sqrt{s\log(M)/m} with power at least p2p11p_{2}-p_{1}-1. Note that when p1=p2=0p_{1}=p_{2}=0, we still have r=l1l2Ar,i2C2L\|\prod_{r=l_{1}}^{l_{2}}\mathbf{A}_{r,i}\|_{2}\leq C_{2}L for some absolute constant C2C_{2}. Therefore,

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated via Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. For τ>0\tau>0, let W~1,,W~L\widetilde{\mathbf{W}}_{1},\ldots,\widetilde{\mathbf{W}}_{L} with W~lWl2τ\|\widetilde{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2}\leq\tau, l=1,,Ll=1,\ldots,L be the perturbed matrices. Let Σ~l,i\widetilde{\bm{\Sigma}}_{l,i}, l=1,,Ll=1,\ldots,L, i=1,,ni=1,\ldots,n be diagonal matrices satisfying Σ~l,iΣl,i0s\|\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}\leq s and (Σ~l,iΣl,i)jj,(Σ~l,i)jj1|(\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i})_{jj}|,|(\widetilde{\bm{\Sigma}}_{l,i})_{jj}|\leq 1 for all l=1,,Ll=1,\ldots,L, i=1,,ni=1,\ldots,n and j=1,,mlj=1,\ldots,m_{l}. If τ,slog(M)/mκL3\tau,\sqrt{s\log(M)/m}\leq\kappa L^{-3} for some small enough absolute constant κ\kappa, then

The proof is similar to the proof of Lemma B.1. Again, let Ar,i={Σr,iWr,(Σ~r,iΣr,i)Wr,Σ~r,i(W~rWr)}\mathcal{A}_{r,i}=\{\bm{\Sigma}_{r,i}W_{r}^{\top},(\widetilde{\bm{\Sigma}}_{r,i}-\bm{\Sigma}_{r,i})\mathbf{W}_{r}^{\top},\widetilde{\bm{\Sigma}}_{r,i}(\widetilde{\mathbf{W}}_{r}-\mathbf{W}_{r})^{\top}\}, r=l,,Lr=l,\ldots,L, then we have

Similar to the proof of Lemma B.1, we denote by Σ|\bm{\Sigma}| the diagonal matrix with absolute values of elements of Σ\bm{\Sigma} on the corresponding entries. For each sequence Al,i,,AL,i\mathbf{A}_{l,i},\ldots,\mathbf{A}_{L,i}, denote

When Ar,i=Σr,iWr\mathbf{A}_{r,i}=\bm{\Sigma}_{r,i}\mathbf{W}_{r} for all r=l,,Lr=l,\ldots,L, then the bound of m_{L}^{-1/2}\mathbf{v}^{\top}\big{(}\prod_{r=l}^{L}\mathbf{A}_{r,i}\big{)}\mathbf{a} is given by (v) in Theorem 5.1. For all the other terms in the expansion, we consider sequences of the form Br2,i(r=r1+1r21Σr,iWr)Br1,i\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}, where

By (vii) in Theorem 5.1, there exists an absolute constant C1C_{1} such that Br2,i(r=r1+1r21Σr,iWr)Br1,i2\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2} with different choices of Br1,i\mathbf{B}_{r_{1},i} and Br2,i\mathbf{B}_{r_{2},i} have the following bounds:

If Br1,i{Σ~r1,iΣr1,i,a}\mathbf{B}_{r_{1},i}\in\{|\widetilde{\bm{\Sigma}}_{r_{1},i}-\bm{\Sigma}_{r_{1},i}|,\mathbf{a}\}, Br2,i{(Σ~r2,iΣr2,i)Wr2,mL1/2v}\mathbf{B}_{r_{2},i}\in\{(\widetilde{\bm{\Sigma}}_{r_{2},i}-\bm{\Sigma}_{r_{2},i})\mathbf{W}_{r_{2}}^{\top},m_{L}^{-1/2}\mathbf{v}^{\top}\}, then Br2,i(r=r1+1r21Σr,iWr)Br1,i2C1Lslog(M)/m\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2}\leq C_{1}L\sqrt{s\log(M)/m}.

If Br1,i=Σ~r1,i(W~r1Wr1)\mathbf{B}_{r_{1},i}=\widetilde{\bm{\Sigma}}_{r_{1},i}(\widetilde{\mathbf{W}}_{r_{1}}-\mathbf{W}_{r_{1}})^{\top}, Br2,i=Σ~r2,i(W~r2Wr2)\mathbf{B}_{r_{2},i}=\widetilde{\bm{\Sigma}}_{r_{2},i}(\widetilde{\mathbf{W}}_{r_{2}}-\mathbf{W}_{r_{2}})^{\top}, then Br2,i(r=r1+1r21Σr,iWr)Br1,i2C1Lτ2\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2}\leq C_{1}L\tau^{2}

Otherwise, Br2,i(r=r1+1r21Σr,iWr)Br1,i2C1Lτ\|\mathbf{B}_{r_{2},i}(\prod_{r=r_{1}+1}^{r_{2}-1}\bm{\Sigma}_{r,i}\mathbf{W}_{r}^{\top})\mathbf{B}_{r_{1},i}\|_{2}\leq C_{1}L\tau.

Let α=max{τ,slog(m)/m}\alpha=\max\{\tau,\sqrt{s\log(m)/m}\} Then similar to the proof of Lemma B.1, by (vi) in Theorem 5.1 we have

where C2C_{2} is an absolute constant, and

Plugging the bounds of I1I_{1} and I2I_{2} into (B.1) completes the proof. ∎

The following lemma is inspired by a similar result given by Allen-Zhu et al. (2018b).

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated via Gaussian initialization, and all results (i)-(viii) in Theorem 5.1 hold. Let W~=(W~1,,W~L)\widetilde{\mathbf{W}}=(\widetilde{\mathbf{W}}_{1},\ldots,\widetilde{\mathbf{W}}_{L}), W^=(W^1,,W^L)\widehat{\mathbf{W}}=(\widehat{\mathbf{W}}_{1},\ldots,\widehat{\mathbf{W}}_{L}) be two collections of weight matrices satisfying W~lWl2,W^lWl2τ\|\widetilde{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2},\|\widehat{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2}\leq\tau, l=1,,Ll=1,\ldots,L. Let Σl,i,Σ~l,i,Σ^l,i\bm{\Sigma}_{l,i},\widetilde{\bm{\Sigma}}_{l,i},\widehat{\bm{\Sigma}}_{l,i} and xl,i,x~l,i,x^l,i\mathbf{x}_{l,i},\widetilde{\mathbf{x}}_{l,i},\widehat{\mathbf{x}}_{l,i} be the binary matrices and hidden layer outputs at the ll-th layer with parameter matrices W,W~,W^\mathbf{W},\widetilde{\mathbf{W}},\widehat{\mathbf{W}} respectively. Let C0,C0C_{0},C_{0}^{\prime} be the absolute constants in the bounds of \big{\|}\prod_{r=l_{1}}^{l_{2}}\widetilde{\bm{\Sigma}}_{r,i}\widetilde{\mathbf{W}}_{r}^{\top}\big{\|}_{2}, 1l1<l2L1\leq l_{1}<l_{2}\leq L and Wl2\|\mathbf{W}_{l}\|_{2}, l=1,,Ll=1,\ldots,L given in Lemma B.1 and (i) in Theorem 5.1 respectively. Then it holds that

x^l,ix~l,i2CLr=1lW^rW~r2\|\widehat{\mathbf{x}}_{l,i}-\widetilde{\mathbf{x}}_{l,i}\|_{2}\leq CL\cdot\sum_{r=1}^{l}\|\widehat{\mathbf{W}}_{r}-\widetilde{\mathbf{W}}_{r}\|_{2},

Σ^l,iΣ~l,i0CL4/3τ2/3ml\|\widehat{\bm{\Sigma}}_{l,i}-\widetilde{\bm{\Sigma}}_{l,i}\|_{0}\leq C^{\prime}L^{4/3}\tau^{2/3}m_{l},

for all l=1,,Ll=1,\ldots,L and i=1,,ni=1,\ldots,n, where C=2(C01)C=2(C_{0}\lor 1) and C=8C2/3C02/3C^{\prime}=8C^{2/3}C_{0}^{\prime 2/3}.

For all i{1,,n}i\in\{1,\ldots,n\} and l{1,,L}l\in\{1,\ldots,L\}, we prove the following stronger results:

We prove the results above by induction in ll. Suppose that for r=1,,l1r=1,\ldots,l-1 it holds that

We first prove the bounds for the diagonal matrices. Since Σ^l,iΣ~l,i0Σ^l,iΣl,i0+Σ~l,iΣl,i0\|\widehat{\bm{\Sigma}}_{l,i}-\widetilde{\bm{\Sigma}}_{l,i}\|_{0}\leq\|\widehat{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}+\|\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}, it suffices to show that Σ^l,iΣl,i0,Σ~l,iΣl,i0C/2L4/3τ2/3ml\|\widehat{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0},\|\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}\leq C^{\prime}/2L^{4/3}\tau^{2/3}m_{l}. We show that Σ~l,iΣl,i0C/2L4/3τ2/3ml\|\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}\leq C^{\prime}/2L^{4/3}\tau^{2/3}m_{l}. Then the same bound for Σ^l,iΣl,i0\|\widehat{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0} follows from the exact same proof. To show Σ~l,iΣl,i0C/2L4/3τ2/3ml\|\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}\leq C^{\prime}/2L^{4/3}\tau^{2/3}m_{l}, it suffices to give upper bound for the number of sign changes between vectors W~lx~l1,i\widetilde{\mathbf{W}}_{l}\widetilde{\mathbf{x}}_{l-1,i} and Wlxl1,i\mathbf{W}_{l}\mathbf{x}_{l-1,i}. We characterize their difference as follows:

Note that we have W~lWl2τ\|\widetilde{\mathbf{W}}_{l}^{\top}-\mathbf{W}_{l}^{\top}\|_{2}\leq\tau, xl1,i22\|\mathbf{x}_{l-1,i}\|_{2}\leq 2 and W~l22C0\|\widetilde{\mathbf{W}}_{l}^{\top}\|_{2}\leq 2C_{0}^{\prime}, and by definition we have C2,c1C\geq 2,\overline{c}\geq 1. Therefore

be the set of indices such that the absolute values of the corresponding entries of Wlxl1,i\mathbf{W}_{l}^{\top}\mathbf{x}_{l-1,i} are bounded by β\beta. Denote

For sl,i(1)(β)s_{l,i}^{(1)}(\beta), we directly use the upper bound sl,i(1)(β)Sl,i(β)2ml3/2βs_{l,i}^{(1)}(\beta)\leq|{\mathcal{S}}_{l,i}(\beta)|\leq 2m_{l}^{3/2}\beta given by (iv) in Theorem 5.1. We now focus on the upper bound of sl,i(2)(β)s_{l,i}^{(2)}(\beta). It is clear that if the sign of node jj changes, we must have

Therefore, we have the following upper bound of Σ~l,iΣl,i0\|\widetilde{\bm{\Sigma}}_{l,i}-\bm{\Sigma}_{l,i}\|_{0}:

Setting β=2C2/3C02/3L4/3τ2/3ml1/2\beta=2C^{2/3}C_{0}^{\prime 2/3}L^{4/3}\tau^{2/3}m_{l}^{-1/2}, we obtain

Now, combining bounds above and the inductive assumption on the bounds of Σ~r,iΣ^r,i0\|\widetilde{\bm{\Sigma}}_{r,i}-\widehat{\bm{\Sigma}}_{r,i}\|_{0}, r=1,,l1r=1,\ldots,l-1, we show that x^l,ix~l,i2CLr=1lW^rW~r2\|\widehat{\mathbf{x}}_{l,i}-\widetilde{\mathbf{x}}_{l,i}\|_{2}\leq CL\cdot\sum_{r=1}^{l}\|\widehat{\mathbf{W}}_{r}-\widetilde{\mathbf{W}}_{r}\|_{2}. For l=1,,Ll=1,\ldots,L, we define binary matrix Σˇl,i\widecheck{\bm{\Sigma}}_{l,i} as follows:

It then follows by definition that (Σ^l,i+Σˇl,i)jj,(Σˇl,i)jj1|(\widehat{\bm{\Sigma}}_{l,i}+\widecheck{\bm{\Sigma}}_{l,i})_{jj}|,|(\widecheck{\bm{\Sigma}}_{l,i})_{jj}|\leq 1 for all j=1,,mlj=1,\ldots,m_{l}, and

Let W,W~\mathbf{W},\widetilde{\mathbf{W}} be the collections of Gaussian initialized and perturbed weight matrices respectively. Define

Then under the same assumptions as Lemma B.3, it holds that

The result directly follows by the bound of Σ~L,iΣL,i0\|\widetilde{\bm{\Sigma}}_{L,i}-\bm{\Sigma}_{L,i}\|_{0} given by Lemma B.3. ∎

Suppose that W1,,WL\mathbf{W}_{1},\ldots,\mathbf{W}_{L} are generated via Gaussian initialization, and all results (i) to (viii) hold. If \|\widetilde{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2}\leq\tau=O\big{(}\phi^{3/2}n^{-3}L^{-2}\big{)} for all ll, then there exists an absolute constant CC such that

where y~i=fW~(xi)\widetilde{y}_{i}=f_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i}) denotes the output of the network using the perturbed weight matrices. By (viii) in Theorem 5.1, the inequality

holds for at least C2mLϕ/nC_{2}m_{L}\phi/n nodes, where C1,C2>0C_{1},C_{2}>0 are positive absolute constants. Moreover, we rewrite the gradient WL,jLS(W~)\nabla_{\mathbf{W}_{L,j}}L_{S}(\widetilde{\mathbf{W}}) as follows:

According to Lemma B.3, the number of nodes satisfying σ(w~L,j,x~L1,i)σ(wL,j,xL1,i0\sigma^{\prime}(\langle\widetilde{\mathbf{w}}_{L,j},\widetilde{\mathbf{x}}_{L-1,i}\rangle)-\sigma^{\prime}(\langle\mathbf{w}_{L,j},\mathbf{x}_{L-1,i}\rangle\neq 0 for at least one ii is at most C3nL4/3τ2/3mLC_{3}nL^{4/3}\tau^{2/3}m_{L}, where C3C_{3} is an absolute constant. For the rest of the nodes in this layer, we have

where C4C_{4} is an absolute constant, the first inequality holds since these nodes satisfy σ(w~L,j,x~L1,i)σ(wL,j,xL1,i=0\sigma^{\prime}(\langle\widetilde{\mathbf{w}}_{L,j},\widetilde{\mathbf{x}}_{L-1,i}\rangle)-\sigma^{\prime}(\langle\mathbf{w}_{L,j},\mathbf{x}_{L-1,i}\rangle=0 for all ii, the second inequality follows from Lemma B.3 and triangle inequality. Let

If W~lWl2τ\|\widetilde{\mathbf{W}}_{l}-\mathbf{W}_{l}\|_{2}\leq\tau, there exists an absolute constant CC such that the following bounds hold on the norm of the partial gradient Wl[LS(W~)]\nabla_{\mathbf{W}_{l}}[L_{S}(\widetilde{\mathbf{W}})] and stochastic partial gradient G~l\widetilde{\mathbf{G}}_{l}:

where y~i=fW~(xi)\widetilde{y}_{i}=f_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i}), B=BB=|\mathcal{B}| denotes the minibatch size.

where the last inequality follows from the fact that v2=mL1/2M1/2\|\mathbf{v}\|_{2}=m_{L}^{1/2}\leq M^{1/2}. Moreover, we have the following for Wl[LS(W~)]\nabla_{\mathbf{W}_{l}}[L_{S}(\widetilde{\mathbf{W}})]:

Similarly, regarding the stochastic gradient G~l\widetilde{\mathbf{G}}_{l}, we have

Appendix C Proof of Technical Lemmas in Section 5

Consider two positive constants aa and bb. For any p[0,1/2)(1/2,1]p\in[0,1/2)\cup(1/2,1], the following inequality holds:

Let W1(0),,WL(0)\mathbf{W}_{1}^{(0)},\dots,\mathbf{W}_{L}^{(0)} be generated via Gaussian random initialization. Let W(k)={Wl(k)}l=1,,L\mathbf{W}^{(k)}=\{\mathbf{W}_{l}^{(k)}\}_{l=1,\dots,L} be the kk-th iterate in the gradient descent. Assume all iterates are in the perturbation region centering at W0\mathbf{W}^{0} with radius τ\tau, i.e., Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau holds for any kKk\leq K and l[L]l\in[L], where KK is the maximum iteration number. Assume all results in Theorem 5.3 hold, there exist absolute constants C\overline{C}, C\underline{C}^{\prime} and C\underline{C}^{\prime\prime} such that the following upper and lower bounds on Δi(k)=yi(y^i(k+1)y^i(k))\Delta_{i}^{(k)}=y_{i}(\widehat{y}_{i}^{(k+1)}-\widehat{y}_{i}^{(k)}) hold:

Then we have the following upper bound on LS(W(k+1))LS(W(k))L_{S}(\mathbf{W}^{(k+1)})-L_{S}(\mathbf{W}^{(k)}),

where \Delta_{i}^{(k)}=y_{i}\big{(}\widehat{y}_{i}^{(k+1)}-\widehat{y}_{i}^{(k)}\big{)}. By Lemma C.2, we know that there exist constants C1C_{1} and C2C_{2} such that

Moreover, using the definition of ul,iu_{l,i} in Lemma C.2, we have

Then, plugging the above result into (C.1) gives

Note that by Lemma B.5, we know that there exists a constant c0c_{0} such that

We only take advantage of the gradient of the weight matrix in the last hidden layer to make loss function decrease. Thus, substituting the above inequality into (C.1), we obtain

According to Assumption 3.2, we know that

Note that min{a,b}1/(1/a+1/b)\min\{a,b\}\geq 1/(1/a+1/b), we have the following by plugging the above inequality into (C.6)

When p1/2p\neq 1/2, we have the following by Lemma C.1

Then taking telescope sum over kk and rearranging terms give

Similarly, taking telescope sum and rearranging terms give

Now we need to guarantee that after KK gradient descent steps the loss function LS(W(K))L_{S}(\mathbf{W}^{(K)}) is smaller than the target accuracy ϵ\epsilon. By (iii), we know that the output yi(0)y_{i}^{(0)} is in the order of O~(1)\widetilde{O}(1), which implies that the training loss LS(W(0))=O~(1)L_{S}(\mathbf{W}^{(0)})=\widetilde{O}(1) due to the smoothness assumption. Therefore, by (C.1) and (C.9), we require the following in terms of KηK\eta,

Here we preserve the parameter α0\alpha_{0} since it might take on value \infty. When α0=\alpha_{0}=\infty, we can remove the term n5/(mϕα02)n^{5}/(m\phi\alpha_{0}^{2}) in (C.13) since it becomes zero. When α0<\alpha_{0}<\infty, we treat it as a constant of order O(1)O(1). This completes the proof.

Appendix D Proof of Lemma 5.5

We prove this lemma by induction. Assume the argument holds for any t<kt<k, thus according to (C.6), we have

for any t<kt<k, where c0c_{0} is an absolute constant. Moreover, by Lemma B.6, we have

where c1c_{1} is an absolute constant. Therefore, we have

it follows that W(k)Wl(0)2τ\|\mathbf{W}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau. This completes the proof. ∎

The following three lemmas are necessary to prove Lemma 5.6.

Let W10,,WL(0)\mathbf{W}_{1}^{0},\dots,\mathbf{W}_{L}^{(0)} be generated via Gaussian random initialization. Let W(k)={Wl(k)}l=1,,L\mathbf{W}^{(k)}=\{\mathbf{W}_{l}^{(k)}\}_{l=1,\dots,L} be the kk-th iterate in the stochastic gradient descent. Assume there exist two constants τ,s>0\tau,s>0 satisfying τκL3\tau\leq\kappa L^{-3} and slog(M)/mκL3\sqrt{s\log(M)/m}\leq\kappa L^{-3} for some small enough absolute constant κ\kappa such that Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau and Σl,i(k)Σl,i(0)0s\|\bm{\Sigma}_{l,i}^{(k)}-\bm{\Sigma}_{l,i}^{(0)}\|_{0}\leq s hold for any kKk\leq K and l[L]l\in[L], where KK is the maximum iteration number. If M=Ω~(1)M=\widetilde{\Omega}(1), there exist absolute constants C\overline{C}, C\underline{C}^{\prime} and C\underline{C}^{\prime\prime} such that the following upper and lower bounds on Δ~i(k)=yi(y^i(k+1)y^i(k))\widetilde{\Delta}_{i}^{(k)}=y_{i}(\widehat{y}_{i}^{(k+1)}-\widehat{y}_{i}^{(k)}) hold:

where B(k)\mathcal{B}^{(k)} with B(k)=B|\mathcal{B}^{(k)}|=B denotes the set of minibatch for stochastic gradient calculation in the kk-th iteration, and

where Gl(k)\mathbf{G}_{l}^{(k)} denotes the stochastic partial gradient with respect to Wl(k)\mathbf{W}_{l}^{(k)}.

Regarding nn random variables u1,,unu_{1},\dots,u_{n} satisfying i=1nui=0\sum_{i=1}^{n}u_{i}=0. Let B[n]\mathcal{B}\in[n] denote a subset of [n][n] and B=Bn|\mathcal{B}|=B\leq n, the following holds,

Under Assumptions 4.4, with probability at least 1δ1-\delta, there exists an absolute constant CC such that the output of the stochastic gradient descent, i.e., W(k)\mathbf{W}^{(k)}, satisfies

By Assumption 3.3, LS(W(k+1))LS(W(k))L_{S}(\mathbf{W}^{(k+1)})-L_{S}(\mathbf{W}^{(k)}) can be upper bounded as follows,

where \widetilde{\Delta}_{i}^{(k)}=y_{i}\big{(}\widehat{y}_{i}^{(k+1)}-\widehat{y}_{i}^{(k)}\big{)}. Then taking expectation conditioning on W(k)\mathbf{W}^{(k)} gives

where C2C_{2} is an absolute constant. By Lemma D.2, we have

where the last inequality holds since (i=1nzi)2i=1nzi2(\sum_{i=1}^{n}z_{i})^{2}\geq\sum_{i=1}^{n}z_{i}^{2} for any z1,,zn0z_{1},\dots,z_{n}\geq 0. Plugging this into (D.6), we obtain

Combining (D.7) and (D.1) and then plugging into (D.1), we have

where C3C_{3} is an absolute constant, and we use the fact that BnB\leq n. By Lemma B.5, there exists a positive constant c0c_{0} such that

Therefore, after one-step stochastic gradient descent, we have

Similar to the analysis for gradient descent, we have

where we use the fact that min{a,b}1/(1/a+1/b)\min\{a,b\}\geq 1/(1/a+1/b). When p1/2p\neq 1/2, rearranging terms and applying Lemma C.1 give

Then we take telescope sum over kk on both sides, and obtain

Similar to the proof for gradient descent, when p=1/2p=1/2, it is easy to show that

Set the step size η=O(ϕM1n2p5)\eta=O(\phi M^{-1}n^{2p-5}) and plug the results in Lemma D.3 into (D.10), with probability at least 1δ1-\delta, there exists an absolute constant c1c_{1} such that

when p=1/2p=1/2. Then it requires to set k=Ω~(n124pB2ϕ2)k=\widetilde{\Omega}(n^{12-4p}B^{-2}\phi^{-2}) to make the L.H.S. of the above two inequalities be lower bounded by c2kηc_{2}k\eta, where c2(0,1)c_{2}\in(0,1) is an absolute constant. Moreover, in order to guarantee that LS(W(k))ϵL_{S}(\mathbf{W}^{(k)})\leq\epsilon, it suffices to set the quantity kηk\eta as follows,

Similar to the proof of Lemma 5.4, by setting α0=\alpha_{0}=\infty or α0=O(1)\alpha_{0}=O(1), we are able to complete the proof.

D.2 Proof of Lemma 5.7

when p=1/2p=1/2. We tackle these two cases separately.

Case p>1/2p>1/2: Then by Lemma D.3, with probability at least 1δ1-\delta, there exists an absolute constant c1c_{1} such that the following holds

Combining with (D.16), the following holds with probability at least 1δ1-\delta

Note that LS(W(0))=O~(1)L_{S}(\mathbf{W}^{(0)})=\widetilde{O}(1), then we can choose the step size η=O(ϕm1n2p7L8B2)\eta=O(\phi m^{-1}n^{2p-7}L^{-8}B^{2}) such that the second term on the R.H.S. of (D.2) is upper bounded by LS12p(W(0))/(24p)|L_{S}^{1-2p}(\mathbf{W}^{(0)})/(2-4p)|. Therefore, with probability at least 1δ1-\delta,

Since we have p[0,1/2)(1/2,1]p\in[0,1/2)\cup(1/2,1], and LS(W(0))=O~(1)L_{S}(\mathbf{W}^{(0)})=\widetilde{O}(1), LS(W(k))=O~(1)L_{S}(\mathbf{W}^{(k)})=\widetilde{O}(1) holds with probability at least 1δ1-\delta.

Case p=1/2p=1/2: By Lemma D.3, with probability at least 1δ1-\delta,

Combining with (D.17), we can set η=O(ϕm1n2p7L8B2)\eta=O(\phi m^{-1}n^{2p-7}L^{-8}B^{2}), and thus

holds with probability at least 1δ1-\delta. Therefore, with probability at least 1δ1-\delta, we have

Now we have proved that LS(W(k))=O~(1)L_{S}(\mathbf{W}^{(k)})=\widetilde{O}(1) with probability at least 1δ1-\delta for one particular kk. Then take union bound, we have LS(W(k))=O~(1)L_{S}(\mathbf{W}^{(k)})=\widetilde{O}(1) for all kKk\leq K with probability at least 1Kδ1-K\delta.

Moreover, when ρ0=\rho_{0}=\infty, we have

where CC is an absolute constant. Then we finalize the proof based on two cases: α0=\alpha_{0}=\infty and α0<\alpha_{0}<\infty. When α0<\alpha_{0}<\infty, since

by induction it is easy to show that there exists T=O~(τL2M1/2n1)T=\widetilde{O}(\tau L^{-2}M^{-1/2}n^{-1}) such that for any kk and η=O(ϕm1n2p7L8B2)\eta=O(\phi m^{-1}n^{2p-7}L^{-8}B^{2}) satisfying kηTk\eta\leq T, it holds that Wl(k)W(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}^{(0)}\|_{2}\leq\tau for any ll, i.e., W(k)\mathbf{W}^{(k)} is in the preset perturbation region with radius τ\tau. When α0<\alpha_{0}<\infty, we have Gl(k)2C1L2M1/2ρ0=O(L2M1/2)\|\mathbf{G}_{l}^{(k)}\|_{2}\leq-C_{1}L^{2}M^{1/2}\rho_{0}=O(L^{2}M^{1/2}), where C1C_{1} is an absolute constant. Thus, it can be show that for any kk and η\eta satisfying kηT=O(τL2M1/2)k\eta\leq T=O(\tau L^{-2}M^{-1/2}), we have Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau. This completes the proof. ∎

Appendix E Proof of Auxiliary Lemmas

It is simple to verify that function f(x)=x12p/(12p)f(x)=x^{1-2p}/(1-2p) is concave when p[0,1/2)(1/2,1]p\in[0,1/2)\cup(1/2,1]. Thus we have

E.2 Proof of Lemma C.2

The upper bound of Δi(k)|\Delta_{i}^{(k)}| can be derived straightforwardly. By (ii) in Theorem 5.3, we know that there exists a constant C1C_{1} such that

By (viii) in Theorem 5.3, we further have

where C2C_{2} is an absolute constant. Therefore, it follows that

where we use the fact that v2M1/2\|\mathbf{v}\|_{2}\leq M^{1/2}.

In what follows we are going to prove the lower bound of Δi(k)\Delta_{i}^{(k)}. Based on the definition of Δi(k)\Delta_{i}^{(k)}, we have

where C3C_{3} is an absolute constant and the last inequality follows from (viii) in Theorem 5.3 and (E.1). Moreover, according to Lemma (vi), for any vector a\mathbf{a} with a0s\|\mathbf{a}\|_{0}\leq s, the following holds,

where C4C_{4} is an absolute constant. Then let a=(Σl,i(k+1)Σl,i(k))Wl(k+1)xl1,i(k+1)\mathbf{a}=(\bm{\Sigma}_{l,i}^{(k+1)}-\bm{\Sigma}_{l,i}^{(k)})\mathbf{W}_{l}^{(k+1)\top}\mathbf{x}_{l-1,i}^{(k+1)} and apply (E.2), we get the following bound of Il,i1|I_{l,i}^{1}|

where C5=C3C4C_{5}=C_{3}\cdot C_{4} is an absolute constant. Then we pay attention to Il,i2I_{l,i}^{2}. Using Wl(k+1)Wl(k)=ηWl[LS(W(k))]\mathbf{W}_{l}^{(k+1)}-\mathbf{W}_{l}^{(k)}=-\eta\nabla_{\mathbf{W}_{l}}[L_{S}(\mathbf{W}^{(k)})], we have

We now proceed to bound Il,i32\|I_{l,i}^{3}\|_{2}. Based on Lemma B.6 and (E.1), we have

where C6C_{6} is an absolute constant. Moreover, note that Wl(k)Wl(0)2τ\|\mathbf{W}_{l}^{(k)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\tau holds for all ll, by (ii) in Theorem 5.3, we have

where the last inequality follows from (iii) in Theorem 5.1, and C7C_{7} is an absolute constant. Therefore, by Assumption 3.3, we have

Assume M1/2logn/δM^{1/2}\geq\sqrt{\log{n/\delta}} and τ1\tau\leq 1, plugging the above inequality into (E.2), we obtain

where C10C_{10} is an absolute constant. Finally, plugging (E.4), (E.2) and (E.7) into (E.2), we have

Let ul,i=Il,i4\mathbf{u}_{l,i}=I_{l,i}^{4} we complete the proof. ∎

E.3 Proof of Lemma D.1

This lemma can be proved by following the same technique for proving Lemma C.2, while we only need to replace the upper bound of Wl[LS(W(k))]2\|\nabla_{\mathbf{W}_{l}}[L_{S}(\mathbf{W}^{(k)})]\|_{2} with the stochastic gradient Gl(k)\|\mathbf{G}_{l}^{(k)}\| based on (viii) in Theorem 5.3. Since the proof technique of this lemma is essentially identical to that of Lemma C.2, we omit the detail here.

E.4 Proof of Lemma D.2

For random variables u1,,unu_{1},\dots,u_{n}, we have

where the last equality is due to the fact that 1ni=1nui=0\frac{1}{n}\sum_{i=1}^{n}u_{i}=0. ∎

E.5 Proof of Lemma D.3

We prove this Lemma by two cases: p1/2p\neq 1/2 and p=1/2p=1/2.

Using the upper bound of Δ~i(k)\widetilde{\Delta}_{i}^{(k)} provided in Lemma D.1, we have

where C1C_{1} and C2C_{2} are absolute constant and the last inequality follows from the fact that 1/ni=1nzip(1/ni=1nzi)p1/n\sum_{i=1}^{n}z_{i}^{p}\leq(1/n\sum_{i=1}^{n}z_{i})^{p} for any z1,,zn0z_{1},\dots,z_{n}\geq 0. Dividing by LS2p(W(k))L_{S}^{2p}(\mathbf{W}^{(k)}) on both sides and applying Lemma C.1,

Then by Azuma’s inequality, we have with probability at least 1δ1-\delta

Case p=1/2p=1/2: Similar to (E.8), we can derive the following upper bound on the difference LS(W(k+1))LS(W(k))L_{S}(\mathbf{W}^{(k+1)})-L_{S}(\mathbf{W}^{(k)}),

which leads to following by using inequality log(x)x1\log(x)\leq x-1,

Then by Azuma’s inequality, we have with probability at least 1δ1-\delta

References