On the Convergence Rate of Training Recurrent Neural Networks

Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

Neural networks have been one of the most powerful tools in machine learning over the past a few decades . The multi-layer structure of neural network gives it supreme power in expressibility and learning performance. However, it raises complexity concerns: the training objective is generally non-convex and non-smooth. In practice, local-search algorithms such as stochastic gradient descent (SGD) are capable of finding global optima, at least on the training data . How SGD avoids local minima for such objectives remains an open theoretical question since Goodfellow et al. .

In recent years, there have been a number of theoretical results aiming at a better understanding of this phenomenon. Many of them focus on two-layer (thus one-hidden-layer) neural networks and assume that the inputs are random Gaussian or sufficiently close to Gaussian . Some study deep neural networks but assuming the activation function is linear . Some study the convex task of training essentially only the last layer of the network . On the technique side, some of these results try to understand the gradient dynamics , while others focus on the geometry properties of the training objective .

In this paper, we show GD and SGD are capable of training multi-layer neural networks (with ReLU activation) to global minima on any non-degenerate training data set. Furthermore, the running time is polynomial in the number of layers and the number of data points. Since there are many different types of multi-layer networks (convolutional, feedforward, recurrent, etc.), in this present paper, we focus on recurrent neural networks (RNN) as our choice of multi-layer networks, and feedforward networks are only its “special case” (see for instance a follow-up work ).

Recurrent Neural Networks. Among different architectures of neural networks, one of the least theoretically-understood structure is the recurrent one . A recurrent neural network recurrently applies the same network unit to a sequence of input tokens, such as a sequence of words in a language sentence. RNN is particularly useful when there are long-term, non-linear interactions between input tokens in the same sequence. These networks are widely used in practice for natural language processing, language generation, machine translation, speech recognition, video and music processing, and many other tasks . On the theory side, while there are some attempts to show that an RNN is more expressive than a feedforward neural network , when and how an RNN can be efficiently learned has little theoretical explanation.

In practice, RNN is usually trained by simple local-search algorithms such as SGD. However, unlike shallow networks, the training process of RNN often runs into the trouble of vanishing or exploding gradient . That is, the value of the gradient becomes exponentially small or large in the time horizon, even when the training objective is still constant. Intuitively, an RNN recurrently applies the same network unit for $L$ times if the input sequence is of length $L$ . When this unit has “operator norm” larger than one or smaller than one, the final output can possibly exponentially explode or vanish in $L$ . More importantly, when one back propagates through time—which intuitively corresponds to applying the reverse unit multiple times— the gradient can also vanish or explode. Controlling the operator norm of a non-linear operator can be quite challenging. In practice, one of the popular ways to resolve this is by the long short term memory (LSTM) structure . However, one can also use rectified linear units (ReLUs) as activation functions to avoid vanishing or exploding gradient . In fact, one of the earliest adoptions of ReLUs was on applications of RNNs for this purpose twenty years ago . For a detailed survey on RNN, we refer the readers to Salehinejad et al. .

In this paper, we study the following general question

Can ReLU provably stabilize the training process and avoid vanishing/exploding gradient?

Can RNN be trained close to zero training error efficiently under mild assumptions?

When there is no activation function, RNN is known as linear dynamical system. Hardt, Ma, and Recht first proved the convergence of finding global minima for such linear dynamical systems. Followups in this line of research include .

Motivations. One may also want to study whether RNN can be trained close to zero test error. However, unlike feedforward networks, the training error, or the ability to memorize examples, may actually be desirable for RNN. After all, many tasks involving RNN are related to memories, and certain RNN units are even referred to memory cells. Since RNN applies the same network unit to all input tokens in a sequence, the following question can possibly of its own interest:

How does RNN learn mappings (say from token 3 to token 7) without destroying others?

Another motivation is the following. An RNN can be viewed as a space constraint, differentiable Turing machine, except that the input is only allowed to be read in a fixed order. It was shown in Siegelmann and Sontag that all Turing machines can be simulated by fully-connected recurrent networks built of neurons with non-linear activations. In practice, RNN is also used as a tool to build neural Turing machines , equipped with a grand goal of automatically learning an algorithm based on the observation of the inputs and outputs. To this extent, we believe the task of understanding the trainability as a first step towards understanding RNN can be meaningful on its own.

Our Result. To present the simplest result, we focus on the classical Elman network with ReLU activation:

If the number of neurons $m\geq\poly(n,d,L,\delta^{-1},\log\varepsilon^{-1})$ is polynomially large, we can find weight matrices $W,A,B$ where the RNN gives $\varepsilon$ training error

if gradient descent (GD) is applied for $T=\Omega\big{(}\frac{\poly(n,d,L)}{\delta^{2}}\log\frac{1}{\varepsilon}\big{)}$ iterations, starting from random Gaussian initializations; or

if (mini-batch or regular) stochastic gradient descent (SGD) is applied for $T=\Omega\big{(}\frac{\poly(n,d,L)}{\delta^{2}}\log\frac{1}{\varepsilon}\big{)}$ iterations, starting from random Gaussian initializations.At a first glance, one may question how it is possible for SGD to enjoy a logarithmic time dependency in $\varepsilon^{-1}$ ; after all, even when minimizing strongly-convex and Lipschitz-smooth functions, the typical convergence rate of SGD is $T\propto 1/\varepsilon$ as opposed to $T\propto\log(1/\varepsilon)$ . We quickly point out there is no contradiction here if the stochastic pieces of the objective enjoy a common global minimizer. In math terms, suppose we want to minimize some function $f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)$ , and suppose $x^{*}$ is the global minimizer of convex functions $f_{1}(x),\dots,f_{n}(x)$ . Then, if $f(x)$ is $\sigma$ -strongly convex, and each each $f_{i}(x)$ is $L$ -Lipschitz smooth, then SGD—moving in negative direction of $\nabla f_{i}(x)$ for a random $i\in[n]$ per step— can find $\varepsilon$ -minimizer of this function in $O\big{(}\frac{L^{2}}{\sigma^{2}}\log\frac{1}{\varepsilon}\big{)}$ iterations.

(To present the simplest possible result, we have not tried to tighten the polynomial dependency with respect to $n,d$ and $L$ . We only tightened the dependency with respect to $\delta$ and $\varepsilon$ .)

Our Contribution. We summarize our contributions as follows.

We believe this is the first proof of convergence of GD/SGD for training the hidden layers of recurrent neural networks (or even for any multi-layer networks of more than two layers) when activation functions are present.Our theorem holds even when $A,B$ are at random initialization and only the hidden weight matrix $W$ is trained. This is much more difficult to analyze than the convex task of training only the last layer $B$ . Training only the last layer can significantly reduce the learning power of (recurrent or not) neural networks in practice.

Our results provide arguably the first theoretical evidence towards the empirical finding of Goodfellow et al. on multi-layer networks, regarding the ability of SGD to avoid (spurious) local minima. Our theorem does not exclude the existence of bad local minima

We build new technical toolkits to analyze multi-layer networks with ReLU activation, which have now found many applications . For instance, combining this paper with new techniques, one can derive guarantees on testing error for RNN in the PAC-learning language .

Extension: Deep RNN. Elman RNN is also referred to as three-layer RNN, and one may also study the convergence of RNNs with more hidden layers. This is referred to as deep RNN . Our theorem also applies to deep RNNs (by combining this paper together with ).

2 Other Related Works

Another relevant work is Brutzkus et al. where the authors studied over-paramterization in the case of two-layer neural network under a linear-separable assumption.

Instead of using randomly initialized weights like this paper, there is a line of work proposing algorithms using weights generated from some “tensor initialization” process .

There is huge literature on using the mean-field theory to study neural networks . At a high level, they study the network dynamics at random initialization when the number of hidden neurons grow to infinity, and use such initialization theory to predict performance after training. However, they do not provide theoretical convergence rate for the training process (at least when the number of neurons is finite).

Notations and Preliminaries

Note that in the occasion that $\prod_{j=1}^{i-1}(I-\widehat{v}_{j}\widehat{v}_{j}^{\top})v_{i}$ is the zero vector, we let $\widehat{v}_{i}$ be an arbitrary unit vector that is orthogonal to $\widehat{v}_{1},\dots,\widehat{v}_{i-1}$ .

We make the following assumption on the input data (see Footnote 10 for how to relax it):

$\|x_{i,1}-x_{j,1}\|\geq\delta$ for some parameter $\delta\in(0,1]$ and every pair of $i\neq j\in[n]$ .

A very important notion that this entire paper relies on is the following:

We consider the following random initialization distributions for $W$ , $A$ and $B$ .

We say that $W,A,B$ are at random initialization, if the entries of $W$ and $A$ are i.i.d. generated from $\mathcal{N}(0,\frac{2}{m})$ , and the entries of $B_{i,j}$ are i.i.d. generated from $\mathcal{N}(0,\frac{1}{d})$ .

We assume $m\geq\poly(n,d,L,\frac{1}{\delta},\log\frac{1}{\varepsilon})$ for some sufficiently large polynomial.

Without loss of generality, we assume $\delta\leq\frac{1}{CL^{2}\log^{3}m}$ for some sufficiently large constant $C$ (if this is not satisfied one can decrease $\delta$ ). Throughout the paper except the detailed appendix, we use $\widetilde{O}$ , $\widetilde{\Omega}$ and $\widetilde{\Theta}$ notions to hide polylogarithmic dependency in $m$ . To simplify notations, we denote by

2 Objective and Gradient

Using chain rule, one can write down a closed form of the (sub-)gradient:

For $k\in[m]$ , the gradient with respect to $W_{k}$ (denoted by $\nabla_{k}$ ) and the full gradient are

Our Results

Our main results can be formally stated as follows.

Suppose $\eta=\widetilde{\Theta}\big{(}\frac{\delta}{m}\poly(n,d,L)\big{)}$ and $m\geq\poly(n,d,L,\delta^{-1},\log\varepsilon^{-1})$ . Let $W^{(0)},{A},{B}$ be at random initialization. With high probability over the randomness of $W^{(0)},{A},{B}$ , if we apply gradient descent for $T$ steps $W^{(t+1)}=W^{(t)}-\eta\nabla f(W^{(t)})$ , then it satisfies

Suppose $\eta=\widetilde{\Theta}\big{(}\frac{\delta}{m}\poly(n,d,L)\big{)}$ and $m\geq\poly(n,d,L,\delta^{-1},\log\varepsilon^{-1})$ . Let $W^{(0)},{A},{B}$ be at random initialization. If we apply stochastic gradient descent for $T$ steps $W^{(t+1)}=W^{(t)}-\eta\nabla f_{i}(W^{(t)})$ for a random index $i\in[n]$ per step, then with high probability (over $W^{(0)},A,B$ and the randomness of SGD), it satisfies

In both cases, we essentially have linear convergence rates. We remark here that the $\widetilde{O}$ notation may hide additional polynomial dependency in $\log\log\varepsilon^{-1}$ . This is not necessary, at the expense of slightly complicating the proofs, as shown by follow up . Notably, our results show that the dependency of the number of layers $L$ , is polynomial. Thus, even when RNN is applied to sequences of long input data, it does not suffer from exponential gradient explosion or vanishing (e.g., $2^{\Omega(L)}$ or $2^{-\Omega(L)}$ ) through the entire training process.

Main Technical Theorems. Our main Theorem 1 and Theorem 2 are in fact natural consequences of the following two technical theorems. They both talk about the first-order behavior of RNNs when the weight matrix $W$ is sufficiently close to some random initialization.

The first theorem is similar to the classical Polyak-Łojasiewicz condition , and says that $\|\nabla f(W)\|_{F}^{2}$ is at least as large as the objective value.

With high probability over random initialization $\widetilde{W},A,B$ , it satisfies

The second theorem shows a special “semi-smoothness” property of the objective.

At a high level, the convergence of GD and SGD are careful applications of the two technical theorems above: indeed, Theorem 3 shows that as long as the objective value is high, the gradient is large; and Theorem 4 shows that if one moves in the (negative) gradient direction, then the objective value can be sufficiently decreased. These two technical theorems together ensure that GD/SGD does not hit any saddle point or (bad) local minima along its training trajectory. This was practically observed by Goodfellow et al. and a theoretical justification was open since then.

An Open Question. We did not try to tighten the polynomial dependencies of $(n,d,L)$ in the proofs. When $m$ is sufficiently large, we make use of the randomness at initialization to argue that, for all the points within a certain radius from initialization, for instance Theorem 3 holds. In practice, however, the SGD can create additional randomness as time goes; also, in practice, it suffices for those points on the SGD trajectory to satisfy Theorem 3. Unfortunately, such randomness can— in principle— be correlated with the SGD trajectory, so we do not know how to use that in the proofs. Analyzing such correlated randomness is certainly beyond the scope of this paper, but can possibly explain why in practice, the size of $m$ needed is not that large.

Overall, we provide the first proof of convergence of GD/SGD for non-linear neural networks that have more two layers. We show with overparameterization GD/SGD can avoid hitting any (bad) local minima along its training trajectory. This was practically observed by Goodfellow et al. and a theoretical justification was open since then. We present our result using recurrent neural networks (as opposed to the simpler feedforward networks ) in this very first paper, because memorization in RNN could be of independent interest. Also, our result proves that RNN can learn mappings from different input tokens to different output tokens simultaneously using the same recurrent unit.

Last but not least, we build new tools to analyze multi-layer networks with ReLU activations that could facilitate many new research on deep learning. For instance, our techniques in Section 4 provide a general theory for why ReLU activations avoid exponential exploding (see e.g. (4.1), (4.4)) or exponential vanishing (see e.g. (4.1), (4.3)); and our techniques in Section 5 give a general theory for the stability of multi-layer networks against adversarial weight perturbations, which is at the heart of showing the semi-smoothness Theorem 4, and used by all the follow-up works .

The main difficulty of this paper is to prove Theorem 3 and 4, and we shall sketch the proof ideas in Section 4 through 7. In such high-level discussions, we shall put our emphasize on

how to avoid exponential blow up in $L$ , and

how to deal with the issue of randomness dependence across layers.

We genuinely hope that this high-level sketch can (1) give readers a clear overview of the proof without the necessity of going to the appendix, and (2) appreciate our proof and understand why it is necessarily long.For instance, proving gradient norm lower bound in Theorem 3 for a single neuron $k\in[m]$ is easy, but how to apply concentration across neurons? Crucially, due to the recurrent structure these quantities are never independent, so we have to build necessary probabilistic tools to tackle this. If one is willing to ignore such subtleties, then our sketched proof is sufficiently short and gives a good overview.

Basic Properties at Random Initialization

In this section we derive basic properties of the RNN when the weight matrices $W,A,B$ are all at random initialization. The corresponding precise statements and proofs are in Appendix B.

The first one says that the forward propagation neither explodes or vanishes, that is,

This relies on a more involved inductive argument than (4.1). At high level, one needs to show that in each layer, the amount of “fresh new randomness” reduces only by a factor at most $1-\frac{1}{10L}$ .

Using (4.1) and (4.2), we obtain the following property about the data separability:

Here, we say two vectors $x$ and $y$ are $\delta$ -separable if $\big{\|}(I-yy^{\top}/\|y\|_{2}^{2})x\big{\|}\geq\delta$ and vice versa. Property (4.3) shows that the separability information (say on input token $1$ ) does not diminish by more than a polynomial factor even if the information is propagated for $L$ layers.

Intermediate Layers and Backward Propagation. Training neural network is not only about forward propagation. We also have to bound intermediate layers and backward propagation.

Intuitively, one cannot use spectral bound argument to derive (4.4) or (4.5): the spectral norm of ${W}$ is $2$ , and even if ReLU activations cancel half of its mass, the spectral norm $\|DW\|_{2}$ remains to be $\sqrt{2}$ . When stacked together, this grows exponential in $L$ .

We did not try to tighten the polynomial factor here in $L$ . We conjecture that proving an $O(1)$ bound may be possible, but that question itself may be a sufficiently interesting random matrix theory problem on its own.

Its proof is in the same spirit as (4.5), with the only difference being the spectral norm of $B$ is around $\sqrt{m/d}$ as opposed to $O(1)$ .

Stability After Adversarial Perturbation

In this section we study the behavior of RNN after adversarial perturbation. The corresponding precise statements and proofs are in Appendix C.

Letting $\widetilde{W},A,B$ be at random initialization, we consider some matrix $W=\widetilde{W}+W^{\prime}$ for $\|W^{\prime}\|_{2}\leq\frac{\poly(\varrho)}{\sqrt{m}}$ . Here, $W^{\prime}$ may depend on the randomness of $\widetilde{W},A$ and $B$ , so we say it can be adversarially chosen. The results of this section will later be applied essentially twice:

Once for those updates generated by GD or SGD, where $W^{\prime}$ is how much the algorithm has moved away from the random initialization.

The other time (see Section 6.3) for a technique that we call “randomness decomposition” where we decompose the true random initialization $W$ into $W=\widetilde{W}+W^{\prime}$ , where $\widetilde{W}$ is a “fake” random initialization but identically distributed as $W$ . Such technique at least traces back to smooth analysis .

To illustrate our high-level idea, from this section on (so in Section 5, 6 and 7)

Forward Stability. Our first, and most technical result is the following:

In our actual proof of (5.1), instead of applying induction on ③, we recursively expand ③ by the above formula. This results in a total of $L$ terms of ① type and $L$ terms of ② type. The main difficulty is to bound a term of ② type, that is:

Our argument consists of two conceptual steps.

The two steps above enable us to perform induction without exponential blow up. Indeed, they together enable us to go through the following logic chain:

Since there is a gap between $m^{-1/2}$ and $m^{-2/3}$ , we can make sure that all blow-up factors are absorbed into this gap, using the property that $m$ is polynomially large. This enables us to perform induction to prove (5.1) without exponential blow-up.

Intermediate Layers and Backward Stability. Using (5.1), and especially using the sparsity $\|D^{\prime}\|_{0}\leq m^{2/3}$ from (5.1), one can apply the results in Section 4 to derive the following stability bounds for intermediate layers and backward propagation:

Special Rank-1 Perturbation. For technical reasons, we also need two bounds in the special case of $W^{\prime}=yz^{\top}$ for some unit vector $z$ and sparse $y$ with $\|y\|_{0}\leq\poly(\varrho)$ . We prove that, for this type of rank-one adversarial perturbation, it satisfies for every $k\in[m]$ :

Proof Sketch of Theorem 3: Polyak-Łojasiewicz Condition

The upper bound in Theorem 3 is easy to prove (based on Section 4 and 5), but the lower bound (a.k.a. the Polyak-Łojasiewicz condition) is the most technically involved result to prove in this paper. We introduce the notion of “fake gradient”. Given fixed vectors $\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}}$ , we define

For every fixed vectors $\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}}$ , if $W,A,B$ are at random initialization, then with high probability

There are only two conceptually simple steps from Theorem 5 to Theorem 3 (see Appendix F).

First, one can use the stability lemmas in Section 5 to show that, the fake gradient $\|\widehat{\nabla}f(W+W^{\prime})\|_{F}$ after adversarial perturbation $W^{\prime}$ (with $\|W^{\prime}\|_{2}\leq\frac{1}{\sqrt{m}}$ ) is also large.

Second, one can apply $\varepsilon$ -net and union bound to turn “fixed $\operatorname{\mathsf{loss}}$ ” into “for all $\operatorname{\mathsf{loss}}$ ”. This allows us to turn the lower bound on the fake gradient into a lower bound on the true gradient $\|\nabla_{k}f(W+W^{\prime})\|_{F}$ .

Therefore, in the rest of this section, we only sketch the ideas behind proving Theorem 5.

This should be quite intuitive to prove, in the following two steps.

2 Thought Experiment: Adding Small Rank-One Perturbation

At the same time, using the fact that $k$ satisfies (6.2), one can show that

Putting (a), (b), (c), and (d) together, we know for such specially chosen $k$ , at least with constant probability over the random perturbation of $W^{\prime}_{k}$ ,

3 Real Proof: Randomness Decomposition and McDiarmid’s Inequality

There are only two main differences between (6.6) and our desired Theorem 5. First, (6.6) gives a gradient lower bound at $W+W^{\prime}_{k}$ , while in Theorem 5 we need a gradient lower bound at random initialization $W$ . Second, (6.6) gives a lower bound on $\widehat{\nabla}_{k}f(\cdot)$ with constant probability for a small fraction of good coordinates $k$ , but in Theorem 5 we need a lower bound for the entire $\widehat{\nabla}f(\cdot)$ .

Randomness Decomposition. To fix the first issue, we resort to a randomness decomposition technique at least tracing back to the smooth analysis of Spielman and Teng :

Given small constant $\theta\in(0,1)$ and $m$ -dimensional random $g\sim\mathcal{N}(0,\frac{1}{m}{I})$ , we can rewrite $g=g_{1}+g_{3}$ where $g_{1}$ follows from $\mathcal{N}(0,\frac{1}{m}{I})$ and $g_{3}$ is very close to $\mathcal{N}(0,\frac{\theta^{2}}{m}{I})$ .

(Note that there is no contradiction here because $g_{1}$ and $g_{3}$ shall be correlated.)

Using Proposition 6.1, for each good coordinate $k$ , instead of “adding” perturbation $W^{\prime}_{k}$ to $W$ , we can instead decompose $W$ into $W=W_{0}+W^{\prime}_{k}$ , where $W_{0}$ is distributed in the same way as $W$ . In other words, $W_{0}$ is also at random initialization. If this idea is carefully implemented, one can immediately turn (6.6) into

Extended McDiarmid’s Inequality. To fix the second issue, one may wish to consider all the indices $k\in[m]$ satisfying (6.5) and (6.2). Since there are at least $\frac{\delta}{\poly(\rho)}$ fraction of such coordinates, if all of them satisfied (6.7), then we would have already proved Theorem 5. Unfortunately, neither can we apply Chernoff bound (because the events with different $k\in[m]$ are correlated), nor can we apply union bound (because the event occurs only with constant probability).

Our technique is to resort to (an extended probabilistic variant of) McDiarmid’s inequality (see Appendix A.6) in a very non-trivial way to boost the confidence.

In other words, although there are $|N|$ difference terms, their total summation only grows in rate $\frac{|N|}{m^{1/6}}$ according to (6.9). After applying a variant of McDiarmid’s inequality, we derive that with high probability over $W^{\prime}_{N}$ , it satisfies

Finally, by sampling sufficiently many random sets $N$ to cover the entire space $[m]$ , we can show

Proof Sketch of Theorem 4: Objective Semi-Smoothness

The objective semi-smoothness Theorem 4 turns out to be much simpler to prove than Theorem 3. It only relies on Section 4 and 5, and does not need randomness decomposition or McDiarmid’s inequality. (Details in Appendix G.)

Recall that in Theorem 4, $\widetilde{W},A,B$ are at random initialization. $\breve{W}$ is an adversarially chosen matrix with $\|\breve{W}-\widetilde{W}\|\leq\frac{\poly(\varrho)}{\sqrt{m}}$ , and $W^{\prime}$ is some other adversarial perturbation on top of $\breve{W}$ , satisfying $\|W^{\prime}\|\leq\frac{\tau_{0}}{\sqrt{m}}$ . We denote by

and plug (7.1) into (7.2) to derive our final Theorem 4.

Appendix Roadmap

Appendix A recalls some old lemmas and derives some new lemmas in probability theory.

Appendix B serves for Section 4, the basic properties at random initialization.

Appendix C serves for Section 5, the stability after adversarial perturbation.

Appendix D, E and F together serve for Section 6 and prove Theorem 3, the Polyak-Łojasiewicz condition and gradient upper bound. In particular:

Appendix D serves for Section 6.1 (the indicator and backward coordinate bounds).

Appendix E serves for Section 6.3 (the randomness decomposition and McDiarmid’s inequality) and proves Theorem 5.

Appendix F shows how to go from Theorem 5 to Theorem 3.

Appendix G serves for Section 7, the proof of Theorem 4, the objective semi-smoothness.

Appendix H gives the final proof for Theorem 1, the GD convergence theorem.

Appendix I gives the final proof for Theorem 2, the SGD convergence theorem.

Parameters. We also summarize a few parameters we shall use in the proofs.

In Definition D.1, we shall introduce two parameters

to control the thresholds of indicator functions (recall Section 6.1).

In Definition E.3, we shall introduce parameter

to describe how much randomness we want to decompose out of $W$ (recall Section 6.2).

In (E.13) of Appendix E.4, we shall choose

which controls the size of the set $N$ where we apply McDiarmid’s inequality (recall Section 6.3).

Appendix

The goal of this section is to present a list of probability tools.

In Section A.1, we recall how to swap randomness.

In Section A.2, we recall concentration bounds for the chi-square distribution.

In Section A.3, we proved a concentration bound of sum of squares of ReLU of Gaussians.

In Section A.4 and Section A.5, we show some properties for random Gaussian vectors.

In Section A.6, we recall the classical McDiarmid’s inequality and then prove a general version of it.

If $f(X,Y)$ holds with probability at least $1-\varepsilon$ , then

with probability at least $1-\sqrt{\varepsilon}$ (over randomness of $X$ ), the following event holds,

$f(X,Y)$ holds with probability at least $1-\sqrt{\varepsilon}$ (over randomness of $Y$ ).

If $\operatornamewithlimits{\mathbf{Pr}}_{X,Y}[f(X,Y)\geq a]\geq\varepsilon$ , then

A.2 Concentration of Chi-Square Distribution

Let $X\sim{\cal X}_{k}^{2}$ be a chi-squared distributed random variable with $k$ degrees of freedom. Each one has zero mean and $\sigma^{2}$ variance. Then

Let $x_{1},x_{2},\cdots,x_{n}$ denote i.i.d. samples from ${\cal N}(0,\sigma^{2})$ . For any $b\geq 1$ , we have

Since $\frac{2n}{\sqrt{8}b}+\frac{2n}{8b^{2}}\leq\frac{2n}{\sqrt{8}b}+\frac{2n}{8b}\leq n/b$ . Thus,

Let $x_{1},x_{2},\cdots,x_{m}$ denote i.i.d. samples from $\mathcal{N}(0,1)$ , and $y_{i}=\max\{x_{i}^{2}-\log m,0\}$ . We have

On the other hand, each random variable $y_{i}$ is $O(1)$ -subgaussian. By subgaussian concentration,

A.3 Concentration of Sum of Squares of ReLU of Gaussians

Given $n$ i.i.d. Gaussian random variables $x_{1},x_{2},\cdots,x_{n}\sim\mathcal{N}(0,\sigma^{2})$ , we have

Using Chernoff bound, we know that with probability $1-\exp(-\varepsilon^{2}n/6)$ , $\sum_{i=1}^{n}\max(x_{i},0)^{2}$ is a at most degree- $(1+\varepsilon)\frac{n}{2}$ Chi-square random variable. Let us say this is the first event.

Thus, we have with probability at least $1-\exp(-\varepsilon^{2}n/2)$ ,

Let the above event denote the second event.

By taking the union bound of two events, we have with probability $1-2\exp(-\varepsilon^{2}n/6)$

Then rescaling the $\varepsilon$ , we get the desired result. ∎

Given $n$ i.i.d. Gaussian random variables $x_{1},x_{2},\cdots,x_{n}\sim\mathcal{N}(0,\sigma^{2})$ , we have

Using Chernoff bound, we know that with probability $1-\exp(-\varepsilon^{2}n/6)$ , $\sum_{i=1}^{n}\max(x_{i},0)^{2}$ is a at most degree- $(1-\varepsilon)\frac{n}{2}$ Chi-square random variable. Let us say this is the first event.

Thus, we have with probability at least $1-\exp(-\varepsilon^{2}n/4)$ ,

Let the above event denote the second event.

By taking the union bound of two events, we have with probability at least $1-2\exp(-\varepsilon^{2}n/6)$ ,

Then rescaling the $\varepsilon$ , we get the desired result. ∎

Combining Lemma A.7 and Lemma A.6, we have

Given $n$ i.i.d. Gaussian random variables $x_{1},x_{2},\cdots,x_{n}\sim{\cal N}(0,\sigma^{2})$ , let $\phi(a)=\max(a,0)^{2}$ . We have

For each $i\in[m]$ , let $y_{i}=(Ax)_{i}$ . Then $y_{i}\sim\mathcal{N}(0,\widetilde{\sigma}^{2})$ , where $\widetilde{\sigma}^{2}=2\sigma^{2}/m\cdot\|x\|_{2}^{2}$ . Using Corollary A.8, we have

Since $y=Ax$ , thus we complete the proof. ∎

$|v_{i}|$ follows i.i.d. from the following distribution: with half probability $|v_{i}|=0$ , and with the other half probability $|v_{i}|$ follows from folded Gaussian distributions $|\mathcal{N}(0,\frac{2\|h\|^{2}}{m})|$ .

$\frac{m\|v\|^{2}}{2\|h\|^{2}}$ is in distribution identical to $\chi^{2}_{\omega}$ (chi-square distribution of order $\omega$ ) where $\omega$ follows from binomial distribution $\mathcal{B}(m,1/2)$ ( $m$ trials each with success rate $1/2$ ).

A.4 Gaussian Vector Percentile: Center

Suppose $x\sim\mathcal{N}(0,\sigma^{2})$ is a Gaussian random variable. For any $t\in(0,\sigma]$ we have

Similarly, if $x\sim\mathcal{N}(\mu,\sigma^{2})$ , for any $t\in(0,\sigma]$ , we have

Let $x\sim\mathcal{N}(0,\sigma^{2}{I})$ . For any $\alpha\in(0,1/2)$ , we have with probability at least $1-\exp(-\alpha^{2}m/100)$ ,

there exists at least $\frac{1}{2}(1-\alpha)$ fraction of $i$ such that $x_{i}\geq 5\alpha\sigma/16$ , and

there exists at least $\frac{1}{2}(1-\alpha)$ fraction of $i$ such that $x_{i}\leq-5\alpha\sigma/16\enspace.$

Let $c_{1}=4/5$ . For each $i\in[m]$ , we define random variable $y_{i}$ as

Choosing $\delta=\frac{1}{4}\alpha$ , we have

We provide the definition of $(\alpha,\sigma)$ -good. Note that this definition will be used often in the later proof.

there are at least $\frac{1}{2}(1-\alpha)$ fraction coordinates satisfy that $w_{i}\geq\alpha\sigma$ ; and

there are at least $\frac{1}{2}(1-\alpha)$ fraction coordinates satisfy that $w_{i}\leq-\alpha\sigma$ .

Lemma A.13 gives the following immediate corollary:

Let $x\sim\mathcal{N}(0,\sigma^{2}{I})$ . For any $\alpha\in(0,1/2)$ , we have with probability at least $1-\exp(-\alpha^{2}m/100)$ that $x$ is $(\alpha,\sigma/4)$ -good.

It is clear that $Ax$ follows from a Gaussian distribution $\mathcal{N}\big{(}0,\big{(}\sum_{i=1}^{k}\sigma_{i}^{2}\|x_{i}\|_{2}^{2}\big{)}{I}\big{)}$ , so we can directly apply Corollary A.15. ∎

A.5 Gaussian Vector Percentile: Tail

Without loss of generality we only prove the result for $\|x\|=1$ .

Fixing any such $x$ and letting $\beta=\frac{\log m}{2\sqrt{m}}$ , we have $y_{i}\sim\mathcal{N}\big{(}0,\frac{2}{m}\big{)}$ so for every $p\geq 1$ , by Gaussian tail bound

Since $\beta^{2}p^{2}m\geq\beta^{2}m\gg\Omega(\log m)$ , we know that if $|y_{i}|\geq\beta p$ occurs for $q/p^{2}$ indices $i$ out of $[m]$ , this cannot happen with probability more than

Finally, by applying union bound over $p=1,2,4,8,16,\dots$ we have with probability $\geq 1-e^{-\Omega(\beta^{2}qm)}\cdot\log q$ ,

In other words, vector $y$ can be written as $y=y_{1}+y_{2}$ where $\|y_{2}\|_{\infty}\leq\beta$ and $\|y_{1}\|^{2}\leq 4q\beta^{2}\log q$ .

A.6 McDiarmid’s Inequality and An Extension

We state the standard McDiarmid’s inequality,

We prove a more general version of McDiarmid’s inequality,

Let $w_{1},\dots,w_{N}$ be independent random variables and $f\colon(w_{1},\dots,w_{N})\mapsto$ . Suppose it satisfies:

With probability at least $1-p$ over $w_{1},\dots,w_{N}$ , it satisfies

Then, $\operatornamewithlimits{\mathbf{Pr}}[f(w_{1},\dots,w_{N})\geq\mu/2]\geq 1-N^{2}\sqrt{p}-e^{\Omega(\frac{-\mu^{2}}{N(c^{2}+p)})}\enspace.$

For each $t\in[N]$ , we have with probability at least $1-\sqrt{p}$ over $w_{1},\dots,w_{t}$ , it satisfies

Define those $(w_{1},\dots,w_{t})$ satisfying the above event to be $K_{\leq t}$ .

Define random variable $X_{t}$ (which depends only on $w_{1},\dots,w_{t}$ ) as

For every $t$ and fixed $w_{1},\dots,w_{t-1}$ .

If $(w_{\leq 1},\dots,w_{<t})\not\in K_{\leq 1}\times\cdots\times K_{<t}$ , then $X_{t}=X_{t-1}=N$ .

If $(w_{\leq 1},\dots,w_{<t})\in K_{\leq 1}\times\cdots\times K_{<t}$ ,

If $w_{\leq t}\not\in K_{\leq t}$ , then $X_{t}-X_{t-1}=N-\cdots\geq 0$ .

Recall from our assumption that, with probability at least $1-\sqrt{p}$ over $w_{t}$ and $w_{>t}$ , it satisfies

Taking expectation over $w_{t}$ and $w_{>t}$ , we have

This precisely means $X_{t}-X_{t-1}\geq c+\sqrt{p}$ .

In sum, we have just shown that $X_{t}-X_{t-1}\geq-(c+\sqrt{p})$ always holds. By applying martingale concentration (with one-sided bound),

Notice that $X_{0}=\mu$ so if we choose $t=\mu/2$ , we have

and we have $X_{N}=f(w_{1},\dots,w_{N})$ with probability at least $1-N\sqrt{p}$ (and $X_{N}=N$ with the remaining probabilities). Together, we have the desired theorem.

Appendix B Basic Properties at Random Initialization

Recall that the recursive update equation of RNN can be described as follows

We introduce two notations that shall repeatedly appear in our proofs.

Section B.4 and Section B.5 prove that the consecutive intermediate layers, in terms of spectral norm, do not explode (for full and sparse vectors respectively).

Section B.6 proves that the backward propagation does not explode.

With probability at least $1-\exp(-\Omega(m/L^{2}))$ over $W$ and $A$ ,

Next, applying Corollary A.10, we know that if $z_{1},z_{2},z_{3}$ are fixed (instead of defined as in (B.1)), then, letting Choosing $\varepsilon=1/2L$ , we have

In particular, since we have “for all” quantifies on $z_{1}$ and $z_{2}$ above, we can substitute the choice of $z_{1}$ and $z_{2}$ in (B.1) (which may depend on the randomness of $W$ and $A$ ). This, together with (B.1), gives

With probability at least $1-\exp(-\Omega(m/L^{2}))$ over $W$ and $A$ ,

We can define $z_{1},z_{2},z_{3}$ in the same way as (B.1). This time, we show a lower bound

Applying Corollary A.10, we know if $z_{1},z_{2},z_{3}$ are fixed (instead of defined as in (B.1)), then choosing $\varepsilon=1/8L$ ,

Substituting the choice of $z_{1},z_{2},z_{3}$ in (B.1), and the lower bound (B.4), we have

Choosing $\varepsilon=1/(4L)$ gives the desired statement. ∎

This finishes the proof of Lemma B.3. $\blacksquare$

B.2 Forward Correlation

in the same way as (B.1) and (B.3) as in the proof of Lemma B.3. We again have the entries of $M_{1},M_{2},M_{3}$ are i.i.d. from $\mathcal{N}(0,\frac{2}{m})$ (recall Footnote 13). For the quantity $M_{2}z_{2}$ , we can further decompose it as follows

where $c_{5}=\frac{1}{16\log m}$ is some fixed parameter, $\nu$ and $\nu^{\prime}$ denote two vectors that are independently generated from $\mathcal{N}(0,\frac{2{I}}{m})$ , and

It is clear that the two sides of (B.7) are identical in distribution (because $M_{2}\sim\mathcal{N}(0,\frac{2{I}}{m})$ and $(z_{2}-c_{5}\alpha)_{+}^{2}+(z_{2}^{\prime})^{2}=z_{2}^{2}$ ).

Next, suppose $z_{1}$ and $z_{2}$ are fixed (instead of depending on the randomness of $W$ and $A$ ) and satisfies Note that if $z_{1}$ and $z_{2}$ are random, then they satisfy such constraints by Lemma B.3.

We can apply Corollary A.16 to obtain the following statement: with probability at least $1-\exp(-\Omega(\alpha^{2}m))$ ,

is $(\alpha,\sigma/4)$ -good where $\sigma=\left(\frac{2}{m}\|z_{1}\|_{2}^{2}+\frac{2}{m}((z_{2}-c_{5}\alpha)_{+})^{2}+\frac{2}{m}\|z_{3}\|_{2}^{2}\right)^{1/2}$ . Using $\|z_{1}\|_{2}^{2}\geq 1/2$ , we can lower bound $\sigma^{2}$ as

$w_{1}$ corresponds to the coordinates that are $\geq\alpha/8\sqrt{m}$ ,

$w_{2}$ is the remaining, which corresponds to the coordinates that are within $(-\alpha/8\sqrt{m},\alpha/8\sqrt{m})$ .

$w_{3}$ corresponds to the coordinates that are $\leq-\alpha/8\sqrt{m}$ , and

We write $v=(v_{1},v_{2},v_{3})$ according to the same partition. We consider the following three cases.

For each index $k$ in the first block, we have $\big{(}\phi(w_{1}+rv_{1})\big{)}_{k}\neq\big{(}w_{1}+rv_{1}\big{)}_{k}$ only if $(rv_{1})_{k}\leq-\alpha/8\sqrt{m}$ . However, if this happens, we have

Applying Lemma A.5, we know with probability at least $1-e^{-\Omega(\sqrt{m})}$ ,

Similarly, for each index $k$ in the third block, we have $\big{(}\phi(w_{1}+rv_{1})\big{)}_{k}\neq 0$ only if $(rv_{1})_{k}\geq\alpha/8\sqrt{m}$ . Therefore, we can similarly derive that

To prove (B.9), we use triangle inequality,

Since $\|\phi(w_{2})\|_{\infty}\leq\alpha/8\sqrt{m}$ and the size of support of $\phi(w_{2})$ is at most $\alpha m$ , we have

Since $v\sim\mathcal{N}(0,\frac{2}{m}{I})$ , and since the size of support of $v_{2}$ is at most $\alpha m$ , we have $\|v_{2}\|\leq 2\sqrt{\alpha}$ with probability at least $1-e^{-\Omega(\alpha m)}$ (due to chi-square distribution concentration). Thus

Together, by triangle inequality we have $\|\phi(w_{2}+rv_{2})\|\leq\frac{\alpha^{3/2}}{4}\enspace.$ This finishes the proof of (B.9).

Denoting by $\delta=(\delta_{1},\delta_{2},\delta_{3})$ , we have

On the other hand, the random vector $z=(I-UU^{\top})w_{1}+rv_{1}$ follows from distribution $\mathcal{N}(\mu,\frac{2r^{2}}{m})$ for some fixed vector $\mu=(I-UU^{\top})w_{1}$ and has at least $\frac{m}{2}(1-\alpha)$ dimensions. By chi-square concentration, we have $\|z\|\geq r(1-3\alpha/2)$ with probability at least $1-e^{-\Omega(\alpha^{2}m)}$ . Putting these together, we have

B.3 Forward δ𝛿\delta-Separateness

We first give the definition of $\delta$ -Separable,

For any two vectors $x,y$ , we say $x$ and $y$ are $\delta$ -separable if

We say a finite set $X$ is $\delta$ -separable if for any two vectors $x,y\in X$ , $x$ and $y$ are $\delta$ -separable.

holds with probability at least $1-\exp(-\Omega(\sqrt{m}))$ .

This also implies that the randomness in $y_{1}$ is independent of the randomness in $y_{2}$ .

It is easy to see that $\|y_{1}\|_{2}=\frac{\langle x,y\rangle}{\|x\|_{2}}$ , so we can rewrite $Wy$ as follows

and we know the entries of $M_{1},M_{2},M_{3},M_{4}$ are i.i.d. from $\mathcal{N}(0,\frac{2}{m})$ , owing to a similar treatment as Footnote 13. For the vector $M_{4}z_{4}$ , we further rewrite it as

Our plan is to first use the randomness in $w$ to argue that $w$ is $(\alpha,\gamma/4\sqrt{m})$ -good. Then we conditioned $w$ is good, and prove that the norm of $(I-UU^{\top})\phi(w+\nu^{\prime}z_{4}^{\prime})$ is lower bounded (using (B.7)).

we can use Corollary A.16 to obtain the following statement: $w$ is $(\alpha,\sigma/4)$ -good with probability at least $1-\exp(-\Omega(\alpha^{2}m))$ where $\sigma=\left(\frac{2}{m}\|z_{1}\|_{2}^{2}+\frac{2}{m}z_{2}^{2}+\frac{2}{m}\|z_{3}\|_{2}^{2}+\frac{2}{m}(z_{4}^{2}-c_{5}^{2}\alpha^{2})\right)^{1/2}$ . Using $\|z_{1}\|_{2}^{2}\geq 1/2$ , we can lower bound $\sigma^{2}$ as

In other words, $w$ is $(\alpha,\frac{1}{4\sqrt{2}\sqrt{m}})$ -good. Next, applying standard $\varepsilon$ -net argument, we have with probability at least $1-\exp(-\Omega(\alpha^{2}m))$ , for all $z_{1},z_{2},z_{4}$ satisfying (B.12), it satisfies $w$ is $(\alpha,\frac{1}{8\sqrt{m}})$ -good. This allows us to plug in the random choice of $z_{1},z_{2},z_{4}$ in (B.11).

We apply Lemma B.7 with the following setting

where $w$ is $(\alpha,\gamma/8\sqrt{m})$ -good. (We can do so because the randomness of $v$ is independent of the randomness of $U$ and $w$ .) Lemma B.7 tells us that, with probability at least $1-\exp(-\Omega(\sqrt{m}))$ over the randomness of $v$ ,

B.4 Intermediate Layers: Spectral Norm

The following lemma bounds the spectral norm of (consecutive) intermediate layers.

We start with an important claim whose proof is almost identical to Lemma B.3.

Now, to prove the spectral norm bound in Lemma B.11, we need to go from “for each $z_{b-1}$ (see Claim B.12)” to “for all $z_{b-1}$ ”. Since $z_{b-1}$ has $m$ dimensions, we cannot afford taking union bound over all possible $z_{b-1}$ (or its $\varepsilon$ -net).

By applying an $\varepsilon$ -net argument over all such possible (but sparse) $(z_{b-1})_{j}$ , we have with probability at least $1-2^{O(m/L^{3})}\exp(-\Omega(m/L^{2}))\geq 1-\exp(-\Omega(m/L^{2}))$ , the above equation holds for all possible $(z_{b-1})_{j}$ .

Next, taking a union bound over all $j\in[L^{3}]$ , we have with probability at least $1-\exp(-\Omega(m/L^{2}))$ :

Without loss of generality we assume $\|z\|_{2}=1$ . We can rewrite $My$ as follows

It is easy to see that $M_{1}$ is independent of $M_{2}$ . We can rewrite

Using Fact A.11 together with concentration bounds (for binomial distribution and for chi-square distribution), we have with probability at least $1-\exp(-\Omega(m/L^{2}))$ ,

B.5 Intermediate Layers: Sparse Spectral Norm

This section proves two results corresponding to the spectral norm of intermediate layers with respect to sparse vectors. We first show Lemma B.14 and our Corollary B.15 and B.16 shall be direct applications of Lemma B.14.

For every $k\in[m]$ and $t\geq 2$ , with probability at least

Similar to the proof of Lemma B.11, we let $U$ denote the following column orthonormal matrix using Gram-Schmidt:

Applying $\varepsilon$ -net over all $k$ -sparse vectors $y$ , we have with probability at least $1-e^{O(k\log m)}e^{-\Omega(nLt^{2})}$ , it satisfies $\|y^{\top}WU\|^{2}\leq\frac{2nLt^{2}}{m}$ for all $k$ -sparse vectors $y$ .

Conditioning on both events happen, we have

B.6 Backward Propagation

This section proves upper bound on the backward propagation against sparse vectors. We first show Lemma B.17 and our Corollary B.18 and B.19 shall be immediate corollaries.

with probability at least $1-\exp(-\Omega(m/L^{2}))$ we have

Taking a union of the above two events, we get with probability $1-\exp(-\Omega(m/L^{2}))-\exp(-\Omega(t^{2}))$ ,

Appendix C Stability After Adversarial Perturbation

Section C.4 considers a special type of rank-one perturbation matrix $W^{\prime}$ , and provides stability bounds on the forward and backward propagation.

As discussed in Section 5, the results of Section C.1, C.2 and C.3 shall be used twice, once for the final training updates (see Appendix G), and once for the randomness decomposition (see Appendix E). In contrast, the results of Section C.4 shall only be used once in Appendix E.

The goal of this section is to prove Lemma C.2,

We emphasize that all these parameters are polynomial in $\varrho$ so negligible when comparing to $m$ .

After recursively applying (C.2), we can write

Applying Claim C.6, we have with probability at least $1-e^{-\Omega(\tau_{1}^{4/3}m^{1/3})}$ , one can write

where inequality ① is due to (C.5), and ② follows from our choice of $\tau_{2}=4L\tau_{5}\log m$ . Thus, we have showed I and II of (C.1):

so III of (C.1) holds. IV and V of (C.1) are implied by Claim C.4, and VI is implied because

Then, we have with probability at least $1-e^{-\Omega(m/L^{2})}$

Using triangle inequality, we can calculate

With probability at least $1-e^{-\Omega(\tau_{1}^{4/3}m^{1/3})}$ the following holds. Whenever

then letting $\tau_{4}=10(\tau_{1})^{2/3}$ and $\tau_{5}=3\tau_{1}$ , we have

Next, for each $j\in S_{2}$ , we must have

Similar to the proof of Lemma B.11, we let $U$ denote the following column orthonormal matrix using Gram-Schmidt:

Finally, taking $\varepsilon$ -net over all $s$ -sparse vectors $x$ , we have the desired result. ∎

Combining Claim C.4 and Claim C.5, we have

With probability at least $1-e^{-\Omega(\tau_{1}^{4/3}m^{1/3})}$ , whenever

C.2 Intermediate Layers

The goal of this subsection is to prove Lemma C.7.

On the other hand, since $\|W^{\prime}\|\leq\frac{\tau_{0}}{\sqrt{m}}$ , we also have

C.3 Backward

Without loss of generality we assume $\|a\|=1$ . We define set ${\cal C}$ to be

Again we assume $\|a\|=1$ without loss of generality. We define set ${\cal C}$ to be

It is similar to the proof of Lemma lem:backwardb. ∎

It is similar to the proof of Lemma lem:backwarda. ∎

C.4 Special Rank-One Perturbation

We prove two lemmas regarding the special rank-one perturbation with respect to a coordinate $k\in[m]$ . The first one talks about forward propagation.

After recursively calculating as above, using $h_{i,0}^{\prime}=0$ , and applying triangle inequality, we have

where the last step follows by $\|y\|_{\infty}\leq\frac{\tau_{0}}{\sqrt{m}}$ , $\|W^{\prime}\|_{2}\leq\frac{\sqrt{N}\tau_{0}}{\sqrt{m}}$ and Lemma lem:forwarda.

where ① follows by $\|z\|_{2}\leq 1$ , ② follows by Lemma B.3, ③ follows by Corollary B.16, and ④ follows by $\|y\|_{\infty}\leq\tau_{0}\frac{1}{\sqrt{m}}$ .

where the last step follows by Corollary B.15 (which relies on Lemma lem:forwardb to give $s=O(L^{5/3}\tau_{0}^{1/3}N^{1/6})$ ) together with Lemma lem:forwardc.

Putting the three bounds into (C.9) (where the term one is the dominating term), we have

The next one talks about backward propagation.

By Corollary B.15 (which relies on Lemma lem:forwardb to give $s=O(L^{5/3}\tau_{0}^{1/3}N^{1/6})$ ), we know

Finally, we use the randomness of $B$ (recall that $v$ does not depend on $B$ ), we conclude that with probability at least $1-e^{-\Omega(m)}$ , it satisfies

This time, we define set ${\cal C}$ to be

Appendix D Indicator and Backward Coordinate Bounds

Suppose $W$ and $A$ follow from random initialization. Given fixed set $N_{1}\subseteq[m]$ and define

Now, suppose $z_{1}$ is fixed (instead of depending on the randomness of $W$ and $A$ ), and it satisfies $\|z_{1}\|\leq 6L$ . We have each entry of $y$ is is distributed as ${\cal N}\big{(}0,\frac{2\|z_{1}\|^{2}+2\|z_{2}\|^{2}}{m}\big{)}$ . By property of Gaussian (see Fact A.12), we have for each $y\in[m]$ ,

Since we have $|N_{1}|$ independent random variables, applying a Chernoff bound, we have

and therefore for a fixed vector $\mu$ it satisfies

Applying Chernoff bound, with probability at least $1-e^{-\Omega(|N_{2}|/n)}$ , we have $|N_{3,i}|\geq(1-\frac{1}{2n})|N_{2}|$ . Applying union bound over all possible $i\in[n]\setminus\{i^{*}\}$ , we have $N_{3}=\bigcap_{i\neq i^{*}}N_{3,i}$ has cardinality at least $\frac{|N_{2}|}{2}$ . ∎

and therefore for a fixed vector $\mu$ it satisfies

D.2 Backward Coordinate Bound

The goal of this section is to prove Lemma D.7.

We have $h(t)$ is $\mathfrak{L}_{h}$ -Lipschitz continuous.

We define two probabilistic events $E_{1},E_{2}$ .

Event $E_{1}$ depends on the randomness of $W$ and $A$ :

Corollary B.16 says $\operatornamewithlimits{\mathbf{Pr}}[E_{1}]\geq 1-e^{-\Omega(\rho^{2})}$ .

Event $E_{2}$ depends on the randomness of $B$ :

By Gaussian tail bound, we have $\operatornamewithlimits{\mathbf{Pr}}[E_{2}]\geq 1-e^{-\Omega(\rho^{2})}$ .

In the rest of the proof, we assume $W$ and $A$ are fixed and satisfy $E_{1}$ . We let $B$ be the only source of randomness but we condition on $B$ satisfies $E_{2}$ .

Step 1. Fixing $b_{-N}$ and letting $b_{N}$ be the only randomness, we claim for each $k\in N$ :

We prove this inequality as follows. We can rewrite

and $v_{k}$ is distributed as Gaussian random variable ${\cal N}(\mu,\sigma^{2})$ , and $\mu$ and $\sigma^{2}$ are defined as follows

where the second step follows by $|\langle a,b\rangle|\leq\|a\|_{2}\cdot\|b\|_{2}$ , the third step follows by (D.2).

Above, inequality ① follows by $h(t)\in$ , inequality ② follows from (D.2). Using McDiarmid inequality (see Lemma A.18), we have

where we choose $\varepsilon=\frac{|N|}{4nL}$ .

Appendix E Gradient Bound at Random Initialization (Theorem 5)

The goal of this section is to understand the lower and upper bounds of gradients. Instead of analyzing the true gradient directly—where the forward and backward propagation have correlated randomness— we assume that the loss vectors $\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}}$ are fixed (no randomness) in this section, as opposed to being defined as $\operatorname{\mathsf{loss}}_{i,a}=Bh_{i,a}-y^{*}_{i,a}$ which is random. We call this the “fake loss.” We define a corresponding “fake gradient” with respect to this fixed loss.

Given fixed vectors $\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}}$ , we define

where inequality ① uses Lemma B.11, Lemma B.3, and $\|B\|_{2}\leq O(\sqrt{m})$ with high probability. Applying triangle inequality and using the definition of fake gradient (see Definition E.1), we finish the proof of the first statement.

As for the second statement, we replace the use of Lemma B.11 with Corollary B.19. ∎

The rest of this section is devoted to proving a (much more involved) lower bound on this fake gradient.

In the rest of this section, we first present a elegant way to decompose the randomness in Section E.1 (motivated by smooth analysis ). We give a lower bound on the expected fake gradient in Section E.2. We then calculate the stability of fake gradient against rank-one perturbations in Section E.3. Finally, in Section E.4, we apply our extended McDiarmid’s inequality to prove Theorem 5.

We introduce a parameter $\theta$ as follows:

where the notions $\beta_{-}$ and $\beta_{+}$ come from Definition D.1. We have $\theta\leq\frac{\delta}{\rho}$ .

entries of $W_{2}$ are i.i.d. drawn from $\mathcal{N}(0,\frac{2}{m})$ (so $W_{2}$ is in the same distribution as $W$ );

With probability at least $1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{2}$ :

We verify that the entries of $W_{2}$ are i.i.d. Gaussian in two steps.

By standard Gaussian tail bound, we have with probability at least $1-e^{-\Omega(\rho^{2})}$ , it satisfies $\|g_{1}\|_{\infty}\leq\frac{\rho}{\sqrt{m}}$ and therefore $\|(1-\sqrt{1-\theta^{2}})g_{1}\|_{\infty}\leq\frac{\theta^{2}\rho}{\sqrt{m}}$ . If this happens, for every $k\in[m]$ , we have (see for instance Fact A.12) that $\operatornamewithlimits{\mathbf{Pr}}_{g_{2}}[(\theta g_{2})_{k}>\frac{\theta}{\sqrt{m}}]\geq\frac{1}{4}$ and $\operatornamewithlimits{\mathbf{Pr}}_{g_{2}}[(\theta g_{2})_{k}<-\frac{\theta}{\sqrt{m}}]\geq\frac{1}{4}$ . Recalling

and using the fact that $\theta\leq\frac{1}{2\rho}$ , we conclude that

Again by Gaussian tail bound, we have with probability at least $1-e^{-\Omega(\rho^{2})}$ , it satisfies $\|g_{1}\|_{\infty},\|g_{2}\|_{\infty}\leq\frac{\rho}{\sqrt{m}}$ . Therefore, $\|(1-\sqrt{1-\theta^{2}})g_{1}\|_{\infty}\leq\frac{\theta^{2}\rho}{\sqrt{m}}$ and using our assumption on $\theta$ , we have

Now, we introduce another two ways of decomposing randomness based on this definition.

E.2 Gradient Lower Bound in Expectation

The goal of this subsection is to prove Lemma E.6 and then translate it into our Core Lemma A (see Lemma E.7).

and recall from Lemma lem:random-decomposed, we have

We let $N_{1}=N$ , apply Lemma D.2 to obtain $N_{4}\subseteq N_{1}$ with $|N_{4}|\geq\frac{\beta_{-}|N_{1}|}{64L}$ , and apply Lemma D.7 to obtain $N_{5}\subseteq N_{4}$ . According to the statements of these lemmas, we know that with probability at least $1-e^{-\Omega(\rho^{2})}$ , the random choice of $W_{1},A,B$ will satisfy $|N_{5}|\geq\frac{\beta_{-}|N_{1}|}{100L}$ .

This implies, by triangle inequality and the fact that $\|\operatorname{\mathsf{loss}}_{i,a}\|\leq\|\operatorname{\mathsf{loss}}_{i^{*},a^{*}}\|$ ,

Using Lemma B.3 and Lemma lem:forwarda we have

This concludes that, with probability at least $1-e^{-\Omega(\rho^{2})}$ over $W_{1}$ and $A$ , it satisfies

This is a fixed value, independent of $W^{\prime}_{N}$ .

Finally, for each $k\in N_{5}$ , owing to Lemma D.7 and Lemma B.3, we have

Putting (E.2) and (E.7) back to (E.1), we conclude that

In Lemma E.6, we split $W$ into $W=W_{1}+W_{N}^{\prime}$ for a fixed $N\subseteq[m]$ . In this subsection, we split $W$ into three parts $W=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime}$ following Lemma def:random-decompc. Our purpose is to rewrite Lemma E.6 into the following variant:

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{2},W_{-N}^{\prime},A,B,N$ , the following holds:

or putting it in another way, using our choice of $q$ ,

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{2},W_{-N}^{\prime},A,B,N$ , the following holds ( $q=\frac{\beta_{-}|N|}{c\rho^{2}}$ for some sufficiently large constant):

Applying simple tricks to switch the ordering of randomness (see Fact A.2), we have

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{2},A,B$ , the following holds:

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $N$ , the following holds:

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over randomness of $W_{-N}^{\prime}$ , the following holds:

Repeating the choice of $N$ for $t$ times, and using union bound, we have

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{2},A,B$ , the following holds:

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $N_{1},N_{2},\cdots,N_{t}$ , the following holds:

for all $N\in\{N_{1},N_{2},\cdots,N_{t}\}$ , the following holds:

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over randomness of $W_{-N}^{\prime}$ , the following holds:

Combining the above statement with Fact A.1 (to merge randomness), we conclude the proof. ∎

E.3 Gradient Stability

The goal of this subsection is to prove Lemma E.8 and then translate it into our Core Lemma B (see Lemma E.10).

$\left|\left\{k\in[m]~{}\bigg{|}~{}\Big{|}\|\widehat{\nabla}_{k}f(\widetilde{W}+W^{\prime\prime}_{j})\|_{2}^{2}-\|\widehat{\nabla}_{k}f(\widetilde{W})\|_{2}^{2}\Big{|}\leq O\left(\frac{\rho^{11}\theta^{1/3}}{m^{1/6}}\right)\right\}\right|\geq\left(1-\frac{\rho^{5}\theta^{2/3}}{m^{1/3}}\right)m\enspace.$

$\left|\|\widehat{\nabla}_{k}f(\widetilde{W}+W^{\prime\prime}_{j})\|_{2}^{2}-\|\widehat{\nabla}_{k}f(\widetilde{W})\|_{2}^{2}\right|\leq O(\rho^{6})$ for every $k\in[m]$ .

Combining this with $\|\widehat{\nabla}_{k}f(\widetilde{W})\|_{2}\leq O(\rho^{4})$ from Lemma E.2, we finish the proof of the first item.

Now, for the indices in $N$ that are randomly sampled from $[m]$ , if $|N\cap J|\geq|N|-S$ for a parameter $S=\rho^{2}$ , then we have

Otherwise, if $|N\cap J|\leq|N|-S$ , this means at least $S$ indices that are chosen from $N$ are outside $J\subseteq[m]$ . This happens with probability at most $\left(\frac{\rho^{5}\theta^{2/3}}{m^{1/3}}\right)^{S}\leq e^{-\Omega(\rho^{2})}$ . ∎

In this subsection, we split $W$ into three parts $W=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime}$ followingdef:random-decomp:N-N. Our purpose is to rewrite Lemma E.8 into the following variant:

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over random $N\subset[m]$ , random $j\in N$ , random $W,A,B$ , and random $W_{j}^{\prime}$ , the following holds:

We can split the randomness (see Fact A.1) and derive that

With probability $\geq 1-e^{-\Omega(\rho^{2})}$ over $N\subseteq[m]$ , $W,A,B$ , the following holds:

With probability $\geq 1-e^{-\Omega(\rho^{2})}$ over $j\in N$ and $W_{j}^{\prime\prime}$ , Eq. (E.11) holds.

Applying standard $\varepsilon$ -net argument, we derive that

With probability $\geq 1-e^{-\Omega(\rho^{2})}$ over $N\subseteq[m]$ , $W,A,B$ , the following holds:

Finally, letting $W=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime}$ , we have

With probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $N\subseteq[m],W_{2},W_{N},W_{-N}^{\prime},A,B$ , the following holds:

Finally, splitting the randomness (using Fact A.1), and applying a union bound over multiple samples $N_{1},\dots,N_{t}$ , we finish the proof. ∎

E.4 Main Theorem: Proof of Theorem 5

Combining Core Lemma A and B (i.e., Lemma E.7 and Lemma E.10), we know that if $|N|$ is appropriately chosen, with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{2},A,B$ and $N_{1},\cdots N_{t}$ , the following holds:

for every $N\in\{N_{1},N_{2},\cdots,N_{t}\}$ , the following holds:

with probability at $1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{-N}^{\prime}$ , the following boxed statement holds:

Below, we condition on the high probability event (see Lemma lem:random-decomposed) that $\|u_{N}\|_{\infty}\leq\frac{3\theta\rho}{2\sqrt{m}}$ . The boxed statement tells us:

With probability $\geq 1-e^{-\Omega(\rho^{2})}$ over $u_{N}$ , it satisfies

At the same time, we have $0\leq\digamma(u_{N})\leq O(N\rho^{8})$ owing to Lemma E.2. Therefore, scaling down $\digamma(u_{N})$ by $\Theta(N\rho^{8})$ (to make sure the function value stays in $$), we can applying extended McDiarmid’s inequality (see Lemma A.19), we have

As long as $N\geq\frac{\rho^{22}}{\beta_{-}^{2}}$ , we have that the above probability is at least $1-e^{-\Omega(\rho^{2})}$ .

in order to satisfy Lemma E.7. We replace the boxed statement with (E.12). This tells us

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{2},A,B$ and $N_{1},\cdots N_{t}$ , the following holds:

for every $N\in\{N_{1},N_{2},\cdots,N_{t}\}$ , the following holds:

with probability at $1-e^{-\Omega(\rho^{2})}$ over the randomness of $W_{-N}^{\prime}$ and $W^{\prime}_{N}$ , the following holds:

After rearranging randomness, and using $W=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime}$ , we have

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $W,A,B$ , the following holds:

with probability $\geq 1-e^{-\Omega(\rho^{2})}$ over the randomness of $N_{1},\cdots N_{t}$ , the following holds:

for every $N\in\{N_{1},N_{2},\cdots,N_{t}\}$ , the following holds:

Finally, since $N_{1},\dots,N_{t}$ are $t$ random subsets of $[m]$ , we know that with probability at least $1-e^{-\Omega(\rho^{2})}$ , for each index $k\in[m]$ , it is covered by at most $\rho^{2}$ random subsets. Therefore,

Appendix F Gradient Bound After Perturbation (Theorem 3)

We use the same fake gradient notion (see Definition E.1) and first derive the following result based on Lemma E.2 and Theorem 5.

Lower bound on $\|\widehat{\nabla}f(\widetilde{W}+W^{\prime})\|_{F}$ . Recall

In Theorem 5 of the previous section, we already have a lower bound on $\|\widehat{\nabla}f(\widetilde{W})\|_{F}$ . We just need to upper bound $\|\widehat{\nabla}f(\widetilde{W}+W^{\prime})-\widehat{\nabla}f(\widetilde{W})\|_{F}$ .

where ① follows by definition and ② and ③ follow by the triangle inequality. Note that in inequality ③ we have hidden four more higher order terms in $o(m^{1/3})$ . We ignore the details for how to bound them, for the ease of presentation. We bound the three terms separately:

Using Corollary B.18 (with $s^{2}=O(L^{10/3}\tau_{0}^{2/3})$ from Lemma lem:forwardb) and Lemma B.3 we have

Finally, using $(a-b)^{2}\geq\frac{1}{2}a^{2}-b^{2}$ , we have

where the last step follows by (F.1) (with our sufficiently large choice of $m$ ) and Theorem 5.

Upper bound on $\|\widehat{\nabla}f(\widetilde{W}+W^{\prime})\|_{F}$ . Using $(a+b)^{2}\leq 2a^{2}+2b^{2}$ , we have

where ① follows by Lemma E.2 and ② follows by (F.1).

Upper bound on $\|\widehat{\nabla}f_{i}(\widetilde{W}+W^{\prime})\|_{F}$ . This is completely analogous so we do not replicate the proofs here. ∎

Appendix G Objective Semi-Smoothness (Theorem 4)

$\|h_{a}\|\leq O(L)$ (by $\|\widetilde{h}_{a}\|\leq O(L)$ from Lemma B.3 and $\|\widetilde{h}_{a}-h_{a}\|\leq o(1)$ from Lemma lem:forwarda); and

$\|W^{\prime}h_{a}\|\leq\|W^{\prime}\|_{2}\|h_{a}\|$ .

Above, ① is by the definition of $f(\cdot)$ ; ② is by (G.2); ③ is by the definition of $\nabla f(\cdot)$ (see Fact 2.5 for an explicit form of the gradient); ④ is by Claim G.2.

where ① uses Lemma C.7 and ② uses Claim G.2 to bound $\|h_{a}-\breve{h}_{a}\|_{2}$ .

Putting (G.4), (G.5) and (G.6) back to (G.3), and using triangle inequality, we have the desired result. ∎

$|D^{\prime\prime}_{k,k}|\leq 1$ for every $k\in[m]$ ,

$D^{\prime\prime}_{k,k}\neq 0$ only when $\mathds{1}_{a_{k}\geq 0}\neq\mathds{1}_{b_{k}\geq 0}$ , and

$\phi(a)-\phi(b)=D(a-b)+D^{\prime\prime}(a-b)$

We verify coordinate by coordinate for each $k\in[m]$ .

If $a_{k}\geq 0$ and $b_{k}\geq 0$ , then $(\phi(a)-\phi(b))_{k}=a_{k}-b_{k}=\big{(}D(a-b)\big{)}_{k}$ .

If $a_{k}<0$ and $b_{k}<0$ , then $(\phi(a)-\phi(b))_{k}=0-0=\big{(}D(a-b)\big{)}_{k}$ .

If $a_{k}\geq 0$ and $b_{k}<0$ , then $(\phi(a)-\phi(b))_{k}=a_{k}=(a_{k}-b_{k})+\frac{b_{k}}{a_{k}-b_{k}}(a_{k}-b_{k})=\big{(}D(a-b)+D^{\prime\prime}(a-b)\big{)}_{k}$ , if we define $(D^{\prime\prime})_{k,k}=\frac{b_{k}}{a_{k}-b_{k}}\in$ .

If $a_{k}<0$ and $b_{k}\geq 0$ , then $(\phi(a)-\phi(b))_{k}=-b_{k}=0\cdot(a_{k}-b_{k})-\frac{b_{k}}{b_{k}-a_{k}}(a_{k}-b_{k})=\big{(}D(a-b)+D^{\prime\prime}(a-b)\big{)}_{k}$ , if we define $(D^{\prime\prime})_{k,k}=\frac{b_{k}}{b_{k}-b_{a}}\in$ . ∎

Appendix H Convergence Rate of Gradient Descent (Theorem 1)

In the rest of the proof, we first assume that for every $t=0,1,\dots,T-1$ , the following holds

We shall prove the convergence of gradient descent assuming (H.1), so that previous statements such as Theorem 4 and Theorem 3 can be applied. At the end of the proof, we shall verify that (H.1) is satisfied throughout the gradient descent process.

We need to verify for each $t$ , $\|W^{(t)}-W^{(0)}\|_{F}$ is small so that (H.1) holds. By Theorem 3,

where the last step follows by our choice of $T$ . ∎

Appendix I Convergence Rate of Stochastic Gradient Descent (Theorem 2)

The proof is almost identical to that of Theorem 1. We again have with probability at least $1-e^{-\Omega(\rho^{2})}$

Again, we first assume for every $t=0,1,\dots,T-1$ , the following holds

We shall prove the convergence of SGD assuming (I.1), so that previous statements such as Theorem 4 and Theorem 3 can be applied. At the end of the proof, we shall verify that (I.1) is satisfied throughout the SGD with high probability.

At the same time, we also have the following absolute value bound:

Above, ① uses Theorem 4 and Cauchy-Shwartz $\langle A,B\rangle\leq\|A\|_{F}\|B\|_{F}$ , and ② uses Theorem 3 and the derivation from (I.2).

By one-sided Azuma’s inequality (a.k.a. martingale concentration), we have with probability at least $1-e^{-\Omega(\rho^{2})}$ , for every $t=1,2,\dots,T$ :

On one hand, after $T=\Omega(\frac{\rho^{15}\log(nL^{2}/\varepsilon)}{\eta\delta m})$ iterations we have

Therefore, we have $f(W^{(T)})\leq\varepsilon$ .

On the other hand, for every $t=1,2,\dots,T$ , we have

where in ① we have used $2a\sqrt{t}-b^{2}t=-(b\sqrt{t}-a/b)^{2}+a^{2}/b^{2}$ , and in ② we have used $\eta\leq O\big{(}\frac{\delta}{\rho^{42}m}\big{)}$ . This implies $f(W^{(t)})\leq O(n\rho^{2}L^{3})$ . We can now verify for each $t$ , $\|W^{(t)}-W^{(0)}\|_{F}$ is small so that (I.1) holds. By Theorem 3,

where the last step follows by our choice of $T$ . ∎

2 Other Related Works

Notations and Preliminaries

2 Objective and Gradient

Our Results

Basic Properties at Random Initialization

Stability After Adversarial Perturbation

Proof Sketch of Theorem 3: Polyak-Łojasiewicz Condition

2 Thought Experiment: Adding Small Rank-One Perturbation

3 Real Proof: Randomness Decomposition and McDiarmid’s Inequality

Proof Sketch of Theorem 4: Objective Semi-Smoothness

Appendix Roadmap

Appendix

A.2 Concentration of Chi-Square Distribution

A.3 Concentration of Sum of Squares of ReLU of Gaussians

A.4 Gaussian Vector Percentile: Center

A.5 Gaussian Vector Percentile: Tail

A.6 McDiarmid’s Inequality and An Extension

Appendix B Basic Properties at Random Initialization

B.2 Forward Correlation

B.3 Forward δ𝛿\delta-Separateness

B.4 Intermediate Layers: Spectral Norm

B.5 Intermediate Layers: Sparse Spectral Norm

B.6 Backward Propagation

Appendix C Stability After Adversarial Perturbation

C.2 Intermediate Layers

C.3 Backward

C.4 Special Rank-One Perturbation

Appendix D Indicator and Backward Coordinate Bounds

D.2 Backward Coordinate Bound

Appendix E Gradient Bound at Random Initialization (Theorem 5)

E.2 Gradient Lower Bound in Expectation

E.3 Gradient Stability

E.4 Main Theorem: Proof of Theorem 5

Appendix F Gradient Bound After Perturbation (Theorem 3)

Appendix G Objective Semi-Smoothness (Theorem 4)

Appendix H Convergence Rate of Gradient Descent (Theorem 1)

Appendix I Convergence Rate of Stochastic Gradient Descent (Theorem 2)

References