On the Convergence Rate of Training Recurrent Neural Networks
Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
Neural networks have been one of the most powerful tools in machine learning over the past a few decades . The multi-layer structure of neural network gives it supreme power in expressibility and learning performance. However, it raises complexity concerns: the training objective is generally non-convex and non-smooth. In practice, local-search algorithms such as stochastic gradient descent (SGD) are capable of finding global optima, at least on the training data . How SGD avoids local minima for such objectives remains an open theoretical question since Goodfellow et al. .
In recent years, there have been a number of theoretical results aiming at a better understanding of this phenomenon. Many of them focus on two-layer (thus one-hidden-layer) neural networks and assume that the inputs are random Gaussian or sufficiently close to Gaussian . Some study deep neural networks but assuming the activation function is linear . Some study the convex task of training essentially only the last layer of the network . On the technique side, some of these results try to understand the gradient dynamics , while others focus on the geometry properties of the training objective .
In this paper, we show GD and SGD are capable of training multi-layer neural networks (with ReLU activation) to global minima on any non-degenerate training data set. Furthermore, the running time is polynomial in the number of layers and the number of data points. Since there are many different types of multi-layer networks (convolutional, feedforward, recurrent, etc.), in this present paper, we focus on recurrent neural networks (RNN) as our choice of multi-layer networks, and feedforward networks are only its “special case” (see for instance a follow-up work ).
Recurrent Neural Networks. Among different architectures of neural networks, one of the least theoretically-understood structure is the recurrent one . A recurrent neural network recurrently applies the same network unit to a sequence of input tokens, such as a sequence of words in a language sentence. RNN is particularly useful when there are long-term, non-linear interactions between input tokens in the same sequence. These networks are widely used in practice for natural language processing, language generation, machine translation, speech recognition, video and music processing, and many other tasks . On the theory side, while there are some attempts to show that an RNN is more expressive than a feedforward neural network , when and how an RNN can be efficiently learned has little theoretical explanation.
In practice, RNN is usually trained by simple local-search algorithms such as SGD. However, unlike shallow networks, the training process of RNN often runs into the trouble of vanishing or exploding gradient . That is, the value of the gradient becomes exponentially small or large in the time horizon, even when the training objective is still constant. Intuitively, an RNN recurrently applies the same network unit for times if the input sequence is of length . When this unit has “operator norm” larger than one or smaller than one, the final output can possibly exponentially explode or vanish in . More importantly, when one back propagates through time—which intuitively corresponds to applying the reverse unit multiple times— the gradient can also vanish or explode. Controlling the operator norm of a non-linear operator can be quite challenging. In practice, one of the popular ways to resolve this is by the long short term memory (LSTM) structure . However, one can also use rectified linear units (ReLUs) as activation functions to avoid vanishing or exploding gradient . In fact, one of the earliest adoptions of ReLUs was on applications of RNNs for this purpose twenty years ago . For a detailed survey on RNN, we refer the readers to Salehinejad et al. .
In this paper, we study the following general question
Can ReLU provably stabilize the training process and avoid vanishing/exploding gradient?
Can RNN be trained close to zero training error efficiently under mild assumptions?
When there is no activation function, RNN is known as linear dynamical system. Hardt, Ma, and Recht first proved the convergence of finding global minima for such linear dynamical systems. Followups in this line of research include .
Motivations. One may also want to study whether RNN can be trained close to zero test error. However, unlike feedforward networks, the training error, or the ability to memorize examples, may actually be desirable for RNN. After all, many tasks involving RNN are related to memories, and certain RNN units are even referred to memory cells. Since RNN applies the same network unit to all input tokens in a sequence, the following question can possibly of its own interest:
How does RNN learn mappings (say from token 3 to token 7) without destroying others?
Another motivation is the following. An RNN can be viewed as a space constraint, differentiable Turing machine, except that the input is only allowed to be read in a fixed order. It was shown in Siegelmann and Sontag that all Turing machines can be simulated by fully-connected recurrent networks built of neurons with non-linear activations. In practice, RNN is also used as a tool to build neural Turing machines , equipped with a grand goal of automatically learning an algorithm based on the observation of the inputs and outputs. To this extent, we believe the task of understanding the trainability as a first step towards understanding RNN can be meaningful on its own.
Our Result. To present the simplest result, we focus on the classical Elman network with ReLU activation:
If the number of neurons is polynomially large, we can find weight matrices where the RNN gives training error
if gradient descent (GD) is applied for T=\Omega\big{(}\frac{\poly(n,d,L)}{\delta^{2}}\log\frac{1}{\varepsilon}\big{)} iterations, starting from random Gaussian initializations; or
if (mini-batch or regular) stochastic gradient descent (SGD) is applied for T=\Omega\big{(}\frac{\poly(n,d,L)}{\delta^{2}}\log\frac{1}{\varepsilon}\big{)} iterations, starting from random Gaussian initializations.At a first glance, one may question how it is possible for SGD to enjoy a logarithmic time dependency in ; after all, even when minimizing strongly-convex and Lipschitz-smooth functions, the typical convergence rate of SGD is as opposed to . We quickly point out there is no contradiction here if the stochastic pieces of the objective enjoy a common global minimizer. In math terms, suppose we want to minimize some function , and suppose is the global minimizer of convex functions . Then, if is -strongly convex, and each each is -Lipschitz smooth, then SGD—moving in negative direction of for a random per step— can find -minimizer of this function in O\big{(}\frac{L^{2}}{\sigma^{2}}\log\frac{1}{\varepsilon}\big{)} iterations.
(To present the simplest possible result, we have not tried to tighten the polynomial dependency with respect to and . We only tightened the dependency with respect to and .)
Our Contribution. We summarize our contributions as follows.
We believe this is the first proof of convergence of GD/SGD for training the hidden layers of recurrent neural networks (or even for any multi-layer networks of more than two layers) when activation functions are present.Our theorem holds even when are at random initialization and only the hidden weight matrix is trained. This is much more difficult to analyze than the convex task of training only the last layer . Training only the last layer can significantly reduce the learning power of (recurrent or not) neural networks in practice.
Our results provide arguably the first theoretical evidence towards the empirical finding of Goodfellow et al. on multi-layer networks, regarding the ability of SGD to avoid (spurious) local minima. Our theorem does not exclude the existence of bad local minima
We build new technical toolkits to analyze multi-layer networks with ReLU activation, which have now found many applications . For instance, combining this paper with new techniques, one can derive guarantees on testing error for RNN in the PAC-learning language .
Extension: Deep RNN. Elman RNN is also referred to as three-layer RNN, and one may also study the convergence of RNNs with more hidden layers. This is referred to as deep RNN . Our theorem also applies to deep RNNs (by combining this paper together with ).
2 Other Related Works
Another relevant work is Brutzkus et al. where the authors studied over-paramterization in the case of two-layer neural network under a linear-separable assumption.
Instead of using randomly initialized weights like this paper, there is a line of work proposing algorithms using weights generated from some “tensor initialization” process .
There is huge literature on using the mean-field theory to study neural networks . At a high level, they study the network dynamics at random initialization when the number of hidden neurons grow to infinity, and use such initialization theory to predict performance after training. However, they do not provide theoretical convergence rate for the training process (at least when the number of neurons is finite).
Notations and Preliminaries
Note that in the occasion that is the zero vector, we let be an arbitrary unit vector that is orthogonal to .
We make the following assumption on the input data (see Footnote 10 for how to relax it):
for some parameter and every pair of .
A very important notion that this entire paper relies on is the following:
We consider the following random initialization distributions for , and .
We say that are at random initialization, if the entries of and are i.i.d. generated from , and the entries of are i.i.d. generated from .
We assume for some sufficiently large polynomial.
Without loss of generality, we assume for some sufficiently large constant (if this is not satisfied one can decrease ). Throughout the paper except the detailed appendix, we use , and notions to hide polylogarithmic dependency in . To simplify notations, we denote by
2 Objective and Gradient
Using chain rule, one can write down a closed form of the (sub-)gradient:
For , the gradient with respect to (denoted by ) and the full gradient are
Our Results
Our main results can be formally stated as follows.
Suppose \eta=\widetilde{\Theta}\big{(}\frac{\delta}{m}\poly(n,d,L)\big{)} and . Let be at random initialization. With high probability over the randomness of , if we apply gradient descent for steps , then it satisfies
Suppose \eta=\widetilde{\Theta}\big{(}\frac{\delta}{m}\poly(n,d,L)\big{)} and . Let be at random initialization. If we apply stochastic gradient descent for steps for a random index per step, then with high probability (over and the randomness of SGD), it satisfies
In both cases, we essentially have linear convergence rates. We remark here that the notation may hide additional polynomial dependency in . This is not necessary, at the expense of slightly complicating the proofs, as shown by follow up . Notably, our results show that the dependency of the number of layers , is polynomial. Thus, even when RNN is applied to sequences of long input data, it does not suffer from exponential gradient explosion or vanishing (e.g., or ) through the entire training process.
Main Technical Theorems. Our main Theorem 1 and Theorem 2 are in fact natural consequences of the following two technical theorems. They both talk about the first-order behavior of RNNs when the weight matrix is sufficiently close to some random initialization.
The first theorem is similar to the classical Polyak-Łojasiewicz condition , and says that is at least as large as the objective value.
With high probability over random initialization , it satisfies
The second theorem shows a special “semi-smoothness” property of the objective.
At a high level, the convergence of GD and SGD are careful applications of the two technical theorems above: indeed, Theorem 3 shows that as long as the objective value is high, the gradient is large; and Theorem 4 shows that if one moves in the (negative) gradient direction, then the objective value can be sufficiently decreased. These two technical theorems together ensure that GD/SGD does not hit any saddle point or (bad) local minima along its training trajectory. This was practically observed by Goodfellow et al. and a theoretical justification was open since then.
An Open Question. We did not try to tighten the polynomial dependencies of in the proofs. When is sufficiently large, we make use of the randomness at initialization to argue that, for all the points within a certain radius from initialization, for instance Theorem 3 holds. In practice, however, the SGD can create additional randomness as time goes; also, in practice, it suffices for those points on the SGD trajectory to satisfy Theorem 3. Unfortunately, such randomness can— in principle— be correlated with the SGD trajectory, so we do not know how to use that in the proofs. Analyzing such correlated randomness is certainly beyond the scope of this paper, but can possibly explain why in practice, the size of needed is not that large.
Overall, we provide the first proof of convergence of GD/SGD for non-linear neural networks that have more two layers. We show with overparameterization GD/SGD can avoid hitting any (bad) local minima along its training trajectory. This was practically observed by Goodfellow et al. and a theoretical justification was open since then. We present our result using recurrent neural networks (as opposed to the simpler feedforward networks ) in this very first paper, because memorization in RNN could be of independent interest. Also, our result proves that RNN can learn mappings from different input tokens to different output tokens simultaneously using the same recurrent unit.
Last but not least, we build new tools to analyze multi-layer networks with ReLU activations that could facilitate many new research on deep learning. For instance, our techniques in Section 4 provide a general theory for why ReLU activations avoid exponential exploding (see e.g. (4.1), (4.4)) or exponential vanishing (see e.g. (4.1), (4.3)); and our techniques in Section 5 give a general theory for the stability of multi-layer networks against adversarial weight perturbations, which is at the heart of showing the semi-smoothness Theorem 4, and used by all the follow-up works .
The main difficulty of this paper is to prove Theorem 3 and 4, and we shall sketch the proof ideas in Section 4 through 7. In such high-level discussions, we shall put our emphasize on
how to avoid exponential blow up in , and
how to deal with the issue of randomness dependence across layers.
We genuinely hope that this high-level sketch can (1) give readers a clear overview of the proof without the necessity of going to the appendix, and (2) appreciate our proof and understand why it is necessarily long.For instance, proving gradient norm lower bound in Theorem 3 for a single neuron is easy, but how to apply concentration across neurons? Crucially, due to the recurrent structure these quantities are never independent, so we have to build necessary probabilistic tools to tackle this. If one is willing to ignore such subtleties, then our sketched proof is sufficiently short and gives a good overview.
Basic Properties at Random Initialization
In this section we derive basic properties of the RNN when the weight matrices are all at random initialization. The corresponding precise statements and proofs are in Appendix B.
The first one says that the forward propagation neither explodes or vanishes, that is,
This relies on a more involved inductive argument than (4.1). At high level, one needs to show that in each layer, the amount of “fresh new randomness” reduces only by a factor at most .
Using (4.1) and (4.2), we obtain the following property about the data separability:
Here, we say two vectors and are -separable if \big{\|}(I-yy^{\top}/\|y\|_{2}^{2})x\big{\|}\geq\delta and vice versa. Property (4.3) shows that the separability information (say on input token ) does not diminish by more than a polynomial factor even if the information is propagated for layers.
Intermediate Layers and Backward Propagation. Training neural network is not only about forward propagation. We also have to bound intermediate layers and backward propagation.
Intuitively, one cannot use spectral bound argument to derive (4.4) or (4.5): the spectral norm of is , and even if ReLU activations cancel half of its mass, the spectral norm remains to be . When stacked together, this grows exponential in .
We did not try to tighten the polynomial factor here in . We conjecture that proving an bound may be possible, but that question itself may be a sufficiently interesting random matrix theory problem on its own.
Its proof is in the same spirit as (4.5), with the only difference being the spectral norm of is around as opposed to .
Stability After Adversarial Perturbation
In this section we study the behavior of RNN after adversarial perturbation. The corresponding precise statements and proofs are in Appendix C.
Letting be at random initialization, we consider some matrix for . Here, may depend on the randomness of and , so we say it can be adversarially chosen. The results of this section will later be applied essentially twice:
Once for those updates generated by GD or SGD, where is how much the algorithm has moved away from the random initialization.
The other time (see Section 6.3) for a technique that we call “randomness decomposition” where we decompose the true random initialization into , where is a “fake” random initialization but identically distributed as . Such technique at least traces back to smooth analysis .
To illustrate our high-level idea, from this section on (so in Section 5, 6 and 7)
Forward Stability. Our first, and most technical result is the following:
In our actual proof of (5.1), instead of applying induction on ③, we recursively expand ③ by the above formula. This results in a total of terms of ① type and terms of ② type. The main difficulty is to bound a term of ② type, that is:
Our argument consists of two conceptual steps.
The two steps above enable us to perform induction without exponential blow up. Indeed, they together enable us to go through the following logic chain:
Since there is a gap between and , we can make sure that all blow-up factors are absorbed into this gap, using the property that is polynomially large. This enables us to perform induction to prove (5.1) without exponential blow-up.
Intermediate Layers and Backward Stability. Using (5.1), and especially using the sparsity from (5.1), one can apply the results in Section 4 to derive the following stability bounds for intermediate layers and backward propagation:
Special Rank-1 Perturbation. For technical reasons, we also need two bounds in the special case of for some unit vector and sparse with . We prove that, for this type of rank-one adversarial perturbation, it satisfies for every :
Proof Sketch of Theorem 3: Polyak-Łojasiewicz Condition
The upper bound in Theorem 3 is easy to prove (based on Section 4 and 5), but the lower bound (a.k.a. the Polyak-Łojasiewicz condition) is the most technically involved result to prove in this paper. We introduce the notion of “fake gradient”. Given fixed vectors , we define
For every fixed vectors , if are at random initialization, then with high probability
There are only two conceptually simple steps from Theorem 5 to Theorem 3 (see Appendix F).
First, one can use the stability lemmas in Section 5 to show that, the fake gradient after adversarial perturbation (with ) is also large.
Second, one can apply -net and union bound to turn “fixed ” into “for all ”. This allows us to turn the lower bound on the fake gradient into a lower bound on the true gradient .
Therefore, in the rest of this section, we only sketch the ideas behind proving Theorem 5.
This should be quite intuitive to prove, in the following two steps.
2 Thought Experiment: Adding Small Rank-One Perturbation
At the same time, using the fact that satisfies (6.2), one can show that
Putting (a), (b), (c), and (d) together, we know for such specially chosen , at least with constant probability over the random perturbation of ,
3 Real Proof: Randomness Decomposition and McDiarmid’s Inequality
There are only two main differences between (6.6) and our desired Theorem 5. First, (6.6) gives a gradient lower bound at , while in Theorem 5 we need a gradient lower bound at random initialization . Second, (6.6) gives a lower bound on with constant probability for a small fraction of good coordinates , but in Theorem 5 we need a lower bound for the entire .
Randomness Decomposition. To fix the first issue, we resort to a randomness decomposition technique at least tracing back to the smooth analysis of Spielman and Teng :
Given small constant and -dimensional random , we can rewrite where follows from and is very close to .
(Note that there is no contradiction here because and shall be correlated.)
Using Proposition 6.1, for each good coordinate , instead of “adding” perturbation to , we can instead decompose into , where is distributed in the same way as . In other words, is also at random initialization. If this idea is carefully implemented, one can immediately turn (6.6) into
Extended McDiarmid’s Inequality. To fix the second issue, one may wish to consider all the indices satisfying (6.5) and (6.2). Since there are at least fraction of such coordinates, if all of them satisfied (6.7), then we would have already proved Theorem 5. Unfortunately, neither can we apply Chernoff bound (because the events with different are correlated), nor can we apply union bound (because the event occurs only with constant probability).
Our technique is to resort to (an extended probabilistic variant of) McDiarmid’s inequality (see Appendix A.6) in a very non-trivial way to boost the confidence.
In other words, although there are difference terms, their total summation only grows in rate according to (6.9). After applying a variant of McDiarmid’s inequality, we derive that with high probability over , it satisfies
Finally, by sampling sufficiently many random sets to cover the entire space , we can show
Proof Sketch of Theorem 4: Objective Semi-Smoothness
The objective semi-smoothness Theorem 4 turns out to be much simpler to prove than Theorem 3. It only relies on Section 4 and 5, and does not need randomness decomposition or McDiarmid’s inequality. (Details in Appendix G.)
Recall that in Theorem 4, are at random initialization. is an adversarially chosen matrix with , and is some other adversarial perturbation on top of , satisfying . We denote by
and plug (7.1) into (7.2) to derive our final Theorem 4.
Appendix Roadmap
Appendix A recalls some old lemmas and derives some new lemmas in probability theory.
Appendix B serves for Section 4, the basic properties at random initialization.
Appendix C serves for Section 5, the stability after adversarial perturbation.
Appendix D, E and F together serve for Section 6 and prove Theorem 3, the Polyak-Łojasiewicz condition and gradient upper bound. In particular:
Appendix D serves for Section 6.1 (the indicator and backward coordinate bounds).
Appendix E serves for Section 6.3 (the randomness decomposition and McDiarmid’s inequality) and proves Theorem 5.
Appendix F shows how to go from Theorem 5 to Theorem 3.
Appendix G serves for Section 7, the proof of Theorem 4, the objective semi-smoothness.
Appendix H gives the final proof for Theorem 1, the GD convergence theorem.
Appendix I gives the final proof for Theorem 2, the SGD convergence theorem.
Parameters. We also summarize a few parameters we shall use in the proofs.
In Definition D.1, we shall introduce two parameters
to control the thresholds of indicator functions (recall Section 6.1).
In Definition E.3, we shall introduce parameter
to describe how much randomness we want to decompose out of (recall Section 6.2).
In (E.13) of Appendix E.4, we shall choose
which controls the size of the set where we apply McDiarmid’s inequality (recall Section 6.3).
Appendix
The goal of this section is to present a list of probability tools.
In Section A.1, we recall how to swap randomness.
In Section A.2, we recall concentration bounds for the chi-square distribution.
In Section A.3, we proved a concentration bound of sum of squares of ReLU of Gaussians.
In Section A.4 and Section A.5, we show some properties for random Gaussian vectors.
In Section A.6, we recall the classical McDiarmid’s inequality and then prove a general version of it.
If holds with probability at least , then
with probability at least (over randomness of ), the following event holds,
holds with probability at least (over randomness of ).
If , then
A.2 Concentration of Chi-Square Distribution
Let be a chi-squared distributed random variable with degrees of freedom. Each one has zero mean and variance. Then
Let denote i.i.d. samples from . For any , we have
Since . Thus,
Let denote i.i.d. samples from , and . We have
On the other hand, each random variable is -subgaussian. By subgaussian concentration,
A.3 Concentration of Sum of Squares of ReLU of Gaussians
Given i.i.d. Gaussian random variables , we have
Using Chernoff bound, we know that with probability , is a at most degree- Chi-square random variable. Let us say this is the first event.
Thus, we have with probability at least ,
Let the above event denote the second event.
By taking the union bound of two events, we have with probability
Then rescaling the , we get the desired result. ∎
Given i.i.d. Gaussian random variables , we have
Using Chernoff bound, we know that with probability , is a at most degree- Chi-square random variable. Let us say this is the first event.
Thus, we have with probability at least ,
Let the above event denote the second event.
By taking the union bound of two events, we have with probability at least ,
Then rescaling the , we get the desired result. ∎
Combining Lemma A.7 and Lemma A.6, we have
Given i.i.d. Gaussian random variables , let . We have
For each , let . Then , where . Using Corollary A.8, we have
Since , thus we complete the proof. ∎
follows i.i.d. from the following distribution: with half probability , and with the other half probability follows from folded Gaussian distributions .
is in distribution identical to (chi-square distribution of order ) where follows from binomial distribution ( trials each with success rate ).
A.4 Gaussian Vector Percentile: Center
Suppose is a Gaussian random variable. For any we have
Similarly, if , for any , we have
Let . For any , we have with probability at least ,
there exists at least fraction of such that , and
there exists at least fraction of such that
Let . For each , we define random variable as
Choosing , we have
We provide the definition of -good. Note that this definition will be used often in the later proof.
there are at least fraction coordinates satisfy that ; and
there are at least fraction coordinates satisfy that .
Lemma A.13 gives the following immediate corollary:
Let . For any , we have with probability at least that is -good.
It is clear that follows from a Gaussian distribution \mathcal{N}\big{(}0,\big{(}\sum_{i=1}^{k}\sigma_{i}^{2}\|x_{i}\|_{2}^{2}\big{)}{I}\big{)}, so we can directly apply Corollary A.15. ∎
A.5 Gaussian Vector Percentile: Tail
Without loss of generality we only prove the result for .
Fixing any such and letting , we have y_{i}\sim\mathcal{N}\big{(}0,\frac{2}{m}\big{)} so for every , by Gaussian tail bound
Since , we know that if occurs for indices out of , this cannot happen with probability more than
Finally, by applying union bound over we have with probability ,
In other words, vector can be written as where and .
A.6 McDiarmid’s Inequality and An Extension
We state the standard McDiarmid’s inequality,
We prove a more general version of McDiarmid’s inequality,
Let be independent random variables and . Suppose it satisfies:
With probability at least over , it satisfies
Then,
For each , we have with probability at least over , it satisfies
Define those satisfying the above event to be .
Define random variable (which depends only on ) as
For every and fixed .
If , then .
If ,
If , then .
Recall from our assumption that, with probability at least over and , it satisfies
Taking expectation over and , we have
This precisely means .
In sum, we have just shown that always holds. By applying martingale concentration (with one-sided bound),
Notice that so if we choose , we have
and we have with probability at least (and with the remaining probabilities). Together, we have the desired theorem.
Appendix B Basic Properties at Random Initialization
Recall that the recursive update equation of RNN can be described as follows
We introduce two notations that shall repeatedly appear in our proofs.
Section B.4 and Section B.5 prove that the consecutive intermediate layers, in terms of spectral norm, do not explode (for full and sparse vectors respectively).
Section B.6 proves that the backward propagation does not explode.
With probability at least over and ,
Next, applying Corollary A.10, we know that if are fixed (instead of defined as in (B.1)), then, letting Choosing , we have
In particular, since we have “for all” quantifies on and above, we can substitute the choice of and in (B.1) (which may depend on the randomness of and ). This, together with (B.1), gives
With probability at least over and ,
We can define in the same way as (B.1). This time, we show a lower bound
Applying Corollary A.10, we know if are fixed (instead of defined as in (B.1)), then choosing ,
Substituting the choice of in (B.1), and the lower bound (B.4), we have
Choosing gives the desired statement. ∎
This finishes the proof of Lemma B.3.
B.2 Forward Correlation
in the same way as (B.1) and (B.3) as in the proof of Lemma B.3. We again have the entries of are i.i.d. from (recall Footnote 13). For the quantity , we can further decompose it as follows
where is some fixed parameter, and denote two vectors that are independently generated from , and
It is clear that the two sides of (B.7) are identical in distribution (because and ).
Next, suppose and are fixed (instead of depending on the randomness of and ) and satisfies Note that if and are random, then they satisfy such constraints by Lemma B.3.
We can apply Corollary A.16 to obtain the following statement: with probability at least ,
is -good where . Using , we can lower bound as
corresponds to the coordinates that are ,
is the remaining, which corresponds to the coordinates that are within .
corresponds to the coordinates that are , and
We write according to the same partition. We consider the following three cases.
For each index in the first block, we have \big{(}\phi(w_{1}+rv_{1})\big{)}_{k}\neq\big{(}w_{1}+rv_{1}\big{)}_{k} only if . However, if this happens, we have
Applying Lemma A.5, we know with probability at least ,
Similarly, for each index in the third block, we have \big{(}\phi(w_{1}+rv_{1})\big{)}_{k}\neq 0 only if . Therefore, we can similarly derive that
To prove (B.9), we use triangle inequality,
Since and the size of support of is at most , we have
Since , and since the size of support of is at most , we have with probability at least (due to chi-square distribution concentration). Thus
Together, by triangle inequality we have This finishes the proof of (B.9).
Denoting by , we have
On the other hand, the random vector follows from distribution for some fixed vector and has at least dimensions. By chi-square concentration, we have with probability at least . Putting these together, we have
B.3 Forward δ𝛿\delta-Separateness
We first give the definition of -Separable,
For any two vectors , we say and are -separable if
We say a finite set is -separable if for any two vectors , and are -separable.
holds with probability at least .
This also implies that the randomness in is independent of the randomness in .
It is easy to see that , so we can rewrite as follows
and we know the entries of are i.i.d. from , owing to a similar treatment as Footnote 13. For the vector , we further rewrite it as
Our plan is to first use the randomness in to argue that is -good. Then we conditioned is good, and prove that the norm of is lower bounded (using (B.7)).
we can use Corollary A.16 to obtain the following statement: is -good with probability at least where . Using , we can lower bound as
In other words, is -good. Next, applying standard -net argument, we have with probability at least , for all satisfying (B.12), it satisfies is -good. This allows us to plug in the random choice of in (B.11).
We apply Lemma B.7 with the following setting
where is -good. (We can do so because the randomness of is independent of the randomness of and .) Lemma B.7 tells us that, with probability at least over the randomness of ,
B.4 Intermediate Layers: Spectral Norm
The following lemma bounds the spectral norm of (consecutive) intermediate layers.
We start with an important claim whose proof is almost identical to Lemma B.3.
Now, to prove the spectral norm bound in Lemma B.11, we need to go from “for each (see Claim B.12)” to “for all ”. Since has dimensions, we cannot afford taking union bound over all possible (or its -net).
By applying an -net argument over all such possible (but sparse) , we have with probability at least , the above equation holds for all possible .
Next, taking a union bound over all , we have with probability at least :
Without loss of generality we assume . We can rewrite as follows
It is easy to see that is independent of . We can rewrite
Using Fact A.11 together with concentration bounds (for binomial distribution and for chi-square distribution), we have with probability at least ,
B.5 Intermediate Layers: Sparse Spectral Norm
This section proves two results corresponding to the spectral norm of intermediate layers with respect to sparse vectors. We first show Lemma B.14 and our Corollary B.15 and B.16 shall be direct applications of Lemma B.14.
For every and , with probability at least
Similar to the proof of Lemma B.11, we let denote the following column orthonormal matrix using Gram-Schmidt:
Applying -net over all -sparse vectors , we have with probability at least , it satisfies for all -sparse vectors .
Conditioning on both events happen, we have
B.6 Backward Propagation
This section proves upper bound on the backward propagation against sparse vectors. We first show Lemma B.17 and our Corollary B.18 and B.19 shall be immediate corollaries.
with probability at least we have
Taking a union of the above two events, we get with probability ,
Appendix C Stability After Adversarial Perturbation
Section C.4 considers a special type of rank-one perturbation matrix , and provides stability bounds on the forward and backward propagation.
As discussed in Section 5, the results of Section C.1, C.2 and C.3 shall be used twice, once for the final training updates (see Appendix G), and once for the randomness decomposition (see Appendix E). In contrast, the results of Section C.4 shall only be used once in Appendix E.
The goal of this section is to prove Lemma C.2,
We emphasize that all these parameters are polynomial in so negligible when comparing to .
After recursively applying (C.2), we can write
Applying Claim C.6, we have with probability at least , one can write
where inequality ① is due to (C.5), and ② follows from our choice of . Thus, we have showed I and II of (C.1):
so III of (C.1) holds. IV and V of (C.1) are implied by Claim C.4, and VI is implied because
Then, we have with probability at least
Using triangle inequality, we can calculate
With probability at least the following holds. Whenever
then letting and , we have
Next, for each , we must have
Similar to the proof of Lemma B.11, we let denote the following column orthonormal matrix using Gram-Schmidt:
Finally, taking -net over all -sparse vectors , we have the desired result. ∎
Combining Claim C.4 and Claim C.5, we have
With probability at least , whenever
C.2 Intermediate Layers
The goal of this subsection is to prove Lemma C.7.
On the other hand, since , we also have
C.3 Backward
Without loss of generality we assume . We define set to be
Again we assume without loss of generality. We define set to be
It is similar to the proof of Lemma lem:backwardb. ∎
It is similar to the proof of Lemma lem:backwarda. ∎
C.4 Special Rank-One Perturbation
We prove two lemmas regarding the special rank-one perturbation with respect to a coordinate . The first one talks about forward propagation.
After recursively calculating as above, using , and applying triangle inequality, we have
where the last step follows by , and Lemma lem:forwarda.
where ① follows by , ② follows by Lemma B.3, ③ follows by Corollary B.16, and ④ follows by .
where the last step follows by Corollary B.15 (which relies on Lemma lem:forwardb to give ) together with Lemma lem:forwardc.
Putting the three bounds into (C.9) (where the term one is the dominating term), we have
The next one talks about backward propagation.
By Corollary B.15 (which relies on Lemma lem:forwardb to give ), we know
Finally, we use the randomness of (recall that does not depend on ), we conclude that with probability at least , it satisfies
This time, we define set to be
Appendix D Indicator and Backward Coordinate Bounds
Suppose and follow from random initialization. Given fixed set and define
Now, suppose is fixed (instead of depending on the randomness of and ), and it satisfies . We have each entry of is is distributed as {\cal N}\big{(}0,\frac{2\|z_{1}\|^{2}+2\|z_{2}\|^{2}}{m}\big{)}. By property of Gaussian (see Fact A.12), we have for each ,
Since we have independent random variables, applying a Chernoff bound, we have
and therefore for a fixed vector it satisfies
Applying Chernoff bound, with probability at least , we have . Applying union bound over all possible , we have has cardinality at least . ∎
and therefore for a fixed vector it satisfies
D.2 Backward Coordinate Bound
The goal of this section is to prove Lemma D.7.
We have is -Lipschitz continuous.
We define two probabilistic events .
Event depends on the randomness of and :
Corollary B.16 says .
Event depends on the randomness of :
By Gaussian tail bound, we have .
In the rest of the proof, we assume and are fixed and satisfy . We let be the only source of randomness but we condition on satisfies .
Step 1. Fixing and letting be the only randomness, we claim for each :
We prove this inequality as follows. We can rewrite
and is distributed as Gaussian random variable , and and are defined as follows
where the second step follows by , the third step follows by (D.2).
Above, inequality ① follows by , inequality ② follows from (D.2). Using McDiarmid inequality (see Lemma A.18), we have
where we choose .
Appendix E Gradient Bound at Random Initialization (Theorem 5)
The goal of this section is to understand the lower and upper bounds of gradients. Instead of analyzing the true gradient directly—where the forward and backward propagation have correlated randomness— we assume that the loss vectors are fixed (no randomness) in this section, as opposed to being defined as which is random. We call this the “fake loss.” We define a corresponding “fake gradient” with respect to this fixed loss.
Given fixed vectors , we define
where inequality ① uses Lemma B.11, Lemma B.3, and with high probability. Applying triangle inequality and using the definition of fake gradient (see Definition E.1), we finish the proof of the first statement.
As for the second statement, we replace the use of Lemma B.11 with Corollary B.19. ∎
The rest of this section is devoted to proving a (much more involved) lower bound on this fake gradient.
In the rest of this section, we first present a elegant way to decompose the randomness in Section E.1 (motivated by smooth analysis ). We give a lower bound on the expected fake gradient in Section E.2. We then calculate the stability of fake gradient against rank-one perturbations in Section E.3. Finally, in Section E.4, we apply our extended McDiarmid’s inequality to prove Theorem 5.
We introduce a parameter as follows:
where the notions and come from Definition D.1. We have .
entries of are i.i.d. drawn from (so is in the same distribution as );
With probability at least over the randomness of :
With probability at least over the randomness of :
We verify that the entries of are i.i.d. Gaussian in two steps.
By standard Gaussian tail bound, we have with probability at least , it satisfies and therefore . If this happens, for every , we have (see for instance Fact A.12) that and . Recalling
and using the fact that , we conclude that
Again by Gaussian tail bound, we have with probability at least , it satisfies . Therefore, and using our assumption on , we have
Now, we introduce another two ways of decomposing randomness based on this definition.
E.2 Gradient Lower Bound in Expectation
The goal of this subsection is to prove Lemma E.6 and then translate it into our Core Lemma A (see Lemma E.7).
and recall from Lemma lem:random-decomposed, we have
We let , apply Lemma D.2 to obtain with , and apply Lemma D.7 to obtain . According to the statements of these lemmas, we know that with probability at least , the random choice of will satisfy .
This implies, by triangle inequality and the fact that ,
This implies, by triangle inequality and the fact that ,
Using Lemma B.3 and Lemma lem:forwarda we have
This concludes that, with probability at least over and , it satisfies
This concludes that, with probability at least over and , it satisfies
This is a fixed value, independent of .
Finally, for each , owing to Lemma D.7 and Lemma B.3, we have
Putting (E.2) and (E.7) back to (E.1), we conclude that
In Lemma E.6, we split into for a fixed . In this subsection, we split into three parts following Lemma def:random-decompc. Our purpose is to rewrite Lemma E.6 into the following variant:
with probability over the randomness of , the following holds:
or putting it in another way, using our choice of ,
with probability over the randomness of , the following holds ( for some sufficiently large constant):
Applying simple tricks to switch the ordering of randomness (see Fact A.2), we have
with probability over the randomness of , the following holds:
with probability over the randomness of , the following holds:
with probability over randomness of , the following holds:
Repeating the choice of for times, and using union bound, we have
with probability over the randomness of , the following holds:
with probability over the randomness of , the following holds:
for all , the following holds:
with probability over randomness of , the following holds:
Combining the above statement with Fact A.1 (to merge randomness), we conclude the proof. ∎
E.3 Gradient Stability
The goal of this subsection is to prove Lemma E.8 and then translate it into our Core Lemma B (see Lemma E.10).
\left|\left\{k\in[m]~{}\bigg{|}~{}\Big{|}\|\widehat{\nabla}_{k}f(\widetilde{W}+W^{\prime\prime}_{j})\|_{2}^{2}-\|\widehat{\nabla}_{k}f(\widetilde{W})\|_{2}^{2}\Big{|}\leq O\left(\frac{\rho^{11}\theta^{1/3}}{m^{1/6}}\right)\right\}\right|\geq\left(1-\frac{\rho^{5}\theta^{2/3}}{m^{1/3}}\right)m\enspace.
for every .
Combining this with from Lemma E.2, we finish the proof of the first item.
Now, for the indices in that are randomly sampled from , if for a parameter , then we have
Otherwise, if , this means at least indices that are chosen from are outside . This happens with probability at most . ∎
In this subsection, we split into three parts followingdef:random-decomp:N-N. Our purpose is to rewrite Lemma E.8 into the following variant:
with probability over random , random , random , and random , the following holds:
We can split the randomness (see Fact A.1) and derive that
With probability over , , the following holds:
With probability over and , Eq. (E.11) holds.
Applying standard -net argument, we derive that
With probability over , , the following holds:
Finally, letting , we have
With probability over the randomness of , the following holds:
Finally, splitting the randomness (using Fact A.1), and applying a union bound over multiple samples , we finish the proof. ∎
E.4 Main Theorem: Proof of Theorem 5
Combining Core Lemma A and B (i.e., Lemma E.7 and Lemma E.10), we know that if is appropriately chosen, with probability over the randomness of and , the following holds:
for every , the following holds:
with probability at over the randomness of , the following boxed statement holds:
Below, we condition on the high probability event (see Lemma lem:random-decomposed) that . The boxed statement tells us:
With probability over , it satisfies
At the same time, we have owing to Lemma E.2. Therefore, scaling down by (to make sure the function value stays in $$), we can applying extended McDiarmid’s inequality (see Lemma A.19), we have
As long as , we have that the above probability is at least .
in order to satisfy Lemma E.7. We replace the boxed statement with (E.12). This tells us
with probability over the randomness of and , the following holds:
for every , the following holds:
with probability at over the randomness of and , the following holds:
After rearranging randomness, and using , we have
with probability over the randomness of , the following holds:
with probability over the randomness of , the following holds:
for every , the following holds:
Finally, since are random subsets of , we know that with probability at least , for each index , it is covered by at most random subsets. Therefore,
Appendix F Gradient Bound After Perturbation (Theorem 3)
We use the same fake gradient notion (see Definition E.1) and first derive the following result based on Lemma E.2 and Theorem 5.
Lower bound on . Recall
In Theorem 5 of the previous section, we already have a lower bound on . We just need to upper bound .
where ① follows by definition and ② and ③ follow by the triangle inequality. Note that in inequality ③ we have hidden four more higher order terms in . We ignore the details for how to bound them, for the ease of presentation. We bound the three terms separately:
Using Corollary B.18 (with from Lemma lem:forwardb) and Lemma B.3 we have
Finally, using , we have
where the last step follows by (F.1) (with our sufficiently large choice of ) and Theorem 5.
Upper bound on . Using , we have
where ① follows by Lemma E.2 and ② follows by (F.1).
Upper bound on . This is completely analogous so we do not replicate the proofs here. ∎
Appendix G Objective Semi-Smoothness (Theorem 4)
(by from Lemma B.3 and from Lemma lem:forwarda); and
.
Above, ① is by the definition of ; ② is by (G.2); ③ is by the definition of (see Fact 2.5 for an explicit form of the gradient); ④ is by Claim G.2.
where ① uses Lemma C.7 and ② uses Claim G.2 to bound .
Putting (G.4), (G.5) and (G.6) back to (G.3), and using triangle inequality, we have the desired result. ∎
for every ,
only when , and
We verify coordinate by coordinate for each .
If and , then (\phi(a)-\phi(b))_{k}=a_{k}-b_{k}=\big{(}D(a-b)\big{)}_{k}.
If and , then (\phi(a)-\phi(b))_{k}=0-0=\big{(}D(a-b)\big{)}_{k}.
If and , then (\phi(a)-\phi(b))_{k}=a_{k}=(a_{k}-b_{k})+\frac{b_{k}}{a_{k}-b_{k}}(a_{k}-b_{k})=\big{(}D(a-b)+D^{\prime\prime}(a-b)\big{)}_{k}, if we define .
If and , then (\phi(a)-\phi(b))_{k}=-b_{k}=0\cdot(a_{k}-b_{k})-\frac{b_{k}}{b_{k}-a_{k}}(a_{k}-b_{k})=\big{(}D(a-b)+D^{\prime\prime}(a-b)\big{)}_{k}, if we define . ∎
Appendix H Convergence Rate of Gradient Descent (Theorem 1)
In the rest of the proof, we first assume that for every , the following holds
We shall prove the convergence of gradient descent assuming (H.1), so that previous statements such as Theorem 4 and Theorem 3 can be applied. At the end of the proof, we shall verify that (H.1) is satisfied throughout the gradient descent process.
We need to verify for each , is small so that (H.1) holds. By Theorem 3,
where the last step follows by our choice of . ∎
Appendix I Convergence Rate of Stochastic Gradient Descent (Theorem 2)
The proof is almost identical to that of Theorem 1. We again have with probability at least
Again, we first assume for every , the following holds
We shall prove the convergence of SGD assuming (I.1), so that previous statements such as Theorem 4 and Theorem 3 can be applied. At the end of the proof, we shall verify that (I.1) is satisfied throughout the SGD with high probability.
At the same time, we also have the following absolute value bound:
Above, ① uses Theorem 4 and Cauchy-Shwartz , and ② uses Theorem 3 and the derivation from (I.2).
By one-sided Azuma’s inequality (a.k.a. martingale concentration), we have with probability at least , for every :
On one hand, after iterations we have
Therefore, we have .
On the other hand, for every , we have
where in ① we have used , and in ② we have used \eta\leq O\big{(}\frac{\delta}{\rho^{42}m}\big{)}. This implies . We can now verify for each , is small so that (I.1) holds. By Theorem 3,
where the last step follows by our choice of . ∎