On the Convergence Rate of Training Recurrent Neural Networks

Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

Neural networks have been one of the most powerful tools in machine learning over the past a few decades . The multi-layer structure of neural network gives it supreme power in expressibility and learning performance. However, it raises complexity concerns: the training objective is generally non-convex and non-smooth. In practice, local-search algorithms such as stochastic gradient descent (SGD) are capable of finding global optima, at least on the training data . How SGD avoids local minima for such objectives remains an open theoretical question since Goodfellow et al. .

In recent years, there have been a number of theoretical results aiming at a better understanding of this phenomenon. Many of them focus on two-layer (thus one-hidden-layer) neural networks and assume that the inputs are random Gaussian or sufficiently close to Gaussian . Some study deep neural networks but assuming the activation function is linear . Some study the convex task of training essentially only the last layer of the network . On the technique side, some of these results try to understand the gradient dynamics , while others focus on the geometry properties of the training objective .

In this paper, we show GD and SGD are capable of training multi-layer neural networks (with ReLU activation) to global minima on any non-degenerate training data set. Furthermore, the running time is polynomial in the number of layers and the number of data points. Since there are many different types of multi-layer networks (convolutional, feedforward, recurrent, etc.), in this present paper, we focus on recurrent neural networks (RNN) as our choice of multi-layer networks, and feedforward networks are only its “special case” (see for instance a follow-up work ).

Recurrent Neural Networks. Among different architectures of neural networks, one of the least theoretically-understood structure is the recurrent one . A recurrent neural network recurrently applies the same network unit to a sequence of input tokens, such as a sequence of words in a language sentence. RNN is particularly useful when there are long-term, non-linear interactions between input tokens in the same sequence. These networks are widely used in practice for natural language processing, language generation, machine translation, speech recognition, video and music processing, and many other tasks . On the theory side, while there are some attempts to show that an RNN is more expressive than a feedforward neural network , when and how an RNN can be efficiently learned has little theoretical explanation.

In practice, RNN is usually trained by simple local-search algorithms such as SGD. However, unlike shallow networks, the training process of RNN often runs into the trouble of vanishing or exploding gradient . That is, the value of the gradient becomes exponentially small or large in the time horizon, even when the training objective is still constant. Intuitively, an RNN recurrently applies the same network unit for LL times if the input sequence is of length LL. When this unit has “operator norm” larger than one or smaller than one, the final output can possibly exponentially explode or vanish in LL. More importantly, when one back propagates through time—which intuitively corresponds to applying the reverse unit multiple times— the gradient can also vanish or explode. Controlling the operator norm of a non-linear operator can be quite challenging. In practice, one of the popular ways to resolve this is by the long short term memory (LSTM) structure . However, one can also use rectified linear units (ReLUs) as activation functions to avoid vanishing or exploding gradient . In fact, one of the earliest adoptions of ReLUs was on applications of RNNs for this purpose twenty years ago . For a detailed survey on RNN, we refer the readers to Salehinejad et al. .

In this paper, we study the following general question

Can ReLU provably stabilize the training process and avoid vanishing/exploding gradient?

Can RNN be trained close to zero training error efficiently under mild assumptions?

When there is no activation function, RNN is known as linear dynamical system. Hardt, Ma, and Recht first proved the convergence of finding global minima for such linear dynamical systems. Followups in this line of research include .

Motivations. One may also want to study whether RNN can be trained close to zero test error. However, unlike feedforward networks, the training error, or the ability to memorize examples, may actually be desirable for RNN. After all, many tasks involving RNN are related to memories, and certain RNN units are even referred to memory cells. Since RNN applies the same network unit to all input tokens in a sequence, the following question can possibly of its own interest:

How does RNN learn mappings (say from token 3 to token 7) without destroying others?

Another motivation is the following. An RNN can be viewed as a space constraint, differentiable Turing machine, except that the input is only allowed to be read in a fixed order. It was shown in Siegelmann and Sontag that all Turing machines can be simulated by fully-connected recurrent networks built of neurons with non-linear activations. In practice, RNN is also used as a tool to build neural Turing machines , equipped with a grand goal of automatically learning an algorithm based on the observation of the inputs and outputs. To this extent, we believe the task of understanding the trainability as a first step towards understanding RNN can be meaningful on its own.

Our Result. To present the simplest result, we focus on the classical Elman network with ReLU activation:

If the number of neurons m\poly(n,d,L,δ1,logε1)m\geq\poly(n,d,L,\delta^{-1},\log\varepsilon^{-1}) is polynomially large, we can find weight matrices W,A,BW,A,B where the RNN gives ε\varepsilon training error

if gradient descent (GD) is applied for T=\Omega\big{(}\frac{\poly(n,d,L)}{\delta^{2}}\log\frac{1}{\varepsilon}\big{)} iterations, starting from random Gaussian initializations; or

if (mini-batch or regular) stochastic gradient descent (SGD) is applied for T=\Omega\big{(}\frac{\poly(n,d,L)}{\delta^{2}}\log\frac{1}{\varepsilon}\big{)} iterations, starting from random Gaussian initializations.At a first glance, one may question how it is possible for SGD to enjoy a logarithmic time dependency in ε1\varepsilon^{-1}; after all, even when minimizing strongly-convex and Lipschitz-smooth functions, the typical convergence rate of SGD is T1/εT\propto 1/\varepsilon as opposed to Tlog(1/ε)T\propto\log(1/\varepsilon). We quickly point out there is no contradiction here if the stochastic pieces of the objective enjoy a common global minimizer. In math terms, suppose we want to minimize some function f(x)=1ni=1nfi(x)f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x), and suppose xx^{*} is the global minimizer of convex functions f1(x),,fn(x)f_{1}(x),\dots,f_{n}(x). Then, if f(x)f(x) is σ\sigma-strongly convex, and each each fi(x)f_{i}(x) is LL-Lipschitz smooth, then SGD—moving in negative direction of fi(x)\nabla f_{i}(x) for a random i[n]i\in[n] per step— can find ε\varepsilon-minimizer of this function in O\big{(}\frac{L^{2}}{\sigma^{2}}\log\frac{1}{\varepsilon}\big{)} iterations.

(To present the simplest possible result, we have not tried to tighten the polynomial dependency with respect to n,dn,d and LL. We only tightened the dependency with respect to δ\delta and ε\varepsilon.)

Our Contribution. We summarize our contributions as follows.

We believe this is the first proof of convergence of GD/SGD for training the hidden layers of recurrent neural networks (or even for any multi-layer networks of more than two layers) when activation functions are present.Our theorem holds even when A,BA,B are at random initialization and only the hidden weight matrix WW is trained. This is much more difficult to analyze than the convex task of training only the last layer BB . Training only the last layer can significantly reduce the learning power of (recurrent or not) neural networks in practice.

Our results provide arguably the first theoretical evidence towards the empirical finding of Goodfellow et al. on multi-layer networks, regarding the ability of SGD to avoid (spurious) local minima. Our theorem does not exclude the existence of bad local minima

We build new technical toolkits to analyze multi-layer networks with ReLU activation, which have now found many applications . For instance, combining this paper with new techniques, one can derive guarantees on testing error for RNN in the PAC-learning language .

Extension: Deep RNN. Elman RNN is also referred to as three-layer RNN, and one may also study the convergence of RNNs with more hidden layers. This is referred to as deep RNN . Our theorem also applies to deep RNNs (by combining this paper together with ).

2 Other Related Works

Another relevant work is Brutzkus et al. where the authors studied over-paramterization in the case of two-layer neural network under a linear-separable assumption.

Instead of using randomly initialized weights like this paper, there is a line of work proposing algorithms using weights generated from some “tensor initialization” process .

There is huge literature on using the mean-field theory to study neural networks . At a high level, they study the network dynamics at random initialization when the number of hidden neurons grow to infinity, and use such initialization theory to predict performance after training. However, they do not provide theoretical convergence rate for the training process (at least when the number of neurons is finite).

Notations and Preliminaries

Note that in the occasion that j=1i1(Iv^jv^j)vi\prod_{j=1}^{i-1}(I-\widehat{v}_{j}\widehat{v}_{j}^{\top})v_{i} is the zero vector, we let v^i\widehat{v}_{i} be an arbitrary unit vector that is orthogonal to v^1,,v^i1\widehat{v}_{1},\dots,\widehat{v}_{i-1}.

We make the following assumption on the input data (see Footnote 10 for how to relax it):

xi,1xj,1δ\|x_{i,1}-x_{j,1}\|\geq\delta for some parameter δ(0,1]\delta\in(0,1] and every pair of ij[n]i\neq j\in[n].

A very important notion that this entire paper relies on is the following:

We consider the following random initialization distributions for WW, AA and BB.

We say that W,A,BW,A,B are at random initialization, if the entries of WW and AA are i.i.d. generated from N(0,2m)\mathcal{N}(0,\frac{2}{m}), and the entries of Bi,jB_{i,j} are i.i.d. generated from N(0,1d)\mathcal{N}(0,\frac{1}{d}).

We assume m\poly(n,d,L,1δ,log1ε)m\geq\poly(n,d,L,\frac{1}{\delta},\log\frac{1}{\varepsilon}) for some sufficiently large polynomial.

Without loss of generality, we assume δ1CL2log3m\delta\leq\frac{1}{CL^{2}\log^{3}m} for some sufficiently large constant CC (if this is not satisfied one can decrease δ\delta). Throughout the paper except the detailed appendix, we use O~\widetilde{O}, Ω~\widetilde{\Omega} and Θ~\widetilde{\Theta} notions to hide polylogarithmic dependency in mm. To simplify notations, we denote by

2 Objective and Gradient

Using chain rule, one can write down a closed form of the (sub-)gradient:

For k[m]k\in[m], the gradient with respect to WkW_{k} (denoted by k\nabla_{k}) and the full gradient are

Our Results

Our main results can be formally stated as follows.

Suppose \eta=\widetilde{\Theta}\big{(}\frac{\delta}{m}\poly(n,d,L)\big{)} and m\poly(n,d,L,δ1,logε1)m\geq\poly(n,d,L,\delta^{-1},\log\varepsilon^{-1}). Let W(0),A,BW^{(0)},{A},{B} be at random initialization. With high probability over the randomness of W(0),A,BW^{(0)},{A},{B}, if we apply gradient descent for TT steps W(t+1)=W(t)ηf(W(t))W^{(t+1)}=W^{(t)}-\eta\nabla f(W^{(t)}), then it satisfies

Suppose \eta=\widetilde{\Theta}\big{(}\frac{\delta}{m}\poly(n,d,L)\big{)} and m\poly(n,d,L,δ1,logε1)m\geq\poly(n,d,L,\delta^{-1},\log\varepsilon^{-1}). Let W(0),A,BW^{(0)},{A},{B} be at random initialization. If we apply stochastic gradient descent for TT steps W(t+1)=W(t)ηfi(W(t))W^{(t+1)}=W^{(t)}-\eta\nabla f_{i}(W^{(t)}) for a random index i[n]i\in[n] per step, then with high probability (over W(0),A,BW^{(0)},A,B and the randomness of SGD), it satisfies

In both cases, we essentially have linear convergence rates. We remark here that the O~\widetilde{O} notation may hide additional polynomial dependency in loglogε1\log\log\varepsilon^{-1}. This is not necessary, at the expense of slightly complicating the proofs, as shown by follow up . Notably, our results show that the dependency of the number of layers LL, is polynomial. Thus, even when RNN is applied to sequences of long input data, it does not suffer from exponential gradient explosion or vanishing (e.g., 2Ω(L)2^{\Omega(L)} or 2Ω(L)2^{-\Omega(L)}) through the entire training process.

Main Technical Theorems. Our main Theorem 1 and Theorem 2 are in fact natural consequences of the following two technical theorems. They both talk about the first-order behavior of RNNs when the weight matrix WW is sufficiently close to some random initialization.

The first theorem is similar to the classical Polyak-Łojasiewicz condition , and says that f(W)F2\|\nabla f(W)\|_{F}^{2} is at least as large as the objective value.

With high probability over random initialization W~,A,B\widetilde{W},A,B, it satisfies

The second theorem shows a special “semi-smoothness” property of the objective.

At a high level, the convergence of GD and SGD are careful applications of the two technical theorems above: indeed, Theorem 3 shows that as long as the objective value is high, the gradient is large; and Theorem 4 shows that if one moves in the (negative) gradient direction, then the objective value can be sufficiently decreased. These two technical theorems together ensure that GD/SGD does not hit any saddle point or (bad) local minima along its training trajectory. This was practically observed by Goodfellow et al. and a theoretical justification was open since then.

An Open Question. We did not try to tighten the polynomial dependencies of (n,d,L)(n,d,L) in the proofs. When mm is sufficiently large, we make use of the randomness at initialization to argue that, for all the points within a certain radius from initialization, for instance Theorem 3 holds. In practice, however, the SGD can create additional randomness as time goes; also, in practice, it suffices for those points on the SGD trajectory to satisfy Theorem 3. Unfortunately, such randomness can— in principle— be correlated with the SGD trajectory, so we do not know how to use that in the proofs. Analyzing such correlated randomness is certainly beyond the scope of this paper, but can possibly explain why in practice, the size of mm needed is not that large.

Overall, we provide the first proof of convergence of GD/SGD for non-linear neural networks that have more two layers. We show with overparameterization GD/SGD can avoid hitting any (bad) local minima along its training trajectory. This was practically observed by Goodfellow et al. and a theoretical justification was open since then. We present our result using recurrent neural networks (as opposed to the simpler feedforward networks ) in this very first paper, because memorization in RNN could be of independent interest. Also, our result proves that RNN can learn mappings from different input tokens to different output tokens simultaneously using the same recurrent unit.

Last but not least, we build new tools to analyze multi-layer networks with ReLU activations that could facilitate many new research on deep learning. For instance, our techniques in Section 4 provide a general theory for why ReLU activations avoid exponential exploding (see e.g. (4.1), (4.4)) or exponential vanishing (see e.g. (4.1), (4.3)); and our techniques in Section 5 give a general theory for the stability of multi-layer networks against adversarial weight perturbations, which is at the heart of showing the semi-smoothness Theorem 4, and used by all the follow-up works .

The main difficulty of this paper is to prove Theorem 3 and 4, and we shall sketch the proof ideas in Section 4 through 7. In such high-level discussions, we shall put our emphasize on

how to avoid exponential blow up in LL, and

how to deal with the issue of randomness dependence across layers.

We genuinely hope that this high-level sketch can (1) give readers a clear overview of the proof without the necessity of going to the appendix, and (2) appreciate our proof and understand why it is necessarily long.For instance, proving gradient norm lower bound in Theorem 3 for a single neuron k[m]k\in[m] is easy, but how to apply concentration across neurons? Crucially, due to the recurrent structure these quantities are never independent, so we have to build necessary probabilistic tools to tackle this. If one is willing to ignore such subtleties, then our sketched proof is sufficiently short and gives a good overview.

Basic Properties at Random Initialization

In this section we derive basic properties of the RNN when the weight matrices W,A,BW,A,B are all at random initialization. The corresponding precise statements and proofs are in Appendix B.

The first one says that the forward propagation neither explodes or vanishes, that is,

This relies on a more involved inductive argument than (4.1). At high level, one needs to show that in each layer, the amount of “fresh new randomness” reduces only by a factor at most 1110L1-\frac{1}{10L}.

Using (4.1) and (4.2), we obtain the following property about the data separability:

Here, we say two vectors xx and yy are δ\delta-separable if \big{\|}(I-yy^{\top}/\|y\|_{2}^{2})x\big{\|}\geq\delta and vice versa. Property (4.3) shows that the separability information (say on input token 11) does not diminish by more than a polynomial factor even if the information is propagated for LL layers.

Intermediate Layers and Backward Propagation. Training neural network is not only about forward propagation. We also have to bound intermediate layers and backward propagation.

Intuitively, one cannot use spectral bound argument to derive (4.4) or (4.5): the spectral norm of W{W} is 22, and even if ReLU activations cancel half of its mass, the spectral norm DW2\|DW\|_{2} remains to be 2\sqrt{2}. When stacked together, this grows exponential in LL.

We did not try to tighten the polynomial factor here in LL. We conjecture that proving an O(1)O(1) bound may be possible, but that question itself may be a sufficiently interesting random matrix theory problem on its own.

Its proof is in the same spirit as (4.5), with the only difference being the spectral norm of BB is around m/d\sqrt{m/d} as opposed to O(1)O(1).

Stability After Adversarial Perturbation

In this section we study the behavior of RNN after adversarial perturbation. The corresponding precise statements and proofs are in Appendix C.

Letting W~,A,B\widetilde{W},A,B be at random initialization, we consider some matrix W=W~+WW=\widetilde{W}+W^{\prime} for W2\poly(ϱ)m\|W^{\prime}\|_{2}\leq\frac{\poly(\varrho)}{\sqrt{m}}. Here, WW^{\prime} may depend on the randomness of W~,A\widetilde{W},A and BB, so we say it can be adversarially chosen. The results of this section will later be applied essentially twice:

Once for those updates generated by GD or SGD, where WW^{\prime} is how much the algorithm has moved away from the random initialization.

The other time (see Section 6.3) for a technique that we call “randomness decomposition” where we decompose the true random initialization WW into W=W~+WW=\widetilde{W}+W^{\prime}, where W~\widetilde{W} is a “fake” random initialization but identically distributed as WW. Such technique at least traces back to smooth analysis .

To illustrate our high-level idea, from this section on (so in Section 5, 6 and 7)

Forward Stability. Our first, and most technical result is the following:

In our actual proof of (5.1), instead of applying induction on ③, we recursively expand ③ by the above formula. This results in a total of LL terms of ① type and LL terms of ② type. The main difficulty is to bound a term of ② type, that is:

Our argument consists of two conceptual steps.

The two steps above enable us to perform induction without exponential blow up. Indeed, they together enable us to go through the following logic chain:

Since there is a gap between m1/2m^{-1/2} and m2/3m^{-2/3}, we can make sure that all blow-up factors are absorbed into this gap, using the property that mm is polynomially large. This enables us to perform induction to prove (5.1) without exponential blow-up.

Intermediate Layers and Backward Stability. Using (5.1), and especially using the sparsity D0m2/3\|D^{\prime}\|_{0}\leq m^{2/3} from (5.1), one can apply the results in Section 4 to derive the following stability bounds for intermediate layers and backward propagation:

Special Rank-1 Perturbation. For technical reasons, we also need two bounds in the special case of W=yzW^{\prime}=yz^{\top} for some unit vector zz and sparse yy with y0\poly(ϱ)\|y\|_{0}\leq\poly(\varrho). We prove that, for this type of rank-one adversarial perturbation, it satisfies for every k[m]k\in[m]:

Proof Sketch of Theorem 3: Polyak-Łojasiewicz Condition

The upper bound in Theorem 3 is easy to prove (based on Section 4 and 5), but the lower bound (a.k.a. the Polyak-Łojasiewicz condition) is the most technically involved result to prove in this paper. We introduce the notion of “fake gradient”. Given fixed vectors {lossi,a}i[n],a{2,,L}\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}}, we define

For every fixed vectors {lossi,a}i[n],a{2,,L}\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}}, if W,A,BW,A,B are at random initialization, then with high probability

There are only two conceptually simple steps from Theorem 5 to Theorem 3 (see Appendix F).

First, one can use the stability lemmas in Section 5 to show that, the fake gradient ^f(W+W)F\|\widehat{\nabla}f(W+W^{\prime})\|_{F} after adversarial perturbation WW^{\prime} (with W21m\|W^{\prime}\|_{2}\leq\frac{1}{\sqrt{m}}) is also large.

Second, one can apply ε\varepsilon-net and union bound to turn “fixed loss\operatorname{\mathsf{loss}}” into “for all loss\operatorname{\mathsf{loss}}”. This allows us to turn the lower bound on the fake gradient into a lower bound on the true gradient kf(W+W)F\|\nabla_{k}f(W+W^{\prime})\|_{F}.

Therefore, in the rest of this section, we only sketch the ideas behind proving Theorem 5.

This should be quite intuitive to prove, in the following two steps.

2 Thought Experiment: Adding Small Rank-One Perturbation

At the same time, using the fact that kk satisfies (6.2), one can show that

Putting (a), (b), (c), and (d) together, we know for such specially chosen kk, at least with constant probability over the random perturbation of WkW^{\prime}_{k},

3 Real Proof: Randomness Decomposition and McDiarmid’s Inequality

There are only two main differences between (6.6) and our desired Theorem 5. First, (6.6) gives a gradient lower bound at W+WkW+W^{\prime}_{k}, while in Theorem 5 we need a gradient lower bound at random initialization WW. Second, (6.6) gives a lower bound on ^kf()\widehat{\nabla}_{k}f(\cdot) with constant probability for a small fraction of good coordinates kk, but in Theorem 5 we need a lower bound for the entire ^f()\widehat{\nabla}f(\cdot).

Randomness Decomposition. To fix the first issue, we resort to a randomness decomposition technique at least tracing back to the smooth analysis of Spielman and Teng :

Given small constant θ(0,1)\theta\in(0,1) and mm-dimensional random gN(0,1mI)g\sim\mathcal{N}(0,\frac{1}{m}{I}), we can rewrite g=g1+g3g=g_{1}+g_{3} where g1g_{1} follows from N(0,1mI)\mathcal{N}(0,\frac{1}{m}{I}) and g3g_{3} is very close to N(0,θ2mI)\mathcal{N}(0,\frac{\theta^{2}}{m}{I}).

(Note that there is no contradiction here because g1g_{1} and g3g_{3} shall be correlated.)

Using Proposition 6.1, for each good coordinate kk, instead of “adding” perturbation WkW^{\prime}_{k} to WW, we can instead decompose WW into W=W0+WkW=W_{0}+W^{\prime}_{k}, where W0W_{0} is distributed in the same way as WW. In other words, W0W_{0} is also at random initialization. If this idea is carefully implemented, one can immediately turn (6.6) into

Extended McDiarmid’s Inequality. To fix the second issue, one may wish to consider all the indices k[m]k\in[m] satisfying (6.5) and (6.2). Since there are at least δ\poly(ρ)\frac{\delta}{\poly(\rho)} fraction of such coordinates, if all of them satisfied (6.7), then we would have already proved Theorem 5. Unfortunately, neither can we apply Chernoff bound (because the events with different k[m]k\in[m] are correlated), nor can we apply union bound (because the event occurs only with constant probability).

Our technique is to resort to (an extended probabilistic variant of) McDiarmid’s inequality (see Appendix A.6) in a very non-trivial way to boost the confidence.

In other words, although there are N|N| difference terms, their total summation only grows in rate Nm1/6\frac{|N|}{m^{1/6}} according to (6.9). After applying a variant of McDiarmid’s inequality, we derive that with high probability over WNW^{\prime}_{N}, it satisfies

Finally, by sampling sufficiently many random sets NN to cover the entire space [m][m], we can show

Proof Sketch of Theorem 4: Objective Semi-Smoothness

The objective semi-smoothness Theorem 4 turns out to be much simpler to prove than Theorem 3. It only relies on Section 4 and 5, and does not need randomness decomposition or McDiarmid’s inequality. (Details in Appendix G.)

Recall that in Theorem 4, W~,A,B\widetilde{W},A,B are at random initialization. W˘\breve{W} is an adversarially chosen matrix with W˘W~\poly(ϱ)m\|\breve{W}-\widetilde{W}\|\leq\frac{\poly(\varrho)}{\sqrt{m}}, and WW^{\prime} is some other adversarial perturbation on top of W˘\breve{W}, satisfying Wτ0m\|W^{\prime}\|\leq\frac{\tau_{0}}{\sqrt{m}}. We denote by

and plug (7.1) into (7.2) to derive our final Theorem 4.

Appendix Roadmap

Appendix A recalls some old lemmas and derives some new lemmas in probability theory.

Appendix B serves for Section 4, the basic properties at random initialization.

Appendix C serves for Section 5, the stability after adversarial perturbation.

Appendix D, E and F together serve for Section 6 and prove Theorem 3, the Polyak-Łojasiewicz condition and gradient upper bound. In particular:

Appendix D serves for Section 6.1 (the indicator and backward coordinate bounds).

Appendix E serves for Section 6.3 (the randomness decomposition and McDiarmid’s inequality) and proves Theorem 5.

Appendix F shows how to go from Theorem 5 to Theorem 3.

Appendix G serves for Section 7, the proof of Theorem 4, the objective semi-smoothness.

Appendix H gives the final proof for Theorem 1, the GD convergence theorem.

Appendix I gives the final proof for Theorem 2, the SGD convergence theorem.

Parameters. We also summarize a few parameters we shall use in the proofs.

In Definition D.1, we shall introduce two parameters

to control the thresholds of indicator functions (recall Section 6.1).

In Definition E.3, we shall introduce parameter

to describe how much randomness we want to decompose out of WW (recall Section 6.2).

In (E.13) of Appendix E.4, we shall choose

which controls the size of the set NN where we apply McDiarmid’s inequality (recall Section 6.3).

Appendix

The goal of this section is to present a list of probability tools.

In Section A.1, we recall how to swap randomness.

In Section A.2, we recall concentration bounds for the chi-square distribution.

In Section A.3, we proved a concentration bound of sum of squares of ReLU of Gaussians.

In Section A.4 and Section A.5, we show some properties for random Gaussian vectors.

In Section A.6, we recall the classical McDiarmid’s inequality and then prove a general version of it.

If f(X,Y)f(X,Y) holds with probability at least 1ε1-\varepsilon, then

with probability at least 1ε1-\sqrt{\varepsilon} (over randomness of XX), the following event holds,

f(X,Y)f(X,Y) holds with probability at least 1ε1-\sqrt{\varepsilon} (over randomness of YY).

If PrX,Y[f(X,Y)a]ε\operatornamewithlimits{\mathbf{Pr}}_{X,Y}[f(X,Y)\geq a]\geq\varepsilon, then

A.2 Concentration of Chi-Square Distribution

Let XXk2X\sim{\cal X}_{k}^{2} be a chi-squared distributed random variable with kk degrees of freedom. Each one has zero mean and σ2\sigma^{2} variance. Then

Let x1,x2,,xnx_{1},x_{2},\cdots,x_{n} denote i.i.d. samples from N(0,σ2){\cal N}(0,\sigma^{2}). For any b1b\geq 1, we have

Since 2n8b+2n8b22n8b+2n8bn/b\frac{2n}{\sqrt{8}b}+\frac{2n}{8b^{2}}\leq\frac{2n}{\sqrt{8}b}+\frac{2n}{8b}\leq n/b. Thus,

Let x1,x2,,xmx_{1},x_{2},\cdots,x_{m} denote i.i.d. samples from N(0,1)\mathcal{N}(0,1), and yi=max{xi2logm,0}y_{i}=\max\{x_{i}^{2}-\log m,0\}. We have

On the other hand, each random variable yiy_{i} is O(1)O(1)-subgaussian. By subgaussian concentration,

A.3 Concentration of Sum of Squares of ReLU of Gaussians

Given nn i.i.d. Gaussian random variables x1,x2,,xnN(0,σ2)x_{1},x_{2},\cdots,x_{n}\sim\mathcal{N}(0,\sigma^{2}), we have

Using Chernoff bound, we know that with probability 1exp(ε2n/6)1-\exp(-\varepsilon^{2}n/6), i=1nmax(xi,0)2\sum_{i=1}^{n}\max(x_{i},0)^{2} is a at most degree-(1+ε)n2(1+\varepsilon)\frac{n}{2} Chi-square random variable. Let us say this is the first event.

Thus, we have with probability at least 1exp(ε2n/2)1-\exp(-\varepsilon^{2}n/2),

Let the above event denote the second event.

By taking the union bound of two events, we have with probability 12exp(ε2n/6)1-2\exp(-\varepsilon^{2}n/6)

Then rescaling the ε\varepsilon, we get the desired result. ∎

Given nn i.i.d. Gaussian random variables x1,x2,,xnN(0,σ2)x_{1},x_{2},\cdots,x_{n}\sim\mathcal{N}(0,\sigma^{2}), we have

Using Chernoff bound, we know that with probability 1exp(ε2n/6)1-\exp(-\varepsilon^{2}n/6), i=1nmax(xi,0)2\sum_{i=1}^{n}\max(x_{i},0)^{2} is a at most degree-(1ε)n2(1-\varepsilon)\frac{n}{2} Chi-square random variable. Let us say this is the first event.

Thus, we have with probability at least 1exp(ε2n/4)1-\exp(-\varepsilon^{2}n/4),

Let the above event denote the second event.

By taking the union bound of two events, we have with probability at least 12exp(ε2n/6)1-2\exp(-\varepsilon^{2}n/6),

Then rescaling the ε\varepsilon, we get the desired result. ∎

Combining Lemma A.7 and Lemma A.6, we have

Given nn i.i.d. Gaussian random variables x1,x2,,xnN(0,σ2)x_{1},x_{2},\cdots,x_{n}\sim{\cal N}(0,\sigma^{2}), let ϕ(a)=max(a,0)2\phi(a)=\max(a,0)^{2}. We have

For each i[m]i\in[m], let yi=(Ax)iy_{i}=(Ax)_{i}. Then yiN(0,σ~2)y_{i}\sim\mathcal{N}(0,\widetilde{\sigma}^{2}), where σ~2=2σ2/mx22\widetilde{\sigma}^{2}=2\sigma^{2}/m\cdot\|x\|_{2}^{2}. Using Corollary A.8, we have

Since y=Axy=Ax, thus we complete the proof. ∎

vi|v_{i}| follows i.i.d. from the following distribution: with half probability vi=0|v_{i}|=0, and with the other half probability vi|v_{i}| follows from folded Gaussian distributions N(0,2h2m)|\mathcal{N}(0,\frac{2\|h\|^{2}}{m})|.

mv22h2\frac{m\|v\|^{2}}{2\|h\|^{2}} is in distribution identical to χω2\chi^{2}_{\omega} (chi-square distribution of order ω\omega) where ω\omega follows from binomial distribution B(m,1/2)\mathcal{B}(m,1/2) (mm trials each with success rate 1/21/2).

A.4 Gaussian Vector Percentile: Center

Suppose xN(0,σ2)x\sim\mathcal{N}(0,\sigma^{2}) is a Gaussian random variable. For any t(0,σ]t\in(0,\sigma] we have

Similarly, if xN(μ,σ2)x\sim\mathcal{N}(\mu,\sigma^{2}), for any t(0,σ]t\in(0,\sigma], we have

Let xN(0,σ2I)x\sim\mathcal{N}(0,\sigma^{2}{I}). For any α(0,1/2)\alpha\in(0,1/2), we have with probability at least 1exp(α2m/100)1-\exp(-\alpha^{2}m/100),

there exists at least 12(1α)\frac{1}{2}(1-\alpha) fraction of ii such that xi5ασ/16x_{i}\geq 5\alpha\sigma/16, and

there exists at least 12(1α)\frac{1}{2}(1-\alpha) fraction of ii such that xi5ασ/16.x_{i}\leq-5\alpha\sigma/16\enspace.

Let c1=4/5c_{1}=4/5. For each i[m]i\in[m], we define random variable yiy_{i} as

Choosing δ=14α\delta=\frac{1}{4}\alpha, we have

We provide the definition of (α,σ)(\alpha,\sigma)-good. Note that this definition will be used often in the later proof.

there are at least 12(1α)\frac{1}{2}(1-\alpha) fraction coordinates satisfy that wiασw_{i}\geq\alpha\sigma; and

there are at least 12(1α)\frac{1}{2}(1-\alpha) fraction coordinates satisfy that wiασw_{i}\leq-\alpha\sigma.

Lemma A.13 gives the following immediate corollary:

Let xN(0,σ2I)x\sim\mathcal{N}(0,\sigma^{2}{I}). For any α(0,1/2)\alpha\in(0,1/2), we have with probability at least 1exp(α2m/100)1-\exp(-\alpha^{2}m/100) that xx is (α,σ/4)(\alpha,\sigma/4)-good.

It is clear that AxAx follows from a Gaussian distribution \mathcal{N}\big{(}0,\big{(}\sum_{i=1}^{k}\sigma_{i}^{2}\|x_{i}\|_{2}^{2}\big{)}{I}\big{)}, so we can directly apply Corollary A.15. ∎

A.5 Gaussian Vector Percentile: Tail

Without loss of generality we only prove the result for x=1\|x\|=1.

Fixing any such xx and letting β=logm2m\beta=\frac{\log m}{2\sqrt{m}}, we have y_{i}\sim\mathcal{N}\big{(}0,\frac{2}{m}\big{)} so for every p1p\geq 1, by Gaussian tail bound

Since β2p2mβ2mΩ(logm)\beta^{2}p^{2}m\geq\beta^{2}m\gg\Omega(\log m), we know that if yiβp|y_{i}|\geq\beta p occurs for q/p2q/p^{2} indices ii out of [m][m], this cannot happen with probability more than

Finally, by applying union bound over p=1,2,4,8,16,p=1,2,4,8,16,\dots we have with probability 1eΩ(β2qm)logq\geq 1-e^{-\Omega(\beta^{2}qm)}\cdot\log q,

In other words, vector yy can be written as y=y1+y2y=y_{1}+y_{2} where y2β\|y_{2}\|_{\infty}\leq\beta and y124qβ2logq\|y_{1}\|^{2}\leq 4q\beta^{2}\log q.

A.6 McDiarmid’s Inequality and An Extension

We state the standard McDiarmid’s inequality,

We prove a more general version of McDiarmid’s inequality,

Let w1,,wNw_{1},\dots,w_{N} be independent random variables and f ⁣:(w1,,wN)f\colon(w_{1},\dots,w_{N})\mapsto. Suppose it satisfies:

With probability at least 1p1-p over w1,,wNw_{1},\dots,w_{N}, it satisfies

Then, Pr[f(w1,,wN)μ/2]1N2peΩ(μ2N(c2+p)).\operatornamewithlimits{\mathbf{Pr}}[f(w_{1},\dots,w_{N})\geq\mu/2]\geq 1-N^{2}\sqrt{p}-e^{\Omega(\frac{-\mu^{2}}{N(c^{2}+p)})}\enspace.

For each t[N]t\in[N], we have with probability at least 1p1-\sqrt{p} over w1,,wtw_{1},\dots,w_{t}, it satisfies

Define those (w1,,wt)(w_{1},\dots,w_{t}) satisfying the above event to be KtK_{\leq t}.

Define random variable XtX_{t} (which depends only on w1,,wtw_{1},\dots,w_{t}) as

For every tt and fixed w1,,wt1w_{1},\dots,w_{t-1}.

If (w1,,w<t)∉K1××K<t(w_{\leq 1},\dots,w_{<t})\not\in K_{\leq 1}\times\cdots\times K_{<t}, then Xt=Xt1=NX_{t}=X_{t-1}=N.

If (w1,,w<t)K1××K<t(w_{\leq 1},\dots,w_{<t})\in K_{\leq 1}\times\cdots\times K_{<t},

If wt∉Ktw_{\leq t}\not\in K_{\leq t}, then XtXt1=N0X_{t}-X_{t-1}=N-\cdots\geq 0.

Recall from our assumption that, with probability at least 1p1-\sqrt{p} over wtw_{t} and w>tw_{>t}, it satisfies

Taking expectation over wtw_{t} and w>tw_{>t}, we have

This precisely means XtXt1c+pX_{t}-X_{t-1}\geq c+\sqrt{p}.

In sum, we have just shown that XtXt1(c+p)X_{t}-X_{t-1}\geq-(c+\sqrt{p}) always holds. By applying martingale concentration (with one-sided bound),

Notice that X0=μX_{0}=\mu so if we choose t=μ/2t=\mu/2, we have

and we have XN=f(w1,,wN)X_{N}=f(w_{1},\dots,w_{N}) with probability at least 1Np1-N\sqrt{p} (and XN=NX_{N}=N with the remaining probabilities). Together, we have the desired theorem.

Appendix B Basic Properties at Random Initialization

Recall that the recursive update equation of RNN can be described as follows

We introduce two notations that shall repeatedly appear in our proofs.

Section B.4 and Section B.5 prove that the consecutive intermediate layers, in terms of spectral norm, do not explode (for full and sparse vectors respectively).

Section B.6 proves that the backward propagation does not explode.

With probability at least 1exp(Ω(m/L2))1-\exp(-\Omega(m/L^{2})) over WW and AA,

Next, applying Corollary A.10, we know that if z1,z2,z3z_{1},z_{2},z_{3} are fixed (instead of defined as in (B.1)), then, letting Choosing ε=1/2L\varepsilon=1/2L, we have

In particular, since we have “for all” quantifies on z1z_{1} and z2z_{2} above, we can substitute the choice of z1z_{1} and z2z_{2} in (B.1) (which may depend on the randomness of WW and AA). This, together with (B.1), gives

With probability at least 1exp(Ω(m/L2))1-\exp(-\Omega(m/L^{2})) over WW and AA,

We can define z1,z2,z3z_{1},z_{2},z_{3} in the same way as (B.1). This time, we show a lower bound

Applying Corollary A.10, we know if z1,z2,z3z_{1},z_{2},z_{3} are fixed (instead of defined as in (B.1)), then choosing ε=1/8L\varepsilon=1/8L,

Substituting the choice of z1,z2,z3z_{1},z_{2},z_{3} in (B.1), and the lower bound (B.4), we have

Choosing ε=1/(4L)\varepsilon=1/(4L) gives the desired statement. ∎

This finishes the proof of Lemma B.3. \blacksquare

B.2 Forward Correlation

in the same way as (B.1) and (B.3) as in the proof of Lemma B.3. We again have the entries of M1,M2,M3M_{1},M_{2},M_{3} are i.i.d. from N(0,2m)\mathcal{N}(0,\frac{2}{m}) (recall Footnote 13). For the quantity M2z2M_{2}z_{2}, we can further decompose it as follows

where c5=116logmc_{5}=\frac{1}{16\log m} is some fixed parameter, ν\nu and ν\nu^{\prime} denote two vectors that are independently generated from N(0,2Im)\mathcal{N}(0,\frac{2{I}}{m}), and

It is clear that the two sides of (B.7) are identical in distribution (because M2N(0,2Im)M_{2}\sim\mathcal{N}(0,\frac{2{I}}{m}) and (z2c5α)+2+(z2)2=z22(z_{2}-c_{5}\alpha)_{+}^{2}+(z_{2}^{\prime})^{2}=z_{2}^{2}).

Next, suppose z1z_{1} and z2z_{2} are fixed (instead of depending on the randomness of WW and AA) and satisfies Note that if z1z_{1} and z2z_{2} are random, then they satisfy such constraints by Lemma B.3.

We can apply Corollary A.16 to obtain the following statement: with probability at least 1exp(Ω(α2m))1-\exp(-\Omega(\alpha^{2}m)),

is (α,σ/4)(\alpha,\sigma/4)-good where σ=(2mz122+2m((z2c5α)+)2+2mz322)1/2\sigma=\left(\frac{2}{m}\|z_{1}\|_{2}^{2}+\frac{2}{m}((z_{2}-c_{5}\alpha)_{+})^{2}+\frac{2}{m}\|z_{3}\|_{2}^{2}\right)^{1/2}. Using z1221/2\|z_{1}\|_{2}^{2}\geq 1/2, we can lower bound σ2\sigma^{2} as

w1w_{1} corresponds to the coordinates that are α/8m\geq\alpha/8\sqrt{m},

w2w_{2} is the remaining, which corresponds to the coordinates that are within (α/8m,α/8m)(-\alpha/8\sqrt{m},\alpha/8\sqrt{m}).

w3w_{3} corresponds to the coordinates that are α/8m\leq-\alpha/8\sqrt{m}, and

We write v=(v1,v2,v3)v=(v_{1},v_{2},v_{3}) according to the same partition. We consider the following three cases.

For each index kk in the first block, we have \big{(}\phi(w_{1}+rv_{1})\big{)}_{k}\neq\big{(}w_{1}+rv_{1}\big{)}_{k} only if (rv1)kα/8m(rv_{1})_{k}\leq-\alpha/8\sqrt{m}. However, if this happens, we have

Applying Lemma A.5, we know with probability at least 1eΩ(m)1-e^{-\Omega(\sqrt{m})},

Similarly, for each index kk in the third block, we have \big{(}\phi(w_{1}+rv_{1})\big{)}_{k}\neq 0 only if (rv1)kα/8m(rv_{1})_{k}\geq\alpha/8\sqrt{m}. Therefore, we can similarly derive that

To prove (B.9), we use triangle inequality,

Since ϕ(w2)α/8m\|\phi(w_{2})\|_{\infty}\leq\alpha/8\sqrt{m} and the size of support of ϕ(w2)\phi(w_{2}) is at most αm\alpha m, we have

Since vN(0,2mI)v\sim\mathcal{N}(0,\frac{2}{m}{I}), and since the size of support of v2v_{2} is at most αm\alpha m, we have v22α\|v_{2}\|\leq 2\sqrt{\alpha} with probability at least 1eΩ(αm)1-e^{-\Omega(\alpha m)} (due to chi-square distribution concentration). Thus

Together, by triangle inequality we have ϕ(w2+rv2)α3/24.\|\phi(w_{2}+rv_{2})\|\leq\frac{\alpha^{3/2}}{4}\enspace. This finishes the proof of (B.9).

Denoting by δ=(δ1,δ2,δ3)\delta=(\delta_{1},\delta_{2},\delta_{3}), we have

On the other hand, the random vector z=(IUU)w1+rv1z=(I-UU^{\top})w_{1}+rv_{1} follows from distribution N(μ,2r2m)\mathcal{N}(\mu,\frac{2r^{2}}{m}) for some fixed vector μ=(IUU)w1\mu=(I-UU^{\top})w_{1} and has at least m2(1α)\frac{m}{2}(1-\alpha) dimensions. By chi-square concentration, we have zr(13α/2)\|z\|\geq r(1-3\alpha/2) with probability at least 1eΩ(α2m)1-e^{-\Omega(\alpha^{2}m)}. Putting these together, we have

B.3 Forward δ𝛿\delta-Separateness

We first give the definition of δ\delta-Separable,

For any two vectors x,yx,y, we say xx and yy are δ\delta-separable if

We say a finite set XX is δ\delta-separable if for any two vectors x,yXx,y\in X, xx and yy are δ\delta-separable.

holds with probability at least 1exp(Ω(m))1-\exp(-\Omega(\sqrt{m})).

This also implies that the randomness in y1y_{1} is independent of the randomness in y2y_{2}.

It is easy to see that y12=x,yx2\|y_{1}\|_{2}=\frac{\langle x,y\rangle}{\|x\|_{2}}, so we can rewrite WyWy as follows

and we know the entries of M1,M2,M3,M4M_{1},M_{2},M_{3},M_{4} are i.i.d. from N(0,2m)\mathcal{N}(0,\frac{2}{m}), owing to a similar treatment as Footnote 13. For the vector M4z4M_{4}z_{4}, we further rewrite it as

Our plan is to first use the randomness in ww to argue that ww is (α,γ/4m)(\alpha,\gamma/4\sqrt{m})-good. Then we conditioned ww is good, and prove that the norm of (IUU)ϕ(w+νz4)(I-UU^{\top})\phi(w+\nu^{\prime}z_{4}^{\prime}) is lower bounded (using (B.7)).

we can use Corollary A.16 to obtain the following statement: ww is (α,σ/4)(\alpha,\sigma/4)-good with probability at least 1exp(Ω(α2m))1-\exp(-\Omega(\alpha^{2}m)) where σ=(2mz122+2mz22+2mz322+2m(z42c52α2))1/2\sigma=\left(\frac{2}{m}\|z_{1}\|_{2}^{2}+\frac{2}{m}z_{2}^{2}+\frac{2}{m}\|z_{3}\|_{2}^{2}+\frac{2}{m}(z_{4}^{2}-c_{5}^{2}\alpha^{2})\right)^{1/2}. Using z1221/2\|z_{1}\|_{2}^{2}\geq 1/2, we can lower bound σ2\sigma^{2} as

In other words, ww is (α,142m)(\alpha,\frac{1}{4\sqrt{2}\sqrt{m}})-good. Next, applying standard ε\varepsilon-net argument, we have with probability at least 1exp(Ω(α2m))1-\exp(-\Omega(\alpha^{2}m)), for all z1,z2,z4z_{1},z_{2},z_{4} satisfying (B.12), it satisfies ww is (α,18m)(\alpha,\frac{1}{8\sqrt{m}})-good. This allows us to plug in the random choice of z1,z2,z4z_{1},z_{2},z_{4} in (B.11).

We apply Lemma B.7 with the following setting

where ww is (α,γ/8m)(\alpha,\gamma/8\sqrt{m})-good. (We can do so because the randomness of vv is independent of the randomness of UU and ww.) Lemma B.7 tells us that, with probability at least 1exp(Ω(m))1-\exp(-\Omega(\sqrt{m})) over the randomness of vv,

B.4 Intermediate Layers: Spectral Norm

The following lemma bounds the spectral norm of (consecutive) intermediate layers.

We start with an important claim whose proof is almost identical to Lemma B.3.

Now, to prove the spectral norm bound in Lemma B.11, we need to go from “for each zb1z_{b-1} (see Claim B.12)” to “for all zb1z_{b-1}”. Since zb1z_{b-1} has mm dimensions, we cannot afford taking union bound over all possible zb1z_{b-1} (or its ε\varepsilon-net).

By applying an ε\varepsilon-net argument over all such possible (but sparse) (zb1)j(z_{b-1})_{j}, we have with probability at least 12O(m/L3)exp(Ω(m/L2))1exp(Ω(m/L2))1-2^{O(m/L^{3})}\exp(-\Omega(m/L^{2}))\geq 1-\exp(-\Omega(m/L^{2})), the above equation holds for all possible (zb1)j(z_{b-1})_{j}.

Next, taking a union bound over all j[L3]j\in[L^{3}], we have with probability at least 1exp(Ω(m/L2))1-\exp(-\Omega(m/L^{2})):

Without loss of generality we assume z2=1\|z\|_{2}=1. We can rewrite MyMy as follows

It is easy to see that M1M_{1} is independent of M2M_{2}. We can rewrite

Using Fact A.11 together with concentration bounds (for binomial distribution and for chi-square distribution), we have with probability at least 1exp(Ω(m/L2))1-\exp(-\Omega(m/L^{2})),

B.5 Intermediate Layers: Sparse Spectral Norm

This section proves two results corresponding to the spectral norm of intermediate layers with respect to sparse vectors. We first show Lemma B.14 and our Corollary B.15 and B.16 shall be direct applications of Lemma B.14.

For every k[m]k\in[m] and t2t\geq 2, with probability at least

Similar to the proof of Lemma B.11, we let UU denote the following column orthonormal matrix using Gram-Schmidt:

Applying ε\varepsilon-net over all kk-sparse vectors yy, we have with probability at least 1eO(klogm)eΩ(nLt2)1-e^{O(k\log m)}e^{-\Omega(nLt^{2})}, it satisfies yWU22nLt2m\|y^{\top}WU\|^{2}\leq\frac{2nLt^{2}}{m} for all kk-sparse vectors yy.

Conditioning on both events happen, we have

B.6 Backward Propagation

This section proves upper bound on the backward propagation against sparse vectors. We first show Lemma B.17 and our Corollary B.18 and B.19 shall be immediate corollaries.

with probability at least 1exp(Ω(m/L2))1-\exp(-\Omega(m/L^{2})) we have

Taking a union of the above two events, we get with probability 1exp(Ω(m/L2))exp(Ω(t2))1-\exp(-\Omega(m/L^{2}))-\exp(-\Omega(t^{2})),

Appendix C Stability After Adversarial Perturbation

Section C.4 considers a special type of rank-one perturbation matrix WW^{\prime}, and provides stability bounds on the forward and backward propagation.

As discussed in Section 5, the results of Section C.1, C.2 and C.3 shall be used twice, once for the final training updates (see Appendix G), and once for the randomness decomposition (see Appendix E). In contrast, the results of Section C.4 shall only be used once in Appendix E.

The goal of this section is to prove Lemma C.2,

We emphasize that all these parameters are polynomial in ϱ\varrho so negligible when comparing to mm.

After recursively applying (C.2), we can write

Applying Claim C.6, we have with probability at least 1eΩ(τ14/3m1/3)1-e^{-\Omega(\tau_{1}^{4/3}m^{1/3})}, one can write

where inequality ① is due to (C.5), and ② follows from our choice of τ2=4Lτ5logm\tau_{2}=4L\tau_{5}\log m. Thus, we have showed I and II of (C.1):

so III of (C.1) holds. IV and V of (C.1) are implied by Claim C.4, and VI is implied because

Then, we have with probability at least 1eΩ(m/L2)1-e^{-\Omega(m/L^{2})}

Using triangle inequality, we can calculate

With probability at least 1eΩ(τ14/3m1/3)1-e^{-\Omega(\tau_{1}^{4/3}m^{1/3})} the following holds. Whenever

then letting τ4=10(τ1)2/3\tau_{4}=10(\tau_{1})^{2/3} and τ5=3τ1\tau_{5}=3\tau_{1}, we have

Next, for each jS2j\in S_{2}, we must have

Similar to the proof of Lemma B.11, we let UU denote the following column orthonormal matrix using Gram-Schmidt:

Finally, taking ε\varepsilon-net over all ss-sparse vectors xx, we have the desired result. ∎

Combining Claim C.4 and Claim C.5, we have

With probability at least 1eΩ(τ14/3m1/3)1-e^{-\Omega(\tau_{1}^{4/3}m^{1/3})}, whenever

C.2 Intermediate Layers

The goal of this subsection is to prove Lemma C.7.

On the other hand, since Wτ0m\|W^{\prime}\|\leq\frac{\tau_{0}}{\sqrt{m}}, we also have

C.3 Backward

Without loss of generality we assume a=1\|a\|=1. We define set C{\cal C} to be

Again we assume a=1\|a\|=1 without loss of generality. We define set C{\cal C} to be

It is similar to the proof of Lemma lem:backwardb. ∎

It is similar to the proof of Lemma lem:backwarda. ∎

C.4 Special Rank-One Perturbation

We prove two lemmas regarding the special rank-one perturbation with respect to a coordinate k[m]k\in[m]. The first one talks about forward propagation.

After recursively calculating as above, using hi,0=0h_{i,0}^{\prime}=0, and applying triangle inequality, we have

where the last step follows by yτ0m\|y\|_{\infty}\leq\frac{\tau_{0}}{\sqrt{m}}, W2Nτ0m\|W^{\prime}\|_{2}\leq\frac{\sqrt{N}\tau_{0}}{\sqrt{m}} and Lemma lem:forwarda.

where ① follows by z21\|z\|_{2}\leq 1, ② follows by Lemma B.3, ③ follows by Corollary B.16, and ④ follows by yτ01m\|y\|_{\infty}\leq\tau_{0}\frac{1}{\sqrt{m}}.

where the last step follows by Corollary B.15 (which relies on Lemma lem:forwardb to give s=O(L5/3τ01/3N1/6)s=O(L^{5/3}\tau_{0}^{1/3}N^{1/6})) together with Lemma lem:forwardc.

Putting the three bounds into (C.9) (where the term one is the dominating term), we have

The next one talks about backward propagation.

By Corollary B.15 (which relies on Lemma lem:forwardb to give s=O(L5/3τ01/3N1/6)s=O(L^{5/3}\tau_{0}^{1/3}N^{1/6})), we know

Finally, we use the randomness of BB (recall that vv does not depend on BB), we conclude that with probability at least 1eΩ(m)1-e^{-\Omega(m)}, it satisfies

This time, we define set C{\cal C} to be

Appendix D Indicator and Backward Coordinate Bounds

Suppose WW and AA follow from random initialization. Given fixed set N1[m]N_{1}\subseteq[m] and define

Now, suppose z1z_{1} is fixed (instead of depending on the randomness of WW and AA), and it satisfies z16L\|z_{1}\|\leq 6L. We have each entry of yy is is distributed as {\cal N}\big{(}0,\frac{2\|z_{1}\|^{2}+2\|z_{2}\|^{2}}{m}\big{)}. By property of Gaussian (see Fact A.12), we have for each y[m]y\in[m],

Since we have N1|N_{1}| independent random variables, applying a Chernoff bound, we have

and therefore for a fixed vector μ\mu it satisfies

Applying Chernoff bound, with probability at least 1eΩ(N2/n)1-e^{-\Omega(|N_{2}|/n)}, we have N3,i(112n)N2|N_{3,i}|\geq(1-\frac{1}{2n})|N_{2}|. Applying union bound over all possible i[n]{i}i\in[n]\setminus\{i^{*}\}, we have N3=iiN3,iN_{3}=\bigcap_{i\neq i^{*}}N_{3,i} has cardinality at least N22\frac{|N_{2}|}{2}. ∎

and therefore for a fixed vector μ\mu it satisfies

D.2 Backward Coordinate Bound

The goal of this section is to prove Lemma D.7.

We have h(t)h(t) is Lh\mathfrak{L}_{h}-Lipschitz continuous.

We define two probabilistic events E1,E2E_{1},E_{2}.

Event E1E_{1} depends on the randomness of WW and AA:

Corollary B.16 says Pr[E1]1eΩ(ρ2)\operatornamewithlimits{\mathbf{Pr}}[E_{1}]\geq 1-e^{-\Omega(\rho^{2})}.

Event E2E_{2} depends on the randomness of BB:

By Gaussian tail bound, we have Pr[E2]1eΩ(ρ2)\operatornamewithlimits{\mathbf{Pr}}[E_{2}]\geq 1-e^{-\Omega(\rho^{2})}.

In the rest of the proof, we assume WW and AA are fixed and satisfy E1E_{1}. We let BB be the only source of randomness but we condition on BB satisfies E2E_{2}.

Step 1. Fixing bNb_{-N} and letting bNb_{N} be the only randomness, we claim for each kNk\in N:

We prove this inequality as follows. We can rewrite

and vkv_{k} is distributed as Gaussian random variable N(μ,σ2){\cal N}(\mu,\sigma^{2}), and μ\mu and σ2\sigma^{2} are defined as follows

where the second step follows by a,ba2b2|\langle a,b\rangle|\leq\|a\|_{2}\cdot\|b\|_{2}, the third step follows by (D.2).

Above, inequality ① follows by h(t)h(t)\in, inequality ② follows from (D.2). Using McDiarmid inequality (see Lemma A.18), we have

where we choose ε=N4nL\varepsilon=\frac{|N|}{4nL}.

Appendix E Gradient Bound at Random Initialization (Theorem 5)

The goal of this section is to understand the lower and upper bounds of gradients. Instead of analyzing the true gradient directly—where the forward and backward propagation have correlated randomness— we assume that the loss vectors {lossi,a}i[n],a{2,,L}\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}} are fixed (no randomness) in this section, as opposed to being defined as lossi,a=Bhi,ayi,a\operatorname{\mathsf{loss}}_{i,a}=Bh_{i,a}-y^{*}_{i,a} which is random. We call this the “fake loss.” We define a corresponding “fake gradient” with respect to this fixed loss.

Given fixed vectors {lossi,a}i[n],a{2,,L}\{\operatorname{\mathsf{loss}}_{i,a}\}_{i\in[n],a\in\{2,\dots,L\}}, we define

where inequality ① uses Lemma B.11, Lemma B.3, and B2O(m)\|B\|_{2}\leq O(\sqrt{m}) with high probability. Applying triangle inequality and using the definition of fake gradient (see Definition E.1), we finish the proof of the first statement.

As for the second statement, we replace the use of Lemma B.11 with Corollary B.19. ∎

The rest of this section is devoted to proving a (much more involved) lower bound on this fake gradient.

In the rest of this section, we first present a elegant way to decompose the randomness in Section E.1 (motivated by smooth analysis ). We give a lower bound on the expected fake gradient in Section E.2. We then calculate the stability of fake gradient against rank-one perturbations in Section E.3. Finally, in Section E.4, we apply our extended McDiarmid’s inequality to prove Theorem 5.

We introduce a parameter θ\theta as follows:

where the notions β\beta_{-} and β+\beta_{+} come from Definition D.1. We have θδρ\theta\leq\frac{\delta}{\rho}.

entries of W2W_{2} are i.i.d. drawn from N(0,2m)\mathcal{N}(0,\frac{2}{m}) (so W2W_{2} is in the same distribution as WW);

With probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})} over the randomness of W2W_{2}:

With probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})} over the randomness of W2W_{2}:

We verify that the entries of W2W_{2} are i.i.d. Gaussian in two steps.

By standard Gaussian tail bound, we have with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})}, it satisfies g1ρm\|g_{1}\|_{\infty}\leq\frac{\rho}{\sqrt{m}} and therefore (11θ2)g1θ2ρm\|(1-\sqrt{1-\theta^{2}})g_{1}\|_{\infty}\leq\frac{\theta^{2}\rho}{\sqrt{m}}. If this happens, for every k[m]k\in[m], we have (see for instance Fact A.12) that Prg2[(θg2)k>θm]14\operatornamewithlimits{\mathbf{Pr}}_{g_{2}}[(\theta g_{2})_{k}>\frac{\theta}{\sqrt{m}}]\geq\frac{1}{4} and Prg2[(θg2)k<θm]14\operatornamewithlimits{\mathbf{Pr}}_{g_{2}}[(\theta g_{2})_{k}<-\frac{\theta}{\sqrt{m}}]\geq\frac{1}{4}. Recalling

and using the fact that θ12ρ\theta\leq\frac{1}{2\rho}, we conclude that

Again by Gaussian tail bound, we have with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})}, it satisfies g1,g2ρm\|g_{1}\|_{\infty},\|g_{2}\|_{\infty}\leq\frac{\rho}{\sqrt{m}}. Therefore, (11θ2)g1θ2ρm\|(1-\sqrt{1-\theta^{2}})g_{1}\|_{\infty}\leq\frac{\theta^{2}\rho}{\sqrt{m}} and using our assumption on θ\theta, we have

Now, we introduce another two ways of decomposing randomness based on this definition.

E.2 Gradient Lower Bound in Expectation

The goal of this subsection is to prove Lemma E.6 and then translate it into our Core Lemma A (see Lemma E.7).

and recall from Lemma lem:random-decomposed, we have

We let N1=NN_{1}=N, apply Lemma D.2 to obtain N4N1N_{4}\subseteq N_{1} with N4βN164L|N_{4}|\geq\frac{\beta_{-}|N_{1}|}{64L}, and apply Lemma D.7 to obtain N5N4N_{5}\subseteq N_{4}. According to the statements of these lemmas, we know that with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})}, the random choice of W1,A,BW_{1},A,B will satisfy N5βN1100L|N_{5}|\geq\frac{\beta_{-}|N_{1}|}{100L}.

This implies, by triangle inequality and the fact that lossi,alossi,a\|\operatorname{\mathsf{loss}}_{i,a}\|\leq\|\operatorname{\mathsf{loss}}_{i^{*},a^{*}}\|,

This implies, by triangle inequality and the fact that lossi,alossi,a\|\operatorname{\mathsf{loss}}_{i,a}\|\leq\|\operatorname{\mathsf{loss}}_{i^{*},a^{*}}\|,

Using Lemma B.3 and Lemma lem:forwarda we have

This concludes that, with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})} over W1W_{1} and AA, it satisfies

This concludes that, with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})} over W1W_{1} and AA, it satisfies

This is a fixed value, independent of WNW^{\prime}_{N}.

Finally, for each kN5k\in N_{5}, owing to Lemma D.7 and Lemma B.3, we have

Putting (E.2) and (E.7) back to (E.1), we conclude that

In Lemma E.6, we split WW into W=W1+WNW=W_{1}+W_{N}^{\prime} for a fixed N[m]N\subseteq[m]. In this subsection, we split WW into three parts W=W2+WN+WNW=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime} following Lemma def:random-decompc. Our purpose is to rewrite Lemma E.6 into the following variant:

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of W2,WN,A,B,NW_{2},W_{-N}^{\prime},A,B,N, the following holds:

or putting it in another way, using our choice of qq,

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of W2,WN,A,B,NW_{2},W_{-N}^{\prime},A,B,N, the following holds (q=βNcρ2q=\frac{\beta_{-}|N|}{c\rho^{2}} for some sufficiently large constant):

Applying simple tricks to switch the ordering of randomness (see Fact A.2), we have

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of W2,A,BW_{2},A,B, the following holds:

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of NN, the following holds:

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over randomness of WNW_{-N}^{\prime}, the following holds:

Repeating the choice of NN for tt times, and using union bound, we have

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of W2,A,BW_{2},A,B, the following holds:

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of N1,N2,,NtN_{1},N_{2},\cdots,N_{t}, the following holds:

for all N{N1,N2,,Nt}N\in\{N_{1},N_{2},\cdots,N_{t}\}, the following holds:

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over randomness of WNW_{-N}^{\prime}, the following holds:

Combining the above statement with Fact A.1 (to merge randomness), we conclude the proof. ∎

E.3 Gradient Stability

The goal of this subsection is to prove Lemma E.8 and then translate it into our Core Lemma B (see Lemma E.10).

\left|\left\{k\in[m]~{}\bigg{|}~{}\Big{|}\|\widehat{\nabla}_{k}f(\widetilde{W}+W^{\prime\prime}_{j})\|_{2}^{2}-\|\widehat{\nabla}_{k}f(\widetilde{W})\|_{2}^{2}\Big{|}\leq O\left(\frac{\rho^{11}\theta^{1/3}}{m^{1/6}}\right)\right\}\right|\geq\left(1-\frac{\rho^{5}\theta^{2/3}}{m^{1/3}}\right)m\enspace.

^kf(W~+Wj)22^kf(W~)22O(ρ6)\left|\|\widehat{\nabla}_{k}f(\widetilde{W}+W^{\prime\prime}_{j})\|_{2}^{2}-\|\widehat{\nabla}_{k}f(\widetilde{W})\|_{2}^{2}\right|\leq O(\rho^{6}) for every k[m]k\in[m].

Combining this with ^kf(W~)2O(ρ4)\|\widehat{\nabla}_{k}f(\widetilde{W})\|_{2}\leq O(\rho^{4}) from Lemma E.2, we finish the proof of the first item.

Now, for the indices in NN that are randomly sampled from [m][m], if NJNS|N\cap J|\geq|N|-S for a parameter S=ρ2S=\rho^{2}, then we have

Otherwise, if NJNS|N\cap J|\leq|N|-S, this means at least SS indices that are chosen from NN are outside J[m]J\subseteq[m]. This happens with probability at most (ρ5θ2/3m1/3)SeΩ(ρ2)\left(\frac{\rho^{5}\theta^{2/3}}{m^{1/3}}\right)^{S}\leq e^{-\Omega(\rho^{2})}. ∎

In this subsection, we split WW into three parts W=W2+WN+WNW=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime} followingdef:random-decomp:N-N. Our purpose is to rewrite Lemma E.8 into the following variant:

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over random N[m]N\subset[m], random jNj\in N, random W,A,BW,A,B, and random WjW_{j}^{\prime}, the following holds:

We can split the randomness (see Fact A.1) and derive that

With probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over N[m]N\subseteq[m], W,A,BW,A,B, the following holds:

With probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over jNj\in N and WjW_{j}^{\prime\prime}, Eq. (E.11) holds.

Applying standard ε\varepsilon-net argument, we derive that

With probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over N[m]N\subseteq[m], W,A,BW,A,B, the following holds:

Finally, letting W=W2+WN+WNW=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime}, we have

With probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of N[m],W2,WN,WN,A,BN\subseteq[m],W_{2},W_{N},W_{-N}^{\prime},A,B, the following holds:

Finally, splitting the randomness (using Fact A.1), and applying a union bound over multiple samples N1,,NtN_{1},\dots,N_{t}, we finish the proof. ∎

E.4 Main Theorem: Proof of Theorem 5

Combining Core Lemma A and B (i.e., Lemma E.7 and Lemma E.10), we know that if N|N| is appropriately chosen, with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of W2,A,BW_{2},A,B and N1,NtN_{1},\cdots N_{t}, the following holds:

for every N{N1,N2,,Nt}N\in\{N_{1},N_{2},\cdots,N_{t}\}, the following holds:

with probability at 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})} over the randomness of WNW_{-N}^{\prime}, the following boxed statement holds:

Below, we condition on the high probability event (see Lemma lem:random-decomposed) that uN3θρ2m\|u_{N}\|_{\infty}\leq\frac{3\theta\rho}{2\sqrt{m}}. The boxed statement tells us:

With probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over uNu_{N}, it satisfies

At the same time, we have 0ϝ(uN)O(Nρ8)0\leq\digamma(u_{N})\leq O(N\rho^{8}) owing to Lemma E.2. Therefore, scaling down ϝ(uN)\digamma(u_{N}) by Θ(Nρ8)\Theta(N\rho^{8}) (to make sure the function value stays in $$), we can applying extended McDiarmid’s inequality (see Lemma A.19), we have

As long as Nρ22β2N\geq\frac{\rho^{22}}{\beta_{-}^{2}}, we have that the above probability is at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})}.

in order to satisfy Lemma E.7. We replace the boxed statement with (E.12). This tells us

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of W2,A,BW_{2},A,B and N1,NtN_{1},\cdots N_{t}, the following holds:

for every N{N1,N2,,Nt}N\in\{N_{1},N_{2},\cdots,N_{t}\}, the following holds:

with probability at 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})} over the randomness of WNW_{-N}^{\prime} and WNW^{\prime}_{N}, the following holds:

After rearranging randomness, and using W=W2+WN+WNW=W_{2}+W_{N}^{\prime}+W_{-N}^{\prime}, we have

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of W,A,BW,A,B, the following holds:

with probability 1eΩ(ρ2)\geq 1-e^{-\Omega(\rho^{2})} over the randomness of N1,NtN_{1},\cdots N_{t}, the following holds:

for every N{N1,N2,,Nt}N\in\{N_{1},N_{2},\cdots,N_{t}\}, the following holds:

Finally, since N1,,NtN_{1},\dots,N_{t} are tt random subsets of [m][m], we know that with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})}, for each index k[m]k\in[m], it is covered by at most ρ2\rho^{2} random subsets. Therefore,

Appendix F Gradient Bound After Perturbation (Theorem 3)

We use the same fake gradient notion (see Definition E.1) and first derive the following result based on Lemma E.2 and Theorem 5.

Lower bound on ^f(W~+W)F\|\widehat{\nabla}f(\widetilde{W}+W^{\prime})\|_{F}. Recall

In Theorem 5 of the previous section, we already have a lower bound on ^f(W~)F\|\widehat{\nabla}f(\widetilde{W})\|_{F}. We just need to upper bound ^f(W~+W)^f(W~)F\|\widehat{\nabla}f(\widetilde{W}+W^{\prime})-\widehat{\nabla}f(\widetilde{W})\|_{F}.

where ① follows by definition and ② and ③ follow by the triangle inequality. Note that in inequality ③ we have hidden four more higher order terms in o(m1/3)o(m^{1/3}). We ignore the details for how to bound them, for the ease of presentation. We bound the three terms separately:

Using Corollary B.18 (with s2=O(L10/3τ02/3)s^{2}=O(L^{10/3}\tau_{0}^{2/3}) from Lemma lem:forwardb) and Lemma B.3 we have

Finally, using (ab)212a2b2(a-b)^{2}\geq\frac{1}{2}a^{2}-b^{2}, we have

where the last step follows by (F.1) (with our sufficiently large choice of mm) and Theorem 5.

Upper bound on ^f(W~+W)F\|\widehat{\nabla}f(\widetilde{W}+W^{\prime})\|_{F}. Using (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}, we have

where ① follows by Lemma E.2 and ② follows by (F.1).

Upper bound on ^fi(W~+W)F\|\widehat{\nabla}f_{i}(\widetilde{W}+W^{\prime})\|_{F}. This is completely analogous so we do not replicate the proofs here. ∎

Appendix G Objective Semi-Smoothness (Theorem 4)

haO(L)\|h_{a}\|\leq O(L) (by h~aO(L)\|\widetilde{h}_{a}\|\leq O(L) from Lemma B.3 and h~ahao(1)\|\widetilde{h}_{a}-h_{a}\|\leq o(1) from Lemma lem:forwarda); and

WhaW2ha\|W^{\prime}h_{a}\|\leq\|W^{\prime}\|_{2}\|h_{a}\|.

Above, ① is by the definition of f()f(\cdot); ② is by (G.2); ③ is by the definition of f()\nabla f(\cdot) (see Fact 2.5 for an explicit form of the gradient); ④ is by Claim G.2.

where ① uses Lemma C.7 and ② uses Claim G.2 to bound hah˘a2\|h_{a}-\breve{h}_{a}\|_{2}.

Putting (G.4), (G.5) and (G.6) back to (G.3), and using triangle inequality, we have the desired result. ∎

Dk,k1|D^{\prime\prime}_{k,k}|\leq 1 for every k[m]k\in[m],

Dk,k0D^{\prime\prime}_{k,k}\neq 0 only when \mathds1ak0\mathds1bk0\mathds{1}_{a_{k}\geq 0}\neq\mathds{1}_{b_{k}\geq 0}, and

ϕ(a)ϕ(b)=D(ab)+D(ab)\phi(a)-\phi(b)=D(a-b)+D^{\prime\prime}(a-b)

We verify coordinate by coordinate for each k[m]k\in[m].

If ak0a_{k}\geq 0 and bk0b_{k}\geq 0, then (\phi(a)-\phi(b))_{k}=a_{k}-b_{k}=\big{(}D(a-b)\big{)}_{k}.

If ak<0a_{k}<0 and bk<0b_{k}<0, then (\phi(a)-\phi(b))_{k}=0-0=\big{(}D(a-b)\big{)}_{k}.

If ak0a_{k}\geq 0 and bk<0b_{k}<0, then (\phi(a)-\phi(b))_{k}=a_{k}=(a_{k}-b_{k})+\frac{b_{k}}{a_{k}-b_{k}}(a_{k}-b_{k})=\big{(}D(a-b)+D^{\prime\prime}(a-b)\big{)}_{k}, if we define (D)k,k=bkakbk(D^{\prime\prime})_{k,k}=\frac{b_{k}}{a_{k}-b_{k}}\in.

If ak<0a_{k}<0 and bk0b_{k}\geq 0, then (\phi(a)-\phi(b))_{k}=-b_{k}=0\cdot(a_{k}-b_{k})-\frac{b_{k}}{b_{k}-a_{k}}(a_{k}-b_{k})=\big{(}D(a-b)+D^{\prime\prime}(a-b)\big{)}_{k}, if we define (D)k,k=bkbkba(D^{\prime\prime})_{k,k}=\frac{b_{k}}{b_{k}-b_{a}}\in. ∎

Appendix H Convergence Rate of Gradient Descent (Theorem 1)

In the rest of the proof, we first assume that for every t=0,1,,T1t=0,1,\dots,T-1, the following holds

We shall prove the convergence of gradient descent assuming (H.1), so that previous statements such as Theorem 4 and Theorem 3 can be applied. At the end of the proof, we shall verify that (H.1) is satisfied throughout the gradient descent process.

We need to verify for each tt, W(t)W(0)F\|W^{(t)}-W^{(0)}\|_{F} is small so that (H.1) holds. By Theorem 3,

where the last step follows by our choice of TT. ∎

Appendix I Convergence Rate of Stochastic Gradient Descent (Theorem 2)

The proof is almost identical to that of Theorem 1. We again have with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})}

Again, we first assume for every t=0,1,,T1t=0,1,\dots,T-1, the following holds

We shall prove the convergence of SGD assuming (I.1), so that previous statements such as Theorem 4 and Theorem 3 can be applied. At the end of the proof, we shall verify that (I.1) is satisfied throughout the SGD with high probability.

At the same time, we also have the following absolute value bound:

Above, ① uses Theorem 4 and Cauchy-Shwartz A,BAFBF\langle A,B\rangle\leq\|A\|_{F}\|B\|_{F}, and ② uses Theorem 3 and the derivation from (I.2).

By one-sided Azuma’s inequality (a.k.a. martingale concentration), we have with probability at least 1eΩ(ρ2)1-e^{-\Omega(\rho^{2})}, for every t=1,2,,Tt=1,2,\dots,T:

On one hand, after T=Ω(ρ15log(nL2/ε)ηδm)T=\Omega(\frac{\rho^{15}\log(nL^{2}/\varepsilon)}{\eta\delta m}) iterations we have

Therefore, we have f(W(T))εf(W^{(T)})\leq\varepsilon.

On the other hand, for every t=1,2,,Tt=1,2,\dots,T, we have

where in ① we have used 2atb2t=(bta/b)2+a2/b22a\sqrt{t}-b^{2}t=-(b\sqrt{t}-a/b)^{2}+a^{2}/b^{2}, and in ② we have used \eta\leq O\big{(}\frac{\delta}{\rho^{42}m}\big{)}. This implies f(W(t))O(nρ2L3)f(W^{(t)})\leq O(n\rho^{2}L^{3}). We can now verify for each tt, W(t)W(0)F\|W^{(t)}-W^{(0)}\|_{F} is small so that (I.1) holds. By Theorem 3,

where the last step follows by our choice of TT. ∎

References