Products of Many Large Random Matrices and Gradients in Deep Neural Networks

Boris Hanin, Mihai Nica

Introduction

Products of independent random matrices are a classical topic in probability and mathematical physics with applications to a variety of fields, from wireless communication networks to the physics of black holes , random dynamical systems , and recently to the numerical stability of randomly initialized neural networks . In the context of neural networks, products of random matrices are related to the numerical stability of gradients at initialization and therefore give precise information about the exploding and vanishing gradient problem (see Section 1.5, Proposition 2, and Corollary 3). The purpose of this article is to prove several new results about such products in the regime where the number of terms and the sizes of matrices grow simultaneously. This regime has attracted attention but remains poorly understood. We find new phenomena not present when the number of terms and the size of the matrices are sent to infinity sequentially rather than simultaneously (see Section 1.1 for more on this point).

where D(i){D}^{(i)} are ni×nin_{i}\times n_{i} diagonal matrices

As explained in Section 1.4 below, Zd(u)Z_{d}(\vec{u}) can be thought of as a line-to-line partition function in a disordered medium given by the computation graph underlying the matrix product defining M(d){M}^{(d)}, with u\vec{u} corresponding to a kind of initial condition. The diagonal matrices D(i){D}^{(i)} then correspond to {0,1}\{0,1\}-valued spins on the vertices of this graph, restricting the allowed paths of the directed random polymer. With this interpretation, our main result, Theorem 1, shows that the analogue of the free energy, namely

is Gaussian up to an error that tends to zero when ni,dn_{i},d tend to infinity.

Let M(d)=X(d)X(1){M}^{(d)}={X}^{(d)}\cdots{X}^{(1)} be as in (2). Then, the norm of the vector M(d)u{M}^{(d)}\vec{u} is approximately log-normal distributed:

where the implicit constant is uniform for pp in a compact subset of (0,1](0,1], and β\beta in a compact set bounded away from β=0\beta=0. Moreover, for every k0,k\geq 0, satisfying (k2)<min1idni,\binom{k}{2}<\min_{1\leq i\leq d}n_{i}, we have

where the implicit constant depends on kk and the moments of μ\mu but not on β,d,ni.\beta,d,n_{i}.

The two conclusions, equation (7) and equation (6), of Theorem 1 are proven separately and have independent proofs. We prove equation (6) by a path-counting type argument in Section 3. The argument in Section 4 for equation (7), in contrast, uses a central limit theorem for martingales.

Theorem 1 shows that the free energy ln(Zd(u))=ln(M(d)u22)\ln(Z_{d}(\vec{u}))=\ln(||{M}^{(d)}\vec{u}||_{2}^{2}) from (4) is Gaussian in the double scaling limit

achieved for instance when ni(d)n_{i}(d) are equal and proportional to d.d. This asymptotic normality for ln(Zd(u))\ln(Z_{d}(\vec{u})) cannot be seen by taking the limits dd\to\infty and min{ni}\min\{n_{i}\} to infinity one after the other. Indeed, consider the case when p=1p=1 and μ=N(0,1),\mu=\mathcal{N}(0,1), the standard Gaussian measure. A simple computation using the rotational invariance of i.i.d. Gaussian matrices shows the equality in distribution

where χn2\chi_{n}^{2} is a chi-squared random variable with nn degrees of freedom and the terms in the product are independent. In the limit where dd is fixed and min{ni,i1},\min\{n_{i},\,i\geq 1\}\to\infty, we have χni2/ni1+O(ni1/2)\chi_{n_{i}}^{2}/n_{i}\approx 1+O(n_{i}^{-1/2}) and so

On the other hand, if the nin_{i} are uniformly bounded, then we have

In fact, for ninn_{i}\equiv n fixed, ln(Zd(u))\ln(Z_{d}(\vec{u})) converges only with an addition 1/d1/d scaling:

making (8) an interesting regime for ln(Zd(u))\ln(Z_{d}(\vec{u})). The non-commutativity of the ni,dn_{i},d\to\infty limits is well-known and is related to the fact that the local statistics of the singular values of M(d){M}^{(d)} are sensitive to the order in which the limits above are taken. Remaining in the simple case of p=1p=1 and μ\mu Gaussian, a simple application of the central limit results show that when all the nin_{i} are equal and are related to dd by n=2β1dn=2\beta^{-1}d, then the exact chi-squared representation of equation (9) gives the convergence in distribution:

which is of course consistent with Theorem 1. Part of the content of Theorem 1 is therefore that this result is essentially independent of the parameter pp and the measure μ\mu according to which the entries of the matrices W(i){W}^{(i)} are distributed. See Section 1.3 for more discussion on the novel aspects of Theorem 1.

2 Connection to previous work in Random Matrix Theory

The literature on products of random matrices is vast. Much of the previous work concerns products of dd i.i.d. random matrices, each of size n×nn\times n. Such ensembles have been well studied in two distinct regimes: (a) when nn is fixed and dd\to\infty and (b) when dd is fixed and nn\to\infty. Case (a) is related to multiplicative ergodic theory and the study of Lyapunov exponents. The seminal articles in this regime are the results of Furstenberg and Kesten , which gives general conditions for the existence of the top Lyapunov exponent

and the multiplicative ergodic theorem of Osceledets , which gives conditions for almost sure (deterministic) values for all the Lyapunov exponents. Many more recent works characterize the Lyapunov exponents under more specific assumptions, most notably for matrices which are rotationally invariant or which have entries that are real or complex Gaussians, see e.g. as well as the survey and references therein.

Case (b), where dd is fixed and nn\to\infty, falls into the setting of free probability. Indeed, one of the great successes of free probability is the idea of “asymptotic freeness”: in the limit nn\to\infty, a collection of dd independent n×nn\times n random matrices behave like a collection of dd freely independent random variables on a non-commutative probability space (see e.g. Chapter 5 or Chapter 1 and 4). Therefore, case (b) is closely related to a product of dd freely independent random variables; precise results are obtained in . Earlier results examine case (b) without explicit use of free probability. The problem of first taking nn\to\infty and afterwards taking dd\to\infty can also be handled using the tools of free probability in the case of Gaussian matrices, see .

As explained in the Introduction and in Section 1.1, the regimes (a), (b) are asymptotically incompatible in the sense that the limits dd\to\infty, and nn\to\infty do not generally commute on the level of the local behavior of the singular value distribution. Indeed, the problem of understanding what happens when both are scaled simultaneously is mentioned as an open problem in . To explain this further, we note that the work of Newman [16, Thm. 1] in regime (a) shows that when p=1p=1 and njnn_{j}\equiv n is fixed, the density of the Lyapunov exponents of M(d){M}^{(d)} converges in the limit when first dd\to\infty and then nn\to\infty to the triangular density

The work of Tucci [27, Thm. 3.2, Ex. 3.4] shows that for Gaussian ensembles related to M(d){M}^{(d)} one obtains the same global limit in the regime (b) when first nn\to\infty and then d.d\to\infty. However, as explained in Section 5, while the global density of all the Lyapunov exponents is the triangular law in both cases, the local behavior (e.g. the fluctuations of the top Lyapunov exponent) is observed to be different depending on the order of the limits even in the exactly solvable special case of products of complex Ginibre matrices.

From this spectral point of view, Theorem 1 gives information about certain averages of the Lyapunov exponents. To see this, fix n0n_{0} and let ni,i1,n_{i},\,i\geq 1, and dd tend to infinity in accordance with (8). Note we we specifically do not take n0n_{0} to infinity. Denote by σ1,,σn0\sigma_{1},\ldots,\sigma_{n_{0}} the non-zero singular values of M(d){M}^{(d)}, and by v1,,vn0\vec{v}_{1},\ldots,\vec{v}_{n_{0}} the corresponding left-singular vectors. Then we have

Hence, Theorem 1 can be interpreted as the statement that the logarithm of the average of the non-zero singular values for M(d)u2||{M}^{(d)}\vec{u}||^{2} is a Gaussian with mean β/2-\beta/2 and variance β\beta in the limit (8). These non-trivial corrections in β\beta can be seen as a finite temperature correction to the maximal entropy regime of Tucci in which first nn\to\infty and then dd\to\infty. For more on this point of view, we refer the reader to [1, Section 3.2].

Finally, in the specific case where the random matrices X(j){X}^{(j)} are complex Ginibre matrices (i.e. the matrix entries are iid complex Gaussian), very recent work looks at the limiting spectrum under the joint scaling limit dd\to\infty, nn\to\infty where the ratio d/nd/n is fixed or going to \infty. This work analyzes exact determinental formulas for the joint distribution of singular values available in the case of complex Ginibre matrices. The analogous formulas for real Gaussian matrices given in are significantly more complicated and such an explicit analysis appears to be much more difficult.

3 Contribution of the Present Work

In the context of these previous random matrix results, let us point out four novel aspects of Theorem 1. First, it deals with the joint d,nd,n\to\infty limit for a large class of non-Gaussian matrices with real entries. There is no integrable structure to our matrix ensembles, and we rely instead on a sum-over-paths approach to analyze the moments (6) and a martingale CLT approach for obtaining the KS distance estimates (7).

An additional novelty of Theorem 1 is it proves the distribution of \big{|}\big{|}{M}^{(d)}\vec{u}\big{|}\big{|}_{2}^{2} is (mostly) universal: it does not depend on the higher moments of the distribution μ\mu beyond the mean and variance, with the exception of the fourth moment μ4\mu_{4} appearing in β\beta in the term u44(μ43)/pn1\left\|\vec{u}\right\|_{4}^{4}(\mu_{4}-3)/pn_{1}. In the regime njnn_{j}\equiv n and d/nβd/n\equiv\beta, this term is a 1/n1/n correction.

The fourth and final novelty of Theorem 1 we would like to emphasize is that our results are non-asymptotic, i.e. we obtain an explicit error term of the form i=1dni2\sum_{i=1}^{d}{n^{-2}_{i}}. This is particularly useful when using Theorem 1 for studying gradients in randomly initialized neural networks (see Section 1.5).

4 Connection to Random Polymers

The matrix ensembles M(d){M}^{(d)} studied in this article, in the case ni=n,i=1,,dn_{i}=n,\,\,i=1,\ldots,d, are related to directed random polymers on the complete graph of size nn. This model were recently explored in detail (c.f. e.g. ). A key object for these polymers is the line-to-line partition function

The fact that the weights are positive makes the analysis of the partition function of this traditional random polymer model different than the analysis of the matrix product M(d){M}^{(d)} defined in (1). In particular, no cancellation is possible between the terms in the definition of Z(d)Z(d) above, causing Z(d)Z(d) to be exponential in dd. The nn fixed and dd\to\infty limit of of the partition function in the case of these positive weights is the object of study in . As explained in Section 1.3, Theorem 1 studies a different regime where both d,nd\to\infty,n\to\infty at the same time. The fact that our weights are mean zero, gives rise to significant cancellation in the terms of Zd(u)Z_{d}(\vec{u}) from (4), so that the partition function in our mean zero model does not grow exponentially with dd provided nn grows with dd as in (8). Additionally, if p<1p<1 in our model, the effect of the diagonal Bernoulli matrices D(i){D}^{(i)} is to close every vertex with probability 1p1-p. The sum over paths in our partition function then becomes the sum only over those paths that pass through vertices that are open.

5 The Case p=1/2𝑝12p=1/2 as Gradients in Random Neural Nets

to be the vectors of activities before and after applying ReLU\operatorname{ReLU} at the neurons at layer jj.

where λ>0\lambda>0 is the learning rate. An important practical impediment to gradient based learning is the exploding and vanishing gradient problem (EVGP), which occurs when gradients are numerically unstable:

making the parameter update (10) too small to be meaningful or too large to be precise. An important intuition is that the EVGP will be most pronounced at the start of training, when the weights and biases of N\mathcal{N} are random and the implicit structure of the data being processed has not yet regularized the function computed by N\mathcal{N}.

As explained below, the EVGP for a depth dd ReLU\operatorname{ReLU} net N\mathcal{N} with hidden layer widths n0,,ndn_{0},\ldots,n_{d} is essentially equivalent to having large fluctuations for the entries (or, in the worst case, for the singular values) of the Jacobians of the transformations between various layers:

where D(i){D}^{(i)} are diagonal {0,1}\{0,1\}-Bernoulli matrices as in (2) with parameter p=12p=\frac{1}{2} and =d\stackrel{{\scriptstyle d}}{{=}} denotes equality in distribution.

and the implicit constant is uniform when β\beta ranges over a compact subset of (0,)(0,\infty).

For more information about the EVGP, statistics of gradients in random ReLU\operatorname{ReLU} nets, and distribution of the singular values of the input-output Jacobian we refer the interested reader to for more details.

Proof of Proposition 2

It remains to see that these diagonal matrices are independent of each other (since the outputs of the previous layer are fed into to subsequent layers, so are not a priori independent). This again will be a consequence of the fact the underlying random variables are symmetrically distributed, and will be formally verified by conjugating the weights and biases of the network by random ±1\pm 1 random variables. This doesn’t change the distribution of the network, but will allow us to see the independence between layers in a more concrete way.

2 Proof of Proposition 2

Fix a neural net N\mathcal{N} as in the statement of proposition 2 and denote its weights and biases at layer jj by Wa,b(j)W_{a,b}^{(j)} and Ba(j)B_{a}^{(j)}. For each j,j, let

be an i.i.d. collection of random variables that each take values ±1\pm 1 with probability 1/21/2. We will also define

Consider the neural net N^\widehat{\mathcal{N}} with weights W^a,b(j)\widehat{W}_{a,b}^{(j)} and biases B^a(j)\widehat{B}_{a}^{(j)} defined by changing the signs of the weights and biases of N\mathcal{N} as follows:

Note that since we’ve assumed that W(j),B(j){W}^{(j)},\vec{B}^{(j)} have distributions that are symmetric around , we have

Hence, since the weights of the two networks are identically distributed,

To prove (14), we will use the fact that two random variables X,YX,Y are independent if the distribution of XX given Y=yY=y does not depend on the value of y.y. That is, (14) will follow once we show that for any fixed sequences sa(j),ra(j){±1}s_{a}^{(j)},r_{a}^{(j)}\in\{\pm 1\} that

To check this equality, it suffices to show that given {sa(j),ra(j)}\{s_{a}^{(j)},\,r_{a}^{(j)}\} there is exactly one possible configuration for the variables ξa(j),ηa(j)\xi_{a}^{(j)},\eta_{a}^{(j)} for which the event a,j{D^a(j)=sa(j),σa(j)=ra(j)}\cap_{a,j}\left\{\widehat{D}_{a}^{(j)}=s_{a}^{(j)},\sigma_{a}^{(j)}=r_{a}^{(j)}\right\} occurs. The resulting probability then follows since ξa(j),ηa(j)\xi_{a}^{(j)},\eta_{a}^{(j)} are i.i.d. variables that each take the values ±1\pm 1 with probability 1/2.1/2. The proof is by induction: we will show that for each i=1,,di=1,\ldots,d, given W(j),B(j),jiW^{(j)},B^{(j)},\,j\leq i there is a unique configuration for the variables ξa(j),ηa(j),ji\xi_{a}^{(j)},\eta_{a}^{(j)},\,\,j\leq i that leads to the event a,ji{D^a(j)=sa(j),σa(j)=ra(j)}\cap_{a,j\leq i}\left\{\widehat{D}_{a}^{(j)}=s_{a}^{(j)},\sigma_{a}^{(j)}=r_{a}^{(j)}\right\}. When i=1,i=1, we have

Recalling that ηb(0)=1\eta_{b}^{(0)}=1 for all bb, we see that for each a,a, there is a unique value of ξa(1)\xi_{a}^{(1)} for which

Proof of Theorem 1: Moment Estimates and Path Counting

We begin by indicating the general plan for the proof of Equation (6) from Theorem 1, which consists of two steps. First, in Proposition 4 below, we express the expectation in (6) as a sum over kk-tuples of paths V[n0]k××[nd]kV\in[n_{0}]^{k}\times\cdots\times[n_{d}]^{k}. The precise result is the following

With the notation of Theorem 1, for each kk we have

where uV(0)=Δj=1kuVj(0)u_{V(0)}\stackrel{{\scriptstyle\Delta}}{{=}}\prod_{j=1}^{k}u_{V_{j}(0)}, and CC is defined by:

Note that the definition of CC in equation (16) depends only on the collection of vertices V(i1)[ni1]kV(i-1)\in[n_{i-1}]^{k} and V(i)[ni]kV(i)\in[n_{i}]^{k}, the moments of the measure μ\mu according to which the entries of the matrices W(i)\underline{W}^{(i)} are distributed, and the parameter p.p. The utility of equation (15) is that it is written as a product over this function of adjacent layers (rather than the whole path VV), which will make it much easier to analyze.

The next step in the proof of the moment estimate (6) is to obtain upper and lower bounds for the expression in (15) that match up to corrections of size i=1dni2.\sum_{i=1}^{d}n_{i}^{-2}. This is done in Section 3.4. The main idea here is to treat the sum (15) as an expectation where each V(i)[ni]kV(i)\in[n_{i}]^{k}, i1i\geq 1 is chosen independently according to the uniform distribution on [ni]k[n_{i}]^{k}. The leading term in this expectation comes from event that the entries {V1(i),Vk(i)}\left\{V_{1}(i),\ldots V_{k}(i)\right\} are all distinct, which happens in layer ii with probability 1O(ni1)1-O(n^{-1}_{i}). When this happens, C(V(i1),V(i))=1.C(V(i-1),V(i))=1. The subleading term comes from the event that {V1(i),,Vk(i)}\left\{V_{1}(i),\ldots,V_{k}(i)\right\} has exactly one element that appears twice, with the others distinct. In each layer, the probability of this type of “collision” is (k2)ni1,\binom{k}{2}n_{i}^{-1}, and C(V(i1),V(i))C(V(i-1),V(i)) typically contributes 3/p3/p when this happens. Hence, heuristically speaking, we have

This is almost correct, except at the first layer, where the vector u\vec{u} acts as special initial condition and slightly deforms the term in this product when i=1.i=1. Section 3.4 makes this argument precise.

2 Edge Sets, Multiplicities, and Paths

In this section we develop some notation and basic results which is used to clarify the “path counting” needed to prove Proposition 4 below. The major result that is developed in this section, and is needed for Proposition 4, is the enumeration the set of paths in Lemma 8. We will use the following notation conventions:

This is the set one gets by drawing an edge between each entry of xx and the corresponding entry of yy and then forgetting the order in which the edges were drawn but remembering the multiplicity. Note that this map from ordered sets of left and right endpoints is many to one. This will come up in our computations, and to keep track of this, we make the following definition.

where the expectation is with respect to μ.\mu. In the proof of Proposition 4 we will consider sequences of compatible edge sets in the sense of the following definition.

The formula (24) below will be used in the proof of Proposition 4.

3 Proof of Proposition 4

The first step in proving Proposition 4 is to express E[Mu2k]\mathbf{E}\left[\left\|{M}\vec{u}\right\|^{2k}\right] as a sum over certain collections of 2k2k paths.

Let Q2kQ^{2k} be the set of 2k2k tuples of paths:

where Γ2k\Gamma^{2k} was defined in (22). Our notation is that if γQ2k\gamma\in Q^{2k}, then γ(i)[ni]2k\gamma(i)\in[n_{i}]^{2k} is a 2k2k-tuple for each 0id0\leq i\leq d.

Note that the entries of the nd×n0n_{d}\times n_{0} matrix M{M} can be written as a sum over certain paths in Γ\Gamma, namely:

Using this interpretation in terms of paths, we obtain by indexing the starting points as b1,b2[n0]b_{1},b_{2}\in[n_{0}] and the ending point as a[nd]a\in[n_{d}], that we can write Mu2\left\|{M}\vec{u}\right\|^{2} as a sum over γQ2\gamma\in Q^{2}:

Similarly, the kk-th power is then given by:

The result of the lemma follows by taking expectation of both sides, using the independence of the random variables ξb(i),Wa,b(i)\xi_{b}^{(i)},\,W_{a,b}^{(i)}’s, and relations

Since the law μ\mu of the entries of the matrices W(i){W}^{(i)} is assumed to symmetric around 0,0, the odd moments of μ\mu are all zero, and it will be useful to consider only edge sets that are “even” in the following sense:

With the same notation as in Lemma 10, we have

Because the variables Wα,β(i)W_{\alpha,\beta}^{(i)} are symmetric around , all their odd moments vanish. Thus, in the expression (26), only collections of paths in which every edge is traversed an even number of times given a non-zero contribution. What remains are exactly paths from Qeven2kQ_{even}^{2k} by the definition in equation (27). ∎

Summing over all possible endpoints v[nd]kv\in[n_{d}]^{k} now gives the identity:

Finally, using this identity on (29), with ff being the function that appears inside the sum over Qeven2kQ^{2k}_{even}, gives the desired result of Proposition 4. ∎

4 Completion of Proof of Equation (6)

We think of the sum in Proposition 4 as an expectation over discrete random variables V(i)V(i). Specifically, we write:

Informally, UU stands for “unique entries”, and consists of those kk-tuples with no repeated entries; PP stands for “one pair” and consists of those kk-tuples with exactly one repeated entry; BB stands for “bad” and consists of everything else. Formally,

For each i1i\geq 1, under the uniform measure on [ni]k[n_{i}]^{k}, each random variable V(i)V(i) has the following probabilities for the events {V(i)Uni}\left\{V(i)\in U_{n_{i}}\right\},{V(i)Pni}\left\{V(i)\in P_{n_{i}}\right\},{V(i)Bni}\left\{V(i)\in B_{n_{i}}\right\}:

The proof is an elementary exercise in discrete probability. ∎

Subdivide the “one pair” set by which indices are paired: Pn=ΔabPn(a,b)P_{n}\stackrel{{\scriptstyle\Delta}}{{=}}\cup_{a\neq b}P_{n}(a,b) where Pn(a,b)=Pn{x[n]k:xa=xb}P_{n}(a,b)=P_{n}\cap\left\{x\in[n]^{k}:x_{a}=x_{b}\right\} for aba\neq b. Then for each k1,i=1,,d,k\geq 1,\,i=1,\ldots,d, we have

where the implicit constant in Θ(1)\Theta(1) is bounded below by 11 and above by μ2k(2k1)!!p1k.\mu_{2k}(2k-1)!!p^{1-k}.

This is an elementary calculation from the definition of CC. If V(i)UniV(i)\in U_{n_{i}}, #V(i)=k\#V(i)=k and the multiplicities of edges in the edge set EV(i1),V(i)E^{V(i-1),V(i)} are all 11 which makes the combinatorial factor in C(V(i1),V(i))C(V(i-1),V(i)) equal to 11, and every edge is covered exactly twice giving a factor of μ2k=1\mu^{k}_{2}=1 in the weight term. If V(i)PniV(i)\in P_{n_{i}}, #V(i)=k1\#V(i)=k-1 giving a factor of 1p\frac{1}{p}. Moreover, in this case, when the indices which are paired in V(i)V(i) are also paired in V(i1)V(i-1), all the combinatorial factors are again 11, and the weight term is μ2k1=μ4\mu^{k-1}_{2}=\mu_{4}. If the paired indices from V(i)V(i) are not paired in V(i1)V(i-1), then there the combinatorial term is 1k2(2221)(21)=31^{k-2}\frac{\binom{2\cdot 2}{2\cdot 1}}{\binom{2}{1}}=3, and the weight term is μ2k=1\mu^{k}_{2}=1. ∎

where the quantities ψU,TU,ψP,TP,ψB,TB\psi_{U},T_{U},\psi_{P},T_{P},\psi_{B},T_{B} are:

Note that for any fixed j=1,,d1j=1,\ldots,d-1 and v[nj]kv\in[n_{j}]^{k}, we have the following conditional independence of layers before V(j)V(j) and after V(j)V(j):

where in the second term we write E\mathcal{E} instead of Eu\mathcal{E}_{\vec{u}} since the measure no longer depends on u.u. Applying this with j=1,j=1, we find

where in the second term the random variables V(2),,V(d)V(2),\ldots,V(d) are uniform and do not depend on uu or vv. An elementary probability computation using Lemma 15 and the measure {u02,,un02}\{u^{2}_{0},\ldots,u^{2}_{n_{0}}\} on V(0)V(0) shows that

where the implicit constant in the last term is bounded below by 11 and above by μ2k(2k1)!!p1k.\mu_{2k}(2k-1)!!p^{1-k}. Combining the result of Lemma 14, with (33) and (34), proves Lemma 16.

Then, for any choice of the label {U,B,P}\ast\in\{U,B,P\}, we have that:

By using the possible values for CC computed in Lemma 15 and the definition of TT_{\ast}, we have that for any label {U,P,B}\ast\in\{U,P,B\}:

This proves the upper bound in (35). The lower bound similarly follows:

We first notice, by application of the elementary probability estimate recorded in Lemma 18, that the upper and lower bounds on TT_{\ast} given in Lemma 17 are equal up to small errors. We have for {U,P,B}\ast\in\{U,P,B\}:

(where K^\widehat{K} is as in Lemma 17). Finally, putting these values for TU,TP,TBT_{U},T_{P},T_{B} into the result of Lemma 14 we see:

The last line follows from the elementary fact for exponentials ex12x21+xexe^{x-\frac{1}{2}x^{2}}\leq 1+x\leq e^{x} for x0x\geq 0. ∎

5 An elementary probability estimate

Let A0,A1,,AdA_{0},A_{1},\ldots,A_{d} be independent events with probabilities p0,,pdp_{0},\ldots,p_{d} and B0,,BDB_{0},\ldots,B_{D} be independent events with probabilities q0,,qdq_{0},\ldots,q_{d} such that

Denote by XiX_{i} the indicator that the event AiA_{i} happens, Xi:=1{Ai}X_{i}:=\mathtt{1}\left\{A_{i}\right\}, and by YiY_{i} the indicator that BiB_{i} happens, Yi=1{Bi}Y_{i}=\mathtt{1}\{B_{i}\}. Further, fix for every i1,,di\in 1,\ldots,d some αi1,Ki1\alpha_{i}\geq 1,K_{i}\geq 1 as well as γi>0\gamma_{i}>0. Define

Then, if γi1\gamma_{i}\geq 1 for every ii, we have:

where by convention α0=γ0=1.\alpha_{0}=\gamma_{0}=1. In contrast, if γi1\gamma_{i}\leq 1 for every ii, we have:

The proof goes by induction on dd. The base case d=1d=1 can be computed directly

which is verified to obey the stated inequalities under the convention α0=γ0=1\alpha_{0}=\gamma_{0}=1. To see the induction step, suppose that d2d\geq 2. Define the filtration Fj=σ(X0,Y0,,Xj,Yj)\mathcal{F}_{j}=\sigma\left(X_{0},Y_{0},\ldots,X_{j},Y_{j}\right). We have, from the definition of ZiZ_{i} that

We compute by directly examining what happens when Xd1=0X_{d-1}=0 and when Xd1=1X_{d-1}=1, that

Now notice that, since Xd1Zd1X_{d-1}Z_{d-1} vanishes when Xd1=0X_{d-1}=0, and since Ad1Bd1=A_{d-1}\cap B_{d-1}=\emptyset, we have that

Hence, when γd1\gamma_{d}\geq 1, since ZjZj+1Z_{j}\leq Z_{j+1} for every jj, we have the estimate

which is the desired inequality to prove the induction step for the upper bound. To see the lower bound, we will actually prove the lower bound for the sequence

so by equation (39) applied to the Z^i\widehat{Z}_{i} sequence, we have then (keeping in mind γd1<0\gamma_{d}-1<0 reverses the inequality):

which is the desired inequality for the induction step on the lower bound. ∎

Proof of Theorem 1: Quantitative Martingale CLT

In the section, we explain the proof of the distribution estimates in equation (7) in Theorem 1 modulo the proof of several key technical results, which are proved in Sections 4.1 and 4.2 below.

We first recall the notation. Namely, fix 0<p10<p\leq 1 and consider a fixed measure μ\mu satisfying (3). For every i=1,,ni=1,\ldots,n take independent random ni×ni1n_{i}\times n_{i-1} matrices W(i){W}^{(i)} with all the entries of W(i){W}^{(i)} drawn i.i.d. from μ\mu and for each i=1,,di=1,\ldots,d, consider ni×nin_{i}\times n_{i} diagonal matrices D(i)=diag(ξ1(i),,ξni(i)){D}^{(i)}=\text{diag}\left(\xi^{(i)}_{1},\ldots,\xi^{(i)}_{n_{i}}\right) where ξa(i)\xi_{a}^{(i)} are iid {0,1}\left\{0,1\right\}-valued independent Bernoulli(p)(p) variables P(ξ=1)=1P(ξ=0)=p\mathbf{P}\left(\xi=1\right)=1-\mathbf{P}\left(\xi=0\right)=p. The key objects of study are, for i=0,1,i=0,1,\ldots, the random ni×n0n_{i}\times n_{0} matrices

Notice that the sequence u(i)\vec{u}^{(i)} is equivalently defined recursively as:

With this notation the relation (7) we seek to show becomes the statement that for every m1m\geq 1

where N(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) is the Gaussian, and

The idea of the proof is to look at the quantity ln(u(d)22)\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right) as the value of a martingale at time dd with respect to the filtration

i.e. Fi\mathcal{F}_{i} the sigma algebra generated by the random variables in the first ii layers. The basic idea of our proof is to deduce the approximate normality of ln(u(d)22)\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right) by applying a martingale CLT with rate (see Theorem 23). Specifically, note that ln(u(0)22)=0\ln\left(\left\|\vec{u}^{(0)}\right\|_{2}^{2}\right)=0, since u(0)\vec{u}^{(0)} is a unit vector. Hence, ln(u(d)22)\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right), is a telescoping sum (modulo the complication discussed below that u(i)\left\|\vec{u}^{(i)}\right\| could vanish):

and we will think of each entry of the sum as an increment. By subtracting off the conditional means, this will yield a martingale difference sequence which can be analyzed. It will turn out that the variance of these increments satisfy:

For i1i\geq 1, we will typically have that

and therefore the term involving the fourth moment μ4\mu_{4} will be of size O(ni2)O(n_{i}^{-2}) for all except the first layer when i=0i=0. The sum of these increment variances is precisely our variance parameter β\beta (modulo terms like ni2n_{i}^{-2}). This informally explains the appearance of u(0)44\left\|\vec{u}^{(0)}\right\|_{4}^{4} in the formula for β\beta, and why the terms from other layers do not depend on the higher moments of μ\mu.

To give a precise proof of (42), we must deal with a wrinkle in the strategy described above: with a small but positive probability the vectors u(i)=0\vec{u}^{(i)}=0, making the ratio of the norms of the vectors u(i),u(i1)\vec{u}^{(i)},\vec{u}^{(i-1)} in (43) undefined. Since the weight matrices W{W} are assumed to have no atoms, this can only happen if the Bernoulli variables are all equal to zero. To take this into account, we define the events

In addition, we will find it convenient to fix a truncation level 0<α<1,0<\alpha<1, and set

We will study the sequence of martingale increments

that coincide, with high probability, with the martingale difference sequence associated to ln(S(i))\ln(S^{(i)}) (see Lemma 22), where by convention we define the product lnα(S(i)/S(i1))1Ai1\ln_{\alpha}\left(S^{(i)}/S^{(i-1)}\right)\mathtt{1}_{A_{i-1}} is zero on the event Ai1cA^{c}_{i-1} when S(i1)=0.S^{(i-1)}=0. To prove the approximate normality of ln(u(i)22),\ln(||\vec{u}^{(i)}||_{2}^{2}), we first prove the approximate normality of iXi\sum_{i}X_{i} in the following Proposition.

Moreover, for any fixed 0<α<10<\alpha<1, the sum i=1dXi\sum_{i=1}^{d}X_{i} is approximately normally distributed in the sense that

We prove Proposition 19 in Section 4.1 below. The next result shows that the sum of the conditional expectations in iXi\sum_{i}X_{i} contributes a constant β/2\beta/2 up to errors of the form ini2.\sum_{i}n_{i}^{-2}.

where YY is a random variable satisfying

Proposition follows from Proposition 28 below. To combine Propositions 19 and 20, we will need the following simple result about perturbations under the KSKS-distance.

If N(0,β)\mathcal{N}(0,\beta) is centered Gaussian with variance β\beta, XX is any random variable, and YY is a positive random variable then there is a universal constant CC so that we have:

For any k>0k>0, there exists a constant CC so that

Further, if X,YX,Y are any two random variables on the same probability space, then:

Combining Propositions 19, Proposition 20 and Lemma 21, we obtain

Finally, combining the following estimate with (49) completes the proof of Theorem 1.

For any fixed 0<α<10<\alpha<1 and any m1m\geq 1 we have

In the proof of Proposition 19, we will use the notation

and we will say that a random variable YY is Oa.s.(f(ni1))O_{a.s.}(f(n_{i-1})) if there exists C>0C>0 independent of ni,dn_{i},d so that YCf(ni1)\left|Y\right|\leq Cf(n_{i-1}) almost surely. The constant CC may depend on the moments of the random variable μ\mu and pp, which we think of as fixed. To conclude the approximate normality (46) we will use the following theorem.

Suppose that X0,X1,X_{0},X_{1},\ldots is a martingale difference sequence with respect to a filtration {Fi,i=0,1,}\{\mathcal{F}_{i},\,i=0,1,\ldots\}. Then

The following Proposition allows us to control the 2nd2^{nd} and 4th4^{th} moments of XiX_{i} appearing on in (51).

For any 0<α<10<\alpha<1, we have that the conditional 2nd and 4th moments of XiX_{i} are:

Moreover, for any i1i\geq 1 and any ji1j\leq i-1,

We will prove Proposition 24 in Section 4.1.1 below. To complete the proof of Proposition 19, note that Proposition 24 yields

Thus, (46) follows from the previous line together with (53), (48), and

To prove this bound, we begin by using Proposition 24 to establish two inequalities which hold for any fixed ii:

With these inequalities in hand, we proceed by expanding the square as follows:

Finally, summing all the bounds for diagonal and off-diagonal entries we see that this entire numerator from equation (51) is bounded by O(i=1dni2)O(\sum_{i=1}^{d}n_{i}^{-2}), which proves (55) and completes the proof of Proposition 19 modulo checking Proposition 24.

We begin by establishing some preliminary results.

proving the first relation (59). Since each ZjZ_{j} is mean 11, the relations (60) follow by standard esimates of the 3rd,2rth3^{rd},{2r}^{th} moments of a sum of nn iid centered random variables. Finally to check the second relation in (59), we write

The following corollary immediately yields (54).

The tail estimate follows by using the Chebyshev inequality and (60). The bound on the expectation is obtained as follows:

To complete the proof of Proposition 24, it remains to check (52) and (53). To do this, we begin with the following observation.

Let lnα(x)={ln(α)x<αln(x)xα\ln_{\alpha}(x)=\begin{cases}\ln(\alpha)&x<\alpha\\ \ln(x)&x\geq\alpha\end{cases}. Suppose YY is a non-negative random variable. Then there are absolute constants CC (here we let CC refer to a generic constant which may change value from line to line) so that for any 0<α<10<\alpha<1:

The proof is an elementary exercise in the Taylor series expansion for ln(x)\ln(x) applied to points in the interval [α,)[\alpha,\infty) on which the derivatives of ln(x)\ln(x) are bounded. ∎

Lemma 27 together with Lemma 25 gives the following information on the conditional moments of Xi,X_{i}, which directly allows us to conclude (52) and (53) and hence complete the proof of Proposition 24.

Recall the vectors u(i)\vec{u}^{(i)} and their norms normalized L2L^{2} and L4L^{4} norms, S(i)S^{(i)} and F(i)F^{(i)} defined in (45) and (50). We have for each i0i\geq 0

On the event AicA_{i}^{c}, both sides of the equation are zero and the equality trivially holds. Therefore we have only to consider what happens on the event AiA_{i}, where u(i)0\vec{u}^{(i)}\neq 0. By equation (41), we have that

By Lemma 25 and Lemma 27, all the terms in equation (63) are Oa.s.(ni2)O_{a.s.}(n_{i}^{-2}) which completes the result. A similar argument, combining the moment calculations from Lemma 25 and the series expansion estimates from Lemma 27 in the natural way, gives the higher moments of lnα(S(i+1)S(i))\ln_{\alpha}\left(\frac{S^{(i+1)}}{S^{(i)}}\right). ∎

2 Facts about KS distance - Proof of Lemma 21

Let us use the notation N=dN(0,1).\mathcal{N}\stackrel{{\scriptstyle d}}{{=}}\mathcal{N}(0,1). Since

we may assume without loss that β=1\beta=1 when proving (47) and (48). We begin by checking (47). We will show that for every s>0s>0, we have that

This is proven by examining the two possibilities of the absolute value, P(X+Yt,Ys)P(Nt,Ys)\mathbf{P}\left(X+Y\leq t,\left|Y\right|\leq s\right)-\mathbf{P}(\mathcal{N}\leq t,\left|Y\right|\leq s) and P(Nt,Ys)P(X+Yt,Ys)\mathbf{P}(\mathcal{N}\leq t,\left|Y\right|\leq s)-\mathbf{P}\left(X+Y\leq t,\left|Y\right|\leq s\right) by using the inclusions {X+Yt,Ys}{Xst,Ys}\left\{X+Y\leq t,\left|Y\right|\leq s\right\}\subset\left\{X-s\leq t,\left|Y\right|\leq s\right\} and {X+st,Ys}{X+Yt,Ys}\left\{X+s\leq t,\left|Y\right|\leq s\right\}\subset\left\{X+Y\leq t,\left|Y\right|\leq s\right\} respectively for the two cases. In the first case, consider:

To verify (48), note that there exists a constant C>0C>0 so that for every k>0k>0

3 Proof of Lemma 22

As in the proof of Proposition 28, on the event S(i1)0,S^{(i-1)}\neq 0, we have that

where ZjZ_{j} are iid random variables each equal in distribution to p1ξjWj(i),u^p^{-1}\xi_{j}\left\langle W_{j}^{(i)},\widehat{u}\right\rangle where u^\widehat{u} is an fixed unit vector. In particular, by Lemma 25, EZj=1\mathbf{E}{Z_{j}}=1 and the higher moments of ZjZ_{j} are uniformly bounded in terms of the moments of μ\mu and 1p.1-p. In particular, for each m1,m\geq 1, there exists Cm>0C_{m}>0 depending only on the moments of μ\mu and 1p1-p so that

Hence, by using a union bound and the estimate on the probability AiA_{i} in (44), we have for each m1,m\geq 1,

References