Products of Many Large Random Matrices and Gradients in Deep Neural Networks

Boris Hanin, Mihai Nica

Introduction

Products of independent random matrices are a classical topic in probability and mathematical physics with applications to a variety of fields, from wireless communication networks to the physics of black holes , random dynamical systems , and recently to the numerical stability of randomly initialized neural networks . In the context of neural networks, products of random matrices are related to the numerical stability of gradients at initialization and therefore give precise information about the exploding and vanishing gradient problem (see Section 1.5, Proposition 2, and Corollary 3). The purpose of this article is to prove several new results about such products in the regime where the number of terms and the sizes of matrices grow simultaneously. This regime has attracted attention but remains poorly understood. We find new phenomena not present when the number of terms and the size of the matrices are sent to infinity sequentially rather than simultaneously (see Section 1.1 for more on this point).

where ${D}^{(i)}$ are $n_{i}\times n_{i}$ diagonal matrices

As explained in Section 1.4 below, $Z_{d}(\vec{u})$ can be thought of as a line-to-line partition function in a disordered medium given by the computation graph underlying the matrix product defining ${M}^{(d)}$ , with $\vec{u}$ corresponding to a kind of initial condition. The diagonal matrices ${D}^{(i)}$ then correspond to $\{0,1\}-$ valued spins on the vertices of this graph, restricting the allowed paths of the directed random polymer. With this interpretation, our main result, Theorem 1, shows that the analogue of the free energy, namely

is Gaussian up to an error that tends to zero when $n_{i},d$ tend to infinity.

Let ${M}^{(d)}={X}^{(d)}\cdots{X}^{(1)}$ be as in (2). Then, the norm of the vector ${M}^{(d)}\vec{u}$ is approximately log-normal distributed:

where the implicit constant is uniform for $p$ in a compact subset of $(0,1]$ , and $\beta$ in a compact set bounded away from $\beta=0$ . Moreover, for every $k\geq 0,$ satisfying $\binom{k}{2}<\min_{1\leq i\leq d}n_{i},$ we have

where the implicit constant depends on $k$ and the moments of $\mu$ but not on $\beta,d,n_{i}.$

The two conclusions, equation (7) and equation (6), of Theorem 1 are proven separately and have independent proofs. We prove equation (6) by a path-counting type argument in Section 3. The argument in Section 4 for equation (7), in contrast, uses a central limit theorem for martingales.

Theorem 1 shows that the free energy $\ln(Z_{d}(\vec{u}))=\ln(||{M}^{(d)}\vec{u}||_{2}^{2})$ from (4) is Gaussian in the double scaling limit

achieved for instance when $n_{i}(d)$ are equal and proportional to $d.$ This asymptotic normality for $\ln(Z_{d}(\vec{u}))$ cannot be seen by taking the limits $d\to\infty$ and $\min\{n_{i}\}$ to infinity one after the other. Indeed, consider the case when $p=1$ and $\mu=\mathcal{N}(0,1),$ the standard Gaussian measure. A simple computation using the rotational invariance of i.i.d. Gaussian matrices shows the equality in distribution

where $\chi_{n}^{2}$ is a chi-squared random variable with $n$ degrees of freedom and the terms in the product are independent. In the limit where $d$ is fixed and $\min\{n_{i},\,i\geq 1\}\to\infty,$ we have $\chi_{n_{i}}^{2}/n_{i}\approx 1+O(n_{i}^{-1/2})$ and so

On the other hand, if the $n_{i}$ are uniformly bounded, then we have

In fact, for $n_{i}\equiv n$ fixed, $\ln(Z_{d}(\vec{u}))$ converges only with an addition $1/d$ scaling:

making (8) an interesting regime for $\ln(Z_{d}(\vec{u}))$ . The non-commutativity of the $n_{i},d\to\infty$ limits is well-known and is related to the fact that the local statistics of the singular values of ${M}^{(d)}$ are sensitive to the order in which the limits above are taken. Remaining in the simple case of $p=1$ and $\mu$ Gaussian, a simple application of the central limit results show that when all the $n_{i}$ are equal and are related to $d$ by $n=2\beta^{-1}d$ , then the exact chi-squared representation of equation (9) gives the convergence in distribution:

which is of course consistent with Theorem 1. Part of the content of Theorem 1 is therefore that this result is essentially independent of the parameter $p$ and the measure $\mu$ according to which the entries of the matrices ${W}^{(i)}$ are distributed. See Section 1.3 for more discussion on the novel aspects of Theorem 1.

2 Connection to previous work in Random Matrix Theory

The literature on products of random matrices is vast. Much of the previous work concerns products of $d$ i.i.d. random matrices, each of size $n\times n$ . Such ensembles have been well studied in two distinct regimes: (a) when $n$ is fixed and $d\to\infty$ and (b) when $d$ is fixed and $n\to\infty$ . Case (a) is related to multiplicative ergodic theory and the study of Lyapunov exponents. The seminal articles in this regime are the results of Furstenberg and Kesten , which gives general conditions for the existence of the top Lyapunov exponent

and the multiplicative ergodic theorem of Osceledets , which gives conditions for almost sure (deterministic) values for all the Lyapunov exponents. Many more recent works characterize the Lyapunov exponents under more specific assumptions, most notably for matrices which are rotationally invariant or which have entries that are real or complex Gaussians, see e.g. as well as the survey and references therein.

Case (b), where $d$ is fixed and $n\to\infty$ , falls into the setting of free probability. Indeed, one of the great successes of free probability is the idea of “asymptotic freeness”: in the limit $n\to\infty$ , a collection of $d$ independent $n\times n$ random matrices behave like a collection of $d$ freely independent random variables on a non-commutative probability space (see e.g. Chapter 5 or Chapter 1 and 4). Therefore, case (b) is closely related to a product of $d$ freely independent random variables; precise results are obtained in . Earlier results examine case (b) without explicit use of free probability. The problem of first taking $n\to\infty$ and afterwards taking $d\to\infty$ can also be handled using the tools of free probability in the case of Gaussian matrices, see .

As explained in the Introduction and in Section 1.1, the regimes (a), (b) are asymptotically incompatible in the sense that the limits $d\to\infty$ , and $n\to\infty$ do not generally commute on the level of the local behavior of the singular value distribution. Indeed, the problem of understanding what happens when both are scaled simultaneously is mentioned as an open problem in . To explain this further, we note that the work of Newman [16, Thm. 1] in regime (a) shows that when $p=1$ and $n_{j}\equiv n$ is fixed, the density of the Lyapunov exponents of ${M}^{(d)}$ converges in the limit when first $d\to\infty$ and then $n\to\infty$ to the triangular density

The work of Tucci [27, Thm. 3.2, Ex. 3.4] shows that for Gaussian ensembles related to ${M}^{(d)}$ one obtains the same global limit in the regime (b) when first $n\to\infty$ and then $d\to\infty.$ However, as explained in Section 5, while the global density of all the Lyapunov exponents is the triangular law in both cases, the local behavior (e.g. the fluctuations of the top Lyapunov exponent) is observed to be different depending on the order of the limits even in the exactly solvable special case of products of complex Ginibre matrices.

From this spectral point of view, Theorem 1 gives information about certain averages of the Lyapunov exponents. To see this, fix $n_{0}$ and let $n_{i},\,i\geq 1,$ and $d$ tend to infinity in accordance with (8). Note we we specifically do not take $n_{0}$ to infinity. Denote by $\sigma_{1},\ldots,\sigma_{n_{0}}$ the non-zero singular values of ${M}^{(d)}$ , and by $\vec{v}_{1},\ldots,\vec{v}_{n_{0}}$ the corresponding left-singular vectors. Then we have

Hence, Theorem 1 can be interpreted as the statement that the logarithm of the average of the non-zero singular values for $||{M}^{(d)}\vec{u}||^{2}$ is a Gaussian with mean $-\beta/2$ and variance $\beta$ in the limit (8). These non-trivial corrections in $\beta$ can be seen as a finite temperature correction to the maximal entropy regime of Tucci in which first $n\to\infty$ and then $d\to\infty$ . For more on this point of view, we refer the reader to [1, Section 3.2].

Finally, in the specific case where the random matrices ${X}^{(j)}$ are complex Ginibre matrices (i.e. the matrix entries are iid complex Gaussian), very recent work looks at the limiting spectrum under the joint scaling limit $d\to\infty$ , $n\to\infty$ where the ratio $d/n$ is fixed or going to $\infty$ . This work analyzes exact determinental formulas for the joint distribution of singular values available in the case of complex Ginibre matrices. The analogous formulas for real Gaussian matrices given in are significantly more complicated and such an explicit analysis appears to be much more difficult.

3 Contribution of the Present Work

In the context of these previous random matrix results, let us point out four novel aspects of Theorem 1. First, it deals with the joint $d,n\to\infty$ limit for a large class of non-Gaussian matrices with real entries. There is no integrable structure to our matrix ensembles, and we rely instead on a sum-over-paths approach to analyze the moments (6) and a martingale CLT approach for obtaining the KS distance estimates (7).

An additional novelty of Theorem 1 is it proves the distribution of $\big{|}\big{|}{M}^{(d)}\vec{u}\big{|}\big{|}_{2}^{2}$ is (mostly) universal: it does not depend on the higher moments of the distribution $\mu$ beyond the mean and variance, with the exception of the fourth moment $\mu_{4}$ appearing in $\beta$ in the term $\left\|\vec{u}\right\|_{4}^{4}(\mu_{4}-3)/pn_{1}$ . In the regime $n_{j}\equiv n$ and $d/n\equiv\beta$ , this term is a $1/n$ correction.

The fourth and final novelty of Theorem 1 we would like to emphasize is that our results are non-asymptotic, i.e. we obtain an explicit error term of the form $\sum_{i=1}^{d}{n^{-2}_{i}}$ . This is particularly useful when using Theorem 1 for studying gradients in randomly initialized neural networks (see Section 1.5).

4 Connection to Random Polymers

The matrix ensembles ${M}^{(d)}$ studied in this article, in the case $n_{i}=n,\,\,i=1,\ldots,d$ , are related to directed random polymers on the complete graph of size $n$ . This model were recently explored in detail (c.f. e.g. ). A key object for these polymers is the line-to-line partition function

The fact that the weights are positive makes the analysis of the partition function of this traditional random polymer model different than the analysis of the matrix product ${M}^{(d)}$ defined in (1). In particular, no cancellation is possible between the terms in the definition of $Z(d)$ above, causing $Z(d)$ to be exponential in $d$ . The $n$ fixed and $d\to\infty$ limit of of the partition function in the case of these positive weights is the object of study in . As explained in Section 1.3, Theorem 1 studies a different regime where both $d\to\infty,n\to\infty$ at the same time. The fact that our weights are mean zero, gives rise to significant cancellation in the terms of $Z_{d}(\vec{u})$ from (4), so that the partition function in our mean zero model does not grow exponentially with $d$ provided $n$ grows with $d$ as in (8). Additionally, if $p<1$ in our model, the effect of the diagonal Bernoulli matrices ${D}^{(i)}$ is to close every vertex with probability $1-p$ . The sum over paths in our partition function then becomes the sum only over those paths that pass through vertices that are open.

5 The Case p=1/2𝑝12p=1/2 as Gradients in Random Neural Nets

to be the vectors of activities before and after applying $\operatorname{ReLU}$ at the neurons at layer $j$ .

where $\lambda>0$ is the learning rate. An important practical impediment to gradient based learning is the exploding and vanishing gradient problem (EVGP), which occurs when gradients are numerically unstable:

making the parameter update (10) too small to be meaningful or too large to be precise. An important intuition is that the EVGP will be most pronounced at the start of training, when the weights and biases of $\mathcal{N}$ are random and the implicit structure of the data being processed has not yet regularized the function computed by $\mathcal{N}$ .

As explained below, the EVGP for a depth $d$ $\operatorname{ReLU}$ net $\mathcal{N}$ with hidden layer widths $n_{0},\ldots,n_{d}$ is essentially equivalent to having large fluctuations for the entries (or, in the worst case, for the singular values) of the Jacobians of the transformations between various layers:

where ${D}^{(i)}$ are diagonal $\{0,1\}$ -Bernoulli matrices as in (2) with parameter $p=\frac{1}{2}$ and $\stackrel{{\scriptstyle d}}{{=}}$ denotes equality in distribution.

and the implicit constant is uniform when $\beta$ ranges over a compact subset of $(0,\infty)$ .

For more information about the EVGP, statistics of gradients in random $\operatorname{ReLU}$ nets, and distribution of the singular values of the input-output Jacobian we refer the interested reader to for more details.

Proof of Proposition 2

It remains to see that these diagonal matrices are independent of each other (since the outputs of the previous layer are fed into to subsequent layers, so are not a priori independent). This again will be a consequence of the fact the underlying random variables are symmetrically distributed, and will be formally verified by conjugating the weights and biases of the network by random $\pm 1$ random variables. This doesn’t change the distribution of the network, but will allow us to see the independence between layers in a more concrete way.

2 Proof of Proposition 2

Fix a neural net $\mathcal{N}$ as in the statement of proposition 2 and denote its weights and biases at layer $j$ by $W_{a,b}^{(j)}$ and $B_{a}^{(j)}$ . For each $j,$ let

be an i.i.d. collection of random variables that each take values $\pm 1$ with probability $1/2$ . We will also define

Consider the neural net $\widehat{\mathcal{N}}$ with weights $\widehat{W}_{a,b}^{(j)}$ and biases $\widehat{B}_{a}^{(j)}$ defined by changing the signs of the weights and biases of $\mathcal{N}$ as follows:

Note that since we’ve assumed that ${W}^{(j)},\vec{B}^{(j)}$ have distributions that are symmetric around , we have

Hence, since the weights of the two networks are identically distributed,

To prove (14), we will use the fact that two random variables $X,Y$ are independent if the distribution of $X$ given $Y=y$ does not depend on the value of $y.$ That is, (14) will follow once we show that for any fixed sequences $s_{a}^{(j)},r_{a}^{(j)}\in\{\pm 1\}$ that

To check this equality, it suffices to show that given $\{s_{a}^{(j)},\,r_{a}^{(j)}\}$ there is exactly one possible configuration for the variables $\xi_{a}^{(j)},\eta_{a}^{(j)}$ for which the event $\cap_{a,j}\left\{\widehat{D}_{a}^{(j)}=s_{a}^{(j)},\sigma_{a}^{(j)}=r_{a}^{(j)}\right\}$ occurs. The resulting probability then follows since $\xi_{a}^{(j)},\eta_{a}^{(j)}$ are i.i.d. variables that each take the values $\pm 1$ with probability $1/2.$ The proof is by induction: we will show that for each $i=1,\ldots,d$ , given $W^{(j)},B^{(j)},\,j\leq i$ there is a unique configuration for the variables $\xi_{a}^{(j)},\eta_{a}^{(j)},\,\,j\leq i$ that leads to the event $\cap_{a,j\leq i}\left\{\widehat{D}_{a}^{(j)}=s_{a}^{(j)},\sigma_{a}^{(j)}=r_{a}^{(j)}\right\}$ . When $i=1,$ we have

Recalling that $\eta_{b}^{(0)}=1$ for all $b$ , we see that for each $a,$ there is a unique value of $\xi_{a}^{(1)}$ for which

Proof of Theorem 1: Moment Estimates and Path Counting

We begin by indicating the general plan for the proof of Equation (6) from Theorem 1, which consists of two steps. First, in Proposition 4 below, we express the expectation in (6) as a sum over $k$ -tuples of paths $V\in[n_{0}]^{k}\times\cdots\times[n_{d}]^{k}$ . The precise result is the following

With the notation of Theorem 1, for each $k$ we have

where $u_{V(0)}\stackrel{{\scriptstyle\Delta}}{{=}}\prod_{j=1}^{k}u_{V_{j}(0)}$ , and $C$ is defined by:

Note that the definition of $C$ in equation (16) depends only on the collection of vertices $V(i-1)\in[n_{i-1}]^{k}$ and $V(i)\in[n_{i}]^{k}$ , the moments of the measure $\mu$ according to which the entries of the matrices $\underline{W}^{(i)}$ are distributed, and the parameter $p.$ The utility of equation (15) is that it is written as a product over this function of adjacent layers (rather than the whole path $V$ ), which will make it much easier to analyze.

The next step in the proof of the moment estimate (6) is to obtain upper and lower bounds for the expression in (15) that match up to corrections of size $\sum_{i=1}^{d}n_{i}^{-2}.$ This is done in Section 3.4. The main idea here is to treat the sum (15) as an expectation where each $V(i)\in[n_{i}]^{k}$ , $i\geq 1$ is chosen independently according to the uniform distribution on $[n_{i}]^{k}$ . The leading term in this expectation comes from event that the entries $\left\{V_{1}(i),\ldots V_{k}(i)\right\}$ are all distinct, which happens in layer $i$ with probability $1-O(n^{-1}_{i})$ . When this happens, $C(V(i-1),V(i))=1.$ The subleading term comes from the event that $\left\{V_{1}(i),\ldots,V_{k}(i)\right\}$ has exactly one element that appears twice, with the others distinct. In each layer, the probability of this type of “collision” is $\binom{k}{2}n_{i}^{-1},$ and $C(V(i-1),V(i))$ typically contributes $3/p$ when this happens. Hence, heuristically speaking, we have

This is almost correct, except at the first layer, where the vector $\vec{u}$ acts as special initial condition and slightly deforms the term in this product when $i=1.$ Section 3.4 makes this argument precise.

2 Edge Sets, Multiplicities, and Paths

In this section we develop some notation and basic results which is used to clarify the “path counting” needed to prove Proposition 4 below. The major result that is developed in this section, and is needed for Proposition 4, is the enumeration the set of paths in Lemma 8. We will use the following notation conventions:

This is the set one gets by drawing an edge between each entry of $x$ and the corresponding entry of $y$ and then forgetting the order in which the edges were drawn but remembering the multiplicity. Note that this map from ordered sets of left and right endpoints is many to one. This will come up in our computations, and to keep track of this, we make the following definition.

where the expectation is with respect to $\mu.$ In the proof of Proposition 4 we will consider sequences of compatible edge sets in the sense of the following definition.

The formula (24) below will be used in the proof of Proposition 4.

3 Proof of Proposition 4

The first step in proving Proposition 4 is to express $\mathbf{E}\left[\left\|{M}\vec{u}\right\|^{2k}\right]$ as a sum over certain collections of $2k$ paths.

Let $Q^{2k}$ be the set of $2k$ tuples of paths:

where $\Gamma^{2k}$ was defined in (22). Our notation is that if $\gamma\in Q^{2k}$ , then $\gamma(i)\in[n_{i}]^{2k}$ is a $2k$ -tuple for each $0\leq i\leq d$ .

Note that the entries of the $n_{d}\times n_{0}$ matrix ${M}$ can be written as a sum over certain paths in $\Gamma$ , namely:

Using this interpretation in terms of paths, we obtain by indexing the starting points as $b_{1},b_{2}\in[n_{0}]$ and the ending point as $a\in[n_{d}]$ , that we can write $\left\|{M}\vec{u}\right\|^{2}$ as a sum over $\gamma\in Q^{2}$ :

Similarly, the $k$ -th power is then given by:

The result of the lemma follows by taking expectation of both sides, using the independence of the random variables $\xi_{b}^{(i)},\,W_{a,b}^{(i)}$ ’s, and relations

Since the law $\mu$ of the entries of the matrices ${W}^{(i)}$ is assumed to symmetric around $0,$ the odd moments of $\mu$ are all zero, and it will be useful to consider only edge sets that are “even” in the following sense:

With the same notation as in Lemma 10, we have

Because the variables $W_{\alpha,\beta}^{(i)}$ are symmetric around , all their odd moments vanish. Thus, in the expression (26), only collections of paths in which every edge is traversed an even number of times given a non-zero contribution. What remains are exactly paths from $Q_{even}^{2k}$ by the definition in equation (27). ∎

Summing over all possible endpoints $v\in[n_{d}]^{k}$ now gives the identity:

Finally, using this identity on (29), with $f$ being the function that appears inside the sum over $Q^{2k}_{even}$ , gives the desired result of Proposition 4. ∎

4 Completion of Proof of Equation (6)

We think of the sum in Proposition 4 as an expectation over discrete random variables $V(i)$ . Specifically, we write:

Informally, $U$ stands for “unique entries”, and consists of those $k$ -tuples with no repeated entries; $P$ stands for “one pair” and consists of those $k$ -tuples with exactly one repeated entry; $B$ stands for “bad” and consists of everything else. Formally,

For each $i\geq 1$ , under the uniform measure on $[n_{i}]^{k}$ , each random variable $V(i)$ has the following probabilities for the events $\left\{V(i)\in U_{n_{i}}\right\}$ , $\left\{V(i)\in P_{n_{i}}\right\}$ , $\left\{V(i)\in B_{n_{i}}\right\}$ :

The proof is an elementary exercise in discrete probability. ∎

Subdivide the “one pair” set by which indices are paired: $P_{n}\stackrel{{\scriptstyle\Delta}}{{=}}\cup_{a\neq b}P_{n}(a,b)$ where $P_{n}(a,b)=P_{n}\cap\left\{x\in[n]^{k}:x_{a}=x_{b}\right\}$ for $a\neq b$ . Then for each $k\geq 1,\,i=1,\ldots,d,$ we have

where the implicit constant in $\Theta(1)$ is bounded below by $1$ and above by $\mu_{2k}(2k-1)!!p^{1-k}.$

This is an elementary calculation from the definition of $C$ . If $V(i)\in U_{n_{i}}$ , $\#V(i)=k$ and the multiplicities of edges in the edge set $E^{V(i-1),V(i)}$ are all $1$ which makes the combinatorial factor in $C(V(i-1),V(i))$ equal to $1$ , and every edge is covered exactly twice giving a factor of $\mu^{k}_{2}=1$ in the weight term. If $V(i)\in P_{n_{i}}$ , $\#V(i)=k-1$ giving a factor of $\frac{1}{p}$ . Moreover, in this case, when the indices which are paired in $V(i)$ are also paired in $V(i-1)$ , all the combinatorial factors are again $1$ , and the weight term is $\mu^{k-1}_{2}=\mu_{4}$ . If the paired indices from $V(i)$ are not paired in $V(i-1)$ , then there the combinatorial term is $1^{k-2}\frac{\binom{2\cdot 2}{2\cdot 1}}{\binom{2}{1}}=3$ , and the weight term is $\mu^{k}_{2}=1$ . ∎

where the quantities $\psi_{U},T_{U},\psi_{P},T_{P},\psi_{B},T_{B}$ are:

Note that for any fixed $j=1,\ldots,d-1$ and $v\in[n_{j}]^{k}$ , we have the following conditional independence of layers before $V(j)$ and after $V(j)$ :

where in the second term we write $\mathcal{E}$ instead of $\mathcal{E}_{\vec{u}}$ since the measure no longer depends on $u.$ Applying this with $j=1,$ we find

where in the second term the random variables $V(2),\ldots,V(d)$ are uniform and do not depend on $u$ or $v$ . An elementary probability computation using Lemma 15 and the measure $\{u^{2}_{0},\ldots,u^{2}_{n_{0}}\}$ on $V(0)$ shows that

where the implicit constant in the last term is bounded below by $1$ and above by $\mu_{2k}(2k-1)!!p^{1-k}.$ Combining the result of Lemma 14, with (33) and (34), proves Lemma 16.

Then, for any choice of the label $\ast\in\{U,B,P\}$ , we have that:

By using the possible values for $C$ computed in Lemma 15 and the definition of $T_{\ast}$ , we have that for any label $\ast\in\{U,P,B\}$ :

This proves the upper bound in (35). The lower bound similarly follows:

We first notice, by application of the elementary probability estimate recorded in Lemma 18, that the upper and lower bounds on $T_{\ast}$ given in Lemma 17 are equal up to small errors. We have for $\ast\in\{U,P,B\}$ :

(where $\widehat{K}$ is as in Lemma 17). Finally, putting these values for $T_{U},T_{P},T_{B}$ into the result of Lemma 14 we see:

The last line follows from the elementary fact for exponentials $e^{x-\frac{1}{2}x^{2}}\leq 1+x\leq e^{x}$ for $x\geq 0$ . ∎

5 An elementary probability estimate

Let $A_{0},A_{1},\ldots,A_{d}$ be independent events with probabilities $p_{0},\ldots,p_{d}$ and $B_{0},\ldots,B_{D}$ be independent events with probabilities $q_{0},\ldots,q_{d}$ such that

Denote by $X_{i}$ the indicator that the event $A_{i}$ happens, $X_{i}:=\mathtt{1}\left\{A_{i}\right\}$ , and by $Y_{i}$ the indicator that $B_{i}$ happens, $Y_{i}=\mathtt{1}\{B_{i}\}$ . Further, fix for every $i\in 1,\ldots,d$ some $\alpha_{i}\geq 1,K_{i}\geq 1$ as well as $\gamma_{i}>0$ . Define

Then, if $\gamma_{i}\geq 1$ for every $i$ , we have:

where by convention $\alpha_{0}=\gamma_{0}=1.$ In contrast, if $\gamma_{i}\leq 1$ for every $i$ , we have:

The proof goes by induction on $d$ . The base case $d=1$ can be computed directly

which is verified to obey the stated inequalities under the convention $\alpha_{0}=\gamma_{0}=1$ . To see the induction step, suppose that $d\geq 2$ . Define the filtration $\mathcal{F}_{j}=\sigma\left(X_{0},Y_{0},\ldots,X_{j},Y_{j}\right)$ . We have, from the definition of $Z_{i}$ that

We compute by directly examining what happens when $X_{d-1}=0$ and when $X_{d-1}=1$ , that

Now notice that, since $X_{d-1}Z_{d-1}$ vanishes when $X_{d-1}=0$ , and since $A_{d-1}\cap B_{d-1}=\emptyset$ , we have that

Hence, when $\gamma_{d}\geq 1$ , since $Z_{j}\leq Z_{j+1}$ for every $j$ , we have the estimate

which is the desired inequality to prove the induction step for the upper bound. To see the lower bound, we will actually prove the lower bound for the sequence

so by equation (39) applied to the $\widehat{Z}_{i}$ sequence, we have then (keeping in mind $\gamma_{d}-1<0$ reverses the inequality):

which is the desired inequality for the induction step on the lower bound. ∎

Proof of Theorem 1: Quantitative Martingale CLT

In the section, we explain the proof of the distribution estimates in equation (7) in Theorem 1 modulo the proof of several key technical results, which are proved in Sections 4.1 and 4.2 below.

We first recall the notation. Namely, fix $0<p\leq 1$ and consider a fixed measure $\mu$ satisfying (3). For every $i=1,\ldots,n$ take independent random $n_{i}\times n_{i-1}$ matrices ${W}^{(i)}$ with all the entries of ${W}^{(i)}$ drawn i.i.d. from $\mu$ and for each $i=1,\ldots,d$ , consider $n_{i}\times n_{i}$ diagonal matrices ${D}^{(i)}=\text{diag}\left(\xi^{(i)}_{1},\ldots,\xi^{(i)}_{n_{i}}\right)$ where $\xi_{a}^{(i)}$ are iid $\left\{0,1\right\}-$ valued independent Bernoulli $(p)$ variables $\mathbf{P}\left(\xi=1\right)=1-\mathbf{P}\left(\xi=0\right)=p$ . The key objects of study are, for $i=0,1,\ldots$ , the random $n_{i}\times n_{0}$ matrices

Notice that the sequence $\vec{u}^{(i)}$ is equivalently defined recursively as:

With this notation the relation (7) we seek to show becomes the statement that for every $m\geq 1$

where $\mathcal{N}(\mu,\sigma^{2})$ is the Gaussian, and

The idea of the proof is to look at the quantity $\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right)$ as the value of a martingale at time $d$ with respect to the filtration

i.e. $\mathcal{F}_{i}$ the sigma algebra generated by the random variables in the first $i$ layers. The basic idea of our proof is to deduce the approximate normality of $\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right)$ by applying a martingale CLT with rate (see Theorem 23). Specifically, note that $\ln\left(\left\|\vec{u}^{(0)}\right\|_{2}^{2}\right)=0$ , since $\vec{u}^{(0)}$ is a unit vector. Hence, $\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right)$ , is a telescoping sum (modulo the complication discussed below that $\left\|\vec{u}^{(i)}\right\|$ could vanish):

and we will think of each entry of the sum as an increment. By subtracting off the conditional means, this will yield a martingale difference sequence which can be analyzed. It will turn out that the variance of these increments satisfy:

For $i\geq 1$ , we will typically have that

and therefore the term involving the fourth moment $\mu_{4}$ will be of size $O(n_{i}^{-2})$ for all except the first layer when $i=0$ . The sum of these increment variances is precisely our variance parameter $\beta$ (modulo terms like $n_{i}^{-2}$ ). This informally explains the appearance of $\left\|\vec{u}^{(0)}\right\|_{4}^{4}$ in the formula for $\beta$ , and why the terms from other layers do not depend on the higher moments of $\mu$ .

To give a precise proof of (42), we must deal with a wrinkle in the strategy described above: with a small but positive probability the vectors $\vec{u}^{(i)}=0$ , making the ratio of the norms of the vectors $\vec{u}^{(i)},\vec{u}^{(i-1)}$ in (43) undefined. Since the weight matrices ${W}$ are assumed to have no atoms, this can only happen if the Bernoulli variables are all equal to zero. To take this into account, we define the events

In addition, we will find it convenient to fix a truncation level $0<\alpha<1,$ and set

We will study the sequence of martingale increments

that coincide, with high probability, with the martingale difference sequence associated to $\ln(S^{(i)})$ (see Lemma 22), where by convention we define the product $\ln_{\alpha}\left(S^{(i)}/S^{(i-1)}\right)\mathtt{1}_{A_{i-1}}$ is zero on the event $A^{c}_{i-1}$ when $S^{(i-1)}=0.$ To prove the approximate normality of $\ln(||\vec{u}^{(i)}||_{2}^{2}),$ we first prove the approximate normality of $\sum_{i}X_{i}$ in the following Proposition.

Moreover, for any fixed $0<\alpha<1$ , the sum $\sum_{i=1}^{d}X_{i}$ is approximately normally distributed in the sense that

We prove Proposition 19 in Section 4.1 below. The next result shows that the sum of the conditional expectations in $\sum_{i}X_{i}$ contributes a constant $\beta/2$ up to errors of the form $\sum_{i}n_{i}^{-2}.$

where $Y$ is a random variable satisfying

Proposition follows from Proposition 28 below. To combine Propositions 19 and 20, we will need the following simple result about perturbations under the $KS$ -distance.

If $\mathcal{N}(0,\beta)$ is centered Gaussian with variance $\beta$ , $X$ is any random variable, and $Y$ is a positive random variable then there is a universal constant $C$ so that we have:

For any $k>0$ , there exists a constant $C$ so that

Further, if $X,Y$ are any two random variables on the same probability space, then:

Combining Propositions 19, Proposition 20 and Lemma 21, we obtain

Finally, combining the following estimate with (49) completes the proof of Theorem 1.

For any fixed $0<\alpha<1$ and any $m\geq 1$ we have

In the proof of Proposition 19, we will use the notation

and we will say that a random variable $Y$ is $O_{a.s.}(f(n_{i-1}))$ if there exists $C>0$ independent of $n_{i},d$ so that $\left|Y\right|\leq Cf(n_{i-1})$ almost surely. The constant $C$ may depend on the moments of the random variable $\mu$ and $p$ , which we think of as fixed. To conclude the approximate normality (46) we will use the following theorem.

Suppose that $X_{0},X_{1},\ldots$ is a martingale difference sequence with respect to a filtration $\{\mathcal{F}_{i},\,i=0,1,\ldots\}$ . Then

The following Proposition allows us to control the $2^{nd}$ and $4^{th}$ moments of $X_{i}$ appearing on in (51).

For any $0<\alpha<1$ , we have that the conditional 2nd and 4th moments of $X_{i}$ are:

Moreover, for any $i\geq 1$ and any $j\leq i-1$ ,

We will prove Proposition 24 in Section 4.1.1 below. To complete the proof of Proposition 19, note that Proposition 24 yields

Thus, (46) follows from the previous line together with (53), (48), and

To prove this bound, we begin by using Proposition 24 to establish two inequalities which hold for any fixed $i$ :

With these inequalities in hand, we proceed by expanding the square as follows:

Finally, summing all the bounds for diagonal and off-diagonal entries we see that this entire numerator from equation (51) is bounded by $O(\sum_{i=1}^{d}n_{i}^{-2})$ , which proves (55) and completes the proof of Proposition 19 modulo checking Proposition 24.

We begin by establishing some preliminary results.

proving the first relation (59). Since each $Z_{j}$ is mean $1$ , the relations (60) follow by standard esimates of the $3^{rd},{2r}^{th}$ moments of a sum of $n$ iid centered random variables. Finally to check the second relation in (59), we write

The following corollary immediately yields (54).

The tail estimate follows by using the Chebyshev inequality and (60). The bound on the expectation is obtained as follows:

To complete the proof of Proposition 24, it remains to check (52) and (53). To do this, we begin with the following observation.

Let $\ln_{\alpha}(x)=\begin{cases}\ln(\alpha)&x<\alpha\\ \ln(x)&x\geq\alpha\end{cases}$ . Suppose $Y$ is a non-negative random variable. Then there are absolute constants $C$ (here we let $C$ refer to a generic constant which may change value from line to line) so that for any $0<\alpha<1$ :

The proof is an elementary exercise in the Taylor series expansion for $\ln(x)$ applied to points in the interval $[\alpha,\infty)$ on which the derivatives of $\ln(x)$ are bounded. ∎

Lemma 27 together with Lemma 25 gives the following information on the conditional moments of $X_{i},$ which directly allows us to conclude (52) and (53) and hence complete the proof of Proposition 24.

Recall the vectors $\vec{u}^{(i)}$ and their norms normalized $L^{2}$ and $L^{4}$ norms, $S^{(i)}$ and $F^{(i)}$ defined in (45) and (50). We have for each $i\geq 0$

On the event $A_{i}^{c}$ , both sides of the equation are zero and the equality trivially holds. Therefore we have only to consider what happens on the event $A_{i}$ , where $\vec{u}^{(i)}\neq 0$ . By equation (41), we have that

By Lemma 25 and Lemma 27, all the terms in equation (63) are $O_{a.s.}(n_{i}^{-2})$ which completes the result. A similar argument, combining the moment calculations from Lemma 25 and the series expansion estimates from Lemma 27 in the natural way, gives the higher moments of $\ln_{\alpha}\left(\frac{S^{(i+1)}}{S^{(i)}}\right)$ . ∎

2 Facts about KS distance - Proof of Lemma 21

Let us use the notation $\mathcal{N}\stackrel{{\scriptstyle d}}{{=}}\mathcal{N}(0,1).$ Since

we may assume without loss that $\beta=1$ when proving (47) and (48). We begin by checking (47). We will show that for every $s>0$ , we have that

This is proven by examining the two possibilities of the absolute value, $\mathbf{P}\left(X+Y\leq t,\left|Y\right|\leq s\right)-\mathbf{P}(\mathcal{N}\leq t,\left|Y\right|\leq s)$ and $\mathbf{P}(\mathcal{N}\leq t,\left|Y\right|\leq s)-\mathbf{P}\left(X+Y\leq t,\left|Y\right|\leq s\right)$ by using the inclusions $\left\{X+Y\leq t,\left|Y\right|\leq s\right\}\subset\left\{X-s\leq t,\left|Y\right|\leq s\right\}$ and $\left\{X+s\leq t,\left|Y\right|\leq s\right\}\subset\left\{X+Y\leq t,\left|Y\right|\leq s\right\}$ respectively for the two cases. In the first case, consider:

To verify (48), note that there exists a constant $C>0$ so that for every $k>0$

3 Proof of Lemma 22

As in the proof of Proposition 28, on the event $S^{(i-1)}\neq 0,$ we have that

where $Z_{j}$ are iid random variables each equal in distribution to $p^{-1}\xi_{j}\left\langle W_{j}^{(i)},\widehat{u}\right\rangle$ where $\widehat{u}$ is an fixed unit vector. In particular, by Lemma 25, $\mathbf{E}{Z_{j}}=1$ and the higher moments of $Z_{j}$ are uniformly bounded in terms of the moments of $\mu$ and $1-p.$ In particular, for each $m\geq 1,$ there exists $C_{m}>0$ depending only on the moments of $\mu$ and $1-p$ so that

Hence, by using a union bound and the estimate on the probability $A_{i}$ in (44), we have for each $m\geq 1,$