Products of Many Large Random Matrices and Gradients in Deep Neural Networks
Boris Hanin, Mihai Nica
Introduction
Products of independent random matrices are a classical topic in probability and mathematical physics with applications to a variety of fields, from wireless communication networks to the physics of black holes , random dynamical systems , and recently to the numerical stability of randomly initialized neural networks . In the context of neural networks, products of random matrices are related to the numerical stability of gradients at initialization and therefore give precise information about the exploding and vanishing gradient problem (see Section 1.5, Proposition 2, and Corollary 3). The purpose of this article is to prove several new results about such products in the regime where the number of terms and the sizes of matrices grow simultaneously. This regime has attracted attention but remains poorly understood. We find new phenomena not present when the number of terms and the size of the matrices are sent to infinity sequentially rather than simultaneously (see Section 1.1 for more on this point).
where are diagonal matrices
As explained in Section 1.4 below, can be thought of as a line-to-line partition function in a disordered medium given by the computation graph underlying the matrix product defining , with corresponding to a kind of initial condition. The diagonal matrices then correspond to valued spins on the vertices of this graph, restricting the allowed paths of the directed random polymer. With this interpretation, our main result, Theorem 1, shows that the analogue of the free energy, namely
is Gaussian up to an error that tends to zero when tend to infinity.
Let be as in (2). Then, the norm of the vector is approximately log-normal distributed:
where the implicit constant is uniform for in a compact subset of , and in a compact set bounded away from . Moreover, for every satisfying we have
where the implicit constant depends on and the moments of but not on
The two conclusions, equation (7) and equation (6), of Theorem 1 are proven separately and have independent proofs. We prove equation (6) by a path-counting type argument in Section 3. The argument in Section 4 for equation (7), in contrast, uses a central limit theorem for martingales.
Theorem 1 shows that the free energy from (4) is Gaussian in the double scaling limit
achieved for instance when are equal and proportional to This asymptotic normality for cannot be seen by taking the limits and to infinity one after the other. Indeed, consider the case when and the standard Gaussian measure. A simple computation using the rotational invariance of i.i.d. Gaussian matrices shows the equality in distribution
where is a chi-squared random variable with degrees of freedom and the terms in the product are independent. In the limit where is fixed and we have and so
On the other hand, if the are uniformly bounded, then we have
In fact, for fixed, converges only with an addition scaling:
making (8) an interesting regime for . The non-commutativity of the limits is well-known and is related to the fact that the local statistics of the singular values of are sensitive to the order in which the limits above are taken. Remaining in the simple case of and Gaussian, a simple application of the central limit results show that when all the are equal and are related to by , then the exact chi-squared representation of equation (9) gives the convergence in distribution:
which is of course consistent with Theorem 1. Part of the content of Theorem 1 is therefore that this result is essentially independent of the parameter and the measure according to which the entries of the matrices are distributed. See Section 1.3 for more discussion on the novel aspects of Theorem 1.
2 Connection to previous work in Random Matrix Theory
The literature on products of random matrices is vast. Much of the previous work concerns products of i.i.d. random matrices, each of size . Such ensembles have been well studied in two distinct regimes: (a) when is fixed and and (b) when is fixed and . Case (a) is related to multiplicative ergodic theory and the study of Lyapunov exponents. The seminal articles in this regime are the results of Furstenberg and Kesten , which gives general conditions for the existence of the top Lyapunov exponent
and the multiplicative ergodic theorem of Osceledets , which gives conditions for almost sure (deterministic) values for all the Lyapunov exponents. Many more recent works characterize the Lyapunov exponents under more specific assumptions, most notably for matrices which are rotationally invariant or which have entries that are real or complex Gaussians, see e.g. as well as the survey and references therein.
Case (b), where is fixed and , falls into the setting of free probability. Indeed, one of the great successes of free probability is the idea of “asymptotic freeness”: in the limit , a collection of independent random matrices behave like a collection of freely independent random variables on a non-commutative probability space (see e.g. Chapter 5 or Chapter 1 and 4). Therefore, case (b) is closely related to a product of freely independent random variables; precise results are obtained in . Earlier results examine case (b) without explicit use of free probability. The problem of first taking and afterwards taking can also be handled using the tools of free probability in the case of Gaussian matrices, see .
As explained in the Introduction and in Section 1.1, the regimes (a), (b) are asymptotically incompatible in the sense that the limits , and do not generally commute on the level of the local behavior of the singular value distribution. Indeed, the problem of understanding what happens when both are scaled simultaneously is mentioned as an open problem in . To explain this further, we note that the work of Newman [16, Thm. 1] in regime (a) shows that when and is fixed, the density of the Lyapunov exponents of converges in the limit when first and then to the triangular density
The work of Tucci [27, Thm. 3.2, Ex. 3.4] shows that for Gaussian ensembles related to one obtains the same global limit in the regime (b) when first and then However, as explained in Section 5, while the global density of all the Lyapunov exponents is the triangular law in both cases, the local behavior (e.g. the fluctuations of the top Lyapunov exponent) is observed to be different depending on the order of the limits even in the exactly solvable special case of products of complex Ginibre matrices.
From this spectral point of view, Theorem 1 gives information about certain averages of the Lyapunov exponents. To see this, fix and let and tend to infinity in accordance with (8). Note we we specifically do not take to infinity. Denote by the non-zero singular values of , and by the corresponding left-singular vectors. Then we have
Hence, Theorem 1 can be interpreted as the statement that the logarithm of the average of the non-zero singular values for is a Gaussian with mean and variance in the limit (8). These non-trivial corrections in can be seen as a finite temperature correction to the maximal entropy regime of Tucci in which first and then . For more on this point of view, we refer the reader to [1, Section 3.2].
Finally, in the specific case where the random matrices are complex Ginibre matrices (i.e. the matrix entries are iid complex Gaussian), very recent work looks at the limiting spectrum under the joint scaling limit , where the ratio is fixed or going to . This work analyzes exact determinental formulas for the joint distribution of singular values available in the case of complex Ginibre matrices. The analogous formulas for real Gaussian matrices given in are significantly more complicated and such an explicit analysis appears to be much more difficult.
3 Contribution of the Present Work
In the context of these previous random matrix results, let us point out four novel aspects of Theorem 1. First, it deals with the joint limit for a large class of non-Gaussian matrices with real entries. There is no integrable structure to our matrix ensembles, and we rely instead on a sum-over-paths approach to analyze the moments (6) and a martingale CLT approach for obtaining the KS distance estimates (7).
An additional novelty of Theorem 1 is it proves the distribution of \big{|}\big{|}{M}^{(d)}\vec{u}\big{|}\big{|}_{2}^{2} is (mostly) universal: it does not depend on the higher moments of the distribution beyond the mean and variance, with the exception of the fourth moment appearing in in the term . In the regime and , this term is a correction.
The fourth and final novelty of Theorem 1 we would like to emphasize is that our results are non-asymptotic, i.e. we obtain an explicit error term of the form . This is particularly useful when using Theorem 1 for studying gradients in randomly initialized neural networks (see Section 1.5).
4 Connection to Random Polymers
The matrix ensembles studied in this article, in the case , are related to directed random polymers on the complete graph of size . This model were recently explored in detail (c.f. e.g. ). A key object for these polymers is the line-to-line partition function
The fact that the weights are positive makes the analysis of the partition function of this traditional random polymer model different than the analysis of the matrix product defined in (1). In particular, no cancellation is possible between the terms in the definition of above, causing to be exponential in . The fixed and limit of of the partition function in the case of these positive weights is the object of study in . As explained in Section 1.3, Theorem 1 studies a different regime where both at the same time. The fact that our weights are mean zero, gives rise to significant cancellation in the terms of from (4), so that the partition function in our mean zero model does not grow exponentially with provided grows with as in (8). Additionally, if in our model, the effect of the diagonal Bernoulli matrices is to close every vertex with probability . The sum over paths in our partition function then becomes the sum only over those paths that pass through vertices that are open.
5 The Case p=1/2𝑝12p=1/2 as Gradients in Random Neural Nets
to be the vectors of activities before and after applying at the neurons at layer .
where is the learning rate. An important practical impediment to gradient based learning is the exploding and vanishing gradient problem (EVGP), which occurs when gradients are numerically unstable:
making the parameter update (10) too small to be meaningful or too large to be precise. An important intuition is that the EVGP will be most pronounced at the start of training, when the weights and biases of are random and the implicit structure of the data being processed has not yet regularized the function computed by .
As explained below, the EVGP for a depth net with hidden layer widths is essentially equivalent to having large fluctuations for the entries (or, in the worst case, for the singular values) of the Jacobians of the transformations between various layers:
where are diagonal -Bernoulli matrices as in (2) with parameter and denotes equality in distribution.
and the implicit constant is uniform when ranges over a compact subset of .
For more information about the EVGP, statistics of gradients in random nets, and distribution of the singular values of the input-output Jacobian we refer the interested reader to for more details.
Proof of Proposition 2
It remains to see that these diagonal matrices are independent of each other (since the outputs of the previous layer are fed into to subsequent layers, so are not a priori independent). This again will be a consequence of the fact the underlying random variables are symmetrically distributed, and will be formally verified by conjugating the weights and biases of the network by random random variables. This doesn’t change the distribution of the network, but will allow us to see the independence between layers in a more concrete way.
2 Proof of Proposition 2
Fix a neural net as in the statement of proposition 2 and denote its weights and biases at layer by and . For each let
be an i.i.d. collection of random variables that each take values with probability . We will also define
Consider the neural net with weights and biases defined by changing the signs of the weights and biases of as follows:
Note that since we’ve assumed that have distributions that are symmetric around , we have
Hence, since the weights of the two networks are identically distributed,
To prove (14), we will use the fact that two random variables are independent if the distribution of given does not depend on the value of That is, (14) will follow once we show that for any fixed sequences that
To check this equality, it suffices to show that given there is exactly one possible configuration for the variables for which the event occurs. The resulting probability then follows since are i.i.d. variables that each take the values with probability The proof is by induction: we will show that for each , given there is a unique configuration for the variables that leads to the event . When we have
Recalling that for all , we see that for each there is a unique value of for which
Proof of Theorem 1: Moment Estimates and Path Counting
We begin by indicating the general plan for the proof of Equation (6) from Theorem 1, which consists of two steps. First, in Proposition 4 below, we express the expectation in (6) as a sum over -tuples of paths . The precise result is the following
With the notation of Theorem 1, for each we have
where , and is defined by:
Note that the definition of in equation (16) depends only on the collection of vertices and , the moments of the measure according to which the entries of the matrices are distributed, and the parameter The utility of equation (15) is that it is written as a product over this function of adjacent layers (rather than the whole path ), which will make it much easier to analyze.
The next step in the proof of the moment estimate (6) is to obtain upper and lower bounds for the expression in (15) that match up to corrections of size This is done in Section 3.4. The main idea here is to treat the sum (15) as an expectation where each , is chosen independently according to the uniform distribution on . The leading term in this expectation comes from event that the entries are all distinct, which happens in layer with probability . When this happens, The subleading term comes from the event that has exactly one element that appears twice, with the others distinct. In each layer, the probability of this type of “collision” is and typically contributes when this happens. Hence, heuristically speaking, we have
This is almost correct, except at the first layer, where the vector acts as special initial condition and slightly deforms the term in this product when Section 3.4 makes this argument precise.
2 Edge Sets, Multiplicities, and Paths
In this section we develop some notation and basic results which is used to clarify the “path counting” needed to prove Proposition 4 below. The major result that is developed in this section, and is needed for Proposition 4, is the enumeration the set of paths in Lemma 8. We will use the following notation conventions:
This is the set one gets by drawing an edge between each entry of and the corresponding entry of and then forgetting the order in which the edges were drawn but remembering the multiplicity. Note that this map from ordered sets of left and right endpoints is many to one. This will come up in our computations, and to keep track of this, we make the following definition.
where the expectation is with respect to In the proof of Proposition 4 we will consider sequences of compatible edge sets in the sense of the following definition.
The formula (24) below will be used in the proof of Proposition 4.
3 Proof of Proposition 4
The first step in proving Proposition 4 is to express as a sum over certain collections of paths.
Let be the set of tuples of paths:
where was defined in (22). Our notation is that if , then is a -tuple for each .
Note that the entries of the matrix can be written as a sum over certain paths in , namely:
Using this interpretation in terms of paths, we obtain by indexing the starting points as and the ending point as , that we can write as a sum over :
Similarly, the -th power is then given by:
The result of the lemma follows by taking expectation of both sides, using the independence of the random variables ’s, and relations
Since the law of the entries of the matrices is assumed to symmetric around the odd moments of are all zero, and it will be useful to consider only edge sets that are “even” in the following sense:
With the same notation as in Lemma 10, we have
Because the variables are symmetric around , all their odd moments vanish. Thus, in the expression (26), only collections of paths in which every edge is traversed an even number of times given a non-zero contribution. What remains are exactly paths from by the definition in equation (27). ∎
Summing over all possible endpoints now gives the identity:
Finally, using this identity on (29), with being the function that appears inside the sum over , gives the desired result of Proposition 4. ∎
4 Completion of Proof of Equation (6)
We think of the sum in Proposition 4 as an expectation over discrete random variables . Specifically, we write:
Informally, stands for “unique entries”, and consists of those -tuples with no repeated entries; stands for “one pair” and consists of those -tuples with exactly one repeated entry; stands for “bad” and consists of everything else. Formally,
For each , under the uniform measure on , each random variable has the following probabilities for the events ,,:
The proof is an elementary exercise in discrete probability. ∎
Subdivide the “one pair” set by which indices are paired: where for . Then for each we have
where the implicit constant in is bounded below by and above by
This is an elementary calculation from the definition of . If , and the multiplicities of edges in the edge set are all which makes the combinatorial factor in equal to , and every edge is covered exactly twice giving a factor of in the weight term. If , giving a factor of . Moreover, in this case, when the indices which are paired in are also paired in , all the combinatorial factors are again , and the weight term is . If the paired indices from are not paired in , then there the combinatorial term is , and the weight term is . ∎
where the quantities are:
Note that for any fixed and , we have the following conditional independence of layers before and after :
where in the second term we write instead of since the measure no longer depends on Applying this with we find
where in the second term the random variables are uniform and do not depend on or . An elementary probability computation using Lemma 15 and the measure on shows that
where the implicit constant in the last term is bounded below by and above by Combining the result of Lemma 14, with (33) and (34), proves Lemma 16.
Then, for any choice of the label , we have that:
By using the possible values for computed in Lemma 15 and the definition of , we have that for any label :
This proves the upper bound in (35). The lower bound similarly follows:
We first notice, by application of the elementary probability estimate recorded in Lemma 18, that the upper and lower bounds on given in Lemma 17 are equal up to small errors. We have for :
(where is as in Lemma 17). Finally, putting these values for into the result of Lemma 14 we see:
The last line follows from the elementary fact for exponentials for . ∎
5 An elementary probability estimate
Let be independent events with probabilities and be independent events with probabilities such that
Denote by the indicator that the event happens, , and by the indicator that happens, . Further, fix for every some as well as . Define
Then, if for every , we have:
where by convention In contrast, if for every , we have:
The proof goes by induction on . The base case can be computed directly
which is verified to obey the stated inequalities under the convention . To see the induction step, suppose that . Define the filtration . We have, from the definition of that
We compute by directly examining what happens when and when , that
Now notice that, since vanishes when , and since , we have that
Hence, when , since for every , we have the estimate
which is the desired inequality to prove the induction step for the upper bound. To see the lower bound, we will actually prove the lower bound for the sequence
so by equation (39) applied to the sequence, we have then (keeping in mind reverses the inequality):
which is the desired inequality for the induction step on the lower bound. ∎
Proof of Theorem 1: Quantitative Martingale CLT
In the section, we explain the proof of the distribution estimates in equation (7) in Theorem 1 modulo the proof of several key technical results, which are proved in Sections 4.1 and 4.2 below.
We first recall the notation. Namely, fix and consider a fixed measure satisfying (3). For every take independent random matrices with all the entries of drawn i.i.d. from and for each , consider diagonal matrices where are iid valued independent Bernoulli variables . The key objects of study are, for , the random matrices
Notice that the sequence is equivalently defined recursively as:
With this notation the relation (7) we seek to show becomes the statement that for every
where is the Gaussian, and
The idea of the proof is to look at the quantity as the value of a martingale at time with respect to the filtration
i.e. the sigma algebra generated by the random variables in the first layers. The basic idea of our proof is to deduce the approximate normality of by applying a martingale CLT with rate (see Theorem 23). Specifically, note that , since is a unit vector. Hence, , is a telescoping sum (modulo the complication discussed below that could vanish):
and we will think of each entry of the sum as an increment. By subtracting off the conditional means, this will yield a martingale difference sequence which can be analyzed. It will turn out that the variance of these increments satisfy:
For , we will typically have that
and therefore the term involving the fourth moment will be of size for all except the first layer when . The sum of these increment variances is precisely our variance parameter (modulo terms like ). This informally explains the appearance of in the formula for , and why the terms from other layers do not depend on the higher moments of .
To give a precise proof of (42), we must deal with a wrinkle in the strategy described above: with a small but positive probability the vectors , making the ratio of the norms of the vectors in (43) undefined. Since the weight matrices are assumed to have no atoms, this can only happen if the Bernoulli variables are all equal to zero. To take this into account, we define the events
In addition, we will find it convenient to fix a truncation level and set
We will study the sequence of martingale increments
that coincide, with high probability, with the martingale difference sequence associated to (see Lemma 22), where by convention we define the product is zero on the event when To prove the approximate normality of we first prove the approximate normality of in the following Proposition.
Moreover, for any fixed , the sum is approximately normally distributed in the sense that
We prove Proposition 19 in Section 4.1 below. The next result shows that the sum of the conditional expectations in contributes a constant up to errors of the form
where is a random variable satisfying
Proposition follows from Proposition 28 below. To combine Propositions 19 and 20, we will need the following simple result about perturbations under the -distance.
If is centered Gaussian with variance , is any random variable, and is a positive random variable then there is a universal constant so that we have:
For any , there exists a constant so that
Further, if are any two random variables on the same probability space, then:
Combining Propositions 19, Proposition 20 and Lemma 21, we obtain
Finally, combining the following estimate with (49) completes the proof of Theorem 1.
For any fixed and any we have
In the proof of Proposition 19, we will use the notation
and we will say that a random variable is if there exists independent of so that almost surely. The constant may depend on the moments of the random variable and , which we think of as fixed. To conclude the approximate normality (46) we will use the following theorem.
Suppose that is a martingale difference sequence with respect to a filtration . Then
The following Proposition allows us to control the and moments of appearing on in (51).
For any , we have that the conditional 2nd and 4th moments of are:
Moreover, for any and any ,
We will prove Proposition 24 in Section 4.1.1 below. To complete the proof of Proposition 19, note that Proposition 24 yields
Thus, (46) follows from the previous line together with (53), (48), and
To prove this bound, we begin by using Proposition 24 to establish two inequalities which hold for any fixed :
With these inequalities in hand, we proceed by expanding the square as follows:
Finally, summing all the bounds for diagonal and off-diagonal entries we see that this entire numerator from equation (51) is bounded by , which proves (55) and completes the proof of Proposition 19 modulo checking Proposition 24.
We begin by establishing some preliminary results.
proving the first relation (59). Since each is mean , the relations (60) follow by standard esimates of the moments of a sum of iid centered random variables. Finally to check the second relation in (59), we write
The following corollary immediately yields (54).
The tail estimate follows by using the Chebyshev inequality and (60). The bound on the expectation is obtained as follows:
To complete the proof of Proposition 24, it remains to check (52) and (53). To do this, we begin with the following observation.
Let . Suppose is a non-negative random variable. Then there are absolute constants (here we let refer to a generic constant which may change value from line to line) so that for any :
The proof is an elementary exercise in the Taylor series expansion for applied to points in the interval on which the derivatives of are bounded. ∎
Lemma 27 together with Lemma 25 gives the following information on the conditional moments of which directly allows us to conclude (52) and (53) and hence complete the proof of Proposition 24.
Recall the vectors and their norms normalized and norms, and defined in (45) and (50). We have for each
On the event , both sides of the equation are zero and the equality trivially holds. Therefore we have only to consider what happens on the event , where . By equation (41), we have that
By Lemma 25 and Lemma 27, all the terms in equation (63) are which completes the result. A similar argument, combining the moment calculations from Lemma 25 and the series expansion estimates from Lemma 27 in the natural way, gives the higher moments of . ∎
2 Facts about KS distance - Proof of Lemma 21
Let us use the notation Since
we may assume without loss that when proving (47) and (48). We begin by checking (47). We will show that for every , we have that
This is proven by examining the two possibilities of the absolute value, and by using the inclusions and respectively for the two cases. In the first case, consider:
To verify (48), note that there exists a constant so that for every
3 Proof of Lemma 22
As in the proof of Proposition 28, on the event we have that
where are iid random variables each equal in distribution to where is an fixed unit vector. In particular, by Lemma 25, and the higher moments of are uniformly bounded in terms of the moments of and In particular, for each there exists depending only on the moments of and so that
Hence, by using a union bound and the estimate on the probability in (44), we have for each