Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?
Boris Hanin
Introduction
A fundamental obstacle in training deep neural nets using gradient based optimization is the exploding and vanishing gradient problem (EVGP), which has attracted much attention (e.g. [BSF94, HBF+01, MM15, XXP17, PSG17, PSG18]) after first being studied by Hochreiter [Hoc91]. The EVGP occurs when the derivative of the loss in the SGD update
is very large for some trainable parameters and very small for others:
This makes the increment in (1) either too small to be meaningful or too large to be precise. In practice, a number of ways of overcoming the EVGP have been proposed (see e.g. [Sch]). Let us mention three general approaches: (i) using architectures such as LSTMs [HS97], highway networks [SGS15], or ResNets [HZRS16] that are designed specifically to control gradients; (ii) precisely initializing weights (e.g. i.i.d. with properly chosen variances [MM15, HZRS15] or using orthogonal weight matrices [ASB16, HSL16]); (iii) choosing non-linearities that that tend to compute numerically stable gradients or activations at initialization [KUMH17].
A number of articles (e.g. [PLR+16, RPK+17, PSG17, PSG18]) use mean field theory to show that even vanilla fully connected architectures can avoid the EVGP in the limit of infinitely wide hidden layers. In this article, we continue this line of investigation. We focus specifically on fully connected nets, and give a rigorous answer to the question of which combinations of depths and hidden layer widths give nets that suffer from the EVGP at initialization. In particular, we avoid approach (iii) to the EVGP by setting once and for all the activations in to be and that we study approach (ii) in the limited sense that we consider only initializations in which weights and biases are independent (and properly scaled as in Definition 1) but do not investigate other initialization strategies. Instead, we focus on rigorously understanding the effects of finite depth and width on gradients in randomly initialized networks. The main contributions of this work are:
We derive new exact formulas for the joint even moments of the entries of the input-output Jacobian in a fully connected net with random weights and biases. These formulas hold at finite depth and width (see Theorem 3).
We prove that the empirical variance of gradients in a fully connected net is exponential in the sum of the reciprocals of the hidden layer widths. This suggests that when this sum of reciprocals is too large, early training dynamics are very slow and it may take many epochs to achieve better-than-chance performance (see Figure 1).
We prove that, so long as weights and biases are initialized independently with the correct variance scaling (see Definition 1), whether the EVGP occurs (in the precise sense explained in §3) in fully connected nets is a function only of the architecture and not the distributions from which the weights and biases are drawn.
The second of the listed contributions has several concrete consequences for architecture selection and for understanding initial training dynamics in nets. Specifically, our main results, Theorems 1-3, prove that the EVGP will occur in a net (in either the annealed or the quenched sense described in §3) if and only if a single scalar parameter, the sum
of reciprocals of the hidden layer widths of is large. Here denotes the width of the hidden layer, and we prove in Theorem 1 that the variance of entries in the input-output Jacobian of is exponential in Implications for architecture selection then follow from special cases of the power-mean inequality:
in which equality is achieved if and only if are all equal. We interpret the leftmost inequality as follows. Fix and a total budget of hidden layer neurons. Theorems 1 and 2 say that to avoid the EVGP in both the quenched and annealed senses, one should minimize and hence make the leftmost expression in (2) as large as possible. This occurs precisely when are all equal. Fix instead and a budget of trainable parameters, which is close to if the ’s don’t fluctuate too much. Again using (2), we find that from the point of view of avoiding the EVGP, it is advantageous to take the ’s to be equal.
In short, our theoretical results (Theorems 1 and 2) show that if is large then, at initialization, will compute gradients that fluctuate wildly, intuitively leading to slow initial training dynamics. This heuristic is corroborated by an experiment from [HR18] about the start of training on MNIST for fully connected neural nets with varying depths and hidden layer widths (the parameter appeared in [HR18] in a different context). Figure 1 shows that is a good summary statistic for predicting how quickly deep networks will start to train.
We conclude the introduction by mentioning what we see as the principal weaknesses of the present work. First, our analysis holds only for activations and assumes that all non-zero weights are independent and zero centered. Therefore, our conclusions do not directly carry over to convolutional, residual, and recurrent networks. Second, our results yield information about the fluctuations of the entries of the input-output Jacobian at any fixed input to It would be interesting to have information about the joint distribution of the ’s with inputs ranging over an entire dataset. Third, our techniques do not directly extend to initializations such as orthogonal weight matrices. We hope to address these issues in the future and, specifically, believe that the qualitative results of this article will generalize to convolutional networks in which the number of channels grows with the layer number.
Relation to Prior Work
To provide some context for our results, we contrast both our approach and contributions with the recent work [PSG17, PSG18]. These articles consider two senses in which a fully connected neural net with random weights and biases can avoid the EVGP. The first is that the average singular value of the input-output Jacobian remains approximately , while the second, termed dynamical isometry, requires that all the singular values of are approximately . The authors of [PSG17, PSG18] study the full distribution of the singular values of the Jacobian first in the infinite width limit and then in the infinite depth limit
Let us emphasize two particularly attractive features of [PSG17, PSG18]. First, neither the initialization nor the non-linearity in the neural nets is assumed to be fixed, allowing the authors to consider solutions of types (ii) and (iii) above to the EVGP. The techniques used in these articles are also rather general, and point to the emergence of universality classes for singular values of the Jacobian of deep neural nets at initialization. Second, the results in these articles access the full distribution of singular values for the Jacobian , providing significantly more refined information than simply controlling the mean singular value.
The neural nets considered in [PSG17, PSG18] are essentially assumed to be infinitely wide, however. This raises the question of whether there is any finite width at which the behavior of a randomly initialized network will resemble the infinite width regime, and moreover, if such a width exists, how wide is wide enough? In this work we give rigorous answers to such questions by quantifying finite width effects, leaving aside questions about both different choices of non-linearity and about good initializations that go beyond independent weights.
Instead of taking the singular value definition of the EVGP as in [PSG17, PSG18], we propose two non-spectral formulations of the EVGP, which we term annealed and quenched. Their precise definitions are given in §3.2 and §3.3, and we provide in §3.1 a discussion of the relation between the different senses in which the EVGP can occur.
Theorem 1 below implies, in the infinite width limit, that all nets avoid the EVGP in both the quenched and annealed sense. Hence, our definition of the EVGP (see §3.2 and §3.3) is weaker than the dynamical isometry condition from [PSG17, PSG18]. But, as explained in §3.1, it is stronger the condition that the average singular value equal Both the quenched and annealed versions of the EVGP concern the fluctuations of the partial derivatives
of the component of the function computed by with respect to the component of its input ( is an input vector - see (10)). The stronger, quenched version of the EVGP concerns the empirical variance of the squares of all the different
Here, is the input dimension to , is the output dimension, and the index runs over all possible input-output neuron pairs Intuitively, since we will show in Theorem 1 that
independently of the depth, having a large mean for means that for a typical realization of the weights and biases in , the derivatives of with respect to different trainable parameters will vary over several orders of magnitude, leading to inefficient SGD updates (1) for any fixed learning rate (see §3.1 - §3.3).
since , making the indicator function and the value of independent of the asymptotic length for activations. The condition defines the edge of chaos regime. This gives a heuristic explanation for why the nets considered in the present article cannot have just one of vanishing and exploding gradients. It also allows us to interpret our results as a rigorous computation for nets of the corrections at the edge of chaos.
In addition to the mean field theory papers, we mention the article [SPSD17]. It does not deal directly with gradients, but it does treat the finite width corrections to the statistical distribution of pre-activations in a feed-forward network with Gaussian initialized weights and biases. A nice aspect of this work is that the results give the joint distribution not only over all the neurons but also over any number of inputs to the network. In a similar vein, we bring to the reader’s attention [BFL+17], which gives interesting heuristic computations about the structure of correlations between gradients corresponding to different inputs in both fully connected and residual nets.
Defining the EVGP for Feed-Forward Networks
where the entries of the Jacobian (see (3)). A common way to formalize this statement is to interpret “ has large fluctuations” to mean that the Jacobian of the function computed by has both very large and very small singular values [BSF94, HBF+01, PSG17]. We give in §3.1 a brief account of the reasoning behind this formulation of the EVGP and explain why is also natural to define the EVGP via the moments of Then, in §3.2 and §3.3, we define two precise senses, which we call annealed and quenched, in which that EVGP can occur, phrased directly in terms of the joint moments of
Let us recall the rationale behind using the spectral theory of to define the EVGP. The gradient in (1) of the loss with respect to, say, a weight connecting neuron in layer to neuron in layer is
where is the derivative of the non-linearity, the derivative of the loss with respect to the output of is
and we’ve denoted the row in the layer to output Jacobian by
Since is the product of layer-to-layer Jacobians, its inner product with is usually the term considered responsible for the EVGP. The worst case distortion it can achieve on the vector is captured precisely by its condition number, the ratio of its largest and smallest singular values.
However, unlike the case of recurrent networks in which is fold product of a fixed matrix, when the hidden layer widths grow with the depth , the dimensions of the layer to layer Jacobians are not fixed and it is not clear to what extent the vector will actually be stretched or compressed by the worst case bounds coming from estimates on the condition number of
Moreover, on a practical level, the EVGP is about the numerical stability of the increments of the SGD updates (1) over all weights (and biases) in the network, which is directly captured by the joint distribution of the random variables
Due to the relation (6), two terms influence the moments of : one coming from the activations at layer and the other from the entries of We focus in this article on the second term and hence interpret the fluctuations of the entries of as a measure of the EVGP.
To conclude, we recall a simple relationship between the moments of the entries of the input-output Jacobian and the distribution of its singular values, which can be used to directly compare spectral and entrywise definitions of the EVGP. Suppose for instance one is interested in the average singular value of (as in [PSG17, PSG18]). The sum of the singular values of is given by
where is any orthonormal basis. Hence, the average singular value can be obtained directly from the joint even moments of the entries of . Both the quenched and annealed EVGP (see (7),(9)) entail that the average singular value for equals and we prove in Theorem 1 (specifically (11)) that even at finite depth and width the average singular value for equals for all the random nets we consider!
One can push this line of reasoning further. Namely, the singular values of any matrix are determined by the Stieltjes transform of the empirical distribution of the eigenvalues of
Writing as a power series in shows that is determined by traces of powers of and hence by the joint even moments of the entries of . We hope to estimate directly in future work.
2 Annealed Exploding and Vanishing Gradients
Fix a sequence of positive integers For each write for the depth net with hidden layer widths and random weights and biases (see Definition 1 below). As in (3), write for the partial derivative of the component of the output of with respect to component of its input. We say that the family of architectures given by avoids the exploding and vanishing gradient problem in the annealed sense if for each fixed input to and every we have
Here the expectation is over the weights and biases in . Architectures that avoid the EVGP in the annealed sense are ones where the typical magnitude of the partial derivatives have bounded (both above and below) fluctuations around a constant mean value. This allows for a reliable a priori selection of the learning rate from (1) even for deep architectures. Our main result about the annealed EVGP is Theorem 1: a family of neural net architectures avoids the EVGP in the annealed sense if and only if
3 Quenched Exploding and Vanishing Gradients
There is an important objection to defining the EVGP as in the previous section. Namely, if a neural net suffers from the annealed EVGP, then it is impossible to choose an appropriate a priori learning rate that works for a typical initialization. However, it may still be that for a typical realization of the weights and biases there is some choice of (depending on the particular initialization), that works well for all (or most) trainable parameters in To study whether this is the case, we must consider the variation of the ’s across different in a fixed realization of weights and biases. This is the essence of the quenched EVGP.
To formulate the precise definition, we again fix a sequence of positive integers and write for a depth net with hidden layer widths We write as in (4)
for the empirical variance of the squares all the entries of the input-output Jacobian of We will say that the family of architectures given by avoids the exploding and vanishing gradient problem in the quenched sense if
Our main result about the quenched sense of the EVGP is Theorem 2. It turns out, at least for the nets we study, that a family of neural net architectures avoids the quenched EVGP if and only if it also avoids the annealed exploding and vanishing gradient problem (i.e. if (8) holds).
Acknowledgements
I thank Leonid Hanin for a number of useful conversations and for his comments on an early draft. I am also grateful to Jeffrey Pennington for pointing out an important typo in the proof of Theorem 3 and to David Rolnick for several helpful conversations and, specifically, for pointing out the relevance of the power-mean inequality for understanding . Finally, I would like to thank several anonymous referees for their help in improving the exposition. One referee in particular raised concerns about the annealed and quenched definitions of the EVGP. Addressing these concerns resulted in the discussion in §3.1.
Notation and Main Results
The function computed by is determined by a collection of weights and biases
to we define for every
The vectors therefore represent the vectors of inputs and outputs of the neurons in the layer of The function computed by takes the form
A random network is obtained by randomizing weights and biases.
are symmetric around for every
the variance of is .
A random net is obtained by requiring that the weights and biases for neurons at layer are drawn independently from
Condition (iii) is used when we apply Lemma 1 in the proof of Theorem 3. It can be removed under the restriction that Since this yields slightly messier but not meaningfully different results, we do not pursue this point.
2 Results
We also continue to write for the entries of the input-output Jacobian of a neural net (see (3)).
In contrast, the fourth moment of is exponential in
Moreover, there exists a constant depending only on and the first moments of the measures such that if , then
Hence, the family of nets avoids the exploding and vanishing gradient problem in the quenched sense if and only if
so that represents a neuron in the layer of Similarly, given any collection of paths that connect (possibly different) neurons in the input of with neurons in its output and any , denote by
the neurons in the layer of that belong to at least one element of Finally, for every and , denote by
the number of paths in that pass through neuron at layer and through neuron at layer
where the sum is over ordered tuples of paths in from to and
where for every the quantity denotes the moment of the measure
The expression (16) can be generalized to case of mixed even moments. Namely, given and for each integers and , we have
See Appendix A for the proof of Theorem 3.
References
Appendix A Proof of Theorem 3
Using that and that is even, we have
Averaging these two expressions we combine (18) with the fact that is independent of and its law has no atoms to obtain the desired result. s∎
We now turn to the proof of Theorem 3. To this end, fix a collection of positive integers , and let . Let us briefly recall the notation for paths from §5.2. Given and we defined a path from p to q to be a collection of neurons so that and The numbers should be thought of as neurons in the hidden layer of . Given such a collection, we obtain for each a weight
between each two consecutive neurons along the path Our starting point is the expression
where are defined as in (10). This expression is well-known and follows immediately form the chain rule (c.f. e.g. equation (1) in [CHM+15]). We therefore have
We will prove a slightly more general statement than in the formulation of Theorem 3. Namely, suppose is any collection of paths from the input of to the output (the paths are not required to have the same starting and ending neurons) such that for every ,
To evaluate the expectation in (21), note that the computation done by is a Markov chain with respect to the layers (i.e. given , the activations at layers are independent of the weight and biases up to and including layer ) Hence, denoting by the sigma algebra generated by the weight and biases up to and including layer , the tower property for expectation and the Markov property yield
Next, observe that for each conditioned on , the families of random variables are independent for different For this implies
Let us make several observations about and when conditioned on . First, the conditioned random variable is independent of the conditioned random variable . Second, the distribution of conditioned on is symmetric around Third, since we assumed that the bias distributions for have no atoms, the conditional distribution of also has no atoms. Fourth, is a linear combination of the weights with given coefficients Since the weight distributions for are symmetric around the above five observations, together with (24) allow us to apply Lemma 1 and to conclude that
To complete the argument, we must consider two cases. First, recall that by assumption, for every the number of for which is even. If for every and each the number of passing through is even, then we may repeat the preceding argument to directly obtain (21). Otherwise, we apply this argument until we reach so that the number of paths in that pass through and is odd. In this case, the right hand side of (22) vanishes since the measure is symmetric around and thus has vanishing odd moments. Relation (21) therefore again holds since in this case both sides are This completes the proof of Theorem 3.
Appendix B Proof of Theorem 1
In this section, we use Theorem 3 to prove Theorem 1. Let us first check (11). According to Theorem 3, we have
Note that since is symmetric around we have that Thus, the terms where vanish. Using , we find
as claimed. We now turn to proving (12). Using Theorem 3, we have
For each with paths from p to q, we have
We begin by checking the first equality in (28) by induction on Fix When we have . Hence and since both the numerator and denominator on the right hand side of (27) equal . The right hand side of (28) is also since for every This completes the base case. Suppose now that and we have proved (28) for all Let
If , then we are done by the inductive hypothesis. Otherwise, there are two choices of for which
These choices correspond to the two permutations of Similarly, there are choices of for which
The six choices correspond to selecting one of two choices for and three choices of an index so that coincides with for each If , we are done. Otherwise, we apply the inductive hypothesis to paths from to to complete the proof of the first equality in (28). The second equality in (28) follows from the observation that since the number of for which must equal the number of for which . ∎
Observe that since , we have
where the expectation on the right hand side is over the uniform measure on paths from the input of to the output conditioned on and and
The value of corresponding to every such path is at least since for every and is at most for the same reason. Therefore, using that for all we have
This completes the proof of the lower bound. The upper bound is obtained in the same way:
As with the second and fourth moment computations, we note that vanishes unless each is even. Hence, as with (29), we may write
where is the analog of from (27). The same argument as in Lemma 1 shows that
which is precisely the weight in (31) assigned to collections with for every yields
where the expectation is over uniformly chosen collections of paths from the input to the output of and
The value of on each such collection is at most , where is a large but fixed constant. Hence, just as in (30),
This completes the proof of Theorem 1.
Appendix C Proof of Theorem 2
Fixing and using that the second sum in the previous line has terms, we have
To estimate the difference in this sum, we use Theorem 3 to obtain
Note that since the measures of the weights are symmetric around zero, their odd moments vanish and hence the only non-zero terms in (34) and (35) are those for which
Further, observe that each path from some fixed input neuron to some fixed output vertex is determined uniquely by the sequence of hidden neurons through which it passes for . Therefore, we may identify each collection of paths in the sum (34) with a unique collection of paths in (35) by asking that for each and all . Observe further that under this bijection,
For , the terms and are related as follows:
We consider two cases: (i) (i.e. ) and (ii) (i.e. and ). In case (i), we have
Hence, using (37) and (38), we find that in case (i)
where the last estimate is proved by the same argument as the relation (12) in Theorem 1. To obtain the analogous lower bound for case (ii), we write In this case, combining (36) with (38), we have
Moreover, continuing to use the bijection between and above, (37) yields in this case
Using that if then
Writing for any neuron in the first hidden layer of , we rewrite the sum in the previous line as
where the point is now that we are considering paths only from to . According to (12) from Theorem 1, we have
Combining this with (33), (39) and setting
proving (15). Finally, the upper bound in (14) follows from dropping the negative term in (33) and applying the upper bound from (12).