Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

Boris Hanin

Introduction

A fundamental obstacle in training deep neural nets using gradient based optimization is the exploding and vanishing gradient problem (EVGP), which has attracted much attention (e.g. [BSF94, HBF+01, MM15, XXP17, PSG17, PSG18]) after first being studied by Hochreiter [Hoc91]. The EVGP occurs when the derivative of the loss in the SGD update

is very large for some trainable parameters $W$ and very small for others:

This makes the increment in (1) either too small to be meaningful or too large to be precise. In practice, a number of ways of overcoming the EVGP have been proposed (see e.g. [Sch]). Let us mention three general approaches: (i) using architectures such as LSTMs [HS97], highway networks [SGS15], or ResNets [HZRS16] that are designed specifically to control gradients; (ii) precisely initializing weights (e.g. i.i.d. with properly chosen variances [MM15, HZRS15] or using orthogonal weight matrices [ASB16, HSL16]); (iii) choosing non-linearities that that tend to compute numerically stable gradients or activations at initialization [KUMH17].

A number of articles (e.g. [PLR+16, RPK+17, PSG17, PSG18]) use mean field theory to show that even vanilla fully connected architectures can avoid the EVGP in the limit of infinitely wide hidden layers. In this article, we continue this line of investigation. We focus specifically on fully connected $\operatorname{ReLU}$ nets, and give a rigorous answer to the question of which combinations of depths $d$ and hidden layer widths $n_{j}$ give $\operatorname{ReLU}$ nets that suffer from the EVGP at initialization. In particular, we avoid approach (iii) to the EVGP by setting once and for all the activations in $\mathcal{N}$ to be $\operatorname{ReLU}$ and that we study approach (ii) in the limited sense that we consider only initializations in which weights and biases are independent (and properly scaled as in Definition 1) but do not investigate other initialization strategies. Instead, we focus on rigorously understanding the effects of finite depth and width on gradients in randomly initialized networks. The main contributions of this work are:

We derive new exact formulas for the joint even moments of the entries of the input-output Jacobian in a fully connected $\operatorname{ReLU}$ net with random weights and biases. These formulas hold at finite depth and width (see Theorem 3).

We prove that the empirical variance of gradients in a fully connected $\operatorname{ReLU}$ net is exponential in the sum of the reciprocals of the hidden layer widths. This suggests that when this sum of reciprocals is too large, early training dynamics are very slow and it may take many epochs to achieve better-than-chance performance (see Figure 1).

We prove that, so long as weights and biases are initialized independently with the correct variance scaling (see Definition 1), whether the EVGP occurs (in the precise sense explained in §3) in fully connected $\operatorname{ReLU}$ nets is a function only of the architecture and not the distributions from which the weights and biases are drawn.

The second of the listed contributions has several concrete consequences for architecture selection and for understanding initial training dynamics in $\operatorname{ReLU}$ nets. Specifically, our main results, Theorems 1-3, prove that the EVGP will occur in a $\operatorname{ReLU}$ net $\mathcal{N}$ (in either the annealed or the quenched sense described in §3) if and only if a single scalar parameter, the sum

of reciprocals of the hidden layer widths of $\mathcal{N},$ is large. Here $n_{j}$ denotes the width of the $j^{th}$ hidden layer, and we prove in Theorem 1 that the variance of entries in the input-output Jacobian of $\mathcal{N}$ is exponential in $\beta.$ Implications for architecture selection then follow from special cases of the power-mean inequality:

in which equality is achieved if and only if $n_{j}$ are all equal. We interpret the leftmost inequality as follows. Fix $d$ and a total budget $\sum_{j}n_{j}$ of hidden layer neurons. Theorems 1 and 2 say that to avoid the EVGP in both the quenched and annealed senses, one should minimize $\beta$ and hence make the leftmost expression in (2) as large as possible. This occurs precisely when $n_{j}$ are all equal. Fix instead $d$ and a budget of trainable parameters, $\sum_{j}n_{j}(n_{j-1}+1),$ which is close to $\sum_{j}n_{j}^{2}$ if the $n_{j}$ ’s don’t fluctuate too much. Again using (2), we find that from the point of view of avoiding the EVGP, it is advantageous to take the $n_{j}$ ’s to be equal.

In short, our theoretical results (Theorems 1 and 2) show that if $\beta$ is large then, at initialization, $\mathcal{N}$ will compute gradients that fluctuate wildly, intuitively leading to slow initial training dynamics. This heuristic is corroborated by an experiment from [HR18] about the start of training on MNIST for fully connected neural nets with varying depths and hidden layer widths (the parameter $\beta$ appeared in [HR18] in a different context). Figure 1 shows that $\beta$ is a good summary statistic for predicting how quickly deep networks will start to train.

We conclude the introduction by mentioning what we see as the principal weaknesses of the present work. First, our analysis holds only for $\operatorname{ReLU}$ activations and assumes that all non-zero weights are independent and zero centered. Therefore, our conclusions do not directly carry over to convolutional, residual, and recurrent networks. Second, our results yield information about the fluctuations of the entries $Z_{p,q}$ of the input-output Jacobian $J_{\mathcal{N}}$ at any fixed input to $\mathcal{N}.$ It would be interesting to have information about the joint distribution of the $Z_{p,q}$ ’s with inputs ranging over an entire dataset. Third, our techniques do not directly extend to initializations such as orthogonal weight matrices. We hope to address these issues in the future and, specifically, believe that the qualitative results of this article will generalize to convolutional networks in which the number of channels grows with the layer number.

Relation to Prior Work

To provide some context for our results, we contrast both our approach and contributions with the recent work [PSG17, PSG18]. These articles consider two senses in which a fully connected neural net $\mathcal{N}$ with random weights and biases can avoid the EVGP. The first is that the average singular value of the input-output Jacobian $J_{\mathcal{N}}$ remains approximately $1$ , while the second, termed dynamical isometry, requires that all the singular values of $J_{\mathcal{N}}$ are approximately $1$ . The authors of [PSG17, PSG18] study the full distribution of the singular values of the Jacobian $J_{\mathcal{N}}$ first in the infinite width limit $n\rightarrow\infty$ and then in the infinite depth limit $d\rightarrow\infty.$

Let us emphasize two particularly attractive features of [PSG17, PSG18]. First, neither the initialization nor the non-linearity in the neural nets $\mathcal{N}$ is assumed to be fixed, allowing the authors to consider solutions of types (ii) and (iii) above to the EVGP. The techniques used in these articles are also rather general, and point to the emergence of universality classes for singular values of the Jacobian of deep neural nets at initialization. Second, the results in these articles access the full distribution of singular values for the Jacobian $J_{\mathcal{N}}$ , providing significantly more refined information than simply controlling the mean singular value.

The neural nets considered in [PSG17, PSG18] are essentially assumed to be infinitely wide, however. This raises the question of whether there is any finite width at which the behavior of a randomly initialized network will resemble the infinite width regime, and moreover, if such a width exists, how wide is wide enough? In this work we give rigorous answers to such questions by quantifying finite width effects, leaving aside questions about both different choices of non-linearity and about good initializations that go beyond independent weights.

Instead of taking the singular value definition of the EVGP as in [PSG17, PSG18], we propose two non-spectral formulations of the EVGP, which we term annealed and quenched. Their precise definitions are given in §3.2 and §3.3, and we provide in §3.1 a discussion of the relation between the different senses in which the EVGP can occur.

Theorem 1 below implies, in the infinite width limit, that all $\operatorname{ReLU}$ nets avoid the EVGP in both the quenched and annealed sense. Hence, our definition of the EVGP (see §3.2 and §3.3) is weaker than the dynamical isometry condition from [PSG17, PSG18]. But, as explained in §3.1, it is stronger the condition that the average singular value equal $1.$ Both the quenched and annealed versions of the EVGP concern the fluctuations of the partial derivatives

of the $q^{th}$ component of the function $f_{\mathcal{N}}$ computed by $\mathcal{N}$ with respect to the $p^{th}$ component of its input ( $\operatorname{Act}^{(0)}$ is an input vector - see (10)). The stronger, quenched version of the EVGP concerns the empirical variance of the squares of all the different $Z_{p,q}:$

Here, $n_{0}$ is the input dimension to $\mathcal{N}$ , $n_{d}$ is the output dimension, and the index $m$ runs over all $n_{0}n_{d}$ possible input-output neuron pairs $(p_{m},q_{m}).$ Intuitively, since we will show in Theorem 1 that

independently of the depth, having a large mean for $\widehat{\operatorname{Var}}\left[Z^{2}\right]$ means that for a typical realization of the weights and biases in $\mathcal{N}$ , the derivatives of $f_{\mathcal{N}}$ with respect to different trainable parameters will vary over several orders of magnitude, leading to inefficient SGD updates (1) for any fixed learning rate $\lambda$ (see §3.1 - §3.3).

since $\phi=\operatorname{ReLU}$ , making $\phi^{\prime}(z)$ the indicator function ${\bf 1}_{[0,\infty)}(z)$ and the value of $\phi^{\prime}(\sqrt{q^{*}}z)$ independent of the asymptotic length $q^{*}$ for activations. The condition $\chi_{1}=1$ defines the edge of chaos regime. This gives a heuristic explanation for why the nets considered in the present article cannot have just one of vanishing and exploding gradients. It also allows us to interpret our results as a rigorous computation for $\operatorname{ReLU}$ nets of the $1/n_{j}$ corrections at the edge of chaos.

In addition to the mean field theory papers, we mention the article [SPSD17]. It does not deal directly with gradients, but it does treat the finite width corrections to the statistical distribution of pre-activations in a feed-forward network with Gaussian initialized weights and biases. A nice aspect of this work is that the results give the joint distribution not only over all the neurons but also over any number of inputs to the network. In a similar vein, we bring to the reader’s attention [BFL+17], which gives interesting heuristic computations about the structure of correlations between gradients corresponding to different inputs in both fully connected and residual $\operatorname{ReLU}$ nets.

Defining the EVGP for Feed-Forward Networks

where $Z_{p,q}$ the entries of the Jacobian $J_{\mathcal{N}}$ (see (3)). A common way to formalize this statement is to interpret “ $Z_{p,q}$ has large fluctuations” to mean that the Jacobian $J_{\mathcal{N}}$ of the function computed by $\mathcal{N}$ has both very large and very small singular values [BSF94, HBF+01, PSG17]. We give in §3.1 a brief account of the reasoning behind this formulation of the EVGP and explain why is also natural to define the EVGP via the moments of $Z_{p,q}.$ Then, in §3.2 and §3.3, we define two precise senses, which we call annealed and quenched, in which that EVGP can occur, phrased directly in terms of the joint moments of $Z_{p,q}.$

Let us recall the rationale behind using the spectral theory of $J_{\mathcal{N}}$ to define the EVGP. The gradient in (1) of the loss with respect to, say, a weight $W_{\alpha,\beta}^{{}_{(j)}}$ connecting neuron $\alpha$ in layer $j-1$ to neuron $\beta$ in layer $j$ is

where $\phi^{\prime}(\operatorname{Act}_{\beta}^{(j)})$ is the derivative of the non-linearity, the derivative of the loss $\mathcal{L}$ with respect to the output $\operatorname{Act}^{{}_{(d)}}$ of $\mathcal{N}$ is

and we’ve denoted the $\beta^{th}$ row in the layer $j$ to output Jacobian $J_{\mathcal{N}}(j\rightarrow d)$ by

Since $J_{\mathcal{N}}(j\rightarrow d)$ is the product of $d-j$ layer-to-layer Jacobians, its inner product with $\nabla_{\operatorname{Act}^{(d)}}\mathcal{L}$ is usually the term considered responsible for the EVGP. The worst case distortion it can achieve on the vector $\nabla_{\operatorname{Act}^{(d)}}\mathcal{L}$ is captured precisely by its condition number, the ratio of its largest and smallest singular values.

However, unlike the case of recurrent networks in which $J_{\mathcal{N}}(j\rightarrow d)$ is $(d-j)-$ fold product of a fixed matrix, when the hidden layer widths grow with the depth $d$ , the dimensions of the layer $j$ to layer $j^{\prime}$ Jacobians $J_{\mathcal{N}}(j\rightarrow j^{\prime})$ are not fixed and it is not clear to what extent the vector $\nabla_{\operatorname{Act}^{(d)}}\mathcal{L}$ will actually be stretched or compressed by the worst case bounds coming from estimates on the condition number of $J_{\mathcal{N}}(j\rightarrow d).$

Moreover, on a practical level, the EVGP is about the numerical stability of the increments of the SGD updates (1) over all weights (and biases) in the network, which is directly captured by the joint distribution of the random variables

Due to the relation (6), two terms influence the moments of $|\partial\mathcal{L}/\partial W_{\alpha,\beta}^{(j)}|^{2}$ : one coming from the activations at layer $j-1$ and the other from the entries of $J_{\mathcal{N}}(j\rightarrow d).$ We focus in this article on the second term and hence interpret the fluctuations of the entries of $J_{\mathcal{N}}(j\rightarrow d)$ as a measure of the EVGP.

To conclude, we recall a simple relationship between the moments of the entries of the input-output Jacobian $J_{\mathcal{N}}$ and the distribution of its singular values, which can be used to directly compare spectral and entrywise definitions of the EVGP. Suppose for instance one is interested in the average singular value of $J_{\mathcal{N}}$ (as in [PSG17, PSG18]). The sum of the singular values of $J_{\mathcal{N}}$ is given by

where $\{u_{j}\}$ is any orthonormal basis. Hence, the average singular value can be obtained directly from the joint even moments of the entries of $J_{\mathcal{N}}$ . Both the quenched and annealed EVGP (see (7),(9)) entail that the average singular value for $J_{\mathcal{N}}$ equals $1,$ and we prove in Theorem 1 (specifically (11)) that even at finite depth and width the average singular value for $J_{\mathcal{N}}$ equals $1$ for all the random $\operatorname{ReLU}$ nets we consider!

One can push this line of reasoning further. Namely, the singular values of any matrix $M$ are determined by the Stieltjes transform of the empirical distribution $\sigma_{M}$ of the eigenvalues of $M^{T}M:$

Writing $(z-x)^{-1}$ as a power series in $z$ shows that $S_{J_{\mathcal{N}}}$ is determined by traces of powers of $J_{\mathcal{N}}^{T}J_{\mathcal{N}}$ and hence by the joint even moments of the entries of $J_{\mathcal{N}}$ . We hope to estimate $S_{J_{\mathcal{N}}}(z)$ directly in future work.

2 Annealed Exploding and Vanishing Gradients

Fix a sequence of positive integers $n_{0},n_{1},\ldots.$ For each $d\geq 1$ write $\mathcal{N}_{d}$ for the depth $d$ $\operatorname{ReLU}$ net with hidden layer widths $n_{0},\ldots,n_{d}$ and random weights and biases (see Definition 1 below). As in (3), write $Z_{p,q}(d)$ for the partial derivative of the $q^{th}$ component of the output of $\mathcal{N}_{d}$ with respect to $p^{th}$ component of its input. We say that the family of architectures given by $\{n_{0},n_{1},\ldots\}$ avoids the exploding and vanishing gradient problem in the annealed sense if for each fixed input to $\mathcal{N}_{d}$ and every $p,q$ we have

Here the expectation is over the weights and biases in $\mathcal{N}_{d}$ . Architectures that avoid the EVGP in the annealed sense are ones where the typical magnitude of the partial derivatives $Z_{p,q}(d)$ have bounded (both above and below) fluctuations around a constant mean value. This allows for a reliable a priori selection of the learning rate $\lambda$ from (1) even for deep architectures. Our main result about the annealed EVGP is Theorem 1: a family of neural net architectures avoids the EVGP in the annealed sense if and only if

3 Quenched Exploding and Vanishing Gradients

There is an important objection to defining the EVGP as in the previous section. Namely, if a neural net $\mathcal{N}$ suffers from the annealed EVGP, then it is impossible to choose an appropriate a priori learning rate $\lambda$ that works for a typical initialization. However, it may still be that for a typical realization of the weights and biases there is some choice of $\lambda$ (depending on the particular initialization), that works well for all (or most) trainable parameters in $\mathcal{N}.$ To study whether this is the case, we must consider the variation of the $Z_{p,q}$ ’s across different $p,q$ in a fixed realization of weights and biases. This is the essence of the quenched EVGP.

To formulate the precise definition, we again fix a sequence of positive integers $n_{0},n_{1},\ldots$ and write $\mathcal{N}_{d}$ for a depth $d$ $\operatorname{ReLU}$ net with hidden layer widths $n_{0},\ldots,n_{d}.$ We write as in (4)

for the empirical variance of the squares all the entries $Z_{p,q}(d)$ of the input-output Jacobian of $\mathcal{N}_{d}.$ We will say that the family of architectures given by $\{n_{0},n_{1},\ldots\}$ avoids the exploding and vanishing gradient problem in the quenched sense if

Our main result about the quenched sense of the EVGP is Theorem 2. It turns out, at least for the $\operatorname{ReLU}$ nets we study, that a family of neural net architectures avoids the quenched EVGP if and only if it also avoids the annealed exploding and vanishing gradient problem (i.e. if (8) holds).

Acknowledgements

I thank Leonid Hanin for a number of useful conversations and for his comments on an early draft. I am also grateful to Jeffrey Pennington for pointing out an important typo in the proof of Theorem 3 and to David Rolnick for several helpful conversations and, specifically, for pointing out the relevance of the power-mean inequality for understanding $\beta$ . Finally, I would like to thank several anonymous referees for their help in improving the exposition. One referee in particular raised concerns about the annealed and quenched definitions of the EVGP. Addressing these concerns resulted in the discussion in §3.1.

Notation and Main Results

The function $f_{\mathcal{N}}$ computed by $\mathcal{N}\in\mathfrak{N}({\bf n},d)$ is determined by a collection of weights and biases

to $\mathcal{N},$ we define for every $j=1,\ldots,d$

The vectors $\operatorname{act}^{(j)},\,\operatorname{Act}^{(j)}$ therefore represent the vectors of inputs and outputs of the neurons in the $j^{th}$ layer of $\mathcal{N}.$ The function computed by $\mathcal{N}$ takes the form

A random network is obtained by randomizing weights and biases.

$\mu^{(j)},\nu^{(j)}$ are symmetric around for every $1\leq j\leq d.$

the variance of $\mu^{(j)}$ is $2/(n_{j-1})$ .

A random net $\mathcal{N}\in\mathfrak{N}_{{\bf\mu},{\bf\nu}}\left({\bf n},d\right)$ is obtained by requiring that the weights and biases for neurons at layer $j$ are drawn independently from $\mu^{(j)},\nu^{(j)}:$

Condition (iii) is used when we apply Lemma 1 in the proof of Theorem 3. It can be removed under the restriction that $d\ll\exp\left(\sum_{j=1}^{d}n_{j}\right).$ Since this yields slightly messier but not meaningfully different results, we do not pursue this point.

2 Results

We also continue to write $Z_{p,q}$ for the entries of the input-output Jacobian of a neural net (see (3)).

In contrast, the fourth moment of $Z_{p,q}(x)$ is exponential in $\sum_{j}\frac{1}{n_{j}}:$

Moreover, there exists a constant $C_{K,\mu}>0$ depending only on $K$ and the first $2K$ moments of the measures $\{\mu^{(j)}\}_{j=1}^{d}$ such that if $K<\min_{j=1}^{d-1}\{n_{j}\}$ , then

Hence, the family $\mathcal{N}_{d}$ of $\operatorname{ReLU}$ nets avoids the exploding and vanishing gradient problem in the quenched sense if and only if

so that $\gamma(j)$ represents a neuron in the $j^{th}$ layer of $\mathcal{N}.$ Similarly, given any collection of $K\geq 1$ paths $\Gamma=\left(\gamma_{k}\right)_{k=1}^{K}$ that connect (possibly different) neurons in the input of $\mathcal{N}$ with neurons in its output and any $1\leq j\leq d$ , denote by

the neurons in the $j^{th}$ layer of $\mathcal{N}$ that belong to at least one element of $\Gamma.$ Finally, for every $\alpha\in\Gamma(j-1)$ and $\beta\in\Gamma(j)$ , denote by

the number of paths in $\Gamma$ that pass through neuron $\alpha$ at layer $j-1$ and through neuron $\beta$ at layer $j.$

where the sum is over ordered tuples $\Gamma=\left(\gamma_{1},\ldots,\gamma_{2K}\right)$ of paths in $\mathcal{N}$ from $p$ to $q$ and

where for every $r\geq 0,$ the quantity $\mu_{r}^{(j)}$ denotes the $r^{th}$ moment of the measure $\mu^{(j)}.$

The expression (16) can be generalized to case of mixed even moments. Namely, given $m\geq 1$ and for each $1\leq m\leq M$ integers $K_{m}\geq 0$ and $1\leq p_{m}\leq n_{0}\,\,1\leq q_{m}\leq n_{d}$ , we have

See Appendix A for the proof of Theorem 3.

References

Appendix A Proof of Theorem 3

Using that $X\stackrel{{\scriptstyle d}}{{=}}-X,w_{j}\stackrel{{\scriptstyle d}}{{=}}-w_{j}$ and that $\sum_{j}k_{j}$ is even, we have

Averaging these two expressions we combine (18) with the fact that $X$ is independent of $\{w_{j}\}_{j=1}^{n}$ and its law has no atoms to obtain the desired result. s∎

We now turn to the proof of Theorem 3. To this end, fix $d\geq 1,$ a collection of positive integers ${\bf n}=\left(n_{i}\right)_{i=0}^{d}$ , and let $\mathcal{N}\in\mathfrak{N}_{{\bf\mu},{\bf\nu}}\left({\bf n},d\right)$ . Let us briefly recall the notation for paths from §5.2. Given $1\leq p\leq n_{0}$ and $1\leq q\leq n_{d},$ we defined a path $\gamma$ from p to q to be a collection $\{\gamma(j)\}_{j=0}^{d}$ of neurons so that $\gamma(0)=p,\,\gamma(d)=q,$ and $\gamma(j)\in\{1,\ldots,n_{j}\}.$ The numbers $\gamma(j)$ should be thought of as neurons in the $j^{th}$ hidden layer of $\mathcal{N}$ . Given such a collection, we obtain for each $j$ a weight

between each two consecutive neurons along the path $\gamma.$ Our starting point is the expression

where $\operatorname{act}^{(j)}$ are defined as in (10). This expression is well-known and follows immediately form the chain rule (c.f. e.g. equation (1) in [CHM+15]). We therefore have

We will prove a slightly more general statement than in the formulation of Theorem 3. Namely, suppose $\Gamma=\left(\gamma_{1},\ldots,\gamma_{2K}\right)$ is any collection of paths from the input of $\mathcal{N}$ to the output (the paths are not required to have the same starting and ending neurons) such that for every $\beta\in\Gamma(d)$ ,

To evaluate the expectation in (21), note that the computation done by $\mathcal{N}$ is a Markov chain with respect to the layers (i.e. given $\operatorname{Act}^{(j-1)}$ , the activations at layers $j,\ldots,d$ are independent of the weight and biases up to and including layer $j-1.$ ) Hence, denoting by $\mathcal{F}_{\leq d-1}$ the sigma algebra generated by the weight and biases up to and including layer $d-1$ , the tower property for expectation and the Markov property yield

Next, observe that for each $1\leq j\leq d,$ conditioned on $\operatorname{Act}^{(j-1)}$ , the families of random variables $\{w_{\alpha,\beta}^{(j)},\,\operatorname{act}_{\beta}^{(j)}\}_{\alpha=1}^{n_{j-1}}$ are independent for different $\beta.$ For $j=d$ this implies

Let us make several observations about $\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d)}$ and $\operatorname{act}_{\Gamma,\beta}^{(d)}$ when conditioned on $\operatorname{Act}^{(d-1)}$ . First, the conditioned random variable $\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d-1)}$ is independent of the conditioned random variable $\operatorname{act}_{\Gamma,\beta}^{(d-1)}$ . Second, the distribution of $\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d)}$ conditioned on $\operatorname{Act}^{(d-1)}$ is symmetric around $0.$ Third, since we assumed that the bias distributions $\nu^{(j)}$ for $\mathcal{N}$ have no atoms, the conditional distribution of $\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d)}$ also has no atoms. Fourth, $\operatorname{act}_{\Gamma,\beta}^{(d-1)}$ is a linear combination of the weights $\{w_{\alpha,\beta}^{(d)}\}_{\alpha\in\Gamma(j-1)}$ with given coefficients $\{\operatorname{Act}_{\alpha}^{(d-1)}\}_{\alpha\in\Gamma(j-1)}.$ Since the weight distributions $\mu^{(j)}$ for $\mathcal{N}$ are symmetric around $0,$ the above five observations, together with (24) allow us to apply Lemma 1 and to conclude that

To complete the argument, we must consider two cases. First, recall that by assumption, for every $\beta\in\Gamma(d),$ the number of $\gamma\in\Gamma$ for which $\gamma(d)=\beta$ is even. If for every $j\leq d$ and each $\alpha\in\Gamma(j-1)$ the number of $\gamma\in\Gamma$ passing through $\alpha$ is even, then we may repeat the preceding argument to directly obtain (21). Otherwise, we apply this argument until we reach $\alpha\in\Gamma(j-1),\,\beta\in\Gamma(j)$ so that the number $\left|\Gamma_{\alpha,\beta}(j)\right|$ of paths in $\Gamma$ that pass through $\alpha$ and $\beta$ is odd. In this case, the right hand side of (22) vanishes since the measure $\mu^{(d)}$ is symmetric around and thus has vanishing odd moments. Relation (21) therefore again holds since in this case both sides are $0.$ This completes the proof of Theorem 3. $\square$

Appendix B Proof of Theorem 1

In this section, we use Theorem 3 to prove Theorem 1. Let us first check (11). According to Theorem 3, we have

Note that since $\mu$ is symmetric around $0,$ we have that $\mu_{1}=0.$ Thus, the terms where $\gamma_{1}\neq\gamma_{2}$ vanish. Using $\mu_{2}^{(j)}=\frac{2}{n_{j-1}}$ , we find

as claimed. We now turn to proving (12). Using Theorem 3, we have

For each $\Gamma=\left(\gamma_{k}\right)_{k=1}^{2}$ with $\gamma_{k}$ paths from p to q, we have

We begin by checking the first equality in (28) by induction on $d.$ Fix $\Gamma=\left(\gamma_{1},\gamma_{2}\right).$ When $d=1,$ we have $\left|\Gamma(0)\right|=\left|\Gamma(1)\right|=1$ . Hence $\gamma_{1}=\gamma_{2}$ and $A(\Gamma)=1$ since both the numerator and denominator on the right hand side of (27) equal $1$ . The right hand side of (28) is also $1$ since $\left|\Gamma(j)\right|=1$ for every $j.$ This completes the base case. Suppose now that $D\geq 2,$ and we have proved (28) for all $d\leq D-1.$ Let

If $j_{*}=1$ , then we are done by the inductive hypothesis. Otherwise, there are two choices of $\bar{\Gamma}=\{\bar{\gamma}_{k}\}_{k=1}^{2}$ for which

These choices correspond to the two permutations of $\{\gamma_{k}\}_{k=1}^{2}.$ Similarly, there are $6$ choices of $\bar{\Gamma}=\{\bar{\gamma}_{k}\}_{k=1}^{4}$ for which

The six choices correspond to selecting one of two choices for $\gamma_{1}(1)$ and three choices of an index $k=2,3,4$ so that $\gamma_{k}(j)$ coincides with $\gamma_{1}(j)$ for each $j\leq j_{*}.$ If $j_{*}=d$ , we are done. Otherwise, we apply the inductive hypothesis to paths from $\Gamma(j_{*})$ to $\Gamma(d)$ to complete the proof of the first equality in (28). The second equality in (28) follows from the observation that since $\left|\Gamma(0)\right|=\left|\Gamma(d)\right|=1,$ the number of $j\in\{1,\ldots,d\}$ for which $\left|\Gamma(j-1)\right|=1,\,\left|\Gamma(j)\right|=2$ must equal the number of $j$ for which $\left|\Gamma(j-1)\right|=2,\,\left|\Gamma(j)\right|=1$ . ∎

Observe that since $\mu_{2}^{(j)}=2/n_{j-1}$ , we have

where the expectation on the right hand side is over the uniform measure on paths $(\gamma_{1},\gamma_{2})$ from the input of $\mathcal{N}$ to the output conditioned on $\gamma_{1}(0)=\gamma_{2}(0)=p$ and $\gamma_{1}(d)=\gamma_{2}(d)=q,$ and

The value of $X_{d}$ corresponding to every such path is at least $2^{k+1}$ since $\widetilde{\mu}_{4}^{(j)}\geq 1$ for every $j$ and is at most $6\widetilde{\mu}_{4,max}$ for the same reason. Therefore, using that for all $\varepsilon\in,$ we have

This completes the proof of the lower bound. The upper bound is obtained in the same way:

As with the second and fourth moment computations, we note that $\mu_{\left|\Gamma_{\alpha,\beta}(j)\right|}$ vanishes unless each $\left|\Gamma_{\alpha,\beta}(j)\right|$ is even. Hence, as with (29), we may write

where $A_{K}(\Gamma)$ is the analog of $A(\Gamma)$ from (27). The same argument as in Lemma 1 shows that

which is precisely the weight in (31) assigned to collections $\Gamma$ with $\left|\Gamma(j)\right|=K$ for every $1\leq j\leq d-1,$ yields

where the expectation is over uniformly chosen collections $\Gamma=\left(\gamma_{1},\ldots,\gamma_{K}\right)$ of paths from the input to the output of $\mathcal{N}$ and

The value of $X_{d}$ on each such collection is at most $\left(C_{K}\right)^{m}$ , where $C_{K}=2^{K-1}\frac{(2K)!}{K!}$ is a large but fixed constant. Hence, just as in (30),

This completes the proof of Theorem 1. $\square$

Appendix C Proof of Theorem 2

Fixing $p,q$ and using that the second sum in the previous line has $M(M-1)$ terms, we have

To estimate the difference in this sum, we use Theorem 3 to obtain

Note that since the measures $\mu^{(j)}$ of the weights are symmetric around zero, their odd moments vanish and hence the only non-zero terms in (34) and (35) are those for which

Further, observe that each path $\gamma$ from some fixed input neuron to some fixed output vertex is determined uniquely by the sequence of hidden neurons $\gamma(j)\in\{1,\ldots,n_{j}\}$ through which it passes for $j=1,\ldots,d-1$ . Therefore, we may identify each collection of paths $\Gamma=\left(\gamma_{k}\right)_{k=1}^{4}$ in the sum (34) with a unique collection of paths $\bar{\Gamma}=\left(\bar{\gamma}_{k}\right)_{k=1}^{4}$ in (35) by asking that $\gamma_{k}(j)=\bar{\gamma}_{k}(j)$ for each $k$ and all $1\leq j\leq d-1$ . Observe further that under this bijection,

For $j=1,d$ , the terms $C_{j}(\Gamma)$ and $C_{j}(\bar{\Gamma})$ are related as follows:

We consider two cases: (i) $q_{m_{1}}\neq q_{m_{2}}$ (i.e. $\left|\bar{\Gamma}(d)\right|=2$ ) and (ii) $q_{m_{1}}=q_{m_{2}}$ (i.e. $\left|\bar{\Gamma}(d)\right|=1$ and $p_{m_{1}}\neq p_{m_{2}}$ ). In case (i), we have

Hence, using (37) and (38), we find that in case (i)

where the last estimate is proved by the same argument as the relation (12) in Theorem 1. To obtain the analogous lower bound for case (ii), we write $q=q_{m_{1}}=q_{m_{2}},\,p_{m_{1}}\neq p_{m_{2}}.$ In this case, combining (36) with (38), we have

Moreover, continuing to use the bijection between $\Gamma$ and $\bar{\Gamma}$ above, (37) yields in this case

Using that if $\left|\Gamma(0)\right|=\left|\Gamma(1)\right|=1,$ then

Writing $\widehat{p}$ for any neuron in the first hidden layer of $\mathcal{N}$ , we rewrite the sum in the previous line as

where the point is now that we are considering paths only from $\widehat{p}$ to $q$ . According to (12) from Theorem 1, we have

Combining this with (33), (39) and setting

proving (15). Finally, the upper bound in (14) follows from dropping the negative term in (33) and applying the upper bound from (12). $\square$