Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

Boris Hanin

Introduction

A fundamental obstacle in training deep neural nets using gradient based optimization is the exploding and vanishing gradient problem (EVGP), which has attracted much attention (e.g. [BSF94, HBF+01, MM15, XXP17, PSG17, PSG18]) after first being studied by Hochreiter [Hoc91]. The EVGP occurs when the derivative of the loss in the SGD update

is very large for some trainable parameters WW and very small for others:

This makes the increment in (1) either too small to be meaningful or too large to be precise. In practice, a number of ways of overcoming the EVGP have been proposed (see e.g. [Sch]). Let us mention three general approaches: (i) using architectures such as LSTMs [HS97], highway networks [SGS15], or ResNets [HZRS16] that are designed specifically to control gradients; (ii) precisely initializing weights (e.g. i.i.d. with properly chosen variances [MM15, HZRS15] or using orthogonal weight matrices [ASB16, HSL16]); (iii) choosing non-linearities that that tend to compute numerically stable gradients or activations at initialization [KUMH17].

A number of articles (e.g. [PLR+16, RPK+17, PSG17, PSG18]) use mean field theory to show that even vanilla fully connected architectures can avoid the EVGP in the limit of infinitely wide hidden layers. In this article, we continue this line of investigation. We focus specifically on fully connected ReLU\operatorname{ReLU} nets, and give a rigorous answer to the question of which combinations of depths dd and hidden layer widths njn_{j} give ReLU\operatorname{ReLU} nets that suffer from the EVGP at initialization. In particular, we avoid approach (iii) to the EVGP by setting once and for all the activations in N\mathcal{N} to be ReLU\operatorname{ReLU} and that we study approach (ii) in the limited sense that we consider only initializations in which weights and biases are independent (and properly scaled as in Definition 1) but do not investigate other initialization strategies. Instead, we focus on rigorously understanding the effects of finite depth and width on gradients in randomly initialized networks. The main contributions of this work are:

We derive new exact formulas for the joint even moments of the entries of the input-output Jacobian in a fully connected ReLU\operatorname{ReLU} net with random weights and biases. These formulas hold at finite depth and width (see Theorem 3).

We prove that the empirical variance of gradients in a fully connected ReLU\operatorname{ReLU} net is exponential in the sum of the reciprocals of the hidden layer widths. This suggests that when this sum of reciprocals is too large, early training dynamics are very slow and it may take many epochs to achieve better-than-chance performance (see Figure 1).

We prove that, so long as weights and biases are initialized independently with the correct variance scaling (see Definition 1), whether the EVGP occurs (in the precise sense explained in §3) in fully connected ReLU\operatorname{ReLU} nets is a function only of the architecture and not the distributions from which the weights and biases are drawn.

The second of the listed contributions has several concrete consequences for architecture selection and for understanding initial training dynamics in ReLU\operatorname{ReLU} nets. Specifically, our main results, Theorems 1-3, prove that the EVGP will occur in a ReLU\operatorname{ReLU} net N\mathcal{N} (in either the annealed or the quenched sense described in §3) if and only if a single scalar parameter, the sum

of reciprocals of the hidden layer widths of N,\mathcal{N}, is large. Here njn_{j} denotes the width of the jthj^{th} hidden layer, and we prove in Theorem 1 that the variance of entries in the input-output Jacobian of N\mathcal{N} is exponential in β.\beta. Implications for architecture selection then follow from special cases of the power-mean inequality:

in which equality is achieved if and only if njn_{j} are all equal. We interpret the leftmost inequality as follows. Fix dd and a total budget jnj\sum_{j}n_{j} of hidden layer neurons. Theorems 1 and 2 say that to avoid the EVGP in both the quenched and annealed senses, one should minimize β\beta and hence make the leftmost expression in (2) as large as possible. This occurs precisely when njn_{j} are all equal. Fix instead dd and a budget of trainable parameters, jnj(nj1+1),\sum_{j}n_{j}(n_{j-1}+1), which is close to jnj2\sum_{j}n_{j}^{2} if the njn_{j}’s don’t fluctuate too much. Again using (2), we find that from the point of view of avoiding the EVGP, it is advantageous to take the njn_{j}’s to be equal.

In short, our theoretical results (Theorems 1 and 2) show that if β\beta is large then, at initialization, N\mathcal{N} will compute gradients that fluctuate wildly, intuitively leading to slow initial training dynamics. This heuristic is corroborated by an experiment from [HR18] about the start of training on MNIST for fully connected neural nets with varying depths and hidden layer widths (the parameter β\beta appeared in [HR18] in a different context). Figure 1 shows that β\beta is a good summary statistic for predicting how quickly deep networks will start to train.

We conclude the introduction by mentioning what we see as the principal weaknesses of the present work. First, our analysis holds only for ReLU\operatorname{ReLU} activations and assumes that all non-zero weights are independent and zero centered. Therefore, our conclusions do not directly carry over to convolutional, residual, and recurrent networks. Second, our results yield information about the fluctuations of the entries Zp,qZ_{p,q} of the input-output Jacobian JNJ_{\mathcal{N}} at any fixed input to N.\mathcal{N}. It would be interesting to have information about the joint distribution of the Zp,qZ_{p,q}’s with inputs ranging over an entire dataset. Third, our techniques do not directly extend to initializations such as orthogonal weight matrices. We hope to address these issues in the future and, specifically, believe that the qualitative results of this article will generalize to convolutional networks in which the number of channels grows with the layer number.

Relation to Prior Work

To provide some context for our results, we contrast both our approach and contributions with the recent work [PSG17, PSG18]. These articles consider two senses in which a fully connected neural net N\mathcal{N} with random weights and biases can avoid the EVGP. The first is that the average singular value of the input-output Jacobian JNJ_{\mathcal{N}} remains approximately 11, while the second, termed dynamical isometry, requires that all the singular values of JNJ_{\mathcal{N}} are approximately 11. The authors of [PSG17, PSG18] study the full distribution of the singular values of the Jacobian JNJ_{\mathcal{N}} first in the infinite width limit nn\rightarrow\infty and then in the infinite depth limit d.d\rightarrow\infty.

Let us emphasize two particularly attractive features of [PSG17, PSG18]. First, neither the initialization nor the non-linearity in the neural nets N\mathcal{N} is assumed to be fixed, allowing the authors to consider solutions of types (ii) and (iii) above to the EVGP. The techniques used in these articles are also rather general, and point to the emergence of universality classes for singular values of the Jacobian of deep neural nets at initialization. Second, the results in these articles access the full distribution of singular values for the Jacobian JNJ_{\mathcal{N}}, providing significantly more refined information than simply controlling the mean singular value.

The neural nets considered in [PSG17, PSG18] are essentially assumed to be infinitely wide, however. This raises the question of whether there is any finite width at which the behavior of a randomly initialized network will resemble the infinite width regime, and moreover, if such a width exists, how wide is wide enough? In this work we give rigorous answers to such questions by quantifying finite width effects, leaving aside questions about both different choices of non-linearity and about good initializations that go beyond independent weights.

Instead of taking the singular value definition of the EVGP as in [PSG17, PSG18], we propose two non-spectral formulations of the EVGP, which we term annealed and quenched. Their precise definitions are given in §3.2 and §3.3, and we provide in §3.1 a discussion of the relation between the different senses in which the EVGP can occur.

Theorem 1 below implies, in the infinite width limit, that all ReLU\operatorname{ReLU} nets avoid the EVGP in both the quenched and annealed sense. Hence, our definition of the EVGP (see §3.2 and §3.3) is weaker than the dynamical isometry condition from [PSG17, PSG18]. But, as explained in §3.1, it is stronger the condition that the average singular value equal 1.1. Both the quenched and annealed versions of the EVGP concern the fluctuations of the partial derivatives

of the qthq^{th} component of the function fNf_{\mathcal{N}} computed by N\mathcal{N} with respect to the pthp^{th} component of its input (Act(0)\operatorname{Act}^{(0)} is an input vector - see (10)). The stronger, quenched version of the EVGP concerns the empirical variance of the squares of all the different Zp,q:Z_{p,q}:

Here, n0n_{0} is the input dimension to N\mathcal{N}, ndn_{d} is the output dimension, and the index mm runs over all n0ndn_{0}n_{d} possible input-output neuron pairs (pm,qm).(p_{m},q_{m}). Intuitively, since we will show in Theorem 1 that

independently of the depth, having a large mean for Var^[Z2]\widehat{\operatorname{Var}}\left[Z^{2}\right] means that for a typical realization of the weights and biases in N\mathcal{N}, the derivatives of fNf_{\mathcal{N}} with respect to different trainable parameters will vary over several orders of magnitude, leading to inefficient SGD updates (1) for any fixed learning rate λ\lambda (see §3.1 - §3.3).

since ϕ=ReLU\phi=\operatorname{ReLU}, making ϕ(z)\phi^{\prime}(z) the indicator function 1[0,)(z){\bf 1}_{[0,\infty)}(z) and the value of ϕ(qz)\phi^{\prime}(\sqrt{q^{*}}z) independent of the asymptotic length qq^{*} for activations. The condition χ1=1\chi_{1}=1 defines the edge of chaos regime. This gives a heuristic explanation for why the nets considered in the present article cannot have just one of vanishing and exploding gradients. It also allows us to interpret our results as a rigorous computation for ReLU\operatorname{ReLU} nets of the 1/nj1/n_{j} corrections at the edge of chaos.

In addition to the mean field theory papers, we mention the article [SPSD17]. It does not deal directly with gradients, but it does treat the finite width corrections to the statistical distribution of pre-activations in a feed-forward network with Gaussian initialized weights and biases. A nice aspect of this work is that the results give the joint distribution not only over all the neurons but also over any number of inputs to the network. In a similar vein, we bring to the reader’s attention [BFL+17], which gives interesting heuristic computations about the structure of correlations between gradients corresponding to different inputs in both fully connected and residual ReLU\operatorname{ReLU} nets.

Defining the EVGP for Feed-Forward Networks

where Zp,qZ_{p,q} the entries of the Jacobian JNJ_{\mathcal{N}} (see (3)). A common way to formalize this statement is to interpret “Zp,qZ_{p,q} has large fluctuations” to mean that the Jacobian JNJ_{\mathcal{N}} of the function computed by N\mathcal{N} has both very large and very small singular values [BSF94, HBF+01, PSG17]. We give in §3.1 a brief account of the reasoning behind this formulation of the EVGP and explain why is also natural to define the EVGP via the moments of Zp,q.Z_{p,q}. Then, in §3.2 and §3.3, we define two precise senses, which we call annealed and quenched, in which that EVGP can occur, phrased directly in terms of the joint moments of Zp,q.Z_{p,q}.

Let us recall the rationale behind using the spectral theory of JNJ_{\mathcal{N}} to define the EVGP. The gradient in (1) of the loss with respect to, say, a weight Wα,β(j)W_{\alpha,\beta}^{{}_{(j)}} connecting neuron α\alpha in layer j1j-1 to neuron β\beta in layer jj is

where ϕ(Actβ(j))\phi^{\prime}(\operatorname{Act}_{\beta}^{(j)}) is the derivative of the non-linearity, the derivative of the loss L\mathcal{L} with respect to the output Act(d)\operatorname{Act}^{{}_{(d)}} of N\mathcal{N} is

and we’ve denoted the βth\beta^{th} row in the layer jj to output Jacobian JN(jd)J_{\mathcal{N}}(j\rightarrow d) by

Since JN(jd)J_{\mathcal{N}}(j\rightarrow d) is the product of djd-j layer-to-layer Jacobians, its inner product with Act(d)L\nabla_{\operatorname{Act}^{(d)}}\mathcal{L} is usually the term considered responsible for the EVGP. The worst case distortion it can achieve on the vector Act(d)L\nabla_{\operatorname{Act}^{(d)}}\mathcal{L} is captured precisely by its condition number, the ratio of its largest and smallest singular values.

However, unlike the case of recurrent networks in which JN(jd)J_{\mathcal{N}}(j\rightarrow d) is (dj)(d-j)-fold product of a fixed matrix, when the hidden layer widths grow with the depth dd, the dimensions of the layer jj to layer jj^{\prime} Jacobians JN(jj)J_{\mathcal{N}}(j\rightarrow j^{\prime}) are not fixed and it is not clear to what extent the vector Act(d)L\nabla_{\operatorname{Act}^{(d)}}\mathcal{L} will actually be stretched or compressed by the worst case bounds coming from estimates on the condition number of JN(jd).J_{\mathcal{N}}(j\rightarrow d).

Moreover, on a practical level, the EVGP is about the numerical stability of the increments of the SGD updates (1) over all weights (and biases) in the network, which is directly captured by the joint distribution of the random variables

Due to the relation (6), two terms influence the moments of L/Wα,β(j)2|\partial\mathcal{L}/\partial W_{\alpha,\beta}^{(j)}|^{2}: one coming from the activations at layer j1j-1 and the other from the entries of JN(jd).J_{\mathcal{N}}(j\rightarrow d). We focus in this article on the second term and hence interpret the fluctuations of the entries of JN(jd)J_{\mathcal{N}}(j\rightarrow d) as a measure of the EVGP.

To conclude, we recall a simple relationship between the moments of the entries of the input-output Jacobian JNJ_{\mathcal{N}} and the distribution of its singular values, which can be used to directly compare spectral and entrywise definitions of the EVGP. Suppose for instance one is interested in the average singular value of JNJ_{\mathcal{N}} (as in [PSG17, PSG18]). The sum of the singular values of JNJ_{\mathcal{N}} is given by

where {uj}\{u_{j}\} is any orthonormal basis. Hence, the average singular value can be obtained directly from the joint even moments of the entries of JNJ_{\mathcal{N}}. Both the quenched and annealed EVGP (see (7),(9)) entail that the average singular value for JNJ_{\mathcal{N}} equals 1,1, and we prove in Theorem 1 (specifically (11)) that even at finite depth and width the average singular value for JNJ_{\mathcal{N}} equals 11 for all the random ReLU\operatorname{ReLU} nets we consider!

One can push this line of reasoning further. Namely, the singular values of any matrix MM are determined by the Stieltjes transform of the empirical distribution σM\sigma_{M} of the eigenvalues of MTM:M^{T}M:

Writing (zx)1(z-x)^{-1} as a power series in zz shows that SJNS_{J_{\mathcal{N}}} is determined by traces of powers of JNTJNJ_{\mathcal{N}}^{T}J_{\mathcal{N}} and hence by the joint even moments of the entries of JNJ_{\mathcal{N}}. We hope to estimate SJN(z)S_{J_{\mathcal{N}}}(z) directly in future work.

2 Annealed Exploding and Vanishing Gradients

Fix a sequence of positive integers n0,n1,.n_{0},n_{1},\ldots. For each d1d\geq 1 write Nd\mathcal{N}_{d} for the depth dd ReLU\operatorname{ReLU} net with hidden layer widths n0,,ndn_{0},\ldots,n_{d} and random weights and biases (see Definition 1 below). As in (3), write Zp,q(d)Z_{p,q}(d) for the partial derivative of the qthq^{th} component of the output of Nd\mathcal{N}_{d} with respect to pthp^{th} component of its input. We say that the family of architectures given by {n0,n1,}\{n_{0},n_{1},\ldots\} avoids the exploding and vanishing gradient problem in the annealed sense if for each fixed input to Nd\mathcal{N}_{d} and every p,qp,q we have

Here the expectation is over the weights and biases in Nd\mathcal{N}_{d}. Architectures that avoid the EVGP in the annealed sense are ones where the typical magnitude of the partial derivatives Zp,q(d)Z_{p,q}(d) have bounded (both above and below) fluctuations around a constant mean value. This allows for a reliable a priori selection of the learning rate λ\lambda from (1) even for deep architectures. Our main result about the annealed EVGP is Theorem 1: a family of neural net architectures avoids the EVGP in the annealed sense if and only if

3 Quenched Exploding and Vanishing Gradients

There is an important objection to defining the EVGP as in the previous section. Namely, if a neural net N\mathcal{N} suffers from the annealed EVGP, then it is impossible to choose an appropriate a priori learning rate λ\lambda that works for a typical initialization. However, it may still be that for a typical realization of the weights and biases there is some choice of λ\lambda (depending on the particular initialization), that works well for all (or most) trainable parameters in N.\mathcal{N}. To study whether this is the case, we must consider the variation of the Zp,qZ_{p,q}’s across different p,qp,q in a fixed realization of weights and biases. This is the essence of the quenched EVGP.

To formulate the precise definition, we again fix a sequence of positive integers n0,n1,n_{0},n_{1},\ldots and write Nd\mathcal{N}_{d} for a depth dd ReLU\operatorname{ReLU} net with hidden layer widths n0,,nd.n_{0},\ldots,n_{d}. We write as in (4)

for the empirical variance of the squares all the entries Zp,q(d)Z_{p,q}(d) of the input-output Jacobian of Nd.\mathcal{N}_{d}. We will say that the family of architectures given by {n0,n1,}\{n_{0},n_{1},\ldots\} avoids the exploding and vanishing gradient problem in the quenched sense if

Our main result about the quenched sense of the EVGP is Theorem 2. It turns out, at least for the ReLU\operatorname{ReLU} nets we study, that a family of neural net architectures avoids the quenched EVGP if and only if it also avoids the annealed exploding and vanishing gradient problem (i.e. if (8) holds).

Acknowledgements

I thank Leonid Hanin for a number of useful conversations and for his comments on an early draft. I am also grateful to Jeffrey Pennington for pointing out an important typo in the proof of Theorem 3 and to David Rolnick for several helpful conversations and, specifically, for pointing out the relevance of the power-mean inequality for understanding β\beta. Finally, I would like to thank several anonymous referees for their help in improving the exposition. One referee in particular raised concerns about the annealed and quenched definitions of the EVGP. Addressing these concerns resulted in the discussion in §3.1.

Notation and Main Results

The function fNf_{\mathcal{N}} computed by NN(n,d)\mathcal{N}\in\mathfrak{N}({\bf n},d) is determined by a collection of weights and biases

to N,\mathcal{N}, we define for every j=1,,dj=1,\ldots,d

The vectors act(j),Act(j)\operatorname{act}^{(j)},\,\operatorname{Act}^{(j)} therefore represent the vectors of inputs and outputs of the neurons in the jthj^{th} layer of N.\mathcal{N}. The function computed by N\mathcal{N} takes the form

A random network is obtained by randomizing weights and biases.

μ(j),ν(j)\mu^{(j)},\nu^{(j)} are symmetric around for every 1jd.1\leq j\leq d.

the variance of μ(j)\mu^{(j)} is 2/(nj1)2/(n_{j-1}).

A random net NNμ,ν(n,d)\mathcal{N}\in\mathfrak{N}_{{\bf\mu},{\bf\nu}}\left({\bf n},d\right) is obtained by requiring that the weights and biases for neurons at layer jj are drawn independently from μ(j),ν(j):\mu^{(j)},\nu^{(j)}:

Condition (iii) is used when we apply Lemma 1 in the proof of Theorem 3. It can be removed under the restriction that dexp(j=1dnj).d\ll\exp\left(\sum_{j=1}^{d}n_{j}\right). Since this yields slightly messier but not meaningfully different results, we do not pursue this point.

2 Results

We also continue to write Zp,qZ_{p,q} for the entries of the input-output Jacobian of a neural net (see (3)).

In contrast, the fourth moment of Zp,q(x)Z_{p,q}(x) is exponential in j1nj:\sum_{j}\frac{1}{n_{j}}:

Moreover, there exists a constant CK,μ>0C_{K,\mu}>0 depending only on KK and the first 2K2K moments of the measures {μ(j)}j=1d\{\mu^{(j)}\}_{j=1}^{d} such that if K<minj=1d1{nj}K<\min_{j=1}^{d-1}\{n_{j}\}, then

Hence, the family Nd\mathcal{N}_{d} of ReLU\operatorname{ReLU} nets avoids the exploding and vanishing gradient problem in the quenched sense if and only if

so that γ(j)\gamma(j) represents a neuron in the jthj^{th} layer of N.\mathcal{N}. Similarly, given any collection of K1K\geq 1 paths Γ=(γk)k=1K\Gamma=\left(\gamma_{k}\right)_{k=1}^{K} that connect (possibly different) neurons in the input of N\mathcal{N} with neurons in its output and any 1jd1\leq j\leq d, denote by

the neurons in the jthj^{th} layer of N\mathcal{N} that belong to at least one element of Γ.\Gamma. Finally, for every αΓ(j1)\alpha\in\Gamma(j-1) and βΓ(j)\beta\in\Gamma(j), denote by

the number of paths in Γ\Gamma that pass through neuron α\alpha at layer j1j-1 and through neuron β\beta at layer j.j.

where the sum is over ordered tuples Γ=(γ1,,γ2K)\Gamma=\left(\gamma_{1},\ldots,\gamma_{2K}\right) of paths in N\mathcal{N} from pp to qq and

where for every r0,r\geq 0, the quantity μr(j)\mu_{r}^{(j)} denotes the rthr^{th} moment of the measure μ(j).\mu^{(j)}.

The expression (16) can be generalized to case of mixed even moments. Namely, given m1m\geq 1 and for each 1mM1\leq m\leq M integers Km0K_{m}\geq 0 and 1pmn01qmnd1\leq p_{m}\leq n_{0}\,\,1\leq q_{m}\leq n_{d}, we have

See Appendix A for the proof of Theorem 3.

References

Appendix A Proof of Theorem 3

Using that X=dX,wj=dwjX\stackrel{{\scriptstyle d}}{{=}}-X,w_{j}\stackrel{{\scriptstyle d}}{{=}}-w_{j} and that jkj\sum_{j}k_{j} is even, we have

Averaging these two expressions we combine (18) with the fact that XX is independent of {wj}j=1n\{w_{j}\}_{j=1}^{n} and its law has no atoms to obtain the desired result. s∎

We now turn to the proof of Theorem 3. To this end, fix d1,d\geq 1, a collection of positive integers n=(ni)i=0d{\bf n}=\left(n_{i}\right)_{i=0}^{d}, and let NNμ,ν(n,d)\mathcal{N}\in\mathfrak{N}_{{\bf\mu},{\bf\nu}}\left({\bf n},d\right). Let us briefly recall the notation for paths from §5.2. Given 1pn01\leq p\leq n_{0} and 1qnd,1\leq q\leq n_{d}, we defined a path γ\gamma from p to q to be a collection {γ(j)}j=0d\{\gamma(j)\}_{j=0}^{d} of neurons so that γ(0)=p,γ(d)=q,\gamma(0)=p,\,\gamma(d)=q, and γ(j){1,,nj}.\gamma(j)\in\{1,\ldots,n_{j}\}. The numbers γ(j)\gamma(j) should be thought of as neurons in the jthj^{th} hidden layer of N\mathcal{N}. Given such a collection, we obtain for each jj a weight

between each two consecutive neurons along the path γ.\gamma. Our starting point is the expression

where act(j)\operatorname{act}^{(j)} are defined as in (10). This expression is well-known and follows immediately form the chain rule (c.f. e.g. equation (1) in [CHM+15]). We therefore have

We will prove a slightly more general statement than in the formulation of Theorem 3. Namely, suppose Γ=(γ1,,γ2K)\Gamma=\left(\gamma_{1},\ldots,\gamma_{2K}\right) is any collection of paths from the input of N\mathcal{N} to the output (the paths are not required to have the same starting and ending neurons) such that for every βΓ(d)\beta\in\Gamma(d),

To evaluate the expectation in (21), note that the computation done by N\mathcal{N} is a Markov chain with respect to the layers (i.e. given Act(j1)\operatorname{Act}^{(j-1)}, the activations at layers j,,dj,\ldots,d are independent of the weight and biases up to and including layer j1.j-1.) Hence, denoting by Fd1\mathcal{F}_{\leq d-1} the sigma algebra generated by the weight and biases up to and including layer d1d-1, the tower property for expectation and the Markov property yield

Next, observe that for each 1jd,1\leq j\leq d, conditioned on Act(j1)\operatorname{Act}^{(j-1)}, the families of random variables {wα,β(j),actβ(j)}α=1nj1\{w_{\alpha,\beta}^{(j)},\,\operatorname{act}_{\beta}^{(j)}\}_{\alpha=1}^{n_{j-1}} are independent for different β.\beta. For j=dj=d this implies

Let us make several observations about act^Γ,β(d)\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d)} and actΓ,β(d)\operatorname{act}_{\Gamma,\beta}^{(d)} when conditioned on Act(d1)\operatorname{Act}^{(d-1)}. First, the conditioned random variable act^Γ,β(d1)\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d-1)} is independent of the conditioned random variable actΓ,β(d1)\operatorname{act}_{\Gamma,\beta}^{(d-1)}. Second, the distribution of act^Γ,β(d)\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d)} conditioned on Act(d1)\operatorname{Act}^{(d-1)} is symmetric around 0.0. Third, since we assumed that the bias distributions ν(j)\nu^{(j)} for N\mathcal{N} have no atoms, the conditional distribution of act^Γ,β(d)\widehat{\operatorname{act}}_{\Gamma,\beta}^{(d)} also has no atoms. Fourth, actΓ,β(d1)\operatorname{act}_{\Gamma,\beta}^{(d-1)} is a linear combination of the weights {wα,β(d)}αΓ(j1)\{w_{\alpha,\beta}^{(d)}\}_{\alpha\in\Gamma(j-1)} with given coefficients {Actα(d1)}αΓ(j1).\{\operatorname{Act}_{\alpha}^{(d-1)}\}_{\alpha\in\Gamma(j-1)}. Since the weight distributions μ(j)\mu^{(j)} for N\mathcal{N} are symmetric around 0,0, the above five observations, together with (24) allow us to apply Lemma 1 and to conclude that

To complete the argument, we must consider two cases. First, recall that by assumption, for every βΓ(d),\beta\in\Gamma(d), the number of γΓ\gamma\in\Gamma for which γ(d)=β\gamma(d)=\beta is even. If for every jdj\leq d and each αΓ(j1)\alpha\in\Gamma(j-1) the number of γΓ\gamma\in\Gamma passing through α\alpha is even, then we may repeat the preceding argument to directly obtain (21). Otherwise, we apply this argument until we reach αΓ(j1),βΓ(j)\alpha\in\Gamma(j-1),\,\beta\in\Gamma(j) so that the number Γα,β(j)\left|\Gamma_{\alpha,\beta}(j)\right| of paths in Γ\Gamma that pass through α\alpha and β\beta is odd. In this case, the right hand side of (22) vanishes since the measure μ(d)\mu^{(d)} is symmetric around and thus has vanishing odd moments. Relation (21) therefore again holds since in this case both sides are 0.0. This completes the proof of Theorem 3. \square

Appendix B Proof of Theorem 1

In this section, we use Theorem 3 to prove Theorem 1. Let us first check (11). According to Theorem 3, we have

Note that since μ\mu is symmetric around 0,0, we have that μ1=0.\mu_{1}=0. Thus, the terms where γ1γ2\gamma_{1}\neq\gamma_{2} vanish. Using μ2(j)=2nj1\mu_{2}^{(j)}=\frac{2}{n_{j-1}}, we find

as claimed. We now turn to proving (12). Using Theorem 3, we have

For each Γ=(γk)k=12\Gamma=\left(\gamma_{k}\right)_{k=1}^{2} with γk\gamma_{k} paths from p to q, we have

We begin by checking the first equality in (28) by induction on d.d. Fix Γ=(γ1,γ2).\Gamma=\left(\gamma_{1},\gamma_{2}\right). When d=1,d=1, we have Γ(0)=Γ(1)=1\left|\Gamma(0)\right|=\left|\Gamma(1)\right|=1. Hence γ1=γ2\gamma_{1}=\gamma_{2} and A(Γ)=1A(\Gamma)=1 since both the numerator and denominator on the right hand side of (27) equal 11. The right hand side of (28) is also 11 since Γ(j)=1\left|\Gamma(j)\right|=1 for every j.j. This completes the base case. Suppose now that D2,D\geq 2, and we have proved (28) for all dD1.d\leq D-1. Let

If j=1j_{*}=1, then we are done by the inductive hypothesis. Otherwise, there are two choices of Γˉ={γˉk}k=12\bar{\Gamma}=\{\bar{\gamma}_{k}\}_{k=1}^{2} for which

These choices correspond to the two permutations of {γk}k=12.\{\gamma_{k}\}_{k=1}^{2}. Similarly, there are 66 choices of Γˉ={γˉk}k=14\bar{\Gamma}=\{\bar{\gamma}_{k}\}_{k=1}^{4} for which

The six choices correspond to selecting one of two choices for γ1(1)\gamma_{1}(1) and three choices of an index k=2,3,4k=2,3,4 so that γk(j)\gamma_{k}(j) coincides with γ1(j)\gamma_{1}(j) for each jj.j\leq j_{*}. If j=dj_{*}=d, we are done. Otherwise, we apply the inductive hypothesis to paths from Γ(j)\Gamma(j_{*}) to Γ(d)\Gamma(d) to complete the proof of the first equality in (28). The second equality in (28) follows from the observation that since Γ(0)=Γ(d)=1,\left|\Gamma(0)\right|=\left|\Gamma(d)\right|=1, the number of j{1,,d}j\in\{1,\ldots,d\} for which Γ(j1)=1,Γ(j)=2\left|\Gamma(j-1)\right|=1,\,\left|\Gamma(j)\right|=2 must equal the number of jj for which Γ(j1)=2,Γ(j)=1\left|\Gamma(j-1)\right|=2,\,\left|\Gamma(j)\right|=1. ∎

Observe that since μ2(j)=2/nj1\mu_{2}^{(j)}=2/n_{j-1}, we have

where the expectation on the right hand side is over the uniform measure on paths (γ1,γ2)(\gamma_{1},\gamma_{2}) from the input of N\mathcal{N} to the output conditioned on γ1(0)=γ2(0)=p\gamma_{1}(0)=\gamma_{2}(0)=p and γ1(d)=γ2(d)=q,\gamma_{1}(d)=\gamma_{2}(d)=q, and

The value of XdX_{d} corresponding to every such path is at least 2k+12^{k+1} since μ~4(j)1\widetilde{\mu}_{4}^{(j)}\geq 1 for every jj and is at most 6μ~4,max6\widetilde{\mu}_{4,max} for the same reason. Therefore, using that for all ε,\varepsilon\in, we have

This completes the proof of the lower bound. The upper bound is obtained in the same way:

As with the second and fourth moment computations, we note that μΓα,β(j)\mu_{\left|\Gamma_{\alpha,\beta}(j)\right|} vanishes unless each Γα,β(j)\left|\Gamma_{\alpha,\beta}(j)\right| is even. Hence, as with (29), we may write

where AK(Γ)A_{K}(\Gamma) is the analog of A(Γ)A(\Gamma) from (27). The same argument as in Lemma 1 shows that

which is precisely the weight in (31) assigned to collections Γ\Gamma with Γ(j)=K\left|\Gamma(j)\right|=K for every 1jd1,1\leq j\leq d-1, yields

where the expectation is over uniformly chosen collections Γ=(γ1,,γK)\Gamma=\left(\gamma_{1},\ldots,\gamma_{K}\right) of paths from the input to the output of N\mathcal{N} and

The value of XdX_{d} on each such collection is at most (CK)m\left(C_{K}\right)^{m}, where CK=2K1(2K)!K!C_{K}=2^{K-1}\frac{(2K)!}{K!} is a large but fixed constant. Hence, just as in (30),

This completes the proof of Theorem 1. \square

Appendix C Proof of Theorem 2

Fixing p,qp,q and using that the second sum in the previous line has M(M1)M(M-1) terms, we have

To estimate the difference in this sum, we use Theorem 3 to obtain

Note that since the measures μ(j)\mu^{(j)} of the weights are symmetric around zero, their odd moments vanish and hence the only non-zero terms in (34) and (35) are those for which

Further, observe that each path γ\gamma from some fixed input neuron to some fixed output vertex is determined uniquely by the sequence of hidden neurons γ(j){1,,nj}\gamma(j)\in\{1,\ldots,n_{j}\} through which it passes for j=1,,d1j=1,\ldots,d-1. Therefore, we may identify each collection of paths Γ=(γk)k=14\Gamma=\left(\gamma_{k}\right)_{k=1}^{4} in the sum (34) with a unique collection of paths Γˉ=(γˉk)k=14\bar{\Gamma}=\left(\bar{\gamma}_{k}\right)_{k=1}^{4} in (35) by asking that γk(j)=γˉk(j)\gamma_{k}(j)=\bar{\gamma}_{k}(j) for each kk and all 1jd11\leq j\leq d-1. Observe further that under this bijection,

For j=1,dj=1,d, the terms Cj(Γ)C_{j}(\Gamma) and Cj(Γˉ)C_{j}(\bar{\Gamma}) are related as follows:

We consider two cases: (i) qm1qm2q_{m_{1}}\neq q_{m_{2}} (i.e. Γˉ(d)=2\left|\bar{\Gamma}(d)\right|=2) and (ii) qm1=qm2q_{m_{1}}=q_{m_{2}} (i.e. Γˉ(d)=1\left|\bar{\Gamma}(d)\right|=1 and pm1pm2p_{m_{1}}\neq p_{m_{2}}). In case (i), we have

Hence, using (37) and (38), we find that in case (i)

where the last estimate is proved by the same argument as the relation (12) in Theorem 1. To obtain the analogous lower bound for case (ii), we write q=qm1=qm2,pm1pm2.q=q_{m_{1}}=q_{m_{2}},\,p_{m_{1}}\neq p_{m_{2}}. In this case, combining (36) with (38), we have

Moreover, continuing to use the bijection between Γ\Gamma and Γˉ\bar{\Gamma} above, (37) yields in this case

Using that if Γ(0)=Γ(1)=1,\left|\Gamma(0)\right|=\left|\Gamma(1)\right|=1, then

Writing p^\widehat{p} for any neuron in the first hidden layer of N\mathcal{N}, we rewrite the sum in the previous line as

where the point is now that we are considering paths only from p^\widehat{p} to qq. According to (12) from Theorem 1, we have

Combining this with (33), (39) and setting

proving (15). Finally, the upper bound in (14) follows from dropping the negative term in (33) and applying the upper bound from (12). \square