Deep Information Propagation

Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, Jascha Sohl-Dickstein

Introduction

Deep neural network architectures have become ubiquitous in machine learning. The success of deep networks is due to the fact that they are highly expressive (Montufar et al., 2014) while simultaneously being relatively easy to optimize (Choromanska et al., 2015; Goodfellow et al., 2014) with strong generalization properties (Recht et al., 2015). Consequently, developments in machine learning often accompany improvements in our ability to train increasingly deep networks. Despite this, designing novel network architectures is frequently equal parts art and science. This is, in part, because a general theory for neural networks that might inform design decisions has lagged behind the feverish pace of design.

A pair of recent papers (Poole et al., 2016; Raghu et al., 2016) demonstrated that random neural networks are exponentially expressive in their depth. Central to their approach was the consideration of networks after random initialization, whose weights and biases were i.i.d. Gaussian distributed. In particular the paper by Poole et al. (2016) developed a “mean field” formalism for treating wide, untrained, neural networks. They showed that these mean field networks exhibit an order-to-chaos transition as a function of the weight and bias variances. Notably the mean field formalism is not closely tied to a specific choice of activation function or loss.

In this paper, we demonstrate the existence of several characteristic “depth” scales that emerge naturally and control signal propagation in these random networks. We then show that one of these depth scales, ξc\xi_{c}, diverges at the boundary between order and chaos. This result is insensitive to many architectural decisions (such as choice of activation function) and will generically be true at any order-to-chaos transition. We then extend these results to include dropout and we show that even small amounts of dropout destroys the order-to-chaos critical point and consequently removes the divergence in ξc\xi_{c}. Together these results bound the depth to which signal may propagate through random neural networks.

We then develop a corresponding mean field model for gradients and we show that a duality exists between the forward propagation of signals and the backpropagation of gradients. The ordered and chaotic phases that Poole et al. (2016) identified correspond to regions of vanishing and exploding gradients, respectively. We demonstrate the validity of this mean field theory by computing gradients of random networks on MNIST. This provides a formal explanation of the ‘vanishing gradients’ phenomenon that has long been observed in neural networks (Bengio et al., 1993). We continue to show that the covariance between two gradients is controlled by the same depth scale that limits correlated signal propagation in the forward direction.

Finally, we hypothesize that a necessary condition for a random neural network to be trainable is that information should be able to pass through it. Thus, the depth-scales identified here bound the set of hyperparameters that will lead to successful training. To test this ansatz we train ensembles of deep, fully connected, feed-forward neural networks of varying depth on MNIST and CIFAR10, with and without dropout. Our results confirm that neural networks are trainable precisely when their depth is not much larger than ξc\xi_{c}. This result is dataset independent and is, therefore, a universal function of network architecture.

A corollary of these result is that asymptotically deep neural networks should be trainable provided they are initialized sufficiently close to the order-to-chaos transition. The notion of “edge of chaos” initialization has been explored previously. Such investigations have been both direct as in Bertschinger et al. (2005); Glorot & Bengio (2010) or indirect, through initialization schemes that favor deep signal propagation such as batch normalization (Ioffe & Szegedy, 2015), orthogonal matrix initialization (Saxe et al., 2014), random walk initialization (Sussillo & Abbott, 2014), composition kernels (Daniely et al., 2016), or residual network architectures (He et al., 2015). The novelty of the work presented here is two-fold. First, our framework predicts the depth at which networks may be trained even far from the order-to-chaos transition. While a skeptic might ask when it would be profitable to initialize a network far from criticality, we respond by noting that there are architectures (such as neural networks with dropout) where no critical point exists and so this more general framework is needed. Second, our work provides a formal, as opposed to intuitive, explanation for why very deep networks can only be trained near the edge of chaos.

Background

Since the weights and biases are randomly distributed, these equations define a probability distribution on the activations and pre-activations over an ensemble of untrained neural networks. The “mean-field” approximation is then to replace zilz_{i}^{l} by a Gaussian whose first two moments match those of zilz_{i}^{l}. For the remainder of the paper we will take the mean field approximation as given.

Consider first the evolution of a single input, xi;ax_{i;a}, as it evolves through the network (as quantified by yi;aly_{i;a}^{l} and zi;alz_{i;a}^{l}). Since the weights and biases are independent with zero mean, the first two moments of the pre-activations in the same layer will be,

where δij\delta_{ij} is the Kronecker delta. Here qaalq_{aa}^{l} is the variance of the pre-activations in the llth layer due to an input xi;ax_{i;a} and it is described by the recursion relation,

where Dz=12πdze12z2\int\mathcal{D}z=\frac{1}{\sqrt{2\pi}}\int dze^{-\frac{1}{2}z^{2}} is the measure for a standard Gaussian distribution. Together these equations completely describe the evolution of a single input through a mean field neural network. For any choice of σw2\sigma_{w}^{2} and σb2\sigma_{b}^{2} with bounded ϕ\phi, eq. 3 has a fixed point at q=limlqaalq^{*}=\lim_{l\to\infty}q_{aa}^{l}.

where u1=qaal1z1u_{1}=\sqrt{q_{aa}^{l-1}}z_{1} and u2=qbbl1(cabl1z1+1(cabl1)2z2)u_{2}=\sqrt{q_{bb}^{l-1}}\left(c_{ab}^{l-1}z_{1}+\sqrt{1-(c^{l-1}_{ab})^{2}}z_{2}\right), with cabl=qabl/qaalqbblc_{ab}^{l}=q_{ab}^{l}/\sqrt{q_{aa}^{l}q_{bb}^{l}}, are Gaussian approximations to the pre-activations in the preceding layer with the correct covariance matrix. Moreover cablc^{l}_{ab} is the correlation between the two inputs after ll layers.

Examining eq. 4 it is clear that c=1c^{*}=1 is a fixed point of the recurrence relation. To determine whether or not the c=1c^{*}=1 is an attractive fixed point the quantity,

is introduced. Poole et al. (2016) note that the c=1c^{*}=1 fixed point is stable if χ1<1\chi_{1}<1 and is unstable otherwise. Thus, χ1=1\chi_{1}=1 represents a critical line separating an ordered phase (in which c=1c^{*}=1 and all inputs end up asymptotically correlated) and a chaotic phase (in which c<1c^{*}<1 and all inputs end up asymptotically decorrelated). For the case of ϕ=tanh\phi=\tanh, the phase diagram in fig. 1 (a) is observed.

Asymptotic Expansions and Depth Scales

Our first contribution is to demonstrate the existence of two depth-scales that arise naturally within the framework of mean field neural networks. Motivating the existence of these depth-scales, we iterate eq. 3 and 4 until convergence for many values of σw2\sigma_{w}^{2} between 0.1 and 3.0 and with σb2=0.05\sigma_{b}^{2}=0.05 starting with qaa0=qbb0=0.8q_{aa}^{0}=q_{bb}^{0}=0.8 and cab0=0.6c_{ab}^{0}=0.6. We see, in fig. 1 (b) and (c), that the manner in which both qaalq_{aa}^{l} approaches qq^{*} and cablc_{ab}^{l} approaches cc^{*} is exponential over many orders of magnitude. We therefore anticipate that asymptotically qaalqel/ξq|q^{l}_{aa}-q^{*}|\sim e^{-l/\xi_{q}} and cablcel/ξc|c^{l}_{ab}-c^{*}|\sim e^{-l/\xi_{c}} for sufficiently large ll. Here, ξq\xi_{q} and ξc\xi_{c} define depth-scales over which information may propagate about the magnitude of a single input and the correlation between two inputs respectively.

We will presently prove that qaalq_{aa}^{l} and cablc_{ab}^{l} are asymptotically exponential. In both cases we will use the same fundamental strategy wherein we expand one of the recurrence relations (either eq. 3 or eq. 4) about its fixed point to get an approximate “asymptotic” recurrence relation. We find that this asymptotic recurrence relation in turn implies exponential decay towards the fixed point over a depth-scale, ξx\xi_{x}.

We first analyze eq. 3 and identify a depth-scale at which information about a single input may propagate. Let qaal=q+ϵlq^{l}_{aa}=q^{*}+\epsilon^{l}. By construction so long as limlqaal=q\lim_{l\to\infty}q^{l}_{aa}=q^{*} exists it follows that ϵl0\epsilon^{l}\to 0 as ll\to\infty. Eq. 3 may be expanded to lowest order in ϵl\epsilon^{l} to arrive at an asymptotic recurrence relation (see Appendix 7.1),

Notably, the term multiplying ϵl\epsilon^{l} is a constant. It follows that for large ll the asymptotic recurrence relation has an exponential solution, ϵlel/ξq\epsilon^{l}\sim e^{-l/\xi_{q}}, with ξq\xi_{q} given by

This establishes ξq\xi_{q} as a depth scale that controls how deep information from a single input may penetrate into a random neural network.

Next, we consider eq. 4. Using a similar argument (detailed in Appendix 7.2) we can expand about cabl=c+ϵlc^{l}_{ab}=c^{*}+\epsilon^{l} to find an asymptotic recurrence relation,

Here u1=qz1u_{1}^{*}=\sqrt{q^{*}}z_{1} and u2=q(cz1+1(c)2z2)u_{2}^{*}=\sqrt{q^{*}}(c^{*}z_{1}+\sqrt{1-(c^{*})^{2}}z_{2}). Thus, once again, we expect that for large ll this recurrence will have an exponential solution, ϵlel/ξc\epsilon^{l}\sim e^{-l/\xi_{c}}, with ξc\xi_{c} given by

In the ordered phase c=1c^{*}=1 and so ξc1=logχ1\xi_{c}^{-1}=-\log\chi_{1}. Since the transition between order and chaos occurs when χ1=1\chi_{1}=1 it follows that ξc\xi_{c} diverges at any order-to-chaos transition so long as qq^{*} and cc^{*} exist.

These results can be investigated intuitively by plotting cabl+1c^{l+1}_{ab} vs cablc^{l}_{ab} in fig. 2 (a). In the ordered phase there is only a single fixed point, cabl=1c^{l}_{ab}=1. In the chaotic regime we see that a second fixed point develops and the cabl=1c^{l}_{ab}=1 point becomes unstable. We see that the linearization about the fixed points becomes significantly closer to the trivial map near the order-to-chaos transition.

To test these claims we measure ξq\xi_{q} and ξc\xi_{c} directly by iterating the recurrence relations for qaalq_{aa}^{l} and cablc_{ab}^{l} as before with qaa0=qbb0=0.8q_{aa}^{0}=q_{bb}^{0}=0.8 and cab0=0.6c_{ab}^{0}=0.6. In this case we consider values of σw2\sigma_{w}^{2} between 0.10.1 and 3.03.0 and σb2\sigma_{b}^{2} between 0.010.01 and 0.30.3. For each hyperparameter settings we fit the resulting residuals, qaalq|q_{aa}^{l}-q^{*}| and cablc|c_{ab}^{l}-c^{*}|, to exponential functions and infer the depth-scale. We then compare this measured depth-scale to that predicted by the asymptotic expansion. The result of this measurement is shown in fig. 2. In general we see that the agreement is quite good. As expected we see that ξc\xi_{c} diverges at the critical point.

As observed in Poole et al. (2016) we see that the depth scale for the propagation of information in a single input, ξq\xi_{q}, is consistently finite and significantly shorter than ξc\xi_{c}. To understand why this is the case consider eq. 6 and note that for tanh\tanh nonlinearities the second term is always negative. Thus, even as χ1\chi_{1} approaches 1 we expect χ1+σw2Dzϕ(qz)ϕ(qz)\chi_{1}+\sigma_{w}^{2}\int\mathcal{D}z\phi^{\prime\prime}(\sqrt{q^{*}}z)\phi(\sqrt{q^{*}}z) to be substantially smaller than 1.

The mean field formalism can be extended to include dropout. The main contribution here will be to argue that even infinitesimal amounts of dropout destroys the mean field critical point, and therefore limits the trainable network depth. In the presence of dropout the propagation equation, eq. 1, becomes,

where pjBernoulli(ρ)p_{j}\sim\text{Bernoulli}(\rho) and ρ\rho is the dropout rate. As is typically the case we have re-scaled the sum by ρ1\rho^{-1} so that the mean of the pre-activation is invariant with respect to our choice of dropout rate.

Following a similar procedure to the original mean field calculation consider the fate of two inputs, xi;a0x^{0}_{i;a} and xi;b0x^{0}_{i;b}, as they are propagated through such a random network. We take the dropout masks to be chosen independently for the two inputs mimicking the manner in which dropout is employed in practice. With dropout the diagonal term in the covariance matrix will be (see Appendix 7.3),

The variance of a single input with dropout will therefore propagate in an identical fashion to the vanilla case with a re-scaling σw2σw2/ρ\sigma_{w}^{2}\to\sigma_{w}^{2}/\rho. Intuitively, this result implies that, for the case of a single input, the presence of dropout simply increases the effective variance of the weights.

Computing the off-diagonal term of the covariance matrix similarly (see Appendix 7.4),

with uˉ1\bar{u}_{1}, uˉ2\bar{u}_{2}, and cˉabl\bar{c}_{ab}^{l} defined by analogy to the mean field equations without dropout. Here, unlike in the case of a single input, the recurrence relation is identical to the recurrence relation without dropout. To see that cˉ=1\bar{c}^{*}=1 is no longer a fixed point of these dynamics consider what happens to eq. 12 when we input cˉl=1\bar{c}^{l}=1. For simplicity, we leverage the short range of ξq\xi_{q} to replace qˉaal=qˉbbl=qˉ\bar{q}_{aa}^{l}=\bar{q}_{bb}^{l}=\bar{q}^{*}. We find (see Appendix 7.5),

The second term is positive for any ρ<1\rho<1. This implies that if cˉabl=1\bar{c}^{l}_{ab}=1 for any ll then cˉabl+1<1\bar{c}^{l+1}_{ab}<1. Thus, c=1c^{*}=1 is not a fixed point of eq. 12 for any ρ<1\rho<1. Since eq. 12 is identical in form to eq. 4 it follows that the depth scale for signal propagation with dropout will likewise be given by eq. 9 with the substitutions qqˉq^{*}\to\bar{q}^{*} and ccˉc^{*}\to\bar{c}^{*} computed using eq. 11 and eq. 12 respectively. Importantly, since there is no longer a sharp critical point with dropout we do not expect a diverging depth scale.

As in networks without dropout we plot, in fig. 3 (a), the iterative map cˉabl+1\bar{c}^{l+1}_{ab} as a function of cˉabl\bar{c}^{l}_{ab}. Most significantly, we see that the cˉabl=1\bar{c}^{l}_{ab}=1 is no longer a fixed point of the dynamics. Instead, as the dropout rate increases cˉabl\bar{c}^{l}_{ab} gets mapped to decreasing values and the fixed point monotonically decreases.

To test these results we plot in fig. 3 (b) the asymptotic correlation, cc^{*}, as a function of σw2\sigma_{w}^{2} for different values of dropout from ρ=0.8\rho=0.8 to ρ=1.0\rho=1.0. As expected, we see that for all ρ<1\rho<1 there is no sharp transition between c=1c^{*}=1 and c<1c^{*}<1. Moreover as the dropout rate increases the correlation cc^{*} monotonically decreases. Intuitively this makes sense. Identical inputs passed through two different dropout masks will become increasingly dissimilar as the dropout rate increases. In fig. 3 (c) we show the depth scale, ξc\xi_{c}, as a function of σw2\sigma_{w}^{2} for the same range of dropout probabilities. We find that, as predicted, the depth of signal propagation with dropout is drastically reduced and, importantly, there is no longer a divergence in ξc\xi_{c}. Increasing the dropout rate continues to decrease the correlation depth for constant σw2\sigma_{w}^{2}.

Gradient Backpropagation

There is a duality between the forward propagation of signals and the backpropagation of gradients. To elucidate this connection consider the backpropagation equations given a loss EE,

The presence of χ1\chi_{1} in the above equation should perhaps not be surprising. In Poole et al. (2016) they show that χ1\chi_{1} is intimately related to the tangent space of a given layer in mean field neural networks. We note that the backpropagation recurrence features an explicit dependence on the ratio of widths of adjacent layers of the network, Nl+1/NlN_{l+1}/N_{l}. Here we will consider exclusively constant width networks where this factor is unity. For a discussion of the case of unequal layer widths see Glorot & Bengio (2010).

Since χ1\chi_{1} depends only on the asymptotic qq^{*} it follows that for constant width networks we expect eq. 15 to again have an exponential solution with,

Note that here ξ1=logχ1\xi_{{}_{\nabla}}^{-1}=-\log\chi_{1} both above and below the transition. It follows that ξ\xi_{{}_{\nabla}} can be both positive and negative. We conclude that there should be three distinct regimes for the gradients.

In the ordered phase, χ1<1\chi_{1}<1 and so ξ>0\xi_{{}_{\nabla}}>0. We therefore expect gradients to vanish over a depth ξ|\xi_{{}_{\nabla}}|.

At criticality, χ11\chi_{1}\to 1 and so ξ\xi_{{}_{\nabla}}\to\infty. Here gradients should be stable regardless of depth.

In the chaotic phase, χ1>1\chi_{1}>1 and so ξ<0\xi_{{}_{\nabla}}<0. It follows that in this regime gradients should explode over a depth ξ|\xi_{{}_{\nabla}}|.

Intuitively these three regimes make sense. To see this, recall that perturbations to a weight in layer ll can alternatively be viewed as perturbations to the pre-activations in the same layer. In the ordered phase both the perturbed signal and the unperturbed signal will be asymptotically mapped to the same point and the derivative will be small. In the chaotic phase the perturbed and unperturbed signals will become asymptotically decorrelated and the gradient will be large.

To investigate these predictions we construct deep random networks of depth L=240L=240 and layer-width Nl=300N_{l}=300. We then consider the cross-entropy loss of these networks on MNIST. In fig. 4 (a) we plot the layer-by-layer 2-norm of the gradient, WablE22||\nabla_{W_{ab}^{l}}E||_{2}^{2}, as a function of layer, ll, for different values of σw2\sigma_{w}^{2}. We see that WablE22||\nabla_{W_{ab}^{l}}E||_{2}^{2} behaves exponentially over many orders of magnitude. Moreover, we see that the gradient vanishes in the ordered phase and explodes in the chaotic phase. We test the quantitative predictions of eq. 16 in fig. 4 (b) where we compare ξ|\xi_{{}_{\nabla}}| as predicted from theory with the measured depth-scale constructed from exponential fits to the gradient data. Here we see good quantitative agreement between the theoretical predictions from mean field random networks and experimentally realized networks. Together these results suggest that the approximations on the backpropagation equations were representative of deep, wide, random networks.

Experimental Results

Taken together, the results of this paper lead us to the following hypothesis: a necessary condition for a random network to be trained is that information about the inputs should be able to propagate forward through the network, and information about the gradients should be able to propagate backwards through the network. The preceding analysis shows that networks will have this property precisely when the network depth, LL, is not much larger than the depth-scale ξc\xi_{c}. This criterion is data independent and therefore offers a “universal” constraint on the hyperparameters that depends on network architecture alone. We now explore this relationship between depth of signal propagation and network trainability empirically.

To investigate this prediction, we consider random networks of depth 10L30010\leq L\leq 300 and 1σw241\leq\sigma_{w}^{2}\leq 4 with σb2=0.05\sigma_{b}^{2}=0.05. We train these networks using Stochastic Gradient Descent (SGD) and RMSProp on MNIST and CIFAR10. We use a learning rate of 10310^{-3} for SGD when L200L\lesssim 200, 10410^{-4} for larger LL, and 10510^{-5} for RMSProp. These learning rates were selected by grid search between 10610^{-6} and 10210^{-2} in exponentially spaced steps of size 1010. We note that the depth dependence of learning rate was explored in detail in Saxe et al. (2014). In fig. 5 (a)-(d) we color in red the training accuracy that neural networks achieved as a function of σw2\sigma_{w}^{2} and LL for different datasets, training time, and choice of minimizer (see Appendix 7.10 for more comparisons). In all cases the neural networks over-fit the data to give a training accuracy of 100%100\% and test accuracies of 98%98\% on MNIST and 55%55\% on CIFAR10. We emphasize that the purpose of this study is to demonstrate trainability as opposed to optimizing test accuracy.

We now make the connection between the depth scale, ξc\xi_{c}, and the maximum trainable depth more precise. Given the arguments in the preceding sections we note that if L=nξcL=n\xi_{c} then signal through the network will be attenuated by a factor of ene^{n}. To understand how much signal can be lost while still allowing for training, we overlay in fig. 5 (a) curves corresponding to nξcn\xi_{c} from n=1n=1 to 66. We find that networks appear to be trainable when L6ξcL\lesssim 6\xi_{c}. It would be interesting to understand why this is the case.

Motivated by this argument in fig. 5 (b)-(d) in white, dashed, overlay we plot twice the predicted depth scale, 6ξc6\xi_{c}. There is clearly a relationship between the depth of correlated signal propagation and whether or not these networks are trainable. Networks closer to their critical point appear to train more quickly than those further away. Moreover, this relationship has no obvious dependence on dataset, duration of training, or minimizer. We therefore conclude that these bounds on trainable hyperparameters are universal. This in turn implies that to train increasingly deep networks, one must generically be ever closer to criticality.

Next we consider the effect of dropout. As we showed earlier, even infinitesimal amounts of dropout disrupt the order-to-chaos phase transition and cause the depth scale to become finite. However, since the effect of a single dropout mask is to simply re-scale the weight variance by σw2σw2/ρ\sigma_{w}^{2}\to\sigma_{w}^{2}/\rho, the gradient magnitude will be stable near criticality, while the input and gradient correlations will not be. This therefore offers a unique opportunity to test whether the relevant depth-scale is 1/logχ1|1/\log\chi_{1}| or ξc\xi_{c}.

In fig. 6 we repeat the same experimental setup as above on MNIST with dropout rates ρ=0.99,0.98,\rho=0.99,0.98, and 0.94. We observe, first and foremost, that even extremely modest amounts of dropout limit the maximum trainable depth to about L=100L=100. We additionally notice that the depth-scale, ξc\xi_{c}, predicts the trainable region accurately for varying amounts of dropout.

Discussion

In this paper we have elucidated the existence of several depth-scales that control signal propagation in random neural networks. Furthermore, we have shown that the degree to which a neural network can be trained depends crucially on its ability to propagate information about inputs and gradients through its full depth. At the transition between order and chaos, information stored in the correlation between inputs can propagate infinitely far through these random networks. This in turn implies that extremely deep neural networks may be trained sufficiently close to criticality. However, our contribution goes beyond advocating for hyperparameter selection that brings random networks to be nearly critical. Instead, we offer a general purpose framework that predicts, at the level of mean field theory, which hyperparameters should allow a network to be trained. This is especially relevant when analyzing schemes like dropout where there is no critical point and which therefore imply an upper bound on trainable network depth.

An alternative perspective as to why information stored in the covariance between inputs is crucial for training can be understood by appealing to the correspondence between infinitely wide Bayesian neural networks and Gaussian Processes (Neal, 2012). In particular the covariance, qablq_{ab}^{l}, is intimately related to the kernel of the induced Gaussian Process. It follows that cases in which signal stored in the covariance between inputs may propagate through the network correspond precisely to situations in which the associated Gaussian Process is well defined.

Our work suggests that it may be fruitful to investigate pre-training schemes that attempt to perturb the weights of a neural network to favor information flow through the network. In principle this could be accomplished through a layer-by-layer local criterion for information flow or by selecting the mean and variance in schemes like batch normalization to maximize the covariance depth-scale.

These results suggest that theoretical work on random neural networks can be used to inform practical architectural decisions. However, there is still much work to be done. For instance, the framework developed here does not apply to unbounded activations, such as rectified linear units, where it can be shown that there are phases in which eq. 3 does not have a fixed point. Additionally, the analysis here applies directly only to fully connected feed-forward networks, and will need to be extended to architectures with structured weight matrices such as convolutional networks.

We close by noting that in physics it has long been known that, through renormalization, the behavior of systems near critical points can control their behavior even far from the idealized critical case. We therefore make the somewhat bold hypothesis that a broad class of neural network topologies will be controlled by the fully-connected mean field critical point.

We thank Ben Poole, Jeffrey Pennington, Maithra Raghu, and George Dahl for useful discussions. We are additionally grateful to RocketAI for introducing us to Temporally Recurrent Online Learning and two-dimensional time.

References

Appendix

Here we present derivations of results from throughout the paper.

Consider the recurrence relation for the variance of a single input,

and a fixed point of the dynamics, qq^{*}. qaalq^{l}_{aa} can be expanded about the fixed point to yield the asymptotic recurrence relation,

We begin by first expanding to order ϵl\epsilon^{l},

We therefore arrive at the approximate reccurence relation,

Using the identity, Dzzf(z)=Dzf(z)\int\mathcal{D}zzf(z)=\int\mathcal{D}zf^{\prime}(z) we can rewrite this asymptotic recurrence relation as,

2 Two input depth-scale

Consider the recurrence relation for the co-variance of two input,

a correlation between the inputs, cabl=qabl/qaalqbblc_{ab}^{l}=q_{ab}^{l}/\sqrt{q_{aa}^{l}q_{bb}^{l}}, and a fixed point of the dynamics, cc^{*}. cablc^{l}_{ab} can be expanded about the fixed point to yield the asymptotic recurrence relation,

Since the relaxation of qaalq^{l}_{aa} and qbblq^{l}_{bb} to qq^{*} occurs much more quickly than the convergence of qablq_{ab}^{l} we approximate qaal=qbbl=qq^{l}_{aa}=q^{l}_{bb}=q^{*} as in Poole et al. (2016). We therefore consider the perturbation qabl/q=cabl=c+ϵlq_{ab}^{l}/q^{*}=c_{ab}^{l}=c^{*}+\epsilon^{l}. It follows that we may make the approximation,

We now consider the case where c<1c^{*}<1 and c=1c^{*}=1 separately; we will later show that these two results agree with one another. First we consider the case where c<1c^{*}<1 in which case we may safely expand the above equation to get,

This allows us to in turn approximate the recurrence relation,

where u1u_{1}^{*} and u2u_{2}^{*} are appropriately defined asymptotic random variables. This leads to the asymptotic recurrence relation,

We now consider the case where c=1c^{*}=1 and cabl=1ϵlc^{l}_{ab}=1-\epsilon^{l}. In this case the expansion of u2lu_{2}^{l} will become,

and so the lowest order correction is of order O(ϵl)\mathcal{O}(\sqrt{\epsilon^{l}}) as opposed to O(ϵl)\mathcal{O}(\epsilon^{l}). As usual we now expand the recurrence relation, noting that u2=u1u_{2}^{*}=u_{1}^{*} is independent of z2z_{2} when c=1c^{*}=1 to find,

It follows that the asymptotic recurrence relation in this case will be,

where χ1\chi_{1} is the stability condition for the ordered phase. We note that although the approximations were somewhat different the asymptotic recurrence relation for c<1c^{*}<1 reduces eq. 47 result for c=1c^{*}=1. We may therefore use 4 for all cc^{*}.

3 Variance of an input with dropout

In the presence of dropout with rate ρ\rho, the variance of a single input as it is passed through the network is described by the recurrence relation,

Recall that the recurrence relation for the pre-activations is given by,

where pjlBernoulli(ρ)p_{j}^{l}\sim\text{Bernoulli}(\rho). It follows that the variance will be given by,

4 Covariance of two inputs with dropout

The co-variance between two signals, zi;alz_{i;a}^{l} and zi;blz_{i;b}^{l}, with separate i.i.d. dropout masks pi;alp_{i;a}^{l} and pi;blp_{i;b}^{l} is given by,

where, in analogy to eq. 4, uˉ1=qˉaalz1\bar{u}_{1}=\sqrt{\bar{q}^{l}_{aa}}z_{1} and uˉ2=qˉbbl(cˉablz1+1(cˉabl)2z2)\bar{u}_{2}=\sqrt{\bar{q}^{l}_{bb}}\left(\bar{c}_{ab}^{l}z_{1}+\sqrt{1-(\bar{c}_{ab}^{l})^{2}}z_{2}\right).

subject to the approximation, qaalqbblqq_{aa}^{l}\approx q_{bb}^{l}\approx q^{*}. This implies that cabl+1<1.c^{l+1}_{ab}<1.

Plugging in cabl=1c^{l}_{ab}=1 with qaalqbblqq_{aa}^{l}\approx q_{bb}^{l}\approx q^{*} we find that uˉ1=uˉ2=qz1\bar{u}_{1}=\bar{u}_{2}=\sqrt{q^{*}}z_{1}. It follows that,

as required. Here we have integrated out z2z_{2} since nether uˉ1\bar{u}_{1} nor uˉ2\bar{u}_{2} depend on it.

6 Mean field gradient scaling

We first note that since the WijlW_{ij}^{l} are i.i.d. it follows that,

where we have used the fact that the first line is related to the sample expectation over the different realizations of the WijlW_{ij}^{l} to approximate it by the analytic expectation in the second line. In mean field theory since the pre-activations in each layer are assumed to be i.i.d. Gaussian it follows that,

7 Mean field backpropagation

Computing the variance directly and using mean field approximation,

as required. In the last step we have made the approximation that qaalqq^{l}_{aa}\approx q^{*} since the depth scale for the variance is short ranged.

8 Mean field gradient covariance scaling

In mean field theory we expect the covariance between the gradients of two different inputs to scale as,

We proceed in a manner analogous to Appendix 7.6. Note that in mean field theory since the weights are i.i.d. it follows that

where, as before, the final term is approximating the sample expectation. Since the weights in the forward and backwards passes are chosen independently it follows that we can factor the expectation as,

9 Mean field backpropagation of covariance

The covariance between the gradients due to two inputs scales as,

As in the analogous derivation for the variance, we compute directly,

10 Further experimental results

Here we include some more experimental figures that investigate the effects of training time, minimizer, and dataset more closely.