On the infinite-depth limit of finite-width neural networks

Soufiane Hayou

The empirical success of over-parameterized neural networks has sparked a growing interest in the theoretical understanding of these models. The large number of parameters – millions if not billions – and the complex (non-linear) nature of the neural computations (presence of non-linearities) make this hypothesis space highly non-trivial. However, in certain situations, increasing the number of parameters has the effect of ‘placing’ the network in some ‘average’ regime that simplifies the theoretical analysis. This is the case with the infinite-width asymptotics of random neural networks. The infinite-width limit of neural network architectures has been extensively studied in the literature, and has led to many interesting theoretical and algorithmic innovations. We summarize these results below.

Initialization schemes: the infinite-width limit of different neural architectures has been extensively studied in the literature. In particular, for multi-layer perceptrons (MLP), a new initialization scheme that stabilizes forward and backward propagation (in the infinite-width limit) was derived in . This initialization scheme is known as the Edge of Chaos, and empirical results show that it significantly improves performance. In , the authors derived similar results for the ResNet architecture, and showed that this architecture is placed by-default on the Edge of Chaos for any choice of the variances of the initialization weights (Gaussian weights). In , the authors showed that an MLP that is initialized on the Edge of Chaos exhibits similar properties to ResNets, which might partially explain the benefits of the Edge of Chaos initialization.

Gaussian process behaviour: Multiple papers (e.g. ) studied the weak limit of neural networks when the width goes to infinity. The results show that a randomly initialized neural network (with Gaussian weights) has a similar behaviour to that of a Gaussian process, for a wide range of neural architectures, and under mild conditions on the activation function. In , the authors leveraged this result and introduced the neural network Gaussian process (NNGP), which is a Gaussian process model with a neural kernel that depends on the architecture and the activation function. Bayesian regression with the NNGP showed that NNGP surprisingly achieves performance close to the one achieved by an SGD-trained finite-width neural network.

The large depth limit of this Gaussian process was studied in , where the authors showed that with proper scaling, the infinite-depth (weak) limit is a Gaussian process with a universal kernelA kernel is called universal when any continuous function on some compact set can be approximated arbitrarily well with kernel features..

Neural Tangent Kernel (NTK): the infinite-width limit of the NTK is the so-called NTK regime or Lazy-training regime. This topic has been extensively studied in the literature. The optimization and generalization properties (and some other aspects) of the NTK have been studied in . The large depth asymptotics of the NTK have been studied in . We refer the reader to for a comprehensive discussion on the NTK.

Others: the theory of infinite-width neural networks has also been utilized for network pruning , regularization , feature learning , and ensembling methods (this is by no means an exhaustive list).

The theoretical analysis of infinite-width neural networks has certainly led to many interesting (theoretical and practical) discoveries. However, most works on this limit consider a fixed depth network. What about infinite-depth? Existing works on the infinite-depth limit can generally be divided into three categories:

Infinite-width-then-infinite-depth limit: in this case, the width is taken to infinity first, then the depth is take to infinity. This is the infinite-depth limit of infinite-width neural networks. This limit was particularly used to derive the Edge of Chaos initialization scheme , study the impact of the activation function , the behaviour of the NTK , kernel shaping etc.

The joint infinite-width-and-depth limit: in this case, the depth-to-width ratio is fixed, and therefore, the width and depth are jointly taken to infinity at the same time. There are few works that study the joint width-depth limit. For instance, in , the authors showed that for a special form of residual neural networks (ResNet), the network output exhibits a (scaled) log-normal behaviour in this joint limit. This is different from the sequential limit where width is taken to infinity first, followed by the depth, in which case the distribution of the network output is asymptotically normal (). In , the authors studied the covariance kernel of an MLP in the joint limit, and showed that it converges weakly to the solution of Stochastic Differential Equation (SDE). In , the authors showed that in the joint limit case, the NTK of an MLP remains random when the width and depth jointly go to infinity. This is different from the deterministic limit of the NTK where the width is taken to infinity before depth . More recently, in , the author explored the impact of the depth-to-width ratio on the correlation kernel and the gradient norms in the case of an MLP architecture, and showed that this ratio can be interpreted as an effective network depth.

Infinite-depth limit of finite-width neural networks: in both previous limits (infinite-width-then-infinite-depth limit, and the joint infinite-width-depth limit), the width goes to infinity. Naturally, one might ask what happens if width is fixed and depth goes to infinity? What is the limiting distribution of the network output at initialization? In , the author showed that neural networks with bounded width are still universal approximators, which motivates the study of finite-width large depth neural networks. In , the authors showed that the pre-activations of a particular ResNet architecture converge weakly to a diffusion process in the infinite-depth limit. This is the result of the fact that ResNet can be seen as discretizations of SDEs (see Section 2).

In the present paper, we study the infinite-depth limit of finite-width ResNet with random Gaussian weights (an architecture that is different from the one studied in ). We are particularly interested in the asymptotic behaviour of the pre/post-activation values. Our contributions are four-fold:

Unlike the infinite-width limit, we show that the resulting distribution of the pre-activations in the infinite-depth limit is not necessarily Gaussian. In the simple case of networks of width 11, we study two cases where we obtain known but completely different distributions by carefully choosing the activation function.

For ReLU activation function, we introduce and discuss the phenomenon of network collapse. This phenomenon occurs when the pre-activations in some hidden layer have all non-positive values which results in zero post-activations. This leads to a stagnant network where increasing the depth beyond a certain level has no effect on the network output. For any fixed width, we show that in the infinite-depth limit, network collapse is a zero-probability event, meaning that almost surely, all post-activations in the network are non-zero.

For networks with general width, where the distribution of the pre-activations is generally intractable, we focus on the norm of the post-activations with ReLU activation function, and show that this norm has approximately a Geometric Bronwian Motion (GBM) dynamics. We call this Quasi-GBM. We also shed light on a regime change phenomenon that occurs when the width nn increases from 33 to 44. For width n3n\leq 3, resp. n4n\geq 4, the logarithmic growth factor of the post-activations is , resp. positive.

We study the sequential limit infinite-depth-then-infinite-width, which is the converse of the more commonly studied infinite-width-then-infinite-depth limit, and show some key differences between these limits. We particularly show that the pre-activations converge to the solution of a Mckean-Vlasov process, which has marginal Gaussian distributions, and thus we recover the Gaussian behaviour in this limit. We compare the two sequential limits and discuss some differences.

The proofs of the theoretical results are provided in the appendix and referenced after each result. Empirical evaluations of these theoretical findings are also provided.

The infinite-depth limit

Hereafter, we denote the width, resp. depth, of the network by nn, resp. LL. We also denote the input dimension by dd. Let d,n,L1d,n,L\geq 1, and consider the following ResNet architecture of width nn and depth LL

The 1/L1/\sqrt{L} scaling in Eq. 1 is not arbitrary. This specific scaling was shown to stabilize the norm of YlY_{l} as well as gradient norms in the large depth limit (e.g. ). In the next result, we show that the infinite depth limit of Eq. 1 (in the sens of the distribution) exists and has the same distribution of the solution of a stochastic differential equation. In the case of a single input, this has already been shown in . The details are provided in Section A. We also generalize this result in the case of multiple inputs and obtain similar SDE dynamics (see 5 in the Appendix).

where the constant in O\mathcal{O} does not depend on tt. Moreover, if the activation function ϕ\phi is only locally Lipschitz, then XtLX^{L}_{t} converges locally to XtX_{t}. More precisely, for any fixed r>0r>0, we consider the stopping times

then the stopped process XtτLLX^{L}_{t\land\tau^{L}} converges in distribution to the stopped solution XtτX_{t\land\tau} of the above SDE.

The proof of 1 is provided in Section A.6. We use classical results on the numerical approximations of SDEs. 1 shows that the infinite-depth limit of finite-width ResNet (Eq. 1) has a similar behaviour to the solution of the SDE given in Eq. 7. In this limit, YtLY_{\lfloor tL\rfloor} converges in distribution to XtX_{t}. Hence, properties of the solutions of Eq. 7 should theoretically be ‘shared’ by the pre-activations YtLY_{\lfloor tL\rfloor} when the depth is large. For the rest of the paper, we study some properties of the solutions of Eq. 7. This requires the definition of filtered probability spaces which we omit here. All the technical details are provided in Section A. We compare the theoretical findings with empirical results obtained by simulating the pre/post-activations of the original network Eq. 1. We refer to XtX_{t}, the solution of Eq. 7, by the infinite-depth network.

The distribution of X1X_{1} (the last layer in the infinite-depth limit) is generally intractable, unlike in the infinite-width-then-infinite-depth limit (Gaussian, ) or joint infinite-depth-and-width limit (involves a log-normal distribution in the case of an MLP architecture, ). Intuitively, one should not expect a universal behaviour (e.g. the Gaussian behaviour in the infinite-width case) of the solution of Eq. 7 as this latter is highly sensitive to the choice of the activation function, and different activation functions might yield completely different distributions of X1X_{1}. We demonstrate this in the next section by showing that we can recover closed-form distributions by carefully choosing the activation function. The main ingredient is the use of Ito^\hat{o} ’s lemma. See Section A for more details.

Different behaviours depending on the activation function

In this section, we restrict our analysis to a width-11 ResNet with one-dimensional inputs, where each layer consists of a single neuron, i.e. d=n=1d=n=1. In this case, the process (Xt)0t1(X_{t})_{0\leq t\leq 1} is one-dimensional and is solution of the following SDE

In financial mathematics nomenclature, the function μ\mu is called the drift and σ\sigma is called the volatility of the diffusion process. Ito^\hat{o} ’s lemma is a valuable tool in stochastic calculus and is often used to transform and simplify SDEs to better understand their properties. It can also be used to find candidate functions gg and activation functions ϕ\phi such that the SDE Eq. 3 admits solutions with known distributions, which yields a closed-form distribution for XtX_{t}. We consecrate the rest of this section to this purpose.

ReLU is a piece-wise linear activation function. Let us first deal with the simpler case of linear activation functions. In the next result, we show that linear activation functions yield log-normal distributions. In this case, the process XtX_{t} follows the Geometric Brownian motion dynamics. Later in this section, we show that this result can be adapted to the case of the ReLU activation function given by ϕ(x)=max(x,0)\phi(x)=\max(x,0).

Then, the process g(Xt)g(X_{t}) is a solution of the SDE

where a=12σ2γ1(γ1)a=\frac{1}{2}\sigma^{2}\gamma^{-1}(\gamma-1). As a result, we have that for all tt\in,

The proof of 2 is provided in Section D, and consists of using Ito^\hat{o} lemma and solving a differential equation. When the activation function is ReLU, we still obtain a log-normal distribution conditionally on the event that the initial value X0X_{0} is positive.

Then, the process XX is a mixture of a Geometric Brownian motion and a constant process. More precisely, we have for all tt\in

Hence, given a fixed X0>0X_{0}>0, the process XX is a Geometric Brownian motion.

The proof of 3 is provided in Section E. We show that conditionally on X0>0X_{0}>0, with probability 11, the process XtX_{t} is positive for all tt\inIn Section E, we show that the stopping τ=inf{t0: s.t. Xt0}\tau=\inf\{t\geq 0:\textrm{ s.t. }X_{t}\leq 0\} is infinite almost surely, which is stronger that what we need. This is a classic result in stochastic calculus.. When Xt>0X_{t}>0, the ReLU activation is just the identity function, which justifies the similarity between this result and the one obtained with linear activations (2). Conversely, if X0<0X_{0}<0, the process is constant equal to X0X_{0} since the updates ‘dXtdX_{t}’ are equal to zero in this case. A rigorous justification of this is given for general width nn later in the paper (Lemma 1). An empirical verification of 2 is provided in Fig. 1 where we compare the theoretical results to simulations of the neural paths (Yl)1lL(Y_{l})_{1\leq l\leq L} and (log(Yl))1lL(\log(Y_{l}))_{1\leq l\leq L} from the original (finite-depth) ResNet given by Eq. 1. We observe an excellent match with theoretical predictions for depths L=50L=50 and L=100L=100. In the case of a small depth (L=5L=5), the theoretical distribution does not fit well the empirical one (obtained by simulations), which is expected since the dynamics of XX describe (only) the infinite-depth limit of the ResNet. More figures are provided in Section K. Remark: notice that the log-normal behaviour is a result of the fact that we only consider the case n=1n=1 (width one). Indeed, the single neuron case forces ReLU to act like a linear activation when X0>0X_{0}>0, and like a ‘zero’ activation when X00X_{0}\leq 0. For general width n1n\geq 1, such behaviour does not hold in general, and usually some coordinates of XtX_{t} will be negative while others are non-negative, which implies that the volatility term ϕ(Xt)\|\phi(X_{t})\| has non-trivial dependence on XtX_{t}. We discuss this in more details in Section 4. In the next section, we illustrate a case of an exotic (non-standard) activation function that yields a completely different closed-form distribution of XtX_{t}.

2 Exotic activation

The next result shows that with a particular choice of the activation function ϕ\phi and mapping gg, the stochastic process g(Xt)g(X_{t}) is the solution of well-known type of SDEs known as the Ornstein-Uhlenbeck SDEs. In this case, the activation function is non-standard and involves the inverse of the imaginary error function, a variant of the error function.

Consider the stochastic process XtX_{t} defined byin Section C, we show that the activation function ϕ\phi is only locally Lipschitz. Hence, the solution of this SDE exists only in the local sense and the convergence in distribution of YtLY_{\lfloor tL\rfloor} to XtX_{t} is also in the local sense (1). However, by continuity of the Brownian path, the stopping times τL\tau^{L} and τ\tau diverge almost surely when rr goes to infinity. Therefore, the conclusion of 4 remains true for all tt\in. Technical details are provided in Section C.

Then, the stochastic process g(Xt)g(X_{t}) follows the Ornstein-Uhlenbeck dynamics on (0,1](0,1] given by

where a=πα24a=\frac{\pi\alpha^{2}}{4}. As a result, conditionally on X0X_{0} (fixed X0X_{0}), we have that for all tt\in,

and the process XtX_{t} is distributed as Xtα1(h(α1π1/2N(g(X0)eat,π2(1e2at)))β)X_{t}\sim\alpha^{-1}(h(\alpha^{-1}\pi^{-1/2}\mathcal{N}\left(g(X_{0})e^{-at},\frac{\pi}{2}(1-e^{-2at})\right))-\beta).

Fig. 2 shows the graph of the activation function ϕ(y)=exp(h1(y)2)\phi(y)=\exp(h^{-1}(y)^{2}) mentioned in 4 with α=1\alpha=1 and β=0\beta=0. With this choice of the activation function, the infinite-depth network output X1X_{1} has the distribution g1(N(g(X0)eat,2(1+e2at)))g^{-1}\left(\mathcal{N}\left(g(X_{0})e^{-at},2(1+e^{-2at})\right)\right) (conditionally on X0X_{0}), where gg is given in the statement of the proposition. This distribution, although easy to simulate, is different from both the Gaussian distribution that we obtain in the infinite-width limit and the log-normal distribution associated with ReLU activation. This confirms that not only do neural networks exhibit completely different behaviours when the ratio depth-to-width is large, but in this case, that their behaviour is very sensitive to the choice of the activation function.

The results of 4 are empirically confirmed in Fig. 3. The original ResNet given by Eq. 7 with depth L=100L=100 exhibit very similar behaviour to that of the SDE.

General width n≥1𝑛1n\geq 1

where ϕ\phi is the activation function, and BB is an nn-dimensional Brownian motion, independent from WinW_{in}. Intuitively, if for some ss, ϕ(Xs)=0\|\phi(X_{s})\|=0, then for all tst\geq s, Xt=XsX_{t}=X_{s} since the increments ’dXtdX_{t}’ are all zero for tst\geq s. This holds for any choice of the activation function ϕ\phi, provided that the process XX exists, i.e. the SDE has a unique solution. We summarize this in the next lemma.

Lemma 1 is a particular case of Lemma 7 in the Appendix. The proof consists of using the uniqueness of the solution of Eq. 4 when the volatility term is Lipschitz. This result is trivial in the finite depth case (Eq. 1). When there exists ss such that ϕ(Xs)=0\phi(X_{s})=0, the process XX becomes constant (equal to XsX_{s}) for all tst\geq s (almost surely). We call this phenomenon process collapse. In the case of finite-depth networks (Eq. 1), we call the same phenomenon network collapse. Understanding when, and whether, such event occurs is useful since it has significant implications on the the large depth behaviour of neural networks. Indeed, if such event occurs, it would mean that increasing depth has no effect on the network output after some time ss (or approximately, after layer index sL\lfloor sL\rfloor). In the next result, we show that under mild conditions on the activation function, process collapse is a zero-probability event.

The proof of Lemma 2 is provided in Section F. Many standard activation functions satisfy the conditions of Lemma 2. Examples include Hyperbolic Tangent Tanh(z)=e2z1e2z+1\textrm{Tanh}(z)=\frac{e^{2z}-1}{e^{2z}+1}, and smooth versions of ReLU activation such as GeLU given by ϕGeLU(z)=zΨ(z)\phi_{GeLU}(z)=z\Psi(z) where Ψ\Psi is the cumulative distribution function of the standard Gaussian variable, and Swish (or SiLU) given by ϕSwish(z)=zh(z)\phi_{Swish}(z)=zh(z) where h(z)=(1+ez)1h(z)=(1+e^{-z})^{-1} is the Sigmoid function. The result of Lemma 2 can be extended to the case when ϕ\phi is the ReLU function with miner changes.

Consider the stochastic process (7) given by the SDE

where ϕ\phi is the ReLU activation function, and (Bt)t0(B_{t})_{t\geq 0} is an nn-dimensional Brownian motion independent from WinN(0,d1I)W_{in}\sim\mathcal{N}(0,d^{-1}I). Let τ\tau be the stopping time given by

The proof of Lemma 3 relies on a particular choice of a sequence of functions (ϕm)m1(\phi_{m})_{m\geq 1} that approximate the ReLU activation ϕ\phi. Details are provided in Section F.

The result of Lemma 3 shows that for all T>0T>0, with probability 11, if there exists j[n]j\in[n] such that X0j>0X_{0}^{j}>0, then for all t[0,T]t\in[0,T], there exists a coordinate ii such that Xti>0X^{i}_{t}>0, which implies that the volatility of the process XX given by 1nϕ(Xt)\frac{1}{\sqrt{n}}\|\phi(X_{t})\| does not vanish in finite time tt. Notably, this implies that for any tt\in, the norm of post-activations given by ϕ(Xt)\|\phi(X_{t})\| does not vanish (with probability 1). This is important as it ensures that the vector ϕ(Xt)\phi(X_{t}), which represents the post-activations in the infinite-depth network, does not vanish, and therefore the process XtX_{t} does not get stuck in

an absorbent point. The dependence between the coordinates of the process XtX_{t} is crucial in this result. In the opposite case where XtX_{t} are independent, the event {ϕ(Xt)=0}\{\|\phi(X_{t})\|=0\} has probability 2n2^{-n}. Notice also that this result holds only in the infinite-depth limit. With finite-depth ResNet (Eq. 1) with ReLU activation, it is not hard to show that the network collapse event {l[L], s.t. ϕ(YtL)=0}\{\exists l\in[L],\textrm{ s.t. }\|\phi(Y_{\lfloor tL\rfloor})\|=0\} has non-zero probability. However, as the depth increases, the probability of network collapse goes to zero. Fig. 4 shows the probability of network collapse for a finite-width and depth ResNet (Eq. 1). As the depth LL increases, it becomes unlikely that the network collapses. This is in agreement with our theoretical prediction that the infinite-depth network represented by the process XtX_{t} has zero-probability collapse event, conditionally on the fact that ϕ(X0)>0\|\phi(X_{0})\|>0. The probability of neural collapse also decreases with width, which is expected, since it becomes less likely to have all pre-activations non-positive as the width increases.

2 Post-activation norm

As a result of Lemma 3, conditionally on ϕ(X0)>0\|\phi(X_{0})\|>0, we can safely consider manipulating functions that require positiveness such as the logarithm of the norm of the post-activations. In the next result, we show that the norm of the post-activations has a distribution that

resembles the log-normal distribution. We call this Quasi Geometric Brownian Motion distribution (Quasi-GBM).

where μs=12ϕ(Xs)21\mu_{s}=\frac{1}{2}\|\phi^{\prime}(X_{s})\|^{2}-1, and (B^)t0(\hat{B})_{t\geq 0} is a one-dimensional Brownian motion. As a result, for all 0st10\leq s\leq t\leq 1

Notice that the case of n=1n=1 matches the result of 3. Indeed, the latter implies that conditionally on ϕ(X0)>0\phi(X_{0})>0, we have log(ϕ(Xt)/ϕ(X0))=log(Xt/X0)t/2+Bt\log(\phi(X_{t})/\phi(X_{0}))=\log(X_{t}/X_{0})\sim-t/2+B_{t} where BB is a one-dimensional Brownian motion, and where we have used the fact that Xt>0X_{t}>0 for all tt. This result can be readily obtained from 1 by setting n=1n=1.

An interesting question is that of the infinite-width limit of the process XtX_{t}, which corresponds to the sequential limit infinite-depth-then-infinite-width of the ResNet YtLY_{\lfloor tL\rfloor} (Eq. 1). We discuss this in the next section.

3 Infinite-width limit of infinite-depth networks

In the next result, we show that when the width goes to infinity, the ratio ϕ(Xt)/ϕ(X0)\|\phi(X_{t})\|/\|\phi(X_{0})\| concentrates around a layer dependent (tt-dependent) constant. In this limit, the coordinates of XtX_{t} converge in L2L_{2} to a Mckean-Vlasov process, which allows us to recover the Gaussian behaviour of the pre-activations of the ResNet. We later compare this with the converse sequential limit infinite-width-then-infinite-depth where the pre-activations are also normally distributed, and show a key difference in the variance of the Gaussian distribution.

where XtiX^{i}_{t} is the solution of the following (Mckean-Vlasov) SDE

As a result, the pre-activations YtLiY^{i}_{\lfloor tL\rfloor} (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-depth-then-infinite-width

The proof of 2 requires the use of a special variant of the Law of large numbers for non iid random variables, and a convergence result of particle systems from the theory of Mckean-Vlasov processes. Details are provided in Section H. In neural network terms, 2 shows that the logarithmic growth factor of the norm of the post-activations, given by log(ϕ(YtL)/ϕ(YsL))\log\left(\|\phi(Y_{\lfloor tL\rfloor})\|/\|\phi(Y_{\lfloor sL\rfloor})\|\right), converges to (ts)/4(t-s)/4 in the sequential limit LL\to\infty, then nn\to\infty. More importantly, the pre-activations YtLiY_{\lfloor tL\rfloor}^{i} converge in distribution to a zero-mean Gaussian distribution in this limit, with a layer-dependent variance. In the converse sequential limit, i.e. nn\to\infty, then LL\to\infty, the limiting distribution of the pre-activations YtLiY_{\lfloor tL\rfloor}^{i} is also Gaussian with the same variance. We show this in the following result, which uses Lemma 5 in .

Let tt\in. Then, in the limit limLlimn\lim_{L\to\infty}\lim_{n\to\infty} (infinite width, then infinite depth), we have that

where the convergence holds in probability.

Moreover, the pre-activations YtLiY^{i}_{\lfloor tL\rfloor} (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-width-then-infinite-depth

The proof of 3 is provided in Section I. We use existing results from on the infinite-depth asymptotics of the neural network Gaussian process (NNGP). It turns out that the order to the sequential limit (taking the width to infinity first, then taking the depth to infinity, or the converse) does not affect the limiting distribution, which is a Gaussian with variance exp(t/2)\propto\exp(t/2)). Intuitively, by taking the width to infinity first, we make the coordinates independent from each other, and the processes (Yli)1L(Y^{i}_{l})_{1\leq L} become iid Markov chains. Taking the infinite-depth limit after the infinite-width limit consists of taking the infinite-depth limit of one-dimensional Markov chains. On the other hand, when we take depth to infinity first, the coordinates (Xti)1in(X^{i}_{t})_{1\leq i\leq n} remain dependent (through the volatility term n1/2ϕ(Xt)n^{-1/2}\|\phi(X_{t})\|), which results in the Quasi-log-normal behaviour of the norm of the post-activations (1). Taking the width to infinity then yields an asymptotic norm of the post-activations equal to ϕ(X0)exp(t/2)\|\phi(X_{0})\|\exp(t/2) (2) which is the same norm in the converse limit (3). It remains to take the width to infinity to decouple the coordinates and obtain the Gaussian distribution (through the Mckean-Vlasov dynamics). Knowing that the variance of the pre-activations is mainly determined by the norm of the post-activations (Eq. 4), we can see why the variance is similar in both sequential limits.

Discussion on the case of multiple inputs

The result of 1 can be easily generalized to the multiple input case, and the resulting dynamics is still an SDE. The generalization to the multiple inputs case is given by 5 in the Appendix.

An important question in the literature on infinite-width neural networks is the behaviour of the correlation of the pre-activations (or the post-activations) for different inputs aa and bb, which is given by YtL(a),YtL(b)YtL(a)YtL(b)\frac{\langle Y_{\lfloor tL\rfloor}(a),Y_{\lfloor tL\rfloor}(b)\rangle}{\|Y_{\lfloor tL\rfloor}(a)\|\|Y_{\lfloor tL\rfloor}(b)\|}. This correlation can be as a geometric measure of the information as it propagates through the network. In the infinite-width-then-depth limit, this correlation (generally) converges to a degenerate limit (a constant value) which results in either a constant or a sharp landscape of the network output and causes gradient exploding/vanishing issues . Techniques such block scaling , or kernel shaping solve this problem and ensure that the correlation is well-behaved in the large depth limit. In our case, when the width nn is finite and the depth LL is taken to infinity, we can define the correlation for two inputs aba\neq b and time tt\in by

Using Ito^\hat{o} ’s lemma, ctc_{t} has dynamics of the form

for some non-trivial mapping Ψ\Psi. Unfortunately, this kind of dynamics (which is not an SDE) is generally intractable, and we are currently investigating these dynamics for future work. However, since we scale the ResNet blocks with the factor 1/L1/\sqrt{L} (Eq. 1), which is the same scaling that solves the degeneracy issue in the infinite-width-then-depth limit , it should be expected that the correlation kernel ctc_{t} does not converge to a degenerate limit.

In Fig. 7, we simulate the correlation path in a ResNet of depth L=200L=200 and width n=20n=20. The paths exhibits some level of stochasticity but no degeneracy can be observed. Understanding the correlation dynamics (Eq. 5) in the infinite-depth limit of finite-width networks is an interesting open question. The infinite-width limitThe infinite-width limit of infinite-depth correlations of these dynamics is also an interesting open question. We leave this for future work.

Practical implications

Our theoretical analysis has many interesting implications from a practical standpoint. Here we summarize some key insights form our results.

An important factor pertaining to the trainability of neural networks is the behaviour of the neurons (pre/post-activations). Ensuring that the neurons are well-behaved at initialization is crucial for training since the first step of any gradient-based training algorithm depends on the values of the neurons at initialization. This has led to interesting developments in initialization schemes for MLPs such as the Edge of Chaos which ensures that the variance of the pre-activation does not (exponentially) vanish or explode in the large depth limit. In the case of ResNet, we know from the existing theory on the infinite-width limit of neural networks that scaling the residual blocks with 1/L1/\sqrt{L} stabilizes the pre/post-activations in the large depth limit . Hence, we do not need a special initialization scheme with this scaling. However, one could argue that this (approximately) ensures stability only when the width is much larger than the depth. What about the other cases when nLn\approx L or nLn\ll L? the last case can be studied by fixing the width and taking the depth to infinity. In our paper, we not only show that the neurons remain stable in fixed-width large-depth networks, but we fully characterize their behaviour when the depth is infinite and show that it follows an SDE in this limit. To summarize, we show that initializing ResNet Eq. 1 with standard Gaussian random variables and scaling the blocks with 1/L1/\sqrt{L} ensures stability inside the network in large-depth (fixed-width) networks (notice that this is actually equivalent to scaling the variance of the initialization weights with 1/L1/L, which can be seen as an initialization scheme). Intuitively, by stabilizing the pre-activations, we also stabilize the gradients. To confirm this intuition, we show in Fig. 8 the evolution of gradient norms as they back-propagate through the network. This experiment was conducted by fixing the last layer’s gradient to a constant value and back-propagating the gradient from there. The result shows that the 1/L1/\sqrt{L} scaling, along with standard Gaussian initialization, ensure well-behaved gradients which is a desirable property for gradient-based training. Another interesting property of the Edge of Chaos initialization scheme for MLPs is that it ensures that correlation kernel (correlation between the pre-activations for different inputs) does not exponentially converge to a degenerate value (constant value)The correlation still converges to 1 with an EOC initialization. The benefit of the EOC lies in the fact that the convergence rate is much slower (polynomial Vs exponential) . We discussed some aspects of the correlation kernel in Section 5 and showed empirically that with the 1/L1/\sqrt{L} scaling, the correlation is well-behaved and does not converge to degenerate values (Fig. 7).

Another issue that could occur in finite-width networks is that of network collapse, i.e. when the pre-activations in a hidden layer are all negative, which causes the post-activations to be all zero. In ResNet (Eq. 1), this implies that increasing depth beyond some level has no effect on the network output. This is problematic since the weights in those ‘inactive’ layers have zero gradient and thus will not be updated when such event occurs. A simple way to understand network collapse is to see what happens at initialization. When the width nn is sufficiently large, one can expect that such event is unlikely to occur. What about small-width neural networks? we offer a simple answer to this question: for finite-width neural networks, increasing the depth LL ensures that such event is unlikely to happen. This is true even for extremely small widths, e.g. n=2,3n=2,3, which is counter-intuitive. Empirical results in Fig. 4 support this theoretical prediction.

An interesting application of fixed-depth infinite-width neural network is the so-called Neural Network Gaussian Process (NNGP). This is the Gaussian process limit of neural networks, that can be used to perform posterior inference and obtain uncertainty estimates . The converse case, i.e. fixed-width infinite-depth, has been however poorly understood, and the question of whether the infinite-depth limit of finite-width networks has some universal behaviour has been an open question since. We addressed this question in this work and showed that the limit (in the case of the ResNet architecture Eq. 1) does not admit a universal distribution (e.g. Gaussian process in the infinite-width limit). More precisely, this limit is highly sensitive to the choice of the activation function.

the infinite-depth limit of infinite-width neural networks has been studied in the literature . It is known that in this limit, the network behaves as a Gaussian process with a well-defined kernel. What about the converse limit, i.e. infinite-width limit of infinite-depth networks? this has been so far an open question, and our work addresses one part of it. We show that the marginal distributions are zero-mean Gaussians with the same variance as in the infinite-width-then-depth limit. Characterizing the full covariance kernel is still however an open question (see Section 5 for a discussion on this topic).

Conclusion, discussion, and limitations

Understanding the limiting laws of randomly initialized neural networks is important on many levels. Primarily, understanding these limiting laws allows us to derive new designs that are immune to exploding/vanishing pre-activations/gradients phenomena. Next, they also enable a deeper understanding of overparameterized neural networks, and (often) yield many interesting (and simple) justifications to the apparent advantage of overparameterization. So far, the focus has been mainly on the infinite-width limit (and infinite-width-then-infinite-depth limit) with few developments on the joint limit. Our work adds to this stream of papers by studying the infinite-depth limit of finite-width neural networks. We showed that unlike the infinite-width limit, where we always obtain (under some mild conditions on the activation function) a Gaussian distribution, the infinite-depth limit is highly sensitive to the choice of the activation function; using the Ito^\hat{o} ’s lemma, we showed how we can obtain certain known distributions by carefully tuning the activation function. In the general width limit, we showed an important characteristic of infinite-depth neural networks with general activation functions (including ReLU, conditionally on ϕ(X0)>0\|\phi(X_{0})\|>0): the probability of process collapse is zero, meaning that with probability one, the process XtX_{t} does not get stuck at any absorbent point. This is not true for finite-depth ResNets as we can see in Fig. 4, which highlights the fact that as we increase depth, the collapse probability tends to decrease, and eventually converges to zero in the infinite-depth limit, which is in agreement with our results.

This work, although novel in many aspects, is still far from depicting a complete picture of the infinite-depth limit of finite-width networks. There are still numerous interesting open questions in this research direction. Indeed, one of these is the dynamics of the gradient, and more specifically the behaviour of the NTK in the infinite-depth limit of finite-width neural networks. For instance, we already know that in the joint infinite-width-depth limit of MLPs, the NTK is random ; but what happens when the width is fixed and the depth goes to infinity? In the MLP case, a degenerate NTK should be expected. Henceforth, questions remain as to whether a suitable scaling leads to interesting (non-degenerate) infinite-depth limit of the NTK as is the case of the infinite-depth limit of infinite-width NTK .

References

Appendix

The following result gives conditions under which a strong solution of a given SDE exists, and is unique.

Let n1n\geq 1, and consider the following SDE

Then, for all T0T\geq 0, there exists a unique strong solution of the SDE above.

A.2 Ito^^𝑜\hat{o} ’s lemma

The following result, known as Ito^\hat{o} ’s lemma, is a classic result in stochastic calculus. We state a version of this result from . Other versions and extensions exist in the literature (e.g. ).

Let XtX_{t} be an Ito^\hat{o} diffusion process (Definition 1) of the form

where xf\nabla_{x}f and x2f\nabla_{x}^{2}f refer to the gradient and the Hessian, respectively. This can also be expressed as an SDE

A.3 Convergence of Euler’s scheme to the SDE solution

The following result gives a convergence rate of the Euler discretization scheme to the solution of the SDE.

where Yi,μi,σi,jY^{i},\mu^{i},\sigma^{i,j} denote the coordinates of these vectors for i[d],j[m]i\in[d],j\in[m], and ΔBkjN(0,δ)\Delta B^{j}_{k}\sim\mathcal{N}(0,\delta). Then, we have that

We can extend the result of 5 to the case of locally Lipschitz drift and volatility functions μ\mu and σ\sigma. For this purpose, let us first define local convergence.

Let (XL)L1(X^{L})_{L\geq 1} be a sequence of processes and XX be a stochastic process. For r>0r>0, define the following stopping times

We say that XLX^{L} converges locally to XX if for any r>0r>0, XtτLLX^{L}_{t\land\tau^{L}} converge to XtτX_{t\land\tau}. This definition is general for any type of convergence, we will specify clearly the type of convergence when we use this notion of local convergence.

Consider the same setting of 5 with the following conditions instead

where τδ=inf{t0:Ytδ1>r}\tau_{\delta}=\inf\{t\geq 0:\|Y_{\lfloor t\delta^{-1}\rfloor}\|>r\}, and τ=inf{t0:Xt>r}\tau=\inf\{t\geq 0:\|X_{t}\|>r\}.

We omit the proof here as it consists of the same techniques used in , with the only difference consisting of considering the stopped process XτX^{\tau}. By stopping the process, we force the process to stay in a region where the coefficients are Lipschitz.

A.4 Convergence of Particles to the solution of Mckean-Vlasov process

The next result gives sufficient conditions for the system of particles to converge to its mean-field limit, known as the Mckean-Vlasov process.

Proof This is a direct result of Thm 3 in . The bounded moment condition holds for k=1k=1 (dimension of the particles), and the conclusion is straightforward.

A.5 Other results from probability and stochastic calculus

The next trivial lemma has been opportunely used in to derive the limiting distribution of the network output (multi-layer perceptron) in the joint infinite width-depth limit. This simple result will also prove useful in our case of the finite-width-infinite-depth limit.

This concludes the proof as the latter is the characteristic function of a random Gaussian vector with Identity covariance matrix.

The next theorem shows when a stochastic process (ito)

Let (Xt)t[0,T](X_{t})_{t\in[0,T]} and (Yt)t[0,T](Y_{t})_{t\in[0,T]} be two stochastic processes given by

Let Nt=σ((Ys)st)\mathcal{N}_{t}=\sigma((Y_{s})_{s\leq t}) be the σ\sigma-Algebra generated by {Ys:st}\{Y_{s}:s\leq t\}. Using Ito^\hat{o} lemma, we have that for s<ts<t,

Hence, MtM_{t} is a martingale (w.r.t to Nt\mathcal{N}_{t}). We conclude that YtY_{t} has the same law as XtX_{t} by the uniqueness of the solution of the martingale problem (see 8.3.6 in ).

The next result is a simple corollary of the existence and uniqueness of the strong solution of an SDE under the Lipschitz conditions on the drift and the volatility. It basically shows that a zero-drift process collapses (becomes constant) once the volatility is zero.

If g(Z0)=0g(Z_{0})=0, then Zt=Z0Z_{t}=Z_{0} almost surely.

Proof This follows for the uniqueness of the strong solution of an SDE(4).

A.6 Proof of 1

We are now ready to prove the following result.

where the constant in O\mathcal{O} does not depend on tt. Moreover, if the activation function ϕ\phi is only locally Lipschitz, then XtLX^{L}_{t} converges locally to XtX_{t}. More precisely, for any fixed r>0r>0, we consider the stopping times

then the stopped process XtτLLX^{L}_{t\land\tau^{L}} converges in distribution to the stopped solution XtτX_{t\land\tau} of the above SDE.

Proof The proof is based on 5 in the appendix. It remains to express Eq. 1 in the required form and make sure all the conditions are satisfied for the result to hold. Using Lemma 6, we can write Eq. 1 as

Now let Ψ\Psi be KK-Lipschitz for some constant K>0K>0. We have that

where Yˉ\bar{Y} is the Euler scheme as in 5, and where we have used the fact that YtLY_{\lfloor tL\rfloor} and YˉtL\bar{Y}_{\lfloor tL\rfloor} have the same distribution.

The result of 1 can be generalized to the case with multiple inputs with minimal changes in the proof. We summarize this result in the next proposition.

where (Bt)t0(\bm{B}_{t})_{t\geq 0} is an knkn-dimensional Brownian motion (Wiener process), independent from WinW_{in}, and Σ(Xtk)\Sigma(\bm{X}^{k}_{t}) is the covariance matrix given by

where αi,j=ϕ(Xtk,i),ϕ(Xtk,j)\alpha_{i,j}=\langle\phi(\bm{X}_{t}^{k,i}),\phi(\bm{X}_{t}^{k,j})\rangle, with ((Xtk,1),,(Xtk,k))=defXtk((X_{t}^{k,1})^{\top},\dots,(X_{t}^{k,k})^{\top})^{\top}\overset{def}{=}\bm{X}_{t}^{k}. Moreover, if the activation function ϕ\phi is only locally Lipschitz, then XtL,k\bm{X}^{L,k}_{t} converges locally to Xtk\bm{X}_{t}^{k}. More precisely, for any fixed r>0r>0, we consider the stopping times τL=inf{t0:XtL,kr}\tau^{L}=\inf\{t\geq 0:\|\bm{X}^{L,k}_{t}\|\geq r\}, and τ=inf{t0:Xtkr},\quad\tau=\inf\{t\geq 0:\|\bm{X}^{k}_{t}\|\geq r\}, then the stopped process XtτLL,k\bm{X}^{L,k}_{t\land\tau^{L}} converges in distribution to the stopped solution Xtτk\bm{X}^{k}_{t\land\tau} of the above SDE.

Proof The proof is similar to that of 1. The only difference lies the definition of the Gaussian vector ζlL\zeta^{L}_{l}. In this case, we have for all xix_{i}

where ζl1L(Yl1(xi))=defnWlϕ(Yl1(xi))\zeta^{L}_{l-1}(Y_{l-1}(x_{i}))\overset{def}{=}\sqrt{n}W_{l}\phi(Y_{l-1}(x_{i})). Concatenating these identities yield

where ζl1L\bm{\zeta}^{L}_{l-1} is the concatenation of the vector ζl1L(Yl1(xi))\zeta^{L}_{l-1}(Y_{l-1}(x_{i})) for i=1,,ki=1,\dots,k. It is straightforward that the covariance matrix of the Gaussian vector ζl1L\bm{\zeta}^{L}_{l-1} is given by the matrix Σ\Sigma above (with XX replaced by YY). We conclude using 5.

B Some technical results for the proofs

In the next lemma, we provide an approximate stochastic process XmX^{m} to XX, that differs from XX by the volatility term. The upper-bound on the L2L_{2} norm of the difference between XmX^{m} and XX will prove useful in the proofs of other results. The proof of this lemma requires the use of Gronwall’s lemma, a tool that is often used in stochastic calculus.

where ϕm(z)=0zh(mu)du\phi_{m}(z)=\int_{0}^{z}h(mu)du where hh is the Sigmoid function given by h(u)=(1+eu)1h(u)=(1+e^{-u})^{-1}, ϕ\phi is the ReLU activation function, and (Bt)t0(B_{t})_{t\geq 0} is an nn-dimensional Brownian motion. We have the following

Using Ito^\hat{o} isometry and the fact that (ϕm(Xsm)ϕ(Xs))2ϕm(Xsm)ϕ(Xs)2(\|\phi_{m}(X^{m}_{s})\|-\|\phi(X_{s})\|)^{2}\leq\|\phi_{m}(X^{m}_{s})-\phi(X_{s})\|^{2}, we obtain

where we have used Lemma 9 and the fact that ReLU is 11-Lipschitz. We concldue using Gronwall’s lemma.

B.2 Approximation of ϕitalic-ϕ\phi

The next lemma provides a simple upper-bound on the distance between the ReLU activation ϕ\phi and an approximate function ϕm\phi_{m} that converges to ϕ\phi in the limit of large mm.

For the case where z0z\leq 0, the proof is the same. We have that

B.3 Other lemmas

The next lemma shows that the logarithmic growth factor log(ϕm(Xtm)ϕm(X0m))\log\left(\frac{\|\phi_{m}(X^{m}_{t})\|}{\|\phi_{m}(X^{m}_{0})\|}\right) converges to log(ϕ(Xt)ϕ(X0))\log\left(\frac{\|\phi(X_{t})\|}{\|\phi(X_{0})\|}\right) when mm goes to infinity, where the convergence holds in L1L_{1}. The key ingredient in the use of uniform integrability coupled with convergence in probability, which is sufficient to conclude on the L1L_{1} convergence. This result will help us conclude in the proof of 1.

where ϕm(z)=0zh(mu)du\phi_{m}(z)=\int_{0}^{z}h(mu)du where hh is the Sigmoid function given by h(u)=(1+eu)1h(u)=(1+e^{-u})^{-1}, ϕ\phi is the ReLU activation function, and (Bt)t0(B_{t})_{t\geq 0} is an nn-dimensional Brownian motion. Then, conditionally on the fact that ϕ(X0)>0\|\phi(X_{0})\|>0, we have that

Let t>0t>0. From Lemma 8, we know that XmX^{m} converges in L2L^{2} to XX. Using Lemma 9 and the fact that ReLU is 11-Lipschitz, we obtain

which implies that ϕm(Xtm)\phi_{m}(X^{m}_{t}) converges in L2L^{2} to ϕ(Xt)\phi(X_{t}). In particular, the convergence holds in probability. Using this fact with the Continuous mapping theorem, we obtain that

where the first term converges to zero by Eq. 8, and the second term converges to zero by Lemma 9. Hence, the convergence in probability holds.

To conclude, it suffices to show that the sequence of random variables (Ytm=log(ϕm(Xtm)ϕm(X0m)))m1\left(Y^{m}_{t}=\log\left(\frac{\|\phi_{m}(X^{m}_{t})\|}{\|\phi_{m}(X^{m}_{0})\|}\right)\right)_{m\geq 1} is uniformly integrable. Let K>0K>0. From the proof of Lemma 2, with ζ=ϕm\zeta=\phi_{m}, we have that

where σi(Xsm)=ϕm(Xsm,i)ϕm(Xsm,i)ϕm(Xsm)\sigma_{i}(X^{m}_{s})=\frac{|\phi_{m}^{\prime}(X^{m,i}_{s})\phi_{m}(X^{m,i}_{s})|}{\|\phi_{m}(X^{m}_{s})\|}, and μ(Xsm)=12i=1n(ϕm(Xsm,i)ϕm(Xsm,i)+ϕm(Xsm,i)2)ϕm(Xsm)ϕm(Xsm)2ϕm(Xsm)2\mu(X^{m}_{s})=\frac{1}{2}\sum_{i=1}^{n}\left(\phi_{m}^{\prime\prime}(X^{m,i}_{s})\phi_{m}(X^{m,i}_{s})+\phi_{m}^{\prime}(X^{m,i}_{s})^{2}\right)-\frac{\|\phi_{m}^{\prime}(X^{m}_{s})\circ\phi_{m}(X^{m}_{s})\|^{2}}{\|\phi_{m}(X^{m}_{s})\|^{2}}. Therefore,

where ϕ\phi is the ReLU activation function, and (Bt)t0(B_{t})_{t\geq 0} is an nn-dimensional Brownian motion independent from X0X_{0}. Then, conditionally on the fact that ϕ(X0)>0\|\phi(X_{0})\|>0, we have that for all s,i[n]s\in,i\in[n]

where the bound holds uniformly over ss\in.

C The Ornstein-Uhlenbeck (OU) process

The OU process is the (unique) strong solution to the following diffusion

Cov(Xt,Xs)=σ22a(eatsea(t+s))\textup{Cov}(X_{t},X_{s})=\frac{\sigma^{2}}{2a}\left(e^{-a|t-s|}-e^{-a(t+s)}\right).

Proof Consider the process Zt=eatXtZ_{t}=e^{at}X_{t}, using Ito^\hat{o} lemma, we have that

We conclude by multiplying both sides with eate^{-at}.

We would like to find sufficient conditions on the activation function ϕ\phi and a function gg such that the process g(Xt)g(X_{t}) (Eq. 7) follows an the OU dynamics. For this purpose, we proceed by reverse-engineering the problem; Using Ito^\hat{o} ’s lemma (Eq. 3), this is satisfied when there exist constants a,b,σa,b,\sigma such that

This implies that g(y)g2(y)=2aσ2(bg(y))\frac{g^{\prime\prime}(y)}{g^{\prime 2}(y)}=2a\sigma^{-2}(b-g(y)). Letting G=gG=\int g be the primitive function of gg, we obtain that GG satisfies a differential equation of the form

for some constants ζ,γ\zeta,\gamma. Integrating the left-hand side yields

where Erfi is the imaginary error functionAlthough the name might be misleading, the imaginary error function is real when the input is real. given by

To alleviate the notation, we denote h:=Erfih:=\textrm{Erfi} in the rest of this section. From the above, GG should have the form

where α,β,ζ\alpha,\beta,\zeta are all constants, and h1h^{-1} is the inverse function of the imaginary error function. We conclude that the activation function ϕ\phi should have the form

In this case, the coefficients aa and bb are given by

Letting g=Gg=G^{\prime}, the process g(Xt)g(X_{t}) has the following dynamics

Hence g(Xt)g(X_{t}) is an OU process, and we can conclude that the network output in the infinite-depth limit X1X_{1} satisfies

We can then infer the distribution of X1X_{1} by a simple change of variable. Note that this distribution is non-trivial, and unlike the infinite-width limit of the same ResNet () where the distribution is Gaussian, here the distribution of the pre-activations is directly impacted by the choice of the activation function ϕ\phi.

However, with this particular choice of the activation function ϕ\phi, the existence of the process XX can only be proven in the local sense, because ϕ\phi is only locally Lipschitz. Let us first show this in the next lemma. We will see how we can mitigate this issue later.

Proof It suffices to show that the derivative of ϕ\phi is locally bounded to conclude. We have that

Now we can rigorously prove the following result.

Consider the stochastic process XtX_{t} defined by

where a=σ2α2πexp(2ζ)a=\frac{\sigma^{2}}{\alpha^{2}\pi}\exp(-2\zeta).

Proof For N>0N>0, consider the stopping time τN\tau_{N} defined by

Using the continuity of paths of XX, it is straightforward that limNτN=\lim_{N\to\infty}\tau_{N}=\infty almost surely. Let N>0N>0 be large enough. The SDE satisfied by the process XX has a unique strong solution for t[0,τN)t\in[0,\tau_{N}) since the activation function ϕ\phi is Lipschitz on the interval (N,N)(-N,N). By applying Ito^\hat{o} lemma for t(0,τN)t\in(0,\tau_{N}), we have that

(from previous results). Using the fact that limNτN=\lim_{N\to\infty}\tau_{N}=\infty almost surely, and taking NN large enough, we obtain that for all t(0,1]t\in(0,1], we have that

D The Geometric Brownian Motion (GBM)

The GBM dynamics refers to stochastic differential equations of the form

where a,σa,\sigma are constants and BB is a one dimensional Brownian motion. This SDE played a crucial role in financial mathematics and is often used as a model of stock prices. It admits a closed-form solution given in the next lemma.

The distribution of XtX_{t} is known as a log-Gaussian distribution. Moreover, the solution is unique.

Proof The existence and uniqueness of the solution follows from 4. Indeed, it suffices to have the drift and the volatility both Lipschitz to obtain the result. This is satisfied in the case of GBM. Now consider the process Zt=log(Xt)Z_{t}=\log(X_{t}). Using Ito^\hat{o} lemmaNotice that here, XtX_{t} should be positive in order to consider log(Xt)\log(X_{t}). This is easy to show and the proof is similar to that of Lemma 15., it is easy to verify that

Now let us find sufficient conditions under which the infinite-depth network represented by the process XX has a GBM behaviour. In order for this to hold, it suffices to have

This implies gg21g\frac{g^{\prime\prime}}{g^{\prime 2}}\propto\frac{1}{g}, or equivalently gggg\frac{g^{\prime\prime}}{g^{\prime}}\propto\frac{g^{\prime}}{g}, which in turn yields log(g)=αlog(g)+β\log(|g^{\prime}|)=\alpha\log(|g|)+\beta, and therefore ggζ|g^{\prime}|\propto|g|^{\zeta}. Assuming that g,g>0g^{\prime},g>0, we can easily verify that functions of the form g(y)=α(y+β)γg(y)=\alpha(y+\beta)^{\gamma} where α,β,γ>0\alpha,\beta,\gamma>0 satisfy the requirements. Hence, the activation function should satisfy ϕ(y)=σγ1(y+β)\phi(y)=\sigma\gamma^{-1}(y+\beta), i.e. the activation should be linear. In this case, we have a=12σ2γ1(γ1)a=\frac{1}{2}\sigma^{2}\gamma^{-1}(\gamma-1) and the process g(Xt)g(X_{t}) has the following GBM dynamics

Observe that in the special case of γ=1,β=0,α=1\gamma=1,\beta=0,\alpha=1, we have g(y)=yg(y)=y and a=0a=0. In this case, we obtain Y1Y0exp(12σ2t+σB1)Y_{1}\sim Y_{0}\exp\left(-\frac{1}{2}\sigma^{2}t+\sigma B_{1}\right).

We summarize the previous results in following proposition.

where γ=σα1\gamma=\sigma\alpha^{-1}. Consider the stochastic process XtX_{t} defined by

Then, the process g(Xt)g(X_{t}) satisfies the following GBM dynamics

where a=12σ2γ1(γ1)a=\frac{1}{2}\sigma^{2}\gamma^{-1}(\gamma-1). As a result, we have that for all tt\in,

E ReLU in the case n=d=1𝑛𝑑1n=d=1

Consider the process XX given by the SDE

It is straightforward that if Xs0X_{s}\leq 0 for some ss\in, then for all ts,Xt=Xst\geq s,X_{t}=X_{s}. This is because dXt=0×dBtdX_{t}=0\times dB_{t} whenever Xt0X_{t}\leq 0. A rigorous justification is provided in Lemma 1. Hence, the event {Xs0}\{X_{s}\leq 0\} constitutes a stopping event where the process becomes constant. We also say that is an absorbent point of the process XX. A classic tool in stochastic calculus to deal with such situations is the notion of stopping time which is a random variable that depend on the trajectory of XX (or equivalently on the natural filtration Ft\mathcal{F}_{t} associated with the Brownian motion BB). Consider the following stopping time

Observe that we have for all t[0,τ]t\in[0,\tau]

which implies that YtY_{t} is a Geometric Brownian motion in the interval [0,τ][0,\tau]. Hence, if τ>1\tau>1 (a.s.), the network output has also a log-normal distribution in the infinite-depth limit. In the next lemma, we show that τ=\tau=\infty with probability 11 which confirms the above.

Let τ\tau be the stopping time defined by Eq. 12. We have that

Proof By continuity of the Brownian path and the ReLU function ϕ\phi, the paths of the process XX are also continuousThis is a classic result in stochastic calculus. More rigorously, XX can be chosen to have continuous paths with probability 1.. we have that τ>0\tau>0 almost surely. From the observation above, taking the limit tτt\to\tau^{-} and using the continuity, we obtain

For some ω{τ<}\omega\in\{\tau<\infty\}, we have that Xτ(ω)=0X_{\tau}(\omega)=0 (by continuity). Hence 12τ(w)+Xτ(ω)=-\frac{1}{2}\tau(w)+X_{\tau}(\omega)=-\infty. This happens with probability zero, which means that the event {τ<}\{\tau<\infty\} has probability zero. This concludes the proof.

Hence, with the ReLU activation function, given X0>0X_{0}>0, the network output is distributed as

Now let us go back to the original setup for X0X_{0}. Recall that X0=WinxX_{0}=W_{in}x for some x0x\neq 0 and WinN(0,1)W_{in}\sim\mathcal{N}(0,1). By conditioning on X0X_{0} and observing that is an absorbent point of the process XX, we obtain that

We summarize these results in the next proposition.

Then, the process XX is a mixture of a Geometric Brownian motion and a constant process. More precisely, we have for all tt\in

Hence, conditionally on X0>0X_{0}>0, the process XX is a Geometric Bronwian motion.

F Proof of Lemma 2 and Lemma 3

Proof It is straightforward that with probability 11 we have ϕ(X0)>0\|\phi(X_{0})\|>0, which implies that with probability 11, τ>0\tau>0. Let t<τt<\tau. Using Ito^\hat{o} ’s lemma with the function g(z)=12log(ζ(x)2)g(z)=\frac{1}{2}\log(\|\zeta(x)\|^{2}), we obtain

where σi(Xs)=ϕ(Xsi)ϕ(Xsi)ϕ(Xs)\sigma_{i}(X_{s})=\frac{|\phi^{\prime}(X^{i}_{s})\phi(X^{i}_{s})|}{\|\phi(X_{s})\|}, and μ(Xs)=12i=1n(ϕ(Xsi)ϕ(Xsi)+ϕ(Xsi)2)ϕ(Xs)ϕ(Xs)2ϕ(Xs)2\mu(X_{s})=\frac{1}{2}\sum_{i=1}^{n}\left(\phi^{\prime\prime}(X^{i}_{s})\phi(X^{i}_{s})+\phi^{\prime}(X^{i}_{s})^{2}\right)-\frac{\|\phi^{\prime}(X_{s})\circ\phi(X_{s})\|^{2}}{\|\phi(X_{s})\|^{2}}, and \circ refers to the Hadamard product of vectors, i.e. coordinate-wise product.

For some ω{τ<}\omega\in\{\tau<\infty\}, using the path continuity of the process XX and the continuity of gg, we have that limtτ(ω)g(Xτ(ω)(ω))=\lim_{t\to\tau(\omega)^{-}}g(X_{\tau(\omega)}(\omega))=-\infty. Therefore, we should also have

Hence, the random variable 1n0tμ(Xs)ds+12n0tσ(Xs)dB^s\frac{1}{\sqrt{n}}\int_{0}^{t}\mu(X_{s})ds+\frac{1}{2n}\int_{0}^{t}\sigma(X_{s})d\hat{B}_{s} is finite with probability 11. We conclude that

Lemma 3. Consider the stochastic process (7) given by the SDE

where ϕ\phi is the ReLU activation function, and (Bt)t0(B_{t})_{t\geq 0} is an nn-dimensional Brownian motion. Let τ\tau be the stopping time given by

Proof Let t0>0t_{0}>0. Using Lemma 7, we know that if for some t1t_{1}, ϕ(Xt1)=0\|\phi(X_{t_{1}})\|=0, then for all tt1t\geq t_{1}, we have that Xt=Xt1X_{t}=X_{t_{1}} and ϕ(Xt)=0\|\phi(X_{t})\|=0. Hence, we have that

Let m1m\geq 1 and consider the function ϕm(z)=0zh(mu)du\phi_{m}(z)=\int_{0}^{z}h(m\,u)du and h(t)=(1+et)1h(t)=(1+e^{-t})^{-1} is the Sigmoid functionNote that ϕm\phi_{m} has a closed-form formula given by ϕm(z)=m1(log(1+emz)log(2))\phi_{m}(z)=m^{-1}(\log(1+e^{mz})-\log(2)), which can be seen as a shifted and scaled version of the Softplus function. However, we do not need the closed-form formula in our analysis.. It is straightforward that ϕm\phi_{m} satisfies the conditions of Lemma 2. Let XmX^{m} be the solution of the following SDE (the solution exists and is unique since ϕm\phi_{m} is trivially Lipschitz)

We know from Lemma 8 that XtmX^{m}_{t} converges in L2L^{2} to XtX_{t} (uniformly over t[0,T]t\in[0,T] for any T>0T>0). In particular, this implies convergence in distribution. Moreover, observe that for all tt

where we used triangular inequality and the upperbound from Lemma 9. Thus, we have that ϕm(Xtm)\phi_{m}(X^{m}_{t}) converges in L2L^{2} (and in distribution) to ϕ(Xt)\phi(X_{t}).

Let δk=[1/(k+1),1/k)\delta_{k}=[1/(k+1),1/{k}) for k1k\geq 1, and define δ0=[1,)\delta_{0}=[1,\infty). For m1m\geq 1, using Lemma 9, we have that

Given k0k\geq 0, we have that for m>n1/2(k+1)m>n^{1/2}(k+1),

Let us deal with the first term. Using Lemma 9, we have that

From Lemma 10, we know that the random variable log(ϕm(Xt0m)ϕm(X0m))\log\left(\frac{\|\phi_{m}(X^{m}_{t_{0}})\|}{\|\phi_{m}(X^{m}_{0})\|}\right) converges in L1L^{1} and thus it is bounded in L1L^{1} norm (over mm). Therefore, a simple application of Markov’s inequality yields that the probability above goes to when mm goes to \infty.

G Proof of 1

where μs=12ϕ(Xs)21\mu_{s}=\frac{1}{2}\|\phi^{\prime}(X_{s})\|^{2}-1, and (B^)t0(\hat{B})_{t\geq 0} is a one-dimensional Brownian motion. As a result, we have that for all 0st10\leq s\leq t\leq 1

Proof Let tt\in. Let us firs consider the case where ϕ(X0)=0\|\phi(X_{0})\|=0. For all tt we have ϕ(Xt)=0\|\phi(X_{t})\|=0 and the result is trivial.

Ideally, we would like to use Ito^\hat{o} ’s lemma and Lemma 3, which ensures that ϕ(Xt)\|\phi(X_{t})\| remains positive on $,andobtainforall, and obtain for allt\in$

where ϕm(t)=0th(mu)du\phi_{m}(t)=\int_{0}^{t}h(m\,u)du and h(t)=(1+et)1h(t)=(1+e^{-t})^{-1} is the Sigmoid function. We have that

Let XmX^{m} be the solution of the following SDE

and σsm,i=h(mXsm,i)ϕm(Xsm,i)ϕm(Xsm)\sigma^{m,i}_{s}=\frac{|h(mX^{m,i}_{s})\circ\phi_{m}(X^{m,i}_{s})|}{\|\phi_{m}(X^{m}_{s})\|}. By Lemma 10, we know that gm(Xsm)gm(X0m)g_{m}(X^{m}_{s})-g_{m}(X^{m}_{0}) converges in L1L_{1} to g(Xs)g(X0)g(X_{s})-g(X_{0}). Let us now compute the limit of gm(Xsm)gm(X0m)g_{m}(X^{m}_{s})-g_{m}(X^{m}_{0}) from the equation above to conclude. More precisely, let us show that for all s(0,1)s\in(0,1)

where μs=12ϕ(Xsi)21\mu_{s}=\frac{1}{2}\|\phi^{\prime}(X^{i}_{s})\|^{2}-1, and σsi=ϕ(Xsi)ϕ(Xs)\sigma_{s}^{i}=\frac{\phi(X_{s}^{i})}{\|\phi(X_{s})\|}.

Let t(0,1]t\in(0,1]. Using triangular inequality, Ito^\hat{o} isometry, and Cauchy-Schwartz inequality, we have that

where Jim=mh(mXsm,i)(1h(mXsm,i))ϕm(Xsm,i)+h(mXsm,i)2J^{m}_{i}=mh(mX^{m,i}_{s})(1-h(mX^{m,i}_{s}))\phi_{m}(X^{m,i}_{s})+h(mX^{m,i}_{s})^{2}, and Gm=h(mXsm)ϕm(Xsm)2ϕm(Xsm)2G^{m}=\frac{\|h(mX^{m}_{s})\circ\phi_{m}(X^{m}_{s})\|^{2}}{\|\phi_{m}(X^{m}_{s})\|^{2}}. Let us start with the term GmG^{m}. Observe that Gm1G^{m}\leq 1 almost surely. We have that

When minixilog(m)/m\min_{i}|x^{i}|\geq\log(m)/m, we have that for all i[n]i\in[n]

For the remaining term, using the fact that 1h211-h^{2}\leq 1, we have that

for some constant CnC_{n} that depends on nn. Therefore, for all ii, we have

Recall that Xsi=X0i+1n0sϕ(Xu)dBuX^{i}_{s}=X^{i}_{0}+\frac{1}{\sqrt{n}}\int_{0}^{s}\|\phi(X_{u})\|dB_{u}. By Lemma 11, we know that

Let us now deal with the last term in JimJ^{m}_{i}. We have that

Notice that τϵ>0\tau_{\epsilon}>0 almost surely since ϕ(X0)(ϵ,ϵ1)\|\phi(X_{0})\|\in(\epsilon,\epsilon^{-1}).

The second term can be upperbounded in the following fashion

Moreover, letting EcE^{c} be the complementary event of EE, we have

Using the fact that ϕm(Xsm)nm+ϕ(Xs)+XsmXs\|\phi_{m}(X^{m}_{s})\|\leq\frac{\sqrt{n}}{m}+\|\phi(X_{s})\|+\|X^{m}_{s}-X_{s}\| (by Lemma 9 and the fact that ReLU is Lipschitz), we obtain

Recall that this holds for any ϵ\epsilon small enough. Observe that τϵ\tau_{\epsilon} is almost surely non-decreasing as we decrease ϵ\epsilon. Hence τϵ\tau_{\epsilon} has a limit almost surely. Using Lemma 3 and the continuity of the paths of XsX_{s} we have that limϵ0+τϵ=\lim_{\epsilon\to 0^{+}}\tau_{\epsilon}=\infty. Taking the limit ϵ0+\epsilon\to 0^{+}, we conclude that almost surely we have

which yields the desired result for the conditional mean by substraction.

Now let us deal with the variance. To alleviate the notation, we omit the conditioning on the event {ϕ(X0)>0}\{\|\phi(X_{0})\|>0\}. All the expectations below are taken conditionally on this event. Let 0st10\leq s\leq t\leq 1. Let λ>0\lambda>0. We have that

where we have used the exchangeability property of the family {ϕ(Xui),i=1,n}\{\phi^{\prime}(X^{i}_{u}),i=1,\dots n\}. Thus, for the variance Varμu2\textrm{Var}\mu_{u}^{2}, we obtain

where Γs,t=defst14((p2u(p1u)2)+n1(p1up2u))du\Gamma_{s,t}\overset{def}{=}\int_{s}^{t}\frac{1}{4}((p^{u}_{2}-(p^{u}_{1})^{2})+n^{-1}(p^{u}_{1}-p^{u}_{2}))\,du. Optimizing over λ\lambda yields

The term Γs,t\Gamma_{s,t} can be shown to have O(n1/2)\mathcal{O}(n^{-1/2}) asymptotic behaviour using tools from Mckean-Vlasov theory. Thus, the variance term has (atmost) O(n1)\mathcal{O}(n^{-1}) behaviour.

H Proof of 2

In this section, we provide the proof of 2. We use the following Law of Large numbers that does not require independence.

Let (Yni)1in,n1(Y_{n}^{i})_{1\leq i\leq n,n\geq 1} be a triangular array of random variables. Assume that the following holds

Theorem 2. For 0st10\leq s\leq t\leq 1, we have

where the convergence holds in L1L_{1}. Moreover, we have that

where XtiX^{i}_{t} is the solution of the following (Mackean-Vlasov) SDE

As a result, the pre-activations YtLiY^{i}_{\lfloor tL\rfloor} (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-depth-then-infinite-width

Proof Let 0st10\leq s\leq t\leq 1. From 1, we have that almost surely

We know that 1n(B^tB^s)\frac{1}{\sqrt{n}}(\hat{B}_{t}-\hat{B}_{s}) converges to zero almost surely (by continuity of Brownian paths) and in L1L_{1}. Let us now deal with the second term n1stμudun^{-1}\int_{s}^{t}\mu_{u}du. We have that 1nμu=121ni=1nϕ(Xui)1n.\frac{1}{n}\mu_{u}=\frac{1}{2}\frac{1}{n}\sum_{i=1}^{n}\phi^{\prime}(X^{i}_{u})-\frac{1}{n}. Fix u[s,t]u\in[s,t] and let Zni=ϕ(Xui)Z^{i}_{n}=\phi^{\prime}(X_{u}^{i}) (recall that XuiX^{i}_{u} has an implicit dependence on nn). Since ZniZ_{n}^{i} is uniformly bounded across ii and nn, it is straightforward that the conditions of 8 are satisfied. Therefore, we have the following convergence in L1L_{1}

Let us now deal with the second result on the absolute growth factor. Let N>0N>0 and define the event

and let ENcE^{c}_{N} be its complementary event. For NN large enough, we have that

The convergence to Mckean-Vlasov dynamics is straightforward from 6, and the Gaussian distribution is given by Lemma 16.

Taking the expectationThis should be understood as integrating the SDE, then taking the expectation, then differentiating once again. yields the following ordinary differential equation

which has a closed-form solution given by

Proof The proof of Lemma 17 is similar to that of Lemma 8 with the only difference of replacing the euclidean norm with the L2L_{2} norm in probability space. Let t0t\geq 0, we have that

where we have used the triangular inequality and Lemma 9. We conclude using Gronwall’s lemma.

I Proof of 3

Theorem 3. Let tt\in. Then, in the limit limLlimn\lim_{L\to\infty}\lim_{n\to\infty} (infinite width, then infinite depth), we have that

where the convergence holds in probability.

Moreover, the pre-activations YtLiY^{i}_{\lfloor tL\rfloor} (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-width-then-infinite-depth

Let us show that the first term in the right-hand side converges to in the sequential limit ‘infinite width then infinite depth’. The proof is similar for the second term. We have that

Using Lemma 5 in , and the homegenous property of ReLU, we have that

J Piece-wise linear activation functions

We have seen in Section 4 that the distribution of XtX_{t} is generally intractable for n2n\geq 2. This is purely due to finite width n2n\geq 2 and not to the non-linearity of the activation function. To understand this, let us see what happens when the activation function is the identity function. In this case the process XtX_{t} is solution of the following SDE

When n=1n=1, the SDE Eq. 15 has a closed-form solution given by the (conditional) GBM distribution (2). For general n2n\geq 2, the entries of XtX_{t} are dependent and the resulting dynamics (generally) do not admit closed-form solutions. However, we can obtain closed-form solutions for the norm Xt\|X_{t}\|. Indeed, a simple application of Ito^\hat{o} ’s lemma yields the following results.

With the linear activation, we have that for all tt\in,

where (B^)t0(\hat{B})_{t\geq 0} is a one-dimensional Brownian motion. As a result, we have that for all 0st10\leq s\leq t\leq 1

The proof of 9 is straightforward using Ito^\hat{o} ’s lemma. We omit the proof here.

By comparing the result of 1 and 9, we observe some differences between the case of ReLU and that of the identity activation function. With ReLU, the drift term in log(ϕ(Xt)/ϕ(Xs))\log(\|\phi(X_{t})\|/\|\phi(X_{s})\|) is given by 1n0tμsds\frac{1}{n}\int_{0}^{t}\mu_{s}ds which is a stochastic term with mean given by (12n41n)t\left(\frac{1-2^{-n}}{4}-\frac{1}{n}\right)t. With the identity activation, this drift term is deterministic and is equal to (121n)t\left(\frac{1}{2}-\frac{1}{n}\right)t. This allows to conclude the following:

Non-linearity induces stochastic drift: the non-linearity of ReLU induces stochasticity in the drift term of log(Xt/X0)\log(\|X_{t}\|/\|X_{0}\|), which results in the Quasi-GBM dynamics given by 1.

Non-linearity induces change of regime: with ReLU, the mean drift of log(ϕ(Xt)/ϕ(X0))\log(\|\phi(X_{t})\|/\|\phi(X_{0})\|) is given by (12n41n)t\left(\frac{1-2^{-n}}{4}-\frac{1}{n}\right)t which is negative for n=1,2,3n=1,2,3. This induces the change of regime we discussed after 1 (having a negative mean drift implies that there is a significant mass of the distribution of Xt/X0\|X_{t}\|/\|X_{0}\| in the regime (0,1)(0,1)). With the identity activation function, the drift term is always non-negative for n2n\geq 2, and negative for n=1n=1. Thus, the change of regime cover some values n2n\geq 2 only when there is a non-linearity. We give more details about this observation in the next result.

To capture the effect of non-linearity in the regime change phenomenon discussed above, we study the dynamics of the post-norm activation for a special class of piece-wise linear activations that include both ReLU and the identity function. The result of 1 can be easily extended to the case of general piece-wise linear activation functions using the same proof techniques. We obtain the following result which generalizes that of 1 and 9.

10 generalizes that of ReLU (1, α=1,β=0\alpha=1,\beta=0) and the identity activation (9, α=β=1\alpha=-\beta=1). The discontinuity of the mean of log(ϕα,β(Xt)ϕα,β(Xs))\log\left(\frac{\|\phi_{\alpha,\beta}(X_{t})\|}{\|\phi_{\alpha,\beta}(X_{s})\|}\right) at the poles α=0\alpha=0 (and β0\beta\neq 0) and β=0\beta=0 (and α0\alpha\neq 0) is due to the fact that the event {ϕα,β(X0)>0}\{\|\phi_{\alpha,\beta}(X_{0})\|>0\} has non-zero probability in these cases and zero probability when α0\alpha\neq 0 and β0\beta\neq 0.

Consider the case when α=1\alpha=1 and β=1ε\beta=1-\varepsilon for some ε1\varepsilon\ll 1. The mean logarithmic growth factor is given by

Observe that for ε=0\varepsilon=0, we recover the result of 9 (identity activation). Hence, a small perturbation of the identity function has the effect of decreasing the factor Gs,tnG_{s,t}^{n} which results in having negative values for Gs,tnG_{s,t}^{n} for certain values of nn. Indeed, by fixing α=1\alpha=1, notice that the minimum values of Gs,tnG_{s,t}^{n} is obtained when β0\beta\approx 0, for which ϕ1,0=\phi_{1,0}= ReLU. Notice that we can also control the change of regime by tuning the parameter α\alpha. This allows us to control the sign of Gs,tnG_{s,t}^{n} for any nn by tuning the parameter α\alpha. We leave the analysis of the practical implications of tuning α\alpha for future work.

K Additional Experiments

Additional histograms of YLY_{L} and log(Yl)\log(Y_{l}) (2) are shown in Fig. 9 and Fig. 10.

K.2 Ornstein-Uhlenbeck process

Additional histograms of YLY_{L} and g(Yl)g(Y_{l}) (4) are shown in Fig. 9 and Fig. 10.

K.3 Histograms of non-scaled log-norm of post-activations

In Fig. 13, we show the histogram of log(ϕ(YL)/ϕ(Y0))\log(\|\phi(Y_{L})\|/\|\phi(Y_{0})\|) based on N=5000N=5000 simulations. We observe that as the width nn increases, the Gaussian approximate is no longer accurate, which is due to the fact that ϕ(YL)/ϕ(Y0)\|\phi(Y_{L})\|/\|\phi(Y_{0})\| converges to a deterministic value (2).

In Fig. 14, Fig. 15, Fig. 16, and Fig. 17, we show the histograms of nlog(ϕ(Yl)/ϕ(Y0))\sqrt{n}\log(\|\phi(Y_{l})\|/\|\phi(Y_{0})\|) for depth L=100L=100, hidden layers l{10,30,40,60,70,90}l\in\{10,30,40,60,70,90\}, and widths n{2,3,20,100}n\in\{2,3,20,100\}. We observe that Gaussian distribution fits better the last layers. This was expected since the limiting distribution (Quasi-GBM) given in 1 is only valid for layer indices tL\lfloor tL\rfloor when LL goes to infinity. Thus, for small ll, it should be expected that the Gaussian distribution would not be a good approximation.

In Fig. 18, Fig. 19, Fig. 20, and Fig. 21, we show the non-scaled versions of the histograms from the previous section. We observe that the histogram concentrates around a single value (the distribution converges to a Dirac mass) as nn increases. This is a result of the asymptotic behaviour of the ResNet in the infinite-depth-then-infinite-width limit as shown in 2.