On the infinite-depth limit of finite-width neural networks
Soufiane Hayou
The empirical success of over-parameterized neural networks has sparked a growing interest in the theoretical understanding of these models. The large number of parameters – millions if not billions – and the complex (non-linear) nature of the neural computations (presence of non-linearities) make this hypothesis space highly non-trivial. However, in certain situations, increasing the number of parameters has the effect of ‘placing’ the network in some ‘average’ regime that simplifies the theoretical analysis. This is the case with the infinite-width asymptotics of random neural networks. The infinite-width limit of neural network architectures has been extensively studied in the literature, and has led to many interesting theoretical and algorithmic innovations. We summarize these results below.
Initialization schemes: the infinite-width limit of different neural architectures has been extensively studied in the literature. In particular, for multi-layer perceptrons (MLP), a new initialization scheme that stabilizes forward and backward propagation (in the infinite-width limit) was derived in . This initialization scheme is known as the Edge of Chaos, and empirical results show that it significantly improves performance. In , the authors derived similar results for the ResNet architecture, and showed that this architecture is placed by-default on the Edge of Chaos for any choice of the variances of the initialization weights (Gaussian weights). In , the authors showed that an MLP that is initialized on the Edge of Chaos exhibits similar properties to ResNets, which might partially explain the benefits of the Edge of Chaos initialization.
Gaussian process behaviour: Multiple papers (e.g. ) studied the weak limit of neural networks when the width goes to infinity. The results show that a randomly initialized neural network (with Gaussian weights) has a similar behaviour to that of a Gaussian process, for a wide range of neural architectures, and under mild conditions on the activation function. In , the authors leveraged this result and introduced the neural network Gaussian process (NNGP), which is a Gaussian process model with a neural kernel that depends on the architecture and the activation function. Bayesian regression with the NNGP showed that NNGP surprisingly achieves performance close to the one achieved by an SGD-trained finite-width neural network.
The large depth limit of this Gaussian process was studied in , where the authors showed that with proper scaling, the infinite-depth (weak) limit is a Gaussian process with a universal kernelA kernel is called universal when any continuous function on some compact set can be approximated arbitrarily well with kernel features..
Neural Tangent Kernel (NTK): the infinite-width limit of the NTK is the so-called NTK regime or Lazy-training regime. This topic has been extensively studied in the literature. The optimization and generalization properties (and some other aspects) of the NTK have been studied in . The large depth asymptotics of the NTK have been studied in . We refer the reader to for a comprehensive discussion on the NTK.
Others: the theory of infinite-width neural networks has also been utilized for network pruning , regularization , feature learning , and ensembling methods (this is by no means an exhaustive list).
The theoretical analysis of infinite-width neural networks has certainly led to many interesting (theoretical and practical) discoveries. However, most works on this limit consider a fixed depth network. What about infinite-depth? Existing works on the infinite-depth limit can generally be divided into three categories:
Infinite-width-then-infinite-depth limit: in this case, the width is taken to infinity first, then the depth is take to infinity. This is the infinite-depth limit of infinite-width neural networks. This limit was particularly used to derive the Edge of Chaos initialization scheme , study the impact of the activation function , the behaviour of the NTK , kernel shaping etc.
The joint infinite-width-and-depth limit: in this case, the depth-to-width ratio is fixed, and therefore, the width and depth are jointly taken to infinity at the same time. There are few works that study the joint width-depth limit. For instance, in , the authors showed that for a special form of residual neural networks (ResNet), the network output exhibits a (scaled) log-normal behaviour in this joint limit. This is different from the sequential limit where width is taken to infinity first, followed by the depth, in which case the distribution of the network output is asymptotically normal (). In , the authors studied the covariance kernel of an MLP in the joint limit, and showed that it converges weakly to the solution of Stochastic Differential Equation (SDE). In , the authors showed that in the joint limit case, the NTK of an MLP remains random when the width and depth jointly go to infinity. This is different from the deterministic limit of the NTK where the width is taken to infinity before depth . More recently, in , the author explored the impact of the depth-to-width ratio on the correlation kernel and the gradient norms in the case of an MLP architecture, and showed that this ratio can be interpreted as an effective network depth.
Infinite-depth limit of finite-width neural networks: in both previous limits (infinite-width-then-infinite-depth limit, and the joint infinite-width-depth limit), the width goes to infinity. Naturally, one might ask what happens if width is fixed and depth goes to infinity? What is the limiting distribution of the network output at initialization? In , the author showed that neural networks with bounded width are still universal approximators, which motivates the study of finite-width large depth neural networks. In , the authors showed that the pre-activations of a particular ResNet architecture converge weakly to a diffusion process in the infinite-depth limit. This is the result of the fact that ResNet can be seen as discretizations of SDEs (see Section 2).
In the present paper, we study the infinite-depth limit of finite-width ResNet with random Gaussian weights (an architecture that is different from the one studied in ). We are particularly interested in the asymptotic behaviour of the pre/post-activation values. Our contributions are four-fold:
Unlike the infinite-width limit, we show that the resulting distribution of the pre-activations in the infinite-depth limit is not necessarily Gaussian. In the simple case of networks of width , we study two cases where we obtain known but completely different distributions by carefully choosing the activation function.
For ReLU activation function, we introduce and discuss the phenomenon of network collapse. This phenomenon occurs when the pre-activations in some hidden layer have all non-positive values which results in zero post-activations. This leads to a stagnant network where increasing the depth beyond a certain level has no effect on the network output. For any fixed width, we show that in the infinite-depth limit, network collapse is a zero-probability event, meaning that almost surely, all post-activations in the network are non-zero.
For networks with general width, where the distribution of the pre-activations is generally intractable, we focus on the norm of the post-activations with ReLU activation function, and show that this norm has approximately a Geometric Bronwian Motion (GBM) dynamics. We call this Quasi-GBM. We also shed light on a regime change phenomenon that occurs when the width increases from to . For width , resp. , the logarithmic growth factor of the post-activations is , resp. positive.
We study the sequential limit infinite-depth-then-infinite-width, which is the converse of the more commonly studied infinite-width-then-infinite-depth limit, and show some key differences between these limits. We particularly show that the pre-activations converge to the solution of a Mckean-Vlasov process, which has marginal Gaussian distributions, and thus we recover the Gaussian behaviour in this limit. We compare the two sequential limits and discuss some differences.
The proofs of the theoretical results are provided in the appendix and referenced after each result. Empirical evaluations of these theoretical findings are also provided.
The infinite-depth limit
Hereafter, we denote the width, resp. depth, of the network by , resp. . We also denote the input dimension by . Let , and consider the following ResNet architecture of width and depth
The scaling in Eq. 1 is not arbitrary. This specific scaling was shown to stabilize the norm of as well as gradient norms in the large depth limit (e.g. ). In the next result, we show that the infinite depth limit of Eq. 1 (in the sens of the distribution) exists and has the same distribution of the solution of a stochastic differential equation. In the case of a single input, this has already been shown in . The details are provided in Section A. We also generalize this result in the case of multiple inputs and obtain similar SDE dynamics (see 5 in the Appendix).
where the constant in does not depend on . Moreover, if the activation function is only locally Lipschitz, then converges locally to . More precisely, for any fixed , we consider the stopping times
then the stopped process converges in distribution to the stopped solution of the above SDE.
The proof of 1 is provided in Section A.6. We use classical results on the numerical approximations of SDEs. 1 shows that the infinite-depth limit of finite-width ResNet (Eq. 1) has a similar behaviour to the solution of the SDE given in Eq. 7. In this limit, converges in distribution to . Hence, properties of the solutions of Eq. 7 should theoretically be ‘shared’ by the pre-activations when the depth is large. For the rest of the paper, we study some properties of the solutions of Eq. 7. This requires the definition of filtered probability spaces which we omit here. All the technical details are provided in Section A. We compare the theoretical findings with empirical results obtained by simulating the pre/post-activations of the original network Eq. 1. We refer to , the solution of Eq. 7, by the infinite-depth network.
The distribution of (the last layer in the infinite-depth limit) is generally intractable, unlike in the infinite-width-then-infinite-depth limit (Gaussian, ) or joint infinite-depth-and-width limit (involves a log-normal distribution in the case of an MLP architecture, ). Intuitively, one should not expect a universal behaviour (e.g. the Gaussian behaviour in the infinite-width case) of the solution of Eq. 7 as this latter is highly sensitive to the choice of the activation function, and different activation functions might yield completely different distributions of . We demonstrate this in the next section by showing that we can recover closed-form distributions by carefully choosing the activation function. The main ingredient is the use of It ’s lemma. See Section A for more details.
Different behaviours depending on the activation function
In this section, we restrict our analysis to a width- ResNet with one-dimensional inputs, where each layer consists of a single neuron, i.e. . In this case, the process is one-dimensional and is solution of the following SDE
In financial mathematics nomenclature, the function is called the drift and is called the volatility of the diffusion process. It ’s lemma is a valuable tool in stochastic calculus and is often used to transform and simplify SDEs to better understand their properties. It can also be used to find candidate functions and activation functions such that the SDE Eq. 3 admits solutions with known distributions, which yields a closed-form distribution for . We consecrate the rest of this section to this purpose.
ReLU is a piece-wise linear activation function. Let us first deal with the simpler case of linear activation functions. In the next result, we show that linear activation functions yield log-normal distributions. In this case, the process follows the Geometric Brownian motion dynamics. Later in this section, we show that this result can be adapted to the case of the ReLU activation function given by .
Then, the process is a solution of the SDE
where . As a result, we have that for all ,
The proof of 2 is provided in Section D, and consists of using It lemma and solving a differential equation. When the activation function is ReLU, we still obtain a log-normal distribution conditionally on the event that the initial value is positive.
Then, the process is a mixture of a Geometric Brownian motion and a constant process. More precisely, we have for all
Hence, given a fixed , the process is a Geometric Brownian motion.
The proof of 3 is provided in Section E. We show that conditionally on , with probability , the process is positive for all In Section E, we show that the stopping is infinite almost surely, which is stronger that what we need. This is a classic result in stochastic calculus.. When , the ReLU activation is just the identity function, which justifies the similarity between this result and the one obtained with linear activations (2). Conversely, if , the process is constant equal to since the updates ‘’ are equal to zero in this case. A rigorous justification of this is given for general width later in the paper (Lemma 1). An empirical verification of 2 is provided in Fig. 1 where we compare the theoretical results to simulations of the neural paths and from the original (finite-depth) ResNet given by Eq. 1. We observe an excellent match with theoretical predictions for depths and . In the case of a small depth (), the theoretical distribution does not fit well the empirical one (obtained by simulations), which is expected since the dynamics of describe (only) the infinite-depth limit of the ResNet. More figures are provided in Section K. Remark: notice that the log-normal behaviour is a result of the fact that we only consider the case (width one). Indeed, the single neuron case forces ReLU to act like a linear activation when , and like a ‘zero’ activation when . For general width , such behaviour does not hold in general, and usually some coordinates of will be negative while others are non-negative, which implies that the volatility term has non-trivial dependence on . We discuss this in more details in Section 4. In the next section, we illustrate a case of an exotic (non-standard) activation function that yields a completely different closed-form distribution of .
2 Exotic activation
The next result shows that with a particular choice of the activation function and mapping , the stochastic process is the solution of well-known type of SDEs known as the Ornstein-Uhlenbeck SDEs. In this case, the activation function is non-standard and involves the inverse of the imaginary error function, a variant of the error function.
Consider the stochastic process defined byin Section C, we show that the activation function is only locally Lipschitz. Hence, the solution of this SDE exists only in the local sense and the convergence in distribution of to is also in the local sense (1). However, by continuity of the Brownian path, the stopping times and diverge almost surely when goes to infinity. Therefore, the conclusion of 4 remains true for all . Technical details are provided in Section C.
Then, the stochastic process follows the Ornstein-Uhlenbeck dynamics on given by
where . As a result, conditionally on (fixed ), we have that for all ,
and the process is distributed as .
Fig. 2 shows the graph of the activation function mentioned in 4 with and . With this choice of the activation function, the infinite-depth network output has the distribution (conditionally on ), where is given in the statement of the proposition. This distribution, although easy to simulate, is different from both the Gaussian distribution that we obtain in the infinite-width limit and the log-normal distribution associated with ReLU activation. This confirms that not only do neural networks exhibit completely different behaviours when the ratio depth-to-width is large, but in this case, that their behaviour is very sensitive to the choice of the activation function.
The results of 4 are empirically confirmed in Fig. 3. The original ResNet given by Eq. 7 with depth exhibit very similar behaviour to that of the SDE.
General width n≥1𝑛1n\geq 1
where is the activation function, and is an -dimensional Brownian motion, independent from . Intuitively, if for some , , then for all , since the increments ’’ are all zero for . This holds for any choice of the activation function , provided that the process exists, i.e. the SDE has a unique solution. We summarize this in the next lemma.
Lemma 1 is a particular case of Lemma 7 in the Appendix. The proof consists of using the uniqueness of the solution of Eq. 4 when the volatility term is Lipschitz. This result is trivial in the finite depth case (Eq. 1). When there exists such that , the process becomes constant (equal to ) for all (almost surely). We call this phenomenon process collapse. In the case of finite-depth networks (Eq. 1), we call the same phenomenon network collapse. Understanding when, and whether, such event occurs is useful since it has significant implications on the the large depth behaviour of neural networks. Indeed, if such event occurs, it would mean that increasing depth has no effect on the network output after some time (or approximately, after layer index ). In the next result, we show that under mild conditions on the activation function, process collapse is a zero-probability event.
The proof of Lemma 2 is provided in Section F. Many standard activation functions satisfy the conditions of Lemma 2. Examples include Hyperbolic Tangent , and smooth versions of ReLU activation such as GeLU given by where is the cumulative distribution function of the standard Gaussian variable, and Swish (or SiLU) given by where is the Sigmoid function. The result of Lemma 2 can be extended to the case when is the ReLU function with miner changes.
Consider the stochastic process (7) given by the SDE
where is the ReLU activation function, and is an -dimensional Brownian motion independent from . Let be the stopping time given by
The proof of Lemma 3 relies on a particular choice of a sequence of functions that approximate the ReLU activation . Details are provided in Section F.
The result of Lemma 3 shows that for all , with probability , if there exists such that , then for all , there exists a coordinate such that , which implies that the volatility of the process given by does not vanish in finite time . Notably, this implies that for any , the norm of post-activations given by does not vanish (with probability 1). This is important as it ensures that the vector , which represents the post-activations in the infinite-depth network, does not vanish, and therefore the process does not get stuck in
an absorbent point. The dependence between the coordinates of the process is crucial in this result. In the opposite case where are independent, the event has probability . Notice also that this result holds only in the infinite-depth limit. With finite-depth ResNet (Eq. 1) with ReLU activation, it is not hard to show that the network collapse event has non-zero probability. However, as the depth increases, the probability of network collapse goes to zero. Fig. 4 shows the probability of network collapse for a finite-width and depth ResNet (Eq. 1). As the depth increases, it becomes unlikely that the network collapses. This is in agreement with our theoretical prediction that the infinite-depth network represented by the process has zero-probability collapse event, conditionally on the fact that . The probability of neural collapse also decreases with width, which is expected, since it becomes less likely to have all pre-activations non-positive as the width increases.
2 Post-activation norm
As a result of Lemma 3, conditionally on , we can safely consider manipulating functions that require positiveness such as the logarithm of the norm of the post-activations. In the next result, we show that the norm of the post-activations has a distribution that
resembles the log-normal distribution. We call this Quasi Geometric Brownian Motion distribution (Quasi-GBM).
where , and is a one-dimensional Brownian motion. As a result, for all
Notice that the case of matches the result of 3. Indeed, the latter implies that conditionally on , we have where is a one-dimensional Brownian motion, and where we have used the fact that for all . This result can be readily obtained from 1 by setting .
An interesting question is that of the infinite-width limit of the process , which corresponds to the sequential limit infinite-depth-then-infinite-width of the ResNet (Eq. 1). We discuss this in the next section.
3 Infinite-width limit of infinite-depth networks
In the next result, we show that when the width goes to infinity, the ratio concentrates around a layer dependent (-dependent) constant. In this limit, the coordinates of converge in to a Mckean-Vlasov process, which allows us to recover the Gaussian behaviour of the pre-activations of the ResNet. We later compare this with the converse sequential limit infinite-width-then-infinite-depth where the pre-activations are also normally distributed, and show a key difference in the variance of the Gaussian distribution.
where is the solution of the following (Mckean-Vlasov) SDE
As a result, the pre-activations (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-depth-then-infinite-width
The proof of 2 requires the use of a special variant of the Law of large numbers for non iid random variables, and a convergence result of particle systems from the theory of Mckean-Vlasov processes. Details are provided in Section H. In neural network terms, 2 shows that the logarithmic growth factor of the norm of the post-activations, given by , converges to in the sequential limit , then . More importantly, the pre-activations converge in distribution to a zero-mean Gaussian distribution in this limit, with a layer-dependent variance. In the converse sequential limit, i.e. , then , the limiting distribution of the pre-activations is also Gaussian with the same variance. We show this in the following result, which uses Lemma 5 in .
Let . Then, in the limit (infinite width, then infinite depth), we have that
where the convergence holds in probability.
Moreover, the pre-activations (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-width-then-infinite-depth
The proof of 3 is provided in Section I. We use existing results from on the infinite-depth asymptotics of the neural network Gaussian process (NNGP). It turns out that the order to the sequential limit (taking the width to infinity first, then taking the depth to infinity, or the converse) does not affect the limiting distribution, which is a Gaussian with variance ). Intuitively, by taking the width to infinity first, we make the coordinates independent from each other, and the processes become iid Markov chains. Taking the infinite-depth limit after the infinite-width limit consists of taking the infinite-depth limit of one-dimensional Markov chains. On the other hand, when we take depth to infinity first, the coordinates remain dependent (through the volatility term ), which results in the Quasi-log-normal behaviour of the norm of the post-activations (1). Taking the width to infinity then yields an asymptotic norm of the post-activations equal to (2) which is the same norm in the converse limit (3). It remains to take the width to infinity to decouple the coordinates and obtain the Gaussian distribution (through the Mckean-Vlasov dynamics). Knowing that the variance of the pre-activations is mainly determined by the norm of the post-activations (Eq. 4), we can see why the variance is similar in both sequential limits.
Discussion on the case of multiple inputs
The result of 1 can be easily generalized to the multiple input case, and the resulting dynamics is still an SDE. The generalization to the multiple inputs case is given by 5 in the Appendix.
An important question in the literature on infinite-width neural networks is the behaviour of the correlation of the pre-activations (or the post-activations) for different inputs and , which is given by . This correlation can be as a geometric measure of the information as it propagates through the network. In the infinite-width-then-depth limit, this correlation (generally) converges to a degenerate limit (a constant value) which results in either a constant or a sharp landscape of the network output and causes gradient exploding/vanishing issues . Techniques such block scaling , or kernel shaping solve this problem and ensure that the correlation is well-behaved in the large depth limit. In our case, when the width is finite and the depth is taken to infinity, we can define the correlation for two inputs and time by
Using It ’s lemma, has dynamics of the form
for some non-trivial mapping . Unfortunately, this kind of dynamics (which is not an SDE) is generally intractable, and we are currently investigating these dynamics for future work. However, since we scale the ResNet blocks with the factor (Eq. 1), which is the same scaling that solves the degeneracy issue in the infinite-width-then-depth limit , it should be expected that the correlation kernel does not converge to a degenerate limit.
In Fig. 7, we simulate the correlation path in a ResNet of depth and width . The paths exhibits some level of stochasticity but no degeneracy can be observed. Understanding the correlation dynamics (Eq. 5) in the infinite-depth limit of finite-width networks is an interesting open question. The infinite-width limitThe infinite-width limit of infinite-depth correlations of these dynamics is also an interesting open question. We leave this for future work.
Practical implications
Our theoretical analysis has many interesting implications from a practical standpoint. Here we summarize some key insights form our results.
An important factor pertaining to the trainability of neural networks is the behaviour of the neurons (pre/post-activations). Ensuring that the neurons are well-behaved at initialization is crucial for training since the first step of any gradient-based training algorithm depends on the values of the neurons at initialization. This has led to interesting developments in initialization schemes for MLPs such as the Edge of Chaos which ensures that the variance of the pre-activation does not (exponentially) vanish or explode in the large depth limit. In the case of ResNet, we know from the existing theory on the infinite-width limit of neural networks that scaling the residual blocks with stabilizes the pre/post-activations in the large depth limit . Hence, we do not need a special initialization scheme with this scaling. However, one could argue that this (approximately) ensures stability only when the width is much larger than the depth. What about the other cases when or ? the last case can be studied by fixing the width and taking the depth to infinity. In our paper, we not only show that the neurons remain stable in fixed-width large-depth networks, but we fully characterize their behaviour when the depth is infinite and show that it follows an SDE in this limit. To summarize, we show that initializing ResNet Eq. 1 with standard Gaussian random variables and scaling the blocks with ensures stability inside the network in large-depth (fixed-width) networks (notice that this is actually equivalent to scaling the variance of the initialization weights with , which can be seen as an initialization scheme). Intuitively, by stabilizing the pre-activations, we also stabilize the gradients. To confirm this intuition, we show in Fig. 8 the evolution of gradient norms as they back-propagate through the network. This experiment was conducted by fixing the last layer’s gradient to a constant value and back-propagating the gradient from there. The result shows that the scaling, along with standard Gaussian initialization, ensure well-behaved gradients which is a desirable property for gradient-based training. Another interesting property of the Edge of Chaos initialization scheme for MLPs is that it ensures that correlation kernel (correlation between the pre-activations for different inputs) does not exponentially converge to a degenerate value (constant value)The correlation still converges to 1 with an EOC initialization. The benefit of the EOC lies in the fact that the convergence rate is much slower (polynomial Vs exponential) . We discussed some aspects of the correlation kernel in Section 5 and showed empirically that with the scaling, the correlation is well-behaved and does not converge to degenerate values (Fig. 7).
Another issue that could occur in finite-width networks is that of network collapse, i.e. when the pre-activations in a hidden layer are all negative, which causes the post-activations to be all zero. In ResNet (Eq. 1), this implies that increasing depth beyond some level has no effect on the network output. This is problematic since the weights in those ‘inactive’ layers have zero gradient and thus will not be updated when such event occurs. A simple way to understand network collapse is to see what happens at initialization. When the width is sufficiently large, one can expect that such event is unlikely to occur. What about small-width neural networks? we offer a simple answer to this question: for finite-width neural networks, increasing the depth ensures that such event is unlikely to happen. This is true even for extremely small widths, e.g. , which is counter-intuitive. Empirical results in Fig. 4 support this theoretical prediction.
An interesting application of fixed-depth infinite-width neural network is the so-called Neural Network Gaussian Process (NNGP). This is the Gaussian process limit of neural networks, that can be used to perform posterior inference and obtain uncertainty estimates . The converse case, i.e. fixed-width infinite-depth, has been however poorly understood, and the question of whether the infinite-depth limit of finite-width networks has some universal behaviour has been an open question since. We addressed this question in this work and showed that the limit (in the case of the ResNet architecture Eq. 1) does not admit a universal distribution (e.g. Gaussian process in the infinite-width limit). More precisely, this limit is highly sensitive to the choice of the activation function.
the infinite-depth limit of infinite-width neural networks has been studied in the literature . It is known that in this limit, the network behaves as a Gaussian process with a well-defined kernel. What about the converse limit, i.e. infinite-width limit of infinite-depth networks? this has been so far an open question, and our work addresses one part of it. We show that the marginal distributions are zero-mean Gaussians with the same variance as in the infinite-width-then-depth limit. Characterizing the full covariance kernel is still however an open question (see Section 5 for a discussion on this topic).
Conclusion, discussion, and limitations
Understanding the limiting laws of randomly initialized neural networks is important on many levels. Primarily, understanding these limiting laws allows us to derive new designs that are immune to exploding/vanishing pre-activations/gradients phenomena. Next, they also enable a deeper understanding of overparameterized neural networks, and (often) yield many interesting (and simple) justifications to the apparent advantage of overparameterization. So far, the focus has been mainly on the infinite-width limit (and infinite-width-then-infinite-depth limit) with few developments on the joint limit. Our work adds to this stream of papers by studying the infinite-depth limit of finite-width neural networks. We showed that unlike the infinite-width limit, where we always obtain (under some mild conditions on the activation function) a Gaussian distribution, the infinite-depth limit is highly sensitive to the choice of the activation function; using the It ’s lemma, we showed how we can obtain certain known distributions by carefully tuning the activation function. In the general width limit, we showed an important characteristic of infinite-depth neural networks with general activation functions (including ReLU, conditionally on ): the probability of process collapse is zero, meaning that with probability one, the process does not get stuck at any absorbent point. This is not true for finite-depth ResNets as we can see in Fig. 4, which highlights the fact that as we increase depth, the collapse probability tends to decrease, and eventually converges to zero in the infinite-depth limit, which is in agreement with our results.
This work, although novel in many aspects, is still far from depicting a complete picture of the infinite-depth limit of finite-width networks. There are still numerous interesting open questions in this research direction. Indeed, one of these is the dynamics of the gradient, and more specifically the behaviour of the NTK in the infinite-depth limit of finite-width neural networks. For instance, we already know that in the joint infinite-width-depth limit of MLPs, the NTK is random ; but what happens when the width is fixed and the depth goes to infinity? In the MLP case, a degenerate NTK should be expected. Henceforth, questions remain as to whether a suitable scaling leads to interesting (non-degenerate) infinite-depth limit of the NTK as is the case of the infinite-depth limit of infinite-width NTK .
References
Appendix
The following result gives conditions under which a strong solution of a given SDE exists, and is unique.
Let , and consider the following SDE
Then, for all , there exists a unique strong solution of the SDE above.
A.2 Ito^^𝑜\hat{o} ’s lemma
The following result, known as It ’s lemma, is a classic result in stochastic calculus. We state a version of this result from . Other versions and extensions exist in the literature (e.g. ).
Let be an It diffusion process (Definition 1) of the form
where and refer to the gradient and the Hessian, respectively. This can also be expressed as an SDE
A.3 Convergence of Euler’s scheme to the SDE solution
The following result gives a convergence rate of the Euler discretization scheme to the solution of the SDE.
where denote the coordinates of these vectors for , and . Then, we have that
We can extend the result of 5 to the case of locally Lipschitz drift and volatility functions and . For this purpose, let us first define local convergence.
Let be a sequence of processes and be a stochastic process. For , define the following stopping times
We say that converges locally to if for any , converge to . This definition is general for any type of convergence, we will specify clearly the type of convergence when we use this notion of local convergence.
Consider the same setting of 5 with the following conditions instead
where , and .
We omit the proof here as it consists of the same techniques used in , with the only difference consisting of considering the stopped process . By stopping the process, we force the process to stay in a region where the coefficients are Lipschitz.
A.4 Convergence of Particles to the solution of Mckean-Vlasov process
The next result gives sufficient conditions for the system of particles to converge to its mean-field limit, known as the Mckean-Vlasov process.
Proof This is a direct result of Thm 3 in . The bounded moment condition holds for (dimension of the particles), and the conclusion is straightforward.
A.5 Other results from probability and stochastic calculus
The next trivial lemma has been opportunely used in to derive the limiting distribution of the network output (multi-layer perceptron) in the joint infinite width-depth limit. This simple result will also prove useful in our case of the finite-width-infinite-depth limit.
This concludes the proof as the latter is the characteristic function of a random Gaussian vector with Identity covariance matrix.
The next theorem shows when a stochastic process (ito)
Let and be two stochastic processes given by
Let be the -Algebra generated by . Using It lemma, we have that for ,
Hence, is a martingale (w.r.t to ). We conclude that has the same law as by the uniqueness of the solution of the martingale problem (see 8.3.6 in ).
The next result is a simple corollary of the existence and uniqueness of the strong solution of an SDE under the Lipschitz conditions on the drift and the volatility. It basically shows that a zero-drift process collapses (becomes constant) once the volatility is zero.
If , then almost surely.
Proof This follows for the uniqueness of the strong solution of an SDE(4).
A.6 Proof of 1
We are now ready to prove the following result.
where the constant in does not depend on . Moreover, if the activation function is only locally Lipschitz, then converges locally to . More precisely, for any fixed , we consider the stopping times
then the stopped process converges in distribution to the stopped solution of the above SDE.
Proof The proof is based on 5 in the appendix. It remains to express Eq. 1 in the required form and make sure all the conditions are satisfied for the result to hold. Using Lemma 6, we can write Eq. 1 as
Now let be -Lipschitz for some constant . We have that
where is the Euler scheme as in 5, and where we have used the fact that and have the same distribution.
The result of 1 can be generalized to the case with multiple inputs with minimal changes in the proof. We summarize this result in the next proposition.
where is an -dimensional Brownian motion (Wiener process), independent from , and is the covariance matrix given by
where , with . Moreover, if the activation function is only locally Lipschitz, then converges locally to . More precisely, for any fixed , we consider the stopping times , and then the stopped process converges in distribution to the stopped solution of the above SDE.
Proof The proof is similar to that of 1. The only difference lies the definition of the Gaussian vector . In this case, we have for all
where . Concatenating these identities yield
where is the concatenation of the vector for . It is straightforward that the covariance matrix of the Gaussian vector is given by the matrix above (with replaced by ). We conclude using 5.
B Some technical results for the proofs
In the next lemma, we provide an approximate stochastic process to , that differs from by the volatility term. The upper-bound on the norm of the difference between and will prove useful in the proofs of other results. The proof of this lemma requires the use of Gronwall’s lemma, a tool that is often used in stochastic calculus.
where where is the Sigmoid function given by , is the ReLU activation function, and is an -dimensional Brownian motion. We have the following
Using It isometry and the fact that , we obtain
where we have used Lemma 9 and the fact that ReLU is -Lipschitz. We concldue using Gronwall’s lemma.
B.2 Approximation of ϕitalic-ϕ\phi
The next lemma provides a simple upper-bound on the distance between the ReLU activation and an approximate function that converges to in the limit of large .
For the case where , the proof is the same. We have that
B.3 Other lemmas
The next lemma shows that the logarithmic growth factor converges to when goes to infinity, where the convergence holds in . The key ingredient in the use of uniform integrability coupled with convergence in probability, which is sufficient to conclude on the convergence. This result will help us conclude in the proof of 1.
where where is the Sigmoid function given by , is the ReLU activation function, and is an -dimensional Brownian motion. Then, conditionally on the fact that , we have that
Let . From Lemma 8, we know that converges in to . Using Lemma 9 and the fact that ReLU is -Lipschitz, we obtain
which implies that converges in to . In particular, the convergence holds in probability. Using this fact with the Continuous mapping theorem, we obtain that
where the first term converges to zero by Eq. 8, and the second term converges to zero by Lemma 9. Hence, the convergence in probability holds.
To conclude, it suffices to show that the sequence of random variables is uniformly integrable. Let . From the proof of Lemma 2, with , we have that
where , and . Therefore,
where is the ReLU activation function, and is an -dimensional Brownian motion independent from . Then, conditionally on the fact that , we have that for all
where the bound holds uniformly over .
C The Ornstein-Uhlenbeck (OU) process
The OU process is the (unique) strong solution to the following diffusion
.
Proof Consider the process , using It lemma, we have that
We conclude by multiplying both sides with .
We would like to find sufficient conditions on the activation function and a function such that the process (Eq. 7) follows an the OU dynamics. For this purpose, we proceed by reverse-engineering the problem; Using It ’s lemma (Eq. 3), this is satisfied when there exist constants such that
This implies that . Letting be the primitive function of , we obtain that satisfies a differential equation of the form
for some constants . Integrating the left-hand side yields
where Erfi is the imaginary error functionAlthough the name might be misleading, the imaginary error function is real when the input is real. given by
To alleviate the notation, we denote in the rest of this section. From the above, should have the form
where are all constants, and is the inverse function of the imaginary error function. We conclude that the activation function should have the form
In this case, the coefficients and are given by
Letting , the process has the following dynamics
Hence is an OU process, and we can conclude that the network output in the infinite-depth limit satisfies
We can then infer the distribution of by a simple change of variable. Note that this distribution is non-trivial, and unlike the infinite-width limit of the same ResNet () where the distribution is Gaussian, here the distribution of the pre-activations is directly impacted by the choice of the activation function .
However, with this particular choice of the activation function , the existence of the process can only be proven in the local sense, because is only locally Lipschitz. Let us first show this in the next lemma. We will see how we can mitigate this issue later.
Proof It suffices to show that the derivative of is locally bounded to conclude. We have that
Now we can rigorously prove the following result.
Consider the stochastic process defined by
where .
Proof For , consider the stopping time defined by
Using the continuity of paths of , it is straightforward that almost surely. Let be large enough. The SDE satisfied by the process has a unique strong solution for since the activation function is Lipschitz on the interval . By applying It lemma for , we have that
(from previous results). Using the fact that almost surely, and taking large enough, we obtain that for all , we have that
D The Geometric Brownian Motion (GBM)
The GBM dynamics refers to stochastic differential equations of the form
where are constants and is a one dimensional Brownian motion. This SDE played a crucial role in financial mathematics and is often used as a model of stock prices. It admits a closed-form solution given in the next lemma.
The distribution of is known as a log-Gaussian distribution. Moreover, the solution is unique.
Proof The existence and uniqueness of the solution follows from 4. Indeed, it suffices to have the drift and the volatility both Lipschitz to obtain the result. This is satisfied in the case of GBM. Now consider the process . Using It lemmaNotice that here, should be positive in order to consider . This is easy to show and the proof is similar to that of Lemma 15., it is easy to verify that
Now let us find sufficient conditions under which the infinite-depth network represented by the process has a GBM behaviour. In order for this to hold, it suffices to have
This implies , or equivalently , which in turn yields , and therefore . Assuming that , we can easily verify that functions of the form where satisfy the requirements. Hence, the activation function should satisfy , i.e. the activation should be linear. In this case, we have and the process has the following GBM dynamics
Observe that in the special case of , we have and . In this case, we obtain .
We summarize the previous results in following proposition.
where . Consider the stochastic process defined by
Then, the process satisfies the following GBM dynamics
where . As a result, we have that for all ,
E ReLU in the case n=d=1𝑛𝑑1n=d=1
Consider the process given by the SDE
It is straightforward that if for some , then for all . This is because whenever . A rigorous justification is provided in Lemma 1. Hence, the event constitutes a stopping event where the process becomes constant. We also say that is an absorbent point of the process . A classic tool in stochastic calculus to deal with such situations is the notion of stopping time which is a random variable that depend on the trajectory of (or equivalently on the natural filtration associated with the Brownian motion ). Consider the following stopping time
Observe that we have for all
which implies that is a Geometric Brownian motion in the interval . Hence, if (a.s.), the network output has also a log-normal distribution in the infinite-depth limit. In the next lemma, we show that with probability which confirms the above.
Let be the stopping time defined by Eq. 12. We have that
Proof By continuity of the Brownian path and the ReLU function , the paths of the process are also continuousThis is a classic result in stochastic calculus. More rigorously, can be chosen to have continuous paths with probability 1.. we have that almost surely. From the observation above, taking the limit and using the continuity, we obtain
For some , we have that (by continuity). Hence . This happens with probability zero, which means that the event has probability zero. This concludes the proof.
Hence, with the ReLU activation function, given , the network output is distributed as
Now let us go back to the original setup for . Recall that for some and . By conditioning on and observing that is an absorbent point of the process , we obtain that
We summarize these results in the next proposition.
Then, the process is a mixture of a Geometric Brownian motion and a constant process. More precisely, we have for all
Hence, conditionally on , the process is a Geometric Bronwian motion.
F Proof of Lemma 2 and Lemma 3
Proof It is straightforward that with probability we have , which implies that with probability , . Let . Using It ’s lemma with the function , we obtain
where , and , and refers to the Hadamard product of vectors, i.e. coordinate-wise product.
For some , using the path continuity of the process and the continuity of , we have that . Therefore, we should also have
Hence, the random variable is finite with probability . We conclude that
Lemma 3. Consider the stochastic process (7) given by the SDE
where is the ReLU activation function, and is an -dimensional Brownian motion. Let be the stopping time given by
Proof Let . Using Lemma 7, we know that if for some , , then for all , we have that and . Hence, we have that
Let and consider the function and is the Sigmoid functionNote that has a closed-form formula given by , which can be seen as a shifted and scaled version of the Softplus function. However, we do not need the closed-form formula in our analysis.. It is straightforward that satisfies the conditions of Lemma 2. Let be the solution of the following SDE (the solution exists and is unique since is trivially Lipschitz)
We know from Lemma 8 that converges in to (uniformly over for any ). In particular, this implies convergence in distribution. Moreover, observe that for all
where we used triangular inequality and the upperbound from Lemma 9. Thus, we have that converges in (and in distribution) to .
Let for , and define . For , using Lemma 9, we have that
Given , we have that for ,
Let us deal with the first term. Using Lemma 9, we have that
From Lemma 10, we know that the random variable converges in and thus it is bounded in norm (over ). Therefore, a simple application of Markov’s inequality yields that the probability above goes to when goes to .
G Proof of 1
where , and is a one-dimensional Brownian motion. As a result, we have that for all
Proof Let . Let us firs consider the case where . For all we have and the result is trivial.
Ideally, we would like to use It ’s lemma and Lemma 3, which ensures that remains positive on $t\in$
where and is the Sigmoid function. We have that
Let be the solution of the following SDE
and . By Lemma 10, we know that converges in to . Let us now compute the limit of from the equation above to conclude. More precisely, let us show that for all
where , and .
Let . Using triangular inequality, It isometry, and Cauchy-Schwartz inequality, we have that
where , and . Let us start with the term . Observe that almost surely. We have that
When , we have that for all
For the remaining term, using the fact that , we have that
for some constant that depends on . Therefore, for all , we have
Recall that . By Lemma 11, we know that
Let us now deal with the last term in . We have that
Notice that almost surely since .
The second term can be upperbounded in the following fashion
Moreover, letting be the complementary event of , we have
Using the fact that (by Lemma 9 and the fact that ReLU is Lipschitz), we obtain
Recall that this holds for any small enough. Observe that is almost surely non-decreasing as we decrease . Hence has a limit almost surely. Using Lemma 3 and the continuity of the paths of we have that . Taking the limit , we conclude that almost surely we have
which yields the desired result for the conditional mean by substraction.
Now let us deal with the variance. To alleviate the notation, we omit the conditioning on the event . All the expectations below are taken conditionally on this event. Let . Let . We have that
where we have used the exchangeability property of the family . Thus, for the variance , we obtain
where . Optimizing over yields
The term can be shown to have asymptotic behaviour using tools from Mckean-Vlasov theory. Thus, the variance term has (atmost) behaviour.
H Proof of 2
In this section, we provide the proof of 2. We use the following Law of Large numbers that does not require independence.
Let be a triangular array of random variables. Assume that the following holds
Theorem 2. For , we have
where the convergence holds in . Moreover, we have that
where is the solution of the following (Mackean-Vlasov) SDE
As a result, the pre-activations (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-depth-then-infinite-width
Proof Let . From 1, we have that almost surely
We know that converges to zero almost surely (by continuity of Brownian paths) and in . Let us now deal with the second term . We have that Fix and let (recall that has an implicit dependence on ). Since is uniformly bounded across and , it is straightforward that the conditions of 8 are satisfied. Therefore, we have the following convergence in
Let us now deal with the second result on the absolute growth factor. Let and define the event
and let be its complementary event. For large enough, we have that
The convergence to Mckean-Vlasov dynamics is straightforward from 6, and the Gaussian distribution is given by Lemma 16.
Taking the expectationThis should be understood as integrating the SDE, then taking the expectation, then differentiating once again. yields the following ordinary differential equation
which has a closed-form solution given by
Proof The proof of Lemma 17 is similar to that of Lemma 8 with the only difference of replacing the euclidean norm with the norm in probability space. Let , we have that
where we have used the triangular inequality and Lemma 9. We conclude using Gronwall’s lemma.
I Proof of 3
Theorem 3. Let . Then, in the limit (infinite width, then infinite depth), we have that
where the convergence holds in probability.
Moreover, the pre-activations (Eq. 1) converge in distribution to a Gaussian distribution in the limit infinite-width-then-infinite-depth
Let us show that the first term in the right-hand side converges to in the sequential limit ‘infinite width then infinite depth’. The proof is similar for the second term. We have that
Using Lemma 5 in , and the homegenous property of ReLU, we have that
J Piece-wise linear activation functions
We have seen in Section 4 that the distribution of is generally intractable for . This is purely due to finite width and not to the non-linearity of the activation function. To understand this, let us see what happens when the activation function is the identity function. In this case the process is solution of the following SDE
When , the SDE Eq. 15 has a closed-form solution given by the (conditional) GBM distribution (2). For general , the entries of are dependent and the resulting dynamics (generally) do not admit closed-form solutions. However, we can obtain closed-form solutions for the norm . Indeed, a simple application of It ’s lemma yields the following results.
With the linear activation, we have that for all ,
where is a one-dimensional Brownian motion. As a result, we have that for all
The proof of 9 is straightforward using It ’s lemma. We omit the proof here.
By comparing the result of 1 and 9, we observe some differences between the case of ReLU and that of the identity activation function. With ReLU, the drift term in is given by which is a stochastic term with mean given by . With the identity activation, this drift term is deterministic and is equal to . This allows to conclude the following:
Non-linearity induces stochastic drift: the non-linearity of ReLU induces stochasticity in the drift term of , which results in the Quasi-GBM dynamics given by 1.
Non-linearity induces change of regime: with ReLU, the mean drift of is given by which is negative for . This induces the change of regime we discussed after 1 (having a negative mean drift implies that there is a significant mass of the distribution of in the regime ). With the identity activation function, the drift term is always non-negative for , and negative for . Thus, the change of regime cover some values only when there is a non-linearity. We give more details about this observation in the next result.
To capture the effect of non-linearity in the regime change phenomenon discussed above, we study the dynamics of the post-norm activation for a special class of piece-wise linear activations that include both ReLU and the identity function. The result of 1 can be easily extended to the case of general piece-wise linear activation functions using the same proof techniques. We obtain the following result which generalizes that of 1 and 9.
10 generalizes that of ReLU (1, ) and the identity activation (9, ). The discontinuity of the mean of at the poles (and ) and (and ) is due to the fact that the event has non-zero probability in these cases and zero probability when and .
Consider the case when and for some . The mean logarithmic growth factor is given by
Observe that for , we recover the result of 9 (identity activation). Hence, a small perturbation of the identity function has the effect of decreasing the factor which results in having negative values for for certain values of . Indeed, by fixing , notice that the minimum values of is obtained when , for which ReLU. Notice that we can also control the change of regime by tuning the parameter . This allows us to control the sign of for any by tuning the parameter . We leave the analysis of the practical implications of tuning for future work.
K Additional Experiments
Additional histograms of and (2) are shown in Fig. 9 and Fig. 10.
K.2 Ornstein-Uhlenbeck process
Additional histograms of and (4) are shown in Fig. 9 and Fig. 10.
K.3 Histograms of non-scaled log-norm of post-activations
In Fig. 13, we show the histogram of based on simulations. We observe that as the width increases, the Gaussian approximate is no longer accurate, which is due to the fact that converges to a deterministic value (2).
In Fig. 14, Fig. 15, Fig. 16, and Fig. 17, we show the histograms of for depth , hidden layers , and widths . We observe that Gaussian distribution fits better the last layers. This was expected since the limiting distribution (Quasi-GBM) given in 1 is only valid for layer indices when goes to infinity. Thus, for small , it should be expected that the Gaussian distribution would not be a good approximation.
In Fig. 18, Fig. 19, Fig. 20, and Fig. 21, we show the non-scaled versions of the histograms from the previous section. We observe that the histogram concentrates around a single value (the distribution converges to a Dirac mass) as increases. This is a result of the asymptotic behaviour of the ResNet in the infinite-depth-then-infinite-width limit as shown in 2.