Stable ResNet
Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, Judith Rousseau
INTRODUCTION
The limit of infinite width has been the focus of many theoretical studies on Neural Networks (NNs) [Neal, 1995, Poole et al., 2016, Schoenholz et al., 2017, Yang and Schoenholz, 2017, Hayou et al., 2019a, Lee et al., 2019]. Although unachievable in practice, it features many interesting properties which can help grasp the complex behaviour of large networks. Infinitely wide 1-layer random NNs behave like Gaussian Processes (GPs) at initialization [Neal, 1995]. This was recently extended to multilayer NNs, where each layer can be associated to its own GP [Matthews et al., 2018, Lee et al., 2018, Yang, 2019a]. From a theoretical point of view, GPs have the advantage that their behaviour is fully captured by the mean function and the covariance kernel. Moreover, when dealing with GPs that are equivalent to infinite width NNs, these processes are usually centered, and hence fully determined by their covariance kernel. For multilayer networks, these kernels can be computed recursively, layer by layer [Lee et al., 2018]. Interestingly, in apparent contradiction with the naive idea “the deeper, the more expressive”, it was shown in [Schoenholz et al., 2017] that the GP becomes trivial as the number of layers goes to infinity, that is the output completely forgets about the input and hence lacks expressive power. This loss of input information during the forward propagation through the network might be exponential in depth and could lead to trainability issues for extremely deep nets [Schoenholz et al., 2017, Hayou et al., 2019a]. One natural way to prevent this last issue is the introduction of skip connections, commonly known as the ResNet architecture. However, in the regime of large width and depth, the output of standard ResNets becomes inexpressive and the network may suffer from gradient exploding [Yang and Schoenholz, 2017]. In the present work, we propose a new class of residual neural networks, the Stable ResNet, which, in the limit of infinite width and depth, is shown to stabilize the gradient (no gradient vanishing or exploding) and to preserve expressivity in the limit of large depth. The main idea is the introduction of layer/depth dependent scaling factors to the ResNet blocks. For ReLU networks, we provide a comprehensive analysis of two different scalings: a uniform one, where the scaling factor is the same for all the layers, and a decreasing one, where the scaling factor decreases as we go deeper inside the network. We also show that Stable ResNet solve the problem of Neural Tangent kernel (NTK) degeneracy in the limit of large depth [Hayou et al., 2019b]; indeed, with our scalings, the NTK is universal in the limit of infinite depth, which ensures that any continuous function can be approximated to an arbitrary precision by the features of the infinite depth NTK on a compact set.
All theoretical results are substantiated with numerical experiments in Section 7, where we demonstrate the benefits of Stable ResNet scalings both for the corresponding infinite width GP kernels as well as trained ResNets, over a range of moderate and large-scale image classification tasks: MNIST, CIFAR-10, CIFAR-100 and TinyImageNet.
RESNET
Consider a standard ResNet architecture with layers, labelled with Notation: for integers ., of dimensions .
where is the activation function. The weights and bias are initialized with , and , where , , , and is the normal law of mean and variance .
Recent results by [Hayou et al., 2021] suggest that scaling the residual blocks with might have some beneficial properties on model pruning at initialization. This results from the stabilization effect on the gradient due to the scaling.
More generally, we introduce the residual architecture:
where is a sequence of scaling factors. We assume hereafter that there exists such that for all and . In the next proposition, we give a necessary and sufficient condition for the gradient to remain bounded as the depth goes to infinity.
Proposition 1 shows that in order to stabilize the gradient, we have to scale the blocks of the ResNet with scalars such that remains bounded as the depth goes to infinity. Taking , Proposition 1 shows that the standard ResNet architecture (1) suffers from gradient exploding at initialization,In [Yang and Schoenholz, 2017], authors show a similar result with a slightly different ResNet architecture. which may cause instability during the first step of gradient based optimization algorithms such as Stochastic Gradient Descent (SGD). This motivates the following definition of Stable ResNet.
A ResNet of type (2) is called a Stable ResNet if and only if .
The condition on the scaling factors is satisfied by a wide range of sequences . However, it is natural to consider the two categories: Uniform scaling. The scaling factors have similar magnitude and tend to zero at the same time. A simple example is the uniform scaling . Decreasing scaling. The sequence is decreasing and tends to zero. To be clearer, we consider a general sequence such that , and let for all , all .
Note that our theoretical analyses will hold for any decreasing scaling that is square summable, but for simplicity in all empirical results we consider the decreasing scaling:
We study theoretical properties of both ResNets with uniform and decreasing scaling. We show that, in addition to stabilizing the gradient, both scalings ensure that the ResNet is expressive in the infinite depth limit. For this purpose, we use a tool known as Neural Network Gaussian Process (NNGP) [Lee et al., 2018] which is the equivalent Gaussian Process of a Neural Network in limit of infinite width.
2 On Gaussian Process approximation of Neural Networks
Consider a ResNet of type (2). Neurons are iid since the weights with which they are connected to the inputs are iid. Using the Central Limit Theorem, as , is a Gaussian variable for any input and index . Moreover, the variables are iid. Therefore, the processes can be seen as independent (across ) centred Gaussian processes with covariance kernel . This is an idealized version of the true process corresponding to letting width . Doing this recursively over leads to similar approximations for where , and we write accordingly . The approximation of by a Gaussian process was first proposed by [Neal, 1995] in the single layer case and was extended to multiple feedforward layers by [Lee et al., 2019] and [Matthews et al., 2018]. More recently, a powerful framework, known as Tensor Programs, was proposed by [Yang, 2019b], confirming the large-width NNGP association for nearly all NN architectures.
For the ReLU activation function , the recurrence relation can be written more explicitly as in [Daniely et al., 2016]. Let be the correlation kernel, defined as
The recurrence relation reads (see Appendix A1)
This recursion leads to divergent diagonal terms . This was proven in [Yang and Schoenholz, 2017] for a slightly different ResNet architecture. In the next Lemma, we extend this result to the ResNet defined by (1).
Figure 1 plots the diagonal NNGP and NTK (introduced in Section 5) values for a point on the sphere, highlighting the exploding kernel problem for standard ResNets. Stable ResNets do not suffer from this problem.
The symmetry in the above definition has to be understood as for all .
Kernels induce non-negative integral operators [Paulsen and Raghupathi, 2016].
Given a kernel on , the Gaussian Process induced by is a centred GP on whose covariance function is .
We will sometimes use the notation for the law of the GP induced by a kernel . With our definition of a kernel, the samples from the induced GP lies in with probability [Steinwart, 2019].
From now on we will assume that if .We exclude since for is discontinuous in and can’t be a kernel on as in Definition 2, if . For all ResNets, it is straightforward to check that is a kernel, in the sense of Definition 2 (see Appendix A1 or [Daniely et al., 2016]). The induced Gaussian Process is what we refer to as NNGP.
We denote by the Reproducing Kernel Hilbert Space (RKHS)See Appendix A0 for a definition. induced by the kernel on the set . The following hierarchical result holds.
For all , , .
Proposition 2 shows that, as we go deeper, the RKHS cannot become poorer. However, increasing might introduce stability issues as illustrated in Proposition 1. We show in Sections 3 and 4 that Stable ResNets resolve this problem.
By Lemma 2, is a bounded, compact, self-adjoint operator and hence can be written as the sum of the projections on its eigenspaces [Lang, 2012]. By Mercer’s Theorem [Paulsen and Raghupathi, 2016], all the eigenfunctions of are continuous. Finally, it is possible to link the eigen-decomposition of with the distribution of the GP induced by . Denoting respectively by and the eigenvalues and eigenfunctions of the operator , we have the equivalence in law:
where are i.i.d. standard Gaussian random variables [Grenander, 1950]. The expressivity, that is the capacity to approximate a large class of function, of the network at initialization is then closely linked to the eigendecomposition of [Yang and Salman, 2019].
3 Universal kernels and expressive GPs
Let be a kernel on , and its RKHS See Appendix A0.. We say that is universal on if for any and any continuous function on , there exists such that .
The universality of a kernel on a compact set implies that the kernel is strictly positive definite, i.e. for all non-zero [Sriperumbudur et al., 2011]. Moreover, universality also implies the full expressivity of the induced GP, as expressed in the following.
A Gaussian Process on is said to be expressive on if, denoting by a random realisation of the process, for all , for all ,
A universal kernel on induces an expressive GP on .
By definition, universal kernels are characterized by the property that their associated RKHS is dense (w.r.t the uniform norm ) in the space of continuous functions on . This is crucial for Kernel regression and Gaussian Process inference [Kanagawa et al., 2018].The closure of the set of functions described by the mean function of the posterior of a GP regression is exactly the RKHS of the kernel of the GP prior. By Proposition 2, it suffices to prove that is universal for some in order to conclude for all . It turns out this is true for .
If , then is universal on . From Proposition 2, is universal for all .
Although the kernel is universal for fixed depth , it is not guaranteed that as , remains universal. Indeed, for the standard ResNet architecture, the variance grows exponentially with [Yang and Schoenholz, 2017], and therefore, the kernel diverges. In order to analyse the expressivity of the kernel of a standard ResNet in the limit of large depth, we can study the correlation kernel , defined in (3), instead. We show in the following Lemma that, as goes to infinity, the kernel converges to a constant (which has a 1D RKHS).
Therefore, is the space of constant functions.
Lemma 4 shows that in the limit of infinite depth , the RKHS of the correlation kernel is trivial, meaning that the NNGP cannot be expressive. On the contrary, we will show in the next sections that Stable ResNets achieve a universal kernel for infinite depth .
UNIFORM SCALING
Consider a Stable ResNet with layers . Under uniform scaling, the recurrence relation in (5) reads:
In the limit as , (7) converges uniformly to a continuous ODE. Studying the solution of the corresponding Cauchy problem, we show that the covariance kernel remains universal in the limit of infinite depth.
As discussed in Section A2 of the Appendix, for any , the solution of the above Cauchy problem exists and is unique. Moreover, the solutions and are kernels on , in the sense of Definition 2.
Clearly, for finite , the continuous ODE (8) is an approximation. However, the following result holds.
Let be the covariance kernel of the layer in a net of layers , and be the solution of (8), then
2 Universality of the covariance kernel
When , the kernel is universal for .
The proof of the above statement is detailed in Appendix A2. The main idea is to show that the integral operator is strictly positive definite and then use a characterization of universal kernels, due to [Sriperumbudur et al., 2011], which connects the universality of Definition 4 with the strict positivity of the induced integral operator.The details are more involved as we need to show that the kernel induces a strictly positive definite operator on for any finite Borel measure on .
DECREASING SCALING
The convergence of the kernel to the limiting kernel is governed by the convergence rate of the series of scaling factors. Moreover, leveraging the RKHS hierarchy from Proposition 2, we find that is universal.
As in the uniform scaling case, the limiting kernel exists and is universal unlike the standard ResNet architecture that yields a divergent kernel as .
To validate our universality and expressivity results, Figure 2 plots the leading eigenvalues of the NNGP (& NTK, introduced in Section 5) kernels on a set of 1000 points sampled uniformly at random from the circle, normalized so that the largest eigenvalue is 1. We use the recursion formulas for NNGP correlation (Lemma A4) and normalized NTK (Lemma A19) to avoid the exploding variance/gradient problem. We see that the unscaled ResNet NNGP becomes inexpressive with depth because all non-leading eigenvalues converge to 0, whereas our Stable ResNets (decreasing and uniform scaling) are expressive even in the large depth limit.
NEURAL TANGENT KERNEL
In the so-called lazy training regime [Chizat and Bach, 2019], the training dynamics of an infinitely wide network can be described via the Neural Tangent Kernel (NTK) [Lee et al., 2019], introduced in [Jacot et al., 2018] and defined as
where and are respectively the input and output datasets, and is the matrix . The universality of the NTK is crucial for the ResNet to learn beyond initialization, since the residual lies in the RKHS generated by . For unscaled ResNet, [Hayou et al., 2019b] showed that the limiting NTK is trivial in the sense of Lemma 4. However, this is not the case for Stable ResNet.
Consider a ResNet of type (2). We have This is true under the technical assumption that the parameters appearing in the back-propagation can be considered independent from the ones of the forward pass (Gradient Independent Assumption) [Yang, 2019a]
An analogous result can be stated for the uniform scaling, after noticing that a continuous formulation () can be obtained in analogy with what has been done for the covariance kernel (cf Appendix A4).
Figure 2 shows that the non-leading NTK eigenvalues do not decay to 0 with depth for Stable ResNets, unlike for unscaled ResNets. This is in line with findings of Propositions 8 and 9.
A PAC-BAYES RESULT
where is a probability distribution on . For some randomized learning algorithm , the empirical and generalization loss are given by:
The PAC-Bayes theorem gives a probabilistic upper bound on the generalization loss of a randomized learning algorithm in terms of the empirical loss . Fix a prior distribution on the hypothesis set . The Kullback-Leibler divergence between and is defined as . The Bernoulli KL-divergence is given by for . We define the inverse Bernoulli KL-divergence by
The KL-divergence term plays a major role as it controls the generalization gap, i.e. the difference (in terms of Bernoulli KL-divergence) between the empirical loss and the generalization loss. In our setting, we consider an ordinary GP regression with prior . Under the standard assumption that the outputs are noisy versions of with , the Bayesian posterior is also a GP and is given by
, . In this setting, we have the following result
Let be the kernel of a ResNet. Let be a GP with kernel and be the corresponding Bayesian posterior for some fixed noise level . Then, in a fixed setting (fixed sample size N), the following results hold: With a standard ResNet, . With a Stable ResNet, .
The KL-divergence bound diverges for a standard ResNet while it remains bounded for Stable ResNet. Although PAC-Bayes bounds only give an upper bound on the generalization error, Proposition 10 shows that Stable ResNet does not suffer from the “curse of depth”, i.e. the KL-divergence does not explode as the depth becomes large.
EXPERIMENTS
In line with our theory, we now present results demonstrating empirical advantages of Stable ResNets (both uniform and decreasing scaling) compared to their unscaled counterparts on a toy regression task and standard image classification tasks, both for infinite-width NNGP kernels as well as trained finite-width NNs in the latter case. In the interests of space, all experimental details not described in this section can be found in Appendix A7. All error bars in this section correspond to 3 independent runs.
We first present a toy regression posterior regression experiment with NNGP kernel. We compare across different depths and scalings, with target test function and a small amount of observation noise ( as defined in Eq. 10).
We use 5 training points (dark green dots).
We map our 1D inputs onto the circle before performing GP regression. This is so that all inputs have unit norm and we can use the NNGP correlation kernel (Eq. 3) for the vanilla ResNet (ResNet with fully connected blocks), in order to avoid the exploding variance problem. As expected from our theory, in Figure 3, for depth 1000 the NNGP correlation kernel without stable scaling (top row, red) is unable to learn anything beyond a constant function due to inexpressivity, whereas our Stable ResNets (bottom two rows, blue) are still expressive in the large depth limit. We plot mean and 95% posterior predictive credible interval for NNGP posteriors.
Stable NNGP classification results
We first compare the performance of Stable and standard ResNets of varying depths through their infinite-width NNGP kernels, on MNIST & CIFAR-10. For each considered NNGP kernel and training set , we report test accuracy using the mean of the posterior predictive (Eq. 10): , which is also the kernel ridge regression predictor [Kanagawa et al., 2018]. We treat classification labels as one-hot regression targets, similar to recent works [Arora et al., 2019, Lee et al., 2019, Shankar et al., 2020], and tune the noise using prediction accuracy on a held-out validation set.
First, in Table 1, we demonstrate the exploding NNGP variance problem for unscaled Wide-ResNets (WRN) [Zagoruyko and Komodakis, 2016]. For an unscaled WRN of depth 202, the NNGP kernel values explode resulting in numerical errors, whereas Stable ResNets achieve 54% test accuracy with 10K training points (out of full size 50K). Note that any numerical errors from exploding NNGP also afflict the NTK, as the difference between the NTK and NNGP is positive semi-definite [Lee et al., 2019, He et al., 2020] (which is why the NTK lines always lie above their corresponding NNGP in Figure 1).
To isolate the disadvantages of inexpressivity in unscaled Resnets NNGPs compared to our Stable ResNets, we need to avoid the exploding variance problem and ensuing numerical errors. In order to do so, we use the NNGP correlation kernel instead of the NNGP covariance kernel , noting that these two kernels are equal up to multiplicative constant on the sphere, and that the posterior predictive mean is invariant to the scale of (with also tuned relative to the scale of ). Moreover, the formula in Lemma A4 for NNGP correlation recursion for vanilla ResNets without bias can be recast as a ResNet with a modified scaling (see Appendix A6), allowing us to use existing optimised libraries [Novak et al., 2020]. In order to use the vanilla ResNet correlation recursion, we standardise all MNIST & CIFAR-10 images to lie on the 784 & 3072-dimension sphere respectively.
Our expressivity results, as well as Proposition 10, suggest that we expect Stable ResNets to outperform standard ResNets for large depths even when exploding variance numerical errors are alleviated for standard ResNets. In Table 2, we see that unscaled ResNets suffer from a degradation in test accuracy with depth, due to inexpressivity, whereas our Stable ResNets (both decreasing and uniform) do not suffer from a drop in performance. For example, the posterior predictive mean using the NNGP of an unscaled vanilla ResNet with depth 1000 attains only 17.86% accuracy on CIFAR-10 with 10K training points, compared to 48.76% for Stable ResNet (decreasing scale).
We focus on the NNGP rather than the NTK as recent works [Lee et al., 2020, Shankar et al., 2020] have demonstrated that there is no advantage to the state-of-the-art NTK over the NNGP as infinite-width kernel predictors. Moreover, we do not aim for near state-of-the-art kernel results due to computational resources, and instead aim to empirically validate the theoretical advantages of Stable ResNets.
Trained Stable ResNet results
Finally, we consider the benefits of trained Stable ResNets on the large-scale CIFAR-10, CIFAR-100 and TinyImageNetAvailable at http://cs231n.stanford.edu/tiny-imagenet-200.zip datasets. We compare trained convolutional ResNets [He et al., 2016] of depths 32, 50 & 104 in terms of test accuracy. In the main text we present results for ResNets trained with Batch Normalization [Ioffe and Szegedy, 2015] (BatchNorm), while results for trained ResNets without BatchNorm can be found in Appendix A7. Stable ResNet scalings are applied to the residual connection after all convolution, ReLU and BatchNorm layers.
We use initial learning rate which is decayed by at and of the way through training. This learning rate schedule has been used previously [He et al., 2016] for unscaled ResNets and we found it to work well for all ResNets trained with BatchNorm. We train for 160 epochs on CIFAR-10/100 and 250 epochs on TinyImageNet. Test accuracy results are displayed in Table 3. As we can see, Stable ResNets consistently outperform standard ResNets across datasets and depths. Moreover, the performance gap is larger for larger depths: for example on CIFAR-100 our Stable ResNet (decreasing) outperforms its standard counterpart by 1.05% (75.06 vs 74.01) on average for depth 32 whereas for depth 104 the test accuracy gap is 2.36% (77.44 vs 75.08) on average. A similar trend can also be observed for the more challenging TinyImageNet dataset. Interestingly, we see that among the Stable ResNets, decreasing scaling also consistently outperforms uniform scaling.
CONCLUSION
Stable ResNets have the benefit of stabilizing the gradient and ensuring expressivity in the limit of infinite depth. We have demonstrated theoretically and empirically that this type of scaling makes NNGP inference robust and improves test accuracy with SGD on modern ResNet architectures. However, while Stable ResNets with both uniform and decreasing scalings outperform standard ResNet, the selection of an optimal scaling remains an open question; we leave this topic for future work.
ACKNOWLEDGMENTS
This material is based upon work supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence (MoD) and the U.K. Engineering and Physical Research Council (EPSRC) under grant number EP/R013616/1. AD is also partially supported by EPSRC EP/R034710/1. BH is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1). The project leading to this work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 834175).
References
Appendix
A0 Mathematical preliminaries
We will make use of functional analysis results on the theory of Hilbert space. We refer to for a comprehensive introduction to the topic. We precise here that, even when not explicitly stated, all Hilbert spaces considered in the present work are real, and all linear operator are bounded. We will make use of the spectral theory for compact self-adjoint operators. We refer again to for a detailed discussion.
We state here a characterisation of kernels, which is an extension of Lemma 2. Despite being a classical result (see the discussion about Mercer kernels in ), we will give a proof, for the sake of completeness.
for any . The operator is a bounded compact self-adjoint definite operator. Moreover, is a kernel if and only if is non-negative definite for all finite Borel measures on .
and the convergence is uniform on . The continuity of the ’s implies that they can be seen as elements of . Moreover, the uniform convergence, along with the fact that , implies the convergence of the sum wrt the operator norm. In particular is a limit of non-negative definite operators and hence non-negative definite. Now, assume that, for all finite Borel , is non-negative definite. Chosen a finite set , in particular we have that is a finite Borel measure (where is the Dirac measure on ). Hence is the matrix . We conclude that is a kernel. ∎
We will now give a definition of the Reproducing Kernel Hilbert Space associated to a kernel. We refer to for a general and comprehensive introduction to the topic.
Given a kernel on , we can associate to it a real Hilbert space , with the following properties:
Denoting as the inner product of , for each , there exists a element such that , for all .
For all , .
Such a Hilbert space exists for each kernel and it is unique up to isomorphism, . is called the Reproducing Kernel Hilbert Space (RKHS) of .
In general, it is not easy to give an explicit form for the RKHS associated to a kernel . However, we can say that it contains the linear span of . Actually, this linear span is a dense subset of , wrt the norm of .
A kernel on is said to be universal if its RKHS is dense in the space of continuous functions , wrt the uniform norm.
Let be a kernel on , and its RKHS. We say that is universal on if for any and any continuous function on , there exists such that .
We can now state a characterization of universal kernels, from .
As a final note, hereafter we often omit the explicit reference to the measure , that is we will speak of the operator on . Unless otherwise stated, this notation implies the choice of an arbitrary finite Borel measure on the compact .
A1 Residual Neural Networks and Gaussian processes
Consider a standard ResNet architecture with layers, labelled with , of dimensions .
Hereafter, denotes the number of neurons in the layer, the activation function and for . The components of weights and bias are respectively initialized with , and where denotes the normal distribution of mean and variance .
In , authors showed that wide deep ResNets might suffer from gradient exploding during backpropagation.
Recent results by suggest that scaling the residual blocks with might have some beneficial properties on model pruning at initialization. This is a result of the stabilization effect of scaling on the gradient.
More generally, we introduce the residual architecture:
where is a sequence of scaling factors. We assume hereafter that there exists such that for all and , we have that .
where we have used the Central Limit Theorem. Therefore, we have
For the remainder of this appendix, we define the function
For all , the diagonal terms of have closed-form expressions. We show this in the next lemma.
where is given by (A2). It is straightforward that . This yields
As a corollary of the previous result, it is easy to show that for a Standard ResNet the diagonal terms explode with depth, which is Lemma 1 in the main paper.
The statement trivially follows from Lemma A3, using that and the fact that for a Standard ResNet (1), all the coefficients ’s are equal to . ∎
In the case of a ResNet with no bias, the correlation kernel follows a simple recursive formula described in the next lemma.
where .
This is direct result of the covariance recursion formula (5). ∎
A1.2 Proof of Proposition 1
We use the following result from in order to derive closed form expressions for the second moment of the gradients.
Consider a ResNet of the form (2) with weights . In the limit of infinite width, we can assume that used in back-propagation is independent from used for forward propagation, for the calculation of Gradient Covariance and NTK.
Next we re-state and prove Proposition 1.
where . We conclude by taking the supremum over and .
where . ∎
Using Lemma A5, we can derive simple recursive formulas for the second moment of the gradient as well as for the Neural Tangent Kernel (NTK). This was previously done in for feedforward neural networks, we prove a similar result for ResNet in the next lemma.
In the limit of infinite width, using the same notation as in proposition 1, we have that
Using lemma A5 and the Central Limit Theorem, we have that
Before moving to the next proofs, recall the definition of Stable ResNet.
A ResNet of type (2) is called a Stable ResNet if and only if .
Leveraging the previous result, the function defined in (4) is analytic. We clarify this in the next lemma.
we get . Moreover, we have that for all
This yields . Then, noticing that
is an odd function, we get that for all . Now let us prove that for all , there exist such that, for all ,
We prove this by induction. For , we have that
so that our claim holds. Assume now that it is true for some , let us prove it for . It is easy to see that
The induction is straightforward. In particular, we have shown that . The conclusion for the coefficients ’s of the expansion of is then trivial. ∎
Using Lemma A8, it will not be hard to show that is continuous. The non-negativity of can be seen as a consequence of the definition of as the covariance of a Gaussian Process. However, we will give a direct proof of it, so that we can state here a general result which we will need later on.
converges uniformly on $\muKT_{\mu}(g(C))g(C)$ is a kernel.
For both Standard and Stable ResNet architectures, for any layer , the covariance function and the correlation function are kernels on , in the sense of Definition 2.
It is straightforward to prove that is a kernel. Now let us show that if is a kernel for some , then is a kernel. Since is symmetric and so is. Moreover, the diagonal elements of are continuous by Lemma A3 and do not vanish (since if we are assuming that ). Hence is continuous. It is then trivial to show that the non-negative definiteness of implies that is non-negative definite, and so is a kernel if is. Now we proceed by induction. Suppose that and are kernels and recall the recursion (5), taking the coefficient to be in the case of a Standard ResNet. Notice that it can be rewritten as
where we have omitted the dependence on for , we have defined and is defined in (A2). Clearly is a kernel. By Lemma A8 and Lemma A9 we have that is a kernel. Using the property that sums and products of kernels are kernels (the sum is trivial, cf Footnote 14 for the product), we conclude that , and so , is a kernel on . ∎
A1.4 Proof of Proposition 2
for all .
We have already shown that is non-negative definite in the proof of Lemma A10. We conclude by using the RKHS hierarchy result (see for instance or page 354 in ). ∎
A1.5 Proof of Lemma 3
A Gaussian Process on is said to be expressive on if, denoted by a random realisation, for all , for all ,
A universal kernel on induces an expressive GP on .
For , we can define the interval , so that, for all we have . Since all these intervals are non empty, we get
By Mercer’s theorem , is trace class and hence for diverging . By Markov’s inequality
A1.6 Proof of Proposition 3
In order to prove Proposition 3 we first need a preliminary result, which will be at the core of the proof of Theorem 1 as well.
where the coefficients ’s are all strictly positive, explicitly . Expanding the inner product , we can express in the form
If , then is universal on . From Proposition 2, is universal for all .
with can be written as a finite linear combination of the functions . This yields
A1.7 Proof of Proposition 4
See the proof of Proposition A7 in Appendix A8. ∎
A1.8 Proof of Proposition 5
Proposition 5 is a well known classical result (see for instance Appendix H in and the references therein. For completeness we give a proof in Appendix A8.
See the proof of Lemma A22 in Appendix A8. ∎
A1.9 Proof of Lemma 4
Therefore, is the space of constant functions.
Since , is non-decreasing wrt and converges to the unique fixed point of which is . This convergence is uniform in , i.e. . Re-writing the recursion yields
where , and . Using Lemma A3, and the boundedness of , a simple Taylor expansion yields
where the expansion is uniform on , and , and for some . The previous dynamical system can be decomposed in two parts, a first part without the term which is the homogeneous system, i.e. the system without bias, and the term which is the contribution of the bias in the dynamical system. Assume , then the term vanishes. Moreover, a Taylor expansion of near 1 yields
Therefore, uniformly in , we have that
Letting , a simple Taylor expansion leads to
Therefore, where . This equivalence is uniform in .
It is likely that the rate holds without assuming . However, the analysis in this requires unnecessarily complicated details. ∎
A2 Stable ResNet with uniform scaling
We provide the results of existence, uniqueness and regularity of the solution of (8) in Lemma A11. Corollary A1 shows that the differential problem can be restated in the operator space. Eventually we give a proof of Lemma 5, assuring uniform convergence to the continuous limit.
For any in , the solution of (8) is unique and well defined for all . The maps and are Lipschitz continuous on and takes values in $q_{t}c_{t}$ are kernels in the sense of Definition 2.
First notice that from (8) we can find, with few algebraic manipulations, an explicit recurrence relation for the correlation , defined in (3). For any we have
We can find a Cauchy problem for the correlation directly from (8) or by noting that , for . With both approaches, we have
where is defined in and
Note that for the diagonal terms , (8) reduces to , whose solution is
is Lipschitz continuous in and in , so there exists such that the Cauchy problem
has a unique solution defined for . Noticing that
we get that for all such that we have , since , and for all such that we have . As a consequence for all and we can take . In particular we get that (A6) has a unique solution , defined for and bounded in $t\geq 0$.
Now notice that is Lipschitz on . let us denote as a Lipschitz constant for . Since both and are , we can find real constants , and such that for all elements of
Let be a Lipschitz constant for . Using the fact that , we can write
where and . Now fix and and consider . We have
So , meaning that (and so ) is Lipschitz on .
Since the mapping is continuous, it defines a compact integral operator on . Since is real and symmetric under the swap of and , the operator is self-adjoint. The same holds true for .
Consider the map , defined on , which is continuous wrt and wrt , as it can be easily checked. Since and $t$
Let be the covariance kernel of the layer in a net of layers , and be the solution of (8), then
We will show that the relation holds for , and hence for . Let , defined on , be such that . Explicitly, with the same notations as in (A6), we have
Since and takes values on compact sets, by uniform continuity, fixed we can write, for
where and are defined as in (A6). As a consequence, we can find a constant and an integer such that, for all , for all , for all
Moreover, there exists a constant such that for all , all and all pairs
Thanks to the two above uniform inequalities, we will now show that, for ,
At this point, using the fact that , it is easy to show by induction that
and so (A9) follows. Finally, the uniform convergence of to implies the one of to and so we conclude. ∎
A2.2 Universality of the covariance kernel
We will now prove the results of universality of Theorem 1 and Proposition 6.
Proof of Theorem 1
Fix any finite Borel measure on , and assume that . Given any non-zero , there exists a such that , for all .
From Corollary A1, we can expand around as
the being wrt the operator norm, where we have defined the kernel via . Since is non-negative, for any , we have
For any finite Borel measure on , for any , the operator on is non-negative definite. In particular, for all we have
Fix and . From (8) we can write
By Lemma A11, is non-negative definite, so we can write
By Lemma A2, it suffices to show that for any finite Borel measure on , is strictly positive definite for all . Fix any nonzero , define the map on $F(t)=\langle T_{\mu}(q_{t})\,\varphi,\varphi\ranglet\in(0,1]s\in(0,t)F(s)>0FF_{t}>0T_{\mu}(q_{t})$ is strictly positive definite. ∎
Proof of Proposition 6
converges in the operator norm. Then is a compact strictly positive definite operator.
both sums converging wrt the operator norm.
The claims for have been already proven in Lemma A8. As for , the analyticity of implies the one of , and it is easy to check the convergence on $f^{\prime}f\beta_{n}>0n$. ∎
The case has been already established in Proposition A2, hence suppose that . First recall (A6)
for small enough. On the other hand, for , we have and so
for small enough. So there is a such that, for , . It follows immediately that the same property is true for . ∎
A3 Stable ResNet with decreasing scaling
Therefore, we can assume without loss of generality that . This yields
Letting and , we have that
Since is non decreasing, is non-decreasing and has a limit .
Now let us prove that the convergence of to happens uniformly with a rate . Using the recursive formula of , and knowing that we have that
Therefore, using the fact that , we have
A3.2 Proof of Corollary 1
A4 Neural Tangent Kernel
Throughout this section, we will consider ResNets with NTK parameterization . This simply means that all the components of the biases and the weights will be initialized as iid standard normal random variables. In order to compensate this change of parameterization, the propagation through the network needs to be slightly modified. Hence (2) will be replaced by
However, it is strightforward to verify that the recurrence (5) for the covariance kernels keeps unchanged. Clearly, the dynamics of a standard ResNet with NTK parameterization can be recovered from (A12) by setting for all .
The Neural Tangent Kernel, introduced by , is defined as
where denotes the gradient wrt the parameters of the network. The NTK of a Stable ResNet can be evaluated recursively. We will now prove the recurrence formula (9). The following result was proven in Lemma 3 in for the case of a standard ResNet without bias. We extend it to ResNet with bias.
For a Stable ResNet, the NTK can be evaluated recursively, layer by layer, as
We prove the second result by induction. The proof is similar to the one of ResNet in . Let . For and
We prove the result by induction. Assume the result is true for layers and let us prove it for . Using the induction hypothesis, as recursively, we have that
where
As , we have that . Using the law of large numbers, as
As a corollary of the above result, using the results in for the ReLU activation function, we can express the recursion more explicitly. We have
where is defined in (4) and is the first derivative of . So we can write
We can now easily check that the NTK is a kernel in the sense of Definition 2.
For all layer , is a kernel in the sense of definition (2).
It’s clear that is a kernel. Now fix any layer . We have already proved in Lemma A10 that is a kernel. With a similar argument, noting that can be expressed as a power series with only non negative coefficients on $1+f^{\prime}(C_{l})\Theta_{l}$ is a kernel. ∎
As a final remark, note that from (A1), we have that . Hence we can rewrite (A13) as
Since is non negative on $\Theta_{l}\geq Q_{l}l$. This is done explicitly in the next Lemma, which is a Corollary of Lemma 1 and show the divergence of the NTK for a Standard ResNet.
By Lemma 1, it suffices to show that . Recall (A13), noticing that on $\left(1+\tfrac{f(C_{l})}{C_{l}}\right)Q_{l}\geq 0\Theta_{0}=Q_{0}\geq 0\Theta_{l}\geq 0l\Theta_{l+1}-\Theta_{l}\geq Q_{l+1}-Q_{l}\Theta_{l}\geq Q_{l}lK^{2}\Theta_{L}(x,x)\geq Q_{L}(x,x)x\in K$. ∎
Consider a ResNet of type (1) without bias, and let . The NTK recursion formula can be written in terms of normalized NTK
where is given by (A2), .
where and . Using the recursive formula for the diagonal elements, we have that . We conclude by dividing both sides by . ∎
Therefore, the NTK can be expressed exclusively in terms of the covariance kernels , more precisely we have that
It is straightforward that converges pointwise to a limiting kernel . Let us prove that the convergence is uniform over . By observing that , we have that for all
where is a constant that depends on the compact . This proves the uniform convergence with a rate of . As a consequence, being a uniform limit of kernels, is a kernel.
Proceeding as in the proof of Lemma A17, it’s easy to prove by induction that for all , is a kernel. In particular,
where is in the operator sense, that is is non-negative definite. This yields
Therefore inherits the universality of naturally by the RKHS hierarchy . We conclude that is universal (for both cases). ∎
With the uniform scaling, for arbitrary , the continuous version of (9) reads
where is the first derivative of , defined in (4).
For any in , the solution of (A16) is unique and well defined for all . Moreover, the map is a kernel in the sense of Definition 2 for all . We have the convergence of the discrete model to the continuous one:
The existence and the uniqueness are clear, since it is a homogeneous first order Cauchy problem, with continuous coefficients. We can write explicitly the solution as
Hence, is the limit of a sequence of non-negative definite operators and hence it is non-negative definite, so that is a kernel on for all . ∎
Fix . The solution of (A16) can be written as , where
Now, let us show that . First, by Lemma A14 it is easy to check that is analytic on and its Taylor expansion around converges on $(1+f^{\prime}(c_{s}))s\in[0,s)(1+f^{\prime}(c_{s}))\,\theta_{s}(1+f^{\prime}(c_{s})\,\theta_{s}Z^{2}s\in[0,t)r_{t}\muKT_{\mu}(r_{t})\varphi\in L^{2}(K,\mu)$, by simple standard arguments we have
and so is a kernel. Now, given two kernels and , it is a classical result that is a kernel and its RKHS contains the RKHS of and , . We conclude that the RKHS of contains the RKHS of . Since is universal, is universal. ∎
A5 A PAC-Bayes Generalization result
Assuming that the samples are distributed as where is a probability distribution on , we define the generalization (true) loss by
For some randomized learning algorithm , the empirical and generalization loss are given by
The PAC-Bayes theorem gives a probabilistic upper bound on the generalization loss of a randomized learning algorithm in terms of the empirical loss . Fix a prior distribution on the hypothesis set . The Kullback-Leibler divergence between and is defined as . The Bernoulli KL-divergence is given by for . We define the inverse Bernoulli KL-divergence by
The PAC-Bayesian theorem gives can also be stated as
The KL-divergence term plays a major role as it controls the generalization gap, i.e. the difference (in terms of Bernoulli KL-divergence) between the empirical loss and the generalization loss. In our setting, we consider an ordinary GP regression with prior . Under the standard assumption that the outputs are noisy versions of with , the Bayesian posterior is also a GP and is given by
where and . In this setting, we have the following result
Let be the kernel of a ResNet. Let be a GP with kernel and be the corresponding Bayesian posterior for some fixed noise level . Then, in a fixed setting (fixed sample size N), the following results hold:
The proof relies on the simple observation that . This yields
where .
Since is symmetric and strictly positive definite, it is straightforward that the largest eigenvalue of is smaller than . This yields
where the last inequality holds for sufficiently large .
Case 2. In the case of Stable ResNet, we know that as , the kernel converges to a strictly positive definite kernel , therefore the first term remains bounded as , which concludes the proof. ∎
A6 NNGP correlation kernel without bias as a modified NNGP kernel
Unscaled ResNets suffer from the exploding variance problem, which needs to be avoided in order to isolate the disadvantages of inexpressivity in their NNGP kernel. In order to do so, we use the NNGP correlation kernel instead of NNGP covariance kernel , noting that Lemma A4 provides a simple recursion formula for if , at depth :
where and defined in (A2). In order to combine this with open-source packages designed for NNGP calculation, we note that (A20) can be viewed as the NNGP kernel of the following modified ResNet layer, using the same notation as in (2):
with
A7 Experimental details and additional results
For our Vanilla ResNet NNGP results, we preprocess all training, validation and test data by first centering the training set and then normalizing all images to lie on the pixel dimension sphere. For our Wide ResNet NNGP results we normalise all data so that the training set is centered and has channel-wise unit variance. We use Kaiming initialisation throughout, with and . Vanilla ResNets have the same structure as type (2) in Table 2 and we use the same WRN kernel architecture as in Table 1 but omit the final average pooling step, which is known to improve kernel performance but dramatically increase computational costs . Throughout this work, where there are residual blocks with multiple layers, we calculate our scaling factors for uniform and decreasing scaled Stable ResNets by the number of residual connections. For example, a WRN-202 has only 99 residual connections, so we set for the uniform scaling factors. We tune the noise variance , which is akin to the regularisation parameter in kernel ridge regression. To do so, we compute validation accuracy on a validation set of size 5000, selecting the best from a logarithmic scale of , where is the training set size and is the training set Gram matrix for NNGP .
A7.2 Trained ResNet results
For all our trained ResNet experiments we use a similar setup to the open-source code for in PyTorch . We repeat each experiment 3 times and report the best test accuracy and error intervals. All ResNets are initialised with Kaiming initialisation and like we adopt ResNets architectures where we double the number of filters in each convolutional layer. For experiments with BatchNorm, on CIFAR-10/100 we use batch size 64 across all depths and on TinyImageNet we used batch size 128 for depths 32 & 50, and batch size 100 for depth 104 in order to allow the model to fit onto a single 11GB VRAM GPU. We use SGD with momentum parameter and weight decay parameter throughout.
We also present results for ResNets trained without BatchNorm . BatchNorm is a normalization layer commonly used with modern ResNets that is known to improve performance and allows deeper ResNets to be trained, though the precise reasons for this are not well understood. Several recent works have studied the possibility of removing the need for BatchNorm layers, by introducing trainable uniform scalings to the residual connection to stabilise variance at initialisation & gradients, demonstrating promising results. Note, our work additionally introduces decreasing scaling and also uses the infinite-width NNGP/NTK connection to assess the theoretical advantages of scaled Stable ResNets in the limit of infinite depth.
Moreover, our focus is not towards the possibility of removing BatchNorm and we show in Table 3 that our scalings can improve BatchNorm ResNets. However, we also present results without BatchNorm in Table 4, where again we see that our scaled stable ResNets improve performance compared to their unscaled counterparts: for example both Decreasing and Uniform scaling outperform the unscaled ResNet by over 3% test accuracy on CIFAR-100 with ResNet-104.
For ResNets trained without BatchNorm, for a fair comparison we tuned the initial learning rate on a small logarithmic scale, using batch size 128.
It can be easily deduced from lemma A8 that there exist such that
where . Following the same approach, we have that
Having the terms of orders 0 and 1 in ensures having a positive coefficient for all terms for , which concludes the proof. ∎
The previous result can be easily extended to general . We have that
Moreover, can be expressed in terms of
where and is such that and and for all . As a result, for all .
Knowing that , we have that
which gives the recursive formulas for the coefficients of the analytic decomposition. Observe that the coefficients are non-decreasing wrt . Using lemma A21 we conclude that . ∎
For depth , proposition A5 shows that all coefficient are (strictly) positive. It turns out that this is a sufficient condition for the kernel to be strictly positive definite. We state this in the next proposition. The result can be seen as a consequence of Lemma A12 and Lemma A13. However we will give here a more direct proof.
for all . We conclude that . ∎
We start by giving a brief review of the theory of Spherical Harmonics (). For some , let be the set of Spherical Harmonics of degree . We have .
For some function , the Hecke-Funk formula reads
form an orthogonal basis of , i.e.
where is the Kronecker symbol.
where . We also have that since is non-negative by definition. The last statement, follows from the spectral theory of compact self-adjoint operators and the orthonormality of the spherical harmonics (see the appendix of for details). ∎
Corollary A3 shows that for any depth , the Spherical Harmonics are the eigenfunctions of the kernel . The fact that is a direct result of Proposition A6. Leveraging this result, we can prove a stronger result, which is the universality of the kernel .