Stable ResNet

Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, Judith Rousseau

INTRODUCTION

The limit of infinite width has been the focus of many theoretical studies on Neural Networks (NNs) [Neal, 1995, Poole et al., 2016, Schoenholz et al., 2017, Yang and Schoenholz, 2017, Hayou et al., 2019a, Lee et al., 2019]. Although unachievable in practice, it features many interesting properties which can help grasp the complex behaviour of large networks. Infinitely wide 1-layer random NNs behave like Gaussian Processes (GPs) at initialization [Neal, 1995]. This was recently extended to multilayer NNs, where each layer can be associated to its own GP [Matthews et al., 2018, Lee et al., 2018, Yang, 2019a]. From a theoretical point of view, GPs have the advantage that their behaviour is fully captured by the mean function and the covariance kernel. Moreover, when dealing with GPs that are equivalent to infinite width NNs, these processes are usually centered, and hence fully determined by their covariance kernel. For multilayer networks, these kernels can be computed recursively, layer by layer [Lee et al., 2018]. Interestingly, in apparent contradiction with the naive idea “the deeper, the more expressive”, it was shown in [Schoenholz et al., 2017] that the GP becomes trivial as the number of layers goes to infinity, that is the output completely forgets about the input and hence lacks expressive power. This loss of input information during the forward propagation through the network might be exponential in depth and could lead to trainability issues for extremely deep nets [Schoenholz et al., 2017, Hayou et al., 2019a]. One natural way to prevent this last issue is the introduction of skip connections, commonly known as the ResNet architecture. However, in the regime of large width and depth, the output of standard ResNets becomes inexpressive and the network may suffer from gradient exploding [Yang and Schoenholz, 2017]. In the present work, we propose a new class of residual neural networks, the Stable ResNet, which, in the limit of infinite width and depth, is shown to stabilize the gradient (no gradient vanishing or exploding) and to preserve expressivity in the limit of large depth. The main idea is the introduction of layer/depth dependent scaling factors to the ResNet blocks. For ReLU networks, we provide a comprehensive analysis of two different scalings: a uniform one, where the scaling factor is the same for all the layers, and a decreasing one, where the scaling factor decreases as we go deeper inside the network. We also show that Stable ResNet solve the problem of Neural Tangent kernel (NTK) degeneracy in the limit of large depth [Hayou et al., 2019b]; indeed, with our scalings, the NTK is universal in the limit of infinite depth, which ensures that any continuous function can be approximated to an arbitrary precision by the features of the infinite depth NTK on a compact set.

All theoretical results are substantiated with numerical experiments in Section 7, where we demonstrate the benefits of Stable ResNet scalings both for the corresponding infinite width GP kernels as well as trained ResNets, over a range of moderate and large-scale image classification tasks: MNIST, CIFAR-10, CIFAR-100 and TinyImageNet.

RESNET

Consider a standard ResNet architecture with $L+1$ layers, labelled with $l\in[0:L]$ Notation: $[m:n]=\{m,m+1\dots n\}$ for integers $n\geq m$ ., of dimensions $\{N_{l}\}_{l\in[0:L]}$ .

where $\phi$ is the activation function. The weights and bias are initialized with $W_{l}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{w}^{2}/N_{l-1})$ , and $B_{l}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{b}^{2})$ , where $\sigma_{w}>0$ , $\sigma_{b}\geq 0$ , $N_{-1}=d$ , and $\mathcal{N}(\mu,\sigma^{2})$ is the normal law of mean $\mu$ and variance $\sigma^{2}$ .

Recent results by [Hayou et al., 2021] suggest that scaling the residual blocks with $L^{-1/2}$ might have some beneficial properties on model pruning at initialization. This results from the stabilization effect on the gradient due to the scaling.

More generally, we introduce the residual architecture:

where $\{\lambda_{l,L}\}_{l\in[1:L]}$ is a sequence of scaling factors. We assume hereafter that there exists $\lambda_{\max}\in(0,\infty)$ such that $\lambda_{l,L}\in(0,\lambda_{\max}]$ for all $L\geq 1$ and $l\in[1:L]$ . In the next proposition, we give a necessary and sufficient condition for the gradient to remain bounded as the depth $L$ goes to infinity.

Proposition 1 shows that in order to stabilize the gradient, we have to scale the blocks of the ResNet with scalars $\{\lambda_{l,L}\}_{l\in[1:L]}$ such that $\sum_{l=1}^{L}\lambda_{l,L}^{2}$ remains bounded as the depth $L$ goes to infinity. Taking $\lambda_{\min}=1$ , Proposition 1 shows that the standard ResNet architecture (1) suffers from gradient exploding at initialization,In [Yang and Schoenholz, 2017], authors show a similar result with a slightly different ResNet architecture. which may cause instability during the first step of gradient based optimization algorithms such as Stochastic Gradient Descent (SGD). This motivates the following definition of Stable ResNet.

A ResNet of type (2) is called a Stable ResNet if and only if $\lim\limits_{L\rightarrow\infty}\sum\limits_{l=1}^{L}\lambda_{l,L}^{2}<\infty$ .

The condition on the scaling factors is satisfied by a wide range of sequences $\{\lambda_{l,L}\}_{l\in[1:L],L\geq 1}$ . However, it is natural to consider the two categories: Uniform scaling. The scaling factors have similar magnitude and tend to zero at the same time. A simple example is the uniform scaling $\lambda_{l,L}=1/\sqrt{L}$ . Decreasing scaling. The sequence is decreasing and tends to zero. To be clearer, we consider a general sequence $\{\lambda_{l}\}_{l\in[1:L]}$ such that $\sum_{l\geq 1}\lambda_{l}^{2}<\infty$ , and let $\lambda_{l,L}=\lambda_{l}$ for all $L\geq 1$ , all $l\in[1:L]$ .

Note that our theoretical analyses will hold for any decreasing scaling $\{\lambda_{l}\}_{l\geq 1}$ that is square summable, but for simplicity in all empirical results we consider the decreasing scaling:

We study theoretical properties of both ResNets with uniform and decreasing scaling. We show that, in addition to stabilizing the gradient, both scalings ensure that the ResNet is expressive in the infinite depth limit. For this purpose, we use a tool known as Neural Network Gaussian Process (NNGP) [Lee et al., 2018] which is the equivalent Gaussian Process of a Neural Network in limit of infinite width.

2 On Gaussian Process approximation of Neural Networks

Consider a ResNet of type (2). Neurons $\{y_{0}^{i}(x)\}_{i\in[1:N_{1}]}$ are iid since the weights with which they are connected to the inputs are iid. Using the Central Limit Theorem, as $N_{0}\rightarrow\infty$ , $y^{i}_{1}(x)$ is a Gaussian variable for any input $x$ and index $i\in[1:N_{1}]$ . Moreover, the variables $\{y^{i}_{1}(x)\}_{i\in[1:N_{1}]}$ are iid. Therefore, the processes $y^{i}_{1}(.)$ can be seen as independent (across $i$ ) centred Gaussian processes with covariance kernel $Q_{1}$ . This is an idealized version of the true process corresponding to letting width $N_{0}\to\infty$ . Doing this recursively over $l$ leads to similar approximations for $y_{l}^{i}(.)$ where $l\in[1:L]$ , and we write accordingly $y_{l}^{i}\stackrel{{\scriptstyle ind}}{{\sim}}\mathcal{GP}(0,Q_{l})$ . The approximation of $y_{l}^{i}(.)$ by a Gaussian process was first proposed by [Neal, 1995] in the single layer case and was extended to multiple feedforward layers by [Lee et al., 2019] and [Matthews et al., 2018]. More recently, a powerful framework, known as Tensor Programs, was proposed by [Yang, 2019b], confirming the large-width NNGP association for nearly all NN architectures.

For the ReLU activation function $\phi:x\mapsto\max(0,x)$ , the recurrence relation can be written more explicitly as in [Daniely et al., 2016]. Let $C_{l}$ be the correlation kernel, defined as

The recurrence relation reads (see Appendix A1)

This recursion leads to divergent diagonal terms $Q_{L}(x,x)$ . This was proven in [Yang and Schoenholz, 2017] for a slightly different ResNet architecture. In the next Lemma, we extend this result to the ResNet defined by (1).

Figure 1 plots the diagonal NNGP and NTK (introduced in Section 5) values for a point on the sphere, highlighting the exploding kernel problem for standard ResNets. Stable ResNets do not suffer from this problem.

The symmetry in the above definition has to be understood as $Q(x,x^{\prime})=Q(x^{\prime},x)$ for all $x,x^{\prime}\in K$ .

Kernels induce non-negative integral operators [Paulsen and Raghupathi, 2016].

Given a kernel $Q$ on $K$ , the Gaussian Process induced by $Q$ is a centred GP on $K$ whose covariance function is $Q$ .

We will sometimes use the notation $\mathcal{GP}(0,Q)$ for the law of the GP induced by a kernel $Q$ . With our definition of a kernel, the samples from the induced GP lies in $L^{2}(K)$ with probability $1$ [Steinwart, 2019].

From now on we will assume that $0\notin K$ if $\sigma_{b}=0$ .We exclude since for $\sigma_{b}=0$ $C_{0}$ is discontinuous in and can’t be a kernel on $K$ as in Definition 2, if $0\in K$ . For all ResNets, it is straightforward to check that $Q_{L}$ is a kernel, in the sense of Definition 2 (see Appendix A1 or [Daniely et al., 2016]). The induced Gaussian Process is what we refer to as NNGP.

We denote by $\mathcal{H}_{Q}(K)$ the Reproducing Kernel Hilbert Space (RKHS)See Appendix A0 for a definition. induced by the kernel $Q$ on the set $K$ . The following hierarchical result holds.

For all $L\geq 1$ , $l\in[0,L-1]$ , $\mathcal{H}_{Q_{l}}(K)\subseteq\mathcal{H}_{Q_{l+1}}(K)$ .

Proposition 2 shows that, as we go deeper, the RKHS cannot become poorer. However, increasing $L$ might introduce stability issues as illustrated in Proposition 1. We show in Sections 3 and 4 that Stable ResNets resolve this problem.

By Lemma 2, $T(Q_{L})$ is a bounded, compact, self-adjoint operator and hence can be written as the sum of the projections on its eigenspaces [Lang, 2012]. By Mercer’s Theorem [Paulsen and Raghupathi, 2016], all the eigenfunctions of $T(Q_{L})$ are continuous. Finally, it is possible to link the eigen-decomposition of $T(Q_{L})$ with the distribution of the GP induced by $Q_{L}$ . Denoting respectively by $\mu_{k}$ and $\psi_{k}$ the eigenvalues and eigenfunctions of the operator $T(Q_{L})$ , we have the equivalence in law:

where $\{Z_{k}\}_{k\geq 0}$ are i.i.d. standard Gaussian random variables [Grenander, 1950]. The expressivity, that is the capacity to approximate a large class of function, of the network at initialization is then closely linked to the eigendecomposition of $Q_{L}$ [Yang and Salman, 2019].

3 Universal kernels and expressive GPs

Let $Q$ be a kernel on $K$ , and $\mathcal{H}_{Q}(K)$ its RKHS See Appendix A0.. We say that $Q$ is universal on $K$ if for any $\varepsilon>0$ and any continuous function $g$ on $K$ , there exists $h\in\mathcal{H}_{Q}(K)$ such that $\|h-g\|_{\infty}<\varepsilon$ .

The universality of a kernel $Q$ on a compact set implies that the kernel is strictly positive definite, i.e. for all non-zero $\varphi\in L^{2}(K),\langle T(Q)\varphi,\varphi\rangle>0$ [Sriperumbudur et al., 2011]. Moreover, universality also implies the full expressivity of the induced GP, as expressed in the following.

A Gaussian Process on $K$ is said to be expressive on $L^{2}(K)$ if, denoting by $\psi$ a random realisation $\psi$ of the process, for all $\varphi\in L^{2}(K)$ , for all $\varepsilon>0$ ,

A universal kernel $Q$ on $K$ induces an expressive GP on $L^{2}(K)$ .

By definition, universal kernels are characterized by the property that their associated RKHS is dense (w.r.t the uniform norm $\|.\|_{\infty}$ ) in the space of continuous functions on $K$ . This is crucial for Kernel regression and Gaussian Process inference [Kanagawa et al., 2018].The closure of the set of functions described by the mean function of the posterior of a GP regression is exactly the RKHS of the kernel of the GP prior. By Proposition 2, it suffices to prove that $Q_{L_{0}}$ is universal for some $L_{0}$ in order to conclude for all $L\geq L_{0}$ . It turns out this is true for $L_{0}=2$ .

If $\sigma_{b}>0$ , then $Q_{2}$ is universal on $K$ . From Proposition 2, $Q_{L}$ is universal for all $L\geq 2$ .

Although the kernel is universal for fixed depth $L$ , it is not guaranteed that as $L\rightarrow\infty$ , $Q_{L}$ remains universal. Indeed, for the standard ResNet architecture, the variance $Q_{L}(x,x)$ grows exponentially with $L$ [Yang and Schoenholz, 2017], and therefore, the kernel diverges. In order to analyse the expressivity of the kernel of a standard ResNet in the limit of large depth, we can study the correlation kernel $C_{L}$ , defined in (3), instead. We show in the following Lemma that, as $L$ goes to infinity, the kernel $C_{L}$ converges to a constant (which has a 1D RKHS).

Therefore, $\mathcal{H}_{C_{\infty}}(K)$ is the space of constant functions.

Lemma 4 shows that in the limit of infinite depth $L$ , the RKHS of the correlation kernel is trivial, meaning that the NNGP cannot be expressive. On the contrary, we will show in the next sections that Stable ResNets achieve a universal kernel for infinite depth $L$ .

UNIFORM SCALING

Consider a Stable ResNet with layers $[0:L]$ . Under uniform scaling, the recurrence relation in (5) reads:

In the limit as $L\to\infty$ , (7) converges uniformly to a continuous ODE. Studying the solution of the corresponding Cauchy problem, we show that the covariance kernel remains universal in the limit of infinite depth.

As discussed in Section A2 of the Appendix, for any $x,x^{\prime}$ , the solution of the above Cauchy problem exists and is unique. Moreover, the solutions $q_{t}$ and $c_{t}$ are kernels on $K$ , in the sense of Definition 2.

Clearly, for finite $L$ , the continuous ODE (8) is an approximation. However, the following result holds.

Let $Q_{l|L}$ be the covariance kernel of the layer $l$ in a net of $L+1$ layers $[0:L]$ , and $q_{t}$ be the solution of (8), then

2 Universality of the covariance kernel

When $\sigma_{b}>0$ , the kernel $q_{t}$ is universal for $t>0$ .

The proof of the above statement is detailed in Appendix A2. The main idea is to show that the integral operator $T(q_{t})$ is strictly positive definite and then use a characterization of universal kernels, due to [Sriperumbudur et al., 2011], which connects the universality of Definition 4 with the strict positivity of the induced integral operator.The details are more involved as we need to show that the kernel induces a strictly positive definite operator on $L^{2}(K,\mu)$ for any finite Borel measure $\mu$ on $K$ .

DECREASING SCALING

The convergence of the kernel $Q_{L}$ to the limiting kernel $Q_{\infty}$ is governed by the convergence rate of the series of scaling factors. Moreover, leveraging the RKHS hierarchy from Proposition 2, we find that $Q_{\infty}$ is universal.

As in the uniform scaling case, the limiting kernel exists and is universal unlike the standard ResNet architecture that yields a divergent kernel $Q_{L}$ as $L\to\infty$ .

To validate our universality and expressivity results, Figure 2 plots the leading eigenvalues of the NNGP (& NTK, introduced in Section 5) kernels on a set of 1000 points sampled uniformly at random from the circle, normalized so that the largest eigenvalue is 1. We use the recursion formulas for NNGP correlation (Lemma A4) and normalized NTK (Lemma A19) to avoid the exploding variance/gradient problem. We see that the unscaled ResNet NNGP becomes inexpressive with depth because all non-leading eigenvalues converge to 0, whereas our Stable ResNets (decreasing and uniform scaling) are expressive even in the large depth limit.

NEURAL TANGENT KERNEL

In the so-called lazy training regime [Chizat and Bach, 2019], the training dynamics of an infinitely wide network can be described via the Neural Tangent Kernel (NTK) [Lee et al., 2019], introduced in [Jacot et al., 2018] and defined as

where $\mathcal{X}$ and $\mathcal{Y}$ are respectively the input and output datasets, $\Theta_{L}(x,\mathcal{X})=\{\Theta_{L}(x,x^{\prime})\}_{x^{\prime}\in\mathcal{X}}$ and $\hat{\Theta}_{L}$ is the matrix $\{\Theta_{l}(x,x^{\prime})\}_{x,x^{\prime}\in\mathcal{X}}$ . The universality of the NTK is crucial for the ResNet to learn beyond initialization, since the residual $F_{\tau}-F_{0}$ lies in the RKHS generated by $\Theta_{L}$ . For unscaled ResNet, [Hayou et al., 2019b] showed that the limiting NTK is trivial in the sense of Lemma 4. However, this is not the case for Stable ResNet.

Consider a ResNet of type (2). We have This is true under the technical assumption that the parameters appearing in the back-propagation can be considered independent from the ones of the forward pass (Gradient Independent Assumption) [Yang, 2019a]

An analogous result can be stated for the uniform scaling, after noticing that a continuous formulation ( $\Theta_{l}\mapsto\theta_{t(l)}$ ) can be obtained in analogy with what has been done for the covariance kernel (cf Appendix A4).

Figure 2 shows that the non-leading NTK eigenvalues do not decay to 0 with depth for Stable ResNets, unlike for unscaled ResNets. This is in line with findings of Propositions 8 and 9.

A PAC-BAYES RESULT

where $\nu$ is a probability distribution on $X\times Y$ . For some randomized learning algorithm $\mathcal{A}$ , the empirical and generalization loss are given by:

The PAC-Bayes theorem gives a probabilistic upper bound on the generalization loss $r(\mathcal{A})$ of a randomized learning algorithm $\mathcal{A}$ in terms of the empirical loss $r_{S}(\mathcal{A})$ . Fix a prior distribution $\mathcal{P}$ on the hypothesis set $\mathcal{U}$ . The Kullback-Leibler divergence between $\mathcal{A}$ and $\mathcal{P}$ is defined as $\text{KL}(\mathcal{A}\|\mathcal{P})=\int\mathcal{A}(h)\log\frac{\mathcal{A}(h)}{\mathcal{P}(h)}\textrm{d}h\in[0,\infty]$ . The Bernoulli KL-divergence is given by $\text{kl}(a||p)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}$ for $a,p\in$ . We define the inverse Bernoulli KL-divergence $\text{kl}^{-1}$ by

The KL-divergence term $\text{KL}(\mathcal{A}\|\mathcal{P})$ plays a major role as it controls the generalization gap, i.e. the difference (in terms of Bernoulli KL-divergence) between the empirical loss and the generalization loss. In our setting, we consider an ordinary GP regression with prior $\mathcal{P}(f)=\mathcal{GP}(f|0,Q(x,x^{\prime}))$ . Under the standard assumption that the outputs $y_{N}=(y_{i})_{i\in[1:N]}$ are noisy versions of $f_{N}=(f(x_{i}))_{i\in[1:N]}$ with $y_{N}|f_{N}\sim\mathcal{N}(y_{N}|f_{N},\sigma^{2}I)$ , the Bayesian posterior $\mathcal{A}$ is also a GP and is given by

$Q_{N}(x)=(Q(x,x_{i}))_{i\in[1:N]}$ , $Q_{NN}=(Q(x_{i},x_{j}))_{1\leq i,j\leq N}$ . In this setting, we have the following result

Let $Q_{L}$ be the kernel of a ResNet. Let $P_{L}$ be a GP with kernel $Q_{L}$ and $\mathcal{A}_{L}$ be the corresponding Bayesian posterior for some fixed noise level $\sigma^{2}>0$ . Then, in a fixed setting (fixed sample size N), the following results hold: $\bullet$ With a standard ResNet, $\textup{KL}(\mathcal{A}_{L}\|P_{L})\gtrsim L$ . $\bullet$ With a Stable ResNet, $\textup{KL}(\mathcal{A}_{L}\|P_{L})=\mathcal{O}_{L}(1)$ .

The KL-divergence bound diverges for a standard ResNet while it remains bounded for Stable ResNet. Although PAC-Bayes bounds only give an upper bound on the generalization error, Proposition 10 shows that Stable ResNet does not suffer from the “curse of depth”, i.e. the KL-divergence does not explode as the depth becomes large.

EXPERIMENTS

In line with our theory, we now present results demonstrating empirical advantages of Stable ResNets (both uniform and decreasing scaling) compared to their unscaled counterparts on a toy regression task and standard image classification tasks, both for infinite-width NNGP kernels as well as trained finite-width NNs in the latter case. In the interests of space, all experimental details not described in this section can be found in Appendix A7. All error bars in this section correspond to 3 independent runs.

We first present a toy regression posterior regression experiment with NNGP kernel. We compare across different depths and scalings, with target test function $y=x\text{sin}(x)$ and a small amount of observation noise $\sigma=0.1$ ( $\sigma$ as defined in Eq. 10).

We use 5 training points (dark green dots).

We map our 1D inputs $x$ onto the circle $(\text{cos}(x),\text{sin}(x))$ before performing GP regression. This is so that all inputs have unit norm and we can use the NNGP correlation kernel (Eq. 3) for the vanilla ResNet (ResNet with fully connected blocks), in order to avoid the exploding variance problem. As expected from our theory, in Figure 3, for depth 1000 the NNGP correlation kernel without stable scaling (top row, red) is unable to learn anything beyond a constant function due to inexpressivity, whereas our Stable ResNets (bottom two rows, blue) are still expressive in the large depth limit. We plot mean and 95% posterior predictive credible interval for NNGP posteriors.

Stable NNGP classification results

We first compare the performance of Stable and standard ResNets of varying depths through their infinite-width NNGP kernels, on MNIST & CIFAR-10. For each considered NNGP kernel $Q$ and training set $(x_{i},y_{i})_{i\in[1:N]}$ , we report test accuracy using the mean of the posterior predictive (Eq. 10): $Q_{N}(\cdot)(Q_{NN}+\sigma^{2}I)^{-1}y_{N}$ , which is also the kernel ridge regression predictor [Kanagawa et al., 2018]. We treat classification labels $y$ as one-hot regression targets, similar to recent works [Arora et al., 2019, Lee et al., 2019, Shankar et al., 2020], and tune the noise $\sigma^{2}$ using prediction accuracy on a held-out validation set.

First, in Table 1, we demonstrate the exploding NNGP variance problem for unscaled Wide-ResNets (WRN) [Zagoruyko and Komodakis, 2016]. For an unscaled WRN of depth 202, the NNGP kernel values explode resulting in numerical errors, whereas Stable ResNets achieve 54% test accuracy with 10K training points (out of full size 50K). Note that any numerical errors from exploding NNGP also afflict the NTK, as the difference between the NTK and NNGP is positive semi-definite [Lee et al., 2019, He et al., 2020] (which is why the NTK lines always lie above their corresponding NNGP in Figure 1).

To isolate the disadvantages of inexpressivity in unscaled Resnets NNGPs compared to our Stable ResNets, we need to avoid the exploding variance problem and ensuing numerical errors. In order to do so, we use the NNGP correlation kernel $C$ instead of the NNGP covariance kernel $Q$ , noting that these two kernels are equal up to multiplicative constant on the sphere, and that the posterior predictive mean is invariant to the scale of $Q$ (with $\sigma^{2}$ also tuned relative to the scale of $Q$ ). Moreover, the formula in Lemma A4 for NNGP correlation recursion for vanilla ResNets without bias can be recast as a ResNet with a modified scaling (see Appendix A6), allowing us to use existing optimised libraries [Novak et al., 2020]. In order to use the vanilla ResNet correlation recursion, we standardise all MNIST & CIFAR-10 images to lie on the 784 & 3072-dimension sphere respectively.

Our expressivity results, as well as Proposition 10, suggest that we expect Stable ResNets to outperform standard ResNets for large depths even when exploding variance numerical errors are alleviated for standard ResNets. In Table 2, we see that unscaled ResNets suffer from a degradation in test accuracy with depth, due to inexpressivity, whereas our Stable ResNets (both decreasing and uniform) do not suffer from a drop in performance. For example, the posterior predictive mean using the NNGP of an unscaled vanilla ResNet with depth 1000 attains only 17.86% accuracy on CIFAR-10 with 10K training points, compared to 48.76% for Stable ResNet (decreasing scale).

We focus on the NNGP rather than the NTK as recent works [Lee et al., 2020, Shankar et al., 2020] have demonstrated that there is no advantage to the state-of-the-art NTK over the NNGP as infinite-width kernel predictors. Moreover, we do not aim for near state-of-the-art kernel results due to computational resources, and instead aim to empirically validate the theoretical advantages of Stable ResNets.

Trained Stable ResNet results

Finally, we consider the benefits of trained Stable ResNets on the large-scale CIFAR-10, CIFAR-100 and TinyImageNetAvailable at http://cs231n.stanford.edu/tiny-imagenet-200.zip datasets. We compare trained convolutional ResNets [He et al., 2016] of depths 32, 50 & 104 in terms of test accuracy. In the main text we present results for ResNets trained with Batch Normalization [Ioffe and Szegedy, 2015] (BatchNorm), while results for trained ResNets without BatchNorm can be found in Appendix A7. Stable ResNet scalings are applied to the residual connection after all convolution, ReLU and BatchNorm layers.

We use initial learning rate $0.1$ which is decayed by $0.1$ at $50\%$ and $75\%$ of the way through training. This learning rate schedule has been used previously [He et al., 2016] for unscaled ResNets and we found it to work well for all ResNets trained with BatchNorm. We train for 160 epochs on CIFAR-10/100 and 250 epochs on TinyImageNet. Test accuracy results are displayed in Table 3. As we can see, Stable ResNets consistently outperform standard ResNets across datasets and depths. Moreover, the performance gap is larger for larger depths: for example on CIFAR-100 our Stable ResNet (decreasing) outperforms its standard counterpart by 1.05% (75.06 vs 74.01) on average for depth 32 whereas for depth 104 the test accuracy gap is 2.36% (77.44 vs 75.08) on average. A similar trend can also be observed for the more challenging TinyImageNet dataset. Interestingly, we see that among the Stable ResNets, decreasing scaling also consistently outperforms uniform scaling.

CONCLUSION

Stable ResNets have the benefit of stabilizing the gradient and ensuring expressivity in the limit of infinite depth. We have demonstrated theoretically and empirically that this type of scaling makes NNGP inference robust and improves test accuracy with SGD on modern ResNet architectures. However, while Stable ResNets with both uniform and decreasing scalings outperform standard ResNet, the selection of an optimal scaling remains an open question; we leave this topic for future work.

ACKNOWLEDGMENTS

This material is based upon work supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence (MoD) and the U.K. Engineering and Physical Research Council (EPSRC) under grant number EP/R013616/1. AD is also partially supported by EPSRC EP/R034710/1. BH is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1). The project leading to this work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 834175).

References

Appendix

A0 Mathematical preliminaries

We will make use of functional analysis results on the theory of Hilbert space. We refer to for a comprehensive introduction to the topic. We precise here that, even when not explicitly stated, all Hilbert spaces considered in the present work are real, and all linear operator are bounded. We will make use of the spectral theory for compact self-adjoint operators. We refer again to for a detailed discussion.

We state here a characterisation of kernels, which is an extension of Lemma 2. Despite being a classical result (see the discussion about Mercer kernels in ), we will give a proof, for the sake of completeness.

for any $\varphi\in L^{2}(K,\mu)$ . The operator $T_{\mu}(Q)$ is a bounded compact self-adjoint definite operator. Moreover, $Q$ is a kernel if and only if $T_{\mu}(Q)$ is non-negative definite for all finite Borel measures $\mu$ on $K$ .

and the convergence is uniform on $K^{2}$ . The continuity of the $Y_{k}$ ’s implies that they can be seen as elements of $L^{2}(K,\mu)$ . Moreover, the uniform convergence, along with the fact that $\mu(K)<\infty$ , implies the convergence of the sum wrt the $L^{2}(K,\mu)$ operator norm. In particular $T_{\mu}(K)$ is a limit of non-negative definite operators and hence non-negative definite. Now, assume that, for all finite Borel $\mu$ , $T_{\mu}(Q)$ is non-negative definite. Chosen a finite set $\{x_{1}\dots x_{n}\}\subset K$ , in particular we have that $\mu=\sum_{i=1}^{n}\delta_{x_{i}}$ is a finite Borel measure (where $\delta_{x}$ is the Dirac measure on $x\in K$ ). Hence $T_{\mu}(Q)$ is the matrix $\{Q(x_{i},x_{j})\}_{i,j}$ . We conclude that $Q$ is a kernel. ∎

We will now give a definition of the Reproducing Kernel Hilbert Space associated to a kernel. We refer to for a general and comprehensive introduction to the topic.

Given a kernel $Q$ on $K$ , we can associate to it a real Hilbert space $\mathcal{H}_{Q}$ , with the following properties:

Denoting as $\langle\cdot,\cdot\rangle_{Q}$ the inner product of $\mathcal{H}_{Q}$ , for each $x\in K$ , there exists a element $k_{x}\in\mathcal{H}_{Q}$ such that $h(x)=\langle h,k_{x}\rangle_{Q}$ , for all $h\in\mathcal{H}_{Q}$ .

For all $x,x^{\prime}\in K$ , $\langle k_{x},k_{x^{\prime}}\rangle_{Q}=Q(x,x^{\prime})$ .

Such a Hilbert space exists for each kernel $Q$ and it is unique up to isomorphism, . $\mathcal{H}_{Q}$ is called the Reproducing Kernel Hilbert Space (RKHS) of $Q$ .

In general, it is not easy to give an explicit form for the RKHS associated to a kernel $Q$ . However, we can say that it contains the linear span of $\{x\mapsto Q(x,x^{\prime})\}_{x^{\prime}\in K}$ . Actually, this linear span is a dense subset of $\mathcal{H}_{Q}$ , wrt the norm of $\mathcal{H}_{Q}$ .

A kernel on $K$ is said to be universal if its RKHS is dense in the space of continuous functions $C(K)$ , wrt the uniform norm.

Let $Q$ be a kernel on $K$ , and $\mathcal{H}_{Q}(K)$ its RKHS. We say that $Q$ is universal on $K$ if for any $\varepsilon>0$ and any continuous function $g$ on $K$ , there exists $h\in\mathcal{H}_{Q}(K)$ such that $\|h-g\|_{\infty}<\varepsilon$ .

We can now state a characterization of universal kernels, from .

As a final note, hereafter we often omit the explicit reference to the measure $\mu$ , that is we will speak of the operator $T(Q)$ on $L^{2}(K)$ . Unless otherwise stated, this notation implies the choice of an arbitrary finite Borel measure $\mu$ on the compact $K$ .

A1 Residual Neural Networks and Gaussian processes

Consider a standard ResNet architecture with $L+1$ layers, labelled with $l\in[0:L]$ , of dimensions $\{N_{l}\}_{l\in[0:L]}$ .

Hereafter, $N_{l}$ denotes the number of neurons in the $l^{th}$ layer, $\phi$ the activation function and $[m:n]:=\{m,m+1,...,n\}$ for $m\leq n$ . The components of weights and bias are respectively initialized with $W_{l}^{ij}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{w}^{2}/N_{l-1})$ , and $B_{l}^{i}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{b}^{2})$ where $\mathcal{N}(\mu,\sigma^{2})$ denotes the normal distribution of mean $\mu$ and variance $\sigma^{2}$ .

In , authors showed that wide deep ResNets might suffer from gradient exploding during backpropagation.

Recent results by suggest that scaling the residual blocks with $L^{-1/2}$ might have some beneficial properties on model pruning at initialization. This is a result of the stabilization effect of scaling on the gradient.

More generally, we introduce the residual architecture:

where $(\lambda_{k,L})_{k\in[1:L]}$ is a sequence of scaling factors. We assume hereafter that there exists $\lambda_{\max}\in(0,\infty)$ such that for all $L\geq 1$ and $k\in[1:L]$ , we have that $\lambda_{k,L}\in(0,\lambda_{\max}]$ .

where we have used the Central Limit Theorem. Therefore, we have

For the remainder of this appendix, we define the function

For all $l$ , the diagonal terms of $Q_{l}$ have closed-form expressions. We show this in the next lemma.

where $\hat{f}$ is given by (A2). It is straightforward that $\hat{f}(1)=1$ . This yields

As a corollary of the previous result, it is easy to show that for a Standard ResNet the diagonal terms explode with depth, which is Lemma 1 in the main paper.

The statement trivially follows from Lemma A3, using that $Q_{0}(x,x)=\sigma_{b}^{2}+\tfrac{\sigma_{w}^{2}}{d}\|x\|^{2}$ and the fact that for a Standard ResNet (1), all the coefficients $\lambda_{l,L}$ ’s are equal to $1$ . ∎

In the case of a ResNet with no bias, the correlation kernel follows a simple recursive formula described in the next lemma.

where $\alpha_{l,L}=\frac{\lambda_{l,L}^{2}\sigma_{w}^{2}}{2}$ .

This is direct result of the covariance recursion formula (5). ∎

A1.2 Proof of Proposition 1

We use the following result from in order to derive closed form expressions for the second moment of the gradients.

Consider a ResNet of the form (2) with weights $W$ . In the limit of infinite width, we can assume that $W^{T}$ used in back-propagation is independent from $W$ used for forward propagation, for the calculation of Gradient Covariance and NTK.

Next we re-state and prove Proposition 1.

where $C=\frac{2}{\sigma_{w}^{2}}\left(\sup_{(x,y)\in K\times K^{\prime}}\bar{q}^{l}(x,y)\right)\left(\sup_{x\in K}Q_{0}(x,x)+\frac{2\sigma_{b}^{2}}{\sigma_{w}^{2}}\right)$ . We conclude by taking the supremum over $l$ and $x,y$ .

where $\kappa=\frac{1}{2}\frac{\lambda_{\min}^{2}}{\left(1+\frac{\sigma_{w}^{2}}{2}\lambda_{\max}^{2}\right)\left(1+\frac{\sigma_{w}^{2}}{2}\lambda_{\min}^{2}\right)}Q_{1}(x,x)\,\bar{q}^{l}(x,y)>0$ . ∎

Using Lemma A5, we can derive simple recursive formulas for the second moment of the gradient as well as for the Neural Tangent Kernel (NTK). This was previously done in for feedforward neural networks, we prove a similar result for ResNet in the next lemma.

In the limit of infinite width, using the same notation as in proposition 1, we have that

Using lemma A5 and the Central Limit Theorem, we have that

Before moving to the next proofs, recall the definition of Stable ResNet.

A ResNet of type (2) is called a Stable ResNet if and only if $\lim\limits_{L\rightarrow\infty}\sum\limits_{k=1}^{L}\lambda_{k,L}^{2}<\infty$ .

Leveraging the previous result, the function $f$ defined in (4) is analytic. We clarify this in the next lemma.

we get $a_{0}=\frac{1}{\pi}$ . Moreover, we have that for all $\gamma\in(-1,1)$

This yields $a_{1}=\hat{f}^{\prime}(0)=\frac{1}{2}$ . Then, noticing that

is an odd function, we get that for all $i\geq 1,a_{2i+1}=0$ . Now let us prove that for all $k\geq 1$ , there exist $b_{k,0},b_{k,1},...,b_{k,k-1}>0$ such that, for all $\gamma\in(-1,1)$ ,

We prove this by induction. For $k=1$ , we have that

so that our claim holds. Assume now that it is true for some $k\geq 1$ , let us prove it for $k+1$ . It is easy to see that

The induction is straightforward. In particular, we have shown that $a_{2i}=\frac{\hat{f}^{(2i)}(0)}{(2i)!}=\frac{b_{i,0}}{(2i)!}>0$ . The conclusion for the coefficients $\alpha$ ’s of the expansion of $f$ is then trivial. ∎

Using Lemma A8, it will not be hard to show that $Q_{l}$ is continuous. The non-negativity of $T(Q_{l})$ can be seen as a consequence of the definition of $Q_{l}$ as the covariance of a Gaussian Process. However, we will give a direct proof of it, so that we can state here a general result which we will need later on.

converges uniformly on $ $. Then, for all finite Borel measure$ \mu $on$ K $,$ T_{\mu}(g(C)) $is a non-negative definite compact operator, and in particular$ g(C)$ is a kernel.

For both Standard and Stable ResNet architectures, for any layer $l$ , the covariance function $Q_{l}$ and the correlation function $C_{l}$ are kernels on $K$ , in the sense of Definition 2.

It is straightforward to prove that $Q_{0}$ is a kernel. Now let us show that if $Q_{l}$ is a kernel for some $l$ , then $C_{l}$ is a kernel. Since $Q_{l}$ is symmetric and so $C_{l}$ is. Moreover, the diagonal elements of $Q_{l}$ are continuous by Lemma A3 and do not vanish (since if $\sigma_{b}=0$ we are assuming that $0\notin K$ ). Hence $C_{l}$ is continuous. It is then trivial to show that the non-negative definiteness of $T(Q_{l})$ implies that $T(C_{l})$ is non-negative definite, and so $C_{l}$ is a kernel if $Q_{l}$ is. Now we proceed by induction. Suppose that $Q_{l-1}$ and $C_{l-1}$ are kernels and recall the recursion (5), taking the coefficient $\lambda$ to be $1$ in the case of a Standard ResNet. Notice that it can be rewritten as

where we have omitted the dependence on $L$ for $\lambda$ , we have defined $R_{l-1}(x,x^{\prime})=\sqrt{Q_{l-1}(x,x)Q_{l-1}(x^{\prime},x^{\prime})}$ and $\hat{f}$ is defined in (A2). Clearly $R_{l-1}$ is a kernel. By Lemma A8 and Lemma A9 we have that $\hat{f}(C_{l})$ is a kernel. Using the property that sums and products of kernels are kernels (the sum is trivial, cf Footnote 14 for the product), we conclude that $Q_{l}$ , and so $C_{l}$ , is a kernel on $K$ . ∎

A1.4 Proof of Proposition 2

$\mathcal{H}_{Q_{l}}(K)\subseteq\mathcal{H}_{Q_{l+1}}(K)$ for all $l\in[0:L-1]$ .

We have already shown that $T(Q_{l})-T(Q_{l-1})$ is non-negative definite in the proof of Lemma A10. We conclude by using the RKHS hierarchy result (see for instance or page 354 in ). ∎

A1.5 Proof of Lemma 3

A Gaussian Process on $K$ is said to be expressive on $L^{2}(K)$ if, denoted by $\psi$ a random realisation, for all $\varphi\in L^{2}(K)$ , for all $\varepsilon>0$ ,

A universal kernel $Q$ on $K$ induces an expressive GP on $L^{2}(K)$ .

For $k\in[0:N]$ , we can define the interval $I_{k}=\left[\tfrac{a_{k}}{\sqrt{\mu_{k}}}-\tfrac{\varepsilon}{\sqrt{2(N+1)\mu_{k}}},\tfrac{a_{k}}{\sqrt{\mu_{k}}}+\tfrac{\varepsilon}{\sqrt{2(N+1)\mu_{k}}}\right]$ , so that, for all $z\in I_{k}$ we have $(z\sqrt{\mu_{k}}-a_{k})^{2}\leq\tfrac{\varepsilon^{2}}{2(N+1)}$ . Since all these intervals are non empty, we get

By Mercer’s theorem , $T(Q)$ is trace class and hence $\delta_{N}\to 0$ for diverging $N$ . By Markov’s inequality

A1.6 Proof of Proposition 3

In order to prove Proposition 3 we first need a preliminary result, which will be at the core of the proof of Theorem 1 as well.

where the coefficients $\omega_{k,n}$ ’s are all strictly positive, explicitly $\omega_{k,n}=\zeta^{k}\binom{n}{k}$ . Expanding the inner product $x\cdot x^{\prime}$ , we can express $p_{n}$ in the form

If $\sigma_{b}>0$ , then $Q_{2}$ is universal on $K$ . From Proposition 2, $Q_{L}$ is universal for all $L\geq 2$ .

with $g$ can be written as a finite linear combination of the functions $\{\hat{f}(C_{0})(x,.)\}_{x\in K}$ . This yields

A1.7 Proof of Proposition 4

See the proof of Proposition A7 in Appendix A8. ∎

A1.8 Proof of Proposition 5

Proposition 5 is a well known classical result (see for instance Appendix H in and the references therein. For completeness we give a proof in Appendix A8.

See the proof of Lemma A22 in Appendix A8. ∎

A1.9 Proof of Lemma 4

Therefore, $\mathcal{H}_{C_{\infty}}(K)$ is the space of constant functions.

Since $\hat{f}(x)\geq x$ , $C_{L}$ is non-decreasing wrt $L$ and converges to the unique fixed point of $\hat{f}$ which is $1$ . This convergence is uniform in $x,x^{\prime}$ , i.e. $\lim_{L\rightarrow\infty}\sup_{x,x^{\prime}\in K}1-C_{L}(x,x^{\prime})=0$ . Re-writing the recursion yields

where $\alpha=\frac{\sigma_{w}^{2}}{2}$ , $\delta_{l}=\left(1+\frac{\sigma_{b}^{2}}{(1+\alpha)Q_{L-1}(x,x)}\right)^{-1/2}\left(1+\frac{\sigma_{b}^{2}}{(1+\alpha)Q_{L-1}(x,x)}\right)^{-1/2}$ and $\zeta_{L}=\sigma_{b}^{2}(Q_{L}(x,x)Q_{L}(x^{\prime},x^{\prime}))^{-1/2}$ . Using Lemma A3, and the boundedness of $C_{L}$ , a simple Taylor expansion yields

where the expansion is uniform on $x,x^{\prime}\in K$ , and $f(x)=\hat{f}(x)-x$ , and $g_{L}=\mathcal{O}(e^{-\beta L})$ for some $\beta>0$ . The previous dynamical system can be decomposed in two parts, a first part without the term $\mathcal{O}(e^{-\beta L})$ which is the homogeneous system, i.e. the system without bias, and the term $\mathcal{O}(e^{-\beta L})$ which is the contribution of the bias in the dynamical system. Assume $\sigma_{b}=0$ , then the term $g_{L}$ vanishes. Moreover, a Taylor expansion of $\hat{f}$ near 1 yields

Therefore, uniformly in $x,x^{\prime}\in K$ , we have that

Letting $\gamma_{L}=1-C_{L}$ , a simple Taylor expansion leads to

Therefore, $\gamma_{L}\sim\kappa L^{-2}$ where $\kappa=\frac{4(1+\alpha)^{2}}{s^{2}\alpha^{2}}$ . This equivalence is uniform in $x,x^{\prime}\in K$ .

It is likely that the rate $\mathcal{O}(L^{-2})$ holds without assuming $\sigma_{b}=0$ . However, the analysis in this requires unnecessarily complicated details. ∎

A2 Stable ResNet with uniform scaling

We provide the results of existence, uniqueness and regularity of the solution of (8) in Lemma A11. Corollary A1 shows that the differential problem can be restated in the operator space. Eventually we give a proof of Lemma 5, assuring uniform convergence to the continuous limit.

For any $x,x^{\prime}$ in $K$ , the solution of (8) is unique and well defined for all $t\in$ . The maps $(x,x^{\prime})\mapsto q_{t}(x,x^{\prime})$ and $(x,x^{\prime})\mapsto c_{t}(x,x^{\prime})$ are Lipschitz continuous on $K^{2}$ and $c_{t}$ takes values in $ $. Moreover, both$ q_{t} $and$ c_{t}$ are kernels in the sense of Definition 2.

First notice that from (8) we can find, with few algebraic manipulations, an explicit recurrence relation for the correlation $C_{l}$ , defined in (3). For any $x,x^{\prime}\in K$ we have

We can find a Cauchy problem for the correlation directly from (8) or by noting that $A_{l}(x,x^{\prime})=1-\tfrac{\sigma_{b}^{2}}{2L}\left(\tfrac{1}{Q_{l}(x,x)}+\tfrac{1}{Q_{l}(x^{\prime},x^{\prime})}\right)+o(1/L)$ , for $L\to\infty$ . With both approaches, we have

where $f$ is defined in $\eqref{deff}$ and

Note that for the diagonal terms $q_{t}(x,x)$ , (8) reduces to $\dot{q}_{t}={\sigma_{b}}^{2}+\frac{{\sigma_{w}}^{2}}{2}\,q_{t}$ , whose solution is

$H$ is Lipschitz continuous in $\gamma$ and $C^{\infty}$ in $t$ , so there exists $\tau>0$ such that the Cauchy problem

has a unique $C^{1}$ solution defined for $t\in[0,\tau)$ . Noticing that

we get that for all $t_{1}$ such that $\gamma(t_{1})=1$ we have $\dot{\gamma}(t_{1})\leq 0$ , since $f(1)=0$ , and for all $t_{-1}$ such that $\gamma(t_{-1})=-1$ we have $\dot{\gamma}(t_{-1})=\sigma_{b}^{2}(\mathcal{G}_{t}(x,x^{\prime})+\mathcal{A}_{t}(x,x^{\prime}))+\tfrac{\sigma_{w}^{2}}{2}>0$ . As a consequence $\gamma(t)\in$ for all $t\in[0,\tau)$ and we can take $\tau=\infty$ . In particular we get that (A6) has a unique solution $t\mapsto c_{t}(z)$ , defined for $t\in$ and bounded in $ $. As a consequence, (8) has a unique and well defined solution for all$ t\geq 0$.

Now notice that $z\mapsto c_{0}(z)$ is Lipschitz on $K^{2}$ . let us denote as $L_{0}$ a Lipschitz constant for $c_{0}$ . Since both $\mathcal{G}_{t}$ and $\mathcal{A}_{t}$ are $C^{1}$ , we can find real constants $L_{G}$ , $L_{A}$ and $M_{A}$ such that for all $z,z^{\prime}$ elements of $K^{2}$

Let $L_{f}$ be a Lipschitz constant for $f$ . Using the fact that $|c_{t}|\leq 1$ , we can write

where $L_{1}=\sigma_{b}^{2}(L_{G}+L_{A})$ and $L_{2}=\sigma_{b}^{2}M_{A}+\frac{\sigma_{w}^{2}}{2}\,L_{f}$ . Now fix $z$ and $z^{\prime}$ and consider $\Delta(t)=c_{t}(z)-c_{t}(z^{\prime})$ . We have

So $|\Delta(t)|\leq\left(\frac{L_{1}}{L_{2}}\,\left(e^{L_{2}\,t}-1\right)+L_{0}\,e^{L_{2}\,t}\right)\|z-z^{\prime}\|$ , meaning that $c_{t}$ (and so $q_{t}$ ) is Lipschitz on $L^{2}$ .

Since the mapping $(x,x^{\prime})\mapsto q_{t}(x,x^{\prime})$ is continuous, it defines a compact integral operator $T(q_{t})$ on $L^{2}(K)$ . Since $q_{t}$ is real and symmetric under the swap of $x$ and $x^{\prime}$ , the operator is self-adjoint. The same holds true for $c_{t}$ .

Consider the map $(t,z)\mapsto q_{t}(z)$ , defined on $\times K^{2}$ , which is continuous wrt $z$ and $C^{2}$ wrt $t$ , as it can be easily checked. Since $K^{2}$ and $ $are compact sets, it follows that for any$ t$

Let $Q_{l|L}$ be the covariance kernel of the layer $l$ in a net of $L+1$ layers $[0:L]$ , and $q_{t}$ be the solution of (8), then

We will show that the relation holds for $c_{t}$ , and hence for $q_{t}$ . Let $H$ , defined on $\times K^{2}$ , be such that $\dot{c}_{t}(z)=H(z,t,c_{t}(z))$ . Explicitly, with the same notations as in (A6), we have

Since $t$ and $z$ takes values on compact sets, by uniform continuity, fixed $h$ we can write, for $h\to 0$

where $\mathcal{A}_{t}$ and $\mathcal{G}_{t}$ are defined as in (A6). As a consequence, we can find a constant $M_{1}>0$ and an integer $L_{\star}>0$ such that, for all $\gamma\in$ , for all $z\in K^{2}$ , for all $L\geq L^{\star}$

Moreover, there exists a constant $M_{2}>0$ such that for all $z\in K^{2}$ , all $t\in$ and all pairs $(\gamma,\gamma^{\prime})\in^{2}$

Thanks to the two above uniform inequalities, we will now show that, for $L\geq L_{\star}$ ,

At this point, using the fact that $\Delta_{0}=0$ , it is easy to show by induction that

and so (A9) follows. Finally, the uniform convergence of $C$ to $c$ implies the one of $Q$ to $q$ and so we conclude. ∎

A2.2 Universality of the covariance kernel

We will now prove the results of universality of Theorem 1 and Proposition 6.

Proof of Theorem 1

Fix any finite Borel measure $\mu$ on $K$ , and assume that $\sigma_{b}>0$ . Given any non-zero $\varphi\in L^{2}(K,\mu)$ , there exists a $t_{\varphi}\in(0,1]$ such that $\langle T_{\mu}(q_{t})\,\varphi,\varphi\rangle>0$ , for all $t\in(0,t_{\varphi})$ .

From Corollary A1, we can expand $T_{\mu}(q_{t})$ around $t=0$ as

the $o(t)$ being wrt the operator norm, where we have defined the kernel $R_{0}$ via $R_{0}(x,x^{\prime})=\tfrac{\sigma_{w}^{2}}{2}\sqrt{(1+\zeta\|x\|^{2})(1+\zeta\|x^{\prime}\|^{2})}$ . Since $T_{\mu}(q_{0})$ is non-negative, for any $\varphi\in L^{2}(I)$ , we have

For any finite Borel measure $\mu$ on $K$ , for any $t\in$ , the operator $T_{\mu}(\dot{q}_{t})$ on $L^{2}(K,\mu)$ is non-negative definite. In particular, for all $\varphi\in L^{2}(K,\mu)$ we have

Fix $\mu$ and $\varphi\in L^{2}(K,\mu)$ . From (8) we can write

By Lemma A11, $T_{\mu}(q_{t})$ is non-negative definite, so we can write

By Lemma A2, it suffices to show that for any finite Borel measure $\mu$ on $K$ , $T_{\mu}(q_{t})$ is strictly positive definite for all $t\in(0,1]$ . Fix any nonzero $\varphi\in L^{2}(K,\mu)$ , define the map $F$ on $ $by$ F(t)=\langle T_{\mu}(q_{t})\,\varphi,\varphi\rangle $. For any fixed$ t\in(0,1] $, by Proposition A2 we can find$ s\in(0,t) $such that$ F(s)>0 $. Since$ F $is non decreasing by Proposition A3, we get that$ F_{t}>0 $. Hence$ T_{\mu}(q_{t})$ is strictly positive definite. ∎

Proof of Proposition 6

converges in the operator norm. Then $A$ is a compact strictly positive definite operator.

both sums converging wrt the operator norm.

The claims for $f$ have been already proven in Lemma A8. As for $g$ , the analyticity of $f$ implies the one of $f^{\prime}$ , and it is easy to check the convergence on $ $. Moreover, all the odd Taylor coefficients of$ f^{\prime} $are striclty positive, as the even coefficients of$ f $are. It follows that$ \beta_{n}>0 $for all odd$ n$. ∎

The case $\sigma_{b}>0$ has been already established in Proposition A2, hence suppose that $\sigma_{b}=0$ . First recall (A6)

for $t$ small enough. On the other hand, for $\varphi^{\prime}=0$ , we have $\varphi=\varphi^{\prime\prime}$ and so

for $t$ small enough. So there is a $t_{\varphi}$ such that, for $t\in(0,t_{\varphi})$ , $\langle T_{\nu}(c_{t})\,\varphi,\varphi\rangle>0$ . It follows immediately that the same property is true for $T_{\nu}(q_{t})$ . ∎

A3 Stable ResNet with decreasing scaling

Therefore, we can assume without loss of generality that $\sigma_{b}=0$ . This yields

Letting $\alpha_{l}=\frac{\sigma_{w}^{2}\lambda_{l}^{2}}{2}$ and $C_{l}:=C_{l}(x,x^{\prime})$ , we have that

Since $\hat{f}$ is non decreasing, $C^{l}$ is non-decreasing and has a limit $C_{\infty}(x,x^{\prime})\leq 1$ .

Now let us prove that the convergence of $C_{l}$ to $C_{\infty}$ happens uniformly with a rate $\sum_{k\geq l}\lambda_{l}^{2}$ . Using the recursive formula of $C_{l}$ , and knowing that we have that

Therefore, using the fact that $C_{l}\leq C_{\infty}$ , we have

A3.2 Proof of Corollary 1

A4 Neural Tangent Kernel

Throughout this section, we will consider ResNets with NTK parameterization . This simply means that all the components of the biases and the weights will be initialized as iid standard normal random variables. In order to compensate this change of parameterization, the propagation through the network needs to be slightly modified. Hence (2) will be replaced by

However, it is strightforward to verify that the recurrence (5) for the covariance kernels keeps unchanged. Clearly, the dynamics of a standard ResNet with NTK parameterization can be recovered from (A12) by setting $\lambda_{l+1,L}=1$ for all $l,L$ .

The Neural Tangent Kernel, introduced by , is defined as

where $\nabla_{\text{par}}$ denotes the gradient wrt the parameters of the network. The NTK of a Stable ResNet can be evaluated recursively. We will now prove the recurrence formula (9). The following result was proven in Lemma 3 in for the case of a standard ResNet without bias. We extend it to ResNet with bias.

For a Stable ResNet, the NTK can be evaluated recursively, layer by layer, as

We prove the second result by induction. The proof is similar to the one of ResNet in . Let $\theta_{k}=(W_{k},B_{k})$ . For $l\geq 1$ and $i\in[1:N_{l+1}]$

We prove the result by induction. Assume the result is true for layers $1,2,...,l$ and let us prove it for $l+1$ . Using the induction hypothesis, as $N_{1},N_{2},...,N_{l-1}\rightarrow\infty$ recursively, we have that

where $I^{\prime}=\frac{\sigma_{w}^{2}}{N_{l}}W_{l+1}^{ii}(\phi^{\prime}(y_{l}^{i}(x))+\phi^{\prime}(y_{l}^{i}(x^{\prime})))\,\Theta_{l}(x,x^{\prime}).$

As $N_{l}\rightarrow\infty$ , we have that $I^{\prime}\rightarrow 0$ . Using the law of large numbers, as $N_{l}\rightarrow\infty$

As a corollary of the above result, using the results in for the ReLU activation function, we can express the recursion more explicitly. We have

where $f$ is defined in (4) and $f^{\prime}:\gamma\mapsto-\tfrac{1}{\pi}\arccos\gamma$ is the first derivative of $f$ . So we can write

We can now easily check that the NTK is a kernel in the sense of Definition 2.

For all layer $l$ , $\Theta_{L}$ is a kernel in the sense of definition (2).

It’s clear that $\Theta_{0}=Q_{0}$ is a kernel. Now fix any layer $l$ . We have already proved in Lemma A10 that $\left(1+\tfrac{f(C_{l})}{C_{l}}\right)Q_{l}$ is a kernel. With a similar argument, noting that $1+f^{\prime}$ can be expressed as a power series with only non negative coefficients on $ $, we conclude by Lemma A9 that$ 1+f^{\prime}(C_{l}) $is a kernel. Using the usual argument that sums and product of kernels are kernels, we conclude by induction that$ \Theta_{l}$ is a kernel. ∎

As a final remark, note that from (A1), we have that $\lambda_{l,L}^{2}\Psi_{l}=Q_{l+1}-Q_{l}$ . Hence we can rewrite (A13) as

Since $1+f^{\prime}$ is non negative on $ $, it is easy to show by induction that$ \Theta_{l}\geq Q_{l} $, point-wise, for all$ l$. This is done explicitly in the next Lemma, which is a Corollary of Lemma 1 and show the divergence of the NTK for a Standard ResNet.

By Lemma 1, it suffices to show that $\Theta_{L}(x,x)\geq Q_{L}(x,x)$ . Recall (A13), noticing that $1+f^{\prime}\geq 0$ on $ $,$ \left(1+\tfrac{f(C_{l})}{C_{l}}\right)Q_{l}\geq 0 $and that$ \Theta_{0}=Q_{0}\geq 0 $, by an easy induction we have that$ \Theta_{l}\geq 0 $for all$ l $. As a consequence, from (A14), we get that$ \Theta_{l+1}-\Theta_{l}\geq Q_{l+1}-Q_{l} $. Hence, again with a straightforward induction we have that$ \Theta_{l}\geq Q_{l} $for all$ l $and the the whole$ K^{2} $. In particular$ \Theta_{L}(x,x)\geq Q_{L}(x,x) $for all$ x\in K$. ∎

Consider a ResNet of type (1) without bias, and let $\alpha=\frac{\sigma_{w}^{2}}{2}$ . The NTK recursion formula can be written in terms of normalized NTK $\kappa^{l}(x,x^{\prime})=\Theta_{l}(x,x^{\prime})/(1+\alpha)^{l-1}$

where $\hat{f}$ is given by (A2), $\hat{f}(t)=\frac{1}{\pi}(t\,\arcsin{t}+\sqrt{1-t^{2}})+\frac{1}{2}t$ .

where $\Psi_{l-1}=\alpha Q_{l-1}(x,x^{\prime})$ and $\Psi_{l-1}^{\prime}=\alpha\hat{f}^{\prime}(C_{l-1})$ . Using the recursive formula for the diagonal elements, we have that $\Psi_{l-1}=\alpha(1+\alpha)^{l-1}\hat{f}(C_{l-1}(x,x^{\prime}))\sqrt{Q_{0}(x,x)Q_{0}(x^{\prime},x^{\prime})}$ . We conclude by dividing both sides by $(1+\alpha)^{l-1}$ . ∎

Therefore, the NTK can be expressed exclusively in terms of the covariance kernels $(Q_{k})_{k\in[0:l-1]}$ , more precisely we have that

It is straightforward that $\Theta_{l}$ converges pointwise to a limiting kernel $\Theta_{\infty}$ . Let us prove that the convergence is uniform over $K$ . By observing that $|f^{\prime}|\leq 1$ , we have that for all $x,x^{\prime}\in K$

where $\kappa$ is a constant that depends on the compact $K$ . This proves the uniform convergence with a rate of $\mathcal{O}\left(\sum_{k=l+1}^{\infty}\lambda_{k}^{2}\right)$ . As a consequence, being a uniform limit of kernels, $\Theta_{\infty}$ is a kernel.

Proceeding as in the proof of Lemma A17, it’s easy to prove by induction that for all $l$ , $\Theta_{l}-Q_{l}$ is a kernel. In particular,

where $\succeq$ is in the operator sense, that is $T(\Theta_{l})-T(Q_{l})$ is non-negative definite. This yields

Therefore $\Theta_{\infty}$ inherits the universality of $Q_{\infty}$ naturally by the RKHS hierarchy . We conclude that $\Theta_{\infty}$ is universal (for both cases). ∎

With the uniform scaling, for arbitrary $x,x^{\prime}\in K$ , the continuous version of (9) reads

where $f^{\prime}:\gamma\mapsto-\tfrac{1}{\pi}\arccos\gamma$ is the first derivative of $f$ , defined in (4).

For any $x,x^{\prime}$ in $K$ , the solution $t\mapsto\Theta_{t}$ of (A16) is unique and well defined for all $t\in$ . Moreover, the map $(x,x^{\prime})\mapsto\Theta_{t}(x,x^{\prime})$ is a kernel in the sense of Definition 2 for all $t\in$ . We have the $L^{2}(K)$ convergence of the discrete model to the continuous one:

The existence and the uniqueness are clear, since it is a homogeneous first order Cauchy problem, with continuous coefficients. We can write explicitly the solution as

Hence, $T(\theta_{t})$ is the limit of a sequence of non-negative definite operators and hence it is non-negative definite, so that $\theta_{t}$ is a kernel on $K$ for all $t\in$ . ∎

Fix $t\in(0,1]$ . The solution of (A16) can be written as $\theta_{t}=q_{t}+r_{t}$ , where

Now, let us show that $r_{t}$ . First, by Lemma A14 it is easy to check that $1+f^{\prime}$ is analytic on $(-1,1)$ and its Taylor expansion around converges on $ $. Moreover all the Taylor coefficients are non negative. Hence, Lemma A9 shows that$ (1+f^{\prime}(c_{s})) $is a kernel for all$ s\in[0,s) $. It follows that$ (1+f^{\prime}(c_{s}))\,\theta_{s} $is a kernel.See footnote 14. Now,$ (1+f^{\prime}(c_{s})\,\theta_{s} $is continuous and symmetric on$ Z^{2} $, and it is easy to check from (A17) that it is uniformly bounded for$ s\in[0,t) $. It follows that$ r_{t} $is continuous and symmetric. Now, fix an arbitrary finite Borel measure$ \mu $on$ K $. We have to show that$ T_{\mu}(r_{t}) $is non-negative definite, so that we can conclude by Lemma A1. Fixed$ \varphi\in L^{2}(K,\mu)$, by simple standard arguments we have

and so $r_{t}$ is a kernel. Now, given two kernels $Q$ and $R$ , it is a classical result that $Q+R$ is a kernel and its RKHS contains the RKHS of $Q$ and $R$ , . We conclude that the RKHS of $\theta_{t}$ contains the RKHS of $q_{t}$ . Since $q_{t}$ is universal, $\theta_{t}$ is universal. ∎

A5 A PAC-Bayes Generalization result

Assuming that the samples are distributed as $(x,y)\sim\nu$ where $\nu$ is a probability distribution on $X\times Y$ , we define the generalization (true) loss by

For some randomized learning algorithm $\mathcal{A}$ , the empirical and generalization loss are given by

The PAC-Bayes theorem gives a probabilistic upper bound on the generalization loss $r(\mathcal{A})$ of a randomized learning algorithm $\mathcal{A}$ in terms of the empirical loss $r_{S}(\mathcal{A})$ . Fix a prior distribution $\mathcal{P}$ on the hypothesis set $\mathcal{H}$ . The Kullback-Leibler divergence between $\mathcal{A}$ and $\mathcal{P}$ is defined as $KL(\mathcal{A}\|\mathcal{P})=\int\log\frac{\mathcal{A}(h)}{P(h)}\mathcal{A}(h)dh\in[0,\infty]$ . The Bernoulli KL-divergence is given by $kl(a||p)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p}$ for $a,p\in$ . We define the inverse Bernoulli KL-divergence $kl^{-1}$ by

The PAC-Bayesian theorem gives can also be stated as

The KL-divergence term $KL(\mathcal{A}\|P)$ plays a major role as it controls the generalization gap, i.e. the difference (in terms of Bernoulli KL-divergence) between the empirical loss and the generalization loss. In our setting, we consider an ordinary GP regression with prior $P(f)=\mathcal{GP}(f|0,Q(x,x^{\prime}))$ . Under the standard assumption that the outputs $y_{N}=(y_{i})_{i\in[1:N]}$ are noisy versions of $f_{N}=(f(x_{i}))_{i\in[1:N]}$ with $y_{N}|f_{N}\sim\mathcal{N}(y_{N}|f_{N},\sigma^{2}I)$ , the Bayesian posterior $\mathcal{A}$ is also a GP and is given by

where $Q_{N}(x)=(Q(x,x_{i}))_{i\in[1:N]}$ and $Q_{NN}=(Q(x_{i},x_{j}))_{1\leq i,j\leq N}$ . In this setting, we have the following result

Let $Q_{L}$ be the kernel of a ResNet. Let $P_{L}$ be a GP with kernel $Q_{L}$ and $\mathcal{A}_{L}$ be the corresponding Bayesian posterior for some fixed noise level $\sigma>0$ . Then, in a fixed setting (fixed sample size N), the following results hold:

The proof relies on the simple observation that $P_{L}(f|f_{N})=\mathcal{A}_{L}(f|f_{N})$ . This yields

where $Q_{L,NN}=(Q_{L}(x_{i},x_{j}))_{1\leq i,j\leq N}$ .

Since $Q_{L,NN}$ is symmetric and strictly positive definite, it is straightforward that the largest eigenvalue of $Q_{L,NN}(Q_{L,NN}+\sigma^{2}I)^{-1})$ is smaller than $1$ . This yields

where the last inequality holds for sufficiently large $L$ .

Case 2. In the case of Stable ResNet, we know that as $L\rightarrow\infty$ , the kernel $Q_{L}$ converges to a strictly positive definite kernel $Q_{\infty}$ , therefore the first term $\log(\det(Q_{L,NN}+\sigma^{2}I))$ remains bounded as $L\rightarrow\infty$ , which concludes the proof. ∎

A6 NNGP correlation kernel without bias as a modified NNGP kernel

Unscaled ResNets suffer from the exploding variance problem, which needs to be avoided in order to isolate the disadvantages of inexpressivity in their NNGP kernel. In order to do so, we use the NNGP correlation kernel $C$ instead of NNGP covariance kernel $Q$ , noting that Lemma A4 provides a simple recursion formula for $C$ if $\sigma_{b}=0$ , at depth $l\leq L$ :

where $\alpha_{l,L}=\frac{\lambda_{l,L}^{2}\sigma_{w}^{2}}{2}$ and $\hat{f}$ defined in (A2). In order to combine this with open-source packages designed for NNGP calculation, we note that (A20) can be viewed as the NNGP kernel of the following modified ResNet layer, using the same notation as in (2):

with $\hat{\alpha}_{l,L}=\frac{\alpha_{l,L}}{1+\alpha_{l,L}}$

A7 Experimental details and additional results

For our Vanilla ResNet NNGP results, we preprocess all training, validation and test data by first centering the training set and then normalizing all images to lie on the pixel dimension sphere. For our Wide ResNet NNGP results we normalise all data so that the training set is centered and has channel-wise unit variance. We use Kaiming initialisation throughout, with $\sigma_{w}^{2}=2$ and $\sigma_{b}^{2}=0$ . Vanilla ResNets have the same structure as type (2) in Table 2 and we use the same WRN kernel architecture as in Table 1 but omit the final average pooling step, which is known to improve kernel performance but dramatically increase computational costs . Throughout this work, where there are residual blocks with multiple layers, we calculate our scaling factors for uniform and decreasing scaled Stable ResNets by the number of residual connections. For example, a WRN-202 has only 99 residual connections, so we set $\lambda_{l,L}^{-1}=\sqrt{99}$ for the uniform scaling factors. We tune the noise variance $\sigma^{2}$ , which is akin to the regularisation parameter in kernel ridge regression. To do so, we compute validation accuracy on a validation set of size 5000, selecting the best $\sigma^{2}=\lambda\times\text{Trace}(Q_{NN})/N$ from a logarithmic scale of $\lambda=[0.001,0.01,0.1]$ , where $N$ is the training set size and $Q_{NN}$ is the $N\times N$ training set Gram matrix for NNGP $Q$ .

A7.2 Trained ResNet results

For all our trained ResNet experiments we use a similar setup to the open-source code for in PyTorch . We repeat each experiment 3 times and report the best test accuracy and error intervals. All ResNets are initialised with Kaiming initialisation and like we adopt ResNets architectures where we double the number of filters in each convolutional layer. For experiments with BatchNorm, on CIFAR-10/100 we use batch size 64 across all depths and on TinyImageNet we used batch size 128 for depths 32 & 50, and batch size 100 for depth 104 in order to allow the model to fit onto a single 11GB VRAM GPU. We use SGD with momentum parameter $0.9$ and weight decay parameter $10^{-4}$ throughout.

We also present results for ResNets trained without BatchNorm . BatchNorm is a normalization layer commonly used with modern ResNets that is known to improve performance and allows deeper ResNets to be trained, though the precise reasons for this are not well understood. Several recent works have studied the possibility of removing the need for BatchNorm layers, by introducing trainable uniform scalings to the residual connection to stabilise variance at initialisation & gradients, demonstrating promising results. Note, our work additionally introduces decreasing scaling and also uses the infinite-width NNGP/NTK connection to assess the theoretical advantages of scaled Stable ResNets in the limit of infinite depth.

Moreover, our focus is not towards the possibility of removing BatchNorm and we show in Table 3 that our scalings can improve BatchNorm ResNets. However, we also present results without BatchNorm in Table 4, where again we see that our scaled stable ResNets improve performance compared to their unscaled counterparts: for example both Decreasing and Uniform scaling outperform the unscaled ResNet by over 3% test accuracy on CIFAR-100 with ResNet-104.

For ResNets trained without BatchNorm, for a fair comparison we tuned the initial learning rate on a small logarithmic scale, using batch size 128.

It can be easily deduced from lemma A8 that there exist $\{b_{i}\}_{i\geq 0}$ such that

where $b_{0},b_{1},b_{2i}>0$ . Following the same approach, we have that

Having the terms of orders 0 and 1 in $C_{1}(x,x^{\prime})$ ensures having a positive coefficient for all terms $z^{i}$ for $i\geq 1$ , which concludes the proof. ∎

The previous result can be easily extended to general $L\geq 2$ . We have that

Moreover, $(\alpha_{L+1,i})_{i\geq 0}$ can be expressed in terms of $(\alpha_{L,i})_{i\geq 0}$

where $\beta_{L}=Q_{L}(x,x)=Q_{L}(x^{\prime},x^{\prime})=\sum_{i\geq 0}\alpha_{L,i}$ and $(a_{m})_{m\geq 0}$ is such that $a_{0},a_{1}>0$ and $a_{2i}>0$ and $a_{2i+1}=0$ for all $i\geq 1$ . As a result, for all $L\geq 2,i\geq 0,\alpha_{L,i}>0$ .

Knowing that $C_{l}(x,y)=\frac{1}{\beta_{l}}Q_{l}(x,y)$ , we have that

which gives the recursive formulas for the coefficients of the analytic decomposition. Observe that the coefficients are non-decreasing wrt $L$ . Using lemma A21 we conclude that $\alpha_{L,i}>0$ . ∎

For depth $L\geq 2$ , proposition A5 shows that all coefficient $(\alpha_{L,i})_{i\geq 0}$ are (strictly) positive. It turns out that this is a sufficient condition for the kernel $Q_{L}$ to be strictly positive definite. We state this in the next proposition. The result can be seen as a consequence of Lemma A12 and Lemma A13. However we will give here a more direct proof.

for all $k\geq 0$ . We conclude that $\varphi=0$ . ∎

We start by giving a brief review of the theory of Spherical Harmonics (). For some $k\geq 1$ , let $(Y_{k,j})_{1\leq j\leq N(d,k)}$ be the set of Spherical Harmonics of degree $k$ . We have $N(d,k)=\frac{2k+d-2}{k}{k+d-3\choose d-2}$ .

For some function $p$ , the Hecke-Funk formula reads

$(P^{d}_{k})_{k\geq 0}$ form an orthogonal basis of $L^{2}(,(1-t^{2})^{\frac{d-3}{2}}dt)$ , i.e.

where $\delta_{ij}$ is the Kronecker symbol.

where $\mu_{k}=\frac{\Omega_{d-1}}{\Omega_{d}}\int_{-1}^{1}p(t)P^{d}_{k}(t)(1-t^{2})^{(d-3)/2}dt$ . We also have that $\mu_{k}\geq 0$ since $Q$ is non-negative by definition. The last statement, follows from the spectral theory of compact self-adjoint operators and the orthonormality of the spherical harmonics (see the appendix of for details). ∎

Corollary A3 shows that for any depth $L$ , the Spherical Harmonics are the eigenfunctions of the kernel $Q_{L}$ . The fact that $\mu_{L,k}>0$ is a direct result of Proposition A6. Leveraging this result, we can prove a stronger result, which is the universality of the kernel $Q_{L}$ .