Stable ResNet

Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, Judith Rousseau

INTRODUCTION

The limit of infinite width has been the focus of many theoretical studies on Neural Networks (NNs) [Neal, 1995, Poole et al., 2016, Schoenholz et al., 2017, Yang and Schoenholz, 2017, Hayou et al., 2019a, Lee et al., 2019]. Although unachievable in practice, it features many interesting properties which can help grasp the complex behaviour of large networks. Infinitely wide 1-layer random NNs behave like Gaussian Processes (GPs) at initialization [Neal, 1995]. This was recently extended to multilayer NNs, where each layer can be associated to its own GP [Matthews et al., 2018, Lee et al., 2018, Yang, 2019a]. From a theoretical point of view, GPs have the advantage that their behaviour is fully captured by the mean function and the covariance kernel. Moreover, when dealing with GPs that are equivalent to infinite width NNs, these processes are usually centered, and hence fully determined by their covariance kernel. For multilayer networks, these kernels can be computed recursively, layer by layer [Lee et al., 2018]. Interestingly, in apparent contradiction with the naive idea “the deeper, the more expressive”, it was shown in [Schoenholz et al., 2017] that the GP becomes trivial as the number of layers goes to infinity, that is the output completely forgets about the input and hence lacks expressive power. This loss of input information during the forward propagation through the network might be exponential in depth and could lead to trainability issues for extremely deep nets [Schoenholz et al., 2017, Hayou et al., 2019a]. One natural way to prevent this last issue is the introduction of skip connections, commonly known as the ResNet architecture. However, in the regime of large width and depth, the output of standard ResNets becomes inexpressive and the network may suffer from gradient exploding [Yang and Schoenholz, 2017]. In the present work, we propose a new class of residual neural networks, the Stable ResNet, which, in the limit of infinite width and depth, is shown to stabilize the gradient (no gradient vanishing or exploding) and to preserve expressivity in the limit of large depth. The main idea is the introduction of layer/depth dependent scaling factors to the ResNet blocks. For ReLU networks, we provide a comprehensive analysis of two different scalings: a uniform one, where the scaling factor is the same for all the layers, and a decreasing one, where the scaling factor decreases as we go deeper inside the network. We also show that Stable ResNet solve the problem of Neural Tangent kernel (NTK) degeneracy in the limit of large depth [Hayou et al., 2019b]; indeed, with our scalings, the NTK is universal in the limit of infinite depth, which ensures that any continuous function can be approximated to an arbitrary precision by the features of the infinite depth NTK on a compact set.

All theoretical results are substantiated with numerical experiments in Section 7, where we demonstrate the benefits of Stable ResNet scalings both for the corresponding infinite width GP kernels as well as trained ResNets, over a range of moderate and large-scale image classification tasks: MNIST, CIFAR-10, CIFAR-100 and TinyImageNet.

RESNET

Consider a standard ResNet architecture with L+1L+1 layers, labelled with l[0:L]l\in[0:L]Notation: [m:n]={m,m+1n}[m:n]=\{m,m+1\dots n\} for integers nmn\geq m., of dimensions {Nl}l[0:L]\{N_{l}\}_{l\in[0:L]}.

where ϕ\phi is the activation function. The weights and bias are initialized with WliidN(0,σw2/Nl1)W_{l}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{w}^{2}/N_{l-1}), and BliidN(0,σb2)B_{l}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{b}^{2}), where σw>0\sigma_{w}>0, σb0\sigma_{b}\geq 0, N1=dN_{-1}=d, and N(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) is the normal law of mean μ\mu and variance σ2\sigma^{2}.

Recent results by [Hayou et al., 2021] suggest that scaling the residual blocks with L1/2L^{-1/2} might have some beneficial properties on model pruning at initialization. This results from the stabilization effect on the gradient due to the scaling.

More generally, we introduce the residual architecture:

where {λl,L}l[1:L]\{\lambda_{l,L}\}_{l\in[1:L]} is a sequence of scaling factors. We assume hereafter that there exists λmax(0,)\lambda_{\max}\in(0,\infty) such that λl,L(0,λmax]\lambda_{l,L}\in(0,\lambda_{\max}] for all L1L\geq 1 and l[1:L]l\in[1:L]. In the next proposition, we give a necessary and sufficient condition for the gradient to remain bounded as the depth LL goes to infinity.

Proposition 1 shows that in order to stabilize the gradient, we have to scale the blocks of the ResNet with scalars {λl,L}l[1:L]\{\lambda_{l,L}\}_{l\in[1:L]} such that l=1Lλl,L2\sum_{l=1}^{L}\lambda_{l,L}^{2} remains bounded as the depth LL goes to infinity. Taking λmin=1\lambda_{\min}=1, Proposition 1 shows that the standard ResNet architecture (1) suffers from gradient exploding at initialization,In [Yang and Schoenholz, 2017], authors show a similar result with a slightly different ResNet architecture. which may cause instability during the first step of gradient based optimization algorithms such as Stochastic Gradient Descent (SGD). This motivates the following definition of Stable ResNet.

A ResNet of type (2) is called a Stable ResNet if and only if limLl=1Lλl,L2<\lim\limits_{L\rightarrow\infty}\sum\limits_{l=1}^{L}\lambda_{l,L}^{2}<\infty.

The condition on the scaling factors is satisfied by a wide range of sequences {λl,L}l[1:L],L1\{\lambda_{l,L}\}_{l\in[1:L],L\geq 1}. However, it is natural to consider the two categories: Uniform scaling. The scaling factors have similar magnitude and tend to zero at the same time. A simple example is the uniform scaling λl,L=1/L\lambda_{l,L}=1/\sqrt{L}. Decreasing scaling. The sequence is decreasing and tends to zero. To be clearer, we consider a general sequence {λl}l[1:L]\{\lambda_{l}\}_{l\in[1:L]} such that l1λl2<\sum_{l\geq 1}\lambda_{l}^{2}<\infty, and let λl,L=λl\lambda_{l,L}=\lambda_{l} for all L1L\geq 1, all l[1:L]l\in[1:L].

Note that our theoretical analyses will hold for any decreasing scaling {λl}l1\{\lambda_{l}\}_{l\geq 1} that is square summable, but for simplicity in all empirical results we consider the decreasing scaling:

We study theoretical properties of both ResNets with uniform and decreasing scaling. We show that, in addition to stabilizing the gradient, both scalings ensure that the ResNet is expressive in the infinite depth limit. For this purpose, we use a tool known as Neural Network Gaussian Process (NNGP) [Lee et al., 2018] which is the equivalent Gaussian Process of a Neural Network in limit of infinite width.

2 On Gaussian Process approximation of Neural Networks

Consider a ResNet of type (2). Neurons {y0i(x)}i[1:N1]\{y_{0}^{i}(x)\}_{i\in[1:N_{1}]} are iid since the weights with which they are connected to the inputs are iid. Using the Central Limit Theorem, as N0N_{0}\rightarrow\infty, y1i(x)y^{i}_{1}(x) is a Gaussian variable for any input xx and index i[1:N1]i\in[1:N_{1}]. Moreover, the variables {y1i(x)}i[1:N1]\{y^{i}_{1}(x)\}_{i\in[1:N_{1}]} are iid. Therefore, the processes y1i(.)y^{i}_{1}(.) can be seen as independent (across ii) centred Gaussian processes with covariance kernel Q1Q_{1}. This is an idealized version of the true process corresponding to letting width N0N_{0}\to\infty. Doing this recursively over ll leads to similar approximations for yli(.)y_{l}^{i}(.) where l[1:L]l\in[1:L], and we write accordingly yliindGP(0,Ql)y_{l}^{i}\stackrel{{\scriptstyle ind}}{{\sim}}\mathcal{GP}(0,Q_{l}). The approximation of yli(.)y_{l}^{i}(.) by a Gaussian process was first proposed by [Neal, 1995] in the single layer case and was extended to multiple feedforward layers by [Lee et al., 2019] and [Matthews et al., 2018]. More recently, a powerful framework, known as Tensor Programs, was proposed by [Yang, 2019b], confirming the large-width NNGP association for nearly all NN architectures.

For the ReLU activation function ϕ:xmax(0,x)\phi:x\mapsto\max(0,x), the recurrence relation can be written more explicitly as in [Daniely et al., 2016]. Let ClC_{l} be the correlation kernel, defined as

The recurrence relation reads (see Appendix A1)

This recursion leads to divergent diagonal terms QL(x,x)Q_{L}(x,x). This was proven in [Yang and Schoenholz, 2017] for a slightly different ResNet architecture. In the next Lemma, we extend this result to the ResNet defined by (1).

Figure 1 plots the diagonal NNGP and NTK (introduced in Section 5) values for a point on the sphere, highlighting the exploding kernel problem for standard ResNets. Stable ResNets do not suffer from this problem.

The symmetry in the above definition has to be understood as Q(x,x)=Q(x,x)Q(x,x^{\prime})=Q(x^{\prime},x) for all x,xKx,x^{\prime}\in K.

Kernels induce non-negative integral operators [Paulsen and Raghupathi, 2016].

Given a kernel QQ on KK, the Gaussian Process induced by QQ is a centred GP on KK whose covariance function is QQ.

We will sometimes use the notation GP(0,Q)\mathcal{GP}(0,Q) for the law of the GP induced by a kernel QQ. With our definition of a kernel, the samples from the induced GP lies in L2(K)L^{2}(K) with probability 11 [Steinwart, 2019].

From now on we will assume that 0K0\notin K if σb=0\sigma_{b}=0.We exclude since for σb=0\sigma_{b}=0 C0C_{0} is discontinuous in and can’t be a kernel on KK as in Definition 2, if 0K0\in K. For all ResNets, it is straightforward to check that QLQ_{L} is a kernel, in the sense of Definition 2 (see Appendix A1 or [Daniely et al., 2016]). The induced Gaussian Process is what we refer to as NNGP.

We denote by HQ(K)\mathcal{H}_{Q}(K) the Reproducing Kernel Hilbert Space (RKHS)See Appendix A0 for a definition. induced by the kernel QQ on the set KK. The following hierarchical result holds.

For all L1L\geq 1, l[0,L1]l\in[0,L-1], HQl(K)HQl+1(K)\mathcal{H}_{Q_{l}}(K)\subseteq\mathcal{H}_{Q_{l+1}}(K).

Proposition 2 shows that, as we go deeper, the RKHS cannot become poorer. However, increasing LL might introduce stability issues as illustrated in Proposition 1. We show in Sections 3 and 4 that Stable ResNets resolve this problem.

By Lemma 2, T(QL)T(Q_{L}) is a bounded, compact, self-adjoint operator and hence can be written as the sum of the projections on its eigenspaces [Lang, 2012]. By Mercer’s Theorem [Paulsen and Raghupathi, 2016], all the eigenfunctions of T(QL)T(Q_{L}) are continuous. Finally, it is possible to link the eigen-decomposition of T(QL)T(Q_{L}) with the distribution of the GP induced by QLQ_{L}. Denoting respectively by μk\mu_{k} and ψk\psi_{k} the eigenvalues and eigenfunctions of the operator T(QL)T(Q_{L}), we have the equivalence in law:

where {Zk}k0\{Z_{k}\}_{k\geq 0} are i.i.d. standard Gaussian random variables [Grenander, 1950]. The expressivity, that is the capacity to approximate a large class of function, of the network at initialization is then closely linked to the eigendecomposition of QLQ_{L} [Yang and Salman, 2019].

3 Universal kernels and expressive GPs

Let QQ be a kernel on KK, and HQ(K)\mathcal{H}_{Q}(K) its RKHS See Appendix A0.. We say that QQ is universal on KK if for any ε>0\varepsilon>0 and any continuous function gg on KK, there exists hHQ(K)h\in\mathcal{H}_{Q}(K) such that hg<ε\|h-g\|_{\infty}<\varepsilon.

The universality of a kernel QQ on a compact set implies that the kernel is strictly positive definite, i.e. for all non-zero φL2(K),T(Q)φ,φ>0\varphi\in L^{2}(K),\langle T(Q)\varphi,\varphi\rangle>0 [Sriperumbudur et al., 2011]. Moreover, universality also implies the full expressivity of the induced GP, as expressed in the following.

A Gaussian Process on KK is said to be expressive on L2(K)L^{2}(K) if, denoting by ψ\psi a random realisation ψ\psi of the process, for all φL2(K)\varphi\in L^{2}(K), for all ε>0\varepsilon>0,

A universal kernel QQ on KK induces an expressive GP on L2(K)L^{2}(K).

By definition, universal kernels are characterized by the property that their associated RKHS is dense (w.r.t the uniform norm .\|.\|_{\infty}) in the space of continuous functions on KK. This is crucial for Kernel regression and Gaussian Process inference [Kanagawa et al., 2018].The closure of the set of functions described by the mean function of the posterior of a GP regression is exactly the RKHS of the kernel of the GP prior. By Proposition 2, it suffices to prove that QL0Q_{L_{0}} is universal for some L0L_{0} in order to conclude for all LL0L\geq L_{0}. It turns out this is true for L0=2L_{0}=2.

If σb>0\sigma_{b}>0, then Q2Q_{2} is universal on KK. From Proposition 2, QLQ_{L} is universal for all L2L\geq 2.

Although the kernel is universal for fixed depth LL, it is not guaranteed that as LL\rightarrow\infty, QLQ_{L} remains universal. Indeed, for the standard ResNet architecture, the variance QL(x,x)Q_{L}(x,x) grows exponentially with LL [Yang and Schoenholz, 2017], and therefore, the kernel diverges. In order to analyse the expressivity of the kernel of a standard ResNet in the limit of large depth, we can study the correlation kernel CLC_{L}, defined in (3), instead. We show in the following Lemma that, as LL goes to infinity, the kernel CLC_{L} converges to a constant (which has a 1D RKHS).

Therefore, HC(K)\mathcal{H}_{C_{\infty}}(K) is the space of constant functions.

Lemma 4 shows that in the limit of infinite depth LL, the RKHS of the correlation kernel is trivial, meaning that the NNGP cannot be expressive. On the contrary, we will show in the next sections that Stable ResNets achieve a universal kernel for infinite depth LL.

UNIFORM SCALING

Consider a Stable ResNet with layers [0:L][0:L]. Under uniform scaling, the recurrence relation in (5) reads:

In the limit as LL\to\infty, (7) converges uniformly to a continuous ODE. Studying the solution of the corresponding Cauchy problem, we show that the covariance kernel remains universal in the limit of infinite depth.

As discussed in Section A2 of the Appendix, for any x,xx,x^{\prime}, the solution of the above Cauchy problem exists and is unique. Moreover, the solutions qtq_{t} and ctc_{t} are kernels on KK, in the sense of Definition 2.

Clearly, for finite LL, the continuous ODE (8) is an approximation. However, the following result holds.

Let QlLQ_{l|L} be the covariance kernel of the layer ll in a net of L+1L+1 layers [0:L][0:L], and qtq_{t} be the solution of (8), then

2 Universality of the covariance kernel

When σb>0\sigma_{b}>0, the kernel qtq_{t} is universal for t>0t>0.

The proof of the above statement is detailed in Appendix A2. The main idea is to show that the integral operator T(qt)T(q_{t}) is strictly positive definite and then use a characterization of universal kernels, due to [Sriperumbudur et al., 2011], which connects the universality of Definition 4 with the strict positivity of the induced integral operator.The details are more involved as we need to show that the kernel induces a strictly positive definite operator on L2(K,μ)L^{2}(K,\mu) for any finite Borel measure μ\mu on KK.

DECREASING SCALING

The convergence of the kernel QLQ_{L} to the limiting kernel QQ_{\infty} is governed by the convergence rate of the series of scaling factors. Moreover, leveraging the RKHS hierarchy from Proposition 2, we find that QQ_{\infty} is universal.

As in the uniform scaling case, the limiting kernel exists and is universal unlike the standard ResNet architecture that yields a divergent kernel QLQ_{L} as LL\to\infty.

To validate our universality and expressivity results, Figure 2 plots the leading eigenvalues of the NNGP (& NTK, introduced in Section 5) kernels on a set of 1000 points sampled uniformly at random from the circle, normalized so that the largest eigenvalue is 1. We use the recursion formulas for NNGP correlation (Lemma A4) and normalized NTK (Lemma A19) to avoid the exploding variance/gradient problem. We see that the unscaled ResNet NNGP becomes inexpressive with depth because all non-leading eigenvalues converge to 0, whereas our Stable ResNets (decreasing and uniform scaling) are expressive even in the large depth limit.

NEURAL TANGENT KERNEL

In the so-called lazy training regime [Chizat and Bach, 2019], the training dynamics of an infinitely wide network can be described via the Neural Tangent Kernel (NTK) [Lee et al., 2019], introduced in [Jacot et al., 2018] and defined as

where X\mathcal{X} and Y\mathcal{Y} are respectively the input and output datasets, ΘL(x,X)={ΘL(x,x)}xX\Theta_{L}(x,\mathcal{X})=\{\Theta_{L}(x,x^{\prime})\}_{x^{\prime}\in\mathcal{X}} and Θ^L\hat{\Theta}_{L} is the matrix {Θl(x,x)}x,xX\{\Theta_{l}(x,x^{\prime})\}_{x,x^{\prime}\in\mathcal{X}}. The universality of the NTK is crucial for the ResNet to learn beyond initialization, since the residual FτF0F_{\tau}-F_{0} lies in the RKHS generated by ΘL\Theta_{L}. For unscaled ResNet, [Hayou et al., 2019b] showed that the limiting NTK is trivial in the sense of Lemma 4. However, this is not the case for Stable ResNet.

Consider a ResNet of type (2). We have This is true under the technical assumption that the parameters appearing in the back-propagation can be considered independent from the ones of the forward pass (Gradient Independent Assumption) [Yang, 2019a]

An analogous result can be stated for the uniform scaling, after noticing that a continuous formulation (Θlθt(l)\Theta_{l}\mapsto\theta_{t(l)}) can be obtained in analogy with what has been done for the covariance kernel (cf Appendix A4).

Figure 2 shows that the non-leading NTK eigenvalues do not decay to 0 with depth for Stable ResNets, unlike for unscaled ResNets. This is in line with findings of Propositions 8 and 9.

A PAC-BAYES RESULT

where ν\nu is a probability distribution on X×YX\times Y. For some randomized learning algorithm A\mathcal{A}, the empirical and generalization loss are given by:

The PAC-Bayes theorem gives a probabilistic upper bound on the generalization loss r(A)r(\mathcal{A}) of a randomized learning algorithm A\mathcal{A} in terms of the empirical loss rS(A)r_{S}(\mathcal{A}). Fix a prior distribution P\mathcal{P} on the hypothesis set U\mathcal{U}. The Kullback-Leibler divergence between A\mathcal{A} and P\mathcal{P} is defined as KL(AP)=A(h)logA(h)P(h)dh[0,]\text{KL}(\mathcal{A}\|\mathcal{P})=\int\mathcal{A}(h)\log\frac{\mathcal{A}(h)}{\mathcal{P}(h)}\textrm{d}h\in[0,\infty]. The Bernoulli KL-divergence is given by kl(ap)=alogap+(1a)log1a1p\text{kl}(a||p)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p} for a,pa,p\in. We define the inverse Bernoulli KL-divergence kl1\text{kl}^{-1} by

The KL-divergence term KL(AP)\text{KL}(\mathcal{A}\|\mathcal{P}) plays a major role as it controls the generalization gap, i.e. the difference (in terms of Bernoulli KL-divergence) between the empirical loss and the generalization loss. In our setting, we consider an ordinary GP regression with prior P(f)=GP(f0,Q(x,x))\mathcal{P}(f)=\mathcal{GP}(f|0,Q(x,x^{\prime})). Under the standard assumption that the outputs yN=(yi)i[1:N]y_{N}=(y_{i})_{i\in[1:N]} are noisy versions of fN=(f(xi))i[1:N]f_{N}=(f(x_{i}))_{i\in[1:N]} with yNfNN(yNfN,σ2I)y_{N}|f_{N}\sim\mathcal{N}(y_{N}|f_{N},\sigma^{2}I), the Bayesian posterior A\mathcal{A} is also a GP and is given by

QN(x)=(Q(x,xi))i[1:N]Q_{N}(x)=(Q(x,x_{i}))_{i\in[1:N]}, QNN=(Q(xi,xj))1i,jNQ_{NN}=(Q(x_{i},x_{j}))_{1\leq i,j\leq N}. In this setting, we have the following result

Let QLQ_{L} be the kernel of a ResNet. Let PLP_{L} be a GP with kernel QLQ_{L} and AL\mathcal{A}_{L} be the corresponding Bayesian posterior for some fixed noise level σ2>0\sigma^{2}>0. Then, in a fixed setting (fixed sample size N), the following results hold: \bullet With a standard ResNet, KL(ALPL)L\textup{KL}(\mathcal{A}_{L}\|P_{L})\gtrsim L. \bullet With a Stable ResNet, KL(ALPL)=OL(1)\textup{KL}(\mathcal{A}_{L}\|P_{L})=\mathcal{O}_{L}(1).

The KL-divergence bound diverges for a standard ResNet while it remains bounded for Stable ResNet. Although PAC-Bayes bounds only give an upper bound on the generalization error, Proposition 10 shows that Stable ResNet does not suffer from the “curse of depth”, i.e. the KL-divergence does not explode as the depth becomes large.

EXPERIMENTS

In line with our theory, we now present results demonstrating empirical advantages of Stable ResNets (both uniform and decreasing scaling) compared to their unscaled counterparts on a toy regression task and standard image classification tasks, both for infinite-width NNGP kernels as well as trained finite-width NNs in the latter case. In the interests of space, all experimental details not described in this section can be found in Appendix A7. All error bars in this section correspond to 3 independent runs.

We first present a toy regression posterior regression experiment with NNGP kernel. We compare across different depths and scalings, with target test function y=xsin(x)y=x\text{sin}(x) and a small amount of observation noise σ=0.1\sigma=0.1 (σ\sigma as defined in Eq. 10).

We use 5 training points (dark green dots).

We map our 1D inputs xx onto the circle (cos(x),sin(x))(\text{cos}(x),\text{sin}(x)) before performing GP regression. This is so that all inputs have unit norm and we can use the NNGP correlation kernel (Eq. 3) for the vanilla ResNet (ResNet with fully connected blocks), in order to avoid the exploding variance problem. As expected from our theory, in Figure 3, for depth 1000 the NNGP correlation kernel without stable scaling (top row, red) is unable to learn anything beyond a constant function due to inexpressivity, whereas our Stable ResNets (bottom two rows, blue) are still expressive in the large depth limit. We plot mean and 95% posterior predictive credible interval for NNGP posteriors.

Stable NNGP classification results

We first compare the performance of Stable and standard ResNets of varying depths through their infinite-width NNGP kernels, on MNIST & CIFAR-10. For each considered NNGP kernel QQ and training set (xi,yi)i[1:N](x_{i},y_{i})_{i\in[1:N]}, we report test accuracy using the mean of the posterior predictive (Eq. 10): QN()(QNN+σ2I)1yNQ_{N}(\cdot)(Q_{NN}+\sigma^{2}I)^{-1}y_{N}, which is also the kernel ridge regression predictor [Kanagawa et al., 2018]. We treat classification labels yy as one-hot regression targets, similar to recent works [Arora et al., 2019, Lee et al., 2019, Shankar et al., 2020], and tune the noise σ2\sigma^{2} using prediction accuracy on a held-out validation set.

First, in Table 1, we demonstrate the exploding NNGP variance problem for unscaled Wide-ResNets (WRN) [Zagoruyko and Komodakis, 2016]. For an unscaled WRN of depth 202, the NNGP kernel values explode resulting in numerical errors, whereas Stable ResNets achieve 54% test accuracy with 10K training points (out of full size 50K). Note that any numerical errors from exploding NNGP also afflict the NTK, as the difference between the NTK and NNGP is positive semi-definite [Lee et al., 2019, He et al., 2020] (which is why the NTK lines always lie above their corresponding NNGP in Figure 1).

To isolate the disadvantages of inexpressivity in unscaled Resnets NNGPs compared to our Stable ResNets, we need to avoid the exploding variance problem and ensuing numerical errors. In order to do so, we use the NNGP correlation kernel CC instead of the NNGP covariance kernel QQ, noting that these two kernels are equal up to multiplicative constant on the sphere, and that the posterior predictive mean is invariant to the scale of QQ (with σ2\sigma^{2} also tuned relative to the scale of QQ). Moreover, the formula in Lemma A4 for NNGP correlation recursion for vanilla ResNets without bias can be recast as a ResNet with a modified scaling (see Appendix A6), allowing us to use existing optimised libraries [Novak et al., 2020]. In order to use the vanilla ResNet correlation recursion, we standardise all MNIST & CIFAR-10 images to lie on the 784 & 3072-dimension sphere respectively.

Our expressivity results, as well as Proposition 10, suggest that we expect Stable ResNets to outperform standard ResNets for large depths even when exploding variance numerical errors are alleviated for standard ResNets. In Table 2, we see that unscaled ResNets suffer from a degradation in test accuracy with depth, due to inexpressivity, whereas our Stable ResNets (both decreasing and uniform) do not suffer from a drop in performance. For example, the posterior predictive mean using the NNGP of an unscaled vanilla ResNet with depth 1000 attains only 17.86% accuracy on CIFAR-10 with 10K training points, compared to 48.76% for Stable ResNet (decreasing scale).

We focus on the NNGP rather than the NTK as recent works [Lee et al., 2020, Shankar et al., 2020] have demonstrated that there is no advantage to the state-of-the-art NTK over the NNGP as infinite-width kernel predictors. Moreover, we do not aim for near state-of-the-art kernel results due to computational resources, and instead aim to empirically validate the theoretical advantages of Stable ResNets.

Trained Stable ResNet results

Finally, we consider the benefits of trained Stable ResNets on the large-scale CIFAR-10, CIFAR-100 and TinyImageNetAvailable at http://cs231n.stanford.edu/tiny-imagenet-200.zip datasets. We compare trained convolutional ResNets [He et al., 2016] of depths 32, 50 & 104 in terms of test accuracy. In the main text we present results for ResNets trained with Batch Normalization [Ioffe and Szegedy, 2015] (BatchNorm), while results for trained ResNets without BatchNorm can be found in Appendix A7. Stable ResNet scalings are applied to the residual connection after all convolution, ReLU and BatchNorm layers.

We use initial learning rate 0.10.1 which is decayed by 0.10.1 at 50%50\% and 75%75\% of the way through training. This learning rate schedule has been used previously [He et al., 2016] for unscaled ResNets and we found it to work well for all ResNets trained with BatchNorm. We train for 160 epochs on CIFAR-10/100 and 250 epochs on TinyImageNet. Test accuracy results are displayed in Table 3. As we can see, Stable ResNets consistently outperform standard ResNets across datasets and depths. Moreover, the performance gap is larger for larger depths: for example on CIFAR-100 our Stable ResNet (decreasing) outperforms its standard counterpart by 1.05% (75.06 vs 74.01) on average for depth 32 whereas for depth 104 the test accuracy gap is 2.36% (77.44 vs 75.08) on average. A similar trend can also be observed for the more challenging TinyImageNet dataset. Interestingly, we see that among the Stable ResNets, decreasing scaling also consistently outperforms uniform scaling.

CONCLUSION

Stable ResNets have the benefit of stabilizing the gradient and ensuring expressivity in the limit of infinite depth. We have demonstrated theoretically and empirically that this type of scaling makes NNGP inference robust and improves test accuracy with SGD on modern ResNet architectures. However, while Stable ResNets with both uniform and decreasing scalings outperform standard ResNet, the selection of an optimal scaling remains an open question; we leave this topic for future work.

ACKNOWLEDGMENTS

This material is based upon work supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence (MoD) and the U.K. Engineering and Physical Research Council (EPSRC) under grant number EP/R013616/1. AD is also partially supported by EPSRC EP/R034710/1. BH is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1). The project leading to this work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 834175).

References

Appendix

A0 Mathematical preliminaries

We will make use of functional analysis results on the theory of Hilbert space. We refer to for a comprehensive introduction to the topic. We precise here that, even when not explicitly stated, all Hilbert spaces considered in the present work are real, and all linear operator are bounded. We will make use of the spectral theory for compact self-adjoint operators. We refer again to for a detailed discussion.

We state here a characterisation of kernels, which is an extension of Lemma 2. Despite being a classical result (see the discussion about Mercer kernels in ), we will give a proof, for the sake of completeness.

for any φL2(K,μ)\varphi\in L^{2}(K,\mu). The operator Tμ(Q)T_{\mu}(Q) is a bounded compact self-adjoint definite operator. Moreover, QQ is a kernel if and only if Tμ(Q)T_{\mu}(Q) is non-negative definite for all finite Borel measures μ\mu on KK.

and the convergence is uniform on K2K^{2}. The continuity of the YkY_{k}’s implies that they can be seen as elements of L2(K,μ)L^{2}(K,\mu). Moreover, the uniform convergence, along with the fact that μ(K)<\mu(K)<\infty, implies the convergence of the sum wrt the L2(K,μ)L^{2}(K,\mu) operator norm. In particular Tμ(K)T_{\mu}(K) is a limit of non-negative definite operators and hence non-negative definite. Now, assume that, for all finite Borel μ\mu, Tμ(Q)T_{\mu}(Q) is non-negative definite. Chosen a finite set {x1xn}K\{x_{1}\dots x_{n}\}\subset K, in particular we have that μ=i=1nδxi\mu=\sum_{i=1}^{n}\delta_{x_{i}} is a finite Borel measure (where δx\delta_{x} is the Dirac measure on xKx\in K). Hence Tμ(Q)T_{\mu}(Q) is the matrix {Q(xi,xj)}i,j\{Q(x_{i},x_{j})\}_{i,j}. We conclude that QQ is a kernel. ∎

We will now give a definition of the Reproducing Kernel Hilbert Space associated to a kernel. We refer to for a general and comprehensive introduction to the topic.

Given a kernel QQ on KK, we can associate to it a real Hilbert space HQ\mathcal{H}_{Q}, with the following properties:

Denoting as ,Q\langle\cdot,\cdot\rangle_{Q} the inner product of HQ\mathcal{H}_{Q}, for each xKx\in K, there exists a element kxHQk_{x}\in\mathcal{H}_{Q} such that h(x)=h,kxQh(x)=\langle h,k_{x}\rangle_{Q}, for all hHQh\in\mathcal{H}_{Q}.

For all x,xKx,x^{\prime}\in K, kx,kxQ=Q(x,x)\langle k_{x},k_{x^{\prime}}\rangle_{Q}=Q(x,x^{\prime}).

Such a Hilbert space exists for each kernel QQ and it is unique up to isomorphism, . HQ\mathcal{H}_{Q} is called the Reproducing Kernel Hilbert Space (RKHS) of QQ.

In general, it is not easy to give an explicit form for the RKHS associated to a kernel QQ. However, we can say that it contains the linear span of {xQ(x,x)}xK\{x\mapsto Q(x,x^{\prime})\}_{x^{\prime}\in K}. Actually, this linear span is a dense subset of HQ\mathcal{H}_{Q}, wrt the norm of HQ\mathcal{H}_{Q} .

A kernel on KK is said to be universal if its RKHS is dense in the space of continuous functions C(K)C(K), wrt the uniform norm.

Let QQ be a kernel on KK, and HQ(K)\mathcal{H}_{Q}(K) its RKHS. We say that QQ is universal on KK if for any ε>0\varepsilon>0 and any continuous function gg on KK, there exists hHQ(K)h\in\mathcal{H}_{Q}(K) such that hg<ε\|h-g\|_{\infty}<\varepsilon.

We can now state a characterization of universal kernels, from .

As a final note, hereafter we often omit the explicit reference to the measure μ\mu, that is we will speak of the operator T(Q)T(Q) on L2(K)L^{2}(K). Unless otherwise stated, this notation implies the choice of an arbitrary finite Borel measure μ\mu on the compact KK.

A1 Residual Neural Networks and Gaussian processes

Consider a standard ResNet architecture with L+1L+1 layers, labelled with l[0:L]l\in[0:L], of dimensions {Nl}l[0:L]\{N_{l}\}_{l\in[0:L]}.

Hereafter, NlN_{l} denotes the number of neurons in the lthl^{th} layer, ϕ\phi the activation function and [m:n]:={m,m+1,...,n}[m:n]:=\{m,m+1,...,n\} for mnm\leq n. The components of weights and bias are respectively initialized with WlijiidN(0,σw2/Nl1)W_{l}^{ij}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{w}^{2}/N_{l-1}), and BliiidN(0,σb2)B_{l}^{i}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma_{b}^{2}) where N(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) denotes the normal distribution of mean μ\mu and variance σ2\sigma^{2}.

In , authors showed that wide deep ResNets might suffer from gradient exploding during backpropagation.

Recent results by suggest that scaling the residual blocks with L1/2L^{-1/2} might have some beneficial properties on model pruning at initialization. This is a result of the stabilization effect of scaling on the gradient.

More generally, we introduce the residual architecture:

where (λk,L)k[1:L](\lambda_{k,L})_{k\in[1:L]} is a sequence of scaling factors. We assume hereafter that there exists λmax(0,)\lambda_{\max}\in(0,\infty) such that for all L1L\geq 1 and k[1:L]k\in[1:L], we have that λk,L(0,λmax]\lambda_{k,L}\in(0,\lambda_{\max}].

where we have used the Central Limit Theorem. Therefore, we have

For the remainder of this appendix, we define the function

For all ll, the diagonal terms of QlQ_{l} have closed-form expressions. We show this in the next lemma.

where f^\hat{f} is given by (A2). It is straightforward that f^(1)=1\hat{f}(1)=1. This yields

As a corollary of the previous result, it is easy to show that for a Standard ResNet the diagonal terms explode with depth, which is Lemma 1 in the main paper.

The statement trivially follows from Lemma A3, using that Q0(x,x)=σb2+σw2dx2Q_{0}(x,x)=\sigma_{b}^{2}+\tfrac{\sigma_{w}^{2}}{d}\|x\|^{2} and the fact that for a Standard ResNet (1), all the coefficients λl,L\lambda_{l,L}’s are equal to 11. ∎

In the case of a ResNet with no bias, the correlation kernel follows a simple recursive formula described in the next lemma.

where αl,L=λl,L2σw22\alpha_{l,L}=\frac{\lambda_{l,L}^{2}\sigma_{w}^{2}}{2}.

This is direct result of the covariance recursion formula (5). ∎

A1.2 Proof of Proposition 1

We use the following result from in order to derive closed form expressions for the second moment of the gradients.

Consider a ResNet of the form (2) with weights WW. In the limit of infinite width, we can assume that WTW^{T} used in back-propagation is independent from WW used for forward propagation, for the calculation of Gradient Covariance and NTK.

Next we re-state and prove Proposition 1.

where C=2σw2(sup(x,y)K×Kqˉl(x,y))(supxKQ0(x,x)+2σb2σw2)C=\frac{2}{\sigma_{w}^{2}}\left(\sup_{(x,y)\in K\times K^{\prime}}\bar{q}^{l}(x,y)\right)\left(\sup_{x\in K}Q_{0}(x,x)+\frac{2\sigma_{b}^{2}}{\sigma_{w}^{2}}\right). We conclude by taking the supremum over ll and x,yx,y.

where κ=12λmin2(1+σw22λmax2)(1+σw22λmin2)Q1(x,x)qˉl(x,y)>0\kappa=\frac{1}{2}\frac{\lambda_{\min}^{2}}{\left(1+\frac{\sigma_{w}^{2}}{2}\lambda_{\max}^{2}\right)\left(1+\frac{\sigma_{w}^{2}}{2}\lambda_{\min}^{2}\right)}Q_{1}(x,x)\,\bar{q}^{l}(x,y)>0. ∎

Using Lemma A5, we can derive simple recursive formulas for the second moment of the gradient as well as for the Neural Tangent Kernel (NTK). This was previously done in for feedforward neural networks, we prove a similar result for ResNet in the next lemma.

In the limit of infinite width, using the same notation as in proposition 1, we have that

Using lemma A5 and the Central Limit Theorem, we have that

Before moving to the next proofs, recall the definition of Stable ResNet.

A ResNet of type (2) is called a Stable ResNet if and only if limLk=1Lλk,L2<\lim\limits_{L\rightarrow\infty}\sum\limits_{k=1}^{L}\lambda_{k,L}^{2}<\infty.

Leveraging the previous result, the function ff defined in (4) is analytic. We clarify this in the next lemma.

we get a0=1πa_{0}=\frac{1}{\pi}. Moreover, we have that for all γ(1,1)\gamma\in(-1,1)

This yields a1=f^(0)=12a_{1}=\hat{f}^{\prime}(0)=\frac{1}{2}. Then, noticing that

is an odd function, we get that for all i1,a2i+1=0i\geq 1,a_{2i+1}=0. Now let us prove that for all k1k\geq 1, there exist bk,0,bk,1,...,bk,k1>0b_{k,0},b_{k,1},...,b_{k,k-1}>0 such that, for all γ(1,1)\gamma\in(-1,1),

We prove this by induction. For k=1k=1, we have that

so that our claim holds. Assume now that it is true for some k1k\geq 1, let us prove it for k+1k+1. It is easy to see that

The induction is straightforward. In particular, we have shown that a2i=f^(2i)(0)(2i)!=bi,0(2i)!>0a_{2i}=\frac{\hat{f}^{(2i)}(0)}{(2i)!}=\frac{b_{i,0}}{(2i)!}>0. The conclusion for the coefficients α\alpha’s of the expansion of ff is then trivial. ∎

Using Lemma A8, it will not be hard to show that QlQ_{l} is continuous. The non-negativity of T(Ql)T(Q_{l}) can be seen as a consequence of the definition of QlQ_{l} as the covariance of a Gaussian Process. However, we will give a direct proof of it, so that we can state here a general result which we will need later on.

converges uniformly on $.Then,forallfiniteBorelmeasure. Then, for all finite Borel measure\muononK,,T_{\mu}(g(C))isanonnegativedefinitecompactoperator,andinparticularis a non-negative definite compact operator, and in particularg(C)$ is a kernel.

For both Standard and Stable ResNet architectures, for any layer ll, the covariance function QlQ_{l} and the correlation function ClC_{l} are kernels on KK, in the sense of Definition 2.

It is straightforward to prove that Q0Q_{0} is a kernel. Now let us show that if QlQ_{l} is a kernel for some ll, then ClC_{l} is a kernel. Since QlQ_{l} is symmetric and so ClC_{l} is. Moreover, the diagonal elements of QlQ_{l} are continuous by Lemma A3 and do not vanish (since if σb=0\sigma_{b}=0 we are assuming that 0K0\notin K). Hence ClC_{l} is continuous. It is then trivial to show that the non-negative definiteness of T(Ql)T(Q_{l}) implies that T(Cl)T(C_{l}) is non-negative definite, and so ClC_{l} is a kernel if QlQ_{l} is. Now we proceed by induction. Suppose that Ql1Q_{l-1} and Cl1C_{l-1} are kernels and recall the recursion (5), taking the coefficient λ\lambda to be 11 in the case of a Standard ResNet. Notice that it can be rewritten as

where we have omitted the dependence on LL for λ\lambda, we have defined Rl1(x,x)=Ql1(x,x)Ql1(x,x)R_{l-1}(x,x^{\prime})=\sqrt{Q_{l-1}(x,x)Q_{l-1}(x^{\prime},x^{\prime})} and f^\hat{f} is defined in (A2). Clearly Rl1R_{l-1} is a kernel. By Lemma A8 and Lemma A9 we have that f^(Cl)\hat{f}(C_{l}) is a kernel. Using the property that sums and products of kernels are kernels (the sum is trivial, cf Footnote 14 for the product), we conclude that QlQ_{l}, and so ClC_{l}, is a kernel on KK. ∎

A1.4 Proof of Proposition 2

HQl(K)HQl+1(K)\mathcal{H}_{Q_{l}}(K)\subseteq\mathcal{H}_{Q_{l+1}}(K) for all l[0:L1]l\in[0:L-1].

We have already shown that T(Ql)T(Ql1)T(Q_{l})-T(Q_{l-1}) is non-negative definite in the proof of Lemma A10. We conclude by using the RKHS hierarchy result (see for instance or page 354 in ). ∎

A1.5 Proof of Lemma 3

A Gaussian Process on KK is said to be expressive on L2(K)L^{2}(K) if, denoted by ψ\psi a random realisation, for all φL2(K)\varphi\in L^{2}(K), for all ε>0\varepsilon>0,

A universal kernel QQ on KK induces an expressive GP on L2(K)L^{2}(K).

For k[0:N]k\in[0:N], we can define the interval Ik=[akμkε2(N+1)μk,akμk+ε2(N+1)μk]I_{k}=\left[\tfrac{a_{k}}{\sqrt{\mu_{k}}}-\tfrac{\varepsilon}{\sqrt{2(N+1)\mu_{k}}},\tfrac{a_{k}}{\sqrt{\mu_{k}}}+\tfrac{\varepsilon}{\sqrt{2(N+1)\mu_{k}}}\right], so that, for all zIkz\in I_{k} we have (zμkak)2ε22(N+1)(z\sqrt{\mu_{k}}-a_{k})^{2}\leq\tfrac{\varepsilon^{2}}{2(N+1)}. Since all these intervals are non empty, we get

By Mercer’s theorem , T(Q)T(Q) is trace class and hence δN0\delta_{N}\to 0 for diverging NN. By Markov’s inequality

A1.6 Proof of Proposition 3

In order to prove Proposition 3 we first need a preliminary result, which will be at the core of the proof of Theorem 1 as well.

where the coefficients ωk,n\omega_{k,n}’s are all strictly positive, explicitly ωk,n=ζk(nk)\omega_{k,n}=\zeta^{k}\binom{n}{k}. Expanding the inner product xxx\cdot x^{\prime}, we can express pnp_{n} in the form

If σb>0\sigma_{b}>0, then Q2Q_{2} is universal on KK. From Proposition 2, QLQ_{L} is universal for all L2L\geq 2.

with gg can be written as a finite linear combination of the functions {f^(C0)(x,.)}xK\{\hat{f}(C_{0})(x,.)\}_{x\in K}. This yields

A1.7 Proof of Proposition 4

See the proof of Proposition A7 in Appendix A8. ∎

A1.8 Proof of Proposition 5

Proposition 5 is a well known classical result (see for instance Appendix H in and the references therein. For completeness we give a proof in Appendix A8.

See the proof of Lemma A22 in Appendix A8. ∎

A1.9 Proof of Lemma 4

Therefore, HC(K)\mathcal{H}_{C_{\infty}}(K) is the space of constant functions.

Since f^(x)x\hat{f}(x)\geq x, CLC_{L} is non-decreasing wrt LL and converges to the unique fixed point of f^\hat{f} which is 11. This convergence is uniform in x,xx,x^{\prime}, i.e. limLsupx,xK1CL(x,x)=0\lim_{L\rightarrow\infty}\sup_{x,x^{\prime}\in K}1-C_{L}(x,x^{\prime})=0. Re-writing the recursion yields

where α=σw22\alpha=\frac{\sigma_{w}^{2}}{2}, δl=(1+σb2(1+α)QL1(x,x))1/2(1+σb2(1+α)QL1(x,x))1/2\delta_{l}=\left(1+\frac{\sigma_{b}^{2}}{(1+\alpha)Q_{L-1}(x,x)}\right)^{-1/2}\left(1+\frac{\sigma_{b}^{2}}{(1+\alpha)Q_{L-1}(x,x)}\right)^{-1/2} and ζL=σb2(QL(x,x)QL(x,x))1/2\zeta_{L}=\sigma_{b}^{2}(Q_{L}(x,x)Q_{L}(x^{\prime},x^{\prime}))^{-1/2}. Using Lemma A3, and the boundedness of CLC_{L}, a simple Taylor expansion yields

where the expansion is uniform on x,xKx,x^{\prime}\in K, and f(x)=f^(x)xf(x)=\hat{f}(x)-x, and gL=O(eβL)g_{L}=\mathcal{O}(e^{-\beta L}) for some β>0\beta>0. The previous dynamical system can be decomposed in two parts, a first part without the term O(eβL)\mathcal{O}(e^{-\beta L}) which is the homogeneous system, i.e.​ the system without bias, and the term O(eβL)\mathcal{O}(e^{-\beta L}) which is the contribution of the bias in the dynamical system. Assume σb=0\sigma_{b}=0, then the term gLg_{L} vanishes. Moreover, a Taylor expansion of f^\hat{f} near 1 yields

Therefore, uniformly in x,xKx,x^{\prime}\in K, we have that

Letting γL=1CL\gamma_{L}=1-C_{L}, a simple Taylor expansion leads to

Therefore, γLκL2\gamma_{L}\sim\kappa L^{-2} where κ=4(1+α)2s2α2\kappa=\frac{4(1+\alpha)^{2}}{s^{2}\alpha^{2}}. This equivalence is uniform in x,xKx,x^{\prime}\in K.

It is likely that the rate O(L2)\mathcal{O}(L^{-2}) holds without assuming σb=0\sigma_{b}=0. However, the analysis in this requires unnecessarily complicated details. ∎

A2 Stable ResNet with uniform scaling

We provide the results of existence, uniqueness and regularity of the solution of (8) in Lemma A11. Corollary A1 shows that the differential problem can be restated in the operator space. Eventually we give a proof of Lemma 5, assuring uniform convergence to the continuous limit.

For any x,xx,x^{\prime} in KK, the solution of (8) is unique and well defined for all tt\in. The maps (x,x)qt(x,x)(x,x^{\prime})\mapsto q_{t}(x,x^{\prime}) and (x,x)ct(x,x)(x,x^{\prime})\mapsto c_{t}(x,x^{\prime}) are Lipschitz continuous on K2K^{2} and ctc_{t} takes values in $.Moreover,both. Moreover, bothq_{t}andandc_{t}$ are kernels in the sense of Definition 2.

First notice that from (8) we can find, with few algebraic manipulations, an explicit recurrence relation for the correlation ClC_{l}, defined in (3). For any x,xKx,x^{\prime}\in K we have

We can find a Cauchy problem for the correlation directly from (8) or by noting that Al(x,x)=1σb22L(1Ql(x,x)+1Ql(x,x))+o(1/L)A_{l}(x,x^{\prime})=1-\tfrac{\sigma_{b}^{2}}{2L}\left(\tfrac{1}{Q_{l}(x,x)}+\tfrac{1}{Q_{l}(x^{\prime},x^{\prime})}\right)+o(1/L), for LL\to\infty. With both approaches, we have

where ff is defined in \eqrefdeff\eqref{deff} and

Note that for the diagonal terms qt(x,x)q_{t}(x,x), (8) reduces to q˙t=σb2+σw22qt\dot{q}_{t}={\sigma_{b}}^{2}+\frac{{\sigma_{w}}^{2}}{2}\,q_{t}, whose solution is

HH is Lipschitz continuous in γ\gamma and CC^{\infty} in tt, so there exists τ>0\tau>0 such that the Cauchy problem

has a unique C1C^{1} solution defined for t[0,τ)t\in[0,\tau). Noticing that

we get that for all t1t_{1} such that γ(t1)=1\gamma(t_{1})=1 we have γ˙(t1)0\dot{\gamma}(t_{1})\leq 0, since f(1)=0f(1)=0, and for all t1t_{-1} such that γ(t1)=1\gamma(t_{-1})=-1 we have γ˙(t1)=σb2(Gt(x,x)+At(x,x))+σw22>0\dot{\gamma}(t_{-1})=\sigma_{b}^{2}(\mathcal{G}_{t}(x,x^{\prime})+\mathcal{A}_{t}(x,x^{\prime}))+\tfrac{\sigma_{w}^{2}}{2}>0. As a consequence γ(t)\gamma(t)\in for all t[0,τ)t\in[0,\tau) and we can take τ=\tau=\infty. In particular we get that (A6) has a unique solution tct(z)t\mapsto c_{t}(z), defined for tt\in and bounded in $.Asaconsequence,(8)hasauniqueandwelldefinedsolutionforall. As a consequence, (8) has a unique and well defined solution for allt\geq 0$.

Now notice that zc0(z)z\mapsto c_{0}(z) is Lipschitz on K2K^{2}. let us denote as L0L_{0} a Lipschitz constant for c0c_{0}. Since both Gt\mathcal{G}_{t} and At\mathcal{A}_{t} are C1C^{1}, we can find real constants LGL_{G}, LAL_{A} and MAM_{A} such that for all z,zz,z^{\prime} elements of K2K^{2}

Let LfL_{f} be a Lipschitz constant for ff. Using the fact that ct1|c_{t}|\leq 1, we can write

where L1=σb2(LG+LA)L_{1}=\sigma_{b}^{2}(L_{G}+L_{A}) and L2=σb2MA+σw22LfL_{2}=\sigma_{b}^{2}M_{A}+\frac{\sigma_{w}^{2}}{2}\,L_{f}. Now fix zz and zz^{\prime} and consider Δ(t)=ct(z)ct(z)\Delta(t)=c_{t}(z)-c_{t}(z^{\prime}). We have

So Δ(t)(L1L2(eL2t1)+L0eL2t)zz|\Delta(t)|\leq\left(\frac{L_{1}}{L_{2}}\,\left(e^{L_{2}\,t}-1\right)+L_{0}\,e^{L_{2}\,t}\right)\|z-z^{\prime}\|, meaning that ctc_{t} (and so qtq_{t}) is Lipschitz on L2L^{2}.

Since the mapping (x,x)qt(x,x)(x,x^{\prime})\mapsto q_{t}(x,x^{\prime}) is continuous, it defines a compact integral operator T(qt)T(q_{t}) on L2(K)L^{2}(K) . Since qtq_{t} is real and symmetric under the swap of xx and xx^{\prime}, the operator is self-adjoint. The same holds true for ctc_{t}.

Consider the map (t,z)qt(z)(t,z)\mapsto q_{t}(z), defined on ×K2\times K^{2}, which is continuous wrt zz and C2C^{2} wrt tt, as it can be easily checked. Since K2K^{2} and $arecompactsets,itfollowsthatforanyare compact sets, it follows that for anyt$

Let QlLQ_{l|L} be the covariance kernel of the layer ll in a net of L+1L+1 layers [0:L][0:L], and qtq_{t} be the solution of (8), then

We will show that the relation holds for ctc_{t}, and hence for qtq_{t}. Let HH, defined on ×K2\times K^{2}, be such that c˙t(z)=H(z,t,ct(z))\dot{c}_{t}(z)=H(z,t,c_{t}(z)). Explicitly, with the same notations as in (A6), we have

Since tt and zz takes values on compact sets, by uniform continuity, fixed hh we can write, for h0h\to 0

where At\mathcal{A}_{t} and Gt\mathcal{G}_{t} are defined as in (A6). As a consequence, we can find a constant M1>0M_{1}>0 and an integer L>0L_{\star}>0 such that, for all γ\gamma\in, for all zK2z\in K^{2}, for all LLL\geq L^{\star}

Moreover, there exists a constant M2>0M_{2}>0 such that for all zK2z\in K^{2}, all tt\in and all pairs (γ,γ)2(\gamma,\gamma^{\prime})\in^{2}

Thanks to the two above uniform inequalities, we will now show that, for LLL\geq L_{\star},

At this point, using the fact that Δ0=0\Delta_{0}=0, it is easy to show by induction that

and so (A9) follows. Finally, the uniform convergence of CC to cc implies the one of QQ to qq and so we conclude. ∎

A2.2 Universality of the covariance kernel

We will now prove the results of universality of Theorem 1 and Proposition 6.

Proof of Theorem 1

Fix any finite Borel measure μ\mu on KK, and assume that σb>0\sigma_{b}>0. Given any non-zero φL2(K,μ)\varphi\in L^{2}(K,\mu), there exists a tφ(0,1]t_{\varphi}\in(0,1] such that Tμ(qt)φ,φ>0\langle T_{\mu}(q_{t})\,\varphi,\varphi\rangle>0, for all t(0,tφ)t\in(0,t_{\varphi}).

From Corollary A1, we can expand Tμ(qt)T_{\mu}(q_{t}) around t=0t=0 as

the o(t)o(t) being wrt the operator norm, where we have defined the kernel R0R_{0} via R0(x,x)=σw22(1+ζx2)(1+ζx2)R_{0}(x,x^{\prime})=\tfrac{\sigma_{w}^{2}}{2}\sqrt{(1+\zeta\|x\|^{2})(1+\zeta\|x^{\prime}\|^{2})}. Since Tμ(q0)T_{\mu}(q_{0}) is non-negative, for any φL2(I)\varphi\in L^{2}(I), we have

For any finite Borel measure μ\mu on KK, for any tt\in, the operator Tμ(q˙t)T_{\mu}(\dot{q}_{t}) on L2(K,μ)L^{2}(K,\mu) is non-negative definite. In particular, for all φL2(K,μ)\varphi\in L^{2}(K,\mu) we have

Fix μ\mu and φL2(K,μ)\varphi\in L^{2}(K,\mu). From (8) we can write

By Lemma A11, Tμ(qt)T_{\mu}(q_{t}) is non-negative definite, so we can write

By Lemma A2, it suffices to show that for any finite Borel measure μ\mu on KK, Tμ(qt)T_{\mu}(q_{t}) is strictly positive definite for all t(0,1]t\in(0,1]. Fix any nonzero φL2(K,μ)\varphi\in L^{2}(K,\mu), define the map FF on $bybyF(t)=\langle T_{\mu}(q_{t})\,\varphi,\varphi\rangle.Foranyfixed. For any fixedt\in(0,1],byPropositionA2wecanfind, by Proposition A2 we can finds\in(0,t)suchthatsuch thatF(s)>0.Since. SinceFisnondecreasingbyPropositionA3,wegetthatis non decreasing by Proposition A3, we get thatF_{t}>0.Hence. HenceT_{\mu}(q_{t})$ is strictly positive definite. ∎

Proof of Proposition 6

converges in the operator norm. Then AA is a compact strictly positive definite operator.

both sums converging wrt the operator norm.

The claims for ff have been already proven in Lemma A8. As for gg, the analyticity of ff implies the one of ff^{\prime}, and it is easy to check the convergence on $.Moreover,alltheoddTaylorcoefficientsof. Moreover, all the odd Taylor coefficients off^{\prime}arestricltypositive,astheevencoefficientsofare striclty positive, as the even coefficients offare.Itfollowsthatare. It follows that\beta_{n}>0foralloddfor all oddn$. ∎

The case σb>0\sigma_{b}>0 has been already established in Proposition A2, hence suppose that σb=0\sigma_{b}=0. First recall (A6)

for tt small enough. On the other hand, for φ=0\varphi^{\prime}=0, we have φ=φ\varphi=\varphi^{\prime\prime} and so

for tt small enough. So there is a tφt_{\varphi} such that, for t(0,tφ)t\in(0,t_{\varphi}), Tν(ct)φ,φ>0\langle T_{\nu}(c_{t})\,\varphi,\varphi\rangle>0. It follows immediately that the same property is true for Tν(qt)T_{\nu}(q_{t}). ∎

A3 Stable ResNet with decreasing scaling

Therefore, we can assume without loss of generality that σb=0\sigma_{b}=0. This yields

Letting αl=σw2λl22\alpha_{l}=\frac{\sigma_{w}^{2}\lambda_{l}^{2}}{2} and Cl:=Cl(x,x)C_{l}:=C_{l}(x,x^{\prime}), we have that

Since f^\hat{f} is non decreasing, ClC^{l} is non-decreasing and has a limit C(x,x)1C_{\infty}(x,x^{\prime})\leq 1.

Now let us prove that the convergence of ClC_{l} to CC_{\infty} happens uniformly with a rate klλl2\sum_{k\geq l}\lambda_{l}^{2}. Using the recursive formula of ClC_{l}, and knowing that we have that

Therefore, using the fact that ClCC_{l}\leq C_{\infty}, we have

A3.2 Proof of Corollary 1

A4 Neural Tangent Kernel

Throughout this section, we will consider ResNets with NTK parameterization . This simply means that all the components of the biases and the weights will be initialized as iid standard normal random variables. In order to compensate this change of parameterization, the propagation through the network needs to be slightly modified. Hence (2) will be replaced by

However, it is strightforward to verify that the recurrence (5) for the covariance kernels keeps unchanged. Clearly, the dynamics of a standard ResNet with NTK parameterization can be recovered from (A12) by setting λl+1,L=1\lambda_{l+1,L}=1 for all l,Ll,L.

The Neural Tangent Kernel, introduced by , is defined as

where par\nabla_{\text{par}} denotes the gradient wrt the parameters of the network. The NTK of a Stable ResNet can be evaluated recursively. We will now prove the recurrence formula (9). The following result was proven in Lemma 3 in for the case of a standard ResNet without bias. We extend it to ResNet with bias.

For a Stable ResNet, the NTK can be evaluated recursively, layer by layer, as

We prove the second result by induction. The proof is similar to the one of ResNet in . Let θk=(Wk,Bk)\theta_{k}=(W_{k},B_{k}). For l1l\geq 1 and i[1:Nl+1]i\in[1:N_{l+1}]

We prove the result by induction. Assume the result is true for layers 1,2,...,l1,2,...,l and let us prove it for l+1l+1. Using the induction hypothesis, as N1,N2,...,Nl1N_{1},N_{2},...,N_{l-1}\rightarrow\infty recursively, we have that

where I=σw2NlWl+1ii(ϕ(yli(x))+ϕ(yli(x)))Θl(x,x).I^{\prime}=\frac{\sigma_{w}^{2}}{N_{l}}W_{l+1}^{ii}(\phi^{\prime}(y_{l}^{i}(x))+\phi^{\prime}(y_{l}^{i}(x^{\prime})))\,\Theta_{l}(x,x^{\prime}).

As NlN_{l}\rightarrow\infty, we have that I0I^{\prime}\rightarrow 0. Using the law of large numbers, as NlN_{l}\rightarrow\infty

As a corollary of the above result, using the results in for the ReLU activation function, we can express the recursion more explicitly. We have

where ff is defined in (4) and f:γ1πarccosγf^{\prime}:\gamma\mapsto-\tfrac{1}{\pi}\arccos\gamma is the first derivative of ff. So we can write

We can now easily check that the NTK is a kernel in the sense of Definition 2.

For all layer ll, ΘL\Theta_{L} is a kernel in the sense of definition (2).

It’s clear that Θ0=Q0\Theta_{0}=Q_{0} is a kernel. Now fix any layer ll. We have already proved in Lemma A10 that (1+f(Cl)Cl)Ql\left(1+\tfrac{f(C_{l})}{C_{l}}\right)Q_{l} is a kernel. With a similar argument, noting that 1+f1+f^{\prime} can be expressed as a power series with only non negative coefficients on $,weconcludebyLemmaA9that, we conclude by Lemma A9 that1+f^{\prime}(C_{l})isakernel.Usingtheusualargumentthatsumsandproductofkernelsarekernels,weconcludebyinductionthatis a kernel. Using the usual argument that sums and product of kernels are kernels, we conclude by induction that\Theta_{l}$ is a kernel. ∎

As a final remark, note that from (A1), we have that λl,L2Ψl=Ql+1Ql\lambda_{l,L}^{2}\Psi_{l}=Q_{l+1}-Q_{l}. Hence we can rewrite (A13) as

Since 1+f1+f^{\prime} is non negative on $,itiseasytoshowbyinductionthat, it is easy to show by induction that\Theta_{l}\geq Q_{l},pointwise,forall, point-wise, for alll$. This is done explicitly in the next Lemma, which is a Corollary of Lemma 1 and show the divergence of the NTK for a Standard ResNet.

By Lemma 1, it suffices to show that ΘL(x,x)QL(x,x)\Theta_{L}(x,x)\geq Q_{L}(x,x). Recall (A13), noticing that 1+f01+f^{\prime}\geq 0 on $,,\left(1+\tfrac{f(C_{l})}{C_{l}}\right)Q_{l}\geq 0andthatand that\Theta_{0}=Q_{0}\geq 0,byaneasyinductionwehavethat, by an easy induction we have that\Theta_{l}\geq 0forallfor alll.Asaconsequence,from(A14),wegetthat. As a consequence, from (A14), we get that\Theta_{l+1}-\Theta_{l}\geq Q_{l+1}-Q_{l}.Hence,againwithastraightforwardinductionwehavethat. Hence, again with a straightforward induction we have that\Theta_{l}\geq Q_{l}forallfor alllandthethewholeand the the wholeK^{2}.Inparticular. In particular\Theta_{L}(x,x)\geq Q_{L}(x,x)forallfor allx\in K$. ∎

Consider a ResNet of type (1) without bias, and let α=σw22\alpha=\frac{\sigma_{w}^{2}}{2}. The NTK recursion formula can be written in terms of normalized NTK κl(x,x)=Θl(x,x)/(1+α)l1\kappa^{l}(x,x^{\prime})=\Theta_{l}(x,x^{\prime})/(1+\alpha)^{l-1}

where f^\hat{f} is given by (A2), f^(t)=1π(tarcsint+1t2)+12t\hat{f}(t)=\frac{1}{\pi}(t\,\arcsin{t}+\sqrt{1-t^{2}})+\frac{1}{2}t.

where Ψl1=αQl1(x,x)\Psi_{l-1}=\alpha Q_{l-1}(x,x^{\prime}) and Ψl1=αf^(Cl1)\Psi_{l-1}^{\prime}=\alpha\hat{f}^{\prime}(C_{l-1}). Using the recursive formula for the diagonal elements, we have that Ψl1=α(1+α)l1f^(Cl1(x,x))Q0(x,x)Q0(x,x)\Psi_{l-1}=\alpha(1+\alpha)^{l-1}\hat{f}(C_{l-1}(x,x^{\prime}))\sqrt{Q_{0}(x,x)Q_{0}(x^{\prime},x^{\prime})}. We conclude by dividing both sides by (1+α)l1(1+\alpha)^{l-1}. ∎

Therefore, the NTK can be expressed exclusively in terms of the covariance kernels (Qk)k[0:l1](Q_{k})_{k\in[0:l-1]}, more precisely we have that

It is straightforward that Θl\Theta_{l} converges pointwise to a limiting kernel Θ\Theta_{\infty}. Let us prove that the convergence is uniform over KK. By observing that f1|f^{\prime}|\leq 1, we have that for all x,xKx,x^{\prime}\in K

where κ\kappa is a constant that depends on the compact KK. This proves the uniform convergence with a rate of O(k=l+1λk2)\mathcal{O}\left(\sum_{k=l+1}^{\infty}\lambda_{k}^{2}\right). As a consequence, being a uniform limit of kernels, Θ\Theta_{\infty} is a kernel.

Proceeding as in the proof of Lemma A17, it’s easy to prove by induction that for all ll, ΘlQl\Theta_{l}-Q_{l} is a kernel. In particular,

where \succeq is in the operator sense, that is T(Θl)T(Ql)T(\Theta_{l})-T(Q_{l}) is non-negative definite. This yields

Therefore Θ\Theta_{\infty} inherits the universality of QQ_{\infty} naturally by the RKHS hierarchy . We conclude that Θ\Theta_{\infty} is universal (for both cases). ∎

With the uniform scaling, for arbitrary x,xKx,x^{\prime}\in K, the continuous version of (9) reads

where f:γ1πarccosγf^{\prime}:\gamma\mapsto-\tfrac{1}{\pi}\arccos\gamma is the first derivative of ff, defined in (4).

For any x,xx,x^{\prime} in KK, the solution tΘtt\mapsto\Theta_{t} of (A16) is unique and well defined for all tt\in. Moreover, the map (x,x)Θt(x,x)(x,x^{\prime})\mapsto\Theta_{t}(x,x^{\prime}) is a kernel in the sense of Definition 2 for all tt\in. We have the L2(K)L^{2}(K) convergence of the discrete model to the continuous one:

The existence and the uniqueness are clear, since it is a homogeneous first order Cauchy problem, with continuous coefficients. We can write explicitly the solution as

Hence, T(θt)T(\theta_{t}) is the limit of a sequence of non-negative definite operators and hence it is non-negative definite, so that θt\theta_{t} is a kernel on KK for all tt\in. ∎

Fix t(0,1]t\in(0,1]. The solution of (A16) can be written as θt=qt+rt\theta_{t}=q_{t}+r_{t}, where

Now, let us show that rtr_{t}. First, by Lemma A14 it is easy to check that 1+f1+f^{\prime} is analytic on (1,1)(-1,1) and its Taylor expansion around converges on $.MoreoveralltheTaylorcoefficientsarenonnegative.Hence,LemmaA9showsthat. Moreover all the Taylor coefficients are non negative. Hence, Lemma A9 shows that(1+f^{\prime}(c_{s}))isakernelforallis a kernel for alls\in[0,s).Itfollowsthat. It follows that(1+f^{\prime}(c_{s}))\,\theta_{s}isakernel.Seefootnote14.Now,is a kernel.See footnote 14. Now,(1+f^{\prime}(c_{s})\,\theta_{s}iscontinuousandsymmetriconis continuous and symmetric onZ^{2},anditiseasytocheckfrom(A17)thatitisuniformlyboundedfor, and it is easy to check from (A17) that it is uniformly bounded fors\in[0,t).Itfollowsthat. It follows thatr_{t}iscontinuousandsymmetric.Now,fixanarbitraryfiniteBorelmeasureis continuous and symmetric. Now, fix an arbitrary finite Borel measure\muononK.Wehavetoshowthat. We have to show thatT_{\mu}(r_{t})isnonnegativedefinite,sothatwecanconcludebyLemmaA1.Fixedis non-negative definite, so that we can conclude by Lemma A1. Fixed\varphi\in L^{2}(K,\mu)$, by simple standard arguments we have

and so rtr_{t} is a kernel. Now, given two kernels QQ and RR, it is a classical result that Q+RQ+R is a kernel and its RKHS contains the RKHS of QQ and RR, . We conclude that the RKHS of θt\theta_{t} contains the RKHS of qtq_{t}. Since qtq_{t} is universal, θt\theta_{t} is universal. ∎

A5 A PAC-Bayes Generalization result

Assuming that the samples are distributed as (x,y)ν(x,y)\sim\nu where ν\nu is a probability distribution on X×YX\times Y, we define the generalization (true) loss by

For some randomized learning algorithm A\mathcal{A}, the empirical and generalization loss are given by

The PAC-Bayes theorem gives a probabilistic upper bound on the generalization loss r(A)r(\mathcal{A}) of a randomized learning algorithm A\mathcal{A} in terms of the empirical loss rS(A)r_{S}(\mathcal{A}). Fix a prior distribution P\mathcal{P} on the hypothesis set H\mathcal{H}. The Kullback-Leibler divergence between A\mathcal{A} and P\mathcal{P} is defined as KL(AP)=logA(h)P(h)A(h)dh[0,]KL(\mathcal{A}\|\mathcal{P})=\int\log\frac{\mathcal{A}(h)}{P(h)}\mathcal{A}(h)dh\in[0,\infty]. The Bernoulli KL-divergence is given by kl(ap)=alogap+(1a)log1a1pkl(a||p)=a\log\frac{a}{p}+(1-a)\log\frac{1-a}{1-p} for a,pa,p\in. We define the inverse Bernoulli KL-divergence kl1kl^{-1} by

The PAC-Bayesian theorem gives can also be stated as

The KL-divergence term KL(AP)KL(\mathcal{A}\|P) plays a major role as it controls the generalization gap, i.e. the difference (in terms of Bernoulli KL-divergence) between the empirical loss and the generalization loss. In our setting, we consider an ordinary GP regression with prior P(f)=GP(f0,Q(x,x))P(f)=\mathcal{GP}(f|0,Q(x,x^{\prime})). Under the standard assumption that the outputs yN=(yi)i[1:N]y_{N}=(y_{i})_{i\in[1:N]} are noisy versions of fN=(f(xi))i[1:N]f_{N}=(f(x_{i}))_{i\in[1:N]} with yNfNN(yNfN,σ2I)y_{N}|f_{N}\sim\mathcal{N}(y_{N}|f_{N},\sigma^{2}I), the Bayesian posterior A\mathcal{A} is also a GP and is given by

where QN(x)=(Q(x,xi))i[1:N]Q_{N}(x)=(Q(x,x_{i}))_{i\in[1:N]} and QNN=(Q(xi,xj))1i,jNQ_{NN}=(Q(x_{i},x_{j}))_{1\leq i,j\leq N}. In this setting, we have the following result

Let QLQ_{L} be the kernel of a ResNet. Let PLP_{L} be a GP with kernel QLQ_{L} and AL\mathcal{A}_{L} be the corresponding Bayesian posterior for some fixed noise level σ>0\sigma>0. Then, in a fixed setting (fixed sample size N), the following results hold:

The proof relies on the simple observation that PL(ffN)=AL(ffN)P_{L}(f|f_{N})=\mathcal{A}_{L}(f|f_{N}). This yields

where QL,NN=(QL(xi,xj))1i,jNQ_{L,NN}=(Q_{L}(x_{i},x_{j}))_{1\leq i,j\leq N}.

Since QL,NNQ_{L,NN} is symmetric and strictly positive definite, it is straightforward that the largest eigenvalue of QL,NN(QL,NN+σ2I)1)Q_{L,NN}(Q_{L,NN}+\sigma^{2}I)^{-1}) is smaller than 11. This yields

where the last inequality holds for sufficiently large LL.

Case 2. In the case of Stable ResNet, we know that as LL\rightarrow\infty, the kernel QLQ_{L} converges to a strictly positive definite kernel QQ_{\infty}, therefore the first term log(det(QL,NN+σ2I))\log(\det(Q_{L,NN}+\sigma^{2}I)) remains bounded as LL\rightarrow\infty, which concludes the proof. ∎

A6 NNGP correlation kernel without bias as a modified NNGP kernel

Unscaled ResNets suffer from the exploding variance problem, which needs to be avoided in order to isolate the disadvantages of inexpressivity in their NNGP kernel. In order to do so, we use the NNGP correlation kernel CC instead of NNGP covariance kernel QQ, noting that Lemma A4 provides a simple recursion formula for CC if σb=0\sigma_{b}=0, at depth lLl\leq L:

where αl,L=λl,L2σw22\alpha_{l,L}=\frac{\lambda_{l,L}^{2}\sigma_{w}^{2}}{2} and f^\hat{f} defined in (A2). In order to combine this with open-source packages designed for NNGP calculation, we note that (A20) can be viewed as the NNGP kernel of the following modified ResNet layer, using the same notation as in (2):

with α^l,L=αl,L1+αl,L\hat{\alpha}_{l,L}=\frac{\alpha_{l,L}}{1+\alpha_{l,L}}

A7 Experimental details and additional results

For our Vanilla ResNet NNGP results, we preprocess all training, validation and test data by first centering the training set and then normalizing all images to lie on the pixel dimension sphere. For our Wide ResNet NNGP results we normalise all data so that the training set is centered and has channel-wise unit variance. We use Kaiming initialisation throughout, with σw2=2\sigma_{w}^{2}=2 and σb2=0\sigma_{b}^{2}=0. Vanilla ResNets have the same structure as type (2) in Table 2 and we use the same WRN kernel architecture as in Table 1 but omit the final average pooling step, which is known to improve kernel performance but dramatically increase computational costs . Throughout this work, where there are residual blocks with multiple layers, we calculate our scaling factors for uniform and decreasing scaled Stable ResNets by the number of residual connections. For example, a WRN-202 has only 99 residual connections, so we set λl,L1=99\lambda_{l,L}^{-1}=\sqrt{99} for the uniform scaling factors. We tune the noise variance σ2\sigma^{2}, which is akin to the regularisation parameter in kernel ridge regression. To do so, we compute validation accuracy on a validation set of size 5000, selecting the best σ2=λ×Trace(QNN)/N\sigma^{2}=\lambda\times\text{Trace}(Q_{NN})/N from a logarithmic scale of λ=[0.001,0.01,0.1]\lambda=[0.001,0.01,0.1], where NN is the training set size and QNNQ_{NN} is the N×NN\times N training set Gram matrix for NNGP QQ.

A7.2 Trained ResNet results

For all our trained ResNet experiments we use a similar setup to the open-source code for in PyTorch . We repeat each experiment 3 times and report the best test accuracy and error intervals. All ResNets are initialised with Kaiming initialisation and like we adopt ResNets architectures where we double the number of filters in each convolutional layer. For experiments with BatchNorm, on CIFAR-10/100 we use batch size 64 across all depths and on TinyImageNet we used batch size 128 for depths 32 & 50, and batch size 100 for depth 104 in order to allow the model to fit onto a single 11GB VRAM GPU. We use SGD with momentum parameter 0.90.9 and weight decay parameter 10410^{-4} throughout.

We also present results for ResNets trained without BatchNorm . BatchNorm is a normalization layer commonly used with modern ResNets that is known to improve performance and allows deeper ResNets to be trained, though the precise reasons for this are not well understood. Several recent works have studied the possibility of removing the need for BatchNorm layers, by introducing trainable uniform scalings to the residual connection to stabilise variance at initialisation & gradients, demonstrating promising results. Note, our work additionally introduces decreasing scaling and also uses the infinite-width NNGP/NTK connection to assess the theoretical advantages of scaled Stable ResNets in the limit of infinite depth.

Moreover, our focus is not towards the possibility of removing BatchNorm and we show in Table 3 that our scalings can improve BatchNorm ResNets. However, we also present results without BatchNorm in Table 4, where again we see that our scaled stable ResNets improve performance compared to their unscaled counterparts: for example both Decreasing and Uniform scaling outperform the unscaled ResNet by over 3% test accuracy on CIFAR-100 with ResNet-104.

For ResNets trained without BatchNorm, for a fair comparison we tuned the initial learning rate on a small logarithmic scale, using batch size 128.

It can be easily deduced from lemma A8 that there exist {bi}i0\{b_{i}\}_{i\geq 0} such that

where b0,b1,b2i>0b_{0},b_{1},b_{2i}>0. Following the same approach, we have that

Having the terms of orders 0 and 1 in C1(x,x)C_{1}(x,x^{\prime}) ensures having a positive coefficient for all terms ziz^{i} for i1i\geq 1, which concludes the proof. ∎

The previous result can be easily extended to general L2L\geq 2. We have that

Moreover, (αL+1,i)i0(\alpha_{L+1,i})_{i\geq 0} can be expressed in terms of (αL,i)i0(\alpha_{L,i})_{i\geq 0}

where βL=QL(x,x)=QL(x,x)=i0αL,i\beta_{L}=Q_{L}(x,x)=Q_{L}(x^{\prime},x^{\prime})=\sum_{i\geq 0}\alpha_{L,i} and (am)m0(a_{m})_{m\geq 0} is such that a0,a1>0a_{0},a_{1}>0 and a2i>0a_{2i}>0 and a2i+1=0a_{2i+1}=0 for all i1i\geq 1. As a result, for all L2,i0,αL,i>0L\geq 2,i\geq 0,\alpha_{L,i}>0.

Knowing that Cl(x,y)=1βlQl(x,y)C_{l}(x,y)=\frac{1}{\beta_{l}}Q_{l}(x,y), we have that

which gives the recursive formulas for the coefficients of the analytic decomposition. Observe that the coefficients are non-decreasing wrt LL. Using lemma A21 we conclude that αL,i>0\alpha_{L,i}>0. ∎

For depth L2L\geq 2, proposition A5 shows that all coefficient (αL,i)i0(\alpha_{L,i})_{i\geq 0} are (strictly) positive. It turns out that this is a sufficient condition for the kernel QLQ_{L} to be strictly positive definite. We state this in the next proposition. The result can be seen as a consequence of Lemma A12 and Lemma A13. However we will give here a more direct proof.

for all k0k\geq 0. We conclude that φ=0\varphi=0. ∎

We start by giving a brief review of the theory of Spherical Harmonics (). For some k1k\geq 1, let (Yk,j)1jN(d,k)(Y_{k,j})_{1\leq j\leq N(d,k)} be the set of Spherical Harmonics of degree kk. We have N(d,k)=2k+d2k(k+d3d2)N(d,k)=\frac{2k+d-2}{k}{k+d-3\choose d-2}.

For some function pp, the Hecke-Funk formula reads

(Pkd)k0(P^{d}_{k})_{k\geq 0} form an orthogonal basis of L2(,(1t2)d32dt)L^{2}(,(1-t^{2})^{\frac{d-3}{2}}dt), i.e.

where δij\delta_{ij} is the Kronecker symbol.

where μk=Ωd1Ωd11p(t)Pkd(t)(1t2)(d3)/2dt\mu_{k}=\frac{\Omega_{d-1}}{\Omega_{d}}\int_{-1}^{1}p(t)P^{d}_{k}(t)(1-t^{2})^{(d-3)/2}dt. We also have that μk0\mu_{k}\geq 0 since QQ is non-negative by definition. The last statement, follows from the spectral theory of compact self-adjoint operators and the orthonormality of the spherical harmonics (see the appendix of for details). ∎

Corollary A3 shows that for any depth LL, the Spherical Harmonics are the eigenfunctions of the kernel QLQ_{L}. The fact that μL,k>0\mu_{L,k}>0 is a direct result of Proposition A6. Leveraging this result, we can prove a stronger result, which is the universality of the kernel QLQ_{L}.

References