Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, Clément Hongler

Introduction

Artificial neural networks (ANNs) have achieved impressive results in numerous areas of machine learning. While it has long been known that ANNs can approximate any function with sufficiently many hidden neurons (11; 14), it is not known what the optimization of ANNs converges to. Indeed the loss surface of neural networks optimization problems is highly non-convex: it has a high number of saddle points which may slow down the convergence (5). A number of results (3; 17; 18) suggest that for wide enough networks, there are very few “bad” local minima, i.e. local minima with much higher cost than the global minimum. More recently, the investigation of the geometry of the loss landscape at initialization has been the subject of a precise study (12). The analysis of the dynamics of training in the large-width limit for shallow networks has seen recent progress as well (15). To the best of the authors knowledge, the dynamics of deep networks has however remained an open problem until the present paper: see the contributions section below.

A particularly mysterious feature of ANNs is their good generalization properties in spite of their usual over-parametrization (20). It seems paradoxical that a reasonably large neural network can fit random labels, while still obtaining good test accuracy when trained on real data (23). It can be noted that in this case, kernel methods have the same properties (1).

In the infinite-width limit, ANNs have a Gaussian distribution described by a kernel (16; 4; 7; 13; 6). These kernels are used in Bayesian inference or Support Vector Machines, yielding results comparable to ANNs trained with gradient descent (2; 13). We will see that in the same limit, the behavior of ANNs during training is described by a related kernel, which we call the neural tangent network (NTK).

We study the network function fθf_{\theta} of an ANN, which maps an input vector to an output vector, where θ\theta is the vector of the parameters of the ANN. In the limit as the widths of the hidden layers tend to infinity, the network function at initialization, fθf_{\theta} converges to a Gaussian distribution (16; 4; 7; 13; 6).

In this paper, we investigate fully connected networks in this infinite-width limit, and describe the dynamics of the network function fθf_{\theta} during training:

During gradient descent, we show that the dynamics of fθf_{\theta} follows that of the so-called kernel gradient descent in function space with respect to a limiting kernel, which only depends on the depth of the network, the choice of nonlinearity and the initialization variance.

The convergence properties of ANNs during training can then be related to the positive-definiteness of the infinite-width limit NTK. In the case when the dataset is supported on a sphere, we prove this positive-definiteness using recent results on dual activation functions (4). The values of the network function fθf_{\theta} outside the training set is described by the NTK, which is crucial to understand how ANN generalize.

For a least-squares regression loss, the network function fθf_{\theta} follows a linear differential equation in the infinite-width limit, and the eigenfunctions of the Jacobian are the kernel principal components of the input data. This shows a direct connection to kernel methods and motivates the use of early stopping to reduce overfitting in the training of ANNs.

Finally we investigate these theoretical results numerically for an artificial dataset (of points on the unit circle) and for the MNIST dataset. In particular we observe that the behavior of wide ANNs is close to the theoretical limit.

Neural networks

In this paper, we assume that the input distribution pinp^{in} is the empirical distribution on a finite dataset x1,...,xNx_{1},...,x_{N}, i.e the sum of Dirac measures 1Ni=0Nδxi\frac{1}{N}\sum_{i=0}^{N}\delta_{x_{i}}.

where the nonlinearity σ\sigma is applied entrywise. The scalar β>0\beta>0 is a parameter which allows us to tune the influence of the bias on the training.

Kernel gradient

The kernel KK is positive definite with respect to pin||\cdot||_{p^{in}} if fpin>0    fK>0||f||_{p^{in}}>0\implies||f||_{K}>0.

The kernel gradient KCf0F\nabla_{K}C|_{f_{0}}\in\mathcal{F} is defined as ΦK(finCf0)\Phi_{K}\left(\partial_{f}^{in}C|_{f_{0}}\right). In contrast to finC\partial_{f}^{in}C which is only defined on the dataset, the kernel gradient generalizes to values xx outside the dataset thanks to the kernel KK:

A time-dependent function f(t)f(t) follows the kernel gradient descent with respect to KK if it satisfies the differential equation

During kernel gradient descent, the cost C(f(t))C(f(t)) evolves as

Convergence to a critical point of CC is hence guaranteed if the kernel KK is positive definite with respect to pin||\cdot||_{p^{in}}: the cost is then strictly decreasing except at points such that df(t)pin=0||d|_{f(t)}||_{p^{in}}=0. If the cost is convex and bounded from below, the function f(t)f(t) therefore converges to a global minimum as tt\to\infty.

As a starting point to understand the convergence of ANN gradient descent to kernel gradient descent in the infinite-width limit, we introduce a simple model, inspired by the approach of (19).

A kernel KK can be approximated by a choice of PP random functions f(p)f^{(p)} sampled independently from any distribution on F\mathcal{F} whose (non-centered) covariance is given by the kernel KK:

The partial derivatives of the parametrization are given by

Optimizing the cost CFlinC\circ F^{lin} through gradient descent, the parameters follow the ODE:

As a result the function fθ(t)linf_{\theta(t)}^{lin} evolves according to

Neural tangent kernel

For ANNs trained using gradient descent on the composition CF(L)C\circ F^{(L)}, the situation is very similar to that studied in the Section 3.1. During training, the network function fθf_{\theta} evolves along the (negative) kernel gradient

with respect to the neural tangent kernel (NTK)

However, in contrast to FlinF^{lin}, the realization function F(L)F^{(L)} of ANNs is not linear. As a consequence, the derivatives θpF(L)(θ)\partial_{\theta_{p}}F^{(L)}(\theta) and the neural tangent kernel depend on the parameters θ\theta. The NTK is therefore random at initialization and varies during training, which makes the analysis of the convergence of fθf_{\theta} more delicate.

In the next subsections, we show that, in the infinite-width limit, the NTK becomes deterministic at initialization and stays constant during training. Since fθf_{\theta} at initialization is Gaussian in the limit, the asymptotic behavior of fθf_{\theta} during training can be explicited in the function space F\mathcal{F}.

As observed in (16; 4; 7; 13; 6), the output functions fθ,if_{\theta,i} for i=1,...,nLi=1,...,n_{L} tend to iid Gaussian processes in the infinite-width limit (a proof in our setup is given in the appendix):

For a network of depth LL at initialization, with a Lipschitz nonlinearity σ\sigma, and in the limit as n1,...,nL1n_{1},...,n_{L-1}\to\infty, the output functions fθ,kf_{\theta,k}, for k=1,...,nLk=1,...,n_{L}, tend (in law) to iid centered Gaussian processes of covariance Σ(L)\Sigma^{(L)}, where Σ(L)\Sigma^{(L)} is defined recursively by:

taking the expectation with respect to a centered Gaussian process ff of covariance Σ(L)\Sigma^{(L)}.

Strictly speaking, the existence of a suitable Gaussian measure with covariance Σ(L)\Sigma^{(L)} is not needed: we only deal with the values of ff at x,xx,x^{\prime} (the joint measure on f(x),f(x)f(x),f(x^{\prime}) is simply a Gaussian vector in 2D). For the same reasons, in the proof of Proposition 1 and Theorem 1, we will freely speak of Gaussian processes without discussing their existence.

The first key result of our paper (proven in the appendix) is the following: in the same limit, the Neural Tangent Kernel (NTK) converges in probability to an explicit deterministic limit.

For a network of depth LL at initialization, with a Lipschitz nonlinearity σ\sigma, and in the limit as the layers width n1,...,nL1n_{1},...,n_{L-1}\to\infty, the NTK Θ(L)\Theta^{(L)} converges in probability to a deterministic limiting kernel:

taking the expectation with respect to a centered Gaussian process ff of covariance Σ(L)\Sigma^{(L)}, and where σ˙\dot{\sigma} denotes the derivative of σ\sigma.

By Rademacher’s theorem, σ˙\dot{\sigma} is defined everywhere, except perhaps on a set of zero Lebesgue measure.

Note that the limiting Θ(L)\Theta^{(L)}_{\infty} only depends on the choice of σ\sigma, the depth of the network and the variance of the parameters at initialization (which is equal to 11 in our setting).

2 Training

Our second key result is that the NTK stays asymptotically constant during training. This applies for a slightly more general definition of training: the parameters are updated according to a training direction dtFd_{t}\in\mathcal{F}:

In the case of gradient descent, dt=dfθ(t)d_{t}=-d|_{f_{\theta(t)}} (see Section 3), but the direction may depend on another network, as is the case for e.g. Generative Adversarial Networks (10). We only assume that the integral 0Tdtpindt\int_{0}^{T}\|d_{t}\|_{p^{in}}dt stays stochastically bounded as the width tends to infinity, which is verified for e.g. least-squares regression, see Section 5.

Assume that σ\sigma is a Lipschitz, twice differentiable nonlinearity function, with bounded second derivative. For any TT such that the integral 0Tdtpindt\int_{0}^{T}\|d_{t}\|_{p^{in}}dt stays stochastically bounded, as n1,...,nL1n_{1},...,n_{L-1}\to\infty, we have, uniformly for t[0,T]t\in[0,T],

As a consequence, in this limit, the dynamics of fθf_{\theta} is described by the differential equation

As the proof of the theorem (in the appendix) shows, the variation during training of the individual activations in the hidden layers shrinks as their width grows. However their collective variation is significant, which allows the parameters of the lower layers to learn: in the formula of the limiting NTK Θ(L+1)(x,x)\Theta^{(L+1)}_{\infty}(x,x^{\prime}) in Theorem 1, the second summand Σ(L+1)\Sigma^{(L+1)} represents the learning due to the last layer, while the first summand represents the learning performed by the lower layers.

As discussed in Section 3, the convergence of kernel gradient descent to a critical point of the cost CC is guaranteed for positive definite kernels. The limiting NTK is positive definite if the span of the derivatives θpF(L)\partial_{\theta_{p}}F^{(L)}, p=1,...,Pp=1,...,P becomes dense in F\mathcal{F} w.r.t. the pinp^{in}-norm as the width grows to infinity. It seems natural to postulate that the span of the preactivations of the last layer (which themselves appear in θpF(L)\partial_{\theta_{p}}F^{(L)}, corresponding to the connection weights of the last layer) becomes dense in F\mathcal{F}, for a large family of measures pinp^{in} and nonlinearities (see e.g. (11; 14) for classical theorems about ANNs and approximation). In the case when the dataset is supported on a sphere, the positive-definiteness of the limiting NTK can be shown using Gaussian integration techniques and existing positive-definiteness criteria, as given by the following proposition, proven in Appendix A.4:

Least-squares regression

Given a goal function ff^{*} and input distribution pinp^{in}, the least-squares regression cost is

Theorems 1 and 2 apply to an ANN trained on such a cost. Indeed the norm of the training direction d(f)pin=ffpin\|d(f)\|_{p^{in}}=\|f^{*}-f\|_{p^{in}} is strictly decreasing during training, bounding the integral. We are therefore interested in the behavior of a function ftf_{t} during kernel gradient descent with a kernel KK (we are of course especially interested in the case K=Θ(L)IdnLK=\Theta^{(L)}_{\infty}\otimes Id_{n_{L}}):

The solution of this differential equation can be expressed in terms of the map Π:fΦK(<f,>pin)\Pi:f\mapsto\Phi_{K}\left(\left<f,\cdot\right>_{p^{in}}\right):

where etΠ=k=0(t)kk!Πke^{-t\Pi}=\sum_{k=0}^{\infty}\frac{(-t)^{k}}{k!}\Pi^{k} is the exponential of tΠ-t\Pi. If Π\Pi can be diagonalized by eigenfunctions f(i)f^{(i)} with eigenvalues λi\lambda_{i}, the exponential etΠe^{-t\Pi} has the same eigenfunctions with eigenvalues etλie^{-t\lambda_{i}}.

For a finite dataset x1,...,xNx_{1},...,x_{N} of size NN, the map Π\Pi takes the form

The map Π\Pi has at most NnLNn_{L} positive eigenfunctions, and they are the kernel principal components f(1),...,f(NnL)f^{(1)},...,f^{(Nn_{L})} of the data with respect to to the kernel KK (21; 22). The corresponding eigenvalues λi\lambda_{i} is the variance captured by the component.

Decomposing the difference (ff0)=Δf0+Δf1+...+ΔfNnL(f^{*}-f_{0})=\Delta^{0}_{f}+\Delta^{1}_{f}+...+\Delta^{Nn_{L}}_{f} along the eigenspaces of Π\Pi, the trajectory of the function ftf_{t} reads

where Δf0\Delta^{0}_{f} is in the kernel (null-space) of Π\Pi and Δfif(i)\Delta^{i}_{f}\propto f^{(i)}.

The above decomposition can be seen as a motivation for the use of early stopping. The convergence is indeed faster along the eigenspaces corresponding to larger eigenvalues λi\lambda_{i}. Early stopping hence focuses the convergence on the most relevant kernel principal components, while avoiding to fit the ones in eigenspaces with lower eigenvalues (such directions are typically the ‘noisier’ ones: for instance, in the case of the RBF kernel, lower eigenvalues correspond to high frequency functions).

with the NnlNn_{l}-vectors κx,k\kappa_{x,k}, yy^{*} and y0y_{0} given by

The first term, the mean, has an important statistical interpretation: it is the maximum-a-posteriori (MAP) estimate given a Gaussian prior on functions fkN(0,Θ(L))f_{k}\sim\mathcal{N}(0,\Theta^{(L)}_{\infty}) and the conditions fk(xi)=fk(xi)f_{k}(x_{i})=f^{*}_{k}(x_{i}) . Equivalently, it is equal to the kernel ridge regression (22) as the regularization goes to zero (λ0\lambda\to 0). The second term is a centered Gaussian whose variance vanishes on the points of the dataset.

Numerical experiments

In the following numerical experiments, fully connected ANNs of various widths are compared to the theoretical infinite-width limit. We choose the size of the hidden layers to all be equal to the same value n:=n1=...=nL1n:=n_{1}=...=n_{L-1} and we take the ReLU nonlinearity σ(x)=max(0,x)\sigma(x)=\max(0,x).

In the first two experiments, we consider the case n0=2n_{0}=2. Moreover, the input elements are taken on the unit circle. This can be motivated by the structure of high-dimensional data, where the centered data points often have roughly the same norm The classical example is for data following a Gaussian distribution N(0,Idn0)\mathcal{N}(0,Id_{n_{0}}): as the dimension n0n_{0} grows, all data points have approximately the same norm n0\sqrt{n_{0}}..

In all experiments, we took nL=1n_{L}=1 (note that by our results, a network with nLn_{L} outputs behaves asymptotically like nLn_{L} networks with scalar outputs trained independently). Finally, the value of the parameter β\beta is chosen as 0.10.1, see Remark 1.

The first experiment illustrates the convergence of the NTK Θ(L)\Theta^{(L)} of a network of depth L=4L=4 for two different widths n=500,10000n=500,10000. The function Θ(4)(x0,x)\Theta^{(4)}(x_{0},x) is plotted for a fixed x0=(1,0)x_{0}=(1,0) and x=(cos(γ),sin(γ))x=(cos(\gamma),sin(\gamma)) on the unit circle in Figure 2. To observe the distribution of the NTK, 1010 independent initializations are performed for both widths. The kernels are plotted at initialization t=0t=0 and then after 200200 steps of gradient descent with learning rate 1.01.0 (i.e. at t=200t=200). We approximate the function f(x)=x1x2f^{*}(x)=x_{1}x_{2} with a least-squares cost on random N(0,1)\mathcal{N}(0,1) inputs.

For the wider network, the NTK shows less variance and is smoother. It is interesting to note that the expectation of the NTK is very close for both networks widths. After 200200 steps of training, we observe that the NTK tends to “inflate”. As expected, this effect is much less apparent for the wider network (n=10000n=10000) where the NTK stays almost fixed, than for the smaller network (n=500n=500).

2 Kernel regression

For a regression cost, the infinite-width limit network function fθ(t)f_{\theta(t)} has a Gaussian distribution for all times tt and in particular at convergence tt\to\infty (see Section 5). We compared the theoretical Gaussian distribution at tt\to\infty to the distribution of the network function fθ(T)f_{\theta(T)} of a finite-width network for a large time T=1000T=1000. For two different widths n=50,1000n=50,1000 and for 1010 random initializations each, a network is trained on a least-squares cost on 44 points of the unit circle for 10001000 steps with learning rate 1.01.0 and then plotted in Figure 2.

We also approximated the kernels Θ(4)\Theta_{\infty}^{(4)} and Σ(4)\Sigma^{(4)} using a large-width network (n=10000n=10000) and used them to calculate and plot the 10th, 50th and 90-th percentiles of the tt\to\infty limiting Gaussian distribution.

The distributions of the network functions are very similar for both widths: their mean and variance appear to be close to those of the limiting distribution tt\to\infty. Even for relatively small widths (n=50n=50), the NTK gives a good indication of the distribution of fθ(t)f_{\theta(t)} as tt\to\infty.

3 Convergence along a principal component

We now illustrate our result on the MNIST dataset of handwritten digits made up of grayscale images of dimension 28×2828\times 28, yielding a dimension of n0=784n_{0}=784.

We computed the first 3 principal components of a batch of N=512N=512 digits with respect to the NTK of a high-width network n=10000n=10000 (giving an approximation of the limiting kernel) using a power iteration method. The respective eigenvalues are λ1=0.0457\lambda_{1}=0.0457, λ2=0.00108\lambda_{2}=0.00108 and λ3=0.00078\lambda_{3}=0.00078. The kernel PCA is non-centered, the first component is therefore almost equal to the constant function, which explains the large gap between the first and second eigenvaluesIt can be observed numerically, that if we choose β=1.0\beta=1.0 instead of our recommended 0.10.1, the gap between the first and the second principal component is about ten times bigger, which makes training more difficult.. The next two components are much more interesting as can be seen in Figure 3(a), where the batch is plotted with xx and yy coordinates corresponding to the 2nd and 3rd components.

We have seen in Section 5 how the convergence of kernel gradient descent follows the kernel principal components. If the difference at initialization f0ff_{0}-f^{*} is equal (or proportional) to one of the principal components f(i)f^{(i)}, then the function will converge along a straight line (in the function space) to ff^{*} at an exponential rate eλite^{-\lambda_{i}t}.

We tested whether ANNs of various widths n=100,1000,10000n=100,1000,10000 behave in a similar manner. We set the goal of the regression cost to f=fθ(0)+0.5f(2)f^{*}=f_{\theta(0)}+0.5f^{(2)} and let the network converge. At each time step tt, we decomposed the difference fθ(t)ff_{\theta(t)}-f^{*} into a component gtg_{t} proportional to f(2)f^{(2)} and another one hth_{t} orthogonal to f(2)f^{(2)}. In the infinite-width limit, the first component decays exponentially fast gtpin=0.5eλ2t||g_{t}||_{p^{in}}=0.5e^{-\lambda_{2}t} while the second is null (ht=0h_{t}=0), as the function converges along a straight line.

As expected, we see in Figure 3(b) that the wider the network, the less it deviates from the straight line (for each width nn we performed two independent trials). As the width grows, the trajectory along the 2nd principal component (shown in Figure 3(c)) converges to the theoretical limit shown in blue.

A surprising observation is that smaller networks appear to converge faster than wider ones. This may be explained by the inflation of the NTK observed in our first experiment. Indeed, multiplying the NTK by a factor aa is equivalent to multiplying the learning rate by the same factor. However, note that since the NTK of large-width network is more stable during training, larger learning rates can in principle be taken. One must hence be careful when comparing the convergence speed in terms of the number of steps (rather than in terms of the time tt): both the inflation effect and the learning rate must be taken into account.

Conclusion

This paper introduces a new tool to study ANNs, the Neural Tangent Kernel (NTK), which describes the local dynamics of an ANN during gradient descent. This leads to a new connection between ANN training and kernel methods: in the infinite-width limit, an ANN can be described in the function space directly by the limit of the NTK, an explicit constant kernel Θ(L)\Theta^{(L)}_{\infty}, which only depends on its depth, nonlinearity and parameter initialization variance. More precisely, in this limit, ANN gradient descent is shown to be equivalent to a kernel gradient descent with respect to Θ(L)\Theta^{(L)}_{\infty}. The limit of the NTK is hence a powerful tool to understand the generalization properties of ANNs, and it allows one to study the influence of the depth and nonlinearity on the learning abilities of the network. The analysis of training using NTK allows one to relate convergence of ANN training with the positive-definiteness of the limiting NTK and leads to a characterization of the directions favored by early stopping methods.

Acknowledgements

The authors thank K. Kytölä for many interesting discussions. The second author was supported by the ERC CG CRITICAL. The last author acknowledges support from the ERC SG Constamis, the NCCR SwissMAP, the Blavatnik Family Foundation and the Latsis Foundation.

References

Appendix A Appendix

This appendix is dedicated to proving the key results of this paper, namely Proposition 1 and Theorems 1 and 2, which describe the asymptotics of neural networks at initialization and during training.

We study the limit of the NTK as n1,...,nL1n_{1},...,n_{L-1}\to\infty sequentially, i.e. we first take n1n_{1}\to\infty, then n2n_{2}\to\infty, etc. This leads to much simpler proofs, but our results could in principle be strengthened to the more general setting when min(n1,...,nL1)\min(n_{1},...,n_{L-1})\to\infty.

A natural choice of convergence to study the NTK is with respect to the operator norm on kernels:

where the expectation is taken over two independent x,xpinx,x^{\prime}\sim p^{in}. This norm depends on the input distribution pinp^{in}. In our setting, pinp^{in} is taken to be the empirical measure of a finite dataset of distinct samples x1,...,xNx_{1},...,x_{N}. As a result, the operator norm of KK is equal to the leading eigenvalue of the NnL×NnLNn_{L}\times Nn_{L} Gram matrix (Kkk(xi,xj))k,k<nL,i,j<N\left(K_{kk^{\prime}}(x_{i},x_{j})\right)_{k,k^{\prime}<n_{L},i,j<N}. In our setting, convergence in operator norm is hence equivalent to pointwise convergence of KK on the dataset.

It has already been observed that the output functions fθ,if_{\theta,i} for i=1,...,nLi=1,...,n_{L} tend to iid Gaussian processes in the infinite-width limit.

For a network of depth LL at initialization, with a Lipschitz nonlinearity σ\sigma, and in the limit as n1,...,nL1n_{1},...,n_{L-1}\to\infty sequentially, the output functions fθ,kf_{\theta,k}, for k=1,...,nLk=1,...,n_{L}, tend (in law) to iid centered Gaussian processes of covariance Σ(L)\Sigma^{(L)}, where Σ(L)\Sigma^{(L)} is defined recursively by:

taking the expectation with respect to a centered Gaussian process ff of covariance Σ(L)\Sigma^{(L)}.

We prove the result by induction. When L=1L=1, there are no hidden layers and fθf_{\theta} is a random affine function of the form:

All output functions fθ,kf_{\theta,k} are hence independent and have covariance Σ(1)\Sigma^{(1)} as needed.

conditioned on the values of α(L)\alpha^{(L)} are iid centered Gaussians with covariance

By the law of large numbers, as nLn_{L}\to\infty, this covariance tends in probability to the expectation

In particular the covariance is deterministic and hence independent of α(L)\alpha^{(L)}. As a consequence, the conditioned and unconditioned distributions of fθ,if_{\theta,i} are equal in the limit: they are iid centered Gaussian of covariance Σ(L+1)\Sigma^{(L+1)}. ∎

In the infinite-width limit, the neural tangent kernel, which is random at initialization, converges in probability to a deterministic limit.

For a network of depth LL at initialization, with a Lipschitz nonlinearity σ\sigma, and in the limit as the layers width n1,...,nL1n_{1},...,n_{L-1}\to\infty sequentially, the NTK Θ(L)\Theta^{(L)} converges in probability to a deterministic limiting kernel:

taking the expectation with respect to a centered Gaussian process ff of covariance Σ(L)\Sigma^{(L)}, and where σ˙\dot{\sigma} denotes the derivative of σ\sigma.

The proof is again by induction. When L=1L=1, there is no hidden layer and therefore no limit to be taken. The neural tangent kernel is a sum over the entries of W(0)W^{(0)} and those of b(0)b^{(0)}:

For the first sum let us observe that by the chain rule:

By the law of large numbers, as nLn_{L}\to\infty, this tends to its expectation which is equal to

It is then easy to see that the second part of the neural tangent kernel, the sum over W(L)W^{(L)} and b(L)b^{(L)} converges to Σ(L+1)δkk\Sigma^{(L+1)}\delta_{kk^{\prime}} as n1,...,nLn_{1},...,n_{L}\to\infty. ∎

A.2 Asymptotics during Training

Given a training direction tdtFt\mapsto d_{t}\in\mathcal{F}, a neural network is trained in the following manner: the parameters θp\theta_{p} are initialized as iid N(0,1)\mathcal{N}(0,1) and follow the differential equation:

In this context, in the infinite-width limit, the NTK stays constant during training:

Assume that σ\sigma is a Lipschitz, twice differentiable nonlinearity function, with bounded second derivative. For any TT such that the integral 0Tdtpindt\int_{0}^{T}\|d_{t}\|_{p^{in}}dt stays stochastically bounded, as n1,...,nL1n_{1},...,n_{L-1}\to\infty sequentially, we have, uniformly for t[0,T]t\in[0,T],

As a consequence, in this limit, the dynamics of fθf_{\theta} is described by the differential equation

As in the previous theorem, the proof is by induction on the depth of the network. When L=1L=1, the neural tangent kernel does not depend on the parameters, it is therefore constant during training.

To apply the induction hypothesis, we now need to bound 1nLW(L)(t)op\lVert\frac{1}{\sqrt{n_{L}}}W^{(L)}(t)\rVert_{op}. For this, we use the following lemma, which is proven in Appendix A.3 below:

From this lemma, to bound 1nLW(L)(t)op\lVert\frac{1}{\sqrt{n_{L}}}W^{(L)}(t)\rVert_{op}, it is hence enough to bound 1nLW(L)(0)op\lVert\frac{1}{\sqrt{n_{L}}}W^{(L)}(0)\rVert_{op}. From the law of large numbers, we obtain that the norm of each of the nL+1n_{L+1} rows of W(L)(0)W^{(L)}(0) is bounded, and hence that 1nLW(L)(0)op\lVert\frac{1}{\sqrt{n_{L}}}W^{(L)}(0)\rVert_{op} is bounded (keep in mind that nL+1n_{L+1} is fixed, while n1,,nLn_{1},\ldots,n_{L} grow).

From the above considerations, we can apply the induction hypothesis to the smaller network, yielding, in the limit as n1,,nLn_{1},\ldots,n_{L}\to\infty (sequentially), that the dynamics is governed by the constant kernel Θ(L)\Theta^{(L)}_{\infty}:

At the same time, the parameters of the last layer evolve according to

Now, observing that the operator norm of ΦΘ(L)\Phi_{\Theta_{\infty}^{(L)}} is equal to Θ(L)op||\Theta_{\infty}^{(L)}||_{op}, defined in the introduction of Appendix A, and using the Cauchy-Schwarz inequality, we get

where the sup norm \lVert\cdot\rVert_{\infty} is defined by f=supxf(x).\left\lVert f\right\lVert_{\infty}=\sup_{x}|f(x)|.

To bound both quantities simultaneously, study the derivative of the quantity

where, in the first inequality, we have used that σ˙c|\dot{\sigma}|\leq c and, in the second inequality, that the sum Wi(L)(t)2+αi(L)(t)pin\lVert W^{(L)}_{i}(t)\rVert_{2}+||\alpha^{(L)}_{i}(t)||_{p^{in}} is bounded by A(t)A(t). Applying Grönwall’s Lemma, we now get

We can now use these bounds to control the variation of the NTK and to prove the theorem. To understand how the NTK evolves, we study the evolution of the derivatives with respect to the parameters. The derivatives with respect to the bias parameters of the top layer bj(L)fθ,j\partial_{b^{(L)}_{j}}f_{\theta,j^{\prime}} are always equal to δjj\delta_{jj^{\prime}}. The derivatives with respect to the connection weights of the top layer are given by

Finally let us study the derivatives with respect to the parameters of the lower layers

Their contribution to the NTK Θjj(L+1)(x,x)\Theta^{(L+1)}_{jj^{\prime}}(x,x^{\prime}) is

By the induction hypothesis, the NTK of the smaller network Θ(L)\Theta^{(L)} tends to Θ(L)δii\Theta^{(L)}_{\infty}\delta_{ii^{\prime}} as n1,...,nL1n_{1},...,n_{L-1}\to\infty. The contribution therefore becomes

A.3 A Priori Control during Training

The goal of this section is to prove Lemma 1, which is a key ingredient in the proof of Theorem 2. Let us first recall it:

At all times, the evolution of the preactivations and weights is given by:

where the layer-wise training directions d(1),,d(L)d^{(1)},\ldots,d^{(L)} are defined recursively by

Set w(k)(t):=1nkW(k)(t)opw^{(k)}(t):=\left\|\frac{1}{\sqrt{n_{k}}}W^{(k)}(t)\right\|_{op} and a(k)(t):=1nkα(k)(t)pina^{(k)}\left(t\right):=\left\|\frac{1}{\sqrt{n_{k}}}\alpha^{\left(k\right)}\left(t\right)\right\|_{p^{in}}. The identities of the previous step yield the following recursive bounds:

where cc is the Lipschitz constant of σ\sigma. These bounds lead to

For the subnetworks NTKs we have the recursive bounds

This allows one to bound the derivative of A(t)A\left(t\right) as follows:

The polynomial control we obtained on the derivative of A(t)A\left(t\right) now allows one to use (a nonlinear form of, see e.g. ) Grönwall’s Lemma: we obtain that A(t)A\left(t\right) stays uniformly bounded on [0,τ]\left[0,\tau\right] for some τ=τ(n1,,nL)>0\tau=\tau\left(n_{1},\ldots,n_{L}\right)>0, and that τT\tau\to T as min(n1,,nL)\min\left(n_{1},\ldots,n_{L}\right)\to\infty, owing to the 1min{1,,nL}\frac{1}{\sqrt{\min\left\{1,\ldots,n_{L}\right\}}}in front of the polynomial. Since A(t)A\left(t\right) is bounded, the differential bound on A(t)A\left(t\right) gives that the derivative tA(t)\partial_{t}A\left(t\right) converges uniformly to on [0,τ]\left[0,\tau\right] for any τ<T\tau<T, and hence A(t)A(0)A\left(t\right)\to A\left(0\right). This concludes the proof of the lemma.

This subsection is devoted to the proof of Proposition 2, which we now recall:

A key ingredient for the proof of Proposition 2 is the following Lemma, which comes from .

If the expansion of μ\mu in Hermite polynomials (hi)i0\left(h_{i}\right)_{i\geq 0} is given by μ=i=0aihi\mu=\sum_{i=0}^{\infty}a_{i}h_{i}, we have

The other key ingredient for proving Proposition 2 is the following theorem, which is a slight reformulation of Theorem 1(b) in , which itself is a generalization of a classical result of Schönberg:

is positive-definite for any n01n_{0}\geq 1 if and only if the coefficients bnb_{n} are strictly positive for infinitely many even and infinitely many odd integers nn.

With Lemma 2 and Theorem 3 above, we are now ready to prove Proposition 2.

We first decompose the limiting NTK Θ(L)\Theta^{\left(L\right)} recursively, relate its positive-definiteness to that of the activation kernels, then show that the positive-definiteness of the activation kernels at level 22 implies that of the higher levels, and finally show the positive-definiteness at level 22 using Lemma 2 and Theorem 3:

Observe that for any L1L\geq 1, using the notation of Theorem 1, we have

Note that the kernel Σ˙(L)Θ(L)\dot{\Sigma}^{\left(L\right)}\Theta^{\left(L\right)} is positive semi-definite, being the product of two positive semi-definite kernels. Hence, if we show that Σ(L+1)\Sigma^{\left(L+1\right)} is positive-definite, this implies that Θ(L+1)\Theta^{\left(L+1\right)} is positive-definite.

By definition, with the notation of Proposition 1 we have

Hence the left-hand side only vanishes if ciσ(f(xi))\sum c_{i}\sigma\left(f\left(x_{i}\right)\right) is almost surely zero. If Σ(L)\Sigma^{\left(L\right)} is positive-definite, the Gaussian (f(xi))i=1,d\left(f\left(x_{i}\right)\right)_{i=1,\ldots d} is non-degenerate, so this only occurs when c1==cd=0c_{1}=\cdots=c_{d}=0 since σ\sigma is assumed to be non-constant. This shows that the positive-definiteness of Σ(L+1)\Sigma^{\left(L+1\right)} is implied by that of Σ(L)\Sigma^{\left(L\right)}. By induction, if Σ(2)\Sigma^{\left(2\right)} is positive-definite, we obtain that all Σ(L)\Sigma^{\left(L\right)} with L2L\geq 2 are positive-definite as well. By the first step this hence implies that Θ(L)\Theta^{\left(L\right)} is positive-definite as well.

Writing the expansion of μ\mu in Hermite polynomials (hi)i0\left(h_{i}\right)_{i\geq 0}

we obtain that μ^\hat{\mu} is given by the power series

Since σ\sigma is non-polynomial, so is μ\mu, and as a result, there is an infinite number of nonzero aia_{i}’s in the above sum.

where the aia_{i}’s are the coefficients of the Hermite expansion of μ\mu. Now, observe that by the previous step, the power series expansion of ν\nu contains both an infinite number of nonzero even terms and an infinite number of nonzero odd terms. This enables one to apply Theorem 3 to obtain that Σ(2)\Sigma^{\left(2\right)} is indeed positive-definite, thereby concluding the proof.