Towards Understanding the Spectral Bias of Deep Learning

Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, Quanquan Gu

Introduction

Over-parameterized neural networks have achieved great success in many applications such as computer vision (He et al., 2016), natural language processing (Collobert and Weston, 2008) and speech recognition (Hinton et al., 2012). It has been shown that over-parameterized neural networks can fit complicated target function or even randomly labeled data (Zhang et al., 2017) and still exhibit good generalization performance when trained with real labels. Intuitively, this is at odds with the traditional notion of generalization ability such as model complexity. In order to understand neural network training, a line of work (Soudry et al., 2018; Gunasekar et al., 2018b, a) has made efforts in the perspective of “implicit bias”, which states that training algorithms for deep learning implicitly pose an inductive bias onto the training process and lead to a solution with low complexity measured by certain norms in the parameter space of the neural network.

Among many attempts to establish implicit bias, Rahaman et al. (2019) pointed out an intriguing phenomenon called spectral bias, which says that during training, neural networks tend to learn the components of lower complexity faster. Similar observation has also been pointed out in Xu et al. (2019b, a). The concept of spectral bias is appealing because this may intuitively explain why over-parameterized neural networks can achieve a good generalization performance without overfitting. During training, the networks fit the low complexity components first and thus lie in the concept class of low complexity. Arguments like this may lead to rigorous guarantee for generalization.

Great efforts have been made in search of explanations about the spectral bias. Rahaman et al. (2019) evaluated the Fourier spectrum of ReLU networks and empirically showed that the lower frequencies are learned first; also lower frequencies are more robust to random perturbation. Andoni et al. (2014) showed that for a sufficiently wide two-layer network, gradient descent with respect to the second layer can learn any low degree bounded polynomial. Xu (2018) provided Fourier analysis to two-layer networks and showed similar empirical results on one-dimensional functions and real data. Nakkiran et al. (2019) used information theoretical approach to show that networks obtained by stochastic gradient descent can be explained by a linear classifier during early training. These studies provide certain explanations about why neural networks exhibit spectral bias in real tasks. But explanations in the theoretical aspect, if any, are to some extent limited. For example, Fourier analysis is usually done in the one-dimensional setting, and thus lacks generality.

Inspired by these works mentioned above in the neural tangent kernel regime, in this paper we study the spectral bias of over-parameterized two-layer neural networks and its connection to the neural tangent kernel. Note that a basic connection between neural tangent kernel and the spectral bias can be indicated by the observation that neural networks evolve as linear models (Jacot et al., 2018; Lee et al., 2019) in the NTK regime, which suggests that some corresponding results for linear models (Advani and Saxe, 2017) can be applied. However, direct combinations of existing techniques cannot show how many hidden nodes can guarantee the learning of simple/complex components, which is the key problem of interest. Therefore, although such combinations of existing results can provide some mathematical intuition, they are not sufficient to explain the spectral bias for neural networks. To give a thorough characterization of spectral bias, we study the training of mildly over-parameterized neural networks. We show that, given a training data set that is generated based on a target function, a fairly narrow network, although cannot fit the training data well due to its limited width, can still learn certain low-complexity components of the target function in the eigenspace corresponding to large eigenvalues of neural tangent kernel. As the width of the network increases, more high-frequency components of the target function can be learned with a slower convergence rate. As a special case, our result implies that when the input data follows uniform distribution on the unit sphere, polynomials of lower degrees can be learned by a narrower neural network at a faster rate. We also conduct experiments to corroborate the theory we establish.

We prove a generic theorem for arbitrary data distributions, which states that under certain sample complexity and over-parameterization conditions, the convergence of the training error along different eigendirections of NTK relies on the corresponding eigenvalues. This theorem gives a more precise control on the regression residual than Su and Yang (2019), where the authors focused on the case when the labeling function is close to the subspace spanned by the first few eigenfunctions.

We present a characterization of the spectra of the neural tangent kernel that is more general than existing results. In particular, we show that when the input data follow uniform distribution over the unit sphere, the eigenvalues of neural tangent kernel are μk=Ω(max{kd1,dk+1})\mu_{k}=\Omega(\max\{k^{-d-1},d^{-k+1}\}), k0k\geq 0, with corresponding eigenfunctions being the kk-th order spherical harmonics. Our result is better than the bound Ω(kd1)\Omega(k^{-d-1}) derived in Bietti and Mairal (2019) when dkd\gg k, which is in a more practical setting.

We establish a rigorous explanation for the spectral bias based on the aforementioned theoretical results without any specific assumptions on the target function. We show that the error terms from different frequencies are provably controlled by the eigenvalues of the NTK, and the lower-frequency components can be learned with less training examples and narrower networks at a faster convergence rate.

Related Work

A few theoretical results have been established towards understanding the spectra of neural tangent kernels. To name a few, Bach (2017) studied two-layer ReLU networks by relating it to kernel methods, and proposed a harmonic decomposition for the functions in the reproducing kernel Hilbert space which we utilize in our proof. Based on the technique in Bach (2017), Bietti and Mairal (2019) studied the eigenvalue decay of integrating operator defined by the neural tangent kernel on unit sphere by using spherical harmonics. Vempala and Wilmes (2019) calculated the eigenvalues of neural tangent kernel corresponding to two-layer neural networks with sigmoid activation function. Basri et al. (2019) established similar results as Bietti and Mairal (2019), but considered the case of training the first layer parameters of a two-layer networks with bias terms. Yang and Salman (2019) studied the the eigenvalues of integral operator with respect to the NTK on Boolean cube by Fourier analysis. Very recently, Chen and Xu (2020) studied the connection between NTK and Laplacian kernels. Bordelon et al. (2020) gave a spectral analysis on the generalization error of NTK-based kernel ridge regression. Basri et al. (2020) studied the convergence of full training residual with a focus on one-dimensional, non-uniformly distributed data.

A series of papers (Gunasekar et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018a, b; Nacson et al., 2018; Li et al., 2018; Jacot et al., 2020) studied implicit bias problem, aiming to figure out when there are multiple optimal solutions of a training objective function, what kind of nice properties the optimal found by a certain training algorithm would have. Implicit bias results of gradient descent, stochastic gradient descent, or mirror descent for various problem settings including matrix factorization, logistic regression, deep linear networks as well as homogeneous models. The major difference between these results and our work is that implicit bias results usually focus on the parameter space, while we study the functions a neural network prefer to learn in the function space.

Preliminaries

In this section we introduce the basic problem setup including the neural network structure and the training algorithm, as well as some background on the neural tangent kernel proposed recently in Jacot et al. (2018) and the corresponding integral operator.

2 Problem Setup

Here we introduce the basic problem setup. We consider two-layer fully connected neural networks of the form

We first randomly initialize the parameters of the network, and run gradient descent for both layers. We present our detailed neural network training algorithm in Algorithm 1.

The initialization scheme for W(0)\mathbf{W}^{(0)} given in Algorithm 1 is known as He initialization (He et al., 2015). It is consistent with the initialization scheme used in Cao and Gu (2019).

3 Neural Tangent Kernel

Many attempts have been made to study the convergence of gradient descent assuming the width of the network is extremely large (Du et al., 2019b; Li and Liang, 2018). When the width of the network goes to infinity, with certain initialization on the model weights, the limit of inner product of network gradients defines a kernel function, namely the neural tangent kernel (Jacot et al., 2018). In this paper, we denote the neural tangent kernel as

For two-layer networks, standard concentration results gives

Since we apply gradient descent to both layers, the neural tangent kernel is the sum of the two different kernel functions and clearly it can be reduced to one layer training setting. These two kernels are arc-cosine kernels of degree 0 and 1 (Cho and Saul, 2009), which are given as κ1(x,x)=κ^1(x,x(x2x2))\kappa_{1}(\mathbf{x},\mathbf{x}^{\prime})=\widehat{\kappa}_{1}(\langle\mathbf{x},\mathbf{x}^{\prime}\rangle(\left\|\mathbf{x}\right\|_{2}\left\|\mathbf{x}^{\prime}\right\|_{2})), κ2(x,x)=κ^2(x,x/(x2x2))\kappa_{2}(\mathbf{x},\mathbf{x}^{\prime})=\widehat{\kappa}_{2}(\langle\mathbf{x},\mathbf{x}^{\prime}\rangle/(\left\|\mathbf{x}\right\|_{2}\left\|\mathbf{x}^{\prime}\right\|_{2})), where

4 Integral Operator

It has been pointed out in Cho and Saul (2009) that arc-cosine kernels are positive semi-definite. Thus the kernel function κ\kappa defined by (3.1) is positive semi-definite being a product and a sum of positive semi-definite kernels. Clearly this kernel is also continuous and symmetric, which implies that the neural tangent kernel κ\kappa is a Mercer kernel.

Main Results

In this section we study the convergence of Algorithm 1. Instead of studying the standard convergence of loss function value, we provide a refined analysis on the speed of convergence along different directions defined by the eigenfunctions of LκL_{\kappa}. We first introduce the following notations.

Denote y=(y1,,yn)\mathbf{y}=(y_{1},\ldots,y_{n})^{\top} and y^(t)=θ(fW(t)(x1),,fW(t)(xn))\widehat{\mathbf{y}}^{(t)}=\theta\cdot(f_{\mathbf{W}^{(t)}}(\mathbf{x}_{1}),\ldots,f_{\mathbf{W}^{(t)}}(\mathbf{x}_{n}))^{\top} for t=0,,Tt=0,\ldots,T. Then Lemma 4.1 shows that the convergence rate of Vrk(yy^(t))2\|\mathbf{V}_{r_{k}}^{\top}(\mathbf{y}-\widehat{\mathbf{y}}^{(t)})\|_{2} roughly represents the speed gradient descent learns the components of the target function corresponding to the first rkr_{k} eigenvalues. The following theorem gives the convergence guarantee of Vrk(yy^(t))2\|\mathbf{V}_{r_{k}}^{\top}(\mathbf{y}-\widehat{\mathbf{y}}^{(t)})\|_{2}.

Theorem 4.2 shows that the convergence rate of Vrk(yy^(t))2\|\mathbf{V}_{r_{k}}^{\top}(\mathbf{y}-\widehat{\mathbf{y}}^{(t)})\|_{2} is determined by the rkr_{k}-th eigenvalue λrk\lambda_{r_{k}}. This reveals the spectral bias of neural network training under the NTK regime. Specifically, as long as the network is wide enough and the sample size is large enough, gradient descent first learns the target function along the eigendirections of neural tangent kernel with larger eigenvalues, and then learns the rest components corresponding to smaller eigenvalues. Moreover, by showing that learning the components corresponding to larger eigenvalues can be done with smaller sample size and narrower networks, our theory pushes the study of neural networks in the NTK regime towards a more practical setting. For these reasons, we believe that Theorem 4.2 to certain extent provides an explanation of the empirical observations given in Rahaman et al. (2019), and demonstrates that the difficulty of a function to be learned by neural network can be characterized in the eigenspace of neural tangent kernel: if the target function has a component corresponding to a small eigenvalue of neural tangent kernel, then learning this function takes longer time, and requires more examples and wider networks.

Note that the results in this paper are all in the “neural tangent kernel regime” (Jacot et al., 2018) or the “lazy training regime”(Chizat et al., 2019). Therefore, our results share certain common limitations of NTK-type results discussed in (Chizat et al., 2019; Allen-Zhu and Li, 2019). However, we believe the study of spectral bias in the NTK regime is still an important research direction. It is worth noting that Theorem 4.2 is not based on the common NTK-type assumption that the target function belongs to the NTK-induced RKHS (Arora et al., 2019a; Cao and Gu, 2019; Ji and Telgarsky, 2020). Instead, Theorem 4.2 works for arbitrary labeling of the data, and is therefore rather general. We believe the characterization of spectral bias under general settings beyond NTK regime is an important future work direction.

Our work follows the same intuition as recent results studying the residual dynamics of over-parameterized two-layer neural networks (Arora et al., 2019a; Su and Yang, 2019). Compared with Su and Yang (2019), the major difference is that while Su and Yang (2019) studied the full residual yy^(T)2\|\mathbf{y}-\widehat{\mathbf{y}}^{(T)}\|_{2} and required that the target function lies approximately in the eigenspace of large eigenvalues of the neural tangent kernel, our result in Theorem 4.2 works for arbitrary target function, and shows that even if the target function has very high frequency components, its components in the eigenspace of large eigenvalues can still be learned very efficiently by neural networks. We note that although the major results in Arora et al. (2019a) are presented in terms of the full residual, certain part of their proof in Arora et al. (2019a) can indicate the convergence of projected residual. However, Arora et al. (2019a) do not build any quantitative connection between the Gram matrix and kernel function. Since the eigenvalues and eigenfunctions of NTK Gram matrix depend on the exact realizations of the nn training samples, they are not directly tractable for the study of spectral bias. Moreover, Arora et al. (2019a) focus on the setting where the network is wide enough to guarantee global convergence, while our result works for narrower networks for which global convergence may not even be possible. More recently, Bordelon et al. (2020) studied the solution of NTK-based kernel ridge regression. Compared with thier result, our analysis is directly on the practical neural network training procedure, and specifies the width requirement for a network to successfully learn a certain component of the target function. Another recent work by Basri et al. (2020) provided theoretical analysis for one-dimensional non-uniformly distributed data, and only studied the convergence of the full residual vector. In comparison, our results cover high-dimensional data, and provide a more detailed analysis on the convergence of different projections of the residual.

2 Spectral Analysis of NTK for Uniform Distribution

We now study the case when the data inputs are uniformly distributed over the unit sphere as an example where the eigendecompositon of NTK can be calculated. We present our results (an extension of Proposition 5 in Bietti and Mairal (2019)) of spectral analysis of neural tangent kernel in the form of a Mercer decomposition, which explicitly gives the eigenvalues and eigenfunctions of NTK.

where Yk,jY_{k,j} for j=1,,N(d,k)j=1,\cdots,N(d,k) are linearly independent spherical harmonics of degree kk in d+1d+1 variables with N(d,k)=2k+d1k(k+d2d1)N(d,k)=\frac{2k+d-1}{k}\tbinom{k+d-2}{d-1} and orders of μk\mu_{k} are given by

3 Convergence for Uniformly Distributed Data

In this subsection, we combine our results in the previous two subsections and give explicit convergence rate for uniformly distributed data on the unit sphere.

where rk=k=0k1N(d,k)r_{k}=\sum_{k^{\prime}=0}^{k-1}N(d,k^{\prime}) and Vrk=(n1/2ϕj(xi))n×rk\mathbf{V}_{{r_{k}}}=(n^{-1/2}\phi_{j}(\mathbf{x}_{i}))_{n\times r_{k}} with ϕ1,,ϕrk\phi_{1},\ldots,\phi_{r_{k}} being a set of orthonomal spherical harmonics of degrees up to k1k-1.

where rk=k=0k1N(d,k)r_{k}=\sum_{k^{\prime}=0}^{k-1}N(d,k^{\prime}) and Vrk=(n1/2ϕj(xi))n×rk\mathbf{V}_{{r_{k}}}=(n^{-1/2}\phi_{j}(\mathbf{x}_{i}))_{n\times r_{k}} with ϕ1,,ϕrk\phi_{1},\ldots,\phi_{r_{k}} being a set of orthonomal spherical harmonics of degrees up to k1k-1.

Corollaries 4.6 and 4.7 further illustrate the spectral bias of neural networks by providing exact calculations of λrk\lambda_{r_{k}}, Vrk\mathbf{V}_{r_{k}} and MM in Theorem 4.2. They show that if the input distribution is uniform over unit sphere, then spherical harmonics with lower degrees are learned first by wide neural networks.

In Corollaries 4.6 and 4.7, the conditions on nn and mm depend exponentially on either kk or dd. We would like to emphasize that such exponential dependency is reasonable and unavoidable. Take the dkd\gg k setting as an example. The exponential dependency in kk is a natural consequence of the fact that in high dimensional space, there are a large number of linearly independent low-degree polynomials. When only nn data points are used for training, it is only reasonable to expect to learn less than nn independent components of the true function. Therefore it is unavoidable to assume

Similar arguments can apply to the kdk\gg d setting.

Experiments

In this section we present experimentral results to verify our theory. Across all tasks, we train a two-layer neural networks with 4096 hidden neurons and initialize it exactly as defined in the problem setup. The optimization method is vanilla gradient descent, and the training sample size for all results is 10001000.

We first experimentally verify our theoretical results by learning linear combinations of spherical harmonics with data inputs uniformly distributed over unit sphere. Define

Following our theoretical analysis, we denote vk=n1/2(Pk(x1),,Pk(xn))\mathbf{v}_{k}=n^{-1/2}(P_{k}(\mathbf{x}_{1}),\ldots,P_{k}(\mathbf{x}_{n})). By Lemma 4.2 vk\mathbf{v}_{k}’s are almost orthonormal. So we define the (approximate) projection length of residual r(t)\mathbf{r}^{(t)} onto vk\mathbf{v}_{k} at step tt as a^k=vkr(t)\widehat{a}_{k}=|\mathbf{v}_{k}^{\top}\mathbf{r}^{(t)}|, where r(t)=(f(x1)θfW(t)(x1),,f(xn)θfW(t)(xn))\mathbf{r}^{(t)}=(f^{*}(\mathbf{x}_{1})-\theta f_{\mathbf{W}^{(t)}}(\mathbf{x}_{1}),\dots,f^{*}(\mathbf{x}_{n})-\theta f_{\mathbf{W}^{(t)}}(\mathbf{x}_{n})) and fW(t)(x)f_{\mathbf{W}^{(t)}}(\mathbf{x}) is the neural network function.

The results are shown in Figure 1. It can be seen that the residual at the lowest frequency (k=1k=1) converges to zero first and then the second lowest (k=2k=2). The highest frequency component is the last one to converge. Following the setting of Rahaman et al. (2019) we assign high frequencies a larger scale in Figure 1 (b), expecting that larger scale will introduce a better descending speed. Still, the low frequencies are learned first.

Note that a^k\widehat{a}_{k} is the projection length onto an approximate vector. In the function space, we can also project the residual function r(x)=f(x)θfW(t)(x)r(\mathbf{x})=f^{*}(\mathbf{x})-\theta f_{\mathbf{W}^{(t)}}(\mathbf{x}) onto the orthonormal Gegenbauer functions Pk(x)P_{k}(\mathbf{x}). Replacing the training data with randomly sampled data points xi\mathbf{x}_{i} can lead to a random estimate of the projection length in function space. Experiments in this setting can be found in the appendix.

We also verify the linear convergence rate proved in Theorem 4.2. Figure 2 presents the convergence curve in logarithmic scale. We can see from Figure 2 that the convergence of different projection length is close to linear convergence, which is well-aligned with our theoretical analysis. Note that we performed a moving average of range 2020 on these curves to avoid the heavy oscillation especially at late stage.

2 Learning Functions of Simple Forms

Apart from the synthesized low frequency function, we also show the dynamics of more general functions’ projection to Pk(x)P_{k}(x). These functions, though in a simple form, have non-zero high-frequency components. In this subsection we show our results still apply when all frequencies exist in the target functions, which are given by f(x)=icos(aiζ,x)f^{*}(\mathbf{x})=\sum_{i}\cos(a_{i}\langle{\bm{\zeta}},{\mathbf{x}}\rangle) and f(x)=iζ,xpif^{*}(\mathbf{x})=\sum_{i}\langle{\bm{\zeta}},{\mathbf{x}}\rangle^{p_{i}}, where ζ\bm{\zeta} is a fixed unit vector.

Figure 3 verifies the result of Theorem 4.2 with more general target functions, and backs up our claim that Theorem 4.2 does not make any assumptions on the target function. Notice that in the early training stage, not all the curves monotonically descend. We believe this is due to the unseen components’ influence on the gradient. Again, as the training proceeds, the residual projections converge at the predicted rates.

3 Non-uniform Input Data Distributions

In this subsection, we provide experimental results for non-uniformly distributed input data. Note that the eigendecomposition of NTK for general multi-dimensional input distributions do not necessarily have good analytic forms. Therefore, here we treat the non-uniform distribution of the input data as model misspecification, and test whether residual projections onto spherical harmonics of different degrees can still exhibit various convergence speed. We consider three examples of non-uniform distributions: (i) piece-wise uniform distribution, (ii) normalized non-isotropic Gaussian, and (iii) normalized Gaussian mixture.

Piece-wise uniform distribution We divide the unit sphere into two semi-spheres along a randomly drawn direction ζ0\bm{\zeta}_{0}. A data input is then generated as follows: with probability 1/41/4, draw the input uniformly over the first unit sphere; with probability 3/43/4, draw the input uniformly over the second unit sphere. The results are shown in Figure 4.

Normalized Gaussian mixture The inputs are drawn from a mixture of 3 non-isotropic Gaussian distributions described above in the second example. The results are shown in Figure 6.

Figures 4, 5, 6 show that the residual components corresponding to lower order polynomials are still learned relatively faster, even for the non-uniform distributions described above, where spherical harmonics are no longer exactly the eigenfunctions of NTK. This suggests that our theoretical results can tolerate certain level of model misspecification, and therefore the spectral bias characterized in Theorem 4.2 and Corollaries 4.6, 4.7 holds in a variety of problem settings.

Conclusion

Appendix A Appendix A: A Review on Spherical Harmonics

In this section, we give a brief review on relevant concepts in spherical harmonics. For more detials, see Bach (2017); Bietti and Mairal (2019); Frye and Efthimiou (2012); Atkinson and Han (2012) for references.

where Pk(t)P_{k}(t) is the Legendre polynomial of degree kk in d+1d+1 dimensions, explicitly given by (Rodrigues’ formula)

We can also see that Pk(t)P_{k}(t), the Legendre polynomial of degree kk shares the same parity with kk. By the orthogonality and the addition formula (A.1) we have,

The following recurrence relation holds for Legendre polynomials:

for k1k\geq 1 and tP0(t)=P1(t)tP_{0}(t)=P_{1}(t) for k=0k=0.

The Hecke-Funk formula is given for a spherical harmonic YkY_{k} of degree kk as follows.

Appendix B Appendix B: Proof of Main Theorems

Here we provide the proof of Lemma 4.1, which is based on direct application of concentration inequalities.

where C1,C2C_{1},C_{2} are absolute constants. Applying a union bound over all i,j[rk]×[rk]i,j\in[r_{k}]\times[r_{k}] finishes the proof. ∎

B.2 Proof of Theorem 4.2

In this section we give the proof of Theorem 4.2. The core idea of our proof is to establish connections between neural network gradients throughout training and the neural tangent kernel. To do so, we first introduce the following definitions and notations.

Define K(0)=m1(WfW(0)(xi),WfW(0)(xj))n×n\mathbf{K}^{(0)}=m^{-1}(\langle\nabla_{\mathbf{W}}f_{\mathbf{W}^{(0)}}(\mathbf{x}_{i}),\nabla_{\mathbf{W}}f_{\mathbf{W}^{(0)}}(\mathbf{x}_{j})\rangle)_{n\times n}, K()=(κ(xi,xj))n×n=limmK(0)\mathbf{K}^{(\infty)}=(\kappa(\mathbf{x}_{i},\mathbf{x}_{j}))_{n\times n}=\lim_{m\rightarrow\infty}\mathbf{K}^{(0)}. Let {λ^i}i=1n\{\widehat{\lambda}_{i}\}_{i=1}^{n}, λ^1λ^n\widehat{\lambda}_{1}\geq\cdots\geq\widehat{\lambda}_{n} be the eigenvalues of n1Kn^{-1}\mathbf{K}^{\infty}, and v^1,,v^n\widehat{\mathbf{v}}_{1},\ldots,\widehat{\mathbf{v}}_{n} be the corresponding eigenvectors. Set V^rk=(v^1,,v^rk)\widehat{\mathbf{V}}_{{r_{k}}}=(\widehat{\mathbf{v}}_{1},\ldots,\widehat{\mathbf{v}}_{{r_{k}}}), V^rk=(v^rk+1,,v^n)\widehat{\mathbf{V}}_{{r_{k}}}^{\bot}=(\widehat{\mathbf{v}}_{{r_{k}}+1},\ldots,\widehat{\mathbf{v}}_{n}). For notation simplicity, we denote \nabla_{\mathbf{W}}f_{\mathbf{W}^{(0)}}(\mathbf{x}_{i})=[\nabla_{\mathbf{W}}f_{\mathbf{W}}(\mathbf{x}_{i})]\big{|}_{\mathbf{W}=\mathbf{W}^{(0)}}, \nabla_{\mathbf{W}_{l}}f_{\mathbf{W}^{(0)}}(\mathbf{x}_{i})=[\nabla_{\mathbf{W}_{l}}f_{\mathbf{W}}(\mathbf{x}_{i})]\big{|}_{\mathbf{W}=\mathbf{W}^{(0)}}, l=1,2l=1,2.

The following lemma’s purpose is to further connect the eigenfunctions of NTK with their finite-width, finite-sample counterparts. Its first statement is proved in Su and Yang (2019).

Suppose that ϕi(x)M|\phi_{i}(\mathbf{x})|\leq M for all xSd\mathbf{x}\in S^{d}. There exist absolute constants C,C,c>0C,C^{\prime},c^{\prime\prime}>0, such that for any δ>0\delta>0 and integer kk with rkn{r_{k}}\leq n, if nC(λrkλrk+1)2log(1/δ)n\geq C(\lambda_{{r_{k}}}-\lambda_{{r_{k}}+1})^{-2}\log(1/\delta), then with probability at least 1δ1-\delta,

The following two lemmas gives some preliminary bounds on the function value and gradients of the neural network around random initialization. They are proved in Cao and Gu (2019).

For any δ>0\delta>0, if mClog(n/δ)m\geq C\log(n/\delta) for a large enough absolute constant CC, then with probability at least 1δ1-\delta, fW(0)(xi)O(log(n/δ))|f_{\mathbf{W}^{(0)}}(\bm{x}_{i})|\leq\mathcal{O}(\sqrt{\log(n/\delta)}) for all i[n]i\in[n].

There exists an absolute constant CC such that, with probability at least 1O(n)exp[Ω(mω2/3)]1-\mathcal{O}(n)\cdot\exp[-\Omega(m\omega^{2/3})], for all i[n]i\in[n], l[L]l\in[L] and WB(W(0),ω)\mathbf{W}\in\mathcal{B}(\mathbf{W}^{(0)},\omega) with ωC[log(m)]3\omega\leq C[\log(m)]^{-3}, it holds uniformly that

The following lemma is the key to characterize the dynamics of the residual throughout training. These bounds in Lemma B.4 are the ones that distinguish our analysis from previous works on neural network training in the NTK regime (Du et al., 2019b; Su and Yang, 2019), since our analysis provides more careful characterization on the residual along different directions.

Suppose that the iterates of gradient descent W(0),,W(t)\mathbf{W}^{(0)},\ldots,\mathbf{W}^{(t)} are inside the ball B(W(0),ω)\mathcal{B}(\mathbf{W}^{(0)},\omega). If ωO~(min{[log(m)]3/2,λrk3,(ηm)3})\omega\leq\widetilde{\mathcal{O}}(\min\{[\log(m)]^{-3/2},\lambda_{r_{k}}^{3},(\eta m)^{-3}\}) and nO~(λrk2)n\geq\widetilde{\mathcal{O}}(\lambda_{r_{k}}^{-2}), then with probability at least 1O(t2n2)exp[Ω(mω2/3)]1-\mathcal{O}(t^{2}n^{2})\cdot\exp[-\Omega(m\omega^{2/3})],

Now we are ready to prove Theorem 4.2. To illustrate the intuition behind the proof, the proof can be mainly summarized into three steps: (i) Using a induction argument together with Lemma B.4 to show that under the theorem assumptions, all TT gradient descent iterates W(0),,W(T)\mathbf{W}^{(0)},\ldots,\mathbf{W}^{(T)} are inside the ball B(W(0),ω)\mathcal{B}(\mathbf{W}^{(0)},\omega) with ω=O(T/(θλrkm))\omega=O(T/(\bm{\theta}\lambda_{r_{k}}\sqrt{m})). (ii) We then reapply Lemma B.4 with t=Tt=T and obtain a bound on V^rk(yy^(T))2\|\widehat{\mathbf{V}}_{r_{k}}^{\top}(\mathbf{y}-\widehat{\mathbf{y}}^{(T)})\|_{2}. (iii) At last, we utilize the concentration results given in Lemma B.1 to finalize the proof. The detaled proof is given as follows.

We first show that all the iterates W(0),,W(T)\mathbf{W}^{(0)},\ldots,\mathbf{W}^{(T)} are inside the ball B(W(0),ω)\mathcal{B}(\mathbf{W}^{(0)},\omega). We prove this result by inductively show that W(t)B(W(0),ω)\mathbf{W}^{(t)}\in\mathcal{B}(\mathbf{W}^{(0)},\omega), t=0,,Tt=0,\ldots,T. First of all, it is clear that W(0)B(W(0),ω)\mathbf{W}^{(0)}\in\mathcal{B}(\mathbf{W}^{(0)},\omega). Suppose that W(0),,W(t)B(W(0),ω)\mathbf{W}^{(0)},\ldots,\mathbf{W}^{(t)}\in\mathcal{B}(\mathbf{W}^{(0)},\omega). Then the results of Lemmas B.3 and B.4 hold for W(0),,W(t)\mathbf{W}^{(0)},\ldots,\mathbf{W}^{(t)}. Denote u(t)=yy^(t)\mathbf{u}^{(t)}=\mathbf{y}-\widehat{\mathbf{y}}^{(t)}, tTt\in{T}. Then we have

where the second inequality follows by Lemma B.3. By Lemma B.4, we have

Now by ω=CT/(λrkm)\omega=\overline{C}T/(\lambda_{r_{k}}\sqrt{m}), η=O~(θ2m)1\eta=\widetilde{\mathcal{O}}(\theta^{2}m)^{-1} and the assumption that mm=O~(λrk14ϵ6)m\geq m^{*}=\widetilde{\mathcal{O}}(\lambda_{r_{k}}^{-14}\cdot\epsilon^{-6}), we obtain

By Lemma 4.1, θ=O~(ϵ)\theta=\widetilde{\mathcal{O}}(\epsilon) and the assumptions nΩ~(max{ϵ1(λrkλrk+1)1,ϵ2M2rk2})n\geq\widetilde{\Omega}(\max\{\epsilon^{-1}(\lambda_{r_{k}}-\lambda_{r_{k}+1})^{-1},\epsilon^{-2}M^{2}r_{k}^{2}\}), the eigenvalues of VrkVrk\mathbf{V}_{r_{k}}^{\top}\mathbf{V}_{r_{k}} are all between 1/21/\sqrt{2} and 2\sqrt{2}. Therefore we have

where the second inequality follows by Lemma B.1 and the fact VrkVrk(1/2)I\mathbf{V}_{r_{k}}^{\top}\mathbf{V}_{r_{k}}\succeq(1/\sqrt{2})\mathbf{I}. Similarly,

By Lemma 4.1, with probability at least 1δ1-\delta, Vrk21+CrkM2log(rk/δ)/n\|\mathbf{V}_{r_{k}}^{\top}\|_{2}\leq 1+Cr_{k}M^{2}\sqrt{\log(r_{k}/\delta)/n}. Combining this result with Lemma B.2 gives Vrky^(0)2θO(nlog(n/δ))ϵn/8\|\mathbf{V}_{r_{k}}^{\top}\widehat{\mathbf{y}}^{(0)}\|_{2}\leq\theta\mathcal{O}(\sqrt{n\log{(n/\delta)}})\leq\epsilon\sqrt{n}/8. Plugging the above estimates into (B.4) gives

Applying union bounds completes the proof. ∎

B.3 Proof of Theorem 4.3

where we apply the Hecke-Funk formula (A.4) and addition formula (A.1). Denote

For two kernel function defined in (3.2), we have

The first equality holds because σ\sigma^{\prime} is 0-homogeneous function and the second equality is true since the normalized direction of a multivariate Gaussian random variable satisfies uniform distribution on the unit sphere. Similarly we can derive

By combining (A.2), (B.7), (B.8), (B.9) and (B.10), we can get

Comparing (B.6), (B.11) and (B.12), we can easily show that

In Bach (2017), for α=1,2\alpha=1,2, explicit expressions of βα,k\beta_{\alpha,k} for kα+1k\geq\alpha+1 are presented as follows:

By Stirling formula Γ(x)xx1/2ex2π\Gamma(x)\approx x^{x-1/2}e^{-x}\sqrt{2\pi}, we have following expression of βα+1,k\beta_{\alpha+1,k} for kα+1k\geq\alpha+1

where C(α)=2α!2πexp{α+1}C(\alpha)=\frac{\sqrt{2}\alpha!}{2\pi}\exp\{\alpha+1\}. Also βα+1,0=d14πΓ(α+12)Γ(d2)Γ(d+α+22)\beta_{\alpha+1,0}=\frac{d-1}{4\pi}\frac{\Gamma{\left(\frac{\alpha+1}{2}\right)}\Gamma{\left(\frac{d}{2}\right)}}{\Gamma{\left(\frac{d+\alpha+2}{2}\right)}}, β1,1=d12dπ\beta_{1,1}=\frac{d-1}{2d\pi} and β2,1=d14πdΓ(12)Γ(d+22)Γ(d+32)\beta_{2,1}=\frac{d-1}{4\pi d}\frac{\Gamma{\left(\frac{1}{2}\right)}\Gamma{\left(\frac{d+2}{2}\right)}}{\Gamma{\left(\frac{d+3}{2}\right)}}. Thus combine (B.13) we know that μα+1,k=Ω(dd+1+αkkα1(k+d)kdα)\mu_{\alpha+1,k}=\Omega\left(d^{d+1+\alpha}k^{k-\alpha-1}(k+d)^{-k-d-\alpha}\right). By considering (A.3) and (B.6), we have

for k1k\geq 1 and kkk\neq k^{\prime}. From the discussion above, we thus know exactly that for k1k\geq 1

B.4 Proof of Corollaries 4.6 and 4.7

Thus we know that ϕj(x)N(d,k)|\phi_{j}(\mathbf{x})|\leq\sqrt{N(d,k)}. For kdk\gg d, we have N(d,k)=2k+d1k(k+d2d1)=O(kd1)N(d,k)=\frac{2k+d-1}{k}{\binom{k+d-2}{d-1}}=\mathcal{O}(k^{d-1}). For dkd\gg k, we have N(d,k)=2k+d1k(k+d2d1)=O(dk)N(d,k)=\frac{2k+d-1}{k}{\binom{k+d-2}{d-1}}=\mathcal{O}(d^{k}). This completes the proof. ∎

Appendix C Appendix C: Proof of Lemmas in Appendix B

Moreover, since VrkVrk=AA+BB\mathbf{V}_{r_{k}}^{\top}\mathbf{V}_{r_{k}}=\mathbf{A}^{\top}\mathbf{A}+\mathbf{B}^{\top}\mathbf{B}, by Lemma 4.1 we have

Plugging in the definition of ξ1\xi_{1} and ξ2\xi_{2} completes the proof. ∎

C.2 Proof of Lemma B.4

The following lemma is a direct application of Proposition 1 in Smale and Zhou (2009) or Proposition 10 in Rosasco et al. (2010). It bounds the difference between the eigenvalues of NTK and their finite-width counterparts.

For any δ>0\delta>0, with probability at least 1δ1-\delta, λiλ^iO(log(1/δ)/n)|\lambda_{i}-\widehat{\lambda}_{i}|\leq\mathcal{O}(\sqrt{\log(1/\delta)/n}) for i[n]i\in[n].

The following lemma gives a recursive formula with is key to the proof of Lemma B.4.

Suppose that the iterates of gradient descent W(0),,W(t)\mathbf{W}^{(0)},\ldots,\mathbf{W}^{(t)} are inside the ball B(W(0),ω)\mathcal{B}(\mathbf{W}^{(0)},\omega). If ωO([log(m)]3/2)\omega\leq\mathcal{O}([\log(m)]^{-3/2}), then with probability at least 1O(n2)exp[Ω(mω2/3)]1-\mathcal{O}(n^{2})\cdot\exp[-\Omega(m\omega^{2/3})],

for all t=0,,t1{t^{\prime}}=0,\ldots,t-1, where y=(y1,,yn)\mathbf{y}=(y_{1},\ldots,y_{n})^{\top}, y^(t)=θ(fW(t)(x1),,fW(t)(xn))\widehat{\mathbf{y}}^{({t^{\prime}})}=\theta\cdot(f_{\mathbf{W}^{({t^{\prime}})}}(\mathbf{x}_{1}),\ldots,f_{\mathbf{W}^{({t^{\prime}})}}(\mathbf{x}_{n}))^{\top}.

We also have the following lemma, which provides a uniform bound of the neural network function value over B(W(0),ω)\mathcal{B}(\mathbf{W}^{(0)},\omega).

Suppose that mΩ(ω2/3log(n/δ))m\geq\Omega(\omega^{-2/3}\log(n/\delta)) and ωO([log(m)]3)\omega\leq\mathcal{O}([\log(m)]^{-3}). Then with probability at least 1δ1-\delta, fW(xi)O(log(n/δ)+ωm)|f_{\mathbf{W}}(\mathbf{x}_{i})|\leq\mathcal{O}(\sqrt{\log(n/\delta)}+\omega\sqrt{m}) for all WB(W(0),ω)\mathbf{W}\in\mathcal{B}(\mathbf{W}^{(0)},\omega) i[n]i\in[n].

Denote u(t)=yy^(t)\mathbf{u}^{(t)}=\mathbf{y}-\widehat{\mathbf{y}}^{(t)}, tTt\in{T}. Then we have

where the first inequality follows by Lemma C.2, and the second inequality follows by Lemma C.3. Therefore we have

for t=0,,t{t^{\prime}}=0,\ldots,t. This completes the proof of (B.1). Similarly, we have

for t=0,,t1{t^{\prime}}=0,\ldots,t-1, where the third inequality is by Lemma C.1 and the assumption that ωO~(λrk3)\omega\leq\widetilde{\mathcal{O}}(\lambda_{r_{k}}^{3}), nO~(λrk2)n\geq\widetilde{\mathcal{O}}(\lambda_{r_{k}}^{-2}), and the fourth inequality is by (B.1). Therefore we have

This completes the proof of (B.2). Finally, for (B.3), by assumption we have ω1/3ηmθ2O~(1)\omega^{1/3}\eta m\theta^{2}\leq\widetilde{\mathcal{O}}(1). Therefore

for t=0,,t1{t^{\prime}}=0,\ldots,t-1, where the third inequality is by Lemma C.1 and the assumption that ωO~(λrk3)\omega\leq\widetilde{\mathcal{O}}(\lambda_{r_{k}}^{3}), and the fourth inequality follows by (B.1). Therefore we have

Appendix D Appendix D: Proof of lemmas in Appendix C

There exists an absolute constant κ\kappa such that, with probability at least 1O(n)exp[Ω(mω2/3)]1-\mathcal{O}(n)\cdot\exp[-\Omega(m\omega^{2/3})] over the randomness of W(0)\mathbf{W}^{(0)}, for all i[n]i\in[n] and W,WB(W(0),ω)\mathbf{W},\mathbf{W}^{\prime}\in\mathcal{B}(\mathbf{W}^{(0)},\omega) with ωκ[log(m)]3/2\omega\leq\kappa[\log(m)]^{-3/2}, it holds uniformly that

If ωO([log(m)]3/2)\omega\leq\mathcal{O}([\log(m)]^{-3/2}), then with probability at least 1O(n)exp[Ω(mω2/3)]1-\mathcal{O}(n)\cdot\exp[-\Omega(m\omega^{2/3})],

for all WB(W(0),ω)\mathbf{W}\in\mathcal{B}(\mathbf{W}^{(0)},\omega) and i[n]i\in[n].

The gradient descent update formula gives

For any j[n]j\in[n], subtracting W(t)\mathbf{W}^{(t)} and applying inner product with θWfW(t)(xj)\theta\nabla_{\mathbf{W}}f_{\mathbf{W}^{(t)}}(\mathbf{x}_{j}) gives

with probability at least 1O(n)exp[Ω(mω2/3)]1-\mathcal{O}(n)\cdot\exp[-\Omega(m\omega^{2/3})]. For I2,j,tI_{2,j,t}, by Bernstein inequality and union bound, with probability at least 1O(n2)exp(Ω(mω2/3))1-\mathcal{O}(n^{2})\cdot\exp(-\Omega(m\omega^{2/3})), we have

where the first inequality follows by Lemmas D.1, the second inequality is obtained from (D.1), and the third inequality follows by Lemma B.3. Setting the jj-th entry of e(t)\mathbf{e}^{(t)} as I1,j,t+I2,j,t+I3,j,tI_{1,j,t}+I_{2,j,t}+I_{3,j,t} and writing (D.2) into matrix form completes the proof. ∎

D.2 Proof of Lemma C.3

where the last inequality is by the assumption ω[log(m)]3\omega\leq[\log(m)]^{-3}. Applying triangle inequality and Lemma B.2 then gives

Appendix E Appendix E: Proof of lemmas in Appendix D

The following lemma is given in Lemma 8.2 in Allen-Zhu et al. (2019).

If ωO([log(m)]3/2)\omega\leq\mathcal{O}([\log(m)]^{-3/2}), then with probability at least 1O(n)exp[Ω(mω2/3)]1-\mathcal{O}(n)\cdot\exp[-\Omega(m\omega^{2/3})], DiDi(0)0O(ω2/3m)\|\mathbf{D}_{i}-\mathbf{D}_{i}^{(0)}\|_{0}\leq\mathcal{O}(\omega^{2/3}m) for all WB(W(0),ω)\mathbf{W}\in\mathcal{B}(\mathbf{W}^{(0)},\omega), i[n]i\in[n].

By Lemma 7.4 in Allen-Zhu et al. (2019) and Lemma E.1, with probability at least 1nexp[Ω(m)]1-n\cdot\exp[-\Omega(m)], mW2(0)(Di(0)Di)FO(ω1/3m)\sqrt{m}\cdot\|\mathbf{W}_{2}^{(0)}(\mathbf{D}_{i}^{(0)}-\mathbf{D}_{i})\|_{F}\leq\mathcal{O}(\omega^{1/3}\sqrt{m}) for all i[n]i\in[n]. Moreover, clearly (W2(0)W2)Di2W2(0)W22ω\|(\mathbf{W}_{2}^{(0)}-\mathbf{W}_{2})\mathbf{D}_{i}\|_{2}\leq\|\mathbf{W}_{2}^{(0)}-\mathbf{W}_{2}\|_{2}\leq\omega. Therefore

for all i[n]i\in[n]. This proves the bound for the first layer gradients. For the second layer gradients, we have

It therefore follows by the 11-Lipschitz continuity of σ()\sigma(\cdot) that

This completes the proof of the first inequality. The second inequality directly follows by triangle inequality and Lemma B.3:

Appendix F Appendix F: Additional Experimental Results

As is discussed in the main paper, when using freshly sampled points, we are actually estimating the projection length of residual function r(x)=f(x)θfW(t)(x)r(\mathbf{x})=f^{*}(\mathbf{x})-\theta f_{\mathbf{W}^{(t)}}(\mathbf{x}) onto the given Gegenbauer polynomial Pk(x)P_{k}(\mathbf{x}). Here we present in Figure 7 a comparison between the projection length using training data and that using test data. An interesting phenomenon is that the network generalizes well on the lower-order Gegenbauer polynomial and along the highest-order Gegenbauer polynomial the network suffers overfitting.

References