Towards Understanding the Spectral Bias of Deep Learning
Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, Quanquan Gu
Introduction
Over-parameterized neural networks have achieved great success in many applications such as computer vision (He et al., 2016), natural language processing (Collobert and Weston, 2008) and speech recognition (Hinton et al., 2012). It has been shown that over-parameterized neural networks can fit complicated target function or even randomly labeled data (Zhang et al., 2017) and still exhibit good generalization performance when trained with real labels. Intuitively, this is at odds with the traditional notion of generalization ability such as model complexity. In order to understand neural network training, a line of work (Soudry et al., 2018; Gunasekar et al., 2018b, a) has made efforts in the perspective of “implicit bias”, which states that training algorithms for deep learning implicitly pose an inductive bias onto the training process and lead to a solution with low complexity measured by certain norms in the parameter space of the neural network.
Among many attempts to establish implicit bias, Rahaman et al. (2019) pointed out an intriguing phenomenon called spectral bias, which says that during training, neural networks tend to learn the components of lower complexity faster. Similar observation has also been pointed out in Xu et al. (2019b, a). The concept of spectral bias is appealing because this may intuitively explain why over-parameterized neural networks can achieve a good generalization performance without overfitting. During training, the networks fit the low complexity components first and thus lie in the concept class of low complexity. Arguments like this may lead to rigorous guarantee for generalization.
Great efforts have been made in search of explanations about the spectral bias. Rahaman et al. (2019) evaluated the Fourier spectrum of ReLU networks and empirically showed that the lower frequencies are learned first; also lower frequencies are more robust to random perturbation. Andoni et al. (2014) showed that for a sufficiently wide two-layer network, gradient descent with respect to the second layer can learn any low degree bounded polynomial. Xu (2018) provided Fourier analysis to two-layer networks and showed similar empirical results on one-dimensional functions and real data. Nakkiran et al. (2019) used information theoretical approach to show that networks obtained by stochastic gradient descent can be explained by a linear classifier during early training. These studies provide certain explanations about why neural networks exhibit spectral bias in real tasks. But explanations in the theoretical aspect, if any, are to some extent limited. For example, Fourier analysis is usually done in the one-dimensional setting, and thus lacks generality.
Inspired by these works mentioned above in the neural tangent kernel regime, in this paper we study the spectral bias of over-parameterized two-layer neural networks and its connection to the neural tangent kernel. Note that a basic connection between neural tangent kernel and the spectral bias can be indicated by the observation that neural networks evolve as linear models (Jacot et al., 2018; Lee et al., 2019) in the NTK regime, which suggests that some corresponding results for linear models (Advani and Saxe, 2017) can be applied. However, direct combinations of existing techniques cannot show how many hidden nodes can guarantee the learning of simple/complex components, which is the key problem of interest. Therefore, although such combinations of existing results can provide some mathematical intuition, they are not sufficient to explain the spectral bias for neural networks. To give a thorough characterization of spectral bias, we study the training of mildly over-parameterized neural networks. We show that, given a training data set that is generated based on a target function, a fairly narrow network, although cannot fit the training data well due to its limited width, can still learn certain low-complexity components of the target function in the eigenspace corresponding to large eigenvalues of neural tangent kernel. As the width of the network increases, more high-frequency components of the target function can be learned with a slower convergence rate. As a special case, our result implies that when the input data follows uniform distribution on the unit sphere, polynomials of lower degrees can be learned by a narrower neural network at a faster rate. We also conduct experiments to corroborate the theory we establish.
We prove a generic theorem for arbitrary data distributions, which states that under certain sample complexity and over-parameterization conditions, the convergence of the training error along different eigendirections of NTK relies on the corresponding eigenvalues. This theorem gives a more precise control on the regression residual than Su and Yang (2019), where the authors focused on the case when the labeling function is close to the subspace spanned by the first few eigenfunctions.
We present a characterization of the spectra of the neural tangent kernel that is more general than existing results. In particular, we show that when the input data follow uniform distribution over the unit sphere, the eigenvalues of neural tangent kernel are , , with corresponding eigenfunctions being the -th order spherical harmonics. Our result is better than the bound derived in Bietti and Mairal (2019) when , which is in a more practical setting.
We establish a rigorous explanation for the spectral bias based on the aforementioned theoretical results without any specific assumptions on the target function. We show that the error terms from different frequencies are provably controlled by the eigenvalues of the NTK, and the lower-frequency components can be learned with less training examples and narrower networks at a faster convergence rate.
Related Work
A few theoretical results have been established towards understanding the spectra of neural tangent kernels. To name a few, Bach (2017) studied two-layer ReLU networks by relating it to kernel methods, and proposed a harmonic decomposition for the functions in the reproducing kernel Hilbert space which we utilize in our proof. Based on the technique in Bach (2017), Bietti and Mairal (2019) studied the eigenvalue decay of integrating operator defined by the neural tangent kernel on unit sphere by using spherical harmonics. Vempala and Wilmes (2019) calculated the eigenvalues of neural tangent kernel corresponding to two-layer neural networks with sigmoid activation function. Basri et al. (2019) established similar results as Bietti and Mairal (2019), but considered the case of training the first layer parameters of a two-layer networks with bias terms. Yang and Salman (2019) studied the the eigenvalues of integral operator with respect to the NTK on Boolean cube by Fourier analysis. Very recently, Chen and Xu (2020) studied the connection between NTK and Laplacian kernels. Bordelon et al. (2020) gave a spectral analysis on the generalization error of NTK-based kernel ridge regression. Basri et al. (2020) studied the convergence of full training residual with a focus on one-dimensional, non-uniformly distributed data.
A series of papers (Gunasekar et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018a, b; Nacson et al., 2018; Li et al., 2018; Jacot et al., 2020) studied implicit bias problem, aiming to figure out when there are multiple optimal solutions of a training objective function, what kind of nice properties the optimal found by a certain training algorithm would have. Implicit bias results of gradient descent, stochastic gradient descent, or mirror descent for various problem settings including matrix factorization, logistic regression, deep linear networks as well as homogeneous models. The major difference between these results and our work is that implicit bias results usually focus on the parameter space, while we study the functions a neural network prefer to learn in the function space.
Preliminaries
In this section we introduce the basic problem setup including the neural network structure and the training algorithm, as well as some background on the neural tangent kernel proposed recently in Jacot et al. (2018) and the corresponding integral operator.
2 Problem Setup
Here we introduce the basic problem setup. We consider two-layer fully connected neural networks of the form
We first randomly initialize the parameters of the network, and run gradient descent for both layers. We present our detailed neural network training algorithm in Algorithm 1.
The initialization scheme for given in Algorithm 1 is known as He initialization (He et al., 2015). It is consistent with the initialization scheme used in Cao and Gu (2019).
3 Neural Tangent Kernel
Many attempts have been made to study the convergence of gradient descent assuming the width of the network is extremely large (Du et al., 2019b; Li and Liang, 2018). When the width of the network goes to infinity, with certain initialization on the model weights, the limit of inner product of network gradients defines a kernel function, namely the neural tangent kernel (Jacot et al., 2018). In this paper, we denote the neural tangent kernel as
For two-layer networks, standard concentration results gives
Since we apply gradient descent to both layers, the neural tangent kernel is the sum of the two different kernel functions and clearly it can be reduced to one layer training setting. These two kernels are arc-cosine kernels of degree 0 and 1 (Cho and Saul, 2009), which are given as , , where
4 Integral Operator
It has been pointed out in Cho and Saul (2009) that arc-cosine kernels are positive semi-definite. Thus the kernel function defined by (3.1) is positive semi-definite being a product and a sum of positive semi-definite kernels. Clearly this kernel is also continuous and symmetric, which implies that the neural tangent kernel is a Mercer kernel.
Main Results
In this section we study the convergence of Algorithm 1. Instead of studying the standard convergence of loss function value, we provide a refined analysis on the speed of convergence along different directions defined by the eigenfunctions of . We first introduce the following notations.
Denote and for . Then Lemma 4.1 shows that the convergence rate of roughly represents the speed gradient descent learns the components of the target function corresponding to the first eigenvalues. The following theorem gives the convergence guarantee of .
Theorem 4.2 shows that the convergence rate of is determined by the -th eigenvalue . This reveals the spectral bias of neural network training under the NTK regime. Specifically, as long as the network is wide enough and the sample size is large enough, gradient descent first learns the target function along the eigendirections of neural tangent kernel with larger eigenvalues, and then learns the rest components corresponding to smaller eigenvalues. Moreover, by showing that learning the components corresponding to larger eigenvalues can be done with smaller sample size and narrower networks, our theory pushes the study of neural networks in the NTK regime towards a more practical setting. For these reasons, we believe that Theorem 4.2 to certain extent provides an explanation of the empirical observations given in Rahaman et al. (2019), and demonstrates that the difficulty of a function to be learned by neural network can be characterized in the eigenspace of neural tangent kernel: if the target function has a component corresponding to a small eigenvalue of neural tangent kernel, then learning this function takes longer time, and requires more examples and wider networks.
Note that the results in this paper are all in the “neural tangent kernel regime” (Jacot et al., 2018) or the “lazy training regime”(Chizat et al., 2019). Therefore, our results share certain common limitations of NTK-type results discussed in (Chizat et al., 2019; Allen-Zhu and Li, 2019). However, we believe the study of spectral bias in the NTK regime is still an important research direction. It is worth noting that Theorem 4.2 is not based on the common NTK-type assumption that the target function belongs to the NTK-induced RKHS (Arora et al., 2019a; Cao and Gu, 2019; Ji and Telgarsky, 2020). Instead, Theorem 4.2 works for arbitrary labeling of the data, and is therefore rather general. We believe the characterization of spectral bias under general settings beyond NTK regime is an important future work direction.
Our work follows the same intuition as recent results studying the residual dynamics of over-parameterized two-layer neural networks (Arora et al., 2019a; Su and Yang, 2019). Compared with Su and Yang (2019), the major difference is that while Su and Yang (2019) studied the full residual and required that the target function lies approximately in the eigenspace of large eigenvalues of the neural tangent kernel, our result in Theorem 4.2 works for arbitrary target function, and shows that even if the target function has very high frequency components, its components in the eigenspace of large eigenvalues can still be learned very efficiently by neural networks. We note that although the major results in Arora et al. (2019a) are presented in terms of the full residual, certain part of their proof in Arora et al. (2019a) can indicate the convergence of projected residual. However, Arora et al. (2019a) do not build any quantitative connection between the Gram matrix and kernel function. Since the eigenvalues and eigenfunctions of NTK Gram matrix depend on the exact realizations of the training samples, they are not directly tractable for the study of spectral bias. Moreover, Arora et al. (2019a) focus on the setting where the network is wide enough to guarantee global convergence, while our result works for narrower networks for which global convergence may not even be possible. More recently, Bordelon et al. (2020) studied the solution of NTK-based kernel ridge regression. Compared with thier result, our analysis is directly on the practical neural network training procedure, and specifies the width requirement for a network to successfully learn a certain component of the target function. Another recent work by Basri et al. (2020) provided theoretical analysis for one-dimensional non-uniformly distributed data, and only studied the convergence of the full residual vector. In comparison, our results cover high-dimensional data, and provide a more detailed analysis on the convergence of different projections of the residual.
2 Spectral Analysis of NTK for Uniform Distribution
We now study the case when the data inputs are uniformly distributed over the unit sphere as an example where the eigendecompositon of NTK can be calculated. We present our results (an extension of Proposition 5 in Bietti and Mairal (2019)) of spectral analysis of neural tangent kernel in the form of a Mercer decomposition, which explicitly gives the eigenvalues and eigenfunctions of NTK.
where for are linearly independent spherical harmonics of degree in variables with and orders of are given by
3 Convergence for Uniformly Distributed Data
In this subsection, we combine our results in the previous two subsections and give explicit convergence rate for uniformly distributed data on the unit sphere.
where and with being a set of orthonomal spherical harmonics of degrees up to .
where and with being a set of orthonomal spherical harmonics of degrees up to .
Corollaries 4.6 and 4.7 further illustrate the spectral bias of neural networks by providing exact calculations of , and in Theorem 4.2. They show that if the input distribution is uniform over unit sphere, then spherical harmonics with lower degrees are learned first by wide neural networks.
In Corollaries 4.6 and 4.7, the conditions on and depend exponentially on either or . We would like to emphasize that such exponential dependency is reasonable and unavoidable. Take the setting as an example. The exponential dependency in is a natural consequence of the fact that in high dimensional space, there are a large number of linearly independent low-degree polynomials. When only data points are used for training, it is only reasonable to expect to learn less than independent components of the true function. Therefore it is unavoidable to assume
Similar arguments can apply to the setting.
Experiments
In this section we present experimentral results to verify our theory. Across all tasks, we train a two-layer neural networks with 4096 hidden neurons and initialize it exactly as defined in the problem setup. The optimization method is vanilla gradient descent, and the training sample size for all results is .
We first experimentally verify our theoretical results by learning linear combinations of spherical harmonics with data inputs uniformly distributed over unit sphere. Define
Following our theoretical analysis, we denote . By Lemma 4.2 ’s are almost orthonormal. So we define the (approximate) projection length of residual onto at step as , where and is the neural network function.
The results are shown in Figure 1. It can be seen that the residual at the lowest frequency () converges to zero first and then the second lowest (). The highest frequency component is the last one to converge. Following the setting of Rahaman et al. (2019) we assign high frequencies a larger scale in Figure 1 (b), expecting that larger scale will introduce a better descending speed. Still, the low frequencies are learned first.
Note that is the projection length onto an approximate vector. In the function space, we can also project the residual function onto the orthonormal Gegenbauer functions . Replacing the training data with randomly sampled data points can lead to a random estimate of the projection length in function space. Experiments in this setting can be found in the appendix.
We also verify the linear convergence rate proved in Theorem 4.2. Figure 2 presents the convergence curve in logarithmic scale. We can see from Figure 2 that the convergence of different projection length is close to linear convergence, which is well-aligned with our theoretical analysis. Note that we performed a moving average of range on these curves to avoid the heavy oscillation especially at late stage.
2 Learning Functions of Simple Forms
Apart from the synthesized low frequency function, we also show the dynamics of more general functions’ projection to . These functions, though in a simple form, have non-zero high-frequency components. In this subsection we show our results still apply when all frequencies exist in the target functions, which are given by and , where is a fixed unit vector.
Figure 3 verifies the result of Theorem 4.2 with more general target functions, and backs up our claim that Theorem 4.2 does not make any assumptions on the target function. Notice that in the early training stage, not all the curves monotonically descend. We believe this is due to the unseen components’ influence on the gradient. Again, as the training proceeds, the residual projections converge at the predicted rates.
3 Non-uniform Input Data Distributions
In this subsection, we provide experimental results for non-uniformly distributed input data. Note that the eigendecomposition of NTK for general multi-dimensional input distributions do not necessarily have good analytic forms. Therefore, here we treat the non-uniform distribution of the input data as model misspecification, and test whether residual projections onto spherical harmonics of different degrees can still exhibit various convergence speed. We consider three examples of non-uniform distributions: (i) piece-wise uniform distribution, (ii) normalized non-isotropic Gaussian, and (iii) normalized Gaussian mixture.
Piece-wise uniform distribution We divide the unit sphere into two semi-spheres along a randomly drawn direction . A data input is then generated as follows: with probability , draw the input uniformly over the first unit sphere; with probability , draw the input uniformly over the second unit sphere. The results are shown in Figure 4.
Normalized Gaussian mixture The inputs are drawn from a mixture of 3 non-isotropic Gaussian distributions described above in the second example. The results are shown in Figure 6.
Figures 4, 5, 6 show that the residual components corresponding to lower order polynomials are still learned relatively faster, even for the non-uniform distributions described above, where spherical harmonics are no longer exactly the eigenfunctions of NTK. This suggests that our theoretical results can tolerate certain level of model misspecification, and therefore the spectral bias characterized in Theorem 4.2 and Corollaries 4.6, 4.7 holds in a variety of problem settings.
Conclusion
Appendix A Appendix A: A Review on Spherical Harmonics
In this section, we give a brief review on relevant concepts in spherical harmonics. For more detials, see Bach (2017); Bietti and Mairal (2019); Frye and Efthimiou (2012); Atkinson and Han (2012) for references.
where is the Legendre polynomial of degree in dimensions, explicitly given by (Rodrigues’ formula)
We can also see that , the Legendre polynomial of degree shares the same parity with . By the orthogonality and the addition formula (A.1) we have,
The following recurrence relation holds for Legendre polynomials:
for and for .
The Hecke-Funk formula is given for a spherical harmonic of degree as follows.
Appendix B Appendix B: Proof of Main Theorems
Here we provide the proof of Lemma 4.1, which is based on direct application of concentration inequalities.
where are absolute constants. Applying a union bound over all finishes the proof. ∎
B.2 Proof of Theorem 4.2
In this section we give the proof of Theorem 4.2. The core idea of our proof is to establish connections between neural network gradients throughout training and the neural tangent kernel. To do so, we first introduce the following definitions and notations.
Define , . Let , be the eigenvalues of , and be the corresponding eigenvectors. Set , . For notation simplicity, we denote \nabla_{\mathbf{W}}f_{\mathbf{W}^{(0)}}(\mathbf{x}_{i})=[\nabla_{\mathbf{W}}f_{\mathbf{W}}(\mathbf{x}_{i})]\big{|}_{\mathbf{W}=\mathbf{W}^{(0)}}, \nabla_{\mathbf{W}_{l}}f_{\mathbf{W}^{(0)}}(\mathbf{x}_{i})=[\nabla_{\mathbf{W}_{l}}f_{\mathbf{W}}(\mathbf{x}_{i})]\big{|}_{\mathbf{W}=\mathbf{W}^{(0)}}, .
The following lemma’s purpose is to further connect the eigenfunctions of NTK with their finite-width, finite-sample counterparts. Its first statement is proved in Su and Yang (2019).
Suppose that for all . There exist absolute constants , such that for any and integer with , if , then with probability at least ,
The following two lemmas gives some preliminary bounds on the function value and gradients of the neural network around random initialization. They are proved in Cao and Gu (2019).
For any , if for a large enough absolute constant , then with probability at least , for all .
There exists an absolute constant such that, with probability at least , for all , and with , it holds uniformly that
The following lemma is the key to characterize the dynamics of the residual throughout training. These bounds in Lemma B.4 are the ones that distinguish our analysis from previous works on neural network training in the NTK regime (Du et al., 2019b; Su and Yang, 2019), since our analysis provides more careful characterization on the residual along different directions.
Suppose that the iterates of gradient descent are inside the ball . If and , then with probability at least ,
Now we are ready to prove Theorem 4.2. To illustrate the intuition behind the proof, the proof can be mainly summarized into three steps: (i) Using a induction argument together with Lemma B.4 to show that under the theorem assumptions, all gradient descent iterates are inside the ball with . (ii) We then reapply Lemma B.4 with and obtain a bound on . (iii) At last, we utilize the concentration results given in Lemma B.1 to finalize the proof. The detaled proof is given as follows.
We first show that all the iterates are inside the ball . We prove this result by inductively show that , . First of all, it is clear that . Suppose that . Then the results of Lemmas B.3 and B.4 hold for . Denote , . Then we have
where the second inequality follows by Lemma B.3. By Lemma B.4, we have
Now by , and the assumption that , we obtain
By Lemma 4.1, and the assumptions , the eigenvalues of are all between and . Therefore we have
where the second inequality follows by Lemma B.1 and the fact . Similarly,
By Lemma 4.1, with probability at least , . Combining this result with Lemma B.2 gives . Plugging the above estimates into (B.4) gives
Applying union bounds completes the proof. ∎
B.3 Proof of Theorem 4.3
where we apply the Hecke-Funk formula (A.4) and addition formula (A.1). Denote
For two kernel function defined in (3.2), we have
The first equality holds because is 0-homogeneous function and the second equality is true since the normalized direction of a multivariate Gaussian random variable satisfies uniform distribution on the unit sphere. Similarly we can derive
By combining (A.2), (B.7), (B.8), (B.9) and (B.10), we can get
Comparing (B.6), (B.11) and (B.12), we can easily show that
In Bach (2017), for , explicit expressions of for are presented as follows:
By Stirling formula , we have following expression of for
where . Also , and . Thus combine (B.13) we know that . By considering (A.3) and (B.6), we have
for and . From the discussion above, we thus know exactly that for
B.4 Proof of Corollaries 4.6 and 4.7
Thus we know that . For , we have . For , we have . This completes the proof. ∎
Appendix C Appendix C: Proof of Lemmas in Appendix B
Moreover, since , by Lemma 4.1 we have
Plugging in the definition of and completes the proof. ∎
C.2 Proof of Lemma B.4
The following lemma is a direct application of Proposition 1 in Smale and Zhou (2009) or Proposition 10 in Rosasco et al. (2010). It bounds the difference between the eigenvalues of NTK and their finite-width counterparts.
For any , with probability at least , for .
The following lemma gives a recursive formula with is key to the proof of Lemma B.4.
Suppose that the iterates of gradient descent are inside the ball . If , then with probability at least ,
for all , where , .
We also have the following lemma, which provides a uniform bound of the neural network function value over .
Suppose that and . Then with probability at least , for all .
Denote , . Then we have
where the first inequality follows by Lemma C.2, and the second inequality follows by Lemma C.3. Therefore we have
for . This completes the proof of (B.1). Similarly, we have
for , where the third inequality is by Lemma C.1 and the assumption that , , and the fourth inequality is by (B.1). Therefore we have
This completes the proof of (B.2). Finally, for (B.3), by assumption we have . Therefore
for , where the third inequality is by Lemma C.1 and the assumption that , and the fourth inequality follows by (B.1). Therefore we have
Appendix D Appendix D: Proof of lemmas in Appendix C
There exists an absolute constant such that, with probability at least over the randomness of , for all and with , it holds uniformly that
If , then with probability at least ,
for all and .
The gradient descent update formula gives
For any , subtracting and applying inner product with gives
with probability at least . For , by Bernstein inequality and union bound, with probability at least , we have
where the first inequality follows by Lemmas D.1, the second inequality is obtained from (D.1), and the third inequality follows by Lemma B.3. Setting the -th entry of as and writing (D.2) into matrix form completes the proof. ∎
D.2 Proof of Lemma C.3
where the last inequality is by the assumption . Applying triangle inequality and Lemma B.2 then gives
Appendix E Appendix E: Proof of lemmas in Appendix D
The following lemma is given in Lemma 8.2 in Allen-Zhu et al. (2019).
If , then with probability at least , for all , .
By Lemma 7.4 in Allen-Zhu et al. (2019) and Lemma E.1, with probability at least , for all . Moreover, clearly . Therefore
for all . This proves the bound for the first layer gradients. For the second layer gradients, we have
It therefore follows by the -Lipschitz continuity of that
This completes the proof of the first inequality. The second inequality directly follows by triangle inequality and Lemma B.3:
Appendix F Appendix F: Additional Experimental Results
As is discussed in the main paper, when using freshly sampled points, we are actually estimating the projection length of residual function onto the given Gegenbauer polynomial . Here we present in Figure 7 a comparison between the projection length using training data and that using test data. An interesting phenomenon is that the network generalizes well on the lower-order Gegenbauer polynomial and along the highest-order Gegenbauer polynomial the network suffers overfitting.