On the Spectral Bias of Neural Networks

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A. Hamprecht, Yoshua Bengio, Aaron Courville

stat.ML cs.LG

Introduction

The remarkable success of deep neural networks at generalizing to natural data is at odds with the traditional notions of model complexity and their empirically demonstrated ability to fit arbitrary random data to perfect accuracy (Zhang et al., 2017a; Arpit et al., 2017). This has prompted recent investigations of possible implicit regularization mechanisms inherent in the learning process which induce a bias towards low complexity solutions (Neyshabur et al., 2014; Soudry et al., 2017; Poggio et al., 2018; Neyshabur et al., 2017).

In this work, we take a slightly shifted view on implicit regularization by suggesting that low-complexity functions are learned faster during training by gradient descent. We expose this bias by taking a closer look at neural networks through the lens of Fourier analysis. While they can approximate arbitrary functions, we find that these networks prioritize learning the low frequency modes, a phenomenon we call the spectral bias. This bias manifests itself not just in the process of learning, but also in the parameterization of the model itself: in fact, we show that the lower frequency components of trained networks are more robust to random parameter perturbations. Finally, we also expose and analyze the rather intricate interplay between the spectral bias and the geometry of the data manifold by showing that high frequencies get easier to learn when the data lies on a lower-dimensional manifold of complex shape embedded in the input space of the model. We focus the discussion on networks with rectified linear unit (ReLU) activations, whose continuous piece-wise linear structure enables an analytic treatment.

We exploit the continuous piecewise-linear structure of ReLU networks to evaluate its Fourier spectrum (Section 2).

We find empirical evidence of a spectral bias: i.e. lower frequencies are learned first. We also show that lower frequencies are more robust to random perturbations of the network parameters (Section 3).

We study the role of the shape of the data manifold: we show how complex manifold shapes can facilitate the learning of higher frequencies and develop a theoretical understanding of this behavior (Section 4).

Fourier analysis of ReLU networks

ReLU networks are known to be continuous piece-wise linear (CPWL) functions, where the linear regions are convex polytopes (Raghu et al., 2016; Montufar et al., 2014; Zhang et al., 2018; Arora et al., 2018). Remarkably, the converse also holds: every CPWL function can be represented by a ReLU network (Arora et al., 2018, Theorem 2.1), which in turn endows ReLU networks with universal approximation properties. Given the ReLU network $f$ from Eqn. 1, we can make the piecewise linearity explicit by writing,

where $\epsilon$ is an index for the linear regions $P_{\epsilon}$ and $1_{P_{\epsilon}}$ is the indicator function on $P_{\epsilon}$ . As shown in Appendix B in more detail, each region corresponds to an activation patternWe adopt the terminology of Raghu et al. (2016); Montufar et al. (2014). of all hidden neurons of the network, which is a binary vector with components conditioned on the sign of the input of the respective neuron. The $1\times d$ matrix $W_{\epsilon}$ is given by

where $W^{(k)}_{\epsilon}$ is obtained from the original weight $W^{(k)}$ by setting its $j^{th}$ column to zero whenever the neuron $j$ of the $k^{th}$ layer is inactive.

2 Fourier Spectrum

The Fourier transform of ReLU networks decomposes as,

The Fourier transform of the indicator over linear regions appearing in Eqn. 4 are fairly intricate mathematical objects. Diaz et al. (2016) develop an elegant procedure for evaluating it in arbitrary dimensions via a recursive application of Stokes theorem. We describe this procedure in detailWe also generalize the construction to tempered distributions. in Appendix C.2, and present here its main corollary.

Lemmas 1, 2 together yield the main result of this section.

The Fourier components of the ReLU network $f_{\theta}$ with parameters $\theta$ is given by the rational function:

Note that Eqn 6 applies to general ReLU networks with arbitrary width and depthSymmetries that might arise due to additional assumptions can be used to further develop Eqn 6, see e.g. Eldan & Shamir (2016) for 2-layer networks..

Second, the numerator in Eqn 6 is bounded by $N_{f}L_{f}$ (cf. Appendix C.3), where $N_{f}$ is the number of linear regions and $L_{f}=\max_{\epsilon}\|W_{\epsilon}\|$ is the Lipschitz constant of the network. Further, the Lipschitz constant $L_{f}$ can be bounded as (cf. Appendix C.6):

where $\|\cdot\|$ is the spectral norm and $\|\cdot\|_{\infty}$ the max norm, and $d_{k}$ is the number of units in the $k$ -th layer. This makes the bound on $L_{f}$ scale exponentially in depth and polynomial in width. As for the number $N_{f}$ of linear regions, Montufar et al. (2014) and Raghu et al. (2016) obtain tight bounds that exhibit the same scaling behaviour (Raghu et al., 2016, Theorem 1). In Appendix A.5, we qualitatively ablate over the depth and width of the network to expose how this reflects on the Fourier spectrum of the network.

Lower Frequencies are Learned First

We now present experiments showing that networks tend to fit lower frequencies first during training. We refer to this phenomenon as the spectral bias, and discuss it in light of the results of Section 2.

Discussion . Multiple theoretical aspects may underlie these observations. First, for a fixed architecture, recall that the numerator in Theorem 1 isThe tightness of this bound is verified empirically in appendix A.5. $\mathcal{O}(L_{f})$ (where $L_{f}$ is the Lipschitz constant of the function). However, $L_{f}$ is bounded by the parameter norm, which can only increase gradually during training by gradient descent. This leads to the higher frequencies being learnedThis assumes that the Lipschitz constant of the (noisy) target function is larger than that of the network at initialization. late in the optimization process. To confirm that the bound indeed increases as the model fits higher frequencies, we plot in Fig 1 the spectral norm of weights of each layer during training for both cases of constant and increasing amplitudes.

where we use the fact that $\nicefrac{{d\theta}}{{dt}}$ is just the learning rate times the parameter gradient of the loss which is independentNote however that the loss term might involve a sum or an integral over all frequencies, but the summation is over a different variable. of $k$ , and assume that the target function $\lambda$ is fixed. Eqn 9 shows that the rate of change of the residual decays with increasing frequency, which is what we find in Experiment 1.

2 Real-Data Experiments

While Experiments 1 and 2 establish the spectral bias by explicitly evaluating the Fourier coefficients, doing so becomes prohibitively expensive for larger $d$ (e.g. on MNIST). To tackle this, we propose the following set of experiments to measure the effect of spectral bias indirectly on MNIST.

In this experiment, we investigate how the validation performance dependent on the frequency of noise added to the training target. We find that the best validation performance on MNIST is particularly insensitive to the magnitude of high-frequency noise, yet it is adversely affected by low-frequency noise. We consider a target (binary) function $\tau_{0}:X\rightarrow\{0,1\}$ defined on the space $X=^{784}$ of MNIST inputs. Samples $\{\mathbf{x}_{i},\tau_{0}(\mathbf{x}_{i})\}_{i}$ form a subset of the MNIST dataset comprising samples $\mathbf{x}_{i}$ belonging to two classes. Let $\psi_{k}(\mathbf{x})$ be a noise function:

corresponding to a radial wave defined on the $784$ -dimensional input spaceThe rationale behind using a radial wave is that it induces oscillations (simultaneously) along all spatial directions. Another viable option is to induce oscillations along the principle axes of the data: we have verified that the key trends of interest are preserved.. The final target function $\tau_{k}$ is then given by $\tau_{k}=\tau_{0}+\beta\psi_{k}$ , where $\beta$ is the effective amplitude of the noise. We fit the same network as in Experiment 1 to the target $\tau_{k}$ with the MSE loss. In the first set of experiments, we ablate over $k$ for a pair of fixed $\beta$ s, while in the second set we ablate over $\beta$ for a pair of fixed $k$ s. In Figure 4, we show the respective validation loss curves, where the validation set is obtained by evaluating $\tau_{0}$ on a separate subset of the data, i.e. $\{\mathbf{x}_{j},\tau_{0}(\mathbf{x}_{j})\}_{j}$ . Figure 11 (in appendix A.3) shows the respective training curves.

Discussion. The profile of the loss curves varies significantly with the frequency of noise added to the target. In Figure 4(a), we see that the validation performance is adversely affected by the amplitude of the low-frequency noise, whereas Figure 4(b) shows that the amplitude of high-frequency noise does not significantly affect the best validation score. This is explained by the fact that the network readily fits the noise signal if it is low frequency, whereas the higher frequency noise is only fit later in the training. In the latter case, the dip in validation score early in the training is when the network has learned the low frequency true target function $\tau_{0}$ ; the remainder of the training is spent learning the higher-frequencies in the training target $\tau$ , as we shall see in the next experiment. Figures 4(c) and 4(d) confirm that the dip in validation score exacerbates for increasing frequency of the noise. Further, we observe that for higher frequencies (e.g. $k=0.5$ ), increasing the amplitude $\beta$ does not significantly degrade the best performance at the dip, confirming that the network is fairly robust to the amplitude of high-frequency noise.

Finally, we note that the dip in validation score was also observed by Arpit et al. (2017) with i.i.d. noiseRecall that i.i.d. noise is white-noise, which has a constant Fourier spectrum magnitude in expectation, i.e. it also contains high-frequency components. in a classification setting.

where $N$ is the number of available samples and $\gamma=2$ . Like in Experiment 3, the target function is given by $\tau=\tau_{0}+\beta\psi$ , and the same network is trained to regress $\tau$ . Figure 5 shows the (generalized) spectrum $\tau$ and $\tau_{0}$ , and that of $f$ as training progresses. Figure 13 (in appendix) shows the corresponding dip in validation loss, where the validation set is same as the training set but with true target function $\tau_{0}$ instead of the noised target $\tau$ .

Discussion. From Figure 5, we learn that the drop in validation score observed in Figure 4 is exactly when the higher-frequencies of the noise signal are yet to be learned. As the network gradually learns the higher frequency eigenfunctions, the validation loss increases while the training loss continues to decrease. Thus these experiments show that the phenomenon of spectral bias persists on non-synthetic data and in high dimensional input spaces.

Not all Manifolds are Learned Equal

In this section, we investigate subtleties that arise when the data lies on a lower dimensional manifold embedded in the higher dimensional input space of the model. We find that the shape of the data-manifold impacts the learnability of high frequencies in a non-trivial way. As we shall see, this is because low frequency functions in the input space may have high frequency components when restricted to lower dimensional manifolds of complex shapes. We demonstrate results in an illustrative minimal settingWe include additional experiments on MNIST and CIFAR-10 in appendices A.6 and A.7., free from unwanted confounding factors, and present a theoretical analysis of the phenomenon.

Our findings from the previous section suggest that neural networks are biased towards expressing a particular subset of such solutions, namely those that are low frequency. It is also worth noting that there exist methods that restrict the space of solutions: notably adversarial training (Goodfellow et al., 2014) and Mixup (Zhang et al., 2017b).

The set-up is similar to that of Experiment 1, and $\lambda$ is as defined in Eqn. 8 with frequencies $\kappa=(20,40,...,180,200)$ , and amplitudes $A_{i}=1\,\forall\,i$ . The model $f$ is trained on the dataset $\{\gamma_{L}(z_{i}),\lambda(z_{i})\}_{i=1}^{N}$ with $N=1000$ uniformly spaced samples $z_{i}$ between and $1$ . The spectrum of $f\circ\gamma_{L}$ in expectation over $\varphi_{i}\sim U(0,2\pi)$ is monitored as training progresses, and the result is shown in Fig 8 for various $L$ . Fig 8(e) shows the corresponding mean squared error curves. More experimental details in appendix A.2.

The results demonstrate a clear attenuation of the spectral bias as $L$ grows. Moreover, Fig 8(e) suggests that the larger the $L$ , the easier the learning task.

Here, we adapt the setting of Experiment 5 to binary classification by simply thresholding the function $\lambda$ at $0.5$ to obtain a binary target signal. To simplify visualization, we only use signals with a single frequency mode $k$ , such that $\lambda(z)=\sin(2\pi kz+\varphi)$ . We train the same network on the resulting classification task with cross-entropy lossWe use Pytorch’s BCEWithLogitsLoss. Internally, it takes a sigmoid of the network’s output (the logits) before evaluating the cross-entropy. for $k\in\{50,100,...,350,400\}$ and $L\in\{0,2,...,18,20\}$ . The heatmap in Fig 9 shows the classification accuracy for each $(k,L)$ pair. Fig 7 shows visualizations of the functions learned by the same network, trained on $(k,L)=(200,20)$ under identical conditions up to random initialization.

Observe that increasing $L$ (i.e. going up a column in Fig 9) results in better (classification) performance for the same target signal. This is the same behaviour as we observed in Experiment 5 (Fig 8a-d), but now with binary cross-entropy loss instead of the MSE.

Discussion. These experiments hint towards a rich interaction between the shape of the manifold and the effective difficulty of the learning task. The key mechanism underlying this phenomenon (as we formalize below) is that the relationship between frequency spectrum of the network $f$ and that of the fit $f\circ\gamma_{L}$ is mediated by the embedding map $\gamma_{L}$ . In particular, we argue that a given signal defined on the manifold is easier to fit when the coordinate functions of the manifold embedding itself has high frequency components. Thus, in our experimental setting, the same signal embedded in a flower with more petals can be captured with lower frequencies of the network.

To understand this mathematically, we address the following questions: given a target function $\lambda$ , how small can the frequencies of a solution $f$ be such that $f\circ\gamma=\lambda$ ? And further, how does this relate to the geometry of the data-manifold ${\mathcal{M}}$ induced by $\gamma$ ? To find out, we write the Fourier transform of the composite function,

This is precisely what Experiments 5 and 6 demonstrate in a minimal setting. From Eqn 4, observe that the coordinate functions have a frequency mode at $L$ . For increasing $L$ , it is apparent that the frequency magnitudes $l$ (in the latent space) that can be expressed with the same frequency $k$ (in the input space) increases with increasing $L$ . This allows the remarkable interpretation that the neural network function can express large frequencies on a manifold ( $l$ ) with smaller frequencies w.r.t its input domain ( $k$ ), provided that the coordinate functions of the data manifold embedding itself has high-frequency components.

Related Work

A number of works have focused on showing that neural networks are capable of approximating arbitrarily complex functions. Hornik et al. (1989); Cybenko (1989); Leshno et al. (1993) have shown that neural networks can be universal approximators when given sufficient width; more recently, Lu et al. (2017) proved that this property holds also for width-bounded networks. Montufar et al. (2014) showed that the number of linear regions of deep ReLU networks grows polynomially with width and exponentially with depth; Raghu et al. (2016) generalized this result and provided asymptotically tight bounds. There have been various results of the benefits of depth for efficient approximation (Poole et al., 2016; Telgarsky, 2016; Eldan & Shamir, 2016). These analysis on the expressive power of deep neural networks can in part explain why over-parameterized networks can perfectly learn random input-output mappings (Zhang et al., 2017a).

Our work more directly follows the line of research on implicit regularization in neural networks trained by gradient descent (Neyshabur et al., 2014; Soudry et al., 2017; Poggio et al., 2018; Neyshabur et al., 2017). In fact, while our Fourier analysis of deep ReLU networks also reflects the width and depth dependence of their expressivity, we focused on showing a learning bias of these networks towards simple functions with dominant lower frequency components. We view our results as a first step towards formalizing the findings of Arpit et al. (2017), where it is empirically shown that deep networks prioritize learning simple patterns of the data during training.

A few other works studied neural networks through the lens of harmonic analysis. For example, Candès (1999) used the ridgelet transform to build constructive procedures for approximating a given function by neural networks, in the case of oscillatory activation functions. This approach has been recently generalized to unbounded activation functions by Sonoda & Murata (2017). Eldan & Shamir (2016) use insights on the support of the Fourier spectrum of two-layer networks to derive a worse-case depth-separation result. Barron (1993) makes use of Fourier space properties of the target function to derive an architecture-dependent approximation bound. In a concurrent and independent work, Xu et al. (2018) make the same observation that lower frequencies are learned first. The subsequent work by Xu (2018) proposes a theoretical analysis of the phenomenon in the case of 2-layer networks with sigmoid activation, based on the spectrum of the sigmoid function.

In light of our findings, it is worth comparing the case of neural networks and other popular algorithms such that kernel machines (KM) and $K$ -nearest neighbor classifiers. We refer to the Appendix E for a detailed discussion and references. In summary, our discussion there suggests that 1. DNNs strike a good balance between function smoothness and expressivity/parameter-efficiency compared with KM; 2. DNNs learn a smoother function compared with $K$ NNs since the spectrum of the DNN decays faster compared with $K$ NNs in the experiments shown there.

Conclusion

We studied deep ReLU networks through the lens of Fourier analysis. Several conclusions can be drawn from our analysis. While neural networks can approximate arbitrary functions, we find that they favour low frequency ones – hence they exhibit a bias towards smooth functions – a phenomenon that we called spectral bias. We also illustrated how the geometry of the data manifold impacts expressivity in a non-trivial way, as high frequency functions defined on complex manifolds can be expressed by lower frequency network functions defined in input space.

We view future work that explore the properties of neural networks in Fourier domain as promising. For example, the Fourier transform affords a natural way of measuring how fast a function can change within a small neighborhood in its input domain; as such, it is a strong candidate for quantifying and analyzing the sensitivity of a model – which in turn provides a natural measure of complexity (Novak et al., 2018). We hope to encourage more research in this direction.

Acknowledgements

The authors would like to thank Joan Bruna, Rémi Le Priol, Vikram Voleti, Ullrich Köthe, Steffen Wolf, Lorenzo Cerrone, Sebastian Damrich, as well as the anonymous reviewers for their valuable feedback.

References

Appendix A Experimental Details

We fit a 6 layer ReLU network with 256 units per layer $f_{\theta}$ to the target function $\lambda$ , which is a superposition of sine waves with increasing frequencies:

A.2 Experiment 5

We use the same 6-layer deep 256-unit wide network and define the target function

where $k_{i}=(20,40,...,180,200)$ , $A_{i}=1\,\forall\,i$ and $\varphi\sim U(0,2\pi)$ . We sample $\phi$ on a grid with 1000 uniformly spaced points between 0 and 1 and map it to the input domain via $\gamma_{L}$ to obtain a dataset $\{(\gamma_{L}(z_{j}),\lambda(z_{j}))\}_{j=0}^{999}$ , on which we train the network with 50000 full-batch gradient descent steps of Adam. On the same 1000-point grid, we evaluate the magnitude of the (single-sided) discrete Fourier transform of $f_{\theta}\circ\gamma_{L}$ every 100 training steps at frequencies $k_{i}$ and average over 10 runs (each with a different set of sampled $z_{i}$ ’s). Fig 8 shows the evolution of the spectrum as training progresses for $L=0,4,10,16$ , and Fig 8(e) shows the corresponding loss curves.

A.3 Experiment 3

In Figure 11, we show the training curves corresponding to Figure 4.

A.4 Experiment 4

Consider the Gaussian Radial Basis Kernel, given by:

where $\varphi_{n}$ is the eigenfunction of $k$ satisfying:

where $f_{i}=f(\mathbf{x}_{i})$ . Now, defining $K$ as the positive definite kernel matrix with elements $K_{ij}=k(\mathbf{x}_{i}\mathbf{x}_{j})$ , we consider it’s eigendecomposition $V\Lambda V^{T}$ where $\Lambda$ is the diagonal matrix of (w.l.o.g sorted) eigenvalues $\lambda_{1}\leq...\leq\lambda_{N}$ and the columns of $V$ are the corresponding eigenvectors. This yields:

where $\mathbf{f}=(f(\mathbf{x}_{1}),...,f(\mathbf{x}_{N}))$ . The value $n$ can be thought of a generalized notion of frequency. Indeed, it is known (Fasshauer, 2011; Rasmussen, 2004), for instance, that the eigenfunctions $\varphi_{n}$ resemble sinusoids with increasing frequencies (for increasing $n$ or decreasing $\lambda_{n}$ ). In Figure 6, we plot the eigenvectors $\mathbf{v}_{0}$ and $\mathbf{v}_{N}$ for $\{\mathbf{x}_{i}\}_{i=1}^{50}$ uniformly spaced between $ $. Further, in Figure ? we evaluate the discrete Fourier transform of all$ N=50 $eigenvectors, and find that the eigenfunction index$ n $does indeed coincide with frequency$ k$. Finally, we remark that the link between signal complexity and the spectrum is extensively studied in (Braun et al., 2006).

A.5 Qualitative Ablation over Architectures

Theorem 1 exposes the relationship between the fourier spectrum of a network and its depth, width and max-norm of parameters. The following experiment is a qualitative ablation study over these variables.

In this experiment, we fit various networks to the $\delta$ -function at $x=0.5$ (see Fig 14(a)). Its spectrum is constant for all frequencies (Fig 14(b)), which makes it particularly useful for testing how well a given network can fit large frequencies. Fig 17 shows the ablation over weight clip (i.e. max parameter max-norm), Fig 15 over depth and Fig 16 over width. Fig 18 exemplarily shows how the network prediction evolves with training iterations. All networks are trained for 60K iterations of full-batch gradient descent under identical conditions (Adam optimizer with $lr=0.0003$ , no weight decay).

Fig 15 shows that increasing the depth (for fixed width) significantly improves the network’s ability to fit higher frequencies (note that the depth increases linearly).

Fig 16 shows that increasing the width (for fixed depth) also helps, but the effect is considerably weaker (note that the width increases exponentially).

Fig 17 shows that increasing the weight clip (or the max parameter max-norm) also helps the network fit higher frequencies.

The above observations are all consistent with Theorem 1, and further show that lower frequencies are learned first (i.e. the spectral bias, cf. Experiment 1). Further, Figure 17 shows that constraining the Lipschitz constant (weight clip) prevents the network from learning higher frequencies, furnishing evidence that the $\mathcal{O}(L_{f})$ bound can be tight.

A.6 MNIST: A Proof of Concept

In the following experiment, we show that given two manifolds of the same dimension – one flat and the other not – the task of learning random labels is harder to solve if the input samples lie on the same manifold. We demonstrate on MNIST under the assumption that the manifold hypothesis is true, and use the fact that the spectrum of the target function we use (white noise) is constant in expectation, and therefore independent of the underlying coordinate system when defined on the manifold.

This result complements the findings of (Arpit et al., 2017) and (Zhang et al., 2017a), which show that it’s easier to fit random labels to random inputs if the latter is defined on the full dimensional input space (i.e. the dimension of the flat manifold is the same as that of the input space, and not that of the underlying data-manifold being used for comparison).

A.7 Cifar-10: It’s All Connected

We have seen that deep neural networks are biased towards learning low frequency functions. This should have as a consequence that isolated bubbles of constant prediction are rare. This in turn implies that given any two points in the input space and a network function that predicts the same class for the said points, there should be a path connecting them such that the network prediction does not change along the path. In the following, we present an experiment where we use a path finding method to find such a path between all Cifar-10 input samples indeed exist.

We only consider pairs of images that belong to the same class $c$ (or, for adversarials, that originate from another class $\neq c$ , but that the model classifies to be of the specified class $c$ ). For each class, we randomly select 50 training images and select a total of 50 random images from all other classes and generate adversarial samples from the latter. Then, paths between all pairs from the whole set of images are computed.

The AutoNEB parameters are chosen as follows: We run four NEB iterations with 10 steps of SGD with learning rate $0.001$ and momentum $0.9$ . This computational budget is similar to that required to compute the adversarial samples. The gradient for each NEB step is computed to maximize the logit output of the ResNet-20 for the specified target class $c$ . We use the formulation of NEB without springs (Draxler et al., 2018).

The result is very clear: We can find paths between all pairs of images for all CIFAR10 labels that do not cross a single decision boundary. This means that all paths belong to the same connected component regarding the output of the DNN. This holds for all possible combinations of images in the above list. Figure 21 shows connecting training to adversarial images and Figure 20 paths between pairs of adversarial images. Paths between training images are not shown, they provide no further insight. Note that the paths are strikingly simple: Visually, they are hard to distinguish from the linear interpolation. Quantitatively, they are essentially (but not exactly) linear, with an average length $(3.0\pm 0.3)\%$ longer than the linear connection.

Appendix B The Continuous Piecewise Linear Structure of Deep ReLU Networks

where $W^{(k)}_{\epsilon}$ is obtained from $W^{(k)}$ by setting its $j$ th column to zero whenever $(\epsilon_{k})_{j}=-1$ .

Appendix C Fourier Analysis of ReLU Networks

Case 1: The function $f$ has compact support. The vector-valued function $\mathbf{k}f(\mathbf{x})e^{i\mathbf{k}\cdot\mathbf{x}}$ is continuous everywhere and has well-defined and continuous gradients almost everywhere. So by Stokes’ theorem (see e.g Spivak (2018)), the integral of its divergence is a pure boundary term. Since we restricted to functions with compact support, the theorem yields

The integrand is $(\mathbf{k}\cdot(\nabla_{\mathbf{x}}f)(\mathbf{x})-ik^{2}f(\mathbf{x}))e^{-i\mathbf{k}\cdot\mathbf{x}}$ , so we deduce,

Now, within each polytope of the decomposition (22), $f$ is affine so its gradient is a constant vector, $\nabla_{\mathbf{x}}f_{\epsilon}=W_{\epsilon}^{T}$ , which gives the desired result (1).

Case 2: The function $f$ does not have compact support. Without the assumption of compact support, the function $f$ is not squared-integrable. The Fourier transform therefore only exists in the sense of distributions, as defined below.

where $1_{P_{\epsilon}}$ is the indicator over $P_{\epsilon}$ and we used $\nabla_{\mathbf{x}}f_{\epsilon}=W_{\epsilon}^{T}$ . It then follows from elementary properties of Schwartz spaces (see e.g. Chapter 16 of Serov (2017)) that:

Together with Eqn 29 and linearity of the Fourier transform, this gives the desired result (1).

C.2 Fourier Transform of Polytopes

1. If $\text{Proj}_{F}(\mathbf{k})=0$ , then $\phi_{\mathbf{k}}(x)=\Phi_{\mathbf{k}}$ is constant on $F$ , and we have:

2. But if $\text{Proj}_{F}(\mathbf{k})\neq 0$ , then:

C.2.2 Discussion

The above theorem provides a recursive relation for computing the Fourier transform of an arbitrary polytope. More precisely, the Fourier transform of a $m$ -dimensional polytope is expressed as a sum of fourier transforms over the $m-1$ dimensional boundaries of the said polytope (which are themselves polytopes) times a $\mathcal{O}(k^{-1})$ weight term (with $k=\|\mathbf{k}\|$ ). The recursion terminates if $\text{Proj}_{F}(\mathbf{k})=0$ , which then yields a constant.

To structure this computation, Diaz et al. (2016) introduce a book-keeping device called the face poset of the polytope. It can be understood as a weighted directed acyclic graph (DAG) with polytopes of various dimensions as its nodes. We start at the root node which is the full dimensional polytope $P$ (i.e. we initially set $m=n$ ). For all of the codimension-one boundary faces $F$ of $P$ , we then draw an edge from the root $P$ to node $F$ and weight it with a term given by:

and repeat the process iteratively for each $F$ . Note that the weight term is $\mathcal{O}(k^{-1})$ where $\text{Proj}_{F}(\mathbf{k})\neq 0$ . This process yields tree paths $T:F_{0}=P\rightarrow F_{1}\rightarrow...\rightarrow F_{|T|}$ where each $F_{i+1}\in\partial F_{i}$ has one dimension less than $F_{i}$ . For a given path and $\mathbf{k}$ , the terminal node for this path, $F_{n_{T}}$ , is the first polytope for which $\text{Proj}_{F_{n_{T}}}(\mathbf{k})=0$ . The final Fourier transform is obtained by multiplying the weights along each path and summing over all tree paths:

where $\Phi^{(T)}=\mathbf{k}\cdot\mathbf{x}_{0}^{T}$ for an arbitrary point $\mathbf{x}_{0}^{T}$ in $F_{|T|}$ .

To write this as a weighted sum of indicator functions, as in Lemma 2, let $\mathcal{T}_{n}$ denote the set of all tree paths $T$ of length $n$ , i.e. $|T|=n$ . For a tree path $T$ , let $S(T)$ be the orthogonal to the terminal node $F_{n}$ , i.e the vectors $\mathbf{k}$ such that $\text{Proj}_{F_{n}}(\mathbf{k})=0$ . The sum over $T$ in Eqn (35) can be split as:

where $\bar{W}_{F,G}=kW_{F,G}$ and $G_{n}=\bigcup_{T\in\mathcal{T}_{n}}S(T)$ . In words, $G_{n}$ is the set of all vectors $\mathbf{k}$ that are orthogonal to some $n$ -codimensional face of the polytope. We identify:

and $D_{0}(\mathbf{k})=\text{vol}(P)$ to obtain Lemma 2. Observe that $D_{n}$ depends on $k$ only via the phase term $e^{i\Phi^{(T)}_{\mathbf{k}}}$ , implying that $D_{n}=\Theta(1)\,(k\rightarrow\infty)$ .

C.3 On Theorem 1

Equation 6 can be obtained by swapping the (finite) sum over $\epsilon$ in Lemma 1 with that over the paths $T$ in Eqn 36. In particular, we have:

Now, the sum $\sum_{\epsilon}W_{\epsilon}D_{n}^{\epsilon}(\hat{\mathbf{k}})I_{G_{n}^{\epsilon}}(\mathbf{k})$ is supported on the union:

where $C_{n}(\cdot,\theta)=\mathcal{O}(1)\,(k\rightarrow\infty)$ , we obtain Theorem 1. Further, if $N_{f}$ is the number of linear regions of the network and $L_{f}=\max_{\epsilon}\|W_{\epsilon}\|$ , we see that $C_{n}=\mathcal{O}(L_{f}N_{f})$ . Indeed, in Appendix A.5, we empirically find that relaxing the constraint on the weight clip (which can be identified with $L_{f}$ ) enabled the network to fit higher frequencies, implying that the $\mathcal{O}(L_{f})$ bound can be tight.

C.4 Spectral Decay Rate of the Parameter Gradient

Therefore, if $f=\mathcal{O}(k^{-\Delta-1})$ where $\Delta$ is the codimension of the highest dimensional polytope $\hat{\mathbf{k}}$ is orthogonal to, we have that $\nicefrac{{\partial f}}{{\partial\theta}}=\mathcal{O}(k^{-\Delta})$ .

C.5 Convergence Rate of a Network Trained on Pure-Frequency Targets

In this section, we derive an asymptotic bound on the convergence rate under the assumption that the target function has only one frequency component.

where $\mathcal{L}$ is the sampled MSE loss and the first term is $\mathcal{O}(k_{0}^{-1})$ as can be seen from Proposition 1. With Parceval’s Theorem, we obtain:

For the magnitude of parameter gradient, we obtain:

C.6 Proof of the Lipschtiz bound

The Lipschitz constant $L_{f}$ of the ReLU network $f$ is bound as follows (for all $\epsilon$ ):

The first equality is simply the fact that $L_{f}=\max_{\epsilon}\|W_{\epsilon}\|$ , and the second inequality follows trivially from the parameterization of a ReLU network as a chain of function compositionsRecall that the Lipschitz constant of a composition of two or more functions is the product of their respective Lipschtiz constants., together with the fact that the Lipschitz constant of the ReLU function is 1 (cf. (Miyato et al., 2018), equation 7). To see the third inequality, consider the definition of the spectral norm of a $I\times J$ matrix $W$ :

Now, $\|W\mathbf{h}\|=\sqrt{\sum_{i}|\mathbf{w}_{i}\cdot\mathbf{h}|}$ , where $\mathbf{w}_{i}$ is the $i$ -th row of the weight matrix $W$ and $i=1,...,I$ . Further, if $\|\mathbf{h}\|=1$ , we have $|\mathbf{w}_{i}\cdot\mathbf{h}|\leq\|\mathbf{w}_{i}\|\|\mathbf{h}\|=\|\mathbf{w}_{i}\|$ . Since $\|\mathbf{w}_{i}\|=\sqrt{\sum_{j}|w_{ij}|}$ (with $j=1,...,J$ ) and $|w_{ij}|\leq\|\theta\|_{\infty}$ , we find that $\|\mathbf{w}_{i}\|\leq\sqrt{J}\|\theta\|_{\infty}$ . Consequently, $\sqrt{\sum_{i}|\mathbf{w}_{i}\cdot\mathbf{h}|}\leq\sqrt{IJ}\|\theta\|_{\infty}$ and we obtain:

Now for $W=W^{(k)}$ , we have $I=d_{k-1}$ and $J=d_{k}$ . In the product over $k$ , every $d_{k}$ except the first and the last occur in pairs, which cancels the square root. For $k=1$ , $d_{k-1}=d$ (for the $d$ input neurons) and for $k=L+1$ , $d_{k}=1$ (for a single output neuron). The final inequality now follows. ∎

C.7 The Fourier Transform of a Function Composition

Consider Equation 14. The general idea is to investigate the behaviour of $P_{\gamma}(\mathbf{l},\mathbf{k})$ for large frequencies $\mathbf{l}$ on manifold but smaller frequencies $\mathbf{k}$ in the input domain. In particular, we are interested in the regime where the stationary phase approximation is applicable to $P_{\gamma}$ , i.e. when $l^{2}+k^{2}\rightarrow\infty$ (cf. section 3.2. of (Bergner et al., )). In this regime, the integrand in $P_{\gamma}(\mathbf{k},\mathbf{l})$ oscillates fast enough such that the only constructive contribution originates from where the phase term $u(\mathbf{z})=\mathbf{k}\cdot\gamma(\mathbf{z})-\mathbf{l}\cdot\mathbf{z}$ does not change with changing $\mathbf{z}$ . This yields the condition that $\nabla_{\mathbf{z}}u(\mathbf{z})=0$ , which translates to the condition (with Einstein summation convention implied and $\partial_{\nu}=\nicefrac{{\partial}}{{\partial x_{\nu}}}$ ):

Equation C.7 can be substituted in equation 49 to obtain:

Appendix D Volume of High-Frequency Parameters in Parameter Space

For a given neural network, we now show that the volume of the parameter space containing parameters that contribute $\epsilon$ -non-negligibly to frequency components of magnitude $k^{\prime}$ above a certain cut-off $k$ decays with increasing $k$ . For notational simplicity and without loss of generality, we absorb the direction $\hat{\mathbf{k}}$ of $\mathbf{k}$ in the respective mappings and only deal with the magnitude $k$ .

Given a ReLU network $f_{\theta}$ of fixed depth, width and weight clip $K$ with parameter vector $\theta$ , an $\epsilon>0$ and $\Theta=B^{\infty}_{K}(0)$ a $L^{\infty}$ ball around , we define:

as the set of all parameters vectors $\theta\in\Xi_{\epsilon}(k)$ that contribute more than an $\epsilon$ in expressing one or more frequencies $k^{\prime}$ above a cut-off frequency $k$ .

If $k_{2}\geq k_{1}$ , we have $\Xi_{\epsilon}(k_{2})\subseteq\Xi_{\epsilon}(k_{1})$ and consequently $\text{vol}(\Xi_{\epsilon}(k_{2}))\leq\text{vol}(\Xi_{\epsilon}(k_{1}))$ , where vol is the Lebesgue measure.

Let $1_{k}^{\epsilon}(\theta)$ be the indicator function on $\Xi_{\epsilon}(k)$ . Then:

The relative volume of $\Xi_{\epsilon}(k)$ w.r.t. $\Theta$ is $\mathcal{O}(k^{-\Delta-1})$ where $1\leq\Delta\leq d$ .

The volume is given by the integral over the indicator function, i.e.

For a large enough $k$ , we have from remark 2, the monotonicity of the Lebesgue integral and theorem 1 that:

Appendix E Kernel Machines and KNNs

In this section, in light of our findings, we want to compare DNNs with K-nearest neighbor (k-NN) classifier and kernel machines which are also popular learning algorithms, but are, in contrast to DNNs, better understood theoretically.

Given that we study why DNNs are biased towards learning smooth functions, we note that kernel machines (KM) are also highly Lipschitz smooth (Eg. for Gaussian kernels all derivatives are bounded). However there are crutial differences between the two. While kernel machines can approximate any target function in principal (Hammer & Gersmann, 2003), the number of Gaussian kernels needed scales linearly with the number of sign changes in the target function (Bengio et al., 2009). Ma & Belkin (2017) have further shown that for smooth kernels, a target function cannot be approximated within $\epsilon$ precision in any polynomial of $1/\epsilon$ steps by gradient descent.

Deep networks on the other hand are also capable of approximating any target function (as shown by the universal approximation theorems Hornik et al. (1989); Cybenko (1989)), but they are also parameter efficient in contrast to KM. For instance, we have seen that deep ReLU networks separate the input space into number of linear regions that grow polynomially in width of layers and exponentially in the depth of the network (Montufar et al., 2014; Raghu et al., 2016). A similar result on the exponentially growing expressive power of networks in terms of their depth is also shown in (Poole et al., 2016). In this paper we have further shown that DNNs are inherently biased towards lower frequency (smooth) functions over a finite parameter space. This suggests that DNNs strike a good balance between function smoothness and expressibility/parameter-efficiency compared with KM.

E.2 K-NN Classifier vs. DNN classifier

Finally, we plot $\zeta(k)$ for various $K$ in figure 22(e). Furthermore, we train a DNN on the very same dataset and overlay the radial spectrum of the resulting probability map on the same plot. We find that while DNN’s are as expressive as a $K=1$ KNN classifier at lower (radial) frequencies, the frequency spectrum of DNNs decay faster than KNN classifier for all values of $K$ considered, indicating that the DNN is smoother than the $K$ NNs considered. We also repeat the experiment corresponding to Fig. 9 with KNNs (see Fig. 22(d)) for various $K$ ’s, to find that unlike DNNs, KNNs do not necessarily perform better for larger $L$ ’s, suggesting that KNNs do not exploit the geometry of the manifold like DNNs do.