Training behavior of deep neural network in frequency domain

Zhi-Qin John Xu, Yaoyu Zhang, Yanyang Xiao

Introduction

Although Deep Neural Networks (DNNs) are totally transparent, i.e., the value of each node and each parameter can be easily obtained, it is difficult to interpret how information is processed through DNNs. We can easily record the trajectories of the parameters of DNNs during the training. However, it remains unclear what is the general principle underlying the highly non-convex problem of DNN optimization . Therefore, DNN is often criticized for being a “black box” . Even for the simple problem of fitting one-dimensional (1-d) functions, the training process of DNN is still not well understood . For example, Wu et al. (2017) use DNNs of different depth to fit a few data points sampled from a 1-d target function of third-order polynomial. They find that, even when a DNN is capable of over-fitting, i.e., the number of its parameters is much larger than the size of the training dataset, it often generalizes well (i.e., no overfitting) after training. In practice, the same phenomenon is also observed for much more complicated datasets . Intuitively, for a wide DNN, its solutions of zero training error lies in a huge space where well-generalized ones only occupy a small subset. Therefore, it is mysterious that DNN optimization often ignores a huge set of over-fitting solutions. To find an underlying mechanism, in this work, we characterize the behavior of the DNN optimization process in the frequency domain using 1-d functions as well as real datasets of image classification problems (MNIST and CIFAR10). Our work provides insight into an implicitly bias underlying the training process of DNNs.

We empirically find that, for real datasets or synthetic functions, a DNN with common settings first quickly captures their dominant low-frequency components while keeping its own high-frequency ones small, and then relatively slowly capture their high-frequency components. We call this phenomenon Frequency Principle (F-Principle). From our numerical experiments, this F-Principle can be widely observed for DNNs of different width (tens to thousands of neurons in each layer), depth (one to tens of hidden layers), training algorithms (gradient descent, stochastic gradient descent, Adam) and activation functions (tanh and ReLU). Remark that this strategy of the F-Principle, i.e., fitting the target function progressively in ascending frequency order, is also adopted explicitly in some numerical algorithms to achieve remarkable efficiency. These numerical algorithms include, for example, the Multigrid method for solving large-scale partial differential equations and a recent numerical scheme that efficiently fits the three-dimensional structure of proteins and protein complexes from noisy two-dimensional images .

The F-Principle provides a potential mechanism of why DNNs often generalize well empirically albeit its ability of over-fitting . For a finite training set, there exists an effective frequency range beyond which the information of the signal is lost. By the F-Principle, with no constraint on the high-frequency components beyond the effective frequency range, DNNs tend to keep them small. For a wide class of low-frequency dominant natural signals (e.g., image and sound), this tendency coincides with their behavior of decaying power at high frequencies. Thus, DNNs often generalize well in practice. When the training data is noisy, the small-amplitude high-frequency components are easier to be contaminated. By the F-Principle, DNNs first capture the less noisy low-frequency components of the training data and keep higher-frequency components small. At this stage, although the loss function is not best optimized for the training data, DNNs could generalize better for not fitting the noise dominating the higher-frequencies. Therefore, as widely observed, early-stopping often helps generalization.

Our key contribution in this work is the discovery of an F-Principle underlying the training of DNNs for both synthetic and real datasets. In addition, we demonstrate how the F-Principle provides insight into the effectiveness of early stopping and the good generalization of DNNs in general.

Related works

Consistent with other studies , our analysis shows that over-parameterized DNNs tend to fit training data with low-frequency functions, which are naturally of lower complexity. Intuitively, lower-frequency functions also possess smaller Lipschitz constants. According to the study in Hardt et al. (2015) , which focuses on the relation between stability and generalization, smaller Lipschitz constants can lead to smaller generalization error.

The F-Principle proposed in this work initiates a series of works . A stronger verification of the F-Principle for the high dimensional datasets can be found in Xu et al., (2019) . Theoretical studies on the F-Principle can be found in Xu et al., (2019), Xu et al., (2019) and Zhang et al., (2019) . The F-Principle is also used as an important phenomenon to pursue fundamentally different learning trajectories of meta-learning . The theoretical framework of analyzing the F-Principle is used to analyze a nonlinear collaborative scheme for deep network training . Based on the F-Principle, a fast algorithm by shifting high frequencies to lower ones is developed for fitting high frequency functions . These subsequent works show the importance of the F-Principle.

Experimental setup

We summarize the setups for each figure as follows. All DNNs are trained by the Adam optimizer, whose parameters are set to their default values . The loss function is the mean-squared error. The parameters of DNNs are initialized by a Gaussian distribution with mean .

In Fig. 1, the setting of the fully-connected tanh-DNN is as follows. Width of hidden layers: 200-100-100; Batch size: 100; Learning rate: 10510^{-5} for CIFAR10 and 10610^{-6} for MNIST; Standard deviation of Gaussian initialization: 10410^{-4}. The setting of the ReLU-CNN is as follows: two layers of 3232 features with 3×33\times 3 convolutional kernel and 2×22\times 2 max-pooling, followed by 128-64 densely connected layers; Batch size: 128; Standard deviation of Gaussian initialization: 0.050.05. We select 1000010000 samples from each dataset for the training.

In Figs. (2, 3, 4, 5), we use a fully-connected tanh-DNN of 44 hidden layers of width 200-200-200-100, standard deviation of Gaussian initialization 0.10.1, learning rate 2×1052\times 10^{-5} and full-batch size training.

In addition, F[]\mathcal{F}[\cdot] indicates the Fourier transform, which is experimentally estimated on discrete training or test data points.

F-Principle

In this section, we study the training process of DNNs in the frequency domain. We empirically find that, for a general class of functions dominated by low-frequencies, the training process of DNNs follows the F-Principle by which low-frequency components are first captured, followed by high-frequency ones.

Since the computation of high-dimensional Fourier transform suffers from the curse of dimensionality, to verify the F-Principle in the image classification problems (MNIST and CIFAR10), we perform the Fourier analysis along the first principle component of the training inputs.

The training set is a list of labeled images denoted by {xk;yk}k=0n1\{\vec{x}_{k};y_{k}\}_{k=0}^{n-1}, where each image xkNin\vec{x}_{k}\in^{N_{in}}, NinN_{in} is the number of pixels of an image, each label yk{0,1,2,9}y_{k}\in\{0,1,2,\cdots 9\}. We use DNNs of two structures to learn this training set, that is, a fully-connected DNN and a CNN. Denote xk=xkvPCx_{k}=\vec{x}_{k}\cdot\vec{v}_{PC}, which is the projection of image xk\vec{x}_{k} along the direction of the first principle component of {xk}k=0n1\{\vec{x}_{k}\}_{k=0}^{n-1} denoted by a unit vector vPC\vec{v}_{PC}. Using non-uniform Fourier transform, we obtain

where |\cdot| denotes the absolute value. As shown in the first column in Fig. 1, both datasets are dominated by low-frequency components along the first principle direction. Theoretically, frequency components other than the peaks are susceptible to the artificial periodic boundary condition implicitly applied in the Fourier transform, thereby are not essential to our frequency domain analysis . In the following, we only focus on the convergence behavior of the frequency peaks during the training. By examining the relative error of certain selected key frequency components (marked by black squares), one can clearly observe that DNNs of both structures for both datasets tend to capture the training data in an order from low to high frequencies as stated by the F-Principle Almost at the same time, another research finds a similar result. However, they add noise to MNIST, which contaminates the labels. (second and third column in Fig. 1).

2 Synthetic data

In this section, we demonstrate the F-Principle by using synthetic data sampled from a target function of known intrinsic frequencies. We design a target function by discretizing a smooth function f0(x)f_{0}(x) as follows,

where Round(){\rm Round(\cdot)} takes the nearest integer value. We define y=f0(x)y=f_{0}(x) for α=0\alpha=0. We consider f0(x)=sin(x)+2sin(3x)+3sin(5x)f_{0}(x)=\sin(x)+2\sin(3x)+3\sin(5x) with α=2\alpha=2 as shown in Fig. 2a. As shown in Fig. 2b, for the discrete Fourier transform (DFT) of f(x)f(x), i.e., F[f]{\cal F}[f], there are three most important frequency components and some small peaks due to the discretization. In this case, we can observe a precise convergence order from low- to high-frequency for frequency peaks as shown in Fig. 2c.

We have performed the same frequency domain analysis for various low-frequency dominant functions, such as f0(x)=xf_{0}(x)=|x|, f0(x)=x2f_{0}(x)=x^{2} and f0(x)=sin(x)f_{0}(x)=\sin(x) with different α\alpha’s (results are not shown), for both ReLU and tanh activation functions, and both gradient descent and Adam optimizers. We find that F-Principle always holds during the training of DNNs. Therefore, the F-Principle seems to be an intrinsic character of DNN optimization.

Understanding the training behavior of DNNs by the F-Principle

In this section, we provide an explanation based on the F-Principle of why DNNs capable of over-fitting often generalize well in practice . For a class of functions dominated by low frequencies, with finite training data points, there is an effective frequency range for this training set, which is defined as the range in frequency domain bounded by Nyquist-Shannon sampling theorem when the sampling is evenly spaced, or its extensions otherwise. When the number of parameters of a DNN is greater than the size of the training set, the DNN can overfit these sampling data points (i.e., training set) with different amount of powers outside the effective frequency range. However, by the F-Principle, the training process will implicitly bias the DNN towards a solution with a low power at the high-frequencies outside the effective frequency range. For functions dominated by low frequencies, this bias coincides with their intrinsic feature of low power at high frequencies, thus naturally leading to a well-generalized solution after training. By the above analysis, we can predict that, in the case of insufficient training data, when the higher-frequency components are not negligible, e.g., there exists a significant frequency peak above the effective frequency range, the DNN cannot generalize well after training.

In another case where the training data is contaminated by noise, early-stopping method is usually applied to avoid overfitting in practice . By the F-Principle, early-stopping can help avoid fitting the noisy high-frequency components. Thus, it naturally leads to a well-generalized solution. We use the following example for illustration.

As shown in Fig. 3a, we consider f0(x)=sin(x)f_{0}(x)=\sin(x) with α=0.5\alpha=0.5 in Eq. (2). For each sample xx, we add a noise ϵ\epsilon on f0(x)f_{0}(x), where ϵ\epsilon follows a Gaussian distribution with mean and standard deviation 0.10.1. The DNN can well fit the sampled training set as the loss function of the training set decreases to a very small value (green stars in Fig. 3b). However, the loss function of the test set first decreases and then increases (red dots in Fig. 3b). That is, the generalization performance of the DNN gets worse during the training after a certain step. In Fig. 3c, F[f]|\mathcal{F}[f]| for the training data (red) and the test data (black) only overlap around the dominant low-frequency components. Clearly, the high-frequency components of the training set are severely contaminated by noise. Around the turning step — where the best generalization performance is achieved, indicated by the green dashed line in Fig. 3b — the DNN well captures the dominant peak as shown in Fig. 3c. After that, clearly, the loss function of the test set increases as DNN start to capture the higher-frequency noise (red dots in Fig. 3b). These phenomena conform with our analysis that early-stopping can lead to a better generalization performance of DNNs as it helps prevent fitting the noisy high-frequency components of the training set.

Conclusions and discussion

In this work, we empirically discover an F-Principle underlying the optimization process of DNNs. Specifically, for functions with dominant low-frequency components, a DNN with common settings first capture their low-frequency components while keeping its own high-frequency ones small. In our experiments, this phenomenon can be widely observed for DNNs of different width (tens to thousands in each layer), depth (one to tens), training algorithms (GD, SGD, Adam), and activation functions (tanh and ReLU). The F-Principle provides insights into the good generalization performance of DNNs often observed in experiments. In Appendix 7, we also discuss how the F-Principle helps understand the training behavior of DNNs in the information plane .

Note that initial parameters with large values could complicate the phenomenon of the F-Principle. In previous experiments, the training behavior of DNNs initialized by Gaussian distribution with mean and small standard deviation follows the F-Principle. However, with large initialization, i.e., parameters initialized by a Gaussian distribution of large standard deviation, it is difficult to observe a clear phenomenon of the F-Principle. More importantly, these two initialization strategies could result in very different generalization performances. When the standard deviation for initialization is large The bias terms are always initialized by standard deviation 0.10.1., say, 1010 (see Fig.4a), the initial DNN output fluctuates strongly. In contrast, when the parameters of the DNN are initialized with small values, say, Gaussian distribution with standard deviation 0.10.1, the initial DNN output is flat (see Fig.4d). For both initializations, DNNs can well fit the training data (see Fig.4b and e). However, for test data, the DNN with small initialization generalizes well (Fig.4f) whereas the DNN with large initialization clearly overfits (Fig.4c). Intuitively, the above phenomenon can be understood as follows. Without explicit constraints on the high-frequency components beyond the effective frequency range of the training data, the DNN output after training tends to inherit these high-frequency components from the initial output. Therefore, with large initialization, the DNN output can easily overfit the training data with fluctuating high-frequency components. In practice, the parameters of DNNs are often randomly initialized with standard deviations close to zero. As suggested by our analysis, the small-initialization strategy may implicitly lead to a more efficient and well-generalized optimization process of DNNs as characterized by the F-Principle. Note that a quantitative study of how initialization affects the generalization of DNN can be found in a subsequent work .

Acknowledgments

The authors want to thank David W. McLaughlin for helpful discussions and thank Qiu Yang (NYU), Zheng Ma (Purdue University), and Tao Luo (Purdue University), Shixiao Jiang (Penn State), Kai Chen (SJTU) for critically reading the manuscript. Part of this work was done when ZX, YZ, YX are postdocs at New York University Abu Dhabi and visiting members at Courant Institute supported by the NYU Abu Dhabi Institute G1301. The authors declare no conflict of interest.

References

Appendix

Through the empirical exploration of the training behavior of DNNs in the information plane, regarding information compression phase, Schwartz-Ziv and Tishby (2017) claimed that (i) information compression is a general process; (ii) information compression is induced by SGD. In this section, we demonstrate how the F-Principle can be used to understand the compression phase.

For any random variables UU and VV with a joint distribution P(u,v)P(u,v): the entropy of UU is defined as I(U)=uP(u)logP(u)I(U)=-\sum_{u}P(u)\log P(u); their mutual information is defined as I(U,V)=u,vP(u,v)logP(u,v)P(u)P(v)I(U,V)=\sum_{u,v}P(u,v)\log\frac{P(u,v)}{P(u)P(v)}; the conditional entropy of UU on VV is defined as

By the construction of the DNN, its output TT is a deterministic function of its input XX, thus, I(TX)=0I(T|X)=0 and I(X,T)=I(T)I(X,T)=I(T). To compute entropy numerically, we evenly bin XX, YY, TT to XbX_{b}, YbY_{b}, TbT_{b} with bin size bb as follows. For any value vv, its binned value is define as vb=Round(v/b)×bv_{b}={\rm Round}(v/b)\times b. In our work, I(T)I(T) and I(Y,T)I(Y,T) are approximated by I(Tb)I(T_{b}) and I(Yb,Tb)I(Y_{b},T_{b}), respectively, with b=0.05b=0.05. Note that, after binning, one value of XbX_{b} may map to multiple values of TbT_{b}. Thus, I(TbXb)0I(T_{b}|X_{b})\neq 0 and I(Xb,Tb)I(Tb)I(X_{b},T_{b})\neq I(T_{b}). The difference vanishes as bin size shrinks. Therefore, with a small bin size, I(Tb)I(T_{b}) is a good approximation of I(X,T)I(X,T). In experiments, we also find that I(Xb,Tb)I(X_{b},T_{b}) and I(Tb)I(T_{b}) behave almost the same in the information plane for the default value b=0.05b=0.05.

2 Compression vs. no compression in the information plane

We demonstrate how compression can appear or disappear by tuning the parameter α\alpha in Eq. (2) with f0(x)=xf_{0}(x)=x for xx\in using full batch gradient descent (GD) without stochasticity. In our simulations, the DNN well fits f(x)f(x) for both α\alpha equal to and 0.50.5 after training (see Fig.5a and c). In the information plane, there is no compression phase for I(T)I(T) for α=0\alpha=0 (see Fig.5b). By increasing α\alpha in Eq. (2) we can observe that: i) the fitted function is discretized with only few possible outputs (see Fig.5c); ii) the compression of I(T)I(T) appears (see Fig.5d). For α>0\alpha>0, behaviors of information plane are similar to previous results . To understand why compression happens for α>0\alpha>0, we next focus on the training courses for different α\alpha in the frequency domain.

A key feature of the class of functions described by Eq. (2) is that the dominant low-frequency components for f(x)f(x) with different α\alpha are the same. By the F-Principle, the DNN first captures those dominant low-frequency components, thus, the training courses for different α\alpha at the beginning are similar, i.e., i) the DNN output is close to f0(x)f_{0}(x) at certain training epochs (blue lines in Fig.5a and c); ii) I(T)I(T) in the information plane increases rapidly until it reaches a value close to the entropy of f0(x)f_{0}(x) , i.e., I(f0(x))I(f_{0}(x)) (see Fig.5b and d). For α=0\alpha=0, the target function is f0(x)f_{0}(x), therefore, I(T)I(T) will be closer and closer to I(f0(x))I(f_{0}(x)) during the training. For α>0\alpha>0, the entropy of the target function, I(f(x))I(f(x)), is much less than I(f0(x))I(f_{0}(x)). In the latter stage of capturing high-frequency components, the DNN output TT would converge to the discretized function f(x)f(x). Therefore, I(T)I(T) would decrease from I(f0(x))I(f_{0}(x)) to I(f(x))I(f(x)).

This analysis is also applicable to other functions. As the discretization is in general inevitable for classification problems with discrete labels, we can often observe information compression in practice as described in the previous study .