Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

Introduction

Deep convolutional neural networks (CNNs) have been crucial to the success of deep learning. Architectures based on CNNs have achieved unprecedented accuracy in domains ranging across computer vision (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), natural language processing (Collobert et al., 2011; Kalchbrenner et al., 2014; Kim, 2014), and recently even the board game Go (Silver et al., 2016, 2017).

The performance of deep convolutional networks has improved as these networks have been made ever deeper. For example, some of the best-performing models on ImageNet (Deng et al., 2009) have employed hundreds or even a thousand layers (He et al., 2016a, b). However, these extremely deep architectures have been trainable only in conjunction with techniques like residual connections (He et al., 2016a) and batch normalization (Ioffe & Szegedy, 2015). It is an open question whether these techniques qualitatively improve model performance or whether they are necessary crutches that solely make the networks easier to train. In this work, we study vanilla CNNs using a combination of theory and experiment to disentangle the notions of trainability and generalization performance. In doing so, we show that through a careful, theoretically-motivated initialization scheme, we can train vanilla CNNs with 10,000 layers using no architectural tricks.

Recent work has used mean field theory to build a theoretical understanding of neural networks with random parameters (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Schoenholz et al., 2017; Karakida et al., 2018; Hayou et al., 2018; Hanin & Rolnick, 2018; Yang & Schoenholz, 2018). These studies revealed a maximum depth through which signals can propagate at initialization, and verified empirically that networks are trainable precisely when signals can travel all the way through them. In the fully-connected setting, the theory additionally predicts the existence of an order-to-chaos phase transition in the space of initialization hyperparameters. For networks initialized on the critical line separating these phases, signals can propagate indefinitely and arbitrarily deep networks can be trained. While mean field theory captures the “average” dynamics of random neural networks it does not quantify the scale of gradient fluctuations that are crucial to the stability of gradient descent. A related body of work (Saxe et al., 2013; Pennington et al., 2017, 2018) has examined the input-output Jacobian and used random matrix theory to quantify the distribution of its singular values in terms of the activation function and the distribution from which the initial random weight matrices are drawn. These works concluded that networks can be trained most efficiently when the Jacobian is well-conditioned, a criterion that can be achieved with orthogonal, but not Gaussian, weight matrices. Together, these approaches have allowed researchers to efficiently train extremely deep network architectures, but so far they have been limited to neural networks composed of fully-connected layers.

In the present work, we continue this line of research and extend it to the convolutional setting. We show that a well-defined mean-field theory exists for convolutional networks in the limit that the number of channels is large, even when the size of the image is small. Moreover, convolutional networks have precisely the same order-to-chaos transition as fully-connected networks, with vanishing gradients in the ordered phase and exploding gradients in the chaotic phase. And just like fully-connected networks, very deep CNNs that are initialized on the critical line separating those two phases can be trained with relative ease.

Moving beyond mean field theory, we additionally show that the random matrix analysis of (Pennington et al., 2017, 2018) carries over to the convolutional setting. Furthermore, we identify an efficient construction from the wavelet literature that generates random orthogonal matrices with the block-circulant structure that corresponds to convolution operators. This construction facilitates random orthogonal initialization for convolulational layers and enables good conditioning of the end-to-end Jacobian matrices of arbitrarily deep networks. We show empirically that networks with this initialization can train significantly more quickly than standard convolutional networks.

Finally, we emphasize that although the order-to-chaos phase boundaries of fully-connected and convolutional networks look identical, the underlying mean-field theories are in fact quite different. In particular, a novel aspect of the convolutional theory is the existence of multiple depth scales that control signal propagation at different spatial frequencies. In the large depth limit, signals can only propagate along modes with minimal spatial structure; all other modes end up deteriorating, even at criticality. We hypothesize that this type of signal degradation is harmful for generalization, and we develop a modified initialization scheme that allows for balanced propagation of signals among all frequencies. In this scheme, which we call Delta-Orthogonal initialization, the orthogonal kernel is drawn from a spatially non-uniform distribution, and it allows us to train vanilla CNNs of 10,000 layers or more with no degradation in performance.

Theoretical results

In this section, we first derive a mean field theory for signal propagation in random convolutional neural networks. We will follow the general methodology established in Poole et al. (2016); Schoenholz et al. (2017); Yang & Schoenholz (2017). We will then arrive at a theory for the singular value distribution of the Jacobian following Pennington et al. (2017, 2018). Together, this will allow us to derive theoretically motivated initialization schemes for convolutional neural networks that we call orthogonal kernels and Delta-Orthogonal kernelsAn example implementation of a deep tanh\tanh network initialized critically using the Delta-Orthogonal kernel is provided at https://github.com/brain-research/mean-field-cnns.. Later we will demonstrate experimentally that these kernels outperform existing initialization schemes for very deep vanilla convolutional networks.

and is independent of jj. A more compact representation of this equation can be given as,

where A=12k+1I2k+1\mathcal{A}=\frac{1}{2k+1}{\bf I}_{2k+1} and \star denotes 2D circular cross-correlation, i.e. for any matrix C{\bm{C}}, AC\mathcal{A}\star{\bm{C}} is defined as,

The function C:PSDnPSDn\mathcal{C}:\text{PSD}_{n}\to\text{PSD}_{n} is related to the C\mathcal{C}-map defined in Poole et al. (2016) (see also (Daniely et al., 2016)) and is given by,

All but the two dimensions α\alpha and α\alpha^{\prime} in eqn. (2.5) marginalize, so, as in (Poole et al., 2016), the C\mathcal{C}-map can be computed by a two-dimensional integral. Unlike in (Poole et al., 2016), α\alpha and α\alpha^{\prime} do not correspond to different examples but rather to different spatial positions and eqn. (2.5) characterizes how signals from a single input propagate through convolutional networks in the mean-field approximationThe multi-input analysis proceeds in precisely the same manner as we present here, but comes with increased notational complexity and features no qualitatively different behavior, so we focus our presentation on the single-input case..

1.2 Dynamics of signal propagation

We now seek to study the dynamics induced by eqn. (2.3). Schematically, our approach will be to identify fixed points of eqn. (2.3) and then linearize the dynamics around these fixed points. These linearized dynamics will dictate the stability and rate of decay towards the fixed points, which determines the depth scales over which signals in the network can propagate.

Schoenholz et al. (2017) found that for many activation functions ϕ\phi (e.g. tanh\tanh) and any choice of σw\sigma_{w} and σb\sigma_{b}, the C{\mathcal{C}}-map has a fixed point Σ\bm{\Sigma}^{*} (i.e. C(Σ)=Σ\mathcal{C}(\bm{\Sigma}^{*})=\bm{\Sigma}^{*}) of the form,

where δa,b\delta_{a,b} is the Kronecker-δ\delta, qq^{*} is the fixed-point variance of a single input, and cc^{*} is the fixed-point correlation between two inputs. It follows from the form of eqn. (2.4) that Σ{\bm{\Sigma}}^{*} is also a fixed point of the layer-to-layer covariance map in the convolutional case (eqn. (2.3)), i.e. Σ=AC(Σ){\bm{\Sigma}}^{*}=\mathcal{A}\star\mathcal{C}({\bm{\Sigma}^{*}}).

To analyze the dynamics of the iteration map (2.3) near the fixed point Σ\bm{\Sigma}^{*}, we define ϵl=ΣΣl\bm{\epsilon}^{l}=\bm{\Sigma}^{*}-\bm{\Sigma}^{l} and expand eqn. (2.3) to lowest order in ϵ.\bm{\epsilon}. This expansion requires the Jacobian of the C\mathcal{C}-map evaluated at the fixed point, the properties of which we analyze in the SM. In brief, perturbations in qq^{*} and cc^{*} evolve independently and the Jacobian decomposes into a diagonal eigenspace VdV_{\text{d}} with eigenvalue χq\chi_{q^{*}}, and an off-diagonal eigenspace Vo.d.V_{\text{o.d.}} with eigenvalue χc\chi_{c^{*}}. The eigenvalues are given byBy the symmetry of Σ\bm{\Sigma}^{*}, these expectations are independent of spatial location and of the choice of h1h_{1} and h2h_{2}.,

i.e. Vd=span(Bd)V_{\text{d}}=\operatorname{span}(B_{\text{d}}) and Vo.d=span(Bo.d.)V_{\text{o.d}}=\operatorname{span}(B_{\text{o.d.}}). Note that χq\chi_{q^{*}} and χc\chi_{c^{*}} also were found in Schoenholz et al. (2017) to control signal propagation in the fully-connected case. The constant γ\gamma is given in Lemma B.2 of the SM but does not concern us here. This eigen-decomposition implies that the layer-wise deviations from the fixed point evolve under eqn. (2.3) as,

where ϵd{\bm{\epsilon}}_{\text{d}} and ϵo.d.{\bm{\epsilon}}_{\text{o.d.}} are decomposition of ϵ{\bm{\epsilon}} into the eigenspaces VdV_{\text{d}} and Vo.d.V_{\text{o.d.}}.

Eqn. (2.9) defines the linear dynamics of random convolutional neural networks near their fixed points and is the basis for the in-depth analysis of the following subsections.

1.3 Multi-dimensional signal propagation

In the fully-connected setting, the dynamics of signal propagation near the fixed point are governed by scalar evolution equations. In contrast, the convolutional setting enjoys much richer dynamics, as eqn. (2.9) describes a multi-dimensional system that we now analyze.

It follows from eqns. (2.4) and (2.8) (see also the SM) that A\mathcal{A} does not mix the diagonal and off-diagonal eigenspaces, i.e. AϵdVd\mathcal{A}\star{\bm{\epsilon}}_{\text{d}}\in V_{\text{d}} and Aϵo.d.Vo.d.\mathcal{A}\star{\bm{\epsilon}}_{\text{o.d.}}\in V_{\text{o.d.}}. To see this, note that for Mα,αVo.d.M^{\alpha,\alpha^{\prime}}\in V_{\text{o.d.}}, the definition implies Mαˉ+β,αˉ+βα,α=Mαˉ,αˉαβ,αβM^{\alpha,\alpha^{\prime}}_{\bar{\alpha}+\beta,\bar{\alpha}^{\prime}+\beta}=M^{\alpha-\beta,\alpha^{\prime}-\beta}_{\bar{\alpha},\bar{\alpha}^{\prime}}. This property ensures that AMα,α\mathcal{A}\star M^{\alpha,\alpha^{\prime}} can be expressed as a linear combination of matrices in Vo.d.V_{\text{o.d.}}, which means it also belongs to Vo.dV_{\text{o.d}}. The same argument applies to Mα,αVd.M^{\alpha,\alpha}\in V_{\text{d.}}. As a result, these eigenspaces evolve entirely independently under the linearization of the covariance iteration map (2.3).

Let l0l_{0} denote the depth over which transient effects persist and after which eqn. (2.9) accurately describes the linearized dynamics. Therefore, at depths larger than l0l_{0}, we have

with λα,α=F(A)α,α\lambda_{\alpha,\alpha^{\prime}}=\mathcal{F}(\mathcal{A})^{*}_{\alpha,\alpha^{\prime}}. Thus, the linearized dynamics of convolutional neural networks decouple into independently-evolving Fourier modes that evolve near the fixed point at frequency-dependent rates.

1.4 Fixed-point analysis

The stability of the fixed point Σ\Sigma^{*} is determined by whether nearby points move closer or farther from Σ\Sigma^{*} under the dynamics described by eqn. (2.9). Eqn. (2.11) shows that this condition depends on the whether the quantities λα,αχq\lambda_{\alpha,\alpha^{\prime}}\chi_{q^{*}} and λα,αχc\lambda_{\alpha,\alpha^{\prime}}\chi_{c^{*}} are less than or greater than one.

Since A\mathcal{A} is a diagonal matrix, the eigenvalues λα,α\lambda_{\alpha,\alpha^{\prime}} have a specific structure. In particular, the set of eigenvalues is comprised of nn copies of the 1D discrete Fourier transform of the diagonal entries of A\mathcal{A}. Furthermore, since the diagonal entries of A\mathcal{A} are non-negative and sum to one, their Fourier coefficients have absolute value no larger than one and the zero-frequency coefficient is equal to one; see Figure 4 for the full distribution in the case of 2D convolutions. It follows that the fixed point Σ\Sigma^{*} will be stable if and only if χq<1\chi_{q^{*}}<1 and χc<1\chi_{c^{*}}<1.

These stability conditions are precisely the ones found to govern fully-connected networks (Poole et al., 2016; Schoenholz et al., 2017). Moreover, the fixed point matrix Σ\Sigma^{*} is also the same as in the fully-connected case. Together, these observations imply that the entire fixed-point structure of the convolutional case is identical to that of the fully-connected case. In particular, based on the results of (Poole et al., 2016), we can immediately conclude that the (σw,σb)(\sigma_{w},\sigma_{b}) hyperparameter plane is separated by the line χ1=1\chi_{1}=1 into an ordered phase with c=1c^{*}=1 in which all pixels approach the same value, and a chaotic phase with c<1c^{*}<1 in which the pixels become decorrelated with one another; see the SM for a review of this phase diagram analysis.

1.5 Depth scales of signal propagation

We now assume that the conditions for a stable fixed point are met, i.e. χq<1\chi_{q^{*}}<1 and χc<1\chi_{c^{*}}<1, and we consider the rate at which the fixed point is approached. As in (Schoenholz et al., 2017), it is convenient to additionally assume χq<χc\chi_{q^{*}}<\chi_{c^{*}} so that the dynamics in the diagonal subspace can be neglected. In this case, eqn. (2.11) can be rewritten as

where ξα,α=1/log(χcλα,α)\xi_{\alpha,\alpha^{\prime}}=-1/\log(\chi_{c^{*}}\lambda_{\alpha,\alpha^{\prime}}) are depth scales governing the convergence of the different modes. In particular, we expect signals corresponding to a specific Fourier mode fα,αf_{\alpha,\alpha^{\prime}} to be able to travel a depth commensurate to ξα,α\xi_{\alpha,\alpha^{\prime}} through the network. Thus, unlike fully-connected networks which exhibit only a single depth scale, convolutional networks feature a hierarchy of depth scales.

Recalling that λα,nα=1\lambda_{\alpha,n-\alpha}=1, it follows that ξcξα,nα=1/logχc\xi_{c}\equiv\xi_{\alpha,n-\alpha}=-1/\log\chi_{c^{*}}, which is identical to the depth scale governing signal propagation through fully-connected networks. It follows from (Schoenholz et al., 2017) that when χ1=1\chi_{1}=1, ξα,nα\xi_{\alpha,n-\alpha} diverges and thus convolutional networks can propagate signals arbitrarily far through the fα,nαf_{\alpha,n-\alpha} modes. Since λα,α<1|\lambda_{\alpha,\alpha^{\prime}}|<1 for αnα\alpha^{\prime}\neq n-\alpha, these are the only modes through which signals can propagate without attenuation. Finally, we note that the fα,nαf_{\alpha,n-\alpha} modes correspond to perturbations that are spatially uniform along the cyclic diagonals of the covariance matrix. The fact that all signals with additional spatial structure attenuate for large depth suggests that deep critical convolutional networks behave quite similarly to fully-connected networks, which also cannot propagate spatially-structured signals.

1.6 Non-uniform kernels

The similarities between signal propagation in convolutional neural networks and fully-connected networks in the limit of large depth are surprising. A consequence may be that the performance of very deep convolutional networks degrades as the signal is forced to propagate along modes with minimal spatial structure. Indeed, Fig. 3 shows that the generalization performance decreases with depth, and that for very large depth it barely surpasses the performance of a fully-connected network.

If increased spatial uniformity is the problem, eqn. (2.12) holds the solution. In order for all modes to propagate without attenuation, it is necessary that λα,α=1\lambda_{\alpha,\alpha^{\prime}}=1 for all α,α\alpha,\alpha^{\prime}. In fact, it is easy to show that the distribution of {λα,α}\{\lambda_{\alpha,\alpha^{\prime}}\} can be modified by allowing for spatial non-uniformity in the variance of the weights within the kernel. To this end, we introduce a non-negative vector v=(vβ)βkerv=(v_{\beta})_{\beta\in{\it ker}} chosen such that βvβ=1\sum_{\beta}v_{\beta}=1, and initialize the weights of the network according to wijl(β)N(0,σw2vβ/c)w^{l}_{ij}(\beta)\sim\mathcal{N}(0,\sigma_{w}^{2}v_{\beta}/c). Each choice of vv will induce a new dynamical equation analogous to eqn. (2.3) (see SM),

where Av=diag(v).\mathcal{A}_{v}=\operatorname{diag}(v). It follows directly from the previous analysis that the linearized dynamics of eqn. (2.13) will be identical to the dynamics of eqn (2.3), only now with λα,α=F(Av)α,α\lambda_{\alpha,\alpha^{\prime}}=\mathcal{F}(\mathcal{A}_{v})_{\alpha,\alpha^{\prime}}^{*}. By the same argument presented in Section 2.1.3, the set of eigenvalues is now comprised of nn copies of the 1D Fourier transform of vv. As a result, it is possible to control the depth scales over which different modes of the signal can propagate through the network by changing the variance vector vv. We will return to this point in section 2.4.

2 Back-propagation of signal

We now turn our attention to the back-propgation of error signals through a convolutional network. Let EE denote the loss and δjl(α)\delta^{l}_{j}(\alpha) the back-propagated signal at layer ll, channel jj and spatial location α\alpha, i.e.,

We have observed that the quantity χ1\chi_{1} is crucial for determining signal propagation in CNNs, both in the forward and backward directions. As discussed in (Poole et al., 2016), χ1\chi_{1} equals the the mean squared singular value of the Jacobian Jl\bm{J}^{l} of the layer-to-layer transition operator. Beyond just the second moment, higher moments and indeed the whole distribution of singular values of the entire end-to-end Jacobian J=lJl{\bm{J}}=\prod_{l}{\bm{J}}^{l} are important for ensuring trainability of very deep fully-connected networks (Pennington et al., 2017, 2018). Specifically, networks train well when their input-output Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to 11.

In fact, we can adopt the entire analysis of (Pennington et al., 2017, 2018) into the convolutional setting with essentially no modification. The reason stems from the fact that, because convolution is a linear operator, it has a matrix representation, Wl\bm{W}^{l}, which appears in the end-to-end Jacobian in precisely the same manner as do the weight matrices in the fully-connected case. In particular, J=l=1LDlWl\bm{J}=\prod_{l=1}^{L}{\bm{D}}^{l}{\bm{W}}^{l}, where Dl{\bm{D}^{l}} is the diagonal matrix whose diagonal elements contain the vectorized representation of derivatives of post-activation neurons in layer ll. Roughly speaking, since this is the same expression as in (Pennington et al., 2017, 2018), the conclusions found in that work regarding dynamical isometry apply equally well in the convolutional setting.

The analysis of Pennington et al. (2017, 2018) reveals that the singular values of J\bm{J} depends crucially on the distribution of singular values of Wl\bm{W}^{l} and Dl\bm{D}^{l}. In particular, to achieve dynamical isometry, all of these matrices should be close to orthogonal. As in the fully-connected case, the singular values of Dl\bm{D}^{l} can be made arbitrarily close to 11 by choosing a small value for qq^{*} and by using an activation function like tanh\tanh that is smooth and linear near the origin. In the convolutional setting, the matrix representation of the convolution operator Wl\bm{W}^{l} is a c×cc\times c block matrix with n×nn\times n circulant blocks. Note that in the large cc limit, n/c0n/c\to 0 and the relative size of the blocks vanishes. Therefore, if the weights are i.i.d. random variables, we can invoke universality results from random matrix theory to conclude its singular value distribution converges to the Marcenko-Pastur distribution; see Fig. S4 in the SM. As such, we find that CNNs with i.i.d. weights cannot achieve dynamical isometry. We address this issue in the next section.

3 Orthogonal Initialization for CNNs

In (Pennington et al., 2017, 2018), it was observed that dynamical isometry can lead to dramatic improvements in training speed, and that achieving these favorable conditions requires orthogonal weight initializations. While the procedure to generate random orthogonal weight matrices in the fully-connected setting is well-known, it is less obvious how to do so in the convolutional setting, and at first sight it is not at all clear whether it is even possible. We resolve this question by invoking a result from the wavelet literature (Kautsky & Turcajová, 1994) and provide an explicit construction. We will focus on the two-dimensional convolution here and begin with some notation.

where the out-of-range matrices are taken to be zero.

Algorithm 1 shows how to construct orthogonal kernels for 2D convolutions of size k×k×cin×cout\Bbbk\times\Bbbk\times c_{in}\times c_{out} with cincoutc_{in}\leq c_{out}. One can employ the same method to construct kernels of higher (or lower) dimensions. This new initialization method can dramatically boost the learning speed of deep CNNs; see Fig. 5 and Section 3.2.

4 Delta-Orthogonal Initialization

In Section 2.1.5 it was observed that, in contrast to fully-connected networks, CNNs have multiple depth scales controlling propagation of signals along different Fourier modes. Even at criticality, for generic variance-averaging vectors vv, the majority of these depth scales are finite. However, there does exist one special averaging vector for which all of the depth scales are infinite: a one-hot vector, i.e. vi=δk,iv_{i}=\delta_{k,i}. This kernel places all of its variance in the spatial center of the kernel and zero variance elsewhere. In this case, the eigenvalues λα,α\lambda_{\alpha,\alpha^{\prime}} are all equal to 11 and all depth scales diverge, implying that signals can propagate arbitrarily far along all Fourier modes.

If we combine this special averaging vector with the orthogonal initialization of the previous section, we obtain a powerful new initialization scheme that we call Delta-Orthogonal Initialization. Matrices of this type can be generated from Algorithm 1 with k=1\Bbbk=1 and padding with appropriate zeros or directly from Algorithm 2 in the SM.

In the following sections, we demonstrate experimentally that extraordinarily deep convolutional networks can be trained with these initialization techniques.

Experiments

To support the theoretical results built up in Section 2, we trained a large number of very deep CNNs on MNIST and CIFAR-10 with tanh\tanh as the activation function. We use the following vanilla CNN architecture. First we apply three 3×3×c3\times 3\times c convolutions with strides 1, 2 and 2 in order to increase the channel size to cc and reduce the spatial dimension to 7×77\times 7 (or 8×88\times 8 for CIFAR-10), and then a block of dd 3×3×c3\times 3\times c convolutions with dd varying from 22 to 10,00010,000. Finally, an average pooling layer and a fully-connected layer are applied. Here c=256c=256 when d256d\leq 256 and c=128c=128 otherwise. To maximally support our theories, we applied no common techniques (including learning rate decay). Note that the early downsampling is necessary from a computational perspective, but it does diminish the maximum achievable performance; e.g. our best achieved test accuracy with downsampling was 82%\% on CIFAR-10. We performed an additional experiment training a 50 layers network without downsampling. This resulted in a test accuracy of 89.90%89.90\%, which is comparable to the best performance on CIFAR-10 using a tanh\tanh architecture that we were able to find (89.82%89.82\%, (Mishkin & Matas, 2015)).

The analysis in Section 2.1 gives a prediction for precisely which initialization hyperparameters a CNN will be trainable. In particular, we predict that the network ought to be trainable provided LξcL\lesssim\xi_{c}. To test this, we train a large number of convolutional neural networks on MNIST with depth varying between L=10L=10 and L=600L=600 and with weights initialized with σw2\sigma_{w}^{2}\in. In Fig. 2 we plot – using a heatmap – the training accuracy obtained by these networks after different numbers of steps. Additionally we overlay the depth scale predicted by our theory, ξc\xi_{c}. We find strikingly good agreement between our theory of random networks and the results of our experiments.

2 Orthogonal Initialization and Ultra-deep CNNs

We argued in Section 2.2.1 that the input-output Jacobian of CNNs with i.i.d. weights will become increasingly ill-conditioned as the number of layers grows. On the other hand, orthogonal weight initializations can achieve dynamical isometry and dramatically boost the training speed. To verify this, we train a 4,000-layer CNN on MNIST using a critically-tuned Gaussian weight initialization and the orthogonal initialization scheme developed in Section 2.3. Fig. 5 shows that the network with Gaussian initialization learns slowly (test and training accuracy is below 60%60\% after 90,00090,000 steps, about 60 epochs). In contrast, orthogonal initialization learns quickly with test accuracy above 60%60\% after only 1 epoch, and achieves 95%95\% after 10,00010,000 steps or about 7 epochs.

3 Multi-dimensional Signal Propagation

The analysis in Section 2.1.3 and Section 2.1.6 suggest that CNNs initialized with kernels with spatially uniform variance may suffer a degradation in generalization performance as the depth increases. Fig. 3 shows the learning curves of CNNs on CIFAR-10 with depth varying from 3232 to 81928192. Although the orthogonal initialization enables even the deepest model to reach 100%100\% training accuracy, the test accuracy decays as the depth increases with the deepest mode generalizing only marginally better than a fully-connected network.

To test whether this degradation in performance may be the result of attenuation of spatially non-uniform signals, we trained a variety of models on CIFAR-10 whose kernels were initialized with spatially non-uniform variance. According to the analysis in Section 2.1.6, changing the shape of this non-uniformity controls the depth scales over which different Fourier components of the signal can propagate through the network. We examined five different non-uniform critical Gaussian initialization methods. The variance vectors vv were chosen in the following way: GS0 refers to the one-hot delta initialization for which the eigenvalues λα,α\lambda_{\alpha,\alpha^{\prime}} are all equal to 1. GS1, GS2 and GS3 are obtained by interpolating between GS0 and GS4, which is the uniform variance initialization.

Each variance vector has exactly 8×88\times 8 singular values, plotted in Fig. 4(b) in descending order. Note that from GS0 to GS4, the singular values become more poorly-conditioned (the distribution becomes more concentrated around 0). Fig. 4(a) shows that the relative fall-off of generalization performance with depth follows the same pattern: the more poorly-conditioned the singular values the worse the model generalizes. These observations suggest that salient information may be propagating along multiple Fourier modes.

4 Training 10,000-layers: Delta-Orthogonal Initialization.

Our theory predicts that an ultra-deep CNNs can train faster and perform better if critically initialized using Delta-Orthogonal kernels. To test this theory, we train CNNs of 1,250, 2,500, 5,000 and 10,000 layers on both MNIST and CIFAR-10 (Fig. 1). All these networks learn surprisingly quickly and, remarkably, the learning time measured in number of training epochs is independent of depth. Furthermore, our experimental results match well with the predicted benefits of this initialization: 99%99\% test accuracy on MNIST for a 10,000-layer network, and 82%82\% on CIFAR-10. To isolate the benefits of the Delta-Orthogonal init, we also train a 2048-layer CNN (Fig. 3) using the spatially-uniform orthogonal initialization proposed in Section 2.3; the testing accuracy is about 70%70\%. Note that the test accuracy using (spatially uniform) Gaussian (non-orthogonal) initialization is already below 70%70\% when the depth is 259.

Discussion

In this work, we developed a theoretical framework based on mean field theory to study the propagation of signals in deep convolutional neural networks. By examining the necessary conditions for signals to flow both forward and backward through the network without attenuation, we derived an initialization scheme that facilitates training of vanilla CNNs of unprecedented depths. We presented an algorithm for the generation of random orthogonal convolutional kernels, an ingredient that is necessary to enable dynamical isometry, i.e. good conditioning of the network’s input-output Jacobian. In contrast to the fully-connected case, signal propagation in CNNs is intrinsically multi-dimensional – we showed how to decompose those signals into independent Fourier modes and how to promote uniform signal propagation across them. By leveraging these various theoretical insights, we demonstrated empirically that it is possible to train vanilla CNNs with 10,000 layers or more.

Our results indicate that we have removed all the major fundamental obstacles to training arbitrarily deep vanilla convolutional networks. In doing so, we have layed the groundwork to begin addressing some outstanding questions in the deep learning community, such as whether depth alone can deliver enhanced generalization performance. Our initial results suggest that past a certain depth, on the order of tens or hundreds of layers, the test performance for vanilla convolutional architecture saturates. These observations suggest that architectural features such as residual connections and batch normalization are likely to play an important role in defining a good model class, rather than simply enabling efficient training.

Acknowledgements

We thank Xinyang Geng, Justin Gilmer, Alex Kurakin, Jaehoon Lee, Hoang Trieu Trinh, and Greg Yang for useful discussions and feedback.

References

Appendix A Discussion of Mean Field Theory

For l0l\geq 0, note that (a) {hjl+1}j\{h^{l+1}_{j}\}_{j} are i.i.d. random variables and (b) for each jj, hjl+1h^{l+1}_{j} is a sum of cc i.i.d. random variables with mean zero. The central limit theorem implies that {hjl+1}j\{h^{l+1}_{j}\}_{j} are i.i.d. Gaussian random variables. Let Σl+1={Σα,αl+1}α,α\bm{\Sigma}^{l+1}=\{\Sigma^{l+1}_{\alpha,\alpha^{\prime}}\}_{\alpha,\alpha^{\prime}} denote the covariance matrix, where

where the expectation is taken over all random variables in and before layer (l+1)(l+1). Therefore, we have the following lemma.

As cc\to\infty, for each l0l\geq 0, hjl+1h^{l+1}_{j} is a mean zero Gaussian with covariance matrix Σl+1\bm{\Sigma}^{l+1} satisfying the recurrence relation,

Let θl=[Wl,bl]\theta^{l}=[W^{l},b^{l}] and θ0:l=[θ0,,θl]\theta^{0:l}=[\theta^{0},\dots,\theta^{l}]. Then,

Note that Σ1\bm{\Sigma}^{1} can be computed once h0h^{0} (or x0x^{0}) is given. We will proceed by induction. Let l1l\geq 1 be fixed and assume {hjl}j\{h_{j}^{l}\}_{j} are i.i.d. mean zero Gaussian with covariance Σl\bm{\Sigma}^{l}. It is not difficult to see that {hjl+1}j\{h_{j}^{l+1}\}_{j} are also i.i.d. mean zero Gaussian as cc\to\infty. To compute the covariance, note that for any fixed pair (α,α)(\alpha,\alpha^{\prime}), {xi(α)xi(α)}i\{x_{i}(\alpha)x_{i}(\alpha^{\prime})\}_{i} are i.i.d. random variables. Then,

Thus by eq. (2.5), eq. (S2) can be written as,

The same proof yields the following corollary.

Let v=(vβ)βkerv=(v_{\beta})_{\beta\in{\it ker}} be a sequence of non-negative numbers with βkervβ=1\sum_{\beta\in{\it ker}}v_{\beta}=1. Let Av\mathcal{A}_{v} be the cross-correlation operator induced by vv, i.e.,

Suppose the weights ωi,jl(β)\omega_{i,j}^{l}(\beta) are drawn i.i.d. from the Gaussian N(0,vβcσω2)\mathcal{N}(0,\frac{v_{\beta}}{c}\cdot\sigma_{\omega}^{2}). Then the recurrence relation for the covariance matrix is given by,

Let EE denote the loss associated to a CNN and δjl(α)\delta^{l}_{j}(\alpha) denote a backprop signal given by,

The layer-to-layer recurrence relation is given by,

We need to make an assumption that the weights used during back-propagation are drawn independently from the weights used in forward propagation. This implies {δjl}jchn\{\delta^{l}_{j}\}_{j\in\it chn} are independent for all ll and for jjj\neq j^{\prime},

For large ll, the second parenthesized term can be approximated by χ1\chi_{1} if α=α\alpha^{\prime}=\alpha and by χc\chi_{c^{*}} otherwise.

Appendix B The Jacobian of the 𝒞𝒞\mathcal{C}-map

Recall that C:PSDnPSDn\mathcal{C}:\text{PSD}_{n}\to\text{PSD}_{n} is given by,

Let JJ be as above and A\mathcal{A} be any n×nn\times n diagonal matrix and UU be any n×nn\times n symmetric matrix. Then,

Let {Vα,α}0ααn1\{V_{\alpha,\alpha^{\prime}}\}_{0\leq\alpha\leq\alpha^{\prime}\leq n-1} be the canonical basis of the space of n×nn\times n symmetric matrices, i.e. [Vα,α]αˉ,αˉ=1[V_{\alpha,\alpha^{\prime}}]_{\bar{\alpha},\bar{\alpha}^{\prime}}=1 if (α,α)=(αˉ,αˉ)(\alpha,\alpha^{\prime})=(\bar{\alpha},\bar{\alpha}^{\prime}) or (αˉ,αˉ)(\bar{\alpha}^{\prime},\bar{\alpha}) and 0 otherwise. We claim the following:

The Jacobian JJ has the following representation:

For the off-diagonal terms (i.e. αα\alpha\neq\alpha^{\prime}),

We first prove Theorem B.1 assuming Lemma B.2, and afterwards we prove the latter.

is an eigenspace of JJ with eigenvalue χc\chi_{c^{*}}. Here span{X}{\rm span}\{X\} denotes the linear span of XX. For χqχc\chi_{q^{*}}\neq\chi_{c^{*}}, define,

is an eigenspace of JJ with eigenvalue χq\chi_{q^{*}} and the direct sum VdVo.d.V_{\text{d}}\bigoplus V_{\text{o.d.}} is the whole space of n×nn\times n symmetric matrices. Note that JJ acts on Vo.d.V_{\text{o.d.}} in a pointwise fashion and that A\mathcal{A} maps Vo.d.V_{\text{o.d.}} onto itself (one can form an eigen-decomposition of A\mathcal{A} (and JJ) in Vo.d.V_{\text{o.d.}} using Fourier matrices; see below for details.) Thus A\mathcal{A} commutes with JJ in Vo.d.V_{\text{o.d.}}. It remains to verify that they also commute in VdV_{\text{d}}.

which we can use it to form a new basis for VdV_{\text{d}},

where FαF_{\alpha} is the diagonal matrix formed by the α\alpha-th row of the n×nn\times n Fourier matrix, i.e. Fα=diag((fα,α)αsp)F_{\alpha}={\rm diag}((f_{\alpha,\alpha^{\prime}})_{\alpha^{\prime}\in{\it sp}}) with fα,α=1ne2πiααπ/nf_{\alpha,\alpha^{\prime}}=\frac{1}{\sqrt{n}}e^{2\pi i\alpha\alpha^{\prime}\pi/n}. Since each FαF_{\alpha} is an eigen-vector of the 2D2D convolutional operator A\mathcal{A}\star\cdot (A\mathcal{A} is diagonal),

where λα\lambda_{\alpha} is the eigenvalue of FαF_{\alpha}. This finishes our proof. ∎

We first consider perturbing the off-diagonal terms. Let ϵ\epsilon be a small number and αα\alpha\neq\alpha^{\prime}. Note that for (αˉ,αˉ){(α,α),(α,α)}(\bar{\alpha},\bar{\alpha}^{\prime})\notin\{(\alpha,\alpha^{\prime}),(\alpha^{\prime},\alpha)\},

where (h1,h2)N(0,Q)(h_{1},h_{2})\sim\mathcal{N}(0,Q) with Q11=Q22=qQ_{11}=Q_{22}=q^{*} and Q12=Q21=cq+ϵQ_{12}=Q_{21}=c^{*}q^{*}+\epsilon. Let c=c+ϵ/qc=c^{*}+\epsilon/q^{*} and choose two independent random variables u1u_{1}, u2N(0,1)u_{2}\sim\mathcal{N}(0,1). Then,

Taylor expanding the term ϕ(q(cu1+1c2u2))\phi(\sqrt{q^{*}}(cu_{1}+\sqrt{1-c^{2}}u_{2})) about the point q(cu1+1(c)2u2)\sqrt{q^{*}}(c^{*}u_{1}+\sqrt{1-(c^{*})^{2}}u_{2}), one can show,

which proves the first statement of Lemma B.2.

To prove the second statement, let α\alpha be fixed and perturb Σ\bm{\Sigma}^{*} by ϵVα,α\epsilon V_{\alpha,\alpha}. Note that all the terms are unchanged except the ones in the α\alpha-th row or α\alpha-th column. It is straightforward to show (see (Poole et al., 2016)) that

where u1u_{1} and u2u_{2} are the same as in eq.(S17) and,

Appendix C Construction of Random Orthogonal Kernels

For simplicity, consider constructing a k×k×c×c\Bbbk\times\Bbbk\times c\times c orthogonal kernel. The complexity can be roughly determined as follows:

Constructing O(k)O(\Bbbk) c×cc\times c symmetric orthogonal matrices takes O(kc3)O(\Bbbk c^{3}) steps.

For j=1,,k1j=1,\dots,\Bbbk-1, convolving a j×jj\times j (each entry is a c×cc\times c matrix) matrix with a 2×22\times 2 matrix requires O(j2)O(j^{2}) matrix multiplications between two c×cc\times c matrices. Since each matrix multiplication costs O(c3)O(c^{3}), a total number of O((kc)3)O((\Bbbk c)^{3}) steps is required for block-wise matrix convolutions.

In sum, the computational complexity is about O((kc)3)O((\Bbbk c)^{3}).

C.2 Delta Orthogonal Kernels

Appendix D Phase Diagram and Vanishing/Exploding gradients

Figure S2 shows the phase diagram derived from the mean field theory of signal propagation in fully-connected networks, reproduced from (Pennington et al., 2017). It depicts the ordered and chaotic phases (with vanishing and exploding gradients, respectively) separated by a transition. The variation in value of qq^{*} along the critical line is shown in color. As discussed in the main text, it also applies to the ordered-to-chaotic phase transition of CNNs.

D.2 Vanishing and Exploding Gradients

Appendix E Distribution of singular values of weight matrices

The end-to-end Jacobian J\bm{J} depends on the matrix of weights Wl\bm{W}^{l}, and the singular value distribution of the latter plays a key role, as discussed in the main text. Figure S4 compares the singular value distribution of weight matrices Wl\bm{W}^{l} in the convolutional vs. fully-connected setting. In more detail, Wl\bm{W}^{l} in the convolutional case can be considered an n×nn\times n circulant tiling of c×cc\times c dense blocks, where each matrix element is generated i.i.d. from N(0,1/(c(2k+1)))\mathcal{N}(0,1/(c(2k+1))). For fixed n=26n=26 and 2k+1=52k+1=5 we compute the singular value distribution, as cc increases, for single draws. This is compared to the distribution for the weight matrix in the fully-connected setting, obtained from a dense nc×ncnc\times nc matrix Wl\bm{W}^{l} whose entries are drawn i.i.d from N(0,1/(nc))\mathcal{N}(0,1/(nc)). We empirically find the agreement between the two improves as the channel number increases, suggesting that the random matrix theory analysis of (Pennington et al., 2017, 2018) carries over to the convolutional setting.

Appendix F Multiple depth scales in signal propagation

Figure S4 empirically demonstrates the existence of multiple depth scales, as discussed in Section 2.1.5. We consider an ensemble of random CNNs and compute the average covariance matrix Σl\bm{\Sigma}^{l} as a function of depth. We consider networks with erf\operatorname{erf} nonliearities with σw=32\sigma_{w}=\frac{3}{2} and σb=12\sigma_{b}=\frac{1}{2} applied to 11D images of size n=10n=10. The initial data covariance Σ0\bm{\Sigma}^{0} is chosen so that ϵ0=ΣΣ0\bm{\epsilon}^{0}=\bm{\Sigma}^{*}-\bm{\Sigma}^{0} is small and has an off-diagonal structure. In particular, all entries of ϵ0\bm{\epsilon}^{0} except the first cyclic diagonal entries are taken to be zero, and that diagonal has Fourier transform given by 16[1,23,(23)3,(23)5,(23)4,(23)2,(23)4,(23)5,(23)3,23]-\frac{1}{6}[1,\frac{2}{3},(\frac{2}{3})^{3},(\frac{2}{3})^{5},(\frac{2}{3})^{4},(\frac{2}{3})^{2},(\frac{2}{3})^{4},(\frac{2}{3})^{5},(\frac{2}{3})^{3},\frac{2}{3}]. We used a spatially non-uniform kernel of size 2k+1=32k+1=3, with weights v=[0.025,0.950,0.025]v=[0.025,0.950,0.025]. We then averaged Σl\bm{\Sigma}^{l} over this ensemble of networks to construct ϵl\bm{\epsilon}^{l}. By decomposing the vector of first cyclic diagonal entries into Fourier modes, we can observe how the signal decays differently along different modes. Figure S4 plots the absolute value of the coefficient of the Fourier decomposition as a function of depth. Our mean field theory predictions for the different depth scales are in excellent agreement with the empirical simulations.