Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

Introduction

Deep learning has achieved state-of-the-art performance in many domains, including computer vision , machine translation , human games , education , and neurobiological modeling . A major determinant of success in training deep networks lies in appropriately choosing the initial weights. Indeed the very genesis of deep learning rested upon the initial observation that unsupervised pre-training provides a good set of initial weights for subsequent fine-tuning through backpropagation . Moreover, seminal work in deep learning suggested that appropriately-scaled Gaussian weights can prevent gradients from exploding or vanishing exponentially , a condition that has been found to be necessary to achieve reasonable learning speeds .

These random weight initializations were primarily driven by the principle that the mean squared singular value of a deep network’s Jacobian from input to output should remain close to 11. This condition implies that on average, a randomly chosen error vector will preserve its norm under backpropagation; however, it provides no guarantees on the worst case growth or shrinkage of an error vector. A stronger requirement one might demand is that every Jacobian singular value remain close to 11. Under this stronger requirement, every single error vector will approximately preserve its norm, and moreover all angles between different error vectors will be preserved. Since error information backpropagates faithfully and isometrically through the network, this stronger requirement is called dynamical isometry .

A theoretical analysis of exact solutions to the nonlinear dynamics of learning in deep linear networks revealed that weight initializations satisfying dynamical isometry yield a dramatic increase in learning speed compared to initializations that do not. For such linear networks, orthogonal weight initializations achieve dynamical isometry, and, remarkably, their learning time, measured in number of learning epochs, becomes independent of depth. In contrast, random Gaussian initializations do not achieve dynamical isometry, nor do they achieve depth-independent training times.

It remains unclear, however, how these results carry over to deep nonlinear networks. Indeed, empirically, a simple change from Gaussian to orthogonal initializations in nonlinear networks has yielded mixed results , raising important theoretical and practical questions. First, how does the entire distribution of singular values of a deep network’s input-output Jacobian depend upon the depth, the statistics of random initial weights, and the shape of the nonlinearity? Second, what combinations of these ingredients can achieve dynamical isometry? And third, among the nonlinear networks that have neither vanishing nor exploding gradients, do those that in addition achieve dynamical isometry also achieve much faster learning compared to those that do not? Here we answer these three questions, and we provide a detailed summary of our results in the discussion.

Theoretical Results

In this section we derive expressions for the entire singular value density of the input-output Jacobian for a variety of nonlinear networks in the large-width limit. We compute the mean squared singular value of J\mathbf{J} (or, equivalently, the mean eigenvalue of JJT\mathbf{J}\mathbf{J}^{T}), and deduce a rescaling that sets it equal to 11. We then examine two metrics that help quantify the conditioning of the Jacobian: smaxs_{\text{max}}, the maximum singular value of J\mathbf{J} (or, equivalently, λmax\lambda_{\text{max}}, the maximum eigenvalue of JJT\mathbf{J}\mathbf{J}^{T}); and σJJT2\sigma^{2}_{JJ^{T}}, the variance of the eigenvalue distribution of JJT\mathbf{J}\mathbf{J}^{T}. If λmax1\lambda_{\text{max}}\gg 1 and σJJT21\sigma^{2}_{JJ^{T}}\gg 1 then the Jacobian is ill-conditioned and we expect the learning dynamics to be slow.

Here Dl\mathbf{D}^{l} is a diagonal matrix with entries Dijl=ϕ(hil)δij{D}^{l}_{ij}=\phi^{\prime}({h}^{l}_{i})\,\delta_{ij}. The input-output Jacobian J\mathbf{J} is closely related to the backpropagation operator mapping output errors to weight matrices at a given layer; if the former is well conditioned, then the latter tends to be well-conditioned for all weight layers. We therefore wish to understand the entire singular value spectrum of J\mathbf{J} for deep networks with randomly initialized weights and biases.

In particular, we will take the biases bil{b}^{l}_{i} to be drawn i.i.d. from a zero mean Gaussian with standard deviation σb\sigma_{b}. For the weights, we will consider two random matrix ensembles: (1) random Gaussian weights in which each Wijl{W}^{l}_{ij} is drawn i.i.d from a Gaussian with variance σw2/N\sigma_{w}^{2}/N, and (2) random orthogonal weights, drawn from a uniform distribution over scaled orthogonal matrices obeying (Wl)TWl=σw2I(\mathbf{W}^{l})^{T}\mathbf{W}^{l}=\sigma_{w}^{2}\,\mathbf{I}.

2 Review of signal propagation

The random matrices Dl\mathbf{D}^{l} in eqn. (2) depend on the empirical distribution of pre-activations hl\mathbf{h}^{l} entering the nonlinearity ϕ\phi in eqn. (1). The propagation of this empirical distribution through different layers ll was studied in . There, it was shown that in the large-NN limit this empirical distribution converges to a Gaussian with zero mean and variance qlq^{l}, where qlq^{l} obeys a recursion relation induced by the dynamics in eqn. (1),

with initial condition q0=1Ni=1N(hi0)2q^{0}=\frac{1}{N}\sum_{i=1}^{N}({h}^{0}_{i})^{2}, and where Dh=dh2πexp(h22)\mathcal{D}h=\frac{dh}{\sqrt{2\pi}}\,\exp{(-\frac{h^{2}}{2})} denotes the standard Gaussian measure. This recursion has a fixed point obeying,

If the input h0\mathbf{h}^{0} is chosen so that q0=qq^{0}=q^{*}, then we start at the fixed point, and the distribution of Dl\mathbf{D}^{l} becomes independent of ll. Also, if we do not start at the fixed point, in many scenarios we rapidly approach it in a few layers (see ), so for large LL, assuming ql=qq^{l}=q^{*} at all depths ll is a good approximation in computing the spectrum of J\mathbf{J}.

Another important quantity governing signal propagation through deep networks is

where ϕ\phi^{\prime} is the derivative of ϕ\phi. Here χ\chi is the mean of the distribution of squared singular values of the matrix DW\mathbf{DW}, when the pre-activations are at their fixed point distribution with variance qq^{*}. As shown in and Fig. 1, χ(σw,σb)\chi(\sigma_{w},\sigma_{b}) separates the (σw,σb)(\sigma_{w},\sigma_{b}) plane into two phases, chaotic and ordered, in which gradients exponentially explode or vanish respectively. Indeed, the mean squared singular value of J\mathbf{J} was shown simply to be χL\chi^{L} in , so χ=1\chi=1 is a critical line of initializations with neither vanishing nor exploding gradients.

3 Free probability, random matrix theory and deep networks.

While the previous section revealed that the mean squared singular value of J\mathbf{J} is χL\chi^{L}, we would like to obtain more detailed information about the entire singular value distribution of J\mathbf{J}, especially when χ=1\chi=1. Since eqn. (2) consists of a product of random matrices, free probability becomes relevant to deep learning as a powerful tool to compute the spectrum of J\mathbf{J}, as we now review.

In general, given a random matrix X\mathbf{X}, its limiting spectral density is defined as

where X\langle\cdot\rangle_{X} denotes the mean with respect to the distribution of the random matrix X\mathbf{X}. Also,

is the definition of the Stieltjes transform of ρX\rho_{X}, which can be inverted using,

The Stieltjes transform GXG_{X} is related to the moment generating function MXM_{X},

where the mkm_{k} is the kkth moment of the distribution ρX\rho_{X}, mk=dλ  ρX(λ)λk=1NtrXkX.m_{k}=\int d\lambda\;\rho_{X}(\lambda)\lambda^{k}=\frac{1}{N}\langle\text{tr}\mathbf{X}^{k}\rangle_{X}\,. In turn, we denote the functional inverse of MXM_{X} by MX1M_{X}^{-1}, which by definition satisfies MX(MX1(z))=MX1(MX(z))=zM_{X}(M_{X}^{-1}(z))=M_{X}^{-1}(M_{X}(z))=z. Finally, the S-transform is defined as,

The utility of the S-transform arises from its behavior under multiplication. Specifically, if A\mathbf{A} and B\mathbf{B} are two freely-independent random matrices, then the S-transform of the product random matrix ensemble AB\mathbf{A}\mathbf{B} is simply the product of their S-transforms,

Our first main result will be to use eqn. (11) to write down an implicit definition of the spectral density of JJT\mathbf{J}\mathbf{J}^{T}. To do this we first note that (see Result 1 of the supplementary material),

where we have used the identical distribution of the weights to define SWWT=SWlWlTS_{WW^{T}}=S_{W_{l}W_{l}^{T}} for all ll, and we have also used the fact the pre-activations are distributed independently of depth as hlN(0,q)h_{l}\sim\mathcal{N}(0,q^{*}), which implies that SDl2=SD2S_{D^{2}_{l}}=S_{D^{2}} for all ll.

Eqn. (12) provides a method to compute the spectrum ρJJT(λ)\rho_{JJ^{T}}(\lambda). Starting from ρWTW(λ)\rho_{W^{T}W}(\lambda) and ρD2(λ)\rho_{D^{2}}(\lambda), we compute their respective S-transforms through the sequence of equations eqns. (7), (9), and (10), take the product in eqn. (12), and then reverse the sequence of steps to go from SJJTS_{JJ^{T}} to ρJJT(λ)\rho_{JJ^{T}}(\lambda) through the inverses of eqns. (10), (9), and (8). Thus we must calculate the S-transforms of WWT\mathbf{W}\mathbf{W}^{T} and D2\mathbf{D}^{2}, which we attack next for specific nonlinearities and weight ensembles in the following sections. In principle, this procedure can be carried out numerically for an arbitrary choice of nonlinearity, but we postpone this investigation to future work.

4 Linear networks

As a warm-up, we first consider a linear network in which J=l=1LWl\mathbf{J}=\prod_{l=1}^{L}\mathbf{W}^{l}. Since criticality (χ=1\chi=1 in eqn. (5)) implies σw2=1\sigma_{w}^{2}=1 and eqn. (4) reduces to q=σw2q+σb2q^{*}=\sigma_{w}^{2}q^{*}+\sigma_{b}^{2}, the only critical point is (σw,σb)=(1,0)(\sigma_{w},\sigma_{b})=(1,0). The case of orthogonal weights is simple: J\mathbf{J} is also orthogonal, and all its singular values are 11, thereby achieving perfect dynamic isometry. Gaussian weights behave very differently. The squared singular values si2s^{2}_{i} of J\bm{J} equal the eigenvalues λi\lambda_{i} of JJT\mathbf{J}\mathbf{J}^{T}, which is a product Wishart matrix, whose spectral density was recently computed in . The resulting singular value density of J\mathbf{J} is given by,

Fig. 2(a) demonstrates a match between this theoretical density and the empirical density obtained from numerical simulations of random linear networks. As the depth increases, this density becomes highly anisotropic, both concentrating about zero and developing an extended tail.

Note that ϕ=π/(L+1)\phi=\pi/(L+1) corresponds to the minimum singular value smin=0s_{\text{min}}=0, while ϕ=0\phi=0 corresponds to the maximum eigenvalue, λmax=smax2=LL(L+1)L+1\lambda_{\text{max}}=s_{\text{max}}^{2}=L^{-L}(L+1)^{L+1}, which, for large LL scales as λmaxeL\lambda_{\text{max}}\sim eL. Both eqn. (13) and the methods of Section 2.5 yield the variance of the eigenvalue distribution of JJT\mathbf{J}\mathbf{J}^{T} to be σJJT2=L\sigma^{2}_{JJ^{T}}=L. Thus for linear Gaussian networks, both smaxs_{\text{max}} and σJJT2\sigma^{2}_{JJ^{T}} grow linearly with depth, signaling poor conditioning and the breakdown of dynamical isometry.

5 ReLU and hard-tanh networks

We first discuss the criticality conditions (finite qq^{*} in eqn. (4) and χ=1\chi=1 in eqn. (5)) in these two nonlinear networks. For both networks, since the slope of the nonlinearity ϕ(h)\phi^{\prime}(h) only takes the values and 11, χ\chi in eqn. (5) reduces to χ=σw2p(q)\chi=\sigma_{w}^{2}p(q^{*}) where p(q)p(q^{*}) is the probability that a given neuron is in the linear regime with ϕ(h)=1\phi^{\prime}(h)=1. As discussed above, we take the large-width limit in which the distribution of the pre-activations hh is a zero mean Gaussian with variance qq^{*}. We therefore find that for ReLU, p(q)=12p(q^{*})=\frac{1}{2} is independent of qq^{*}, whereas for hard-tanh, p(q)=11dheh2/2q2πq=erf(1/2q)p(q^{*})=\int_{-1}^{1}dh\frac{e^{-h^{2}/2q^{*}}}{\sqrt{2\pi q^{*}}}=\operatorname{erf}(1/\sqrt{2q^{*}}) depends on qq^{*}. In particular, it approaches 11 as q0q^{*}\to 0.

Thus for ReLU, χ=1\chi=1 if and only if σw2=2\sigma_{w}^{2}=2, in which case eqn. (4) reduces to q=12σw2q+σb2q^{*}=\frac{1}{2}\sigma_{w}^{2}q^{*}+\sigma_{b}^{2}, implying that the only critical point is (σw,σb)=(2,0)(\sigma_{w},\sigma_{b})=(2,0). For hard-tanh, in contrast, χ=σw2p(q)\chi=\sigma_{w}^{2}p(q^{*}), where p(q)p(q^{*}) itself depends on σw\sigma_{w} and σb\sigma_{b} through eqn. (4), and so the criticality condition χ=1\chi=1 yields a curve in the (σw,σb)(\sigma_{w},\sigma_{b}) plane similar to that shown for the tanh network in Fig. 1. As one moves along this curve in the direction of decreasing σw\sigma_{w}, the curve approaches the point (σw,σb)=(1,0)(\sigma_{w},\sigma_{b})=(1,0) with qq^{*} monotonically decreasing towards , i.e. q0q^{*}\rightarrow 0 as σw1\sigma_{w}\rightarrow 1.

The critical ReLU network and the one parameter family of critical hard-tanh networks have neither vanishing nor exploding gradients, due to χ=1\chi=1. Nevertheless, the entire singular value spectrum of J\mathbf{J} of these networks can behave very differently. From eqn. (12), this spectrum depends on the non-linearity ϕ(h)\phi(h) through SD2S_{D^{2}} in eqn. (10), which in turn only depends on the distribution of eigenvalues of D2\mathbf{D}^{2}, or equivalently, the distribution of squared derivatives ϕ(h)2\phi^{\prime}(h)^{2}. As we have seen, this distribution is a Bernoulli distribution with parameter p(q)p(q^{*}): ρD2(z)=(1p(q))δ(z)+p(q)δ(z1)\rho_{D^{2}}(z)=(1-p(q^{*}))\,\delta(z)+p(q^{*})\,\delta(z-1). Inserting this distribution into the sequence eqn. (7), eqn. (9), eqn. (10) then yields

To complete the calculation of SJJTS_{JJ^{T}} in eqn. (12), we must also compute SWWTS_{WW^{T}}. We do this for Gaussian and orthogonal weights in the next two subsections.

We re-derive the well-known expression for the SS-transform of products of random Gaussian matrices with variance σw2\sigma_{w}^{2} in Example 3 of the supplementary material. The result is SWWT=σw2(1+z)1S_{WW^{T}}=\sigma_{w}^{-2}(1+z)^{-1}, which, when combined with eqn. (14) for SD2S_{D^{2}}, eqn. (12) for SJJTS_{JJ^{T}}, and eqn. (10) for MX1(z)M^{-1}_{X}(z), yields

Using eqn. (15) and eqn. (9), we can define a polynomial that the Stieltjes transform GG satisfies,

The correct root of this equation is the one for which G1/zG\sim 1/z as zz\to\infty . From eqn. (8), the spectral density is obtained from the imaginary part of G(λ+iϵ)G(\lambda+i\epsilon) as ϵ0+\epsilon\to 0^{+}.

The positions of the spectral edges, namely locations of the minimum and maximum eigenvalues of JJT\mathbf{J}\mathbf{J}^{T}, can be deduced from the values of zz for which the imaginary part of the root of eqn. (16) vanishes, i.e. when the discriminant of the polynomial in eqn. (16) vanishes. After a detailed but unenlightening calculation, we find, for large LL,

Recalling that χ=σw2p(q)\chi=\sigma_{w}^{2}p(q^{*}), we find exponential growth in λmax\lambda_{\text{max}} if χ>1\chi>1 and exponential decay if χ<1\chi<1. Moreover, even at criticality when χ=1\chi=1, λmax\lambda_{\max} still grows linearly with depth.

Next, we obtain the variance σJJT2\sigma_{JJ^{T}}^{2} of the eigenvalue density of JJT\mathbf{J}\mathbf{J}^{T} by computing its first two moments m1m_{1} and m2m_{2}. We employ the Lagrange inversion theorem ,

which relates the expansions of the moment generating function MJJT(z)M_{JJ^{T}}(z) and its functional inverse MJJT1(z)M_{JJ^{T}}^{-1}(z). Substituting this expansion for MJJT1(z)M_{JJ^{T}}^{-1}(z) into eqn. (15), expanding the right hand side, and equating the coefficients of zz, we find,

Both moments generically either exponentially grow or vanish. However even at criticality, when χ=σw2p(q)=1\chi=\sigma_{w}^{2}p(q^{*})=1, the variance σJJT2=m2m12=Lp(q)\sigma_{JJ^{T}}^{2}=m_{2}-m_{1}^{2}=\frac{L}{p(q^{*})} still exhibits linear growth with depth.

Note that p(q)p(q^{*}) is the fraction of neurons operating in the linear regime, which is always less than 11. Thus for both ReLU and hard-tanh networks, no choice of Gaussian initialization can ever prevent this linear growth, both in σJJT2\sigma_{JJ^{T}}^{2} and λmax\lambda_{\text{max}}, implying that even critical Gaussian initializations will always lead to a failure of dynamical isometry at large depth for these networks.

5.2 Orthogonal weights

For orthogonal W\mathbf{W}, we have WWT=I\mathbf{W}\mathbf{W}^{T}=\bm{I}, and the S-transform is SI=1S_{I}=1 (see Example 2 of the supplementary material). After scaling by σw\sigma_{w}, we have SWWT=Sσw2I=σw2SI=σw2S_{WW^{T}}=S_{\sigma_{w}^{2}I}=\sigma_{w}^{-2}S_{I}=\sigma_{w}^{-2}. Combining this with eqn. (14) and eqn. (12) yields SJJT(z)S_{JJ^{T}}(z) and, through eqn. (10), yields MJJT1M^{-1}_{JJ^{T}}:

Now, combining eqn. (20) and eqn. (9), we obtain a polynomial that the Stieltjes transform GG satisfies:

From this we can extract the eigenvalue and singular value density of JJT\mathbf{J}\mathbf{J}^{T} and J\mathbf{J}, respectively, through eqn. (8). Figs. 2(b) and 2(c) demonstrate an excellent match between our theoretical predictions and numerical simulations of random networks. We find that at modest depths, the singular values are peaked near λmax\lambda_{\text{max}}, but at larger depths, the distribution both accumulates mass at and spreads out, developing a growing tail. Thus at fixed critical values of σw\sigma_{w} and σb\sigma_{b}, both deep ReLU and hard-tanh networks have ill-conditioned Jacobians, even with orthogonal weight matrices.

As above, we can obtain the maximum eigenvalue of JJT\mathbf{J}\mathbf{J}^{T} by determining the values of zz for which the discriminant of the polynomial in eqn. (21) vanishes. This calculation yields,

For large LL, λmax\lambda_{\text{max}} either exponentially explodes or decays, except at criticality when χ=σw2p(q)=1\chi=\sigma_{w}^{2}p(q^{*})=1, where it behaves as λmax=1p(q)p(q)(eLe2)+O(L1)\lambda_{\text{max}}=\frac{1-p(q^{*})}{p(q^{*})}\left(eL-\frac{e}{2}\right)+\mathcal{O}(L^{-1}). Also, as above, we can compute the variance σJJT2\sigma^{2}_{JJ^{T}} by expanding MJJT1M^{-1}_{JJ^{T}} in eqn. (20) and applying eqn. (18). At criticality, we find σJJT2=1p(q)p(q)L\sigma_{JJ^{T}}^{2}=\frac{1-p(q^{*})}{p(q^{*})}L for large LL. Now the large LL asymptotic behavior of both λmax\lambda_{\text{max}} and σJJT2\sigma^{2}_{JJ^{T}} depends crucially on p(q)p(q^{*}), the fraction of neurons in the linear regime.

For ReLU networks, p(q)=1/2p(q^{*})=1/2, and we see that λmax\lambda_{\text{max}} and σJJT2\sigma_{JJ^{T}}^{2} grow linearly with depth and dynamical isometry is unachievable in ReLU networks, even for critical orthogonal weights. In contrast, for hard tanh networks, p(q)=erf(1/2q)p(q^{*})=\operatorname{erf}(1/\sqrt{2q^{*}}). Therefore, one can always move along the critical line in the (σw,σb)(\sigma_{w},\sigma_{b}) plane towards the point (1,0)(1,0), thereby reducing qq^{*}, increasing p(q)p(q^{*}), and decreasing, to an arbitrarily small value, the prefactor 1p(q)p(q)\frac{1-p(q^{*})}{p(q^{*})} controlling the linear growth of both λmax\lambda_{\text{max}} and σJJT2\sigma_{JJ^{T}}^{2}. So unlike either ReLU networks, or Gaussian networks, one can achieve dynamical isometry up to depth LL by choosing qq^{*} small enough so that p(q)11Lp(q^{*})\approx 1-\frac{1}{L}. In essence, this strategy increases the fraction of neurons operating in the linear regime, enabling orthogonal hard-tanh nets to mimic the successful dynamical isometry achieved by orthogonal linear nets. However, this strategy is unavailable for orthogonal ReLU networks. A demonstration of these results is shown in Fig. 3.

Experiments

Having established a theory of the entire singular value distribution of J\mathbf{J}, and in particular of when dynamical isometry is present or not, we now provide empirical evidence that the presence or absence of this isometry can have a large impact on training speed. In our first experiment, summarized in Fig. 4, we compare three different classes of critical neural networks: (1) tanh\tanh with small σw2=1.05\sigma_{w}^{2}=1.05 and σb2=2.01×105\sigma_{b}^{2}=2.01\times 10^{-5}; (2) tanh\tanh with large σw2=2\sigma_{w}^{2}=2 and σb2=0.104\sigma_{b}^{2}=0.104; and (3) ReLU with σw2=2\sigma_{w}^{2}=2 and σb2=2.01×105\sigma_{b}^{2}=2.01\times 10^{-5}. In each case σb\sigma_{b} is chosen appropriately to achieve critical initial conditions at the boundary between order and chaos , with χ=1\chi=1. All three of these networks have a mean squared singular value of 11 with neither vanishing nor exploding gradients in the infinite width limit. These experiments therefore probe the specific effect of dynamical isometry, or the entire shape of the spectrum of J\mathbf{J}, on learning. We also explore the degree to which more sophisticated optimizers can overcome poor initializations. We compare SGD, Momentum, RMSProp , and ADAM .

We train networks of depth L=200L=200 and width N=400N=400 for 10510^{5} steps with a batch size of 10310^{3}. We additionally average our results over 30 different instantiations of the network to reduce noise. For each nonlinearity, initialization, and optimizer, we obtain the optimal learning rate through grid search. For SGD and SGD+Momentum we consider logarithmically spaced rates between [104,101][10^{-4},10^{-1}] in steps 100.110^{0.1}; for ADAM and RMSProp we explore the range [107,104][10^{-7},10^{-4}] at the same step size. To choose the optimal learning rate we select a threshold accuracy pp and measure the first step when performance exceeds pp. Our qualitative conclusions are fairly independent of pp. Here we report results on a version of CIFAR-10We use the standard CIFAR-10 dataset augmented with random flips and crops, and random saturation, brightness, and contrast perturbations.

Based on our theory, we expect the performance advantage of orthogonal over Gaussian initializations to be significant in case (1) and somewhat negligible in cases (2) and (3). This prediction is verified in Fig. 4 (blue solid and dashed learning curves are well-separated, compared to red and black cases). Furthermore, the extent of dynamical isometry at initialization strongly predicts the speed of learning. The effect is large, with the most isometric case (orthogonal tanh\tanh with small σw2\sigma_{w}^{2}) learning faster than the least isometric case (ReLU networks) by several orders of magnitude. Moreover, these conclusions robustly persist across all optimizers. Intriguingly, in the case where dynamical isometry helps the most (tanh\tanh with small σw2\sigma_{w}^{2}), the effect of initialization (orthogonal versus Gaussian) has a much larger impact on learning speed than the choice of optimizer.

These insights suggest a more quantitative analysis of the relation between dynamical isometry and learning speed for orthogonal tanh\tanh networks, summarized in Fig. 5. We focus on SGD, given the lack of a strong dependence on optimizer. Intriguingly, Fig. 5(a) demonstrates the optimal training time is O(L)O(\sqrt{L}) and so grows sublinearly with depth LL. Also Fig. 5(b) reveals that increased dynamical isometry enables faster training by making available larger (i.e. faster) learning rates. Finally, Fig. 5(c,d) and their similarity to Fig. 3(b,d) suggest a strong positive correlation between training time and max singular value of J\mathbf{J}. Overall, these results suggest that dynamical isometry is correlated with learning speed, and controlling the entire distribution of Jacobian singular values may be an important design consideration in deep learning.

In Fig. 6, we explore the relationship between dynamical isometry and performance going beyond initialization by studying the evolution of singular values throughout training. We find that if dynamical isometry is present at initialization, it persists for some time into training. Intriguingly, perfect dynamical isometry at initialization (q=0q^{*}=0) is not the best choice for preserving isometry throughout training; instead, some small but nonzero value of qq^{*} appears optimal. Moreover, both learning speed and generalization accuracy peak at this nonzero value. These results bolster the relationship between dynamical isometry and performance beyond simply the initialization.

Discussion

In summary, we have employed free probability theory to analytically compute the entire distribution of Jacobian singular values as a function of depth, random initialization, and nonlinearity shape. This analytic computation yielded several insights into which combinations of these ingredients enable nonlinear deep networks to achieve dynamical isometry. In particular, deep linear Gaussian networks cannot; the maximum Jacobian singular value grows linearly with depth even if the second moment remains 11. The same is true for both orthogonal and Gaussian ReLU networks. Thus the ReLU nonlinearity destroys the dynamical isometry of orthogonal linear networks. In contrast, orthogonal, but not Gaussian, sigmoidal networks can achieve dynamical isometry; as the depth increases, the max singular value can remain O(1)O(1) in the former case but grows linearly in the latter. Thus orthogonal sigmoidal networks rescue the failure of dynamical isometry in ReLU networks.

Correspondingly, we demonstrate, on CIFAR-10, that orthogonal sigmoidal networks can learn orders of magnitude faster than ReLU networks. This performance advantage is robust to the choice of a variety of optimizers, including SGD, momentum, RMSProp and ADAM. Orthogonal sigmoidal networks moreover have sublinear learning times with depth. While not as fast as orthogonal linear networks, which have depth independent training times , orthogonal sigmoidal networks have training times growing as the square root of depth. Finally, dynamical isometry, if present at initialization, persists for a large amount of time during training. Moreover, isometric initializations with longer persistence times yield both faster learning and better generalization.

Overall, these results yield the insight that the shape of the entire distribution of a deep network’s Jacobian singular values can have a dramatic effect on learning speed; only controlling the second moment, to avoid exponentially vanishing and exploding gradients, can leave significant performance advantages on the table. Moreover, by pursuing the design principle of tightly concentrating the entire distribution around 11, we reveal that very deep feedfoward networks, with sigmoidal nonlinearities, can actually outperform ReLU networks, the most popular type of nonlinear deep network used today.

In future work, it would be interesting to extend our methods to other types of networks, including for example skip connections, or convolutional architectures. More generally, the performance advantage in learning that accompanies dynamical isometry suggests it may be interesting to explicitly optimize this property in reinforcement learning based searches over architectures .

S.G. thanks the Simons, McKnight, James S. McDonnell, and Burroughs Wellcome Foundations and the Office of Naval Research for support.

References

Theoretical results

The SS-transform for JJTJJ^{T} is given by,

First notice that, by eqn. (9), M(z)M(z) and thus S(z)S(z) depend only on the moments of the distribution. The moments, in turn, can be defined in terms of traces, which are invariant to cyclic permutations, i.e.,

where the last line follows since each weight matrix is identically distributed. ∎

Products of Gaussian random matrices with variance σw2\sigma_{w}^{2} have the SS transform,

It is well-known (see, e.g. ) that the moments of a Wishart are proportional to the Catalan numbers, i.e.,

It is straightforward to invert this function,

The SS-transform of the identity is given by SI=1S_{I}=1.

The moments of the identity are all equal to one, so we have,