The Emergence of Spectral Universality in Deep Networks

Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

INTRODUCTION

A well-conditioned initialization is essential for successfully training neural networks. Seminal initial work focused on random weight initializations ensuring that the second moment of the spectrum of singular values of the network Jacobian from input to output remained one, thereby preventing exponential explosion or vanishing of gradients . However, recent work has shown that even among different random initializations sharing this property, those whose entire spectrum tightly concentrates around one can often yield faster learning by orders of magnitude. For example, deep linear networks with orthogonal initializations, for which the entire spectrum is exactly one, can achieve depth-independent learning speeds, while the corresponding Gaussian initializations cannot .

Recently, it was shown that a similarly well-conditioned Jacobian could be constructed for deep non-linear networks using a combination of orthogonal weights and $\tanh$ nonlinearities. The result of this improved conditioning was an orders-of-magnitude speedup in learning for $\tanh$ networks. However, the same study also proved that a well-conditioned Jacobian could not be achieved with Rectified Linear units (ReLUs). Together these results explained why, historically, in some cases orthogonal weight initialization had been found to improve training efficiency only slightly .

These empirical results connecting the conditioning of the Jacobian to a dramatic speedup in learning raise an important theoretical question. Namely, how does the entire shape of this spectrum depend on a network’s nonlinearity, weight and bias distribution, and depth? Here we provide a detailed analytic answer by using powerful tools from free probability theory. Our answer provides theoretical guidance on how to choose these different network ingredients so as to achieve tight concentration of deep Jacobian spectra even at very large depths. Along the way, we find several surprises, and we summarize our results in the discussion.

PRELIMINARIES

Here $\mathbf{D}^{l}$ is a diagonal matrix with entries $D^{l}_{ij}=\phi^{\prime}(h^{l}_{i})\,\delta_{ij}$ , where $\delta_{ij}$ is the Kronecker delta function. The input-output Jacobian $\mathbf{J}$ is closely related to the backpropagation operator mapping output errors to weight matrices at a given layer, in the sense that if the former is well-conditioned, then the latter tends to be well-conditioned for all weight layers. We are therefore interested in understanding the entire singular value spectrum of $\mathbf{J}$ for deep networks with randomly initialized weights and biases.

In particular, we will take the biases $\mathbf{b}^{l}_{i}$ to be drawn i.i.d. from a zero-mean Gaussian with standard deviation $\sigma_{b}$ . For the weights, we will consider two random matrix ensembles: (1) random Gaussian weights in which each $W^{l}_{ij}$ is drawn i.i.d from a Gaussian with variance $\sigma_{w}^{2}/N$ , and (2) random orthogonal weights, drawn from a uniform distribution over scaled orthogonal matrices obeying $(\mathbf{W}^{l})^{T}\mathbf{W}^{l}=\sigma_{w}^{2}\,\mathbf{I}$ .

2 Review of Signal Propagation

The random matrices $\mathbf{D}^{l}$ in (2) depend on the empirical distribution of pre-activations $h^{l}_{i}$ for $i=1,\dots,N$ entering the nonlinearity $\phi$ in (1). The propagation of this empirical distribution through different layers $l$ was studied in . In those works, it was shown that in the large $N$ limit this empirical distribution converges to a Gaussian with zero mean and variance $q^{l}$ , where $q^{l}$ obeys a recursion relation induced by the dynamics in (1):

with initial condition $q^{1}=\frac{\sigma_{w}^{2}}{N}\sum_{i=1}^{N}(x^{0}_{i})^{2}+\sigma_{b}^{2}$ , and $\mathcal{D}h=\frac{dh}{\sqrt{2\pi}}\,\exp{(-\frac{h^{2}}{2})}$ denoting the standard normal measure. This recursion has a fixed point obeying,

If the input $\mathbf{x}^{0}$ is chosen so that $q^{1}=q^{*}$ , then the dynamics start at the fixed point and the distribution of $\mathbf{D}^{l}$ is independent of $l$ . Moreover, even if $q^{1}\neq q^{*}$ , a few layers is often sufficient to approximately converge to the fixed point (see ). As such, when $L$ is large, it is often a good approximation to assume that $q^{l}=q^{*}$ for all depths $l$ when computing the spectrum of $\mathbf{J}$ .

Another important quantity governing signal propagation through deep networks is

where $\phi^{\prime}$ is the derivative of $\phi$ . Here $\chi$ is second moment of the distribution of squared singular values of the matrix $\mathbf{DW}$ , when the pre-activations are at their fixed point distribution with variance $q^{*}$ . As shown in , $\chi(\sigma_{w},\sigma_{b})$ separates the $(\sigma_{w},\sigma_{b})$ plane into two regions: (a) when $\chi>1$ , forward signal propagation expands and folds space in a chaotic manner and back-propagated gradients exponentially explode; and (b) when $\chi<1$ , forward signal propagation contracts space in an ordered manner and back-propagated gradients exponentially vanish. Thus the constraint $\chi(\sigma_{w},\sigma_{b})=1$ determines a critical line in the $(\sigma_{w},\sigma_{b})$ plane separating the ordered and chaotic regimes. Moreover, the second moment of the distribution of squared singular values of $\mathbf{J}$ was shown simply to be $\chi^{L}$ in . Fig. 1 shows an example of an order-chaos transition for the tanh nonlinearity.

3 Review of Free Probability

The previous section revealed that the mean squared singular value of $\mathbf{J}$ is $\chi^{L}$ . Indeed when $\chi\ll 1$ or $\chi\gg 1$ the vanishing or explosion of gradients, respectively, dominates the learning dynamics and provide a compelling case for choosing an initialization that is critical with $\chi=1$ . We would like to investigate the question of whether or not all cases where $\chi=1$ are the same and, in particular, to obtain more detailed information about entire the singular value distribution of $\mathbf{J}$ when $\chi=1$ . Since (2) consists of a product of random matrices, free probability becomes relevant as a powerful tool to compute the spectrum of $\mathbf{J}$ , as we now review. See for a pedagogical introduction, and for prior work applying free probability to deep learning.

In general, given a random matrix $\mathbf{X}$ , its limiting spectral density is defined as

where $\langle\cdot\rangle_{X}$ denotes an average w.r.t to the distribution over the random matrix $\mathbf{X}$ .

The Stieltjes transform of $\rho_{X}$ is defined as,

$G_{X}$ is related to the moment generating function $M_{X}$ ,

where $m_{k}$ is the $k$ th moment of the distribution $\rho_{X}$ ,

In turn, we denote the functional inverse of $M_{X}$ by $M_{X}^{-1}$ , which by definition satisfies $M_{X}(M_{X}^{-1}(z))=M_{X}^{-1}(M_{X}(z))=z$ . Finally, the S-transform is defined as,

The utility of the S-transform arises from its behavior under multiplication. Specifically, if $\mathbf{A}$ and $\mathbf{B}$ are two freely independent random matrices, then the S-transform of the product random matrix ensemble $\mathbf{A}\mathbf{B}$ is simply the product of their S-transforms,

MASTER EQUATION FOR SPECTRAL DENSITY

We can now write down an implicit expression of the spectral density of $\mathbf{J}\mathbf{J}^{T}$ , which is also the distribution of the square of the singular values of $\mathbf{J}$ . In particular, in the supplementary material (SM) Sec. 1, we combine (12) with the facts that the S-transform depends only on traces of moments through (9), and that these traces are invariant under cyclic permutations, to derive a simple expression for the S-transform of $\mathbf{J}\mathbf{J}^{T}$ ,

Here the lack of dependence on the layer index $l$ on the RHS is valid if the input $\mathbf{x}^{0}$ is such that $q^{1}=q^{*}$ .

Thus, given expressions for the S-transforms associated with the nonlinearity, $S_{D^{2}}$ , and the weights, $S^{L}_{W^{T}W}$ , one can compute the S-transform of the input-output Jacobian $S_{JJ^{T}}$ at any network depth $L$ through (13). Then from $S_{JJ^{T}}$ , one can invert the sequence (7), (9), and (11) to obtain $\rho_{JJ^{T}}(\lambda)$ .

2 An Efficient Master Equation

The previous section provides a naive method for computing the spectrum $\rho_{JJ^{T}}(\lambda)$ , through a complex sequence of calculations. One must start from $\rho_{W^{T}W}(\lambda)$ and $\rho_{D^{2}}(\lambda)$ , compute their respective Stieltjes transforms, moment generating functions, inverse moment generating functions, and S-transforms, take the product in (13), and then invert this sequence of steps to finally arrive at $\rho_{JJ^{T}}(\lambda)$ . Here we provide a much simpler “master” equation for extracting information about $\rho_{JJ^{T}}(\lambda)$ and its moments directly from knowledge of the moment generating function of the nonlinearity, $M_{D}^{2}(z)$ , and the S-transform of the weights, $S_{W^{T}W}(z)$ . As we shall see, these latter two functions are the simplest functions to work with for arbitrary nonlinearities.

To derive the master equation, we insert (11), for $\mathbf{X}=\mathbf{D}^{2}$ , into (13), and perform some algebraic manipulations (see SM Sec. 3 for details) to obtain implicit functional equations for $M_{JJ^{T}}(z)$ and $G(z)$ ,

In principle, a solution to eq. (15) allows us to compute the entire spectrum of $\bm{J}\bm{J}^{T}$ . In practice, when an exact solution in terms of elementary functions is lacking, it is still possible to extract robust numerical solutions, as we describe in the next subsection.

3 Numerical Extraction of Spectra

Here we describe how to solve (15) numerically. The difficulty is that (15) implicitly defines $G(z)$ through an equation of the form $\mathcal{F}(G,z)=0$ . Notice that, for any given $z$ , this equation may have multiple roots in $G$ . The correct branch can be chosen by requiring that $z\to\infty$ , $G(z)\sim 1/z$ . Therefore, one point on the correct branch can be found by taking $|z|$ large, and finding the solution to $\mathcal{F}(G,z)=0$ that is closest to $G=1/z$ . Recall that to obtain the density $\rho_{JJ^{T}}(\lambda)$ through the inversion formula ((8)), we need to extract the behavior of $G(z)$ near the real axis at a point $z=\lambda+i\epsilon$ where $\rho_{JJ^{T}}(\lambda)$ has support. So, practically speaking, for each $\lambda$ we can walk along the imaginary direction obeying $\text{Re}(z)=\lambda$ from large imaginary values to small, and repeatedly solve $\mathcal{F}(G,z)=0$ , always choosing the root that is closest to the previous root.

In the following sections, we demonstrate through many examples a precise numerical match between the outcome of Algorithm 1 and direct simulations of various random neural networks, thereby justifying not only (15), but also the efficacy our algorithm.

4 Moments of Deep Spectra

In addition to numerically extracting the spectrum of $\mathbf{J}\mathbf{J}^{T}$ , we can also calculate its moments $m_{k}$ encoded in the function

These moments in turn can be computed in terms of the series expansions of $S_{W^{T}W}$ and $M_{D^{2}}$ , which we define as

where the moments $\mu_{k}$ of $\mathbf{D}^{2}$ are given by,

Substituting these expansions into (14), we obtain equations for the unknown moments $m_{k}$ in terms of the known moments $\mu_{k}$ and $s_{k}$ . We can solve for the low-order moments by expanding (14) in powers of $z^{-1}$ . By equating the coefficients of $z^{-1}$ and $z^{-2}$ , we find equations for $m_{1}$ and $m_{2}$ whose solution yields (see SM Sec. 3),

Note the combination $\sigma_{w}^{2}\mu_{1}$ is none other than $\chi$ defined in (5), and so (21) recovers the result that the mean squared singular value $m_{1}$ of $\mathbf{J}$ either exponentially explodes or vanishes unless $\chi(\sigma_{w},\sigma_{b})=1$ on a critical boundary between order and chaos. However, even on this critical boundary where the mean $m_{1}$ of the spectrum of $\mathbf{J}\mathbf{J}^{T}$ is one for any depth $L$ , the variance

grows linearly with depth $L$ for generic values of $\mu_{1}$ , $\mu_{2}$ and $s_{1}$ . Thus $\mathbf{J}$ can be highly ill-conditioned at large depths $L$ for generic choices of nonlinearities and weights, even when $\sigma_{w}$ and $\sigma_{b}$ are tuned to criticality.

SPECIAL CASES OF DEEP SPECTRA

Exploiting the master equation (14) requires information about $M_{D^{2}}(z)$ , and $S_{WW^{T}}(z)$ . We first provide this information and then use it to look at special cases of deep networks.

First, for any nonlinearity $\phi(h)$ , we have, through (7) and (9),

The integral over the Gaussian measure $\mathcal{D}h$ reflects a sum over all the activations $h^{l}_{i}$ in a layer $l$ , since in the large $N$ limit the empirical distribution of activations converges to a Gaussian with standard deviation $\sqrt{q^{*}}$ . Moreover, an activation $h^{l}_{i}$ feels a squared slope $\phi^{\prime}(h^{l}_{i})^{2}$ , which appears as an eigenvalue of the diagonal matrix $(\mathbf{D}^{l})^{2}$ . Thus $M_{D^{2}}(z)$ naturally involves an integral over a function of $\phi^{\prime}(\cdot)^{2}$ against a Gaussian.

Table 1 provides the moment generating function and moments of $\mathbf{D}^{2}$ for several nonlinearities. Detailed derivations of the results in Table 1, which follow from performing the integral in (23), can be found in the SM Sec. 3. In the Erf case, $\Phi$ is a special function known as the Lerch transcendent, which can be defined by its moments $\mu_{k}$ .

2 Transforms of Weights

The S-transforms of the weights can be obtain through the sequence of equations (7), (9), and (11), starting with $\rho_{W^{T}W}(\lambda)=\delta(\lambda-1)$ for an orthogonal random matrix $\mathbf{W}$ , and $\rho_{W^{T}W}(\lambda)=(2\pi)^{-1}\sqrt{4-\lambda}\quad\text{for}\,\lambda\in$ , for a Gaussian random matrix $\mathbf{W}$ with variance $\frac{1}{N}$ (see SM Sec. 5). Furthermore, by scaling $\mathbf{W}\rightarrow\sigma_{w}\mathbf{W}$ , the S-transform scales as $S_{W^{T}W}\rightarrow\sigma_{w}^{-2}S_{W^{T}W}$ , yielding the S-transforms and first moments in Table 2.

3 Exact Properties of Deep Spectra

Now for different randomly initialized deep networks, we insert the appropriate expressions in Tables 1 and 2 into our master equations (14) and (15) to obtain information about the spectrum of $\mathbf{J}\mathbf{J}^{T}$ , including its entire shape, through Algorithm 1, and its variance $\sigma_{JJ^{T}}^{2}$ through (21) and (22). We always work at criticality, so that in (5), $\chi=\sigma_{w}^{2}\mu_{1}=1$ . The resulting condition for $\sigma_{w}^{2}$ at criticality and the value of $\sigma_{JJ^{T}}^{2}$ are shown in Table 1 for different nonlinearities, both for orthogonal ( $s_{1}=0$ ) and Gaussian ( $s_{1}=-1$ ) weights.

For linear networks, the fixed point equation (4) reduces to $q^{*}=\sigma_{w}^{2}q^{*}+\sigma_{b}^{2}$ , and $(\sigma_{w},\sigma_{b})=(1,0)$ is the only critical point. Moreover, linear Gaussian networks behave very differently from orthogonal ones. The latter are well conditioned, with $\sigma^{2}_{JJ^{T}}=0$ because the product of orthogonal matrices is orthogonal and so $\rho_{JJ^{T}}(\lambda)=\delta(\lambda-1)$ for all $L$ . However, $\sigma^{2}_{JJ^{T}}=L$ for Gaussian weights. This radically different behavior of the spectrum of $\mathbf{JJ}^{T}$ is shown in Fig. 2A.

3.2 ReLU Networks

For ReLU networks, the fixed point equation (4) reduces to $q^{*}=\frac{1}{2}\sigma_{w}^{2}q^{*}+\sigma_{b}^{2}$ , and $(\sigma_{w},\sigma_{b})=(\sqrt{2},0)$ is the only critical point. Unlike the linear case, $\sigma_{JJ^{T}}^{2}$ becomes $L$ for orthogonal and $2L$ for Gaussian weights. In essence, the ReLU nonlinearity destroys the qualitative scaling advantage that linear networks possess for orthogonal weights versus Gaussian. The qualitative similarity of spectra for ReLU Orthogonal and linear Gaussian is shown in Fig. 2AB.

3.3 Hard Tanh and Erf Networks

For Hard Tanh and Erf Networks, the criticality condition $\sigma_{w}^{2}={\mu_{1}^{-1}}$ does not determine a unique value of $\sigma_{w}^{2}$ because $\mu_{1}$ , the mean squared slope $\phi^{\prime}(h)^{2}$ , now depends on the variance $q^{*}$ of the distribution of pre-activations $h$ . Since $q^{*}$ itself is a function of $\sigma_{w}$ and $\sigma_{b}$ through (4), these networks enjoy an entire critical curve in the $(\sigma_{w},\sigma_{b})$ plane, similar to that shown in Fig. 1. As $q^{*}$ decreases monotonically towards zero, the corresponding point on this curve approaches the point $(\sigma_{w},\sigma_{b})=(1,0)$ .

Moreover, Table 1 shows that $\sigma_{JJ^{T}}^{2}=L(\mathcal{F}(q^{*})-1-s_{1})$ with $\lim_{q^{*}\rightarrow 0}\mathcal{F}(q^{*})=1$ . This implies that for Gaussian weights ( $s_{1}=-1$ ), no matter how small one makes $\sigma_{w}$ , $\sigma^{2}_{JJ^{T}}\propto L$ . However, for orthogonal weights ( $s_{1}=0$ ), for any fixed $L$ , one can reduce $\sigma_{w}$ and therefore $q^{*}$ , so as to make $\sigma_{JJ^{T}}^{2}$ arbitrarily small. Thus Hard Tanh and Erf nonlinearities rescue the scaling advantage that orthogonal weights possess over Gaussian, which was present in linear networks, but destroyed in ReLU networks. Examples of the well-conditioned nature of orthogonal Hard Tanh and Erf networks compared to orthogonal ReLu networks are shown in Fig. 2.

UNIVERSALITY IN DEEP SPECTRA

Table 1 shows that for orthogonal Erf and Hard Tanh networks (but not ReLU networks), since $\sigma_{JJ^{T}}^{2}=L(\mathcal{F}(q^{*})-1)$ with $\lim_{q^{*}\rightarrow 0}\mathcal{F}(q^{*})=1$ , one can always choose $q^{*}$ to vary inversely with $L$ so as to achieve a desired $L$ -independent constant variance $\sigma^{2}_{JJ^{T}}\equiv\sigma^{2}_{0}$ . To achieve this scaling, $q^{*}(L)$ should satisfy the equation $\mathcal{F}(q^{*}(L))=1+\frac{\sigma^{2}_{0}}{L}$ , which implies $\sigma_{w}\to 1$ and $q^{*}\rightarrow 0$ as $L\to\infty$ .

Remarkably, in this double scaling limit, not only does the variance of the spectrum of $\mathbf{JJ}^{T}$ remain constant at the fixed value $\sigma_{0}^{2}$ , but the entire shape of the distribution converges to a universal limiting distribution as $L\rightarrow\infty$ . There is more than one possible limiting distribution, but its form depends on $\phi$ only through the distribution of $\phi^{\prime}(h)^{2}$ as $q^{*}\to 0$ via the expression for $M_{D^{2}}(z)$ in (23). Therefore, many qualitatively different activation functions may in fact be members of the same universality class. We identify two universality classes that correspond to many common activation functions: the Bernoulli universality class and the smooth universality class, named based on the distribution of $\phi^{\prime}(h)^{2}$ as $q^{*}\to 0$ .

The Bernoulli universality class contains many piecewise linear activation functions, such as Hard Tanh (Fig. 3C) and a version of ReLU shifted so as to be linear at the origin, which for concreteness we define as $\phi(x)=[x+\frac{1}{2}]_{+}-\frac{1}{2}$ (Fig. 3E). While these functions look quite different, their derivatives are both Bernoulli-distributed (Fig. 3DF) and the limiting spectra of their corresponding Jacobians are the same (Fig. 4AB).

The smooth universality class contains many smooth activation functions, such as Erf (Fig. 3G) and a smoothed version of ReLU that we take to be the sigmoid-weighted linear unit (SiLU) (Fig. 3I). In this case, not only do the activation functions themselves look different, but so too do their derivatives (Fig. 3HJ). Nevertheless, in the double scaling limit, the limiting spectra of their corresponding Jacobians are the same (Fig. 4CD). The rate of convergence to the limiting distribution is different, because the moments $\mu_{k}$ differ substantially for non-zero $q^{*}$ .

Unlike the smoothed and shifted versions of ReLU, the vanilla ReLU activation (Fig. 3AB) behaves entirely differently and has no limiting distribution because the $\mu_{k}$ are independent of $q^{*}$ and therefore it is impossible to attain an $L$ -independent constant variance $\sigma^{2}_{JJ^{T}}\equiv\sigma^{2}_{0}$ in this case.

To understand the mechanism behind the emergence of spectral universality, we now examine orthogonal networks whose activation functions have squared derivatives obeying a Bernoulli distribution and show that they all share a universal limiting distribution as $L\to\infty$ . To this end, we suppose that,

for some function $p(q^{*})$ that measures the probability of the nonlinearity having slope one as a function of $q^{*}$ . We will assume that $p(q^{*})\to 1$ as $q^{*}\to 0$ . The relevant ratio of moments and the weight variance $\sigma_{w}^{2}$ are given as,

Notice that a solution $q^{*}(L)$ to (22) will exist for large $L$ since we are assuming $p(q^{*})\to 1$ as $q^{*}\to 0$ . Substituting this solution in (24) and (25) gives for large $L$ ,

Using these expressions and (11), we find that the S-transform obeys,

Using (9) and (11) to solve for $G(z)$ gives,

where $W$ denotes the principal branch of the Lambert-W function and solves the transcendental equation,

The spectral density can be extracted from (30) easily using (8). The results are shown in black lines in Fig. 4AB. Both Hard Tanh and Shifted ReLU have Bernoulli-distributed $\phi^{\prime}(h)^{2}$ and, despite being qualitatively different activation functions, have the same limiting spectral distributions. It is evident that the empirical spectral densities converge to this universal limiting distribution as the depth increases.

Next we build some additional understanding of the spectral density implied by (30). Because the spectral density is proportional to the imaginary part of $G(z)$ , we expect the locations of the spectral edges to be related to branch points of $G(z)$ , or more generally to poles in its derivative. Using the relation,

we can inspect the derivative of $G(z)$ . It may be expressed as,

By inspection, we find that $G^{\prime}(z)$ has double poles at,

which are locations where the spectral density diverges, i.e. there are delta function peaks at $\lambda_{0}$ and $\lambda_{2}$ . Note that there is only a pole at $\lambda_{2}$ if $\sigma_{0}\leq 1$ . There is also a single pole at,

which defines the right spectral edge, i.e. the maximum value of the bulk of the density.

The above observations regarding $\lambda_{0}$ , $\lambda_{1}$ , and $\lambda_{2}$ are evident in Fig. 4AB. Noting that in the figure, $\sigma_{0}=1/2$ , we predict that the bulk of the density to have its right edge located at $s=\sqrt{\lambda_{1}}=\sqrt{e}/2\approx 0.82$ and that there should be a delta function peak at $s=\sqrt{\lambda_{2}}=e^{1/8}\approx 1.13$ , both of which are reflected in the figure.

A similar analysis can be carried out for activation functions for which the distribution of $\phi^{\prime}(h)^{2}$ is smooth and concentrates around one as $q^{*}\to 0$ . The analysis for Erf is presented in the SM. We find that,

and that $G(z)$ can be expressed in terms of a generalized Lambert-W function . The locations of the spectral edges are given by $s_{\pm}=e^{-\frac{1}{4}\sigma_{\pm}^{2}}\sqrt{1+\frac{1}{2}\sigma_{\mp}^{2}}$ , where,

For $\sigma_{0}=1/2$ , these results give $s_{-}\approx 0.57$ and $s_{+}=1.56$ , which is in excellent agreement with the behavior observed in Fig. 4CD. Overall, Fig. 4 provides strong evidence supporting our predictions that orthogonal Hard Tanh and shifted ReLU networks have the Bernoulli limit distribution, while orthogonal Erf and smoothed Relu networks have the smooth limit distribution.

Finally, we derived these universal limits assuming orthogonal weights. In the SM we show that orthogonality is in fact necessary for the existence of a stable limiting distribution for the spectrum of $\mathbf{JJ}^{T}$ . No other random matrix ensemble can yield a stable distribution for any choice of nonlinearity with $\phi^{\prime}(0)=1$ . Essentially, any spread in the singular values of $\mathbf{W}$ grows in an unbounded way with depth and cannot be nonlinearly damped.

DISCUSSION

In summary, motivated by a lack of theoretical clarity on when and why different weight initializations and nonlinearities combine to yield well-conditioned spectra that speed up deep learning, we developed a calculational framework based on free probability to provide, with unprecedented detail, analytic information about the entire Jacobian spectrum of deep networks with arbitrary nonlinearities. Our results provide a principled framework for the initialization of weights and the choice of nonlinearities in order to produce well-conditioned Jacobians and fast learning. Intriguingly, we find novel universality classes of deep spectra that remain well-conditioned as the depth goes to infinity, as well as theoretical conditions for their existence. Our results lend additional support to the surprising conclusions revealed in , namely that using either Gaussian initializations or ReLU nonlinearities precludes the possibility of obtaining stable spectral distributions for very deep networks. Beyond the sigmoidal units advocated in , our results suggest that a wide variety of nonlinearities, including shifted and smoothed variants of ReLU, can achieve dynamical isometry, provided the weights are orthogonal. Interesting future work could involve the discovery of new universality classes of well-conditioned deep spectra for more diverse nonlinearities than considered here.

References

Review of free probability

For what follows, we define the key objects of free probability. Given a random matrix $\mathbf{X}$ , its limiting spectral density is defined as

where $\langle\cdot\rangle_{X}$ denotes an average w.r.t to the distribution over the random matrix $\mathbf{X}$ . For large $N$ , the empirical histogram of eigenvalues of a single realization of $\mathbf{X}$ converges to $\rho_{X}$ . In turn, the Stieltjes transform of $\rho_{X}$ is defined as,

$G_{X}$ is related to the moment generating function $M_{X}$ ,

where the $m_{k}$ is the $k$ ’th moment of the distribution $\rho_{X}$ ,

In turn, we denote the functional inverse of $M_{X}$ by $M_{X}^{-1}$ , which by definition satisfies $M_{X}(M_{X}^{-1}(z))=M_{X}^{-1}(M_{X}(z))=z$ . Finally, the S-transform is defined in terms of the functional inverse $M_{X}^{-1}$ as,

Free probability and deep networks

We will now use eqn. (S7) to write down an implicit definition of the spectral density of $\mathbf{J}\mathbf{J}^{T}$ , which is also the distribution of the square of the singular values of $\mathbf{J}$ . Here $\mathbf{J}$ is the input-output Jacobian of a deep network defined in the main paper. First notice that, by eqn. (9), $M(z)$ and thus $S(z)$ depend only on the moments of the spectral density. The moments, in turn, can be defined in terms of traces (as in eqn. (S5)), which are invariant to cyclic permutations, i.e.,

where the last equality follows if each term in the Jacobian product identically distributed. Given the expression for $S_{JJ^{T}}$ , a simple procedure recovers the density of singular values of $\mathbf{J}$ :

Use eqn. (S6) to obtain the moment generating function $M_{JJ^{T}}(z)$

Use eqn. (9) to obtain the Stieltjes transform $G_{JJ^{T}}(z)$

Use eqn. (S3) to obtain the spectral density $\rho_{JJ^{T}}(\lambda)$

Use the relation $\lambda=\sigma^{2}$ to obtain the density of singular values of $J$ .

So in order to compute the distribution of singular values of of $J$ , all that remains is to compute the S-transforms of $W^{T}W$ and of $D^{2}$ . We will attack this problem for specific activation functions and matrix ensembles in the following sections.

Derivation of master equations for the spectrum of the Jacobian

To derive the master equation, we first insert (S6), for $\mathbf{X}=\mathbf{D^{2}}$ , into (S10) to obtain

Then we find $M_{JJ^{T}}^{-1}=(1+z)(zS_{J^{T}J})^{-1}$ by inverting (S6), which combined with the above equation yields

Applying $M_{D^{2}}$ to both sides gives,

Finally, evaluating this equation at $z=M_{JJ^{T}}$ gives our sought after master equation:

This is an implicit functional equation for $M_{JJ^{T}}(z)$ , an unknown quantity, in terms of the known functions $M_{D^{2}}(z)$ and $S_{W^{T}W}(z)$ . Furthermore, by substituting (S4), $M_{JJ^{T}}=zG_{JJ^{T}}-1$ , into (S11), we also obtain an implicit functional equation for the Stieltjes transform $G$ of $\rho_{JJ^{T}}(\lambda)$ ,

Derivation of Moments of deep spectra

The moments $m_{k}$ of the spectrum of $\mathbf{J}\mathbf{J}^{T}$ are encoded in the moment generating function

These moments in turn can be computed in terms of the series expansions of $S_{W^{T}W}$ and $M_{D^{2}}$ , which we define as

where the moments $\mu_{k}$ of $\mathbf{D}^{2}$ are given by,

We can substitute these moment expansions into (S11) to obtain equations for the unknown moments $m_{k}$ of the spectrum of $\mathbf{J}\mathbf{J}^{T}$ , in terms of the known moments $\mu_{k}$ and $s_{k}$ . We can solve for the low order moments by expanding (S11) in powers of $z^{-1}$ . By equating the coefficients of $z^{-1}$ and $z^{-2}$ , we obtain the following equations for $m_{1}$ and $m_{2}$ ,

Transforms of Nonlinearities

Here we compute the moment generating functions $M_{D^{2}}(z)$ for various choices of the nonlinearity $\phi$ , some of which are displayed in Table 1 of the main paper.

3 ϕ(x)=htanh⁡(x)italic-ϕ𝑥htanh𝑥\phi(x)=\operatorname{htanh}(x)

5 ϕ(x)=erf⁡(π2x)italic-ϕ𝑥erf𝜋2𝑥\phi(x)=\operatorname{erf}(\frac{\sqrt{\pi}}{2}x)

where $\Phi$ is the special function known as the Lerch transcendent.

6 ϕ(x)=2πarctan⁡(π2x)italic-ϕ𝑥2𝜋𝜋2𝑥\phi(x)=\frac{2}{\pi}\arctan(\frac{\pi}{2}x)

Transforms of Weights

First consider the case of an orthogonal random matrix satisfying $\mathbf{W^{T}W}=\mathbf{I}$ . Then

The case of a random Gaussian random matrix $\mathbf{W}$ with zero mean, variance $\frac{1}{N}$ entries is more complex, but well known:

Furthermore, by scaling $\mathbf{W}\rightarrow\sigma_{w}\mathbf{W}$ , the S-transform scales as $S_{W^{T}W}\rightarrow\sigma_{w}^{-2}S_{W^{T}W}$ , yielding the S-transforms in Table 1.

Universality class of orthogonal Hard Tanh networks

We consider hard tanh with orthogonal weights. The moment generating function is,

if we wish to scale $q^{*}$ with depth $L$ so as to achieve a depth independent constant variance $\sigma_{JJ^{T}}^{2}=\sigma_{0}^{2}$ as $L\rightarrow\infty$ . This expression for $q^{*}$ gives,

where $W$ is the standard Lambert-W function, or product log. The derivative of this function has double poles at,

which are locations where the spectral density diverges. There is also a single pole at,

which is the maximum value of the bulk of the density.

Universality class of orthogonal erferf\operatorname{erf} networks

Consider $\phi(x)=\sqrt{\frac{\pi}{2}}\operatorname{erf}(\frac{x}{\sqrt{2}}),$ which has been scaled so that $\phi^{\prime}(0)=1$ and $\phi^{\prime\prime\prime}(0)=-1$ . The $\mu_{k}$ are given by,

If we wish to scale $q^{*}$ with depth $L$ so as to achieve a depth independent constant variance $\sigma_{JJ^{T}}^{2}=\sigma_{0}^{2}$ as $L\rightarrow\infty$ , then we can choose

Since we also assume the network is critical, we also have that,

To illustrate universality, we next consider an arbitrary activation function, and assume that it has a Taylor expansion around 0. This allows us to expand the $\mu_{k}$ . First we write,

We will need $\phi_{1}\neq 0$ . First we will assume that $\phi_{2}\neq 0$ . Using this expansion we can write,

where we have used the fact that the network is critical so that we have $\mu_{1}=g^{-2}$ . Using the Lagrange inversion theorem to expand $M_{D^{2}}^{-1}$ , we find that

Next we will assume that $\phi_{2}=0$ and $\phi_{3}\neq 0$ We suspect these additional assumptions are unnecessary and that the results which follow are valid so long as there exists a $k$ for which $\phi_{k}\neq 0$ . It would be interesting to prove this.. Using the above expansion we can write,

where we have used the fact that the network is critical so that we have $\mu_{1}=\sigma_{w}^{-2}$ . Using the Lagrange inversion theorem to expand $M_{D^{2}}^{-1}$ , we find that

establishing a universal limiting S-transform (subject to our assumptions). From this result we can extract the Stieltjes transform and thus the spectral density. The result establishes a universal double scaling limiting spectral distribution. Next we observe that the Stieltjes transform can be expressed in terms of a generalization of the Lambert - $W$ function called the r-Lambert function, $W_{r}(z)$ , which is defined by

In terms of this function, the Stieltjes transform is,

We can extract the maximum and minumum eigenvalue by finding the branch points of this function. It suffices to look for poles in the derivative of the numerator of $G(z)$ . Using $r=-\sigma_{0}^{2}ze^{\sigma_{0}^{2}}$ , eqn. (S52) and its total derivative with respect to $z$ yields the following equation defining the locations of these poles,

where $W$ is the standard Lambert W function. Next we substite this relation into eqn. (S52); zeros in $z$ then define the location of the branch points. Some straightforward algebra yields the maximum and minimum eigenvalue,

Orthogonal weights are required for stable, universal limiting distributions

We work at criticality so $\chi=\sigma_{w}^{2}\mu_{1}=1$ . This implies that

Observe that Jensen’s inequality requires that $\mu_{2}\geq\mu_{1}^{2}$ . If we require that $\sigma_{JJ^{T}}^{2}$ approach a constant as $L\to\infty$ , we must have that,

we can relate $\sigma_{w}$ and $s_{1}$ to $\mathfrak{m}_{1}$ and $\mathfrak{m}_{2}$ . Specifically, evaluating the relation,

Expanding this equation to second order gives,

Positivity of variance gives $s_{1}\leq 0$ , which, together with eqn. (S58) implies,

Altogether we see that the variance of the distribution of eigenvalues of $WW^{T}$ must be zero. Since its mean is equal to $\sigma_{w}^{2}$ , we see that the only valid distribution for the eigenvalues of $WW^{T}$ is a delta function peaked at $\sigma_{w}^{2}$ , i.e. the distribution corresponding to the singular values of an orthogonal matrix scaled by $\sigma_{w}$ .