Exponential expressivity in deep neural networks through transient chaos

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, Surya Ganguli

Introduction

Deep feedforward neural networks, with multiple hidden layers, have achieved remarkable performance across many domains . A key factor thought to underlie their success is their high expressivity. This informal notion has manifested itself primarily in two forms of intuition. The first is that deep networks can compactly express highly complex functions over input space in a way that shallow networks with one hidden layer and the same number of neurons cannot. The second piece of intuition, which has captured the imagination of machine learning and neuroscience alike, is that deep neural networks can disentangle highly curved manifolds in input space into flattened manifolds in hidden space, to aid the performance of simple linear readouts. These intuitions, while attractive, have been difficult to formalize mathematically, and thereby rigorously test.

For the first intuition, seminal works have exhibited examples of particular functions that can be computed with a polynomial number of neurons (in the input dimension) in a deep network but require an exponential number of neurons in a shallow network . This raises a central open question: are such functions merely rare curiosities, or is any function computed by a generic deep network not efficiently computable by a shallow network? The theoretical techniques employed in prior work both limited the applicability of theory to specific nonlinearities and dictated the particular measure of deep functional complexity involved. For example focused on ReLu nonlinearities and number of linear regions as a complexity measure, while focused on sum-product networks and the number of monomials as complexity measure, and focused on Pfaffian nonlinearities and topological measures of complexity, like the sum of Betti numbers of a decision boundary. However, see for an interesting analysis of a general class of compositional functions. The limits of prior theoretical techniques raise another central question: is there a unifying theoretical framework for deep neural expressivity that is simultaneously applicable to arbitrary nonlinearities, generic networks, and a natural, general measure of functional complexity?

Here we attack both central problems of deep neural expressivity by combining a very different set of tools, namely Riemannian geometry and dynamical mean field theory . This novel combination enables us to show that for very broad classes of nonlinearities, even random deep neural networks can construct hidden internal representations whose global extrinsic curvature grows exponentially with depth but not width. Our geometric framework enables us to quantitatively define a notion of disentangling and verify this notion even in deep random networks. Furthermore, our methods yield insights into the emergent, deterministic nature of signal propagation through large random feedforward networks, revealing the existence of an order to chaos transition as a function of the statistics of weights and biases. We find that the transient, finite depth evolution in the chaotic regime underlies the origins of exponential expressivity in deep random networks.

In our companion paper , we study several related measures of expressivity in deep random neural networks with piecewise linear activations.

A mean field theory of deep nonlinear signal propagation

where bl\mathbf{b}^{l} is a vector of biases, hl\mathbf{h}^{l} is the pattern of inputs to neurons at layer ll, and ϕ\phi is a single neuron scalar nonlinearity that acts component-wise to transform inputs hl\mathbf{h}^{l} to activities xl\mathbf{x}^{l}. We wish to understand the nature of typical functions computable by such networks, as a consequence of their depth. We therefore study ensembles of random networks in which each of the synaptic weights Wijl{\mathbf{W}}^{l}_{ij} are drawn i.i.d. from a zero mean Gaussian with variance σw2/Nl1{\sigma_{w}^{2}}/N_{l-1}, while the biases are drawn i.i.d. from a zero mean Gaussian with variance σb2\sigma_{b}^{2}. This weight scaling ensures that the input contribution to each individual neuron at layer ll from activities in layer l1l-1 remains O(1)O(1), independent of the layer width Nl1N_{l-1}. This ensemble constitutes a maximum entropy distribution over deep neural networks, subject to constraints on the means and variances of weights and biases. This ensemble induces no further structure in the resulting set of deep functions, so its analysis provides an opportunity to understand the specific contribution of depth alone to the nature of typical functions computed by deep networks.

In the limit of large layer widths, Nl1N_{l}\gg 1, certain aspects of signal propagation through deep random neural networks take on an essentially deterministic character. This emergent determinism in large random neural networks enables us to understand how the Riemannian geometry of simple manifolds in the input layer x0\mathbf{x}^{0} is typically modified as the manifold propagates into the deep layers. For example, consider the simplest case of a single input vector x0\mathbf{x}^{0}. As it propagates through the network, its length in downstream layers will change. We track this changing length by computing the normalized squared length of the input vector at each layer:

This length is the second moment of the empirical distribution of inputs hil\mathbf{h}^{l}_{i} across all NlN_{l} neurons in layer ll. For large NlN_{l}, this empirical distribution converges to a zero mean Gaussian since each hil=jWijlϕ(hjl1)+bil\mathbf{h}^{l}_{i}=\sum_{j}{\mathbf{W}}^{l}_{ij}\phi(\mathbf{h}^{l-1}_{j})+\mathbf{b}^{l}_{i}

is a weighted sum of a large number of uncorrelated random variables - i.e. the weights Wijl{\mathbf{W}}^{l}_{ij} and biases bil\mathbf{b}^{l}_{i}, which are independent of the activity in previous layers. By propagating this Gaussian distribution across one layer, we obtain an iterative map for qlq^{l} in (2):

where Dz=dz2πez22\mathcal{D}z=\frac{dz}{\sqrt{2\pi}}e^{-\frac{z^{2}}{2}} is the standard Gaussian measure, and the initial condition is q1=σw2q0+σb2q^{1}=\sigma^{2}_{w}q^{0}+\sigma^{2}_{b}, where q0=1N0x0x0q^{0}=\frac{1}{N_{0}}\mathbf{x}^{0}\cdot\mathbf{x}^{0} is the length in the initial activity layer. See Supplementary Material (SM) for a derivation of (3). Intuitively, the integral over zz in (3) replaces an average over the empirical distribution of hil\mathbf{h}^{l}_{i} across neurons ii in layer ll at large layer width NlN_{l}.

The function V\mathcal{V} in (3) is an iterative variance, or length, map that predicts how the length of an input in (2) changes as it propagates through the network. This length map is plotted in Fig. 1A for the special case of a sigmoidal nonlinearity, ϕ(h)=tanh(h)\phi(h)=\tanh(h). For monotonic nonlinearities, this length map is a monotonically increasing, concave function whose intersections with the unity line determine its fixed points q(σw,σb)q^{*}(\sigma_{w},\sigma_{b}). For σb=0\sigma_{b}=0 and σw<1\sigma_{w}<1, the only intersection is at q=0q^{*}=0. In this bias-free, small weight regime, the network shrinks all inputs to the origin. For σw>1\sigma_{w}>1 and σb=0\sigma_{b}=0, the q=0q^{*}=0 fixed point becomes unstable and the length map acquires a second nonzero fixed point, which is stable. In this bias-free, large weight regime, the network expands small inputs and contracts large inputs. Also, for any nonzero bias σb\sigma_{b}, the length map has a single stable non-zero fixed point. In such a regime, even with small weights, the injected biases at each layer prevent signals from decaying to . The dynamics of the length map leads to rapid convergence of length to its fixed point with depth (Fig. 1B,D), often within only 44 layers. The fixed points q(σw,σb)q^{*}(\sigma_{w},\sigma_{b}) are shown in Fig. 1C.

Transient chaos in deep networks

Now consider the layer-wise propagation of two inputs x0,1\mathbf{x}^{0,1} and x0,2\mathbf{x}^{0,2}. The geometry of these two inputs as they propagate through the network is captured by the 22 by 22 matrix of inner products:

The dynamics of the two diagonal terms are each theoretically predicted by the length map in (3). We derive (see SM) a correlation map C\mathcal{C} that predicts the layer-wise dynamics of q12lq^{l}_{12}:

where c12l=q12l(q11lq22l)1/2c^{l}_{12}=q^{l}_{12}(q^{l}_{11}q^{l}_{22})^{-1/2} is the correlation coefficient. Here z1z_{1} and z2z_{2} are independent standard Gaussian variables, while u1u_{1} and u2u_{2} are correlated Gaussian variables with covariance matrix uaub=qabl1\langle u_{a}u_{b}\rangle=q^{l-1}_{ab}. Together, (3) and (5) constitute a theoretical prediction for the typical evolution of the geometry of 22 points in (4) in a fixed large network.

Analysis of these equations reveals an interesting order to chaos transition in the σw\sigma_{w} and σb\sigma_{b} plane. In particular, what happens to two nearby points as they propagate through the layers? Their relation to each other can be tracked by the correlation coefficient c12lc^{l}_{12} between the two points, which approaches a fixed point c(σw,σb)c^{*}(\sigma_{w},\sigma_{b}) at large depth. Since the length of each point rapidly converges to q(σw,σb)q^{*}(\sigma_{w},\sigma_{b}), as shown in Fig. 1BD, we can compute cc^{*} by simply setting q11l=q22l=q(σw,σb)q^{l}_{11}=q^{l}_{22}=q^{*}(\sigma_{w},\sigma_{b}) in (5) and dividing by qq^{*} to obtain an iterative correlation coefficient map, or C\mathcal{C}-map, for c12lc^{l}_{12}:

This C\mathcal{C}-map is shown in Fig. 2A. It always has a fixed point at c=1c^{*}=1 as can be checked by direct calculation. However, the stability of this fixed point depends on the slope of the map at 11, which is

See SM for a derivation of (7). If the slope χ1\chi_{1} is less than 11, then the C\mathcal{C}-map is above the unity line, the fixed point at 11 under the C\mathcal{C}-map in (6) is stable, and

nearby points become more similar over time. Conversely, if χ1>1\chi_{1}>1 then this fixed point is unstable, and nearby points separate as they propagate through the layers. Thus we can intuitively understand χ1\chi_{1} as a multiplicative stretch factor. This intuition can be made precise by considering the Jacobian Jijl=Wijlϕ(hjl1)\mathbf{J}^{l}_{ij}={\mathbf{W}}^{l}_{ij}\phi^{\prime}(\mathbf{h}^{l-1}_{j}) at a point hjl1\mathbf{h}^{l-1}_{j} with length qq^{*}. Jl\mathbf{J}^{l} is a linear approximation of the network map from layer l1l-1 to ll in the vicinity of hl1\mathbf{h}^{l-1}. Therefore a small random perturbation hl1+u\mathbf{h}^{l-1}+\mathbf{u} will map to hl+Ju\mathbf{h}^{l}+\mathbf{J}\mathbf{u}. The growth of the perturbation, Ju22/u22||\mathbf{J}\mathbf{u}||_{2}^{2}/||\mathbf{u}||_{2}^{2} becomes χ1(q)\chi_{1}(q^{*}) after averaging over the random perturbation u\mathbf{u}, weight matrix Wl{\mathbf{W}}^{l}, and Gaussian distribution of hil1\mathbf{h}^{l-1}_{i} across ii. Thus χ1\chi_{1} directly reflects the typical multiplicative growth or shrinkage of a random perturbation across one layer.

The dynamics of the iterative C\mathcal{C}-map and its agreement with network simulations is shown in Fig. 2B. The correlation dynamics are much slower than the length dynamics because the C\mathcal{C}-map is closer to the unity line (Fig. 2A) than the length map (Fig. 1A). Thus correlations typically take about 2020 layers to approach the fixed point, while lengths need only 44. The fixed point cc^{*} and slope χ1\chi_{1} of the C\mathcal{C}-map are shown in Fig. 2CD. For any fixed, finite σb\sigma_{b}, as σw\sigma_{w} increases three qualitative regions occur. For small σw\sigma_{w}, c=1c^{*}=1 is the only fixed point, and it is stable because χ1<1\chi_{1}<1. In this strong bias regime, any two input points converge to each other as they propagate through the network. As σw\sigma_{w} increases, χ1\chi_{1} increases and crosses 11, destabilizing the c=1c^{*}=1 fixed point. In this intermediate regime, a new stable fixed point cc^{*} appears, which decreases as σw\sigma_{w} increases. Here an equal footing competition between weights and nonlinearities (which de-correlate inputs) and the biases (which correlate them), leads to a finite cc^{*}. At larger σw\sigma_{w}, the strong weights overwhelm the biases and maximally de-correlate inputs to make them orthogonal, leading to a stable fixed point at c=0c^{*}=0.

Thus the equation χ1(σw,σb)=1\chi_{1}(\sigma_{w},\sigma_{b})=1 yields a phase transition boundary in the (σw,σb)(\sigma_{w},\sigma_{b}) plane, separating it into a chaotic (or ordered) phase, in which nearby points separate (or converge). In dynamical systems theory, the logarithm of χ1\chi_{1} is related to the well known Lyapunov exponent which is positive (or negative) for chaotic (or ordered) dynamics. However, in a feedforward network, the dynamics is truncated at a finite depth DD, and hence the dynamics are a form of transient chaos.

The propagation of manifold geometry through deep networks

To illustrate these concepts, it is useful to compute all of them for the circle h1(θ)\mathbf{h}^{1}(\theta) defined above: gE(θ)=Nqg^{E}(\theta)=Nq, LE=2πNq\mathcal{L}^{E}=2\pi\sqrt{Nq}, κ(θ)=1/Nq\kappa(\theta)=1/\sqrt{Nq}, gG(θ)=1g^{G}(\theta)=1, and LG=2π\mathcal{L}^{G}=2\pi. As expected, κ(θ)\kappa(\theta) is the inverse of the radius of curvature, which is Nq\sqrt{Nq}. Now consider how these quantities change if the circle is scaled up so that h(θ)χh(θ)\mathbf{h}(\theta)\rightarrow\chi\mathbf{h}(\theta). The length LE\mathcal{L}^{E} and radius scale up by χ\chi, but the curvature κ\kappa scales down as χ1{\chi^{-1}}, and so LG\mathcal{L}^{G} does not change. Thus linear expansion increases length and decreases curvature, thereby maintaining constant Grassmannian length LG\mathcal{L}^{G}.

We now show that nonlinear propagation of this same circle through a deep network can behave very differently from linear expansion: in the chaotic regime, length can increase without any decrease in extrinsic curvature! To remove the scaling with NN in the above quantities, we will work with the renormalized quantities κˉ=Nκ\bar{\kappa}=\sqrt{N}\kappa, gˉE=1NgE\bar{g}^{E}=\frac{1}{N}g^{E}, and LˉE=1NLE\bar{\mathcal{L}}^{E}=\frac{1}{\sqrt{N}}\mathcal{L}^{E}. Thus, 1/(κˉ)21/(\bar{\kappa})^{2} can be thought of as a radius of curvature squared per neuron of the osculating circle, while (LˉE)2(\bar{\mathcal{L}}^{E})^{2} is the squared Euclidean length of the curve per neuron. For the circle, these quantities are qq and 2πq2\pi q respectively. For simplicity, in the inputs to the first layer of neurons, we begin with a circle h1(θ)\mathbf{h}^{1}(\theta) with squared radius per neuron q1=qq^{1}=q^{*}, so this radius is already at the fixed point of the length map in (3). In the SM, we derive an iterative formula for the extrinsic curvature and Euclidean metric of this manifold as it propagates through the layers of a deep network:

where χ1\chi_{1} is the stretch factor defined in (7) and χ2\chi_{2} is defined analogously as

χ2\chi_{2} is closely related to the second derivative of the C\mathcal{C}-map in (6) at c12l1=1c^{l-1}_{12}=1; this second derivative is χ2q\chi_{2}q^{*}. See SM for a derivation of the evolution equations (8) for the extrinsic geometry of a curve as it propagates through a deep network.

Intriguingly for a sigmoidal neural network, these evolution equations behave very differently in the chaotic (χ1>1\chi_{1}>1) versus ordered (χ1<1\chi_{1}<1) phase. In the chaotic phase, the Euclidean metric gˉE\bar{g}^{E} grows exponentially with depth due to multiplicative stretching through χ1\chi_{1}. This stretching does multiplicatively attenuate any curvature in layer l1l-1 by a factor 1/χ11/\chi_{1} (see the update equation for κˉl\bar{\kappa}^{l} in (8)), but new curvature is added in due to a nonzero χ2\chi_{2}, which originates from the curvature of the single neuron nonlinearity in (9). Thus, unlike in linear expansion, extrinsic curvature is not lost, but maintained, and ultimately approaches a fixed point κˉ\bar{\kappa}^{*}. This implies that the global curvature measure LˉG\bar{\mathcal{L}}^{G} grows exponentially with depth. These highly nontrivial predictions of the metric and curvature evolution equations in (8) are quantitatively confirmed in simulations in Figure 4C-E.

Intuitively, this exponential growth of global curvature LˉG\bar{\mathcal{L}}^{G} in the chaotic phase implies that the curve explores many different tangent directions in hidden representation space. This further implies that the coordinate functions of the embedding hil(θ)\mathbf{h}^{l}_{i}(\theta) become highly complex curved basis functions on the input manifold coordinate θ\theta, allowing a deep network to compute exponentially complex functions over simple low dimensional manifolds (Figure 5A-C, details in SM). In our companion paper , we further develop the relationship between length and expressivity in terms of the number of achievable classification patterns on a set of inputs. Moreover, we explore how training a single layer at different depths from the output affects network performance.

Shallow networks cannot achieve exponential expressivity

Consider a shallow network with 11 hidden layer x1\mathbf{x}^{1}, one input layer x0\mathbf{x}^{0}, with x1=ϕ(W1x0)+b1\mathbf{x}^{1}=\phi({\mathbf{W}}^{1}\mathbf{x}^{0})+\mathbf{b}^{1}, and a linear readout layer. How complex can the hidden representation be as a function of its width N1N_{1}, relative to the results above for depth? We prove a general upper bound on LE\mathcal{L}^{E} (see SM):

Suppose ϕ(h)\phi(h) is monotonically non-decreasing with bounded dynamic range RR, i.e. maxhϕ(h)minhϕ(h)=R\max_{h}\phi(h)-\min_{h}\phi(h)=R. Further suppose that x0(θ)\mathbf{x}^{0}(\theta) is a curve in input space such that no 1D projection of θx(θ)\partial_{\theta}\mathbf{x}(\theta) changes sign more than ss times over the range of θ\theta. Then for any choice of W1{\mathbf{W}}^{1} and b1\mathbf{b}^{1} the Euclidean length of x1(θ)\mathbf{x}^{1}(\theta), satisfies LEN1(1+s)R\mathcal{L}^{E}\leq N_{1}(1+s)R.

For the circle input, s=1s=1 and for the tanh\tanh nonlinearity, R=2R=2, so in this special case, the normalized length LˉE2N1\bar{\mathcal{L}}^{E}\leq 2\sqrt{N_{1}}. In contrast, for deep networks in the chaotic regime LˉE\bar{\mathcal{L}}^{E} grows exponentially with depth in h\mathbf{h} space, and so consequently also in x\mathbf{x} space. Therefore the length of curves typically expand exponentially in depth even for random deep networks, but can only expand as the square root of width no matter what shallow network is chosen. Moreover, as we have seen above, it is the exponential growth of LEˉ\bar{\mathcal{L^{E}}} that fundamentally drives the exponential growth of LˉG\bar{\mathcal{L}}^{G} with depth. Indeed shallow random networks exhibit minimal growth in expressivity even at large widths (Figure 5D).

Classification boundaries acquire exponential local curvature with depth

We have focused so far on how simple manifolds in input space can acquire both exponential Euclidean and Grassmannian length with depth, thereby exponentially de-correlating and filling up hidden representation space. Another natural question is how the complexity of a decision boundary grows as it is backpropagated to the input layer. Consider a linear classifier y=sgn(βxDβ0)y=\text{sgn}(\boldsymbol{\beta}\cdot\mathbf{x}^{D}-\beta_{0}) acting on the final layer. In this layer, the N1N-1 dimensional decision boundary is the hyperplane βxDβ0=0\boldsymbol{\beta}\cdot\mathbf{x}^{D}-\beta_{0}=0. However, in the input layer x0\mathbf{x}^{0}, the decision boundary is a curved N1N-1 dimensional manifold M\mathcal{M} that arises as the solution set of the nonlinear equation G(x0)βxD(x0)β0=0G(\mathbf{x}^{0})\equiv\boldsymbol{\beta}\cdot\mathbf{x}^{D}(\mathbf{x}^{0})-\beta_{0}=0, where xD(x0)\mathbf{x}^{D}(\mathbf{x}^{0}) is the nonlinear feedforward map from input to output.

At any point x\mathbf{x}^{*} on the decision boundary in layer ll, the gradient G\vec{\nabla}G is perpendicular to the N1N-1 dimensional tangent plane TxM\mathcal{T}_{x^{*}}\mathcal{M} (see Fig. 4F). The normal vector G\vec{\nabla}G, along with any unit tangent vector v^TxM\hat{\mathbf{v}}\in\mathcal{T}_{x^{*}}\mathcal{M}, spans a 22 dimensional subspace whose intersection with M\mathcal{M} yields a geodesic curve in M\mathcal{M} passing through xx^{*} with velocity vector v^\hat{\mathbf{v}}. This geodesic will have extrinsic curvature κ(x,v^)\kappa(\mathbf{x}^{*},\hat{\mathbf{v}}). Maximizing this curvature over v^\hat{\mathbf{v}} yields the first principal curvature κ1(x)\kappa_{1}(\mathbf{x}^{*}). A sequence of successive maximizations of κ(x,v^)\kappa(\mathbf{x}^{*},\hat{\mathbf{v}}), while constraining v^\hat{\mathbf{v}} to be perpendicular to all previous solutions, yields the sequence of principal curvatures κ1(x)κ2(x)κN1(x)\kappa_{1}(\mathbf{x}^{*})\geq\kappa_{2}(\mathbf{x}^{*})\geq\dots\geq\kappa_{N-1}(\mathbf{x}^{*}). These principal curvatures arise as the eigenvalues of a normalized Hessian operator projected onto the tangent plane TxM\mathcal{T}_{x^{*}}\mathcal{M}: H=G21P2GxxTP\mathcal{H}={||\vec{\nabla}G||_{2}^{-1}}{\mathbf{P}\frac{\partial^{2}G}{\partial\mathbf{x}\partial\mathbf{x}^{T}}}\mathbf{P}, where P=I^G^GT\mathbf{P}=\mathbf{I}-\widehat{\nabla}G\widehat{\nabla}G^{T} is the projection operator onto TxM\mathcal{T}_{x^{*}}\mathcal{M} and ^G\widehat{\nabla}G is the unit normal vector . Intuitively, near x\mathbf{x}^{*}, the decision boundary M\mathcal{M} can be approximated as a paraboloid with a quadratic form H\mathcal{H} whose N1N-1 eigenvalues are the principal curvatures κ1,,κN1\kappa_{1},\dots,\kappa_{N-1} (Fig. 4F).

We compute these curvatures numerically as a function of depth in Fig. 4G (see SM for details). We find, remarkably, that a subset of principal curvatures grow exponentially with depth. Here the principal curvatures are signed, with positive (negative) curvature indicating that the associated geodesic curves towards (away from) the normal vector G\vec{\nabla}G. Thus the decision boundary can become exponentially curved with depth, enabling highly complex classifications. Moreover, this exponentially curved boundary is disentangled and mapped to a flat boundary in the output layer.

Discussion

Moreover, our analysis of a maximum entropy distribution over deep networks constitutes an important null model of deep signal propagation that can be used to assess and understand different behavior in trained networks. For example, the metrics we have adapted from Riemannian geometry, combined with an understanding of their behavior in random networks, may provide a basis for understanding what is special about trained networks. Furthermore, while we have focused on the notion of input-output chaos, the duality between inputs and synaptic weights imply a form of weight chaos, in which deep neural networks rapidly traverse function space as weights change (see SM). Indeed, just as autocorrelation lengths between outputs as a function of inputs shrink exponentially with depth, so too will autocorrelations between outputs as a function of weights.

References

Supplementary Material

Below is a series of appendices giving derivations of results in the main paper, followed by details of results along with more visualizations.

Note: code to programmatically reproduce all plots in the paper in Jupyter notebooks will be released upon publication.

where bl\mathbf{b}^{l} is a vector of biases, hl\mathbf{h}^{l} is the pattern of inputs to neurons at layer ll, and ϕ\phi is a single neuron scalar nonlinearity that acts component-wise to transform inputs hl\mathbf{h}^{l} to activities xl\mathbf{x}^{l}. The synaptic weights Wijl{\mathbf{W}}^{l}_{ij} are drawn i.i.d. from a zero mean Gaussian with variance σw2/Nl1{\sigma_{w}^{2}}/N_{l-1}, while the biases are drawn i.i.d. from a zero mean Gaussian with variance σb2\sigma_{b}^{2}. This weight scaling ensures that the input contribution to each individual neuron at layer ll from activities in layer l1l-1 remains O(1)O(1), independent of the layer width Nl1N_{l-1}.

As a single input point x0\mathbf{x}^{0} propagates through the network, it’s length in downstream layers can either grow or shrink. To track the propagation of this length, we track the normalized squared length of the input vector at each layer,

This length is the second moment of the empirical distribution of inputs hil\mathbf{h}^{l}_{i} across all NlN_{l} neurons in layer ll for a fixed set of weights. This empirical distribution is expected to be Gaussian for large NlN_{l}, since each individual hil=wl,iϕ(hl1)+bil\mathbf{h}^{l}_{i}={\mathbf{w}}^{{l},{i}}\cdot\phi(\mathbf{h}^{l-1})+\mathbf{b}^{l}_{i} is Gaussian distributed, as a sum of a large number of independent random variables, and each hil\mathbf{h}^{l}_{i} is independent of hjl\mathbf{h}^{l}_{j} for iji\neq j because the synaptic weights vectors and biases into each neuron are chosen independently.

While the mean of this Gaussian is , its variance can be computed by considering the variance of the input to a single neuron:

where \langle{\cdot}\rangle denotes an average over the distribution of weights and biases into neuron ii at layer ll. Here we have used the identity wjl,iwkl,i=δjkσw2/Nl1\langle{{\mathbf{w}}^{{l},{i}}_{j}{\mathbf{w}}^{{l},{i}}_{k}}\rangle=\delta_{jk}\,\sigma_{w}^{2}/N_{l-1}. Now the empirical distribution of inputs across layer l1l-1 is also Gaussian, with mean zero and variance ql1q^{l-1}. Therefore we can replace the average over neurons in layer l1l-1 in (12) with an integral over a Gaussian random variable, obtaining

where Dz=dz2πez22\mathcal{D}z=\frac{dz}{\sqrt{2\pi}}e^{-\frac{z^{2}}{2}} is the standard Gaussian measure, and the initial condition for the variance map is q1=σw2q0+σb2q^{1}=\sigma^{2}_{w}q_{0}+\sigma^{2}_{b}, where q0=1N0x0x0q^{0}=\frac{1}{N_{0}}\mathbf{x}^{0}\cdot\mathbf{x}^{0} is the length in the initial activity layer. The function V\mathcal{V} in (13) is an iterative variance map that predicts how the length of an input in (11) changes as it propagates through the network. Its derivation relies on the well-known self-averaging assumption in the statistical physics of disordered systems, which, in our context, means that the empirical distribution of inputs across neurons for a fixed network converges for large width, to the distribution of inputs to a single neuron across random networks.

A.2 Derivation of a correlation map for the propagation of two points

Now consider the layer-wise propagation of two inputs x0,1\mathbf{x}^{0,1} and x0,2\mathbf{x}^{0,2}. The geometry of these two inputs as they propagate through the layers is captured by the 22 by 22 matrix of inner products

The joint empirical distribution of hil(x0,a)\mathbf{h}^{l}_{i}(\mathbf{x}^{0,a}) and hil(x0,a)\mathbf{h}^{l}_{i}(\mathbf{x}^{0,a}) across ii at large NlN_{l} will converge to a 2 dimensional Gaussian distribution with covariance qablq^{l}_{ab}. Propagating this joint distribution forward one layer using ideas similar to the derivation above for 11 input yields

where c12l=q12lq11lq12lc^{l}_{12}=\frac{q^{l}_{12}}{\sqrt{q^{l}_{11}}\sqrt{q^{l}_{12}}} is the correlation coefficient (CC). Here z1z_{1} and z2z_{2} are independent standard Gaussian variables, while u1u_{1} and u2u_{2} are correlated Gaussian variables with covariance matrix uaub=qabl1\langle u_{a}u_{b}\rangle=q^{l-1}_{ab}. The integration over z1z_{1} and z2z_{2} can be thought of as the large NlN_{l} limit of sums over hil(x0,a)\mathbf{h}^{l}_{i}(\mathbf{x}^{0,a}) and hil(x0,b)\mathbf{h}^{l}_{i}(\mathbf{x}^{0,b}).

When both input points are at their fixed point length, qq^{*}, the dynamics of their correlation coefficient can be obtained by simply setting q11l=q22l=q(σw,σb)q^{l}_{11}=q^{l}_{22}=q^{*}(\sigma_{w},\sigma_{b}) in (15) and dividing by qq^{*} to obtain a recursion relation for c12lc^{l}_{12}:

Direct calculation reveals that c12l(1)=1c^{l}_{12}(1)=1 as expected. Of particular interest is the slope χ1\chi_{1} of this map at 11. A direct, if tedious calculation shows that

To obtain this result, one has to apply the chain rule and product rule from calculus, as well as employ the identity

which can be obtained via integration by parts. Evaluating the derivative at 11 yields

Appendix B Derivation of evolution equations for Riemannian curvature

Here we derive recursion relations for Riemannian curvature quantitites.

with Q(0)=NqQ(0)=Nq^{*}. At large NN, the inner-product structure of translation invariant manifolds remains approximately translation invariant as it propagates through the network. Therefore, at large NN, we can express inner products of derivatives of h\mathbf{h} in terms of derivatives of QQ. For example, the Euclidean metric gEg^{E} is given by

Here, each dot is a short hand notation for derivative w.r.t. θ\theta. Also, the extrinsic curvature

where v(θ)=θh(θ)\mathbf{v}(\theta)=\partial_{\theta}\mathbf{h}(\theta) and a(θ)=θ2h(θ)\mathbf{a}(\theta)=\partial^{2}_{\theta}\mathbf{h}(\theta), simplifies to

Now if the translation invariant manifold lives on a sphere of radius NqNq^{*} where qq^{*} is the fixed point radius of the length map, then its radius does not change as it propagates through the system. Then we can also express gEg^{E} and κ\kappa in terms of the correlation coefficient function c(θ)=Q(θ)/qc(\theta)=Q(\theta)/q^{*} (up to a factor of NN). Thus to understand the propagation of local quantities like Euclidean length and curvature, we need to understand the propagation of derivatives of c(θ)c(\theta) at θ=0\theta=0 under the C\mathcal{C}-map in (16). Note that c(θ)c(\theta) is symmetric and achieves a maximum value of 11 at θ=0\theta=0. Thus the function H1(θ)=1c(θ)H^{1}(\theta)=1-c(\theta) is symmetric with a minimum at θ=0\theta=0. We consider the propagation of H1H^{1} though the C\mathcal{C}-map. But first we consider the propagation of derivatives under function composition in general.

B.2 Behavior of first and second derivatives under function composition

Assume H1(Δt)H^{1}(\Delta t) is an even function and H1(0)=0H^{1}(0)=0, so that its Taylor expansion can be written as H1(Δt)=12H¨1(0)Δt2+14H....1(0)Δt4+H^{1}(\Delta t)=\frac{1}{2}\ddot{H}^{1}(0)\Delta t^{2}+\frac{1}{4}\ddddot{H}^{1}(0)\Delta t^{4}+\dots. We are interested in determining how the second and fourth derivatives of HH propagate under composition with another function GG, so that H2=G(H1(Δt))H^{2}=G(H^{1}(\Delta t)) . We assume G(0)=0G(0)=0. We can use the chain rule and the product rule to derive:

B.3 Evolution equations for curvature and length

We now apply the above iterations with H1(θ)=1c(θ)H^{1}(\theta)=1-c(\theta) and G(c)=11qC(1c,q,qσw,σb)G(c)=1-\frac{1}{q^{*}}\mathcal{C}(1-c,\,q^{*},\,q^{*}\,|\,\sigma_{w},\sigma_{b}). Clearly, G(0)=0G(0)=0 the symmetric H1H^{1} obeys H1(0)=0H^{1}(0)=0, satisfying the above iterations of second and fourth derivatives. Taking into account these derivative recursions, using the expressions for κ\kappa and gEg^{E} in terms of derivatives of c(θ)c(\theta) at , and carefully accounting for factors of qq^{*} and NN, we obtain the final evolution equations that have been successfully tested against experiments:

where χ1\chi_{1} is the stretch factor defined in (19) and χ2\chi_{2} is defined analogously as

χ2\chi_{2} is closely related to the second derivative of the correlation coefficient map in (16) at c12l1=1c^{l-1}_{12}=1. Indeed this second derivative is χ2q\chi_{2}q^{*}.

Appendix C Upper bounds on the complexity of shallow neural representations

Consider a shallow network with 11 hidden layer x1\mathbf{x}^{1} and one input layer x0\mathbf{x}^{0}, so that x1=ϕ(W1x0)+b\mathbf{x}^{1}=\phi({\mathbf{W}}^{1}\mathbf{x}^{0})+\mathbf{b}. The network can compute functions through a linear readout of the hidden layer x1\mathbf{x}^{1}. We are interested in how complex these neural representations can get, with one layer of synaptic weights and nonlinearities, as a function the number of hidden units N1N_{1}. In particular, we are interested in how the length and curvature of an input manifold x0(θ)\mathbf{x}^{0}(\theta) changes as it propagates to become x1(θ)\mathbf{x}^{1}(\theta) in the hidden layer. We would like to upper bound the maximal achievable length and curvature over all possible choices of W1{\mathbf{W}}^{1} and b\mathbf{b}.

Here, we derive such an upper bound on the Euclidean length for a very general class of nonlinearities ϕ(h)\phi(h). We simply assume that (1) ϕ(h)\phi(h) is monotonically non-decreasing (so that ϕ(h)0h\phi^{\prime}(h)\geq 0\forall h) and (2) has with bounded dynamic range R, i.e. maxhϕ(h)minhϕ(h)=R.\max_{h}\phi(h)-\min_{h}\phi(h)=R. The Euclidean length in hidden space is

where the inequality follows from the triangle inequality. Now suppose that for any ii, θxi1(θ)\partial_{\theta}\mathbf{x}^{1}_{i}(\theta) never changes sign across θ\theta. Furthermore, assume that θ\theta ranges from to Θ\Theta. Then

More generally, let r1r_{1} denote the maximal number of times that any one neuron has a change in sign of the derivative θxi1(θ)\partial_{\theta}\mathbf{x}^{1}_{i}(\theta) across θ\theta. Then applying the above argument to each segment of constant sign yields

Now how many times can θxi1(θ)\partial_{\theta}\mathbf{x}^{1}_{i}(\theta) change sign? Since θxi1(θ)=ϕ(hi)θhi\partial_{\theta}\mathbf{x}^{1}_{i}(\theta)=\phi^{\prime}(\mathbf{h}_{i})\,\partial_{\theta}\mathbf{h}_{i}, where θhi(θ)=[Wlθx0(θ)]i\partial_{\theta}\mathbf{h}_{i}(\theta)=[{\mathbf{W}}^{l}\partial_{\theta}\mathbf{x}^{0}(\theta)]_{i}, and ϕ(hi)\phi(\mathbf{h}_{i}) is monotonically increasing, the number of times θxi1(θ)\partial_{\theta}\mathbf{x}^{1}_{i}(\theta) changes sign equals the number of times the input θhi(θ)\partial_{\theta}\mathbf{h}_{i}(\theta) changes sign. In turn, suppose s0s_{0} is the maximal number of times any one dimensional projection of the derivative vector θx0(θ)\partial_{\theta}\mathbf{x}^{0}(\theta) changes sign across θ\theta. Then the number of times the sign of θhi(θ)\partial_{\theta}\mathbf{h}_{i}(\theta) changes for any ii cannot exceed s0s_{0} because hi\mathbf{h}_{i} is a linear projection of x0\mathbf{x}^{0}. Together this implies r1s0r_{1}\leq s_{0}. We have thus proven:

Appendix D Simulation details

All neural network simulations were implemented in Keras and Theano. For all simulations (except Figure 5C), we used inputs and hidden layers with a width of 1,000 and tanh activations. We found that our results were mostly insensitive to width, but using larger widths decreased the fluctuations in the averaged quantities. Simulation error bars are all standard deviations, with the variance computed across the different inputs, h1(θ)h^{1}(\theta). If not mentioned, the weights in the network are initialized in the chaotic regime with σb=0.3\sigma_{b}=0.3, σw=4.0\sigma_{w}=4.0.

Computing κ(θ)\kappa(\theta) requires the computation of the velocity and acceleration vectors, corresponding to the first and second derivatives of the neural network hl(θ)h^{l}(\theta) with respect to θ\theta. As θ\theta is always one-dimensional, we can greatly speed up these computations by using forward-mode auto-differentiation, evaluating the Jacobian and Hessian in a feedforward manner. We implemented this using the R-op in Theano.

To identify the curvature of the decision boundary, we first had to identify points that lied along the decision boundary. We randomly initialized data points and then optimized G(xD(xl))2G(x^{D}(x^{l}))^{2} with respect to the input xx using Adam. This yields a set of inputs xlx^{l} where we compute the Jacobian and Hessian of G(xD(xl))G(x^{D}(x^{l})) to evaluate principal curvatures.

D.2 Details on Figure 5C-D: evaluating expressivity

To evaluate the set of functions reachable by a network, we first parameterized function space using a Fourier basis up to a particular maximum frequency, ωmax\omega_{\text{max}} on a sampled set of one dimensional inputs parameterized by θ\theta. We then took the output activations of each neural network and linearly regressed the output activations onto each Fourier basis. For each basis, we computed the angle between the predicted basis vector and the true basis vector. These are the quantities that appear in Figure 5C-D. Given any function with bounded frequency, we can represent it in this Fourier basis, and decompose the error in the prediction of the function into the error in prediction of each Fourier component. Thus error in the predicting the Fourier basis is a reasonable proxy for error in prediction of functions with bounded frequency.

Appendix E Additional visualization of hidden actions

Appendix F A view from the function space perspective

We have shown above that for a fixed set of weights and biases in the chaotic regime, the internal representation hl(x0)\mathbf{h}^{l}(\mathbf{x}^{0}) at large depth ll, rapidly de-correlates from itself as the input x0\mathbf{x}^{0} changes (see e.g. Fig. 3B in the main paper). Here we ask a dual question: for a fixed input manifold, how does a deep network move in a function space over this manifold as the weights in a single layer change? Consider for example, a random one parameter family of deep networks parameterized by Δ\Delta\in. In this family, we assume that the bias vectors bl\mathbf{b}^{l} in each layer are chosen as i.i.d. random Gaussian vectors with zero mean and variance σb2\sigma_{b}^{2}, independent of Δ\Delta. Moreover, we assume the weight matrix Wl{\mathbf{W}}^{l} has elements that are drawn i.i.d. from zero mean Gaussians with variance σw2\sigma_{w}^{2}, independent of Δ\Delta for all layers except l=2l=2. The only dependence on Δ\Delta in this family of networks originates in the weights in layer l=2l=2, chosen as

Here both a base matrix W\mathbf{W} and a perturbation matrix dW\mathbf{dW} have matrix elements that are zero mean i.i.d. Gaussians with variance σw2\sigma_{w}^{2}. Each matrix element of W2(Δ){\mathbf{W}}^{2}(\Delta) thus also has variance σw2\sigma^{2}_{w} just like all the other layers. In turn, this family of networks induces a family of functions hD(h1,Δ)\mathbf{h}^{D}(\mathbf{h}^{1},\Delta). For simplicity, we restrict these functions to a simple input manifold, the circle,

and the associated correlation coefficient,

Because of our restriction to an input circle at the fixed point radius, Ql(0,0)=Ql(Δ,Δ)=qQ^{l}(0,0)=Q^{l}(\Delta,\Delta)=q^{*} for all ll and Δ\Delta in the large width limit. By using logic similar to the derivation of (15), we can derive a recursion relation for the function space correlation Ql(0,Δ)Q^{l}(0,\Delta):

where Cl(Δ)=Ql(0,Δ)/qC^{l}(\Delta)=Q^{l}(0,\Delta)/q^{*}. The initial condition for this recursion is C1(Δ)=1C^{1}(\Delta)=1, since the family of functions in the first layer of inputs is independent of Δ\Delta. Now, the difference in weights at a nonzero Δ\Delta reduces the function space correlation to C2(Δ)<1C^{2}(\Delta)<1. At this point, the representation in h2\mathbf{h}^{2} is different for the two networks at parameter values and Δ\Delta. Moreover, in the chaotic regime, this difference will amplify due to the similarity between the function space evolution equation in (37) and the evolution equation for the similarity of two points in (15). In essence, just as two points in the input exponentially separate as they propagate through a single network in the chaotic regime, a pair of different functions separate when computed in the final layer. Thus a small perturbation in the weights into layer 22 can yield a very large change in the space of functions from the input manifold to layer DD. Moreover, as Δ\Delta varies from -1 to 1, the function hD(θ,Δ)\mathbf{h}^{D}(\theta,\Delta) roughly undergoes a random walk in function space whose autocorrelation length decreases exponentially with depth DD. This weight chaos, or a sensitive dependence of the function computed by a deep network with respect to weight changes far from the final layer, is another manifestation of deep neural expressivity. Our companion paper further explores the expressivity of deep random networks in function space and also finds an exponential growth in expressivity with depth.