On the Expressive Power of Deep Neural Networks

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, Jascha Sohl-Dickstein

Introduction

Deep neural networks have proved astoundingly effective at a wide range of empirical tasks, from image classification (Krizhevsky et al., 2012) to playing Go (Silver et al., 2016), and even modeling human learning (Piech et al., 2015).

Despite these successes, understanding of how and why neural network architectures achieve their empirical successes is still lacking. This includes even the fundamental question of neural network expressivity, how the architectural properties of a neural network (depth, width, layer type) affect the resulting functions it can compute, and its ensuing performance.

This is a foundational question, and there is a rich history of prior work addressing expressivity in neural networks. However, it has been challenging to derive conclusions that provide both theoretical generality with respect to choices of architecture as well as meaningful insights into practical performance.

Indeed, the very first results on this question take a highly theoretical approach, from using functional analysis to show universal approximation results (Hornik et al., 1989; Cybenko, 1989), to analysing expressivity via comparisons to boolean circuits (Maass et al., 1994) and studying network VC dimension (Bartlett et al., 1998). While these results provided theoretically general conclusions, the shallow networks they studied are very different from the deep models that have proven so successful in recent years.

In response, several recent papers have focused on understanding the benefits of depth for neural networks (Pascanu et al., 2013; Montufar et al., 2014; Eldan and Shamir, 2015; Telgarsky, 2015; Martens et al., 2013; Bianchini and Scarselli, 2014). These results are compelling and take modern architectural changes into account, but they only show that a specific choice of weights for a deeper network results in inapproximability by a shallow (typically one or two hidden layers) network.

In particular, the goal of this new line of work has been to establish lower bounds — showing separations between shallow and deep networks — and as such they are based on hand-coded constructions of specific network weights. Even if the weight values used in these constructions are robust to small perturbations (as in (Pascanu et al., 2013; Montufar et al., 2014)), the functions that arise from these constructions tend toward extremal properties by design, and there is no evidence that a network trained on data ever resembles such a function.

This has meant that a set of fundamental questions about neural network expressivity has remained largely unanswered. First, we lack a good understanding of the “typical” case rather than the worst case in these bounds for deep networks, and consequently have no way to evaluate whether the hand-coded extremal constructions provide a reflection of the complexity encountered in more standard settings. Second, we lack an understanding of upper bounds to match the lower bounds produced by this prior work; do the constructions used to date place us near the limit of the expressive power of neural networks, or are there still large gaps? Finally, if we had an understanding of these two issues, we might begin to draw connections between network expressivity and observed performance.

In this paper, we address this set of challenges by defining and analyzing an interrelated set of measures of expressivity for neural networks; our framework applies to a wide range of standard architectures, independent of specific weight choices. We begin our analysis at the start of training, after random initialization, and later derive insights connecting network expressivity and performance.

Our first measure of expressivity is based on the notion of an activation pattern: in a network where the units compute functions based on discrete thresholds, we can ask which units are above or below their thresholds (i.e. which units are “active” and which are not). For the range of standard architectures that we consider, the network is essentially computing a linear function once we fix the activation pattern; thus, counting the number of possible activation patterns provides a concrete way of measuring the complexity beyond linearity that the network provides. We give an upper bound on the number of possible activation patterns, over any setting of the weights. This bound is tight as it matches the hand-constructed lower bounds of earlier work (Pascanu et al., 2013; Montufar et al., 2014).

Key to our analysis is the notion of a transition, in which changing an input xx to a nearby input x+δx+\delta changes the activation pattern. We study the behavior of transitions as we pass the input along a one-dimensional parametrized trajectory x(t)x(t). Our central finding is that the trajectory length grows exponentially in the depth of the network.

Trajectory length serves as a unifying notion in our measures of expressivity, and it leads to insights into the behavior of trained networks. Specifically, we find that the exponential growth in trajectory length as a function of depth implies that small adjustments in parameters lower in the network induce larger changes than comparable adjustments higher in the network. We demonstrate this phenomenon through experiments on MNIST and CIFAR-10, where the network displays much less robustness to noise in the lower layers, and better performance when they are trained well. We also explore the effects of regularization methods on trajectory length as the network trains and propose a less computationally intensive method of regularization, trajectory regularization, that offers the same performance as batch normalization.

The contributions of this paper are thus:

Measures of expressivity: We propose easily computable measures of neural network expressivity that capture the expressive power inherent in different neural network architectures, independent of specific weight settings.

Exponential trajectories: We find an exponential depth dependence displayed by these measures, through a unifying analysis in which we study how the network transforms its input by measuring trajectory length

All weights are not equal (the lower layers matter more): We show how these results on trajectory length suggest that optimizing weights in lower layers of the network is particularly important.

Trajectory Regularization Based on understanding the effect of batch norm on trajectory length, we propose a new method of regularization, trajectory regularization, that offers the same advantages as batch norm, and is computationally more efficient.

In prior work (Poole et al., 2016), we studied the propagation of Riemannian curvature through random networks by developing a mean field theory approach. Here, we take an approach grounded in computational geometry, presenting measures with a combinatorial flavor and explore the consequences during and after training.

Measures of Expressivity

Given a neural network of a certain architecture AA (some depth, width, layer types), we have an associated function, FA(x;W)F_{A}(x;W), where xx is an input and WW represents all the parameters of the network. Our goal is to understand how the behavior of FA(x;W)F_{A}(x;W) changes as AA changes, for values of WW that we might encounter during training, and across inputs xx.

The first major difficulty comes from the high dimensionality of the input. Precisely quantifying the properties of FA(x;W)F_{A}(x;W) over the entire input space is intractable. As a tractable alternative, we study simple one dimensional trajectories through input space. More formally:

Simple examples of a trajectory would be a line (x(t)=tx1+(1t)x0x(t)=tx_{1}+(1-t)x_{0}) or a circular arc (x(t)=cos(πt/2)x0+sin(πt/2)x1x(t)=\cos(\pi t/2)x_{0}+\sin(\pi t/2)x_{1}), but in general x(t)x(t) may be more complicated, and potentially not expressible in closed form.

Armed with this notion of trajectories, we can begin to define measures of expressivity of a network FA(x;W)F_{A}(x;W) over trajectories x(t)x(t).

In (Montufar et al., 2014) the notion of a “linear region” is introduced. Given a neural network with piecewise linear activations (such as ReLU or hard tanh), the function it computes is also piecewise linear, a consequence of the fact that composing piecewise linear functions results in a piecewise linear function. So one way to measure the “expressive power” of different architectures AA is to count the number of linear pieces (regions), which determines how nonlinear the function is.

In fact, a change in linear region is caused by a neuron transition in the output layer. More precisely:

Definition For fixed WW, we say a neuron with piecewise linear region transitions between inputs x,x+δx,x+\delta if its activation function switches linear region between xx and x+δx+\delta.

So a ReLU transition would be given by a neuron switching from off to on (or vice versa) and for hard tanh by switching between saturation at 1-1 to its linear middle region to saturation at 11. For any generic trajectory x(t)x(t), we can thus define T(FA(x(t);W))\mathcal{T}(F_{A}(x(t);W)) to be the number of transitions undergone by output neurons (i.e. the number of linear regions) as we sweep the input x(t)x(t). Instead of just concentrating on the output neurons however, we can look at this pattern over the entire network. We call this an activation patten:

Definition We can define AP(FA(x;W))\mathcal{AP}(F_{A}(x;W)) to be the activation pattern – a string of form {0,1}num neurons\{0,1\}^{\text{num neurons}} (for ReLUs) and {1,0,1}num neurons\{-1,0,1\}^{\text{num neurons}} (for hard tanh) of the network encoding the linear region of the activation function of every neuron, for an input xx and weights WW.

Overloading notation slightly, we can also define (similarly to transitions) A(FA(x(t);W))\mathcal{A}(F_{A}(x(t);W)) as the number of distinct activation patterns as we sweep xx along x(t)x(t). As each distinct activation pattern corresponds to a different linear function of the input, this combinatorial measure captures how much more expressive AA is over a simple linear mapping.

Returning to Montufar et al, they provide a construction i.e. a specific set of weights W0W_{0}, that results in an exponential increase of linear regions with the depth of the architectures. They also appeal to Zaslavsky’s theorem (Stanley, 2011) from the theory of hyperplane arrangements to show that a shallow network, i.e. one hidden layer, with the same number of parameters as a deep network, has a much smaller number of linear regions than the number achieved by their choice of weights W0W_{0} for the deep network.

More formally, letting A1A_{1} be a fully connected network with one hidden layer, and AlA_{l} a fully connected network with the same number of parameters, but ll hidden layers, they show

We derive a much more general result by considering the ‘global’ activation patterns over the entire input space, and prove that for any fully connected network, with any number of hidden layers, we can upper bound the number of linear regions it can achieve, over all possible weight settings WW. This upper bound is asymptotically tight, matched by the construction given in (Montufar et al., 2014). Our result can be written formally as:

Next, suppose we have NN neurons in total. Then we want to compare (for wlog ReLUs), quantities like U(n,N/n,m)U(n^{\prime},N/n^{\prime},m) for different nn^{\prime}.

But U(n,N/n,m)=O((N/n)mn)U(n^{\prime},N/n^{\prime},m)=O((N/n^{\prime})^{mn^{\prime}}), and so, noting that the maxima of (ax)mx\left(\frac{a}{x}\right)^{mx} (for a>ea>e) is x=a/ex=a/e, we get, (for n,k>en,k>e), in comparison to (*),

We prove this via an inductive proof on regions in a hyperplane arrangement. The proof can be found in the Appendix. As noted in the introduction, this result differs from earlier lower-bound constructions in that it is an upper bound that applies to all possible sets of weights. Via our analysis, we also prove

This result is of independent interest for optimization – a linear function over a convex polytope results in a well behaved loss function and an easy optimization problem. Understanding the density of these regions during the training process would likely shed light on properties of the loss surface, and improved optimization methods. A picture of a network’s regions is shown in Figure 1.

We empirically tested the growth of the number of activations and transitions as we varied xx along x(t)x(t) to understand their behavior. We found that for bounded non linearities, especially tanh and hard-tanh, not only do we observe exponential growth with depth (as hinted at by the upper bound) but that the scale of parameter initialization also affects the observations (Figure 2). We also experimented with sweeping the weights WW of a layer through a trajectory W(t)W(t), and counting the different labellings output by the network. This ‘dichotomies’ measure is discussed further in the Appendix, and also exhibits the same growth properties, Figure 14.

2 Trajectory Length

In fact, there turns out to be a reason for the exponential growth with depth, and the sensitivity to initialization scale. Returning to our definition of trajectory, we can define an immediately related quantity, trajectory length

Definition: Given a trajectory, x(t)x(t), we define its length, l(x(t))l(x(t)), to be the standard arc length:

Intuitively, the arc length breaks x(t)x(t) up into infinitesimal intervals and sums together the Euclidean length of these intervals.

If we let A(n,k)A_{(n,k)} denote, as before, fully connected networks with nn hidden layers each of width kk, and initializing with weights N(0,σw2/k)\sim\mathcal{N}(0,\sigma_{w}^{2}/k) (accounting for input scaling as typical), and biases N(0,σb2)\sim\mathcal{N}(0,\sigma_{b}^{2}), we find that:

Bound on Growth of Trajectory Length Let FA(x,W)F_{A}(x^{\prime},W) be a ReLU or hard tanh random neural network and x(t)x(t) a one dimensional trajectory with x(t+δ)x(t+\delta) having a non trival perpendicular component to x(t)x(t) for all t,δt,\delta (i.e, not a line). Then defining z(d)(x(t))=z(d)(t)z^{(d)}(x(t))=z^{(d)}(t) to be the image of the trajectory in layer dd of the network, we have

That is, l(x(t)l(x(t) grows exponentially with the depth of the network, but the width only appears as a base (of the exponent). This bound is in fact tight in the limits of large σw\sigma_{w} and kk.

A schematic image depicting this can be seen in Figure 3 and the proof can be found in the Appendix. A rough outline is as follows: we look at the expected growth of the difference between a point z(d)(t)z^{(d)}(t) on the curve and a small perturbation z(d)(t+dt)z^{(d)}(t+dt), from layer dd to layer d+1d+1. Denoting this quantity δz(d)(t)\left|\left|\delta z^{(d)}(t)\right|\right|, we derive a recurrence relating δz(d+1)(t)\left|\left|\delta z^{(d+1)}(t)\right|\right| and δz(d)(t)\left|\left|\delta z^{(d)}(t)\right|\right| which can be composed to give the desired growth rate.

The analysis is complicated by the statistical dependence on the image of the input z(d+1)(t)z^{(d+1)}(t). So we instead form a recursion by looking at the component of the difference perpendicular to the image of the input in that layer, i.e. δz(d+1)(t)\left|\left|\delta z^{(d+1)}_{\perp}(t)\right|\right|, which results in the condition on x(t)x(t) in the statement.

In Figures 4, 12, we see the growth of an input trajectory for ReLU networks on CIFAR-10 and MNIST. The CIFAR-10 network is convolutional but we observe that these layers also result in similar rates of trajectory length increases to the fully connected layers. We also see, as would be expected, that pooling layers act to reduce the trajectory length. We discuss upper bounds in the Appendix.

For the hard tanh case (and more generally any bounded non-linearity), we can formally prove the relation of trajectory length and transitions under an assumption: assume that while we sweep x(t)x(t) all neurons are saturated unless transitioning saturation endpoints, which happens very rapidly. (This is the case for e.g. large initialization scales). Then we have:

Transitions proportional to trajectory length Let FAn,kF_{A_{n,k}} be a hard tanh network with nn hidden layers each of width kk. And let

Then T(FAn,k(x(t);W)=O(g(k,σw,σb,n))\mathcal{T}(F_{A_{n,k}}(x(t);W)=O(g(k,\sigma_{w},\sigma_{b},n)) for WW initialized with weight and bias scales σw,σb\sigma_{w},\sigma_{b}.

Note that the expression for g(k,σw,σb,n)g(k,\sigma_{w},\sigma_{b},n) is exactly the expression given by Theorem 3 when σw\sigma_{w} is very large and dominates σb\sigma_{b}. We can also verify this experimentally in settings where the simpilfying assumption does not hold, as in Figure 5.

Insights from Network Expressivity

Here we explore the insights gained from applying our measurements of expressivity, particularly trajectory length, to understand network performance. We examine the connection of expressivity and stability, and inspired by this, propose a new method of regularization, trajectory regularization that offers the same advantages as the more computationally intensive batch normalization.

The analysis of network expressivity offers interesting takeaways related to the parameter and functional stability of a network. From the proof of Theorem 3, we saw that a perturbation to the input would grow exponentially in the depth of the network. It is easy to see that this analysis is not limited to the input layer, but can be applied to any layer. In this form, it would say

A perturbation at a layer grows exponentially in the remaining depth after that layer.

This means that perturbations to weights in lower layers should be more costly than perturbations in the upper layers, due to exponentially increasing magnitude of noise, and result in a much larger drop of accuracy. Figure 6, in which we train a conv network on CIFAR-10 and add noise of varying magnitudes to exactly one layer, shows exactly this.

We also find that the converse (in some sense) holds: after initializing a network, we trained a single layer at different depths in the network and found monotonically increasing performance as layers lower in the network were trained. This is shown in Figure 7 and Figure 17 in the Appendix.

2 Trajectory Length and Regularization: The Effect of Batch Normalization

Expressivity measures, especially trajectory length, can also be used to better understand the effect of regularization. One regularization technique that has been extremely successful for training neural networks is Batch Normalization (Ioffe and Szegedy, 2015).

By taking measures of trajectories during training we find that without batch norm, trajectory length tends to increase during training, as shown in Figures 8 and Figure 18 in the Appendix. In these experiments, two networks were initialized with σw2=2\sigma_{w}^{2}=2 and trained to high test accuracy on CIFAR10 and MNIST. We see that in both cases, trajectory length increases as training progresses.

A surprising observation is σw2=2\sigma_{w}^{2}=2 is not in the exponential growth increase regime at initialization for the CIFAR10 architecture (Figure 8 at Step .). But note that even with a smaller weight initialization, weight norms increase during training, shown in Figure 9, pushing typically initialized networks into the exponential growth regime.

While the initial growth of trajectory length enables greater functional expressivity, large trajectory growth in the learnt representation results in an unstable representation, witnessed in Figure 6. In Figure 10 we train another conv net on CIFAR10, but this time with batch normalization. We see that the batch norm layers reduce trajectory length, helping stability.

3 Trajectory Regularization

Motivated by the fact that batch normalization decreases trajectory length and hence helps stability and generalization, we consider directly regularizing on trajectory length: we replace every batch norm layer used in the conv net in Figure 10 with a trajectory regularization layer. This layer adds to the loss λ(current length/orig length)\lambda(\text{current length}/\text{orig length}), and then scales the outgoing activations by λ\lambda, where λ\lambda is a parameter to be learnt. In implementation, we typically scale the additional loss above with a constant (0.010.01) to reduce magnitude in comparison to classification loss. Our results, Figure 11 show that both trajectory regularization and batch norm perform comparably, and considerably better than not using batch norm. One advantage of using Trajectory Regularization is that we don’t require different computations to be performed for train and test, enabling more efficient implementation.

Discussion

Characterizing the expressiveness of neural networks, and understanding how expressiveness varies with parameters of the architecture, has been a challenging problem due to the difficulty in identifying meaningful notions of expressivity and in linking their analysis to implications for these networks in practice. In this paper we have presented an interrelated set of expressivity measures; we have shown tight exponential bounds on the growth of these measures in the depth of the networks, and we have offered a unifying view of the analysis through the notion of trajectory length. Our analysis of trajectories provides insights for the performance of trained networks as well, suggesting that networks in practice may be more sensitive to small perturbations in weights at lower layers. We also used this to explore the empirical success of batch norm, and developed a new regularization method – trajectory regularization.

This work raises many interesting directions for future work. At a general level, continuing the theme of ‘principled deep understanding’, it would be interesting to link measures of expressivity to other properties of neural network performance. There is also a natural connection between adversarial examples, (Goodfellow et al., 2014), and trajectory length: adversarial perturbations are only a small distance away in input space, but result in a large change in classification (the output layer). Understanding how trajectories between the original input and an adversarial perturbation grow might provide insights into this phenomenon. Another direction, partially explored in this paper, is regularizing based on trajectory length. A very simple version of this was presented, but further performance gains might be achieved through more sophisticated use of this method.

Acknowledgements

We thank Samy Bengio, Ian Goodfellow, Laurent Dinh, and Quoc Le for extremely helpful discussion.

References

Appendix

Here we include the full proofs from sections in the paper.

We show inductively that FA(x;W)F_{A}(x;W) partitions the input space into convex polytopes via hyperplanes. Consider the image of the input space under the first hidden layer. Each neuron vi(1)v^{(1)}_{i} defines hyperplane(s) on the input space: letting Wi(0)W^{(0)}_{i} be the iith row of W(0)W^{(0)}, bi(0)b^{(0)}_{i} the bias, we have the hyperplane Wi(0)x+bi=0W^{(0)}_{i}x+b_{i}=0 for a ReLU and hyperplanes Wi(0)x+bi=±1W^{(0)}_{i}x+b_{i}=\pm 1 for a hard-tanh. Considering all such hyperplanes over neurons in the first layer, we get a hyperplane arrangement in the input space, each polytope corresponding to a specific activation pattern in the first hidden layer.

Now, assume we have partitioned our input space into convex polytopes with hyperplanes from layers d1\leq d-1. Consider vi(d)v^{(d)}_{i} and a specific polytope RiR_{i}. Then the activation pattern on layers d1\leq d-1 is constant on RiR_{i}, and so the input to vi(d)v^{(d)}_{i} on RiR_{i} is a linear function of the inputs jλjxj+b\sum_{j}\lambda_{j}x_{j}+b and some constant term, comprising of the bias and the output of saturated units. Setting this expression to zero (for ReLUs) or to ±1\pm 1 (for hard-tanh) again gives a hyperplane equation, but this time, the equation is only valid in RiR_{i} (as we get a different linear function of the inputs in a different region.) So the defined hyperplane(s) either partition RiR_{i} (if they intersect RiR_{i}) or the output pattern of vi(d)v^{(d)}_{i} is also constant on RiR_{i}. The theorem then follows. ∎

This implies that any one dimensional trajectory x(t)x(t), that does not ‘double back’ on itself (i.e. reenter a polytope it has previously passed through), will not repeat activation patterns. In particular, after seeing a transition (crossing a hyperplane to a different region in input space) we will never return to the region we left. A simple example of such a trajectory is a straight line:

Let the hyperplane arrangement be denoted H\mathcal{H}, and let HHH\in\mathcal{H} be one specific hyperplane. Then the number of regions in H\mathcal{H} is precisely the number of regions in HH\mathcal{H}-H plus the number of regions in HH\mathcal{H}\cap H. (This follows from the fact that HH subdivides into two regions exactly all of the regions in HH\mathcal{H}\cap H, and does not affect any of the other regions.)

In particular, we have the recursive formula

We now induct on k+mk+m to assert the claim. The base cases of r(1,0)=r(0,1)=1r(1,0)=r(0,1)=1 are trivial, and assuming the claim for k+m1\leq k+m-1 as the induction hypothesis, we have

where the last equality follows by the well known identity

With this result, we can easily prove Theorem 1 as follows:

First consider the ReLU case. Each neuron has one hyperplane associated with it, and so by Theorem 5, the first hidden layer divides up the inputs space into r(k,m)r(k,m) regions, with r(k,m)O(km)r(k,m)\leq O(k^{m}).

Now consider the second hidden layer. For every region in the first hidden layer, there is a different activation pattern in the first layer, and so (as described in the proof of Theorem 2) a different hyperplane arrangement of kk hyperplanes in an mm dimensional space, contributing at most r(k,m)r(k,m) regions.

In particular, the total number of regions in input space as a result of the first and second hidden layers is r(k,m)r(k,m)O(k2m)\leq r(k,m)*r(k,m)\leq O(k^{2}m). Continuing in this way for each of the nn hidden layers gives the O(kmn)O(k^{m}n) bound.

A very similar method works for hard tanh, but here each neuron produces two hyperplanes, resulting in a bound of O((2k)mn)O((2k)^{mn}).

Appendix B Proofs and additional results from Section 2.2

B.1 Notation and Preliminary Results

Difference of points on trajectory Given x(t)=x,x(t+dt)=x+δxx(t)=x,x(t+dt)=x+\delta x in the trajectory, let δz(d)=z(d)(x+δx)z(d)(x)\delta z^{(d)}=z^{(d)}(x+\delta x)-z^{(d)}(x)

This notation can also be used with a matrix WW, see Lemma 1.

Before stating and proving the main theorem, we need a few preliminary results.

Given W,xW,x as before, and considering WW_{\parallel}, WW_{\perp} with respect to xx (wlog a unit vector) we can express them directly in terms of WW as follows: Letting W(i)W^{(i)} be the iith row of WW, we have

i.e. the projection of each row in the direction of xx. And of course

The motivation to consider such a decomposition of WW is for the resulting independence between different components, as shown in the following lemma.

Independence of Projections Let xx be a given vector (wlog of unit norm.) If WW is a random matrix with WijN(0,σ2)W_{ij}\sim\mathcal{N}(0,\sigma^{2}), then WW_{\parallel} and WW_{\perp} with respect to xx are independent random variables.

We use the rotational invariance of random Gaussian matrices, i.e. if WW is a Gaussian matrix, iid entries N(0,σ2)\mathcal{N}(0,\sigma^{2}), and RR is a rotation, then RWRW is also iid Gaussian, entries N(0,σ2)\mathcal{N}(0,\sigma^{2}). (This follows easily from affine transformation rules for multivariate Gaussians.)

In the following two lemmas, we use the rotational invariance of Gaussians as well as the chi distribution to prove results about the expected norm of a random Gaussian vector.

We will find it useful to bound ratios of the Gamma function (as appear in Lemma 3) and so introduce the following inequality, from [Kershaw, 1983] that provides an extension of Gautschi’s Inequality.

An Extension of Gautschi’s Inequality For 0<s<10<s<1, we have

Norm of Projections Let WW be a kk by kk random Gaussian matrix with iid entries N(0,σ2)\sim\mathcal{N}(0,\sigma^{2}), and x,yx,y two given vectors. Partition WW into components as in Lemma 1 and let xx_{\perp} be a nonzero vector perpendicular to xx. Then

If 1A{1}_{\mathcal{A}} is an identity matrix with non-zeros diagonal entry ii iff iA[k]i\in\mathcal{A}\subset[k], and A>2|A|>2, then

First note we can view 1AW=1AW{1}_{\mathcal{A}}{}^{\perp}W_{\perp}={}^{\perp}{1}_{\mathcal{A}}W_{\perp}. (Projecting down to a random (as WW is random) subspace of fixed size A=m|\mathcal{A}|=m and then making perpendicular commutes with making perpendicular and then projecting everything down to the subspace.)

Norm and Translation Let XX be a centered multivariate Gaussian, with diagonal covariance matrix, and μ\mu a constant vector.

The inequality can be seen intuitively geometrically: as XX has diagonal covariance matrix, the contours of the pdf of X\left|\left|X\right|\right| are circular centered at , decreasing radially. However, the contours of the pdf of Xμ\left|\left|X-\mu\right|\right| are shifted to be centered around μ\left|\left|\mu\right|\right|, and so shifting back μ\mu to reduces the norm.

A more formal proof can be seen as follows: let the pdf of XX be fX()f_{X}(\cdot). Then we wish to show

Now we can pair points x,xx,-x, using the fact that fX(x)=fX(x)f_{X}(x)=f_{X}(-x) and the triangle inequality on the integrand to get

B.2 Proof of Theorem

We use vi(d)v^{(d)}_{i} to denote the ithi^{th} neuron in hidden layer dd. We also let x=z(0)x=z^{(0)} be an input, h(d)h^{(d)} be the hidden representation at layer dd, and ϕ\phi the non-linearity. The weights and bias are called W(d)W^{(d)} and b(d)b^{(d)} respectively. So we have the relations

We first prove the zero bias case. To do so, it is sufficient to prove that

as integrating over tt gives us the statement of the theorem.

For ease of notation, we will suppress the tt in z(d)(t)z^{(d)}(t).

where the division is done with respect to z(d)z^{(d)}. Note that this means h(d+1)=W(d)z(d)h^{(d+1)}=W^{(d)}_{\parallel}z^{(d)} as the other component annihilates (maps to ) z(d)z^{(d)}.

We can also define AW(d)={i:i[k],hi(d+1)<1}\mathcal{A}_{W^{(d)}_{\parallel}}=\{i:i\in[k],|h^{(d+1)}_{i}|<1\} i.e. the set of indices for which the hidden representation is not saturated. Letting WiW_{i} denote the iith row of matrix WW, we now claim that:

Indeed, by Lemma 2 we first split the expectation over W(d)W^{(d)} into a tower of expectations over the two independent parts of WW to get

But conditioning on W(d)W^{(d)}_{\parallel} in the inner expectation gives us h(d+1)h^{(d+1)} and AW(d)\mathcal{A}_{W^{(d)}_{\parallel}}, allowing us to replace the norm over ϕ(W(d)δz(d))\phi(W^{(d)}\delta z^{(d)}) with the sum in the term on the right hand side of the claim.

Till now, we have mostly focused on partitioning the matrix W(d)W^{(d)}. But we can also set δz(d)=δz(d)+δz(d)\delta z^{(d)}=\delta z^{(d)}_{\parallel}+\delta z^{(d)}_{\perp} where the perpendicular and parallel are with respect to z(d)z^{(d)}. In fact, to get the expression in (**), we derive a recurrence as below:

(where the indicator in the right hand side zeros out coordinates not in the active set.)

where the ^\hat{\cdot} indicates a unit vector.

Now note that for any index iAi\in\mathcal{A}, the right hand sides of (1) and (2) are identical, and so the vectors on the left hand side agree for all iAi\in\mathcal{A}. In particular,

Now the claim follows easily by noting that δz(d+1)δz(d+1)1A\left|\left|\delta z_{\perp}^{(d+1)}\right|\right|\geq\left|\left|\delta z^{(d+1)}_{\perp}\cdot{1}_{\mathcal{A}}\right|\right|.

Returning to (*), we split δz(d)=δz(d)+δz(d)\delta z^{(d)}=\delta z_{\perp}^{(d)}+\delta z_{\parallel}^{(d)}, W(d)=W(d)+W(d)W_{\perp}^{(d)}={}^{\parallel}W_{\perp}^{(d)}+{}^{\perp}W_{\perp}^{(d)} (and W(d)W_{\parallel}^{(d)} analogously), and after some cancellation, we have

We would like a recurrence in terms of only perpendicular components however, so we first drop the W(d),W(d){}^{\parallel}W_{\perp}^{(d)},{}^{\parallel}W_{\parallel}^{(d)} (which can be done without decreasing the norm as they are perpendicular to the remaining terms) and using the above claim, have

But in the inner expectation, the term W(d)δz(d){}^{\perp}W_{\parallel}^{(d)}\delta z_{\parallel}^{(d)} is just a constant, as we are conditioning on W(d)W_{\parallel}^{(d)}. So using Lemma 5 we have

We use the fact that we have the probability mass function for an (k,p)(k,p) binomial random variable to bound the j\sqrt{j} term:

But by using Jensen’s inequality with 1/x1/\sqrt{x}, we get

where the last equality follows by recognising the expectation of a binomial(k1,p)(k-1,p) random variable. So putting together, we get

From here, we must analyse the hard tanh and ReLU cases separately. First considering the hard tanh case:

To lower bound pp, we first note that as hi(d+1)h^{(d+1)}_{i} is a normal random variable with variance σ2\leq\sigma^{2}, if AN(0,σ2)A\sim\mathcal{N}(0,\sigma^{2})

where the last inequality holds for σ1\sigma\geq 1 and follows by Taylor expanding ex2/2e^{-x^{2}/2} around . Similarly, we can also show that p1σp\leq\frac{1}{\sigma}.

with the constant cc being the ratio of δx(t)\left|\left|\delta x(t)_{\perp}\right|\right| to δx(t)\left|\left|\delta x(t)\right|\right|. So if our trajectory direction is almost orthogonal to x(t)x(t) (which will be the case for e.g. random circular arcs, cc can be seen to be 1\approx 1 by splitting into components as in Lemma 1, and using Lemmas 3, 4.)

The ReLU case (with no bias) is even easier. Noting that for random weights, p=1/2p=1/2, and plugging in to equation (a), we get

But the expression on the right hand side has exactly the asymptotic form O(σk/k+1)O(\sigma\sqrt{k}/\sqrt{k+1}), and we finish as in (c).

In fact, we can easily extend the above result to the case of non-zero bias. The insight is to note that because δz(d+1)\delta z^{(d+1)} involves taking a difference between z(d+1)(t+dt)z^{(d+1)}(t+dt) and z(d+1)(t)z^{(d+1)}(t), the bias term does not enter at all into the expression for δz(d+1)\delta z^{(d+1)}. So the computations above hold, and equation (a) becomes

For ReLUs, we require hi(d+1)=wi(d+1)zi(d)+bi(d+1)>0h_{i}^{(d+1)}=w_{i}^{(d+1)}z_{i}^{(d)}+b_{i}^{(d+1)}>0 where the bias and weight are drawn from N(0,σb2)\mathcal{N}(0,\sigma^{2}_{b}) and N(0,σw2)\mathcal{N}(0,\sigma^{2}_{w}) respectively. But with p1/4p\geq 1/4, this holds as the signs for w,bw,b are purely random. Substituting in and working through results in the same asymptotic behavior as without bias.

For hard tanh, not that as hi(d+1)h_{i}^{(d+1)} is a normal random variable with variance σw2+σb2\leq\sigma^{2}_{w}+\sigma^{2}_{b} (as equation (b) becomes

Replace hard-tanh with a linear coordinate-wise identity map, hi(d+1)=(W(d)z(d))i+bih^{(d+1)}_{i}=(W^{(d)}z^{(d)})_{i}+b_{i}. This provides an upper bound on the norm. We also then recover a chi distribution with kk terms, each with standard deviation σwk12\frac{\sigma_{w}}{k^{\frac{1}{2}}},

where the second step follows from [Laforgia and Natalini, 2013], and holds for k>1k>1.

Finally, using the identity arctan(x)+arctan(1/x)\arctan(x)+\arctan(1/x) and the Laurent series for arctan(1/x)\arctan(1/x), we can evaluate the right hand side to be O(1/k)O(1/\sqrt{k}). In particular

This means that in expectation, any neuron in layer dd will be sensitive to the transitions of k\sqrt{k} neurons in the layer below. Using this, and the fact the while vi(d1)v^{(d-1)}_{i} might flip very quickly from say 1-1 to 11, the gradation in the transition ensures that neurons in layer dd sensitive to vi(d1)v^{(d-1)}_{i} will transition at distinct times, we get the desired growth rate in expectation as follows:

We replace k\sqrt{k} with k(1+σb2/σw2)\sqrt{k(1+\sigma_{b}^{2}/\sigma_{w}^{2})}, by noting that YN(0,σw2+σb2)Y\sim\mathcal{N}(0,\sigma_{w}^{2}+\sigma_{b}^{2}). This results in a growth rate of form O(k/1+σb2σw2)O(\sqrt{k}/\sqrt{1+\frac{\sigma_{b}^{2}}{\sigma_{w}^{2}}}). ∎

B.3 Dichotomies: a natural dual

Our measures of expressivity have mostly concentrated on sweeping the input along a trajectory x(t)x(t) and taking measures of FA(x(t);W)F_{A}(x(t);W). Instead, we can also sweep the weights WW along a trajectory W(t)W(t), and look at the consequences (e.g. binary labels – i.e. dichotomies), say for a fixed set of inputs x1,...,xsx_{1},...,x_{s}.

In fact, after random initialization, sweeping the first layer weights is statistically very similar to sweeping the input along a trajectory x(t)x(t). In particular, letting WW^{\prime} denote the first layer weights, for a particular input x0x_{0}, x0Wx_{0}W^{\prime} is a vector, each coordinate is iid, N(0,x02σw2)\sim\mathcal{N}(0,||x_{0}||^{2}\sigma_{w}^{2}). Extending this observation, we see that (providing norms are chosen appropriately), x0Wcos(t)+x1Wsin(t)x_{0}W^{\prime}\cos(t)+x_{1}W^{\prime}\sin(t) (fixed x0,x1,Wx_{0},x_{1},W) has the same distribution as x0W0cos(t)+x0W1sin(t)x_{0}W^{\prime}_{0}\cos(t)+x_{0}W^{\prime}_{1}\sin(t) (fixed x0,W0,W1x_{0},W^{\prime}_{0},W^{\prime}_{1}).

So we expect that there will be similarities between results for sweeping weights and for sweeping input trajectories, which we explore through some synthetic experiments, primarily for hard tanh, in Figures 15, 16. We find that the proportionality of transitions to trajectory length extends to dichotomies, as do results on the expressive power afforded by remaining depth.

For non-random inputs and non-random functions, this is a well known question upper bounded by the Sauer-Shelah lemma [Sauer, 1972]. We discuss this further in Appendix LABEL:sec_VC. In the random setting, the statistical duality of weight sweeping and input sweeping suggests a direct proportion to transitions and trajectory length for a fixed input. Furthermore, if the xiSx_{i}\in S are sufficiently uncorrelated (e.g. random) class label transitions should occur independently for each xix_{i} Indeed, we show this in Figure 14.

Appendix C Addtional Experiments from Section 3

Here we include additional experiments from Section 3