Deep ReLU Networks Have Surprisingly Few Activation Patterns

Boris Hanin, David Rolnick

Introduction

A fundamental question in the theory of deep learning is why deeper networks often work better in practice than shallow ones. One proposed explanation is that, while even shallow neural networks are universal approximators , there are functions for which increased depth allows exponentially more efficient representations. This phenomenon has been quantified for various complexity measures . However, authors such as Ba and Caruana have called into question this point of view , observing that shallow networks can often be trained to imitate deep networks and thus that functions learned in practice by deep networks may not achieve the full expressive power of depth.

We aim to count the number of such activation regions. This number has been the subject of previous work (see §1.1), with the majority concerning large lower bounds on the maximum over all θ\theta of the number of regions for a given network architecture. In contrast, we are interested in the typical behavior of ReLU nets as they are used in practice. We therefore focus on small upper bounds for the average number of activation regions present for a typical value of θ\theta. Our main contributions are:

We give precise definitions and prove several fundamental properties of both linear and activation regions, two concepts that are often conflated in the literature (see §2).

This bound holds in particular for deep ReLU nets at initialization, and is in sharp contrast to the maximum possible number of activation patterns, which is exponential in depth .

Theorem 5 also strongly suggests that the bounds on number of activation regions continue to hold approximately throughout training. We empirically verify that this behavior holds, even for networks trained on memorization-based tasks (see §4 and Figures 3-6).

Our results show both theoretically and empirically not only that the number of activation patterns can be small for a ReLU network but that it typically is small both at initialization and throughout training, effectively capped far below its theoretical maximum even for tasks where a higher number of regions would be advantageous (see §4). We provide in §3.2-3.3 two intuitive explanations for this phenomenon. The essence of both is that many activation patterns can be created only when a typical neuron zz in N\mathcal{N} turns on/off repeatedly, forcing the value of its pre-activation z(x)z(x) to cross the level of its bias bzb_{z} many times. This requires (i) significant overlap between the range of z(x)z(x) on the different activation regions of xz(x)x\mapsto z(x) and (ii) the bias bzb_{z} to be picked within this overlap. Intuitively, (i) and (ii) require either large or highly coordinated gradients. In the former case, z(x)z(x) oscillates over a large range of outputs and bzb_{z} can be random, while in the latter z(x)z(x) may oscillate only over a small range of outputs and bzb_{z} is carefully chosen. Neither is likely to happen with a proper initialization. Moreover, both appear to be difficult to learn with gradient-based optimization.

The rest of this article is structured as follows. Section 2 gives formal definitions and some important properties of both activation regions and the closely related notion of linear regions (see Definitions 1 and 2).Section 3 contains our main technical result, Theorem 5, stated in §3.1. Sections 3.2 and 3.3 provide heuristics for understanding Theorem 5 and its implications. Finally, §4 is devoted to experiments that push the limits of how many activation regions a ReLU network can learn in practice.

We consider the typical number of activation regions in ReLU nets. In contrast, interesting bounds on the maximum number of regions are given in . Such worst case bounds are used to control the VC-dimension of ReLU networks in but differ dramatically from our average case analysis.

Our main theoretical result, Theorem 5, is related to , which conjectured that our Theorem 5 should hold and proved bounds for other notions of average complexity of activation regions. Theorem 5 is also related in spirit to , which uses a mean field analysis of wide ReLU nets to show that they are biased towards simple functions. Our empirical work (e.g. §4) is related both to the experiments of and to those of . The last two observe that neural networks are capable of fitting noisy or completely random data. Theorem 5 and experiments in §4 give a counterpoint, suggesting limitations on the complexity of random functions that ReLU nets can fit in practice (see Figures 4-6).

How to Think about Activation Regions

Before stating our main results on counting activation regions in §3, we provide a formal definition and contrast them with linear regions in §2.1. We also note in §2.1 some simple properties of activation regions that are useful both for understanding how they are built up layer by layer in a deep ReLU net and for visualizing them. Then, in §2.2, we explain the relationship between activation regions and arrangements of bent hyperplanes (see Lemma 4).

Our main objects of study in this article are activation regions, which we now define.

Fix θ\theta, a vector of trainable parameters in N\mathcal{N}, and an activation pattern A.\mathcal{A}. The activation region corresponding to A,θ\mathcal{A},\theta is

where neuron zz has pre-activation z(x;θ)z(x;\theta), bias bz,b_{z}, and post-activation max{0,z(x;θ)bz}.\max\{0,z(x;\theta)-b_{z}\}. We say the activation regions of N\mathcal{N} at θ\theta are the non-empty activation regions R(A,θ)\mathcal{R}(\mathcal{A},\theta).

When specifying an activation pattern A\mathcal{A}, the sign aza_{z} assigned to a neuron zz determines whether zz is on or off for inputs in the activation region R(A;θ)\mathcal{R}(\mathcal{A};\theta) since the pre-activation z(x;θ)bzz(x;\theta)-b_{z} of neuron zz is positive (resp. negative) when az=1a_{z}=1 (resp. az=1a_{z}=-1). Perhaps the most fundamental property of activation regions is their convexity.

Let N\mathcal{N} be a ReLU net. Then for every activation pattern A\mathcal{A} and any vector θ\theta of trainable parameters for N\mathcal{N} each activation region R(A;θ)\mathcal{R}(\mathcal{A};\theta) is convex.

We note that Lemma 1 has been observed before (e.g. Theorem 2 in ), but in much of the literature the difference between linear regions (defined below), which are not necessarily convex, and activation regions, which are, is ignored. It turns out that Lemma 1 holds for any piecewise linear activation, such as leaky ReLU and hard hyperbolic tangent/sigmoid. This fact seems to be less well-known (see Appendix B.1 for a proof). To provide a useful alternative description of activation regions for a ReLU net N,\mathcal{N}, a fixed vector θ\theta of trainable parameters and neuron zz of N\mathcal{N}, define

For any ReLU net N\mathcal{N} and any vector θ\theta of trainable parameters

We prove Lemma 2 in Appendix B.2. We may compare activation regions with linear regions, which are the regions of input space on which the network defines different linear functions.

The linear regions of N\mathcal{N} at θ\theta are the connected components of input space with BN\mathcal{B}_{\mathcal{N}} removed:

Linear regions have often been conflated with activation regions, but in some cases they are different. This can, for example, happen when an entire layer of the network is zeroed out by ReLUs, leading many distinct activation regions to coalesce into a single linear region. It can also occur for non-generic configurations of weights and biases where the linear functions computed in two neighboring activation regions happen to coincide. However, the number of activation regions is always at least as large as the number of linear regions.

Let N\mathcal{N} be a ReLU net. For any parameter vector θ\theta for N,\mathcal{N}, the number of linear regions in N\mathcal{N} at θ\theta is always bounded above by the number of activation regions in N\mathcal{N} at θ.\theta. In fact, the closure of every linear region is the closure of the union of some number of activation regions.

Lemma 3 is proved in Appendix B.3. We prove moreover in Appendix B.4 that generically, the gradient of N\nabla\mathcal{N} is different in the interior of most activation regions and hence that most activation regions lie in different linear regions. In particular, this means that the number of linear regions is generically very similar to the number of activation regions.

2 Activation Regions and Hyperplane Arrangements

Lemma 4, which follows immediately from the proof of Lemma 7 in Appendix B.1, ensures that in a small ball near any point that does not belong to zHz(θ)\bigcup_{z}H_{z}(\theta), the collection of bent hyperplanes Hz(θ)H_{z}(\theta) look like an ordinary hyperplane arrangement. Globally, however, Hz(θ)H_{z}(\theta) can define many more regions than ordinary hyperplane arrangements. This reflects the fact that deep ReLU nets may have many more activation regions than shallow networks with the same number of neurons.

Despite their different extremal behaviors, we show in Theorem 5 that the average number of activation regions in a random ReLU net enjoys depth-independent upper bounds at initialization. We show experimentally that this holds throughout training as well (see §4). On the other hand, although we do not prove this here, we believe that the effect of depth can be seen through the fluctuations (e.g. the variance), rather than the mean, of the number of activation regions. For instance, for depth 11 ReLU nets, the variance is since for a generic configuration of weights/biases, the number of activation regions is constant (see (4)). The variance is strictly positive, however, for deeper networks.

Main Result

Theorem 5 gives upper bounds on the average number of activation regions per unit volume of input space for a feed-forward ReLU net with random weights/biases. Note that it applies even to highly correlated weight/bias distributions and hence holds throughout training. Also note that although we require no tied weights, there are no further constraints on the connectivity between adjacent layers.

The vector of weights is a continuous random variable (i.e. has a density).

The conditional distribution of any collection of biases given all weights and other biases is a continuous random variable (for identically zero biases, see Appendix D).

Here, the average is with respect to the distribution of weights and biases in N.\mathcal{N}.

We state and prove a generalization of Theorem 5 in Appendix C.

Below we present two kinds of intuition motivating the Theorem. First, in §3.2 we derive the upper bound (5) via an intuitive geometric argument. Then in §3.3, we explain why, at initialization, we expect the upper bounds (5) to have matching, depth-independent, lower bounds (to leading order in the number of neurons). This suggests that the average total number of activation regions at initialization should be the same for any two ReLU nets with the same number of neurons (see (4) and Figure 3).

2 Geometric Intuition

Geometrically, the number of solutions to z(x)=bzz(x)=b_{z} for inputs xIx\in I is the number of times the horizontal line y=bzy=b_{z} intersects the graph y=z(x)y=z(x) over xI.x\in I. A large number of intersections at a given bias bzb_{z} may only occur if the graph of z(x)z(x) has many oscillations around that level. Hence, since bzb_{z} is random, the graph of z(x)z(x) must oscillate many times over a large range on the yy axis. This can happen only if the total variation xIz(x)\int_{x\in I}\left|z^{\prime}(x)\right| of z(x)z(x) over II is large. Thus, if z(x)\left|z^{\prime}(x)\right| is typically of moderate size, we expect only O(1)O(1) solutions to z(x)=bzz(x)=b_{z} per unit input length, suggesting

3 Is Theorem 5 Sharp?

Maximizing the Number of Activation Regions

While we have seen in Figure 3 that the number of regions does not strongly increase during training on a simple task, such experiments leave open the possibility that the number of regions would go up markedly if the task were more complicated. Will the number of regions grow to achieve the theoretical upper bound (exponential in the depth) if the task is designed so that having more regions is advantageous? We now investigate this possibility. See Appendix A for experimental details.

Memorization tasks on large datasets require learning highly oscillatory functions with large numbers of activation regions. Inspired by the work of Arpit et. al. in , we train on several tasks interpolating between memorization and generalization (see Figure 4) in a certain fraction of MNIST labels have been randomized. We find that the maximum number of activation regions learned does increase with the amount of noise to be memorized, but only slightly. In no case does the number of activation regions change by more than a small constant factor from its initial value. Next, we train a network to memorize binary labels for random 2D points (see Figure 5). Again, the number of activation regions after training increases slightly with increasing memorization, until the task becomes too hard for the network and training fails altogether. Varying the learning rate yields similar results (see Figure 6(a)), suggesting the small increase in activation regions is probably not a result of hyperparameter choice.

2 The Effect of Initialization

Let N\mathcal{N} be a deep ReLU network, and for c>0c>0 let Ncbias\mathcal{N}^{\text{bias}}_{c} be the network obtained by multiplying all biases in N\mathcal{N} by cc. Then, N(x)=Ncbias(cx)/c\mathcal{N}(x)=\mathcal{N}^{\text{bias}}_{c}(cx)/c. Rescaling all biases by the same constant therefore does not change the total number of activation regions.

In the extreme case of biases initialized to zero, Theorem 5 does not apply. However, as we explain in Appendix D, zero biases only create fewer activation regions (see Figure 7). We now consider changing the scale of weights at initialization. In , it was suggested that initializing the weights of a network with greater variance should increase the number of activation regions. Likewise, the upper bound in Theorem 5 on the density of activation regions increases as gradient norms increase, and it has been shown that increased weight variance increases gradient norms . However, this is again a property of the local, rather than global, number of regions.

Conclusion

We have presented theoretical and empirical evidence that the number of activation regions learned in practice by a ReLU network is far from the maximum possible and depends mainly on the number of neurons in the network, rather than its depth. This surprising result implies that, at least when network gradients and biases are well-behaved (see conditions 3,4 in the statement of Theorem 5), the partition of input space learned by a deep ReLU network is not significantly more complex than that of a shallow network with the same number of neurons. We found that this is true even after training on memorization-based tasks, in which we expect a large number of regions to be advantageous for fitting many randomly labeled inputs. Our results are stated for ReLU nets with no tied weights and biases (and arbitrary connectivity). We believe that analogous results and proofs hold for residual and convolutional networks but have not verified the technical details. While our results do not directly influence architecture selection for deep ReLU nets, they present the practical takeaway that the utility of depth may lie more in its effect on optimization than on expressivity.

References

Appendix A Experimental Design

We run several experiments that involve calculating the activation regions intersecting a 1D or 2D subset of input space. In order to compute these regions, we add neurons of the network one by one from the first to last hidden layer, observing how each neuron cuts existing regions. Determining whether a region is cut by a neuron involves identifying whether the corresponding linear function on that region has zeros within the region. This can be solved easily by identifying whether all vertices of the region have the same sign (region is not cut) or some two of the vertices have different signs (region is cut). Thus, our procedure is to maintain a list of regions and the linear functions defined on them, then for each newly added neuron identify on which regions its preactivation vanishes and replace these regions by the resulting split regions. Note that this procedure also works in three dimensions and higher, but becomes slower as the number of regions in higher dimensions grows like the number of neurons to the dimension, as shown in Theorem 5.

Appendix B Proofs of Various Lemmas

Fix θ\theta, a vector of trainable parameters in N\mathcal{N}, and an activation pattern A.\mathcal{A}. The activation region corresponding to A,θ\mathcal{A},\theta is

where the pre-activation of a neuron zz is z(x;θ)z(x;\theta), its bias is bz,b_{z}, and its post-activation is therefore φ(z(x;θ)bz).\varphi(z(x;\theta)-b_{z}). Finally, the activation regions of N\mathcal{N} at θ\theta is the collection of all non-empty activation regions R(A,θ)\mathcal{R}(\mathcal{A},\theta).

We will prove the following generalization of Lemma 8.

Let N\mathcal{N} be a network with non-linearity φ\varphi, and let

be any activation pattern. Then, for any vector θ\theta of trainable parameters for N,\mathcal{N}, the region R(A;θ)\mathcal{R}(\mathcal{A};\theta) is convex.

Write dd for the depth of N\mathcal{N} and note that, by definition,

is the intersection of two convex sets and is therefore convex. Taking the intersection over all zz in layer d+1d+1 completes the proof. ∎

B.2 Proof of Lemma 2

and, by Lemma 1, R(A;θ)\mathcal{R}\left(\mathcal{A};\theta\right) is convex and hence connected. Therefore it is equal to the connected component we started with. \square

B.3 Proof of Lemma 3

Let N\mathcal{N} be a ReLU net, and fix a vector θ\theta of its trainable parameters. Let us first check that

We will use the following simple fact: if X,YX,Y are subsets of a topological space TT, then XYX\subset Y implies that every connected component of XX is the subset of some connected component of YY. Indeed, if XαX_{\alpha} is a connected component of XX, then it is a connected subjset of YY and hence is contained in a unique connected component of Y.Y.

This fact shows that the cardinality of the set of connected components of YY is bounded above by the cardinality of the set of connected components for X.X. Using Lemma 2, the inequality (6) therefore reduces to showing that

Fix any xx in the complement of the right hand side. By definition, we have

then N\mathcal{N} restricted to UU is given by a single linear function and hence has a continuous gradient N(;θ)\nabla\mathcal{N}\left(\cdot\,;\theta\right) on U.U.

B.4 Distinguishability of Activation Regions

In addition to being convex, activation regions for a ReLU net N\mathcal{N} generically correspond to different linear functions:

Fix an activation pattern A\mathcal{A} for a depth dd ReLU network N\mathcal{N} with R(A;θ)\mathcal{R}(\mathcal{A};\theta)\neq\emptyset. Fix xR(A;θ).x\in\mathcal{R}\left(\mathcal{A};\theta\right). We will use the following well-known formula:

where the sum is over all paths in the computational graph of N,\mathcal{N}, a path is open at xx if every neuron zz in γ\gamma satisfies z(x)bz > 0z(x)-b_{z}~{}>~{}0, and wγ(j)w_{\gamma}^{(j)} is the weight on the edge of γ\gamma between layers j1j-1 and j.j. If there exist two different, non-empty activation regions corresponding to activation patterns A\mathcal{A} for which there is at least one open path through the network on which N(;θ)\nabla N(\cdot\,;\theta) has the same value on , then there exists aγ{±1}a_{\gamma}\in\{\pm 1\} and a non-empty collection Γ\Gamma of paths so that

Appendix C Statement and Proof of a Generalization of Theorem 5

We begin by stating a generalization of Theorem 5 to what we term partial activation regions. Let us write

Fix θ\theta, a vector of trainable parameters in N\mathcal{N}, and a kk-partial activation pattern A.\mathcal{A}. The kk-partial activation region corresponding to A,θ\mathcal{A},\theta is

where the pre-activation of a neuron zz is z(x;θ)z(x;\theta), its bias is bz,b_{z}, its post-activation is therefore max{0,z(x;θ)bz}.\max\{0,z(x;\theta)-b_{z}\}. Finally, the activation regions of N\mathcal{N} at θ\theta is the collection of all non-empty activation regions R(A,θ)\mathcal{R}(\mathcal{A},\theta).

The same argument as the proof of Lemma 7 yields the following result:

Let N\mathcal{N} be a ReLU net. Fix a non-negative integer rr and let A\mathcal{A} be a rr-partial activation pattern for N.\mathcal{N}. For every vector θ\theta of trainable parameters for N\mathcal{N}, the rr-partial activation region R(A;θ)\mathcal{R}\left(\mathcal{A};\theta\right) is convex.

We will prove the following generalization of Theorem 5.

Every collection of biases has a density with respect to Lebesgue measure conditional on the values of all weights and other biases (for identically zero biases, see Appendix D).

Here, the average is with respect to the distribution of weights and biases in N.\mathcal{N}.

We begin with the following observation. Let N\mathcal{N} be a ReLU net, fix a vector θ\theta of trainable parameters, and consider a neuron zz in N.\mathcal{N}. The function xz(x;θ)x\mapsto z(x;\theta) is continuous and piecewise linear with a finite number of regions R\mathcal{R} on which z(x,θ)z(x,\theta) is some fixed linear function. On each such R,\mathcal{R}, the gradient z\nabla z is constant. If that constant is non-zero, then Hz(θ)RH_{z}(\theta)\cap\mathcal{R} is either empty if bzb_{z} does not belong to the range of zz on R\mathcal{R} or is a co-dimension 11 hyperplane if it does. In contrast, if z=0\nabla z=0 on R\mathcal{R}, then zz is constant and Hz(θ)RH_{z}(\theta)\cap\mathcal{R} is empty unless bzb_{z} is precisely equal to the value of zz on R.\mathcal{R}. Thus, given any choice of weights in N\mathcal{N} and for all but a finite number of biases, the set Hz(θ)H_{z}(\theta) coincides with co-dimension one hyperplane in each linear region R\mathcal{R} for the function z(;θ)z(\cdot;\theta) that it intersects.

For any vector θ\theta of trainable parameters for N\mathcal{N} define

The collections Vk(θ)\mathcal{V}_{k}(\theta) are useful for the following reason.

With probability 11 with respect to θ\theta, for every kk, the set Vk(θ)\mathcal{V}_{k}(\theta) has dimension (i.e. is a collection of discrete points) and

where #Vk\#\mathcal{V}_{k} is the number of points in Vk\mathcal{V}_{k}.

The statement that generically Vk(θ)\mathcal{V}_{k}(\theta) has dimension follows since, by construction, Ck\mathcal{C}_{k} has dimension kk and, with probability 11, XN,k(θ)X_{\mathcal{N},k}(\theta) coincides locally with a co-dimension kk hyperplane (see Lemma 11) in general position. To prove (12) note that any non-empty, bounded, convex polytope has at least one vertex on its boundary. Thus, with probability 1,1,

where TkT_{k} is the maximum over all vVkv\in\mathcal{V}_{k} of the number of rr-partial activation regions whose boundary contains v.v. To complete the proof, it remains to check that, with probability 11 over θ\theta,

To see this, note that, by definition of XN,kX_{\mathcal{N},k}, in a sufficiently small neighborhood UU of any vVkv\in\mathcal{V}_{k}, all but kk neurons have pre-activations satisfying either z(x)>bzz(x)>b_{z} or z(x)<bzz(x)<b_{z} for all xU.x\in U. Thus, there are at most (kr)2kr\binom{k}{r}2^{k-r} different rr-partial activation patterns A\mathcal{A} whose rr-partial activation region R(A;θ)\mathcal{R}\left(\mathcal{A};\theta\right) can intersect U.U.

To proceed with the proof of Theorem 5, we need the following observation.

Indeed, for any θ\theta in a full measure set, Lemma 13 guarantees the existence of an ε>0\varepsilon>0 so that the ε\varepsilon balls Bε(v)B_{\varepsilon}(v) are contained in Ck,ε\mathcal{C}_{k,\varepsilon} and are disjoint. Hence,

Thus, taking ε0\varepsilon\rightarrow 0, we have

By Fatou’s Lemma and (10) there exists T>0T>0 such that

Combining this with (11) and (12), we find

where in the last line we’ve assumed δ>1/T\delta>1/T and in the first inequality we used that

Appendix D Zero Bias

Suppose that N\mathcal{N} is a deep ReLU net with any bias distribution, and let N0bias\mathcal{N}^{\text{bias}}_{0} be the same network with all biases set to . Then, the total number of activation regions (over all of input space) for N0bias\mathcal{N}^{\text{bias}}_{0} is no more than that for N\mathcal{N}.

A key point in the proof is the scale equivariance of ReLU networks with zero bias.

Let N\mathcal{N} be a ReLU network with biases set identically to zero. Then,

N\mathcal{N} is equivariant under positive constant multiplication: N(cx)=cN(x)\mathcal{N}(cx)=c\mathcal{N}(x) for each c>0c>0.

For every activation region RR of N\mathcal{N}, and every point xx in RR, all points cxcx are also in RR for c>0c>0 (this implies that RR is a convex cone).

Note that each neuron of the network computes a function of the form z(x1,,xm)=ReLU(i=1mwixi)z(x_{1},\ldots,x_{m})=\text{ReLU}(\sum_{i=1}^{m}w_{i}x_{i}). Note that:

Thus, each neuron is equivariant under multiplication by positive constants cc, and thus the overall network must be as well, proving (a). Note also that the activation patterns of ReLUs for xx and cxcx must also be identical, implying that xx and cxcx lie in the same linear region. This proves (b).

Wee now turn to the proof of Proposition 14. We proceed by defining an injective mapping from regions of N0bias\mathcal{N}^{\text{bias}}_{0} to regions of N\mathcal{N}. For each linear region RR of N0bias\mathcal{N}^{\text{bias}}_{0}, pick a point xRRx_{R}\in R. By the Lemma, cxRRcx_{R}\in R for each c>0c>0. Let N1/cbias\mathcal{N}^{\text{bias}}_{1/c} be the network obtained from N\mathcal{N} by dividing all biases by cc, and observe that N(cx)=cN1/cbias(x)\mathcal{N}(cx)=c\mathcal{N}^{\text{bias}}_{1/c}(x), with the same activation pattern between the two networks. But by picking cc sufficiently large, N1/cbias\mathcal{N}^{\text{bias}}_{1/c} becomes arbitrarily close to N0bias\mathcal{N}^{\text{bias}}_{0}. We conclude that for some sufficiently large cRc_{R}, N0bias(cRxR)\mathcal{N}^{\text{bias}}_{0}(c_{R}x_{R}) and N(cRxR)\mathcal{N}(c_{R}x_{R}) have the same pattern of activations, so the regions of N\mathcal{N} in which cRxRc_{R}x_{R} lies must be distinct for all distinct RR. Thus, the number of regions of N\mathcal{N} must be at least as large as the number of regions of N0bias\mathcal{N}^{\text{bias}}_{0}, as desired. ∎

Lemma 15 implies that networks with small bias are almost scale equivariant. Figure 7 shows the activation regions for a network initialized with very small biases (i.i.d. normal with variance 10610^{-6}). Part (a) displays regions intersecting a plane through the origin, indicating that, outside of a small radius around the origin, all regions are infinite and are approximated by cones. Thus, ReLU networks near initialization are almost scale equivariant except for sample points very close to the origin, a simple but potentially useful property that has not been widely recognized. During training, biases grow and the radius grows within which finite regions occur. Note that, as shown in part (b) of the figure, a plane not containing the origin does not reveal this structure, as such a plane can have finite intersection with many regions that are in fact infinite.