Complexity of Linear Regions in Deep Networks
Boris Hanin, David Rolnick
Introduction
A growing field of theory has sought to explain the broad success of deep neural networks via a mathematical characterization of the ability of these networks to approximate different functions of input data. Many such works consider the expressivity of neural networks, showing that certain functions are more efficiently expressible by deep architectures than by shallow ones (e.g. Bianchini & Scarselli (2014); Montufar et al. (2014); Telgarsky (2015); Lin et al. (2017); Rolnick & Tegmark (2018)). It has, however, also been noted that many expressible functions are not efficiently learnable, at least by gradient descent (Shalev-Shwartz et al., 2018). More generally, the typical behavior of a network used in practice, the practical expressivity, may be very different from what is theoretically attainable. To adequately explain the power of deep learning, it is necessary to consider networks with parameters as they will naturally occur before, during, and after training.
Networks with a piecewise linear activation (e.g. ReLU, hard ) compute piecewise linear functions for which input space is divided into pieces, with the network computing a single linear function on each piece (see Figures 1-4). Figure 2 shows how the complexity of these pieces, which we refer to as linear regions, changes in a deep ReLU net with two-dimensional inputs. Each neuron in the first layer splits the input space into two pieces along a hyperplane, fitting a different linear function to each of the pieces. Subsequent layers split the regions of the preceding layers. The local density of linear regions serves as a convenient proxy for the local complexity or smoothness of the network, with the ability to interpolate a complex data distribution seeming to require fitting many relatively small regions. The topic of counting linear regions is taken up by a number of authors (Telgarsky, 2015; Montufar et al., 2014; Serra et al., 2018; Raghu et al., 2017).
A worst case estimate is that every neuron in each new layer splits each of the regions present at the previous layer, giving a number of regions exponential in the depth. Indeed this is possible, as examined extensively e.g. in Montufar et al. (2014). An example of Telgarsky (2015) shows that a sawtooth function with teeth can be expressed exactly using only neurons, as shown in Figure 3. However, even slightly perturbing this network (by adding noise to the weights and biases) ruins this beautiful structure and severely reduces the number of linear pieces, raising the question of whether typical neural networks actually achieve the theoretical bounds for numbers of linear regions.
Figure 1 also invites measures of complexity for piecewise linear networks beyond region counting. The boundary between two linear regions can be straight or can be bent in complex ways, for example, suggesting the volume of the boundary between linear regions as complexity measure for the resulting partition of input space. In the 2D example of Figure 1, this corresponds to computing perimeters of the linear pieces. As we detail below, this measure has another natural advantage: the volume of the boundary controls the typical distance from a random input to the boundary of its linear region (see §2.2). This measures the stability of the function computed by the network, and it is intuitively related to robustness under adversarial perturbation.
Our Contributions. In this paper, we provide mathematical tools for analyzing the complexity of linear regions of a network with piecewise linear activations (such as ReLU) before, during, and after training. Our main contributions are as follows:
For networks at initialization, the total surface area of the boundary between linear regions scales as the number of neurons times the number of breakpoints of the activation function. This is our main result, from which several corollaries follow (see Theorem 3, Corollary 4, and the discussion in §2).
In particular, for any line segment through input space, the average number of regions intersecting it is linear in the number of neurons, far below the exponential number of regions that is theoretically attainable.
Theorem 3 also allows us to conclude that, at initialization, the average distance from a sample point to the nearest region boundary is bounded below by a constant times the reciprocal of the number of neurons (see Corollary 5).
We find empirically that both the number of regions and the distance to the nearest region boundary stay roughly constant during training and in particular are far from their theoretical maxima. That this should be the case is strongly suggested by Theorem 3, though not a direct consequence of it.
Overall, our results stress that practical expressivity lags significantly behind theoretical expressivity. Moreover, both our theoretical and empirical findings suggest that for certain measures of complexity, trained deep networks are remarkably similar to the same networks at initialization.
In the next section, we informally state our theoretical and empirical results and explore the underlying intuitions. Detailed descriptions of our experiments are provided in §3. The precise theorem statements for ReLU networks can be found in §5. The exact formulations for general piecewise linear networks are in Appendix A, with proofs in the rest of the Supplementary Material. In particular, Appendix B contains intuition for how our proofs are shaped, while details are completed in §C-D.
Informal Overview of Results
This section gives an informal introduction to our results. We begin in §2.1 by describing the case of networks with input dimension In §2.2, we consider networks with higher input dimension. For simplicity, we focus throughout this section on fully connected ReLU networks. We emphasize, however, that our results apply to any piecewise linear activation. Moreover, the upper bounds we present in Theorems 1, 2, and 3 (and hence in Corollaries 4 and 5) can also be generalized to hold for feed-forward networks with arbitrary connectivity, though we do not go into details in this work, for the sake of clarity of exposition.
Consider the simple case of a net with input and output dimensions equal to Such a network computes a piecewise linear function (see Figure 4), and we are interested in understanding both at initialization and during training the number of distinct linear regions. There is a simple universal upper bound:
where the maximum is over all settings of weight and biases. This bound depends on the architecture of only via the number of neurons. For more refined upper bounds which take into account the widths of the layers, see Theorem 1 in Raghu et al. (2017) and Theorem 1 in Serra et al. (2018).
The constructions in Montufar et al. (2014); Telgarsky (2015); Raghu et al. (2017); Serra et al. (2018) indicate that the bound in (1) is very far from sharp for shallow and wide networks but that exponential growth in the number of regions can be achieved in deep, skinny networks for very special choices of weights and biases. This is a manifestation of the expressive power of depth, the idea that repeated compositions allow deep networks to capture complex hierarchical relations more efficiently per parameter than their shallow cousins. However, there is no non-trivial lower bound for the number of linear regions:
The minimum is attained by setting all weights and biases to This raises the question of the behavior for the average number of regions when the weights and biases are chosen at random (e.g. at initialization). Intuitively, configurations of weights and biases that come close to saturating the exponential upper bound (1) are numerically unstable in the sense that a small random perturbation of the weights and biases drastically reduces the number of linear regions (see Figure 3 for an illustration). In this direction, we prove a somewhat surprising answer to the question of how many regions has at initialization. We state the result for but note that it holds for any piecewise linear, continuous activation function (see Theorems 3 and 6).
Let be a network with piecewise linear activation with input and output dimensions of both equal . Suppose the weights and biases are randomly initialized so that for each neuron , its pre-activation has bounded mean gradient
where is the number of breakpoints in the non-linearity of (for ReLU nets, ). The same result holds when computing the number of linear regions along any fixed -dimensional curve in a high-dimensional input space.
Theorem 1 defies the common intuition that, on average, each layer in multiplies the number of regions formed up to the previous layer by a constant larger than one. This would imply that the average number of regions is exponential in the depth. To provide intuition for why this is not true for random weights and biases, consider the effect of each neuron separately. Suppose the pre-activation of a neuron satisfies , a hallmark of any reasonable initialization. Then, over a compact set of inputs, the piecewise linear function cannot be highly oscillatory over a large portion of the range of . Thus, if the bias is not too concentrated on any interval, we expect the equation to have solutions. On average, then, we expect that each neuron adds a constant number of new linear regions. Thus, the average total number of regions should scale roughly as the number of neurons.
Theorem 1 follows from a general result, Theorem 3, that holds for essentially any non-degenerate distribution of weights and biases and with any input dimension. If and the bias distribution are well-behaved, then throughout training, Theorem 3 suggests the number of linear regions along a -dimensional curve in input space scales like the number of neurons in . Figures 5-6 show experiments that give empirical verification of this heuristic.
2 Higher-Dimensional Regions
For networks with input dimension exceeding there are several ways to generalize counting linear regions. A unit-matching heuristic applied to Theorem 1 suggests
Experiments
We empirically verified our theorems and further examined how linear regions of a network change during training. All experiments below were performed with fully-connected networks, initialized with He normal weights (i.i.d. with variance ) and biases drawn i.i.d. normal with variance (to prevent collapse of regions at initialization, which occurs when all biases are uniquely zero). Training was performed on the vectorized MNIST (input dimension 784) using the Adam optimizer at learning rate . All networks attain test accuracy in the range .
We calculated the number of regions along lines through the origin and and a random selected training example in input space. For each setting of weights and biases within the network during training, the number of regions along each line is calculated exactly by building up the network one layer at a time and calculating how each region is split by the next layer of neurons. Figure 5 represents the average over 5 independent runs, from each of which we sample 100 lines; variance across the different runs is not significant.
Figure 5 plots the average number of regions along a line, divided by the number of neurons in the network, as a function of epoch during training. We make several observations:
As predicted by Theorem 3, all networks start out with the number of regions along a line equal to a constant times the number of neurons in the network (the constant in fact appears very close to 1 in this case).
Throughout training, the number of regions does not deviate significantly from the number of neurons in the network, staying within a small constant of the value at initialization, in keeping with our intuitive understanding of Theorem 3 described informally around Theorem 1 above.
The number of regions actually decreases during the initial part of training, then increases again. We explore this behavior further in other experiments below.
2 Distance to the Nearest Region Boundary
We calculated the average distance to the nearest boundary for randomly selected input points, for various networks throughout training. Points were selected randomly from a normal distribution with mean and variance matching the componentwise mean and variance of MNIST training data. Results were averaged over independent runs, but variance across runs is not significant. Rerunning these experiments with sample points selected randomly from (i) the training data or (ii) the test data yielded similar results to random sample points.
We find Figure 6(c) fascinating, though we do not completely understand it. It plots the product of number of neurons and distance to the nearest region boundary against the test accuracy. It suggests two phases of training: first regions expand, then they contract. This lines up with observations made in Arpit et al. (2017) that neural networks “learn patterns first” on which generalization is simple and then refine the fit to encompass memorization of individual samples. A generalization phase would suggest that regions are growing, while memorization would suggest smaller regions are fit to individual data points. This is, however, speculation and more experimental (and theoretical) exploration will be required to confirm or disprove this intuition.
We found it instructive to consider the full distribution of distances from sample points to their nearest boundaries, rather than just the average. For a single network (depth 4, width 16), Figure 7 indicates that this distribution does not significantly change during training, although there appears to be a slight skew towards larger regions, in agreement with the findings in Novak et al. (2018). The histogram shows -distances. Hence, distance to the nearest region boundary varies over many orders of magnitude. This is consistent with Figures 1 and 4, which lend credence to the intuition that small distances to the nearest region boundary are explained by the presence of many small regions. According to Theorem 3, this should correlate with a combination of regions in input space at which some neurons have a large gradient and neurons with highly peaked biases distributions. We hope to return to this in future work.
3 Regions Within a 2D Plane
We visualized the regions of a network through training. Specifically, following experiments in Novak et al. (2018), we plotted regions within a plane in the -dimensional input space (Figure 8) through three data points with different labels (, , and , in our case) inside a square centered at the circumcenter of the three examples. The network shown has depth and width . We observe that, as expected from our other plots, the regions expand initially during training and then contract again. We expect the number of regions within a -dimensional subspace to be on the order of the square of the number of neurons – that is, , which we indeed find.
Our approach for calculating regions is simple. We start with a single region (in this case, the square), and subdivide it by adding neurons to the network one by one. For each new neuron, we calculate the linear function it defines on each region, and determine whether that region is split into two. This approach terminates within a reasonable amount of time precisely because our theorem holds: there are relatively few regions. Note that we exactly determine all regions within the given square by calculating all region boundaries; thus our counts are exact and do not miss any small regions, as might occur if we merely estimated regions by sampling points from input space.
Related Work
There are a number of works that touch on the themes of this article: (i) the expressivity of depth; (ii) counting the number of regions in networks with piecewise linear activations; (iii) the behavior of linear regions through training; and (iv) the difference between expressivity and learnability. Related to (i), we refer the reader to Eldan & Shamir (2016); Telgarsky (2016) for examples of functions that can be efficiently represented by deep but not shallow ReLU nets. Next, still related to (i), for uniform approximation over classes of functions, again using deep ReLU nets, see Yarotsky (2017); Rolnick & Tegmark (2018); Yarotsky (2018); Petersen & Voigtlaender (2018). For interesting results on (ii) about counting the maximal possible number of linear regions in networks with piecewise linear activations see Bianchini & Scarselli (2014); Montufar et al. (2014); Poole et al. (2016); Arora et al. (2018); Raghu et al. (2017). Next, in the vein of (iii), for both a theoretical and empirical perspective on the number of regions computed by deep networks and specifically how the regions change during training, see Poole et al. (2016); Novak et al. (2018). In the direction of (iv), we refer the reader to Shalev-Shwartz et al. (2018); Hanin & Rolnick (2018); Hanin (2018). Finally, for general insights into learnability and expressivity in deep vs. shallow networks see Mhaskar & Poggio (2016); Mhaskar et al. (2016); Zhang et al. (2017); Lin et al. (2017); Poggio et al. (2017); Neyshabur et al. (2017).
Formal Statement of Results
More precisely, we set and recursively define to be the set of points so that in a neighborhood of the set coincides with a co-dimension hyperplane.
Theorem 3 holds under the following assumption on the distribution of weights and biases:
of inside is, in the notation (6),
the function is the density of the joint distribution of the biases , and we say a neuron is good at if there exists a path of neurons from to the output in the computational graph of so that each neuron along this path is open at ).
To evaluate the expression in (8) requires information on the distribution of gradients , the pre-activations , and the biases Exact information about these quantities is available at initialization (Hanin, 2018; Hanin & Rolnick, 2018; Hanin & Nica, 2018), yielding the following Corollary.
where depends only on but not on the architecture of and is the width of the hidden layer. Moreover, we also have similar lower bounds
with depending only on the distribution of the weights in .
We prove Corollary 7 in Appendix D. Let us state one final corollary of Theorem 3
We prove Corollary 8 in §E. The basic idea is simple. For every we have
with the probability on the right hand side scaling like
Conclusions and Further Work
The question of why depth is powerful has been a persistent problem for deep learning theory, and one that recently has been answered by works giving enhanced expressivity as the ultimate explanation. However, our results suggest that such explanations may be misleading. While we do not speak to all notions of expressivity in this paper, we have both theoretically and empirically evaluated one common measure: the linear regions in the partition of input space defined by a network with piecewise linear activations. We found that the average size of the boundary of these linear regions depends only on the number of neurons and not on the network depth – both at initialization and during training. This strongly suggests that deeper networks do not learn more complex functions than shallow networks. We plan to test this interpretation further in future work – for example, with experiments on more complex tasks, as well as by investigating higher order statistics, such as the variance.
We do not propose a replacement theory for the success of deep learning; however, prior work has already hinted at how such a theory might proceed. Notably, Ba & Caruana (2014) show that, once deep networks are trained to perform a task successfully, their behavior can often be replicated by shallow networks, suggesting that the advantages of depth may be linked to easier learning.
References
Appendix A Formal Statement of Results for General Piecewise Linear Activations
The analog of Theorem 3 for general is the following.
and write for the density of the joint distribution of the biases . We say a neuron is good at if there exists a path of neurons from to the output in the computational graph of so that each neuron along this path is open at (i.e. ).
of inside is, in the notation of (6),
where equals
multiplied by the indicator function of the event that is good at for every
Note that if in the definition (11) of we have that the possible values do not include , then we may ignore the event that are good at in the definition of
where is the number of breakpoints in the non-linearity of (see (11)) and
We prove Corollary 7 in §D and state a final corollary of Theorem 3:
where, as before, is the number of breakpoints in the non-linearity of .
We prove Corollary 8 in §E. The basic idea is simple. For every we have
with the probability on the right hand side scaling like
Appendix B Outline of Proof of Theorem 6
We remark here that if in the non-linearity there are no linear pieces at which the slopes on equals (i.e. for all in the definition (11) of ). If, for example, is ReLU, then need not be empty.
The overall proof of Theorem 3 can be divided into several steps. The first gives the following representation of
Under Assumptions and of Theorem 3, we have, with probability
The next step in proving Theorem 3 is to identify the portions of of each dimension. To do this, we write for any distinct neurons ,
The set is, intuitively, the collection of inputs at which switches between linear regions for and at which the output of is affected by the post-activations of these neurons. Proposition 9 shows that we may represent as a disjoint union
We prove Proposition 10 in §C.2. The idea is that each is piecewise linear and, with probability , at every point at which exactly the neurons contribute to , its co-dimension is the number of linear conditions needed to define it. Observe that with probability , the bias vector for any collection of distinct neurons is a regular value for . Hence,
Proposition 10 thus implies that, with probability
The final step in the proof of Theorem 3 is therefore to prove the following result.
where is defined as in (13).
We provide a detailed proof of Proposition 11 in §C.3. The intuition is that the image of the volume element under is the volume element
that the vector of biases belongs to the image of under map for some collection of breakpoints The formal argument uses the co-area formula (see (29) and (30)).
Appendix C Proof of Theorem 3
Intuitively, are the neurons that, at the input are open (i.e. contribute to the gradient of the output ) but do not change their contribution in a neighborhood of , are the neurons that are closed, and are the neurons that, at , produce a discontinuity in the derivative of Thus, for example, if then
We begin by proving that by checking the contrapositive
Fix . Note that are locally constant in the sense that there exists so that for all with , we have
Moreover, observe that if in the definition (11) of none of the slopes equal , then for every . To prove (19), consider any path from the input to the output in the computational graph of Such a path consists of neurons, one in each layer:
To each path we may associate a sequence of weights:
For instance, if , then
and in general only one term in the definition of is non-zero for each We may write
Note that if , then for any path through a neuron , we have
This is an open condition in light of (20), and hence for all in a neighborhood of and for any path through a neuron we also have that
Thus, since the summand in (21) vanishes identically if , we find that for in a neighborhood of any we may write
But, again by (20), for any fixed , all in a neighborhood of and each we have as well. Thus, in particular,
Thus, for sufficiently close to we have for every path in the sum (22) that
Therefore, the partial derivatives are independent of in a neighborhood of and hence continuous at . This proves (19). Let us now prove the reverse inclusion:
for any pair of distinct neurons Note also that since is continuous and piecewise linear, the set is closed. Thus, it is enough to show the slightly weaker inclusion
since the closure of \widetilde{S}_{z}\big{\backslash}\bigcup_{\widehat{z}\neq z}S_{\widehat{z}} equals Fix a neuron and suppose x\in\widetilde{S}_{z}\big{\backslash}\bigcup_{\widehat{z}\neq z}S_{\widehat{z}}. By definition, we have that for every neuron either
This has two consequences. First, by (20), the map is linear in a neighborhood of Second, in a neighborhood of the set coincides with . Hence, combining these facts, near the set coincides with the hyperplane
We may take two sequences of inputs on opposite sides of this hyperplane so that
where the index the same as the one that defines the hyperplane (25). Further, since has co-dimension (it is contained in the piecewise linear co-dimension set , for example), we may also assume that Consider any path from the input to the output of the computational graph of passing through (so that ). By construction, for every , we have
and hence, after passing to a subsequence, we may assume that the symmetric difference
Substituting into this expression , we find that there exists a non-empty collection of paths from the input to the output of so that
Note that the expression above is a polynomial in the weights of . Note also that, by construction, this polynomial is not identically zero due to the condition (26). There are only finitely many such polynomials since both and range over a finite alphabet. For each such non-zero polynomial, the set of weights at which it vanishes has co-dimension . Hence, with probability the difference is non-zero. This shows that the partial derivatives are not continuous at and hence that
C.2 Proof of Proposition 10
Fix distinct neurons and suppose but not in for any After relabeling, we may assume that they are ordered by layer index:
Since , we also have that for any Thus, there exists a neighborhood of so for every Thus, there exists a neighborhood of on which is linear.
C.3 Proof of Proposition 11
where is the Jacobian of and
We assumed that the biases have a joint conditional density
given all other weights and biases. The mean of the term in (28) corresponding to a fixed over the conditional distribution of is therefore
where we’ve abbreviated as well as . This can rewritten as
Thus, applying the co-area formula (29), (30) shows that the average of (28) over the conditional distribution of is precisely
Appendix D Proof of Corollary 7
where, as in (13), is
times the indicator function of the even that is good at for every When the weights and biases of are independent, we may write as
is the associated Gram matrix. The Gram identity says that equals
The estimate (14) proves the upper bound (15). For the special case of we use the AM-GM inequality and Jensen’s inequality to write
Therefore, by Theorem 1 of Hanin & Nica (2018), there exist so that
This completes the proof of the upper bound in (15). To prove the power bound, lower bound in (15) we must argue in a different way. Namely, we will induct on and use the following facts to prove the base case :
At initialization, for each fixed input , we have
At initialization, for every neuron and each input we have
This follows easily from Theorem 1 of Hanin (2018).
which using (33) and (32) together with Markov’s inequality, is bounded above by
Next, using Jensen’s inequality twice, we write
where in the last inequality we applied (34). Putting this all together, we find that exists so that
for sufficiently large. This completes the proof of the lower bound in (15) when . To complete the proof of Corollary 7, suppose we have proved the lower bound in (15) for all networks and all collections of distinct neurons. We may assume after relabeling that the neurons are ordered by layer index:
Summing this lower bound over yields
Applying the inductive hypothesis once more completes the proof.
Appendix E Proof of Corollary 8
where we abbreviated Using that
and repeating this argument times completes the proof. ∎