On the number of response regions of deep feed forward networks with piece-wise linear activations

Razvan Pascanu, Guido Montufar, Yoshua Bengio

Introduction

Deep systems are believed to play an important role in information processing of intelligent agents. A common hypothesis underlying this belief is that deep models can be exponentially more efficient at representing some functions than their shallow counterparts (see Bengio, 2009).

The argument is usually a compositional one. Higher layers in a deep model can re-use primitives constructed by the lower layers in order to build gradually more complex functions. For example, on a vision task, one would hope that the first layer learns Gabor filters capable to detect edges of different orientation. These edges are then put together at the second layer to form part-of-object shapes. On higher layers, these part-of-object shapes are combined further to obtain detectors for more complex part-of-object shapes or objects. Such a behaviour is empirically illustrated, for instance, in Zeiler and Fergus (2013); Lee et al. (2009). On the other hand, a shallow model has to construct detectors of target objects based only on the detectors learnt by the first layer.

The representational power of computational systems with shallow and deep architectures has been studied intensively. A well known result Hajnal et al. (1993) derived lower complexity bounds for shallow threshold networks. Other works have explored the representational power of generative models based on Boltzmann machines Montúfar et al. (2011); Martens et al. (2013) and deep belief networks (Sutskever and Hinton, 2008; Le Roux and Bengio, 2010; Montúfar and Ay, 2011), or have compared mixtures and products of experts models (Montúfar and Morton, 2012).

In addition to such inspections, a wealth of evidence for the validity of this hypothesis comes from deep models consistently outperforming shallow ones on a variety of tasks and datasets (see, e.g., Goodfellow et al., 2013; Hinton et al., 2012b, a). However, theoretical results on the representational power of deep models are limited, usually due to the composition of nonlinear functions in deep models, which makes mathematical analysis difficult. Up to now, theoretical results have focussed on circuit operations (neural net unit computations) that are substantially different from those being used in real state-of-the-art deep learning applications, such as logic gates (Håstad, 1986), linear + threshold units with non-negative weights (Håstad and Goldmann, 1991) or polynomials (Bengio and Delalleau, 2011). Bengio and Delalleau (2011) show that deep sum-product networks (Poon and Domingos, 2011) can use exponentially less nodes to express some families of polynomials compared to the shallow ones.

The present note analyzes the representational power of deep MLPs with rectifier units. Rectifier units (Glorot et al., 2011; Nair and Hinton, 2010) and piecewise linearly activated units in general (like the maxout unit (Goodfellow et al., 2013)), are becoming popular choices in designing deep models, and most current state-of-the-art results involve using one of such activations (Goodfellow et al., 2013; Hinton et al., 2012b). Glorot et al. (2011) show that rectifier units have several properties that make the optimization problem easier than the more traditional case using smooth and bounded activations, such as tanh or sigmoid.

In this work we take advantage of the piecewise linear nature of the rectifier unit to mathematically analyze the behaviour of deep rectifier MLPs. Given that the model is a composition of piecewise linear functions, it is itself a piecewise linear function. We compare the flexibility of a deep model with that of a shallow model by counting the number of linear regions they define over the input space for a fixed number of hidden units. This is the number of pieces available to the model in order to approximate some arbitrary nonlinear function. For example, if we want to perfectly approximate some curved boundary between two classes, a rectifier MLP will have to use infinitely many linear regions. In practice we have a finite number of pieces, and if we assume that we can perfectly learn their optimal slopes, then the number of linear regions becomes a good proxy for how well the model approximates this boundary. In this sense, the number of linear regions is an upper bound for the flexibility of the model. In practice, the linear pieces are not independent and the model may not be able to learn the right slope for each linear region. Specifically, for deep models there is a correlation between regions, which results from the sharing of parameters between the functions that describe the output on each region.

This is by no means a negative observation. If all the linear regions of the deep model were independent of each other, by having many more linear regions, deep models would grossly overfit. The correlation of the linear regions of a deep model results in its ability to generalize, by allowing it to better represent only a small family of structured functions. These are functions that look complicated (e.g., a distribution with a huge number of modes) but that have an underlying structure that the network can ‘compress’ into its parameters. The number of regions, which indicates the number of variations that the network can represent, provides a measure of how well it can fit this family of structured functions (whose approximation potentially needs infinitely many linear regions).

We believe that this approach, based on counting the number of linear regions, is extensible to any other piecewise linear activation function and also to other architectures, including the maxout activation and the convolutional networks with rectifier activations.

We know the maximal number of regions of linearity of functions computable by a shallow model with a fixed number of hidden units. This number is given by a well studied geometrical problem. The main insight of the present work is to provide a geometrical construction that describes the regions of linearity of functions computed by deep models. We show that in the asymptotic regime, these functions have many more linear regions than the ones computed by shallow models, for the same number of hidden units.

For the single layer case, each hidden unit divides the input space in two, whereby the boundary is given by a hyperplane. For all input values on one side of the hyperplane, the unit outputs a positive value. For all input values on the other side of the hyperplane, the unit outputs . Therefore, the question that we are asking is: Into how many regions do $n$ hyperplanes split space? This question is studied in geometry under the name of hyperplane arrangements, with classic results such as Zaslavsky’s theorem. Section 3 provides a quick introduction to the subject.

For the multilayer version of the model we rely on the following intuition. By using the rectifier nonlinearity, we identify multiple regions of the input space which are mapped by a given layer into an equivalent set of activations and represent thus equivalent inputs for the next layers. That is, a hidden layer can perform a kind of or operation by reacting similarly to several different inputs. Any subsequent computation made on these activations is replicated on all equivalent inputs.

This paper is organized as follows. In Section 2 we provide definitions and basic observations about piecewise linear functions. In Section 3 we discuss rectifier networks with one single hidden layer and describe their properties in terms of hyperplane arrangements which are fairly well known in the literature. In Section 4 we discuss deep rectifier networks and prove our main result, Theorem 1, which describes their complexity in terms of the number of regions of linearity of functions that they represent. Details about the asymptotic behaviour of the results derived in Sections 3 and 4 are given in the Appendix A. In Section 5 we analyze a special type of deep rectifier MLP and show that even for a small number of hidden layers it can generate a large number of linear regions. In Section 6 we offer a discussion of the results.

Preliminaries

We consider classes of functions (models) defined in the following way.

A rectifier feedforward network is a layered feedforward network, or multilayer perceptron (MLP), as shown in Fig. 1, with following properties. Each hidden unit receives as inputs the real valued activations $x_{1},\ldots,x_{n}$ of all units in the previous layer, computes the weighted sum

The real parameters $w_{1},\ldots,w_{n}$ are the input weights and $b$ is the bias of the unit. The output layer is a linear layer, that is, the units in the last layer compute a linear combination of their inputs and output it unrectified.

In the remainder of this section we state three simple lemmas.

The next lemma states that a piecewise linear function $f=(f_{i})_{i\in[k]}$ has as many regions of linearity as there are distinct intersections of regions of linearity of the coordinates $f_{i}$ .

In regard to the number of regions of linearity of the functions represented by rectifier networks, the number of output dimensions, i.e., the number of linear output units, is irrelevant. This is the statement of the next lemma.

The number of (linear) output units of a rectifier feedforward network does not affect the maximal number of regions of linearity that it can realize.

If the number of regions of $f$ is finite, then the number of differences of gradients is finite and there is a vector outside the union of their orthogonal spaces. Hence a matrix with a single row (a single output unit) suffices to capture all transitions between different regions of linearity of $f$ . ∎

One hidden layer

Let us look at the number of response regions of a single hidden layer MLP with $n_{0}$ input units and $n$ hidden units. We first formulate the rectifier unit as follows:

The general answer to Problem 1 is given by Zaslavsky’s theorem (Zaslavsky, 1975, Theorem A), which is one of the central results from the theory of hyperplane arrangements.

We will only need the special case of hyperplanes in general position, which realize the maximal possible number of regions. Formally, an $n$ -dimensional arrangement $\mathcal{A}$ is in general position if for any subset $\{H_{1},\ldots H_{p}\}\subseteq\mathcal{A}$ the following holds. (1) If $p\leq n$ , then $\dim(H_{1}\cap\cdots\cap H_{p})=n-p$ . (2) If $p>n$ , then $H_{1}\cap\cdots\cap H_{p}=\emptyset$ . An arrangement is in general position if the weights $\mathbf{W}$ , $\mathbf{b}$ defining its hyperplanes are generic. This means that any arrangement can be perturbed by an arbitrarily small perturbation in such a way that the resulting arrangement is in general position.

For arrangements in general position, Zaslavsky’s theorem can be stated in the following way (see Stanley, 2004, Proposition 2.4).

In particular, the number of regions of a $2$ -dimensional arrangement $\mathcal{A}_{m}$ of $m$ lines in general position is equal to

For the purpose of illustration, we sketch a proof of eq. (4) using the sweep hyperplane method. We proceed by induction over the number of lines $m$ .

Base case $m=0$ . It is obvious that in this case there is a single region, corresponding to the entire plane. Therefore, $r(\mathcal{A}_{0})=1$ .

Induction step. Assume that for $m$ lines the number of regions is $r(\mathcal{A}_{m})={m\choose 2}+m+1$ , and add a new line $L_{m+1}$ to the arrangement. Since we assumed the lines are in general position, $L_{m+1}$ intersects each of the existing lines $L_{k}$ at a different point. Fig. 2 depicts the situation for $m=2$ .

The $m$ intersection points split the line $L_{m+1}$ into $m+1$ segments. Each of these segments cuts a region of $\mathcal{A}_{m}$ in two pieces. Therefore, by adding the line $L_{m+1}$ we get $m+1$ new regions. In Fig. 2 the two intersection points result in three segments that split each of the regions $R_{1},R_{01},R_{0}$ in two. Hence

For the number of response regions of MLPs with one single hidden layer we obtain the following.

The regions of linearity of a function in the model $\mathcal{F}_{(n_{0},n_{1},1)}$ with $n_{0}$ inputs and $n_{1}$ hidden units are given by the regions of an arrangement of $n_{1}$ hyperplanes in $n_{0}$ -dimensional space. The maximal number of regions of such an arrangement is $\mathcal{R}(n_{0},n_{1},n_{y})=\sum_{j=0}^{n_{0}}{n_{1}\choose j}$ .

This is a consequence of Lemma 1. The maximal number of regions is produced by an $n_{0}$ -dimensional arrangement of $n_{1}$ hyperplanes in general position, which is given in Proposition 1. ∎

Multiple hidden layers

In order to show that a $k$ hidden layer model can be more expressive than a single hidden layer one with the same number of hidden units, we will need the next three propositions.

Any arrangement can be scaled down and shifted such that all regions of the arrangement intersect the unit ball.

The proposition is illustrated in Fig. 3.

We need some additional notations in order to formulate the next proposition. Given a hyperplane $H=\{\mathbf{x}\colon\mathbf{w}^{\top}\mathbf{x}+b=0\}$ , we consider the region $H^{-}=\{\mathbf{x}\colon\mathbf{w}^{\top}\mathbf{x}+b<0\}$ , and the region $H^{+}=\{\mathbf{x}\colon\mathbf{w}^{\top}\mathbf{x}+b\geq 0\}$ . If we think about the corresponding rectifier unit, then $H^{+}$ is the region where the unit is active and $H^{-}$ is the region where the unit is dead.

Let $R$ be a region delimited by the hyperplanes $\{H_{1},\ldots,H_{n}\}$ . We denote by $R^{+}\subseteq\{1,\ldots,n\}$ the set of all hyperplane-indices $j$ with $R\subset H^{+}_{j}$ . In other words, $R^{+}$ is the list of hidden units that are active (non-zero) in the input-space region $R$ .

The following proposition describes the combinatorics of $2$ -dimensional arrangements in general position. More precisely, the proposition describes the combinatorics of $n$ -dimensional arrangements with $2$ -dimensional essentialization in general position. Recall that the essentialization of an arrangement is the arrangement that it defines in the subspace spanned by the normals of its hyperplanes.

The proposition guarantees the existence of input weights and bias for a rectifier layer such that for any list of consecutive units, there is a region of inputs for which exactly the units from that list are active.

We show that the hyperplanes of a $2$ -dimensional arrangement in general position can be indexed in such a way that the claim of the proposition holds. For higher dimensional arrangements the statement follows trivially, applying the $2$ -dimensional statement to the intersection of the arrangement with a $2$ -subspace.

Now assume that there is an arrangement of $n$ lines with the claimed properties. To add an $(n+1)$ -th line, we first consider the maximal distance $d_{\text{max}}$ from the origin to the intersection of two lines $L_{i}\cap L_{j}$ with $1\leq i<j<n$ . We also consider the radius- $(d_{\text{max}}+r)$ circle $\mathcal{S}_{n}$ centered at the origin. The circle $\mathcal{S}_{n}$ contains all intersection of any of the first $n$ lines. We now choose an angle $\alpha_{n}$ with $0<\alpha_{n}<\alpha_{n-1}$ and define $L_{n+1}$ as the tangent of $\mathcal{S}_{n}$ that forms an angle $\alpha_{n}$ with the y-axis. Fig. 4 depicts adding the third and fourth line to the arrangement.

After adding line $L_{n+1}$ , we have that the arrangement

has regions $R_{1}^{\prime},\ldots,R_{n+1}^{\prime}$ with $R_{i}^{\prime+}=\{i,i+1,\ldots,n+1\}$ for all $i\in[n+1]$ .

The regions of the arrangement are stable under perturbation of the angles and radii used to define the lines. Any slight perturbation of these parameters preserves the list of regions. Therefore, the arrangement is in general position.

The second property comes from the order in which $L_{n+1}$ intersects all previous lines. $L_{n+1}$ intersects the lines in the order in which they were added to the arrangement: $L_{1},L_{2},\ldots,L_{n}$ . The intersection of $L_{n+1}$ and $L_{i}$ , $B_{i{n+1}}=L_{n+1}\cap L_{i}$ , is above the lines $L_{i+1},L_{i+2},\ldots,L_{n}$ , and hence the segment $B_{{i-1}{n+1}}B_{i{n+1}}$ between the intersection with $L_{i-1}$ and with $L_{i}$ , has to cut the region in which only units $i$ to $n$ are active.

The intersection order is ensured by the choice of angles $\alpha_{i}$ and the fact that the lines are tangent to the circles $\mathcal{S}_{i}$ . For any $i<j$ and $B_{ij}=L_{i}\cap L_{j}$ let $T_{ij}$ be the line parallel to the y-axis passing through $B_{ij}$ . Each line $T_{ij}$ divides the space in two. Let $H_{ij}$ be the half-space to the right of $T_{ij}$ . Within any half-space $H_{ij}$ , the intersection $H_{ij}\cap L_{i}$ is above $H_{ij}\cap L_{j}$ , because the angle $\alpha_{i-1}$ of $L_{i}$ with the y-axis is larger than $\alpha_{j-1}$ (this means $L_{j}$ has a stepper decrease). Since $L_{n+1}$ is tangent to the circle that contains all points $B_{ij}$ , the line $L_{n+1}$ will intersect lines $L_{i}$ and $L_{j}$ in $H_{ij}$ , and therefore it has to intersect $L_{i}$ first.

The next proposition guarantees the existence of a collection of affine maps with shared bias, which map a collection of regions to a common output.

We can now set $\mathbf{c}$ to be minus the common center of the scaled balls, to obtain the map:

These $g_{i}$ satisfy claimed property, namely that $g_{i}(R_{i})$ contains the unit ball, for all $i$ . ∎

Before proceeding, we discuss an example illustrating how the previous propositions and lemmas are put together to prove our main result below, in Theorem 1.

Within each input region $R_{i}$ , only two units in the first hidden layer are active. Therefore, for each input region $R_{i}$ , the output of the intermediary layer is an affine transformation of $S_{i}$ . Furthermore, the weights of the intermediary layer can be chosen in such a way that the image of each $R_{i}$ contains the unit ball.

Now, $f^{\prime}$ is the map computed by a rectifier layer with $2$ inputs and $n$ outputs. It is possible to define this map in such a way that it has $\mathcal{R}$ regions of linearity within the unit ball, where $\mathcal{R}$ is the number of regions of a $2$ -dimensional arrangement of $n$ hyperplanes in general position.

Now we are ready to state our main result on the number of response regions of rectifier deep feedforward networks:

A model with $n_{0}$ inputs and $k$ hidden layers of widths $n_{1},n_{2},\ldots,n_{k}$ can divide the input space in $\left(\prod_{i=1}^{k-1}\left\lfloor\frac{n_{i}}{n_{0}}\right\rfloor\right)\sum_{i=0}^{n_{0}}{n_{k}\choose i}$ or possibly more regions.

To be more specific, for an input vectors from $R_{1}$ , exactly the first $n_{0}$ units of the first hidden layer are active, that is, for these input vectors the value of $\mathbf{h}^{(1)}_{j}$ is non-zero if and only if $j\in I_{1}=\{1,\ldots,n_{0}\}$ . For input vectors from $R_{2}$ , only the next $n_{0}$ units of the first hidden layer are active, that is, the units with index in $I_{2}=\{n_{0}+1,\ldots,2n_{0}\}$ , and so on.

Now we consider a ‘fictitious’ intermediary layer consisting of $n_{0}$ linear units between the first and second hidden layers. As this intermediary layer computes an affine function, it can be absorbed into the second hidden layer (see Lemma 3). We use it only for making the next arguments clearer.

The weights $\mathbf{U}^{(2)}$ and $\mathbf{c}^{(2)}$ describe the affine function computed by the intermediary layer, $\mathbf{x}\mapsto\mathbf{U}^{(2)}\mathbf{x}+\mathbf{c}$ . The weights $\mathbf{V}^{(2)}$ and $\mathbf{d}^{(2)}$ are the input and bias weights of the rectifier layer following the intermediary layer.

Let $S_{i}$ be the image of the input-space region $R_{i}$ by the first hidden layer. By Proposition 5, there is a choice of the weights $\mathbf{U}^{(2)}_{i}$ and bias $\mathbf{c}^{(2)}$ such that the image of $S_{i}$ by $\mathbf{x}\to\mathbf{U}^{(2)}_{i}(\mathbf{x})+\mathbf{c}^{(2)}$ contains the $n_{0}$ -dimensional unit ball. Now, for all inputs vectors from $R_{i}$ , only the units $I_{i}$ of the first hidden layer are active. Therefore, $g|_{R_{i}}=g_{i}|_{R_{i}}+\mathbf{c}^{(2)}$ . This implies that the image $g(R_{i})$ of the input-space region $R_{i}$ by the intermediary layer contains the unit ball, for all $i\in[p]$ .

In consequence, the map from input-space to activations of the second hidden layer has $r(\mathcal{A})$ regions of linearity within each input-space region $R_{i}$ . Fig. 6 illustrates the situation. All inputs that are mapped to the same activation of the first hidden layer, are treated as equivalent on the subsequent layers. In this sense, an arrangement $\mathcal{A}$ defined on the set of common outputs of $R_{1},\ldots,R_{p}$ at the first hidden layer, is ‘replicated’ in each input region $R_{1},\ldots,R_{p}$ .

The subsequent layers of the network can be analyzed in a similar way as done above for the first two layers. In particular, the weights $\mathbf{V}^{(2)}$ and $\mathbf{d}^{(2)}$ can be chosen in such a way that they define an arrangement with the properties from Proposition 4. Then, the map taking activations from the second hidden layer to activations from the third hidden layer, can be analyzed by considering again a fictitious intermediary layer between the second and third layers, and so forth, as done above.

For the last hidden layer we choose the input weights $\mathbf{V}^{(k)}$ and bias $\mathbf{d}^{(k)}$ defining an $n_{0}$ -dimensional arrangement of $n_{k}$ hyperplanes in general position. The map of inputs to activations of the last hidden layer has thus $\left(\prod_{i=1}^{k-1}\left\lfloor\frac{n_{i}}{n_{0}}\right\rfloor\right)\sum_{i=0}^{n_{0}}{n_{k}\choose i}$ regions of linearity. This number is a lower bound on the maximal number of regions of linearity of functions computable by the network. This completes the proof. The intuition of the construction is illustrated in Fig. 7. ∎

In the Appendix A we derive an asymptotic expansion of the bound given in Theorem 1.

A special class of deep models

In this section we consider deep rectifier models with $n_{0}$ input units and hidden layers of width $n=2n_{0}$ . This restriction allows us to construct a very efficient deep model in terms of number of response regions. The analysis that we provide in this section complements the results from the previous section, showing that rectifier MLPs can compute functions with many response regions, even when defined with relatively few hidden layers.

Let us assume we have a $2$ -dimensional input, i.e., $n_{0}=2$ , and a layer of $n=4$ rectifiers $f_{1}$ , $f_{2}$ , $f_{3}$ , and $f_{4}$ , followed by a linear projection. We construct the rectifier layer in such a way that it divides the input space into four ‘square’ cones; each of them corresponding to the inputs where two of the rectifier units are active. We define the four rectifiers as:

The absolute-value unit $g_{i}$ divides the input space along the $i$ -th coordinate axis, taking values which are symmetric about that axis. The combination of $g_{1}$ and $g_{2}$ is then a function with four regions of linearity;

Since the values of $g_{i}$ are symmetric about the $i$ -th coordinate axis, each point $\mathbf{x}\in\mathcal{S}_{i}$ has a corresponding point $\mathbf{y}\in\mathcal{S}_{j}$ with $\mathbf{g}(\mathbf{x})=\mathbf{g}(\mathbf{y})$ , for all $i$ and $j$ .

We can apply the same procedure to the image of $\left[g_{1},g_{2}\right]$ to recursively divide the input space, as illustrated in Fig. 8. For instance, if we apply this procedure one more time, we get four regions within each $\mathcal{S}_{i}$ , resulting in $16$ regions in total, within the input space. On the last layer, we may place rectifiers in any way suitable for the task of interest (e.g., classification). The partition computed by the last layer will be copied to each of the input space regions that produced the same input for the last layer. Fig. 9 shows a function that can be implemented efficiently by a deep model using the previous observations.

The foregoing discussion can be easily generalized to $n_{0}>2$ input variables and $k$ hidden layers, each consisting of $2n_{0}$ rectifiers. In that case, the maximal number of linear regions of functions computable by the network is lower-bounded as follows.

The maximal number of regions of linearity of functions computable by a rectifier neural network with $n_{0}$ input variables and $k$ hidden layers of width $2n_{0}$ is at least $2^{(k-1)n_{0}}\sum_{j=0}^{n_{0}}{2n_{0}\choose j}$ .

The theorem shows that even for a small number of layers $k$ , we can have many more linear regions in a deep model than in a shallow one. For example, if we set the input dimensionality to $n_{0}=2$ , a shallow model with $4n_{0}$ units will have at most $37$ linear regions. The equivalent deep model with two layers of $2n_{0}$ units can produce $44$ linear regions. For $6n_{0}$ hidden units the shallow model computes at most $79$ regions, while the equivalent three layer model can compute $176$ regions.

Discussion and conclusions

In this paper we introduced a novel way of understanding the expressiveness of neural networks with piecewise linear activations. We count the number of regions of linearity, also called response regions, of the functions that they can represent. The number of response regions tells us how well the models can approximate arbitrary curved shapes. Computational Geometry provides us the tool to make such statements.

We found that deep and narrow rectifier MPLs can generate many more regions of linearity than their shallow counterparts with the same number of computational units or of parameters. We can express this in terms of the ratio between the maximal number of response regions and the number of parameters of both model classes. For a deep model with $n_{0}=O(1)$ inputs and $k$ hidden layers of width $n$ , the maximal number of response regions per parameter behaves as

For a shallow model with $n_{0}=O(1)$ inputs, the maximal number of response regions per parameter behaves as

We see that the deep model can generate many more response regions per parameter than the shallow model; exponentially more regions per parameter in terms of the number of hidden layers $k$ , and at least order $(k-2)$ polynomially more regions per parameter in terms of the layer width $n$ . In particular, there are deep models which use fewer parameters to produce more linear regions than their shallow counterparts. Details about the asymptotic expansions are given in the Appendix A.

In this paper we only considered linear output units, but this is not a restriction, as the output activation itself is not parametrized. If there is a target function $f_{\text{targ}}$ that we want to model with a rectifier MLP with $\sigma$ as its output activation function, then there exists a function $f^{\prime}_{\text{targ}}$ such that $\sigma(f^{\prime}_{\text{targ}})=f_{\text{targ}}$ , when $\sigma$ has an inverse (e.g., with sigmoid), $f^{\prime}_{\text{targ}}=\sigma^{-1}(f_{\text{targ}})$ . For activations that do not have an inverse, like softmax, there are infinitely many functions $f^{\prime}_{\text{targ}}$ that work. We just need to pick one, e.g., for softmax we can pick $\log(f_{\text{targ}})$ . By analyzing how well we can model $f^{\prime}_{\text{targ}}$ with a linear output rectifier MLP we get an indirect measure of how well we can model $f_{\text{targ}}$ with an MLP that has $\sigma$ as its output activation.

Another interesting observation is that we recover a high ratio of $n$ to $n_{0}$ if the data lives near a low-dimensional manifold (effectively like reducing the input size $n_{0}$ ). One-layer models can reach the upper bound of response regions only by spanning all the dimensions of the input. In other words, shallow models are not capable of concentrating linear response regions in any lower dimensional subspace of the input. If, as commonly assumed, data lives near a low dimensional manifold, then we care only about the number of response regions that a model can generate in the directions of the data manifold. One way of thinking about this is principal component analysis (PCA), where one finds that only few input space directions (say on the MNIST database) are relevant to the underlying data. In such a situation, one cares about the number of response regions that a model can generate only within the directions in which the data does change. In such situations $n\gg n_{0}$ , and our results show a clear advantage of using deep models.

We believe that the proposed framework can be used to answer many other interesting questions about these models. For example, one can look at how the number of response regions is affected by different constraints of the model, like shared weights. We think that this approach can also be used to study other kinds of piecewise linear models, such as convolutional networks with rectifier units or maxout networks, or also for comparing between different piecewise linear models.

Appendix A Asymptotic

Here we derive asymptotic expressions of the formulas contained in Proposition 2 and Theorem 1. We use following standard notation:

$f(n)=O(g(n))$ means that there is a positive constant $c_{2}$ such that $f(n)\leq c_{2}g(n)$ for all $n$ larger than some $N$ .

$f(n)=\Theta(g(n))$ means that there are two positive constants $c_{1}$ and $c_{2}$ such that $c_{1}g(n)\leq f(n)\leq c_{2}g(n)$ for all $n$ larger than some $N$ .

$f(n)=\Omega(g(n))$ means that there is a positive constant $c_{1}$ such that $f(n)\geq c_{1}g(n)$ for all $n$ larger than some $N$ .

Consider a single layer rectified MLP with $kn$ units and $n_{0}$ inputs. Then the maximal number of regions of linearity of the functions represented by this network is

Consider a k layer rectified MLP with hidden layers of width $n$ and $n_{0}$ inputs. Then the maximal number of regions of linearity of the functions represented by this network satisfies

Here only the asymptotic expressions remain to be shown. It is known that

When $n_{0}$ is constant, $n_{0}=O(1)$ , we have that

We now analyze the number of response regions as a function of the number of parameters. When $k$ and $n_{0}$ are fixed, then $\left\lfloor{n}/{n_{0}}\right\rfloor^{k-1}$ grows polynomially in $n$ , and $k^{n_{0}}$ is constant. On the other hand, when $n$ is fixed with $n>2n_{0}$ , then $\left\lfloor{n}/{n_{0}}\right\rfloor^{k-1}$ grows exponentially in $k$ , and $k^{n_{0}}$ grows polynomially in $k$ .

The number of parameters of a deep model with $n_{0}=O(1)$ inputs, $n_{\operatorname{out}}=O(1)$ outputs, and $k$ hidden layers of width $n$ is

The number of parameters of a shallow model with $n_{0}=O(1)$ inputs, $n_{\operatorname{out}}=O(1)$ outputs, and $kn$ hidden units is

For the deep model, each layer, except the first and last, has an input weight matrix with $n^{2}$ entries and a bias vector of length $n$ . This gives a total of $(k-1)n^{2}+(k-1)n$ parameters. The first layer has $nn_{0}$ input weights and $n$ bias. The output layer has $nn_{out}$ input weight matrix and $n_{\operatorname{out}}$ bias. If we sum these together we get

For the shallow model, the hidden layer has $knn_{0}$ input weights and $kn$ bias. The output weights has $knn_{\operatorname{out}}$ input weights and $n_{\operatorname{out}}$ bias. Summing these together we get

The number of linear regions per parameter can be given as follows.

Consider a fixed number of inputs $n_{0}$ and a fixed number of outputs $n_{\operatorname{out}}$ . The maximal ratio of the number of response regions to the number of parameters of a deep model with $k$ layers of width $n$ is

In the case of a shallow model with $kn$ hidden units, the ratio is

This is by combining Proposition 6 and Proposition 7. ∎

We see that fixing the number of parameters, deep models can compute functions with many more regions of linearity that those computable by shallow models. The ratio is exponential in the number of hidden layers $k$ and thus in the number of hidden units.

We would like to thank KyungHyun Cho, Çağlar Gülçehre, and anonymous ICLR reviewers for their comments. Razvan Pascanu is supported by a DeepMind Fellowship.