On the number of response regions of deep feed forward networks with piece-wise linear activations

Razvan Pascanu, Guido Montufar, Yoshua Bengio

Introduction

Deep systems are believed to play an important role in information processing of intelligent agents. A common hypothesis underlying this belief is that deep models can be exponentially more efficient at representing some functions than their shallow counterparts (see Bengio, 2009).

The argument is usually a compositional one. Higher layers in a deep model can re-use primitives constructed by the lower layers in order to build gradually more complex functions. For example, on a vision task, one would hope that the first layer learns Gabor filters capable to detect edges of different orientation. These edges are then put together at the second layer to form part-of-object shapes. On higher layers, these part-of-object shapes are combined further to obtain detectors for more complex part-of-object shapes or objects. Such a behaviour is empirically illustrated, for instance, in Zeiler and Fergus (2013); Lee et al. (2009). On the other hand, a shallow model has to construct detectors of target objects based only on the detectors learnt by the first layer.

The representational power of computational systems with shallow and deep architectures has been studied intensively. A well known result Hajnal et al. (1993) derived lower complexity bounds for shallow threshold networks. Other works have explored the representational power of generative models based on Boltzmann machines Montúfar et al. (2011); Martens et al. (2013) and deep belief networks (Sutskever and Hinton, 2008; Le Roux and Bengio, 2010; Montúfar and Ay, 2011), or have compared mixtures and products of experts models (Montúfar and Morton, 2012).

In addition to such inspections, a wealth of evidence for the validity of this hypothesis comes from deep models consistently outperforming shallow ones on a variety of tasks and datasets (see, e.g., Goodfellow et al., 2013; Hinton et al., 2012b, a). However, theoretical results on the representational power of deep models are limited, usually due to the composition of nonlinear functions in deep models, which makes mathematical analysis difficult. Up to now, theoretical results have focussed on circuit operations (neural net unit computations) that are substantially different from those being used in real state-of-the-art deep learning applications, such as logic gates (Håstad, 1986), linear + threshold units with non-negative weights (Håstad and Goldmann, 1991) or polynomials (Bengio and Delalleau, 2011). Bengio and Delalleau (2011) show that deep sum-product networks (Poon and Domingos, 2011) can use exponentially less nodes to express some families of polynomials compared to the shallow ones.

The present note analyzes the representational power of deep MLPs with rectifier units. Rectifier units (Glorot et al., 2011; Nair and Hinton, 2010) and piecewise linearly activated units in general (like the maxout unit (Goodfellow et al., 2013)), are becoming popular choices in designing deep models, and most current state-of-the-art results involve using one of such activations (Goodfellow et al., 2013; Hinton et al., 2012b). Glorot et al. (2011) show that rectifier units have several properties that make the optimization problem easier than the more traditional case using smooth and bounded activations, such as tanh or sigmoid.

In this work we take advantage of the piecewise linear nature of the rectifier unit to mathematically analyze the behaviour of deep rectifier MLPs. Given that the model is a composition of piecewise linear functions, it is itself a piecewise linear function. We compare the flexibility of a deep model with that of a shallow model by counting the number of linear regions they define over the input space for a fixed number of hidden units. This is the number of pieces available to the model in order to approximate some arbitrary nonlinear function. For example, if we want to perfectly approximate some curved boundary between two classes, a rectifier MLP will have to use infinitely many linear regions. In practice we have a finite number of pieces, and if we assume that we can perfectly learn their optimal slopes, then the number of linear regions becomes a good proxy for how well the model approximates this boundary. In this sense, the number of linear regions is an upper bound for the flexibility of the model. In practice, the linear pieces are not independent and the model may not be able to learn the right slope for each linear region. Specifically, for deep models there is a correlation between regions, which results from the sharing of parameters between the functions that describe the output on each region.

This is by no means a negative observation. If all the linear regions of the deep model were independent of each other, by having many more linear regions, deep models would grossly overfit. The correlation of the linear regions of a deep model results in its ability to generalize, by allowing it to better represent only a small family of structured functions. These are functions that look complicated (e.g., a distribution with a huge number of modes) but that have an underlying structure that the network can ‘compress’ into its parameters. The number of regions, which indicates the number of variations that the network can represent, provides a measure of how well it can fit this family of structured functions (whose approximation potentially needs infinitely many linear regions).

We believe that this approach, based on counting the number of linear regions, is extensible to any other piecewise linear activation function and also to other architectures, including the maxout activation and the convolutional networks with rectifier activations.

We know the maximal number of regions of linearity of functions computable by a shallow model with a fixed number of hidden units. This number is given by a well studied geometrical problem. The main insight of the present work is to provide a geometrical construction that describes the regions of linearity of functions computed by deep models. We show that in the asymptotic regime, these functions have many more linear regions than the ones computed by shallow models, for the same number of hidden units.

For the single layer case, each hidden unit divides the input space in two, whereby the boundary is given by a hyperplane. For all input values on one side of the hyperplane, the unit outputs a positive value. For all input values on the other side of the hyperplane, the unit outputs . Therefore, the question that we are asking is: Into how many regions do nn hyperplanes split space? This question is studied in geometry under the name of hyperplane arrangements, with classic results such as Zaslavsky’s theorem. Section 3 provides a quick introduction to the subject.

For the multilayer version of the model we rely on the following intuition. By using the rectifier nonlinearity, we identify multiple regions of the input space which are mapped by a given layer into an equivalent set of activations and represent thus equivalent inputs for the next layers. That is, a hidden layer can perform a kind of or operation by reacting similarly to several different inputs. Any subsequent computation made on these activations is replicated on all equivalent inputs.

This paper is organized as follows. In Section 2 we provide definitions and basic observations about piecewise linear functions. In Section 3 we discuss rectifier networks with one single hidden layer and describe their properties in terms of hyperplane arrangements which are fairly well known in the literature. In Section 4 we discuss deep rectifier networks and prove our main result, Theorem 1, which describes their complexity in terms of the number of regions of linearity of functions that they represent. Details about the asymptotic behaviour of the results derived in Sections 3 and 4 are given in the Appendix A. In Section 5 we analyze a special type of deep rectifier MLP and show that even for a small number of hidden layers it can generate a large number of linear regions. In Section 6 we offer a discussion of the results.

Preliminaries

We consider classes of functions (models) defined in the following way.

A rectifier feedforward network is a layered feedforward network, or multilayer perceptron (MLP), as shown in Fig. 1, with following properties. Each hidden unit receives as inputs the real valued activations x1,,xnx_{1},\ldots,x_{n} of all units in the previous layer, computes the weighted sum

The real parameters w1,,wnw_{1},\ldots,w_{n} are the input weights and bb is the bias of the unit. The output layer is a linear layer, that is, the units in the last layer compute a linear combination of their inputs and output it unrectified.

In the remainder of this section we state three simple lemmas.

The next lemma states that a piecewise linear function f=(fi)i[k]f=(f_{i})_{i\in[k]} has as many regions of linearity as there are distinct intersections of regions of linearity of the coordinates fif_{i}.

In regard to the number of regions of linearity of the functions represented by rectifier networks, the number of output dimensions, i.e., the number of linear output units, is irrelevant. This is the statement of the next lemma.

The number of (linear) output units of a rectifier feedforward network does not affect the maximal number of regions of linearity that it can realize.

If the number of regions of ff is finite, then the number of differences of gradients is finite and there is a vector outside the union of their orthogonal spaces. Hence a matrix with a single row (a single output unit) suffices to capture all transitions between different regions of linearity of ff. ∎

One hidden layer

Let us look at the number of response regions of a single hidden layer MLP with n0n_{0} input units and nn hidden units. We first formulate the rectifier unit as follows:

The general answer to Problem 1 is given by Zaslavsky’s theorem (Zaslavsky, 1975, Theorem A), which is one of the central results from the theory of hyperplane arrangements.

We will only need the special case of hyperplanes in general position, which realize the maximal possible number of regions. Formally, an nn-dimensional arrangement A\mathcal{A} is in general position if for any subset {H1,Hp}A\{H_{1},\ldots H_{p}\}\subseteq\mathcal{A} the following holds. (1) ​If pnp\leq n, then dim(H1Hp)=np\dim(H_{1}\cap\cdots\cap H_{p})=n-p. (2) ​If p>np>n, then H1Hp=H_{1}\cap\cdots\cap H_{p}=\emptyset. An arrangement is in general position if the weights W\mathbf{W}, b\mathbf{b} defining its hyperplanes are generic. This means that any arrangement can be perturbed by an arbitrarily small perturbation in such a way that the resulting arrangement is in general position.

For arrangements in general position, Zaslavsky’s theorem can be stated in the following way (see Stanley, 2004, Proposition 2.4).

In particular, the number of regions of a 22-dimensional arrangement Am\mathcal{A}_{m} of mm lines in general position is equal to

For the purpose of illustration, we sketch a proof of eq. (4) using the sweep hyperplane method. We proceed by induction over the number of lines mm.

Base case m=0m=0. It is obvious that in this case there is a single region, corresponding to the entire plane. Therefore, r(A0)=1r(\mathcal{A}_{0})=1.

Induction step. Assume that for mm lines the number of regions is r(Am)=(m2)+m+1r(\mathcal{A}_{m})={m\choose 2}+m+1, and add a new line Lm+1L_{m+1} to the arrangement. Since we assumed the lines are in general position, Lm+1L_{m+1} intersects each of the existing lines LkL_{k} at a different point. Fig. 2 depicts the situation for m=2m=2.

The mm intersection points split the line Lm+1L_{m+1} into m+1m+1 segments. Each of these segments cuts a region of Am\mathcal{A}_{m} in two pieces. Therefore, by adding the line Lm+1L_{m+1} we get m+1m+1 new regions. In Fig. 2 the two intersection points result in three segments that split each of the regions R1,R01,R0R_{1},R_{01},R_{0} in two. Hence

For the number of response regions of MLPs with one single hidden layer we obtain the following.

The regions of linearity of a function in the model F(n0,n1,1)\mathcal{F}_{(n_{0},n_{1},1)} with n0n_{0} inputs and n1n_{1} hidden units are given by the regions of an arrangement of n1n_{1} hyperplanes in n0n_{0}-dimensional space. The maximal number of regions of such an arrangement is R(n0,n1,ny)=j=0n0(n1j)\mathcal{R}(n_{0},n_{1},n_{y})=\sum_{j=0}^{n_{0}}{n_{1}\choose j}.

This is a consequence of Lemma 1. The maximal number of regions is produced by an n0n_{0}-dimensional arrangement of n1n_{1} hyperplanes in general position, which is given in Proposition 1. ∎

Multiple hidden layers

In order to show that a kk hidden layer model can be more expressive than a single hidden layer one with the same number of hidden units, we will need the next three propositions.

Any arrangement can be scaled down and shifted such that all regions of the arrangement intersect the unit ball.

The proposition is illustrated in Fig. 3.

We need some additional notations in order to formulate the next proposition. Given a hyperplane H={x ⁣:wx+b=0}H=\{\mathbf{x}\colon\mathbf{w}^{\top}\mathbf{x}+b=0\}, we consider the region H={x ⁣:wx+b<0}H^{-}=\{\mathbf{x}\colon\mathbf{w}^{\top}\mathbf{x}+b<0\}, and the region H+={x ⁣:wx+b0}H^{+}=\{\mathbf{x}\colon\mathbf{w}^{\top}\mathbf{x}+b\geq 0\}. If we think about the corresponding rectifier unit, then H+H^{+} is the region where the unit is active and HH^{-} is the region where the unit is dead.

Let RR be a region delimited by the hyperplanes {H1,,Hn}\{H_{1},\ldots,H_{n}\}. We denote by R+{1,,n}R^{+}\subseteq\{1,\ldots,n\} the set of all hyperplane-indices jj with RHj+R\subset H^{+}_{j}. In other words, R+R^{+} is the list of hidden units that are active (non-zero) in the input-space region RR.

The following proposition describes the combinatorics of 22-dimensional arrangements in general position. More precisely, the proposition describes the combinatorics of nn-dimensional arrangements with 22-dimensional essentialization in general position. Recall that the essentialization of an arrangement is the arrangement that it defines in the subspace spanned by the normals of its hyperplanes.

The proposition guarantees the existence of input weights and bias for a rectifier layer such that for any list of consecutive units, there is a region of inputs for which exactly the units from that list are active.

We show that the hyperplanes of a 22-dimensional arrangement in general position can be indexed in such a way that the claim of the proposition holds. For higher dimensional arrangements the statement follows trivially, applying the 22-dimensional statement to the intersection of the arrangement with a 22-subspace.

Now assume that there is an arrangement of nn lines with the claimed properties. To add an (n+1)(n+1)-th line, we first consider the maximal distance dmaxd_{\text{max}} from the origin to the intersection of two lines LiLjL_{i}\cap L_{j} with 1i<j<n1\leq i<j<n. We also consider the radius-(dmax+r)(d_{\text{max}}+r) circle Sn\mathcal{S}_{n} centered at the origin. The circle Sn\mathcal{S}_{n} contains all intersection of any of the first nn lines. We now choose an angle αn\alpha_{n} with 0<αn<αn10<\alpha_{n}<\alpha_{n-1} and define Ln+1L_{n+1} as the tangent of Sn\mathcal{S}_{n} that forms an angle αn\alpha_{n} with the y-axis. Fig. 4 depicts adding the third and fourth line to the arrangement.

After adding line Ln+1L_{n+1}, we have that the arrangement

has regions R1,,Rn+1R_{1}^{\prime},\ldots,R_{n+1}^{\prime} with Ri+={i,i+1,,n+1}R_{i}^{\prime+}=\{i,i+1,\ldots,n+1\} for all i[n+1]i\in[n+1].

The regions of the arrangement are stable under perturbation of the angles and radii used to define the lines. Any slight perturbation of these parameters preserves the list of regions. Therefore, the arrangement is in general position.

The second property comes from the order in which Ln+1L_{n+1} intersects all previous lines. Ln+1L_{n+1} intersects the lines in the order in which they were added to the arrangement: L1,L2,,LnL_{1},L_{2},\ldots,L_{n}. The intersection of Ln+1L_{n+1} and LiL_{i}, Bin+1=Ln+1LiB_{i{n+1}}=L_{n+1}\cap L_{i}, is above the lines Li+1,Li+2,,LnL_{i+1},L_{i+2},\ldots,L_{n}, and hence the segment Bi1n+1Bin+1B_{{i-1}{n+1}}B_{i{n+1}} between the intersection with Li1L_{i-1} and with LiL_{i}, has to cut the region in which only units ii to nn are active.

The intersection order is ensured by the choice of angles αi\alpha_{i} and the fact that the lines are tangent to the circles Si\mathcal{S}_{i}. For any i<ji<j and Bij=LiLjB_{ij}=L_{i}\cap L_{j} let TijT_{ij} be the line parallel to the y-axis passing through BijB_{ij}. Each line TijT_{ij} divides the space in two. Let HijH_{ij} be the half-space to the right of TijT_{ij}. Within any half-space HijH_{ij}, the intersection HijLiH_{ij}\cap L_{i} is above HijLjH_{ij}\cap L_{j}, because the angle αi1\alpha_{i-1} of LiL_{i} with the y-axis is larger than αj1\alpha_{j-1} (this means LjL_{j} has a stepper decrease). Since Ln+1L_{n+1} is tangent to the circle that contains all points BijB_{ij}, the line Ln+1L_{n+1} will intersect lines LiL_{i} and LjL_{j} in HijH_{ij}, and therefore it has to intersect LiL_{i} first.

The next proposition guarantees the existence of a collection of affine maps with shared bias, which map a collection of regions to a common output.

We can now set c\mathbf{c} to be minus the common center of the scaled balls, to obtain the map:

These gig_{i} satisfy claimed property, namely that gi(Ri)g_{i}(R_{i}) contains the unit ball, for all ii. ∎

Before proceeding, we discuss an example illustrating how the previous propositions and lemmas are put together to prove our main result below, in Theorem 1.

Within each input region RiR_{i}, only two units in the first hidden layer are active. Therefore, for each input region RiR_{i}, the output of the intermediary layer is an affine transformation of SiS_{i}. Furthermore, the weights of the intermediary layer can be chosen in such a way that the image of each RiR_{i} contains the unit ball.

Now, ff^{\prime} is the map computed by a rectifier layer with 22 inputs and nn outputs. It is possible to define this map in such a way that it has R\mathcal{R} regions of linearity within the unit ball, where R\mathcal{R} is the number of regions of a 22-dimensional arrangement of nn hyperplanes in general position.

Now we are ready to state our main result on the number of response regions of rectifier deep feedforward networks:

A model with n0n_{0} inputs and kk hidden layers of widths n1,n2,,nkn_{1},n_{2},\ldots,n_{k} can divide the input space in (i=1k1nin0)i=0n0(nki)\left(\prod_{i=1}^{k-1}\left\lfloor\frac{n_{i}}{n_{0}}\right\rfloor\right)\sum_{i=0}^{n_{0}}{n_{k}\choose i} or possibly more regions.

To be more specific, for an input vectors from R1R_{1}, exactly the first n0n_{0} units of the first hidden layer are active, that is, for these input vectors the value of hj(1)\mathbf{h}^{(1)}_{j} is non-zero if and only if jI1={1,,n0}j\in I_{1}=\{1,\ldots,n_{0}\}. For input vectors from R2R_{2}, only the next n0n_{0} units of the first hidden layer are active, that is, the units with index in I2={n0+1,,2n0}I_{2}=\{n_{0}+1,\ldots,2n_{0}\}, and so on.

Now we consider a ‘fictitious’ intermediary layer consisting of n0n_{0} linear units between the first and second hidden layers. As this intermediary layer computes an affine function, it can be absorbed into the second hidden layer (see Lemma 3). We use it only for making the next arguments clearer.

The weights U(2)\mathbf{U}^{(2)} and c(2)\mathbf{c}^{(2)} describe the affine function computed by the intermediary layer, xU(2)x+c\mathbf{x}\mapsto\mathbf{U}^{(2)}\mathbf{x}+\mathbf{c}. The weights V(2)\mathbf{V}^{(2)} and d(2)\mathbf{d}^{(2)} are the input and bias weights of the rectifier layer following the intermediary layer.

Let SiS_{i} be the image of the input-space region RiR_{i} by the first hidden layer. By Proposition 5, there is a choice of the weights Ui(2)\mathbf{U}^{(2)}_{i} and bias c(2)\mathbf{c}^{(2)} such that the image of SiS_{i} by xUi(2)(x)+c(2)\mathbf{x}\to\mathbf{U}^{(2)}_{i}(\mathbf{x})+\mathbf{c}^{(2)} contains the n0n_{0}-dimensional unit ball. Now, for all inputs vectors from RiR_{i}, only the units IiI_{i} of the first hidden layer are active. Therefore, gRi=giRi+c(2)g|_{R_{i}}=g_{i}|_{R_{i}}+\mathbf{c}^{(2)}. This implies that the image g(Ri)g(R_{i}) of the input-space region RiR_{i} by the intermediary layer contains the unit ball, for all i[p]i\in[p].

In consequence, the map from input-space to activations of the second hidden layer has r(A)r(\mathcal{A}) regions of linearity within each input-space region RiR_{i}. Fig. 6 illustrates the situation. All inputs that are mapped to the same activation of the first hidden layer, are treated as equivalent on the subsequent layers. In this sense, an arrangement A\mathcal{A} defined on the set of common outputs of R1,,RpR_{1},\ldots,R_{p} at the first hidden layer, is ‘replicated’ in each input region R1,,RpR_{1},\ldots,R_{p}.

The subsequent layers of the network can be analyzed in a similar way as done above for the first two layers. In particular, the weights V(2)\mathbf{V}^{(2)} and d(2)\mathbf{d}^{(2)} can be chosen in such a way that they define an arrangement with the properties from Proposition 4. Then, the map taking activations from the second hidden layer to activations from the third hidden layer, can be analyzed by considering again a fictitious intermediary layer between the second and third layers, and so forth, as done above.

For the last hidden layer we choose the input weights V(k)\mathbf{V}^{(k)} and bias d(k)\mathbf{d}^{(k)} defining an n0n_{0}-dimensional arrangement of nkn_{k} hyperplanes in general position. The map of inputs to activations of the last hidden layer has thus (i=1k1nin0)i=0n0(nki)\left(\prod_{i=1}^{k-1}\left\lfloor\frac{n_{i}}{n_{0}}\right\rfloor\right)\sum_{i=0}^{n_{0}}{n_{k}\choose i} regions of linearity. This number is a lower bound on the maximal number of regions of linearity of functions computable by the network. This completes the proof. The intuition of the construction is illustrated in Fig. 7. ∎

In the Appendix A we derive an asymptotic expansion of the bound given in Theorem 1.

A special class of deep models

In this section we consider deep rectifier models with n0n_{0} input units and hidden layers of width n=2n0n=2n_{0}. This restriction allows us to construct a very efficient deep model in terms of number of response regions. The analysis that we provide in this section complements the results from the previous section, showing that rectifier MLPs can compute functions with many response regions, even when defined with relatively few hidden layers.

Let us assume we have a 22-dimensional input, i.e., n0=2n_{0}=2, and a layer of n=4n=4 rectifiers f1f_{1}, f2f_{2}, f3f_{3}, and f4f_{4}, followed by a linear projection. We construct the rectifier layer in such a way that it divides the input space into four ‘square’ cones; each of them corresponding to the inputs where two of the rectifier units are active. We define the four rectifiers as:

The absolute-value unit gig_{i} divides the input space along the ii-th coordinate axis, taking values which are symmetric about that axis. The combination of g1g_{1} and g2g_{2} is then a function with four regions of linearity;

Since the values of gig_{i} are symmetric about the ii-th coordinate axis, each point xSi\mathbf{x}\in\mathcal{S}_{i} has a corresponding point ySj\mathbf{y}\in\mathcal{S}_{j} with g(x)=g(y)\mathbf{g}(\mathbf{x})=\mathbf{g}(\mathbf{y}), for all ii and jj.

We can apply the same procedure to the image of [g1,g2]\left[g_{1},g_{2}\right] to recursively divide the input space, as illustrated in Fig. 8. For instance, if we apply this procedure one more time, we get four regions within each Si\mathcal{S}_{i}, resulting in 1616 regions in total, within the input space. On the last layer, we may place rectifiers in any way suitable for the task of interest (e.g., classification). The partition computed by the last layer will be copied to each of the input space regions that produced the same input for the last layer. Fig. 9 shows a function that can be implemented efficiently by a deep model using the previous observations.

The foregoing discussion can be easily generalized to n0>2n_{0}>2 input variables and kk hidden layers, each consisting of 2n02n_{0} rectifiers. In that case, the maximal number of linear regions of functions computable by the network is lower-bounded as follows.

The maximal number of regions of linearity of functions computable by a rectifier neural network with n0n_{0} input variables and kk hidden layers of width 2n02n_{0} is at least 2(k1)n0j=0n0(2n0j)2^{(k-1)n_{0}}\sum_{j=0}^{n_{0}}{2n_{0}\choose j}.

The theorem shows that even for a small number of layers kk, we can have many more linear regions in a deep model than in a shallow one. For example, if we set the input dimensionality to n0=2n_{0}=2, a shallow model with 4n04n_{0} units will have at most 3737 linear regions. The equivalent deep model with two layers of 2n02n_{0} units can produce 4444 linear regions. For 6n06n_{0} hidden units the shallow model computes at most 7979 regions, while the equivalent three layer model can compute 176176 regions.

Discussion and conclusions

In this paper we introduced a novel way of understanding the expressiveness of neural networks with piecewise linear activations. We count the number of regions of linearity, also called response regions, of the functions that they can represent. The number of response regions tells us how well the models can approximate arbitrary curved shapes. Computational Geometry provides us the tool to make such statements.

We found that deep and narrow rectifier MPLs can generate many more regions of linearity than their shallow counterparts with the same number of computational units or of parameters. We can express this in terms of the ratio between the maximal number of response regions and the number of parameters of both model classes. For a deep model with n0=O(1)n_{0}=O(1) inputs and kk hidden layers of width nn, the maximal number of response regions per parameter behaves as

For a shallow model with n0=O(1)n_{0}=O(1) inputs, the maximal number of response regions per parameter behaves as

We see that the deep model can generate many more response regions per parameter than the shallow model; exponentially more regions per parameter in terms of the number of hidden layers kk, and at least order (k2)(k-2) polynomially more regions per parameter in terms of the layer width nn. In particular, there are deep models which use fewer parameters to produce more linear regions than their shallow counterparts. Details about the asymptotic expansions are given in the Appendix A.

In this paper we only considered linear output units, but this is not a restriction, as the output activation itself is not parametrized. If there is a target function ftargf_{\text{targ}} that we want to model with a rectifier MLP with σ\sigma as its output activation function, then there exists a function ftargf^{\prime}_{\text{targ}} such that σ(ftarg)=ftarg\sigma(f^{\prime}_{\text{targ}})=f_{\text{targ}}, when σ\sigma has an inverse (e.g., with sigmoid), ftarg=σ1(ftarg)f^{\prime}_{\text{targ}}=\sigma^{-1}(f_{\text{targ}}). For activations that do not have an inverse, like softmax, there are infinitely many functions ftargf^{\prime}_{\text{targ}} that work. We just need to pick one, e.g., for softmax we can pick log(ftarg)\log(f_{\text{targ}}). By analyzing how well we can model ftargf^{\prime}_{\text{targ}} with a linear output rectifier MLP we get an indirect measure of how well we can model ftargf_{\text{targ}} with an MLP that has σ\sigma as its output activation.

Another interesting observation is that we recover a high ratio of nn to n0n_{0} if the data lives near a low-dimensional manifold (effectively like reducing the input size n0n_{0}). One-layer models can reach the upper bound of response regions only by spanning all the dimensions of the input. In other words, shallow models are not capable of concentrating linear response regions in any lower dimensional subspace of the input. If, as commonly assumed, data lives near a low dimensional manifold, then we care only about the number of response regions that a model can generate in the directions of the data manifold. One way of thinking about this is principal component analysis (PCA), where one finds that only few input space directions (say on the MNIST database) are relevant to the underlying data. In such a situation, one cares about the number of response regions that a model can generate only within the directions in which the data does change. In such situations nn0n\gg n_{0}, and our results show a clear advantage of using deep models.

We believe that the proposed framework can be used to answer many other interesting questions about these models. For example, one can look at how the number of response regions is affected by different constraints of the model, like shared weights. We think that this approach can also be used to study other kinds of piecewise linear models, such as convolutional networks with rectifier units or maxout networks, or also for comparing between different piecewise linear models.

Appendix A Asymptotic

Here we derive asymptotic expressions of the formulas contained in Proposition 2 and Theorem 1. We use following standard notation:

f(n)=O(g(n))f(n)=O(g(n)) means that there is a positive constant c2c_{2} such that f(n)c2g(n)f(n)\leq c_{2}g(n) for all nn larger than some NN.

f(n)=Θ(g(n))f(n)=\Theta(g(n)) means that there are two positive constants c1c_{1} and c2c_{2} such that c1g(n)f(n)c2g(n)c_{1}g(n)\leq f(n)\leq c_{2}g(n) for all nn larger than some NN.

f(n)=Ω(g(n))f(n)=\Omega(g(n)) means that there is a positive constant c1c_{1} such that f(n)c1g(n)f(n)\geq c_{1}g(n) for all nn larger than some NN.

Consider a single layer rectified MLP with knkn units and n0n_{0} inputs. Then the maximal number of regions of linearity of the functions represented by this network is

Consider a k layer rectified MLP with hidden layers of width nn and n0n_{0} inputs. Then the maximal number of regions of linearity of the functions represented by this network satisfies

Here only the asymptotic expressions remain to be shown. It is known that

When n0n_{0} is constant, n0=O(1)n_{0}=O(1), we have that

We now analyze the number of response regions as a function of the number of parameters. When kk and n0n_{0} are fixed, then n/n0k1\left\lfloor{n}/{n_{0}}\right\rfloor^{k-1} grows polynomially in nn, and kn0k^{n_{0}} is constant. On the other hand, when nn is fixed with n>2n0n>2n_{0}, then n/n0k1\left\lfloor{n}/{n_{0}}\right\rfloor^{k-1} grows exponentially in kk, and kn0k^{n_{0}} grows polynomially in kk.

The number of parameters of a deep model with n0=O(1)n_{0}=O(1) inputs, nout=O(1)n_{\operatorname{out}}=O(1) outputs, and kk hidden layers of width nn is

The number of parameters of a shallow model with n0=O(1)n_{0}=O(1) inputs, nout=O(1)n_{\operatorname{out}}=O(1) outputs, and knkn hidden units is

For the deep model, each layer, except the first and last, has an input weight matrix with n2n^{2} entries and a bias vector of length nn. This gives a total of (k1)n2+(k1)n(k-1)n^{2}+(k-1)n parameters. The first layer has nn0nn_{0} input weights and nn bias. The output layer has nnoutnn_{out} input weight matrix and noutn_{\operatorname{out}} bias. If we sum these together we get

For the shallow model, the hidden layer has knn0knn_{0} input weights and knkn bias. The output weights has knnoutknn_{\operatorname{out}} input weights and noutn_{\operatorname{out}} bias. Summing these together we get

The number of linear regions per parameter can be given as follows.

Consider a fixed number of inputs n0n_{0} and a fixed number of outputs noutn_{\operatorname{out}}. The maximal ratio of the number of response regions to the number of parameters of a deep model with kk layers of width nn is

In the case of a shallow model with knkn hidden units, the ratio is

This is by combining Proposition 6 and Proposition 7. ∎

We see that fixing the number of parameters, deep models can compute functions with many more regions of linearity that those computable by shallow models. The ratio is exponential in the number of hidden layers kk and thus in the number of hidden units.

We would like to thank KyungHyun Cho, Çağlar Gülçehre, and anonymous ICLR reviewers for their comments. Razvan Pascanu is supported by a DeepMind Fellowship.

References