Understanding Deep Neural Networks with Rectified Linear Units

Raman Arora, Amitabh Basu, Poorya Mianjy, Anirbit Mukherjee

cs.LG cond-mat.dis-nn cs.AI cs.CC stat.ML

Introduction

Recently, there has been a resurgence of DNNs with the advent of deep learning LeCun et al. (2015). Deep learning, loosely speaking, refers to a suite of computational techniques that have been developed recently for training DNNs. It started with the work of Hinton et al. (2006), which gave empirical evidence that if DNNs are initialized properly (for instance, using unsupervised pre-training), then we can find good solutions in a reasonable amount of runtime. This work was soon followed by a series of early successes of deep learning at significantly improving the state-of-the-art in speech recognition Hinton et al. (2012). Since then, deep learning has received immense attention from the machine learning community with several state-of-the-art AI systems in speech recognition, image classification, and natural language processing based on deep neural nets Hinton et al. (2012); Dahl et al. (2013); Krizhevsky et al. (2012); Le (2013); Sutskever et al. (2014). While there is less of evidence now that pre-training actually helps, several other solutions have since been put forth to address the issue of efficiently training DNNs. These include heuristics such as dropouts Srivastava et al. (2014), but also considering alternate deep architectures such as convolutional neural networks Sermanet et al. (2014), deep belief networks Hinton et al. (2006), and deep Boltzmann machines Salakhutdinov & Hinton (2009). In addition, deep architectures based on new non-saturating activation functions have been suggested to be more effectively trainable – the most successful and widely popular of these is the rectified linear unit (ReLU) activation, i.e., $\sigma(x)=\max\{0,x\}$ , which is the focus of study in this paper.

In this paper, we formally study deep neural networks with rectified linear units; we refer to these deep architectures as ReLU DNNs. Our work is inspired by these recent attempts to understand the reason behind the successes of deep learning, both in terms of the structure of the functions represented by DNNs, Telgarsky (2015; 2016); Kane & Williams (2015); Shamir (2016), as well as efforts which have tried to understand the non-convex nature of the training problem of DNNs better Kawaguchi (2016); Haeffele & Vidal (2015). Our investigation of the function space represented by ReLU DNNs also takes inspiration from the classical theory of circuit complexity; we refer the reader to Arora & Barak (2009); Shpilka & Yehudayoff (2010); Jukna (2012); Saptharishi (2014); Allender (1998) for various surveys of this deep and fascinating field. In particular, our gap results are inspired by results like the ones by Hastad Hastad (1986), Razborov Razborov (1987) and Smolensky Smolensky (1987) which show a strict separation of complexity classes. We make progress towards similar statements with deep neural nets with ReLU activation.

where $\circ$ denotes function composition. The depth of a ReLU DNN is defined as $k+1$ . The width of a ReLU DNN is $\max\{w_{1},\ldots,w_{k}\}$ . The size of the ReLU DNN is $w_{1}+w_{2}+\ldots+w_{k}$ .

Many of our important statements will be phrased in terms of the following simplex.

Let $M>0$ be any positive real number and $p\geq 1$ be any natural number. Define the following set:

Exact characterization of function class represented by ReLU DNNs

Proofs of Theorems 2.2 and 2.3 are provided in Appendix A. We would like to remark that a weaker version of Theorem 2.1 was observed in (Goodfellow et al., 2013, Proposition 4.1) (with no bound on the depth), along with a universal approximation theorem (Goodfellow et al., 2013, Theorem 4.3) similar to Theorem 2.3. The authors of Goodfellow et al. (2013) also used a previous result of Wang (Wang, 2004) for obtaining their result. In a subsequent work Boris Hanin (Hanin, 2017) has, among other things, found a width and depth upper bound for ReLU net representation of positive PWL functions on $^{n}$ . The width upperbound is n+3 for general positive PWL functions and $n+1$ for convex positive PWL functions. For convex positive PWL functions his depth upper bound is sharp if we disallow dead ReLUs.

Benefits of Depth

In fact our family of hard functions described above has a very intricate structure as stated below.

For every $k\geq 1$ , $w\geq 2$ , every member of the family of hard functions in Theorem 3.1 has $w^{k}$ pieces and this family can be parametrized by

i.e., for every point in the set above, there exists a distinct function with the stated properties.

The following is an immediate corollary of Theorem 3.1 by choosing the parameters carefully.

A particularly illuminative special case is obtained by setting $\epsilon=1$ in Corollary 3.3:

We can also get hardness of approximation versions of Theorem 3.1 and Corollaries 3.3 and 3.4, with the same gaps (upto constant terms), using the following theorem.

For every $k\geq 1$ , $w\geq 2$ , there exists a function $f_{k,w}$ that can be represented by a $(k+1)$ -layer ReLU DNN with $w$ nodes in each layer, such that for all $\delta>0$ and $k^{\prime}\leq k$ the following holds:

where $\mathcal{G}_{k^{\prime},\delta}$ is the family of functions representable by ReLU DNNs with depth at most $k^{\prime}+1$ , and size at most $k^{\prime}\frac{w^{k/k^{\prime}}(1-4\delta)^{1/k^{\prime}}}{2^{1+1/k^{\prime}}}$ .

The depth-size trade-off results in Theorems 3.1, and 3.5 extend and improve Telgarsky’s theorems from (Telgarsky, 2015; 2016) in the following three ways:

Telgarsky’s family of hard functions is parameterized by a single natural number $k$ . In contrast, we show that for every pair of natural numbers $w$ and $k$ , and a point from the set in equation 3.1, there exists a “hard” function which to be represented by a depth $k^{\prime}$ network would need a size of at least $w^{\frac{k}{k^{\prime}}}k^{\prime}$ . With the extra flexibility of choosing the parameter $w$ , for the purpose of showing gaps in representation ability of deep nets we can shows size lower bounds which are super-exponential in depth as explained in Corollaries 3.3 and 3.4.

A characteristic feature of the “hard” functions in Boolean circuit complexity is that they are usually a countable family of functions and not a “smooth” family of hard functions. In fact, in the last section of Telgarsky (2015), Telgarsky states this as a “weakness” of the state-of-the-art results on “hard” functions for both Boolean circuit complexity and neural nets research. In contrast, we provide a smoothly parameterized family of “hard” functions in Section 3.1 (parametrized by the set in equation 3.1). Such a continuum of hard functions wasn’t demonstrated before this work.

Similar measures have been studied in previous works Montufar et al. (2014); Pascanu et al. (2013); Raghu et al. (2016). The best known families $\mathcal{H}$ are the ones from Theorem 4 of (Montufar et al., 2014) and a mild generalization of Theorem $1.1$ of (Telgarsky, 2016) to $k$ layers of ReLU activations with width $w$ ; these constructions achieve $\bigg{(}\lfloor(\frac{w}{n})\rfloor\bigg{)}^{(k-1)n}(\sum_{j=0}^{n}{w\choose j})$ and $\textrm{comp}_{\mathcal{H}}(n,k,s)=O(w^{k})$ , respectively. At the end of this section we would explain the precise sense in which we improve on these numbers. An analysis of this complexity measure is done using integer programming techniques in (Serra et al., 2017).

The following results are well-known in the theory of zonotopes (Ziegler, 1995).

$\gamma_{Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m})}({\mathchoice{\mbox{\boldmath$ \displaystyle\bf r $}}{\mbox{\boldmath$ \textstyle\bf r $}}{\mbox{\boldmath$ \scriptstyle\bf r $}}{\mbox{\boldmath$ \scriptscriptstyle\bf r $}}})=\max_{{\mathchoice{\mbox{\boldmath$ \displaystyle\bf x $}}{\mbox{\boldmath$ \textstyle\bf x $}}{\mbox{\boldmath$ \scriptstyle\bf x $}}{\mbox{\boldmath$ \scriptscriptstyle\bf x $}}}\in Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m})}\langle{\mathchoice{\mbox{\boldmath$ \displaystyle\bf r $}}{\mbox{\boldmath$ \textstyle\bf r $}}{\mbox{\boldmath$ \scriptstyle\bf r $}}{\mbox{\boldmath$ \scriptscriptstyle\bf r $}}},{\mathchoice{\mbox{\boldmath$ \displaystyle\bf x $}}{\mbox{\boldmath$ \textstyle\bf x $}}{\mbox{\boldmath$ \scriptstyle\bf x $}}{\mbox{\boldmath$ \scriptscriptstyle\bf x $}}}\rangle=\max_{{\mathchoice{\mbox{\boldmath$ \displaystyle\bf x $}}{\mbox{\boldmath$ \textstyle\bf x $}}{\mbox{\boldmath$ \scriptstyle\bf x $}}{\mbox{\boldmath$ \scriptscriptstyle\bf x $}}}\in\operatorname*{vert}(Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m}))}\langle{\mathchoice{\mbox{\boldmath$ \displaystyle\bf r $}}{\mbox{\boldmath$ \textstyle\bf r $}}{\mbox{\boldmath$ \scriptstyle\bf r $}}{\mbox{\boldmath$ \scriptscriptstyle\bf r $}}},{\mathchoice{\mbox{\boldmath$ \displaystyle\bf x $}}{\mbox{\boldmath$ \textstyle\bf x $}}{\mbox{\boldmath$ \scriptstyle\bf x $}}{\mbox{\boldmath$ \scriptscriptstyle\bf x $}}}\rangle,$ and $\gamma_{Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m})}$ is therefore a piecewise linear function with $|\operatorname*{vert}(Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m}))|$ pieces.

Given any tuple $({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m})\in S(n,m)$ and any point

the function $\operatorname{ZONOTOPE}_{k,w,m}^{n}[{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{k},{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m}]:=H_{{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{k}}\circ\gamma_{Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m})}$ has $(m-1)^{n-1}w^{k}$ pieces and it can be represented by a $k+2$ layer ReLU DNN with size $2m+wk$ .

Finally, we are ready to state the main result of this section.

Every $f\in\operatorname{ZONOTOPE}^{n}_{k,w,m}$ is representable by a ReLU DNN of depth $k+2$ and size $2m+wk$ , and has $\left(\sum_{i=0}^{n-1}{m-1\choose i}\right)w^{k}$ pieces.

Consider any $f\in\operatorname{ZONOTOPE}^{n}_{k,w,m}$ . If $f$ is represented by a $(k^{\prime}+1)$ -layer DNN for any $k^{\prime}\leq k$ , then this $(k^{\prime}+1)$ -layer DNN has size at least $\max\left\{\frac{1}{2}(k^{\prime}w^{\frac{k}{k^{\prime}n}})\cdot(m-1)^{(1-\frac{1}{n})\frac{1}{k^{\prime}}}-1\;\;,\;\;\frac{w^{\frac{k}{k^{\prime}}}}{n^{1/k^{\prime}}}k^{\prime}\right\}\vspace{-2mm}$ .

The family $\operatorname{ZONOTOPE}^{n}_{k,w,m}$ is in one-to-one correspondence with

Firstly we note that the construction in (Montufar et al., 2014) requires all the hidden layers to have width at least as big as the input dimensionality $n$ . In contrast, we do not impose such restrictions and the network size in our construction is independent of the input dimensionality. Thus our result probes networks with bottleneck architectures whose complexity cant be seen from their result.

Secondly, in terms of our complexity measure, there seem to be regimes where our bound does better. One such regime, for example, is when $n\leq w<2n$ and $k\in\Omega(\frac{n}{\log(n)})$ , by setting in our construction $m<n$ .

Thirdly, it is not clear to us whether the construction in (Montufar et al., 2014) gives a smoothly parameterized family of functions other than by introducing small perturbations of the construction in their paper. In contrast, we have a smoothly parameterized family which is in one-to-one correspondence with a well-understood manifold like the higher-dimensional torus.

There exists an algorithm to find a global optimum of Problem 4.1 in time $O(2^{w}(D)^{nw}\textrm{poly}(D,n,w))$ . Note that the running time $O(2^{w}(D)^{nw}\textrm{poly}(D,n,w))$ is polynomial in the data size $D$ for fixed $n,w$ .

Proof Sketch: A full proof of Theorem 4.1 is included in Appendix C. Here we provide a sketch of the proof. When the empirical risk minimization problem is viewed as an optimization problem in the space of weights of the ReLU DNN, it is a nonconvex, quadratic problem. However, one can instead search over the space of functions representable by 2-layer DNNs by writing them in the form similar to (2.1). This breaks the problem into two parts: a combinatorial search and then a convex problem that is essentially linear regression with linear inequality constraints. This enables us to guarantee global optimality.

Algorithm 1 implements the empirical risk minimization (ERM) rule for training ReLU DNN with one hidden layer. To the best of our knowledge there is no other known algorithm that solves the ERM problem to global optimality. We note that due to known hardness results exponential dependence on the input dimension is unavoidable Blum & Rivest (1992); Shalev-Shwartz & Ben-David (2014); Algorithm 1 runs in time polynomial in the number of data points. To the best of our knowledge there is no hardness result known which rules out empirical risk minimization of deep nets in time polynomial in circuit size or data size. Thus our training result is a step towards resolving this gap in the complexity literature.

A related result for improperly learning ReLUs has been recently obtained by Goel et al (Goel et al., 2016). In contrast, our algorithm returns a ReLU DNN from the class being learned. Another difference is that their result considers the notion of reliable learning as opposed to the empirical risk minimization objective considered in (4.1).

Discussion

The running time of the algorithm that we give in this work to find the exact global minima of a two layer ReLU-DNN is exponential in the input dimension $n$ and the number of hidden nodes $w$ . The exponential dependence on $n$ can not be removed unless $P=NP$ ; see Shalev-Shwartz & Ben-David (2014); Blum & Rivest (1992); DasGupta et al. (1995). However, we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time that is polynomial in the data size and/or the number of hidden nodes, assuming that the input dimension is a fixed constant. Resolving this dependence on network size would be another step towards clarifying the theoretical complexity of training ReLU DNNs and is a good open question for future research, in our opinion. Perhaps an even better breakthrough would be to get optimal training algorithms for DNNs with two or more hidden layers and this seems like a substantially harder nut to crack. It would also be a significant breakthrough to get gap results between consecutive constant depths or between logarithmic and constant depths.

We would like to thank Christian Tjandraatmadja for pointing out a subtle error in a previous version of the paper, which affected the complexity results for the number of linear regions in our constructions in Section 3.2. Anirbit would like to thank Ramprasad Saptharishi, Piyush Srivastava and Rohit Gurjar for extensive discussions on Boolean and arithmetic circuit complexity. This paper has been immensely influenced by the perspectives gained during those extremely helpful discussions. Amitabh Basu gratefully acknowledges support from the NSF grant CMMI1452820. Raman Arora was supported in part by NSF BIGDATA grant IIS-1546482.

References

Appendix A Expressing piecewise linear functions using ReLU DNNs

is equal to $\operatorname{sgn}(r)\max\{|r|(x-a),0\}$ , which can be implemented by a 2-layer ReLU DNN with size 1. Similarly, any function of the form,

It is easy to verify that the above set of simultaneous linear equations has a unique solution. Indeed, $r$ must equal $s_{R}$ , and then one can solve for $t_{1},\ldots,t_{m-1}$ starting from the last equation $b_{m-2}=t_{m-1}(a_{m-2}-a_{m-1})$ and then back substitute to compute $t_{m-2},t_{m-3},\ldots,t_{1}$ . The lower bound of $p-1$ on the size for any $2$ -layer ReLU DNN that expresses a $p$ piece function follows from Lemma D.6. ∎

One can do better in terms of size when the rightmost piece of the given function is flat, i.e., $s_{R}=0$ . In this case $r=0$ , which means that $f=0$ ; thus, the decomposition of $h$ above is of size $p-1$ . A similar construction can be done when $s_{L}=0$ . This gives the following statement which will be useful for constructing our forthcoming hard functions.

Appendix B Benefits of Depth

is piecewise linear with at most $(p+1)^{k}+2$ pieces, with $(p+1)^{k}$ of these pieces in the range $[0,M]$ (see Figure 2). Moreover, in each piece in the range $[0,M]$ , the function is affine with minimum value and maximum value $M$ .

Given $k\geq 1$ and $w\geq 2$ , choose any point

By Definition 8, each $h_{{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{i}}$ , $i=1,\ldots,k$ is a piecewise linear function with $w+1$ pieces and the leftmost piece having slope . Thus, by Corollary A.1, each $h_{{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{i}}$ , $i=1,\ldots,k$ can be represented by a 2-layer ReLU DNN with size $w$ . Using Lemma D.1, $H_{{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{k}}$ can be represented by a $k+1$ layer DNN with size $wk$ ; in fact, each hidden layer has exactly $w$ nodes. ∎

Follows from Theorem 3.2 and Lemma D.6. ∎

The above integral attains its minimum of $\frac{1}{2w^{k}}$ at $y_{1}=y_{2}=\frac{1}{2}$ . Putting together,

By Theorem 3.6 part 3., $\gamma_{Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m})}({\mathchoice{\mbox{\boldmath$ \displaystyle\bf r $}}{\mbox{\boldmath$ \textstyle\bf r $}}{\mbox{\boldmath$ \scriptstyle\bf r $}}{\mbox{\boldmath$ \scriptscriptstyle\bf r $}}})=|\langle{\mathchoice{\mbox{\boldmath$ \displaystyle\bf r $}}{\mbox{\boldmath$ \textstyle\bf r $}}{\mbox{\boldmath$ \scriptstyle\bf r $}}{\mbox{\boldmath$ \scriptscriptstyle\bf r $}}},{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1}\rangle|+\ldots+|\langle{\mathchoice{\mbox{\boldmath$ \displaystyle\bf r $}}{\mbox{\boldmath$ \textstyle\bf r $}}{\mbox{\boldmath$ \scriptstyle\bf r $}}{\mbox{\boldmath$ \scriptscriptstyle\bf r $}}},{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m}\rangle|$ . It suffices to observe

The fact that $\operatorname{ZONOTOPE}_{k,w,m}^{n}[{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{k},{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m}]$ can be represented by a $k+2$ layer ReLU DNN with size $2m+wk$ follows from Lemmas 3.7 and D.1. The number of pieces follows from the fact that $\gamma_{Z({\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf b $}}{\mbox{\boldmath$ \textstyle\bf b $}}{\mbox{\boldmath$ \scriptstyle\bf b $}}{\mbox{\boldmath$ \scriptscriptstyle\bf b $}}}^{m})}$ has $\sum_{i=0}^{n-1}{m-1\choose i}$ distinct linear pieces by parts 1. and 2. of Theorem 3.6, and $H_{{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{1},\ldots,{\mathchoice{\mbox{\boldmath$ \displaystyle\bf a $}}{\mbox{\boldmath$ \textstyle\bf a $}}{\mbox{\boldmath$ \scriptstyle\bf a $}}{\mbox{\boldmath$ \scriptscriptstyle\bf a $}}}^{k}}$ has $w^{k}$ pieces by Lemma B.1. ∎

Appendix C Exact Empirical Risk Minimization

In other words, the family of functions over which we are searching is of the form

We now use the observation that fitting piecewise linear functions to minimize loss is just a step away from linear regression, which is a special case where the function is contrained to have exactly one affine linear piece. Our algorithm will first guess the optimal partition of the data points such that all points in the same class of the partition correspond to the same affine piece of $f$ , and then do linear regression in each class of the partition. Altenatively, one can think of this as guessing the interval $(x_{i},x_{i+1})$ of data points where the $w-1$ breakpoints of the piecewise linear function will lie, and then doing linear regression between the breakpoints.

More formally, we parametrize piecewise linear functions with $w$ pieces by the $w$ slope-intercept values $(a_{1},b_{1}),\ldots,(a_{2},b_{2}),\ldots,(a_{w},b_{w})$ of the $w$ different pieces. This means that between breakpoints $j$ and $j+1$ , $1\leq j\leq w-2$ , the function is given by $f(x)=a_{j+1}x+b_{j+1},$ and the first and last pieces are $a_{1}x+b_{1}$ and $a_{w}x+b_{w}$ , respectively.

Define $\mathcal{I}$ to be the set of all $(w-1)$ -tuples $(i_{1},\ldots,i_{w-1})$ of natural numbers such that $1\leq i_{1}\leq\ldots\leq i_{w-1}\leq D$ . Given a fixed tuple $I=(i_{1},\ldots,i_{w-1})\in\mathcal{I}$ , we wish to search through all piecewise linear functions whose breakpoints, in order, appear in the intervals $(x_{i_{1}},x_{i_{1}+1}),(x_{i_{2}},x_{i_{2}+1})$ , $\ldots,(x_{i_{w-1}},x_{i_{w-1}+1})$ . Define also $\mathcal{S}=\{-1,1\}^{w-1}$ . Any $S\in\mathcal{S}$ will have the following interpretation: if $S_{j}=1$ then $a_{j}\leq a_{j+1}$ , and if $S_{j}=-1$ then $a_{j}\geq a_{j+1}$ . Now for every $I\in\mathcal{I}$ and $S\in\mathcal{S}$ , requiring a piecewise linear function that respects the conditions imposed by $I$ and $S$ is easily seen to be equivalent to imposing the following linear inequalities on the parameters $(a_{1},b_{1}),\ldots,(a_{2},b_{2}),\ldots,(a_{w},b_{w})$ :

Let the set of piecewise linear functions whose breakpoints satisfy the above be denoted by $\textrm{PWL}^{1}_{I,S}$ for $I\in\mathcal{I},S\in\mathcal{S}$ .

Given a particular $I\in\mathcal{I}$ , we define

The right hand side of the above equation is the problem of minimizing a convex objective subject to linear constraints. Now, to solve (C.3), we need to simply solve the problem (C.5) for all $I\in\mathcal{I},S\in\mathcal{S}$ and pick the minimum. Since $|\mathcal{I}|={D\choose w}=O(D^{w})$ and $|\mathcal{S}|=2^{w-1}$ we need to solve $O(2^{w}\cdot D^{w})$ convex optimization problems, each taking time $O(\textrm{poly}(D))$ . Therefore, the total running time is $O((2D)^{w}\textrm{poly}(D))$ .

Appendix D Auxiliary Lemmas

Now we will collect some straightforward observations that will be used often. The following operations preserve the property of being representable by a ReLU DNN.

Follows from (1.1) and the fact that a composition of affine transformations is another affine transformation.∎

We simply put the two ReLU DNNs in parallel and combine the appropriate coordinates of the outputs. ∎

Simply use the fact that $T=(I\circ\sigma\circ T)+(-I\circ\sigma\circ(-T))$ , and the right hand side can be represented by a 2-layer ReLU DNN of size $2m$ using Lemma D.2. ∎

We prove this by induction on $k$ . The base case is $k=1$ , i.e, we have a 2-layer ReLU DNN. Since every activation node can produce at most one breakpoint in the piecewise linear function, we can get at most $w_{1}$ breakpoints, i.e., $w_{1}+1$ pieces.

Lemma D.5 has the following consequence about the depth and size tradeoffs for expressing functions with agiven number of pieces.

Let widths of the $k$ hidden layers be $w_{1},\ldots,w_{k}$ . By Lemma D.5, we must have

By the AM-GM inequality, minimizing the size $w_{1}+w_{2}+\ldots+w_{k}$ subject to (D.1), means setting $w_{1}+1=w_{2}=\ldots=w_{k}$ . This implies that $w_{1}+1=w_{2}=\ldots=w_{k}\geq\frac{1}{2}p^{1/k}$ . The first statement follows. The second statement follows using the AM-GM inequality again, this time with a restriction on $w_{1}+w_{2}+\ldots+w_{k}$ . ∎