Why Deep Neural Networks for Function Approximation?

Shiyu Liang, R. Srikant

Introduction

Neural networks have drawn significant interest from the machine learning community, especially due to their recent empirical successes (see the surveys (Bengio, 2009)). Neural networks are used to build state-of-art systems in various applications such as image recognition, speech recognition, natural language process and others (see, Krizhevsky et al. 2012; Goodfellow et al. 2013; Wan et al. 2013, for example). The result that neural networks are universal approximators is one of the theoretical results most frequently cited to justify the use of neural networks in these applications. Numerous results have shown the universal approximation property of neural networks in approximations of different function classes, (see, e.g., Cybenko 1989; Hornik et al. 1989; Funahashi 1989; Hornik 1991; Chui & Li 1992; Barron 1993; Poggio et al. 2015).

All these results and many others provide upper bounds on the network size and assert that small approximation error can be achieved if the network size is sufficiently large. More recently, there has been much interest in understanding the approximation capabilities of deep versus shallow networks. Delalleau & Bengio (2011) have shown that there exist deep sum-product networks which cannot be approximated by shallow sum-product networks unless they use an exponentially larger amount of units or neurons. Montufar et al. (2014) have shown that the number of linear region increases exponentially with the number of layers in the neural network. Telgarsky (2016) has established such a result for neural networks, which is the subject of this paper. Eldan & Shamir (2015) have shown that, to approximate a specific function, a two-layer network requires an exponential number of neurons in the input dimension, while a three-layer network requires a polynomial number of neurons. These recent papers demonstrate the power of deep networks by showing that depth can lead to an exponential reduction in the number of neurons required, for specific functions or specific neural networks. Our goal here is different: we are interested in function approximation specifically and would like to show that for a given upper bound on the approximation error, shallow networks require exponentially more neurons than deep networks for a large class of functions.

The multilayer neural networks considered in this paper are allowed to use either rectifier linear units (ReLU) or binary step units (BSU), or any combination of the two. The main contributions of this paper are

We have shown that, for ε\varepsilon-approximation of functions with enough piecewise smoothness, a multilayer neural network which uses Θ(log(1/ε))\Theta(\log(1/\varepsilon)) layers only needs O(polylog(1/ε))\mathcal{O}(\text{poly}\log(1/\varepsilon)) neurons, while Ω(poly(1/ε))\Omega(\text{poly}(1/\varepsilon)) neurons are required by neural networks with o(log(1/ε))o(\log(1/\varepsilon)) layers. In other words, shallow networks require exponentially more neurons than a deep network to achieve the level of accuracy for function approximation.

We have shown that for all differentiable and strongly convex functions, multilayer neural networks need Ω(log(1/ε))\Omega(\log(1/\varepsilon)) neurons to achieve an ε\varepsilon-approximation. Thus, our results for deep networks are tight.

The outline of this paper is as follows. In Section 2, we present necessary definitions and the problem statement. In Section 3, we present upper bounds on network size, while the lower bound is provided in Section 4. Conclusions are presented in Section 5. Around the same time that our paper was uploaded in arxiv, a similar paper was also uploaded in arXiv by Yarotsky (2016). The results in the two papers are similar in spirit, but the details and the general approach are substantially different.

Preliminaries and problem statement

In this section, we present definitions on feedforward neural networks and formally present the problem statement.

Here, σ()\sigma(\cdot) denotes the activation function and [n][n] denotes the index set [n]={1,...,n}[n]=\{1,...,n\}. In this paper, we only consider two important types of activation functions:

We call the number of layers and the number of neurons in the network as the depth and the size of the feedforward neural network, respectively. We use the set F(N,L)\mathcal{F}(N,L) to denote the function set containing all feedforward neural networks of depth LL, size NN and composed of a combination of rectifier linear units (ReLUs) and binary step units. We say one feedforward neural network is deeper than the other network if and only if it has a larger depth. Through this paper, the terms feedforward neural network and multilayer neural network are used interchangeably.

2 Problem Statement

Specifically, we aim to answer the following questions:

Does there exists L(ε)L(\varepsilon) and N(ε)N(\varepsilon) such that (1) is satisfied? We will refer to such L(ε)L(\varepsilon) and N(ε)N(\varepsilon) as upper bounds on the depth and size of the required neural network.

Given a fixed depth LL, what is the minimum value of NN such that (1) is satisfied? We will refer to such an NN as a lower bound on the size of a neural network of a given depth LL.

The first question asks what depth and size are sufficient to guarantee an ε\varepsilon-approximation. The second question asks, for a fixed depth, what is the minimum size of a neural network required to guarantee an ε\varepsilon-approximation. Obviously, tight bounds in the answers to these two questions provide tight bounds on the network size and depth required for function approximation. Besides, solutions to these two questions together can be further used to answer the following question. If a deeper neural network of size NdN_{d} and a shallower neural network of size NsN_{s} are used to approximate the same function with the same error, then how fast does the ratio Nd/NsN_{d}/N_{s} decay to zero as the error decays to zero?

Upper bounds on function approximations

In this section, we present upper bounds on the size of the multilayer neural network which are sufficient for function approximation. Before stating the results, some notations and terminology deserve further explanation. First, the upper bound on the network size represents the number of neurons required at most for approximating a given function with a certain error. Secondly, the notion of the approximation is the LL_{\infty} distance: for two functions ff and gg, the LL_{\infty} distance between these two function is the maximum point-wise disagreement over the cube d^{d}.

In this subsection, we present all results on approximating univariate functions. We first present a theorem on the size of the network for approximating a simple quadratic function. As part of the proof, we present the structure of the multilayer feedforward neural network used and show how the neural network parameters are chosen. Results on approximating general functions can be found in Theorem 2 and 4.

The proof is composed of three parts. For any xx\in, we first use the multilayer neural network to approximate xx by its finite binary expansion i=0nxi2i\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}. We then construct a 2-layer neural network to implement function f(i=0nxi2i)f\left(\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}\right).

For each xx\in, xx can be denoted by its binary expansion x=i=0xi2i,x=\sum_{i=0}^{\infty}\frac{x_{i}}{2^{i}}, where xi{0,1}x_{i}\in\{0,1\} for all i0i\geq 0. It is straightforward to see that the nn-layer neural network shown in Figure 1 can be used to find x0,...,xnx_{0},...,x_{n}.

Finally, we consider the approximation error of this multilayer neural network,

Therefore, in order to achieve ε\varepsilon-approximation error, one should choose n=log21ε+1n=\left\lceil\log_{2}\frac{1}{\varepsilon}\right\rceil+1. In summary, the deep neural network has O(log1ε)\mathcal{O}\left(\log\frac{1}{\varepsilon}\right) layers, O(log1ε)\mathcal{O}\left(\log\frac{1}{\varepsilon}\right) binary step units and O(log(1ε))\mathcal{O}\left(\log\left(\frac{1}{\varepsilon}\right)\right) rectifier linear units. ∎

Next, a theorem on the size of the network for approximating general polynomials is given as follows.

The proof is composed of three parts. We first use the deep structure shown in Figure 1 to find the nn-bit binary expansion i=0naixi\sum_{i=0}^{n}a_{i}x^{i} of xx. Then we construct a multilayer network to approximate polynomials gi(x)=xig_{i}(x)=x^{i}, i=1,...,pi=1,...,p. Finally, we analyze the approximation error.

Using the same deep structure shown in Figure 1, we could find the binary expansion sequence {x0,...,xn}\{x_{0},...,x_{n}\}. In this step, we used nn binary steps units in total. Now we rewrite gm+1(i=0nxi2n)g_{m+1}(\sum_{i=0}^{n}\frac{x_{i}}{2^{n}}),

Clearly, the equation (2) defines iterations between the outputs of neighbor layers. Therefore, the deep neural network shown in Figure 2 can be used to implement the iteration given by (2). Further, to implement this network, one should use O(p)\mathcal{O}(p) layers with O(pn)\mathcal{O}(pn) rectifier linear units in total.

This indicates, to achieve ε\varepsilon-approximation error, one should choose n=logpε+1n=\left\lceil\log\frac{p}{\varepsilon}\right\rceil+1. Besides, since we used O(n+p)\mathcal{O}(n+p) layers with O(n)\mathcal{O}(n) binary step units and O(pn)\mathcal{O}(pn) rectifier linear units in total, this multilayer neural network thus has O(p+logpε)\mathcal{O}\left(p+\log\frac{p}{\varepsilon}\right) layers, O(logpε)\mathcal{O}\left(\log\frac{p}{\varepsilon}\right) binary step units and O(plogpε)\mathcal{O}\left(p\log\frac{p}{\varepsilon}\right) rectifier linear units. ∎

In Theorem 2, we have shown an upper bound on the size of multilayer neural network for approximating polynomials. We can easily observe that the number of neurons in network grows as plogpp\log p with respect to pp, the degree of the polynomial. We note that both Andoni et al. (2014) and Barron (1993) showed the sizes of the networks grow exponentially with respect to pp if only 3-layer neural networks are allowed to be used in approximating polynomials.

Besides, every function ff with p+1p+1 continuous derivatives on a bounded set can be approximated easily with a polynomial with degree pp. This is shown by the following well known result of Lagrangian interpolation. By this result, we could further generalize Theorem 2. The proof can be found in the reference (Gil et al., 2007).

If a function ff is defined at points z0,...,znz_{0},...,z_{n}, zi=cos((i+1/2)π/(n+1))z_{i}=\cos((i+1/2)\pi/(n+1)), i[n]i\in[n], there exists a polynomial of degree not more than nn such that Pn(zi)=f(zi)P_{n}(z_{i})=f(z_{i}), i=0,...,n.i=0,...,n. This polynomial is given by Pn(x)=i=0nf(zi)Li(x)P_{n}(x)=\sum_{i=0}^{n}f(z_{i})L_{i}(x) where Li(x)=πn+1(x)(xzi)πn+1(zi)L_{i}(x)=\frac{\pi_{n+1}(x)}{(x-z_{i})\pi^{\prime}_{n+1}(z_{i})} and πn+1(x)=j=0n(xzj).\pi_{n+1}(x)=\prod_{j=0}^{n}(x-z_{j}). Additionally, if ff is continuous on $andandn+1timesdifferentiableintimes differentiable in(-1,1)$, then

where f(n)(x)f^{(n)}(x) is the derivative of ff of the nnth order and the norm f\left\|f\right\| is the ll_{\infty} norm f=maxxf(x){\left\|f\right\|=\max_{x\in}f(x)}.

Then the upper bound on the network size for approximating more general functions follows directly from Theorem 2 and Lemma 3.

Let N=log2εN=\left\lceil\log\frac{2}{\varepsilon}\right\rceil. From Lemma 3, it follows that there exists polynomial PNP_{N} of degree NN such that for any xx\in,

Remark: Note that, to implement the architecture in Figure 2 using the definition of a feedforward neural network in Section 2, we need the gig_{i}, i[p]i\in[p] at the output. This can be accomplished by using O(p2)\mathcal{O}(p^{2}) additional ReLUs. Since p=O(log(1/ε))p=\mathcal{O}(\log(1/\varepsilon)), this doesn’t change the order result in Theorem 4.

Theorem 4 shows that any function ff with enough smoothness can be approximated by a multilayer neural network containing polylog(1ε)\left(\frac{1}{\varepsilon}\right) neurons with ε\varepsilon error. Further, Theorem 4 can be used to show that for functions h1h_{1},…,hkh_{k} with enough smoothness, then linear combinations, multiplications and compositions of these functions can as well be approximated by multilayer neural networks containing polylog(1ε)\left(\frac{1}{\varepsilon}\right) neurons with ε\varepsilon error. Specific results are given in the following corollaries.

Remark: Clearly, Corollary 5 follows directly from the fact that the linear combination ff satisfies the conditions in Theorem 4 if all the functions h1h_{1},…,hkh_{k} satisfy those conditions. We note here that the upper bound on the network size for approximating linear combinations is independent of kk, the number of component functions.

Remark: Proofs of Corollary 6 and 7 can be found in the appendix. We observe that different from the case of linear combinations, the upper bound on the network size grows as k2log2kk^{2}\log^{2}k in the case of function multiplications and grows as k2(log1ε)2k^{2}\left(\log\frac{1}{\varepsilon}\right)^{2} in the case of function compositions where kk is the number of component functions.

In this subsection, we have shown a polylog(1ε)\log\left(\frac{1}{\varepsilon}\right) upper bound on the network size for ε\varepsilon-approximation of both univariate polynomials and general univariate functions with enough smoothness. Besides, we have shown that linear combinations, multiplications and compositions of univariate functions with enough smoothness can as well be approximated with ε\varepsilon error by a multilayer neural network of size polylog(1ε)\log\left(\frac{1}{\varepsilon}\right). In the next subsection, we will show the upper bound on the network size for approximating multivariate functions.

2 Approximation of multivariate functions

In this subsection, we present all results on approximating multivariate functions. We first present a theorem on the upper bound on the neural network size for approximating a product of multivariate linear functions. We next present a theorem on the upper bound on the neural network size for approximating general multivariate polynomial functions. Finally, similar to the results in the univariate case, we present the upper bound on the neural network size for approximating the linear combination, the multiplication and the composition of multivariate functions with enough smoothness.

Theorem 8 shows an upper bound on the network size for ε\varepsilon-approximation of a product of multivariate linear functions. Furthermore, since any general multivariate polynomial can be viewed as a linear combination of products, the result on general multivariate polynomials directly follows from Theorem 8.

Remark: The proof is given in the appendix. By further analyzing the results on the network size, we obtain the following results: (a) fixing degree pp, N(d,ε)=O(dp+1logdε)N(d,\varepsilon)=\mathcal{O}\left(d^{p+1}\log\frac{d}{\varepsilon}\right) as dd\rightarrow\infty and (b) fixing input dimension dd, N(p,ε)=O(pdlogpε){N(p,\varepsilon)=\mathcal{O}\left(p^{d}\log\frac{p}{\varepsilon}\right)} as pp\rightarrow\infty. Similar results on approximating multivariate polynomials were obtained by Andoni et al. (2014) and Barron (1993). Barron (1993) showed that on can use a 3-layer neural network to approximate any multivariate polynomial with degree pp, dimension dd and network size dp/ε2d^{p}/\varepsilon^{2}. Andoni et al. (2014) showed that one could use the gradient descent to train a 3-layer neural network of size d2p/ε2d^{2p}/\varepsilon^{2} to approximate any multivariate polynomial. However, Theorem 9 shows that the deep neural network could reduce the network size from O(1/ε)\mathcal{O}\left(1/\varepsilon\right) to O(log1ε)\mathcal{O}\left(\log\frac{1}{\varepsilon}\right) for the same ε\varepsilon error. Besides, for a fixed input dimension dd, the size of the 3-layer neural network used by Andoni et al. (2014) and Barron (1993) grows exponentially with respect to the degree pp. However, the size of the deep neural network shown in Theorem 9 grows only polynomially with respect to the degree. Therefore, the deep neural network could reduce the network size from O(exp(p))\mathcal{O}(\exp(p)) to O(poly(p))\mathcal{O}(\text{poly}(p)) when the degree pp becomes large.

Theorem 9 shows an upper bound on the network size for approximating multivariate polynomials. Further, by combining Theorem 4 and Corollary 7, we could obtain an upper bound on the network size for approximating more general functions. The results are shown in the following corollary.

Remark: Corollary 10 shows an upper bound on network size for approximating compositions of multivariate polynomials and general univariate functions. The upper bound can be loose due to the assumption that l(x)l(\bm{x}) is a general multivariate polynomials of degree pp. For some specific cases, the upper bound can be much smaller. We present two specific examples in the Appendix H and I.

In this subsection, we have shown that a similar polylog(1ε)\log\left(\frac{1}{\varepsilon}\right) upper bound on the network size for ε\varepsilon-approximation of general multivariate polynomials and functions which are compositions of univariate functions and multivariate polynomials.

The results in this section can be used to find a multilayer neural network of size polylog(1ε)\left(\frac{1}{\varepsilon}\right) which provides an approximation error of at most ε\varepsilon. In the next section, we will present lower bounds on the network size for approximating both univariate and multivariate functions. The lower bound together with the upper bound shows a tight bound on the network size required for function approximations.

While we have presented results in both the univariate and multivariate cases for smooth functions, the results automatically extend to functions that are piecewise smooth, with a finite number of pieces. In other words, if the domain of the function is partitioned into regions, and the function is sufficiently smooth (in the sense described in the Theorems and Corollaries earlier) in each of the regions, then the results essentially remain unchanged except for an additional factor which will depend on the number of regions in the domain.

Lower bounds on function approximations

In this section, we present lower bounds on the network size in function for certain classes of functions. Next, by combining the lower bounds and the upper bounds shown in the previous section, we could analytically show the advantages of deeper neural networks over shallower ones. The theorem below is inspired by a similar result (DasGupta & Schnitger, 1993) for univariate quadratic functions, where it is stated without a proof. Here we show that the result extends to general multivariate strongly convex functions.

Remark: The proof is in the Appendix F. Theorem 11 shows that every strongly convex function cannot be approximated with error ε\varepsilon by any multilayer neural network with rectifier linear units and binary step units and of size smaller than log2(μ/ε)4\log_{2}(\mu/\varepsilon)-4. Theorem 11 together with Theorem 1 directly shows that to approximate quadratic function f(x)=x2f(x)=x^{2} with error ε\varepsilon, the network size should be of order Θ(log1ε)\Theta\left(\log\frac{1}{\varepsilon}\right). Further, by combining Theorem 11 and Theorem 4, we could analytically show the benefits of deeper neural networks. The result is given in the following corollary.

Remarks: (i) The strong convexity requirement can be relaxed: the result obviously holds if the function is strongly concave and it also holds if the function consists of pieces which are strongly convex or strongly concave. (ii) Corollary 12 shows that in the approximation of the same function, the size of the deep neural network NsN_{s} is only of polynomially logarithmic order of the size of the shallow neural network NdN_{d}, i.e., Nd=O(polylog(Ns))N_{d}=\mathcal{O}(\text{polylog}(N_{s})). Similar results can be obtained for multivariate functions on the type considered in Section 3.2.

Conclusions

In this paper, we have shown that an exponentially large number of neurons are needed for function approximation using shallow networks, when compared to deep networks. The results are established for a large class of smooth univariate and multivariate functions. Our results are established for the case of feedforward neural networks with ReLUs and binary step units.

The research reported here was supported by NSF Grants CIF 14-09106, ECCS 16-09370, and ARO Grant W911NF-16-1-0259.

References

Appendix A Proof of Corollary 5

Then the approximation error is upper bounded by

Appendix B Proof of Corollary 6

Since f(x)=h1(x)h2(x)...hk(x)f(x)=h_{1}(x)h_{2}(x)...h_{k}(x), then the derivative of ff of order nn is

By the assumption that hi(αi)αi!\left\|h_{i}^{(\alpha_{i})}\right\|\leq\alpha_{i}! holds for i=1,...,ki=1,...,k, then we have

Then from Theorem 4, it follows that there exists a polynomial of PNP_{N} degree NN that

Further, function l(x)=x/log2xl(x)=x/\log_{2}x is monotonically increasing on [e,)[e,\infty) and

Therefore, to suffice the inequality (3), one should should choose

Since N=4klog24k+4k+2log22εN=\left\lceil 4k\log_{2}4k+4k+2\log_{2}\frac{2}{\varepsilon}\right\rceil by assumptions, then there exists a polynomial PNP_{N} of degree NN such that

Let i=0Nxi2i\sum_{i=0}^{N}\frac{x_{i}}{2^{i}} denote the binary expansion of xx and let

Appendix C Proof of Corollary 7

Further we assume the derivative of Fm1F_{m-1} has an upper bound Fm11.\left\|F^{\prime}_{m-1}\right\|\leq 1. Then for FmF_{m}, since Fm(x)F_{m}(x) can be rewritten as

In addition, the derivative of FmF_{m} can be upper bounded by

From iterations (4) and (5), we could have for 2mk2\leq m\leq k,

From the iteration (6), we have for 2mk2\leq m\leq k,

Therefore, to approximate f=Fkf=F_{k}, we need at most O(klogklog1ε+logk(log1ε)2)\mathcal{O}\left(k\log k\log\frac{1}{\varepsilon}+\log k\left(\log\frac{1}{\varepsilon}\right)^{2}\right) layers, O(klogklog1ε+logk(log1ε)2)\mathcal{O}\left(k\log k\log\frac{1}{\varepsilon}+\log k\left(\log\frac{1}{\varepsilon}\right)^{2}\right) binary step units and O(k2(log1ε)2+(log1ε)4)\mathcal{O}\left(k^{2}\left(\log\frac{1}{\varepsilon}\right)^{2}+\left(\log\frac{1}{\varepsilon}\right)^{4}\right) rectifier linear units.

Appendix D Proof of Theorem 8

The proof is composed of two parts. As before, we first use the deep structure shown in Figure 1 to find the binary expansion of x\bm{x} and next use a multilayer neural network to approximate the polynomial.

In the rest of proof, we consider the approximation error. Since for k=1,...,dk=1,...,d and xd\forall\bm{x}\in^{d},

By choosing n=log2pdεn=\left\lceil\log_{2}\frac{pd}{\varepsilon}\right\rceil, we have

Since we use ndnd binary step units to convert the input to binary form and dnpdnp neurons in function approximation, we thus use O(dlogpdε)\mathcal{O}\left(d\log\frac{pd}{\varepsilon}\right) binary step units and O(pdlogpdε)\mathcal{O}\left(pd\log\frac{pd}{\varepsilon}\right) rectifier linear units in total. In addition, since we have used nn layers to convert the input to binary form and pp layers in the function approximation section of the network, the whole deep structure has O(p+logpdε)\mathcal{O}\left(p+\log\frac{pd}{\varepsilon}\right) layers. ∎

Appendix E Proof of Theorem 9

Since the total number of multinomial is upper bounded by

the size of deep neural network is thus upper bounded by

If the dimension of the input dd is fixed, then (8) is has the order of

while if the degree pp is fixed, then (8) is has the order of

Appendix F Proof of Theorem 11

Now we calculate the lower bound on M(ε)M(\varepsilon). We first define 4 points x0x_{0}, x1=x0+2ρε/μx_{1}=x_{0}+2\sqrt{\rho\varepsilon/\mu}, x2=x1+2ρε/μx_{2}=x_{1}+2\sqrt{\rho\varepsilon/\mu} and x3=x2+2ρε/μx_{3}=x_{2}+2\sqrt{\rho\varepsilon/\mu}, ρ>1\forall\rho>1. We assume

Further, Telgarsky (2016) has shown that the maximum number of break points that a multilayer neural network of depth LL and size NN could have is (N/L)L(N/L)^{L}. Thus, LL and NN should satisfy

Besides, let m=N/Lm=N/L. Since each layer in network should have at least 2 neurons, i.e., m2m\geq 2, then

Now we consider the multivariate case d>1d>1. Assume input vector to be x=(x1,...,x(d))\bm{x}=(x^{1},...,x^{(d)}). We now fix x(2),...,x(d)x^{(2)},...,x^{(d)} and define two univariate functions

Remark: We make the following remarks about the lower bound in the theorem.

if the depth LL is fixed, as in shallow networks, the number of neurons required is Ω((1/ε)12L)\Omega\left((1/\varepsilon)^{\frac{1}{2L}}\right).

if we are allowed to choose LL optimally to minimize the lower bound, we will choose L=12log(μ16ε)L=\frac{1}{2}\log(\frac{\mu}{16\varepsilon}) and thus the lower bound will become Ω(log1ε)\Omega(\log\frac{1}{\varepsilon}), closed to the O(log21ε)\mathcal{O}(\log^{2}\frac{1}{\varepsilon}) upper bound shown in Theorem 4.

Appendix G Proof of Corollary 12

Substituting for log(1ε)\log\left(\frac{1}{\varepsilon}\right) from (12) to (11), we have

By definition, a shallow neural network has a small number of layers, i.e., LsL_{s}. Thus, the size of the deep neural network is O(log2Ns)\mathcal{O}(\log^{2}N_{s}). This means NdNsN_{d}\ll N_{s}. ∎

Appendix H Proof of Corollary 13

Besides, from Theorem 4, it follows that there exists a deep neural network f^\hat{f} with O(log1ε)\mathcal{O}\left(\log\frac{1}{\varepsilon}\right) layers O(log1ε)\mathcal{O}\left(\log\frac{1}{\varepsilon}\right) binary step units and O((log1ε)2)\mathcal{O}\left(\left(\log\frac{1}{\varepsilon}\right)^{2}\right) such that

By inequalities (13) and (14), the the approximation error is upper bounded by

Now the deep neural network has O(logdε)\mathcal{O}\left(\log\frac{d}{\varepsilon}\right) layers, O(dlogdε)\mathcal{O}\left(d\log\frac{d}{\varepsilon}\right) binary step units and O(dlogdε+(log1ε)2)\mathcal{O}\left(d\log\frac{d}{\varepsilon}+\left(\log\frac{1}{\varepsilon}\right)^{2}\right) rectifier linear units.

Appendix I Proof of Corollary 14