Error bounds for approximations with deep ReLU networks

Dmitry Yarotsky

Introduction

Recently, multiple successful applications of deep neural networks to pattern recognition problems (Schmidhuber , LeCun et al. ) have revived active interest in theoretical properties of such networks, in particular their expressive power. It has been argued that deep networks may be more expressive than shallow ones of comparable size (see, e.g., Delalleau and Bengio , Raghu et al. , Montufar et al. , Bianchini and Scarselli , Telgarsky ). In contrast to a shallow network, a deep one can be viewed as a long sequence of non-commutative transformations, which is a natural setting for high expressiveness (cf. the well-known Solovay-Kitaev theorem on fast approximation of arbitrary quantum operations by sequences of non-commutative gates, see Kitaev et al. , Dawson and Nielsen ).

There are various ways to characterize expressive power of networks. Delalleau and Bengio 2011 consider sum-product networks and prove for certain classes of polynomials that they are much more easily represented by deep networks than by shallow networks. Montufar et al. 2014 estimate the number of linear regions in the network’s landscape. Bianchini and Scarselli 2014 give bounds for Betti numbers characterizing topological properties of functions represented by networks. Telgarsky 2015, 2016 provides specific examples of classification problems where deep networks are provably more efficient than shallow ones.

In the context of classification problems, a general and standard approach to characterizing expressiveness is based on the notion of the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis ). There exist several bounds for VC-dimension of deep networks with piece-wise polynomial activation functions that go back to geometric techniques of Goldberg and Jerrum 1995 and earlier results of Warren 1968; see Bartlett et al. , Sakurai and the book Anthony and Bartlett . There is a related concept, the fat-shattering dimension, for real-valued approximation problems (Kearns and Schapire , Anthony and Bartlett ).

A very general approach to expressiveness in the context of approximation is the method of nonlinear widths by DeVore et al. 1989 that concerns approximation of a family of functions under assumption of a continuous dependence of the model on the approximated function.

In this paper we examine the problem of shallow-vs-deep expressiveness from the perspective of approximation theory and general spaces of functions having derivatives up to certain order (Sobolev-type spaces). In this framework, the problem of expressiveness is very well studied in the case of shallow networks with a single hidden layer, where it is known, in particular, that to approximate a CnC^{n}-function on a dd-dimensional set with infinitesimal error ϵ\epsilon one needs a network of size about ϵd/n\epsilon^{-d/n}, assuming a smooth activation function (see, e.g., Mhaskar , Pinkus for a number of related rigorous upper and lower bounds and further qualifications of this result). Much less seems to be known about deep networks in this setting, though Mhaskar et al. 2016, 2016 have recently introduced functional spaces constructed using deep dependency graphs and obtained expressiveness bounds for related deep networks.

We will focus our attention on networks with the ReLU activation function σ(x)=max(0,x)\sigma(x)=\max(0,x), which, despite its utter simplicity, seems to be the most popular choice in practical applications LeCun et al. . We will consider LL^{\infty}-error of approximation of functions belonging to the Sobolev spaces Wn,(d)\mathcal{W}^{n,\infty}(^{d}) (without any assumptions of hierarchical structure). We will often consider families of approximations, as the approximated function runs over the unit ball Fd,nF_{d,n} in Wn,(d)\mathcal{W}^{n,\infty}(^{d}). In such cases we will distinguish scenarios of fixed and adaptive network architectures. Our goal is to obtain lower and upper bounds on the expressiveness of deep and shallow networks in different scenarios. We measure complexity of networks in a conventional way, by counting the number of their weights and computation units (cf. Anthony and Bartlett ).

The main body of the paper consists of Sections 2, 3 and 4.

In Section 2 we describe our ReLU network model and show that the ReLU function is replaceable by any other continuous piece-wise linear activation function, up to constant factors in complexity asymptotics (Proposition 1).

In Section 3 we establish several upper bounds on the complexity of approximating by ReLU networks, in particular showing that deep networks are quite efficient for approximating smooth functions. Specifically:

In Subsection 3.1 we show that the function f(x)=x2f(x)=x^{2} can be ϵ\epsilon-approximated by a network of depth and complexity O(ln(1/ϵ))O(\ln(1/\epsilon)) (Proposition 2). This also leads to similar upper bounds on the depth and complexity that are sufficient to implement an approximate multiplication in a ReLU network (Proposition 3).

In Subsection 3.2 we describe a ReLU network architecture of depth O(ln(1/ϵ))O(\ln(1/\epsilon)) and complexity O(ϵd/nln(1/ϵ))O(\epsilon^{-d/n}\ln(1/\epsilon)) that is capable of approximating with error ϵ\epsilon any function from Fd,nF_{d,n} (Theorem 1).

In Subsection 3.3 we show that, even with fixed-depth networks, one can further decrease the approximation complexity if the network architecture is allowed to depend on the approximated function. Specifically, we prove that one can ϵ\epsilon-approximate a given Lipschitz function on the segment $byadepth6ReLUnetworkwithby a depth-6 ReLU network withO(\frac{1}{\epsilon\ln(1/\epsilon)})$ connections and activation units (Theorem 2). This upper bound is of interest since it lies below the lower bound provided by the method of nonlinear widths under assumption of continuous model selection (see Subsection 4.1).

In Section 4 we obtain several lower bounds on the complexity of approximation by deep and shallow ReLU networks, using different approaches and assumptions.

In Subsection 4.1 we recall the general lower bound provided by the method of continuous nonlinear widths. This method assumes that parameters of the approximation continuously depend on the approximated function, but does not assume anything about how the approximation depends on its parameters. In this setup, at least ϵd/n\sim\epsilon^{-d/n} connections and weights are required for an ϵ\epsilon-approximation on Fd,nF_{d,n} (Theorem 3). As already mentioned, for d=n=1d=n=1 this lower bound is above the upper bound provided by Theorem 2.

In Subsection 4.2 we consider the setup where the same network architecture is used to approximate all functions in Fd,nF_{d,n}, but the weights are not assumed to continuously depend on the function. In this case, application of existing results on VC-dimension of deep piece-wise polynomial networks yields a ϵd/(2n)\sim\epsilon^{d/(2n)} lower bound in general and a ϵd/nln2p1(1/ϵ)\sim\epsilon^{-d/n}\ln^{-2p-1}(1/\epsilon) lower bound if the network depth grows as O(lnp(1/ϵ))O(\ln^{p}(1/\epsilon)) (Theorem 4).

In Subsection 4.3 we consider an individual approximation, without any assumptions regarding it as an element of a family as in Subsections 4.1 and 4.2. We prove that for any d,nd,n there exists a function in Wn,(d)\mathcal{W}^{n,\infty}(^{d}) such that its approximation complexity is not o(ϵd/(9n))o(\epsilon^{-d/(9n)}) as ϵ0\epsilon\to 0 (Theorem 5).

In Subsection 4.4 we prove that ϵ\epsilon-approximation of any nonlinear C2C^{2}-function by a network of fixed depth LL requires at least ϵ1/(2(L2))\sim\epsilon^{-1/(2(L-2))} computation units (Theorem 6). By comparison with Theorem 1, this shows that for sufficiently smooth functions approximation by fixed-depth ReLU networks is less efficient than by unbounded-depth networks.

In Section 5 we discuss the obtained bounds and summarize their implications, in particular comparing deep vs. shallow networks and fixed vs. adaptive architectures.

The arXiv preprint of the first version of the present work appeared almost simultaneously with the work of Liang and Srikant Liang and Srikant containing results partly overlapping with our results in Subsections 3.1,3.2 and 4.4. Liang and Srikant consider networks equipped with both ReLU and threshold activation functions. They prove a logarithmic upper bound for the complexity of approximating the function f(x)=x2f(x)=x^{2}, which is analogous to our Proposition 2. Then, they extend this upper bound to polynomials and smooth functions. In contrast to our treatment of generic smooth functions based on standard Sobolev spaces, they impose more complex assumptions on the function (including, in particular, how many derivatives it has) that depend on the required approximation accuracy ϵ\epsilon. As a consequence, they obtain strong O(lnc(1/ϵ))O(\ln^{c}(1/\epsilon)) complexity bounds rather different from our bound in Theorem 1 (in fact, our lower bound proved in Theorem 5 rules out, in general, such strong upper bounds for functions having only finitely many derivatives). Also, Liang and Srikant prove a lower bound for the complexity of approximating convex functions by shallow networks. Our version of this result, given in Subsection 4.4, is different in that we assume smoothness and nonlinearity instead of global convexity.

The ReLU network model

Throughout the paper, we consider feedforward neural networks with the ReLU (Rectified Linear Unit) activation function

The network consists of several input units, one output unit, and a number of “hidden” computation units. Each hidden unit performs an operation of the form

with some weights (adjustable parameters) (wk)k=1N(w_{k})_{k=1}^{N} and bb depending on the unit. The output unit is also a computation unit, but without the nonlinearity, i.e., it computes y=k=1Nwkxk+by=\sum_{k=1}^{N}w_{k}x_{k}+b. The units are grouped in layers, and the inputs (xk)k=1N(x_{k})_{k=1}^{N} of a computation unit in a certain layer are outputs of some units belonging to any of the preceding layers (see Fig. 1). Note that we allow connections between units in non-neighboring layers. Occasionally, when this cannot cause confusion, we may denote the network and the function it implements by the same symbol.

The depth of the network, the number of units and the total number of weights are standard measures of network complexity (Anthony and Bartlett ). We will use these measures throughout the paper. The number of weights is, clearly, the sum of the total number of connections and the number of computation units. We identify the depth with the number of layers (in particular, the most common type of neural networks – shallow networks having a single hidden layer – are depth-3 networks according to this convention).

We finish this subsection with a proposition showing that, given our complexity measures, using the ReLU activation function is not much different from using any other piece-wise linear activation function with finitely many breakpoints: one can replace one network by an equivalent one but having another activation function while only increasing the number of units and weights by constant factors. This justifies our restricted attention to the ReLU networks (which could otherwise have been perceived as an excessively particular example of networks).

Let ξ\xi be a network with the activation function ρ\rho, having depth LL, WW weights and UU computation units. Then there exists a ReLU network η\eta that has depth LL, not more than (M+1)2W(M+1)^{2}W weights and not more than (M+1)U(M+1)U units, and that computes the same function as ξ\xi.

a) Let a1<<aMa_{1}<\ldots<a_{M} be the breakpoints of ρ\rho, i.e., the points where its derivative is discontinuous: ρ(ak+)ρ(ak)\rho^{\prime}(a_{k}+)\neq\rho^{\prime}(a_{k}-). We can then express ρ\rho via the ReLU function σ\sigma, as a linear combination

with appropriately chosen coefficients (cm)m=0M(c_{m})_{m=0}^{M} and hh. It follows that computation performed by a single ρ\rho-unit,

can be equivalently represented by a linear combination of a constant function and computations of M+1M+1 σ\sigma-units,

(here mm is the index of a ρ\rho-unit). We can then replace one-by-one all the ρ\rho-units in the network ξ\xi by σ\sigma-units, without changing the output of the network. Obviously, these replacements do not change the network depth. Since each hidden unit gets replaced by M+1M+1 new units, the number of units in the new network is not greater than M+1M+1 times their number in the original network. Note also that the number of connections in the network is multiplied, at most, by (M+1)2(M+1)^{2}. Indeed, each unit replacement entails replacing each of the incoming and outgoing connections of this unit by M+1M+1 new connections, and each connection is replaced twice: as an incoming and as an outgoing one. These considerations imply the claimed complexity bounds for the resulting σ\sigma-network η\eta.

b) Let aa be any breakpoint of ρ\rho, so that ρ(a+)ρ(a)\rho^{\prime}(a+)\neq\rho^{\prime}(a-). Let r0r_{0} be the distance separating aa from the nearest other breakpoint, so that ρ\rho is linear on [a,a+r0][a,a+r_{0}] and on [ar0,a][a-r_{0},a] (if ρ\rho has only one node, any r0>0r_{0}>0 will do). Then, for any r>0r>0, we can express the ReLU function σ\sigma via ρ\rho in the rr-neighborhood of 0:

It follows that a computation performed by a single σ\sigma-unit,

can be equivalently represented by a linear combination of a constant function and two ρ\rho-units,

holds. Since D\mathcal{D} is a bounded set, we can choose rr at each unit of the initial network η\eta sufficiently large so as to satisfy condition (2) for all network inputs from D\mathcal{D}. Then, like in a), we replace each σ\sigma-unit with two ρ\rho-units, which produces the desired ρ\rho-network. ∎

Upper bounds

Our first key result shows that ReLU networks with unconstrained depth can very efficiently approximate the function f(x)=x2f(x)=x^{2} (more efficiently than any fixed-depth network, as we will see in Section 4.4). Our construction uses the “sawtooth” function that has previously appeared in the paper Telgarsky .

The function f(x)=x2f(x)=x^{2} on the segment $canbeapproximatedwithanyerrorcan be approximated with any error\epsilon>0byaReLUnetworkhavingthedepthandthenumberofweightsandcomputationunitsby a ReLU network having the depth and the number of weights and computation unitsO(\ln(1/\epsilon))$.

Consider the “tooth” (or “mirror”) function g:,g:\to,

Telgarsky has shown (see Lemma 2.4 in Telgarsky ) that gsg_{s} is a “sawtooth” function with 2s12^{s-1} uniformly distributed “teeth” (each application of gg doubles the number of teeth):

(see Fig. 2(a)). Our key observation now is that the function f(x)=x2f(x)=x^{2} can be approximated by linear combinations of the functions gsg_{s}. Namely, let fmf_{m} be the piece-wise linear interpolation of ff with 2m+12^{m}+1 uniformly distributed breakpoints k2m,k=0,,2m\frac{k}{2^{m}},k=0,\ldots,2^{m}:

(see Fig. 2(b)). The function fmf_{m} approximates ff with the error ϵm=22m2\epsilon_{m}=2^{-2m-2}. Now note that refining the interpolation from fm1f_{m-1} to fmf_{m} amounts to adjusting it by a function proportional to a sawtooth function:

Since gg can be implemented by a finite ReLU network (as g(x)=2\sigma(x)-4\sigma\big{(}x-\frac{1}{2}\big{)}+2\sigma(x-1)) and since construction of fmf_{m} only involves O(m)O(m) linear operations and compositions of gg, we can implement fmf_{m} by a ReLU network having depth and the number of weights and computation units all being O(m)O(m) (see Fig. 2(c)). This implies the claim of the proposition.

we can use Proposition 2 to efficiently implement accurate multiplication in a ReLU network. The implementation will depend on the required accuracy and the magnitude of the multiplied quantities.

for any inputs x,yx,y, if xM|x|\leq M and yM,|y|\leq M, then ×~(x,y)xyϵ|\widetilde{\times}(x,y)-xy|\leq\epsilon;

if x=0x=0 or y=0y=0, then ×~(x,y)=0\widetilde{\times}(x,y)=0;

the depth and the number of weights and computation units in η\eta is not greater than c1ln(1/ϵ)+c2c_{1}\ln(1/\epsilon)+c_{2} with an absolute constant c1c_{1} and a constant c2=c2(M)c_{2}=c_{2}(M).

2 Fast deep approximation of general smooth functions

In order to formulate our general result, Theorem 1, we consider the Sobolev spaces Wn,(d)\mathcal{W}^{n,\infty}(^{d}) with n=1,2,n=1,2,\ldots Recall that Wn,(d)\mathcal{W}^{n,\infty}(^{d}) is defined as the space of functions on d^{d} lying in LL^{\infty} along with their weak derivatives up to order nn. The norm in Wn,(d)\mathcal{W}^{n,\infty}(^{d}) can be defined by

where n=(n1,,nd){0,1,}d\mathbf{n}=(n_{1},\ldots,n_{d})\in\{0,1,\ldots\}^{d}, n=n1++nd|\mathbf{n}|=n_{1}+\ldots+n_{d}, and DnfD^{\mathbf{n}}f is the respective weak derivative. Here and in the sequel we denote vectors by boldface characters. The space Wn,(d)\mathcal{W}^{n,\infty}(^{d}) can be equivalently described as consisting of the functions from Cn1(d)C^{n-1}(^{d}) such that all their derivatives of order n1n-1 are Lipschitz continuous.

Throughout the paper, we denote by Fn,dF_{n,d} the unit ball in Wn,(d)\mathcal{W}^{n,\infty}(^{d}):

Also, it will now be convenient to make a distinction between networks and network architectures: we define the latter as the former with unspecified weights. We say that a network architecture is capable of expressing any function from Fd,nF_{d,n} with error ϵ\epsilon meaning that this can be achieved by some weight assignment.

For any d,nd,n and ϵ(0,1)\epsilon\in(0,1), there is a ReLU network architecture that

is capable of expressing any function from Fd,nF_{d,n} with error ϵ\epsilon;

has the depth at most c(ln(1/ϵ)+1)c(\ln(1/\epsilon)+1) and at most cϵd/n(ln(1/ϵ)+1)c\epsilon^{-d/n}(\ln(1/\epsilon)+1) weights and computation units, with some constant c=c(d,n)c=c(d,n).

The proof will consist of two steps. We start with approximating ff by a sum-product combination f1f_{1} of local Taylor polynomials and one-dimensional piecewise-linear functions. After that, we will use results of the previous section to approximate f1f_{1} by a neural network.

Let NN be a positive integer. Consider a partition of unity formed by a grid of (N+1)d(N+1)^{d} functions ϕm\phi_{\mathbf{m}} on the domain d^{d}:

Here m=(m1,,md){0,1,,N}d,\mathbf{m}=(m_{1},\ldots,m_{d})\in\{0,1,\ldots,N\}^{d}, and the function ϕm\phi_{\mathbf{m}} is defined as the product

For any m{0,,N}d\mathbf{m}\in\{0,\ldots,N\}^{d}, consider the degree-(n1)(n-1) Taylor polynomial for the function ff at x=mN\mathbf{x}=\frac{\mathbf{m}}{N}:

with the usual conventions n!=k=1dnk!\mathbf{n}!=\prod_{k=1}^{d}n_{k}! and (xmN)n=k=1d(xkmkN)nk(\mathbf{x}-\frac{\mathbf{m}}{N})^{\mathbf{n}}=\prod_{k=1}^{d}(x_{k}-\frac{m_{k}}{N})^{n_{k}}. Now define an approximation to ff by

We bound the approximation error using the Taylor expansion of ff:

Here in the second step we used the support property (7) and the bound (6), in the third the observation that any xd\mathbf{x}\in^{d} belongs to the support of at most 2d2^{d} functions ϕm\phi_{\mathbf{m}}, in the fourth a standard bound for the Taylor remainder, and in the fifth the property fWn,(d)1.\|f\|_{\mathcal{W}^{n,\infty}(^{d})}\leq 1.

(where \lceil\cdot\rceil is the ceiling function), then

Note that, by (8) the coefficients of the polynomials PmP_{\mathbf{m}} are uniformly bounded for all fFd,nf\in F_{d,n}:

We have therefore reduced our task to the following: construct a network architecture capable of approximating with uniform error ϵ2\frac{\epsilon}{2} any function of the form (9), assuming that NN is given by (10) and the polynomials PmP_{\mathbf{m}} are of the form (12).

The expansion is a linear combination of not more than dn(N+1)dd^{n}(N+1)^{d} terms ϕm(x)(xmN)n\mathbf{\phi}_{\mathbf{m}}(\mathbf{x})(\mathbf{x}-\frac{\mathbf{m}}{N})^{\mathbf{n}}. Each of these terms is a product of at most d+n1d+n-1 piece-wise linear univariate factors: dd functions ψ(3Nxk3mk)\psi(3Nx_{k}-3m_{k}) (see (5)) and at most n1n-1 linear expressions xkmkNx_{k}-\frac{m_{k}}{N}. We can implement an approximation of this product by a neural network with the help of Proposition 3. Specifically, let ×~\widetilde{\times} be the approximate multiplication from Proposition 3 for M=d+nM=d+n and some accuracy δ\delta to be chosen later, and consider the approximation of the product ϕm(x)(xmN)n\mathbf{\phi}_{\mathbf{m}}(\mathbf{x})(\mathbf{x}-\frac{\mathbf{m}}{N})^{\mathbf{n}} obtained by the chained application of ×~\widetilde{\times}:

that Using statement c) of Proposition 3, we see f~m,n\widetilde{f}_{\mathbf{m},\mathbf{n}} can be implemented by a ReLU network with the depth and the number of weights and computation units not larger than (d+n)c1ln(1/δ),(d+n)c_{1}\ln(1/\delta), for some constant c1=c1(d,n)c_{1}=c_{1}(d,n).

Now we estimate the error of this approximation. Note that we have ψ(3Nxk3mk)1|\psi(3Nx_{k}-3m_{k})|\leq 1 and xkmkN1|x_{k}-\frac{m_{k}}{N}|\leq 1 for all kk and all xd\mathbf{x}\in^{d}. By statement a) of Proposition 3, if a1|a|\leq 1 and bM|b|\leq M, then ×~(a,b)b+δ|\widetilde{\times}(a,b)|\leq|b|+\delta. Repeatedly applying this observation to all approximate multiplications in (14) while assuming δ<1\delta<1, we see that the arguments of all these multiplications are bounded by our MM (equal to d+nd+n) and the statement a) of Proposition 3 holds for each of them. We then have

Moreover, by statement b) of Proposition 3,

We estimate the approximation error of f~\widetilde{f}:

where in the first step we use expansion (13), in the second the identity (16), in the third the bound am,n1|a_{\mathbf{m},\mathbf{n}}|\leq 1 and the fact that xsuppϕm\mathbf{x}\in\operatorname{supp}\phi_{\mathbf{m}} for at most 2d2^{d} functions ϕm,\phi_{\mathbf{m}}, and in the fourth the bound (15). It follows that if we choose

then f~f1ϵ2\|\widetilde{f}-f_{1}\|_{\infty}\leq\frac{\epsilon}{2} and hence, by (11),

On the other hand, note that by (17), f~\widetilde{f} can be implemented by a network consisting of parallel subnetworks that compute each of f~m,n\widetilde{f}_{\mathbf{m},\mathbf{n}}; the final output is obtained by weighting the outputs of the subnetworks with the weights am,na_{\mathbf{m},\mathbf{n}}. The architecture of the full network does not depend on ff; only the weights am,na_{\mathbf{m},\mathbf{n}} do. As already shown, each of these subnetworks has not more than c1ln(1/δ)c_{1}\ln(1/\delta) layers, weights and computation units, with some constant c1=c1(d,n)c_{1}=c_{1}(d,n). There are not more than dn(N+1)dd^{n}(N+1)^{d} such subnetworks. Therefore, the full network for f~\widetilde{f} has not more than c1ln(1/δ)+1c_{1}\ln(1/\delta)+1 layers and dn(N+1)d(c1ln(1/δ)+1)d^{n}(N+1)^{d}(c_{1}\ln(1/\delta)+1) weights and computation units. With δ\delta given by (18) and NN given by (10), we obtain the claimed complexity bounds. ∎

3 Faster approximations using adaptive network architectures

Theorem 1 provides an upper bound for the approximation complexity in the case when the same network architecture is used to approximate all functions in Fd,nF_{d,n}. We can consider an alternative, “adaptive architecture” scenario where not only the weights, but also the architecture is adjusted to the approximated function. We expect, of course, that this would decrease the complexity of the resulting architectures, in general (at the price of needing to find the appropriate architecture). In this section we show that we can indeed obtain better upper bounds in this scenario.

For simplicity, we will only consider the case d=n=1d=n=1. Then, Wn,(d)\mathcal{W}^{n,\infty}(^{d}) is the space of Lipschitz functions on the segment $.Theset. The setF_{1,1}consistsoffunctionsconsists of functionsfhavingbothhaving both\|f\|_{\infty}andtheLipschitzconstantboundedby1.Theorem1providesanupperboundand the Lipschitz constant bounded by 1. Theorem 1 provides an upper boundO(\frac{\ln(1/\epsilon)}{\epsilon})forthenumberofweightsandcomputationunits,butinthisspecialcasethereisinfactabetterboundfor the number of weights and computation units, but in this special case there is in fact a better boundO(\frac{1}{\epsilon})$ obtained simply by piece-wise interpolation.

Namely, given fF1,1f\in F_{1,1} and ϵ>0\epsilon>0, set T=1ϵT=\lceil\frac{1}{\epsilon}\rceil and let f~\widetilde{f} be the piece-wise interpolation of ff with T+1T+1 uniformly spaced breakpoints (tT)t=0T(\frac{t}{T})_{t=0}^{T} (i.e., f~(tT)=f(tT),t=0,,T\widetilde{f}(\frac{t}{T})=f(\frac{t}{T}),t=0,\ldots,T). The function f~\widetilde{f} is also Lipschitz with constant 1 and hence ff~1Tϵ\|f-\widetilde{f}\|_{\infty}\leq\frac{1}{T}\leq\epsilon (since for any xx\in we can find tt such that xtT12T|x-\frac{t}{T}|\leq\frac{1}{2T} and then f(x)f~(x)f(x)f(tT)+f~(tT)f~(x)212T=1T|f(x)-\widetilde{f}(x)|\leq|f(x)-f(\frac{t}{T})|+|\widetilde{f}(\frac{t}{T})-\widetilde{f}(x)|\leq 2\cdot\frac{1}{2T}=\frac{1}{T}). At the same time, the function f~\widetilde{f} can be expressed in terms of the ReLU function σ\sigma by

with some coefficients bb and (wt)t=0T1(w_{t})_{t=0}^{T-1}. This expression can be viewed as a special case of the depth-3 ReLU network with O(1ϵ)O(\frac{1}{\epsilon}) weights and computation units.

We show now how the bound O(1ϵ)O(\frac{1}{\epsilon}) can be improved by using adaptive architectures.

For any fF1,1f\in F_{1,1} and ϵ(0,12)\epsilon\in(0,\frac{1}{2}), there exists a depth-6 ReLU network η\eta (with architecture depending on ff) that provides an ϵ\epsilon-approximation of ff while having not more than cϵln(1/ϵ)\frac{c}{\epsilon\ln(1/\epsilon)} weights, connections and computation units. Here cc is an absolute constant.

We first explain the idea of the proof. We start with interpolating ff by a piece-wise linear function, but not on the length scale ϵ\epsilon – instead, we do it on a coarser length scale mϵm\epsilon, with some m=m(ϵ)>1m=m(\epsilon)>1. We then create a “cache” of auxiliary subnetworks that we use to fill in the details and go down to the scale ϵ\epsilon, in each of the mϵm\epsilon-subintervals. This allows us to reduce the amount of computations for small ϵ\epsilon because the complexity of the cache only depends on mm. The assignment of cached subnetworks to the subintervals is encoded in the network architecture and depends on the function ff. We optimize mm by balancing the complexity of the cache with that of the initial coarse approximation. This leads to mln(1/ϵ)m\sim\ln(1/\epsilon) and hence to the reduction of the total complexity of the network by a factor ln(1/ϵ)\sim\ln(1/\epsilon) compared to the simple piece-wise linear approximation on the scale ϵ\epsilon. This construction is inspired by a similar argument used to prove the O(2n/n)O(2^{n}/n) upper bound for the complexity of Boolean circuits implementing nn-ary functions Shannon .

The proof becomes simpler if, in addition to the ReLU function σ\sigma, we are allowed to use the activation function

in our neural network. Since ρ\rho is discontinuous, we cannot just use Proposition 1 to replace ρ\rho-units by σ\sigma-units. We will first prove the analog of the claimed result for the model including ρ\rho-units, and then we will show how to construct a purely ReLU nework.

For any fF1,1f\in F_{1,1} and ϵ(0,12)\epsilon\in(0,\frac{1}{2}), there exists a depth-5 network including σ\sigma-units and ρ\rho-units, that provides an ϵ\epsilon-approximation of ff while having not more than cϵln(1/ϵ)\frac{c}{\epsilon\ln(1/\epsilon)} weights, where cc is an absolute constant.

Given fF1,1f\in F_{1,1}, we will construct an approximation f~\widetilde{f} to ff in the form

Here, f~1\widetilde{f}_{1} is the piece-wise linear interpolation of ff with the breakpoints {tT}t=0T\{\frac{t}{T}\}_{t=0}^{T}, for some positive integer TT to be chosen later. Since ff is Lipschitz with constant 1, f~1\widetilde{f}_{1} is also Lipschitz with constant 1. We will denote by ItI_{t} the intervals between the breakpoints:

We will now construct f~2\widetilde{f}_{2} as an approximation to the difference

Note that f2f_{2} vanishes at the endpoints of the intervals ItI_{t}:

and f2f_{2} is Lipschitz with constant 2:

since ff and f~1\widetilde{f}_{1} are Lipschitz with constant 1.

Note that the size Γ|\Gamma| of Γ\Gamma is not larger than 3m3^{m}.

Moreover, if f2f_{2} is defined by (20), then, using (21), (22), on each interval ItI_{t} the function f2f_{2} can be approximated with error not larger than 2Tm\frac{2}{Tm} by a properly rescaled function γΓ\gamma\in\Gamma. Namely, for each t=0,,T1t=0,\ldots,T-1 we can define the function gg by g(y)=Tf2(t+yT)g(y)=Tf_{2}(\frac{t+y}{T}). Then it is Lipschitz with constant 2 and g(0)=g(1)=0g(0)=g(1)=0, so we can find γtΓ\gamma_{t}\in\Gamma such that

Note that the obtained assignment tγtt\mapsto\gamma_{t} is not injective, in general (TT will be much larger than Γ|\Gamma|).

We can then define f~2\widetilde{f}_{2} on the whole [0,1)[0,1) by

This f~2\widetilde{f}_{2} approximates f2f_{2} with error 2Tm\frac{2}{Tm} on [0,1)[0,1):

and hence, by (20), for the full approximation f~=f~1+f~2\widetilde{f}=\widetilde{f}_{1}+\widetilde{f}_{2} we will also have

Note that the approximation f~2\widetilde{f}_{2} has properties analogous to (21), (22):

in particular, f~2\widetilde{f}_{2} is continuous on [0,1)[0,1).

We will now rewrite f~2\widetilde{f}_{2} in a different form interpretable as a computation by a neural network. Specifically, using our additional activation function ρ\rho given by (19), we can express f~2\widetilde{f}_{2} as

Indeed, given x[0,1)x\in[0,1), observe that all the terms in the inner sum vanish except for the one corresponding to the tt determined by the condition xItx\in I_{t}. For this particular tt we have ρ(Txt)=Txt\rho(Tx-t)=Tx-t. Since γ(0)=0\gamma(0)=0, we conclude that (28) agrees with (23).

Let us also expand γΓ\gamma\in\Gamma over the basis of shifted ReLU functions:

Substituting this expansion in (28), we finally obtain

The first layer contains the single input unit Q(1)Q^{(1)}.

The second layer contains TT units (Qt(2))t=1T(Q^{(2)}_{t})_{t=1}^{T} computing Qt(2)=ρ(TQ(1)t).Q^{(2)}_{t}=\rho(TQ^{(1)}-t).

The third layer contains Γ|\Gamma| units (Qγ(3))γΓ(Q^{(3)}_{\gamma})_{\gamma\in\Gamma} computing Qγ(3)=σ(t:γt=γQt(2))Q^{(3)}_{\gamma}=\sigma(\sum_{t:\gamma_{t}=\gamma}Q^{(2)}_{t}). This is equivalent to Qγ(3)=t:γt=γQt(2)Q^{(3)}_{\gamma}=\sum_{t:\gamma_{t}=\gamma}Q^{(2)}_{t}, because Qt(2)0Q^{(2)}_{t}\geq 0.

The fourth layer contains mΓm|\Gamma| units (Qγ,r(4))r=0,,m1γΓ(Q^{(4)}_{\gamma,r})_{\stackrel{{\scriptstyle\gamma\in\Gamma}}{{r=0,\ldots,m-1}}} computing Qγ,r(4)=σ(Qγ(3)rm).Q^{(4)}_{\gamma,r}=\sigma(Q^{(3)}_{\gamma}-\frac{r}{m}).

The final layer consists of a single output unit Q(5)=γΓr=0m1cγ,rTQγ,r(4).Q^{(5)}=\sum_{\gamma\in\Gamma}\sum_{r=0}^{m-1}\frac{c_{\gamma,r}}{T}Q^{(4)}_{\gamma,r}.

Examining this network, we see that the total number of connections and units in it is O(T+mΓ)O(T+m|\Gamma|) and hence is O(T+m3m)O(T+m3^{m}). This also holds for the full network implementing f~=f~1+f~2\widetilde{f}=\widetilde{f}_{1}+\widetilde{f}_{2}, since the term f~1\widetilde{f}_{1} requires even fewer layers, connections and units. The output units of the subnetworks for f~1\widetilde{f}_{1} and f~2\widetilde{f}_{2} can be merged into the output unit for f~1+f~2\widetilde{f}_{1}+\widetilde{f}_{2}, so the depth of the full network is the maximum of the depths of the networks implementing f~1\widetilde{f}_{1} and f~2\widetilde{f}_{2}, i.e., is 5 (see Fig. 4).

Now, given ϵ(0,12)\epsilon\in(0,\frac{1}{2}), take m=12log3(1/ϵ)m=\lceil\frac{1}{2}\log_{3}(1/\epsilon)\rceil and T=2mϵ.T=\lceil\frac{2}{m\epsilon}\rceil. Then, by (25), the approximation error maxxf(x)f~(x)2Tmϵ\max_{x\in}|f(x)-\widetilde{f}(x)|\leq\frac{2}{Tm}\leq\epsilon, while T+m3m=O(1ϵln(1/ϵ))T+m3^{m}=O(\frac{1}{\epsilon\ln(1/\epsilon)}), which implies the claimed complexity bound. ∎

We show now how to modify the constructed network so as to remove ρ\rho-units. We only need to modify the f~2\widetilde{f}_{2} part of the network. We will show that for any δ>0\delta>0 we can replace f~2\widetilde{f}_{2} with a function f~3,δ\widetilde{f}_{3,\delta} (defined below) that

obeys the following analog of approximation bound (24):

and is implementable by a depth-6 ReLU network having complexity c(T+m3m)c(T+m3^{m}) with an absolute constant cc independent of δ\delta.

Since δ\delta can be taken arbitrarily small, the Theorem then follows by arguing as in Lemma 1, only with f~2\widetilde{f}_{2} replaced by f~3,δ\widetilde{f}_{3,\delta}.

As a first step, we approximate ρ\rho by a continuous piece-wise linear function ρδ\rho_{\delta}, with a small δ>0\delta>0:

Let f~2,δ\widetilde{f}_{2,\delta} be defined as f~2\widetilde{f}_{2} in (29), but with ρ\rho replaced by ρδ\rho_{\delta}:

Since ρδ\rho_{\delta} is a continuous piece-wise linear function with three breakpoints, we can express it via the ReLU function, and hence implement f~2,δ\widetilde{f}_{2,\delta} by a purely ReLU network, as in Proposition 1, and the complexity of the implementation does not depend on δ\delta. However, replacing ρ\rho with ρδ\rho_{\delta} affects the function f~2\widetilde{f}_{2} on the intervals (tδT,tT],t=1,,T(\frac{t-\delta}{T},\frac{t}{T}],t=1,\ldots,T, introducing there a large error (of magnitude O(1T)O(\frac{1}{T})). But recall that both f2f_{2} and f~2\widetilde{f}_{2} vanish at the points tT,t=0,,T,\frac{t}{T},t=0,\ldots,T, by (21), (26). We can then largely remove this newly introduced error by simply suppressing f~2,δ\widetilde{f}_{2,\delta} near the points tT\frac{t}{T}.

Precisely, consider the continuous piece-wise linear function

and the full comb-like filtering function

Note that Φδ\Phi_{\delta} is continuous piece-wise linear with 4T4T breakpoints, and 0Φδ(x)10\leq\Phi_{\delta}(x)\leq 1. We then define our final modification of f~2\widetilde{f}_{2} as

The function f~3,δ\widetilde{f}_{3,\delta} obeys the bound (30).

Given x[0,1)x\in[0,1), let t{0,,T1}t\in\{0,\ldots,T-1\} and y[0,1)y\in[0,1) be determined from the representation x=t+yTx=\frac{t+y}{T} (i.e., yy is the relative position of xx in the respective interval ItI_{t}). Consider several possibilities for yy:

y[1δ,1]y\in[1-\delta,1]. In this case Φδ(x)=0\Phi_{\delta}(x)=0. Note that

because, by construction, supxf~2,δ(x)supxf~2(x)\sup_{x\in}|\widetilde{f}_{2,\delta}(x)|\leq\sup_{x\in}|\widetilde{f}_{2}(x)|, and supxf~2(x)1\sup_{x\in}|\widetilde{f}_{2}(x)|\leq 1 by (26), (27). It follows that both terms in (31) vanish, i.e., f~3,δ(x)=0\widetilde{f}_{3,\delta}(x)=0. But, since f2f_{2} is Lipschitz with constant 2 by (22) and f2(t+1T)=0f_{2}(\frac{t+1}{T})=0, we have f2(x)f2(x)f2(t+1T)2y1T2δT|f_{2}(x)|\leq|f_{2}(x)-f_{2}(\frac{t+1}{T})|\leq\frac{2|y-1|}{T}\leq\frac{2\delta}{T}. This implies f2(x)f~3,δ(x)2δT|f_{2}(x)-\widetilde{f}_{3,\delta}(x)|\leq\frac{2\delta}{T}.

y[δ,12δ]y\in[\delta,1-2\delta]. In this case Φδ(x)=1\Phi_{\delta}(x)=1 and f~2,δ(x)=f~2(x)\widetilde{f}_{2,\delta}(x)=\widetilde{f}_{2}(x). Using (32), we find that f~3,δ(x)=f~2,δ(x)=f~2(x)\widetilde{f}_{3,\delta}(x)=\widetilde{f}_{2,\delta}(x)=\widetilde{f}_{2}(x). It follows that f2(x)f~3,δ(x)=f2(x)f~2(x)2Tm|f_{2}(x)-\widetilde{f}_{3,\delta}(x)|=|f_{2}(x)-\widetilde{f}_{2}(x)|\leq\frac{2}{Tm}.

y[0,δ][12δ,1δ]y\in[0,\delta]\cup[1-2\delta,1-\delta]. In this case f~2,δ(x)=f~2(x)\widetilde{f}_{2,\delta}(x)=\widetilde{f}_{2}(x). Since σ\sigma is Lipschitz with constant 1, f~3,δ(x)f~2,δ(x)=f~2(x)|\widetilde{f}_{3,\delta}(x)|\leq|\widetilde{f}_{2,\delta}(x)|=|\widetilde{f}_{2}(x)|. Both f2f_{2} and f~2\widetilde{f}_{2} are Lipschitz with constant 2 (by (22), (27)) and vanish at tT\frac{t}{T} and t+1T\frac{t+1}{T} (by (21), (26)). It follows that

It remains to verify the complexity property b) of the function f~3,δ\widetilde{f}_{3,\delta}. As already mentioned, f~2,δ\widetilde{f}_{2,\delta} can be implemented by a depth-5 purely ReLU network with not more than c(T+m3m)c(T+m3^{m}) weights, connections and computation units, where cc is an absolute constant independent of δ\delta. The function Φδ\Phi_{\delta} can be implemented by a shallow, depth-3 network with O(T)O(T) units and connection. Then, computation of f~3,δ\widetilde{f}_{3,\delta} can be implemented by a network including two subnetworks for computing f~2,δ\widetilde{f}_{2,\delta} and Ψδ\Psi_{\delta}, and an additional layer containing two σ\sigma-units as written in (31). We thus obtain 6 layers in the resulting full network and, choosing TT and mm in the same way as in Lemma 1, obtain the bound cϵln(1/ϵ)\frac{c}{\epsilon\ln(1/\epsilon)} for the number of its connections, weights, and computation units. ∎

Lower bounds

The method of continuous nonlinear widths (DeVore et al. ) is a very general approach to the analysis of parameterized nonlinear approximations, based on the assumption of continuous selection of their parameters. We are interested in the following lower bound for the complexity of approximations in Wn,(d).\mathcal{W}^{n,\infty}(^{d}).

The hypothesis of continuous weight selection is crucial in Theorem 3. By examining our proof of the counterpart upper bound O(ϵd/nln(1/ϵ))O(\epsilon^{-d/n}\ln(1/\epsilon)) in Theorem 1, the weights are selected there in a continuous manner, so this upper bound asymptotically lies above cϵd/nc\epsilon^{-d/n} in agreement with Theorem 3. We remark, however, that the optimal choice of the network weights (minimizing the error) is known to be discontinuous in general, even for shallow networks (Kainen et al. ).

We also compare the bounds of Theorems 3 and 2. In the case d=n=1d=n=1, Theorem 3 provides a lower bound cϵ\frac{c}{\epsilon} for the number of weights and connections. On the other hand, in the adaptive architecture scenario, Theorem 2 provides the upper bound cϵln(1/ϵ)\frac{c}{\epsilon\ln(1/\epsilon)} for the number of weights, connections and computation units. The fact that this latter bound is asymptotically below the bound of Theorem 3 reflects the extra expressiveness associated with variable network architecture.

2 Bounds based on VC-dimension

In this section we consider the setup where the same network architecture is used to approximate all functions fFd,nf\in F_{d,n}, but the dependence of the weights on ff is not assumed to be necessarily continuous. In this setup, some lower bounds on the network complexity can be obtained as a consequence of existing upper bounds on VC-dimension of networks with piece-wise polynomial activation functions and Boolean outputs (Anthony and Bartlett ). In the next theorem, part a) is a more general but weaker bound, while part b) is a stronger bound assuming a constrained growth of the network depth.

For any ϵ(0,1)\epsilon\in(0,1), a ReLU network architecture capable of approximating any function fFd,nf\in F_{d,n} with error ϵ\epsilon must have at least cϵd/(2n)c\epsilon^{-d/(2n)} weights, with some constant c=c(d,n)>0c=c(d,n)>0.

Let p0,c1>0p\geq 0,c_{1}>0 be some constants. For any ϵ(0,12)\epsilon\in(0,\frac{1}{2}), if a ReLU network architecture of depth Lc1lnp(1/ϵ)L\leq c_{1}\ln^{p}(1/\epsilon) is capable of approximating any function fFd,nf\in F_{d,n} with error ϵ\epsilon, then the network must have at least c2ϵd/nln2p1(1/ϵ)c_{2}\epsilon^{-d/n}\ln^{-2p-1}(1/\epsilon) weights, with some constant c2=c2(d,n,p,c1)>0c_{2}=c_{2}(d,n,p,c_{1})>0.The author thanks Matus Telgarsky for suggesting this part of the theorem.

where WW is the number of weights, LL is the network depth, and c3c_{3} is an absolute constant.

Let us obtain a condition ensuring that such fFd,nf\in F_{d,n}. For any multi-index n\mathbf{n},

with the constant c4=(maxn:nnmaxxDnϕ(x))1c_{4}=(\max_{\mathbf{n}:|\mathbf{n}|\leq n}\max_{\mathbf{x}}|D^{\mathbf{n}}\phi(\mathbf{x})|)^{-1}, then fFd,nf\in F_{d,n}.

Suppose that there is a ReLU network architecture η\eta that can approximate, by adjusting its weights, any fFd,nf\in F_{d,n} with error less than ϵ\epsilon. Denote by η(x,w)\eta(\mathbf{x},\mathbf{w}) the output of the network for the input vector x\mathbf{x} and the vector of weights w\mathbf{w}.

Consider any assignment z\mathbf{z} of Boolean values z1,,zNd{0,1}z_{1},\ldots,z_{N^{d}}\in\{0,1\}. Set

and let ff be given by (35) (see Fig. 5); then (36) holds and hence fFd,nf\in F_{d,n}.

By assumption, there is then a vector of weights, w=wz\mathbf{w}=\mathbf{w}_{\mathbf{z}}, such that for all mm we have η(xm,wz)ymϵ,|\eta(\mathbf{x}_{m},\mathbf{w}_{\mathbf{z}})-y_{m}|\leq\epsilon, and in particular

Since the Boolean values zmz_{m} were arbitrary, we conclude that the subset SS is shattered and hence

Expressing NN through ϵ\epsilon with (37), we obtain

To establish part a) of the Theorem, we apply bound (33) to the network η1\eta_{1}:

where WW is the number of weights in η1\eta_{1}, which is the same as in η\eta if we do not count the threshold parameter. Combining (38) with (39), we obtain the desired lower bound Wcϵd/(2n)W\geq c\epsilon^{-d/(2n)} with c=(c4/3)d/(2n)c31/2c=(c_{4}/3)^{d/(2n)}c_{3}^{-1/2}.

To establish part b) of the Theorem, we use bound (34) and the hypothesis Lc1lnp(1/ϵ)L\leq c_{1}\ln^{p}(1/\epsilon):

Trying a WW of the form Wc2=c2ϵd/nln2p1(1/ϵ)W_{c_{2}}=c_{2}\epsilon^{-d/n}\ln^{-2p-1}(1/\epsilon) with a constant c2c_{2}, we get

Comparing this with (41), we see that if we choose c2<(c4/3)d/nn/(dc3c12)c_{2}<(c_{4}/3)^{d/n}n/(dc_{3}c_{1}^{2}), then for sufficiently small ϵ\epsilon we have WlnWWc2lnWc2W\ln W\geq W_{c_{2}}\ln W_{c_{2}} and hence WWc2W\geq W_{c_{2}}, as claimed. We can ensure that WWc2W\geq W_{c_{2}} for all ϵ(0,12)\epsilon\in(0,\frac{1}{2}) by further decreasing c2c_{2}. ∎

We remark that the constrained depth hypothesis of part b) is satisfied, with p=1p=1, by the architecture used for the upper bound in Theorem 1. The bound stated in part b) of Theorem 4 matches the upper bound of Theorem 1 and the lower bound of Theorem 3 up to a power of ln(1/ϵ)\ln(1/\epsilon).

3 Adaptive network architectures

Our goal in this section is to obtain a lower bound for the approximation complexity in the scenario where the network architecture may depend on the approximated function. This lower bound is thus a counterpart to the upper bound of Section 3.3.

To state this result we define the complexity N(f,ϵ)\mathcal{N}(f,\epsilon) of approximating the function ff with error ϵ\epsilon as the minimal number of hidden computation units in a ReLU network that provides such an approximation.

For any d,nd,n, there exists fWn,(d)f\in\mathcal{W}^{n,\infty}(^{d}) such that N(f,ϵ)\mathcal{N}(f,\epsilon) is not o(ϵd/(9n))o(\epsilon^{-d/(9n)}) as ϵ0\epsilon\to 0.

Fix d,nd,n. For any sufficiently small ϵ>0\epsilon>0 there exists fϵFd,nf_{\epsilon}\in F_{d,n} such that N(fϵ,ϵ)c1ϵd/(8n)\mathcal{N}(f_{\epsilon},\epsilon)\geq c_{1}\epsilon^{-d/(8n)}, with some constant c1=c1(d,n)>0c_{1}=c_{1}(d,n)>0.

Observe that all the networks with not more than mm hidden computation units can be embedded in the single “enveloping” network that has mm hidden layers, each consisting of mm units, and that includes all the connections between units not in the same layer (see Fig. 6(a)). The number of weights in this enveloping network is O(m4)O(m^{4}). On the other hand, Theorem 4a) states that at least cϵd/(2n)c\epsilon^{-d/(2n)} weights are needed for an architecture capable of ϵ\epsilon-approximating any function in Fd,nF_{d,n}. It follows that there is a function fϵFd,nf_{\epsilon}\in F_{d,n} that cannot be ϵ\epsilon-approximated by networks with fewer than c1ϵd/(8n)c_{1}\epsilon^{-d/(8n)} computation units. ∎

Before proceeding to the proof of Theorem 5, note that N(f,ϵ)\mathcal{N}(f,\epsilon) is a monotone decreasing function of ϵ\epsilon with a few obvious properties:

(follows by multiplying the weights of the output unit of the approximating network by a constant);

(follows by approximating f±gf\pm g by an approximation of ff);

(follows by combining approximating networks for f1f_{1} and f2f_{2} as in Fig. 6(b)).

The claim of Theorem 5 is similar to the claim of Lemma 3, but is about a single function ff satisfying a slightly weaker complexity bound at multiple values of ϵ0\epsilon\to 0. We will assume that Theorem 5 is false, i.e.,

for all fWn,(d),f\in\mathcal{W}^{n,\infty}(^{d}), and we will reach contradiction by presenting ff violating this assumption. Specifically, we construct this ff as

We determine ak,fk,ϵka_{k},f_{k},\epsilon_{k} sequentially. Suppose we have already found {as,fs,ϵs}s=1k1\{a_{s},f_{s},\epsilon_{s}\}_{s=1}^{k-1}; let us describe how we define ak,fk,ϵka_{k},f_{k},\epsilon_{k}.

so that the function ff defined by the series (46) will be in Wn,(d)\mathcal{W}^{n,\infty}(^{d}), because fkWn,(d)1\|f_{k}\|_{\mathcal{W}^{n,\infty}(^{d})}\leq 1.

Next, using Lemma 3 and Eq. (42), observe that if ϵk\epsilon_{k} is sufficiently small, then we can find fkFd,nf_{k}\in F_{d,n} such that

In addition, by assumption (45), if ϵk\epsilon_{k} is small enough then

Let us choose ϵk\epsilon_{k} and fkf_{k} so that both (49) and (50) hold. Obviously, we can also make sure that ϵk0\epsilon_{k}\to 0 as kk\to\infty.

Let us check that the above choice of {ak,fk,ϵk}k=1\{a_{k},f_{k},\epsilon_{k}\}_{k=1}^{\infty} ensures that inequality (47) holds for all kk:

Here in the first step we use inequality (43), in the second the monotonicity of N(f,ϵ)\mathcal{N}(f,\epsilon), in the third the monotonicity of N(f,ϵ)\mathcal{N}(f,\epsilon) and the setting (48), in the fourth the inequality (44), and in the fifth the conditions (49) and (50). ∎

4 Slow approximation of smooth functions by shallow networks

In this section we show that, in contrast to deep ReLU networks, shallow ReLU networks relatively inefficiently approximate sufficiently smooth (C2C^{2}) nonlinear functions. We remark that Liang and Srikant 2016 prove a similar result assuming global convexity instead of smoothness and nonlinearity.

Let fC2(d)f\in C^{2}(^{d}) be a nonlinear function (i.e., not of the form f(x1,,xd)a0+k=1dakxkf(x_{1},\ldots,x_{d})\equiv a_{0}+\sum_{k=1}^{d}a_{k}x_{k} on the whole d^{d}). Then, for any fixed LL, a depth-LL ReLU network approximating ff with error ϵ(0,1)\epsilon\in(0,1) must have at least cϵ1/(2(L2))c\epsilon^{-1/(2(L-2))} weights and computation units, with some constant c=c(f,L)>0.c=c(f,L)>0.

Suppose that f~\widetilde{f} is an ϵ\epsilon-approximation of function ff, and let f~\widetilde{f} be implemented by a ReLU network η\eta of depth LL. Let f~1:xf~(x0+xv)\widetilde{f}_{1}:x\mapsto\widetilde{f}(\mathbf{x}_{0}+x\mathbf{v}). Then f~1\widetilde{f}_{1} also approximates f1f_{1} with error not larger than ϵ\epsilon. Moreover, since f~1\widetilde{f}_{1} is obtained from f~\widetilde{f} by a linear substitution x=x0+xv\mathbf{x}=\mathbf{x}_{0}+x\mathbf{v}, f~1\widetilde{f}_{1} can be implemented by a ReLU network η1\eta_{1} of the same depth LL and with the number of units and weights not larger than in η\eta (we can obtain η1\eta_{1} from η\eta by replacing the input layer in η\eta with a single unit, accordingly modifying the input connections, and adjusting the weights associated with these connections). It is thus sufficient to establish the claimed bounds for η1\eta_{1}.

By construction, f~1\widetilde{f}_{1} is a continuous piece-wise linear function of xx. Denote by MM the number of linear pieces in f~1\widetilde{f}_{1}. We will use the following counting lemma.

M(2U)L2M\leq(2U)^{L-2}, where UU is the number of computation units in η1\eta_{1}.

This bound, up to minor details, is proved in Lemma 2.1 of Telgarsky . Precisely, Telgarsky’s lemma states that if a network has a single input, connections only between neighboring layers, at most mm units in a layer, and a piece-wise linear activation function with tt pieces, then the number of linear pieces in the output of the network is not greater than (tm)L(tm)^{L}. By examining the proof of the lemma we see that it will remain valid for networks with connections not necessarily between neighboring layers, if we replace mm by UU in the expression (tm)L(tm)^{L}. Moreover, we can slightly strengthen the bound by noting that in the present paper the input and output units are counted as separate layers, only units of layers 3 to LL have multiple incoming connections, and the activation function is applied only in layers 2 to L1L-1. By following Telgarsky’s arguments, this gives the slightly more accurate bound (tU)L2(tU)^{L-2} (i.e., with the power L2L-2 instead of LL). It remains to note that the ReLU activation function corresponds to t=2t=2. ∎

Lemma 4 implies that there is an interval [a,b][a,b]\subset of length not less than 2(2U)(L2)2(2U)^{-(L-2)} on which the function f~1\widetilde{f}_{1} is linear. Let g=f1f~1.g=f_{1}-\widetilde{f}_{1}. Then, by the approximation accuracy assumption, supx[a,b]g(x)ϵ\sup_{x\in[a,b]}|g(x)|\leq\epsilon, while by (51) and by the linearity of f~1\widetilde{f}_{1} on [a,b][a,b], maxx[a,b]g(x)c1>0.\max_{x\in[a,b]}g^{\prime\prime}(x)\geq c_{1}>0. It follows that max(g(a),g(b))g(a+b2)+c12(ba2)2\max(g(a),g(b))\geq g(\frac{a+b}{2})+\frac{c_{1}}{2}(\frac{b-a}{2})^{2} and hence

which implies the claimed bound U12(4ϵc1)1/(2(L2))U\geq\frac{1}{2}(\frac{4\epsilon}{c_{1}})^{-1/(2(L-2))}. Since there are at least as many weights as computation units in a network, a similar bound holds for the number of weights. ∎

Discussion

We discuss some implications of the obtained bounds.

Our results clearly show that deep ReLU networks more efficiently express smooth functions than shallow ReLU networks. By Theorem 1, functions from the Sobolev space Wn,(d)\mathcal{W}^{n,\infty}(^{d}) can be ϵ\epsilon-approximated by ReLU networks with depth O(ln(1/ϵ))O(\ln(1/\epsilon)) and the number of computation units O(ϵd/nln(1/ϵ))O(\epsilon^{-d/n}\ln(1/\epsilon)). In contrast, by Theorem 6, a nonlinear function from C2(d)C^{2}(^{d}) cannot be ϵ\epsilon-approximated by a ReLU network of fixed depth LL with the number of units less than cϵ1/(2(L2)).c\epsilon^{-1/(2(L-2))}. In particular, it follows that in terms of the required number of computation units, unbounded-depth approximations of functions from Wn,(d)\mathcal{W}^{n,\infty}(^{d}) are asymptotically strictly more efficient than approximations with a fixed depth LL at least when

(assuming also n>2n>2, so that Wn,(d)C2(d)\mathcal{W}^{n,\infty}(^{d})\subset C^{2}(^{d})). The efficiency of depth is even more pronounced for very smooth functions such as polynomials, which can be implemented by deep networks using only O(ln(1/ϵ))O(\ln(1/\epsilon)) units (cf. Propositions 2 and 3 and the proof of Theorem 1). Liang and Srikant describe in Liang and Srikant some conditions on the approximated function (resembling conditions of local analyticity) under which complexity of deep ϵ\epsilon-approximation is O(lnc(1/ϵ))O(\ln^{c}(1/\epsilon)) with a constant power cc.

Continuous model selection vs. function-dependent network architectures.

When approximating a function by a neural network, one can either view the network architecture as fixed and only tune the weights, or optimize the architecture as well. Moreover, when tuning the weights, one can either require them to continuously depend on the approximated function or not. We naturally expect that more freedom in the choice of the approximation should lead to higher expressiveness.

Our bounds confirm this expectation to a certain extent. Specifically, the complexity of ϵ\epsilon-approximation of functions from the unit ball F1,1F_{1,1} in W1,()\mathcal{W}^{1,\infty}() is lower bounded by cϵ\frac{c}{\epsilon} in the scenario with a fixed architecture and continuously selected weights (see Theorem 3). On the other hand, we show in Theorem 2 that this complexity is upper bounded by O(1ϵln(1/ϵ))O(\frac{1}{\epsilon\ln(1/\epsilon)}) if we are allowed to adjust the network architecture. This bound is achieved by finite-depth (depth-6) ReLU networks using the idea of reused subnetworks familiar from the theory of Boolean circuits Shannon .

In the case of fixed architecture, we have not established any evidence of complexity improvement for unconstrained weight selection compared to continuous weight selection. We remark however that, already for approximations with depth-3 networks, the optimal weights are known to discontinuously depend, in general, on the approximated function (Kainen et al. ). On the other hand, part b) of Theorem 4 shows that if the network depth scales as O(lnp(1/ϵ))O(\ln^{p}(1/\epsilon)), discontinuous weight selection cannot improve the continuous-case complexity more than by a factor being some power of ln(1/ϵ)\ln(1/\epsilon).

Upper vs. lower complexity bounds.

We indicate the gaps between respective upper and lower bounds in the three scenarios mentioned above: fixed architectures with continuous selection of weights, fixed architectures with unconstrained selection of weights, or adaptive architectures.

For fixed architectures with continuous selection the lower bound cϵd/nc\epsilon^{-d/n} is provided by Proposition 3, and the upper bound O(ϵd/nln(1/ϵ))O(\epsilon^{-d/n}\ln(1/\epsilon)) by Theorem 1, so these bounds are tight up to a factor O(ln(1/ϵ))O(\ln(1/\epsilon)).

In the case of fixed architecture but unconstrained selection, part b) of Theorem 4 gives a lower bound cϵd/nln2p1(1/ϵ)c\epsilon^{-d/n}\ln^{-2p-1}(1/\epsilon) under assumption that the depth is constrained by O(lnp(1/ϵ))O(\ln^{p}(1/\epsilon)). This is only different by a factor of O(ln2p+2(1/ϵ))O(\ln^{2p+2}(1/\epsilon)) from the upper bound of Theorem 1. Without this depth constraint we only have the significantly weaker bound cϵd/(2n)c\epsilon^{-d/(2n)} (part a) of Theorem 4).

In the case of adaptive architectures, there is a big gap between our upper and lower bounds. The upper bound O(1ϵln(1/ϵ))O(\frac{1}{\epsilon\ln(1/\epsilon)}) is given by Theorem 2 for d=n=1d=n=1. The lower bound, proved for general d,nd,n in Theorem 5, only states that there are fWn,(d)f\in\mathcal{W}^{n,\infty}(^{d}) for which the complexity is not o(ϵd/(9n))o(\epsilon^{-d/(9n)}).

ReLU vs. smooth activation functions.

A popular general-purpose method of approximation is shallow (depth-3) networks with smooth activation functions (e.g., logistic sigmoid). Upper and lower approximation complexity bounds for these networks (Mhaskar , Maiorov and Meir ) show that complexity scales as ϵd/n\sim\epsilon^{-d/n} up to some ln(1/ϵ)\ln(1/\epsilon) factors. Comparing this with our bounds in Theorems 1,2,4, it appears that deep ReLU networks are roughly (up to ln(1/ϵ)\ln(1/\epsilon) factors) as expressive as shallow networks with smooth activation functions.

Conclusion.

We have established several upper and lower bounds for the expressive power of deep ReLU networks in the context of approximation in Sobolev spaces. We should note, however, that this setting may not quite reflect typical real world applications, which usually possess symmetries and hierarchical and other structural properties substantially narrowing the actually interesting classes of approximated functions (LeCun et al. ). Some recent publications introduce and study expressive power of deep networks in frameworks bridging this gap, in particular, graph-based hierarchical approximations are studied in Mhaskar et al. , Mhaskar and Poggio and convolutional arithmetic circuits in Cohen et al. . Theoretical analysis of expressiveness of deep networks taking into account such properties of real data seems to be an important and promising direction of future research.

Acknowledgments

The author thanks Matus Telgarsky and the anonymous referees for multiple helpful comments on the preliminary versions of the paper. The research was funded by Skolkovo Institute of Science and Technology.

References