Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations

Boris Hanin

Introduction

Over the past several years, neural nets - particularly deep nets - have become the state of the art in a remarkable number of machine learning problems, from mastering Go to image recognition/segmentation and machine translation (see the review article for more background). Despite all their practical successes, a robust theory of why they work so well is in its infancy. Much of the work to date has focused on the problem of explaining and quantifying the expressivity - the ability to approximate a rich class of functions - of deep neural nets . Expressivity can be seen both as an effect of both depth and width. It has been known since at least the work of Cybenko and Hornik-Stinchcombe-White that if no constraint is placed on the width of a hidden layer, then a single hidden layer is enough to approximate essentially any function. The purpose of this article, in contrast, is to investigate the “effect of depth without the aid of width.” More precisely, for each d1d\geq 1 we would like to estimate

In Theorem 1, we prove that ωmin(d)d+2.\omega_{\text{min}}(d)\leq d+2. This raises two questions:

Is the estimate in the previous line sharp?

How efficiently can ReLU\operatorname{ReLU} nets of a given width wwmin(d)w\geq w_{\text{min}}(d) approximate a given continuous function of dd variables?

On the subject of Q1, we will prove in forthcoming work with M. Sellke that in fact ωmin(d)=d+1.\omega_{\text{min}}(d)=d+1. When d=1d=1, the lower bound is simple to check, and the upper bound follows for example from Theorem 3.1 in . The main results in this article, however, concern Q1 and Q2 for convex functions. For instance, we prove in Theorem 1 that

This illustrates a central point of the present paper: the convexity of the ReLU\operatorname{ReLU} activation makes ReLU\operatorname{ReLU} nets well-adapted to representing convex functions on d.^{d}.

Theorem 1 also addresses Q2 by providing quantitative estimates on the depth of a ReLU\operatorname{ReLU} net with width d+1d+1 that approximates a given convex function. We provide similar depth estimates for arbitrary continuous functions on d,^{d}, but this time for nets of width d+3.d+3. Several of our depth estimates are based on the work of Balázs-György-Szepesvári on max-affine estimators in convex regression.

In order to prove Theorem 1, we must understand what functions can be exactly computed by a ReLU\operatorname{ReLU} net. Such functions are always piecewise affine, and we prove in Theorem 2 the converse: every piecewise affine function on d^{d} can be exactly represented by a ReLU\operatorname{ReLU} net with hidden layer width at most d+3d+3. Moreover, we prove that the depth of the network that computes such a function is bounded by the number affine pieces it contains. This extends the results of Arora-Basu-Mianjy-Mukherjee (e.g. Theorem 2.1 and Corollary 2.2 in ).

Convex functions again play a special role. We show that every convex function on d^{d} that is piecewise affine with NN pieces can be represented exactly by a ReLU\operatorname{ReLU} net with width d+1d+1 and depth N.N.

Statement of Results

the modulus of continuity of f,f, whose value at ε\varepsilon is the maximum ff changes when its argument moves by at most ε.\varepsilon. Note that by definition of a continuous function, ωf(ε)0\omega_{f}(\varepsilon)\rightarrow 0 as ε0.\varepsilon\rightarrow 0. Next, given din,dout,d_{\text{in}},d_{\text{out}}, and w1,w\geq 1, we define a feed-forward neural net with ReLU activations, input dimension dind_{\text{in}}, hidden layer width ww, depth n,n, and output dimension doutd_{\text{out}} to be any member of the finite-dimensional family of functions

are affine transformations, and for every m1m\geq 1

We often denote such a net by N\mathcal{N} and write

for the function it computes. Our first result contrasts both the width and depth required to approximate continuous, convex, and smooth functions by ReLU\operatorname{ReLU} nets.

There exists a sequence of feed-forward neural nets Nk\mathcal{N}_{k} with ReLU activations, input dimension d,d, hidden layer width d+2,d+2, output dimension 1,1, such that

In particular, wmin(d)d+2.w_{\text{min}}(d)\leq d+2. Moreover, write ωf\omega_{f} for the modulus of continuity of f,f, and fix ε>0.\varepsilon>0. There exists a feed-forward neural nets Nε\mathcal{N}_{\varepsilon} with ReLU activations, input dimension d,d, hidden layer width d+3,d+3, output dimension 1,1, and

There exists a sequence of feed-forward neural nets Nk\mathcal{N}_{k} with ReLU activations, input dimension d,d, hidden layer width d+1,d+1, and output dimension 1,1, such that

Hence, ωminconv(d)d+1.\omega_{\text{min}}^{\text{conv}}(d)\leq d+1. Further, there exists C>0C>0 such that if ff is both convex and Lipschitz with Lipschitz constant L,L, then the nets Nk\mathcal{N}_{k} in (8) can be taken to satisfy

There exists a constant KK depending only on dd and a constant CC depending only on the maximum of the first KK derivative of ff such that for every k3k\geq 3 the width d+2d+2 nets Nk\mathcal{N}_{k} in (5) can be chosen so that

The main novelty of Theorem 1 is the width estimate wminconv(d)d+1w_{\text{min}}^{\text{conv}}(d)\leq d+1 and the quantitative depth estimates (9) for convex functions as well as the analogous estimates (6) and (7) for continuous functions. Let us breifly explain the origin of the other estimates. The relation (5) and the corresponding estimate wmin(d)d+2w_{\text{min}}(d)\leq d+2 are a combination of the well-known fact that ReLU\operatorname{ReLU} nets with one hidden layer can approximate any continuous function and a simple procedure by which a ReLU\operatorname{ReLU} net with input dimension dd and a single hidden layer of width nn can be replaced by another ReLU\operatorname{ReLU} net that computes the same function but has depth n+2n+2 and width d+2.d+2. For these width d+2d+2 nets, we are unaware of how to obtain quantitative estimates on the depth required to approximate a fixed continuous function to a given precision. At the expense of changing the width of our ReLU\operatorname{ReLU} nets from d+2d+2 to d+3,d+3, however, we furnish the estimates (6) and (7). On the other hand, using Theorem 3.1 in , when ff is sufficiently smooth, we obtain the depth estimates (10) for width d+2d+2 ReLU\operatorname{ReLU} nets.

Our next result concerns the exact representation of piecewise affine functions by ReLU\operatorname{ReLU} nets. Instead of measuring the complexity of a such a function by its Lipschitz constant or modulus of continuity, the complexity of a piecewise affine function can be thought of as the minimal number of affine pieces needed to define it.

Moreover, there exists a feed-forward neural net N\mathcal{N} with ReLU activations, input dimension d,d, hidden layer width d+3,d+3, output dimension 1,1, and

that computes ff exactly. Finally, if ff is convex (and hence hh vanishes), then the width of N\mathcal{N} can be taken to be d+1d+1 and the depth can be taken to N.N.

The fact that the function computed by a ReLU\operatorname{ReLU} net can be written as (11) follows from Theorem 2.1 in . The novelty in Theorem 2 is therefore the uniform width estimate d+3d+3 in the representation on any function computed by a ReLU\operatorname{ReLU} net and the d+1d+1 width estimate for convex functions. Theorem 2 will be used in the proof of Theorem 1.

Relation to Previous Work

This article is related to several strands of prior work:

The results in this article complement the work of Liao-Mhaskar-Poggio and Mhaskar-Poggio , who consider the advantages of depth for representing certain heirarchical or compositional functions by neural nets with both ReLU\operatorname{ReLU} and non-ReLU\operatorname{ReLU} activations. Their results (e.g. Theorem 1 in and Theorem 3.1 in ) give bounds on the width for approximation both for shallow and certain deep heirarchical nets.

Theorems 1-2 are also quantitative analogs of Corollary 2.2 and Theorem 2.4 in the work of Arora-Basu-Mianjy-Mukerjee . Their results give bounds on the depth of a ReLU\operatorname{ReLU} net needed to compute exactly a piecewise linear function of dd variables. However, except when d=1,d=1, they do not obtain an estimate on the number of neurons in such a network and hence cannot bound the width of the hidden layers.

Our results are related to Theorems II.1 and II.4 of Rolnick-Tegmark , which are themselves extensions of Lin-Rolnick-Tegmark . Their results give lower bounds on the total size (number of neurons) of a neural net (with non-ReLU\operatorname{ReLU} activations) that approximates sparse multivariable polynomials. Their bounds do not imply a control on the width of such networks that depends only on the number of variables, however.

This work was inpsired in part by questions raised in the work of Telgarsky . In particular, in Theorems 1.1 and 1.2 of , Telgarsky constructs interesting examples of sawthooth functions that can be computed efficiently by deep width 22 ReLU\operatorname{ReLU} nets that cannot be well-approximated by shallower networks with a simlar number of parameters.

Theorems 1-2 are quantitative statements about the expressive power of depth without the aid of width. This topic, usually without considering bounds on the width, has been taken up by many authors. We refer the reader to for several interesting quantitative measures of the complexity of functions computed by deep neural nets.

Finally, we refer the reader to the interesting work of Yarofsky , which provides bounds on the total number of parameters in a ReLU\operatorname{ReLU} net needed to approximate a given class of functions (mainly balls in various Sobolev spaces).

Acknowledgements

It is a pleasure to thank Elchanan Mossel and Leonid Hanin for many helpful discussions. This paper originated while I attended EM’s class on deep learning . In particular, I would like to thank him for suggesting proving quantitative bounds in Theorem 2 and for suggesting that a lower bound can be obtained by taking piece-wise linear functions with many different directions. He also pointed out that the width estimates for continuous function in Theorem 1 where sub-optimal in a previous draft. l would also like to thank Leonid Hanin for detailed comments on a several previous drafts and for useful references to results in approximation theory. I am also grateful to Brandon Rule and Matus Telgarsky for comments on an earlier version of this article. I am also grateful to BR for the original suggestion to investigate the expressivity of neural nets of width 22. I also would like to thank Max Kleiman-Weiner for useful comments and discussion. Finally, I thank Zhou Lu for pointing out a serious error what used to be Theorem 3 in a previous version of this article. I have removed that result.

Proof of Theorem 2

when ff is convex. We seek to show that ff can be exactly represented by a ReLU\operatorname{ReLU} net with input dimension d,d, hidden layer width d+1d+1, and depth N.N. Our proof relies on the following observation.

where (id,0)(x)=(x,0)\left(\text{id},0\right)(x)=(x,0) maps d^{d} to the graph of the zero function. Note that the ReLU\operatorname{ReLU} in this initial layer is linear. With this notation, repeatedly using Lemma 3, we find that

therefore has input dimension d,d, hidden layer width d+1,d+1, depth NN and computes ff exactly.

Next, consider the general case when ff is given by

as in (11). For this situation, we use a different way of computing the maximum using ReLU\operatorname{ReLU} nets.

There exists a ReLU\operatorname{ReLU} net M\mathcal{M} with input dimension 2,2, hidden layer width 22, output dimension 11 and depth 22 such that

Set A1(x,y):=(xy,y),A2(z,w)=z+w,A_{1}(x,y):=(x-y,y),\,A_{2}(z,w)=z+w, and define

We now describe how to construct a ReLU\operatorname{ReLU} net N\mathcal{N} with input dimension dd, hidden layer width d+3,d+3, output dimension 1,1, and depth 2(M+N)2(M+N) that exactly computes ff. We use width dd to copy the input xx, width 22 to compute successive maximums of the positive affine functions gα,hβg_{\alpha},h_{\beta} using the net M\mathcal{M} from Lemma 4 above, and width 11 as memory in which we store g=supαgαg=\sup_{\alpha}g_{\alpha} while computing h=supβhβ.h=\sup_{\beta}h_{\beta}. The final layer computes the difference f=gh.f=g-h.

Proof of Theorem 1

Combining this result with Theorem 2 proves (9). We turn to checking (5) and (10). We need the following observations, which seems to be well-known but not written down in the literature.

Let N\mathcal{N} be a ReLU\operatorname{ReLU} net with input dimension d,d, a single hidden layer of width n,n, and output dimension 1.1. There exists another ReLU\operatorname{ReLU} net N~\widetilde{\mathcal{N}} that computes the same function as N\mathcal{N} but has input dimension dd and n+2n+2 hidden layers with width d+2.d+2.

Denote by {Aj}j=1n\{A_{j}\}_{j=1}^{n} the affine functions computed by each neuron in the hidden layer of N\mathcal{N} so that

The affine transformations A~j\widetilde{\mathcal{A}}_{j} computed by the jthj^{th} hidden layer of N~\widetilde{\mathcal{N}} are then

We are essentially using width dd to copy in the input variable, width 11 to compute each AjA_{j} and width 11 to store the output. ∎

Recall that positive continuous functions can be arbitrarily well-approximated by smooth functions and hence by ReLU\operatorname{ReLU} nets with a single hidden layer (see e.g. Theorem 3.1 ). The relation (5) therefore follows from Lemma 6. Similarly, by Theorem 3.1 in , if ff is smooth, then there exists K=K(d)>0K=K(d)>0 and a constant CfC_{f} depending only on the maximum value of the first KK derivatives of ff such that

where the infimum is over ReLU\operatorname{ReLU} nets N\mathcal{N} with a single hidden layer of width nn. Combining this with Lemma 6 proves (10).

of d^{d} into d!/ωf(ε)dd!/\omega_{f}(\varepsilon)^{d} copies of ωf(ε)\omega_{f}(\varepsilon) times the standard dd-simplex. Define fεf_{\varepsilon} to be a piecewise linear approximation to ff obtained by setting fεf_{\varepsilon} equal to ff on the vertices of the Pj\mathcal{P}_{j}’s and taking fεf_{\varepsilon} to be affine on their interiors. Since the diameter of each Pj\mathcal{P}_{j} is ωf(ε),\omega_{f}(\varepsilon), we have

Next, since fεf_{\varepsilon} is a piecewise affine function, by Theorem 2.1 in (see Theorem 2), we may write

where gε,hεg_{\varepsilon},h_{\varepsilon} are convex, positive, and piecewise affine. Applying Theorem 2 completes the proof of (6) and (7). ∎

References