Error bounds for approximations with deep ReLU neural networks in $W^{s,p}$ norms

Ingo Gühring, Gitta Kutyniok, Philipp Petersen

Introduction

Powered by modern, highly parallelized hardware and the immense amount of accessible data, deep neural networks substantially outperform both traditional modelling approaches, e.g. those based on differential equations, and classical machine learning methods in a wide range of applications. Prominent areas of application include image classification , speech recognition , and natural language processing .

Despite this overwhelming success in applications, a comprehensive mathematical explanation of this success is has not been found. However, a deep theoretical understanding of these techniques is crucial to design more efficient architectures and essential in safety-critical applications, such as autonomous driving.

Many attempts at unraveling the extreme efficiency of deep neural networks have been made in the context of approximation theory. The universal approximation theorem establishes that every continuous function on a compact domain can be uniformly approximated by neural networks. More refined results that also set into relation the size and the approximation fidelity of neural networks have been reported for smooth activation functions in, for example, .

In most applications, the activation functions are not smooth but taken to be the piecewise linear ReLU function. For these networks, approximation bounds for classes of smooth functions have been established in and for piecewise smooth functions in . Connections to sparse grids and the associated approximation rates were established in and connections to linear finite element approximation were reported in . It was also discovered that the depth of neural networks, i.e., the number of layers, crucially influences the approximation capabilities of these networks in the sense that deeper networks are more efficient approximators .

One of the applications of deep learning where a strong knowledge of the approximation capabilities of neural networks directly translates into quantifiable theoretical results appears when solving partial differential equations using deep learning techniques. Some notable advances in this direction have been made in . In this regard, not only the approximation fidelity with respect to standard Lebesgue spaces is of high interest, but also that with respect to Sobolev-type norms. First results in this direction were reported in .

In this work, we give a comprehensive analysis of the approximation rates of deep neural networks of Sobolev-regular functions with respect to (fractional) Sobolev norms.

In the remaining part of this introduction, we briefly describe neural network architectures relevant to this work. Then, we present the setting that neural networks are typically used in. Afterwards, we give a practical and theoretical motivation for studying approximations of functions and their derivatives with neural networks. Finally, we state our results.

is the affine-linear transformation of the ll-th layer, then the computation in that layer can be described as

Note that ϱ\varrho acts componentwise in all but the last layer and in the last layer no activation function is applied. The parameters defining the affine-linear transformations TlT_{l} are referred to as weights and k=0LNl\sum_{k=0}^{L}N_{l} is called the number of neurons of the network, since each output coordinate of flf_{l} can be seen as a small computational unit, similar to neurons in the brain. If a network has 33 or more layers, then it is usually called deep and a 22-layer network is called shallow. The complexity of a neural network is typically measured in the number of layers, weights, and neurons (see ). The term deep learning refers to the subset of machine learning methods associated with deep neural networks.

Recently, more general feedforward architectures which allow connections between non-neighboring layers, so-called skip connections, have been shown to yield state-of-the-art results in object recognition tasks . Here, the input of a layer consists of the output of all preceding layers. Note, that the case where no skip connections are allowed, is a special case of this more general architecture.

One of today’s most widely-used activation functions is the Rectified Linear Unit (ReLU) (see ), defined as

The popularity of the ReLU can be explained by a number of factors. It is cheap to compute, promotes sparsity in data representation , alleviates the problem of vanishing-gradients , and thus yields better optimization properties. Moreover, the ReLU is the only commonly-used activation function such that the associated function spaces are not necessary non-closed . In this work, we will mostly focus on ReLU networks with a feedforward architecture allowing skip connections.

2 Supervised learning with neural networks

Typically, neural networks are applied in supervised learning problems. The starting point is a dataset of input-output pairs (xi,f(xi))i=1m(x_{i},f(x_{i}))_{i=1}^{m}, called samples, where ff is in most cases an unknown functionWe will later see a case where ff is in fact known. with values only given at sample points xix_{i}. As an example, xix_{i} can be thought of as an image and f(xi)f(x_{i}) as a vector of scores, where each score is the probability of a certain category, e.g. ”dog” or ”cat”, being associated with xix_{i}. During training, one then seeks to learn the function ff by adapting the weights ww of a neural network N\mathcal{N} such that the empirical loss

is minimized for some loss function ll. Ultimately, one is interested in how well the learning algorithm performs on formerly unseen data points xx. The associated error is called the generalization error.

3 Motivation: approximating functions and derivatives

From a learning theory point of view, the task of estimating the generalization error decomposes into a statistical problem depending on the samples and an approximation theoretical problem, independent of the samples. An accessible introduction to learning theory from the perspective of approximation theory can be found in . The aim of this work is to study the simultaneous approximation of a function and its derivative with a neural network. There are multiple scenarios in which this is possible and useful.

In , the jj-th order derivatives of ff are incorporated in the empirical loss function (1.1), resulting in an empirical loss

which encourages the network to encode information about the derivatives of ff in its weights. The authors of call this method Sobolev training and reported reduced generalization errors and better data-efficiency in a network compression task (see ) and in application to synthetic gradients (see ). In case of network compression, the approximated function ff is a function realized by a possibly very large neural network Nlarge(w)\mathcal{N}_{\text{large}}(\cdot|w), that has been trained for some supervised learning task and is learnt by a smaller network Nsmall\mathcal{N}_{\text{small}}. In contrast to usual supervised learning settings, the approximated function f()=Nlarge(w)f(\cdot)=\mathcal{N}_{\text{large}}(\cdot|w) is known and the derivatives can be computed.

Motivated by the performance of deep learning-based solutions in classical machine learning tasks and, in particular, by their ability to overcome the curse of dimension, neural networks are now also applied for the approximative solution of partial differential equations (PDEs) (see ).

In the authors present their deep Galerkin method for approximating solutions of high-dimensional quasilinear parabolic PDEs. For this, a functional J(f)J(f) encoding the differential operator, boundary conditions, and initial conditions is introduced. A neural network NPDE\mathcal{N}_{\text{PDE}} with weights ww is then trained to minimize the functional J(NPDE(w))J(\mathcal{N}_{\text{PDE}}(w)). This is done by a discretization and randomly sampling spatial points.

The theoretical foundation for approximating a function and higher-order derivatives with a neural network was already given in a less known version of the universal approximation theorem by Hornik in [32, Theorem 3]. In particular, it was shown that if the activation function ϱ\varrho is kk-times continuously differentiable, non-constant, and bounded, then any kk-times continuously differentiable function ff and its derivatives up to order kk can be uniformly approximated by a shallow neural network on compact sets. Note though that the conditions on the activation function are very restrictive and that, for example, the ReLU is not included in the above result. However, in , it was shown that the theorem also holds for shallow ReLU networks if k=1k=1. Theorem 3 in was also used in to show the existence of a shallow network approximating solutions of the PDEs considered in this paper.

An important aspect, that is untouched by the previous approximation results is how the complexity of a network and, in particular, its depth relates to its approximation properties.

4 Our contribution

In Theorem 1 in , Yarotsky showed upper complexity bounds for approximations in LL^{\infty} norm of functions from the Sobolev space Wn,((0,1)d)W^{n,\infty}(\left(0,1\right)^{d}) with neural networks for continuous piecewise linear activation functions with a finite number of breakpoints. Precisely, it is shown there, that for any ε>0\varepsilon>0 there exists a neural network Nε\mathcal{N}_{\varepsilon} with at most clog2(\nicefrac1ε)c\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}}) layers and at most cεd/nlog2(\nicefrac1ε)c\cdot\varepsilon^{-d/n}\log_{2}(\nicefrac{{1}}{{\varepsilon}}) weights and neurons such that for any fWn,((0,1)d)f\in W^{n,\infty}(\left(0,1\right)^{d}) with fWn,((0,1)d)1\lVert f\rVert_{W^{n,\infty}(\left(0,1\right)^{d})}\leq 1 there is a choice of weights wfw_{f} with

where cc is a constant depending on dd and nn.

Furthermore, under the assumption of a possibly discontinuous weight selection the number of weights needed by a neural network Nε\mathcal{N}_{\varepsilon} to be able to realize an ε\varepsilon-approximation in LL^{\infty} norm for any fWn,((0,1)d)f\in W^{n,\infty}(\left(0,1\right)^{d}) with fWn,((0,1)d)1\lVert f\rVert_{W^{n,\infty}(\left(0,1\right)^{d})}\leq 1 is shown to be lower bounded by cεd/(2n)c^{\prime}\cdot\varepsilon^{d/(2n)} in [61, Theorem 4 a)]. The constant cc^{\prime} depends on dd and nn.

We show for the same set of activation functions that the approximation can also be done with respect to higher-order Sobolev norms with arbitrary 1p1\leq p\leq\infty and that there is a trade-off between the regularity used in the approximation norm and the regularity used in the complexity bounds. Specifically, we show that for any approximation accuracy ε>0\varepsilon>0 and regularity 0s10\leq s\leq 1, there is a ReLU neural network Nε\mathcal{N}_{\varepsilon} with at most clog2(εn/(ns))c\cdot\log_{2}(\varepsilon^{-n/(n-s)}) layers and cεd/(ns)log2(εn/(ns))c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}(\varepsilon^{-n/(n-s)}) weights and neurons such that and any fWn,p((0,1)d)f\in{W^{n,p}(\left(0,1\right)^{d})} with fWn,p((0,1)d)B\lVert f\rVert_{{W^{n,p}(\left(0,1\right)^{d})}}\leq B there is a choice of weights wfw_{f} with

where cc is a constant depending on d,n,p,sd,n,p,s, and BB. In the boundary case s=0s=0 and p=p=\infty our results corresponds to the theorem shown by Yarotsky and for s=1s=1 and p=p=\infty the function Nε(wf)\mathcal{N}_{\varepsilon}(\cdot|w_{f}) and its weak gradient uniformly approximate ff and the weak gradient of ff, respectively. For non-integer ss the function Nε(wf)\mathcal{N}_{\varepsilon}(\cdot|w_{f}) approximates the function ff and its fractional derivative of order ss approximates the fractional derivative of order ss of ff. The case 0s10\leq s\leq 1 and p=p=\infty was already shown by one of the authors, I. Gühring, in his Master thesis .

Moreover, analogously to [61, Theorem 4 a)] we establish lower bounds where the same regularity-complexity trade-off can be observed. We show that if a neural network Nε\mathcal{N}_{\varepsilon} is able to realize an ε\varepsilon-approximation in W1,W^{1,\infty} norm for any fWn,((0,1)d)f\in W^{n,\infty}(\left(0,1\right)^{d}) with fWn,((0,1)d)1\lVert f\rVert_{W^{n,\infty}(\left(0,1\right)^{d})}\leq 1, then the number of weights of Nε\mathcal{N}_{\varepsilon} is lower bounded by cε\nicefracd2(n1)c^{\prime}\cdot\varepsilon^{\nicefrac{{-d}}{{2(n-1)}}}. Here, cc^{\prime} is a constant depending on dd and nn.

5 Outline

As a preparation, we start by introducing notation and some definitions in Section 1.6. In Section 2, we rigorously define a neural network and its architecture in mathematical terms and develop a network calculus. In Section 3, we briefly introduce (fractional) Sobolev spaces. In Section 4, we present our results which will be discussed in Section 5. To allow a concise presentation of our results most proofs are given in the appendix: Appendix A and B contain the necessary preparation for the proof of the results from Section 4.1 which is given in Appendix C. The results from Section 4.2 are proven in Appendix D.

6 Notation

Of course, the same notation also applies to (block) vectors.

for all x,yΩx,y\in\Omega. If we want to emphasize the constant LL, then we say that ff is LL-Lipschitz.

If XX is a linear space and 1\lVert\cdot\rVert_{1} and 2\lVert\cdot\rVert_{2} are two norms on XX, then we say that 1\lVert\cdot\rVert_{1} and 2\lVert\cdot\rVert_{2} are equivalent if there exist constants C1,C2>0C_{1},C_{2}>0 such that

For two normed linear spaces X,YX,Y we denote by L(X,Y)\mathcal{L}(X,Y) the set of bounded linear operators mapping XX to YY and for TL(X,Y)T\in\mathcal{L}(X,Y) the induced operator norm of TT is denoted by

Let a>0a>0, then we say for two functions f:(0,a)[0,)f:(0,a)\to[0,\infty) and g:(0,a)[0,)g:(0,a)\to[0,\infty) that f(ε)f(\varepsilon) is in O(g(ε))\mathcal{O}(g(\varepsilon)) if there exists 0<δ<a0<\delta<a and C>0C>0 such that f(ε)Cg(ε)f(\varepsilon)\leq Cg(\varepsilon) for all ε(0,δ)\varepsilon\in(0,\delta).

Neural networks

In this section, we introduce the notion of neural networks used in this paper. As in , we will consider a general type of feedforward architecture that also allows for connections of neurons in non-neighboring layers. It can be seen, though, that any function realized by such a network can also be realized by a network with a more restrictive feedforward architecture where only neurons from neighboring layers can be connected (Lemma 2.11). As in , we draw a distinction between the neural network and the function that the network realizes. This gives us the possibility to develop a network calculus in the spirit of [47, Chapter 2]. As in that paper we will introduce the notion of network concatenation and parallelization.

The following definition is similar to [47, Definition 2.1], where the difference is that we also allow connections between non-neighboring layers.

where xLx_{L} results from the following scheme:

where Al,xkA_{l,x_{k}} is an Nl×NkN_{l}\times N_{k} matrix for k=0,,l1k=0,\ldots,l-1 and l=1,,Ll=1,\ldots,L. Then

We will now define the class of neural networks where only connections between neighboring layers are allowed. Networks without skip connections are a special case of the networks of Definition 2.1. Since they are more frequently used in the literature, we coin such networks standard neural networks.

If Φ=((A1,b1),(A2,b2),,(AL,bL))\Phi=((A_{1},b_{1}),(A_{2},b_{2}),\dots,(A_{L},b_{L})) is a neural network as above and we have for

that Al,xi=0A_{l,x_{i}}=0 for l=1,,Ll=1,\ldots,L and i=0,,l2i=0,\ldots,l-2, then we call Φ\Phi a standard neural network. The computation scheme then reduces to the following:

In practice, before training a neural network, i.e. adjusting the weights of the network, one has to decide which network architecture to use. The following definition will clarify the notion of a network architecture.

We say that a neural network Φ=((A1,b1),(A2,b2),,(AL,bL))\Phi=((A^{\prime}_{1},b^{\prime}_{1}),(A^{\prime}_{2},b^{\prime}_{2}),\dots,(A^{\prime}_{L},b^{\prime}_{L})) with input dimension dd and LL layers has architecture A{\mathcal{A}} if

Nl(Φ)=NlN_{l}(\Phi)=N_{l} for all l=1,,Ll=1,\ldots,L and

[Al]i,j0[A^{\prime}_{l}]_{i,j}\neq 0 implies [Al]i,j0[A_{l}]_{i,j}\neq 0 for all l=1,,Ll=1,\ldots,L with i=1,,Nli=1,\ldots,N_{l} and j=1,,k=0l1Nkj=1,\ldots,\sum_{k=0}^{l-1}N_{k}.

In the spirit of Definition 2.2 we say that A{\mathcal{A}} is a standard neural network architecture if all weights connecting neurons from non-neighboring layers are zero.

Note that in general an architecture of a neural network does not need to be unique.

Moreover, we define one of the in practice commonly used (see ) activation functions.

is called ReLU (Rectified Linear Unit) activation function.

It can be shown, see [61, Proposition 1], that in terms of approximation and upper complexity bounds for networks with continuous and piecewise linear activation functions it suffices to focus on the ReLU.

be two neural networks such that the input layer of Φ1\Phi^{1} has the same dimension as the output layer of Φ2\Phi^{2}. Then, Φ1Φ2\Phi^{1}\bullet\Phi^{2} denotes the following L1+L21L_{1}+L_{2}-1 layer network:

for l=1,,L1l=1,\ldots,L_{1}. We call Φ1Φ2\Phi^{1}\bullet\Phi^{2} the concatenation of Φ1\Phi^{1} and Φ2\Phi^{2}.

Then, Rϱ(Φ1Φ2)=Rϱ(Φ1)Rϱ(Φ2)R_{\varrho}(\Phi^{1}\bullet\Phi^{2})=R_{\varrho}(\Phi^{1})\circ R_{\varrho}(\Phi^{2}).

This can be verified with a direct computation. ∎

Next, we adapt [47, Lemma 2.3] to the case where skip connections are allowed and define a network which realizes an identity function if the activation function is the ReLU.

Then the sparse concatenation of Φ1\Phi^{1} and Φ2\Phi^{2} is defined as

for l=1,,L1l=1,\ldots,L_{1}. Furthermore, it holds that L(Φ1Φ2)=L1+L2L(\Phi^{1}\odot\Phi^{2})=L_{1}+L_{2}, and M(Φ1Φ2)2M(Φ1)+2M(Φ2)M(\Phi^{1}\odot\Phi^{2})\leq 2M(\Phi^{1})+2M(\Phi^{2}) and N(Φ1Φ2)2N(Φ1)+2N(Φ2)N(\Phi^{1}\odot\Phi^{2})\leq 2N(\Phi^{1})+2N(\Phi^{2}).

2 Network parallelization

be two neural networks with dd-dimensional input. If L1L2L_{1}\leq L_{2}, then we define

A similar construction is used for the case L1>L2L_{1}>L_{2}. Then P(Φ1,Φ2)P(\Phi^{1},\Phi^{2}) is a neural network with dd-dimensional input and max{L1,L2}\max\{L_{1},L_{2}\} layers, which we call the parallelization of Φ1\Phi^{1} and Φ2\Phi^{2}. We have M(P(Φ1,Φ2))=M(Φ1)+M(Φ2)M(P(\Phi^{1},\Phi^{2}))=M(\Phi^{1})+M(\Phi^{2}), N(P(Φ1,Φ2))=N(Φ1)+N(Φ2)dN(P(\Phi^{1},\Phi^{2}))=N(\Phi^{1})+N(\Phi^{2})-d and Rϱ(P(Φ1,Φ2))(x)=(Rϱ(Φ1)(x),Rϱ(Φ2)(x))R_{\varrho}(P(\Phi^{1},\Phi^{2}))(x)=(R_{\varrho}(\Phi^{1})(x),R_{\varrho}(\Phi^{2})(x)).

with dd-dimensional input and L=max{L1,,Ln}L=\max\{L_{1},\ldots,L_{n}\} layers is called the parallelization of Φ1,Φ2,,Φn\Phi^{1},\Phi^{2},\ldots,\Phi^{n}. We have

and Rϱ(P(Φ1,,Φn))(x)=(Rϱ(Φ1)(x),,Rϱ(Φn)(x))R_{\varrho}(P(\Phi^{1},\ldots,\Phi^{n}))(x)=(R_{\varrho}(\Phi^{1})(x),\ldots,R_{\varrho}(\Phi^{n})(x)).

3 Standard neural networks

In this subsection, we will show that if a function is a realization of a neural network, then it can also be realized by a standard neural network with similar complexity. The proof of this result is heavily based on the idea of the construction of an identity function with ReLU networks (Lemma 2.6).

Let ϱ\varrho be the ReLU. Then there exist absolute constants C1,C2>0C_{1},C_{2}>0 such that for any neural network Φ\Phi with dd-dimensional input and L=L(Φ)L=L(\Phi) layers, N=N(Φ)N=N(\Phi) neurons and M=M(Φ)M=M(\Phi) nonzero weights, there is a standard neural network Φst\Phi_{\text{st}} with dd-dimensional input, and LL layers, at most C1LNC_{1}LN neurons and C2(LN+M)C_{2}\cdot(LN+M) nonzero weights such that

If L=1L=1, then there is nothing to show. For L>1L>1 we start by defining the matrices

In the next remark we collect some properties of the class of functions that can be realized by a neural network.

If Φ\Phi is a neural network with dd-dimensional input and mm-dimensional output and ϱ\varrho is the ReLU, then Rϱ(Φ)R_{\varrho}(\Phi) is a Lipschitz-continuous, piecewise affine-linear function. This follows easily from Definition 2.2 and Lemma 2.11 which show that Rϱ(Φ)R_{\varrho}(\Phi) can be expressed as the composition of Lipschitz-continuous, piecewise affine-linear transformations.

In [4, Theorem 2.1] also the converse was shown, i.e. every piecewise affine-linear function can be realized by a neural network with the ReLU as activation function.

Sobolev spaces

Spaces of functions that admit generalized derivatives fulfilling suitable integrability properties are a crucial concept in modern theory of partial differential equations (cf. e.g. ). In order to study properties of PDEs using functional analytic tools, a differential equation is reformulated via a differential operator mapping one function space to another. For a wide range of differential equations the appropriate spaces in this formulation are Sobolev spaces. A historical background of the development of Sobolev spaces in the context of PDEs can be found in [56, Lecture 1]. For a detailed treatment of the broad theory of Sobolev spaces we refer the reader to .

Furthermore, for fWn,p(Ω)f\in{W^{n,p}(\Omega)} and 1p<1\leq p<\infty, we define the norm

We write f:=(D1f,,Ddf)\nabla f:=(D^{1}f,\ldots,D^{d}f). Moreover, the space Wn,p(Ω){W^{n,p}(\Omega)} equipped with the norm Wn,p(Ω)\lVert\cdot\rVert_{{W^{n,p}(\Omega)}} is a Banach space. For fWn,p(Ω)f\in{W^{n,p}(\Omega)} we shall use the notation

if the domain is clear from the context, or if we want to emphasize the variable xx that the function ff depends on, respectively. If n=0n=0, then the Sobolev space is just a Lebesgue space, i.e. W0,p(Ω)=Lp(Ω){W^{0,p}(\Omega)}=L^{p}(\Omega).

Many results about function spaces defined on a domain Ω\Omega require Ω\Omega to fulfill certain regularity conditions. Different geometrical conditions and the resulting properties have been intensively studied and can for example be found in . For our purposes it is enough to focus on the condition introduced in the next definition which can be found in [20, Appendix C.1].

after possibly relabeling and reorienting the coordinate axes.

In the sequel, we will only work with convex domains. Moreover, every open, bounded, and convex domain is a Lipschitz domain [23, Corollary 1.2.2.3].

Fractional Sobolev spaces play an important role in the analysis of partial differential equations. In particular, they characterize the regularity of functions from Sobolev spaces defined on a domain Ω\Omega restricted to the boundary Ω\partial\Omega of the domain where the restriction is realized by a so-called trace operator (see e.g. ). Moreover, a detailed description of various areas of further application is listed in . For a more in-depth analysis of these spaces the interested reader is referred to and .

Next, we define fractional-order spaces in terms of an intrinsic norm.

We set for 0<s<10<s<1 and 1p1\leq p\leq\infty

The space Ws,p(Ω){W^{s,p}(\Omega)} endowed with the norm Ws,p(Ω)\lVert\cdot\rVert_{{W^{s,p}(\Omega)}} is a Banach space called Sobolev-Slobodeckij space.For p=p=\infty the space coincides with the space of bounded ss-Hölder continuous functions. Precisely, each equivalence class contains a bounded ss-Hölder continuous representative.

Approximations with deep ReLU neural networks in Sobolev type norms

In this section, we derive upper and lower complexity bounds for approximations of functions from certain Sobolev spaces with deep ReLU networks. Our result is a generalization of [61, Theorem 1] (upper bounds) and [61, Theorem 4 a)] (lower bounds) to the case where the approximation error is measured in a Sobolev-type norm.

In [61, Theorem 1] it is shown that for B=1B=1 and an arbitrary function fFn,d,,Bf\in\mathcal{F}_{n,d,\infty,B} a ReLU network can be constructed that realizes an approximation with LL^{\infty}-error at most ε\varepsilon using O(log2(\nicefrac1ε))\mathcal{O}(\log_{2}(\nicefrac{{1}}{{\varepsilon}})) layers and O(εd/nlog2(\nicefrac1ε))\mathcal{O}(\varepsilon^{-d/n}\log_{2}(\nicefrac{{1}}{{\varepsilon}})) nonzero weights and neurons.

We will show that the approximation can also be performed with respect to a continuous scale of higher order Sobolev-type norms for arbitrary 1p1\leq p\leq\infty and additionally, that there is a trade-off between the regularity used in the norm in which the approximation error is measured and the regularity used in the bounds. In particular, we will show the following theorem:

For any ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right), there is a neural network architecture Aε=Aε(d,n,p,B,s,ε){\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,p,B,s,\varepsilon) with dd-dimensional input and one-dimensional output such that for any fFn,d,p,Bf\in\mathcal{F}_{n,d,p,B} there is a neural network Φεf\Phi_{\varepsilon}^{f} that has architecture Aε{\mathcal{A}}_{\varepsilon} such that

L(Aε)clog2(εn/(ns))L({\mathcal{A}}_{\varepsilon})\leq c\cdot\log_{2}(\varepsilon^{-n/(n-s)});

M(Aε)cεd/(ns)log2(εn/(ns))M({\mathcal{A}}_{\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}(\varepsilon^{-n/(n-s)});

N(Aε)cεd/(ns)log2(εn/(ns))N({\mathcal{A}}_{\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}(\varepsilon^{-n/(n-s)}).

Clearly, if s=0s=0 and p=p=\infty, then Theorem 4.1 coincides with the theorem shown by Yarotsky in . Note that if Φ\Phi is a neural network and ϱ\varrho is the ReLU, then the restriction of Rϱ(Φ)R_{\varrho}(\Phi) to (0,1)d\left(0,1\right)^{d} is bounded and Lipschitz continuous (cf. Remark 2.12) and hence in view of [10, Chapter 1.3] an element of the Sobolev space W1,((0,1)d){W^{1,\infty}(\left(0,1\right)^{d})}. Therefore, the expressions in the previous theorem are well-defined.

Theorem 4.1 holds for all activation functions that are in some sense similar to the ReLU. In particular, it follows directly from [61, Proposition 1] that the statement holds for any continuous piecewise linear activation function ϱ^\hat{\varrho} with MM breakpoints, where 1M<1\leq M<\infty.

As a consequence of Theorem 4.1 and Lemma 2.11, we can easily derive a similar statement for standard neural networks.

For any ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right), there is a standard neural network architecture Ast,ε=Ast,ε(d,n,p,B,s,ε){\mathcal{A}}_{\text{st},\varepsilon}={\mathcal{A}}_{\text{st},\varepsilon}(d,n,p,B,s,\varepsilon) with dd-dimensional input and one-dimensional output such that for any fFn,d,p,Bf\in\mathcal{F}_{n,d,p,B} there is a standard neural network Φst,ε\Phi_{\text{st},\varepsilon} that has architecture Ast,ε{\mathcal{A}}_{\text{st},\varepsilon} such that

L(Ast,ε)clog2(εn/(ns))L({\mathcal{A}}_{\text{st},\varepsilon})\leq c\cdot\log_{2}(\varepsilon^{-n/(n-s)});

M(Ast,ε)cεd/(ns)log22(εn/(ns))M({\mathcal{A}}_{\text{st},\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}^{2}(\varepsilon^{-n/(n-s)});

N(Ast,ε)cεd/(ns)log22(εn/(ns))N({\mathcal{A}}_{\text{st},\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}^{2}(\varepsilon^{-n/(n-s)}).

2 Lower complexity bounds

In this subsection, we show that the same regularity-complexity trade-off that can be observed for upper complexity bounds can be shown for lower bounds. In Theorem 4 a) in Yarotsky proves that a network architecture capable of approximating any function fFn,d,,1f\in\mathcal{F}_{n,d,\infty,1} up to a LL^{\infty}-error ε\varepsilon has at least cεd/(2n)c\cdot\varepsilon^{-d/(2n)} weights. Here we consider approximations in W1,W^{1,\infty} norm. The following theorem combines the result from with our result.

If ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right) and Aε=Aε(d,n,B,k,ε){\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,B,k,\varepsilon) is an architecture such that for any fFn,d,,Bf\in\mathcal{F}_{n,d,\infty,B} there is a neural network Φεf\Phi_{\varepsilon}^{f} that has architecture Aε{\mathcal{A}}_{\varepsilon} and

then Aε{\mathcal{A}}_{\varepsilon} has at least M(Aε)cε\nicefracd2(nk)M({\mathcal{A}}_{\varepsilon})\geq c\varepsilon^{\nicefrac{{-d}}{{2(n-k)}}} weights.

In case k=0k=0, Theorem 4.3 coincides with [61, Theorem 4 a)].

Note that the case of standard neural network architectures is included in the theorem above and, thus, the same lower bounds hold true for this case.

Moreover, the proof can easily be adapted to piecewise polynomial activation functions (not necessarily continuous) with a finite number of breakpoints.

Discussion and future work

The motivation behind deriving theoretical results for deep neural networks is to improve our understanding of their empirical success in a broad area of applications. In this section, we briefly describe the practical relevance of our result, but mostly focus on aspects where our theoretical framework differs from empirical observations or fails to incorporate computational limitations met in implementations. Furthermore, we will discuss possible extensions of our results.

Practical relevance. Our results give upper and lower bounds for the complexity of neural network architectures used in Sobolev training (see Section 1.3 and ). Furthermore, we anticipate our results to lead to complexity bounds in the theoretical foundation of deep learning-based approaches for the numerical solution of PDEs. We leave an in-depth analysis of this topic for future research.

Curse of dimension. One of the main reasons for the current interest in deep neural networks is their outstanding performance in high-dimensional problems. As an example, consider the popular ILSVRTypically abbreviated with ILSRC which stands for ImageNet Large Scale Visual Recognition Challenge. challenge (see ), an image recognition task on the ImageNet database (see ) containing variable-resolution images which are typically downsampled before training to yield a roughly (240×240×3)(240\times 240\times 3)-dimensional input space (cf. e.g. ).

Our asymptotic upper complexity bounds for the number of weights and neurons are in O(εd/(ns))\mathcal{O}(\varepsilon^{-d/(n-s)}), and thus depend strongly on the dimension dd of the input space. Moreover, the constants in the upper bounds for the number of layers, weights, and neurons also increase exponentially with increasing dd. Even if there is still a gap between our upper and lower bounds for the weights of a network architecture, Theorem 4.3 shows that one cannot expect to circumvent the curse of dimension in the setting considered in this paper.

The power of depth. For s=0s=0, the approximations obtained in Theorem 4.1 and [61, Theorem 1] coincide. In this case, Yarotsky showed in that the constructed unbounded depth approximations for functions in Wn,((0,1)d){W^{n,\infty}(\left(0,1\right)^{d})} (with n>2n>2) are asymptotically more efficient in the number of weights and neurons than approximations with fixed length LL if

As a consequence, to be more efficient than a shallow network, i.e. a network with depth L=2L=2, one needs n>2dn>2d regularity. Even if this result does not completely explain the success of deep networks over shallow ones, since dd is typically very large, it would be interesting to obtain similar results for higher-order Sobolev norms.

Unbounded complexity of weights. When training a neural network on a computer, the weights have to be stored in memory. In practice, storing weights up to an arbitrary precision or even unbounded weights (in absolute value) is infeasible. In the construction of the neural network Rϱ(Φεf)R_{\varrho}(\Phi^{f}_{\varepsilon}) that approximates a function ff up to an Ws,W^{s,\infty}-error ε\varepsilon and realizes the upper complexity bound from Theorem 4.1 we used weights WεW_{\varepsilon} with Wε\lvert W_{\varepsilon}\rvert\to\infty as ε0\varepsilon\downarrow 0 (see construction of Φm,l\Phi_{m,l} in the proof of Lemma C.3 (v)).

In , neural networks are restricted to possess quantized weights (see [47, Definition 2.9]) which controls the complexity of the weights depending on the approximation error ε\varepsilon. It would be interesting to extend Theorem 4.1 to networks with quantized weights and to analyze how coarse a quantization is possible to maintain the associated upper bounds.

Acknowledgements

I.G. would like to thank Mones Raslan for fruitful discussions on the topic and acknowledges support from the Research Training Group ”Differential Equation- and Data-driven Models in Life Sciences and Fluid Dynamics: An Interdisciplinary Research Training Group (DAEDALUS)” (GRK 2433) funded by the German Research Foundation (DFG). G. K. acknowledges partial support by the Bundesministerium für Bildung und Forschung (BMBF) through the Berliner Zentrum for Machine Learning (BZML), Project AP4, by the Deutsche Forschungsgemeinschaft (DFG) through grants CRC 1114 ”Scaling Cascades in Complex Systems”, Project B07, CRC/TR 109 ”Discretization in Geometry and Dynamics’, Projects C02 and C03, RTG DAEDALUS (RTG 2433), Projects P1 and P3, RTG BIOQIC (RTG 2260), Projects P4 and P9, and SPP 1798 ”Compressed Sensing in Information Processing”, Coordination Project and Project Massive MIMO-I/II, by the Berlin Mathematics Research Center MATH+ , Projects EF1-1 and EF1-4, and by the Einstein Foundation Berlin. P.P. is supported by a DFG Research Fellowship ”Shearlet-based energy functionals for anisotropic phase-field methods”.

Appendix A Interpolation spaces

In this section, we recall some essential notations and results from interpolation theory in Banach spaces. In particular, we present the KK-method of real interpolation.

For two Banach spaces B0B_{0} and B1B_{1} interpolation techniques allow the definition and study of a family of spaces that in some sense bridge the gap between B0B_{0} and B1B_{1}. This rather abstract concept will be applied to Sobolev spaces in Section 3.1 to derive spaces of fractional regularity. Interpolation theory is a valuable tool in harmonic analysis and the study of partial differential equations. A small summary about the history of the development of the theory of interpolation spaces can be found in [56, Lecture 21]. Interested readers are recommended to read the monographs and for a detailed treatment in the context of partial differential equations.

The basic notions of functional analysis are assumed to be known and we refer to for an introduction.

If B0,B1B_{0},B_{1} are two Banach spaces such that B1B_{1} is continuously embedded in B0B_{0}, then we write B1B0B_{1}\subset B_{0} and call (B0,B1)(B_{0},B_{1}) an interpolation couple. One can also consider more general pairs of Banach spaces but this setting is sufficiently general for our purposes.

Let (B0,B1)(B_{0},B_{1}) be an interpolation couple. For every uB1u\in B_{1} and t>0t>0 we define

The first term measures how well uu can be approximated by elements from the smaller space B1B_{1} in B0\lVert\cdot\rVert_{B_{0}} and the second term is a penalty term weighted by tt.

Let (B0,B1)(B_{0},B_{1}) be an interpolation couple. Let 0<θ<10<\theta<1 and 1p1\leq p\leq\infty. We set

The norm uu(B0,B1)θ,pu\mapsto\lVert u\rVert_{\left(B_{0},B_{1}\right)_{\theta,p}} turns (B0,B1)θ,p\left(B_{0},B_{1}\right)_{\theta,p} into a Banach space (cf. [38, Proposition 1.5]) which is called a real interpolation space.

The next lemma shows that in the sense of a continuous embedding the spaces Bθ,pB_{\theta,p} form a family of nested Banach spaces.

Let 1p1\leq p\leq\infty, then the following holds:

If (B0,B1)(B_{0},B_{1}) is an interpolation couple and 0<θ1<θ2<10<\theta_{1}<\theta_{2}<1, then

if BB is a Banach space and 0<θ<10<\theta<1, then (B,B)θ,p=B\left(B,B\right)_{\theta,p}=B.

For (i) see [38, p. 2] and for (ii) [38, Proposition 1.4]. ∎

One of the most important theorems in interpolation theory shows how the norm of an operator defined on interpolation couples relates to the operator norm with respect to the corresponding interpolation spaces. The theorem can be found in [38, Theorem 1.6].

Let (A0,A1)(A_{0},A_{1}) and (B0,B1)(B_{0},B_{1}) be two interpolation couples. If TL(A0,B0)L(A1,B1)T\in\mathcal{L}(A_{0},B_{0})\cap\mathcal{L}(A_{1},B_{1}), then TL(Aθ,p,Bθ,p)T\in\mathcal{L}(A_{\theta,p},B_{\theta,p}) for every 0<θ<10<\theta<1 and 1p1\leq p\leq\infty. Furthermore,

As a corollary from the previous theorem the following useful result can be obtained (see [38, Corollary 1.7]).

Let (B0,B1)(B_{0},B_{1}) be an interpolation couple. Moreover, let 0<θ<10<\theta<1 and 1p1\leq p\leq\infty, then there is a constant c=c(θ,p)>0c=c(\theta,p)>0 such that for all uB1u\in B_{1} we have

For the case p=p=\infty we have c(θ,)=1c(\theta,\infty)=1.

Appendix B Sobolev spaces

As a generalization of Definition 3.1, we now introduce Sobolev spaces of vector-valued functions (see [48, Chapter 1.2.3]).

These spaces are again Banach spaces. To simplify the exposition, we introduce notation for a family of semi-norms.

For m=1m=1 we only write Wk,p(Ω)\lvert\cdot\rvert_{{W^{k,p}(\Omega)}}. Note that W0,p(Ω)\lvert\cdot\rvert_{{W^{0,p}(\Omega)}} coincides with Lp(Ω)\lVert\cdot\rVert_{L^{p}(\Omega)}.

A proof can be found in [29, Theorem 4.1] together with [29, Remark 4.2].

To use the bounds for the gradient obtained in the previous theorem in the Sobolev semi-norm setting from Definition B.2, the following observation turns out to be useful. If fW1,(Ω)f\in W^{1,\infty}(\Omega), then

The next lemma recalls a well known fact about the composition of Lipschitz functions and is not hard to see.

The following corollary establishes a chain rule estimate for W1,W^{1,\infty}.

It then follows from Theorem B.3 that fjf_{j} is LjL_{j}-Lipschitz and hence, ff is Lf\lvert L_{f}\rvert-Lipschitz. In the same manner we define

and have that gg is LgL_{g}-Lipschitz. With Lemma B.4 we conclude that also the composition gfg\circ f is (LfLg)(\lvert L_{f}\rvert\cdot L_{g})-Lipschitz. Since gg is bounded and hence also gfg\circ f applying Theorem B.3 again yields that gfW1,(Ω1)g\circ f\in{W^{1,\infty}(\Omega_{1})} and furthermore

B.2 Product estimate

In the following lemma, we show an estimate for the semi-norm of a product of weakly differentiable functions.

Let fW1,(Ω)f\in{W^{1,\infty}(\Omega)} and gW1,p(Ω)g\in{W^{1,p}(\Omega)} with 1p1\leq p\leq\infty, then fgW1,p(Ω)fg\in{W^{1,p}(\Omega)} and there exists a constant C=C(d,p)>0C=C(d,p)>0 such that

Clearly, fg,fDig+gDifLloc1(Ω)fg,fD^{i}g+gD^{i}f\in L^{1}_{\text{loc}}(\Omega) so that the product formula in [21, Chapter 7.3] yields that for the weak derivatives of fgfg it holds

where C=C(d,p)>0C=C(d,p)>0 is a constant. Furthermore, it is clear that also fgLp(Ω)fg\in L^{p}(\Omega). ∎

B.3 Averaged Taylor polynomial

In this subsection, we develop a polynomial approximation in the spirit of Taylor polynomials but appropriate for Sobolev spaces. A polynomial approximation PfP_{f} of a function fFn,d,p,Bf\in\mathcal{F}_{n,d,p,B} is the first step towards an approximation of ff realized by a neural network in the proof of Theorem 4.1.

A reference for this entire subsection is [10, Chapter 4.1].

and ϕ\phi is an arbitrary cut-off function supported in B\overline{B}, i.e.

A cut-off function as used in the previous definition always exists. A possible choice is

From the linearity of the weak derivative we can easily conclude that the averaged Taylor polynomial is linear in ff.

Recall that the averaged Taylor polynomial is defined via an integral and some cut-off function (cf. (B.3)) that perform an averaging of a polynomial expression (cf. (B.4)). Additionally, the following lemma shows that the averaged Taylor polynomial of order nn is indeed a polynomial of degree less than nn in xx.

Moreover, there exists a constant c=c(n,d,R)>0c=c(n,d,R)>0 such that the coefficients cαc_{\alpha} are bounded with cαcrd/pfWn1,p(Ω)\lvert c_{\alpha}\rvert\leq cr^{-d/p}\lVert f\rVert_{{W^{n-1,p}(\Omega)}} for all α\alpha with αn1\lvert\alpha\rvert\leq n-1.

in multi-index notation. Then, combining Equation (B.3) and (B.4) yields

Combining the last estimate with Equation (B.7) yields

where the second step follows from ϕLcrd\lVert\phi\rVert_{L^{\infty}}\leq cr^{-d} for some constant c=c(d)>0c=c(d)>0 (see [10, Section 4.1]). To estimate the absolute value of the coefficients cγc_{\gamma} (defined in Equation B.6), we have

Here, the second step used Equation (B.8) together with Equation (B.5), and c=c(n,d,R)>0c^{\prime}=c^{\prime}(n,d,R)>0 is a constant. ∎

The next step is to derive approximation properties of the averaged Taylor polynomial. To this end, recall that for the (standard) Taylor expansion of some function ff defined on a domain Ω\Omega in x0x_{0} to yield an approximation at some point x0+hx_{0}+h the whole path x0+thx_{0}+th for 0t10\leq t\leq 1 has to be contained in Ω\Omega (see [40, Theorem 6.8.10]). In case of the averaged Taylor polynomial the expansion point x0x_{0} is replaced by a ball BB and we require that the path between each x0Bx_{0}\in B and each xΩx\in\Omega is contained in Ω\Omega. This geometrical condition is made precise in the following definition.

The next definition introduces a geometric notion which becomes important when given a family of subdivisions Th\mathcal{T}^{h}, 0<h10<h\leq 1 of a domain Ω\Omega which becomes finer for smaller hh. One typically needs to control the nondegeneracy of (Th)h(\mathcal{T}^{h})_{h} which can be done e.g. with a uniformly bounded chunkiness parameter.

If R\mathcal{R}\neq\varnothing, then we define

To emphasize the dependence on the set Ω\Omega, we will occasionally write rmax(Ω)r^{\star}_{\text{max}}(\Omega) and γ(Ω)\gamma(\Omega).

Finally, the next lemma shows approximation properties of the averaged Taylor polynomial.

where QnfQ^{n}f denotes the Taylor polynomial of order nn of ff averaged over BB and h=diam(Ω)h=\operatorname{diam}(\Omega).

A proof can be found in [10, Lemma 4.3.8].

B.4 Fractional Sobolev spaces

In this subsection, we derive a generalization of Sobolev spaces characterized by fractional-order derivatives using two different approaches. First, we interpolate integer-valued Sobolev spaces, and secondly, we define an intrinsic norm. Both approaches are then shown to be equivalent under some regularity condition on the domain Ω\Omega.

We start by defining fractional-order Sobolev spaces Ws,pW^{s,p} for 0<s<10<s<1 via Banach space interpolation (see Section A).

Let 0<s<10<s<1 and 1p1\leq p\leq\infty, then we set

The next theorem shows that, given suitable regularity conditions of the domain Ω\Omega, Definition B.13 and Definition 3.3 yield the same spaces with equivalent norms. The theorem can be found in [10, Theorem 14.2.3] for 1p<1\leq p<\infty. The case p=p=\infty is not included in [10, Theorem 14.2.3] but can be shown with the same technique.

with equivalence of the respective norms.

We make use of the norm equivalence by only using the Ws,pW^{s,p} norm in the following section and transferring results obtained by interpolation theory for the W~s,p\widetilde{W}^{s,p} norm to the Ws,pW^{s,p} norm without mentioning it again.

Appendix C Upper bounds for approximations

The strategy of the proof of Theorem 4.1 follows the idea of the proof of Theorem 1 in : A polynomial approximation PfP_{f} of a function fFn,d,p,Bf\in\mathcal{F}_{n,d,p,B} is approximated by a function realized by a neural network ΦPf\Phi_{P_{f}}. As in , we start by constructing a neural network that approximates the square function on the interval (0,1)\left(0,1\right) up to an approximation error at most ε\varepsilon. In our result the error is measured in the W1,W^{1,\infty} norm (Proposition C.1). This result is then used to obtain an approximation of a multiplication operator (Proposition C.2). Next, we derive a partition of unity (Lemma C.3) as a product of univariate functions that can be realized by neural networks. Using this construction we then build a localized Taylor approximation for a function fFn,d,p,Bf\in\mathcal{F}_{n,d,p,B} in the LpL^{p} and W1,pW^{1,p} norm. An interpolation argument shows that same construction can be used for an approximation in the Ws,pW^{s,p} norm, where 0s10\leq s\leq 1 (Lemma C.4). In the next step, we lay the foundation for approximating products of the partition of unity and monomials, i.e. localized monomials, with ReLU networks (Lemma C.5) in the Wk,W^{k,\infty} norm for k=0,1k=0,1. Via an embedding this result this result is then used to approximate the localized polynomials in Wk,pW^{k,p} with ReLU networks (Lemma C.6).

All Sobolev spaces encountered in the following proofs are defined on open, bounded, and convex domains which are, in particular, Lipschitz-domains (see [23, Corollary 1.2.2.3]).

The results presented in the following two propositions can be found in a similar way in [51, Proposition 3.1]. However, since our results contain some minor extensions we decided to give the proof here for the sake of completeness. In detail, Proposition C.2 (ii) and (iii) are not shown in [51, Proposition 3.1].

In [61, Proposition 2], a network is constructed that approximates the square function xx2x\to x^{2} on the interval (0,1)\left(0,1\right) in the LL^{\infty} norm. Interestingly, the same construction can be used when measuring the approximation error in the W1,W^{1,\infty} norm. In particular, the depth and the number of weights and neurons of the network do not grow asymptotically faster to satisfy the approximation accuracy with respect to this stronger norm.

There exist constants c1,c2,c3,c4>0c_{1},c_{2},c_{3},c_{4}>0, such that for all ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right) there is a neural network Φεsq\Phi^{\text{sq}}_{\varepsilon} with at most c1log2(\nicefrac1ε)c_{1}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}}) nonzero weights, at most c2log2(\nicefrac1ε)c_{2}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}}) layers, at most c3log2(\nicefrac1ε)c_{3}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}}) neurons, and with one-dimensional input and output such that

and Rϱ(Φεsq)(0)=0R_{\varrho}(\Phi^{\text{sq}}_{\varepsilon})(0)=0. Furthermore, it holds that

Thus, Rϱ(Φm)R_{\varrho}(\Phi_{m}) is a piecewise linear interpolant of ff with 2m+12^{m}+1 uniformly distributed breakpoints k2m,k=0,,2m\frac{k}{2^{m}},k=0,\ldots,2^{m}. In particular, Rϱ(Φm)(0)=0R_{\varrho}(\Phi_{m})(0)=0. Furthermore, it is shown in the proof of [61, Proposition 2] that

We will now show that the approximation error of the derivative can be bounded in a similar way. In particular, we show the estimate

From Equation (C.3) we get for all k=0,,2m1k=0,\ldots,2^{m}-1

Combining Equation (C.4) and (C.5) yields

Clearly, the weak derivative of Φm\Phi_{m} is a piecewise constant function, which assumes its maximum on the last piece. Hence,

Let now ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right) and choose m=log2(\nicefrac1ε)m=\lceil\log_{2}(\nicefrac{{1}}{{\varepsilon}})\rceil. Now, Φεsq:=Φm\Phi^{\text{sq}}_{\varepsilon}:=\Phi_{m} satisfies the approximation bound in Equation (C.1) and Rϱ(Φεsq)(0)=0R_{\varrho}(\Phi^{\text{sq}}_{\varepsilon})(0)=0. The estimate (C.2) holds because of Equation (C.6). The number of weights can be bounded by

In the same way, the number of neurons and layers can be bounded. This concludes the proof.

As in , we will now use Proposition C.1 and the polarization identity

to define an approximate multiplication where the approximation error is again (and in contrast to ) measured in the W1,W^{1,\infty} norm.

For any M1M\geq 1, there exist constants c,c1>0c,c_{1}>0 and c2=c2(M)>0c_{2}=c_{2}(M)>0 such that for any ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right), there is a neural network ×~\widetilde{\times} with two-dimensional input and one-dimensional output that satisfies the following properties:

Rϱ(×~)(x,y)xyW1,((M,M)2;dxdy)ε\lVert R_{\varrho}(\widetilde{\times})(x,y)-xy\rVert_{W^{1,\infty}(\left(-M,M\right)^{2};dxdy)}\leq\varepsilon;

if x=0x=0 or y=0y=0, then Rϱ(×~)(x,y)=0R_{\varrho}(\widetilde{\times})\left(x,y\right)=0;

Rϱ(×~ε)W1,((M,M)2)cM\lvert R_{\varrho}(\widetilde{\times}_{\varepsilon})\rvert_{W^{1,\infty}(\left(-M,M\right)^{2})}\leq cM;

the depth and the number of weights and neurons in ×~\widetilde{\times} is at most c1log2(1/ε)+c2c_{1}\log_{2}(1/\varepsilon)+c_{2}.

Set δ:=ε/(6M2C)\delta:=\varepsilon/(6M^{2}C), where CC is the constant from Corollary B.5 for n=2n=2 and m=1m=1, and let Φδsq\Phi^{\text{sq}}_{\delta} be the approximate squaring network from Proposition C.1 such that

As in the proof of [61, Proposition 3], we use the fact that x=ϱ(x)+ϱ(x)\lvert x\rvert=\varrho(x)+\varrho(-x) to see that a network ×~\widetilde{\times} can be constructed with two-dimensional input and one-dimensional output that satisfies

As a consequence of Proposition C.1, there exists a constant c0>0c_{0}>0 such that ×~\widetilde{\times} has at most c0log2(\nicefrac1ε)+c0log2(6M2)+3c0log2(\nicefrac1ε)+c1c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{0}\log_{2}(6M^{2})+3\leq c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{1} layers, 3c0log2(\nicefrac1ε)+3c0log2(6M2)+9clog2(\nicefrac1ε)+c23c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+3c_{0}\log_{2}(6M^{2})+9\leq c^{\prime}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{2} neurons and 3c0log2(\nicefrac1ε)+3c0log2(6M2)+17clog2(\nicefrac1ε)+c33c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+3c_{0}\log_{2}(6M^{2})+17\leq c^{\prime\prime}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{3} nonzero weights. Here c,c>0c^{\prime},c^{\prime\prime}>0 and c1=c1(M),c2=c2(M),c3=c3(M)>0c_{1}=c_{1}(M),c_{2}=c_{2}(M),c_{3}=c_{3}(M)>0 are suitable constants and (iv) is hence satisfied.

Since Rϱ(Φδsq)(0)=0R_{\varrho}(\Phi^{\text{sq}}_{\delta})(0)=0, (ii) easily follows.

Using the polarization identity (C.7), we can write

To keep the following calculations simple, we introduce some notation. In particular, we define

Note that for the inner functions in the compositions in Equation (C.11), it holds that

Hence, to finally prove (i), we apply Corollary B.5 to (C.11) and get

To show (iii) we use the chain rule estimate from Corollary B.5 and get

where we used Equation (C.12) in the third step and CC^{\prime} is the absolute constant from Equation (C.2) in Proposition C.1. ∎

Next, we construct a partition of unity (in the same way as in [61, Theorem 1]) that can be defined as a product of piecewise linear functions, such that each factor of the product can be realized by a neural network. We will use this product structure in Lemma C.6 together with a generalized version of the approximate multiplication from Proposition C.2 to approximate localized polynomials with ReLU networks.

ϕmΨϕm(x)=1\sum_{\phi_{m}\in\Psi}\phi_{m}(x)=1 for every xdx\in^{d};

there exist absolute constants C,c1C,c\geq 1 such that for each ϕmΨ\phi_{m}\in\Psi there is a neural network Φm\Phi_{m} with dd-dimensional input and dd-dimensional output, with at most three layers, CdCd nonzero weights and neurons, that satisfies

and [Rϱ(Φm)]lWk,((0,1)d)(cN)k\lVert[R_{\varrho}(\Phi_{m})]_{l}\rVert_{{W^{k,\infty}(\left(0,1\right)^{d})}}\leq(cN)^{k} for all l=1,,dl=1,\ldots,d and k{0,1}k\in\{0,1\}.

for m=(m1,,md){0,,N}dm=(m_{1},\ldots,m_{d})\in\{0,\ldots,N\}^{d}. Then, (i),(ii) and (iii) follow easily from the definition.

It follows that ϕmW1,3N\lvert\phi_{m}\rvert_{W^{1,\infty}}\leq 3N.

To show (v), we start by constructing a network Φψ\Phi_{\psi} that realizes the function ψ\psi. For this we set

and Φψ:=((A1,b1),(A2,b2))\Phi_{\psi}:=((A_{1},b_{1}),(A_{2},b_{2})). Then Φψ\Phi_{\psi} is a two-layer network with one-dimensional input and one-dimensional output, with 1212 nonzero weights and 66 neurons such that

which is a three-layer network with dd-dimensional input and dd-dimensional output, with at most d2(12+2)=d28d\cdot 2\cdot(12+2)=d\cdot 28 nonzero weights and at most 9d9d neurons, and

The following lemma uses the partition of unity from Lemma C.3 and the Bramble-Hilbert Lemma B.12 in a classical way to derive an approximation with localized polynomials in the LpL^{p} norm and the W1,pW^{1,p} norm. Using an interpolation argument, we can generalize this result to the case where the approximation is performed with respect to the Ws,W^{s,\infty} norm, where 0s10\leq s\leq 1.

Let 0s10\leq s\leq 1 and set fN:=m{0,,N}dϕmpf,mf_{N}:=\sum_{m\in\{0,\ldots,N\}^{d}}\phi_{m}p_{f,m}, then the operator Ts:Wn,p((0,1)d)Ws,p((0,1)d)T_{s}:{W^{n,p}(\left(0,1\right)^{d})}\to{W^{s,p}(\left(0,1\right)^{d})} with Tsf=ffNT_{s}f=f-f_{N} is linear and bounded with

Furthermore, there is a constant c=c(d,n)>0c=c(d,n)>0 such that for any fWn,p((0,1)d)f\in{W^{n,p}(\left(0,1\right)^{d})} the coefficients of the polynomials pf,mp_{f,m} satisfy

The idea of the proof is similar to the first part of the proof of [61, Theorem 1]. We use approximation properties of averaged Taylor polynomials (see Bramble-Hilbert Lemma B.12) to derive local estimates and then combine them using a partition of unity to obtain a global estimate. In order to use this strategy also near the boundary, we make use of an extension operator.

where CE=CE(n,p,d)C_{E}=C_{E}(n,p,d) is the norm of the extension operator.

Step 1 (Averaged Taylor polynomials): For each m{0,,N}dm\in\{0,\ldots,N\}^{d} we set

for m{0,,N}dm\in\{0,\ldots,N\}^{d}, where c=c(n,d,p)>0c^{\prime\prime}=c^{\prime\prime}(n,d,p)>0 is a suitable constant. It now suffices to show (i) and (ii).

Step 2 (Local estimates in Wk,p,k{0,1}\lVert\cdot\rVert_{W^{k,p}},k\in\{0,1\}): To check that the conditions of the Bramble-Hilbert Lemma B.12 are fulfilled, note that Bm,NΩm,NB_{m,N}\subset\subset\Omega_{m,N}. Furthermore, Bm,NB_{m,N} is a ball in Ωm,N\Omega_{m,N} such that Ωm,N\Omega_{m,N} is star-shaped with respect to Bm,NB_{m,N}. We have diam(Ωm,N)=(2d)/N\operatorname{diam}_{\lvert\cdot\rvert}(\Omega_{m,N})=(2\sqrt{d})/N and rmax(Ωm,N)=1/Nr^{\star}_{\text{max}}(\Omega_{m,N})=1/N and, thus,

Finally, we have for the chunkiness parameter of Ωm,N\Omega_{m,N}

Applying the Bramble-Hilbert Lemma B.12 yields for each m{0,,N}dm\in\{0,\ldots,N\}^{d} the local estimate

Here, C1=C1(n,d)>0C_{1}=C_{1}(n,d)>0 is the constant from Lemma B.12 which only depends on nn and dd, since the chunkiness parameter of Ωm,N\Omega_{m,N} is a constant depending only on dd (see (C.15)) and C2=C2(n,d)>0C_{2}=C_{2}(n,d)>0. In the same way, we get

where C3=C3(n,d)>0C_{3}=C_{3}(n,d)>0 is a suitable constant.

The first step towards a global estimate is now to combine Equation (C.16) and (C.17) with the cut-off functions from the partition of unity. We have

Furthermore, using the product inequality for weak derivatives from Lemma B.6 we get that there is a constant C=C(d,p)>0C^{\prime}=C^{\prime}(d,p)>0 such that

for some constant C5=C5(n,d,p)>0C_{5}=C_{5}(n,d,p)>0.

Step 3 (Global estimate in Wk,p,k{0,1}\lVert\cdot\rVert_{W^{k,p}},k\in\{0,1\}): To derive the global estimate, we start by noting that with property (ii) from Lemma C.3 we have

where the last step follows from (0,1)dm~{0,,N}dΩm~,N\left(0,1\right)^{d}\subset\bigcup_{\widetilde{m}\in\{0,\ldots,N\}^{d}}\Omega_{\widetilde{m},N}. Now we obtain for each m~{0,,N}d\widetilde{m}\in\{0,\ldots,N\}^{d}

where the triangle inequality together with the support property (iii) from Lemma C.3 is used in the first step. The second step follows again from Lemma C.3 (iii) and the third step follows from (C.18) for k=0k=0 and from (C.20) for k=1k=1. Here C6=C6(n,d,p)>0C_{6}=C_{6}(n,d,p)>0 can be chosen independent of kk (e.g. C6:=max{C2,C5}C_{6}:=\max\{C_{2},C_{5}\}).

Finally, the boundedness claim in (i) and (ii) follows from using the definition of fNf_{N} and combining Equation (C.22) with Equation (C.23):

where the last two steps follow from the definition of Ωm~,N\Omega_{\widetilde{m},N}. Thus, we have

for k{0,1}k\in\{0,1\}, where Equation (C.14) was used in the first and second step. Here C7=C7(n,d,p)>0C_{7}=C_{7}(n,d,p)>0 and C8=C8(n,d,p)C_{8}=C_{8}(n,d,p) are constants. The linearity of TkT_{k}, k{0,1}k\in\{0,1\} is a consequence of the linearity of the averaged Taylor polynomial (cf. Remark B.8).

Step 4 (Interpolation): For 0<s<10<s<1 we use a Banach space interpolation argument. Set

Then, we can apply Theorem A.4 to the operator TT with Tf:=ffNTf:=f-f_{N} and get for a constant C=C(n,d,p)>0C=C(n,d,p)>0 that

where we used Lemma A.3 (ii) to see that (A0,A1)s,p=Wn,p((0,1)d)(A_{0},A_{1})_{s,p}={W^{n,p}(\left(0,1\right)^{d})}. ∎

The following technical lemma lays the foundation for approximating localized monomials with neural networks. Using the notation and statement (v) from Lemma C.3, a localized monomial ϕmxα\phi_{m}x^{\alpha} can be expressed as the product of the output components of a network Φ(m,α)\Phi_{(m,\alpha)} with a suitable output dimension nn, i.e.

Given a network Φ\Phi with nn-dimensional output, we construct a network ΨΦ\Psi_{\Phi} that approximates the product of the output components of Φ\Phi.

For any ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right), and any neural network Φ\Phi with dd-dimensional input and nn-dimensional output where nmn\leq m, and with number of layers, neurons and weights all bounded by KK, such that

there exists a neural network Ψε,Φ\Psi_{\varepsilon,\Phi} with dd-dimensional input and one-dimensional output, and with number of layers, neurons and weights all bounded by KClog2(\nicefrac1ε)KC\log_{2}(\nicefrac{{1}}{{\varepsilon}}), such that

for k{0,1}k\in\{0,1\} and some constant c=c(d,m,k)c=c(d,m,k). Moreover,

If m=1m=1, then we can choose Ψε,Φ=Φ\Psi_{\varepsilon,\Phi}=\Phi for any ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right) and the claim holds.

Case 1: If nmn\leq m, then we use the induction hypothesis and get that there is a constant C0=C0(m)>0C_{0}=C_{0}(m)>0 and a neural network Ψε,Φ\Psi_{\varepsilon,\Phi} with dd-dimensional input and one-dimensional output, and at most KC0log2(\nicefrac1ε)KC_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}}) layers, neurons and weights such that

for k{0,1}k\in\{0,1\} and c1=c1(d,m)c_{1}=c_{1}(d,m). Moreover,

for any x(0,1)dx\in\left(0,1\right)^{d}. Furthermore, we have Rϱ(Ψε,Φ)W1,((0,1)d)C1N\lvert R_{\varrho}(\Psi_{\varepsilon,\Phi})\rvert_{{W^{1,\infty}(\left(0,1\right)^{d})}}\leq C_{1}N, for C1=C1(d,m)C_{1}=C_{1}(d,m).

We denote by Φm\Phi_{m} the neural network with dd-dimensional input and mm-dimensional output which results from Φ\Phi by removing the last output neuron and corresponding weights. In detail, we write

Using the induction hypothesis and the constants C0,c1C_{0},c_{1} and C1C_{1} from Case 1, we get that there is a neural network Ψε,Φm=((A1,b1),(A2,b2),,(AL,bL))\Psi_{\varepsilon,\Phi_{m}}=((A^{\prime}_{1},b^{\prime}_{1}),(A^{\prime}_{2},b^{\prime}_{2}),\dots,(A^{\prime}_{L^{\prime}},b^{\prime}_{L^{\prime}})) with dd-dimensional input and one-dimensional output, and at most KC0log2(\nicefrac1ε)KC_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}}) layers, neurons and weights such that

for any x(0,1)dx\in\left(0,1\right)^{d}. Furthermore, we can assume that Rϱ(Ψε,Φm)W1,((0,1)d)C1N\lvert R_{\varrho}(\Psi_{\varepsilon,\Phi_{m}})\rvert_{{W^{1,\infty}(\left(0,1\right)^{d})}}\leq C_{1}N, and that the first L(Φ)1L(\Phi)-1 layers of Ψε,Φm\Psi_{\varepsilon,\Phi_{m}} and Φm\Phi_{m} coincide and, thus, also the first L(Φ)1L(\Phi)-1 layers of Ψε,Φm\Psi_{\varepsilon,\Phi_{m}} and Φ\Phi, i.e. Al=AlA_{l}=A^{\prime}_{l} for l=1,,L(Φ)1l=1,\ldots,L(\Phi)-1.

Now, we add the formerly removed neuron with corresponding weights back to the last layer of Ψε,Φm\Psi_{\varepsilon,\Phi_{m}}. For the resulting network

it holds that the first L1L-1 layers of Ψ~ε,Φ\widetilde{\Psi}_{\varepsilon,\Phi} and Φ\Phi coincide, and Ψ~ε,Φ\widetilde{\Psi}_{\varepsilon,\Phi} is a neural network with two-dimensional output. Note that

where we used Equation (C.26) for k=0k=0. Additionally, we have

Now, we denote by ×~\widetilde{\times} the network from Proposition C.2 with M=m+1M=m+1 and accuracy ε\varepsilon and define

Consequently, Ψε,Φ\Psi_{\varepsilon,\Phi} has dd-dimensional input, one-dimensional output and, combining the induction hypothesis with statement (iv) of Proposition C.2 and Remark 2.8, at most

layers, number of neurons and weights. Here cc^{\prime} and c=c(m+1)c^{\prime\prime}=c^{\prime\prime}(m+1) are the constants from Proposition C.2 (iv) for the choice M=m+1M=m+1 and C=C(m+1)>0C=C(m+1)>0 is a suitable constant. Clearly, the first L1L-1 layers of Ψε,Φ\Psi_{\varepsilon,\Phi} and Φ\Phi coincide and for the approximation properties it holds that

for k{0,1}k\in\{0,1\}. We continue by considering the first term of the Inequality (C.28) for k=0k=0 and obtain

where we used Proposition C.2 (i) for the last step. Next, we consider the same term for k=1k=1 and apply the chain rule from Corollary B.5. For this, let C^=C^(d)\hat{C}=\hat{C}(d) be the constant from Corollary B.5 (for n=dn=d and m=2m=2). We get

where we used the induction hypothesis together with [Rϱ(Φ)]m+1W1,((0,1)d)N\lvert[R_{\varrho}(\Phi)]_{m+1}\rvert_{{W^{1,\infty}(\left(0,1\right)^{d})}}\leq N in the third step, and C1=C1(d,m+1)>0C_{1}^{\prime}=C_{1}^{\prime}(d,m+1)>0 is a suitable constant.

To estimate the second term of (C.28) for k=0k=0 we use the induction hypothesis (for k=0k=0) and get

For k=1k=1 we apply the product rule from Lemma B.6 together with [Rϱ(Φ)]m+1L1\lVert[R_{\varrho}(\Phi)]_{m+1}\rVert_{{L^{\infty}}}\leq 1 and get

where we used the induction hypothesis for k=1k=1, and c1=c1(d,m+1)>0c_{1}^{\prime}=c_{1}^{\prime}(d,m+1)>0.

Combining (C.28) with (C.29) and (C.31) we have

and in the same way a combination of (C.28) with (C.30) and (C.32) yields

where c1=c1(d,m+1)>0c_{1}^{\prime\prime}=c_{1}^{\prime\prime}(d,m+1)>0. Putting together the two previous estimates yields

for a suitable constant c1=c1(d,m+1)>0c_{1}^{\prime\prime\prime}=c_{1}^{\prime\prime\prime}(d,m+1)>0.

We now show Equation (C.25) for m+1m+1. To this end, assume that [Rϱ(Φ)]l(x)=0[R_{\varrho}(\Phi)]_{l}(x)=0 for some l{1,,m+1}l\in\{1,\ldots,m+1\} and x(0,1)dx\in\left(0,1\right)^{d}. If lml\leq m, then Equation (C.27) implies that

Hence, by application of Proposition C.2, we have

Finally, we need to show that there is a constant C1=C1(d,m+1)>0C_{1}^{\prime\prime}=C_{1}^{\prime\prime}(d,m+1)>0 such that

Similarly as in (C.30) we have for a constant C1=C1(d,m+1)>0C_{1}^{\prime\prime}=C_{1}^{\prime\prime}(d,m+1)>0 that

where Corollary B.5 was used for the second step and c^\hat{c} is the constant from Proposition C.2 (iii) which together with an argument as in (C.30) implies the third step.

Taking the maximum of the each pair of constants derived in Case 1 and Case 2 concludes the proof. ∎

In the next lemma, we approximate a sum of localized polynomials with a neural network. One of the difficulties is to control the derivative of the localizing functions from the partition of unity.

For any ε(0,\nicefrac12)\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right) there is a neural network architecture Aε=Aε(d,n,N,ε){\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,N,\varepsilon) with dd-dimensional input and one-dimensional output, with at most C2log2(\nicefrac1ε)C_{2}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}}) layers and C3(N+1)dlog2(\nicefrac1ε)C_{3}(N+1)^{d}\log_{2}(\nicefrac{{1}}{{\varepsilon}}) weights and neurons, such that the following holds: Let fWn,p((0,1)d)f\in{W^{n,p}(\left(0,1\right)^{d})} and pm(x):=pf,m(x)=αn1cm,αxαp_{m}(x):=p_{f,m}(x)=\sum_{\lvert\alpha\rvert\leq n-1}c_{m,\alpha}x^{\alpha} for m{0,,N}dm\in\{0,\ldots,N\}^{d} be the polynomials from Lemma C.4, then there is a neural network ΦP,ε\Phi_{P,\varepsilon} that has architecture Aε{\mathcal{A}}_{\varepsilon} such that

Step 1 (Approximating localized monomials ϕmxα\phi_{m}x^{\alpha}): Let αn1|\alpha|\leq n-1 and m{0,,N}dm\in\{0,\dots,N\}^{d}. It is easy to see that there is a neural network Φα\Phi_{\alpha} with dd-dimensional input and α\lvert\alpha\rvert-dimensional output, with one layer and at most n1n-1 weights and n1+dn-1+d neurons such that

Let now Φm\Phi_{m} be the neural network and C,c1C,c\geq 1 the constants from Lemma C.3 (v) and define the network

for a constant c=c(n,d,k)>0c^{\prime}=c^{\prime}(n,d,k)>0 and k{0,1}k\in\{0,1\}, and

Step 2 (Constructing an architecture capable of approximating sums of localized polynomials): We set

Then, there are constants C2=C2(n,d),C3=C3(n,d)>0C_{2}=C_{2}(n,d),C_{3}=C_{3}(n,d)>0 such that ΦP,ε\Phi_{P,\varepsilon} is a neural network with dd-dimensional input and one-dimensional output, with at most 1+C1log2(\nicefrac1ε)C2log2(\nicefrac1ε)1+C_{1}\log_{2}(\nicefrac{{1}}{{\varepsilon}})\leq C_{2}\log_{2}(\nicefrac{{1}}{{\varepsilon}}) layers,

Note that the network ΦP,ε\Phi_{P,\varepsilon} only depends on pf,mp_{f,m} (and thus on ff) via the coefficients cm,αc_{m,\alpha}. Now, it is easy to see that there exists a neural network architecture Aε=Aε(d,n,N,ε){\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,N,\varepsilon) with L(Aε)C2log2(\nicefrac1ε)L({\mathcal{A}}_{\varepsilon})\leq C_{2}\log_{2}(\nicefrac{{1}}{{\varepsilon}}) layers and number of neurons and weights bounded by C3(N+1)dlog2(\nicefrac1ε)C_{3}(N+1)^{d}\log_{2}(\nicefrac{{1}}{{\varepsilon}}) such that ΦP,ε\Phi_{P,\varepsilon} has architecture Aε{\mathcal{A}}_{\varepsilon} for every of choice of coefficients cm,αc_{m,\alpha}, and hence for every choice of ff.

Step 3 (Estimating the approximation error in Wk,p,k{0,1}\lVert\cdot\rVert_{W^{k,p}},k\in\{0,1\}): For each m{0,,N}dm\in\{0,\ldots,N\}^{d} we set

where the last step is a consequence of (0,1)dm~{0,,N}dΩm~,N\left(0,1\right)^{d}\subset\bigcup_{\widetilde{m}\in\{0,\ldots,N\}^{d}}\Omega_{\widetilde{m},N}. For m~{0,,N}d\widetilde{m}\in\{0,\ldots,N\}^{d} we have

where we used the triangle inequality in the first step and Lemma C.4 in the second step. Here c1=c1(n,d)>0c_{1}=c_{1}(n,d)>0 is a constant. Next, note that

where λ\lambda denotes the Lebesgue measure and c3=c3(d,p)>0c_{3}=c_{3}(d,p)>0 is a constant. Combining Equation (C.39) with the last estimate yields

Combining Equation (C.40) with Equation (C.41) and plugging the result in Equation (C.38) finally yields

where the last step is the same as Step 3 of the proof of Lemma C.4 and c5=c5(n,d,p,k)>0c_{5}=c_{5}(n,d,p,k)>0 is a constant. Hence the case s=0s=0 and s=1s=1 is proven.

Step 4 (Interpolation): To show the general statement for 0s10\leq s\leq 1 we use the interpolation inequality from Corollary A.5 together with Equation (C.42) and directly get

for a constant c6=c6(n,p,d,s)>0c_{6}=c_{6}(n,p,d,s)>0. This concludes the proof of the lemma. ∎

Finally, we are ready to proof the upper complexity bounds.

The proof can be divided into two steps: First, we approximate the function ff by a sum of localized polynomials and then approximate this sum by a network.

where C=C(n,d,p)>0C=C(n,d,p)>0 is the constant from Corollary C.4. Without loss of generality we may assume that CB1CB\geq 1. The same corollary yields that if Ψ=Ψ(d,N)={ϕm:m{0,,N}d}\Psi=\Psi(d,N)=\left\{\phi_{m}:m\in\{0,\ldots,N\}^{d}\right\} is the partition of unity from Lemma C.3, then there exist polynomials pm(x)=αn1cm,αxαp_{m}(x)=\sum_{\lvert\alpha\rvert\leq n-1}c_{m,\alpha}x^{\alpha} for m{0,,N}dm\in\{0,\ldots,N\}^{d} such that

For the second step, let C1=C1(n,d,p,s)>0C_{1}=C_{1}(n,d,p,s)>0, C2=C2(n,d)>0C_{2}=C_{2}(n,d)>0 and C3=C3(n,d)>0C_{3}=C_{3}(n,d)>0 be the constants from Lemma C.6 and ΦP,ε\Phi_{P,\varepsilon} be the neural network provided by Lemma C.6 with εn/(ns)/(8CB2C1)\varepsilon^{n/(n-s)}/(8CB^{2}C_{1}) instead of ε\varepsilon. Then ΦP,ε\Phi_{P,\varepsilon} has at most

layers for a constant C=C(n,d,p,B,s)>0C^{\prime}=C^{\prime}(n,d,p,B,s)>0 and at most

nonzero weights and neurons. Here C=C(n,d,p,B,s)C^{\prime\prime}=C^{\prime\prime}(n,d,p,B,s) is a suitable constant and we used (2CB)/ε1(2CB)/\varepsilon\geq 1 in the first step. It holds that the architecture of ΦP,ε\Phi_{P,\varepsilon} is independent of the function ff. Furthermore, we have

where we used the inequality (x+y)σxσ+yσ(x+y)^{\sigma}\leq x^{\sigma}+y^{\sigma} for x,y0x,y\geq 0 and 0σ10\leq\sigma\leq 1 in the last step. We now continue the above computation using that s/(ns)1s/(n-s)\leq 1 and 2CB12CB\geq 1 and therefore (2CB)s/(ns)2CB(2CB)^{s/(n-s)}\leq 2CB and get

Combining the previous computations we get

Using the triangle inequality and Equations (C.44) and (C.45) we finally obtain

Appendix D Lower bounds for approximations

Lower complexity bounds for approximations in LL^{\infty} norm, which correspond to the case k=0k=0 in Theorem 4.3, have already been shown in Theorem 4 a) in . For lower complexity bounds of W1,W^{1,\infty} approximations we modify the proof strategy outlined in [61, Theorem 4 a)].

We start by showing an auxiliary result, that is used in the proof of Proposition D.2.

Now, set xn:=x+(1/n)νx_{n}:=x+(1/n)\nu. By the pigeonhole principle, there exists q{1,,k}q\in\{1,\ldots,k\}, such that Vq\overline{V_{q}} contains infinitely many xnx_{n}. It is not hard to see that if a closed polyhedron contains a converging sequence on a line, then it also contains a small section of the line including the limit point of the sequence. Thus, there exists δ>0\delta>0 such that {x+λδν:0λ1}Vq(0,1)dVq(0,1)d\{x+\lambda\delta\nu:0\leq\lambda\leq 1\}\subset\overline{V_{q}}\cap\left(0,1\right)^{d}\subset\overline{V_{q}\cap\left(0,1\right)^{d}}. Then, setting T:=Vq(0,1)dT:=V_{q}\cap\left(0,1\right)^{d} shows the claim. ∎

As in , we make use of a combinatorial quantity measuring the expressiveness of a set of binary valued functions HH defined on some set XX, called VC-dimension (see e.g. [3, Chapter 3.3]). We define

On the other hand, [3, Theorem 8.4] yields an upper bound of the VC-dimension of HH in terms of the number of computations and the dimension of the parameterization of HH which can be expressed as a function of M(Aε)M({\mathcal{A}}_{\varepsilon}) (Claim 2). Together this gives the desired relation.

Step 1 (Construction of hh): Let now 0<ε<c1N(n1)/(3d)0<\varepsilon<c_{1}N^{-(n-1)}/(3\sqrt{d}) for some constant c1=c1(n)>0c_{1}=c_{1}(n)>0 to be chosen later, and Aε=Aε(d,n,ε){\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,\varepsilon) be a neural network architecture as in the claim of the proposition.

To this end, we first define a function fyFn,d,,Bf_{y}\in\mathcal{F}_{n,d,\infty,B} with fy(xm)=ymaf_{y}(x_{m})=y_{m}\cdot a for some constant a>0a>0, and then make use of a neural network Φfy\Phi^{f_{y}} that approximates fyf_{y}.

and, in particular, if x=1/(4N)\lvert x\rvert=1/(4N), then

Here, we defined the constant c1=c1(n,B)>0c_{1}=c_{1}(n,B)>0 which was left unspecified in the beginning of the proof by c1:=Bϕ(1/4)4ψWn,((0,1)d)c_{1}:=\frac{B\phi(1/4)}{4\lVert\psi\rVert_{{W^{n,\infty}(\left(0,1\right)^{d})}}}.

where we used Equation (D.5) together with Equation (D.4) and ν(xm)=1\lvert\nu(x_{m})\rvert=1 for the first step and Equation (D.2) together with the upper bound for ε\varepsilon for the second step.

In a similar way, we get for the case ym=0y_{m}=0 that

Finally, combining Equation (D.6) and Equation (D.7) reads as

Claim 2 (VCdim(H)CM(Aε)2\operatorname{VCdim}(H)\leq C\cdot M({\mathcal{A}}_{\varepsilon})^{2}): We start by showing that there exists a constant C=C(d)C^{\prime}=C^{\prime}(d) such that h((w,δ),x)h((w,\delta),x) can be computed using CM(Aε)C^{\prime}\cdot M({\mathcal{A}}_{\varepsilon}) operations of the following types

the arithmetic operations +,,×+,-,\times, and // on real numbers,

jumps conditioned on >,,<,,=>,\geq,<,\leq,=, and \neq comparisons of real numbers

Note that the number of neurons that are needed for the computation of Rϱ(Aε(w))R_{\varrho}({\mathcal{A}}_{\varepsilon}(w)) can be bounded by M(Aε)M({\mathcal{A}}_{\varepsilon}). Thus, Rϱ(Aε(w))R_{\varrho}({\mathcal{A}}_{\varepsilon}(w)) can be computed using at most C2M(Aε)C_{2}\cdot M({\mathcal{A}}_{\varepsilon}) operations where C2>0C_{2}>0 is an absolute constant. Hence, there exists a constant C3=C3(d)C_{3}=C_{3}(d) such that for the number of operations of the specified type tt needed for the computation of h(x)h(x) where xdx\in^{d} it holds that tC3M(Aε)t\leq C_{3}M({\mathcal{A}}_{\varepsilon}).

where C4=C4(d)>0C_{4}=C_{4}(d)>0 is a suitable constant. ∎

The proof of the lower complexity bounds is now a simple consequence of Proposition D.2.

The case k=0k=0 corresponds to [61, Theorem 4 a)].

For the case k=1k=1, let c=c(n,B)c=c(n,B) and C=C(d)C=C(d) be the constants from Proposition D.2 and set

Then, N(c/ε)1/(n1)N\leq(c/\varepsilon)^{1/(n-1)} and, thus, 0<εcN(n1)0<\varepsilon\leq cN^{-(n-1)}. Now, Proposition D.2 implies that

We also have (c/ε)1/(n1)2N(c/\varepsilon)^{1/(n-1)}\leq 2N and hence c1εd/(n1)Ndc_{1}\varepsilon^{-d/(n-1)}\leq N^{d} for a suitable constant c1=c1(n,B)>0c_{1}=c_{1}(n,B)>0. Combining this estimate with Equation D.8 yields

for a constant C=C(d,n,B)>0C^{\prime}=C^{\prime}(d,n,B)>0. ∎

References