Error bounds for approximations with deep ReLU neural networks in $W^{s,p}$ norms

Ingo Gühring, Gitta Kutyniok, Philipp Petersen

Introduction

Powered by modern, highly parallelized hardware and the immense amount of accessible data, deep neural networks substantially outperform both traditional modelling approaches, e.g. those based on differential equations, and classical machine learning methods in a wide range of applications. Prominent areas of application include image classification , speech recognition , and natural language processing .

Despite this overwhelming success in applications, a comprehensive mathematical explanation of this success is has not been found. However, a deep theoretical understanding of these techniques is crucial to design more efficient architectures and essential in safety-critical applications, such as autonomous driving.

Many attempts at unraveling the extreme efficiency of deep neural networks have been made in the context of approximation theory. The universal approximation theorem establishes that every continuous function on a compact domain can be uniformly approximated by neural networks. More refined results that also set into relation the size and the approximation fidelity of neural networks have been reported for smooth activation functions in, for example, .

In most applications, the activation functions are not smooth but taken to be the piecewise linear ReLU function. For these networks, approximation bounds for classes of smooth functions have been established in and for piecewise smooth functions in . Connections to sparse grids and the associated approximation rates were established in and connections to linear finite element approximation were reported in . It was also discovered that the depth of neural networks, i.e., the number of layers, crucially influences the approximation capabilities of these networks in the sense that deeper networks are more efficient approximators .

One of the applications of deep learning where a strong knowledge of the approximation capabilities of neural networks directly translates into quantifiable theoretical results appears when solving partial differential equations using deep learning techniques. Some notable advances in this direction have been made in . In this regard, not only the approximation fidelity with respect to standard Lebesgue spaces is of high interest, but also that with respect to Sobolev-type norms. First results in this direction were reported in .

In this work, we give a comprehensive analysis of the approximation rates of deep neural networks of Sobolev-regular functions with respect to (fractional) Sobolev norms.

In the remaining part of this introduction, we briefly describe neural network architectures relevant to this work. Then, we present the setting that neural networks are typically used in. Afterwards, we give a practical and theoretical motivation for studying approximations of functions and their derivatives with neural networks. Finally, we state our results.

is the affine-linear transformation of the $l$ -th layer, then the computation in that layer can be described as

Note that $\varrho$ acts componentwise in all but the last layer and in the last layer no activation function is applied. The parameters defining the affine-linear transformations $T_{l}$ are referred to as weights and $\sum_{k=0}^{L}N_{l}$ is called the number of neurons of the network, since each output coordinate of $f_{l}$ can be seen as a small computational unit, similar to neurons in the brain. If a network has $3$ or more layers, then it is usually called deep and a $2$ -layer network is called shallow. The complexity of a neural network is typically measured in the number of layers, weights, and neurons (see ). The term deep learning refers to the subset of machine learning methods associated with deep neural networks.

Recently, more general feedforward architectures which allow connections between non-neighboring layers, so-called skip connections, have been shown to yield state-of-the-art results in object recognition tasks . Here, the input of a layer consists of the output of all preceding layers. Note, that the case where no skip connections are allowed, is a special case of this more general architecture.

One of today’s most widely-used activation functions is the Rectified Linear Unit (ReLU) (see ), defined as

The popularity of the ReLU can be explained by a number of factors. It is cheap to compute, promotes sparsity in data representation , alleviates the problem of vanishing-gradients , and thus yields better optimization properties. Moreover, the ReLU is the only commonly-used activation function such that the associated function spaces are not necessary non-closed . In this work, we will mostly focus on ReLU networks with a feedforward architecture allowing skip connections.

2 Supervised learning with neural networks

Typically, neural networks are applied in supervised learning problems. The starting point is a dataset of input-output pairs $(x_{i},f(x_{i}))_{i=1}^{m}$ , called samples, where $f$ is in most cases an unknown functionWe will later see a case where $f$ is in fact known. with values only given at sample points $x_{i}$ . As an example, $x_{i}$ can be thought of as an image and $f(x_{i})$ as a vector of scores, where each score is the probability of a certain category, e.g. ”dog” or ”cat”, being associated with $x_{i}$ . During training, one then seeks to learn the function $f$ by adapting the weights $w$ of a neural network $\mathcal{N}$ such that the empirical loss

is minimized for some loss function $l$ . Ultimately, one is interested in how well the learning algorithm performs on formerly unseen data points $x$ . The associated error is called the generalization error.

3 Motivation: approximating functions and derivatives

From a learning theory point of view, the task of estimating the generalization error decomposes into a statistical problem depending on the samples and an approximation theoretical problem, independent of the samples. An accessible introduction to learning theory from the perspective of approximation theory can be found in . The aim of this work is to study the simultaneous approximation of a function and its derivative with a neural network. There are multiple scenarios in which this is possible and useful.

In , the $j$ -th order derivatives of $f$ are incorporated in the empirical loss function (1.1), resulting in an empirical loss

which encourages the network to encode information about the derivatives of $f$ in its weights. The authors of call this method Sobolev training and reported reduced generalization errors and better data-efficiency in a network compression task (see ) and in application to synthetic gradients (see ). In case of network compression, the approximated function $f$ is a function realized by a possibly very large neural network $\mathcal{N}_{\text{large}}(\cdot|w)$ , that has been trained for some supervised learning task and is learnt by a smaller network $\mathcal{N}_{\text{small}}$ . In contrast to usual supervised learning settings, the approximated function $f(\cdot)=\mathcal{N}_{\text{large}}(\cdot|w)$ is known and the derivatives can be computed.

Motivated by the performance of deep learning-based solutions in classical machine learning tasks and, in particular, by their ability to overcome the curse of dimension, neural networks are now also applied for the approximative solution of partial differential equations (PDEs) (see ).

In the authors present their deep Galerkin method for approximating solutions of high-dimensional quasilinear parabolic PDEs. For this, a functional $J(f)$ encoding the differential operator, boundary conditions, and initial conditions is introduced. A neural network $\mathcal{N}_{\text{PDE}}$ with weights $w$ is then trained to minimize the functional $J(\mathcal{N}_{\text{PDE}}(w))$ . This is done by a discretization and randomly sampling spatial points.

The theoretical foundation for approximating a function and higher-order derivatives with a neural network was already given in a less known version of the universal approximation theorem by Hornik in [32, Theorem 3]. In particular, it was shown that if the activation function $\varrho$ is $k$ -times continuously differentiable, non-constant, and bounded, then any $k$ -times continuously differentiable function $f$ and its derivatives up to order $k$ can be uniformly approximated by a shallow neural network on compact sets. Note though that the conditions on the activation function are very restrictive and that, for example, the ReLU is not included in the above result. However, in , it was shown that the theorem also holds for shallow ReLU networks if $k=1$ . Theorem 3 in was also used in to show the existence of a shallow network approximating solutions of the PDEs considered in this paper.

An important aspect, that is untouched by the previous approximation results is how the complexity of a network and, in particular, its depth relates to its approximation properties.

4 Our contribution

In Theorem 1 in , Yarotsky showed upper complexity bounds for approximations in $L^{\infty}$ norm of functions from the Sobolev space $W^{n,\infty}(\left(0,1\right)^{d})$ with neural networks for continuous piecewise linear activation functions with a finite number of breakpoints. Precisely, it is shown there, that for any $\varepsilon>0$ there exists a neural network $\mathcal{N}_{\varepsilon}$ with at most $c\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ layers and at most $c\cdot\varepsilon^{-d/n}\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ weights and neurons such that for any $f\in W^{n,\infty}(\left(0,1\right)^{d})$ with $\lVert f\rVert_{W^{n,\infty}(\left(0,1\right)^{d})}\leq 1$ there is a choice of weights $w_{f}$ with

where $c$ is a constant depending on $d$ and $n$ .

Furthermore, under the assumption of a possibly discontinuous weight selection the number of weights needed by a neural network $\mathcal{N}_{\varepsilon}$ to be able to realize an $\varepsilon$ -approximation in $L^{\infty}$ norm for any $f\in W^{n,\infty}(\left(0,1\right)^{d})$ with $\lVert f\rVert_{W^{n,\infty}(\left(0,1\right)^{d})}\leq 1$ is shown to be lower bounded by $c^{\prime}\cdot\varepsilon^{d/(2n)}$ in [61, Theorem 4 a)]. The constant $c^{\prime}$ depends on $d$ and $n$ .

We show for the same set of activation functions that the approximation can also be done with respect to higher-order Sobolev norms with arbitrary $1\leq p\leq\infty$ and that there is a trade-off between the regularity used in the approximation norm and the regularity used in the complexity bounds. Specifically, we show that for any approximation accuracy $\varepsilon>0$ and regularity $0\leq s\leq 1$ , there is a ReLU neural network $\mathcal{N}_{\varepsilon}$ with at most $c\cdot\log_{2}(\varepsilon^{-n/(n-s)})$ layers and $c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}(\varepsilon^{-n/(n-s)})$ weights and neurons such that and any $f\in{W^{n,p}(\left(0,1\right)^{d})}$ with $\lVert f\rVert_{{W^{n,p}(\left(0,1\right)^{d})}}\leq B$ there is a choice of weights $w_{f}$ with

where $c$ is a constant depending on $d,n,p,s$ , and $B$ . In the boundary case $s=0$ and $p=\infty$ our results corresponds to the theorem shown by Yarotsky and for $s=1$ and $p=\infty$ the function $\mathcal{N}_{\varepsilon}(\cdot|w_{f})$ and its weak gradient uniformly approximate $f$ and the weak gradient of $f$ , respectively. For non-integer $s$ the function $\mathcal{N}_{\varepsilon}(\cdot|w_{f})$ approximates the function $f$ and its fractional derivative of order $s$ approximates the fractional derivative of order $s$ of $f$ . The case $0\leq s\leq 1$ and $p=\infty$ was already shown by one of the authors, I. Gühring, in his Master thesis .

Moreover, analogously to [61, Theorem 4 a)] we establish lower bounds where the same regularity-complexity trade-off can be observed. We show that if a neural network $\mathcal{N}_{\varepsilon}$ is able to realize an $\varepsilon$ -approximation in $W^{1,\infty}$ norm for any $f\in W^{n,\infty}(\left(0,1\right)^{d})$ with $\lVert f\rVert_{W^{n,\infty}(\left(0,1\right)^{d})}\leq 1$ , then the number of weights of $\mathcal{N}_{\varepsilon}$ is lower bounded by $c^{\prime}\cdot\varepsilon^{\nicefrac{{-d}}{{2(n-1)}}}$ . Here, $c^{\prime}$ is a constant depending on $d$ and $n$ .

5 Outline

As a preparation, we start by introducing notation and some definitions in Section 1.6. In Section 2, we rigorously define a neural network and its architecture in mathematical terms and develop a network calculus. In Section 3, we briefly introduce (fractional) Sobolev spaces. In Section 4, we present our results which will be discussed in Section 5. To allow a concise presentation of our results most proofs are given in the appendix: Appendix A and B contain the necessary preparation for the proof of the results from Section 4.1 which is given in Appendix C. The results from Section 4.2 are proven in Appendix D.

6 Notation

Of course, the same notation also applies to (block) vectors.

for all $x,y\in\Omega$ . If we want to emphasize the constant $L$ , then we say that $f$ is $L$ -Lipschitz.

If $X$ is a linear space and $\lVert\cdot\rVert_{1}$ and $\lVert\cdot\rVert_{2}$ are two norms on $X$ , then we say that $\lVert\cdot\rVert_{1}$ and $\lVert\cdot\rVert_{2}$ are equivalent if there exist constants $C_{1},C_{2}>0$ such that

For two normed linear spaces $X,Y$ we denote by $\mathcal{L}(X,Y)$ the set of bounded linear operators mapping $X$ to $Y$ and for $T\in\mathcal{L}(X,Y)$ the induced operator norm of $T$ is denoted by

Let $a>0$ , then we say for two functions $f:(0,a)\to[0,\infty)$ and $g:(0,a)\to[0,\infty)$ that $f(\varepsilon)$ is in $\mathcal{O}(g(\varepsilon))$ if there exists $0<\delta<a$ and $C>0$ such that $f(\varepsilon)\leq Cg(\varepsilon)$ for all $\varepsilon\in(0,\delta)$ .

Neural networks

In this section, we introduce the notion of neural networks used in this paper. As in , we will consider a general type of feedforward architecture that also allows for connections of neurons in non-neighboring layers. It can be seen, though, that any function realized by such a network can also be realized by a network with a more restrictive feedforward architecture where only neurons from neighboring layers can be connected (Lemma 2.11). As in , we draw a distinction between the neural network and the function that the network realizes. This gives us the possibility to develop a network calculus in the spirit of [47, Chapter 2]. As in that paper we will introduce the notion of network concatenation and parallelization.

The following definition is similar to [47, Definition 2.1], where the difference is that we also allow connections between non-neighboring layers.

where $x_{L}$ results from the following scheme:

where $A_{l,x_{k}}$ is an $N_{l}\times N_{k}$ matrix for $k=0,\ldots,l-1$ and $l=1,\ldots,L$ . Then

We will now define the class of neural networks where only connections between neighboring layers are allowed. Networks without skip connections are a special case of the networks of Definition 2.1. Since they are more frequently used in the literature, we coin such networks standard neural networks.

If $\Phi=((A_{1},b_{1}),(A_{2},b_{2}),\dots,(A_{L},b_{L}))$ is a neural network as above and we have for

that $A_{l,x_{i}}=0$ for $l=1,\ldots,L$ and $i=0,\ldots,l-2$ , then we call $\Phi$ a standard neural network. The computation scheme then reduces to the following:

In practice, before training a neural network, i.e. adjusting the weights of the network, one has to decide which network architecture to use. The following definition will clarify the notion of a network architecture.

We say that a neural network $\Phi=((A^{\prime}_{1},b^{\prime}_{1}),(A^{\prime}_{2},b^{\prime}_{2}),\dots,(A^{\prime}_{L},b^{\prime}_{L}))$ with input dimension $d$ and $L$ layers has architecture ${\mathcal{A}}$ if

$N_{l}(\Phi)=N_{l}$ for all $l=1,\ldots,L$ and

$[A^{\prime}_{l}]_{i,j}\neq 0$ implies $[A_{l}]_{i,j}\neq 0$ for all $l=1,\ldots,L$ with $i=1,\ldots,N_{l}$ and $j=1,\ldots,\sum_{k=0}^{l-1}N_{k}$ .

In the spirit of Definition 2.2 we say that ${\mathcal{A}}$ is a standard neural network architecture if all weights connecting neurons from non-neighboring layers are zero.

Note that in general an architecture of a neural network does not need to be unique.

Moreover, we define one of the in practice commonly used (see ) activation functions.

is called ReLU (Rectified Linear Unit) activation function.

It can be shown, see [61, Proposition 1], that in terms of approximation and upper complexity bounds for networks with continuous and piecewise linear activation functions it suffices to focus on the ReLU.

be two neural networks such that the input layer of $\Phi^{1}$ has the same dimension as the output layer of $\Phi^{2}$ . Then, $\Phi^{1}\bullet\Phi^{2}$ denotes the following $L_{1}+L_{2}-1$ layer network:

for $l=1,\ldots,L_{1}$ . We call $\Phi^{1}\bullet\Phi^{2}$ the concatenation of $\Phi^{1}$ and $\Phi^{2}$ .

Then, $R_{\varrho}(\Phi^{1}\bullet\Phi^{2})=R_{\varrho}(\Phi^{1})\circ R_{\varrho}(\Phi^{2})$ .

This can be verified with a direct computation. ∎

Next, we adapt [47, Lemma 2.3] to the case where skip connections are allowed and define a network which realizes an identity function if the activation function is the ReLU.

Then the sparse concatenation of $\Phi^{1}$ and $\Phi^{2}$ is defined as

for $l=1,\ldots,L_{1}$ . Furthermore, it holds that $L(\Phi^{1}\odot\Phi^{2})=L_{1}+L_{2}$ , and $M(\Phi^{1}\odot\Phi^{2})\leq 2M(\Phi^{1})+2M(\Phi^{2})$ and $N(\Phi^{1}\odot\Phi^{2})\leq 2N(\Phi^{1})+2N(\Phi^{2})$ .

2 Network parallelization

be two neural networks with $d$ -dimensional input. If $L_{1}\leq L_{2}$ , then we define

A similar construction is used for the case $L_{1}>L_{2}$ . Then $P(\Phi^{1},\Phi^{2})$ is a neural network with $d$ -dimensional input and $\max\{L_{1},L_{2}\}$ layers, which we call the parallelization of $\Phi^{1}$ and $\Phi^{2}$ . We have $M(P(\Phi^{1},\Phi^{2}))=M(\Phi^{1})+M(\Phi^{2})$ , $N(P(\Phi^{1},\Phi^{2}))=N(\Phi^{1})+N(\Phi^{2})-d$ and $R_{\varrho}(P(\Phi^{1},\Phi^{2}))(x)=(R_{\varrho}(\Phi^{1})(x),R_{\varrho}(\Phi^{2})(x))$ .

with $d$ -dimensional input and $L=\max\{L_{1},\ldots,L_{n}\}$ layers is called the parallelization of $\Phi^{1},\Phi^{2},\ldots,\Phi^{n}$ . We have

and $R_{\varrho}(P(\Phi^{1},\ldots,\Phi^{n}))(x)=(R_{\varrho}(\Phi^{1})(x),\ldots,R_{\varrho}(\Phi^{n})(x))$ .

3 Standard neural networks

In this subsection, we will show that if a function is a realization of a neural network, then it can also be realized by a standard neural network with similar complexity. The proof of this result is heavily based on the idea of the construction of an identity function with ReLU networks (Lemma 2.6).

Let $\varrho$ be the ReLU. Then there exist absolute constants $C_{1},C_{2}>0$ such that for any neural network $\Phi$ with $d$ -dimensional input and $L=L(\Phi)$ layers, $N=N(\Phi)$ neurons and $M=M(\Phi)$ nonzero weights, there is a standard neural network $\Phi_{\text{st}}$ with $d$ -dimensional input, and $L$ layers, at most $C_{1}LN$ neurons and $C_{2}\cdot(LN+M)$ nonzero weights such that

If $L=1$ , then there is nothing to show. For $L>1$ we start by defining the matrices

In the next remark we collect some properties of the class of functions that can be realized by a neural network.

If $\Phi$ is a neural network with $d$ -dimensional input and $m$ -dimensional output and $\varrho$ is the ReLU, then $R_{\varrho}(\Phi)$ is a Lipschitz-continuous, piecewise affine-linear function. This follows easily from Definition 2.2 and Lemma 2.11 which show that $R_{\varrho}(\Phi)$ can be expressed as the composition of Lipschitz-continuous, piecewise affine-linear transformations.

In [4, Theorem 2.1] also the converse was shown, i.e. every piecewise affine-linear function can be realized by a neural network with the ReLU as activation function.

Sobolev spaces

Spaces of functions that admit generalized derivatives fulfilling suitable integrability properties are a crucial concept in modern theory of partial differential equations (cf. e.g. ). In order to study properties of PDEs using functional analytic tools, a differential equation is reformulated via a differential operator mapping one function space to another. For a wide range of differential equations the appropriate spaces in this formulation are Sobolev spaces. A historical background of the development of Sobolev spaces in the context of PDEs can be found in [56, Lecture 1]. For a detailed treatment of the broad theory of Sobolev spaces we refer the reader to .

Furthermore, for $f\in{W^{n,p}(\Omega)}$ and $1\leq p<\infty$ , we define the norm

We write $\nabla f:=(D^{1}f,\ldots,D^{d}f)$ . Moreover, the space ${W^{n,p}(\Omega)}$ equipped with the norm $\lVert\cdot\rVert_{{W^{n,p}(\Omega)}}$ is a Banach space. For $f\in{W^{n,p}(\Omega)}$ we shall use the notation

if the domain is clear from the context, or if we want to emphasize the variable $x$ that the function $f$ depends on, respectively. If $n=0$ , then the Sobolev space is just a Lebesgue space, i.e. ${W^{0,p}(\Omega)}=L^{p}(\Omega)$ .

Many results about function spaces defined on a domain $\Omega$ require $\Omega$ to fulfill certain regularity conditions. Different geometrical conditions and the resulting properties have been intensively studied and can for example be found in . For our purposes it is enough to focus on the condition introduced in the next definition which can be found in [20, Appendix C.1].

after possibly relabeling and reorienting the coordinate axes.

In the sequel, we will only work with convex domains. Moreover, every open, bounded, and convex domain is a Lipschitz domain [23, Corollary 1.2.2.3].

Fractional Sobolev spaces play an important role in the analysis of partial differential equations. In particular, they characterize the regularity of functions from Sobolev spaces defined on a domain $\Omega$ restricted to the boundary $\partial\Omega$ of the domain where the restriction is realized by a so-called trace operator (see e.g. ). Moreover, a detailed description of various areas of further application is listed in . For a more in-depth analysis of these spaces the interested reader is referred to and .

Next, we define fractional-order spaces in terms of an intrinsic norm.

We set for $0<s<1$ and $1\leq p\leq\infty$

The space ${W^{s,p}(\Omega)}$ endowed with the norm $\lVert\cdot\rVert_{{W^{s,p}(\Omega)}}$ is a Banach space called Sobolev-Slobodeckij space.For $p=\infty$ the space coincides with the space of bounded $s$ -Hölder continuous functions. Precisely, each equivalence class contains a bounded $s$ -Hölder continuous representative.

Approximations with deep ReLU neural networks in Sobolev type norms

In this section, we derive upper and lower complexity bounds for approximations of functions from certain Sobolev spaces with deep ReLU networks. Our result is a generalization of [61, Theorem 1] (upper bounds) and [61, Theorem 4 a)] (lower bounds) to the case where the approximation error is measured in a Sobolev-type norm.

In [61, Theorem 1] it is shown that for $B=1$ and an arbitrary function $f\in\mathcal{F}_{n,d,\infty,B}$ a ReLU network can be constructed that realizes an approximation with $L^{\infty}$ -error at most $\varepsilon$ using $\mathcal{O}(\log_{2}(\nicefrac{{1}}{{\varepsilon}}))$ layers and $\mathcal{O}(\varepsilon^{-d/n}\log_{2}(\nicefrac{{1}}{{\varepsilon}}))$ nonzero weights and neurons.

We will show that the approximation can also be performed with respect to a continuous scale of higher order Sobolev-type norms for arbitrary $1\leq p\leq\infty$ and additionally, that there is a trade-off between the regularity used in the norm in which the approximation error is measured and the regularity used in the bounds. In particular, we will show the following theorem:

For any $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ , there is a neural network architecture ${\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,p,B,s,\varepsilon)$ with $d$ -dimensional input and one-dimensional output such that for any $f\in\mathcal{F}_{n,d,p,B}$ there is a neural network $\Phi_{\varepsilon}^{f}$ that has architecture ${\mathcal{A}}_{\varepsilon}$ such that

$L({\mathcal{A}}_{\varepsilon})\leq c\cdot\log_{2}(\varepsilon^{-n/(n-s)})$ ;

$M({\mathcal{A}}_{\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}(\varepsilon^{-n/(n-s)})$ ;

$N({\mathcal{A}}_{\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}(\varepsilon^{-n/(n-s)})$ .

Clearly, if $s=0$ and $p=\infty$ , then Theorem 4.1 coincides with the theorem shown by Yarotsky in . Note that if $\Phi$ is a neural network and $\varrho$ is the ReLU, then the restriction of $R_{\varrho}(\Phi)$ to $\left(0,1\right)^{d}$ is bounded and Lipschitz continuous (cf. Remark 2.12) and hence in view of [10, Chapter 1.3] an element of the Sobolev space ${W^{1,\infty}(\left(0,1\right)^{d})}$ . Therefore, the expressions in the previous theorem are well-defined.

Theorem 4.1 holds for all activation functions that are in some sense similar to the ReLU. In particular, it follows directly from [61, Proposition 1] that the statement holds for any continuous piecewise linear activation function $\hat{\varrho}$ with $M$ breakpoints, where $1\leq M<\infty$ .

As a consequence of Theorem 4.1 and Lemma 2.11, we can easily derive a similar statement for standard neural networks.

For any $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ , there is a standard neural network architecture ${\mathcal{A}}_{\text{st},\varepsilon}={\mathcal{A}}_{\text{st},\varepsilon}(d,n,p,B,s,\varepsilon)$ with $d$ -dimensional input and one-dimensional output such that for any $f\in\mathcal{F}_{n,d,p,B}$ there is a standard neural network $\Phi_{\text{st},\varepsilon}$ that has architecture ${\mathcal{A}}_{\text{st},\varepsilon}$ such that

$L({\mathcal{A}}_{\text{st},\varepsilon})\leq c\cdot\log_{2}(\varepsilon^{-n/(n-s)})$ ;

$M({\mathcal{A}}_{\text{st},\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}^{2}(\varepsilon^{-n/(n-s)})$ ;

$N({\mathcal{A}}_{\text{st},\varepsilon})\leq c\cdot\varepsilon^{-d/(n-s)}\cdot\log_{2}^{2}(\varepsilon^{-n/(n-s)})$ .

2 Lower complexity bounds

In this subsection, we show that the same regularity-complexity trade-off that can be observed for upper complexity bounds can be shown for lower bounds. In Theorem 4 a) in Yarotsky proves that a network architecture capable of approximating any function $f\in\mathcal{F}_{n,d,\infty,1}$ up to a $L^{\infty}$ -error $\varepsilon$ has at least $c\cdot\varepsilon^{-d/(2n)}$ weights. Here we consider approximations in $W^{1,\infty}$ norm. The following theorem combines the result from with our result.

If $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ and ${\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,B,k,\varepsilon)$ is an architecture such that for any $f\in\mathcal{F}_{n,d,\infty,B}$ there is a neural network $\Phi_{\varepsilon}^{f}$ that has architecture ${\mathcal{A}}_{\varepsilon}$ and

then ${\mathcal{A}}_{\varepsilon}$ has at least $M({\mathcal{A}}_{\varepsilon})\geq c\varepsilon^{\nicefrac{{-d}}{{2(n-k)}}}$ weights.

In case $k=0$ , Theorem 4.3 coincides with [61, Theorem 4 a)].

Note that the case of standard neural network architectures is included in the theorem above and, thus, the same lower bounds hold true for this case.

Moreover, the proof can easily be adapted to piecewise polynomial activation functions (not necessarily continuous) with a finite number of breakpoints.

Discussion and future work

The motivation behind deriving theoretical results for deep neural networks is to improve our understanding of their empirical success in a broad area of applications. In this section, we briefly describe the practical relevance of our result, but mostly focus on aspects where our theoretical framework differs from empirical observations or fails to incorporate computational limitations met in implementations. Furthermore, we will discuss possible extensions of our results.

Practical relevance. Our results give upper and lower bounds for the complexity of neural network architectures used in Sobolev training (see Section 1.3 and ). Furthermore, we anticipate our results to lead to complexity bounds in the theoretical foundation of deep learning-based approaches for the numerical solution of PDEs. We leave an in-depth analysis of this topic for future research.

Curse of dimension. One of the main reasons for the current interest in deep neural networks is their outstanding performance in high-dimensional problems. As an example, consider the popular ILSVRTypically abbreviated with ILSRC which stands for ImageNet Large Scale Visual Recognition Challenge. challenge (see ), an image recognition task on the ImageNet database (see ) containing variable-resolution images which are typically downsampled before training to yield a roughly $(240\times 240\times 3)$ -dimensional input space (cf. e.g. ).

Our asymptotic upper complexity bounds for the number of weights and neurons are in $\mathcal{O}(\varepsilon^{-d/(n-s)})$ , and thus depend strongly on the dimension $d$ of the input space. Moreover, the constants in the upper bounds for the number of layers, weights, and neurons also increase exponentially with increasing $d$ . Even if there is still a gap between our upper and lower bounds for the weights of a network architecture, Theorem 4.3 shows that one cannot expect to circumvent the curse of dimension in the setting considered in this paper.

The power of depth. For $s=0$ , the approximations obtained in Theorem 4.1 and [61, Theorem 1] coincide. In this case, Yarotsky showed in that the constructed unbounded depth approximations for functions in ${W^{n,\infty}(\left(0,1\right)^{d})}$ (with $n>2$ ) are asymptotically more efficient in the number of weights and neurons than approximations with fixed length $L$ if

As a consequence, to be more efficient than a shallow network, i.e. a network with depth $L=2$ , one needs $n>2d$ regularity. Even if this result does not completely explain the success of deep networks over shallow ones, since $d$ is typically very large, it would be interesting to obtain similar results for higher-order Sobolev norms.

Unbounded complexity of weights. When training a neural network on a computer, the weights have to be stored in memory. In practice, storing weights up to an arbitrary precision or even unbounded weights (in absolute value) is infeasible. In the construction of the neural network $R_{\varrho}(\Phi^{f}_{\varepsilon})$ that approximates a function $f$ up to an $W^{s,\infty}$ -error $\varepsilon$ and realizes the upper complexity bound from Theorem 4.1 we used weights $W_{\varepsilon}$ with $\lvert W_{\varepsilon}\rvert\to\infty$ as $\varepsilon\downarrow 0$ (see construction of $\Phi_{m,l}$ in the proof of Lemma C.3 (v)).

In , neural networks are restricted to possess quantized weights (see [47, Definition 2.9]) which controls the complexity of the weights depending on the approximation error $\varepsilon$ . It would be interesting to extend Theorem 4.1 to networks with quantized weights and to analyze how coarse a quantization is possible to maintain the associated upper bounds.

Acknowledgements

I.G. would like to thank Mones Raslan for fruitful discussions on the topic and acknowledges support from the Research Training Group ”Differential Equation- and Data-driven Models in Life Sciences and Fluid Dynamics: An Interdisciplinary Research Training Group (DAEDALUS)” (GRK 2433) funded by the German Research Foundation (DFG). G. K. acknowledges partial support by the Bundesministerium für Bildung und Forschung (BMBF) through the Berliner Zentrum for Machine Learning (BZML), Project AP4, by the Deutsche Forschungsgemeinschaft (DFG) through grants CRC 1114 ”Scaling Cascades in Complex Systems”, Project B07, CRC/TR 109 ”Discretization in Geometry and Dynamics’, Projects C02 and C03, RTG DAEDALUS (RTG 2433), Projects P1 and P3, RTG BIOQIC (RTG 2260), Projects P4 and P9, and SPP 1798 ”Compressed Sensing in Information Processing”, Coordination Project and Project Massive MIMO-I/II, by the Berlin Mathematics Research Center MATH+ , Projects EF1-1 and EF1-4, and by the Einstein Foundation Berlin. P.P. is supported by a DFG Research Fellowship ”Shearlet-based energy functionals for anisotropic phase-field methods”.

Appendix A Interpolation spaces

In this section, we recall some essential notations and results from interpolation theory in Banach spaces. In particular, we present the $K$ -method of real interpolation.

For two Banach spaces $B_{0}$ and $B_{1}$ interpolation techniques allow the definition and study of a family of spaces that in some sense bridge the gap between $B_{0}$ and $B_{1}$ . This rather abstract concept will be applied to Sobolev spaces in Section 3.1 to derive spaces of fractional regularity. Interpolation theory is a valuable tool in harmonic analysis and the study of partial differential equations. A small summary about the history of the development of the theory of interpolation spaces can be found in [56, Lecture 21]. Interested readers are recommended to read the monographs and for a detailed treatment in the context of partial differential equations.

The basic notions of functional analysis are assumed to be known and we refer to for an introduction.

If $B_{0},B_{1}$ are two Banach spaces such that $B_{1}$ is continuously embedded in $B_{0}$ , then we write $B_{1}\subset B_{0}$ and call $(B_{0},B_{1})$ an interpolation couple. One can also consider more general pairs of Banach spaces but this setting is sufficiently general for our purposes.

Let $(B_{0},B_{1})$ be an interpolation couple. For every $u\in B_{1}$ and $t>0$ we define

The first term measures how well $u$ can be approximated by elements from the smaller space $B_{1}$ in $\lVert\cdot\rVert_{B_{0}}$ and the second term is a penalty term weighted by $t$ .

Let $(B_{0},B_{1})$ be an interpolation couple. Let $0<\theta<1$ and $1\leq p\leq\infty$ . We set

The norm $u\mapsto\lVert u\rVert_{\left(B_{0},B_{1}\right)_{\theta,p}}$ turns $\left(B_{0},B_{1}\right)_{\theta,p}$ into a Banach space (cf. [38, Proposition 1.5]) which is called a real interpolation space.

The next lemma shows that in the sense of a continuous embedding the spaces $B_{\theta,p}$ form a family of nested Banach spaces.

Let $1\leq p\leq\infty$ , then the following holds:

If $(B_{0},B_{1})$ is an interpolation couple and $0<\theta_{1}<\theta_{2}<1$ , then

if $B$ is a Banach space and $0<\theta<1$ , then $\left(B,B\right)_{\theta,p}=B$ .

For (i) see [38, p. 2] and for (ii) [38, Proposition 1.4]. ∎

One of the most important theorems in interpolation theory shows how the norm of an operator defined on interpolation couples relates to the operator norm with respect to the corresponding interpolation spaces. The theorem can be found in [38, Theorem 1.6].

Let $(A_{0},A_{1})$ and $(B_{0},B_{1})$ be two interpolation couples. If $T\in\mathcal{L}(A_{0},B_{0})\cap\mathcal{L}(A_{1},B_{1})$ , then $T\in\mathcal{L}(A_{\theta,p},B_{\theta,p})$ for every $0<\theta<1$ and $1\leq p\leq\infty$ . Furthermore,

As a corollary from the previous theorem the following useful result can be obtained (see [38, Corollary 1.7]).

Let $(B_{0},B_{1})$ be an interpolation couple. Moreover, let $0<\theta<1$ and $1\leq p\leq\infty$ , then there is a constant $c=c(\theta,p)>0$ such that for all $u\in B_{1}$ we have

For the case $p=\infty$ we have $c(\theta,\infty)=1$ .

Appendix B Sobolev spaces

As a generalization of Definition 3.1, we now introduce Sobolev spaces of vector-valued functions (see [48, Chapter 1.2.3]).

These spaces are again Banach spaces. To simplify the exposition, we introduce notation for a family of semi-norms.

For $m=1$ we only write $\lvert\cdot\rvert_{{W^{k,p}(\Omega)}}$ . Note that $\lvert\cdot\rvert_{{W^{0,p}(\Omega)}}$ coincides with $\lVert\cdot\rVert_{L^{p}(\Omega)}$ .

A proof can be found in [29, Theorem 4.1] together with [29, Remark 4.2].

To use the bounds for the gradient obtained in the previous theorem in the Sobolev semi-norm setting from Definition B.2, the following observation turns out to be useful. If $f\in W^{1,\infty}(\Omega)$ , then

The next lemma recalls a well known fact about the composition of Lipschitz functions and is not hard to see.

The following corollary establishes a chain rule estimate for $W^{1,\infty}$ .

It then follows from Theorem B.3 that $f_{j}$ is $L_{j}$ -Lipschitz and hence, $f$ is $\lvert L_{f}\rvert$ -Lipschitz. In the same manner we define

and have that $g$ is $L_{g}$ -Lipschitz. With Lemma B.4 we conclude that also the composition $g\circ f$ is $(\lvert L_{f}\rvert\cdot L_{g})$ -Lipschitz. Since $g$ is bounded and hence also $g\circ f$ applying Theorem B.3 again yields that $g\circ f\in{W^{1,\infty}(\Omega_{1})}$ and furthermore

B.2 Product estimate

In the following lemma, we show an estimate for the semi-norm of a product of weakly differentiable functions.

Let $f\in{W^{1,\infty}(\Omega)}$ and $g\in{W^{1,p}(\Omega)}$ with $1\leq p\leq\infty$ , then $fg\in{W^{1,p}(\Omega)}$ and there exists a constant $C=C(d,p)>0$ such that

Clearly, $fg,fD^{i}g+gD^{i}f\in L^{1}_{\text{loc}}(\Omega)$ so that the product formula in [21, Chapter 7.3] yields that for the weak derivatives of $fg$ it holds

where $C=C(d,p)>0$ is a constant. Furthermore, it is clear that also $fg\in L^{p}(\Omega)$ . ∎

B.3 Averaged Taylor polynomial

In this subsection, we develop a polynomial approximation in the spirit of Taylor polynomials but appropriate for Sobolev spaces. A polynomial approximation $P_{f}$ of a function $f\in\mathcal{F}_{n,d,p,B}$ is the first step towards an approximation of $f$ realized by a neural network in the proof of Theorem 4.1.

A reference for this entire subsection is [10, Chapter 4.1].

and $\phi$ is an arbitrary cut-off function supported in $\overline{B}$ , i.e.

A cut-off function as used in the previous definition always exists. A possible choice is

From the linearity of the weak derivative we can easily conclude that the averaged Taylor polynomial is linear in $f$ .

Recall that the averaged Taylor polynomial is defined via an integral and some cut-off function (cf. (B.3)) that perform an averaging of a polynomial expression (cf. (B.4)). Additionally, the following lemma shows that the averaged Taylor polynomial of order $n$ is indeed a polynomial of degree less than $n$ in $x$ .

Moreover, there exists a constant $c=c(n,d,R)>0$ such that the coefficients $c_{\alpha}$ are bounded with $\lvert c_{\alpha}\rvert\leq cr^{-d/p}\lVert f\rVert_{{W^{n-1,p}(\Omega)}}$ for all $\alpha$ with $\lvert\alpha\rvert\leq n-1$ .

in multi-index notation. Then, combining Equation (B.3) and (B.4) yields

Combining the last estimate with Equation (B.7) yields

where the second step follows from $\lVert\phi\rVert_{L^{\infty}}\leq cr^{-d}$ for some constant $c=c(d)>0$ (see [10, Section 4.1]). To estimate the absolute value of the coefficients $c_{\gamma}$ (defined in Equation B.6), we have

Here, the second step used Equation (B.8) together with Equation (B.5), and $c^{\prime}=c^{\prime}(n,d,R)>0$ is a constant. ∎

The next step is to derive approximation properties of the averaged Taylor polynomial. To this end, recall that for the (standard) Taylor expansion of some function $f$ defined on a domain $\Omega$ in $x_{0}$ to yield an approximation at some point $x_{0}+h$ the whole path $x_{0}+th$ for $0\leq t\leq 1$ has to be contained in $\Omega$ (see [40, Theorem 6.8.10]). In case of the averaged Taylor polynomial the expansion point $x_{0}$ is replaced by a ball $B$ and we require that the path between each $x_{0}\in B$ and each $x\in\Omega$ is contained in $\Omega$ . This geometrical condition is made precise in the following definition.

The next definition introduces a geometric notion which becomes important when given a family of subdivisions $\mathcal{T}^{h}$ , $0<h\leq 1$ of a domain $\Omega$ which becomes finer for smaller $h$ . One typically needs to control the nondegeneracy of $(\mathcal{T}^{h})_{h}$ which can be done e.g. with a uniformly bounded chunkiness parameter.

If $\mathcal{R}\neq\varnothing$ , then we define

To emphasize the dependence on the set $\Omega$ , we will occasionally write $r^{\star}_{\text{max}}(\Omega)$ and $\gamma(\Omega)$ .

Finally, the next lemma shows approximation properties of the averaged Taylor polynomial.

where $Q^{n}f$ denotes the Taylor polynomial of order $n$ of $f$ averaged over $B$ and $h=\operatorname{diam}(\Omega)$ .

A proof can be found in [10, Lemma 4.3.8].

B.4 Fractional Sobolev spaces

In this subsection, we derive a generalization of Sobolev spaces characterized by fractional-order derivatives using two different approaches. First, we interpolate integer-valued Sobolev spaces, and secondly, we define an intrinsic norm. Both approaches are then shown to be equivalent under some regularity condition on the domain $\Omega$ .

We start by defining fractional-order Sobolev spaces $W^{s,p}$ for $0<s<1$ via Banach space interpolation (see Section A).

Let $0<s<1$ and $1\leq p\leq\infty$ , then we set

The next theorem shows that, given suitable regularity conditions of the domain $\Omega$ , Definition B.13 and Definition 3.3 yield the same spaces with equivalent norms. The theorem can be found in [10, Theorem 14.2.3] for $1\leq p<\infty$ . The case $p=\infty$ is not included in [10, Theorem 14.2.3] but can be shown with the same technique.

with equivalence of the respective norms.

We make use of the norm equivalence by only using the $W^{s,p}$ norm in the following section and transferring results obtained by interpolation theory for the $\widetilde{W}^{s,p}$ norm to the $W^{s,p}$ norm without mentioning it again.

Appendix C Upper bounds for approximations

The strategy of the proof of Theorem 4.1 follows the idea of the proof of Theorem 1 in : A polynomial approximation $P_{f}$ of a function $f\in\mathcal{F}_{n,d,p,B}$ is approximated by a function realized by a neural network $\Phi_{P_{f}}$ . As in , we start by constructing a neural network that approximates the square function on the interval $\left(0,1\right)$ up to an approximation error at most $\varepsilon$ . In our result the error is measured in the $W^{1,\infty}$ norm (Proposition C.1). This result is then used to obtain an approximation of a multiplication operator (Proposition C.2). Next, we derive a partition of unity (Lemma C.3) as a product of univariate functions that can be realized by neural networks. Using this construction we then build a localized Taylor approximation for a function $f\in\mathcal{F}_{n,d,p,B}$ in the $L^{p}$ and $W^{1,p}$ norm. An interpolation argument shows that same construction can be used for an approximation in the $W^{s,p}$ norm, where $0\leq s\leq 1$ (Lemma C.4). In the next step, we lay the foundation for approximating products of the partition of unity and monomials, i.e. localized monomials, with ReLU networks (Lemma C.5) in the $W^{k,\infty}$ norm for $k=0,1$ . Via an embedding this result this result is then used to approximate the localized polynomials in $W^{k,p}$ with ReLU networks (Lemma C.6).

All Sobolev spaces encountered in the following proofs are defined on open, bounded, and convex domains which are, in particular, Lipschitz-domains (see [23, Corollary 1.2.2.3]).

The results presented in the following two propositions can be found in a similar way in [51, Proposition 3.1]. However, since our results contain some minor extensions we decided to give the proof here for the sake of completeness. In detail, Proposition C.2 (ii) and (iii) are not shown in [51, Proposition 3.1].

In [61, Proposition 2], a network is constructed that approximates the square function $x\to x^{2}$ on the interval $\left(0,1\right)$ in the $L^{\infty}$ norm. Interestingly, the same construction can be used when measuring the approximation error in the $W^{1,\infty}$ norm. In particular, the depth and the number of weights and neurons of the network do not grow asymptotically faster to satisfy the approximation accuracy with respect to this stronger norm.

There exist constants $c_{1},c_{2},c_{3},c_{4}>0$ , such that for all $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ there is a neural network $\Phi^{\text{sq}}_{\varepsilon}$ with at most $c_{1}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ nonzero weights, at most $c_{2}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ layers, at most $c_{3}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ neurons, and with one-dimensional input and output such that

and $R_{\varrho}(\Phi^{\text{sq}}_{\varepsilon})(0)=0$ . Furthermore, it holds that

Thus, $R_{\varrho}(\Phi_{m})$ is a piecewise linear interpolant of $f$ with $2^{m}+1$ uniformly distributed breakpoints $\frac{k}{2^{m}},k=0,\ldots,2^{m}$ . In particular, $R_{\varrho}(\Phi_{m})(0)=0$ . Furthermore, it is shown in the proof of [61, Proposition 2] that

We will now show that the approximation error of the derivative can be bounded in a similar way. In particular, we show the estimate

From Equation (C.3) we get for all $k=0,\ldots,2^{m}-1$

Combining Equation (C.4) and (C.5) yields

Clearly, the weak derivative of $\Phi_{m}$ is a piecewise constant function, which assumes its maximum on the last piece. Hence,

Let now $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ and choose $m=\lceil\log_{2}(\nicefrac{{1}}{{\varepsilon}})\rceil$ . Now, $\Phi^{\text{sq}}_{\varepsilon}:=\Phi_{m}$ satisfies the approximation bound in Equation (C.1) and $R_{\varrho}(\Phi^{\text{sq}}_{\varepsilon})(0)=0$ . The estimate (C.2) holds because of Equation (C.6). The number of weights can be bounded by

In the same way, the number of neurons and layers can be bounded. This concludes the proof.

As in , we will now use Proposition C.1 and the polarization identity

to define an approximate multiplication where the approximation error is again (and in contrast to ) measured in the $W^{1,\infty}$ norm.

For any $M\geq 1$ , there exist constants $c,c_{1}>0$ and $c_{2}=c_{2}(M)>0$ such that for any $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ , there is a neural network $\widetilde{\times}$ with two-dimensional input and one-dimensional output that satisfies the following properties:

$\lVert R_{\varrho}(\widetilde{\times})(x,y)-xy\rVert_{W^{1,\infty}(\left(-M,M\right)^{2};dxdy)}\leq\varepsilon$ ;

if $x=0$ or $y=0$ , then $R_{\varrho}(\widetilde{\times})\left(x,y\right)=0$ ;

$\lvert R_{\varrho}(\widetilde{\times}_{\varepsilon})\rvert_{W^{1,\infty}(\left(-M,M\right)^{2})}\leq cM$ ;

the depth and the number of weights and neurons in $\widetilde{\times}$ is at most $c_{1}\log_{2}(1/\varepsilon)+c_{2}$ .

Set $\delta:=\varepsilon/(6M^{2}C)$ , where $C$ is the constant from Corollary B.5 for $n=2$ and $m=1$ , and let $\Phi^{\text{sq}}_{\delta}$ be the approximate squaring network from Proposition C.1 such that

As in the proof of [61, Proposition 3], we use the fact that $\lvert x\rvert=\varrho(x)+\varrho(-x)$ to see that a network $\widetilde{\times}$ can be constructed with two-dimensional input and one-dimensional output that satisfies

As a consequence of Proposition C.1, there exists a constant $c_{0}>0$ such that $\widetilde{\times}$ has at most $c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{0}\log_{2}(6M^{2})+3\leq c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{1}$ layers, $3c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+3c_{0}\log_{2}(6M^{2})+9\leq c^{\prime}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{2}$ neurons and $3c_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+3c_{0}\log_{2}(6M^{2})+17\leq c^{\prime\prime}\log_{2}(\nicefrac{{1}}{{\varepsilon}})+c_{3}$ nonzero weights. Here $c^{\prime},c^{\prime\prime}>0$ and $c_{1}=c_{1}(M),c_{2}=c_{2}(M),c_{3}=c_{3}(M)>0$ are suitable constants and (iv) is hence satisfied.

Since $R_{\varrho}(\Phi^{\text{sq}}_{\delta})(0)=0$ , (ii) easily follows.

Using the polarization identity (C.7), we can write

To keep the following calculations simple, we introduce some notation. In particular, we define

Note that for the inner functions in the compositions in Equation (C.11), it holds that

Hence, to finally prove (i), we apply Corollary B.5 to (C.11) and get

To show (iii) we use the chain rule estimate from Corollary B.5 and get

where we used Equation (C.12) in the third step and $C^{\prime}$ is the absolute constant from Equation (C.2) in Proposition C.1. ∎

Next, we construct a partition of unity (in the same way as in [61, Theorem 1]) that can be defined as a product of piecewise linear functions, such that each factor of the product can be realized by a neural network. We will use this product structure in Lemma C.6 together with a generalized version of the approximate multiplication from Proposition C.2 to approximate localized polynomials with ReLU networks.

$\sum_{\phi_{m}\in\Psi}\phi_{m}(x)=1$ for every $x\in^{d}$ ;

there exist absolute constants $C,c\geq 1$ such that for each $\phi_{m}\in\Psi$ there is a neural network $\Phi_{m}$ with $d$ -dimensional input and $d$ -dimensional output, with at most three layers, $Cd$ nonzero weights and neurons, that satisfies

and $\lVert[R_{\varrho}(\Phi_{m})]_{l}\rVert_{{W^{k,\infty}(\left(0,1\right)^{d})}}\leq(cN)^{k}$ for all $l=1,\ldots,d$ and $k\in\{0,1\}$ .

for $m=(m_{1},\ldots,m_{d})\in\{0,\ldots,N\}^{d}$ . Then, (i),(ii) and (iii) follow easily from the definition.

It follows that $\lvert\phi_{m}\rvert_{W^{1,\infty}}\leq 3N$ .

To show (v), we start by constructing a network $\Phi_{\psi}$ that realizes the function $\psi$ . For this we set

and $\Phi_{\psi}:=((A_{1},b_{1}),(A_{2},b_{2}))$ . Then $\Phi_{\psi}$ is a two-layer network with one-dimensional input and one-dimensional output, with $12$ nonzero weights and $6$ neurons such that

which is a three-layer network with $d$ -dimensional input and $d$ -dimensional output, with at most $d\cdot 2\cdot(12+2)=d\cdot 28$ nonzero weights and at most $9d$ neurons, and

The following lemma uses the partition of unity from Lemma C.3 and the Bramble-Hilbert Lemma B.12 in a classical way to derive an approximation with localized polynomials in the $L^{p}$ norm and the $W^{1,p}$ norm. Using an interpolation argument, we can generalize this result to the case where the approximation is performed with respect to the $W^{s,\infty}$ norm, where $0\leq s\leq 1$ .

Let $0\leq s\leq 1$ and set $f_{N}:=\sum_{m\in\{0,\ldots,N\}^{d}}\phi_{m}p_{f,m}$ , then the operator $T_{s}:{W^{n,p}(\left(0,1\right)^{d})}\to{W^{s,p}(\left(0,1\right)^{d})}$ with $T_{s}f=f-f_{N}$ is linear and bounded with

Furthermore, there is a constant $c=c(d,n)>0$ such that for any $f\in{W^{n,p}(\left(0,1\right)^{d})}$ the coefficients of the polynomials $p_{f,m}$ satisfy

The idea of the proof is similar to the first part of the proof of [61, Theorem 1]. We use approximation properties of averaged Taylor polynomials (see Bramble-Hilbert Lemma B.12) to derive local estimates and then combine them using a partition of unity to obtain a global estimate. In order to use this strategy also near the boundary, we make use of an extension operator.

where $C_{E}=C_{E}(n,p,d)$ is the norm of the extension operator.

Step 1 (Averaged Taylor polynomials): For each $m\in\{0,\ldots,N\}^{d}$ we set

for $m\in\{0,\ldots,N\}^{d}$ , where $c^{\prime\prime}=c^{\prime\prime}(n,d,p)>0$ is a suitable constant. It now suffices to show (i) and (ii).

Step 2 (Local estimates in $\lVert\cdot\rVert_{W^{k,p}},k\in\{0,1\}$ ): To check that the conditions of the Bramble-Hilbert Lemma B.12 are fulfilled, note that $B_{m,N}\subset\subset\Omega_{m,N}$ . Furthermore, $B_{m,N}$ is a ball in $\Omega_{m,N}$ such that $\Omega_{m,N}$ is star-shaped with respect to $B_{m,N}$ . We have $\operatorname{diam}_{\lvert\cdot\rvert}(\Omega_{m,N})=(2\sqrt{d})/N$ and $r^{\star}_{\text{max}}(\Omega_{m,N})=1/N$ and, thus,

Finally, we have for the chunkiness parameter of $\Omega_{m,N}$

Applying the Bramble-Hilbert Lemma B.12 yields for each $m\in\{0,\ldots,N\}^{d}$ the local estimate

Here, $C_{1}=C_{1}(n,d)>0$ is the constant from Lemma B.12 which only depends on $n$ and $d$ , since the chunkiness parameter of $\Omega_{m,N}$ is a constant depending only on $d$ (see (C.15)) and $C_{2}=C_{2}(n,d)>0$ . In the same way, we get

where $C_{3}=C_{3}(n,d)>0$ is a suitable constant.

The first step towards a global estimate is now to combine Equation (C.16) and (C.17) with the cut-off functions from the partition of unity. We have

Furthermore, using the product inequality for weak derivatives from Lemma B.6 we get that there is a constant $C^{\prime}=C^{\prime}(d,p)>0$ such that

for some constant $C_{5}=C_{5}(n,d,p)>0$ .

Step 3 (Global estimate in $\lVert\cdot\rVert_{W^{k,p}},k\in\{0,1\}$ ): To derive the global estimate, we start by noting that with property (ii) from Lemma C.3 we have

where the last step follows from $\left(0,1\right)^{d}\subset\bigcup_{\widetilde{m}\in\{0,\ldots,N\}^{d}}\Omega_{\widetilde{m},N}$ . Now we obtain for each $\widetilde{m}\in\{0,\ldots,N\}^{d}$

where the triangle inequality together with the support property (iii) from Lemma C.3 is used in the first step. The second step follows again from Lemma C.3 (iii) and the third step follows from (C.18) for $k=0$ and from (C.20) for $k=1$ . Here $C_{6}=C_{6}(n,d,p)>0$ can be chosen independent of $k$ (e.g. $C_{6}:=\max\{C_{2},C_{5}\}$ ).

Finally, the boundedness claim in (i) and (ii) follows from using the definition of $f_{N}$ and combining Equation (C.22) with Equation (C.23):

where the last two steps follow from the definition of $\Omega_{\widetilde{m},N}$ . Thus, we have

for $k\in\{0,1\}$ , where Equation (C.14) was used in the first and second step. Here $C_{7}=C_{7}(n,d,p)>0$ and $C_{8}=C_{8}(n,d,p)$ are constants. The linearity of $T_{k}$ , $k\in\{0,1\}$ is a consequence of the linearity of the averaged Taylor polynomial (cf. Remark B.8).

Step 4 (Interpolation): For $0<s<1$ we use a Banach space interpolation argument. Set

Then, we can apply Theorem A.4 to the operator $T$ with $Tf:=f-f_{N}$ and get for a constant $C=C(n,d,p)>0$ that

where we used Lemma A.3 (ii) to see that $(A_{0},A_{1})_{s,p}={W^{n,p}(\left(0,1\right)^{d})}$ . ∎

The following technical lemma lays the foundation for approximating localized monomials with neural networks. Using the notation and statement (v) from Lemma C.3, a localized monomial $\phi_{m}x^{\alpha}$ can be expressed as the product of the output components of a network $\Phi_{(m,\alpha)}$ with a suitable output dimension $n$ , i.e.

Given a network $\Phi$ with $n$ -dimensional output, we construct a network $\Psi_{\Phi}$ that approximates the product of the output components of $\Phi$ .

For any $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ , and any neural network $\Phi$ with $d$ -dimensional input and $n$ -dimensional output where $n\leq m$ , and with number of layers, neurons and weights all bounded by $K$ , such that

there exists a neural network $\Psi_{\varepsilon,\Phi}$ with $d$ -dimensional input and one-dimensional output, and with number of layers, neurons and weights all bounded by $KC\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ , such that

for $k\in\{0,1\}$ and some constant $c=c(d,m,k)$ . Moreover,

If $m=1$ , then we can choose $\Psi_{\varepsilon,\Phi}=\Phi$ for any $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ and the claim holds.

Case 1: If $n\leq m$ , then we use the induction hypothesis and get that there is a constant $C_{0}=C_{0}(m)>0$ and a neural network $\Psi_{\varepsilon,\Phi}$ with $d$ -dimensional input and one-dimensional output, and at most $KC_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ layers, neurons and weights such that

for $k\in\{0,1\}$ and $c_{1}=c_{1}(d,m)$ . Moreover,

for any $x\in\left(0,1\right)^{d}$ . Furthermore, we have $\lvert R_{\varrho}(\Psi_{\varepsilon,\Phi})\rvert_{{W^{1,\infty}(\left(0,1\right)^{d})}}\leq C_{1}N$ , for $C_{1}=C_{1}(d,m)$ .

We denote by $\Phi_{m}$ the neural network with $d$ -dimensional input and $m$ -dimensional output which results from $\Phi$ by removing the last output neuron and corresponding weights. In detail, we write

Using the induction hypothesis and the constants $C_{0},c_{1}$ and $C_{1}$ from Case 1, we get that there is a neural network $\Psi_{\varepsilon,\Phi_{m}}=((A^{\prime}_{1},b^{\prime}_{1}),(A^{\prime}_{2},b^{\prime}_{2}),\dots,(A^{\prime}_{L^{\prime}},b^{\prime}_{L^{\prime}}))$ with $d$ -dimensional input and one-dimensional output, and at most $KC_{0}\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ layers, neurons and weights such that

for any $x\in\left(0,1\right)^{d}$ . Furthermore, we can assume that $\lvert R_{\varrho}(\Psi_{\varepsilon,\Phi_{m}})\rvert_{{W^{1,\infty}(\left(0,1\right)^{d})}}\leq C_{1}N$ , and that the first $L(\Phi)-1$ layers of $\Psi_{\varepsilon,\Phi_{m}}$ and $\Phi_{m}$ coincide and, thus, also the first $L(\Phi)-1$ layers of $\Psi_{\varepsilon,\Phi_{m}}$ and $\Phi$ , i.e. $A_{l}=A^{\prime}_{l}$ for $l=1,\ldots,L(\Phi)-1$ .

Now, we add the formerly removed neuron with corresponding weights back to the last layer of $\Psi_{\varepsilon,\Phi_{m}}$ . For the resulting network

it holds that the first $L-1$ layers of $\widetilde{\Psi}_{\varepsilon,\Phi}$ and $\Phi$ coincide, and $\widetilde{\Psi}_{\varepsilon,\Phi}$ is a neural network with two-dimensional output. Note that

where we used Equation (C.26) for $k=0$ . Additionally, we have

Now, we denote by $\widetilde{\times}$ the network from Proposition C.2 with $M=m+1$ and accuracy $\varepsilon$ and define

Consequently, $\Psi_{\varepsilon,\Phi}$ has $d$ -dimensional input, one-dimensional output and, combining the induction hypothesis with statement (iv) of Proposition C.2 and Remark 2.8, at most

layers, number of neurons and weights. Here $c^{\prime}$ and $c^{\prime\prime}=c^{\prime\prime}(m+1)$ are the constants from Proposition C.2 (iv) for the choice $M=m+1$ and $C=C(m+1)>0$ is a suitable constant. Clearly, the first $L-1$ layers of $\Psi_{\varepsilon,\Phi}$ and $\Phi$ coincide and for the approximation properties it holds that

for $k\in\{0,1\}$ . We continue by considering the first term of the Inequality (C.28) for $k=0$ and obtain

where we used Proposition C.2 (i) for the last step. Next, we consider the same term for $k=1$ and apply the chain rule from Corollary B.5. For this, let $\hat{C}=\hat{C}(d)$ be the constant from Corollary B.5 (for $n=d$ and $m=2$ ). We get

where we used the induction hypothesis together with $\lvert[R_{\varrho}(\Phi)]_{m+1}\rvert_{{W^{1,\infty}(\left(0,1\right)^{d})}}\leq N$ in the third step, and $C_{1}^{\prime}=C_{1}^{\prime}(d,m+1)>0$ is a suitable constant.

To estimate the second term of (C.28) for $k=0$ we use the induction hypothesis (for $k=0$ ) and get

For $k=1$ we apply the product rule from Lemma B.6 together with $\lVert[R_{\varrho}(\Phi)]_{m+1}\rVert_{{L^{\infty}}}\leq 1$ and get

where we used the induction hypothesis for $k=1$ , and $c_{1}^{\prime}=c_{1}^{\prime}(d,m+1)>0$ .

Combining (C.28) with (C.29) and (C.31) we have

and in the same way a combination of (C.28) with (C.30) and (C.32) yields

where $c_{1}^{\prime\prime}=c_{1}^{\prime\prime}(d,m+1)>0$ . Putting together the two previous estimates yields

for a suitable constant $c_{1}^{\prime\prime\prime}=c_{1}^{\prime\prime\prime}(d,m+1)>0$ .

We now show Equation (C.25) for $m+1$ . To this end, assume that $[R_{\varrho}(\Phi)]_{l}(x)=0$ for some $l\in\{1,\ldots,m+1\}$ and $x\in\left(0,1\right)^{d}$ . If $l\leq m$ , then Equation (C.27) implies that

Hence, by application of Proposition C.2, we have

Finally, we need to show that there is a constant $C_{1}^{\prime\prime}=C_{1}^{\prime\prime}(d,m+1)>0$ such that

Similarly as in (C.30) we have for a constant $C_{1}^{\prime\prime}=C_{1}^{\prime\prime}(d,m+1)>0$ that

where Corollary B.5 was used for the second step and $\hat{c}$ is the constant from Proposition C.2 (iii) which together with an argument as in (C.30) implies the third step.

Taking the maximum of the each pair of constants derived in Case 1 and Case 2 concludes the proof. ∎

In the next lemma, we approximate a sum of localized polynomials with a neural network. One of the difficulties is to control the derivative of the localizing functions from the partition of unity.

For any $\varepsilon\in\left(0,\nicefrac{{1}}{{2}}\right)$ there is a neural network architecture ${\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,N,\varepsilon)$ with $d$ -dimensional input and one-dimensional output, with at most $C_{2}\cdot\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ layers and $C_{3}(N+1)^{d}\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ weights and neurons, such that the following holds: Let $f\in{W^{n,p}(\left(0,1\right)^{d})}$ and $p_{m}(x):=p_{f,m}(x)=\sum_{\lvert\alpha\rvert\leq n-1}c_{m,\alpha}x^{\alpha}$ for $m\in\{0,\ldots,N\}^{d}$ be the polynomials from Lemma C.4, then there is a neural network $\Phi_{P,\varepsilon}$ that has architecture ${\mathcal{A}}_{\varepsilon}$ such that

Step 1 (Approximating localized monomials $\phi_{m}x^{\alpha}$ ): Let $|\alpha|\leq n-1$ and $m\in\{0,\dots,N\}^{d}$ . It is easy to see that there is a neural network $\Phi_{\alpha}$ with $d$ -dimensional input and $\lvert\alpha\rvert$ -dimensional output, with one layer and at most $n-1$ weights and $n-1+d$ neurons such that

Let now $\Phi_{m}$ be the neural network and $C,c\geq 1$ the constants from Lemma C.3 (v) and define the network

for a constant $c^{\prime}=c^{\prime}(n,d,k)>0$ and $k\in\{0,1\}$ , and

Step 2 (Constructing an architecture capable of approximating sums of localized polynomials): We set

Then, there are constants $C_{2}=C_{2}(n,d),C_{3}=C_{3}(n,d)>0$ such that $\Phi_{P,\varepsilon}$ is a neural network with $d$ -dimensional input and one-dimensional output, with at most $1+C_{1}\log_{2}(\nicefrac{{1}}{{\varepsilon}})\leq C_{2}\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ layers,

Note that the network $\Phi_{P,\varepsilon}$ only depends on $p_{f,m}$ (and thus on $f$ ) via the coefficients $c_{m,\alpha}$ . Now, it is easy to see that there exists a neural network architecture ${\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,N,\varepsilon)$ with $L({\mathcal{A}}_{\varepsilon})\leq C_{2}\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ layers and number of neurons and weights bounded by $C_{3}(N+1)^{d}\log_{2}(\nicefrac{{1}}{{\varepsilon}})$ such that $\Phi_{P,\varepsilon}$ has architecture ${\mathcal{A}}_{\varepsilon}$ for every of choice of coefficients $c_{m,\alpha}$ , and hence for every choice of $f$ .

Step 3 (Estimating the approximation error in $\lVert\cdot\rVert_{W^{k,p}},k\in\{0,1\}$ ): For each $m\in\{0,\ldots,N\}^{d}$ we set

where the last step is a consequence of $\left(0,1\right)^{d}\subset\bigcup_{\widetilde{m}\in\{0,\ldots,N\}^{d}}\Omega_{\widetilde{m},N}$ . For $\widetilde{m}\in\{0,\ldots,N\}^{d}$ we have

where we used the triangle inequality in the first step and Lemma C.4 in the second step. Here $c_{1}=c_{1}(n,d)>0$ is a constant. Next, note that

where $\lambda$ denotes the Lebesgue measure and $c_{3}=c_{3}(d,p)>0$ is a constant. Combining Equation (C.39) with the last estimate yields

Combining Equation (C.40) with Equation (C.41) and plugging the result in Equation (C.38) finally yields

where the last step is the same as Step 3 of the proof of Lemma C.4 and $c_{5}=c_{5}(n,d,p,k)>0$ is a constant. Hence the case $s=0$ and $s=1$ is proven.

Step 4 (Interpolation): To show the general statement for $0\leq s\leq 1$ we use the interpolation inequality from Corollary A.5 together with Equation (C.42) and directly get

for a constant $c_{6}=c_{6}(n,p,d,s)>0$ . This concludes the proof of the lemma. ∎

Finally, we are ready to proof the upper complexity bounds.

The proof can be divided into two steps: First, we approximate the function $f$ by a sum of localized polynomials and then approximate this sum by a network.

where $C=C(n,d,p)>0$ is the constant from Corollary C.4. Without loss of generality we may assume that $CB\geq 1$ . The same corollary yields that if $\Psi=\Psi(d,N)=\left\{\phi_{m}:m\in\{0,\ldots,N\}^{d}\right\}$ is the partition of unity from Lemma C.3, then there exist polynomials $p_{m}(x)=\sum_{\lvert\alpha\rvert\leq n-1}c_{m,\alpha}x^{\alpha}$ for $m\in\{0,\ldots,N\}^{d}$ such that

For the second step, let $C_{1}=C_{1}(n,d,p,s)>0$ , $C_{2}=C_{2}(n,d)>0$ and $C_{3}=C_{3}(n,d)>0$ be the constants from Lemma C.6 and $\Phi_{P,\varepsilon}$ be the neural network provided by Lemma C.6 with $\varepsilon^{n/(n-s)}/(8CB^{2}C_{1})$ instead of $\varepsilon$ . Then $\Phi_{P,\varepsilon}$ has at most

layers for a constant $C^{\prime}=C^{\prime}(n,d,p,B,s)>0$ and at most

nonzero weights and neurons. Here $C^{\prime\prime}=C^{\prime\prime}(n,d,p,B,s)$ is a suitable constant and we used $(2CB)/\varepsilon\geq 1$ in the first step. It holds that the architecture of $\Phi_{P,\varepsilon}$ is independent of the function $f$ . Furthermore, we have

where we used the inequality $(x+y)^{\sigma}\leq x^{\sigma}+y^{\sigma}$ for $x,y\geq 0$ and $0\leq\sigma\leq 1$ in the last step. We now continue the above computation using that $s/(n-s)\leq 1$ and $2CB\geq 1$ and therefore $(2CB)^{s/(n-s)}\leq 2CB$ and get

Combining the previous computations we get

Using the triangle inequality and Equations (C.44) and (C.45) we finally obtain

Appendix D Lower bounds for approximations

Lower complexity bounds for approximations in $L^{\infty}$ norm, which correspond to the case $k=0$ in Theorem 4.3, have already been shown in Theorem 4 a) in . For lower complexity bounds of $W^{1,\infty}$ approximations we modify the proof strategy outlined in [61, Theorem 4 a)].

We start by showing an auxiliary result, that is used in the proof of Proposition D.2.

Now, set $x_{n}:=x+(1/n)\nu$ . By the pigeonhole principle, there exists $q\in\{1,\ldots,k\}$ , such that $\overline{V_{q}}$ contains infinitely many $x_{n}$ . It is not hard to see that if a closed polyhedron contains a converging sequence on a line, then it also contains a small section of the line including the limit point of the sequence. Thus, there exists $\delta>0$ such that $\{x+\lambda\delta\nu:0\leq\lambda\leq 1\}\subset\overline{V_{q}}\cap\left(0,1\right)^{d}\subset\overline{V_{q}\cap\left(0,1\right)^{d}}$ . Then, setting $T:=V_{q}\cap\left(0,1\right)^{d}$ shows the claim. ∎

As in , we make use of a combinatorial quantity measuring the expressiveness of a set of binary valued functions $H$ defined on some set $X$ , called VC-dimension (see e.g. [3, Chapter 3.3]). We define

On the other hand, [3, Theorem 8.4] yields an upper bound of the VC-dimension of $H$ in terms of the number of computations and the dimension of the parameterization of $H$ which can be expressed as a function of $M({\mathcal{A}}_{\varepsilon})$ (Claim 2). Together this gives the desired relation.

Step 1 (Construction of $h$ ): Let now $0<\varepsilon<c_{1}N^{-(n-1)}/(3\sqrt{d})$ for some constant $c_{1}=c_{1}(n)>0$ to be chosen later, and ${\mathcal{A}}_{\varepsilon}={\mathcal{A}}_{\varepsilon}(d,n,\varepsilon)$ be a neural network architecture as in the claim of the proposition.

To this end, we first define a function $f_{y}\in\mathcal{F}_{n,d,\infty,B}$ with $f_{y}(x_{m})=y_{m}\cdot a$ for some constant $a>0$ , and then make use of a neural network $\Phi^{f_{y}}$ that approximates $f_{y}$ .

and, in particular, if $\lvert x\rvert=1/(4N)$ , then

Here, we defined the constant $c_{1}=c_{1}(n,B)>0$ which was left unspecified in the beginning of the proof by $c_{1}:=\frac{B\phi(1/4)}{4\lVert\psi\rVert_{{W^{n,\infty}(\left(0,1\right)^{d})}}}$ .

where we used Equation (D.5) together with Equation (D.4) and $\lvert\nu(x_{m})\rvert=1$ for the first step and Equation (D.2) together with the upper bound for $\varepsilon$ for the second step.

In a similar way, we get for the case $y_{m}=0$ that

Finally, combining Equation (D.6) and Equation (D.7) reads as

Claim 2 ( $\operatorname{VCdim}(H)\leq C\cdot M({\mathcal{A}}_{\varepsilon})^{2}$ ): We start by showing that there exists a constant $C^{\prime}=C^{\prime}(d)$ such that $h((w,\delta),x)$ can be computed using $C^{\prime}\cdot M({\mathcal{A}}_{\varepsilon})$ operations of the following types

the arithmetic operations $+,-,\times$ , and $/$ on real numbers,

jumps conditioned on $>,\geq,<,\leq,=$ , and $\neq$ comparisons of real numbers

Note that the number of neurons that are needed for the computation of $R_{\varrho}({\mathcal{A}}_{\varepsilon}(w))$ can be bounded by $M({\mathcal{A}}_{\varepsilon})$ . Thus, $R_{\varrho}({\mathcal{A}}_{\varepsilon}(w))$ can be computed using at most $C_{2}\cdot M({\mathcal{A}}_{\varepsilon})$ operations where $C_{2}>0$ is an absolute constant. Hence, there exists a constant $C_{3}=C_{3}(d)$ such that for the number of operations of the specified type $t$ needed for the computation of $h(x)$ where $x\in^{d}$ it holds that $t\leq C_{3}M({\mathcal{A}}_{\varepsilon})$ .

where $C_{4}=C_{4}(d)>0$ is a suitable constant. ∎

The proof of the lower complexity bounds is now a simple consequence of Proposition D.2.

The case $k=0$ corresponds to [61, Theorem 4 a)].

For the case $k=1$ , let $c=c(n,B)$ and $C=C(d)$ be the constants from Proposition D.2 and set

Then, $N\leq(c/\varepsilon)^{1/(n-1)}$ and, thus, $0<\varepsilon\leq cN^{-(n-1)}$ . Now, Proposition D.2 implies that

We also have $(c/\varepsilon)^{1/(n-1)}\leq 2N$ and hence $c_{1}\varepsilon^{-d/(n-1)}\leq N^{d}$ for a suitable constant $c_{1}=c_{1}(n,B)>0$ . Combining this estimate with Equation D.8 yields

for a constant $C^{\prime}=C^{\prime}(d,n,B)>0$ . ∎