Error bounds for approximations with deep ReLU neural networks in $W^{s,p}$ norms
Ingo Gühring, Gitta Kutyniok, Philipp Petersen
Introduction
Powered by modern, highly parallelized hardware and the immense amount of accessible data, deep neural networks substantially outperform both traditional modelling approaches, e.g. those based on differential equations, and classical machine learning methods in a wide range of applications. Prominent areas of application include image classification , speech recognition , and natural language processing .
Despite this overwhelming success in applications, a comprehensive mathematical explanation of this success is has not been found. However, a deep theoretical understanding of these techniques is crucial to design more efficient architectures and essential in safety-critical applications, such as autonomous driving.
Many attempts at unraveling the extreme efficiency of deep neural networks have been made in the context of approximation theory. The universal approximation theorem establishes that every continuous function on a compact domain can be uniformly approximated by neural networks. More refined results that also set into relation the size and the approximation fidelity of neural networks have been reported for smooth activation functions in, for example, .
In most applications, the activation functions are not smooth but taken to be the piecewise linear ReLU function. For these networks, approximation bounds for classes of smooth functions have been established in and for piecewise smooth functions in . Connections to sparse grids and the associated approximation rates were established in and connections to linear finite element approximation were reported in . It was also discovered that the depth of neural networks, i.e., the number of layers, crucially influences the approximation capabilities of these networks in the sense that deeper networks are more efficient approximators .
One of the applications of deep learning where a strong knowledge of the approximation capabilities of neural networks directly translates into quantifiable theoretical results appears when solving partial differential equations using deep learning techniques. Some notable advances in this direction have been made in . In this regard, not only the approximation fidelity with respect to standard Lebesgue spaces is of high interest, but also that with respect to Sobolev-type norms. First results in this direction were reported in .
In this work, we give a comprehensive analysis of the approximation rates of deep neural networks of Sobolev-regular functions with respect to (fractional) Sobolev norms.
In the remaining part of this introduction, we briefly describe neural network architectures relevant to this work. Then, we present the setting that neural networks are typically used in. Afterwards, we give a practical and theoretical motivation for studying approximations of functions and their derivatives with neural networks. Finally, we state our results.
is the affine-linear transformation of the -th layer, then the computation in that layer can be described as
Note that acts componentwise in all but the last layer and in the last layer no activation function is applied. The parameters defining the affine-linear transformations are referred to as weights and is called the number of neurons of the network, since each output coordinate of can be seen as a small computational unit, similar to neurons in the brain. If a network has or more layers, then it is usually called deep and a -layer network is called shallow. The complexity of a neural network is typically measured in the number of layers, weights, and neurons (see ). The term deep learning refers to the subset of machine learning methods associated with deep neural networks.
Recently, more general feedforward architectures which allow connections between non-neighboring layers, so-called skip connections, have been shown to yield state-of-the-art results in object recognition tasks . Here, the input of a layer consists of the output of all preceding layers. Note, that the case where no skip connections are allowed, is a special case of this more general architecture.
One of today’s most widely-used activation functions is the Rectified Linear Unit (ReLU) (see ), defined as
The popularity of the ReLU can be explained by a number of factors. It is cheap to compute, promotes sparsity in data representation , alleviates the problem of vanishing-gradients , and thus yields better optimization properties. Moreover, the ReLU is the only commonly-used activation function such that the associated function spaces are not necessary non-closed . In this work, we will mostly focus on ReLU networks with a feedforward architecture allowing skip connections.
2 Supervised learning with neural networks
Typically, neural networks are applied in supervised learning problems. The starting point is a dataset of input-output pairs , called samples, where is in most cases an unknown functionWe will later see a case where is in fact known. with values only given at sample points . As an example, can be thought of as an image and as a vector of scores, where each score is the probability of a certain category, e.g. ”dog” or ”cat”, being associated with . During training, one then seeks to learn the function by adapting the weights of a neural network such that the empirical loss
is minimized for some loss function . Ultimately, one is interested in how well the learning algorithm performs on formerly unseen data points . The associated error is called the generalization error.
3 Motivation: approximating functions and derivatives
From a learning theory point of view, the task of estimating the generalization error decomposes into a statistical problem depending on the samples and an approximation theoretical problem, independent of the samples. An accessible introduction to learning theory from the perspective of approximation theory can be found in . The aim of this work is to study the simultaneous approximation of a function and its derivative with a neural network. There are multiple scenarios in which this is possible and useful.
In , the -th order derivatives of are incorporated in the empirical loss function (1.1), resulting in an empirical loss
which encourages the network to encode information about the derivatives of in its weights. The authors of call this method Sobolev training and reported reduced generalization errors and better data-efficiency in a network compression task (see ) and in application to synthetic gradients (see ). In case of network compression, the approximated function is a function realized by a possibly very large neural network , that has been trained for some supervised learning task and is learnt by a smaller network . In contrast to usual supervised learning settings, the approximated function is known and the derivatives can be computed.
Motivated by the performance of deep learning-based solutions in classical machine learning tasks and, in particular, by their ability to overcome the curse of dimension, neural networks are now also applied for the approximative solution of partial differential equations (PDEs) (see ).
In the authors present their deep Galerkin method for approximating solutions of high-dimensional quasilinear parabolic PDEs. For this, a functional encoding the differential operator, boundary conditions, and initial conditions is introduced. A neural network with weights is then trained to minimize the functional . This is done by a discretization and randomly sampling spatial points.
The theoretical foundation for approximating a function and higher-order derivatives with a neural network was already given in a less known version of the universal approximation theorem by Hornik in [32, Theorem 3]. In particular, it was shown that if the activation function is -times continuously differentiable, non-constant, and bounded, then any -times continuously differentiable function and its derivatives up to order can be uniformly approximated by a shallow neural network on compact sets. Note though that the conditions on the activation function are very restrictive and that, for example, the ReLU is not included in the above result. However, in , it was shown that the theorem also holds for shallow ReLU networks if . Theorem 3 in was also used in to show the existence of a shallow network approximating solutions of the PDEs considered in this paper.
An important aspect, that is untouched by the previous approximation results is how the complexity of a network and, in particular, its depth relates to its approximation properties.
4 Our contribution
In Theorem 1 in , Yarotsky showed upper complexity bounds for approximations in norm of functions from the Sobolev space with neural networks for continuous piecewise linear activation functions with a finite number of breakpoints. Precisely, it is shown there, that for any there exists a neural network with at most layers and at most weights and neurons such that for any with there is a choice of weights with
where is a constant depending on and .
Furthermore, under the assumption of a possibly discontinuous weight selection the number of weights needed by a neural network to be able to realize an -approximation in norm for any with is shown to be lower bounded by in [61, Theorem 4 a)]. The constant depends on and .
We show for the same set of activation functions that the approximation can also be done with respect to higher-order Sobolev norms with arbitrary and that there is a trade-off between the regularity used in the approximation norm and the regularity used in the complexity bounds. Specifically, we show that for any approximation accuracy and regularity , there is a ReLU neural network with at most layers and weights and neurons such that and any with there is a choice of weights with
where is a constant depending on , and . In the boundary case and our results corresponds to the theorem shown by Yarotsky and for and the function and its weak gradient uniformly approximate and the weak gradient of , respectively. For non-integer the function approximates the function and its fractional derivative of order approximates the fractional derivative of order of . The case and was already shown by one of the authors, I. Gühring, in his Master thesis .
Moreover, analogously to [61, Theorem 4 a)] we establish lower bounds where the same regularity-complexity trade-off can be observed. We show that if a neural network is able to realize an -approximation in norm for any with , then the number of weights of is lower bounded by . Here, is a constant depending on and .
5 Outline
As a preparation, we start by introducing notation and some definitions in Section 1.6. In Section 2, we rigorously define a neural network and its architecture in mathematical terms and develop a network calculus. In Section 3, we briefly introduce (fractional) Sobolev spaces. In Section 4, we present our results which will be discussed in Section 5. To allow a concise presentation of our results most proofs are given in the appendix: Appendix A and B contain the necessary preparation for the proof of the results from Section 4.1 which is given in Appendix C. The results from Section 4.2 are proven in Appendix D.
6 Notation
Of course, the same notation also applies to (block) vectors.
for all . If we want to emphasize the constant , then we say that is -Lipschitz.
If is a linear space and and are two norms on , then we say that and are equivalent if there exist constants such that
For two normed linear spaces we denote by the set of bounded linear operators mapping to and for the induced operator norm of is denoted by
Let , then we say for two functions and that is in if there exists and such that for all .
Neural networks
In this section, we introduce the notion of neural networks used in this paper. As in , we will consider a general type of feedforward architecture that also allows for connections of neurons in non-neighboring layers. It can be seen, though, that any function realized by such a network can also be realized by a network with a more restrictive feedforward architecture where only neurons from neighboring layers can be connected (Lemma 2.11). As in , we draw a distinction between the neural network and the function that the network realizes. This gives us the possibility to develop a network calculus in the spirit of [47, Chapter 2]. As in that paper we will introduce the notion of network concatenation and parallelization.
The following definition is similar to [47, Definition 2.1], where the difference is that we also allow connections between non-neighboring layers.
where results from the following scheme:
where is an matrix for and . Then
We will now define the class of neural networks where only connections between neighboring layers are allowed. Networks without skip connections are a special case of the networks of Definition 2.1. Since they are more frequently used in the literature, we coin such networks standard neural networks.
If is a neural network as above and we have for
that for and , then we call a standard neural network. The computation scheme then reduces to the following:
In practice, before training a neural network, i.e. adjusting the weights of the network, one has to decide which network architecture to use. The following definition will clarify the notion of a network architecture.
We say that a neural network with input dimension and layers has architecture if
for all and
implies for all with and .
In the spirit of Definition 2.2 we say that is a standard neural network architecture if all weights connecting neurons from non-neighboring layers are zero.
Note that in general an architecture of a neural network does not need to be unique.
Moreover, we define one of the in practice commonly used (see ) activation functions.
is called ReLU (Rectified Linear Unit) activation function.
It can be shown, see [61, Proposition 1], that in terms of approximation and upper complexity bounds for networks with continuous and piecewise linear activation functions it suffices to focus on the ReLU.
be two neural networks such that the input layer of has the same dimension as the output layer of . Then, denotes the following layer network:
for . We call the concatenation of and .
Then, .
This can be verified with a direct computation. ∎
Next, we adapt [47, Lemma 2.3] to the case where skip connections are allowed and define a network which realizes an identity function if the activation function is the ReLU.
Then the sparse concatenation of and is defined as
for . Furthermore, it holds that , and and .
2 Network parallelization
be two neural networks with -dimensional input. If , then we define
A similar construction is used for the case . Then is a neural network with -dimensional input and layers, which we call the parallelization of and . We have , and .
with -dimensional input and layers is called the parallelization of . We have
and .
3 Standard neural networks
In this subsection, we will show that if a function is a realization of a neural network, then it can also be realized by a standard neural network with similar complexity. The proof of this result is heavily based on the idea of the construction of an identity function with ReLU networks (Lemma 2.6).
Let be the ReLU. Then there exist absolute constants such that for any neural network with -dimensional input and layers, neurons and nonzero weights, there is a standard neural network with -dimensional input, and layers, at most neurons and nonzero weights such that
If , then there is nothing to show. For we start by defining the matrices
In the next remark we collect some properties of the class of functions that can be realized by a neural network.
If is a neural network with -dimensional input and -dimensional output and is the ReLU, then is a Lipschitz-continuous, piecewise affine-linear function. This follows easily from Definition 2.2 and Lemma 2.11 which show that can be expressed as the composition of Lipschitz-continuous, piecewise affine-linear transformations.
In [4, Theorem 2.1] also the converse was shown, i.e. every piecewise affine-linear function can be realized by a neural network with the ReLU as activation function.
Sobolev spaces
Spaces of functions that admit generalized derivatives fulfilling suitable integrability properties are a crucial concept in modern theory of partial differential equations (cf. e.g. ). In order to study properties of PDEs using functional analytic tools, a differential equation is reformulated via a differential operator mapping one function space to another. For a wide range of differential equations the appropriate spaces in this formulation are Sobolev spaces. A historical background of the development of Sobolev spaces in the context of PDEs can be found in [56, Lecture 1]. For a detailed treatment of the broad theory of Sobolev spaces we refer the reader to .
Furthermore, for and , we define the norm
We write . Moreover, the space equipped with the norm is a Banach space. For we shall use the notation
if the domain is clear from the context, or if we want to emphasize the variable that the function depends on, respectively. If , then the Sobolev space is just a Lebesgue space, i.e. .
Many results about function spaces defined on a domain require to fulfill certain regularity conditions. Different geometrical conditions and the resulting properties have been intensively studied and can for example be found in . For our purposes it is enough to focus on the condition introduced in the next definition which can be found in [20, Appendix C.1].
after possibly relabeling and reorienting the coordinate axes.
In the sequel, we will only work with convex domains. Moreover, every open, bounded, and convex domain is a Lipschitz domain [23, Corollary 1.2.2.3].
Fractional Sobolev spaces play an important role in the analysis of partial differential equations. In particular, they characterize the regularity of functions from Sobolev spaces defined on a domain restricted to the boundary of the domain where the restriction is realized by a so-called trace operator (see e.g. ). Moreover, a detailed description of various areas of further application is listed in . For a more in-depth analysis of these spaces the interested reader is referred to and .
Next, we define fractional-order spaces in terms of an intrinsic norm.
We set for and
The space endowed with the norm is a Banach space called Sobolev-Slobodeckij space.For the space coincides with the space of bounded -Hölder continuous functions. Precisely, each equivalence class contains a bounded -Hölder continuous representative.
Approximations with deep ReLU neural networks in Sobolev type norms
In this section, we derive upper and lower complexity bounds for approximations of functions from certain Sobolev spaces with deep ReLU networks. Our result is a generalization of [61, Theorem 1] (upper bounds) and [61, Theorem 4 a)] (lower bounds) to the case where the approximation error is measured in a Sobolev-type norm.
In [61, Theorem 1] it is shown that for and an arbitrary function a ReLU network can be constructed that realizes an approximation with -error at most using layers and nonzero weights and neurons.
We will show that the approximation can also be performed with respect to a continuous scale of higher order Sobolev-type norms for arbitrary and additionally, that there is a trade-off between the regularity used in the norm in which the approximation error is measured and the regularity used in the bounds. In particular, we will show the following theorem:
For any , there is a neural network architecture with -dimensional input and one-dimensional output such that for any there is a neural network that has architecture such that
;
;
.
Clearly, if and , then Theorem 4.1 coincides with the theorem shown by Yarotsky in . Note that if is a neural network and is the ReLU, then the restriction of to is bounded and Lipschitz continuous (cf. Remark 2.12) and hence in view of [10, Chapter 1.3] an element of the Sobolev space . Therefore, the expressions in the previous theorem are well-defined.
Theorem 4.1 holds for all activation functions that are in some sense similar to the ReLU. In particular, it follows directly from [61, Proposition 1] that the statement holds for any continuous piecewise linear activation function with breakpoints, where .
As a consequence of Theorem 4.1 and Lemma 2.11, we can easily derive a similar statement for standard neural networks.
For any , there is a standard neural network architecture with -dimensional input and one-dimensional output such that for any there is a standard neural network that has architecture such that
;
;
.
2 Lower complexity bounds
In this subsection, we show that the same regularity-complexity trade-off that can be observed for upper complexity bounds can be shown for lower bounds. In Theorem 4 a) in Yarotsky proves that a network architecture capable of approximating any function up to a -error has at least weights. Here we consider approximations in norm. The following theorem combines the result from with our result.
If and is an architecture such that for any there is a neural network that has architecture and
then has at least weights.
In case , Theorem 4.3 coincides with [61, Theorem 4 a)].
Note that the case of standard neural network architectures is included in the theorem above and, thus, the same lower bounds hold true for this case.
Moreover, the proof can easily be adapted to piecewise polynomial activation functions (not necessarily continuous) with a finite number of breakpoints.
Discussion and future work
The motivation behind deriving theoretical results for deep neural networks is to improve our understanding of their empirical success in a broad area of applications. In this section, we briefly describe the practical relevance of our result, but mostly focus on aspects where our theoretical framework differs from empirical observations or fails to incorporate computational limitations met in implementations. Furthermore, we will discuss possible extensions of our results.
Practical relevance. Our results give upper and lower bounds for the complexity of neural network architectures used in Sobolev training (see Section 1.3 and ). Furthermore, we anticipate our results to lead to complexity bounds in the theoretical foundation of deep learning-based approaches for the numerical solution of PDEs. We leave an in-depth analysis of this topic for future research.
Curse of dimension. One of the main reasons for the current interest in deep neural networks is their outstanding performance in high-dimensional problems. As an example, consider the popular ILSVRTypically abbreviated with ILSRC which stands for ImageNet Large Scale Visual Recognition Challenge. challenge (see ), an image recognition task on the ImageNet database (see ) containing variable-resolution images which are typically downsampled before training to yield a roughly -dimensional input space (cf. e.g. ).
Our asymptotic upper complexity bounds for the number of weights and neurons are in , and thus depend strongly on the dimension of the input space. Moreover, the constants in the upper bounds for the number of layers, weights, and neurons also increase exponentially with increasing . Even if there is still a gap between our upper and lower bounds for the weights of a network architecture, Theorem 4.3 shows that one cannot expect to circumvent the curse of dimension in the setting considered in this paper.
The power of depth. For , the approximations obtained in Theorem 4.1 and [61, Theorem 1] coincide. In this case, Yarotsky showed in that the constructed unbounded depth approximations for functions in (with ) are asymptotically more efficient in the number of weights and neurons than approximations with fixed length if
As a consequence, to be more efficient than a shallow network, i.e. a network with depth , one needs regularity. Even if this result does not completely explain the success of deep networks over shallow ones, since is typically very large, it would be interesting to obtain similar results for higher-order Sobolev norms.
Unbounded complexity of weights. When training a neural network on a computer, the weights have to be stored in memory. In practice, storing weights up to an arbitrary precision or even unbounded weights (in absolute value) is infeasible. In the construction of the neural network that approximates a function up to an -error and realizes the upper complexity bound from Theorem 4.1 we used weights with as (see construction of in the proof of Lemma C.3 (v)).
In , neural networks are restricted to possess quantized weights (see [47, Definition 2.9]) which controls the complexity of the weights depending on the approximation error . It would be interesting to extend Theorem 4.1 to networks with quantized weights and to analyze how coarse a quantization is possible to maintain the associated upper bounds.
Acknowledgements
I.G. would like to thank Mones Raslan for fruitful discussions on the topic and acknowledges support from the Research Training Group ”Differential Equation- and Data-driven Models in Life Sciences and Fluid Dynamics: An Interdisciplinary Research Training Group (DAEDALUS)” (GRK 2433) funded by the German Research Foundation (DFG). G. K. acknowledges partial support by the Bundesministerium für Bildung und Forschung (BMBF) through the Berliner Zentrum for Machine Learning (BZML), Project AP4, by the Deutsche Forschungsgemeinschaft (DFG) through grants CRC 1114 ”Scaling Cascades in Complex Systems”, Project B07, CRC/TR 109 ”Discretization in Geometry and Dynamics’, Projects C02 and C03, RTG DAEDALUS (RTG 2433), Projects P1 and P3, RTG BIOQIC (RTG 2260), Projects P4 and P9, and SPP 1798 ”Compressed Sensing in Information Processing”, Coordination Project and Project Massive MIMO-I/II, by the Berlin Mathematics Research Center MATH+ , Projects EF1-1 and EF1-4, and by the Einstein Foundation Berlin. P.P. is supported by a DFG Research Fellowship ”Shearlet-based energy functionals for anisotropic phase-field methods”.
Appendix A Interpolation spaces
In this section, we recall some essential notations and results from interpolation theory in Banach spaces. In particular, we present the -method of real interpolation.
For two Banach spaces and interpolation techniques allow the definition and study of a family of spaces that in some sense bridge the gap between and . This rather abstract concept will be applied to Sobolev spaces in Section 3.1 to derive spaces of fractional regularity. Interpolation theory is a valuable tool in harmonic analysis and the study of partial differential equations. A small summary about the history of the development of the theory of interpolation spaces can be found in [56, Lecture 21]. Interested readers are recommended to read the monographs and for a detailed treatment in the context of partial differential equations.
The basic notions of functional analysis are assumed to be known and we refer to for an introduction.
If are two Banach spaces such that is continuously embedded in , then we write and call an interpolation couple. One can also consider more general pairs of Banach spaces but this setting is sufficiently general for our purposes.
Let be an interpolation couple. For every and we define
The first term measures how well can be approximated by elements from the smaller space in and the second term is a penalty term weighted by .
Let be an interpolation couple. Let and . We set
The norm turns into a Banach space (cf. [38, Proposition 1.5]) which is called a real interpolation space.
The next lemma shows that in the sense of a continuous embedding the spaces form a family of nested Banach spaces.
Let , then the following holds:
If is an interpolation couple and , then
if is a Banach space and , then .
For (i) see [38, p. 2] and for (ii) [38, Proposition 1.4]. ∎
One of the most important theorems in interpolation theory shows how the norm of an operator defined on interpolation couples relates to the operator norm with respect to the corresponding interpolation spaces. The theorem can be found in [38, Theorem 1.6].
Let and be two interpolation couples. If , then for every and . Furthermore,
As a corollary from the previous theorem the following useful result can be obtained (see [38, Corollary 1.7]).
Let be an interpolation couple. Moreover, let and , then there is a constant such that for all we have
For the case we have .
Appendix B Sobolev spaces
As a generalization of Definition 3.1, we now introduce Sobolev spaces of vector-valued functions (see [48, Chapter 1.2.3]).
These spaces are again Banach spaces. To simplify the exposition, we introduce notation for a family of semi-norms.
For we only write . Note that coincides with .
A proof can be found in [29, Theorem 4.1] together with [29, Remark 4.2].
To use the bounds for the gradient obtained in the previous theorem in the Sobolev semi-norm setting from Definition B.2, the following observation turns out to be useful. If , then
The next lemma recalls a well known fact about the composition of Lipschitz functions and is not hard to see.
The following corollary establishes a chain rule estimate for .
It then follows from Theorem B.3 that is -Lipschitz and hence, is -Lipschitz. In the same manner we define
and have that is -Lipschitz. With Lemma B.4 we conclude that also the composition is -Lipschitz. Since is bounded and hence also applying Theorem B.3 again yields that and furthermore
B.2 Product estimate
In the following lemma, we show an estimate for the semi-norm of a product of weakly differentiable functions.
Let and with , then and there exists a constant such that
Clearly, so that the product formula in [21, Chapter 7.3] yields that for the weak derivatives of it holds
where is a constant. Furthermore, it is clear that also . ∎
B.3 Averaged Taylor polynomial
In this subsection, we develop a polynomial approximation in the spirit of Taylor polynomials but appropriate for Sobolev spaces. A polynomial approximation of a function is the first step towards an approximation of realized by a neural network in the proof of Theorem 4.1.
A reference for this entire subsection is [10, Chapter 4.1].
and is an arbitrary cut-off function supported in , i.e.
A cut-off function as used in the previous definition always exists. A possible choice is
From the linearity of the weak derivative we can easily conclude that the averaged Taylor polynomial is linear in .
Recall that the averaged Taylor polynomial is defined via an integral and some cut-off function (cf. (B.3)) that perform an averaging of a polynomial expression (cf. (B.4)). Additionally, the following lemma shows that the averaged Taylor polynomial of order is indeed a polynomial of degree less than in .
Moreover, there exists a constant such that the coefficients are bounded with for all with .
in multi-index notation. Then, combining Equation (B.3) and (B.4) yields
Combining the last estimate with Equation (B.7) yields
where the second step follows from for some constant (see [10, Section 4.1]). To estimate the absolute value of the coefficients (defined in Equation B.6), we have
Here, the second step used Equation (B.8) together with Equation (B.5), and is a constant. ∎
The next step is to derive approximation properties of the averaged Taylor polynomial. To this end, recall that for the (standard) Taylor expansion of some function defined on a domain in to yield an approximation at some point the whole path for has to be contained in (see [40, Theorem 6.8.10]). In case of the averaged Taylor polynomial the expansion point is replaced by a ball and we require that the path between each and each is contained in . This geometrical condition is made precise in the following definition.
The next definition introduces a geometric notion which becomes important when given a family of subdivisions , of a domain which becomes finer for smaller . One typically needs to control the nondegeneracy of which can be done e.g. with a uniformly bounded chunkiness parameter.
If , then we define
To emphasize the dependence on the set , we will occasionally write and .
Finally, the next lemma shows approximation properties of the averaged Taylor polynomial.
where denotes the Taylor polynomial of order of averaged over and .
A proof can be found in [10, Lemma 4.3.8].
B.4 Fractional Sobolev spaces
In this subsection, we derive a generalization of Sobolev spaces characterized by fractional-order derivatives using two different approaches. First, we interpolate integer-valued Sobolev spaces, and secondly, we define an intrinsic norm. Both approaches are then shown to be equivalent under some regularity condition on the domain .
We start by defining fractional-order Sobolev spaces for via Banach space interpolation (see Section A).
Let and , then we set
The next theorem shows that, given suitable regularity conditions of the domain , Definition B.13 and Definition 3.3 yield the same spaces with equivalent norms. The theorem can be found in [10, Theorem 14.2.3] for . The case is not included in [10, Theorem 14.2.3] but can be shown with the same technique.
with equivalence of the respective norms.
We make use of the norm equivalence by only using the norm in the following section and transferring results obtained by interpolation theory for the norm to the norm without mentioning it again.
Appendix C Upper bounds for approximations
The strategy of the proof of Theorem 4.1 follows the idea of the proof of Theorem 1 in : A polynomial approximation of a function is approximated by a function realized by a neural network . As in , we start by constructing a neural network that approximates the square function on the interval up to an approximation error at most . In our result the error is measured in the norm (Proposition C.1). This result is then used to obtain an approximation of a multiplication operator (Proposition C.2). Next, we derive a partition of unity (Lemma C.3) as a product of univariate functions that can be realized by neural networks. Using this construction we then build a localized Taylor approximation for a function in the and norm. An interpolation argument shows that same construction can be used for an approximation in the norm, where (Lemma C.4). In the next step, we lay the foundation for approximating products of the partition of unity and monomials, i.e. localized monomials, with ReLU networks (Lemma C.5) in the norm for . Via an embedding this result this result is then used to approximate the localized polynomials in with ReLU networks (Lemma C.6).
All Sobolev spaces encountered in the following proofs are defined on open, bounded, and convex domains which are, in particular, Lipschitz-domains (see [23, Corollary 1.2.2.3]).
The results presented in the following two propositions can be found in a similar way in [51, Proposition 3.1]. However, since our results contain some minor extensions we decided to give the proof here for the sake of completeness. In detail, Proposition C.2 (ii) and (iii) are not shown in [51, Proposition 3.1].
In [61, Proposition 2], a network is constructed that approximates the square function on the interval in the norm. Interestingly, the same construction can be used when measuring the approximation error in the norm. In particular, the depth and the number of weights and neurons of the network do not grow asymptotically faster to satisfy the approximation accuracy with respect to this stronger norm.
There exist constants , such that for all there is a neural network with at most nonzero weights, at most layers, at most neurons, and with one-dimensional input and output such that
and . Furthermore, it holds that
Thus, is a piecewise linear interpolant of with uniformly distributed breakpoints . In particular, . Furthermore, it is shown in the proof of [61, Proposition 2] that
We will now show that the approximation error of the derivative can be bounded in a similar way. In particular, we show the estimate
From Equation (C.3) we get for all
Combining Equation (C.4) and (C.5) yields
Clearly, the weak derivative of is a piecewise constant function, which assumes its maximum on the last piece. Hence,
Let now and choose . Now, satisfies the approximation bound in Equation (C.1) and . The estimate (C.2) holds because of Equation (C.6). The number of weights can be bounded by
In the same way, the number of neurons and layers can be bounded. This concludes the proof.
As in , we will now use Proposition C.1 and the polarization identity
to define an approximate multiplication where the approximation error is again (and in contrast to ) measured in the norm.
For any , there exist constants and such that for any , there is a neural network with two-dimensional input and one-dimensional output that satisfies the following properties:
;
if or , then ;
;
the depth and the number of weights and neurons in is at most .
Set , where is the constant from Corollary B.5 for and , and let be the approximate squaring network from Proposition C.1 such that
As in the proof of [61, Proposition 3], we use the fact that to see that a network can be constructed with two-dimensional input and one-dimensional output that satisfies
As a consequence of Proposition C.1, there exists a constant such that has at most layers, neurons and nonzero weights. Here and are suitable constants and (iv) is hence satisfied.
Since , (ii) easily follows.
Using the polarization identity (C.7), we can write
To keep the following calculations simple, we introduce some notation. In particular, we define
Note that for the inner functions in the compositions in Equation (C.11), it holds that
Hence, to finally prove (i), we apply Corollary B.5 to (C.11) and get
To show (iii) we use the chain rule estimate from Corollary B.5 and get
where we used Equation (C.12) in the third step and is the absolute constant from Equation (C.2) in Proposition C.1. ∎
Next, we construct a partition of unity (in the same way as in [61, Theorem 1]) that can be defined as a product of piecewise linear functions, such that each factor of the product can be realized by a neural network. We will use this product structure in Lemma C.6 together with a generalized version of the approximate multiplication from Proposition C.2 to approximate localized polynomials with ReLU networks.
for every ;
there exist absolute constants such that for each there is a neural network with -dimensional input and -dimensional output, with at most three layers, nonzero weights and neurons, that satisfies
and for all and .
for . Then, (i),(ii) and (iii) follow easily from the definition.
It follows that .
To show (v), we start by constructing a network that realizes the function . For this we set
and . Then is a two-layer network with one-dimensional input and one-dimensional output, with nonzero weights and neurons such that
which is a three-layer network with -dimensional input and -dimensional output, with at most nonzero weights and at most neurons, and
The following lemma uses the partition of unity from Lemma C.3 and the Bramble-Hilbert Lemma B.12 in a classical way to derive an approximation with localized polynomials in the norm and the norm. Using an interpolation argument, we can generalize this result to the case where the approximation is performed with respect to the norm, where .
Let and set , then the operator with is linear and bounded with
Furthermore, there is a constant such that for any the coefficients of the polynomials satisfy
The idea of the proof is similar to the first part of the proof of [61, Theorem 1]. We use approximation properties of averaged Taylor polynomials (see Bramble-Hilbert Lemma B.12) to derive local estimates and then combine them using a partition of unity to obtain a global estimate. In order to use this strategy also near the boundary, we make use of an extension operator.
where is the norm of the extension operator.
Step 1 (Averaged Taylor polynomials): For each we set
for , where is a suitable constant. It now suffices to show (i) and (ii).
Step 2 (Local estimates in ): To check that the conditions of the Bramble-Hilbert Lemma B.12 are fulfilled, note that . Furthermore, is a ball in such that is star-shaped with respect to . We have and and, thus,
Finally, we have for the chunkiness parameter of
Applying the Bramble-Hilbert Lemma B.12 yields for each the local estimate
Here, is the constant from Lemma B.12 which only depends on and , since the chunkiness parameter of is a constant depending only on (see (C.15)) and . In the same way, we get
where is a suitable constant.
The first step towards a global estimate is now to combine Equation (C.16) and (C.17) with the cut-off functions from the partition of unity. We have
Furthermore, using the product inequality for weak derivatives from Lemma B.6 we get that there is a constant such that
for some constant .
Step 3 (Global estimate in ): To derive the global estimate, we start by noting that with property (ii) from Lemma C.3 we have
where the last step follows from . Now we obtain for each
where the triangle inequality together with the support property (iii) from Lemma C.3 is used in the first step. The second step follows again from Lemma C.3 (iii) and the third step follows from (C.18) for and from (C.20) for . Here can be chosen independent of (e.g. ).
Finally, the boundedness claim in (i) and (ii) follows from using the definition of and combining Equation (C.22) with Equation (C.23):
where the last two steps follow from the definition of . Thus, we have
for , where Equation (C.14) was used in the first and second step. Here and are constants. The linearity of , is a consequence of the linearity of the averaged Taylor polynomial (cf. Remark B.8).
Step 4 (Interpolation): For we use a Banach space interpolation argument. Set
Then, we can apply Theorem A.4 to the operator with and get for a constant that
where we used Lemma A.3 (ii) to see that . ∎
The following technical lemma lays the foundation for approximating localized monomials with neural networks. Using the notation and statement (v) from Lemma C.3, a localized monomial can be expressed as the product of the output components of a network with a suitable output dimension , i.e.
Given a network with -dimensional output, we construct a network that approximates the product of the output components of .
For any , and any neural network with -dimensional input and -dimensional output where , and with number of layers, neurons and weights all bounded by , such that
there exists a neural network with -dimensional input and one-dimensional output, and with number of layers, neurons and weights all bounded by , such that
for and some constant . Moreover,
If , then we can choose for any and the claim holds.
Case 1: If , then we use the induction hypothesis and get that there is a constant and a neural network with -dimensional input and one-dimensional output, and at most layers, neurons and weights such that
for and . Moreover,
for any . Furthermore, we have , for .
We denote by the neural network with -dimensional input and -dimensional output which results from by removing the last output neuron and corresponding weights. In detail, we write
Using the induction hypothesis and the constants and from Case 1, we get that there is a neural network with -dimensional input and one-dimensional output, and at most layers, neurons and weights such that
for any . Furthermore, we can assume that , and that the first layers of and coincide and, thus, also the first layers of and , i.e. for .
Now, we add the formerly removed neuron with corresponding weights back to the last layer of . For the resulting network
it holds that the first layers of and coincide, and is a neural network with two-dimensional output. Note that
where we used Equation (C.26) for . Additionally, we have
Now, we denote by the network from Proposition C.2 with and accuracy and define
Consequently, has -dimensional input, one-dimensional output and, combining the induction hypothesis with statement (iv) of Proposition C.2 and Remark 2.8, at most
layers, number of neurons and weights. Here and are the constants from Proposition C.2 (iv) for the choice and is a suitable constant. Clearly, the first layers of and coincide and for the approximation properties it holds that
for . We continue by considering the first term of the Inequality (C.28) for and obtain
where we used Proposition C.2 (i) for the last step. Next, we consider the same term for and apply the chain rule from Corollary B.5. For this, let be the constant from Corollary B.5 (for and ). We get
where we used the induction hypothesis together with in the third step, and is a suitable constant.
To estimate the second term of (C.28) for we use the induction hypothesis (for ) and get
For we apply the product rule from Lemma B.6 together with and get
where we used the induction hypothesis for , and .
Combining (C.28) with (C.29) and (C.31) we have
and in the same way a combination of (C.28) with (C.30) and (C.32) yields
where . Putting together the two previous estimates yields
for a suitable constant .
We now show Equation (C.25) for . To this end, assume that for some and . If , then Equation (C.27) implies that
Hence, by application of Proposition C.2, we have
Finally, we need to show that there is a constant such that
Similarly as in (C.30) we have for a constant that
where Corollary B.5 was used for the second step and is the constant from Proposition C.2 (iii) which together with an argument as in (C.30) implies the third step.
Taking the maximum of the each pair of constants derived in Case 1 and Case 2 concludes the proof. ∎
In the next lemma, we approximate a sum of localized polynomials with a neural network. One of the difficulties is to control the derivative of the localizing functions from the partition of unity.
For any there is a neural network architecture with -dimensional input and one-dimensional output, with at most layers and weights and neurons, such that the following holds: Let and for be the polynomials from Lemma C.4, then there is a neural network that has architecture such that
Step 1 (Approximating localized monomials ): Let and . It is easy to see that there is a neural network with -dimensional input and -dimensional output, with one layer and at most weights and neurons such that
Let now be the neural network and the constants from Lemma C.3 (v) and define the network
for a constant and , and
Step 2 (Constructing an architecture capable of approximating sums of localized polynomials): We set
Then, there are constants such that is a neural network with -dimensional input and one-dimensional output, with at most layers,
Note that the network only depends on (and thus on ) via the coefficients . Now, it is easy to see that there exists a neural network architecture with layers and number of neurons and weights bounded by such that has architecture for every of choice of coefficients , and hence for every choice of .
Step 3 (Estimating the approximation error in ): For each we set
where the last step is a consequence of . For we have
where we used the triangle inequality in the first step and Lemma C.4 in the second step. Here is a constant. Next, note that
where denotes the Lebesgue measure and is a constant. Combining Equation (C.39) with the last estimate yields
Combining Equation (C.40) with Equation (C.41) and plugging the result in Equation (C.38) finally yields
where the last step is the same as Step 3 of the proof of Lemma C.4 and is a constant. Hence the case and is proven.
Step 4 (Interpolation): To show the general statement for we use the interpolation inequality from Corollary A.5 together with Equation (C.42) and directly get
for a constant . This concludes the proof of the lemma. ∎
Finally, we are ready to proof the upper complexity bounds.
The proof can be divided into two steps: First, we approximate the function by a sum of localized polynomials and then approximate this sum by a network.
where is the constant from Corollary C.4. Without loss of generality we may assume that . The same corollary yields that if is the partition of unity from Lemma C.3, then there exist polynomials for such that
For the second step, let , and be the constants from Lemma C.6 and be the neural network provided by Lemma C.6 with instead of . Then has at most
layers for a constant and at most
nonzero weights and neurons. Here is a suitable constant and we used in the first step. It holds that the architecture of is independent of the function . Furthermore, we have
where we used the inequality for and in the last step. We now continue the above computation using that and and therefore and get
Combining the previous computations we get
Using the triangle inequality and Equations (C.44) and (C.45) we finally obtain
Appendix D Lower bounds for approximations
Lower complexity bounds for approximations in norm, which correspond to the case in Theorem 4.3, have already been shown in Theorem 4 a) in . For lower complexity bounds of approximations we modify the proof strategy outlined in [61, Theorem 4 a)].
We start by showing an auxiliary result, that is used in the proof of Proposition D.2.
Now, set . By the pigeonhole principle, there exists , such that contains infinitely many . It is not hard to see that if a closed polyhedron contains a converging sequence on a line, then it also contains a small section of the line including the limit point of the sequence. Thus, there exists such that . Then, setting shows the claim. ∎
As in , we make use of a combinatorial quantity measuring the expressiveness of a set of binary valued functions defined on some set , called VC-dimension (see e.g. [3, Chapter 3.3]). We define
On the other hand, [3, Theorem 8.4] yields an upper bound of the VC-dimension of in terms of the number of computations and the dimension of the parameterization of which can be expressed as a function of (Claim 2). Together this gives the desired relation.
Step 1 (Construction of ): Let now for some constant to be chosen later, and be a neural network architecture as in the claim of the proposition.
To this end, we first define a function with for some constant , and then make use of a neural network that approximates .
and, in particular, if , then
Here, we defined the constant which was left unspecified in the beginning of the proof by .
where we used Equation (D.5) together with Equation (D.4) and for the first step and Equation (D.2) together with the upper bound for for the second step.
In a similar way, we get for the case that
Finally, combining Equation (D.6) and Equation (D.7) reads as
Claim 2 (): We start by showing that there exists a constant such that can be computed using operations of the following types
the arithmetic operations , and on real numbers,
jumps conditioned on , and comparisons of real numbers
Note that the number of neurons that are needed for the computation of can be bounded by . Thus, can be computed using at most operations where is an absolute constant. Hence, there exists a constant such that for the number of operations of the specified type needed for the computation of where it holds that .
where is a suitable constant. ∎
The proof of the lower complexity bounds is now a simple consequence of Proposition D.2.
The case corresponds to [61, Theorem 4 a)].
For the case , let and be the constants from Proposition D.2 and set
Then, and, thus, . Now, Proposition D.2 implies that
We also have and hence for a suitable constant . Combining this estimate with Equation D.8 yields
for a constant . ∎