Topological properties of the set of functions generated by neural networks of fixed size

Philipp Petersen, Mones Raslan, Felix Voigtlaender

Introduction

Neural networks, introduced in 1943 by McCulloch and Pitts , are the basis of every modern machine learning algorithm based on deep learning . The term deep learning describes a variety of methods that are based on the data-driven manipulation of the weights of a neural network. Since these methods perform spectacularly well in practice, they have become the state-of-the-art technology for a host of applications including image classification , speech recognition , game intelligence , and many more.

This success of deep learning has encouraged many scientists to pick up research in the area of neural networks after the field had gone dormant for decades. In particular, quite a few mathematicians have recently investigated the properties of different neural network architectures, hoping that this can explain the effectiveness of deep learning techniques. In this context, mathematical analysis has mainly been conducted in the context of statistical learning theory , where the overall success of a learning method is determined by the approximation properties of the underlying function class, the feasibility of optimizing over this class, and the generalization capabilities of the class, when only training with finitely many samples.

In the approximation theoretical part of deep learning research, one analyzes the expressiveness of deep neural network architectures. The universal approximation theorem demonstrates that neural networks can approximate any continuous function, as long as one uses networks of increasing complexity for the approximation. If one is interested in approximating more specific function classes than the class of all continuous functions, then one can often quantify more precisely how large the networks have to be to achieve a given approximation accuracy for functions from the restricted class. Examples of such results are . Some articles study in particular in which sense deep networks have a superior expressiveness compared to their shallow counterparts, thereby partially explaining the efficiency of networks with many layers in deep learning.

Another line of research studies the training procedures employed in deep learning. Given a set of training samples, the training process is an optimization problem over the parameters of a neural network, where a loss function is minimized. The loss function is typically a non-linear, non-convex function of the weights of the network, rendering the optimization of this function highly challenging . Nonetheless, in applications, neural networks are often trained successfully through a variation of stochastic gradient descent. In this regard, the energy landscape of the problem was studied and found to allow convergence to a global optimum, if the problem is sufficiently overparametrized; see .

The third large area of mathematical research on deep neural networks is analyzing the so-called generalization error of deep learning. In the framework of statistical learning theory , the discrepancy between the empirical loss and the expected loss of a classifier is called the generalization error. Specific bounds for this error for the class of deep neural networks were analyzed for instance in , and in more specific settings for instance in .

In this work, we study neural networks from a different point of view. Specifically, we study the structure of the set of functions implemented by neural networks of fixed size. These sets are naturally (non-linear) subspaces of classical function spaces like Lp(Ω)L^{p}(\Omega) and C(Ω)C(\Omega) for compact sets Ω\Omega.

Due to the size of the networks being fixed, our analysis is inherently non-asymptotic. Therefore, our viewpoint is fundamentally different from the analysis in the framework of statistical learning theory. Indeed, in approximation theory, the expressive power of networks growing in size is analyzed. In optimization, one studies the convergence properties of iterative algorithms—usually that of some form of stochastic gradient descent. Finally, when considering the generalization capabilities of deep neural networks, one mainly studies how and with which probability the empirical loss of a classifier converges to the expected loss, for increasing numbers of random training samples and depending on the sizes of the underlying networks.

Given this strong delineation to the classical fields, we will see that our point of view yields interpretable results describing phenomena in deep learning that are not directly explained by the classical approaches. We will describe these results and their interpretations in detail in Subsections 1.1–1.3.

We will use standard notation throughout most of the paper without explicitly introducing it. We do, however, collect a list of used symbols and notions in Appendix A. To not interrupt the flow of reading, we have deferred several auxiliary results to Appendix B and all proofs and related statements to Appendices C–E.

Before we continue, we formally introduce the notion of spaces of neural networks of fixed size.

where xLx_{L} results from the following scheme:

Before we continue, let us note that the set NN(S)\mathcal{NN}(S) of all neural networks (that is, the network weights) with a fixed architecture forms a finite-dimensional vector space, which we equip with the norm

In the remainder of this introduction, we discuss our results concerning the topological properties of the sets of realizations of neural networks with fixed architecture and their interpretation in the context of deep learning. Then, we give an overview of related work. We note at this point that it is straightforward to generalize all of the results in this paper to neural networks for which one only prescribes the total number of neurons and layers and not the specific architecture.

1 Non-convexity of the set of realizations

We will show in Section 2 (Theorem 2.1) that, for a given architecture SS, the set RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is not convex, except possibly when the activation function is a polynomial, which is clearly not the case for any of the activation functions that are commonly used in practice.

In fact, for a large class of activation functions (including the ReLU and the standard sigmoid activation function), the set RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) turns out to be highly non-convex in the sense that for every r[0,)r\in[0,\infty), the set of functions having uniform distance at most rr to any function in RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is not convex. We prove this result in Theorem 2.2 and Remark 2.3.

This non-convexity is undesirable, since for non-convex sets, there do not necessarily exist well-defined projection operators onto them. In classical statistical learning theory , the property that the so-called regression function can be uniquely projected onto a convex (and compact) hypothesis space greatly simplifies the learning problem; see [20, Section 7]. Furthermore, in applications where the realization of a network—rather than its set of weights—is the quantity of interest (for example when a network is used as an Ansatz for the solution of a PDE, as in ), our results show that the Ansatz space is non-convex. This non-convexity is inconvenient if one aims for a convergence proof of the underlying optimization algorithm, since one cannot apply convexity-based fixed-point theorems. Concretely, if a neural network is optimized by stochastic gradient descent so as to satisfy a certain PDE, then it is interesting to see if there even exists a network so that the iteration stops. In other words, one might ask whether gradient descent on the set of neural networks (potentially with bounded weights) has a fixed point. If the space of neural networks were convex and compact, then the fixed-point theorem of Schauder would guarantee the existence of such a fixed point.

2 (Non-)closedness of the set of realizations

For any fixed architecture SS, we show in Section 3 (Theorem 3.1) that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is not a closed subset of Lp(μ)L^{p}(\mu) for 0<p<0<p<\infty, under very mild assumptions on the measure μ\mu and the activation function ϱ\varrho. The assumptions concerning ϱ\varrho are satisfied for all activation functions used in practice.

For the case p=p=\infty, the situation is more involved: For all activation functions that are commonly used in practice—except for the (parametric) ReLU—the associated sets RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) are non-closed also with respect to the uniform norm; see Theorem 3.3. For the (parametric) ReLU, however, the question of closedness of the sets RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) remains mostly open. Nonetheless, in two special cases, we prove in Section 3.4 that the sets RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) are closed. In particular, for neural network architectures with two layers only, Theorem 3.8 establishes the closedness of RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S), where ϱ\varrho is the (parametric) ReLU.

A practical consequence of the observation of non-closedness can be identified with the help of the following argument that is made precise in Subsection 3.3: We show that the set

of realizations of neural networks with a fixed architecture and all affine linear maps bounded in a suitable norm, is always closed. As a consequence, we observe the following phenomenon of exploding weights: If a function ff is such that it does not have a best approximation in RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S), that is, if there does not exist fRNNϱΩ(S)f^{*}\in\mathcal{RNN}^{\Omega}_{\varrho}(S) such that

Certainly, the presence of large coefficients will make the numerical optimization increasingly unstable. Thus, exploding weights in the sense described above are highly undesirable in practice.

The argument above discusses an approximation problem in an LpL^{p}-norm. In practice, one usually only minimizes “empirical norms”. We will demonstrate in Proposition 3.6 that also in this situation, for increasing numbers of samples, the weights of the neural networks that minimize the empirical norms necessarily explode under certain assumptions. Note that the set-up of having a fixed architecture and a potentially unbounded number of training samples is common in applications where neural networks are trained to solve partial differential equations. There, training samples are generated during the training process .

3 Failure of inverse stability of the realization map

For both of these results—continuity and no inverse stability—we only need to assume that the activation function ϱ\varrho is Lipschitz continuous and not constant.

These properties of the realization map pinpoint a potential problem that can occur when training a neural network: Let us consider a regression problem, where a network is iteratively updated by a (stochastic) gradient descent algorithm trying to minimize a loss function. It is then possible that at some iterate the loss function exhibits a very small error, even though the associated network parameters have a large distance to the optimal parameters. This issue is especially severe since a small error term leads to small steps if gradient descent methods are used in the optimization. Consequently, convergence to the very distant optimal weights will be slow even if the energy landscape of the optimization problem happens to be free of spurious local minima.

4 Related work

The aforementioned properties of non-convexity and non-closedness have, to some extent, been studied before. Classical results analyze the spaces of shallow neural networks, that is, of RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) for S=(d,N0,1)S=(d,N_{0},1), so that L=2L=2. For such sets of shallow networks, a property that has been extensively studied is to what extent RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) has the best approximation property. Here, we say that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) has the best approximation property, if for every function fLp(Ω)f\in L^{p}(\Omega), 1p1\leq p\leq\infty, there exists a function F(f)RNNϱΩ(S)F(f)\in\mathcal{RNN}_{\varrho}^{\Omega}(S) such that fF(f)Lp=infgRNNϱΩ(S)fgLp\|f-F(f)\|_{L^{p}}=\inf_{g\in\mathcal{RNN}_{\varrho}^{\Omega}(S)}\|f-g\|_{L^{p}}. In it was shown that even if a minimizer always exists, the map fF(f)f\mapsto F(f) is necessarily discontinuous. Furthermore, at least for the Heaviside activation function, there does exist a (non-unique) best approximation; see .

Additionally, [27, Proposition 4.1] demonstrates, for shallow networks as before, that for the logistic activation function ϱ(x)=(1+ex)1\varrho(x)=(1+e^{-x})^{-1}, the set RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) does not have the best approximation property in C(Ω)C(\Omega). In the proof of this statement, it was also shown that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is not closed. Furthermore, it is claimed that this result should hold for every non-linear activation function. The previously mentioned result of and Theorem 3.8 below disprove this conjecture for the Heaviside and ReLU activation functions, respectively.

In deep learning, one chooses a loss function L:C(Ω)[0,){\mathcal{L}:C(\Omega)\to[0,\infty)}, which is then minimized over the set of neural networks RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) with fixed architecture SS. A typical loss function is the empirical square loss, that is,

It is important to emphasize that this notion of non-convexity describes properties of the loss function, in contrast to the non-convexity of the sets of functions that we analyze in this work.

Non-convexity of the set of realizations

In this section, we analyze the convexity of the set of all neural network realizations. In particular, we will show that this set is highly non-convex for all practically used activation functions listed in Table 1. First, we examine the convexity of the set RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S):

If RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is convex, then ϱ\varrho is a polynomial.

(1) It is easy to see that all of the activation functions in Table 1 are locally Lipschitz continuous, and that none of them is a polynomial. Thus, the associated sets of realizations are never convex.

(2) In the case where ϱ\varrho is a polynomial, the set RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) might or might not be convex. Indeed, if S=(1,N,1)S=(1,N,1) and ϱ(x)=xm\varrho(x)=x^{m}, then it is not hard to see that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is convex if and only if NmN\geq m.

The detailed proof of Theorem 2.1 is the subject of Appendix C.1. Let us briefly outline the proof strategy:

We first show in Proposition C.1 that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is closed under scalar multiplication, hence star-shaped with respect to the origin, i.e., 0 is a center.A subset AA of some vector space VV is called star-shaped, if there exists some fAf\in A such that for all gAg\in A, also {λf+(1λ)g ⁣:λ}A\{\lambda f+(1-\lambda)g\colon\lambda\in\}\subset A. The vector ff is called a center of AA.

A direct consequence of Step 2 is that if RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is convex, then it can only contain a finite number of linearly independent functions; see Corollary C.5.

In applications, the non-convexity of RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S) might not be as problematic as it first seems. If, for instance, the set RNNϱΩ(S)+Bδ(0)\mathcal{RNN}^{\Omega}_{\varrho}(S)+B_{\delta}(0) of functions that can be approximated up to error δ>0\delta>0 by a neural network with architecture SS was convex, then one could argue that the non-convexity of RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S) was not severe. Indeed, in practice, neural networks are only trained to minimize a certain empirical loss function, with resulting bounds on the generalization error which are typically of size ε=O(m1/2)\varepsilon=\mathcal{O}(m^{-1/2}), with mm denoting the number of training samples. In this setting, one is not really interested in “completely minimizing” the (empirical) loss function, but would be content with finding a function for which the empirical loss is ε\varepsilon-close to the global minimum. Hence, one could argue that one is effectively working with a hypothesis space of the form RNNϱΩ(S)+Bδ(0)\mathcal{RNN}^{\Omega}_{\varrho}(S)+B_{\delta}(0), containing all functions that can be represented up to an error of δ\delta by neural networks of architecture SS.

To quantify this potentially more relevant notion of convexity of neural networks, we define, for a subset AA of a vector space Y\mathcal{Y}, the convex hull of AA as

For ε>0\varepsilon>0, we say that a subset AA of a normed vector space Y\mathcal{Y} is ε\varepsilon-convex in (Y,Y)(\mathcal{Y},\|\cdot\|_{\mathcal{Y}}), if

Hence, the notion of ε\varepsilon-convexity asks whether the convex hull of a set is contained in an enlargement of this set. Note that if RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is dense in C(Ω)C(\Omega), then its closure is trivially ε\varepsilon-convex for all ε>0\varepsilon>0. Our main result regarding the ε\varepsilon-convexity of neural network sets shows that this is the only case in which RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is ε\varepsilon-convex for any ε>0\varepsilon>0.

Assume that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is not dense in C(Ω)C(\Omega). Then there does not exist any ε>0\varepsilon>0 such that RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is ε\varepsilon-convex in \big{(}C(\Omega),\|\cdot\|_{\sup}\big{)}.

All closures in the theorem are taken in C(Ω)C(\Omega).

The proof of this theorem is the subject of Appendix C.2. It is based on showing that if RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is ε\varepsilon-convex for some ε>0\varepsilon>0, then in fact RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is convex, which we then use to show that RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} contains all realizations of two-layer neural networks with activation function ϱ\varrho. As shown in , this implies that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is dense in C(Ω)C(\Omega), since ϱ\varrho is not a polynomial. ∎

While it is certainly natural to expect that RNNϱΩ(S)C(Ω)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\neq C(\Omega) should hold for most activation functions ϱ\varrho, giving a reference including large classes of activation functions such that the claim holds is not straightforward. We study this problem more closely in Appendix C.3.

To be more precise, from Proposition C.10 it follows that the ReLU, the parametric ReLU, the exponential linear unit, the softsign, the sigmoid, and the tanh\tanh yield realization sets RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) which are not dense in Lp(Ω)L^{p}(\Omega) and in C(Ω)C(\Omega).

The only activation functions listed in Table 1 for which we do not know whether any of the realization sets RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is dense in Lp(Ω)L^{p}(\Omega) or in C(Ω)C(\Omega) are: the inverse square root linear unit, the inverse square root unit, the softplus, and the arctan\arctan function. Of course, we expect that also for these activation functions, the resulting sets of realizations are never dense in Lp(Ω)L^{p}(\Omega) or in C(Ω)C(\Omega).

(Non-)Closedness of the set of realizations

We start by examining the closedness with respect to LpL^{p}-norms for p(0,)p\in(0,\infty). In fact, for all B>0B>0 and all widely used activation functions (including all activation functions presented in Table 1), the set RNNϱ[B,B]d(S)\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in Lp(μ)L^{p}(\mu), for any p(0,)p\in(0,\infty) and any “sufficiently non-atomic” measure μ\mu on [B,B]d[-B,B]^{d}. To be more precise, the following is true:

There is some r>0r>0 such that ϱ(,r)(r,)\varrho|_{(-\infty,-r)\cup(r,\infty)} is differentiable;

At least one of the following two conditions is satisfied:

There are λ,λ0\lambda,\lambda^{\prime}\geq 0 with λλ\lambda\neq\lambda^{\prime} such that ϱ(x)λ\varrho^{\prime}(x)\to\lambda as xx\to\infty, and ϱ(x)λ\varrho^{\prime}(x)\to\lambda^{\prime} as xx\to-\infty, and we have NL12N_{L-1}\geq 2.

Finally, let B>0B>0 and let μ\mu be a finite Borel measure on [B,B]d[-B,B]^{d} for which the support suppμ\operatorname*{supp}\mu is uncountable. Then the set RNNϱ[B,B]d(S)\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in Lp(μ)L^{p}(\mu) for any p(0,)p\in(0,\infty). More precisely, there is a function fL(μ)f\in L^{\infty}(\mu) which satisfies fRNNϱ[B,B]d(S)RNNϱ[B,B]d(S)f\in\overline{\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)}\setminus\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) for all p(0,)p\in(0,\infty), where the closure is taken in Lp(μ)L^{p}(\mu).

If suppμ\operatorname*{supp}\mu is countable, then μ=xsuppμμ({x})δx\mu=\sum_{x\in\operatorname*{supp}\mu}\mu(\{x\})\,\delta_{x} is a countable sum of Dirac measures, meaning that μ\mu is purely atomic. In particular, if μ\mu is non-atomic (meaning that μ({x})=0\mu(\{x\})=0 for all x[B,B]dx\in[-B,B]^{d}), then suppμ\operatorname*{supp}\mu is uncountable and the theorem is applicable.

For the proof of the theorem, we refer to Appendix D.1. The main proof idea consists in the approximation of a (discontinuous) step function which cannot be represented by a neural network with continuous activation function. ∎

The assumptions concerning the activation function ϱ\varrho in Theorem 3.1 are satisfied for all of the activation functions listed in Table 1. In any case where ϱ\varrho is bounded, one can take NL1=1N_{L-1}=1; otherwise, one can take NL1=2N_{L-1}=2.

For a proof of this statement, we refer to Appendix D.2. ∎

We have seen in Theorem 3.1 that under reasonably mild assumptions on the activation function ϱ\varrho—which are satisfied for all commonly used activation functions—the set RNNϱ[B,B]d(S)\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in any LpL^{p}-space where p(0,)p\in(0,\infty). However, the argument of the proof of Theorem 3.1 breaks down if one considers closedness with respect to the sup\|\cdot\|_{\sup}-norm. Therefore, we will analyze this setting more closely in this section. More precisely, in Theorem 3.3, we present several criteria regarding the activation function ϱ\varrho which imply that the set RNNϱ[B,B]d(S)\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in C([B,B]d)C([-B,B]^{d}). We remark that in all these results, ϱ\varrho will be assumed to be at least C1C^{1}. Developing similar criteria for non-differentiable functions is an interesting topic for future research.

NL12N_{L-1}\geq 2 and ϱ\varrho is bounded, analytic, and not constant.

Then the set RNNϱ[B,B]d(S)\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in the space C([B,B]d)C([-B,B]^{d}).

For the proof of the statement, we refer to Appendix D.3. In particular, the proof of the statement under Condition (i) can be found in Appendix D.3.1. Its main idea consists of the uniform approximation of ϱ\varrho^{\prime} (which cannot be represented by neural networks with activation function ϱ\varrho, due to its lack of sufficient regularity) by neural networks. For the proof of the statement under Condition (ii), we refer to Appendix D.3.2. The main proof idea consists in the uniform approximation of an unbounded analytic function which cannot be represented by a neural network with activation function ϱ\varrho, since ϱ\varrho itself is bounded. Finally, the proof of the statement under Condition (iii) can be found in Appendix D.3.3. Its main idea consists in the approximation of the function x(x)+max{r,q}∉Cmax{r,q}.x\mapsto(x)^{\max\{r,q\}}_{+}\not\in C^{\max\{r,q\}}.

Theorem 3.3 applies to all activation functions listed in Table 1 except for the ReLU and the parametric ReLU. To be more precise,

Condition (i) is fulfilled by the function xmax{0,x}kx\mapsto\max\{0,x\}^{k} for k2k\geq 2, and by the exponential linear unit, the softsign function, and the inverse square root linear unit.

Condition (ii) is fulfilled by the inverse square root unit, the sigmoid function, the tanh\tanh function, and the arctan\arctan function.

Condition (iii) (with r=1r=1 and q=0q=0) is fulfilled by the softplus function.

For the proof of this statement, we refer to Appendix D.4. In particular, for the proof of (1), we refer to Appendix D.4.1, the proof of (2) is clear and for the proof of (3), we refer to Appendix D.4.2. ∎

3 The phenomenon of exploding weights

We have just seen that the realization set RNNϱ[B,B]d(S)\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in Lp(μ)L^{p}(\mu) for any p(0,)p\in(0,\infty) and every practically relevant activation function. Furthermore, for a variety of activation functions, we have seen that RNNϱ[B,B]d(S)\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in C([B,B]d)C([-B,B]^{d}). The situation is substantially different if the weights are taken from a compact subset:

The proof of this statement is based on the continuity of the realization map and can be found in Appendix D.5. ∎

Proposition 3.5 helps to explain the phenomenon of exploding network weights that is sometimes observed during the training of neural networks. Indeed, let us assume that R:=RNNϱ[B,B]d(S)\mathcal{R}:=\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S) is not closed in Y\mathcal{Y}, where YLp(μ)\mathcal{Y}\coloneqq L^{p}(\mu) for a Borel measure μ\mu on [B,B]d[-B,B]^{d}, or YC([B,B]d)\mathcal{Y}\coloneqq C([-B,B]^{d}); as seen in Sections 3.1 and 3.2, this is true under mild assumptions on ϱ\varrho. Then, it follows that there exists a function fYf\in\mathcal{Y} which does not have a best approximation in R\mathcal{R}, meaning that there does not exist any gRg\in\mathcal{R} satisfying

in fact, one can take any fRRf\in\overline{\mathcal{R}}\setminus\mathcal{R}. Next, recall from Proposition 3.5 that the subset of R\mathcal{R} that contains only realizations of networks with uniformly bounded weights is compact.

The argument above only deals with the approximation problem in C([B,B]d)C([-B,B]^{d}) or in Lp(μ)L^{p}(\mu) for p(0,)p\in(0,\infty). In practice, one is often not concerned with these norms, but instead wants to minimize an empirical loss function over R\mathcal{R}. For the empirical square loss, this loss function takes the form

Here, σΩ\sigma_{\Omega} is the marginal probability distribution of σ\sigma on Ω\Omega, and Eσ(fσ)\mathcal{E}_{\sigma}(f_{\sigma}) is called the Bayes risk; it is the minimal expected loss achievable by any (measurable) function.

In this context of a statistical learning problem, we have the following result regarding exploding weights:

A compact way of stating Proposition 3.6 is that, if fσf_{\sigma} has no best approximation in RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) with respect to L2(σΩ)\|\cdot\|_{L^{2}(\sigma_{\Omega})}, then the weights of the minimizers (or approximate minimizers) of the empirical square loss explode for increasing numbers of samples.

Since σ\sigma is unknown in applications, it is indeed possible that fσf_{\sigma} has no best approximation in the set of neural networks. As just one example, this is true if σΩ\sigma_{\Omega} is any Borel probability measure on Ω\Omega and if σ\sigma is the distribution of (X,g(X))(X,g(X)), where XσΩX\sim\sigma_{\Omega} and gL2(σΩ)g\in L^{2}(\sigma_{\Omega}) is bounded and satisfies gRNNϱΩ(S)RNNϱΩ(S)g\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\setminus\mathcal{RNN}_{\varrho}^{\Omega}(S), with the closure taken in L2(σΩ)L^{2}(\sigma_{\Omega}). The existence of such a function gg is guaranteed by Theorem 3.1 if suppσΩ\operatorname*{supp}\sigma_{\Omega} is uncountable.

For the proof of Proposition 3.6, we refer to Appendix D.6. The proof is based on classical arguments of statistical learning theory as given in . ∎

In this subsection, we analyze the closedness of sets of realizations of neural networks with respect to the ReLU or the parametric ReLU activation function in C(Ω)C(\Omega), mostly for the case Ω=[B,B]d\Omega=[-B,B]^{d}. We conjecture that the set of (realizations of) ReLU networks of a fixed complexity is closed in C(Ω)C(\Omega), but were not able to prove such a result in full generality. In two special cases, namely when the networks have only two layers, or when at least the scaling weights are bounded, we can show that the associated set of ReLU realizations is closed in C(Ω)C(\Omega); see below.

We begin by analyzing the set of realizations with uniformly bounded scaling weights and possibly unbounded biases, before proceeding with the analysis of two layer ReLU networks.

As our second result in this section, we show that the set of realizations of two-layer neural networks with arbitrary scaling weights and biases is closed in C([B,B]d),C([-B,B]^{d}), if the activation is the parametric ReLU. It is a fascinating question for further research whether this also holds for deeper networks.

Failure of inverse instability of the realization map

Finally, if ϱ\varrho is globally Lipschitz continuous, then there is a constant C=C(ϱ,S)>0C=C(\varrho,S)>0 such that

For the proof of this statement, we refer to Appendix E.1. ∎

then the realizations of Φ,Ψ\Phi,\Psi are identical.

In this section, our main goal is to determine whether, up to the failure of injectivity, the realization map is a homeomorphism onto its range; mathematically, this means that we want to determine whether the realization map is a quotient map. We will see that this is not the case.

We finally rephrase the preceding result in more topological terms:

For the proof of the statement, we refer to Appendix E.3. ∎

Acknowledgements

P.P. and M.R. were supported by the DFG Collaborative Research Center TRR 109 “Discretization in Geometry and Dynamics”. P.P. was supported by a DFG Research Fellowship ”Shearlet-based energy functionals for anisotropic phase-field methods”. M.R. is supported by the Berlin Mathematical School. F.V. acknowledges support from the European Commission through DEDALE (contract no. 665044) within the H2020 Framework Program.

We would like to thank Dave L. Renfro for bringing the paper to our attention.

References

Appendix A Notation

Given functions (fi)in(f_{i})_{i\in\underline{n}} with fi:XiYif_{i}:X_{i}\to Y_{i}, we consider three different types of products between these maps: The cartesian product of f1,,fnf_{1},\dots,f_{n} is

Finally, the direct sum of f1,,fnf_{1},\dots,f_{n} is defined if X1==XnX_{1}=\dots=X_{n}, and given by

Appendix B Auxiliary results: Operations with neural networks

This part of the appendix is devoted to auxiliary results that are connected with basic operations one can perform with neural networks and which we will frequently make use of in the proofs below.

We start by showing that one can “enlarge” a given neural network in such a way that the realizations of the original network and the enlarged network coincide. To be more precise, the following holds:

Setting N0N~0dN_{0}\coloneqq\widetilde{N}_{0}\coloneqq d, and N~LNL\widetilde{N}_{L}\coloneqq N_{L}, we define Φ~((A~1,b~1),,(A~L,b~L))\widetilde{\Phi}\coloneqq\left((\widetilde{A}_{1},\widetilde{b}_{1}),\dots,(\widetilde{A}_{L},\widetilde{b}_{L})\right) by

Another operation that we can perform with networks is concatenation, as given in the following definition.

Then, we call Φ1\newmoonΦ2\Phi^{1}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi^{2} the concatenation of Φ1\Phi^{1} and Φ2\Phi^{2}.

We claim that there is some C0>0C_{0}>0 such that ϱC(x)xε|\varrho_{C}(x)-x|\leq\varepsilon^{\prime} for all x[BLε,B+Lε]x\in[-B-L\varepsilon,B+L\varepsilon] and all CC0C\geq C_{0}. To see this, first note by definition of the derivative that there is some δ>0\delta>0 with

Here we implicitly used that s0=ϱ(x0)0s_{0}=\varrho^{\prime}(x_{0})\neq 0 to ensure that the right-hand side is a positive multiple of t|t|. Now, set C0(B+L)/δC_{0}\coloneqq(B+L)/\delta, and let CC0C\geq C_{0} be arbitrary. Note because of εε1\varepsilon^{\prime}\leq\varepsilon\leq 1 that every x[BLε,B+Lε]x\in[-B-L\varepsilon,B+L\varepsilon] satisfies xB+L|x|\leq B+L. Hence, if we set tx/Ct\coloneqq x/C, then tδ|t|\leq\delta. Therefore,

Note that ϱC\varrho_{C} is differentiable at with derivative ϱC(0)=Cs0ϱ(x0)1C=1\varrho_{C}^{\prime}(0)=\frac{C}{s_{0}}\varrho^{\prime}(x_{0})\frac{1}{C}=1, thanks to the chain rule.

Using these preliminary observations, we now construct the neural networks ΦεB\Phi_{\varepsilon}^{B} and Φε,iB\Phi_{\varepsilon,i}^{B}. Define {\Phi_{0}^{C}\coloneqq\big{(}(A_{1},b_{1}),(A_{2},b_{2})\big{)}}, where

where ϱC\varrho_{C} is applied L1L-1 times.

Since ϱC(x)xεε|\varrho_{C}(x)-x|\leq\varepsilon^{\prime}\leq\varepsilon for all x[BLε,B+Lε]x\in[-B-L\varepsilon,B+L\varepsilon], it is not hard to see by induction that

where ϱC\varrho_{C} is applied tLt\leq L times. Therefore, since ε=ε/(dL)\varepsilon^{\prime}=\varepsilon/(dL), we conclude for CC0C\geq C_{0} that

We proceed with the second part of the proposition. We first prove the statement for i=1i=1. Let \widetilde{\Phi}_{1}^{C}\coloneqq\big{(}(A_{1}^{\prime},b_{1}^{\prime}),(A_{2}^{\prime},b_{2}^{\prime})\big{)}, where

We have Φ~1CNN((d,1,1))\widetilde{\Phi}_{1}^{C}\in\mathcal{NN}((d,1,1)). Next, define \widetilde{\Phi}_{2}^{C}\coloneqq\big{(}(A_{1}^{\prime\prime},b_{1}^{\prime\prime}),(A_{2}^{\prime\prime},b_{2}^{\prime\prime})\big{)}, where

We have Φ~2CNN((1,1,1))\widetilde{\Phi}_{2}^{C}\in\mathcal{NN}((1,1,1)). Setting Φ~CΦ~2C\newmoon\newmoonΦ~2C\newmoonΦ~1C\widetilde{\Phi}_{C}\coloneqq\widetilde{\Phi}_{2}^{C}{\raisebox{2.0pt}{\tiny\newmoon}\,}\dots\raisebox{2.0pt}{\tiny\newmoon}\,\widetilde{\Phi}_{2}^{C}{\raisebox{2.0pt}{\tiny\newmoon}\,}\widetilde{\Phi}_{1}^{C}, where we take L2L-2 concatenations (meaning L1L-1 factors), yields a neural network Φ~CNN((d,1,,1))\widetilde{\Phi}_{C}\in\mathcal{NN}((d,1,\dots,1)) (with LL layers) such that

where ϱC\varrho_{C} is applied L1L-1 times. Exactly as in the proof of the first part, this implies for CC0C\geq C_{0} that

Setting Φ~ε,1BΦ~C\widetilde{\Phi}_{\varepsilon,1}^{B}\coloneqq\widetilde{\Phi}_{C} and repeating the previous arguments yields the claim for i=1i=1. Permuting the columns of A1A_{1}^{\prime} yields the result for arbitrary i{1,,d}i\in\{1,\dots,d\}.

Appendix C Proofs and results connected to Section 2

We first establish the star-shapedness of the set of all realizations of neural networks, which is a direct consequence of the fact that the set is invariant under scalar multiplication. The following proposition provides the details.

We can choose λ=0\lambda=0 in the argument above and obtain 0RNNϱΩ(S){0\in\mathcal{RNN}^{\Omega}_{\varrho}(S)}. For every fRNNϱΩ(S)f\in\mathcal{RNN}^{\Omega}_{\varrho}(S) the line {λf ⁣:λ}\{\lambda f\colon\lambda\in\} between and ff is contained in RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S), since RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S) is closed under scalar multiplication. We conclude that RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S) is star-shaped with respect to the origin. ∎

Our next goal is to show that RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S) cannot contain infinitely many linearly independent centers.

As a preparation, we prove two related results which show that the class RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is “small”. The main assumption for guaranteeing this is that the activation function should be locally Lipschitz continuous.

Since ϱ\varrho is locally Lipschitz continuous, Proposition 4.1 (which will be proved completely independently) shows that the realization map

As a composition of locally Lipschitz continuous functions, the map

As a corollary, we can now show that the class of neural network realizations cannot contain a subspace of large dimension.

Now, the announced estimate for the number of linearly independent centers of the set of all network realizations of a fixed size is a direct consequence.

Next, we analyze the convexity of RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S). As a direct consequence of Proposition C.4, we see that RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S) is never convex if RNNϱΩ(S)\mathcal{RNN}^{\Omega}_{\varrho}(S) contains more than a certain number of linearly independent functions.

Every element of a convex set is a center. Thus the result follows directly from Proposition C.4. ∎

Step 1: Set S(d,N1,,NL1,1)S^{\prime}\coloneqq(d,N_{1},\dots,N_{L-1},1). We first show that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S^{\prime}) does not contain infinitely many linearly independent functions. To see this, first note that the map

Step 5: In this step, we complete the proof, by first showing for arbitrary B>0B>0 that ϱ[B,B]\varrho|_{[-B,B]} is a polynomial of degree at most NN.

Let ε,B>0\varepsilon,B>0 be arbitrary. Since ϱ\varrho is continuous, it is uniformly continuous on [B1,B+1][-B-1,B+1], that is, there is some δ(0,1)\delta\in(0,1) such that ϱ(x)ϱ(y)ε|\varrho(x)-\varrho(y)|\leq\varepsilon for all x,y[B1,B+1]x,y\in[-B-1,B+1] with xyδ|x-y|\leq\delta. Since ϱ(x0)0\varrho^{\prime}(x_{0})\neq 0 and L2L\geq 2, Proposition B.3 and Lemma B.1 imply existence of a neural network Φ~ε,BNN((d,N1,,NL1))\widetilde{\Phi}_{\varepsilon,B}\in\mathcal{NN}((d,N_{1},\dots,N_{L-1})) such that

with ϱ\varrho acting componentwise. By (C.3), it follows that there is a neural network Φε,BNN(S){\Phi_{\varepsilon,B}\in\mathcal{NN}(S^{\prime})} satisfying

where the closure is taken with respect to the sup norm, and where we implicitly used that the space on the right-hand side is a closed subspace of C([B,B])C([-B,B]), since it is a finite dimensional subspace.

Since ϱ[B,B]\varrho|_{[-B,B]} is a polynomial of degree at most NN; we see that the N+1N+1-th derivative of ϱ\varrho satisfies ϱ(N+1)0\varrho^{(N+1)}\equiv 0 on (B,B)(-B,B), for arbitrary B>0B>0. Thus, ϱ(N+1)0\varrho^{(N+1)}\equiv 0, meaning that ϱ\varrho is a polynomial. ∎

In the above proof, we used the following elementary lemma, whose proof we provide for completeness.

Let us assume towards a contradiction that

As our final ingredient for the proof of Theorem 2.1, we show that every non-constant locally Lipschitz function ϱ\varrho satisfies the technical assumptions of Proposition C.6.

Since ϱ\varrho is not constant, there is some B>0B>0 such that ϱ[B,B]\varrho|_{[-B,B]} is not constant. By assumption, ϱ\varrho is Lipschitz continuous on [B,B][-B,B]. Thus, ϱ[B,B]\varrho|_{[-B,B]} is absolutely continuous; see for instance [60, Definition 7.17]. Thanks to the fundamental theorem of calculus for the Lebesgue integral (see [60, Theorem 7.20]), this implies that ϱ[B,B]\varrho|_{[-B,B]} is differentiable almost everywhere on (B,B)(-B,B) and satisfies ϱ(y)ϱ(x)=xyϱ(t)dt\varrho(y)-\varrho(x)=\int_{x}^{y}\varrho^{\prime}(t)\,dt for Bx<yB-B\leq x<y\leq B, where ϱ(t):=0\varrho^{\prime}(t):=0 if ϱ\varrho is not differentiable at tt.

Since ϱ[B,B]\varrho|_{[-B,B]} is not constant, the preceding formula shows that there has to be some x0(B,B)x_{0}\in(-B,B) such that ϱ(x0)0\varrho^{\prime}(x_{0})\neq 0; in particular, this means that ϱ\varrho is differentiable at x0x_{0}. ∎

Now, a combination of Corollary C.5, Proposition C.6, and Lemma C.8 proves Theorem 2.1. For the application of Lemma C.8, note that if ϱ\varrho is constant, then ϱ\varrho is a polynomial, so that the conclusion of Theorem 2.1 also holds in this case.

C.2 Proof of Theorem 2.2

We first show in the following lemma that if RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is convex, then RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is dense in C(Ω)C(\Omega). The proof of Theorem 2.2 is given thereafter.

If RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is convex, then RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is dense in C(Ω)C(\Omega).

Now we are ready to prove Theorem 2.2. By assumption, RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is not dense in C(Ω)C(\Omega). We start by proving that there exists at least one ε>0\varepsilon>0 such that the set RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is not ε\varepsilon-convex. Suppose towards a contradiction that this is not true, so that RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is ε\varepsilon-convex for all ε>0\varepsilon>0. This implies

where the last identity holds true, since if f~∉RNNϱΩ(S)\widetilde{f}\not\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}, there exists ε>0\varepsilon^{\prime}>0 such that f~fsup>ε\|\widetilde{f}-f\|_{\sup}>\varepsilon^{\prime} for all fRNNϱΩ(S)f\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}. Equation (C.6) shows that RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is convex, which by Lemma C.9 implies that RNNϱΩ(S)C(Ω)\mathcal{RNN}_{\varrho}^{\Omega}(S)\subset C(\Omega) is dense, in contradiction to the assumptions of Theorem 2.2. This is the desired contradiction, showing that RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is ε\varepsilon-convex for some ε>0\varepsilon>0.

Thus, there exists g\in\text{co}\big{(}\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\big{)} such that gfsupε0\|g-f\|_{\sup}\geq\varepsilon_{0} for all fRNNϱΩ(S)f\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}. Now, let ε>0\varepsilon>0 be arbitrary. Then, εε0gco(RNNϱΩ(S))\frac{\varepsilon}{\varepsilon_{0}}g\in\text{co}(\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}), since RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is closed under scalar multiplication. Moreover,

again due to the closedness under scalar multiplication of RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S). This shows that RNNϱΩ(S)\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)} is not ε\varepsilon-convex for any ε>0\varepsilon>0. \square

C.3 Non-dense network sets

applying one of the arithmetic operations +,,×+,-,\times, and // on real numbers;

jumps conditioned on comparisons of real numbers using the following operators: <,>,,,=,<,>,\leq,\geq,=,\neq.

Using this result, we can now show that the realization sets of networks with activation functions that are computable by elementary operations are never dense in Lp(Ω)L^{p}(\Omega) or C(Ω)C(\Omega).

Then. we have YRNNϱΩ(S)Y\overline{\mathcal{Y}\cap\mathcal{RNN}_{\varrho}^{\Omega}(S)}\subsetneq\mathcal{Y}.

The considerations from before the statement of the proposition show that

Since Lp(Ω)FLp(Ω)L^{p}(\Omega)\cap\mathcal{F}\subset L^{p}(\Omega) is dense, there is for each b{0,1}mb\in\{0,1\}^{m} some fbFLp(Ω)f_{b}\in\mathcal{F}\cap L^{p}(\Omega) such that fbgbLpprd/(21+pm2m)\|f_{b}-g_{b}\|_{L^{p}}^{p}\leq r^{d}/(2^{1+p}\cdot m\cdot 2^{m}). If we set Ωb:={xΩ ⁣:fb(x)gb(x)1/2}\Omega_{b}:=\{x\in\Omega\colon|f_{b}(x)-g_{b}(x)|\geq 1/2\}, then \mathds1Ωb2pfbgbp{\mathds{1}}_{\Omega_{b}}\leq 2^{p}\cdot|f_{b}-g_{b}|^{p}, and hence

and thus λ(b{0,1}mΩb)rd2m\lambda(\bigcup_{b\in\{0,1\}^{m}}\Omega_{b})\leq\frac{r^{d}}{2m}, where λ\lambda is the Lebesgue measure. Hence, λ(Mjb{0,1}mΩb)rd2m>0\lambda(M_{j}\setminus\bigcup_{b\in\{0,1\}^{m}}\Omega_{b})\geq\frac{r^{d}}{2m}>0, so that we can choose for each jmj\in\underline{m} some xjMjb{0,1}mΩbx_{j}\in M_{j}\setminus\bigcup_{b\in\{0,1\}^{m}}\Omega_{b}. We then have

and hence fb(xj)>1/2f_{b}(x_{j})>1/2 if bj=1b_{j}=1 and fb(xj)<1/2f_{b}(x_{j})<1/2 otherwise. Thus, if we set r1rm12r_{1}\coloneqq\dots\coloneqq r_{m}\coloneqq\frac{1}{2}, then we have as above that {\mathds{1}}_{[0,\infty)}\big{(}f_{b}(x_{j})-r_{j}\big{)}=b_{j} for all jmj\in\underline{m} and b{0,1}mb\in\{0,1\}^{m}. The remainder of the proof is as for Y=C(Ω)\mathcal{Y}=C(\Omega). ∎

Note that the following activation functions are computable by elementary operations: any piecewise polynomial function (in particular, the ReLU and the parametric ReLU), the exponential linear unit, the softsign (since the absolute value can be computed using a case distinction), the sigmoid, and the tanh\tanh. Thus, the preceding proposition applies to each of these activation functions.

Appendix D Proofs of the results in Section 3

The proof of Theorem 3.1 is crucially based on the following lemma:

Now, if η<1\eta<1, then any rξC(v;δ,η)r\xi\in C(v;\delta,\eta) with ξv<η|\xi-v|<\eta satisfies v,ξ=v,v+v,ξv1ξv>0,\langle v,\xi\rangle=\langle v,v\rangle+\langle v,\xi-v\rangle\geq 1-|\xi-v|>0, so that C(x,v;δ,η)Bδ(x)H+(x,v)C(x^{\ast},v;\delta,\eta)\subset B_{\delta}(x^{\ast})\cap H_{+}(x^{\ast},v) and C(x,v;δ,η)Bδ(x)H(x,v)C(x^{\ast},-v;\delta,\eta)\subset B_{\delta}(x^{\ast})\cap H_{-}(x^{\ast},v). From this it is easy to see that if x,vx^{\ast},v are as provided by the result in (for E=KE=K), then indeed xKH+(x,v)KH(x,v)x^{\ast}\in\overline{K\cap H_{+}(x^{\ast},v)}\cap\overline{K\cap H_{-}(x^{\ast},v)}.

Assume towards a contradiction that such a continuous function gg exists. Recall (see for instance [19, Section 7.4]) that the support of μ\mu is defined as

In particular, if U[B,B]dU\subset[-B,B]^{d} is open with UsuppμU\cap\operatorname*{supp}\mu\neq\varnothing, then μ(U)>0\mu(U)>0.

We now prove Theorem 3.1. Set Ω[B,B]d\Omega\coloneqq[-B,B]^{d} and define N~i:=1\widetilde{N}_{i}:=1 for i=1,,L2{i=1,\dots,L-2} and N~L1:=2\widetilde{N}_{L-1}:=2 if ϱ\varrho is unbounded, while N~L1:=1\widetilde{N}_{L-1}:=1 otherwise. We show that for ϱ\varrho as in the statement of the theorem there exists a sequence of functions in RNNϱΩ((d,N~1,,N~L1,1))RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}((d,\widetilde{N}_{1},\dots,\widetilde{N}_{L-1},1))\subset\mathcal{RNN}_{\varrho}^{\Omega}(S) such that the sequence converges (in Lp(μ)L^{p}(\mu)) to a bounded, discontinuous limit fL(μ)f\in L^{\infty}(\mu), meaning that ff does not have a continuous representative, even after possibly changing it on a μ\mu-null-set. Since RNNϱΩ(S)C(Ω),\mathcal{RNN}_{\varrho}^{\Omega}(S)\subset C(\Omega), this will show that fRNNϱΩ(S)RNNϱΩ(S)f\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\setminus\mathcal{RNN}_{\varrho}^{\Omega}(S).

Next, using Proposition B.3, choose a neural network ΨNN((d,1,,1))\Psi\in\mathcal{NN}((d,1,\dots,1)) with L1L-1 layers such that

We now distinguish the cases given in Assumption (iv)(a) and (b) of Theorem 3.1.

Then, Φn\newmoonΦNN((d,N~1,,N~L1,1))\Phi_{n}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi\in\mathcal{NN}((d,\widetilde{N}_{1},\dots,\widetilde{N}_{L-1},1)). Now, let us define

since ξnx\xi_{n}^{x}\to\infty as n,nNxn\to\infty,n\geq N_{x}. Analogously, it follows for xΩH(x,v)x\in\Omega\cap H_{-}(x^{\ast},v) that limnhn(x)=λ\lim_{n\to\infty}h_{n}(x)=\lambda^{\prime}. Hence, setting γ:=ϱ(0)ϱ(1)\gamma:=\varrho(0)-\varrho(-1), we see for each xΩx\in\Omega that

By the monotonicity and since ϱ\varrho is not constant (because of ϱ(x0)0\varrho^{\prime}(x_{0})\neq 0), we have c>cjc>c^{\prime}\vphantom{\sum_{j}}.

Then, Φ~n\newmoonΦNN((d,N~1,,N~L1,1))\widetilde{\Phi}_{n}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi\in\mathcal{NN}((d,\widetilde{N}_{1},\dots,\widetilde{N}_{L-1},1)). Now, let us define

Since each of the h~n\widetilde{h}_{n} is continuous and Ω\Omega is compact, we have h~nLp(μ)\widetilde{h}_{n}\in L^{p}(\mu) for all p(0,]p\in(0,\infty]. Equation (D.3) implies that J(x)>0J(x)>0 for all xΩH+(x,v)x\in\Omega\cap H_{+}(x^{\ast},v). This in turn yields that

Similarly, the fact that J(x)<0J(x)<0 for all xΩH(x,v)x\in\Omega\cap H_{-}(x^{\ast},v) yields

Combining (D.4) with (D.5) yields for all xΩx\in\Omega that

D.2 Proof of Corollary 3.2

Next, the softsign, the inverse square root unit, the sigmoid, the tanh\tanh, and the arctan\arctan function are all bounded, and thus satisfy condition (iv)(b) of Theorem 3.1. Thus, all that remains is to verify condition (iv)(a) of Theorem 3.1 for the remaining activation functions:

For the ReLU ϱ(x)=max{0,x}\varrho(x)=\max\{0,x\}, condition (iv)(a) is satisfied with λ=1\lambda=1 and λ=0λ\lambda^{\prime}=0\neq\lambda.

For the parametric ReLU ϱ(x)=max{ax,x}\varrho(x)=\max\{ax,x\} (with a0a\geq 0, a1a\neq 1), Condition (iv)(a) is satisfied with λ=max{1,a}\lambda=\max\{1,a\} and λ=min{1,a}\lambda^{\prime}=\min\{1,a\}, where λλ\lambda\neq\lambda^{\prime} since a1a\neq 1.

For the exponential linear unit ϱ(x)=x\mathds1[0,)(x)+(ex1)\mathds1(,0)(x)\varrho(x)=x{\mathds{1}}_{[0,\infty)}(x)+(e^{x}-1){\mathds{1}}_{(-\infty,0)}(x), Condition (iv)(a) is satisfied for λ=1\lambda=1 and λ=limxex=0λ\lambda^{\prime}=\lim_{x\to-\infty}e^{x}=0\neq\lambda.

For the inverse square root linear unit ϱ(x)=x\mathds1[0,)(x)+x1+ax2\mathds1(,0)(x)\varrho(x)=x{\mathds{1}}_{[0,\infty)}(x)+\frac{x}{\sqrt{1+ax^{2}}}{\mathds{1}}_{(-\infty,0)}(x), the quotient rule shows that for x<0x<0 we have

Therefore, Condition (iv)(a) is satisfied for λ=1\lambda=1 and λ=limxϱ(x)=0λ\lambda^{\prime}=\lim_{x\to-\infty}\varrho^{\prime}(x)=0\neq\lambda.

For the softplus function ϱ(x)=ln(1+ex)\varrho(x)=\ln(1+e^{x}), Condition (iv)(a) is satisfied for

D.3 Proof of Theorem 3.3

then εn0\varepsilon_{n}\to 0 as nn\to\infty.

Now, by the Lipschitz continuity of ϱ(λ)\varrho(\lambda\cdot) and Equation (D.7), we conclude that

Here, the last step used that ξnxx1n11|\xi_{n}^{x}-x_{1}|\leq n^{-1}\leq 1, so that x1,ξnx[(B+1),B+1]x_{1},\xi_{n}^{x}\in[-(B+1),B+1].

Since fnϱλf_{n}\to\varrho_{\lambda} uniformly, where ϱλCm(Ω)\varrho_{\lambda}\notin C^{m}(\Omega), and hence ϱλRNNϱΩ(S)\varrho_{\lambda}\notin\mathcal{RNN}_{\varrho}^{\Omega}(S), we thus see that RNNϱΩ(S)\mathcal{RNN}_{\varrho}^{\Omega}(S) is not closed in C(Ω)C(\Omega). \square

D.3.2 Proof of Theorem 3.3 under Condition (ii)

Since ϱ\varrho^{\prime} is continuous, we conclude that

We set ΦnΦn1\newmoonΦn2NN(S)\Phi_{n}\coloneqq\Phi_{n}^{1}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi_{n}^{2}\in\mathcal{NN}(S^{\prime}) and note for all xΩx\in\Omega that

D.3.3 Proof of Theorem 3.3 under Condition (iii)

for a constant c0=c0(B,s,r,q)>0c_{0}=c_{0}(B,s,r,q)>0. Moreover, for x[0,B]x\in[0,B], we have

D.4 Proof of Corollary 3.4

D.4.2 Proof of Corollary 3.4.(3)

and for x0x\leq 0, we have ln(1+ex)x01+ln(2)|\ln(1+e^{x})-x^{0}|\leq 1+\ln(2). \square

D.5 Proof of Proposition 3.5

The set ΘC\Theta_{C} is closed and bounded in the normed space \big{(}\mathcal{NN}(S),\|\cdot\|_{\mathcal{NN}(S)}\big{)}. Thus, the Heine-Borel Theorem implies the compactness of ΘC\Theta_{C}. By Proposition 4.1 (which will be proved independently), the map

D.6 Proof of Proposition 3.6

By compactness of RC\mathcal{R}^{C}, we can choose gRCg\in\mathcal{R}^{C} satisfying fσgL2(σΩ)2=infhRCfσhL2(σΩ)2.\|f_{\sigma}-g\|_{L^{2}(\sigma_{\Omega})}^{2}=\inf_{h\in\mathcal{R}^{C}}\|f_{\sigma}-h\|_{L^{2}(\sigma_{\Omega})}^{2}. Define M:=infhRNNϱΩ(S)fσhL2(σΩ)2M:=\inf_{h\in\mathcal{RNN}_{\varrho}^{\Omega}(S)}\|f_{\sigma}-h\|_{L^{2}(\sigma_{\Omega})}^{2}. Since by assumption the infimum defining MM is not attained, we have fσgL2(σΩ)2>M\|f_{\sigma}-g\|_{L^{2}(\sigma_{\Omega})}^{2}>M, so that there are hRNNϱΩ(S)h\in\mathcal{RNN}_{\varrho}^{\Omega}(S) and δ>0\delta>0 with fσgL2(σΩ)22δ+fσhL2(σΩ)2\|f_{\sigma}-g\|_{L^{2}(\sigma_{\Omega})}^{2}\geq 2\delta+\|f_{\sigma}-h\|_{L^{2}(\sigma_{\Omega})}^{2}. Let C>0C^{\prime}>0 with hRCh\in\mathcal{R}^{C^{\prime}}.

We now claim that ΩNcΩN,δ/3(1)ΩN,δ/3(2)\Omega_{N}^{c}\subset\Omega_{N,\delta/3}^{(1)}\cup\Omega_{N,\delta/3}^{(2)}. Once we prove this, we get

By rearranging and recalling the choice of hh and δ\delta, we finally see

which is the desired contradiction. \square

D.7 Proof of Proposition 3.7

The main ingredient of the proof will be to show that one can replace a given sequence of networks with CC-bounded scaling weights by another sequence with CC-bounded scaling weights that also has bounded biases. Then one can apply Proposition 3.5.

each network Ψn(m)\Psi_{n}^{(m)}, nImn\in I_{m}, has CC-bounded scaling weights;

For arbitrary i{1,,NL}i\in\{1,\dots,N_{L}\} and xΩx\in\Omega, this implies

Thus, it remains to construct the networks Ψn(m)\Psi_{n}^{(m)} for nImn\in I_{m} (and the sets ImI_{m}) for m{0,,L1}m\in\{0,\dots,L-1\} with the properties (A)–(C) from above.

Thus, it remains to construct d(n),e(n),C(n)d^{(n)},e^{(n)},C^{(n)} for nIm+1n\in I_{m+1} (and the set Im+1I_{m+1} itself) as described around Equation (D.12). To this end, for nIm(0)n\in I_{m}^{(0)} and k{1,,Nm+1}k\in\{1,\dots,N_{m+1}\}, define

To see that these choices indeed fulfil the conditions outlined around Equation (D.12) for a suitable choice of Im+1Im(0)I_{m+1}\subset I_{m}^{\left(0\right)}, first note that (d(n))nIm(0)\left(d^{(n)}\right)_{n\in I_{m}^{\left(0\right)}} is indeed a bounded family. Furthermore, \big{|}C_{k,i}^{(n)}\big{|}\leq\big{|}(B_{m+1}^{(n,m)})_{k,i}\big{|} for all k{1,,Nm+1}k\in\{1,\dots,N_{m+1}\} and i{1,,Nm}i\in\{1,\dots,N_{m}\}, which easily implies C(n)maxBm+1(n,m)maxC\|C^{(n)}\|_{\max}\leq\|B_{m+1}^{(n,m)}\|_{\max}\leq C for all nIm(0)n\in I_{m}^{(0)}. Thus, it remains to verify equation (D.12) itself. But the estimate Bm+1(n,m)maxC\|B_{m+1}^{(n,m)}\|_{\max}\leq C also implies

As a final preparation, note that ϱm+1=ϱ××ϱ\varrho_{m+1}=\varrho\times\cdots\times\varrho is a cartesian product of ReLU functions, since mL2m\leq L-2. Now, for k{1,,Nm+1}k\in\{1,\dots,N_{m+1}\} there are three cases:

where the last step used our choice of d(n),e(n),C(n)d^{(n)},e^{(n)},C^{(n)}, and the fact that (C(n)x+d(n))k0\left(C^{(n)}x+d^{(n)}\right)_{k}\geq 0 by Equation (D.13).

where the last step used our choice of d(n),e(n),C(n)d^{(n)},e^{(n)},C^{(n)}.

Overall, we have thus shown that Equation (D.12) is satisfied for all nIm+1n\in I_{m+1}, where

is clearly an infinite set, since Im(0)I_{m}^{\left(0\right)} is. ∎

Hence, (Ψn)nI(\Psi_{n})_{n\in I} is a bounded, infinite family in the finite dimensional vector space NN(S)\mathcal{NN}(S). Thus, there is a further infinite set I1II_{1}\subset I such that ΨnΨNN(S)\Psi_{n}\to\Psi\in\mathcal{NN}(S) as nn\to\infty in I1I_{1}.

But since Ω\Omega is bounded, say Ω[R,R]d\Omega\subset[-R,R]^{d}, the realization map

D.8 Proof of Theorem 3.8

For the proof of Theorem 3.8, we will use a careful analysis of the singularity hyperplanes of functions of the form xϱa(α,x+β)x\mapsto\varrho_{a}(\langle\alpha,x\rangle+\beta), that is, the hyperplane on which this function is not differentiable. To simplify this analysis, we first introduce a convenient terminology and collect quite a few auxiliary results.

By discarding those (αj,βj)(\alpha_{j},\beta_{j}) for which x0Sαj,βjx_{0}\notin S_{\alpha_{j},\beta_{j}}, we can assume that x0Sαj,βjx_{0}\in S_{\alpha_{j},\beta_{j}} for all jNj\in\underline{N}.

Assume towards a contradiction that the claim of the lemma is false; that is,

But since VV is a vector space, this easily implies V=αV=\alpha^{\perp}, that is, z,αj=0\langle z,\alpha_{j}\rangle=0 for all zαz\in\alpha^{\perp}. In other words, ααj\alpha^{\perp}\subset\alpha_{j}^{\perp}, and then α=αj\alpha^{\perp}=\alpha_{j}^{\perp} by a dimension argument, since α,αj0\alpha,\alpha_{j}\neq 0.

Hence, spanα=(α)=(αj)=spanαj\operatorname{span}\alpha=(\alpha^{\perp})^{\perp}=(\alpha_{j}^{\perp})^{\perp}=\operatorname{span}\alpha_{j}. Because of α=αj=1|\alpha|=|\alpha_{j}|=1, we thus see α=εαj\alpha=\varepsilon\,\alpha_{j} for some ε{±1}\varepsilon\in\{\pm 1\}. Finally, since x0Sα,βSαj,βjx_{0}\in S_{\alpha,\beta}\cap S_{\alpha_{j},\beta_{j}}, we see β=α,x0=εαj,x0=εβj,\beta=-\langle\alpha,x_{0}\rangle=-\varepsilon\langle\alpha_{j},x_{0}\rangle=\varepsilon\beta_{j}, and thus (α,β)=ε(αj,βj)(\alpha,\beta)=\varepsilon(\alpha_{j},\beta_{j}), in contradiction to (α,β)≁(αj,βj)(\alpha,\beta)\not\sim(\alpha_{j},\beta_{j}). ∎

Then, there is ε>0\varepsilon>0 satisfying

We claim that there is some ε>0\varepsilon>0 such that xx0+tzUSα,βj=1NUαj,βj(ε)x\coloneqq x_{0}+tz\in U\cap S_{\alpha,\beta}\cap\bigcap_{j=1}^{N}U_{\alpha_{j},\beta_{j}}^{(\varepsilon)}. To see this, note for jNJj\in\underline{N}\setminus J that x0Sαj,βjx_{0}\in S_{\alpha_{j},\beta_{j}}, and hence

since z,αj0\langle z,\alpha_{j}\rangle\neq 0 for all jNJj\in\underline{N}\setminus J, by choice of zz. By combining all our observations, we see that x0+tzUSα,βj=1NUαj,βj(ε){x_{0}+tz\in U\cap S_{\alpha,\beta}\cap\bigcap_{j=1}^{N}U_{\alpha_{j},\beta_{j}}^{(\varepsilon)}} for εmin{ε1,ε2}>0\varepsilon\coloneqq\min\{\varepsilon_{1},\varepsilon_{2}\}>0. ∎

Then, the family (hi)i=1,,N+1(h_{i})_{i=1,\dots,N+1} is linearly independent.

Because of x0USαj,βjx_{0}\in U\cap S_{\alpha_{j},\beta_{j}}, Lemma D.6 shows that hαj,βj(a)Uh_{\alpha_{j},\beta_{j}}^{(a)}|_{U} is not differentiable at x0x_{0}. On the other hand, we have

where the right-hand side is differentiable at x0x_{0}, since each summand is easily seen to be differentiable on the open set VV, with x0VUx_{0}\in V\cap U. ∎

The continuous function Ω(0,),xα,x+β\Omega\to(0,\infty),x\mapsto|\langle\alpha,x\rangle+\beta|, which is well-defined by assumption, attains a minimum ε=minxΩα,x+β>0\varepsilon=\min_{x\in\Omega}|\langle\alpha,x\rangle+\beta|>0. ∎

Step 2: We claim that ξ1ξ2spanα\xi_{1}-\xi_{2}\in\operatorname{span}\alpha. To see this, consider the modified function

which is continuous and satisfies f~0\widetilde{f}\equiv 0 on UWα,βU\cap W_{\alpha,\beta}^{-} and f~(x)=θ,x+ω\widetilde{f}(x)=\langle\theta,x\rangle+\omega on UWα,β+U\cap W_{\alpha,\beta}^{+}, where we defined θξ1ξ2\theta\coloneqq\xi_{1}-\xi_{2} and ωω1ω2\omega\coloneqq\omega_{1}-\omega_{2}.

Since we saw in Step 1 that USα,βUWα,β±U\cap S_{\alpha,\beta}\subset\overline{U\cap W_{\alpha,\beta}^{\pm}}, we thus get by continuity of f~\widetilde{f} that

But by assumption on UU, there is some x0USα,βx_{0}\in U\cap S_{\alpha,\beta}. For arbitrary vαv\in\alpha^{\perp}, we then have x0+tvUSα,βx_{0}+tv\in U\cap S_{\alpha,\beta} for all t(ε,ε)t\in(-\varepsilon,\varepsilon) and a suitable ε=ε(v)>0\varepsilon=\varepsilon(v)>0, since UU is open. Hence, 0=θ,x0+tv+ω=tθ,v0=\langle\theta,x_{0}+tv\rangle+\omega=t\cdot\langle\theta,v\rangle for all t(ε,ε)t\in(-\varepsilon,\varepsilon), and thus vθv\in\theta^{\perp}. In other words, αθ\alpha^{\perp}\subset\theta^{\perp}, and thus spanα=(α)(θ)θ=ξ1ξ2,\operatorname{span}\alpha=(\alpha^{\perp})^{\perp}\supset(\theta^{\perp})^{\perp}\ni\theta=\xi_{1}-\xi_{2}, as claimed in this step.

Because of x0Sα,βx_{0}\in S_{\alpha,\beta}, we then have g(x0)=ζ,x0+κ=f(x0)g(x_{0})=\langle\zeta,x_{0}\rangle+\kappa=f(x_{0}). Furthermore, since ϱa(x)=x\varrho_{a}(x)=x for x0x\geq 0, we see for all xUWα,β+x\in U\cap W_{\alpha,\beta}^{+} that

Here, the last step used that f(x)=ξ1,x+ω1f(x)=\langle\xi_{1},x\rangle+\omega_{1} for xUWα,β+x\in U\cap W_{\alpha,\beta}^{+}, and that x0USα,βUWα,β+x_{0}\in U\cap S_{\alpha,\beta}\subset\overline{U\cap W_{\alpha,\beta}^{+}} by Step 1, so that we get f(x0)=ξ1,x0+ω1f(x_{0})=\langle\xi_{1},x_{0}\rangle+\omega_{1} as well.

Likewise, since ϱa(t)=at\varrho_{a}(t)=a\,t for t<0t<0, we see for xUWα,βx\in U\cap W_{\alpha,\beta}^{-} that

In combination, Equations (D.15) and (D.16) show f(x)=g(x)f(x)=g(x) for all xU(Wα,β+Wα,β)x\in U\cap(W_{\alpha,\beta}^{+}\cup W_{\alpha,\beta}^{-}). Since this set is dense in UU by Step 1, we are done. ∎

With all of these preparations, we can finally prove Theorem 3.8.

Step 4 (Showing that the jj-th neuron is eventually affine-linear, for jJj\in J): Since Step 3 shows that the claim holds in case of r=N0r=N_{0}, we will from now on consider only the case where r<N0r<N_{0}.

For jJj\in J, there are two cases: In case of (b1)j[0,](b_{1})_{j}\in[0,\infty], define

If otherwise (b1)j[,0)(b_{1})_{j}\in[-\infty,0), define

Next, for arbitrary 0<δ<B0<\delta<B, we define Ωδ[(Bδ),Bδ]d\Omega_{\delta}\coloneqq[-(B-\delta),B-\delta]^{d}. Note that since Sα(i),β(i)ΩS_{\alpha^{(i)},\beta^{(i)}}\cap\Omega^{\circ}\neq\varnothing for all iri\in\underline{r}, there is some δ0>0\delta_{0}>0 such that Sα(i),β(i)((Bδ),Bδ)dS_{\alpha^{(i)},\beta^{(i)}}\cap(-(B-\delta),B-\delta)^{d}\neq\varnothing for all iri\in\underline{r} and all 0<δδ00<\delta\leq\delta_{0}. For the remainder of this step, we will consider a fixed δ(0,δ0]\delta\in(0,\delta_{0}], and we claim that there is some N2=N2(δ)>0N_{2}=N_{2}(\delta)>0 such that

simply because (b1nk)j(b1)j(b_{1}^{n_{k}})_{j}\to(b_{1})_{j} and ϱa(x)=x\varrho_{a}(x)=x if x0x\geq 0, and ϱa(x)=ax\varrho_{a}(x)=ax if x<0x<0. Therefore, the affine-linear function

To prove Equation (D.17), we distinguish two cases for each jJj\in J; by definition of JJ, these are the only two possible cases:

Now, since the function xak,j,x+(b1nk)jx\mapsto\langle a_{k,j},x\rangle+(b_{1}^{n_{k}})_{j} is continuous, since Ω\Omega is connected (in fact convex), and since 0Ω0\in\Omega, this implies sign(ak,j,x+(b1nk)j)=sign(b1nk)j\operatorname{sign}(\langle a_{k,j},x\rangle+(b_{1}^{n_{k}})_{j})=\operatorname{sign}(b_{1}^{n_{k}})_{j} for all xΩx\in\Omega and kkjk\geq k_{j}.

With the same argument as at the end of Case 1, we thus see sign(ak,j,x+(b1nk)j)=sign(b1nk)j\operatorname{sign}(\langle a_{k,j},x\rangle+(b_{1}^{n_{k}})_{j})=\operatorname{sign}(b_{1}^{n_{k}})_{j} for all xΩδx\in\Omega_{\delta} and kkj(δ)k\geq k_{j}(\delta).

Together, the two cases prove that Equation (D.17) holds if we set N2(δ)maxjJkj(δ)N_{2}(\delta)\coloneqq\max_{j\in J}k_{j}(\delta).

To see this, let ε>0\varepsilon>0 be arbitrary, and recall Jc=i=1rJiJ^{c}=\bigcup_{i=1}^{r}J_{i}. By definition of JiJ_{i}, and by choice of α(i)\alpha^{(i)} and β(i)\beta^{(i)}, there is for each iri\in\underline{r} and jJij\in J_{i} some σj{±1}\sigma_{j}\in\{\pm 1\} satisfying

Define N4(ε)maxjJck(j)(ε)N_{4}(\varepsilon)\coloneqq\max_{j\in J^{c}}k^{(j)}(\varepsilon). Then, for kN4(ε)k\geq N_{4}(\varepsilon), iri\in\underline{r}, jJij\in J_{i}, and arbitrary xΩUα(i),β(i)(ε,±)x\in\Omega\cap U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}, we have on the one hand σj(α(i),x+β(i))ε|\sigma_{j}\cdot(\langle\alpha^{(i)},x\rangle+\beta^{(i)})|\geq\varepsilon, and on the other hand

Step 6 (Proving the “almost convergence” of the sum of all jj-th neurons for jJij\in J_{i}): For iri\in\underline{r} define

In combination with Equation (D.18), we see

Recall from Step 4 that Ωδ0Sα(i),β(i)\Omega_{\delta_{0}}^{\circ}\cap S_{\alpha^{(i)},\beta^{(i)}}\neq\varnothing for all iri\in\underline{r}, by choice of δ0\delta_{0}. Therefore, Lemma D.5 shows (because of Uα,β(σ)(Uα,β(ε))U_{\alpha,\beta}^{(\sigma)}\subset(U_{\alpha,\beta}^{(\varepsilon)})^{\circ} for ε<σ\varepsilon<\sigma) for each iri\in\underline{r} that

To use this observation, note that Equations (D.20) and (D.21) show that gi(k)+qi(k)g_{i}^{(k)}+q_{i}^{(k)} converges pointwise to GiG_{i} on Bri(xi)B_{r_{i}}(x_{i}). Furthermore, since xiSα(i),β(i)x_{i}\in S_{\alpha^{(i)},\beta^{(i)}}, it is not hard to see that there is some ε0>0\varepsilon_{0}>0 with \big{(}U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}\big{)}^{\circ}\cap B_{r_{i}}(x_{i})\neq\varnothing for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}); for the details, we refer to Step 1 in the proof of Lemma D.9. Finally, as a consequence of Step 5, we see for arbitrary ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}) that gi(k)+qi(k)g_{i}^{(k)}+q_{i}^{(k)} and GiG_{i} are both affine-linear on Uα(i),β(i)(ε,±)U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}, at least for kk large enough (depending on ε\varepsilon). Thus, the observation from above (with Θ=Uα(i),β(i)(ε,±)\Theta=U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)} and U=ΘBri(xi)U=\Theta^{\circ}\cap B_{r_{i}}(x_{i})) implies that gi(k)+qi(k)Gig_{i}^{(k)}+q_{i}^{(k)}\to G_{i} pointwise on Uα(i),β(i)(ε,±)U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}, for arbitrary ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}).

Step 7 (Finishing the proof): For arbitrary δ(0,δ0)\delta\in(0,\delta_{0}), let us set

Then, Equations (D.19) and (D.22) imply for kN3(δ)k\geq N_{3}(\delta) that

But the latter set is dense in Ω\Omega (since its complement is a null-set), and ff and ψ+i=1rGi\psi+\sum_{i=1}^{r}G_{i} are continuous on Ω\Omega. Hence,

Recalling from Steps 3 and 4 that r<N0r<N_{0}, this implies fRNNϱaΩ((d,r+1,1))RNNϱaΩ((d,N0,1))f\in\mathcal{RNN}_{\varrho_{a}}^{\Omega}((d,r+1,1))\subset\mathcal{RNN}_{\varrho_{a}}^{\Omega}((d,N_{0},1)), as claimed. Here, we implicitly used that

Appendix E Proofs of the results in Section 4

with locally uniform convergence. Since Ω\Omega is compact, this implies uniform convergence on Ω\Omega, and thus completes the proof of the first claim.

For the induction step, let us write \Psi=\big{(}(B_{1},c_{1}),\dots,(B_{L},c_{L})\big{)}, and note that

E.2 Proof of Theorem 4.2

Next, setting z(ξ)e2πiaξ0z(\xi)\coloneqq e^{2\pi ia\xi}\neq 0, we observe that

We first note that each function faf_{a} from Step 1 is bounded. In fact, if ϱ\varrho is MM-Lipschitz, then

as long as nn1n\geq n_{1} is so large that xn,ynΩx_{n},y_{n}\in\Omega. But this implies

E.3 Proof of Corollary 4.3

Clearly, by switching to complements, we can equivalently replace “open” by “closed” everywhere.