Topological properties of the set of functions generated by neural networks of fixed size

Philipp Petersen, Mones Raslan, Felix Voigtlaender

Introduction

Neural networks, introduced in 1943 by McCulloch and Pitts , are the basis of every modern machine learning algorithm based on deep learning . The term deep learning describes a variety of methods that are based on the data-driven manipulation of the weights of a neural network. Since these methods perform spectacularly well in practice, they have become the state-of-the-art technology for a host of applications including image classification , speech recognition , game intelligence , and many more.

This success of deep learning has encouraged many scientists to pick up research in the area of neural networks after the field had gone dormant for decades. In particular, quite a few mathematicians have recently investigated the properties of different neural network architectures, hoping that this can explain the effectiveness of deep learning techniques. In this context, mathematical analysis has mainly been conducted in the context of statistical learning theory , where the overall success of a learning method is determined by the approximation properties of the underlying function class, the feasibility of optimizing over this class, and the generalization capabilities of the class, when only training with finitely many samples.

In the approximation theoretical part of deep learning research, one analyzes the expressiveness of deep neural network architectures. The universal approximation theorem demonstrates that neural networks can approximate any continuous function, as long as one uses networks of increasing complexity for the approximation. If one is interested in approximating more specific function classes than the class of all continuous functions, then one can often quantify more precisely how large the networks have to be to achieve a given approximation accuracy for functions from the restricted class. Examples of such results are . Some articles study in particular in which sense deep networks have a superior expressiveness compared to their shallow counterparts, thereby partially explaining the efficiency of networks with many layers in deep learning.

Another line of research studies the training procedures employed in deep learning. Given a set of training samples, the training process is an optimization problem over the parameters of a neural network, where a loss function is minimized. The loss function is typically a non-linear, non-convex function of the weights of the network, rendering the optimization of this function highly challenging . Nonetheless, in applications, neural networks are often trained successfully through a variation of stochastic gradient descent. In this regard, the energy landscape of the problem was studied and found to allow convergence to a global optimum, if the problem is sufficiently overparametrized; see .

The third large area of mathematical research on deep neural networks is analyzing the so-called generalization error of deep learning. In the framework of statistical learning theory , the discrepancy between the empirical loss and the expected loss of a classifier is called the generalization error. Specific bounds for this error for the class of deep neural networks were analyzed for instance in , and in more specific settings for instance in .

In this work, we study neural networks from a different point of view. Specifically, we study the structure of the set of functions implemented by neural networks of fixed size. These sets are naturally (non-linear) subspaces of classical function spaces like $L^{p}(\Omega)$ and $C(\Omega)$ for compact sets $\Omega$ .

Due to the size of the networks being fixed, our analysis is inherently non-asymptotic. Therefore, our viewpoint is fundamentally different from the analysis in the framework of statistical learning theory. Indeed, in approximation theory, the expressive power of networks growing in size is analyzed. In optimization, one studies the convergence properties of iterative algorithms—usually that of some form of stochastic gradient descent. Finally, when considering the generalization capabilities of deep neural networks, one mainly studies how and with which probability the empirical loss of a classifier converges to the expected loss, for increasing numbers of random training samples and depending on the sizes of the underlying networks.

Given this strong delineation to the classical fields, we will see that our point of view yields interpretable results describing phenomena in deep learning that are not directly explained by the classical approaches. We will describe these results and their interpretations in detail in Subsections 1.1–1.3.

We will use standard notation throughout most of the paper without explicitly introducing it. We do, however, collect a list of used symbols and notions in Appendix A. To not interrupt the flow of reading, we have deferred several auxiliary results to Appendix B and all proofs and related statements to Appendices C–E.

Before we continue, we formally introduce the notion of spaces of neural networks of fixed size.

where $x_{L}$ results from the following scheme:

Before we continue, let us note that the set $\mathcal{NN}(S)$ of all neural networks (that is, the network weights) with a fixed architecture forms a finite-dimensional vector space, which we equip with the norm

In the remainder of this introduction, we discuss our results concerning the topological properties of the sets of realizations of neural networks with fixed architecture and their interpretation in the context of deep learning. Then, we give an overview of related work. We note at this point that it is straightforward to generalize all of the results in this paper to neural networks for which one only prescribes the total number of neurons and layers and not the specific architecture.

1 Non-convexity of the set of realizations

We will show in Section 2 (Theorem 2.1) that, for a given architecture $S$ , the set $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is not convex, except possibly when the activation function is a polynomial, which is clearly not the case for any of the activation functions that are commonly used in practice.

In fact, for a large class of activation functions (including the ReLU and the standard sigmoid activation function), the set $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ turns out to be highly non-convex in the sense that for every $r\in[0,\infty)$ , the set of functions having uniform distance at most $r$ to any function in $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is not convex. We prove this result in Theorem 2.2 and Remark 2.3.

This non-convexity is undesirable, since for non-convex sets, there do not necessarily exist well-defined projection operators onto them. In classical statistical learning theory , the property that the so-called regression function can be uniquely projected onto a convex (and compact) hypothesis space greatly simplifies the learning problem; see [20, Section 7]. Furthermore, in applications where the realization of a network—rather than its set of weights—is the quantity of interest (for example when a network is used as an Ansatz for the solution of a PDE, as in ), our results show that the Ansatz space is non-convex. This non-convexity is inconvenient if one aims for a convergence proof of the underlying optimization algorithm, since one cannot apply convexity-based fixed-point theorems. Concretely, if a neural network is optimized by stochastic gradient descent so as to satisfy a certain PDE, then it is interesting to see if there even exists a network so that the iteration stops. In other words, one might ask whether gradient descent on the set of neural networks (potentially with bounded weights) has a fixed point. If the space of neural networks were convex and compact, then the fixed-point theorem of Schauder would guarantee the existence of such a fixed point.

2 (Non-)closedness of the set of realizations

For any fixed architecture $S$ , we show in Section 3 (Theorem 3.1) that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is not a closed subset of $L^{p}(\mu)$ for $0<p<\infty$ , under very mild assumptions on the measure $\mu$ and the activation function $\varrho$ . The assumptions concerning $\varrho$ are satisfied for all activation functions used in practice.

For the case $p=\infty$ , the situation is more involved: For all activation functions that are commonly used in practice—except for the (parametric) ReLU—the associated sets $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ are non-closed also with respect to the uniform norm; see Theorem 3.3. For the (parametric) ReLU, however, the question of closedness of the sets $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ remains mostly open. Nonetheless, in two special cases, we prove in Section 3.4 that the sets $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ are closed. In particular, for neural network architectures with two layers only, Theorem 3.8 establishes the closedness of $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ , where $\varrho$ is the (parametric) ReLU.

A practical consequence of the observation of non-closedness can be identified with the help of the following argument that is made precise in Subsection 3.3: We show that the set

of realizations of neural networks with a fixed architecture and all affine linear maps bounded in a suitable norm, is always closed. As a consequence, we observe the following phenomenon of exploding weights: If a function $f$ is such that it does not have a best approximation in $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ , that is, if there does not exist $f^{*}\in\mathcal{RNN}^{\Omega}_{\varrho}(S)$ such that

Certainly, the presence of large coefficients will make the numerical optimization increasingly unstable. Thus, exploding weights in the sense described above are highly undesirable in practice.

The argument above discusses an approximation problem in an $L^{p}$ -norm. In practice, one usually only minimizes “empirical norms”. We will demonstrate in Proposition 3.6 that also in this situation, for increasing numbers of samples, the weights of the neural networks that minimize the empirical norms necessarily explode under certain assumptions. Note that the set-up of having a fixed architecture and a potentially unbounded number of training samples is common in applications where neural networks are trained to solve partial differential equations. There, training samples are generated during the training process .

3 Failure of inverse stability of the realization map

For both of these results—continuity and no inverse stability—we only need to assume that the activation function $\varrho$ is Lipschitz continuous and not constant.

These properties of the realization map pinpoint a potential problem that can occur when training a neural network: Let us consider a regression problem, where a network is iteratively updated by a (stochastic) gradient descent algorithm trying to minimize a loss function. It is then possible that at some iterate the loss function exhibits a very small error, even though the associated network parameters have a large distance to the optimal parameters. This issue is especially severe since a small error term leads to small steps if gradient descent methods are used in the optimization. Consequently, convergence to the very distant optimal weights will be slow even if the energy landscape of the optimization problem happens to be free of spurious local minima.

4 Related work

The aforementioned properties of non-convexity and non-closedness have, to some extent, been studied before. Classical results analyze the spaces of shallow neural networks, that is, of $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ for $S=(d,N_{0},1)$ , so that $L=2$ . For such sets of shallow networks, a property that has been extensively studied is to what extent $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ has the best approximation property. Here, we say that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ has the best approximation property, if for every function $f\in L^{p}(\Omega)$ , $1\leq p\leq\infty$ , there exists a function $F(f)\in\mathcal{RNN}_{\varrho}^{\Omega}(S)$ such that $\|f-F(f)\|_{L^{p}}=\inf_{g\in\mathcal{RNN}_{\varrho}^{\Omega}(S)}\|f-g\|_{L^{p}}$ . In it was shown that even if a minimizer always exists, the map $f\mapsto F(f)$ is necessarily discontinuous. Furthermore, at least for the Heaviside activation function, there does exist a (non-unique) best approximation; see .

Additionally, [27, Proposition 4.1] demonstrates, for shallow networks as before, that for the logistic activation function $\varrho(x)=(1+e^{-x})^{-1}$ , the set $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ does not have the best approximation property in $C(\Omega)$ . In the proof of this statement, it was also shown that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is not closed. Furthermore, it is claimed that this result should hold for every non-linear activation function. The previously mentioned result of and Theorem 3.8 below disprove this conjecture for the Heaviside and ReLU activation functions, respectively.

In deep learning, one chooses a loss function ${\mathcal{L}:C(\Omega)\to[0,\infty)}$ , which is then minimized over the set of neural networks $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ with fixed architecture $S$ . A typical loss function is the empirical square loss, that is,

It is important to emphasize that this notion of non-convexity describes properties of the loss function, in contrast to the non-convexity of the sets of functions that we analyze in this work.

Non-convexity of the set of realizations

In this section, we analyze the convexity of the set of all neural network realizations. In particular, we will show that this set is highly non-convex for all practically used activation functions listed in Table 1. First, we examine the convexity of the set $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ :

If $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is convex, then $\varrho$ is a polynomial.

(1) It is easy to see that all of the activation functions in Table 1 are locally Lipschitz continuous, and that none of them is a polynomial. Thus, the associated sets of realizations are never convex.

(2) In the case where $\varrho$ is a polynomial, the set $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ might or might not be convex. Indeed, if $S=(1,N,1)$ and $\varrho(x)=x^{m}$ , then it is not hard to see that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is convex if and only if $N\geq m$ .

The detailed proof of Theorem 2.1 is the subject of Appendix C.1. Let us briefly outline the proof strategy:

We first show in Proposition C.1 that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is closed under scalar multiplication, hence star-shaped with respect to the origin, i.e., 0 is a center.A subset $A$ of some vector space $V$ is called star-shaped, if there exists some $f\in A$ such that for all $g\in A$ , also $\{\lambda f+(1-\lambda)g\colon\lambda\in\}\subset A$ . The vector $f$ is called a center of $A$ .

A direct consequence of Step 2 is that if $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is convex, then it can only contain a finite number of linearly independent functions; see Corollary C.5.

In applications, the non-convexity of $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ might not be as problematic as it first seems. If, for instance, the set $\mathcal{RNN}^{\Omega}_{\varrho}(S)+B_{\delta}(0)$ of functions that can be approximated up to error $\delta>0$ by a neural network with architecture $S$ was convex, then one could argue that the non-convexity of $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ was not severe. Indeed, in practice, neural networks are only trained to minimize a certain empirical loss function, with resulting bounds on the generalization error which are typically of size $\varepsilon=\mathcal{O}(m^{-1/2})$ , with $m$ denoting the number of training samples. In this setting, one is not really interested in “completely minimizing” the (empirical) loss function, but would be content with finding a function for which the empirical loss is $\varepsilon$ -close to the global minimum. Hence, one could argue that one is effectively working with a hypothesis space of the form $\mathcal{RNN}^{\Omega}_{\varrho}(S)+B_{\delta}(0)$ , containing all functions that can be represented up to an error of $\delta$ by neural networks of architecture $S$ .

To quantify this potentially more relevant notion of convexity of neural networks, we define, for a subset $A$ of a vector space $\mathcal{Y}$ , the convex hull of $A$ as

For $\varepsilon>0$ , we say that a subset $A$ of a normed vector space $\mathcal{Y}$ is $\varepsilon$ -convex in $(\mathcal{Y},\|\cdot\|_{\mathcal{Y}})$ , if

Hence, the notion of $\varepsilon$ -convexity asks whether the convex hull of a set is contained in an enlargement of this set. Note that if $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is dense in $C(\Omega)$ , then its closure is trivially $\varepsilon$ -convex for all $\varepsilon>0$ . Our main result regarding the $\varepsilon$ -convexity of neural network sets shows that this is the only case in which $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is $\varepsilon$ -convex for any $\varepsilon>0$ .

Assume that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is not dense in $C(\Omega)$ . Then there does not exist any $\varepsilon>0$ such that $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is $\varepsilon$ -convex in $\big{(}C(\Omega),\|\cdot\|_{\sup}\big{)}$ .

All closures in the theorem are taken in $C(\Omega)$ .

The proof of this theorem is the subject of Appendix C.2. It is based on showing that if $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is $\varepsilon$ -convex for some $\varepsilon>0$ , then in fact $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is convex, which we then use to show that $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ contains all realizations of two-layer neural networks with activation function $\varrho$ . As shown in , this implies that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is dense in $C(\Omega)$ , since $\varrho$ is not a polynomial. ∎

While it is certainly natural to expect that $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\neq C(\Omega)$ should hold for most activation functions $\varrho$ , giving a reference including large classes of activation functions such that the claim holds is not straightforward. We study this problem more closely in Appendix C.3.

To be more precise, from Proposition C.10 it follows that the ReLU, the parametric ReLU, the exponential linear unit, the softsign, the sigmoid, and the $\tanh$ yield realization sets $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ which are not dense in $L^{p}(\Omega)$ and in $C(\Omega)$ .

The only activation functions listed in Table 1 for which we do not know whether any of the realization sets $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is dense in $L^{p}(\Omega)$ or in $C(\Omega)$ are: the inverse square root linear unit, the inverse square root unit, the softplus, and the $\arctan$ function. Of course, we expect that also for these activation functions, the resulting sets of realizations are never dense in $L^{p}(\Omega)$ or in $C(\Omega)$ .

(Non-)Closedness of the set of realizations

We start by examining the closedness with respect to $L^{p}$ -norms for $p\in(0,\infty)$ . In fact, for all $B>0$ and all widely used activation functions (including all activation functions presented in Table 1), the set $\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in $L^{p}(\mu)$ , for any $p\in(0,\infty)$ and any “sufficiently non-atomic” measure $\mu$ on $[-B,B]^{d}$ . To be more precise, the following is true:

There is some $r>0$ such that $\varrho|_{(-\infty,-r)\cup(r,\infty)}$ is differentiable;

At least one of the following two conditions is satisfied:

There are $\lambda,\lambda^{\prime}\geq 0$ with $\lambda\neq\lambda^{\prime}$ such that $\varrho^{\prime}(x)\to\lambda$ as $x\to\infty$ , and $\varrho^{\prime}(x)\to\lambda^{\prime}$ as $x\to-\infty$ , and we have $N_{L-1}\geq 2$ .

Finally, let $B>0$ and let $\mu$ be a finite Borel measure on $[-B,B]^{d}$ for which the support $\operatorname*{supp}\mu$ is uncountable. Then the set $\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in $L^{p}(\mu)$ for any $p\in(0,\infty)$ . More precisely, there is a function $f\in L^{\infty}(\mu)$ which satisfies $f\in\overline{\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)}\setminus\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ for all $p\in(0,\infty)$ , where the closure is taken in $L^{p}(\mu)$ .

If $\operatorname*{supp}\mu$ is countable, then $\mu=\sum_{x\in\operatorname*{supp}\mu}\mu(\{x\})\,\delta_{x}$ is a countable sum of Dirac measures, meaning that $\mu$ is purely atomic. In particular, if $\mu$ is non-atomic (meaning that $\mu(\{x\})=0$ for all $x\in[-B,B]^{d}$ ), then $\operatorname*{supp}\mu$ is uncountable and the theorem is applicable.

For the proof of the theorem, we refer to Appendix D.1. The main proof idea consists in the approximation of a (discontinuous) step function which cannot be represented by a neural network with continuous activation function. ∎

The assumptions concerning the activation function $\varrho$ in Theorem 3.1 are satisfied for all of the activation functions listed in Table 1. In any case where $\varrho$ is bounded, one can take $N_{L-1}=1$ ; otherwise, one can take $N_{L-1}=2$ .

For a proof of this statement, we refer to Appendix D.2. ∎

We have seen in Theorem 3.1 that under reasonably mild assumptions on the activation function $\varrho$ —which are satisfied for all commonly used activation functions—the set $\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in any $L^{p}$ -space where $p\in(0,\infty)$ . However, the argument of the proof of Theorem 3.1 breaks down if one considers closedness with respect to the $\|\cdot\|_{\sup}$ -norm. Therefore, we will analyze this setting more closely in this section. More precisely, in Theorem 3.3, we present several criteria regarding the activation function $\varrho$ which imply that the set $\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in $C([-B,B]^{d})$ . We remark that in all these results, $\varrho$ will be assumed to be at least $C^{1}$ . Developing similar criteria for non-differentiable functions is an interesting topic for future research.

$N_{L-1}\geq 2$ and $\varrho$ is bounded, analytic, and not constant.

Then the set $\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in the space $C([-B,B]^{d})$ .

For the proof of the statement, we refer to Appendix D.3. In particular, the proof of the statement under Condition (i) can be found in Appendix D.3.1. Its main idea consists of the uniform approximation of $\varrho^{\prime}$ (which cannot be represented by neural networks with activation function $\varrho$ , due to its lack of sufficient regularity) by neural networks. For the proof of the statement under Condition (ii), we refer to Appendix D.3.2. The main proof idea consists in the uniform approximation of an unbounded analytic function which cannot be represented by a neural network with activation function $\varrho$ , since $\varrho$ itself is bounded. Finally, the proof of the statement under Condition (iii) can be found in Appendix D.3.3. Its main idea consists in the approximation of the function $x\mapsto(x)^{\max\{r,q\}}_{+}\not\in C^{\max\{r,q\}}.$ ∎

Theorem 3.3 applies to all activation functions listed in Table 1 except for the ReLU and the parametric ReLU. To be more precise,

Condition (i) is fulfilled by the function $x\mapsto\max\{0,x\}^{k}$ for $k\geq 2$ , and by the exponential linear unit, the softsign function, and the inverse square root linear unit.

Condition (ii) is fulfilled by the inverse square root unit, the sigmoid function, the $\tanh$ function, and the $\arctan$ function.

Condition (iii) (with $r=1$ and $q=0$ ) is fulfilled by the softplus function.

For the proof of this statement, we refer to Appendix D.4. In particular, for the proof of (1), we refer to Appendix D.4.1, the proof of (2) is clear and for the proof of (3), we refer to Appendix D.4.2. ∎

3 The phenomenon of exploding weights

We have just seen that the realization set $\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in $L^{p}(\mu)$ for any $p\in(0,\infty)$ and every practically relevant activation function. Furthermore, for a variety of activation functions, we have seen that $\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in $C([-B,B]^{d})$ . The situation is substantially different if the weights are taken from a compact subset:

The proof of this statement is based on the continuity of the realization map and can be found in Appendix D.5. ∎

Proposition 3.5 helps to explain the phenomenon of exploding network weights that is sometimes observed during the training of neural networks. Indeed, let us assume that $\mathcal{R}:=\mathcal{RNN}_{\varrho}^{[-B,B]^{d}}(S)$ is not closed in $\mathcal{Y}$ , where $\mathcal{Y}\coloneqq L^{p}(\mu)$ for a Borel measure $\mu$ on $[-B,B]^{d}$ , or $\mathcal{Y}\coloneqq C([-B,B]^{d})$ ; as seen in Sections 3.1 and 3.2, this is true under mild assumptions on $\varrho$ . Then, it follows that there exists a function $f\in\mathcal{Y}$ which does not have a best approximation in $\mathcal{R}$ , meaning that there does not exist any $g\in\mathcal{R}$ satisfying

in fact, one can take any $f\in\overline{\mathcal{R}}\setminus\mathcal{R}$ . Next, recall from Proposition 3.5 that the subset of $\mathcal{R}$ that contains only realizations of networks with uniformly bounded weights is compact.

The argument above only deals with the approximation problem in $C([-B,B]^{d})$ or in $L^{p}(\mu)$ for $p\in(0,\infty)$ . In practice, one is often not concerned with these norms, but instead wants to minimize an empirical loss function over $\mathcal{R}$ . For the empirical square loss, this loss function takes the form

Here, $\sigma_{\Omega}$ is the marginal probability distribution of $\sigma$ on $\Omega$ , and $\mathcal{E}_{\sigma}(f_{\sigma})$ is called the Bayes risk; it is the minimal expected loss achievable by any (measurable) function.

In this context of a statistical learning problem, we have the following result regarding exploding weights:

A compact way of stating Proposition 3.6 is that, if $f_{\sigma}$ has no best approximation in $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ with respect to $\|\cdot\|_{L^{2}(\sigma_{\Omega})}$ , then the weights of the minimizers (or approximate minimizers) of the empirical square loss explode for increasing numbers of samples.

Since $\sigma$ is unknown in applications, it is indeed possible that $f_{\sigma}$ has no best approximation in the set of neural networks. As just one example, this is true if $\sigma_{\Omega}$ is any Borel probability measure on $\Omega$ and if $\sigma$ is the distribution of $(X,g(X))$ , where $X\sim\sigma_{\Omega}$ and $g\in L^{2}(\sigma_{\Omega})$ is bounded and satisfies $g\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\setminus\mathcal{RNN}_{\varrho}^{\Omega}(S)$ , with the closure taken in $L^{2}(\sigma_{\Omega})$ . The existence of such a function $g$ is guaranteed by Theorem 3.1 if $\operatorname*{supp}\sigma_{\Omega}$ is uncountable.

For the proof of Proposition 3.6, we refer to Appendix D.6. The proof is based on classical arguments of statistical learning theory as given in . ∎

In this subsection, we analyze the closedness of sets of realizations of neural networks with respect to the ReLU or the parametric ReLU activation function in $C(\Omega)$ , mostly for the case $\Omega=[-B,B]^{d}$ . We conjecture that the set of (realizations of) ReLU networks of a fixed complexity is closed in $C(\Omega)$ , but were not able to prove such a result in full generality. In two special cases, namely when the networks have only two layers, or when at least the scaling weights are bounded, we can show that the associated set of ReLU realizations is closed in $C(\Omega)$ ; see below.

We begin by analyzing the set of realizations with uniformly bounded scaling weights and possibly unbounded biases, before proceeding with the analysis of two layer ReLU networks.

As our second result in this section, we show that the set of realizations of two-layer neural networks with arbitrary scaling weights and biases is closed in $C([-B,B]^{d}),$ if the activation is the parametric ReLU. It is a fascinating question for further research whether this also holds for deeper networks.

Failure of inverse instability of the realization map

Finally, if $\varrho$ is globally Lipschitz continuous, then there is a constant $C=C(\varrho,S)>0$ such that

For the proof of this statement, we refer to Appendix E.1. ∎

then the realizations of $\Phi,\Psi$ are identical.

In this section, our main goal is to determine whether, up to the failure of injectivity, the realization map is a homeomorphism onto its range; mathematically, this means that we want to determine whether the realization map is a quotient map. We will see that this is not the case.

We finally rephrase the preceding result in more topological terms:

For the proof of the statement, we refer to Appendix E.3. ∎

Acknowledgements

P.P. and M.R. were supported by the DFG Collaborative Research Center TRR 109 “Discretization in Geometry and Dynamics”. P.P. was supported by a DFG Research Fellowship ”Shearlet-based energy functionals for anisotropic phase-field methods”. M.R. is supported by the Berlin Mathematical School. F.V. acknowledges support from the European Commission through DEDALE (contract no. 665044) within the H2020 Framework Program.

We would like to thank Dave L. Renfro for bringing the paper to our attention.

References

Appendix A Notation

Given functions $(f_{i})_{i\in\underline{n}}$ with $f_{i}:X_{i}\to Y_{i}$ , we consider three different types of products between these maps: The cartesian product of $f_{1},\dots,f_{n}$ is

Finally, the direct sum of $f_{1},\dots,f_{n}$ is defined if $X_{1}=\dots=X_{n}$ , and given by

Appendix B Auxiliary results: Operations with neural networks

This part of the appendix is devoted to auxiliary results that are connected with basic operations one can perform with neural networks and which we will frequently make use of in the proofs below.

We start by showing that one can “enlarge” a given neural network in such a way that the realizations of the original network and the enlarged network coincide. To be more precise, the following holds:

Setting $N_{0}\coloneqq\widetilde{N}_{0}\coloneqq d$ , and $\widetilde{N}_{L}\coloneqq N_{L}$ , we define $\widetilde{\Phi}\coloneqq\left((\widetilde{A}_{1},\widetilde{b}_{1}),\dots,(\widetilde{A}_{L},\widetilde{b}_{L})\right)$ by

Another operation that we can perform with networks is concatenation, as given in the following definition.

Then, we call $\Phi^{1}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi^{2}$ the concatenation of $\Phi^{1}$ and $\Phi^{2}$ .

We claim that there is some $C_{0}>0$ such that $|\varrho_{C}(x)-x|\leq\varepsilon^{\prime}$ for all $x\in[-B-L\varepsilon,B+L\varepsilon]$ and all $C\geq C_{0}$ . To see this, first note by definition of the derivative that there is some $\delta>0$ with

Here we implicitly used that $s_{0}=\varrho^{\prime}(x_{0})\neq 0$ to ensure that the right-hand side is a positive multiple of $|t|$ . Now, set $C_{0}\coloneqq(B+L)/\delta$ , and let $C\geq C_{0}$ be arbitrary. Note because of $\varepsilon^{\prime}\leq\varepsilon\leq 1$ that every $x\in[-B-L\varepsilon,B+L\varepsilon]$ satisfies $|x|\leq B+L$ . Hence, if we set $t\coloneqq x/C$ , then $|t|\leq\delta$ . Therefore,

Note that $\varrho_{C}$ is differentiable at with derivative $\varrho_{C}^{\prime}(0)=\frac{C}{s_{0}}\varrho^{\prime}(x_{0})\frac{1}{C}=1$ , thanks to the chain rule.

Using these preliminary observations, we now construct the neural networks $\Phi_{\varepsilon}^{B}$ and $\Phi_{\varepsilon,i}^{B}$ . Define ${\Phi_{0}^{C}\coloneqq\big{(}(A_{1},b_{1}),(A_{2},b_{2})\big{)}}$ , where

where $\varrho_{C}$ is applied $L-1$ times.

Since $|\varrho_{C}(x)-x|\leq\varepsilon^{\prime}\leq\varepsilon$ for all $x\in[-B-L\varepsilon,B+L\varepsilon]$ , it is not hard to see by induction that

where $\varrho_{C}$ is applied $t\leq L$ times. Therefore, since $\varepsilon^{\prime}=\varepsilon/(dL)$ , we conclude for $C\geq C_{0}$ that

We proceed with the second part of the proposition. We first prove the statement for $i=1$ . Let $\widetilde{\Phi}_{1}^{C}\coloneqq\big{(}(A_{1}^{\prime},b_{1}^{\prime}),(A_{2}^{\prime},b_{2}^{\prime})\big{)}$ , where

We have $\widetilde{\Phi}_{1}^{C}\in\mathcal{NN}((d,1,1))$ . Next, define $\widetilde{\Phi}_{2}^{C}\coloneqq\big{(}(A_{1}^{\prime\prime},b_{1}^{\prime\prime}),(A_{2}^{\prime\prime},b_{2}^{\prime\prime})\big{)}$ , where

We have $\widetilde{\Phi}_{2}^{C}\in\mathcal{NN}((1,1,1))$ . Setting $\widetilde{\Phi}_{C}\coloneqq\widetilde{\Phi}_{2}^{C}{\raisebox{2.0pt}{\tiny\newmoon}\,}\dots\raisebox{2.0pt}{\tiny\newmoon}\,\widetilde{\Phi}_{2}^{C}{\raisebox{2.0pt}{\tiny\newmoon}\,}\widetilde{\Phi}_{1}^{C}$ , where we take $L-2$ concatenations (meaning $L-1$ factors), yields a neural network $\widetilde{\Phi}_{C}\in\mathcal{NN}((d,1,\dots,1))$ (with $L$ layers) such that

where $\varrho_{C}$ is applied $L-1$ times. Exactly as in the proof of the first part, this implies for $C\geq C_{0}$ that

Setting $\widetilde{\Phi}_{\varepsilon,1}^{B}\coloneqq\widetilde{\Phi}_{C}$ and repeating the previous arguments yields the claim for $i=1$ . Permuting the columns of $A_{1}^{\prime}$ yields the result for arbitrary $i\in\{1,\dots,d\}$ .

Appendix C Proofs and results connected to Section 2

We first establish the star-shapedness of the set of all realizations of neural networks, which is a direct consequence of the fact that the set is invariant under scalar multiplication. The following proposition provides the details.

We can choose $\lambda=0$ in the argument above and obtain ${0\in\mathcal{RNN}^{\Omega}_{\varrho}(S)}$ . For every $f\in\mathcal{RNN}^{\Omega}_{\varrho}(S)$ the line $\{\lambda f\colon\lambda\in\}$ between and $f$ is contained in $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ , since $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ is closed under scalar multiplication. We conclude that $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ is star-shaped with respect to the origin. ∎

Our next goal is to show that $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ cannot contain infinitely many linearly independent centers.

As a preparation, we prove two related results which show that the class $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is “small”. The main assumption for guaranteeing this is that the activation function should be locally Lipschitz continuous.

Since $\varrho$ is locally Lipschitz continuous, Proposition 4.1 (which will be proved completely independently) shows that the realization map

As a composition of locally Lipschitz continuous functions, the map

As a corollary, we can now show that the class of neural network realizations cannot contain a subspace of large dimension.

Now, the announced estimate for the number of linearly independent centers of the set of all network realizations of a fixed size is a direct consequence.

Next, we analyze the convexity of $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ . As a direct consequence of Proposition C.4, we see that $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ is never convex if $\mathcal{RNN}^{\Omega}_{\varrho}(S)$ contains more than a certain number of linearly independent functions.

Every element of a convex set is a center. Thus the result follows directly from Proposition C.4. ∎

Step 1: Set $S^{\prime}\coloneqq(d,N_{1},\dots,N_{L-1},1)$ . We first show that $\mathcal{RNN}_{\varrho}^{\Omega}(S^{\prime})$ does not contain infinitely many linearly independent functions. To see this, first note that the map

Step 5: In this step, we complete the proof, by first showing for arbitrary $B>0$ that $\varrho|_{[-B,B]}$ is a polynomial of degree at most $N$ .

Let $\varepsilon,B>0$ be arbitrary. Since $\varrho$ is continuous, it is uniformly continuous on $[-B-1,B+1]$ , that is, there is some $\delta\in(0,1)$ such that $|\varrho(x)-\varrho(y)|\leq\varepsilon$ for all $x,y\in[-B-1,B+1]$ with $|x-y|\leq\delta$ . Since $\varrho^{\prime}(x_{0})\neq 0$ and $L\geq 2$ , Proposition B.3 and Lemma B.1 imply existence of a neural network $\widetilde{\Phi}_{\varepsilon,B}\in\mathcal{NN}((d,N_{1},\dots,N_{L-1}))$ such that

with $\varrho$ acting componentwise. By (C.3), it follows that there is a neural network ${\Phi_{\varepsilon,B}\in\mathcal{NN}(S^{\prime})}$ satisfying

where the closure is taken with respect to the sup norm, and where we implicitly used that the space on the right-hand side is a closed subspace of $C([-B,B])$ , since it is a finite dimensional subspace.

Since $\varrho|_{[-B,B]}$ is a polynomial of degree at most $N$ ; we see that the $N+1$ -th derivative of $\varrho$ satisfies $\varrho^{(N+1)}\equiv 0$ on $(-B,B)$ , for arbitrary $B>0$ . Thus, $\varrho^{(N+1)}\equiv 0$ , meaning that $\varrho$ is a polynomial. ∎

In the above proof, we used the following elementary lemma, whose proof we provide for completeness.

Let us assume towards a contradiction that

As our final ingredient for the proof of Theorem 2.1, we show that every non-constant locally Lipschitz function $\varrho$ satisfies the technical assumptions of Proposition C.6.

Since $\varrho$ is not constant, there is some $B>0$ such that $\varrho|_{[-B,B]}$ is not constant. By assumption, $\varrho$ is Lipschitz continuous on $[-B,B]$ . Thus, $\varrho|_{[-B,B]}$ is absolutely continuous; see for instance [60, Definition 7.17]. Thanks to the fundamental theorem of calculus for the Lebesgue integral (see [60, Theorem 7.20]), this implies that $\varrho|_{[-B,B]}$ is differentiable almost everywhere on $(-B,B)$ and satisfies $\varrho(y)-\varrho(x)=\int_{x}^{y}\varrho^{\prime}(t)\,dt$ for $-B\leq x<y\leq B$ , where $\varrho^{\prime}(t):=0$ if $\varrho$ is not differentiable at $t$ .

Since $\varrho|_{[-B,B]}$ is not constant, the preceding formula shows that there has to be some $x_{0}\in(-B,B)$ such that $\varrho^{\prime}(x_{0})\neq 0$ ; in particular, this means that $\varrho$ is differentiable at $x_{0}$ . ∎

Now, a combination of Corollary C.5, Proposition C.6, and Lemma C.8 proves Theorem 2.1. For the application of Lemma C.8, note that if $\varrho$ is constant, then $\varrho$ is a polynomial, so that the conclusion of Theorem 2.1 also holds in this case.

C.2 Proof of Theorem 2.2

We first show in the following lemma that if $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is convex, then $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is dense in $C(\Omega)$ . The proof of Theorem 2.2 is given thereafter.

If $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is convex, then $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is dense in $C(\Omega)$ .

Now we are ready to prove Theorem 2.2. By assumption, $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is not dense in $C(\Omega)$ . We start by proving that there exists at least one $\varepsilon>0$ such that the set $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is not $\varepsilon$ -convex. Suppose towards a contradiction that this is not true, so that $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is $\varepsilon$ -convex for all $\varepsilon>0$ . This implies

where the last identity holds true, since if $\widetilde{f}\not\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ , there exists $\varepsilon^{\prime}>0$ such that $\|\widetilde{f}-f\|_{\sup}>\varepsilon^{\prime}$ for all $f\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ . Equation (C.6) shows that $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is convex, which by Lemma C.9 implies that $\mathcal{RNN}_{\varrho}^{\Omega}(S)\subset C(\Omega)$ is dense, in contradiction to the assumptions of Theorem 2.2. This is the desired contradiction, showing that $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is $\varepsilon$ -convex for some $\varepsilon>0$ .

Thus, there exists $g\in\text{co}\big{(}\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\big{)}$ such that $\|g-f\|_{\sup}\geq\varepsilon_{0}$ for all $f\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ . Now, let $\varepsilon>0$ be arbitrary. Then, $\frac{\varepsilon}{\varepsilon_{0}}g\in\text{co}(\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)})$ , since $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is closed under scalar multiplication. Moreover,

again due to the closedness under scalar multiplication of $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ . This shows that $\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}$ is not $\varepsilon$ -convex for any $\varepsilon>0$ . $\square$

C.3 Non-dense network sets

applying one of the arithmetic operations $+,-,\times$ , and $/$ on real numbers;

jumps conditioned on comparisons of real numbers using the following operators: $<,>,\leq,\geq,=,\neq$ .

Using this result, we can now show that the realization sets of networks with activation functions that are computable by elementary operations are never dense in $L^{p}(\Omega)$ or $C(\Omega)$ .

Then. we have $\overline{\mathcal{Y}\cap\mathcal{RNN}_{\varrho}^{\Omega}(S)}\subsetneq\mathcal{Y}$ .

The considerations from before the statement of the proposition show that

Since $L^{p}(\Omega)\cap\mathcal{F}\subset L^{p}(\Omega)$ is dense, there is for each $b\in\{0,1\}^{m}$ some $f_{b}\in\mathcal{F}\cap L^{p}(\Omega)$ such that $\|f_{b}-g_{b}\|_{L^{p}}^{p}\leq r^{d}/(2^{1+p}\cdot m\cdot 2^{m})$ . If we set $\Omega_{b}:=\{x\in\Omega\colon|f_{b}(x)-g_{b}(x)|\geq 1/2\}$ , then ${\mathds{1}}_{\Omega_{b}}\leq 2^{p}\cdot|f_{b}-g_{b}|^{p}$ , and hence

and thus $\lambda(\bigcup_{b\in\{0,1\}^{m}}\Omega_{b})\leq\frac{r^{d}}{2m}$ , where $\lambda$ is the Lebesgue measure. Hence, $\lambda(M_{j}\setminus\bigcup_{b\in\{0,1\}^{m}}\Omega_{b})\geq\frac{r^{d}}{2m}>0$ , so that we can choose for each $j\in\underline{m}$ some $x_{j}\in M_{j}\setminus\bigcup_{b\in\{0,1\}^{m}}\Omega_{b}$ . We then have

and hence $f_{b}(x_{j})>1/2$ if $b_{j}=1$ and $f_{b}(x_{j})<1/2$ otherwise. Thus, if we set $r_{1}\coloneqq\dots\coloneqq r_{m}\coloneqq\frac{1}{2}$ , then we have as above that ${\mathds{1}}_{[0,\infty)}\big{(}f_{b}(x_{j})-r_{j}\big{)}=b_{j}$ for all $j\in\underline{m}$ and $b\in\{0,1\}^{m}$ . The remainder of the proof is as for $\mathcal{Y}=C(\Omega)$ . ∎

Note that the following activation functions are computable by elementary operations: any piecewise polynomial function (in particular, the ReLU and the parametric ReLU), the exponential linear unit, the softsign (since the absolute value can be computed using a case distinction), the sigmoid, and the $\tanh$ . Thus, the preceding proposition applies to each of these activation functions.

Appendix D Proofs of the results in Section 3

The proof of Theorem 3.1 is crucially based on the following lemma:

Now, if $\eta<1$ , then any $r\xi\in C(v;\delta,\eta)$ with $|\xi-v|<\eta$ satisfies $\langle v,\xi\rangle=\langle v,v\rangle+\langle v,\xi-v\rangle\geq 1-|\xi-v|>0,$ so that $C(x^{\ast},v;\delta,\eta)\subset B_{\delta}(x^{\ast})\cap H_{+}(x^{\ast},v)$ and $C(x^{\ast},-v;\delta,\eta)\subset B_{\delta}(x^{\ast})\cap H_{-}(x^{\ast},v)$ . From this it is easy to see that if $x^{\ast},v$ are as provided by the result in (for $E=K$ ), then indeed $x^{\ast}\in\overline{K\cap H_{+}(x^{\ast},v)}\cap\overline{K\cap H_{-}(x^{\ast},v)}$ .

Assume towards a contradiction that such a continuous function $g$ exists. Recall (see for instance [19, Section 7.4]) that the support of $\mu$ is defined as

In particular, if $U\subset[-B,B]^{d}$ is open with $U\cap\operatorname*{supp}\mu\neq\varnothing$ , then $\mu(U)>0$ .

We now prove Theorem 3.1. Set $\Omega\coloneqq[-B,B]^{d}$ and define $\widetilde{N}_{i}:=1$ for ${i=1,\dots,L-2}$ and $\widetilde{N}_{L-1}:=2$ if $\varrho$ is unbounded, while $\widetilde{N}_{L-1}:=1$ otherwise. We show that for $\varrho$ as in the statement of the theorem there exists a sequence of functions in $\mathcal{RNN}_{\varrho}^{\Omega}((d,\widetilde{N}_{1},\dots,\widetilde{N}_{L-1},1))\subset\mathcal{RNN}_{\varrho}^{\Omega}(S)$ such that the sequence converges (in $L^{p}(\mu)$ ) to a bounded, discontinuous limit $f\in L^{\infty}(\mu)$ , meaning that $f$ does not have a continuous representative, even after possibly changing it on a $\mu$ -null-set. Since $\mathcal{RNN}_{\varrho}^{\Omega}(S)\subset C(\Omega),$ this will show that $f\in\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\setminus\mathcal{RNN}_{\varrho}^{\Omega}(S)$ .

Next, using Proposition B.3, choose a neural network $\Psi\in\mathcal{NN}((d,1,\dots,1))$ with $L-1$ layers such that

We now distinguish the cases given in Assumption (iv)(a) and (b) of Theorem 3.1.

Then, $\Phi_{n}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi\in\mathcal{NN}((d,\widetilde{N}_{1},\dots,\widetilde{N}_{L-1},1))$ . Now, let us define

since $\xi_{n}^{x}\to\infty$ as $n\to\infty,n\geq N_{x}$ . Analogously, it follows for $x\in\Omega\cap H_{-}(x^{\ast},v)$ that $\lim_{n\to\infty}h_{n}(x)=\lambda^{\prime}$ . Hence, setting $\gamma:=\varrho(0)-\varrho(-1)$ , we see for each $x\in\Omega$ that

By the monotonicity and since $\varrho$ is not constant (because of $\varrho^{\prime}(x_{0})\neq 0$ ), we have $c>c^{\prime}\vphantom{\sum_{j}}$ .

Then, $\widetilde{\Phi}_{n}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi\in\mathcal{NN}((d,\widetilde{N}_{1},\dots,\widetilde{N}_{L-1},1))$ . Now, let us define

Since each of the $\widetilde{h}_{n}$ is continuous and $\Omega$ is compact, we have $\widetilde{h}_{n}\in L^{p}(\mu)$ for all $p\in(0,\infty]$ . Equation (D.3) implies that $J(x)>0$ for all $x\in\Omega\cap H_{+}(x^{\ast},v)$ . This in turn yields that

Similarly, the fact that $J(x)<0$ for all $x\in\Omega\cap H_{-}(x^{\ast},v)$ yields

Combining (D.4) with (D.5) yields for all $x\in\Omega$ that

D.2 Proof of Corollary 3.2

Next, the softsign, the inverse square root unit, the sigmoid, the $\tanh$ , and the $\arctan$ function are all bounded, and thus satisfy condition (iv)(b) of Theorem 3.1. Thus, all that remains is to verify condition (iv)(a) of Theorem 3.1 for the remaining activation functions:

For the ReLU $\varrho(x)=\max\{0,x\}$ , condition (iv)(a) is satisfied with $\lambda=1$ and $\lambda^{\prime}=0\neq\lambda$ .

For the parametric ReLU $\varrho(x)=\max\{ax,x\}$ (with $a\geq 0$ , $a\neq 1$ ), Condition (iv)(a) is satisfied with $\lambda=\max\{1,a\}$ and $\lambda^{\prime}=\min\{1,a\}$ , where $\lambda\neq\lambda^{\prime}$ since $a\neq 1$ .

For the exponential linear unit $\varrho(x)=x{\mathds{1}}_{[0,\infty)}(x)+(e^{x}-1){\mathds{1}}_{(-\infty,0)}(x)$ , Condition (iv)(a) is satisfied for $\lambda=1$ and $\lambda^{\prime}=\lim_{x\to-\infty}e^{x}=0\neq\lambda$ .

For the inverse square root linear unit $\varrho(x)=x{\mathds{1}}_{[0,\infty)}(x)+\frac{x}{\sqrt{1+ax^{2}}}{\mathds{1}}_{(-\infty,0)}(x)$ , the quotient rule shows that for $x<0$ we have

Therefore, Condition (iv)(a) is satisfied for $\lambda=1$ and $\lambda^{\prime}=\lim_{x\to-\infty}\varrho^{\prime}(x)=0\neq\lambda$ .

For the softplus function $\varrho(x)=\ln(1+e^{x})$ , Condition (iv)(a) is satisfied for

D.3 Proof of Theorem 3.3

then $\varepsilon_{n}\to 0$ as $n\to\infty$ .

Now, by the Lipschitz continuity of $\varrho(\lambda\cdot)$ and Equation (D.7), we conclude that

Here, the last step used that $|\xi_{n}^{x}-x_{1}|\leq n^{-1}\leq 1$ , so that $x_{1},\xi_{n}^{x}\in[-(B+1),B+1]$ .

Since $f_{n}\to\varrho_{\lambda}$ uniformly, where $\varrho_{\lambda}\notin C^{m}(\Omega)$ , and hence $\varrho_{\lambda}\notin\mathcal{RNN}_{\varrho}^{\Omega}(S)$ , we thus see that $\mathcal{RNN}_{\varrho}^{\Omega}(S)$ is not closed in $C(\Omega)$ . $\square$

D.3.2 Proof of Theorem 3.3 under Condition (ii)

Since $\varrho^{\prime}$ is continuous, we conclude that

We set $\Phi_{n}\coloneqq\Phi_{n}^{1}{\raisebox{2.0pt}{\tiny\newmoon}\,}\Phi_{n}^{2}\in\mathcal{NN}(S^{\prime})$ and note for all $x\in\Omega$ that

D.3.3 Proof of Theorem 3.3 under Condition (iii)

for a constant $c_{0}=c_{0}(B,s,r,q)>0$ . Moreover, for $x\in[0,B]$ , we have

D.4 Proof of Corollary 3.4

D.4.2 Proof of Corollary 3.4.(3)

and for $x\leq 0$ , we have $|\ln(1+e^{x})-x^{0}|\leq 1+\ln(2)$ . $\square$

D.5 Proof of Proposition 3.5

The set $\Theta_{C}$ is closed and bounded in the normed space $\big{(}\mathcal{NN}(S),\|\cdot\|_{\mathcal{NN}(S)}\big{)}.$ Thus, the Heine-Borel Theorem implies the compactness of $\Theta_{C}$ . By Proposition 4.1 (which will be proved independently), the map

D.6 Proof of Proposition 3.6

By compactness of $\mathcal{R}^{C}$ , we can choose $g\in\mathcal{R}^{C}$ satisfying $\|f_{\sigma}-g\|_{L^{2}(\sigma_{\Omega})}^{2}=\inf_{h\in\mathcal{R}^{C}}\|f_{\sigma}-h\|_{L^{2}(\sigma_{\Omega})}^{2}.$ Define $M:=\inf_{h\in\mathcal{RNN}_{\varrho}^{\Omega}(S)}\|f_{\sigma}-h\|_{L^{2}(\sigma_{\Omega})}^{2}$ . Since by assumption the infimum defining $M$ is not attained, we have $\|f_{\sigma}-g\|_{L^{2}(\sigma_{\Omega})}^{2}>M$ , so that there are $h\in\mathcal{RNN}_{\varrho}^{\Omega}(S)$ and $\delta>0$ with $\|f_{\sigma}-g\|_{L^{2}(\sigma_{\Omega})}^{2}\geq 2\delta+\|f_{\sigma}-h\|_{L^{2}(\sigma_{\Omega})}^{2}$ . Let $C^{\prime}>0$ with $h\in\mathcal{R}^{C^{\prime}}$ .

We now claim that $\Omega_{N}^{c}\subset\Omega_{N,\delta/3}^{(1)}\cup\Omega_{N,\delta/3}^{(2)}$ . Once we prove this, we get

By rearranging and recalling the choice of $h$ and $\delta$ , we finally see

which is the desired contradiction. $\square$

D.7 Proof of Proposition 3.7

The main ingredient of the proof will be to show that one can replace a given sequence of networks with $C$ -bounded scaling weights by another sequence with $C$ -bounded scaling weights that also has bounded biases. Then one can apply Proposition 3.5.

each network $\Psi_{n}^{(m)}$ , $n\in I_{m}$ , has $C$ -bounded scaling weights;

For arbitrary $i\in\{1,\dots,N_{L}\}$ and $x\in\Omega$ , this implies

Thus, it remains to construct the networks $\Psi_{n}^{(m)}$ for $n\in I_{m}$ (and the sets $I_{m}$ ) for $m\in\{0,\dots,L-1\}$ with the properties (A)–(C) from above.

Thus, it remains to construct $d^{(n)},e^{(n)},C^{(n)}$ for $n\in I_{m+1}$ (and the set $I_{m+1}$ itself) as described around Equation (D.12). To this end, for $n\in I_{m}^{(0)}$ and $k\in\{1,\dots,N_{m+1}\}$ , define

To see that these choices indeed fulfil the conditions outlined around Equation (D.12) for a suitable choice of $I_{m+1}\subset I_{m}^{\left(0\right)}$ , first note that $\left(d^{(n)}\right)_{n\in I_{m}^{\left(0\right)}}$ is indeed a bounded family. Furthermore, $\big{|}C_{k,i}^{(n)}\big{|}\leq\big{|}(B_{m+1}^{(n,m)})_{k,i}\big{|}$ for all $k\in\{1,\dots,N_{m+1}\}$ and $i\in\{1,\dots,N_{m}\}$ , which easily implies $\|C^{(n)}\|_{\max}\leq\|B_{m+1}^{(n,m)}\|_{\max}\leq C$ for all $n\in I_{m}^{(0)}$ . Thus, it remains to verify equation (D.12) itself. But the estimate $\|B_{m+1}^{(n,m)}\|_{\max}\leq C$ also implies

As a final preparation, note that $\varrho_{m+1}=\varrho\times\cdots\times\varrho$ is a cartesian product of ReLU functions, since $m\leq L-2$ . Now, for $k\in\{1,\dots,N_{m+1}\}$ there are three cases:

where the last step used our choice of $d^{(n)},e^{(n)},C^{(n)}$ , and the fact that $\left(C^{(n)}x+d^{(n)}\right)_{k}\geq 0$ by Equation (D.13).

where the last step used our choice of $d^{(n)},e^{(n)},C^{(n)}$ .

Overall, we have thus shown that Equation (D.12) is satisfied for all $n\in I_{m+1}$ , where

is clearly an infinite set, since $I_{m}^{\left(0\right)}$ is. ∎

Hence, $(\Psi_{n})_{n\in I}$ is a bounded, infinite family in the finite dimensional vector space $\mathcal{NN}(S)$ . Thus, there is a further infinite set $I_{1}\subset I$ such that $\Psi_{n}\to\Psi\in\mathcal{NN}(S)$ as $n\to\infty$ in $I_{1}$ .

But since $\Omega$ is bounded, say $\Omega\subset[-R,R]^{d}$ , the realization map

D.8 Proof of Theorem 3.8

For the proof of Theorem 3.8, we will use a careful analysis of the singularity hyperplanes of functions of the form $x\mapsto\varrho_{a}(\langle\alpha,x\rangle+\beta)$ , that is, the hyperplane on which this function is not differentiable. To simplify this analysis, we first introduce a convenient terminology and collect quite a few auxiliary results.

By discarding those $(\alpha_{j},\beta_{j})$ for which $x_{0}\notin S_{\alpha_{j},\beta_{j}}$ , we can assume that $x_{0}\in S_{\alpha_{j},\beta_{j}}$ for all $j\in\underline{N}$ .

Assume towards a contradiction that the claim of the lemma is false; that is,

But since $V$ is a vector space, this easily implies $V=\alpha^{\perp}$ , that is, $\langle z,\alpha_{j}\rangle=0$ for all $z\in\alpha^{\perp}$ . In other words, $\alpha^{\perp}\subset\alpha_{j}^{\perp}$ , and then $\alpha^{\perp}=\alpha_{j}^{\perp}$ by a dimension argument, since $\alpha,\alpha_{j}\neq 0$ .

Hence, $\operatorname{span}\alpha=(\alpha^{\perp})^{\perp}=(\alpha_{j}^{\perp})^{\perp}=\operatorname{span}\alpha_{j}$ . Because of $|\alpha|=|\alpha_{j}|=1$ , we thus see $\alpha=\varepsilon\,\alpha_{j}$ for some $\varepsilon\in\{\pm 1\}$ . Finally, since $x_{0}\in S_{\alpha,\beta}\cap S_{\alpha_{j},\beta_{j}}$ , we see $\beta=-\langle\alpha,x_{0}\rangle=-\varepsilon\langle\alpha_{j},x_{0}\rangle=\varepsilon\beta_{j},$ and thus $(\alpha,\beta)=\varepsilon(\alpha_{j},\beta_{j})$ , in contradiction to $(\alpha,\beta)\not\sim(\alpha_{j},\beta_{j})$ . ∎

Then, there is $\varepsilon>0$ satisfying

We claim that there is some $\varepsilon>0$ such that $x\coloneqq x_{0}+tz\in U\cap S_{\alpha,\beta}\cap\bigcap_{j=1}^{N}U_{\alpha_{j},\beta_{j}}^{(\varepsilon)}$ . To see this, note for $j\in\underline{N}\setminus J$ that $x_{0}\in S_{\alpha_{j},\beta_{j}}$ , and hence

since $\langle z,\alpha_{j}\rangle\neq 0$ for all $j\in\underline{N}\setminus J$ , by choice of $z$ . By combining all our observations, we see that ${x_{0}+tz\in U\cap S_{\alpha,\beta}\cap\bigcap_{j=1}^{N}U_{\alpha_{j},\beta_{j}}^{(\varepsilon)}}$ for $\varepsilon\coloneqq\min\{\varepsilon_{1},\varepsilon_{2}\}>0$ . ∎

Then, the family $(h_{i})_{i=1,\dots,N+1}$ is linearly independent.

Because of $x_{0}\in U\cap S_{\alpha_{j},\beta_{j}}$ , Lemma D.6 shows that $h_{\alpha_{j},\beta_{j}}^{(a)}|_{U}$ is not differentiable at $x_{0}$ . On the other hand, we have

where the right-hand side is differentiable at $x_{0}$ , since each summand is easily seen to be differentiable on the open set $V$ , with $x_{0}\in V\cap U$ . ∎

The continuous function $\Omega\to(0,\infty),x\mapsto|\langle\alpha,x\rangle+\beta|$ , which is well-defined by assumption, attains a minimum $\varepsilon=\min_{x\in\Omega}|\langle\alpha,x\rangle+\beta|>0$ . ∎

Step 2: We claim that $\xi_{1}-\xi_{2}\in\operatorname{span}\alpha$ . To see this, consider the modified function

which is continuous and satisfies $\widetilde{f}\equiv 0$ on $U\cap W_{\alpha,\beta}^{-}$ and $\widetilde{f}(x)=\langle\theta,x\rangle+\omega$ on $U\cap W_{\alpha,\beta}^{+}$ , where we defined $\theta\coloneqq\xi_{1}-\xi_{2}$ and $\omega\coloneqq\omega_{1}-\omega_{2}$ .

Since we saw in Step 1 that $U\cap S_{\alpha,\beta}\subset\overline{U\cap W_{\alpha,\beta}^{\pm}}$ , we thus get by continuity of $\widetilde{f}$ that

But by assumption on $U$ , there is some $x_{0}\in U\cap S_{\alpha,\beta}$ . For arbitrary $v\in\alpha^{\perp}$ , we then have $x_{0}+tv\in U\cap S_{\alpha,\beta}$ for all $t\in(-\varepsilon,\varepsilon)$ and a suitable $\varepsilon=\varepsilon(v)>0$ , since $U$ is open. Hence, $0=\langle\theta,x_{0}+tv\rangle+\omega=t\cdot\langle\theta,v\rangle$ for all $t\in(-\varepsilon,\varepsilon)$ , and thus $v\in\theta^{\perp}$ . In other words, $\alpha^{\perp}\subset\theta^{\perp}$ , and thus $\operatorname{span}\alpha=(\alpha^{\perp})^{\perp}\supset(\theta^{\perp})^{\perp}\ni\theta=\xi_{1}-\xi_{2},$ as claimed in this step.

Because of $x_{0}\in S_{\alpha,\beta}$ , we then have $g(x_{0})=\langle\zeta,x_{0}\rangle+\kappa=f(x_{0})$ . Furthermore, since $\varrho_{a}(x)=x$ for $x\geq 0$ , we see for all $x\in U\cap W_{\alpha,\beta}^{+}$ that

Here, the last step used that $f(x)=\langle\xi_{1},x\rangle+\omega_{1}$ for $x\in U\cap W_{\alpha,\beta}^{+}$ , and that $x_{0}\in U\cap S_{\alpha,\beta}\subset\overline{U\cap W_{\alpha,\beta}^{+}}$ by Step 1, so that we get $f(x_{0})=\langle\xi_{1},x_{0}\rangle+\omega_{1}$ as well.

Likewise, since $\varrho_{a}(t)=a\,t$ for $t<0$ , we see for $x\in U\cap W_{\alpha,\beta}^{-}$ that

In combination, Equations (D.15) and (D.16) show $f(x)=g(x)$ for all $x\in U\cap(W_{\alpha,\beta}^{+}\cup W_{\alpha,\beta}^{-})$ . Since this set is dense in $U$ by Step 1, we are done. ∎

With all of these preparations, we can finally prove Theorem 3.8.

Step 4 (Showing that the $j$ -th neuron is eventually affine-linear, for $j\in J$ ): Since Step 3 shows that the claim holds in case of $r=N_{0}$ , we will from now on consider only the case where $r<N_{0}$ .

For $j\in J$ , there are two cases: In case of $(b_{1})_{j}\in[0,\infty]$ , define

If otherwise $(b_{1})_{j}\in[-\infty,0)$ , define

Next, for arbitrary $0<\delta<B$ , we define $\Omega_{\delta}\coloneqq[-(B-\delta),B-\delta]^{d}$ . Note that since $S_{\alpha^{(i)},\beta^{(i)}}\cap\Omega^{\circ}\neq\varnothing$ for all $i\in\underline{r}$ , there is some $\delta_{0}>0$ such that $S_{\alpha^{(i)},\beta^{(i)}}\cap(-(B-\delta),B-\delta)^{d}\neq\varnothing$ for all $i\in\underline{r}$ and all $0<\delta\leq\delta_{0}$ . For the remainder of this step, we will consider a fixed $\delta\in(0,\delta_{0}]$ , and we claim that there is some $N_{2}=N_{2}(\delta)>0$ such that

simply because $(b_{1}^{n_{k}})_{j}\to(b_{1})_{j}$ and $\varrho_{a}(x)=x$ if $x\geq 0$ , and $\varrho_{a}(x)=ax$ if $x<0$ . Therefore, the affine-linear function

To prove Equation (D.17), we distinguish two cases for each $j\in J$ ; by definition of $J$ , these are the only two possible cases:

Now, since the function $x\mapsto\langle a_{k,j},x\rangle+(b_{1}^{n_{k}})_{j}$ is continuous, since $\Omega$ is connected (in fact convex), and since $0\in\Omega$ , this implies $\operatorname{sign}(\langle a_{k,j},x\rangle+(b_{1}^{n_{k}})_{j})=\operatorname{sign}(b_{1}^{n_{k}})_{j}$ for all $x\in\Omega$ and $k\geq k_{j}$ .

With the same argument as at the end of Case 1, we thus see $\operatorname{sign}(\langle a_{k,j},x\rangle+(b_{1}^{n_{k}})_{j})=\operatorname{sign}(b_{1}^{n_{k}})_{j}$ for all $x\in\Omega_{\delta}$ and $k\geq k_{j}(\delta)$ .

Together, the two cases prove that Equation (D.17) holds if we set $N_{2}(\delta)\coloneqq\max_{j\in J}k_{j}(\delta)$ .

To see this, let $\varepsilon>0$ be arbitrary, and recall $J^{c}=\bigcup_{i=1}^{r}J_{i}$ . By definition of $J_{i}$ , and by choice of $\alpha^{(i)}$ and $\beta^{(i)}$ , there is for each $i\in\underline{r}$ and $j\in J_{i}$ some $\sigma_{j}\in\{\pm 1\}$ satisfying

Define $N_{4}(\varepsilon)\coloneqq\max_{j\in J^{c}}k^{(j)}(\varepsilon)$ . Then, for $k\geq N_{4}(\varepsilon)$ , $i\in\underline{r}$ , $j\in J_{i}$ , and arbitrary $x\in\Omega\cap U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}$ , we have on the one hand $|\sigma_{j}\cdot(\langle\alpha^{(i)},x\rangle+\beta^{(i)})|\geq\varepsilon$ , and on the other hand

Step 6 (Proving the “almost convergence” of the sum of all $j$ -th neurons for $j\in J_{i}$ ): For $i\in\underline{r}$ define

In combination with Equation (D.18), we see

Recall from Step 4 that $\Omega_{\delta_{0}}^{\circ}\cap S_{\alpha^{(i)},\beta^{(i)}}\neq\varnothing$ for all $i\in\underline{r}$ , by choice of $\delta_{0}$ . Therefore, Lemma D.5 shows (because of $U_{\alpha,\beta}^{(\sigma)}\subset(U_{\alpha,\beta}^{(\varepsilon)})^{\circ}$ for $\varepsilon<\sigma$ ) for each $i\in\underline{r}$ that

To use this observation, note that Equations (D.20) and (D.21) show that $g_{i}^{(k)}+q_{i}^{(k)}$ converges pointwise to $G_{i}$ on $B_{r_{i}}(x_{i})$ . Furthermore, since $x_{i}\in S_{\alpha^{(i)},\beta^{(i)}}$ , it is not hard to see that there is some $\varepsilon_{0}>0$ with $\big{(}U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}\big{)}^{\circ}\cap B_{r_{i}}(x_{i})\neq\varnothing$ for all $\varepsilon\in(0,\varepsilon_{0})$ ; for the details, we refer to Step 1 in the proof of Lemma D.9. Finally, as a consequence of Step 5, we see for arbitrary $\varepsilon\in(0,\varepsilon_{0})$ that $g_{i}^{(k)}+q_{i}^{(k)}$ and $G_{i}$ are both affine-linear on $U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}$ , at least for $k$ large enough (depending on $\varepsilon$ ). Thus, the observation from above (with $\Theta=U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}$ and $U=\Theta^{\circ}\cap B_{r_{i}}(x_{i})$ ) implies that $g_{i}^{(k)}+q_{i}^{(k)}\to G_{i}$ pointwise on $U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}$ , for arbitrary $\varepsilon\in(0,\varepsilon_{0})$ .

Step 7 (Finishing the proof): For arbitrary $\delta\in(0,\delta_{0})$ , let us set

Then, Equations (D.19) and (D.22) imply for $k\geq N_{3}(\delta)$ that

But the latter set is dense in $\Omega$ (since its complement is a null-set), and $f$ and $\psi+\sum_{i=1}^{r}G_{i}$ are continuous on $\Omega$ . Hence,

Recalling from Steps 3 and 4 that $r<N_{0}$ , this implies $f\in\mathcal{RNN}_{\varrho_{a}}^{\Omega}((d,r+1,1))\subset\mathcal{RNN}_{\varrho_{a}}^{\Omega}((d,N_{0},1))$ , as claimed. Here, we implicitly used that

Appendix E Proofs of the results in Section 4

with locally uniform convergence. Since $\Omega$ is compact, this implies uniform convergence on $\Omega$ , and thus completes the proof of the first claim.

For the induction step, let us write $\Psi=\big{(}(B_{1},c_{1}),\dots,(B_{L},c_{L})\big{)}$ , and note that

E.2 Proof of Theorem 4.2

Next, setting $z(\xi)\coloneqq e^{2\pi ia\xi}\neq 0$ , we observe that

We first note that each function $f_{a}$ from Step 1 is bounded. In fact, if $\varrho$ is $M$ -Lipschitz, then

as long as $n\geq n_{1}$ is so large that $x_{n},y_{n}\in\Omega$ . But this implies

E.3 Proof of Corollary 4.3

Clearly, by switching to complements, we can equivalently replace “open” by “closed” everywhere.