Topological properties of the set of functions generated by neural networks of fixed size
Philipp Petersen, Mones Raslan, Felix Voigtlaender
Introduction
Neural networks, introduced in 1943 by McCulloch and Pitts , are the basis of every modern machine learning algorithm based on deep learning . The term deep learning describes a variety of methods that are based on the data-driven manipulation of the weights of a neural network. Since these methods perform spectacularly well in practice, they have become the state-of-the-art technology for a host of applications including image classification , speech recognition , game intelligence , and many more.
This success of deep learning has encouraged many scientists to pick up research in the area of neural networks after the field had gone dormant for decades. In particular, quite a few mathematicians have recently investigated the properties of different neural network architectures, hoping that this can explain the effectiveness of deep learning techniques. In this context, mathematical analysis has mainly been conducted in the context of statistical learning theory , where the overall success of a learning method is determined by the approximation properties of the underlying function class, the feasibility of optimizing over this class, and the generalization capabilities of the class, when only training with finitely many samples.
In the approximation theoretical part of deep learning research, one analyzes the expressiveness of deep neural network architectures. The universal approximation theorem demonstrates that neural networks can approximate any continuous function, as long as one uses networks of increasing complexity for the approximation. If one is interested in approximating more specific function classes than the class of all continuous functions, then one can often quantify more precisely how large the networks have to be to achieve a given approximation accuracy for functions from the restricted class. Examples of such results are . Some articles study in particular in which sense deep networks have a superior expressiveness compared to their shallow counterparts, thereby partially explaining the efficiency of networks with many layers in deep learning.
Another line of research studies the training procedures employed in deep learning. Given a set of training samples, the training process is an optimization problem over the parameters of a neural network, where a loss function is minimized. The loss function is typically a non-linear, non-convex function of the weights of the network, rendering the optimization of this function highly challenging . Nonetheless, in applications, neural networks are often trained successfully through a variation of stochastic gradient descent. In this regard, the energy landscape of the problem was studied and found to allow convergence to a global optimum, if the problem is sufficiently overparametrized; see .
The third large area of mathematical research on deep neural networks is analyzing the so-called generalization error of deep learning. In the framework of statistical learning theory , the discrepancy between the empirical loss and the expected loss of a classifier is called the generalization error. Specific bounds for this error for the class of deep neural networks were analyzed for instance in , and in more specific settings for instance in .
In this work, we study neural networks from a different point of view. Specifically, we study the structure of the set of functions implemented by neural networks of fixed size. These sets are naturally (non-linear) subspaces of classical function spaces like and for compact sets .
Due to the size of the networks being fixed, our analysis is inherently non-asymptotic. Therefore, our viewpoint is fundamentally different from the analysis in the framework of statistical learning theory. Indeed, in approximation theory, the expressive power of networks growing in size is analyzed. In optimization, one studies the convergence properties of iterative algorithms—usually that of some form of stochastic gradient descent. Finally, when considering the generalization capabilities of deep neural networks, one mainly studies how and with which probability the empirical loss of a classifier converges to the expected loss, for increasing numbers of random training samples and depending on the sizes of the underlying networks.
Given this strong delineation to the classical fields, we will see that our point of view yields interpretable results describing phenomena in deep learning that are not directly explained by the classical approaches. We will describe these results and their interpretations in detail in Subsections 1.1–1.3.
We will use standard notation throughout most of the paper without explicitly introducing it. We do, however, collect a list of used symbols and notions in Appendix A. To not interrupt the flow of reading, we have deferred several auxiliary results to Appendix B and all proofs and related statements to Appendices C–E.
Before we continue, we formally introduce the notion of spaces of neural networks of fixed size.
where results from the following scheme:
Before we continue, let us note that the set of all neural networks (that is, the network weights) with a fixed architecture forms a finite-dimensional vector space, which we equip with the norm
In the remainder of this introduction, we discuss our results concerning the topological properties of the sets of realizations of neural networks with fixed architecture and their interpretation in the context of deep learning. Then, we give an overview of related work. We note at this point that it is straightforward to generalize all of the results in this paper to neural networks for which one only prescribes the total number of neurons and layers and not the specific architecture.
1 Non-convexity of the set of realizations
We will show in Section 2 (Theorem 2.1) that, for a given architecture , the set is not convex, except possibly when the activation function is a polynomial, which is clearly not the case for any of the activation functions that are commonly used in practice.
In fact, for a large class of activation functions (including the ReLU and the standard sigmoid activation function), the set turns out to be highly non-convex in the sense that for every , the set of functions having uniform distance at most to any function in is not convex. We prove this result in Theorem 2.2 and Remark 2.3.
This non-convexity is undesirable, since for non-convex sets, there do not necessarily exist well-defined projection operators onto them. In classical statistical learning theory , the property that the so-called regression function can be uniquely projected onto a convex (and compact) hypothesis space greatly simplifies the learning problem; see [20, Section 7]. Furthermore, in applications where the realization of a network—rather than its set of weights—is the quantity of interest (for example when a network is used as an Ansatz for the solution of a PDE, as in ), our results show that the Ansatz space is non-convex. This non-convexity is inconvenient if one aims for a convergence proof of the underlying optimization algorithm, since one cannot apply convexity-based fixed-point theorems. Concretely, if a neural network is optimized by stochastic gradient descent so as to satisfy a certain PDE, then it is interesting to see if there even exists a network so that the iteration stops. In other words, one might ask whether gradient descent on the set of neural networks (potentially with bounded weights) has a fixed point. If the space of neural networks were convex and compact, then the fixed-point theorem of Schauder would guarantee the existence of such a fixed point.
2 (Non-)closedness of the set of realizations
For any fixed architecture , we show in Section 3 (Theorem 3.1) that is not a closed subset of for , under very mild assumptions on the measure and the activation function . The assumptions concerning are satisfied for all activation functions used in practice.
For the case , the situation is more involved: For all activation functions that are commonly used in practice—except for the (parametric) ReLU—the associated sets are non-closed also with respect to the uniform norm; see Theorem 3.3. For the (parametric) ReLU, however, the question of closedness of the sets remains mostly open. Nonetheless, in two special cases, we prove in Section 3.4 that the sets are closed. In particular, for neural network architectures with two layers only, Theorem 3.8 establishes the closedness of , where is the (parametric) ReLU.
A practical consequence of the observation of non-closedness can be identified with the help of the following argument that is made precise in Subsection 3.3: We show that the set
of realizations of neural networks with a fixed architecture and all affine linear maps bounded in a suitable norm, is always closed. As a consequence, we observe the following phenomenon of exploding weights: If a function is such that it does not have a best approximation in , that is, if there does not exist such that
Certainly, the presence of large coefficients will make the numerical optimization increasingly unstable. Thus, exploding weights in the sense described above are highly undesirable in practice.
The argument above discusses an approximation problem in an -norm. In practice, one usually only minimizes “empirical norms”. We will demonstrate in Proposition 3.6 that also in this situation, for increasing numbers of samples, the weights of the neural networks that minimize the empirical norms necessarily explode under certain assumptions. Note that the set-up of having a fixed architecture and a potentially unbounded number of training samples is common in applications where neural networks are trained to solve partial differential equations. There, training samples are generated during the training process .
3 Failure of inverse stability of the realization map
For both of these results—continuity and no inverse stability—we only need to assume that the activation function is Lipschitz continuous and not constant.
These properties of the realization map pinpoint a potential problem that can occur when training a neural network: Let us consider a regression problem, where a network is iteratively updated by a (stochastic) gradient descent algorithm trying to minimize a loss function. It is then possible that at some iterate the loss function exhibits a very small error, even though the associated network parameters have a large distance to the optimal parameters. This issue is especially severe since a small error term leads to small steps if gradient descent methods are used in the optimization. Consequently, convergence to the very distant optimal weights will be slow even if the energy landscape of the optimization problem happens to be free of spurious local minima.
4 Related work
The aforementioned properties of non-convexity and non-closedness have, to some extent, been studied before. Classical results analyze the spaces of shallow neural networks, that is, of for , so that . For such sets of shallow networks, a property that has been extensively studied is to what extent has the best approximation property. Here, we say that has the best approximation property, if for every function , , there exists a function such that . In it was shown that even if a minimizer always exists, the map is necessarily discontinuous. Furthermore, at least for the Heaviside activation function, there does exist a (non-unique) best approximation; see .
Additionally, [27, Proposition 4.1] demonstrates, for shallow networks as before, that for the logistic activation function , the set does not have the best approximation property in . In the proof of this statement, it was also shown that is not closed. Furthermore, it is claimed that this result should hold for every non-linear activation function. The previously mentioned result of and Theorem 3.8 below disprove this conjecture for the Heaviside and ReLU activation functions, respectively.
In deep learning, one chooses a loss function , which is then minimized over the set of neural networks with fixed architecture . A typical loss function is the empirical square loss, that is,
It is important to emphasize that this notion of non-convexity describes properties of the loss function, in contrast to the non-convexity of the sets of functions that we analyze in this work.
Non-convexity of the set of realizations
In this section, we analyze the convexity of the set of all neural network realizations. In particular, we will show that this set is highly non-convex for all practically used activation functions listed in Table 1. First, we examine the convexity of the set :
If is convex, then is a polynomial.
(1) It is easy to see that all of the activation functions in Table 1 are locally Lipschitz continuous, and that none of them is a polynomial. Thus, the associated sets of realizations are never convex.
(2) In the case where is a polynomial, the set might or might not be convex. Indeed, if and , then it is not hard to see that is convex if and only if .
The detailed proof of Theorem 2.1 is the subject of Appendix C.1. Let us briefly outline the proof strategy:
We first show in Proposition C.1 that is closed under scalar multiplication, hence star-shaped with respect to the origin, i.e., 0 is a center.A subset of some vector space is called star-shaped, if there exists some such that for all , also . The vector is called a center of .
A direct consequence of Step 2 is that if is convex, then it can only contain a finite number of linearly independent functions; see Corollary C.5.
In applications, the non-convexity of might not be as problematic as it first seems. If, for instance, the set of functions that can be approximated up to error by a neural network with architecture was convex, then one could argue that the non-convexity of was not severe. Indeed, in practice, neural networks are only trained to minimize a certain empirical loss function, with resulting bounds on the generalization error which are typically of size , with denoting the number of training samples. In this setting, one is not really interested in “completely minimizing” the (empirical) loss function, but would be content with finding a function for which the empirical loss is -close to the global minimum. Hence, one could argue that one is effectively working with a hypothesis space of the form , containing all functions that can be represented up to an error of by neural networks of architecture .
To quantify this potentially more relevant notion of convexity of neural networks, we define, for a subset of a vector space , the convex hull of as
For , we say that a subset of a normed vector space is -convex in , if
Hence, the notion of -convexity asks whether the convex hull of a set is contained in an enlargement of this set. Note that if is dense in , then its closure is trivially -convex for all . Our main result regarding the -convexity of neural network sets shows that this is the only case in which is -convex for any .
Assume that is not dense in . Then there does not exist any such that is -convex in \big{(}C(\Omega),\|\cdot\|_{\sup}\big{)}.
All closures in the theorem are taken in .
The proof of this theorem is the subject of Appendix C.2. It is based on showing that if is -convex for some , then in fact is convex, which we then use to show that contains all realizations of two-layer neural networks with activation function . As shown in , this implies that is dense in , since is not a polynomial. ∎
While it is certainly natural to expect that should hold for most activation functions , giving a reference including large classes of activation functions such that the claim holds is not straightforward. We study this problem more closely in Appendix C.3.
To be more precise, from Proposition C.10 it follows that the ReLU, the parametric ReLU, the exponential linear unit, the softsign, the sigmoid, and the yield realization sets which are not dense in and in .
The only activation functions listed in Table 1 for which we do not know whether any of the realization sets is dense in or in are: the inverse square root linear unit, the inverse square root unit, the softplus, and the function. Of course, we expect that also for these activation functions, the resulting sets of realizations are never dense in or in .
(Non-)Closedness of the set of realizations
We start by examining the closedness with respect to -norms for . In fact, for all and all widely used activation functions (including all activation functions presented in Table 1), the set is not closed in , for any and any “sufficiently non-atomic” measure on . To be more precise, the following is true:
There is some such that is differentiable;
At least one of the following two conditions is satisfied:
There are with such that as , and as , and we have .
Finally, let and let be a finite Borel measure on for which the support is uncountable. Then the set is not closed in for any . More precisely, there is a function which satisfies for all , where the closure is taken in .
If is countable, then is a countable sum of Dirac measures, meaning that is purely atomic. In particular, if is non-atomic (meaning that for all ), then is uncountable and the theorem is applicable.
For the proof of the theorem, we refer to Appendix D.1. The main proof idea consists in the approximation of a (discontinuous) step function which cannot be represented by a neural network with continuous activation function. ∎
The assumptions concerning the activation function in Theorem 3.1 are satisfied for all of the activation functions listed in Table 1. In any case where is bounded, one can take ; otherwise, one can take .
For a proof of this statement, we refer to Appendix D.2. ∎
We have seen in Theorem 3.1 that under reasonably mild assumptions on the activation function —which are satisfied for all commonly used activation functions—the set is not closed in any -space where . However, the argument of the proof of Theorem 3.1 breaks down if one considers closedness with respect to the -norm. Therefore, we will analyze this setting more closely in this section. More precisely, in Theorem 3.3, we present several criteria regarding the activation function which imply that the set is not closed in . We remark that in all these results, will be assumed to be at least . Developing similar criteria for non-differentiable functions is an interesting topic for future research.
and is bounded, analytic, and not constant.
Then the set is not closed in the space .
For the proof of the statement, we refer to Appendix D.3. In particular, the proof of the statement under Condition (i) can be found in Appendix D.3.1. Its main idea consists of the uniform approximation of (which cannot be represented by neural networks with activation function , due to its lack of sufficient regularity) by neural networks. For the proof of the statement under Condition (ii), we refer to Appendix D.3.2. The main proof idea consists in the uniform approximation of an unbounded analytic function which cannot be represented by a neural network with activation function , since itself is bounded. Finally, the proof of the statement under Condition (iii) can be found in Appendix D.3.3. Its main idea consists in the approximation of the function ∎
Theorem 3.3 applies to all activation functions listed in Table 1 except for the ReLU and the parametric ReLU. To be more precise,
Condition (i) is fulfilled by the function for , and by the exponential linear unit, the softsign function, and the inverse square root linear unit.
Condition (ii) is fulfilled by the inverse square root unit, the sigmoid function, the function, and the function.
Condition (iii) (with and ) is fulfilled by the softplus function.
For the proof of this statement, we refer to Appendix D.4. In particular, for the proof of (1), we refer to Appendix D.4.1, the proof of (2) is clear and for the proof of (3), we refer to Appendix D.4.2. ∎
3 The phenomenon of exploding weights
We have just seen that the realization set is not closed in for any and every practically relevant activation function. Furthermore, for a variety of activation functions, we have seen that is not closed in . The situation is substantially different if the weights are taken from a compact subset:
The proof of this statement is based on the continuity of the realization map and can be found in Appendix D.5. ∎
Proposition 3.5 helps to explain the phenomenon of exploding network weights that is sometimes observed during the training of neural networks. Indeed, let us assume that is not closed in , where for a Borel measure on , or ; as seen in Sections 3.1 and 3.2, this is true under mild assumptions on . Then, it follows that there exists a function which does not have a best approximation in , meaning that there does not exist any satisfying
in fact, one can take any . Next, recall from Proposition 3.5 that the subset of that contains only realizations of networks with uniformly bounded weights is compact.
The argument above only deals with the approximation problem in or in for . In practice, one is often not concerned with these norms, but instead wants to minimize an empirical loss function over . For the empirical square loss, this loss function takes the form
Here, is the marginal probability distribution of on , and is called the Bayes risk; it is the minimal expected loss achievable by any (measurable) function.
In this context of a statistical learning problem, we have the following result regarding exploding weights:
A compact way of stating Proposition 3.6 is that, if has no best approximation in with respect to , then the weights of the minimizers (or approximate minimizers) of the empirical square loss explode for increasing numbers of samples.
Since is unknown in applications, it is indeed possible that has no best approximation in the set of neural networks. As just one example, this is true if is any Borel probability measure on and if is the distribution of , where and is bounded and satisfies , with the closure taken in . The existence of such a function is guaranteed by Theorem 3.1 if is uncountable.
For the proof of Proposition 3.6, we refer to Appendix D.6. The proof is based on classical arguments of statistical learning theory as given in . ∎
In this subsection, we analyze the closedness of sets of realizations of neural networks with respect to the ReLU or the parametric ReLU activation function in , mostly for the case . We conjecture that the set of (realizations of) ReLU networks of a fixed complexity is closed in , but were not able to prove such a result in full generality. In two special cases, namely when the networks have only two layers, or when at least the scaling weights are bounded, we can show that the associated set of ReLU realizations is closed in ; see below.
We begin by analyzing the set of realizations with uniformly bounded scaling weights and possibly unbounded biases, before proceeding with the analysis of two layer ReLU networks.
As our second result in this section, we show that the set of realizations of two-layer neural networks with arbitrary scaling weights and biases is closed in if the activation is the parametric ReLU. It is a fascinating question for further research whether this also holds for deeper networks.
Failure of inverse instability of the realization map
Finally, if is globally Lipschitz continuous, then there is a constant such that
For the proof of this statement, we refer to Appendix E.1. ∎
then the realizations of are identical.
In this section, our main goal is to determine whether, up to the failure of injectivity, the realization map is a homeomorphism onto its range; mathematically, this means that we want to determine whether the realization map is a quotient map. We will see that this is not the case.
We finally rephrase the preceding result in more topological terms:
For the proof of the statement, we refer to Appendix E.3. ∎
Acknowledgements
P.P. and M.R. were supported by the DFG Collaborative Research Center TRR 109 “Discretization in Geometry and Dynamics”. P.P. was supported by a DFG Research Fellowship ”Shearlet-based energy functionals for anisotropic phase-field methods”. M.R. is supported by the Berlin Mathematical School. F.V. acknowledges support from the European Commission through DEDALE (contract no. 665044) within the H2020 Framework Program.
We would like to thank Dave L. Renfro for bringing the paper to our attention.
References
Appendix A Notation
Given functions with , we consider three different types of products between these maps: The cartesian product of is
Finally, the direct sum of is defined if , and given by
Appendix B Auxiliary results: Operations with neural networks
This part of the appendix is devoted to auxiliary results that are connected with basic operations one can perform with neural networks and which we will frequently make use of in the proofs below.
We start by showing that one can “enlarge” a given neural network in such a way that the realizations of the original network and the enlarged network coincide. To be more precise, the following holds:
Setting , and , we define by
Another operation that we can perform with networks is concatenation, as given in the following definition.
Then, we call the concatenation of and .
We claim that there is some such that for all and all . To see this, first note by definition of the derivative that there is some with
Here we implicitly used that to ensure that the right-hand side is a positive multiple of . Now, set , and let be arbitrary. Note because of that every satisfies . Hence, if we set , then . Therefore,
Note that is differentiable at with derivative , thanks to the chain rule.
Using these preliminary observations, we now construct the neural networks and . Define {\Phi_{0}^{C}\coloneqq\big{(}(A_{1},b_{1}),(A_{2},b_{2})\big{)}}, where
where is applied times.
Since for all , it is not hard to see by induction that
where is applied times. Therefore, since , we conclude for that
We proceed with the second part of the proposition. We first prove the statement for . Let \widetilde{\Phi}_{1}^{C}\coloneqq\big{(}(A_{1}^{\prime},b_{1}^{\prime}),(A_{2}^{\prime},b_{2}^{\prime})\big{)}, where
We have . Next, define \widetilde{\Phi}_{2}^{C}\coloneqq\big{(}(A_{1}^{\prime\prime},b_{1}^{\prime\prime}),(A_{2}^{\prime\prime},b_{2}^{\prime\prime})\big{)}, where
We have . Setting , where we take concatenations (meaning factors), yields a neural network (with layers) such that
where is applied times. Exactly as in the proof of the first part, this implies for that
Setting and repeating the previous arguments yields the claim for . Permuting the columns of yields the result for arbitrary .
Appendix C Proofs and results connected to Section 2
We first establish the star-shapedness of the set of all realizations of neural networks, which is a direct consequence of the fact that the set is invariant under scalar multiplication. The following proposition provides the details.
We can choose in the argument above and obtain . For every the line between and is contained in , since is closed under scalar multiplication. We conclude that is star-shaped with respect to the origin. ∎
Our next goal is to show that cannot contain infinitely many linearly independent centers.
As a preparation, we prove two related results which show that the class is “small”. The main assumption for guaranteeing this is that the activation function should be locally Lipschitz continuous.
Since is locally Lipschitz continuous, Proposition 4.1 (which will be proved completely independently) shows that the realization map
As a composition of locally Lipschitz continuous functions, the map
As a corollary, we can now show that the class of neural network realizations cannot contain a subspace of large dimension.
Now, the announced estimate for the number of linearly independent centers of the set of all network realizations of a fixed size is a direct consequence.
Next, we analyze the convexity of . As a direct consequence of Proposition C.4, we see that is never convex if contains more than a certain number of linearly independent functions.
Every element of a convex set is a center. Thus the result follows directly from Proposition C.4. ∎
Step 1: Set . We first show that does not contain infinitely many linearly independent functions. To see this, first note that the map
Step 5: In this step, we complete the proof, by first showing for arbitrary that is a polynomial of degree at most .
Let be arbitrary. Since is continuous, it is uniformly continuous on , that is, there is some such that for all with . Since and , Proposition B.3 and Lemma B.1 imply existence of a neural network such that
with acting componentwise. By (C.3), it follows that there is a neural network satisfying
where the closure is taken with respect to the sup norm, and where we implicitly used that the space on the right-hand side is a closed subspace of , since it is a finite dimensional subspace.
Since is a polynomial of degree at most ; we see that the -th derivative of satisfies on , for arbitrary . Thus, , meaning that is a polynomial. ∎
In the above proof, we used the following elementary lemma, whose proof we provide for completeness.
Let us assume towards a contradiction that
As our final ingredient for the proof of Theorem 2.1, we show that every non-constant locally Lipschitz function satisfies the technical assumptions of Proposition C.6.
Since is not constant, there is some such that is not constant. By assumption, is Lipschitz continuous on . Thus, is absolutely continuous; see for instance [60, Definition 7.17]. Thanks to the fundamental theorem of calculus for the Lebesgue integral (see [60, Theorem 7.20]), this implies that is differentiable almost everywhere on and satisfies for , where if is not differentiable at .
Since is not constant, the preceding formula shows that there has to be some such that ; in particular, this means that is differentiable at . ∎
Now, a combination of Corollary C.5, Proposition C.6, and Lemma C.8 proves Theorem 2.1. For the application of Lemma C.8, note that if is constant, then is a polynomial, so that the conclusion of Theorem 2.1 also holds in this case.
C.2 Proof of Theorem 2.2
We first show in the following lemma that if is convex, then is dense in . The proof of Theorem 2.2 is given thereafter.
If is convex, then is dense in .
Now we are ready to prove Theorem 2.2. By assumption, is not dense in . We start by proving that there exists at least one such that the set is not -convex. Suppose towards a contradiction that this is not true, so that is -convex for all . This implies
where the last identity holds true, since if , there exists such that for all . Equation (C.6) shows that is convex, which by Lemma C.9 implies that is dense, in contradiction to the assumptions of Theorem 2.2. This is the desired contradiction, showing that is -convex for some .
Thus, there exists g\in\text{co}\big{(}\overline{\mathcal{RNN}_{\varrho}^{\Omega}(S)}\big{)} such that for all . Now, let be arbitrary. Then, , since is closed under scalar multiplication. Moreover,
again due to the closedness under scalar multiplication of . This shows that is not -convex for any .
C.3 Non-dense network sets
applying one of the arithmetic operations , and on real numbers;
jumps conditioned on comparisons of real numbers using the following operators: .
Using this result, we can now show that the realization sets of networks with activation functions that are computable by elementary operations are never dense in or .
Then. we have .
The considerations from before the statement of the proposition show that
Since is dense, there is for each some such that . If we set , then , and hence
and thus , where is the Lebesgue measure. Hence, , so that we can choose for each some . We then have
and hence if and otherwise. Thus, if we set , then we have as above that {\mathds{1}}_{[0,\infty)}\big{(}f_{b}(x_{j})-r_{j}\big{)}=b_{j} for all and . The remainder of the proof is as for . ∎
Note that the following activation functions are computable by elementary operations: any piecewise polynomial function (in particular, the ReLU and the parametric ReLU), the exponential linear unit, the softsign (since the absolute value can be computed using a case distinction), the sigmoid, and the . Thus, the preceding proposition applies to each of these activation functions.
Appendix D Proofs of the results in Section 3
The proof of Theorem 3.1 is crucially based on the following lemma:
Now, if , then any with satisfies so that and . From this it is easy to see that if are as provided by the result in (for ), then indeed .
Assume towards a contradiction that such a continuous function exists. Recall (see for instance [19, Section 7.4]) that the support of is defined as
In particular, if is open with , then .
We now prove Theorem 3.1. Set and define for and if is unbounded, while otherwise. We show that for as in the statement of the theorem there exists a sequence of functions in such that the sequence converges (in ) to a bounded, discontinuous limit , meaning that does not have a continuous representative, even after possibly changing it on a -null-set. Since this will show that .
Next, using Proposition B.3, choose a neural network with layers such that
We now distinguish the cases given in Assumption (iv)(a) and (b) of Theorem 3.1.
Then, . Now, let us define
since as . Analogously, it follows for that . Hence, setting , we see for each that
By the monotonicity and since is not constant (because of ), we have .
Then, . Now, let us define
Since each of the is continuous and is compact, we have for all . Equation (D.3) implies that for all . This in turn yields that
Similarly, the fact that for all yields
Combining (D.4) with (D.5) yields for all that
D.2 Proof of Corollary 3.2
Next, the softsign, the inverse square root unit, the sigmoid, the , and the function are all bounded, and thus satisfy condition (iv)(b) of Theorem 3.1. Thus, all that remains is to verify condition (iv)(a) of Theorem 3.1 for the remaining activation functions:
For the ReLU , condition (iv)(a) is satisfied with and .
For the parametric ReLU (with , ), Condition (iv)(a) is satisfied with and , where since .
For the exponential linear unit , Condition (iv)(a) is satisfied for and .
For the inverse square root linear unit , the quotient rule shows that for we have
Therefore, Condition (iv)(a) is satisfied for and .
For the softplus function , Condition (iv)(a) is satisfied for
D.3 Proof of Theorem 3.3
then as .
Now, by the Lipschitz continuity of and Equation (D.7), we conclude that
Here, the last step used that , so that .
Since uniformly, where , and hence , we thus see that is not closed in .
D.3.2 Proof of Theorem 3.3 under Condition (ii)
Since is continuous, we conclude that
We set and note for all that
D.3.3 Proof of Theorem 3.3 under Condition (iii)
for a constant . Moreover, for , we have
D.4 Proof of Corollary 3.4
D.4.2 Proof of Corollary 3.4.(3)
and for , we have .
D.5 Proof of Proposition 3.5
The set is closed and bounded in the normed space \big{(}\mathcal{NN}(S),\|\cdot\|_{\mathcal{NN}(S)}\big{)}. Thus, the Heine-Borel Theorem implies the compactness of . By Proposition 4.1 (which will be proved independently), the map
D.6 Proof of Proposition 3.6
By compactness of , we can choose satisfying Define . Since by assumption the infimum defining is not attained, we have , so that there are and with . Let with .
We now claim that . Once we prove this, we get
By rearranging and recalling the choice of and , we finally see
which is the desired contradiction.
D.7 Proof of Proposition 3.7
The main ingredient of the proof will be to show that one can replace a given sequence of networks with -bounded scaling weights by another sequence with -bounded scaling weights that also has bounded biases. Then one can apply Proposition 3.5.
each network , , has -bounded scaling weights;
For arbitrary and , this implies
Thus, it remains to construct the networks for (and the sets ) for with the properties (A)–(C) from above.
Thus, it remains to construct for (and the set itself) as described around Equation (D.12). To this end, for and , define
To see that these choices indeed fulfil the conditions outlined around Equation (D.12) for a suitable choice of , first note that is indeed a bounded family. Furthermore, \big{|}C_{k,i}^{(n)}\big{|}\leq\big{|}(B_{m+1}^{(n,m)})_{k,i}\big{|} for all and , which easily implies for all . Thus, it remains to verify equation (D.12) itself. But the estimate also implies
As a final preparation, note that is a cartesian product of ReLU functions, since . Now, for there are three cases:
where the last step used our choice of , and the fact that by Equation (D.13).
where the last step used our choice of .
Overall, we have thus shown that Equation (D.12) is satisfied for all , where
is clearly an infinite set, since is. ∎
Hence, is a bounded, infinite family in the finite dimensional vector space . Thus, there is a further infinite set such that as in .
But since is bounded, say , the realization map
D.8 Proof of Theorem 3.8
For the proof of Theorem 3.8, we will use a careful analysis of the singularity hyperplanes of functions of the form , that is, the hyperplane on which this function is not differentiable. To simplify this analysis, we first introduce a convenient terminology and collect quite a few auxiliary results.
By discarding those for which , we can assume that for all .
Assume towards a contradiction that the claim of the lemma is false; that is,
But since is a vector space, this easily implies , that is, for all . In other words, , and then by a dimension argument, since .
Hence, . Because of , we thus see for some . Finally, since , we see and thus , in contradiction to . ∎
Then, there is satisfying
We claim that there is some such that . To see this, note for that , and hence
since for all , by choice of . By combining all our observations, we see that for . ∎
Then, the family is linearly independent.
Because of , Lemma D.6 shows that is not differentiable at . On the other hand, we have
where the right-hand side is differentiable at , since each summand is easily seen to be differentiable on the open set , with . ∎
The continuous function , which is well-defined by assumption, attains a minimum . ∎
Step 2: We claim that . To see this, consider the modified function
which is continuous and satisfies on and on , where we defined and .
Since we saw in Step 1 that , we thus get by continuity of that
But by assumption on , there is some . For arbitrary , we then have for all and a suitable , since is open. Hence, for all , and thus . In other words, , and thus as claimed in this step.
Because of , we then have . Furthermore, since for , we see for all that
Here, the last step used that for , and that by Step 1, so that we get as well.
Likewise, since for , we see for that
In combination, Equations (D.15) and (D.16) show for all . Since this set is dense in by Step 1, we are done. ∎
With all of these preparations, we can finally prove Theorem 3.8.
Step 4 (Showing that the -th neuron is eventually affine-linear, for ): Since Step 3 shows that the claim holds in case of , we will from now on consider only the case where .
For , there are two cases: In case of , define
If otherwise , define
Next, for arbitrary , we define . Note that since for all , there is some such that for all and all . For the remainder of this step, we will consider a fixed , and we claim that there is some such that
simply because and if , and if . Therefore, the affine-linear function
To prove Equation (D.17), we distinguish two cases for each ; by definition of , these are the only two possible cases:
Now, since the function is continuous, since is connected (in fact convex), and since , this implies for all and .
With the same argument as at the end of Case 1, we thus see for all and .
Together, the two cases prove that Equation (D.17) holds if we set .
To see this, let be arbitrary, and recall . By definition of , and by choice of and , there is for each and some satisfying
Define . Then, for , , , and arbitrary , we have on the one hand , and on the other hand
Step 6 (Proving the “almost convergence” of the sum of all -th neurons for ): For define
In combination with Equation (D.18), we see
Recall from Step 4 that for all , by choice of . Therefore, Lemma D.5 shows (because of for ) for each that
To use this observation, note that Equations (D.20) and (D.21) show that converges pointwise to on . Furthermore, since , it is not hard to see that there is some with \big{(}U_{\alpha^{(i)},\beta^{(i)}}^{(\varepsilon,\pm)}\big{)}^{\circ}\cap B_{r_{i}}(x_{i})\neq\varnothing for all ; for the details, we refer to Step 1 in the proof of Lemma D.9. Finally, as a consequence of Step 5, we see for arbitrary that and are both affine-linear on , at least for large enough (depending on ). Thus, the observation from above (with and ) implies that pointwise on , for arbitrary .
Step 7 (Finishing the proof): For arbitrary , let us set
Then, Equations (D.19) and (D.22) imply for that
But the latter set is dense in (since its complement is a null-set), and and are continuous on . Hence,
Recalling from Steps 3 and 4 that , this implies , as claimed. Here, we implicitly used that
Appendix E Proofs of the results in Section 4
with locally uniform convergence. Since is compact, this implies uniform convergence on , and thus completes the proof of the first claim.
For the induction step, let us write \Psi=\big{(}(B_{1},c_{1}),\dots,(B_{L},c_{L})\big{)}, and note that
E.2 Proof of Theorem 4.2
Next, setting , we observe that
We first note that each function from Step 1 is bounded. In fact, if is -Lipschitz, then
as long as is so large that . But this implies
E.3 Proof of Corollary 4.3
Clearly, by switching to complements, we can equivalently replace “open” by “closed” everywhere.