How degenerate is the parametrization of neural networks with the ReLU activation function?

Julius Berner, Dennis Elbrächter, Philipp Grohs

Introduction and Motivation

In recent years much effort has been invested into explaining and understanding the overwhelming success of deep learning based methods. On the theoretical side, impressive approximation capabilities of neural networks have been established . No less important are recent results on the generalization of neural networks, which deal with the question of how well networks, trained on limited samples, perform on unseen data . Last but not least, the optimization error, which quantifies how well a neural network can be trained by applying stochastic gradient descent to an optimization problem, has been analyzed in different scenarios . While there are many interesting approaches to the latter question, they tend to require very strong assumptions (e.g. (almost) linearity, convexity, or extreme over-parametrization). Thus a satisfying explanation for the success of stochastic gradient descent for a non-smooth, non-convex problem remains elusive. In the present paper we intend to pave the way for a functional perspective on the optimization problem. This allows for new mathematical approaches towards understanding the training of neural networks, some of which are demonstrated in Section 1.2. To this end we examine degenerate parametrizations with undesirable properties in Section 2. These can be roughly classified as

weight vectors with directly opposite directions.

Under conditions designed to avoid these degeneracies, Theorem 3.1 establishes inverse stability for shallow networks with ReLU activation function. This is accomplished by a refined analysis of the behavior of ReLU networks near a discontinuity of their derivative. Proposition 1.2 shows how inverse stability connects the loss surface of the parametrized minimization problem to the loss surface of the realization space problem. In Theorem 1.3 we showcase a novel result on almost optimality of local minima of the parametrized problem obtained by analyzing the realization space problem. Note that this approach of analyzing the loss surface is conceptually different from previous approaches as in .

The architecture NALN\in\mathcal{A}_{L} simply specifies the number of neurons NlN_{l} in each of the LL layers. We can then define the space PN\mathcal{P}_{N} of parametrizations with architecture NALN\in\mathcal{A}_{L} as

the set P=NALPN\mathcal{P}=\bigcup_{N\in\mathcal{A}_{L}}\mathcal{P}_{N} of all parametrizations with architecture in AL\mathcal{A}_{L}, and the realization map

Given R(Γ)\mathcal{R}(\Gamma) and R(Θ)\mathcal{R}(\Theta) that are close, does there exist a parametrization Φ\Phi with R(Φ)=R(Θ)\mathcal{R}(\Phi)=\mathcal{R}(\Theta) such that Γ\Gamma and Φ\Phi are close?

As we will see in Section 2, this question is fundamentally connected to understanding the redundancies and degeneracies of the way that neural networks are parametrized. By suitable regularization, i.e. considering a subspace ΩPN\Omega\subseteq\mathcal{P}_{N} of parametrizations, we can avoid these pathologies and establish a positive answer to the question above. For such a property the term inverse stability was introduced in , which constitutes the only other research conducted in this area, as far as we are aware.

Let s,α>0s,\alpha>0, NALN\in\mathcal{A}_{L}, and ΩPN\Omega\subseteq\mathcal{P}_{N}. We say that the realization map is (s,α)(s,\alpha) inverse stable on Ω\Omega w.r.t. \|\cdot\|, if for all ΓΩ\Gamma\in\Omega and gR(Ω)g\in\mathcal{R}(\Omega) there exists ΦΩ\Phi\in\Omega with

See for further information on Sobolev norms, and for further information on the derivative of ReLU networks.

2 Implications of inverse stability for neural network optimization

Typically, the optimization problem is solved over some subspace of parametrizations ΩPN\Omega\subseteq\mathcal{P}_{N}, i.e.

From an abstract point of view, by writing g=R(Γ)R(Ω)g=\mathcal{R}(\Gamma)\in\mathcal{R}(\Omega), this is equivalent to the corresponding optimization problem over the space of realizations R(Ω)\mathcal{R}(\Omega), i.e.

However, the loss landscape of the optimization problem (8) is only properly connected to the loss landscape of the optimization problem (9) if the realization map is inverse stable on Ω\Omega. Otherwise a realization gR(PN)g\in\mathcal{R}(\mathcal{P}_{N}) can be arbitrarily close to a global minimum in the realization space but every parametrization Φ\Phi with R(Φ)=g\mathcal{R}(\Phi)=g is far away from the corresponding global minimum in the parametrization space. Moreover, local minima of (8) in the parametrization space must correspond to local minima of (9) in the realization space if and only if we have inverse stability.

Let NALN\in\mathcal{A}_{L}, ΩPN\Omega\subseteq\mathcal{P}_{N} and let the realization map be (s,α)(s,\alpha) inverse stable on Ω\Omega w.r.t. \|\cdot\|. Let ΓΩ\Gamma_{*}\in\Omega be a local minimum of LR\mathcal{L}\circ\mathcal{R} on Ω\Omega with radius r>0r>0, i.e. for all ΦΩ\Phi\in\Omega with ΦΓr\|\Phi-\Gamma_{*}\|_{\infty}\leq r it holds that

Then R(Γ)\mathcal{R}(\Gamma_{*}) is a local minimum of L\mathcal{L} on R(Ω)\mathcal{R}(\Omega) with radius (rs)1/α(\frac{r}{s})^{1/\alpha}, i.e. for all gR(Ω)g\in\mathcal{R}(\Omega) with gR(Γ)(rs)1/α\|g-\mathcal{R}(\Gamma_{*})\|\leq(\frac{r}{s})^{1/\alpha} it holds that

See Appendix A.1.2 for a proof and Example A.1 for a counterexample in the case that inverse stability is not given. Note that in (9) we consider a problem with convex loss function but non-convex feasible set, see [34, Section 3.2]. This opens up new avenues of investigation using tools from functional analysis and allows utilizing recent results exploring the topological properties of neural network realization spaces. As a concrete demonstration we provide with Theorem A.2 a strong result obtained on the realization space, which estimates the quality of a local minimum based on its radius and the approximation capabilities of the chosen architecture for a class of functions SS. Specifically let C>0C>0, let Λ ⁣:B[0,){\Lambda\colon\mathcal{B}\to[0,\infty)} be a quasi-convex regularizer, and define

We denote the sets of regularized parametrizations by

Assume that SS is compact in the \|\cdot\|-closure of R(P)\mathcal{R}(\mathcal{P}) and that for every NALN\in\mathcal{A}_{L} the realization map is (s,α)(s,\alpha) inverse stable on ΩN\Omega_{N} w.r.t. \|\cdot\| . Then for all ε,r>0\varepsilon,r>0 there exists n(ε,r)ALn(\varepsilon,r)\in\mathcal{A}_{L} such that for every NALN\in\mathcal{A}_{L} with N1n1(ε,r),,NL1nL1(ε,r)N_{1}\geq n_{1}(\varepsilon,r),\dots,N_{L-1}\geq n_{L-1}(\varepsilon,r) the following holds: Every local minimum Γ\Gamma_{*} with radius at least rr of minΓΩNL(R(Γ))\min_{\Gamma\in\Omega_{N}}\mathcal{L}(\mathcal{R}(\Gamma)) satisfies

See Appendix A.1.2 for a proof and note that here it is important to have an inverse stability result, where the parameters (s,α)(s,\alpha) do not depend on the size of the architecture, which we achieve for L=2L=2 and B=W1,\mathcal{B}=W^{1,\infty}. Suitable Λ\Lambda would be Besov norms which constitute a common regularizer in image and signal processing. Moreover, note that the required size of the architecture in Theorem 1.3 can be quantified, if one has approximation rates for SS. In particular, this approach allows the use of approximation results in order to explain the success of neural network optimization and enables a combined study of these two aspects, which, to the best of our knowledge, has not been done before. Unlike in recent literature, our result needs no assumptions on the sample set (incorporated in the loss function, see (7)), in particular we do not require “overparametrization” with respect to the sample size. Here the required size of the architecture only depends on the complexity of SS, i.e. the class of functions one wants to approximate, the radius of the local minima of interest, the Lipschitz constant of the loss function, and the parameters of the inverse stability. In the following we restrict ourselves to two-layer ReLU networks without biases, where we present a proof for (4,1/2)(4,1/2) inverse stability w.r.t. the Sobolev semi-norm on a suitably regularized space of parametrizations. Both the regularizations as well as the stronger norm (compared to the uniform norm) will shown to be necessary in Section 2. We now present, in an informal way, a collection of our main results. A short proof making the connection to the formal results can be found in Appendix A.1.2.

the network is balanced, i.e. ai=ci\|a_{i}\|_{\infty}=\|c_{i}\|_{\infty},

no non-zero weight vectors in the first layer are redundant, i.e. ai∦aja_{i}\not\parallel a_{j},

the last two coordinates of each weight vector aia_{i} are strictly positive.

The global minimum of (17) is at least as good as the global minimum of (15), i.e.

By further regularizing (17) in the sense of Theorem 1.3, we can estimate the quality of its local minima.

This argument is not limited to the MSE loss function but works for any loss function based on evaluating the realization. The omission of bias weights is standard in neural network optimization literature . While this severely limits the functions that can be realized with a given architecture, it is sufficient to augment the problem by one dimension in order to recover the full range of functions that can be learned . Here we augment by two dimensions, so that the third regularization condition C.3 can be fulfilled without loosing range. Moreover, note that, for simplicity of presentation, the regularization assumptions stated above are stricter than necessary and possible relaxations are discussed in Section 3.

Obstacles to inverse stability - degeneracies of ReLU parametrizations

In (21) one can see that in our setting W1,(U)|\cdot|_{W^{1,\infty}(U)} is independent of UU (as long as UU contains a neighbourhood of the origin) and will thus be abbreviated by W1,|\cdot|_{W^{1,\infty}}.

All proofs for this section can be found in Appendix A.2.2. We start by showing that inverse stability fails w.r.t. the uniform norm. This example is adapted from [34, Theorem 5.2] and represents, to the best of our knowledge, the only degeneracy which has already been observed before.

Let Γ:=(0,0)N(2,2,1)\Gamma:=(0,0)\in\mathcal{N}_{(2,2,1)} and gkR(N(2,2,1))g_{k}\in\mathcal{R}(\mathcal{N}_{(2,2,1)}) be given by (see Figure 1)

In particular, note that inverse stability fails here even for a non-degenerate parametrization of the zero function Γ=(0,0)\Gamma=(0,0). However, for this type of counterexample the magnitude of the gradient of R(Φk)\mathcal{R}(\Phi_{k}) needs to go to infinity, which is our motivation for looking at inverse stability w.r.t. W1,|\cdot|_{W^{1,\infty}}.

2 Failure of inverse stability w.r.t. Sobolev norm

In this section we present four degenerate cases where inverse stability fails w.r.t. W1,|\cdot|_{W^{1,\infty}}. This collection of counterexamples is complete in the sense that we can establish inverse stability under assumptions which are designed to exclude these four pathologies.

Let r>0r>0, \Gamma:=\big{(}(r,0),0\big{)}\in\mathcal{N}_{(2,1,1)} and gkR(N(2,1,1))g_{k}\in\mathcal{R}(\mathcal{N}_{(2,1,1)}) be given by (see Figure 3)

This is a very simple example of a degenerate parametrization of the zero function, since R(Γ)=0\mathcal{R}(\Gamma)=0 regardless of choice of rr. The issue here is that we can have a weight pair, i.e. ((r,0),0)((r,0),0), where the product is independent of the value of one of the parameters. Note that in Example A.4 one can see a slightly more subtle version of this pathology by considering \Gamma_{k}:=\big{(}(k,0),\tfrac{1}{k^{2}}\big{)}\in\mathcal{N}_{(2,1,1)} instead. In that case one could still get an inverse stability estimate for each fixed kk; the parameters of inverse stability (s,α)(s,\alpha) would however deteriorate with increasing kk. In particular this demonstrates the need for some sort of balancedness of the parametrization, i.e. control over ci\|c_{i}\|_{\infty} and ai\|a_{i}\|_{\infty} individually relative to ciai\|c_{i}\|_{\infty}\|a_{i}\|_{\infty}. Inverse stability is also prevented by redundant directions as the following example illustrates.

and gkR(N(2,2,1))g_{k}\in\mathcal{R}(\mathcal{N}_{(2,2,1)}) be given by (see Figure 3)

The next example shows that not only redundant weight vectors can cause issues, but also weight vectors of opposite direction, as they would allow for a (balanced) degenerate parametrization of the zero function.

Thus we will need an assumption which prevents each individual Γ\Gamma in our restricted set from having pairwise linearly dependent weight vectors, i.e. coinciding hyperplanes of non-differentiability. This, however, does not suffice as is demonstrated by the next example, which shows that the relation between the hyperplanes of the two realizations matters.

and consider the parametrizations (see Figure 5)

Note that Γ\Gamma and Θ\Theta need to have multiple exactly opposite weight vectors which add to something small (compared to the size of the individual vectors), but not zero, since otherwise reparametrization would be possible (see Lemma A.5).

Inverse stability for two-layer ReLU Networks

We now establish an inverse stability result using assumptions designed to exclude the pathologies from the previous section. First we present a rather technical theorem for output dimension one which considers a parametrization Γ\Gamma in the unrestricted parametrization space NN\mathcal{N}_{N} and a function gg in the the corresponding function space R(NN)\mathcal{R}(\mathcal{N}_{N}). The aim is to use assumptions which are as weak as possible, while allowing us to find a parametrization Φ\Phi of gg, whose distance to Γ\Gamma can be bounded relative to gR(Γ)W1,|g-\mathcal{R}(\Gamma)|_{W^{1,\infty}}. We then continue by defining a restricted parametrization space NN\mathcal{N}_{N}^{*}, for which we get uniform inverse stability (meaning that we get the same estimate for every ΓNN\Gamma\in\mathcal{N}_{N}^{*}).

It holds for all i[m]i\in[{m}] with ciΓaiΓ2gR(Γ)W1,\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}\leq 2|g-\mathcal{R}(\Gamma)|_{W^{1,\infty}} that ciΓ,aiΓβ|c_{i}^{\Gamma}|,\|a_{i}^{\Gamma}\|_{\infty}\leq\beta.

It holds for all i,jIΓi,j\in I^{\Gamma} with iji\neq j that ajΓajΓaiΓaiΓ\frac{a_{j}^{\Gamma}}{\|a_{j}^{\Gamma}\|_{\infty}}\neq\frac{a_{i}^{\Gamma}}{\|a_{i}^{\Gamma}\|_{\infty}}.

There exists a parametrization \Theta=\Big{(}\big{[}a_{1}^{\Theta}\big{|}\dots\big{|}a_{m}^{\Theta}\big{]}^{T},c^{\Theta}\Big{)}\in\mathcal{N}_{N} such that R(Θ)=g\mathcal{R}(\Theta)=g and

it holds for all i,jIΓi,j\in I^{\Gamma} with iji\neq j that ajΓajΓaiΓaiΓ\frac{a_{j}^{\Gamma}}{\|a_{j}^{\Gamma}\|_{\infty}}\neq-\frac{a_{i}^{\Gamma}}{\|a_{i}^{\Gamma}\|_{\infty}} and for all i,jIΘi,j\in I^{\Theta} with iji\neq j that ajΘajΘaiΘaiΘ\frac{a_{j}^{\Theta}}{\|a_{j}^{\Theta}\|_{\infty}}\neq-\frac{a_{i}^{\Theta}}{\|a_{i}^{\Theta}\|_{\infty}},

it holds for all iIΓi\in I^{\Gamma}, jIΘj\in I^{\Theta} that aiΓaiΓajΘajΘ\frac{a_{i}^{\Gamma}}{\|a_{i}^{\Gamma}\|_{\infty}}\neq-\frac{a_{j}^{\Theta}}{\|a_{j}^{\Theta}\|_{\infty}}

where IΘ:={i[m] ⁣:aiΘ0}I^{\Theta}:=\{i\in[{m}]\colon a^{\Theta}_{i}\neq 0\}.

Then there exists a parametrization ΦNN\Phi\in\mathcal{N}_{N} with

for which we have R(NN)=R(NN)\mathcal{R}(\mathcal{N}^{\prime}_{N})=\mathcal{R}(\mathcal{N}_{N}). Note that the above definition as well as the following definition and theorem are for networks with arbitrary output dimensions, as the balancedness condition makes this extension rather straightforward. In order to satisfy Conditions C.3a and C.3b we need to restrict the parametrization space in a way which also restricts the corresponding space of realizations. One possibility to do so is the following approach, which also incorporates the previous restrictions as well as the transition to networks without biases.

In particular, this means that for any optimization problem over an unrestricted parametrization space P(d,m,D)\mathcal{P}_{(d,m,D)}, there is a corresponding optimization problem over the parametrization space N(d+2,m+1,D)\mathcal{N}^{*}_{(d+2,m+1,D)} whose solution is at least as good (see Corollary 1.4). Our main result now states that for such a restricted parametrization space we have uniform (4,1/2)(4,1/2) inverse stability w.r.t. W1,|\cdot|_{W^{1,\infty}}, a proof of which can be found in Appendix A.3.2.

Outlook

This contribution investigates the potential insights which may be gained from studying the optimization problem over the space of realizations, as well as the difficulties encountered when trying to connect it to the parametrized problem. While Theorem 1.3 and Theorem 3.3 offer some compelling preliminary answers, there are multiple ways in which they can be extended. To obtain our inverse stability result for shallow ReLU networks we studied sums of ridge functions. Extending this result to deep ReLU networks requires understanding their behaviour under composition. In particular, we have ridge functions which vanish on some half space, i.e. colloquially speaking each neuron may “discard half the information” it receives from the previous layer. This introduces a new type of degeneracy, which one will have to deal with. Another interesting direction is an extension to inverse stability w.r.t. some weaker norm like L\|\cdot\|_{L^{\infty}} or a fractional Sobolev norm under stronger restrictions on the space of parametrizations (see Lemma A.7 for a simple approach using very strong restrictions). Lastly, note that Theorem 1.3 is not specific to the ReLU activation function and thus also incentivizes the study of inverse stability for any other activation function. From an applied point of view, Conditions C.1-C.3 motivate the implementation of corresponding regularization (i.e. penalizing unbalancedness and redundancy in the sense of parallel weight vectors) in state-of-the-art networks, in order to explore whether preventing inverse stability leads to improved performance in practice. Note that there already are results using, e.g. cosine similarity, as regularizer to prevent parallel weight vectors as well as approaches, called Sobolev Training, reporting better generalization and data-efficiency by employing a Sobolev norm based loss .

Acknowledgment

The research of JB and DE was supported by the Austrian Science Fund (FWF) under grants I3403-N32 and P 30148. The authors would like to thank Pavol Harár for helpful comments.

References

Appendix A Appendix - Proofs and Additional Material

For simplicity of presentation, assume we are given two samples x1D1x^{1}\in D_{1}, x2D2x^{2}\in D_{2} with labels y1=0y^{1}=0, y2=1y^{2}=1. The corresponding MSE is

with loss L(R(Γ))=12\mathcal{L}(\mathcal{R}(\Gamma_{*}))=\tfrac{1}{2}. Note that changing each weight by less than 12\tfrac{1}{2} does not decrease the loss, as this rotates the vector (1,0)(-1,0) by at most 4545^{\circ}. Thus Γ\Gamma_{*} is a local minimum in the parametrization space. However, the sequence of realizations given by

see Figure 6. Accordingly, R(Γ)\mathcal{R}(\Gamma_{*}) is not a local minimum in the realization space even w.r.t. the Sobolev norm. The problem occurs, since inverse stability fails due to unbalancedness of Γ\Gamma_{*}.

Let gg_{*} be a local minimum with radius r2ηr^{\prime}\geq 2\eta of the optimization problem mingR(ΩN)L(g)\min_{g\in\mathcal{R}(\Omega_{N})}\mathcal{L}(g). Then it holds for every gR(ΩN)g\in\mathcal{R}(\Omega_{N}) (in particular for every global minimizer) that

Define λ:=r2gg\lambda:=\frac{r^{\prime}}{2\|g-g_{*}\|} and f:=(1λ)g+λgSf:=(1-\lambda)g_{*}+\lambda g\in S. Due to (46) there is ΦΩN\Phi\in\Omega_{N} such that R(Φ)fη\|\mathcal{R}(\Phi)-f\|\leq\eta and by the assumptions on gg_{*} and L\mathcal{L} it holds that

This completes the proof. See Figure 7 for illustration.

A.1.2 Proofs

By Definition 1.1 we know that for every gR(Ω)g\in\mathcal{R}(\Omega) with gR(Γ)(rs)1/α\|g-\mathcal{R}(\Gamma_{*})\|\leq(\frac{r}{s})^{1/\alpha} there exists ΦΩ\Phi\in\Omega with

Let ε,r>0\varepsilon,r>0, define r:=(rs)1/αr^{\prime}:=(\tfrac{r}{s})^{1/\alpha} and η:=min{(2crdiam(S))1ε,r2}\eta:=\min\{(\tfrac{2c}{r^{\prime}}\operatorname{diam}(S))^{-1}\varepsilon,\tfrac{r^{\prime}}{2}\}. Then compactness of SS implies the existence of an architecture n(ε,r)ALn(\varepsilon,r)\in\mathcal{A}_{L} such that for every NALN\in\mathcal{A}_{L} with N1n1(ε,r),,NL1nL1(ε,r)N_{1}\geq n_{1}(\varepsilon,r),\dots,N_{L-1}\geq n_{L-1}(\varepsilon,r) the approximability assumption (46) is satisfied. Let now Γ\Gamma_{*} be a local minimum with radius at least rr of minΓΩNL(R(Γ))\min_{\Gamma\in\Omega_{N}}\mathcal{L}(\mathcal{R}(\Gamma)). As we assume uniform (s,α)(s,\alpha) inverse stability, Proposition 1.2 implies that R(Γ)\mathcal{R}(\Gamma_{*}) is a local minimum of the optimization problem mingR(ΩN)L(g)\min_{g\in\mathcal{R}(\Omega_{N})}\mathcal{L}(g) with radius at least r=(rs)1/α2ηr^{\prime}=(\tfrac{r}{s})^{1/\alpha}\geq 2\eta. Theorem A.2 establishes the claim. ∎

We simply combine the main observations from our paper. First, note that the assumptions imply that the restricted parametrization space Ω\Omega, which we are optimizing over, is the space N(d+2,N1+1,D)\mathcal{N}_{(d+2,N_{1}+1,D)}^{*} from Definition 3.2. Secondly, Theorem 3.3 implies that the realization map is (4,1/2)(4,1/2) inverse stable on Ω\Omega. Thus, Proposition 1.2 directly proves Claim 1. For the proof of Claim 2 we make use of Lemma A.6. It implies that for every ΘP(d,N1,D)\Theta\in\mathcal{P}_{(d,N_{1},D)} there exists ΓΩ\Gamma\in\Omega such that it holds that

A.2 Section 2

with linearly independent weight vectors (aiΘ)i=1m(a_{i}^{\Theta})_{i=1}^{m} and mini[m]ciΘ>0\min_{i\in[m]}\|c_{i}^{\Theta}\|_{\infty}>0 and let

with R(Φ)=R(Θ)\mathcal{R}(\Phi)=\mathcal{R}(\Theta). Then there exists a permutation π ⁣:[m][m]\pi\colon[m]\to[m] such that for every i[m]i\in[m] there exist λi(0,)\lambda_{i}\in(0,\infty) with

This means that, up to reordering and rebalancing, Θ\Theta is the unique parametrization of R(Θ)\mathcal{R}(\Theta).

First we define for every s{0,1}ms\in\{0,1\}^{m} the corresponding open orthant

By assumption AΘA^{\Theta} has rank mm, i.e. is surjective, and therefore the preimages of the orthants

are disjoint, non-empty open sets. Note that on each HsH^{s} the realization R(Θ)\mathcal{R}(\Theta) is linear with

Since AΘA^{\Theta} has full row rank, it has a right inverse. Thus we have for s,t{0,1}ms,t\in\{0,1\}^{m} that

Note that CΘdiag(s)=CΘdiag(t)C^{\Theta}\operatorname{diag}(s)=C^{\Theta}\operatorname{diag}(t) can only hold if s=ts=t due to the assumptions that ciΘ0\|c^{\Theta}_{i}\|_{\infty}\neq 0 for all i[m]i\in[m]. Thus the above establishes that for s,t{0,1}ms,t\in\{0,1\}^{m} it holds that

i.e. R(Θ)\mathcal{R}(\Theta) has different derivatives on its 2m2^{m} linear regions. In order for R(Φ)\mathcal{R}(\Phi) to have matching linear regions and matching derivatives on each one of them, there must exist a permutation matrix P{0,1}m×mP\in\{0,1\}^{m\times m} such that for every s{0,1}ms\in\{0,1\}^{m}

Thus, there exist (λi)i=1m(0,)m(\lambda_{i})_{i=1}^{m}\in(0,\infty)^{m} such that

The assumption that DR(Θ)=DR(Ψ)D\mathcal{R}(\Theta)=D\mathcal{R}(\Psi), together with (56) for s=(1,,1)s=(1,\dots,1), implies that

and gkR(N(2,1,1))g_{k}\in\mathcal{R}(\mathcal{N}_{(2,1,1)}) be given by

The only way to parametrize gkg_{k} is gk(x)=R(Φk)(x)=cρ((0,a),x)g_{k}(x)=\mathcal{R}(\Phi_{k})(x)=c\rho(\langle(0,a),x\rangle) with a,c>0a,c>0 (see Lemma A.3), and we have

A.2.2 Proofs

(see [34, Prop. 5.1]). It follows that R(Φk)W1,Lip(R(Φk))|\mathcal{R}(\Phi_{k})|_{W^{1,\infty}}\leq\operatorname{Lip}(\mathcal{R}(\Phi_{k})) is uniformly bounded which contradicts (67). ∎

The only way to parametrize gkg_{k} is gk(x)=R(Φk)(x)=cρ((0,a),x)g_{k}(x)=\mathcal{R}(\Phi_{k})(x)=c\rho(\langle(0,a),x\rangle) with a,c>0a,c>0 (see also Lemma A.3), which proves the claim. ∎

(see Lemma A.3). Thus it holds that ΦkΓ(1,0)(0,a2)1\|\Phi_{k}-\Gamma\|_{\infty}\geq\|(1,0)-(0,a_{2})\|_{\infty}\geq 1 and the proof is completed by direct calculation. ∎

Let Φk\Phi_{k} be an arbitrary parametrization of gkg_{k} given by

The Cauchy-Schwarz inequality and the linear independence of vv to each aia_{i}, i[m]i\in[m], establishes that C:=1d2mini[m][ai22v22ai,v2]>0C:=\tfrac{1}{d^{2}}\min_{i\in[m]}\left[\|a_{i}\|^{2}_{2}\|v\|^{2}_{2}-\langle a_{i},v\rangle^{2}\right]>0. Together with the fact that R(Γ)=0\mathcal{R}(\Gamma)=0, this completes the proof. ∎

and thus the difference of the gradients is constant, i.e.

However, regardless of the balancing and reordering of the weight vectors aika_{i}^{k}, ii\in, we have that

By Lemma A.3, up to balancing and reordering, there does not exist any other parametrization of Θk\Theta_{k} with the same realization. ∎

A.3 Section 3

Since ΘP(d,m,D)\Theta\in\mathcal{P}_{(d,m,D)} it can be written as

and observe that bi+>0b_{i}^{+}>0, bi>0b_{i}^{-}>0, and bi+bi=bib_{i}^{+}-b_{i}^{-}=b_{i}. For i[m]i\in[{m}] let

In order to balance the network, let aiΓ=ai(ciai)1/2a_{i}^{\Gamma}=a_{i}^{*}(\tfrac{\|c^{*}_{i}\|_{\infty}}{\|a_{i}^{*}\|_{\infty}})^{1/2} and ciΓ=ci(aici)1/2c_{i}^{\Gamma}=c_{i}^{*}(\tfrac{\|a^{*}_{i}\|_{\infty}}{\|c_{i}^{*}\|_{\infty}})^{1/2} for every i[m+1]i\in[m+1]. Then the claim follows by direct computation. ∎

A.3.2 Proofs

By conditions C.2 and C.3a we have for all i,jIΓi,j\in I^{\Gamma} that

Further note that we can reparametrize R(Θ)\mathcal{R}(\Theta) such that the same holds there. To this end observe that

given that aa^{\prime} is a positive multiple of aa. Specifically, let (Jk)k=1K(J_{k})_{k=1}^{K} be a partition of IΘI^{\Theta} (i.e. JkJ_{k}\neq\emptyset, k=1KJk=IΘ\cup_{k=1}^{K}J_{k}=I^{\Theta} and JkJk=J_{k}\cap J_{k^{\prime}}=\emptyset if kkk\neq k^{\prime}), such that for all k[K]k\in[{K}] it holds that

We denote by jkj_{k} the smallest element in JkJ_{k} and make the following replacements, for all iIΘi\in I^{\Theta}, without changing the realization of Θ\Theta:

Note that we also update the set IΘ:={i[m] ⁣:aiΘ0}I^{\Theta}:=\{i\in[{m}]\colon a^{\Theta}_{i}\neq 0\} accordingly. Let now

By construction and condition C.3a, we have for all i,jIΘi,j\in I^{\Theta} that

Note that we now have a parametrization Θ\Theta of gg, where all weight vectors aiΘa_{i}^{\Theta} are either zero (in which case the corresponding ciΘc_{i}^{\Theta} are also zero) or pairwise linearly independent to each other nonzero weight vector. Next, for s{0,1}ms\in\{0,1\}^{m}, let

The HΓsH^{s}_{\Gamma}, sSΓs\in S^{\Gamma}, and HΘsH^{s}_{\Theta}, sSΘs\in S^{\Theta}, are the interiors of the different linear regions of R(Γ)\mathcal{R}(\Gamma) and R(Θ)\mathcal{R}(\Theta) respectively. Next observe that the derivatives of fiΓ,fiΘf^{\Gamma}_{i},f^{\Theta}_{i} are (a.e.) given by

Note that for every xHΓsx\in H^{s}_{\Gamma}, yHΘsy\in H^{s}_{\Theta} we have

Next we use that for sSΓs\in S^{\Gamma}, tSΘt\in S^{\Theta} we have ΣsΓΣtΘr|\Sigma^{\Gamma}_{s}-\Sigma^{\Theta}_{t}|\leq r if HsΓHtΘH^{\Gamma}_{s}\cap H^{\Theta}_{t}\neq\emptyset, and compare adjacent linear regions of R(Γ)R(Θ)\mathcal{R}(\Gamma)-\mathcal{R}(\Theta). Let now iIΓi\in I^{\Gamma} and consider the following cases: Case 1: We have HΓ,i0HΘ,j0H^{0}_{\Gamma,i}\neq H^{0}_{\Theta,j} for all jIΘj\in I^{\Theta}. This means that the DfkΘDf_{k}^{\Theta}, k[m]k\in[{m}], and the DfkΓDf_{k}^{\Gamma}, k[m]\{i}k\in[{m}]\backslash\{i\}, are the same on both sides near the hyperplane HΓ,i0H^{0}_{\Gamma,i}, while the value of DfiΓDf^{\Gamma}_{i} is on one side and ciΓaiΓc_{i}^{\Gamma}a_{i}^{\Gamma} on the other. Specifically, there exist s+,sSΓs^{+},s^{-}\in S^{\Gamma} and sSΘs^{*}\in S^{\Theta} such that si+=1s^{+}_{i}=1, si=0s^{-}_{i}=0, sj+=sjs^{+}_{j}=s^{-}_{j} for all j[m]\{i}j\in[{m}]\backslash\{i\}, and HΓs+HΘsH_{\Gamma}^{s^{+}}\cap H_{\Theta}^{s^{*}}\neq\emptyset, HΓsHΘsH_{\Gamma}^{s^{-}}\cap H_{\Theta}^{s^{*}}\neq\emptyset, which implies

Case 2: There exists jIΘj\in I^{\Theta} such that HΓ,i0=HΘ,j0H^{0}_{\Gamma,i}=H^{0}_{\Theta,j}. Note that (86) ensures that HΓ,i0HΓ,k0H^{0}_{\Gamma,i}\neq H^{0}_{\Gamma,k} for k[m]{i}k\in[{m}]\setminus\{i\} and (92) ensures that HΘ,j0HΓ,k0H^{0}_{\Theta,j}\neq H^{0}_{\Gamma,k} for k[m]{j}k\in[{m}]\setminus\{j\}. Moreover, Condition C.3b implies HΓ,i+=HΘ,j+H^{+}_{\Gamma,i}=H^{+}_{\Theta,j}. This means that the DfkΘDf_{k}^{\Theta}, k[m]\{j}k\in[{m}]\backslash\{j\}, and the DfkΓDf_{k}^{\Gamma}, k[m]\{i}k\in[{m}]\backslash\{i\}, are the same on both sides near the hyperplane HΓ,i0=HΘ,j0H^{0}_{\Gamma,i}=H^{0}_{\Theta,j}, while the values of DfiΓDf^{\Gamma}_{i} and DfjΘDf^{\Theta}_{j} change. Specifically there exist s+,sSΓs^{+},s^{-}\in S^{\Gamma} and t+,tSΘt^{+},t^{-}\in S^{\Theta} such that si+=1s^{+}_{i}=1, si=0s^{-}_{i}=0, sk+=sks^{+}_{k}=s^{-}_{k} for all k[m]\{i}k\in[{m}]\backslash\{i\}, tj+=1t^{+}_{j}=1, tj=0t^{-}_{j}=0, tk+=tkt^{+}_{k}=t^{-}_{k} for all k[m]\{j}k\in[{m}]\backslash\{j\} and Hs+ΓHt+ΘH^{\Gamma}_{s^{+}}\cap H^{\Theta}_{t^{+}}\neq\emptyset, HsΓHtΘH^{\Gamma}_{s^{-}}\cap H^{\Theta}_{t^{-}}\neq\emptyset, which implies

Analogously we get for iIΘi\in I^{\Theta} that HΘ,i0HΓ,j0H^{0}_{\Theta,i}\neq H^{0}_{\Gamma,j} for all jIΓj\in I^{\Gamma} implies ciΘaiΘ2r\|c^{\Theta}_{i}a^{\Theta}_{i}\|_{\infty}\leq 2r. Next let

Colloquially speaking, this shows that for every fiΓf^{\Gamma}_{i} with iI2i\in I_{2} there is a fjΘf^{\Theta}_{j} with exactly matching half-spaces, i.e. HΓ,i+=HΘ,j+H^{+}_{\Gamma,i}=H^{+}_{\Theta,j}, and approximately matching gradients (Case 2). Moreover, all unmatched fiΓf^{\Gamma}_{i} and fjΘf^{\Theta}_{j} must have a small gradient (Case 1). Specifically, the above establishes that there exists a permutation π ⁣:[m][m]\pi\colon[{m}]\to[{m}] such that for every iI1i\in I_{1} it holds that

We make the following replacements, for all i[m]i\in[m], without changing the realization of Θ\Theta:

In order to balance the weights of Θ\Theta for I1I_{1}, we further make the following replacements, for all iI1i\in I_{1} with aiΘ0a^{\Theta}_{i}\neq 0, without changing the realization of Θ\Theta:

Moreover, due to Condition C.1, we get for every iI1i\in I_{1} that

Next we (approximately) match the balancing of (ciΘ,aiΘ)(c_{i}^{\Theta},a_{i}^{\Theta}) to the balancing of (ciΓ,aiΓ)(c_{i}^{\Gamma},a_{i}^{\Gamma}) for iI2i\in I_{2}, in order to derive estimates on ciΘciΓ|c^{\Theta}_{i}-c^{\Gamma}_{i}| and aiΘaiΓ\|a^{\Theta}_{i}-a^{\Gamma}_{i}\|_{\infty} from (102). Specifically, we make the following replacements, for all iI2i\in I_{2}, without changing the realization of Θ\Theta:

Let now iI2i\in I_{2} and consider the following cases: Case A: We have ciΓaiΓ2r\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}\leq 2r which, together with (102), implies ciΘaiΘ4r\|c_{i}^{\Theta}a^{\Theta}_{i}\|_{\infty}\leq 4r. Due to (108) and Condition C.1 it follows that

Case B.1: We have ciΓaiΓ>2r\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}>2r and ciΓ>aiΓ|c^{\Gamma}_{i}|>\|a_{i}^{\Gamma}\|_{\infty} which ensures ciΓ>ciΓaiΓ1/2|c_{i}^{\Gamma}|>\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}^{1/2}. Due to (109) we get ciΘ=ciΓc_{i}^{\Theta}=c_{i}^{\Gamma} and it follows that

Case B.2: We have ciΓaiΓ>2r\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}>2r and ciΓ<aiΓ|c^{\Gamma}_{i}|<\|a_{i}^{\Gamma}\|_{\infty} which ensures aiΓ>ciΓaiΓ1/2\|a_{i}^{\Gamma}\|>\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}^{1/2}. Due to (110) we get aiΘ=aiΓa_{i}^{\Theta}=a_{i}^{\Gamma} and it follows that

Case B.3: We have ciΓaiΓ>2r\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}>2r and ciΓ=aiΓ|c^{\Gamma}_{i}|=\|a_{i}^{\Gamma}\|_{\infty}. Note that ciΓaiΓ>2r\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}>2r and (102) ensure that sgn(ciΘ)=sgn(ciΓ)\operatorname{sgn}(c_{i}^{\Theta})=\operatorname{sgn}(c_{i}^{\Gamma}), and that for x,y>0x,y>0 it holds that xyx2y21/2|x-y|\leq|x^{2}-y^{2}|^{1/2}. Combining this with the definition of I2I_{2}, the reverse triangle inequality, and (111) implies that

Combining (107), (112), (113), (114), and (115) establishes that

Let ΘNN\Theta\in\mathcal{N}_{N}^{*} be a parametrization of gg, i.e. R(Θ)=g\mathcal{R}(\Theta)=g. We write

As in (103), we make the following replacements, for all i[m]i\in[m], without changing the realization of Θ\Theta:

Note that the weights of Θ\Theta are already balanced, i.e. we have for every i[m]i\in[{m}] that

Thus, we can skip the reparametrization step in (104) and get directly for every iI1i\in I_{1} that

and analogously aiΘaiΓ2(2r)1/2\|a^{\Theta}_{i}-a^{\Gamma}_{i}\|_{\infty}\leq 2(2r)^{1/2}. For iI2i\in I_{2} we need to slightly deviate from the proof of Theorem 3.1. We can skip the reparametrization step in (108)-(111) due to balancedness and need to distinguish three cases: Case A.1: We have ciΓaiΓ2r\|c_{i}^{\Gamma}a_{i}^{\Gamma}\|_{\infty}\leq 2r which, together with (125), implies ciΘaiΘ4r\|c_{i}^{\Theta}a^{\Theta}_{i}\|_{\infty}\leq 4r. Due to balancedness it follows that

Case A.2: We have ciΘaiΘ2r\|c_{i}^{\Theta}a_{i}^{\Theta}\|_{\infty}\leq 2r which, together with (125), implies ciΓaiΓ4r\|c_{i}^{\Gamma}a^{\Gamma}_{i}\|_{\infty}\leq 4r. Again it follows that

Let now w.l.o.g. aiΓaiΘ\|a^{\Gamma}_{i}\|_{\infty}\geq\|a^{\Theta}_{i}\|_{\infty} (otherwise we switch their roles in the following) which implies that λiΓ=Δi+λiΘ\lambda^{\Gamma}_{i}=\Delta_{i}+\lambda^{\Theta}_{i} with Δi=λiΓλiΘ0\Delta_{i}=\lambda^{\Gamma}_{i}-\lambda^{\Theta}_{i}\geq 0. Then it holds that

The last step holds due to (131) and the balancedness of Θ\Theta which ensure that

A.4 Section 4

for all i[m],j[m]{i}i\in[{m}],j\in[{m}]\setminus\{i\}, and define

Then for every B(0,)B\in(0,\infty) there is CB(0,)C_{B}\in(0,\infty) such that we have uniform (CB,1/2)(C_{B},1/2) inverse stability w.r.t. L((B,B)d)\|\cdot\|_{L^{\infty}((-B,B)^{d})}. That is, for all ΓNNA\Gamma\in\mathcal{N}_{N}^{A} and gR(NNA)g\in\mathcal{R}(\mathcal{N}_{N}^{A}) there exists a parametrization ΦNNA\Phi\in\mathcal{N}_{N}^{A} with

Note that the non-zero angle between the hyperplanes given by the weight vectors (ai)i=1m(a_{i})_{i=1}^{m} establishes that the minimal perimeter inside each linear region intersected with (B,B)d(-B,B)^{d} is lower bounded. As the realization is linear on each region, this implies the existence of a constant CB(0,)C^{\prime}_{B}\in(0,\infty), such that for every ΘNNA\Theta\in\mathcal{N}^{A}_{N} it holds that

Now note that for NNA\mathcal{N}^{A}_{N} we can get the same uniform (4,1/2)(4,1/2) inverse stability result w.r.t. W1,|\cdot|_{W^{1,\infty}} as in Theorem 3.3 by choosing π\pi to be the identity in (124). Together with (137) this implies the claim. ∎