The Power of Depth for Feedforward Neural Networks

Ronen Eldan, Ohad Shamir

Introduction and Main Result

Learning via multi-layered artificial neural networks, a.k.a. deep learning, has seen a dramatic resurgence of popularity over the past few years, leading to impressive performance gains on difficult learning problems, in fields such as computer vision and speech recognition. Despite their practical success, our theoretical understanding of their properties is still partial at best.

In this paper, we consider the question of the expressive power of neural networks of bounded size. The boundedness assumption is important here: It is well-known that sufficiently large depth-22 neural networks, using reasonable activation functions, can approximate any continuous function on a bounded domain (Cybenko (1989); Hornik et al. (1989); Funahashi (1989); Barron (1994)). However, the required size of such networks can be exponential in the dimension, which renders them impractical as well as highly prone to overfitting. From a learning perspective, both theoretically and in practice, our main interest is in neural networks whose size is bounded.

For a network of bounded size, a basic architectural question is how to trade off between its width and depth: Should we use networks that are narrow and deep (many layers, with a small number of neurons per layer), or shallow and wide? Is the “deep” in “deep learning” really important? Or perhaps we can always content ourselves with shallow (e.g. depth-22) neural networks?

Overwhelming empirical evidence as well as intuition indicates that having depth in the neural network is indeed important: Such networks tend to result in complex predictors which seem hard to capture using shallow architectures, and often lead to better practical performance. However, for the types of networks used in practice, there are surprisingly few formal results (see related work below for more details).

Clearly, to prove something on the separation between 22-layer and 33-layer networks, we need to make some assumption on the activation function σ()\sigma(\cdot) (for example, if σ()\sigma(\cdot) is the identity, then both 22-layer and 33-layer networks compute linear functions, hence there is no difference in their expressive power). All we will essentially require is that σ()\sigma(\cdot) is universal, in the sense that a sufficiently large 22-layer network can approximate any univariate Lipschitz function which is non-constant on a bounded domain. More formally, we use the following assumption:

In addition, for technical reasons, we will require the following mild growth and measurability conditions, which are satisfied by virtually all activation functions in the literature, including the examples discussed earlier:

The activation function σ\sigma is (Lebesgue) measurable and satisfies

Our main result is the following theorem, which implies that there are 33-layer networks of width polynomial in the dimension dd, which cannot be arbitrarily well approximated by 22-layer networks, unless their width is exponential in dd:

gg is bounded in [2,+2][-2,+2], supported on {x:xCd}\{\mathbf{x}:\|\mathbf{x}\|\leq C\sqrt{d}\}, and expressible by a 33-layer network of width Ccσd19/4Cc_{\sigma}d^{19/4}.

Every function ff, expressed by a 22-layer network of width at most cecdce^{cd}, satisfies

We make the following additional remarks about the theorem:

At least for specific activation functions such as the ReLU, sigmoid, and threshold, the proof construction implies that gg is poly(d)\text{poly}(d)-Lipschitz, and the 33-layer network expressing it has parameters bounded by poly(d)\text{poly}(d).

On a qualitative level, the question we are considering is similar to the question of Boolean circuit lower bounds in computational complexity: In both cases, we consider functions which can be represented as a combination of simple computational units (Boolean gates in computational complexity; neurons in neural networks), and ask how large or how deep this representation needs to be, in order to compute or approximate some given function. For Boolean circuits, there is a relatively rich literature and some strong lower bounds. A recent example is the paper Rossman et al. (2015), which shows for any d2d\geq 2 an explicit depth dd, linear-sized circuit on {0,1}n\{0,1\}^{n}, which cannot be non-trivially approximated by depth d1d-1 circuits of size polynomial in nn. That being said, it is well-known that the type of computation performed by each unit in the circuit can crucially affect the hardness results, and lower bounds for Boolean circuits do not readily translate to neural networks of the type used in practice, which are real-valued and express continuous functions. For example, a classical result on Boolean circuits states that the parity function over {0,1}d\{0,1\}^{d} cannot be computed by constant-depth Boolean circuits whose size is polynomial in dd (see for instance Håstad (1986)). Nevertheless, the parity function can in fact be easily computed by a simple 22-layer, O(d)\mathcal{O}(d)-width real-valued neural network with most reasonable activation functionsSee Rumelhart et al. (1986), Figure 6, where reportedly the structure was even found automatically by back-propagation. For a threshold activation function σ(z)=1{z0}\sigma(z)=\mathbf{1}\left\{z\geq 0\right\} and input x=(x1,,xd){0,1}d\mathbf{x}=(x_{1},\ldots,x_{d})\in\{0,1\}^{d}, the network is given by x  i=1d+1(1)i+1σ(j=1dxji+12)\mathbf{x}~{}\mapsto~{}\sum_{i=1}^{d+1}(-1)^{i+1}\sigma\left(\sum_{j=1}^{d}x_{j}-i+\frac{1}{2}\right). In fact, we only need σ\sigma to satisfy σ(z)=1\sigma(z)=1 for z12z\geq\frac{1}{2} and σ(z)=0\sigma(z)=0 for z12z\leq-\frac{1}{2}, so the construction easily generalizes to other activation functions (such as a ReLU or a sigmoid), possibly by using a small linear combination of them to represent such a σ\sigma..

Moving to networks with real-valued outputs, one related field is arithmetic circuit complexity (see Shpilka and Yehudayoff (2010) for a survey), but the focus there is on computing polynomials, which can be thought of as neural networks where each neuron computes a linear combination or a product of its inputs. Again, this is different than most standard neural networks used in practice, and the results and techniques do not readily translate.

Recently, several works in the machine learning community attempted to address questions similar to the one we consider here. Pascanu et al. (2013); Montufar et al. (2014) consider the number of linear regions which can be expressed by ReLU networks of a given width and size, and Bianchini and Scarselli (2014) consider the topological complexity (via Betti numbers) of networks with certain activation functions, as a function of the depth. Although these can be seen as measures of the function’s complexity, such results do not translate directly to a lower bound on the approximation error, as in Thm. 1. Delalleau and Bengio (2011); Martens and Medabalimi (2014) and Cohen et al. (2015) show strong approximation hardness results for certain neural network architectures (such as polynomials or representing a certain tensor structure), which are however fundamentally different than the standard neural networks considered here.

Proof Sketch

In a nutshell, the 33-layer network we construct approximates a radial function with bounded support (i.e. one which depends on the input x\mathbf{x} only via its Euclidean norm x\|\mathbf{x}\|, and is for any x\mathbf{x} whose norm is larger than some threshold). With 33 layers, approximating radial functions is rather straightforward: First, using assumption 1, we can construct a linear combination of neurons expressing the univariate mapping zz2z\mapsto z^{2} arbitrarily well in any bounded domain. Therefore, by adding these combinations together, one for each coordinate, we can have our network first compute (approximately) the mapping xx2=ixi2\mathbf{x}\mapsto\|\mathbf{x}\|^{2}=\sum_{i}x_{i}^{2} inside any bounded domain, and then use the next layer to compute some univariate function of x2\|\mathbf{x}\|^{2}, resulting in an approximately radial function. With only 22 layers, it is less clear how to approximate such radial functions. Indeed, our proof essentially indicates that approximating radial functions with 22 layers can require exponentially large width.

In particular, we will consider a density function which equals φ2(x)\varphi^{2}(\mathbf{x}), where φ\varphi is the inverse Fourier transform of the indicator 1{xB}\mathbf{1}\left\{\mathbf{x}\in B\right\}, BB being the origin-centered unit-volume Euclidean ball (the reason for this choice will become evident later). Before continuing, we note that a formula for φ\varphi can be given explicitly (see Lemma 2), and an illustration of it in d=2d=2 dimensions is provided in Figure 1. Also, it is easily verified that φ2(x)\varphi^{2}(\mathbf{x}) is indeed a density function: It is clearly non-negative, and by isometry of the Fourier transform, φ2(x)dx=φ^2(x)dx=1{xB}2dx\int\varphi^{2}(\mathbf{x})d\mathbf{x}=\int\hat{\varphi}^{2}(\mathbf{x})d\mathbf{x}=\int\mathbf{1}\left\{\mathbf{x}\in B\right\}^{2}d\mathbf{x}, which equals 11 since BB is a unit-volume ball.

Our goal now is to lower bound the right hand side of Eq. (3). To continue, we find it convenient to consider the Fourier transforms fφ^,gφ^\widehat{f\varphi},\widehat{g\varphi} of the functions fφ,gφf\varphi,g\varphi, rather than the functions themselves. Since the Fourier transform is isometric, the above equals

Luckily, the Fourier transform of functions expressible by a 22-layer network has a very particular form. Specifically, consider any function of the form

In words, the support of fφ^\widehat{f\varphi} is contained in a union of tubes of bounded radius passing through the origin. This is the key property of 22-layer networks we will use to derive our main theorem. Note that it holds regardless of the exact shape of the fif_{i} functions, and hence our proof will also hold if the activations in the network are different across the first layer neurons, or even if they are chosen in some adaptive manner.

To establish our theorem, we will find a function gg expressible by a 33-layer network, such that gφ^\widehat{g\varphi} has a constant distance (in L2L_{2} space) from any function supported on TT (a union of kk tubes as above). Here is where high dimensionality plays a crucial role: Unless kk is exponentially large in the dimension, the domain TT is very sparse when one considers large distances from the origin, in the sense that

The construction and analysis of this function constitutes the technical bulk of the proof. The main difficulty in this step is that even if the Fourier transform g^\hat{g} of gg has some of its L2L_{2} mass on high frequencies, it is not clear that this will also be true for gφ^=g1{B}\widehat{g\varphi}=g\star\mathbf{1}\left\{B\right\} (note that while convolving with a Euclidean ball increases the average distance from the origin in the L1L_{1} sense, it doesn’t necessarily do the same in the L2L_{2} sense).

We overcome this difficulty by considering a random superposition of indicators of thin shells: Specifically, we consider the function

Preliminaries

Rd=1π(Γ(d2+1))1/dR_{d}=\sqrt{\frac{1}{\pi}}\left(\Gamma\left(\frac{d}{2}+1\right)\right)^{1/d}, which is always between 15d\frac{1}{5}\sqrt{d} and 12d\frac{1}{2}\sqrt{d}.

Proof of Thm. 1

In this section, we provide the proof of Thm. 1. Note that some technical proofs, as well as some important technical lemmas on the structure of Bessel functions, are deferred to the appendix.

As discussed in Sec. 2, our theorem rests on constructing a distribution μ\mu and an appropriate function gg, which is easy to approximate (w.r.t. μ\mu) by small 33-layer networks, but difficult to approximate using 22-layer networks. Thus, we begin by formally defining g,μg,\mu that we will use.

Let φ(x)\varphi(\mathbf{x}) be the Fourier transform of 1{wRdBd}\mathbf{1}\left\{\mathbf{w}\in R_{d}B_{d}\right\}. Then

To define our hard-to-approximate function, we introduce some notation. Let α1\alpha\geq 1 and γ\gamma be some large numerical constants to be determined later, and set N=γd2N=\gamma d^{2}, assumed to be an integer (essentially, we need α,γ\alpha,\gamma to be sufficiently large so that all the lemmas we construct below would hold). Consider the intervals

We split the intervals to “good” and “bad” intervals using the following definition:

Δi\Delta_{i} is a good interval (or equivalently, ii is good) if for any xΔix\in\Delta_{i}

Otherwise, we say that Δi\Delta_{i} is a bad interval.

By definition of a “good” interval and Lemma 2, we see that gig_{i} is defined to be non-zero, when the value of φ\varphi on the corresponding interval Δi\Delta_{i} is sufficiently bounded away from , a fact which will be convenient for us later on.

Our proof will revolve around the L2L_{2} function

which as explained in Sec. 2, will be shown to be easy to approximate arbitrarily well with a 33-layer network, but hard to approximate with a 22-layer network.

2 Key Lemmas

In this subsection, we collect several key technical lemmas on gig_{i} and φ\varphi, which are crucial for the main proof. The proofs of all the lemmas can be found in Appendix B.

The following lemma ensures that φ(x)\varphi(\mathbf{x}) is sufficiently close to being a constant on any good interval:

If d2d\geq 2, αc\alpha\geq c and Ncα3/2d2N\geq c\alpha^{3/2}d^{2} (for some sufficiently large universal constant cc), then inside any good interval Δi\Delta_{i}, φ(x)\varphi(x) has the same sign, and

The following lemma ensures that the Fourier transform g^i\hat{g}_{i} of gig_{i} has a sufficiently large part of its L2L_{2} mass far away from the origin:

Suppose N100αd3/2N\geq 100\alpha d^{3/2}. Then for any ii,

where gi^\hat{g_{i}} is the Fourier transform of gig_{i}.

The following lemma ensures that giφ^\widehat{g_{i}\varphi} also has sufficiently large L2L_{2} mass far away from the origin:

Suppose that αC\alpha\geq C, NCα3/2d2N\geq C\alpha^{3/2}d^{2} and d>Cd>C, where C>0C>0 is a universal constant. Then for any ii,

The following lemma ensures that a linear combination of the gig_{i}’s has at least a constant L2(φ2)L_{2}(\varphi^{2}) probability mass.

Suppose that αc\alpha\geq c and Nc(αd)3/2N\geq c(\alpha d)^{3/2} for some sufficiently large universal constant cc, then for every choice of ϵi{1,+1}\epsilon_{i}\in\{-1,+1\}, i=1,,Ni=1,\ldots,N, one has

Finally, the following lemma guarantees that the non-Lipschitz function i=1Nϵigi(x)\sum_{i=1}^{N}\epsilon_{i}g_{i}(\mathbf{x}) can be approximated by a Lipschitz function (w.r.t. the density φ2\varphi^{2}). This will be used to show that i=1Nϵigi(x)\sum_{i=1}^{N}\epsilon_{i}g_{i}(\mathbf{x}) can indeed be approximated by a 33-layer network.

Suppose that d2d\geq 2. For any choice of ϵi{1,+1}\epsilon_{i}\in\{-1,+1\}, i=1,,Ni=1,\ldots,N, there exists an NN-Lipschitz function ff, supported on [αd,2αd][\alpha\sqrt{d},2\alpha\sqrt{d}] and with range in [1,+1][-1,+1], which satisfies

The goal of this section is to prove the following proposition.

Fix a dimension dd, suppose that d>Cd>C, α>C\alpha>C and NCα3/2d2N\geq C\alpha^{3/2}d^{2} and let kk be an integer satisfying

for constants C,κ>0C^{\prime},\kappa>0. Then one has

where δ>0\delta>0 is a universal constant.

The proof of this proposition requires a few intermediate steps. In the remainder of the section, we will assume that N,d,αN,d,\alpha are chosen to be large enough to satisfy the assumptions of Lemma 6 and Lemma 5. In other words we assume that d>Cd>C and NCα3/2d2N\geq C\alpha^{3/2}d^{2} for a suitable universal constant C>0C>0. We begin with the following:

Suppose that d,Nd,N are as above. There exists a choice of ϵi{1,1}\epsilon_{i}\in\{-1,1\}, i=1,,Ni=1,\ldots,N such that

Suppose that each ϵi\epsilon_{i} is chosen independently and uniformly at random from {1,+1}\{-1,+1\}. It suffices to show that

for some universal constant c>0c>0, since that would ensure there exist some choice of ϵ1,,ϵN\epsilon_{1},\ldots,\epsilon_{N} satisfying the lemma statement. Define h(w)=1{w(2RdBd)C}h(\mathbf{w})=\mathbf{1}\left\{\mathbf{w}\in(2R_{d}B_{d})^{C}\right\} and consider the operator

This is equivalent to removing low-frequency components from gg (in the Fourier domain), and therefore is an orthogonal projection. According to Lemma 5 and isometry of the Fourier transform, we have

for every good ii. Moreover, an application of Lemma 6, and the fact that gi,gjL2=0\langle g_{i},g_{j}\rangle_{L_{2}}=0 for any iji\neq j (as gi,gjg_{i},g_{j} have disjoint supports) tells us that

for a universal constant c>0c>0. We finally get,

Let ff be a function such that fφL2f\varphi\in L_{2}, and is of the form in Eq. (8). Suppose that the functions fif_{i} are measurable functions satisfying

where ψS\psi\in\mathcal{S}. Note that the growth condition ensures that the integral above is well-defined. The Fourier transform h^\hat{h} of a tempered distribution hh is also a tempered distribution, and defined as

which is indeed an element of S\mathcal{S}^{*} by the linearity of the Fourier transform, by the continuity of ψψi(x)\psi\to\psi_{i}(x) with respect to the topology of S\mathcal{S} and by the dominated convergence theorem. Finally, define

Using the fact thatThis is because g^(x)dx=g(x)exp(2πix,w)dwdx=g(x)(exp(2πix,w)1dw)dx=g(x)δ(x)dx=g(0)\int\hat{g}(\mathbf{x})d\mathbf{x}=\int\int g(\mathbf{x})\exp(-2\pi i\langle\mathbf{x},\mathbf{w}\rangle)d\mathbf{w}d\mathbf{x}=\int g(\mathbf{x})\left(\int\exp(-2\pi i\langle\mathbf{x},\mathbf{w}\rangle)\cdot 1d\mathbf{w}\right)d\mathbf{x}=\int g(\mathbf{x})\delta(\mathbf{x})d\mathbf{x}=g(0), where δ()\delta(\cdot) is the Dirac delta function, which is the Fourier transform of the constant 11 function. See also (Hunter and Nachtergaele, 2001, Chapter 11, Example 11.31).

for any gSg\in\mathcal{S}, recalling that vi\mathbf{v}_{i} has unit norm, and letting vi\mathbf{v}_{i}^{\perp} denote the subspace of vectors orthogonal to vi\mathbf{v}_{i}, we have the following for any ψS\psi\in\mathcal{S}:

where the use of Fubini’s theorem is justified by the fact that ψS\psi\in\mathcal{S}.

We now use the convolution-multiplication theorem (see e.g., (Hunter and Nachtergaele, 2001, Theorem 11.35)) according to which if f,gL1f,g\in L_{1} then

Using this, we have the following for every ψS\psi\in\mathcal{S}:

Let q,wq,w be two functions of unit norm in L2L_{2}. Suppose that qq satisfies

so that TT contains the support of q()q(\cdot). For each r>0r>0, define

Using a union bound and the definition of hh, equation Eq. (17) follows.

to be the averaging of q()q(\cdot) with respect to rotations (in the above formula Hd1\mathcal{H}_{d-1} denotes the d1d-1 dimensional Hausdorff measure, i.e. the standard measure in d1d-1 dimensions). We have the following: Since w()w(\cdot) is radial and has unit L2L_{2} norm, and we assume q()q(\cdot) is supported on TT, we have

As a result of these calculations, we have

where we used the assumption that q()q(\cdot) is unit norm and that ACw2(x)dx=(2RdBd)Cw2(x)dx1δ\int_{A^{C}}w^{2}(\mathbf{x})d\mathbf{x}=\int_{(2R_{d}B_{d})^{C}}w^{2}(\mathbf{x})d\mathbf{x}\leq 1-\delta. Since 1δ112δ\sqrt{1-\delta}\leq 1-\frac{1}{2}\delta for any δ\delta\in, the result follows. ∎

where (ϵi)(\epsilon_{i}) are the signs provided by Lemma 8. According to Lemma 6, we have

for a universal constant c3>0c_{3}>0. Next, we claim that since qL2=wL2=1\left\|q\right\|_{L_{2}}=\left\|w\right\|_{L_{2}}=1, we have for every scalars β1,β2>0\beta_{1},\beta_{2}>0 that

Indeed, we may clearly multiply both β1\beta_{1} and β2\beta_{2} by the same constant affecting the correctness of the formula, thus we may assume that β2=1\beta_{2}=1. It thus amounts to show that for two unit vectors v,uv,u in a Hilbert space, one has that minβ>0βvu214vu2\min_{\beta>0}\|\beta v-u\|^{2}\geq\frac{1}{4}\|v-u\|^{2}. We have

which in particular implies formula Eq. (20).

Combining the above, and using the fact that q,wq,w have unit L2L_{2} norm, we finally get

where in the last inequality, we use the assumption in Eq. (7), choosing c=min{c2/4,c3}c=\min\{c_{2}/4,c_{3}\}. The proof is complete. ∎

There is a universal constant C>0C>0 such that the following holds. Let δ(0,1)\delta\in(0,1). Suppose that dCd\geq C and that the functions gig_{i} are constructed as in Eq. (6). For any choice of ϵi{1,+1}\epsilon_{i}\in\{-1,+1\}, i=1,,Ni=1,\ldots,N, there exists a function gg expressible by a 33-layer network of width at most 8cσδα3/2Nd11/4+1\frac{8c_{\sigma}}{\delta}\alpha^{3/2}Nd^{11/4}+1, and with range in [2,+2][-2,+2], such that

The proof of this proposition relies on assumption 1, which ensures that we can approximate univariate functions using our activation function. As discussed before Thm. 1, one can also plug in weaker versions of the assumption (i.e. worse polynomial dependence of the width ww on R,L,1/δR,L,1/\delta), and get versions of proposition 2 where the width guarantee has worse polynomial dependence on the parameters N,α,d,δN,\alpha,d,\delta. This would lead to versions of the Thm. 1 with somewhat worse constants and polynomial dependence on the dimension dd.

Suppose the activation function σ\sigma satisfies assumption 1. Let ff be an LL-Lipschitz function supported on [r,R][r,R], where r1r\geq 1. Then for any δ>0\delta>0, there exists a function gg expressible by a 33-layer network of width at most 2cσd2R2Lrδ+1\frac{2c_{\sigma}d^{2}R^{2}L}{\sqrt{r}\delta}+1, such that

which is constant outside [R,R][-R,R], as well as the function

and where the width parameter ww is at most 2cσdR2Lrδ\frac{2c_{\sigma}dR^{2}L}{\sqrt{r}\delta}. Consequently, the function

can be expressed in the form a+i=1wαiσ(βixγi)a+\sum_{i=1}^{w}\alpha_{i}\sigma(\beta_{i}x-\gamma_{i}) where w2cσd2R2Lrδw\leq\frac{2c_{\sigma}d^{2}R^{2}L}{\sqrt{r}\delta}, and it holds that

where wcσR2Lrδw\leq\frac{c_{\sigma}R^{2}L}{\sqrt{r}\delta}.

for appropriate scalars a,ui,ci,vi,j,bi,ja,u_{i},c_{i},v_{i,j},b_{i,j} and vectors wi,j\mathbf{w}_{i,j}, and where ww is at most

Eq. (23) is exactly a 33-layer network (compare to Eq. (2)), except that there is an additional constant term aa. However, by increasing ww by 11, we can simulate aa by an additional neuron xaσ(σ(0)+z)σ(σ(0,x)+z)\mathbf{x}\mapsto\frac{a}{\sigma(\sigma(0)+z)}\cdot\sigma(\sigma(\langle\mathbf{0},\mathbf{x}\rangle)+z), where zz is some scalar such that σ(σ(0)+z)0\sigma(\sigma(0)+z)\neq 0 (note that if there is no such zz, then σ\sigma is the zero function, and therefore cannot satisfy assumption 1). So, we can write the function gg as a 33-layer network (as defined in Eq. (2)), of width at most

Let us consider each of the three absolute values:

The first absolute value term is at most δ/2\delta/2 by Eq. (22).

Summing the above, we get that g(x)f(x)δ2+δ2+0=δ|g(\mathbf{x})-f(\|\mathbf{x}\|)|\leq\frac{\delta}{2}+\frac{\delta}{2}+0=\delta as required. ∎

We are now ready to prove Proposition 2, which is essentially a combination of Lemmas 7 and 10.

First, invoke Lemma 7 to obtain an NN-Lipschitz function hh with range in [1,+1][-1,+1] which satisfies

5 Finishing the Proof

We are finally ready to prove our main theorem.

The proof is a straightforward combination of propositions 1 and 2 (whose conditions can be verified to follow immediately from the assumptions used in the theorem). We first choose α=C\alpha=C and N=Cα3/2d2N=\lceil C\alpha^{3/2}d^{2}\rceil with the constant CC taken from the statement of Proposition 1. By invoking this proposition we obtain signs ϵi{1,1}\epsilon_{i}\in\{-1,1\} and a universal constant δ1>0\delta_{1}>0 for which any function ff expressed by a bounded-size 22-layer network satisfies

(where CC^{\prime} is a universal constant depending on the universal constants C,δ1C,\delta_{1}), so that

Combining Eq. (25) and Eq. (26) with the triangle inequality, we have that fgL2(μ)δ1/2\left\|f-g\right\|_{L_{2}(\mu)}\geq\delta_{1}/2 for any 22-layer function ff. The proof is complete. ∎

OS is supported in part by an FP7 Marie Curie CIG grant, the Intel ICRI-CI Institute, and Israel Science Foundation grant 425/13. We thank James Martens and the anonymous COLT 2016 reviewers for several helpful comments.

References

Appendix A Approximation Properties of the ReLU Activation Function

In this appendix, we prove that the ReLU activation function satisfies assumption 1, and also prove bounds on the Lipschitz parameter of the approximation and the size of the required parameters. Specifically, we have the following lemma:

Moreover, one has αi2L|\alpha_{i}|\leq 2L and w3RLδw\leq 3\frac{RL}{\delta}.

If one has 2RL<δ2RL<\delta, then the results holds trivially because we may take the function hh to be the function (with width parameter w=0w=0). Otherwise, we must have Rδ/2LR\geq\delta/2L, so by increasing the value of RR by a factor of at most 22, we may assume without loss of generality that there exists an integer mm such that R=mδ/LR=m\delta/L.

Then clearly equation Eq. (27) holds true. Moreover, we have αi2L|\alpha_{i}|\leq 2L, which completes the proof.

Appendix B Technical Proofs

Since φ(x)\varphi(\mathbf{x}) is radial (hence rotationally invariant), let us assume without loss of generality that it equals re1r\mathbf{e}_{1}, where r=xr=\|\mathbf{x}\| and e1\mathbf{e}_{1} is the first standard basis vector. This means that the integral becomes

Performing the variable change z=arccos(w1/Rd)z=\arccos(w_{1}/R_{d}) (which implies that as w1w_{1} goes from Rd-R_{d} to RdR_{d}, zz goes from π\pi to , and also Rdcos(z)=w1R_{d}\cos(z)=w_{1} and Rdsin(z)dz=dw1-R_{d}\sin(z)dz=dw_{1}), we can rewrite the integral above as

Since we know that this integral must be real-valued (since we’re computing the Fourier transform φ(x)\varphi(\mathbf{x}), which is real-valued and even), we can ignore the imaginary components, so the above reduces to

By a standard formula for Bessel functions (see Equation 10.9.4. in DLMF ), we have

which by substituting x=2πrRdx=2\pi rR_{d} and changing sides, implies that

Plugging this back into Eq. (29), we get the expression

Plugging in the explicit formula Vd1=π(d1)/2Γ(d+12)V_{d-1}=\frac{\pi^{(d-1)/2}}{\Gamma\left(\frac{d+1}{2}\right)}, this simplifies to

Recalling that this equals φ(x)\varphi(x) where x=r\|\mathbf{x}\|=r, the result follows.

B.2 Proof of Lemma 3

Moreover, using the definition of a good interval, and the fact that the maximal value in any interval is at most 2αd2\alpha\sqrt{d}, we have

Since xx (in any interval) is at least αd\alpha\sqrt{d}, then Jd/2(2πRdx)J_{d/2}(2\pi R_{d}x) is 2πRd2\pi R_{d}-Lipschitz in xx by Lemma 12. Since the width of each interval only αdN\frac{\alpha\sqrt{d}}{N}, Eq. (30) implies that Jd/2(2πRdx)J_{d/2}(2\pi R_{d}x) (and hence φ(x)\varphi(x)) does not change signs in the interval, provided that N>2160(παRdd)3/2N>2\sqrt{160}\left(\pi\alpha R_{d}\sqrt{d}\right)^{3/2}. Recalling that Rd12dR_{d}\leq\frac{1}{2}\sqrt{d}, this is indeed satisfied by the lemma’s conditions.

Turning to the second part of the lemma, assuming φ(x)\varphi(x) is positive without loss of generality, and using the Lipschitz property of Jd/2()J_{d/2}(\cdot) and Eq. (30), we have

which is less than 1+d1/21+d^{-1/2} provided that Ncα3/2d2N\geq c\alpha^{3/2}d^{2} for some universal constant cc.

B.3 Proof of Lemma 4

The result is trivially true for a bad interval ii (where gig_{i} is the function, hence both sides of the inequality in the lemma statement are ), so we will focus on the case that ii is a good interval.

Since, gg is a radial function, its Fourier transform is also radial, and is given by

By Lemma 12, Jd/21(x)1|J_{d/2-1}(x)|\leq 1, hence Eq. (31) can be upper bounded by

Combining this with Eq. (32), we get that

B.4 Proof of Lemma 5

The result is trivially true for a bad interval ii (where gig_{i} is the function, hence both sides of the inequality in the lemma statement are ), so we will focus on the case that ii is a good interval.

Define a=supxΔiφ(x)a=\sup_{x\in\Delta_{i}}\varphi(x). Using Lemma 3, we have that φ(x)\varphi(x) does not change signs in the interval Δi\Delta_{i}. Suppose without loss of generality that it is positive. Moreover, by the same lemma we have that

Next, by choosing the constant CC to be large enough, we may apply Lemma 4, which yields that

By the triangle inequality, we have that for two vectors u,vu,v in a normed space, one has v2u22vvu\|v\|^{2}\geq\|u\|^{2}-2\|v\|\|v-u\|. This teaches us that

B.5 Proof of Lemma 6

Since the gig_{i} for different ii have disjoint supports (up to measure-zero sets), the integral in the lemma equals

Recalling that Ad=dπd/2Γ(d2+1)A_{d}=\frac{d\pi^{d/2}}{\Gamma\left(\frac{d}{2}+1\right)} and that Rdd=πd/2Γ(d2+1)R_{d}^{d}=\pi^{-d/2}\Gamma\left(\frac{d}{2}+1\right) by Lemma 1, this equals

We now claim that for any r[αd,2αd]r\in[\alpha\sqrt{d},2\alpha\sqrt{d}] (that is, in any interval),

which would imply that we can lower bound Eq. (35) by

To see why Eq. (36) holds, consider an rr which satisfies the left hand side. The width of its interval is at most αdN\frac{\alpha\sqrt{d}}{N}, and by Lemma 12, Jd/2(2πRdr)J_{d/2}(2\pi R_{d}r) is at most 2πRd2\pi R_{d}-Lipschitz in rr. Therefore, for any other rr^{\prime} in the same interval as rr, it holds that

which can be verified to be at least 180πRdr\sqrt{\frac{1}{80\pi R_{d}r}} by the condition on NN in the lemma statement, and the facts that r2αd,Rd12dr\leq 2\alpha\sqrt{d},R_{d}\leq\frac{1}{2}\sqrt{d}. As a result, Jd/22(2πRdr)180πRdrJ_{d/2}^{2}(2\pi R_{d}r^{\prime})\geq\frac{1}{80\pi R_{d}r} for any rr^{\prime} in the same interval as rr, which implies that rr is in a good interval.

We now continue by taking Eq. (37), and performing the variable change x=2πRdrx=2\pi R_{d}r, leading to

Applying Lemma 15 with β=2πRdα/d\beta=2\pi R_{d}\alpha/\sqrt{d} (which by Lemma 1, is between 2πα/52\pi\alpha/5 and πα\pi\alpha, hence satisfies the conditions of Lemma 15 if α\alpha is large enough), this is at least

B.6 Proof of Lemma 7

where dist(x,ΔiC)\text{dist}(x,\Delta_{i}^{C}) is the distance of xx from the boundaries of Δi\Delta_{i}. Note that for bad ii, this is the same as gi(x)g_{i}(x), whereas for good ii, it is an NN-Lipschitz approximation of gi(x)g_{i}(x).

Let f(x)=i=1Nϵgˇ(x)f(\mathbf{x})=\sum_{i=1}^{N}\epsilon\check{g}(\mathbf{x}), and note that since the support of gˇi\check{g}_{i} are disjoint, ff is also NN Lipschitz. With this definition, the integral in the lemma becomes

Since the support of gˇi(x)gi(x)\check{g}_{i}(\mathbf{x})-g_{i}(\mathbf{x}) is disjoint for different ii, this equals

Using the definition of RdR_{d} from Lemma 1, and the fact that Ad=dπd/2Γ(d2+1)A_{d}=\frac{d\pi^{d/2}}{\Gamma\left(\frac{d}{2}+1\right)}, this equals

Now, note that by definition of giˇ,gi\check{g_{i}},g_{i}, their difference gˇi(r)gi(r)|\check{g}_{i}(r)-g_{i}(r)| can be non-zero (and at most 11) only for rr belonging to two sub-intervals of width 1N\frac{1}{N} within the interval Δi\Delta_{i} (which itself lies in [αd,2αd][\alpha\sqrt{d},2\alpha\sqrt{d}]). Moreover, for such rr (which is certainly at least αd\alpha\sqrt{d}), we can use Lemma 14 to upper bound Jd/22(2πRdr)J_{d/2}^{2}(2\pi R_{d}r) by 1.3αd\frac{1.3}{\alpha d}. Overall, we can upper bound the sum of integrals above by

Appendix C Technical Results On Bessel functions

For any ν0\nu\geq 0 and xx, Jν(x)1|J_{\nu}(x)|\leq 1. Moreover, for any ν1\nu\geq 1 and x3νx\geq 3\nu, Jν(x)J_{\nu}(x) is 11-Lipschitz in xx.

The bound on the magnitude follows from equation 10.14.1 in DLMF .

The derivative of Jν(x)J_{\nu}(x) w.r.t. xx is given by Jν+1(x)+(ν/x)Jν(x)-J_{\nu+1}(x)+(\nu/x)J_{\nu}(x) (see equation 10.6.1 in DLMF ). Since Jν+1(x)|J_{\nu+1}(x)| and Jν(x)|J_{\nu}(x)|, for ν1\nu\geq 1, are at most 12\frac{1}{\sqrt{2}} (see equation 10.14.1 in DLMF ), we have that the magnitude of the derivative is at most 121+νx12(1+13)<1\frac{1}{\sqrt{2}}\left|1+\frac{\nu}{x}\right|\leq\frac{1}{\sqrt{2}}\left(1+\frac{1}{3}\right)<1. ∎

To prove the lemmas below, we will need the following explicit approximation result for the Bessel function Jd/2(x)J_{d/2}(x), which is an immediate corollary of Theorem 5 in Krasikov , plus some straightforward approximations (using the facts that for any z(0,0.5]z\in(0,0.5], we have 1z210.3z\sqrt{1-z^{2}}\geq 1-0.3z and 0zarcsin(z)0.6z0\leq z\arcsin(z)\leq 0.6z):

Using Lemma 13 (which is justified since rdr\geq\sqrt{d} and Rd15dR_{d}\geq\frac{1}{5}\sqrt{d} by Lemma 1), the fact that cos\cos is at most 11, and the assumption d2d\geq 2,

For any β1,d2\beta\geq 1,d\geq 2 such that βd127\beta d\geq 127, it holds that

For any a,b0a,b\geq 0, we have a1{ab}  aba\cdot\mathbf{1}\left\{a\geq b\right\}~{}\geq~{}a-b. Therefore,

We now wish to use Lemma 13 and plug in the approximation for Jd/2(x)J_{d/2}(x). To do so, let a=Jd/2(x)a=J_{d/2}(x), let bb be its approximation from Lemma 13, and let ϵ=x3/2\epsilon=x^{-3/2} the bound on the approximation from the lemma. Therefore, we have abϵ|a-b|\leq\epsilon. This implies

Eq. (38) can be further simplified, since by definition of bb and Lemma 13,

Plugging this back into Eq. (38), plugging in the definition of a,ba,b, and recalling that cd,x1c_{d,x}\leq 1 and xd2x\geq d\geq 2, we get that

To compute the integral above, we will perform a variable change, but first lower bound the integral in a more convenient form. A straightforward calculation (manually or using a symbolic computation toolbox) reveals that

which according to Lemma 13, equals cd,xc_{d,x}, which is at most 11. Using this and the fact that fd,x0.85f_{d,x}\geq 0.85 by the same lemma ,

Using the variable change z=fd,xxz=f_{d,x}x, and the fact that 1.3fd,x0.851.3\geq f_{d,x}\geq 0.85, the above equals

We now perform integration by parts. Note that cos2((d+1)π4+z)=z(z2+14sin((d+1)π2+2z))\cos^{2}\left(-\frac{(d+1)\pi}{4}+z\right)=\frac{\partial}{\partial z}\left(\frac{z}{2}+\frac{1}{4}\sin\left(-\frac{(d+1)\pi}{2}+2z\right)\right), and sin\sin is always bounded by 11, hence

Concatenating all the lower bounds we attained so far, we showed that

If βd127\beta d\geq 127, this is at least 0.005βd\frac{0.005}{\beta d}, from which the lemma follows. ∎