Benefits of depth in neural networks

Matus Telgarsky

Setting and main results

A neural network is a model of real-valued computation defined by a connected directed graph as follows. Nodes await real numbers on their incoming edges, thereafter computing a function of these reals and transmitting it along their outgoing edges. Root nodes apply their computation to a vector provided as input to the network, whereas internal nodes apply their computation to the output of other nodes. Different nodes may compute different functions, two common choices being the maximization gate vmaxiviv\mapsto\max_{i}v_{i} (where vv is the vector of values on incoming edges), and the standard ReLU gate vσ\textscr(a,v+b)v\mapsto\sigma_{\textsc{r}}(\left\langle a,v\right\rangle+b) where σ\textscr(z):=max{0,z}\sigma_{\textsc{r}}(z):=\max\{0,z\} is called the ReLU (rectified linear unit), and the parameters aa and bb may vary from node to node. Graphs in the present work are acyclic, and there is exactly one node with no outgoing edges whose computation is the output of the network.

Neural networks distinguish themselves from many other function classes used in machine learning by possessing multiple layers, meaning the output is the result of composing together an arbitrary number of (potentially complicated) nonlinear operations; by contrast, the functions computed by boosted decision stumps and SVMs can be written as neural networks with a constant number of layers.

The purpose of the present work is to show that standard types of networks always gain in representation power with the addition of layers. Concretely: it is shown that for every positive integer kk, there exist neural networks with Θ(k3)\Theta(k^{3}) layers, Θ(1)\Theta(1) nodes per layer, and Θ(1)\Theta(1) distinct parameters which can not be approximated by networks with O(k)\mathcal{O}(k) layers and o(2k)o(2^{k}) nodes.

Before stating the main result, a few choices and pieces of notation deserve explanation. First, the target many-layered function uses standard ReLU gates; this is by no means necessary, and a more general statement can be found in Theorem 3.19. Secondly, the notion of approximation is the L1L^{1} distance: given two functions ff and gg, their pointwise disagreement f(x)g(x)|f(x)-g(x)| is averaged over the cube d^{d}. Here as well, the same proofs allow flexibility (cf. Theorem 3.19). Lastly, the shallower networks used for approximation use semi-algebraic gates, which generalize the earlier maximization and standard ReLU gates, and allow for analysis of not just standard networks with ReLU gates, but convolutional networks with ReLU and maximization gates (Krizhevsky et al., 2012), sum-product networks (where nodes compute polynomials) (Poon and Domingos, 2011), and boosted decision trees; the full definition of semi-algebraic gates appears in Section 2.

where C\mathcal{C} is the union of the following two sets of functions.

Functions computed by networks of (t,α,β)(t,\alpha,\beta)-semi-algebraic gates in k\leq k layers and 2k/(tαβ)\leq 2^{k}/(t\alpha\beta) nodes. (E.g., as with standard ReLU networks or with convolutional neural networks with standard ReLU and maximization gates; cf. Section 2.)

Functions computed by linear combinations of t\leq t decision trees each with 2k3/t\leq 2^{k^{3}}/t nodes. (E.g., the function class used by boosted decision trees; cf. Section 2.)

Analogs to Theorem 1.1 for boolean circuits — which have boolean inputs routed through {and,or,not}\{\textup{and},\textup{or},\textup{not}\} gates — have been studied extensively by the circuit complexity community, where they are called depth hierarchy theorems. The seminal result, due to Håstad (1986), establishes the inapproximability of the parity function by shallow circuits (unless their size is exponential). Standard neural networks appear to have received less study; closest to the present work is an investigation by Eldan and Shamir (2015) analyzing the case k=2k=2 when the dimension dd is large, showing an exponential separation between 2- and 3-layer networks, a regime not handled by Theorem 1.1. Further bibliographic notes and open problems may be found in Section 5.

The proof of Theorem 1.1 (and of the more general Theorem 3.19) occupies Section 3. The key idea is that just a few function compositions (layers) suffice to construct a highly oscillatory function, whereas function addition (adding nodes but keeping depth fixed) gives a function with few oscillations. Thereafter, an elementary counting argument suffices to show that low-oscillation functions can not approximate high-oscillation functions.

2 Companion results

Theorem 1.1 only provides the existence of one network (for each kk) which can not be approximated by a network with many fewer layers. It is natural to wonder if there are many such special functions. The following bound indicates their population is in fact quite modest.

Specifically, the construction behind Theorem 1.1, as elaborated in Theorem 3.19, can be seen as exhibiting O(2k3)\mathcal{O}(2^{k^{3}}) points, and a fixed labeling of these points, upon which a shallow network hardly improves upon random guessing. The forthcoming Theorem 1.2 similarly shows that even on the more simpler task of fitting O(k9)\mathcal{O}(k^{9}) points, the earlier class of networks is useless on most random labellings.

Let any neural net graph G\mathfrak{G} be given with p\leq p parameters in l\leq l layers and m\leq m total (t,α,β)(t,\alpha,\beta)-semi-algebraic nodes. Then for any δ>0\delta>0 and any n8pl2ln(8emtαβp(l+1))+4ln(1/δ)n\geq 8pl^{2}\ln(8emt\alpha\beta p(l+1))+4\ln(1/\delta) points (xi)i=1n(x_{i})_{i=1}^{n}, with probability 1δ\geq 1-\delta over uniform random labels (yi)i=1n(y_{i})_{i=1}^{n},

This proof is a direct corollary of the VC dimension of semi-algebraic networks, which in turn can be proved by a small modification of the VC dimension proof for piecewise polynomial networks (Anthony and Bartlett, 1999, Theorem 8.8). Moreover, the core methodology for VC dimension bounds of neural networks is due to Warren, whose goal was an analog of Theorem 1.2 for polynomials (Warren, 1968, Theorem 7).

Let any neural net graph G\mathfrak{G} be given with p\leq p parameters in l\leq l layers and m\leq m total nodes, each of which is (t,α,β)(t,\alpha,\beta)-semi-algebraic. Then

The proof of Theorem 1.2 and Lemma 1.3 may be found in Section 4. The argument for the VC dimension is very close to the argument for Theorem 1.1 that a network with few layers has few oscillations; see Section 4 for further discussion of this relationship.

Semi-algebraic gates and assorted network notation

The definition of a semi-algebraic gate is unfortunately complicated; it is designed to capture a few standard nodes in a single abstraction without degrading the bounds. Note that the name semi-algebraic set is standard (Bochnak et al., 1998, Definition 2.1.4), and refers to a set defined by unions and intersections of polynomial inequalities (and thus the name is somewhat abused here).

A notable trait of the definition is that the number of terms mm does not need to enter the name as it does not affect any of the complexity estimates herein (e.g., Theorem 1.1 or Theorem 1.2).

Given polynomials (pi)i=1r(p_{i})_{i=1}^{r} of degree α\leq\alpha, the standard (r,α)(r,\alpha)-min and -max gates ϕmin(v):=mini[r]pi(v)\phi_{\min}(v):=\min_{i\in[r]}p_{i}(v) and ϕmax(v):=maxi[r]qi(v)\phi_{\max}(v):=\max_{i\in[r]}q_{i}(v) are (r(r1),α,α)(r(r-1),\alpha,\alpha)-sa.

Every kk-dt is (k,1,0)(k,1,0)-sa, and every (t,k)(t,k)-bdt is (tk,1,0)(tk,1,0).

The proof of Lemma 2.3 is mostly a matter of unwrapping definitions, and is deferred to Appendix A. Perhaps the only interesting encoding is for the maximization gate (and similarly the minimization gate), which uses maxivi=ivi(j<i1[vi>vj])(j>i1[vivj])\max_{i}v_{i}=\sum_{i}v_{i}(\prod_{j<i}\mathbf{1}[v_{i}>v_{j}])(\prod_{j>i}\mathbf{1}[v_{i}\geq v_{j}]).

Some of the results, for instance Theorem 1.1 and its generalization Theorem 3.19, will let not only the parameters but also network graph G\mathfrak{G} vary. Let Nd((mi,ti,αi,βi)i=1l)\mathcal{N}_{d}((m_{i},t_{i},\alpha_{i},\beta_{i})_{i=1}^{l}) denote a network where layer ii has mi\leq m_{i} nodes where each is (ti,αi,βi)(t_{i},\alpha_{i},\beta_{i})-sa and the input has dimension dd. As a simplification, let Nd(m,l,t,α,β)\mathcal{N}_{d}(m,l,t,\alpha,\beta) denote networks of (t,α,β)(t,\alpha,\beta)-sa gates in l\leq l layers (not including layer 0) each with m\leq m nodes. There are various empirical prescriptions on how to vary the number of nodes per layer; for instance, convolutional networks typically have an increase between layer 0 and layer 1, followed by exponential decrease for a few layers, and finally a few layers with the same number of nodes (Fukushima, 1980; LeCun et al., 1998; Krizhevsky et al., 2012).

Benefits of depth

The purpose of this section is to prove Theorem 1.1 and its generalization Theorem 3.19 in the following three steps.

Functions with few oscillations poorly approximate functions with many oscillations.

Functions computed by networks with few layers must have few oscillations.

Functions computed by networks with many layers can have many oscillations.

To control this expression, note that every XJX_{J} is disjoint, however X:=JIjXjX:=\cup_{J\in\mathcal{I}_{j}}X_{j} can be smaller than If\mathcal{I}_{f}: in particular, it misses intervals UIfU\in\mathcal{I}_{f} whose interior intersects with the boundary of an interval in Ig\mathcal{I}_{g}. Since there are at most sg1s_{g}-1 such boundaries,

which rearranges to gives JIgXJsfsg\sum_{J\in\mathcal{I}_{g}}|X_{J}|\geq s_{f}-s_{g}. Combining this with eq. 3.1,

2 Few layers, few oscillations

Before giving the central upper bounds and sketching their proofs, notice by analogy to polynomials how compositions and additions vary in their impact upon oscillations. By adding together two polynomials, the resulting polynomial has at most twice as many terms and does not exceed the maximum degree of either polynomial. On the other hand, composing polynomials, the result has the product of the degrees and can have more than the product of the terms. As both of these can impact the number of roots or crossings (e.g., by the Bezout Theorem or Descartes’ Rule of Signs), composition wins the race to higher oscillations.

Suppose fNd((mi,ti,αi,βi)i=1l)f\in\mathcal{N}_{d}((m_{i},t_{i},\alpha_{i},\beta_{i})_{i=1}^{l}) with minimin{αi,βi}1\min_{i}\min\{\alpha_{i},\beta_{i}\}\geq 1. Setting α:=maxiαi,β:=maxiβi\alpha:=\max_{i}\alpha_{i},\beta:=\max_{i}\beta_{i}, t:=maxitit:=\max_{i}t_{i}, m:=imim:=\sum_{i}m_{i}, then Cr(fh)2(2tmα/l)lβl2\textup{Cr}(f\circ h)\leq 2(2tm\alpha/l)^{l}\beta^{l^{2}}.

Lemma 3.3 shows the key tradeoff: the number of layers is in the exponent, while the number of nodes is in the base.

Rather than directly controlling Cr(fh)\textup{Cr}(f\circ h), the proofs will first show fhf\circ h is (t,α)(t,\alpha)-poly, which immediately bounds Cr(fh)\textup{Cr}(f\circ h) as follows.

A second technical lemma is needed to reason about combinations of partitions defined by (t,α,β)(t,\alpha,\beta)-sa and (t,α)(t,\alpha)-poly functions.

The proof is somewhat painful owing to the fact that there is no convention on the structure of the intervals in the partitions, namely which ends are closed and which are open, and is thus deferred to Appendix A. The principle of the proof is elementary, and is depicted at right: given a collection of partitions, an intersection of constituent intervals must share endpoints with intervals in in the intersection, thus the total number of intervals bounds the total number of possible intersections. Arguably, this failure to increase complexity in the face of arbitrary intersections is why semi-algebraic gates do not care about the number of terms in their definition.

Recall that (t,α,β)(t,\alpha,\beta)-sa means there is a set of tt polynomials of degree at most α\alpha which form the regions defining the function by intersecting simpler regions x1[q(x)0]x\mapsto\mathbf{1}[q(x)\geq 0] and x1[q(x)<0]x\mapsto\mathbf{1}[q(x)<0]. As such, in order to analyze semi-algebraic gates composed with piecewise polynomial gates, consider first the behavior of these predicate polynomials.

This gives the following complexity bound for composing (s,α,β)(s,\alpha,\beta)-sa and (t,γ)(t,\gamma)-poly gates.

The proof of Lemma 3.3 now follows by Lemma 3.9. In particular, for semi-algebraic networks, the proof is an induction over layers, establishing node jj is (tj,αj)(t_{j},\alpha_{j})-poly (for appropriate (tj,αj)(t_{j},\alpha_{j})).

3 Many layers, many oscillations

The idea behind this construction is as follows. Consider any continuous function f:f:\to which is a generalization of a triangle wave with a single peak: f(0)=f(1)=0f(0)=f(1)=0, and there is some a(0,1)a\in(0,1) with f(a)=1f(a)=1, and additionally ff strictly increases along [0,a][0,a] and strictly decreases along [a,1][a,1].

Now consider the effect of the composition ff=f2f\circ f=f^{2}. Along [0,a][0,a], this is a stretched copy of ff, since f(f(a))=f(1)=0=f(0)=f(f(0))f(f(a))=f(1)=0=f(0)=f(f(0)) and moreover ff is a bijection between [0,a][0,a] and $(whenrestrictedto(when restricted to[0,a]).Thesamereasoningappliesto). The same reasoning applies tof^{2}alongalong[a,1],meaning, meaningf^{2}isafunctionwithtwopeaks.Iteratingthisargumentimpliesis a function with two peaks. Iterating this argument impliesf^{k}isafunctionwithis a function with2^{k-1}$ peaks; the following definition and lemmas formalize this reasoning.

ff is (t,[a,b])(t,[a,b])-triangle when it is continuous along [a,b][a,b], and [a,b][a,b] may be divided into 2t2t intervals [ai,ai+1][a_{i},a_{i+1}] with a1=aa_{1}=a and a2t+1=ba_{2t+1}=b, f(ai)=f(ai+2)f(a_{i})=f(a_{i+2}) whenever 1i2t11\leq i\leq 2t-1, f(a1)=0f(a_{1})=0, f(a2)=1f(a_{2})=1, ff is strictly increasing along odd-numbered intervals (those starting from aia_{i} with ii odd), and strictly decreasing along even-numbered intervals.

If ff is (s,)(s,)-triangle and gg is (t,)(t,)-triangle, then fgf\circ g is (2st,)(2st,)-triangle.

Since g()=g()= and ff and gg are continuous along $,then, thenf\circ giscontinuousalongis continuous along.Intheremaininganalysis,let. In the remaining analysis, let(a_{1},\ldots,a_{2s+1})andand(c_{1},\ldots,c_{2t+1})respectivelydenotetheintervalboundariesforrespectively denote the interval boundaries forfandandg$.

Now consider any interval [cj,cj+1][c_{j},c_{j+1}] where jj is odd, meaning the restriction gj:[cj,cj+1]g_{j}:[c_{j},c_{j+1}]\to of gg to [cj,cj+1][c_{j},c_{j+1}] is strictly increasing. It will be shown that fgjf\circ g_{j} is (s,[cj,cj+1])(s,[c_{j},c_{j+1}])-triangle, and an analogous proof holds for the strictly decreasing restriction gj+1:[cj+1,cj+2]g_{j+1}:[c_{j+1},c_{j+2}]\to, whereby it follows that fgf\circ g is (2st,)(2st,) by considering all choices of jj.

To this end, note for any i{1,,2s+1}i\in\{1,\ldots,2s+1\} that gj1(ai)g_{j}^{-1}(a_{i}) exists and is unique, thus set ai:=gj1(ai)a^{\prime}_{i}:=g_{j}^{-1}(a_{i}). By this choice, for odd ii it holds that f(gj(ai))=f(gj(gj1(ai)))=f(ai)=f(a1)=0f(g_{j}(a^{\prime}_{i}))=f(g_{j}(g_{j}^{-1}(a_{i})))=f(a_{i})=f(a_{1})=0 and fgjf\circ g_{j} is strictly increasing along [ai,ai+1][a^{\prime}_{i},a^{\prime}_{i+1}] (since gjg_{j} is strictly increasing everywhere and ff is strictly increasing along [gj(ai),gj(ai+1)]=[ai,ai+1][g_{j}(a^{\prime}_{i}),g_{j}(a^{\prime}_{i+1})]=[a_{i},a_{i+1}]), and similarly even ii has f(gj(ai))=f(a2)=1f(g_{j}(a^{\prime}_{i}))=f(a_{2})=1 and fgjf\circ g_{j} is strictly decreasing along [ai,ai+1][a^{\prime}_{i},a^{\prime}_{i+1}].

If fN1(m,l,t,α,β)f\in\mathcal{N}_{1}(m,l,t,\alpha,\beta) is (t,)(t,)-triangle with pp distinct parameters, then fkN1(m,kl,t,α,β)f^{k}\in\mathcal{N}_{1}(m,kl,t,\alpha,\beta) is (2k1tk,)(2^{k-1}t^{k},)-triangle with pp distinct parameters and Cr(fk)=(2t)k+1\textup{Cr}(f^{k})=(2t)^{k}+1.

It suffices to perform k1k-1 applications of Lemma 3.12.

Next, note the following examples of triangle functions.

The following functions are (1,)(1,)-triangle.

f(z):=σ\textscr(2σ\textscr(z)4σ\textscr(z1/2))N1(2,1,1,1,1)f(z):=\sigma_{\textsc{r}}(2\sigma_{\textsc{r}}(z)-4\sigma_{\textsc{r}}(z-1/2))\in\mathcal{N}_{1}(2,1,1,1,1).

g(z):=min{σ\textscr(2z),σ\textscr(22z)}N1(2,1,2,1,1)g(z):=\min\{\sigma_{\textsc{r}}(2z),\sigma_{\textsc{r}}(2-2z)\}\in\mathcal{N}_{1}(2,1,2,1,1).

h(z):=4z(1z)N1(1,1,0,2,0)h(z):=4z(1-z)\in\mathcal{N}_{1}(1,1,0,2,0). Cf. Schmitt (2000).

Set f(z):=σ\textscr(2σ\textscr(z)4σ\textscr(z1/2))N1(2,1,1,1,1)f(z):=\sigma_{\textsc{r}}(2\sigma_{\textsc{r}}(z)-4\sigma_{\textsc{r}}(z-1/2))\in\mathcal{N}_{1}(2,1,1,1,1) (cf. Lemma 3.16). Let real zz\in and positive integer kk be given, and choose the unique nonnegative integer ik{0,,2k1}i_{k}\in\{0,\ldots,2^{k-1}\} and real zk[0,1)z_{k}\in[0,1) so that z=(ik+zk)21kz=(i_{k}+z_{k})2^{1-k}. Then

4 Proof of Theorem 1.1

The proof of Theorem 1.1 now follows: Lemma 3.17 shows that a many-layered ReLU network can give rise to a highly oscillatory and regular function fkf^{k}, Lemma 3.3 shows that few-layered networks and (boosted) decision trees give rise to functions with few oscillations, and lastly Lemma 3.1 shows how to combine these into an inapproximability result.

Using nearly the same proof, but giving up on continuous uniform measure, it is possible to handle other distances and more flexible target functions.

Limitations of depth

Theorem 3.19 can be taken to say: there exists a labeling of Θ(2k3)\Theta(2^{k^{3}}) points which is realizable by a network of depth and size Θ(k3)\Theta(k^{3}), but can not be approximated by networks with depth kk and size o(2k)o(2^{k}). On the other hand, this section will sketch the proof of Theorem 1.2, which implies that these Θ(k3)\Theta(k^{3}) depth networks realize relatively few different labellings. The proof is a quick consequence of the VC dimension of semi-algebraic networks (cf. Lemma 1.3) and the following fact, where Sh()\textup{Sh}(\cdot) is used to denote the growth function (Anthony and Bartlett, 1999, Chapter 3).

Let any function class F\mathcal{F} and any distinct points (xi)i=1n(x_{i})_{i=1}^{n} be given. Then with probability at least 1δ1-\delta over a uniform random draw of labels (yi)i=1n(y_{i})_{i=1}^{n} (with yi{1,+1}y_{i}\in\{-1,+1\}),

The proof of the preceding result is similar to proofs of the Gilbert-Varshamov packing bound via Hoeffding’s inequality (Duchi, 2016, Lemma 13.5). Note that a similar result was used by Warren to prove rates of approximation of continuous functions by polynomials, but without invoking Hoeffding’s inequality (Warren, 1968, Theorem 7).

The remaining task is to control the VC dimension of semi-algebraic networks. To this end, note the following generalization of Lemma 1.3, which further provides that semi-algebraic networks compute functions which are polynomial when restricted to certain polynomial regions.

Let neural network graph G\mathfrak{G} be given with p\leq p parameters, l\leq l layers, and m\leq m total nodes, and suppose every gate is (t,α,β)(t,\alpha,\beta)-sa. Then

The proof follows the same basic structure of the VC bound for networks with piecewise polynomial activation functions (Anthony and Bartlett, 1999, Theorem 8.8). The slightly modified proof here is also very similar to the proof of Lemma 3.3, performing an induction up through the layers of the network, arguing that each node computes a polynomial after restricting attention to some range of parameters. The proof of Lemma 4.2 manages to be multivariate (unlike Lemma 3.3), though this requires arguments due to Warren (1968) which are significantly more complicated than those of Lemma 3.3 (without leading to a strengthening of Theorem 1.1).

One minor departure from the VC dimension proof of piecewise polynomial networks (cf. (Anthony and Bartlett, 1999, Theorem 8.8)) is the following lemma, which is used to track the number of regions with the more complicated semi-algebraic networks. Despite this generalization, the VC dimension bound is basically the same as for piecewise polynomial networks.

Bibliographic notes and open problems

Arguably the first approximation theorem of a big class by a smaller one is the Weierstrass Approximation Theorem, which states that polynomials uniformly approximate continuous functions over compact sets (Weierstrass, 1885). Refining this, Kolmogorov (1936) gave a bound on how well subspaces of functions can approximate continuous functions, and Vitushkin (1955, 1959) showed a similar bound for approximation by polynomials in terms of the polynomial degrees, dimension, and modulus of continuity of the target function. Warren (1968) then gave an alternate proof and generalization of this result, in the process effectively proving the VC dimension of polynomials (developing tools still used to prove the VC dimension of neural networks (Anthony and Bartlett, 1999, Chapters 7-8)), and producing an analog to Theorem 1.2 for polynomials.

The preceding results, however, focused on separating large classes (e.g., continuous functions of bounded modulus) from small classes (polynomials of bounded degree). Aiming to refine this, depth hierarchy theorems in circuit complexity separated circuits of a certain depth from circuits of a slightly smaller depth. As mentioned in Section 1, the seminal result here is due to Håstad (1986). For architectures closer to neural networks, sum-product networks (summation and product nodes) have been analyzed by Bengio and Delalleau (2011) and more recently Martens and Medabalimi (2015), and networks of linear threshold functions in 2 and 3 layers by Kane and Williams (2015); note that both polynomial gates (as in sum-product networks) and linear threshold gates are semi-algebraic gates. Most closely to the present work (excluding (Telgarsky, 2015), which is a vastly simplified account), Eldan and Shamir (2015) analyze 2- and 3-layer networks with general activation functions composed with affine mappings, showing separations which are exponential in the input dimension. Due to this result and also recent advances in circuit complexity (Rossman et al., 2015), it is natural to suppose Theorem 1.1 can be strengthened to separating kk and k+1k+1 layer networks when dimension dd is large; however, none of the earlier works give a tight sense of the behavior as d1d\downarrow 1.

The triangle wave target functions considered here (e.g., cf. Lemma 3.16) have appeared in various forms throughout the literature. General properties of piecewise affine highly oscillating functions were investigated by Szymanski and McCane (2014) and Montúfar et al. (2014). Also, Schmitt (2000) investigated the map z4z(1z)z\mapsto 4z(1-z) (as in Lemma 3.16) to show that sigmoidal networks can not approximate high degree polynomials via an analysis similar to the one here, however looseness in the VC bounds for sigmoidal networks prevented exponential separations and depth hierarchies.

A tantalizing direction for future work is to characterize not just one difficult function (e.g., triangle functions as in Lemma 3.16), but many, or even all functions which are not well-approximated by smaller depths. Arguably, this direction could have value in machine learning, as discovery of such underlying structure could lead to algorithms to recover it. As a trivial example of the sort of structure which could arise, considering the following proposition, stating that any symmetric signal may be repeated by pre-composing it with the ReLU triangle function.

Set f(z):=σ\textscr(2σ\textscr(z)4σ\textscr(z1/2))f(z):=\sigma_{\textsc{r}}(2\sigma_{\textsc{r}}(z)-4\sigma_{\textsc{r}}(z-1/2)) (cf. Lemma 3.16), and let any g:g:\to be given with g(z)=g(1z)g(z)=g(1-z). Then h:=gfkh:=g\circ f^{k} satisfies h(x)=h(x+i2k)=g(x2k)h(x)=h(x+i2^{k})=g(x2^{k}) for every real x[0,2k]x\in[0,2^{-k}] and integer i{0,,2k1}i\in\{0,\ldots,2^{-k}-1\}; in other words, hh is 2k2^{k} repetitions of gg with graph scaled horizontally and uniformly to fit within 2^{2}.

The author is indebted to Joshua Zahl for help navigating semi-algebraic geometry and for a simplification of the multivariate case in Theorem 1.1, and to Rastislav Telgársky for an introduction to this general topic via Kolmogorov’s Superposition Theorem (Kolmogorov, 1957). The author further thanks Jacob Abernethy, Peter Bartlett, Sébastien Bubeck, and Alex Kulesza for valuable discussions.

References

Appendix A Deferred proofs

This appendix collects various proofs omitted from the main text.

The following mechanical proof shows that standard piecewise polynomial gates, maximization/minimization gates, and decision trees are all semi-algebraic gates.

Since mini[r]xi=maxi[r]xi\min_{i\in[r]}x_{i}=-\max_{i\in[r]}-x_{i}, it suffices to handle the maximum case, which has the form

Constructing polynomials qi,j=pjpiq_{i,j}=p_{j}-p_{i} when j<ij<i and qi,j=pipjq_{i,j}=p_{i}-p_{j} when j>ij>i, it follows that ϕmax\phi_{\max} is (r(r1),α,α)(r(r-1),\alpha,\alpha)-sa.

First consider a kk-dt ff, wherein the proof follows by induction on tree size. In the base case k=1k=1, ff is constant. Otherwise, there exist functions flf_{l} and frf_{r} which are respectively ll- and rr-dt with l+r<kl+r<k, and additionally an affine function qfq_{f} so that

where the last step expanded the semi-algebraic forms of flf_{l} and frf_{r}. As such, by combining the sets of predicate polynomials for flf_{l} and frf_{r} together with {qf}\{q_{f}\} (where the former two have cardinalities l\leq l and r\leq r by the inductive hypothesis), and unioning together the triples for flf_{l} and frf_{r} but extending the triples to include 1[qf<0]\mathbf{1}[q_{f}<0] for triples in flf_{l} and 1[qf0]\mathbf{1}[q_{f}\geq 0] for triples in frf_{r}, it follows by construction that ff is (k,1,0)(k,1,0)-semi-algebraic.

Now consider a (t,k)(t,k)-bdt gg. By the preceding expansion, each individual tree fif_{i} is (k,1,0)(k,1,0)-sa, thus their sum is (tk,1,0)(tk,1,0) by unioning together the sets of polynomials, triples, and adding together the expansions.

A.2 Deferred proofs from Section 3

The first proof shows that a collection of partitions may be refined into a single partition whose size is at most the total number of intervals across all partitions. As discussed in the text, while the proof has a simple idea (one need only consider boundaries of intervals across all partitions), it is somewhat painful since there is not consistent rule for whether specific endpoints endpoints of intervals are open or closed.

Consider the case that each partition AlA_{l} which contains the boundary point aia_{i} has exactly two intervals meeting at aia_{i} and moreover the closedness properties are the same, meaning either aia_{i} is contained in the interval which ends at aia_{i}, or it is contained in the interval which starts at aia_{i}. In this case, partition UU into two intervals so that the treatment of the boundary is the same as those AlA_{l}’s with a boundary at aia_{i}.

Otherwise, it is either the case that some AlA_{l} have aia_{i} contained in the interval ending at aia_{i} whereas others have it contained in the interval starting at aia_{i}, or simply some AlA_{l} have three intervals meeting at aia_{i}: namely, the singleton interval [al,al][a_{l},a_{l}] as well as two intervals not containing ala_{l}. In this case, partition UU into three intervals: one ending at aia_{i} (but not containing it), the singleton interval [ai,ai][a_{i},a_{i}], and an interval starting at aia_{i} (but not containing it).

(These cases may also be described in a unified way: consider all intervals of AA which have aia_{i} as an endpoint, extend such intervals of positive length to have infinite length while keeping endpoint aia_{i} and the side it falls on, and then refine UU by intersecting it with all of these intervals, which as above results in either 2 or 3 intervals.)

Note that the construction never introduces more intervals at a boundary point than exist in AA, thus BA=kt|B|\leq|A|=kt.

It remains to be shown that a union of intersections of elements of AA is a union of elements of BB. Note that it suffices to show that intersections of elements of AA are unions of elements of BB, since thereafter these encodings can be used to express unions of intersections of AA as unions of BB. As such, consider any intersection UU of elements of AA; there is nothing to show if UU is empty, thus suppose it is nonempty. In this case, it must also be an interval (e.g., since intersections of convex sets are convex), and its endpoints must coincide with endpoints of AA. Moreover, if the left endpoint of UU is open, then UU must be formed from an intersection which includes an interval with the same open left endpoint, thus there exists such an interval in AA, and by the above construction of BB, there also exists an interval with such an open left endpoint in BB; the same argument similarly handles the case of closed left endpoints, as well as open and closed right endpoints, namely giving elements in BB which match these traits. Let ara_{r} and asa_{s} denote these endpoints. By the above construction of BB, intervals with endpoints {aj,aj+1}\{a_{j},a_{j+1}\} for j{r,,s1}j\in\{r,\ldots,s-1\} will be included in BB, and since BB is a partition, the union of these elements will be exactly UU. Since UU was an arbitrary intersection of elements of AA, the proof is complete.

Next, the tools of Section 3.2 (culminating in the composition rule for semi-algebraic gates (Lemma 3.9)) are used to show crossing number bounds on semi-algebraic networks and boosted decision trees.

This proof first shows fhf\circ h is (2itiαiji1tjαjβjij+1kj,jiβj)(2^{i}t_{i}\alpha_{i}\prod_{j\leq i-1}t_{j}\alpha_{j}\beta_{j}^{i-j+1}k_{j},\prod_{j\leq i}\beta_{j})-poly, and then relaxes this expression and applies Lemma 3.4 to obtain the desired bound.

First consider the case d=1d=1 and hh is the identity map, thus fh=ff\circ h=f. For convenience, set

The proof proceeds by induction on the layers of ff, showing that each node in layer ii is (2iTiAiCi1Mi1,Bi)(2^{i}T_{i}A_{i}C_{i-1}M_{i-1},B_{i})-poly.

For convenience, first consider layer i=0i=0 of the inputs themselves: here, node ii outputs the ithi^{\textup{th}} coordinate of the input, and is thus affine and (1,1)(1,1)-poly. Next consider layer i>0i>0, where the inductive hypothesis grants that each node in layer i1i-1 is (2i1Ti1Ai1Ci2Mi2,Bi1)(2^{i-1}T_{i-1}A_{i-1}C_{i-2}M_{i-2},B_{i-1})-poly. Consequently, since any node in layer ii is (ti,αi,βi)(t_{i},\alpha_{i},\beta_{i})-sa, Lemma 3.9 grants it is also (2i1tiTi1Ai1Ci2Mi2mi1(1+αiBi1),βiBi1)(2^{i-1}t_{i}T_{i-1}A_{i-1}C_{i-2}M_{i-2}m_{i-1}(1+\alpha_{i}B_{i-1}),\beta_{i}B_{i-1})-poly as desired (since 1+αiBi12αiBi11+\alpha_{i}B_{i-1}\leq 2\alpha_{i}B_{i-1}).

Lastly, the simplified terms give fhf\circ h is ((2tα)lβl(l1)/2jl1mj,βl(l+1)/2)((2t\alpha)^{l}\beta^{l(l-1)/2}\prod_{j\leq l-1}m_{j},\beta^{l(l+1)/2})-poly. Since ln()\ln(\cdot) is strictly increasing and concave and ml=1m_{l}=1,

It follows that fhf\circ h is ((2tmα/l)lβl(l1)/2,βl(l+1)/2)((2tm\alpha/l)^{l}\beta^{l(l-1)/2},\beta^{l(l+1)/2})-poly, whereby the crossing number bound follows by Lemma 3.4.

Next, elementary computations verify that the three functions listed in Lemma 3.16 are indeed (1,)(1,)-triangle.

By inspection, f(0)=f(1)=0f(0)=f(1)=0 and f(1/2)=1f(1/2)=1. Moreover, for x[0,1/2]x\in[0,1/2], f(x)=2xf(x)=2x meaning ff is increasing, and x[1/2,1]x\in[1/2,1] means f(x)=2(1x)f(x)=2(1-x), meaning ff is decreasing. Lastly, the properties of gg follow since f=gf=g.

By inspection, h(0)=h(1)=0h(0)=h(1)=0 and h(1/2)=1h(1/2)=1. Moreover hh is a quadratic, thus can cross 0 at most twice, and moreover 1/21/2 is the unique critical point (since gg^{\prime} has degree 1), thus gg is increasing on [0,1/2][0,1/2] and decreasing on [1/2,1][1/2,1].

In the case of the ReLU (1,)(1,)-triangle function ff given in Lemma 3.16, the exact form of fkf^{k} may be established as follows. (Recall that this refined form allows for the use of Lebesgue measure in Theorem 1.1, and also the repetition statement in Proposition 5.1.)

(of Lemma 3.17) The proof proceeds by induction on the number of compositions ll. For the base case l=1l=1,

For the inductive step, first note for any z[0,1/2]z\in[0,1/2], by symmetry of flf^{l} around 1/2 (i.e., fl(z)=fl(1z)f^{l}(z)=f^{l}(1-z) by the inductive hypothesis), and by the above explicit form of f1f^{1},

meaning the case z(1/2,1]z\in(1/2,1] is implied by the case z[0,1/2]z\in[0,1/2]. Since the unique nonnegative integer il+1i_{l+1} and real zl+1[0,1)z_{l+1}\in[0,1) satisfy 2z=2(il+1+zl+1)2l1=(il+1+zl+1)2l2z=2(i_{l+1}+z_{l+1})2^{-l-1}=(i_{l+1}+z_{l+1})2^{-l}, the inductive hypothesis grants

The proof of the slightly more general form of Theorem 1.1 is as follows; it does not quite imply Theorem 1.1, since the constructed measure is not the Lebesgue measure even for the ReLU-based (1,)(1,)-triangle function from Lemma 3.16.

Since zi<zi+1z_{i}<z_{i+1} and fk~(zi)fk~(zi+1)\widetilde{f^{k}}(z_{i})\neq\widetilde{f^{k}}(z_{i+1}), then taking (Ui)i=1s(U_{i})_{i=1}^{s} to denote the intervals of I\mathcal{I} sorted by their left endpoint, ziUiz_{i}\in U_{i} for i[s]i\in[s]. By Lemma 3.1,

Construct the continuous measure μ\mu as follows, starting with the construction of a univariate measure μ0\mu_{0}. Since fkf^{k} is continuous, there exists a δ(0,mini[s1]zizi+1/2)\delta\in(0,\min_{i\in[s-1]}|z_{i}-z_{i+1}|/2) so that fk(z)fk(zi)1/4|f^{k}(z)-f^{k}(z_{i})|\leq 1/4 for any i[s]i\in[s] and zz with zziδ|z-z_{i}|\leq\delta. As such, let μ0\mu_{0} denote the probability measure which places half of its mass uniformly on these ss balls of radius δ\delta (which must be disjoint since fkf^{k} alternates between 0 and 1 along (zi)i=1s(z_{i})_{i=1}^{s}), and half of its mass uniformly on the remaining subset of $.Finally,extendthistoaprobabilitymeasure. Finally, extend this to a probability measure\muonon^{d}uniformly,meaninguniformly, meaning\muistheproductofis the product of\mu_{0}andthemeasureand the measure\mu_{1}whichisuniformoverwhich is uniform over^{d-1}$. Now consider the two types of distances.

As a closing curiosity, Theorem 3.19 implies the following statement regarding polynomials.

A.3 Deferred proofs from Section 4

First, the proof of a certain VC lower bound which mimics the Gilbert-Varshamov bound; the proof is little more than a consequence of Hoeffding’s inequality.

(of Lemma 4.1) For convenience, set m:=Sh(F;n)m:=\textup{Sh}(\mathcal{F};n), and let (a1,,am)(a_{1},\ldots,a_{m}) denote these dichotomies (meaning aj{0,1}na_{j}\in\{0,1\}^{n}), and with foresight set ϵ:=ln(m/δ)/(2n)\epsilon:=\sqrt{\ln(m/\delta)/(2n)}. Let (Yi)i=1n(Y_{i})_{i=1}^{n} denote fair Bernoulli random labellings for each point, and note by symmetry of the fair coin that for any fixed dichotomy aja_{j},

Consequently, by a union bound over all dichotomies and lastly by Hoeffding’s inequality,

where the last step used the choice of ϵ\epsilon.

The remaining deferred proofs do not exactly follow the order of Section 4, but instead the order of dependencies in the proofs. In particular, to control the VC dimension, first it is useful to prove Lemma 4.3, which is used to control the growth of numbers of regions as semi-algebraic gates are combined.

Additionally consider the set of sign patterns

Distinct elements of S\mathcal{S} correspond to distinct sign patterns in VV: namely, for any CSC\in\mathcal{S}, using the ordering of Q\mathcal{Q} to encode AA and BB as binary vectors of length Q|\mathcal{Q}|, the corresponding interleaved binary vector of length 2Q2|\mathcal{Q}| is distinct for distinct choices of (A,B)(A,B). (For each ii that appears in neither AA nor BB, there two possible encodings in VV: having both coordinates corresponding to ii set to 1, and having them set to 0. On the other hand, a more succinct encoding based just on (li)i=1Q(l_{i})_{i=1}^{|\mathcal{Q}|} fails to capture those sets arising from intersections of proper subsets of Q\mathcal{Q}.) As such, making use of growth function bounds for sets of polynomials (Anthony and Bartlett, 1999, Theorem 8.3),

Thanks to Lemma 4.3, the proof of the VC dimension bound Lemma 4.2 follows by induction over layers, effectively keeping track of a piecewise (regionwise?) polynomial function as with the proof of Lemma 3.3 (but now in the multivariate case).

(of Lemma 4.2) First note that this proof follows the scheme of a VC dimension proof for networks with piecewise polynomial activation functions (Anthony and Bartlett, 1999, Theorem 8.8), but with Lemma 4.3 allowing for the more complicated semi-algebraic gates, and some additional bookkeeping for the (semi-algebraic) shapes of the regions of the partition S\mathcal{S}.

For the inductive step, consider some layer i+1i+1. Restricted to any SSiS\in\mathcal{S}_{i}, the nodes of the previous layer ii compute fixed polynomials of degree βi\beta^{i}. Each node in layer i+1i+1 is (t,α,β)(t,\alpha,\beta)-sa, meaning there are tt predicates, defined by polynomials of degree α\leq\alpha, which define regions wherein this node is a fixed polynomial. Let QSQ_{S} denote this set of predicates, where QStnmi+1|Q_{S}|\leq tnm_{i+1} by considering the nn possible input examples and the tt possible predicates encountered in each of the mi+1m_{i+1} nodes in layer i+1i+1, and set Qi+1:=Qi(SSiQS).Q_{i+1}:=Q_{i}\bigcup\left(\cup_{S\in\mathcal{S}_{i}}Q_{S}\right). By the definition of semi-algebraic gate, each node in layer i+1i+1 computes a fixed polynomial when restricted to a region defined by an intersection of predicates which moreover are defined by Qi+1Q_{i+1}. As such, defining Si+1\mathcal{S}_{i+1} as the refinement of Si+1\mathcal{S}_{i+1} which partitions each SSiS\in\mathcal{S}_{i} according to the intersections of predicates encountered in each node, then Lemma 4.3 on each QSQ_{S} grants

which completes the inductive construction.

The upper bound on KK may now be estimated. First, S|\mathcal{S}| may be upper bounded by applying eq. A.2 recursively:

To compute VC(N(G))\textup{VC}(\mathcal{N}(\mathfrak{G})), it suffices to find NN such that Sh(N(G);N)<2N\textup{Sh}(\mathcal{N}(\mathfrak{G});N)<2^{N}, which in turn is implied by p(l+1)ln(N)+p(l+1)ln(8emtαβl)<Nln(2)p(l+1)\ln(N)+p(l+1)\ln(8emt\alpha\beta^{l})<N\ln(2). Since ln(N)=ln(N/(2p(l+1))+ln(2p(l+1))N/(2p(l+1))1+ln(2p(l+1))\ln(N)=\ln(N/(2p(l+1))+\ln(2p(l+1))\leq N/(2p(l+1))-1+\ln(2p(l+1)) and ln(2)1/2>1/6\ln(2)-1/2>1/6, it suffices to show

As such, the left hand side of this expression is an upper bound on VC(N(G))\textup{VC}(\mathcal{N}(\mathfrak{G})).

The proofs of Lemma 1.3 and Theorem 1.2 from Section 1 are now direct from Lemma 4.2 and Lemma 4.1.

(of Lemma 1.3) This statement is the same as Lemma 4.2 with some details removed.

(of Theorem 1.2) By the bound on Sh(N(G);n)\textup{Sh}(\mathcal{N}(\mathfrak{G});n) from Lemma 4.2,

The result follows by plugging this into Lemma 4.1.

A.4 Deferred proofs from Section 5

(of Proposition 5.1) Immediate from Lemma 3.17.