Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks

Peter L. Bartlett, Nick Harvey, Chris Liaw, Abbas Mehrabian

Introduction

Deep neural networks underlie many of the recent breakthroughs in applied machine learning, particularly in image and speech recognition. These successes motivate a renewed study of these networks’ theoretical properties.

Classification is one of the learning tasks in which deep neural networks have been particularly successful, e.g., for image recognition. A natural foundational question that arises is: what are the generalization guarantees of these networks in a statistical learning framework? An established way to address this question is by considering VC-dimension, which characterizes uniform convergence of misclassification frequencies to probabilities (see Vapnik and Chervonenkis, 1971), and asymptotically determines the sample complexity of PAC learning with such classifiers (see Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989).

Let HH denote a class of functions from X{\mathcal{X}} to {0,1}{\{0,1\}} (the hypotheses, or the classification rules). For any non-negative integer mm, we define the growth function of HH as

If {(h(x1),,h(xm)):hH}=2m\left|\left\{(h(x_{1}),\ldots,h(x_{m})):h\in H\right\}\right|=2^{m}, we say HH shatters the set {x1,,xm}\{x_{1},\dots,x_{m}\}. The Vapnik-Chervonenkis dimension of HH, denoted VCdim(H){\operatorname{VCdim}}(H), is the size of the largest shattered set, i.e. the largest mm such that ΠH(m)=2m\Pi_{H}(m)=2^{m}. If there is no largest mm, we define VCdim(H)={\operatorname{VCdim}}(H)=\infty.

For a class of real-valued functions, such as those generated by neural networks, a natural measure of complexity that implies similar uniform convergence properties is the pseudodimension (see Pollard, 1990).

Let F\mathcal{F} be a class of functions from X{\mathcal{X}} to \Re. The pseudodimension of F\mathcal{F}, written Pdim(F){\operatorname{Pdim}}(\mathcal{F}), is the largest integer mm for which there exists (x1,,xm,y1,,ym)Xm×m(x_{1},\dots,x_{m},y_{1},\dots,y_{m})\in{\mathcal{X}}^{m}\times\Re^{m} such that for any (b1,,bm){0,1}m(b_{1},\dots,b_{m})\in\{0,1\}^{m} there exists fFf\in\mathcal{F} such that

For a class F\mathcal{F} of real-valued functions, we may define VCdim(F)VCdim(sgn(F)){\operatorname{VCdim}}(\mathcal{F})\coloneqq{\operatorname{VCdim}}(\operatorname{sgn}(\mathcal{F})), where

and sgn(x)=1[x>0]\operatorname{sgn}(x)=\mathbf{1}[x>0]. For any class F\mathcal{F}, clearly VCdim(F)Pdim(F){\operatorname{VCdim}}(\mathcal{F})\leq{\operatorname{Pdim}}(\mathcal{F}). If F\mathcal{F} is the class of functions generated by a neural network with a fixed architecture and fixed activation functions (see Section 1.3 for definitions), then it is not hard to see that indeed Pdim(F)VCdim(F){\operatorname{Pdim}}(\mathcal{F})\leq{\operatorname{VCdim}}(\mathcal{F}) (see (Anthony and Bartlett, 1999, Theorem 14.1) for a proof), and hence Pdim(F)=VCdim(F){\operatorname{Pdim}}(\mathcal{F})={\operatorname{VCdim}}(\mathcal{F}). Therefore, all the results of this paper automatically apply to the pseudodimensions of neural networks as well.

The main contribution of this paper is to prove nearly-tight bounds on the VC-dimension of deep neural networks in which the non-linear activation function is a piecewise linear function with a constant number of pieces. For simplicity we will henceforth refer to such networks as “piecewise linear networks”. The activation function that is the most commonly used in practice is the rectified linear unit, also known as ReLU (see LeCun, Bengio, and Hinton, 2015; Goodfellow, Bengio, and Courville, 2016). The ReLU function is defined as σ(x)=max{0,x}\sigma(x)=\max\{0,x\}, so it is clearly piecewise linear.

It is particularly interesting to consider how the VC-dimension is affected by the various attributes of the network: the number WW of parameters (i.e., weights and biases), the number UU of non-linear units (i.e., nodes), and the number LL of layers. Among all networks with the same size (number of weights), is it true that those with more layers have larger VC-dimension?

Such a statement is indeed true, and previously known; however, a tight characterization of how depth affects VC-dimension was unknown prior to this work.

Our first main result is a new VC-dimension lower bound that holds even for the restricted family of ReLU networks.

There exists a universal constant CC such that the following holds. Given any W,LW,L with W>CL>C2W>CL>C^{2}, there exists a ReLU network with L\leq L layers and W\leq W parameters with VC-dimension WLlog(W/L)/C\geq WL\log(W/L)/C.

Our construction can be augmented slightly to give a neural network with linear threshold and identity activation functions with the same guarantees.

The proof appears in Section 2. Prior to our work, the best known lower bounds were Ω(WL)\Omega(WL) (see Bartlett, Maiorov, and Meir, 1998, Theorem 2) and Ω(WlogW)\Omega(W\log W) (see Maass, 1994, Theorem 1). We strictly improve both bounds to Ω(WLlog(W/L))\Omega(WL\log(W/L)).

Our proof of Theorem 3 uses the “bit extraction” technique, which was also used in (Bartlett et al., 1998) to give an Ω(WL)\Omega(WL) lower bound. We refine this technique to gain the additional logarithmic factor that appears in Theorem 3.

Unfortunately there is a barrier to refining this technique any further. Our next theorem shows the hardness of computing the mod function, implying that the bit extraction technique cannot yield a stronger lower bound than Theorem 3. Further discussion of this connection may be found in Remark 12.

The proof of this theorem appears in Section 3. One interesting aspect of the proof is that it does not use Warren’s lemma (Warren, 1968), which is a mainstay of VC-dimension upper bounds (see Goldberg and Jerrum, 1995; Bartlett et al., 1998; Anthony and Bartlett, 1999).

Our next main result is an upper bound on the VC-dimension of neural networks with piecewise polynomial activation functions.

Consider a neural network architecture with WW parameters and UU computation units arranged in LL layers, so that each unit has connections only from units in earlier layers. Let kik_{i} denote the number of units at the iith layer. Suppose that all non-output units have piecewise-polynomial activation functions with p+1p+1 pieces and degree no more than dd, and the output unit has the identity function as its activation function.

If d=0d=0, let WiW_{i} denote the number of parameters (weights and biases) at the inputs to units in layer ii; if d>0d>0, let WiW_{i} denote the total number of parameters (weights and biases) at the inputs to units in all the layers up to layer ii (i.e., in layers 1,2,,i1,2,\dots,i). Define the effective depth as

For the class F\mathcal{F} of all (real-valued) functions computed by this network and mLˉWm\geq\bar{L}W, we have

The average depth Lˉ\bar{L} is always between 1 and LL, and captures how the parameters are distributed in the network: it is close to 11 if they are concentrated near the output (or if the activation functions are piecewise-constant), while it is of order LL if the parameters are concentrated near the input, or are spread out throughout the network. Hence, this suggests that edges and vertices closer to the input have a larger effect in increasing the VC-dimension, a phenomenon not observed before; and indeed our lower bound construction in Theorem 3 (as well as the lower bound construction from (Bartlett et al., 1998)) considers a network with most of the parameters near the input.

The proof of this result appears in Section 4. Prior to our work, the best known upper bounds were O(W2)O(W^{2}) (see Goldberg and Jerrum, 1995, Section 3.1) and O(WLlogW+WL2)O(WL\log W+WL^{2}) (see Bartlett et al., 1998, Theorem 1), both of which hold for piecewise polynomial activation functions with a bounded number of pieces (for the remainder of this section, assume that p=O(1)p=O(1) throughout); we strictly improve both bounds to O(WLlogW)O(WL\log W) for the special case of piecewise linear functions (d=1d=1). Recall that ReLU is an example of a piecewise linear activation function. For the case d=0d=0, an O(WlogU)O(W\log U) bound for the VC-dimension was already proved using different techniques by Cover (1968) and by Baum and Haussler (1989, Corollary 2). Our Theorem 6 implies all of these upper bounds (except the O(W2)O(W^{2}) upper bound of Goldberg and Jerrum) using a unified technique, and gives a slightly more refined picture of the dependence of the VC-dimension on the distribution of parameters in a deep network.

To compare our upper and lower bounds, let d(W,L)d(W,L) denote the largest VC-dimension of a piecewise linear network with WW parameters and LL layers. Theorems 6 and 3 imply there exist constants c,Cc,C such that

For neural networks arising in practice it would certainly be the case that LL is significantly smaller than W0.99W^{0.99}, in which case our results determine the asymptotic bound d(W,L)=Θ(WLlogW)d(W,L)=\Theta(WL\log W). On the other hand, in the regime L=Θ(W)L=\Theta(W), which is merely of theoretical interest, we also now have a tight bound d(W,L)=Θ(WL)d(W,L)=\Theta(WL), obtained by combining Theorem 3 with results of Goldberg and Jerrum (1995). There is now only a very narrow regime, say W0.99LWW^{0.99}\ll L\ll W, in which the bounds of (2) are not asymptotically tight, and they differ only in the logarithmic factor.

Our final result is an upper bound for VC-dimension in terms of WW and UU (the number of non-linear units, or nodes). This bound is tight in the case d=1d=1 and p=2p=2, as discussed in Remark 10.

Consider a neural network with WW parameters and UU units with activation functions that are piecewise polynomials with at most pp pieces and of degree at most dd. Let F\mathcal{F} be the set of (real-valued) functions computed by this network. Then VCDim(sgn(F))=O(WUlog((d+1)p))\operatorname{VCDim}(\operatorname{sgn}(\mathcal{F}))=O(WU\log((d+1)p)).

The proof of this result appears in Section 5. The best known upper bound before our work was O(W2)O(W^{2}), implicitly proven for bounded dd and pp by Goldberg and Jerrum (1995, Section 3.1). Our theorem improves this to the tight result O(WU)O(WU).

We can summarize the tightest known results on the VC-dimension of neural networks with piecewise polynomial activation functions as follows: for classes F\mathcal{F} of functions computed by the class of networks with LL layers, WW parameters, and UU units with the following non-linearities, we have the following bounds on VC-dimension:

VCdim(F)=Θ(WlogW){\operatorname{VCdim}}(\mathcal{F})=\Theta(W\log W) (Cover (1968) and Baum and Haussler (1989) showed the upper bound and Maass (1994) showed the lower bound).

cWLlog(W/L)  VCdim(F)  CWLlogWc\cdot WL\log(W/L)~{}\leq~{}{\operatorname{VCdim}}(\mathcal{F})~{}\leq~{}C\cdot WL\log W (this paper).

VCdim(F)=O(WL2+WLlogW){\operatorname{VCdim}}(\mathcal{F})=O(WL^{2}+WL\log W) (Bartlett et al., 1998), and VCdim(F)=O(WU){\operatorname{VCdim}}(\mathcal{F})=O(WU) (this paper), and VCdim(F)=Ω(WLlog(W/L)){\operatorname{VCdim}}(\mathcal{F})=\Omega(WL\log(W/L)) (this paper).

2 Related Work

For other theoretical properties of neural networks, we refer the reader to the monograph (Anthony and Bartlett, 1999). In this section, we summarize previous work that studies the impact of depth on the representational power of neural networks. It has long been known that two-layer networks with a variety of activation functions can approximate arbitrary continuous functions on compact sets (Hornik, 1991). Sontag (1992) showed that three-layer networks of linear threshold units can approximate inverses of continuous functions, whereas two-layer networks cannot. There are several recent papers that aim to understand which functions can be expressed using a neural network of a given depth and size. There are technical similarities between our work and these. Two striking papers considered the problem of approximating a deep neural network with a shallower network. Telgarsky (2016) shows that there is a ReLU network with LL layers and U=Θ(L)U=\Theta(L) units such that any network approximating it with only O(L1/3)O(L^{1/3}) layers must have Ω(2L1/3)\Omega(2^{L^{1/3}}) units; this phenomenon holds even for real-valued functions. Eldan and Shamir (2016) show an analogous result for a high-dimensional 3-layer network that cannot be approximated by a 2-layer network except with an exponential blow-up in the number of nodes.

Very recently, several authors have shown that deep neural networks are capable of approximating broad classes of functions. Safran and Shamir (2017) show that a sufficiently non-linear C2C^{2} function on d^{d} can be approximated with ϵ\epsilon error in L2L_{2} by a ReLU network with O(polylog(1/ϵ))O(\operatorname{polylog}(1/\epsilon)) layers and weights, but any such approximation with O(1)O(1) layers requires Ω(1/ϵ)\Omega(1/\epsilon) weights. Yarotsky (2017) shows that any CnC^{n}-function on d^{d} can be approximated with ϵ\epsilon error in LL_{\infty} by a ReLU network with O(log(1/ϵ))O(\log(1/\epsilon)) layers and O((1ϵ)d/nlog(1/ϵ))O((\frac{1}{\epsilon})^{d/n}\log(1/\epsilon)) weights. Liang and Srikant (2017) show that a sufficiently smooth univariate function can be approximated with ϵ\epsilon error in LL_{\infty} by a network with ReLU and threshold gates with Θ(log(1/ϵ))\Theta(\log(1/\epsilon)) layers and O(polylog(1/ϵ))O(\operatorname{polylog}(1/\epsilon)) weights, but that Ω(poly(1/ϵ))\Omega(\operatorname{poly}(1/\epsilon)) weights would be required if there were only o(log(1/ϵ))o(\log(1/\epsilon)) layers; they also prove analogous results for multivariate functions. Lastly, Cohen, Sharir, and Shashua (2016) draw a connection to tensor factorizations to show that, for a certain family of arithmetic circuits (in particular, without ReLU non-linearities), the set of functions computable by a shallow network have measure zero among those computable by a deep networks.

3 Notation

A neural network is defined by an activation function ψ:\psi:\Re\rightarrow\Re, a directed acyclic graph, and a set of parameters: a weight for each edge of the graph, and a bias for each node of the graph. Let WW denote the number of parameters (weights and biases) of the network, UU denote the number of computation units (nodes), and LL denote the length of the longest path in the graph. We will say that the neural network has LL layers.

The computation of a neural network proceeds as follows. For i=1,,Li=1,\ldots,L, the input into a computation unit uu at layer ii is wx+bw^{\top}x+b, where xx is the (real) vector corresponding to the outputs of the computational units with a directed edge to uu, ww is the corresponding vector of edge weights, and bb is the bias parameter associated with uu. For layers 1,,L11,\ldots,L-1, the output of uu is ψ(wx+b)\psi(w^{\top}x+b). For the output layer, we replace ψ\psi with the identity, so the output is simply wx+bw^{\top}x+b. Since we consider VC-dimension, we will always take the sign of the output of the network, to make the output lie in {0,1}\{0,1\} for binary classification.

A piecewise polynomial function with pp pieces is a function ff for which there exists a partition of \Re into disjoint intervals (pieces) I1,,IpI_{1},\ldots,I_{p} and corresponding polynomials f1,,fpf_{1},\ldots,f_{p} such that if xIix\in I_{i} then f(x)=fi(x)f(x)=f_{i}(x). A piecewise linear function is a piecewise polynomial function in which each fif_{i} is linear. The most common activation function used in practice is the rectified linear unit (ReLU) where I1=(,0]I_{1}=(-\infty,0], I2=(0,)I_{2}=(0,\infty) and f1(x)=0,f2(x)=xf_{1}(x)=0,f_{2}(x)=x. We denote this function by σ(x)max{0,x}\sigma(x)\coloneqq\max\{0,x\}. The set {1,2,,n}\{1,2,\dots,n\} is denoted [n][n].

Proof of Theorem 3

The proof of our main lower bound uses the “bit extraction” technique from (Bartlett et al., 1998) to prove an Ω(WL)\Omega(WL) lower bound. We refine this technique in a key way — we partition the input bits into blocks and extract multiple bits at a time instead of a single bit at a time. This yields a more efficient bit extraction network, and hence a stronger VC-dimension lower bound.

We show the following result, which immediately implies Theorem 3.

Let r,m,nr,m,n be positive integers, and let k=m/rk=\lceil m/r\rceil. There exists a ReLU network with 3+5k3+5k layers, 2+n+4m+k((11+r)2r+2r+2)2+n+4m+k((11+r)2^{r}+2r+2) parameters, m+nm+n input nodes and m+2+k(5×2r+r+1)m+2+k(5\times 2^{r}+r+1) computational nodes with VC-dimension mn\geq mn.

Choosing r=1r=1 gives a network with W=O(m+n)W=O(m+n), U=O(m)U=O(m) and VC-dimension Ω(mn)=Ω(WU)\Omega(mn)=\Omega(WU). This implies that the upper bound O(WU)O(WU) given in Theorem 8 is tight.

To prove Theorem 3, assume WW, LL, and W/LW/L are sufficiently large, and set r=log2(W/L)/2r=\log_{2}(W/L)/2, m=rL/8m=rL/8, and n=W5m2rn=W-5m2^{r} in Theorem 9. The rest of this section is devoted to proving Theorem 9.

Let SnnS_{n}\subseteq\Re^{n} denote the standard basis. We shatter the set Sn×SmS_{n}\times S_{m}. Given an arbitrary function f ⁣:Sn×Sm{0,1}f\colon S_{n}\times S_{m}\to\{0,1\}, we build a ReLU neural network that takes as input (x1,x2)Sn×Sm(x_{1},x_{2})\in S_{n}\times S_{m} and outputs f(x1,x2)f(x_{1},x_{2}). Define nn numbers a1,a2,,an{02m,12m,,2m12m}a_{1},a_{2},\dots,a_{n}\in\{\frac{0}{2^{m}},\frac{1}{2^{m}},\dots,\frac{2^{m}-1}{2^{m}}\} so that the iith digit of the binary representation of aja_{j} equals f(ej,ei)f(e_{j},e_{i}). These numbers will be used as the parameters of the network, as described below.

Given input (x1,x2)Sn×Sm(x_{1},x_{2})\in S_{n}\times S_{m}, assume that x1=eix_{1}=e_{i} and x2=ejx_{2}=e_{j}. The network must output the iith bit of aja_{j}. This “bit extraction approach” was used in (Bartlett et al., 1998, Theorem 2) to give an Ω(WL)\Omega(WL) lower bound for the VC-dimension. We use a similar approach but we introduce a novel idea: we split the bit extraction into blocks and extract rr bits at a time instead of a single bit at a time. This allows us to prove a lower bound of Ω(WLlog(W/L))\Omega(WL\log(W/L)). One can ask, naturally, whether this approach can be pushed further. Our Theorem 5 implies that the bit extraction approach cannot give a lower bound better than Ω(WLlog(W/L))\Omega(WL\log(W/L)) (see Remark 12).

The first layer of the network “selects” aja_{j}, and the remaining layers “extract” the iith bit of aja_{j}. In the first layer we have a single computational unit that calculates

This part uses 1 layer, 1 computation unit, and 1+n1+n parameters.

The rest of the network extracts all bits of aja_{j} and outputs the iith bit. The extraction is done in kk steps, where in each step we extract the rr most significant bits and zero them out. We will use the following building block for extracting rr bits.

Suppose positive integers rr and mm are given. There exists a ReLU network with 5 layers, 5×2r+r+15\times 2^{r}+r+1 units and 11×2r+r2r+2r+211\times 2^{r}+r2^{r}+2r+2 parameters that given the real number b=0.b1b2bmb=0.b_{1}b_{2}\dots b_{m} (in binary representation) as input, outputs the (r+1)(r+1)-dimensional vector (b1,b2,,br,0.br+1br+2bm)(b_{1},b_{2},\dots,b_{r},0.b_{r+1}b_{r+2}\dots b_{m}).

Figure 1 shows a schematic of the ReLU network in the above lemma.

Proof Partition [0,1)[0,1) into 2r2^{r} even subintervals. Observe that the values of b1,,brb_{1},\dots,b_{r} are determined by knowing which such subinterval bb lies in. We first show how to design a two-layer ReLU network that computes the indicator function for an interval to any arbitrary precision. Using 2r2^{r} of these networks in parallel allows us to determine which subinterval bb lies in and hence, determine the bits b1,,brb_{1},\dots,b_{r}.

For any aba\leq b and ε>0\varepsilon>0, observe that the function f(x)σ(1σ(a/εx/ε))+σ(1σ(x/εb/ε))1f(x)\coloneqq\sigma(1-\sigma(a/\varepsilon-x/\varepsilon))+\sigma(1-\sigma(x/\varepsilon-b/\varepsilon))-1 has the property that, f(x)=1f(x)=1 for x[a,b]x\in[a,b], and f(x)=0f(x)=0 for x(aε,b+ε)x\notin(a-\varepsilon,b+\varepsilon), and f(x)f(x)\in for all xx. Thus we can use ff to approximate the indicator function for [a,b][a,b], to any desired precision. Moreover, this function can be computed with 3 layers, 5 units, and 11 parameters as follows. First, computing σ(a/εx/ε)\sigma(a/\varepsilon-x/\varepsilon) can be done with 1 unit, 1 layer, and 2 parameters. Computing σ(1σ(a/εx/ε))\sigma(1-\sigma(a/\varepsilon-x/\varepsilon)) can be done with 1 additional unit, 1 additional layer, and 2 additional parameters. Similarly, σ(1σ(x/εb/ε))\sigma(1-\sigma(x/\varepsilon-b/\varepsilon)) can be computed with 2 units, the same 2 layers, and 4 parameters. Computing the sum can be done with 1 additional layer, 1 additional unit, and 3 additional parameters. In total, computing ff can be done with 3 layers, 5 units, and 11 parameters. We will choose ε=2m2\varepsilon=2^{-m-2} because we are working with mm-digit numbers.

Thus, the values b1,,brb_{1},\dots,b_{r} can be generated by adding the corresponding indicator variables. (For instance, b1=k=2r12r11[b[k2r,(k+1)2r)]b_{1}=\sum_{k=2^{r-1}}^{2^{r}-1}\mathbf{1}[b\in[k\cdot 2^{-r},(k+1)\cdot 2^{-r})].) Finally, the remainder 0.br+1br+2bm0.b_{r+1}b_{r+2}\dots b_{m} can be computed as 0.br+1br+2bm=2rbk=1r2rkbk0.b_{r+1}b_{r+2}\ldots b_{m}=2^{r}b-\sum_{k=1}^{r}2^{r-k}b_{k}.

Now we count the number of layers and parameters: we use 2r2^{r} small networks that work in parallel for producing the indicators, each has 3 layers, 5 units and 11 parameters. To produce b1,,brb_{1},\dots,b_{r} we need an additional layer, r×(2r+1)r\times(2^{r}+1) additional parameters, and rr additional units. For producing the remainder we need 1 more layer, 1 more unit, and r+2r+2 more parameters.

We use m/r\lceil m/r\rceil of these blocks to extract the bits of aja_{j}, denoted by aj,1,,aj,ma_{j,1},\dots,a_{j,m}. Extracting aj,ia_{j,i} is now easy, noting that if x,y{0,1}x,y\in\{0,1\} then xy=σ(x+y1)x\wedge y=\sigma(x+y-1). So, since x2=eix_{2}=e_{i}, we have

This calculation needs 2 layers, 1+m1+m units, and 1+4m1+4m parameters.

Theorem 5 implies an inherent barrier to proving lower bounds using the “bit extraction” approach from (Bartlett et al., 1998). Recall that this technique uses nn binary numbers with mm bits to encode a function f ⁣:Sn×Sm{0,1}f\colon S_{n}\times S_{m}\to\{0,1\} to show an Ω(mn)\Omega(mn) lower bound for VC-dimension, where SkS_{k} denotes the set of standard basis vectors in k\Re^{k}. The network begins by selecting one of the nn binary numbers, and then extracting a particular bit of that number. (Bartlett et al., 1998) shows that it is possible to take m=Ω(L)m=\Omega(L) and n=Ω(W)n=\Omega(W), thus proving a lower bound of Ω(WL)\Omega(WL) for the VC-dimension. In Theorem 3 we showed we can increase mm to Ω(Llog(W/L))\Omega(L\log(W/L)), improving the lower bound to Ω(WLlog(W/L))\Omega(WL\log(W/L)). Theorem 5 implies that to extract just the least significant bit, one is forced to have m=O(Llog(W/L))m=O(L\log(W/L)); on the other hand, we always have nWn\leq W. Hence there is no way to improve the VC-dimension lower bound by more than a constant via the bit extraction technique. In particular, for general piecewise polynomial networks, closing the gap between the O(WL2+WLlogW)O(WL^{2}+WL\log W) of (Bartlett et al., 1998) and Ω(WLlogW/L)\Omega(WL\log W/L) of this paper will require a different technique.

Proof of Theorem 5

For a piecewise polynomial function \Re\to\Re, breakpoints are the boundaries between the pieces. So if a function has pp pieces, it has p1p-1 breakpoints.

Let f1,,fk:f_{1},\dots,f_{k}:\Re\to\Re be piecewise polynomial of degree DD, and suppose the union of their breakpoints has size BB. Let ψ:\psi:\Re\to\Re be piecewise polynomial of degree dd with bb breakpoints. Let w1,,wkw_{1},\dots,w_{k}\in\Re be arbitrary. The function g(x)ψ(iwifi(x))g(x)\coloneqq\psi(\sum_{i}w_{i}f_{i}(x)) is piecewise polynomial of degree DdDd with at most (B+1)(2+bD)1(B+1)(2+bD)-1 breakpoints.

Proof Without loss of generality, assume that w1==wk=1w_{1}=\dots=w_{k}=1. The function ifi\sum_{i}f_{i} has B+1B+1 pieces. Consider one such interval I\mathcal{I}. We will prove that it will create at most 2+bD2+bD pieces in gg. In fact, if ifi\sum_{i}f_{i} is constant on I\mathcal{I}, gg will have 1 piece on I\mathcal{I}. Otherwise, for any point yy, the equation ifi(x)=y\sum_{i}f_{i}(x)=y has at most DD solutions on I\mathcal{I}. Let y1,,yby_{1},\dots,y_{b} be the breakpoints of ψ\psi. Suppose we move along the curve (x,ifi(x))(x,\sum_{i}f_{i}(x)) on I\mathcal{I}. Whenever we hit a point (t,yi)(t,y_{i}) for some tt, one new piece is created in gg. So at most bDbD new pieces are created. In addition, we may have two pieces for the beginning and ending of I\mathcal{I}. This gives a total of 2+bD2+bD pieces per interval, as required. Finally, note that the number of breakpoints is one fewer than the number of pieces.

Theorem 5 follows immediately from the following theorem.

In the special case of piecewise linear functions, this gives m=O(Llog(W/L)).m=O(L\log(W/L)).

Let us now relate γ(o)\gamma(o) with WW and LL. Suppose that, for i[L]i\in[L], there are WiW_{i} edges between layer ii and previous layers. By the AM-GM inequality,

Combining Eqs. 3 and 4 gives the theorem.

Our theorem above implies a qualitatively similar statement. In particular, if we choose m=k1+εm=k^{1+\varepsilon} then for any function gg computable by a neural network with Θ(k)\Theta(k) layers and O(2kε)O(2^{k^{\varepsilon}}) parameters, there must exist x{0,1,,2m1}x\in\{0,1,\ldots,2^{m}-1\} such that f(x)g(x)>1/2|f(x)-g(x)|>1/2.

Proof of Theorem 6

The proof of this theorem is very similar to the proof of the upper bound for piecewise polynomial networks from (Bartlett et al., 1998, Theorem 1) but optimized in a few places. The main technical tool in the proof is a bound on the growth function of a polynomially parametrized function class, due to Goldberg and Jerrum (1995). It uses an argument involving counting the number of connected components of semi-algebraic sets. The form stated here is (Bartlett et al., 1998, Lemma 1), which is a slight improvement of a result of Warren (1968) (the proof can be found in (Anthony and Bartlett, 1999, Theorem 8.3)).

Let p1,,pmp_{1},\ldots,p_{m} be polynomials of degree at most dd in nmn\leq m variables. Define

i.e. KK is the number of possible sign vectors given by the polynomials. Then K2(2emd/n)nK\leq 2(2emd/n)^{n}.

Proof [of Theorem 6]. For input xXx\in{\mathcal{X}} and parameter vector aWa\in\Re^{W}, let f(x,a)f(x,a) denote the output of the network. The F\mathcal{F} is simply the class of functions {xf(x,a):aW}\{x\mapsto f(x,a):a\in\Re^{W}\}.

Fix x1,x2,,xmx_{1},x_{2},\ldots,x_{m} in X{\mathcal{X}}. We view the parameters of the network, denoted aa, as a collection of WW real variables. We wish to bound

In other words, KK is the number of sign patterns that the neural network can output for the sequence of inputs (x1,,xm)(x_{1},\ldots,x_{m}). We will prove geometric upper bounds for KK, which will imply upper bounds for Πsgn(F)(m)\Pi_{\operatorname{sgn}(\mathcal{F})}(m)

For any partition S={P1,P2,,PN}{\mathcal{S}}=\{P_{1},P_{2},\ldots,P_{N}\} of the parameter domain W\Re^{W}, clearly we have

We choose the partition in such a way that within each region PiP_{i}, the functions f(xj,)f(x_{j},\cdot) are all fixed polynomials of bounded degree, so that each term in this sum can be bounded via Lemma 15.

The partition is constructed iteratively layer by layer, through a sequence S0,S1,S2,,SL1{\mathcal{S}}_{0},{\mathcal{S}}_{1},{\mathcal{S}}_{2},\ldots,{\mathcal{S}}_{L-1} of successive refinements, with the following properties:

We have S0=1|{\mathcal{S}}_{0}|=1 and, for each n[L1]n\in[L-1],

For each n{0,,L1}n\in\{0,\dots,L-1\}, each element SS of Sn1{\mathcal{S}}_{n-1}, each j[m]j\in[m], and each unit uu in the nnth layer, when aa varies in SS, the net input to uu is a fixed polynomial function in WnW_{n} variables of aa, of total degree no more than 1+(n1)dn11+(n-1)d^{n-1} (this polynomial may depend on SS, jj and uu).

We may define S0=W{\mathcal{S}}_{0}=\Re^{W}, which satisfies property 2 above, since the input to any node in layer 1 is of the form wTxj+bw^{T}x_{j}+b, which is an affine function of w,bw,b.

Now suppose that S0,,Sn1{\mathcal{S}}_{0},\dots,{\mathcal{S}}_{n-1} have been defined, and we want to define Sn{\mathcal{S}}_{n}. For any h[kn],j[m],h\in[k_{n}],j\in[m], and SSn1S\in{\mathcal{S}}_{n-1}, let ph,xj,S(a)p_{h,x_{j},S}(a) denote the function describing the net input of the hh-th unit in the nn-th layer, in response to xjx_{j}, when aSa\in S. By the induction hypothesis this is a polynomial with total degree no more than 1+(n1)dn11+(n-1)d^{n-1}, and depends on at most WnW_{n} many variables.

Let {t1,,tp}\{t_{1},\dots,t_{p}\} denote the set of breakpoints of the activation function. For any fixed SSn1S\in{\mathcal{S}}_{n-1}, by Lemma 15, the collection of polynomials

distinct sign patterns when aWa\in\Re^{W}. Thus, one can partition W\Re^{W} into this many regions, such that all these polynomials have the same signs within each region. We intersect all these regions with SS to obtain a partition of SS into at most Π\Pi subregions. Performing this for all SSn1S\in{\mathcal{S}}_{n-1} gives our desired partition Sn{\mathcal{S}}_{n}. Thus, the required property 1 (inequality (6)) is clearly satisfied.

Fix some SSnS^{\prime}\in{\mathcal{S}}_{n}. Notice that, when aa varies in SS^{\prime}, all the polynomials

have the same sign, hence the input of each nnth layer unit lies between two breakpoints of the activation function, hence the output of each nnth layer unit in response to an xjx_{j} is a fixed polynomial in WnW_{n} variables of degree no more than d(1+(n1)dn1)ndnd(1+(n-1)d^{n-1})\leq nd^{n}. This implies that the input of every (n+1)(n+1)th layer unit in response to an xjx_{j} is a fixed polynomial function of Wn+1W_{n+1} variables of degree no more than 1+ndn1+nd^{n}. (When d=0d=0, this affine function depends only on the Wn+1W_{n+1} parameters in layer n+1n+1; for d>0d>0, it is a polynomial function of all parameters up to layer n+1n+1.)

Proceeding in this way we obtain a partition SL1{\mathcal{S}}_{L-1} of W\Re^{W} such that for SSL1S\in{\mathcal{S}}_{L-1} the network output in response to any xjx_{j} is a fixed polynomial of aSa\in S of degree no more than 1+(L1)dL11+(L-1)d^{L-1} (recall that the last node just outputs its input), and hence by Lemma 15 again,

On the other hand, applying (6) iteratively gives

and thus using (5), and since the points x1,,xmx_{1},\ldots,x_{m} were chosen arbitrarily, we obtain

For the bound on the VC-dimension, from the third line in the formula above, and the definition of VC-dimension, we find

Notice that U>2U>2 implies 2eR162eR\geq 16, hence Lemma 16 below gives

Suppose that 2m2t(mr/w)w2^{m}\leq 2^{t}(mr/w)^{w} for some r16r\geq 16 and mwt0m\geq w\geq t\geq 0. Then, mt+wlog2(2rlog2r)m\leq t+w\log_{2}(2r\log_{2}r).

Proof We would like to show that 2x>2t(xr/w)w2^{x}>2^{t}(xr/w)^{w} for all x>t+wlog2(2rlog2r)mx>t+w\log_{2}(2r\log_{2}r)\eqqcolon m. Let f(x)xtwlog2(xr/w)f(x)\coloneqq x-t-w\log_{2}(xr/w). To show that f(x)>0f(x)>0 for all x>mx>m, we need only show that f(m)0f(m)\geq 0 and f(x)>0f^{\prime}(x)>0 for all xmx\geq m. First, f(m)0f(m)\geq 0 if and only if

which holds since r16r\geq 16 and t/w1t/w\leq 1. Finally, for xmx\geq m, we have f(x)0f^{\prime}(x)\geq 0 if and only if

which holds since r16r\geq 16 implies xmwlog2(2rlog2r)>w/ln2x\geq m\geq w\log_{2}(2r\log_{2}r)>w/\ln 2.

Proof of Theorem 8

The idea of the proof is that the sign of the output of a neural network can be expressed as a Boolean formula where each predicate is a polynomial inequality. For example, consider the following toy network, where the activation function of the hidden units is a ReLU.

The sign of the output of the network is sgn(y)=sgn(w3σ(w1x)+w4σ(w2x))\operatorname{sgn}(y)=\operatorname{sgn}(w_{3}\sigma(w_{1}x)+w_{4}\sigma(w_{2}x)). Define the following Boolean predicates: p1=(w1x>0)p_{1}=(w_{1}x>0), p2=(w2x>0)p_{2}=(w_{2}x>0), q1=(w3w1x>0)q_{1}=(w_{3}w_{1}x>0), q2=(w4w2x>0)q_{2}=(w_{4}w_{2}x>0), and q3=(w3w1x+w4w2x>0)q_{3}=(w_{3}w_{1}x+w_{4}w_{2}x>0). Then, we can write

A theorem of Goldberg and Jerrum states that any class of functions that can be expressed using a relatively small number of distinct polynomial inequalities has small VC-dimension.

Let k,nk,n be positive integers and f ⁣:n×k{0,1}f\colon\Re^{n}\times\Re^{k}\to\{0,1\} be a function that can be expressed as a Boolean formula containing ss distinct atomic predicates where each atomic predicate is a polynomial inequality or equality in k+nk+n variables of degree at most dd. Let F={f(,w):wk}\mathcal{F}=\{f(\cdot,w):w\in\Re^{k}\}. Then VCDim(F)2klog2(8eds)\operatorname{VCDim}(\mathcal{F})\leq 2k\log_{2}(8eds).

Proof [of Theorem 8]. Consider a neural network with WW weights and UU computation units, and assume that the activation function ψ\psi is piecewise polynomial of degree at most dd with pp pieces. To apply Theorem 17, we will express the sign of the output of the network as a Boolean function consisting of less than 2(1+p)U2(1+p)^{U} atomic predicates, each being a polynomial inequality of degree at most max{U+1,2dU}\max\{U+1,2d^{U}\}.

Since the neural network graph is acyclic, it can be topologically sorted. For i[U]i\in[U], let uiu_{i} denote the iith computation unit in the topological ordering. The input to each computation unit uu lies in one of the pp pieces of ψ\psi. For i[U]i\in[U] and j[p]j\in[p], we say “uiu_{i} is in state jj” if the input to uiu_{i} lies in the jjth piece.

For u1u_{1} and any jj, the predicate “u1u_{1} is in state jj” is a single atomic predicate which is the quadratic inequality indicating whether its input lies in the corresponding interval. So, the state of u1u_{1} can be expressed as a function of pp atomic predicates. Conditioned on u1u_{1} being in a certain state, the state of u2u_{2} can be determined using pp atomic predicates, which are polynomial inequalities of degree at most 2d+12d+1. Consequently, the state of u2u_{2} can be determined using p+p2p+p^{2} atomic predicates, each of which is a polynomial of degree at most 2d+12d+1. Continuing similarly, we obtain that for each ii, the state of uiu_{i} can be determined using p(1+p)i1p(1+p)^{i-1} atomic predicates, each of which is a polynomial of degree at most di1+j=0i1djd^{i-1}+\sum_{j=0}^{i-1}d^{j}. Consequently, the state of all nodes can be determined using less than (1+p)U(1+p)^{U} atomic predicates, each of which is a polynomial of degree at most dU1+j=0U1djmax{U+1,2dU}d^{U-1}+\sum_{j=0}^{U-1}d^{j}\leq\max\{U+1,2d^{U}\} (the output unit is linear). Conditioned on all nodes being in certain states, the sign of the output can be determined using one more atomic predicate, which is a polynomial inequality of degree at most max{U+1,2dU}\max\{U+1,2d^{U}\}.

In total, we have less than 2(1+p)U2(1+p)^{U} atomic polynomial-inequality predicates and each polynomial has degree at most max{U+1,2dU}\max\{U+1,2d^{U}\}. Thus, by Theorem 17, we get an upper bound of 2Wlog(16emax{U+1,2dU}(1+p)U)=O(WUlog((1+d)p))2W\log(16e\cdot\max\{U+1,2d^{U}\}\cdot(1+p)^{U})=O(WU\log((1+d)p)) for the VC-dimension.

Christopher Liaw is supported by an NSERC graduate scholarship. Abbas Mehrabian is supported by an NSERC Postdoctoral Fellowship and a Simons-Berkeley Research Fellowship. Peter Bartlett gratefully acknowledges the support of the NSF through grant IIS-1619362 and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). Part of this work was done while Peter Bartlett and Abbas Mehrabian were visiting the Simons Institute for the Theory of Computing at UC Berkeley.

References