Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks

Peter L. Bartlett, Nick Harvey, Chris Liaw, Abbas Mehrabian

Introduction

Deep neural networks underlie many of the recent breakthroughs in applied machine learning, particularly in image and speech recognition. These successes motivate a renewed study of these networks’ theoretical properties.

Classification is one of the learning tasks in which deep neural networks have been particularly successful, e.g., for image recognition. A natural foundational question that arises is: what are the generalization guarantees of these networks in a statistical learning framework? An established way to address this question is by considering VC-dimension, which characterizes uniform convergence of misclassification frequencies to probabilities (see Vapnik and Chervonenkis, 1971), and asymptotically determines the sample complexity of PAC learning with such classifiers (see Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989).

Let $H$ denote a class of functions from ${\mathcal{X}}$ to ${\{0,1\}}$ (the hypotheses, or the classification rules). For any non-negative integer $m$ , we define the growth function of $H$ as

If $\left|\left\{(h(x_{1}),\ldots,h(x_{m})):h\in H\right\}\right|=2^{m}$ , we say $H$ shatters the set $\{x_{1},\dots,x_{m}\}$ . The Vapnik-Chervonenkis dimension of $H$ , denoted ${\operatorname{VCdim}}(H)$ , is the size of the largest shattered set, i.e. the largest $m$ such that $\Pi_{H}(m)=2^{m}$ . If there is no largest $m$ , we define ${\operatorname{VCdim}}(H)=\infty$ .

For a class of real-valued functions, such as those generated by neural networks, a natural measure of complexity that implies similar uniform convergence properties is the pseudodimension (see Pollard, 1990).

Let $\mathcal{F}$ be a class of functions from ${\mathcal{X}}$ to $\Re$ . The pseudodimension of $\mathcal{F}$ , written ${\operatorname{Pdim}}(\mathcal{F})$ , is the largest integer $m$ for which there exists $(x_{1},\dots,x_{m},y_{1},\dots,y_{m})\in{\mathcal{X}}^{m}\times\Re^{m}$ such that for any $(b_{1},\dots,b_{m})\in\{0,1\}^{m}$ there exists $f\in\mathcal{F}$ such that

For a class $\mathcal{F}$ of real-valued functions, we may define ${\operatorname{VCdim}}(\mathcal{F})\coloneqq{\operatorname{VCdim}}(\operatorname{sgn}(\mathcal{F}))$ , where

and $\operatorname{sgn}(x)=\mathbf{1}[x>0]$ . For any class $\mathcal{F}$ , clearly ${\operatorname{VCdim}}(\mathcal{F})\leq{\operatorname{Pdim}}(\mathcal{F})$ . If $\mathcal{F}$ is the class of functions generated by a neural network with a fixed architecture and fixed activation functions (see Section 1.3 for definitions), then it is not hard to see that indeed ${\operatorname{Pdim}}(\mathcal{F})\leq{\operatorname{VCdim}}(\mathcal{F})$ (see (Anthony and Bartlett, 1999, Theorem 14.1) for a proof), and hence ${\operatorname{Pdim}}(\mathcal{F})={\operatorname{VCdim}}(\mathcal{F})$ . Therefore, all the results of this paper automatically apply to the pseudodimensions of neural networks as well.

The main contribution of this paper is to prove nearly-tight bounds on the VC-dimension of deep neural networks in which the non-linear activation function is a piecewise linear function with a constant number of pieces. For simplicity we will henceforth refer to such networks as “piecewise linear networks”. The activation function that is the most commonly used in practice is the rectified linear unit, also known as ReLU (see LeCun, Bengio, and Hinton, 2015; Goodfellow, Bengio, and Courville, 2016). The ReLU function is defined as $\sigma(x)=\max\{0,x\}$ , so it is clearly piecewise linear.

It is particularly interesting to consider how the VC-dimension is affected by the various attributes of the network: the number $W$ of parameters (i.e., weights and biases), the number $U$ of non-linear units (i.e., nodes), and the number $L$ of layers. Among all networks with the same size (number of weights), is it true that those with more layers have larger VC-dimension?

Such a statement is indeed true, and previously known; however, a tight characterization of how depth affects VC-dimension was unknown prior to this work.

Our first main result is a new VC-dimension lower bound that holds even for the restricted family of ReLU networks.

There exists a universal constant $C$ such that the following holds. Given any $W,L$ with $W>CL>C^{2}$ , there exists a ReLU network with $\leq L$ layers and $\leq W$ parameters with VC-dimension $\geq WL\log(W/L)/C$ .

Our construction can be augmented slightly to give a neural network with linear threshold and identity activation functions with the same guarantees.

The proof appears in Section 2. Prior to our work, the best known lower bounds were $\Omega(WL)$ (see Bartlett, Maiorov, and Meir, 1998, Theorem 2) and $\Omega(W\log W)$ (see Maass, 1994, Theorem 1). We strictly improve both bounds to $\Omega(WL\log(W/L))$ .

Our proof of Theorem 3 uses the “bit extraction” technique, which was also used in (Bartlett et al., 1998) to give an $\Omega(WL)$ lower bound. We refine this technique to gain the additional logarithmic factor that appears in Theorem 3.

Unfortunately there is a barrier to refining this technique any further. Our next theorem shows the hardness of computing the mod function, implying that the bit extraction technique cannot yield a stronger lower bound than Theorem 3. Further discussion of this connection may be found in Remark 12.

The proof of this theorem appears in Section 3. One interesting aspect of the proof is that it does not use Warren’s lemma (Warren, 1968), which is a mainstay of VC-dimension upper bounds (see Goldberg and Jerrum, 1995; Bartlett et al., 1998; Anthony and Bartlett, 1999).

Our next main result is an upper bound on the VC-dimension of neural networks with piecewise polynomial activation functions.

Consider a neural network architecture with $W$ parameters and $U$ computation units arranged in $L$ layers, so that each unit has connections only from units in earlier layers. Let $k_{i}$ denote the number of units at the $i$ th layer. Suppose that all non-output units have piecewise-polynomial activation functions with $p+1$ pieces and degree no more than $d$ , and the output unit has the identity function as its activation function.

If $d=0$ , let $W_{i}$ denote the number of parameters (weights and biases) at the inputs to units in layer $i$ ; if $d>0$ , let $W_{i}$ denote the total number of parameters (weights and biases) at the inputs to units in all the layers up to layer $i$ (i.e., in layers $1,2,\dots,i$ ). Define the effective depth as

For the class $\mathcal{F}$ of all (real-valued) functions computed by this network and $m\geq\bar{L}W$ , we have

The average depth $\bar{L}$ is always between 1 and $L$ , and captures how the parameters are distributed in the network: it is close to $1$ if they are concentrated near the output (or if the activation functions are piecewise-constant), while it is of order $L$ if the parameters are concentrated near the input, or are spread out throughout the network. Hence, this suggests that edges and vertices closer to the input have a larger effect in increasing the VC-dimension, a phenomenon not observed before; and indeed our lower bound construction in Theorem 3 (as well as the lower bound construction from (Bartlett et al., 1998)) considers a network with most of the parameters near the input.

The proof of this result appears in Section 4. Prior to our work, the best known upper bounds were $O(W^{2})$ (see Goldberg and Jerrum, 1995, Section 3.1) and $O(WL\log W+WL^{2})$ (see Bartlett et al., 1998, Theorem 1), both of which hold for piecewise polynomial activation functions with a bounded number of pieces (for the remainder of this section, assume that $p=O(1)$ throughout); we strictly improve both bounds to $O(WL\log W)$ for the special case of piecewise linear functions ( $d=1$ ). Recall that ReLU is an example of a piecewise linear activation function. For the case $d=0$ , an $O(W\log U)$ bound for the VC-dimension was already proved using different techniques by Cover (1968) and by Baum and Haussler (1989, Corollary 2). Our Theorem 6 implies all of these upper bounds (except the $O(W^{2})$ upper bound of Goldberg and Jerrum) using a unified technique, and gives a slightly more refined picture of the dependence of the VC-dimension on the distribution of parameters in a deep network.

To compare our upper and lower bounds, let $d(W,L)$ denote the largest VC-dimension of a piecewise linear network with $W$ parameters and $L$ layers. Theorems 6 and 3 imply there exist constants $c,C$ such that

For neural networks arising in practice it would certainly be the case that $L$ is significantly smaller than $W^{0.99}$ , in which case our results determine the asymptotic bound $d(W,L)=\Theta(WL\log W)$ . On the other hand, in the regime $L=\Theta(W)$ , which is merely of theoretical interest, we also now have a tight bound $d(W,L)=\Theta(WL)$ , obtained by combining Theorem 3 with results of Goldberg and Jerrum (1995). There is now only a very narrow regime, say $W^{0.99}\ll L\ll W$ , in which the bounds of (2) are not asymptotically tight, and they differ only in the logarithmic factor.

Our final result is an upper bound for VC-dimension in terms of $W$ and $U$ (the number of non-linear units, or nodes). This bound is tight in the case $d=1$ and $p=2$ , as discussed in Remark 10.

Consider a neural network with $W$ parameters and $U$ units with activation functions that are piecewise polynomials with at most $p$ pieces and of degree at most $d$ . Let $\mathcal{F}$ be the set of (real-valued) functions computed by this network. Then $\operatorname{VCDim}(\operatorname{sgn}(\mathcal{F}))=O(WU\log((d+1)p))$ .

The proof of this result appears in Section 5. The best known upper bound before our work was $O(W^{2})$ , implicitly proven for bounded $d$ and $p$ by Goldberg and Jerrum (1995, Section 3.1). Our theorem improves this to the tight result $O(WU)$ .

We can summarize the tightest known results on the VC-dimension of neural networks with piecewise polynomial activation functions as follows: for classes $\mathcal{F}$ of functions computed by the class of networks with $L$ layers, $W$ parameters, and $U$ units with the following non-linearities, we have the following bounds on VC-dimension:

${\operatorname{VCdim}}(\mathcal{F})=\Theta(W\log W)$ (Cover (1968) and Baum and Haussler (1989) showed the upper bound and Maass (1994) showed the lower bound).

$c\cdot WL\log(W/L)~{}\leq~{}{\operatorname{VCdim}}(\mathcal{F})~{}\leq~{}C\cdot WL\log W$ (this paper).

${\operatorname{VCdim}}(\mathcal{F})=O(WL^{2}+WL\log W)$ (Bartlett et al., 1998), and ${\operatorname{VCdim}}(\mathcal{F})=O(WU)$ (this paper), and ${\operatorname{VCdim}}(\mathcal{F})=\Omega(WL\log(W/L))$ (this paper).

2 Related Work

For other theoretical properties of neural networks, we refer the reader to the monograph (Anthony and Bartlett, 1999). In this section, we summarize previous work that studies the impact of depth on the representational power of neural networks. It has long been known that two-layer networks with a variety of activation functions can approximate arbitrary continuous functions on compact sets (Hornik, 1991). Sontag (1992) showed that three-layer networks of linear threshold units can approximate inverses of continuous functions, whereas two-layer networks cannot. There are several recent papers that aim to understand which functions can be expressed using a neural network of a given depth and size. There are technical similarities between our work and these. Two striking papers considered the problem of approximating a deep neural network with a shallower network. Telgarsky (2016) shows that there is a ReLU network with $L$ layers and $U=\Theta(L)$ units such that any network approximating it with only $O(L^{1/3})$ layers must have $\Omega(2^{L^{1/3}})$ units; this phenomenon holds even for real-valued functions. Eldan and Shamir (2016) show an analogous result for a high-dimensional 3-layer network that cannot be approximated by a 2-layer network except with an exponential blow-up in the number of nodes.

Very recently, several authors have shown that deep neural networks are capable of approximating broad classes of functions. Safran and Shamir (2017) show that a sufficiently non-linear $C^{2}$ function on $^{d}$ can be approximated with $\epsilon$ error in $L_{2}$ by a ReLU network with $O(\operatorname{polylog}(1/\epsilon))$ layers and weights, but any such approximation with $O(1)$ layers requires $\Omega(1/\epsilon)$ weights. Yarotsky (2017) shows that any $C^{n}$ -function on $^{d}$ can be approximated with $\epsilon$ error in $L_{\infty}$ by a ReLU network with $O(\log(1/\epsilon))$ layers and $O((\frac{1}{\epsilon})^{d/n}\log(1/\epsilon))$ weights. Liang and Srikant (2017) show that a sufficiently smooth univariate function can be approximated with $\epsilon$ error in $L_{\infty}$ by a network with ReLU and threshold gates with $\Theta(\log(1/\epsilon))$ layers and $O(\operatorname{polylog}(1/\epsilon))$ weights, but that $\Omega(\operatorname{poly}(1/\epsilon))$ weights would be required if there were only $o(\log(1/\epsilon))$ layers; they also prove analogous results for multivariate functions. Lastly, Cohen, Sharir, and Shashua (2016) draw a connection to tensor factorizations to show that, for a certain family of arithmetic circuits (in particular, without ReLU non-linearities), the set of functions computable by a shallow network have measure zero among those computable by a deep networks.

3 Notation

A neural network is defined by an activation function $\psi:\Re\rightarrow\Re$ , a directed acyclic graph, and a set of parameters: a weight for each edge of the graph, and a bias for each node of the graph. Let $W$ denote the number of parameters (weights and biases) of the network, $U$ denote the number of computation units (nodes), and $L$ denote the length of the longest path in the graph. We will say that the neural network has $L$ layers.

The computation of a neural network proceeds as follows. For $i=1,\ldots,L$ , the input into a computation unit $u$ at layer $i$ is $w^{\top}x+b$ , where $x$ is the (real) vector corresponding to the outputs of the computational units with a directed edge to $u$ , $w$ is the corresponding vector of edge weights, and $b$ is the bias parameter associated with $u$ . For layers $1,\ldots,L-1$ , the output of $u$ is $\psi(w^{\top}x+b)$ . For the output layer, we replace $\psi$ with the identity, so the output is simply $w^{\top}x+b$ . Since we consider VC-dimension, we will always take the sign of the output of the network, to make the output lie in $\{0,1\}$ for binary classification.

A piecewise polynomial function with $p$ pieces is a function $f$ for which there exists a partition of $\Re$ into disjoint intervals (pieces) $I_{1},\ldots,I_{p}$ and corresponding polynomials $f_{1},\ldots,f_{p}$ such that if $x\in I_{i}$ then $f(x)=f_{i}(x)$ . A piecewise linear function is a piecewise polynomial function in which each $f_{i}$ is linear. The most common activation function used in practice is the rectified linear unit (ReLU) where $I_{1}=(-\infty,0]$ , $I_{2}=(0,\infty)$ and $f_{1}(x)=0,f_{2}(x)=x$ . We denote this function by $\sigma(x)\coloneqq\max\{0,x\}$ . The set $\{1,2,\dots,n\}$ is denoted $[n]$ .

Proof of Theorem 3

The proof of our main lower bound uses the “bit extraction” technique from (Bartlett et al., 1998) to prove an $\Omega(WL)$ lower bound. We refine this technique in a key way — we partition the input bits into blocks and extract multiple bits at a time instead of a single bit at a time. This yields a more efficient bit extraction network, and hence a stronger VC-dimension lower bound.

We show the following result, which immediately implies Theorem 3.

Let $r,m,n$ be positive integers, and let $k=\lceil m/r\rceil$ . There exists a ReLU network with $3+5k$ layers, $2+n+4m+k((11+r)2^{r}+2r+2)$ parameters, $m+n$ input nodes and $m+2+k(5\times 2^{r}+r+1)$ computational nodes with VC-dimension $\geq mn$ .

Choosing $r=1$ gives a network with $W=O(m+n)$ , $U=O(m)$ and VC-dimension $\Omega(mn)=\Omega(WU)$ . This implies that the upper bound $O(WU)$ given in Theorem 8 is tight.

To prove Theorem 3, assume $W$ , $L$ , and $W/L$ are sufficiently large, and set $r=\log_{2}(W/L)/2$ , $m=rL/8$ , and $n=W-5m2^{r}$ in Theorem 9. The rest of this section is devoted to proving Theorem 9.

Let $S_{n}\subseteq\Re^{n}$ denote the standard basis. We shatter the set $S_{n}\times S_{m}$ . Given an arbitrary function $f\colon S_{n}\times S_{m}\to\{0,1\}$ , we build a ReLU neural network that takes as input $(x_{1},x_{2})\in S_{n}\times S_{m}$ and outputs $f(x_{1},x_{2})$ . Define $n$ numbers $a_{1},a_{2},\dots,a_{n}\in\{\frac{0}{2^{m}},\frac{1}{2^{m}},\dots,\frac{2^{m}-1}{2^{m}}\}$ so that the $i$ th digit of the binary representation of $a_{j}$ equals $f(e_{j},e_{i})$ . These numbers will be used as the parameters of the network, as described below.

Given input $(x_{1},x_{2})\in S_{n}\times S_{m}$ , assume that $x_{1}=e_{i}$ and $x_{2}=e_{j}$ . The network must output the $i$ th bit of $a_{j}$ . This “bit extraction approach” was used in (Bartlett et al., 1998, Theorem 2) to give an $\Omega(WL)$ lower bound for the VC-dimension. We use a similar approach but we introduce a novel idea: we split the bit extraction into blocks and extract $r$ bits at a time instead of a single bit at a time. This allows us to prove a lower bound of $\Omega(WL\log(W/L))$ . One can ask, naturally, whether this approach can be pushed further. Our Theorem 5 implies that the bit extraction approach cannot give a lower bound better than $\Omega(WL\log(W/L))$ (see Remark 12).

The first layer of the network “selects” $a_{j}$ , and the remaining layers “extract” the $i$ th bit of $a_{j}$ . In the first layer we have a single computational unit that calculates

This part uses 1 layer, 1 computation unit, and $1+n$ parameters.

The rest of the network extracts all bits of $a_{j}$ and outputs the $i$ th bit. The extraction is done in $k$ steps, where in each step we extract the $r$ most significant bits and zero them out. We will use the following building block for extracting $r$ bits.

Suppose positive integers $r$ and $m$ are given. There exists a ReLU network with 5 layers, $5\times 2^{r}+r+1$ units and $11\times 2^{r}+r2^{r}+2r+2$ parameters that given the real number $b=0.b_{1}b_{2}\dots b_{m}$ (in binary representation) as input, outputs the $(r+1)$ -dimensional vector $(b_{1},b_{2},\dots,b_{r},0.b_{r+1}b_{r+2}\dots b_{m})$ .

Figure 1 shows a schematic of the ReLU network in the above lemma.

Proof Partition $[0,1)$ into $2^{r}$ even subintervals. Observe that the values of $b_{1},\dots,b_{r}$ are determined by knowing which such subinterval $b$ lies in. We first show how to design a two-layer ReLU network that computes the indicator function for an interval to any arbitrary precision. Using $2^{r}$ of these networks in parallel allows us to determine which subinterval $b$ lies in and hence, determine the bits $b_{1},\dots,b_{r}$ .

For any $a\leq b$ and $\varepsilon>0$ , observe that the function $f(x)\coloneqq\sigma(1-\sigma(a/\varepsilon-x/\varepsilon))+\sigma(1-\sigma(x/\varepsilon-b/\varepsilon))-1$ has the property that, $f(x)=1$ for $x\in[a,b]$ , and $f(x)=0$ for $x\notin(a-\varepsilon,b+\varepsilon)$ , and $f(x)\in$ for all $x$ . Thus we can use $f$ to approximate the indicator function for $[a,b]$ , to any desired precision. Moreover, this function can be computed with 3 layers, 5 units, and 11 parameters as follows. First, computing $\sigma(a/\varepsilon-x/\varepsilon)$ can be done with 1 unit, 1 layer, and 2 parameters. Computing $\sigma(1-\sigma(a/\varepsilon-x/\varepsilon))$ can be done with 1 additional unit, 1 additional layer, and 2 additional parameters. Similarly, $\sigma(1-\sigma(x/\varepsilon-b/\varepsilon))$ can be computed with 2 units, the same 2 layers, and 4 parameters. Computing the sum can be done with 1 additional layer, 1 additional unit, and 3 additional parameters. In total, computing $f$ can be done with 3 layers, 5 units, and 11 parameters. We will choose $\varepsilon=2^{-m-2}$ because we are working with $m$ -digit numbers.

Thus, the values $b_{1},\dots,b_{r}$ can be generated by adding the corresponding indicator variables. (For instance, $b_{1}=\sum_{k=2^{r-1}}^{2^{r}-1}\mathbf{1}[b\in[k\cdot 2^{-r},(k+1)\cdot 2^{-r})]$ .) Finally, the remainder $0.b_{r+1}b_{r+2}\dots b_{m}$ can be computed as $0.b_{r+1}b_{r+2}\ldots b_{m}=2^{r}b-\sum_{k=1}^{r}2^{r-k}b_{k}$ .

Now we count the number of layers and parameters: we use $2^{r}$ small networks that work in parallel for producing the indicators, each has 3 layers, 5 units and 11 parameters. To produce $b_{1},\dots,b_{r}$ we need an additional layer, $r\times(2^{r}+1)$ additional parameters, and $r$ additional units. For producing the remainder we need 1 more layer, 1 more unit, and $r+2$ more parameters.

We use $\lceil m/r\rceil$ of these blocks to extract the bits of $a_{j}$ , denoted by $a_{j,1},\dots,a_{j,m}$ . Extracting $a_{j,i}$ is now easy, noting that if $x,y\in\{0,1\}$ then $x\wedge y=\sigma(x+y-1)$ . So, since $x_{2}=e_{i}$ , we have

This calculation needs 2 layers, $1+m$ units, and $1+4m$ parameters.

Theorem 5 implies an inherent barrier to proving lower bounds using the “bit extraction” approach from (Bartlett et al., 1998). Recall that this technique uses $n$ binary numbers with $m$ bits to encode a function $f\colon S_{n}\times S_{m}\to\{0,1\}$ to show an $\Omega(mn)$ lower bound for VC-dimension, where $S_{k}$ denotes the set of standard basis vectors in $\Re^{k}$ . The network begins by selecting one of the $n$ binary numbers, and then extracting a particular bit of that number. (Bartlett et al., 1998) shows that it is possible to take $m=\Omega(L)$ and $n=\Omega(W)$ , thus proving a lower bound of $\Omega(WL)$ for the VC-dimension. In Theorem 3 we showed we can increase $m$ to $\Omega(L\log(W/L))$ , improving the lower bound to $\Omega(WL\log(W/L))$ . Theorem 5 implies that to extract just the least significant bit, one is forced to have $m=O(L\log(W/L))$ ; on the other hand, we always have $n\leq W$ . Hence there is no way to improve the VC-dimension lower bound by more than a constant via the bit extraction technique. In particular, for general piecewise polynomial networks, closing the gap between the $O(WL^{2}+WL\log W)$ of (Bartlett et al., 1998) and $\Omega(WL\log W/L)$ of this paper will require a different technique.

Proof of Theorem 5

For a piecewise polynomial function $\Re\to\Re$ , breakpoints are the boundaries between the pieces. So if a function has $p$ pieces, it has $p-1$ breakpoints.

Let $f_{1},\dots,f_{k}:\Re\to\Re$ be piecewise polynomial of degree $D$ , and suppose the union of their breakpoints has size $B$ . Let $\psi:\Re\to\Re$ be piecewise polynomial of degree $d$ with $b$ breakpoints. Let $w_{1},\dots,w_{k}\in\Re$ be arbitrary. The function $g(x)\coloneqq\psi(\sum_{i}w_{i}f_{i}(x))$ is piecewise polynomial of degree $Dd$ with at most $(B+1)(2+bD)-1$ breakpoints.

Proof Without loss of generality, assume that $w_{1}=\dots=w_{k}=1$ . The function $\sum_{i}f_{i}$ has $B+1$ pieces. Consider one such interval $\mathcal{I}$ . We will prove that it will create at most $2+bD$ pieces in $g$ . In fact, if $\sum_{i}f_{i}$ is constant on $\mathcal{I}$ , $g$ will have 1 piece on $\mathcal{I}$ . Otherwise, for any point $y$ , the equation $\sum_{i}f_{i}(x)=y$ has at most $D$ solutions on $\mathcal{I}$ . Let $y_{1},\dots,y_{b}$ be the breakpoints of $\psi$ . Suppose we move along the curve $(x,\sum_{i}f_{i}(x))$ on $\mathcal{I}$ . Whenever we hit a point $(t,y_{i})$ for some $t$ , one new piece is created in $g$ . So at most $bD$ new pieces are created. In addition, we may have two pieces for the beginning and ending of $\mathcal{I}$ . This gives a total of $2+bD$ pieces per interval, as required. Finally, note that the number of breakpoints is one fewer than the number of pieces.

Theorem 5 follows immediately from the following theorem.

In the special case of piecewise linear functions, this gives $m=O(L\log(W/L)).$

Let us now relate $\gamma(o)$ with $W$ and $L$ . Suppose that, for $i\in[L]$ , there are $W_{i}$ edges between layer $i$ and previous layers. By the AM-GM inequality,

Combining Eqs. 3 and 4 gives the theorem.

Our theorem above implies a qualitatively similar statement. In particular, if we choose $m=k^{1+\varepsilon}$ then for any function $g$ computable by a neural network with $\Theta(k)$ layers and $O(2^{k^{\varepsilon}})$ parameters, there must exist $x\in\{0,1,\ldots,2^{m}-1\}$ such that $|f(x)-g(x)|>1/2$ .

Proof of Theorem 6

The proof of this theorem is very similar to the proof of the upper bound for piecewise polynomial networks from (Bartlett et al., 1998, Theorem 1) but optimized in a few places. The main technical tool in the proof is a bound on the growth function of a polynomially parametrized function class, due to Goldberg and Jerrum (1995). It uses an argument involving counting the number of connected components of semi-algebraic sets. The form stated here is (Bartlett et al., 1998, Lemma 1), which is a slight improvement of a result of Warren (1968) (the proof can be found in (Anthony and Bartlett, 1999, Theorem 8.3)).

Let $p_{1},\ldots,p_{m}$ be polynomials of degree at most $d$ in $n\leq m$ variables. Define

i.e. $K$ is the number of possible sign vectors given by the polynomials. Then $K\leq 2(2emd/n)^{n}$ .

Proof [of Theorem 6]. For input $x\in{\mathcal{X}}$ and parameter vector $a\in\Re^{W}$ , let $f(x,a)$ denote the output of the network. The $\mathcal{F}$ is simply the class of functions $\{x\mapsto f(x,a):a\in\Re^{W}\}$ .

Fix $x_{1},x_{2},\ldots,x_{m}$ in ${\mathcal{X}}$ . We view the parameters of the network, denoted $a$ , as a collection of $W$ real variables. We wish to bound

In other words, $K$ is the number of sign patterns that the neural network can output for the sequence of inputs $(x_{1},\ldots,x_{m})$ . We will prove geometric upper bounds for $K$ , which will imply upper bounds for $\Pi_{\operatorname{sgn}(\mathcal{F})}(m)$

For any partition ${\mathcal{S}}=\{P_{1},P_{2},\ldots,P_{N}\}$ of the parameter domain $\Re^{W}$ , clearly we have

We choose the partition in such a way that within each region $P_{i}$ , the functions $f(x_{j},\cdot)$ are all fixed polynomials of bounded degree, so that each term in this sum can be bounded via Lemma 15.

The partition is constructed iteratively layer by layer, through a sequence ${\mathcal{S}}_{0},{\mathcal{S}}_{1},{\mathcal{S}}_{2},\ldots,{\mathcal{S}}_{L-1}$ of successive refinements, with the following properties:

We have $|{\mathcal{S}}_{0}|=1$ and, for each $n\in[L-1]$ ,

For each $n\in\{0,\dots,L-1\}$ , each element $S$ of ${\mathcal{S}}_{n-1}$ , each $j\in[m]$ , and each unit $u$ in the $n$ th layer, when $a$ varies in $S$ , the net input to $u$ is a fixed polynomial function in $W_{n}$ variables of $a$ , of total degree no more than $1+(n-1)d^{n-1}$ (this polynomial may depend on $S$ , $j$ and $u$ ).

We may define ${\mathcal{S}}_{0}=\Re^{W}$ , which satisfies property 2 above, since the input to any node in layer 1 is of the form $w^{T}x_{j}+b$ , which is an affine function of $w,b$ .

Now suppose that ${\mathcal{S}}_{0},\dots,{\mathcal{S}}_{n-1}$ have been defined, and we want to define ${\mathcal{S}}_{n}$ . For any $h\in[k_{n}],j\in[m],$ and $S\in{\mathcal{S}}_{n-1}$ , let $p_{h,x_{j},S}(a)$ denote the function describing the net input of the $h$ -th unit in the $n$ -th layer, in response to $x_{j}$ , when $a\in S$ . By the induction hypothesis this is a polynomial with total degree no more than $1+(n-1)d^{n-1}$ , and depends on at most $W_{n}$ many variables.

Let $\{t_{1},\dots,t_{p}\}$ denote the set of breakpoints of the activation function. For any fixed $S\in{\mathcal{S}}_{n-1}$ , by Lemma 15, the collection of polynomials

distinct sign patterns when $a\in\Re^{W}$ . Thus, one can partition $\Re^{W}$ into this many regions, such that all these polynomials have the same signs within each region. We intersect all these regions with $S$ to obtain a partition of $S$ into at most $\Pi$ subregions. Performing this for all $S\in{\mathcal{S}}_{n-1}$ gives our desired partition ${\mathcal{S}}_{n}$ . Thus, the required property 1 (inequality (6)) is clearly satisfied.

Fix some $S^{\prime}\in{\mathcal{S}}_{n}$ . Notice that, when $a$ varies in $S^{\prime}$ , all the polynomials

have the same sign, hence the input of each $n$ th layer unit lies between two breakpoints of the activation function, hence the output of each $n$ th layer unit in response to an $x_{j}$ is a fixed polynomial in $W_{n}$ variables of degree no more than $d(1+(n-1)d^{n-1})\leq nd^{n}$ . This implies that the input of every $(n+1)$ th layer unit in response to an $x_{j}$ is a fixed polynomial function of $W_{n+1}$ variables of degree no more than $1+nd^{n}$ . (When $d=0$ , this affine function depends only on the $W_{n+1}$ parameters in layer $n+1$ ; for $d>0$ , it is a polynomial function of all parameters up to layer $n+1$ .)

Proceeding in this way we obtain a partition ${\mathcal{S}}_{L-1}$ of $\Re^{W}$ such that for $S\in{\mathcal{S}}_{L-1}$ the network output in response to any $x_{j}$ is a fixed polynomial of $a\in S$ of degree no more than $1+(L-1)d^{L-1}$ (recall that the last node just outputs its input), and hence by Lemma 15 again,

On the other hand, applying (6) iteratively gives

and thus using (5), and since the points $x_{1},\ldots,x_{m}$ were chosen arbitrarily, we obtain

For the bound on the VC-dimension, from the third line in the formula above, and the definition of VC-dimension, we find

Notice that $U>2$ implies $2eR\geq 16$ , hence Lemma 16 below gives

Suppose that $2^{m}\leq 2^{t}(mr/w)^{w}$ for some $r\geq 16$ and $m\geq w\geq t\geq 0$ . Then, $m\leq t+w\log_{2}(2r\log_{2}r)$ .

Proof We would like to show that $2^{x}>2^{t}(xr/w)^{w}$ for all $x>t+w\log_{2}(2r\log_{2}r)\eqqcolon m$ . Let $f(x)\coloneqq x-t-w\log_{2}(xr/w)$ . To show that $f(x)>0$ for all $x>m$ , we need only show that $f(m)\geq 0$ and $f^{\prime}(x)>0$ for all $x\geq m$ . First, $f(m)\geq 0$ if and only if

which holds since $r\geq 16$ and $t/w\leq 1$ . Finally, for $x\geq m$ , we have $f^{\prime}(x)\geq 0$ if and only if

which holds since $r\geq 16$ implies $x\geq m\geq w\log_{2}(2r\log_{2}r)>w/\ln 2$ .

Proof of Theorem 8

The idea of the proof is that the sign of the output of a neural network can be expressed as a Boolean formula where each predicate is a polynomial inequality. For example, consider the following toy network, where the activation function of the hidden units is a ReLU.

The sign of the output of the network is $\operatorname{sgn}(y)=\operatorname{sgn}(w_{3}\sigma(w_{1}x)+w_{4}\sigma(w_{2}x))$ . Define the following Boolean predicates: $p_{1}=(w_{1}x>0)$ , $p_{2}=(w_{2}x>0)$ , $q_{1}=(w_{3}w_{1}x>0)$ , $q_{2}=(w_{4}w_{2}x>0)$ , and $q_{3}=(w_{3}w_{1}x+w_{4}w_{2}x>0)$ . Then, we can write

A theorem of Goldberg and Jerrum states that any class of functions that can be expressed using a relatively small number of distinct polynomial inequalities has small VC-dimension.

Let $k,n$ be positive integers and $f\colon\Re^{n}\times\Re^{k}\to\{0,1\}$ be a function that can be expressed as a Boolean formula containing $s$ distinct atomic predicates where each atomic predicate is a polynomial inequality or equality in $k+n$ variables of degree at most $d$ . Let $\mathcal{F}=\{f(\cdot,w):w\in\Re^{k}\}$ . Then $\operatorname{VCDim}(\mathcal{F})\leq 2k\log_{2}(8eds)$ .

Proof [of Theorem 8]. Consider a neural network with $W$ weights and $U$ computation units, and assume that the activation function $\psi$ is piecewise polynomial of degree at most $d$ with $p$ pieces. To apply Theorem 17, we will express the sign of the output of the network as a Boolean function consisting of less than $2(1+p)^{U}$ atomic predicates, each being a polynomial inequality of degree at most $\max\{U+1,2d^{U}\}$ .

Since the neural network graph is acyclic, it can be topologically sorted. For $i\in[U]$ , let $u_{i}$ denote the $i$ th computation unit in the topological ordering. The input to each computation unit $u$ lies in one of the $p$ pieces of $\psi$ . For $i\in[U]$ and $j\in[p]$ , we say “ $u_{i}$ is in state $j$ ” if the input to $u_{i}$ lies in the $j$ th piece.

For $u_{1}$ and any $j$ , the predicate “ $u_{1}$ is in state $j$ ” is a single atomic predicate which is the quadratic inequality indicating whether its input lies in the corresponding interval. So, the state of $u_{1}$ can be expressed as a function of $p$ atomic predicates. Conditioned on $u_{1}$ being in a certain state, the state of $u_{2}$ can be determined using $p$ atomic predicates, which are polynomial inequalities of degree at most $2d+1$ . Consequently, the state of $u_{2}$ can be determined using $p+p^{2}$ atomic predicates, each of which is a polynomial of degree at most $2d+1$ . Continuing similarly, we obtain that for each $i$ , the state of $u_{i}$ can be determined using $p(1+p)^{i-1}$ atomic predicates, each of which is a polynomial of degree at most $d^{i-1}+\sum_{j=0}^{i-1}d^{j}$ . Consequently, the state of all nodes can be determined using less than $(1+p)^{U}$ atomic predicates, each of which is a polynomial of degree at most $d^{U-1}+\sum_{j=0}^{U-1}d^{j}\leq\max\{U+1,2d^{U}\}$ (the output unit is linear). Conditioned on all nodes being in certain states, the sign of the output can be determined using one more atomic predicate, which is a polynomial inequality of degree at most $\max\{U+1,2d^{U}\}$ .

In total, we have less than $2(1+p)^{U}$ atomic polynomial-inequality predicates and each polynomial has degree at most $\max\{U+1,2d^{U}\}$ . Thus, by Theorem 17, we get an upper bound of $2W\log(16e\cdot\max\{U+1,2d^{U}\}\cdot(1+p)^{U})=O(WU\log((1+d)p))$ for the VC-dimension.

Christopher Liaw is supported by an NSERC graduate scholarship. Abbas Mehrabian is supported by an NSERC Postdoctoral Fellowship and a Simons-Berkeley Research Fellowship. Peter Bartlett gratefully acknowledges the support of the NSF through grant IIS-1619362 and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). Part of this work was done while Peter Bartlett and Abbas Mehrabian were visiting the Simons Institute for the Theory of Computing at UC Berkeley.