Norm-Based Capacity Control in Neural Networks

Behnam Neyshabur, Ryota Tomioka, Nathan Srebro

Introduction

The statistical complexity, or capacity, of unregularized feed-forward neural networks, as a function of the network size and depth, is fairly well understood. With hard-threshold activations, the VC-dimension, and hence sample complexity, of the class of functions realizable with a feed-forward network is equal, up to logarithmic factors, to the number of edges in the network (Anthony and Bartlett, 2009; Shalev-Shwartz and Ben-David, 2014), corresponding to the number of parameters. With continuous activation functions the VC-dimension could be higher, but is fairly well understood and is still controlled by the size and depth of the network.Using weights with very high precision and vastly different magnitudes it is possible to shatter a number of points quadratic in the number of edges when activations such as the sigmoid, ramp or hinge are used (Shalev-Shwartz and Ben-David, 2014, Chapter 20.4). But even with such activations, the VC dimension can still be bounded by the size and depth (Bartlett, 1998; Anthony and Bartlett, 2009; Shalev-Shwartz and Ben-David, 2014).

But feedforward networks are often trained with some kind of explicit or implicit regularization, such as weight decay, early stopping, “max regularization”, or more exotic regularization such as drop-outs. What is the effect of such regularization on the induced hypothesis class and its capacity?

A central question we ask is: can we bound the capacity of feed-forward network in terms of norm-based regularization alone, without relying on network size and even if the network size (number of nodes or edges) is unbounded or infinite? What type of regularizers admit such capacity control? And how does the capacity behave as a function of the norm, and perhaps other network parameters such as depth?

Beyond the central question of capacity control, we also analyze the convexity of the resulting hypothesis class—unlike unregularized size-controlled feed-forward networks, infinite magnitude-controlled networks have the potential of yielding convex hypothesis classes (this is the case, e.g., when we move from rank-based control on matrices, which limits the number of parameters to magnitude based control with the trace-norm or max-norm). A convex class might be easier to optimize over and might be convenient in other ways.

Preliminaries: Feedforward Neural Networks

We will refer to the size of the network, which is the overall number of edges $\left\lvert{E}\right\rvert$ , the depth $d$ of the network, which is the length of the longest directed path in $G$ , and the in-degree (or width) $H$ of a network, which is the maximum in-degree of a vertex in $G$ .

A special case of feedforward neural networks are layered fully connected networks where vertices are partitioned into layers and there is a directed edge from every vertex in layer $i$ to every vertex in layer $i+1$ . We index the layers from the first layer, $i=1$ whose inputs are the input nodes, up to the last layer $i=d$ which contains the single output node—the number of layers is thus equal to the depth and the in-degree is the maximal layer size. We denote by $\text{layer}(d,H)$ the layered fully connected network with $d$ layers and $H$ nodes per layer (except the output layer that has a single node), and also allow $H=\infty$ . We will also use the shorthand $\mathcal{N}^{d,H,\sigma}=\mathcal{N}^{\text{layer}(d,H),\sigma}$ and $\mathcal{N}^{d,\sigma}=\mathcal{N}^{\text{layer}(d,\infty),\sigma}$ .

We will focus mostly on the hinge, or RELU (REctified Linear Unit) activation, which is currently in popular use (Nair and Hinton, 2010; Bordes and Bengio, 2011; Zeiler et al., 2013), $\sigma_{\textsc{relu}}(z)=[z]_{+}=\max(z,0)$ . When the activation will not be specified, we will implicitly be referring to the RELU. The RELU has several convenient properties which we will exploit, some of them shared with other activation functions:

The hinge is Lipschitz continuous with Lipshitz constant one. This property is also shared by the sigmoid and the ramp activation $\sigma(z)=\min(\max(0,z),1)$ .

The hinge is idempotent, i.e. $\sigma_{\textsc{relu}}(\sigma_{\textsc{relu}}(z))=\sigma_{\textsc{relu}}(z)$ . This property is also shared by the ramp and hard threshold activations.

We will consider various measures $\alpha(w)$ of the magnitude of the weights $w(\cdot)$ . Such a measure induces a complexity measure on functions $f\in\mathcal{N}^{G,\sigma}$ defined by $\alpha^{G,\sigma}(f)=\inf_{f_{G,w,\sigma}=f}\alpha(w)$ . The sublevel sets of the complexity measure $\alpha^{G,\sigma}$ form a family of hypothesis classes $\mathcal{N}^{G,\sigma}_{\alpha\leq a}=\{f\in\mathcal{N}^{G,\sigma}\;|\;\alpha^{G,\sigma}(f)\leq a\}$ . Again we will use the shorthand $\alpha^{d,H,\sigma}$ and $\alpha^{d,\sigma}$ when referring to layered graphs $\text{layer}(d,H)$ and $\text{layer}(d,\infty)$ respectively, and frequently drop $\sigma$ when RELU is implicitly meant.

For binary function $g:\{\pm 1\}^{D}\rightarrow{\pm 1}$ we say that $g$ is realized by $f$ with unit margin if $\forall_{x}f(x)g(x)\geq 1$ . A set of points $S$ is shattered with unit margin by a hypothesis class $\mathcal{N}$ if all $g:S\rightarrow{\pm 1}$ can be realized with unit margin by some $f\in\mathcal{N}$ .

Group Norm Regularization

Considering the grouping of weights going into each edge of the network, we will consider the following generic group-norm type regularizer, parametrized by $1\leq p,q\leq\infty$ :

Here and elsewhere we allow $q=\infty$ with the usual conventions that $(\sum z_{i}^{q})^{1/q}=\sup z_{i}$ and $1/q=0$ when it appears in other contexts. When $q=\infty$ the group regularizer (3) imposes a per-unit regularization, where we constrain the norm of the incoming weights of each unit separately, and when $q=p$ the regularizer (3) is an “overall” weight regularizer, constraining the overall norm of all weights in the system. E.g., when $q=p=1$ we are paying for the sum of all magnitudes of weights in the network, and $q=p=2$ corresponds to overall weight-decay where we pay for the sum of square magnitudes of all weights (i.e. the overall Euclidean norm of the weights).

where $\displaystyle\gamma_{p,q}(W)=\prod_{k=1}^{d}\left\lVert{W_{k}}\right\rVert_{p,q}$ aggregates the layers by multiplication instead of summation. The inequality (4) holds regardless of the activation function, and so for any $\sigma$ we have:

But due to the homogeneity of the RELU activation, when this activation is used we can always balance the norm between the different layers without changing the computed function so as to achieve equality in (4):

For any $f\in\mathcal{N}^{d,H,\sigma_{\textsc{relu}}}$ , $\displaystyle\mu^{d,H,\sigma_{\textsc{relu}}}_{p,q}(f)=d^{1/q}\sqrt[d]{\gamma_{p,q}^{d,H,\sigma_{\textsc{relu}}}(f)}$ .

Let $W$ be weights that realizes $f$ and are optimal with respect to $\gamma_{p,g}$ ; i.e. $\gamma_{p,q}(W)=\gamma^{d,H}_{p,q}(f)$ . Let $\widetilde{W}_{k}=\sqrt[d]{\gamma_{p,q}(W)}W_{k}/\left\lVert{W_{k}}\right\rVert_{p,q}$ , and observe that they also realize $f$ . We now have:

which together with (4) completes the proof.

The two measures are therefore equivalent when we use RELUs, and define the same level sets, or family of hypothesis classes, which we refer to simply as $\mathcal{N}^{d,H}_{p,q}$ . In the remainder of this Section, we investigate convexity and generalization properties of these hypothesis classes.

In order to understand the effect of the norm on the sample complexity, we bound the Rademacher complexity of the classes $\mathcal{N}^{d,H}_{p,q}$ . Recall that the Rademacher Complexity is a measure of the capacity of a hypothesis class on a specific sample, which can be used to bound the difference between empirical and expected error, and thus the excess generalization error of empirical risk minimization (see, e.g., Bartlett and Mendelson (2003) for a complete treatment, and Appendix A for the exact definitions we use). In particular, the Rademacher complexity typically scales as $\sqrt{C/m}$ , which corresponds to a sample complexity of $O(C/\epsilon^{2})$ , where $m$ is the sample size and $C$ is the effective measure of capacity of the hypothesis class.

We prove the bound by induction, showing that for any $q,d>1$ and $1\leq p<\infty$ ,

The intuition is that when $p^{*}<q$ , the Rademacher complexity increases by simply distributing the weights among neurons and if $p^{*}\geq q$ then the supremum is attained when the output neuron is connected to a neuron with highest Rademacher complexity in the lower layer and all other weights in the top layer are set to zero. For a complete proof, see Appendix A. \BlackBox

Note that for $2\leq p<\infty$ , the bound on the Rademacher complexity scales with $m^{\frac{1}{p}}$ (see section A.1 in appendix) because:

2 Tightness

We next investigate the tightness of the complexity bound in Theorem 3.2, and show that when $1/p+1/q<1$ the dependence on the width $H$ is indeed unavoidable. We show not only that the bound on the Rademacher complexity is tight, but that the implied bound on the sample complexity is tight, even for binary classification with a margin over binary inputs. To do this, we show how we can shatter the $m=2^{D}$ points $\{\pm 1\}^{D}$ using a network with small group-norm:

For any $p,q\geq 1$ (and $1/p^{*}+1/p=1$ ) and any depth $d\geq 2$ , the $m=2^{D}$ points $\{\pm 1\}^{D}$ can be shattered with unit margin by $\mathcal{N}^{d,H}_{\gamma_{p,q}\leq\gamma}$ with:

Consider a size $m$ subset $S_{m}$ of $2^{D}$ vertices of the $D$ dimensional hypercube $\{-1,+1\}^{D}$ . We construct the first layer using $m$ units. Each unit has a unique weight vector consisting of $+1$ and $-1$ ’s and will output a positive value if and only if the sign pattern of the input $x\in S_{m}$ matches that of the weight vector. The second layer has a single unit and connects to all $m$ units in the first layer. For any $m$ dimensional sign pattern $b\in\{-1,+1\}^{m}$ , we can choose the weights of the second layer to be $b$ , and the network will output the desired sign for each $x\in S_{m}$ with unit margin. The norm of the network is at most $(m\cdot D^{q/p})^{1/q}\cdot m^{1/p}=D^{1/p}\cdot m^{(1/p+1/q)}.$ This establishes the claim for $d=2$ . For $d>2$ and $1/p+1/q\geq 1$ , we obtain the same norm and unit margin by adding $d-2$ layers with one unit in each layer connected to the previous layer by a unit weight. For $d>2$ and $1/p+1/q<1$ , we show the dependence on $H$ by recursively replacing the top unit with $H$ copies of it and adding an averaging unit on top of that. More specifically, given the above $d=2$ layer network, we make $H$ copies of the output unit with rectified linear activation and add a 3rd layer with one output unit with uniform weight $1/H$ to all the copies in the 2nd layer. Since this operation does not change the output of the network, we have the same margin and now the norm of the network is $(m\cdot D^{q/p})^{1/q}\cdot(Hm^{q/p})^{1/q}\cdot(H(1/H^{p}))^{1/p}=D^{1/p}\cdot m^{(1/p+1/q)}\cdot H^{1/q-1/p^{*}}.$ That is, we have reduced the norm by factor $H^{1/q-1/p^{*}}$ . By repeating this process, we get the geometric reduction in the norm $H^{(d-2)(1/q-1/p^{*})}$ , which concludes the proof.

Next we consider the dependence on the width $H$ when $1/p+1/q<1$ . Here we have to use depth $d\geq 3$ , and we see that indeed as the width $H$ and depth $d$ increase, the magnitude control $\gamma$ can decrease as $H^{(1/p^{*}-1/q)(d-2)}$ without decreasing the capacity, matching Theorem 1 up to an offset of 2 on the depth. In particular, we see that in this regime we can shatter an arbitrarily large number of points with arbitrarily low $\gamma$ by using enough hidden units, and so the capacity of $\mathcal{N}^{d}_{p,q}$ is indeed infinite and it cannot ensure any generalization.

3 Convexity

Finally we establish a sufficient condition for the hypothesis classes $\mathcal{N}^{d}_{p,q}$ to be convex. We are referring to convexity of the functions in the $\mathcal{N}^{d}_{p,q}$ independent of a specific representation. If we consider a, possibly regularized, empirical risk minimization problem on the weights, the objective (the empirical risk) would never be a convex function of the weights (for depth $d\geq 2$ ), even if the regularizer is convex in $w$ (which it always is for $p,q\geq 1$ ). But if we do not bound the width of the network, and instead rely on magnitude-control alone, we will see that the resulting hypothesis class, and indeed the complexity measure, may be convex (with respect to taking convex combinations of functions, not of weights).

For any $d,p,q\geq 1$ such that $\frac{1}{q}\leq\frac{1}{d-1}\big{(}1-\frac{1}{p}\big{)}$ , $\gamma^{d}_{p,q}(f)$ is a semi-norm in $\mathcal{N}^{d}$ .

In particular, under the condition of the Theorem, $\gamma^{d}_{p,q}$ is convex, and hence its sublevel sets $\mathcal{N}^{d}_{p,q}$ are convex, and so $\mu^{d}_{p,q}$ is quasi-convex (but not convex).

To show convexity, consider two functions $f,g\in\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma}$ and $0<\alpha<1$ , and let $U$ and $V$ be the weights realizing $f$ and $g$ respectively with $\gamma_{p,q}(U)\leq\gamma$ and $\gamma_{p,q}(V)\leq\gamma$ . We will construct weights $W$ realizing $\alpha f+(1-\alpha)g$ with $\gamma_{p,q}(W)\leq\gamma$ . This is done by first balancing $U$ and $V$ s.t. at each layer $\left\lVert{U_{i}}\right\rVert_{p,q}=\sqrt[d]{\gamma_{p,q}(U)}$ and $\left\lVert{V_{i}}\right\rVert_{p,q}=\sqrt[d]{\gamma_{p,q,}(V)}$ and then placing $U$ and $V$ side by side, with no interaction between the units calculating $f$ and $g$ until the output layer. The output unit has weights $\alpha U_{d}$ coming in from the $f$ -side and weights $(1-\alpha)V_{d}$ coming in from the $g$ -side. In Appendix B we show that under the condition in the theorem, $\gamma_{p,q}(W)\leq\gamma$ . To complete the proof, we also show $\gamma^{d}_{p,q}$ is homogeneous and that this is sufficient for convexity. \BlackBox

Per-Unit and Path Regularization

In this Section we will focus on the special case of $q=\infty$ , i.e. when we constrain the norm of the incoming weights of each unit separately.

Consider a regularizer which looks at the sum over all paths from input nodes to the output node, of the product of the weights along the path:

where $p\geq 1$ controls the norm used to aggregate the paths. We can motivate this regularizer as follows: if a node does not have any high-weight paths going out of it, we really don’t care much about what comes into it, as it won’t have much effect on the output. The path-regularizer thus looks at the aggregated influence of all the weights.

Referring to the induced regularizer $\phi^{G}_{p}(f)=\min_{f_{G,w}=f}\phi_{p}(w)$ (with the usual shorthands for layered graphs), we now observe that for layered graphs, path regularization and per-unit regularization are equivalent:

For $p\geq 1$ , any $d$ and (finite or infinite) $H$ , for any $f\in\mathcal{N}^{d,H}$ : $\phi_{p}^{d,H}(f)=\gamma^{d,H}_{p,\infty}$

It is important to emphasize that even for layered graphs, it is not the case that for all weights $\phi_{p}(w)=\gamma_{p,\infty}(w)$ . E.g., a high-magnitude edge going into a unit with no non-zero outgoing edges will affect $\gamma_{p,\infty}(w)$ but not $\phi_{p}(w)$ , as will having high-magnitude edges on different layers in different paths. In a sense path regularization is as more careful regularizer less fooled by imbalance. Nevertheless, in the proof of Theorem 4.1 in Appendix C.1, we show we can always balance the weights such that the two measures are equal.

The equivalence does not extend to non-layered graphs, since the lengths of different paths might be different. Again, we can think of path regularizer as more refined regularizer taking into account the local structure. However, if we consider all DAGs of depth at most $d$ (i.e. with paths of length at most $d$ ), the notions are again equivalent (see proof in Appendix C.2):

For any $p\geq 1$ and any $d$ : $\displaystyle\gamma^{d}_{p,\infty}(f)=\min_{\textrm{$ G\in\text{DAG}(d) $}}\phi^{G}_{p}(f)$ .

In particular, for any graph $G$ of depth $d$ , we have that $\phi^{G}_{p}(f)\geq\gamma^{d}_{p,\infty}(f)$ . Combining this observation with Corollary 3.3 allows us to immediately obtain a generalization bound for path regularization on any, even non-layered, graph:

Note that in order to apply Corollary 3.3 and obtain a width-independent bound, we had to limit ourselves to $p=1$ . We further explore this issue next.

Capacity

Convexity

An immediately consequence of Theorem 3.6 is that per-unit regularization, if we do not constrain the network width, is convex for any $p\geq 1$ . In fact, $\gamma^{d}_{p,\infty}$ is a (semi)norm. However, as discussed above, for depth $d>2$ this is meaningful only for $p=1$ , as $\gamma^{d}_{p,\infty}$ collapses for $p>1$ .

Hardness

Since the classes $\mathcal{N}^{d}_{1,\infty}$ are convex, we might hope that this might make learning computationally easier. Indeed, one can consider functional-gradient or boosting-type strategies for learning a predictor in the class (Lee et al., 1996). However, as Bach (2014) points out, this is not so easy as it requires finding the best fit for a target with a RELU unit, which is not easy. Indeed, applying results on hardness of learning intersections of halfspaces, which can be represented with small per-unit norm using two-layer networks, we can conclude that, subject to certain complexity assumptions, it is not possible to efficiently PAC learn $\mathcal{N}^{d}_{1,\infty}$ , even for depth $d=2$ when $\gamma_{1,\infty}$ increases superlinearly:

This is a corollary of Theorem D.1 in the Appendix D. Either versions of corollary 4.4 precludes the possibility of learning in time polynomial in $\gamma_{1,\infty}$ , though it still might be possible to learn in $\textrm{poly}(D)$ time when $\gamma_{1,\infty}$ is sublinear.

Sharing

We conclude this Section with an observation on the type of networks obtained by per-unit, or equivalently path, regularization.

For any $p\geq 1$ and $d>1$ and any $f\in\mathcal{N}^{d}$ , there exists a layered graph $G(V,E)$ of depth $d$ , such that $f\in\mathcal{N}^{G}$ and $\gamma^{G}_{p,\infty}(f)=\phi^{G}_{p}(f)=\gamma^{d}_{p,\infty}(f)$ , and the out-degree of every internal (non-input) node in $G$ is one. That is, the subgraph of $G$ induced by the non-input vertices is a tree directed toward the output vertex.

What the Theorem tells us is that we can realize every function as a tree with optimal per-unit norm. If we think of learning with an infinite fully-connected layered network, we can always restrict ourselves to models in which the non-zero-weight edges form a tree. This means that when using per-unit regularization we have no incentive to “share” lower-level units—each unit will only have a single outgoing edge and will only be used by a single down-stream unit. This seems to defy much of the intuition and power of using deep networks, where we expect lower layers to represent generic feature useful in many higher-level features. In effect, we are not encouraging any transfer between learning different aspects of the function (or between different tasks or classes, if we do have multiple output units). Per-unit regularization therefore misses out on much of the inductive bias that we might like to impose when using deep learning (namely, promoting sharing).

For any $f_{G,w}\in\mathcal{N}^{\text{DAG}(d)}$ , we show how to construct such $\widetilde{G}$ and $\widetilde{w}$ . We first sort the vertices of $G$ based on topological ordering such that the out-degree of the first vertex is zero. Let $G_{0}=G$ and $w_{0}=w$ . At each step $i$ , we first set $G_{i}=G_{i-1}$ and $w_{i}=w_{i-1}$ and then pick the vertex $u$ that is the $i$ th vector in the topological ordering. If the out-degree of $u$ is at most 1. Otherwise, for any edge $(u\rightarrow v)$ we create a copy of vertex $u$ that we call it $u_{v}$ , add the edge $(u_{v}\rightarrow v)$ to $G_{i}$ and connect all incoming edges of $u$ with the same weights to every such $u_{v}$ and finally we delete the vertex $u$ from $G_{i}$ together with all incoming and outgoing edges of $u$ . It is easy to indicate that $f_{G_{i},w_{i}}=f_{G_{i-1},w_{i-1}}$ . After at most $|V|$ such steps, all internal nodes have out-degree one and hence the subgraph induced by non-input vertices will be a tree.

Overall Regularization

Convexity

This is similar to per-unit regularization, except we impose different norms at different layers (if $p\not=1$ ). We can see that $\mathcal{N}^{2}_{\nu_{p}\leq\nu}=\nu\cdot\overline{\operatorname{conv}}(\sigma(\mathcal{G}))$ , and is thus convex for any $p$ . Focusing on RELU activation we have the equivalence:

$\displaystyle\mu^{2}_{2,2}(f)=2\nu_{2}(f).$

Here (9) is the arithmetic-geometric mean inequality for which we can achieve equality by balancing the weights (as in Claim 1) and (10) again follows from the homogeneity of the RELU which allows us to rebalance the weights.

Hardness

As with $\mathcal{N}^{d}_{1,\infty}$ , we might hope that the convexity of $\mathcal{N}^{2}_{2,2}$ might make it computationally easy to learn. However, by the same reduction from learning intersection of halfspaces (Theorem D.1 in Appendix D) we can again conclude that we cannot learn in time polynomial in $\mu^{2}_{2,2}$ :

Depth Independent Regularization

Up until now we discussed relying on magnitude-based regularization instead of directly controlling network size, thus allowing unbounded and even infinite width. But we still relied on a finite bound on the depth in all our derivations. Can the explicit dependence on the depth be avoided, and replaced with only a measure of scale of the weights?

We already know we cannot rely only on a bound on the group-norm $\mu_{p,q}$ when the depth is unbounded, as we know from Theorem 3.4 that in terms of $\mu_{p,q}$ the sample complexity necessarily increases exponentially with the depth: if we allow arbitrarily deep graphs we can shrink $\mu_{p,q}$ toward zero without changing the scale of the computed function. However, controlling the $\gamma$ -measure, or equivalently the path-regularizer $\phi$ , in arbitrarily-deep graphs is sensible, and we can define:

where the minimization is over any DAG. From Theorem 4.2 we can conclude that $\phi_{p}(f)=\gamma_{p,\infty}(f)$ . In any case, $\gamma_{p,q}(f)$ is a sensible complexity measure, that does not collapse despite the unbounded depth. Can we obtain generalization guarantees for the class $\mathcal{N}_{\gamma_{p,q}\leq\gamma}$ ?

Unfortunately, even when $1/p+1/q\geq 1$ and we can obtain width-independent bounds, the bound in Corollary 3.3 still has a dependence on $4^{d}$ , even if $\gamma_{p,q}$ is bounded. Can such a dependence be avoided?

The proof is again based on an inductive argument similar to Theorem 3.2 and you can find it in appendix A.4.

However, the ramp is not homogeneous and so the equivalent between $\mu$ , $\gamma$ and $\phi$ breaks down. Can we obtain such a bound also for the RELU? At the very least, what we can say is that an inductive argument such that used in the proofs of Theorems 3.2 and 6.1 cannot be used to avoid an exponential dependence on the depth. To see this, consider $\gamma_{1,\infty}\leq 1$ (this choice is arbitrary if we are considering the Rademacher complexity), for which we have

where $\overline{\operatorname{conv}}(\cdot)$ is the symmetric convex hull, and $[\cdot]_{+}=\max(z,0)$ is applied to each function in the class. In order to apply the inductive argument without increasing the complexity exponentially with the depth, we would need the operation $[\overline{\operatorname{conv}}(\mathcal{H})]_{+}$ to preserve the Rademacher complexity, at least for non-negative convex cones $\mathcal{H}$ . However we show a simple example of a non-negative convex cone $\mathcal{H}$ for which $\mathcal{R}_{m}\left([\overline{\operatorname{conv}}(\mathcal{H})]_{+}\right)>\mathcal{R}_{m}\left(\mathcal{H}\right)$ .

Summary and Open Issues

Although the additive $\mu$ -measure and multiplication $\gamma$ -measure are equivalent at the optimum, they behave rather differently in terms of optimization dynamics (based on anecdotal empirical experience) and understanding the relationship between them, as well as the novel path-based regularizer can be helpful in practical regularization of neural networks.

Beyond the open issue regarding depth-independent $\gamma$ -based capacity control, another interesting open question is understanding the expressive power of $\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma}$ , particularly as a function of the depth $d$ . Clearly going from depth $d=1$ to depth $d=2$ provides additional expressive power, but it is not clear how much additional depth helps. The class $\mathcal{N}^{2}$ already includes all binary functions over $\{\pm 1\}^{D}$ and is dense among continuous real-valued functions. But can the $\gamma$ -measure be reduced by increasing the depth? Viewed differently: $\gamma^{d}_{p,q}(f)$ is monotonically non-increasing in $d$ , but are there functions for it continues decreasing? Although it seems obvious there are functions that require high depth for efficient representation, these questions are related to decade-old problems in circuit complexity and might not be easy to resolve.

This research was partially supported by NSF grant IIS-1302662 and an Intel ICRI-CI award. We thank the COLT anonymous reviewers for pointing out an error in the statement of Lemma A.3 and suggesting other corrections.

References

Appendix A Rademacher Complexities

In this section, we prove an upper bound for the Rademacher complexity of the class $\mathcal{N}_{\gamma_{p,q}\leq\gamma}^{d,H,\sigma_{\text{RELU}}}$ , i.e., the class of functions that can be represented as depth $d$ , width $H$ network with rectified linear activations, and the layer-wise group norm complexity $\gamma_{p,q}$ bounded by $\gamma$ . As mentioned in the main text, our proof is an induction with respect to the depth $d$ . We start with $d=1$ layer neural networks, which is essentially the class of linear separators.

(Khintchine-Kahane Inequality) For any $0<p<\infty$ and $S=\{z_{1},\dots,z_{m}\}$ , if the random variable $\xi$ is uniform over $\{\pm 1\}^{m}$ , then

where $C_{p}$ is a constant depending only on $p$ .

The sharp value of the constant $C_{p}$ was found by Haagerup (1981) but for our analysis, it is enough to note that if $p\geq 1$ we have $C_{p}\leq\sqrt{p}$ .

(Massart Lemma) Let $A$ be a finite set of $m$ dimensional vectors. Then

where $p^{*}$ is such that $\frac{1}{p^{*}}+\frac{1}{p}=1$ .

First, note that $\mathcal{N}^{1}$ is the class of linear functions and hence for any function $f_{w}\in\mathcal{N}^{1}$ , we have that $\gamma_{p,q}(w)=\left\lVert{w}\right\rVert_{p}$ . Therefore, we can write the Rademacher complexity for a set $S=\{x_{1},\dots,x_{m}\}$ as:

For $1\leq p\leq\min\left\{2,\frac{2\log(2D)}{2\log(2D)-1}\right\}$ (and therefore $2\log(2D)\leq p^{*}$ ), we have

We now use the Massart Lemma viewing each feature $(x_{i}[j])_{i=1}^{m}$ for $j=1,\ldots,D$ as a member of a finite hypothesis class and obtain

If $\min\left\{2,\frac{2\log(2D)}{2\log(2D)-1}\right\}<p<\infty$ , by Khintchine-Kahane inequality we have

If $p^{*}\geq 2$ , by Minskowski inequality we have that $\left\lVert{X}\right\rVert_{2,p^{*}}\leq m^{1/2}\max_{i}\left\lVert{x_{i}}\right\rVert_{p^{*}}$ . Otherwise, by subadditivity of the function $f(z)=z^{\frac{p^{*}}{2}}$ , we get $\left\lVert{X}\right\rVert_{2,p^{*}}\leq m^{1/{p^{*}}}\max_{i}\left\lVert{x_{i}}\right\rVert_{p^{*}}$ .

A.2 Theorem 3.2

For the proof of theorem 3.2, we need the following two technical lemmas. The first is the well-known contraction lemma:

For any $p,q\geq 1$ , $d\geq 2$ , $\xi\in\{\pm 1\}^{m}$ and $f\in\mathcal{N}^{d,H,H}$ we have

where $p^{*}$ is such that $\frac{1}{p^{*}}+\frac{1}{p}=1$ .

By the definition of Rademacher complexity if $\xi$ is uniform over $\{\pm 1\}^{m}$ , we have:

where the equality (13) is obtained by lemma A.6 and inequality (14) is by Contraction Lemma. This will give us the bound on Rademacher complexity of $\mathcal{N}^{d,H}_{\gamma_{p,q}\leq\gamma}$ based on the Rademacher complexity of $\mathcal{N}^{d-1,H}_{\gamma_{p,q}\leq\gamma}$ . Applying the same argument on all layers and using lemma A.3 to bound the complexity of the first layer completes the proof.

A.3 Proof of Lemma A.6

It is immediate that the right hand side of the equality in the statement is always less than or equal to the left hand side because given any vector $w$ in the right hand side, by setting each row of matrix $W$ in the left hand side we get the equality. Therefore, it is enough to prove that the left hand side is less than or equal to the right hand side. For the convenience of notations, let $g(w)\triangleq|\sum_{i=1}^{m}\xi_{i}w^{\top}[f(x_{i})]_{+}|$ . Define $\widetilde{w}$ to be:

If $q\leq p^{*}$ , then the right hand side of equality in the lemma statement will reduce to $g(\widetilde{w})/\left\lVert{\widetilde{w}}\right\rVert_{p}$ and therefore we need to show that for any matrix $V$ ,

Since $q\leq p^{*}$ , we have $\left\lVert{V}\right\rVert_{p,{p^{*}}}\leq\left\lVert{V}\right\rVert_{p,q}$ and hence it is enough to prove the following inequality:

On the other hand, if $q>{p^{*}}$ , then we need to prove the following inequality holds:

Since $q>{p^{*}}$ , we have that $\left\lVert{V}\right\rVert_{p,{p^{*}}}\leq H^{\frac{1}{{p^{*}}}-\frac{1}{q}}\left\lVert{V}\right\rVert_{p,q}$ . Therefore, it is again enough to show that:

We can rewrite the above inequality in the following form:

By the definition of $\widetilde{w}$ , we know that the above inequality holds for each term in the sum and hence the inequality is true.

A.4 Theorem 6.1

Assuming $\xi$ is uniform over $\{\pm 1\}^{m}$ , we have:

where the equality (15) is by anti-symmetric property of $\sigma$ and inequality (16) is by the version of Contraction Lemma without the absolute value. This will give us the bound on Rademacher complexity of $\mathcal{N}^{d,H}_{\mu_{1,\infty}\leq\mu}$ based on the Rademacher complexity of $\mathcal{N}^{d-1,H}_{\mu_{1,\infty}\leq\mu}$ . Applying the same argument on all layers and using lemma A.3 to bound the complexity of the first layer completes the proof.

We repeat the statement here for convenience.

For any $d,p,q\geq 1$ such that $\frac{1}{q}\leq\frac{1}{d-1}\big{(}1-\frac{1}{p}\big{)}$ , $\gamma^{d}_{p,q}(f)$ is a semi-norm in $\mathcal{N}^{d}$ .

First we show that for any two functions $f_{1},f_{2}\in\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma}$ and $0\leq\alpha\leq 1$ , the function $g=\alpha f_{1}+(1-\alpha)f_{2}$ is in the hypothesis class $\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma}$ . We prove this by constructing weights $W$ that realizes $g$ . Let $U$ and $V$ be the weights of two neural networks such that $\gamma_{p,q}(U)=\gamma_{p,q}^{d}(f_{1})\leq\gamma$ and $\gamma_{p,q}(V)=\gamma_{p,q}^{d}(f_{2})\leq\gamma$ . For every layer $i=1,\ldots,d$ let

Then for the defined $W$ , we have $f_{W}=\alpha f_{1}+(1-\alpha)f_{2}$ for rectified linear and any other non-negative homogeneous activation function. Moreover, for any $i<d$ , the norm of each layer is

Combining inequalities (17) and (18), we get $\gamma^{d}_{p,q}(f_{W})\leq 2^{\frac{d-1}{q}+\frac{1}{p}}\gamma\leq\gamma,$ where the last inequality holds because we assume that $\frac{1}{q}\leq\frac{1}{d-1}\big{(}1-\frac{1}{p}\big{)}$ . Thus for every $\gamma\geq 0$ , $\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma}$ is a convex set.

Non-negative homogeneity

For any function $f\in\mathcal{N}^{d}$ and any $\alpha\geq 0$ , let $U$ be the weights realizing $f$ with $\gamma^{d}_{p,q}(f)=\gamma_{p,q}(U)$ . Then $\sqrt[d]{\alpha}U$ realizes $\alpha f$ establishing $\gamma^{d}_{p,q}(\alpha f)\leq\gamma_{p,q}(\sqrt[d]{\alpha}U)=\alpha\gamma_{p,q}(U)=\alpha\gamma^{d}_{p,q}(U)=\alpha\gamma_{p,q}^{d}(f)$ . This establishes the non-negative homogeneity of $\gamma_{p,q}^{d}$ .

Convex sublevel sets and homogeneity imply triangular inequality

Appendix C Path Regularization

For any function $f\in\mathcal{N}^{d,H}_{\gamma_{p,\infty}\leq\gamma}$ there is a layered network with weights $w$ such that $\gamma_{p,\infty}(w)=\gamma^{d,H}_{p,\infty}(f)$ and for any internal unit $v$ , $\sum_{(u\rightarrow v)\in E}|w(u\rightarrow v)|^{p}=1$ .

Let $w$ be the weights of a network such that $\gamma_{p,\infty}(w)=\gamma^{d,H}_{p,\infty}(f)$ . We now construct a network with weights $\widetilde{w}$ such that $\gamma_{p,\infty}(w)=\gamma^{d,H}_{p,\infty}(f)$ and for any internal unit $v$ , $\sum_{(u\rightarrow v)\in E}|\widetilde{w}(u\rightarrow v)|^{p}=1$ . We do this by an incremental algorithm. Let $w_{0}=w$ . At each step $i$ , we do the following.

For $p\geq 1$ , any $d$ and (finite or infinite) $H$ , for any $f\in\mathcal{N}^{d,H}$ : $\phi_{p}^{d,H}(f)=\gamma^{d,H}_{p,\infty}$ .

By the Lemma C.1, there is a layered network with weights $\widetilde{w}$ such that $\gamma_{p,\infty}(\widetilde{w})=\gamma^{d,H}_{p,\infty}(f)$ and for any internal unit $v$ , $\sum_{(u\rightarrow v)\in E}|\widetilde{w}(u\rightarrow v)|^{p}=1$ . Let $W$ be the weights of the layered network that corresponds to the function $\widetilde{w}$ . Then we have:

C.2 Proof of Theorem 4.2

In this section, without loss of generality, we assume that all the internal nodes in a DAG have incoming edges and outgoing edges because otherwise we can just discard them. Let $d_{\text{out}}(v)$ be the longest directed path from vertex $v$ to $v_{\textrm{out}}$ and $d_{\text{in}}(v)$ be the longest directed path from any input vertex $v_{\textrm{in}}[i]$ to $v$ . We say graph $G$ is a sublayered graph if $G$ is a subgraph of a layered graph.

We first show the necessary and sufficient conditions under which a DAG is a sublayered graph.

The graph $G(E,V)$ is a sublayered graph if and only if any path from input nodes to the output nodes has length $d$ where $d$ is the length of the longest path in $G$

Since the internal nodes have incoming edges and outgoing edges; hence if $G$ is a sublayered graph it is straightforward by induction on the layers that for every vertex $v$ in layer $i$ , there is a vertex $u$ in layer $i+1$ such that $(v\rightarrow u)\in E$ and this proves the necessary condition for being sublayered graph.

To show the sufficient condition, for any internal node $u$ , $u$ has $d_{\text{in}}(v)$ distance from the input node in every path that includes $u$ (otherwise we can build a path that is longer than $d$ ). Therefore, for each vertex $v\in V$ , we can place vertex $v$ in layer $d_{\text{in}}(v)$ and all the outgoing edges from $v$ will be to layer $d_{\text{in}}(v)+1$ .

If the graph $G(E,V)$ is not a sublayered graph then there exists a directed edge $(u\rightarrow v)$ such that $d_{\text{in}}(u)+d_{\text{out}}(v)<d-1$ where $d$ the length of the longest path in $G$ .

We prove the lemma by an inductive argument. If $G$ is not sublayered, by lemma C.4, we know that there exists a path $v_{0}\rightarrow\dots v_{i}\dots\rightarrow v_{d^{\prime}}$ where $v_{0}$ is an input node ( $d_{\text{in}}(v_{0})=0$ ), $v_{d^{\prime}}=v_{\textrm{out}}$ ( $d_{\text{out}}(v_{d^{\prime}}=0$ ) and $d^{\prime}<d$ . Now consider the vertex $v_{1}$ . We need to have $d_{\text{out}}(v_{1})=d-1$ otherwise if $d_{\text{out}}(v_{1})<d-1$ we get $d_{\text{in}}(u)+d_{\text{out}}(v)<d-1$ and if $d_{\text{out}}(v_{1})>d-1$ there will be path in $G$ that is longer than $d$ . Also, since $d_{\text{out}}(v_{1})=d-1$ and the longest path in $G$ has length $d$ , we have $d_{\text{in}}(v_{1})=1$ .

By applying the same inductive argument on each vertex $v_{i}$ in the path we get $d_{\text{in}}(v_{i})=i$ and $d_{\text{out}}(v_{i})=d-i$ . Note that if the condition $d_{\text{in}}(u)+d_{\text{out}}(v)<d-1$ is not satisfied in one of the steps of the inductive argument, the lemma is proved. Otherwise, we have $d_{\text{in}}(v_{d^{\prime}-1})=d^{\prime}-1$ and $d_{\text{out}}(v_{d^{\prime}-1})=d-d^{\prime}+1$ and therefore $d_{\text{in}}(v_{d^{\prime}-1})+d_{\text{out}}(v_{\textrm{out}})=d^{\prime}-1<d-1$ that proves the lemma.

For any $p\geq 1$ and any $d$ : $\displaystyle\gamma^{d}_{p,\infty}(f)=\min_{\textrm{$ G\in\text{DAG}(d) $}}\phi^{G}_{p}(f)$ .

Appendix D Hardness of Learning Neural Networks

If it is not possible to efficiently PAC learn intersection of halfspaces (even improperly), we can conclude it is also not possible to efficiently PAC learn any hypothesis class which can represent such intersection. In Theorem D.1 we show that intersection of homogeneous half spaces can be realized with unit margin by neural networks with bound norm.

For any $k>0$ , the intersection of $k$ homogeneous half spaces is realizable with unit margin by $\mathcal{N}^{2}_{\gamma_{p,q}\leq\gamma}$ where $\gamma=4D^{\frac{1}{p}}k^{2}$ .

The proof is by a construction that is similar to the one in Livni et al. (2014). For each hyperplane $\left\langle w_{i},x\right\rangle>0$ , where $w_{i}\in\{\pm 1\}^{D}$ , we include two units in the first layer: $g^{+}_{i}(x)=[\left\langle w_{i},x\right\rangle]_{+}$ and $g^{-}_{i}(x)=[\left\langle w_{i},x\right\rangle-1]_{+}$ . We set all incoming weights of the output node to be $1$ . Therefore, this network is realizing the following function:

Since all inputs and all weights are integer, the outputs of the first layer will be integer, $\left([\left\langle w_{i},x\right\rangle]_{+}-[\left\langle w_{i},x\right\rangle-1]_{+}\right)$ will be zero or one, and $f$ realizes the intersection of the $k$ halfspaces with unit margin. Now, we just need to make sure that $\gamma^{2}_{p,q}(f)$ is bounded by $\gamma=4D^{\frac{1}{p}}k^{2}$ :