Norm-Based Capacity Control in Neural Networks

Behnam Neyshabur, Ryota Tomioka, Nathan Srebro

Introduction

The statistical complexity, or capacity, of unregularized feed-forward neural networks, as a function of the network size and depth, is fairly well understood. With hard-threshold activations, the VC-dimension, and hence sample complexity, of the class of functions realizable with a feed-forward network is equal, up to logarithmic factors, to the number of edges in the network (Anthony and Bartlett, 2009; Shalev-Shwartz and Ben-David, 2014), corresponding to the number of parameters. With continuous activation functions the VC-dimension could be higher, but is fairly well understood and is still controlled by the size and depth of the network.Using weights with very high precision and vastly different magnitudes it is possible to shatter a number of points quadratic in the number of edges when activations such as the sigmoid, ramp or hinge are used (Shalev-Shwartz and Ben-David, 2014, Chapter 20.4). But even with such activations, the VC dimension can still be bounded by the size and depth (Bartlett, 1998; Anthony and Bartlett, 2009; Shalev-Shwartz and Ben-David, 2014).

But feedforward networks are often trained with some kind of explicit or implicit regularization, such as weight decay, early stopping, “max regularization”, or more exotic regularization such as drop-outs. What is the effect of such regularization on the induced hypothesis class and its capacity?

A central question we ask is: can we bound the capacity of feed-forward network in terms of norm-based regularization alone, without relying on network size and even if the network size (number of nodes or edges) is unbounded or infinite? What type of regularizers admit such capacity control? And how does the capacity behave as a function of the norm, and perhaps other network parameters such as depth?

Beyond the central question of capacity control, we also analyze the convexity of the resulting hypothesis class—unlike unregularized size-controlled feed-forward networks, infinite magnitude-controlled networks have the potential of yielding convex hypothesis classes (this is the case, e.g., when we move from rank-based control on matrices, which limits the number of parameters to magnitude based control with the trace-norm or max-norm). A convex class might be easier to optimize over and might be convenient in other ways.

Preliminaries: Feedforward Neural Networks

We will refer to the size of the network, which is the overall number of edges E\left\lvert{E}\right\rvert, the depth dd of the network, which is the length of the longest directed path in GG, and the in-degree (or width) HH of a network, which is the maximum in-degree of a vertex in GG.

A special case of feedforward neural networks are layered fully connected networks where vertices are partitioned into layers and there is a directed edge from every vertex in layer ii to every vertex in layer i+1i+1. We index the layers from the first layer, i=1i=1 whose inputs are the input nodes, up to the last layer i=di=d which contains the single output node—the number of layers is thus equal to the depth and the in-degree is the maximal layer size. We denote by layer(d,H)\text{layer}(d,H) the layered fully connected network with dd layers and HH nodes per layer (except the output layer that has a single node), and also allow H=H=\infty. We will also use the shorthand Nd,H,σ=Nlayer(d,H),σ\mathcal{N}^{d,H,\sigma}=\mathcal{N}^{\text{layer}(d,H),\sigma} and Nd,σ=Nlayer(d,),σ\mathcal{N}^{d,\sigma}=\mathcal{N}^{\text{layer}(d,\infty),\sigma}.

We will focus mostly on the hinge, or RELU (REctified Linear Unit) activation, which is currently in popular use (Nair and Hinton, 2010; Bordes and Bengio, 2011; Zeiler et al., 2013), σ\textscrelu(z)=[z]+=max(z,0)\sigma_{\textsc{relu}}(z)=[z]_{+}=\max(z,0). When the activation will not be specified, we will implicitly be referring to the RELU. The RELU has several convenient properties which we will exploit, some of them shared with other activation functions:

The hinge is Lipschitz continuous with Lipshitz constant one. This property is also shared by the sigmoid and the ramp activation σ(z)=min(max(0,z),1)\sigma(z)=\min(\max(0,z),1).

The hinge is idempotent, i.e. σ\textscrelu(σ\textscrelu(z))=σ\textscrelu(z)\sigma_{\textsc{relu}}(\sigma_{\textsc{relu}}(z))=\sigma_{\textsc{relu}}(z). This property is also shared by the ramp and hard threshold activations.

We will consider various measures α(w)\alpha(w) of the magnitude of the weights w()w(\cdot). Such a measure induces a complexity measure on functions fNG,σf\in\mathcal{N}^{G,\sigma} defined by αG,σ(f)=inffG,w,σ=fα(w)\alpha^{G,\sigma}(f)=\inf_{f_{G,w,\sigma}=f}\alpha(w). The sublevel sets of the complexity measure αG,σ\alpha^{G,\sigma} form a family of hypothesis classes NαaG,σ={fNG,σ    αG,σ(f)a}\mathcal{N}^{G,\sigma}_{\alpha\leq a}=\{f\in\mathcal{N}^{G,\sigma}\;|\;\alpha^{G,\sigma}(f)\leq a\}. Again we will use the shorthand αd,H,σ\alpha^{d,H,\sigma} and αd,σ\alpha^{d,\sigma} when referring to layered graphs layer(d,H)\text{layer}(d,H) and layer(d,)\text{layer}(d,\infty) respectively, and frequently drop σ\sigma when RELU is implicitly meant.

For binary function g:{±1}D±1g:\{\pm 1\}^{D}\rightarrow{\pm 1} we say that gg is realized by ff with unit margin if xf(x)g(x)1\forall_{x}f(x)g(x)\geq 1. A set of points SS is shattered with unit margin by a hypothesis class N\mathcal{N} if all g:S±1g:S\rightarrow{\pm 1} can be realized with unit margin by some fNf\in\mathcal{N}.

Group Norm Regularization

Considering the grouping of weights going into each edge of the network, we will consider the following generic group-norm type regularizer, parametrized by 1p,q1\leq p,q\leq\infty:

Here and elsewhere we allow q=q=\infty with the usual conventions that (ziq)1/q=supzi(\sum z_{i}^{q})^{1/q}=\sup z_{i} and 1/q=01/q=0 when it appears in other contexts. When q=q=\infty the group regularizer (3) imposes a per-unit regularization, where we constrain the norm of the incoming weights of each unit separately, and when q=pq=p the regularizer (3) is an “overall” weight regularizer, constraining the overall norm of all weights in the system. E.g., when q=p=1q=p=1 we are paying for the sum of all magnitudes of weights in the network, and q=p=2q=p=2 corresponds to overall weight-decay where we pay for the sum of square magnitudes of all weights (i.e. the overall Euclidean norm of the weights).

where γp,q(W)=k=1dWkp,q\displaystyle\gamma_{p,q}(W)=\prod_{k=1}^{d}\left\lVert{W_{k}}\right\rVert_{p,q} aggregates the layers by multiplication instead of summation. The inequality (4) holds regardless of the activation function, and so for any σ\sigma we have:

But due to the homogeneity of the RELU activation, when this activation is used we can always balance the norm between the different layers without changing the computed function so as to achieve equality in (4):

For any fNd,H,σ\textscreluf\in\mathcal{N}^{d,H,\sigma_{\textsc{relu}}}, μp,qd,H,σ\textscrelu(f)=d1/qγp,qd,H,σ\textscrelu(f)d\displaystyle\mu^{d,H,\sigma_{\textsc{relu}}}_{p,q}(f)=d^{1/q}\sqrt[d]{\gamma_{p,q}^{d,H,\sigma_{\textsc{relu}}}(f)}.

Let WW be weights that realizes ff and are optimal with respect to γp,g\gamma_{p,g}; i.e. γp,q(W)=γp,qd,H(f)\gamma_{p,q}(W)=\gamma^{d,H}_{p,q}(f). Let W~k=γp,q(W)dWk/Wkp,q\widetilde{W}_{k}=\sqrt[d]{\gamma_{p,q}(W)}W_{k}/\left\lVert{W_{k}}\right\rVert_{p,q}, and observe that they also realize ff. We now have:

which together with (4) completes the proof.

The two measures are therefore equivalent when we use RELUs, and define the same level sets, or family of hypothesis classes, which we refer to simply as Np,qd,H\mathcal{N}^{d,H}_{p,q}. In the remainder of this Section, we investigate convexity and generalization properties of these hypothesis classes.

In order to understand the effect of the norm on the sample complexity, we bound the Rademacher complexity of the classes Np,qd,H\mathcal{N}^{d,H}_{p,q}. Recall that the Rademacher Complexity is a measure of the capacity of a hypothesis class on a specific sample, which can be used to bound the difference between empirical and expected error, and thus the excess generalization error of empirical risk minimization (see, e.g., Bartlett and Mendelson (2003) for a complete treatment, and Appendix A for the exact definitions we use). In particular, the Rademacher complexity typically scales as C/m\sqrt{C/m}, which corresponds to a sample complexity of O(C/ϵ2)O(C/\epsilon^{2}), where mm is the sample size and CC is the effective measure of capacity of the hypothesis class.

We prove the bound by induction, showing that for any q,d>1q,d>1 and 1p<1\leq p<\infty,

The intuition is that when p<qp^{*}<q, the Rademacher complexity increases by simply distributing the weights among neurons and if pqp^{*}\geq q then the supremum is attained when the output neuron is connected to a neuron with highest Rademacher complexity in the lower layer and all other weights in the top layer are set to zero. For a complete proof, see Appendix A. \BlackBox

Note that for 2p<2\leq p<\infty, the bound on the Rademacher complexity scales with m1pm^{\frac{1}{p}} (see section A.1 in appendix) because:

2 Tightness

We next investigate the tightness of the complexity bound in Theorem 3.2, and show that when 1/p+1/q<11/p+1/q<1 the dependence on the width HH is indeed unavoidable. We show not only that the bound on the Rademacher complexity is tight, but that the implied bound on the sample complexity is tight, even for binary classification with a margin over binary inputs. To do this, we show how we can shatter the m=2Dm=2^{D} points {±1}D\{\pm 1\}^{D} using a network with small group-norm:

For any p,q1p,q\geq 1 (and 1/p+1/p=11/p^{*}+1/p=1) and any depth d2d\geq 2, the m=2Dm=2^{D} points {±1}D\{\pm 1\}^{D} can be shattered with unit margin by Nγp,qγd,H\mathcal{N}^{d,H}_{\gamma_{p,q}\leq\gamma} with:

Consider a size mm subset SmS_{m} of 2D2^{D} vertices of the DD dimensional hypercube {1,+1}D\{-1,+1\}^{D}. We construct the first layer using mm units. Each unit has a unique weight vector consisting of +1+1 and 1-1’s and will output a positive value if and only if the sign pattern of the input xSmx\in S_{m} matches that of the weight vector. The second layer has a single unit and connects to all mm units in the first layer. For any mm dimensional sign pattern b{1,+1}mb\in\{-1,+1\}^{m}, we can choose the weights of the second layer to be bb, and the network will output the desired sign for each xSmx\in S_{m} with unit margin. The norm of the network is at most (mDq/p)1/qm1/p=D1/pm(1/p+1/q).(m\cdot D^{q/p})^{1/q}\cdot m^{1/p}=D^{1/p}\cdot m^{(1/p+1/q)}. This establishes the claim for d=2d=2. For d>2d>2 and 1/p+1/q11/p+1/q\geq 1, we obtain the same norm and unit margin by adding d2d-2 layers with one unit in each layer connected to the previous layer by a unit weight. For d>2d>2 and 1/p+1/q<11/p+1/q<1, we show the dependence on HH by recursively replacing the top unit with HH copies of it and adding an averaging unit on top of that. More specifically, given the above d=2d=2 layer network, we make HH copies of the output unit with rectified linear activation and add a 3rd layer with one output unit with uniform weight 1/H1/H to all the copies in the 2nd layer. Since this operation does not change the output of the network, we have the same margin and now the norm of the network is (mDq/p)1/q(Hmq/p)1/q(H(1/Hp))1/p=D1/pm(1/p+1/q)H1/q1/p.(m\cdot D^{q/p})^{1/q}\cdot(Hm^{q/p})^{1/q}\cdot(H(1/H^{p}))^{1/p}=D^{1/p}\cdot m^{(1/p+1/q)}\cdot H^{1/q-1/p^{*}}. That is, we have reduced the norm by factor H1/q1/pH^{1/q-1/p^{*}}. By repeating this process, we get the geometric reduction in the norm H(d2)(1/q1/p)H^{(d-2)(1/q-1/p^{*})}, which concludes the proof.

Next we consider the dependence on the width HH when 1/p+1/q<11/p+1/q<1. Here we have to use depth d3d\geq 3, and we see that indeed as the width HH and depth dd increase, the magnitude control γ\gamma can decrease as H(1/p1/q)(d2)H^{(1/p^{*}-1/q)(d-2)} without decreasing the capacity, matching Theorem 1 up to an offset of 2 on the depth. In particular, we see that in this regime we can shatter an arbitrarily large number of points with arbitrarily low γ\gamma by using enough hidden units, and so the capacity of Np,qd\mathcal{N}^{d}_{p,q} is indeed infinite and it cannot ensure any generalization.

3 Convexity

Finally we establish a sufficient condition for the hypothesis classes Np,qd\mathcal{N}^{d}_{p,q} to be convex. We are referring to convexity of the functions in the Np,qd\mathcal{N}^{d}_{p,q} independent of a specific representation. If we consider a, possibly regularized, empirical risk minimization problem on the weights, the objective (the empirical risk) would never be a convex function of the weights (for depth d2d\geq 2), even if the regularizer is convex in ww (which it always is for p,q1p,q\geq 1). But if we do not bound the width of the network, and instead rely on magnitude-control alone, we will see that the resulting hypothesis class, and indeed the complexity measure, may be convex (with respect to taking convex combinations of functions, not of weights).

For any d,p,q1d,p,q\geq 1 such that \frac{1}{q}\leq\frac{1}{d-1}\big{(}1-\frac{1}{p}\big{)}, γp,qd(f)\gamma^{d}_{p,q}(f) is a semi-norm in Nd\mathcal{N}^{d}.

In particular, under the condition of the Theorem, γp,qd\gamma^{d}_{p,q} is convex, and hence its sublevel sets Np,qd\mathcal{N}^{d}_{p,q} are convex, and so μp,qd\mu^{d}_{p,q} is quasi-convex (but not convex).

To show convexity, consider two functions f,gNγp,qγdf,g\in\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma} and 0<α<10<\alpha<1, and let UU and VV be the weights realizing ff and gg respectively with γp,q(U)γ\gamma_{p,q}(U)\leq\gamma and γp,q(V)γ\gamma_{p,q}(V)\leq\gamma. We will construct weights WW realizing αf+(1α)g\alpha f+(1-\alpha)g with γp,q(W)γ\gamma_{p,q}(W)\leq\gamma. This is done by first balancing UU and VV s.t. at each layer Uip,q=γp,q(U)d\left\lVert{U_{i}}\right\rVert_{p,q}=\sqrt[d]{\gamma_{p,q}(U)} and Vip,q=γp,q,(V)d\left\lVert{V_{i}}\right\rVert_{p,q}=\sqrt[d]{\gamma_{p,q,}(V)} and then placing UU and VV side by side, with no interaction between the units calculating ff and gg until the output layer. The output unit has weights αUd\alpha U_{d} coming in from the ff-side and weights (1α)Vd(1-\alpha)V_{d} coming in from the gg-side. In Appendix B we show that under the condition in the theorem, γp,q(W)γ\gamma_{p,q}(W)\leq\gamma. To complete the proof, we also show γp,qd\gamma^{d}_{p,q} is homogeneous and that this is sufficient for convexity. \BlackBox

Per-Unit and Path Regularization

In this Section we will focus on the special case of q=q=\infty, i.e. when we constrain the norm of the incoming weights of each unit separately.

Consider a regularizer which looks at the sum over all paths from input nodes to the output node, of the product of the weights along the path:

where p1p\geq 1 controls the norm used to aggregate the paths. We can motivate this regularizer as follows: if a node does not have any high-weight paths going out of it, we really don’t care much about what comes into it, as it won’t have much effect on the output. The path-regularizer thus looks at the aggregated influence of all the weights.

Referring to the induced regularizer ϕpG(f)=minfG,w=fϕp(w)\phi^{G}_{p}(f)=\min_{f_{G,w}=f}\phi_{p}(w) (with the usual shorthands for layered graphs), we now observe that for layered graphs, path regularization and per-unit regularization are equivalent:

For p1p\geq 1, any dd and (finite or infinite) HH, for any fNd,Hf\in\mathcal{N}^{d,H}: ϕpd,H(f)=γp,d,H\phi_{p}^{d,H}(f)=\gamma^{d,H}_{p,\infty}

It is important to emphasize that even for layered graphs, it is not the case that for all weights ϕp(w)=γp,(w)\phi_{p}(w)=\gamma_{p,\infty}(w). E.g., a high-magnitude edge going into a unit with no non-zero outgoing edges will affect γp,(w)\gamma_{p,\infty}(w) but not ϕp(w)\phi_{p}(w), as will having high-magnitude edges on different layers in different paths. In a sense path regularization is as more careful regularizer less fooled by imbalance. Nevertheless, in the proof of Theorem 4.1 in Appendix C.1, we show we can always balance the weights such that the two measures are equal.

The equivalence does not extend to non-layered graphs, since the lengths of different paths might be different. Again, we can think of path regularizer as more refined regularizer taking into account the local structure. However, if we consider all DAGs of depth at most dd (i.e. with paths of length at most dd), the notions are again equivalent (see proof in Appendix C.2):

For any p1p\geq 1 and any dd: \displaystyle\gamma^{d}_{p,\infty}(f)=\min_{\textrm{G\in\text{DAG}(d)}}\phi^{G}_{p}(f).

In particular, for any graph GG of depth dd, we have that ϕpG(f)γp,d(f)\phi^{G}_{p}(f)\geq\gamma^{d}_{p,\infty}(f). Combining this observation with Corollary 3.3 allows us to immediately obtain a generalization bound for path regularization on any, even non-layered, graph:

Note that in order to apply Corollary 3.3 and obtain a width-independent bound, we had to limit ourselves to p=1p=1. We further explore this issue next.

Capacity

Convexity

An immediately consequence of Theorem 3.6 is that per-unit regularization, if we do not constrain the network width, is convex for any p1p\geq 1. In fact, γp,d\gamma^{d}_{p,\infty} is a (semi)norm. However, as discussed above, for depth d>2d>2 this is meaningful only for p=1p=1, as γp,d\gamma^{d}_{p,\infty} collapses for p>1p>1.

Hardness

Since the classes N1,d\mathcal{N}^{d}_{1,\infty} are convex, we might hope that this might make learning computationally easier. Indeed, one can consider functional-gradient or boosting-type strategies for learning a predictor in the class (Lee et al., 1996). However, as Bach (2014) points out, this is not so easy as it requires finding the best fit for a target with a RELU unit, which is not easy. Indeed, applying results on hardness of learning intersections of halfspaces, which can be represented with small per-unit norm using two-layer networks, we can conclude that, subject to certain complexity assumptions, it is not possible to efficiently PAC learn N1,d\mathcal{N}^{d}_{1,\infty}, even for depth d=2d=2 when γ1,\gamma_{1,\infty} increases superlinearly:

This is a corollary of Theorem D.1 in the Appendix D. Either versions of corollary 4.4 precludes the possibility of learning in time polynomial in γ1,\gamma_{1,\infty}, though it still might be possible to learn in poly(D)\textrm{poly}(D) time when γ1,\gamma_{1,\infty} is sublinear.

Sharing

We conclude this Section with an observation on the type of networks obtained by per-unit, or equivalently path, regularization.

For any p1p\geq 1 and d>1d>1 and any fNdf\in\mathcal{N}^{d}, there exists a layered graph G(V,E)G(V,E) of depth dd, such that fNGf\in\mathcal{N}^{G} and γp,G(f)=ϕpG(f)=γp,d(f)\gamma^{G}_{p,\infty}(f)=\phi^{G}_{p}(f)=\gamma^{d}_{p,\infty}(f), and the out-degree of every internal (non-input) node in GG is one. That is, the subgraph of GG induced by the non-input vertices is a tree directed toward the output vertex.

What the Theorem tells us is that we can realize every function as a tree with optimal per-unit norm. If we think of learning with an infinite fully-connected layered network, we can always restrict ourselves to models in which the non-zero-weight edges form a tree. This means that when using per-unit regularization we have no incentive to “share” lower-level units—each unit will only have a single outgoing edge and will only be used by a single down-stream unit. This seems to defy much of the intuition and power of using deep networks, where we expect lower layers to represent generic feature useful in many higher-level features. In effect, we are not encouraging any transfer between learning different aspects of the function (or between different tasks or classes, if we do have multiple output units). Per-unit regularization therefore misses out on much of the inductive bias that we might like to impose when using deep learning (namely, promoting sharing).

For any fG,wNDAG(d)f_{G,w}\in\mathcal{N}^{\text{DAG}(d)}, we show how to construct such G~\widetilde{G} and w~\widetilde{w}. We first sort the vertices of GG based on topological ordering such that the out-degree of the first vertex is zero. Let G0=GG_{0}=G and w0=ww_{0}=w. At each step ii, we first set Gi=Gi1G_{i}=G_{i-1} and wi=wi1w_{i}=w_{i-1} and then pick the vertex uu that is the iith vector in the topological ordering. If the out-degree of uu is at most 1. Otherwise, for any edge (uv)(u\rightarrow v) we create a copy of vertex uu that we call it uvu_{v}, add the edge (uvv)(u_{v}\rightarrow v) to GiG_{i} and connect all incoming edges of uu with the same weights to every such uvu_{v} and finally we delete the vertex uu from GiG_{i} together with all incoming and outgoing edges of uu. It is easy to indicate that fGi,wi=fGi1,wi1f_{G_{i},w_{i}}=f_{G_{i-1},w_{i-1}}. After at most V|V| such steps, all internal nodes have out-degree one and hence the subgraph induced by non-input vertices will be a tree.

Overall Regularization

Convexity

This is similar to per-unit regularization, except we impose different norms at different layers (if p1p\not=1). We can see that Nνpν2=νconv(σ(G))\mathcal{N}^{2}_{\nu_{p}\leq\nu}=\nu\cdot\overline{\operatorname{conv}}(\sigma(\mathcal{G})), and is thus convex for any pp. Focusing on RELU activation we have the equivalence:

μ2,22(f)=2ν2(f).\displaystyle\mu^{2}_{2,2}(f)=2\nu_{2}(f).

Here (9) is the arithmetic-geometric mean inequality for which we can achieve equality by balancing the weights (as in Claim 1) and (10) again follows from the homogeneity of the RELU which allows us to rebalance the weights.

Hardness

As with N1,d\mathcal{N}^{d}_{1,\infty}, we might hope that the convexity of N2,22\mathcal{N}^{2}_{2,2} might make it computationally easy to learn. However, by the same reduction from learning intersection of halfspaces (Theorem D.1 in Appendix D) we can again conclude that we cannot learn in time polynomial in μ2,22\mu^{2}_{2,2}:

Depth Independent Regularization

Up until now we discussed relying on magnitude-based regularization instead of directly controlling network size, thus allowing unbounded and even infinite width. But we still relied on a finite bound on the depth in all our derivations. Can the explicit dependence on the depth be avoided, and replaced with only a measure of scale of the weights?

We already know we cannot rely only on a bound on the group-norm μp,q\mu_{p,q} when the depth is unbounded, as we know from Theorem 3.4 that in terms of μp,q\mu_{p,q} the sample complexity necessarily increases exponentially with the depth: if we allow arbitrarily deep graphs we can shrink μp,q\mu_{p,q} toward zero without changing the scale of the computed function. However, controlling the γ\gamma-measure, or equivalently the path-regularizer ϕ\phi, in arbitrarily-deep graphs is sensible, and we can define:

where the minimization is over any DAG. From Theorem 4.2 we can conclude that ϕp(f)=γp,(f)\phi_{p}(f)=\gamma_{p,\infty}(f). In any case, γp,q(f)\gamma_{p,q}(f) is a sensible complexity measure, that does not collapse despite the unbounded depth. Can we obtain generalization guarantees for the class Nγp,qγ\mathcal{N}_{\gamma_{p,q}\leq\gamma} ?

Unfortunately, even when 1/p+1/q11/p+1/q\geq 1 and we can obtain width-independent bounds, the bound in Corollary 3.3 still has a dependence on 4d4^{d}, even if γp,q\gamma_{p,q} is bounded. Can such a dependence be avoided?

The proof is again based on an inductive argument similar to Theorem 3.2 and you can find it in appendix A.4.

However, the ramp is not homogeneous and so the equivalent between μ\mu, γ\gamma and ϕ\phi breaks down. Can we obtain such a bound also for the RELU? At the very least, what we can say is that an inductive argument such that used in the proofs of Theorems 3.2 and 6.1 cannot be used to avoid an exponential dependence on the depth. To see this, consider γ1,1\gamma_{1,\infty}\leq 1 (this choice is arbitrary if we are considering the Rademacher complexity), for which we have

where conv()\overline{\operatorname{conv}}(\cdot) is the symmetric convex hull, and []+=max(z,0)[\cdot]_{+}=\max(z,0) is applied to each function in the class. In order to apply the inductive argument without increasing the complexity exponentially with the depth, we would need the operation [conv(H)]+[\overline{\operatorname{conv}}(\mathcal{H})]_{+} to preserve the Rademacher complexity, at least for non-negative convex cones H\mathcal{H}. However we show a simple example of a non-negative convex cone H\mathcal{H} for which Rm([conv(H)]+)>Rm(H)\mathcal{R}_{m}\left([\overline{\operatorname{conv}}(\mathcal{H})]_{+}\right)>\mathcal{R}_{m}\left(\mathcal{H}\right).

Summary and Open Issues

Although the additive μ\mu-measure and multiplication γ\gamma-measure are equivalent at the optimum, they behave rather differently in terms of optimization dynamics (based on anecdotal empirical experience) and understanding the relationship between them, as well as the novel path-based regularizer can be helpful in practical regularization of neural networks.

Beyond the open issue regarding depth-independent γ\gamma-based capacity control, another interesting open question is understanding the expressive power of Nγp,qγd\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma}, particularly as a function of the depth dd. Clearly going from depth d=1d=1 to depth d=2d=2 provides additional expressive power, but it is not clear how much additional depth helps. The class N2\mathcal{N}^{2} already includes all binary functions over {±1}D\{\pm 1\}^{D} and is dense among continuous real-valued functions. But can the γ\gamma-measure be reduced by increasing the depth? Viewed differently: γp,qd(f)\gamma^{d}_{p,q}(f) is monotonically non-increasing in dd, but are there functions for it continues decreasing? Although it seems obvious there are functions that require high depth for efficient representation, these questions are related to decade-old problems in circuit complexity and might not be easy to resolve.

This research was partially supported by NSF grant IIS-1302662 and an Intel ICRI-CI award. We thank the COLT anonymous reviewers for pointing out an error in the statement of Lemma A.3 and suggesting other corrections.

References

Appendix A Rademacher Complexities

In this section, we prove an upper bound for the Rademacher complexity of the class Nγp,qγd,H,σRELU\mathcal{N}_{\gamma_{p,q}\leq\gamma}^{d,H,\sigma_{\text{RELU}}}, i.e., the class of functions that can be represented as depth dd, width HH network with rectified linear activations, and the layer-wise group norm complexity γp,q\gamma_{p,q} bounded by γ\gamma. As mentioned in the main text, our proof is an induction with respect to the depth dd. We start with d=1d=1 layer neural networks, which is essentially the class of linear separators.

(Khintchine-Kahane Inequality) For any 0<p<0<p<\infty and S={z1,,zm}S=\{z_{1},\dots,z_{m}\}, if the random variable ξ\xi is uniform over {±1}m\{\pm 1\}^{m}, then

where CpC_{p} is a constant depending only on pp.

The sharp value of the constant CpC_{p} was found by Haagerup (1981) but for our analysis, it is enough to note that if p1p\geq 1 we have CppC_{p}\leq\sqrt{p}.

(Massart Lemma) Let AA be a finite set of mm dimensional vectors. Then

where pp^{*} is such that 1p+1p=1\frac{1}{p^{*}}+\frac{1}{p}=1.

First, note that N1\mathcal{N}^{1} is the class of linear functions and hence for any function fwN1f_{w}\in\mathcal{N}^{1}, we have that γp,q(w)=wp\gamma_{p,q}(w)=\left\lVert{w}\right\rVert_{p}. Therefore, we can write the Rademacher complexity for a set S={x1,,xm}S=\{x_{1},\dots,x_{m}\} as:

For 1pmin{2,2log(2D)2log(2D)1}1\leq p\leq\min\left\{2,\frac{2\log(2D)}{2\log(2D)-1}\right\} (and therefore 2log(2D)p2\log(2D)\leq p^{*}), we have

We now use the Massart Lemma viewing each feature (xi[j])i=1m(x_{i}[j])_{i=1}^{m} for j=1,,Dj=1,\ldots,D as a member of a finite hypothesis class and obtain

If min{2,2log(2D)2log(2D)1}<p<\min\left\{2,\frac{2\log(2D)}{2\log(2D)-1}\right\}<p<\infty, by Khintchine-Kahane inequality we have

If p2p^{*}\geq 2, by Minskowski inequality we have that X2,pm1/2maxixip\left\lVert{X}\right\rVert_{2,p^{*}}\leq m^{1/2}\max_{i}\left\lVert{x_{i}}\right\rVert_{p^{*}}. Otherwise, by subadditivity of the function f(z)=zp2f(z)=z^{\frac{p^{*}}{2}}, we get X2,pm1/pmaxixip\left\lVert{X}\right\rVert_{2,p^{*}}\leq m^{1/{p^{*}}}\max_{i}\left\lVert{x_{i}}\right\rVert_{p^{*}}.

A.2 Theorem 3.2

For the proof of theorem 3.2, we need the following two technical lemmas. The first is the well-known contraction lemma:

For any p,q1p,q\geq 1, d2d\geq 2, ξ{±1}m\xi\in\{\pm 1\}^{m} and fNd,H,Hf\in\mathcal{N}^{d,H,H} we have

where pp^{*} is such that 1p+1p=1\frac{1}{p^{*}}+\frac{1}{p}=1.

where pp^{*} is such that 1p+1p=1\frac{1}{p^{*}}+\frac{1}{p}=1.

By the definition of Rademacher complexity if ξ\xi is uniform over {±1}m\{\pm 1\}^{m}, we have:

where the equality (13) is obtained by lemma A.6 and inequality (14) is by Contraction Lemma. This will give us the bound on Rademacher complexity of Nγp,qγd,H\mathcal{N}^{d,H}_{\gamma_{p,q}\leq\gamma} based on the Rademacher complexity of Nγp,qγd1,H\mathcal{N}^{d-1,H}_{\gamma_{p,q}\leq\gamma}. Applying the same argument on all layers and using lemma A.3 to bound the complexity of the first layer completes the proof.

A.3 Proof of Lemma A.6

It is immediate that the right hand side of the equality in the statement is always less than or equal to the left hand side because given any vector ww in the right hand side, by setting each row of matrix WW in the left hand side we get the equality. Therefore, it is enough to prove that the left hand side is less than or equal to the right hand side. For the convenience of notations, let g(w)i=1mξiw[f(xi)]+g(w)\triangleq|\sum_{i=1}^{m}\xi_{i}w^{\top}[f(x_{i})]_{+}|. Define w~\widetilde{w} to be:

If qpq\leq p^{*}, then the right hand side of equality in the lemma statement will reduce to g(w~)/w~pg(\widetilde{w})/\left\lVert{\widetilde{w}}\right\rVert_{p} and therefore we need to show that for any matrix VV,

Since qpq\leq p^{*}, we have Vp,pVp,q\left\lVert{V}\right\rVert_{p,{p^{*}}}\leq\left\lVert{V}\right\rVert_{p,q} and hence it is enough to prove the following inequality:

On the other hand, if q>pq>{p^{*}}, then we need to prove the following inequality holds:

Since q>pq>{p^{*}}, we have that Vp,pH1p1qVp,q\left\lVert{V}\right\rVert_{p,{p^{*}}}\leq H^{\frac{1}{{p^{*}}}-\frac{1}{q}}\left\lVert{V}\right\rVert_{p,q}. Therefore, it is again enough to show that:

We can rewrite the above inequality in the following form:

By the definition of w~\widetilde{w}, we know that the above inequality holds for each term in the sum and hence the inequality is true.

A.4 Theorem 6.1

Assuming ξ\xi is uniform over {±1}m\{\pm 1\}^{m}, we have:

where the equality (15) is by anti-symmetric property of σ\sigma and inequality (16) is by the version of Contraction Lemma without the absolute value. This will give us the bound on Rademacher complexity of Nμ1,μd,H\mathcal{N}^{d,H}_{\mu_{1,\infty}\leq\mu} based on the Rademacher complexity of Nμ1,μd1,H\mathcal{N}^{d-1,H}_{\mu_{1,\infty}\leq\mu}. Applying the same argument on all layers and using lemma A.3 to bound the complexity of the first layer completes the proof.

We repeat the statement here for convenience.

For any d,p,q1d,p,q\geq 1 such that \frac{1}{q}\leq\frac{1}{d-1}\big{(}1-\frac{1}{p}\big{)}, γp,qd(f)\gamma^{d}_{p,q}(f) is a semi-norm in Nd\mathcal{N}^{d}.

First we show that for any two functions f1,f2Nγp,qγdf_{1},f_{2}\in\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma} and 0α10\leq\alpha\leq 1, the function g=αf1+(1α)f2g=\alpha f_{1}+(1-\alpha)f_{2} is in the hypothesis class Nγp,qγd\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma}. We prove this by constructing weights WW that realizes gg. Let UU and VV be the weights of two neural networks such that γp,q(U)=γp,qd(f1)γ\gamma_{p,q}(U)=\gamma_{p,q}^{d}(f_{1})\leq\gamma and γp,q(V)=γp,qd(f2)γ\gamma_{p,q}(V)=\gamma_{p,q}^{d}(f_{2})\leq\gamma. For every layer i=1,,di=1,\ldots,d let

Then for the defined WW, we have fW=αf1+(1α)f2f_{W}=\alpha f_{1}+(1-\alpha)f_{2} for rectified linear and any other non-negative homogeneous activation function. Moreover, for any i<di<d, the norm of each layer is

Combining inequalities (17) and (18), we get γp,qd(fW)2d1q+1pγγ,\gamma^{d}_{p,q}(f_{W})\leq 2^{\frac{d-1}{q}+\frac{1}{p}}\gamma\leq\gamma, where the last inequality holds because we assume that \frac{1}{q}\leq\frac{1}{d-1}\big{(}1-\frac{1}{p}\big{)}. Thus for every γ0\gamma\geq 0, Nγp,qγd\mathcal{N}^{d}_{\gamma_{p,q}\leq\gamma} is a convex set.

Non-negative homogeneity

For any function fNdf\in\mathcal{N}^{d} and any α0\alpha\geq 0, let UU be the weights realizing ff with γp,qd(f)=γp,q(U)\gamma^{d}_{p,q}(f)=\gamma_{p,q}(U). Then αdU\sqrt[d]{\alpha}U realizes αf\alpha f establishing γp,qd(αf)γp,q(αdU)=αγp,q(U)=αγp,qd(U)=αγp,qd(f)\gamma^{d}_{p,q}(\alpha f)\leq\gamma_{p,q}(\sqrt[d]{\alpha}U)=\alpha\gamma_{p,q}(U)=\alpha\gamma^{d}_{p,q}(U)=\alpha\gamma_{p,q}^{d}(f). This establishes the non-negative homogeneity of γp,qd\gamma_{p,q}^{d}.

Convex sublevel sets and homogeneity imply triangular inequality

Appendix C Path Regularization

For any function fNγp,γd,Hf\in\mathcal{N}^{d,H}_{\gamma_{p,\infty}\leq\gamma} there is a layered network with weights ww such that γp,(w)=γp,d,H(f)\gamma_{p,\infty}(w)=\gamma^{d,H}_{p,\infty}(f) and for any internal unit vv, (uv)Ew(uv)p=1\sum_{(u\rightarrow v)\in E}|w(u\rightarrow v)|^{p}=1.

Let ww be the weights of a network such that γp,(w)=γp,d,H(f)\gamma_{p,\infty}(w)=\gamma^{d,H}_{p,\infty}(f). We now construct a network with weights w~\widetilde{w} such that γp,(w)=γp,d,H(f)\gamma_{p,\infty}(w)=\gamma^{d,H}_{p,\infty}(f) and for any internal unit vv, (uv)Ew~(uv)p=1\sum_{(u\rightarrow v)\in E}|\widetilde{w}(u\rightarrow v)|^{p}=1. We do this by an incremental algorithm. Let w0=ww_{0}=w. At each step ii, we do the following.

For p1p\geq 1, any dd and (finite or infinite) HH, for any fNd,Hf\in\mathcal{N}^{d,H}: ϕpd,H(f)=γp,d,H\phi_{p}^{d,H}(f)=\gamma^{d,H}_{p,\infty}.

By the Lemma C.1, there is a layered network with weights w~\widetilde{w} such that γp,(w~)=γp,d,H(f)\gamma_{p,\infty}(\widetilde{w})=\gamma^{d,H}_{p,\infty}(f) and for any internal unit vv, (uv)Ew~(uv)p=1\sum_{(u\rightarrow v)\in E}|\widetilde{w}(u\rightarrow v)|^{p}=1. Let WW be the weights of the layered network that corresponds to the function w~\widetilde{w}. Then we have:

C.2 Proof of Theorem 4.2

In this section, without loss of generality, we assume that all the internal nodes in a DAG have incoming edges and outgoing edges because otherwise we can just discard them. Let dout(v)d_{\text{out}}(v) be the longest directed path from vertex vv to voutv_{\textrm{out}} and din(v)d_{\text{in}}(v) be the longest directed path from any input vertex vin[i]v_{\textrm{in}}[i] to vv. We say graph GG is a sublayered graph if GG is a subgraph of a layered graph.

We first show the necessary and sufficient conditions under which a DAG is a sublayered graph.

The graph G(E,V)G(E,V) is a sublayered graph if and only if any path from input nodes to the output nodes has length dd where dd is the length of the longest path in GG

Since the internal nodes have incoming edges and outgoing edges; hence if GG is a sublayered graph it is straightforward by induction on the layers that for every vertex vv in layer ii, there is a vertex uu in layer i+1i+1 such that (vu)E(v\rightarrow u)\in E and this proves the necessary condition for being sublayered graph.

To show the sufficient condition, for any internal node uu, uu has din(v)d_{\text{in}}(v) distance from the input node in every path that includes uu (otherwise we can build a path that is longer than dd). Therefore, for each vertex vVv\in V, we can place vertex vv in layer din(v)d_{\text{in}}(v) and all the outgoing edges from vv will be to layer din(v)+1d_{\text{in}}(v)+1.

If the graph G(E,V)G(E,V) is not a sublayered graph then there exists a directed edge (uv)(u\rightarrow v) such that din(u)+dout(v)<d1d_{\text{in}}(u)+d_{\text{out}}(v)<d-1 where dd the length of the longest path in GG.

We prove the lemma by an inductive argument. If GG is not sublayered, by lemma C.4, we know that there exists a path v0vivdv_{0}\rightarrow\dots v_{i}\dots\rightarrow v_{d^{\prime}} where v0v_{0} is an input node (din(v0)=0d_{\text{in}}(v_{0})=0), vd=voutv_{d^{\prime}}=v_{\textrm{out}} (dout(vd=0d_{\text{out}}(v_{d^{\prime}}=0) and d<dd^{\prime}<d. Now consider the vertex v1v_{1}. We need to have dout(v1)=d1d_{\text{out}}(v_{1})=d-1 otherwise if dout(v1)<d1d_{\text{out}}(v_{1})<d-1 we get din(u)+dout(v)<d1d_{\text{in}}(u)+d_{\text{out}}(v)<d-1 and if dout(v1)>d1d_{\text{out}}(v_{1})>d-1 there will be path in GG that is longer than dd. Also, since dout(v1)=d1d_{\text{out}}(v_{1})=d-1 and the longest path in GG has length dd, we have din(v1)=1d_{\text{in}}(v_{1})=1.

By applying the same inductive argument on each vertex viv_{i} in the path we get din(vi)=id_{\text{in}}(v_{i})=i and dout(vi)=did_{\text{out}}(v_{i})=d-i. Note that if the condition din(u)+dout(v)<d1d_{\text{in}}(u)+d_{\text{out}}(v)<d-1 is not satisfied in one of the steps of the inductive argument, the lemma is proved. Otherwise, we have din(vd1)=d1d_{\text{in}}(v_{d^{\prime}-1})=d^{\prime}-1 and dout(vd1)=dd+1d_{\text{out}}(v_{d^{\prime}-1})=d-d^{\prime}+1 and therefore din(vd1)+dout(vout)=d1<d1d_{\text{in}}(v_{d^{\prime}-1})+d_{\text{out}}(v_{\textrm{out}})=d^{\prime}-1<d-1 that proves the lemma.

For any p1p\geq 1 and any dd: \displaystyle\gamma^{d}_{p,\infty}(f)=\min_{\textrm{G\in\text{DAG}(d)}}\phi^{G}_{p}(f).

Appendix D Hardness of Learning Neural Networks

If it is not possible to efficiently PAC learn intersection of halfspaces (even improperly), we can conclude it is also not possible to efficiently PAC learn any hypothesis class which can represent such intersection. In Theorem D.1 we show that intersection of homogeneous half spaces can be realized with unit margin by neural networks with bound norm.

For any k>0k>0, the intersection of kk homogeneous half spaces is realizable with unit margin by Nγp,qγ2\mathcal{N}^{2}_{\gamma_{p,q}\leq\gamma} where γ=4D1pk2\gamma=4D^{\frac{1}{p}}k^{2}.

The proof is by a construction that is similar to the one in Livni et al. (2014). For each hyperplane wi,x>0\left\langle w_{i},x\right\rangle>0, where wi{±1}Dw_{i}\in\{\pm 1\}^{D}, we include two units in the first layer: gi+(x)=[wi,x]+g^{+}_{i}(x)=[\left\langle w_{i},x\right\rangle]_{+} and gi(x)=[wi,x1]+g^{-}_{i}(x)=[\left\langle w_{i},x\right\rangle-1]_{+}. We set all incoming weights of the output node to be 11. Therefore, this network is realizing the following function:

Since all inputs and all weights are integer, the outputs of the first layer will be integer, ([wi,x]+[wi,x1]+)\left([\left\langle w_{i},x\right\rangle]_{+}-[\left\langle w_{i},x\right\rangle-1]_{+}\right) will be zero or one, and ff realizes the intersection of the kk halfspaces with unit margin. Now, we just need to make sure that γp,q2(f)\gamma^{2}_{p,q}(f) is bounded by γ=4D1pk2\gamma=4D^{\frac{1}{p}}k^{2}: