Representation Benefits of Deep Feedforward Networks

Matus Telgarsky

Overview

Let positive integer kk, number of layers ll, and number of nodes per layer mm be given with m2(k3)/l1m\leq 2^{(k-3)/l-1}. Then there exists a collection of n:=2kn:=2^{k} points ((xi,yi))i=1n((x_{i},y_{i}))_{i=1}^{n} with xix_{i}\in and y{0,1}y\in\{0,1\} such that

For example, approaching the error of the 2k2k-layer network (which has O(k)\mathcal{O}(k) nodes and weights) with 22 layers requires at least 2(k3)/212^{(k-3)/2-1} nodes, and with k3\sqrt{k-3} layers needs at least 2k312^{\sqrt{k-3}-1} nodes.

The purpose of this note is to provide an elementary proof of Theorem 1.1 and its refinement Theorem 1.2, which amongst other improvements will use a recurrent neural network in the upper bound. Section 2 will present the proof, and Section 3 will tie these results to the literature on neural network expressive power and circuit complexity, which by contrast makes use of product nodes rather than standard feedforward networks when showing the benefits of depth.

There are three refinements to make: the classification problem will be specified, the perfect network will be an even simpler recurrent network, and σ\sigma need not be σ\textscr\sigma_{\textsc{r}}.

Let nn-ap (the nn-alternating-point problem) denote the set of nn uniformly spaced points within [0,12n][0,1-2^{-n}] with alternating labels, as depicted in Figure 1; that is, the points ((xi,yi))i=1n((x_{i},y_{i}))_{i=1}^{n} with xi=i2nx_{i}=i2^{-n}, and yi=0y_{i}=0 when ii is even, and otherwise yi=1y_{i}=1. As the xx values pass from left to right, the labels change as often as possible; the key is that adding a constant number of nodes in a flat network only corrects predictions on a constant number of points, whereas adding a constant number of nodes in a deep network can correct predictions on a constant fraction of the points.

Let R(σ;m,l;k)\mathfrak{R}(\sigma;m,l;k) denote kk iterations of a recurrent network with ll layers of at most mm nodes each, defined as follows. Every fR(σ;m,l;k)f\in\mathfrak{R}(\sigma;m,l;k) consists of some fixed network gN(σ;m,l)g\in\mathfrak{N}(\sigma;m,l) applied kk times:

Consequently, R(σ;m,l;k)N(σ;m,lk)\mathfrak{R}(\sigma;m,l;k)\subseteq\mathfrak{N}(\sigma;m,lk), but the former has O(ml)\mathcal{O}(ml) parameters whereas the latter has O(mlk)\mathcal{O}(mlk) parameters.

This more refined result can thus say, for example, that on the 2k2^{k}-ap one needs exponentially (in kk) many parameters when boosting decision stumps, linearly many parameters with a deep network, and constantly many parameters with a recurrent network.

Analysis

This section will first prove the lower bound via a counting argument, simply tracking the number of times a function within N(σ;m,l)\mathfrak{N}(\sigma;m,l) can cross 1/2. The upper bound will exhibit a network in N(σ\textscr;2,2)\mathfrak{N}(\sigma_{\textsc{r}};2,2) which can be composed with itself kk times to exactly fit the nn-ap. These bounds together prove Theorem 1.2, which in turn implies Theorem 1.1.

The lower bound is proved in two stages. First, composing and summing sawtooth functions must also yield a sawtooth function, thus elements of N(σ;m,l)\mathfrak{N}(\sigma;m,l) are sawtooth whenever σ\sigma is. Secondly, a sawtooth function can not cross 1/21/2 very often, meaning it can’t hope to match the quickly changing labels of the nn-ap.

To start, N(σ;m,l)\mathfrak{N}(\sigma;m,l) is sawtooth as follows.

The proof is straightforward and deferred momentarily. The key observation is that adding together sawtooth functions grows the number of regions very slowly, whereas composition grows the number very quickly, an early sign of the benefits of depth.

Given a sawtooth function, its classification error on the nn-ap may be lower bounded as follows.

To close, the proof of Section 2.1 proceeds as follows. First note how adding and composing sawtooths grows their complexity.

First consider f+gf+g, and moreover any intervals UfIfU_{f}\in\mathcal{I}_{f} and UgIgU_{g}\in\mathcal{I}_{g}. Necessarily, f+gf+g has a single slope along UfUgU_{f}\cap U_{g}. Consequently, f+gf+g is I|\mathcal{I}|-sawtooth, where I\mathcal{I} is the set of all intersections of intervals from If\mathcal{I}_{f} and Ig\mathcal{I}_{g}, meaning I:={UfUg:UfIf,UgIg}\mathcal{I}:=\{U_{f}\cap U_{g}:U_{f}\in\mathcal{I}_{f},U_{g}\in\mathcal{I}_{g}\}. By sorting the left endpoints of elements of If\mathcal{I}_{f} and Ig\mathcal{I}_{g}, it follows that Ik+l|\mathcal{I}|\leq k+l (the other intersections are empty).

Now consider fgf\circ g, and in particular consider the image f(g(Ug))f(g(U_{g})) for some interval UgIgU_{g}\in\mathcal{I}_{g}. gg is affine with a single slope along UgU_{g}, therefore ff is being considered along a single unbroken interval g(Ug)g(U_{g}). However, nothing prevents g(Ug)g(U_{g}) from hitting all the elements of If\mathcal{I}_{f}; since UgU_{g} was arbitrary, it holds that fgf\circ g is (IfIg)(|\mathcal{I}_{f}|\cdot|\mathcal{I}_{g}|)-sawtooth. ∎

The proof of Section 2.1 follows by induction over layers of N(σ;m,l)\mathfrak{N}(\sigma;m,l).

The proof proceeds by induction over layers, showing the output of each node in layer ii is (tm)i(tm)^{i}-sawtooth as a function of the neural network input. For the first layer, each node starts by computing xw0+w,xx\mapsto w_{0}+\left\langle w,x\right\rangle, which is itself affine and thus 1-sawtooth, so the full node computation xσ(w0+w,x)x\mapsto\sigma(w_{0}+\left\langle w,x\right\rangle) is tt-sawtooth by Section 2.1. Thereafter, the input to layer ii with i>1i>1 is a collection of functions (g1,,gm)(g_{1},\ldots,g_{m^{\prime}}) with mmm^{\prime}\leq m and gjg_{j} being (tm)i1(tm)^{i-1}-sawtooth by the inductive hypothesis; consequently, xw0+jwjgj(x)x\mapsto w_{0}+\sum_{j}w_{j}g_{j}(x) is m(tm)i1m(tm)^{i-1}-sawtooth by Section 2.1, whereby applying σ\sigma yields a (tm)i(tm)^{i}-sawtooth function (once again by Section 2.1). ∎

2 Upper bound

Note that fmN(σ\textscr;2,2)f_{\textup{m}}\in\mathfrak{N}(\sigma_{\textsc{r}};2,2); for instance, fm(x)=σ\textscr(2σ\textscr(x)4σ\textscr(x1/2))f_{\textup{m}}(x)=\sigma_{\textsc{r}}(2\sigma_{\textsc{r}}(x)-4\sigma_{\textsc{r}}(x-1/2)). The upper bounds will use fmkR(σ\textscr;2,2;k)N(σ\textscr;2,2k)f_{\textup{m}}^{k}\in\mathfrak{R}(\sigma_{\textsc{r}};2,2;k)\subseteq\mathfrak{N}(\sigma_{\textsc{r}};2,2k).

These compositions may be written as follows.

Let real xx\in and positive integer kk be given, and choose the unique nonnegative integer ik{0,,2k1}i_{k}\in\{0,\ldots,2^{k-1}\} and real xk[0,1)x_{k}\in[0,1) so that x=(ik+xk)21kx=(i_{k}+x_{k})2^{1-k}. Then

The proof proceeds by induction on the number of compositions ll. When l=1l=1, there is nothing to show. For the inductive step, the mirroring property of pre-composition with fmf_{\textup{m}} combined with the symmetry of fmlf_{\textup{m}}^{l} (by the inductive hypothesis) implies that every x[0,1/2]x\in[0,1/2] satisfies

Consequently, it suffices to consider x[0,1/2]x\in[0,1/2], which by the mirroring property means (fmlfm)(x)=fml(2x)(f_{\textup{m}}^{l}\circ f_{\textup{m}})(x)=f_{\textup{m}}^{l}(2x). Since the unique nonnegative integer il+1i_{l+1} and real xl+1[0,1)x_{l+1}\in[0,1) satisfy 2x=2(il+1+xl+1)2l1=(il+1+xl+1)2l2x=2(i_{l+1}+x_{l+1})2^{-l-1}=(i_{l+1}+x_{l+1})2^{-l}, the inductive hypothesis applied to 2x2x grants

Before closing this subsection, it is interesting to view fmkf_{\textup{m}}^{k} in one more way, namely its effect on ((xi,yi))i=1n((x_{i},y_{i}))_{i=1}^{n} provided by the nn-ap with n:=2kn:=2^{k}. Observe that ((fm(xi),yi))i=1n((f_{\textup{m}}(x_{i}),y_{i}))_{i=1}^{n} is an (n/2)(n/2)-ap with all points duplicated except x1=0x_{1}=0, and an additional point with xx-coordinate 11.

3 Proof of Theorems 1.2 and 1.1

It suffices to prove Theorem 1.2, which yields Theorem 1.1 since σ\textscr\sigma_{\textsc{r}} is 2-sawtooth, whereby the condition m2(k3)/l1m\leq 2^{(k-3)/l-1} implies

and the upper bound transfers since R(σ\textscr;2,2;k)N(σ\textscr;2,2k)\mathfrak{R}(\sigma_{\textsc{r}};2,2;k)\subseteq\mathfrak{N}(\sigma_{\textsc{r}};2,2k).

Related work

The standard classical result on the representation power of neural networks is due to Cybenko (1989), who proved that neural networks can approximate continuous functions over d^{d} arbitrarily well. This result, however, is for flat networks.

An early result showing the benefits of depth is due to Håstad (1986), who established, via an incredible proof, that boolean circuits consisting only of and gates and or gates require exponential size in order to approximate the parity function well. These gates correspond to multiplication and addition over the boolean domain, and moreover the parity function is the Fourier basis over the boolean domain; as mentioned above, fmkf_{\textup{m}}^{k} as used here is a piecewise affine approximation of a Fourier basis, and it was suggested previously by Bengio and LeCun (2007) that Fourier transforms admit efficient representations with deep networks. Lastly, note that Håstad (1986)’s work has one of the same weaknesses as the present result, namely of only controlling a countable family of functions which is in no sense dense.

More generally, networks consisting of sum and product nodes, but now over the reals, have been studied in the machine learning literature, where it was showed by Bengio and Delalleau (2011) that again there is an exponential benefit to depth. While this result was again for a countable class of functions, more recent work by Cohen et al. (2015) aims to give a broader characterization.

Lastly, while this note was only concerned with finite sets of points, it is worthwhile to mention the relevance of representation power to statistical questions. Namely, by the seminal result of Anthony and Bartlett (1999, Theorem 8.14), the VC dimension of N(σ\textscr;m,l)\mathfrak{N}(\sigma_{\textsc{r}};m,l) is at most O(m8l2)\mathcal{O}(m^{8}l^{2}), indicating that these exponential representation benefits directly translate into statistical savings. Interestingly, note that fmkf_{\textup{m}}^{k} has an exponentially large Lipschitz constant (exactly 2k2^{k}), and thus an elementary statistical analysis via Lipschitz constants and Rademacher complexity (Bartlett and Mendelson, 2002) can inadvertently erase the benefits of depth as presented here.

References