Deep vs. shallow networks : An approximation theory perspective

Hrushikesh Mhaskar, Tomaso Poggio

Introduction

Deep Neural Networks especially of the convolutional type (DCNNs) have started a revolution in the field of artificial intelligence and machine learning, triggering a large number of commercial ventures and practical applications. Most deep learning references these days start with Hinton’s backpropagation and with Lecun’s convolutional networks (see for a nice review ). Of course, multilayer convolutional networks have been around at least as far back as the optical processing era of the 70s. Fukushima’s Neocognitron was a convolutional neural network that was trained to recognize characters. The HMAX model of visual cortex was described as a series of AND and OR layers to represent hierarchies of disjunctions of conjunctions. A version of the questions about the importance of hierarchies was asked in as follows: “A comparison with real brains offers another, and probably related, challenge to learning theory. The “learning algorithms” we have described in this paper correspond to one-layer architectures. Are hierarchical architectures with more layers justifiable in terms of learning theory? It seems that the learning theory of the type we have outlined does not offer any general argument in favor of hierarchical learning machines for regression or classification. This is somewhat of a puzzle since the organization of cortex – for instance visual cortex – is strongly hierarchical. At the same time, hierarchical learning systems show superior performance in several engineering applications.”

Ironically a mathematical theory characterizing the properties of DCNN’s and even simply why they work so well is still missing. Two of the basic theoretical questions about Deep Convolutional Neural Networks (DCNNs) are:

which classes of functions can they approximate well?

why is stochastic gradient descent (SGD) so unreasonably efficient?

In this paper we review and extend a theoretical framework that we have introduced very recently to address the first question . The theoretical results include answers to why and when deep networks are better than shallow by using the idealized model of a deep network as a directed acyclic graph (DAG), which we have shown to capture the properties a range of convolutional architectures recently used, such as the very deep convolutional networks of the ResNet type . For compositional functions conforming to a DAG structure with a small maximal indegree of the nodes, such as a binary tree structure, one can bypass the curse of dimensionality with the help of the blessings of compositionality (cf. for a motivation for this terminology). We demonstrate this fact using three examples : traditional sigmoidal networks, the ReLU networks commonly used in DCNN’s, and Gaussian networks. The results announced for the ReLU and Gaussian networks are new. We then give examples of different notions of sparsity for which we expect better performance of DCNN’s over shallow networks, and propose a quantitative measurement, called relative dimension, encapsulating each of these notions, independently of the different roles the various parameters play in each case.

In Section 2, we explain the motivation for considering compositional functions, and demonstrate how some older results on sigmoidal networks apply for approximation of these functions. In Section 3, we announce our new results in the case of shallow networks implementing the ReLU and Gaussian activation functions. The notion of a compositional function conforming to a DAG structure is explained in Section 4, in which we also demonstrate how the results in Section 3 lead to better approximation bounds for such functions. The ideas behind the proofs of these new theorems are sketched in Section 5. Finally, we make some concluding remarks in Section 6, pointing out a quantitative measurement for three notions of sparsity which we feel may be underlying the superior performance of deep networks.

Compositional functions

The purpose of this section is to introduce the concept of compositional functions, and illustrate by an example how this leads to a better approximation power for deep networks. In Sub-section 2.1, we explain how such functions arise in image processing and vision. In Sub-section 2.2, we review some older results for approximation by shallow networks implementing a sigmoidal activation function, and explain how a “good error propagation” helps to generalize these results for deep networks.

Many of the computations performed on images should reflect the symmetries in the physical world that manifest themselves through the image statistics. Assume for instance that a computational hierarchy such as

is given. Then shift invariance of the image statistics is reflected in the following property: the local node “processors” satisfy $h_{21}=h_{22}$ and $h_{11}=h_{12}=h_{13}=h_{14}$ since there is no reason for them to be different across an image. Similar invariances of image statistics – for instance to scale rotation – can be similarly used to constrain visual algorithms and their parts such as the local processes $h$ .

It is natural to ask whether the hierarchy itself – for simplicity the idealized binary tree of the Figure 3 – follows from a specific symmetry in the world and which one. A possible answer to this question follows from the fact that in natural images the target object is usually among several other objects at a range of scales and position. From the physical point of view, this is equivalent to the observation that there are several localized clusters of surfaces with similar properties (object parts, objects, scenes, etc). These basic aspects of the physical world are reflected in properties of the statistics of images: locality, shift invariance and scale invariance. In particular, locality reflects clustering of similar surfaces in the world – the closer to each other pixels are in the image, the more likely they are to be correlated. Thus nearby patches are likely to be correlated (because of locality), at all scales. Ruderman’s pioneering work concludes that this set of properties is equivalent to the statement that natural images consist of many object patches that may partly occlude each other (object patches are image patches which have similar properties because they are induced by local groups of surfaces with similar properties). We argue that Ruderman’s conclusion reflects the compositionality of objects and parts: parts are themselves objects, that is self-similar clusters of similar surfaces in the physical world. The property of compositionality was in fact a main motivation for hierarchical architectures such as Fukushima’s and later imitations of it such as HMAX which was described as a pyramid of AND and OR layers , that is a sequence of conjunctions and disjunctions. According to these arguments, compositional functions should be important for vision tasks because they reflect constraints on visual algorithms.

The following argument shows that compositionality of visual computations is a basic property that follows from the simple requirement of scalability of visual algorithms: an algorithm should not change if the size of the image (in pixels) changes. In other words, it should be possible to add or subtract simple reusable parts to the algorithm to adapt it to increased or decreased size of the image without changing its basic core.

The final step in the argument uses the universal approximation property to claim that a nonlinear node with two inputs and enough units (that is, channels) can approximate arbitrarily well each of the $H_{2}$ blocks. This leads to conclude that deep convolutional neural networks are natural approximators of scalable, shift-invariant functions.

2 An example

In this section, we illustrate the advantage of approximating a compositional function using deep networks corresponding to the compositional structure rather than a shallow network that does not take into account this structure.

For explaining our ideas for the deep network, we consider compositional functions conforming to a binary tree. For example, we consider functions of the form (cf. Figure 3)

For the hierarchical binary tree network, the spaces analogous to $W_{r,q}^{\mbox{NN}}$ are $W_{H,r,2}^{\mbox{NN}}$ , defined to be the class of all functions $f$ which have the same structure (e.g., (2.2)), where each of the constituent functions $h$ is in $W_{r,2}^{\mbox{NN}}$ (applied with only $2$ variables). We define the corresponding class of deep networks $\mathcal{D}_{n}$ to be set of all functions with the same structure, where each of the constituent functions is in $\mathcal{S}_{n}$ . We note that in the case when $q$ is an integer power of $2$ , the number of parameters involved in an element of $\mathcal{D}_{n}$ – that is, weights and biases, in a node of the binary tree is $(q-1)(q+2)n$ .

The following theorem (cf. ) estimates the degree of approximation for shallow and deep networks. We remark that the assumptions on $\sigma$ in the theorem below are not satisfied by the ReLU function $x\mapsto|x|$ , but they are satisfied by smoothing the function in an arbitrarily small interval around the origin.

Proof. Theorem 2.1(a) was proved by . To prove Theorem 2.1(b), we observe that each of the constituent functions being in $W_{r,2}^{\mbox{NN}}$ , (2.3) applied with $q=2$ implies that each of these functions can be approximated from $\mathcal{S}_{n}$ up to accuracy $n^{-r/2}$ . Our assumption that $f\in W_{H,r,2}^{\mbox{NN}}$ implies that each of these constituent functions is Lipschitz continuous. Hence, it is easy to deduce that, for example, if $P$ , $P_{1}$ , $P_{2}$ are approximations to the constituent functions $h$ , $h_{1}$ , $h_{2}$ , respectively within an accuracy of $\epsilon$ , then

for some constant $c>0$ independent of $\epsilon$ . This leads to (2.4). $\square$

The constants involved in ${\cal O}$ in (2.3) will depend upon the norms of the derivatives of $f$ as well as $\sigma$ . Thus, when the only a priori assumption on the target function is about the number of derivatives, then to guarantee an accuracy of $\epsilon$ , we need a shallow network with ${\cal O}(\epsilon^{-q/r})$ trainable parameters. If we assume a hierarchical structure on the target function as in Theorem 2.1, then the corresponding deep network yields a guaranteed accuracy of $\epsilon$ only with ${\cal O}(\epsilon^{-2/r})$ trainable parameters.

Shallow networks

In this section, we announce our results in the context of shallow networks in two settings. One is the setting of neural networks using the ReLU function $x\mapsto|x|=x_{+}+(-x)_{+}$ (Sub-section 3.1), and the other is the setting of Gaussian networks using an activation function of the form ${\bf x}\mapsto\exp(-|{\bf x}-{\bf w}|^{2})$ (Sub-section 3.2). It is our objective to generalize these results to the case of deep networks in Section 4.

Before starting with the mathematical details, we would like to make some remarks regarding the results in this section and in Section 4.

It seems unnatural to restrict the range of the constituent functions. Therefore, we are interested in approximating functions on the entire Euclidean space.

If one is interested only in error estimates analogous to those in Theorem 2.1, then our results need to be applied to functions supported on the unit cube. One way to ensure that that the smoothness is preserved is to consider a smooth extension of the function on the unit cube to the Euclidean space [25, Chapter VI], and then multiply this extension by a $C^{\infty}$ function supported on $^{q}$ , equal to $1$ on the unit cube. However, this destroys the constructive nature of our theorems.

A problem of central importance in approximation theory is to determine what constitutes the right smoothness and the right measurement of complexity. The number of parameters or the number of non–linear units is not necessarily the right measurement for complexity. Likewise, the number of derivatives is not necessarily the right measure for smoothness for every approximation process. In this paper, we illustrate this by showing that different smoothness classes and notions of complexity lead to satisfactory approximation theorems.

In the sequel, we will adopt the following convention. The notation $A\lesssim B$ means $A\leq cB$ for some generic positive constant $c$ that may depend upon fixed parameters in the discussion, such as $\gamma$ , $q$ , but independent of the target function and the number of parameters in the approximating network. By $A\sim B$ , we mean $A\lesssim B$ and $B\lesssim A$ .

Our first main theorem is the following Theorem 3.1. We note two technical novelties here. One is that the activation function $|\cdot|$ does not satisfy the conditions of Theorem 2.1. Second is that the approximation is taking place on the whole Euclidean space rather than on a cube as in Theorem 2.1.

Let $\gamma>0$ , $n\geq 1$ be an integer, $f\in W_{w,\gamma,q}$ . Then there exists $P\in\mathcal{R}_{n,q}$ such that

2 Gaussian networks

We wish to consider shallow networks where each channel evaluates a Gaussian non–linearity; i.e., Gaussian networks of the form

Since one of our goals is to show that our results on the upper bounds for the accuracy of approximation are the best possible for individual functions, the class $W_{r,q}$ needs to be refined somewhat. Toward that goal, we define next a regularization expression, known in approximation theory parlance as a $K$ –functional, by

Let $\{{\mathcal{C}}_{m}\}$ be a sequence of finite subsets with ${\mathcal{C}}_{m}\subset[-cm,cm]^{q}$ , with

Moreover, the coefficients of $G$ can be chosen as linear combinations of the data $\{f({\bf x}):{\bf x}\in{\mathcal{C}}_{m}\}$ .

We note that the set of centers ${\mathcal{C}}_{m}$ can be chosen arbitrarily subject to the conditions stated in the theorem; there is no training necessary to determine these parameters. Therefore, there are only ${\cal O}(m^{2q})$ coefficients to be found by training. This means that if we assume a priori that $f\in\mathcal{W}_{\gamma,q}$ , then the number of trainable parameters to theoretically guarantee an accuracy of $\epsilon>0$ is ${\cal O}(\epsilon^{-2q/\gamma})$ . For the unit ball $\mathcal{B}_{\gamma,q}$ of the class $\mathcal{W}_{\gamma,q}$ as defined in Section 3.2, the Bernstein inequality proved in leads to $d_{n}(\mathcal{B}_{\gamma,q})\sim n^{-\gamma/(2q)}$ . Thus, the estimate (3.5) is the best possible in terms of widths. This implies in particular that when the networks are computed using samples of $f$ to obtain an accuracy of $\epsilon$ in the approximation, one needs $\sim\epsilon^{-2q/\gamma}$ samples. When $f$ is compactly supported, $\|f\|_{\gamma,q}$ is of the same order of magnitude as the norm of $f$ corresponding to the $K$ -functional based on the smoothness class $W_{r}^{\mbox{NN}}$ in Section 2.2. However, the number of parameters is then not commesurate with the results in that section.

We observe that the width estimate holds for the approximation of the entire class, and hence, an agreement with such width estimate implies only that there exists a possibly pathological function for which the approximation estimate cannot be improved. How good is the estimate in Theorem 3.2 for individual functions? If we know that some oracle can give us Gaussian networks that achieve a given accuracy with a given complexity, does it necessarily imply that the target function is smooth as indicated by the above theorems? The following is a converse to Theorem 3.2 demonstrating that the accuracy asserted by these theorems is possible if and only if the target function is in the smoothness class required in these theorems. It demonstrates also that rather than the number of nonlinearities in the Gaussian network, it is the minimal separation among the centers that is the “right” measurement for the complexity of the networks. Theorem 3.3 below is a refinement of the corresponding result in .

We observe that Theorem 3.2 can be interpreted to give estimates on the degree of approximation by Gaussian networks either in terms of the number of non–linear units, or the number of trainable parameters, or the minimal separation among the centers, or the number of samples of the target function. Theorem 3.3 shows that the right model of complexity among these is the minimal separation among the centers. Using this measurement for complexity yields “matching” direct and converse theorems. Based on the results in , we expect that a similar theorem should be true also for ReLU networks.

Deep networks

The purpose of this section is to generalize the results in Section 3 to the case of deep networks. In Sub-section 4.1, we will formulate the concept of compositional functions in terms of a DAG, and introduce the related mathematical concepts for measuring the degree of approximation and smoothness. The approximation theory results in this context will be described in Sub-section 4.2.

Let $\mathcal{G}$ be a directed acyclic graph (DAG), with the set of nodes $V$ . A $\mathcal{G}$ –function is defined as follows. The in-edges to each node of $\mathcal{G}$ represents an input real variable. The node itself represents the evaluation of a real valued function of the inputs. The out-edges fan out the result of this evaluation. Each of the source node obtains an input from some Euclidean space. Other nodes can also obtain such an input. We assume that there is only one sink node, whose output is the $\mathcal{G}$ -function. For example, the DAG in Figure 4 represents the $\mathcal{G}$ –function

2 Approximation using deep networks

First, we discuss the analogue of Theorem 3.1 in Section 3.1 for deep networks conforming to the DAG $\mathcal{G}$ . We define the classes $\mathcal{G}X_{w}$ and $\mathcal{G}W_{w,\gamma}$ in accordance with the notation introduced in Section 4.1, and denote the norm on $\mathcal{G}X_{w}$ (respectively, $\mathcal{G}W_{w,\gamma}$ ) by $\|\cdot\|_{\mathcal{G},w}$ (respectively, $\|\cdot\|_{\mathcal{G},w,\gamma}$ ). The symbol $\mathcal{G}\mathcal{R}_{n}$ denotes the family of networks $\{P_{v}\in\mathcal{R}_{n,d(v)}\}_{v\in V}$ . The analogue of Theorem 3.1 is the following.

Let $1\leq\gamma\leq 2$ , $n\geq 1$ be an integer, $f\in\mathcal{G}W_{w,\gamma}$ , $d=\max_{v\in V}d(v)$ . Then there exists $P\in\mathcal{G}\mathcal{R}_{n}$ such that

We observe that if $P=\{P_{v}\}\in\mathcal{G}\mathcal{R}_{n}$ the number of trainable parameters in each constituent network $P_{v}$ is ${\cal O}(n)$ . Therefore, the total number of trainable parameters in $P$ is ${\cal O}(|V|n)$ . Equivalently, when the target function is in $\mathcal{G}W_{w,\gamma}$ , one needs ${\cal O}((\epsilon/|V|)^{-d/\gamma})$ units in a deep network to achieve an accuracy of at most $\epsilon$ . If one ignores the compositional structure of the target function, (3.1) shows that one needs $O(\epsilon^{-q/\gamma})$ units in a shallow network. Thus, a deep network conforming to the structure of the target function yields a substantial improvement over a shallow network if $d\ll q$ .

The analogue of Theorem 3.2 and Theorem 3.3 are parts (a) and (b) respectively of the following Theorem 4.2.

(a) For each $v\in V$ , let $\{{\mathcal{C}}_{m,v}\}$ be a sequence of finite subsets as described in Theorem 3.2. Let $\gamma\geq 1$ and $f\in\mathcal{G}\mathcal{W}_{\gamma}$ . Then for integer $m\geq 1$ , there exists $G\in\mathcal{G}\mathcal{N}_{\max|{\mathcal{C}}_{m,v}|,m}$ with centers of the constituent network $G_{v}$ at vertex $v$ at points in ${\mathcal{C}}_{m,v}$ such that

Moreover, the coefficients of each constituent $G_{v}$ can be chosen as linear combinations of the data $\{f_{v}({\bf x}):{\bf x}\in{\mathcal{C}}_{m,v}\}$

Then $f\in\mathcal{G}\mathcal{W}_{\gamma}$ .

Ideas behind the proofs

A parametrization of the upper hemisphere $\SS^{q}_{+}=\{\mathbf{u}\in\SS^{q}:u_{q+1}>0\}$ of $\SS^{q}$ is given by

Next, we define an operator $\mathcal{S}$ on $X_{w,q}$ by

We note that if $f\in X_{w,q}$ , then $(|{\bf x}|^{2}+1)^{-1/2}f({\bf x})\to 0$ as $|{\bf x}|\to\infty$ . Therefore, $\mathcal{S}(f)$ is well defined, and defines an even, continuous function on $\SS^{q}$ , equal to on the “equator” $u_{q+1}=0$ .

The function $t\to|t|$ can be expressed in an expansion

with the series converging on compact subsets of $(-1,1)$ .

We define the $\phi$ –derivative of $F$ formally by

The estimate (5.12) now leads easily for all $f\in X_{w,q}$ for which $\mathcal{D}(f)\in X_{w,q}$ to

where ${\bf x}_{k}$ is defined by $({\bf x}_{k})_{j}=(\mathbf{v}_{k})_{j}/(\mathbf{v}_{k})_{q+1}$ , $j=1,\cdots,q$ .

In order to define the smoothness class $W_{w,\gamma,q}$ , we first define the $K$ –functional

where the infimum is taken over all $g$ for which $\mathcal{D}g\in X_{w,q}$ . Finally, the smoothness class $W_{w,\gamma,q}$ is defined to be the set of all $f\in X_{w,q}$ such that

The estimate (5.14) then leads to (3.1) in a standard manner.

We remark here that the unit cube $^{q}$ is mapped to some compact subset of $\SS^{q}_{+}$ . However, the operator $\mathcal{D}$ does not have an obvious interpretation in terms of ordinary derivatives on the cube.

2 Theorems 3.2 and 3.3.

In this section, let $\{\psi_{j}\}$ denote the sequence of orthonormalized Hermite functions; i.e., [26, Formulas (5.5.3), (5.5.1)]

The multivariate Hermite functions are defined by

Using the Mehler formula [1, Formula (6.1.13)], it can be shown that

We combine the results on function approximation and quadrature formulas developed in to complete the proof of Theorem 3.2.

To prove Theorem 3.3, we modify the ideas in to obtain a Berstein–type inequality for Gaussian networks of the form

The proof of Theorem 3.3 then follows standard arguments in approximation theory.

3 Results in Section 4.2.

Theorems 4.1 and 4.2(a) follow from Theorems 3.1 and 3.2 respectively by the “good error propagation property” as in the proof of Theorem 2.1(b) from Theorem 2.1(a). Our definitions of the norms for function spaces associated with deep networks ensure that a bound of the form (4.5) implies a bound of the form (3.6) for each of the constituent functions. Therefore, Theorem 3.3 leads to Theorem 4.2(b).

Blessed representations

As pointed out in Sections 2.2 and 4, there are deep networks – for instance of the convolutional type – that can bypass the curse of dimensionality when learning functions blessed with compositionality. In this section, we explore possible definitions of blessed function representations that can be exploited by deep but not by shallow networks to reduce the complexity of learning. We list three examples, each of a different type.

The main example consists of the compositional functions defined in this paper in terms of DAGs (Figure 4). The simplest DAG is a binary tree (see Figure 3 ) corresponding to compositional functions of the type

As explained in previous sections, such compositional functions can be approximated well by deep networks. In particular, if the function form above has shift symmetry, it takes the form

that can be approximated well by a Deep Convolutional Network (that is with “weight sharing”) but not by a shallow one. This first example is important because compositionality seems a common feature of algorithms applied to signals originating from our physical world, such as images. Not surprisingly, binary-like tree structures (the term binary-like covers obvious extensions to two-dimensional inputs such as images) represemt well the architecture of the most successful DCNN.

Consider that the proof of Theorem 2.1 relies upon the fact that when $\sigma$ satisfies the conditions of that theorem, the algebraic polynomials in $q$ variables of (total or coordinatewise) degree $<n$ are in the uniform closure of the span of ${\cal O}(n^{q})$ functions of the form ${\bf x}\mapsto\sigma({\bf w}\cdot{\bf x}+b)$ . The advantage of deep nets is due to the fact that polynomials of smaller number of variables lead to a nominally high degree polynomial through repeated composition. As a simple example, we consider the polynomial

where $Q_{1}$ , $Q_{2}$ , $Q_{3}$ are bivariate polynomials of total degree $\leq 2$ . Nominally, $Q$ is a polynomial of total degree $4096$ in $4$ variables, and hence, requires $\small{\left(\!\!\begin{array}[]{c}{4100}\\ {4}\end{array}\!\!\right)}\approx(1.17)*10^{13}$ parameters without any prior knowledge of the compositional structure. However, the compositional structure implies that each of these coefficients is a function of only $18$ parameters. In this case, the representation which makes deep networks approximate the function with a smaller number of parameters than shallow networks is based on polynomial approximation of functions of the type $g(g(g()))$ .

As a different example, we consider a function which is a linear combination of $n$ tensor product Chui–Wang spline wavelets , where each wavelet is a tensor product cubic spline. It is shown in that is impossible to implement such a function using a shallow neural network with a sigmoidal activation function using ${\cal O}(n)$ neurons, but a deep network with the activation function $(x_{+})^{2}$ can do so. This case is even less general than the previous one but it is interesting because shallow networks are provably unable to implement these splines using a fixed number of units. In general, this does not avoid the curse of dimensionality, but it shows that deep networks provide, unlike shallow networks, local and multi–scale approximation since the spline wavelets are compactly supported with shrinking supports.

Examples of functions that cannot be represented efficiently by shallow networks have been given very recently by . The results in illustrate the power of deep networks compared to shallow ones, similar in spirit to .

The previous examples show three different kinds of “sparsity” that allow a blessed representation by deep networks with a much smaller number of parameters than by shallow networks. This state of affairs motivates the following general definition of relative dimension. Let $d_{n}(W)$ be the non–linear $n$ -width of a function class $W$ . For the unit ball $\mathcal{B}_{\gamma,q}$ of the class $\mathcal{W}_{\gamma,q}$ as defined in Section 3.2, the Bernstein inequality proved in leads to $d_{n}(\mathcal{B}_{\gamma,q})\sim n^{-\gamma/(2q)}$ . In contrast, for the unit ball $\mathcal{G}\mathcal{B}_{\gamma}$ of the class we have shown that $d_{n}(\mathcal{G}\mathcal{B}_{\gamma})\leq cn^{-\gamma/(2d)}$ , where $d=\max_{v\in V}d(v)$ .

As we mentioned in previous papers this definition, and in fact most of the previous results, can be specialized to the class of Boolean functions which map the Boolean cube into reals, yielding a number of known and new results. This application will be described in a forthcoming paper.

Conclusion

A central problem of approximation theory is to determine the correct notions of smoothness classes of target functions and the correct measurement of complexity for the approximation spaces. This definition is dictated by having “matching” direct and converse theorems. In this paper, we have demonstrated how different smoothness classes lead to satisfactory results for approximation by ReLU networks and Gaussian networks on the entire Euclidean space. Converse theorem is proved for Gaussian networks, and results in suggest that a similar statement ought to be true for ReLU networks as well. These results indicate that the correct measurement of network complexity is not necessarily the number of parameters. We have initiated a discussion of notions of sparsity which we hope would add deeper insights into this area.