On the Expressive Power of Deep Learning: A Tensor Analysis

Nadav Cohen, Or Sharir, Amnon Shashua

Introduction

The expressive power of neural networks is achieved through depth. There is mounting empirical evidence that for a given budget of resources (e.g. neurons), the deeper one goes, the better the eventual performance will be. However, existing theoretical arguments that support this empirical finding are limited. There have been many attempts to theoretically analyze function spaces generated by network architectures, and their dependency on network depth and size. The prominent approach for justifying the power of depth is to show that deep networks can efficiently express functions that would require shallow networks to have super-polynomial size. We refer to such scenarios as instances of depth efficiency. Unfortunately, existing results dealing with depth efficiency (e.g. [Hastad(1986), Håstad and Goldmann(1991), Delalleau and Bengio(2011), Martens and Medabalimi(2014)]) typically apply to specific network architectures that do not resemble ones commonly used in practice. In particular, none of these results apply to convolutional networks ([LeCun and Bengio(1995)]), which represent the most empirically successful and widely used deep learning architecture to date. A further limitation of current results is that they merely show existence of depth efficiency (i.e. of functions that are efficiently realizable with a certain depth but cannot be efficiently realized with shallower depths), without providing any information as to how frequent this property is. These shortcomings of current theory are the ones that motivated our work.

The architectural features that specialize convolutional networks compared to classic feed-forward fully-connected networks are threefold. The first feature, locality, refers to the connection of a neuron only to neighboring neurons in the preceding layer, as opposed to having the entire layer drive it. In the context of image processing (the most common application of convolutional networks), locality is believed to reflect the inherent compositional structure of data – the closer pixels are in an image, the more likely they are to be correlated. The second architectural feature of convolutional networks is sharing, which means that different neurons in the same layer, connected to different neighborhoods in the preceding layer, share the same weights. Sharing, which together with locality gives rise to convolution, is motivated by the fact that in natural images, the semantic meaning of a pattern often does not depend on its location (i.e. two identical patterns appearing in different locations of an image often convey the same semantic content). Finally, the third architectural idea of convolutional networks is pooling, which is essentially an operator that decimates layers, replacing neural activations in a spatial window by a single value (e.g. their maximum or average). In the context of images, pooling induces invariance to translations (which often do not affect semantic content), and in addition is believed to create a hierarchy of abstraction in the patterns neurons respond to. The three architectural elements of locality, sharing and pooling, which have facilitated the great success of convolutional networks, are all lacking in existing theoretical studies of depth efficiency.

In this paper we introduce a convolutional arithmetic circuit architecture that incorporates locality, sharing and pooling. Arithmetic circuits (also known as Sum-Product Networks, [Poon and Domingos(2011)]) are networks with two types of nodes: sum nodes, which compute a weighted sum of their inputs, and product nodes, computing the product of their inputs. We use sum nodes to implement convolutions (locality with sharing), and product nodes to realize pooling. The models we arrive at may be viewed as convolutional networks with product pooling and linear point-wise activation. They are attractive on three accounts. First, as discussed in app. E, convolutional arithmetic circuits are equivalent to SimNets, a new deep learning architecture that has recently demonstrated promising empirical results on various image recognition benchmarks ([Cohen et al.(2016)Cohen, Sharir, and Shashua]). Second, as we show in sec. 3, convolutional arithmetic circuits are realizations of hierarchical tensor decompositions (see [Hackbusch(2012)]), opening the door to various mathematical and algorithmic tools for their analysis and implementation. Third, the depth efficiency of convolutional arithmetic circuits, which we analyze in sec. 4, was shown in the subsequent work of [Cohen and Shashua(2016)] to be superior to the depth efficiency of the popular convolutional rectifier networks, namely convolutional networks with rectified linear (ReLU) activation and max or average pooling.

Employing machinery from measure theory and matrix algebra, made available through their connection to hierarchical tensor decompositions, we prove a number of fundamental results concerning the depth efficiency of our convolutional arithmetic circuits. Our main theoretical result (thm. 4.1 and corollary 4.2) states that besides a negligible (zero measure) set, all functions that can be realized by a deep network of polynomial size, require exponential size in order to be realized, or even approximated, by a shallow network. When translated to the viewpoint of tensor decompositions, this implies that almost all tensors realized by Hierarchical Tucker (HT) decomposition ([Hackbusch and Kühn(2009)]) cannot be efficiently realized by the classic CP (rank-1) decomposition. To the best of our knowledge, this result is unknown to the tensor analysis community, in which the advantage of HT over CP is typically demonstrated through specific examples of tensors that can be efficiently realized by the former and not by the latter. Following our main result, we present a generalization (thm. A.1 and corollary A.2) that compares networks of arbitrary depths, showing that the amount of resources one has to pay in order to maintain representational power while trimming down layers of a network grows double exponentially w.r.t. the number of layers cut off. We also characterize cases in which dropping a single layer bears an exponential price.

The remainder of the paper is organized as follows. In sec. 2 we briefly review notations and mathematical background required in order to follow our work. This is followed by sec. 3, which presents our convolutional arithmetic circuits and establishes their equivalence with tensor decompositions. Our theoretical analysis is covered in sec. 4. Finally, sec. 5 concludes. In order to keep the manuscript at a reasonable length, we defer our detailed survey of related work to app. D, covering works on the depth efficiency of boolean circuits, arithmetic circuits and neural networks, as well as different applications of tensor analysis in the field of deep learning.

Preliminaries

Tensors of the form $\mathop{\otimes}_{i=1}^{N}{\mathbf{v}}^{(i)}$ are called pure or elementary, and are regarded as having rank-1 (assuming ${\mathbf{v}}^{(i)}\neq 0~{}~{}\forall i$ ). It is not difficult to see that any tensor can be expressed as a sum of rank-1 tensors:

A representation as above is called a CANDECOMP/PARAFAC decomposition of ${\mathcal{A}}$ , or in short, a CP decomposition CP decomposition is regarded as the classic and most basic tensor decomposition, dating back to the beginning of the 20’th century (see [Kolda and Bader(2009)] for a historic survey). . The CP-rank of ${\mathcal{A}}$ is defined as the minimum number of terms in a CP decomposition, i.e. as the minimal $Z$ for which eq. 1 can hold. Notice that for a tensor of order 2, i.e. a matrix, this definition of CP-rank coincides with that of standard matrix rank.

A repeating concept in this paper is that of measure zero. More broadly, our analysis is framed in measure theoretical terms. While an introduction to the field is beyond the scope of the paper (the interested reader is referred to [Jones(2001)]), it is possible to intuitively grasp the ideas that form the basis to our claims. When dealing with subsets of a Euclidean space, the standard and most natural measure in a sense is called the Lebesgue measure. This is the only measure we consider in our analysis. A set of (Lebesgue) measure zero can be thought of as having zero “volume” in the space of interest. For example, the interval between $(0,0)$ and $(1,0)$ has zero measure as a subset of the 2D plane, but has positive measure as a subset of the 1D $x$ -axis. An alternative way to view a zero measure set $S$ follows the property that if one draws a random point in space by some continuous distribution, the probability of that point hitting $S$ is necessarily zero. A related term that will be used throughout the paper is almost everywhere, which refers to an entire space excluding, at most, a set of zero measure.

Convolutional Arithmetic Circuits

Our eventual aim is to realize score functions $h_{y}$ with a layered network architecture. As a first step along this path, we notice that $h_{y}({\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{N})$ is fully determined by the activations of the $M$ representation functions $f_{\theta_{1}}{\ldots}f_{\theta_{M}}$ on the $N$ input vectors ${\mathbf{x}}_{1}{\ldots}{\mathbf{x}}_{N}$ . In other words, given $\{f_{\theta_{d}}({\mathbf{x}}_{i})\}_{d\in[M],i\in[N]}$ , the score $h_{y}({\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{N})$ is independent of the input. It is thus natural to consider the computation of these $M{\cdot}N$ numbers as the first layer of our networks. This layer, referred to as the representation layer, may be conceived as a convolutional operator with $M$ channels, each corresponding to a different function applied to all input vectors (see fig. 1).

Once we have constrained our score functions to have the structure depicted in eq. 2, learning a classifier reduces to estimation of the parameters $\theta_{1}{\ldots}\theta_{M}$ , and the coefficient tensors ${\mathcal{A}}^{1}{\ldots}{\mathcal{A}}^{Y}$ . The computational challenge is that the latter tensors are of order $N$ (and dimension $M$ in each mode), having an exponential number of entries ( $M^{N}$ each). In the next subsections we utilize tensor decompositions (factorizations) to address this computational challenge, and show how they are naturally realized by convolutional arithmetic circuits.

The most straightforward way to factorize a tensor is through a CP (rank-1) decomposition (see sec. 2). Consider a joint CP decomposition for the coefficient tensors $\{{\mathcal{A}}^{y}\}_{y\in{\mathcal{Y}}}$ :

Substituting our CP decomposition (eq. 3) into the expression for the score functions in eq. 2, we obtain:

From this we conclude that the network illustrated in fig. 1 implements a classifier (score functions) under the CP decomposition in eq. 3. We refer to this network as CP model. The network consists of a representation layer followed by a single hidden layer, which in turn is followed by the output. The hidden layer begins with a $1\times 1$ conv operator, which is simply a 3D convolution with $Z$ channels and receptive field $1\times 1$ . The convolution may operate without coefficient sharing, i.e. the filters that generate feature maps by sliding across the previous layer may have different coefficients at different spatial locations. This is often referred to in the deep learning community as a locally-connected operator (see [Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf]). To obtain a standard convolutional operator, simply enforce coefficient sharing by constraining the vectors ${\mathbf{a}}^{z,i}$ in the CP decomposition (eq. 3) to be equal to each other for different values of $i$ (this setting is discussed in sec. 3.3). Following conv operator, the hidden layer includes global product pooling. Feature maps generated by conv are reduced to singletons through multiplication of their entries, creating a vector of dimension $Z$ . This vector is then mapped into the $Y$ network outputs through a final dense linear layer.

To recap, CP model (fig. 1) is a shallow (single hidden layer) convolutional arithmetic circuit that realizes the CP decomposition (eq. 3). It is universal, i.e. it can realize any coefficient tensors with large enough size ( $Z$ ). Unfortunately, since the CP-rank of a generic tensor is exponential in its order (see [Hackbusch(2012)]), the size required for CP model to be universal is exponential ( $Z$ exponential in $N$ ).

In this subsection we present a deep network that corresponds to the recently introduced Hierarchical Tucker tensor decomposition ([Hackbusch and Kühn(2009)]), which we refer to in short as HT decomposition. The network, dubbed HT model, is universal. Specifically, any set of tensors ${\mathcal{A}}^{y}$ represented by CP model can be represented by HT model with only a polynomial penalty in terms of resources. The advantage of HT model, as we show in sec. 4, is that in almost all cases it generates tensors that require an exponential size in order to be realized, or even approximated, by CP model. Put differently, if one draws the weights of HT model by some continuous distribution, with probability one, the resulting tensors cannot be approximated by a polynomial CP model. Informally, this implies that HT model is exponentially more expressive than CP model.

HT model is based on the hierarchical tensor decomposition in eq. 4, which is a special case of the HT decomposition as presented in [Hackbusch and Kühn(2009)] (in the latter’s terminology, we restrict the matrices $A^{l,j,\gamma}$ to be diagonal). Our construction and theoretical results apply to the general HT decomposition as well, with the specialization done merely to bring forth a network that resembles current convolutional networks If we had not constrained $A^{l,j,\gamma}$ to be diagonal, pooling operations would involve entries from different channels. .

The decomposition in eq. 4 recursively constructs the coefficient tensors $\{{\mathcal{A}}^{y}\}_{y\in[Y]}$ by assembling vectors $\{{\mathbf{a}}^{0,j,\gamma}\}_{j\in[N],\gamma\in[r_{0}]}$ into tensors $\{\phi^{l,j,\gamma}\}_{l\in[L-1],j\in[N/2^{l}],\gamma\in[r_{l}]}$ in an incremental fashion. The index $l$ stands for the level in the decomposition, $j$ represents the “location” within level $l$ , and $\gamma$ corresponds to the individual tensor in level $l$ and location $j$ . $r_{l}$ is referred to as level- $l$ rank, and is defined to be the number of tensors in each location of level $l$ (we denote for completeness $r_{L}:=Y$ ). The tensor $\phi^{l,j,\gamma}$ has order $2^{l}$ , and we assume for simplicity that $N$ – the order of ${\mathcal{A}}^{y}$ , is a power of $2$ (this is merely a technical assumption also made in [Hackbusch and Kühn(2009)], it does not limit the generality of our analysis).

The hierarchical decomposition (eq. 4) is universal, i.e. with large enough ranks $r_{l}$ it can represent any tensors. Moreover, it is a super-set of the CP decomposition (eq. 3). That is to say, all tensors representable by a CP decomposition having $Z$ components are also representable by a hierarchical decomposition with ranks $r_{0}=r_{1}=\cdots=r_{L-1}=Z$ To see this, simply assign the first level vectors ${\mathbf{a}}^{0,j,\gamma}$ with CP’s basis vectors, the last level weights with CP’s per-class weights, and the intermediate levels’ weights with indicator vectors. . Note that this comes with a polynomial penalty – the number of parameters increases from $N\cdot M\cdot Z+Z\cdot Y$ in the CP decomposition, to $N\cdot M\cdot Z+Z\cdot Y+N\cdot Z^{2}$ in the hierarchical decomposition. However, as we show in sec. 4, the gain in expressive power is exponential.

Plugging the expression for ${\mathcal{A}}^{y}$ in our hierarchical decomposition (eq. 4) into the score function $h_{y}$ given in eq. 2, we obtain the network displayed in fig. 2 – HT model. This network includes a representation layer followed by $L=\log_{2}N$ hidden layers which in turn are followed by the output. As in the shallow CP model (fig. 1), the hidden layers consist of $1\times 1$ conv operators followed by product pooling. The difference is that instead of a single hidden layer collapsing the entire spatial structure through global pooling, hidden layers now pool over size- $2$ windows, decimating feature maps by a factor of two (no overlaps). After $L=\log_{2}N$ such layers feature maps are reduced to singletons, and we arrive at a 1D structure with $r_{L-1}$ nodes. This is then mapped into $Y$ network outputs through a final dense linear layer. We note that the network’s size- $2$ pooling windows (and the resulting number of hidden layers $L=\log_{2}N$ ) correspond to the fact that our hierarchical decomposition (eq. 4) is based on a full binary tree over modes, i.e. it combines (through tensor product) two tensors at a time. We focus on this setting solely for simplicity of presentation, and since it is the one presented in [Hackbusch and Kühn(2009)]. Our analysis (sec. 4) could easily be adapted to hierarchical decompositions based on other trees (taking tensor products between more than two tensors at a time), and that would correspond to networks with different pooling window sizes and resulting depths.

HT model (fig. 2) is conceptually divided into two parts. The first is the representation layer, transforming input vectors ${\mathbf{x}}_{1}{\ldots}{\mathbf{x}}_{N}$ into $N{\cdot}M$ real-valued scalars $\{f_{\theta_{d}}({\mathbf{x}}_{i})\}_{i\in[N],d\in[M]}$ . The second and main part of the network, which we view as an “inference” engine, is the convolutional arithmetic circuit that takes the $N{\cdot}M$ measurements produced by the representation layer, and accordingly computes $Y$ class scores at the output layer.

To recap, we have now a deep network (fig. 2), which we refer to as HT model, that computes the score functions $h_{y}$ (eq. 2) with coefficient tensors ${\mathcal{A}}^{y}$ hierarchically decomposed as in eq. 4. The network is universal in the sense that with enough channels $r_{l}$ , any tensors may be represented. Moreover, the model is a super-set of the shallow CP model presented in sec. 3.1. The question of depth efficiency now naturally arises. In particular, we would like to know if there are functions that may be represented by a polynomially sized deep HT model, yet require exponential size from the shallow CP model. The answer, as described in sec. 4, is that almost all functions realizable by HT model meet this property. In other words, the set of functions realizable by a polynomial CP model has measure zero in the space of functions realizable by a given polynomial HT model.

3 Shared Coefficients for Convolution

The $1\times 1$ conv operator in our networks (see fig. 1 and 2) implements a local linear transformation with coefficients generally being location-dependent. In the special case where coefficients do not depend on location, i.e. remain fixed across space, the local linear transformation becomes a standard convolution. We refer to this setting as coefficient sharing. Sharing is a widely used structural constraint, one of the pillars behind the successful convolutional network architecture. In the context of image processing (prominent application of convolutional networks), sharing is motivated by the observation that in natural images, the semantic content of a pattern often does not depend on its location. In this subsection we explore the effect of sharing on the expressiveness of our networks, or more specifically, on the coefficient tensors ${\mathcal{A}}^{y}$ they can represent.

For CP model, coefficient sharing amounts to setting ${\mathbf{a}}^{z}:={\mathbf{a}}^{z,1}=\cdots={\mathbf{a}}^{z,N}$ in the CP decomposition (eq. 3), transforming the latter to a symmetric CP decomposition:

CP model with sharing is not universal (not all tensors ${\mathcal{A}}^{y}$ are representable, no matter how large $Z$ is allowed to be) – it can only represent symmetric tensors.

In the case of HT model, sharing amounts to applying the following constraints on the hierarchical decomposition in eq. 4: ${\mathbf{a}}^{l,\gamma}:={\mathbf{a}}^{l,1,\gamma}=\cdots={\mathbf{a}}^{l,\nicefrac{{N}}{{2^{l}}},\gamma}$ for every $l=0{\ldots}L-1$ and $\gamma=1{\ldots}r_{l}$ . Note that in this case universality is lost as well, but nonetheless generated tensors are not limited to be symmetric, already demonstrating an expressive advantage of deep models over shallow ones. In sec. 4 we take this further by showing that the shared HT model is exponentially more expressive than CP model, even if the latter is not constrained by sharing.

Theorems of Network Capacity

The first contribution of this paper, presented in sec. 3, is the equivalence between deep learning architectures successfully employed in practice, and tensor decompositions. Namely, we showed that convolutional arithmetic circuits as in fig. 2, which are in fact SimNets that have demonstrated promising empirical performance (see app. E), may be formulated as hierarchical tensor decompositions. As a second contribution, we make use of the established link between arithmetic circuits and tensor decompositions, combining theoretical tools from these two worlds, to prove results that are of interest to both deep learning and tensor analysis communities. This is the focus of the current section.

The fundamental theoretical result proven in this paper is the following:

Let ${\mathcal{A}}^{y}$ be a tensor of order $N$ and dimension $M$ in each mode, generated by the recursive formulas in eq. 4. Define $r:=\min\{r_{0},M\}$ , and consider the space of all possible configurations for the parameters of the composition – $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ . In this space, the generated tensor ${\mathcal{A}}^{y}$ will have CP-rank of at least $r^{\nicefrac{{N}}{{2}}}$ almost everywhere (w.r.t. Lebesgue measure). Put differently, the configurations for which the CP-rank of ${\mathcal{A}}^{y}$ is less than $r^{\nicefrac{{N}}{{2}}}$ form a set of measure zero. The exact same result holds if we constrain the composition to be “shared”, i.e. set ${\mathbf{a}}^{l,j,\gamma}\equiv{\mathbf{a}}^{l,\gamma}$ and consider the space of $\{{\mathbf{a}}^{l,\gamma}\}_{l,\gamma}$ configurations.

From the perspective of deep learning, thm. 4.1 leads to the following corollary:

Given linearly independent representation functions $\{f_{\theta_{d}}\}_{d\in[M]}$ , randomizing the weights of HT model (sec. 3.2) by a continuous distribution induces score functions $h_{y}$ that with probability one, cannot be approximated arbitrarily well (in $L^{2}$ sense) by a CP model (sec. 3.1) with less than $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}$ hidden channels. This result holds even if we constrain HT model with weight sharing (sec. 3.3) while leaving CP model in its general form.

That is to say, besides a negligible set, all functions that can be realized by a polynomially sized HT model (with or without weight sharing), require exponential size in order to be realized, or even approximated, by CP model. Most of the previous works relating to depth efficiency (see app. D) merely show existence of functions that separate depths (i.e. that are efficiently realizable by a deep network yet require super-polynomial size from shallow networks). Corollary 4.2 on the other hand establishes depth efficiency for almost all functions that a deep network can implement. Equally importantly, it applies to deep learning architectures that are being successfully employed in practice (SimNets – see app. E).

Adopting the viewpoint of tensor analysis, thm. 4.1 states that besides a negligible set, all tensors realized by HT (Hierarchical Tucker) decomposition cannot be represented by the classic CP (rank-1) decomposition if the latter has less than an exponential number of terms As stated in sec. 3.2, the decomposition in eq. 4 to which thm. 4.1 applies is actually a special case of HT decomposition as introduced in [Hackbusch and Kühn(2009)]. However, the theorem and its proof can easily be adapted to account for the general case. We focus on the special case merely because it corresponds to convolutional arithmetic circuit architectures used in practice. . To the best of our knowledge, this result has never been proved in the tensor analysis community. In the original paper introducing HT decomposition ([Hackbusch and Kühn(2009)]), as a motivating example, the authors present a specific tensor that is efficiently realizable by HT decomposition while requiring an exponential number of terms from CP decomposition The same motivating example is given in a more recent textbook introducing tensor analysis ([Hackbusch(2012)]). . Our result strengthens this motivation considerably, showing that it is not just one specific tensor that favors HT over CP, but rather, almost all tensors realizable by HT exhibit this preference. Taking into account that any tensor realized by CP can also be realized by HT with only a polynomial penalty in the number of parameters (see sec. 3.2), this implies that in an asymptotic sense, HT decomposition is exponentially more efficient than CP decomposition.

The complete proofs of thm. 4.1 and corollary 4.2 are given in app. B. We provide here an outline of the main tools employed and arguments made along these proofs.

To prove thm. 4.1 we combine approaches from the worlds of circuit complexity and tensor decompositions. The first class of machinery we employ is matrix algebra, which has proven to be a powerful source of tools for analyzing the complexity of circuits. For example, arithmetic circuits have been analyzed through what is called the partial derivative matrix (see [Raz and Yehudayoff(2009)]), and for boolean circuits a widely used tool is the communication matrix (see [Karchmer(1989)]). We gain access to matrix algebra by arranging tensors that take part in the CP and HT decompositions as matrices, a process often referred to as matricization. With matricization, the tensor product translates to the Kronecker product, and the properties of the latter become readily available. The second tool-set we make use of is measure theory, which prevails in the study of tensor decompositions, but is much less frequent in analyses of circuit complexity. In order to frame a problem in measure theoretical terms, one obviously needs to define a measure space of interest. For tensor decompositions, the straightforward space to focus on is that of the decomposition variables. For general circuits on the other hand, it is often unclear if defining a measure space is at all appropriate. However, when circuits are considered in the context of machine learning they are usually parameterized, and defining a measure space on top of these parameters is an effective approach for studying the prevalence of various properties in hypotheses spaces.

Our proof of thm. 4.1 traverses through the following path. We begin by showing that matricizing a rank-1 tensor produces a rank-1 matrix. This implies that the matricization of a tensor generated by a CP decomposition with $Z$ terms has rank at most $Z$ . We then turn to show that the matricization of a tensor generated by the HT decomposition in eq. 4 has rank at least $\min\{r_{0},M\}^{N/2}$ almost everywhere. This is done through induction over the levels of the decomposition ( $l=1{\ldots}L$ ). For the first level ( $l=1$ ), we use a combination of measure theoretical and linear algebraic arguments to show that the generated matrices have maximal rank ( $\min\{r_{0},M\}$ ) almost everywhere. For the induction step, the facts that under matricization tensor product translates into Kronecker product, and that the latter increases ranks multiplicatively If $\odot$ denotes the Kronecker product, then for any matrices $A$ and $B$ : $rank(A{\odot}B)=rank(A){\cdot}rank(B)$ . , imply that matricization ranks in the current level are generally equal to those in the previous level squared. Measure theoretical claims are then made to ensure that this indeed takes place almost everywhere.

To prove corollary 4.2 based on thm. 4.1, we need to show that the inability of CP model to realize a tensor generated by HT model, implies that the former cannot approximate score functions produced by the latter. In general, the set of tensors expressible by a CP decomposition is not topologically closed Hence the definition of border rank, see [Hackbusch(2012)]. , which implies that a-priori, it may be that CP model can approximate tensors generated by HT model even though it cannot realize them. However, since the proof of thm. 4.1 was achieved through separation of matrix rank, distances are indeed positive and CP model cannot approximate HT model’s tensors almost always. To translate from tensors to score functions, we simply note that in a finite-dimensional Hilbert space convergence in norm implies convergence in coefficients under any basis. Therefore, in the space of score functions (eq. 2) convergence in norm implies convergence in coefficients under the basis $\{({\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{N}){\mapsto}\prod_{i=1}^{N}f_{\theta_{d_{i}}}({\mathbf{x}}_{i})\}_{d_{1}{\ldots}d_{N}\in[M]}$ . That is to say, it implies convergence in coefficient tensors.

2 Generalization

Thm. 4.1 and corollary 4.2 compare the expressive power of the deep HT model (sec. 3.2) to that of the shallow CP model (sec. 3.1). One may argue that such an analysis is lacking, as it does not convey information regarding the importance of each individual layer. In particular, it does not shed light on the advantage of very deep networks, which at present provide state of the art recognition accuracy, compared to networks of more moderate depth. For this purpose we present a generalization, specifying the amount of resources one has to pay in order to maintain representational power while layers are incrementally cut off from a deep network. For conciseness we defer this analysis to app. A, and merely state here our final conclusions. We find that the representational penalty is double exponential w.r.t. the number of layers removed. In addition, there are certain cases where the removal of even a single layer leads to an exponential inflation, falling in line with the suggestion of [Bengio(2009)].

Discussion

In this work we address a fundamental issue in deep learning – the expressive efficiency of depth. There have been many attempts to theoretically analyze this question, but from a practical machine learning perspective, existing results are limited. Most of the results apply to very specific types of networks that do not resemble ones used in practice, and none of the results account for the locality-sharing-pooling paradigm which forms the basis for convolutional networks – the most successful deep learning architecture to date. In addition, current analyses merely show existence of depth efficiency, i.e. of functions that are efficiently realizable by deep networks but not by shallow ones. The practical implications of such findings are arguably slight, as a-priori, it may be that only a small fraction of the functions realizable by deep networks enjoy depth efficiency, and for all the rest shallow networks suffice.

Our aim in this paper was to develop a theory that facilitates an analysis of depth efficiency for networks that incorporate the widely used structural ingredients of locality, sharing and pooling. We consider the task of classification into one of a finite set of categories ${\mathcal{Y}}=\{1{\ldots}Y\}$ . Our instance space is defined to be the Cartesian product of $N$ vector spaces, in compliance with the common practice of representing natural data through ordered local structures (e.g. images through patches). Each of the $N$ vectors that compose an instance is represented by a descriptor of length $M$ , generated by running the vector through $M$ “representation” functions. As customary, classification is achieved through maximization of score functions $h_{y}$ , one for every category $y\in{\mathcal{Y}}$ . Each score function is a linear combination over the $M^{N}$ possible products that may be formed by taking one descriptor entry from every input vector. The coefficients for these linear combinations conveniently reside in tensors ${\mathcal{A}}^{y}$ of order $N$ and dimension $M$ along each axis. We construct networks that compute score functions $h_{y}$ by decomposing (factorizing) the coefficient tensors ${\mathcal{A}}^{y}$ . The resulting networks are convolutional arithmetic circuits that incorporate locality, sharing and pooling, and operate on the $N{\cdot}M$ descriptor entries generated from the input.

We show that a shallow (single hidden layer) network realizes the classic CP (rank-1) tensor decomposition, whereas a deep network with $\log_{2}N$ hidden layers realizes the recently introduced Hierarchical Tucker (HT) decomposition ([Hackbusch and Kühn(2009)]). Our fundamental result, presented in thm. 4.1 and corollary 4.2, states that randomizing the weights of a deep network by some continuous distribution will lead, with probability one, to score functions that cannot be approximated by a shallow network if the latter’s size is not exponential (in $N$ ). We extend this result (thm. A.1 and corollary A.2) by deriving analogous claims that compare two networks of any depths, not just deep vs. shallow.

To further highlight the connection between our networks and ones used in practice, we show (app. E) that translating convolution and product pooling computations to log-space (for numerical stability) gives rise to SimNets – a recently proposed deep learning architecture which has been shown to produce state of the art accuracy in computationally limited settings ([Cohen et al.(2016)Cohen, Sharir, and Shashua]).

Besides the central line of our work discussed above, the construction and theory presented in this paper shed light on various conjectures and practices employed by the deep learning community. First, with respect to the pooling operation, our analysis points to the possibility that perhaps it has more to do with factorization of computed functions than it does with translation invariance. This may serve as an explanation for the fact that pooling windows in state of the art convolutional networks are typically very small (see for example [Simonyan and Zisserman(2014)]), often much smaller than the radius of translation one would like to be invariant to. Indeed, in our framework, as we show in app. A, pooling over large windows and trimming down a network’s depth may bring to an exponential decrease in expressive efficiency.

The second point our theory sheds light on is sharing. As discussed in sec. 3.3, introducing weight sharing to a shallow network (CP model) considerably limits its expressive power. The network can only represent symmetric tensors, which in turn means that it is location invariant w.r.t. input vectors (patches). In the case of a deep network (HT model) the limitation posed by sharing is not as strict. Generated tensors need not be symmetric, implying that the network is capable of modeling location – a crucial ability in almost any real-world task. The above findings suggest that the sharing constraint is increasingly limiting as a network gets shallower, to the point where it causes complete ignorance to location. This could serve as an argument supporting the empirical success of deep convolutional networks – they bind together the statistical and computational advantages of sharing with many layers that mitigate its expressive limitations.

Lastly, our construction advocates locality, or more specifically, $1\times 1$ receptive fields. Recent convolutional networks providing state of the art recognition performance (e.g. [Lin et al.(2014)Lin, Chen, and Yan, Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich]) make extensive use of $1\times 1$ linear transformations, proving them to be very successful in practice. In view of our model, such $1\times 1$ operators factorize tensors while providing universality with a minimal number of parameters. It seems reasonable to conjecture that for this task of factorizing coefficient tensors, larger receptive fields are not significantly helpful, as they lead to redundancy which may deteriorate performance in presence of limited training data. Investigation of this conjecture is left for future work.

References

Appendix A Generalized Theorem of Network Capacity

In sec. 4 we presented our fundamental theorem of network capacity (thm. 4.1 and corollary 4.2), showing that besides a negligible set, all functions that can be realized by a polynomially sized HT model (with or without weight sharing), require exponential size in order to be realized, or even approximated, by CP model. In terms of network depth, CP and HT models represent the extremes – the former has only a single hidden layer achieved through global pooling, whereas the latter has $L=\log_{2}N$ hidden layers achieved through minimal (size- $2$ ) pooling windows. It is of interest to generalize the fundamental result by establishing a comparison between networks of intermediate depths. This is the focus of the current appendix.

We begin by defining a truncated version of the hierarchical tensor decomposition presented in eq. 4:

The only difference between this decomposition and the original is that instead of completing the full process with $L:=\log_{2}N$ levels, we stop after $L_{c}{\leq}L$ . At this point remaining tensors are binded together to form the final order- $N$ tensor. The corresponding network will simply include a premature global pooling stage that shrinks feature maps to $1\times 1$ , and then a final linear layer that performs classification. As before, we consider a shared version of the decomposition in which ${\mathbf{a}}^{l,j,\gamma}\equiv{\mathbf{a}}^{l,\gamma}$ . Notice that this construction realizes a continuum between CP and HT models, which correspond to the extreme cases $L_{c}=1$ and $L_{c}=L$ respectively.

The following theorem, a generalization of thm. 4.1, compares a truncated decomposition having $L_{1}$ levels, to one with $L_{2}<L_{1}$ levels that implements the same tensor, quantifying the penalty in terms of parameters:

Let ${\mathcal{A}}^{(1)}$ and ${\mathcal{A}}^{(2)}$ be tensors of order $N$ and dimension $M$ in each mode, generated by the truncated recursive formulas in eq. 5, with $L_{1}$ and $L_{2}$ levels respectively. Denote by $\{r^{(1)}_{l}\}_{l=0}^{L_{1}-1}$ and $\{r^{(2)}_{l}\}_{l=0}^{L_{2}-1}$ the composition ranks of ${\mathcal{A}}^{(1)}$ and ${\mathcal{A}}^{(2)}$ respectively. Assuming w.l.o.g. that $L_{1}>L_{2}$ , we define $r:=\min\{r^{(1)}_{0},...,r^{(1)}_{L_{2}-1},M\}$ , and consider the space of all possible configurations for the parameters of ${\mathcal{A}}^{(1)}$ ’s composition – $\{{\mathbf{a}}^{(1),l,j,\gamma}\}_{l,j,\gamma}$ . In this space, almost everywhere (w.r.t. Lebesgue measure), the generated tensor ${\mathcal{A}}^{(1)}$ requires that $r^{(2)}_{L_{2}-1}\geq(r)^{2^{L-L_{2}}}$ if one wishes that ${\mathcal{A}}^{(2)}$ be equal to ${\mathcal{A}}^{(1)}$ . Put differently, the configurations for which ${\mathcal{A}}^{(1)}$ can be realized by ${\mathcal{A}}^{(2)}$ with $r^{(2)}_{L_{2}-1}<(r)^{2^{L-L_{2}}}$ form a set of measure zero. The exact same result holds if we constrain the composition of ${\mathcal{A}}^{(1)}$ to be “shared”, i.e. set ${\mathbf{a}}^{(1),l,j,\gamma}\equiv{\mathbf{a}}^{(1),l,\gamma}$ and consider the space of $\{{\mathbf{a}}^{(1),l,\gamma}\}_{l,\gamma}$ configurations.

In analogy with corollary 4.2, we obtain the following generalization:

Suppose we are given linearly independent representation functions $f_{\theta_{1}}{\ldots}f_{\theta_{M}}$ , and consider two networks that correspond to the truncated hierarchical tensor decomposition in eq. 5, with $L_{1}$ and $L_{2}$ hidden layers respectively. Assume w.l.o.g. that $L_{1}>L_{2}$ , i.e. that network 1 is deeper than network 2, and define $r$ to be the minimal number of channels across the representation layer and the first $L_{2}$ hidden layers of network 1. Then, if we randomize the weights of network 1 by a continuous distribution, we obtain, with probability one, score functions $h_{y}$ that cannot be approximated arbitrarily well (in $L^{2}$ sense) by network 2 if the latter has less than $(r)^{2^{L-L_{2}}}$ channels in its last hidden layer. The result holds even if we constrain network 1 with weight sharing while leaving network 2 in its general form.

Proofs of thm. A.1 and corollary A.2 are given in app. B. Hereafter, we briefly discuss some of their implications. First, notice that we indeed obtain a generalization of the fundamental theorem of network capacity (thm. 4.1 and corollary 4.2), which corresponds to the extreme case $L_{1}=L$ and $L_{2}=1$ . Second, note that for the baseline case of $L_{1}=L$ , i.e. a full-depth network has generated the target score function, approximating this with a truncated network draws a price that grows double exponentially w.r.t. the number of missing layers. Third, and most intriguingly, we see that when $L_{1}$ is considerably smaller than $L$ , i.e. when a significantly truncated network is sufficient to model our problem, cutting off even a single layer leads to an exponential price, and this price is independent of $L_{1}$ . Such scenarios of exponential penalty for trimming down a single layer were discussed in [Bengio(2009)], but only in the context of specific functions realized by networks that do not resemble ones used in practice (see [Håstad and Goldmann(1991)] for an example of such result). We prove this in a much broader, more practical setting, showing that for convolutional arithmetic circuit (SimNet – see app. E) architectures, almost any function realized by a significantly truncated network will exhibit this behavior. The issue relates to empirical practice, supporting the common methodology of designing networks that go as deep as possible. Specifically, it encourages extending network depth by pooling over small regions, avoiding significant spatial decimation that brings network termination closer.

We conclude this appendix by stressing once more that our construction and theoretical approach are not limited to the models covered by our theorems (CP model, HT model, truncated HT model). These are merely exemplars deemed most appropriate for initial analysis. The fundamental and generalized theorems of network capacity are similar in spirit, and analogous theorems for networks with different pooling window sizes and depths (corresponding to different tensor decompositions) may easily be derived.

Appendix B Proofs

Our proof of thm. 4.1 and A.1 relies on basic knowledge in measure theory, or more specifically, Lebesgue measure spaces. We do not provide here a comprehensive background on this field (the interested reader is referred to [Jones(2001)]), but rather supplement the brief discussion given in sec. 2, with a list of facts we will be using which are not necessarily intuitive:

A union of countably (or finitely) many sets of zero measure is itself a set of zero measure.

In the above, and in the entirety of this paper, the only measure spaces we consider are Euclidean spaces equipped with Lebesgue measure. Thus when we say that a set of $d$ -dimensional points has zero measure, we mean that its Lebesgue measure in the $d$ -dimensional Euclidean space is zero.

In words, an order- $2^{L}$ tensor given by a CP-decomposition (see sec. 2) with $Z$ terms, has matricization with rank at most $Z$ . Thus, to prove that a certain order- $2^{L}$ tensor has CP-rank of at least $R$ , it suffices to show that its matricization has rank of at least $R$ .

We now state and prove two lemmas that will be needed for our proofs of thm. 4.1 and A.1.

Obviously $rank(U({\mathbf{x}}))\leq r$ for all ${\mathbf{x}}$ , so it remains to show that $rank(U({\mathbf{x}}))\geq r$ for all ${\mathbf{x}}$ but a set of zero measure. Let $U_{r}({\mathbf{x}})$ be the top-left $r\times r$ sub-matrix of $U({\mathbf{x}})$ . If $U_{r}({\mathbf{x}})$ is non-singular then of course $rank(U({\mathbf{x}}))\geq r$ as required. It thus suffices to show that the set of points ${\mathbf{x}}$ for which $\det U_{r}({\mathbf{x}})=0$ has zero measure. Now, $\det U_{r}({\mathbf{x}})$ is a polynomial in the entries of ${\mathbf{x}}$ , and so it either vanishes on a set of zero measure, or it is the zero polynomial (see [Caron and Traynor(2005)]). All that is left is to disqualify the latter option, and that can be done by finding a specific point ${\mathbf{x}}_{0}$ for which $\det U_{r}({\mathbf{x}}_{0})\neq 0$ . Indeed, we may choose ${\mathbf{x}}_{0}$ such that $D({\mathbf{x}}_{0})$ is the identity matrix and $A({\mathbf{x}}_{0}),B({\mathbf{x}}_{0})$ hold $1$ on their main diagonal and otherwise. This selection implies that $U_{r}({\mathbf{x}}_{0})$ is the identity matrix, and in particular $\det U_{r}({\mathbf{x}}_{0})\neq 0$ .

This shows that indeed our set of interest $S$ has zero measure.

With all preliminaries and lemmas in place, we turn to prove thm. 4.1, establishing an exponential efficiency of HT decomposition (eq. 4) over CP decomposition (eq. 3).

We begin with the case of an “unshared” composition, i.e. the one given in eq. 4 (as opposed to the “shared” setting of ${\mathbf{a}}^{l,j,\gamma}\equiv{\mathbf{a}}^{l,\gamma}$ ). Denoting for convenience $\phi^{L,1,1}:={\mathcal{A}}^{y}$ and $r_{L}=1$ , we will show by induction over $l=1,...,L$ that almost everywhere (at all points but a set of zero measure) w.r.t. $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ , all CP-ranks of the tensors $\{\phi^{l,j,\gamma}\}_{j\in[\nicefrac{{N}}{{2^{l}}}],\gamma\in[r_{l}]}$ are at least $r^{\nicefrac{{2^{l}}}{{2}}}$ . In accordance with our discussion in the beginning of this subsection, it suffices to consider the matricizations $[\phi^{l,j,\gamma}]$ , and show that these all have ranks greater or equal to $r^{\nicefrac{{2^{l}}}{{2}}}$ almost everywhere.

Assume now that almost everywhere $rank[\phi^{l-1,j^{\prime},\gamma^{\prime}}]\geq r^{\nicefrac{{2^{l-1}}}{{2}}}$ for all $j^{\prime}\in[\nicefrac{{N}}{{2^{l-1}}}]$ and $\gamma^{\prime}\in[r_{l-1}]$ . For some specific choice of $j\in[\nicefrac{{N}}{{2^{l}}}]$ and $\gamma\in[r_{l}]$ we have:

Denote $M_{\alpha}:=[\phi^{l-1,2j-1,\alpha}]\odot[\phi^{l-1,2j,\alpha}]$ for $\alpha=1{\ldots}r_{l-1}$ . By our inductive assumption, and by the general property $rank(A\odot B)=rank(A){\cdot}rank(B)$ , we have that almost everywhere the ranks of all matrices $M_{\alpha}$ are at least $r^{\nicefrac{{2^{l-1}}}{{2}}}\cdot r^{\nicefrac{{2^{l-1}}}{{2}}}=r^{\nicefrac{{2^{l}}}{{2}}}$ . Writing $[\phi^{l,j,\gamma}]=\sum_{\alpha=1}^{r_{l-1}}a_{\alpha}^{l,j,\gamma}\cdot M_{\alpha}$ , and noticing that $\{M_{\alpha}\}$ do not depend on ${\mathbf{a}}^{l,j,\gamma}$ , we turn our attention to lemma B.3. The lemma tells us that $rank[\phi^{l,j,\gamma}]\geq r^{\nicefrac{{2^{l}}}{{2}}}$ almost everywhere. Since a finite union of zero measure sets has zero measure, we conclude that almost everywhere $rank[\phi^{l,j,\gamma}]\geq r^{\nicefrac{{2^{l}}}{{2}}}$ holds jointly for all $j\in[\nicefrac{{N}}{{2^{l}}}]$ and $\gamma\in[r_{l}]$ . This completes the proof of the theorem in the unshared case.

Proving the theorem in the shared case may be done in the exact same way, except that for $l=1$ one needs the version of lemma B.1 for which $A({\mathbf{x}})$ and $B({\mathbf{x}})$ are equal.

We now head on to prove thm. A.1, which is a generalization of thm. 4.1. The proof will be similar in nature to that of thm. 4.1, yet slightly more technical. In short, the idea is to show that in the generic case, expressing ${\mathcal{A}}^{(1)}$ as a sum of tensor products between tensors of order $2^{L_{2}-1}$ requires at least $r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ terms. Since ${\mathcal{A}}^{(2)}$ is expressed as a sum of $r_{L_{2}-1}$ such terms, demanding ${\mathcal{A}}^{(2)}={\mathcal{A}}^{(1)}$ implies $r_{L_{2}-1}\geq r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ .

$\varphi_{q}({\mathcal{A}}\otimes{\mathcal{B}})=\varphi_{q}({\mathcal{A}})\otimes\varphi_{q}({\mathcal{B}})$

$\varphi_{q}(\lambda{\mathcal{A}}+{\mathcal{B}})=\lambda\varphi_{q}({\mathcal{A}})+\varphi_{q}({\mathcal{B}})$

For the sake of our proof we are interested in the case $q=2^{L_{2}-1}$ , and denote for brevity $\varphi:=\varphi_{2^{L_{2}-1}}$ .

As stated above, we would like to show that in the generic case, expressing ${\mathcal{A}}^{(1)}$ as $\sum_{z=1}^{Z}\phi^{(z)}_{1}\otimes\cdots\otimes\phi^{(z)}_{\nicefrac{{N}}{{2^{L_{2}-1}}}}$ , where $\phi^{(z)}_{i}$ are tensors of order $2^{L_{2}-1}$ , implies $Z\geq r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ . Applying $\varphi$ to both sides of such a decomposition gives: $\varphi({\mathcal{A}}^{(1)})=\sum_{z=1}^{Z}\varphi(\phi^{(z)}_{1})\otimes\cdots\otimes\varphi(\phi^{(z)}_{\nicefrac{{N}}{{2^{L_{2}-1}}}})$ , where $\varphi(\phi^{(z)}_{i})$ are now vectors. Thus, to prove thm. A.1 it suffices to show that in the generic case, the CP-rank of $\varphi({\mathcal{A}}^{(1)})$ is at least $r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ , or alternatively, that the rank of the matricization $[\varphi({\mathcal{A}}^{(1)})]$ is at least $r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ . This will be our strategy in the following proof:

In accordance with the above discussion, it suffices to show that in the generic case $rank[\varphi({\mathcal{A}}^{(1)})]\geq r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ . To ease the path for the reader, we reformulate the problem using slightly simpler notations. We have an order- $N$ tensor ${\mathcal{A}}$ with dimension $M$ in each mode, generated as follows:

Let $L_{2}$ be a positive integer smaller than $L_{1}$ , and let $\varphi$ be the tensor squeezing operator that merges groups of $2^{L_{2}-1}$ modes. Define $r:=\min\{r_{0},...,r_{L_{2}-1},M\}$ . With $[\cdot]$ being the matricization operator defined in the beginning of the appendix, our task is to prove that $rank[\varphi({\mathcal{A}})]\geq r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ almost everywhere w.r.t. $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ . We also consider the case of shared parameters – ${\mathbf{a}}^{l,j,\gamma}\equiv{\mathbf{a}}^{l,\gamma}$ , where we would like to show that the same condition holds almost everywhere w.r.t. $\{{\mathbf{a}}^{l,\gamma}\}_{l,\gamma}$ .

Our strategy for proving the claim is inductive. We show that for $l=L_{2}{\ldots}L_{1}-1$ , almost everywhere it holds that for all $j$ and all $\gamma$ : $rank[\varphi(\phi^{l,j,\gamma})]\geq r^{2^{l-L_{2}}}$ . We then treat the special case of $l=L_{1}$ , showing that indeed $rank[\varphi({\mathcal{A}})]\geq r^{\nicefrac{{N}}{{2^{L_{2}}}}}$ . We begin with the setting of unshared parameters ( ${\mathbf{a}}^{l,j,\gamma}$ ), and afterwards attend the scenario of shared parameters ( ${\mathbf{a}}^{l,\gamma}$ ) as well.

Our first task is to treat the case $l=L_{2}$ , i.e. show that $rank[\varphi(\phi^{L_{2},j,\gamma})]\geq r$ almost everywhere jointly for all $j$ and all $\gamma$ (there is actually no need for the matricization $[\cdot]$ here, as $\varphi(\phi^{L_{2},j,\gamma})$ are already matrices). Since a union of finitely many zero measure sets has zero measure, it suffices to show that this condition holds almost everywhere when specific $j$ and $\gamma$ are chosen. Denote by ${\mathbf{e}}_{i}$ a vector holding $1$ in entry $i$ and elsewhere, by ${\mathbf{0}}$ a vector of zeros, and by ${\mathbf{1}}$ a vector of ones. Suppose that for every $j$ we assign ${\mathbf{a}}^{0,j,\alpha}$ to be ${\mathbf{e}}_{\alpha}$ when $\alpha\leq r$ and ${\mathbf{0}}$ otherwise. Suppose also that for all $1\leq l\leq L_{2}-1$ and all $j$ we set ${\mathbf{a}}^{l,j,\gamma}$ to be ${\mathbf{e}}_{\gamma}$ when $\gamma\leq r$ and ${\mathbf{0}}$ otherwise. Finally, assume we set ${\mathbf{a}}^{L_{2},j,\gamma}={\mathbf{1}}$ for all $j$ and all $\gamma$ . These settings imply that for every $j$ , when $\gamma\leq r$ we have $\phi^{L_{2}-1,j,\gamma}=\mathop{\otimes}_{j=1}^{2^{L_{2}-2}}({\mathbf{e}}_{\gamma}\otimes{\mathbf{e}}_{\gamma})$ , i.e. the tensor $\phi^{L_{2}-1,j,\gamma}$ holds $1$ in location $(\gamma,...,\gamma)$ and elsewhere. If $\gamma>r$ then $\phi^{L_{2}-1,j,\gamma}$ is the zero tensor. We conclude from this that there are indices $1\leq i_{1}<...<i_{r}\leq M^{L_{2}-1}$ such that $\varphi(\phi^{L_{2}-1,j,\gamma})={\mathbf{e}}_{i_{\gamma}}$ for $\gamma\leq r$ , and that for $\gamma>r$ we have $\varphi(\phi^{L_{2}-1,j,\gamma})={\mathbf{0}}$ . We may thus write:

Now, since $i_{1}{\ldots}i_{r}$ are different from each other, the matrix $\varphi(\phi^{L_{2},j,\gamma})$ has rank $r$ . This however does not prove our inductive hypothesis for $l=L_{2}$ . We merely showed a specific parameter assignment for which it holds, and we need to show that it is met almost everywhere. To do so, we consider an $r\times r$ sub-matrix of $\varphi(\phi^{L_{2},j,\gamma})$ which is non-singular under the specific parameter assignment we defined. The determinant of this sub-matrix is a polynomial in the elements of $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ which we know does not vanish with the specific assignments defined. Thus, this polynomial vanishes at subset of $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ having zero measure (see [Caron and Traynor(2005)]). That is to say, the sub-matrix of $\varphi(\phi^{L_{2},j,\gamma})$ has rank $r$ almost everywhere, and thus $\varphi(\phi^{L_{2},j,\gamma})$ has rank at least $r$ almost everywhere. This completes our treatment of the case $l=L_{2}$ .

We now turn to prove the propagation of our inductive hypothesis. Let $l\in\{L_{2}+1,...,L_{1}-1\}$ , and assume that our inductive hypothesis holds for $l-1$ . Specifically, assume that almost everywhere w.r.t. $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ , we have that $rank[\varphi(\phi^{l-1,j,\gamma})]\geq r^{2^{l-1-L_{2}}}$ jointly for all $j\in[\nicefrac{{N}}{{2^{l-1}}}]$ and all $\gamma\in[r_{l-1}]$ . We would like to show that almost everywhere, $rank[\varphi(\phi^{l,j,\gamma})]\geq r^{2^{l-L_{2}}}$ jointly for all $j\in[\nicefrac{{N}}{{2^{l}}}]$ and all $\gamma\in[r_{l}]$ . Again, the fact that a finite union of zero measure sets has zero measure implies that we may prove the condition for specific $j\in[\nicefrac{{N}}{{2^{l}}}]$ and $\gamma\in[r_{l}]$ . Applying the squeezing operator $\varphi$ followed by matricization $[\cdot]$ to the recursive expression for $\phi^{l,j,\gamma}$ , we get:

For $\alpha=1{\ldots}r_{l-1}$ , denote the matrix $[\varphi(\phi^{l-1,2j-1,\alpha})]\odot[\varphi(\phi^{l-1,2j,\alpha})]$ by $M_{\alpha}$ . The fact that the Kronecker product multiplies ranks, along with our inductive assumption, imply that almost everywhere $rank(M_{\alpha})\geq r^{2^{l-1-L_{2}}}\cdot r^{2^{l-1-L_{2}}}=r^{2^{l-L_{2}}}$ . Noting that the matrices $M_{\alpha}$ do not depend on ${\mathbf{a}}^{l,j,\gamma}$ , we apply lemma B.3 and conclude that almost everywhere $rank[\varphi(\phi^{l,j,\gamma})]\geq r^{2^{l-L_{2}}}$ , which completes the prove of the inductive propagation.

Next, we treat the special case $l=L_{1}$ . We assume now that almost everywhere $rank[\varphi(\phi^{L_{1}-1,j,\gamma})]\geq r^{2^{L_{1}-1-L_{2}}}$ jointly for all $j$ and all $\gamma$ . Again, we apply the squeezing operator $\varphi$ followed by matricization $[\cdot]$ , this time to both sides of the expression for ${\mathcal{A}}$ :

As before, denote $M_{\alpha}:=\mathop{\odot}_{j=1}^{2^{L-L_{1}+1}}[\varphi(\phi^{L_{1}-1,j,\alpha})]$ for $\alpha=1{\ldots}r_{L_{1}-1}$ . Using again the multiplicative rank property of the Kronecker product along with our inductive assumption, we get that almost everywhere $rank(M_{\alpha})\geq\prod_{j=1}^{2^{L-L_{1}+1}}r^{2^{L_{1}-1-L_{2}}}=r^{L-L_{2}}$ . Noticing that $\{M_{\alpha}\}_{\alpha\in[r_{L_{1}-1}]}$ do not depend on ${\mathbf{a}}^{L_{1},1,1}$ , we apply lemma B.3 for the last time and get that almost everywhere (w.r.t. $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ ), the rank of $[\varphi({\mathcal{A}})]$ is at least $r^{L-L_{2}}$ . This completes our proof in the case of unshared parameters.

Proving the theorem in the case of shared parameters ( ${\mathbf{a}}^{l,j,\gamma}\equiv{\mathbf{a}}^{l,\gamma}$ ) can be done in the exact same way as above. In fact, all one has to do is omit the references to $j$ and the proof will apply. Notice in particular that the specific parameter assignment we defined to handle $l=L_{2}$ was completely symmetric, i.e. it did not include any dependence on $j$ .

B.2 Proof of Corollaries 4.2 and A.2

In a finite-dimensional Hilbert space, convergence in norm implies convergence in representation coefficients under any preselected basis. We thus have:

Appendix C Derivation of Hypotheses Space

In order to keep the body of the paper at a reasonable length, the presentation of our hypotheses space (eq. 2) in sec. 3 did not provide the grounds for its definition. In this appendix we derive the hypotheses space step by step. After establishing basic preliminaries on the topic of $L^{2}$ spaces, we utilize the notion of tensor products between such spaces to reach a universal representation as in eq. 2 but with $M\to\infty$ . We then make use of empirical studies characterizing the statistics of natural images, to argue that in practice a moderate value of $M$ ( $M\in\Omega(100)$ ) suffices.

When dealing with functions over scalars, vectors or collections of vectors, we consider $L^{2}$ spaces, or more formally, the Hilbert spaces of Lebesgue measurable square-integrable real functions equipped with standard (point-wise) addition and scalar multiplication, as well as the inner-product defined by integral over point-wise multiplication. The topic of $L^{2}$ function spaces lies at the heart of functional analysis, and requires basic knowledge in measure theory. We present here the bare necessities required to follow this appendix, referring the interested reader to [Rudin(1991)] for a more comprehensive introduction.

C.2 Construction

In the case of Gaussians and neurons, we argue that a finite set of functions suffices, i.e. that it is possible to choose $f_{\theta_{1}}{\ldots}f_{\theta_{M}}\in{\mathcal{F}}$ that will suffice in order to represent score functions required for natural tasks. Moreover, we claim that $M$ need not be large (e.g. on the order of 100). Our argument relies on statistical properties of natural images, and is fully detailed in app. C.3. It implies that under proper choice of $\{f_{\theta_{d}}({\mathbf{x}})\}_{d\in[M]}$ , the finite set of point-wise product functions $\{({\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{N})\mapsto\prod_{i=1}^{N}f_{\theta_{d_{i}}}({\mathbf{x}}_{i})\}_{d_{1},\ldots,d_{N}\in[M]}$ spans the score functions of interest, and we may define for each label $y$ a tensor ${\mathcal{A}}^{y}$ of order $N$ and dimension $M$ in each mode, such that:

C.3 Finite Function Bases for Classification of Natural Data

where $h_{y}({\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{N})$ is of the form:

In this subsection we formalize our argument that a finite value for $M$ is sufficient when ${\mathcal{X}}$ represents natural data, and in particular, natural images. Based on empirical studies characterizing the statistical properties of natural images, and in compliance with the number of channels in a typical convolutional network layer, we find that $M$ on the order of 100 typically suffices.

Let ${\mathcal{D}}$ be a distribution of labeled instances $(X,\bar{y})$ over ${\mathcal{X}}\times{\mathcal{Y}}$ (we use bar notation to distinguish the label $\bar{y}$ from the running index $y$ ), and ${\mathcal{D}}_{\mathcal{X}}$ be the induced marginal distribution of instances $X$ over ${\mathcal{X}}$ . We would like to show, given particular assumptions on ${\mathcal{D}}$ , that there exist functions $f_{\theta_{1}},\ldots,f_{\theta_{M}}\in{\mathcal{F}}$ and tensors ${\mathcal{A}}^{1},\ldots,{\mathcal{A}}^{Y}$ of order $N$ and dimension $M$ in each mode, such that the score functions $h_{y}$ defined in eq. 2 achieve low classification error:

Let $\{h_{y}^{*}\}_{y\in{\mathcal{Y}}}$ be a set of “ground truth” score functions for which optimal prediction is achieved, or more specifically, for which the expected hinge-loss (upper bounds the 0-1 loss) is minimal:

Our strategy will be to select score functions $h_{y}$ of the format given in eq. 2, that approximate $h_{y}^{*}$ in the sense of low expected maximal absolute difference:

We refer to ${\mathcal{E}}$ as the score approximation error obtained by $h_{y}$ . The 0-1 loss of $h_{y}$ with respect to the labeled example $(X,\bar{y})\in{\mathcal{X}}\times{\mathcal{Y}}$ is bounded as follows:

Taking expectation of the first and last terms above with respect to $(X,\bar{y})\sim{\mathcal{D}}$ , and recalling the definitions given in eq. 9, 10 and 11, we get:

In words, the classification error of the score functions $h_{y}$ is bounded by the optimal expected hinge-loss plus a term equal to twice their score approximation error. Recall that we did not constrain the optimal score functions $h_{y}^{*}$ in any way. Thus, assuming a label is deterministic given an instance, the optimal expected hinge-loss is essentially zero, and the classification error of $h_{y}$ is dominated by their score approximation error ${\mathcal{E}}$ (eq. 11). Our problem thus translates to showing that $h_{y}$ can be selected such that ${\mathcal{E}}$ is small.

Under our idealized assumptions on ${\mathcal{D}}_{\mathcal{X}}$ , the expectation in the score approximation error ${\mathcal{E}}$ can be discretized as follows:

where ${\mathcal{M}}_{c_{1},\ldots,c_{N}}:=({\boldsymbol{\mu}}_{c_{1}},\ldots,{\boldsymbol{\mu}}_{c_{N}})$ and $P_{c_{1},\ldots,c_{N}}$ stands for the probability that ${\mathbf{x}}_{i}$ lies near ${\boldsymbol{\mu}}_{c_{i}}$ for every $i\in[N]$ ( $P_{c_{1},\ldots,c_{N}}{\geq}0,~{}\sum_{c_{1},\ldots,c_{N}}P_{c_{1},\ldots,c_{N}}=1$ ).

Assuming we have chosen $f_{\theta_{1}}{\ldots}f_{\theta_{M}}$ to separate GMM components, and plugging-in the format of $h_{y}$ given in eq. 2, we get the following convenient form for $h_{y}({\mathcal{M}}_{c_{1},\ldots,c_{N}})$ :

Assigning the coefficient tensors through the following rule:

for every $y\in{\mathcal{Y}}$ and $c_{1}{\ldots}c_{N}\in[M]$ . Plugging this into eq. 12, we get a score approximation error of zero.

To recap, we have shown that when the parametric functions $f_{\theta}$ are Gaussians (eq. 7) or neurons (eq. 8), not only are the score functions $h_{y}$ given in eq. 2 universal when $M\to\infty$ (see app. C.2), but they can also achieve zero classification error (eq. 9) with a moderate value of $M$ (on the order of 100) if the underlying data distribution ${\mathcal{D}}$ is “natural”. In this context, ${\mathcal{D}}$ is regarded as natural if it satisfies two conditions. The first, which is rather mild, requires that a label be completely determined by the instance. For example, an image will belong to one category with probability one, and to the rest of the categories with probability zero. The second condition, which is far more restrictive, states that input vectors composing an instance can be quantized into a moderate number ( $M$ ) of templates. The assumption that natural images exhibit this property is based on various empirical studies where it is shown to hold approximately. Since it does not hold exactly, our analysis is approximate, and its implication in practice is that the classification error introduced by constraining score functions to have the format given in eq. 2, is negligible compared to other sources of error (factorization of the coefficient tensors, finiteness of training data and difficulty in optimization).

Appendix D Related Work

The classic approach for theoretically analyzing the power of depth focused on investigation of the computational complexity of boolean circuits. An early result, known as the “exponential efficiency of depth”, may be summarized as follows: for every integer $k$ , there are boolean functions that can be computed by a circuit comprising alternating layers of AND and OR gates which has depth $k$ and polynomial size, yet if one limits the depth to $k-1$ or less, an exponentially large circuit is required. See [Sipser(1983)] for a formal statement of this classic result. Recently, [Rossman et al.(2015)Rossman, Servedio, and Tan] have established a somewhat stronger result, showing cases where not only are polynomially wide shallow boolean circuits incapable of exact realization, but also of approximation (i.e. of agreeing with the target function on more than a specified fraction of input combinations). Other classical results are related to threshold circuits, a class of models more similar to contemporary neural networks than boolean circuits. Namely, they can be viewed as neural networks where each neuron computes a weighted sum of its inputs (possibly including bias), followed by threshold activation ( $\sigma(z)={\mathbf{1}}[z\geq 0]$ ). For threshold circuits, the main known result in our context is the existence of functions that separate depth 3 from depth 2 (see [Hajnal et al.(1987)Hajnal, Maass, Pudlák, Szegedy, and Turán] for a statement relating to exact realization, and the techniques in [Maass et al.(1994)Maass, Schnitger, and Sontag, Martens et al.(2013)Martens, Chattopadhya, Pitassi, and Zemel] for extension to approximation).

More recent studies focus on arithmetic circuits ([Shpilka and Yehudayoff(2010)]), whose nodes typically compute either a weighted sum or a product of their inputs There are different definitions for arithmetic circuits in the literature. We adopt the definition given in [Martens and Medabalimi(2014)], under which an arithmetic circuit is a directed acyclic graph, where nodes with no incoming edges correspond to inputs, nodes with no outgoing edges correspond to outputs, and the remaining nodes are either labeled as “sum” or as “product”. A product node computes the product of its child nodes. A sum node computes a weighted sum of its child nodes, where the weights are parameters linked to its incoming edges. (besides their role in studying expressiveness, deep networks of this class have been shown to support provably optimal training [Livni et al.(2014)Livni, Shalev-Shwartz, and Shamir]). A special case of this are the Sum-Product Networks (SPNs) presented in [Poon and Domingos(2011)]. SPNs are a class of deep generative models designed to efficiently compute probability density functions. Their summation weights are typically constrained to be non-negative (such an arithmetic circuit is called monotone), and in addition, in order for them to be valid (i.e. to be able to compute probability density functions), additional architectural constraints are needed (e.g. decomposability and completeness). The most widely known theoretical arguments regarding the efficiency of depth in SPNs were given in [Delalleau and Bengio(2011)]. In this work, two specific families of SPNs were considered, both comprising alternating sum and product layers – a family ${\mathcal{F}}$ whose nodes form a full binary tree, and a family ${\mathcal{G}}$ with $n$ nodes per layer (excluding the output), each connected to $n-1$ nodes in the preceding layer. The authors show that functions implemented by these networks require an exponential number of nodes in order to be realized by shallow (single hidden-layer networks). The limitations of this work are twofold. First, as the authors note themselves, it only analyzes the ability of shallow networks to realize exactly functions generated by deep networks, and does not provide any result relating to approximation. Second, the specific SPN families considered in this work are not universal hypothesis classes and do not resemble networks used in practice. Recently, [Martens and Medabalimi(2014)] proved that there exist functions which can be efficiently computed by decomposable and complete (D&C) SPNs of depth $d+1$ , yet require a D&C SPN of depth $d$ or less to have super-polynomial size for exact realization. This analysis only treats approximation in the limited case of separating depth 4 from depth 3 (D&C) SPNs. Additionally, it only deals with specific separating functions, and does not convey information regarding how frequent these are. In other words, according to this analysis, it may be that almost all functions generated by deep networks can be efficiently realized by shallow networks, and there are only few pathological functions for which this does not hold. A further limitation of this analysis is that for general $d$ , the separation between depths $d+1$ and $d$ is based on a multilinear circuit result by [Raz and Yehudayoff(2009)], that translates into a network that once again does not follow the common practices of deep learning.

There have been recent attempts to analyze the efficiency of network depth in other settings as well. The most commonly used type of neural networks these days includes neurons that compute a weighted sum of their inputs (with bias) followed by Rectified Linear Unit (ReLU) activation ( $\sigma(z)=\max\{0,z\}$ ). [Pascanu et al.(2013)Pascanu, Montufar, and Bengio] and [Montufar et al.(2014)Montufar, Pascanu, Cho, and Bengio] study the number of linear regions that may be expressed by such networks as a function of their depth and width, thereby showing existence of functions separating deep from shallow (depth 2) networks. [Telgarsky(2015)] shows a simple construction of a depth $d$ width 2 ReLU network that operates on one-dimensional inputs, realizing a function that cannot be approximated by ReLU networks of depth $o(d/\log{d})$ and width polynomial in $d$ . [Eldan and Shamir(2015)] provides functions expressible by ReLU networks of depth 3 and polynomial width, which can only be approximated by a depth 2 network if the latter’s width is exponential. The result in this paper applies not only to ReLU activation, but also to the standard sigmoid ( $\sigma(z)=1/(1+e^{-z})$ ), and more generally, to any universal activation (see assumption 1 in [Eldan and Shamir(2015)]). [Bianchini and Scarselli(2014)] also considers different types of activations, studying the topological complexity (through Betti numbers) of decision regions as a function of network depth, width and activation type. The results in this paper establish the existence of deep vs. shallow separating functions only for the case of polynomial activation. While the above works do address more conventional neural networks, they do not account for the structure of convolutional networks – the most successful deep learning architectures to date, and more importantly, they too prove only existence of some separating functions, without providing any insight as to how frequent these are.

We are not the first to incorporate ideas from the field of tensor analysis into deep learning. [Socher et al.(2013)Socher, Chen, Manning, and Ng], [Yu et al.(2012)Yu, Deng, and Seide], [Setiawan et al.(2015)Setiawan, Huang, Devlin, Lamar, Zbib, Schwartz, and Makhoul], and [Hutchinson et al.(2013)Hutchinson, Deng, and Yu] all proposed different neural network architectures that include tensor-based elements, and exhibit various advantages in terms of expressiveness and/or ease of training. In [Janzamin et al.(2015)Janzamin, Sedghi, and Anandkumar], an alternative algorithm for training neural networks is proposed, based on tensor decomposition and Fourier analysis, with proven generalization bounds. In [Novikov et al.(2014)Novikov, Rodomanov, Osokin, and Vetrov], [Anandkumar et al.(2014)Anandkumar, Ge, Hsu, Kakade, and Telgarsky], [Yang and Dunson(2015)] and [Song et al.(2013)Song, Ishteva, Parikh, Xing, and Park], algorithms for tensor decompositions are used to estimate parameters of different graphical models. Notably, [Song et al.(2013)Song, Ishteva, Parikh, Xing, and Park] uses the relatively new Hierarchical Tucker decomposition ([Hackbusch and Kühn(2009)]) that we employ in our work, with certain similarities in the formulations. The works differ considerably in their objectives though: while [Song et al.(2013)Song, Ishteva, Parikh, Xing, and Park] focuses on the proposal of a new training algorithm, our purpose in this work is to analyze the expressive efficiency of networks and how that depends on depth. Recently, [Lebedev et al.(2014)Lebedev, Ganin, Rakhuba, Oseledets, and Lempitsky] modeled the filters in a convolutional network as four dimensional tensors, and used the CP decomposition to construct an efficient and accurate approximation. Another work that draws a connection between tensor analysis and deep learning is the recent study presented in [Haeffele and Vidal(2015)]. This work shows that with sufficiently large neural networks, no matter how training is initialized, there exists a local optimum that is accessible with gradient descent, and this local optimum is approximately equivalent to the global optimum in terms of objective value.

Appendix E Computation in Log-Space with SimNets

A practical issue one faces when implementing arithmetic circuits is the numerical instability of the product operation – a product node with a large number of inputs is easily susceptible to numerical overflow or underflow. A common solution to this is to perform the computations in log-space, i.e. instead of computing activations we compute their $\log$ . This requires the activations to be non-negative to begin with, and alters the sum and product operations as follows. A product simply turns into a sum, as $\log\prod_{i}\alpha_{i}=\sum_{i}\log\alpha_{i}$ . A sum becomes what is known as log-sum-exp or softmax: $\log\sum_{i}\alpha_{i}=\log\sum_{i}\exp(\log\alpha_{i})$ .

Turning to our networks, the requirement that all activations be non-negative does not limit their universality. The reason for this is that the functions $f_{\theta}$ are non-negative in both cases of interest – Gaussians (eq. 7) and neurons (eq. 8). In addition, one can always add a common offset to all coefficient tensors ${\mathcal{A}}^{y}$ , ensuring they are positive without affecting classification. Non-negative decompositions (i.e. decompositions with all weights holding non-negative values) can then be found, leading all network activations to be non-negative. In general, non-negative tensor decompositions may be less efficient than unconstrained decompositions, as there are cases where a non-negative tensor supports an unconstrained decomposition that is smaller than its minimal non-negative decomposition. Nevertheless, as we shall soon see, these non-negative decompositions translate into a proven architecture, which was demonstrated to achieve comparable performance to state of the art convolutional networks, thus in practice the deterioration in efficiency does not seem to be significant.

Naïvely implementing CP or HT model (fig. 1 or 2 respectively) in log-space translates to $\log$ activation following the locally connected linear transformations (convolutions if coefficients are shared, see sec. 3.3), to product pooling turning into sum pooling, and to $\exp$ activation following the pooling. However, applying $\exp$ and $\log$ activations as just described, without proper handling of the inputs to each computational layer, would not result in a numerically stable computation Naïve implementation of softmax is not numerically stable, as it involves storing $\alpha_{i}=\exp(\log\alpha_{i})$ directly. This however can be easily corrected by defining $c:=\max_{i}\log\alpha_{i}$ , and computing $\log\sum_{i}\exp(\log\alpha_{i}-c)+c$ . The result is identical, but now we only exponentiate negative numbers (no overflow), with at least one of these numbers equal to zero (no underflow). .

The SimNet architecture ([Cohen and Shashua(2014), Cohen et al.(2016)Cohen, Sharir, and Shashua]) naturally brings forth a numerically stable implementation of our networks. The architecture is based on two ingredients – a flexible similarity measure and the MEX operator:

The similarity layer, capable of computing both the common convolutional operator as well as weighted $l_{p}$ norm, may realize the representation by computing $\log f_{\theta}({\mathbf{x}}_{i})$ , whereas MEX can naturally implement both log-sum-exp and sum-pooling ( $\lim_{\beta\to 0}\textrm{MEX}_{\beta}({\mathbf{x}},\mathbf{0})=\textrm{mean}_{j}\{x_{j}\}$ ) in a numerically stable manner.

Not only are SimNets capable of correctly and efficiently implementing our networks, but they have already been demonstrated ([Cohen et al.(2016)Cohen, Sharir, and Shashua]) to perform as well as state of the art convolutional networks on several image recognition benchmarks, and outperform them when computational resources are limited.