Convolutional Rectifier Networks as Generalized Tensor Decompositions

Nadav Cohen, Amnon Shashua

Introduction

Deep neural networks are repeatedly proving themselves to be extremely effective machine learning models, providing state of the art accuracies on a wide range of tasks (see ). Arguably, the most successful deep learning architecture to date is that of convolutional neural networks (ConvNets, ), which prevails in the field of computer vision, and is recently being harnessed for many other application domains as well (e.g. ). Modern ConvNets are formed by stacking layers one after the other, where each layer consists of a linear convolutional operator followed by Rectified Linear Unit (ReLU ) activation ( $\sigma(z)=\max\{0,z\}$ ), which in turn is followed by max or average pooling ( $P\{c_{j}\}=\max\{c_{j}\}$ or $P\{c_{j}\}=\operatorname*{mean}\{c_{j}\}$ respectively). Such models, which we refer to as convolutional rectifier networks, have driven the resurgence of deep learning (), and represent the cutting edge of the ConvNet architecture ().

Despite their empirical success, and the vast attention they are receiving, our theoretical understanding of convolutional rectifier networks is partial at best. It is believed that they enjoy depth efficiency, i.e. that when allowed to go deep, such networks can implement with polynomial size computations that would require super-polynomial size if the networks were shallow. However, formal arguments that support this are scarce. It is unclear to what extent convolutional rectifier networks leverage depth efficiency, or more formally, what is the proportion of weight settings that would lead a deep network to implement a computation that cannot be efficiently realized by a shallow network. We refer to the most optimistic situation, where this takes place for all weight settings but a negligible (zero measure) set, as complete depth efficiency.

Compared to convolutional rectifier networks, our theoretical understanding of depth efficiency for arithmetic circuits, and in particular for convolutional arithmetic circuits, is much more developed. Arithmetic circuits (also known as Sum-Product Networks, ) are networks with two types of nodes: sum nodes, which compute a weighted sum of their inputs, and product nodes, computing the product of their inputs. The depth efficiency of arithmetic circuits has been studied by the theoretical computer science community for the last five decades, long before the resurgence of deep learning. Although many problems in the area remain open, significant progress has been made over the years, making use of various mathematical tools. Convolutional arithmetic circuits form a specific sub-class of arithmetic circuits. Namely, these are ConvNets with linear activation ( $\sigma(z)=z$ ) and product pooling ( $P\{c_{j}\}=\prod{c_{j}}$ ). Recently, analyzed convolutional arithmetic circuits through tensor decompositions, essentially proving, for the type of networks considered, that depth efficiency holds completely. Although convolutional arithmetic circuits are known to be equivalent to SimNets (), a new deep learning architecture that has recently demonstrated promising empirical performance (), they are fundamentally different from convolutional rectifier networks. Accordingly, the result established in does not apply to the models most commonly used in practice.

In this paper we present a construction, based on the notion of generalized tensor decompositions, that transforms convolutional arithmetic circuits of the type described in into convolutional rectifier networks. We then use the available mathematical tools from the world of arithmetic circuits to prove new results concerning the expressive power and depth efficiency of convolutional rectifier networks. Namely, we show that with ReLU activation, average pooling leads to loss of universality, whereas max pooling is universal but enjoys depth efficiency to a lesser extent than product pooling with linear activation (convolutional arithmetic circuits). These results indicate that from the point of view of expressive power and depth efficiency, convolutional arithmetic circuits (SimNets) have an advantage over the prevalent convolutional rectifier networks (ConvNets with ReLU activation and max or average pooling). This leads us to believe that developing effective methods for training convolutional arithmetic circuits, thereby fulfilling their expressive potential, may give rise to a deep learning architecture that is provably superior to convolutional rectifier networks but has so far been overlooked by practitioners.

The remainder of the paper is organized as follows. In sec. 2 we review existing works relating to depth efficiency of arithmetic circuits and networks with ReLU activation. Sec. 3 presents our definition of generalized tensor decompositions, followed by sec. 4 which employs this concept to frame convolutional rectifier networks. In sec. 5 we make use of this framework for an analysis of the expressive power and depth efficiency of such networks. Finally, sec. 6 concludes.

Related Work

None of the analyses above account for convolutional networks By this we mean that in all analyses, the deep networks shown to benefit from depth (i.e. to realize functions that require super-polynomial size from shallow networks) are not ConvNets. , thus they do not apply to the deep learning architecture most commonly used in practice. Recently, introduced convolutional arithmetic circuits, which may be viewed as ConvNets with linear activation and product pooling. These networks were shown to correspond to hierarchical tensor decompositions (see ). Tools from linear algebra, functional analysis and measure theory were then employed to prove that the networks are universal, and exhibit complete depth efficiency. Although similar in structure, convolutional arithmetic circuits are inherently different from convolutional rectifier networks (ConvNets with ReLU activation and max or average pooling). Accordingly, the analysis carried out in does not apply to the networks at the forefront of deep learning.

Closing the gap between the networks analyzed in and convolutional rectifier networks is the topic of this paper. We achieve this by generalizing tensor decompositions, thereby opening the door to mathematical machinery as used in , harnessing it to analyze, for the first time, the depth efficiency of convolutional rectifier networks.

Generalized Tensor Decompositions

We begin by establishing basic tensor-related terminology and notations. The definitions we give are actually concrete special cases of more abstract algebraic definitions as given in . We limit the discussion to these special cases since they suffice for our needs and are easier to grasp. For our purposes, a tensor is simply a multi-dimensional array:

Generalized tensor decompositions are simply obtained by plugging in the generalized tensor product $\otimes_{g}$ in place of the standard tensor product $\otimes$ .

From Networks to Tensors

Following the representation, a network includes $L$ hidden layers indexed by $l=0{\ldots}L-1$ . Each hidden layer $l$ begins with a $1\times 1$ conv operator, which is simply a 3D convolution with $r_{l}$ channels and receptive field $1\times 1$ followed by point-wise activation $\sigma(\cdot)$ . We allow the convolution to operate without weight sharing, in which case the filters that generate feature maps by sliding across the previous layer may have different coefficients at different spatial locations. This is often referred to in the deep learning community as a locally-connected layer (see ). We refer to it as the unshared case, in contrast to the shared case that gives rise to a standard $1\times 1$ convolution. The second (last) operator in a hidden layer is spatial pooling. Feature maps generated by $1\times 1$ conv are decimated, by applying the pooling operator $P(\cdot)$ (e.g. max or average) to non-overlapping 2D windows that cover the spatial extent. The last of the $L$ hidden layers ( $l=L-1$ ) reduces feature maps to singletons (its pooling operator is global), creating a vector of dimension $r_{L-1}$ . This vector is mapped into $Y$ network outputs through a final dense linear layer.

Altogether, the architectural parameters of a ConvNet are the type of representation functions ( $f_{\theta_{d}}$ ), the pooling window sizes (which in turn determine the number of hidden layers $L$ ), the setting of conv weights as shared or unshared, the number of channels in each layer ( $M$ for representation, $r_{0}{\ldots}r_{L-1}$ for hidden layers, $Y$ for output), and the choice of activation and pooling operators ( $\sigma(\cdot)$ and $P(\cdot)$ respectively). Given these architectural parameters, the learnable parameters of a network are the representation weights ( $\theta_{d}$ ), the conv weights ( ${\mathbf{a}}^{l,j,\gamma}$ for hidden layer $l$ , location $j$ and channel $\gamma$ in the unshared case; ${\mathbf{a}}^{l,\gamma}$ for hidden layer $l$ and channel $\gamma$ in the shared case), and the output weights ( ${\mathbf{a}}^{L,1,y}$ ).

The choice of activation and pooling operators determines the type of network we arrive at. For linear activation ( $\sigma(z)=z$ ) and product pooling ( $P\{c_{j}\}=\prod{c_{j}}$ ) we get a convolutional arithmetic circuit as analyzed in . For ReLU activation ( $\sigma(z)=\max\{0,z\}$ ) and max or average pooling ( $P\{c_{j}\}=\max\{c_{j}\}$ or $P\{c_{j}\}=\operatorname*{mean}\{c_{j}\}$ respectively) we get the commonly used convolutional rectifier networks, on which we focus in this paper.

In terms of pooling window sizes and network depth, we direct our attention to two special cases representing the extremes. The first is a shallow network that includes global pooling in its single hidden layer – see illustration in fig. 2. The second is the deepest possible network, in which all pooling windows cover only two entries, resulting in $L=\log_{2}N$ hidden layers. These ConvNets, which we refer to as shallow and deep respectively, will be shown to correspond to canonical tensor decompositions. It is for this reason, and for simplicity of presentation, that we focus on these special cases. One may just as well consider networks of intermediate depths with different pooling window sizes, and that would correspond to other, non-standard, tensor decompositions. The analysis carried out in sec. 5 can easily be adapted to such networks.

where $\sigma(\cdot)$ and $P(\cdot)$ are the network’s activation and pooling functions, respectively. Notice that the activation-pooling operator meets the associativity and commutativity requirements under product pooling with linear activation ( $g(a,b)=a{\cdot}b$ ), and under max pooling with ReLU activation ( $g(a,b)=\max\{a,b,0\}$ ). To account for the case of average pooling with ReLU activation, which a-priori leads to a non-associative activation-pooling operator, we simply replace average by sum, i.e. we analyze sum pooling with ReLU activation ( $g(a,b)=\max\{a,0\}+\max\{b,0\}$ ), which from the point of view of expressiveness is completely equivalent to average pooling with ReLU activation (scaling factors can always blend in to linear weights that follow pooling).

With the activation-pooling operator $g$ in place, it is straightforward to see that the grid tensor of $h_{y}^{S}$ – a score function generated by the shallow ConvNet (fig. 2), is given by the following generalized tensor decomposition:

Turning to the deep ConvNet (fig. 1 with size- $2$ pooling windows and $L=\log_{2}N$ hidden layers), the grid tensor of its score function $h_{y}^{D}$ is given by the hierarchical generalized tensor decomposition below:

To conclude this section, we presented a ConvNet architecture (fig. 1) whose activation and pooling operators may be chosen to realize convolutional arithmetic circuits (linear activation, product pooling) or convolutional rectifier networks (ReLU activation, max/average pooling). We then defined the grid tensor of a network’s score function as a tensor holding function values on an exponentially large grid whose points are sequences with elements chosen from a finite set of templates. Then, we saw that the grid tensor of a shallow ConvNet (fig. 2) is given by the generalized CP decomposition (eq. 6), and a grid tensor of a deep ConvNet (fig. 1 with $L=\log_{2}N$ ) is given by the generalized HT decomposition (eq. 7). In the next section we utilize the connection between ConvNets and generalized tensor decompositions for an analysis of the expressive power and depth efficiency of convolutional rectifier networks.

Capacity Analysis

In this section we analyze score functions expressible by the shallow and deep ConvNets (fig. 2, and fig. 1 with $L=\log_{2}N$ , respectively) under ReLU activation with max or average pooling (convolutional rectifier networks), comparing these settings against linear activation with product pooling (convolutional arithmetic circuits). Score functions are analyzed through grid tensors (eq. 3), represented by the generalized tensor decompositions established in the previous section: the generalized CP decomposition (eq. 6) corresponding to the shallow network, and the generalized HT decomposition (eq. 7) corresponding to the deep network. The analysis is organized as follows. In sec. 5.1 we present preliminary material required in order to follow our proofs. Sec. 5.2 discusses templates and representation functions, which form the bridge between score functions and generalized tensor decompositions. Sec. 5.3 presents matricization – a technical tool that facilitates the use of matrix theory for analyzing generalized tensor decompositions. The actual analysis begins in sec. 5.4, where we address the question of universality, i.e. of the ability of networks to realize any score function when their size is unlimited. This is followed by sec. 5.5 which studies depth efficiency, namely, situations where functions efficiently computable by deep networks require shallow networks to have super-polynomial size. Finally, sec. 5.6 analyzes the case of coefficient sharing, in which the conv operators of our networks are standard convolutions (as opposed to the more general locally-connected layers).

For evaluating the completeness of depth efficiency, and for other purposes as well, we are often interested in the “volume” of sets in a Euclidean space, or more formally, in their Lebesgue measure. While an introduction to Lebesgue measure theory is beyond the scope of this paper (the interested reader is referred to ), we restate here several concepts and results our proofs will rely upon. A zero measure set can intuitively be thought of as having zero volume. A union of countably many zero measure sets is itself a zero measure set. If we randomize a point in space by some continuous distribution, the probability of hitting a zero measure set is always zero. A useful fact (proven in for example) is that the zero set of a polynomial, i.e. the set of points on which a polynomial vanishes, is either the entire space (when the polynomial in question is the zero polynomial), or it must have measure zero. An open set always has positive measure, and when a point in space is drawn by a continuous distribution with non-vanishing continuous probability density function, the probability of hitting such a set is positive.

2 Templates and Representation Functions

Continuity: $f_{\theta}({\mathbf{x}})$ is continuous w.r.t. both $\theta$ and ${\mathbf{x}}$ .

Both of the assumptions above are met for most reasonable choices of ${\mathcal{F}}$ . In particular, non-degeneracy holds when representation functions are standard neurons:

Non-degeneracy means that given distinct templates, one may choose representation functions for which $F$ is non-singular. We may as well consider the opposite situation, where we are given representation functions, and would like to choose templates leading to non-singular $F$ . Apparently, so long as the representation functions are linearly independent, this is always possible:

We may view the determinant of $F$ (eq. 4) as a function of $({\mathbf{x}}^{(1)},\ldots,{\mathbf{x}}^{(M)})$ :

3 Matricization

Equipped with the matricization operator $[\cdot]$ and the generalized Kronecker product $\odot_{g}$ , we are now in a position to translate the generalized HT decomposition (eq. 7) to an expression for the matricization of a grid tensor generated by the deep ConvNet:

We refer to this factorization as the matricized generalized HT decomposition. Notice that the expression above for $\phi^{1,j,\gamma}$ is the same as in the original generalized HT decomposition, as order-2 tensors need not be matricized.

For the matricization of a grid tensor generated by the shallow ConvNet, we translate the generalized CP decomposition (eq. 6) into the matricized generalized CP decomposition:

The matricized generalized CP and HT decompositions (eq. 10 and 5.3 respectively) will be used throughout our proofs to establish depth efficiency. This is generally done by providing a lower bound on $rank[{\mathcal{A}}(h_{y}^{D})]$ – the rank of the deep ConvNet’s matricized grid tensor, and an upper bound on $rank[{\mathcal{A}}(h_{y}^{S})]$ – the rank of the shallow ConvNet’s matricized grid tensor. The upper bound on $rank[{\mathcal{A}}(h_{y}^{S})]$ will be linear in $Z$ , and so requiring ${\mathcal{A}}(h_{y}^{S})={\mathcal{A}}(h_{y}^{D})$ , and in particular $rank[{\mathcal{A}}(h_{y}^{S})]=rank[{\mathcal{A}}(h_{y}^{D})]$ , will give us a lower bound on $Z$ . That is to say, we obtain a lower bound on the number of hidden channels in the shallow ConvNet, that must be met in order for this network to replicate a grid tensor generated by the deep ConvNet. Our analysis of depth efficiency is given in sec. 5.5. As a prerequisite, we first head on to sec. 5.4 to analyze universality.

4 Universality

Universality refers to the ability of a network to realize (or approximate) any function of choice when no restrictions are imposed on its size. It is well-known that fully-connected neural networks are universal under all types of non-linear activations typically used in practice, even if the number of hidden layers is restricted to one (). To the best of our knowledge universality has never been studied in the context of convolutional rectifier networks. This is the purpose of the current subsection. Specifically, we analyze the universality of our shallow and deep ConvNets (fig. 2, and fig. 1 with $L=\log_{2}N$ , respectively) under ReLU activation and max or average pooling.

We begin by stating a result similar to that given in , according to which convolutional arithmetic circuits are universal:

Assuming covering templates exist, with linear activation and product pooling the shallow ConvNet is universal (hence so is the deep).

Heading on to convolutional rectifier networks, the following claim tells us that max pooling leads to universality:

Assuming covering templates exist, with ReLU activation and max pooling the shallow ConvNet is universal (hence so is the deep).

The proof follows the same line as that of claim 3, except we cannot rely on the ability of the standard CP decomposition to realize any tensor of choice. Instead, we need to show that the generalized CP decomposition (eq. 6) with $g(a,b)=\max\{a,b,0\}$ can realize any tensor, so long as $Z$ is large enough. We will show that $Z{\geq}2{\cdot}M^{N}$ suffices. For that, it is enough to consider an arbitrary indicator tensor, i.e. a tensor holding $1$ in some entry and in all other entries, and show that it can be expressed with $Z=2$ .

${\mathbf{a}}^{1,1}=\cdots={\mathbf{a}}^{1,N}={\mathbf{1}}$

$\forall{i\in[N]}:{\mathbf{a}}^{2,i}=\bar{{\mathbf{e}}}_{d_{i}}$

At this point we encounter the first somewhat surprising result, according to which convolutional rectifier networks are not universal with average pooling:

With ReLU activation and average pooling, both the shallow and deep ConvNets are not universal.

In accordance with the above, we complete our proof by showing that with $g(a,b)=\max\{a,0\}+\max\{b,0\}$ , the matricized generalized CP and HT decompositions (eq. 10 and 5.3 respectively) give rise to low-rank matrices. For the matricized generalized CP decomposition (eq. 10), corresponding to the shallow ConvNet, we have with $g(a,b)=\max\{a,0\}+\max\{b,0\}$ :

Turning to the matricized generalized HT decomposition (eq. 5.3), which corresponds to the deep ConvNet, we have with $g(a,b)=\max\{a,0\}+\max\{b,0\}$ :

One may wonder if perhaps the non-universality of ReLU activation and average pooling is merely an artifact of the conv operator in our ConvNets having $1\times 1$ receptive field. Apparently, as the following claim shows, expanding the receptive field does not remedy the situation, and indeed non-universality is an inherent property of convolutional rectifier networks with average pooling:

Consider the network illustrated in fig. 3, obtained by expanding the conv receptive field in the shallow ConvNet from $1\times 1$ to $w{\times}h$ , where $w{\cdot}h<\nicefrac{{N}}{{2}}+1-\log_{M}{N}$ (conv windows cover less than half the feature maps that precede them). Such a network, when equipped with ReLU activation and average pooling, is not universal.

where for every $i\in[N]$ , ${\mathcal{B}}^{i}$ is a tensor of order $w{\cdot}h$ and dimension $M$ in each mode, defined by:

Let ${\mathcal{O}}$ be a tensor of order $N-w{\cdot}h$ and dimension $M$ in each mode, holding $1$ in all entries. We may write:

where for every $i\in[N]$ , $p_{i}(\cdot)$ is an appropriately chosen operator that permutes the modes of an order- $N$ tensor.

We now make use of some known facts related to tensor rank (see sec. 5.1), in order to show that eq. 11 is not universal, i.e. that there are many tensors which cannot be realized by ${\mathcal{A}}(h_{y}^{S(w{\times}h)})$ . Being tensors of order $w{\cdot}h$ and dimension $M$ in each mode, the ranks of ${\mathcal{B}}^{1}\ldots{\mathcal{B}}^{N}$ are bounded above by $M^{w{\cdot}h-1}$ . Since ${\mathcal{O}}$ is an all- $1$ tensor, and since permuting modes does not alter rank, we have: $rank(p_{i}({\mathcal{B}}^{i}\otimes{\mathcal{O}})){\leq}M^{w{\cdot}h-1}~{}\forall{i\in[N]}$ . Finally, from sub-additivity of the rank we get: $rank({\mathcal{A}}(h_{y}^{S(w{\times}h)})){\leq}N{\cdot}M^{w{\cdot}h-1}$ . Now, we know by assumption that $w{\cdot}h<\nicefrac{{N}}{{2}}+1-\log_{M}{N}$ , and this implies: $rank({\mathcal{A}}(h_{y}^{S(w{\times}h)}))<M^{\nicefrac{{N}}{{2}}}$ . Since there exist tensors of order $N$ and dimension $M$ in each mode having rank at least $M^{\nicefrac{{N}}{{2}}}$ (actually only a negligible set of tensors do not meet this), eq. 11 is indeed not universal. That is to say, the shallow ConvNet with conv receptive field expanded to $w{\times}h$ (fig. 3) cannot realize all grid tensors on the templates ${\mathbf{x}}^{(1)}\ldots{\mathbf{x}}^{(M)}$ . ∎

We conclude this subsection by noting that the non-universality result in claim 6 does not contradict the known universality of shallow (single hidden layer) fully-connected neural networks. Indeed, a shallow fully-connected network corresponds to the ConvNet illustrated in fig. 3, with conv receptive field covering the entire spatial extent ( $w{\cdot}h=N$ ), thereby effectively removing the pooling operator (assuming the latter realizes the identity on singletons). In claim 7 below we show that such a network, when equipped with ReLU activation, is universal. On the other hand, in claim 6 we assumed that the ConvNet’s receptive field covers less than half the spatial extent ( $w{\cdot}h<\nicefrac{{N}}{{2}}+1-\log_{M}{N}$ ), and have shown that with ReLU activation and average pooling, this leads to non-universality. Loosely speaking, our findings imply that for networks with ReLU activation, which are known to be universal when fully-connected, introducing locality disrupts universality with average pooling (and maintains it with max pooling).

Assume that there exist covering templates ${\mathbf{x}}^{(1)}\ldots{\mathbf{x}}^{(M)}$ , and corresponding representation functions $f_{\theta_{1}}{\ldots}f_{\theta_{M}}$ leading to a matrix $F$ (eq. 4) that has non-recurring rows and a constant non-zero column The assumption that such representation functions exist differs from our usual non-degeneracy assumption. The latter requires $F$ to be non-singular, whereas here we pose the weaker requirement of $F$ having non-recurring rows. On the other hand, here we also demand that $F$ have a constant non-zero column, i.e. that there be a representation function $f_{\theta_{d}}$ such that $f_{\theta_{d}}({\mathbf{x}}^{(1)})=\cdots=f_{\theta_{d}}({\mathbf{x}}^{(M)})=c\neq 0$ . In claim 1 we showed that standard neurons meet the non-degeneracy assumption. A slight modification to its proof shows that they also meet the assumption made here. Namely, if we modify the constructions for the cases of ReLU activation and sigmoidal activation by setting $f_{\theta_{1}}({\mathbf{x}})=\sigma({\mathbf{0}}^{\top}{\mathbf{x}}+1)$ and $f_{\theta_{1}}({\mathbf{x}})=\sigma({\mathbf{0}}^{\top}{\mathbf{x}}+\alpha)$ respectively, we get matrices $F$ that are not only non-singular, but also have a constant non-zero column. . Consider the fully-connected network illustrated in fig. 4, obtained by expanding the conv receptive field in the shallow ConvNet to cover the entire spatial extent. Such a network, when equipped with ReLU activation, is universal.

Let $h_{y}^{S(fc)}$ be the $y$ ’th score function of our shallow fully-connected network (fig. 4) when equipped with ReLU activation ( $\sigma(z)=\max\{0,z\}$ ). We would like to show that ${\mathcal{A}}(h_{y}^{S(fc)})$ – the grid tensor of $h_{y}^{S(fc)}$ w.r.t. the covering templates ${\mathbf{x}}^{(1)}\ldots{\mathbf{x}}^{(M)}$ , may take on any value when hidden and output weights ( $\{A^{z}\}_{z\in[Z]}$ and ${\mathbf{a}}^{y}$ respectively) are chosen appropriately.

For any $d_{1}{\ldots}d_{N}\in[M]$ , define the following matrix:

In words, $F^{(d_{1}{\ldots}d_{N})}$ is the matrix obtained by taking rows $d_{1}{\ldots}d_{N}$ from $F$ (recurrence allowed), and stacking them one on top of the other. It holds that:

where $\left\langle{\cdot},{\cdot}\right\rangle$ stands for the inner-product operator, i.e. $\left\langle{F^{(d_{1}{\ldots}d_{N})}},{A^{z}}\right\rangle:=\sum_{i=1}^{N}\sum_{d=1}^{M}F_{i,d}^{(d_{1}{\ldots}d_{N})}A_{i,d}^{z}$ .

${\mathbf{w}}_{1}=\cdots={\mathbf{w}}_{k}={\mathbf{u}}$

$b_{1}=-{\mathbf{u}}^{\top}{\mathbf{v}}_{1}+1$

$b_{j}=-{\mathbf{u}}^{\top}{\mathbf{v}}_{j-1}$ for $j=2{\ldots}k$

$a_{j}=\frac{c_{j}-c_{j-1}}{{\mathbf{u}}^{\top}{\mathbf{v}}_{j}-{\mathbf{u}}^{\top}{\mathbf{v}}_{j-1}}-\sum_{t=1}^{j-1}a_{t}$ for $j=2{\ldots}k$

To complete the proof, we show below that this assignment meets the condition in eq. 12 for $i=1{\ldots}k$ .

implies that the condition in eq. 12 indeed holds for $i=1$ :

Comparing this to the same expression with $i$ replaced by $i-1$ we obtain:

Now, if we follow an inductive argument and assume that the condition in eq. 12 holds for $i-1$ , i.e. that $\sum_{j=1}^{k}a_{j}\max\{0,{\mathbf{w}}_{j}^{\top}{\mathbf{v}}_{i-1}+b_{j}\}=c_{i-1}$ , we get:

Plugging in the definition $a_{i}=\frac{c_{i}-c_{i-1}}{{\mathbf{u}}^{\top}{\mathbf{v}}_{i}-{\mathbf{u}}^{\top}{\mathbf{v}}_{i-1}}-\sum_{j=1}^{i-1}a_{j}$ gives:

Thus the condition in eq. 12 holds for $i$ as well. We have therefore shown by induction that our assignment of ${\mathbf{w}}_{1}{\ldots}{\mathbf{w}}_{k}$ , ${\mathbf{b}}$ and ${\mathbf{a}}$ meets the lemma’s requirement. ∎

5 Depth Efficiency

The driving force behind deep learning is the expressive power that comes with depth. It is generally believed that deep networks with non-linear layers efficiently express functions that cannot be efficiently expressed by shallow networks, i.e. that would require the latter to have super-polynomial size. We refer to such scenario as depth efficiency. Being concerned with the minimal size required by a shallow network in order to realize (or approximate) a given function, the question of depth efficiency implicitly assumes universality, i.e. that there exists some (possibly exponential) size with which the shallow network is capable of expressing the target function. While technically it is possible to consider depth efficiency with a non-universal shallow network, in the majority of the cases, particularly in our framework, the shallow network would simply not be able to express a function generated by a deep network, no matter how large we allow it to be. Arguably, this provides little insight into the complexity of functions brought forth by depth.

To the best of our knowledge, at the time of this writing the only work to formally analyze depth efficiency in the context of ConvNets is . This work focused on convolutional arithmetic circuits, showing that with such networks depth efficiency is complete, i.e. besides a negligible set, all functions realizable by a deep network enjoy depth efficiency. We frame this result in our setup:

Let $f_{\theta_{1}}{\ldots}f_{\theta_{M}}$ be any set of linearly independent representation functions for a deep ConvNet (fig. 1 with $L=\log_{2}N$ ) with linear activation and product pooling. Suppose we randomize the linear weights ( ${\mathbf{a}}^{l,j,\gamma}$ ) of the network by some continuous distribution. Then, with probability 1, we obtain score functions that cannot be realized by a shallow ConvNet (fig. 2) with linear activation and product pooling if the number of hidden channels in the latter ( $Z$ ) is less than $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}$ .

We now turn to convolutional rectifier networks, for which depth efficiency has yet to be analyzed. In sec. 5.4 we saw that convolutional rectifier networks are universal with max pooling, and non-universal with average pooling. Since depth efficiency is only applicable to universal architectures, we focus on the former setting. The following claim establishes existence of depth efficiency for ConvNets with ReLU activation and max pooling:

There exist weight settings for a deep ConvNet with ReLU activation and max pooling, giving rise to score functions that cannot be realized by a shallow ConvNet with ReLU activation and max pooling if the number of hidden channels in the latter ( $Z$ ) is less than $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}\cdot\frac{2}{M{\cdot}N}$ .

In light of the above, the proof boils down to showing that with $g(a,b)=\max\{a,b,0\}$ :

The matricized generalized CP decomposition (eq. 10) produces matrices with rank at most $Z{\cdot}\frac{M{\cdot}N}{2}$ .

For an invertible $F$ , there exists a weight ( ${\mathbf{a}}^{l,j,\gamma}$ ) setting under which the matricized generalized HT decomposition (eq. 5.3) produces a matrix with rank at least $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}$ .

This would imply that every summand in the matricized generalized CP decomposition (eq. 10) has rank at most $\frac{M{\cdot}N}{2}$ , and the desired result readily follows. To prove eq. 13, note that each of the vectors $\bar{{\mathbf{v}}}:={\mathbf{v}}_{1}\odot_{g}\cdots\odot_{g}{\mathbf{v}}_{\frac{N}{2}}$ and $\bar{{\mathbf{u}}}:={\mathbf{u}}_{1}\odot_{g}\cdots\odot_{g}{\mathbf{u}}_{\frac{N}{2}}$ are of dimension $M^{\nicefrac{{N}}{{2}}}$ , but have only up to $\frac{M{\cdot}N}{2}$ unique values. Let $\delta_{{\mathbf{v}}},\delta_{{\mathbf{u}}}:[M^{N/2}]\to[M^{N/2}]$ be permutations that arrange the entries of $\bar{{\mathbf{v}}}$ and $\bar{{\mathbf{u}}}$ in descending order. Permuting the rows of the matrix $\bar{{\mathbf{v}}}\odot_{g}\bar{{\mathbf{u}}}^{\top}$ via $\delta_{{\mathbf{v}}}$ , and the columns via $\delta_{{\mathbf{u}}}$ , obviously does not change its rank. On the other hand, we get a $M^{N/2}{\times}M^{N/2}$ matrix with a $\frac{M{\cdot}N}{2}\times\frac{M{\cdot}N}{2}$ block structure, each block being constant (i.e. all entries of a block hold the same value). This implies that the rank of $\bar{{\mathbf{v}}}\odot_{g}\bar{{\mathbf{u}}}^{\top}$ is at most $\frac{M{\cdot}N}{2}$ , which is what we set out to prove.

Moving on to the matricized generalized HT decomposition (eq. 5.3), for an invertible $F$ we define the following weight setting ( ${\mathbf{0}}$ and ${\mathbf{1}}$ here denote the all- and all- $1$ vectors, respectively):

${\mathbf{a}}^{l,j,\gamma}=\left\{\begin{array}[]{ll}{\mathbf{1}}&,\gamma=1~{},~{}l\in[L-1]\\ {\mathbf{0}}&,\gamma>1~{},~{}l\in[L-1]\end{array}\right.$

Under this setting, the produced matrix $\left[{\mathcal{A}}\left(h_{y}^{D}\right)\right]$ holds $\min\{r_{0},M\}$ everywhere besides $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}$ entries on its diagonal, where it holds $\min\{r_{0},M\}-1$ . The rank of this matrix is at least $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}$ . ∎

Nearly all results in the literature that relate to depth efficiency merely show its existence, and claim 9 is no different in that respect. From a practical perspective, the implications of such results are slight, as a-priori, it may be that only a small fraction of the functions realizable by a deep network enjoy depth efficiency, and for all the rest shallow networks suffice. In sec. 5.5.2 we extend claim 9, arguing that with ReLU activation and max pooling, depth efficiency becomes more and more prevalent as the number of hidden channels in the deep ConvNet grows. However, no matter how large the deep ConvNet is, with ReLU activation and max pooling depth efficiency is never complete – there is always positive measure to the set of weight configurations that lead the deep ConvNet to generate score functions efficiently realizable by the shallow ConvNet:

Suppose we randomize the weights of a deep ConvNet with ReLU activation and max pooling by some continuous distribution with non-vanishing continuous probability density function. Then, assuming covering templates exist, with positive probability, we obtain score functions that can be realized by a shallow ConvNet with ReLU activation and max pooling having only a single hidden channel ( $Z=1$ ).

In light of the above, the proof boils down to the following claim, framed in terms of our generalized tensor decompositions. Fixing $g(a,b)=\max\{a,b,0\}$ , per arbitrary invertible $F$ there exists a weight ( ${\mathbf{a}}^{l,j,\gamma}$ ) setting for the generalized HT decomposition (eq. 7), such that the produced tensor may be realized by the generalized CP decomposition (eq. 6) with $Z=1$ , and this holds even if the weights ${\mathbf{a}}^{l,j,\gamma}$ and matrix $F$ are subject to small perturbations Recall that by assumption representation functions are continuous w.r.t. their parameters ( $f_{\theta}({\mathbf{x}})$ is continuous w.r.t. $\theta$ ), and so small perturbations on representation parameters ( $\theta_{d}$ ) translate into small perturbations on the matrix $F$ (eq. 4). .

We will now show that the following weight setting meets our requirement ( ${\mathbf{0}}$ and ${\mathbf{1}}$ here denote the all- and all- $1$ vectors, respectively):

${\mathbf{a}}^{0,j,\gamma}=\left\{\begin{array}[]{ll}F^{-1}{\mathbf{1}}&,j~{}\text{odd}\\ {\mathbf{0}}&,j~{}\text{even}\end{array}\right.$

${\mathbf{a}}^{l,j,\gamma}=\left\{\begin{array}[]{ll}{\mathbf{1}}&,j~{}\text{odd}~{}~{},~{}l\in[L-1]\\ {\mathbf{0}}&,j~{}\text{even}~{},~{}l\in[L-1]\end{array}\right.$

Let ${\mathcal{E}}^{F}$ be an additive noise matrix applied to $F$ , and $\{{\boldsymbol{\epsilon}}^{l,j,\gamma}\}_{l,j,\gamma}$ be additive noise vectors applied to $\{{\mathbf{a}}^{l,j,\gamma}\}_{l,j,\gamma}$ . We use the notation ${\mathbf{o}}(\epsilon)$ to refer to vectors that tend to ${\mathbf{0}}$ as ${\mathcal{E}}^{F}\to 0$ and ${\boldsymbol{\epsilon}}^{l,j,\gamma}\to{\mathbf{0}}$ , with the dimension of a vector to be understood by context. Plugging in the noisy variables into the generalized HT decomposition (eq. 7), we get for every $j\in[N/2]$ and $\alpha\in[r_{0}]$ :

If the applied noise ( ${\mathcal{E}}^{F},{\boldsymbol{\epsilon}}^{l,j,\gamma}$ ) is small enough this is equal to $({\mathbf{1}}+{\mathbf{o}}(\epsilon))\otimes{\mathbf{1}}$ (recall that $\otimes$ stands for the standard tensor product), and we in turn get for every $j\in[N/4]$ and $\gamma\in[r_{1}]$ :

With the applied noise ( ${\mathcal{E}}^{F},{\boldsymbol{\epsilon}}^{l,j,\gamma}$ ) small enough this becomes $(r_{0}{\mathbf{1}}+{\mathbf{o}}(\epsilon)\otimes{\mathbf{1}}\otimes{\mathbf{1}}\otimes{\mathbf{1}}$ . Continuing in this fashion over the levels of the decomposition, we get that with small enough noise, for every $l\in[L-1]$ , $j\in[N/2^{l+1}]$ and $\gamma\in[r_{l}]$ :

where $\otimes_{i=1}^{2^{l+1}-1}{\mathbf{1}}$ stands for the tensor product of the vector ${\mathbf{1}}$ with itself $2^{l+1}-1$ times. We readily conclude from this that with small enough noise, the tensor produced by the decomposition may be written as follows:

leads to ${\mathcal{A}}\left(h_{y}^{S}\right)={\mathcal{A}}\left(h_{y}^{D}\right)$ , as required. ∎

Comparing claims 8 and 10, we see that depth efficiency is complete under linear activation with product pooling, and incomplete under ReLU activation with max pooling. We interpret this as indicating that convolutional arithmetic circuits benefit from the expressive power of depth more than convolutional rectifier networks do. This result is rather surprising, especially given the fact that convolutional rectifier networks are much more commonly used in practice. We attribute the discrepancy primarily to historical reasons, and conjecture that developing effective methods for training convolutional arithmetic circuits, thereby fulfilling their expressive potential, may give rise to a deep learning architecture that is provably superior to convolutional rectifier networks but has so far been overlooked by practitioners.

Loosely speaking, we have shown that the gap in expressive power between the shallow and deep ConvNets is greater with linear activation and product pooling than it is with ReLU activation and max pooling. One may wonder at this point if it is plausible to deduce from this which architectural setting is more expressive, as a-priori, altering the shallow vs. deep ConvNet comparisons such that one network has linear activation with product pooling and the other has ReLU activation with max pooling, may change the expressive gaps in favor of the latter. Claims 11 and 12 below show that this is not the case. Specifically, they show that the depth efficiency of the deep ConvNet with linear activation and product pooling remains complete when the shallow ConvNet has ReLU activation and max pooling (claim 11), and on the other hand, the depth efficiency of the deep ConvNet with ReLU activation and max pooling remains incomplete when the shallow ConvNet has linear activation and product pooling (claim 12). This affirms our stand regarding the expressive advantage of convolutional arithmetic circuits over convolutional rectifier networks.

Let $f_{\theta_{1}}{\ldots}f_{\theta_{M}}$ be any set of linearly independent representation functions for a deep ConvNet with linear activation and product pooling. Suppose we randomize the weights of the network by some continuous distribution. Then, with probability 1, we obtain score functions that cannot be realized by a shallow ConvNet with ReLU activation and max pooling if the number of hidden channels in the latter ( $Z$ ) is less than $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}\cdot\frac{2}{M{\cdot}N}$ .

Suppose we randomize the weights of a deep ConvNet with ReLU activation and max pooling by some continuous distribution with non-vanishing continuous probability density function. Then, assuming covering templates exist, with positive probability, we obtain score functions that can be realized by a shallow ConvNet with linear activation and product pooling having only a single hidden channel ( $Z=1$ ).

This leads to ${\mathcal{A}}\left(h_{y}^{S}\right)={\mathcal{A}}\left(h_{y}^{D}\right)$ , as required. ∎

In their current form, the results in our analysis establishing depth efficiency (claims 8, 9, 11 and the analogous ones in sec. 5.6) relate to exact realization. Specifically, they provide a lower bound on the size of a shallow ConvNet required in order for it to realize exactly a grid tensor generated by a deep ConvNet. From a practical perspective, a more interesting question would be the size required by a shallow ConvNet in order to approximate the computation of a deep ConvNet. A-priori, it may be that although the size required for exact realization is exponential, the one required for approximation is only polynomial. As we briefly discuss below, this is not the case, and in fact all of the lower bounds we have provided apply not only to exact realization, but also to arbitrarily-well approximation.

5.2 On the Incidence of Depth Efficiency

In claim 8 we saw that depth efficiency is complete with linear activation and product pooling. That is to say, with linear activation and product pooling, besides a negligible set, all weight settings for the deep ConvNet (fig. 1 with size- $2$ pooling windows and $L=\log_{2}N$ hidden layers) lead to score functions that cannot be realized by the shallow ConvNet (fig. 2) unless the latter has super-polynomial size. We have also seen (claims 9 and 10) that replacing the activation and pooling operators by ReLU and max respectively, makes depth efficiency incomplete. There are still weight settings leading the deep ConvNet to generate score functions that require the shallow ConvNet to have super-polynomial size, but these do not occupy the entire space. In other words, there is now positive measure to the set of deep ConvNet weight configurations leading to score functions efficiently realizable by the shallow ConvNet. A natural question would then be just how frequent depth efficiency is under ReLU activation and max pooling. More formally, we may consider a uniform distribution over a compact domain in the deep ConvNet’s weight space, and ask the following. Assuming weights for the deep ConvNet are drawn from this distribution, what is the probability that generated score functions exhibit depth efficiency, i.e. require super-polynomial size from the shallow ConvNet? In the following we address this question, arguing that the probability tends to $1$ as the number of channels in the hidden layers of the deep ConvNet grows. We do not prove this formally, but nonetheless provide a framework we believe may serve as a basis for establishing formal results concerning the incidence of depth efficiency. The framework is not limited to ReLU activation and max pooling – it may be used under different choices of activation and pooling operators as well.

The central tool used in the paper for proving depth efficiency is the rank of grid tensors when these are arranged as matrices. We establish upper bounds on the rank of matricized grid tensors produced by the shallow ConvNet through the matricized generalized CP decomposition (eq. 10). These upper bounds are typically linear in the size of the input ( $N$ ) and the number of hidden channels in the network ( $Z$ ). The challenge is then to derive a super-polynomial (in $N$ ) lower bound on the rank of matricized grid tensors produced by the deep ConvNet through the matricized generalized HT decomposition (eq. 5.3). In the case of linear activation and product pooling ( $g(a,b)=a{\cdot}b$ ), the generalized Kronecker product $\odot_{g}$ reduces to the standard Kronecker product $\odot$ , and the rank-multiplicative property of the latter ( $rank(A{\odot}B)=rank(A){\cdot}rank(B)$ ) can be used to show (see ) that besides in negligible (zero measure) cases, rank grows rapidly through the levels of the matricized generalized HT decomposition (eq. 5.3), to the point where the final produced matrix has exponential rank. This situation does not persist when the activation and pooling operators are replaced by ReLU and max (respectively). Indeed, in the proof of claim 10 we explicitly presented a non-negligible (positive measure) case where the matricized generalized HT decomposition (eq. 5.3) produces a matrix of rank $1$ . To study the incidence of depth efficiency under ReLU activation and max pooling, we assume the weights ( ${\mathbf{a}}^{l,j,\gamma}$ ) of the matricized generalized HT decomposition (eq. 5.3) are drawn independently and uniformly from a bounded interval (e.g. $ $), and question the probability of the produced matrix$ [{\mathcal{A}}\left(h_{y}^{D}\right)] $having rank super-polynomial in$ N$.

The main piece that is missing in order to complete the sketch we have outlined above into a formal proof, is the behavior of rank under the generalized Kronecker product $\odot_{g}$ . This obviously depends on the choice of underlying operator $g$ . In the case of linear activation and product pooling $g(a,b)=a{\cdot}b$ , the generalized Kronecker product $\odot_{g}$ reduces to the standard Kronecker product $\odot$ , and ranks always increase multiplicatively, i.e. $rank(A{\odot}B)=rank(A){\cdot}rank(B)$ for any matrices $A$ and $B$ . The fact that there is a simple law governing the behavior of ranks makes this case relatively simple to analyze, and we indeed have a full characterization (claim 8). In the case of linear activation and max pooling the underlying operator is given by $g(a,b)=\max\{a,b\}$ , and it is not difficult to see that $\odot_{g}$ does not decrease rank, i.e. $rank(A{\odot_{g}}B){\geq}\min\{rank(A),rank(B)\}$ for any matrices $A$ and $B$ To see this, simply note that under the choice $g(a,b)=\max\{a,b\}$ there is either a sub-matrix of $A{\odot_{g}}B$ that is equal to $A$ , or one that is equal to $B$ . . For ReLU activation and max pooling, corresponding to the choice $g(a,b)=\max\{a,b,0\}$ , there is no simple rule depicting the behavior of ranks under $\odot_{g}$ , and in fact, for matrices $A$ and $B$ holding negative values, the rank of $rank(A{\odot_{g}}B)$ necessarily drops to zero. Nonetheless, it seems reasonable to assume that at least in some cases, a non-linear operation such as $\odot_{g}$ does increase rank, and as we have seen, benefiting from these cases is more probable when the hidden layers of the deep ConvNet include many channels. To this end, we provide in fig. 5 simulation results for the case of ReLU activation and max pooling ( $g(a,b)=\max\{a,b,0\}$ ), demonstrating that indeed ranks produced by the matricized generalized HT decomposition (eq. 5.3) tend to be higher as $r_{0}{\ldots}r_{L-1}$ grow. We leave a complete formal analysis of this phenomenon to future work.

Our proof relies on concepts and results from Lebesgue measure theory (see sec. 5.1 for a brief discussion). The result to prove is equivalent to stating that there is measure zero to the set of weight vectors ${\boldsymbol{\alpha}}$ for which $rank(A({\boldsymbol{\alpha}}))<\max_{i\in[m]}r_{i}$ .

6 Shared Coefficients for Convolution

To this end, our analysis has focused on the unshared setting, where the coefficients of the $1\times 1$ conv filters in our networks (fig. 1) may vary across spatial locations. In practice, ConvNets typically enforce sharing, which in our framework implies that the coefficients of the $1\times 1$ conv filter in channel $\gamma$ of hidden layer $l$ , are the same for all locations $j$ . In this subsection we analyze the shared setting, following a line similar to that of our analysis for the unshared setting given in sec. 5.4 and 5.5. For brevity, we assume the reader is familiar with the latter, and do not repeat discussions given there.

In the shared setting, the shallow ConvNet (fig. 2) would have a single weight vector ${\mathbf{a}}^{z}$ for every hidden channel $z$ , as opposed to the unshared setting where it had a weight vector ${\mathbf{a}}^{z,i}$ for every location $i$ in every hidden channel $z$ . Grid tensors produced by the shallow ConvNet in the shared setting are given by what we call the shared generalized CP decomposition:

As for the deep ConvNet (fig. 1 with size- $2$ pooling windows and $L=\log_{2}N$ hidden layers), in the shared setting, instead of having a weight vector ${\mathbf{a}}^{l,j,\gamma}$ for every hidden layer $l$ , channel $\gamma$ and location $j$ , there is a single weight vector ${\mathbf{a}}^{l,\gamma}$ for all locations of channel $\gamma$ in hidden layer $l$ . Produced grid tensors are then given by the shared generalized HT decomposition:

We now turn to analyze universality and depth efficiency in the shared setting.

In the unshared setting we saw (sec. 5.4) that linear activation with product pooling and ReLU activation with max pooling both lead to universality, whereas ReLU activation with average pooling does not. We will now see that in the shared setting, no matter how the activation and pooling operators are chosen, universality is never met.

A shallow ConvNet with shared weights produces grid tensors through the shared generalized CP decomposition (eq. 15). A tensor ${\mathcal{A}}$ generated by this decomposition is necessarily symmetric, i.e. for any permutation $\delta:[N]\to[N]$ and indexes $d_{1}{\ldots}d_{N}$ it meets: ${\mathcal{A}}_{d_{1}{\ldots}d_{N}}={\mathcal{A}}_{\delta(d_{1}){\ldots}\delta(d_{N})}$ . Obviously not all tensors share this property, so indeed a shallow ConvNet with weight sharing is not universal. A deep ConvNet with shared weights produces grid tensors through the shared generalized HT decomposition (eq. 16). For this decomposition, a generated tensor ${\mathcal{A}}$ is invariant to replacing the first and second halves of its modes, i.e. for any indexes $d_{1}{\ldots}d_{N}$ it meets: ${\mathcal{A}}_{d_{1},{\ldots},d_{N}}={\mathcal{A}}_{d_{\nicefrac{{N}}{{2}}+1},\ldots,d_{N},d_{1},\ldots,d_{\nicefrac{{N}}{{2}}}}$ . Although this property is much less stringent than symmetry, it is still not met by most tensors, and so a deep ConvNet with weight sharing is not universal either.

6.2 Depth Efficiency

Depth efficiency deals with the computational complexity of replicating a deep network’s function using a shallow network. In order for this question to be applicable, we require that the shallow network be a universal machine. If this is not the case, then it is generally likely that the deep network’s function simply lies outside the reach of the shallow network, and we do not obtain a quantitative insight into the true power of depth. Since our shallow ConvNets are not universal with shared weights (sec. 5.6.1), we evaluate depth efficiency of deep ConvNets with shared weights against shallow ConvNets with unshared weights. Specifically, we do this for the activation-pooling choices leading shallow ConvNets with unshared weights to be universal: linear activation with product pooling, and ReLU activation with max pooling (see sec. 5.4).

For linear activation with product pooling, the following claim, which is essentially a derivative of theorem 1 in , tells us that in the shared setting, as in the unshared setting, depth efficiency holds completely:

Let $f_{\theta_{1}}{\ldots}f_{\theta_{M}}$ be any set of linearly independent representation functions for a deep ConvNet with linear activation, product pooling and weight sharing. Suppose we randomize the weights of the network by some continuous distribution. Then, with probability 1, we obtain score functions that cannot be realized by a shallow ConvNet with linear activation and product pooling (not limited by weight sharing), if the number of hidden channels in the latter ( $Z$ ) is less than $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}$ .

The proof here is almost identical to that of claim 8. The only difference is that in the latter, we used the fact that the generalized HT decomposition (eq. 7), when equipped with $g(a,b)=a{\cdot}b$ , almost always produces tensors whose matrix arrangements have rank at least $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}$ , whereas here, we require an analogous result for the shared generalized HT decomposition (eq. 16). Such result is provided by the proof of theorem 1 in . ∎

Heading on to ReLU activation and max pooling, we will show that here too, the situation in the shared setting is the same as in the unshared setting. Specifically, depth efficiency holds, but not completely. We prove this via two claims, analogous to claims 9 and 10 in sec. 5.5:

There exist weight settings for a deep ConvNet with ReLU activation, max pooling and weight sharing, giving rise to score functions that cannot be realized by a shallow ConvNet with ReLU activation and max pooling (not limited by weight sharing), if the number of hidden channels in the latter ( $Z$ ) is less than $\min\{r_{0},M\}^{\nicefrac{{N}}{{2}}}\cdot\frac{2}{M{\cdot}N}$ .

Suppose we randomize the weights of a deep ConvNet with ReLU activation, max pooling and weight sharing by some continuous distribution with non-vanishing continuous probability density function. Then, assuming covering templates exist, with positive probability, we obtain score functions that can be realized by a shallow ConvNet with ReLU activation and max pooling having only a single hidden channel ( $Z=1$ ).

The proof is similar in spirit to that of claim 10, which dealt with incompleteness of depth efficiency under ReLU activation and max pooling in the unshared setting. Our focus here is on the shared setting, or more specifically, on the case where the deep ConvNet is limited by weight sharing while the shallow ConvNet is not. Accordingly, we would like to show the following. Fixing $g(a,b)=\max\{a,b,0\}$ , per arbitrary invertible $F$ there exists a weight ( ${\mathbf{a}}^{l,\gamma}$ ) setting for the shared generalized HT decomposition (eq. 16), such that the produced tensor may be realized by the generalized CP decomposition (eq. 6) with $Z=1$ , and this holds even if the weights ${\mathbf{a}}^{l,\gamma}$ and matrix $F$ are subject to small perturbations.

Turning to the main part of the proof, we now show that the following weight setting meets our requirement:

${\mathbf{a}}^{0,\gamma}=F^{-1}{\mathbf{v}}$

${\mathbf{a}}^{l,\gamma}={\mathbf{1}}$ , $l\in[L-1]$

For convenience, we adopt the notation ${\mathbf{o}}(\epsilon)$ as referring to vectors that tend to ${\mathbf{0}}$ as ${\mathcal{E}}^{F}\to 0$ and ${\boldsymbol{\epsilon}}^{l,\gamma}\to{\mathbf{0}}$ , with the dimension of a vector to be understood by context. Plugging in the noisy variables into the shared generalized HT decomposition (eq. 16), we get for every $\alpha\in[r_{0}]$ :

To recapitulate this subsection, we have shown that introducing weight sharing into the $1\times 1$ conv operators of our networks, thereby limiting the general locally-connected linear mappings to be standard convolutions, disrupts universality, but leaves depth efficiency intact – it remains to hold completely under linear activation with product pooling, and incompletely under ReLU activation with max pooling.

Discussion

The contribution of this paper is twofold. First, we introduce a construction in the form of generalized tensor decompositions, that enables transforming convolutional arithmetic circuits into convolutional rectifier networks (ConvNets with ReLU activation and max or average pooling). This opens the door to various mathematical tools from the world of arithmetic circuits, now available for analyzing convolutional rectifier networks. As a second contribution, we make use of such tools to prove new results on the expressive properties that drive this important class of networks.

Our analysis shows that convolutional rectifier networks are universal with max pooling, but not with average pooling. This implies that if non-linearity originates solely from ReLU activation, increasing network size alone is not sufficient for expressing arbitrary functions. More interestingly, we analyze the behavior of convolutional rectifier networks in terms of depth efficiency, i.e. of cases where a function generated by a deep network of polynomial size requires shallow networks to have super-polynomial size. It is known that convolutional arithmetic circuits exhibit complete depth efficiency, meaning that besides a negligible (zero measure) set, all functions generated by deep networks of this type are depth efficient. We show that this is not the case with convolutional rectifier networks, for which depth efficiency exists, but is weaker in the sense that it is not complete (there is positive measure to the set of functions generated by a deep network that may be efficiently realized by shallow networks).

Depth efficiency is believed to be the key factor behind the success of deep learning. Our analysis indicates that from this perspective, the widely used convolutional rectifier networks are inferior to convolutional arithmetic circuits. This leads us to believe that convolutional arithmetic circuits bear the potential to improve the performance of deep learning beyond what is witnessed today. Of course, a practical machine learning model is measured not only by its expressive power, but also by our ability to train it. Over the years, massive amounts of research have been devoted to training convolutional rectifier networks. Convolutional arithmetic circuits on the other hand received far less attention, although they have been successfully trained in recent works on the SimNet architecture (), demonstrating how the enhanced expressive power can lead to state of the art performance in computationally limited settings.

We believe that developing effective methods for training convolutional arithmetic circuits, thereby fulfilling their expressive potential, may give rise to a deep learning architecture that is provably superior to convolutional rectifier networks but has so far been overlooked by practitioners.

This work is partly funded by Intel grant ICRI-CI no. 9-2012-6133 and by ISF Center grant 1790/12. Nadav Cohen is supported by a Google Fellowship in Machine Learning.

References

Appendix A Existence of Covering Templates

If the dimension of image patches is small then it seems reasonable to believe that relatively few templates can indeed cover the possible appearances of a patch. For example, in the extreme case where each patch is simply a gray-scale pixel ( $s=1$ ), having $M=256$ templates may provide the standard $8$ -bit resolution, leading grid tensors to fully define score functions by accounting for all possible images. However, since in our construction input patches correspond to the receptive field in the first layer of a ConvNet (see fig. 1), we would like to establish an argument for image patch sizes that more closely correlate to typical receptive fields, e.g. $5{\times}5$ . For this we rely on various studies (e.g. ) characterizing the statistics of natural images, which have shown that for large ensembles of images, randomly cropped patches of size up to $16{\times}16$ may be relatively well captured by Gaussian Mixture Models with as few as $64$ components. This complies with the common belief that there is a moderate number of appearances taken by the vast majority of local image patches (edges, Gabor filters etc.). That is to say, it complies with our assumption that covering templates exist with a moderate value of $M$ . We refer the reader to for a more formal argument on this line.