Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity

Amit Daniely, Roy Frostig, Yoram Singer

Introduction

Neural network (NN) learning has underpinned state of the art empirical results in numerous applied machine learning tasks (see for instance ). Nonetheless, neural network learning remains rather poorly understood in several regards. Notably, it remains unclear why training algorithms find good weights, how learning is impacted by the network architecture and activations, what is the role of random weight initialization, and how to choose a concrete optimization procedure for a given architecture.

We start by analyzing the expressive power of NNs subsequent to the random weight initialization. The motivation is the empirical success of training algorithms despite inherent computational intractability, and the fact that they optimize highly non-convex objectives with potentially many local minima. Our key result shows that random initialization already positions learning algorithms at a good starting point. We define an object termed a computation skeleton that describes a distilled structure of feed-forward networks. A skeleton induces a family of network architectures along with a hypothesis class H{\cal H} of functions obtained by certain non-linear compositions according to the skeleton’s structure. We show that the representation generated by random initialization is sufficiently rich to approximately express the functions in H{\cal H}. Concretely, all functions in H{\cal H} can be approximated by tuning the weights of the last layer, which is a convex optimization task.

In addition to explaining in part the success in finding good weights, our study provides an appealing perspective on neural network learning. We establish a tight connection between network architectures and their dual kernel spaces. This connection generalizes several previous constructions (see Sec 2). As we demonstrate, our dual view gives rise to design principles for NNs, supporting current practice and suggesting new ideas. We outline below a few points.

Duals of convolutional networks appear a more suitable fit for vision and acoustic tasks than those of fully connected networks.

Our framework surfaces a principled initialization scheme. It is very similar to common practice, but incorporates a small correction.

By modifying the activation functions, two consecutive fully connected layers can be replaced with one while preserving the network’s dual kernel.

The ReLU activation, i.e. xmax(x,0)x\mapsto\max(x,0), possesses favorable properties. Its dual kernel is expressive, and it can be well approximated by random initialization, even when the initialization’s scale is moderately changed.

As the number of layers in a fully connected network becomes very large, its dual kernel converges to a degenerate form for any non-linear activation.

Our result suggests that optimizing the weights of the last layer can serve as a convex proxy for choosing among different architectures prior to training. This idea was advocated and tested empirically in .

Related work

Understanding neural network learning, particularly its recent successes, commonly decomposes into the following research questions.

What functions can be efficiently expressed by neural networks?

When does a low empirical loss result in a low population loss?

Why and when do efficient algorithms, such as stochastic gradient, find good weights?

Though still far from being complete, previous work provides some understanding of questions (i) and (ii). Standard results from complexity theory imply that essentially all functions of interest (that is, any efficiently computable function) can be expressed by a network of moderate size. Biological phenomena show that many relevant functions can be expressed by even simpler networks, similar to convolutional neural networks that are dominant in ML tasks today. Barron’s theorem states that even two-layer networks can express a very rich set of functions. As for question (ii), both classical and more recent results from statistical learning theory show that as the number of examples grows in comparison to the size of the network the empirical loss must be close to the population loss. In contrast to the first two, question (iii) is rather poorly understood. While learning algorithms succeed in practice, theoretical analysis is overly pessimistic. Direct interpretation of theoretical results suggests that when going slightly deeper beyond single layer networks, e.g. to depth two networks with very few hidden units, it is hard to predict even marginally better than random . Finally, we note that the recent empirical successes of NNs have prompted a surge of theoretical work around NN learning .

Compositional kernels and connections to networks.

The idea of composing kernels has repeatedly appeared throughout the machine learning literature, for instance in early work by Schölkopf et al. , Grauman and Darrell . Inspired by deep networks’ success, researchers considered deep composition of kernels . For fully connected two-layer networks, the correspondence between kernels and neural networks with random weights has been examined in . Notably, Rahimi and Recht proved a formal connection (similar to ours) for the RBF kernel. Their work was extended to include polynomial kernels as well as other kernels . Several authors have further explored ways to extend this line of research to deeper, either fully-connected networks or convolutional networks . Our work sets a common foundation for and expands on these ideas. We extend the analysis from fully-connected and convolutional networks to a rather broad family of architectures. In addition, we prove approximation guarantees between a network and its corresponding kernel in our more general setting. We thus extend previous analyses that only applies to fully connected two-layer networks. Finally, we use the connection as an analytical tool to reason about architectural design choices.

Setting

Input space.

Supervised learning.

Neural network learning.

Kernel learning.

a convex objective that often can be efficiently minimized.

Computation skeletons

In this section we define a simple structure which we term a computation skeleton. The purpose of a computational skeleton is to compactly describe a feed-forward computation from an input to an output. A single skeleton encompasses a family of neural networks that share the same skeletal structure. Likewise, it defines a corresponding kernel space.

A computation skeleton S{\cal S} is a DAG whose non-input nodes are labeled by activations.

Figure 1 shows four example skeletons, omitting the designation of the activation functions. The skeleton S1{\cal S}_{1} is rather basic as it aggregates all the inputs in a single step. Such topology can be useful in the absence of any prior knowledge of how the output label may be computed from an input example, and it is commonly used in natural language processing where the input is represented as a bag-of-words . The only structure in S1{\cal S}_{1} is a single fully connected layer:

An induced subgraph of a skeleton with r+1r+1 nodes, u1,,ur,vu_{1},\ldots,u_{r},v, is called a fully connected layer if its edges are u1v,,urvu_{1}v,\ldots,u_{r}v.

The skeleton S2{\cal S}_{2} is slightly more involved: it first processes consecutive (overlapping) parts of the input, and the next layer aggregates the partial results. Altogether, it corresponds to networks with a single one-dimensional convolutional layer, followed by a fully connected layer. The two-dimensional (and deeper) counterparts of such skeletons correspond to networks that are common in visual object recognition.

Let s,w,qs,w,q be positive integers and denote n=s(q1)+wn=s(q-1)+w. A subgraph of a skeleton is a one dimensional convolution layer of width ww and stride ss if it has n+qn+q nodes, u1,,un,v1,,vqu_{1},\ldots,u_{n},v_{1},\ldots,v_{q}, and qwqw edges, us(i1)+jviu_{s(i-1)+j}\,v_{i}, for 1iq,1jw1\leq i\leq q,1\leq j\leq w.

The skeleton S3{\cal S}_{3} is a somewhat more sophisticated version of S2{\cal S}_{2}: the local computations are first aggregated, then reconsidered with the aggregate, and finally aggregated again. The last skeleton, S4{\cal S}_{4}, corresponds to the networks that arise in learning sequence-to-sequence mappings as used in translation, speech recognition, and OCR tasks (see for example Sutskever et al. ).

Note that the notion of the replication parameter rr corresponds, in the terminology of convolutional networks, to the number of channels taken in a convolutional layer and to the number of hidden units taken in a fully-connected layer.

Figure 2 illustrates a (5,4)(5,4)- and 55-realizations of a skeleton with coordinate dimension d=2d=2. The (5,4)(5,4)-realization is a network with a single (one dimensional) convolutional layer having 55 channels, stride of 22, and width of 44, followed by three fully-connected layers. The global replication parameter rr in a realization is used for brevity; it is straightforward to extend results when the different nodes in S{\cal S} are each replicated to a different extent.

We next define a scheme for random initialization of the weights of a neural network, that is similar to what is often done in practice. We employ the definition throughout the paper whenever we refer to random weights.

Architectures such as convolutional nets have weights that are shared across different edges. Again, it is straightforward to extend our results to these cases and for simplicity we assume no explicit weight sharing.

2 From computation skeletons to reproducing kernels

The following definition gives the kernel corresponding to a skeleton having normalized activations.For a skeleton S{\cal S} with unnormalized activations, the corresponding kernel is the kernel of the skeleton S{\cal S}^{\prime} obtained by normalizing the activations of S{\cal S}.

The final kernel κS\kappa_{\cal S} is κo\kappa_{o}, the kernel associated with the output node oo. The resulting Hilbert space and norm are denoted HS{\cal H}_{\cal S} and S\|\cdot\|_{\cal S} respectively, and Hv{\cal H}_{v} and v\|\cdot\|_{v} denote the space and norm when formed at node vv.

As we show later, κS\kappa_{\cal S} is indeed a (normalized) kernel for every skeleton S{\cal S}. To understand the kernel in the context of learning, we need to examine which functions can be expressed as moderate norm functions in HS{\cal H}_{\cal S}. As we show in section 7, these are the functions obtained by certain simple compositions according to the feed-forward structure of S{\cal S}. For intuition, the following example contrasts two commonly used skeletons.

Consider a network whose activations are all ReLU, σ(z)=[z]+\sigma(z)=[z]_{+}, and an input space Xn,1={±1}n{\cal X}_{n,1}=\{\pm 1\}^{n}. Say that S1{\cal S}_{1} is a skeleton comprising a single fully connected layer, and that S2{\cal S}_{2} is one comprising a convolutional layer of stride 11 and width q=log0.999(n)q=\log^{0.999}(n), followed by a single fully-connected layer. (The skeleton S2{\cal S}_{2} from Figure 1 is a concrete example of the convolutional skeleton with q=2q=2 and n=4n=4.) The kernel κS1\kappa_{{\cal S}_{1}} takes the form κS1(x,y)=σ^(x,y/n)\kappa_{{\cal S}_{1}}({\mathbf{x}},{\mathbf{y}})=\hat{\sigma}\left({\langle{\mathbf{x}},{\mathbf{y}}\rangle}/{n}\right). It is a symmetric kernel and therefore functions with small norm in HS1{\cal H}_{{\cal S}_{1}} are essentially low-degree polynomials. For instance, fix a bound R=n1.001R=n^{1.001} on the norm of the functions. In this case, the space HS1R{\cal H}^{R}_{{\cal S}_{1}} contains multiplication of one or two input coordinates. However, multiplication of 33 or more coordinates are no-longer in HS1R{\cal H}^{R}_{{\cal S}_{1}}. Moreover, this property holds true regardless of the choice of activation function. On the other hand, HS2R{\cal H}^{R}_{{\cal S}_{2}} contains functions whose dependence on adjacent input coordinates is far more complex. It includes, for instance, any function f:X{±1}f:{\cal X}\to\{\pm 1\} that is symmetric (i.e. f(x)=f(x)f(x)=f(-x)) and that depends on qq adjacent coordinates xi,,xi+q{\mathbf{x}}_{i},\ldots,{\mathbf{x}}_{i+q}. Furthermore, any sum of nn such functions is also in HS2R{\cal H}^{R}_{{\cal S}_{2}}.

Main results

We review our main results. Let us fix a compositional kernel S{\cal S}. There are a few upshots to underscore upfront. First, our analysis implies that a representation generated by a random initialization of N=N(S,r,k){\cal N}={\cal N}({\cal S},r,k) approximates the kernel κS\kappa_{\cal S}. The sense in which the result holds is twofold. First, with the proper rescaling we show that RN,w(x),RN,w(x)κS(x,x)\langle\mathcal{R}_{{\cal N},{\mathbf{w}}}({\mathbf{x}}),\mathcal{R}_{{\cal N},{\mathbf{w}}}({\mathbf{x}}^{\prime})\rangle\approx\kappa_{{\cal S}}({\mathbf{x}},{\mathbf{x}}^{\prime}). Then, we also show that the functions obtained by composing bounded linear functions with RN,w\mathcal{R}_{{\cal N},{\mathbf{w}}} are approximately the bounded-norm functions in HS{\cal H}_{\cal S}. In other words, the functions expressed by N{\cal N} under varying the weights of the last layer are approximately bounded-norm functions in HS{\cal H}_{\cal S}. For simplicity, we restrict the analysis to the case k=1k=1. We also confine the analysis to either bounded activations, with bounded first and second derivatives, or the ReLU activation. Extending the results to a broader family of activations is left for future work. Through this and remaining sections we use \gtrsim to hide universal constants.

Note that many activations are CC-bounded for some constant C>0C>0. In particular, most of the popular sigmoid-like functions such as 1/(1+ex){1}/({1+e^{-x}}), erf(x)\operatorname*{erf}(x), x/1+x2{x}/{\sqrt{1+x^{2}}}, tanh(x)\tanh(x), and tan1(x)\tan^{-1}(x) satisfy the boundedness requirements. We next introduce terminology that parallels the representation layer of N{\cal N} with a kernel space. Concretely, let N{\cal N} be a network whose representation part has qq output neurons. Given weights w{\mathbf{w}}, the normalized representation Ψw\Psi_{\mathbf{w}} is obtained from the representation RN,wR_{{\cal N},{\mathbf{w}}} by dividing each output neuron vv by σvq\|\sigma_{v}\|\sqrt{q}. The empirical kernel corresponding to w{\mathbf{w}} is defined as κw(x,x)=Ψw(x),Ψw(x)\kappa_{\mathbf{w}}({\mathbf{x}},{\mathbf{x}}^{\prime})=\langle\Psi_{{\mathbf{w}}}({\mathbf{x}}),\Psi_{{\mathbf{w}}}({\mathbf{x}}^{\prime})\rangle. We also define the empirical kernel space corresponding to w{\mathbf{w}} as Hw=Hκw{\cal H}_{\mathbf{w}}={\cal H}_{\kappa_{\mathbf{w}}}. Concretely,

and the norm of Hw{\cal H}_{\mathbf{w}} is defined as hw=inf{vh=hv}\|h\|_{\mathbf{w}}=\inf\{\|{\mathbf{v}}\|\mid{}h=h_{\mathbf{v}}\}. Our first result shows that the empirical kernel approximates the kernel kSk_{\cal S}.

Let S{\cal S} be a skeleton with CC-bounded activations. Let w{\mathbf{w}} be a random initialization of N=N(S,r){\cal N}={\cal N}({\cal S},r) with

Then, for all x,x{\mathbf{x}},{\mathbf{x}}^{\prime}, with probability of at least 1δ1-\delta,

We note that if we fix the activation and assume that the depth of S{\cal S} is logarithmic, then the required bound on rr is polynomial. For the ReLU activation we get a stronger bound with only quadratic dependence on the depth. However, it requires that ϵ1/0pt(S)\epsilon\leq{1}/{0pt({\cal S})}.

Let S{\cal S} be a skeleton with ReLU activations. Let w{\mathbf{w}} be a random initialization of N(S,r){\cal N}({\cal S},r) with

Then, for all x,x{\mathbf{x}},{\mathbf{x}}^{\prime} and ϵ1/0pt(S)\epsilon\lesssim{}1/{0pt({\cal S})}, with probability of at least 1δ1-\delta,

Let S{\cal S} be a skeleton with CC-bounded activations. Let w{\mathbf{w}} be a random initialization of N(S,r){\cal N}({\cal S},r) with

Then, with probability of at least 1δ1-\delta over the choices of w{\mathbf{w}} we have that Hw2R{\cal H}_{\mathbf{w}}^{\sqrt{2}R} ϵ\epsilon-approximates HSR{\cal H}^{R}_{\cal S} and HS2R{\cal H}_{\cal S}^{\sqrt{2}R} ϵ\epsilon-approximates HwR{\cal H}^{R}_{\mathbf{w}}.

Let S{\cal S} be a skeleton with ReLU activations and ϵ1/0pt(C)\epsilon\lesssim{1}/{0pt({\cal C})}. Let w{\mathbf{w}} be a random initialization of N(S,r){\cal N}({\cal S},r) with

Then, with probability of at least 1δ1-\delta over the choices of w{\mathbf{w}} we have that Hw2R{\cal H}_{\mathbf{w}}^{\sqrt{2}R} ϵ\epsilon-approximates HSR{\cal H}^{R}_{\cal S} and HS2R{\cal H}_{\cal S}^{\sqrt{2}R} ϵ\epsilon-approximates HwR{\cal H}^{R}_{\mathbf{w}}.

As in Theorems 2 and 3, for a fixed CC-bounded activation and logarithmically deep S{\cal S}, the required bounds on rr are polynomial. Analogously, for the ReLU activation the bound is polynomial even without restricting the depth. However, the polynomial growth in Theorems 4 and 5 is rather large. Improving the bounds, or proving their optimality, is left to future work.

Our results can be extended to incorporate bias terms. Namely, in addition to the weights we can add a bias vector b={bvvV}{\mathbf{b}}=\{b_{v}\mid v\in V\} and let each neuron compute the function

To do so, we extend the definition of random initialization and compositional kernel as follows:

The final kernel κSβ\kappa^{\beta}_{\cal S} is κoβ\kappa^{\beta}_{o}, the kernel associated with the output node oo.

Note that our original definitions correspond to β=0\beta=0. With the above definitions, Theorems 2, 3, 4 and 5 extend to the case when there exist bias terms. To simplify the notation, we focus on the case when there are no biases.

Mathematical background

The following theorem describes a tight connection between embeddings of X{\cal X} into a Hilbert space and kernel spaces.

Hκ={fv:vH}{\cal H}_{\kappa}=\{f_{\mathbf{v}}:{\mathbf{v}}\in{\cal H}\}, where fv(x)=v,Φ(x)Hf_{\mathbf{v}}({\mathbf{x}})=\langle{\mathbf{v}},\Phi({\mathbf{x}})\rangle_{{\cal H}}.

For every fHκf\in{\cal H}_{\kappa}, fHκ=inf{vHf=fv}\|f\|_{{\cal H}_{\kappa}}=\inf\{\|{\mathbf{v}}\|_{{\cal H}}\mid f=f_{\mathbf{v}}\}.

Positive definite functions.

The norm of μ\mu is defined as μ:=ibi=μ(1)\|\mu\|:=\sqrt{\sum_{i}b_{i}}=\sqrt{\mu(1)}. We say that μ\mu is normalized if μ=1\|\mu\|=1

The restriction to the unit sphere of many of the kernels used in machine learning applications corresponds to positive definite functions. An example is the Gaussian kernel,

Indeed, note that for unit vectors x,x{\mathbf{x}},{\mathbf{x}}^{\prime} we have

Another example is the Polynomial kernel κ(x,x)=x,xd\kappa({\mathbf{x}},{\mathbf{x}}^{\prime})=\langle{\mathbf{x}},{\mathbf{x}}^{\prime}\rangle^{d}.

Hermite polynomials.

The normalized Hermite polynomials is the sequence h0,h1,h_{0},h_{1},\ldots of orthonormal polynomials obtained by applying the Gram-Schmidt process to the sequence 1,x,x2,1,x,x^{2},\ldots w.r.t. the inner-product f,g=12πf(x)g(x)ex22dx\langle f,g\rangle=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}f(x)g(x)e^{-\frac{x^{2}}{2}}dx. Recall that we define activations as square integrable functions w.r.t. the Gaussian measure. Thus, Hermite polynomials form an orthonormal basis to the space of activations. In particular, each activation σ\sigma can be uniquely described in the basis of Hermite polynomials,

Compositional kernel spaces

We now describe the details of compositional kernel spaces. Let S{\cal S} be a skeleton with normalized activations and nn input nodes associated with the input’s coordinates. Throughout the rest of the section we study the functions in HS{\cal H}_{\cal S} and their norm. In particular, we show that κS\kappa_{\cal S} is indeed a normalized kernel. Recall that κS\kappa_{\cal S} is defined inductively by the equation,

The recursion (7) describes a means for generating a kernel form another kernel. Since kernels correspond to kernel spaces, it also prescribes an operator that produces a kernel space from other kernel spaces. If Hv{\cal H}_{v} is the space corresponding to vv, we denote this operator by

The function κ\kappa is indeed a kernel. Furthermore, the following properties hold.

If H1,,Hn{\cal H}_{1},\ldots,{\cal H}_{n} are normalized then so is H{\cal H}.

H={f1++fnnfiHi}{\cal H}=\left\{\frac{f_{1}+\ldots+f_{n}}{n}\mid f_{i}\in{\cal H}_{i}\right\}

fH2=inf{f1H12++fnHn2n\mboxs.t.f=f1++fnn,  fiHi}\|f\|^{2}_{{\cal H}}=\inf\left\{\frac{\|f_{1}\|^{2}_{{\cal H}_{1}}+\ldots+\|f_{n}\|^{2}_{{\cal H}_{n}}}{n}\mbox{ s.t. }f=\frac{f_{1}+\ldots+f_{n}}{n},\;f_{i}\in{\cal H}_{i}\right\}

(outline) The fact that κ\kappa is a kernel follows directly from the definition of a kernel and the fact that an average of PSD matrices is PSD. Also, it is straight forward to verify item 1. We now proceed to items 2 and 3. By Theorem 7 there are Hilbert spaces G1,,Gn{\cal G}_{1},\ldots,{\cal G}_{n} and mappings Φi:XGi\Phi_{i}:{\cal X}\to{\cal G}_{i} such that κi(x,x)=Φi(x),Φi(x)Gi\kappa_{i}({\mathbf{x}},{\mathbf{x}}^{\prime})=\langle\Phi_{i}({\mathbf{x}}),\Phi_{i}({\mathbf{x}}^{\prime})\rangle_{{\cal G}_{i}}. Consider now the mapping

It holds that κ(x,x)=Ψ(x),Ψ(x)\kappa({\mathbf{x}},{\mathbf{x}}^{\prime})=\langle\Psi({\mathbf{x}}),\Psi({\mathbf{x}}^{\prime})\rangle. Properties 2 and 3 now follow directly form Thm. 7 applied to Ψ\Psi. ∎

The extension of a kernel space.

Let H{\cal H} be a normalized kernel space with a kernel κ\kappa. Let μ(x)=ibixi\mu(x)=\sum_{i}b_{i}x^{i} be a PSD function. As we will see shortly, a function is PSD if and only if it is a dual of an activation function. The extension of H{\cal H} w.r.t. μ\mu, denoted μ(H)\mu\left({\cal H}\right), is the kernel space corresponding to the kernel κ(x,x)=μ(κ(x,x))\kappa^{\prime}({\mathbf{x}},{\mathbf{x}}^{\prime})=\mu(\kappa({\mathbf{x}},{\mathbf{x}}^{\prime})).

The function κ\kappa^{\prime} is indeed a kernel. Furthermore, the following properties hold.

μ(H)\mu({\cal H}) is normalized if and only if μ\mu is.

fμ(H)inf{AgAgHbA\mboxs.t.f=AgAg,  AH}\|f\|_{\mu({\cal H})}\leq\inf\left\{\displaystyle\sum_{A}\frac{\prod_{g\in A}\|g\|_{{\cal H}}}{\sqrt{b_{|A|}}}\mbox{ s.t. }f=\sum_{A}\prod_{g\in A}g,\;A\subset{\cal H}\right\}

(outline) Let Φ:XG\Phi:{\cal X}\to{\cal G} be a mapping from X{\cal X} to the unit ball of a Hilbert space G{\cal G} such that κ(x,x)=Φ(x),Φ(x)\kappa({\mathbf{x}},{\mathbf{x}}^{\prime})=\langle\Phi({\mathbf{x}}),\Phi({\mathbf{x}}^{\prime})\rangle. Define

It is not difficult to verify that Ψ(x),Ψ(x)=μ(κ(x,x))\langle\Psi({\mathbf{x}}),\Psi({\mathbf{x}}^{\prime})\rangle=\mu(\kappa({\mathbf{x}},{\mathbf{x}}^{\prime})). Hence, by Thm. 7, κ\kappa^{\prime} is indeed a kernel. Verifying property 1 is a straightforward task. Properties 2 and 3 follow by applying Thm. 7 on the mapping Ψ\Psi. ∎

The dual activation function

The following lemma describes a few basic properties of the dual activation. These properties follow easily from the definition of the dual activation and equations (2), (4), and (5).

The following properties of the mapping σσ^\sigma\mapsto\hat{\sigma} hold:

If σ=iaihi\sigma=\sum_{i}a_{i}h_{i} is the Hermite expansion of σ\sigma, then σ^(ρ)=iai2ρi\hat{\sigma}(\rho)=\sum_{i}a_{i}^{2}\rho^{i}.

For every σ\sigma, σ^\hat{\sigma} is positive definite.

Every positive definite function is a dual of some activation.

The mapping σσ^\sigma\mapsto\hat{\sigma} preserves norms.

The mapping σσ^\sigma\mapsto\hat{\sigma} commutes with differentiation.

For every σ\sigma, σ^\hat{\sigma} is continuous in $andsmoothinand smooth in(-1,1)$.

For every σ\sigma, σ^\hat{\sigma} is non-decreasing and convex in $$.

For every σ\sigma, the range of σ^\hat{\sigma} is [σ2,σ2]\left[-\|\sigma\|^{2},\|\sigma\|^{2}\right].

We next discuss a few examples for activations and calculate their dual activation and kernel. Note that the dual of the exponential activation was calculated in and the duals of the step and the ReLU activations were calculated in . Here, our derivations are different and may prove useful for future calculations of duals for other activations.

Consider the activation function σ(x)=Ceax\sigma(x)=Ce^{ax} where C>0C>0 is a normalization constant such that σ=1\|\sigma\|=1. The actual value of CC is e2a2e^{-2a^{2}} but it will not be needed for the derivation below. From properties (e) and (f) of Lemma 11 we have that,

The the solution of ordinary differential equation (σ^)=a2σ^\left(\hat{\sigma}\right)^{\prime}=a^{2}\hat{\sigma} is of the form σ^(ρ)=bexp(a2ρ)\hat{\sigma}(\rho)=b\exp\left(a^{2}\rho\right). Since σ^(1)=1\hat{\sigma}(1)=1 we have b=ea2b=e^{-a^{2}}. We therefore obtain that the dual activation function is

Note that the kernel induced by σ\sigma is the RBF kernel, restricted to the dd-dimensional sphere,

The Sine activation and the Sinh kernel.

Consider the activation σ(x)=sin(ax)\sigma(x)=\sin(ax). We can write sin(ax)=eiaxeiax2i\sin(ax)=\frac{e^{iax}-e^{-iax}}{2i}. We have

Similarly, since the variance of XYX-Y and YXY-X is 22ρ2-2\rho, we get

Hermite activations and polynomial kernels.

From Lemma 11 it follows that the dual activation of the Hermite polynomial hnh_{n} is h^n(ρ)=ρn\hat{h}_{n}(\rho)=\rho^{n}. Hence, the corresponding kernel is the polynomial kernel.

The normalized step activation.

To calculate σ^\hat{\sigma} we compute the Hermite expansion of σ\sigma. For n0n\geq 0 we let

Since h0(x)=1h_{0}(x)=1, h1(x)=xh_{1}(x)=x, and h2(x)=x212h_{2}(x)=\frac{x^{2}-1}{\sqrt{2}}, we get the corresponding coefficients,

For n3n\geq 3 we write gn(x)=hn(x)ex22g_{n}(x)=h_{n}(x)e^{-\frac{x^{2}}{2}} and note that

Here, the second equality follows from (4) and the third form (3). We therefore get

The second equality follows from (3) and the last equality follows from (6). Finally, from Lemma 11 we have that σ^(ρ)=n=0bnρn\hat{\sigma}(\rho)=\sum_{n=0}^{\infty}b_{n}\rho^{n} where

In particular, (b0,b1,b2,b3,b4,b5,b6)=(12,1π,0,16π,0,340π,0)(b_{0},b_{1},b_{2},b_{3},b_{4},b_{5},b_{6})=\left(\frac{1}{2},\frac{1}{\pi},0,\frac{1}{6\pi},0,\frac{3}{40\pi},0\right). Note that from the Taylor expansion of cos1\cos^{-1} it follows that σ^(ρ)=1cos1(ρ)π\hat{\sigma}(\rho)=1-\frac{\cos^{-1}(\rho)}{\pi}.

The normalized ReLU activation.

Consider the activation σ(x)=2max(0,x)\sigma(x)=\sqrt{2}\max(0,x). We now write σ^(ρ)=ibiρi\hat{\sigma}(\rho)=\sum_{i}b_{i}\rho^{i}. The first coefficient is

To calculate the remaining coefficients we simply note that the derivative of the ReLU activation is the step activation and the mapping σσ^\sigma\mapsto\hat{\sigma} commutes with differentiation. Hence, from the calculation of the step activation we get,

In particular, (b0,b1,b2,b3,b4,b5,b6)=(1π,12,12π,0,124π,0,180π)(b_{0},b_{1},b_{2},b_{3},b_{4},b_{5},b_{6})=\left(\frac{1}{\pi},\frac{1}{2},\frac{1}{2\pi},0,\frac{1}{24\pi},0,\frac{1}{80\pi}\right). We see that the coefficients corresponding to the degrees , 11, and 22 sum to 0.97740.9774. The sums up to degrees 44 or 66 are 0.99070.9907 and 0.99470.9947 respectively. That is, we get an excellent approximation of less than 1%1\% error with a dual activation of degree 44.

The collapsing tower of fully connected layers.

To conclude this section we discuss the case of very deep networks. The setting is taken for illustrative purposes but it might surface when building networks with numerous fully connected layers. Indeed, most deep architectures that we are aware of do not employ more than five consecutive fully connected layers.

Consider a skeleton Sm{\cal S}_{m} consisting of mm fully connected layers, each layer associated with the same (normalized) activation σ\sigma. We would like to examine the form of the compositional kernel as the number of layers becomes very large. Due to the repeated structure and activation we have

Hence, the limiting properties of κSm\kappa_{{\cal S}_{m}} can be understood from the limit of αm\alpha_{m}. In the case that σ(x)=x\sigma(x)=x or σ(x)=x\sigma(x)=-x, σ^\hat{\sigma} is the identity function. Therefore αm(ρ)=σ^(ρ)=ρ\alpha_{m}(\rho)=\hat{\sigma}(\rho)=\rho for all mm and κSm\kappa_{{\cal S}_{m}} is simply the linear kernel. Assume now that σ\sigma is neither the identity nor its negation. The following claim shows that αm\alpha_{m} has a point-wise limit corresponding to a degenerate kernel.

There exists a constant 0ασ10\leq\alpha_{\sigma}\leq 1 such that for all 1<ρ<1-1<\rho<1,

Before proving the claim, we note that for ρ=1\rho=1, αm(1)=1\alpha_{m}(1)=1 for all mm, and therefore limmαm(1)=1\lim_{m\to\infty}\alpha_{m}(1)=1. For ρ=1\rho=-1, if σ\sigma is anti-symmetric then αm(1)=1\alpha_{m}(-1)=-1 for all mm, and in particular limmαm(1)=1\lim_{m\to\infty}\alpha_{m}(-1)=-1. In any other case, our argument can show that limmαm(1)=ασ\lim_{m\to\infty}\alpha_{m}(-1)=\alpha_{\sigma}.

Recall that σ^(ρ)=i=0biρi\hat{\sigma}(\rho)=\sum_{i=0}^{\infty}b_{i}\rho^{i} where the bib_{i}’s are non-negative numbers that sum to 1. By the assumption that σ\sigma is not the identity or its negation, b1<1b_{1}<1. We first claim that there is a unique ασ\alpha_{\sigma}\in such that

To prove (9) it suffices to prove the following properties.

σ^(ρ)>ρ\hat{\sigma}(\rho)>\rho for ρ(1,0)\rho\in(-1,0)

σ^\hat{\sigma} is non-decreasing and convex in $$

the graph of σ^\hat{\sigma} has at most a single intersection in [0,1)[0,1) with the graph of f(ρ)=ρf(\rho)=\rho

If the above properties hold we can take ασ\alpha_{\sigma} to be the intersection point or 11 if such a point does not exist. We first show (a). For ρ(1,0)\rho\in(-1,0) we have that

Here, the third inequality follows form the fact that b00b_{0}\geq 0 and for all ii, biρibiρ-b_{i}|\rho|^{i}\geq-b_{i}|\rho|. Moreover since b1<1b_{1}<1, one of these inequalities must be strict. Properties (b) and (c) follows from Lemma 11. Finally, to show (d), we note that the second derivative of σ^(ρ)ρ\hat{\sigma}(\rho)-\rho is i2i(i1)biρi2\sum_{i\geq 2}i(i-1)b_{i}\rho^{i-2} which is non-negative in [0,1)[0,1). Hence, σ^(ρ)ρ\hat{\sigma}(\rho)-\rho is convex in $andinparticularintersectswiththeand in particular intersects with thexaxisateither,-axis at either ,1,,2orinfinitelymanytimesinor infinitely many times in.Asweassumethat. As we assume that\hat{\sigma}isnottheidentity,wecanruleouttheoptionofinfinitelymanyintersections.Also,sinceis not the identity, we can rule out the option of infinitely many intersections. Also, since\hat{\sigma}(1)=1,weknowthatthereisatleastoneintersectionin, we know that there is at least one intersection in.Hence,thereare. Hence, there are1oror2intersectionsinintersections inandbecauseoneofthemisinand because one of them is in\rho=1,weconcludethatthereisatmostoneintersectionin, we conclude that there is at most one intersection in[0,1)$.

Lastly, we derive the conclusion of the claim from equation (9). Fix ρ(1,1)\rho\in(-1,1). Assume first that ρασ\rho\geq\alpha_{\sigma}. By (9), αm(ρ)\alpha_{m}(\rho) is a monotonically non-increasing sequence that is lower bounded by ασ\alpha_{\sigma}. Hence, it has a limit αστρ<1\alpha_{\sigma}\leq\tau\leq\rho<1. Now, by the continuity of σ^\hat{\sigma} we have

Since the only solution to σ^(ρ)=ρ\hat{\sigma}(\rho)=\rho in (1,1)(-1,1) is ασ\alpha_{\sigma}, we must have τ=ασ\tau=\alpha_{\sigma}. We next deal with the case that 1<ρ<ασ-1<\rho<\alpha_{\sigma}. If for some mm, αm(ρ)[ασ,1)\alpha_{m}(\rho)\in[\alpha_{\sigma},1), the argument for ασρ\alpha_{\sigma}\leq\rho shows that ασ=limmαm(ρ)\alpha_{\sigma}=\lim_{m\to\infty}\alpha_{m}(\rho). If this is not the case, we have that for all mm, αm(ρ)αm+1(ρ)ασ\alpha_{m}(\rho)\leq\alpha_{m+1}(\rho)\leq\alpha_{\sigma}. As in the case of ρασ\rho\geq\alpha_{\sigma}, this can be used to show that αm(ρ)\alpha_{m}(\rho) converges to ασ\alpha_{\sigma}. ∎

Proofs

Let σ\sigma be an activation. Define the following,

We underscore the following properties of the extension of a dual activation.

The restriction of the extended kσk_{\sigma} to the sphere agrees with the restricted definition.

The extended dual activation and kernel are defined for every activation σ\sigma such that for all a0a\geq 0, xσ(ax)x\mapsto\sigma(ax) is square integrable with respect to the Gaussian measure.

A normalized activation σ\sigma is (α,β,γ)(\alpha,\beta,\gamma)-decent for α,β,γ0\alpha,\beta,\gamma\geq 0 if the following conditions hold.

The dual activation σˉ\bar{\sigma} is β\beta-Lipschitz in M+γ{\cal M}_{+}^{\gamma} with respect to the \infty-norm.

If (X1,Y1),,(Xr,Yr)(X_{1},Y_{1}),\ldots,(X_{r},Y_{r}) are independent samples from N(0,Σ)\text{N}\left(0,\Sigma\right) for ΣM+γ\Sigma\in{\cal M}_{+}^{\gamma} then

It is enough to show that the following properties hold.

The (extended) dual activation σˉ\bar{\sigma} is 2C22C^{2}-Lipschitz in M++{\cal M}_{++} w.r.t. the \infty-norm.

If (X1,Y1),,(Xr,Yr)(X_{1},Y_{1}),\ldots,(X_{r},Y_{r}) are independent samples from N(0,Σ)\text{N}\left(0,\Sigma\right) then

From the boundedness of σ\sigma it holds that σ(X)σ(Y)C2|\sigma(X)\sigma(Y)|\leq C^{2}. Hence, the second property follows directly from Hoeffding’s bound. We next prove the first part. Let z=(x,y){\mathbf{z}}=(x,y) and ϕ(z)=σ(x)σ(y)\phi({\mathbf{z}})=\sigma(x)\sigma(y). Note that for ΣM++\Sigma\in{\cal M}_{++} we have

Let g(z)=ezΣ1z2g({\mathbf{z}})=e^{-\frac{{\mathbf{z}}^{\top}\Sigma^{-1}{\mathbf{z}}}{2}}. Then, the first and second order partial derivatives of gg are

We conclude that σˉ\bar{\sigma} is differentiable in M++{\cal M}_{++} with partial derivatives that are point-wise bounded by C22\frac{C^{2}}{2}. Thus, σˉ\bar{\sigma} is 2C22C^{2}-Lipschitz in M+{\cal M}_{+} w.r.t. the \infty-norm. ∎

We next show that the ReLU activation is decent.

The measure concentration property follows from standard concentration bounds for sub-exponential random variables (e.g. ). It remains to show that σˉ\bar{\sigma} is (1+o(γ))(1+o(\gamma))-Lipschitz in M+γ{\cal M}^{\gamma}_{+}. We first calculate an exact expression for σˉ\bar{\sigma}. The expression was already calculated in , yet we give here a derivation for completeness.

The following equality holds for all ΣM+2\Sigma\in{\cal M}_{+}^{2},

By the positive homogeneity of the ReLU activation we have

For brevity, we henceforth drop the argument from σˉ(Σ)\bar{\sigma}(\Sigma) and use the abbreviation σˉ\bar{\sigma}. In order to show that σˉ\bar{\sigma} is (1+o(γ))(1+o(\gamma))-Lipschitz w.r.t. the \infty-norm it is enough to show that for every ΣM+γ\Sigma\in{\cal M}_{+}^{\gamma} we have,

First, Note that σˉ/Σ11{\partial\bar{\sigma}}/{\partial\Sigma_{11}} and σˉ/Σ22{\partial\bar{\sigma}}/{\partial\Sigma_{22}} have the same sign, hence,

We therefore get that the 11-norm of σˉ\nabla\bar{\sigma} is,

The gradient of 12Σ11+Σ22Σ11Σ22\frac{1}{2}\frac{\Sigma_{11}+\Sigma_{22}}{\sqrt{\Sigma_{11}\Sigma_{22}}} at (Σ11,Σ22)=(1,1)(\Sigma_{11},\Sigma_{22})=(1,1) is (0,0)(0,0). Therefore, from the mean value theorem we get, 12Σ11+Σ22Σ11Σ22=1+o(γ)\frac{1}{2}\frac{\Sigma_{11}+\Sigma_{22}}{\sqrt{\Sigma_{11}\Sigma_{22}}}=1+o(\gamma). Furthermore, σ^\hat{\sigma}, σ^\hat{\sigma}^{\prime} and Σ12Σ11Σ22\frac{\Sigma_{12}}{\sqrt{\Sigma_{11}\Sigma_{22}}} are bounded by 11 in absolute value. Hence, we can write,

Finally, if we let t=Σ12Σ11Σ22t=\frac{\Sigma_{12}}{\sqrt{\Sigma_{11}\Sigma_{22}}}, we can further simply the expression for σˉ\nabla\bar{\sigma},

Finally, the proof is obtained from the fact that the function f(t)=1t2π+1cos1(t)πf(t)=\frac{\sqrt{1-t^{2}}}{\pi}+1-\frac{\cos^{-1}(t)}{\pi} satisfies 0f(t)10\leq f(t)\leq 1 for every tt\in. Indeed, it is simple to verify that f(1)=0f(-1)=0 and f(1)=1f(1)=1. Hence, it suffices to show that ff^{\prime} is non-negative in $$ which is indeed the case since,

2 Proofs of Thms. 2 and 3

We start by an additional theorem which serves as a simple stepping stone for proving the aforementioned main theorems.

Let S{\cal S} be a skeleton with (α,β,γ)(\alpha,\beta,\gamma)-decent activations, 0<ϵγ0<\epsilon\leq\gamma, and Bd=i=0d1βiB_{d}=\sum_{i=0}^{d-1}\beta^{i}. Let w{\mathbf{w}} be a random initialization of the network N=N(S,r){\cal N}={\cal N}({\cal S},r) with

Then, for every x,y{\mathbf{x}},{\mathbf{y}} with probability of at least 1δ1-\delta, it holds that

Before proving the theorem we show that together with Lemmas 12 and 13, Theorems 2 and 3 follow from Theorem 14. We restate them as corollaries, prove them, and then proceed to the proof of Theorem 14.

Let S{\cal S} be a skeleton with CC-bounded activations. Let w{\mathbf{w}} be a random initialization of N=N(S,r){\cal N}={\cal N}({\cal S},r) with

Then, for every x,y{\mathbf{x}},{\mathbf{y}}, w.p. 1δ\geq 1-\delta,

From Lemma 12, for all γ>0\gamma>0, each activation is (C2,2C2,γ)(C^{2},2C^{2},\gamma)-decent. By Theorem 14, it suffices to show that

Let S{\cal S} be a skeleton with ReLU activations, and w{\mathbf{w}} a random initialization of N(S,r){\cal N}({\cal S},r) with rc10pt2(S)log(8Sδ)ϵ2r\geq c_{1}\frac{0pt^{2}({\cal S})\log\left(\frac{8|{\cal S}|}{\delta}\right)}{\epsilon^{2}}. For all x,y{\mathbf{x}},{\mathbf{y}} and ϵmin(c2,10pt(S))\epsilon\leq\min(c_{2},\frac{1}{0pt({\cal S})}), w.p. 1δ\geq 1-\delta,

Here, c1,c2>0c_{1},c_{2}>0 are universal constants.

This claim follows from the fact that (1+o(ϵ))ieo(ϵ)0pt(S)(1+o(\epsilon))^{i}\leq e^{o(\epsilon)0pt({\cal S})} as long as i0pt(S)i\leq 0pt({\cal S}). Since we assume that ϵ1/0pt(S)\epsilon\leq{1}/{0pt({\cal S})}, the expression is bounded by ee for sufficiently small ϵ\epsilon. ∎

Note that Ku=σˉup(Ku)\mathcal{K}^{u}=\bar{\sigma}_{u}^{p}(\mathcal{K}^{\leftarrow u}). We say that a node uSu\in{\cal S}, is well-initialized if

Here, we use the convention that B0=0B_{0}=0. It is enough to show that with probability of at least 1δ\geq 1-\delta all nodes are well-initialized. We first note that input nodes are well-initialized by construction since Kwu=Ku\mathcal{K}^{u}_{\mathbf{w}}=\mathcal{K}^{u}. Next, we show that given that all incoming nodes for a certain node are well-initialized, then w.h.p. the node is well-initialized as well.

It is easy to verify that Kwu\mathcal{K}_{\mathbf{w}}^{u} is the empirical covariance matrix of rr independent variables distributed according to (σ(X),σ(Y))\left(\sigma(X),\sigma(Y)\right) where (X,Y)N(0,Kwu)(X,Y)\sim\text{N}\left(0,\mathcal{K}_{\mathbf{w}}^{\leftarrow u}\right). Given the assumption that all nodes incoming to uu are well-initialized, we have,

Further, since ϵγ\epsilon\leq\gamma then KwuM+γ\mathcal{K}^{\leftarrow u}_{\mathbf{w}}\in{\cal M}_{+}^{\gamma}. Using the fact that σu\sigma_{u} is (α,β,γ)(\alpha,\beta,\gamma)-decent and that r2α2B0pt(S)2log(8Sδ)ϵ2r\geq\frac{2\alpha^{2}B^{2}_{0pt({\cal S})}\log\left(\frac{8|{\cal S}|}{\delta}\right)}{\epsilon^{2}}, we get that w.p. of at least 1δS1-\frac{\delta}{|{\cal S}|},

Finally, using (9.2) and (13) along with the fact that σˉ\bar{\sigma} is β\beta-Lipschitz, we have

We are now ready to conclude the proof. Let u1,,uSu_{1},\ldots,u_{|{\cal S}|} be an ordered list of the nodes in S{\cal S} in accordance to their depth, starting with the shallowest nodes, and ending with the output node. Denote by AqA_{q} the event that u1,,uqu_{1},\ldots,u_{q} are well-initialized. We need to show that Pr(AS)1δ\Pr(A_{|{\cal S}|})\geq 1-\delta. We do so using an induction on qq for the inequality Pr(Aq)1qδS\Pr(A_{q})\geq 1-\frac{q\delta}{|{\cal S}|}. Indeed, for q=1,,nq=1,\ldots,n, uqu_{q} is an input node and Pr(Aq)=1\Pr(A_{q})=1. Thus, the base of the induction hypothesis holds. Assume that q>nq>n. By Claim (3) we have that Pr(AqAq1)1δS\Pr(A_{q}|A_{q-1})\geq 1-\frac{\delta}{|{\cal S}|}. Finally, from the induction hypothesis we have,

3 Proofs of Thms. 4 and 5

Theorems 4 and 5 follow from using the following lemma combined with Theorems 2 and 3. When we apply the lemma, we always focus on the special case where one of the kernels is constant w.p. 11.

For some C>0C>0, xX,  κ1(x,x),κ2(x,x)C\forall{\mathbf{x}}\in{\cal X},\;\kappa_{1}({\mathbf{x}},{\mathbf{x}}),\kappa_{2}({\mathbf{x}},{\mathbf{x}})\leq C.

Then, w.p. 1δ\geq 1-\delta over the choices of κ1,κ2\kappa_{1},\kappa_{2}, for every f1Hκ1Mf_{1}\in{\cal H}^{M}_{\kappa_{1}} there is f2Hκ22Mf_{2}\in{\cal H}^{\sqrt{2}M}_{\kappa_{2}} such that LD(f2)LD(f1)+ϵ4LM{\cal L}_{{\cal D}}(f_{2})\leq{\cal L}_{{\cal D}}(f_{1})+\sqrt{\epsilon}4LM.

To prove the above lemma, we state another lemma below followed by a basic measure concentration result.

L(w):=1mi=1mw,xiw,xiϵ{\cal L}({\mathbf{w}}):=\frac{1}{m}\sum_{i=1}^{m}|\langle{\mathbf{w}},{\mathbf{x}}_{i}\rangle-\langle{\mathbf{w}}^{*},{\mathbf{x}}_{i}\rangle|\leq\epsilon

iαiw2ϵ\sum_{i}|\alpha_{i}|\leq\frac{\|{\mathbf{w}}^{*}\|^{2}}{\epsilon}

ww\|{\mathbf{w}}\|\leq\|{\mathbf{w}}^{*}\|

Denote M=wM=\|{\mathbf{w}}^{*}\|, C=maxixiC=\max_{i}\|{\mathbf{x}}_{i}\|, and yi=w,xiy_{i}=\langle{\mathbf{w}}^{*},{\mathbf{x}}_{i}\rangle. Suppose that we run stochastic gradient decent on the sample {(x1,y1),,(xm,ym)}\{({\mathbf{x}}_{1},y_{1}),\ldots,({\mathbf{x}}_{m},y_{m})\} w.r.t. the loss L(w){\cal L}({\mathbf{w}}), with learning rate η=ϵC2\eta=\frac{\epsilon}{C^{2}}, and with projections onto the ball of radius MM. Namely, we start with w0=0{\mathbf{w}}_{0}=0 and at each iteration t1t\geq 1, we choose at random it[m]i_{t}\in[m] and perform the update,

After T=M2C2ϵ2T=\frac{M^{2}C^{2}}{\epsilon^{2}} iterations the loss in expectation would be at most ϵ\epsilon (see for instance Chapter 14 in ). In particular, there exists a sequence of at most M2C2ϵ2\frac{M^{2}C^{2}}{\epsilon^{2}} gradient steps that attains a solution w{\mathbf{w}} with L(w)ϵ{\cal L}({\mathbf{w}})\leq\epsilon. Each update adds or subtracts ϵC2xi\frac{\epsilon}{C^{2}}{\mathbf{x}}_{i} from the current solution. Hence w{\mathbf{w}} can be written as a weighted sum of xi{\mathbf{x}}_{i}’s where the sum of each coefficient is at most TϵC2=M2ϵT\frac{\epsilon}{C^{2}}=\frac{M^{2}}{\epsilon}. ∎

In particular, w.p. 1δ\geq 1-\delta (14) and (15) hold and therefore it suffices to prove the conclusion of the theorem under these conditions. Indeed, let Ψ1,Ψ2:XH\Psi_{1},\Psi_{2}:{\cal X}\to{\cal H} be two mapping from X{\cal X} to a Hilbert space H{\cal H} so that κi(x,y)=Ψi(x),Ψi(y)\kappa_{i}({\mathbf{x}},{\mathbf{y}})=\langle\Psi_{i}({\mathbf{x}}),\Psi_{i}({\mathbf{y}})\rangle. Let f1Hκ1Mf_{1}\in{\cal H}^{M}_{\kappa_{1}}. By lemma 18 there are α1,,αm\alpha_{1},\ldots,\alpha_{m} so that for the vector w=i=1mα1Ψ1(xi){\mathbf{w}}=\sum_{i=1}^{m}\alpha_{1}\Psi_{1}({\mathbf{x}}_{i}) we have

Consider the function f2H2f_{2}\in{\cal H}_{2} defined by f2(x)=i=1mα1Ψ2(xi),Ψ2(x)f_{2}({\mathbf{x}})=\sum_{i=1}^{m}\alpha_{1}\langle\Psi_{2}({\mathbf{x}}_{i}),\Psi_{2}({\mathbf{x}})\rangle. We note that

Discussion

Our results surface the question of the extent to which random initialization accounts for the success of neural networks. While we mostly leave this question for future research, we would like to point to empirical evidence supporting the important role of initialization. First, numerous researchers and practitioners demonstrated that random initialization, similar to the scheme we analyze, is crucial to the success of neural network learning (see for instance ). This suggests that starting from arbitrary weights is unlikely to lead to a good solution. Second, several studies show that the contribution of optimizing the representation layers is relatively small . For example, competitive accuracy on CIFAR-10, STL-10, MNIST and MONO datasets can be achieved by optimizing merely the last layer . Furthermore, Saxe et al. show that the performance of training the last layer is quite correlated with training the entire network. The effectiveness of optimizing solely the last layer is also manifested by the popularity of the random features paradigm . Finally, other studies show that the metrics induced by the initial and fully trained representations are not substantially different. Indeed, Giryes et al. demonstrated that for the MNIST and CIFAR-10 datasets the distances’ histogram of different examples barely changes when moving from the initial to the trained representation. For the ImageNet dataset the difference is more pronounced yet still moderate.

The role of architecture.

By using skeletons and compositional kernel spaces, we can reason about functions that the network can actually learn rather than merely express. This may explain in retrospect past architectural choices and potentially guide future choices. Let us consider for example the task of object recognition. It appears intuitive, and is supported by visual processing mechanisms in mammals, that in order to perform object recognition, the first processing stages are confined to local receptive fields. Then, the result of the local computations are applied to detect more complex shapes which are further combined towards a prediction. This processing scheme is naturally expressed by convolutional skeletons. A two dimensional version of Example 1 demonstrates the usefulness of convolutional networks for vision and speech applications.

The rationale we described above was pioneered by LeCun and colleagues . Alas, the mere fact that a network can express desired functions does not guarantee that it can actually learn them. Using for example Barron’s theorem , one may claim that vision-related functions are expressed by fully connected two layer networks, but such networks are inferior to convolutional networks in machine vision applications. Our result mitigates this gap. First, it enables use of the original intuition behind convolutional networks in order to design function spaces that are provably learnable. Second, as detailed in Example 1, it also explains why convolutional networks perform better than fully connected networks.

The role of other architectural choices.

In addition to the general topology of the network, our theory can be useful for understanding and guiding other architectural choices. We give two examples. First, suppose that a skeleton S{\cal S} has a fully connected layer with the dual activation σ^1\hat{\sigma}_{1}, followed by an additional fully connected layer with dual activation σ^2\hat{\sigma}_{2}. It is straightforward to verify that if these two layers are replaced by a single layer with dual activation σ^2σ^1\hat{\sigma}_{2}\circ\hat{\sigma}_{1}, the corresponding compositional kernel space remains the same. This simple observation can be useful in potentially saving a whole layer in the corresponding networks.

The second example is concerned with the ReLU activation, which is one of the most common activations used in practice. Our theory suggests a somewhat surprising explanation for its usefulness. First, the dual kernel of the ReLU activation enables expression of non-linear functions. However, this property holds true for many activations. Second, Theorem 3 shows that even for quite deep networks with ReLU activations, random initialization approximates the corresponding kernel. While we lack a proof at the time of writing, we conjecture that this property holds true for many other activations. What is then so special about the ReLU? Well, an additional property of the ReLU is being positive homogeneous, i.e. satisfying σ(ax)=aσ(x)\sigma(ax)=a\sigma(x) for all a0a\geq 0. This fact makes the ReLU activation robust to small perturbations in the distribution used for initialization. Concretely, if we multiply the variance of the random weights by a constant, the distribution of the generated representation and the space Hw{\cal H}_{\mathbf{w}} remain the same up to a scaling. Note moreover that training algorithms are sensitive to the initialization. Our initialization is very similar to approaches used in practice, but encompasses a small “correction”, in the form of a multiplication by a small constant which depends on the activation. For most activations, ignoring this correction, especially in deep networks, results in a large change in the generated representation. The ReLU activation is more robust to such changes. We note that similar reasoning applies to the max-pooling operation.

Future work.

Though our formalism is fairly general, we mostly analyzed fully connected and convolutional layers. Intriguing questions remain, such as the analysis of max-pooling and recursive neural network components from the dual perspective. On the algorithmic side, it is yet to be seen whether our framework can help in understanding procedures such as dropout and batch-normalization . Beside studying existing elements of neural network learning, it would be interesting to devise new architectural components inspired by duality. More concrete questions are concerned with quantitative improvements of the main results. In particular, it remains open whether the dependence on 2O(0pt(S))2^{O\left(0pt({\cal S})\right)} can be made polynomial and the quartic dependence on 1/ϵ1/\epsilon, RR, and LL can be improved. In addition to being interesting in their own right, improving the bounds may further underscore the effectiveness of random initialization as a way of generating low dimensional embeddings of compositional kernel spaces. Randomly generating such embeddings can be also considered on its own, and we are currently working on design and analysis of random features a la Rahimi and Recht .

Acknowledgments

We would like to thank Yossi Arjevani, Elad Eban, Moritz Hardt, Elad Hazan, Percy Liang, Nati Linial, Ben Recht, and Shai Shalev-Shwartz for fruitful discussions, comments, and suggestions.

References