Avoiding pathologies in very deep networks

David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin Ghahramani

Relating deep neural networks to deep GPs

This section gives a precise definition of deep GP s, reviews the precise relationship between neural networks and Gaussian processes, and gives two equivalent ways of constructing neural networks which give rise to deep GP s.

We define a deep GP as a distribution on functions constructed by composing functions drawn from GP priors. An example of a deep GP is a composition of vector-valued functions, with each function drawn independently from GP priors:

Multilayer neural networks also implement compositions of vector-valued functions, one per layer. Therefore, understanding general properties of function compositions might helps us to gain insight into deep neural networks.

2 Single-hidden-layer models

First, we relate neural networks to standard “shallow” Gaussian processes, using the standard neural network architecture known as the multi-layer perceptron ( MLP ) (Rosenblatt, 1962). In the typical definition of an MLP with one hidden layer, the hidden unit activations are defined as:

where ${\bf{h}}$ are the hidden unit activations, ${\bf{b}}$ is a bias vector, $\mathbf{V}$ is a weight matrix and $\sigma$ is a one-dimensional nonlinear function, usually a sigmoid, applied element-wise. The output vector $\boldsymbol{f}({\bf{x}})$ is simply a weighted sum of these hidden unit activations:

where $\mathbf{W}$ is another weight matrix.

Neal (1995, chapter 2) showed that some neural networks with infinitely many hidden units, one hidden layer, and unknown weights correspond to Gaussian processes. More precisely, for any model of the form

with fixedfeatures $\left[h_{1}({\bf{x}}),\dots,h_{K}({\bf{x}})\right]^{\mathsf{T}}={\bf{h}}({\bf{x}})$ and i.i.d. $w$ ’s with zero mean and finite variance $\sigma^{2}$ , the central limit theorem implies that as the number of features $K$ grows, any two function values $f({\bf{x}})$ and $f({\bf{x}}^{\prime})$ have a joint distribution approaching a Gaussian:

A joint Gaussian distribution between any set of function values is the definition of a Gaussian process.

The result is surprisingly general: it puts no constraints on the features (other than having uniformly bounded activation), nor does it require that the feature weights ${\bf{w}}$ be Gaussian distributed. An MLP with a finite number of nodes also gives rise to a GP , but only if the distribution on ${\bf{w}}$ is Gaussian.

One can also work backwards to derive a one-layer MLP from any GP : Mercer’s theorem, discussed in LABEL:sec:mercer, implies that any positive-definite kernel function corresponds to an inner product of features: $k({\bf{x}},{\bf{x}}^{\prime})={\bf{h}}({\bf{x}})^{\mathsf{T}}{\bf{h}}({\bf{x}}^{\prime})$ .

Thus, in the one-hidden-layer case, the correspondence between MLP s and GP s is straightforward: the implicit features ${\bf{h}}({\bf{x}})$ of the kernel correspond to hidden units of an MLP .

3 Multiple hidden layers

Next, we examine infinitely-wide MLP s having multiple hidden layers. There are several ways to construct such networks, giving rise to different priors on functions.

This architecture is shown on the right of figure 1. For example, if we extend the model given by equation 3 to have two layers of feature mappings, the resulting model becomes

If the features ${\bf{h}}^{(1)}({\bf{x}})$ and ${\bf{h}}^{(2)}({\bf{x}})$ are fixed with only the last-layer weights ${{\bf{w}}}$ unknown, this model corresponds to a shallow GP with a deep kernel, given by

Deep kernels, explored in section 5, imply a fixed representation as opposed to a prior over representations. Thus, unless we richly parameterize these kernels, their capacity to learn an appropriate representation will be limited in comparison to more flexible models such as deep neural networks or deep GP s.

4 Two network architectures equivalent to deep GPs

There are two equivalent neural network architectures that correspond to deep GP s: one having fixed nonlinearities, and another having GP -distributed nonlinearities.

This alternating-layer architecture has an interpretation as a series of linear information bottlenecks. To see this, substitute equation 3 into equation 12 to get

Characterizing deep Gaussian process priors

This section develops several theoretical results characterizing the behavior of deep GP s as a function of their depth. Specifically, we show that the size of the derivative of a one-dimensional deep GP becomes log-normal distributed as the network becomes deeper. We also show that the Jacobian of a multivariate deep GP is a product of independent Gaussian matrices having independent entries. These results will allow us to identify a pathology that emerges in very deep networks in section 3.

In this section, we derive the limiting distribution of the derivative of an arbitrarily deep, one-dimensional GP having a squared-exp kernel:

The parameter $\sigma^{2}$ controls the variance of functions drawn from the prior, and the lengthscale parameter $w$ controls the smoothness. The derivative of a GP with a squared-exp kernel is point-wise distributed as $\mathcal{N}\!\left(0,\nicefrac{{\sigma^{2}}}{{w^{2}}}\right)$ . Intuitively, a draw from a GP is likely to have large derivatives if the kernel has high variance and small lengthscales.

By the chain rule, the derivative of a one-dimensional deep GP is simply a product of the derivatives of each layer, which are drawn independently by construction. The distribution of the absolute value of this derivative is a product of half-normals, each with mean $\sqrt{\nicefrac{{2\sigma^{2}}}{{\pi w^{2}}}}$ . If one chooses kernel parameters such that ${\nicefrac{{\sigma^{2}}}{{w^{2}}}=\nicefrac{{\pi}}{{2}}}$ , then the expected magnitude of the derivative remains constant regardless of the depth.

The distribution of the log of the magnitude of the derivatives has finite moments:

where $\gamma\approxeq 0.5772$ is Euler’s constant. Since the second moment is finite, by the central limit theorem, the limiting distribution of the size of the gradient approaches a log-normal as L grows:

Even if the expected magnitude of the derivative remains constant, the variance of the log-normal distribution grows without bound as the depth increases.

Because the log-normal distribution is heavy-tailed and its domain is bounded below by zero, the derivative will become very small almost everywhere, with rare but very large jumps. Figure 3 shows this behavior in a draw from a 1D deep GP prior. This figure also shows that once the derivative in one region of the input space becomes very large or very small, it is likely to remain that way in subsequent layers.

2 Distribution of the Jacobian

Next, we characterize the distribution on Jacobians of multivariate functions drawn from deep GP priors, finding them to be products of independent Gaussian matrices with independent entries.

Because differentiation is a linear operator, the derivatives of a function drawn from a GP prior are also jointly Gaussian distributed. The covariance between partial derivatives with respect to input dimensions $d_{1}$ and $d_{2}$ of vector ${\bf{x}}$ are given by Solak et al. (2003):

If our kernel is a product over individual dimensions $k({\bf{x}},{\bf{x}}^{\prime})=\prod_{d}^{D}k_{d}(x_{d},x_{d}^{\prime})$ , then the off-diagonal entries are zero, implying that all elements are independent. ∎

For example, in the case of the multivariate squared-exp kernel, the covariance between derivatives has the form:

The Jacobian of the vector-valued function $\boldsymbol{f}({\bf{x}})$ is a matrix $J$ with elements ${J_{ij}({\bf{x}})=\frac{\partial f_{i}({\bf{x}})}{\partial x_{j}}}$ . Because the GP s on each output dimension $f_{1}({\bf{x}}),f_{2}({\bf{x}}),\dots,f_{D}({\bf{x}})$ are independent by construction, it follows that each row of $J$ is independent. Lemma 2.1 shows that the elements of each row are independent Gaussian. Thus all entries in the Jacobian of a GP -distributed transform are independent Gaussian R.V.’s. ∎

The Jacobian of a deep GP with a product kernel is a product of independent Gaussian matrices, with each entry in each matrix being drawn independently.

This result allows us to analyze the representational properties of a deep Gaussian process by examining the properties of products of independent Gaussian matrices.

Formalizing a pathology

A common use of deep neural networks is building useful representations of data manifolds. What properties make a representation useful? Rifai et al. (2011a) argued that good representations of data manifolds are invariant in directions orthogonal to the data manifold. They also argued that, conversely, a good representation must also change in directions tangent to the data manifold, in order to preserve relevant information. Figure 4 visualizes a representation having these two properties.

As in Rifai et al. (2011b), we characterize the representational properties of a function by the singular value spectrum of the Jacobian. The number of relatively large singular values of the Jacobian indicate the number of directions in data-space in which the representation varies significantly.

Figure 5 shows the distribution of the singular value spectrum of draws from 5-dimensional deep GP s of different depths.Rifai et al. (2011b) analyzed the Jacobian at location of the training points, but because the priors we are examining are stationary, the distribution of the Jacobian is identical everywhere. As the nets gets deeper, the largest singular value tends to dominate, implying there is usually only one effective degree of freedom in the representations being computed.

Figure 6 demonstrates a related pathology that arises when composing functions to produce a deep density model. The density in the observed space eventually becomes locally concentrated onto one-dimensional manifolds, or filaments. This again suggests that, when the width of the network is relatively small, deep compositions of independent functions are unsuitable for modeling manifolds whose underlying dimensionality is greater than one.

To visualize this pathology in another way, figure 7 illustrates a color-coding of the representation computed by a deep GP , evaluated at each point in the input space. After 10 layers, we can see that locally, there is usually only one direction that one can move in ${\bf{x}}$ -space in order to change the value of the computed representation, or to cross a decision boundary. This means that such representations are likely to be unsuitable for decision tasks that depend on more than one property of the input.

To what extent are these pathologies present in the types of neural networks commonly used in practice? In simulations, we found that for deep functions with a fixed hidden dimension $D$ , the singular value spectrum remained relatively flat for hundreds of layers as long as $D>100$ . Thus, these pathologies are unlikely to severely effect the relatively shallow, wide networks most commonly used in practice.

Fixing the pathology

Figure 8 shows a graphical representation of the two connectivity architectures.

Similar connections between non-adjacent layers can also be found the primate visual cortex (Maunsell and van Essen, 1983). Visualizations of the resulting prior on functions are shown in figures 9, 10 and 12.

The Jacobian of an input-connected deep function is defined by the recurrence

where $I_{D}$ is a $D$ -dimensional identity matrix. Thus the Jacobian of an input-connected deep GP is a product of independent Gaussian matrices each with an identity matrix appended. Figure 11 shows that with this architecture, even 50-layer deep GP s have well-behaved singular value spectra.

The pathology examined in this section is an example of the sort of analysis made possible by a well-defined prior on functions. The figures and analysis done in this section could be done using Bayesian neural networks with finite numbers of nodes, but would be more difficult. In particular, care would need to be taken to ensure that the networks do not produce degenerate mappings due to saturation of the hidden units.

Deep kernels

Bengio et al. (2006) showed that kernel machines have limited generalization ability when they use “local” kernels such as the squared-exp. However, many interesting non-local kernels can be constructed which allow some forms of extrapolation. One way to build non-local kernels is by composing fixed feature maps, creating deep kernels. For example, periodic kernels can be viewed as a 2-layer-deep kernel, in which the first layer maps $x\rightarrow[\sin(x),\cos(x)]$ , and the second layer maps through basis functions corresponding to the implicitly feature map giving rise to an SE kernel.

This section builds on the work of Cho and Saul (2009), who derived several kinds of deep kernels by composing multiple layers of feature mappings.

In principle, one can compose the implicit feature maps of any two kernels $k_{a}$ and $k_{b}$ to get a new kernel, which we denote by $\left(k_{b}\circ k_{a}\right)$ :

However, this composition might not always have a closed form if the number of hidden features of either kernel is infinite.

Fortunately, composing the squared-exp ( SE ) kernel with the implicit mapping given by any other kernel has a simple closed form. If $k({\bf{x}},{\bf{x}}^{\prime})={\bf{h}}({\bf{x}})^{\mathsf{T}}{\bf{h}}({\bf{x}}^{\prime})$ , then

Starting with the squared-exp kernel, this repeated mapping satisfies

One can also consider two related connectivity architectures: one in which each layer is connected to the output layer, and another in which every layer is connected to all subsequent layers. It is easy to show that in the limit of infinite depth of composing SE kernels, both these architectures converge to $k({\bf{x}},{\bf{x}}^{\prime})=\delta({\bf{x}},{\bf{x}}^{\prime})$ , the white noise kernel.

2 When are deep kernels useful models?

Kernels correspond to fixed feature maps, and so kernel learning is an example of implicit representation learning. Kernels can capture complex structure (Duvenaud et al., 2013), and can enable many types of generalization, such as translation and rotation invariance in images (Kondor, 2008). More generally, Salakhutdinov and Hinton (2008) used a deep neural network to learn feature transforms for kernels, to learn invariances in an unsupervised manner. We believe that the relatively uninteresting properties of the deep kernels derived in this section simply reflect the fact that an arbitrary computation, even if it is “deep”, is not likely to give rise to a useful representation unless combined with learning. To put it another way, any fixed representation is unlikely to be useful unless it has been chosen specifically for the problem at hand.

Dropout in Gaussian processes

Dropout is a recently-introduced method for regularizing neural networks (Hinton et al., 2012; Srivastava, 2013). Training with dropout entails independently setting to zero (“dropping”) some proportion $p$ of features or inputs, in order to improve the robustness of the resulting network, by reducing co-dependence between neurons. To maintain similar overall activation levels, the remaining weights are divided by $p$ . Predictions are made by approximately averaging over all possible ways of dropping out neurons.

Baldi and Sadowski (2013) and Wang and Manning (2013) analyzed dropout in terms of the effective prior induced by this procedure in several models, such as linear and logistic regression. In this section, we perform a similar analysis for GP s, examining the priors on functions that result from performing dropout in the one-hidden-layer neural network implicitly defined by a GP .

Recall from section 1 that some GP s can be derived as infinitely-wide one-hidden-layer neural networks, with fixed activation functions ${\bf{h}}({\bf{x}})$ and independent random weights ${\bf{w}}$ having zero mean and finite variance $\sigma^{2}_{{\bf{w}}}$ :

Because equation 39 is a result of the central limit theorem, it does not depend on the exact form of the distribution on ${\bf{w}}$ , but only on its mean and variance. Thus the central limit theorem still applies. Performing dropout on the features of an infinitely-wide MLP does not change the resulting model at all, except to rescale the output variance. Indeed, dividing all weights by $\sqrt{p}$ restores the initial variance:

in which case dropout on the hidden units has no effect at all. Intuitively, this is because no individual feature can have more than an infinitesimal contribution to the network output.

This result does not hold in neural networks having a finite number of hidden features with Gaussian-distributed weights, another model class that also gives rise to GP s.

2 Dropout on inputs gives additive covariance

One can also perform dropout on the $D$ inputs to the GP . For simplicity, consider a stationary product kernel ${k({\bf{x}},{\bf{x}}^{\prime})=\prod_{d=1}^{D}k_{d}(x_{d},x_{d}^{\prime})}$ which has been normalized such that $k({\bf{x}},{\bf{x}})=1$ , and a dropout probability of $p=\nicefrac{{1}}{{2}}$ . In this case, the generative model can be written as:

This is a mixture of $2^{D}$ GP s, each depending on a different subset of the inputs:

We present two results which might give intuition about this model.

Another way to understand the resulting prior is to note that the dropout mixture given by equation 43 has the following covariance:

For dropout rates $p\neq\nicefrac{{1}}{{2}}$ , the $d$ th order terms will be weighted by $p^{(D-d)}(1-p)^{d}$ .

Therefore, performing dropout on the inputs of a GP gives a non-Gaussian distribution that has the same first two moments as a GP having a covariance given by equation 46. This model class is called additive GP s, and have the property that they can sometimes allow non-local extrapolation (Duvenaud et al., 2011). To see why, note that a GP whose covariance is a sum of kernels corresponds to a sum of functions, each distributed according to a GP having the corresponding kernel. Therefore, equation 46 describes a prior on a sum of $2^{D}$ functions, each depending on a different subset of input variables. Functions which depend only on a small number of input dimensions will be invariant to noise in all dimensions on which they don’t depend. Isocontours of the resulting kernels are shown in Figure 14.

Thus intepreting dropout as a mixture of models, in which each model only depends on a subset of the input variables, helps to explain why dropout sometimes leads to better generalization.

Related work

Neal (1995, chapter 2) explored properties of arbitrarily deep Bayesian neural networks, including those that would give rise to deep GP s. He noted that infinitely deep random neural networks without extra connections to the input would be equivalent to a Markov chain, and therefore would lead to degenerate priors on functions. He also suggested connecting the input to each layer in order to fix this problem. Much of the analysis in this paper can be seen as a more detailed investigation, and vindication, of these claims.

The first instance of deep GP s being used in practice was (Lawrence and Moore, 2007), who presented a model called “hierarchical GP-LVM s”, in which time was mapped through a composition of multiple GP s to produce observations.

The term “deep Gaussian processes” was first used by Damianou and Lawrence (2013), who developed a variational inference method, analyzed the effect of automatic relevance determination, and showed that deep GP S could learn with relatively little data. They used the term “deep GP ” to refer both to supervised models (compositions of GP s) and to unsupervised models (compositions of GP-LVM s). This conflation may be reasonable, since the activations of the hidden layers are themselves latent variables, even in supervised settings: Depending on kernel parameters, each latent variable may or may not depend on the layer below.

In general, supervised models can also be latent-variable models. For example, Wang and Neal (2012) investigated single-layer GP regression models that had additional latent inputs.

0.2 Nonparametric neural networks

Adams et al. (2010) proposed a prior on arbitrarily deep Bayesian networks having an unknown and unbounded number of parametric hidden units in each layer. Their architecture has connections only between adjacent layers, and so may have similar pathologies to the one discussed in this paper as the number of layers increases.

Wilson et al. (2012) introduced Gaussian process regression networks, which are defined as a matrix product of draws from GP s priors, rather than a composition. These networks have the form:

We can easily define a “deep” Gaussian process regression network:

which repeatedly adds and multiplies functions drawn from GP s, in contrast to deep GP s, which repeatedly compose functions. This prior on functions has a similar form to the Jacobian of a deep GP (equation 21), and so might be amenable to a similar analysis to that of section 2.

0.3 Information-preserving architectures

Deep density networks (Rippel and Adams, 2013) are constructed through a series of parametric warpings of fixed dimension, with penalty terms encouraging the preservation of information about lower layers. This is another promising approach to fixing the pathology discussed in section 3.

0.4 Recurrent networks

Bengio et al. (1994) and Pascanu et al. (2012) analyzed a related problem with gradient-based learning in recurrent networks, the “exploding-gradients” problem. They noted that in recurrent neural networks, the size of the training gradient can grow or shrink exponentially as it is back-propagated, making gradient-based training difficult.

Hochreiter and Schmidhuber (1997) addressed the exploding-gradients problem by introducing hidden units designed to have stable gradients. This architecture is known as long short-term memory.

0.5 Deep kernels

The first systematic examination of deep kernels was done by Cho and Saul (2009), who derived closed-form composition rules for SE , polynomial, and arc-cosine kernels, and showed that deep arc-cosine kernels performed competitively in machine-vision applications when used in a SVM .

Hermans and Schrauwen (2012) constructed deep kernels in a time-series setting, constructing kernels corresponding to infinite-width recurrent neural networks. They also proposed concatenating the implicit feature vectors from previous time-steps with the current inputs, resulting in an architecture analogous to the input-connected architecture proposed by Neal (1995, chapter 2).

0.6 Analyses of deep learning

Montavon et al. (2010) performed a layer-wise analysis of deep networks, and noted that the performance of MLP s degrades as the number of layers with random weights increases, a result consistent with the analysis of this paper.

The experiments of Saxe et al. (2011) suggested that most of the good performance of convolutional neural networks could be attributed to the architecture alone. Later, Saxe et al. (2013) looked at the dynamics of gradient-based training methods in deep linear networks as a tractable approximation to standard deep (nonlinear) neural networks.

0.7 Source code

Source code to produce all figures is available at http://www.github.com/duvenaud/deep-limits. This code is also capable of producing visualizations of mappings such as figures 7 and 12 using neural nets instead of GP s at each layer.

Conclusions

This paper used well-defined priors to explicitly examine the assumptions made by neural network models. We used deep Gaussian processes as an easy-to-analyze model corresponding to multi-layer preceptrons having nonparametric activation functions. We showed that representations based on repeated composition of independent functions exhibit a pathology where the representations becomes invariant to all but one direction of variation. We then showed that this problem could be alleviated by connecting the input to each layer.

We also examined properties of deep kernels, corresponding to arbitrarily many compositions of fixed feature maps. Finally, we derived models obtained by performing dropout on Gaussian processes, finding a tractable approximation to exact dropout in GP s.

Much recent work on deep networks has focused on weight initialization (Martens, 2010), regularization (Lee et al., 2007) and network architecture (Gens and Domingos, 2013). If we can identify priors that give our models desirable properties, these might in turn suggest regularization, initialization, and architecture choices that also provide such properties.

We thank Carl Rasmussen, Andrew McHutchon, Neil Lawrence, Andreas Damianou, James Robert Lloyd, Creighton Heaukulani, Dan Roy, Mark van der Wilk, Miguel Hernández-Lobato and Andrew Tulloch for helpful discussions.