Deep Convolutional Networks as shallow Gaussian Processes

Adrià Garriga-Alonso, Carl Edward Rasmussen, Laurence Aitchison

Introduction

Convolutional Neural Networks (\acpCNN) have powerful pattern-recognition capabilities that have recently given dramatic improvements in important tasks such as image classification (Krizhevsky et al., 2012). However, as \acpCNN are increasingly being applied in real-world, safety-critical domains, their vulnerability to adversarial examples (Szegedy et al., 2013; Kurakin et al., 2016), and their poor uncertainty estimates are becoming increasingly problematic. Bayesian inference is a theoretically principled and demonstrably successful (Snoek et al., 2012; Deisenroth & Rasmussen, 2011) framework for learning in the face of uncertainty, which may also help to address the problems of adversarial examples (Gal & Smith, 2018). Unfortunately, Bayesian inference in \acpCNN is extremely difficult due to the very large number of parameters, requiring highly approximate factorised variational approximations (Blundell et al., 2015; Gal & Ghahramani, 2015), or requiring the storage (Lakshminarayanan et al., 2017) of large numbers of posterior samples (Welling & Teh, 2011; Mandt et al., 2017).

Other methods such as those based on \acfpGP are more amenable to Bayesian inference, allowing us to compute the posterior uncertainty exactly (Rasmussen & Williams, 2006). This raises the question of whether it might be possible to combine the pattern-recognition capabilities of \acpCNN with the exact probabilistic computations in \acpGP. Two such approaches exist in the literature. First, deep convolutional kernels (Wilson et al., 2016) parameterise a \acGP kernel using the weights and biases of a \acCNN, which is used to embed the input images into some latent space before computing their similarity. The \acCNN parameters of the resulting kernel then are optimised by gradient descent. However, the large number of kernel parameters in the \acCNN reintroduces the risk of overconfidence and overfitting. To avoid this risk, we need to infer a posterior over the \acCNN kernel parameters, which is as difficult as directly inferring a posterior over the parameters of the original \acCNN. Second, it is possible to define a convolutional \acGP (van der Wilk et al., 2017) or a deep convolutional \acGP (Kumar et al., 2018) by defining a \acGP that takes an image patch as input, and using that \acGP as a component in a larger \acCNN-like system. However, inference in such systems is very computationally expensive, at least without the use of potentially severe variational approximations (van der Wilk et al., 2017).

An alternative approach is suggested by the underlying connection between Bayesian Neural Networks (\acspNN) and \acpGP. In particular, Neal (1996) showed that the function defined by a single-layer fully-connected \acNN with infinitely many hidden units, and random independent zero-mean weights and biases is equivalent to a \acGP, implying that we can do exact Bayesian inference in such a \acNN by working with the equivalent \acGP. Recently, this result was extended to arbitrarily deep fully-connected \acpNN with infinitely many hidden units in each layer (Lee et al., 2017; Matthews et al., 2018a). However, these fully-connected networks are rarely used in practice, as they are unable to exploit important properties of images such as translational invariance, raising the question of whether state-of-the-art architectures such as \acpCNN (LeCun et al., 1990) and ResNets (He et al., 2016a) have equivalent \acGP representations. Here, we answer in the affirmative, giving the \acGP kernel corresponding to arbitrarily deep \acpCNN and to (convolutional) residual neural networks (He et al., 2016a). In this case, if each hidden layer has an infinite number of convolutional filters, the network prior is equivalent to a \acGP.

Furthermore, we show that two properties of the \acGP kernel induced by a \acCNN allow it to be computed very efficiently. First, in previous work it was necessary to compute the covariance matrix for the output of a single convolutional filter applied at all possible locations within a single image (van der Wilk et al., 2017), which was prohibitively computationally expensive. In contrast, under our prior, the downstream weights are independent with zero-mean, which decorrelates the contribution from each location, and implies that it is necessary only to track the patch variances, and not their covariances. Second, while it is still necessary to compute the variance of the output of a convolutional filter applied at all locations within the image, the specific structure of the kernel induced by the \acCNN means that the variance at every location can be computed simultaneously and efficiently as a convolution.

Finally, we empirically demonstrate the performance increase coming from adding translation-invariant structure to the \acGP prior. Without computing any gradients, and without augmenting the training set (e.g. using translations), we obtain 0.84% error rate on the MNIST classification benchmark, setting a new record for nonparametric GP-based methods.

\acGP behaviour in a \acCNN

For clarity of exposition, we will treat the case of a 2D convolutional \acNN. The result applies straightforwardly to $n$ D convolutions, dilated convolutions and upconvolutions (“deconvolutions”), since they can be represented as linear transformations with tied coefficients (see Fig. 1).

The network takes an arbitrary input image ${\bf{X}}$ of height $H^{(0)}$ and width $D^{(0)}$ , as a $C^{(0)}\times(H^{(0)}D^{(0)})$ real matrix. Each row, which we denote ${\bf{x}}_{1},{\bf{x}}_{2},\dotsc,{\bf{x}}_{C^{(0)}}$ , corresponds to a channel of the image (e.g. $C^{(0)}=3$ for RGB), flattened to form a vector. The first activations ${\bf{A}}^{(1)}({\bf{X}})$ are a linear transformation of the inputs. For $i\in\{1,\dotsc,C^{(1)}\}$ :

We consider a network with $L$ hidden layers. The other activations of the network, from ${\bf{A}}^{(2)}({\bf{X}})$ up to ${\bf{A}}^{(L+1)}({\bf{X}})$ , are defined recursively:

The outputs of the network are the last activations, ${\bf{A}}^{(L+1)}({\bf{X}})$ . In the classification or regression setting, the outputs are not spatially extended, so we have $H^{(L+1)}=D^{(L+1)}=1$ , which is equivalent to a fully-connected output layer. In this case, the pseudo-weights ${\bf{W}}^{(L+1)}_{i,j}$ only have one row, and the activations ${\bf{a}}_{i}^{(L+1)}$ are single-element vectors.

Note that, to keep the activation variance constant, the weight variance is divided by the number of input channels. The weight variance can also be divided by the number of elements of the filter, which makes it equivalent to the \acNN weight initialisation scheme introduced by He et al. (2016a).

2 Argument for \acGP behaviour

This quantity (and the following arguments) can all be extended to the case of finitely many input points.

For any pair of data points, ${\bf{X}}$ and ${\bf{X}}^{\prime}$ the feature-maps corresponding to the $j$ th channel, ${\bf{a}}_{j}^{(1)}({\bf{X}},{\bf{X}}^{\prime})$ have a multivariate Gaussian joint distribution. This is because each element is a linear combination of shared Gaussian random variables: the biases, ${\bf{b}}_{j}^{(0)}$ and the filters, ${\bf{U}}^{(0)}_{j,:}$ . Following Eq. (1),

where ${\bf 1}$ is a vector of all-ones. While the elements within a feature map display strong correlations, different feature maps are \aciid conditioned on the data (i.e. ${\bf{a}}_{i}^{(1)}({\bf{X}},{\bf{X}}^{\prime})$ and ${\bf{a}}_{i^{\prime}}^{(1)}({\bf{X}},{\bf{X}}^{\prime})$ are \aciid for $i\neq i^{\prime}$ ), because the parameters for different feature-maps (i.e. the biases, $b_{i}^{(1)}$ and the filters, ${\bf{W}}^{(1)}_{i,:}$ ) are themselves \aciid.

Induction step.

The ConvNet and ResNet kernels

A \acGP is completely specified by its mean and covariance (or kernel) functions. These give the parameters of the joint Gaussian distribution of the \acpRV indexed by any two inputs, ${\bf{X}}$ and ${\bf{X}}^{\prime}$ . For the purposes of computing the mean and covariance, it is easiest to consider the network as being written entirely in index notation,

The mean function is thus easy to compute,

This is true at the output layer $(L+1)$ : in order to achieve an output suitable for classification or regression, we use only a single output location $H^{(L+1)}=D^{(L+1)}=1$ , with a number of “channels” equal to the number of of outputs/classes, so it is only possible to compute the covariance at that single location. We now show that, if we only need the covariance at corresponding locations in the outputs, we only need the covariance at corresponding locations in the inputs, and this requirement propagates backwards through the network.

Formally, as the activations are composed of a sum of terms, their covariance is the sum of the covariances of all those underlying terms,

As the terms in the covariance have mean zero, and as the weights and activations from the previous layer are independent,

is the covariance of the activations, which is again independent of the channel.

2 Covariance of the activities

For example, for the ReLU nonlinearity ( $\phi(x)=\max(0,x)$ ), one can adapt Cho & Saul (2009) in the same way as Matthews et al. (2018a, section 3) to obtain

3 Efficiency of the ConvNet kernel

We now have all the pieces for computing the kernel, as written in Algorithm 1.

Furthermore, the particular form for the kernel (Eq. 1 and Eq. 2) implies that the required variances and covariances at all required locations can be computed efficiently as a convolution.

4 Kernel for a residual \acCNN

then the kernel recursion (Eq. 11) becomes

This way of adding skip connections is equivalent to the “pre-activation” shortcuts described by He et al. (2016b). Remarkably, the natural way of adding residual connections to \acpNN is the one that performed best in their empirical evaluations.

Experiments

We evaluate our kernel on the MNIST handwritten digit classification task. Classification likelihoods are not conjugate for \acpGP, so we must make an approximation, and we follow Lee et al. (2017), in re-framing classification as multi-output regression.

The training set is split into $N=50000$ training and $10000$ validation examples. The regression targets ${\bf{Y}}\in\{-1,1\}^{N\times 10}$ are a one-hot encoding of the example’s class: $y_{n,c}=1$ if the $n$ th example belongs to class $c$ , and $-1$ otherwise.

Training is exact conjugate likelihood \acGP regression with noiseless targets ${\bf{Y}}$ (Rasmussen & Williams, 2006). First we compute the $N\times N$ kernel matrix ${\bf{K}}_{xx}$ , which contains the kernel between every pair of examples. Then we compute ${\bf{K}}_{xx}^{-1}{\bf{Y}}$ using a linear system solver.

The test set has $N_{T}=10000$ examples. We compute the $N_{T}\times N$ matrix ${\bf{K}}_{x^{*}x}$ , the kernel between each test example and all the training examples. The predictions are given by the row-wise maximum of ${\bf{K}}_{x^{*}x}{\bf{K}}_{xx}^{-1}{\bf{Y}}$ .

For the “ConvNet \acGP” and “Residual \acCNN \acGP”, (Table 1) we optimise the kernel hyperparameters by random search. We draw $M$ random hyperparameter samples, compute the resulting kernel’s performance in the validation set, and pick the highest performing run. The kernel hyperparameters are: $\sigma_{\text{b}}^{2}$ , $\sigma_{\text{w}}^{2}$ ; the number of layers; the convolution stride, filter sizes and edge behaviour; the nonlinearity (we consider the error function and ReLU); and the frequency of residual skip connections (for Residual CNN GPs). We do not retrain the model on the validation set after choosing hyperparameters.

The “ResNet \acGP” (Table 1) is the kernel equivalent to a 32-layer version of the basic residual architecture by He et al. (2016a). The differences are: an initial $3\times 3$ convolutional layer and a final dense layer instead of average pooling. We chose to remove the pooling because computing its output variance requires the off-diagonal elements of the filter covariance, in which case we could not exploit the efficiency gains described in Sec. 3.3.

We found that the despite it not being optimised, the 32-layer ResNet \acGP outperformed all other comparable architectures (Table 1), including the NNGP in Lee et al. (2017), which is state-of-the-art for non-convolutional networks, and convolutional \acpGP (van der Wilk et al., 2017; Kumar et al., 2018). That said, our results have not reached state-of-the-art for methods that incorporate a parametric neural network, such as a standard ResNet (Chen et al., 2018) and a Gaussian process with a deep neural network kernel (Bradshaw et al., 2017).

To check whether the GP limit is applicable to relatively small networks used practically (with of the order of $100$ channels in the first layers), we randomly sampled $10,000$ 32-layer ResNets, with 3, 10, 30 and 100 channels in the first layers. Following the usual practice for ResNets we increase the number the number of hidden units when we downsample the feature maps. Then, we compare the sampled and limiting theoretical distributions of ${\bf{A}}^{(32)}({\bf{X}})$ for a given input ${\bf{X}}$ .

The probability density plots show a good match around 100 channels (Fig. 2A), which matches a more sensitive graphical procedure based on quantile-quantile plots (Fig. 2B). Notably, even for only 30 channels, the empirical moments (computed over many input images) match closely the limiting ones (Fig. 2C). For comparison, typical ResNets use from 64 (He et al., 2016a) to 192 (Zagoruyko & Komodakis, 2016) channels in their first layers. We believe that this is because the moment propagation equations only require the Gaussianity assumption for propagation through the ReLU, and presumably this is robust to non-Gaussian input activations.

Asymptotically, computing the kernel matrix takes $O(N^{2}LD)$ time, where $L$ is the number of layers in the network and $D$ is the dimensionality of the input, and inverting the kernel matrix takes $O(N^{3})$ . As such, we expect that for very large datasets, inverting the kernel matrix will dominate the computation time. However, on MNIST, $N^{3}$ is only around a factor of $10$ larger than $N^{2}LD$ . In practice, we found that it was more expensive to compute the kernel matrix than to invert it. For the ResNet kernel, the most expensive, computing ${\bf{K}}_{xx}$ , and ${\bf{K}}_{xx*}$ for validation and test took $3$ h $40$ min on two Tesla P100 GPUs. In contrast, inverting ${\bf{K}}_{xx}$ and computing validation and test performance took $43.25\pm 8.8$ seconds on a single Tesla P100 GPU.

Related work

Van der Wilk et al. (van der Wilk et al., 2017) also adapted \acpGP to image classification. They defined a prior on functions $f$ that takes an image and outputs a scalar. First, draw a function $g\sim\mathcal{GP}(0,k_{p}({\bf{X}},{\bf{X}}^{\prime}))$ . Then, $f$ is the sum of the output of $g$ applied to each of the convolutional patches. Their approach is also inspired by convolutional \acpNN, but their kernel $k_{p}$ is applied to all pairs of patches of ${\bf{X}}$ and ${\bf{X}}^{\prime}$ . This makes their convolutional kernel expensive to evaluate, requiring inter-domain inducing point approximations to remain tractable. The kernels in this work, directly motivated by the infinite-filter limit of a \acCNN, only apply something like $k_{p}$ to the corresponding pairs of patches within ${\bf{X}}$ and ${\bf{X}}^{\prime}$ (Eq. 10). As such, the \acCNN kernels are cheaper to compute and exhibit superior performance (Table 1), despite the use of an approximate likelihood function.

Kumar et al. (2018) define a prior over functions by stacking several \acpGP with van der Wilk’s convolutional kernel, forming a “Deep \acGP” (Damianou & Lawrence, 2013). In contrast, the kernel in this paper confines all hierarchy to the definition of the kernel, and the resulting \acpGP is shallow.

Wilson et al. (2016) introduced and Bradshaw et al. (2017) improved deep kernel learning. The inputs to a classic \acGP kernel $k$ (e.g. RBF) are preprocessed by applying a feature extractor $g$ (a deep \acNN) prior to computing the kernel: $k_{\text{deep}}({\bf{X}},{\bf{X}}^{\prime}):=k(g({\bf{X}};\theta),g({\bf{X}}^{\prime},\theta))$ . The \acNN parameters are optimised by gradient ascent using the likelihood as the objective, as in standard \acGP kernel learning (Rasmussen & Williams, 2006, Chapter 5). Since deep kernel learning incorporates a state-of-the-art \acNN with over $10^{6}$ parameters, we expect it to perform similarly to a \acNN applied directly to the task of image classification. At present both \acpCNN and deep kernel learning display superior performance to the \acGP kernels in this work. However, the kernels defined here have far fewer parameters (around $10$ , compared to their $10^{6}$ ).

Borovykh (2018) also suggests that a \acCNN exhibits \acGP behaviour. However, they take the infinite limit with respect to the filter size, not the number of filters. Thus, their infinite network is inapplicable to real data which is always of finite dimension.

Finally, there is a series of papers analysing the mean-field behaviour of deep \acpNN and \acpCNN which aims to find good random initializations, i.e. those that do not exhibit vanishing or exploding gradients or activations (Schoenholz et al., 2016; Yang & Schoenholz, 2017). Apart from their very different focus, the key difference to our work is that they compute the variance for a single training-example, whereas to obtain the \acpGP kernel, we additionally need to compute the output covariances for different training/test examples (Xiao et al., 2018).

Conclusions and future work

We have shown that deep Bayesian \acpCNN with infinitely many filters are equivalent to a \acGP with a recursive kernel. We also derived the kernel for the \acGP equivalent to a \acCNN, and showed that, in handwritten digit classification, it outperforms all previous \acGP approaches that do not incorporate a parametric \acNN into the kernel. Given that most state-of-the-art neural networks incorporate structure (convolutional or otherwise) into their architecture, the equivalence between \acpCNN and \acpGP is potentially of considerable practical relevance. In particular, we hope to apply \acGP \acpCNN in domains as widespread as adversarial examples, lifelong learning and k-shot learning, and we hope to improve them by developing efficient multi-layered inducing point approximation schemes.

References

Appendix

The key technical issues in the proof (and the key differences between Lee et al. 2017 and Matthews et al. 2018b) arise from exactly how and where we take limits. In particular, consider the activations as being functions of the activities at the previous layer,

Now, there are two approaches to taking limits. First, both our argument in the main text, and the argument in Lee et al. (2017) is valid if we are able to take limits “inside” the network,

However, Matthews et al. (2018a; b) argue that is preferable to take limits “outside” the network. In particular, Matthews et al. (2018b) take the limit with all layers simultaneously,

2 Extending the derivations of Matthews et al. (2018b) to the convolutional case

In the main text, we follow Lee et al. (2017) in sequentially taking the limit of each layer to infinity (i.e. $C^{(1)}\rightarrow\infty$ , then $C^{(2)}\rightarrow\infty$ etc.). This dramatically simplified the argument, because taking the number of units in the previous layer to infinity means that the inputs from that layer are exactly Gaussian distributed. However, Matthews et al. (2018b) argue that the more practically relevant limit is where we take all layers to infinity simultaneously. This raises considerable additional difficulties, because we must reason about convergence in the case where the previous layer is finite. Note that this section is not intended to stand independently: it is intended to be read alongside Matthews et al. (2018b), and we use several of their results without proof.

To extend their proofs to the convolutional case, we begin by defining our networks in a form that is easier to manipulate and as close as possible to Eq. (21-23) in Matthews et al. (2018b),

Now, we must prove that these projections converge in distribution a Gaussian. We begin by defining summands, as in Eq. 26 in Matthews et al. (2018b),

such that the projections can be written as a sum of the summands, exactly as in Eq. 27 in Matthews et al. (2018b),

Condition a) follows immediately as the summands are uncorrelated and zero-mean. Conditions b) and c) are more involved as convergence in distribution in the previous layers does not imply convergence in moments for our activation functions.

Following Eq. 48 in Matthews et al. (2018b), which uses Lemma 19 in Matthews et al. (2018b), we have,

by the linear envelope property, $\lvert\phi(u)\rvert\leq c+m\lvert u\rvert$ . Following Matthews et al. (2018b), we substitute this bound back into Eq. (28) and suppress a multiplicative constant independent of $x$ and $n$ ,

This can be multiplied out, yielding a weighted sum of expectations of the form,

Using this result, we can obtain a straightforward adaptation of Lemmas 15, 16 and 21 in Matthews et al. (2018b). Lemma 15 gives condition b), Lemma 16 gives condition c); Lemma 15 requires uniform integrability, which is established by Lemma 21.

3 Calibration of Gaussian process uncertainty

It is important to check that the estimates of uncertainty produced by our Gaussian process are reasonable. However, to make this assessment, we needed to use a proper likelihood, and not the squared-error loss in the main text. We therefore used our kernel to perform the full, multi-class classification problem in GPflow (Matthews et al., 2017), with a RobustMax likelihood (Hernández-lobato et al., 2011). The more difficult non-conjugate inference problem forced us to use 1000 inducing points, randomly chosen from the training inputs. Both our kernel and an RBF kernel have similar calibration curves, that closely track the diagonal, indicating accurate uncertainty estimation. However, even in the inducing point setting, our convolutional kernel gave considerably better performance than the RBF kernel (2.4% error vs 3.4% error). See Fig. 3.

4 Closed-form expectation for the error function nonlinearity

The error function (erf) is given by the integral $\phi(x)=\frac{2}{\sqrt{\pi}}\int_{0}^{x}e^{-t^{2}}dt$ , and is related to the Gaussian CDF. It is very similar to the hyperbolic tangent (tanh), with the difference that erf’s tails approach 0 and 1 faster than the tanh’s tails.

Williams (1997) gives us the relevant closed-form integral: