Deep convolutional Gaussian processes

Kenneth Blomqvist, Samuel Kaski, Markus Heinonen

Introduction

Gaussian processes (GPs) are a family of flexible function distributions defined by a kernel function (Rasmussen and Williams, 2006). The modeling capacity is determined by the chosen kernel. Standard stationary kernels lead to models that underperform in practice. Shallow – or single layer – Gaussian processes are often sub-optimal since flexible kernels that would account for non-stationary patterns and long-range interactions in the data are difficult to design and infer (Wilson et al., 2013; Remes et al., 2017). Deep Gaussian processes boost performance by modelling networks of GP nodes (Duvenaud et al., 2011; Sun et al., 2018) or by mapping inputs through multiple Gaussian process ’layers’ (Damianou and Lawrence, 2013; Salimbeni and Deisenroth, 2017). While more flexible and powerful than shallow GPs, deep Gaussian processes result in degenerate models if the individual GP layers are not invertible, which limits their potential (Duvenaud et al., 2014).

Convolutional neural networks (CNN) are a celebrated approach for image recognition tasks with superior performance (Mallat, 2016). These models encode a hierarchical translation-invariance assumption into the structure of the model by applying convolutions to extract increasingly complex patterns through the layers.

While neural networks have achieved unparalleled results on many tasks, they have their shortcomings. Effective neural networks require large number of parameters that require careful optimisation to prevent overfitting. Neural networks can often leverage a large number of training data to counteract this problem. Developing methods that are better regularized and can incorporate prior knowledge would allow us to deploy machine learning methods in domains where massive amounts of data is not available. Conventional neural networks do not provide reliable uncertainty estimates on predictions, which are important in many real world applications.

The deterministic CNN’s have been extended into the probabilistic domain with weight uncertainties (Blundell et al., 2015). Gal and Ghahramani (2016) explored the Bayesian connections of the dropout technique. Neural networks are known to converge to Gaussian processes at the limit of infinite layer width (MacKay, 1992; Williams, 1997; Lee et al., 2017). Garriga-Alonso et al. (2018) derive a kernel which is equivalent to residual CNNs with a certain prior over the weights. Wilson et al. (2016b) proposed a hybrid deep kernel learning approach, where a feature-extractor deep neural network is stacked with a Gaussian process predictor layer, learning the neural network weights by variational inference (Wilson et al., 2016a).

Recently Van der Wilk et al. (2017) proposed the first convolution-based Gaussian process for images with promising performance. They proposed a weighted additive model where Gaussian process responses over image subpatches are aggregated for image classification. The convolutional Gaussian process is unable to model pattern combinations due to its restriction to a single layer. Very recently Kumar et al. (2018) applied convolutional kernels in a deep Gaussian process, however they were unable to significantly improve upon the shallow convolutional GP model.

In this paper we propose a deep convolutional Gaussian process, which iteratively convolves several GP functions over the image. We learn multimodal probabilistic representations that encode combinations of increasingly complex pattern combinations as a function of depth. Our model is a fully Bayesian kernel method with no neural network component. On the CIFAR-10 dataset, deep convolutions increase the current state-of-the-art GP predictive accuracy from 65% to 76%. Our model demonstrates how a purely GP based approach can reach the performance of hybrid neural network GP models.

Background

A convolution as used in convolutional neural networks takes a signal, two dimensional in the case of an image, and a tensor valued filter to produce a new signal (Goodfellow et al., 2016). The filter is moved across the signal and at each step taking a dot product with the corresponding section in the signal. The resulting signal will have a high value where the signal is similar to the filter, zero where it’s orthogonal to the filter and a low value where it’s very different from the filter. A convolution of a two dimensional image $\mathbf{x}$ and a convolutional filter $\mathbf{g}$ is defined:

By default the convolution is defined over every location of the image. Sometimes one might use only every other location. This is referred to as the stride. A stride of 2 means only every other location $i,j$ is taken in the output.

2 Primer on Gaussian processes

defines a prior distribution over function values $f(\mathbf{x})$ with mean and covariance:

A GP prior defines that for any collection of $n$ inputs $X=(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})^{T}$ , the corresponding function values

follow a multivariate Normal distribution

3 Variational inference

The variational expected likelihood in $\mathcal{L}$ can be computed using numerical quadrature approaches (Hensman et al., 2015b).

Deep convolutional Gaussian process

In this section we introduce the deep convolution Gaussian process. We stack multiple convolutional GP layers followed by a GP classifier with a convolutional kernel.

We model the $C$ patch responses at each of the first $L-1$ layers as independent GPs with shared prior

For example, on MNIST where images have size $28\times 28\times 1$ using patches of size $5\times 5\times 1$ , a stride of 1 and $C=10$ patch response functions, we obtain a representation of size $24\times 24\times 10$ after the first layer (height and width $W_{1}=H_{1}=(28-5)/1+1$ ). This is passed on to the next layer which produces an output of size $20\times 20\times 10$ .

where the covariance between the input and the inducing variables are

2 Final classification layer

As the last layer of our model we aggregate the output of the convolutional layers using a GP with a weighted convolutional kernel as presented by Van der Wilk et al. (2017). We set a GP prior on the last layer patch response function

with weights for each patch response. We get an additive GP

As with the convolutional layers the inducing points live in the patch space of instead of in the image space. The inter-domain kernel is

3 Doubly stochastic variational inference

The deep convolutional Gaussian process is an instance of a deep Gaussian process with the convolutional kernels and patch filter inducing points. We follow the doubly stochastic variational inference approach of Salimbeni and Deisenroth (2017) for model learning. The key idea of doubly stochastic inference is to draw samples from the Gaussian

through the deep system for a single input image $\mathbf{x}_{i}$ .

The inducing points of each layer are independent. We assume a factorised likelihood

The evidence framework MacKay (1992) considers optimizing the evidence,

Following the variational approach we assume a variational joint model

The variational expected likelihood is computed using a Monte Carlo approximation yielding the first source of stochasticity. The whole lower bound is optimized using stochastic gradient descent yielding the second source of stochasticity.

The Figure 2 visualises representations of CIFAR-10 images over the deep convolutional GP model. Figure 3 visualises the patch and filter spaces of the three layers, indicating high overlap. Finally, Figure 4 shows example filters $\mathbf{z}$ learned on the CIFAR-10 dataset, which extract image features.

Experiments

We compare our approach on the standard image classification benchmarks of MNIST and CIFAR-10 (Krizhevsky and Hinton, 2009), which have standard training and test folds to facilitate direct performance comparisons. MNIST contains 60,000 training examples of $28\times 28$ sized grayscale images of 10 hand-drawn digits, with a separate 10,000 validation set. CIFAR-10 contains 50,000 training examples of RGB colour images of size $32\times 32$ from 10 classes, with 5,000 images per class. The images represents objects such as airplanes, cats or horses. There is a separate validation set of 10,000 images. We preprocess the images for zero mean and unit variance along the color channel.

We compare our model primarily against the original shallow convolutional Gaussian process (Van der Wilk et al., 2017), which is currently the only convolutional Gaussian process based image classifier. We also consider the performance of the hybrid neural network GP approach of Wilson et al. (2016a). For completeness we report the performance of a state-of-the-art CNN method DenseNet (Huang et al., 2017).

Our TensorFlow (Abadi et al., 2016) implementation is compatible with the GPflow framework (Matthews et al., 2017) and freely available online https://github.com/kekeblom/DeepCGP. We leverage GPU accelerated computation, 64bit floating point precision, and employ a minibatch size of 32. We start the Adam learning rate at $0.01$ and multiply it by $0.1$ every 100,000 optimization steps until the learning rate reaches 1e-5. We use $M=384$ inducing points at each layer. We set a stride of $2$ for the first layer and $1$ for all other layers. The convolutional filter size is 5x5 on all layers except for the first layer on CIFAR-10 where it is 4x4. This is to make use of all the image pixels using a stride of 2.

Inducing points $\mathbf{Z}$ are initialized by running $k$ -means with $M$ clusters on image patches from the training set. The variational means $\mathbf{m}$ are initialised to zero. $\mathbf{S}$ are initialised to a tiny variance kernel prior $10^{-5}\cdot K_{\mathbf{Z}\mathbf{Z}}$ following Salimbeni and Deisenroth (2017), except for the last layer where we use $K_{\mathbf{Z}\mathbf{Z}}$ . For models deeper than two layers, we employ iterative optimisation where the first $L-2$ layers and layer $L$ are initialised to the learned values of an $L-1$ model, while the one additional layer added before the classification layer is initialised to default values.

1 MNIST and CIFAR-10 results

Table 1 shows the classification accuracy on MNIST and CIFAR-10. Adding a convolutional layer to the weighted convolutional kernel GP improves performance on CIFAR-10 from 58.65% to 73.85%. Adding another convolutional layer further improves the accuracy to 75.9%. On MNIST the performance increases from $1.42\%$ error to $0.56\%$ error with the three-layer deep convolutional GP.

The deep kernel learning method uses a fully connected five-layer DNN instead of a CNN, and performs similarly to our model, but with much more parameters.

Figure 5 shows a single sample for 10 image class examples (rows) over the 10 patch response channels (columns) for the first layer (panel a) and second layer (panel b). The first layer indicates various edge detectors, while the second layer samples show the complexity of pattern extraction. The row object classes map to different kinds of representations, as expected.

Figure 6 shows the effect of different channel numbers on a two layer model. The ELBO increases up to $C=16$ response channels, while starts to decrease with $C=32$ channels. A model with approximately $C=10$ channels indicates best performance.

Conclusions

We presented a new type of deep Gaussian process with convolutional structure. The convolutional GP layers gradually linearize the data using multiple filters with nonlinear kernel functions. Our model greatly improves test results on the compared classification benchmarks compared to other GP-based approaches, and approaches the performance of hybrid neural-GP methods. The performance of our model seems to improve as more layers are added.

We did not experiment with using a stride of 1 at the first layer. Neither did we try models with 4 or more layers. The added complexity comes with an increased computational cost and we were thus limited from experimenting with these improvements. We believe that both of these enhancements would increase performance.

There are several avenues for improved efficiency and modelling capacity. The Stochastic Gradient Hamiltonian Monte Carlo approach (Ma et al., 2015) has proven efficient in deep GPs (Havasi et al., 2018) and in GANs (Saatci and Wilson, 2017). Another avenue for improvement lies in kernel interpolation techniques (Wilson and Nickisch, 2015; Evans and Nair, 2018) which would make inference and prediction faster. We leave these directions for future work.

We thank Michael Riis Andersen for his invaluable comments and helpful suggestions.