Variational Auto-encoded Deep Gaussian Processes

Zhenwen Dai, Andreas Damianou, Javier González, Neil Lawrence

Introduction

Probabilistic directed generative models are flexible tools that have recently captured the attention of the the Deep Learning community (Uria et al., 2014; Mnih & Gregor, 2014; Kingma & Welling, 2013; Rezende et al., 2014). These models have the ability to produce samples able to mimic the learned data and they allow principled assessment of the uncertainty in the predictions. These properties are crucial to successfully addressing challenges such as uncertainty quantification or data imputation and allow the ideas of deep learning to be extended to related machine learning fields such as probabilistic numerics.

The main challenge is that exact inference on directed nonlinear probabilistic models is typically intractable due to the required marginalisation of the latent components. This has lead to the development of probabilistic generative models based on neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014; Rezende et al., 2014), in which probabilistic distributions are defined for the input and output of individual layers. Efficient approximated inference methods have been developed in this context based on stochastic variational inference or stochastic back-propagation. However, a question that remains open is how to properly regularize the model parameters. Techniques such as dropout have been used to avoid over-fitting (Hinton et al., 2012). Alternatively, Bayesian inference offers a mathematically grounded framework for regularization. Blundell et al. (2015) show that Bayesian (variational) inference outperforms dropout. Kingma et al. (2015); Gal & Ghahramani (2015) have shown that dropout itself can be reformulated in the variational inference context.

In this work, we develop a new scalable Bayesian non-parametric generative model. We focus on a deep Gaussian processes (DGP) that we augment by means of a recognition model, a multi-layer perceptron (MLP) between the latent representation of layers of the DGP. This allows us to simplify the inference and to avoid the challenge of initializing variational parameters. In addition, although DGP have been used only in small scale data so far, we show how it is possible to scale these models by means of new formulation of the lower bound that allows to distribute most of the computation.

The main contributions of this work are: i) a novel extension to DGPs by means of a recognition model that we call Variational Auto-Encoded deep Gaussian process (VAE-DGP), ii) a derivation of the distributed variational lower bound of the model and iii) a demonstration of the utility of the model on several mainstream deep learning datasets.

Deep Gaussian Processes

Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estimation. However, their tractability comes at a price: they can only represent a restricted class of functions. Indeed, even though sophisticated definitions and combinations of covariance functions can lead to powerful models (Durrande et al., 2011; Gönen & Alpaydin, 2011; Hensman et al., 2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribution of instantiations of the latent function remains; this limits the applicability of the models. One line of recent research to address this limitation focused on function composition (Snelson et al., 2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead employs process composition (Lawrence & Moore, 2007; Damianou et al., 2011; Lázaro-Gredilla, 2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).

where the functions flf_{l} are drawn from Gaussian processes with covariance functions klk_{l}, i.e. fl(x)GP(0,kl(x,x))f_{l}(x)\sim\mathcal{GP}(0,k_{l}(x,x^{\prime})). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as a fairly uninformative prior which also provides soft regularization, i.e. XLN(0,I){\bf{X}}_{L}\sim\mathcal{N}(0,\mathbf{I}). In the supervised learning scenario, the inputs of the top hidden layer is observed and govern its hidden outputs.

The expressive power of a deep GP is significantly greater than that of a standard GP, because the successive warping of latent variables through the hierarchy allows for modeling non-stationarities and sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is the limit of an infinitely wide neural network, a deep GP is the limit where the parametric function composition of a deep neural network turns into a process composition. Specifically, a deep neural network can be written as:

where W,U\mathbf{W},\mathbf{U} and V\mathbf{V} are parameter matrices and ϕ()\boldsymbol{\phi}(\cdot) denotes an activation function. By non-parametrically treating the stacked function composition g(h)=Vϕ(Uh)g(\mathbf{h})=\mathbf{V}^{\top}\boldsymbol{\phi}(\mathbf{U}\mathbf{h}) as process composition we obtain the deep GP definition of Equation 2.

In a standard GP model, inference is performed by analytically integrating out the latent function ff. In the DGP case, the latent variables have to additionally be integrated out, to obtain the marginal likelihood of DGPs over the observed data:

The above marginal likelihood and the following derivation aims at unsupervised learning problems, however, it is straight-forward to extend the formulation to supervised scenario by assuming observed XL{\bf{X}}_{L}. Bayesian inference in DGPs involves optimizing the model hyper-parameters with respect to the marginal likelihood and inferring the posterior distributions of latent variables for training/testing data. The exact inference of DGPs is intractable due to the intractable integral in (4). Approximated inference techniques such as variational inference and EP have been developed (Damianou & Lawrence, 2013; Bui et al., 2015). By taking a variational approach, i.e. assuming a variational posterior distribution of latent variables, q({Xl}l=1L)=l=1Lq(Xl)q(\{{\bf{X}}_{l}\}_{l=1}^{L})=\prod_{l=1}^{L}q({\bf{X}}_{l}), a lower bound of the log marginal distribution can be derived as

where F1=logp(YX1)q(X1)\mathcal{F}_{1}=\left\langle\log p(\mathbf{{Y}}|{\bf{X}}_{1})\right\rangle_{q({\bf{X}}_{1})} and Fl=logp(Xl1Xl)q(Xl1)q(Xl),l=2L\mathcal{F}_{l}=\left\langle\log p({\bf{X}}_{l-1}|{\bf{X}}_{l})\right\rangle_{q({\bf{X}}_{l-1})q({\bf{X}}_{l})},l=2\ldots L, are known as free energy for individual layers. H(q(Xl))H(q({\bf{X}}_{l})) denotes the entropy of the variational distribution q(Xl)q({\bf{X}}_{l}) and KL(q(XL)p(XL))\text{KL}\left(q({\bf{X}}_{L})\,\|\,p({\bf{X}}_{L})\right) denotes the Kullback-Leibler divergence between q(XL)q({\bf{X}}_{L}) and p(XL)p({\bf{X}}_{L}). According to the model definition, both p(YX1)p(\mathbf{{Y}}|{\bf{X}}_{1}) and p(Xl1Xl)p({\bf{X}}_{l-1}|{\bf{X}}_{l}) are Gaussian processes. The variational distribution of Xl{\bf{X}}_{l} is typically parameterized as a Gaussian distribution q(Xl)=n=1NN(xl(n)μl(n),Σl(n))q({\bf{X}}_{l})=\prod_{n=1}^{N}\mathcal{N}({\bf x}^{(n)}_{l}|\boldsymbol{\mu}^{(n)}_{l},\mathbf{{\Sigma}}^{(n)}_{l}).

Variational Auto-Encoded Model

Damianou & Lawrence (2013) provides a tractable variational inference method for DGP by deriving a closed-form lower bound of the marginal likelihood. While successfully demonstrating strengths of DGP, the experiments that they show are limited to very small scales (hundreds of datapoints). The limitation on scalability is mostly due to the computational expensive covariance matrix inversion and the large number of variational parameters (growing linearly with the size of data).

To scale up DGP to handle large datasets, we propose a new deep generative model, by augmenting DGP with a variationally auto-encoded inference mechanism. We refer to this inference mechanism as a recognition model (see Figure 3). A recognition model provides us with a mechanism for constraining the variational posterior distributions of latent variables. Instead of representing variational posteriors as individual variational parameters, which become a big burden to optimization, we define them as a transformation of observed data. This allows us to reduce the number of parameters for optimization (which no longer grow linearly with the size of data) and to perform fast inference at test time. A similar constraint mechanism has been referred to as a “back-constraint” in the GP literature. Lawrence & Quiñonero Candela (2006) constrained the latent inputs of a GP with a parametric model to enforce local distance preservation in the inputs; Ek et al. (2008) followed the same approach for constraining the latent space with information from additional views of the data. Our formulation differs from the above in that we rather constrain a whole latent posterior distribution through the variational parameters. Damianou & Lawrence (2015) also constrained the posterior, but this was achieved using a direct specific parameterization for that distribution, making this back-constraint grow with the number of inputs. Another difference to the previous approaches is that we consider deep hierarchies of latent spaces and, consequently, of recognition models. Our constraint mechanism is more similar to that of other variationally auto-encoded models, such as (Salakhutdinov & Hinton, 2008; Snoek et al., 2012a; Kingma & Welling, 2013; Mnih & Gregor, 2014; Rezende et al., 2014). The main differences with our work is that are that we have a Bayesian non-parametric generative model and a closed-form variational lower bound. This enables us to be Bayesian when inferring the generative distribution and avoids sampling from variational posterior distributions.

Specifically, for the observed layer, the posterior mean of the variational distribution is defined as a transformation of the observed data:

where the transformation function g1g_{1} is parameterized by a multi-layer perceptron (MLP). Similarly, for the hidden layers, the posterior mean is defined as a transformation of the posterior mean from the lower layer:

Note that all the transformation functions are deterministic, therefore, the posterior mean of all the hidden layers can be viewed as direct transformations of the observed data, i.e. μl(n)=gl(g1(y(n)))\boldsymbol{\mu}^{(n)}_{l}=g_{l}(\ldots g_{1}(\mathbf{y}^{(n)})). We use the hyperbolic tangent activation function for all the MLPs. The posterior variances Σl(n)\mathbf{{\Sigma}}^{(n)}_{l} are assumed to be diagonal and the same across all the datapoints.

The closed-form variational lower bound allows us to apply sophisticated gradient optimization methods such as L-BFGS. It avoids the problem of initializing and optimizing a large number of variational parameters. The initialization of variational parameters are converted into the initialization of neural network parameters, which has been well studied in deep learning literature. Furthermore, with the reparameterization, the variational parameters are moved coherently during optimization through the changes of neural network mapping. This helps the model avoid local optima and approach better solutions. Figure 4(a) shows an example of the learned 2D latent space of one layer (shallow) DGP and VAE-DGP from the same initialization. Clearly, the recognition model in VAE-DGP helps move the datapoints to a better solution. Note that the recognition model serves as a (deterministic) reparameterization of variational parameters. Therefore, the parameters of MLP are the variational parameters of our model. As automatically “regularized” by Bayesian inference, a overly complicated cognition model will not cause the generative model to overfit. This allows us to freely choose a powerful enough recognition model (see Fig. 4(b) for an exampleNote that the shown training and test log-likelihood are not directly comparable. The shown train log-likelihood is the lower bound in Equation 5 divided by the size of data. The shown test log-likelihood is an approximation: 1N(L(Y,Y)L(Y))\frac{1}{N_{*}}(\mathcal{L}(\mathbf{{Y}}^{*},\mathbf{{Y}})-\mathcal{L}(\mathbf{{Y}})).).

Computationally, the recognition model re-parameterization resolves the linear growing of the number of variational parameters with respect to the size of data. Based on this formulation, we develop a distributed variational inference approach, which is described in detail in the following section.

Distributed Variational Inference

where each row of U1\mathbf{{U}}_{1} represents an inducing variable which is associated with the inducing input at the same row of Z1\mathbf{{Z}}_{1}. Assuming a particular form of the variational distribution of F1\mathbf{{F}}_{1} and U1\mathbf{{U}}_{1}: q(F1,U1X1)=p(F1U1,X1)q(U1)q(\mathbf{{F}}_{1},\mathbf{{U}}_{1}|{\bf{X}}_{1})=p(\mathbf{{F}}_{1}|\mathbf{{U}}_{1},{\bf{X}}_{1})q(\mathbf{{U}}_{1}), the free energy of the observed layer can be lower bounded by

As shown by Titsias & Lawrence (2010), this lower bound can be formulated in closed-form for kernels like linear, exponentiated quadratic. For other kernels, it can be computed approximately by using the techniques such as Gaussian quadrature. Note that the optimal value of q(U1)q(\mathbf{{U}}_{1}) can be derived in closed-form by setting its gradient L/q(U1)\partial\mathcal{L}/\partial q(\mathbf{{U}}_{1}) to zero, therefore, the only variational parameters that we need to optimize for the observed layer are q(X1)q({\bf{X}}_{1}) and Z1\mathbf{{Z}}_{1}.

For the hidden layers, the variational posterior distributions are slightly different, because the posterior of inducing variables depend on the output variable of that layer. For the ll-th hidden layer, the variational posterior distribution is, therefore, defined as q(Fl,UlXl1,Xl)=p(FlUl,Xl)q(UlXl1)q(\mathbf{{F}}_{l},\mathbf{{U}}_{l}|{\bf{X}}_{l-1},{\bf{X}}_{l})=p(\mathbf{{F}}_{l}|\mathbf{{U}}_{l},{\bf{X}}_{l})q(\mathbf{{U}}_{l}|{\bf{X}}_{l-1}). Similar to the observed layer, a lower bound of the free energy can be derived as:

With Equation (5) and (9-10), a closed-form variational lower bound of the log marginal likelihood is defined.

The computation of the lower bounds of free energy terms is expensive. This limits the scalability of the original DGP. Fortunately, with the introduced auxiliary variables and the recognition model, most of the computation is distributable in a data-parallelism fashion. We exploit this fact and derive a distributed formulation of the lower bound. This allows us to scale up our inference method to large data. Specifically, the lower bound of the free energy consists of a few terms (explained below) that depend on the size of data: \mboxTr(YY)\mbox{Tr}(\mathbf{{Y}}^{\top}\mathbf{{Y}}), \mboxTr(Λ11Ψ1YYΨ1)\mbox{Tr}(\boldsymbol{\Lambda}_{1}^{-1}\boldsymbol{\Psi}_{1}^{\top}\mathbf{{Y}}\mathbf{{Y}}^{\top}\boldsymbol{\Psi}_{1}), ψ1\psi_{1} and Φ1\boldsymbol{\Phi}_{1}. All of them can be formulated as a sum of intermediate results from individual datapoints:

where KF1F1\mathbf{{K}}_{\mathbf{{F}}_{1}\mathbf{{F}}_{1}}, KU1U1\mathbf{{K}}_{\mathbf{{U}}_{1}\mathbf{{U}}_{1}} are the covariance matrices of F1\mathbf{{F}}_{1} and U1\mathbf{{U}}_{1} respectively, KF1U1\mathbf{{K}}_{\mathbf{{F}}_{1}\mathbf{{U}}_{1}} is the cross-covariance matrix between F1\mathbf{{F}}_{1} and U1\mathbf{{U}}_{1}, and ψ1=\mboxTr(KF1F1q(X1))\psi_{1}=\mbox{Tr}(\left\langle\mathbf{{K}}_{\mathbf{{F}}_{1}\mathbf{{F}}_{1}}\right\rangle_{q({\bf{X}}_{1})}), Ψ1=KF1U1q(X1)\boldsymbol{\Psi}_{1}=\left\langle\mathbf{{K}}_{\mathbf{{F}}_{1}\mathbf{{U}}_{1}}\right\rangle_{q({\bf{X}}_{1})} and Φ1=KF1U1KF1U1q(X1)\boldsymbol{\Phi}_{1}=\left\langle\mathbf{{K}}_{\mathbf{{F}}_{1}\mathbf{{U}}_{1}}^{\top}\mathbf{{K}}_{\mathbf{{F}}_{1}\mathbf{{U}}_{1}}\right\rangle_{q({\bf{X}}_{1})}, and Λ1=KU1U1+Φ1\boldsymbol{\Lambda}_{1}=\mathbf{{K}}_{\mathbf{{U}}_{1}\mathbf{{U}}_{1}}+\boldsymbol{\Phi}_{1}. This enables data-parallelism by distributing the computation that depends on individual datapoints and only collecting the intermediate results that do not scale with the size of data. Gal et al. (2014) and Dai et al. (2014) exploit a similar formulation for distributing the computation of BGPLVM, however, in their formulations, the gradients of variational parameters that depend on individual datapoints have to be collected centrally. Such collection severely limits the scalability of the model.

For hidden layers, the free energy terms are slightly different. Their data-dependent terms additionally involve the expectation with respect to the variational distribution of output variables: \mboxTr(Xl1Xl1q(Xl1))\mbox{Tr}(\left\langle{\bf{X}}_{l-1}^{\top}{\bf{X}}_{l-1}\right\rangle_{q({\bf{X}}_{l-1})}), \mboxTr(Λl1ΨlXl1Xl1q(Xl1)Ψl)\mbox{Tr}(\boldsymbol{\Lambda}_{l}^{-1}\boldsymbol{\Psi}_{l}^{\top}\left\langle{\bf{X}}_{l-1}{\bf{X}}_{l-1}^{\top}\right\rangle_{q({\bf{X}}_{l-1})}\boldsymbol{\Psi}_{l}), ψl\psi_{l} and Φl\boldsymbol{\Phi}_{l}. The first term can be naturally reformulated as a sum across datapoints:

For the second term, we can rewrite Xl1Xl1q(Xl1)=Rl1Rl1+Al1Al1\left\langle{\bf{X}}_{l-1}{\bf{X}}_{l-1}^{\top}\right\rangle_{q({\bf{X}}_{l-1})}=\mathbf{R}_{l-1}^{\top}\mathbf{R}_{l-1}+\mathbf{A}_{l-1}\mathbf{A}_{l-1}, where Rl1=[(μ(l1)(1)),,(μ(l1)(N))]\mathbf{R}_{l-1}=[(\boldsymbol{\mu}^{(1)}_{(l-1)})^{\top},\ldots,(\boldsymbol{\mu}^{(N)}_{(l-1)})^{\top}], Al1=diag(αl1(1),,αl1(N))\mathbf{A}_{l-1}=\text{diag}(\alpha^{(1)}_{l-1},\ldots,\alpha^{(N)}_{l-1}) and αl1(n)=\mboxTr(Σl1(n))12\alpha^{(n)}_{l-1}=\mbox{Tr}(\mathbf{{\Sigma}}^{(n)}_{l-1})^{\frac{1}{2}}. This enables us to formulate it into a distributable form:

With the above formulations, we obtain distributable a variational lower bound. For optimization, the gradients of all the model and variational parameters can be derived with respect to the lower bound. As the variational distributions q({Xl}l=1L)q(\{{\bf{X}}_{l}\}_{l=1}^{L}) are computed according to the recognition model, the gradients of q({Xl}l=1L)q(\{{\bf{X}}_{l}\}_{l=1}^{L}) are back-propagated (through the recognition model), which allows to compute the gradients of its the parameters.

Experiments

As a probabilistic generative model, VAE-DGP is applicable to a range of different tasks such as data generation, data imputation, etc. In this section we evaluate our model in a variety of problems and compare it with the alternatives in the in the literature.

We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey faces contains 1956 20×2820\times 28 frames taken from a video clip. The Yale faces contains 2414 images, which are resized to 20×2820\times 28. We take the last 200 frames from the Frey faces and 300 images randomly from Yale faces as the test set and use the rest for training. The intensity of the original gray-scale images are normalized to $$. The applied VAE-DGP has two hidden layers (a 2D top hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with widths (500-300). As a generative model, we can draw samples from the learned model by sampling first from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case) and layer-wise downwards. The generated images are shown in Figure 5(a).

To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST (LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 28×2828\times 28 images. The intensity of the original gray-scale images are normalized to $.Wetrainourmodelwiththreedifferentmodelsettings(one,twoandthreehiddenlayers).ThetrainedmodelsareevaluatedbytheloglikelihoodofthetestsetAsanonparametricmodel,thetestloglikelihoodofVAEDGPisformulatedas. We train our model with three different model settings (one, two and three hidden layers). The trained models are evaluated by the log-likelihood of the test setAs a non-parametric model, the test log-likelihood of VAE-DGP is formulated as\frac{1}{N_{*}}\log p(\mathbf{{Y}}^{*}|\mathbf{{Y}}),where, where\mathbf{{Y}}^{*}isthetestdataandis the test data and\mathbf{{Y}}isthetrainingdata.Asthetruetestloglikelihoodisintractable,weapproximateitasis the training data. As the true test log-likelihood is intractable, we approximate it as\frac{1}{N_{*}}(\mathcal{L}(\mathbf{{Y}}^{*},\mathbf{{Y}})-\mathcal{L}(\mathbf{{Y}}))$., which consists of 10,000 images. The results are shown in Table 1 along with some baseline performances taken from the literature. The numbers in the parenthesis indicate the dimensionality of hidden layers from top to bottom. The exponentiated quadratic kernel are used for all the layers with 300 inducing points. All the MLPs in the recognition model has two hidden layers with width (500-300). All our models are trained as a whole from randomly initialized recognition model.

2 Data Imputation

We demonstrate the model’s ability to impute missing data by showing half of images on the test set. We use the learned VAE-DGP to impute the other half of the images. this is challenging problem because there might be ambiguities in the answers. For instance, by showing the right half of a digit “8”, the answers “3” and “8” are both reasonable. We show the imputation performance for the test images in Frey-Yale and MNIST in Fig. 5(b) and Fig. 6 respectively. We also apply VAE-DGP to the street view house number dataset (SVHN) (Netzer et al., 2011). We use three hidden layers with the dimensionality of latent space from top to bottom (5-30-500). The top two hidden layers use the exponentiated quadratic kernel and the observed layer uses the linear kernel with 500 inducing points. The learned model is used for imputing the images in the test set (see Fig. 5(c)).

3 Supervised Learning and Bayesian Optimization

In this section we consider two supervised learning problem instances: regression and Bayesian optimization (BO) (Osborne, 2010; Snoek et al., 2012b). We demonstrate the utility of VEA-DGP in these settings by evaluating its performance in terms of predictive accuracy and predictive uncertainty quantification. For these experiments we use a VEA-DGP with one hidden layer (and one observed inputs layer) and exponentiated quadratic covariance functions. Furthermore, we incorporate the deep GP modification of Duvenaud et al. (2014) so that the observed input layer has an additional connection to the output layer. Duvenaud et al. (2014) showed that this modification increases the general stability of the method. Since the sample size of the data considered for supervised learning is relatively small, we do not use the recognition model to back-constrain the variational distributions.

In the regression experiments we use the Abalone dataset (41774177 11-dimensional outputs and 88-dimensional inputs) from UCI and the Creep dataset (20662066 11-dimensional outputs and 3030-dimensional inputs) from (Cole et al., 2000). A typical split for this data is to use 10001000 (Abalone) and 800800 (Creep) instaces for training. We used 100100 inducing inputs for each layer and performed 4 runs with different random splits. We summarize the results in Table 2.

Conclusion

We have proposed a new deep non-parametric generative model. Although general enough to be used in supervised and unsupervised problems, we especially highlighted its usefulness in the latter scenario, a case which is known to be a major challenge for current deep machine learning approaches. Our model is based on a deep Gaussian process, which we extended with a layer-wise parameterization through multilayer perceptrons, significantly simplifying optimization. Additionally, we developed a new formulation of the lower bound that allows for distributed computations. Overall, our approach is able to perform Bayesian inference using large datasets and compete with current alternatives. Future developments include the regularization of the perceptron weights, to reformulate the current setup for the context of multi-view problems and to incorporate convolutional structures into the objective function.

acknowledgement. The authors thank the financial support of RADIANT (EU FP7-HEALTH Project Ref 305626), BBSRC Project No BB/K011197/1 and WYSIWYD (EU FP7-ICT Project Ref 612139).

References