Black box variational inference for state space models

Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, Liam Paninski

Introduction

Markov chain Monte Carlo sampling and particle filtering for general time-series models are well-developed but typically do not scale well to large-scale problems. Even when a model $p_{\theta}(\vx,\vz)$ , with parameters $\theta$ , has been trained, we can only access an analytically-intractable posterior through further sampling. Here, we take a variational approach to time-series modeling: rather than attempting to compute the posterior $p_{\theta}(\vz|\vx)$ of our generative model, we approximate it with a distribution $q_{\phi}(\vz|\vx)$ with variational parameters $\phi$ . Inference proceeds by simultaneously optimizing $p_{\theta}$ (through its model parameters $\theta$ ) and $q_{\phi}$ (through its variational parameters $\phi$ ) such that $q_{\phi}$ approximates the true posterior.

Our main contributions are (1) a structured approximate posterior that can express temporal dependencies, and (2) a fast and scalable inference algorithm. We propose a multivariate Gaussian approximate posterior with block tri-diagonal inverse covariance, and formulate an algorithm that scales (in both time and space complexity) only linearly in the length of the time-series. For inference, we make use of recent advances in stochastic gradient variational Bayes (SGVB) (Rezende et al., 2014; Kingma & Welling, 2013; Kingma et al., 2014) to learn an approximate posterior with a complex functional dependence upon the observations $\vx$ . Using this approach we are able to learn a neural network (NN) that maps $\vx$ into the smoothed posterior $q(\vz|\vx)$ (sometimes called the recognition model). This approach is also ‘black-box’ in the sense that the inference algorithm does not depend explicitly upon the functional form of the generative model $p_{\theta}$ .

Our motivations lie in the study of high-dimensional time-series, such as neural spike-train recordings (Kao et al., 2015). We seek to infer trajectories $\vz$ that provide insight into the latent, low-dimensional structure in the dynamics of such data. Recent, related approaches to variational inference in time-series models focus upon the design and learning of rich generative models capable of capturing the statistical structure of large, complex datasets (Gan et al., 2015; Chung et al., 2015). In contrast, our focus is upon computationally efficient inference in structured, interpretable parameterizations that build upon methods fundamental in scientific applications.

We apply our smoothing approach for approximate posterior inference of a well-studied generative model: the Poisson linear dynamical system (PLDS) model. We find that our general, black-box approach outperforms a specialized variational Bayes expectation maximization (VBEM) approach (Emtiyaz Khan et al., 2013) to inference in PLDS, reaching comparable solutions to VBEM before it can complete a single EM iteration. Additionally, we apply our method to inference in a one-dimensional, nonlinear dynamical system, showing that we are able to accurately recover nonlinear relationships in the posterior mean.

Stochastic gradient variational Bayes

In variational inference we approximate an intractable posterior distribution $p_{\theta}(\vz|\vx)=p_{\theta}(\vx,\vz)/p_{\theta}(\vx)$ with $q_{\phi}(\vz|\vx)$ In many approaches to variational inference the dependence of upon $\vx$ is dropped; in our case, the parameterization of $q_{\phi}(\vz|\vx)$ may depend explicitly upon the observations $\vx$ . that comes from a tractable class (e.g., the Gaussian family) and is parameterized by variational parameters $\phi$ . We learn $\phi$ and $\theta$ together by optimizing the evidence lower bound (ELBO) of the marginal likelihood (Jordan et al., 1999), given by,

The quantity $\mathcal{L}(\theta,\phi;\vx)$ is the ELBO, and $H(q_{\phi}(\vz|\vx))$ is the entropy of the approximating posterior. Our goal is to differentiate $\mathcal{L}(\theta,\phi)$ with respect to $\phi$ and $\theta$ so as to maximize $\mathcal{L}$ ,

For the remainder of this section, we use the notation $f_{\left\{\theta,\phi\right\}}(\vz)=-\log q_{\phi}(\vz|\vx)+\log p_{\theta}(\vx,\vz)$ . Typically at least some terms of eq. 3 cannot be integrated in closed form. While it is often possible to estimate the gradient by sampling directly from $q(\vz|\vx)$ , in general the approximate gradient exhibits high variance (Paisley et al., 2012). One approach to addressing this difficulty, independently proposed by Kingma & Welling (2013), Rezende et al. (2014) and Titsias & Lázaro-Gredilla (2014), is to compute the integral using the “reparameterization trick”: choose an easy-to-sample random variable $\epsilon$ with distribution $p(\epsilon)$ and parameterize $\vz$ through a function $g$ of observations $\vx$ and parameters $\phi$ ,

The point of this notation is to make clear that $g_{\phi}(\vx,\cdot)$ is a deterministic function: all randomness in $q$ comes from the random variable $\epsilon$ . This allows us to approximate the gradient using the simple estimator,

where $\epsilon^{l}$ are iid samples from $p(\epsilon)$ . In Kingma & Welling (2013) this estimator is referred to as the Stochastic Gradient Variational Bayes (SGVB) estimator. Empirically, eq. 5 has much lower variance than previous sampling-based approaches to the estimation of eq. 3 (Kingma & Welling, 2013; Titsias & Lázaro-Gredilla, 2014).

An important property of eq. 5 is that it does not depend upon the particular form of $f_{\left\{\theta,\phi\right\}}(\cdot)$ : we need only be able to evaluate it at the samples $\epsilon^{i}$ . It is in this sense that our approach is “black-box”: in principle, inference works the same way regardless of our choice of generative model $p_{\theta}$ . In practice, of course, different modeling choices will affect the computation time and the convergence rate of the method.

The estimator also permits significant freedom in our parameterization of the transformation $g_{\phi}(\vx,\cdot)$ : for inference, we just need to be able to differentiate $g$ with respect to $\phi$ . While it is possible to use the SGVB approach with a separate set of parameters $\phi_{t}$ for each observation (as in Hoffman et al. (2013), for instance), much recent work has used deep neural networks (DNNs) to train a function that maps directly into the posterior (Rezende et al., 2014; Kingma & Welling, 2013; Kingma et al., 2014). Under this approach, with a trained $g_{\phi}(\vx,\cdot)$ , no additional gradient steps are needed to obtain $q(\vz|\vx)$ for new observations $\vx$ .

Variational approach to state-space modeling

One common, convenient choice of approximate posterior is the multivariate normal. In the notation of Section 2, the multivariate normal comes about if we choose $\epsilon\sim\mathcal{N}(0,I)$ and take $g_{\phi}(\vx,\cdot)$ to be an affine function (Titsias & Lázaro-Gredilla, 2014). We can then express a sample $\vz\sim q(\vz|\vx)$ as,

It is easy to sample from a Gaussian approximate posterior using eq. 6, and the entropy term within eq. 2 has a closed form:

2 Smoothing Gaussian approximate posterior

For modeling time-series data, we seek an approximate posterior capable of expressing our strong expectation that the latent variables change smoothly over time. While the Gaussian approximate posterior of eq. 8 can represent arbitrary correlation structure, we propose a Gaussian approximate posterior whose parameterization scales only linearly in $T$ . To do so, we borrow from the toolkit of the standard Kalman filter. In an LDS model with Gaussian observations, the posterior is a multivariate Gaussian with a block tri-diagonal inverse covariance. This block-tridiagonal structure results from (and expresses) the conditional independence properties of the LDS prior.

To enable our approximate posterior to express the same correlation structure we parameterize the inverse covariance of eq. 8, ${\Sigma}^{-1}$ , to be block tri-diagonal. Our final posterior takes the form:

where $\mu_{\phi}(\vx)$ is the posterior mean and $R_{\phi}(\vx)$ is a lower block bi-diagonal matrix with $n\times n$ blocks A lower block bi-diagonal matrix has only non-zero diagonal and (first) lower-diagonal blocks..

Parameterization of the smoothing posterior

We can naturally parameterize $\mu_{\phi}(\vx)$ and $\Sigma_{\phi}(\vx)$ of eq. 10 using 3 neural networks. We use one neural network to represent a map $\vx_{t}\to\mu_{t}$ ,

where $\mu_{t}$ is a $n\times 1$ segment of $\mu$ , and $\mu=\left(\mu_{1},\mu_{2},\dots,\mu_{T}\right)$ . We can parameterize the block tri-diagonal covariance ${\Sigma_{\phi}(\vx)}^{-1}$ ,

by parmeterizing each of the blocks separately:

In practice, we found it necessary to enforce the positive-definiteness of the covariance by adding a diagonal matrix $\alpha I$ to ${\Sigma_{\phi}(\vx)}^{-1}$ , where $\alpha>0$ is a fixed constant. In the experiments, we refer to this parameterization as VILDSblk.

2 Product-of-Gaussians approximate posterior

We can also define the approximate posterior through a product of Gaussian factors, $q(\vz|\vx)\propto r_{1}(\vz|\vx)r_{0}(\vz)$ , where:

$\mD$ and $\mC$ are $nT\times nT$ matrices and $\mM$ is a $nT$ -dimensional vector. In this set-up, we can view $r_{0}$ as a prior. In terms of eq. 10, the final posterior is then given by:

In order to be a parameterization of the smoothing posterior, eq. 10, ${\mD}^{-1}$ and ${\mC}^{-1}$ must be block tri-diagonal. The multiplicative interaction between the posterior mean and covariance leads to different performance from the parameterization described in Section 4.1. Further, we can choose $\mD$ to initialize the means with a given degree of smoothness, which is not possible in the formulation of Section 4.1. In the experiments, we refer to this parameterization as VILDSmult; in Appendix A we describe the specific parameteriation we used for ${\mC}^{-1}$ and ${\mD}^{-1}$ .

Experiments

In the experiments, we refer to the SGVB with the smoothing approximate posterior as VILDS. We refer to SGVB with an approximate posterior independent across time as mean field (MF). The mean field posterior is given by,

where $V_{t}$ is a full $n\times n$ covariance matrix. We optimize all parameters by gradient ascent, using the SGVB approach with $L=1$ to estimate the gradient with eq. 5.

For training VILDS and MF, we performed gradient descent on all parameters $\left\{\theta,\phi\right\}$ of the generative model and approximate posterior. We tried several adaptive gradient stochastic optimization methods, including: ADAM (Kingma & Ba, 2014), Adadelta (Zeiler, 2012), Adagrad (Duchi et al., 2011) and RMSprop (Tieleman & Hinton, 2012). In the experiments we show here, we used Adadelta to learn all parameters. We gradually decreased the base learning rate by a factor of $10$ after a period of $20$ “epochs” without an increase of the objective function.

First, we illustrate the efficacy of our approach by showing that we can recover the analytic posterior of a Kalman filter model. Under a Kalman filter model, the latents are governed by an LDS,

with Gaussian innovation noise with covariance matrix $\mQ$ , $\bm{\epsilon}_{t}\sim\mathcal{N}(0,\mQ)$ . Observations are coupled to the latents through a loading matrix $\mC$ ,

and $\eta_{t}$ are Gaussian noise with diagonal covariance.

2 Poisson LDS (PLDS)

The Kalman filter/smoother is exact for an LDS with linear-Gaussian observations. A common generalization in the literature is an LDS with non-Gaussian observations. One well-studied example is the Poisson LDS (PLDS). Under this model, the latents are again governed by an LDS,

with Gaussian innovation noise with covariance matrix $\mQ$ , $\bm{\epsilon}_{t}\sim\mathcal{N}(0,\mQ)$ . Observations are modulated by a log-rate $\vr_{t}$ , which is coupled to the latent state $\vz_{t}$ via a loading matrix $\mC$ ,

With Poisson observations, the posterior does not have a closed form. Several methods have been proposed for approximate learning and inference in the special case of the PLDS (Buesing et al., 2014; 2012; Macke et al., 2011); Laplace approximation is also frequently used (Paninski et al., 2010; Fahrmeir & Kaufmann, 1991). We compare VILDS to the variational Bayes expectation-maximization approach proposed by Emtiyaz Khan et al. (2013). This VBEM approach uses a full, unconstrained Gaussian as a variational approximate posterior $q_{\phi}$ , and performs EM iterations through a dual-space parameterization. We refer to it by the abbreviation VBDual, to emphasize this dual-space parameterization. We parameterize both the MF and VILDS approximate posteriors using a 5-layer, dense NN for each of $\mu_{\phi}(\vx)$ and $R_{\phi}(\vx)$ . For VILDS, we use the parameterization of Section 4.2. We use a rectified-linear nonlinearity between each layer, followed by a linear output layer mapping into the parameterization. We simulated $T=5000$ samples from a PLDS model with $n=2$ latent states and $m=100$ observation dimensions. We initialized all three methods (VILDS, MF and VBDual) using the nuclear-norm minimization methods outlined in Pfau et al. (2013).

To better illustrate the timecourse of learning, each epoch consisted of only $100$ minibatches, where each minibatch was of size $100$ . A single gradient step was taken for each minibatch. We ran both MF and VILDS for a fixed 500 iterations. We find that VILDS reaches a higher ELBO value than either MF or VBDual, and does so before VBDual can complete a single expectation-maximization iteration (see Fig. 2). Further, the posterior means learned by VILDS are smoother than those learned using the MF approximate posterior (see Fig. 3).

3 Nonlinear dynamics simulation

VILDS can perform approximate posterior inference for nonlinear, non-Gaussian generative models. To illustrate, we simulated $5000$ samples from a toy one-dimensional nonlinear dynamical model given by:

where $\epsilon_{t}$ and $\eta_{t}$ are each iid $\mathcal{N}(0,1)$ random variables. For the approximate posterior, we parameterized both the $\mu$ and $R$ using 8-layer networks where each layer has only a single unit, and rectified-linear nonlinearity. We use the LDS-inspired parameterization of Section 4.2. As shown in Fig. 4, VILDS is capable of recovering the nonlinear relationship in the state space. For these experiments, we held the generative model parameters $\theta$ fixed and learned only $\phi$ .

Conclusion

We proposed a Gaussian variational approximate posterior with block tri-diagonal covariance structure capable of expressing “smoothed” trajectories of a time-series posterior. Exploiting the block tri-diagonal covariance structure, inference scales only linearly (in both time and space complexity) in the length $T$ of a time-series. Using the SGVB approach to variational inference, we can perform approximate inference for a wide class of latent variable generative models.

Despite the generality of the inference algorithm, the approach is limited by the Gaussian approximate posterior: most latent variable time-series generative models have non-Gaussian posteriors. One possible route forward are the methods of Rezende & Mohamed (2015) and Dinh et al. (2014), which permit learning and inference using a non-Gaussian approximate posterior within the SGVB framework.

We implemented all methods in Python using Theano with the Lasagne library (Bastien et al., 2012; Bergstra et al., 2010). We plan to release the source code on Github shortly.

As we were preparing this manuscript we became aware of Krishnan et al. (2015), which studies a closely-related (but distinct) method. In future work, we will plan to perform detailed comparisons between the methods.

Funding for this research was provided by DARPA N66001-15-C-4032, Google Faculty Research award, and ONR N00014-14-1-0243 (LP); Simons Global Brain Research Award 325171 (LP and JC); Sloan Research Fellowship (JPC). We thank David Carlson and Megan McKinney, Esq. for useful comments on the manuscript, and Gabriel Synnaeve for making his Python code publicly available.

References

Appendix A Smoothing approximate posterior with explicit forward/backward

The posterior mean and covariances may be computed by the standard Kalman forward-backward algorithm. To see this, we can write the posterior in matrix form as the product of two Gaussians. We have

where the positive-definite matrix $Q$ is a covariance matrix, analogous to the “innovation noise” in the standard Kalman filter, $n\times n$ matrix $A$ is a linear dynamics matrixFor stable dynamics, we assume that the eigenvalues of $A$ have magnitude less than one; in the examples we considered, we did not need to enforce this constraint..

We can then re-write the approximate posterior $q$ as the product $q(\vz|\vx)\propto r_{0}(\vz)r_{1}(\vz)$ . By the standard product-of-normal-densities identity, $q(\vz|\vx)$ also has a multivariate normal distribution $q=\mathcal{N}(\mu_{\phi}(\vx),\Sigma_{\phi}(\vx))$ , where $\mu$ and $\Sigma$ are given by eq. 19 (which we repeat here):

Computation proceeds just as in Section 3.2.1, except that now the computation of eq. 19 takes the form,

which may be computed efficiently by exploiting the block bi-diagonal structure of $R$ .