Auto-Encoding Sequential Monte Carlo

Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, Frank Wood

Introduction

We build upon aesmc (Le et al., 2017), a method for model learning that itself builds on variational auto-encoders (vaes) (Kingma & Welling, 2014; Rezende et al., 2014) and importance weighted auto-encoders (iwaes) (Burda et al., 2016). Aesmc is similarly based on maximizing a lower bound to the log marginal likelihood, but uses smc (Doucet & Johansen, 2009) as the underlying marginal likelihood estimator instead of importance sampling (is). For a very wide array of models, particularly those with sequential structure, smc forms a substantially more powerful inference method than is, typically returning lower variance estimates for the marginal likelihood. Consequently, by using smc for its marginal likelihood estimation, aesmc often leads to improvements in model learning compared with vaes and iwaes. We provide experiments on structured time-series data that show that aesmc based learning was able to learn useful representations of the latent space for both reconstruction and prediction more effectively than the iwae counterpart.

Aesmc was introduced in an earlier preprint (Le et al., 2017) concurrently with the closely related methods of Maddison et al. (2017); Naesseth et al. (2017). In this work we take these ideas further by providing new theoretical insights for the resulting evidence lower bounds (elbos), extending these to explore the relative efficiency of different approaches to proposal learning, and using our results to develop a new and improved training procedure. In particular, we introduce a method for expressing the gap between an elbo and the log marginal likelihood as a Kullback-Leibler (kl) divergence between two distributions on an extended sampling space. Doing so allows us to investigate the behavior of this family of algorithms when the objective is maximized perfectly, which occurs only if the kl divergence becomes zero. In the iwae case, this implies that the proposal distributions are equal to the posterior distributions under the learned model. In the aesmc case, it has implications for both the proposal distributions and the intermediate set of targets that are learned. We demonstrate that, somewhat counter-intuitively, using lower variance estimates for the marginal likelihood can actually be harmful to proposal learning. Using these insights, we experiment with an adaptation to the aesmc algorithm, which we call alternating elbos, that uses different lower bounds for updating the model parameters and proposal parameters. We observe that this adaptation can, in some cases, improve model learning and proposal adaptation.

Background

State-space models (ssms) are probabilistic models over a set of latent variables $x_{1:T}$ and observed variables $y_{1:T}$ . Given parameters $\theta$ , a ssm is characterized by an initial density $\mu_{\theta}(x_{1})$ , a series of transition densities $f_{t,\theta}(x_{t}\lvert x_{1:t-1})$ , and a series of emission densities $g_{t,\theta}(y_{t}\lvert x_{1:t})$ with the joint density being $p_{\theta}(x_{1:T},y_{1:T})=\mu_{\theta}(x_{1})\prod_{t=2}^{T}f_{t,\theta}(x_{t}\lvert x_{1:t-1})\prod_{t=1}^{T}g_{t,\theta}(y_{t}\lvert x_{1:t})$ .

2 Sequential Monte Carlo

smc performs approximate inference on a sequence of target distributions $(\pi_{t}(x_{1:t}))_{t=1}^{T}$ . In the context of ssms, the target distributions are often taken to be $(p_{\theta}(x_{1:t}\lvert y_{1:t}))_{t=1}^{T}$ . Given a parameter $\phi$ and proposal distributions $q_{1,\phi}(x_{1}\lvert y_{1})$ and $(q_{t,\phi}(x_{t}\lvert y_{1:t},x_{1:t-1}))_{t=2}^{T}$ from which we can sample and whose densities we can evaluate, smc is described in Algorithm 1.

The sequential nature of smc and the resampling step are crucial in making smc scalable to large $T$ . The former makes it easier to design efficient proposal distributions as each step need only target the next set of variables $x_{t}$ . The resampling step allows the algorithm to focus on promising particles in light of new observations, avoiding the exponential divergence between the weights of different samples that occurs for importance sampling as $T$ increases. This can be demonstrated both empirically and theoretically (Del Moral, 2004, Chapter 9). We refer the reader to (Doucet & Johansen, 2009) for an in-depth treatment of smc.

3 Importance Weighted Auto-Encoders

Note that for $K=1$ particle, this objective reduces to a vae (Kingma & Welling, 2014; Rezende et al., 2014) objective we will refer to as

The iwae optimization is performed using stochastic gradient ascent (sga) where a sample from $\left(\prod_{k=1}^{K}q_{\phi}(x^{k}\lvert y^{(n)})\right)$ is obtained using the reparameterization trick (Kingma & Welling, 2014) and the gradient $\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta,\phi}\log\left(\sum_{k=1}^{K}\frac{p_{\theta}(x^{k},y^{(n)})}{q_{\phi}(x^{k}\lvert y^{(n)})}\right)$ is used to perform an optimization step.

Auto-Encoding Sequential Monte Carlo

aesmc implements model learning, proposal adaptation, and inference amortization in a similar manner to the vae and the iwae: it uses sga on an empirical average of the elbo over observations. However, it varies in the form of this elbo. In this section, we will introduce the aesmc elbo, explain how gradients of it can be estimated, and discuss the implications of these changes.

Consider a family of ssms $\{p_{\theta}(x_{1:T},y_{1:T}):\theta\in\Theta\}$ and a family of proposal distributions $\{q_{\phi}(x_{1:T}\lvert y_{1:T})=q_{1,\phi}(x_{1}\lvert y_{1})\prod_{t=2}^{T}q_{t,\phi}(x_{t}\lvert x_{1:t-1},y_{1:t}):\phi\in\Phi\}$ . Aesmc uses an elbo objective based on the smc marginal likelihood estimator (1). In particular, for a given $y_{1:T}$ , the objective is defined as

where $\hat{Z}_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})$ is defined in (1) and $Q_{\text{SMC}}$ is the sampling distribution of smc,

For notational convenience, we will talk about optimizing elbos in the rest of this section. However, we note that the main intended use of aesmc is to amortize over datasets, for which the elbo is replaced by the dataset average $\mathcal{J}(\theta,\phi)$ in the optimization target. Nonetheless, rather than using the full dataset for each gradient update, will we instead use minibatches, noting that this forms unbiased estimator.

2 Gradient Estimation

To account for the discrete choices of ancestor indices $a_{t}^{k}$ one could additionally use the reinforce (Williams, 1992) trick, however in practice, we found that the additional term in the estimator has problematically high variance. We explore various other possible gradient estimators and empirical assessments of their variances in Appendix A. This exploration confirms that including the additional reinforce terms leads to problematically high variance, justifying our decision to omit them, despite introducing a small bias into the gradient estimates.

3 Bias & Implications on the Proposals

In this section, we express the gap between elbos and the log marginal likelihood as a kl divergence and study implications on the proposal distributions. We present a set of claims and propositions whose full proofs are in Appendix B. These give insight into the behavior of aesmc and show the advantages, and disadvantages, of using our different elbo. This insight motivates Section 4 which proposes an algorithm for improving proposal learning.

is a lower bound on $\log Z_{P}$ and satisfies

Given a non-negative unbiased estimator $\hat{Z}_{P}(x)\geq 0$ of the normalizing constant $Z_{P}$ where $x$ is distributed according to the proposal distribution $Q(x)$ , the following holds:

is the implied normalized target density.

In the case of iwaes, we can apply Claim 1 with $Q$ and $\hat{Z}_{P}$ being $Q_{\text{IS}}$ and $\hat{Z}_{\text{IS}}$ respectively as defined in (3) and $Z_{P}$ being $p_{\theta}(y)$ . This yields

Similarly, in the case of aesmc, we obtain

$Q_{\text{IS}}(x^{1:K})=P_{\text{IS}}(x^{1:K})$ for all $x^{1:K}$ if and only if $q(x\lvert y)=p(x\lvert y)$ for all $x$ .

If $K>1$ , then $P_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})=Q_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})$ for all $(x_{1:T}^{1:K},a_{1:T-1}^{1:K})$ if and only if

$q_{1}(x_{1}\lvert y_{1})=p(x_{1}\lvert y_{1:T})$ for all $x_{1}$ and $q_{t}(x_{t}\lvert x_{1:t-1},y_{1:t})=p(x_{1:t}\lvert y_{1:T})/p(x_{1:t-1}\lvert y_{1:T})$ for $t=2,\dotsc,T$ for all $x_{1:t}$ ,

where $\pi_{t}(x_{1:t})$ are the intermediate targets used by smc.

Improving Proposal Learning

We thus see that increasing $K$ reduces the snr and so the gradient updates for the proposal will degrade towards pure noise if $K$ is set too high.

Experiments

We now present a series of experiments designed to answer the following questions: 1) Does tightening the bound by using either more particles or a better inference procedure lead to an adverse effect on proposal learning? 2) Can aesmc, despite this effect, outperform iwae? 3) Can we further improve the learned model and proposal by using alt?

First we investigate a linear Gaussian state space model (lgssm) for model learning and a latent variable model for proposal adaptation. This allows us to compare the learned parameters to the optimal ones. Doing so, we confirm our conclusions for this simple problem.

We then extend those results to more complex, high dimensional observation spaces that require models and proposals parameterized by neural networks. We do so by investigating the Moving Agents dataset, a set of partially occluded video sequences.

We generate a sequence $y_{1:T}$ for $T=200$ by sampling from the model with $\theta=(\theta_{1},\theta_{2})=(0.9,1.0)$ . We then optimize the different elbos w.r.t. $\theta$ using the bootstrap proposal $q_{1}(x_{1}\lvert y_{1})=\mu_{\theta}(x_{1})$ and $q_{t}(x_{t}\lvert x_{1:t-1},y_{1:t})=f_{t,\theta}(x_{t}\lvert x_{1:t-1})$ . Because we use the bootstrap proposal, gradients w.r.t. to $\theta$ are not backpropagated through $q$ .

2 Proposal Learning

We now investigate how learning $\phi$ , i.e. the proposal, is affected by the the choice of elbo and the number of particles.

3 Moving Agents

To show that our results are applicable to complex, high dimensional data we compare aesmc and iwae on stochastic, partially observable video sequences. Figure 7 in Appendix C.2 shows an example of such a sequence.

The dataset consists of $N=5000$ sequences of images $(y_{1:T}^{(n)})_{n=1}^{N}$ of which 1000 are randomly held out as test set. Each sequence contains $T=40$ images represented as a 2 dimensional array of size $32\times 32$ . In each sequence there is one agent, represented as circle, whose starting position is sampled randomly along the top and bottom of the image. The dataset is inspired by (Ondrúška & Posner, 2016), however with the crucial difference that the movement of the agent is stochastic. The agent performs a directed random walk through the image. At each timestep, it moves according to

where $(x_{t},y_{t})$ are the coordinates in frame $t$ in a unit square that is then projected onto $32\times 32$ pixels. In addition to the stochasticity of the movement, half of the image is occluded, preventing the agent from being observed.

For the generative model and proposal distribution we use a Variational Recurrent Neural Network (vrnn) (Chung et al., 2015). It extends recurrent neural networks (rnns) by introducing a stochastic latent state $x_{t}$ at each timestep $t$ . Together with the observation $y_{t}$ , this state conditions the deterministic transition of the rnn. By introducing this unobserved stochastic state, the vrnn is able to better model complex long range variability in stochastic sequences. Architecture and hyperparameter details are given in Appendix C.1.

Conclusions

We have developed aesmc—a method for performing model learning using a new elbo objective which is based on the smc marginal likelihood estimator. This elbo objective is optimized using sga and the reparameterization trick. Our approach utilizes the efficiency of smc in models with intermediate observations and hence is suitable for highly structured models. We experimentally demonstrated that this objective leads to better generative model training than the iwae objective for structured problems, due to the superior inference and tighter bound provided by using smc instead of importance sampling.

Additionally, in Claim 1, we provide a simple way to express the bias of objectives induced by log of marginal likelihood estimators as a kl divergence on an extended space. In Propositions 1 and 2, we investigate the implications of these kls being zero in the case of iwae and aesmc. In the latter case, we find that we can achieve zero kl only if we are able to learn smc intermediate target distributions corresponding to marginals of the target distribution. Using our assertion that tighter variational bounds are not necessarily better, we then introduce and test a new method, alternating elbos, that addresses some of these issues and observe that, in some cases, this improves both model and proposal learning.

TAL is supported by EPSRC DTA and Google (project code DF6700) studentships. MI is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. TR is supported by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071; majority of TR’s work was undertaken while he was in the Department of Engineering Science, University of Oxford, and was supported by a BP industrial grant. TJ is supported by the UK EPSRC and MRC CDT in Statistical Science. FW is supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1; DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006; Intel and DARPA D3M, under Cooperative Agreement FA8750-17-2-0093.

References

Appendix A Gradients

The goal is to obtain an unbiased estimator for the gradient

which we can estimate by sampling $(x_{1:T}^{1:K},a_{1:T-1}^{1:K})$ directly from $Q_{\text{SMC}}$ and evaluating $\left[\nabla_{\theta,\phi}\log Q_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})\log\hat{Z}_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})+\nabla_{\theta,\phi}\log\hat{Z}_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})\right]$ .

A.2 Reinforce & Reparameterization

In Figure 5, we demonstrate that the estimator in (31) has much higher variance if we include the first term.

Appendix B Proofs for Bias & Implications on the Proposals

( $\implies$ ) Substituting for $Q_{\text{IS}}(x^{1:K})=P_{\text{IS}}(x^{1:K})$ , we obtain

Integrating both sides with respect to $(x^{2},\dotsc,x^{K})$ over the whole support (i.e. marginalizing out everything except $x^{1}$ ), we obtain:

Rearranging gives us $q(x^{1}\lvert y)=p(x^{1}\lvert y)$ for all $x^{1}$ .

( $\impliedby$ ) Substituting $p(x^{k}\lvert y)=q(x^{k}\lvert y)$ , we obtain

( $\implies$ ) It suffices to show that $\hat{Z}_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})=Z$ for all $(x_{1:T}^{1:K},a_{1:T-1}^{1:K})$ implies 1 and 2 in Proposal 2 due to equation (11).

We first prove that $\hat{Z}_{\text{SMC}}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})=Z$ for all $(x_{1:T}^{1:K},a_{1:T-1}^{1:K})$ implies that the weights

Now, for $x_{1:t}$ , consider the implied proposal by rearranging (41) and (42)

where $w_{t}:=w_{t}(x_{1:t})$ is constant from our previous results. For this to be a normalized density with respect to $x_{t}$ , we must have

such that each $\pi_{t}(x_{1:t})$ must be the corresponding marginal of the final target. We also have

( $\impliedby$ ) To complete the proof, we now simply substitute identities in 1 and 2 of Proposal 2 back to the expression of $\hat{Z}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})$ to obtain $\hat{Z}(x_{1:T}^{1:K},a_{1:T-1}^{1:K})=Z$ . ∎

Appendix C Experiments

In the following we give the details of our vrnn architecture. The generative model is given by:

and the proposal distribution is given by

The functions $\mu^{x}_{\theta}$ and $\sigma^{x}_{\theta}$ are computed by networks with two fully connected layers of size 128 whose first layer is shared. $\varphi^{x}_{\theta}$ is one fully connected layer of size 128.

For visual input, the encoding $\varphi^{y}_{\theta}$ is a convolutional network with conv-4x4-2-1-32, conv-4x4-2-1-64, conv-4x4-2-1-128 where conv-wxh-s-p-n denotes a convolutional network with $n$ filters of size $w\times h$ , stride $s$ , padding $p$ . Between convolutions we use leaky ReLUs with slope 0.2 as nonlinearity and batch norms. The decoding $\mu^{y}_{\theta}$ uses transposed convolutions of the same dimensions but in reversed order, however with stride $s=1$ and padding $p=0$ for the first layer.

A Gated Recurrent Unit (GRU) is used as RNN and if not stated otherwise ReLUs are used in between fully connected layers.

For the moving agents dataset we use adam with a learning rate of $10^{-3}$ .

A specific feature of the vrnn architecture is that the proposal and the generative model share the component $\varphi^{y}_{\phi,\theta}$ . Consequently, we set $\phi=\theta$ for the parameters belonging to this module and train it using gradients for both $\theta$ and $\phi$ .

C.2 Moving Agents

In Figure 7 we investigate the quality of the generative model by comparing visual predictions. We do so for models learned by iwae (top) and aesmc (bottom). The models were learned using ten particles but for easier visualization we only predict using five particles.

The first row in each graphic shows the ground truth. The second row shows the averaged predictions of all five particles. The next five rows show the predictions made by each particle individually.

The observations (i.e. the top row) up to $t=19$ are shown to the model. Up to this timestep the latent values $x_{0:19}$ are drawn from the proposal distribution $q(x_{t}|y_{t},h_{t-1})$ . From $t=20$ onwards the latent values $x_{20:37}$ are drawn from the generative model $p(x_{t}|x_{t-1})$ . Consequently, the model predicts the partially occluded, stochastic movement over 17 timesteps into the future.

We note that most particles predict a viable future trajectory. However, the model learned by iwae is not as consistent in the quality of its predictions, often ’forgetting’ the particle. This does not happen in every predicted sequence but the behavior shown here is very typical. Models learned by aesmc are much more consistent in the quality of their predictions.

C.3 Optimizing Only Proposal Parameters

Increasing $K_{\text{test}}$ (moving up in Figure 8 (Right)) improves inference.

Increasing $K_{\text{train}}$ (moving to the right in Figure 8 (Right)) worsens inference.

Among different possible combinations of (training algorithm, testing algorithm), (is, smc) $\succ$ (smc, smc) $\succ$ (is, is) $\succ$ (smc, is), where we use “ $a\succ b$ ” to denote that the combination $a$ results in better inference than combination $b$ .