Generalizing Hamiltonian Monte Carlo with Neural Networks

Daniel Levy, Matthew D. Hoffman, Jascha Sohl-Dickstein

Introduction

High-dimensional distributions that are only analytically tractable up to a normalizing constant are ubiquitous in many fields. For instance, they arise in protein folding (Schütte et al., 1999), physics simulations (Olsson, 1995), and machine learning (Andrieu et al., 2003). Sampling from such distributions is a critical task for learning and inference (MacKay, 2003), however it is an extremely hard problem in general.

Markov Chain Monte Carlo (MCMC) methods promise a solution to this problem. They operate by generating a sequence of correlated samples that converge in distribution to the target. This convergence is most often guaranteed through detailed balance, a sufficient condition for the chain to have the target equilibrium distribution. In practice, for any proposal distribution, one can ensure detailed balance through a Metropolis-Hastings (Hastings, 1970) accept/reject step.

Despite theoretical guarantees of eventual convergence, in practice convergence and mixing speed depend strongly on choosing a proposal that works well for the task at hand. What’s more, it is often more art than science to know when an MCMC chain has converged (“burned-in”), and when the chain has produced a new uncorrelated sample (“mixed”). Additionally, the reliance on detailed balance, which assigns equal probability to the forward and reverse transitions, often encourages random-walk behavior and thus slows exploration of the space (Ichiki & Ohzeki, 2013).

For densities over continuous spaces, Hamiltonian Monte Carlo (HMC; Duane et al., 1987; Neal, 2011) introduces independent, auxiliary momentum variables, and computes a new state by integrating Hamiltonian dynamics. This method can traverse long distances in state space with a single Metropolis-Hastings test. This is the state-of-the-art method for sampling in many domains. However, HMC can perform poorly in a number of settings. While HMC mixes quickly spatially, it struggles at mixing across energy levels due to its volume-preserving dynamics. HMC also does not work well with multi-modal distributions, as the probability of sampling a large enough momentum to traverse a very low-density region is negligibly small. Furthermore, HMC struggles with ill-conditioned energy landscapes (Girolami & Calderhead, 2011) and deals poorly with rapidly changing gradients (Sohl-Dickstein et al., 2014).

Recently, probabilistic models parameterized by deep neural networks have achieved great success at approximately sampling from highly complex, multi-modal empirical distributions (Kingma & Welling, 2013; Rezende et al., 2014; Goodfellow et al., 2014; Bengio et al., 2014; Sohl-Dickstein et al., 2015). Building on these successes, we present a method that, given an analytically described distribution, automatically returns an exact sampler with good convergence and mixing properties, from a class of highly expressive parametric models. The proposed family of samplers is a generalization of HMC; it transforms the HMC trajectory using parametric functions (deep networks in our experiments), while retaining theoretical guarantees with a tractable Metropolis-Hastings accept/reject step. The sampler is trained to minimize a variation on expected squared jumped distance (similar in spirit to Pasarica & Gelman (2010)). Our parameterization reduces easily to standard HMC. It is further capable of emulating several common extensions of HMC such as within-trajectory tempering (Neal, 1996) and diagonal mass matrices (Bennett, 1975).

We evaluate our method on distributions where HMC usually struggles, as well as on a the real-world task of training latent-variable generative models.

We introduce a generic training procedure which takes as input a distribution defined by an energy function, and returns a fast-mixing MCMC kernel.

We show significant empirical gains on various distributions where HMC performs poorly.

We finally evaluate our method on the real-world task of training and sampling from a latent variable generative model, where we show improvement in the model’s log-likelihood, and greater complexity in the distribution of posterior samples.

Related Work

Adaptively modifying proposal distributions to improve convergence and mixing has been explored in the past (Andrieu & Thoms, 2008). In the case of HMC, prior work has reduced the need to choose step size (Neal, 2011) or number of leapfrog steps (Hoffman & Gelman, 2014) by adaptively tuning those parameters. Salimans et al. (2015) proposed an alternate scheme based on variational inference. We adopt the much simpler approach of Pasarica & Gelman (2010), who show that choosing the hyperparameters of a proposal distribution to maximize expected squared jumped distance is both principled and effective in practice.

Previous work has also explored applying models from machine learning to MCMC tasks. Kernel methods have been used both for learning a proposal distribution (Sejdinovic et al., 2014) and for approximating the gradient of the energy (Strathmann et al., 2015). In physics, Restricted and semi-Restricted Boltzmann machines have been used both to build approximations of the energy function which allow more rapid sampling (Liu et al., 2017; Huang & Wang, 2017), and to motivate new hand-designed proposals (Wang, 2017).

Most similar to our approach is recent work from Song et al. (2017), which uses adversarial training of a volume-preserving transformation, which is subsequently used as an MCMC proposal distribution. While promising, this technique has several limitations. It does not use gradient information, which is often crucial to maintaining high acceptance rates, especially in high dimensions. It also can only indirectly measure the quality of the generated sample using adversarial training, which is notoriously unstable, suffers from “mode collapse” (where only a portion of a target distribution is covered), and often requires objective modification to train in practice (Arjovsky et al., 2017). Finally, since the proposal transformation preserves volume, it can suffer from the same difficulties in mixing across energy levels as HMC, as we illustrate in Section 5.

To compute the Metropolis-Hastings acceptance probability for a deterministic transition, the operator must be invertible and have a tractable Jacobian. Recent work (Dinh et al., 2016), introduces RNVP, an invertible transformation that operates by, at each layer, modifying only a subset of the variables by a function that depends solely on the remaining variables. This is exactly invertible with an efficiently computable Jacobian. Furthermore, by chaining enough of these layers, the model can be made arbitrarily expressive. This parameterization will directly motivate our extension of the leapfrog integrator in HMC.

Background

Let $p$ be a target distribution, analytically known up to a constant, over a space $\mathcal{X}$ . Markov chain Monte Carlo (MCMC) methods (Neal, 1993) aim to provide samples from $p$ . To that end, MCMC methods construct a Markov Chain whose stationary distribution is the target distribution $p$ . Obtaining samples then corresponds to simulating a Markov Chain, i.e., given an initial distribution $\pi_{0}$ and a transition kernel $K$ , constructing the following sequence of random variables:

Given any proposal distribution $q$ , satisfying mild conditions, we can easily construct a transition kernel that respects detailed balance using Metropolis-Hastings (Hastings, 1970) accept/reject rules. More formally, starting from $x_{0}\sim\pi_{0}$ , at each step $t$ , we sample $x^{\prime}\sim q(\cdot|X_{t})$ , and with probability $A(x^{\prime}|x_{t})=\min\left(1,\frac{p(x^{\prime})q(x_{t}|x^{\prime})}{p(x_{t})q(x^{\prime}|x_{t})}\right)$ , accept $x^{\prime}$ as the next sample $x_{t+1}$ in the chain. If we reject $x^{\prime}$ , then we retain the previous state and $x_{t+1}=x_{t}$ . For typical proposals this algorithm has strong asymptotic guarantees. But in practice one must often choose between very low acceptance probabilities and very cautious proposals, both of which lead to slow mixing. For continuous state spaces, Hamiltonian Monte Carlo (HMC; Neal, 2011) tackles this problem by proposing updates that move far in state space while staying roughly on iso-probability contours of $p$ .

2 Hamiltonian Monte Carlo

The dynamics are typically simulated using the leapfrog integrator (Hairer et al., 2003; Leimkuhler & Reich, 2004), which for a single time step consists of:

Following Sohl-Dickstein et al. (2014), we write the action of the leapfrog integrator in terms of an operator $\mathbf{L}$ : $\mathbf{L}\xi\triangleq\mathbf{L}(x,v)\triangleq(x^{\prime},v^{\prime})$ , and introduce a momentum flip operator $\mathbf{F}$ : $\mathbf{F}(x,v)\triangleq(x,-v)$ . It is important to note two properties of these operators. First, the transformation $\mathbf{F}\mathbf{L}$ is an involution, i.e. $\mathbf{F}\mathbf{L}\mathbf{F}\mathbf{L}(x,v)=\mathbf{F}\mathbf{L}(x^{\prime},-v^{\prime})=(x,v)$ . Second, the transformations from $(x,v)$ to $(x,v^{\frac{1}{2}})$ , from $(x,v^{\frac{1}{2}})$ to $(x^{\prime},v^{\frac{1}{2}})$ , and from $(x^{\prime},v^{\frac{1}{2}})$ to $(x^{\prime},v^{\prime})$ are all volume-preserving shear transformations i.e., only one of the variables ( $x$ or $v$ ) changes, by an amount determined by the other one. The determinant of the Jacobian, $\left|\frac{\partial[\mathbf{F}\mathbf{L}\xi]}{\partial\xi^{T}}\right|$ , is thus easy to compute. For vanilla HMC $\left|\frac{\partial[\mathbf{F}\mathbf{L}\xi]}{\partial\xi^{T}}\right|=1$ , but we will leave it in symbolic form for use in Section 4. The Metropolis-Hastings-Green (Hastings, 1970; Green, 1995) acceptance probability for the HMC proposal is made simple by these two properties, and is

L2HMC: Training MCMC Samplers

In this section, we describe our proposed method L2HMC (for ‘Learning To Hamiltonian Monte Carlo’). Given access to only an energy function $U$ (and not samples), L2HMC learns a parametric leapfrog operator $\mathbf{L}_{\theta}$ over an augmented state space. We begin by describing what desiderata we have for $\mathbf{L}_{\theta}$ , then go into detail on how we parameterize our sampler. Finally, we conclude this section by describing our training procedure.

HMC is a powerful algorithm, but it can still struggle even on very simple problems. For example, a two-dimensional multivariate Gaussian with an ill-conditioned covariance matrix can take arbitrarily long to traverse (even if the covariance is diagonal), whereas it is trivial to sample directly from it. Another problem is that HMC can only move between energy levels via a random walk (Neal, 2011), which leads to slow mixing in some models. Finally, HMC cannot easily traverse low-density zones. For example, given a simple Gaussian mixture model, HMC cannot mix between modes without recourse to additional tricks, as illustrated in Figure 1(b). These observations determine the list of desiderata for our learned MCMC kernel: fast mixing, fast burn-in, mixing across energy levels, and mixing between modes.

While pursuing these goals, we must take care to ensure that our proposal operator retains two key features of the leapfrog operator used in HMC: it must be invertible, and the determinant of its Jacobian must be tractable. The leapfrog operator satisfies these properties by ensuring that each sub-update only affects a subset of the variables, and that no sub-update depends nonlinearly on any of the variables being updated. We are free to generalize the leapfrog operator in any way that preserves these properties. In particular, we are free to translate and rescale each sub-update of the leapfrog operator, so long as we are careful to ensure that these translation and scale terms do not depend on the variables being updated.

We now describe the details of our augmented leapfrog integrator $\mathbf{L}_{\theta}$ , for a single time-step $t$ , and for direction $d=1$ .

We first update the momenta $v$ . This update can only depend on a subset $\zeta_{1}\triangleq(x,\partial_{x}U(x),t)$ of the full state, which excludes $v$ . It takes the form

We have introduced three new functions of $\zeta_{1}$ : $T_{v}$ , $Q_{v}$ , and $S_{v}$ . $T_{v}$ is a translation, $\exp(Q_{v})$ rescales the gradient, and $\exp(\frac{\epsilon}{2}S_{v})$ rescales the momentum. The determinant of the Jacobian of this transformation is $\exp\left(\frac{\epsilon}{2}\mathds{1}\cdot S_{v}(\zeta_{1})\right)$ . Note that if $T_{v}$ , $Q_{v}$ , and $S_{v}$ are all zero, then we recover the standard leapfrog momentum update.

We now update $x$ . As hinted above, to make our transformation more expressive, we first update a subset of the coordinates of $x$ , followed by the complementary subset. The first update, which yields $x^{\prime}$ and affects only $x_{m^{t}}$ , depends on the state subset $\zeta_{2}\triangleq(x_{\bar{m}^{t}},v,t)$ . Conversely, with $x^{\prime}$ defined below, the second update only affects $x^{\prime}_{\bar{m}^{t}}$ and depends only on $\zeta_{3}\triangleq(x^{\prime}_{m^{t}},v,t)$ :

Again, $T_{x}$ is a translation, $\exp(Q_{x})$ rescales the effect of the momenta, $\exp(\epsilon S_{x})$ rescales the positions $x$ , and we recover the original leapfrog position update if $T_{x}=Q_{x}=S_{x}=0$ . The determinant of the Jacobian of the first transformation is $\exp\left(\epsilon m^{t}\cdot S_{x}(\zeta_{2})\right)$ , and the determinant of the Jacobian of the second transformation is $\exp\left(\epsilon\bar{m}^{t}\cdot S_{x}(\zeta_{3})\right)$ .

Finally, we update $v$ again, based on the subset $\zeta_{4}\triangleq(x^{\prime\prime},\partial_{x}U(x^{\prime\prime}),t)$ :

This update has the same form as the momentum update in equation 4.

To give intuition into these terms, the scaling applied to the momentum can enable, among other things, acceleration in low-density zones, to facilitate mixing between modes. The scaling term applied to the gradient of the energy may allow better conditioning of the energy landscape (e.g., by learning a diagonal inertia tensor), or partial ignoring of the energy gradient for rapidly oscillating energies.

The corresponding integrator for $d=-1$ is given in Appendix A; it essentially just inverts the updates in equations 4, 5 and 6. For all experiments, the functions $Q,S,T$ are implemented using multi-layer perceptrons, with shared weights. We encode the current time step in the MLP input.

Our leapfrog operator $\mathbf{L}_{\theta}$ corresponds to running $M$ steps of this modified leapfrog, $\mathbf{L}_{\theta}\xi=\mathbf{L}_{\theta}(x,v,d)=(x^{\prime\prime\times M},v^{\prime\prime\times M},d)$ , and our flip operator $\mathbf{F}$ reverses the direction variable $d$ , $\mathbf{F}\xi=(x,v,-d)$ . Written in terms of these modified operators, our proposal and acceptance probability are identical to those for standard HMC. Note, however, that this parameterization enables learning non-volume-preserving transformations, as the determinant of the Jacobian is a function of $S_{x}$ and $S_{v}$ that does not necessarily evaluate to $1$ . This quantity is derived in Appendix B.

For convenience, we denote by $\mathbf{R}$ an operator that re-samples the momentum and direction. I.e., given $\xi=(x,v,d)$ , $\mathbf{R}\xi=(x,v^{\prime},d^{\prime})$ where $v^{\prime}\sim\mathcal{N}(0,I),d^{\prime}\sim\mathcal{U}\left(\{-1,1\}\right)$ . Sampling thus consists of alternating application of the $\mathbf{F}\mathbf{L}_{\theta}$ and $\mathbf{R}$ , in the following two steps each of which is a Markov transition that satisfies detailed balance with respect to $p$ :

$\xi^{\prime}=\mathbf{F}\mathbf{L}_{\theta}\xi$ with probability $A(\mathbf{F}\mathbf{L}_{\theta}\xi|\xi)$ (Equation 3), otherwise $\xi^{\prime}=\xi$ .

This parameterization is effectively a generalization of standard HMC as it is non-volume preserving, with learnable parameters, and easily reduces to standard HMC for $Q,S,T=0$ .

2 Loss and Training Procedure

where $\lambda$ is a scale parameter, capturing the characteristic length scale of the problem. The second term encourages typical moves to be large, while the first term strongly penalizes the sampler if it is ever in a state where it cannot move effectively – $\delta(\xi,\xi^{\prime})$ being small resulting in a large loss value. We train our sampler by minimizing this loss over both the target distribution and initialization distribution. Formally, given an initial distribution $\pi_{0}$ over $\mathcal{X}$ , we define $q(\xi)=\pi_{0}(x)\mathcal{N}(v;0,I)p(d)$ , and minimize

The first term of this loss encourages mixing as it considers our operator applied on draws from the distribution; the second term rewards fast burn-in; $\lambda_{b}$ controls the strength of the ‘burn-in’ regularization. Given this loss, we exactly describe our training procedure in Algorithm 1. It is important to note that each training iteration can be done with only one pass through the network and can be efficiently batched. We further emphasize that this training procedure can be applied to any learnable operator whose Jacobian’s determinant is tractable, making it a general framework for training MCMC proposals.

Experiments

We present an empirical evaluation of our trained sampler on a diverse set of energy functions. We first present results on a collection of toy distributions capturing common pathologies of energy landscapes, followed by results on a task from machine learning: maximum-likelihood training of deep generative models. For each, we compare against HMC with well-tuned step length and show significant gains in mixing time. Code implementing our algorithm is available onlinehttps://github.com/brain-research/l2hmc..

We evaluate our L2HMC sampler on a diverse collection of energy functions, each posing different challenges for standard HMC.

Ill-Conditioned Gaussian (ICG): Gaussian distribution with diagonal covariance spaced log-linearly between $10^{-2}$ and $10^{2}$ . This demonstrates that L2HMC can learn a diagonal inertia tensor.

Strongly correlated Gaussian (SCG): We rotate a diagonal Gaussian with variances $[10^{2},10^{-2}]$ by $\frac{\pi}{4}$ . This is an extreme version of an example from Neal (2011). This problem shows that, although our parametric sampler only applies element-wise transformations, it can adapt to structure which is not axis-aligned.

Mixture of Gaussians (MoG): Mixture of two isotropic Gaussians with $\sigma^{2}=0.1$ , and centroids separated by distance $4$ . The means are thus about $12$ standard deviations apart, making it almost impossible for HMC to mix between modes.

Rough Well: Similar to an example from Sohl-Dickstein et al. (2014), for a given $\eta>0,U(x)=\frac{1}{2}x^{T}x+\eta\sum_{i}\cos(\frac{x_{i}}{\eta})$ . For small $\eta$ the energy itself is altered negligibly, but its gradient is perturbed by a high frequency noise oscillating between $-1$ and $1$ . In our experiments, we choose $\eta=10^{-2}$ .

For each of these distributions, we compare against HMC with the same number of leapfrog steps and a well-tuned step-size. To compare mixing time, we plot auto-correlation for each method and report effective sample size (ESS). We compute those quantities in the same way as Sohl-Dickstein et al. (2014). We observe that samplers trained with L2HMC show greatly improved autocorrelation and ESS on the presented tasks, providing more than $106\times$ improved ESS on the SCG task. In addition, for the MoG, we show that L2HMC can easily mix between modes while standard HMC gets stuck in a mode, unable to traverse the low density zone. Experimental details, as well as a comparison with LAHMC (Sohl-Dickstein et al., 2014), are shown in Appendix C.

In addition to the well known challenges associated with adversarial training (Arjovsky et al., 2017), we note that parameterization using a volume-preserving operator can dramatically fail on simple energy landscapes. We build off of the mog2 experiment presented in (Song et al., 2017), which is a $2$ -d mixture of isotropic Gaussians separated by a distance of $10$ with variances $0.5$ . We consider that setup but increase the ratio of variances: $\sigma_{1}^{2}=3,\sigma_{2}^{2}=0.05$ . We show in Figure 1(d) sample chains trained with L2HMC and A-NICE-MC; A-NICE-MC cannot effectively mix between the two modes as only a fraction of the volume of the large mode can be mapped to the small one, making it highly improbable to traverse. This is also an issue for HMC. On the other hand, L2HMC can both traverse the low-density region between modes, and map a larger volume in the left mode to a smaller volume in the right mode. It is important to note that the distance between both clusters is less than in the mog2 case, and it is thus a good diagnostic of the shortcomings of volume-preserving transformations.

2 Latent-Variable Generative Model

where $q_{\psi}(z|x)$ is a tractable conditional distribution with parameters $\psi$ , typically parameterized by a neural network. Recently, to improve upon well-known pitfalls like over-pruning (Burda et al., 2015) of the VAE, Hoffman (2017) proposed HMC-DLGM. For a data sample $x^{(i)}$ , after obtaining a sample from the approximate posterior $q_{\psi}(\cdot|x^{(i)})$ , Hoffman (2017) runs a MCMC algorithm with energy function $U(z,x^{(i)})=-\log p(z)-\log p(x^{(i)}|z;\phi)$ to obtain a more exact posterior sample from $p(z|x^{(i)};\phi)$ . Given that better posterior sample $z^{\prime}$ , the algorithm maximizes $\log p(x^{(i)}|z^{\prime};\phi)$ .

To show the benefits of L2HMC, we borrow the method from Hoffman (2017), but replace HMC by jointly training an L2HMC sampler to improve the efficiency of the posterior sampling. We call this model L2HMC-DLGM. A diagram of our model and a formal description of our training procedure are presented in Appendix D. We define, for $\xi=\{z,v,d\},r(\xi|x;\psi)\triangleq q_{\psi}(z|x)\mathcal{N}\left(v;0,I\right)\mathcal{U}\left(d;\{-1,1\}\right)$ .

In the subsequent sections, we compare our method to the standard VAE model from Kingma & Welling (2013) and HMC-DGLM from Hoffman (2017). It is important to note that, since our sampler is trained jointly with $p_{\phi}$ and $q_{\psi}$ , it performs exactly the same number of gradient computations of the energy function as HMC. We first show that training a latent variable generative model with L2HMC results in better generative models both qualitatively and quantitatively. We then show that our improved sampler enables a more expressive, non-Gaussian, posterior.

Implementation details: Our decoder ( $p_{\phi}$ ) is a neural network with $2$ fully connected layers, with $1024$ units each and softplus non-linearities, and outputs Bernoulli activation probabilities for each pixel. The encoder ( $q_{\psi}$ ) has the same architecture, returning mean and variance for the approximate posterior. Our model was trained for $300$ epochs with Adam (Kingma & Ba, 2014) and a learning rate $\alpha=10^{-3}$ . All experiments were done on the dynamically binarized MNIST dataset (LeCun, ).

We first present samples from decoders trained with L2HMC, HMC and the ELBO (i.e. vanilla VAE). Although higher log likelihood does not necessarily correspond to better samples (Theis et al., 2015), we can see in Figure 5, shown in the Appendix, that the decoder trained with L2HMC generates sharper samples than the compared methods.

We now compare our method to HMC in terms of log-likelihood of the data. As we previously stated, the marginal likelihood of a data point $x\in\mathcal{X}$ is not tractable as it requires integrating $p(x,z)$ over a high-dimensional space. However, we can estimate it using annealed importance sampling (AIS; Neal (2001)). Following Wu et al. (2016), we evaluate our generative models on both training and held-out data. In Figure 2, we plot the data’s log-likelihood against the number of gradient computation steps for both HMC-DGLM and L2HMC-DGLM. We can see that for a similar number of gradient computations, L2HMC-DGLM achieves higher likelihood for both training and held-out data. This is a strong indication that L2HMC provides significantly better posterior samples.

In the standard VAE framework, approximate posteriors are often parametrized by a Gaussian, thus making a strong assumption of uni-modality. In this section, we show that using L2HMC to sample from the posterior enables learning of a richer posterior landscape.

Visualization of the posterior After training a decoder with L2HMC, we randomly choose an element $x_{0}\in\mathcal{D}$ and run $512$ parallel L2HMC chains for $20,000$ Metropolis-Hastings steps. We then find the direction of highest variance, project the samples along that direction and show a histogram in Figure 3(b). This plot shows non-Gaussianity in the latent space for the posterior. Using our improved sampler enables the decoder to make use of a more expressive posterior, and enables the encoder to sample from this non-Gaussian posterior.

Future Work

The loss in Section 4.2 targets lag-one autocorrelation. It should be possible to extend this to also target lag-two and higher autocorrelations. It should also be possible to extend this loss to reward fast decay in the autocorrelation of other statistics of the samples, for instance the sample energy as well as the sample position. These additional statistics could also include learned statistics of the samples, combining benefits of the adversarial approach of (Song et al., 2017) with the current work.

Our learned generalization of HMC should prove complementary to several other research directions related to HMC. It would be interesting to explore combining our work with the use of HMC in a minibatch setting (Chen et al., 2014); with shadow Hamiltonians (Izaguirre & Hampton, 2004); with gradient pre-conditioning approaches similar to those used in Riemannian HMC (Girolami et al., 2009; Betancourt, 2013); with the use of alternative HMC accept-reject rules (Sohl-Dickstein et al., 2014; Berger et al., 2015); with the use of non-canonical Hamiltonian dynamics (Tripuraneni et al., 2016); with variants of AIS adapted to HMC proposals (Sohl-Dickstein & Culpepper, 2012); with the extension of HMC to discrete state spaces (Zhang et al., 2012); and with the use of alternative Hamiltonian integrators (Creutz & Gocksch, 1989; Chao et al., 2015).

Finally, our work is also complementary to other methods not utilizing gradient information. For example, we could incorporate the intuition behind Multiple Try Metropolis schemes (Martino & Read, 2013) by having several parametric operators and training each one when used. In addition, one could draw inspiration from the adaptive literature (Haario et al., 2001; Andrieu & Thoms, 2008) or component-wise strategies (Gilks & Wild, 1992).

Conclusion

In this work, we presented a general method to train expressive MCMC kernels parameterized with deep neural networks. Given a target distribution $p$ , analytically known up to a constant, our method provides a fast-mixing sampler, able to efficiently explore the state space. Our hope is that our method can be utilized in a “black-box” manner, in domains where sampling constitutes a huge bottleneck such as protein foldings (Schütte et al., 1999) or physics simulations (Olsson, 1995).

We would like to thank Ben Poole, Aditya Grover, David Belanger, and Colin Raffel for insightful comments on the draft, Mohammad Norouzi for support and encouragement launching the project, and Jiaming Song for discussions and help running A-NICE-MC.

References

Appendix

Let $\xi=\{x,v,d\}$ in the extended state space with $d=-1$ . Here, we describe the leapfrog updates for a single time step $t$ , this consists of inverting the equations presented in the corresponding section.

With the notation from Section 4, let $\zeta_{2}\triangleq\{x_{m^{t}},v,t\}$

Let us denote $\zeta_{3}\triangleq\{x^{\prime}_{\bar{m}^{t}},v,t\}$ :

Finally, the last update, with $\zeta_{4}\triangleq\{x^{\prime\prime},\partial_{x}U(x^{\prime\prime}),t\}$ :

It is important to note that to invert $\mathbf{L}_{\theta}$ , these steps should be ran for $t$ from $M$ to $1$ .

Appendix B Determinant of the Jacobian

Given the derivations (and notations) from Section 4, for the forward operator $\mathbf{L}_{\theta}$ , we can immediately compute the Jacobian:

Where $\zeta^{t}_{i}$ denotes the intermediary variable $\zeta_{i}$ at time step $t$ and $d$ is the direction of $\xi$ i.e. $\xi=\{x,v,d\}$ .

Appendix C Experimental details of Section 5

First of all, we keep separate parameters for the network responsible for updating $v$ and those updating $x$ . The architectures are the same. Let us take the example of $Q_{v},S_{v},T_{v}$ . The time step $t$ is given as input to the MLP, encoded as $\tau(t)=\left(\cos(\frac{2\pi t}{M}),\sin(\frac{2\pi t}{M})\right)$ . $\sigma(\cdot)$ denotes the ReLU non-linearity.

$S_{v}=\lambda_{s}\mathtt{tanh}(W_{s}h_{2}+b_{s}),Q_{v}=\lambda_{q}\mathtt{tanh}(W_{q}h_{2}+b_{q}),T_{v}=W_{t}h_{2}+b_{t}$ .

In Section 5.1, the $Q,S,T$ are neural networks with $2$ hidden layers with $10$ ( $100$ for the $50$ -d ICG) units and ReLU non-linearities. We train with Adam (Kingma & Ba, 2014) and a learning rate $\alpha=10^{-3}$ . We train for $5,000$ iterations with a batch size of $200$ .

$\lambda_{b}$ was set to for ICG and SCG and to $1$ for MoG and Rough Well. For the MoG tasks, we train our sampler with a temperature parameter that we continuously anneal; we evaluate the trained sampler without using temperature.

C.2 Auto-correlation and ESS

Let $(x_{\tau})_{\tau\leq T}$ be a set of correlated samples converging to the distribution $p$ with mean $\mu$ and covariance $\Sigma$ . We define auto-correlation at time $t$ as:

We can now define effective sample size (ESS) as:

Similar to Hoffman & Gelman (2014), we truncate the sum when the auto-correlation goes below $0.05$ .

C.3 Comparison with LAHMC

We compare our trained sampler with LAHMC (Sohl-Dickstein et al., 2014). Results are reported in Table 1. L2HMC largely outperforms LAHMC on all task. LAHMC is also unable to mix between modes for the MoG task. We also note that L2HMC could be easily combined with LAHMC, by replacing the leapfrog integrator of LAHMC with the learned one of L2HMC.

Appendix D L2HMC-DGLM

In this section, we present our training algorithm as well as a diagram explaining L2HMC-DGLM. For conciseness, given our operator $\mathbf{L}_{\theta}$ , we denote by $\mathbf{K}_{\theta}(\cdot|x)$ the distribution over next state given sampling of a momentum and direction and the Metropolis-Hastings step.

D.2 Implementation details of L2HMC-DGLM

Similar to our L2HMC training on unconditional sampling, we share weights across $Q,S$ and $T$ . In addition, the auxiliary variable $x$ (here the image from MNIST) is first passed through a $2$ -layer neural network, with softplus non-linearities and $512$ hidden units. This input is given to both networks $\{\cdot\}_{x}$ and $\{\cdot\}_{v}$ . The architecture then consists of $2$ hidden layers of $200$ units and ReLU non-linearities. For $\lambda$ (scale parameter of the loss), we use the standard deviation of the approximate posterior.

For each data point, we run $20$ Markov Chains in parallel, $10,000$ annealing steps with $10$ leapfrogs per step and choose the step size for an acceptance rate of $0.65$ .

D.3 MNIST Samples

We show in Figure 5 samples from the three evaluated models: VAE (Kingma & Welling, 2013), HMC-DGLM (Hoffman, 2017) and L2HMC-DGLM.