Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton

Introduction

Estimating expectations using Markov Chain Monte Carlo (MCMC) is a fundamental approximate inference technique in Bayesian statistics. MCMC itself can be computationally demanding, and the expected estimation error depends directly on the correlation between successive points in the Markov chain. Therefore, efficiency can be achieved by taking large steps with high probability.

Hamiltonian Monte Carlo is an MCMC algorithm that improves efficiency by exploiting gradient information. It simulates particle movement along the contour lines of a dynamical system constructed from the target density. Projections of these trajectories cover wide parts of the target’s support, and the probability of accepting a move along a trajectory is often close to one. Remarkably, this property is mostly invariant to growing dimensionality, and HMC here often is superior to random walk methods, which need to decrease their step size at a much faster rate [1, Sec. 4.4].

Unfortunately, for a large class of problems, gradient information is not available. For example, in Pseudo-Marginal MCMC (PM-MCMC) , the posterior does not have an analytic expression, but can only be estimated at any given point, e.g. in Bayesian Gaussian Process classification . A related setting is MCMC for Approximate Bayesian Computation (ABC-MCMC), where the posterior is approximated through repeated simulation from a likelihood model . In both cases, HMC cannot be applied, leaving random walk methods as the only mature alternative. There have been efforts to mimic HMC’s behaviour using stochastic gradients from mini-batches in Big Data , or stochastic finite differences in ABC . Stochastic gradient based HMC methods, however, often suffer from low acceptance rates or additional bias that is hard to quantify .

Random walk methods can be tuned by matching scaling of steps and target. For example, Adaptive Metropolis-Hastings (AMH) is based on learning the global scaling of the target from the history of the Markov chain. Yet, for densities with nonlinear support, this approach does not work very well. Recently, introduced a Kernel Adaptive Metropolis-Hastings (KAMH) algorithm whose proposals are locally aligned to the target. By adaptively learning target covariance in a Reproducing Kernel Hilbert Space (RKHS), KAMH achieves improved sampling efficiency.

In this paper, we extend the idea of using kernel methods to learn efficient proposal distributions . Rather than locally smoothing the target density, however, we estimate its gradients globally. More precisely, we fit an infinite dimensional exponential family model in an RKHS via score matching . This is a non-parametric method of modelling the log unnormalised target density as an RKHS function, and has been shown to approximate a rich class of density functions arbitrarily well. More importantly, the method has been empirically observed to be relatively robust to increasing dimensionality – in sharp contrast to classical kernel density estimation [15, Sec. 6.5]. Gaussian Processes (GP) were also used in as an emulator of the target density in order to speed up HMC, however, this requires access to the target in closed form, to provide training points for the GP.

We require our adaptive KMC algorithm to be computationally efficient, as it deals with high-dimensional MCMC chains of growing length. We develop two novel approximations to the infinite dimensional exponential family model. The first approximation, score matching lite, is based on computing the solution in terms of a lower dimensional, yet growing, subspace in the RKHS. KMC with score matching lite (KMC lite) is geometrically ergodic on the same class of targets as standard random walks. The second approximation uses a finite dimensional feature space (KMC finite), combined with random Fourier features . KMC finite is an efficient online estimator that allows to use all of the Markov chain history, at the cost of decreased efficiency in unexplored regions. A choice between KMC lite and KMC finite ultimately depends on the ability to initialise the sampler within high-density regions of the target; alternatively, the two approaches could be combined.

Experiments show that KMC inherits the efficiency of HMC, and therefore mixes significantly better than state-of-the-art gradient-free adaptive samplers on a number of target densities, including on synthetic examples, and when used in PM-MCMC and ABC-MCMC. All code can be found at https://github.com/karlnapf/kernel_hmc

Background and Previous Work

In the absence of logπ\nabla\log\pi, the usual choice of QQ is a random walk, i.e. Q(xt)=N(xt,Σt).Q(\cdot|x_{t})={\cal N}(\cdot|x_{t},\Sigma_{t}). A popular choice of the scaling is ΣtI\Sigma_{t}\propto I. When the scale of the target density is not uniform across dimensions, or if there are strong correlations, the AMH algorithm improves mixing by adaptively learning global covariance structure of π\pi from the history of the Markov chain. For cases where the local scaling does not match the global covariance of π\pi, i.e. the support of the target is nonlinear, KAMH improves mixing by learning the target covariance in a RKHS. KAMH proposals are Gaussian with a covariance that matches the local covariance of π\pi around the current state xtx_{t}, without requiring access to logπ\nabla\log\pi.

In practice, (1) is usually unavailable and we need to resort to approximations. Here, we limit ourselves to the leap-frog integrator; see for details. To correct for discretisation error, a Metropolis acceptance procedure can be applied: starting from (p,q)(p^{\prime},q), the end-point of the approximate trajectory is accepted with probability min[1,exp(H(p,q)+H(p,q))]\min\left[1,\exp\left(-H(p^{*},q^{*})+H(p^{\prime},q)\right)\right]. HMC is often able to propose distant, uncorrelated moves with a high acceptance probability.

In many cases the gradient of logπ(q)=U(q)\log\pi(q)=-U(q) cannot be written in closed form, leaving random-walk based methods as the state-of-the-art . We aim to overcome random-walk behaviour, so as to obtain significantly more efficient sampling .

Kernel Induced Hamiltonian Dynamics

KMC replaces the potential energy in (1) by a kernel induced surrogate computed from the history of the Markov chain. This surrogate does not require gradients of the log-target density. The surrogate induces a kernel Hamiltonian flow, which can be numerically simulated using standard leap-frog integration. As with the discretisation error in HMC, any deviation of the kernel induced flow from the true flow is corrected via a Metropolis acceptance procedure. This here also contains the estimation noise from π^\hat{\pi} and re-uses previous values of π^\hat{\pi}, c.f. [3, Table 1]. Consequently, the stationary distribution of the chain remains correct, given that we take care when adapting the surrogate.

We construct a kernel induced potential energy surrogate whose gradients approximate the gradients of the true potential energy UU in (1), without accessing π\pi or π\nabla\pi directly, but only using the history of the Markov chain. To that end, we model the (unnormalised) target density π(x)\pi(x) with an infinite dimensional exponential family model of the form

We define a kernel induced Hamiltonian operator by replacing UU in the potential energy part Upq\frac{\partial U}{\partial p}\frac{\partial}{\partial q} in (1) by our kernel surrogate Uk=fU_{k}=f. It is clear that, depending on UkU_{k}, the resulting kernel induced Hamiltonian flow differs from the original one. That said, any bias on the resulting Markov chain, in addition to discretisation error from the leap-frog integrator, is naturally corrected for in the Pseudo-Marginal Metropolis step. We accept an end-point ϕtHk(p,q)\phi_{t}^{H_{k}}(p^{\prime},q) of a trajectory starting at (p,q)(p^{\prime},q) along the kernel induced flow with probability

where H(ϕtHk(p,q))H\left(\phi_{t}^{H_{k}}(p^{\prime},q)\right) corresponds to the true Hamiltonian at ϕtHk(p,q)\phi_{t}^{H_{k}}(p^{\prime},q). Here, in the Pseudo-Marginal context, we replace both terms in the ratio in (3) by unbiased estimates, i.e., we replace π(q)\pi(q) within HH with an unbiased estimator π^(q)\hat{\pi}(q). Note that this also involves ‘recycling’ the estimates of HH from previous iterations to ensure anyymptotic correctness, c.f. [3, Table 1]. Any deviations of the kernel induced flow from the true flow result in a decreased acceptance probability (3). We therefore need to control the approximation quality of the kernel induced potential energy to maintain high acceptance probability in practice. See Figure 1 for an illustrative example.

Two Efficient Estimators for Exponential Families in RKHS

We now address estimating the infinite dimensional exponential family model (2) from data. The original estimator in has a large computational cost. This is problematic in the adaptive MCMC context, where the model has to be updated on a regular basis. We propose two efficient approximations, each with its strengths and weaknesses. Both are based on score matching.

Following , we model an unnormalised log probability density logπ(x)\log\pi(x) with a parametric model

where we note that the normalising constants vanish from taking the gradient \nabla. As shown in [14, Theorem 1], it is possible to compute an empirical version without accessing π(x)\pi(x) or logπ(x)\nabla\log\pi(x) other than through observed samples,

Our approximations of the original model (2) are based on minimising (5) using approximate scores.

2 Infinite Dimensional Exponential Families Lite

The original estimator of ff in (2) takes a dual form in a RKHS sub-space spanned by nd+1nd+1 kernel derivatives, [13, Thm. 4]. The update of the proposal at the iteration tt of MCMC requires inversion of a (td+1)×(td+1)(td+1)\times(td+1) matrix. This is clearly prohibitive if we are to run even a moderate number of iterations of a Markov chain. Following , we take a simple approach to avoid prohibitive computational costs in tt: we form a proposal using a random sub-sample of fixed size nn from the Markov chain history, z:={zi}i=1n{xi}i=1t\mathbf{z}:=\{z_{i}\}_{i=1}^{n}\subseteq\{x_{i}\}_{i=1}^{t}. In order to avoid excessive computation when dd is large, we replace the full dual solution with a solution in terms of span({k(zi,)}i=1n)\text{span}\left(\left\{k(z_{i},\cdot)\right\}_{i=1}^{n}\right), which covers the support of the true density by construction, and grows with increasing nn. That is, we assume that the model (4) takes the ‘light’ form

Given a set of samples z={zi}i=1n\mathbf{z}=\{z_{i}\}_{i=1}^{n} and assuming f(x)=i=1nαik(zi,x)f(x)=\sum_{i=1}^{n}\alpha_{i}k(z_{i},x) for the Gaussian kernel of the form k(x,y)=exp(σ1xy22)k(x,y)=\exp\left(-\sigma^{-1}\|x-y\|_{2}^{2}\right), and λ>0,\lambda>0, the unique minimiser of the λfH2\lambda\|f\|_{{\cal H}}^{2}-regularised empirical score matching objective (5) is given by

The estimator costs O(n3+dn2){\cal O}(n^{3}+dn^{2}) computation (for computing C,bC,b, and for inverting CC) and O(n2){\cal O}(n^{2}) storage, for a fixed random chain history sub-sample size nn. This can be further reduced via low-rank approximations to the kernel matrix and conjugate gradient methods, which are derived in Appendix A.

Gradients of the model are given as f(x)=i=1nαik(x,xi)\nabla f(x)=\sum_{i=1}^{n}\alpha_{i}\nabla k(x,x_{i}), i.e. they simply require to evaluate gradients of the kernel function. Evaluation and storage of f()\nabla f(\cdot) both cost O(dn){\cal O}(dn).

3 Exponential Families in Finite Feature Spaces

Instead of fitting an infinite-dimensional model on a subset of the available data, the second estimator is based on fitting a finite dimensional approximation using all available data {xi}i=1t\{x_{i}\}_{i=1}^{t}, in primal form. As we will see, updating the estimator when a new data point arrives can be done online.

An example feature embedding based on random Fourier features and a standard Gaussian kernel is ϕx=2m[cos(ω1Tx+u1),,cos(ωmTx+um)]\phi_{x}=\sqrt{\frac{2}{m}}\left[\cos(\omega_{1}^{T}x+u_{1}),\dots,\cos(\omega_{m}^{T}x+u_{m})\right], with ωiN(ω)\omega_{i}\sim{\cal N}(\omega) and uiUniform[0,2π]u_{i}\sim\texttt{Uniform}[0,2\pi]. The estimator has a one-off cost of O(tdm2+m3){\cal O}(tdm^{2}+m^{3}) computation and O(m2){\cal O}(m^{2}) storage. Given that we have computed a solution based on the Markov chain history {xi}i=1t\{x_{i}\}_{i=1}^{t}, however, it is straightforward to update C,bC,b, and the solution θ^λ\hat{\theta}_{\lambda} online, after a new point xt+1x_{t+1} arrives. This is achieved by storing running averages and performing low-rank updates of matrix inversions, and costs O(dm2){\cal O}(dm^{2}) computation and O(m2){\cal O}(m^{2}) storage, independent of tt. Further details are given in Appendix B.

Gradients of the model are f(x)=[ϕx]θ^\nabla f(x)=\left[\nabla\phi_{x}\right]^{\top}\hat{\theta} , i.e., they require the evaluation of the gradient of the feature space embedding, costing O(md){\cal O}(md) computation and and O(m){\cal O}(m) storage.

Kernel Hamiltonian Monte Carlo

Constructing a kernel induced Hamiltonian flow as in Section 3 from the gradients of the infinite dimensional exponential family model (2), and approximate estimators (6),(8), we arrive at a gradient free, adaptive MCMC algorithm: Kernel Hamiltonian Monte Carlo (Algorithm 1).

KMC finite using (8) allows for online updates using the full Markov chain history, and therefore is a more elegant solution than KMC lite, which has greater computational cost and requires sub-sampling the chain history. Due to the parametric nature of KMC finite, however, the tails of the estimator are not guaranteed to decay. For example, the random Fourier feature embedding described below Proposition 2 contains periodic cosine functions, and therefore oscillates in the tails of (8), resulting in a reduced acceptance probability. As we will demonstrate in the experiments, this problem does not appear when KMC finite is initialised in high-density regions, nor after burn-in. In situations where information about the target density support is unknown, and during burn-in, we suggest to use the lite estimator (7), whose gradients decay outside of the training data. As a result, KMC lite is guaranteed to fall back to a Random Walk Metropolis in unexplored regions, inheriting its convergence properties, and smoothly transitions to HMC-like proposals as the MCMC chain grows. A proof of the proposition below can be found in Appendix C.

Assume d=1d=1, π(x)\pi(x) has log-concave tails, the regularity conditions of [22, Thm 2.2] (implying π\pi-irreducibility and smallness of compact sets), that MCMC adaptation stops after a fixed time, and a fixed number LL of ϵ\epsilon-leapfrog steps. If lim supx2f(x)2=0\limsup_{\|x\|_{2}\to\infty}\|\nabla f(x)\|_{2}=0, and M:x:f(x)2M\exists M:\forall x:\|\nabla f(x)\|_{2}\leq M, then KMC lite is geometrically ergodic from π\pi-almost any starting point.

MCMC algorithms that use the history of the Markov chain for constructing proposals might not be asymptotically correct. We follow [12, Sec. 4.2] and the idea of ‘vanishing adaptation’ , to avoid such biases. Let {at}i=0\left\{a_{t}\right\}_{i=0}^{\infty} be a schedule of decaying probabilities such that limtat=0\lim_{t\to\infty}a_{t}=0 and t=0at=\sum_{t=0}^{\infty}a_{t}=\infty. We update the density gradient estimate according to this schedule in Algorithm 1. Intuitively, adaptation becomes less likely as the MCMC chain progresses, but never fully stops, while sharing asymptotic convergence with adaptation that stops at a fixed point [23, Theorem 1]. Note that Proposition 3 is a stronger statement about the convergence rate.

KMC has two free parameters: the Gaussian kernel bandwidth σ\sigma, and the regularisation parameter λ\lambda. As KMC’s performance depends on the quality of the approximate infinite dimensional exponential family model in (6) or (8), a principled approach is to use the score matching objective function in (5) to choose σ,λ\sigma,\lambda pairs via cross-validation (using e.g. ‘hot-started’ black-box optimisation). Earlier adaptive kernel-based MCMC methods did not address parameter choice.

Experiments

We start by quantifying performance of KMC finite on synthetic targets. We emphasise that these results can be reproduced with the lite version.

In order to quantify efficiency in growing dimensions, we study hypothetical acceptance rates along trajectories on the kernel induced Hamiltonian flow (no MCMC yet) on a challenging Gaussian target: We sample the diagonal entries of the covariance matrix from a Gamma(1,1) distribution and rotate with a uniformly sampled random orthogonal matrix. The resulting target is challenging to estimate due to its ‘non-singular smoothness’, i.e., substantially differing length-scales across its principal components. As a single Gaussian kernel is not able to effeciently represent such scaling families, we use a rational quadratic kernel for the gradient estimation, whose random features are straightforward to compute. Figure 2 shows the average acceptance over 100100 independent trials as a function of the number of (ground truth) samples and basis functions, which are set to be equal n=mn=m, and of dimension dd. In low to moderate dimensions, gradients of the finite estimator lead to acceptance rates comparable to plain HMC. On targets with more ‘regular’ smoothness, the estimator performs well in up to d100d\approx 100, with less variance. See Appendix D.1 for details.

We next apply KMC to sample from the marginal posterior over hyper-parameters of a Gaussian Process Classification (GPC) model on the UCI Glass dataset . Classical HMC cannot be used for this problem, due to the intractability of the marginal data likelihood. Our experimental protocol mostly follows [12, Section 5.1], see Appendix D.3, but uses only 6000 MCMC iterations without discarding a burn-in, i.e., we study how fast KMC initially explores the target. We compare convergence in terms of all mixed moments of order up to 3 to a set of benchmark samples (MMD , lower is better). KMC randomly uses between 1 and 10 leapfrog steps of a size chosen uniformly in [0.01,0.1][0.01,0.1], a standard Gaussian momentum, and a kernel tuned by cross-validation, see Appendix D.3. We did not extensively tune the HMC parameters of KMC as the described settings were sufficient. Both KMC and KAMH used 1000 samples from the chain history. Figure 4 (left) shows that KMC’s burn-in contains a short ‘exploration phase’ where produced estimates are bad, due to it falling back to a random walk in unexplored regions, c.f. Proposition 3. From around 500 iterations, however, KMC clearly outperforms both RW and the earlier state-of-the-art KAMH. These results are backed by the minimum ESS (not plotted), which is around 415 for KMC and is around 35 and 25 for KAMH and RW, respectively. Note that all samplers effectively stop improving from 3000 iterations – indicating a burn-in bias. All samplers took 1h time, with most time spent estimating the marginal likelihood.

We now apply KMC in the context of Approximate Bayesian Computation (ABC), which often is employed when the data likelihood is intractable but can be obtained by simulation, see e.g. . ABC-MCMC targets an approximate posterior by constructing an unbiased Monte Carlo estimator of the approximate likelihood. As each such evaluation requires expensive simulations from the likelihood, the goal of all ABC methods is to reduce the number of such simulations. Accordingly, Hamiltonian ABC was recently proposed , combining the synthetic likelihood approach with gradients based on stochastic finite differences. We remark that this requires to simulate from the likelihood in every leapfrog step, and that the additional bias from the Gaussian likelihood approximation can be problematic. In contrast, KMC does not require simulations to construct a proposal, but rather ‘invests’ simulations into an accept/reject step (3) that ensures convergence to the original ABC target. Figure 4 (right) compares performance of RW, HABC (sticky random numbers and SPAS, [8, Sec. 4.3, 4.4]), and KMC on a 1010-dimensional skew-normal distribution p(yθ)=2N(θ,I)Φ(α,y)p(y|\theta)=2\mathcal{N}\left(\theta,I\right)\Phi\left(\left\langle\alpha,y\right\rangle\right) with θ=α=110\theta=\alpha=\mathbf{1}\cdot 10. KMC mixes as well as HABC, but HABC suffers from a severe bias. KMC also reduces the number of simulations per proposal by a factor 2L=1002L=100. See Appendix D.4 for details.

Discussion

We have introduced KMC, a kernel-based gradient free adaptive MCMC algorithm that mimics HMC’s behaviour by estimating target gradients in an RKHS. In experiments, KMC outperforms random walk based sampling methods in up to d=50d=50 dimensions, including the recent kernel-based KAMH . KMC is particularly useful when gradients of the target density are unavailable, as in PM-MCMC or ABC-MCMC, where classical HMC cannot be used. We have proposed two efficient empirical estimators for the target gradients, each with different strengths and weaknesses, and have given experimental evidence for the robustness of both.

Future work includes establishing theoretical consistency and uniform convergence rates for the empirical estimators, for example via using recent analysis of random Fourier Features with tight bounds , and a thorough experimental study in the ABC-MCMC context where we see a lot of potential for KMC. It might also be possible to use KMC as a precomputing strategy to speed up classical HMC as in . For code, see https://github.com/karlnapf/kernel_hmc

References

Appendix

The Appendix contains proofs for Propositions 1 and 2, as well as additional computational details for both KMC lite in Section A and KMC finite in Section B. Section C covers the proof of geometric ergodicity of KMC lite from Proposition 3. Section D describes further experimental details.

The proof below extends the model in [20, Section 4.1]. We assume that the model log-density (4) takes the form in Proposition 1, then directly implement score functions (5), from which we derive an empirical score matching objective as a system of linear equations.

As assumed the log unnormalised density takes the form

The score functions for (5) are then given by

We now rewrite J(α)J(\alpha) in matrix form. The expression for the term J(α)J(\alpha) being optimised is the sum of two terms.

The final term may be computed with the right ordering of operations,

In matrix notation, the inner sum is a column vector,

We take the entry-wise square and sum the resulting vector. Denote by Dx:=diag(x)D_{x}:=\text{diag}(x), then the following two relations hold

This means that J(α)J(\alpha) as defined previously,

Assuming CC is invertible, this is minimised by

Reduced Computational Costs via Low-rank Approximations and Conjugate Gradient

Solving the linear system in (7) requires O(n3){\cal O}(n^{3}) computation and O(n2){\cal O}(n^{2}) storage for a fixed random sub-sample of the chain history z.\mathbf{z}. In order to allow for large nn, and to exploit potential manifold structure in the RKHS, we apply a low-rank approximation to the kernel matrix via incomplete Cholesky [28, Alg. 5.12], that is a standard way to achieve linear computational costs for kernel methods. We rewrite the kernel matrix

Appendix B Finite Feature Space Estimator

We assume the model log-density (4) takes the primal form in a finite dimensional feature space as in Proposition 2, then again directly implement score functions in (5) and minimise it via a linear solve.

As assumed the log unnormalised density takes the form

Assuming that CC is invertible (trivial for nmn\geq m), the objective is uniquely minimised by differentiating (11) wrt. θ\theta, setting to zero, and solving for θ\theta. This gives

Next, we give an example for the approximate feature space Hm{\cal H}_{m}. Note that the above approach can be combined with any set of finite dimensional approximate feature mappings ϕx\phi_{x}.

Example: Random Fourier Features for the Gaussian Kernel

where Γ(ω)\Gamma(\omega) is the Fourier transform of the kernel. An approximate feature mapping for such kernels can be obtained via dropping imaginary terms and approximating the integral with Monte Carlo integration. This gives

with fixed random basis vector realisations that depend on the kernel via Γ(ω)\Gamma(\omega),

for i=1mi=1\dots m. It is easy to see that this approximation is consistent for mm\to\infty, i.e.

See for details and a uniform convergence bound and for a more detailed analysis with tighter bounds. Note that it is possible to achieve logarithmic computational costs in dd exploiting properties of Hadamard matrices .

The feature map derivatives (10) are given by

where \odot is the element-wise product. Consequently the gradient is given by

As an example, the translation invariant Gaussian kernel and its Fourier transform are

Constant Cost Updates

using cheap triangular back-substitution from Lˉt\bar{L}_{t}, and never storing Cˉt1\bar{C}_{t}^{-1} or Lˉt1\bar{L}_{t}^{-1} explicitly.

Using such updates, the computational costs for updating the approximate infinite dimensional exponential family model in every iteration of the Markov chain are O(dm2){\cal O}(dm^{2}), which constant in tt. We can therefore use all points in the history for constructing a proposal. See our implementation for further details.

Perform rank-dd update to obtain updated Cholesky factorisation Lˉt+1Lˉt+1T=Cˉt+1\bar{L}_{t+1}\bar{L}_{t+1}^{T}=\bar{C}_{t+1}.

Update approximate infinite dimensional exponential family parameters

Appendix C Ergodicity of KMC lite

Denote by α(xt,x(p))\alpha(x_{t},x^{*}(p^{\prime})) the probability of accepting a (p,x)(p^{\prime},x^{*}) proposal at state xtx_{t}. Let ab=min(a,b)a\wedge b=\min(a,b). Define c(x(0)):=Lϵ2logπ(x(0))/2+ϵ2i=1L1(Li)logπ(x(iϵ))c(x^{(0)}):=L\epsilon^{2}\nabla\log\pi(x^{(0)})/2+\epsilon^{2}\sum_{i=1}^{L-1}(L-i)\nabla\log\pi(x^{(i\epsilon)}) and d(x(0)):=ϵ(f(x(0))+f(x(Lϵ)))/2+ϵi=1L1f(x(iϵ))d(x^{(0)}):=\epsilon(\nabla f(x^{(0)})+\nabla f(x^{(L\epsilon)}))/2+\epsilon\sum_{i=1}^{L-1}\nabla f(x^{(i\epsilon)}), where x(iϵ)x^{(i\epsilon)} is the ii-th point of the leapfrog integration from x=x(0)x=x^{(0)}.

We assumed π(x)\pi(x) is log-concave in the tails, meaning xU>0\exists x_{U}>0 s.t. for x>xt>xUx^{*}>x_{t}>x_{U}, we have π(x)/π(xt)eα1(x2xt2)\pi(x^{*})/\pi(x_{t})\leq e^{-\alpha_{1}(\|x^{*}\|_{2}-\|x_{t}\|_{2})} and for xt>x>xUx_{t}>x^{*}>x_{U}, we have π(x)/π(xt)eα1(x2xt2)\pi(x^{*})/\pi(x_{t})\geq e^{-\alpha_{1}(\|x^{*}\|_{2}-\|x_{t}\|_{2})}, and a similar condition holds in the negative tail. Furthermore, we assumed fixed HMC parameters: LL leapfrog steps of size ϵ\epsilon, and wlog the identity mass matrix II. Following , it is sufficient to show

for some s>0s>0, where μ()\mu(\cdot) is a standard Gaussian measure. Denoting the integral II_{-\infty}^{\infty}, we split it into

for some δ(0,1)\delta\in(0,1). We show that the first and third terms decay to zero whilst the second remains strictly negative as xtx_{t}\to\infty (a similar argument holds as xtx_{t}\to-\infty). We detail the case f(x)0\nabla f(x)\uparrow 0 as xx\to\infty here, the other is analogous. Taking IxtδxtδI_{-x_{t}^{\delta}}^{x_{t}^{\delta}}, we can choose an xtx_{t} large enough that xtCLϵxtδ>xUx_{t}-C-L\epsilon x_{t}^{\delta}>x_{U}, γ1<c(xtxtδ)<0-\gamma_{1}<c(x_{t}-x_{t}^{\delta})<0 and γ2<d(xtxtδ)<0-\gamma_{2}<d(x_{t}-x_{t}^{\delta})<0. So for p(0,xtδ)p^{\prime}\in(0,x_{t}^{\delta}) we have

where the last inequality comes from the log-concave tails assumption. For p(γ22/2,xtδ)p^{\prime}\in(\gamma_{2}^{2}/2,x_{t}^{\delta})

where xtx_{t} is large enough that α2=α1Lϵγ2/2>0\alpha_{2}=\alpha_{1}L\epsilon-\gamma_{2}/2>0. Similarly for p(γ1/Lϵ,xtδ)p^{\prime}\in(\gamma_{1}/L\epsilon,x_{t}^{\delta})

Because γ1\gamma_{1} and γ2\gamma_{2} can be chosen to be arbitrarily small, then for large enough xtx_{t} we will have

where c1=α1γ1γ22/2>0c_{1}=\alpha_{1}\gamma_{1}-\gamma_{2}^{2}/2>0 for large enough xtx_{t}, as γ1\gamma_{1} and γ2\gamma_{2} are of the same order. Now turning to p(xtδ,0)p^{\prime}\in(-x_{t}^{\delta},0), we can use an exact rearrangement of the same argument (noting that c1c_{1} can be made arbitrarily small) to get

Combining (14) and (15) and rearranging as in [30, Theorem 3.2] shows that IxtδxtδI_{-x_{t}^{\delta}}^{x_{t}^{\delta}} is strictly negative in the limit if s2=sLϵs_{2}=sL\epsilon is chosen small enough, as I0γ2/LϵI_{0}^{\gamma_{2}/L\epsilon} can also be made arbitrarily small.

For IxtδI_{-\infty}^{-x_{t}^{\delta}} it suffices to note that the Gaussian tails of μ()\mu(\cdot) will dominate the exponential growth of es(x(p)2xt2)e^{s(\|x^{*}(p^{\prime})\|_{2}-\|x_{t}\|_{2})} meaning the integral can be made arbitrarily small by choosing large enough xtx_{t}, and the same argument holds for IxtδI_{x_{t}^{\delta}}^{\infty}. ∎

Appendix D Additional Experimental Details

This section contains additional details for the experiments in Section 6.

We reproduce the experiment in Figure 2 on an isotropic Gaussian in increasing dimension. As length-scales across all principal components are equal, this is a significantly less challenging target to estimate gradients for; though still useful as a benchmark representing very smooth targets. We use a standard Gaussian kernel and the same experimental protocol as for Figure 2. The estimator works slightly better than on the target considered in Figure 2, and performs well up to d100d\approx 100, see Figure 5.

D.2 Banana target

We choose d=8d=8, V=100V=100 and b=0.03b=0.03, which corresponds to the ‘strongly twisted’ 8-dimensional Banana in . The target is challenging due to the nonlinear dependence of the first two dimensions and the highly position dependent scaling within these dimensions.

D.3 Pseudo-Marginal MCMC for GP Classification

Closely following , we consider a joint distribution of GP-latent variables f\mathbf{f}, labels y\mathbf{y} (with covariate matrix XX), and hyper-parameters θ\theta, given by

where yi{1,1}y_{i}\in\{-1,1\}. Aiming for a fully Bayesian treatment, we wish to estimate the marginal posterior of the hyper-parameters θ\theta, motivated in . The marginal likelihood p(yθ)p(\mathbf{y}|\theta) is intractable for non-Gaussian likelihoods p(yf)p(\mathbf{y}|\mathbf{f}), but can be replaced with an unbiased estimate

where {f(i)}i=1nimpq(fθ)\left\{\mathbf{f}^{(i)}\right\}_{i=1}^{n_{\textrm{imp}}}\sim q(\mathbf{f}|\theta) are nimpn_{imp} importance samples. In , the importance distribution q(fθ)q(\mathbf{f}|\theta) is chosen as the Laplacian or as the Expectation Propagation (EP) approximation of p(fy,θ)p(yf)p(fθ)p(\mathbf{f}|\mathbf{y},\theta)\propto p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\theta), leading to state-of-the-art results.

We here use a Laplace approximation and nimp=100n_{\text{imp}}=100. We consider classification of window against non-window glass in the UCI Glass dataset, which induces a posterior that has a nonlinear shape [12, Figure 3]. Since the ground truth for the hyperparameter posterior is not available, we initially run multiple hand-tuned standard Metropolis-Hastings chains for 500,000 iterations (with a 100,000 burn-in), keep every 1000-th sample in each of the chains, and combine them. The resulting samples are used as a benchmark, to evaluate the performance all algorithms. We use the MMD between each sampler output and the benchmark sample is computed, using the polynomial kernel (1+θ,θ)3\left(1+\left\langle\theta,\theta^{\prime}\right\rangle\right)^{3}. This corresponds to the estimation error of all mixed moments of order up to 3.

Kernel parameters are tuned using a black box Bayesian optimisation packageWe use the open-source package pybo, available under https://github.com/mwhoffman/pybo and the median heuristic for KMC and KAMH repsectively. The Bayesian optimisation uses standard parameters and is stopped after 15 iterations, where each trial is done via a 5-fold cross-validation of the score matching objective (5). We learn parameters after MCMC 500 iterations, and then re-learn after 2000. We tried re-learning parameters after more iterations, but this did not lead to significant changes. The costs for this are neglectable in the context of PM-MCMC as estimating the marginal likelihood takes significantly more time than generating the KMC proposal.

D.4 ABC MCMC

In this section, we give a brief background on Approximate Bayesian Computation, and how KMC can be used within the framework. We then give details of the competing approach in the final experiment in Section 6, including experimental details and an analytic counterexample.

Approximate Bayesian Computation is a method for inference in the scenario where conditional on some parameter of interest θ\theta, we can easily simulate data xf(θ)x\sim f(\cdot|\theta), but for which the likelihood function ff is unavailable . We however have data yy which assume to be from the model, and we have a prior π0(θ)\pi_{0}(\theta). A simple ABC algorithm is to sample θiπ0()\theta_{i}\sim\pi_{0}(\cdot) (or any other suitable distribution), simulate data xif(θi)x_{i}\sim f(\cdot|\theta_{i}), and ‘accept’ xix_{i} as a sample from the approximate posterior πϵ(θy)\pi_{\epsilon}(\theta|y) if d(y,x)ϵd(y,x)\leq\epsilon. This procedure can be formalised by defining the approximate likelihood as

where gϵ(yx,θ)g_{\epsilon}(y|x,\theta) is an appropriate kernel that gives more importance to points for which d(y,x)d(y,x) is smaller. In the simple case above gϵ(yx,θ)=1{d(y,x)ϵ}g_{\epsilon}(y|x,\theta)=\mathbf{1}_{\{d(y,x)\leq\epsilon\}}. The ABC posterior is then found using πϵ(θy)fϵ(yθ)π0(θ)\pi_{\epsilon}(\theta|y)\propto f_{\epsilon}(y|\theta)\pi_{0}(\theta). Often gϵg_{\epsilon} is based on some low-dimensional summary statistics, which can have both advantages and disadvantages.

There are many different way to do ABC, and clearly not all involve Markov chain Monte Carlo. If the posterior however is not similar to the prior, and if θ\theta is more than three or four dimensional, MCMC is a sensible option. Since the likelihood (17) is intractable, typically algorithms are considered for which an approximation to either the likelihood, or the ABC posterior are used either in constructing proposals, defining Metropolis-Hastings acceptance rates, or both. We focus here on samplers which target πϵ(θy)\pi_{\epsilon}(\theta|y) directly, c.f. .

Similar to the approach taken in Section D.3, we here accept proposals θQ(θ,)\theta^{\prime}\sim Q(\theta,\cdot) where QQ is some proposal mechanism (i.e. KMC), via replacing the likelihood with an unbiased estimate. We accept according to the ratio

Following , one idea to approximate the intractable likelihood is to draw nlikn_{\text{lik}} samples xif(θi)x_{i}\sim f(\cdot|\theta_{i}), and fit a Gaussian approximation to ff, producing estimates μ^\hat{\mu} and Σ^\hat{\Sigma} for the mean and covariance using {xi}i=1nlik\{x_{i}\}_{i=1}^{n_{\text{lik}}}. If the error functon gϵg_{\epsilon} is also chosen to be a Gaussian (with mean yy and variance ϵ\epsilon), then the marginal likelihood fϵ(yθ)f_{\epsilon}(y|\theta) can be approximated as

The likelihood is essentially approximated by a Gaussian fGf_{G}, producing a synthetic posterior πs()\pi_{s}(\cdot), which is then used in the accept-reject step. Clearly some approximation error is introduced by the Gaussian likelihood approximation step, but as shown in , it can be a reasonable choice for some models.

Introduced in , the synthetic likelihood formulation is used to construct a proposal, with the accept-reject step removed altogether. Hamiltonian dynamics use the gradient logπ(θ)\nabla\log\pi(\theta) to suggest candidate values for the next state of a Markov chain which are far from the current point, thus increasing the chances that the chain mixes quickly. Here the gradient of the log-likelihood is unavailable, so is approximated with that of a Gaussian (since the map θ(μ,Σ)\theta\to(\mu,\Sigma) is not always clear this is done numerically, using a stochastic finite differences estimate of the gradient, SPAS [8, Sec. 4.3, 4.4]), giving

Since there is no accept-reject step, the synthetic posterior is also the target of this scheme (although there is also further bias introduced by discretisation error), but the introduction of gradient-based dynamics is hoped to improve mixing and hence efficiency of inferences compared to random-walk type schemes.

We give a very simple toy model to highlight the bias introduced by the Hamiltonian ABC sampler. Consider posterior inference for the mean parameter in a log-Normal model. Specifically, the true model is

where the precision τ\tau and hyper-parameters μ0,τ0\mu_{0},\tau_{0} are known. The model is in fact conjugate, giving a Gaussian posterior

If we introduce a Gaussian approximation to the likelihood, then the mean and precision of this approximation fGf_{G} are (empirical estimates for)

which depend on the current value for μ\mu in the chain. The resulting synthetic posterior is no longer tractable, but since it is one dimensional we can approximate it numerically. Using μ0=0\mu_{0}=0, τ0=1/100\tau_{0}=1/100, ϵ=0.1\epsilon=0.1 and τ=1\tau=1 then the true and approximate posteriors for 100 data points generated using the truth μ=2\mu=2 are shown in Figure 6. This is a proof of concept that a likelihood with a positive skew being approximated by a Gaussian introduces an upwards bias to the posterior.

The simulation study in Section 6 uses a slightly more complex and multi-dimensional simulation example: a 10-dimensional multivariate skew-Normal distribution, given by

with θ=α=110\theta=\alpha=\mathbf{1}\cdot 10. In each iteration of KMC, the likelihood is estimated via simulating nlik=10n_{\text{lik}}=10 samples from the above likelihood. We use the mean of all samples as summary statistic, and a Gaussian similarity kernel gϵ(yx,θ)g_{\epsilon}(y|x,\theta) with a fixed ϵ=0.55\epsilon=0.55. Both KMC and HABC use a standard Gaussian momentum, a uniformly random stepsize in [0.01,0.1][0.01,0.1] and L=50L=50 leapfrog steps. HABC is used with the suggested ‘sticky random numbers’ [8, Section 4.4], i.e. we use the same seed for all simulations along a single proposal trajectory. Both algorithms are run for 200+5000200+5000 MCMC iterations. KMC then attempts to re-learn smoothness parameters, and stops adaptation. Burn-in samples are discarded when quantifying performance of all algorithms.

HABC is used in its ‘stochastic gradient’ and has a ‘friction’ parameter that we estimate using a running average of the global covariance of all SPAS gradient evaluations, [8, Equation 21]. Note that we ran HABC with both the friction term included and removed, where we found that adding friction has severely negative impact on mixing, where not adding friction results in a wider posterior (with the same bias). Figure 4 (middle, right) show the results without friction, Figure 7 shows the same plots with friction. We refer to our implementation for further details.

Due to the gradient estimation in every of the L=50L=50 leap-frog steps, every MCMC proposal for HABC requires 2L=1002L=100 simulations to be generated. In contrast, KMC only requires a single simulation, for evaluating the accept/reject probability (3). We leave studying the exact trade-offs of KMC’s learning phase and its ability to mix well as compared to HABC to future work.