Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families
Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton
Introduction
Estimating expectations using Markov Chain Monte Carlo (MCMC) is a fundamental approximate inference technique in Bayesian statistics. MCMC itself can be computationally demanding, and the expected estimation error depends directly on the correlation between successive points in the Markov chain. Therefore, efficiency can be achieved by taking large steps with high probability.
Hamiltonian Monte Carlo is an MCMC algorithm that improves efficiency by exploiting gradient information. It simulates particle movement along the contour lines of a dynamical system constructed from the target density. Projections of these trajectories cover wide parts of the target’s support, and the probability of accepting a move along a trajectory is often close to one. Remarkably, this property is mostly invariant to growing dimensionality, and HMC here often is superior to random walk methods, which need to decrease their step size at a much faster rate [1, Sec. 4.4].
Unfortunately, for a large class of problems, gradient information is not available. For example, in Pseudo-Marginal MCMC (PM-MCMC) , the posterior does not have an analytic expression, but can only be estimated at any given point, e.g. in Bayesian Gaussian Process classification . A related setting is MCMC for Approximate Bayesian Computation (ABC-MCMC), where the posterior is approximated through repeated simulation from a likelihood model . In both cases, HMC cannot be applied, leaving random walk methods as the only mature alternative. There have been efforts to mimic HMC’s behaviour using stochastic gradients from mini-batches in Big Data , or stochastic finite differences in ABC . Stochastic gradient based HMC methods, however, often suffer from low acceptance rates or additional bias that is hard to quantify .
Random walk methods can be tuned by matching scaling of steps and target. For example, Adaptive Metropolis-Hastings (AMH) is based on learning the global scaling of the target from the history of the Markov chain. Yet, for densities with nonlinear support, this approach does not work very well. Recently, introduced a Kernel Adaptive Metropolis-Hastings (KAMH) algorithm whose proposals are locally aligned to the target. By adaptively learning target covariance in a Reproducing Kernel Hilbert Space (RKHS), KAMH achieves improved sampling efficiency.
In this paper, we extend the idea of using kernel methods to learn efficient proposal distributions . Rather than locally smoothing the target density, however, we estimate its gradients globally. More precisely, we fit an infinite dimensional exponential family model in an RKHS via score matching . This is a non-parametric method of modelling the log unnormalised target density as an RKHS function, and has been shown to approximate a rich class of density functions arbitrarily well. More importantly, the method has been empirically observed to be relatively robust to increasing dimensionality – in sharp contrast to classical kernel density estimation [15, Sec. 6.5]. Gaussian Processes (GP) were also used in as an emulator of the target density in order to speed up HMC, however, this requires access to the target in closed form, to provide training points for the GP.
We require our adaptive KMC algorithm to be computationally efficient, as it deals with high-dimensional MCMC chains of growing length. We develop two novel approximations to the infinite dimensional exponential family model. The first approximation, score matching lite, is based on computing the solution in terms of a lower dimensional, yet growing, subspace in the RKHS. KMC with score matching lite (KMC lite) is geometrically ergodic on the same class of targets as standard random walks. The second approximation uses a finite dimensional feature space (KMC finite), combined with random Fourier features . KMC finite is an efficient online estimator that allows to use all of the Markov chain history, at the cost of decreased efficiency in unexplored regions. A choice between KMC lite and KMC finite ultimately depends on the ability to initialise the sampler within high-density regions of the target; alternatively, the two approaches could be combined.
Experiments show that KMC inherits the efficiency of HMC, and therefore mixes significantly better than state-of-the-art gradient-free adaptive samplers on a number of target densities, including on synthetic examples, and when used in PM-MCMC and ABC-MCMC. All code can be found at https://github.com/karlnapf/kernel_hmc
Background and Previous Work
In the absence of , the usual choice of is a random walk, i.e. A popular choice of the scaling is . When the scale of the target density is not uniform across dimensions, or if there are strong correlations, the AMH algorithm improves mixing by adaptively learning global covariance structure of from the history of the Markov chain. For cases where the local scaling does not match the global covariance of , i.e. the support of the target is nonlinear, KAMH improves mixing by learning the target covariance in a RKHS. KAMH proposals are Gaussian with a covariance that matches the local covariance of around the current state , without requiring access to .
In practice, (1) is usually unavailable and we need to resort to approximations. Here, we limit ourselves to the leap-frog integrator; see for details. To correct for discretisation error, a Metropolis acceptance procedure can be applied: starting from , the end-point of the approximate trajectory is accepted with probability . HMC is often able to propose distant, uncorrelated moves with a high acceptance probability.
In many cases the gradient of cannot be written in closed form, leaving random-walk based methods as the state-of-the-art . We aim to overcome random-walk behaviour, so as to obtain significantly more efficient sampling .
Kernel Induced Hamiltonian Dynamics
KMC replaces the potential energy in (1) by a kernel induced surrogate computed from the history of the Markov chain. This surrogate does not require gradients of the log-target density. The surrogate induces a kernel Hamiltonian flow, which can be numerically simulated using standard leap-frog integration. As with the discretisation error in HMC, any deviation of the kernel induced flow from the true flow is corrected via a Metropolis acceptance procedure. This here also contains the estimation noise from and re-uses previous values of , c.f. [3, Table 1]. Consequently, the stationary distribution of the chain remains correct, given that we take care when adapting the surrogate.
We construct a kernel induced potential energy surrogate whose gradients approximate the gradients of the true potential energy in (1), without accessing or directly, but only using the history of the Markov chain. To that end, we model the (unnormalised) target density with an infinite dimensional exponential family model of the form
We define a kernel induced Hamiltonian operator by replacing in the potential energy part in (1) by our kernel surrogate . It is clear that, depending on , the resulting kernel induced Hamiltonian flow differs from the original one. That said, any bias on the resulting Markov chain, in addition to discretisation error from the leap-frog integrator, is naturally corrected for in the Pseudo-Marginal Metropolis step. We accept an end-point of a trajectory starting at along the kernel induced flow with probability
where corresponds to the true Hamiltonian at . Here, in the Pseudo-Marginal context, we replace both terms in the ratio in (3) by unbiased estimates, i.e., we replace within with an unbiased estimator . Note that this also involves ‘recycling’ the estimates of from previous iterations to ensure anyymptotic correctness, c.f. [3, Table 1]. Any deviations of the kernel induced flow from the true flow result in a decreased acceptance probability (3). We therefore need to control the approximation quality of the kernel induced potential energy to maintain high acceptance probability in practice. See Figure 1 for an illustrative example.
Two Efficient Estimators for Exponential Families in RKHS
We now address estimating the infinite dimensional exponential family model (2) from data. The original estimator in has a large computational cost. This is problematic in the adaptive MCMC context, where the model has to be updated on a regular basis. We propose two efficient approximations, each with its strengths and weaknesses. Both are based on score matching.
Following , we model an unnormalised log probability density with a parametric model
where we note that the normalising constants vanish from taking the gradient . As shown in [14, Theorem 1], it is possible to compute an empirical version without accessing or other than through observed samples,
Our approximations of the original model (2) are based on minimising (5) using approximate scores.
2 Infinite Dimensional Exponential Families Lite
The original estimator of in (2) takes a dual form in a RKHS sub-space spanned by kernel derivatives, [13, Thm. 4]. The update of the proposal at the iteration of MCMC requires inversion of a matrix. This is clearly prohibitive if we are to run even a moderate number of iterations of a Markov chain. Following , we take a simple approach to avoid prohibitive computational costs in : we form a proposal using a random sub-sample of fixed size from the Markov chain history, . In order to avoid excessive computation when is large, we replace the full dual solution with a solution in terms of , which covers the support of the true density by construction, and grows with increasing . That is, we assume that the model (4) takes the ‘light’ form
Given a set of samples and assuming for the Gaussian kernel of the form , and the unique minimiser of the -regularised empirical score matching objective (5) is given by
The estimator costs computation (for computing , and for inverting ) and storage, for a fixed random chain history sub-sample size . This can be further reduced via low-rank approximations to the kernel matrix and conjugate gradient methods, which are derived in Appendix A.
Gradients of the model are given as , i.e. they simply require to evaluate gradients of the kernel function. Evaluation and storage of both cost .
3 Exponential Families in Finite Feature Spaces
Instead of fitting an infinite-dimensional model on a subset of the available data, the second estimator is based on fitting a finite dimensional approximation using all available data , in primal form. As we will see, updating the estimator when a new data point arrives can be done online.
An example feature embedding based on random Fourier features and a standard Gaussian kernel is , with and . The estimator has a one-off cost of computation and storage. Given that we have computed a solution based on the Markov chain history , however, it is straightforward to update , and the solution online, after a new point arrives. This is achieved by storing running averages and performing low-rank updates of matrix inversions, and costs computation and storage, independent of . Further details are given in Appendix B.
Gradients of the model are , i.e., they require the evaluation of the gradient of the feature space embedding, costing computation and and storage.
Kernel Hamiltonian Monte Carlo
Constructing a kernel induced Hamiltonian flow as in Section 3 from the gradients of the infinite dimensional exponential family model (2), and approximate estimators (6),(8), we arrive at a gradient free, adaptive MCMC algorithm: Kernel Hamiltonian Monte Carlo (Algorithm 1).
KMC finite using (8) allows for online updates using the full Markov chain history, and therefore is a more elegant solution than KMC lite, which has greater computational cost and requires sub-sampling the chain history. Due to the parametric nature of KMC finite, however, the tails of the estimator are not guaranteed to decay. For example, the random Fourier feature embedding described below Proposition 2 contains periodic cosine functions, and therefore oscillates in the tails of (8), resulting in a reduced acceptance probability. As we will demonstrate in the experiments, this problem does not appear when KMC finite is initialised in high-density regions, nor after burn-in. In situations where information about the target density support is unknown, and during burn-in, we suggest to use the lite estimator (7), whose gradients decay outside of the training data. As a result, KMC lite is guaranteed to fall back to a Random Walk Metropolis in unexplored regions, inheriting its convergence properties, and smoothly transitions to HMC-like proposals as the MCMC chain grows. A proof of the proposition below can be found in Appendix C.
Assume , has log-concave tails, the regularity conditions of [22, Thm 2.2] (implying -irreducibility and smallness of compact sets), that MCMC adaptation stops after a fixed time, and a fixed number of -leapfrog steps. If , and , then KMC lite is geometrically ergodic from -almost any starting point.
MCMC algorithms that use the history of the Markov chain for constructing proposals might not be asymptotically correct. We follow [12, Sec. 4.2] and the idea of ‘vanishing adaptation’ , to avoid such biases. Let be a schedule of decaying probabilities such that and . We update the density gradient estimate according to this schedule in Algorithm 1. Intuitively, adaptation becomes less likely as the MCMC chain progresses, but never fully stops, while sharing asymptotic convergence with adaptation that stops at a fixed point [23, Theorem 1]. Note that Proposition 3 is a stronger statement about the convergence rate.
KMC has two free parameters: the Gaussian kernel bandwidth , and the regularisation parameter . As KMC’s performance depends on the quality of the approximate infinite dimensional exponential family model in (6) or (8), a principled approach is to use the score matching objective function in (5) to choose pairs via cross-validation (using e.g. ‘hot-started’ black-box optimisation). Earlier adaptive kernel-based MCMC methods did not address parameter choice.
Experiments
We start by quantifying performance of KMC finite on synthetic targets. We emphasise that these results can be reproduced with the lite version.
In order to quantify efficiency in growing dimensions, we study hypothetical acceptance rates along trajectories on the kernel induced Hamiltonian flow (no MCMC yet) on a challenging Gaussian target: We sample the diagonal entries of the covariance matrix from a Gamma(1,1) distribution and rotate with a uniformly sampled random orthogonal matrix. The resulting target is challenging to estimate due to its ‘non-singular smoothness’, i.e., substantially differing length-scales across its principal components. As a single Gaussian kernel is not able to effeciently represent such scaling families, we use a rational quadratic kernel for the gradient estimation, whose random features are straightforward to compute. Figure 2 shows the average acceptance over independent trials as a function of the number of (ground truth) samples and basis functions, which are set to be equal , and of dimension . In low to moderate dimensions, gradients of the finite estimator lead to acceptance rates comparable to plain HMC. On targets with more ‘regular’ smoothness, the estimator performs well in up to , with less variance. See Appendix D.1 for details.
We next apply KMC to sample from the marginal posterior over hyper-parameters of a Gaussian Process Classification (GPC) model on the UCI Glass dataset . Classical HMC cannot be used for this problem, due to the intractability of the marginal data likelihood. Our experimental protocol mostly follows [12, Section 5.1], see Appendix D.3, but uses only 6000 MCMC iterations without discarding a burn-in, i.e., we study how fast KMC initially explores the target. We compare convergence in terms of all mixed moments of order up to 3 to a set of benchmark samples (MMD , lower is better). KMC randomly uses between 1 and 10 leapfrog steps of a size chosen uniformly in , a standard Gaussian momentum, and a kernel tuned by cross-validation, see Appendix D.3. We did not extensively tune the HMC parameters of KMC as the described settings were sufficient. Both KMC and KAMH used 1000 samples from the chain history. Figure 4 (left) shows that KMC’s burn-in contains a short ‘exploration phase’ where produced estimates are bad, due to it falling back to a random walk in unexplored regions, c.f. Proposition 3. From around 500 iterations, however, KMC clearly outperforms both RW and the earlier state-of-the-art KAMH. These results are backed by the minimum ESS (not plotted), which is around 415 for KMC and is around 35 and 25 for KAMH and RW, respectively. Note that all samplers effectively stop improving from 3000 iterations – indicating a burn-in bias. All samplers took 1h time, with most time spent estimating the marginal likelihood.
We now apply KMC in the context of Approximate Bayesian Computation (ABC), which often is employed when the data likelihood is intractable but can be obtained by simulation, see e.g. . ABC-MCMC targets an approximate posterior by constructing an unbiased Monte Carlo estimator of the approximate likelihood. As each such evaluation requires expensive simulations from the likelihood, the goal of all ABC methods is to reduce the number of such simulations. Accordingly, Hamiltonian ABC was recently proposed , combining the synthetic likelihood approach with gradients based on stochastic finite differences. We remark that this requires to simulate from the likelihood in every leapfrog step, and that the additional bias from the Gaussian likelihood approximation can be problematic. In contrast, KMC does not require simulations to construct a proposal, but rather ‘invests’ simulations into an accept/reject step (3) that ensures convergence to the original ABC target. Figure 4 (right) compares performance of RW, HABC (sticky random numbers and SPAS, [8, Sec. 4.3, 4.4]), and KMC on a -dimensional skew-normal distribution with . KMC mixes as well as HABC, but HABC suffers from a severe bias. KMC also reduces the number of simulations per proposal by a factor . See Appendix D.4 for details.
Discussion
We have introduced KMC, a kernel-based gradient free adaptive MCMC algorithm that mimics HMC’s behaviour by estimating target gradients in an RKHS. In experiments, KMC outperforms random walk based sampling methods in up to dimensions, including the recent kernel-based KAMH . KMC is particularly useful when gradients of the target density are unavailable, as in PM-MCMC or ABC-MCMC, where classical HMC cannot be used. We have proposed two efficient empirical estimators for the target gradients, each with different strengths and weaknesses, and have given experimental evidence for the robustness of both.
Future work includes establishing theoretical consistency and uniform convergence rates for the empirical estimators, for example via using recent analysis of random Fourier Features with tight bounds , and a thorough experimental study in the ABC-MCMC context where we see a lot of potential for KMC. It might also be possible to use KMC as a precomputing strategy to speed up classical HMC as in . For code, see https://github.com/karlnapf/kernel_hmc
References
Appendix
The Appendix contains proofs for Propositions 1 and 2, as well as additional computational details for both KMC lite in Section A and KMC finite in Section B. Section C covers the proof of geometric ergodicity of KMC lite from Proposition 3. Section D describes further experimental details.
The proof below extends the model in [20, Section 4.1]. We assume that the model log-density (4) takes the form in Proposition 1, then directly implement score functions (5), from which we derive an empirical score matching objective as a system of linear equations.
As assumed the log unnormalised density takes the form
The score functions for (5) are then given by
We now rewrite in matrix form. The expression for the term being optimised is the sum of two terms.
The final term may be computed with the right ordering of operations,
In matrix notation, the inner sum is a column vector,
We take the entry-wise square and sum the resulting vector. Denote by , then the following two relations hold
This means that as defined previously,
Assuming is invertible, this is minimised by
Reduced Computational Costs via Low-rank Approximations and Conjugate Gradient
Solving the linear system in (7) requires computation and storage for a fixed random sub-sample of the chain history In order to allow for large , and to exploit potential manifold structure in the RKHS, we apply a low-rank approximation to the kernel matrix via incomplete Cholesky [28, Alg. 5.12], that is a standard way to achieve linear computational costs for kernel methods. We rewrite the kernel matrix
Appendix B Finite Feature Space Estimator
We assume the model log-density (4) takes the primal form in a finite dimensional feature space as in Proposition 2, then again directly implement score functions in (5) and minimise it via a linear solve.
As assumed the log unnormalised density takes the form
Assuming that is invertible (trivial for ), the objective is uniquely minimised by differentiating (11) wrt. , setting to zero, and solving for . This gives
Next, we give an example for the approximate feature space . Note that the above approach can be combined with any set of finite dimensional approximate feature mappings .
Example: Random Fourier Features for the Gaussian Kernel
where is the Fourier transform of the kernel. An approximate feature mapping for such kernels can be obtained via dropping imaginary terms and approximating the integral with Monte Carlo integration. This gives
with fixed random basis vector realisations that depend on the kernel via ,
for . It is easy to see that this approximation is consistent for , i.e.
See for details and a uniform convergence bound and for a more detailed analysis with tighter bounds. Note that it is possible to achieve logarithmic computational costs in exploiting properties of Hadamard matrices .
The feature map derivatives (10) are given by
where is the element-wise product. Consequently the gradient is given by
As an example, the translation invariant Gaussian kernel and its Fourier transform are
Constant Cost Updates
using cheap triangular back-substitution from , and never storing or explicitly.
Using such updates, the computational costs for updating the approximate infinite dimensional exponential family model in every iteration of the Markov chain are , which constant in . We can therefore use all points in the history for constructing a proposal. See our implementation for further details.
Perform rank- update to obtain updated Cholesky factorisation .
Update approximate infinite dimensional exponential family parameters
Appendix C Ergodicity of KMC lite
Denote by the probability of accepting a proposal at state . Let . Define and , where is the -th point of the leapfrog integration from .
We assumed is log-concave in the tails, meaning s.t. for , we have and for , we have , and a similar condition holds in the negative tail. Furthermore, we assumed fixed HMC parameters: leapfrog steps of size , and wlog the identity mass matrix . Following , it is sufficient to show
for some , where is a standard Gaussian measure. Denoting the integral , we split it into
for some . We show that the first and third terms decay to zero whilst the second remains strictly negative as (a similar argument holds as ). We detail the case as here, the other is analogous. Taking , we can choose an large enough that , and . So for we have
where the last inequality comes from the log-concave tails assumption. For
where is large enough that . Similarly for
Because and can be chosen to be arbitrarily small, then for large enough we will have
where for large enough , as and are of the same order. Now turning to , we can use an exact rearrangement of the same argument (noting that can be made arbitrarily small) to get
Combining (14) and (15) and rearranging as in [30, Theorem 3.2] shows that is strictly negative in the limit if is chosen small enough, as can also be made arbitrarily small.
For it suffices to note that the Gaussian tails of will dominate the exponential growth of meaning the integral can be made arbitrarily small by choosing large enough , and the same argument holds for . ∎
Appendix D Additional Experimental Details
This section contains additional details for the experiments in Section 6.
We reproduce the experiment in Figure 2 on an isotropic Gaussian in increasing dimension. As length-scales across all principal components are equal, this is a significantly less challenging target to estimate gradients for; though still useful as a benchmark representing very smooth targets. We use a standard Gaussian kernel and the same experimental protocol as for Figure 2. The estimator works slightly better than on the target considered in Figure 2, and performs well up to , see Figure 5.
D.2 Banana target
We choose , and , which corresponds to the ‘strongly twisted’ 8-dimensional Banana in . The target is challenging due to the nonlinear dependence of the first two dimensions and the highly position dependent scaling within these dimensions.
D.3 Pseudo-Marginal MCMC for GP Classification
Closely following , we consider a joint distribution of GP-latent variables , labels (with covariate matrix ), and hyper-parameters , given by
where . Aiming for a fully Bayesian treatment, we wish to estimate the marginal posterior of the hyper-parameters , motivated in . The marginal likelihood is intractable for non-Gaussian likelihoods , but can be replaced with an unbiased estimate
where are importance samples. In , the importance distribution is chosen as the Laplacian or as the Expectation Propagation (EP) approximation of , leading to state-of-the-art results.
We here use a Laplace approximation and . We consider classification of window against non-window glass in the UCI Glass dataset, which induces a posterior that has a nonlinear shape [12, Figure 3]. Since the ground truth for the hyperparameter posterior is not available, we initially run multiple hand-tuned standard Metropolis-Hastings chains for 500,000 iterations (with a 100,000 burn-in), keep every 1000-th sample in each of the chains, and combine them. The resulting samples are used as a benchmark, to evaluate the performance all algorithms. We use the MMD between each sampler output and the benchmark sample is computed, using the polynomial kernel . This corresponds to the estimation error of all mixed moments of order up to 3.
Kernel parameters are tuned using a black box Bayesian optimisation packageWe use the open-source package pybo, available under https://github.com/mwhoffman/pybo and the median heuristic for KMC and KAMH repsectively. The Bayesian optimisation uses standard parameters and is stopped after 15 iterations, where each trial is done via a 5-fold cross-validation of the score matching objective (5). We learn parameters after MCMC 500 iterations, and then re-learn after 2000. We tried re-learning parameters after more iterations, but this did not lead to significant changes. The costs for this are neglectable in the context of PM-MCMC as estimating the marginal likelihood takes significantly more time than generating the KMC proposal.
D.4 ABC MCMC
In this section, we give a brief background on Approximate Bayesian Computation, and how KMC can be used within the framework. We then give details of the competing approach in the final experiment in Section 6, including experimental details and an analytic counterexample.
Approximate Bayesian Computation is a method for inference in the scenario where conditional on some parameter of interest , we can easily simulate data , but for which the likelihood function is unavailable . We however have data which assume to be from the model, and we have a prior . A simple ABC algorithm is to sample (or any other suitable distribution), simulate data , and ‘accept’ as a sample from the approximate posterior if . This procedure can be formalised by defining the approximate likelihood as
where is an appropriate kernel that gives more importance to points for which is smaller. In the simple case above . The ABC posterior is then found using . Often is based on some low-dimensional summary statistics, which can have both advantages and disadvantages.
There are many different way to do ABC, and clearly not all involve Markov chain Monte Carlo. If the posterior however is not similar to the prior, and if is more than three or four dimensional, MCMC is a sensible option. Since the likelihood (17) is intractable, typically algorithms are considered for which an approximation to either the likelihood, or the ABC posterior are used either in constructing proposals, defining Metropolis-Hastings acceptance rates, or both. We focus here on samplers which target directly, c.f. .
Similar to the approach taken in Section D.3, we here accept proposals where is some proposal mechanism (i.e. KMC), via replacing the likelihood with an unbiased estimate. We accept according to the ratio
Following , one idea to approximate the intractable likelihood is to draw samples , and fit a Gaussian approximation to , producing estimates and for the mean and covariance using . If the error functon is also chosen to be a Gaussian (with mean and variance ), then the marginal likelihood can be approximated as
The likelihood is essentially approximated by a Gaussian , producing a synthetic posterior , which is then used in the accept-reject step. Clearly some approximation error is introduced by the Gaussian likelihood approximation step, but as shown in , it can be a reasonable choice for some models.
Introduced in , the synthetic likelihood formulation is used to construct a proposal, with the accept-reject step removed altogether. Hamiltonian dynamics use the gradient to suggest candidate values for the next state of a Markov chain which are far from the current point, thus increasing the chances that the chain mixes quickly. Here the gradient of the log-likelihood is unavailable, so is approximated with that of a Gaussian (since the map is not always clear this is done numerically, using a stochastic finite differences estimate of the gradient, SPAS [8, Sec. 4.3, 4.4]), giving
Since there is no accept-reject step, the synthetic posterior is also the target of this scheme (although there is also further bias introduced by discretisation error), but the introduction of gradient-based dynamics is hoped to improve mixing and hence efficiency of inferences compared to random-walk type schemes.
We give a very simple toy model to highlight the bias introduced by the Hamiltonian ABC sampler. Consider posterior inference for the mean parameter in a log-Normal model. Specifically, the true model is
where the precision and hyper-parameters are known. The model is in fact conjugate, giving a Gaussian posterior
If we introduce a Gaussian approximation to the likelihood, then the mean and precision of this approximation are (empirical estimates for)
which depend on the current value for in the chain. The resulting synthetic posterior is no longer tractable, but since it is one dimensional we can approximate it numerically. Using , , and then the true and approximate posteriors for 100 data points generated using the truth are shown in Figure 6. This is a proof of concept that a likelihood with a positive skew being approximated by a Gaussian introduces an upwards bias to the posterior.
The simulation study in Section 6 uses a slightly more complex and multi-dimensional simulation example: a 10-dimensional multivariate skew-Normal distribution, given by
with . In each iteration of KMC, the likelihood is estimated via simulating samples from the above likelihood. We use the mean of all samples as summary statistic, and a Gaussian similarity kernel with a fixed . Both KMC and HABC use a standard Gaussian momentum, a uniformly random stepsize in and leapfrog steps. HABC is used with the suggested ‘sticky random numbers’ [8, Section 4.4], i.e. we use the same seed for all simulations along a single proposal trajectory. Both algorithms are run for MCMC iterations. KMC then attempts to re-learn smoothness parameters, and stops adaptation. Burn-in samples are discarded when quantifying performance of all algorithms.
HABC is used in its ‘stochastic gradient’ and has a ‘friction’ parameter that we estimate using a running average of the global covariance of all SPAS gradient evaluations, [8, Equation 21]. Note that we ran HABC with both the friction term included and removed, where we found that adding friction has severely negative impact on mixing, where not adding friction results in a wider posterior (with the same bias). Figure 4 (middle, right) show the results without friction, Figure 7 shows the same plots with friction. We refer to our implementation for further details.
Due to the gradient estimation in every of the leap-frog steps, every MCMC proposal for HABC requires simulations to be generated. In contrast, KMC only requires a single simulation, for evaluating the accept/reject probability (3). We leave studying the exact trade-offs of KMC’s learning phase and its ability to mix well as compared to HABC to future work.