Fast $ε$-free Inference of Simulation Models with Bayesian Conditional Density Estimation

George Papamakarios, Iain Murray

Introduction

A simulator-based model is a data-generating process described by a computer program, usually with some free parameters we need to learn from data. Simulator-based modelling lends itself naturally to scientific domains such as evolutionary biology , ecology , disease epidemics , economics and cosmology , where observations are best understood as products of underlying physical processes. Inference in these models amounts to discovering plausible parameter settings that could have generated our observed data. The application domains mentioned can require properly calibrated distributions that express uncertainty over plausible parameters, rather than just point estimates, in order to reach scientific conclusions or make decisions.

As an analytical expression for the likelihood of parameters given observations is typically not available for simulator-based models, conventional likelihood-based Bayesian inference is not applicable. An alternative family of algorithms for likelihood-free inference has been developed, referred to as Approximate Bayesian Computation (ABC). These algorithms simulate the model repeatedly and only accept parameter settings which generate synthetic data similar to the observed data, typically gathered in a real-world experiment.

Rejection ABC , the most basic ABC algorithm, simulates the model for each setting of proposed parameters, and rejects parameters if the generated data is not within a certain distance from the observations. The accepted parameters form a set of independent samples from an approximate posterior. Markov Chain Monte Carlo ABC (MCMC-ABC) is an improvement over rejection ABC which, instead of independently proposing parameters, explores the parameter space by perturbing the most recently accepted parameters. Sequential Monte Carlo ABC (SMC-ABC) uses importance sampling to simulate a sequence of slowly-changing distributions, the last of which is an approximation to the parameter posterior.

Conventional ABC algorithms such as the above suffer from three drawbacks. First, they only represent the parameter posterior as a set of (possibly weighted or correlated) samples. A sample-based representation easily gives estimates and error bars of individual parameters, and model predictions. However these computations are noisy, and it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses. Second, the parameter samples do not come from the correct Bayesian posterior, but from an approximation based on assuming a pseudo-observation that the data is within an $\epsilon$ -ball centred on the data actually observed. Third, as the $\epsilon$ -tolerance is reduced, it can become impractical to simulate the model enough times to match the observed data even once. When simulations are expensive to perform, good quality inference becomes impractical.

We propose a parametric approach to likelihood-free inference, which unlike conventional ABC does not suffer from the above three issues. Instead of returning samples from an $\epsilon$ -approximation to the posterior, our approach learns a parametric approximation to the exact posterior, which can be made as accurate as required. Preliminary fits to the posterior are used to guide future simulations, which can reduce the number of simulations required to learn an accurate approximation by orders of magnitude. Our approach uses conditional density estimation with Bayesian neural networks, and draws upon advances in parametric density estimation, stochastic variational inference, and recognition networks, as discussed in the related work section.

Bayesian conditional density estimation for likelihood-free inference

Let $\bm{\theta}$ be a vector of parameters controlling a simulator-based model, and let $\mathbf{x}$ be a data vector generated by the model. The model may be provided as a probabilistic program that can be easily simulated, and implicitly defines a likelihood $p\mathopen{}\left(\mathbf{x}\,|\,\bm{\theta}\right)\mathclose{}$ , which we assume we cannot evaluate. Let $p\mathopen{}\left(\bm{\theta}\right)\mathclose{}$ encode our prior beliefs about the parameters. Given an observation $\mathbf{x}_{o}$ , we are interested in the parameter posterior $p\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}=\mathbf{x}_{o}\right)\mathclose{}\propto p\mathopen{}\left(\mathbf{x}=\mathbf{x}_{o}\,|\,\bm{\theta}\right)\mathclose{}\,p\mathopen{}\left(\bm{\theta}\right)\mathclose{}$ .

As the likelihood $p\mathopen{}\left(\mathbf{x}=\mathbf{x}_{o}\,|\,\bm{\theta}\right)\mathclose{}$ is unavailable, conventional Bayesian inference cannot be carried out. The principle behind ABC is to approximate $p\mathopen{}\left(\mathbf{x}=\mathbf{x}_{o}\,|\,\bm{\theta}\right)\mathclose{}$ by $p\mathopen{}\left(\left\|\mathbf{x}-\mathbf{x}_{o}\right\|<\epsilon\,|\,\bm{\theta}\right)\mathclose{}$ for a sufficiently small value of $\epsilon$ , and then estimate the latter—e.g. by Monte Carlo—using simulations from the model. Hence, ABC approximates the posterior by $p\mathopen{}\left(\bm{\theta}\,|\,\left\|\mathbf{x}-\mathbf{x}_{o}\right\|<\epsilon\right)\mathclose{}$ , which is typically broader and more uncertain. ABC can trade off computation for accuracy by decreasing $\epsilon$ , which improves the approximation to the posterior but requires more simulations from the model. However, the approximation becomes exact only when $\epsilon\rightarrow 0$ , in which case simulations never match the observations, $p\mathopen{}\left(\left\|\mathbf{x}-\mathbf{x}_{o}\right\|<\epsilon\,|\,\bm{\theta}\right)\mathclose{}\rightarrow 0$ , and existing methods break down. In this paper, we refer to $p\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}=\mathbf{x}_{o}\right)\mathclose{}$ as the exact posterior, as it corresponds to setting $\epsilon=0$ in $p\mathopen{}\left(\bm{\theta}\,|\,\left\|\mathbf{x}-\mathbf{x}_{o}\right\|<\epsilon\right)\mathclose{}$ .

In most practical applications of ABC, $\mathbf{x}$ is taken to be a fixed-length vector of summary statistics that is calculated from data generated by the simulator, rather than the raw data itself. Extracting statistics is often necessary in practice, to reduce the dimensionality of the data and maintain $p\mathopen{}\left(\left\|\mathbf{x}-\mathbf{x}_{o}\right\|<\epsilon\,|\,\bm{\theta}\right)\mathclose{}$ to practically acceptable levels. For the purposes of this paper, we will make no distinction between raw data and summary statistics, and we will regard the calculation of summary statistics as part of the data generating process.

2 Learning the posterior

We assume that each of a set of $N$ pairs $\mathopen{}\left(\bm{\theta}_{n},\mathbf{x}_{n}\right)\mathclose{}$ was independently generated by

In the limit $N\rightarrow\infty$ , the probability of the parameter vectors $\prod_{n}q_{\bm{\phi}}(\bm{\theta}_{n}\,|\,\mathbf{x}_{n})$ is maximized w.r.t. $\bm{\phi}$ if and only if

Intuition: if we simulated enough parameters from the prior, the density estimator $q_{\bm{\phi}}$ would learn a conditional of the joint prior model over parameters and data, which is the posterior $p\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}\right)\mathclose{}$ . If we simulate parameters drawn from another distribution, we need to “importance reweight” the result. A more detailed proof can be found in Appendix A.

The proposition above suggests the following procedure for learning the posterior: (a) propose a set of parameter vectors $\left\{\bm{\theta}_{n}\right\}$ from the proposal prior; (b) for each $\bm{\theta}_{n}$ run the simulator to obtain a corresponding data vector $\mathbf{x}_{n}$ ; (c) train $q_{\bm{\phi}}$ with maximum likelihood on $\left\{\bm{\theta}_{n},\mathbf{x}_{n}\right\}$ ; and (d) estimate the posterior by

This procedure is summarized in Algorithm 2.

3 Choice of conditional density estimator and proposal prior

We draw upon work on conditional neural density estimation and take $q_{\bm{\phi}}$ to be a Mixture Density Network (MDN) with fully parameterized covariance matrices. That is, $q_{\bm{\phi}}$ takes the form of a mixture of $K$ Gaussian components $q_{\bm{\phi}}\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}\right)\mathclose{}=\sum_{k}{\alpha_{k}\,\mathcal{N}\mathopen{}\left(\bm{\theta}\,|\,\mathbf{m}_{k},\mathbf{S}_{k}\right)\mathclose{}}$ , whose mixing coefficients $\left\{\alpha_{k}\right\}$ , means $\left\{\mathbf{m}_{k}\right\}$ and covariance matrices $\left\{\mathbf{S}_{k}\right\}$ are computed by a feedforward neural network parameterized by $\bm{\phi}$ , taking $\mathbf{x}$ as input. Such an architecture is capable of representing any conditional distribution arbitrarily accurately—provided the number of components $K$ and number of hidden units in the neural network are sufficiently large—while remaining trainable by backpropagation. The parameterization of the MDN is detailed in Appendix B.

4 Learning the proposal prior

Simple rejection ABC is inefficient because the posterior $p\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}=\mathbf{x}_{o}\right)\mathclose{}$ is typically much narrower than the prior $p\mathopen{}\left(\bm{\theta}\right)\mathclose{}$ . A parameter vector $\bm{\theta}$ sampled from $p\mathopen{}\left(\bm{\theta}\right)\mathclose{}$ will rarely be plausible under $p\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}=\mathbf{x}_{o}\right)\mathclose{}$ and will most likely be rejected. Practical ABC algorithms attempt to reduce the number of rejections by modifying the way they propose parameters; for instance, MCMC-ABC and SMC-ABC propose new parameters by perturbing parameters they already consider plausible, in the hope that nearby parameters remain plausible.

As we shall demonstrate in Section 3, the procedure above learns Gaussian approximations to the true posterior fast: in our experiments typically $4$ – $6$ iterations of $200$ – $500$ samples each were sufficient. This Gaussian approximation can be used as a rough but cheap approximation to the true posterior, or it can serve as a good proposal prior in Algorithm 2 for efficiently fine-tuning a non-Gaussian multi-component posterior. If the second strategy is adopted, then we can reuse the single-component neural density estimator learnt in Algorithm 1 to initialize $q_{\bm{\phi}}$ in Algorithm 2. The weights in the final layer of the MDN are replicated $K$ times, with small random perturbations to break symmetry.

5 Use of Bayesian neural density estimators

To make Algorithm 1 as efficient as possible, the number of simulations per iteration $N$ should be kept small, while at the same time it should provide a sufficient training signal for $q_{\bm{\phi}}$ . With a conventional MDN, if $N$ is made too small, there is a danger of overfitting, especially in early iterations, leading to over-confident proposal priors and an unstable procedure. Early stopping could be used to avoid overfitting; however a significant fraction of the $N$ samples would have to be used as a validation set, leading to inefficient use of simulations.

As a better alternative, we developed a Bayesian version of the MDN using Stochastic Variational Inference (SVI) for neural networks . We shall refer to this Bayesian version of the MDN as MDN-SVI. An MDN-SVI has two sets of adjustable parameters of the same size, the means $\bm{\phi}_{m}$ and the log variances $\bm{\phi}_{s}$ . The means correspond to the parameters $\bm{\phi}$ of a conventional MDN. During training, Gaussian noise of variance $\exp{\bm{\phi}_{s}}$ is added to the means independently for each training example $\mathopen{}\left(\bm{\theta}_{n},\mathbf{x}_{n}\right)\mathclose{}$ . The Bayesian interpretation of this procedure is that it optimizes a variational Gaussian posterior with a diagonal covariance matrix over parameters $\bm{\phi}$ . At prediction time, the noise is switched off and the MDN-SVI behaves like a conventional MDN with $\bm{\phi}=\bm{\phi}_{m}$ . Appendix D details the implementation and training of MDN-SVI. We found that using an MDN-SVI instead of an MDN improves the robustness and efficiency of Algorithm 1 because (a) MDN-SVI is resistant to overfitting, allowing us to use a smaller number of simulations $N$ ; (b) no validation set is needed, so all samples can be used for training; and (c) since overfitting is not an issue, no careful tuning of training time is necessary.

Experiments

We compare to three ABC baselines: (a) rejection ABC , where parameters are proposed from the prior and are accepted if $\left\|\mathbf{x}-\mathbf{x}_{o}\right\|<\epsilon$ ; (b) MCMC-ABC with a spherical Gaussian proposal, whose variance we manually tuned separately in each case for best performance; and (c) SMC-ABC , where the sequence of $\epsilon$ ’s was exponentially decayed, with a decay rate manually tuned separately in each case for best performance. MCMC-ABC was given the unrealistic advantage of being initialized with a sample from rejection ABC, removing the need for an otherwise necessary burn-in period. Code for reproducing the experiments is provided at https://github.com/gpapamak/epsilon_free_inference.

The first experiment is a toy problem where the goal is to infer the common mean $\theta$ of a mixture of two 1D Gaussians, given a single datapoint $x_{o}$ . The setup is

where $\theta_{\alpha}=-10$ , $\theta_{\beta}=10$ , $\alpha=0.5$ , $\sigma_{1}=1$ , $\sigma_{2}=0.1$ and $x_{o}=0$ . The posterior can be calculated analytically, and is proportional to an equal mixture of two Gaussians centred at $x_{o}$ with variances $\sigma_{1}^{2}$ and $\sigma_{2}^{2}$ , restricted to $\left[\theta_{\alpha},\theta_{\beta}\right]$ . This problem is often used in the SMC-ABC literature to illustrate the difficulty of MCMC-ABC in representing long tails. Here we use it to demonstrate the correctness of our approach and its ability to accurately represent non-Gaussian long-tailed posteriors.

Figure 1 shows the results of neural density estimation using each strategy. All MDNs have one hidden layer with $20$ tanh units and $2$ Gaussian components, except for the proposal prior MDN which has a single component. Both MDN with prior and MDN with proposal learn good parametric approximations to the true posterior, and the proposal prior is a good Gaussian approximation to it. We used $10$ K simulations to train the MDN with prior, whereas the prior proposal took $4$ iterations of $200$ simulations each to train, and the MDN with proposal took $1000$ simulations on top of the previous $800$ . The MDN with prior learns the posterior distributions for a large range of possible observations $x$ (middle plot of Figure 1), whereas the MDN with proposal gives accurate posterior probabilities only near the value actually observed (right plot of Figure 1).

2 Bayesian linear regression

In Bayesian linear regression, the goal is to infer the parameters $\bm{\theta}$ of a linear map from noisy observations of outputs at known inputs. The setup is

where we took $\mathbf{m}=\mathbf{0}$ , $\mathbf{S}=\mathbf{I}$ , $\sigma=0.1$ , randomly generated inputs $\left\{\mathbf{u}_{i}\right\}$ from a standard Gaussian, and randomly generated observations $\mathbf{x}_{o}$ from the model. In our setup, $\bm{\theta}$ and $\mathbf{x}$ have $6$ and $10$ dimensions respectively. The posterior is analytically tractable, and is a single Gaussian.

All MDNs have one hidden layer of $50$ tanh units and one Gaussian component. ABC methods were run for a sequence of decreasing $\epsilon$ ’s, up to their failing points. To measure the approximation quality to the posterior, we analytically calculated the KL divergence from the true posterior to the learnt posterior (which for ABC was taken to be a Gaussian fit to the set of returned posterior samples). The left of Figure 2 shows the approximation quality vs $\epsilon$ ; MDN methods are shown as horizontal lines. As $\epsilon$ is decreased, ABC methods sample from an increasingly better approximation to the true posterior, however they eventually reach their failing point, or take prohibitively long. The best approximations are achieved by MDN with proposal and a very long run of SMC-ABC.

The middle of Figure 2 shows the increase in number of simulations needed to improve approximation quality (as $\epsilon$ decreases). We quote the total number of simulations for MDN training, and the number of simulations per effective sample for ABC. Appendix E describes how the number of effective samples is calculated. The number of simulations per effective sample should be multiplied by the number of effective samples needed in practice. Moreover, SMC-ABC will not work well with only one particle, so many times the quoted cost will always be needed. Here, MDNs make more efficient use of simulations than Monte Carlo ABC methods. Sequentially fitting a prior proposal was more than ten times cheaper than training with prior samples, and more accurate.

3 Lotka–Volterra predator-prey population model

The Lotka–Volterra model is a stochastic Markov jump process that describes the continuous time evolution of a population of predators interacting with a population of prey. There are four possible reactions: (a) a predator being born, (b) a predator dying, (c) a prey being born, and (d) a prey being eaten by a predator. Positive parameters $\bm{\theta}=\left(\theta_{1},\theta_{2},\theta_{3},\theta_{4}\right)$ control the rate of each reaction. Given a set of statistics $\mathbf{x}_{o}$ calculated from an observed population time series, the objective is to infer $\bm{\theta}$ . We used a flat prior over $\log{\bm{\theta}}$ , and calculated a set of $9$ statistics $\mathbf{x}$ . The full setup is detailed in Appendix F. The Lotka–Volterra model is commonly used in the ABC literature as a realistic model which can be simulated, but whose likelihood is intractable. One of the properties of Lotka–Volterra is that typical nature-like observations only occur for very specific parameter settings, resulting in narrow, Gaussian-like posteriors that are hard to recover.

The MDN trained with prior has two hidden layers of $50$ tanh units each, whereas the MDN-SVI used to train the proposal prior and the MDN-SVI trained with proposal have one hidden layer of $50$ tanh units. All three have one Gaussian component. We found that using more than one components made no difference to the results; in all cases the MDNs chose to use only one component and switch the rest off, which is consistent with our observation about the near-Gaussianity of the posterior.

We measure how well each method retrieves the true parameter values that were used to generate $\mathbf{x}_{o}$ by calculating their log probability under each learnt posterior; for ABC a Gaussian fit to the posterior samples was used. The left panel of Figure 3 shows how this log probability varies with $\epsilon$ , demonstrating the superiority of MDN methods over ABC. In the middle panel we can see that MDN training with proposal makes efficient use of simulations compared to training with prior and ABC; note that for ABC the number of simulations is only for one effective sample. In the right panel, we can see that the estimates returned by MDN methods are more confident around the true parameters compared to ABC, because the MDNs learn the exact posterior rather than an inflated version of it like ABC does (plots for the other three parameters look similar).

We found that when training an MDN with a well-tuned proposal that focuses on the plausible region, an MDN with fewer parameters is needed compared to training with the prior. This is because the MDN trained with proposal needs to learn only the local relationship between $\mathbf{x}$ and $\bm{\theta}$ near $\mathbf{x}_{o}$ , as opposed to in the entire domain of the prior. Hence, not only are savings achieved in number of simulations, but also training the MDN itself becomes more efficient.

4 M/G/1 queue model

The M/G/1 queue model describes the processing of a queue of continuously arriving jobs by a single server. In this model, the time the server takes to process each job is independently and uniformly distributed in the interval $\left[\theta_{1},\theta_{2}\right]$ . The time interval between arrival of two consecutive jobs is independently and exponentially distributed with rate $\theta_{3}$ . The server observes only the time intervals between departure of two consecutive jobs. Given a set of equally-spaced percentiles $\mathbf{x}_{o}$ of inter-departure times, the task is to infer parameters $\bm{\theta}=\left(\theta_{1},\theta_{2},\theta_{3}\right)$ . This model is easy to simulate but its likelihood is intractable, and it has often been used as an ABC benchmark . Unlike Lotka–Volterra, data $\mathbf{x}$ is weakly informative about $\bm{\theta}$ , and hence the posterior over $\bm{\theta}$ tends to be broad and non-Gaussian. In our setup, we placed flat independent priors over $\theta_{1}$ , $\theta_{2}-\theta_{1}$ and $\theta_{3}$ , and we took $\mathbf{x}$ to be $5$ equally spaced percentiles, as detailed in Appendix G.

The MDN trained with prior has two hidden layers of $50$ tanh units each, whereas the MDN-SVI used to train the proposal prior and the one trained with proposal have one hidden layer of $50$ tanh units. As observed in the Lotka–Volterra demo, less capacity is required when training with proposal, as the relationship to be learned is local and hence simpler, which saves compute time and gives a more accurate final posterior. All MDNs have $8$ Gaussian components (except the MDN-SVI used to train the proposal prior, which always has one), which, after experimentation, we determined are enough for the MDNs to represent the non-Gaussian nature of the posterior.

Figure 4 reports the log probability of the true parameters under each posterior learnt—for ABC, the log probability was calculated by fitting a mixture of $8$ Gaussians to posterior samples using Expectation-Maximization—and the number of simulations needed to achieve it. As before, MDN methods are more confident compared to ABC around the true parameters, which is due to ABC computing a broader posterior than the true one. MDN methods make more efficient use of simulations, since they use all of them for training and, unlike ABC, do not throw a proportion of them away.

Related work

Regression adjustment. An early parametric approach to ABC is regression adjustment, where a parametric regressor is trained on simulation data in order to learn a mapping from $\mathbf{x}$ to $\bm{\theta}$ . The learnt mapping is then used to correct for using a large $\epsilon$ , by adjusting the location of posterior samples gathered by e.g. rejection ABC. Beaumont et al. used linear regressors, and later Blum and François used neural networks with one hidden layer that separately predicted the mean and variance of $\bm{\theta}$ . Both can be viewed as rudimentary density estimators and as such they are a predecessor to our work. However, they were not flexible enough to accurately estimate the posterior, and they were only used within some other ABC method to allow for a larger $\epsilon$ . In our work, we make conditional density estimation flexible enough to approximate the posterior accurately.

Synthetic likelihood. Another parametric approach is synthetic likelihood, where parametric models are used to estimate the likelihood $p\mathopen{}\left(\mathbf{x}\,|\,\bm{\theta}\right)\mathclose{}$ . Wood used a single Gaussian, and later Fan et al. used a mixture Gaussian model. Both of them learnt a separate density model of $\mathbf{x}$ for each $\bm{\theta}$ by repeatedly simulating the model for fixed $\bm{\theta}$ . More recently, Meeds and Welling used a Gaussian process model to interpolate Gaussian likelihood approximations between different $\bm{\theta}$ ’s. Compared to learning the posterior, synthetic likelihood has the advantage of not depending on the choice of proposal prior. Its main disadvantage is the need of further approximate inference on top of it in order to obtain the posterior. In our work we directly learn the posterior, eliminating the need for further inference, and we address the problem of correcting for the proposal prior.

Efficient Monte Carlo ABC. Recent work on ABC has focused on reducing the simulation cost of sample-based ABC methods. Hamiltonian ABC improves upon MCMC-ABC by using stochastically estimated gradients in order to explore the parameter space more efficiently. Optimization Monte Carlo ABC explicitly optimizes the location of ABC samples, which greatly reduces rejection rate. Bayesian optimization ABC models $p\mathopen{}\left(\left\|\mathbf{x}-\mathbf{x}_{o}\right\|\,|\,\bm{\theta}\right)\mathclose{}$ as a Gaussian process and then uses Bayesian optimization to guide simulations towards the region of small distances $\left\|\mathbf{x}-\mathbf{x}_{o}\right\|$ . In our work we show how a significant reduction in simulation cost can also be achieved with parametric methods, which target the posterior directly.

Recognition networks. Our use of neural density estimators for learning posteriors is reminiscent of recognition networks in machine learning. A recognition network is a neural network that is trained to invert a generative model. The Helmholtz machine , the variational auto-encoder and stochastic backpropagation are examples where a recognition network is trained jointly with the generative network it is designed to invert. Feedforward neural networks have been used to invert black-box generative models and binary-valued Bayesian networks , and convolutional neural networks have been used to invert a physics engine . Our work illustrates the potential of recognition networks in the field of likelihood-free inference, where the generative model is fixed, and inference of its parameters is the goal.

Learning proposals. Neural density estimators have been employed in learning proposal distributions for importance sampling and Sequential Monte Carlo . Although not our focus here, our fit to the posterior could also be used within Monte Carlo inference methods. In this work we see how far we can get purely by fitting a series of conditional density estimators.

Conclusions

Bayesian conditional density estimation improves likelihood-free inference in three main ways: (a) it represents the posterior parametrically, as opposed to as a set of samples, allowing for probabilistic evaluations later on in the pipeline; (b) it targets the exact posterior, rather than an $\epsilon$ -approximation of it; and (c) it makes efficient use of simulations by not rejecting samples, by interpolating between samples, and by gradually focusing on the plausible parameter region. Our belief is that neural density estimation is a tool with great potential in likelihood-free inference, and our hope is that this work helps in establishing its usefulness in the field.

We thank Amos Storkey for useful comments. George Papamakarios is supported by the Centre for Doctoral Training in Data Science, funded by EPSRC (grant EP/L016427/1) and the University of Edinburgh, and by Microsoft Research through its PhD Scholarship Programme.

Appendix A Proof of Proposition 1

Maximizing $\prod_{n}{q_{\bm{\phi}}\mathopen{}\left(\bm{\theta}_{n}\,|\,\mathbf{x}_{n}\right)\mathclose{}}$ w.r.t. $\bm{\phi}$ is equivalent to maximizing the average log probability

The above KL divergence is minimized (and becomes ) if and only if

Appendix B Parameterization and training of Mixture Density Networks

A Mixture Density Network (MDN) is a conditional density estimator $q_{\bm{\phi}}\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}\right)\mathclose{}$ , which takes the form of a mixture of $K$ Gaussian components, as follows

The mixing coefficients $\bm{\alpha}=\mathopen{}\left(\alpha_{1},\ldots,\alpha_{K}\right)\mathclose{}$ , means $\left\{\mathbf{m}_{k}\right\}$ and covariance matrices $\left\{\mathbf{S}_{k}\right\}$ are computed by a feedforward neural network $f_{\mathbf{W},\mathbf{b}}\mathopen{}\left(\mathbf{x}\right)\mathclose{}$ , which has input $\mathbf{x}$ , weights $\mathbf{W}$ and biases $\mathbf{b}$ . In particular, let the output of the neural network be

Then, the mixing coefficients are given by

The softmax ensures that the mixing coefficients are strictly positive and sum to one. Similarly, the means are given by

As for the covariance matrices, we need to ensure that they are symmetric and positive definite. For this reason, instead of parameterizing the covariance matrices directly, we parameterize the Cholesky factorization of their inverses. That is, we rewrite

where $\mathbf{U}_{k}$ is parameterized to be an upper triangular matrix with strictly positive elements in the diagonal, as follows

The above parameterization of the covariance matrix was introduced by Williams for learning conditional Gaussians.

Given a set of training data $\left\{\bm{\theta}_{n},\mathbf{x}_{n}\right\}$ , training the MDN with maximum likelihood amounts to maximizing the average log probability

Because the reparameterization $\bm{\phi}$ described above is unconstrained, any off-the-shelf gradient-based stochastic optimizer can be used. Gradients of the average log probability can be easily computed with backpropagation. In our experiments, we implemented MDNs using Theano , which automatically backpropagates gradients, and we maximized the average log likelihood using Adam , which is currently the state of the art in minibatch-based stochastic optimization.

Appendix C Analytical calculation of parameter posterior

According to Proposition 1, after training $q_{\bm{\phi}}\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}\right)\mathclose{}$ , the posterior at $\mathbf{x}=\mathbf{x}_{o}$ is approximated by

Typically, the prior $p\mathopen{}\left(\bm{\theta}\right)\mathclose{}$ is a simple distribution like a uniform or a Gaussian. Here we will consider the uniform case, while the Gaussian case is treated analogously. Let $p\mathopen{}\left(\bm{\theta}\right)\mathclose{}$ be uniform everywhere (improper). Then the posterior estimate becomes

their ratio can be calculated and normalized analytically. In particular, after some algebra it can be shown that the posterior estimate $\hat{p}\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}=\mathbf{x}_{o}\right)\mathclose{}$ is also a mixture of $K$ Gaussians

where quantities $\left\{c_{k}\right\}$ are given by

Appendix D Stochastic Variational Inference for Mixture Density Networks

In this section we describe our adaptation of Stochastic Variational Inference (SVI) for neural networks , in order to develop a Bayesian version of MDN. The first step is to express beliefs about the MDN parameters $\bm{\phi}$ as independent Gaussian random variables with means $\bm{\phi}_{m}$ and log variances $\bm{\phi}_{s}$ . Under this interpretation we can rewrite the parameters as

where the symbol $\odot$ denotes elementwise multiplication and $\mathbf{u}$ is an unknown vector drawn from a standard normal,

The above parameterization induces the following variational distribution over $\bm{\phi}$

Under this prior, before seeing any data we set the parameter means $\bm{\phi}_{m}$ all to zero, and the parameter log variances $\bm{\phi}_{s}$ all equal to $\log{\lambda^{-1}}$ . In our experiments, we used a default value of $\lambda=0.01$ .

Given training data $\left\{\bm{\theta}_{n},\mathbf{x}_{n}\right\}$ , the objective of SVI is to optimize $\bm{\phi}_{m}$ and $\bm{\phi}_{s}$ so as to make the variational distribution $q\mathopen{}\left(\bm{\phi}\right)\mathclose{}$ be as close as possible (in KL) to the true Bayesian posterior over $\bm{\phi}$ . This objective is equivalent to maximizing a variational lower bound,

with respect to $\bm{\phi}_{m}$ and $\bm{\phi}_{s}$ . The expectations in the first term of the above can be stochastically approximated by randomly drawing $\mathbf{u}$ ’s from a standard normal. The KL term can be calculated analytically, which yields

The above optimization problem has been parameterized in such a way that $\bm{\phi}_{m}$ and $\bm{\phi}_{s}$ are unconstrained. Moreover, the derivatives of the variational lower bound with respect to $\bm{\phi}_{m}$ and $\bm{\phi}_{s}$ can be easily calculated with backpropagation after stochastic approximations to the expectations have been made. This allows the use of any off-the-shelf gradient-based stochastic optimizer. In our experiments, we implemented MDN-SVI in Theano , which automatically calculates derivatives with backpropagation, and used Adam for stochastic maximization of the variational lower bound.

An important practical detail for stochastically approximating the expectation terms is the local reparameterization trick , which leverages the layered feedforward structure of the MDN. Consider any hidden or output unit in the MDN; if $a$ is its activation and $\mathbf{z}$ is the vector of its inputs, then the relationship between $a$ and $\mathbf{z}$ is always of the form

where $\mathbf{w}$ and $b$ are the weights and bias respectively associated with this unit. As we have seen, in the SVI framework these weights and biases are Gaussian random variables with means $\mathbf{w}_{m}$ and $b_{m}$ , and log variances $\mathbf{w}_{s}$ and $b_{s}$ . It is easy to see that this induces a Gaussian distribution over activation $a$ , whose mean $a_{m}$ and variance $\exp{a_{s}}$ is given by

where $\odot$ denotes elementwise multiplication. Therefore, randomly sampling $\mathbf{w}$ and $b$ in order to estimate the expectations and their gradients in the variational lower bound is equivalent to directly sampling $a$ from a Gaussian with the above mean and variance. This trick saves computation by reducing calls to the random number generator, but more importantly it reduces the variance of the stochastic approximation of the expectations (intuitively this is because less randomness is involved) and hence it makes stochastic optimization more stable and faster to converge.

Appendix E Effective sample size of ABC methods

Rejection ABC returns a set of independent samples, MCMC-ABC returns a set of correlated samples, and SMC-ABC returns a set of independent but weighted samples. To make a fair comparison between them in terms of simulation cost, we quote the number of simulations per effective sample, that is, the total number of simulations divided by the effective sample size of the returned set of samples.

Suppose that each sample is a vector of $D$ components. For MCMC-ABC, where samples come in the form of $D$ autocorrelated sequences, we estimated the effective sample size for component $d$ as

For SMC-ABC, each sample is independent but comes with a corresponding non-negative weight $w_{n}$ . The weights have to sum to one, that is $\sum_{n}{w_{n}}=1$ . We estimated the effective sample size by

Appendix F Setup for the Lotka–Volterra experiment

The Lotka–Volterra model is a stochastic model that was developed to describe the time evolution of a population of predators interacting with a population of prey. Let $X$ be the number of predators and $Y$ be the number of prey. The model asserts that the following four reactions can take place, with corresponding rates:

A predator may be born, with rate $\theta_{1}XY$ , increasing $X$ by one.

A predator may die, with rate $\theta_{2}X$ , decreasing $X$ by one.

A prey may be born, with rate $\theta_{3}Y$ , increasing $Y$ by one.

A prey may be eaten by a predator, with rate $\theta_{4}XY$ , decreasing $Y$ by one.

Given initial populations $X$ and $Y$ , the above model can be simulated using Gillespie’s algorithm , as follows:

Draw the time to next reaction from an exponential distribution with rate equal to the total rate $\theta_{1}XY+\theta_{2}X+\theta_{3}Y+\theta_{4}XY$ .

Select a reaction at random, with probability proportional to its rate.

Simulate the reaction, and go to step (i).

In our experiments, each simulation started with initial populations $X=50$ and $Y=100$ , and took place for a total of $30$ time units. We recorded the values of $X$ and $Y$ after every $0.2$ time units, resulting in two time series of $151$ values each.

Data $\mathbf{x}$ was taken to be the following set of $9$ statistics calculated from the time series:

The autocorrelation coefficient of each time series at lag $1$ and lag $2$ .

The cross-correlation coefficient between the two time series.

Since the above statistics have potentially very different scales, we normalized them on the basis of a pilot run. That is, we performed a pilot run of $1000$ simulations, calculated and stored the mean and standard deviation of each statistic across pilot simulations, and used them in all subsequent simulations to normalize each statistic by subtracting the pilot mean and dividing by the pilot standard deviation. This choice of statistics and normalization process was taken from Wilkinson .

From our experience with the model we observed that typical evolutions of the predator/prey populations for randomly selected parameters $\bm{\theta}=\mathopen{}\left(\theta_{1},\theta_{2},\theta_{3},\theta_{4}\right)\mathclose{}$ include (a) the predators quickly eating all the prey and then slowly decaying exponentially, or (b) the predators quickly dying out and then the prey growing exponentially. However, for certain carefully tuned values of $\bm{\theta}$ , the two populations exhibit an oscillatory behaviour, typical of natural ecological systems. In order to generate observations $\mathbf{x}_{o}$ for our experimental setup, we set the parameters to

and simulated the model to generate $\mathbf{x}_{o}$ . We carefully chose parameter values that give rise to oscillatory behaviour (see Figure 5 for typical examples of population evolution corresponding to the above parameters). Since only a small subset of parameters give rise to such oscillatory behaviour, the posterior $p\mathopen{}\left(\bm{\theta}\,|\,\mathbf{x}=\mathbf{x}_{o}\right)\mathclose{}$ is expected to be tightly peaked around the true parameter values. We tested our algorithms by evaluating how well (in terms of assigned log probability) each algorithm retrieves the true parameters.

Finally, we took the prior over $\bm{\theta}$ to be uniform in the log domain. That is, the prior was taken to be

where $\log{\theta}_{\alpha}=-5$ and $\log{\theta}_{\beta}=2$ , which of course includes the true parameters. All our inferences where done in the log domain.

Appendix G Setup for the M/G/1 experiment

The M/G/1 queue model is a statistical model that describes how a single server processes a queue formed by a set of continuously arriving jobs. Let $I$ be the total number of jobs to be processed, $s_{i}$ be the time the server takes to process job $i$ , $v_{i}$ be the time that job $i$ entered the queue, and $d_{i}$ be the time that job $i$ left the queue (i.e. the time when the server finished processing it). The M/G/1 queue model asserts that for each job $i$ we have

The goal is to infer parameters $\bm{\theta}=\mathopen{}\left(\theta_{1},\theta_{2},\theta_{3}\right)\mathclose{}$ if the only knowledge is a set of percentiles of the empirical distribution of the interdeparture times $d_{i}-d_{i-1}$ for $i=1,\ldots,I$ . In our experiments we used $5$ equally spaced percentiles. That is, given a set of $I$ interdeparture times $d_{i}-d_{i-1}$ , we took $\mathbf{x}$ to be the th, $25$ th, $50$ th, $75$ th and $100$ th percentiles of the set of interdeparture times. Note that the th and $100$ th percentiles correspond to the minimum and maximum element in the set.

Since different percentiles can have different scales and strong correlations between them, we whitened the data on the basis of a pilot run. That is, we performed $100$ K pilot simulations, and recorded the mean vector and covariance matrix of the resulting percentiles. For each subsequent simulation, we calculated $\mathbf{x}$ from resulting percentiles by subtracting the mean vector and decorrelating and normalizing with the covariance matrix.

To generate observed data $\mathbf{x}_{o}$ , we set the parameters to the following values

and simulated the model to get $\mathbf{x}_{o}$ . We evaluated inference algorithms by how well the true parameter values were retrieved, as measured by log probability under computed posteriors. Finally, the prior probability of the parameters was taken to be

which is uniform, albeit not axis-aligned, and of course includes the true parameters.