Control functionals for Monte Carlo integration

Chris J. Oates, Mark Girolami, Nicolas Chopin

Introduction

Statistical methods are increasingly being employed to analyse complex models of physical phenomena (e.g. in climate forecasting or simulations of molecular dynamics; Slingo et al.,, 2009; Angelikopoulos et al.,, 2012). Analytic intractability of complex models has inspired the development of sophisticated Monte Carlo methodologies to facilitate computation (Robert and Casella,, 2004). In their most basic form, Monte Carlo estimators converge as the reciprocal of root- $n$ where $n$ is the number of random samples. For complex models it may only be feasible to obtain a limited number of samples (e.g. a recent Met Office model for future climate simulations required the order of $10^{6}$ core-hours per simulation; Mizielinski et al.,, 2014). In these situations, root- $n$ convergence is too slow and leads in practice to high-variance estimation. Our contribution is motivated by resolving this issue and provides novel methodology that is both formal and general.

based on $n$ independent and identically distributed (IID) samples $\{\bm{x}_{i}\}_{i=1}^{n}$ of the random variable, satisfies the central limit theorem and converges to $\mu(f)$ at the rate $O_{P}(n^{-1/2})$ , or simply at “root- $n$ ”. When working with complex models, root- $n$ convergence can be problematic, as highlighted in e.g. Ba and Joseph, (2012). A model is considered complex when either (i) $\bm{X}$ is expensive to simulate, or (ii) $f$ is expensive to evaluate, in each case relative to the required estimator precision. Both situations are prevalent in scientific and engineering applications (e.g. Kohlhoff et al.,, 2014; Higdon et al.,, 2015). This paper introduces a class of estimators that converge more quickly than root- $n$ . The significance of our contribution is made clear in the comparative overview below.

Generic approaches to reduction of variance are well-known in both statistics and numerical analysis. These include (i) importance sampling and its extensions (Cornuet et al.,, 2012; Li et al.,, 2013), (ii) stratified sampling and related techniques (Rubinstein and Kroese,, 2011), (iii) antithetic variables (Green and Han,, 1992) and more generally (randomised) quasi-Monte Carlo (QMC/RQMC; Dick and Pillichshammer,, 2010), (iv) Rao-Blackwellisation (Robert and Casella,, 2004; Douc and Robert,, 2011; Ghosh and Clyde,, 2011; Olsson and Ryden,, 2011), (v) Riemann sums (Philippe,, 1997), (vi) control variates (Glasserman,, 2004; Mira et al.,, 2013; Li et al.,, 2016), (vii) multi-level Monte Carlo and related techniques (e.g. Heinrich,, 1995; Giles,, 2013; Giles and Szpruch,, 2014), (viii) Bayesian Monte Carlo (BMC; O’Hagan,, 1991; Rasmussen and Ghahramani,, 2003; Briol et al.,, 2015), and (ix) a plethora of sophisticated Markov chain Monte Carlo sampling schemes (MCMC; Łatuszyński et al.,, 2015). Classical introductions to many of the above techniques include Robert and Casella, (2004, Chap. 4) and Rubinstein and Kroese, (2011, Chap. 5).

Motivated by contemporary statistical applications, we state four desiderata for a variance reduction technique: (I) Unbiased estimation: Monte Carlo (MC) methods based on IID samples produce unbiased estimators, whilst techniques such as MCMC generally produce biased estimators. (II) Compatibility with an un-normalised density $\pi$ : An “un-normalised” density is known only up to proportionality so that, for example, MCMC techniques are required for sampling. (III) Super-root- $n$ convergence (for sufficiently regular $f$ ): The convergence rates of (R)QMC are well studied and can be super-root- $n$ . Riemann sums can also achieve super-root- $n$ rates and Briol et al., (2016) showed the same holds for BMC. (IV) Post-hoc schemes: Rao-Blackwellisation, Riemann sums, BMC and control variates can all be conceived as post-hoc schemes; i.e. schemes that can be applied retrospectively after samples have been obtained. In contrast, the remaining methods require modification to computer code for the sampling process itself. The former are appealing from both a theoretical and a practical perspective since they separate the challenge of sampling from the challenge of variance reduction.

Table 1 summarises existing techniques in relation to these desiderata; note that no technique fulfils all four criteria. In contrast, the method proposed here, called “control functionals”, is able to satisfy all four desiderata. Control functionals appear to be similar, in this sense, to Riemann sums i.e. they are a super-root- $n$ , post-hoc approach that applies to un-normalised sampling densities. However, Riemann sums are rarely used in practice due to (i) the fact that estimators are biased at finite sample sizes, and (ii) there is a prohibitive increase in methodological complexity for multi-dimensional state spaces. Control functionals do not posses either of these drawbacks.

The control variates described above are solving a misspecified regression problem, since in general $f$ will not be a linear combination of the $s_{i}$ basis functions. As such they achieve at most a constant factor reduction in estimator variance. Intuitively, one would like to increase the number $m$ of basis functions to increase in line with the number $n$ . Mijatović and Vogrinc, (2015) explored this approach within the Metropolis-Hastings method. However, their solution requires the user to partition of the state space, which limits its wider appeal. This paper introduces a powerful new perspective on variance reduction that fully resolves these issues, satisfying all the desiderata described above. To realise our method we developed a gradient-based function space that leads to closed-form estimators whose convergence can be guaranteed. The functional analysis perspective works “out of the box”, without requiring the user to partition the state space. Extensive empirical support is provided in favour of the proposed method, including applications to hierarchical models and models based on non-linear differential equations. In each case state-of-the-art estimation is achieved.

All results can be reproduced using MATLAB R2015a code that is available to download from http://warwick.ac.uk/control_functionals.

Methodology

Denote by $\mathcal{D}=\{\bm{x}_{i}\}_{i=1}^{n}$ a collection of states $\bm{x}_{i}\in\Omega$ . At each state $\bm{x}_{i}$ the corresponding function values $f(\bm{x}_{i})$ and gradients $\nabla_{\bm{x}}\log\pi(\bm{x}_{i})$ are assumed to have been pre-computed and cached. The method that we develop does not then require any further recourse to the statistical model $\pi$ , nor any further evaluations of the function $f$ , and is in this sense a widely-applicable post-hoc scheme.

2 From control variates to control functionals

Our starting point is establish a trade-off between random sampling and deterministic approximation, as suggested on several separate occasions by authors including Bakhvalov, (1959); Heinrich, (1995); Speight, (2009); Giles, (2013).

Consider a dichotomy of available states $\mathcal{D}=\{\bm{x}_{i}\}_{i=1}^{n}$ into two disjoint subsets $\mathcal{D}_{0}=\{\bm{x}_{i}\}_{i=1}^{m}$ and $\mathcal{D}_{1}=\{\bm{x}_{i}\}_{i=m+1}^{n}$ , where $1\leq m<n$ . Although $m$ , $n$ are fixed, we will be interested in the asymptotic regime where $m=O(n^{\gamma})$ for some $\gamma\in$ . Consider surrogate functions of the form

where $s_{f,\mathcal{D}_{0}}\in\mathcal{L}^{2}(\pi)$ is an approximation to $f$ , based on $\mathcal{D}_{0}$ , whose expectation $\mu(s_{f,\mathcal{D}_{0}})$ is analytically tractable. By construction $f_{\mathcal{D}_{0}}\in\mathcal{L}^{2}(\pi)$ , $\mu(f_{\mathcal{D}_{0}})=\mu(f)$ and $\sigma^{2}(f_{\mathcal{D}_{0}})=\sigma^{2}(f-s_{f,\mathcal{D}_{0}})$ . We study estimators of the form

Assume $m=O(n^{\gamma})$ for some $\gamma\in$ and that the expected functional approximation error (EFAE) vanishes as

Taking $\gamma=1$ optimises the rate in Prop. 1 and we therefore assume in the sequel that $m/n\rightarrow r\in(0,1)$ .

2.2 Control variates based on Stein’s identity

To construct approximations $s_{f,\mathcal{D}_{0}}$ whose integrals $\mu(s_{f,\mathcal{D}_{0}})$ are analytically tractable, we make the assumption

Denote the gradient function by $\bm{u}(\bm{x}):=\nabla_{\bm{x}}\log\pi(\bm{x})$ where $\nabla_{\bm{x}}:=[\partial/\partial x_{1},\dots,\partial/\partial x_{d}]^{T}$ , well-defined by (A1). We study approximations of the form

Let $\bm{n}(\bm{x})$ be the unit normal to the boundary $\partial\Omega$ of the state space $\Omega$ . Then

Assume (A1,2). Then $\mu(\psi)=0$ and so $\mu(s_{f,\mathcal{D}_{0}})=c$ .

The statistic $\psi$ is recognised as a control variate. These control variates were explored in the case where $\bm{\phi}$ is a (gradient of a low-degree) polynomial by Assaraf and Caffarel, (1999), Assaraf and Caffarel, (2003) and Mira et al., (2013). This paper takes the innovative step of setting $\bm{\phi}$ within a function space to enable fully non-parametric approximation. The functional approximation perspective differs fundamentally from the control variate approach, in which the estimation problem is formally mis-specified (i.e. $\bm{\phi}$ is restricted to a low dimensional parametric family that does not contain the “true” function). We emphasise this key conceptual distinction by referring to $\psi$ as a control functional (CF; reflecting the use of terminology from functional analysis).

3 Theory

This section establishes $\psi$ as belonging to a Hilbert space $\mathcal{H}_{0}\subset\mathcal{L}^{2}(\pi)$ . This allows us to formulate and solve a functional approximation problem that targets the EFAE.

We make an assumption on $k$ that will be enforced by construction:

Now we can analyse the class of CFs induced by $k$ :

Assume $\bm{\phi}\in\mathcal{H}^{d}$ and (A1,3). Then $\psi$ belongs to $\mathcal{H}_{0}$ , the reproducing kernel Hilbert space with kernel $k_{0}(\bm{x},\bm{x}^{\prime}):=\nabla_{\bm{x}}\cdot\nabla_{\bm{x}^{\prime}}k(\bm{x},\bm{x}^{\prime})+\bm{u}(\bm{x})\cdot\nabla_{\bm{x}^{\prime}}k(\bm{x},\bm{x}^{\prime})+\bm{u}(\bm{x}^{\prime})\cdot\nabla_{\bm{x}}k(\bm{x},\bm{x}^{\prime})+\bm{u}(\bm{x})\cdot\bm{u}(\bm{x}^{\prime})k(\bm{x},\bm{x}^{\prime})$ .

To gain some intuition for $\mathcal{H}_{0}$ we strengthen (A2) as follows:

For $\pi$ -almost all $\bm{x}\in\Omega$ the kernel $k$ satisfies

While (A2’) must be verified on a case-by-case basis, it can in principle always be enforced with a suitable choice of $k$ .

Under (A1,2’,3), the gradient-based kernel $k_{0}$ satisfies

Lemma 1 generalises Eqn. 1 of Mira et al., (2013) and implies that $\mathcal{H}_{0}$ consists of only valid CFs, i.e. $\psi\in\mathcal{H}_{0}\implies\mu(\psi)=0$ . These ideas are illustrated in Fig. 1.

The gradient-based kernel $k_{0}$ satisfies

Under (A1,2’,3,4) we have $\mathcal{H}_{0}\subset\mathcal{L}^{2}(\pi)$ .

In general (A4) must be verified on a case-by-case basis. (A4) is easily verified for all examples in this paper.

3.2 Consistent approximation and asymptotics

Now we establish theoretical results for consistent approximation of $f$ by $s_{f,\mathcal{D}_{0}}$ . Write $\mathcal{C}$ for the reproducing kernel Hilbert space of constant functions with kernel $k_{\mathcal{C}}(\bm{x},\bm{x}^{\prime})=1$ for all $\bm{x},\bm{x}^{\prime}\in\Omega$ . Denote the norms associated to $\mathcal{C}$ and $\mathcal{H}_{0}$ respectively by $\|\cdot\|_{\mathcal{C}}$ and $\|\cdot\|_{\mathcal{H}_{0}}$ . Write $\mathcal{H}_{+}=\mathcal{C}+\mathcal{H}_{0}$ for the set $\{c+\psi:c\in\mathcal{C},\;\psi\in\mathcal{H}_{0}\}$ . Equip $\mathcal{H}_{+}$ with the structure of a vector space, with addition operator $(c+\psi)+(c^{\prime}+\psi^{\prime})=(c+c^{\prime})+(\psi+\psi^{\prime})$ and multiplication operator $\lambda(c+\psi)=(\lambda c)+(\lambda\psi)$ , each well-defined due to uniqueness of the representation $f=c+\psi$ , $f^{\prime}=c^{\prime}+\psi^{\prime}$ with $c,c^{\prime}\in\mathcal{C}$ and $\psi,\psi^{\prime}\in\mathcal{H}_{0}$ . In addition, equip $\mathcal{H}_{+}$ with the norm $\|f\|_{\mathcal{H}_{+}}^{2}:=\|c\|_{\mathcal{C}}^{2}+\|\psi\|_{\mathcal{H}_{0}}^{2}$ , again, well-defined by uniqueness of representation. It can be shown that $\mathcal{H}_{+}$ is a reproducing kernel Hilbert space with kernel $k_{+}(\bm{x},\bm{x}^{\prime}):=k_{\mathcal{C}}(\bm{x},\bm{x}^{\prime})+k_{0}(\bm{x},\bm{x}^{\prime})$ (Berlinet and Thomas-Agnan,, 2004, Thm. 5, p24).

For the analysis we assume a basic well-posedness condition:

$f\in\mathcal{H}_{+}$ . i.e. $f=c+\psi$ for some $c\in\mathcal{C}$ and $\psi\in\mathcal{H}_{0}$ .

(A5) is equivalent to the existence of a solution $\bm{\phi}\in\mathcal{H}^{d}$ to the partial differential equation

called the “fundamental equation” in Assaraf and Caffarel, (1999, Eqn. 5; see also Eqn. 4 in ()). With no initial or boundary conditions to satisfy, it is easy to show that there exist infinitely many solutions to the fundamental equation, so (A5) is automatically satisfied by choosing $k$ such that $\mathcal{H}$ is big enough to contain at least one solution.

To realise the CF method we consider the regularised least-squares (RLS) functional approximation given by

where $\lambda>0$ . For the special case where $m=n$ , the CF estimator can be interpreted as kernel quadrature (Sommariva and Vianello,, 2006) and also as empirical interpolation (Kristoffersen,, 2013). The distinguishing feature of CFs from these methods is that the Stein construction is compatible with un-normalised $\pi$ .

Below we will establish that the RLS estimate produces vanishing EFAE under a strengthening of (A4):

$\sup_{\bm{x}\in\Omega}k_{0}(\bm{x},\bm{x})<\infty$ .

(A4’) would follow from (A3) and compactness of the state space $\Omega$ , but we do not assume compactness here. All experiments in this paper have at worst $\bm{u}(\bm{x})=O(\|\bm{x}\|_{2})$ , so that (A4’) is automatically satisfied, for example, when we choose $k(\bm{x},\bm{x}^{\prime})=(1+\alpha_{1}\|\bm{x}\|_{2}^{2}+\alpha_{1}\|\bm{x}^{\prime}\|_{2}^{2})^{-1}\exp(-(2\alpha_{2}^{2})^{-1}\|\bm{x}-\bm{x}^{\prime}\|_{2}^{2})$ for some $\alpha_{1},\alpha_{2}>0$ .

CFs based on RLS therefore improve upon the Monte Carlo rate. The hypotheses on $\pi$ are weak, only requiring that $\pi$ be continuously differentiable. Empirical evidence (below) indicates stronger rates hold in more regular examples. Indeed we can prove sharper results under stronger conditions that include boundedness of $\Omega$ . Details are reserved for a future publication (Oates et al.,, 2016).

Importantly, the RLS estimate leads to a convenient closed-form expression for the CF estimator:

Assume (A1,3). The CF estimator based on RLS is

where $\bm{f}_{0}=[f(\bm{x}_{1}),\dots,f(\bm{x}_{m})]^{T}$ , $\bm{f}_{1}=[f(\bm{x}_{m+1}),\dots,f(\bm{x}_{n})]^{T}$ , $\bm{1}=[1,\dots,1]^{T}$ , $(\bm{K}_{0})_{i,j}=k_{0}(\bm{x}_{i},\bm{x}_{j})$ and the vector

contains predictions for $\bm{f}_{1}$ based only on $\mathcal{D}_{0}$ , with $(\bm{K}_{1,0})_{i,j}=k_{0}(\bm{x}_{m+i},\bm{x}_{j})$ .

The estimator is a weighted combination of function values $\bm{f}=[\bm{f}_{0}^{T},\bm{f}_{1}^{T}]^{T}$ with weights summing to one. Estimates are readily obtained using standard matrix algebra. Moreover the weights are independent of the test function $f$ and can be re-used to estimate multiple expectations $\mu(f_{j})$ for a collection $\{f_{j}\}$ .

The samples $\mathcal{D}_{1}$ enter only through the term $(*)$ in Eqn. 3, which vanishes in probability as $m\rightarrow\infty$ . Thus any randomness due to $\mathcal{D}_{1}$ vanishes and this gives another perspective on the source of super-root- $n$ convergence of the estimator.

The term $(**)$ in Eqn. 3 is algebraically equivalent to BMC based on $\mathcal{H}_{+}$ . i.e. $(**)$ is the posterior mean for $\mu(f)$ based on a Gaussian process (GP) prior $f\sim\mathcal{GP}(0,k_{+})$ and data $\mathcal{D}_{0}$ (Rasmussen and Ghahramani,, 2003, Eqn. 9). Our general construction in Lemma 3 therefore “heals” BMC in the sense of (i) de-biasing the BMC estimator, (ii) generalising BMC to un-normalised densities and (iii) remaining agnostic to statistical paradigm (e.g. frequentist vs. Bayesian).

3.3 Non-asymptotic bounds

The naive computational complexity associated with the RLS estimate is $O(m^{3})$ due to the solution of an $m\times m$ linear system. In situations where $\bm{X}$ is expensive to simulate or $f$ is expensive to evaluate, $m$ is necessarily small and this additional computational cost will be negligible relative to model-based computation. In such scenarios we are more interested in non-asymptotic behaviour:

Assume (A1,2’,3,5). Let $\lambda\searrow 0$ to simplify presentation. Then

Here $(\bm{K}_{1})_{i,j}=k_{0}(\bm{x}_{m+i},\bm{x}_{m+j})$ and $\bm{K}_{0,1}=\bm{K}_{1,0}^{T}$ .

In the extreme case where $\bm{x}\neq\bm{x}^{\prime}\implies k_{0}(\bm{x},\bm{x}^{\prime})=0$ , the discrepancy $D(\mathcal{D}_{0},\mathcal{D}_{1})$ reduces to $(n-m)^{-1}$ and we recover the usual root- $n$ rate. A similar bound forms the basis for recent work on two-sample testing by Chwialkowski et al., (2016); Liu et al., (2016).

3.4 Implementation

Several randomly chosen splits of the samples $\mathcal{D}$ into subsets $\mathcal{D}_{0}$ and $\mathcal{D}_{1}$ may be averaged over to reduce estimator variance. We note that a multi-splitting estimator remains unbiased. As an alternative to multi-splitting, for applications where consistency suffices and unbiased estimation is not essential, we also propose the simplified estimator $(**)$ in Eqn. 3 with $m=n$ . Empirical results below show that bias is negligible for practical purposes and, to pre-empt our conclusions, we recommend this simplified estimator for use in applications due to its reduced variance compared to the multi-splitting estimator. In all cases the regularisation parameter $\lambda$ was taken to be the smallest power of 10 such that the kernel matrix $\bm{K}_{0}+\lambda\bm{I}$ has condition number lower than $10^{10}$ .

A kernel $k(\bm{x},\bm{x}^{\prime};\bm{\alpha})$ typically involves hyper-parameters $\bm{\alpha}$ that must be specified. Selection of $\bm{\alpha}$ can proceed via cross-validation, under the assumption that $\mathcal{D}_{0}$ are independent samples from $\pi$ . Specifically, we randomly split the samples $\mathcal{D}_{0}$ into $m^{\prime}$ training samples $\mathcal{D}_{0,0}$ and $m-m^{\prime}$ test samples $\mathcal{D}_{0,1}$ . Then we propose to select $\bm{\alpha}$ to minimise $\|\bm{f}_{(0,1)}-\hat{\bm{f}}_{(0,1)}\|_{2}$ where $\bm{f}_{(0,1)}$ is a vector of values $f_{i}$ for $\bm{x}_{i}\in\mathcal{D}_{0,1}$ , and $\hat{\bm{f}}_{(0,1)}$ are the corresponding predicted values. In this way we are targeting the EFAE that reflects the variance of the CF estimator. We emphasise that the cross-validated estimator does not require additional sampling and will remain unbiased provided that preliminary cross-validation is performed only using $\mathcal{D}_{0}$ . In this paper we employed the kernel defined in Remark 4, with hyper-parameters $\alpha_{1},\alpha_{2}$ . Full pseudocode is provided in the supplement.

4 Illustration

To illustrate the method we begin with simple, tractable examples. Consider the synthetic problem of estimating the expectation of $f(\bm{X})=\sin(\frac{\pi}{d}\sum_{i=1}^{d}X_{i})$ where $\bm{X}$ is a $d$ -dimensional standard Gaussian random variable. By symmetry the true expectation is $\mu(f)=0$ . Initially we take $n=50$ IID samples and consider the scalar case $d=1$ . Cross-validation was used to select tuning parameters. Specifically: (i) We selected the hyper-parameters $\alpha_{1}=0.1$ , $\alpha_{2}=1$ on the basis that this approximately minimised the cross-validation error (Fig. S1a). (ii) We found that estimator variance due to sample-splitting was minimised when at least half of the samples were allocated to $\mathcal{D}_{0}$ (Fig. S1b). We therefore set a conservative default $m=\lceil n/2\rceil$ . (iii) Empirical results showed that little additional variance reduction occurs from employing multiple splits (Fig. S1c), so we chose to just use a single split. (iv) Finally, we found that the bias of the simplified estimator was negligible ( $<\sim 10^{-3}$ ) compared to Monte Carlo error ( $\sim 10^{-2}$ ) (Fig. S1d). This is in line with an analogous result for classical control variates, where estimator bias vanishes asymptotically with respect to Monte Carlo error (Glasserman,, 2004, p.200).

In Fig. 2 we summarise the sampling distribution of both the sample-splitting and simplified CF estimators as the number of samples $n$ is varied. The alternative approaches of the arithmetic mean, Riemann sums and “zero variance” (ZV) control variates are also shown, the latter being based on quadratic polynomials (Mira et al.,, 2013). It is visually apparent that CFs enjoy the lowest variance at all samples sizes considered. We note that in this synthetic example, where there are essentially no computational restrictions, the CF framework is unnecessary and gains in precision come with comparable increases in computational cost. However we emphasise that, in the serious applications that follow, the CF calculations requires negligible computational resources in comparison to simulation from the model. Super-root- $n$ convergence could perhaps be achieved by employing polynomials of increasing degree in the ZV method, but our implementation of this approach did not provide stable estimates in this example (full details in the supplement).

Since the performance of CF is so pronounced, in order to more clearly visualise the results for all sample sizes, in Fig. 3 we plot the estimator mean square error (MSE) scaled by $n$ , so that root- $n$ convergence corresponds to a horizontal line. Empirical results here are consistent with theory, showing that the arithmetic mean and control variates all achieve a constant factor variance reduction, whereas Riemann sums and CFs achieve super-root- $n$ convergence. In this example CFs significantly outperformed Riemann sums, the latter being based on piecewise linear approximations. We plot results for both the sample-splitting CF estimator and the simplified CF estimator, observing that the latter has lower variance.

To assess the generality of our conclusions we considered going beyond the scalar case to examples with dimensions $d=3$ and $d=5$ . The analogous results in Figs. S2a, S2b show that, whilst increasing dimensionality presents fundamental challenges for all the variance reduction methods, CF continues to out-perform alternatives. Going further we considered a variety of alternative problems, varying both the test function $f$ and the density $\pi$ . These include several pathological cases, with results summarised in Table S1. The results marked (b) echo the conclusions of Mira et al., (2013), that ZV control variates are effective in many cases where $f$ is well-approximated by a low-degree polynomial and $\pi$ is a Gaussian or gamma density. However, when $f$ is not well-approximated by a low-degree polynomial, or when $\pi$ takes a more complex form, as in cases marked (c), ZV control variates can be outperformed by CFs, which have the potential to decrease variance dramatically. We then investigated how CFs can fail when theoretical assumptions are violated (see examples marked “CF $\times$ ”). As expected, violation of (A2) and (A5) in (e), (g) respectively led to poor performance of the CF estimator. Interestingly, violation of differentiability in example (f) did not lead to poor estimation, though this may be because $\pi$ was only non-differentiable at a single point.

We have not reported computational times for these experiments. Our work is motivated by settings in which either simulation from $\pi$ or evaluation of $f$ (or both) are computationally prohibitive, so that additional effort required to implement CFs is negligible by comparison; we illustrate this with two realistic applications the next section.

Applications

Two applications are considered that together present many of the challenges associated with complex models. Firstly we consider marginalisation of hyper-parameters in hierarchical models, focussing on a non-parametric prediction problem. Here evaluation of $f$ forms a computational bottleneck due to the required inversion of a large matrix. For this problem, CFs are shown to offer significant computational savings. Secondly we consider computation of normalising constants for models based on non-linear ordinary differential equations (ODEs). Evaluation of the likelihood function requires numerical integration of a system of ODEs and dominates computational expenditure in both sampling from $\pi$ and evaluation of $f$ . Here CFs combine with gradient-based population MCMC and thermodynamic integration in order to deliver a state-of-the-art technique for low-variance estimation of normalising constants.

A fully Bayesian treatment of hierarchical models aims to marginalise over hyper-parameters, but this often entails a prohibitive level of computation. Here we explore the efficacy of CFs in such situations.

1.2 Marginalising the GP hyper-parameters

We are interested in predicting the value of the response $Y_{*}$ corresponding to an unseen state vector $\bm{z}_{*}$ . Our estimator will be the Bayesian posterior mean given by

where we implicitly condition on the covariates $\bm{z}_{1},\dots,\bm{z}_{N},\bm{z}_{*}$ . Eqn. 4 is unavailable in closed form and we therefore naive a Monte Carlo estimate by sampling $\bm{\theta}_{1},\dots,\bm{\theta}_{n}$ independently from the prior $\pi(\bm{\theta})$ (more efficient QMC estimates are considered later). Phrasing in terms of our previous notation, the function of interest is

where $(\bm{C}_{N})_{i,j}=c(\bm{z}_{i},\bm{z}_{j};\bm{\theta})$ and $(\bm{C}_{*,N})_{1,j}=c(\bm{z}_{*},\bm{z}_{j};\bm{\theta})$ and the underlying distribution is $\pi(\bm{\theta})$ . Each evaluation of the integrand $f(\bm{\theta})$ requires $O(N^{3})$ operations due to the matrix inversion; this can be reduced by employing a “subset of regressors” approximation

where $N^{\prime}<N$ denotes a subset of the full data (see Sec. 8.3.1 of Rasmussen and Williams,, 2006, for full details). To facilitate the illustration below, which investigates the sampling distribution of estimators, we take a random subset of $N=1,000$ training points and a subset of regressors approximation with $N^{\prime}=100$ . However we emphasise that evaluation of Eqn. 5 will typically be based on much larger $N$ and $N^{\prime}$ and will be extremely expensive in general. In applications we would therefore have to proceed with Monte Carlo estimation based on only a small number $n$ of these function evaluations.

1.3 SARCOS robot arm

We used the hierarchical GP model in Sec. 3.1.2 to estimate the inverse dynamics of a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The task, as described in Rasmussen and Williams, (2006, Sec. 8.3.1), is to map from a 21-dimensional input space (7 positions, 7 velocities, 7 accelerations) to the corresponding 7 joint torques using the hierarchical GP model described in Sec. 3.1.1. Following Rasmussen and Williams, (2006) we present results below on just one of the mappings, from the 21 input variables to the first of the seven torques. The dataset consists of 48,933 input-output pairs, of which 44,484 were used as a training set and the remaining 4,449 were used as a test set. The inputs were translated and scaled to have mean zero and unit variance on the training set. The outputs were centred so as to have mean zero on the training set. Here $\sigma=0.1$ , $\alpha=\gamma=25$ , $\beta=\delta=0.04$ , so that each hyper-parameter $\theta_{i}$ has a prior mean of $1$ and a prior standard deviation of $0.2$ .

For each test point $\bm{z}_{*}$ we estimated the sampling standard deviation of $\hat{Y}_{*}$ over 10 independent realisations of the Monte Carlo sampling procedure. For CF we took default hyper-parameters $\alpha_{1}=0.1$ , $\alpha_{2}=1$ , the latter reflecting the fact that the training data were standardised. The estimator standard deviations were estimated in this way for all 4,449 test samples and the full results are shown in Fig. 4. Note that each test sample corresponds to a different function $f$ and thus these results are quite objective, encompassing thousands of different Monte Carlo integration problems. Results show that, for the vast majority of integration problems, CF achieves a lower estimator variance compared with both the arithmetic mean estimator and ZV control variates. Here the cost of post-processing the Monte Carlo samples (using either ZV control variates or CF) is negligible in comparison to the cost of evaluating the function $f$ , even once. Indeed, CF requires that we invert a $n\times n$ matrix once, where $n$ is no larger than $N^{\prime}$ in this example.

In the supplement we investigate an extension that draws design points $\mathcal{D}_{0}$ using RQMC. Results show that CFs+RQMC outperforms RQMC alone.

2 Normalising constants for non-linear ODE models

Our second application concerns the estimation of normalising constants for non-linear ODE models (e.g. Calderhead and Girolami,, 2009). Recent empirical investigations recommend thermodynamic integration (TI) for this task (e.g. Friel and Wyse,, 2012). The control variate method of Mira et al., (2003) was recently applied to TI by Oates et al., (2016), who found that this “controlled thermodynamic integral” (CTI) was extremely effective for standard regression models, but only moderately effective in complex models including non-linear ODEs due to poor approximation by low-degree polynomials. Below we study the application of CFs to TI in this setting where CTI is less effective.

Conditional on an inverse temperature parameter $t$ , the “power posterior” for parameters $\bm{\theta}$ given data $\bm{y}$ is defined as $p(\bm{\theta}|\bm{y},t)\propto p(\bm{y}|\bm{\theta})^{t}p(\bm{\theta})$ (Friel and Pettitt,, 2008). Varying $t\in$ produces a continuous path between the prior $p(\bm{\theta})$ and the posterior $p(\bm{\theta}|\bm{y})$ and it is assumed here that all intermediate distributions exist and are well-defined. The standard thermodynamic identity is

where the expectation in the integrand is with respect to the power posterior. In TI, the one-dimensional integral in Eqn. 6 is evaluated numerically using a quadrature approximation over a discrete temperature ladder $0=t_{0}<t_{1}<\dots<t_{m}=1$ . Here we use the second-order quadrature recommended by Friel et al., (2014):

where $\hat{\mu}_{i}$ , $\hat{\nu}_{i}$ are Monte Carlo estimates of the posterior mean and variance respectively of $\log p(\bm{y}|\bm{\theta})$ when $\bm{\theta}$ arises from $p(\bm{\theta}|\bm{y},t_{i})$ . CTI uses ZV control variates to reduce the variance of these estimates. However, in complex models $\log p(\bm{y}|\bm{\theta})$ will be poorly approximated by a low-degree polynomial and $p(\bm{\theta}|\bm{y},t)$ will be non-Gaussian; this explains the mediocre performance of CTI in these cases. In contrast, CFs should still be able to deliver gains in estimation.

2.2 Non-linear ODE models

The approach is illustrated by computing the marginal likelihood for a non-linear ODE model (the van der Pol oscillator), described in full in the supplement. For TI, a temperature schedule $t_{i}=(i/30)^{5}$ was used, following the recommendation by Calderhead and Girolami, (2009). The power posterior is not available in closed form, precluding the straight-forward generation of IID samples. Instead, samples from each of the power posteriors $p(\bm{\theta}|\bm{y},t_{i})$ were obtained using population MCMC, involving both (i) “within-temperature” proposals produced by the (simplified) m-MALA algorithm of Girolami and Calderhead, (2011), and (ii) “between-temperature” proposals, as described previously by Calderhead and Girolami, (2009). Gradient information is thus pre-computed in the sampling scheme and can be leveraged “for free”, as noted by Papamarkou et al., (2014). We denote the number of samples by $n$ , such that for each of the 31 temperatures we obtained $n$ samples (a total of $31\times n$ occasions where the system of ODEs was integrated numerically). Both sampling and evaluation of the integrand are computationally expensive, requiring the numerical solution of a system of ODEs.

Results in Fig. 5 show that the CTI estimator improves upon the standard TI estimator, but a more substantial reduction in estimator variance results from using the CF method. For the CF computation we have used the simplified but biased CF estimator, since TI in any case produces a biased estimate for the normalising constant due to numerical quadrature. The hyper-parameters $\alpha_{1}=0.1$ , $\alpha_{2}=3$ were selected on the basis of cross-validation. The additional cost of using CF is essentially zero relative to running the population MCMC sampler, the latter requiring repeated solution of the ODE system.

Discussion

This paper developed a novel and general approach to integration that achieves super-root- $n$ convergence. An important feature of CFs is that variance reduction is formulated as a post-hoc step. This has several advantages: (i) No modification is required to existing computer code associated with either the sampling process or the model itself. (ii) Specific implementational choices, e.g. for the kernel, can be made after performing expensive simulations. Through exploitation of recent results in functional analysis we were able to realise our general framework and construct estimators with an analytic form. Empirical results evidenced the practical utility of CF estimators in settings where gradient information is available and the dimensionality of the problem is not too large (e.g. $\leq 10$ ). The paper concludes below by suggesting directions for further research.

In terms of methodology: (i) The estimates we presented here are not parameterisation-invariant. Likewise the specification of $f$ and $\pi$ is not unique, as we can employ an importance sampling transformation $f\mapsto(f\pi)/\pi^{\prime}$ , $\pi\mapsto\pi^{\prime}$ . It would therefore be interesting to elicit effective parametrisations as an additional post-hoc step. (ii) The version of CFs presented here is limited in terms of the dimension of the problems for which it is effective. Techniques for high-dimensional functional approximation should be applicable in the context of CFs (e.g. Dick et al.,, 2013) and this forms part of our ongoing research.

In terms of theory: (i) For bounded $\Omega$ , sharper asymptotics are provided in a sequel, Oates et al., (2016). These account for various levels of smoothness of both $f$ and $\pi$ and help to explain the strong empirical results presented here. However the case of unbounded $\Omega$ seems considerably more challenging to characterise. (ii) For problems involving un-normalised densities $\pi$ , sampling is naturally facilitated by MCMC. The analysis of CFs is carried out in Oates et al., (2016) under a uniform ergodicity assumption. For unbounded $\Omega$ this condition is too strong and future work will aim to relax this constraint.

In terms of application: (i) Our methods were motivated by the un-normalised densities arising in Bayesian computation. An extension should be possible to models with unknown, parameter-dependent normalising constants, which include e.g. Markov random fields (Everitt,, 2012) and random network models (Friel et al.,, 2015). (ii) An interesting direction would be to use the discrepancy $D(\mathcal{D}_{0},\mathcal{D}_{1})$ as a tool for assessment of MCMC convergence, providing a reproducing kernel Hilbert space alternative to Gorham and Mackey, (2015).

Finally we note that Oates and Girolami, (2016) provide a complementary study of CF strategies in the QMC setting.

The authors are grateful to the editor and referees, whose valuable feedback helped to improve the paper. The authors benefited from discussions with Sergios Agapiou, Michel Caffarel, Adam Johansen, Christian Robert, Daniel Simpson and Tim Sullivan. CJO was supported by EPSRC [EP/D002060/1] and the ARC Centre for Excellence in Mathematical and Statistical Frontiers. MG was supported by EPSRC [EP/J016934/1], EU [EU/259348], an EPSRC Established Career Fellowship and a Royal Society Wolfson Research Merit Award. NC was supported by the ANR (Agence Nationale de la Recherche) grant Labex ECODEC ANR [11-LABEX-0047].

Appendix A Proofs

(A1) ensures $\psi$ is well-defined. Since $\partial\Omega$ is piecewise smooth we can apply the divergence theorem (e.g. Bourne and Kendall,, 1977, p.159) to obtain

which is zero by (A2). The use of this identity in statistical applications is often attributed to Stein, (1970). Thus $\mu(s_{f,\mathcal{D}_{0}})=\mu(c)+\mu(\psi)=c$ , as required. ∎

Clearly $\mathcal{H}_{0}$ is a vector space with addition and multiplication defined pointwise; $(\lambda\psi+\lambda^{\prime}\psi^{\prime})(\bm{x})=\lambda\psi(\bm{x})+\lambda^{\prime}\psi^{\prime}(\bm{x})$ .

Stage 2: We now show that $\mathcal{H}_{0}$ can be endowed with the structure of a RKHS. To this end, define a norm on $\mathcal{H}_{0}$ by

Theorem 4.21 (p121) of Steinwart and Christmann, (2008) immediately gives that the normed space $(\mathcal{H}_{0},\|\cdot\|_{\mathcal{H}_{0}})$ is a RKHS whose kernel $k_{0}$ satisfies $k_{0}(\bm{x},\bm{x}^{\prime})=\langle\Phi^{*}(\bm{x}),\Phi^{*}(\bm{x}^{\prime})\rangle_{\mathcal{H}^{d}}$ . Thus we can directly calculate

where the interchange of derivative and inner product is justified by (A3) and Lemma 4.34 (p130) in Steinwart and Christmann, (2008). This completes the proof. ∎

(A1,3) ensure the kernel $k_{0}$ is well-defined. Then

Now using the divergence theorem (Bourne and Kendall,, 1977, p.159) we obtain

From Theorem 1, (A1,3) ensure $\mathcal{H}_{0}$ is well-defined. Moreover, from Lemma 1, (A1,2’,3) imply that $\mu(\psi)=0$ and thus

Now, given $\psi\in\mathcal{H}_{0}$ , we need to show $\sigma^{2}(\psi)<\infty$ . By the reproducing property followed by the Cauchy-Schwarz inequality, we have

Using the reproducing property again, we have $\|k_{0}(\cdot,\bm{x})\|_{\mathcal{H}_{0}}^{2}=k_{0}(\bm{x},\bm{x})$ and it follows from (A4) that

From Theorem 1, (A1,3) ensure $\mathcal{H}_{+}$ is well-defined. Unbiasedness follows from (A1,2’,3) and Lemma 2. Below we employ the standard notation

and denote the standard norm on this space by $\|\cdot\|_{L^{2}(\pi)}$ .

It therefore remains to prove requirements (i) and (ii) above are satisfied. For (i) we have that $\sup_{\bm{x}\in\Omega}k_{+}(\bm{x},\bm{x}^{\prime})=1+\sup_{\bm{x}\in\Omega}k_{0}(\bm{x},\bm{x})$ , where the second term is finite by (A4’). For (ii), Prop. 3.3 of Sun and Wu, (2009) (which does not depend on Theorem 1.1 of the same paper) shows that, when (i) holds, we have $\|T^{-1/2}h\|_{L^{2}(\pi)}=\|h\|_{\mathcal{H}_{+}}$ for all $h\in\mathcal{H}_{+}$ . Since $f\in\mathcal{H}_{+}$ by (A5) we thus have $\|T^{-1/2}f\|_{L^{2}(\pi)}=\|f\|_{\mathcal{H}_{+}}<\infty$ and so $T^{-1/2}f\in L^{2}(\pi)$ , as required. ∎

From Theorem 1, (A1,3) ensure $\mathcal{H}_{0}$ is well-defined. The interpolation problem is equivalently expressed as $s_{f,\mathcal{D}_{0}}=\hat{c}+\hat{\psi}$ where

For fixed $c\in\mathcal{C}$ , the representer theorem (Steinwart and Christmann,, 2008, Thm. 5.5, p168) tells us that the solution

takes the form $\psi(\bm{x})=\sum_{i=1}^{m}\beta_{i}k_{0}(\bm{x}_{i},\bm{x})$ where, due to the reproducing property, $\|\psi\|_{\mathcal{H}_{0}}^{2}=\bm{\beta}^{T}\bm{K}_{0}\bm{\beta}$ . Thus writing $\bm{\beta}=[\beta_{1},\dots,\beta_{m}]^{T}$ reduces the problem to

Differentiating with respect to $c$ and $\bm{\beta}$ leads, via the Woodbury matrix inversion identity, to the solution

and associated fitted values $\hat{\bm{f}}_{1}=\hat{c}\bm{1}+\bm{K}_{1,0}\hat{\bm{\beta}}$ at the points $\mathcal{D}_{1}$ . Putting this together, we have

From Theorem 1, (A1,3) ensure $\mathcal{H}_{0}$ is well-defined. The CF estimator takes the form

where, by Lemma 3, the vector of weights $\bm{w}=[w_{1},\dots,w_{n}]^{T}$ is given by

and satisfies $\bm{1}^{T}\bm{w}=1$ . Using the reproducing property, the estimation error is

It follows from the Cauchy-Schwarz inequality that

The first term satisfies $\|\psi\|_{\mathcal{H}_{0}}^{2}\leq\hat{c}^{2}+\|\psi\|_{\mathcal{H}_{0}}^{2}=\|f\|_{\mathcal{H}_{+}}^{2}$ and, from the reproducing property, the second term satisfies

Finally, upon substituting Eqn. 10 into Eqn. 13 we obtain the required result with $D(\mathcal{D}_{0},\mathcal{D}_{1})=\bm{w}^{T}\bm{K}\bm{w}$ . The special case $\lambda=0$ is reported in the statement of the theorem. ∎