Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees

François-Xavier Briol, Chris J. Oates, Mark Girolami, Michael A. Osborne

Introduction

Computing integrals is a core challenge in machine learning and numerical methods play a central role in this area. This can be problematic when a numerical integration routine is repeatedly called, maybe millions of times, within a larger computational pipeline. In such situations, the cumulative impact of numerical errors can be unclear, especially in cases where the error has a non-trivial structural component. One solution is to model the numerical error statistically and to propagate this source of uncertainty through subsequent computations. Conversely, an understanding of how errors arise and propagate can enable the efficient focusing of computational resources upon the most challenging numerical integrals in a pipeline.

Classical numerical integration schemes do not account for prior information on the integrand and, as a consequence, can require an excessive number of function evaluations to obtain a prescribed level of accuracy . Alternatives such as Quasi-Monte Carlo (QMC) can exploit knowledge on the smoothness of the integrand to obtain optimal convergence rates . However these optimal rates can only hold on sub-sequences of sample sizes nn, a consequence of the fact that all function evaluations are weighted equally in the estimator . A modern approach that avoids this problem is to consider arbitrarily weighted combinations of function values; the so-called quadrature rules (also called cubature rules). Whilst quadrature rules with non-equal weights have received comparatively little theoretical attention, it is known that the extra flexibility given by arbitrary weights can lead to extremely accurate approximations in many settings (see applications to image de-noising and mental simulation in psychology ).

Probabilistic numerics, introduced in the seminal paper of , aims at re-interpreting numerical tasks as inference tasks that are amenable to statistical analysis.A detailed discussion on probabilistic numerics and an extensive up-to-date bibliography can be found at http://www.probabilistic-numerics.org. Recent developments include probabilistic solvers for linear systems and differential equations . For the task of computing integrals, Bayesian Quadrature (BQ) and more recent work by provide probabilistic numerics methods that produce a full posterior distribution on the output of numerical schemes. One advantage of this approach is that we can propagate uncertainty through all subsequent computations to explicitly model the impact of numerical error . Contrast this with chaining together classical error bounds; the result in such cases will typically be a weak bound that provides no insight into the error structure. At present, a significant shortcoming of these methods is the absence of theoretical results relating to rates of posterior contraction. This is unsatisfying and has likely hindered the adoption of probabilistic approaches to integration, since it is not clear that the induced posteriors represent a sensible quantification of the numerical error (by classical, frequentist standards).

This paper establishes convergence rates for a new probabilistic approach to integration. Our results thus overcome a key perceived weakness associated with probabilistic numerics in the quadrature setting. Our starting point is recent work by , who cast the design of quadrature rules as a problem in convex optimisation that can be solved using the Frank-Wolfe (FW) algorithm. We propose a hybrid approach of with BQ, taking the form of a quadrature rule, that (i) carries a full probabilistic interpretation, (ii) is amenable to rigorous theoretical analysis, and (iii) converges orders-of-magnitude faster, empirically, compared with the original approaches in . In particular, we prove that super-exponential rates hold for posterior contraction (concentration of the posterior probability mass on the true value of the integral), showing that the posterior distribution provides a sensible and effective quantification of the uncertainty arising from numerical error. The methodology is explored in simulations and also applied to a challenging model selection problem from cellular biology, where numerical error could lead to mis-allocation of expensive resources.

Background

There are many ways of choosing combinations {xi,wi}i=1n\{x_{i},w_{i}\}_{i=1}^{n} in the literature. For example, taking weights to be wi=1/nw_{i}=1/n with points {xi}i=1n\{x_{i}\}_{i=1}^{n} drawn independently from the probability distribution p(x)p(x) recovers basic Monte Carlo integration. The case with weights wi=1/nw_{i}=1/n, but with points chosen with respect to some specific (possibly deterministic) schemes includes kernel herding and Quasi-Monte Carlo (QMC) . In Bayesian Quadrature, the points {xi}i=1n\{x_{i}\}_{i=1}^{n} are chosen to minimise a posterior variance, with weights {wi}i=1n\{w_{i}\}_{i=1}^{n} arising from a posterior probability distribution.

This shows that to obtain low integration error in the RKHS H\mathcal{H}, one only needs to obtain a good approximation of its mean element μp\mu_{p} (as fH\forall f\in\mathcal{H}: p[f]=f,μpHp[f]=\langle f,\mu_{p}\rangle_{\mathcal{H}}). Establishing theoretical results for such quadrature rules is an active area of research .

2 Bayesian Quadrature

Thus, in the RKHS setting, minimising the posterior variance corresponds to minimising the worst case error of the quadrature rule. Below we refer to Optimal BQ (OBQ) as BQ coupled with design points {xiOBQ}i=1n\{x_{i}^{\text{OBQ}}\}_{i=1}^{n} chosen to globally minimise (5). We also call Sequential BQ (SBQ) the algorithm that greedily selects design points to give the greatest decrease in posterior variance at each iteration. OBQ will give improved results over SBQ, but cannot be implemented in general, whereas SBQ is comparatively straight-forward to implement. There are currently no theoretical results establishing the convergence of either BQ, OBQ or SBQ.

3 Deriving Quadrature Rules via the Frank-Wolfe Algorithm

Despite the elegance of BQ, its convergence rates have not yet been rigorously established. In brief, this is because p^BQ[f]\hat{p}_{\text{BQ}}[f] is an orthogonal projection of ff onto the affine hull of {Φ(xi)}i=1n\{\Phi(x_{i})\}_{i=1}^{n}, rather than e.g. the convex hull. Standard results from the optimisation literature apply to bounded domains, but the affine hull is not bounded (i.e. the BQ weights can be arbitrarily large and possibly negative). Below we describe a solution to the problem of computing integrals recently proposed by , based on the FW algorithm, that restricts attention to the (bounded) convex hull of {Φ(xi)}i=1n\{\Phi(x_{i})\}_{i=1}^{n}.

At each iteration ii, the FW algorithm computes a linearisation of the objective function JJ at the previous state gi1Gg_{i-1}\in\mathcal{G} along its gradient (DJ)(gi1)(DJ)(g_{i-1}) and selects an ‘atom’ gˉiG\bar{g}_{i}\in\mathcal{G} that minimises the inner product a state gg and (DJ)(gi1)(DJ)(g_{i-1}). The new state giGg_{i}\in\mathcal{G} is then a convex combination of the previous state gi1g_{i-1} and of the atom gˉi\bar{g}_{i}. This convex combination depends on a step-size ρi\rho_{i} which is pre-determined and different versions of the algorithm may have different step-size sequences.

Our goal in quadrature is to approximate the mean element μp\mu_{p}. Recently proposed to frame integration as a FW optimisation problem. Here, the domain GH\mathcal{G}\subseteq\mathcal{H} is a space of functions and taking the objective function to be:

This gives an approximation of the mean element and JJ takes the form of half the posterior variance (or the MMD2). In this functional approximation setting, minimisation of JJ is carried out over G=M\mathcal{G}=\mathcal{M}, the marginal polytope of the RKHS H\mathcal{H}. The marginal polytope M\mathcal{M} is defined as the closure of the convex hull of Φ(X)\Phi(\mathcal{X}), so that in particular μpM\mu_{p}\in\mathcal{M}. Assuming as in that Φ(x)\Phi(x) is uniformly bounded in feature space (i.e. R>0:xX\exists R>0:\forall x\in\mathcal{X}, Φ(x)HR\|\Phi(x)\|_{\mathcal{H}}\leq R), then M\mathcal{M} is a closed and bounded set and can be optimised over.

A particular advantage of this method is that it leads to ‘sparse’ solutions which are linear combinations of the atoms {gˉi}i=1n\{\bar{g}_{i}\}_{i=1}^{n} . In particular this provides a weighted estimate for the mean element:

where by default ρ0=1\rho_{0}=1 which leads to all wiFWw_{i}^{\text{FW}}\in when ρi=1/(i+1)\rho_{i}=1/(i+1). A typical sequence of approximations to the mean element is shown in Fig. 1 (left), demonstrating that the approximation quickly converges to the ground truth (in black). Since minimisation of a linear function can be restricted to extreme points of the domain, the atoms will be of the form gˉi=Φ(xiFW)=k(,xiFW)\bar{g}_{i}=\Phi(x_{i}^{\text{FW}})=k(\cdot,x_{i}^{\text{FW}}) for some xiFWXx_{i}^{\text{FW}}\in\mathcal{X}. The minimisation in gg over G\mathcal{G} from step 2 in Algorithm 1 therefore becomes a minimisation in xx over X\mathcal{X} and this algorithm therefore provides us design points. In practice, at each iteration ii, the FW algorithm hence selects a design point xiFWXx_{i}^{\text{FW}}\in\mathcal{X} which induces an atom gˉi\bar{g}_{i} and gives us an approximation of the mean element μp\mu_{p}. We denote by μ^FW\hat{\mu}_{\text{FW}} this approximation after nn iterations. Using the reproducing property, we can show that the FW estimate is a quadrature rule:

The total computational cost for FW is O(n2)\mathcal{O}(n^{2}). An extension known as FW with Line Search (FWLS) uses a line-search method to find the optimal step size ρi\rho_{i} at each iteration (see Alg. 1). Once again, the approximation obtained by FWLS has a sparse expression as a convex combination of all the previously visited states and we obtain an associated quadrature rule. FWLS has theoretical convergence rates that can be stronger than standard versions of FW but has computational cost in O(n3)\mathcal{O}(n^{3}). The authors in provide a survey of FW-based algorithms and their convergence rates under different regularity conditions on the objective function and domain of optimisation.

Remark: The FW design points {xiFW}i=1n\{x_{i}^{\text{FW}}\}_{i=1}^{n} are generally not available in closed-form. We follow mainstream literature by selecting, at each iteration, the point that minimises the MMD over a finite collection of MM points, drawn i.i.d from p(x)p(x). The authors in proved that this approximation adds a O(M1/4)\mathcal{O}(M^{-1/4}) term to the MMD, so that theoretical results on FW convergence continue to apply provided that M(n)M(n)\rightarrow\infty sufficiently quickly. Appendix A provides full details. In practice, one may also make use of a numerical optimisation scheme in order to select the points.

A Hybrid Approach: Frank-Wolfe Bayesian Quadrature

To combine the advantages of a probabilistic integrator with a formal convergence theory, we propose Frank-Wolfe Bayesian Quadrature (FWBQ). In FWBQ, we first select design points {xiFW}i=1n\{x_{i}^{\text{FW}}\}_{i=1}^{n} using the FW algorithm. However, when computing the quadrature approximation, instead of using the usual FW weights {wiFW}i=1n\{w_{i}^{\text{FW}}\}_{i=1}^{n} we use instead the weights {wiBQ}i=1n\{w_{i}^{\text{BQ}}\}_{i=1}^{n} provided by BQ. We denote this quadrature rule by p^FWBQ\hat{p}_{\text{FWBQ}} and also consider p^FWLSBQ\hat{p}_{\text{FWLSBQ}}, which uses FWLS in place of FW. As we show below, these hybrid estimators (i) carry the Bayesian interpretation of Sec. 2.2, (ii) permit a rigorous theoretical analysis, and (iii) out-perform existing FW quadrature rules by orders of magnitude in simulations. FWBQ is hence ideally suited to probabilistic numerics applications.

The posterior mean p^FWBQ[f]\hat{p}_{\emph{FWBQ}}[f] converges to the true integral p[f]p[f] at the following rates:

where the FWBQ uses step-size ρi=1/(i+1)\rho_{i}=1/(i+1), D(0,)D\in(0,\infty) is the diameter of the marginal polytope M\mathcal{M} and R(0,)R\in(0,\infty) gives the radius of the smallest ball of center μp\mu_{p} included in M\mathcal{M}.

Note that all the proofs of this paper can be found in Appendix B. An immediate corollary of Theorem 1 is that FWLSBQ has an asymptotic error which is exponential in nn and is therefore superior to that of any QMC estimator . This is not a contradiction - recall that QMC restricts attention to uniform weights, while FWLSBQ is able to propose arbitrary weightings. In addition we highlight a robustness property: Even when the assumptions of this section do not hold, one still obtains atleast a rate OP(n1/2)\mathcal{O}_{P}(n^{-1/2}) for the posterior mean using either FWBQ or FWLSBQ .

Remark: The choice of kernel affects the convergence of the FWBQ method . Clearly, we expect faster convergence if the function we are integrating is ‘close’ to the space of functions induced by our kernel. Indeed, the kernel specifies the geometry of the marginal polytope M\mathcal{M}, that in turn directly influences the rate constant RR and DD associated with FW convex optimisation.

Consistency is only a stepping stone towards our main contribution which establishes posterior contraction rates for FWBQ. Posterior contraction is important as these results justify, for the first time, the probabilistic numerics approach to integration; that is, we show that the full posterior distribution is a sensible quantification (at least asymptotically) of numerical error in the integration routine:

where the FWBQ uses step-size ρi=1/(i+1)\rho_{i}=1/(i+1), D(0,)D\in(0,\infty) is the diameter of the marginal polytope M\mathcal{M} and R(0,)R\in(0,\infty) gives the radius of the smallest ball of center μp\mu_{p} included in M\mathcal{M}.

The contraction rates are exponential for FWBQ and super-exponential for FWLBQ, and thus the two algorithms enjoy both a probabilistic interpretation and rigorous theoretical guarantees. A notable corollary is that OBQ enjoys the same rates as FWLSBQ, resolving a conjecture by Tony O’Hagan that OBQ converges exponentially [personal communication]:

The consistency and contraction rates obtained for FWLSBQ apply also to OBQ.

Experimental Results

To facilitate the experiments in this paper we followed and employed an exponentiated-quadratic (EQ) kernel k(x,x)λ2exp(\nicefrac12σ2xx22)k(x,x^{\prime})\coloneqq\lambda^{2}\exp(-\nicefrac{{1}}{{2\sigma^{2}}}\|x-x^{\prime}\|^{2}_{2}). This corresponds to an infinite-dimensional RKHS, not covered by our theory; nevertheless, we note that all simulations are practically finite-dimensional due to rounding at machine precision. See Appendix E for a finite-dimensional approximation using random Fourier features. EQ kernels are popular in the BQ literature as, when pp is a mixture of Gaussians, the mean element μp\mu_{p} is analytically tractable (see Appendix C). Some other (p,k)(p,k) pairs that produce analytic mean elements are discussed in .

For this simulation study, we took p(x)p(x) to be a 20-component mixture of 2D-Gaussian distributions. Monte Carlo (MC) is often used for such distributions but has a slow convergence rate in OP(n1/2)\mathcal{O}_{P}(n^{-1/2}). FW and FWLS are known to converge more quickly and are in this sense preferable to MC . In our simulations (Fig. 2, left), both our novel methods FWBQ and FWLSBQ decreased the MMD much faster than the FW/FWLS methods of . Here, the same kernel hyper-parameters (λ,σ)=(1,0.8)(\lambda,\sigma)=(1,0.8) were employed for all methods to have a fair comparison. This suggests that the best quadrature rules correspond to elements outside the convex hull of {Φ(xi)}i=1n\{\Phi(x_{i})\}_{i=1}^{n}. Examples of those, including BQ, often assign negative weights to features (Fig. S1 right, Appendix D).

The principle advantage of our proposed methods is that they reconcile theoretical tractability with a fully probabilistic interpretation. For illustration, Fig. 2 (right) plots the posterior uncertainty due to numerical error for a typical integration problem based on this p(x)p(x). In-depth empirical studies of such posteriors exist already in the literature and the reader is referred to for details.

Beyond these theoretically tractable integrators, SBQ seems to give even better performance as nn increases. An intuitive explanation is that SBQ picks {xi}i=1n\{x_{i}\}_{i=1}^{n} to minimise the MMD whereas FWBQ and FWLSBQ only minimise an approximation of the MMD (its linearisation along DJDJ). In addition, the SBQ weights are optimal at each iteration, which is not true for FWBQ and FWLSBQ. We conjecture that Theorem 1 and 2 provide upper bounds on the rates of SBQ. This conjecture is partly supported by Fig. 1 (right), which shows that SBQ selects similar design points to FW/FWLS (but weights them optimally). Note also that both FWBQ and FWLSBQ give very similar result. This is not surprising as FWLS has no guarantees over FW in infinite-dimensional RKHS .

2 Quantifying Numerical Error in a Proteomic Model Selection Problem

The problem is quickly exaggerated when the number mm of models increases, as there are more opportunities for one of the L(Mi)L(M_{i}) terms to be ‘too large’ due to numerical error. In , the number mm of models was combinatorial in the number of protein kinases measured in a high-throughput assay (currently 102\sim 10^{2} but in principle up to 104\sim 10^{4}). This led to deploy substantial computing resources to ensure that numerical error in each estimate of L(Mi)L(M_{i}) was individually controlled. Probabilistic numerics provides a more elegant and efficient solution: At any given stage, we have a fully probabilistic quantification of our uncertainty in each of the integrals L(Mi)L(M_{i}), shown to be sensible both theoretically and empirically. This induces a full posterior distribution over numerical uncertainty in the location of the MAP estimate (i.e. ‘Bayes all the way down’). As such we can determine, on-line, the precise point in the computational pipeline when numerical uncertainty near the MAP estimate becomes acceptably small, and cease further computation.

The FWBQ methodology was applied to one of the model selection tasks in . In Fig. 3 (left) we display posterior model probabilities for each of m=352m=352 candidates models, where a low number (n=10n=10) of samples were used for each integral. (For display clarity only the first 50 models are shown.) In this low-nn regime, numerical error introduces a second level of uncertainty that we quantify by combining the FWBQ error models for all integrals in the computational pipeline; this is summarised by a box plot (rather than a single point) for each of the models (obtained by sampling - details in Appendix D). These box plots reveal that our estimated posterior model probabilities are completely dominated by numerical error. In contrast, when nn is increased through 50, 100 and 200 (Fig. 3, right and Fig. S2), the uncertainty due to numerical error becomes negligible. At n=200n=200 we can conclude that model 2626 is the true MAP estimate and further computations can be halted. Correctness of this result was confirmed using the more computationally intensive methods in .

In Appendix D we compared the relative performance of FWBQ, FWLSBQ and SBQ on this problem. Fig. S1 shows that the BQ weights reduced the MMD by orders of magnitude relative to FW and FWLS and that SBQ converged more quickly than both FWBQ and FWLSBQ.

Conclusions

This paper provides the first theoretical results for probabilistic integration, in the form of posterior contraction rates for FWBQ and FWLSBQ. This is an important step in the probabilistic numerics research programme as it establishes a theoretical justification for using the posterior distribution as a model for the numerical integration error (which was previously assumed [11, 12, 20, 23, 27, e.g.]). The practical advantages conferred by a fully probabilistic error model were demonstrated on a model selection problem from proteomics, where sensitivity of an evaluation of the MAP estimate was modelled in terms of the error arising from repeated numerical integration.

The strengths and weaknesses of BQ (notably, including scalability in the dimension dd of X\mathcal{X}) are well-known and are inherited by our FWBQ methodology. We do not review these here but refer the reader to for an extended discussion. Convergence, in the classical sense, was proven here to occur exponentially quickly for FWLSBQ, which partially explains the excellent performance of BQ and related methods seen in applications , as well as resolving an open conjecture. As a bonus, the hybrid quadrature rules that we developed turned out to converge much faster in simulations than those in , which originally motivated our work.

A key open problem for kernel methods in probabilistic numerics is to establish protocols for the practical elicitation of kernel hyper-parameters. This is important as hyper-parameters directly affect the scale of the posterior over numerical error that we ultimately aim to interpret. Note that this problem applies equally to BQ, as well as related quadrature methods and more generally in probabilistic numerics . Previous work, such as , optimised hyper-parameters on a per-application basis. Our ongoing research seeks automatic and general methods for hyper-parameter elicitation that provide good frequentist coverage properties for posterior credible intervals, but we reserve the details for a future publication.

The authors are grateful for discussions with Simon Lacoste-Julien, Simo Särkkä, Arno Solin, Dino Sejdinovic, Tom Gunter and Mathias Cronjäger. FXB was supported by EPSRC [EP/L016710/1]. CJO was supported by EPSRC [EP/D002060/1]. MG was supported by EPSRC [EP/J016934/1], an EPSRC Established Career Fellowship, the EU grant [EU/259348] and a Royal Society Wolfson Research Merit Award.

References

Supplementary Material

A high-level pseudo-code description for the Frank-Wolfe Bayesian Quadrature (FWBQ) algorithm is provided below.

Frank-Wolfe Line-Search Bayesian Quadrature (FWLSBQ) is simply obtained by substituting the Frank-Wolfe algorithm with the Frank-Wolfe Line-Search algorithm. In this appendix, we derive all of the expressions necessary to implement both the FW and FWLS algorithms (for quadrature) in practice. All of the other steps can be derived from the relevant equations as highlighted in Algorithm 2 above.

The FW/FWLS are both initialised by the user choosing a design point x1FWx_{1}^{\text{FW}}. This can be done either at random or by choosing a location which is known to have high probability mass under p(x)p(x). The first approximation to μp\mu_{p} is therefore given by g1=k(,x1FW)g_{1}=k(\cdot,x_{1}^{\text{FW}}). The algorithm then loops over the next three steps to obtain new design points {xiFW}i=2n\{x_{i}^{\text{FW}}\}_{i=2}^{n}:

Step 1) Obtaining the new Frank-Wolfe design points xi+1FWx_{i+1}^{\text{FW}}.

At iteration ii, the step consists of choosing the point xˉiFW\bar{x}_{i}^{\text{FW}}. Let {wl(i)}l=1i1\{w_{l}^{(i)}\}_{l=1}^{i-1} denote the Frank-Wolfe weights assigned to each of the previous design points {xlFW}l=1i1\{x_{l}^{\text{FW}}\}_{l=1}^{i-1} at this new iteration, given that we choose xx as our new design point. The choice of new design point is done by computing the derivative of the objective function J(gi1)J(g_{i-1}) and finding the point xx^{*} which minimises the inner product:

To do so, we need to obtain an equivalent expression of the minimisation of the linearisation of JJ (denoted DJDJ) in terms of kernel values and evaluations of the mean element μp\mu_{p}. Since minimisation of a linear function can be restricted to extreme points of the domain, we have that

Then using the definition of JJ we have:

Our new design point xiFWx_{i}^{\text{FW}} is therefore the point xx^{*} which minimises this expression. Note that this equation may not be convex and may require us to make use of approximate methods to find the minimum xx^{*}. To do so, we sample MM points (where MM is large) independently from the distribution pp and pick the sample which minimises the expression above. From this introduces an additive error term of size O(M1/4)\mathcal{O}(M^{-1/4}), which does not impact our convergence analysis provided that M(n) vanishes sufficiently quickly. In all experiments we took MM between 10,00010,000 and 50,00050,000 so that this error will be negligible.

It is important to note that sampling from p(x)p(x) is likely to not be the best solution to optimising this expression. One may, for example, be better off using any other optimisation method which does not require convexity (for example, Bayesian Optimization). However, we have used sampling as the result from discussed above allows us to have a theoretical upper bound on the error introduced.

Step 2) Computing the Step-Sizes and Weights for the Frank-Wolfe and Frank-Wolfe Line-Search Algorithms.

Computing the weights {wl(i)}l=1n\{w_{l}^{(i)}\}_{l=1}^{n} assigned by the FW/FWLS algorithms to each of the design points is obtained using the equation:

Clearly, this expression depends on the choice of step-sizes {ρl}l=1i\{\rho_{l}\}_{l=1}^{i}. In the case of the standard Frank-Wolfe algorithm, this step-size sequence is a an input from the algorithm and so computing the weights is straightforward. However, in the case of the Frank-Wolfe Line-Search algorithm, the choice of step-size is optimized at each iteration so that gig_{i} minimises JJ the most.

In the case of computing integrals, this optimization step can actually be obtained analytically. This analytic expression will be given in terms of values of the kernel values and evaluations of the mean element.

Taking the derivative of this expression with respect to ρ\rho, we get:

Setting this derivative to zero gives us the following optimum:

Clearly, differentiating a second time with respect to ρ\rho gives gi1Φ(xi)H2\|g_{i-1}-\Phi(x_{i})\|_{\mathcal{H}}^{2}, which is non-negative and so ρ\rho^{*} is a minimum. One can show using geometrical arguments about the marginal polytope M\mathcal{M} that ρ\rho^{*} will be in $$ .

The numerator of this line-search expression is

Clearly all expressions provided here can be vectorised for efficient computational implementation.

Step 3) Computing a new approximation of the mean element.

The final step consists of updating the approximation of the mean element, which can be done directly by setting:

Appendix B: Proofs of Theorems and Corollaries

The posterior mean p^FWBQ[f]\hat{p}_{\emph{FWBQ}}[f] converges to the true integral p[f]p[f] at the following rates:

where the FWBQ uses step-size ρi=1/(i+1)\rho_{i}=1/(i+1), D(0,)D\in(0,\infty) is the diameter of the marginal polytope M\mathcal{M} and R(0,)R\in(0,\infty) gives the radius of the smallest ball of center μp\mu_{p} included in M\mathcal{M}.

The posterior mean in BQ is a Bayes estimator and so the MMD takes a minimax form . In particular, the BQ weights perform no worse than the FW weights:

Now, the values attained by the objective function JJ along the path {gi}i=1n\{g_{i}\}_{i=1}^{n} determined by the FW(/FWLS) algorithm can be expressed in terms of the MMD as follows:

since fH1\|f\|_{\mathcal{H}}\leq 1. To complete the proof we leverage recent analysis of the FW algorithm with steps ρi=1/(n+1)\rho_{i}=1/(n+1) and the FWLS algorithm. Specifically, from [2, Prop. 1] we have that:

where DD is the diameter of the marginal polytope M\mathcal{M} and RR is the radius of the smallest ball centered at μp\mu_{p} included in M\mathcal{M}. ∎

where D(0,)D\in(0,\infty) is the diameter of the marginal polytope M\mathcal{M} and R(0,)R\in(0,\infty) gives the radius of the smallest ball of center μp\mu_{p} included in M\mathcal{M}.

Now the posterior probability mass on ScS^{c} is given by

where ϕ(rmn,σn)\phi(r|m_{n},\sigma_{n}) is the p.d.f. of the posterior normal distribution. By the definition of γ\gamma we get the upper bound:

From (26) we have that the terms ()(*) are bounded by fH1<\|f\|_{\mathcal{H}}\leq 1<\infty as σn0\sigma_{n}\rightarrow 0, so that asymptotically we have:

Finally we may substitute the asymptotic results derived in the proof of Theorem 1 for the MMD σn\sigma_{n} into (31) to complete the proof. ∎

The consistency and contraction rates obtained for FWLSBQ apply also to OBQ.

By definition, OBQ chooses samples that globally minimise the MMD and we can hence bound this quantity from above by the MMD of FWLSBQ:

Consistency and contraction follow from inserting this inequality into the above proofs. ∎

Appendix C: Computing the Mean Element for the Simulation Study

where Σσ\Sigma_{\sigma} is a d-dimensional diagonal matrix with entries σ2\sigma^{2}, and where p(x)p(x) is a mixture of d-dimensional Gaussian distributions:

(Note that, in this section only, xix_{i} denotes the iith component of the vector xx.) Using properties of Gaussian distributions (see Appendix A.2 of ) we obtain

This last expression is in fact itself a Gaussian distribution with probability density function ϕ(xμl,Σl+Σσ)\phi(x|\mu_{l},\Sigma_{l}+\Sigma_{\sigma}) and we hence obtain:

Finally, we once again use properties of Gaussians to obtain

Other combinations of kernel kk and density pp that give rise to an analytic mean element can be found in the references of .

Appendix D: Details of the Application to Proteomics Data

The ‘CheMA’ methodology described in contains several elements that we do not attempt to reproduce in full here; in particular we do not attempt to provide a detailed motivation for the mathematical forms presented below, as this requires elements from molecular chemistry. For our present purposes it will be sufficient to define the statistical models {Mi}i=1m\{M_{i}\}_{i=1}^{m} and to clearly specify the integration problems that are to be solved. We refer the reader to and the accompanying supplementary materials for a full biological background.

Denote by D\mathcal{D} the dataset containing normalised measured expression levels yS(tj)y_{S}(t_{j}) and yS(tj)y_{S}^{*}(t_{j}) for, respectively, the unphosphorylated and phosphorylated forms of a protein of interest (‘substrate’) in a longitudinal experiment at time tjt_{j}. In addition D\mathcal{D} contains normalised measured expression levels yEi(tj)y_{E_{i}}^{*}(t_{j}) for a set of possible regulator kinases (‘enzymes’, here phosphorylated proteins) that we denote by {Ei}\{E_{i}\}.

An important scientific goal is to identify the roles of enzymes (or ‘kinases’) in protein signaling; in this case the problem takes the form of variable selection and we are interested to discover which enzymes must be included in a model for regulation of the substrate SS. Specifically, a candidate model MiM_{i} specifies which enzymes in the set {Ei}\{E_{i}\} are regulators of the substrate SS, for example M3={E2,E4}M_{3}=\{E_{2},E_{4}\}. Following we consider models containing at most two enzymes, as well as the model containing no enzymes.

Given a dataset D\mathcal{D} and model MiM_{i}, we can write down a likelihood function as follows:

The prior specification proposed in and followed here is

Due to its careful design, the likelihood in Eqn. 39 is partially conjugate, so the following integral can be evaluated in closed form:

The numerical challenge is then to compute the integral

for each candidate model MiM_{i}. Depending on the number of enzymes in model MiM_{i}, this will either be a 1-, 2- or 3-dimensional numerical integral. Whilst such integrals are not challenging to compute on a per-individual basis, the nature of the application means that the values L(Mi)L(M_{i}) will be similar for many candidate models and, when the number of models is large, this demands either a very precise calculation per model or a careful quantification of the impact of numerical error on the subsequent inferences (i.e. determining the MAP estimate). It is this particular issue that motivates the use of probabilistic numerical methods.

For simplicity we focussed on the single hyper-parameter pair λ=σ=1\lambda=\sigma=1, which produces:

where ϕT\phi_{T} is the p.d.f. of the truncated Gaussian distribution introduced above and erf is the error function. To compute the posterior variance of the numerical error we also require the quantity:

which we have simply evaluated numerically. We emphasise that principled approaches to hyper-parameter elicitation are an important open research problem that we aim to address in a future publication (see discussion in the main text). The values used here are scientifically reasonable and serve to illustrate key aspects of our methodology.

FWBQ provides posterior distributions over the numerical uncertainty in each of our estimates for the marginal likelihoods L(Mi)L(M_{i}). In order to propagate this uncertainty forward into a posterior distribution over posterior model probabilities (see Figs. 3 in the main text and S2 below), we simply sampled values L^(Mi)\hat{L}(M_{i}) from each of the posterior distributions for L(Mi)L(M_{i}) and used these samples values to construct posterior model probabilities L^(Mi)/jL^(Mj)\hat{L}(M_{i})/\sum_{j}\hat{L}(M_{j}). Repeating this procedure many times enables us to sample from the posterior distribution over the posterior model probabilities (i.e. two levels of Bayes’ theorem). This provides a principled quantification of the uncertainty due to numerical error in the output of our primary Bayesian analysis.

The proteomic dataset D\mathcal{D} that we considered here was a subset of the larger dataset provided in . Specifically, the substrate SS was the well-studied 4E-binding protein 1 (4EBP1) and the enzymes EjE_{j} consisted of a collection of key proteins that are thought to be connected with 4EBP1 regulation, or at least involved in similar regulatory processes within cellular signalling. Full details, including experimental protocols, data normalisation and the specific choice of measurement time points are provided in the supplementary materials associated with .

For this particular problem, biological interest arises because the data-generating system was provided by breast cancer cell lines. As such, the textbook description of 4EBP1 regulation may not be valid and indeed it is thought that 4EBP1 dis-regulation is a major contributing factor to these complex diseases (see ). We do not elaborate further on the scientific rationale for model-based proteomics in this work.

Appendix E: FWBQ algorithms with Random Fourier Features

In this section, we will investigate the use of random Fourier features (introduced in ) for the FWLS and FWLSBQ algorithms. An advantage of using this type of approximation is that the cost of manipulating the Gram matrix, and in particular of inverting it, goes down from O(n3)\mathcal{O}(n^{3}) to O(nD2)\mathcal{O}(nD^{2}) for some user-defined constant DD which controls the quality of approximation. This could make Bayesian Quadrature more competitive against other integration methods such as MCMC or QMC. Furthermore, the kernels obtained using this method lead to finite-dimensional RKHS, which therefore satisfy the assumptions required for the theory in this paper to hold. This will be the aspect that we will focus on. In particular, we will show empirically that exponential convergence may be possible even when the RKHS is infinite-dimensional.

We will re-use the 2020-component mixture of Gaussians example with d=2d=2 from our simulation studies, but using instead a random Fourier approximation of the exponentiated-quadratic (EQ) kernel k(x,x):=λ2exp(1/2σ2xx22)k(x,x^{\prime}):=\lambda^{2}\exp(-1/2\sigma^{2}\|x-x^{\prime}\|_{2}^{2}) with (λ,σ)=(1,0.8)(\lambda,\sigma)=(1,0.8) and M=10000M=10000.

Following Bochner’s theorem, we can always express translation invariant kernels in Fourier space:

where wg(w)w\sim g(w) for g(w)g(w) being the Fourier transform of the kernel. One can then use a Monte Carlo approximation of the kernel’s Fourier expression with DD samples whenever gg is a p.d.f.. Our approximated kernel will then lead to a DD-dimensional RKHS and will be given by:

where zwj,bj(x)=2cos(wjTx+bj)z_{w_{j},b_{j}}(x)=\sqrt{2}\cos(w_{j}^{T}x+b_{j}) and bj[0,2π]b_{j}\sim[0,2\pi] uniformly. Random Fourier features approximations are unbiased and, in the specific case of a dd-dimensional EQ kernel with λ=1\lambda=1, we have to samples from the following Fourier transform:

which is a dd-dimensional Gaussian distribution with zero mean and covariance matrix with all diagonal elements equal to (1/σ2)(1/\sigma^{2}).

The impact on the MMD from the use of random Fourier features to approximate the kernel for both the FWLS and FWLSBQ algorithms is demonstrated in Figure S3. In this example, the quadrature rule uses the kernel with random features but the MMD is calculated using the original H\mathcal{H}-norm. The reason for using this H\mathcal{H}-norm is to have a unique measure of distance between points which can be compared.

Clearly, we once again have that the rate of convergence of the FWLSBQ is much faster than FWLS when using the exact kernel. The same phenomena is observed for the method with high number of random features (D=5000D=5000). This suggests that both the choice of design points and the calculation of the BQ weights is not strongly influenced by the approximation. It is also interesting to notice that the rates of convergence is very close for the exact and D=5000D=5000 methods (atleast when nn is small), potentially suggesting that exponential convergence is possible for the exact method. This is not so surprising in itself since using a Gaussian kernel represents a prior belief that the integrand of interest is very smooth, and we can therefore expect fast convergence of the method.

However, in the case with a smaller number of random features is used (D=1000D=1000), we actually observe a very poor performance of the method, which is mainly due to the fact that the weights are not well approximated anymore.

In summary, the experiments in this section suggest that the use of random features is a potential alternative for scaling up Bayesian Quadrature, but that one needs to be careful to use a high enough number of features. The experiments also give hope of having very similar convergence for infinite-dimensional and finite-dimensional spaces.