Measuring Sample Quality with Diffusions

Jackson Gorham, Andrew B. Duncan, Sebastian J. Vollmer, Lester Mackey

Introduction

When H\mathcal{H} is convergence determining, the measure (1) is an integral probability metric (IPM) , and dH(Qn,P)d_{\mathcal{H}}(Q_{n},P) converges to zero only if the sample sequence (Qn)n1(Q_{n})_{n\geq 1} converges in distribution to PP.

While a variety of standard probability metrics are representable as IPMs , the intractability of integration under PP precludes us from computing most of these candidate quality measures. Recently, Gorham and Mackey sidestepped this issue by constructing a class of test functions hh known a priori to have zero mean under PP. Their resulting quality measure – the Langevin graph Stein discrepancy – satisfied our computability and convergence detection requirements (Desiderata (i) and (iii)) and detected sample sequence non-convergence (Desideratum (ii)) for strongly log concave targets with bounded third and fourth derivatives .

Our first contribution is to show that the Langevin Stein discrepancy in fact determines convergence for all smooth, distantly dissipative target distributions by explicitly lower and upper bounding the Langevin Stein discrepancy by standard Wasserstein distances. Distant dissipativity is a substantial relaxation of log concavity that covers a variety of common non-log concave targets like Gaussian mixtures and robust Student’s t regression posteriors. This contribution greatly extends the range of applicability of the Langevin Stein discrepancy.

Because heavy-tailed distributions are never distantly dissipative, as a second contribution, we extend the computable Stein discrepancy framework of to accommodate heavy-tailed target distributions by introducing a new class of multivariate Stein operators based on general Itô diffusions. These operators can be used as drop-in replacements for the commonly used Langevin operator in applications.

for constants C1,C2>0C_{1},C_{2}>0 determined by Theorem 7 and Proposition 8. This improves upon prior analyses even in the case of strongly log concave targets.

Our primary contribution underlies these three advances. By relating Stein’s method to Markov process coupling rates in Section 2, we prove that every sufficiently fast coupling Itô diffusion gives rise to explicit, uniform multivariate Stein factor bounds on the derivatives of Stein equation solutions. Stein factor bounds are central to Stein’s method of measuring distributional convergence, and while a wealth of bounds are available for univariate targets (see, e.g., for explicit bounds or for a recent review), Stein factors for continuous multivariate distributions have largely been relegated to Gaussian , Dirichlet , and strongly log-concave target distributions. Our approach, which exposes a general relationship between Stein factors and Markov process coupling times, extends the reach of Stein’s method to the stationary distributions of all fast coupling Itô diffusions.

In Section 3, we provide examples of practically checkable sufficient conditions for fast coupling and illustrate the process of verifying these conditions for canonical log-concave, heavy-tailed, and multimodal targets. Section 4 describes a practical algorithm for computing diffusion Stein discrepancies using a geometric spanner and linear programming. In Section 5, we complement the principal theoretical contributions of this work with several simple numerical examples illustrating how diffusion Stein discrepancies can be deployed in practice. In particular, we use our discrepancies to select the hyperparameters of biased samplers, compare random and deterministic quadrature rules, and quantify bias-variance tradeoffs in approximate Markov chain Monte Carlo. A discussion of related and future work follows in Section 6, and all proofs are deferred to the appendices.

Stein’s method

In the early 1970s, Charles Stein introduced a powerful three-step approach to upper-bounding a reference IPM dHd_{\mathcal{H}}:

The operator T\mathcal{T}{} and its domain G\mathcal{G} define the Stein discrepancy ,

a quality measure which takes the form of an integral probability metric while avoiding explicit integration under PP.

Next, prove that, for each test function hh in the reference class H\mathcal{H}, the Stein equation

admits a solution ghGg_{h}\in\mathcal{G}. This step ensures that the reference metric dHd_{\mathcal{H}} lower bounds the Stein discrepancy (Desideratum (ii)) and, in practice, can be carried out simultaneously for large classes of target distributions.

Finally, use whatever means necessary to upper bound the Stein discrepancy and thereby establish convergence to zero under appropriate conditions (Desideratum (i)). Our general result, Proposition 8, suffices for this purpose.

While Stein’s method is traditionally used as analytical tool to establish rates of distributional convergence, we aim, following , to develop the method into a practical computational tool for measuring the quality of a sample. We begin by assessing the convergence properties of a broad class of Stein operators derived from Itô diffusions. Our efforts will culminate in Section 4, where we show how to explicitly compute the Stein discrepancy (2) given any sample measure QnQ_{n} and appropriate choices of T\mathcal{T}{} and G\mathcal{G}.

To identify an operator T\mathcal{T}{} that generates mean-zero functions under PP, we will appeal to the elegant and widely applicable generator method construction of Barbour and Götze . These authors note that if (Zt)t0(Z_{t})_{t\geq 0} is a Feller process with invariant measure PP, then the infinitesimal generator A\mathcal{A}{} of the process, defined pointwise by

where (Wt)t0(W_{t})_{t\geq 0} is an mm-dimensional Wiener process.

As the next theorem, distilled from [62, Thm. 2] and [74, Sec. 4.6], shows, it is straightforward to construct Itô diffusions with a given invariant measure PP (see also ).

the (overdamped) Langevin diffusion (also known as the Brownian or Smoluchowski dynamics) [74, Secs. 6.5 and 4.5], where aIa\equiv I and c0c\equiv 0;

the Riemannian Langevin diffusion , where c0c\equiv 0 and aa is not constant;

We will present detailed examples making use of these diffusion classes in Sections 3 and 5.

Theorem 2 forms the basis for our Stein operator of choice, the diffusion Stein operator T\mathcal{T}{}, defined by substituting gg for 12u\frac{1}{2}\nabla u in the generator (7):

T\mathcal{T}{} is an appropriate choice for our setting as it depends on PP only through logp\nabla\log p and is therefore computable even when the normalizing constant of pp is unavailable. One suitable domain for T\mathcal{T}{} is the classical Stein set of 1-bounded functions with 1-bounded, 1-Lipschitz derivatives:

Indeed, our next proposition, proved in Section A, shows that, on this domain, the diffusion Stein operator generates mean-zero functions under PP.

Together, T\mathcal{T}{} and G\mathcal{G}_{\|{\cdot}\|} give rise to the classical diffusion Stein discrepancy S(Qn,T,G)\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}}), our primary object of study in Sections 2.2 and 2.3.

2 Lower bounding the diffusion Stein discrepancy

To establish that the classical diffusion Stein discrepancy detects non-convergence (Desideratum (ii)), we will lower bound the discrepancy in terms of the L1L^{1} Wasserstein distance, dW2=W1,2d_{\mathcal{W}_{\|{\cdot}\|_{2}}}=W_{1,\|{\cdot}\|_{2}}, a standard reference IPM generated by

The first step is to show that, for each hW2h\in\mathcal{W}_{\|{\cdot}\|_{2}}, the solution ghg_{h} to the Stein equation (3) with diffusion Stein operator (8) has low-order derivatives uniformly bounded by target-specific constants called Stein factors.

Explicit Langevin diffusion (D1) Stein factor bounds are readily available for a wide variety of univariate targetsThe Langevin operator recovers Stein’s density method operator when d=1d=1. (see, e.g., for explicit bounds or for a recent review). In contrast, in the multivariate setting, efforts to establish Stein factors have focused on Gaussian , Dirichlet , and strongly log-concave targets with preconditioned Langevin (D2) operators. To extend the reach of the literature, we will derive multivariate Stein factors for targets with fast-coupling Itô diffusions. Our measure of coupling speed is the Wasserstein decay rate.

Let (Pt)t0(P_{t}{})_{t\geq 0}{} be the transition semigroup of an Itô diffusion (Zt,x)t0({Z}_{t,x})_{t\geq 0} defined via

where δxPt\delta_{x}P_{t}{} denotes the distribution of Zt,x{Z}_{t,x}.

Our next result, proved in Section B, shows that the smoothness of a solution ghg_{h} to a Stein equation is controlled by the rate of Wasserstein decay and hence by how quickly two diffusions with distinct starting points couple. The Stein factor bounds on the derivatives of uhu_{h} and ghg_{h} may be of independent interest for establishing rates of distributional convergence.

Fix any Lipschitz hh. If an Itô diffusion has invariant measure PP1P\in\mathcal{P}_{1}, transition semigroup (Pt)t0(P_{t}{})_{t\geq 0}, Wasserstein decay rate rr, and infinitesimal generator A\mathcal{A}{} (4), then

is twice continuously differentiable and satisfies

for γρρM1(b)+ρ22ρ2M1(σ)2+ρ2F1(σ)2\gamma_{\rho}\triangleq\rho M_{1}(b)+\frac{\rho^{2}-2\rho}{2}M_{1}(\sigma)^{2}+\frac{\rho}{2}F_{1}(\sigma)^{2}, αM2(b)22M1(b)+4M1(σ)2+2F2(σ)2\alpha\triangleq\frac{M_{2}(b)^{2}}{2M_{1}(b)+4M_{1}(\sigma)^{2}}+2F_{2}(\sigma)^{2}. If, additionally, 3b\nabla^{3}b and 3σ\nabla^{3}\sigma are locally Lipschitz and hC3h\in C^{3} with bounded third derivatives, then, for all ι(0,1)\iota\in(0,1),

for K>0K>0 a constant depending only on M1:3(σ),M1:3(b),M_{1:3}(\sigma),M_{1:3}(b), M0(σ1),M_{0}(\sigma^{-1}), and rr.

Thms. 1 and 2 of Pardoux and Veretennikov also bound the solutions of the Stein equation (3). However, for generic Lipschitz hh, [72, Thms. 1 and 2] provide inexplicit constants; only guarantee the polynomial growth of ghg_{h} and its derivatives, not uniform boundedness; and require bounded σ\sigma, a strong assumption which rules out the heavy-tailed examples of Section 3.

A first consequence of Theorem 5, proved in Section D, concerns Stein operators (8) with constant covariance and stream matrices aa and cc. In this setting, fast Wasserstein decay implies that the diffusion Stein discrepancy converges to zero only if the Wasserstein distance does (Desideratum (ii)).

Consider an Itô diffusion with diffusion Stein operator T\mathcal{T}{} (8) for PP1P\in\mathcal{P}_{1}, Wasserstein decay rate rr, constant covariance and stream matrices aa and cc, and Lipschitz drift b(x)=12(a+c)logp(x)b(x)=\frac{1}{2}(a+c)\nabla\log p(x). If sr0r(t)dts_{r}\triangleq\int_{0}^{\infty}r(t)\,dt, then

Theorem 6 in fact provides an explicit upper bound on the Wasserstein distance in terms of the Stein discrepancy and the Wasserstein decay rate. Under additional smoothness assumptions on the coefficients, the explicit relationship between Stein discrepancy and Wasserstein distance can be improved and extended to diffusions with non-constant diffusion coefficient, as our next result, proved in Section E, shows.

Consider an Itô diffusion for PP1P\in\mathcal{P}_{1} with diffusion Stein operator T\mathcal{T}{} (8), Wasserstein decay rate rr, and Lipschitz drift and diffusion coefficients bb (6) and σ\sigma with locally Lipschitz second derivatives. If sr0r(t)dts_{r}\triangleq\int_{0}^{\infty}r(t)\,dt, then

for β1\beta_{1}, β2\beta_{2} defined in Theorem 5 and

If, additionally, 3b\nabla^{3}b and 3σ\nabla^{3}\sigma are locally Lipschitz, then

for a constant K>0K>0 depending only on M1:3(σ),M1:3(b),M_{1:3}(\sigma),M_{1:3}(b), M0(σ1),M_{0}(\sigma^{-1}), and rr.

The log(\nicefrac1S(Qn,T,G))\log(\nicefrac{{1}}{{\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})}}) term in (14) reflects the potential non-smoothness of the Stein equation solution ghg_{h} studied in Theorem 5. Indeed, for d2d\geq 2 and standard multivariate Gaussian PP, there exist Lipschitz hh with infinite M2(gh)M_{2}(g_{h}) [77, Remark 2].

In Section 3, we will present practically checkable conditions implying fast Wasserstein decay and discuss both broad families and specific diffusion-target pairings covered by this theory.

3 Upper bounding the diffusion Stein discrepancy

In upper bounding the Stein discrepancy, one classically aims to establish rates of convergence to PP for specific sequences (Qn)n=1(Q_{n})_{n=1}^{\infty}. Since our interest is in explicitly computing Stein discrepancies for arbitrary sample sequences, our general upper bound in Proposition 8 serves principally to provide sufficient conditions under which the classical diffusion Stein discrepancy converges to zero.

Let T\mathcal{T}{} be the diffusion Stein operator (8) for PP1P\in\mathcal{P}_{1}. If ma+cm\triangleq a+c and bb (6) are PP-integrable,

This result, proved in Section F, complements the Wasserstein distance lower bounds of Section 2.2 and implies that, for Lipschitz and sufficiently integrable mm and bb, the diffusion Stein discrepancy converges to zero whenever QnQ_{n} converges to PP in Wasserstein distance.

4 Extension to non-uniform Stein sets

For any c1,c2,c3>0c_{1},c_{2},c_{3}>0, our analyses and algorithms readily accommodate the non-uniform Stein set

This added flexibility can be valuable when tight upper bounds on a reference IPM, like the Wasserstein distance, are available for a particular choice of Stein factors (c1,c2,c3)(c_{1},c_{2},c_{3}). When such Stein factors are unknown or difficult to compute, we recommend the parameter-free classical Stein set and graph Stein set of the sequel as practical defaults, since the classical Stein discrepancy is strongly equivalent to any non-uniform Stein discrepancy:

The short proof follows exactly as in [35, Prop. 4].

Sufficient conditions for Wasserstein decay

Since the Stein discrepancy lower bounds of Section 2 depend on the Wasserstein decay (9) of the chosen diffusion, we next provide examples of practically checkable sufficient conditions for Wasserstein decay and illustrate the process of verifying these conditions for a selection of diffusion-target pairings. These pedagogical examples serve to succinctly illustrate the process of verifying our assumptions and do not represent the full scope of applicability.

It is well known [see, e.g., 7, Eq. 7] that the Langevin diffusion (D1) enjoys exponential Wasserstein decay whenever logp\log p is kk-strongly log concave, i.e., when the drift b=12logpb=\frac{1}{2}\nabla\log p satisfies b(x)b(y),xyk2xy22\langle{b(x)-b(y)},{x-y}\rangle\leq-{\textstyle\frac{k}{2}}\|{x-y}\|_{2}^{2} for k>0k>0. An analogous uniform dissipativity condition gives explicit exponential decay for a generic Itô diffusion:

The proof of Theorem 10 in Section G holds even when the drift bb is not Lipschitz, yields the same decay rate for W2,2\mathcal{W}_{2,\|{\cdot}\|_{2}}, and relies on a synchronous coupling of Itô diffusions, mimicking [7, Sec. 1].

Hence, if the drift bb of an Itô diffusion is k/2-k/2-one-sided Lipschitz, i.e.,

and some G0G\succ 0, and the diffusion coefficient σ\sigma is k\sqrt{k^{\prime}}-Lipschitz, that is,

the PP-targeted preconditioned Langevin diffusion (D2) drift b(β)=12Σlogp(β)b(\beta)={\textstyle\frac{1}{2}}\Sigma\nabla\log p(\beta) satisfies (15) with k=1k=1 and G=Σ1G=\Sigma^{-1} and M1(b)12I+14Σl=1LvlvlopM_{1}(b)\leq{\textstyle\frac{1}{2}}\|{I+\frac{1}{4}\Sigma\sum_{l=1}^{L}v_{l}v_{l}^{\top}}\|_{op}. Hence, the diffusion enjoys geometric Wasserstein decay (Theorem 10) and a Wasserstein lower bound on the Stein discrepancy (Theorem 6).

2 Distant dissipativity, constant σ𝜎\sigma

When the diffusion coefficient σ\sigma is constant with a12σσa\triangleq\frac{1}{2}\sigma\sigma^{\top} invertible, Eberle showed that a distant dissipativity condition is sufficient for exponential Wasserstein decay. Specifically, if we define a one-sided Lipschitz constant conditioned on a distance r>0r>0,

then [22, Cor. 2] establishes exponential Wasserstein decay whenever κ\kappa is continuous with lim infrκ(r)>0\liminf_{r\to\infty}\kappa(r)>0 and 01rκ(r)dr<\int_{0}^{1}r\kappa(r)^{-}dr<\infty. For a Lipschitz drift, this holds whenever bb is dissipative at large distances, that is, whenever, for some k>0k>0, we have κ(r)k\kappa(r)\geq k for all rr sufficiently large [22, Lem. 1].

3 Distant dissipativity, non-constant σ𝜎\sigma

Using a combination of synchronous and reflection couplings, Wang [93, Thm. 2.6] showed that diffusions satisfying a distant dissipativity condition exhibit exponential Wasserstein decay, even when the diffusion coefficient σ\sigma is non-constant. In Section H, we combine the coupling strategy of [93, Thm. 2.6] with the approach of for diffusions with constant σ\sigma to obtain the following explicit Wasserstein decay rate for distantly dissipative diffusions with bounded σ1\sigma^{-1}.

Let (Pt)t0(P_{t}{})_{t\geq 0} be the transition semigroup of an Itô diffusion with drift and diffusion coefficients bb and σ\sigma. Define the truncated diffusion coefficient

and the distance-conditional dissipativity function

for any m0infxy(σ0(x)σ0(y))(xy)2xy2andα1/(λ02+m02/4).m_{0}\leq\inf_{x\neq y}{\textstyle\frac{\|{(\sigma_{0}(x)-\sigma_{0}(y))^{\top}(x-y)}\|_{2}}{\|{x-y}\|_{2}}}\quad\text{and}\quad\alpha\triangleq 1/(\lambda_{0}^{2}+m_{0}^{2}/4).

If the constants R0=inf{R0:κ(r)0, rR}R_{0}=\inf\{R\geq 0\>:\>\kappa(r)\geq 0,\ \forall\,r\geq R\} and R1=inf{RR0:κ(r)R(RR0)8, rR}R_{1}=\inf\{R\geq R_{0}\>:\>\kappa(r)R(R-R_{0})\geq 8,\ \forall\,r\geq R\} satisfy R0R1<R_{0}\leq R_{1}<\infty, then

Theorem 11 holds even when the drift bb is not Lipschitz.

The Wasserstein decay rate (17) in Theorem 11 has a simple form when the diffusion is dissipative at large distances and κ\kappa is bounded below. These rates follow exactly as in [22, Lem. 1].

Under the conditions of Theorem 11, suppose that, for R,L0R,L\geq 0 and K>0K>0, κ(r)L for rR and κ(r)K for r>R\kappa(r)\geq-L\text{ for }r\leq R\text{ and }\kappa(r)\geq K\text{ for }r>R. Then

for fixed δ,ν>0\delta,\nu>0 and Σ0\Sigma\succ 0. Introduce the shorthand ψλ(r)21+r2/δ2λ2\psi_{\lambda}(r)\triangleq 2\sqrt{1+r^{2}/\delta^{2}}-\lambda^{2} for each λ[0,2)\lambda\in[0,\sqrt{2}) and ξ(β)1+1ν(yVβ)Σ1(yVβ)\xi(\beta)\triangleq 1+\frac{1}{\nu}(y-V\beta)^{\top}\Sigma^{-1}(y-V\beta). Since

Indeed, fix any λ0(0,1/M0(σ1))=(0,2)\lambda_{0}\in(0,1/M_{0}(\sigma^{-1}))=(0,\sqrt{2}). Since M1(ψλ)1δ2λ2M_{1}(\sqrt{\psi_{\lambda}})\leq\frac{1}{\delta\sqrt{2-\lambda^{2}}}, M1(ψλ)2δM_{1}(\psi_{\lambda})\leq\frac{2}{\delta}, and M2(ψλ)2δ2M_{2}(\psi_{\lambda})\leq\frac{2}{\delta^{2}}, σ0\sigma_{0}, σ\sigma, aa, and a\nabla a are all Lipschitz. The drift bb is also Lipschitz, since logp\nabla\log p and the product of a(β)a(\beta) and

are bounded. Hence, κ\kappa (16) is bounded below. Moreover, the the Hölder continuity of xxx\mapsto\sqrt{x}, Cauchy-Schwarz, and the triangle inequality imply

Letting ζ\zeta represent the supremum in the final inequality, we see that κ(r)α=1/λ02\kappa(r)\geq\alpha=1/\lambda_{0}^{2} whenever r2(d+1δ+ζ).r\geq 2({\textstyle\frac{d+1}{\delta}}+\zeta). Hence, Corollary 12 delivers exponential Wasserstein decay. A Wasserstein lower bound on the Stein discrepancy now follows from Theorem 7, since M2(ψ0)12δ2M_{2}(\sqrt{\psi_{0}})\leq\frac{1}{\sqrt{2}\delta^{2}}, M3(ψ0)96255δ3M_{3}(\psi_{0})\leq\frac{96}{25\sqrt{5}\delta^{3}}, and a(β)2logp(β)a(\beta)\nabla^{2}\log p(\beta) is Lipschitz, and hence M2(σ)M_{2}(\sigma) and M2(b)M_{2}(b) are bounded.

Computing Stein discrepancies

For any sample QnQ_{n}, Stein operator T\mathcal{T}{}, and Stein set G\mathcal{G}, the Stein discrepancy S(Qn,T,G)\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}}) is recovered by solving an optimization problem over functions gGg\in\mathcal{G}. For example, if we write ma+cm\triangleq a+c and b(x)121p(x),p(x)m(x)b(x)\triangleq\frac{1}{2}{\textstyle\frac{1}{p(x)}}\langle{\nabla},{p(x)m(x)}\rangle, the classical diffusion Stein discrepancy is the value

For all Stein sets, the diffusion Stein discrepancy objective is linear in gg and only queries gg and g\nabla g at the nn sample points underlying QnQ_{n}. However, the classical Stein set G\mathcal{G}_{\|{\cdot}\|} constrains gg at all points in its domain, resulting in an infinite-dimensional optimization problem.When d=1d=1, the problem reduces to a finite-dimensional convex quadratically constrained quadratic program with linear objective as in [35, Thm. 9].

imposes boundedness constraints only at sample points and smoothness constraints only at pairs of sample points enumerated in the edge set EE. The graph is termed a tt-spanner if each edge (x,y)E(x,y)\in E is assigned the weight xy\|{x-y}\|, and, for all xyVx^{\prime}\neq y^{\prime}\in V, there exists a path between xx^{\prime} and yy^{\prime} in the graph with total path weight no greater than txyt\|{x^{\prime}-y^{\prime}}\|. Remarkably, for any linear Stein operator T\mathcal{T}{}, a spanner Stein discrepancy S(Qn,T,G,Qn,Gt)\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|,Q_{n},{G_{t}}}}) based on a tt-spanner GtG_{t} is equivalent to the classical Stein discrepancy in the following strong sense, implying Desiderata (i) and (ii).

where κd\kappa_{d} is independent of (Qn,P,T,Gt)(Q_{n},P,\mathcal{T}{},G_{t}) and depends only on dd and \|{\cdot}\|.

The proof relies on the Whitney-Glaeser extension theorem [85, Thm. 1.4] of Glaeser and follows exactly as in [35, Prop. 5 and 6].

2 Decoupled linear programs

where ψji\psi_{ji} and Ψjki\Psi_{jki} represent the values gj(xi)g_{j}(x_{i}) and kgj(xi)\nabla_{k}g_{j}(x_{i}) respectively. Therefore, our recommended quality measure is the 22-spanner diffusion Stein discrepancy with =1\|{\cdot}\|=\|{\cdot}\|_{1}. Its computation is summarized in Algorithm 1. An efficient implementation of Algorithm 1, integrated with 11 linear program solver options, is publicly available via our Julia package.https://jgorham.github.io/SteinDiscrepancy.jl/

Numerical illustrations

In this section, we complement the principal theoretical contributions of this work with several simple numerical illustrations demonstrating how diffusion Stein discrepancies can be deployed in practice. We will use our proposed quality measures to select hyperparameters for biased samplers, to quantify a bias-variance trade-off for approximate MCMC, and to compare deterministic and random quadrature rules. In each case, we choose experimental settings in which a notion of surrogate ground truth is available for external validation. We solve all linear programs using Julia for Mathematical Programming with the Gurobi 6.0.4 solver and use the C++ greedy spanner implementation of Bouts et al. to compute our 22-spanners. Our timings were obtained on a single core of an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. Code reconstructing all experiments is available on the Julia package site.footnote 5

We first present a simple example to illustrate several Stein discrepancy properties. For a Gaussian mixture target PP (Example 3) with p(x)e12(xΔ2)2+e12(x+Δ2)2p(x)\propto e^{-\frac{1}{2}(x-\frac{\Delta}{2})^{2}}+e^{-\frac{1}{2}(x+\frac{\Delta}{2})^{2}} and Δ>0\Delta>0, we simulate one i.i.d. sequence of sample points from PP and a second i.i.d. sequence from N(Δ2,1)\mathcal{N}(-\frac{\Delta}{2},1), which represents only one component of PP. For various mode separations Δ\Delta, Figure 1 shows that the Langevin spanner Stein discrepancy (D1) applied to the first nn Gaussian mixture sample points decreases to zero at a n1/2n^{-1/2} rate, while the discrepancy applied to the single mode sequence stays bounded away from zero. However, Figure 1 also indicates that larger sample sizes are needed to distinguish between the mixture and single mode sample sequences when Δ\Delta is large. This accords with our theory (see Example 3, Corollary 12, and Theorem 6), which implies that both the Langevin diffusion Wasserstein decay rate and the bound relating Stein to Wasserstein degrade as the mixture mode separation Δ\Delta increases.

2 Selecting sampler hyperparameters

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) with a constant step size ϵ\epsilon is an approximate MCMC procedure designed to accelerate posterior inference. Unlike asymptotically correct MCMC algorithms, SGRLD has a stationary distribution that deviates increasingly from its target PP as its step size ϵ\epsilon grows. On the other hand, if ϵ\epsilon is too small, SGRLD fails to explore the sample space sufficiently quickly. Hence, an appropriate setting of ϵ\epsilon is paramount for accurate inference.

After standardizing the output variable and non-constant regressors and initializing each chain with an approximate posterior mode found by L-BFGS started at the origin, we ran SGRLD with minibatch size 3030, metric G(β)=1/(21+β/δ22)IG(\beta)=1/(2\sqrt{1+\|{\beta/\delta}\|_{2}^{2}})I, and a variety of step sizes ϵ\epsilon to produce sample sequences of length 200,000200,000 thinned to length 2,0002,000. We then selected the step size that delivered the highest quality sample – either the maximum effective sample size (ESS, a popular MCMC mixing diagnostic based on asymptotic variance ) or the minimum Riemannian Langevin spanner Stein discrepancy with a(β)=G1(β)a(\beta)=G^{-1}(\beta). The longest discrepancy computation consumed 66s for spanner construction and 6565s to solve a coordinate optimization problem. As a surrogate measure of ground truth, we also generated a sample QQ^{*} of size 2×1082\times 10^{8} from the Metropolis-adjusted Riemannian Langevin Algorithm (MARLA) with metric GG and compute the median bivariate marginal Wasserstein distance dW1d_{\mathcal{W}_{\|{\cdot}\|_{1}}} between each SGRLD sample and QQ^{*} thinned to 5,0005,000 points .

Figure 2(a) shows that ESS, which does not account for stationary distribution bias, selects the largest step size available, ϵ=102\epsilon=10^{-2}. As seen in Figure 2(b), this choice results in samples that are greatly overdispersed when compared with the ground truth MARLA sample QQ^{*}. At the other extreme, the selection ϵ=107\epsilon=10^{-7} produces greatly underdispersed samples due to slow mixing. The Stein discrepancy chooses an intermediate value, ϵ=104\epsilon=10^{-4}. The same value minimizes the surrogate ground truth Wasserstein measure and produces samples that most closely resemble the QQ^{*} in Figure 2(b).

3 Quantifying a bias-variance trade-off

Approximate random walk Metropolis-Hastings (ARWMH) with tolerance parameter ϵ\epsilon is a biased MCMC procedure that accelerates posterior inference by approximating the standard MH correction. Qualitatively, a smaller setting of ϵ\epsilon produces a more faithful approximation of the MH correction and less bias between the chain’s stationary distribution and the target distribution of interest. A larger setting of ϵ\epsilon leads to faster sampling and a more rapid reduction of Monte Carlo variance, as fewer datapoint likelihoods are computed per sampling step. We will quantify this bias-variance trade-off as a function of sampling time using the Langevin spanner Stein discrepancy.

In the notation of Example 2, we conduct a Bayesian Huber regression analysis (c=1c=1) of the log radon levels in 1,1901,190 Minnesota households as a function of the log amount of uranium in the county, an indicator of whether the radon reading was performed in a basement, and an intercept term. A N(0,I)\mathcal{N}(0,I) prior is placed on the coefficient vector β\beta. We run ARWMH with minibatch size 55 and two settings of the tolerance threshold ϵ\epsilon (0.10.1 and 0.20.2) for 10710^{7} likelihood evaluations, discard the sample points from the first 10510^{5} evaluations, and thin the remaining points to sequences of length 1,0001,000. Figure 3 displays the Langevin spanner Stein discrepancy applied to the first nn points in each sequence as a function of the likelihood evaluation count, which serves as a proxy for sampling time. As expected, the higher tolerance sample (ϵ=0.2\epsilon=0.2) is of higher Stein quality for a small computational budget but is eventually overtaken by the ϵ=0.1\epsilon=0.1 sample with smaller asymptotic bias. The longest discrepancy computation consumed 0.80.8s for the spanner and 20.120.1s for a coordinate LP.

4 Comparing quadrature rules

Stein discrepancies can also measure the quality of deterministic sample sequences designed to improve upon Monte Carlo sampling. For the Gaussian mixture target of Section 5.1, Figure 4 compares the median quality of 50 sample sequences generated from four quadrature rules recently studied in [53, Sec. 4.1]: i.i.d. sampling from PP, Quasi-Monte Carlo (QMC) sampling using a deterministic quasirandom number generator, Frank-Wolfe (FW) kernel herding , and fully-corrective Frank-Wolfe (FCFW) kernel herding . The quality judgments of the Langevin spanner Stein discrepancy (D1) closely mimic those of the L1L^{1} Wasserstein distance dWd_{\mathcal{W}_{\|{\cdot}\|}}{}, which is computable for simple univariate targets . Each Stein discrepancy was computed in under 0.030.03s.

Under both diagnostics and as previously observed in other metrics , the i.i.d. samples are typically of lower median quality than their deterministic counterparts. More suprisingly and in contrast to past work focused on very smooth function classes , FCFW underperforms FW and QMC in our diagnostics for larger sample sizes. Apparently FCFW, which is heavily optimized for smooth function integration, has sacrificed approximation quality for less smooth test functions. For example, Figure 4 shows that QMC offers a better quadrature estimate than FCFW for h1(x)=max{0,1minj{1,2}xμj}h_{1}(x)=\max\{0,1-\min_{j\in\{1,2\}}|x-\mu_{j}|\}, a 11-Lipschitz approximation to the indicator of being within one standard deviation of a mode.

In addition to providing a sample quality score, the Stein discrepancy optimization problem produces an optimal Stein function gg^{*} and an associated test function h=Tgh^{*}=\mathcal{T}{g^{*}} that is mean zero under PP and best distinguishes the sample QnQ_{n} from the target PP. Figure 4 gives examples of these maximally discriminatve functions hh^{*} for a target mode separation of Δ=5\Delta=5 and length 200200 sequences from each quadrature rule. We also display the associated sample histograms with overlaid target density. The optimal FCFW function reflects the jagged nature of the FCFW histogram.

Connections and conclusions

We developed quality measures suitable for comparing the fidelity of arbitrary “off-target” sample sequences by generating infinite collections of known target expectations.

The score statistic of Fan et al. and the Gibbs sampler convergence criteria of Zellner and Min account for some sample biases but sacrifice differentiating power by exploiting only a finite number of known target expectations. For example, when P=N(0,1)P=\mathcal{N}(0,1), the score statistic cannot differentiate two samples with the same means and variances. Maximum mean discrepancies (MMDs) over characteristic reproducing kernel Hilbert spaces do detect arbitrary distributional biases but are only computable when the chosen kernel functions can be integrated under the target. In practice, one often approximates MMD using a sample from the target, but this requires a separate trustworthy sample from PP.

While we have focused on the graph and classical Stein sets of , our diffusion Stein operators can also be paired with the reproducing kernel Hilbert space unit balls advocated in to form tractable kernel diffusion Stein discrepancies or with the random feature functions advocated in to form random feature diffusion Stein discrepancies. We have also restricted our attention to Stein operators arising from diffusion generators. These take the form (Tg)(x)=1p(x),p(x)m(x)g(x)(\mathcal{T}{g})({x})={\textstyle\frac{1}{p(x)}}\langle{\nabla},{p(x)m(x)g(x)}\rangle with m=a+cm=a+c for a(x)a(x) positive semidefinite and c(x)c(x) skew-symmetric. More generally, if the matrix mm possesses eigenvalues having a negative real part, then the resulting operator need not correspond to a diffusion process. Such operators fall into the class of pseudo-Fokker Planck operators which have been studied in the context of quantum optics . As noted in it is possible to obtain corresponding stochastic dynamics in an extended state space by introducing complex-valued noise terms; these operators may merit further study in future work.

Alternative inferential tasks

Alternative targets

Our exposition has focused on the Wasserstein distance dWd_{\mathcal{W}_{\|{\cdot}\|}}, which is only defined for distributions with finite means. A parallel development could be made for the Dudley metric to target distributions with undefined mean. The work of Cerrai also suggests that the Lipschitz condition on our drift and diffusion coefficients can be relaxed.

Appendix A Proof of Proposition 3

Let f(r)=M0(g)Bra(z)+c(z)opp(z)dzf(r)=M_{0}(g)\int_{\partial\mathcal{B}_{r}}\|{a(z)+c(z)}\|_{op}\,p(z)\,dz. Since gg and nrn_{r} are bounded,

The coarea formula and the integrability of aa and cc further imply that

Appendix B Proof of Theorem 5

with infinitesimal generator A\mathcal{A}{}, that uhu_{h} is Lipschitz, that uhu_{h} has a continuous Hessian, that uhu_{h} has a bounded and Hölder continuous Hessian under additional smoothness assumptions.

The transition semigroup (Pt)t0(P_{t}{})_{t\geq 0}{} of an Itô diffusion with Lipschitz drift and diffusion coefficients is strongly continuous on LL.

for some C>0C>0 depending only on M1(b)M_{1}(b) and M1(σ)M_{1}(\sigma). The dominated convergence theorem now yields the desired pointwise convergence.

To prove the strong continuity of (Pt)t0(P_{t}{})_{t\geq 0}{}, it suffices, by [23, Thm. I.5.8, p. 40], to verify that (Pt)t0(P_{t}{})_{t\geq 0}{} is weakly continuous, i.e., that l(Ptf)l(f)l(P_{t}{f})\rightarrow l(f), as t0+t\rightarrow 0^{+}, for all elements ll of the dual space LL^{*}. To this end, fix any lLl\in L^{*}. By the Riesz-Markov theorem for LL [17, Theorem 2.4], there exists a finite signed Radon measure μ\mu such that

for L\lVert\cdot\rVert_{L^{*}} the dual norm. By Jensen’s inequality and [28, Sec. 5, Cor. 1.2],

Since 1+x221+\|{x}\|_{2}^{2} is μ|\mu|-integrable by (20), dominated convergence gives

Consider the infinitesimal generator A\mathcal{A}{} of the semigroup (Pt)t0(P_{t}{})_{t\geq 0}{} on LL with

The stationarity of PP and the definitions of dW2d_{\mathcal{W}_{\|{\cdot}\|_{2}}} and rr imply that

and hence PthL0\lVert P_{t}{h}\rVert_{L}\rightarrow 0 as tt\rightarrow\infty, since PP has a finite mean, and r(t)0r(t)\to 0 as tt\to\infty as rr is integrable and monotonic. Arguing similarly,

Instantiate the additional preconditions of (11), and assume that M0(σ1),F2(σ),M2(b)<M_{0}(\sigma^{-1}),F_{2}(\sigma),M_{2}(b)<\infty, or else (11) is vacuous. Lemma 15, established in Section C, shows that the semigroup PthP_{t}{h} admits a bounded continuous Hessian, which is integrable in tt.

for γρρM1(b)+ρ22ρ2M1(σ)2+ρ2F1(σ)2\gamma_{\rho}\triangleq\rho M_{1}(b)+\frac{\rho^{2}-2\rho}{2}M_{1}(\sigma)^{2}+\frac{\rho}{2}F_{1}(\sigma)^{2}, αM2(b)22M1(b)+4M1(σ)2+2F2(σ)2\alpha\triangleq\frac{M_{2}(b)^{2}}{2M_{1}(b)+4M_{1}(\sigma)^{2}}+2F_{2}(\sigma)^{2}.

The dominated convergence theorem now implies that the Hessian of uhu_{h} is obtained by differentiating twice under the integral sign. The advertised bound (11) on 2uh\nabla^{2}u_{h} follows by replacing the infimum on the right-hand side of the semigroup bound (22) with the selection t0=min(t,1)t_{0}=\min(t,1), applying the bound emin(t,1)γρeγρe^{\min(t,1)\gamma_{\rho}}\leq e^{\gamma_{\rho}} for each γρ\gamma_{\rho} and tt, and integrating the result over tt.

Finally, instantiate the additional preconditions of (12), and fix any ι(0,1)\iota\in(0,1). The integral representation (10) of uhu_{h}, the dominated convergence theorem, and Jensen’s inequality imply

When t1t\leq 1, a seminorm interpolation lemma (Lemma 19 in the supplement), a semigroup third derivative estimate (Lemma 20 in the supplement) with t0=min(t,1)t_{0}=\min(t,1), and the semigroup second derivative estimate of Lemma 15 with t0=min(t,1)t_{0}=\min(t,1) imply

for some constant K1>0K_{1}>0 depending only on M1:3(b),M1:3(σ),M0(σ1),M_{1:3}(b),M_{1:3}(\sigma),M_{0}(\sigma^{-1}), and rr. Thus 01M1ι(2Pth)dt2M1(h)K1ι\int_{0}^{1}M_{1-\iota}(\nabla^{2}P_{t}{h})\,dt\leq\frac{2M_{1}(h)}{K_{1}\iota}. For t>1t>1, Lemmas 19, 20, and 15 and the integrability of rr yield

for a constant K2>0K_{2}>0 again depending only on M1:3(b),M1:3(σ),M0(σ1),M_{1:3}(b),M_{1:3}(\sigma),M_{0}(\sigma^{-1}), and rr. Combining these bounds and choosing K=min(K1,K2)/2K=\min(K_{1},K_{2})/2 completes the proof. An explicit constant KK can be obtained by tracing constants through the proof of Lemma 20.

Appendix C Proof of Lemma 15

obtained by formally differentiating the equation (5) defining (Zt,x)t0({Z}_{t,x})_{t\geq 0} with respect to xx in the direction vv. The second directional derivative flow (Ut,v,v)t0(U_{t,v,v^{\prime}})_{t\geq 0} solves the second variation equation,

obtained by differentiating (23) with respect to xx in the direction vv^{\prime}.

Since ff has bounded first and second derivatives, the dominated convergence theorem implies that, for each t0t\geq 0, PtfP_{t}{f} is twice differentiable with

obtained by differentiating under the integral sign. Lemma 16, proved in Section C.1, justifies the exchanges of derivative and expectation by ensuring that the derivative flows have moments bounded uniformly in xx.

for γρρM1(b)+ρ22ρ2M1(σ)2+ρ2F1(σ)2\gamma_{\rho}\triangleq\rho M_{1}(b)+\frac{\rho^{2}-2\rho}{2}M_{1}(\sigma)^{2}+\frac{\rho}{2}F_{1}(\sigma)^{2} and αM2(b)22M1(b)+4M1(σ)2+2F2(σ)2\alpha\triangleq\frac{M_{2}(b)^{2}}{2M_{1}(b)+4M_{1}(\sigma)^{2}}+2F_{2}(\sigma)^{2}.

Since f\nabla f and 2f\nabla^{2}f are bounded, and (Vt,v)t0,(Vt,v)t0,(V_{t,v})_{t\geq 0},(V_{t,v^{\prime}})_{t\geq 0}, and (Ut,v,v)t0(U_{t,v,v^{\prime}})_{t\geq 0} have second moments bounded uniformly in xx by Lemma 16, the Hessian formula (25) implies that 2Ptf\nabla^{2}P_{t}{f} is bounded and hence that Ptf\nabla P_{t}{f} is Lipschitz continuous for each t0t\geq 0.

Hereafter we assume that M0(σ1)<M_{0}(\sigma^{-1})<\infty, as the semigroup Hessian bound (22) is otherwise vacuous.

The Lipschitz continuity of ff and the Itô diffusion moment bound of [51, Thm. 3.4, part 4] together imply that

for all t0t\geq 0. Since σ1\sigma^{-1} is bounded, and b\nabla b and σ\nabla\sigma are bounded and Lipschitz, [27, Prop. 3.2] gives the following Bismut-Elworthy-Li-type formula for the directional derivative of PtfP_{t}{f} for each t>0t>0:

By interchanging derivative and integral, the dominated convergence theorem now delivers the Hessian expression

for each t>0t>0, provided that J1,x,J2,xJ_{1,x},J_{2,x}, and J3,xJ_{3,x} are continuous in xx. The requisite continuity follows from the Lipschitz continuity of f\nabla f and ff, the boundedness of σ1\sigma^{-1}, σ\nabla\sigma, and 2σ\nabla^{2}\sigma, and the controlled moment growth and Hölder continuity of (Zt,x)t0({Z}_{t,x})_{t\geq 0}, (Vt,v)t0,(Vt,v)t0,(V_{t,v})_{t\geq 0},(V_{t,v^{\prime}})_{t\geq 0}, and (Ut,v,v)t0(U_{t,v,v^{\prime}})_{t\geq 0} as functions of xx [76, Theorem V.40]. The dominated convergence theorem further implies that 2Ptf\nabla^{2}P_{t}{f} is continuous for each t>0t>0.

Now, we fix any t>0t>0 and turn to bounding 2Ptf\nabla^{2}P_{t}{f} in terms of M1(f)M_{1}(f), by bounding the expectations of J1,x,J2,xJ_{1,x},J_{2,x}, and J3,xJ_{3,x} of (28) in turn.

where we have adopted the definition of γρ\gamma_{\rho} given in Lemma 16.

where we have used Dynkin’s formula [28, Eq. 7.11], the Itô isometry, and the chain rule,

and we bound this expression using Cauchy-Schwarz, Jensen’s inequality, the semigroup gradient bound (21), the second derivative flow bound (27), and the fact that sr(ts)esγ4s\mapsto r(t-s)e^{s\gamma_{4}} is increasing:

C.1 Proof of Lemma 16: Derivative flow bounds

the advertised result (26) follows from Grönwall’s inequality.

for any ϵ>0\epsilon>0. Letting γρ=ρM1(b)+ρ22ρ2M1(σ)2+ρ2F1(σ)2\gamma_{\rho}=\rho M_{1}(b)+\frac{\rho^{2}-2\rho}{2}M_{1}(\sigma)^{2}+\frac{\rho}{2}F_{1}(\sigma)^{2}, we see that, by Cauchy-Schwarz and our derivative flow bound (26),

Hence, if we choose ϵ=γ4(2M1(b)+2F1(σ)2)\epsilon=\gamma_{4}-(2M_{1}(b)+2F_{1}(\sigma)^{2}) and define α=M2(b)2/ϵ+2F2(σ)2\alpha=M_{2}(b)^{2}/\epsilon+2F_{2}(\sigma)^{2} we may write

Gronwall’s inequality now yields the result (27) via

Appendix D Proof of Theorem 6

Hence, since our choice of hh was arbitrary, and

we may take expectation under QnQ_{n} and supremum over hh in (31) to reach

as v1,G/v12{\langle{v_{1}},{G}\rangle}/\|{v_{1}}\|_{2} has a standard normal distribution. Leibniz’s rule also gives

where the last equality follows by Isserlis’ theorem. Finally, when fC1f\in C^{1}, Leibniz’s rule gives fs=fϕs\nabla f_{s}=\nabla f\star\phi_{s}.

Appendix E Proof of Theorem 7

We will derive each inequality for =2\|{\cdot}\|=\|{\cdot}\|_{2}; the generic norm results will then follow from the property 2\|{\cdot}\|\geq\|{\cdot}\|_{2}, which implies G2G\mathcal{G}_{\|{\cdot}\|_{2}}\subseteq\mathcal{G}_{\|{\cdot}\|}.

Let ηsζ\eta\triangleq s^{*}\zeta for s=S(Qn,T,G2)2/πβ/ζ.s^{*}=\sqrt{\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|_{2}}})\sqrt{{2}{/\pi}}\beta/\zeta}. Since H\mathcal{H} is dense in W2\mathcal{W}_{\|{\cdot}\|_{2}}, we may take expectation under QnQ_{n} and supremum over hh in (32) to reach

E.2 Proof of the second inequality

Assume now that 3b\nabla^{3}b and 3σ\nabla^{3}\sigma are bounded and locally Lipschitz. Fix any ι(0,1)\iota\in(0,1). Lemma 17 and an auxiliary smoothing lemma (Lemma 18 in the supplement) imply that M2(gh,s)=M1(gh,s)dM1ι(gh)sιM_{2}(g_{h,s})=M_{1}(\nabla g_{h,s})\leq\sqrt{d}\frac{M_{1-\iota}(\nabla g_{h})}{s^{\iota}}. This improved dependence on ss will allow us to establish a near-linear relationship between the Stein discrepancy and the Wasserstein distance. By Theorem 5, M1ι(gh)1K(1ι+sr)M_{1-\iota}(\nabla g_{h})\leq\frac{1}{K}(\frac{1}{\iota}+s_{r}) for KK depending only on M1:3(σ),M1:3(b),M_{1:3}(\sigma),M_{1:3}(b), M0(σ1),M_{0}(\sigma^{-1}), and rr. Hence, M2(gh,s)Cι/sιM_{2}(g_{h,s})\leq C_{\iota}/s^{\iota} for CιdK(1ι+sr)C_{\iota}\triangleq\frac{\sqrt{d}}{K}(\frac{1}{\iota}+s_{r}). Following the derivation in Section E.1 and choosing s^{*}=\mathopen{}\mathclose{{}\left(\frac{\iota C_{\iota}\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|_{2}}})}{\zeta}}\right){}^{\frac{1}{\iota+1}} and ηζιs\eta\triangleq\frac{\zeta}{\iota}s^{*}, we obtain

Now consider the case in which S(Qn,T,G)<e1\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})<e^{-1} and the choice ι=1/log(\nicefrac1S(Qn,T,G))(0,1)\iota=1/\log(\nicefrac{{1}}{{\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})}})\in(0,1). Since x1/(logx1)ex^{1/(\log x-1)}\leq e for all x(0,e1)x\in(0,e^{-1}),

Introduce the shorthand c0=dKζc_{0}=\frac{\sqrt{d}}{K\zeta}. Since \nicefrac11+ι(\nicefrac12,1)\nicefrac{{1}}{{1+\iota}}\in(\nicefrac{{1}}{{2}},1), we have c011+ιmax(c0,c0)c_{0}^{\frac{1}{1+\iota}}\leq\max(\sqrt{c_{0}},c_{0}). Similarly, 1+srι>11+s_{r}\iota>1, so (1+ιsr)11+ι1+ιsr(1+\iota s_{r})^{\frac{1}{1+\iota}}\leq 1+\iota s_{r}. Therefore,

Next, fix any ι(0,1)\iota\in(0,1) and consider the case in which S(Qn,T,G)e1\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})\geq e^{-1} so that S(Qn,T,G)11+ιS(Qn,T,G)eιι+1\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})^{{\textstyle\frac{1}{1+\iota}}}\leq\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})e^{{\textstyle\frac{\iota}{\iota+1}}}. Because 1ιeιι+112e1/2<e{\textstyle\frac{1}{\iota}}e^{{\textstyle\frac{\iota}{\iota+1}}}\leq\frac{1}{2}e^{1/2}<e and (1+ιsr)\nicefrac11+ι1+sr(1+\iota s_{r})^{\nicefrac{{1}}{{1+\iota}}}\leq 1+s_{r}, we conclude that

The result follows from estimates of these two cases and the bound (34).

Appendix F Proof of Proposition 8

for any coupling of XX and ZZ. We obtain the first advertised inequality by repeatedly applying the Fenchel-Young inequality for dual norms, invoking the boundedness and Lipschitz constraints on gg and g\nabla g, and taking a supremum over gGg\in\mathcal{G}_{\|{\cdot}\|}. The second inequality follows from the firstby invoking Jensen’s inequality, the fact min(x,y)xty1t\min(x,y)\leq x^{t}y^{1-t} for all x,y0x,y\geq 0, Hölder’s inequality, and finally the definition of Ws,W_{s,\|{\cdot}\|}.

Appendix G Proof of Theorem 10

By the uniform dissipativity assumption, the right-hand side is at most xyG2=dWG(δx,δy)2\|{x-y}\|_{G}^{2}=d_{\mathcal{W}_{\|{\cdot}\|_{G}}}(\delta_{x},\,\delta_{y})^{2}. For the transition semigroup (Pt)t0(P_{t}{})_{t\geq 0}{},

Appendix H Proof of Theorem 11

where (Wt)t0(W_{t}^{\prime})_{t\geq 0} is an mm-dimensional Wiener process and (Wt)t0(W_{t}^{\prime\prime})_{t\geq 0} is an independent dd-dimensional Wiener process.

Following the argument of Eberle [22, Sec. 4], we define the difference process Yt=Zt,xZt,yY_{t}={Z}_{t,x}-{Z}_{t,y}, its norm rt=Yt2r_{t}=\|{Y_{t}}\|_{2}, and the one-dimensional Wiener process Wt=0tYs/rs,dWsW_{t}=\int_{0}^{t}\langle{Y_{s}/r_{s}},{\,dW_{s}^{\prime\prime}}\rangle, and apply the generalized Itô formula [49, Thm. 22.5] to obtain the stochastic differential equations

for any concave increasing f:[0,)[0,)f:[0,\infty)\mapsto[0,\infty) with absolutely continuous derivative, f(0)=0f(0)=0, and f(0)=1f^{\prime}(0)=1. Since the drift term in the latter equation is bounded above by βt(2/α)(f(rt)(1/4)f(rt)κ(rt)rt),\beta_{t}\triangleq(2/\alpha)(f^{\prime\prime}(r_{t})-(1/4)f^{\prime}(r_{t})\kappa(r_{t})r_{t}), the argument of [22, p. 15] shows that the results of [22, Thm. 1 and Cor. 2] hold for our choice of α\alpha and κ\kappa.

Acknowledgments

We thank Simon Lacoste-Julien for sharing his quadrature code, Martin Hairer for discussing interpolation inequalities, Andreas Eberle for reading an earlier version of this manuscript, and Murat Erdogdu for identifying an important typographical error in an earlier version of this manuscript. This material is based upon work supported by the grant EPSRC EP/N000188/1, the National Science Foundation DMS RTG Grant No. 1501767, the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747, the Frederick E. Terman Fellowship, and the Lloyd’s Register Foundation programme on Data Centric engineering at the Alan Turing Institute, UK.

References

Supplementary Appendix I Smoothing and interpolation

The first result bounds the Lipschitz constant of fsf_{s} in terms of the Hölder continuity of ff.

An application of the triangle inequality gives rise to

Supplementary Appendix J Semigroup third derivative estimate

for constants c,Cc,C depending only on M1:3(σ),M1:3(b),M0(σ1),M_{1:3}(\sigma),M_{1:3}(b),M_{0}(\sigma^{-1}), and rr.

Our proof closely follows that of Lemma 15 in Section C, and we will only highlight the important differences. Throughout, cc and CC will represent arbitrary constants depending only on M1:3(σ),M1:3(b),M0(σ1),M_{1:3}(\sigma),M_{1:3}(b),M_{0}(\sigma^{-1}), and rr that may change from expression to expression.

obtained by differentiating (24) with respect to xx in the direction vv^{\prime\prime}.

In a manner analogous to the derivation of (28) in proof of Lemma 15, we can derive an expression for the third derivative of the semi-group,

We will bound each term Ji,j,xJ_{i,j,x} in (38) in turn.

We will provide a step-by-step calculation for the first term. By Cauchy-Schwarz,

We use the derivative flow bounds of Lemma 16 to realize

Cauchy-Schwarz, the Itô isometry [28, Eqs. 7.1 and 7.2], and Lemma 16 now yield

to rewrite σ(Zs,x)2σ1(Zs,x)\sigma({Z}_{s,x})\nabla^{2}\sigma^{-1}({Z}_{s,x}). The next term satisfies

Cauchy-Schwarz and the Itô isometry [28, Eqs. 7.1 and 7.2] yield

Using (29), Lemma 15, and the non-increasing property of rr, we find that

J.4 Semigroup third derivative bound

By combining the bounds for each Ji,j,xJ_{i,j,x} term, adapting the argument of [8, Prop. 1.5.1], and invoking the semigroup gradient bound and Hessian bound M2(Psf)M1(f)r(ss0)cs0eCs0M_{2}(P_{s}{f})\leq M_{1}(f)r(s-s_{0})\frac{c^{\prime}}{\sqrt{s_{0}}}e^{C^{\prime}s_{0}} of Lemma 15, we obtain, for any t0(0,t]t_{0}\in(0,t] and s0=t0/2s_{0}=t_{0}/2

J.5 Third derivative flow bound

We have by Cauchy-Schwarz and Young’s inequality

We can conclude using Gronwall’s inequality that

where we use that, by Young’s inequality,

Following the arguments of Section C.1, Grönwall’s inequality gives