Measuring Sample Quality with Diffusions
Jackson Gorham, Andrew B. Duncan, Sebastian J. Vollmer, Lester Mackey
Introduction
When is convergence determining, the measure (1) is an integral probability metric (IPM) , and converges to zero only if the sample sequence converges in distribution to .
While a variety of standard probability metrics are representable as IPMs , the intractability of integration under precludes us from computing most of these candidate quality measures. Recently, Gorham and Mackey sidestepped this issue by constructing a class of test functions known a priori to have zero mean under . Their resulting quality measure – the Langevin graph Stein discrepancy – satisfied our computability and convergence detection requirements (Desiderata (i) and (iii)) and detected sample sequence non-convergence (Desideratum (ii)) for strongly log concave targets with bounded third and fourth derivatives .
Our first contribution is to show that the Langevin Stein discrepancy in fact determines convergence for all smooth, distantly dissipative target distributions by explicitly lower and upper bounding the Langevin Stein discrepancy by standard Wasserstein distances. Distant dissipativity is a substantial relaxation of log concavity that covers a variety of common non-log concave targets like Gaussian mixtures and robust Student’s t regression posteriors. This contribution greatly extends the range of applicability of the Langevin Stein discrepancy.
Because heavy-tailed distributions are never distantly dissipative, as a second contribution, we extend the computable Stein discrepancy framework of to accommodate heavy-tailed target distributions by introducing a new class of multivariate Stein operators based on general Itô diffusions. These operators can be used as drop-in replacements for the commonly used Langevin operator in applications.
for constants determined by Theorem 7 and Proposition 8. This improves upon prior analyses even in the case of strongly log concave targets.
Our primary contribution underlies these three advances. By relating Stein’s method to Markov process coupling rates in Section 2, we prove that every sufficiently fast coupling Itô diffusion gives rise to explicit, uniform multivariate Stein factor bounds on the derivatives of Stein equation solutions. Stein factor bounds are central to Stein’s method of measuring distributional convergence, and while a wealth of bounds are available for univariate targets (see, e.g., for explicit bounds or for a recent review), Stein factors for continuous multivariate distributions have largely been relegated to Gaussian , Dirichlet , and strongly log-concave target distributions. Our approach, which exposes a general relationship between Stein factors and Markov process coupling times, extends the reach of Stein’s method to the stationary distributions of all fast coupling Itô diffusions.
In Section 3, we provide examples of practically checkable sufficient conditions for fast coupling and illustrate the process of verifying these conditions for canonical log-concave, heavy-tailed, and multimodal targets. Section 4 describes a practical algorithm for computing diffusion Stein discrepancies using a geometric spanner and linear programming. In Section 5, we complement the principal theoretical contributions of this work with several simple numerical examples illustrating how diffusion Stein discrepancies can be deployed in practice. In particular, we use our discrepancies to select the hyperparameters of biased samplers, compare random and deterministic quadrature rules, and quantify bias-variance tradeoffs in approximate Markov chain Monte Carlo. A discussion of related and future work follows in Section 6, and all proofs are deferred to the appendices.
Stein’s method
In the early 1970s, Charles Stein introduced a powerful three-step approach to upper-bounding a reference IPM :
The operator and its domain define the Stein discrepancy ,
a quality measure which takes the form of an integral probability metric while avoiding explicit integration under .
Next, prove that, for each test function in the reference class , the Stein equation
admits a solution . This step ensures that the reference metric lower bounds the Stein discrepancy (Desideratum (ii)) and, in practice, can be carried out simultaneously for large classes of target distributions.
Finally, use whatever means necessary to upper bound the Stein discrepancy and thereby establish convergence to zero under appropriate conditions (Desideratum (i)). Our general result, Proposition 8, suffices for this purpose.
While Stein’s method is traditionally used as analytical tool to establish rates of distributional convergence, we aim, following , to develop the method into a practical computational tool for measuring the quality of a sample. We begin by assessing the convergence properties of a broad class of Stein operators derived from Itô diffusions. Our efforts will culminate in Section 4, where we show how to explicitly compute the Stein discrepancy (2) given any sample measure and appropriate choices of and .
To identify an operator that generates mean-zero functions under , we will appeal to the elegant and widely applicable generator method construction of Barbour and Götze . These authors note that if is a Feller process with invariant measure , then the infinitesimal generator of the process, defined pointwise by
where is an -dimensional Wiener process.
As the next theorem, distilled from [62, Thm. 2] and [74, Sec. 4.6], shows, it is straightforward to construct Itô diffusions with a given invariant measure (see also ).
the (overdamped) Langevin diffusion (also known as the Brownian or Smoluchowski dynamics) [74, Secs. 6.5 and 4.5], where and ;
the Riemannian Langevin diffusion , where and is not constant;
We will present detailed examples making use of these diffusion classes in Sections 3 and 5.
Theorem 2 forms the basis for our Stein operator of choice, the diffusion Stein operator , defined by substituting for in the generator (7):
is an appropriate choice for our setting as it depends on only through and is therefore computable even when the normalizing constant of is unavailable. One suitable domain for is the classical Stein set of 1-bounded functions with 1-bounded, 1-Lipschitz derivatives:
Indeed, our next proposition, proved in Section A, shows that, on this domain, the diffusion Stein operator generates mean-zero functions under .
Together, and give rise to the classical diffusion Stein discrepancy , our primary object of study in Sections 2.2 and 2.3.
2 Lower bounding the diffusion Stein discrepancy
To establish that the classical diffusion Stein discrepancy detects non-convergence (Desideratum (ii)), we will lower bound the discrepancy in terms of the Wasserstein distance, , a standard reference IPM generated by
The first step is to show that, for each , the solution to the Stein equation (3) with diffusion Stein operator (8) has low-order derivatives uniformly bounded by target-specific constants called Stein factors.
Explicit Langevin diffusion (D1) Stein factor bounds are readily available for a wide variety of univariate targetsThe Langevin operator recovers Stein’s density method operator when . (see, e.g., for explicit bounds or for a recent review). In contrast, in the multivariate setting, efforts to establish Stein factors have focused on Gaussian , Dirichlet , and strongly log-concave targets with preconditioned Langevin (D2) operators. To extend the reach of the literature, we will derive multivariate Stein factors for targets with fast-coupling Itô diffusions. Our measure of coupling speed is the Wasserstein decay rate.
Let be the transition semigroup of an Itô diffusion defined via
where denotes the distribution of .
Our next result, proved in Section B, shows that the smoothness of a solution to a Stein equation is controlled by the rate of Wasserstein decay and hence by how quickly two diffusions with distinct starting points couple. The Stein factor bounds on the derivatives of and may be of independent interest for establishing rates of distributional convergence.
Fix any Lipschitz . If an Itô diffusion has invariant measure , transition semigroup , Wasserstein decay rate , and infinitesimal generator (4), then
is twice continuously differentiable and satisfies
for , . If, additionally, and are locally Lipschitz and with bounded third derivatives, then, for all ,
for a constant depending only on and .
Thms. 1 and 2 of Pardoux and Veretennikov also bound the solutions of the Stein equation (3). However, for generic Lipschitz , [72, Thms. 1 and 2] provide inexplicit constants; only guarantee the polynomial growth of and its derivatives, not uniform boundedness; and require bounded , a strong assumption which rules out the heavy-tailed examples of Section 3.
A first consequence of Theorem 5, proved in Section D, concerns Stein operators (8) with constant covariance and stream matrices and . In this setting, fast Wasserstein decay implies that the diffusion Stein discrepancy converges to zero only if the Wasserstein distance does (Desideratum (ii)).
Consider an Itô diffusion with diffusion Stein operator (8) for , Wasserstein decay rate , constant covariance and stream matrices and , and Lipschitz drift . If , then
Theorem 6 in fact provides an explicit upper bound on the Wasserstein distance in terms of the Stein discrepancy and the Wasserstein decay rate. Under additional smoothness assumptions on the coefficients, the explicit relationship between Stein discrepancy and Wasserstein distance can be improved and extended to diffusions with non-constant diffusion coefficient, as our next result, proved in Section E, shows.
Consider an Itô diffusion for with diffusion Stein operator (8), Wasserstein decay rate , and Lipschitz drift and diffusion coefficients (6) and with locally Lipschitz second derivatives. If , then
for , defined in Theorem 5 and
If, additionally, and are locally Lipschitz, then
for a constant depending only on and .
The term in (14) reflects the potential non-smoothness of the Stein equation solution studied in Theorem 5. Indeed, for and standard multivariate Gaussian , there exist Lipschitz with infinite [77, Remark 2].
In Section 3, we will present practically checkable conditions implying fast Wasserstein decay and discuss both broad families and specific diffusion-target pairings covered by this theory.
3 Upper bounding the diffusion Stein discrepancy
In upper bounding the Stein discrepancy, one classically aims to establish rates of convergence to for specific sequences . Since our interest is in explicitly computing Stein discrepancies for arbitrary sample sequences, our general upper bound in Proposition 8 serves principally to provide sufficient conditions under which the classical diffusion Stein discrepancy converges to zero.
Let be the diffusion Stein operator (8) for . If and (6) are -integrable,
This result, proved in Section F, complements the Wasserstein distance lower bounds of Section 2.2 and implies that, for Lipschitz and sufficiently integrable and , the diffusion Stein discrepancy converges to zero whenever converges to in Wasserstein distance.
4 Extension to non-uniform Stein sets
For any , our analyses and algorithms readily accommodate the non-uniform Stein set
This added flexibility can be valuable when tight upper bounds on a reference IPM, like the Wasserstein distance, are available for a particular choice of Stein factors . When such Stein factors are unknown or difficult to compute, we recommend the parameter-free classical Stein set and graph Stein set of the sequel as practical defaults, since the classical Stein discrepancy is strongly equivalent to any non-uniform Stein discrepancy:
The short proof follows exactly as in [35, Prop. 4].
Sufficient conditions for Wasserstein decay
Since the Stein discrepancy lower bounds of Section 2 depend on the Wasserstein decay (9) of the chosen diffusion, we next provide examples of practically checkable sufficient conditions for Wasserstein decay and illustrate the process of verifying these conditions for a selection of diffusion-target pairings. These pedagogical examples serve to succinctly illustrate the process of verifying our assumptions and do not represent the full scope of applicability.
It is well known [see, e.g., 7, Eq. 7] that the Langevin diffusion (D1) enjoys exponential Wasserstein decay whenever is -strongly log concave, i.e., when the drift satisfies for . An analogous uniform dissipativity condition gives explicit exponential decay for a generic Itô diffusion:
The proof of Theorem 10 in Section G holds even when the drift is not Lipschitz, yields the same decay rate for , and relies on a synchronous coupling of Itô diffusions, mimicking [7, Sec. 1].
Hence, if the drift of an Itô diffusion is -one-sided Lipschitz, i.e.,
and some , and the diffusion coefficient is -Lipschitz, that is,
the -targeted preconditioned Langevin diffusion (D2) drift satisfies (15) with and and . Hence, the diffusion enjoys geometric Wasserstein decay (Theorem 10) and a Wasserstein lower bound on the Stein discrepancy (Theorem 6).
2 Distant dissipativity, constant σ𝜎\sigma
When the diffusion coefficient is constant with invertible, Eberle showed that a distant dissipativity condition is sufficient for exponential Wasserstein decay. Specifically, if we define a one-sided Lipschitz constant conditioned on a distance ,
then [22, Cor. 2] establishes exponential Wasserstein decay whenever is continuous with and . For a Lipschitz drift, this holds whenever is dissipative at large distances, that is, whenever, for some , we have for all sufficiently large [22, Lem. 1].
3 Distant dissipativity, non-constant σ𝜎\sigma
Using a combination of synchronous and reflection couplings, Wang [93, Thm. 2.6] showed that diffusions satisfying a distant dissipativity condition exhibit exponential Wasserstein decay, even when the diffusion coefficient is non-constant. In Section H, we combine the coupling strategy of [93, Thm. 2.6] with the approach of for diffusions with constant to obtain the following explicit Wasserstein decay rate for distantly dissipative diffusions with bounded .
Let be the transition semigroup of an Itô diffusion with drift and diffusion coefficients and . Define the truncated diffusion coefficient
and the distance-conditional dissipativity function
for any
If the constants and satisfy , then
Theorem 11 holds even when the drift is not Lipschitz.
The Wasserstein decay rate (17) in Theorem 11 has a simple form when the diffusion is dissipative at large distances and is bounded below. These rates follow exactly as in [22, Lem. 1].
Under the conditions of Theorem 11, suppose that, for and , . Then
for fixed and . Introduce the shorthand for each and . Since
Indeed, fix any . Since , , and , , , , and are all Lipschitz. The drift is also Lipschitz, since and the product of and
are bounded. Hence, (16) is bounded below. Moreover, the the Hölder continuity of , Cauchy-Schwarz, and the triangle inequality imply
Letting represent the supremum in the final inequality, we see that whenever Hence, Corollary 12 delivers exponential Wasserstein decay. A Wasserstein lower bound on the Stein discrepancy now follows from Theorem 7, since , , and is Lipschitz, and hence and are bounded.
Computing Stein discrepancies
For any sample , Stein operator , and Stein set , the Stein discrepancy is recovered by solving an optimization problem over functions . For example, if we write and , the classical diffusion Stein discrepancy is the value
For all Stein sets, the diffusion Stein discrepancy objective is linear in and only queries and at the sample points underlying . However, the classical Stein set constrains at all points in its domain, resulting in an infinite-dimensional optimization problem.When , the problem reduces to a finite-dimensional convex quadratically constrained quadratic program with linear objective as in [35, Thm. 9].
imposes boundedness constraints only at sample points and smoothness constraints only at pairs of sample points enumerated in the edge set . The graph is termed a -spanner if each edge is assigned the weight , and, for all , there exists a path between and in the graph with total path weight no greater than . Remarkably, for any linear Stein operator , a spanner Stein discrepancy based on a -spanner is equivalent to the classical Stein discrepancy in the following strong sense, implying Desiderata (i) and (ii).
where is independent of and depends only on and .
The proof relies on the Whitney-Glaeser extension theorem [85, Thm. 1.4] of Glaeser and follows exactly as in [35, Prop. 5 and 6].
2 Decoupled linear programs
where and represent the values and respectively. Therefore, our recommended quality measure is the -spanner diffusion Stein discrepancy with . Its computation is summarized in Algorithm 1. An efficient implementation of Algorithm 1, integrated with 11 linear program solver options, is publicly available via our Julia package.https://jgorham.github.io/SteinDiscrepancy.jl/
Numerical illustrations
In this section, we complement the principal theoretical contributions of this work with several simple numerical illustrations demonstrating how diffusion Stein discrepancies can be deployed in practice. We will use our proposed quality measures to select hyperparameters for biased samplers, to quantify a bias-variance trade-off for approximate MCMC, and to compare deterministic and random quadrature rules. In each case, we choose experimental settings in which a notion of surrogate ground truth is available for external validation. We solve all linear programs using Julia for Mathematical Programming with the Gurobi 6.0.4 solver and use the C++ greedy spanner implementation of Bouts et al. to compute our -spanners. Our timings were obtained on a single core of an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. Code reconstructing all experiments is available on the Julia package site.footnote 5
We first present a simple example to illustrate several Stein discrepancy properties. For a Gaussian mixture target (Example 3) with and , we simulate one i.i.d. sequence of sample points from and a second i.i.d. sequence from , which represents only one component of . For various mode separations , Figure 1 shows that the Langevin spanner Stein discrepancy (D1) applied to the first Gaussian mixture sample points decreases to zero at a rate, while the discrepancy applied to the single mode sequence stays bounded away from zero. However, Figure 1 also indicates that larger sample sizes are needed to distinguish between the mixture and single mode sample sequences when is large. This accords with our theory (see Example 3, Corollary 12, and Theorem 6), which implies that both the Langevin diffusion Wasserstein decay rate and the bound relating Stein to Wasserstein degrade as the mixture mode separation increases.
2 Selecting sampler hyperparameters
Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) with a constant step size is an approximate MCMC procedure designed to accelerate posterior inference. Unlike asymptotically correct MCMC algorithms, SGRLD has a stationary distribution that deviates increasingly from its target as its step size grows. On the other hand, if is too small, SGRLD fails to explore the sample space sufficiently quickly. Hence, an appropriate setting of is paramount for accurate inference.
After standardizing the output variable and non-constant regressors and initializing each chain with an approximate posterior mode found by L-BFGS started at the origin, we ran SGRLD with minibatch size , metric , and a variety of step sizes to produce sample sequences of length thinned to length . We then selected the step size that delivered the highest quality sample – either the maximum effective sample size (ESS, a popular MCMC mixing diagnostic based on asymptotic variance ) or the minimum Riemannian Langevin spanner Stein discrepancy with . The longest discrepancy computation consumed s for spanner construction and s to solve a coordinate optimization problem. As a surrogate measure of ground truth, we also generated a sample of size from the Metropolis-adjusted Riemannian Langevin Algorithm (MARLA) with metric and compute the median bivariate marginal Wasserstein distance between each SGRLD sample and thinned to points .
Figure 2(a) shows that ESS, which does not account for stationary distribution bias, selects the largest step size available, . As seen in Figure 2(b), this choice results in samples that are greatly overdispersed when compared with the ground truth MARLA sample . At the other extreme, the selection produces greatly underdispersed samples due to slow mixing. The Stein discrepancy chooses an intermediate value, . The same value minimizes the surrogate ground truth Wasserstein measure and produces samples that most closely resemble the in Figure 2(b).
3 Quantifying a bias-variance trade-off
Approximate random walk Metropolis-Hastings (ARWMH) with tolerance parameter is a biased MCMC procedure that accelerates posterior inference by approximating the standard MH correction. Qualitatively, a smaller setting of produces a more faithful approximation of the MH correction and less bias between the chain’s stationary distribution and the target distribution of interest. A larger setting of leads to faster sampling and a more rapid reduction of Monte Carlo variance, as fewer datapoint likelihoods are computed per sampling step. We will quantify this bias-variance trade-off as a function of sampling time using the Langevin spanner Stein discrepancy.
In the notation of Example 2, we conduct a Bayesian Huber regression analysis () of the log radon levels in Minnesota households as a function of the log amount of uranium in the county, an indicator of whether the radon reading was performed in a basement, and an intercept term. A prior is placed on the coefficient vector . We run ARWMH with minibatch size and two settings of the tolerance threshold ( and ) for likelihood evaluations, discard the sample points from the first evaluations, and thin the remaining points to sequences of length . Figure 3 displays the Langevin spanner Stein discrepancy applied to the first points in each sequence as a function of the likelihood evaluation count, which serves as a proxy for sampling time. As expected, the higher tolerance sample () is of higher Stein quality for a small computational budget but is eventually overtaken by the sample with smaller asymptotic bias. The longest discrepancy computation consumed s for the spanner and s for a coordinate LP.
4 Comparing quadrature rules
Stein discrepancies can also measure the quality of deterministic sample sequences designed to improve upon Monte Carlo sampling. For the Gaussian mixture target of Section 5.1, Figure 4 compares the median quality of 50 sample sequences generated from four quadrature rules recently studied in [53, Sec. 4.1]: i.i.d. sampling from , Quasi-Monte Carlo (QMC) sampling using a deterministic quasirandom number generator, Frank-Wolfe (FW) kernel herding , and fully-corrective Frank-Wolfe (FCFW) kernel herding . The quality judgments of the Langevin spanner Stein discrepancy (D1) closely mimic those of the Wasserstein distance , which is computable for simple univariate targets . Each Stein discrepancy was computed in under s.
Under both diagnostics and as previously observed in other metrics , the i.i.d. samples are typically of lower median quality than their deterministic counterparts. More suprisingly and in contrast to past work focused on very smooth function classes , FCFW underperforms FW and QMC in our diagnostics for larger sample sizes. Apparently FCFW, which is heavily optimized for smooth function integration, has sacrificed approximation quality for less smooth test functions. For example, Figure 4 shows that QMC offers a better quadrature estimate than FCFW for , a -Lipschitz approximation to the indicator of being within one standard deviation of a mode.
In addition to providing a sample quality score, the Stein discrepancy optimization problem produces an optimal Stein function and an associated test function that is mean zero under and best distinguishes the sample from the target . Figure 4 gives examples of these maximally discriminatve functions for a target mode separation of and length sequences from each quadrature rule. We also display the associated sample histograms with overlaid target density. The optimal FCFW function reflects the jagged nature of the FCFW histogram.
Connections and conclusions
We developed quality measures suitable for comparing the fidelity of arbitrary “off-target” sample sequences by generating infinite collections of known target expectations.
The score statistic of Fan et al. and the Gibbs sampler convergence criteria of Zellner and Min account for some sample biases but sacrifice differentiating power by exploiting only a finite number of known target expectations. For example, when , the score statistic cannot differentiate two samples with the same means and variances. Maximum mean discrepancies (MMDs) over characteristic reproducing kernel Hilbert spaces do detect arbitrary distributional biases but are only computable when the chosen kernel functions can be integrated under the target. In practice, one often approximates MMD using a sample from the target, but this requires a separate trustworthy sample from .
While we have focused on the graph and classical Stein sets of , our diffusion Stein operators can also be paired with the reproducing kernel Hilbert space unit balls advocated in to form tractable kernel diffusion Stein discrepancies or with the random feature functions advocated in to form random feature diffusion Stein discrepancies. We have also restricted our attention to Stein operators arising from diffusion generators. These take the form with for positive semidefinite and skew-symmetric. More generally, if the matrix possesses eigenvalues having a negative real part, then the resulting operator need not correspond to a diffusion process. Such operators fall into the class of pseudo-Fokker Planck operators which have been studied in the context of quantum optics . As noted in it is possible to obtain corresponding stochastic dynamics in an extended state space by introducing complex-valued noise terms; these operators may merit further study in future work.
Alternative inferential tasks
Alternative targets
Our exposition has focused on the Wasserstein distance , which is only defined for distributions with finite means. A parallel development could be made for the Dudley metric to target distributions with undefined mean. The work of Cerrai also suggests that the Lipschitz condition on our drift and diffusion coefficients can be relaxed.
Appendix A Proof of Proposition 3
Let . Since and are bounded,
The coarea formula and the integrability of and further imply that
Appendix B Proof of Theorem 5
with infinitesimal generator , that is Lipschitz, that has a continuous Hessian, that has a bounded and Hölder continuous Hessian under additional smoothness assumptions.
The transition semigroup of an Itô diffusion with Lipschitz drift and diffusion coefficients is strongly continuous on .
for some depending only on and . The dominated convergence theorem now yields the desired pointwise convergence.
To prove the strong continuity of , it suffices, by [23, Thm. I.5.8, p. 40], to verify that is weakly continuous, i.e., that , as , for all elements of the dual space . To this end, fix any . By the Riesz-Markov theorem for [17, Theorem 2.4], there exists a finite signed Radon measure such that
for the dual norm. By Jensen’s inequality and [28, Sec. 5, Cor. 1.2],
Since is -integrable by (20), dominated convergence gives
Consider the infinitesimal generator of the semigroup on with
The stationarity of and the definitions of and imply that
and hence as , since has a finite mean, and as as is integrable and monotonic. Arguing similarly,
Instantiate the additional preconditions of (11), and assume that , or else (11) is vacuous. Lemma 15, established in Section C, shows that the semigroup admits a bounded continuous Hessian, which is integrable in .
for , .
The dominated convergence theorem now implies that the Hessian of is obtained by differentiating twice under the integral sign. The advertised bound (11) on follows by replacing the infimum on the right-hand side of the semigroup bound (22) with the selection , applying the bound for each and , and integrating the result over .
Finally, instantiate the additional preconditions of (12), and fix any . The integral representation (10) of , the dominated convergence theorem, and Jensen’s inequality imply
When , a seminorm interpolation lemma (Lemma 19 in the supplement), a semigroup third derivative estimate (Lemma 20 in the supplement) with , and the semigroup second derivative estimate of Lemma 15 with imply
for some constant depending only on and . Thus . For , Lemmas 19, 20, and 15 and the integrability of yield
for a constant again depending only on and . Combining these bounds and choosing completes the proof. An explicit constant can be obtained by tracing constants through the proof of Lemma 20.
Appendix C Proof of Lemma 15
obtained by formally differentiating the equation (5) defining with respect to in the direction . The second directional derivative flow solves the second variation equation,
obtained by differentiating (23) with respect to in the direction .
Since has bounded first and second derivatives, the dominated convergence theorem implies that, for each , is twice differentiable with
obtained by differentiating under the integral sign. Lemma 16, proved in Section C.1, justifies the exchanges of derivative and expectation by ensuring that the derivative flows have moments bounded uniformly in .
for and .
Since and are bounded, and and have second moments bounded uniformly in by Lemma 16, the Hessian formula (25) implies that is bounded and hence that is Lipschitz continuous for each .
Hereafter we assume that , as the semigroup Hessian bound (22) is otherwise vacuous.
The Lipschitz continuity of and the Itô diffusion moment bound of [51, Thm. 3.4, part 4] together imply that
for all . Since is bounded, and and are bounded and Lipschitz, [27, Prop. 3.2] gives the following Bismut-Elworthy-Li-type formula for the directional derivative of for each :
By interchanging derivative and integral, the dominated convergence theorem now delivers the Hessian expression
for each , provided that , and are continuous in . The requisite continuity follows from the Lipschitz continuity of and , the boundedness of , , and , and the controlled moment growth and Hölder continuity of , and as functions of [76, Theorem V.40]. The dominated convergence theorem further implies that is continuous for each .
Now, we fix any and turn to bounding in terms of , by bounding the expectations of , and of (28) in turn.
where we have adopted the definition of given in Lemma 16.
where we have used Dynkin’s formula [28, Eq. 7.11], the Itô isometry, and the chain rule,
and we bound this expression using Cauchy-Schwarz, Jensen’s inequality, the semigroup gradient bound (21), the second derivative flow bound (27), and the fact that is increasing:
C.1 Proof of Lemma 16: Derivative flow bounds
the advertised result (26) follows from Grönwall’s inequality.
for any . Letting , we see that, by Cauchy-Schwarz and our derivative flow bound (26),
Hence, if we choose and define we may write
Gronwall’s inequality now yields the result (27) via
Appendix D Proof of Theorem 6
Hence, since our choice of was arbitrary, and
we may take expectation under and supremum over in (31) to reach
as has a standard normal distribution. Leibniz’s rule also gives
where the last equality follows by Isserlis’ theorem. Finally, when , Leibniz’s rule gives .
Appendix E Proof of Theorem 7
We will derive each inequality for ; the generic norm results will then follow from the property , which implies .
Let for Since is dense in , we may take expectation under and supremum over in (32) to reach
E.2 Proof of the second inequality
Assume now that and are bounded and locally Lipschitz. Fix any . Lemma 17 and an auxiliary smoothing lemma (Lemma 18 in the supplement) imply that . This improved dependence on will allow us to establish a near-linear relationship between the Stein discrepancy and the Wasserstein distance. By Theorem 5, for depending only on and . Hence, for . Following the derivation in Section E.1 and choosing s^{*}=\mathopen{}\mathclose{{}\left(\frac{\iota C_{\iota}\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|_{2}}})}{\zeta}}\right){}^{\frac{1}{\iota+1}} and , we obtain
Now consider the case in which and the choice . Since for all ,
Introduce the shorthand . Since , we have . Similarly, , so . Therefore,
Next, fix any and consider the case in which so that . Because and , we conclude that
The result follows from estimates of these two cases and the bound (34).
Appendix F Proof of Proposition 8
for any coupling of and . We obtain the first advertised inequality by repeatedly applying the Fenchel-Young inequality for dual norms, invoking the boundedness and Lipschitz constraints on and , and taking a supremum over . The second inequality follows from the firstby invoking Jensen’s inequality, the fact for all , Hölder’s inequality, and finally the definition of .
Appendix G Proof of Theorem 10
By the uniform dissipativity assumption, the right-hand side is at most . For the transition semigroup ,
Appendix H Proof of Theorem 11
where is an -dimensional Wiener process and is an independent -dimensional Wiener process.
Following the argument of Eberle [22, Sec. 4], we define the difference process , its norm , and the one-dimensional Wiener process , and apply the generalized Itô formula [49, Thm. 22.5] to obtain the stochastic differential equations
for any concave increasing with absolutely continuous derivative, , and . Since the drift term in the latter equation is bounded above by the argument of [22, p. 15] shows that the results of [22, Thm. 1 and Cor. 2] hold for our choice of and .
Acknowledgments
We thank Simon Lacoste-Julien for sharing his quadrature code, Martin Hairer for discussing interpolation inequalities, Andreas Eberle for reading an earlier version of this manuscript, and Murat Erdogdu for identifying an important typographical error in an earlier version of this manuscript. This material is based upon work supported by the grant EPSRC EP/N000188/1, the National Science Foundation DMS RTG Grant No. 1501767, the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747, the Frederick E. Terman Fellowship, and the Lloyd’s Register Foundation programme on Data Centric engineering at the Alan Turing Institute, UK.
References
Supplementary Appendix I Smoothing and interpolation
The first result bounds the Lipschitz constant of in terms of the Hölder continuity of .
An application of the triangle inequality gives rise to
Supplementary Appendix J Semigroup third derivative estimate
for constants depending only on and .
Our proof closely follows that of Lemma 15 in Section C, and we will only highlight the important differences. Throughout, and will represent arbitrary constants depending only on and that may change from expression to expression.
obtained by differentiating (24) with respect to in the direction .
In a manner analogous to the derivation of (28) in proof of Lemma 15, we can derive an expression for the third derivative of the semi-group,
We will bound each term in (38) in turn.
We will provide a step-by-step calculation for the first term. By Cauchy-Schwarz,
We use the derivative flow bounds of Lemma 16 to realize
Cauchy-Schwarz, the Itô isometry [28, Eqs. 7.1 and 7.2], and Lemma 16 now yield
to rewrite . The next term satisfies
Cauchy-Schwarz and the Itô isometry [28, Eqs. 7.1 and 7.2] yield
Using (29), Lemma 15, and the non-increasing property of , we find that
J.4 Semigroup third derivative bound
By combining the bounds for each term, adapting the argument of [8, Prop. 1.5.1], and invoking the semigroup gradient bound and Hessian bound of Lemma 15, we obtain, for any and
J.5 Third derivative flow bound
We have by Cauchy-Schwarz and Young’s inequality
We can conclude using Gronwall’s inequality that
where we use that, by Young’s inequality,
Following the arguments of Section C.1, Grönwall’s inequality gives