Measuring Sample Quality with Diffusions

Jackson Gorham, Andrew B. Duncan, Sebastian J. Vollmer, Lester Mackey

Introduction

When $\mathcal{H}$ is convergence determining, the measure (1) is an integral probability metric (IPM) , and $d_{\mathcal{H}}(Q_{n},P)$ converges to zero only if the sample sequence $(Q_{n})_{n\geq 1}$ converges in distribution to $P$ .

While a variety of standard probability metrics are representable as IPMs , the intractability of integration under $P$ precludes us from computing most of these candidate quality measures. Recently, Gorham and Mackey sidestepped this issue by constructing a class of test functions $h$ known a priori to have zero mean under $P$ . Their resulting quality measure – the Langevin graph Stein discrepancy – satisfied our computability and convergence detection requirements (Desiderata (i) and (iii)) and detected sample sequence non-convergence (Desideratum (ii)) for strongly log concave targets with bounded third and fourth derivatives .

Our first contribution is to show that the Langevin Stein discrepancy in fact determines convergence for all smooth, distantly dissipative target distributions by explicitly lower and upper bounding the Langevin Stein discrepancy by standard Wasserstein distances. Distant dissipativity is a substantial relaxation of log concavity that covers a variety of common non-log concave targets like Gaussian mixtures and robust Student’s t regression posteriors. This contribution greatly extends the range of applicability of the Langevin Stein discrepancy.

Because heavy-tailed distributions are never distantly dissipative, as a second contribution, we extend the computable Stein discrepancy framework of to accommodate heavy-tailed target distributions by introducing a new class of multivariate Stein operators based on general Itô diffusions. These operators can be used as drop-in replacements for the commonly used Langevin operator in applications.

for constants $C_{1},C_{2}>0$ determined by Theorem 7 and Proposition 8. This improves upon prior analyses even in the case of strongly log concave targets.

Our primary contribution underlies these three advances. By relating Stein’s method to Markov process coupling rates in Section 2, we prove that every sufficiently fast coupling Itô diffusion gives rise to explicit, uniform multivariate Stein factor bounds on the derivatives of Stein equation solutions. Stein factor bounds are central to Stein’s method of measuring distributional convergence, and while a wealth of bounds are available for univariate targets (see, e.g., for explicit bounds or for a recent review), Stein factors for continuous multivariate distributions have largely been relegated to Gaussian , Dirichlet , and strongly log-concave target distributions. Our approach, which exposes a general relationship between Stein factors and Markov process coupling times, extends the reach of Stein’s method to the stationary distributions of all fast coupling Itô diffusions.

In Section 3, we provide examples of practically checkable sufficient conditions for fast coupling and illustrate the process of verifying these conditions for canonical log-concave, heavy-tailed, and multimodal targets. Section 4 describes a practical algorithm for computing diffusion Stein discrepancies using a geometric spanner and linear programming. In Section 5, we complement the principal theoretical contributions of this work with several simple numerical examples illustrating how diffusion Stein discrepancies can be deployed in practice. In particular, we use our discrepancies to select the hyperparameters of biased samplers, compare random and deterministic quadrature rules, and quantify bias-variance tradeoffs in approximate Markov chain Monte Carlo. A discussion of related and future work follows in Section 6, and all proofs are deferred to the appendices.

Stein’s method

In the early 1970s, Charles Stein introduced a powerful three-step approach to upper-bounding a reference IPM $d_{\mathcal{H}}$ :

The operator $\mathcal{T}{}$ and its domain $\mathcal{G}$ define the Stein discrepancy ,

a quality measure which takes the form of an integral probability metric while avoiding explicit integration under $P$ .

Next, prove that, for each test function $h$ in the reference class $\mathcal{H}$ , the Stein equation

admits a solution $g_{h}\in\mathcal{G}$ . This step ensures that the reference metric $d_{\mathcal{H}}$ lower bounds the Stein discrepancy (Desideratum (ii)) and, in practice, can be carried out simultaneously for large classes of target distributions.

Finally, use whatever means necessary to upper bound the Stein discrepancy and thereby establish convergence to zero under appropriate conditions (Desideratum (i)). Our general result, Proposition 8, suffices for this purpose.

While Stein’s method is traditionally used as analytical tool to establish rates of distributional convergence, we aim, following , to develop the method into a practical computational tool for measuring the quality of a sample. We begin by assessing the convergence properties of a broad class of Stein operators derived from Itô diffusions. Our efforts will culminate in Section 4, where we show how to explicitly compute the Stein discrepancy (2) given any sample measure $Q_{n}$ and appropriate choices of $\mathcal{T}{}$ and $\mathcal{G}$ .

To identify an operator $\mathcal{T}{}$ that generates mean-zero functions under $P$ , we will appeal to the elegant and widely applicable generator method construction of Barbour and Götze . These authors note that if $(Z_{t})_{t\geq 0}$ is a Feller process with invariant measure $P$ , then the infinitesimal generator $\mathcal{A}{}$ of the process, defined pointwise by

where $(W_{t})_{t\geq 0}$ is an $m$ -dimensional Wiener process.

As the next theorem, distilled from [62, Thm. 2] and [74, Sec. 4.6], shows, it is straightforward to construct Itô diffusions with a given invariant measure $P$ (see also ).

the (overdamped) Langevin diffusion (also known as the Brownian or Smoluchowski dynamics) [74, Secs. 6.5 and 4.5], where $a\equiv I$ and $c\equiv 0$ ;

the Riemannian Langevin diffusion , where $c\equiv 0$ and $a$ is not constant;

We will present detailed examples making use of these diffusion classes in Sections 3 and 5.

Theorem 2 forms the basis for our Stein operator of choice, the diffusion Stein operator $\mathcal{T}{}$ , defined by substituting $g$ for $\frac{1}{2}\nabla u$ in the generator (7):

$\mathcal{T}{}$ is an appropriate choice for our setting as it depends on $P$ only through $\nabla\log p$ and is therefore computable even when the normalizing constant of $p$ is unavailable. One suitable domain for $\mathcal{T}{}$ is the classical Stein set of 1-bounded functions with 1-bounded, 1-Lipschitz derivatives:

Indeed, our next proposition, proved in Section A, shows that, on this domain, the diffusion Stein operator generates mean-zero functions under $P$ .

Together, $\mathcal{T}{}$ and $\mathcal{G}_{\|{\cdot}\|}$ give rise to the classical diffusion Stein discrepancy $\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})$ , our primary object of study in Sections 2.2 and 2.3.

2 Lower bounding the diffusion Stein discrepancy

To establish that the classical diffusion Stein discrepancy detects non-convergence (Desideratum (ii)), we will lower bound the discrepancy in terms of the $L^{1}$ Wasserstein distance, $d_{\mathcal{W}_{\|{\cdot}\|_{2}}}=W_{1,\|{\cdot}\|_{2}}$ , a standard reference IPM generated by

The first step is to show that, for each $h\in\mathcal{W}_{\|{\cdot}\|_{2}}$ , the solution $g_{h}$ to the Stein equation (3) with diffusion Stein operator (8) has low-order derivatives uniformly bounded by target-specific constants called Stein factors.

Explicit Langevin diffusion (D1) Stein factor bounds are readily available for a wide variety of univariate targetsThe Langevin operator recovers Stein’s density method operator when $d=1$ . (see, e.g., for explicit bounds or for a recent review). In contrast, in the multivariate setting, efforts to establish Stein factors have focused on Gaussian , Dirichlet , and strongly log-concave targets with preconditioned Langevin (D2) operators. To extend the reach of the literature, we will derive multivariate Stein factors for targets with fast-coupling Itô diffusions. Our measure of coupling speed is the Wasserstein decay rate.

Let $(P_{t}{})_{t\geq 0}{}$ be the transition semigroup of an Itô diffusion $({Z}_{t,x})_{t\geq 0}$ defined via

where $\delta_{x}P_{t}{}$ denotes the distribution of ${Z}_{t,x}$ .

Our next result, proved in Section B, shows that the smoothness of a solution $g_{h}$ to a Stein equation is controlled by the rate of Wasserstein decay and hence by how quickly two diffusions with distinct starting points couple. The Stein factor bounds on the derivatives of $u_{h}$ and $g_{h}$ may be of independent interest for establishing rates of distributional convergence.

Fix any Lipschitz $h$ . If an Itô diffusion has invariant measure $P\in\mathcal{P}_{1}$ , transition semigroup $(P_{t}{})_{t\geq 0}$ , Wasserstein decay rate $r$ , and infinitesimal generator $\mathcal{A}{}$ (4), then

is twice continuously differentiable and satisfies

for $\gamma_{\rho}\triangleq\rho M_{1}(b)+\frac{\rho^{2}-2\rho}{2}M_{1}(\sigma)^{2}+\frac{\rho}{2}F_{1}(\sigma)^{2}$ , $\alpha\triangleq\frac{M_{2}(b)^{2}}{2M_{1}(b)+4M_{1}(\sigma)^{2}}+2F_{2}(\sigma)^{2}$ . If, additionally, $\nabla^{3}b$ and $\nabla^{3}\sigma$ are locally Lipschitz and $h\in C^{3}$ with bounded third derivatives, then, for all $\iota\in(0,1)$ ,

for $K>0$ a constant depending only on $M_{1:3}(\sigma),M_{1:3}(b),$ $M_{0}(\sigma^{-1}),$ and $r$ .

Thms. 1 and 2 of Pardoux and Veretennikov also bound the solutions of the Stein equation (3). However, for generic Lipschitz $h$ , [72, Thms. 1 and 2] provide inexplicit constants; only guarantee the polynomial growth of $g_{h}$ and its derivatives, not uniform boundedness; and require bounded $\sigma$ , a strong assumption which rules out the heavy-tailed examples of Section 3.

A first consequence of Theorem 5, proved in Section D, concerns Stein operators (8) with constant covariance and stream matrices $a$ and $c$ . In this setting, fast Wasserstein decay implies that the diffusion Stein discrepancy converges to zero only if the Wasserstein distance does (Desideratum (ii)).

Consider an Itô diffusion with diffusion Stein operator $\mathcal{T}{}$ (8) for $P\in\mathcal{P}_{1}$ , Wasserstein decay rate $r$ , constant covariance and stream matrices $a$ and $c$ , and Lipschitz drift $b(x)=\frac{1}{2}(a+c)\nabla\log p(x)$ . If $s_{r}\triangleq\int_{0}^{\infty}r(t)\,dt$ , then

Theorem 6 in fact provides an explicit upper bound on the Wasserstein distance in terms of the Stein discrepancy and the Wasserstein decay rate. Under additional smoothness assumptions on the coefficients, the explicit relationship between Stein discrepancy and Wasserstein distance can be improved and extended to diffusions with non-constant diffusion coefficient, as our next result, proved in Section E, shows.

Consider an Itô diffusion for $P\in\mathcal{P}_{1}$ with diffusion Stein operator $\mathcal{T}{}$ (8), Wasserstein decay rate $r$ , and Lipschitz drift and diffusion coefficients $b$ (6) and $\sigma$ with locally Lipschitz second derivatives. If $s_{r}\triangleq\int_{0}^{\infty}r(t)\,dt$ , then

for $\beta_{1}$ , $\beta_{2}$ defined in Theorem 5 and

If, additionally, $\nabla^{3}b$ and $\nabla^{3}\sigma$ are locally Lipschitz, then

for a constant $K>0$ depending only on $M_{1:3}(\sigma),M_{1:3}(b),$ $M_{0}(\sigma^{-1}),$ and $r$ .

The $\log(\nicefrac{{1}}{{\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})}})$ term in (14) reflects the potential non-smoothness of the Stein equation solution $g_{h}$ studied in Theorem 5. Indeed, for $d\geq 2$ and standard multivariate Gaussian $P$ , there exist Lipschitz $h$ with infinite $M_{2}(g_{h})$ [77, Remark 2].

In Section 3, we will present practically checkable conditions implying fast Wasserstein decay and discuss both broad families and specific diffusion-target pairings covered by this theory.

3 Upper bounding the diffusion Stein discrepancy

In upper bounding the Stein discrepancy, one classically aims to establish rates of convergence to $P$ for specific sequences $(Q_{n})_{n=1}^{\infty}$ . Since our interest is in explicitly computing Stein discrepancies for arbitrary sample sequences, our general upper bound in Proposition 8 serves principally to provide sufficient conditions under which the classical diffusion Stein discrepancy converges to zero.

Let $\mathcal{T}{}$ be the diffusion Stein operator (8) for $P\in\mathcal{P}_{1}$ . If $m\triangleq a+c$ and $b$ (6) are $P$ -integrable,

This result, proved in Section F, complements the Wasserstein distance lower bounds of Section 2.2 and implies that, for Lipschitz and sufficiently integrable $m$ and $b$ , the diffusion Stein discrepancy converges to zero whenever $Q_{n}$ converges to $P$ in Wasserstein distance.

4 Extension to non-uniform Stein sets

For any $c_{1},c_{2},c_{3}>0$ , our analyses and algorithms readily accommodate the non-uniform Stein set

This added flexibility can be valuable when tight upper bounds on a reference IPM, like the Wasserstein distance, are available for a particular choice of Stein factors $(c_{1},c_{2},c_{3})$ . When such Stein factors are unknown or difficult to compute, we recommend the parameter-free classical Stein set and graph Stein set of the sequel as practical defaults, since the classical Stein discrepancy is strongly equivalent to any non-uniform Stein discrepancy:

The short proof follows exactly as in [35, Prop. 4].

Sufficient conditions for Wasserstein decay

Since the Stein discrepancy lower bounds of Section 2 depend on the Wasserstein decay (9) of the chosen diffusion, we next provide examples of practically checkable sufficient conditions for Wasserstein decay and illustrate the process of verifying these conditions for a selection of diffusion-target pairings. These pedagogical examples serve to succinctly illustrate the process of verifying our assumptions and do not represent the full scope of applicability.

It is well known [see, e.g., 7, Eq. 7] that the Langevin diffusion (D1) enjoys exponential Wasserstein decay whenever $\log p$ is $k$ -strongly log concave, i.e., when the drift $b=\frac{1}{2}\nabla\log p$ satisfies $\langle{b(x)-b(y)},{x-y}\rangle\leq-{\textstyle\frac{k}{2}}\|{x-y}\|_{2}^{2}$ for $k>0$ . An analogous uniform dissipativity condition gives explicit exponential decay for a generic Itô diffusion:

The proof of Theorem 10 in Section G holds even when the drift $b$ is not Lipschitz, yields the same decay rate for $\mathcal{W}_{2,\|{\cdot}\|_{2}}$ , and relies on a synchronous coupling of Itô diffusions, mimicking [7, Sec. 1].

Hence, if the drift $b$ of an Itô diffusion is $-k/2$ -one-sided Lipschitz, i.e.,

and some $G\succ 0$ , and the diffusion coefficient $\sigma$ is $\sqrt{k^{\prime}}$ -Lipschitz, that is,

the $P$ -targeted preconditioned Langevin diffusion (D2) drift $b(\beta)={\textstyle\frac{1}{2}}\Sigma\nabla\log p(\beta)$ satisfies (15) with $k=1$ and $G=\Sigma^{-1}$ and $M_{1}(b)\leq{\textstyle\frac{1}{2}}\|{I+\frac{1}{4}\Sigma\sum_{l=1}^{L}v_{l}v_{l}^{\top}}\|_{op}$ . Hence, the diffusion enjoys geometric Wasserstein decay (Theorem 10) and a Wasserstein lower bound on the Stein discrepancy (Theorem 6).

2 Distant dissipativity, constant σ𝜎\sigma

When the diffusion coefficient $\sigma$ is constant with $a\triangleq\frac{1}{2}\sigma\sigma^{\top}$ invertible, Eberle showed that a distant dissipativity condition is sufficient for exponential Wasserstein decay. Specifically, if we define a one-sided Lipschitz constant conditioned on a distance $r>0$ ,

then [22, Cor. 2] establishes exponential Wasserstein decay whenever $\kappa$ is continuous with $\liminf_{r\to\infty}\kappa(r)>0$ and $\int_{0}^{1}r\kappa(r)^{-}dr<\infty$ . For a Lipschitz drift, this holds whenever $b$ is dissipative at large distances, that is, whenever, for some $k>0$ , we have $\kappa(r)\geq k$ for all $r$ sufficiently large [22, Lem. 1].

3 Distant dissipativity, non-constant σ𝜎\sigma

Using a combination of synchronous and reflection couplings, Wang [93, Thm. 2.6] showed that diffusions satisfying a distant dissipativity condition exhibit exponential Wasserstein decay, even when the diffusion coefficient $\sigma$ is non-constant. In Section H, we combine the coupling strategy of [93, Thm. 2.6] with the approach of for diffusions with constant $\sigma$ to obtain the following explicit Wasserstein decay rate for distantly dissipative diffusions with bounded $\sigma^{-1}$ .

Let $(P_{t}{})_{t\geq 0}$ be the transition semigroup of an Itô diffusion with drift and diffusion coefficients $b$ and $\sigma$ . Define the truncated diffusion coefficient

and the distance-conditional dissipativity function

for any $m_{0}\leq\inf_{x\neq y}{\textstyle\frac{\|{(\sigma_{0}(x)-\sigma_{0}(y))^{\top}(x-y)}\|_{2}}{\|{x-y}\|_{2}}}\quad\text{and}\quad\alpha\triangleq 1/(\lambda_{0}^{2}+m_{0}^{2}/4).$

If the constants $R_{0}=\inf\{R\geq 0\>:\>\kappa(r)\geq 0,\ \forall\,r\geq R\}$ and $R_{1}=\inf\{R\geq R_{0}\>:\>\kappa(r)R(R-R_{0})\geq 8,\ \forall\,r\geq R\}$ satisfy $R_{0}\leq R_{1}<\infty$ , then

Theorem 11 holds even when the drift $b$ is not Lipschitz.

The Wasserstein decay rate (17) in Theorem 11 has a simple form when the diffusion is dissipative at large distances and $\kappa$ is bounded below. These rates follow exactly as in [22, Lem. 1].

Under the conditions of Theorem 11, suppose that, for $R,L\geq 0$ and $K>0$ , $\kappa(r)\geq-L\text{ for }r\leq R\text{ and }\kappa(r)\geq K\text{ for }r>R$ . Then

for fixed $\delta,\nu>0$ and $\Sigma\succ 0$ . Introduce the shorthand $\psi_{\lambda}(r)\triangleq 2\sqrt{1+r^{2}/\delta^{2}}-\lambda^{2}$ for each $\lambda\in[0,\sqrt{2})$ and $\xi(\beta)\triangleq 1+\frac{1}{\nu}(y-V\beta)^{\top}\Sigma^{-1}(y-V\beta)$ . Since

Indeed, fix any $\lambda_{0}\in(0,1/M_{0}(\sigma^{-1}))=(0,\sqrt{2})$ . Since $M_{1}(\sqrt{\psi_{\lambda}})\leq\frac{1}{\delta\sqrt{2-\lambda^{2}}}$ , $M_{1}(\psi_{\lambda})\leq\frac{2}{\delta}$ , and $M_{2}(\psi_{\lambda})\leq\frac{2}{\delta^{2}}$ , $\sigma_{0}$ , $\sigma$ , $a$ , and $\nabla a$ are all Lipschitz. The drift $b$ is also Lipschitz, since $\nabla\log p$ and the product of $a(\beta)$ and

are bounded. Hence, $\kappa$ (16) is bounded below. Moreover, the the Hölder continuity of $x\mapsto\sqrt{x}$ , Cauchy-Schwarz, and the triangle inequality imply

Letting $\zeta$ represent the supremum in the final inequality, we see that $\kappa(r)\geq\alpha=1/\lambda_{0}^{2}$ whenever $r\geq 2({\textstyle\frac{d+1}{\delta}}+\zeta).$ Hence, Corollary 12 delivers exponential Wasserstein decay. A Wasserstein lower bound on the Stein discrepancy now follows from Theorem 7, since $M_{2}(\sqrt{\psi_{0}})\leq\frac{1}{\sqrt{2}\delta^{2}}$ , $M_{3}(\psi_{0})\leq\frac{96}{25\sqrt{5}\delta^{3}}$ , and $a(\beta)\nabla^{2}\log p(\beta)$ is Lipschitz, and hence $M_{2}(\sigma)$ and $M_{2}(b)$ are bounded.

Computing Stein discrepancies

For any sample $Q_{n}$ , Stein operator $\mathcal{T}{}$ , and Stein set $\mathcal{G}$ , the Stein discrepancy $\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}})$ is recovered by solving an optimization problem over functions $g\in\mathcal{G}$ . For example, if we write $m\triangleq a+c$ and $b(x)\triangleq\frac{1}{2}{\textstyle\frac{1}{p(x)}}\langle{\nabla},{p(x)m(x)}\rangle$ , the classical diffusion Stein discrepancy is the value

For all Stein sets, the diffusion Stein discrepancy objective is linear in $g$ and only queries $g$ and $\nabla g$ at the $n$ sample points underlying $Q_{n}$ . However, the classical Stein set $\mathcal{G}_{\|{\cdot}\|}$ constrains $g$ at all points in its domain, resulting in an infinite-dimensional optimization problem.When $d=1$ , the problem reduces to a finite-dimensional convex quadratically constrained quadratic program with linear objective as in [35, Thm. 9].

imposes boundedness constraints only at sample points and smoothness constraints only at pairs of sample points enumerated in the edge set $E$ . The graph is termed a $t$ -spanner if each edge $(x,y)\in E$ is assigned the weight $\|{x-y}\|$ , and, for all $x^{\prime}\neq y^{\prime}\in V$ , there exists a path between $x^{\prime}$ and $y^{\prime}$ in the graph with total path weight no greater than $t\|{x^{\prime}-y^{\prime}}\|$ . Remarkably, for any linear Stein operator $\mathcal{T}{}$ , a spanner Stein discrepancy $\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|,Q_{n},{G_{t}}}})$ based on a $t$ -spanner $G_{t}$ is equivalent to the classical Stein discrepancy in the following strong sense, implying Desiderata (i) and (ii).

where $\kappa_{d}$ is independent of $(Q_{n},P,\mathcal{T}{},G_{t})$ and depends only on $d$ and $\|{\cdot}\|$ .

The proof relies on the Whitney-Glaeser extension theorem [85, Thm. 1.4] of Glaeser and follows exactly as in [35, Prop. 5 and 6].

2 Decoupled linear programs

where $\psi_{ji}$ and $\Psi_{jki}$ represent the values $g_{j}(x_{i})$ and $\nabla_{k}g_{j}(x_{i})$ respectively. Therefore, our recommended quality measure is the $2$ -spanner diffusion Stein discrepancy with $\|{\cdot}\|=\|{\cdot}\|_{1}$ . Its computation is summarized in Algorithm 1. An efficient implementation of Algorithm 1, integrated with 11 linear program solver options, is publicly available via our Julia package.https://jgorham.github.io/SteinDiscrepancy.jl/

Numerical illustrations

In this section, we complement the principal theoretical contributions of this work with several simple numerical illustrations demonstrating how diffusion Stein discrepancies can be deployed in practice. We will use our proposed quality measures to select hyperparameters for biased samplers, to quantify a bias-variance trade-off for approximate MCMC, and to compare deterministic and random quadrature rules. In each case, we choose experimental settings in which a notion of surrogate ground truth is available for external validation. We solve all linear programs using Julia for Mathematical Programming with the Gurobi 6.0.4 solver and use the C++ greedy spanner implementation of Bouts et al. to compute our $2$ -spanners. Our timings were obtained on a single core of an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. Code reconstructing all experiments is available on the Julia package site.footnote 5

We first present a simple example to illustrate several Stein discrepancy properties. For a Gaussian mixture target $P$ (Example 3) with $p(x)\propto e^{-\frac{1}{2}(x-\frac{\Delta}{2})^{2}}+e^{-\frac{1}{2}(x+\frac{\Delta}{2})^{2}}$ and $\Delta>0$ , we simulate one i.i.d. sequence of sample points from $P$ and a second i.i.d. sequence from $\mathcal{N}(-\frac{\Delta}{2},1)$ , which represents only one component of $P$ . For various mode separations $\Delta$ , Figure 1 shows that the Langevin spanner Stein discrepancy (D1) applied to the first $n$ Gaussian mixture sample points decreases to zero at a $n^{-1/2}$ rate, while the discrepancy applied to the single mode sequence stays bounded away from zero. However, Figure 1 also indicates that larger sample sizes are needed to distinguish between the mixture and single mode sample sequences when $\Delta$ is large. This accords with our theory (see Example 3, Corollary 12, and Theorem 6), which implies that both the Langevin diffusion Wasserstein decay rate and the bound relating Stein to Wasserstein degrade as the mixture mode separation $\Delta$ increases.

2 Selecting sampler hyperparameters

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) with a constant step size $\epsilon$ is an approximate MCMC procedure designed to accelerate posterior inference. Unlike asymptotically correct MCMC algorithms, SGRLD has a stationary distribution that deviates increasingly from its target $P$ as its step size $\epsilon$ grows. On the other hand, if $\epsilon$ is too small, SGRLD fails to explore the sample space sufficiently quickly. Hence, an appropriate setting of $\epsilon$ is paramount for accurate inference.

After standardizing the output variable and non-constant regressors and initializing each chain with an approximate posterior mode found by L-BFGS started at the origin, we ran SGRLD with minibatch size $30$ , metric $G(\beta)=1/(2\sqrt{1+\|{\beta/\delta}\|_{2}^{2}})I$ , and a variety of step sizes $\epsilon$ to produce sample sequences of length $200,000$ thinned to length $2,000$ . We then selected the step size that delivered the highest quality sample – either the maximum effective sample size (ESS, a popular MCMC mixing diagnostic based on asymptotic variance ) or the minimum Riemannian Langevin spanner Stein discrepancy with $a(\beta)=G^{-1}(\beta)$ . The longest discrepancy computation consumed $6$ s for spanner construction and $65$ s to solve a coordinate optimization problem. As a surrogate measure of ground truth, we also generated a sample $Q^{*}$ of size $2\times 10^{8}$ from the Metropolis-adjusted Riemannian Langevin Algorithm (MARLA) with metric $G$ and compute the median bivariate marginal Wasserstein distance $d_{\mathcal{W}_{\|{\cdot}\|_{1}}}$ between each SGRLD sample and $Q^{*}$ thinned to $5,000$ points .

Figure 2(a) shows that ESS, which does not account for stationary distribution bias, selects the largest step size available, $\epsilon=10^{-2}$ . As seen in Figure 2(b), this choice results in samples that are greatly overdispersed when compared with the ground truth MARLA sample $Q^{*}$ . At the other extreme, the selection $\epsilon=10^{-7}$ produces greatly underdispersed samples due to slow mixing. The Stein discrepancy chooses an intermediate value, $\epsilon=10^{-4}$ . The same value minimizes the surrogate ground truth Wasserstein measure and produces samples that most closely resemble the $Q^{*}$ in Figure 2(b).

3 Quantifying a bias-variance trade-off

Approximate random walk Metropolis-Hastings (ARWMH) with tolerance parameter $\epsilon$ is a biased MCMC procedure that accelerates posterior inference by approximating the standard MH correction. Qualitatively, a smaller setting of $\epsilon$ produces a more faithful approximation of the MH correction and less bias between the chain’s stationary distribution and the target distribution of interest. A larger setting of $\epsilon$ leads to faster sampling and a more rapid reduction of Monte Carlo variance, as fewer datapoint likelihoods are computed per sampling step. We will quantify this bias-variance trade-off as a function of sampling time using the Langevin spanner Stein discrepancy.

In the notation of Example 2, we conduct a Bayesian Huber regression analysis ( $c=1$ ) of the log radon levels in $1,190$ Minnesota households as a function of the log amount of uranium in the county, an indicator of whether the radon reading was performed in a basement, and an intercept term. A $\mathcal{N}(0,I)$ prior is placed on the coefficient vector $\beta$ . We run ARWMH with minibatch size $5$ and two settings of the tolerance threshold $\epsilon$ ( $0.1$ and $0.2$ ) for $10^{7}$ likelihood evaluations, discard the sample points from the first $10^{5}$ evaluations, and thin the remaining points to sequences of length $1,000$ . Figure 3 displays the Langevin spanner Stein discrepancy applied to the first $n$ points in each sequence as a function of the likelihood evaluation count, which serves as a proxy for sampling time. As expected, the higher tolerance sample ( $\epsilon=0.2$ ) is of higher Stein quality for a small computational budget but is eventually overtaken by the $\epsilon=0.1$ sample with smaller asymptotic bias. The longest discrepancy computation consumed $0.8$ s for the spanner and $20.1$ s for a coordinate LP.

4 Comparing quadrature rules

Stein discrepancies can also measure the quality of deterministic sample sequences designed to improve upon Monte Carlo sampling. For the Gaussian mixture target of Section 5.1, Figure 4 compares the median quality of 50 sample sequences generated from four quadrature rules recently studied in [53, Sec. 4.1]: i.i.d. sampling from $P$ , Quasi-Monte Carlo (QMC) sampling using a deterministic quasirandom number generator, Frank-Wolfe (FW) kernel herding , and fully-corrective Frank-Wolfe (FCFW) kernel herding . The quality judgments of the Langevin spanner Stein discrepancy (D1) closely mimic those of the $L^{1}$ Wasserstein distance $d_{\mathcal{W}_{\|{\cdot}\|}}{}$ , which is computable for simple univariate targets . Each Stein discrepancy was computed in under $0.03$ s.

Under both diagnostics and as previously observed in other metrics , the i.i.d. samples are typically of lower median quality than their deterministic counterparts. More suprisingly and in contrast to past work focused on very smooth function classes , FCFW underperforms FW and QMC in our diagnostics for larger sample sizes. Apparently FCFW, which is heavily optimized for smooth function integration, has sacrificed approximation quality for less smooth test functions. For example, Figure 4 shows that QMC offers a better quadrature estimate than FCFW for $h_{1}(x)=\max\{0,1-\min_{j\in\{1,2\}}|x-\mu_{j}|\}$ , a $1$ -Lipschitz approximation to the indicator of being within one standard deviation of a mode.

In addition to providing a sample quality score, the Stein discrepancy optimization problem produces an optimal Stein function $g^{*}$ and an associated test function $h^{*}=\mathcal{T}{g^{*}}$ that is mean zero under $P$ and best distinguishes the sample $Q_{n}$ from the target $P$ . Figure 4 gives examples of these maximally discriminatve functions $h^{*}$ for a target mode separation of $\Delta=5$ and length $200$ sequences from each quadrature rule. We also display the associated sample histograms with overlaid target density. The optimal FCFW function reflects the jagged nature of the FCFW histogram.

Connections and conclusions

We developed quality measures suitable for comparing the fidelity of arbitrary “off-target” sample sequences by generating infinite collections of known target expectations.

The score statistic of Fan et al. and the Gibbs sampler convergence criteria of Zellner and Min account for some sample biases but sacrifice differentiating power by exploiting only a finite number of known target expectations. For example, when $P=\mathcal{N}(0,1)$ , the score statistic cannot differentiate two samples with the same means and variances. Maximum mean discrepancies (MMDs) over characteristic reproducing kernel Hilbert spaces do detect arbitrary distributional biases but are only computable when the chosen kernel functions can be integrated under the target. In practice, one often approximates MMD using a sample from the target, but this requires a separate trustworthy sample from $P$ .

While we have focused on the graph and classical Stein sets of , our diffusion Stein operators can also be paired with the reproducing kernel Hilbert space unit balls advocated in to form tractable kernel diffusion Stein discrepancies or with the random feature functions advocated in to form random feature diffusion Stein discrepancies. We have also restricted our attention to Stein operators arising from diffusion generators. These take the form $(\mathcal{T}{g})({x})={\textstyle\frac{1}{p(x)}}\langle{\nabla},{p(x)m(x)g(x)}\rangle$ with $m=a+c$ for $a(x)$ positive semidefinite and $c(x)$ skew-symmetric. More generally, if the matrix $m$ possesses eigenvalues having a negative real part, then the resulting operator need not correspond to a diffusion process. Such operators fall into the class of pseudo-Fokker Planck operators which have been studied in the context of quantum optics . As noted in it is possible to obtain corresponding stochastic dynamics in an extended state space by introducing complex-valued noise terms; these operators may merit further study in future work.

Alternative inferential tasks

Alternative targets

Our exposition has focused on the Wasserstein distance $d_{\mathcal{W}_{\|{\cdot}\|}}$ , which is only defined for distributions with finite means. A parallel development could be made for the Dudley metric to target distributions with undefined mean. The work of Cerrai also suggests that the Lipschitz condition on our drift and diffusion coefficients can be relaxed.

Appendix A Proof of Proposition 3

Let $f(r)=M_{0}(g)\int_{\partial\mathcal{B}_{r}}\|{a(z)+c(z)}\|_{op}\,p(z)\,dz$ . Since $g$ and $n_{r}$ are bounded,

The coarea formula and the integrability of $a$ and $c$ further imply that

Appendix B Proof of Theorem 5

with infinitesimal generator $\mathcal{A}{}$ , that $u_{h}$ is Lipschitz, that $u_{h}$ has a continuous Hessian, that $u_{h}$ has a bounded and Hölder continuous Hessian under additional smoothness assumptions.

The transition semigroup $(P_{t}{})_{t\geq 0}{}$ of an Itô diffusion with Lipschitz drift and diffusion coefficients is strongly continuous on $L$ .

for some $C>0$ depending only on $M_{1}(b)$ and $M_{1}(\sigma)$ . The dominated convergence theorem now yields the desired pointwise convergence.

To prove the strong continuity of $(P_{t}{})_{t\geq 0}{}$ , it suffices, by [23, Thm. I.5.8, p. 40], to verify that $(P_{t}{})_{t\geq 0}{}$ is weakly continuous, i.e., that $l(P_{t}{f})\rightarrow l(f)$ , as $t\rightarrow 0^{+}$ , for all elements $l$ of the dual space $L^{*}$ . To this end, fix any $l\in L^{*}$ . By the Riesz-Markov theorem for $L$ [17, Theorem 2.4], there exists a finite signed Radon measure $\mu$ such that

for $\lVert\cdot\rVert_{L^{*}}$ the dual norm. By Jensen’s inequality and [28, Sec. 5, Cor. 1.2],

Since $1+\|{x}\|_{2}^{2}$ is $|\mu|$ -integrable by (20), dominated convergence gives

Consider the infinitesimal generator $\mathcal{A}{}$ of the semigroup $(P_{t}{})_{t\geq 0}{}$ on $L$ with

The stationarity of $P$ and the definitions of $d_{\mathcal{W}_{\|{\cdot}\|_{2}}}$ and $r$ imply that

and hence $\lVert P_{t}{h}\rVert_{L}\rightarrow 0$ as $t\rightarrow\infty$ , since $P$ has a finite mean, and $r(t)\to 0$ as $t\to\infty$ as $r$ is integrable and monotonic. Arguing similarly,

Instantiate the additional preconditions of (11), and assume that $M_{0}(\sigma^{-1}),F_{2}(\sigma),M_{2}(b)<\infty$ , or else (11) is vacuous. Lemma 15, established in Section C, shows that the semigroup $P_{t}{h}$ admits a bounded continuous Hessian, which is integrable in $t$ .

The dominated convergence theorem now implies that the Hessian of $u_{h}$ is obtained by differentiating twice under the integral sign. The advertised bound (11) on $\nabla^{2}u_{h}$ follows by replacing the infimum on the right-hand side of the semigroup bound (22) with the selection $t_{0}=\min(t,1)$ , applying the bound $e^{\min(t,1)\gamma_{\rho}}\leq e^{\gamma_{\rho}}$ for each $\gamma_{\rho}$ and $t$ , and integrating the result over $t$ .

Finally, instantiate the additional preconditions of (12), and fix any $\iota\in(0,1)$ . The integral representation (10) of $u_{h}$ , the dominated convergence theorem, and Jensen’s inequality imply

When $t\leq 1$ , a seminorm interpolation lemma (Lemma 19 in the supplement), a semigroup third derivative estimate (Lemma 20 in the supplement) with $t_{0}=\min(t,1)$ , and the semigroup second derivative estimate of Lemma 15 with $t_{0}=\min(t,1)$ imply

for some constant $K_{1}>0$ depending only on $M_{1:3}(b),M_{1:3}(\sigma),M_{0}(\sigma^{-1}),$ and $r$ . Thus $\int_{0}^{1}M_{1-\iota}(\nabla^{2}P_{t}{h})\,dt\leq\frac{2M_{1}(h)}{K_{1}\iota}$ . For $t>1$ , Lemmas 19, 20, and 15 and the integrability of $r$ yield

for a constant $K_{2}>0$ again depending only on $M_{1:3}(b),M_{1:3}(\sigma),M_{0}(\sigma^{-1}),$ and $r$ . Combining these bounds and choosing $K=\min(K_{1},K_{2})/2$ completes the proof. An explicit constant $K$ can be obtained by tracing constants through the proof of Lemma 20.

Appendix C Proof of Lemma 15

obtained by formally differentiating the equation (5) defining $({Z}_{t,x})_{t\geq 0}$ with respect to $x$ in the direction $v$ . The second directional derivative flow $(U_{t,v,v^{\prime}})_{t\geq 0}$ solves the second variation equation,

obtained by differentiating (23) with respect to $x$ in the direction $v^{\prime}$ .

Since $f$ has bounded first and second derivatives, the dominated convergence theorem implies that, for each $t\geq 0$ , $P_{t}{f}$ is twice differentiable with

obtained by differentiating under the integral sign. Lemma 16, proved in Section C.1, justifies the exchanges of derivative and expectation by ensuring that the derivative flows have moments bounded uniformly in $x$ .

for $\gamma_{\rho}\triangleq\rho M_{1}(b)+\frac{\rho^{2}-2\rho}{2}M_{1}(\sigma)^{2}+\frac{\rho}{2}F_{1}(\sigma)^{2}$ and $\alpha\triangleq\frac{M_{2}(b)^{2}}{2M_{1}(b)+4M_{1}(\sigma)^{2}}+2F_{2}(\sigma)^{2}$ .

Since $\nabla f$ and $\nabla^{2}f$ are bounded, and $(V_{t,v})_{t\geq 0},(V_{t,v^{\prime}})_{t\geq 0},$ and $(U_{t,v,v^{\prime}})_{t\geq 0}$ have second moments bounded uniformly in $x$ by Lemma 16, the Hessian formula (25) implies that $\nabla^{2}P_{t}{f}$ is bounded and hence that $\nabla P_{t}{f}$ is Lipschitz continuous for each $t\geq 0$ .

Hereafter we assume that $M_{0}(\sigma^{-1})<\infty$ , as the semigroup Hessian bound (22) is otherwise vacuous.

The Lipschitz continuity of $f$ and the Itô diffusion moment bound of [51, Thm. 3.4, part 4] together imply that

for all $t\geq 0$ . Since $\sigma^{-1}$ is bounded, and $\nabla b$ and $\nabla\sigma$ are bounded and Lipschitz, [27, Prop. 3.2] gives the following Bismut-Elworthy-Li-type formula for the directional derivative of $P_{t}{f}$ for each $t>0$ :

By interchanging derivative and integral, the dominated convergence theorem now delivers the Hessian expression

for each $t>0$ , provided that $J_{1,x},J_{2,x}$ , and $J_{3,x}$ are continuous in $x$ . The requisite continuity follows from the Lipschitz continuity of $\nabla f$ and $f$ , the boundedness of $\sigma^{-1}$ , $\nabla\sigma$ , and $\nabla^{2}\sigma$ , and the controlled moment growth and Hölder continuity of $({Z}_{t,x})_{t\geq 0}$ , $(V_{t,v})_{t\geq 0},(V_{t,v^{\prime}})_{t\geq 0},$ and $(U_{t,v,v^{\prime}})_{t\geq 0}$ as functions of $x$ [76, Theorem V.40]. The dominated convergence theorem further implies that $\nabla^{2}P_{t}{f}$ is continuous for each $t>0$ .

Now, we fix any $t>0$ and turn to bounding $\nabla^{2}P_{t}{f}$ in terms of $M_{1}(f)$ , by bounding the expectations of $J_{1,x},J_{2,x}$ , and $J_{3,x}$ of (28) in turn.

where we have adopted the definition of $\gamma_{\rho}$ given in Lemma 16.

where we have used Dynkin’s formula [28, Eq. 7.11], the Itô isometry, and the chain rule,

and we bound this expression using Cauchy-Schwarz, Jensen’s inequality, the semigroup gradient bound (21), the second derivative flow bound (27), and the fact that $s\mapsto r(t-s)e^{s\gamma_{4}}$ is increasing:

C.1 Proof of Lemma 16: Derivative flow bounds

the advertised result (26) follows from Grönwall’s inequality.

for any $\epsilon>0$ . Letting $\gamma_{\rho}=\rho M_{1}(b)+\frac{\rho^{2}-2\rho}{2}M_{1}(\sigma)^{2}+\frac{\rho}{2}F_{1}(\sigma)^{2}$ , we see that, by Cauchy-Schwarz and our derivative flow bound (26),

Hence, if we choose $\epsilon=\gamma_{4}-(2M_{1}(b)+2F_{1}(\sigma)^{2})$ and define $\alpha=M_{2}(b)^{2}/\epsilon+2F_{2}(\sigma)^{2}$ we may write

Gronwall’s inequality now yields the result (27) via

Appendix D Proof of Theorem 6

Hence, since our choice of $h$ was arbitrary, and

we may take expectation under $Q_{n}$ and supremum over $h$ in (31) to reach

as ${\langle{v_{1}},{G}\rangle}/\|{v_{1}}\|_{2}$ has a standard normal distribution. Leibniz’s rule also gives

where the last equality follows by Isserlis’ theorem. Finally, when $f\in C^{1}$ , Leibniz’s rule gives $\nabla f_{s}=\nabla f\star\phi_{s}$ .

Appendix E Proof of Theorem 7

We will derive each inequality for $\|{\cdot}\|=\|{\cdot}\|_{2}$ ; the generic norm results will then follow from the property $\|{\cdot}\|\geq\|{\cdot}\|_{2}$ , which implies $\mathcal{G}_{\|{\cdot}\|_{2}}\subseteq\mathcal{G}_{\|{\cdot}\|}$ .

Let $\eta\triangleq s^{*}\zeta$ for $s^{*}=\sqrt{\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|_{2}}})\sqrt{{2}{/\pi}}\beta/\zeta}.$ Since $\mathcal{H}$ is dense in $\mathcal{W}_{\|{\cdot}\|_{2}}$ , we may take expectation under $Q_{n}$ and supremum over $h$ in (32) to reach

E.2 Proof of the second inequality

Assume now that $\nabla^{3}b$ and $\nabla^{3}\sigma$ are bounded and locally Lipschitz. Fix any $\iota\in(0,1)$ . Lemma 17 and an auxiliary smoothing lemma (Lemma 18 in the supplement) imply that $M_{2}(g_{h,s})=M_{1}(\nabla g_{h,s})\leq\sqrt{d}\frac{M_{1-\iota}(\nabla g_{h})}{s^{\iota}}$ . This improved dependence on $s$ will allow us to establish a near-linear relationship between the Stein discrepancy and the Wasserstein distance. By Theorem 5, $M_{1-\iota}(\nabla g_{h})\leq\frac{1}{K}(\frac{1}{\iota}+s_{r})$ for $K$ depending only on $M_{1:3}(\sigma),M_{1:3}(b),$ $M_{0}(\sigma^{-1}),$ and $r$ . Hence, $M_{2}(g_{h,s})\leq C_{\iota}/s^{\iota}$ for $C_{\iota}\triangleq\frac{\sqrt{d}}{K}(\frac{1}{\iota}+s_{r})$ . Following the derivation in Section E.1 and choosing $s^{*}=\mathopen{}\mathclose{{}\left(\frac{\iota C_{\iota}\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|_{2}}})}{\zeta}}\right){}^{\frac{1}{\iota+1}}$ and $\eta\triangleq\frac{\zeta}{\iota}s^{*}$ , we obtain

Now consider the case in which $\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})<e^{-1}$ and the choice $\iota=1/\log(\nicefrac{{1}}{{\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})}})\in(0,1)$ . Since $x^{1/(\log x-1)}\leq e$ for all $x\in(0,e^{-1})$ ,

Introduce the shorthand $c_{0}=\frac{\sqrt{d}}{K\zeta}$ . Since $\nicefrac{{1}}{{1+\iota}}\in(\nicefrac{{1}}{{2}},1)$ , we have $c_{0}^{\frac{1}{1+\iota}}\leq\max(\sqrt{c_{0}},c_{0})$ . Similarly, $1+s_{r}\iota>1$ , so $(1+\iota s_{r})^{\frac{1}{1+\iota}}\leq 1+\iota s_{r}$ . Therefore,

Next, fix any $\iota\in(0,1)$ and consider the case in which $\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})\geq e^{-1}$ so that $\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})^{{\textstyle\frac{1}{1+\iota}}}\leq\mathcal{S}({Q_{n}},{\mathcal{T}{}},{\mathcal{G}_{\|{\cdot}\|}})e^{{\textstyle\frac{\iota}{\iota+1}}}$ . Because ${\textstyle\frac{1}{\iota}}e^{{\textstyle\frac{\iota}{\iota+1}}}\leq\frac{1}{2}e^{1/2}<e$ and $(1+\iota s_{r})^{\nicefrac{{1}}{{1+\iota}}}\leq 1+s_{r}$ , we conclude that

The result follows from estimates of these two cases and the bound (34).

Appendix F Proof of Proposition 8

for any coupling of $X$ and $Z$ . We obtain the first advertised inequality by repeatedly applying the Fenchel-Young inequality for dual norms, invoking the boundedness and Lipschitz constraints on $g$ and $\nabla g$ , and taking a supremum over $g\in\mathcal{G}_{\|{\cdot}\|}$ . The second inequality follows from the firstby invoking Jensen’s inequality, the fact $\min(x,y)\leq x^{t}y^{1-t}$ for all $x,y\geq 0$ , Hölder’s inequality, and finally the definition of $W_{s,\|{\cdot}\|}$ .

Appendix G Proof of Theorem 10

By the uniform dissipativity assumption, the right-hand side is at most $\|{x-y}\|_{G}^{2}=d_{\mathcal{W}_{\|{\cdot}\|_{G}}}(\delta_{x},\,\delta_{y})^{2}$ . For the transition semigroup $(P_{t}{})_{t\geq 0}{}$ ,

Appendix H Proof of Theorem 11

where $(W_{t}^{\prime})_{t\geq 0}$ is an $m$ -dimensional Wiener process and $(W_{t}^{\prime\prime})_{t\geq 0}$ is an independent $d$ -dimensional Wiener process.

Following the argument of Eberle [22, Sec. 4], we define the difference process $Y_{t}={Z}_{t,x}-{Z}_{t,y}$ , its norm $r_{t}=\|{Y_{t}}\|_{2}$ , and the one-dimensional Wiener process $W_{t}=\int_{0}^{t}\langle{Y_{s}/r_{s}},{\,dW_{s}^{\prime\prime}}\rangle$ , and apply the generalized Itô formula [49, Thm. 22.5] to obtain the stochastic differential equations

for any concave increasing $f:[0,\infty)\mapsto[0,\infty)$ with absolutely continuous derivative, $f(0)=0$ , and $f^{\prime}(0)=1$ . Since the drift term in the latter equation is bounded above by $\beta_{t}\triangleq(2/\alpha)(f^{\prime\prime}(r_{t})-(1/4)f^{\prime}(r_{t})\kappa(r_{t})r_{t}),$ the argument of [22, p. 15] shows that the results of [22, Thm. 1 and Cor. 2] hold for our choice of $\alpha$ and $\kappa$ .

Acknowledgments

We thank Simon Lacoste-Julien for sharing his quadrature code, Martin Hairer for discussing interpolation inequalities, Andreas Eberle for reading an earlier version of this manuscript, and Murat Erdogdu for identifying an important typographical error in an earlier version of this manuscript. This material is based upon work supported by the grant EPSRC EP/N000188/1, the National Science Foundation DMS RTG Grant No. 1501767, the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747, the Frederick E. Terman Fellowship, and the Lloyd’s Register Foundation programme on Data Centric engineering at the Alan Turing Institute, UK.

References

Supplementary Appendix I Smoothing and interpolation

The first result bounds the Lipschitz constant of $f_{s}$ in terms of the Hölder continuity of $f$ .

An application of the triangle inequality gives rise to

Supplementary Appendix J Semigroup third derivative estimate

for constants $c,C$ depending only on $M_{1:3}(\sigma),M_{1:3}(b),M_{0}(\sigma^{-1}),$ and $r$ .

Our proof closely follows that of Lemma 15 in Section C, and we will only highlight the important differences. Throughout, $c$ and $C$ will represent arbitrary constants depending only on $M_{1:3}(\sigma),M_{1:3}(b),M_{0}(\sigma^{-1}),$ and $r$ that may change from expression to expression.

obtained by differentiating (24) with respect to $x$ in the direction $v^{\prime\prime}$ .

In a manner analogous to the derivation of (28) in proof of Lemma 15, we can derive an expression for the third derivative of the semi-group,

We will bound each term $J_{i,j,x}$ in (38) in turn.

We will provide a step-by-step calculation for the first term. By Cauchy-Schwarz,

We use the derivative flow bounds of Lemma 16 to realize

Cauchy-Schwarz, the Itô isometry [28, Eqs. 7.1 and 7.2], and Lemma 16 now yield

to rewrite $\sigma({Z}_{s,x})\nabla^{2}\sigma^{-1}({Z}_{s,x})$ . The next term satisfies

Cauchy-Schwarz and the Itô isometry [28, Eqs. 7.1 and 7.2] yield

Using (29), Lemma 15, and the non-increasing property of $r$ , we find that

J.4 Semigroup third derivative bound

By combining the bounds for each $J_{i,j,x}$ term, adapting the argument of [8, Prop. 1.5.1], and invoking the semigroup gradient bound and Hessian bound $M_{2}(P_{s}{f})\leq M_{1}(f)r(s-s_{0})\frac{c^{\prime}}{\sqrt{s_{0}}}e^{C^{\prime}s_{0}}$ of Lemma 15, we obtain, for any $t_{0}\in(0,t]$ and $s_{0}=t_{0}/2$

J.5 Third derivative flow bound

We have by Cauchy-Schwarz and Young’s inequality

We can conclude using Gronwall’s inequality that

where we use that, by Young’s inequality,

Following the arguments of Section C.1, Grönwall’s inequality gives