Measuring Sample Quality with Stein's Method

Jackson Gorham, Lester Mackey

Introduction

To address this shortcoming, we develop a new measure of sample quality suitable for comparing asymptotically exact, asymptotically biased, and even deterministic sample sequences. The quality measure is based on Stein’s method and is attainable by solving a linear program. After outlining our design criteria in Section 2, we relate the convergence of the quality measure to that of standard probability metrics in Section 3, develop a streamlined implementation based on geometric spanners in Section 4, and illustrate applications to hyperparameter selection, convergence rate assessment, and the quantification of bias-variance tradeoffs in posterior inference in Section 5. We discuss related work in Section 6 and defer all proofs to the appendix.

Quality Measures for Samples

Stein’s Method

Stein’s method for characterizing convergence in distribution classically proceeds in three steps:

Together, T\mathcal{T}{} and G\mathcal{G} define the Stein discrepancy,

an IPM-type quality measure with no explicit integration under PP.

Lower bound the Stein discrepancy by a familiar convergence-determining IPM dHd_{\mathcal{H}}. This step can be performed once, in advance, for large classes of target distributions and ensures that, for any sequence of probability measures (μm)m1(\mu_{m})_{m\geq 1}, S(μm,T,G)\mathcal{S}({\mu_{m}},{\mathcal{T}{}},{\mathcal{G}}) converges to zero only if dH(μm,P)d_{\mathcal{H}}(\mu_{m},P) does (Desideratum (ii)).

Upper bound the Stein discrepancy by any means necessary to demonstrate convergence to zero under suitable conditions (Desideratum (i)). In our case, the universal bound established in Section 3.3 will suffice.

While Stein’s method is typically employed as an analytical tool, we view the Stein discrepancy as a promising candidate for a practical sample quality measure. Indeed, in Section 4, we will adopt an optimization perspective and develop efficient procedures to compute the Stein discrepancy for any sample measure QQ and appropriate choices of T\mathcal{T}{} and G\mathcal{G}. First, we assess the convergence properties of an equivalent Stein discrepancy in the subsections to follow.

The generator method of Barbour provides a convenient and general means of constructing operators T\mathcal{T}{} which produce mean-zero functions under PP (2) . Let (Zt)t0({Z}_{t})_{t\geq 0} represent a Markov process with unique stationary distribution PP. Then the infinitesimal generator A\mathcal{A}{} of (Zt)t0({Z}_{t})_{t\geq 0}, defined by

For example, the overdamped Langevin diffusion, defined by the stochastic differential equation dZt=12logp(Zt)dt+dWtd{Z}_{t}=\frac{1}{2}\nabla\log p({Z}_{t})dt+dW_{t} for (Wt)t0(W_{t})_{t\geq 0} a Wiener process, gives rise to the generator

After substituting gg for 12u\frac{1}{2}\nabla u, we obtain the associated Stein operatorThe operator TP\mathcal{T}_{P}{} has also found fruitful application in the design of Monte Carlo control variates .

The Stein operator TP\mathcal{T}_{P}{} is particularly well-suited to our setting as it depends on PP only through the derivative of its log density and hence is computable even when the normalizing constant of pp is not.

of sufficiently smooth functions satisfying a Neumann-type boundary condition. The following proposition – a consequence of integration by parts – shows that \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|} is a suitable domain for TP\mathcal{T}_{P}{}.

Together, TP\mathcal{T}_{P}{} and \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|} form the classical Stein discrepancy \mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}), our chief object of study.

2 Lower Bounding the Classical Stein Discrepancy

In the univariate setting (d=1d=1), it is known for a wide variety of targets PP that the classical Stein discrepancy \mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}) converges to zero only if the Wasserstein distance d_{\mathcal{W}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P) does . In the multivariate setting, analogous statements are available for multivariate Gaussian targets , but few other target distributions have been analyzed. To extend the reach of the multivariate literature, we show in Theorem 2 that the classical Stein discrepancy also determines Wasserstein convergence for a large class of strongly log-concave densities, including the Bayesian logistic regression posterior under Gaussian priors.

We emphasize that the sufficient conditions in Theorem 2 are certainly not necessary for lower bounding the classical Stein discrepancy. We hope that the theorem and its proof will provide a template for lower bounding \mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}) for other large classes of multivariate target distributions.

3 Upper Bounding the Classical Stein Discrepancy

We next establish sufficient conditions for the convergence of the classical Stein discrepancy to zero.

If XQX\sim Q and ZP{Z}\sim P with logp(Z)\nabla\log p({Z}) integrable,

One implication of Proposition 3 is that \mathcal{S}({Q_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}) converges to zero whenever XmQmX_{m}\sim Q_{m} converges in mean-square to ZPZ\sim P and logp(Xm)\nabla\log p(X_{m}) converges in mean to logp(Z)\nabla\log p(Z).

4 Extension to Non-uniform Stein Sets

The analyses and algorithms in this paper readily accommodate non-uniform Stein sets of the form

for constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 known as Stein factors in the literature. We will exploit this additional flexibility in Section 5.2 to establish tight lower-bounding relations between the Stein discrepancy and Wasserstein distance for well-studied target distributions. For general use, however, we advocate the parameter-free classical Stein set and graph Stein sets to be introduced in the sequel. Indeed, any non-uniform Stein discrepancy is equivalent to the classical Stein discrepancy in a strong sense:

Computing Stein Discrepancies

Evaluating a Stein discrepancy S(Q,TP,G)\mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}}) for a fixed (Q,P)(Q,P) pair reduces to solving an optimization program over functions gGg\in\mathcal{G}. For example, the classical Stein discrepancy is the optimum

Note that the objective associated with any Stein discrepancy S(Q,TP,G)\mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}}) is linear in gg and, since QQ is discrete, only depends on gg and g\nabla g through their values at each of the nn sample points xix_{i}. The primary difficulty in solving the classical Stein program (8) stems from the infinitude of constraints imposed by the classical Stein set \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}. One way to avoid this difficulty is to impose the classical smoothness constraints at only a finite collection of points. To this end, for each finite graph G=(V,E)G=(V,E) with vertices VXV\subset\mathcal{X} and edges EV2E\subset V^{2}, we define the graph Stein set,

the family of functions which satisfy the classical constraints and certain implied Taylor compatibility constraints at pairs of points in EE. Remarkably, if the graph G1G_{1} consists of edges between all distinct sample points xix_{i}, then the associated complete graph Stein discrepancy \mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|,Q,{G_{1}}}}) is equivalent to the classical Stein discrepancy in the following strong sense.

where κd\kappa_{d} is a constant, independent of (Q,P)(Q,P), depending only on the dimension dd and norm \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|.

Proposition 5 follows from the Whitney-Glaeser extension theorem for smooth functions and implies that the complete graph Stein discrepancy inherits all of the desirable convergence properties of the classical discrepancy. However, the complete graph also introduces order n2n^{2} constraints, rendering computation infeasible for large samples. To achieve the same form of equivalence while enforcing only O(n)O(n) constraints, we will make use of sparse geometric spanner subgraphs.

2 Geometric Spanners

For a given dilation factor t1t\geq 1, a tt-spanner is a graph G=(V,E)G=(V,E) with weight \mathopen{}\mathclose{{}\left\|{x-y}}\right\| on each edge (x,y)E(x,y)\in E and a path between each pair xyVx^{\prime}\neq y^{\prime}\in V with total weight no larger than t\mathopen{}\mathclose{{}\left\|{x^{\prime}-y^{\prime}}}\right\|. The next proposition shows that spanner Stein discrepancies enjoy the same convergence properties as the complete graph Stein discrepancy.

3 Decoupled Linear Programs

We have arbitrarily numbered the elements viv_{i} of the vertex set VV so that γji\gamma_{ji} represents the function value gj(vi)g_{j}(v_{i}), and Γjki\Gamma_{jki} represents the gradient value kgj(vi)\nabla_{k}g_{j}(v_{i}).

4 Constrained Domains

A small modification to the unconstrained formulation (9) extends our tractable Stein discrepancy computation to any domain defined by coordinate boundary constraints, that is, to X=(α1,β1)××(αd,βd)\mathcal{X}=(\alpha_{1},\beta_{1})\times\dots\times(\alpha_{d},\beta_{d}) with αj<βj-\infty\leq\alpha_{j}<\beta_{j}\leq\infty for all jj. Specifically, for each dimension jj, we augment the jj-th coordinate linear program of (9) with the boundary compatibility constraints

These additional constraints ensure that our candidate function and gradient values can be extended to a smooth function satisfying the boundary conditions g(z),n(z)=0\langle{g(z)},{n(z)}\rangle=0 on X\partial\mathcal{X}. Proposition 10 in the appendix shows that the spanner Stein discrepancy so computed is strongly equivalent to the classical Stein discrepancy on X\mathcal{X}.

Algorithm 1 summarizes the complete solution for computing our recommended, parameter-free spanner Stein discrepancy in the multivariate setting. Notably, the spanner step is unnecessary in the univariate setting, as the complete graph Stein discrepancy \mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{1},Q,{G_{1}}}}) can be computed directly by sorting the sample and boundary points and only enforcing constraints between consecutive points in this ordering. Thus, the complete graph Stein discrepancy is our recommended quality measure when d=1d=1, and a recipe for its computation is given in Algorithm 2.

Experiments

We now turn to an empirical evaluation of our proposed quality measures. We compute all spanners using the efficient C++ greedy spanner implementation of Bouts et al. and solve all optimization programs using Julia for Mathematical Programming with the default Gurobi 6.0.4 solver . All reported timings are obtained using a single core of an Intel Xeon CPU E5-2650 v2 @ 2.60GHz.

We begin with a simple example to illuminate a few properties of the Stein diagnostic. For the target P=N(0,1)P=\mathcal{N}(0,1), we generate a sequence of sample points i.i.d. from the target and a second sequence i.i.d. from a scaled Student’s t distribution with matching variance and 10 degrees of freedom. The left panel of Figure 1 shows that the complete graph Stein discrepancy applied to the first nn Gaussian sample points decays to zero at an n0.52n^{-0.52} rate, while the discrepancy applied to the scaled Student’s t sample remains bounded away from zero. The middle panel displays optimal Stein functions gg recovered by the Stein program for different sample sizes. Each gg yields a test function hTPgh\triangleq\mathcal{T}_{P}{g}, featured in the right panel, that best discriminates the sample QQ from the target PP. Notably, the Student’s t test functions exhibit relatively large magnitude values in the tails of the support.

2 Comparing Discrepancies

We show in Theorem 9 in the appendix that, when d=1d=1, the classical Stein discrepancy is the optimum of a convex quadratically constrained quadratic program with a linear objective, O(n)O(n) variables, and O(n)O(n) constraints. This offers the opportunity to directly compare the behavior of the graph and classical Stein discrepancies. We will also compare to the Wasserstein distance d_{\mathcal{W}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}, which is computable for simple univariate target distributions and provably lower bounds the non-uniform Stein discrepancies (7) with c1:3=(0.5,0.5,1)c_{1:3}=(0.5,0.5,1) for P=Unif(0,1)P=\textnormal{Unif}(0,1) and c1:3=(1,4,2)c_{1:3}=(1,4,2) for P=N(0,1)P=\mathcal{N}(0,1) . For N(0,1)\mathcal{N}(0,1) and Unif(0,1)\textnormal{Unif}(0,1) targets and several random number generator seeds, we generate a sequence of sample points i.i.d. from the target distribution and plot the non-uniform classical and complete graph Stein discrepancies and the Wasserstein distance as functions of the first nn sample points in Figure 2. Two apparent trends are that the graph Stein discrepancy very closely approximates the classical and that both Stein discrepancies track the fluctuations in Wasserstein distance even when a magnitude separation exists. In the Unif(0,1)\textnormal{Unif}(0,1) case, the Wasserstein distance in fact equals the classical Stein discrepancy because TPg=g\mathcal{T}_{P}{g}=g^{\prime} is a Lipschitz function.

3 Selecting Sampler Hyperparameters

Stochastic Gradient Langevin Dynamics (SGLD) with constant step size ϵ\epsilon is a biased MCMC procedure designed for scalable inference. It approximates the overdamped Langevin diffusion, but, because no Metropolis-Hastings (MH) correction is used, the stationary distribution of SGLD deviates increasingly from its target as ϵ\epsilon grows. If ϵ\epsilon is too small, however, SGLD explores the sample space too slowly. Hence, an appropriate choice of ϵ\epsilon is critical for accurate posterior inference. To illustrate the value of the Stein diagnostic for this task, we adopt the bimodal Gaussian mixture model (GMM) posterior of as our target. For a range of step sizes ϵ\epsilon, we use SGLD with minibatch size 5 to draw 50 independent sequences of length n=1000n=1000, and we select the value of ϵ\epsilon with the highest median quality – either the maximum effective sample size (ESS, a standard diagnostic based on autocorrelation ) or the minimum spanner Stein discrepancy – across these sequences. The average discrepancy computation consumes 0.40.4s for spanner construction and 1.41.4s per coordinate linear program. As seen in Figure 3(a), ESS, which does not detect distributional bias, selects the largest step size presented to it, while the Stein discrepancy prefers an intermediate value. The rightmost plot of Figure 3(b) shows that a representative SGLD sample of size nn using the ϵ\epsilon selected by ESS is greatly overdispersed; the leftmost is greatly underdispersed due to slow mixing. The middle sample, with ϵ\epsilon selected by the Stein diagnostic, most closely resembles the true posterior.

4 Quantifying a Bias-Variance Trade-off

The approximate random walk MH (ARWMH) sampler is a second biased MCMC procedure designed for scalable posterior inference. Its tolerance parameter ϵ\epsilon controls the number of datapoint likelihood evaluations used to approximate the standard MH correction step. Qualitatively, a larger ϵ\epsilon implies fewer likelihood computations, more rapid sampling, and a more rapid reduction of variance. A smaller ϵ\epsilon yields a closer approximation to the MH correction and less bias in the sampler stationary distribution. We will use the Stein discrepancy to explicitly quantify this bias-variance trade-off.

We analyze a dataset of 5353 prostate cancer patients with six binary predictors and a binary outcome indicating whether cancer has spread to surrounding lymph nodes . Our target is the Bayesian logistic regression posterior under a N(0,I)\mathcal{N}(0,I) prior on the parameters. We run RWMH (ϵ=0\epsilon=0) and ARWMH (ϵ=0.1\epsilon=0.1 and batch size =2=2) for 10510^{5} likelihood evaluations, discard the points from the first 10310^{3} evaluations, and thin the remaining points to sequences of length 10001000. The discrepancy computation time for 10001000 points averages 1.31.3s for the spanner and 1212s for a coordinate LP. Figure 4 displays the spanner Stein discrepancy applied to the first nn points in each sequence as a function of the likelihood evaluation count. We see that the approximate sample is of higher Stein quality for smaller computational budgets but is eventually overtaken by the asymptotically exact sequence.

5 Assessing Convergence Rates

The Stein discrepancy can also be used to assess the quality of deterministic sample sequences. In Figure 5 in the appendix, for P=Unif(0,1)P=\textnormal{Unif}(0,1), we plot the complete graph Stein discrepancies of the first nn points of an i.i.d. Unif(0,1)\textnormal{Unif}(0,1) sample, a deterministic Sobol sequence , and a deterministic kernel herding sequence defined by the norm \mathopen{}\mathclose{{}\left\|{h}}\right\|_{\mathcal{H}}=\int_{0}^{1}(h^{\prime}(x))^{2}dx. We use the median value over 50 sequences in the i.i.d. case and estimate the convergence rate for each sampler using the slope of the best least squares affine fit to each log-log plot. The discrepancy computation time averages 0.080.08s for n=200n=200 points, and the recovered rates of n0.49n^{-0.49} and n1n^{-1} for the i.i.d. and Sobol sequences accord with expected O(1/n)O(1/\sqrt{n}) and O(1/n)O(1/n) bounds from the literature . As witnessed also in other metrics , the herding rate of n0.96n^{-0.96} outpaces its best known bound of dH(Qn,P)=O(1/n)d_{\mathcal{H}}(Q_{n},P)=O(1/\sqrt{n}), suggesting an opportunity for sharper analysis.

Discussion of Related Work

We have developed a quality measure suitable for comparing biased, exact, and deterministic sample sequences by exploiting an infinite class of known target functionals. The diagnostics of also account for asymptotic bias but lose discriminating power by considering only a finite collection of functionals. For example, for a N(0,1)\mathcal{N}(0,1) target, the score statistic of cannot distinguish two samples with equal first and second moments. Maximum mean discrepancy (MMD) on a characteristic Hilbert space takes full distributional bias into account but is only viable when the expected kernel evaluations are easily computed under the target. One can approximate MMD, but this requires access to a separate trustworthy ground-truth sample from the target.

Appendix A Proof of Proposition 1

for nrn_{r} the outward unit normal vector to (XBr)\partial(\mathcal{X}\cap\mathcal{B}_{r}). The final quantity in this expression equates to zero, as g(x),n(x)=0\langle{g(x)},{n(x)}\rangle=0 for all xx on the boundary X\partial\mathcal{X}, gg is bounded, and limmp(xm)=0\lim_{m\to\infty}p(x_{m})=0 for any (xm)m=1(x_{m})_{m=1}^{\infty} with xmXx_{m}\in\mathcal{X} for all mm and \mathopen{}\mathclose{{}\left\|{x_{m}}}\right\|_{\infty}\to\infty.

Appendix B Proof of Theorem 2: Stein Discrepancy Lower Bound for Strongly Log-concave Densities

We let Ck(X)C^{k}(\mathcal{X}) denote the set of real-valued functions on X\mathcal{X} with kk continuous derivatives and d_{\mathcal{M}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}} denote the smooth function distance, the IPM generated by

The following result, proved in the companion paper , establishes the existence of explicit constants (Stein factors) c1,c2,c3>0c_{1},c_{2},c_{3}>0, such that, for any test function h\in\mathcal{M}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}, the Stein equation

has a solution gh=12uhg_{h}=\frac{1}{2}\nabla u_{h} belonging to the non-uniform Stein set \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}^{c_{1:3}}.

For each xXx\in\mathcal{X}, let (Zt,x)t0({Z}_{t,x})_{t\geq 0} represent the overdamped Langevin diffusion with infinitestimal generator

and initial state Z0,x=x{Z}_{0,x}=x. Then, for each hC3(X)h\in C^{3}(\mathcal{X}) with bounded first, second, and third derivatives, the function

Hence, by the equivalence of non-uniform Stein discrepancies (Proposition 4), d_{\mathcal{M}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P)\leq\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}^{c_{1:3}}})\leq\max(c_{1},c_{2},c_{3})\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}) for any probability measure μ\mu.

The desired result now follows from Lemma 8, which implies that the Wasserstein distance d_{\mathcal{W}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0 whenever d_{\mathcal{M}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0 for a sequence of probability measures (μm)m1(\mu_{m})_{m\geq 1}.

Lemma 2.2 of the companion paper establishes this result for the case \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|=\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}; we omit the proof of the generalization which closely mirrors that of the Euclidean norm case.

Appendix C Proof of Proposition 3: Stein Discrepancy Upper Bound

To derive the second advertised inequality, we use the definition of the matrix norm, the Fenchel-Young inequality for dual norms, the definition of the matrix dual norm, and the Cauchy-Schwarz inequality in turn:

Since our bounds hold uniformly for all gg in \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}, the proof is complete.

Appendix D Proof of Proposition 4: Equivalence of Non-uniform Stein Discrepancies

Fix any c1,c2,c3>0c_{1},c_{2},c_{3}>0, and let cmax=max(c1,c2,c3)c_{\max}=\max(c_{1},c_{2},c_{3}) and cmin=min(c1,c2,c3)c_{\min}=\min(c_{1},c_{2},c_{3}). Since the Stein discrepancy objective is linear in gg, we have a\,\mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})=\mathcal{S}({Q},{\mathcal{T}_{P}{}},{a\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}) for any a>0a>0. The result now follows from the observation that c_{\min}\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}\subseteq\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}^{c_{1:3}}\subseteq c_{\max}\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}.

Appendix E Proof of Proposition 5: Equivalence of Classical and Complete Graph Stein Discrepancies

Appendix F Proof of Proposition 6: Equivalence of Spanner and Complete Graph Stein Discrepancies

Identical reasoning establishes that \mathopen{}\mathclose{{}\left\|{g(z_{0})-g(z_{L})}}\right\|^{*}\leq t\mathopen{}\mathclose{{}\left\|{z_{0}-z_{L}}}\right\|.

Furthermore, since \mathopen{}\mathclose{{}\left\|{g(z_{l-1})-g(z_{l})-{\nabla g(z_{l})}{(z_{l-1}-z_{l})}}}\right\|^{*}\leq\frac{1}{2}\mathopen{}\mathclose{{}\left\|{z_{l-1}-z_{l}}}\right\|^{2} for each ll, the triangle inequality and the definition of the tensor norm \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|^{*} imply that

Since z,zz,z^{\prime} were arbitrary, and the Stein discrepancy objective is linear in gg, we conclude that \mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|,Q,{G_{t}}}})\leq\mathcal{S}({Q},{\mathcal{T}_{P}{}},{2t^{2}\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|,Q,{G_{1}}}})=2t^{2}\,\mathcal{S}({Q},{\mathcal{T}_{P}{}},{\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|,Q,{G_{1}}}}).

Appendix G Finite-dimensional Classical Stein Program

where (r)_{+}\triangleq\max\mathopen{}\mathclose{{}\left({r,0}}\right),

Fix any g\in\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}^{c_{1:3}}. Also, since gg^{\prime} is c2c_{2}-bounded and c3c_{3}-Lipschitz, the constraints (13b) must be satisfied. Consider now the c2c_{2}-bounded and c3c_{3}-Lipschitz extensions of gg^{\prime}

We know that B(t)g(t)U(t)B(t)\leq g^{\prime}(t)\leq U(t) for all tt, for, if not, there would be a point t0t_{0} and a point x(i)x_{(i)} such that g(x(i))g(t0)>c3x(i)t0|g^{\prime}(x_{(i)})-g^{\prime}(t_{0})|>c_{3}|x_{(i)}-t_{0}|, which combined with the c3c_{3}-Lipschitz property would be a contradiction. Thus, for each sample x(i)x_{(i)}, the fundamental theorem of calculus gives

The right-hand side of this inequality evaluates precisely to the right-hand side of the constraint (13c). An analogous upper bound using U(t)U(t) yields (13d).

Finally, consider any point x(i)x_{(i)}. If x(i){α,β}x_{(i)}\in\{\alpha,\beta\}, then (13e) is satisfied as g(z)=0g(z)=0 for any point zz on the boundary. Suppose instead that α<x(i)<β\alpha<x_{(i)}<\beta. Without loss of generality, we may assume that g(x(i))0g^{\prime}(x_{(i)})\geq 0. Since gg^{\prime} is c3c_{3}-Lipschitz, we have g(t)g(x(i))c3tx(i)g^{\prime}(t)\geq g^{\prime}(x_{(i)})-c_{3}|t-x_{(i)}| for all tt. Integrating both sides of this inequality from x(i)x_{(i)} to xu=x(i)+g(x(i))/c3x_{u}=x_{(i)}+g^{\prime}(x_{(i)})/c_{3}, we obtain

Since g(xu)c1g(x_{u})\leq c_{1}, we have 12c3g(x(i))2+g(x(i))c1\frac{1}{2c_{3}}g^{\prime}(x_{(i)})^{2}+g(x_{(i)})\leq c_{1}. Similarly, by integrating the inequality from xb=x(i)g(x(i))/c3x_{b}=x_{(i)}-g^{\prime}(x_{(i)})/c_{3} to x(i)x_{(i)}, we have g(xb)g(x(i))g(x(i))2/(2c3)g(x_{b})-g(x_{(i)})\geq g^{\prime}(x_{(i)})^{2}/(2c_{3}), which combined with g(xb)c1g(x_{b})\leq c_{1} yields (13e).

G.2 Extending Feasible Solutions

We will use the c2c_{2}-bounded, c3c_{3}-Lipschitz functions BB and UU as building blocks for our extension, since they satisfy B(t)=U(t)=g(t)B(t)=U(t)=g^{\prime}(t) for t{x(i),x(i+1)}t\in\{x_{(i)},x_{(i+1)}\} and B(t)g(t)U(t)B(t)\leq g^{\prime}(t)\leq U(t),

for t[x(i),x(i+1)]t\in[x_{(i)},x_{(i+1)}]. We need only consider three cases.

𝑖1[x_{(i)},x_{(i+1)}]. Without loss of generality, we may assume that g(x(i)),g(x(i+1))0g^{\prime}(x_{(i)}),g^{\prime}(x_{(i+1)})\geq 0 and that BB changes sign. Consider the quantity ϕx(i)x(i+1)max{B(t),0}dt\phi\triangleq\int_{x_{(i)}}^{x_{(i+1)}}\max\{B(t),0\}dt. If g(x(i+1))g(x(i))ϕg(x_{(i+1)})-g(x_{(i)})\leq\phi, we let mi=Bm_{i}=B and Mi=max{B,0}M_{i}=\max\{B,0\}.

𝑖1[x_{(i)},x_{(i+1)}]. Without loss of generality, we may assume that g(x(i))0,g(x(i+1))<0g^{\prime}(x_{(i)})\geq 0,g^{\prime}(x_{(i+1)})<0. Since BB continuously interpolates between g(x(i))g^{\prime}(x_{(i)}) and g(x(i+1))g^{\prime}(x_{(i+1)}) on [x(i),x(i+1)][x_{(i)},x_{(i+1)}], it must have a root rr. Let wi[x(i),x(i+1)]w_{i}\in[x_{(i)},x_{(i+1)}] be the point where BB changes from one linear portion to another. Then because BB is monotonic on each linear portion, the fact that B(wi)B(x(i+1))<0B(w_{i})\leq B(x_{(i+1)})<0 means that BB cannot have a root between [wi,x(i+1)][w_{i},x_{(i+1)}] and hence has at most one root on [x(i),x(i+1)][x_{(i)},x_{(i+1)}]. Hence rr is the unique root of BB.

In a similar fashion, let us define ss as the root of UU, and since B(x)U(x)B(x)\leq U(x) for all xx, we have srs\geq r. Define

For the other case, in which g(x(i+1))g(x(i))>ψg(x_{(i+1)})-g(x_{(i)})>\psi, we choose mi=Wm_{i}=W and Mi=UM_{i}=U. Then (13e) imply that mi(s)=Mi(s)=U(s)[c1,c1]m_{i}(s)=M_{i}(s)=U(s)\in[-c_{1},c_{1}], and, since this is the only critical point, the extension is well-defined on (x(i),x(i+1))(x_{(i)},x_{(i+1)}).

Appendix H Equivalence of Constrained Classical and Spanner Stein Discrepancies

For PP with support X=(α1,β1)××(αd,βd)\mathcal{X}=(\alpha_{1},\beta_{1})\times\cdots\times(\alpha_{d},\beta_{d}) for αj<βj-\infty\leq\alpha_{j}<\beta_{j}\leq\infty, Algorithm 1 computes a Stein discrepancy based on the graph Stein set

Our next result shows that the graph Stein discrepancy based on a tt-spanner is strongly equivalent to the classical Stein discrepancy.

where κd\kappa_{d} is a constant, independent of (Q,P,Gt,t)(Q,P,G_{t},t), depending only on the dimension dd.

The smoothness constraints of the classical Stein set \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{1}} now imply that

so that all graph Stein set boundary compatibility constraints are satisfied. Hence, we have the containment \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{1}}\subseteq\mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{1},Q,{G_{t}}}, which implies the first advertised inequality.

for all kjk\neq j, we have (16). Moreover, the boundary compatibility constraints of \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{1},Q,{G_{t}}} imply

establishing (15). We next invoke the triangle inequality, the boundary compatibility conditions of \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{1},Q,{G_{t}}}, Hölder’s inequality, the Lipschitz derivative property (16), and the fact \mathopen{}\mathclose{{}\left\|{z-b}}\right\|_{1}=\mathopen{}\mathclose{{}\left\|{b^{*}-z}}\right\|_{1}+\mathopen{}\mathclose{{}\left\|{b^{*}-b}}\right\|_{1} in turn to establish (17):

A parallel argument yields (19). Finally, we may deduce (18), as

by the triangle inequality, the definition of \mathcal{G}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{1},Q,{G_{t}}}, and the Lipschitz property (16). ∎

Acknowledgments

The authors would like to thank Madeleine Udell for her generous advice concerning optimization in Julia, Quirijn Bouts and Kevin Buchin for sharing their wise counsel and greedy spanner implementation, Francis Bach for sharing his pseudosampling code, Andreas Eberle for his triple coupling pointers, and Jessica Hwang for her feedback on various versions of this manuscript.

This work was supported by the Frederick E. Terman Fellowship and the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747.

References