Measuring Sample Quality with Kernels

Jackson Gorham, Lester Mackey

Introduction

This closed form represents a significant practical advantage, as no linear program solvers are necessary, and the computation of the discrepancy can be easily parallelized. However, as we will see in Section 3.2, not all kernel Stein discrepancies are suitable for our setting. In particular, in dimension $d\geq 3$ , the kernel Stein discrepancies previously recommended in the literature fail to detect when a sample is not converging to the target. To address this shortcoming, we develop a theory of weak convergence for the kernel Stein discrepancies analogous to that of (Gorham & Mackey, 2015; Mackey & Gorham, 2016; Gorham et al., 2016) and design a class of kernel Stein discrepancies that provably control weak convergence for a large class of target distributions.

After formally describing our goals for measuring sample quality in Section 2, we outline our strategy, based on Stein’s method, for constructing and analyzing practical quality measures at the start of Section 3. In Section 3.1, we define our family of closed-form quality measures – the kernel Stein discrepancies (KSDs) – and establish several appealing practical properties of these measures. We analyze the convergence properties of KSDs in Sections 3.2 and 3.3, showing that previously proposed KSDs fail to detect non-convergence and proposing practical convergence-determining alternatives. Section 4 illustrates the value of convergence-determining kernel Stein discrepancies in a variety of applications, including hyperparameter selection, sampler selection, one-sample hypothesis testing, and sample quality improvement. Finally, in Section 5, we conclude with a discussion of related and future work.

Quality measures for samples

Stein’s method with kernels

Stein’s method (Stein, 1972) provides a three-step recipe for assessing convergence in distribution:

For any such Stein operator and Stein set $\mathcal{G}$ , Gorham & Mackey (2015) defined the Stein discrepancy as

which, crucially, avoids explicit integration under $P$ .

Lower bound the Stein discrepancy by an IPM $d_{\mathcal{H}}$ known to dominate weak convergence. This can be done once for a broad class of target distributions to ensure that $\mu_{m}\Rightarrow P$ whenever $\mathcal{S}({\mu_{m}},{\mathcal{T}{}},{\mathcal{G}})\to 0$ for a sequence of probability measures $(\mu_{m})_{m\geq 1}$ (Desideratum (ii)).

Provide an upper bound on the Stein discrepancy ensuring that $\mathcal{S}({\mu_{m}},{\mathcal{T}{}},{\mathcal{G}})\to 0$ under suitable convergence of $\mu_{m}$ to $P$ (Desideratum (i)).

While Stein’s method is principally used as a mathematical tool to prove convergence in distribution, we seek, in the spirit of (Gorham & Mackey, 2015; Gorham et al., 2016), to harness the Stein discrepancy as a practical tool for measuring sample quality. The subsections to follow develop a specific, practical instantiation of the abstract Stein’s method recipe based on reproducing kernel Hilbert spaces. An empirical analysis of the Stein discrepancies recommended by our theory follows in Section 4.

A standard, widely applicable univariate Stein operator is the density method operator (see Stein et al., 2004; Chatterjee & Shao, 2011; Chen et al., 2011; Ley et al., 2017),

Inspired by the generator method of Barbour (1988, 1990) and Götze (1991), Gorham & Mackey (2015) generalized this operator to multiple dimensions. The resulting Langevin Stein operator

With this definition, we define our kernel Stein set $\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}$ as the set of vector-valued functions $g=(g_{1},\dots,g_{d})$ such that each component function $g_{j}$ belongs to $\mathcal{K}_{k}$ and the vector of their norms $\mathopen{}\mathclose{{}\left\|{g_{j}}}\right\|_{\mathcal{K}_{k}}$ belongs to the $\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|^{*}$ unit ball:Our analyses and algorithms support each $g_{j}$ belonging to a different RKHS $\mathcal{K}_{k_{j}}$ , but we will not need that flexibility here.

The following result, proved in Section B, establishes that this is an acceptable domain for $\mathcal{T}_{P}{}$ .

The Langevin Stein operator and kernel Stein set together define our quality measure of interest, the kernel Stein discrepancy (KSD) $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})$ . When $\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|=\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}$ , this definition recovers the KSD proposed by Chwialkowski et al. (2016) and Liu et al. (2016). Our next result shows that, for any $\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|$ , the KSD admits a closed-form solution.

Suppose $k\in C^{(1,1)}$ , and, for each $j\in\{1,\dots d\}$ , define the Stein kernel

The proof is found in Section C. Notably, when $\mu$ is the discrete measure $Q_{n}=\sum_{i=1}^{n}q_{n}(x_{i})\delta_{x_{i}}$ , the KSD reduces to evaluating each $k_{0}^{j}$ at pairs of support points as $w_{j}=\sqrt{\sum_{i,i^{\prime}=1}^{n}q_{n}(x_{i})k_{0}^{j}(x_{i},x_{i^{\prime}})q_{n}(x_{i^{\prime}})},$ a computation which is easily parallelized over sample pairs and coordinates $j$ .

Our Stein set choice was motivated by the work of Oates et al. (2016b) who used the sum of Stein kernels $k_{0}=\sum_{j=1}^{d}k_{0}^{j}$ to develop nonparametric control variates. Each term $w_{j}$ in Proposition 2 can also be viewed as an instance of the maximum mean discrepancy (MMD) (Gretton et al., 2012) between $\mu$ and $P$ measured with respect to the Stein kernel $k_{0}^{j}$ . In standard uses of MMD, an arbitrary kernel function is selected, and one must be able to compute expectations of the kernel function under $P$ . Here, this requirement is satisfied automatically, since our induced kernels are chosen to have mean zero under $P$ .

For clarity we will focus on the specific kernel Stein set choice $\mathcal{G}_{k}\triangleq\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}$ for the remainder of the paper, but our results extend directly to KSDs based on any $\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|$ , since all KSDs are equivalent in a strong sense:

Under the assumptions of Proposition 2, there are constants $c_{d},c_{d}^{\prime}>0$ depending only on $d$ and $\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|$ such that $c_{d}\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})\leq\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}})\leq c_{d}^{\prime}\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})$ .

2 Lower bounding the kernel Stein discrepancy

We next aim to establish conditions under which the KSD $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ only if $\mu_{m}\Rightarrow P$ (Desideratum (ii)). Recently, Gorham et al. (2016) showed that the Langevin graph Stein discrepancy dominates convergence in distribution whenever $P$ belongs to the class $\mathcal{P}$ of distantly dissipative distributions with Lipschitz score function $b$ :

A distribution $P$ is distantly dissipative if $\kappa_{0}\triangleq\lim\inf_{r\to\infty}\kappa(r)>0$ for

Examples of distributions in $\mathcal{P}$ include finite Gaussian mixtures with common covariance and all distributions strongly log-concave outside of a compact set, including Bayesian linear, logistic, and Huber regression posteriors with Gaussian priors (see Gorham et al., 2016, Section 4). Moreover, when $d=1$ , membership in $\mathcal{P}$ is sufficient to provide a lower bound on the KSD for most common kernels including the Gaussian, Matérn, and inverse multiquadric kernels.

Suppose that $P\in\mathcal{P}$ and $k(x,y)=\Phi(x-y)$ for $\Phi\in C^{2}$ with a non-vanishing generalized Fourier transform. If $d=1$ , then $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ only if $\mu_{m}\Rightarrow P$ .

The proof in Section E provides a lower bound on the KSD in terms of an IPM known to dominate weak convergence. However, our next theorem shows that in higher dimensions $\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ can converge to without the sequence $(Q_{n})_{n\geq 1}$ converging to any probability measure. This deficiency occurs even when the target is Gaussian.

Suppose $k\in C_{b}^{(1,1)}$ and define the kernel decay rate

If $d\geq 3$ , $P=\mathcal{N}(0,I_{d})$ , and $\gamma(r)=o(r^{-\alpha})$ for $\alpha\triangleq(\frac{1}{2}-\frac{1}{d})^{-1}$ , then $\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ does not imply $Q_{n}\Rightarrow P$ .

Theorem 6 implies that KSDs based on the commonly used Gaussian kernel, Matérn kernel, and compactly supported kernels of Wendland (2004, Theorem 9.13) all fail to detect non-convergence when $d\geq 3$ . In addition, KSDs based on the inverse multiquadric kernel ( $k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}$ ) for $\beta<-1$ fail to detect non-convergence for any $d>2\beta/(\beta+1)$ . The proof in Section F shows that the violating sample sequences $(Q_{n})_{n\geq 1}$ are simple to construct, and we provide an empirical demonstration of this failure to detect non-convergence in Section 4.

The failure of the KSDs in Theorem 6 can be traced to their inability to enforce uniform tightness. A sequence of probability measures $(\mu_{m})_{m\geq 1}$ is uniformly tight if for every $\epsilon>0$ , there is a finite number $R(\epsilon)$ such that $\lim\sup_{m}\mu_{m}(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}>R(\epsilon))\leq\epsilon$ . Uniform tightness implies that no mass in the sequence of probability measures escapes to infinity. When the kernel $k$ decays more rapidly than the score function grows, the KSD ignores excess mass in the tails and hence can be driven to zero by a non-tight sequence of increasingly diffuse probability measures. The following theorem demonstrates uniform tightness is the missing piece to ensure weak convergence.

Suppose that $P\in\mathcal{P}$ and $k(x,y)=\Phi(x-y)$ for $\Phi\in C^{2}$ with a non-vanishing generalized Fourier transform. If $(\mu_{m})_{m\geq 1}$ is uniformly tight, then $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ only if $\mu_{m}\Rightarrow P$ .

Our proof in Section G explicitly lower bounds the KSD $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ in terms of the bounded Lipschitz metric $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P)$ , which exactly metrizes weak convergence.

Ideally, when a sequence of probability measures is not uniformly tight, the KSD would reflect this divergence in its reported value. To achieve this, we consider the inverse multiquadric (IMQ) kernel $k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}\,$ for some $\beta<0$ and $c>0$ . While KSDs based on IMQ kernels fail to determine convergence when $\beta<-1$ (by Theorem 6), our next theorem shows that they automatically enforce tightness and detect non-convergence whenever $\beta\in(-1,0)$ .

Suppose $P\in\mathcal{P}$ and $k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}$ for $c>0$ and $\beta\in(-1,0)$ . If $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ , then $\mu_{m}\Rightarrow P$ .

The proof in Section H provides a lower bound on the KSD in terms of the bounded Lipschitz metric $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P)$ . The success of the IMQ kernel over other common characteristic kernels can be attributed to its slow decay rate. When $P\in\mathcal{P}$ and the IMQ exponent $\beta>-1$ , the function class $\mathcal{T}_{P}{\mathcal{G}_{k}}$ contains unbounded (coercive) functions. These functions ensure that the IMQ KSD $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ goes to only if $(\mu_{m})_{m\geq 1}$ is uniformly tight.

3 Upper bounding the kernel Stein discrepancy

The usual goal in upper bounding the Stein discrepancy is to provide a rate of convergence to $P$ for particular approximating sequences $(\mu_{m})_{m=1}^{\infty}$ . Because we aim to directly compute the KSD for arbitrary samples $Q_{n}$ , our chief purpose in this section is to ensure that the KSD $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ will converge to zero when $\mu_{m}$ is converging to $P$ (Desideratum (i)).

Experiments

We next conduct an empirical evaluation of the KSD quality measures recommended by our theory, recording all timings on an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. Throughout, we will refer to the KSD with IMQ base kernel $k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}$ , exponent $\beta=-\frac{1}{2}$ , and $c=1$ as the IMQ KSD. Code reproducing all experiments can be found on the Julia (Bezanson et al., 2014) package site https://jgorham.github.io/SteinDiscrepancy.jl/.

Our first, simple experiment is designed to illustrate several properties of the IMQ KSD and to compare its behavior with that of two preexisting discrepancy measures, the Wasserstein distance $d_{\mathcal{W}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}$ , which can be computed for simple univariate targets (Vallender, 1974), and the spanner graph Stein discrepancy of Gorham & Mackey (2015). We adopt a bimodal Gaussian mixture with $p(x)\propto e^{-\frac{1}{2}\mathopen{}\mathclose{{}\left\|{x+\Delta e_{1}}}\right\|_{2}^{2}}+e^{-\frac{1}{2}\mathopen{}\mathclose{{}\left\|{x-\Delta e_{1}}}\right\|_{2}^{2}}$ and $\Delta=1.5$ as our target $P$ and generate a first sample point sequence i.i.d. from the target and a second sequence i.i.d. from one component of the mixture, $\mathcal{N}(-\Delta e_{1},I_{d})$ . As seen in the left panel of Figure 1 where $d=1$ , the IMQ KSD decays at an $n^{-0.51}$ rate when applied to the first $n$ points in the target sample and remains bounded away from zero when applied to the to the single component sample. This desirable behavior is closely mirrored by the Wasserstein distance and the graph Stein discrepancy.

The middle panel of Figure 1 records the time consumed by the graph and kernel Stein discrepancies applied to the i.i.d. sample points from $P$ . Each method is given access to $d$ cores when working in $d$ dimensions, and we use the released code of Gorham & Mackey (2015) with the default Gurobi 6.0.4 linear program solver for the graph Stein discrepancy. We find that the two methods have nearly identical runtimes when $d=1$ but that the KSD is $10$ to $1000$ times faster when $d=4$ . In addition, the KSD is straightforwardly parallelized and does not require access to a linear program solver, making it an appealing practical choice for a quality measure.

2 The importance of kernel choice

Theorem 6 established that kernels with rapidly decaying tails yield KSDs that can be driven to zero by off-target sample sequences. Our next experiment provides an empirical demonstration of this issue for a multivariate Gaussian target $P=\mathcal{N}(0,I_{d})$ and KSDs based on the popular Gaussian ( $k(x,y)=e^{-\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2}/2}$ ) and Matérn ( $k(x,y)=(1+\sqrt{3}\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2})e^{-\sqrt{3}\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}}$ ) radial kernels.

Following the proof Theorem 6 in Section F, we construct an off-target sequence $(Q_{n})_{n\geq 1}$ that sends $\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ to for these kernel choices whenever $d\geq 3$ . Specifically, for each $n$ , we let $Q_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{x_{i}}$ where, for all $i$ and $j$ , $\mathopen{}\mathclose{{}\left\|{x_{i}}}\right\|_{2}\leq 2n^{1/d}\log n$ and $\mathopen{}\mathclose{{}\left\|{x_{i}-x_{j}}}\right\|_{2}\geq 2\log n$ . To select these sample points, we independently sample candidate points uniformly from the ball $\{x:\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\leq 2n^{1/d}\log n\}$ , accept any points not within $2\log n$ Euclidean distance of any previously accepted point, and terminate when $n$ points have been accepted.

For various dimensions, Figure 2 displays the result of applying each KSD to the off-target sequence $(Q_{n})_{n\geq 1}$ and an “on-target” sequence of points sampled i.i.d. from $P$ . For comparison, we also display the behavior of the IMQ KSD which provably controls tightness and dominates weak convergence for this target by Theorem 8. As predicted, the Gaussian and Matérn KSDs decay to under the off-target sequence and decay more rapidly as the dimension $d$ increases; the IMQ KSD remains bounded away from .

3 Selecting sampler hyperparameters

To assess the suitability of the KSD for tolerance parameter selection, we take as our target $P$ the bimodal Gaussian mixture model posterior of (Welling & Teh, 2011). For an array of $\epsilon$ values, we generated $50$ independent approximate slice sampling chains with batch size $5$ , each with a budget of $148000$ likelihood evaluations, and plotted the median IMQ KSD and effective sample size (ESS, a standard sample quality measure based on asymptotic variance (Brooks et al., 2011)) in Figure 3. ESS, which does not detect Markov chain bias, is maximized at the largest hyperparameter evaluated ( $\epsilon=10^{-1}$ ), while the KSD is minimized at an intermediate value ( $\epsilon=10^{-2}$ ). The right panel of Figure 3 shows representative samples produced by several settings of $\epsilon$ . The sample produced by the ESS-selected chain is significantly overdispersed, while the sample from $\epsilon=0$ has minimal coverage of the second mode due to its small sample size. The sample produced by the KSD-selected chain best resembles the posterior target. Using $4$ cores, the longest KSD computation with $n=10^{3}$ sample points took $0.16s$ .

4 Selecting samplers

Ahn et al. (2012) developed two biased MCMC samplers for accelerated posterior inference, both called Stochastic Gradient Fisher Scoring (SGFS). In the full version of SGFS (termed SGFS-f), a $d\times d$ matrix must be inverted to draw each new sample point. Since this can be costly for large $d$ , the authors developed a second sampler (termed SGFS-d) in which only a diagonal matrix must be inverted to draw each new sample point. Both samplers can be viewed as discrete-time approximations to a continuous-time Markov process that has the target $P$ as its stationary distribution; however, because no Metropolis-Hastings correction is employed, neither sampler has the target as its stationary distribution. Hence we will use the KSD – a quality measure that accounts for asymptotic bias – to evaluate and choose between these samplers.

Specifically, we evaluate the SGFS-f and SGFS-d samples produced in (Ahn et al., 2012, Sec. 5.1). The target $P$ is a Bayesian logistic regression with a flat prior, conditioned on a dataset of $10^{4}$ MNIST handwritten digit images. From each image, the authors extracted $50$ random projections of the raw pixel values as covariates and a label indicating whether the image was a $7$ or a $9$ . After discarding the first half of sample points as burn-in, we obtained regression coefficient samples with $5\times 10^{4}$ points and $d=51$ dimensions (including the intercept term). Figure 4 displays the IMQ KSD applied to the first $n$ points in each sample. As external validation, we follow the protocol of Ahn et al. (2012) to find the bivariate marginal means and 95% confidence ellipses of each sample that align best and worst with those of a surrogate ground truth sample obtained from a Hamiltonian Monte Carlo chain with $10^{5}$ iterates. Both the KSD and the surrogate ground truth suggest that the moderate speed-up provided by SGFS-d ( $0.0017s$ per sample vs. $0.0019s$ for SGFS-f) is outweighed by the significant loss in inferential accuracy. However, the KSD assessment does not require access to an external trustworthy ground truth sample. The longest KSD computation took $400s$ using $16$ cores.

5 Beyond sample quality comparison

While our investigation of the KSD was motivated by the desire to develop practical, trustworthy tools for sample quality comparison, the kernels recommended by our theory can serve as drop-in replacements in other inferential tasks that make use of kernel Stein discrepancies.

5.2 Improving sample quality

Related and future work

The score statistic of Fan et al. (2006) and the Gibbs sampler convergence criteria of Zellner & Min (1995) detect certain forms of non-convergence but fail to detect others due to the finite number of test functions tested. For example, when $P=\mathcal{N}(0,1)$ , the score statistic (Fan et al., 2006) only monitors sample means and variances.

While assessing sample quality was our chief objective, our results may hold benefits for other applications that make use of Stein discrepancies or Stein operators. In particular, our kernel recommendations could be incorporated into the Monte Carlo control functionals framework of Oates et al. (2016b); Oates & Girolami (2015), the variational inference approaches of Liu & Wang (2016); Liu & Feng (2016); Ranganath et al. (2016), and the Stein generative adversarial network approach of Wang & Liu (2016).

In the future, we aim to leverage stochastic, low-rank, and sparse approximations of the kernel matrix and score function to produce KSDs that scale better with the number of sample and data points while still guaranteeing control over weak convergence. A reader may also wonder for which distributions outside of $\mathcal{P}$ the KSD dominates weak convergence. The following theorem, proved in Section J, shows that no KSD with a $C_{0}$ kernel dominates weak convergence when the target has a bounded score function.

If $\nabla\log p$ is bounded and $k\in C^{(1,1)}_{0}$ , then $\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ does not imply $Q_{n}\Rightarrow P$ .

Acknowledgments

We thank Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton for sharing their hypothesis testing code, Qiang Liu for sharing his black-box importance sampling code, and Sebastian Vollmer and Andrew Duncan for many helpful conversations regarding this work. This material is based upon work supported by the National Science Foundation DMS RTG Grant No. 1501767, the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747, and the Frederick E. Terman Fellowship.

References

Appendix A Additional appendix notation

Appendix B Proof of Proposition 1: Zero mean test functions

Appendix C Proof of Proposition 2: KSD closed form

Hence, we may apply the representation (C) and exchange expectation and RKHS inner product to discover

Appendix D Proof of Proposition 3: Stein set equivalence

By Proposition 2, $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})=\mathopen{}\mathclose{{}\left\|{w}}\right\|$ and $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}})=\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}$ for some vector $w$ , and by (Bachman & Narici, 1966, Thm. 8.7), there exist constants $c_{d},c_{d}^{\prime}>0$ depending only on $d$ and $\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|$ such that $c_{d}\mathopen{}\mathclose{{}\left\|{w}}\right\|\leq\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}\leq c_{d}^{\prime}\mathopen{}\mathclose{{}\left\|{w}}\right\|$ .

Appendix E Proof of Theorem 5: Univariate KSD detects non-convergence

Remarks An explicit value for the Stein factor $\mathcal{M}_{P}$ can be derived from the proof in Section E.1 and the results of Gorham et al. (2016). After optimizing the bound $d_{\mathcal{H}}(\mu,P)$ over $\epsilon>0$ , the Gaussian, inverse multiquadric, and Matérn ( $v>1$ ) kernels achieve rates of $O(1/\sqrt{\log(\frac{1}{\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})})})$ , $O(1/\log(\frac{1}{\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})}))$ , and $O(\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})^{1/(v+1/2)})$ respectively as $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ .

In particular, since $\hat{\Phi}$ is non-vanishing, $F(t)$ is finite for all $t$ . If $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ , then, for any fixed $\epsilon>0$ , we have $\lim\sup_{m\to\infty}d_{\mathcal{H}}(\mu_{m},P)\leq\epsilon$ . Taking $\epsilon\to 0$ shows that $\lim_{m\to\infty}d_{\mathcal{H}}(\mu_{m},P)\to 0$ , which implies that $\mu_{m}\Rightarrow P$ .

Fix any probability measure $\mu$ and $h\in\mathcal{H}$ , and define the tilting function $\Xi(x)\triangleq(1+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{1/2}$ . The proof will proceed in three steps.

Moreover, we have $\nabla_{x}k_{b}(x,y)=(y-x)k_{b}(x,y)$ and $\nabla r(x)=-xr(x)$ . Thus for any $x$ , by (Steinwart & Christmann, 2008, Corollary 4.36) we have

We next show that there is a solution to the $P$ Stein equation

In our final step, we will use the following lemma, proved in Section E.2, to show that we can approximate $\mathcal{T}_{P}{g_{h}}$ arbitrarily well by a function in a scaled copy of $\mathcal{T}_{P}{\mathcal{G}_{k}}$ .

where $F(t)\triangleq\sup_{\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{\infty}\leq t}\hat{\Phi}(\omega)^{-1}$ .

Taking a supremum over $h\in\mathcal{H}$ yields the advertised result.

E.2 Proof of Lemma 12: Stein approximations with finite RKHS norm

Thus for any $\delta>0$ , by the triangle inequality, we have

Thus it remains to bound the RKHS norm of $g_{\delta}$ . By the Convolution Theorem (Wendland, 2004, Thm. 5.16), we have $\hat{g_{\delta}}(\omega)=(2\pi)^{d/2}\hat{g}(\omega)\hat{\rho_{\delta}}(\omega)$ , and so the squared norm of $g_{\delta}$ in $\mathcal{K}_{k}^{d}$ is equal to (Wendland, 2004, Thm. 10.21)

Appendix F Proof of Theorem 6: KSD fails with light kernel tails

Since the target distribution $P$ is $\mathcal{N}(0,I_{d})$ , the associated gradient of the log density is $b(x)=-x$ . Thus

Since $k\in C^{(2,2)}_{b}$ , $\gamma(0)<\infty$ . Thus by Cauchy-Schwarz, the first term of (10) is upper bounded by

To handle the second term of (10), we will use the assumed bound on $k$ and its derivatives from $\gamma$ . For any fixed $i\neq i^{\prime}$ , by the triangle inequality, Cauchy-Schwarz, and fact $\gamma$ is monotonically decreasing we have

Our upper bounds on the Stein discrepancy (10) and our choice of $\Delta_{n}$ now imply that

Moreover, since $\gamma(r)=o(r^{-\alpha})$ , we have $\gamma^{-1}(1/n)=o(n^{1/\alpha})=o(n^{1/2-1/d})$ , and hence $\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ as $n\to\infty$ .

However, the sequence $(Q_{n})_{n\geq 1}$ is not uniformly tight and hence converges to no probability measure. This follows as, for each $r>0$ ,

for $m=\lceil{5^{d+1}r^{d}\rceil}$ , since at most $(r+4r/{\Delta_{m}})^{d}$ points with minimum pairwise Euclidean distance greater than $\Delta_{m}$ can fit into a ball of radius $r$ (Wainwright, 2017, Lems. 5.1 and 5.2).

Appendix G Proof of Theorem 7: KSD detects tight non-convergence

Theorem 7 will follow from the following result which upper bounds the bounded Lipschitz metric $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}(\mu,P)$ in terms of the tightness rate $R(\mu,\epsilon)$ , the rate of decay of the generalized Fourier transform $\hat{\Phi}$ , and the KSD $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ .

Suppose $P\in\mathcal{P}$ and let $\mu$ be a probability measure with tightness rate $R(\mu,\epsilon)$ defined in (11). Moreover, suppose the kernel $k(x,y)=\Phi(x-y)$ with $\Phi\in C^{2}$ and $F(t)\triangleq\sup_{\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{\infty}\leq t}\hat{\Phi}(\omega)^{-1}$ finite for all $t>0$ . Then there exists a constant $\mathcal{M}_{P}$ such that, for all $\rho,\epsilon,\delta>0$ ,

where $\theta_{d}\triangleq d\int_{0}^{1}\operatorname{exp}\mathopen{}\mathclose{{}\left(-1/(1-r^{2})}\right)r^{d-1}\,dr$ for $d>0$ (and $\theta_{0}\triangleq e^{-1}$ ), $V_{d}$ is the volume of the unit Euclidean ball in dimension $d$ , and

Remarks An explicit value for the Stein factor $\mathcal{M}_{P}$ can be derived from the proof in Section G.1 and the results of Gorham et al. (2016). When bounds on $R$ and $F$ are known, the final expression can be optimized over $\epsilon,\rho$ and $\delta$ to produce rates of convergence in $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}$ .

Fix any $\delta>0$ , and consider a sequence of probability measures $(\mu_{m})_{m\geq 1}$ that is uniformly tight. We must have $\lim\sup_{m}R(\mu_{m},\epsilon)<\infty$ for all $\epsilon>0$ . Moreover, since $\hat{\Phi}$ is non-vanishing, $F(t)$ is finite for all $t$ . Thus if $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ , then for any fixed $\epsilon<1$ and $\rho>0$ ,

Taking $\rho,\epsilon\to 0$ yields $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}(\mu_{m},P)\to 0$ .

Moreover, by the argument (9) employed in the proof of Lemma 12, we have

showing that $\mathcal{T}_{P}{g_{\rho}}$ closely approximates $\mathcal{T}_{P}{g}$ .

We will next truncate $g_{\rho}$ using the following lemma proved in Section G.2.

where $\theta_{d}\triangleq d\int_{0}^{1}\operatorname{exp}\mathopen{}\mathclose{{}\left(-1/(1-r^{2})}\right)r^{d-1}\,dr$ for $d>0$ and $\theta_{0}\triangleq e^{-1}$ .

Fix any $\epsilon,\delta>0$ , and let $K=\mathcal{B}(0,R(\mu,\epsilon))$ with $R(\mu,\epsilon)$ defined in (11). This set is compact since our sequence is uniformly tight. Hence, we may define $g_{K,\delta}(x)\triangleq g_{\rho}(x)\,v_{K,\delta}(x)$ as a smooth, truncated version of $g_{\rho}$ based on Lemma 14. Since

properties (12) and (13) imply that $(\mathcal{T}_{P}{g_{\rho}})({x})=(\mathcal{T}_{P}{g_{K,\delta}})({x})$ for all $x\in K$ , $(\mathcal{T}_{P}{g_{K,\delta}})({x})=0$ when $x\notin K^{2\delta}$ , and

for $x\in K^{2\delta}\setminus K$ by Cauchy-Schwarz. In addition

The advertised result follows by substituting $\textnormal{Vol}(\mathcal{B}(0,r))=V_{d}r^{d}$ and taking the supremum over all $h\in BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}$ .

G.2 Proof of Lemma 14: Smoothed indicator function

where the normalizing constant is given by

for $V_{d}$ being the volume of the unit Euclidean ball in $d$ dimensions (Baker, 1999).

where we used the substitution $z\triangleq(x-y)/\delta$ . By differentiating $\psi$ , using (Baker, 1999) with the substitution $r=\mathopen{}\mathclose{{}\left\|{z}}\right\|_{2}$ , and employing integration by parts we have

Appendix H Proof of Theorem 8: IMQ KSD detects non-convergence

We first use the following theorem to upper bound the bounded Lipschitz metric $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P)$ in terms of the KSD $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ .

Suppose $P\in\mathcal{P}$ and $k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{\beta}$ for $c>0$ , and $\beta\in(-1,0)$ . Choose any $\alpha\in(0,\frac{1}{2}(\beta+1))$ and $a>\frac{1}{2}c$ . Then there exist an $\epsilon_{0}>0$ and a constant $\mathcal{M}_{P}$ such that, for all $\mu$ ,

for $\theta_{d},\theta_{d-1},V_{d},$ and $c_{\rho,\delta}$ defined in Theorem 13, the function $\mathcal{D}$ defined in (H.2), the function $\zeta$ defined in (19), and

where $K_{v}$ is the modified Bessel function of the third kind. Moreover, if $\lim\sup_{m}\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})<\infty$ then $(\mu_{m})_{m\geq 1}$ is uniformly tight.

Remark The Stein factor $\mathcal{M}_{P}$ can be determined explicitly based on the proof of Theorem 15 in Section H.1 and the results of Gorham et al. (2016).

Note that $F_{IMQ}(t)$ is finite for all $t>0$ , so fix any $\epsilon\in[0,\epsilon_{0})$ and $\delta,\rho>0$ . If $\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ , then $\lim\sup_{m}d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\leq\rho\sqrt{d}(1+M_{1}(b)\mathcal{M}_{P})+(3+\epsilon+(M_{1}(b)\rho\sqrt{d}+{\textstyle\frac{\delta^{-1}d\theta_{d-1}}{\theta_{d}}})\mathcal{M}_{P})\epsilon$ . Thus taking $\epsilon,\rho\to 0$ yields $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0$ . Since $d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0$ only if $\mu_{m}\Rightarrow P$ , the statement of Theorem 8 follows.

Fix any $\alpha\in(0,\frac{1}{2}(\beta+1))$ and $a>\frac{1}{2}c$ . Then there is some $\mathring{g}\in\mathcal{G}_{k}$ such that $\mathcal{T}_{P}{\mathring{g}}$ is bounded below by a constant $\zeta(a,c,\alpha,\beta)$ and has a growth rate of $\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2\alpha}$ as $\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty$ . Such a function exists by the following lemma, proved in Section H.2.

where the function $\mathcal{D}$ is defined in (H.2) and $R_{0}\triangleq\inf\{r>0\,|\,\kappa(r^{\prime})\geq 0,\forall r^{\prime}\geq r\}$ . Moreover, $\lim\inf\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{-2\alpha}(\mathcal{T}_{P}{\mathring{g}})({x})\geq\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa_{0}$ as $\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty$ .

Our next lemma connects the growth rate of $\mathcal{T}_{P}{\mathring{g}}$ to the tightness rate of a probability measure evaluated with the Stein discrepancy. Its proof is found in Section H.3.

In particular, if $\lim\sup_{m}\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ is finite, $(\mu_{m})_{m\geq 1}$ is uniformly tight.

We can thus plug the tightness rate estimate of Lemma 17 applied to the function $\mathring{g}$ into Theorem 13. Since $\mathopen{}\mathclose{{}\left\|{w}}\right\|_{\infty}\leq t$ implies $\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}\leq\sqrt{d}t$ , we can use the formula for the generalized Fourier transform of the IMQ kernel in (20) to see $\hat{\Phi}(\omega)$ is monotonically decreasing in $\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}$ to establish (18). By taking $\eta\to\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa_{0}$ we obtain (15).

To prove (17), notice that $F_{IMQ}(t)=O(e^{(c\sqrt{d}+\lambda)t})$ as $t\to\infty$ for any $\lambda>0$ by (21). Hence, by choosing $\epsilon=\rho=((c\sqrt{d}+1)/\log({\textstyle\frac{1}{\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})}}))^{1/2}$ and fixing any $\delta>0$ we obtain the advertised decay rate as $\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ . The uniform tightness conclusion follows from Lemma 17.

H.2 Proof of Lemma 16: Generalized multiquadric Stein sets yield coercive functions

By (Wendland, 2004, Thm. 8.15), $\Phi_{c,\beta}$ has a generalized Fourier transform of order $\max(0,\lceil\beta\rceil)$ given by

Now fix any $a>c/2$ and $\alpha\in(0,\frac{1}{2}(\beta+1))$ , and consider the functions $g_{j}(x)=\nabla_{x_{j}}\Phi_{a,\alpha}(x)=2\alpha x_{j}(a^{2}+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{\alpha-1}$ . We will show that $g=(g_{1},\dots,g_{d})\in\mathcal{K}_{k}^{d}$ . Note that $\hat{g_{j}}(\omega)=(i\omega_{j})\widehat{\Phi_{a,\alpha}}(\omega)$ . Using (Wendland, 2004, Thm. 10.21), we know $\mathopen{}\mathclose{{}\left\|{g_{j}}}\right\|_{\mathcal{K}_{k}}=\mathopen{}\mathclose{{}\left\|{\hat{g_{j}}/\sqrt{\widehat{\Phi_{c,\beta}}}}}\right\|_{L^{2}}$ , and thus $\mathopen{}\mathclose{{}\left\|{g}}\right\|_{\mathcal{K}_{k}^{d}}=\mathopen{}\mathclose{{}\left\|{\hat{g}/\sqrt{\widehat{\Phi_{c,\beta}}}}}\right\|_{L^{2}}$ . Hence

where $V_{d}$ is the volume of the unit ball in $d$ -dimensions and in the last step we used the substitution $r=\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{2}$ (Baker, 1999). Since $\alpha<\frac{1}{2}(\beta+1)$ and the function $r\mapsto r^{t}$ is integrable around the origin when $t>-1$ , we can bound the integral above by

We can apply the technique to the other integral, yielding

Since $c-2a<0$ , we can upper bound the last integral above by the quantity

where $\Gamma(s,x)\triangleq\int_{x}^{\infty}t^{s-1}e^{-t}\,dt$ is the upper incomplete gamma function. Hence, the function $g$ belongs to $\mathcal{K}_{k}^{d}$ with norm upper bounded by $\mathcal{D}(a,b,\alpha,\beta)^{1/2}$ where

Now define $\mathring{g}=-\mathcal{D}(a,c,\alpha,\beta)^{-1/2}g$ so that $\mathring{g}\in\mathcal{G}_{k}$ . We will lower bound the growth rate of $\mathcal{T}_{P}{\mathring{g}}$ and also construct a uniform lower bound. Note

The latter two terms are both uniformly bounded in $x$ . By the distant dissipativity assumption, there is some $\kappa>0$ such that $\lim\sup_{\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty}\frac{1}{\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2}}\langle{b(x)},{x}\rangle\leq-\frac{1}{2}\kappa$ . Thus the first term of (23) grows at least at the rate $\frac{1}{2}\kappa\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2\alpha}$ . This assures $\lim\inf\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{-2\alpha}(\mathcal{T}_{P}{\mathring{g}})({x})\geq\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa$ as $\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty$ .

Moreover, because $b$ is Lipschitz, we have

Hence for any $x\in\mathcal{B}(0,R_{0})$ , we must have $-\langle{b(x)},{x}\rangle\geq-M_{1}(b)R_{0}^{2}-\mathopen{}\mathclose{{}\left\|{b(0)}}\right\|_{2}R_{0}$ . By choice of $R_{0}$ , for all $x\notin\mathcal{B}(0,R_{0})$ , the distant dissipativity assumption implies $-\langle{b(x)},{x}\rangle\geq 0$ . Hence applying this to (23) shows that $\mathcal{T}_{P}{\mathring{g}}$ is uniformly lower bounded by $\zeta(a,c,\alpha,\beta)$ .

H.3 Proof of Lemma 17: Coercive functions yield tightness

Thus we see that $\mu(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}\geq r_{\epsilon})\leq\epsilon$ whenever $\epsilon\geq(\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})-\zeta)/\gamma(r_{\epsilon})$ . This implies that for sufficiently small $\epsilon$ , if

we must have $\mu(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}\geq r_{\epsilon})\leq\epsilon$ . Hence whenever $\lim\sup_{m}\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})$ is bounded, we must have $(\mu_{m})_{m\geq 1}$ is uniformly tight as $\lim\sup_{m}R(\mu_{m},\epsilon)$ is finite.

Appendix I Proof of Proposition 9: KSD detects convergence

We will first state and prove a useful lemma.

To handle the other term above, notice that by Cauchy-Schwarz and the fact that $\min(a,b)\leq\sqrt{ab}$ for $a,b\geq 0$ ,

The stated inequality now follows by taking the infimum of these bounds over all joint distributions $(X,Z)$ with $X\sim\mu$ and $Z\sim P$ . ∎

Appendix J Proof of Theorem 10: KSD fails for bounded scores

We can express $k_{0}(x,y)\triangleq\sum_{j=1}^{d}k_{0}^{j}(x,y)$ as

Let $\gamma$ be the kernel decay rate defined in the statement of Theorem 6. Then as $k\in C^{(1,1)}_{0}$ , we must have $\gamma(0)<\infty$ and $\lim_{r\to\infty}\gamma(r)=0$ . By the triangle inequality

We now handle the second term of (24). By repeated use of Cauchy-Schwarz we have

By assumption, $\gamma(r)\to 0$ as $r\to\infty$ . Furthermore, since the second term of (24) is upper bounded by the average of the terms $k_{0}(x_{i},x_{i}^{\prime})$ for $i\neq i^{\prime}$ , we have $\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0$ as $n\to\infty$ . However, $(Q_{n})_{n\geq 1}$ is not uniformly tight and hence does not converge to the probability measure $P$ .