Measuring Sample Quality with Kernels
Jackson Gorham, Lester Mackey
Introduction
This closed form represents a significant practical advantage, as no linear program solvers are necessary, and the computation of the discrepancy can be easily parallelized. However, as we will see in Section 3.2, not all kernel Stein discrepancies are suitable for our setting. In particular, in dimension , the kernel Stein discrepancies previously recommended in the literature fail to detect when a sample is not converging to the target. To address this shortcoming, we develop a theory of weak convergence for the kernel Stein discrepancies analogous to that of (Gorham & Mackey, 2015; Mackey & Gorham, 2016; Gorham et al., 2016) and design a class of kernel Stein discrepancies that provably control weak convergence for a large class of target distributions.
After formally describing our goals for measuring sample quality in Section 2, we outline our strategy, based on Stein’s method, for constructing and analyzing practical quality measures at the start of Section 3. In Section 3.1, we define our family of closed-form quality measures – the kernel Stein discrepancies (KSDs) – and establish several appealing practical properties of these measures. We analyze the convergence properties of KSDs in Sections 3.2 and 3.3, showing that previously proposed KSDs fail to detect non-convergence and proposing practical convergence-determining alternatives. Section 4 illustrates the value of convergence-determining kernel Stein discrepancies in a variety of applications, including hyperparameter selection, sampler selection, one-sample hypothesis testing, and sample quality improvement. Finally, in Section 5, we conclude with a discussion of related and future work.
Quality measures for samples
Stein’s method with kernels
Stein’s method (Stein, 1972) provides a three-step recipe for assessing convergence in distribution:
For any such Stein operator and Stein set , Gorham & Mackey (2015) defined the Stein discrepancy as
which, crucially, avoids explicit integration under .
Lower bound the Stein discrepancy by an IPM known to dominate weak convergence. This can be done once for a broad class of target distributions to ensure that whenever for a sequence of probability measures (Desideratum (ii)).
Provide an upper bound on the Stein discrepancy ensuring that under suitable convergence of to (Desideratum (i)).
While Stein’s method is principally used as a mathematical tool to prove convergence in distribution, we seek, in the spirit of (Gorham & Mackey, 2015; Gorham et al., 2016), to harness the Stein discrepancy as a practical tool for measuring sample quality. The subsections to follow develop a specific, practical instantiation of the abstract Stein’s method recipe based on reproducing kernel Hilbert spaces. An empirical analysis of the Stein discrepancies recommended by our theory follows in Section 4.
A standard, widely applicable univariate Stein operator is the density method operator (see Stein et al., 2004; Chatterjee & Shao, 2011; Chen et al., 2011; Ley et al., 2017),
Inspired by the generator method of Barbour (1988, 1990) and Götze (1991), Gorham & Mackey (2015) generalized this operator to multiple dimensions. The resulting Langevin Stein operator
With this definition, we define our kernel Stein set \mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|} as the set of vector-valued functions such that each component function belongs to and the vector of their norms \mathopen{}\mathclose{{}\left\|{g_{j}}}\right\|_{\mathcal{K}_{k}} belongs to the \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|^{*} unit ball:Our analyses and algorithms support each belonging to a different RKHS , but we will not need that flexibility here.
The following result, proved in Section B, establishes that this is an acceptable domain for .
The Langevin Stein operator and kernel Stein set together define our quality measure of interest, the kernel Stein discrepancy (KSD) \mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}). When \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|=\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}, this definition recovers the KSD proposed by Chwialkowski et al. (2016) and Liu et al. (2016). Our next result shows that, for any \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|, the KSD admits a closed-form solution.
Suppose , and, for each , define the Stein kernel
The proof is found in Section C. Notably, when is the discrete measure , the KSD reduces to evaluating each at pairs of support points as a computation which is easily parallelized over sample pairs and coordinates .
Our Stein set choice was motivated by the work of Oates et al. (2016b) who used the sum of Stein kernels to develop nonparametric control variates. Each term in Proposition 2 can also be viewed as an instance of the maximum mean discrepancy (MMD) (Gretton et al., 2012) between and measured with respect to the Stein kernel . In standard uses of MMD, an arbitrary kernel function is selected, and one must be able to compute expectations of the kernel function under . Here, this requirement is satisfied automatically, since our induced kernels are chosen to have mean zero under .
For clarity we will focus on the specific kernel Stein set choice \mathcal{G}_{k}\triangleq\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}} for the remainder of the paper, but our results extend directly to KSDs based on any \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|, since all KSDs are equivalent in a strong sense:
Under the assumptions of Proposition 2, there are constants depending only on and \mathopen{}\mathclose{{}\left\|{\cdot}}\right\| such that c_{d}\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})\leq\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}})\leq c_{d}^{\prime}\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}).
2 Lower bounding the kernel Stein discrepancy
We next aim to establish conditions under which the KSD only if (Desideratum (ii)). Recently, Gorham et al. (2016) showed that the Langevin graph Stein discrepancy dominates convergence in distribution whenever belongs to the class of distantly dissipative distributions with Lipschitz score function :
A distribution is distantly dissipative if for
Examples of distributions in include finite Gaussian mixtures with common covariance and all distributions strongly log-concave outside of a compact set, including Bayesian linear, logistic, and Huber regression posteriors with Gaussian priors (see Gorham et al., 2016, Section 4). Moreover, when , membership in is sufficient to provide a lower bound on the KSD for most common kernels including the Gaussian, Matérn, and inverse multiquadric kernels.
Suppose that and for with a non-vanishing generalized Fourier transform. If , then only if .
The proof in Section E provides a lower bound on the KSD in terms of an IPM known to dominate weak convergence. However, our next theorem shows that in higher dimensions can converge to without the sequence converging to any probability measure. This deficiency occurs even when the target is Gaussian.
Suppose and define the kernel decay rate
If , , and for , then does not imply .
Theorem 6 implies that KSDs based on the commonly used Gaussian kernel, Matérn kernel, and compactly supported kernels of Wendland (2004, Theorem 9.13) all fail to detect non-convergence when . In addition, KSDs based on the inverse multiquadric kernel (k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}) for fail to detect non-convergence for any . The proof in Section F shows that the violating sample sequences are simple to construct, and we provide an empirical demonstration of this failure to detect non-convergence in Section 4.
The failure of the KSDs in Theorem 6 can be traced to their inability to enforce uniform tightness. A sequence of probability measures is uniformly tight if for every , there is a finite number such that \lim\sup_{m}\mu_{m}(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}>R(\epsilon))\leq\epsilon. Uniform tightness implies that no mass in the sequence of probability measures escapes to infinity. When the kernel decays more rapidly than the score function grows, the KSD ignores excess mass in the tails and hence can be driven to zero by a non-tight sequence of increasingly diffuse probability measures. The following theorem demonstrates uniform tightness is the missing piece to ensure weak convergence.
Suppose that and for with a non-vanishing generalized Fourier transform. If is uniformly tight, then only if .
Our proof in Section G explicitly lower bounds the KSD in terms of the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P), which exactly metrizes weak convergence.
Ideally, when a sequence of probability measures is not uniformly tight, the KSD would reflect this divergence in its reported value. To achieve this, we consider the inverse multiquadric (IMQ) kernel k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}\, for some and . While KSDs based on IMQ kernels fail to determine convergence when (by Theorem 6), our next theorem shows that they automatically enforce tightness and detect non-convergence whenever .
Suppose and k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta} for and . If , then .
The proof in Section H provides a lower bound on the KSD in terms of the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P). The success of the IMQ kernel over other common characteristic kernels can be attributed to its slow decay rate. When and the IMQ exponent , the function class contains unbounded (coercive) functions. These functions ensure that the IMQ KSD goes to only if is uniformly tight.
3 Upper bounding the kernel Stein discrepancy
The usual goal in upper bounding the Stein discrepancy is to provide a rate of convergence to for particular approximating sequences . Because we aim to directly compute the KSD for arbitrary samples , our chief purpose in this section is to ensure that the KSD will converge to zero when is converging to (Desideratum (i)).
Experiments
We next conduct an empirical evaluation of the KSD quality measures recommended by our theory, recording all timings on an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. Throughout, we will refer to the KSD with IMQ base kernel k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}, exponent , and as the IMQ KSD. Code reproducing all experiments can be found on the Julia (Bezanson et al., 2014) package site https://jgorham.github.io/SteinDiscrepancy.jl/.
Our first, simple experiment is designed to illustrate several properties of the IMQ KSD and to compare its behavior with that of two preexisting discrepancy measures, the Wasserstein distance d_{\mathcal{W}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}, which can be computed for simple univariate targets (Vallender, 1974), and the spanner graph Stein discrepancy of Gorham & Mackey (2015). We adopt a bimodal Gaussian mixture with p(x)\propto e^{-\frac{1}{2}\mathopen{}\mathclose{{}\left\|{x+\Delta e_{1}}}\right\|_{2}^{2}}+e^{-\frac{1}{2}\mathopen{}\mathclose{{}\left\|{x-\Delta e_{1}}}\right\|_{2}^{2}} and as our target and generate a first sample point sequence i.i.d. from the target and a second sequence i.i.d. from one component of the mixture, . As seen in the left panel of Figure 1 where , the IMQ KSD decays at an rate when applied to the first points in the target sample and remains bounded away from zero when applied to the to the single component sample. This desirable behavior is closely mirrored by the Wasserstein distance and the graph Stein discrepancy.
The middle panel of Figure 1 records the time consumed by the graph and kernel Stein discrepancies applied to the i.i.d. sample points from . Each method is given access to cores when working in dimensions, and we use the released code of Gorham & Mackey (2015) with the default Gurobi 6.0.4 linear program solver for the graph Stein discrepancy. We find that the two methods have nearly identical runtimes when but that the KSD is to times faster when . In addition, the KSD is straightforwardly parallelized and does not require access to a linear program solver, making it an appealing practical choice for a quality measure.
2 The importance of kernel choice
Theorem 6 established that kernels with rapidly decaying tails yield KSDs that can be driven to zero by off-target sample sequences. Our next experiment provides an empirical demonstration of this issue for a multivariate Gaussian target and KSDs based on the popular Gaussian (k(x,y)=e^{-\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2}/2}) and Matérn (k(x,y)=(1+\sqrt{3}\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2})e^{-\sqrt{3}\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}}) radial kernels.
Following the proof Theorem 6 in Section F, we construct an off-target sequence that sends to for these kernel choices whenever . Specifically, for each , we let where, for all and , \mathopen{}\mathclose{{}\left\|{x_{i}}}\right\|_{2}\leq 2n^{1/d}\log n and \mathopen{}\mathclose{{}\left\|{x_{i}-x_{j}}}\right\|_{2}\geq 2\log n. To select these sample points, we independently sample candidate points uniformly from the ball \{x:\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\leq 2n^{1/d}\log n\}, accept any points not within Euclidean distance of any previously accepted point, and terminate when points have been accepted.
For various dimensions, Figure 2 displays the result of applying each KSD to the off-target sequence and an “on-target” sequence of points sampled i.i.d. from . For comparison, we also display the behavior of the IMQ KSD which provably controls tightness and dominates weak convergence for this target by Theorem 8. As predicted, the Gaussian and Matérn KSDs decay to under the off-target sequence and decay more rapidly as the dimension increases; the IMQ KSD remains bounded away from .
3 Selecting sampler hyperparameters
To assess the suitability of the KSD for tolerance parameter selection, we take as our target the bimodal Gaussian mixture model posterior of (Welling & Teh, 2011). For an array of values, we generated independent approximate slice sampling chains with batch size , each with a budget of likelihood evaluations, and plotted the median IMQ KSD and effective sample size (ESS, a standard sample quality measure based on asymptotic variance (Brooks et al., 2011)) in Figure 3. ESS, which does not detect Markov chain bias, is maximized at the largest hyperparameter evaluated (), while the KSD is minimized at an intermediate value (). The right panel of Figure 3 shows representative samples produced by several settings of . The sample produced by the ESS-selected chain is significantly overdispersed, while the sample from has minimal coverage of the second mode due to its small sample size. The sample produced by the KSD-selected chain best resembles the posterior target. Using cores, the longest KSD computation with sample points took .
4 Selecting samplers
Ahn et al. (2012) developed two biased MCMC samplers for accelerated posterior inference, both called Stochastic Gradient Fisher Scoring (SGFS). In the full version of SGFS (termed SGFS-f), a matrix must be inverted to draw each new sample point. Since this can be costly for large , the authors developed a second sampler (termed SGFS-d) in which only a diagonal matrix must be inverted to draw each new sample point. Both samplers can be viewed as discrete-time approximations to a continuous-time Markov process that has the target as its stationary distribution; however, because no Metropolis-Hastings correction is employed, neither sampler has the target as its stationary distribution. Hence we will use the KSD – a quality measure that accounts for asymptotic bias – to evaluate and choose between these samplers.
Specifically, we evaluate the SGFS-f and SGFS-d samples produced in (Ahn et al., 2012, Sec. 5.1). The target is a Bayesian logistic regression with a flat prior, conditioned on a dataset of MNIST handwritten digit images. From each image, the authors extracted random projections of the raw pixel values as covariates and a label indicating whether the image was a or a . After discarding the first half of sample points as burn-in, we obtained regression coefficient samples with points and dimensions (including the intercept term). Figure 4 displays the IMQ KSD applied to the first points in each sample. As external validation, we follow the protocol of Ahn et al. (2012) to find the bivariate marginal means and 95% confidence ellipses of each sample that align best and worst with those of a surrogate ground truth sample obtained from a Hamiltonian Monte Carlo chain with iterates. Both the KSD and the surrogate ground truth suggest that the moderate speed-up provided by SGFS-d ( per sample vs. for SGFS-f) is outweighed by the significant loss in inferential accuracy. However, the KSD assessment does not require access to an external trustworthy ground truth sample. The longest KSD computation took using cores.
5 Beyond sample quality comparison
While our investigation of the KSD was motivated by the desire to develop practical, trustworthy tools for sample quality comparison, the kernels recommended by our theory can serve as drop-in replacements in other inferential tasks that make use of kernel Stein discrepancies.
5.2 Improving sample quality
Related and future work
The score statistic of Fan et al. (2006) and the Gibbs sampler convergence criteria of Zellner & Min (1995) detect certain forms of non-convergence but fail to detect others due to the finite number of test functions tested. For example, when , the score statistic (Fan et al., 2006) only monitors sample means and variances.
While assessing sample quality was our chief objective, our results may hold benefits for other applications that make use of Stein discrepancies or Stein operators. In particular, our kernel recommendations could be incorporated into the Monte Carlo control functionals framework of Oates et al. (2016b); Oates & Girolami (2015), the variational inference approaches of Liu & Wang (2016); Liu & Feng (2016); Ranganath et al. (2016), and the Stein generative adversarial network approach of Wang & Liu (2016).
In the future, we aim to leverage stochastic, low-rank, and sparse approximations of the kernel matrix and score function to produce KSDs that scale better with the number of sample and data points while still guaranteeing control over weak convergence. A reader may also wonder for which distributions outside of the KSD dominates weak convergence. The following theorem, proved in Section J, shows that no KSD with a kernel dominates weak convergence when the target has a bounded score function.
If is bounded and , then does not imply .
Acknowledgments
We thank Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton for sharing their hypothesis testing code, Qiang Liu for sharing his black-box importance sampling code, and Sebastian Vollmer and Andrew Duncan for many helpful conversations regarding this work. This material is based upon work supported by the National Science Foundation DMS RTG Grant No. 1501767, the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747, and the Frederick E. Terman Fellowship.
References
Appendix A Additional appendix notation
Appendix B Proof of Proposition 1: Zero mean test functions
Appendix C Proof of Proposition 2: KSD closed form
Hence, we may apply the representation (C) and exchange expectation and RKHS inner product to discover
Appendix D Proof of Proposition 3: Stein set equivalence
By Proposition 2, \mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})=\mathopen{}\mathclose{{}\left\|{w}}\right\| and \mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}})=\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2} for some vector , and by (Bachman & Narici, 1966, Thm. 8.7), there exist constants depending only on and \mathopen{}\mathclose{{}\left\|{\cdot}}\right\| such that c_{d}\mathopen{}\mathclose{{}\left\|{w}}\right\|\leq\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}\leq c_{d}^{\prime}\mathopen{}\mathclose{{}\left\|{w}}\right\|.
Appendix E Proof of Theorem 5: Univariate KSD detects non-convergence
Remarks An explicit value for the Stein factor can be derived from the proof in Section E.1 and the results of Gorham et al. (2016). After optimizing the bound over , the Gaussian, inverse multiquadric, and Matérn () kernels achieve rates of , , and respectively as .
In particular, since is non-vanishing, is finite for all . If , then, for any fixed , we have . Taking shows that , which implies that .
Fix any probability measure and , and define the tilting function \Xi(x)\triangleq(1+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{1/2}. The proof will proceed in three steps.
Moreover, we have and . Thus for any , by (Steinwart & Christmann, 2008, Corollary 4.36) we have
We next show that there is a solution to the Stein equation
In our final step, we will use the following lemma, proved in Section E.2, to show that we can approximate arbitrarily well by a function in a scaled copy of .
where F(t)\triangleq\sup_{\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{\infty}\leq t}\hat{\Phi}(\omega)^{-1}.
Taking a supremum over yields the advertised result.
E.2 Proof of Lemma 12: Stein approximations with finite RKHS norm
Thus for any , by the triangle inequality, we have
Thus it remains to bound the RKHS norm of . By the Convolution Theorem (Wendland, 2004, Thm. 5.16), we have , and so the squared norm of in is equal to (Wendland, 2004, Thm. 10.21)
Appendix F Proof of Theorem 6: KSD fails with light kernel tails
Since the target distribution is , the associated gradient of the log density is . Thus
Since , . Thus by Cauchy-Schwarz, the first term of (10) is upper bounded by
To handle the second term of (10), we will use the assumed bound on and its derivatives from . For any fixed , by the triangle inequality, Cauchy-Schwarz, and fact is monotonically decreasing we have
Our upper bounds on the Stein discrepancy (10) and our choice of now imply that
Moreover, since , we have , and hence as .
However, the sequence is not uniformly tight and hence converges to no probability measure. This follows as, for each ,
for , since at most points with minimum pairwise Euclidean distance greater than can fit into a ball of radius (Wainwright, 2017, Lems. 5.1 and 5.2).
Appendix G Proof of Theorem 7: KSD detects tight non-convergence
Theorem 7 will follow from the following result which upper bounds the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}(\mu,P) in terms of the tightness rate , the rate of decay of the generalized Fourier transform , and the KSD .
Suppose and let be a probability measure with tightness rate defined in (11). Moreover, suppose the kernel with and F(t)\triangleq\sup_{\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{\infty}\leq t}\hat{\Phi}(\omega)^{-1} finite for all . Then there exists a constant such that, for all ,
where \theta_{d}\triangleq d\int_{0}^{1}\operatorname{exp}\mathopen{}\mathclose{{}\left(-1/(1-r^{2})}\right)r^{d-1}\,dr for (and ), is the volume of the unit Euclidean ball in dimension , and
Remarks An explicit value for the Stein factor can be derived from the proof in Section G.1 and the results of Gorham et al. (2016). When bounds on and are known, the final expression can be optimized over and to produce rates of convergence in d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}.
Fix any , and consider a sequence of probability measures that is uniformly tight. We must have for all . Moreover, since is non-vanishing, is finite for all . Thus if , then for any fixed and ,
Taking yields d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}(\mu_{m},P)\to 0.
Moreover, by the argument (9) employed in the proof of Lemma 12, we have
showing that closely approximates .
We will next truncate using the following lemma proved in Section G.2.
where \theta_{d}\triangleq d\int_{0}^{1}\operatorname{exp}\mathopen{}\mathclose{{}\left(-1/(1-r^{2})}\right)r^{d-1}\,dr for and .
Fix any , and let with defined in (11). This set is compact since our sequence is uniformly tight. Hence, we may define as a smooth, truncated version of based on Lemma 14. Since
properties (12) and (13) imply that for all , when , and
for by Cauchy-Schwarz. In addition
The advertised result follows by substituting and taking the supremum over all h\in BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}.
G.2 Proof of Lemma 14: Smoothed indicator function
where the normalizing constant is given by
for being the volume of the unit Euclidean ball in dimensions (Baker, 1999).
where we used the substitution . By differentiating , using (Baker, 1999) with the substitution r=\mathopen{}\mathclose{{}\left\|{z}}\right\|_{2}, and employing integration by parts we have
Appendix H Proof of Theorem 8: IMQ KSD detects non-convergence
We first use the following theorem to upper bound the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P) in terms of the KSD .
Suppose and k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{\beta} for , and . Choose any and . Then there exist an and a constant such that, for all ,
for and defined in Theorem 13, the function defined in (H.2), the function defined in (19), and
where is the modified Bessel function of the third kind. Moreover, if then is uniformly tight.
Remark The Stein factor can be determined explicitly based on the proof of Theorem 15 in Section H.1 and the results of Gorham et al. (2016).
Note that is finite for all , so fix any and . If , then \lim\sup_{m}d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\leq\rho\sqrt{d}(1+M_{1}(b)\mathcal{M}_{P})+(3+\epsilon+(M_{1}(b)\rho\sqrt{d}+{\textstyle\frac{\delta^{-1}d\theta_{d-1}}{\theta_{d}}})\mathcal{M}_{P})\epsilon. Thus taking yields d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0. Since d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0 only if , the statement of Theorem 8 follows.
Fix any and . Then there is some such that is bounded below by a constant and has a growth rate of \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2\alpha} as \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty. Such a function exists by the following lemma, proved in Section H.2.
where the function is defined in (H.2) and . Moreover, \lim\inf\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{-2\alpha}(\mathcal{T}_{P}{\mathring{g}})({x})\geq\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa_{0} as \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty.
Our next lemma connects the growth rate of to the tightness rate of a probability measure evaluated with the Stein discrepancy. Its proof is found in Section H.3.
In particular, if is finite, is uniformly tight.
We can thus plug the tightness rate estimate of Lemma 17 applied to the function into Theorem 13. Since \mathopen{}\mathclose{{}\left\|{w}}\right\|_{\infty}\leq t implies \mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}\leq\sqrt{d}t, we can use the formula for the generalized Fourier transform of the IMQ kernel in (20) to see is monotonically decreasing in \mathopen{}\mathclose{{}\left\|{w}}\right\|_{2} to establish (18). By taking we obtain (15).
To prove (17), notice that as for any by (21). Hence, by choosing and fixing any we obtain the advertised decay rate as . The uniform tightness conclusion follows from Lemma 17.
H.2 Proof of Lemma 16: Generalized multiquadric Stein sets yield coercive functions
By (Wendland, 2004, Thm. 8.15), has a generalized Fourier transform of order given by
Now fix any and , and consider the functions g_{j}(x)=\nabla_{x_{j}}\Phi_{a,\alpha}(x)=2\alpha x_{j}(a^{2}+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{\alpha-1}. We will show that . Note that . Using (Wendland, 2004, Thm. 10.21), we know \mathopen{}\mathclose{{}\left\|{g_{j}}}\right\|_{\mathcal{K}_{k}}=\mathopen{}\mathclose{{}\left\|{\hat{g_{j}}/\sqrt{\widehat{\Phi_{c,\beta}}}}}\right\|_{L^{2}}, and thus \mathopen{}\mathclose{{}\left\|{g}}\right\|_{\mathcal{K}_{k}^{d}}=\mathopen{}\mathclose{{}\left\|{\hat{g}/\sqrt{\widehat{\Phi_{c,\beta}}}}}\right\|_{L^{2}}. Hence
where is the volume of the unit ball in -dimensions and in the last step we used the substitution r=\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{2} (Baker, 1999). Since and the function is integrable around the origin when , we can bound the integral above by
We can apply the technique to the other integral, yielding
Since , we can upper bound the last integral above by the quantity
where is the upper incomplete gamma function. Hence, the function belongs to with norm upper bounded by where
Now define so that . We will lower bound the growth rate of and also construct a uniform lower bound. Note
The latter two terms are both uniformly bounded in . By the distant dissipativity assumption, there is some such that \lim\sup_{\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty}\frac{1}{\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2}}\langle{b(x)},{x}\rangle\leq-\frac{1}{2}\kappa. Thus the first term of (23) grows at least at the rate \frac{1}{2}\kappa\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2\alpha}. This assures \lim\inf\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{-2\alpha}(\mathcal{T}_{P}{\mathring{g}})({x})\geq\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa as \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty.
Moreover, because is Lipschitz, we have
Hence for any , we must have -\langle{b(x)},{x}\rangle\geq-M_{1}(b)R_{0}^{2}-\mathopen{}\mathclose{{}\left\|{b(0)}}\right\|_{2}R_{0}. By choice of , for all , the distant dissipativity assumption implies . Hence applying this to (23) shows that is uniformly lower bounded by .
H.3 Proof of Lemma 17: Coercive functions yield tightness
Thus we see that \mu(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}\geq r_{\epsilon})\leq\epsilon whenever . This implies that for sufficiently small , if
we must have \mu(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}\geq r_{\epsilon})\leq\epsilon. Hence whenever is bounded, we must have is uniformly tight as is finite.
Appendix I Proof of Proposition 9: KSD detects convergence
We will first state and prove a useful lemma.
To handle the other term above, notice that by Cauchy-Schwarz and the fact that for ,
The stated inequality now follows by taking the infimum of these bounds over all joint distributions with and . ∎
Appendix J Proof of Theorem 10: KSD fails for bounded scores
We can express as
Let be the kernel decay rate defined in the statement of Theorem 6. Then as , we must have and . By the triangle inequality
We now handle the second term of (24). By repeated use of Cauchy-Schwarz we have
By assumption, as . Furthermore, since the second term of (24) is upper bounded by the average of the terms for , we have as . However, is not uniformly tight and hence does not converge to the probability measure .