Measuring Sample Quality with Kernels

Jackson Gorham, Lester Mackey

Introduction

This closed form represents a significant practical advantage, as no linear program solvers are necessary, and the computation of the discrepancy can be easily parallelized. However, as we will see in Section 3.2, not all kernel Stein discrepancies are suitable for our setting. In particular, in dimension d3d\geq 3, the kernel Stein discrepancies previously recommended in the literature fail to detect when a sample is not converging to the target. To address this shortcoming, we develop a theory of weak convergence for the kernel Stein discrepancies analogous to that of (Gorham & Mackey, 2015; Mackey & Gorham, 2016; Gorham et al., 2016) and design a class of kernel Stein discrepancies that provably control weak convergence for a large class of target distributions.

After formally describing our goals for measuring sample quality in Section 2, we outline our strategy, based on Stein’s method, for constructing and analyzing practical quality measures at the start of Section 3. In Section 3.1, we define our family of closed-form quality measures – the kernel Stein discrepancies (KSDs) – and establish several appealing practical properties of these measures. We analyze the convergence properties of KSDs in Sections 3.2 and 3.3, showing that previously proposed KSDs fail to detect non-convergence and proposing practical convergence-determining alternatives. Section 4 illustrates the value of convergence-determining kernel Stein discrepancies in a variety of applications, including hyperparameter selection, sampler selection, one-sample hypothesis testing, and sample quality improvement. Finally, in Section 5, we conclude with a discussion of related and future work.

Quality measures for samples

Stein’s method with kernels

Stein’s method (Stein, 1972) provides a three-step recipe for assessing convergence in distribution:

For any such Stein operator and Stein set G\mathcal{G}, Gorham & Mackey (2015) defined the Stein discrepancy as

which, crucially, avoids explicit integration under PP.

Lower bound the Stein discrepancy by an IPM dHd_{\mathcal{H}} known to dominate weak convergence. This can be done once for a broad class of target distributions to ensure that μmP\mu_{m}\Rightarrow P whenever S(μm,T,G)0\mathcal{S}({\mu_{m}},{\mathcal{T}{}},{\mathcal{G}})\to 0 for a sequence of probability measures (μm)m1(\mu_{m})_{m\geq 1} (Desideratum (ii)).

Provide an upper bound on the Stein discrepancy ensuring that S(μm,T,G)0\mathcal{S}({\mu_{m}},{\mathcal{T}{}},{\mathcal{G}})\to 0 under suitable convergence of μm\mu_{m} to PP (Desideratum (i)).

While Stein’s method is principally used as a mathematical tool to prove convergence in distribution, we seek, in the spirit of (Gorham & Mackey, 2015; Gorham et al., 2016), to harness the Stein discrepancy as a practical tool for measuring sample quality. The subsections to follow develop a specific, practical instantiation of the abstract Stein’s method recipe based on reproducing kernel Hilbert spaces. An empirical analysis of the Stein discrepancies recommended by our theory follows in Section 4.

A standard, widely applicable univariate Stein operator is the density method operator (see Stein et al., 2004; Chatterjee & Shao, 2011; Chen et al., 2011; Ley et al., 2017),

Inspired by the generator method of Barbour (1988, 1990) and Götze (1991), Gorham & Mackey (2015) generalized this operator to multiple dimensions. The resulting Langevin Stein operator

With this definition, we define our kernel Stein set \mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|} as the set of vector-valued functions g=(g1,,gd)g=(g_{1},\dots,g_{d}) such that each component function gjg_{j} belongs to Kk\mathcal{K}_{k} and the vector of their norms \mathopen{}\mathclose{{}\left\|{g_{j}}}\right\|_{\mathcal{K}_{k}} belongs to the \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|^{*} unit ball:Our analyses and algorithms support each gjg_{j} belonging to a different RKHS Kkj\mathcal{K}_{k_{j}}, but we will not need that flexibility here.

The following result, proved in Section B, establishes that this is an acceptable domain for TP\mathcal{T}_{P}{}.

The Langevin Stein operator and kernel Stein set together define our quality measure of interest, the kernel Stein discrepancy (KSD) \mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}). When \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|=\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}, this definition recovers the KSD proposed by Chwialkowski et al. (2016) and Liu et al. (2016). Our next result shows that, for any \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|, the KSD admits a closed-form solution.

Suppose kC(1,1)k\in C^{(1,1)}, and, for each j{1,d}j\in\{1,\dots d\}, define the Stein kernel

The proof is found in Section C. Notably, when μ\mu is the discrete measure Qn=i=1nqn(xi)δxiQ_{n}=\sum_{i=1}^{n}q_{n}(x_{i})\delta_{x_{i}}, the KSD reduces to evaluating each k0jk_{0}^{j} at pairs of support points as wj=i,i=1nqn(xi)k0j(xi,xi)qn(xi),w_{j}=\sqrt{\sum_{i,i^{\prime}=1}^{n}q_{n}(x_{i})k_{0}^{j}(x_{i},x_{i^{\prime}})q_{n}(x_{i^{\prime}})}, a computation which is easily parallelized over sample pairs and coordinates jj.

Our Stein set choice was motivated by the work of Oates et al. (2016b) who used the sum of Stein kernels k0=j=1dk0jk_{0}=\sum_{j=1}^{d}k_{0}^{j} to develop nonparametric control variates. Each term wjw_{j} in Proposition 2 can also be viewed as an instance of the maximum mean discrepancy (MMD) (Gretton et al., 2012) between μ\mu and PP measured with respect to the Stein kernel k0jk_{0}^{j}. In standard uses of MMD, an arbitrary kernel function is selected, and one must be able to compute expectations of the kernel function under PP. Here, this requirement is satisfied automatically, since our induced kernels are chosen to have mean zero under PP.

For clarity we will focus on the specific kernel Stein set choice \mathcal{G}_{k}\triangleq\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}} for the remainder of the paper, but our results extend directly to KSDs based on any \mathopen{}\mathclose{{}\left\|{\cdot}}\right\|, since all KSDs are equivalent in a strong sense:

Under the assumptions of Proposition 2, there are constants cd,cd>0c_{d},c_{d}^{\prime}>0 depending only on dd and \mathopen{}\mathclose{{}\left\|{\cdot}}\right\| such that c_{d}\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})\leq\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}})\leq c_{d}^{\prime}\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}).

2 Lower bounding the kernel Stein discrepancy

We next aim to establish conditions under which the KSD S(μm,TP,Gk)0\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0 only if μmP\mu_{m}\Rightarrow P (Desideratum (ii)). Recently, Gorham et al. (2016) showed that the Langevin graph Stein discrepancy dominates convergence in distribution whenever PP belongs to the class P\mathcal{P} of distantly dissipative distributions with Lipschitz score function bb:

A distribution PP is distantly dissipative if κ0liminfrκ(r)>0\kappa_{0}\triangleq\lim\inf_{r\to\infty}\kappa(r)>0 for

Examples of distributions in P\mathcal{P} include finite Gaussian mixtures with common covariance and all distributions strongly log-concave outside of a compact set, including Bayesian linear, logistic, and Huber regression posteriors with Gaussian priors (see Gorham et al., 2016, Section 4). Moreover, when d=1d=1, membership in P\mathcal{P} is sufficient to provide a lower bound on the KSD for most common kernels including the Gaussian, Matérn, and inverse multiquadric kernels.

Suppose that PPP\in\mathcal{P} and k(x,y)=Φ(xy)k(x,y)=\Phi(x-y) for ΦC2\Phi\in C^{2} with a non-vanishing generalized Fourier transform. If d=1d=1, then S(μm,TP,Gk)0\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0 only if μmP\mu_{m}\Rightarrow P.

The proof in Section E provides a lower bound on the KSD in terms of an IPM known to dominate weak convergence. However, our next theorem shows that in higher dimensions S(Qn,TP,Gk)\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}) can converge to without the sequence (Qn)n1(Q_{n})_{n\geq 1} converging to any probability measure. This deficiency occurs even when the target is Gaussian.

Suppose kCb(1,1)k\in C_{b}^{(1,1)} and define the kernel decay rate

If d3d\geq 3, P=N(0,Id)P=\mathcal{N}(0,I_{d}), and γ(r)=o(rα)\gamma(r)=o(r^{-\alpha}) for α(121d)1\alpha\triangleq(\frac{1}{2}-\frac{1}{d})^{-1}, then S(Qn,TP,Gk)0\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0 does not imply QnPQ_{n}\Rightarrow P.

Theorem 6 implies that KSDs based on the commonly used Gaussian kernel, Matérn kernel, and compactly supported kernels of Wendland (2004, Theorem 9.13) all fail to detect non-convergence when d3d\geq 3. In addition, KSDs based on the inverse multiquadric kernel (k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}) for β<1\beta<-1 fail to detect non-convergence for any d>2β/(β+1)d>2\beta/(\beta+1). The proof in Section F shows that the violating sample sequences (Qn)n1(Q_{n})_{n\geq 1} are simple to construct, and we provide an empirical demonstration of this failure to detect non-convergence in Section 4.

The failure of the KSDs in Theorem 6 can be traced to their inability to enforce uniform tightness. A sequence of probability measures (μm)m1(\mu_{m})_{m\geq 1} is uniformly tight if for every ϵ>0\epsilon>0, there is a finite number R(ϵ)R(\epsilon) such that \lim\sup_{m}\mu_{m}(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}>R(\epsilon))\leq\epsilon. Uniform tightness implies that no mass in the sequence of probability measures escapes to infinity. When the kernel kk decays more rapidly than the score function grows, the KSD ignores excess mass in the tails and hence can be driven to zero by a non-tight sequence of increasingly diffuse probability measures. The following theorem demonstrates uniform tightness is the missing piece to ensure weak convergence.

Suppose that PPP\in\mathcal{P} and k(x,y)=Φ(xy)k(x,y)=\Phi(x-y) for ΦC2\Phi\in C^{2} with a non-vanishing generalized Fourier transform. If (μm)m1(\mu_{m})_{m\geq 1} is uniformly tight, then S(μm,TP,Gk)0\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0 only if μmP\mu_{m}\Rightarrow P.

Our proof in Section G explicitly lower bounds the KSD S(μ,TP,Gk)\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}) in terms of the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P), which exactly metrizes weak convergence.

Ideally, when a sequence of probability measures is not uniformly tight, the KSD would reflect this divergence in its reported value. To achieve this, we consider the inverse multiquadric (IMQ) kernel k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}\, for some β<0\beta<0 and c>0c>0. While KSDs based on IMQ kernels fail to determine convergence when β<1\beta<-1 (by Theorem 6), our next theorem shows that they automatically enforce tightness and detect non-convergence whenever β(1,0)\beta\in(-1,0).

Suppose PPP\in\mathcal{P} and k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta} for c>0c>0 and β(1,0)\beta\in(-1,0). If S(μm,TP,Gk)0\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0, then μmP\mu_{m}\Rightarrow P.

The proof in Section H provides a lower bound on the KSD in terms of the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P). The success of the IMQ kernel over other common characteristic kernels can be attributed to its slow decay rate. When PPP\in\mathcal{P} and the IMQ exponent β>1\beta>-1, the function class TPGk\mathcal{T}_{P}{\mathcal{G}_{k}} contains unbounded (coercive) functions. These functions ensure that the IMQ KSD S(μm,TP,Gk)\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}) goes to only if (μm)m1(\mu_{m})_{m\geq 1} is uniformly tight.

3 Upper bounding the kernel Stein discrepancy

The usual goal in upper bounding the Stein discrepancy is to provide a rate of convergence to PP for particular approximating sequences (μm)m=1(\mu_{m})_{m=1}^{\infty}. Because we aim to directly compute the KSD for arbitrary samples QnQ_{n}, our chief purpose in this section is to ensure that the KSD S(μm,TP,Gk)\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}) will converge to zero when μm\mu_{m} is converging to PP (Desideratum (i)).

Experiments

We next conduct an empirical evaluation of the KSD quality measures recommended by our theory, recording all timings on an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. Throughout, we will refer to the KSD with IMQ base kernel k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2})^{\beta}, exponent β=12\beta=-\frac{1}{2}, and c=1c=1 as the IMQ KSD. Code reproducing all experiments can be found on the Julia (Bezanson et al., 2014) package site https://jgorham.github.io/SteinDiscrepancy.jl/.

Our first, simple experiment is designed to illustrate several properties of the IMQ KSD and to compare its behavior with that of two preexisting discrepancy measures, the Wasserstein distance d_{\mathcal{W}_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}, which can be computed for simple univariate targets (Vallender, 1974), and the spanner graph Stein discrepancy of Gorham & Mackey (2015). We adopt a bimodal Gaussian mixture with p(x)\propto e^{-\frac{1}{2}\mathopen{}\mathclose{{}\left\|{x+\Delta e_{1}}}\right\|_{2}^{2}}+e^{-\frac{1}{2}\mathopen{}\mathclose{{}\left\|{x-\Delta e_{1}}}\right\|_{2}^{2}} and Δ=1.5\Delta=1.5 as our target PP and generate a first sample point sequence i.i.d. from the target and a second sequence i.i.d. from one component of the mixture, N(Δe1,Id)\mathcal{N}(-\Delta e_{1},I_{d}). As seen in the left panel of Figure 1 where d=1d=1, the IMQ KSD decays at an n0.51n^{-0.51} rate when applied to the first nn points in the target sample and remains bounded away from zero when applied to the to the single component sample. This desirable behavior is closely mirrored by the Wasserstein distance and the graph Stein discrepancy.

The middle panel of Figure 1 records the time consumed by the graph and kernel Stein discrepancies applied to the i.i.d. sample points from PP. Each method is given access to dd cores when working in dd dimensions, and we use the released code of Gorham & Mackey (2015) with the default Gurobi 6.0.4 linear program solver for the graph Stein discrepancy. We find that the two methods have nearly identical runtimes when d=1d=1 but that the KSD is 1010 to 10001000 times faster when d=4d=4. In addition, the KSD is straightforwardly parallelized and does not require access to a linear program solver, making it an appealing practical choice for a quality measure.

2 The importance of kernel choice

Theorem 6 established that kernels with rapidly decaying tails yield KSDs that can be driven to zero by off-target sample sequences. Our next experiment provides an empirical demonstration of this issue for a multivariate Gaussian target P=N(0,Id)P=\mathcal{N}(0,I_{d}) and KSDs based on the popular Gaussian (k(x,y)=e^{-\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}^{2}/2}) and Matérn (k(x,y)=(1+\sqrt{3}\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2})e^{-\sqrt{3}\mathopen{}\mathclose{{}\left\|{x-y}}\right\|_{2}}) radial kernels.

Following the proof Theorem 6 in Section F, we construct an off-target sequence (Qn)n1(Q_{n})_{n\geq 1} that sends S(Qn,TP,Gk)\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}) to for these kernel choices whenever d3d\geq 3. Specifically, for each nn, we let Qn=1ni=1nδxiQ_{n}=\frac{1}{n}\sum_{i=1}^{n}\delta_{x_{i}} where, for all ii and jj, \mathopen{}\mathclose{{}\left\|{x_{i}}}\right\|_{2}\leq 2n^{1/d}\log n and \mathopen{}\mathclose{{}\left\|{x_{i}-x_{j}}}\right\|_{2}\geq 2\log n. To select these sample points, we independently sample candidate points uniformly from the ball \{x:\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\leq 2n^{1/d}\log n\}, accept any points not within 2logn2\log n Euclidean distance of any previously accepted point, and terminate when nn points have been accepted.

For various dimensions, Figure 2 displays the result of applying each KSD to the off-target sequence (Qn)n1(Q_{n})_{n\geq 1} and an “on-target” sequence of points sampled i.i.d. from PP. For comparison, we also display the behavior of the IMQ KSD which provably controls tightness and dominates weak convergence for this target by Theorem 8. As predicted, the Gaussian and Matérn KSDs decay to under the off-target sequence and decay more rapidly as the dimension dd increases; the IMQ KSD remains bounded away from .

3 Selecting sampler hyperparameters

To assess the suitability of the KSD for tolerance parameter selection, we take as our target PP the bimodal Gaussian mixture model posterior of (Welling & Teh, 2011). For an array of ϵ\epsilon values, we generated 5050 independent approximate slice sampling chains with batch size 55, each with a budget of 148000148000 likelihood evaluations, and plotted the median IMQ KSD and effective sample size (ESS, a standard sample quality measure based on asymptotic variance (Brooks et al., 2011)) in Figure 3. ESS, which does not detect Markov chain bias, is maximized at the largest hyperparameter evaluated (ϵ=101\epsilon=10^{-1}), while the KSD is minimized at an intermediate value (ϵ=102\epsilon=10^{-2}). The right panel of Figure 3 shows representative samples produced by several settings of ϵ\epsilon. The sample produced by the ESS-selected chain is significantly overdispersed, while the sample from ϵ=0\epsilon=0 has minimal coverage of the second mode due to its small sample size. The sample produced by the KSD-selected chain best resembles the posterior target. Using 44 cores, the longest KSD computation with n=103n=10^{3} sample points took 0.16s0.16s.

4 Selecting samplers

Ahn et al. (2012) developed two biased MCMC samplers for accelerated posterior inference, both called Stochastic Gradient Fisher Scoring (SGFS). In the full version of SGFS (termed SGFS-f), a d×dd\times d matrix must be inverted to draw each new sample point. Since this can be costly for large dd, the authors developed a second sampler (termed SGFS-d) in which only a diagonal matrix must be inverted to draw each new sample point. Both samplers can be viewed as discrete-time approximations to a continuous-time Markov process that has the target PP as its stationary distribution; however, because no Metropolis-Hastings correction is employed, neither sampler has the target as its stationary distribution. Hence we will use the KSD – a quality measure that accounts for asymptotic bias – to evaluate and choose between these samplers.

Specifically, we evaluate the SGFS-f and SGFS-d samples produced in (Ahn et al., 2012, Sec. 5.1). The target PP is a Bayesian logistic regression with a flat prior, conditioned on a dataset of 10410^{4} MNIST handwritten digit images. From each image, the authors extracted 5050 random projections of the raw pixel values as covariates and a label indicating whether the image was a 77 or a 99. After discarding the first half of sample points as burn-in, we obtained regression coefficient samples with 5×1045\times 10^{4} points and d=51d=51 dimensions (including the intercept term). Figure 4 displays the IMQ KSD applied to the first nn points in each sample. As external validation, we follow the protocol of Ahn et al. (2012) to find the bivariate marginal means and 95% confidence ellipses of each sample that align best and worst with those of a surrogate ground truth sample obtained from a Hamiltonian Monte Carlo chain with 10510^{5} iterates. Both the KSD and the surrogate ground truth suggest that the moderate speed-up provided by SGFS-d (0.0017s0.0017s per sample vs. 0.0019s0.0019s for SGFS-f) is outweighed by the significant loss in inferential accuracy. However, the KSD assessment does not require access to an external trustworthy ground truth sample. The longest KSD computation took 400s400s using 1616 cores.

5 Beyond sample quality comparison

While our investigation of the KSD was motivated by the desire to develop practical, trustworthy tools for sample quality comparison, the kernels recommended by our theory can serve as drop-in replacements in other inferential tasks that make use of kernel Stein discrepancies.

5.2 Improving sample quality

Related and future work

The score statistic of Fan et al. (2006) and the Gibbs sampler convergence criteria of Zellner & Min (1995) detect certain forms of non-convergence but fail to detect others due to the finite number of test functions tested. For example, when P=N(0,1)P=\mathcal{N}(0,1), the score statistic (Fan et al., 2006) only monitors sample means and variances.

While assessing sample quality was our chief objective, our results may hold benefits for other applications that make use of Stein discrepancies or Stein operators. In particular, our kernel recommendations could be incorporated into the Monte Carlo control functionals framework of Oates et al. (2016b); Oates & Girolami (2015), the variational inference approaches of Liu & Wang (2016); Liu & Feng (2016); Ranganath et al. (2016), and the Stein generative adversarial network approach of Wang & Liu (2016).

In the future, we aim to leverage stochastic, low-rank, and sparse approximations of the kernel matrix and score function to produce KSDs that scale better with the number of sample and data points while still guaranteeing control over weak convergence. A reader may also wonder for which distributions outside of P\mathcal{P} the KSD dominates weak convergence. The following theorem, proved in Section J, shows that no KSD with a C0C_{0} kernel dominates weak convergence when the target has a bounded score function.

If logp\nabla\log p is bounded and kC0(1,1)k\in C^{(1,1)}_{0}, then S(Qn,TP,Gk)0\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0 does not imply QnPQ_{n}\Rightarrow P.

Acknowledgments

We thank Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton for sharing their hypothesis testing code, Qiang Liu for sharing his black-box importance sampling code, and Sebastian Vollmer and Andrew Duncan for many helpful conversations regarding this work. This material is based upon work supported by the National Science Foundation DMS RTG Grant No. 1501767, the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747, and the Frederick E. Terman Fellowship.

References

Appendix A Additional appendix notation

Appendix B Proof of Proposition 1: Zero mean test functions

Appendix C Proof of Proposition 2: KSD closed form

Hence, we may apply the representation (C) and exchange expectation and RKHS inner product to discover

Appendix D Proof of Proposition 3: Stein set equivalence

By Proposition 2, \mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}})=\mathopen{}\mathclose{{}\left\|{w}}\right\| and \mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k,\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}})=\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2} for some vector ww, and by (Bachman & Narici, 1966, Thm. 8.7), there exist constants cd,cd>0c_{d},c_{d}^{\prime}>0 depending only on dd and \mathopen{}\mathclose{{}\left\|{\cdot}}\right\| such that c_{d}\mathopen{}\mathclose{{}\left\|{w}}\right\|\leq\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}\leq c_{d}^{\prime}\mathopen{}\mathclose{{}\left\|{w}}\right\|.

Appendix E Proof of Theorem 5: Univariate KSD detects non-convergence

Remarks An explicit value for the Stein factor MP\mathcal{M}_{P} can be derived from the proof in Section E.1 and the results of Gorham et al. (2016). After optimizing the bound dH(μ,P)d_{\mathcal{H}}(\mu,P) over ϵ>0\epsilon>0, the Gaussian, inverse multiquadric, and Matérn (v>1v>1) kernels achieve rates of O(1/log(1S(μ,TP,Gk)))O(1/\sqrt{\log(\frac{1}{\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})})}), O(1/log(1S(μ,TP,Gk)))O(1/\log(\frac{1}{\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})})), and O(S(μ,TP,Gk)1/(v+1/2))O(\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})^{1/(v+1/2)}) respectively as S(μ,TP,Gk)0\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0.

In particular, since Φ^\hat{\Phi} is non-vanishing, F(t)F(t) is finite for all tt. If S(μm,TP,Gk)0\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0, then, for any fixed ϵ>0\epsilon>0, we have limsupmdH(μm,P)ϵ\lim\sup_{m\to\infty}d_{\mathcal{H}}(\mu_{m},P)\leq\epsilon. Taking ϵ0\epsilon\to 0 shows that limmdH(μm,P)0\lim_{m\to\infty}d_{\mathcal{H}}(\mu_{m},P)\to 0, which implies that μmP\mu_{m}\Rightarrow P.

Fix any probability measure μ\mu and hHh\in\mathcal{H}, and define the tilting function \Xi(x)\triangleq(1+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{1/2}. The proof will proceed in three steps.

Moreover, we have xkb(x,y)=(yx)kb(x,y)\nabla_{x}k_{b}(x,y)=(y-x)k_{b}(x,y) and r(x)=xr(x)\nabla r(x)=-xr(x). Thus for any xx, by (Steinwart & Christmann, 2008, Corollary 4.36) we have

We next show that there is a solution to the PP Stein equation

In our final step, we will use the following lemma, proved in Section E.2, to show that we can approximate TPgh\mathcal{T}_{P}{g_{h}} arbitrarily well by a function in a scaled copy of TPGk\mathcal{T}_{P}{\mathcal{G}_{k}}.

where F(t)\triangleq\sup_{\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{\infty}\leq t}\hat{\Phi}(\omega)^{-1}.

Taking a supremum over hHh\in\mathcal{H} yields the advertised result.

E.2 Proof of Lemma 12: Stein approximations with finite RKHS norm

Thus for any δ>0\delta>0, by the triangle inequality, we have

Thus it remains to bound the RKHS norm of gδg_{\delta}. By the Convolution Theorem (Wendland, 2004, Thm. 5.16), we have gδ^(ω)=(2π)d/2g^(ω)ρδ^(ω)\hat{g_{\delta}}(\omega)=(2\pi)^{d/2}\hat{g}(\omega)\hat{\rho_{\delta}}(\omega), and so the squared norm of gδg_{\delta} in Kkd\mathcal{K}_{k}^{d} is equal to (Wendland, 2004, Thm. 10.21)

Appendix F Proof of Theorem 6: KSD fails with light kernel tails

Since the target distribution PP is N(0,Id)\mathcal{N}(0,I_{d}), the associated gradient of the log density is b(x)=xb(x)=-x. Thus

Since kCb(2,2)k\in C^{(2,2)}_{b}, γ(0)<\gamma(0)<\infty. Thus by Cauchy-Schwarz, the first term of (10) is upper bounded by

To handle the second term of (10), we will use the assumed bound on kk and its derivatives from γ\gamma. For any fixed iii\neq i^{\prime}, by the triangle inequality, Cauchy-Schwarz, and fact γ\gamma is monotonically decreasing we have

Our upper bounds on the Stein discrepancy (10) and our choice of Δn\Delta_{n} now imply that

Moreover, since γ(r)=o(rα)\gamma(r)=o(r^{-\alpha}), we have γ1(1/n)=o(n1/α)=o(n1/21/d)\gamma^{-1}(1/n)=o(n^{1/\alpha})=o(n^{1/2-1/d}), and hence S(Qn,TP,Gk)0\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0 as nn\to\infty.

However, the sequence (Qn)n1(Q_{n})_{n\geq 1} is not uniformly tight and hence converges to no probability measure. This follows as, for each r>0r>0,

for m=5d+1rdm=\lceil{5^{d+1}r^{d}\rceil}, since at most (r+4r/Δm)d(r+4r/{\Delta_{m}})^{d} points with minimum pairwise Euclidean distance greater than Δm\Delta_{m} can fit into a ball of radius rr (Wainwright, 2017, Lems. 5.1 and 5.2).

Appendix G Proof of Theorem 7: KSD detects tight non-convergence

Theorem 7 will follow from the following result which upper bounds the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}(\mu,P) in terms of the tightness rate R(μ,ϵ)R(\mu,\epsilon), the rate of decay of the generalized Fourier transform Φ^\hat{\Phi}, and the KSD S(μ,TP,Gk)\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}).

Suppose PPP\in\mathcal{P} and let μ\mu be a probability measure with tightness rate R(μ,ϵ)R(\mu,\epsilon) defined in (11). Moreover, suppose the kernel k(x,y)=Φ(xy)k(x,y)=\Phi(x-y) with ΦC2\Phi\in C^{2} and F(t)\triangleq\sup_{\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{\infty}\leq t}\hat{\Phi}(\omega)^{-1} finite for all t>0t>0. Then there exists a constant MP\mathcal{M}_{P} such that, for all ρ,ϵ,δ>0\rho,\epsilon,\delta>0,

where \theta_{d}\triangleq d\int_{0}^{1}\operatorname{exp}\mathopen{}\mathclose{{}\left(-1/(1-r^{2})}\right)r^{d-1}\,dr for d>0d>0 (and θ0e1\theta_{0}\triangleq e^{-1}), VdV_{d} is the volume of the unit Euclidean ball in dimension dd, and

Remarks An explicit value for the Stein factor MP\mathcal{M}_{P} can be derived from the proof in Section G.1 and the results of Gorham et al. (2016). When bounds on RR and FF are known, the final expression can be optimized over ϵ,ρ\epsilon,\rho and δ\delta to produce rates of convergence in d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}.

Fix any δ>0\delta>0, and consider a sequence of probability measures (μm)m1(\mu_{m})_{m\geq 1} that is uniformly tight. We must have limsupmR(μm,ϵ)<\lim\sup_{m}R(\mu_{m},\epsilon)<\infty for all ϵ>0\epsilon>0. Moreover, since Φ^\hat{\Phi} is non-vanishing, F(t)F(t) is finite for all tt. Thus if S(μm,TP,Gk)0\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0, then for any fixed ϵ<1\epsilon<1 and ρ>0\rho>0,

Taking ρ,ϵ0\rho,\epsilon\to 0 yields d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|_{2}}}(\mu_{m},P)\to 0.

Moreover, by the argument (9) employed in the proof of Lemma 12, we have

showing that TPgρ\mathcal{T}_{P}{g_{\rho}} closely approximates TPg\mathcal{T}_{P}{g}.

We will next truncate gρg_{\rho} using the following lemma proved in Section G.2.

where \theta_{d}\triangleq d\int_{0}^{1}\operatorname{exp}\mathopen{}\mathclose{{}\left(-1/(1-r^{2})}\right)r^{d-1}\,dr for d>0d>0 and θ0e1\theta_{0}\triangleq e^{-1}.

Fix any ϵ,δ>0\epsilon,\delta>0, and let K=B(0,R(μ,ϵ))K=\mathcal{B}(0,R(\mu,\epsilon)) with R(μ,ϵ)R(\mu,\epsilon) defined in (11). This set is compact since our sequence is uniformly tight. Hence, we may define gK,δ(x)gρ(x)vK,δ(x)g_{K,\delta}(x)\triangleq g_{\rho}(x)\,v_{K,\delta}(x) as a smooth, truncated version of gρg_{\rho} based on Lemma 14. Since

properties (12) and (13) imply that (TPgρ)(x)=(TPgK,δ)(x)(\mathcal{T}_{P}{g_{\rho}})({x})=(\mathcal{T}_{P}{g_{K,\delta}})({x}) for all xKx\in K, (TPgK,δ)(x)=0(\mathcal{T}_{P}{g_{K,\delta}})({x})=0 when xK2δx\notin K^{2\delta}, and

for xK2δKx\in K^{2\delta}\setminus K by Cauchy-Schwarz. In addition

The advertised result follows by substituting Vol(B(0,r))=Vdrd\textnormal{Vol}(\mathcal{B}(0,r))=V_{d}r^{d} and taking the supremum over all h\in BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}.

G.2 Proof of Lemma 14: Smoothed indicator function

where the normalizing constant is given by

for VdV_{d} being the volume of the unit Euclidean ball in dd dimensions (Baker, 1999).

where we used the substitution z(xy)/δz\triangleq(x-y)/\delta. By differentiating ψ\psi, using (Baker, 1999) with the substitution r=\mathopen{}\mathclose{{}\left\|{z}}\right\|_{2}, and employing integration by parts we have

Appendix H Proof of Theorem 8: IMQ KSD detects non-convergence

We first use the following theorem to upper bound the bounded Lipschitz metric d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu,P) in terms of the KSD S(μ,TP,Gk)\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}).

Suppose PPP\in\mathcal{P} and k(x,y)=(c^{2}+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{\beta} for c>0c>0, and β(1,0)\beta\in(-1,0). Choose any α(0,12(β+1))\alpha\in(0,\frac{1}{2}(\beta+1)) and a>12ca>\frac{1}{2}c. Then there exist an ϵ0>0\epsilon_{0}>0 and a constant MP\mathcal{M}_{P} such that, for all μ\mu,

for θd,θd1,Vd,\theta_{d},\theta_{d-1},V_{d}, and cρ,δc_{\rho,\delta} defined in Theorem 13, the function D\mathcal{D} defined in (H.2), the function ζ\zeta defined in (19), and

where KvK_{v} is the modified Bessel function of the third kind. Moreover, if limsupmS(μm,TP,Gk)<\lim\sup_{m}\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})<\infty then (μm)m1(\mu_{m})_{m\geq 1} is uniformly tight.

Remark The Stein factor MP\mathcal{M}_{P} can be determined explicitly based on the proof of Theorem 15 in Section H.1 and the results of Gorham et al. (2016).

Note that FIMQ(t)F_{IMQ}(t) is finite for all t>0t>0, so fix any ϵ[0,ϵ0)\epsilon\in[0,\epsilon_{0}) and δ,ρ>0\delta,\rho>0. If S(μm,TP,Gk)0\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0, then \lim\sup_{m}d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\leq\rho\sqrt{d}(1+M_{1}(b)\mathcal{M}_{P})+(3+\epsilon+(M_{1}(b)\rho\sqrt{d}+{\textstyle\frac{\delta^{-1}d\theta_{d-1}}{\theta_{d}}})\mathcal{M}_{P})\epsilon. Thus taking ϵ,ρ0\epsilon,\rho\to 0 yields d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0. Since d_{BL_{\mathopen{}\mathclose{{}\left\|{\cdot}}\right\|}}(\mu_{m},P)\to 0 only if μmP\mu_{m}\Rightarrow P, the statement of Theorem 8 follows.

Fix any α(0,12(β+1))\alpha\in(0,\frac{1}{2}(\beta+1)) and a>12ca>\frac{1}{2}c. Then there is some g˚Gk\mathring{g}\in\mathcal{G}_{k} such that TPg˚\mathcal{T}_{P}{\mathring{g}} is bounded below by a constant ζ(a,c,α,β)\zeta(a,c,\alpha,\beta) and has a growth rate of \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2\alpha} as \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty. Such a function exists by the following lemma, proved in Section H.2.

where the function D\mathcal{D} is defined in (H.2) and R0inf{r>0κ(r)0,rr}R_{0}\triangleq\inf\{r>0\,|\,\kappa(r^{\prime})\geq 0,\forall r^{\prime}\geq r\}. Moreover, \lim\inf\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{-2\alpha}(\mathcal{T}_{P}{\mathring{g}})({x})\geq\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa_{0} as \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty.

Our next lemma connects the growth rate of TPg˚\mathcal{T}_{P}{\mathring{g}} to the tightness rate of a probability measure evaluated with the Stein discrepancy. Its proof is found in Section H.3.

In particular, if limsupmS(μm,TP,Gk)\lim\sup_{m}\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}) is finite, (μm)m1(\mu_{m})_{m\geq 1} is uniformly tight.

We can thus plug the tightness rate estimate of Lemma 17 applied to the function g˚\mathring{g} into Theorem 13. Since \mathopen{}\mathclose{{}\left\|{w}}\right\|_{\infty}\leq t implies \mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}\leq\sqrt{d}t, we can use the formula for the generalized Fourier transform of the IMQ kernel in (20) to see Φ^(ω)\hat{\Phi}(\omega) is monotonically decreasing in \mathopen{}\mathclose{{}\left\|{w}}\right\|_{2} to establish (18). By taking ηαD(a,c,α,β)1/2κ0\eta\to\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa_{0} we obtain (15).

To prove (17), notice that FIMQ(t)=O(e(cd+λ)t)F_{IMQ}(t)=O(e^{(c\sqrt{d}+\lambda)t}) as tt\to\infty for any λ>0\lambda>0 by (21). Hence, by choosing ϵ=ρ=((cd+1)/log(1S(μ,TP,Gk)))1/2\epsilon=\rho=((c\sqrt{d}+1)/\log({\textstyle\frac{1}{\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})}}))^{1/2} and fixing any δ>0\delta>0 we obtain the advertised decay rate as S(μ,TP,Gk)0\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0. The uniform tightness conclusion follows from Lemma 17.

H.2 Proof of Lemma 16: Generalized multiquadric Stein sets yield coercive functions

By (Wendland, 2004, Thm. 8.15), Φc,β\Phi_{c,\beta} has a generalized Fourier transform of order max(0,β)\max(0,\lceil\beta\rceil) given by

Now fix any a>c/2a>c/2 and α(0,12(β+1))\alpha\in(0,\frac{1}{2}(\beta+1)), and consider the functions g_{j}(x)=\nabla_{x_{j}}\Phi_{a,\alpha}(x)=2\alpha x_{j}(a^{2}+\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2})^{\alpha-1}. We will show that g=(g1,,gd)Kkdg=(g_{1},\dots,g_{d})\in\mathcal{K}_{k}^{d}. Note that gj^(ω)=(iωj)Φa,α^(ω)\hat{g_{j}}(\omega)=(i\omega_{j})\widehat{\Phi_{a,\alpha}}(\omega). Using (Wendland, 2004, Thm. 10.21), we know \mathopen{}\mathclose{{}\left\|{g_{j}}}\right\|_{\mathcal{K}_{k}}=\mathopen{}\mathclose{{}\left\|{\hat{g_{j}}/\sqrt{\widehat{\Phi_{c,\beta}}}}}\right\|_{L^{2}}, and thus \mathopen{}\mathclose{{}\left\|{g}}\right\|_{\mathcal{K}_{k}^{d}}=\mathopen{}\mathclose{{}\left\|{\hat{g}/\sqrt{\widehat{\Phi_{c,\beta}}}}}\right\|_{L^{2}}. Hence

where VdV_{d} is the volume of the unit ball in dd-dimensions and in the last step we used the substitution r=\mathopen{}\mathclose{{}\left\|{\omega}}\right\|_{2} (Baker, 1999). Since α<12(β+1)\alpha<\frac{1}{2}(\beta+1) and the function rrtr\mapsto r^{t} is integrable around the origin when t>1t>-1, we can bound the integral above by

We can apply the technique to the other integral, yielding

Since c2a<0c-2a<0, we can upper bound the last integral above by the quantity

where Γ(s,x)xts1etdt\Gamma(s,x)\triangleq\int_{x}^{\infty}t^{s-1}e^{-t}\,dt is the upper incomplete gamma function. Hence, the function gg belongs to Kkd\mathcal{K}_{k}^{d} with norm upper bounded by D(a,b,α,β)1/2\mathcal{D}(a,b,\alpha,\beta)^{1/2} where

Now define g˚=D(a,c,α,β)1/2g\mathring{g}=-\mathcal{D}(a,c,\alpha,\beta)^{-1/2}g so that g˚Gk\mathring{g}\in\mathcal{G}_{k}. We will lower bound the growth rate of TPg˚\mathcal{T}_{P}{\mathring{g}} and also construct a uniform lower bound. Note

The latter two terms are both uniformly bounded in xx. By the distant dissipativity assumption, there is some κ>0\kappa>0 such that \lim\sup_{\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty}\frac{1}{\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2}}\langle{b(x)},{x}\rangle\leq-\frac{1}{2}\kappa. Thus the first term of (23) grows at least at the rate \frac{1}{2}\kappa\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2\alpha}. This assures \lim\inf\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{-2\alpha}(\mathcal{T}_{P}{\mathring{g}})({x})\geq\frac{\alpha}{\mathcal{D}(a,c,\alpha,\beta)^{1/2}}\kappa as \mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}\to\infty.

Moreover, because bb is Lipschitz, we have

Hence for any xB(0,R0)x\in\mathcal{B}(0,R_{0}), we must have -\langle{b(x)},{x}\rangle\geq-M_{1}(b)R_{0}^{2}-\mathopen{}\mathclose{{}\left\|{b(0)}}\right\|_{2}R_{0}. By choice of R0R_{0}, for all xB(0,R0)x\notin\mathcal{B}(0,R_{0}), the distant dissipativity assumption implies b(x),x0-\langle{b(x)},{x}\rangle\geq 0. Hence applying this to (23) shows that TPg˚\mathcal{T}_{P}{\mathring{g}} is uniformly lower bounded by ζ(a,c,α,β)\zeta(a,c,\alpha,\beta).

H.3 Proof of Lemma 17: Coercive functions yield tightness

Thus we see that \mu(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}\geq r_{\epsilon})\leq\epsilon whenever ϵ(S(μ,TP,Gk)ζ)/γ(rϵ)\epsilon\geq(\mathcal{S}({\mu},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})-\zeta)/\gamma(r_{\epsilon}). This implies that for sufficiently small ϵ\epsilon, if

we must have \mu(\mathopen{}\mathclose{{}\left\|{{X}}}\right\|_{2}\geq r_{\epsilon})\leq\epsilon. Hence whenever limsupmS(μm,TP,Gk)\lim\sup_{m}\mathcal{S}({\mu_{m}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}}) is bounded, we must have (μm)m1(\mu_{m})_{m\geq 1} is uniformly tight as limsupmR(μm,ϵ)\lim\sup_{m}R(\mu_{m},\epsilon) is finite.

Appendix I Proof of Proposition 9: KSD detects convergence

We will first state and prove a useful lemma.

To handle the other term above, notice that by Cauchy-Schwarz and the fact that min(a,b)ab\min(a,b)\leq\sqrt{ab} for a,b0a,b\geq 0,

The stated inequality now follows by taking the infimum of these bounds over all joint distributions (X,Z)(X,Z) with XμX\sim\mu and ZPZ\sim P. ∎

Appendix J Proof of Theorem 10: KSD fails for bounded scores

We can express k0(x,y)j=1dk0j(x,y)k_{0}(x,y)\triangleq\sum_{j=1}^{d}k_{0}^{j}(x,y) as

Let γ\gamma be the kernel decay rate defined in the statement of Theorem 6. Then as kC0(1,1)k\in C^{(1,1)}_{0}, we must have γ(0)<\gamma(0)<\infty and limrγ(r)=0\lim_{r\to\infty}\gamma(r)=0. By the triangle inequality

We now handle the second term of (24). By repeated use of Cauchy-Schwarz we have

By assumption, γ(r)0\gamma(r)\to 0 as rr\to\infty. Furthermore, since the second term of (24) is upper bounded by the average of the terms k0(xi,xi)k_{0}(x_{i},x_{i}^{\prime}) for iii\neq i^{\prime}, we have S(Qn,TP,Gk)0\mathcal{S}({Q_{n}},{\mathcal{T}_{P}{}},{\mathcal{G}_{k}})\to 0 as nn\to\infty. However, (Qn)n1(Q_{n})_{n\geq 1} is not uniformly tight and hence does not converge to the probability measure PP.