A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation

Qiang Liu, Jason D. Lee, Michael I. Jordan

Introduction

Evaluating the goodness-of-fit of models over observed data is a fundamental task in machine learning and statistics. Traditional approaches often involve calculating or comparing the likelihoods or cumulative distribution functions (CDF) of the models. Unfortunately, modern learning techniques increasingly involve complex probabilistic models with computationally intractable likelihoods or CDFs, such as large graphical models, hidden variables models and deep generative models (Koller & Friedman, 2009; Salakhutdinov, 2015). Although Markov chain Monte Carlo (MCMC) or variational methods can be used to approximate the likelihood, their approximation errors are often large and hard to estimate, making it difficult to give results with calibrated statistical significance. In fact, it is often a #P-complete problem to calculate or even approximate likelihoods for graphical models (e.g, Chandrasekaran et al., 2008), making likelihood-based approaches fundamentally infeasible.

We propose a likelihood-free approach for model evaluation with guaranteed statistical significance. In particular, we consider the setting of goodness-of-fit testing, where we test whether a given sample $\{x_{i}\}\sim p(x)$ is drawn from a given distribution $q(x)$ , meaning $H_{0}:p=q$ . Our method is based on a new discrepancy measure between distributions that can be empirically estimated using $U$ -statistics, and depends on $q$ only through its score function ${\boldsymbol{s}}_{q}=\nabla_{x}\log q(x)$ ; this score function does not depend on the normalization constant in $q(x)$ , and can often be calculated efficiently even when the likelihood is intractable. This allows us to apply our methods to complex and high dimensional models on which the likelihood-based methods, or other traditional goodness-of-fit tests, such as $\chi^{2}$ -test and Kolmogorov-Smirnov test, can not be applied.

for smooth functions $f(x)$ with proper zero-boundary conditions, where ${\boldsymbol{s}}_{q}(x)=\nabla_{x}\log q(x)={\nabla_{x}q(x)}/{q(x)}$ is called the (Stein) score function of $q(x)$ ; when $p=q$ , (1) is known as Stein’s identity (e.g., Stein et al., 2004), and can be proved using integration by parts. As a result, one can define a Stein discrepancy measureOur definition is the square of the typical definition of Stein discrepancy, as such that in Gorham & Mackey (2015). between $p$ and $q$ via

where $x,x^{\prime}$ are i.i.d. random variables drawn from $p$ and $u_{q}(x,x^{\prime})$ is a function (defined in Theorem 3.6) that depends on $q$ only through the score function $\nabla_{x}\log q(x)$ which can be calculated efficiently even when $q$ has an intractable normalization constant. Specifically, assuming $q(x)=f(x)/Z$ with $Z=\int f(x)dx$ being the normalization constant, we have ${\boldsymbol{s}}_{q}=\nabla_{x}\log f(x)$ , independent of $Z$ ; calculating $Z$ involves a high dimension integration, and has been the major challenge for likelihood-based and Bayesian methods for model evaluation.

Related Work

Outline

Section 2 introduces RKHS and Stein’s identity. Section 3 defines our KSD and studies its main properties, and Section 4 discusses the empirical estimation of KSD and its application in goodness-of-fit tests. We discuss related methods in Section 5, present experiments in Section 6 and conclude the paper in Section 7.

Notations

Backgrounds

We first introduce positive definite kernels and reproducing kernel Hilbert spaces (RKHS) in Section 2.1, and then Stein’s identity and operator in Section 2.2.

Let $k(x,x^{\prime})$ be a positive definite kernel. The spectral decomposition of $k(x,x^{\prime})$ , as implied by Mercer’s theorem, is defined as

For a positive definite kernel $k(x,x^{\prime})$ , its related RKHS $\mathcal{H}$ comprises of linear combinations of its eigenfunctions, i.e., $f(x)=\sum_{j}f_{j}e_{j}(x)$ with $\sum_{j}f_{j}^{2}/\lambda_{j}<\infty$ , endowed with an inner product $\langle f,g\rangle_{\mathcal{H}}=\sum_{j}f_{j}g_{j}/\lambda_{j}$ between $f(x)$ and $g(x)=\sum_{j}g_{j}e_{j}(x)$ . Thus this Hilbert space is equipped with a norm $||f||_{\mathcal{H}}$ where $||f||_{\mathcal{H}}^{2}=\langle f,f\rangle_{\mathcal{H}}=\sum_{j}f_{j}^{2}/\lambda_{j}$ . One can verify that $k(x,\cdot)$ is in $\mathcal{H}$ and satisfies the important “reproducing” property,

Every positive definite kernel $k$ defines a unique RKHS for which $k$ is a reproducing kernel.

2 Stein’s Identity and Operator

The Stein’s operator of $p$ is a linear operator acting on the Stein class of $p$ , defined as

Assume $p(x)$ is a smooth density supported on $\mathcal{X}$ , then

for any $\boldsymbol{f}$ that is in the Stein class of $p$ .

By the definition of the Stein class, simply note that ${\boldsymbol{s}}_{p}(x)\boldsymbol{f}(x)^{\top}+\nabla\boldsymbol{f}(x)=\nabla_{x}(\boldsymbol{f}(x)p(x))/p(x).$ ∎

The following result gives a convenient tool for our derivation; it relates the expectation under $p$ of Stein’s operator ${\mathcal{A}}_{q}f$ with the difference of the score functions of $p$ and $q$ .

Assume $p(x)$ and $q(x)$ are smooth densities supported on $\mathcal{X}$ and $\boldsymbol{f}(x)$ is in the Stein class of $p$ , we have

which was first derived in Gorham & Mackey (2015) using Langevin diffusion. It is an interesting direction to consider the possibility of using determinant or other matrix norms instead of the trace.

Kernelized Stein Discrepancy

We introduce our kernelized Stein discrepancy (KSD) with an elementary definition motivated by Lemma 2.3, and then establish its connection with Stein’s method and RKHS.

A kernel $k(x,x^{\prime})$ is said to be integrally strictly positive definition, if for any function $g$ that satisfies $0<||g||_{2}^{2}<\infty$ ,

where $\boldsymbol{\delta}_{q,p}(x)={\boldsymbol{s}}_{q}(x)-{\boldsymbol{s}}_{p}(x)$ is the score difference between $p$ and $q$ , and $x$ , $x^{\prime}$ are i.i.d. draws from $p(x)$ .

Result directly follows the definition in (7). ∎

A kernel $k(x,x^{\prime})$ is said to be in the Stein class of $p$ if $k(x,x^{\prime})$ has continuous second order partial derivatives, and both $k(x,\cdot)$ and $k(\cdot,x)$ are in the Stein class of $p$ for any fixed $x$ .

If $k(x,x^{\prime})$ is in the Stein class of $p$ , so is any $f\in\mathcal{H}$ .

Assume $p$ and $q$ are smooth densities and $k(x,x^{\prime})$ is in the Stein class of $p$ . Define

Apply Lemma 2.3 twice, first on $k(\cdot,x^{\prime})$ for fixed $x^{\prime}$ , and then with fixed $x$ . See the Appendix. ∎

Assume $k(x,x^{\prime})$ is a positive definite kernel in the Stein class of $p$ , with positive eigenvalues $\{\lambda_{j}\}$ and eigenfunctions $\{e_{j}(x)\}$ , then $u_{q}(x,x^{\prime})$ is also a positive definite kernel, and can be rewritten into

where ${\mathcal{A}}_{q}e_{j}(x)={\boldsymbol{s}}_{q}(x)e_{j}(x)+\nabla_{x}e_{j}(x)$ is the Stein’s operator acted on $e_{j}$ . In addition,

Note that although $\{e_{j}\}$ are orthonormal, the $\{{\mathcal{A}}_{q}e_{j}(x)\}$ are no longer orthonormal in general.

where the maximum is achieved when $\boldsymbol{f}=\boldsymbol{\beta}/||\boldsymbol{\beta}||_{\mathcal{H}^{d}}$ .

Goodness-of-fit Testing Based on KSD

2) If $p=q$ , then we have $\sigma_{u}^{2}=0$ (the $U$ -statistics is degenerate) and

where $\{Z_{j}\}$ are i.i.d. standard Gaussian random variables, and $\{c_{j}\}$ are the eigenvalues of kernel $u_{q}(x,x^{\prime})$ under $p(x)$ , that is, they are the solutions of $c_{j}\phi_{j}(x)=\int_{x^{\prime}}u_{q}(x,x^{\prime})\phi_{j}(x^{\prime})p(x^{\prime})dx^{\prime}$ for non-zero $\phi_{j}$ .

Using the standard asymptotic results of $U$ -statistics in Serfling (2009, Section 5.5), we just need to check that $\sigma_{u}^{2}\neq 0$ when $p\neq q$ and $\sigma_{u}^{2}=0$ when $p=q$ . See Appendix for details. ∎

One difficulty in implementing this test is that the limit distribution in (15) and its $\alpha$ -quantile does not have analytic form unless $c_{j}=0,$ or $1$ . Fortunately, the same type of asymptotics appears in many other classical goodness-of-fit tests, such as Cramer-von Mises test, Anderson-Darling test, as well as two-sample tests (Gretton et al., 2012). As a consequence, a line of work has been devoted to approximating the critical values of (15), including bootstrap methods (Arcones & Gine, 1992; Huskova & Janssen, 1993; Chwialkowski et al., 2014) and eigenvalue approximation (Gretton et al., 2009).

Assume the conditions in Theorem 4.1. If $p=q$ , then as the bootstrap sample size $m\to\infty$ ,

that is, the bootstrap test attains the correct significance level asymptotically (consistent in level).

It is important to note, on the other hand, that the more usual bootstrap, such as $\sum_{i\neq j}w_{i}w_{j}u_{q}(x_{i},x_{j})$ , may not work for degenerate $U$ -statistics as discussed in Arcones & Gine (1992).

This bootstrap test is summarized in Algorithm 1; its cost is $O(mn^{2})$ where $n$ is the size of the sample $\{x_{i}\}$ and $m$ the bootstrap sample size. A more computationally efficient, but less statistically powerful, method can be constructed based on the following linear estimator:

Related Methods

We discuss the connection with Fisher divergence and maximum discrepancy measure (MMD).

Fisher divergence, also known as Fisher information distance (Johnson, 2004), is defined as

1) Following Definition (8) and (18), we have

2) In addition, if $k(x,x^{\prime})$ is positive definite and in the Stein class of $p$ , and ${\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p}\in\mathcal{H}^{d}$ , we have, for $p\neq q$ ,

(19) is a simple result of Cauchy-Schwarz inequality, and (20) can be obtained by taking $f=({\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p})/||{\boldsymbol{s}}_{q}(x)-{\boldsymbol{s}}_{p}(x)||_{\mathcal{H}^{d}}$ in (13). See Appendix. ∎

(19) suggests that the convergence in Fisher divergence is stronger than that in KSD. In fact, using Stein’s method, Ley & Swan (2013) showed that Fisher divergence is stronger than most other divergences, including KL, total variation and Hellinger distances.

2 Maximum Mean Discrepancy & Two-sample Tests

Closely related to goodness-of-fit tests are two sample tests, which test whether two i.i.d. samples $\{x_{i}\}$ and $\{y_{i}\}$ are drawn from the same distribution. In principle, one can turn a goodness-of-fit test into a two sample test by drawing $\{y_{i}\}$ from $q(x)$ . However, it is often difficult to draw exact i.i.d. samples for practical models, and furthermore MCMC sampling may be computationally expensive, suffer from the convergence problems, and introduce undesired correlations. When the MCMC approximation is poor, the two sample test would reject the null even when $p=q$ (inconsistent in level).

Maximum Mean Discrepancy (Gretton et al., 2012) is a nonparametric distance measure widely used for two sample tests, defined as

Experiments

We present empirical results in this section. We start with a toy case of 1D Gaussian mixture on which we can compare with the classical goodness-of-fit tests that only work for univariate distributions, and then proceed to Gaussian-Bernoulli restricted Boltzmann machine (RBM), a graphical model widely used in deep learning (Welling et al., 2004; Hinton & Salakhutdinov, 2006). The following methods are evaluated, all with a significance level of $0.05$ :

1) KSD-U. The KSD-based bootstrap test using $U$ -statistic in Algorithm 1 (bootstrap size is $1000$ ), using RBF kernel with bandwidth chosen to be median of the data distances.

2) KSD-Linear. The KSD test based on the linear estimator in (17) with asymptotically normal null distribution.

3) Classical goodness-of-fit tests, including $\chi^{2}$ test, Kolmogorov-Smirnov test and Cramer-von Mises test (Lehmann & Romano, 2006); they are evaluated on only the 1D Gaussian mixture.

4) MMD-MC( ${n^{\prime}}$ ). Draw exact sample $\{y_{i}\}$ of size $n^{\prime}$ from $q(x)$ and perform two sample MMD test of Gretton et al. (2012) over $\{x_{i}\}$ and $\{y_{i}\}$ using bootstrapWe use the mmdTestBoot.m under http://www.gatsby.ucl.ac.uk/%7Egretton/mmd/mmd.htm, with 1000 bootstrap replicates.

5) MMD-MCMC( ${n^{\prime}}$ ). Draw approximate sample $\{y_{i}\}$ of size $n^{\prime}$ from $q(x)$ using Gibbs sampler and perform MMD test on $\{x_{i}\}$ and $\{y_{i}\}$ ; we use $1000$ burn-in steps.

6) LR (simple vs. simple). We evaluate the exact log-likelihood ratio $2\log(q(x)/p(x))$ and use it to test whether $\{x_{i}\}$ is drawn from $p(x)$ or $q(x)$ . This approach is an oracle test in that it knows it exactly calculates the likelihood, and assumes we know $p(x)$ and tests a much easier null hypothesis of simple vs. simple.

7) Likelihood Ratio (AIS). We approximately evaluate the likelihood ratio using annealed importance sampling (AIS), which is one of the most widely used algorithm for approximating likelihood (Neal, 2001; Salakhutdinov & Murray, 2008). Our AIS implementation uses a Gibbs sampler transition with a linear temperature grid of size $1000$ . We do not perform a test based on the AIS result because it is hard to know the approximation error.

We find from Figure 1 that the oracle LR (simple vs. simple) performs the best as expected. Otherwise, our KSD-U performs comparably with, or better than, the classical tests ( $\chi^{2}$ , Kolmogorov-Smirnov and Cramer-Von Mises) as well as MMD-MC(1000). KSD-Linear tends to perform the worst, suggesting it is not useful in this simple setting. However, it can serve as a computationally efficient alternative of KSD-U for more complex models on which the other tests are not practical. Note that because both the cases of $p=q$ and $p\neq q$ happen with $0.5$ probability in our simulation, the error rate in the hardest case when $p$ is close $q$ is $0.5$ .

Gaussian-Bernoulli Restricted Boltzmann Machine (RBM)

where $Z$ is the normalization constant. The probability of the observable variable $x$ is $p(x)=\sum_{h\in\{\pm 1\}^{d^{\prime}}}p(x,h),$ which is intractable to calculate due to the difficult constant term $Z$ . Nevertheless, one can show that its score function ${\boldsymbol{s}}_{p}$ can be easily calculated in a closed form,

In our experiment, we simulate a true model $p(x)$ by drawing $b$ and $c$ from standard Gaussian and select $B$ uniformly randomly from $\{\pm 1\}$ ; we use $d=50$ observable variables and $d^{\prime}=10$ hidden variables, so that it remains possible to exactly calculate $p(x)$ and draw exact samples using the brute-force algorithm. Similar to the case of 1D Gaussian mixture, we set $q(x)$ randomly with equal probability to be equal to either $p(x)$ or a perturbed version by adding Gaussian noise to $B$ with variance $\sigma_{per}^{2}$ . We report the the error rates of different tests in Figure 2; the results are averaged on $1000$ random trials.

Figure 2(a) shows that the oracle LR (simple vs. simple) performs the best again as expected, followed by our KSD-U method. The MMD-MCMC breaks down because the MCMC sample is not representative of $q$ , while the performance of MMD-MC depends on the size of the exact sample: it performs worse than KSD-U with MMD-MC(100), and is almost as good with MMD-MC(1000); see also Figure 2(b). Again, we find that KSD-linear generally performs much worse than KSD-U, but it provides a computationally efficient $O(n)$ alternative to KSD-U which has a $O(mn^{2})$ complexity and MMD which costs $O(mnn^{\prime})$ . A trade-off between linear and quadratic complexity can be achieved using block averaging; see Zaremba et al. (2013).

Figure 2(c)-(h) shows the different discrepancy measures under the case $p=q$ and $p\neq q$ , respectively. Again, we can find that the exact likelihood ratio provides the best discrimination, while MMD-MCMC fails to distinguish the two cases at all. The AIS approximation performs reasonably well, but is worse than KSD-U and MMD-MC(1000) in this particular case.

Conclusion and Future Directions

We propose a new computationally tractable discrepancy measure between complex probability models, and use it to derive a novel class of goodness-of-fit tests. We believe our discrepancy measure provides a new fundamental tool for analyzing and using complex probability models in statistics and machine learning. Future directions include extending our method to composite goodness-of-fit tests, in which we want to test if the observed data follows a given class of distributions, as well as understanding the theoretical discrimination power of KSD compared to the other classical goodness-of-fit tests, two sample tests (e.g., MMD with infinite exact Monte Carlo sample), and the method in Gorham & Mackey (2015).

This work is supported in part by NSF CRII 1565796. We thank Arthur Gretton and the anonymous reviewers for their valuable comments.

References

Appendix A Proofs

1) Denote by $\boldsymbol{v}(x,x^{\prime})=k(x,x^{\prime}){\boldsymbol{s}}_{q}(x^{\prime})+\nabla_{x^{\prime}}k(x,x^{\prime})={\mathcal{A}}_{q}k_{x}(x^{\prime})$ ; applying Lemma 2.3 on $k(x,\cdot)$ with fixed $x$ ,

Because $k(\cdot,x^{\prime})$ is in the Stein class of $p$ for any $x^{\prime}$ , we can show that $\nabla_{x^{\prime}}k(\cdot,x^{\prime})$ is also in the Stein class, since

and hence $\boldsymbol{v}(\cdot,x^{\prime})$ is also in the Stein class; apply Lemma 2.3 on $\boldsymbol{v}(\cdot,x^{\prime})$ with fixed $x^{\prime}$ gives

The result then follows by noting that $\nabla_{x}\boldsymbol{v}(x,x^{\prime})=\nabla_{x}k(x,x^{\prime}){\boldsymbol{s}}_{q}(x^{\prime})^{\top}+\nabla_{x^{\prime}x^{\prime}}k(x,x^{\prime}).$ ∎

Therefore, $u_{q}(x,x^{\prime})$ is positive definite because $\lambda_{j}>0$ . In addition,

We first prove (12) by applying the reproducing property $k(x,x^{\prime})=\langle k(x,\cdot),k(x^{\prime},\cdot)\rangle_{\mathcal{H}}$ on (8):

where we used the fact that $\nabla_{x}f(x)=\langle f(\cdot),~{}\nabla_{x}k(x,\cdot)\rangle_{\mathcal{H}}$ ; see (Zhou, 2008; Steinwart & Christmann, 2008). The variational form (13) then follows the fact that $||\boldsymbol{\beta}||_{\mathcal{H}^{d}}=\max_{\boldsymbol{f}\in\mathcal{H}^{d}}\big{\{}\langle\boldsymbol{f},\boldsymbol{\beta}\rangle_{\mathcal{H}^{d}},~{}~{}~{}.s.t.~{}~{}||\boldsymbol{f}||_{\mathcal{H}^{d}}\leq 1\big{\}}$ .

For any $f\in\mathcal{H}$ with kernel $k(x,x^{\prime})$ , we have $f=\langle f,~{}k(\cdot,x)\rangle_{\mathcal{H}}$ and $\nabla_{x}f=\langle f,~{}\nabla_{x}k(x,\cdot)\rangle_{\mathcal{H}}$ . Therefore,

Applying the standard asymptotic results of $U$ -statistics in Serfling (2009, Section 5.5), we just need to check that $\sigma_{u}^{2}\neq 0$ when $p\neq q$ and $\sigma_{u}^{2}=0$ when $p=q$ .

(19) is obtained by applying Cauchy-Schwarz inequality on (8),

To prove (20), we simply note that (13) is equivalent to

Taking $\boldsymbol{f}=({\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p})/||{\boldsymbol{s}}_{q}(x)-{\boldsymbol{s}}_{p}(x)||_{\mathcal{H}^{d}}$ then gives (20). ∎

Let $\mathcal{F}(p)=\mathcal{L}^{2}(p)\cap\mathcal{S}(p)$ , where $\mathcal{S}(p)$ represents the Stein class of $p$ , then we have

and the equality holds when ${\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p}\in\mathcal{F}(p)^{d}$ .

Note that $\mathcal{L}^{2}(p)$ is larger than the Stein class and RKHS, and includes discontinuous, non-smooth functions, and hence we need to ensure $\boldsymbol{f}$ is in the Stein class explicitly.

Restricting the maximizing to $\mathcal{F}(p)^{d}$ and applying Lemma 2.3 would give the result. ∎