A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation

Qiang Liu, Jason D. Lee, Michael I. Jordan

Introduction

Evaluating the goodness-of-fit of models over observed data is a fundamental task in machine learning and statistics. Traditional approaches often involve calculating or comparing the likelihoods or cumulative distribution functions (CDF) of the models. Unfortunately, modern learning techniques increasingly involve complex probabilistic models with computationally intractable likelihoods or CDFs, such as large graphical models, hidden variables models and deep generative models (Koller & Friedman, 2009; Salakhutdinov, 2015). Although Markov chain Monte Carlo (MCMC) or variational methods can be used to approximate the likelihood, their approximation errors are often large and hard to estimate, making it difficult to give results with calibrated statistical significance. In fact, it is often a #P-complete problem to calculate or even approximate likelihoods for graphical models (e.g, Chandrasekaran et al., 2008), making likelihood-based approaches fundamentally infeasible.

We propose a likelihood-free approach for model evaluation with guaranteed statistical significance. In particular, we consider the setting of goodness-of-fit testing, where we test whether a given sample {xi}p(x)\{x_{i}\}\sim p(x) is drawn from a given distribution q(x)q(x), meaning H0:p=qH_{0}:p=q. Our method is based on a new discrepancy measure between distributions that can be empirically estimated using UU-statistics, and depends on qq only through its score function sq=xlogq(x){\boldsymbol{s}}_{q}=\nabla_{x}\log q(x); this score function does not depend on the normalization constant in q(x)q(x), and can often be calculated efficiently even when the likelihood is intractable. This allows us to apply our methods to complex and high dimensional models on which the likelihood-based methods, or other traditional goodness-of-fit tests, such as χ2\chi^{2}-test and Kolmogorov-Smirnov test, can not be applied.

for smooth functions f(x)f(x) with proper zero-boundary conditions, where sq(x)=xlogq(x)=xq(x)/q(x){\boldsymbol{s}}_{q}(x)=\nabla_{x}\log q(x)={\nabla_{x}q(x)}/{q(x)} is called the (Stein) score function of q(x)q(x); when p=qp=q, (1) is known as Stein’s identity (e.g., Stein et al., 2004), and can be proved using integration by parts. As a result, one can define a Stein discrepancy measureOur definition is the square of the typical definition of Stein discrepancy, as such that in Gorham & Mackey (2015). between pp and qq via

where x,xx,x^{\prime} are i.i.d. random variables drawn from pp and uq(x,x)u_{q}(x,x^{\prime}) is a function (defined in Theorem 3.6) that depends on qq only through the score function xlogq(x)\nabla_{x}\log q(x) which can be calculated efficiently even when qq has an intractable normalization constant. Specifically, assuming q(x)=f(x)/Zq(x)=f(x)/Z with Z=f(x)dxZ=\int f(x)dx being the normalization constant, we have sq=xlogf(x){\boldsymbol{s}}_{q}=\nabla_{x}\log f(x), independent of ZZ; calculating ZZ involves a high dimension integration, and has been the major challenge for likelihood-based and Bayesian methods for model evaluation.

Related Work

Outline

Section 2 introduces RKHS and Stein’s identity. Section 3 defines our KSD and studies its main properties, and Section 4 discusses the empirical estimation of KSD and its application in goodness-of-fit tests. We discuss related methods in Section 5, present experiments in Section 6 and conclude the paper in Section 7.

Notations

Backgrounds

We first introduce positive definite kernels and reproducing kernel Hilbert spaces (RKHS) in Section 2.1, and then Stein’s identity and operator in Section 2.2.

Let k(x,x)k(x,x^{\prime}) be a positive definite kernel. The spectral decomposition of k(x,x)k(x,x^{\prime}), as implied by Mercer’s theorem, is defined as

For a positive definite kernel k(x,x)k(x,x^{\prime}), its related RKHS H\mathcal{H} comprises of linear combinations of its eigenfunctions, i.e., f(x)=jfjej(x)f(x)=\sum_{j}f_{j}e_{j}(x) with jfj2/λj<\sum_{j}f_{j}^{2}/\lambda_{j}<\infty, endowed with an inner product f,gH=jfjgj/λj\langle f,g\rangle_{\mathcal{H}}=\sum_{j}f_{j}g_{j}/\lambda_{j} between f(x)f(x) and g(x)=jgjej(x)g(x)=\sum_{j}g_{j}e_{j}(x). Thus this Hilbert space is equipped with a norm fH||f||_{\mathcal{H}} where fH2=f,fH=jfj2/λj||f||_{\mathcal{H}}^{2}=\langle f,f\rangle_{\mathcal{H}}=\sum_{j}f_{j}^{2}/\lambda_{j}. One can verify that k(x,)k(x,\cdot) is in H\mathcal{H} and satisfies the important “reproducing” property,

Every positive definite kernel kk defines a unique RKHS for which kk is a reproducing kernel.

2 Stein’s Identity and Operator

The Stein’s operator of pp is a linear operator acting on the Stein class of pp, defined as

Assume p(x)p(x) is a smooth density supported on X\mathcal{X}, then

for any f\boldsymbol{f} that is in the Stein class of pp.

By the definition of the Stein class, simply note that sp(x)f(x)+f(x)=x(f(x)p(x))/p(x).{\boldsymbol{s}}_{p}(x)\boldsymbol{f}(x)^{\top}+\nabla\boldsymbol{f}(x)=\nabla_{x}(\boldsymbol{f}(x)p(x))/p(x).

The following result gives a convenient tool for our derivation; it relates the expectation under pp of Stein’s operator Aqf{\mathcal{A}}_{q}f with the difference of the score functions of pp and qq.

Assume p(x)p(x) and q(x)q(x) are smooth densities supported on X\mathcal{X} and f(x)\boldsymbol{f}(x) is in the Stein class of pp, we have

which was first derived in Gorham & Mackey (2015) using Langevin diffusion. It is an interesting direction to consider the possibility of using determinant or other matrix norms instead of the trace.

Kernelized Stein Discrepancy

We introduce our kernelized Stein discrepancy (KSD) with an elementary definition motivated by Lemma 2.3, and then establish its connection with Stein’s method and RKHS.

A kernel k(x,x)k(x,x^{\prime}) is said to be integrally strictly positive definition, if for any function gg that satisfies 0<g22<0<||g||_{2}^{2}<\infty,

where δq,p(x)=sq(x)sp(x)\boldsymbol{\delta}_{q,p}(x)={\boldsymbol{s}}_{q}(x)-{\boldsymbol{s}}_{p}(x) is the score difference between pp and qq, and xx, xx^{\prime} are i.i.d. draws from p(x)p(x).

Result directly follows the definition in (7). ∎

A kernel k(x,x)k(x,x^{\prime}) is said to be in the Stein class of pp if k(x,x)k(x,x^{\prime}) has continuous second order partial derivatives, and both k(x,)k(x,\cdot) and k(,x)k(\cdot,x) are in the Stein class of pp for any fixed xx.

If k(x,x)k(x,x^{\prime}) is in the Stein class of pp, so is any fHf\in\mathcal{H}.

Assume pp and qq are smooth densities and k(x,x)k(x,x^{\prime}) is in the Stein class of pp. Define

Apply Lemma 2.3 twice, first on k(,x)k(\cdot,x^{\prime}) for fixed xx^{\prime}, and then with fixed xx. See the Appendix. ∎

Assume k(x,x)k(x,x^{\prime}) is a positive definite kernel in the Stein class of pp, with positive eigenvalues {λj}\{\lambda_{j}\} and eigenfunctions {ej(x)}\{e_{j}(x)\}, then uq(x,x)u_{q}(x,x^{\prime}) is also a positive definite kernel, and can be rewritten into

where Aqej(x)=sq(x)ej(x)+xej(x){\mathcal{A}}_{q}e_{j}(x)={\boldsymbol{s}}_{q}(x)e_{j}(x)+\nabla_{x}e_{j}(x) is the Stein’s operator acted on eje_{j}. In addition,

Note that although {ej}\{e_{j}\} are orthonormal, the {Aqej(x)}\{{\mathcal{A}}_{q}e_{j}(x)\} are no longer orthonormal in general.

where the maximum is achieved when f=β/βHd\boldsymbol{f}=\boldsymbol{\beta}/||\boldsymbol{\beta}||_{\mathcal{H}^{d}}.

Goodness-of-fit Testing Based on KSD

2) If p=qp=q, then we have σu2=0\sigma_{u}^{2}=0 (the UU-statistics is degenerate) and

where {Zj}\{Z_{j}\} are i.i.d. standard Gaussian random variables, and {cj}\{c_{j}\} are the eigenvalues of kernel uq(x,x)u_{q}(x,x^{\prime}) under p(x)p(x), that is, they are the solutions of cjϕj(x)=xuq(x,x)ϕj(x)p(x)dxc_{j}\phi_{j}(x)=\int_{x^{\prime}}u_{q}(x,x^{\prime})\phi_{j}(x^{\prime})p(x^{\prime})dx^{\prime} for non-zero ϕj\phi_{j}.

Using the standard asymptotic results of UU-statistics in Serfling (2009, Section 5.5), we just need to check that σu20\sigma_{u}^{2}\neq 0 when pqp\neq q and σu2=0\sigma_{u}^{2}=0 when p=qp=q. See Appendix for details. ∎

One difficulty in implementing this test is that the limit distribution in (15) and its α\alpha-quantile does not have analytic form unless cj=0,c_{j}=0, or 11. Fortunately, the same type of asymptotics appears in many other classical goodness-of-fit tests, such as Cramer-von Mises test, Anderson-Darling test, as well as two-sample tests (Gretton et al., 2012). As a consequence, a line of work has been devoted to approximating the critical values of (15), including bootstrap methods (Arcones & Gine, 1992; Huskova & Janssen, 1993; Chwialkowski et al., 2014) and eigenvalue approximation (Gretton et al., 2009).

Assume the conditions in Theorem 4.1. If p=qp=q, then as the bootstrap sample size mm\to\infty,

that is, the bootstrap test attains the correct significance level asymptotically (consistent in level).

It is important to note, on the other hand, that the more usual bootstrap, such as ijwiwjuq(xi,xj)\sum_{i\neq j}w_{i}w_{j}u_{q}(x_{i},x_{j}), may not work for degenerate UU-statistics as discussed in Arcones & Gine (1992).

This bootstrap test is summarized in Algorithm 1; its cost is O(mn2)O(mn^{2}) where nn is the size of the sample {xi}\{x_{i}\} and mm the bootstrap sample size. A more computationally efficient, but less statistically powerful, method can be constructed based on the following linear estimator:

Related Methods

We discuss the connection with Fisher divergence and maximum discrepancy measure (MMD).

Fisher divergence, also known as Fisher information distance (Johnson, 2004), is defined as

1) Following Definition (8) and (18), we have

2) In addition, if k(x,x)k(x,x^{\prime}) is positive definite and in the Stein class of pp, and sqspHd{\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p}\in\mathcal{H}^{d}, we have, for pqp\neq q,

(19) is a simple result of Cauchy-Schwarz inequality, and (20) can be obtained by taking f=(sqsp)/sq(x)sp(x)Hdf=({\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p})/||{\boldsymbol{s}}_{q}(x)-{\boldsymbol{s}}_{p}(x)||_{\mathcal{H}^{d}} in (13). See Appendix. ∎

(19) suggests that the convergence in Fisher divergence is stronger than that in KSD. In fact, using Stein’s method, Ley & Swan (2013) showed that Fisher divergence is stronger than most other divergences, including KL, total variation and Hellinger distances.

2 Maximum Mean Discrepancy & Two-sample Tests

Closely related to goodness-of-fit tests are two sample tests, which test whether two i.i.d. samples {xi}\{x_{i}\} and {yi}\{y_{i}\} are drawn from the same distribution. In principle, one can turn a goodness-of-fit test into a two sample test by drawing {yi}\{y_{i}\} from q(x)q(x). However, it is often difficult to draw exact i.i.d. samples for practical models, and furthermore MCMC sampling may be computationally expensive, suffer from the convergence problems, and introduce undesired correlations. When the MCMC approximation is poor, the two sample test would reject the null even when p=qp=q (inconsistent in level).

Maximum Mean Discrepancy (Gretton et al., 2012) is a nonparametric distance measure widely used for two sample tests, defined as

Experiments

We present empirical results in this section. We start with a toy case of 1D Gaussian mixture on which we can compare with the classical goodness-of-fit tests that only work for univariate distributions, and then proceed to Gaussian-Bernoulli restricted Boltzmann machine (RBM), a graphical model widely used in deep learning (Welling et al., 2004; Hinton & Salakhutdinov, 2006). The following methods are evaluated, all with a significance level of 0.050.05:

1) KSD-U. The KSD-based bootstrap test using UU-statistic in Algorithm 1 (bootstrap size is 10001000), using RBF kernel with bandwidth chosen to be median of the data distances.

2) KSD-Linear. The KSD test based on the linear estimator in (17) with asymptotically normal null distribution.

3) Classical goodness-of-fit tests, including χ2\chi^{2} test, Kolmogorov-Smirnov test and Cramer-von Mises test (Lehmann & Romano, 2006); they are evaluated on only the 1D Gaussian mixture.

4) MMD-MC(n{n^{\prime}}). Draw exact sample {yi}\{y_{i}\} of size nn^{\prime} from q(x)q(x) and perform two sample MMD test of Gretton et al. (2012) over {xi}\{x_{i}\} and {yi}\{y_{i}\} using bootstrapWe use the mmdTestBoot.m under http://www.gatsby.ucl.ac.uk/%7Egretton/mmd/mmd.htm, with 1000 bootstrap replicates.

5) MMD-MCMC(n{n^{\prime}}). Draw approximate sample {yi}\{y_{i}\} of size nn^{\prime} from q(x)q(x) using Gibbs sampler and perform MMD test on {xi}\{x_{i}\} and {yi}\{y_{i}\}; we use 10001000 burn-in steps.

6) LR (simple vs. simple). We evaluate the exact log-likelihood ratio 2log(q(x)/p(x))2\log(q(x)/p(x)) and use it to test whether {xi}\{x_{i}\} is drawn from p(x)p(x) or q(x)q(x). This approach is an oracle test in that it knows it exactly calculates the likelihood, and assumes we know p(x)p(x) and tests a much easier null hypothesis of simple vs. simple.

7) Likelihood Ratio (AIS). We approximately evaluate the likelihood ratio using annealed importance sampling (AIS), which is one of the most widely used algorithm for approximating likelihood (Neal, 2001; Salakhutdinov & Murray, 2008). Our AIS implementation uses a Gibbs sampler transition with a linear temperature grid of size 10001000. We do not perform a test based on the AIS result because it is hard to know the approximation error.

We find from Figure 1 that the oracle LR (simple vs. simple) performs the best as expected. Otherwise, our KSD-U performs comparably with, or better than, the classical tests (χ2\chi^{2}, Kolmogorov-Smirnov and Cramer-Von Mises) as well as MMD-MC(1000). KSD-Linear tends to perform the worst, suggesting it is not useful in this simple setting. However, it can serve as a computationally efficient alternative of KSD-U for more complex models on which the other tests are not practical. Note that because both the cases of p=qp=q and pqp\neq q happen with 0.50.5 probability in our simulation, the error rate in the hardest case when pp is close qq is 0.50.5.

Gaussian-Bernoulli Restricted Boltzmann Machine (RBM)

where ZZ is the normalization constant. The probability of the observable variable xx is p(x)=h{±1}dp(x,h),p(x)=\sum_{h\in\{\pm 1\}^{d^{\prime}}}p(x,h), which is intractable to calculate due to the difficult constant term ZZ. Nevertheless, one can show that its score function sp{\boldsymbol{s}}_{p} can be easily calculated in a closed form,

In our experiment, we simulate a true model p(x)p(x) by drawing bb and cc from standard Gaussian and select BB uniformly randomly from {±1}\{\pm 1\}; we use d=50d=50 observable variables and d=10d^{\prime}=10 hidden variables, so that it remains possible to exactly calculate p(x)p(x) and draw exact samples using the brute-force algorithm. Similar to the case of 1D Gaussian mixture, we set q(x)q(x) randomly with equal probability to be equal to either p(x)p(x) or a perturbed version by adding Gaussian noise to BB with variance σper2\sigma_{per}^{2}. We report the the error rates of different tests in Figure 2; the results are averaged on 10001000 random trials.

Figure 2(a) shows that the oracle LR (simple vs. simple) performs the best again as expected, followed by our KSD-U method. The MMD-MCMC breaks down because the MCMC sample is not representative of qq, while the performance of MMD-MC depends on the size of the exact sample: it performs worse than KSD-U with MMD-MC(100), and is almost as good with MMD-MC(1000); see also Figure 2(b). Again, we find that KSD-linear generally performs much worse than KSD-U, but it provides a computationally efficient O(n)O(n) alternative to KSD-U which has a O(mn2)O(mn^{2}) complexity and MMD which costs O(mnn)O(mnn^{\prime}). A trade-off between linear and quadratic complexity can be achieved using block averaging; see Zaremba et al. (2013).

Figure 2(c)-(h) shows the different discrepancy measures under the case p=qp=q and pqp\neq q, respectively. Again, we can find that the exact likelihood ratio provides the best discrimination, while MMD-MCMC fails to distinguish the two cases at all. The AIS approximation performs reasonably well, but is worse than KSD-U and MMD-MC(1000) in this particular case.

Conclusion and Future Directions

We propose a new computationally tractable discrepancy measure between complex probability models, and use it to derive a novel class of goodness-of-fit tests. We believe our discrepancy measure provides a new fundamental tool for analyzing and using complex probability models in statistics and machine learning. Future directions include extending our method to composite goodness-of-fit tests, in which we want to test if the observed data follows a given class of distributions, as well as understanding the theoretical discrimination power of KSD compared to the other classical goodness-of-fit tests, two sample tests (e.g., MMD with infinite exact Monte Carlo sample), and the method in Gorham & Mackey (2015).

This work is supported in part by NSF CRII 1565796. We thank Arthur Gretton and the anonymous reviewers for their valuable comments.

References

Appendix A Proofs

1) Denote by v(x,x)=k(x,x)sq(x)+xk(x,x)=Aqkx(x)\boldsymbol{v}(x,x^{\prime})=k(x,x^{\prime}){\boldsymbol{s}}_{q}(x^{\prime})+\nabla_{x^{\prime}}k(x,x^{\prime})={\mathcal{A}}_{q}k_{x}(x^{\prime}); applying Lemma 2.3 on k(x,)k(x,\cdot) with fixed xx,

Because k(,x)k(\cdot,x^{\prime}) is in the Stein class of pp for any xx^{\prime}, we can show that xk(,x)\nabla_{x^{\prime}}k(\cdot,x^{\prime}) is also in the Stein class, since

and hence v(,x)\boldsymbol{v}(\cdot,x^{\prime}) is also in the Stein class; apply Lemma 2.3 on v(,x)\boldsymbol{v}(\cdot,x^{\prime}) with fixed xx^{\prime} gives

The result then follows by noting that xv(x,x)=xk(x,x)sq(x)+xxk(x,x).\nabla_{x}\boldsymbol{v}(x,x^{\prime})=\nabla_{x}k(x,x^{\prime}){\boldsymbol{s}}_{q}(x^{\prime})^{\top}+\nabla_{x^{\prime}x^{\prime}}k(x,x^{\prime}).

Therefore, uq(x,x)u_{q}(x,x^{\prime}) is positive definite because λj>0\lambda_{j}>0. In addition,

We first prove (12) by applying the reproducing property k(x,x)=k(x,),k(x,)Hk(x,x^{\prime})=\langle k(x,\cdot),k(x^{\prime},\cdot)\rangle_{\mathcal{H}} on (8):

where we used the fact that xf(x)=f(), xk(x,)H\nabla_{x}f(x)=\langle f(\cdot),~{}\nabla_{x}k(x,\cdot)\rangle_{\mathcal{H}}; see (Zhou, 2008; Steinwart & Christmann, 2008). The variational form (13) then follows the fact that ||\boldsymbol{\beta}||_{\mathcal{H}^{d}}=\max_{\boldsymbol{f}\in\mathcal{H}^{d}}\big{\{}\langle\boldsymbol{f},\boldsymbol{\beta}\rangle_{\mathcal{H}^{d}},~{}~{}~{}.s.t.~{}~{}||\boldsymbol{f}||_{\mathcal{H}^{d}}\leq 1\big{\}}.

For any fHf\in\mathcal{H} with kernel k(x,x)k(x,x^{\prime}), we have f=f, k(,x)Hf=\langle f,~{}k(\cdot,x)\rangle_{\mathcal{H}} and xf=f, xk(x,)H\nabla_{x}f=\langle f,~{}\nabla_{x}k(x,\cdot)\rangle_{\mathcal{H}}. Therefore,

Applying the standard asymptotic results of UU-statistics in Serfling (2009, Section 5.5), we just need to check that σu20\sigma_{u}^{2}\neq 0 when pqp\neq q and σu2=0\sigma_{u}^{2}=0 when p=qp=q.

(19) is obtained by applying Cauchy-Schwarz inequality on (8),

To prove (20), we simply note that (13) is equivalent to

Taking f=(sqsp)/sq(x)sp(x)Hd\boldsymbol{f}=({\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p})/||{\boldsymbol{s}}_{q}(x)-{\boldsymbol{s}}_{p}(x)||_{\mathcal{H}^{d}} then gives (20). ∎

Let F(p)=L2(p)S(p)\mathcal{F}(p)=\mathcal{L}^{2}(p)\cap\mathcal{S}(p), where S(p)\mathcal{S}(p) represents the Stein class of pp, then we have

and the equality holds when sqspF(p)d{\boldsymbol{s}}_{q}-{\boldsymbol{s}}_{p}\in\mathcal{F}(p)^{d}.

Note that L2(p)\mathcal{L}^{2}(p) is larger than the Stein class and RKHS, and includes discontinuous, non-smooth functions, and hence we need to ensure f\boldsymbol{f} is in the Stein class explicitly.

Restricting the maximizing to F(p)d\mathcal{F}(p)^{d} and applying Lemma 2.3 would give the result. ∎