A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation
Qiang Liu, Jason D. Lee, Michael I. Jordan
Introduction
Evaluating the goodness-of-fit of models over observed data is a fundamental task in machine learning and statistics. Traditional approaches often involve calculating or comparing the likelihoods or cumulative distribution functions (CDF) of the models. Unfortunately, modern learning techniques increasingly involve complex probabilistic models with computationally intractable likelihoods or CDFs, such as large graphical models, hidden variables models and deep generative models (Koller & Friedman, 2009; Salakhutdinov, 2015). Although Markov chain Monte Carlo (MCMC) or variational methods can be used to approximate the likelihood, their approximation errors are often large and hard to estimate, making it difficult to give results with calibrated statistical significance. In fact, it is often a #P-complete problem to calculate or even approximate likelihoods for graphical models (e.g, Chandrasekaran et al., 2008), making likelihood-based approaches fundamentally infeasible.
We propose a likelihood-free approach for model evaluation with guaranteed statistical significance. In particular, we consider the setting of goodness-of-fit testing, where we test whether a given sample is drawn from a given distribution , meaning . Our method is based on a new discrepancy measure between distributions that can be empirically estimated using -statistics, and depends on only through its score function ; this score function does not depend on the normalization constant in , and can often be calculated efficiently even when the likelihood is intractable. This allows us to apply our methods to complex and high dimensional models on which the likelihood-based methods, or other traditional goodness-of-fit tests, such as -test and Kolmogorov-Smirnov test, can not be applied.
for smooth functions with proper zero-boundary conditions, where is called the (Stein) score function of ; when , (1) is known as Stein’s identity (e.g., Stein et al., 2004), and can be proved using integration by parts. As a result, one can define a Stein discrepancy measureOur definition is the square of the typical definition of Stein discrepancy, as such that in Gorham & Mackey (2015). between and via
where are i.i.d. random variables drawn from and is a function (defined in Theorem 3.6) that depends on only through the score function which can be calculated efficiently even when has an intractable normalization constant. Specifically, assuming with being the normalization constant, we have , independent of ; calculating involves a high dimension integration, and has been the major challenge for likelihood-based and Bayesian methods for model evaluation.
Related Work
Outline
Section 2 introduces RKHS and Stein’s identity. Section 3 defines our KSD and studies its main properties, and Section 4 discusses the empirical estimation of KSD and its application in goodness-of-fit tests. We discuss related methods in Section 5, present experiments in Section 6 and conclude the paper in Section 7.
Notations
Backgrounds
We first introduce positive definite kernels and reproducing kernel Hilbert spaces (RKHS) in Section 2.1, and then Stein’s identity and operator in Section 2.2.
Let be a positive definite kernel. The spectral decomposition of , as implied by Mercer’s theorem, is defined as
For a positive definite kernel , its related RKHS comprises of linear combinations of its eigenfunctions, i.e., with , endowed with an inner product between and . Thus this Hilbert space is equipped with a norm where . One can verify that is in and satisfies the important “reproducing” property,
Every positive definite kernel defines a unique RKHS for which is a reproducing kernel.
2 Stein’s Identity and Operator
The Stein’s operator of is a linear operator acting on the Stein class of , defined as
Assume is a smooth density supported on , then
for any that is in the Stein class of .
By the definition of the Stein class, simply note that ∎
The following result gives a convenient tool for our derivation; it relates the expectation under of Stein’s operator with the difference of the score functions of and .
Assume and are smooth densities supported on and is in the Stein class of , we have
which was first derived in Gorham & Mackey (2015) using Langevin diffusion. It is an interesting direction to consider the possibility of using determinant or other matrix norms instead of the trace.
Kernelized Stein Discrepancy
We introduce our kernelized Stein discrepancy (KSD) with an elementary definition motivated by Lemma 2.3, and then establish its connection with Stein’s method and RKHS.
A kernel is said to be integrally strictly positive definition, if for any function that satisfies ,
where is the score difference between and , and , are i.i.d. draws from .
Result directly follows the definition in (7). ∎
A kernel is said to be in the Stein class of if has continuous second order partial derivatives, and both and are in the Stein class of for any fixed .
If is in the Stein class of , so is any .
Assume and are smooth densities and is in the Stein class of . Define
Apply Lemma 2.3 twice, first on for fixed , and then with fixed . See the Appendix. ∎
Assume is a positive definite kernel in the Stein class of , with positive eigenvalues and eigenfunctions , then is also a positive definite kernel, and can be rewritten into
where is the Stein’s operator acted on . In addition,
Note that although are orthonormal, the are no longer orthonormal in general.
where the maximum is achieved when .
Goodness-of-fit Testing Based on KSD
2) If , then we have (the -statistics is degenerate) and
where are i.i.d. standard Gaussian random variables, and are the eigenvalues of kernel under , that is, they are the solutions of for non-zero .
Using the standard asymptotic results of -statistics in Serfling (2009, Section 5.5), we just need to check that when and when . See Appendix for details. ∎
One difficulty in implementing this test is that the limit distribution in (15) and its -quantile does not have analytic form unless or . Fortunately, the same type of asymptotics appears in many other classical goodness-of-fit tests, such as Cramer-von Mises test, Anderson-Darling test, as well as two-sample tests (Gretton et al., 2012). As a consequence, a line of work has been devoted to approximating the critical values of (15), including bootstrap methods (Arcones & Gine, 1992; Huskova & Janssen, 1993; Chwialkowski et al., 2014) and eigenvalue approximation (Gretton et al., 2009).
Assume the conditions in Theorem 4.1. If , then as the bootstrap sample size ,
that is, the bootstrap test attains the correct significance level asymptotically (consistent in level).
It is important to note, on the other hand, that the more usual bootstrap, such as , may not work for degenerate -statistics as discussed in Arcones & Gine (1992).
This bootstrap test is summarized in Algorithm 1; its cost is where is the size of the sample and the bootstrap sample size. A more computationally efficient, but less statistically powerful, method can be constructed based on the following linear estimator:
Related Methods
We discuss the connection with Fisher divergence and maximum discrepancy measure (MMD).
Fisher divergence, also known as Fisher information distance (Johnson, 2004), is defined as
1) Following Definition (8) and (18), we have
2) In addition, if is positive definite and in the Stein class of , and , we have, for ,
(19) is a simple result of Cauchy-Schwarz inequality, and (20) can be obtained by taking in (13). See Appendix. ∎
(19) suggests that the convergence in Fisher divergence is stronger than that in KSD. In fact, using Stein’s method, Ley & Swan (2013) showed that Fisher divergence is stronger than most other divergences, including KL, total variation and Hellinger distances.
2 Maximum Mean Discrepancy & Two-sample Tests
Closely related to goodness-of-fit tests are two sample tests, which test whether two i.i.d. samples and are drawn from the same distribution. In principle, one can turn a goodness-of-fit test into a two sample test by drawing from . However, it is often difficult to draw exact i.i.d. samples for practical models, and furthermore MCMC sampling may be computationally expensive, suffer from the convergence problems, and introduce undesired correlations. When the MCMC approximation is poor, the two sample test would reject the null even when (inconsistent in level).
Maximum Mean Discrepancy (Gretton et al., 2012) is a nonparametric distance measure widely used for two sample tests, defined as
Experiments
We present empirical results in this section. We start with a toy case of 1D Gaussian mixture on which we can compare with the classical goodness-of-fit tests that only work for univariate distributions, and then proceed to Gaussian-Bernoulli restricted Boltzmann machine (RBM), a graphical model widely used in deep learning (Welling et al., 2004; Hinton & Salakhutdinov, 2006). The following methods are evaluated, all with a significance level of :
1) KSD-U. The KSD-based bootstrap test using -statistic in Algorithm 1 (bootstrap size is ), using RBF kernel with bandwidth chosen to be median of the data distances.
2) KSD-Linear. The KSD test based on the linear estimator in (17) with asymptotically normal null distribution.
3) Classical goodness-of-fit tests, including test, Kolmogorov-Smirnov test and Cramer-von Mises test (Lehmann & Romano, 2006); they are evaluated on only the 1D Gaussian mixture.
4) MMD-MC(). Draw exact sample of size from and perform two sample MMD test of Gretton et al. (2012) over and using bootstrapWe use the mmdTestBoot.m under http://www.gatsby.ucl.ac.uk/%7Egretton/mmd/mmd.htm, with 1000 bootstrap replicates.
5) MMD-MCMC(). Draw approximate sample of size from using Gibbs sampler and perform MMD test on and ; we use burn-in steps.
6) LR (simple vs. simple). We evaluate the exact log-likelihood ratio and use it to test whether is drawn from or . This approach is an oracle test in that it knows it exactly calculates the likelihood, and assumes we know and tests a much easier null hypothesis of simple vs. simple.
7) Likelihood Ratio (AIS). We approximately evaluate the likelihood ratio using annealed importance sampling (AIS), which is one of the most widely used algorithm for approximating likelihood (Neal, 2001; Salakhutdinov & Murray, 2008). Our AIS implementation uses a Gibbs sampler transition with a linear temperature grid of size . We do not perform a test based on the AIS result because it is hard to know the approximation error.
We find from Figure 1 that the oracle LR (simple vs. simple) performs the best as expected. Otherwise, our KSD-U performs comparably with, or better than, the classical tests (, Kolmogorov-Smirnov and Cramer-Von Mises) as well as MMD-MC(1000). KSD-Linear tends to perform the worst, suggesting it is not useful in this simple setting. However, it can serve as a computationally efficient alternative of KSD-U for more complex models on which the other tests are not practical. Note that because both the cases of and happen with probability in our simulation, the error rate in the hardest case when is close is .
Gaussian-Bernoulli Restricted Boltzmann Machine (RBM)
where is the normalization constant. The probability of the observable variable is which is intractable to calculate due to the difficult constant term . Nevertheless, one can show that its score function can be easily calculated in a closed form,
In our experiment, we simulate a true model by drawing and from standard Gaussian and select uniformly randomly from ; we use observable variables and hidden variables, so that it remains possible to exactly calculate and draw exact samples using the brute-force algorithm. Similar to the case of 1D Gaussian mixture, we set randomly with equal probability to be equal to either or a perturbed version by adding Gaussian noise to with variance . We report the the error rates of different tests in Figure 2; the results are averaged on random trials.
Figure 2(a) shows that the oracle LR (simple vs. simple) performs the best again as expected, followed by our KSD-U method. The MMD-MCMC breaks down because the MCMC sample is not representative of , while the performance of MMD-MC depends on the size of the exact sample: it performs worse than KSD-U with MMD-MC(100), and is almost as good with MMD-MC(1000); see also Figure 2(b). Again, we find that KSD-linear generally performs much worse than KSD-U, but it provides a computationally efficient alternative to KSD-U which has a complexity and MMD which costs . A trade-off between linear and quadratic complexity can be achieved using block averaging; see Zaremba et al. (2013).
Figure 2(c)-(h) shows the different discrepancy measures under the case and , respectively. Again, we can find that the exact likelihood ratio provides the best discrimination, while MMD-MCMC fails to distinguish the two cases at all. The AIS approximation performs reasonably well, but is worse than KSD-U and MMD-MC(1000) in this particular case.
Conclusion and Future Directions
We propose a new computationally tractable discrepancy measure between complex probability models, and use it to derive a novel class of goodness-of-fit tests. We believe our discrepancy measure provides a new fundamental tool for analyzing and using complex probability models in statistics and machine learning. Future directions include extending our method to composite goodness-of-fit tests, in which we want to test if the observed data follows a given class of distributions, as well as understanding the theoretical discrimination power of KSD compared to the other classical goodness-of-fit tests, two sample tests (e.g., MMD with infinite exact Monte Carlo sample), and the method in Gorham & Mackey (2015).
This work is supported in part by NSF CRII 1565796. We thank Arthur Gretton and the anonymous reviewers for their valuable comments.
References
Appendix A Proofs
1) Denote by ; applying Lemma 2.3 on with fixed ,
Because is in the Stein class of for any , we can show that is also in the Stein class, since
and hence is also in the Stein class; apply Lemma 2.3 on with fixed gives
The result then follows by noting that ∎
Therefore, is positive definite because . In addition,
We first prove (12) by applying the reproducing property on (8):
where we used the fact that ; see (Zhou, 2008; Steinwart & Christmann, 2008). The variational form (13) then follows the fact that ||\boldsymbol{\beta}||_{\mathcal{H}^{d}}=\max_{\boldsymbol{f}\in\mathcal{H}^{d}}\big{\{}\langle\boldsymbol{f},\boldsymbol{\beta}\rangle_{\mathcal{H}^{d}},~{}~{}~{}.s.t.~{}~{}||\boldsymbol{f}||_{\mathcal{H}^{d}}\leq 1\big{\}}.
For any with kernel , we have and . Therefore,
Applying the standard asymptotic results of -statistics in Serfling (2009, Section 5.5), we just need to check that when and when .
(19) is obtained by applying Cauchy-Schwarz inequality on (8),
To prove (20), we simply note that (13) is equivalent to
Taking then gives (20). ∎
Let , where represents the Stein class of , then we have
and the equality holds when .
Note that is larger than the Stein class and RKHS, and includes discontinuous, non-smooth functions, and hence we need to ensure is in the Stein class explicitly.
Restricting the maximizing to and applying Lemma 2.3 would give the result. ∎