A Kernel Test of Goodness of Fit

Kacper Chwialkowski, Heiko Strathmann, Arthur Gretton

Introduction

Statistical tests of goodness-of-fit are a fundamental tool in statistical analysis, dating back to the test of Kolmogorov and Smirnov . Given a set of samples {Zi}i=1n\{Z_{i}\}_{i=1}^{n} with distribution ZiqZ_{i}\sim q, our interest is in whether qq matches some reference or target distribution pp, which we assume to be only known up to the normalisation constant. Recently, in the multivariate setting, Gorham & Mackey proposed an elegant measure of sample quality with respect to a target. This measure is a maximum discrepancy between empirical sample expectations and target expectations over a large class of test functions, constructed so as to have zero expectation over the target distribution by use of a Stein operator. This operator depends only on the derivative of the logq\log q: thus, the approach can be applied very generally, as it does not require closed-form integrals over the target distribution (or numerical approximations of such integrals). By contrast, many earlier discrepancy measures require integrals with respect to the target (see below for a review). This is problematic e.g. if the intention is to perform benchmarks for assessing Markov Chain Monte Carlo, since these integrals are certainly not known to the practitioner.

A challenge in applying the approach of Gorham & Mackey is the complexity of the function class used, which results from applying the Stein operator to the W2,W^{2,\infty} Sobolev space. Thus, their sample quality measure requires solving a linear program that arises from a complicated construction of graph Stein discrepancies and geometric spanners. Their metric furthermore requires access to nontrivial lower bounds that, despite being provided for log-concave densities, are a largely open problem otherwise, in particular for multivariate cases.

An important application of a goodness-of-fit measure is in statistical testing, where it is desired to determine whether the empirical discrepancy measure is large enough to reject the null hypothesis (that the sample arises from the target distribution). One approach is to establish the asymptotic behaviour of the test statistic, and to set a test threshold at a large quantile of the asymptotic distribution. The asymptotic behaviour of the W2,W^{2,\infty}-Sobolev Stein discrepancies remains a challenging open problem, due to the complexity of the function class used. It is not clear how one would compute p-values for this statistic, or determine when the goodness of fit test allows to accept the null hypothesis at a user-specified test level.

The key contribution of this work is to define a statistical test of goodness-of-fit, based on a Stein discrepancy computed in a reproducing kernel Hilbert space (RKHS). To construct our test statistic, we use a function class defined by applying the Stein operator to a chosen space of RKHS functions, as proposed by .Oates et al. addressed the problem of variance reduction in Monte Carlo integration, using the Stein operator to avoid bias. Our measure of goodness of fit is the largest discrepancy over this space of functions between empirical sample expectations and target expectations (the latter being zero, due to the effect of the Stein operator). The approach is a natural extension to goodness-of-fit testing of the earlier kernel two-sample tests and independence tests , which are based on the maximum mean discrepancy, an integral probability metric. As with these earlier tests, our statistic is a simple V-statistic, and can be computed in closed form and in quadratic time. Moreover, it is an unbiased estimate of the corresponding population discrepancy. As with all Stein-based discrepancies, only the gradient of the log target density is needed; we do not require integrals with respect to the target density – including the normalisation constant. Given that our test statistic is a V-statistic, we make use of the extensive literature on asymptotics of V-statistics to formulate a hypothesis test .An alternative linear-time test, based on differences in analytic functions of the sample and following the recent work of , is provided in [10, Appendix 5.1] We provide statistical tests for both uncorrelated and correlated samples, where the latter is essential if the test is used in assessing the quality of output of an MCMC procedure. An identical test was obtained simultaneously in independent work by Liu et al. , for uncorrelated samples.

Several alternative approaches exist in the statistics literature to goodness-of-fit testing. A first strategy is to partition the space, and to conduct the test on a histogram estimate of the distribution . Such space partitioning approaches can have attractive theoretical properties (e.g. distribution-free test thresholds) and work well in low dimensions, however they are much less powerful than alternatives once the dimensionality increases . A second popular approach has been to use the smoothed L2L_{2} distance between the empirical characteristic function of the sample, and the characteristic function of the target density. This dates back to the test of Gaussianity of Baringhaus & Henze [3, Eq. 2.1], who used an exponentiated quadratic smoothing function. For this choice of smoothing function, their statistic is identical to the maximum mean discrepancy (MMD) with the exponentiated quadratic kernel, which can be shown using the Bochner representation of the kernel [36, Corollary 4]. It is essential in this case that the target distribution be Gaussian, since the convolution with the kernel (or in the Fourier domain, the smoothing function) must be available in closed form. An L2L_{2} distance between Parzen window estimates can also be used , giving the same expression again, although the optimal choice of bandwidth for consistent Parzen window estimates may not be a good choice for testing . A different smoothing scheme in the frequency domain results in an energy distance statistic [this likewise being an MMD with a particular choice of kernel; see 32], which can be used in a test of normality . The key point is that the required integrals are again computable in closed form for the Gaussian, although the reasoning may be extended to certain other families of interest, e.g. . The requirement of computing closed-form integrals with respect to the test distribution severely restricts this testing strategy. Finally, a problem related to goodness-of-fit testing is that of model criticism . In this setting, samples generated from a fitted model are compared via the maximum mean discrepancy with samples used to train the model, such that a small MMD indicates a good fit. There are two limitation to the method: first, it requires samples from the model (which might not be easy if this requires a complex MCMC sampler); second, the choice of number of samples from the model is not obvious, since too few samples cause a loss in test power, and too many are computationally wasteful. Neither issue arises in our test, as we do not require model samples.

In our experiments, a particular focus is on applying our goodness-of-fit test to certify the output of approximate Markov Chain Monte Carlo (MCMC) samplers . These methods use modifications to Markov transition kernels that improve mixing speed at the cost of introducing asymptotic bias. The resulting bias-variance trade-off can usually be tuned with parameters of the sampling algorithms. It is therefore important to test whether for a particular parameter setting and run-time, the samples are of the desired quality. This question cannot be answered with classical MCMC convergence statistics, such as the widely used potential scale reduction factor (R-factor) or the effective sample size, since these assume that the Markov chain reaches the true equilibrium distribution i.e. absence of asymptotic bias. By contrast, our test exactly quantifies the asymptotic bias of approximate MCMC.

Code can be found at https://github.com/karlnapf/kernel_goodness_of_fit.

We begin in section 2 with a high-level construction of the RKHS-based Stein discrepancy and associated statistical test. In Section 3, we provide additional details and prove the main results. Section 4 contains experimental illustrations on synthetic examples, statistical model criticism, bias-variance trade-offs in approximate MCMC, and convergence in non-parametric density estimation.

Test Definition: Statistic and Threshold

We begin with a high-level construction of our divergence measure and the associated statistical test. While this section aims to outline the main ideas, we provide details and proofs in Section 3.

Suppose a random variable ZZ is distributed according to a measureThroughout the article, all occurrences of ZZ, e.g. Z,Zi,ZZ^{\prime},Z_{i},Z_{\heartsuit}, are understood to be distributed according to qq. qq and XX is distributed according to the target measure pp. As we will see, the operator can be expressed by defining a function that depends on gradient of the log-density and the kernel,

whose expected inner product with ff gives exactly the expected value of the Stein operator,

The second main result states that the discrepancy Sp(Z)S_{p}(Z) can be used to distinguish two distributions.

Section 3.1 contains all necessary proofs. We now proceed to construct an estimator for S(Z)2S(Z)^{2}, and outline its asymptotic properties.

2 Wild Bootstrap Testing

It is straightforward to estimate the squared Stein discrepancy S(Z)2S(Z)^{2} from samples {Zi}i=1n\{Z_{i}\}_{i=1}^{n}: a quadratic time estimator is a V-Statistic, and takes the form

The asymptotic null distribution of the normalised V-Statistic nVnnV_{n}, however, has no computable closed form. Furthermore, care has to be taken when the ZiZ_{i} exhibit correlation structure, as the null distribution might significantly change, impacting test significance. The wild bootstrap technique addresses both problems. First, it allows us to estimate quantiles of the null distribution in order to compute test thresholds. Second, it accounts for correlation structure in the ZiZ_{i} by mimicking it with an auxiliary random process: a simple Markov chain taking values in {1,1}\{-1,1\}, starting from W1,n=1W_{1,n}=1,

where the UtU_{t} are uniform (0,1)(0,1) i.i.d. random variables and ana_{n} is the probability of Wt,nW_{t,n} changing sign (for i.i.d. data we set an=0.5a_{n}=0.5). This leads to a bootstrapped V-statistic

Proposition 3.2 establishes that, under the null hypothesis, nBnnB_{n} is a good approximation of nVnnV_{n}, so it is possible to approximate quantiles of the null distribution by sampling from it. Under the alternative, VnV_{n} dominates BnB_{n} – resulting in almost sure rejection of the null hypothesis.

We propose the following test procedure for testing the null hypothesis that the ZiZ_{i} are distributed according to the target distribution pp.

Obtain wild bootstrap samples {Bn}i=1D\{B_{n}\}_{i=1}^{D} and estimate the 1α1-\alpha empirical quantile of these samples.

Proofs of the Main Results

We now prove the claims made in the previous section.

Lemma 5.1 (in the Appendix) shows that the expected value of the Stein operator is zero on the target measure.

ξp(x,)\xi_{p}(x,\cdot) is an element of the reproducing kernel Hilbert space Fd\mathcal{F}^{d} – by Steinwart & Christmann [39, Lemma 4.34] k(x,)F\nabla k(x,\cdot)\in\mathcal{F}, and logp(x)xi\frac{\partial\log p(x)}{\partial x_{i}} is just a scalar. We first show that hp(x,y)=ξp(x,),ξp(y,).h_{p}(x,y)=\langle\xi_{p}(x,\cdot),\xi_{p}(y,\cdot)\rangle. Using notations

Next we show that ξp(x,)\xi_{p}(x,\cdot) is Bochner integrable [see 39, Definition A.5.20],

This allows us to take the expectation inside the RKHS inner product. We next relate the expected value of the Stein operator to the inner product of ff and the expected value of ξq(Z)\xi_{q}(Z),

The second equality follows from the fact that a linear operator fi,F\langle f_{i},\cdot\rangle_{\mathcal{F}} can be interchanged with the Bochner integral, and the fact that ξp\xi_{p} is Bochner integrable. Using definition of S(Z)S(Z), Lemma (5.1), and Equation (2), we have

We now calculate closed form expression for Sp2(Z)S_{p}^{2}(Z),

where ZZ^{\prime} is an independent copy of ZZ. ∎

Next, we prove that the discrepancy SS discriminates different probability measures.

We recognise that the expected value of (logp(Z)logq(Z))k(Z,)\nabla(\log p(Z)-\log q(Z))k(Z,\cdot) is the mean embedding of a function g(y)=(logp(y)q(y))g(y)=\nabla\left(\log\frac{p(y)}{q(y)}\right) with respect to the measure qq. By the assumptions the function gg is square integrable; therefore, since the kernel kk is CoC_{o}-universal, by Carmeli et al. [8, Theorem 4.2 b] its embedding is zero if and only if g=0g=0. This implies that

A constant vector field of derivatives can only be generated by a constant function, so logp(y)q(y)=C\log\frac{p(y)}{q(y)}=C, for some CC, which implies that p(y)=eCq(y)p(y)=e^{C}q(y). Since pp and qq both integrate to one, C=0C=0 and thus p=qp=q, which is a contradiction. ∎

2 Wild Bootstrap Testing

The two concepts required to derive the distribution of the test statistic are: τ\tau-mixing , and V-statistics .

We assume τ\tau-mixing as our notion of dependence within the observations, since this is weak enough for most practical applications. Trivially, i.i.d. observations are τ\tau-mixing. As for Markov chains, whose convergence we study in the experiments, the property of geometric ergodicity implies τ\tau-mixing (given that the stationary distribution has a finite moment of some order – see the Appendix for further discussion). For further details on τ\tau-mixing, see . For this work, we assume a technical condition t=1t2τ(t)\sum_{t=1}^{\infty}t^{2}\sqrt{\tau(t)}\leq\infty. A direct application of [25, Theorem 2.1] characterises the limiting behavior of nVnnV_{n} for τ\tau-mixing processes.

The proof, which is a simple verification of the relevant assumptions, can be found in the Appendix. Although a formula for a limit distribution of VnV_{n} can be derived explicitly [Theorem 2.1 25], we do not provide it here. To our knowledge there are no methods of obtaining quantiles of a limit of VnV_{n} in closed form. The common solution is to estimate quantiles by a resampling method, as described in Section 2. The validity of this resampling method is guaranteed by the following proposition (which follows from Theorem 2.1 of Leucht and a modification of the Lemma 5 of Chwialkowski et al. ), proved in the supplement.

As a consequence, if the null hypothesis is true, we can approximate any quantile; while under the alternative hypothesis, all quantiles of BnB_{n} collapse to zero while P(Vn>0)1P(V_{n}>0)\to 1. We discuss specific case of testing MCMC convergence in the Appendix.

Experiments

We provide a number of experimental applications for our test. We begin with a simple check to establish correct test calibration on non-i.i.d. data, followed by a demonstration of statistical model criticism for Gaussian process (GP) regression. We then apply the proposed test to quantify bias-variance trade-offs in MCMC, and demonstrate how to use the test to verify whether MCMC samples are drawn from the desired stationary distribution. In the final experiment, we move away from the MCMC setting, and use the test to evaluate the convergence of a nonparametric density estimator. Code can be found at https://github.com/karlnapf/kernel_goodness_of_fit.

In our first task, we modify Experiment 4.1 from Gorham & Mackey . The null hypothesis is that the observed samples come from a standard normal distribution. We study the power of the test against samples from a Student’s t distribution. We expect to observe low p-values when testing against a Student’s t distribution with few degrees of freedom. We consider 1, 5, 10 or \infty degrees of freedom, where \infty is equivalent to sampling from a standard normal distribution. For a fixed number of degrees of freedom we draw 1400 samples and calculate the p-value. This procedure is repeated 100 times, and the bar plots of p-values are shown in Figures 1,2,3.

Our twist on the original experiment 4.1 by Gorham & Mackey is that the draws from the Student’s t distribution exhibit temporal correlation. We generate samples using a Metropolis–Hastings algorithm, with a Gaussian random walk with variance 1/2. We emphasise the need for an appropriate choice of the wild bootstrap process parameter ana_{n}. In Figure 1 we plot p-values for ana_{n} being set to 0.50.5. Such a high value of ana_{n} is suitable for i.i.d. observations, but results in p-values that are too conservative for temporally correlated observations. In Figure 2, we set an=0.02a_{n}=0.02, which gives a well calibrated distribution of the p-values under the null hypothesis, however the test power is reduced. Indeed, p-values for five degrees of freedom are already large. The solution that we recommend is a mixture of thinning and adjusting an,a_{n}, as presented in the Figure 3. We thin the observations by a factor of 20 and set an=0.1a_{n}=0.1, thus preserving both good statistical power and correct calibration of p-values under the null hypothesis. In a general, we recommend to thin a chain so that Cor(Xt,Xt1)<0.5\text{Cor}(X_{t},X_{t-1})<0.5, set an=0.1/ka_{n}=0.1/k, and run test with at least max(500k,d100)\max(500k,d100) data points, where k<10k<10.

In this experiment, we compare with the test proposed by Baringhaus & Henze , which is essentially an MMD test for normality, i.e. the null hypothesis is that ZZ is a dd-dimensional standard normal random variable. We set the sample size to n=500,1000n=500,1000 and an=0.5a_{n}=0.5, generate

and modify Z0Z0+YZ_{0}\leftarrow Z_{0}+Y. Table 1 shows the power as a function of the sample size. We observe that for higher dimensions, and where the expectation of the kernel exists in closed form, an MMD-type test like is a better choice.

We next apply our test to the problem of statistical model criticism for GP regression. Our presentation and approach are similar to the non i.i.d. case in Section 6 of Lloyd & Ghahramani . We use the solar dataset, consisting of a d=1d=1 regression problem with N=402N=402 pairs (X,y)(X,y). We fit Ntrain=361N_{\text{train}}=361 data using a GP with an exponentiated quadratic kernel and a Gaussian noise model, and perform standard maximum likelihood II on the hyperparameters (length-scale, overall scale, noise-variance). We then apply our test to the remaining Ntest=41N_{\text{test}}=41 data. The test attempts to falsify the null hypothesis that the solar dataset was generated from the plug-in predictive distribution (conditioned on training data and predicted position) of the GP. Lloyd & Ghahramani refer to this setup as non-i.i.d., since the predictive distribution is a different univariate Gaussian for every predicted point. Our particular Ntrain,NtestN_{\text{train}},N_{\text{test}} were chosen to make sure the GP fit has stabilised, i.e. adding more data did not cause further model refinement.

Figure 4 shows training and testing data, and the fitted GP. Clearly, the Gaussian noise model is a poor fit for this particular dataset, e.g. around X=1X=-1. Figure 5 shows the distribution over D=10000D=10000 bootstrapped V-statistics BnB_{n} with n=Ntestn=N_{\text{test}}. The test statistic lies in an upper quantile of the bootstrapped null distribution, correctly indicating that it is unlikely the test points were generated by the fitted GP model, even for the low number of test data observed, n=41n=41.

We now illustrate how to quantify bias-variance trade-offs in an approximate MCMC algorithm – austerity MCMC . For the purpose of illustration we use a simple generative model from Gorham & Mackey , Welling & Teh ,

Austerity MCMC is a Monte Carlo procedure designed to reduce the number of likelihood evaluation in the acceptance step of the Metropolis-Hastings algorithm. The crux of method is to look at only a subset of the data, and make an acceptance/rejection decision based on this subset. The probability of making a wrong decision is proportional to a parameter ϵ\epsilon\in . This parameter influences the time complexity of austerity MCMC: when ϵ\epsilon is larger, i.e., when there is a greater tolerance for error, the expected computational cost is lower. We simulate {Xi}1i400\{X_{i}\}_{1\leq i\leq 400} points from the model with θ1=0\theta_{1}=0 and θ2=1\theta_{2}=1. In our experiment, there are two modes in the posterior distribution: one at (0,1)(0,1) and the other at (1,1)(1,-1). We run the algorithm with ϵ\epsilon varying over the range [0.001,0.2][0.001,0.2]. For each ϵ\epsilon we calculate an individual thinning factor, such that correlation between consecutive samples from the chains is smaller than 0.50.5 (greater ϵ\epsilon generally required more thinning). For each ϵ\epsilon we test the hypothesis that {θi}1i500\{\theta_{i}\}_{1\leq i\leq 500} is drawn from the true stationary posterior, using our goodness of fit test. We generate 100 p-values for each ϵ\epsilon , as shown in Figure 6. A good approximation of the true stationary distribution is obtained at ϵ=0.4\epsilon=0.4, which is still parsimonious in terms of likelihood evaluations, as shown in Figure 7.

In our final experiment, we apply our goodness of fit test to measuring quality-of-fit in nonparametric density estimation. We evaluate two density models: the infinite dimensional exponential family , and a recent approximation to this model using random Fourier features . Our implementation of the model assumes the log density to take the form f(x)f(x), where ff lies in an RKHS induced by a Gaussian kernel with bandwidth 11. We fit the model using NN observations drawn from a standard Gaussian, and perform our quadratic time test on a separate evaluation dataset of fixed size n=500n=500. Our goal is to identify NN sufficiently large that the goodness of fit test does not reject the null hypothesis (i.e., the model has learned the density sufficiently well, bearing in mind that it is guaranteed to converge for sufficiently large NN). Figure 8 shows how the distribution of p-values evolves as a function of NN; this distribution is uniform for N=5000N=5000, but at N=500N=500, the null hypothesis would very rarely be rejected.

We next consider the random Fourier feature approximation to this model, where the log pdf, ff, is approximated using a finite dictionary of random Fourier features . The natural question when using this approximation is: “How many random features are needed?” Using the same test set size n=500n=500 as above, and a large number of samples, N=5104N=5\cdot 10^{4}, Figure 9 shows the distributions of p-values for an increasing number of random features mm. From m=50m=50, the null hypothesis would rarely be rejected. Note, however, that the p-values do not have a uniform distribution, even for a large number of random features. This subtle effect is caused by over-smoothing due to the regularisation approach taken by Strathmann et al. [40, KMC finite], which would not otherwise have been detected.

Arthur Gretton has ORCID 0000-0003-3169-7624.

References

Appendix

If a random variable XX is distributed according to pp, under conditions on the kernel

We check assumptions of Theorem 2.1 from . Condition A1, t=1τ(t)\sum_{t=1}^{\infty}\sqrt{\tau(t)}\leq\infty, is implied by assumption t=1t2τ(t)\sum_{t=1}^{\infty}t^{2}\sqrt{\tau(t)}\leq\infty in Section 3. Condition A2 (iv), Lipschitz continuity of hh, is assumed. Conditions A2 i), ii) positive definiteness, symmetry and degeneracy of hh follow from the proof of Theorem (2.2). Indeed

MCMC convergence testing

In the stationary phase there are number of results which might be used to show that the chain is τ\tau-mixing.

Strong mixing is historically the most studied type of temporal dependence – a lot of models, including Markov Chains, are proved to be strongly mixing, therefore it’s useful to relate weak mixing to strong mixing. For a random variable XX on a probability space (Ω,F,PX)(\Omega,\mathcal{F},P_{X}) and MF\mathcal{M}\subset\mathcal{F} we define

A process is called β\beta-mixing or absolutely regular if

Dedecker & Prieur [Equation 7.6] relates τ\tau-mixing and β\beta-mixing , as follows: if QxQ_{x} is the generalized inverse of the tail function

While this definition can be hard to interpret, it can be simplified in the case EXp=ME|X|^{p}=M for some p>1p>1, since via Markov’s inequality P(X>t)MtpP(|X|>t)\leq\frac{M}{t^{p}}, and thus Mtpu\frac{M}{t^{p}}\leq u implies P(X>t)uP(|X|>t)\leq u. Therefore Q(u)=MupQx(u)Q^{\prime}(u)=\frac{M}{\sqrt[p]{u}}\geq Q_{x}(u). As a result, we have the inequality

Dedecker & Prieur provide examples of systems that are τ\tau-mixing. In particular, given that certain assumptions are satisfied, then causal functions of stationary sequences, iterated random functions, Markov chains, and expanding maps are all τ\tau-mixing.

Of particular interest to this work are Markov chains. The assumptions provided by , under which Markov chains are τ\tau-mixing, are difficult to check. We can, however, use classical theorems about the absolute regularity (β\beta-mixing). In particular [7, Corollary 3.6] states that a Harris recurrent and aperiodic Markov chain satisfies absolute regularity, and [7, Theorem 3.7] states that geometric ergodicity implies geometric decay of the β\beta coefficient. Interestingly [7, Theorem 3.2] describes situations in which a non-stationary chain β\beta-mixes exponentially.

Using inequalities between τ\tau-mixing coefficient and strong mixing coefficients, one can use these classical theorems show that e.g for p=2p=2 we have