A Kernel Test of Goodness of Fit

Kacper Chwialkowski, Heiko Strathmann, Arthur Gretton

Introduction

Statistical tests of goodness-of-fit are a fundamental tool in statistical analysis, dating back to the test of Kolmogorov and Smirnov . Given a set of samples $\{Z_{i}\}_{i=1}^{n}$ with distribution $Z_{i}\sim q$ , our interest is in whether $q$ matches some reference or target distribution $p$ , which we assume to be only known up to the normalisation constant. Recently, in the multivariate setting, Gorham & Mackey proposed an elegant measure of sample quality with respect to a target. This measure is a maximum discrepancy between empirical sample expectations and target expectations over a large class of test functions, constructed so as to have zero expectation over the target distribution by use of a Stein operator. This operator depends only on the derivative of the $\log q$ : thus, the approach can be applied very generally, as it does not require closed-form integrals over the target distribution (or numerical approximations of such integrals). By contrast, many earlier discrepancy measures require integrals with respect to the target (see below for a review). This is problematic e.g. if the intention is to perform benchmarks for assessing Markov Chain Monte Carlo, since these integrals are certainly not known to the practitioner.

A challenge in applying the approach of Gorham & Mackey is the complexity of the function class used, which results from applying the Stein operator to the $W^{2,\infty}$ Sobolev space. Thus, their sample quality measure requires solving a linear program that arises from a complicated construction of graph Stein discrepancies and geometric spanners. Their metric furthermore requires access to nontrivial lower bounds that, despite being provided for log-concave densities, are a largely open problem otherwise, in particular for multivariate cases.

An important application of a goodness-of-fit measure is in statistical testing, where it is desired to determine whether the empirical discrepancy measure is large enough to reject the null hypothesis (that the sample arises from the target distribution). One approach is to establish the asymptotic behaviour of the test statistic, and to set a test threshold at a large quantile of the asymptotic distribution. The asymptotic behaviour of the $W^{2,\infty}$ -Sobolev Stein discrepancies remains a challenging open problem, due to the complexity of the function class used. It is not clear how one would compute p-values for this statistic, or determine when the goodness of fit test allows to accept the null hypothesis at a user-specified test level.

The key contribution of this work is to define a statistical test of goodness-of-fit, based on a Stein discrepancy computed in a reproducing kernel Hilbert space (RKHS). To construct our test statistic, we use a function class defined by applying the Stein operator to a chosen space of RKHS functions, as proposed by .Oates et al. addressed the problem of variance reduction in Monte Carlo integration, using the Stein operator to avoid bias. Our measure of goodness of fit is the largest discrepancy over this space of functions between empirical sample expectations and target expectations (the latter being zero, due to the effect of the Stein operator). The approach is a natural extension to goodness-of-fit testing of the earlier kernel two-sample tests and independence tests , which are based on the maximum mean discrepancy, an integral probability metric. As with these earlier tests, our statistic is a simple V-statistic, and can be computed in closed form and in quadratic time. Moreover, it is an unbiased estimate of the corresponding population discrepancy. As with all Stein-based discrepancies, only the gradient of the log target density is needed; we do not require integrals with respect to the target density – including the normalisation constant. Given that our test statistic is a V-statistic, we make use of the extensive literature on asymptotics of V-statistics to formulate a hypothesis test .An alternative linear-time test, based on differences in analytic functions of the sample and following the recent work of , is provided in [10, Appendix 5.1] We provide statistical tests for both uncorrelated and correlated samples, where the latter is essential if the test is used in assessing the quality of output of an MCMC procedure. An identical test was obtained simultaneously in independent work by Liu et al. , for uncorrelated samples.

Several alternative approaches exist in the statistics literature to goodness-of-fit testing. A first strategy is to partition the space, and to conduct the test on a histogram estimate of the distribution . Such space partitioning approaches can have attractive theoretical properties (e.g. distribution-free test thresholds) and work well in low dimensions, however they are much less powerful than alternatives once the dimensionality increases . A second popular approach has been to use the smoothed $L_{2}$ distance between the empirical characteristic function of the sample, and the characteristic function of the target density. This dates back to the test of Gaussianity of Baringhaus & Henze [3, Eq. 2.1], who used an exponentiated quadratic smoothing function. For this choice of smoothing function, their statistic is identical to the maximum mean discrepancy (MMD) with the exponentiated quadratic kernel, which can be shown using the Bochner representation of the kernel [36, Corollary 4]. It is essential in this case that the target distribution be Gaussian, since the convolution with the kernel (or in the Fourier domain, the smoothing function) must be available in closed form. An $L_{2}$ distance between Parzen window estimates can also be used , giving the same expression again, although the optimal choice of bandwidth for consistent Parzen window estimates may not be a good choice for testing . A different smoothing scheme in the frequency domain results in an energy distance statistic [this likewise being an MMD with a particular choice of kernel; see 32], which can be used in a test of normality . The key point is that the required integrals are again computable in closed form for the Gaussian, although the reasoning may be extended to certain other families of interest, e.g. . The requirement of computing closed-form integrals with respect to the test distribution severely restricts this testing strategy. Finally, a problem related to goodness-of-fit testing is that of model criticism . In this setting, samples generated from a fitted model are compared via the maximum mean discrepancy with samples used to train the model, such that a small MMD indicates a good fit. There are two limitation to the method: first, it requires samples from the model (which might not be easy if this requires a complex MCMC sampler); second, the choice of number of samples from the model is not obvious, since too few samples cause a loss in test power, and too many are computationally wasteful. Neither issue arises in our test, as we do not require model samples.

In our experiments, a particular focus is on applying our goodness-of-fit test to certify the output of approximate Markov Chain Monte Carlo (MCMC) samplers . These methods use modifications to Markov transition kernels that improve mixing speed at the cost of introducing asymptotic bias. The resulting bias-variance trade-off can usually be tuned with parameters of the sampling algorithms. It is therefore important to test whether for a particular parameter setting and run-time, the samples are of the desired quality. This question cannot be answered with classical MCMC convergence statistics, such as the widely used potential scale reduction factor (R-factor) or the effective sample size, since these assume that the Markov chain reaches the true equilibrium distribution i.e. absence of asymptotic bias. By contrast, our test exactly quantifies the asymptotic bias of approximate MCMC.

Code can be found at https://github.com/karlnapf/kernel_goodness_of_fit.

We begin in section 2 with a high-level construction of the RKHS-based Stein discrepancy and associated statistical test. In Section 3, we provide additional details and prove the main results. Section 4 contains experimental illustrations on synthetic examples, statistical model criticism, bias-variance trade-offs in approximate MCMC, and convergence in non-parametric density estimation.

Test Definition: Statistic and Threshold

We begin with a high-level construction of our divergence measure and the associated statistical test. While this section aims to outline the main ideas, we provide details and proofs in Section 3.

Suppose a random variable $Z$ is distributed according to a measureThroughout the article, all occurrences of $Z$ , e.g. $Z^{\prime},Z_{i},Z_{\heartsuit}$ , are understood to be distributed according to $q$ . $q$ and $X$ is distributed according to the target measure $p$ . As we will see, the operator can be expressed by defining a function that depends on gradient of the log-density and the kernel,

whose expected inner product with $f$ gives exactly the expected value of the Stein operator,

The second main result states that the discrepancy $S_{p}(Z)$ can be used to distinguish two distributions.

Section 3.1 contains all necessary proofs. We now proceed to construct an estimator for $S(Z)^{2}$ , and outline its asymptotic properties.

2 Wild Bootstrap Testing

It is straightforward to estimate the squared Stein discrepancy $S(Z)^{2}$ from samples $\{Z_{i}\}_{i=1}^{n}$ : a quadratic time estimator is a V-Statistic, and takes the form

The asymptotic null distribution of the normalised V-Statistic $nV_{n}$ , however, has no computable closed form. Furthermore, care has to be taken when the $Z_{i}$ exhibit correlation structure, as the null distribution might significantly change, impacting test significance. The wild bootstrap technique addresses both problems. First, it allows us to estimate quantiles of the null distribution in order to compute test thresholds. Second, it accounts for correlation structure in the $Z_{i}$ by mimicking it with an auxiliary random process: a simple Markov chain taking values in $\{-1,1\}$ , starting from $W_{1,n}=1$ ,

where the $U_{t}$ are uniform $(0,1)$ i.i.d. random variables and $a_{n}$ is the probability of $W_{t,n}$ changing sign (for i.i.d. data we set $a_{n}=0.5$ ). This leads to a bootstrapped V-statistic

Proposition 3.2 establishes that, under the null hypothesis, $nB_{n}$ is a good approximation of $nV_{n}$ , so it is possible to approximate quantiles of the null distribution by sampling from it. Under the alternative, $V_{n}$ dominates $B_{n}$ – resulting in almost sure rejection of the null hypothesis.

We propose the following test procedure for testing the null hypothesis that the $Z_{i}$ are distributed according to the target distribution $p$ .

Obtain wild bootstrap samples $\{B_{n}\}_{i=1}^{D}$ and estimate the $1-\alpha$ empirical quantile of these samples.

Proofs of the Main Results

We now prove the claims made in the previous section.

Lemma 5.1 (in the Appendix) shows that the expected value of the Stein operator is zero on the target measure.

$\xi_{p}(x,\cdot)$ is an element of the reproducing kernel Hilbert space $\mathcal{F}^{d}$ – by Steinwart & Christmann [39, Lemma 4.34] $\nabla k(x,\cdot)\in\mathcal{F}$ , and $\frac{\partial\log p(x)}{\partial x_{i}}$ is just a scalar. We first show that $h_{p}(x,y)=\langle\xi_{p}(x,\cdot),\xi_{p}(y,\cdot)\rangle.$ Using notations

Next we show that $\xi_{p}(x,\cdot)$ is Bochner integrable [see 39, Definition A.5.20],

This allows us to take the expectation inside the RKHS inner product. We next relate the expected value of the Stein operator to the inner product of $f$ and the expected value of $\xi_{q}(Z)$ ,

The second equality follows from the fact that a linear operator $\langle f_{i},\cdot\rangle_{\mathcal{F}}$ can be interchanged with the Bochner integral, and the fact that $\xi_{p}$ is Bochner integrable. Using definition of $S(Z)$ , Lemma (5.1), and Equation (2), we have

We now calculate closed form expression for $S_{p}^{2}(Z)$ ,

where $Z^{\prime}$ is an independent copy of $Z$ . ∎

Next, we prove that the discrepancy $S$ discriminates different probability measures.

We recognise that the expected value of $\nabla(\log p(Z)-\log q(Z))k(Z,\cdot)$ is the mean embedding of a function $g(y)=\nabla\left(\log\frac{p(y)}{q(y)}\right)$ with respect to the measure $q$ . By the assumptions the function $g$ is square integrable; therefore, since the kernel $k$ is $C_{o}$ -universal, by Carmeli et al. [8, Theorem 4.2 b] its embedding is zero if and only if $g=0$ . This implies that

A constant vector field of derivatives can only be generated by a constant function, so $\log\frac{p(y)}{q(y)}=C$ , for some $C$ , which implies that $p(y)=e^{C}q(y)$ . Since $p$ and $q$ both integrate to one, $C=0$ and thus $p=q$ , which is a contradiction. ∎

2 Wild Bootstrap Testing

The two concepts required to derive the distribution of the test statistic are: $\tau$ -mixing , and V-statistics .

We assume $\tau$ -mixing as our notion of dependence within the observations, since this is weak enough for most practical applications. Trivially, i.i.d. observations are $\tau$ -mixing. As for Markov chains, whose convergence we study in the experiments, the property of geometric ergodicity implies $\tau$ -mixing (given that the stationary distribution has a finite moment of some order – see the Appendix for further discussion). For further details on $\tau$ -mixing, see . For this work, we assume a technical condition $\sum_{t=1}^{\infty}t^{2}\sqrt{\tau(t)}\leq\infty$ . A direct application of [25, Theorem 2.1] characterises the limiting behavior of $nV_{n}$ for $\tau$ -mixing processes.

The proof, which is a simple verification of the relevant assumptions, can be found in the Appendix. Although a formula for a limit distribution of $V_{n}$ can be derived explicitly [Theorem 2.1 25], we do not provide it here. To our knowledge there are no methods of obtaining quantiles of a limit of $V_{n}$ in closed form. The common solution is to estimate quantiles by a resampling method, as described in Section 2. The validity of this resampling method is guaranteed by the following proposition (which follows from Theorem 2.1 of Leucht and a modification of the Lemma 5 of Chwialkowski et al. ), proved in the supplement.

As a consequence, if the null hypothesis is true, we can approximate any quantile; while under the alternative hypothesis, all quantiles of $B_{n}$ collapse to zero while $P(V_{n}>0)\to 1$ . We discuss specific case of testing MCMC convergence in the Appendix.

Experiments

We provide a number of experimental applications for our test. We begin with a simple check to establish correct test calibration on non-i.i.d. data, followed by a demonstration of statistical model criticism for Gaussian process (GP) regression. We then apply the proposed test to quantify bias-variance trade-offs in MCMC, and demonstrate how to use the test to verify whether MCMC samples are drawn from the desired stationary distribution. In the final experiment, we move away from the MCMC setting, and use the test to evaluate the convergence of a nonparametric density estimator. Code can be found at https://github.com/karlnapf/kernel_goodness_of_fit.

In our first task, we modify Experiment 4.1 from Gorham & Mackey . The null hypothesis is that the observed samples come from a standard normal distribution. We study the power of the test against samples from a Student’s t distribution. We expect to observe low p-values when testing against a Student’s t distribution with few degrees of freedom. We consider 1, 5, 10 or $\infty$ degrees of freedom, where $\infty$ is equivalent to sampling from a standard normal distribution. For a fixed number of degrees of freedom we draw 1400 samples and calculate the p-value. This procedure is repeated 100 times, and the bar plots of p-values are shown in Figures 1,2,3.

Our twist on the original experiment 4.1 by Gorham & Mackey is that the draws from the Student’s t distribution exhibit temporal correlation. We generate samples using a Metropolis–Hastings algorithm, with a Gaussian random walk with variance 1/2. We emphasise the need for an appropriate choice of the wild bootstrap process parameter $a_{n}$ . In Figure 1 we plot p-values for $a_{n}$ being set to $0.5$ . Such a high value of $a_{n}$ is suitable for i.i.d. observations, but results in p-values that are too conservative for temporally correlated observations. In Figure 2, we set $a_{n}=0.02$ , which gives a well calibrated distribution of the p-values under the null hypothesis, however the test power is reduced. Indeed, p-values for five degrees of freedom are already large. The solution that we recommend is a mixture of thinning and adjusting $a_{n},$ as presented in the Figure 3. We thin the observations by a factor of 20 and set $a_{n}=0.1$ , thus preserving both good statistical power and correct calibration of p-values under the null hypothesis. In a general, we recommend to thin a chain so that $\text{Cor}(X_{t},X_{t-1})<0.5$ , set $a_{n}=0.1/k$ , and run test with at least $\max(500k,d100)$ data points, where $k<10$ .

In this experiment, we compare with the test proposed by Baringhaus & Henze , which is essentially an MMD test for normality, i.e. the null hypothesis is that $Z$ is a $d$ -dimensional standard normal random variable. We set the sample size to $n=500,1000$ and $a_{n}=0.5$ , generate

and modify $Z_{0}\leftarrow Z_{0}+Y$ . Table 1 shows the power as a function of the sample size. We observe that for higher dimensions, and where the expectation of the kernel exists in closed form, an MMD-type test like is a better choice.

We next apply our test to the problem of statistical model criticism for GP regression. Our presentation and approach are similar to the non i.i.d. case in Section 6 of Lloyd & Ghahramani . We use the solar dataset, consisting of a $d=1$ regression problem with $N=402$ pairs $(X,y)$ . We fit $N_{\text{train}}=361$ data using a GP with an exponentiated quadratic kernel and a Gaussian noise model, and perform standard maximum likelihood II on the hyperparameters (length-scale, overall scale, noise-variance). We then apply our test to the remaining $N_{\text{test}}=41$ data. The test attempts to falsify the null hypothesis that the solar dataset was generated from the plug-in predictive distribution (conditioned on training data and predicted position) of the GP. Lloyd & Ghahramani refer to this setup as non-i.i.d., since the predictive distribution is a different univariate Gaussian for every predicted point. Our particular $N_{\text{train}},N_{\text{test}}$ were chosen to make sure the GP fit has stabilised, i.e. adding more data did not cause further model refinement.

Figure 4 shows training and testing data, and the fitted GP. Clearly, the Gaussian noise model is a poor fit for this particular dataset, e.g. around $X=-1$ . Figure 5 shows the distribution over $D=10000$ bootstrapped V-statistics $B_{n}$ with $n=N_{\text{test}}$ . The test statistic lies in an upper quantile of the bootstrapped null distribution, correctly indicating that it is unlikely the test points were generated by the fitted GP model, even for the low number of test data observed, $n=41$ .

We now illustrate how to quantify bias-variance trade-offs in an approximate MCMC algorithm – austerity MCMC . For the purpose of illustration we use a simple generative model from Gorham & Mackey , Welling & Teh ,

Austerity MCMC is a Monte Carlo procedure designed to reduce the number of likelihood evaluation in the acceptance step of the Metropolis-Hastings algorithm. The crux of method is to look at only a subset of the data, and make an acceptance/rejection decision based on this subset. The probability of making a wrong decision is proportional to a parameter $\epsilon\in$ . This parameter influences the time complexity of austerity MCMC: when $\epsilon$ is larger, i.e., when there is a greater tolerance for error, the expected computational cost is lower. We simulate $\{X_{i}\}_{1\leq i\leq 400}$ points from the model with $\theta_{1}=0$ and $\theta_{2}=1$ . In our experiment, there are two modes in the posterior distribution: one at $(0,1)$ and the other at $(1,-1)$ . We run the algorithm with $\epsilon$ varying over the range $[0.001,0.2]$ . For each $\epsilon$ we calculate an individual thinning factor, such that correlation between consecutive samples from the chains is smaller than $0.5$ (greater $\epsilon$ generally required more thinning). For each $\epsilon$ we test the hypothesis that $\{\theta_{i}\}_{1\leq i\leq 500}$ is drawn from the true stationary posterior, using our goodness of fit test. We generate 100 p-values for each $\epsilon$ , as shown in Figure 6. A good approximation of the true stationary distribution is obtained at $\epsilon=0.4$ , which is still parsimonious in terms of likelihood evaluations, as shown in Figure 7.

In our final experiment, we apply our goodness of fit test to measuring quality-of-fit in nonparametric density estimation. We evaluate two density models: the infinite dimensional exponential family , and a recent approximation to this model using random Fourier features . Our implementation of the model assumes the log density to take the form $f(x)$ , where $f$ lies in an RKHS induced by a Gaussian kernel with bandwidth $1$ . We fit the model using $N$ observations drawn from a standard Gaussian, and perform our quadratic time test on a separate evaluation dataset of fixed size $n=500$ . Our goal is to identify $N$ sufficiently large that the goodness of fit test does not reject the null hypothesis (i.e., the model has learned the density sufficiently well, bearing in mind that it is guaranteed to converge for sufficiently large $N$ ). Figure 8 shows how the distribution of p-values evolves as a function of $N$ ; this distribution is uniform for $N=5000$ , but at $N=500$ , the null hypothesis would very rarely be rejected.

We next consider the random Fourier feature approximation to this model, where the log pdf, $f$ , is approximated using a finite dictionary of random Fourier features . The natural question when using this approximation is: “How many random features are needed?” Using the same test set size $n=500$ as above, and a large number of samples, $N=5\cdot 10^{4}$ , Figure 9 shows the distributions of p-values for an increasing number of random features $m$ . From $m=50$ , the null hypothesis would rarely be rejected. Note, however, that the p-values do not have a uniform distribution, even for a large number of random features. This subtle effect is caused by over-smoothing due to the regularisation approach taken by Strathmann et al. [40, KMC finite], which would not otherwise have been detected.

Arthur Gretton has ORCID 0000-0003-3169-7624.

References

Appendix

If a random variable $X$ is distributed according to $p$ , under conditions on the kernel

We check assumptions of Theorem 2.1 from . Condition A1, $\sum_{t=1}^{\infty}\sqrt{\tau(t)}\leq\infty$ , is implied by assumption $\sum_{t=1}^{\infty}t^{2}\sqrt{\tau(t)}\leq\infty$ in Section 3. Condition A2 (iv), Lipschitz continuity of $h$ , is assumed. Conditions A2 i), ii) positive definiteness, symmetry and degeneracy of $h$ follow from the proof of Theorem (2.2). Indeed

MCMC convergence testing

In the stationary phase there are number of results which might be used to show that the chain is $\tau$ -mixing.

Strong mixing is historically the most studied type of temporal dependence – a lot of models, including Markov Chains, are proved to be strongly mixing, therefore it’s useful to relate weak mixing to strong mixing. For a random variable $X$ on a probability space $(\Omega,\mathcal{F},P_{X})$ and $\mathcal{M}\subset\mathcal{F}$ we define

A process is called $\beta$ -mixing or absolutely regular if

Dedecker & Prieur [Equation 7.6] relates $\tau$ -mixing and $\beta$ -mixing , as follows: if $Q_{x}$ is the generalized inverse of the tail function

While this definition can be hard to interpret, it can be simplified in the case $E|X|^{p}=M$ for some $p>1$ , since via Markov’s inequality $P(|X|>t)\leq\frac{M}{t^{p}}$ , and thus $\frac{M}{t^{p}}\leq u$ implies $P(|X|>t)\leq u$ . Therefore $Q^{\prime}(u)=\frac{M}{\sqrt[p]{u}}\geq Q_{x}(u)$ . As a result, we have the inequality

Dedecker & Prieur provide examples of systems that are $\tau$ -mixing. In particular, given that certain assumptions are satisfied, then causal functions of stationary sequences, iterated random functions, Markov chains, and expanding maps are all $\tau$ -mixing.

Of particular interest to this work are Markov chains. The assumptions provided by , under which Markov chains are $\tau$ -mixing, are difficult to check. We can, however, use classical theorems about the absolute regularity ( $\beta$ -mixing). In particular [7, Corollary 3.6] states that a Harris recurrent and aperiodic Markov chain satisfies absolute regularity, and [7, Theorem 3.7] states that geometric ergodicity implies geometric decay of the $\beta$ coefficient. Interestingly [7, Theorem 3.2] describes situations in which a non-stationary chain $\beta$ -mixes exponentially.

Using inequalities between $\tau$ -mixing coefficient and strong mixing coefficients, one can use these classical theorems show that e.g for $p=2$ we have