Fast Two-Sample Testing with Analytic Representations of Probability Measures

Kacper Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, Arthur Gretton

Introduction

Testing whether two random variables are identically distributed without imposing any parametric assumptions on their distributions is important in a variety of scientific applications. These include data integration in bioinformatics , benchmarking for steganography and automated model checking . Such problems are addressed in the statistics literature via two-sample tests (also known as homogeneity tests).

In contrast to consistent two-sample tests, heuristics based on pseudo-distances, such as the difference between characteristic functions evaluated at a single frequency, have been studied in the context of goodness-of-fit tests . It was shown that the power of such tests can be maximized against fully specified alternative hypotheses, where test power is the probability of correctly rejecting the null hypothesis that the distributions are the same. In other words, if the class of distributions being distinguished is known in advance, then the tests can focus only at those particular frequencies where the characteristic functions differ most. This approach was generalized to evaluating the empirical characteristic functions at multiple distinct frequencies by , thus improving on tests that need to know the single “best” frequency in advance (the cost remains linear in the sample size, albeit with a larger constant). This approach still fails to solve the consistency problem, however: two distinct characteristic functions can agree on an interval, and if the tested frequencies fall in that interval, the distributions will be indistinguishable.

In Section 2 of the present work, we introduce two novel distances between distributions, which both use a parsimonious representation of the probability measures. The first distance builds on the notion of differences in characteristic functions with the introduction of smooth characteristic functions, which can be though as the analytic analogues of the characteristics functions. A distance between smooth characteristic functions evaluated at a single random frequency is almost surely a distance (Definition 1 formalizes this concept) between these two distributions. In other words, there is no need to calculate the whole infinite dimensional representation - it is almost surely sufficient to evaluate it at a single random frequency (although checking more frequencies will generally result in more powerful tests). The second distance is based on analytic mean embeddings of two distributions in a characteristic RKHS; again, it is sufficient to evaluate the distance between mean embeddings at a single randomly chosen point to obtain almost surely a distance. To our knowledge, this representation is the first mapping of the space of probability measures into a finite dimensional Euclidean space (in the simplest case, the real line) that is almost surely an injection, and as a result almost surely a metrization. This metrization is very appealing from a computational viewpoint, since the statistics based on it have linear time complexity (in the number of samples) and constant memory requirements.

We construct statistical tests in Section 3, based on empirical estimates of differences in the analytic representations of the two distributions. Our tests have a number of theoretical and computational advantages over previous approaches. The test based on differences between analytic mean embeddings is a.s. consistent for all distributions, and the test based on differences between smoothed characteristic functions is a.s. consistent for all distributions with integrable characteristic functions (contrast with , which is only consistent under much more onerous conditions, as discussed above). This same weakness was used by in justifying a test that integrates over the entire frequency domain (albeit at cost quadratic in the sample size), for which the quadratic-time MMD is a generalization . Compared with such quadratic time tests, our tests can be conducted in linear time – hence, we expect their power/computation tradeoff to be superior.

We provide several experimental benchmarks (Section 4) for our tests. First, we compare test power as a function of computation time for two real-life testing settings: amplitude modulated audio samples, and the Higgs dataset, which are both challenging multivariate testing problems. Our tests give a better power/computation tradeoff than the characteristic function-based tests of , the previous sub-quadratic-time MMD tests , and the quadratic-time MMD test. In terms of power when unlimited computation time is available, we might expect worse performance for the new tests, in line with findings for linear- and sub-quadratic-time MMD-based tests . Remarkably, such a loss of power is not the rule: for instance, when distinguishing signatures of the Higgs boson from background noise (’Higgs dataset’), we observe that a test based on differences in smoothed empirical characteristic functions outperforms the quadratic-time MMD. This is in contrast to linear- and sub-quadratic-time MMD-based tests, which by construction are less powerful than quadratic-time MMD. Next, for challenging artificial data (both high-dimensional distributions, and distributions for which the difference is very subtle), our tests again give a better power/computation tradeoff than competing methods.

Analytic embeddings and distances

In this section we consider mappings from the space of probability measures into a sub-space of real valued analytic functions. We will show that evaluating these maps at JJ randomly selected points is almost surely injective for any J>0J>0. Using this result, we obtain a simple (randomized) metrization of the space of probability measures. This metrization is used in the next section to construct linear-time nonparametric two-sample tests.

To motivate our approach, we begin by recalling an integral family of distances between distributions, denoted Maximum Mean Discrepancies (MMD) . The MMD is defined as

where PP and QQ are probability measures on EE, and BkB_{k} is the unit ball in the RKHS HkH_{k} associated with a positive definite kernel k:E×ERk:E\times E\to\mathbf{R}. A popular choice of kk is the Gaussian kernel k(x,y)=exp(xy2/γ2)k(x,y)=\exp(-\|x-y\|^{2}/\gamma^{2}) with bandwidth parameter γ\gamma. It can be shown that the MMD is equal to the RKHS distance between so called mean embeddings,

where μP\mu_{P} is an embedding of the probability measure PP to HkH_{k},

and Hk\|\cdot\|_{H_{k}} denotes the norm in the RKHS HkH_{k}. When kk is translation invariant, i.e., k(x,y)=κ(xy)k\left(x,y\right)=\kappa(x-y), the squared MMD can be written [26, Corollary 4] as

where FF denotes the Fourier transform, F1F^{-1} is the inverse Fourier transform, and φP\varphi_{P}, φQ\varphi_{Q} are the characteristic functions of PP, QQ, respectively. From [26, Theorem 9], a kernel is called characteristic when

Any bounded, continuous, translation-invariant kernel whose inverse Fourier transform is almost everywhere non-zero is characteristic . By representation (2), it is clear that MMD with a characteristic kernel is a metric.

A practical limitation when using the MMD in testing is that empirical estimates are expensive to compute, these being the sum of two U-statistics and an empirical average, with cost quadratic in the sample size. We might instead consider a finite dimensional approximation to the MMD, achieved by estimating the integral (4), with the random variable

where {Tj}j=1J\left\{T_{j}\right\}_{j=1}^{J} are sampled independently from the distribution with a density function F1κF^{-1}\kappa. This type of approximation is applied to various kernel algorithms under the name of random Fourier features . In the statistical testing literature, the quantity dφ,J(P,Q)d_{\varphi,J}(P,Q) predates the MMD by a considerable time, and was studied in , and more recently revisited in . Our first proposition is that dφ,J2(P,Q)d_{\varphi,J}^{2}(P,Q) can be a poor choice of distance between probability measures, as it fails to distinguish a large class of measures. The following result is proved in the Appendix.

We are therefore motivated to find distances of the form (6) that can distinguish larger classes of distributions, yet remain efficient to compute. These distances are characterized as follows:

A random process dd with the values in R\mathbf{R}, indexed with pairs from the set of probability measures M\mathcal{M}

is said to be a random metric if it satisfies all the conditions for a metric with qualification ‘almost surely’. Formally, for all P,Q,RMP,Q,R\in\mathcal{M}, random variables d(P,Q),d(P,R),d(R,Q)d(P,Q),d(P,R),d(R,Q) must satisfy

if P=QP=Q, then d(P,Q)=0d(P,Q)=0 a.s, if PQP\neq Q then d(P,Q)=0d(P,Q)=0 a.s.

d(P,Q)d(P,R)+d(R,Q)d(P,Q)\leq d(P,R)+d(R,Q) a.s. Note that this does not imply that realizations of dd are distances on M\mathcal{M}, but it does imply that they are almost surely distances for all arbitrary finite subsets of M\mathcal{M}.

From the statistical testing point of view, the coincidence axiom of a metric dd, d(P,Q)=0d(P,Q)=0 if and only if P=QP=Q, is key, as it ensures consistency against all alternatives. The quantity dφ,J(P,Q)d_{\varphi,J}(P,Q) in (6) violates the coincidence axiom, so it is only a random pseudometric (other axioms are trivially satisfied). We remedy this problem by replacing the characteristic functions by smooth characteristic functions:

A smooth characteristic function ϕP(t)\phi_{P}(t) of a measure PP is a characteristic function of PP convolved with an analytic smoothing kernel ll, i.e.

The analogue of dφ,J(P,Q)d_{\varphi,J}(P,Q) for smooth characteristic functions is simply

where {Tj}j=1J\left\{T_{j}\right\}_{j=1}^{J} are sampled independently from the absolutely continuous distribution (returning to our earlier example, this might be F1κ(t)F^{-1}\kappa(t) if we believe this to be an informative choice). The following theorem, proved in the Appendix, demonstrates that the smoothing greatly increases the class of distributions we can distinguish.

Let ll be an analytic, integrable kernel with an inverse Fourier transform strictly greater than zero. Then, for any J>0J>0, dϕ,Jd_{\phi,J} is a random metric on the space of probability measures with integrable characteristic functions, and ϕP\phi_{P} is an analytic function.

This result is primarily a consequence of analyticity of smooth characteristic functions and the fact that analytic functions are ’well behaved’. There is an additional, practical advantage to smoothing: when the variability in the difference of the characteristic functions is high, and these differences are local, smoothing distributes the difference in CFs more evenly in the frequency domain (a simple illustration is in Fig. A.1, Appendix), making them easier to find by measurement at a small number of randomly chosen points. This accounts for the observed improvements in test power in Section 4, over differences in unsmoothed CFs.

Metrics based on mean embeddings.

The key step which led us to the construction of a random metric dϕ,Jd_{\phi,J} is the convolution of the original characteristic functions with an analytic smoothing kernel. This idea need not be restricted to the representations of probability measures in the frequency domain. We may instead directly convolve the probability measure with a positive definite kernel kk (that need not be translation invariant), yielding its mean embedding into the associated RKHS,

We say that a positive definite kernel k:RD×RDRk:\mathbf{R}^{D}\times\mathbf{R}^{D}\to\mathbf{R} is analytic on its domain if for all xRDx\in\mathbf{R}^{D}, the feature map k(x,)k(x,\cdot) is an analytic function on RD\mathbf{R}^{D}. By using embeddings with characteristic and analytic kernels, we obtain particularly useful representations of distributions. As for the smoothed CF case, we define

The following theorem ensures that dμ,J(P,Q)d_{\mu,J}(P,Q) is also a random metric.

Let kk be an analytic, integrable and characteristic kernel. Then for any J>0J>0, dμ,Jd_{\mu,J} is a random metric on the space of probability measures (and μP\mu_{P} is an analytic function).

Note that this result is stronger than the one presented in Theorem 1, since is is not restricted to the class of probability measures with integrable characteristic functions. Indeed, the assumption that the characteristic function is integrable implies the existence and boundedness of a density. Recalling the representation of MMD in (2), we have proved that it is almost always sufficient to measure difference between μP\mu_{P} and μQ\mu_{Q} at a finite number of points, provided our kernel is characteristic and analytic. In the next section, we will see that metrization of the space of probability measures using random metrics dμ,Jd_{\mu,J}, dϕ,Jd_{\phi,J} is very appealing from the computational point of view. It turns out that the statistical tests that arise from those metrics have linear time complexity (in the number of samples) and constant memory requirements.

Hypothesis Tests Based on Distances Between Analytic Functions

In this section, we provide two linear-time two-sample tests: first, a test based on analytic mean embeddings, and then a test based on smooth characteristic functions. We further describe the relation with competing alternatives. Proofs of this chapter’s propositions are in the Appendix B.

Difference in analytic functions In the previous section we described the random metric based on a difference in analytic mean embeddings, dμ,J2(P,Q)=1Jj=1J(μP(Tj)μQ(Tj))2.d_{\mu,J}^{2}(P,Q)=\frac{1}{J}\sum_{j=1}^{J}(\mu_{P}(T_{j})-\mu_{Q}(T_{j}))^{2}. If we replace μP\mu_{P} with the empirical mean embedding μ^P=1ni=1nk(Xi,)\hat{\mu}_{P}=\frac{1}{n}\sum_{i=1}^{n}k(X_{i},\cdot) it can be shown that for any sequence of unique {tj}j=1J\{t_{j}\}_{j=1}^{J}, under the null hypothesis, as nn\to\infty,

converges in distribution to a sum of correlated chi-squared variables. Even for fixed {tj}j=1J\{t_{j}\}_{j=1}^{J}, it is very computationally costly to obtain quantiles of this distribution, since this requires a bootstrap or permutation procedure. We will follow a different approach based on Hotelling’s T2T^{2}-statistic . The Hotelling’s T2T^{2}-squared statistic of a normally distributed, zero mean, Gaussian vector W=(W1,,WJ)W=(W_{1},\cdots,W_{J}), with a covariance matrix Σ\Sigma, is T2=WΣ1WT^{2}=W\Sigma^{-1}W. The compelling property of the statistic is that it is distributed as a χ2\chi^{2}-random variable with JJ degrees of freedom. To see a link between T2T^{2} and equation (11), consider a random variable i=jJWj2\sum_{i=j}^{J}W_{j}^{2}: this is also distributed as a sum of correlated chi-squared variables. In our case WW is replaced with a difference of normalized empirical mean embeddings, and Σ\Sigma is replaced with the empirical covariance of the difference of mean embeddings. Formally, let ZiZ_{i} denote the vector of differences between kernels at tests points TjT_{j},

We define the vector of mean empirical differences Wn=1ni=1nZi,W_{n}=\frac{1}{n}\sum_{i=1}^{n}Z_{i}, and its covariance matrix Σn=1nZZT\Sigma_{n}=\frac{1}{n}ZZ^{T}. The test statistic is

The computation of SnS_{n} requires inversion of a J×JJ\times J matrix Σn\Sigma_{n}, but this is fast and numerically stable: JJ will typically be small and is in our experiments less than 10. The next proposition demonstrates the use of SnS_{n} as a two-sample test statistic.

We now apply the above proposition in obtaining a statistical test.

Calculate SnS_{n}. Choose a threshold rαr_{\alpha} corresponding to the 1α1-\alpha quantile of a χ2\chi^{2} distribution with JJ degrees of freedom, and reject the null hypothesis whenever SnS_{n} is larger than rαr_{\alpha}.

There are a number of valid sampling schemes for the test points {Tj}j=1J\left\{T_{j}\right\}_{j=1}^{J} to evaluate the differences in mean embeddings: see Section 4 for a discussion.

Difference in smooth characteristic functions From the convolution definition of a smooth characteristic function (7) it is not clear how to calculate its estimator in linear time. However, we show in the next proposition that a smooth characteristic function can be written as an expected value of some function with respect to the given measure, which can be estimated in a linear time.

Let kk be an integrable translation-invariant kernel and ff its inverse Fourier transform. Then the smooth characteristic function of PP can be written as ϕP(t)=Rdeitxf(x)dP(x).\phi_{P}(t)=\int_{\mathbf{R}^{d}}e^{it^{\top}x}f(x)dP(x).

It is now clear that a test based on the smooth characteristic functions is similar to the test based on mean embeddings. The main difference is in the definition of the vector of differences ZiZ_{i}:

The imaginary and real part of the e1TjXif(Xi)e1TjYif(Yi)e^{\sqrt{-1}T_{j}^{\top}X_{i}}f(X_{i})-e^{\sqrt{-1}T_{j}^{\top}Y_{i}}f(Y_{i}) are stacked together, in order to ensure that WnW_{n}, Σn\Sigma_{n} and SnS_{n} as all real-valued quantities.

Let dμ,J2(P,Q)=0d_{\mu,J}^{2}(P,Q)=0 and let {Xi}i=1n\{X_{i}\}_{i=1}^{n} and {Yi}i=1n\{Y_{i}\}_{i=1}^{n} be i.i.d. samples from PP and QQ respectively. Then the statistic SnS_{n} is almost surely asymptotically distributed as a χ2\chi^{2}-random variable with 2J2J degrees of freedom (as nn\to\infty with JJ fixed). If dϕ,J2(P,Q)>0d_{\phi,J}^{2}(P,Q)>0 , then almost surely for any fixed rr, P(Sn>r)1P(S_{n}>r)\to 1 as nn\to\infty.

Other tests. The test based on empirical characteristic functions was constructed originally for one test point and then generalized to many points - it is quite similar to our second test, but does not perform smoothing (it is also based on a T2T^{2}-Hotelling statistic). The block MMD is a sub-quadratic test, which can be trivially linearized by fixing the block size, as presented in the Appendix. Finally, another alternative is the MMD, an inherently quadratic time test. We scale MMD to linear time by sub-sampling our data set, and choosing only n\sqrt{n} points, so that the MMD complexity becomes O(n)O(n). Note, however, that the true complexity of MMD involves a permutation calculation of the null distribution at cost O(bnn)O(b_{n}n), where the number of permutations bnb_{n} grows with nn. See Appendix C for a detailed description of alternative tests.

Experiments

In this section we compare two-sample tests on both artificial benchmark data and on real-world data. We denote the smooth characteristic function test as ‘Smooth CF’, and the test based on the analytic mean embeddings as ‘Mean Embedding’. We compare against several alternative testing approaches: block MMD (‘Block MMD’), a characteristic functions based test (‘CF’), a sub-sampling MMD test (‘MMD(n\sqrt{n})’), and the quadratic-time MMD test (‘MMD(n)’).

Experimental setup. For all the experiments, DD is the dimensionality of samples in a dataset, nn is a number of samples in the dataset (sample size) and JJ is number of test frequencies. Parameter selection is required for all the tests. The table summarizes the main choices of the parameters made for the experiments. The first parameter is the test function, used to calculate the particular statistic. The scalar γ\gamma represents the length-scale of the observed data. Notice that for the kernel tests we recover the standard parameterization exp(xγyγ2)=exp(xy2γ2)\exp(-\|\frac{x}{\gamma}-\frac{y}{\gamma}\|^{2})=\exp(-\frac{\|x-y\|^{2}}{\gamma^{2}}). The original CF test was proposed without any parameters, hence we added γ\gamma to ensure a fair comparison - for this test varying γ\gamma is equivalent to adjusting the variance of the distribution of frequencies TjT_{j}. For all tests, the value of the scaling parameter γ\gamma was chosen so as to maximize test power on a held-out training set: details are described in Appendix D. We chose not to optimize the sampling scheme for the Mean Embedding and Smooth CF tests, since this would give them an unfair advantage over the Block MMD, MMD(n\sqrt{n}) and CF tests. The block size in the Block MMD test and the number of test frequencies in the Mean Embedding, Smooth CF, and CF tests, were always set to the same value (not greater than 10) to maintain exactly the same time complexity. Note that we did not use the popular median heuristic for kernel bandwidth choice (MMD and B-test), since it gives poor results for the Blobs and AM Audio datasets . We do not run MMD(n) test in the ’Simulation 1’ and on the ’Amplitude Modulated Music’ since the sample size is 1000010000, i.e., too large for a quadratic-time test with permutation sampling for the test critical value.

It is important to verify that Type I error is indeed at the design level, set at α=0.05\alpha=0.05 in this paper. This is verified in the Appendix figure A.2. Also shown in the plots is the 95%95\% percent confidence intervals for the results, as averaged over 4000 runs.

D=4D=4, nn varies, J=10J=10. The first experiment we consider is on the UCI Higgs dataset described in - the task is to distinguish signatures of processes which produce Higgs bosons from background processes which do not. We consider a two-sample test on certain extremely low-level features in the dataset - kinematic properties measured by the particle detectors, i.e., the joint distributions of the azimuthal angular momenta φ\varphi for four particle jets. We denote by PP the jet φ\varphi-momenta distribution of the background process (no Higgs bosons), and by QQ the corresponding distribution for the process that produces Higgs bosons (both are distributions on R4\mathbf{R}^{4}). As discussed in [2, Fig. 2], φ\varphi-momenta, unlike transverse momenta pTp_{T}, carry very little discriminating information for recognizing whether Higgs bosons were produced or not. Therefore, we would like to test the null hypothesis that the distributions of angular momenta PP (no Higgs boson observed) and QQ (Higgs boson observed) might yet be rejected. The results for different algorithms are presented in the Figure 1. We observe that the joint distribution of the angular momenta is in fact a discriminative feature. Sample size varies from 1000 to 12000. The Smooth CF test has significantly higher power than the other tests, including the quadratic-time MMD, which we could only run on up to 51005100 samples due to computational limitations. The leading performance of the Smooth CF test is especially remarkable given it is several orders of magnitude faster then the quadratic-time MMD(n), which is both expensive to compute, and requires a costly permutation approach to determine the significance threshold.

Real Data 2: Amplitude Modulated Music,

D=1000D=1000, n=10000n=10000, J=10J=10. Amplitude modulation is the earliest technique used to transmit voice over the radio. In the following experiment observations were one thousand dimensional samples of carrier signals that were modulated with two different input audio signals from the same album, song PP and song QQ (further details of these data are described in [10, Section 5]). To increase the difficulty of the testing problem, independent Gaussian noise of increasing variance (in the range 11 to 4.04.0) was added to the signals. The results are presented in the Figure 2. Compared to the other tests, the Mean Embedding and Smooth CF tests are more robust to the moderate noise contamination.

Simulation 1: High Dimensions,

DD varies, n=10000n=10000, J=3J=3. It has been recently shown, in theory and in practice, that the two-sample problem gets more difficult as the number of the dimensions increases on which the distributions do not differ . In the following experiment, we study the power of the two-sample tests as a function of dimension of the samples. We run two-sample test on two datasets of Gaussian random vectors which differ only in the first dimension,

where 0d0_{d} is a DD-dimensional vector of zeros, IDI_{D} is a DD-dimensional identity matrix, and diag(v)\text{diag}(v) is a diagonal matrix with vv on the diagonal. The number of dimensions (D) varies from 50 to 1000 (Dataset I) and from 50 to 2500 (Dataset II). The power of the different two-sample tests is presented in Figure 3. The Mean Embedding test yields best performance for both datasets, where the advantage is especially large for differences in variance.

Simulation 2: Blobs,

D=2D=2, nn varies, J=5J=5. The Blobs dataset is a grid of two dimensional Gaussian distributions (see Figure 4), which is known to be a challenging two-sample testing task. The difficulty arises from the fact that the difference in distributions is encoded at a much smaller lengthscale than the overall data. In this experiment both PP and QQ are a four by four grid of Gaussians, where PP has unit covariance matrix in each mixture component, while each component of QQ has a non unit covariance matrix. It was demonstrated by that a good choice of kernel is crucial for this task. Figure 4 presents the results of two-sample tests on the Blobs dataset. The number of samples varies from 50 to 14000 ( MMD(n) reached test power one with n=1400n=1400). We found that the MMD(n) test has the best power as function of the sample size, but the worst power/computation tradeoff. By contrast, random distance based tests have the best power/computation tradeoff.

References

Appendix A Figures

Appendix B Proofs

For some I=I(ϵ)I=I(\epsilon), there exists an interval [I,I][-I,I] with measure 1(1ϵ)1J1-(1-\epsilon)^{\frac{1}{J}}. Define fw(t)=1wtf_{w}(t)=1-w|t| for w>1Iw>\frac{1}{I} and zero elsewhere. By Polya’s theorem, A={fw}w>1I\mathcal{A}=\{f_{w}\}_{w>\frac{1}{I}} is an uncountable family of characteristic functions that are the same on the complement of [I,I][-I,I], which has measure (1ϵ)1J(1-\epsilon)^{\frac{1}{J}}. For w1>w2>1Iw_{1}>w_{2}>\frac{1}{I}, fw1fw2f_{w_{1}}\neq f_{w_{2}} in some neighborhood of 1/w11/{w_{1}}, hence the measures associated with those characteristic functions are different. The probability that all TiT_{i} sit in the complement of interval [I,I][-I,I] is ((1ϵ)1J)J=(1ϵ)\left((1-\epsilon)^{\frac{1}{J}}\right)^{J}=(1-\epsilon) and such an event implies that Sφ,J2=0S_{\varphi,J}^{2}=0.

Proof of Theorem 2

First we give a proposition that characterizes limits of analytic functions.

If {fn}\{f_{n}\} is a sequence of real valued, uniformly bounded analytic functions on Rd\mathbf{R}^{d} converging pointwise to ff, then ff is analytic.

Now we characterize the RKHS of an analytic kernel. Similar results were proved for specific classes of kernels in [29, Theorem 1], [28, Corollary 3.5].

If kk is a bounded, analytic kernel on Rd×Rd\mathbf{R}^{d}\times\mathbf{R}^{d}, then all functions in the RKHS Hk\mathcal{H}_{k} associated with this kernel are analytic.

Since Rd\mathbf{R}^{d} is separable then by [27, Lemma 4.33] Hilbert Space Hk\mathcal{H}_{k} is separable. By Moore-Aronszajn Theorem there exist a set H0H_{0} of linear combinations of functions k(,x),xRdk(\cdot,x),x\in\mathbf{R}^{d}, which is dense in Hk\mathcal{H}_{k} and Hk\mathcal{H}_{k} is a set of functions which are pointwise limits of Cauchy sequences in H0H_{0}. For each fHkf\in\mathcal{H}_{k} let {fn}H0\{f_{n}\}\in\mathcal{H}_{0} be a sequence of functions converging in the Hilbert Space norm to ff. Since {fn}\{f_{n}\} is convergent there exists NN such that n>N\forall n>N fnf1\left\|f_{n}-f\right\|\leq 1. For all nn there exist a uniform bound on fnf_{n} norm

Since kk is bounded, by the [27, Lemma 4.33], there exists CC such that for any fHkf\in\mathcal{H}_{k}, fCf\|f\|_{\infty}\leq C\|f\|. Therefore for all nn

Finally, using Proposition 5 we conclude that ff is analytic. This concludes the proof of Lemma 1.

Next, we show that analytic functions are ’well behaved’.

Let μ\mu be absolutely continuous measure on Rd\mathbf{R}^{d} (wrt. the Lebesgue measure). Non-zero, analytic function ff can be zero at most at the set of measure 0, with respect to the measure μ\mu.

If ff is zero at the set with a limit point then it is zero everywhere. Therefore ff can be zero at most at a set AA without a limit point, which by definition is a discrete set (distance between any two points in AA is greater then some ϵ>0\epsilon>0). Discrete sets have zero Lebesgue measure (as a countable union of points with zero measure). Since PP is absolutely continuous then μ(A)\mu(A) is zero as well. ∎

Next, we show how to construct random distances.

Let Λ\Lambda be an injective mapping from the space of the probability measures into a space of analytic functions on Rd\mathbf{R}^{d}. Define

where {Tj}j=1J\left\{T_{j}\right\}_{j=1}^{J} are real valued i.i.d. random variables from a distribution which is absolutely continuous with respect to the Lebesgue measure. Then, dΛ,J2(P,Q)d^{2}_{\Lambda,J}(P,Q) is a random metric.

Let ΛP\Lambda P and ΛQ\Lambda Q be images of measures PP and QQ respectively. We want to apply Lemma 2 to the analytic function f=ΛPΛQf=\Lambda P-\Lambda Q, with the measure μ=μTi\mu=\mu_{T_{i}}, to see that if PQP\neq Q then f0f\neq 0 a.s. To do so, we need to show that PQP\neq Q implies that ff is non-zero. Since mapping to Λ\Lambda is injective, there must exists at least one point oo where ff is non-zero. By continuity of ff, there exists a ball around oo in which ff is non-zero.

We have shown that PQP\neq Q implies f0f\neq 0 a.s. which in turn implies that dΛ,J(P,Q)>0d_{\Lambda,J}(P,Q)>0 a.s. If P=QP=Q then f=0f=0 and dΛ,J(P,Q)=0d_{\Lambda,J}(P,Q)=0.

By the construction dΛ,J(P,Q)=dΛ,J(Q,P)d_{\Lambda,J}(P,Q)=d_{\Lambda,J}(Q,P) and for any measure UU, dΛ,J(P,Q)dΛ,J(P,U)+dΛ,J(U,Q)d_{\Lambda,J}(P,Q)\leq d_{\Lambda,J}(P,U)+d_{\Lambda,J}(U,Q) a.s. since the triangle inequality holds for any vectors in RJ\mathbf{R}^{J}. ∎

Since kk is characteristic the mapping Λ:PμP\Lambda:P\to\mu_{P} is injective. Since kk is a bounded, analytic kernel on Rd×Rd\mathbf{R}^{d}\times\mathbf{R}^{d}, the Lemma 1 guarantees that μP\mu_{P} is analytic, hence the image of Λ\Lambda is a subset of analytic functions. Therefore, we can use Lemma 3 to see that dΛ,J(P,Q)2=dμ,J(P,Q)2d_{\Lambda,J}(P,Q)^{2}=d_{\mu,J}(P,Q)^{2} is a random metric. This concludes the proof of Theorem 2. ∎

Proof of Theorem 1

We first show that smooth characteristic functions are unique to distributions.

If ll is an analytic, integrable, translation invariant kernel with an inverse Fourier transform strictly greater then zero and PP has integrable characteristic function, then the mapping

is injective and ϕP\phi_{P} is element of the RKHS Hl\mathcal{H}_{l} associated with ll.

For the integrable characteristic function φ\varphi we define a functional L:HlRL:\mathcal{H}_{l}\to R given by formula

Since L(f+g)=L(f)+L(g)L(f+g)=L(f)+L(g), LL is linear. We check that LL is bounded; let B={fHl:f1}B=\{f\in\mathcal{H}_{l}:\parallel f\parallel\leq 1\} be a unit ball in the Hilbert Space.

By Riesz representation Theorem there exist ϕH\phi\in H such that ϕ,f=Rdφ(x)f(x)dx\langle\phi,f\rangle=\int_{\mathbf{R}^{d}}\varphi(x)f(x)dx. By reproducing property ϕ\phi is given by equation ϕ(x)=ϕ,l(t,)=Rdl(x,t)φ(x)dx\phi(x)=\langle\phi,l(t,)\rangle=\int_{\mathbf{R}^{d}}l(x,t)\varphi(x)dx. With each probability measure PP with an integrable characteristic function φP\varphi_{P} we associate the smooth characteristic function with

We will prove that PϕPP\to\phi_{P} is injective. We will show that , xϕQ(x)=ϕP(x)\forall_{x}\phi_{Q}(x)=\phi_{P}(x) implies P=QP=Q.

We apply inverse Fourier transform to this convolution and get

Where g=T1lg=T^{-1}l, fY=T1φQf_{Y}=T^{-1}\varphi_{Q} and fX=T1φPf_{X}=T^{-1}\varphi_{P}. Since inverse Fourier transform is injective on the space of the integrable characteristic functions, and all l,φP,φQl,\varphi_{P},\varphi_{Q} are integrable CFs, then application of the inverse Fourier transform does not enlarge the null space of Eq. (20). Since g(x)>0g(x)>0, fX(x)=fY(x)f_{X}(x)=f_{Y}(x) everywhere, implying that the mapping PϕPP\to\phi_{P} is injective. This concludes the proof of Lemma 4.

Next, we show that smooth characteristic function is analytic.

If ll is an analytic, integrable kernel with an inverse Fourier transform strictly greater then zero and PP has an integrable characteristic function then the smooth characteristic function ϕP\phi_{P} is analytic.

By lemma 3, all functions in the RKHS associated with ll are analytic and by 4 ϕP\phi_{P} is an element of this RKHS. ∎

Since ll is an analytic, integrable kernel with an inverse Fourier transform strictly greater then zero then by the Lemma 4 the mapping Λ:PϕP\Lambda:P\to\phi_{P} is injective and Λ(P)\Lambda(P) is an element of the RKHS associated with ll. The Lemma 5 shows that μP\mu_{P} is analytic. Therefore we can use Lemma 3 to see that dΛ,J(P,Q)2=dϕ,J(P,Q)2d_{\Lambda,J}(P,Q)^{2}=d_{\phi,J}(P,Q)^{2} is a random metric. This concludes the proof of Theorem 1 ∎

Use of Fubini’s theorem is justified, since the iterated integral is finite [Theorem 8.8 b] i.e.

Proof of Proposition 2

The probability space of random variables {Tj}1jJ\{T_{j}\}_{1\leq j\leq J} and {Xi}1in\{X_{i}\}_{1\leq i\leq n} is a product space i.e sequence of TjT_{j}’s is defined on the space (Ω1,F1,P1)(\Omega_{1},\mathcal{F}_{1},P_{1}) and the sequence of XiX_{i}’s is defined on the space (Ω2,F2,P2)(\Omega_{2},\mathcal{F}_{2},P_{2}). We will show that for almost all ωΩ1\omega\in\Omega_{1}, SnS_{n} converges to χ2\chi^{2} distribution with JJ degrees of freedom. We define

We have showed that the proposition hold for almost all ω\omega. Indeed it does not hold if it happens that for some aba\neq b, Ta(ω)=Tb(ω)T_{a}(\omega)=T_{b}(\omega) or dμ,Jω(P,Q)=0d_{\mu,J}^{\omega}(P,Q)=0 for PQP\neq Q. But both those events have zero measure.

Proof of Proposition 4

The poof is analogue to the proof of the Proposition 2.

Appendix C Other tests

For two measures PP, QQ the population MMDMMD can be written as

An MMD-based test uses as its statistic an empirical estimator of the squared population MMD, and rejects the null if this is larger than a threshold rαr_{\alpha} corresponding to the 1α1-\alpha quantile of the null distribution. The minimum variance unbiased estimator of MMD is

The test threshold rαr_{\alpha} is costly to compute. The null distribution of MMDn2MMD^{2}_{n} is an infinite weighted sum of chi-squared random variables, where the weights are eigenvalues of the kernel with respect to the (unknown) distribution PP. A bootstrap or permutation procedure may be used in obtaining consistent quantiles of the null distribution, however the cost is O(bnn2)O(b_{n}n^{2}) if we have bnb_{n} permutations and nn data points (bnb_{n} is usually in the hundreds, at minimum). As an alternative consistent procedure, the eigenvalues of the joint gram matrix over samples from PP and QQ may be used in place of the population eigenvalues; the fastest quadratic-time MMD test uses a gamma approximation to the null distribution, which works well most of the times, but has no consistency guarantees .

C.2 Sub-quadratic time MMD test

An alternative to the quadratic-time MMD test is a B-test (block-based test): the idea is to break the data into blocks, compute a quadratic-time statistic on each block, and average these quantities to obtain the test statistic. More specifically, for an individual block, laying on the main diagonal and starting at position (i1)B+1(i-1)B+1, the statistic η(i)\eta(i) is calculated as

The choice of BB determines computation time - at one extreme is the linear-time MMD suggested by where we have n/2n/2 blocks of size B=2B=2, and at the other extreme is the usual full MMD with 11 block of size nn, which requires calculating the test statistic on the whole kernel matrix in quadratic time. In our case, the size of the block remains constant as nn increases, and is greater than 2. This is very similar to the case proposed by , and the consistency of the test is not affected.

B-test of assumes that BB\to\infty together with nn, which implies that the statistic η^\hat{\eta} defined in (26) under the null distribution satisfies

Therefore, a slight change to p-value needs to be applied when σ02\sigma_{0}^{2} is estimated directly: Φ(n(B1)η^2σ^0)\Phi\left(-\frac{\sqrt{n(B-1)}\hat{\eta}}{2\hat{\sigma}_{0}}\right). If, however, one simply uses the empirical variance of the individual statistics computed within each block, the procedure is unaffected.

Appendix D Parameters Choice

We split our data set into two disjoint sets, training and testing set, and optimize parameters on the training set. We didn’t come up with an automated testing procedure, instead we plotted the p-values of tests for different scales. The figure D.3 presents such a plot for three different tests. The p-values were obtained by running the test several times (20 to 50) for each data scaling λ\lambda. Note that in the case of simulations we just generated new training dataset for each repetition for a given data scaling. For the music dataset we generated new noises for each scaling and for the Higgs dataset we have used bootstrap. The last method is applicable to real life problems i.e. we split our data into training and test part and then bootstrap from the training part.