FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test
Ji Zhao, Deyu Meng
Introduction
The two-sample test is one of the most fundamental tests in statistics and has a wide range of applications. It uses samples drawn from two distributions to test whether to accept or reject the null hypothesis that they are the same or different. This task, however, is very difficult and challenging in practice since the underneath distribution information are generally unknown apriori (Bickel, 1969; Friedman & Rafsky, 1979; Hall & Tajvidi, 2002; Biau & Gyofi, 2005). The maximum mean discrepancy (MMD) is the latest test statistic designed for this task by measuring the discrepancy of two distributions by embedding them in a reproducing kernel Hilbert space (Gretton et al., 2012a). The MMD has been attracting much attention in recent two-sample test research due to its solid theoretical fundament (Smola et al., 2008; Sriperumbudur et al., 2010, 2011; Sejdinovic et al., 2013) and successful applications including biological data test, data integration and attribute matching (Gretton et al., 2009), outlier detection, data classifiability (Sriperumbudur et al., 2009), domain adaption, etc. By generalizing the MMD to kernel families as the supremum of MMDs on a class of kernels, it has also been effectively used for some basic machine learning problems such as kernel selection (Sriperumbudur et al., 2009).
Albeit its various applications, the exact MMD needs computational cost, where and denote the size and dimension of samples, respectively, to calculate the kernel values between all pairs from the assessed two-sample sets. This quadratic computational complexity greatly hampers its further application to large-scale practical problems. How to speedup the computation of MMD has thus become a hot issue in statistics and machine learning in recent years.
There are mainly two approaches proposed for this problem by approximating MMD on a subsampling set of all sample pairs. The first is MMD-linear, which is the extremely simplified MMD calculation by only using possibly fewest interactions of sample pairs (Gretton et al., 2012a). While this strategy significantly accelerates the MMD calculation to , it also brings very high variance due to its evident loss of sample pair information. To better leverage the computation cost and calculation accuracy of MMD, B-test is recently proposed (Zaremba et al., 2013). The main idea is to split two-sample sets into corresponding subsets, construct block correspondence between them, and then compute the exact MMD inner each block while omit the inter-block pair information. By changing the block size, it can vary smoothly from MMD-linear with linear complexity to exact MMD with quadratic complexity. In practice, the block size is generally set as a modest value by experience. Thus the time complexity of B-test is , correspondingly.
Actually, as the coming of the big data era, it has become a hot trend to enhancing the efficiency of kernel-based learning methods, such as support vector machines and Gaussian process, throughout machine learning, computer vision and data mining. Many efforts have been made to speedup the establishment of the kernel information and accelerate the implementation of kernel techniques (Smola & Schölkopf, 2000; Williams & Seeger, 2000; Fine & K.Scheinberg, 2001). Two of the representative developments are Random Kitchen Sinks (Rahimi & Recht, 2007, 2008) and Fastfood (Le et al., 2013), which can significantly speed up the computation for a large range of kernel functions by mapping data into a relatively low-dimensional randomized feature space. These developments inspire us for this MMD-acceleration research topic, which constitutes an important branch along this line of research.
The main difficulty and challenge of MMD calculation lie in the fact that it needs to compute the kernel values between all sample pairs of two sets. MMD-linear and B-test attain this task by only utilizing a subsampling pair subset from all. Such simplification, however, also decreases the accuracy of MMD calculation due to their neglectness of the entire sample pair information. To this aim, this paper proposes a new efficient MMD calculation strategy, which can implement this task in a more efficient and accurate way. In summary, this paper mainly contains the following four-fold contributions:
1. Through employing Bochner’s theorem and Fastfood technique (Le et al., 2013) for kernels that are spherically invariant, we reduce the MMD computation cost to , which has lower time complexity than current MMD approximation methods and facilitates MMD’s application in large-scale data. Moreover, our method is easy to be sequentially computed and parallelized.
2. The proposed method utilizes the interacted kernel values between all pairs from two sample sets to calculate MMD, which naturally leads to very accurate MMD result. Our experimental results substantiate that our method is with similar accuracy as exact MMD, and with significantly smaller variance than MMD-linear and B-test.
3. We have theoretically proved the uniform convergence of our method in both unbiased and biased cases. Comparatively, both MMD-linear and B-test are only feasible in unbiased cases.
4. We provide a geometrical explanation of our method in calculating MMD with shift-invariant kernels. Under this viewpoint, it is potentially useful for arousing more extensive metrics for two-sample test.
The code of our FastMMD method is available at http://gr.xjtu.edu.cn/web/dymeng/2.
Efficient MMD for Shift-Invariant Kernels
Firstly we give a brief review of MMD (Gretton et al., 2012a) and introduce some important properties of shift-invariant kernel (Rahimi & Recht, 2007). Then we propose an efficient MMD approximation method.
Substituting the empirical estimates and of the feature space means based on respective samples, an empirical biased estimate of MMD can then be obtained as:
where if , and if . We can see that the time complexity of such MMD estimate is . We will investigate how to accelerate its computation to , especially for shift-invariant kernels.
2 Efficient Approximation of MMD
The following classical theorem from harmonic analysis provides the main fundament underlying our approximation method (Genton, 2001).
Theorem 1 (Bochner) Every bounded continuous positive definite function is Fourier transform of a non-negative finite Borel measure. This means that for any bounded shift-invariant kernel , there exists a non-negative finite Borel measure satisfying
We assume that the discussed positive definite kernel is real valued. According to Bochner’s theorem, we have that if a shift-invariant kernel is positive definite, there exists a proper scaled probability measure satisfying
Since both the probability measure and the kernel are real, the integrand can be replaced by in the above equation. Taking the Gaussian kernel as an example, we can rewrite it as , where . Its Fourier transform is . By proper scaling, can be viewed as a multivariate Gaussian distribution , where is the identity matrix.
Claim 2 For a shift-invariant kernel , suppose is its corresponding normalized measure in Bochner’s theorem, then
Claim 2 can be easily proved by substituting Eqn. (2) into kernel . A very interesting thing is that we can fortunately calculate in linear time by applying the Harmonic Addition Theorem (Nahin, 1995), as shown in Fig. 1. First, this expression can be viewed as a squared amplitude of combined sinusoids. Suppose a linear combination of sinusoids is , then its amplitude has a closed form , see Fig. 1(a). Second, the amplitude of sinusoids with the same frequency can be calculated in a sequential way in linear time, see Fig. 1(b). By combining the above two observations, we can calculate the expression with linear time complexity. If we set in Eqn. (3) as that in empirical MMD, it turns out to be the biased estimate of MMD. As a result, Claim 2 finely implies a novel methodology to efficiently approximate MMD.
By the same spirit of Random Fourier features (Rahimi & Recht, 2007), we draw i.i.d samples from distribution , and then take the average of to approximate the empirical estimate of squared MMD. The Quasi-Monte Carlo sampling techniques can also be used to generate sample sequence which has lower discrepancy than i.i.d sampling (Yang et al., 2014). Formally, we sample samples from the distribution , and then use them to approximate :
where is the amplitude of linear combination of sinusoids . We call the number of basis functions (Neal, 1994; Le et al., 2013). The efficient calculation of for unbiased MMD is similar since we can also rewrite unbiased estimate of MMD as the form in Eqn. (3). The proof is provided in Appendix A.1. The aforementioned procedure for approximating MMD is described in Algorithm 1. We also provide an equivalent implementation of Algorithm 1 in Appendix A.2.
For kernels that are spherically invariant, the Fastfood technique can be employed to further speedup sampling in step 2 and calculation in steps 4 - 5 of Algorithm 1 (Le et al., 2013). This can bring further efficiency gain for MMD calculation from to . In the rest of this paper, we call our original algorithm as FastMMD-Fourier and its variant using Fastfood as FastMMD-Fastfood, respectively.
3 Computational Complexity
As aforementioned, is the number of samples, is the dimension number of samples, and is the number of basis functions for approximating . Given a sampling of , the time complexity for calculating is . The overall computational complexity of the entire FastMMD-Fourier is thus . For FastMMD-Fastfood, the computation speed is further enhanced to . Usually, the basis number can preset as a fixed number and are thus independent of the sample scale. As compared with the complexities of the previous MMD methods, such as for exact MMD (Gretton et al., 2012a), and for B-test (Zaremba et al., 2013), the proposed FastMMD methods evidently get a speed gain. Furthermore, instead of only utilizing a subsampling pair subset from all by the current MMD approximation methods, FastMMD takes into consideration all the interacted information between sample pairs. Our methods are thus expected to be more accurate.
Another interesting thing is that the calculation of can be computed in a sequential way, and thus our method can be naturally implemented in stream computations. Also our method is easy to be parallelized. This further implies the potential usefulness of the proposed FastMMD methods in real large-scaled applications.
4 Approximation Guarantees
In this section, we will prove the approximation ability of the proposed FastMMD methods.
Theorem 3 (Uniform Convergence of FastMMD-Fourier) Let be a compact subset of with diameter . Then, for the biased estimate of MMD in Algorithm 1, we have:
Theorem 4 (Uniform Convergence of FastMMD-Fastfood) If we use Fastfood method (Le et al., 2013) to calculate in Algorithm 1, suppose the kernel is Gaussian kernel with bandwidth and is the biased estimate of MMD that arises from a block of Fastfood In Fastfood method, the kernel expansions are calculated by constructing several blocks. For the asymptotic analysis of error bound, we can treat as by padding the data with zeros if . This allow us to consider only one block., then we have:
This bound also holds for the approximation of unbiased MMD.
From Theorems 3 and 4, we can see that the approximation of FastMMD is unbiased. The proof is provided in Appendix A.2, A.3 and A.4.
5 Tests Based on the Asymptotic Distribution of the Unbiased Statistic
The principle for constructing our FastMMD approximation complies with the spirit of random kitchen and sinks (Rahimi & Recht, 2007) and Fastfood (Le et al., 2013), both using the kernel expansion to approximate shift-invariant kernels. Specifically, we use the linear combination of Dirac-delta functions to approximate continuous function , which is uniquely determined by kernel . It means that we implicitly introduce a new kernel which is an approximation for the original kernel in MMD calculation.
Based on the aforementioned analysis, the uniform convergence bounds and the asymptotic distribution for general kernels in Gretton et al. (2012a), including the Theorem and Corollary 7 - 13, still hold for FastMMD. Given the asymptotic distribution of the unbiased statistic , the goal of the two-sample test is to determine whether the empirical test statistic is so large as to be outside the quantile of the null distribution. In Gretton et al. (2012a), two approaches based on asymptotic distributions are proposed. One method is using the bootstrap (Arcones & Giné, 1992) on aggregated data, and the other is approximating the null distribution by fitting Pearson curves to its first three moments. Both the two methods can be incorporated into FastMMD.
Ensemble of Circular Discrepancy
In the previous section, we proposed an efficient approximation for MMD. In this section, we give a geometric explanation for our methods by using random projection (Blum, 2006) on a circle and circular discrepancy. This explanation is expected to help us more insightfully understand such approximation and inspire more extensive metrics for the two-sample test other than MMD.
If in Eqn. (2) is fixed, we can see that the positions of points projected on a unit circle sufficiently determine the kernel. In other words, the random variables and can be wrapped around the circumference of a unit circle without changing the value of kernel function. We first investigate the circular distribution under fixed in the following, and later will discuss the cases when is sampled from a multivariate distribution.
Given a fixed , we wrap two classes of samples on a unit circle separately. The probability density functions (PDFs) of the wrapped two random variables and can be mathematically expressed as:
where is the modular arithmetic, and is the Dirac delta function. Distributions and are zero when . Such distributions are called circular distributions or polar distributions (Fisher, 1993).
2 Circular Discrepancy
We now define a metric for measuring the discrepancy between and . Later we will show that this definition is closely related to MMD.
Definition 2 Given two independent circular distributions and , we define the circular discrepancy as:
where is also a circular distribution.
In this definition, we choose sine function as the measure for assessing the distance between two circular distributions.
Claim 5 The circular discrepancy as defined in (7) is equal to:
If and are probability mass functions (linear combination of Dirac delta functions), let , and then the circular discrepancy is equal to
From Eqn. (8) in Claim 5, it can be seen that the optimal distribution of random variable in the definition of circular discrepancy is a Dirac delta function. For the integral in Eqn. (8), if we change to , the value of this expression will change its sign. Based on this observation, it is clear that there are two distributions of that can maximize and minimize the object function, respectively. The difference of their non-zero positions is . These two distributions construct a “optimal decision diameter” for the projected samples on a unit circle.
We then give a geometric explanation for problem (7). Note that the sine function is a distance measure with sign for two points on a circle (suppose positive angles are counterclockwise), so the definition of circular discrepancy aims to find a diameter that possibly largely separates the projected samples of two different classes. An example is shown in Fig. 2, where the orientation of the diameter corresponds to the non-zero elements of distribution in the definition of circular discrepancy. We can see that this diameter maximizes the mean margin of two sample classes.
To have a large ensemble of circular discrepancy, the spread parameter in Gaussian kernel should be neither too small nor two large. If , its Fourier transform tend to be the uniform distribution. Intuition suggests that if a continuous distribution is spread out over a large region of the line then the corresponding wrapped distribution will be almost uniform on the circle. If , its Fourier transform tend to be Dirac delta function, and all the projected points on the circle would be near zero. In both of the two cases, the ensemble of circular discrepancy is small. This observation is consistent with the asymptotic behavior of Gaussian kernel SVM (Keerthi & Lin, 2003).
3 Ensemble of Circular Discrepancy
We have discussed the circular discrepancy for a given random projection on a unit circle. For shift-invariant kernels, is not a fixed value, but randomly sampled from a distribution, see Eqn. (2). We use sampling method for ensemble of circular discrepancy under different random projections. This ensemble turns out to be an efficient approximation of empirical estimate of MMD, so Algorithm 1 can be explained as ensemble of circular discrepancy.
Definition 3 Suppose is the normalized distribution in Eqn. (2) for kernel ; and are two circular distributions depending on according to Eqn. (5)(6). The ensemble of circular discrepancy for and with shift-invariant kernel is defined as:
Claim 6 For a shift-invariant kernel , ; denote as the associated Hilbert space with kernel ; as the normalized distribution in Eqn. (2); and as two circular distributions depending on defined by Eqn. (5)(6). Then the ensemble of circular discrepancy is
Proof: By substituting Eqn. (9) into Eqn. (10) and utilizing Bochner’s theorem, we can obtain
where if , and if . ∎
Fig. 3 demonstrates the relationship between MMD, circular discrepancy and our approximation. The blue and red contours are two distributions, and is the distribution determined by Fourier transform of the kernel. We generate i.i.d. samples from distribution . For each generated , we project samples on a unit circle and calculate the circular discrepancy. The ensemble of discrepancy then corresponds to the MMD. We can see that the circular discrepancy constructs classifiers implicitly.
It is interesting that if we use other similarity measurements such as Mallows Distance (earth mover’s distance) (Werman et al., 1986), other than the sine function utilized in Definition 2, more extensive metrics for two-sample test can be naturally obtained. Furthermore, it should be noted that in Definition 3, we use -norm average for ensemble. If we use other norms, we can also generalize more measures other than MMD for this task. All these extensions are hopeful directions for future investigation.
Experimental Results
We begin by investigating how well our methods can approximate the exact MMD as the sampling number increases. Following previous work on kernel hypothesis testing (Gretton et al., 2012b; Zaremba et al., 2013), our synthetic distribution is designed as grids of 2D Gaussian blobs. We specify two distributions, and . For distribution each Gaussian has identity covariance matrix, while for distribution the covariance is non-spherical with a ratio of large to small covariance eigenvalues. Samples drawn from and are presented in Fig. 4. The ratio is set as , and the sample number for each distribution is set as .
We used a Gaussian kernel with , which approximately matches the scale of the variance of each Gaussian in mixture . The MMD approximation results are shown in Fig. 5. We use the relative difference between exact (biased and unbiased) MMD and the approximation to quantify the error. The absolute difference also exhibits similar behavior and is thus not shown due to space limit. The results are presented as averages from trials. As can be seen, as increases, both FastMMD-Fourier and FastMMD-Fastfood converge quickly to exact MMD in both biased and unbiased cases. Their performance are indistinguishable. It can be observed that for both methods, the good approximation can be obtained even from a modest number of basis.
2 Accuracy Test for MMD with Kernel Family
In some situations, MMD with a kernel family is preferred (Sriperumbudur et al., 2009). Here we present an experiment that illuminates the superiority of our FastMMD methods on accuracy of MMD with kernel family. The synthesis data are generated as follows. All samples are constrained into a two-dimensional rectangle: . The points, which are located within a circular ring in between by and are labeled as , while other points are labeled as . We generate samples for each distribution randomly as the test set. Then we use these samples to calculate MMD with a kernel family. Here the kernel family is composed of multivariate isotropic Gaussians with bandwidth varying between and , with a multiplicative step-size of .
We compare our method with exact MMD, MMD-linear (Gretton et al., 2012a), and B-test (Zaremba et al., 2013). Note that the latter two methods are only valid for unbiased MMD. In our methods, the number of basis function is set as . The block size in B-test is set to the default choice, i.e, the square of sample size . For our method, we repeat times and use the curves and error bars to represent means and standard deviations of MMD, respectively. Since both MMD-linear and B-test depend on the permutation of data samples, we make permutations of the samples. From Fig. 6, it can be seen that the means of all these methods are consistent with the true values. Also it can be seen that our FastMMD-Fourier and FastMMD-Fastfood have similar accuracy, and their deviations are much smaller than those of MMD-linear and B-test.
For some applications, we need to find the kernel that has the maximal MMD. Since our methods have lower variance, they incline to find the correct with higher probability than MMD-linear and B-test.
3 Efficiency Test on Synthetic Data
In order to evaluate the efficiency of our methods, we generate samples uniformly from and . The efficiency of different methods is shown in Fig. 7. In Fig. 7(a), the number of samples is varied from to , and the data dimension is set as . In Fig. 7(b), the number of samples is set as , and the dimension is varied from to .
All competing methods are implemented in Matlab except that we use Spiral WHT Packagehttp://www.spiral.net/ to perform the Fast Walsh-Hadamard transform in Fastfood. The comparison methods include exact MMD, MMD-linear and B-testhttps://github.com/wojzaremba/btest. We run all the codes on a PC with AMD Athlon X2 250 ( MHz) CPU and GB RAM memory.
From Fig. 7, we can see that when or varies from small to large, our methods gradually become more efficient than exact MMD and B-test. When and , FastMMD-Fourier and FastMMD-Fastfood are x and x faster than exact MMD / B-test, respectively. As for MMD-linear, since it is the extreme simplified subsampling version of MMD calculation, it always runs very fast. However, as the sample size or dimension increases, the computation times of our methods still depict a more slowly increase trend than that of MMD-linear. This empirically confirms the efficiency of the proposed FaseMMD methods.
4 Efficiency Test on Large-Scale Real Data
In order to validate the efficiency of FastMMD on real world data, we also perform experiments on PASCAL VOC 2007 http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/, which is popular in computer vision community. The dataset consists of images, and there are 20 object categories in this dataset with some images containing multiple objects. In our experiments, we choose images which contain persons as one sample set, and the remaining images as another sample set. We use the VLFeat toolboxhttp://www.vlfeat.org/ to extract features. For each image, the feature is constructed by bag-of-words representation and spatial pyramid. The codebook size for bag-of-words is , and the spatial pyramid contains , and levels. Thus the feature dimension for each image is . Finally, the feature vectors are normalized by norm.
We use MMD to measure the discrepancy between these two image set. In this MMD calculation, the sample number is , and data dimension is . The number of basis function is fixed as . The bandwidth is set as . We perform this experiment on a PC with Intel Core i7-3770 GHz CPU and GB RAM memory.
The results of efficiency and accuracy are shown in Tab. 1. We compare our method with exact MMD, MMD-linear and B-test. Except for exact MMD, we repeat times for each method, and report the mean and standard deviation of MMD, execution time, and relative error. We can see that our FastMMD methods have smaller approximation error and smaller deviation, and meanwhile it has three orders of speedup compared to exact MMD.
5 Type II Error
Given MMD, several strategies can be employed to calculate the test threshold (Gretton et al., 2009, 2012a; Chwialkowski et al., 2014). The bootstrap strategy (Arcones & Giné, 1992) is utilized in our experiments since it can be easily integrated into our FastMMD method. Also the bootstrap is preferred for large-scale datasets since it costs , faster than most other methods for this task such as Person, with cost (Gretton et al., 2012a).
The data used for this experiment is generated from Gaussian blob distributions as described in Section 4.1. The sample size is set as for two distributions. The bandwidth is selected by maximizing MMD. The selected bandwidth is , and it approximately matches the scale of variance of each Gaussian blob in distribution . We find that all of biased/unbiased FastMMD methods have similar good performance, and we thus only demonstrate the result of FastMMD-Fourier for biased MMD.
The level for Type I error is set as , and the number of bootstrap shuffles is . The Type II error is shown in Fig. 8(a). We can see that the Type II error drops quickly when increasing number of basis. It demonstrates empirically that increasing number of basis can decrease the Type II error.
We also compare our method with Bootstrap (Bootstrap approach with exact MMD), Pearson (moment matching to Pearson curves), spectrum (Gram matrix eigenspectrum) and Gamma (two-parameter Gamma approximation), where the former two approaches are presented in Gretton et al. (2012a) and the latter two are introduced in Gretton et al. (2012b). The basis number utilized in FastMMD is fixed as . The type II errors of all competing methods with respect to are shown in Fig. 8(b). The execution times for different approaches are s (Bootstrap), s (Pearson), s (spectrum), s (Gamma), and (our FastMMD+Bootstrap). We can see that the proposed FastMMD method can achieve comparable results with other methods, while it is evidently more efficient than Bootstrap, Pearson and spectrum. This actually substantiates the efficiency of FastMMD since our method and Bootstrap are the same except with different strategies for calculating MMD. Our method is less efficient than Gamma method because bootstrap methods need to calculate MMD for many times (it is determined by the number of bootstrap shuffles which is in this experiment). How to design effective statistic based on FastMMD and optimal threshold remains a challenging issue for future investigation.
Conclusions
In this paper, we propose a method, called FastMMD, for efficiently calculating the maximum mean discrepancy (MMD) for shift-invariant kernels. Taking advantage of Fourier transform of shift-invariant kernels, we get a linear time complexity approximation method. We prove the theoretical convergence of the proposed method in both unbiased and biased cases, and further present a geometric explanation for it. This explanation on one side delivers new insight for intrinsic MMD mechanism, and on the other side is hopeful to inspire more extensive new metrics for two-sample test, which will be investigated in our future research. Future work also includes finding the significance threshold using more efficient and effective strategies other than bootstrap.
Acknowledgement
We thank Alex Smola for valuable suggestions. We also thank anonymous reviewers for their constructive comments, which helped us to improve the manuscript. This research was supported by the National Basic Research Program of China (973 Program) under Grant No. 2013CB329404, the China NSFC project under contract 61373114, 11131006, and the Civil Aviation Administration of China jointly funded project No. U1233110.
Appendix
The unbiased estimate of MMD is (See Lemma 6 in Gretton et al. (2012a))
Denote , . If the kernel is shift-invariant, let for any , and then we have
Denote and . The biased and unbiased estimate of MMD can be reformulated as:
A.2 Preliminary for the Proof of Theorem 3 and Theorem 4
The following two claims and an equivalent form of Algorithm 1 provide basic tools to analyze the MMD approximation.
Claim 7 (Uniform Convergence for Linear Combination of Approximation) Let be a compact subset of , be a function with parameter . If is drawn from a distribution, and the difference of two functions is bounded as
where is function about . Then
Proof: For any fixed , we have
Based on Eqn. (14), we know that with probability more than , it hold that . So with probability more than , it holds that . The proof is then completed. ∎
and then make a variable substitution .
Define a function . Next we prove three properties of .
(i) , because dominates the other when . Also it is obvious that .
(ii) . Let , and then we obtain the only solution . Since is continuous and differentiable, , and , we have that is the minimum of this function.
Denote . We can verify that and . So . ∎
We provide an equivalent form of FastMMD. It is easy to verify that
where .
The time complexity of this method is also linear. Note that the expression on the right hand is the square length for weighted addition of random Fourier features (Rahimi & Recht, 2007), which means this method uses the random Fourier features to calculate MMD. The procedure for approximating MMD is described in Algorithm 2. Note that this algorithm is equivalent to Algorithm 1 as described in the maintext. Algorithm 1 provides us a geometric explanation of MMD, and Algorithm 2 is convenient for the following uniform convergence analysis.
A.3 Proof of Theorem 3
From Algorithm 2, it is easy to see that:
where if and if , . and are the same as that in Algorithm 2.
According to Eqn. (5) in Gretton et al. (2012a):
Based on the above Eqn. (18)(19), we have:
According to Claim 1 in Random Features (Rahimi & Recht, 2007):
Denote the right hand of the inequality (21) as . Based on Claim 7, the Eqn. (20) is bounded by , then we can obtain the uniform convergence for biased estimate of MMD.
Next we prove the case for unbiased estimate of MMD. According to Algorithm 2 and by virtue of certain algebraically equivalent transformation, we obtain:
According to Eqn. (3) in Gretton et al. (2012a), we know that
Based on Eqn. (22)(23) and Claim 7, is bounded by , and then we obtain the uniform convergence for unbiased estimate of MMD. ∎
A.4 Proof of Theorem 4
is compact and with diameter at most twice . And then we can find an -net that covers using at most balls of radius (Cucker & Smale, 2001). Let denote the centers of these balls. We have for all if the following two conditions hold for all : (i) ; (ii) , if belongs to the ball of the -net. Next we bound the probability of these two events:
(i) The union bound followed by Hoeffding’s inequality applied to the anchors in the -net gives
(ii) If we use Fastfood for FastMMD in Algorithm 2, and suppose the estimate of kernel arises from a block of Fastfood, then according to Theorem 6 in Fastfood literature (Le et al., 2013) we have:
Since usually is large in Fastfood method, we suppose that . Considering Claim 8 and the fact , this inequality can be further reformulated as
Combining (24) and (25) gives a bound in terms of the free variable :
This has the form . Setting turns this to , and assuming that and , we obtain that
Denote the right hand of the inequality (26) as . Similar to the proof in Theorem 3, we have
Then we complete the proof of the uniform convergence for biased estimate of MMD. The convergence for unbiased estimate of MMD can be proved in the similar way. ∎
A.5 Proof of Claim 5
Let , and be the PDFs for random variables , and , respectively. We then have
If , then
The second equation holds because the sum of sinusoids with the same frequency is also a sinusoid with that frequency. According to the trigonometric identity, it holds that
where is the four-quadrant arctangent function. The supremum of this problem is when . ∎