Universality, Characteristic Kernels and RKHS Embedding of Measures
Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet
Introduction
Such an embedding has been found to be useful in many statistical applications like homogeneity testing (Gretton et al., 2007), independence testing (Gretton et al., 2008; Fukumizu et al., 2008), dimensionality reduction (Fukumizu et al., 2004, 2009a), etc., as it provides a powerful and straightforward method of dealing with higher-order statistics of random variables. However, in these applications, it is critical that the embedding in (1) is injective so that probability measures can be distinguished by their images in . To this end, Fukumizu et al. (2008) introduced the notion of characteristic kernel — a bounded, measurable is said to be characteristic if (1) is injective — for which many characterizations have recently been provided (Gretton et al., 2007; Fukumizu et al., 2008, 2009b; Sriperumbudur et al., 2008, 2009a, 2009b).
A natural extension to the above idea of embedding probability measures into an RKHS, is to embed finite signed Borel measures, into as
and study the conditions on the kernel, for which such an embedding is injective. Although the embedding in (2) can be proposed and investigated for mathematical pleasure, we show as one of the main contributions of this paper that under certain conditions on and , the embedding in (2) is closely related to the concept of universal kernels (see Section 1.1 for the formal introduction to universal kernels), which was first proposed by Steinwart (2001) — in the context of achieving the Bayes risk in kernel-based classification/regression algorithms — and later extended by Micchelli et al. (2006), Carmeli et al. (2009) and Sriperumbudur et al. (2010).The present paper is an extended version of Sriperumbudur et al. (2010). This connection shows that the embedding in (2) is not just an abstract mathematical object, but has applications in kernel-based classification/regression algorithms. Using the connection between (2) and universal kernels, we then show how the various notions of universality mentioned above are related to each other. In addition, since the embedding in (2) is a generalization of the embedding in (1), we also demonstrate the relation between characteristic kernels and universal kernels, which extends the preliminary study carried out in Sriperumbudur et al. (2009b, Section 3.4).
In the remainder of this introduction, we provide a comprehensive overview of our contributions which are presented in detail in later sections. First, in Section 1.1, we introduce universality, briefly discuss various notions of universality that are proposed in literature, and outline our contribution: a measure embedding view point of universality, which is novel and different from the existing view point of approximating functions in some target space by functions in an RKHS. We show that a kernel is universal if and only if the embedding in (2) is injective. Second, in Section 1.2, we discuss our second contribution of relating universal and characteristic kernels.
In the regularization approach to learning (Evgeniou et al., 2000), it is well known that kernel-based algorithms (for classification/regression) generally invoke the representer theorem (Kimeldorf and Wahba, 1970; Schölkopf et al., 2001) and learn a function in that has the representation,
is dense in (Aronszajn, 1950), and assuming that the kernel-based algorithm makes “converge to an appropriate function” in as , the above question of approximating arbitrarily well by in (3) as goes to infinity is equivalent to the question of whether is rich enough to approximate any arbitrarily well, i.e., whether is universal. We show that characterizing universal RKHSs (or equivalently, the characterization of corresponding reproducing kernels (r.k.) as any RKHS is uniquely determined by its reproducing kernel) leads to the embedding in (2).
As mentioned above, the goal is to characterize that allow to approximate any in some target space, usually assumed to be some subset of the space of real-valued continuous functions on . Therefore, depending on the choice of , the choice of target space and the type of approximation, various notions of universality have been proposed (Steinwart, 2001; Micchelli et al., 2006; Carmeli et al., 2009; Sriperumbudur et al., 2010), which are briefly discussed in the following paragraphs. The eventual goal is to have a notion of universality that allows comprehensive (and general) necessary and/or sufficient conditions on the reproducing kernel for approximating, as strong as possible, a class of target functions, as general as possible.
As our contribution, in Section 3.1, we completely characterize c-universal kernels by showing that is c-universal if and only if the embedding in (2) is injective for , the space of finite signed Radon measures defined on a compact Hausdorff space, (see Section 2 for a formal definition of ). It has to be noted that this result is different from and more general — as both necessary and sufficient conditions are provided — than the one by Steinwart (2001, Theorem 9), where only a sufficient condition is provided. Using this characterization, as a special case, we also obtain necessary and sufficient conditions for a Fourier kernel (see Section 3.3) to be c-universal, while Steinwart (2001) provided only a sufficient condition.
-universality: Although cc-universality solves the limitation of c-universality by handling non-compact , the topology of compact convergence considered in cc-universality is weaker than the topology of uniform convergence, i.e., a sequence of functions, converging to in the topology of uniform convergence ensures that they converge in the topology of compact convergence but not vice-versa. So, the natural question to ask is whether we can characterize that are rich enough to approximate any on non-compact in a stronger sense, i.e., uniformly, by some . Recently, this has been answered by Carmeli et al. (2009, Definition 2, Theorem 1) and Sriperumbudur et al. (2010), wherein they defined to be -universal if is bounded, and its corresponding RKHS, is dense in w.r.t. the uniform norm, where is a locally compact Hausdorff (LCH) space and is the Banach space of bounded continuous functions vanishing at infinity, endowed with the uniform norm (see Section 2 for the definition of ).
-universality: The definition of -universality deals with being dense in w.r.t. the uniform norm, where is an LCH space. Although the notion of -universality addresses limitations associated with both c- and cc-universality, it only approximates a subset of , i.e., it cannot deal with functions in . This limitation can be addressed by considering a larger class of functions to be approximated.
To this end, we propose a notion of universality that is stronger than -universality: is said to be -universal if its corresponding RKHS, is dense in , the space of bounded continuous functions on a topological space, (note that ). This notion of -universality is more applicable in learning theory than -universality as the target function, can belong to (which is a more natural assumption) instead of it being restrained to (note that only contains functions that vanish at infinity). We show in Section 3.1 that is -universal if and only if the embedding in (2) is injective for belonging to a certain class of set functions (see Section 2 for the definition of set functions) defined on a normal topological space, (see Theorem 6 for details). Because of the technicalities involved in dealing with set functions, in this paper, we do not fully analyze this notion of universality unlike the other aforementioned notions, although it is an interesting problem to be resolved because of its applicability in learning theory.
To summarize our first contribution, we show that, by appropriately choosing and in (2), the injectivity of the embedding in (2) completely characterizes various notions of universality that are proposed in literature. Using this connection between universality and the injectivity of the embedding in (2), we relate all these notions of universality, which is summarized in Figure 1.
2 Contribution 2: Relation between characteristic and universal kernels
Using the embedding in (1), Gretton et al. (2007) proposed a metric, called the maximum mean discrepancy (MMD), on the space of all Borel probability measures, when is characteristic. One important theoretical question that is usually considered for metrics on probability measures is (Dudley, 2002, Chapter 11): “What is the nature of the topology induced by the probability metric in relation to the usual weak topology?” In probability theory, this question is important in understanding and proving central limit theorems. Although being characteristic is sufficient for MMD to be a metric, we show in Section 4.2 that a notion stronger than the characteristic property is required to answer the above question. In particular, we show in Proposition 24 that if is an LCH space and is -universal, then the topology induced by MMD coincides with the usual weak topology on the space of Radon probability measures defined on .Sriperumbudur et al. (2009b) showed that if is a compact metric space and is c-universal, then the topology induced by MMD coincides with the usual weak topology. The result for non-compact was left as an open question and is addressed in this paper, by applying the notion of -universality. This result can be used to compare MMD to other probability metrics, such as the Dudley metric, total variation distance, Wasserstein distance, etc. We refer to Sriperumbudur et al. (2009b) for a detailed study on the comparison of MMD to other probability metrics.
To summarize, our main contributions in this paper are:
To establish the relationship between various notions of universality and the RKHS embedding, shown in (2), of finite signed Radon measures, and in turn present a novel measure embedding view point of universality compared to the classical function approximation view point.
To clarify the relationship between universal and characteristic kernels.
A summary of the results in this paper is shown in Figure 1. In the following section, we introduce the notation and some definitions that are used throughout the paper. Supplementary results used in proofs are collected in Appendix A.
Definitions & Notation
Let be a topological space. denotes the space of all continuous functions on . is the space of all bounded, continuous functions on . For a locally compact Hausdorff space, , is said to vanish at infinity if for every the set is compact. The class of all continuous on which vanish at infinity is denoted as . The spaces and are endowed with the uniform norm, defined as for .
If denotes a topological vector space, we denote by the vector space of continuous linear functionals on , and is called the topological dual space (in this paper, we simply refer to it as the dual).
For a set , we denote its interior as .
Radon measure: A signed Radon measure on a Hausdorff space is a Borel measure on satisfying
for each compact subset ,
for each in the Borel -algebra of .
is said to be finite if , where is the total-variation of . denotes the space of all finite Radon measures on while denotes the space of all finite signed Radon measures on . The space of all Radon probability measures is denoted as . For , the support of is defined as
denotes the space of all compactly supported finite signed Radon measures on . We refer the reader to Berg et al. (1984, Chapter 2) for a general reference on the theory of Radon measures.
Finitely additive, regular set function: A set function is a function defined on a family of sets, and has values in .
A set function defined on a family of sets is said to be finitely additive if , and , for every finite family of disjoint subsets of such that .
A field of subsets of a set is a non-empty family, , of subsets of such that , , and for all , we have and .
An additive set function defined on a field of subsets of a topological space is said to be regular if for each and , there exists whose closure is contained in and there exists whose interior contains such that for every with .
Furthermore, is said to be strictly pd (resp. conditionally strictly pd) if, for mutually distinct , equality in (5) only holds for .
Characterization of Universal Kernels
Before characterizing various notions of universality, let us revisit their formal definitions.
A continuous kernel on a compact Hausdorff space is called c-universal if the RKHS, induced by is dense in w.r.t. the uniform norm, i.e., for every function and all , there exists an such that .
A continuous kernel on a Hausdorff space is said to be cc-universal if the RKHS, induced by is dense in endowed with the topology of compact convergence, i.e., for any compact set , for any and all , there exists an such that .
A bounded kernel, with on a locally compact Hausdorff space, is said to be -universal if the RKHS, induced by is dense in w.r.t. the uniform norm, i.e., for every function and all , there exists an such that .
A bounded continuous kernel, on a topological space, , is said to be -universal if the RKHS, induced by is dense in w.r.t. the uniform norm, i.e., for any and all , there exists an such that .
First note that the above definitions are valid only if is included in the appropriate target space, i.e., for c- and cc-universality, for -universality, and for -universality. By Steinwart and Christmann (2008, Lemma 4.28, Theorem 4.61), the assumptions made on the kernel in the above definitions ensure that the definitions are valid. Also note that all these definitions are equivalent when is compact as for compact . When is not compact, it is easy to see that -universality is stronger than -universality, i.e., if is -universal, then it is also -universal, but not vice-versa. On the other hand, it is not straightforward to see how the notions of cc-universal and -universal are related when is non-compact. By characterizing -universality and cc-universality, Theorem 6 in the following section, shows that the notion of -universality is stronger than cc-universality, i.e., if a kernel is -universal, then it is cc-universal, but not vice-versa. Based on these results, it follows that -universality is stronger than cc-universality (but not vice-versa), when is non-compact.
Before we state our main result, i.e., Theorem 6, we need the following result, usually referred to as the Hahn-Banach theorem, which we quote from Rudin (1991, Theorem 3.5) (also see the remark following Theorem 3.5 in Rudin (1991)).
Suppose be a subspace of a locally convex topological vector space . Then is dense in if and only if , where
The following main result of this paper, which presents a necessary and sufficient condition for to be c-, cc-, - or -universal. hinges on the above theorem, where we choose to be the RKHS, and to be , or for which is known through the Riesz representation theorem.
Let be a compact Hausdorff space with being continuous. Then is c-universal if and only if the embedding,
Let be an LCH space and . Then is cc-universal if and only if the embedding,
Let be an LCH space with the kernel, being bounded and . Then is -universal if and only if the embedding,
Let be a normal topological space and let be the space of all finitely additive, regular, bounded set functions defined on the field generated by the closed sets of . Then, a bounded continuous kernel, is -universal if and only if the embedding,
Proof First, we prove , from which follows. By Definition 3, is -universal if is dense in . We now invoke Theorem 5 to characterize the denseness of in , which means we need to consider the dual of . By the Riesz representation theorem (Folland, 1999, Theorem 7.17), in the sense that there is a bijective linear isometry from onto , given by the natural mapping,
Therefore, by Theorem 5, is dense in if and only if
() Suppose (12) is injective, i.e., for , . Then by Lemma 26 (see Appendix A), we have
which by (15) means is dense in and therefore is -universal. () We need to prove that if is dense in then holds. This is equivalent to showing that if does not hold, then is not dense in . Suppose does not hold, i.e., such that , which means such that for every , then, by (15), is not dense in . When is compact, coincides with , which means c-universality and -universality are equivalent. Therefore, is c-universal if and only if the embedding in (10) is injective. The proof is similar to that of except that we need to consider the dual of endowed with the topology of compact convergence (a locally convex topological vector space) to characterize the denseness of in . It is known (Hewitt, 1950) that in the sense that there is a bijective linear isometry from onto , given by the natural mapping, . The rest of the proof is verbatim with replaced by . The proof is very similar to that of , wherein we identify such that and satisfy (Dunford and Schwartz, 1958, p. 262). Here, represents the isometric isomorphism. The rest of the proof is verbatim with replaced by . Theorem 6 can also be interpreted as: for appropriate assumptions on and , the embedding in (2) is injective if and only if the kernel is universal, therefore relating universality and injective RKHS embedding of finite signed Radon measures. In other words, Theorem 6 provides a novel measure embedding view point of universality compared to its well-known function approximation view point. Based on Theorem 6, the following remarks can be made.
(a) Theorem 6 provides a necessary and sufficient condition for c-universality — is c-universal if and only if the embedding in (10) is injective — while Steinwart (2001) provided only a sufficient condition (in terms of the feature maps being an algebra; see Steinwart and Christmann (2008, Theorem 4.56) for details) using the Stone-Weierstraß theorem. Therefore, Theorem 6 differs from and generalizes the result by Steinwart (2001). (b) Note that the embedding in (11) is injective if and only if for any compact set , the embedding
Based on Theorem 6, the following result provides an alternate and equivalent characterization of universality or injectivity of the embedding in (2), which is easier to interpret, as it resembles the condition of being strictly pd (though not quite exactly the same). This alternate characterization is then used in Sections 3.2–3.4 to obtain easily checkable conditions for the universality of specific classes of kernels. We also show that strictly pd is a necessary condition for universality.
Suppose the assumptions in Theorem 6 hold. Then,
If is c-, cc- or -universal, then it is strictly pd.
Proof We only prove . The proof of is exactly the same as that of with replaced by , while the proof of is trivial. () Suppose is not -universal. By Theorem 6(c), there exists such that , which implies . This means
Define , where represents the Dirac measures at . Clearly and . From (20), it is clear that . Therefore, by Proposition 8(b), is not cc-universal. The result for -universality follows from Remark 7(c), while the result for c-universality is trivial. See Carmeli et al. (2009, Corollary 5), Steinwart and Christmann (2008, Proposition 4.54, Example 4.11) and Sriperumbudur et al. (2009b, Footnote 4).
A summary of results based on Theorem 6, Remarks 7, 9 and Proposition 8 is shown in Figures 1(a) and 1(b).
Although the conditions in (17)-(19) are easy to interpret, they are not always easy to check. To this end, in the remainder of this section, we present easily checkable characterizations for the following classes of kernels. These classes of kernels are both mathematically and practically interesting as many of the popular kernels used in machine learning, e.g., Gaussian, Laplacian, exponential, etc., fall in these classes (see Examples 1–3 for more examples).
These kernels are also called Schoenberg kernels (Wendland, 2005, Corollary 7.12, Theorem 7.13).Note that is a scale mixture of Gaussian kernels.
is an LCH space with bounded . Let , where we assume the series converges uniformly on . is a set of continuous real-valued functions on where is a countable index set.
If is compact, then is -universal.
If , then is cc-universal.
Gaussian, with .
Laplacian, with , where .
-spline, with , where and .
The following remarks can be made about Proposition 11.
A summary of results, based on Proposition 11 and Remark 12, for the case of kernels satisfying (), is shown in Figure 1(c).
where Fubini’s theorem is invoked in and
with and .
with and .
-universal kernels vs. Strictly pd kernels: We have shown in Proposition 8(d) that strictly pd is a necessary condition for to be c-, cc- or -universal. However, the converse is not true (see Remark 9(a)), which is based on Proposition 14 and the following result in Theorem 15. Before we state the result, we need some definitions.
A summary of results for kernels of the type () is shown in Figure 1(b).
Suppose () holds. Then the following conditions are equivalent.
Proof by Remark 7(c), by Proposition 8(d) and by Wendland (2005, Theorem 7.14). Now, we show .
Gaussian, . Note that in (21), where represents a Dirac measure at . Clearly .
Inverse multiquadratic, , obtained by choosing in (21). It is easy to verify that .
A summary of results for kernels of the type () is shown in Figure 1(d).
We now consider the characterization of c-, cc- and -universality for ().
is c-universal (resp. cc-universal) if and only if for any (resp. ), there exists some for which .
Let . Then is -universal if and only if for any , there exists some for which .
Proof We first prove . The proof for c-universality in is trivial as it follows from , while the proof for cc-universality in is exactly the same as that of with replaced by . Let us consider
where we have invoked Fubini’s theorem in . () Suppose for any , there exists some for which . Then, from (28), it is clear that and therefore is -universal, which follows from Proposition 8(c). () Suppose there exists a non-zero measure, for which for any . By (28), this means there exists a for which , i.e., is not -universal (by Proposition 8(c)). The conditions in Proposition 17 are not always easy to check. However, for the case of Taylor kernels (Steinwart and Christmann, 2008, Lemma 4.8), which include the exponential kernel, simple, easy to check sufficient conditions can be obtained as shown in Corollary 18. Although this result is exactly the same as Corollary 4.57 in Steinwart and Christmann (2008), we present a different proof (we would like to remind the reader that our characterization of c-universality is different from the one provided by Steinwart (2001) and therefore the proof is different; see Remark 7(a)).
Proof From the proof of Lemma 4.8 in Steinwart and Christmann (2008), we have
To summarize, in this section, by showing the relation between various notions of universality and the injective RKHS embedding of finite signed Radon measures, we have presented a novel measure embedding point of view of universality compared to its well-known function approximation view point. Since the RKHS embedding of finite signed Radon measures generalizes the concept of RKHS embedding of Radon probability measures, the latter being related to characteristic kernels (Fukumizu et al., 2004, 2008; Sriperumbudur et al., 2008), in the following section, we relate the notion of universality to characteristic kernels.
Characteristic Kernels and Universality
Recent studies in machine learning have considered the mapping of random variables into a suitable RKHS and showed that this provides a powerful and straightforward method of dealing with higher-order statistics of the variables. Using their RKHS mappings, for sufficiently rich RKHSs, it becomes possible to test for homogeneity (Gretton et al., 2007), independence (Gretton et al., 2008), conditional independence (Fukumizu et al., 2008), to find the most predictive subspace in regression (Fukumizu et al., 2004), etc. Key to the above applications is the notion of a characteristic kernel — defined below — which gives rise to an RKHS that is sufficiently rich in the sense required above.
Since the embedding in (30) is a special case of the embedding in (2), and the injectivity of the embedding in (2) is related to universality (see Section 3), we now relate universal and characteristic kernels.
Gretton et al. (2007) have shown that a c-universal kernel is characteristic. Besides this result, not much is known or understood about the relation between characteristic and universal kernels. The following result not only provides the same result obtained by Gretton et al. (2007), but also generalizes it for non-compact .
Suppose the assumptions in Theorem 6 hold. If is -, - or -universal, then it is characteristic to the set of probability measures contained in , or , respectively.
Proof The proof is trivial and follows from Theorem 6 and Definition 19. Now, one can ask when the converse to Proposition 20 is true. The following result answers this question for some special classes of kernels.
A summary of the relation between characteristic and universal kernels is shown in Figure 1. Characteristic kernels vs. Strictly pd kernels: In Section 3, we have shown the relation between universal kernels and strictly pd kernels, while in Propositions 20 and 21, we have related universal and characteristic kernels. We now investigate the relation between characteristic and strictly pd kernels.
If is characteristic, then it is conditionally strictly pd.
So far, we presented the relation between characteristic kernels and universal kernels and showed that for any LCH space, , the characteristic property is a weaker notion than -universality. Although such a weaker notion is sufficient to make the embedding in (30) injective, in the following section, we show that the stronger notion of -universality is required to study an important property of the “probability metric” associated with the embedding in (30).
on , called the maximum mean discrepancy (MMD). Note that when is characteristic, is a metric on . One immediate question that naturally arises is “how is MMD related to other metrics on , such as the Prohorov metric, Dudley metric, Wasserstein-Kantorovich metric, total variation metric, etc?” This is a question of both theoretical and practical importance.
Let be an LCH space and be -universal. Then, the topology induced by coincides with the weak topology on .
The following result in Sriperumbudur et al. (2009b, Theorem 23) can be obtained as a simple corollary to Proposition 24, wherein the question of metrization of weak topology by is addressed only for compact Hausdorff . The general non-compact case was left as an open problem, which we addressed in Proposition 24.
Suppose is compact Hausdorff and is c-universal. Then, metrizes the weak topology on .
Conclusions & Discussion
The discussion in this paper has been related to the characterization of various notions of universality wherein the RKHS, is dense in some subset of (the space of real-valued continuous functions on ) w.r.t. the uniform norm (here, is a some arbitrary topological space). This means any target function, in the appropriate subset of can be approximated arbitrarily well by some w.r.t. the uniform norm. There is a notion of universality, which we have not considered, called -universality (Steinwart and Christmann, 2008, Chapter 5): a measurable and bounded kernel, defined on a Hausdorff space, , is said to be -universal if the RKHS, induced by is dense in w.r.t. the -norm, defined as , for all and some . Here is the Banach space of -integrable -measurable functions on . This notion of universality is more applicable in learning theory, where the target function, is usually assumed to lie in for some and for some Borel probability measure, . By considering this notion of universality, any can be approximated arbitrarily well by some w.r.t. the p-norm for all Borel probability measures and some . In particular, Steinwart and Christmann (2008, Theorems 5.31, 5.36 and Corollary 5.37) have shown that -universality is necessary and sufficient to achieve consistency in kernel-based learning algorithms. In this paper, we did not consider this notion of universality because unlike the other notions of universality, it is not straightforward to relate -universality and the RKHS embedding of measures by using the Hahn-Banach theorem (see Theorem 5). However, recently, Carmeli et al. (2009, Theorem 1) have shown that is -universal if and only if it is -universal, which therefore establishes the relation between -universality and the RKHS embedding of measures. Using this result, -universality can be related to all other notions considered in this paper, through Figure 1.
B. K. S. and G. R. G. L. wish to acknowledge support from the Institute of Statistical Mathematics (ISM), Tokyo, the National Science Foundation (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MICRO program. Part of this work was done while B. K. S. was visiting ISM. K.F. was supported by JSPS KAKENHI 19500249.
Appendix A. Supplementary Results
For completeness, we present the following supplementary result, which is a simple generalization of the technique used in the proof of Sriperumbudur et al. (2008, Theorem 3).
Let be a measurable and bounded kernel on a measurable space, and let be its associated RKHS. Then, for any and for any finite signed Borel measure, ,
Therefore, is a bounded linear functional on . By the Riesz representation theorem (Folland, 1999, Theorem 5.25), there exists a unique such that for all . Set for some , which implies and the result follows.