Universality, Characteristic Kernels and RKHS Embedding of Measures

Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet

Introduction

Such an embedding has been found to be useful in many statistical applications like homogeneity testing (Gretton et al., 2007), independence testing (Gretton et al., 2008; Fukumizu et al., 2008), dimensionality reduction (Fukumizu et al., 2004, 2009a), etc., as it provides a powerful and straightforward method of dealing with higher-order statistics of random variables. However, in these applications, it is critical that the embedding in (1) is injective so that probability measures can be distinguished by their images in H\mathcal{H}. To this end, Fukumizu et al. (2008) introduced the notion of characteristic kernel — a bounded, measurable kk is said to be characteristic if (1) is injective — for which many characterizations have recently been provided (Gretton et al., 2007; Fukumizu et al., 2008, 2009b; Sriperumbudur et al., 2008, 2009a, 2009b).

A natural extension to the above idea of embedding probability measures into an RKHS, H\mathcal{H} is to embed finite signed Borel measures, μ\mu into H\mathcal{H} as

and study the conditions on the kernel, kk for which such an embedding is injective. Although the embedding in (2) can be proposed and investigated for mathematical pleasure, we show as one of the main contributions of this paper that under certain conditions on μ\mu and XX, the embedding in (2) is closely related to the concept of universal kernels (see Section 1.1 for the formal introduction to universal kernels), which was first proposed by Steinwart (2001) — in the context of achieving the Bayes risk in kernel-based classification/regression algorithms — and later extended by Micchelli et al. (2006), Carmeli et al. (2009) and Sriperumbudur et al. (2010).The present paper is an extended version of Sriperumbudur et al. (2010). This connection shows that the embedding in (2) is not just an abstract mathematical object, but has applications in kernel-based classification/regression algorithms. Using the connection between (2) and universal kernels, we then show how the various notions of universality mentioned above are related to each other. In addition, since the embedding in (2) is a generalization of the embedding in (1), we also demonstrate the relation between characteristic kernels and universal kernels, which extends the preliminary study carried out in Sriperumbudur et al. (2009b, Section 3.4).

In the remainder of this introduction, we provide a comprehensive overview of our contributions which are presented in detail in later sections. First, in Section 1.1, we introduce universality, briefly discuss various notions of universality that are proposed in literature, and outline our contribution: a measure embedding view point of universality, which is novel and different from the existing view point of approximating functions in some target space by functions in an RKHS. We show that a kernel is universal if and only if the embedding in (2) is injective. Second, in Section 1.2, we discuss our second contribution of relating universal and characteristic kernels.

In the regularization approach to learning (Evgeniou et al., 2000), it is well known that kernel-based algorithms (for classification/regression) generally invoke the representer theorem (Kimeldorf and Wahba, 1970; Schölkopf et al., 2001) and learn a function in H\mathcal{H} that has the representation,

is dense in H\mathcal{H} (Aronszajn, 1950), and assuming that the kernel-based algorithm makes ff “converge to an appropriate function” in H\mathcal{H} as nn\rightarrow\infty, the above question of approximating ff^{\star} arbitrarily well by ff in (3) as nn goes to infinity is equivalent to the question of whether H\mathcal{H} is rich enough to approximate any ff^{\star} arbitrarily well, i.e., whether H\mathcal{H} is universal. We show that characterizing universal RKHSs (or equivalently, the characterization of corresponding reproducing kernels (r.k.) as any RKHS is uniquely determined by its reproducing kernel) leads to the embedding in (2).

As mentioned above, the goal is to characterize H\mathcal{H} that allow to approximate any ff^{\star} in some target space, usually assumed to be some subset of the space of real-valued continuous functions on XX. Therefore, depending on the choice of XX, the choice of target space and the type of approximation, various notions of universality have been proposed (Steinwart, 2001; Micchelli et al., 2006; Carmeli et al., 2009; Sriperumbudur et al., 2010), which are briefly discussed in the following paragraphs. The eventual goal is to have a notion of universality that allows comprehensive (and general) necessary and/or sufficient conditions on the reproducing kernel for approximating, as strong as possible, a class of target functions, as general as possible.

As our contribution, in Section 3.1, we completely characterize c-universal kernels by showing that kk is c-universal if and only if the embedding in (2) is injective for μMb(X)\mu\in M_{b}(X), the space of finite signed Radon measures defined on a compact Hausdorff space, XX (see Section 2 for a formal definition of Mb(X)M_{b}(X)). It has to be noted that this result is different from and more general — as both necessary and sufficient conditions are provided — than the one by Steinwart (2001, Theorem 9), where only a sufficient condition is provided. Using this characterization, as a special case, we also obtain necessary and sufficient conditions for a Fourier kernel (see Section 3.3) to be c-universal, while Steinwart (2001) provided only a sufficient condition.

c0c_{0}-universality: Although cc-universality solves the limitation of c-universality by handling non-compact XX, the topology of compact convergence considered in cc-universality is weaker than the topology of uniform convergence, i.e., a sequence of functions, {fn}C(X)\{f_{n}\}\subset C(X) converging to fC(X)f\in C(X) in the topology of uniform convergence ensures that they converge in the topology of compact convergence but not vice-versa. So, the natural question to ask is whether we can characterize H\mathcal{H} that are rich enough to approximate any ff^{\star} on non-compact XX in a stronger sense, i.e., uniformly, by some gHg\in\mathcal{H}. Recently, this has been answered by Carmeli et al. (2009, Definition 2, Theorem 1) and Sriperumbudur et al. (2010), wherein they defined kk to be c0c_{0}-universal if kk is bounded, k(,x)C0(X),xXk(\cdot,x)\in C_{0}(X),\,\forall\,x\in X and its corresponding RKHS, H\mathcal{H} is dense in C0(X)C_{0}(X) w.r.t. the uniform norm, where XX is a locally compact Hausdorff (LCH) space and C0(X)C_{0}(X) is the Banach space of bounded continuous functions vanishing at infinity, endowed with the uniform norm (see Section 2 for the definition of C0(X)C_{0}(X)).

cbc_{b}-universality: The definition of c0c_{0}-universality deals with H\mathcal{H} being dense in C0(X)C_{0}(X) w.r.t. the uniform norm, where XX is an LCH space. Although the notion of c0c_{0}-universality addresses limitations associated with both c- and cc-universality, it only approximates a subset of C(X)C(X), i.e., it cannot deal with functions in C(X)\C0(X)C(X)\backslash C_{0}(X). This limitation can be addressed by considering a larger class of functions to be approximated.

To this end, we propose a notion of universality that is stronger than c0c_{0}-universality: kk is said to be cbc_{b}-universal if its corresponding RKHS, H\mathcal{H} is dense in Cb(X)C_{b}(X), the space of bounded continuous functions on a topological space, XX (note that C0(X)Cb(X)C_{0}(X)\subset C_{b}(X)). This notion of cbc_{b}-universality is more applicable in learning theory than c0c_{0}-universality as the target function, ff^{\star} can belong to Cb(X)C_{b}(X) (which is a more natural assumption) instead of it being restrained to C0(X)C_{0}(X) (note that C0(X)C_{0}(X) only contains functions that vanish at infinity). We show in Section 3.1 that kk is cbc_{b}-universal if and only if the embedding in (2) is injective for μ\mu belonging to a certain class of set functions (see Section 2 for the definition of set functions) defined on a normal topological space, XX (see Theorem 6 for details). Because of the technicalities involved in dealing with set functions, in this paper, we do not fully analyze this notion of universality unlike the other aforementioned notions, although it is an interesting problem to be resolved because of its applicability in learning theory.

To summarize our first contribution, we show that, by appropriately choosing XX and μ\mu in (2), the injectivity of the embedding in (2) completely characterizes various notions of universality that are proposed in literature. Using this connection between universality and the injectivity of the embedding in (2), we relate all these notions of universality, which is summarized in Figure 1.

2 Contribution 2: Relation between characteristic and universal kernels

Using the embedding in (1), Gretton et al. (2007) proposed a metric, called the maximum mean discrepancy (MMD), on the space of all Borel probability measures, when kk is characteristic. One important theoretical question that is usually considered for metrics on probability measures is (Dudley, 2002, Chapter 11): “What is the nature of the topology induced by the probability metric in relation to the usual weak topology?” In probability theory, this question is important in understanding and proving central limit theorems. Although kk being characteristic is sufficient for MMD to be a metric, we show in Section 4.2 that a notion stronger than the characteristic property is required to answer the above question. In particular, we show in Proposition 24 that if XX is an LCH space and kk is c0c_{0}-universal, then the topology induced by MMD coincides with the usual weak topology on the space of Radon probability measures defined on XX.Sriperumbudur et al. (2009b) showed that if XX is a compact metric space and kk is c-universal, then the topology induced by MMD coincides with the usual weak topology. The result for non-compact XX was left as an open question and is addressed in this paper, by applying the notion of c0c_{0}-universality. This result can be used to compare MMD to other probability metrics, such as the Dudley metric, total variation distance, Wasserstein distance, etc. We refer to Sriperumbudur et al. (2009b) for a detailed study on the comparison of MMD to other probability metrics.

To summarize, our main contributions in this paper are:

To establish the relationship between various notions of universality and the RKHS embedding, shown in (2), of finite signed Radon measures, and in turn present a novel measure embedding view point of universality compared to the classical function approximation view point.

To clarify the relationship between universal and characteristic kernels.

A summary of the results in this paper is shown in Figure 1. In the following section, we introduce the notation and some definitions that are used throughout the paper. Supplementary results used in proofs are collected in Appendix A.

Definitions & Notation

Let XX be a topological space. C(X)C(X) denotes the space of all continuous functions on XX. Cb(X)C_{b}(X) is the space of all bounded, continuous functions on XX. For a locally compact Hausdorff space, XX, fC(X)f\in C(X) is said to vanish at infinity if for every ϵ>0\epsilon>0 the set {x:f(x)ϵ}\{x:|f(x)|\geq\epsilon\} is compact. The class of all continuous ff on XX which vanish at infinity is denoted as C0(X)C_{0}(X). The spaces Cb(X)C_{b}(X) and C0(X)C_{0}(X) are endowed with the uniform norm, u\|\cdot\|_{u} defined as fu:=supxXf(x)\|f\|_{u}:=\sup_{x\in X}|f(x)| for fC0(X)Cb(X)f\in C_{0}(X)\subset C_{b}(X).

If YY denotes a topological vector space, we denote by YY^{\prime} the vector space of continuous linear functionals on YY, and YY^{\prime} is called the topological dual space (in this paper, we simply refer to it as the dual).

For a set AA, we denote its interior as AA^{\circ}.

Radon measure: A signed Radon measure μ\mu on a Hausdorff space XX is a Borel measure on XX satisfying

μ(C)<\mu(C)<\infty for each compact subset CXC\subset X,

μ(B)=sup{μ(C)CB,Ccompact}\mu(B)=\sup\{\mu(C)\,|\,C\subset B,\,C\,\text{compact}\} for each BB in the Borel σ\sigma-algebra of XX.

μ\mu is said to be finite if μ:=μ(X)<\|\mu\|:=|\mu|(X)<\infty, where μ|\mu| is the total-variation of μ\mu. M+b(X)M^{b}_{+}(X) denotes the space of all finite Radon measures on XX while Mb(X)M_{b}(X) denotes the space of all finite signed Radon measures on XX. The space of all Radon probability measures is denoted as M+1(X):={μM+b(X):μ(X)=1}M^{1}_{+}(X):=\{\mu\in M^{b}_{+}(X):\mu(X)=1\}. For μMb(X)\mu\in M_{b}(X), the support of μ\mu is defined as

Mbc(X)M_{bc}(X) denotes the space of all compactly supported finite signed Radon measures on XX. We refer the reader to Berg et al. (1984, Chapter 2) for a general reference on the theory of Radon measures.

Finitely additive, regular set function: A set function is a function defined on a family of sets, and has values in [,+][-\infty,+\infty].

A set function μ\mu defined on a family τ\tau of sets is said to be finitely additive if τ\emptyset\in\tau, μ()=0\mu(\emptyset)=0 and μ(l=1nAl)=l=1nμ(Al)\mu(\cup^{n}_{l=1}A_{l})=\sum^{n}_{l=1}\mu(A_{l}), for every finite family {A1,,An}\{A_{1},\ldots,A_{n}\} of disjoint subsets of τ\tau such that l=1nAlτ\cup^{n}_{l=1}A_{l}\in\tau.

A field of subsets of a set XX is a non-empty family, Σ\Sigma, of subsets of XX such that Σ\emptyset\in\Sigma, XΣX\in\Sigma, and for all A,BΣA,B\in\Sigma, we have ABΣA\cup B\in\Sigma and B\AΣB\backslash A\in\Sigma.

An additive set function μ\mu defined on a field Σ\Sigma of subsets of a topological space XX is said to be regular if for each AΣA\in\Sigma and ϵ>0\epsilon>0, there exists BΣB\in\Sigma whose closure is contained in AA and there exists CΣC\in\Sigma whose interior contains AA such that μ(D)<ϵ|\mu(D)|<\epsilon for every DΣD\in\Sigma with D:=C\BD:=C\backslash B.

Furthermore, kk is said to be strictly pd (resp. conditionally strictly pd) if, for mutually distinct x1,,xnXx_{1},\ldots,x_{n}\in X, equality in (5) only holds for α1==αn=0\alpha_{1}=\cdots=\alpha_{n}=0.

Characterization of Universal Kernels

Before characterizing various notions of universality, let us revisit their formal definitions.

A continuous kernel kk on a compact Hausdorff space XX is called c-universal if the RKHS, H\mathcal{H} induced by kk is dense in C(X)C(X) w.r.t. the uniform norm, i.e., for every function gC(X)g\in C(X) and all ϵ>0\epsilon>0, there exists an fHf\in\mathcal{H} such that fguϵ\|f-g\|_{u}\leq\epsilon.

A continuous kernel kk on a Hausdorff space XX is said to be cc-universal if the RKHS, H\mathcal{H} induced by kk is dense in C(X)C(X) endowed with the topology of compact convergence, i.e., for any compact set ZXZ\subset X, for any gC(Z)g\in C(Z) and all ϵ>0\epsilon>0, there exists an fHZf\in\mathcal{H}_{|Z} such that fguϵ\|f-g\|_{u}\leq\epsilon.

A bounded kernel, kk with k(,x)C0(X),xXk(\cdot,x)\in C_{0}(X),\,\forall\,x\in X on a locally compact Hausdorff space, XX is said to be c0c_{0}-universal if the RKHS, H\mathcal{H} induced by kk is dense in C0(X)C_{0}(X) w.r.t. the uniform norm, i.e., for every function gC0(X)g\in C_{0}(X) and all ϵ>0\epsilon>0, there exists an fHf\in\mathcal{H} such that fguϵ\|f-g\|_{u}\leq\epsilon.

A bounded continuous kernel, kk on a topological space, XX, is said to be cbc_{b}-universal if the RKHS, H\mathcal{H} induced by kk is dense in Cb(X)C_{b}(X) w.r.t. the uniform norm, i.e., for any gCb(X)g\in C_{b}(X) and all ϵ>0\epsilon>0, there exists an fHf\in\mathcal{H} such that fguϵ\|f-g\|_{u}\leq\epsilon.

First note that the above definitions are valid only if H\mathcal{H} is included in the appropriate target space, i.e., C(X)C(X) for c- and cc-universality, C0(X)C_{0}(X) for c0c_{0}-universality, and Cb(X)C_{b}(X) for cbc_{b}-universality. By Steinwart and Christmann (2008, Lemma 4.28, Theorem 4.61), the assumptions made on the kernel in the above definitions ensure that the definitions are valid. Also note that all these definitions are equivalent when XX is compact as C0(X)=Cb(X)=C(X)C_{0}(X)=C_{b}(X)=C(X) for compact XX. When XX is not compact, it is easy to see that cbc_{b}-universality is stronger than c0c_{0}-universality, i.e., if kk is cbc_{b}-universal, then it is also c0c_{0}-universal, but not vice-versa. On the other hand, it is not straightforward to see how the notions of cc-universal and c0c_{0}-universal are related when XX is non-compact. By characterizing c0c_{0}-universality and cc-universality, Theorem 6 in the following section, shows that the notion of c0c_{0}-universality is stronger than cc-universality, i.e., if a kernel is c0c_{0}-universal, then it is cc-universal, but not vice-versa. Based on these results, it follows that cbc_{b}-universality is stronger than cc-universality (but not vice-versa), when XX is non-compact.

Before we state our main result, i.e., Theorem 6, we need the following result, usually referred to as the Hahn-Banach theorem, which we quote from Rudin (1991, Theorem 3.5) (also see the remark following Theorem 3.5 in Rudin (1991)).

Suppose AA be a subspace of a locally convex topological vector space YY. Then AA is dense in YY if and only if A={0}A^{\perp}=\{0\}, where

The following main result of this paper, which presents a necessary and sufficient condition for kk to be c-, cc-, c0c_{0}- or cbc_{b}-universal. hinges on the above theorem, where we choose AA to be the RKHS, H\mathcal{H} and YY to be C(X)C(X), C0(X)C_{0}(X) or Cb(X)C_{b}(X) for which YY^{\prime} is known through the Riesz representation theorem.

Let XX be a compact Hausdorff space with kk being continuous. Then kk is c-universal if and only if the embedding,

Let XX be an LCH space and kCb(X×X)k\in C_{b}(X\times X). Then kk is cc-universal if and only if the embedding,

Let XX be an LCH space with the kernel, kk being bounded and k(,x)C0(X),xXk(\cdot,x)\in C_{0}(X),\,\forall\,x\in X. Then kk is c0c_{0}-universal if and only if the embedding,

Let XX be a normal topological space and let Mrba(X)M_{rba}(X) be the space of all finitely additive, regular, bounded set functions defined on the field generated by the closed sets of XX. Then, a bounded continuous kernel, kk is cbc_{b}-universal if and only if the embedding,

Proof First, we prove (c)(c), from which (a)(a) follows. (c)(c) By Definition 3, kk is c0c_{0}-universal if H\mathcal{H} is dense in C0(X)C_{0}(X). We now invoke Theorem 5 to characterize the denseness of H\mathcal{H} in C0(X)C_{0}(X), which means we need to consider the dual C0(X):=(C0(X))C^{\prime}_{0}(X):=(C_{0}(X))^{\prime} of C0(X)C_{0}(X). By the Riesz representation theorem (Folland, 1999, Theorem 7.17), C0(X)=Mb(X)C^{\prime}_{0}(X)=M_{b}(X) in the sense that there is a bijective linear isometry μTμ\mu\mapsto T_{\mu} from Mb(X)M_{b}(X) onto C0(X)C^{\prime}_{0}(X), given by the natural mapping,

Therefore, by Theorem 5, H\mathcal{H} is dense in C0(X)C_{0}(X) if and only if

(\,\Leftarrow\,) Suppose (12) is injective, i.e., for μMb(X)\mu\in M_{b}(X), Xk(,x)dμ(x)=0μ=0\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0. Then by Lemma 26 (see Appendix A), we have

which by (15) means H\mathcal{H} is dense in C0(X)C_{0}(X) and therefore kk is c0c_{0}-universal. (\,\Rightarrow\,) We need to prove that if H\mathcal{H} is dense in C0(X)C_{0}(X) then (Xk(,x)dμ(x)=0μ=0)(\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0) holds. This is equivalent to showing that if (Xk(,x)dμ(x)=0μ=0)(\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0) does not hold, then H\mathcal{H} is not dense in C0(X)C_{0}(X). Suppose (Xk(,x)dμ(x)=0μ=0)(\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0) does not hold, i.e., 0μMb(X)\exists\,0\neq\mu\in M_{b}(X) such that Xk(,x)dμ(x)=0\int_{X}k(\cdot,x)\,d\mu(x)=0, which means 0μMb(X)\exists\,0\neq\mu\in M_{b}(X) such that Xfdμ=0\int_{X}f\,d\mu=0 for every fHf\in\mathcal{H}, then, by (15), H\mathcal{H} is not dense in C0(X)C_{0}(X). (a)(a) When XX is compact, C0(X)C_{0}(X) coincides with C(X)C(X), which means c-universality and c0c_{0}-universality are equivalent. Therefore, kk is c-universal if and only if the embedding in (10) is injective. (b)(b) The proof is similar to that of (a)(a) except that we need to consider the dual of C(X)C(X) endowed with the topology of compact convergence (a locally convex topological vector space) to characterize the denseness of H\mathcal{H} in C(X)C(X). It is known (Hewitt, 1950) that C(X)=Mbc(X)C^{\prime}(X)=M_{bc}(X) in the sense that there is a bijective linear isometry μTμ\mu\mapsto T_{\mu} from Mbc(X)M_{bc}(X) onto C(X)C^{\prime}(X), given by the natural mapping, Tμ(f)=Xfdμ,fC(X)T_{\mu}(f)=\int_{X}f\,d\mu,\,\,f\in C(X). The rest of the proof is verbatim with Mb(X)M_{b}(X) replaced by Mbc(X)M_{bc}(X). (d)(d) The proof is very similar to that of (a)(a) , wherein we identify (Cb(X))Mrba(X)(C_{b}(X))^{\prime}\cong M_{rba}(X) such that T(Cb(X))T\in(C_{b}(X))^{\prime} and μMrba(X)\mu\in M_{rba}(X) satisfy T(f)=Xfdμ,fCb(X)T(f)=\int_{X}f\,d\mu,\,f\in C_{b}(X) (Dunford and Schwartz, 1958, p. 262). Here, \cong represents the isometric isomorphism. The rest of the proof is verbatim with Mb(X)M_{b}(X) replaced by Mrba(X)M_{rba}(X). Theorem 6 can also be interpreted as: for appropriate assumptions on XX and μ\mu, the embedding in (2) is injective if and only if the kernel is universal, therefore relating universality and injective RKHS embedding of finite signed Radon measures. In other words, Theorem 6 provides a novel measure embedding view point of universality compared to its well-known function approximation view point. Based on Theorem 6, the following remarks can be made.

(a) Theorem 6 provides a necessary and sufficient condition for c-universality — kk is c-universal if and only if the embedding in (10) is injective — while Steinwart (2001) provided only a sufficient condition (in terms of the feature maps being an algebra; see Steinwart and Christmann (2008, Theorem 4.56) for details) using the Stone-Weierstraß theorem. Therefore, Theorem 6 differs from and generalizes the result by Steinwart (2001). (b) Note that the embedding in (11) is injective if and only if for any compact set ZXZ\subset X, the embedding

Based on Theorem 6, the following result provides an alternate and equivalent characterization of universality or injectivity of the embedding in (2), which is easier to interpret, as it resembles the condition of kk being strictly pd (though not quite exactly the same). This alternate characterization is then used in Sections 3.2–3.4 to obtain easily checkable conditions for the universality of specific classes of kernels. We also show that strictly pd is a necessary condition for universality.

Suppose the assumptions in Theorem 6 hold. Then,

If kk is c-, cc- or c0c_{0}-universal, then it is strictly pd.

Proof We only prove (c)(c). The proof of (b)(b) is exactly the same as that of (c)(c) with Mb(X)M_{b}(X) replaced by Mbc(X)M_{bc}(X), while the proof of (a)(a) is trivial. (c)(c) (\,\Leftarrow\,) Suppose kk is not c0c_{0}-universal. By Theorem 6(c), there exists 0μMb(X)0\neq\mu\in M_{b}(X) such that Xk(,x)dμ(x)=0\int_{X}k(\cdot,x)\,d\mu(x)=0, which implies Xk(,x)dμ(x)H=0\|\int_{X}k(\cdot,x)\,d\mu(x)\|_{\mathcal{H}}=0. This means

Define μ:=j=1nαjδxj\mu:=\sum^{n}_{j=1}\alpha_{j}\delta_{x_{j}}, where δx\delta_{x} represents the Dirac measures at xx. Clearly μ0\mu\neq 0 and μMbc(X)\mu\in M_{bc}(X). From (20), it is clear that  ⁣ ⁣Xk(x,y)dμ(x)dμ(y)=0\int\!\!\int_{X}k(x,y)\,d\mu(x)\,d\mu(y)=0. Therefore, by Proposition 8(b), kk is not cc-universal. The result for c0c_{0}-universality follows from Remark 7(c), while the result for c-universality is trivial. See Carmeli et al. (2009, Corollary 5), Steinwart and Christmann (2008, Proposition 4.54, Example 4.11) and Sriperumbudur et al. (2009b, Footnote 4).

A summary of results based on Theorem 6, Remarks 7, 9 and Proposition 8 is shown in Figures 1(a) and 1(b).

Although the conditions in (17)-(19) are easy to interpret, they are not always easy to check. To this end, in the remainder of this section, we present easily checkable characterizations for the following classes of kernels. These classes of kernels are both mathematically and practically interesting as many of the popular kernels used in machine learning, e.g., Gaussian, Laplacian, exponential, etc., fall in these classes (see Examples 1–3 for more examples).

These kernels are also called Schoenberg kernels (Wendland, 2005, Corollary 7.12, Theorem 7.13).Note that kk is a scale mixture of Gaussian kernels.

XX is an LCH space with bounded kk. Let k(x,y)=jIϕj(x)ϕj(y),(x,y)X×Xk(x,y)=\sum_{j\in I}\phi_{j}(x)\phi_{j}(y),\,\,(x,y)\in X\times X, where we assume the series converges uniformly on X×XX\times X. {ϕj:jI}\{\phi_{j}:j\in I\} is a set of continuous real-valued functions on XX where II is a countable index set.

If supp(ψ)\emph{supp}(\psi) is compact, then kk is c0c_{0}-universal.

If (supp(Λ))(\emph{supp}(\Lambda))^{\circ}\neq\emptyset, then kk is cc-universal.

Gaussian, ψ(x)=exp(x222σ2),σ>0\psi(x)=\exp\left(-\frac{\|x\|^{2}_{2}}{2\sigma^{2}}\right),\,\sigma>0 with ψ^(ω)=σdexp(σ2ω222)\hat{\psi}(\omega)=\sigma^{d}\exp\left(-\frac{\sigma^{2}\|\omega\|^{2}_{2}}{2}\right).

Laplacian, ψ(x)=exp(σx1),σ>0\psi(x)=\exp\left(-\sigma\|x\|_{1}\right),\,\sigma>0 with ψ^(ω)=(2π)d/2j=1dσσ2+ωj2\hat{\psi}(\omega)=\left(\frac{2}{\pi}\right)^{d/2}\prod^{d}_{j=1}\frac{\sigma}{\sigma^{2}+\omega^{2}_{j}}, where ω=(ω1,,ωd)\omega=(\omega_{1},\ldots,\omega_{d}).

B1B_{1}-spline, ψ(x)=j=1d(1xj)\mathds1(xj)\psi(x)=\prod^{d}_{j=1}(1-|x_{j}|)\mathds{1}_{}(x_{j}) with ψ^(ω)=j=1d42πsin2(ωj/2)ωj2\hat{\psi}(\omega)=\prod^{d}_{j=1}\frac{4}{\sqrt{2\pi}}\frac{\sin^{2}(\omega_{j}/2)}{\omega^{2}_{j}}, where x=(x1,,xd)x=(x_{1},\ldots,x_{d}) and ω=(ω1,,ωd)\omega=(\omega_{1},\ldots,\omega_{d}).

The following remarks can be made about Proposition 11.

A summary of results, based on Proposition 11 and Remark 12, for the case of kernels satisfying (A1A_{1}), is shown in Figure 1(c).

where Fubini’s theorem is invoked in (a)(a) and

ψ(x)=eαcosxcos(αsinx),0<α1\psi(x)=e^{\alpha\cos x}\cos(\alpha\sin x),\,0<\alpha\leq 1 with Aψ(0)=1A_{\psi}(0)=1 and Aψ(n)=αn2n!,n0A_{\psi}(n)=\frac{\alpha^{|n|}}{2|n|!},\,\forall\,n\neq 0.

ψ(x)=(π(x)mod2π)2\psi(x)=(\pi-(x)_{mod\,\,2\pi})^{2} with Aψ(0)=π23A_{\psi}(0)=\frac{\pi^{2}}{3} and Aψ(n)=2n2,n0A_{\psi}(n)=\frac{2}{n^{2}},\,\forall\,n\neq 0.

cc-universal kernels vs. Strictly pd kernels: We have shown in Proposition 8(d) that strictly pd is a necessary condition for kk to be c-, cc- or c0c_{0}-universal. However, the converse is not true (see Remark 9(a)), which is based on Proposition 14 and the following result in Theorem 15. Before we state the result, we need some definitions.

A summary of results for kernels of the type (A2A_{2}) is shown in Figure 1(b).

Suppose (A3A_{3}) holds. Then the following conditions are equivalent.

Proof (a)(d)(a)\Rightarrow(d) by Remark 7(c), (d)(c)(d)\Rightarrow(c) by Proposition 8(d) and (c)(b)(c)\Leftrightarrow(b) by Wendland (2005, Theorem 7.14). Now, we show (b)(a)(b)\Rightarrow(a).

Gaussian, k(x,y)=eσxy22,σ>0k(x,y)=e^{-\sigma\|x-y\|^{2}_{2}},\,\sigma>0. Note that ν=δσ\nu=\delta_{\sigma} in (21), where δσ\delta_{\sigma} represents a Dirac measure at σ\sigma. Clearly supp(ν)={σ}{0}\emph{supp}(\nu)=\{\sigma\}\neq\{0\}.

Inverse multiquadratic, k(x,y)=(c2+xy22)β,β>0,c>0k(x,y)=(c^{2}+\|x-y\|^{2}_{2})^{-\beta},\,\beta>0,\,c>0, obtained by choosing dν(t)=1Γ(β)tβ1ec2tdtd\nu(t)=\frac{1}{\Gamma(\beta)}t^{\beta-1}e^{-c^{2}t}\,dt in (21). It is easy to verify that supp(ν){0}\emph{supp}(\nu)\neq\{0\}.

A summary of results for kernels of the type (A3A_{3}) is shown in Figure 1(d).

We now consider the characterization of c-, cc- and c0c_{0}-universality for (A4A_{4}).

kk is c-universal (resp. cc-universal) if and only if for any 0μMb(X)0\neq\mu\in M_{b}(X) (resp. 0μMbc(X)0\neq\mu\in M_{bc}(X)), there exists some jIj\in I for which Xϕjdμ0\int_{X}\phi_{j}\,d\mu\neq 0.

Let k(,x)C0(X),xXk(\cdot,x)\in C_{0}(X),\,\forall\,x\in X. Then kk is c0c_{0}-universal if and only if for any 0μMb(X)0\neq\mu\in M_{b}(X), there exists some jIj\in I for which Xϕjdμ0\int_{X}\phi_{j}\,d\mu\neq 0.

Proof We first prove (b)(b). The proof for c-universality in (a)(a) is trivial as it follows from (b)(b), while the proof for cc-universality in (a)(a) is exactly the same as that of (b)(b) with Mb(X)M_{b}(X) replaced by Mbc(X)M_{bc}(X). Let us consider

where we have invoked Fubini’s theorem in (c)(c). (b)(b) (\,\Leftarrow\,) Suppose for any 0μMb(X)0\neq\mu\in M_{b}(X), there exists some jIj\in I for which Xϕjdμ0\int_{X}\phi_{j}\,d\mu\neq 0. Then, from (28), it is clear that  ⁣ ⁣Xk(x,y)dμ(x)dμ(y)>0,0μMb(X)\int\!\!\int_{X}k(x,y)\,d\mu(x)\,d\mu(y)>0,\,\forall\,0\neq\mu\in M_{b}(X) and therefore kk is c0c_{0}-universal, which follows from Proposition 8(c). (\,\Rightarrow\,) Suppose there exists a non-zero measure, μMb(X)\mu\in M_{b}(X) for which Xϕjdμ=0\int_{X}\phi_{j}\,d\mu=0 for any jIj\in I. By (28), this means there exists a 0μMb(X)0\neq\mu\in M_{b}(X) for which  ⁣ ⁣Xk(x,y)dμ(x)dμ(y)=0\int\!\!\int_{X}k(x,y)\,d\mu(x)\,d\mu(y)=0, i.e., kk is not c0c_{0}-universal (by Proposition 8(c)). The conditions in Proposition 17 are not always easy to check. However, for the case of Taylor kernels (Steinwart and Christmann, 2008, Lemma 4.8), which include the exponential kernel, simple, easy to check sufficient conditions can be obtained as shown in Corollary 18. Although this result is exactly the same as Corollary 4.57 in Steinwart and Christmann (2008), we present a different proof (we would like to remind the reader that our characterization of c-universality is different from the one provided by Steinwart (2001) and therefore the proof is different; see Remark 7(a)).

Proof From the proof of Lemma 4.8 in Steinwart and Christmann (2008), we have

To summarize, in this section, by showing the relation between various notions of universality and the injective RKHS embedding of finite signed Radon measures, we have presented a novel measure embedding point of view of universality compared to its well-known function approximation view point. Since the RKHS embedding of finite signed Radon measures generalizes the concept of RKHS embedding of Radon probability measures, the latter being related to characteristic kernels (Fukumizu et al., 2004, 2008; Sriperumbudur et al., 2008), in the following section, we relate the notion of universality to characteristic kernels.

Characteristic Kernels and Universality

Recent studies in machine learning have considered the mapping of random variables into a suitable RKHS and showed that this provides a powerful and straightforward method of dealing with higher-order statistics of the variables. Using their RKHS mappings, for sufficiently rich RKHSs, it becomes possible to test for homogeneity (Gretton et al., 2007), independence (Gretton et al., 2008), conditional independence (Fukumizu et al., 2008), to find the most predictive subspace in regression (Fukumizu et al., 2004), etc. Key to the above applications is the notion of a characteristic kernel — defined below — which gives rise to an RKHS that is sufficiently rich in the sense required above.

Since the embedding in (30) is a special case of the embedding in (2), and the injectivity of the embedding in (2) is related to universality (see Section 3), we now relate universal and characteristic kernels.

Gretton et al. (2007) have shown that a c-universal kernel is characteristic. Besides this result, not much is known or understood about the relation between characteristic and universal kernels. The following result not only provides the same result obtained by Gretton et al. (2007), but also generalizes it for non-compact XX.

Suppose the assumptions in Theorem 6 hold. If kk is cc-, cccc- or c0c_{0}-universal, then it is characteristic to the set of probability measures contained in Mb(X)M_{b}(X), Mbc(X)M_{bc}(X) or Mb(X)M_{b}(X), respectively.

Proof The proof is trivial and follows from Theorem 6 and Definition 19. Now, one can ask when the converse to Proposition 20 is true. The following result answers this question for some special classes of kernels.

A summary of the relation between characteristic and universal kernels is shown in Figure 1. Characteristic kernels vs. Strictly pd kernels: In Section 3, we have shown the relation between universal kernels and strictly pd kernels, while in Propositions 20 and 21, we have related universal and characteristic kernels. We now investigate the relation between characteristic and strictly pd kernels.

If kk is characteristic, then it is conditionally strictly pd.

So far, we presented the relation between characteristic kernels and universal kernels and showed that for any LCH space, XX, the characteristic property is a weaker notion than c0c_{0}-universality. Although such a weaker notion is sufficient to make the embedding in (30) injective, in the following section, we show that the stronger notion of c0c_{0}-universality is required to study an important property of the “probability metric” associated with the embedding in (30).

on M+1(X)M^{1}_{+}(X), called the maximum mean discrepancy (MMD). Note that when kk is characteristic, γk\gamma_{k} is a metric on M+1(X)M^{1}_{+}(X). One immediate question that naturally arises is “how is MMD related to other metrics on M+1(X)M^{1}_{+}(X), such as the Prohorov metric, Dudley metric, Wasserstein-Kantorovich metric, total variation metric, etc?” This is a question of both theoretical and practical importance.

Let XX be an LCH space and kk be c0c_{0}-universal. Then, the topology induced by γk\gamma_{k} coincides with the weak topology on M+1(X)M^{1}_{+}(X).

The following result in Sriperumbudur et al. (2009b, Theorem 23) can be obtained as a simple corollary to Proposition 24, wherein the question of metrization of weak topology by γk\gamma_{k} is addressed only for compact Hausdorff XX. The general non-compact case was left as an open problem, which we addressed in Proposition 24.

Suppose XX is compact Hausdorff and kk is c-universal. Then, γk\gamma_{k} metrizes the weak topology on M+1(X)M^{1}_{+}(X).

Conclusions & Discussion

The discussion in this paper has been related to the characterization of various notions of universality wherein the RKHS, H\mathcal{H} is dense in some subset of C(X)C(X) (the space of real-valued continuous functions on XX) w.r.t. the uniform norm (here, XX is a some arbitrary topological space). This means any target function, ff^{\star} in the appropriate subset of C(X)C(X) can be approximated arbitrarily well by some gHg\in\mathcal{H} w.r.t. the uniform norm. There is a notion of universality, which we have not considered, called LpL_{p}-universality (Steinwart and Christmann, 2008, Chapter 5): a measurable and bounded kernel, kk defined on a Hausdorff space, XX, is said to be LpL_{p}-universal if the RKHS, H\mathcal{H} induced by kk is dense in Lp(X,μ)L^{p}(X,\mu) w.r.t. the pp-norm, defined as fp:=(Xf(x)pdμ(x))1/p\|f\|_{p}:=(\int_{X}|f(x)|^{p}\,d\mu(x))^{1/p}, for all μM+1(X)\mu\in M^{1}_{+}(X) and some p[1,)p\in[1,\infty). Here Lp(X,μ)L^{p}(X,\mu) is the Banach space of pp-integrable μ\mu-measurable functions on XX. This notion of universality is more applicable in learning theory, where the target function, ff^{\star} is usually assumed to lie in Lp(X,μ)L^{p}(X,\mu) for some p[1,)p\in[1,\infty) and for some Borel probability measure, μ\mu. By considering this notion of universality, any fLp(X,μ)f^{\star}\in L^{p}(X,\mu) can be approximated arbitrarily well by some gHg\in\mathcal{H} w.r.t. the p-norm for all Borel probability measures μ\mu and some p[1,)p\in[1,\infty). In particular, Steinwart and Christmann (2008, Theorems 5.31, 5.36 and Corollary 5.37) have shown that LpL_{p}-universality is necessary and sufficient to achieve consistency in kernel-based learning algorithms. In this paper, we did not consider this notion of universality because unlike the other notions of universality, it is not straightforward to relate LpL_{p}-universality and the RKHS embedding of measures by using the Hahn-Banach theorem (see Theorem 5). However, recently, Carmeli et al. (2009, Theorem 1) have shown that kk is LpL_{p}-universal if and only if it is c0c_{0}-universal, which therefore establishes the relation between LpL_{p}-universality and the RKHS embedding of measures. Using this result, LpL_{p}-universality can be related to all other notions considered in this paper, through Figure 1.

B. K. S. and G. R. G. L. wish to acknowledge support from the Institute of Statistical Mathematics (ISM), Tokyo, the National Science Foundation (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MICRO program. Part of this work was done while B. K. S. was visiting ISM. K.F. was supported by JSPS KAKENHI 19500249.

Appendix A. Supplementary Results

For completeness, we present the following supplementary result, which is a simple generalization of the technique used in the proof of Sriperumbudur et al. (2008, Theorem 3).

Let kk be a measurable and bounded kernel on a measurable space, XX and let H\mathcal{H} be its associated RKHS. Then, for any fHf\in\mathcal{H} and for any finite signed Borel measure, μ\mu,

Therefore, TμT_{\mu} is a bounded linear functional on H\mathcal{H}. By the Riesz representation theorem (Folland, 1999, Theorem 5.25), there exists a unique λμH\lambda_{\mu}\in\mathcal{H} such that Tμ[f]=f,λμHT_{\mu}[f]=\langle f,\lambda_{\mu}\rangle_{\mathcal{H}} for all fHf\in\mathcal{H}. Set f=k(,u)f=k(\cdot,u) for some uXu\in X, which implies λμ=Xk(,x)dμ(x)\lambda_{\mu}=\int_{X}k(\cdot,x)\,d\mu(x) and the result follows.

References