Universality, Characteristic Kernels and RKHS Embedding of Measures

Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet

Introduction

Such an embedding has been found to be useful in many statistical applications like homogeneity testing (Gretton et al., 2007), independence testing (Gretton et al., 2008; Fukumizu et al., 2008), dimensionality reduction (Fukumizu et al., 2004, 2009a), etc., as it provides a powerful and straightforward method of dealing with higher-order statistics of random variables. However, in these applications, it is critical that the embedding in (1) is injective so that probability measures can be distinguished by their images in $\mathcal{H}$ . To this end, Fukumizu et al. (2008) introduced the notion of characteristic kernel — a bounded, measurable $k$ is said to be characteristic if (1) is injective — for which many characterizations have recently been provided (Gretton et al., 2007; Fukumizu et al., 2008, 2009b; Sriperumbudur et al., 2008, 2009a, 2009b).

A natural extension to the above idea of embedding probability measures into an RKHS, $\mathcal{H}$ is to embed finite signed Borel measures, $\mu$ into $\mathcal{H}$ as

and study the conditions on the kernel, $k$ for which such an embedding is injective. Although the embedding in (2) can be proposed and investigated for mathematical pleasure, we show as one of the main contributions of this paper that under certain conditions on $\mu$ and $X$ , the embedding in (2) is closely related to the concept of universal kernels (see Section 1.1 for the formal introduction to universal kernels), which was first proposed by Steinwart (2001) — in the context of achieving the Bayes risk in kernel-based classification/regression algorithms — and later extended by Micchelli et al. (2006), Carmeli et al. (2009) and Sriperumbudur et al. (2010).The present paper is an extended version of Sriperumbudur et al. (2010). This connection shows that the embedding in (2) is not just an abstract mathematical object, but has applications in kernel-based classification/regression algorithms. Using the connection between (2) and universal kernels, we then show how the various notions of universality mentioned above are related to each other. In addition, since the embedding in (2) is a generalization of the embedding in (1), we also demonstrate the relation between characteristic kernels and universal kernels, which extends the preliminary study carried out in Sriperumbudur et al. (2009b, Section 3.4).

In the remainder of this introduction, we provide a comprehensive overview of our contributions which are presented in detail in later sections. First, in Section 1.1, we introduce universality, briefly discuss various notions of universality that are proposed in literature, and outline our contribution: a measure embedding view point of universality, which is novel and different from the existing view point of approximating functions in some target space by functions in an RKHS. We show that a kernel is universal if and only if the embedding in (2) is injective. Second, in Section 1.2, we discuss our second contribution of relating universal and characteristic kernels.

In the regularization approach to learning (Evgeniou et al., 2000), it is well known that kernel-based algorithms (for classification/regression) generally invoke the representer theorem (Kimeldorf and Wahba, 1970; Schölkopf et al., 2001) and learn a function in $\mathcal{H}$ that has the representation,

is dense in $\mathcal{H}$ (Aronszajn, 1950), and assuming that the kernel-based algorithm makes $f$ “converge to an appropriate function” in $\mathcal{H}$ as $n\rightarrow\infty$ , the above question of approximating $f^{\star}$ arbitrarily well by $f$ in (3) as $n$ goes to infinity is equivalent to the question of whether $\mathcal{H}$ is rich enough to approximate any $f^{\star}$ arbitrarily well, i.e., whether $\mathcal{H}$ is universal. We show that characterizing universal RKHSs (or equivalently, the characterization of corresponding reproducing kernels (r.k.) as any RKHS is uniquely determined by its reproducing kernel) leads to the embedding in (2).

As mentioned above, the goal is to characterize $\mathcal{H}$ that allow to approximate any $f^{\star}$ in some target space, usually assumed to be some subset of the space of real-valued continuous functions on $X$ . Therefore, depending on the choice of $X$ , the choice of target space and the type of approximation, various notions of universality have been proposed (Steinwart, 2001; Micchelli et al., 2006; Carmeli et al., 2009; Sriperumbudur et al., 2010), which are briefly discussed in the following paragraphs. The eventual goal is to have a notion of universality that allows comprehensive (and general) necessary and/or sufficient conditions on the reproducing kernel for approximating, as strong as possible, a class of target functions, as general as possible.

As our contribution, in Section 3.1, we completely characterize c-universal kernels by showing that $k$ is c-universal if and only if the embedding in (2) is injective for $\mu\in M_{b}(X)$ , the space of finite signed Radon measures defined on a compact Hausdorff space, $X$ (see Section 2 for a formal definition of $M_{b}(X)$ ). It has to be noted that this result is different from and more general — as both necessary and sufficient conditions are provided — than the one by Steinwart (2001, Theorem 9), where only a sufficient condition is provided. Using this characterization, as a special case, we also obtain necessary and sufficient conditions for a Fourier kernel (see Section 3.3) to be c-universal, while Steinwart (2001) provided only a sufficient condition.

$c_{0}$ -universality: Although cc-universality solves the limitation of c-universality by handling non-compact $X$ , the topology of compact convergence considered in cc-universality is weaker than the topology of uniform convergence, i.e., a sequence of functions, $\{f_{n}\}\subset C(X)$ converging to $f\in C(X)$ in the topology of uniform convergence ensures that they converge in the topology of compact convergence but not vice-versa. So, the natural question to ask is whether we can characterize $\mathcal{H}$ that are rich enough to approximate any $f^{\star}$ on non-compact $X$ in a stronger sense, i.e., uniformly, by some $g\in\mathcal{H}$ . Recently, this has been answered by Carmeli et al. (2009, Definition 2, Theorem 1) and Sriperumbudur et al. (2010), wherein they defined $k$ to be $c_{0}$ -universal if $k$ is bounded, $k(\cdot,x)\in C_{0}(X),\,\forall\,x\in X$ and its corresponding RKHS, $\mathcal{H}$ is dense in $C_{0}(X)$ w.r.t. the uniform norm, where $X$ is a locally compact Hausdorff (LCH) space and $C_{0}(X)$ is the Banach space of bounded continuous functions vanishing at infinity, endowed with the uniform norm (see Section 2 for the definition of $C_{0}(X)$ ).

$c_{b}$ -universality: The definition of $c_{0}$ -universality deals with $\mathcal{H}$ being dense in $C_{0}(X)$ w.r.t. the uniform norm, where $X$ is an LCH space. Although the notion of $c_{0}$ -universality addresses limitations associated with both c- and cc-universality, it only approximates a subset of $C(X)$ , i.e., it cannot deal with functions in $C(X)\backslash C_{0}(X)$ . This limitation can be addressed by considering a larger class of functions to be approximated.

To this end, we propose a notion of universality that is stronger than $c_{0}$ -universality: $k$ is said to be $c_{b}$ -universal if its corresponding RKHS, $\mathcal{H}$ is dense in $C_{b}(X)$ , the space of bounded continuous functions on a topological space, $X$ (note that $C_{0}(X)\subset C_{b}(X)$ ). This notion of $c_{b}$ -universality is more applicable in learning theory than $c_{0}$ -universality as the target function, $f^{\star}$ can belong to $C_{b}(X)$ (which is a more natural assumption) instead of it being restrained to $C_{0}(X)$ (note that $C_{0}(X)$ only contains functions that vanish at infinity). We show in Section 3.1 that $k$ is $c_{b}$ -universal if and only if the embedding in (2) is injective for $\mu$ belonging to a certain class of set functions (see Section 2 for the definition of set functions) defined on a normal topological space, $X$ (see Theorem 6 for details). Because of the technicalities involved in dealing with set functions, in this paper, we do not fully analyze this notion of universality unlike the other aforementioned notions, although it is an interesting problem to be resolved because of its applicability in learning theory.

To summarize our first contribution, we show that, by appropriately choosing $X$ and $\mu$ in (2), the injectivity of the embedding in (2) completely characterizes various notions of universality that are proposed in literature. Using this connection between universality and the injectivity of the embedding in (2), we relate all these notions of universality, which is summarized in Figure 1.

2 Contribution 2: Relation between characteristic and universal kernels

Using the embedding in (1), Gretton et al. (2007) proposed a metric, called the maximum mean discrepancy (MMD), on the space of all Borel probability measures, when $k$ is characteristic. One important theoretical question that is usually considered for metrics on probability measures is (Dudley, 2002, Chapter 11): “What is the nature of the topology induced by the probability metric in relation to the usual weak topology?” In probability theory, this question is important in understanding and proving central limit theorems. Although $k$ being characteristic is sufficient for MMD to be a metric, we show in Section 4.2 that a notion stronger than the characteristic property is required to answer the above question. In particular, we show in Proposition 24 that if $X$ is an LCH space and $k$ is $c_{0}$ -universal, then the topology induced by MMD coincides with the usual weak topology on the space of Radon probability measures defined on $X$ .Sriperumbudur et al. (2009b) showed that if $X$ is a compact metric space and $k$ is c-universal, then the topology induced by MMD coincides with the usual weak topology. The result for non-compact $X$ was left as an open question and is addressed in this paper, by applying the notion of $c_{0}$ -universality. This result can be used to compare MMD to other probability metrics, such as the Dudley metric, total variation distance, Wasserstein distance, etc. We refer to Sriperumbudur et al. (2009b) for a detailed study on the comparison of MMD to other probability metrics.

To summarize, our main contributions in this paper are:

To establish the relationship between various notions of universality and the RKHS embedding, shown in (2), of finite signed Radon measures, and in turn present a novel measure embedding view point of universality compared to the classical function approximation view point.

To clarify the relationship between universal and characteristic kernels.

A summary of the results in this paper is shown in Figure 1. In the following section, we introduce the notation and some definitions that are used throughout the paper. Supplementary results used in proofs are collected in Appendix A.

Definitions & Notation

Let $X$ be a topological space. $C(X)$ denotes the space of all continuous functions on $X$ . $C_{b}(X)$ is the space of all bounded, continuous functions on $X$ . For a locally compact Hausdorff space, $X$ , $f\in C(X)$ is said to vanish at infinity if for every $\epsilon>0$ the set $\{x:|f(x)|\geq\epsilon\}$ is compact. The class of all continuous $f$ on $X$ which vanish at infinity is denoted as $C_{0}(X)$ . The spaces $C_{b}(X)$ and $C_{0}(X)$ are endowed with the uniform norm, $\|\cdot\|_{u}$ defined as $\|f\|_{u}:=\sup_{x\in X}|f(x)|$ for $f\in C_{0}(X)\subset C_{b}(X)$ .

If $Y$ denotes a topological vector space, we denote by $Y^{\prime}$ the vector space of continuous linear functionals on $Y$ , and $Y^{\prime}$ is called the topological dual space (in this paper, we simply refer to it as the dual).

For a set $A$ , we denote its interior as $A^{\circ}$ .

Radon measure: A signed Radon measure $\mu$ on a Hausdorff space $X$ is a Borel measure on $X$ satisfying

$\mu(C)<\infty$ for each compact subset $C\subset X$ ,

$\mu(B)=\sup\{\mu(C)\,|\,C\subset B,\,C\,\text{compact}\}$ for each $B$ in the Borel $\sigma$ -algebra of $X$ .

$\mu$ is said to be finite if $\|\mu\|:=|\mu|(X)<\infty$ , where $|\mu|$ is the total-variation of $\mu$ . $M^{b}_{+}(X)$ denotes the space of all finite Radon measures on $X$ while $M_{b}(X)$ denotes the space of all finite signed Radon measures on $X$ . The space of all Radon probability measures is denoted as $M^{1}_{+}(X):=\{\mu\in M^{b}_{+}(X):\mu(X)=1\}$ . For $\mu\in M_{b}(X)$ , the support of $\mu$ is defined as

$M_{bc}(X)$ denotes the space of all compactly supported finite signed Radon measures on $X$ . We refer the reader to Berg et al. (1984, Chapter 2) for a general reference on the theory of Radon measures.

Finitely additive, regular set function: A set function is a function defined on a family of sets, and has values in $[-\infty,+\infty]$ .

A set function $\mu$ defined on a family $\tau$ of sets is said to be finitely additive if $\emptyset\in\tau$ , $\mu(\emptyset)=0$ and $\mu(\cup^{n}_{l=1}A_{l})=\sum^{n}_{l=1}\mu(A_{l})$ , for every finite family $\{A_{1},\ldots,A_{n}\}$ of disjoint subsets of $\tau$ such that $\cup^{n}_{l=1}A_{l}\in\tau$ .

A field of subsets of a set $X$ is a non-empty family, $\Sigma$ , of subsets of $X$ such that $\emptyset\in\Sigma$ , $X\in\Sigma$ , and for all $A,B\in\Sigma$ , we have $A\cup B\in\Sigma$ and $B\backslash A\in\Sigma$ .

An additive set function $\mu$ defined on a field $\Sigma$ of subsets of a topological space $X$ is said to be regular if for each $A\in\Sigma$ and $\epsilon>0$ , there exists $B\in\Sigma$ whose closure is contained in $A$ and there exists $C\in\Sigma$ whose interior contains $A$ such that $|\mu(D)|<\epsilon$ for every $D\in\Sigma$ with $D:=C\backslash B$ .

Furthermore, $k$ is said to be strictly pd (resp. conditionally strictly pd) if, for mutually distinct $x_{1},\ldots,x_{n}\in X$ , equality in (5) only holds for $\alpha_{1}=\cdots=\alpha_{n}=0$ .

Characterization of Universal Kernels

Before characterizing various notions of universality, let us revisit their formal definitions.

A continuous kernel $k$ on a compact Hausdorff space $X$ is called c-universal if the RKHS, $\mathcal{H}$ induced by $k$ is dense in $C(X)$ w.r.t. the uniform norm, i.e., for every function $g\in C(X)$ and all $\epsilon>0$ , there exists an $f\in\mathcal{H}$ such that $\|f-g\|_{u}\leq\epsilon$ .

A continuous kernel $k$ on a Hausdorff space $X$ is said to be cc-universal if the RKHS, $\mathcal{H}$ induced by $k$ is dense in $C(X)$ endowed with the topology of compact convergence, i.e., for any compact set $Z\subset X$ , for any $g\in C(Z)$ and all $\epsilon>0$ , there exists an $f\in\mathcal{H}_{|Z}$ such that $\|f-g\|_{u}\leq\epsilon$ .

A bounded kernel, $k$ with $k(\cdot,x)\in C_{0}(X),\,\forall\,x\in X$ on a locally compact Hausdorff space, $X$ is said to be $c_{0}$ -universal if the RKHS, $\mathcal{H}$ induced by $k$ is dense in $C_{0}(X)$ w.r.t. the uniform norm, i.e., for every function $g\in C_{0}(X)$ and all $\epsilon>0$ , there exists an $f\in\mathcal{H}$ such that $\|f-g\|_{u}\leq\epsilon$ .

A bounded continuous kernel, $k$ on a topological space, $X$ , is said to be $c_{b}$ -universal if the RKHS, $\mathcal{H}$ induced by $k$ is dense in $C_{b}(X)$ w.r.t. the uniform norm, i.e., for any $g\in C_{b}(X)$ and all $\epsilon>0$ , there exists an $f\in\mathcal{H}$ such that $\|f-g\|_{u}\leq\epsilon$ .

First note that the above definitions are valid only if $\mathcal{H}$ is included in the appropriate target space, i.e., $C(X)$ for c- and cc-universality, $C_{0}(X)$ for $c_{0}$ -universality, and $C_{b}(X)$ for $c_{b}$ -universality. By Steinwart and Christmann (2008, Lemma 4.28, Theorem 4.61), the assumptions made on the kernel in the above definitions ensure that the definitions are valid. Also note that all these definitions are equivalent when $X$ is compact as $C_{0}(X)=C_{b}(X)=C(X)$ for compact $X$ . When $X$ is not compact, it is easy to see that $c_{b}$ -universality is stronger than $c_{0}$ -universality, i.e., if $k$ is $c_{b}$ -universal, then it is also $c_{0}$ -universal, but not vice-versa. On the other hand, it is not straightforward to see how the notions of cc-universal and $c_{0}$ -universal are related when $X$ is non-compact. By characterizing $c_{0}$ -universality and cc-universality, Theorem 6 in the following section, shows that the notion of $c_{0}$ -universality is stronger than cc-universality, i.e., if a kernel is $c_{0}$ -universal, then it is cc-universal, but not vice-versa. Based on these results, it follows that $c_{b}$ -universality is stronger than cc-universality (but not vice-versa), when $X$ is non-compact.

Before we state our main result, i.e., Theorem 6, we need the following result, usually referred to as the Hahn-Banach theorem, which we quote from Rudin (1991, Theorem 3.5) (also see the remark following Theorem 3.5 in Rudin (1991)).

Suppose $A$ be a subspace of a locally convex topological vector space $Y$ . Then $A$ is dense in $Y$ if and only if $A^{\perp}=\{0\}$ , where

The following main result of this paper, which presents a necessary and sufficient condition for $k$ to be c-, cc-, $c_{0}$ - or $c_{b}$ -universal. hinges on the above theorem, where we choose $A$ to be the RKHS, $\mathcal{H}$ and $Y$ to be $C(X)$ , $C_{0}(X)$ or $C_{b}(X)$ for which $Y^{\prime}$ is known through the Riesz representation theorem.

Let $X$ be a compact Hausdorff space with $k$ being continuous. Then $k$ is c-universal if and only if the embedding,

Let $X$ be an LCH space and $k\in C_{b}(X\times X)$ . Then $k$ is cc-universal if and only if the embedding,

Let $X$ be an LCH space with the kernel, $k$ being bounded and $k(\cdot,x)\in C_{0}(X),\,\forall\,x\in X$ . Then $k$ is $c_{0}$ -universal if and only if the embedding,

Let $X$ be a normal topological space and let $M_{rba}(X)$ be the space of all finitely additive, regular, bounded set functions defined on the field generated by the closed sets of $X$ . Then, a bounded continuous kernel, $k$ is $c_{b}$ -universal if and only if the embedding,

Proof First, we prove $(c)$ , from which $(a)$ follows. $(c)$ By Definition 3, $k$ is $c_{0}$ -universal if $\mathcal{H}$ is dense in $C_{0}(X)$ . We now invoke Theorem 5 to characterize the denseness of $\mathcal{H}$ in $C_{0}(X)$ , which means we need to consider the dual $C^{\prime}_{0}(X):=(C_{0}(X))^{\prime}$ of $C_{0}(X)$ . By the Riesz representation theorem (Folland, 1999, Theorem 7.17), $C^{\prime}_{0}(X)=M_{b}(X)$ in the sense that there is a bijective linear isometry $\mu\mapsto T_{\mu}$ from $M_{b}(X)$ onto $C^{\prime}_{0}(X)$ , given by the natural mapping,

Therefore, by Theorem 5, $\mathcal{H}$ is dense in $C_{0}(X)$ if and only if

( $\,\Leftarrow\,$ ) Suppose (12) is injective, i.e., for $\mu\in M_{b}(X)$ , $\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0$ . Then by Lemma 26 (see Appendix A), we have

which by (15) means $\mathcal{H}$ is dense in $C_{0}(X)$ and therefore $k$ is $c_{0}$ -universal. ( $\,\Rightarrow\,$ ) We need to prove that if $\mathcal{H}$ is dense in $C_{0}(X)$ then $(\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0)$ holds. This is equivalent to showing that if $(\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0)$ does not hold, then $\mathcal{H}$ is not dense in $C_{0}(X)$ . Suppose $(\int_{X}k(\cdot,x)\,d\mu(x)=0\Rightarrow\mu=0)$ does not hold, i.e., $\exists\,0\neq\mu\in M_{b}(X)$ such that $\int_{X}k(\cdot,x)\,d\mu(x)=0$ , which means $\exists\,0\neq\mu\in M_{b}(X)$ such that $\int_{X}f\,d\mu=0$ for every $f\in\mathcal{H}$ , then, by (15), $\mathcal{H}$ is not dense in $C_{0}(X)$ . $(a)$ When $X$ is compact, $C_{0}(X)$ coincides with $C(X)$ , which means c-universality and $c_{0}$ -universality are equivalent. Therefore, $k$ is c-universal if and only if the embedding in (10) is injective. $(b)$ The proof is similar to that of $(a)$ except that we need to consider the dual of $C(X)$ endowed with the topology of compact convergence (a locally convex topological vector space) to characterize the denseness of $\mathcal{H}$ in $C(X)$ . It is known (Hewitt, 1950) that $C^{\prime}(X)=M_{bc}(X)$ in the sense that there is a bijective linear isometry $\mu\mapsto T_{\mu}$ from $M_{bc}(X)$ onto $C^{\prime}(X)$ , given by the natural mapping, $T_{\mu}(f)=\int_{X}f\,d\mu,\,\,f\in C(X)$ . The rest of the proof is verbatim with $M_{b}(X)$ replaced by $M_{bc}(X)$ . $(d)$ The proof is very similar to that of $(a)$ , wherein we identify $(C_{b}(X))^{\prime}\cong M_{rba}(X)$ such that $T\in(C_{b}(X))^{\prime}$ and $\mu\in M_{rba}(X)$ satisfy $T(f)=\int_{X}f\,d\mu,\,f\in C_{b}(X)$ (Dunford and Schwartz, 1958, p. 262). Here, $\cong$ represents the isometric isomorphism. The rest of the proof is verbatim with $M_{b}(X)$ replaced by $M_{rba}(X)$ . Theorem 6 can also be interpreted as: for appropriate assumptions on $X$ and $\mu$ , the embedding in (2) is injective if and only if the kernel is universal, therefore relating universality and injective RKHS embedding of finite signed Radon measures. In other words, Theorem 6 provides a novel measure embedding view point of universality compared to its well-known function approximation view point. Based on Theorem 6, the following remarks can be made.

(a) Theorem 6 provides a necessary and sufficient condition for c-universality — $k$ is c-universal if and only if the embedding in (10) is injective — while Steinwart (2001) provided only a sufficient condition (in terms of the feature maps being an algebra; see Steinwart and Christmann (2008, Theorem 4.56) for details) using the Stone-Weierstraß theorem. Therefore, Theorem 6 differs from and generalizes the result by Steinwart (2001). (b) Note that the embedding in (11) is injective if and only if for any compact set $Z\subset X$ , the embedding

Based on Theorem 6, the following result provides an alternate and equivalent characterization of universality or injectivity of the embedding in (2), which is easier to interpret, as it resembles the condition of $k$ being strictly pd (though not quite exactly the same). This alternate characterization is then used in Sections 3.2–3.4 to obtain easily checkable conditions for the universality of specific classes of kernels. We also show that strictly pd is a necessary condition for universality.

Suppose the assumptions in Theorem 6 hold. Then,

If $k$ is c-, cc- or $c_{0}$ -universal, then it is strictly pd.

Proof We only prove $(c)$ . The proof of $(b)$ is exactly the same as that of $(c)$ with $M_{b}(X)$ replaced by $M_{bc}(X)$ , while the proof of $(a)$ is trivial. $(c)$ ( $\,\Leftarrow\,$ ) Suppose $k$ is not $c_{0}$ -universal. By Theorem 6(c), there exists $0\neq\mu\in M_{b}(X)$ such that $\int_{X}k(\cdot,x)\,d\mu(x)=0$ , which implies $\|\int_{X}k(\cdot,x)\,d\mu(x)\|_{\mathcal{H}}=0$ . This means

Define $\mu:=\sum^{n}_{j=1}\alpha_{j}\delta_{x_{j}}$ , where $\delta_{x}$ represents the Dirac measures at $x$ . Clearly $\mu\neq 0$ and $\mu\in M_{bc}(X)$ . From (20), it is clear that $\int\!\!\int_{X}k(x,y)\,d\mu(x)\,d\mu(y)=0$ . Therefore, by Proposition 8(b), $k$ is not cc-universal. The result for $c_{0}$ -universality follows from Remark 7(c), while the result for c-universality is trivial. See Carmeli et al. (2009, Corollary 5), Steinwart and Christmann (2008, Proposition 4.54, Example 4.11) and Sriperumbudur et al. (2009b, Footnote 4).

A summary of results based on Theorem 6, Remarks 7, 9 and Proposition 8 is shown in Figures 1(a) and 1(b).

Although the conditions in (17)-(19) are easy to interpret, they are not always easy to check. To this end, in the remainder of this section, we present easily checkable characterizations for the following classes of kernels. These classes of kernels are both mathematically and practically interesting as many of the popular kernels used in machine learning, e.g., Gaussian, Laplacian, exponential, etc., fall in these classes (see Examples 1–3 for more examples).

These kernels are also called Schoenberg kernels (Wendland, 2005, Corollary 7.12, Theorem 7.13).Note that $k$ is a scale mixture of Gaussian kernels.

$X$ is an LCH space with bounded $k$ . Let $k(x,y)=\sum_{j\in I}\phi_{j}(x)\phi_{j}(y),\,\,(x,y)\in X\times X$ , where we assume the series converges uniformly on $X\times X$ . $\{\phi_{j}:j\in I\}$ is a set of continuous real-valued functions on $X$ where $I$ is a countable index set.

If $\emph{supp}(\psi)$ is compact, then $k$ is $c_{0}$ -universal.

If $(\emph{supp}(\Lambda))^{\circ}\neq\emptyset$ , then $k$ is cc-universal.

Gaussian, $\psi(x)=\exp\left(-\frac{\|x\|^{2}_{2}}{2\sigma^{2}}\right),\,\sigma>0$ with $\hat{\psi}(\omega)=\sigma^{d}\exp\left(-\frac{\sigma^{2}\|\omega\|^{2}_{2}}{2}\right)$ .

Laplacian, $\psi(x)=\exp\left(-\sigma\|x\|_{1}\right),\,\sigma>0$ with $\hat{\psi}(\omega)=\left(\frac{2}{\pi}\right)^{d/2}\prod^{d}_{j=1}\frac{\sigma}{\sigma^{2}+\omega^{2}_{j}}$ , where $\omega=(\omega_{1},\ldots,\omega_{d})$ .

$B_{1}$ -spline, $\psi(x)=\prod^{d}_{j=1}(1-|x_{j}|)\mathds{1}_{}(x_{j})$ with $\hat{\psi}(\omega)=\prod^{d}_{j=1}\frac{4}{\sqrt{2\pi}}\frac{\sin^{2}(\omega_{j}/2)}{\omega^{2}_{j}}$ , where $x=(x_{1},\ldots,x_{d})$ and $\omega=(\omega_{1},\ldots,\omega_{d})$ .

The following remarks can be made about Proposition 11.

A summary of results, based on Proposition 11 and Remark 12, for the case of kernels satisfying ( $A_{1}$ ), is shown in Figure 1(c).

where Fubini’s theorem is invoked in $(a)$ and

$\psi(x)=e^{\alpha\cos x}\cos(\alpha\sin x),\,0<\alpha\leq 1$ with $A_{\psi}(0)=1$ and $A_{\psi}(n)=\frac{\alpha^{|n|}}{2|n|!},\,\forall\,n\neq 0$ .

$\psi(x)=(\pi-(x)_{mod\,\,2\pi})^{2}$ with $A_{\psi}(0)=\frac{\pi^{2}}{3}$ and $A_{\psi}(n)=\frac{2}{n^{2}},\,\forall\,n\neq 0$ .

$c$ -universal kernels vs. Strictly pd kernels: We have shown in Proposition 8(d) that strictly pd is a necessary condition for $k$ to be c-, cc- or $c_{0}$ -universal. However, the converse is not true (see Remark 9(a)), which is based on Proposition 14 and the following result in Theorem 15. Before we state the result, we need some definitions.

A summary of results for kernels of the type ( $A_{2}$ ) is shown in Figure 1(b).

Suppose ( $A_{3}$ ) holds. Then the following conditions are equivalent.

Proof $(a)\Rightarrow(d)$ by Remark 7(c), $(d)\Rightarrow(c)$ by Proposition 8(d) and $(c)\Leftrightarrow(b)$ by Wendland (2005, Theorem 7.14). Now, we show $(b)\Rightarrow(a)$ .

Gaussian, $k(x,y)=e^{-\sigma\|x-y\|^{2}_{2}},\,\sigma>0$ . Note that $\nu=\delta_{\sigma}$ in (21), where $\delta_{\sigma}$ represents a Dirac measure at $\sigma$ . Clearly $\emph{supp}(\nu)=\{\sigma\}\neq\{0\}$ .

Inverse multiquadratic, $k(x,y)=(c^{2}+\|x-y\|^{2}_{2})^{-\beta},\,\beta>0,\,c>0$ , obtained by choosing $d\nu(t)=\frac{1}{\Gamma(\beta)}t^{\beta-1}e^{-c^{2}t}\,dt$ in (21). It is easy to verify that $\emph{supp}(\nu)\neq\{0\}$ .

A summary of results for kernels of the type ( $A_{3}$ ) is shown in Figure 1(d).

We now consider the characterization of c-, cc- and $c_{0}$ -universality for ( $A_{4}$ ).

$k$ is c-universal (resp. cc-universal) if and only if for any $0\neq\mu\in M_{b}(X)$ (resp. $0\neq\mu\in M_{bc}(X)$ ), there exists some $j\in I$ for which $\int_{X}\phi_{j}\,d\mu\neq 0$ .

Let $k(\cdot,x)\in C_{0}(X),\,\forall\,x\in X$ . Then $k$ is $c_{0}$ -universal if and only if for any $0\neq\mu\in M_{b}(X)$ , there exists some $j\in I$ for which $\int_{X}\phi_{j}\,d\mu\neq 0$ .

Proof We first prove $(b)$ . The proof for c-universality in $(a)$ is trivial as it follows from $(b)$ , while the proof for cc-universality in $(a)$ is exactly the same as that of $(b)$ with $M_{b}(X)$ replaced by $M_{bc}(X)$ . Let us consider

where we have invoked Fubini’s theorem in $(c)$ . $(b)$ ( $\,\Leftarrow\,$ ) Suppose for any $0\neq\mu\in M_{b}(X)$ , there exists some $j\in I$ for which $\int_{X}\phi_{j}\,d\mu\neq 0$ . Then, from (28), it is clear that $\int\!\!\int_{X}k(x,y)\,d\mu(x)\,d\mu(y)>0,\,\forall\,0\neq\mu\in M_{b}(X)$ and therefore $k$ is $c_{0}$ -universal, which follows from Proposition 8(c). ( $\,\Rightarrow\,$ ) Suppose there exists a non-zero measure, $\mu\in M_{b}(X)$ for which $\int_{X}\phi_{j}\,d\mu=0$ for any $j\in I$ . By (28), this means there exists a $0\neq\mu\in M_{b}(X)$ for which $\int\!\!\int_{X}k(x,y)\,d\mu(x)\,d\mu(y)=0$ , i.e., $k$ is not $c_{0}$ -universal (by Proposition 8(c)). The conditions in Proposition 17 are not always easy to check. However, for the case of Taylor kernels (Steinwart and Christmann, 2008, Lemma 4.8), which include the exponential kernel, simple, easy to check sufficient conditions can be obtained as shown in Corollary 18. Although this result is exactly the same as Corollary 4.57 in Steinwart and Christmann (2008), we present a different proof (we would like to remind the reader that our characterization of c-universality is different from the one provided by Steinwart (2001) and therefore the proof is different; see Remark 7(a)).

Proof From the proof of Lemma 4.8 in Steinwart and Christmann (2008), we have

To summarize, in this section, by showing the relation between various notions of universality and the injective RKHS embedding of finite signed Radon measures, we have presented a novel measure embedding point of view of universality compared to its well-known function approximation view point. Since the RKHS embedding of finite signed Radon measures generalizes the concept of RKHS embedding of Radon probability measures, the latter being related to characteristic kernels (Fukumizu et al., 2004, 2008; Sriperumbudur et al., 2008), in the following section, we relate the notion of universality to characteristic kernels.

Characteristic Kernels and Universality

Recent studies in machine learning have considered the mapping of random variables into a suitable RKHS and showed that this provides a powerful and straightforward method of dealing with higher-order statistics of the variables. Using their RKHS mappings, for sufficiently rich RKHSs, it becomes possible to test for homogeneity (Gretton et al., 2007), independence (Gretton et al., 2008), conditional independence (Fukumizu et al., 2008), to find the most predictive subspace in regression (Fukumizu et al., 2004), etc. Key to the above applications is the notion of a characteristic kernel — defined below — which gives rise to an RKHS that is sufficiently rich in the sense required above.

Since the embedding in (30) is a special case of the embedding in (2), and the injectivity of the embedding in (2) is related to universality (see Section 3), we now relate universal and characteristic kernels.

Gretton et al. (2007) have shown that a c-universal kernel is characteristic. Besides this result, not much is known or understood about the relation between characteristic and universal kernels. The following result not only provides the same result obtained by Gretton et al. (2007), but also generalizes it for non-compact $X$ .

Suppose the assumptions in Theorem 6 hold. If $k$ is $c$ -, $cc$ - or $c_{0}$ -universal, then it is characteristic to the set of probability measures contained in $M_{b}(X)$ , $M_{bc}(X)$ or $M_{b}(X)$ , respectively.

Proof The proof is trivial and follows from Theorem 6 and Definition 19. Now, one can ask when the converse to Proposition 20 is true. The following result answers this question for some special classes of kernels.

A summary of the relation between characteristic and universal kernels is shown in Figure 1. Characteristic kernels vs. Strictly pd kernels: In Section 3, we have shown the relation between universal kernels and strictly pd kernels, while in Propositions 20 and 21, we have related universal and characteristic kernels. We now investigate the relation between characteristic and strictly pd kernels.

If $k$ is characteristic, then it is conditionally strictly pd.

So far, we presented the relation between characteristic kernels and universal kernels and showed that for any LCH space, $X$ , the characteristic property is a weaker notion than $c_{0}$ -universality. Although such a weaker notion is sufficient to make the embedding in (30) injective, in the following section, we show that the stronger notion of $c_{0}$ -universality is required to study an important property of the “probability metric” associated with the embedding in (30).

on $M^{1}_{+}(X)$ , called the maximum mean discrepancy (MMD). Note that when $k$ is characteristic, $\gamma_{k}$ is a metric on $M^{1}_{+}(X)$ . One immediate question that naturally arises is “how is MMD related to other metrics on $M^{1}_{+}(X)$ , such as the Prohorov metric, Dudley metric, Wasserstein-Kantorovich metric, total variation metric, etc?” This is a question of both theoretical and practical importance.

Let $X$ be an LCH space and $k$ be $c_{0}$ -universal. Then, the topology induced by $\gamma_{k}$ coincides with the weak topology on $M^{1}_{+}(X)$ .

The following result in Sriperumbudur et al. (2009b, Theorem 23) can be obtained as a simple corollary to Proposition 24, wherein the question of metrization of weak topology by $\gamma_{k}$ is addressed only for compact Hausdorff $X$ . The general non-compact case was left as an open problem, which we addressed in Proposition 24.

Suppose $X$ is compact Hausdorff and $k$ is c-universal. Then, $\gamma_{k}$ metrizes the weak topology on $M^{1}_{+}(X)$ .

Conclusions & Discussion

The discussion in this paper has been related to the characterization of various notions of universality wherein the RKHS, $\mathcal{H}$ is dense in some subset of $C(X)$ (the space of real-valued continuous functions on $X$ ) w.r.t. the uniform norm (here, $X$ is a some arbitrary topological space). This means any target function, $f^{\star}$ in the appropriate subset of $C(X)$ can be approximated arbitrarily well by some $g\in\mathcal{H}$ w.r.t. the uniform norm. There is a notion of universality, which we have not considered, called $L_{p}$ -universality (Steinwart and Christmann, 2008, Chapter 5): a measurable and bounded kernel, $k$ defined on a Hausdorff space, $X$ , is said to be $L_{p}$ -universal if the RKHS, $\mathcal{H}$ induced by $k$ is dense in $L^{p}(X,\mu)$ w.r.t. the $p$ -norm, defined as $\|f\|_{p}:=(\int_{X}|f(x)|^{p}\,d\mu(x))^{1/p}$ , for all $\mu\in M^{1}_{+}(X)$ and some $p\in[1,\infty)$ . Here $L^{p}(X,\mu)$ is the Banach space of $p$ -integrable $\mu$ -measurable functions on $X$ . This notion of universality is more applicable in learning theory, where the target function, $f^{\star}$ is usually assumed to lie in $L^{p}(X,\mu)$ for some $p\in[1,\infty)$ and for some Borel probability measure, $\mu$ . By considering this notion of universality, any $f^{\star}\in L^{p}(X,\mu)$ can be approximated arbitrarily well by some $g\in\mathcal{H}$ w.r.t. the p-norm for all Borel probability measures $\mu$ and some $p\in[1,\infty)$ . In particular, Steinwart and Christmann (2008, Theorems 5.31, 5.36 and Corollary 5.37) have shown that $L_{p}$ -universality is necessary and sufficient to achieve consistency in kernel-based learning algorithms. In this paper, we did not consider this notion of universality because unlike the other notions of universality, it is not straightforward to relate $L_{p}$ -universality and the RKHS embedding of measures by using the Hahn-Banach theorem (see Theorem 5). However, recently, Carmeli et al. (2009, Theorem 1) have shown that $k$ is $L_{p}$ -universal if and only if it is $c_{0}$ -universal, which therefore establishes the relation between $L_{p}$ -universality and the RKHS embedding of measures. Using this result, $L_{p}$ -universality can be related to all other notions considered in this paper, through Figure 1.

B. K. S. and G. R. G. L. wish to acknowledge support from the Institute of Statistical Mathematics (ISM), Tokyo, the National Science Foundation (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MICRO program. Part of this work was done while B. K. S. was visiting ISM. K.F. was supported by JSPS KAKENHI 19500249.

Appendix A. Supplementary Results

For completeness, we present the following supplementary result, which is a simple generalization of the technique used in the proof of Sriperumbudur et al. (2008, Theorem 3).

Let $k$ be a measurable and bounded kernel on a measurable space, $X$ and let $\mathcal{H}$ be its associated RKHS. Then, for any $f\in\mathcal{H}$ and for any finite signed Borel measure, $\mu$ ,

Therefore, $T_{\mu}$ is a bounded linear functional on $\mathcal{H}$ . By the Riesz representation theorem (Folland, 1999, Theorem 5.25), there exists a unique $\lambda_{\mu}\in\mathcal{H}$ such that $T_{\mu}[f]=\langle f,\lambda_{\mu}\rangle_{\mathcal{H}}$ for all $f\in\mathcal{H}$ . Set $f=k(\cdot,u)$ for some $u\in X$ , which implies $\lambda_{\mu}=\int_{X}k(\cdot,x)\,d\mu(x)$ and the result follows.