Kernel Mean Embedding of Distributions: A Review and Beyond

Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf

Chapter 1 Introduction

This work aims to provide a comprehensive review of kernel mean embeddings of distributions and, in the course of doing so, discusses some challenging issues that could potentially lead to new research directions. To the best of our knowledge, there is no comparable review in this area so far; however, the short review paper of Song et al. (2013) on Hilbert space embedding of conditional distributions and its applications in nonparametric inference in graphical models may be of interest to some readers.

The kernel mean embedding owes its success to a positive definite function commonly known as the kernel function. The kernel function has become popular in the machine learning community for more than 20 years. Initially, it arises as an effortless way to perform an inner product $\langle\mathbf{x},\mathbf{y}\rangle$ in a high-dimensional feature space $\mathscr{H}$ for some data points $\mathbf{x},\mathbf{y}\in\mathcal{X}$ . The positive definiteness of the kernel function guarantees the existence of a dot product space $\mathscr{H}$ and a mapping $\phi:\mathcal{X}\rightarrow\mathscr{H}$ such that $k(\mathbf{x},\mathbf{y})=\langle\phi(\mathbf{x}),\phi(\mathbf{y})\rangle_{\mathscr{H}}$ (Aronszajn 1950) without needing to compute $\phi$ explicitly (Boser et al. 1992, Cortes and Vapnik 1995, Vapnik 2000, Schölkopf and Smola 2002). The kernel function can be applied to any learning algorithm as long as the latter can be expressed entirely in terms of a dot product $\langle\mathbf{x},\mathbf{y}\rangle$ (Schölkopf et al. 1998). This trick is commonly known as the kernel trick (see Section 2 for a more detailed account). Many kernel functions have been proposed for various kinds of data structures including non-vectorial data such as graphs, text documents, semi-groups, and probability distributions (Schölkopf and Smola 2002, Gärtner 2003). Many well-known learning algorithms have already been kernelized and have proven successful in scientific disciplines such as bioinformatics, natural language processing, computer vision, robotics, and causal inference.

Secondly, several elementary operations on distributions (and associated random variables) can be performed directly by means of this representation. For example, by the reproducing property of $\mathscr{H}$ , we have

As a result of the aforementioned advantages, the kernel mean embedding has made widespread contributions in various directions. Firstly, most tasks in machine learning and statistics involve estimation of the data-generating process whose success depends critically on the accuracy and the reliability of this estimation. It is known that estimating the kernel mean embedding is easier than estimating the distribution itself, which helps improve many statistical inference methods. These include, for example, two-sample testing (Gretton et al. 2012a), independence and conditional independence tests (Fukumizu et al. 2008, Zhang et al. 2011, Doran et al. 2014), causal inference (Sgouritsa et al. 2013, Chen et al. 2014), adaptive MCMC (Sejdinovic et al. 2014), and approximate Bayesian computation (Fukumizu et al. 2013).

Secondly, several attempts have been made in using kernel mean embedding as a representation in the predictive learning on distributions (Muandet et al. 2012, Szabó et al. 2016, Muandet and Schölkopf 2013, Guevara et al. 2015, Lopez-Paz et al. 2015). As opposed to the classical setting where training and test examples are data points, many applications call for a learning framework in which training and test examples are probability distributions. This is ubiquitous in, for example, multiple-instance learning (Doran 2013), learning with noisy and uncertain input, learning from missing data, group anomaly detection (Muandet and Schölkopf 2013, Guevara et al. 2015), dataset squishing, and bag-of-words data (Yoshikawa et al. 2014; 2015). The kernel mean representation equipped with the RKHS methods enables classification, regression, and anomaly detection to be performed on such distributions.

Finally, the kernel mean embedding also allows one to perform complex approximate inference without making strong parametric assumption on the form of underlying distribution. The idea is to represent all relevant probabilistic quantities as a kernel mean embedding. Then, basic operations such as sum rule and product rule can be formulated in terms of the expectation and inner product in feature space. Examples of algorithms in this class include kernel belief propagation (KBP), kernel embedding of latent tree model, kernel Bayes rule, and predictive-state representation (Song et al. 2010b; 2009; 2011a; 2013, Fukumizu et al. 2013). Recently, the kernel mean representation has become one of the prominent tools in causal inference and discovery (Lopez-Paz et al. 2015, Sgouritsa et al. 2013, Chen et al. 2014, Schölkopf et al. 2015).

The aforementioned examples represent only a handful of successful applications of kernel mean embedding. More examples and details will be provided throughout the survey.

1 Purpose and Scope

The purpose of this survey is to give a comprehensive review of kernel mean embedding of distributions, to present important theoretical results and practical applications, and to draw connections to related areas. We restrict the scope of this survey to key theoretical results and new applications of kernel mean embedding with references to related work. We focus primarily on basic intuition and sketches for proofs, leaving the full proofs to the papers cited.

All materials presented in this paper should be accessible to a wide audience. In particular, we hope that this survey will be most useful to readers who are not at all familiar with the idea of kernel mean embedding, but already have some background knowledge in machine learning. To ease the reading, we suggest non-expert readers to also consult elementary machine learning textbooks such as Bishop (2006), Schölkopf and Smola (2002), Mohri et al. (2012), and Murphy (2012). Experienced machine learners who are interested in applying the idea of kernel mean embedding to their work are also encouraged to read this survey. Lastly, we will also provide some practical considerations that could be useful to practitioners who are interested in implementing the idea in real-world applications.

2 Outline of the Survey

The schematic outline of this survey is depicted in Figure 1.3 and can be summarized as follows.

introduces notation and the basic idea of a positive definite kernel and reproducing kernel Hilbert space (RKHS) (§2.1 and §2.2). It also presents general theoretical results such as the reproducing property (Prop 2.2.4), the Riesz representation theorem (Thm 2.2.3), Mercer’s theorem (Thm 2.1.2), Bochner’s theorem (Thm 2.1.3), and Schoenberg’s characterization (Thm 2.1.5). In addition, it contains a brief discussion about Hilbert-Schmidt operators on RKHS (§2.3).

conveys the idea of Hilbert space embedding of marginal distributions (§3.1) as well as covariance operators (§3.2), presents essential properties of mean embedding (§3.3), discusses its estimation and approximation procedures (§3.4), and reviews important applications, notably maximum mean discrepancy (MMD) (§3.5), kernel dependence measure (§3.6), learning on distributional data (§3.7), and how to recover information from the embedding of distributions (§3.8).

generalizes the idea of kernel mean embedding to the space of conditional distributions, called conditional mean embedding (§4.1), presents regression perspective (§4.2), and describes basic operations—namely sum rule, product rule, and Bayes’ rule—in terms of marginal and conditional mean embeddings (§4.3). We review applications in graphical models, probabilistic inference (§4.4), reinforcement learning (§4.6), conditional dependence measures (§4.7), and causal discovery (§4.8). Estimating the conditional mean embedding is challenging both theoretically and empirically. We discuss some of the key challenges as well as some applications.

draws connections between the kernel mean embedding framework and other methods including kernel density estimation, empirical characteristic function, divergence methods and probabilistic modeling.

provides suggestions for future research.

3 Notation

Table LABEL:tab:notations summarizes a collection of the commonly used notation and symbols used throughout the survey.

Chapter 2 Background

This section introduces the kernel methods and the concept of reproducing kernel Hilbert space (RKHS) which form the backbone of the survey. Readers who are familiar with probability theory and the concept of RKHS may skip this section entirely or return to it later. More detailed accounts on this topic can be found, in Schölkopf and Smola (2002), Berlinet and Thomas-Agnan (2004), and Hofmann et al. (2008), for example. Additional materials can also be found in Gretton (2016). Readers who are interested particularly in the typical applications of kernels in machine learning, e.g., support vector machines (SVMs) and Gaussian processes (GPs), are encouraged to read Cucker and Smale (2002), Burges (1998), Rasmussen and Williams (2005) and references therein.

Many classical learning algorithms—such as the perceptron (Rosenblatt 1958), support vector machine (SVM) (Cortes and Vapnik 1995), and principal component analysis (PCA) (Pearson 1901, Hotelling 1933)—employ data instances, e.g., $\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X}$ , only through an inner product $\langle\mathbf{x},\mathbf{x}^{\prime}\rangle$ , which basically is a similarity measure between $\mathbf{x}$ and $\mathbf{x}^{\prime}$ . However, the class of linear functions induced by this inner product may be too restrictive for many real-world problems. Kernel methods aim to build more flexible and powerful learning algorithms by replacing $\langle\mathbf{x},\mathbf{x}^{\prime}\rangle$ with some other, possibly non-linear, similarity measures.

The most natural extension of $\langle\mathbf{x},\mathbf{x}^{\prime}\rangle$ is to explicitly apply a non-linear transformation:

into a possibly high-dimensional feature space $\mathcal{F}$ and subsequently evaluate the inner product in $\mathcal{F}$ , i.e.,

where $\langle\cdot,\cdot\rangle_{\mathcal{F}}$ denotes the inner product of $\mathcal{F}$ . We will refer to $\phi$ and $k$ as a feature map and a kernel function, respectively. Likewise, we can interpret $k(\mathbf{x},\mathbf{x}^{\prime})$ as a non-linear similarity measure between $\mathbf{x}$ and $\mathbf{x}^{\prime}$ . Since most algorithms depend on the data set only through the inner product $\langle\mathbf{x},\mathbf{x}^{\prime}\rangle$ , we can obtain a non-linear extension of these algorithms by simply substituting $\langle\mathbf{x},\mathbf{x}^{\prime}\rangle$ with $\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle_{\mathcal{F}}$ . Note that the learning algorithm remains the same: we only change the space in which these algorithms operate. As (2.1) is non-linear, a linear algorithm in $\mathcal{F}$ corresponds to the non-linear counterpart in the original space $\mathcal{X}$ .

Unfortunately, evaluating $k(\mathbf{x},\mathbf{x}^{\prime})$ as above requires a two-step procedure: i) construct the feature maps $\phi(\mathbf{x})$ and $\phi(\mathbf{x}^{\prime})$ explicitly, and ii) then evaluate $\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle_{\mathcal{F}}$ . These two steps can be computationally expensive if $\phi(\mathbf{x})$ lives in a high-dimensional feature space, e.g., when the degree $d$ of the polynomial is large. Fortunately, (2.3) implies that there is an alternative way to evaluate $\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle_{\mathcal{F}}$ without resorting to constructing $\phi(\mathbf{x})$ explicitly if all we need is an inner product $\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle_{\mathcal{F}}$ . That is, it is sufficient to consider $k(\mathbf{x},\mathbf{x}^{\prime})=\langle\mathbf{x},\mathbf{x}^{\prime}\rangle^{2}$ directly. This is an essential aspect of kernel methods, often referred to as the kernel trick in the machine learning community.

When is such trick possible? It turns out that if $k$ is positive definite (cf. Definition 2.1.1) there always exists $\phi:\mathcal{X}\rightarrow\mathcal{F}$ for which $k(\mathbf{x},\mathbf{x}^{\prime})=\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle_{\mathcal{F}}$ . Since $\langle\cdot,\cdot\rangle$ is positive definite, it follows that $k$ defined as in (2.2) is positive definite for any choice of explicit feature map $\phi$ .

If $f(\mathbf{x}^{\prime})=k(\mathbf{x}^{\prime},\cdot)$ for some $\mathbf{x}^{\prime}\in\mathcal{X}$ , we obtain the reproducing property, i.e., $\langle k(\mathbf{x},\cdot),k(\mathbf{x}^{\prime},\cdot)\rangle_{\mathscr{H}}=k(\mathbf{x},\mathbf{x}^{\prime})$ , as a special case. Although we do not need to know $\phi(\mathbf{x})=k(\mathbf{x},\cdot)$ explicitly, it can be derived directly from the kernel $k$ (see, e.g., Schölkopf and Smola 2002 for concrete examples). Further details of RKHS will be given in Section 2.2.

The capability of the kernel trick not only results in powerful learning algorithms, but also allows domain experts to easily invent domain-specific kernel functions which are suitable for particular applications. This leads to a number of kernel functions in various application domains (Genton 2002). In machine learning, commonly used kernels include the Gaussian and Laplace kernels, i.e.,

Another essential property of the kernel trick is that it can be applied not only to Euclidean data, but also to non-Euclidean structured data, functional data, and other domains on which positive definite kernels may be defined (Schölkopf and Smola 2002, Gärtner 2003). A review of several classes of kernel functions can be found in Genton (2002). A review of kernels for vector-valued functions can be found in Álvarez et al. (2012). Hofmann et al. (2008) also provides a general review of kernel methods in machine learning.

Next, we give an important characterization of a continuous positive definite kernel $k$ on a compact set, known as Mercer’s theorem (Mercer 1909). For the proof, see, e.g., Cucker and Zhou (2007; pp. 59). {shadowbox}

Let $\mathcal{X}$ be a compact Hausdorff space and $\mu$ be a finite Borel measure with support $\mathcal{X}$ .see, e.g., Kreyszig (1978) and Rudin (1991) for the details of a compact Hausdorff space and Borel measures. Suppose $k$ is a continuous positive definite kernel on $\mathcal{X}$ , and define the integral operator $\mathcal{T}_{k}:L_{2}(\mathcal{X},\mu)\rightarrow L_{2}(\mathcal{X},\mu)$ by

which is positive definite, i.e., $\forall f\in L_{2}(\mathcal{X},\mu)$ ,

Then, there is an orthonormal basis $\{\psi_{i}\}$ of $L_{2}(\mathcal{X},\mu)$ consisting of eigenfunctions of $\mathcal{T}_{k}$ such that the corresponding sequence of eigenvalues $\{\lambda_{i}\}$ are non-negative. The eigenfunctions corresponding to non-zero eigenvalues can be taken as continuous functions on $\mathcal{X}$ and $k(\mathbf{u},\mathbf{v})$ has the representation

where the convergence is absolute and uniform.

The condition (2.8) is known as the Mercer’s condition and the kernel functions that satisfy this condition are often referred to as Mercer kernels. An important consequence of Mercer’s theorem is that it is possible to derive the explicit form of the feature map $\phi(\mathbf{x})$ from (2.9), i.e., $\phi(\mathbf{x})=[\sqrt{\lambda_{1}}\psi_{1}(\mathbf{x}),\sqrt{\lambda_{2}}\psi_{2}(\mathbf{x}),\ldots]$ . Steinwart and Scovel (2012) studied Mercer’s theorem in general domains by relaxing the compactness assumption on $\mathcal{X}$ . Moreover, there is an intrinsic connection between the integral operator $\mathcal{T}_{k}$ , covariance operator (§3.2), and Gram matrix $\mathbf{K}$ , see, e.g., Rosasco et al. (2010) for further details.

By virtue of Theorem 2.1.3, we may interpret the similarity measure $k(\mathbf{x},\mathbf{x}^{\prime})=\psi(\mathbf{x}-\mathbf{x}^{\prime})$ in the Fourier domain. That is, the measure $\Lambda$ determines which frequency component occurs in the kernel by putting a non-negative power on each frequency $\bm{\omega}$ . Note that we may normalize $k$ such that $\psi(\mathbf{0})=1$ , in which case $\Lambda$ will be a probability measure and $k$ corresponds to its characteristic function. For example, the measure $\Lambda$ that corresponds to the Gaussian kernel $k(\mathbf{x}-\mathbf{x}^{\prime})=e^{-\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}_{2}/(2\sigma^{2})}$ is a Gaussian distribution of the form $(2\pi/\sigma^{2})^{-d/2}e^{-\sigma^{2}\|\bm{\omega}\|^{2}_{2}/2}$ . For the Laplacian kernel $k(\mathbf{x},\mathbf{x}^{\prime})=e^{-\|\mathbf{x}-\mathbf{x}^{\prime}\|_{1}/\sigma}$ , the corresponding measure is a Cauchy distribution, i.e., $\Lambda(\bm{\omega})=\prod_{i=1}^{d}\frac{\sigma}{\pi(1+\omega_{i}^{2})}$ . Table 2.1 summarizes the Fourier transform of some well-known kernel functions.

Another promising application of Bochner’s theorem is the approximation of the kernel function, which is useful in speeding up kernel methods. The feature map $\phi$ of many kernel functions such as the Gaussian kernel is infinite dimensional. To avoid the need to construct the Gram matrix $\mathbf{K}_{ij}=k(\mathbf{x}_{i},\mathbf{x}_{j})$ —which scales at least $O(n^{2})$ in terms of both computation and memory usage—Rahimi and Recht (2007) proposes to approximate the integral in (2.10) based on a Monte Carlo sample $\bm{\omega}\sim\Lambda$ . The resulting approximation is known as random Fourier features (RFFs). See also Kar and Karnick (2012), Vedaldi and Zisserman (2012), Le et al. (2013), Pham and Pagh (2013) and references therein for a generalization of this idea. Theoretical results on the preservation of kernel evaluation can be found in, e.g., Sutherland and Schneider (2015) and Sriperumbudur and Szabó (2015). Other approaches to approximating the Gram matrix $\mathbf{K}$ are low-rank approximation via incomplete Cholesky decomposition (Fine and Scheinberg 2001) and the Nyström method (Williams and Seeger 2001, Bach 2013).

Finally, we briefly mention Schoenberg’s characterization (Schoenberg 1938) for a class of radial kernels.

That is, a radial kernel is a translation-invariant kernel that depends only on the distance between two instances. This class of kernel functions is characterized by Schoenberg’s theorem.

where $\mu$ is a finite non-negative Borel measure on $[0,\infty)$ .

In other words, this class of kernels can be expressed as a scaled Gaussian mixture. Examples of well-known radial kernels include the Gaussian RBF kernel $k(\mathbf{x},\mathbf{y})=\exp(-\|\mathbf{x}-\mathbf{y}\|^{2}/2\sigma^{2})$ , the mixture-of-Gaussians kernel $k(\mathbf{x},\mathbf{y})=\sum_{j=1}^{K}\beta_{j}\exp(-\|\mathbf{x}-\mathbf{y}\|^{2}/2\sigma^{2}_{j})$ where $\bm{\beta}\geq\mathbf{0}$ and $\sum_{j=1}^{K}\beta_{j}=1$ , the inverse multiquadratic kernel $k(\mathbf{x},\mathbf{y})=(c^{2}+\|\mathbf{x}-\mathbf{y}\|^{2})^{-\gamma}$ where $c,\gamma>0$ , and the Matérn kernel

where $\sigma>0$ , $\nu>0$ , $K_{\nu}$ is the modified Bessel function of the second kind of order $\nu$ , and $\Gamma$ is the Gamma function. Pennington et al. (2015) apply Schoenberg’s theorem to build spherical random Fourier (SRF) features of polynomial kernels, which are generally unbounded, for data on the unit sphere. For a detailed exposition on radial kernels and Schoenberg’s theorem, see, e.g., Bavaud (2011), Wendland (2005; Ch. 7), and Rasmussen and Williams (2005).

2 Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space (RKHS) $\mathscr{H}$ is a Hilbert space where all evaluation functionals in $\mathscr{H}$ are bounded. First, we give a formal definition of a Hilbert space (Kreyszig 1978; Chapter 3).

We are now in a position to give the definition of a reproducing kernel Hilbert space. {shadowbox}

A Hilbert space $\mathscr{H}$ of functions is a reproducing kernel Hilbert space (RKHS) if the evaluation functionals $\mathcal{F}_{\mathbf{x}}[f]$ defined as $\mathcal{F}_{\mathbf{x}}[f]=f(\mathbf{x})$ are bounded, i.e., for all $\mathbf{x}\in\mathcal{X}$ there exists some $C>0$ such that

Intuitively speaking, functions in the RKHS are smooth in the sense of (2.14). This smoothness property ensures that the solution in the RKHS obtained from learning algorithms will be well-behaved, since regularization on the RKHS norm leads to regularization on function values. For example, in classification and regression problems, it is ensured that by minimizing the empirical risk on the training data w.r.t. the functions in the RKHS, we obtain a solution $\hat{f}$ that is close to the true solution $f$ and also generalizes well to unseen test data. This does not necessarily hold for functions in arbitrary Hilbert spaces. The space of square-integrable functions $L_{2}[a,b]$ does not have this property. It is very easy to find a function in $L_{2}[a,b]$ that attains zero risk on the training data, but performs poorly on unseen data, i.e., overfitting.

The next theorem known as Riesz representation theorem provides a characterization of a bounded linear operator in $\mathscr{H}$ (see, e.g., Akhiezer and Glazman 1993 for the proof). {shadowbox}

The Riesz representation theorem will be used to prove a sufficient condition for the existence of the kernel mean embedding in an RKHS (see Lemma 3.1.2). By the definition of a RKHS, the evaluation functional $\mathcal{F}_{\mathbf{x}}[f]=f(\mathbf{x})$ is a bounded linear operator in $\mathscr{H}$ . Therefore, the Riesz representation theorem ensures that for any $\mathbf{x}\in\mathcal{X}$ we can find an element in $\mathscr{H}$ that is a representer of the evaluation $f(\mathbf{x})$ . Proposition 2.2.4 states this result, which is called the point evaluation property.

For each $\mathbf{x}\in\mathcal{X}$ , there exists a function $k_{\mathbf{x}}\in\mathscr{H}$ such that

where $\phi(\mathbf{x}):=k_{\mathbf{x}}$ is the feature map $\phi:\mathcal{X}\rightarrow\mathcal{\mathscr{H}}$ . As mentioned earlier, we call $\phi$ a canonical feature map associated with $\mathscr{H}$ essentially because when we apply the function $k(\mathbf{x},\mathbf{y})$ in the learning algorithms, the data points are implicitly represented by a function $k_{\mathbf{x}}$ in the feature space. As we will see in Section 3, the kernel mean embedding is defined by means of $k_{\mathbf{x}}$ and can itself be viewed as a canonical feature map of the probability distribution.

The RKHS $\mathscr{H}$ is fully characterized by the reproducing kernel $k$ . In fact, the RKHS uniquely determines $k$ , and vice versa, as stated in the following theorem which is due to Aronszajn (1950): {shadowbox}

For every positive definite function $k(\cdot,\cdot)$ on $\mathcal{X}\times\mathcal{X}$ there exists a unique (up to isometric isomorphism) RKHS with $k$ as its reproducing kernel. Conversely, the reproducing kernel of an RKHS is unique and positive definite.

3 Hilbert-Schmidt Operators

We conclude this part by describing the notion of the Hilbert-Schmidt operator which is ubiquitous in many applications of RKHS. Before proceeding to the detail, we first define a separable Hilbert space. {shadowbox}

A Hilbert space $\mathscr{H}$ is said to be separable if it has a countable basis.

Let $\mathscr{H}$ and $\mathscr{F}$ be separable Hilbert spaces and $(h_{i})_{i\in I}$ and $(f_{j})_{j\in J}$ are orthonormal basis for $\mathscr{H}$ and $\mathscr{F}$ , respectively. A Hilbert-Schmidt operator is a bounded operator $\mathcal{A}:\mathscr{F}\rightarrow\mathscr{H}$ whose Hilbert-Schmidt norm

Recently, the Hilbert-Schmidt operators have received much attention in the machine learning community and form a backbone of many modern applications of kernel methods. For instance, Gretton et al. (2005a) uses a Hilbert-Schmidt norm of the cross-covariance operator as a measure of statistical dependence between two random variables; see also Chwialkowski and Gretton (2014) and Chwialkowski et al. (2014) for an extension to random processes. Likewise, these notions have also been applied in sufficient dimension reduction (Fukumizu et al. 2004), kernel CCA (Fukumizu et al. 2007), kernel PCA (Zwald et al. 2004), and kernel Bayes’ rule (Fukumizu et al. 2013). Quang et al. (2014) proposes a Log-Hilbert-Schmidt metric between positive definite operators on a Hilbert space. It is applied in particular to compute the distance between covariance operators in a RKHS with applications in multi-category image classification.

Chapter 3 Hilbert Space Embedding of Marginal Distributions

This section introduces the idea of Hilbert-space embedding of distributions by generalizing the standard viewpoint of the kernel feature map of random sample to Dirac measures (see (3.2)), and then to more general probability measures. The presentation here is similar to that of Berlinet and Thomas-Agnan (2004; Chapter 4) with some modifications. We summarize this generalization in Figure 3.1.

In other words, $k(\mathbf{x},\cdot)$ is a high-dimensional representer of $\mathbf{x}$ . Moreover, by the reproducing property it also acts as a representer of evaluation of any function in $\mathscr{H}$ on the data point $\mathbf{x}$ . These properties of kernels are imperative in practical applications because if an algorithm can be formulated in terms of an inner product $\langle\mathbf{x},\mathbf{y}\rangle$ , one can construct an alternative algorithm by replacing the inner product by a positive definite kernel $k(\mathbf{x},\mathbf{y})$ without building the mapping of $\mathbf{x}$ and $\mathbf{y}$ explicitly (a.k.a. the kernel trick). Well-known examples of kernelizable learning algorithms include the support vector machine (SVM) (Cortes and Vapnik 1995) and principle component analysis (PCA) (Hotelling 1933).

We can generalize the concept of a high-dimensional feature map of data points $\mathbf{x}\in\mathcal{X}$ to measures on a measurable space $(\mathcal{X},\Sigma)$ where $\Sigma$ is a $\sigma$ -algebra of subsets of $\mathcal{X}$ . The simplest example of a measure is the Dirac measure $\delta_{\mathbf{x}}$ defined for $\mathbf{x}$ in $\mathcal{X}$ by

where $A\in\Sigma$ . Since any measurable function $f$ on $\mathcal{X}$ is integrable w.r.t. $\delta_{\mathbf{x}}$ , we have

When $f$ belongs to the Hilbert space $\mathscr{H}$ of functions on $\mathcal{X}$ with reproducing kernel $k$ , we can rewrite (3.3) using the reproducing property of $\mathscr{H}$ as

namely, the expectation of $f$ w.r.t. the Dirac measure $\delta_{\mathbf{x}}$ . Although integrating $f$ w.r.t. $\delta_{\mathbf{x}}$ or evaluating $\langle f,k(\mathbf{x},\cdot)\rangle_{\mathscr{H}}$ gives the same result $f(\mathbf{x})$ , i.e., the value of $f$ at point $\mathbf{x}$ , the former gives a measure-theoretic point of view of the latter (see also Figure 3.1). Consequently, we can define a feature map from the space of Dirac measures to $\mathscr{H}$ as

More generally, if $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ are $n$ distinct points in $\mathcal{X}$ and $a_{1},\ldots,a_{n}$ are $n$ non-zero real numbers, we consider a linear combination

of Dirac measures putting the mass $a_{i}$ at the point $\mathbf{x}_{i}$ . This is a signed measure which constitutes a class of measures with finite support. A measure of the form (3.7) is ubiquitous in the machine learning community, especially in Bayesian probabilistic inference (Adams 2009). For example, if $a_{i}=1/n$ for all $i$ , we obtain an empirical measure associated with a sample $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ . A Donsker measure is obtained when $a_{i}$ is also a random variable (Berlinet and Thomas-Agnan 2004). Lastly, if $a_{i}=1$ for all $i$ , the measure of the form (3.7) represents an instance of a point process on $\mathcal{X}$ which has numerous applications in Bayesian nonparametric inference and neural coding (Dayan and Abbott 2005). For example, a determinantal point process (DPP)—a point process with a repulsive property—has recently gained popularity in the machine learning community (Kulesza and Taskar 2012).

Likewise, for any measurable function $f$ we have

This extends our previous remark on Dirac measures to measures with finite support, and if $f$ belongs to $\mathscr{H}$ , we obtain similar results to the case of Dirac measure. That is, the mapping

gives a representer in $\mathscr{H}$ of a measure with finite support. Furthermore, it is a representer of expectation w.r.t. the measure, i.e., if $\mu:=\sum_{i=1}^{n}a_{i}\delta_{\mathbf{x}_{i}}$ , we have for any $f$ in $\mathscr{H}$

In particular, for any Hilbert space $\mathscr{H}$ of functions on $\mathcal{X}$ with reproducing kernel $k$ , a linear combination $\sum_{i=1}^{n}a_{i}k(\mathbf{x}_{i},\cdot)$ forms a dense subset of $\mathscr{H}$ . Here some readers may have concerns regarding the measurability of $f$ . This is easily seen, however, since it is known that point-wise convergence of measurable functions gives a measurable function.

where $\mathbf{x}^{(i)}:=\bigotimes_{k=1}^{i}\mathbf{x}$ denotes the $i$ th-order tensor product (Schölkopf and Smola 2002; Proposition 2.1), the kernel mean embedding can be written explicitly as

Consider $k(\mathbf{x},\mathbf{x}^{\prime})=\exp(\langle\mathbf{x},\mathbf{x}^{\prime}\rangle)$ . Hence, we can write the kernel mean embedding as

1.2 Empirical Estimation of Mean Embeddings

2 Covariance Operators

In addition to the mean element, the covariance and cross-covariance operators on RKHSs are important concepts for modern applications of Hilbert space embedding of distributions. In principle, they are generalizations of covariance and cross-covariance matrices in Euclidean space to the infinite-dimensional elements in RKHSs. We give a brief review here; see Baker (1973), Fukumizu et al. (2004), Gretton (2016) for further details.

Cross-covariance operators were introduced in Baker (1970) and then treated more extensively in Baker (1973). Let $(X,Y)$ be a random variable taking values on $\mathcal{X}\times\mathcal{Y}$ and $(\mathscr{H},k)$ and $(\mathscr{G},l)$ be RKHSs with measurable kernels on $\mathcal{X}$ and $\mathcal{Y}$ , respectively. Throughout we assume

In the above definition, one needs to note that an operator can be identified with an element in the product space. More precisely, the space of Hilbert-Schmidt operators ${\rm HS}(\mathscr{H},\mathscr{G})$ (Section 2.3) forms Hilbert spaces which are isomorphic to the product space $\mathscr{H}\otimes\mathscr{G}$ given by the product kernel. The isomorphism is defined by

Note that, given an orthonormal basis $\{\phi_{a}\}_{a}$ of $\mathscr{H}$ and $\{\varphi_{b}\}_{b}$ of $\mathscr{G}$ , we have $\|\sum_{i}\langle\cdot,f_{i}\rangle_{\mathscr{H}}g_{i}\|_{{\rm HS}}^{2}=\sum_{a}\sum_{b}\{\sum_{i}\langle\phi_{a},f_{i}\rangle_{\mathscr{H}}\langle\varphi_{b},g_{i}\rangle_{\mathscr{G}}\}^{2}=\sum_{a}\sum_{b}\langle\sum_{i}f_{i}\otimes g_{i},\phi_{a}\otimes\varphi_{b}\rangle_{\mathscr{H}\otimes\mathscr{G}}^{2}=\|\sum_{i}f_{i}\otimes g_{i}\|_{\mathscr{H}\otimes\mathscr{G}}^{2}$ , where the last equality is based on the fact that $\{\phi_{a}\otimes\varphi_{b}\}_{a,b}$ is an orthonormal basis of $\mathscr{H}\otimes\mathscr{G}$ . This implies the above map is an isometry.

Alternatively, we may define an operator $\mathcal{C}_{\mathit{YX}}$ as a unique bounded operator that satisfies

If $X=Y$ , we call $\mathcal{C}_{\mathit{XX}}$ the covariance operator, which is positive and self-adjoint, i.e., for all $f,h\in\mathscr{H}$ , $\langle\mathcal{C}_{\mathit{XX}}f,h\rangle_{\mathscr{H}}=\langle f,\mathcal{C}_{\mathit{XX}}h\rangle_{\mathscr{H}}$ .

For any $f\in\mathscr{H}$ , we can also write the integral expressions for $\mathcal{C}_{\mathit{YX}}$ and $\mathcal{C}_{\mathit{XX}}$ , respectively, as

Below we outline a basic result that will become crucial for the definition of conditional mean embedding presented in Section 4.

Let $\mathcal{N}(\mathcal{C})$ and $\mathcal{R}(\mathcal{C})$ denote the null space and the range of an operator $\mathcal{C}$ . The following result due to Baker (1973) states that the cross-covariance operator can be decomposed into the covariance of the marginals and the correlation. {shadowbox}

There exists a unique bounded operator $\mathcal{V}_{\mathit{YX}}:\mathscr{H}\rightarrow\mathscr{F}$ , $\|\mathcal{V}\|\leq 1$ , such that

where $\mathcal{R}(\mathcal{V}_{\mathit{YX}})\subset\overline{\mathcal{R}(\mathcal{C}_{\mathit{YY}})}$ and $\mathcal{N}(\mathcal{V}_{\mathit{YX}})^{\perp}\subset\overline{\mathcal{R}(\mathcal{C}_{\mathit{XX}})}$ .

The operator $\mathcal{V}_{\mathit{YX}}$ is often referred to as the normalized cross-covariance operator and has been used as a basis for the conditional dependence measure. It is also related to the canonical correlation. See, e.g., Fukumizu et al. (2007; 2008). Compared to $\mathcal{C}_{\mathit{YX}}$ , $\mathcal{V}_{\mathit{YX}}$ captures the same information about the dependence of $X$ and $Y$ , but with less influence of the marginal distributions.

The covariance operator serves as a basic building block in classical kernel-based methods such as kernel PCA (Schölkopf et al. 1998, Zwald et al. 2004), kernel Fisher discriminant, kernel CCA (Fukumizu et al. 2007), and kernel ICA (Bach and Jordan 2003). More recent applications of the covariance operator include independence and conditional independence measures (Gretton et al. 2005b, Zhang et al. 2008; 2011, Doran et al. 2014). See §3.6 and §4.7 for more details.

3 Properties of the Mean Embedding

In this section we discuss the notion of the characteristic kernel which can be formally defined as follows.

A closely related notion is that of a universal kernel which we discuss its connection to a characteristic kernel below.

A continuous positive definite kernel $k$ on a compact metric space $\mathcal{X}$ is said to be universal if the corresponding RKHS $\mathscr{H}$ is dense in $C_{b}(\mathcal{X})$ , i.e., for any $f\in C_{b}(\mathcal{X})$ and $\varepsilon>0$ , there exists a function $h\in\mathscr{H}$ such that $\|f-h\|_{\infty}<\varepsilon$ .

The richness of RKHS was previously studied through the so-called universal kernels (Steinwart 2002). Recall that a continuous positive definite kernel $k$ on a compact metric space $\mathcal{X}$ is said to be universal if the corresponding \etb@resrvdarkhs $\mathscr{H}$ is dense in the space of bounded continuous functions on $\mathcal{X}$ (Definition 3.3.3). It implies that any kernel-based learning algorithms with universal kernels can in principle approximate any bounded continuous function $f$ arbitrarily well. Note that approximation in the RKHS norm implies the approximation in the sup norm by the continuity of the evaluation functional. It follows from Definition 3.3.2 and Gretton et al. (2012a; Theorem 8) that all universal kernels are characteristic. Examples of universal kernels on a compact domain are Gaussian and Laplace kernels. On a discrete domain $\mathcal{X}=\{\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\}$ , any strictly positive definite kernel, e.g., $k(\mathbf{x},\mathbf{x}^{\prime})=\mathds{1}_{\{\mathbf{x}=\mathbf{x}^{\prime}\}}$ , is universal (Borgwardt et al. 2006; Section 2.3). Overall, universality is stronger than the characteristic property, i.e., all universal kernels are characteristic, but not vice versa. Nevertheless, they match if the kernel is translation invariant, continuous, and decays to zero at infinity. Sriperumbudur et al. (2011a) provides a comprehensive insight into this connection between universal and characteristic kernels.

For applications involving statistical hypothesis testing, a characteristic kernel is essential because it ensures that we obtain the right test statistics asymptotically, although in practice we only have access to a finite sample. However, there are other application domains—for instance, predictive learning on distributional data (Muandet et al. 2012, Muandet and Schölkopf 2013, Oliva et al. 2014, Szabó et al. 2016)—in which the uniqueness of the embeddings is not strictly required, and thereby the choice of kernel functions can be less restrictive. In such applications, it is better to interpret kernel $k$ as a filter function which determines which frequency component of the corresponding characteristic functions appears in the embeddings, see, e.g., Example 3.1.6. As a result, the shape of the kernel function in the Fourier domain can sometimes be more informative when choosing kernel functions.

4 Kernel Mean Estimation and Approximation

By the weak law of large numbers, the empirical estimator (3.21) is guaranteed to converge to the true mean embedding. Berlinet and Thomas-Agnan (2004) shows that the convergence happens at a rate $O_{p}(n^{-1/2})$ . This result has also been reported in Smola et al. (2007), Shawe-Taylor and Cristianini (2004; Chapter 4), Song (2008; Theorem 27), Gretton et al. (2012a), Lopez-Paz et al. (2015; Theorem 1) and Tolstikhin et al. (2016; Proposition A.1), in slightly different forms.

Inspired by the James-Stein estimator of the mean of multivariate Gaussian distribution (Stein 1955, James and Stein 1961), Muandet et al. (2014a; 2016) proposed a shrinkage estimator of the kernel mean which has the form

To understand the effect of the shrinkage estimator in (3.22), consider the bias-variance decomposition

It is known that the empirical estimator (3.22) can be considered as an M-estimator (Shawe-Taylor and Cristianini 2004, Kim and Scott 2012), i.e., it can be obtained as

It was shown that the estimators of Muandet et al. (2014b) performs shrinkage by first projecting data onto the KPCA basis, and then shrinking each component independently according to a pre-defined shrinkage rule, which is specified by the filter function $g_{\lambda}$ (Muandet et al. 2014b; Proposition 3). The shrinkage estimate is then reconstructed as a superposition of the resulting components. Unlike (3.22), the spectral shrinkage estimators (3.25) also take the eigenspectrum of the kernel $\mathbf{K}$ into account. The idea has been extended to estimate the covariance operator (Muandet et al. 2016, Ramdas and Wehbe 2015) which is ubiquitous in kernel independence and conditional independence tests (cf. §3.6 and §4.7). Flaxman et al. (2016) proposes a Bayesian model for kernel mean embedding using a GP prior over the RKHS containing the mean embedding. While the posterior mean of the model resembles the frequentist spectral KMSE (3.25) when $g_{\lambda}(\mathbf{K})=(\mathbf{K}+n\lambda\mathbf{I})^{-1}$ , a closed-form uncertainty estimate leads to a principled method for learning kernel parameters with the marginal pseudo-likelihood.

Several ways to improve the use of kernel mean embedding have been investigated from the regularization perspective. In an attempt to improve the overall performance of the hypothesis testing, Harchaoui et al. (2007) proposed the two-sample test based on kernel FDA with Tikhonov-type regularization of the covariance operator, like the spectral KMSE. It was subsequently applied to change-point detection (Harchaoui et al. 2009a) and audio segmentation (Harchaoui et al. 2009b). Harchaoui et al. (2009b) also considers a spectral filtering approach. Furthermore, Danafar et al. (2014) used the RKHS norm of the mean embeddings to directly regularize the distance between them, resulting in estimates which are similar to (3.22). As we can see, there is a fundamental link between Stein estimation in statistics and Tikhonov regularization in inverse problems.

4.2 Approximating the Kernel Mean

In many applications of kernel methods such as computational biology (Schölkopf et al. 2004), the computational cost may be a critical issue, especially in the era of “big data”. Traditional kernel-based algorithms have become computationally prohibitive as the volume of data has exploded because most existing algorithms scale at least quadratically with sample size. Likewise, the use of kernel mean embedding has also suffered from this limitation due to two fundamental issues. First, any estimators of the kernel mean involve the (weighted) sum of the feature map of the sample. Second, for certain kernel functions such as the Gaussian RBF kernel, the kernel mean lives in an infinite dimensional space. We can categorize previous attempts in approximating the kernel mean into two basic approaches: 1) find a smaller subset of samples whose estimate approximates well the original estimate of the kernel mean, 2) find a finite approximation of the kernel mean directly.

The former has been studied extensively in the literature. For example, Cruz Cortés and Scott (2014) considers the problem of approximating the kernel mean as a sparse linear combination of the sample. The proposed algorithm relies on a subset selection problem using a novel incoherence measure. The algorithm can be solved efficiently as an instance of the k-center problem and has linear complexity in the sample size. Similarly, Grünewälder et al. (2012) proposes a sparse approximation of the conditional mean embedding by relying on an interpretation of the conditional mean as a regressor. Note that the same idea can be adopted to find a sparse approximation of the standard kernel mean by imposing the sparsity-inducing norm on the coefficient $\bm{\beta}$ , e.g., $\|\bm{\beta}\|_{1}$ (Muandet et al. 2014a). An advantage of sparse representation is in applications where the kernel mean is evaluated repeatedly, e.g., the Kalman filter (Kanagawa et al. 2013, McCalman et al. 2013). One of the drawbacks is that it needs to solve a non-trivial optimization problem in order to find an optimal subsample.

where $\mathbf{z}(\mathbf{x}):=\mathbf{W}^{\top}\mathbf{x}$ and $w_{ij}\sim p(\mathbf{w})$ . If elements of $\mathbf{W}$ are drawn from an appropriate distribution $p(\mathbf{w})$ , the Johnson-Lindenstrauss Lemma (Dasgupta and Gupta 2003, Blum 2005) ensures that this transformation will preserve similarity between data points. In Rahimi and Recht (2007), $p(\mathbf{w})$ is chosen to be the Fourier transform of translation-invariant kernels $k(\mathbf{x},\mathbf{y})=\psi(\mathbf{x}-\mathbf{y})$ . Given a feature map $\mathbf{z}$ , a finite approximation of the kernel mean can be obtained directly as

5 Maximum Mean Discrepancy

For example, if $\mathcal{F}$ is chosen to be a space of all bounded continuous functions on $\mathcal{X}$ , the IPM is a metric over a space of probability distributions, as stated in the following theorem (Müller 1997). {shadowbox}

A mapping $\gamma:M_{+}^{1}\times M_{+}^{1}\rightarrow[0,\infty]$ given by

where we use the reproducing property of $\mathscr{H}$ and the linearity of the inner product, respectively. Thus, we can express the MMD in terms of the associated kernel function $k$ as

Before describing an empirical estimate of (3.29), we give basic definitions of $U$ -statistics and $V$ -statistics which are the cores of the empirical estimate of MMD, see, e.g., Serfling (1981; Chapter 5).

for some real-valued measurable function $h=h(\mathbf{x}_{1},\ldots,\mathbf{x}_{m})$ called a kernel. The corresponding U-statistic of order $m$ for estimation of $\theta$ on the basis of a sample $\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}$ of size $n\geq m$ is obtained by averaging the kernel $h$ symmetrically over the observations:

where $\sum_{c}$ denotes summation over the ${n\choose m}$ combinations of $m$ distinct elements $\{i_{1},i_{2},\ldots,i_{m}\}$ from $\{1,2,\ldots,n\}$ .

The empirical MMD can be expressed in terms of empirical mean embeddings as $\widehat{\text{MMD}}_{b}^{2}[\mathscr{H},\mathbf{X},\mathbf{Y}]=\|\hat{\mu}_{\mathbf{X}}-\hat{\mu}_{\mathbf{Y}}\|_{\mathscr{H}}^{2}$ . Moreover, we can write an unbiased estimate of the MMD entirely in terms of $k$ as (Borgwardt et al. 2006; Corollary 2.3)

Note that (3.32) is an unbiased estimate which is a sum of two $U$ -statistics and a sample average (Serfling 1981; Chapter 5). That is, assuming that $m=n$ ,

The MMD test has several advantages over existing methods proposed in the literature (Anderson et al. 1994, Biau and Györfi 2005, Nguyen et al. 2007). First, the MMD test is distribution free.Note that even if a test is consistent, it is not possible to distinguish distributions with high probability at a given, fixed sample size. That is, the assumption on the parametric form of the underlying distribution is not needed. Furthermore, like most of the kernel-based tests, e.g., Harchaoui et al. (2007), the test can be applied in structured domains like graphs and documents as soon as the positive definite kernel is well-defined. Lastly, an availability of the asymptotic distribution of the test statistic allows for better approximation of the empirical estimates of the test statistic.

A goodness-of-fit testing is an equally important problem. That is, we test whether the given sample $\{\mathbf{x}_{i}\}$ is drawn from a particular distribution $q$ or from a family of distributions $\mathcal{Q}$ . Traditional goodness-of-fit tests are the $\chi^{2}$ -test and the Kolmogorov-Smirnov test. In principle, we can turn it into a two-sample test by drawing a sample $\{\mathbf{y}_{i}\}$ from $q$ and apply the MMD tests. However, it is often difficult in practice to draw samples from $q$ , especially when $q$ corresponds to complex and high dimensional distributions. Liu et al. (2016) and Chwialkowski et al. (2016) independently and simultaneously develop a goodness-of-fit test based on Stein’s identity applied to functions in RKHS (Oates et al. 2016).

The MMD can be computed in quadratic time $O(n^{2}d)$ , which might prohibit its applications in large-scale problems. In Gretton et al. (2012a), the authors also propose the linear time statistics and test by using the subsampling of the term in (3.32), i.e., drawing pairs from $\mathbf{X}$ and $\mathbf{Y}$ without replacement. This method reduces the time complexity of MMD from $O(n^{2}d)$ to $O(nd)$ . However, the test has high variance due to loss of information. The $B$ -test of Zaremba et al. (2013) trades off the computation and variance of the test by splitting two-sample sets into corresponding subsets and then computes the exact MMD in each block while ignoring between-block interactions with $O(n^{3/2}d)$ time complexity.The $B$ -test can be understood as a specific case of the tests using incomplete $U$ -statistic (Blom 1976). While this kind of statistic can be obtained in numerous ways, it has not been explored much in the context of MMD.

The MMD has been applied extensively in many applications, namely, clustering (Jegelka et al. 2009), density estimation (Song et al. 2007a; 2008, Sriperumbudur 2011), (conditional) independence tests (Fukumizu et al. 2008, Doran et al. 2014, Chwialkowski and Gretton 2014), causal discovery (Sgouritsa et al. 2013, Chen et al. 2014, Schölkopf et al. 2015), covariate shift (Gretton et al. 2009, Pan et al. 2011) and domain adaptation (Blanchard et al. 2011, Muandet et al. 2013), selection bias correction (Huang et al. 2007), herding (Chen et al. 2010, Huszar and Duvenaud 2012), Markov chain Monte Carlo (Sejdinovic et al. 2014), moment matching for training deep generative models (Li et al. 2015, Dziugaite et al. 2015), statistical model criticism (Lloyd and Ghahramani 2015, Kim et al. 2016), approximate Bayesian computation (Park et al. 2016), distribution regression (Szabó et al. 2016) and model selection in generative models (Bounliphone et al. 2016), for example.

5.2 Application: Training Deep Generative Models via MMD

In this section, we highlight the promising application of MMD in training deep generative models which was recently assayed by Li et al. (2015) and Dziugaite et al. (2015). Deep generative models are neural networks that provide a distribution from which one can generate samples by feeding the models with random noise $Z$ . Generative Adversarial Networks (GAN) proposed by Goodfellow et al. (2014) is one of the most popular deep generative models used in several computer vision applications, see, e.g., Bouchacourt et al. (2016) and references therein. A GAN model consists of two neural networks—namely a generative network and a discriminative network—trained in an adversarial manner. The generative network, also referred as the generator $G_{\bm{\theta}}$ , aims to replicate the training data from input noise, whereas the discriminative network, also referred as the discriminator $D_{\bm{\phi}}$ , acts as an adversary that aims to distinguish the synthetic data generated by the generator from the training data. The learning is considered successful if the synthetic data is indistinguishable from the real data.

Training the GAN amounts to solving the minimax optimization problem (Goodfellow et al. 2014)

where $X$ is distributed according to the data distribution and $Z$ is distributed according to a prior over noise variable. As mentioned in Goodfellow et al. (2014), the algorithm can be interpreted as approximately minimizing the Jensen-Shannon divergence. On the other hand, the idea proposed in Li et al. (2015) and Dziugaite et al. (2015) is to replace the discriminators $\{D_{\bm{\phi}}\}$ by the MMD test. Specifically, rather than training a neural network to distinguish $X$ from $G_{\bm{\theta}}(Z)$ , they view the problem as a two-sample testing using the MMD test statistic. As a result, the objective function $V(G_{\bm{\theta}},D_{\bm{\phi}})$ in (3.35) is substituted by

for some RKHS $\mathscr{H}$ . In this case, we can think of $\{D_{\bm{\phi}}\}$ as the unit ball of functions in RKHS $\mathscr{H}$ . The MMD test then plays the role of an adversary (Dziugaite et al. 2015). Figure 3.3 gives an illustrative comparison.

The benefits of replacing the discriminative network with the MMD test are two-fold. Firstly, the MMD offers a closed-form solution for the discriminator in GAN. It is also more theoretically appealing because, if $\mathscr{H}$ corresponds to the RKHS of a characteristic kernel, $V_{\mathscr{H}}(G_{\bm{\theta}},D_{\bm{\phi}})$ vanishes if and only if $X$ and $G_{\bm{\theta}}(Z)$ come from the same distribution. Secondly, it avoids the hard minimax objective function used in traditional GAN. Hence, training the MMD nets is more computationally efficient.

Combining MMD with deep neural networks has recently become an active research area. In domain adaptation, for instance, the MMD has also been used to reduce the dataset bias and thus improve transferability of the features across different domains (Long et al. 2015; 2016). As briefly remarked by Dziugaite et al. (2015), one of the promising research directions is to extend this idea to a broader class of (generative) models.A similar remark was made by Sebastian Nowozin at the Dagstuhl seminar on New Directions for Learning with Kernels and Gaussian Processes that took place during November 27 – December 2, 2016.

6 Kernel Dependency Measures

A dependence measure is one of the most fundamental tools in statistical analysis. Rényi (1959) outlines a list of seven desirable properties—known as Rényi’s axioms—which should be satisfied by any measure of dependence between two random variables. The classical criteria such as Spearman’s rho and Kendall’s tau can detect only linear dependences. Some other non-linear criteria such as mutual information based measures and density ratio methods, may require certain assumptions regarding the parametric form of the underlying distribution which is too restrictive in several applications.

When the dependence is non-linear, one of the most successful non-parametric measures is the Hilbert Schmidt Independence Criterion (HSIC) (Gretton et al. 2005a). Let $\mathscr{H}$ and $\mathscr{G}$ be separable RKHSs on $\mathcal{X}$ and $\mathcal{Y}$ with reproducing kernels $k$ and $l$ , respectively. The HSIC is defined as the square HS-norm of the associated cross-covariance operator:

An empirical unbiased estimate of the HSIC statistic from an i.i.d. sample $(\mathbf{x}_{1},\mathbf{y}_{1}),\ldots,(\mathbf{x}_{n},\mathbf{y}_{n})$ on $\mathcal{X}\times\mathcal{Y}$ is given by

Due to its generality, HSIC has proven successful in both statistics and machine learning. Many learning problems can be interpreted as a maximization or minimization of the dependence between $X$ and $Y$ . For instance, Song et al. (2007b) considers supervised feature selection as maximizing the dependence between the subsets of covariate $X^{\prime}$ , obtained from backward elimination algorithm (Guyon and Elisseeff 2003), and the target $Y$ , e.g., outputs of binary, multi-class, and regression settings. Likewise, when $Y$ denotes the cluster labels, Song et al. (2007a) proposes a HSIC-based clustering algorithm called CLUHSIC that clusters the data by maximizing the HSIC between $X$ and $Y$ . Interestingly, depending on the structure of the output kernel $l$ , many classical clustering algorithms, e.g., k-means, weighted k-mean, and hierarchical clustering, can be considered as a special case of CLUHSIC. Moreover, the MaxMMD of Jegelka et al. (2009) is in fact equivalent to the dependence maximization framework. This concept has been extended to several applications including, for example, kernelized sorting (Quadrianto et al. 2009).

Several extensions of HSIC have been proposed for settings in which there are three random variables $(X,Y,Z)$ —e.g., conditional dependence measure, three-variable interaction, and relative dependency measure. For instance, Sejdinovic et al. (2013a) proposes a kernel nonparametric test to detect Lancaster three-variable interaction, e.g., V-structure. Bounliphone et al. (2015) constructs a consistent test for relative dependency which—unlike the Lancaster test of Sejdinovic et al. (2013a)—measures dependency between a source variable and two candidate target variables. Taking into account the correlation between two HSIC statistics when deriving the corresponding joint asymptotic distribution leads to a more powerful, consistent test than the test based on two independent HSIC statistics (Bounliphone et al. 2015; Theorem 4)). Lastly, a nonparametric dependency test for an arbitrary number of random variables has also been considered in the literature and is required in many applications. However, the prominent issue is that, as the number of random variables grows, the convergence of the estimators may be arbitrarily slow. We will postpone the discussion of conditional dependence measures to §4.7.

Kernel independence tests for time series—e.g., financial data and brain activity data—were recently proposed in a series of works including Besserve et al. (2013), Chwialkowski and Gretton (2014), Chwialkowski et al. (2014). If one is interested in a serial dependency within a single time series, the test reduces to the i.i.d. case under null hypothesis. In order to test dependence between one time series and another, Besserve et al. (2013) characterizes dependences between time-series as the Hilbert-Schmidt norm of the kernel cross-spectral density (KCSD) operator which is the Fourier transform of the covariance operator at each time lag. In contrast, Chwialkowski and Gretton (2014) uses the standard HSIC test statistic whose null estimate is obtained by making a shift of one signal relative to the other, rather than the ordinary bootstrapping. The authors impose mixing conditions on random processes under investigation (see Chwialkowski and Gretton 2014 for details).

7 Learning on Distributional Data

The benefits of kernel mean representation for distributional data are three-fold. Firstly, the representation can be estimated consistently without any parametric assumption on the underlying distributions, unlike those based on generative probability models (Jaakkola and Haussler 1998, Jebara et al. 2004). Secondly, compared to density estimation approaches (Póczos et al. 2011, Oliva et al. 2014; 2013), the kernel mean representation is less prone to the curse of dimensionality and admits fast convergence in terms of sample size (Shawe-Taylor and Cristianini 2004, Smola et al. 2007, Grünewälder et al. 2012). Lastly, being an element of the kernel feature space enables one to extend the whole arsenal of kernel methods to distributional data and to adopt existing tools for theoretical analysis, e.g., generalization bounds (Szabó et al. 2016, Lopez-Paz et al. 2015, Muandet 2015).

7.2 Properties of Distributional Kernels

Assuming that $\mathcal{X}$ is compact, the RKHS $\mathscr{H}$ induced by the kernel $k$ is dense in $\mathcal{F}$ if $k$ is universal in the sense of Steinwart (2002).

A more challenging question to answer is whether there exists a non-linear kernel $\kappa$ on $M_{+}^{1}(\mathcal{X})$ which is dense in $C_{b}(M_{+}^{1}(\mathcal{X}))$ . For example, Christmann and Steinwart (2010) considers a Gaussian-like kernel

Recently, the empirical kernel between distributions was employed in an unsupervised way for multi-task learning to generalize to a previously unseen task (Blanchard et al. 2011). Figure 3.4 summarizes the kernel-based framework for distributional data.

7.3 Distributional Risk Minimization

On the one hand, it is instructive to compare the DRM in (3.43) to the vicinal risk minimization (VRM) of Chapelle et al. (2001). Specifically, they consider a slightly different regularized functional

The next theorem provides a generalization bound for a classification problem on distributional data using the kernel mean embedding.

where $R_{n}(\mathcal{F}_{k})$ denotes the Rademacher complexity of $\mathcal{F}_{k}$ .

which is a generalized inner product between two input objects. That is, the probabilistic models $p$ and $q$ are learned for each example and used as a surrogate to construct the kernel between those examples. The kernel (3.48) can be evaluated for all exponential families such as multinomials and Gaussians. Moreover, the PPK (with a certain value of $\rho$ ) can be computed analytically for some distributions such as mixture models, HMMs, and linear dynamical systems. For intractable models, Jebara et al. (2004) suggested approximating (3.48) by using structured mean-field approximations. As a result, the PPK can be applied for a broader class of generative models.

The PPK is in fact closely related to well-known kernels such as the Bhattacharyya kernel (Bhattacharyya 1943) and the exponential symmetrized Kullback-Leibler (KL) divergence (Moreno et al. 2004). In Hein and Bousquet (2005), an extension of a two-parameter family of Hilbertian metrics of Topsøe was used to define Hilbertian kernels on probability measures. In Cuturi et al. (2005), the semi-group kernels were designed for objects with additive semi-group structure such as positive measures. Recently, Martins et al. (2009) introduced non-extensive information theoretic kernels on probability measures based on new Jensen-Shannon-type divergences. Although these kernels have proven successful in many applications, they are designed specifically for certain properties of distributions and application domains. Moreover, there has been no attempt in making a connection to the kernels on corresponding input spaces.

and a generative mean map kernel (GMMK) proposed by Mehta and Gray (2010). In the GMMK, the kernel between two objects $\mathbf{x}$ and $\mathbf{y}$ is defined via $\hat{p}_{\mathbf{x}}$ and $\hat{p}_{\mathbf{y}}$ , which are estimated probabilistic models of $\mathbf{x}$ and $\mathbf{y}$ , respectively. Moreover, the empirical version of (3.49) and (3.40) coincides with that of classical set kernel of Haussler (1999) and Gärtner et al. (2002) (see also Kondor and Jebara 2003). This connection reveals an interesting fact that—although originally presented as a similarity measure between sets of vectors—this class of kernels has a natural interpretation as the similarity between the underlying probability distributions.

Distributions can be used to capture input uncertainty. The use of the kernel (3.40) in dealing with the uncertainty in input data has a connection to robust SVMs. For example, a generalized form of the SVM in Shivaswamy et al. (2006) incorporates the probabilistic uncertainty into the maximization of the margin. This results in a second-order cone programming (SOCP) that generalizes the standard SVM. In SOCP, one needs to specify the parameter $\tau_{i}$ that reflects the probability of correctly classifying the $i$ th training example. In the context of this section, we may represent data point $\mathbf{x}_{i}$ by a distribution $\mathcal{N}(\mathbf{x}_{i},\sigma_{i}^{2}\mathbf{I})$ . Therefore, the parameter $\tau_{i}$ is closely related to the parameter $\sigma_{i}$ , which specifies the variance of the distribution centered at the $i$ th example. Anderson and Gupta (2011) showed the equivalence between SVMs using expected kernels and SOCP when $\tau_{i}=0$ . When $\tau_{i}>0$ , the mean and covariance of missing kernel entries have to be estimated explicitly, making the SOCP more involved for nonlinear kernels. Although achieving comparable performance to the standard SVM with expected kernels, the SOCP requires a more computationally extensive SOCP solver, as opposed to simple quadratic programming (QP). From a Bayesian perspective, the problem of regression with uncertain inputs has been studied in the Gaussian process community, e.g., Girard et al. (2002).

A major drawback of previous studies is that they usually impose a strong parametric assumption on the form of probability distribution. The kernel mean representation, on the other hand, allows one to learn directly from distributions without such an assumption. For example, Szabó et al. (2016) has recently studied the nonparametric distribution regression problem based on kernel mean embedding and the kernel ridge regression algorithm. They establish the consistency and convergence rate of the resulting algorithm whose challenge arises from the two-stage sampling: a meta distribution generates an i.i.d. sample of distributions from which i.i.d observations have been generated (see also Theorem 3.7.3). As a result, in practice we only observe samples from the distributions rather than the distributions themselves. The theoretical analysis uses the results of Caponnetto and De Vito (2007) who provides error bounds for the regularized least-squares algorithm in the standard setting.

8 Recovering Information from Mean Embeddings

The classical pre-image problem in kernel methods involves finding patterns in input space that map to specific feature vectors in the feature space (Schölkopf and Smola 2002; Chapter 18). Recovering a pre-image is considered necessary in some applications such as image denoising using kernel PCA (Kwok and Tsang 2004, Kim et al. 2005) and visualizing the clustering solutions of a kernel-based clustering algorithm (Dhillon et al. 2004, Jegelka et al. 2009). Moreover, it can be used as a reduced set method to compress a kernel expansion (Schölkopf and Smola 2002; Chapter 18). Schölkopf and Smola (2002; Proposition 18.1) shows that if the pre-image exists and the kernel is an invertible function of $\langle\mathbf{x},\mathbf{x}^{\prime}\rangle$ , the pre-image will be easy to compute. Unfortunately, the exact pre-image typically does not exist, and the best one can do is to approximate it. There is a fair amount of work on this topic and the interested readers should consult Schölkopf and Smola (2002; Chapter 18) for further details.

Note that (3.51) is quadratic in $\bm{\pi}$ and is also convex in $\bm{\pi}$ as $\mathbf{K}$ , $\mathbf{Q}$ , and $\mathbf{R}$ are positive definite. The integrals $\mathbf{Q}_{ij}$ and $\mathbf{R}_{ij}$ can be evaluated in closed form for some kernels (see Song et al. 2008; Table 1 and Muandet et al. 2012; Table 1). Unfortunately, the problem is often non-convex in both $\mathbf{m}_{i}$ and $\sigma_{i}$ , $i=1,\ldots,m$ . A derivative-free optimization is often used to find these parameters. In practice, $\pi_{i}$ and $\{\mathbf{m}_{i},\sigma_{i}\}$ are solved alternately until convergence (see, e.g., Song et al. 2008, Chen et al. 2010).

8.2 Kernel Herding

Instead of finding a distributional pre-image of the mean embedding, another common application is obtaining a sample from the distribution or sampling. To avoid an intractable learning of a joint probabilistic model and a direct sampling from that model, Welling (2009a) proposes a novel approach called the Herding algorithm. Herding directly generates pseudo-samples in a fully deterministic fashion and in a way that asymptotically matches the empirical moments of the data. Afterwards, Chen et al. (2010) proposes a kernel herding algorithm that extends the herding algorithm (Welling 2009a; b, Welling and Chen 2010) to continuous spaces by using the kernel trick. Herding can be understood concisely as a weakly chaotic non-linear dynamical system $\mathbf{w}_{t+1}=F(\mathbf{w}_{t})$ . They re-interpret herding as an infinite memory process in the state space $\mathbf{x}$ by marginalizing out the parameter $\mathbf{w}$ , resulting in a mapping $\mathbf{x}_{t+1}=G(\mathbf{x}_{1},\ldots,\mathbf{x}_{t};\mathbf{w}_{0})$ . Under some technical assumptions, herding can be seen to greedily minimize the squared error

where $\hat{\mu}_{T}$ denotes the empirical mean embedding obtained from herding. Following the result of Welling (2009a), kernel herding is shown to decrease the error of expectations of functions in the RKHS at a rate of $O(1/T)$ as opposed to the random samples whose rate is $O(1/\sqrt{T})$ . The fast rate is guaranteed even when herding is carried out with some error. This condition is reminiscent of the Boosting algorithm and perceptron cycling theorem (Chen et al. 2010; Corollary 2). The fast convergence is due to the negative autocorrelation, i.e., herding tends to find samples in an unexplored high-density region. This kind of behaviour can also be observed in Quasi Monte Carlo integration and Bayesian quadrature methods (Rasmussen and Ghahramani 2002).

Herding can be problematic in a high dimensional setting when optimizing over the new sample. Bach et al. (2012) also pointed out that the fast convergence rate is not guaranteed in an infinite dimensional Hilbert space. To alleviate this issue, Bach et al. (2012) shows that the herding procedure of Welling (2009a) takes the form of a convex optimization algorithm in which convergence results can be invoked. Lacoste-Julien et al. (2015) takes this interpretation and proposes the Frank-Wolfe optimization algorithm for particle filtering. Lastly, it is instructive to mention that kernel herding in RKHSs has an implicit connection to Quasi-Monte Carlo (QMC) theory, i.e., one can show by the reproducing property and the Cauchy-Schwartz inequality that the integration error

where $D(\cdot)$ is a discrepancy measure:

This connection has been noted by several authors including Bach et al. (2012) and Yang et al. (2014).

Chapter 4 Hilbert Space Embedding of Conditional Distributions

To better understand the distinction between the kernel mean embedding of marginal and conditional distributions, and the problems that we may encounter, we briefly summarize the concept of marginal, joint, and conditional distributions. Further details are available in most statistics textbooks, see, e.g., Wasserman (2010). Readers already familiar with these concepts may wish to proceed directly to the definition of conditional mean embedding.

The basic relationships between marginal, joint, and conditional distributions can be illustrated via the following equations:

As we can see in the first equation of (4.1), the conditional probability of $Y$ given $X$ is equal to the joint probability of $X$ and $Y$ divided by the marginal probability of $X$ . Marginal, joint, and conditional distributions equipped with the formulation (4.1) provide a powerful language for statistical inference in statistics and machine learning.

The following definition provides explicit forms of $\mathcal{U}_{Y|X}$ and $\mathcal{U}_{Y|\mathbf{x}}$ .

Let $\mathcal{C}_{\mathit{XX}}:\mathscr{H}\rightarrow\mathscr{H}$ and $\mathcal{C}_{\mathit{YX}}:\mathscr{H}\rightarrow\mathscr{G}$ be the covariance operator of $X$ and cross-covariance operator between $X$ to $Y$ , given in (3.14), respectively. Then, the conditional mean embedding $\mathcal{U}_{Y|X}$ and $\mathcal{U}_{Y|\mathbf{x}}$ are defined as

where $\mathcal{C}_{\mathit{XX}}^{-1}\mu_{\Pi}$ denotes the function mapped to $\mu_{\Pi}$ by $\mathcal{C}_{\mathit{XX}}$ .

1.2 Empirical Estimation of Conditional Mean Embeddings

Theorem 4.1.3 gives a formal characterization of the empirical estimator of the conditional mean embedding.

The conditional mean embedding $\mu_{Y|\mathbf{x}}$ can be estimated using

2 Regression Interpretation

As illustrated in Figure 4.1, the conditional mean embedding has a natural interpretation as a solution to the vector-valued regression problem. This observation has been made in Zhang et al. (2011) and later thoroughly in Grünewälder et al. (2012), which we review below.

where $\mathscr{G}$ is a Hilbert space and $\mathscr{H}_{\Gamma}$ denotes a \etb@resrvdarkhs of vector-valued functions from $\mathcal{X}$ to $\mathscr{G}$ (see Micchelli and Pontil 2005 for more details). Grünewälder et al. (2012) shows that $\hat{\mu}_{Y|X}$ can be obtained as a minimizer of the optimization of the form (4.9).In fact, a regression view of conditional mean embedding has already been noted very briefly in Song et al. (2009; Section 6) with connections to the solutions of Gaussian process regression (Rasmussen and Williams 2005) and kernel dependency estimation (Cortes et al. 2005). Nevertheless, Grünewälder et al. (2012) gives a more rigorous account of this perspective.

Following the analysis of Grünewälder et al. (2012), a natural optimization problem for the conditional mean embedding is to find a function $\mu:\mathcal{X}\rightarrow\mathscr{G}$ that minimizes the following objective:

which can then be replaced by its empirical counterpart

The regularization term is added to provide a well-posed problem and prevent overfitting.

The advantages of vector-valued regression interpretation of conditional mean embedding are two-fold. First, since we have a well-defined loss function, we can use a cross-validation procedure for parameter or model selection, e.g., $\lambda$ . Second, it improves the performance analysis of conditional mean embedding as one has access to a rich theory of vector-valued regression (Micchelli and Pontil 2005, Carmeli et al. 2006, Caponnetto and De Vito 2007, Caponnetto et al. 2008). In particular, by applying the convergence results of Caponnetto and De Vito (2007), Grünewälder et al. (2012) derives minimax convergence rates of the estimator $\hat{\mu}$ which are $O(\log(n)/n)$ . Since the analysis is done under the assumption that $\mathscr{G}$ is finite dimensional, the conditional mean embedding is simply the ridge regression of feature vectors, and thus the better rate—compared to the rate of $O(n^{-1/4})$ of Song et al. (2009)—is quite natural.

Based on the new interpretation, Grünewälder et al. (2012) derives a sparse formulation of the conditional mean embedding $\hat{\mu}=\sum_{i=1}^{n}\Gamma_{\mathbf{x}_{i}}\mathbf{M}l(\mathbf{y}_{j},\cdot)$ using a sparse matrix $\mathbf{M}$ obtained from the optimization problem

where $\gamma$ is a non-negative regularization parameter and $\|\mathbf{M}\|_{1,1}:=\sum_{i,j=1}^{n}|\mathbf{M}_{ij}|$ . This is a Lasso problem with $n^{2}$ variables. Moreover, one can construct different estimators of conditional mean embedding by introducing a new regularizer in (4.12) (see, e.g., Muandet et al. 2014b; Table 1). It may be of interest to investigate theoretical properties of these new estimators. Lastly, it is instructive to point out that the regression interpretation of the conditional mean embedding can be considered as an instance of a smooth operator framework proposed later in Grünewälder et al. (2013).

3 Basic Operations: Sum, Product, and Bayes’ Rules

In this section we review basic operations in probabilistic inference and show how they can be carried out in terms of kernel mean embeddings. Sum and product rules are elementary rules of probability. Unlike the traditional recipe, the idea is to perform these operations directly on the marginal and conditional embeddings to obtain a new element in the \etb@resrvdarkhs which corresponds to the embedding of the resulting distribution. One of the advantages of this idea is that the product and sum rules can be performed without making any parametric assumptions on the respective distributions.

Next, we will show how these operations can be achieved as an algebraic manipulation of the (conditional) mean embedding in the \etb@resrvdarkhs. These results are due to Song et al. (2009; 2013), Fukumizu et al. (2013), and Schölkopf et al. (2015).

Theorem 4.1.2 provides sufficient conditions for (4.15) to be well-defined. Alternatively, we can also use a tensor product $\phi(\mathbf{x})\otimes\phi(\mathbf{x})$ to define the kernel sum rule as

3.2 Kernel Product Rule

Although both (4.15) and (4.18) do not require any parametric assumptions on the underlying distributions, these operations can be both statistically difficult and computationally costly in some applications (see below).

3.3 Kernel Bayes’ Rule

where the operators $\mathcal{C}^{\Pi}_{\mathit{YX}}$ and $\mathcal{C}^{\Pi}_{\mathit{XX}}$ are given by

where the former uses the kernel product rule (4.19), and the latter uses the kernel sum rule (4.16). The embeddings $\mu_{Y}^{\Pi}$ and $\mathcal{C}^{\Pi}_{\mathit{YY}}$ correspond to the embeddings of $\Pi(Y)$ using features $\varphi(\mathbf{y})$ and $\varphi(\mathbf{y})\otimes\varphi(\mathbf{y})$ , respectively.

Fukumizu et al. (2013) demonstrates consistency of the estimator of the posterior mean $\hat{\mu}_{Y|\mathbf{x}}$ (and the posterior expectation $\langle g,\hat{\mu}_{Y|\mathbf{x}}\rangle$ ), i.e., $\|\hat{\mu}_{Y|\mathbf{x}}-\mu_{Y|\mathbf{x}}\|_{\mathscr{G}}\rightarrow 0$ in probability as $n\rightarrow\infty$ (Fukumizu et al. 2013; Theorem 5), and shows that if $\|\hat{\mu}_{Y}^{\Pi}-\mu_{Y}^{\Pi}\|_{\mathscr{G}}=O_{p}(n^{-\alpha})$ as $n\rightarrow\infty$ for some $0\leq\alpha\leq 1/2$ , then—under certain specific assumptions—we have for any $\mathbf{x}\in\mathcal{X}$ ,

A common application of KBR is in a situation where computing the likelihood of observed data is intractable, while generating samples from the corresponding density is relatively easy. Kanagawa et al. (2016a), for example, proposes a filtering method with KBR in the situation where the observation model must be estimated nonlinearly and nonparametrically with data.

As an aside, Song et al. (2013) also provides the formulations of sum, product, and Bayes’ rules as a multi-linear algebraic operation using a tensor product feature $\phi(\mathbf{x})\otimes\phi(\mathbf{x})$ —instead of the standard feature $\phi(\mathbf{x})$ .

3.4 Functional Operations on Mean Embeddings

and showed that (4.23) is a consistent estimate of $\mu[f(X,Y)]$ and the convergence happens at a rate $O_{p}(\sqrt{\sum_{i}\alpha_{i}^{2}}+\sqrt{\sum_{j}\beta_{j}^{2}})$ , i.e., see Schölkopf et al. (2015; Theorem 3). Note that the feature maps $\phi_{x}$ , $\phi_{y}$ , and $\phi_{z}$ may correspond to different kernels. The estimators and theoretical analysis can be extended to functional operations on a larger set of random variables. Simon-Gabriel et al. (2016) provides consistency results and finite sample guarantees for the mean embedding of $f(Z)$ , where $f$ is a continuous function and $Z$ may consist of multiple random variables. However, if the variables are dependent, a consistent estimator of the joint embedding $\mu_{Z}$ is also required. Furthermore, their results cover estimators of $\mu[f(Z)]$ that are expressed in terms of “reduced set” expansions (Schölkopf and Smola 2002; Chapter 18).

As we can see from (4.23), the gain here is that the embedding of $f(X,Y)$ can be obtained directly from the embedding of $X$ and $Y$ without resorting to density estimation of $f(X,Y)$ . Many well-established frameworks for arithmetic operations on independent random variables involve integral transform methods such as Fourier and Mellin transforms (Springer 1979) which only allow for some simple cases, e.g., the sum, difference, product, or quotient of independent random variables. Moreover, the functional $f$ is applicable on any domain—no matter how complicated—as long as useful positive definite kernels are introduced on such domains.

We provide crucial remarks on the conditional mean embedding.

To alleviate such a strong assumption, a regularized version $\mathcal{C}_{\mathit{YX}}(\mathcal{C}_{\mathit{XX}}+\lambda_{n}\mathcal{I})^{-1}k(\mathbf{x},\cdot)$ is used instead. For a separable Hilbert space $\mathscr{H}$ , it has been shown that $\widehat{\mathcal{C}}_{\mathit{YX}}(\widehat{\mathcal{C}}_{\mathit{XX}}+\lambda_{n}\mathcal{I})^{-1}k(\mathbf{x},\cdot)$ converges to $\mathcal{C}_{\mathit{YX}}\mathcal{C}_{\mathit{XX}}^{-1}k(\mathbf{x},\cdot)$ if $\lambda_{n}$ decays to zero sufficiently slowly as $n\rightarrow\infty$ . In practice, a cross-validation procedure is commonly used to find the best $\lambda_{n}$ for the problems at hand.

Below we review some applications that employ the above operations on kernel mean embedding.

4 Graphical Models and Probabilistic Inference

Conditional mean embedding has enjoyed successful applications in graphical models and probabilistic inference (Song et al. 2009; 2010a; 2011a; 2011b; 2010b). Probabilistic graphical models are ubiquitous in many fields including natural language processing, computational biology, computer vision, and social science. Most of the traditional algorithms for inference often specify explicitly the parametric distributions underlying the observations and then apply basic operations such as sum, product, and Bayes rules on these distributions to obtain the posterior distribution over desired quantities, e.g., parameters of the model. On the other hand, the philosophy behind embedding-based algorithms is to represent distributions by their mean embedding counterparts, and then to apply the operations given in Section 4.3 on these embeddings instead. This method leads to several advantages over the classical approach. First, an inference can be performed in a non-parametric fashion; one does not need a parametric assumption about the underlying distribution as well as prior-posterior conjugacy. Second, most algorithms do not require density estimation which is difficult in high-dimensional spaces (Wasserman 2006; Section 6.5). Lastly, many models are only restricted to deal with discrete latent variables, e.g., a hidden Markov model (HMM) (Baum and Petrie 1966). The embedding approach allows for (possibly structured) non-Gaussian continuous variables, which makes these models applicable for a wider class of applications. Nevertheless, there are some disadvantages as well. First, relying on the kernel function, the resulting algorithms are usually sensitive to the choice of kernel and its parameters which need to be chosen carefully. Second, the algorithms only have access to the embedding of the posterior distribution rather than the distribution itself. Hence, to recover certain information such as the shape of the distribution, one needs to resort to a pre-image problem to obtain an estimate of the full posterior distribution (cf. Section 3.8 and Song et al. 2008, Kanagawa and Fukumizu 2014, McCalman et al. 2013). Lastly, the algorithms can become computationally costly. Approximation techniques such as low-rank approximation are often used to reduce the computation time and memory storage. See also Song et al. (2013) for a unified view of nonparametric inference in graphical models with conditional mean embedding.

Then, we condition the distribution on a new observation $\mathbf{o}_{t+1}$ using Bayes’ rule to obtain

Song et al. (2009) propose the exact updates for prediction (Song et al. 2009; Theorem 7) and conditioning (Song et al. 2009; Theorem 8) which can be formulated entirely in terms of kernel mean embeddings. Despite the exact updates, one still needs to estimate the conditional cross-covariance operator in each conditioning step, which is both statistically difficult and computationally costly. This problem is alleviated by using approximate inference under some simplifying assumptions. See Song et al. (2009; Theorem 9) for technical details. Empirically, although requiring labeled sequence of observations to perform filtering, for strongly nonlinear dynamics it has been shown to outperform standard Kalman filter which requires the exact knowledge of the dynamics. McCalman et al. (2013) also considers the filtering algorithm based on kernel mean embedding, i.e., the kernel Bayes’ rule (Fukumizu et al. 2011), to address the multi-modal nature of the posterior distribution in robotics.

A nonparametric representation of tree-structured graphical models was introduced in Song et al. (2010b). Inference in this kind of graphical model relies mostly on message passing algorithms. In case of a discrete variable, or Gaussian distribution, the message passing can be carried out efficiently using the sum-product algorithm. Minka (2001) proposes the expectation-propagation (EP) algorithm which requires an estimation of only certain moments of the messages. Sudderth et al. (2010) considers messages as mixture of Gaussians. The drawback of this method is that the number of mixture components grows exponentially as the message is propagated. Ihler and McAllester (2009) considers a particle belief propagation (BP) where the messages are expressed as a function of a distribution of particles at each node. Unlike these algorithms, the embedding-based algorithm proposed in Song et al. (2010b) expresses the message $\mathbf{m}_{\mathit{ts}}(s)$ between pairs of nodes as \etb@resrvdarkhs functions on which sum and product steps can be performed using the linear operation in the \etb@resrvdarkhs to obtain a new message. In addition, Song et al. (2010b) also proves the consistency of the conditional mean embedding estimator, i.e., $\|\hat{\mathcal{U}}_{Y|X}-\mathcal{U}_{Y|X}\|_{\text{HS}}$ converges in probability under some reasonable assumptions (Song et al. 2010b; Theorem 1). The algorithm was applied in cross-lingual document retrieval and camera orientation recovery from images. The idea has been used later for latent tree graphical models (Song et al. 2011b), which are often used for expressing hierarchical dependencies among many variables in computer vision and natural language processing, and for belief propagation algorithm (Pearl 1988, Song et al. 2011a) for pairwise Markov random fields.

Assuming that the latent structure underlying the data-generating process has a low-rank structure, e.g., latent tree, Song and Dai (2013) constructs an improved estimator of the kernel mean embedding for multivariate distribution using the truncated SVD (TSVD) algorithm.

5 Markov Chain Monte Carlo Methods

Adaptive MCMC is a class of MCMC methods that try to learn an optimal proposal distribution from previously accepted samples (Haario et al. 2001, Andrieu and Thoms 2008). Sejdinovic et al. (2014) introduces a kernel adaptive Metropolis-Hasting (KAMH) algorithm for sampling from a target distribution with strongly non-linear support. The idea is similar to that proposed in Haario et al. (2001), but when the target distributions are strongly non-linear, the samplers of Haario et al. (2001) may suffer from low acceptance probability and slow mixing. The idea of Sejdinovic et al. (2014) is to embed the trajectory of the Markov chain into the RKHS and then to construct the proposal using the information encoded in the empirical covariance operator. Intuitively, the empirical covariance operator contains the information about the non-linear support of the distribution. Hence, kernel PCA (Schölkopf et al. 1998) direction may be used to construct the proposal distribution. The proposed algorithm first obtains the RKHS sample

from the Gaussian measure in the RKHS and then finds a point $\mathbf{x}^{*}\in\mathcal{X}$ whose canonical feature map $k(\mathbf{x}^{*},\cdot)$ is close to $f$ in an RKHS norm. Consequently, the resulting sampler is adaptive to the local structure of the target at the current chain state, is oriented towards a nearby region of high density, and does not suffer from a wrongly scaled proposal distribution. Strathmann et al. (2015) introduces a kernel Hamiltonian Monte Carlo (KHMC) to address the case where one cannot run a traditional Hamiltonian Monte Carlo (HMC) since the target is doubly intractable, i.e., gradients of the energy functions are not available. They use an infinite dimensional exponential family model in an RKHS as a surrogate model and learn it via score matching (Hyvärinen 2005, Sriperumbudur et al. 2013). Recently, Schuster et al. (2016) looks into the ways of using RKHS representation for sequential Monte Carlo (SMC), which is more convenient than the MCMC methods as it does not require vanishing adaptation to ensure convergence to the correct invariant distribution.

6 Markov Decision Processes and Reinforcement Learning

A Markov decision process (MDP) is a discrete time stochastic control which is widely studied and applied in robotics, automated control, economics, and manufacturing (Bellman 1957). See Figure 4.2 for a graphical illustration of a MDP. Formally, an MDP is a 5-tuple $(\mathcal{S},\mathcal{A},P(\cdot,\cdot),R(\cdot,\cdot),\gamma)$ where $\mathcal{S}$ is a finite set of states, $\mathcal{A}$ is a finite set of actions, $P_{\mathbf{a}}(\mathbf{s},\mathbf{s}^{\prime})$ is a state-transition probability $P(\mathbf{s}_{t+1}=\mathbf{s}^{\prime}\,|\,\mathbf{s}_{t}=\mathbf{s},\mathbf{a}_{t}=\mathbf{a})$ , $R_{\mathbf{a}}(\mathbf{s},\mathbf{s}^{\prime})$ is the immediate reward received after the transition from state $\mathbf{s}$ to state $\mathbf{s}^{\prime}$ , and $\gamma\in$ is the discount factor. At each time step, the process is in some state $\mathbf{s}$ , and we may choose any action $\mathbf{a}$ that is available in state $\mathbf{s}$ . The process moves into the new state $\mathbf{s}^{\prime}$ according to the state-transition function $P_{\mathbf{a}}(\mathbf{s},\mathbf{s}^{\prime})$ and gives a corresponding reward $R_{\mathbf{a}}(\mathbf{s},\mathbf{s}^{\prime})$ . The discount factor $\gamma$ represents the difference in importance between future rewards and present rewards. The goal of MDP is to obtain an optimal policy that maximizes a value function $V(\cdot)$ defined on beliefs (distributions over states), and determined by the reward function $R$ , see, e.g., Sutton and Barto (1998) for further details.

As in a MDP, any problems that involve control optimal theory can be solved by analyzing the appropriate Bellman equation. According to Bellman’s principle of optimality, an optimal policy has the property that whatever the intial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision (Bellman 2003). In MDP, this principle is reflected in the Bellman optimality equation:

The operator $\mathcal{B}$ is known as the Bellman operator. If the image of $\mathcal{B}$ is always a measurable function, $V_{t+1}=\mathcal{B}V_{t}$ converges in sup-norm to the optimal solution.

Nishiyama et al. (2012) extends the kernel-based algorithm of Grünewälder et al. (2012) to a partially observable MDP (POMDP). In POMDP, we do not assume that the state $\mathbf{s}$ is known when action $\mathbf{a}$ is to be taken. This approach also considers the distributions over states, observations, and actions, and employs the kernel Bayes’ rule (Fukumizu et al. 2013) to update the embeddings. The major contribution of Nishiyama et al. (2012) is to formulate POMDPs in feature space (using the kernel Bellman equation) and to propose a kernelized value iteration algorithm. However, the algorithm requires the labeled training data for the true latent states.

Boots et al. (2013) proposes to model predictive state representations (PSRs) using conditional mean embeddings. The PSR represents a state as a set of predictions of future observable events which differ fundamentally from the latent variable model as it depends only on the observable quantities. For that reason, learning PSRs should be easier. The key idea of this work is to represent the state as nonparametric conditional mean embeddings in an RKHS. The benefit of mean embeddings here is that it allows for learning algorithms which work well for continuous actions and observations, unlike traditional algorithms which often run into trouble due to lack of data. Moreover, previous approaches often use kernel density estimation, which suffer from slow convergence and difficult numerical integration. The algorithm also updates the states using the kernel Bayes’ rule (Fukumizu et al. 2013). Lastly, it is claimed that although all prediction functions in the proposed algorithm are linear PSRs—i.e., there is a linear relationship between conditional probabilities of tests—it can still represent non-linear dynamics (Boots et al. 2013).

Finally, it should be emphasized that the traditional approach for MDP and POMDP involves a conditional density estimation which is difficult for high dimensional problems. In addition, computation of high dimensional integrals to obtain expectation can be very costly. On the other hand, the kernel mean embedding turns the expectation operator into an RKHS inner product, which has linear complexity in the number of training points. Moreover, the kernel mean estimates do not scale poorly with the dimension of the underlying space.

7 Conditional Dependency Measures

There exist numerous tests for conditional dependency such as partial correlation tests (Baba et al. 2004), conditional densities based tests (Su and White 2008), and permutation-based tests (Tsamardinos and Borboudakis 2010). These classical tests require either the parametric assumption of the underlying distribution, e.g., Gaussianity, or a linear relationship between random variables, or the discretization of the conditioning variable $Z$ or all. Thus, they still suffer from the curse of dimensionality which makes their application domains quite limited. In what follows, we will focus on the kernel-based conditional dependency tests that have been proposed for non-linear and non-Gaussian data to remedy some of the aforementioned issues.

Similar to the idea of HSIC described in §3.6, Fukumizu et al. (2008) proposes a nonparametric conditional dependence measure based on the normalized conditional cross-covariance operator

where $\mathcal{V}_{\mathit{YX}}$ is a unique bounded operator such that $\mathcal{C}_{YX}=\mathcal{C}_{YY}^{1/2}\mathcal{V}_{YX}\mathcal{C}_{XX}^{1/2}$ (Baker 1973).See also Theorem 3.2.2 for a brief discussion on the operator $\mathcal{V}_{\mathit{YX}}$ . The operators $\mathcal{V}_{YZ}$ and $\mathcal{V}_{ZX}$ can be defined in a similar way. Substituting $\mathcal{V}_{YX}=\mathcal{C}_{YY}^{-1/2}\mathcal{V}_{XY}\mathcal{C}_{XX}^{-1/2}$ (as well as $\mathcal{V}_{YZ}$ and $\mathcal{V}_{ZX}$ ) in (4.26) yields

Intuitively, the operator $\mathcal{C}_{\mathit{YX}|Z}$ can be viewed as a non-linear operator reminiscent of the conditional covariane matrix $\mathcal{C}_{\mathit{YX}|Z}=\mathcal{C}_{\mathit{YX}}-\mathcal{C}_{\mathit{YZ}}\mathcal{C}_{\mathit{ZZ}}^{-1}\mathcal{C}_{\mathit{ZX}}$ of Gaussian random variables. Given extended variables $\ddot{X}=(X,Z)$ and $\ddot{Y}=(Y,Z)$ , they show that $\mathcal{C}_{\mathit{\ddot{Y}\ddot{X}}|Z}=\mathbf{0}$ if and only if $X$ and $Y$ are conditionally independent given $Z$ , and propose the squared Hilbert-Schmidt norm $\|\mathcal{V}_{\mathit{\ddot{Y}\ddot{X}|Z}}\|^{2}_{\text{HS}}$ as a measure of conditional dependency and $\|\mathcal{V}_{\mathit{YX}}\|^{2}_{\text{HS}}$ as an unconditional counterpart. Note that $\mathcal{V}_{\mathit{\ddot{Y}\ddot{X}|Z}}=\mathbf{0}$ and $\mathcal{V}_{\mathit{YX}}=\mathbf{0}$ imply $\mathcal{C}_{\mathit{YX}}=\mathbf{0}$ and $\mathcal{C}_{\mathit{\ddot{Y}\ddot{X}}|Z}=\mathbf{0}$ , respectively. Although mathematically rigorous, it is not known how to analytically compute the null distribution of the test statistics. In Fukumizu et al. (2008), they resort to a bootstrapping approach which is computationally expensive. See Fukumizu et al. (2008; Section 2.2 and 2.3) for a rigorous treatment of this measure.

8 Causal Discovery

Causal inference from observational data concerns how the distribution of effect variable would have changed if we were to intervene on one of the cause variables while keeping all others fixed (Pearl 2000). There has recently been a great deal of research and interest on causality in the machine learning community (Schölkopf et al. 2012, Guyon 2013, Mooij et al. 2016, Guyon 2014). The kernel mean embedding has proven to be a successful mathematical tool in this research area.

Given a set of random variables $X=(X_{1},X_{2},\ldots,X_{d})$ , the goal of causal discovery is to uncover the underlying causal directed acyclic graph (DAG) denoted by $G(V,E)$ . Each vertex $V_{i}\in V$ corresponds to the random variable $X_{i}$ and an edge $E_{ij}$ from $V_{i}$ to $V_{j}$ indicates a direct causal relationship from $X_{i}$ to $X_{j}$ (denoted by $X_{i}\rightarrow X_{j}$ ). The causal relationships in the DAG are usually parametrized by a structural equation model (SEM)

where $f_{i}$ is an arbitrary function and $\text{Pa}(X_{i})$ is a parental set of $V_{i}$ . It is often assumed that $E_{i}$ are mutually independent, i.e., there is no hidden confounder.

Under Markov and faithfulness assumptions (Pearl 2000), the PC algorithm (Spirtes et al. 2000) exploits conditional dependencies to recover the Markov equivalence class of $G$ . Owing to this approach, several modern kernel-based conditional independence tests have been developed with this particular application in mind (Fukumizu et al. 2008, Zhang et al. 2011, Doran et al. 2014). Nevertheless, this approach is not suitable for the bivariate case in which we observe only two random variables $X$ and $Y$ . In this case, Schölkopf et al. (2015) recently proposed a kernel probabilistic programming (KPP) which provides an expression of a functional of random variables, e.g., $f(X,Y)=X\times Y$ , by means of kernel mean embedding. In particular, they apply it to causal inference by embedding the SEM associated with the additive noise model (ANM) and constructing a test based on such embeddings (Schölkopf et al. 2015; Theorem 4). For a comprehensive review of research along this direction, see, e.g., Mooij et al. (2016).

is exactly $m$ (Sgouritsa et al. 2013; Theorem 1). This allows them to use clustering to identify confounders under various scenarios.

Recently, Schölkopf et al. (2012) postulates that under a specific causal assumption there is an asymmetry in the decomposition

Chapter 5 Relationships between KME and Other Methods

In this section, we discuss the relationships between kernel mean embedding and other approaches ranging from kernel density estimation (KDE) to probabilistic models for Bayesian inference.

As shown in Gretton et al. (2012a), the $L_{2}$ distance between kernel density estimates $\hat{f}_{p}$ and $\hat{f}_{q}$ is a special case of the biased MMD in (3.31) between $\hat{\mu}_{p}$ and $\hat{\mu}_{q}$ . That is, let $\hat{f}_{p}$ and $\hat{f}_{q}$ be given by $\hat{f}_{p}(\mathbf{x})=\frac{1}{n}\sum_{i=1}^{n}\kappa_{\sigma}(\mathbf{x}_{i}-\mathbf{x})$ and $\hat{f}_{q}(\mathbf{x})=\frac{1}{m}\sum_{i=1}^{m}\kappa_{\sigma}(\mathbf{y}_{i}-\mathbf{x})$ , respectively. Then, the $L_{2}$ distance between $\hat{f}_{p}$ and $\hat{f}_{q}$ can be written as

Another popular approach for representing a probability distribution is through the empirical characteristic function (ECF) (Feuerverger and Mureika 1977, Alba Fernàndez et al. 2008). There is a one-to-one correspondence between characteristic functions and distributions and Feuerverger and Mureika (1977) has shown that the ECF converges uniformly over compact sets almost surely to the population characteristic function. As discussed in §3.3.1, we may view both the kernel mean embedding and the characteristic function as integral transforms of the distribution. Kankainen and Ushakov (1998), for instance, proposes a consistent and asymptotically distribution-free test for independence based on the empirical characteristic function; see also Kankainen (1995). Unfortunately, ECF can be difficult to obtain in many cases, especially for conditional distributions.

2 Classical Kernel Machines

The kernel mean has appeared in the literature since the beginning of the field of kernel methods itself (Schölkopf et al. 1998, Schölkopf and Smola 2002). Classical algorithms for classification and anomaly detection employ a mean function in the \etb@resrvdarkhs as their building block. Given a data set $(\mathbf{x}_{1},y_{1}),(\mathbf{x}_{2},y_{2}),\ldots,(\mathbf{x}_{n},y_{n})$ where $y_{i}\in\{-1,+1\}$ , Shawe-Taylor and Cristianini (2004; Chapter 4), for example, considers a simple classifier that classifies a data point $\mathbf{x}_{*}$ by measuring the \etb@resrvdarkhs distance between $\phi(\mathbf{x}_{*})$ and the class-conditional means in the feature space

3 Distance-based Two-Sample Test

4 Fourier Optics

Harmeling et al. (2013) establishes a link between Fourier optics and kernel mean embedding from the computer vision viewpoint. A simple imaging system can be described by the so-called incoherent imaging equation

where both $q(\mathbf{u})$ and $p(\bm{\xi})$ describe image intensities. The function $f$ represents the impulse response function, i.e., the point spread function (PSF), of the imaging system. In this case, the image $p(\bm{\xi})$ induces, up to normalization, a probability measure which represents the light distribution of the object being imaged. The kernel $f(\mathbf{u}-\bm{\xi})$ in (5.3), which is shift-invariant, can be interpreted physically as the point response of an optical system. Based on this interpretation, Harmeling et al. (2013) asserts that the Fraunhofer diffraction is in fact a special case of kernel mean embedding and that in theory an object $p(\bm{\xi})$ with bounded support can be recovered completely from its diffraction-limited image, using an argument from the injectivity of the mean embedding (Fukumizu et al. 2004, Sriperumbudur et al. 2008). In other words, Fraunhofer diffraction does not destroy any information. A simple approach to compute the inversion in practice is also given in Harmeling et al. (2013).

5 Probabilistic Perspective

Chapter 6 Future Directions

We have seen that kernel mean embedding of distributions has made a considerable impact in machine learning and statistics. Nevertheless, some questions remain open and there are numerous potential research directions. In this section, we give suggestions on some of the directions we believe to be important.

How to choose the “right” kernel function remains an ultimate problem in kernel methods and there is no exception to the kernel mean embedding. For some applications, one may argue that characteristic kernels should be sufficient as they have been shown theoretically to preserve all necessary information about the distributions. However, this property may not hold empirically because we only have access to finite samples. In which case, prior knowledge about the problem becomes more relevant in choosing the “right” kernel. It is also not trivial how to choose parameter values of such a kernel. These issues have been addressed in Sriperumbudur et al. (2009), Gretton et al. (2012b), for example. Finally, understanding how to interpret the associated representation of distributions is essential for several applications in statistics.

2 Stochastic Processes and Bayesian Interpretation

A probability distribution is often defined over finite dimensional objects. Highly structured data can be taken into account using a kernel, but it is not clear how to employ kernel mean embedding for stochastic processes—i.e., distributions over infinite dimensional objects—in a similar manner to the distribution. See Chwialkowski and Gretton (2014) and Chwialkowski et al. (2014) for some preliminary results. This line of research has been quite fruitful in Bayesian nonparametrics. Hence, extending the kernel mean embedding to a Bayesian nonparametric framework provides potential applications of probabilistic inference. It is also interesting to obtain a Bayesian interpretation of the kernel mean embedding. Having an elegant interpretation could potentially lead to several extensions of the previous studies along the line of Bayesian inference, e.g., Gaussian processes.

3 Scalability

In the era of “big data”, it is imperative that modern learning algorithms are able to deal with increasingly complex and large-scale data. Recently, there has been a growing interest in developing large-scale kernel learning, which is probably inspired by the lack of theoretical insight of deep neural networks despite their success in various application domains. The advances along this direction will benefit the development of algorithms using kernel mean embedding. In the context of MMD, many recent studies have addressed this issue (Zaremba et al. 2013, Cruz Cortés and Scott 2014, Zhao and Meng 2015, Chwialkowski et al. 2015).

4 High-dimensional Inference

It has been observed that the improvement of a shrinkage estimator tends to increase as the ambient dimensionality increases. Kernel-based methods are known to be less prone to the curse of dimensionality compared to classical approaches such as kernel density estimation, but little is known about the underlying theory. Better understanding of the role of ambient dimension in kernel methods may shed light in novel applications of kernel mean embedding in high-dimensional regime. There are some recent studies on high-dimensional analysis of MMD and related tests: see Reddi et al. (2015), Ramdas et al. (2015a; b) for example.

5 Causality

Causal inference involves the investigation of how the distribution of outcome changes as we change some other variable. Several frameworks for causal inference using kernel mean embedding have been proposed. There have been some recent studies in this direction (Zhang et al. 2011, Sgouritsa et al. 2013, Chen et al. 2014, Lopez-Paz et al. 2015), but it remains a challenging problem. For example, Lopez-Paz et al. (2015) considers bivariate causal inference as a classification task on the joint distributions of cause and effect variables. In the potential outcome framework, the causal effect is defined as the difference between the distributions of outcomes under control and treatment populations. Due to the fundamental problem of causal inference, either one of them would never be observed in practice. In this case, the question is whether we can use the kernel mean embedding to represent the counterfactual distribution.

6 Privacy-Preserving Embedding

Privacy is an essential requirement in medicine and finance—especially in the age of big data—because we now have access to an unprecedented amount of data of individuals that could compromise their privacy. Unlike most parametric models, most applications of kernel mean embedding such as two sample testing and dependency measures often require direct access to individual data points which renders these algorithms vulnerable to adversarial attack. As a result, it is imperative to ask to what extent we can still achieve the same level of performance under an additional requirement that the privacy must be preserved. Another challenge in the context of kernel mean embedding is to preserve the privacy at the distributional level.

7 Statistics, Machine Learning, and Invariant Distribution Embeddings

Many statistical properties of a probability distribution are independent of the input space on which it is defined. For example, independence implies $p(\mathbf{x},\mathbf{y})=p(\mathbf{x})p(\mathbf{y})$ regardless of $\mathcal{X}$ and $\mathcal{Y}$ . Therefore, there is a need to develop an invariant representation for distributions which will allow us to deal with such distributions simultaneously across different domains. This kind of knowledge is known as domain-general knowledge in cognitive science (Goodman et al. 2011).

Chapter 7 Conclusions

A kernel mean embedding of distributions has emerged as a powerful tool for probabilistic modeling, statistical inference, machine learning, and causal discovery. The basic idea is to map distributions into a reproducing kernel Hilbert space (RKHS) where the whole arsenal of kernel methods can be extended to probability measures. It has given rise to a great deal of research and novel applications of positive definite kernels in both statistics and machine learning. In this survey, we gave a comprehensive review of existing work and recent advances in this research area. In the course of the survey, we also discussed some of the most challenging issues and open problems that could potentially lead to new research directions.

The survey began with a brief introduction to the reproducing kernel Hilbert space (RKHS) which forms the backbone of this survey (Section 2). As the name suggests, the kernel mean embedding owes its success to the powerful concept of a positive definite function commonly known as the kernel function. We provided a short review of its theoretical properties and classical applications. Then, we reviewed the Hilbert-Schmidt operators which are vital ingredients in modern applications of kernel mean embedding.

Next, we provided a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and a review of its applications (Section 3). To summarize, the Hilbert space embedding of distributions allows us to apply RKHS methods to probability measures. This extension prompts a wide range of applications such as kernel two-sample testing, independence testing, group anomaly detection, MCMC methods, predictive learning on distributions, and causal inference.

The survey then generalized the idea of embedding to conditional distributions, gave theoretical insights, and reviewed some applications (Section 4). The conditional mean embedding allows us to perform sum, product, and Bayes’ rules which are ubiquitous in graphical models, probabilistic inference, and reinforcement learning, for example, in a non-parametric way. Furthermore, the conditional mean embedding admits a natural interpretation as a solution to a vector-valued regression problem. Then, we mentioned some relationships between the kernel mean embedding and other related areas (Section 5). Lastly, we gave some suggestions for future research directions (Section 6).

We hope that this survey will become a useful reference for graduate students and researchers in machine learning and statistics who are interested in the theory and applications of the kernel mean embedding of distributions.

KM would like to thank Anant Raj for reading the first draft of this survey and providing several helpful comments. We thank Wittawat Jitkrittum, Sebastian Nowozin, and Dino Sejdinovic for detailed and thoughtful comments about the final draft. We also thank Motonobu Kanagawa for pointing out the mistake in one of his paper, and gave detailed explanation of the mistake, which we have included in the survey. Last but not least, we are in debt to the editor and anonymous reviewers who provided insightful reviews that helped to improve the survey significantly.