Learning Sets with Separating Kernels

Ernesto De Vito, Lorenzo Rosasco, Alessandro Toigo

Introduction

In this paper we study the problem of learning from data the set where the data probability distribution is concentrated. Our study is more broadly motivated by questions in unsupervised learning, such as the problem of inferring geometric properties of probability distributions from random samples.

In recent years, there has been great progress in the theory and algorithms for supervised learning, i.e. function approximation problems from random noisy data . On the other hand, while there are a number of methods and studies in unsupervised learning, e.g. algorithms for clustering, dimensionality reduction, dictionary learning (see Chapter 14 of ), many interesting problems remain largely unexplored.

Our analysis starts with the observation that many studies in unsupervised learning hinge on at least one of the following two assumptions. The first is that the data are distributed according to a probability distribution which is absolutely continuous with respect to a reference measure, such as the Lebesgue measure. In this case it is possible to define a density and the corresponding density level sets. Studies in this scenario include to name a few. Such an assumption prevents considering the case where the data are represented in a high dimensional Euclidean space but are concentrated on a Lebesgue negligible subset, as a lower dimensional submanifold. This motivates the second assumption – sometimes called manifold assumption – postulating that the data lie on a low dimensional Riemannian manifold embedded in an Euclidean space. This latter idea has triggered a large number of different algorithmic and theoretical studies (see for example ). Though the manifold assumption has proved useful in some applications, there are many practical scenarios where it might not be satisfied. This observation has motivated considering more general situations such as manifold plus noise models , and models where the data are described by combinations of more than one manifold .

Here we consider a different point of view and work in a setting where the data are described by an abstract probability space and a similarity function induced by a reproducing kernel . In this framework, we consider the basic problem of estimating the set where the data distribution is concentrated (see Section 1.2 for a detailed discussion of related works). A special class of reproducing kernels, that we call separating kernels, plays a special role in our study. First, it allows to define a suitable metric on the probability space and makes the support of the distribution well defined; second, it leads to a new analytical characterization of the support in terms of the null space of the integral operator associated to the reproducing kernel.

This last result is the key towards a new computational approach to learn the support from data, since the integral operator can be approximated with high probability from random samples . Estimation of the null space of the integral operator can be unstable, and regularization techniques can be used to obtain stable estimators. In this paper we study a class of regularization techniques proposed to solve ill-posed problems and already studied in the context of supervised learning . Regularization is achieved by filtering out the small eigenvalues of the sample empirical matrix defined by the kernel. Different algorithms are defined by different filter functions and have different computational properties. Consistency and stability properties for a large class of spectral filters and of the corresponding algorithms are established in a unified framework. Numerical experiments show that the proposed algorithms are competitive, and often better, than other state of the art techniques.

The paper is divided into two parts. The first part includes Section 2, where we establish several mathematical results relating reproducing kernel Hilbert spaces of functions on a set XX and the geometry of the set XX itself. In particular, in this section we introduce the concept of separating kernel, which we further explore in Section 3. These results are of interest in their own right, and are at the heart of our approach. In the second part of the paper we discuss the problem of learning the support from data. More precisely, in Section 4 we illustrate some algorithms for learning the support of a distribution from random samples. In Section 5 we establish universal consistency for the proposed methods and discuss stability to random sampling. We conclude in Section 6 and 7 with some further discussions and some numerical experiments, respectively. A conference version of this paper appeared in . We now start by describing in some more detail our results and discussing some related works.

In this section we briefly describe the main ideas and results in the paper.

The setting we consider is described by a probability space (X,ρ)(X,\rho) and a measurable reproducing kernel KK on the set XX . The data are independent and identically distributed (i.i.d.) samples x1,,xnx_{1},\dots,x_{n}, each one drawn from XX with probability ρ\rho. The reproducing kernel KK reflects some prior information on the problem and, as we discuss in the following, will also define the geometry of XX. The goal is to use the sample points x1,,xnx_{1},\dots,x_{n} to estimate the region where the probability measure ρ\rho is concentrated.

To fix some ideas, the space XX can be thought of as a high-dimensional Euclidean space and the distribution ρ\rho as being concentrated on a region XρX_{\rho}, which is a smaller – and potentially lower dimensional – subset of XX (e.g. a linear subspace or a manifold). In this example, the goal is to build from data an estimator XnX_{n} which is, with high probability, close to XρX_{\rho} with respect to a suitable metric.

We first note that a precise definition of XρX_{\rho} requires some care. If ρ\rho is assumed to have a continuous density with respect to some fixed reference measure (for example, the Lebesgue measure in the Euclidean space), then the region XρX_{\rho} can be easily defined to be the closure of the set of points where the density function is non-zero. Nevertheless, this assumption would prevent considering the situation where the data are concentrated on a “small”, possibly lower dimensional, subset of XX. Note that, if the set XX were endowed with a topological structure and ρ\rho were defined on the corresponding Borel σ\sigma-algebra, it would be natural to define XρX_{\rho} as the support of the measure ρ\rho, i.e. the smallest closed subset of XX having measure one. However, since the set XX is only assumed to be a measurable space, no a priori given topology is available. Here we also remark that the definition of XρX_{\rho} is not the only point where some further structure on XX would be useful. Indeed, when defining a learning error, a notion of distance between the set XρX_{\rho} and its estimator XnX_{n} is also needed and hence some metric structure on XX is required.

The idea is to use the properties of the reproducing kernel KK to induce a metric structure – and consequently a topology – on XX. Indeed, under some mild technical assumptions on KK, the function

defines a metric on XX, thus making XX a topological space. Then, it is natural to define XρX_{\rho} to be the support of ρ\rho with respect to such metric topology. Moreover, the Hausdorff distance dHd_{H} induced by the metric dKd_{K} provides a notion of distance between closed sets.

The problem we consider can now be restated as follows: we want to learn from data an estimator XnX_{n} of XρX_{\rho}, such that limndH(Xn,Xρ)=0\lim_{n\to\infty}d_{H}(X_{n},X_{\rho})=0 almost surely. While XρX_{\rho} is now well defined, it is not clear how to build an estimator from data. A main result in the paper, given in Theorem 3, provides a new analytic characterization of XρX_{\rho}, which immediately suggests a new computational solution for the corresponding learning problem. To derive and state this result, we introduce a new notion of reproducing kernels, called separating kernels, that, roughly speaking, captures the sense in which the reproducing kernel and the probability distribution need to be related. We say that a reproducing kernel Hilbert space H\mathcal{H} (or equivalently its kernel) separates a subset CXC\subset X, if, for any x∉Cx\not\in C, there exists fHf\in\mathcal{H} such that

If KK separates all possible closed subsets in XX, we say that it is completely separating. Figure 1 illustrates the notion of separating kernel in the simple example of the linear kernel in a Euclidean space.

Now, Theorem 3 states that, if either KK is completely separating, or at least separates XρX_{\rho}, then XρX_{\rho} is the level set of a suitable distribution dependent continuous function FρF_{\rho}. More precisely, let H{\mathcal{H}} be the reproducing kernel Hilbert space associated to KK , T:HHT:\mathcal{H}\to\mathcal{H} the integral operator with kernel KK, and denote by TT^{\dagger} its pseudo-inverse. If we consider the function FρF_{\rho} on XX, defined by

and KK separates XρX_{\rho}, then we prove that

(where for simplicity we are assuming K(x,x)=1K(x,x)=1 for all xXx\in X).

The above result is crucial since the integral operator TT can be approximated with high probability from data (see and references therein). However, since the definition of FρF_{\rho} involves the pseudo-inverse of TT, the support estimation problem can be unstable and regularization techniques are needed to ensure stability. With this in mind, we propose and study a family of spectral regularization techniques which are classical in inverse problems and have been considered in supervised learning in . We define an estimator by

where Fn(x)=(1/n)Kxgλn(Kn/n)KxF_{n}(x)=(1/n){\mathbf{K}}_{{\mathbf{x}}}^{\top}\,g_{\lambda_{n}}({\mathbf{K}_{n}}/n){\mathbf{K}}_{{\mathbf{x}}}, with (Kn)i,j=K(xi,xj){\mathbf{(}{\mathbf{K}}_{n})}_{i,j}=K(x_{i},x_{j}), Kx{\mathbf{K}}_{{\mathbf{x}}} is the column vector whose ii-th entry is K(xi,x)K(x_{i},x), and Kx{\mathbf{K}}_{{\mathbf{x}}}^{\top} is its transpose. Here gλn(Kn/n)g_{\lambda_{n}}({\mathbf{K}_{n}}/n) is a matrix defined via spectral calculus by a spectral filter function gλng_{\lambda_{n}} that suppresses the contribution of the eigenvalues smaller than λn\lambda_{n}. Examples of spectral filters include Tikhonov regularization and truncated singular values decomposition , to name a few.

This class of methods can be studied within a unified framework, and the error analysis in the paper establishes strong universal consistency if XρX_{\rho} is separated by KK. More precisely, under the latter assumption, we show in Theorem 6 that,

provided that XX is compact and the sequences (τn)n1,(λn)n1(\tau_{n})_{n\geq 1},(\lambda_{n})_{n\geq 1} are chosen so that,

where LλnL_{\lambda_{n}} is the Lipschitz constant of the function rλn(σ)=σgλn(σ)r_{\lambda_{n}}(\sigma)=\sigma g_{\lambda_{n}}(\sigma). The above result is universal in the sense that consistency can be shown without assuming regularity condition on ρ\rho or XρX_{\rho}.

The proof of the above result crucially depends on estimating the deviation between FnF_{n} and FρF_{\rho}. Indeed, for the above choice of the sequence λn)n1\lambda_{n})_{n\geq 1} we show that

Under suitable distribution dependent assumptions, the above result can be further developed to obtain finite sample bounds quantifying stability to random sampling. Indeed, if the couple (ρ,K)(\rho,K) is such that supxXTs/2TTKx<+\sup_{x\in X}\left\lVert{T^{-s/2}T^{\dagger}TK_{x}}\right\rVert<+\infty, with 0<s10<s\leq 1, and the eigenvalues of the (compact and positive) operator TT satisfy σjj1/b\sigma_{j}\sim j^{-1/b} for some 0<b10<b\leq 1, then we prove in Theorem 7 that, for n1n\geq 1 and δ>0\delta>0, we have

with probability at least 12eδ1-2e^{-\delta}, for λn=n1/(2s+b+1)\lambda_{n}=n^{-1/(2s+b+1)} and a suitable constant Cs,b,δC_{s,b,\delta} which does not depend on nn.

2 State of the art

where B(x,ϵ)B(x,\epsilon) is the ball of center xx and radius ϵ\epsilon, and ϵn\epsilon_{n} slowly goes to zero when nn tends to infinity. Consistency and minimax converges rates are studied in with respect to the distance

where C1C2=(C1C2)(C2C1)C_{1}\triangle C_{2}=(C_{1}\setminus C_{2})\cup(C_{2}\setminus C_{1}) and μ\mu is a suitable known measure. If ρ\rho has a density ff with respect to some known measure μ\mu, a traditional approach is based on a non-parametric estimator fnf_{n} of ff, a so called plug-in estimator. A kernel based class of plug-in estimators is proposed in , namely

where hnh_{n} is a regularization parameter and cnc_{n} is a suitable threshold. Convergence rates with respect to dμd_{\mu} are provided in . A related problem is level set estimation, where the goal is to detect the high density regions {xXf(x)c}\left\{x\in X\mid f(x)\geq c\right\}. Consistency and optimal convergence rates for different plug-in estimators

Kernels, Integral Operators and Geometry in Spaces of Probabilities

In this section we establish the results that provide the foundations of our approach. The basic framework in this paper is described by a triple (X,ρ,K)(X,\rho,K), where

XX is a set (endowed with a σ\sigma-algebra AX\mathcal{A}_{X});

ρ\rho is a probability measure defined on XX;

KK is a (real) reproducing kernel on XX, i.e. a real function on X×XX\times X of positive type.

We interpret XX as the data space and ρ\rho as the probability distribution generating the data. Roughly speaking, the kernel KK provides a natural similarity measure on XX and it defines its geometry.

We denote by H\mathcal{H} the reproducing kernel Hilbert space associated with the reproducing kernel KK (we refer to for an exhaustive review on the theory of reproducing kernel Hilbert spaces). The scalar product and norm in H\mathcal{H} are denoted by ,\left\langle{\cdot},{\cdot}\right\rangle and \left\lVert{\cdot}\right\rVert, respectively. We recall that the elements of H\mathcal{H} are real functions on XX, and the reproducing property f(x)=f,Kxf(x)=\left\langle{f},{K_{x}}\right\rangle holds true for all xXx\in X and fHf\in\mathcal{H}, where KxHK_{x}\in\mathcal{H} is defined by Kx(y)=K(y,x)K_{x}(y)=K(y,x).

In order to prove our results, we need some technical conditions on KK.

The kernel KK has the following properties:

for all x,yXx,y\in X with xyx\neq y we have KxKyK_{x}\neq K_{y};

the associated reproducing kernel Hilbert space H\mathcal{H} is separable;

the real function KK is measurable with respect to the product σ\sigma-algebra AXAX\mathcal{A}_{X}\otimes\mathcal{A}_{X};

Assumptions 1.a), 1.b) and 1.c) are minimal requirements. In particular, Assumptions 1.a) and 1.b) are needed in order to define a separable metric structure on XX, while Assumption 1.c) ensures that such metric topology is compatible with the σ\sigma-algebra AX\mathcal{A}_{X} (see Proposition 1 below). In Proposition 2, the combination of 1.a), 1.b) and 1.c) will allow us to define the support XρX_{\rho} of the probability measure ρ\rho, as anticipated in Section 1.1. Assumption 1.d), instead, is a normalization requirement, and could be replaced by a suitable boundedness condition (in fact, even weaker integrability conditions could also be considered). We choose the normalization K(x,x)=1K(x,x)=1 xX\forall x\in X since it makes equations more readable, and it is not restrictive in view of Proposition 13 in A.1.

We now show how the above assumptions allow us to define a metric on XX and to characterize the corresponding support of ρ\rho in terms of the integral operator with kernel KK.

Our first result makes XX a separable metric space isometrically embedded in H\mathcal{H}. This point of view is developed in . The relation between metric spaces isometrically embedded in Hilbert spaces and kernels of positive type was studied by Schoenberg around 1940. A recent discussion on this topic can be found in Chapter 2 § 3 of .

Under Assumption 1.a), the map dK:X×X[0,+[d_{K}:X\times X\to[0,+\infty[ defined by

the map xKxx\mapsto K_{x} is an isometry from XX into H\mathcal{H};

the kernel KK is a continuous function on X×XX\times X, and each fHf\in\mathcal{H} is a continuous function.

If also Assumption 1.b) is satisfied, then

the metric space (X,dK)(X,d_{K}) is separable.

Finally, if also Assumption 1.c) holds true, then

the closed subsets of XX are measurable (with respect to AX\mathcal{A}_{X});

if YY is a topological space endowed with its Borel σ\sigma-algebra and f:XYf:X\to Y is continuous, then ff is measurable; in particular, the functions in H\mathcal{H} are measurable.

Many of these properties are known in the literature, see for example and references therein. For the reader’s convenience, we give a self-contained short proof. Assumption 1.a) states that the map xKxx\mapsto K_{x} is injective. Since dK(x,y)=KxKyd_{K}(x,y)=\left\lVert{K_{x}-K_{y}}\right\rVert by definition, dKd_{K} is the metric on XX making xKxx\mapsto K_{x} an isometry, as claimed in item i). About ii), the kernel KK is continuous since K(x,y)=Ky,KxK(x,y)=\left\langle{K_{y}},{K_{x}}\right\rangle and the map xKxx\mapsto K_{x} is continuous by item i); furthermore, the elements of H\mathcal{H} are continuous by the reproducing property f(x)=f,Kxf(x)=\left\langle{f},{K_{x}}\right\rangle. If also Assumption 1.b) holds true, then the set {KxxX}{\left\{K_{x}\mid x\in X\right\}} is separable in H\mathcal{H}, and so is XX as the map xKxx\mapsto K_{x} is isometric from XX onto {KxxX}\left\{K_{x}\mid x\in X\right\}. Item iii) then follows. Suppose now that also Assumption 1.c) holds true. Then the map dKd_{K} is a measurable map, so that the open balls of XX are measurable. Since XX is separable, any open set is a countable union of open balls, hence it is measurable. It follows that the closed subsets are measurable, too, hence item iv). Let YY and ff be as in item v). If AYA\subset Y is closed, then f1(A)f^{-1}(A) is closed in XX, hence measurable by item iv). It follows that f1(A)f^{-1}(A) is measurable for all Borel sets AYA\subset Y, i.e. ff is measurable. Since the elements of H\mathcal{H} are continuous by ii), they are measurable, and item v) is proved. ∎

In the rest of the paper we will always consider XX as a topological metric space with metric dKd_{K}. Note that dKd_{K} is the metric induced on XX by the norm of H\mathcal{H} through the embedding xKx{x\mapsto K_{x}}. The next result shows that under our assumptions we can define the set XρX_{\rho} as the smallest closed subset of XX having measure one.

Under Assumptions 1.a), 1.b) and 1.c), there exists a unique closed subset XρXX_{\rho}\subset X with ρ(Xρ)=1\rho(X_{\rho})=1 satisfying the following property: if CC is a closed subset of XX and ρ(C)=1\rho(C)=1, then CXρC\supset X_{\rho}.

Clearly, XρX_{\rho} is closed and measurable by Proposition 1. Since XX is separable, there exists a sequence of closed subsets (Cj)j1(C_{j})_{j\geq 1} such that every closed subset C=CjkC=\cap C_{j_{k}}, for some suitable subsequence. Hence, Xρ=jρ(Cj)=1CjX_{\rho}=\bigcap\limits_{j\mid\rho(C_{j})=1}C_{j} and, as a consequence, ρ(Xρ)=1\rho(X_{\rho})=1. ∎

We add one remark. The set XρX_{\rho} is called the support of the measure ρ\rho and clearly depends both on the probability distribution and on the topology induced by the kernel KK through the metric dKd_{K} on XX.

2 Separating Kernels

The following definition of separating kernel plays a central role in our approach.

We say that the reproducing kernel Hilbert space H\mathcal{H} separates a subset CXC\subset X, if, for all x∉Cx\not\in C, there exists fHf\in\mathcal{H} such that

In this case we also say that the corresponding reproducing kernel separates CC.

We add some comments. First, in (2) the function ff depends on xx and CC. Second, the reproducing property and (2) imply that Kx0K_{x}\neq 0 and KxKyK_{x}\neq K_{y} for all x∉Cx\not\in C and yCy\in C (compare with Assumption 1.a)). Finally, we stress that a different notion of separating property is given in .

Next, we provide an equivalent characterization of the separating property, which will be the key to a computational approach to support estimation. For any set CC, let PC:HHP_{C}:\mathcal{H}\to\mathcal{H} be the orthogonal projection onto the closed subspace

i.e. the closure of the linear space generated by the family {KxxC}\left\{K_{x}\mid x\in C\right\}. Note that PC2=PCP_{C}^{2}=P_{C}, PC=PCP_{C}^{\top}=P_{C} and

The Hilbert space HC\mathcal{H}_{C} is a closed subspace of the reproducing kernel Hilbert space H\mathcal{H}, and it is itself a reproducing kernel Hilbert space of functions on XX with reproducing kernel KC(x,y)=PCKy,PCKx=PCKy,KxK_{C}(x,y)=\left\langle{P_{C}K_{y}},{P_{C}K_{x}}\right\rangle=\left\langle{P_{C}K_{y}},{K_{x}}\right\rangle. Note that KC(x,y)=K(x,y)K_{C}(x,y)=K(x,y) for all x,yCx,y\in C by definition of PCP_{C}. Clearly, the function FCF_{C} corresponds to the value of KCK_{C} on the diagonal.

For any subset CXC\subset X, the following facts are equivalent:

for all x∉Cx\not\in C, KxranPCK_{x}\notin\operatorname{ran}\,P_{C};

C={xXFC(x)=K(x,x)}\displaystyle{C=\left\{x\in X\mid F_{C}(x)=K(x,x)\right\}};

{KxxC}={KxxX}ranPC\displaystyle{{\left\{K_{x}\mid x\in C\right\}=\left\{K_{x}\mid x\in X\right\}}\cap\operatorname{ran}\,P_{C}}.

Under Assumption \refA.\refA3)\ref{A}.\ref{A3}), if CC is separated by H\mathcal{H}, then CC is closed with respect to the metric dKd_{K}.

We first prove that i) \Rightarrow ii). Given xCx\notin C, by assumption there is fHf\in\mathcal{H} such that f,Kx=f(x)0\left\langle{f},{K_{x}}\right\rangle=f(x)\neq 0, i.e. Kx∉{f}K_{x}\not\in\left\{f\right\}^{\perp}, and f,Ky=f(y)=0\left\langle{f},{K_{y}}\right\rangle=f(y)=0 for all yCy\in C, i.e. fkerPC=ranPCf\in\ker{P_{C}}=\operatorname{ran}\,P_{C}^{\perp}. It follows that ranPC{f}\operatorname{ran}\,P_{C}\subset\left\{f\right\}^{\perp}, and then KxranPCK_{x}\notin\operatorname{ran}\,P_{C}. We prove ii) \Rightarrow iii). If xCx\in C, then KxranPCK_{x}\in\operatorname{ran}\,P_{C} by definition of PCP_{C}, so that FC(x)=K(x,x)F_{C}(x)=K(x,x). Hence C{xXFC(x)=K(x,x)}C\subset\left\{x\in X\mid F_{C}(x)=K(x,x)\right\}. If x∉Cx\not\in C, then by assumption PCKxKxP_{C}K_{x}\neq K_{x}, i.e. (IPC)Kx0(I-P_{C})K_{x}\neq 0. By the equality

this implies FC(x)K(x,x)F_{C}(x)\neq K(x,x). Hence C{xXFC(x)=K(x,x)}C\supset\left\{x\in X\mid F_{C}(x)=K(x,x)\right\}. We prove iii) \Rightarrow i). If x∉Cx\not\in C, define f=(IPC)KxkerPCf=(I-P_{C})K_{x}\in\ker{P_{C}}, so that f(y)=0f(y)=0 for all yCy\in C. Furthermore, f(x)=K(x,x)FC(x)0f(x)=K(x,x)-F_{C}(x)\neq 0. Thus, ff separates the set CC. Finally, iv) is a restatement of ii) taking into account that KxranPCK_{x}\in\operatorname{ran}\,P_{C} for all xCx\in C by construction. Under Assumption \refA.\refA3)\ref{A}.\ref{A3}), the map xFC(x)K(x,x)=PCKx,KxK(x,x)x\mapsto F_{C}(x)-K(x,x)=\left\langle{P_{C}K_{x}},{K_{x}}\right\rangle-K(x,x) is continuous by Proposition 1. By item iii), CC is the -level set of this function, hence CC is closed. ∎

Proposition 13 in A.1 shows that the reproducing kernel KK can be normalized under the mild assumption that K(x,x)0K(x,x)\neq 0 for all xXx\in X, so that Assumption 1.d) can be satisfied up to a rescaling of KK.

It may be the case that the set XX has its own metric dXd_{X}, and the σ\sigma-algebra AX\mathcal{A}_{X} is the Borel σ\sigma-algebra associated with the topology induced by dXd_{X}. The following proposition shows that the metrics dKd_{K} and dXd_{X} induce the same topology on XX, provided that H\mathcal{H} separates all the dXd_{X}-closed subsets and the corresponding kernel is continuous.

Let XX be a separable metric space with respect to a metric dXd_{X}, and AX\mathcal{A}_{X} the corresponding Borel σ\sigma-algebra. Let H\mathcal{H} be a reproducing kernel Hilbert space on XX with kernel KK. Assume that the kernel KK is a continuous function with respect to dXd_{X} and that the space H\mathcal{H} separates every subset of XX which is closed with respect to dXd_{X}. Then

Assumptions \refA.\refA3)\ref{A}.\ref{A3}), \refA.\refA1)\ref{A}.\ref{A1}) and \refA.\refA2)\ref{A}.\ref{A2}) hold true, and K(x,x)>0K(x,x)>0 for all xXx\in X;

a set is closed with respect to dKd_{K} if and only if it is closed with respect to dXd_{X}.

The kernel is measurable and the space H\mathcal{H} is separable by Proposition 5.15.1 and Corollary 5.25.2 in . Since the points are closed sets for dXd_{X} and the dXd_{X}-closed sets are separated by H\mathcal{H}, then Kx0K_{x}\neq 0 (i.e. K(x,x)>0K(x,x)>0) for all xXx\in X and KxKyK_{x}\neq K_{y} if xyx\neq y by the discussion following Definition 1. We show that dXd_{X} and dKd_{K} are equivalent metrics. Take a sequence (xj)j1(x_{j})_{j\geq 1} such that for some xXx\in X it holds that limjdX(xj,x)=0\lim_{j\to\infty}d_{X}(x_{j},x)=0. Since KK is continuous with respect to dXd_{X}, we have limjdK(xj,x)=0\lim_{j\to\infty}d_{K}(x_{j},x)=0. Hence, the dKd_{K}-closed sets are dXd_{X}-closed, too. Conversely, if the set CC is dXd_{X}-closed, since H\mathcal{H} separates CC, Theorem 1 implies that C={xXK(x,x)FC(x)=0}C=\left\{x\in X\mid K(x,x)-F_{C}(x)=0\right\}, which is a dKd_{K}-closed set by dKd_{K}-continuity of the map xK(x,x)FC(x)x\mapsto K(x,x)-F_{C}(x). ∎

Item ii) of the above proposition states that the metrics dKd_{K} and dXd_{X} are equivalent and implies that the set XρX_{\rho} defined in Proposition 2 coincides with the support of ρ\rho with respect to the topology induced by dXd_{X}.

3 The Integral Operator Defined by the Kernel

We denote by S1\mathcal{S}_{1} the Banach space of the trace class operators on H\mathcal{H}, with trace class norm

where {ei}iI\left\{e_{i}\right\}_{i\in I} is any orthonormal basis of H\mathcal{H}. Furthermore, we let S2\mathcal{S}_{2} be the separable Hilbert space of the Hilbert-Schmidt operators on H\mathcal{H}, with Hilbert-Schmidt norm

Finally, if AA is any bounded operator on H\mathcal{H}, we denote by A\left\lVert{A}\right\rVert_{\infty} its uniform operator norm. It is standard that AAS2AS1\left\lVert{A}\right\rVert_{\infty}\leq\left\lVert{A}\right\rVert_{\mathcal{S}_{2}}\leq\left\lVert{A}\right\rVert_{\mathcal{S}_{1}}. Moreover, for all functions f1,f2Hf_{1},f_{2}\in\mathcal{H}, the rank-one operator f1f2f_{1}\otimes f_{2} on H\mathcal{H} defined by

is trace class, and f1f2S1=f1f2S2=f1f2\left\lVert{f_{1}\otimes f_{2}}\right\rVert_{\mathcal{S}_{1}}=\left\lVert{f_{1}\otimes f_{2}}\right\rVert_{\mathcal{S}_{2}}=\left\lVert{f_{1}}\right\rVert\left\lVert{f_{2}}\right\rVert.

We recall a few facts on integral operators with kernel KK (see for proofs and further discussions). Under Assumption 1, the S1\mathcal{S}_{1}-valued map xKxKxx\mapsto K_{x}\otimes K_{x} is Bochner-integrable with respect to ρ\rho, and its integral

defines a positive trace class operator TT with TS1=tr[T]=1\left\lVert{T}\right\rVert_{\mathcal{S}_{1}}=\operatorname{tr}\left[T\right]=1 (a short proof is given in Proposition 14 of the Appendix). Using the reproducing property of H\mathcal{H}, it is straightforward to see that TT is simply the integral operator with kernel KK acting on H\mathcal{H}, i.e.

The following is a key result in our approach.

Under Assumption 1, the null space of TT is

where XρX_{\rho} is the support of ρ\rho as defined in Proposition 2.

Note that , for all fHf\in\mathcal{H}, the set

is closed since ff is continuous. We now prove Equation (5). Since TT is a positive operator, spectral theorem gives that Tf=0Tf=0 if and only if Tf,f=0\left\langle{Tf},{f}\right\rangle=0. The definition of TT and the reproducing property gives that

hence the condition Tf,f=0\left\langle{Tf},{f}\right\rangle=0 is equivalent to the fact that f(x)=0f(x)=0 for ρ\rho-almost every xXx\in X. Hence fkerTf\in\ker{T} if and only if ρ(Cf)=1\rho(C_{f})=1, i.e. CfXρC_{f}\supset X_{\rho}, or equivalently f,Kx=0\left\langle{f},{K_{x}}\right\rangle=0 xXρ\forall x\in X_{\rho}. Equation (5) then follows. ∎

In the following, we will use the abbreviated notation Pρ=PXρP_{\rho}=P_{X_{\rho}}. Note that the space H\mathcal{H} splits into the direct sum H=HρHρ\mathcal{H}=\mathcal{H}_{\rho}\oplus\mathcal{H}_{\rho}^{\perp}, where

The reproducing kernel Hilbert space Hρ\mathcal{H}_{\rho} (see Remark 2) has been considered before , and in particular in the context of semi-supervised manifold regularization , where XρX_{\rho} is assumed to be an embedded manifold. The corresponding reproducing kernel is Kρ(x,y)=PρKy,KxK_{\rho}(x,y)=\left\langle{P_{\rho}K_{y}},{K_{x}}\right\rangle and FXρ(x)=Kρ(x,x)F_{X_{\rho}}(x)=K_{\rho}(x,x). See also the discussion in Section 6.

Under Assumption 1, we also introduce the integral operator LK:L2(X,ρ)L2(X,ρ)L_{K}:L^{2}(X,\rho)\to L^{2}(X,\rho),

which is a positive trace class operator, too. Note the difference between the operators TT and LKL_{K}: although their definitions are formally the same, the respective domains and images change.

Since TT and LKL_{K} are positive trace class operators, by the Hilbert-Schmidt theorem each of them admits an orthonormal family of eigenvectors in H\mathcal{H} and L2(X,ρ)L^{2}(X,\rho), respectively, with a corresponding family of positive eigenvalues. The two spectral decompositions are strongly related, as we now briefly recall (see also Proposition 8 of and Theorem 2.11 of ).

Denote by (σj)jJ(\sigma_{j})_{j\in J} the (finite or countable) family of strictly positive eigenvalues of LKL_{K}, where each eigenvalue is repeated according to its (finite) multiplicity. For each jJj\in J select a corresponding eigenvector ϕjL2(X,ρ)\phi_{j}\in L^{2}(X,\rho) in such a way that the sequence (ϕj)jJ(\phi_{j})_{j\in J} is orthonormal in L2(X,ρ)L^{2}(X,\rho). Hilbert-Schmidt theorem provides that

where the series converges in trace norm. In general, each element ϕj\phi_{j} is an equivalence class of functions defined ρ\rho-almost everywhere. In particular, the value of ϕj\phi_{j} is not defined outside XρX_{\rho}. However, in each equivalence class we can choose a unique continuous function, denoted again by ϕj\phi_{j}, which is defined at every point of XX by means of the extension equation

With this choice, which will be implicitly assumed in the following, the family (σj)jJ(\sigma_{j})_{j\in J} coincides with the family of strictly positive eigenvalues of TT (with the same multiplicities), (σjϕj)jJ(\sqrt{\sigma_{j}}\phi_{j})_{j\in J} is a orthonormal family in H\mathcal{H} of eigenfunctions of TT, and

where the series converges in the Banach space S1\mathcal{S}_{1} (hence in S2\mathcal{S}_{2}), see e.g. . As TS1=1\left\lVert{T}\right\rVert_{\mathcal{S}_{1}}=1, the positive sequence (σj)jJ(\sigma_{j})_{j\in J} is summable and sums up to 11. It is clear that the family (σjϕj)jJ(\sqrt{\sigma_{j}}\phi_{j})_{j\in J} is an orthonormal basis of the Hilbert space Hρ\mathcal{H}_{\rho}. Conversely, let (fj)jJ(f_{j})_{j\in J} be an orthonormal basis of Hρ\mathcal{H}_{\rho} of eigenvectors of TT with corresponding eigenvalues (σj)jJ(\sigma_{j})_{j\in J}. Define

Then, it is not difficult to show that (6), (7) and (8) hold true.

4 An Analytic Characterization of the Support

Let Assumption 1 hold true. Collecting the previous results, if H\mathcal{H} separates XρX_{\rho}, then Theorem 1 gives that

The function Fρ=FXρF_{\rho}=F_{X_{\rho}} is defined by (3) in terms of the projection PρP_{\rho}, which, in light of Theorem 2, can be characterized using the operator TT. Indeed, from the definition of FρF_{\rho} and (5) we have

where TT^{\dagger} is the pseudo-inverse of TT and θ\theta is the Heaviside function θ(σ)=1I]0,+[(σ)\theta(\sigma)=1{\hskip-2.5pt}\hbox{I}_{]0,+\infty[}(\sigma) (note that with our definition θ(0)=0)\theta(0)=0). The above discussion is summarized in the following theorem.

If H\mathcal{H} satisfies Assumption 1 and separates the support XρX_{\rho} of the measure ρ\rho, then

As we discussed before, a natural question is whether there exist kernels capable to separate all possible closed subsets of XX. In a learning scenario, this can be translated into a universality property, in the sense that it allows to describe any probability distribution and learn consistently its support . Note that in a supervised learning framework a similar role is played by the so called universal kernels . The following section answers positively to the previous question, introducing and studying the concept of completely separating kernels. Interestingly, there are universal kernels in the sense of which do not separate all closed subsets of XX, as for example the Gaussian kernel.

Completely separating reproducing kernel Hilbert spaces

The property defining the class of kernels we are interested in is captured by the following definition.

A reproducing kernel Hilbert space H\mathcal{H} satisfying Assumption \refA.\refA3)\ref{A}.\ref{A3}) is called completely separating if H\mathcal{H} separates all the subsets CXC\subset X which are closed with respect to the metric dKd_{K} defined by (1). In this case, we also say that the corresponding reproducing kernel is completely separating.

We end this section with Proposition 6, which gives a simple way to build completely separating kernels in high dimensional spaces from completely separating kernels in one dimension, the latter usually being easier to characterize.

We now state a sufficient condition on KK ensuring that H\mathcal{H} is completely separating.

for some a,b,γ1,γ2>0a,b,\gamma_{1},\gamma_{2}>0. Then,

the translation invariant kernel K(x,y)=K(xy)K(x,y)=K(x-y) is positive definite and continuous;

As an application, we show that the Abel kernel is completely separating.

with σ>0\sigma>0. Then KK is a positive definite kernel and the corresponding reproducing kernel Hilbert space H\mathcal{H} is completely separating for all d1d\geq 1.

A standard Fourier transform computation gives

where Γ\Gamma is Euler gamma function (Theorem 1.14 in ). The claim then follows from Theorem 4. ∎

2 Building Separating Kernels

The following result gives a way to construct completely separating reproducing kernel Hilbert spaces on high dimensional spaces.

If XiX_{i}, i=1,2di=1,2\ldots d, are sets and K(i)K^{(i)} are completely separating reproducing kernels on XiX_{i} for all i=1,2di=1,2\ldots d, then the product kernel

is completely separating on the set X=X1×X2××XdX=X_{1}\times X_{2}\times\ldots\times X_{d}.

Each set XiX_{i} and XX are endowed with the metric dK(i)d_{K^{(i)}} and dKd_{K} induced by the corresponding kernels, and Hi\mathcal{H}_{i} and H\mathcal{H} denote the reproducing kernel Hilbert spaces with kernels K(i)K^{(i)} and KK, respectively. A standard result gives that H=H1Hd\mathcal{H}=\mathcal{H}_{1}\otimes\ldots\otimes\mathcal{H}_{d} and Kx=Kx1(1)Kxd(d)K_{x}=K^{(1)}_{x_{1}}\otimes\ldots\otimes K^{(d)}_{x_{d}} for all x=(x1,,xd)Xx=(x_{1},\ldots,x_{d})\in X . We claim that the dKd_{K}-topology on XX is contained in the product topology of the dK(i)d_{K^{(i)}}-topologies on XiX_{i} (actually, it is not difficult to show that the two topologies coincide). Indeed, if (xi,k)k1(x_{i,k})_{k\geq 1} are sequences in XiX_{i} such that limkdK(i)(xi,k,xi)=0\lim_{k\to\infty}d_{K^{(i)}}(x_{i,k},x_{i})=0 for all i=1,,di=1,\ldots,d, then

since limkK(i)(xi,k,xi,k)=limkK(i)(xi,k,xi)=K(i)(xi,xi)\lim_{k\to\infty}K^{(i)}(x_{i,k},x_{i,k})=\lim_{k\to\infty}K^{(i)}(x_{i,k},x_{i})=K^{(i)}(x_{i},x_{i}). We now prove that H\mathcal{H} is completely separating. If CXC\subset X is dKd_{K}-closed and x=(x1,,xd)XCx=(x_{1},\ldots,x_{d})\in X\setminus C, since CC is also closed in the product topology, for all i=1,,di=1,\ldots,d there exists an open neighborhood UiU_{i} of xix_{i} in XiX_{i} such that U=U1××UdXCU=U_{1}\times\ldots\times U_{d}\subset X\setminus C. Since each Hi\mathcal{H}_{i} is completely separating, for all i=1,,di=1,\ldots,d there exists fiHif_{i}\in\mathcal{H}_{i} such that fi(xi)0f_{i}(x_{i})\neq 0 and fi(yi)=0f_{i}(y_{i})=0 for all yiXiUiy_{i}\in X_{i}\setminus U_{i}. Then the product function f=f1fdf=f_{1}\otimes\ldots\otimes f_{d} is in H\mathcal{H}, and satisfies f(x)0f(x)\neq 0 and f(y)=0f(y)=0 for all yCy\in C. ∎

is completely separating since each kernel in the product is positive definite and completely separating by Proposition 5.

A Spectral Approach to Learning the Support

In this section we study the set estimation problem in the context of learning theory. We fix a triple (X,ρ,K)(X,\rho,K) as in Section 2, and assume throughout that the reproducing kernel KK satisfies Assumption 1. We regard XX as a metric space with respect to dKd_{K}, and continue to denote by XρX_{\rho} the support of ρ\rho defined in Proposition 2.

If H\mathcal{H} separates XρX_{\rho}, Theorem 3 shows that the support XρX_{\rho} is the 11-level set of a suitable function FρF_{\rho} defined by the integral operator TT, and therefore depending on KK and ρ\rho. However, the probability distribution ρ\rho is unknown, as we only have a set of i.i.d. points x1,,xnx_{1},\dots,x_{n} sampled from ρ\rho at our disposal. Our task is now to use our sample in order to estimate the set XρX_{\rho}.

The definition of TT given by (4) suggests that it can be estimated by the data dependent operator

The operator TnT_{n} is positive and with finite rank; in particular, TnS1T_{n}\in\mathcal{S}_{1} and TnS1=tr[Tn]=1\left\lVert{T_{n}}\right\rVert_{\mathcal{S}_{1}}=\operatorname{tr}\left[T_{n}\right]=1. We denote by (σj(n))jJn(\sigma^{(n)}_{j})_{j\in J_{n}} the strictly positive eigenvalues of TnT_{n} (each one repeated according to its multiplicity) and by \big{(}\sqrt{\sigma^{(n)}_{j}}\phi^{(n)}_{j}\big{)}_{j\in J_{n}} the corresponding eigenvectors; note that in the present case the index set JnJ_{n} is finite. However, though TnT_{n} converges to TT in all relevant topologies (see Lemma 1 and Remark 6 below), in general TnTnT^{\dagger}_{n}T_{n} does not converge to TTT^{\dagger}T since TT^{\dagger} may be unbounded, or, equivalently, since may be an accumulation point of the spectrum of TT when dimH=\dim\mathcal{H}=\infty. Hence, the problem of support estimation is ill-posed, and regularization techniques are needed to restore well-posedness and ensure a stable solution. In the following sections, we will show that spectral regularization can be used to learn the support efficiently from the data.

Note that, since the spectra of TnT_{n} and TT are both contained in the interval $,wecanassumethatthefunctions, we can assume that the functionsg_{\lambda}andandr_{\lambda}aredefinedonare defined on.Moreover,astheoperators. Moreover, as the operatorsr_{\lambda}(T_{n})andandr_{\lambda}(T)approximateorthogonalprojections,itisusefultohavetheboundapproximate orthogonal projections, it is useful to have the bound0\leq r_{\lambda}(T_{n}),r_{\lambda}(T)\leq{I}satisfiedforallsatisfied for allT_{n}andandTs,andthiscanbeachievedbychoosingthefunction’s, and this can be achieved by choosing the functionr_{\lambda}suchthatsuch that0\leq r_{\lambda}(\sigma)\leq 1forallfor all\sigma$.

As a consequence of the above discussion, the characterization of filter functions giving rise to stable algorithms is captured by the following assumption.

The family of functions (rλ)λ>0(r_{\lambda})_{\lambda>0}, with rλ:r_{\lambda}:\to for all λ>0\lambda>0, has the following properties:

for all σ>0\sigma>0, we have limλ0+rλ(σ)=1\lim_{\lambda\to 0^{+}}r_{\lambda}(\sigma)=1;

for all λ>0\lambda>0, there exists a positive constant LλL_{\lambda} such that

By Assumption 2.a, there exists a function gλ:[0,+[g_{\lambda}:\to[0,+\infty[ such that rλ(σ)=gλ(σ)σr_{\lambda}(\sigma)=g_{\lambda}(\sigma)\sigma. On the other hand, by Assumption 2.b) we have limλ0+rλ(σ)=θ(σ)\lim_{\lambda\to 0^{+}}r_{\lambda}(\sigma)=\theta(\sigma) for all σ\sigma\in. Assumption 2.c) is of technical nature, and will become clear in Section 5.2; here we note that in particular it implies that rλr_{\lambda} is a continuous function for all λ>0\lambda>0.

A few examples of filter functions rλr_{\lambda} satisfying Assumption 2 and of corresponding functions gλg_{\lambda} are given in Table 1. It is easy to check that for each of them Lλ=1/λL_{\lambda}=1/\lambda. See for further examples.

For a chosen filter, the corresponding regularized empirical estimator of FρF_{\rho} is defined by

where we allow the regularization parameter λn\lambda_{n} to depend on the number of samples nn. Note that the functions FnF_{n} and FρF_{\rho} are continuous on XX by continuity of the mapping xKxx\mapsto K_{x} (see i) of Proposition 1). In Section 5 we will show that, for an appropriate choice of the sequence (λn)n1(\lambda_{n})_{n\geq 1}, the estimator FnF_{n} converges almost surely to FρF_{\rho} uniformly on compact subsets of XX. Unfortunately, this does not imply convergence of the 11-level sets of FnF_{n} to the 11-level set of FρF_{\rho} in any sense (as, for example, with respect to the Hausdorff distance). However, an estimator of XρX_{\rho} can be obtained by setting

where τn>0\tau_{n}>0 is an off-set parameter that depends on the sample size nn (recall that FnF_{n} takes values in $).InSection5weshowthat,forasuitablechoiceofthesequence). In Section 5 we show that, for a suitable choice of the sequence(\tau_{n})_{n\geq 1},theclosedset, the closed setX_{n}$ is indeed a consistent estimator of the support with respect to the Hausdorff distance.

In the following section we discuss some remarks about the computation of FnF_{n}.

2 Algorithmic and Computational Aspects

We show that the computation of FnF_{n} (hence of XnX_{n}) reduces to a finite dimensional problem involving the empirical kernel matrix defined by the data. To this purpose, it is useful to introduce the sampling operator

which can be interpreted as the restriction operator which evaluates functions in H\mathcal{H} on the points of the training set. The transpose of SnS_{n} is

and SnS_{n}^{\top} can be interpreted as the out-of-sample extension operator . A simple computation shows that

Hence, considering the filter given in the form rλ(Tn)=gλ(Tn)Tnr_{\lambda}(T_{n})=g_{\lambda}(T_{n})T_{n}, we have

where the second equality follows from spectral calculus. Using the definition of the sampling operator, we can consider the nn-dimensional vector Kx{\mathbf{K}}_{x} defined by

where Kx{\mathbf{K}}_{x}^{\ast} is the conjugate transpose of Kx{\mathbf{K}}_{x}. More explicitly we have

The above equation shows that, while H\mathcal{H} could be infinite dimensional, the computation of the estimator reduces to a finite dimensional problem. Further, though the mathematical definition of the filter is done through spectral calculus, the computations might not require performing an eigen-decomposition. As an example, for Tikhonov regularization gλn(σ)=1σ+λng_{\lambda_{n}}(\sigma)=\frac{1}{\sigma+\lambda_{n}}, so that gλn(Knn)=(Knn+λn)1g_{\lambda_{n}}\left(\frac{\mathbf{K}_{n}}{n}\right)=(\frac{{\mathbf{K}}_{n}}{n}+\lambda_{n})^{-1} and the coefficient vector α(x)\alpha(x) in (20) is given by

In the case of the Landweber filter, it is possible to prove that the coefficient vector can be evaluated iteratively by setting α0(x)=0\alpha^{0}(x)=0, and

for t=1,,mλnt=1,\dots,m_{\lambda_{n}}. We refer to for the corresponding algorithm in a supervised framework; see also the discussion in Section 6.3.

We thus see that the estimator corresponding to Tikhonov regularization can be computed via Cholesky decomposition and has complexity of order O(n3)O(n^{3}). For Landweber iteration the complexity is O(n2m)O(n^{2}m), where mm is the number of iterations. Finally, the spectral cut-off, or truncated SVD, requires O(n3)O(n^{3}) operations to compute the eigen-decompostion of the kernel matrix. Further discussions can be found in and references therein. We end remarking that, in order to test whether NN points belong or not to the support, we simply have to repeat the above computation replacing Kx{\mathbf{K}}_{x} by a n×Nn\times N matrix Kx,N{\mathbf{K}}_{x,N}, in which each column is a vector Kx{\mathbf{K}}_{x} corresponding to a point xx in the test set. Note that in this case the coefficients α(x)\alpha(x) will also form a n×Nn\times N matrix.

Error Analysis: Convergence and Stability

In this section we develop an errror analysis for the proposed class of estimators. First, we discuss convergence (consistency) and then stability with respect to random sampling in terms of finite sample bounds. We continue to suppose throughout this section that Assumption 1 holds true, and consider XX as a metric space with metric dKd_{K}.

We recall that the empirical data are a set of i.i.d. points x1,,xnx_{1},\dots,x_{n}, each one drawn from XX with probability ρ\rho. Since we need to study asymptotic properties when the sample size nn goes to infinity, we introduce the following probability space

for some measurable map ξn:XnM\xi_{n}:X^{n}\to M. The number nn is the cardinality of the sampled data. We then have the following facts.

TnT_{n} is a Sk\mathcal{S}_{k}-valued estimator for k=1,2k=1,2;

if XX is locally compact, then FnF_{n} is a C(X)C(X)-valued estimator, where C(X)C(X) is the space of continuous functions on XX with the topology of uniform convergence on compact subsets.

The proof of the above proposition is rather technical, and we defer the interested reader to A.2 for more details.

In item ii) of Proposition 7, the assumption that XX is locally compact is needed to ensure that the topology of uniform convergence on compact subsets is a separable metric topology on C(X)C(X), which in turn is essential to prove measurability of the random variable FnF_{n} (see the proof of Proposition 16 in A.2). In many examples, the set XX has its own locally compact separable metric dXd_{X}. In this case, in order for XX to be locally compact metric space also for the metric dKd_{K}, it is enough that the kernel KK is a dXd_{X}-continuous function separating every subset of XX which is closed with respect to dXd_{X}, as the two topologies induced by dXd_{X} and dKd_{K} then coincide by item ii) of Proposition 3.

Concentration of measure results for random variables in Hilbert spaces can be used to prove that TnT_{n} is an unbiased estimator of TT, as stated in the following lemma.

with probability at least 12eδ1-2e^{-\delta}. Furthermore

The result is known, but we report its short proof. For all i1i\geq 1 define the random variables Zi:ΩS2Z_{i}:\Omega\to\mathcal{S}_{2} as

Note that (23) and Theorem 2.19 in imply that

2 Consistency

We now choose a family of filter functions (rλ)λ>0(r_{\lambda})_{\lambda>0} and study the convergence of the associated estimators FnF_{n} and XnX_{n} introduced in Section 4.

which can be seen as the infinite sample analogue of FnF_{n}. Clearly, GλG_{\lambda} is a continuous function. For all sets CXC\subset X, we then have the following splitting of the error into two parts, the sample error and the approximation error

In order to prove consistency, we need to show that the left hand side goes to as the sequence of regularization parameters (λn)n1(\lambda_{n})_{n\geq 1} tends to . This will be done separately for the approximation and the sample errors in the next two propositions.

Under Assumption 2.b), if the sequence (λn)n1(\lambda_{n})_{n\geq 1} is such that limnλn=0\lim_{n\to\infty}\lambda_{n}=0, then, for any compact subset CXC\subset X,

Assumption 2.b) and limnλn=0\lim_{n\to\infty}\lambda_{n}=0 imply that the sequence of non-negative functions (rλn)n1(r_{\lambda_{n}})_{n\geq 1} is bounded by 11 and converges pointwisely to the Heaviside function θ\theta on the interval $.Spectraltheoremensuresthat,forall. Spectral theorem ensures that, for allx\in C$,

Given ϵ>0\epsilon>0, by compactness of CC there exists a finite covering of CC by balls of radius ϵ\epsilon, namely Ci=1mB(xi,ϵ)C\subset\cup_{i=1}^{m}B(x_{i},\epsilon). By (25) there exists n0n_{0} such that

where KxKxi<ϵ\left\lVert{K_{x}-K_{x_{i}}}\right\rVert<\epsilon for all xB(xi,ϵ)x\in B(x_{i},\epsilon) since KxKxi=dK(x,xi)\left\lVert{K_{x}-K_{x_{i}}}\right\rVert=d_{K}(x,x_{i}), an, because rλn(σ)1\left\lvert{r_{\lambda_{n}}(\sigma)}\right\rvert\leq 1, θ(σ)1\left\lvert{\theta(\sigma)}\right\rvert\leq 1, supσrλn(σ)θ(σ)2\sup_{\sigma\in}\left\lvert{r_{\lambda_{n}}(\sigma)-\theta(\sigma)}\right\rvert\leq 2. ∎

Convergence to zero of the sample error follows from (23) and the next proposition.

In particular, if Assumption 2.c) holds, then

which proves (26). Assumption 2.c) and Theorem 8.1 in (see also Lemma 7 in A.3 for a simple unpublished proof due to A. Maurer) imply that

The above results can be combined in the following theorem, showing that, if the sequence λn\lambda_{n} is suitably chosen, then FnF_{n} converges almost surely to FρF_{\rho} with respect to the topology of uniform convergence on compact subsets of XX.

Under Assumption 2, if the sequence (λn)n1(\lambda_{n})_{n\geq 1} is such that

then, for every compact subset CXC\subset X,

We show convergence to zero of both the two terms in the right hand side of inequality (24), thus implying (29). By (27), we have

where M=supn1(Lλnlogn)/nM=\sup_{n\geq 1}(L_{\lambda_{n}}\log n)/\sqrt{n} is finite by (28). Then (23) implies that the first term in the right hand side of inequality (24) converges to zero almost surely. Since the second term goes to zero by Proposition 8, the claim follows. ∎

As already remarked above, uniform convergence of FnF_{n} to FρF_{\rho} on compact subsets does not imply convergence of the level sets of FnF_{n} to the corresponding level sets of FρF_{\rho} in any sense (as, for example, with respect to the Hausdorff distance among compact subsets). For this reason, we introduce a family of threshold parameters (τn)n1(\tau_{n})_{n\geq 1} and define the estimator XnX_{n} of the set XρX_{\rho} as in (17).

We define a data dependent parameter τn\tau_{n} as the function on Ω\Omega

where we wrote explicitely the dependence of FnF_{n} on the training set ωΩ\omega\in\Omega. Since FnF_{n} takes values in $,clearly, clearly\tau_{n}(\omega)\in$.

The following is the central result of this section. It shows that, assuming XX is compact and for the above choice of the sequence (τn)n1(\tau_{n})_{n\geq 1}, the Hausdorff distance between XnX_{n} and XρX_{\rho} goes to zero with probability 11. Here we recall that the Hausdorff distance between two subsets A,BXA,B\subset X is

where dK(x,Y)=infyYdK(x,y)d_{K}(x,Y)=\inf_{y\in Y}d_{K}(x,y).

Suppose the metric space XX is compact. Under Assumption 2, if H\mathcal{H} separates the set XρX_{\rho} and the sequence (λn)n1(\lambda_{n})_{n\geq 1} satisfies (28), for the choice of the threshold parameters (τn)n1(\tau_{n})_{n\geq 1} given in (30) we have

We devote the rest of this section to proof of the above theorem. For simplicity, we split it into a few lemmas.

Under the hypotheses of Theorem 6, we have

Since XX is compact, possibly passing to a subsequence we can assume that the sequence (zk)k1(z_{k})_{k\geq 1} converges to a limit zXz\in X. We claim that zXρz\in X_{\rho}. Indeed, if kk is sufficiently large, then we have

where Fnk(zk)1τnk\left\lvert{F_{n_{k}}(z_{k})-1}\right\rvert\leq\tau_{n_{k}} is due to the fact that zkXnkz_{k}\in X_{n_{k}}, so that

As nkn_{k} goes to \infty, we have supxXFρ(x)Fnk(x)0\sup_{x\in X}\left\lvert{F_{\rho}(x)-F_{n_{k}}(x)}\right\rvert\to 0 by Theorem 5; moreover, since FρF_{\rho} is continuous in zz and τnk\tau_{n_{k}} goes to zero, the above inequality for Fρ(z)1\left\lvert{F_{\rho}(z)-1}\right\rvert gives Fρ(z)=1F_{\rho}(z)=1. Since H\mathcal{H} separates XρX_{\rho}, this implies zXρz\in X_{\rho}. However, (32) implies that dK(z,x)ϵd_{K}(z,x)\geq\epsilon for all xXρx\in X_{\rho}, which is the desired contradiction. ∎

The proof that supxXndK(x,Xρ)\sup_{x\in X_{n}}d_{K}(x,X_{\rho}) goes to zero as nn\to\infty requires a further technical lemma, see [36, Lemma 6.1]. In its statement, for all n1n\geq 1 and xXx\in X, we denote by ξ1,n(x)\xi_{1,n}(x) the nearest neighbour of xx in the training set {x1,,xn}\left\{x_{1},\ldots,x_{n}\right\}, i.e.

Given xXρx\in X_{\rho}, fix ϵ>0\epsilon>0 and, denoted by B(x,ϵ)B(x,\epsilon) the closed ball with center xx and radius ϵ\epsilon, set p=ρ(B(x,ϵ))p=\rho(B(x,\epsilon)). By definition of the support and the fact that ρ\rho is a probability measure, 0<p10<p\leq 1. Furthermore

Since 01p<10\leq 1-p<1, the series n(1p)n\sum_{n}(1-p)^{n} converges, so that Borel-Cantelli lemma yields

Since this holds for all ϵ>0\epsilon>0, we have

Under the hypotheses of Theorem 5, if the metric space XX is compact, then

Choose a denumerable dense family {zj}jJ\left\{z_{j}\right\}_{j\in J} in XρX_{\rho}. By the Lemma 3 there exists an event EE with probability 11 such that

on EE. We claim that the limit (33) holds on EE. Observe that, by definition of τn\tau_{n}, xiXnx_{i}\in X_{n} for all 1in1\leq i\leq n, and

so that it is enough to show that limn+supxXρdK(ξ1,n(x),x)=0\lim_{n\to+\infty}\sup_{x\in X_{\rho}}d_{K}(\xi_{1,n}(x),x)=0. Fix ϵ>0\epsilon>0. Since XρX_{\rho} is compact, there is a finite subset JϵJJ_{\epsilon}\subset J such that {B(zj,ϵ)}jJϵ\left\{B(z_{j},\epsilon)\right\}_{j\in J_{\epsilon}} is a finite covering of XρX_{\rho}. We claim that

Indeed, fixed xXρx\in X_{\rho}, there exists an index jJϵj\in J_{\epsilon} such that xB(zj,ϵ)x\in B(z_{j},\epsilon). By definition of ξ1,n\xi_{1,n}, clearly

so that by the triangular inequality we get

Taking the sup\sup over XρX_{\rho} we get the claim. Since JϵJ_{\epsilon} is finite, by (34)

Since ϵ\epsilon is arbitrary, we get limn+supxXρdK(ξ1,n(x),x)=0\lim_{n\to+\infty}\sup_{x\in X_{\rho}}d_{K}(\xi_{1,n}(x),x)=0, and this concludes the proof ∎

The proof of Theorem 6 follows easily combining the previous lemmas.

As dH(Xn,Xρ)=max{supxXndK(x,Xρ), supxXρdK(x,Xn)}d_{H}(X_{n},X_{\rho})=\max\{\sup_{x\in X_{n}}d_{K}(x,X_{\rho}),\ \sup_{x\in X_{\rho}}d_{K}(x,X_{n})\}, the theorem follows combining Lemmas 2 and 4. ∎

We conclude this section with some comments. First, if H\mathcal{H} does not separate XρX_{\rho}, then the statement of Theorem 6 continues to be true provided that the support XρX_{\rho} is replaced by the level set {xXFρ(x)=1}\left\{x\in X\mid F_{\rho}(x)=1\right\}. Note that, the Hausdorff distance dHd_{H} has been defined with respect to the metric dKd_{K} induced by the kernel, however, if the set XX has its own metric dXd_{X} making it compact and the hypotheses of Proposition 3 are satisfied, then Theorem 6 implies convergence of XnX_{n} to XρX_{\rho} also with respect to the Hausdorff distance associated to dXd_{X}. Finally, we remark that in Theorem 6 convergence of XnX_{n} to XρX_{\rho} does not depend on any a priori assumption on the probability ρ\rho.

3 Finite Sample Bounds and Stability of Random Sampling

In order to prove stability of our algorithms under random sampling and determine their convergence rates, we need to specify suitable a priori assumptions on the class of problems to be considered. In the present section, a detailed analysis of the convergence rates of FnF_{n} to FρF_{\rho} will be carried out for the case of the Tikhonov filter rλ(σ)=σ/(σ+λ)r_{\lambda}(\sigma)=\sigma/(\sigma+\lambda). The techniques in should allow to derive similar results for filters other than Tikhonov.

which is finite since TT is a trace class operator. The above quantity is related to the degrees of freedom of the estimator . Here, we recall that N{\mathcal{N}} is a decreasing function of λ\lambda and limλ0+N(λ)=N\lim_{\lambda\to 0^{+}}{\mathcal{N}}(\lambda)=N, where NN is the dimension of the range of TT.

The a priori conditions we consider in the present paper are given by the following two assumptions, which involve both the reproducing kernel KK and the probability measure ρ\rho (compare with ).

there exist bb\in and Db1D_{b}\geq 1 such that

there exist 0<s10<s\leq 1 and a constant Cs>0C_{s}>0 such that PρKxranTs/2P_{\rho}K_{x}\in\operatorname{ran}\,T^{s/2} for all xXx\in X, and

The above conditions are classical in the theory of inverse problems and have been recently considered in supervised learning. Before showing how they allow to derive a finite sample bound on the error supxXFn(x)Fρ(x)\sup_{x\in X}\left\lvert{F_{n}(x)-F_{\rho}(x)}\right\rvert, we add some comments. First, Assumption 3.a) is related to the level of ill-posedness of the problem and can be interpreted as a condition specifying the aspect ratio of the range of TT. Since 0<λN(λ)<tr[T]=10<\lambda{\mathcal{N}}(\lambda)<\operatorname{tr}\left[T\right]=1, inequality (36) is always satisfied with the choice b=1b=1 and D1=1D_{1}=1, so that in this case we are not imposing any a priori assumption. If dimranT=N<\dim\operatorname{ran}\,T=N<\infty, the best choice is b=0b=0 and D0=ND_{0}=\sqrt{N}; otherwise, if dimranT=\dim\operatorname{ran}\,T=\infty, then necessarily b>0b>0. In the latter case, a sufficient condition to have b<1b<1 is to assume a decay rate σjj1/b\sigma_{j}\sim j^{-1/b} on the eigenvalues of TT (see Proposition 3 of ).

Coming to Assumption 3.b), first of all we remark that it is always satisfied when dimranT\dim\operatorname{ran}\,T is finite with the choice s=1s=1 and C1=maxjJ1/σjC_{1}=\max_{j\in J}1/\sigma_{j}. In the general case, Assumption 3.b) can be expressed by the following equivalent condition

where (ϕj,σj)jJ(\phi_{j},\sigma_{j})_{j\in J} are the eigenvectors and eigenvalues of LKL_{K}, which were defined in Section 2.3 (see in particular (7) for the definition of the functions ϕj\phi_{j} outside the set XρX_{\rho}). Clearly, the higher is ss, the stronger is the assumption.

Note that in particular inequality (38) holds true if there exists a constantAs as it happens for example for reproducing kernels on X=[0,2π]dX=[0,2\pi]^{d} which are invariant under translations, when ρ\rho is the Lebesgue measure on [0,2π]d[0,2\pi]^{d}. κ>0\kappa>0 such that supxXϕj(x)κ\sup_{x\in X}\left\lvert{\phi_{j}(x)}\right\rvert\leq\kappa for all jJj\in J, and s]0,1]s\in]0,1] is chosen to make the series jJσj1s\sum_{j\in J}\sigma_{j}^{1-s} finite. In this case, it is quite easy to give conditions on the eigenvalues (σj)jJ(\sigma_{j})_{j\in J} assuring that both Assumptions 3.a) and 3.b) are satisfied. For example, if σjj1/b\sigma_{j}\sim j^{-1/b} for some 0<b<10<b<1, then (36) holds true with this choice of bb, and (37) is satisfied for any 0<s<1b0<s<1-b.

Setting β=1s[0,1[\beta=1-s\in[0,1[, condition (38) is equivalent to the fact that for all x,yXx,y\in X the series

converges absolutely to a bounded reproducing kernel KρβK^{\beta}_{\rho}. Convergence of the series (39) was studied e.g. in , where it is proved that, if the sequence of powers (σjβ)jJ(\sigma^{\beta}_{j})_{j\in J} is summable, there exists a ρ\rho-null set NN such that (39) converges absolutely on (XN)×(XN)(X\setminus N)\times(X\setminus N) (see [70, Proposition 4.4]). We remark that this weaker fact is not sufficient in our setting: indeed, on the one hand it does not imply that the series (39) (or, equivalently, (38)) converges on all of XX, and on the other it does not guarantee that such series is uniformly bounded, two conditions which however are both needed in the proof of Theorem 7 below to get uniform estimates on the whole set XX. A direction of future work is to study the geometric nature of the above conditions when XX is a metric space or a Euclidean space and XρX_{\rho} a Riemannian submanifold.

The following theorem provides the finite sample bound on the error supxXFn(x)Fρ(x)\sup_{x\in X}\left\lvert{F_{n}(x)-F_{\rho}(x)}\right\rvert.

Suppose rλ(σ)=σ/(σ+λ)r_{\lambda}(\sigma)=\sigma/(\sigma+\lambda). If Assumption 3 holds and we choose

then, for n1n\geq 1 and δ>0\delta>0, we have

with probability at least 12eδ1-2e^{-\delta}.

We postpone the proof to the end of the current section and add here some comments. The above finite sample bound quantifies the stability of the estimator with respect to random sampling. Equivalently, if we set the right hand term of the inequality to ϵ\epsilon and solve for n=n(ϵ,δ)n=n(\epsilon,\delta), we obtain the sample complexity of the problem, i.e. how many samples are needed in order to achieve the maximum error ϵ\epsilon with confidence 12eδ1-2e^{-\delta}. As remarked before, Assumption 3.a) is verified for b=1b=1 by any reproducing kernel. In this limit case our result gives a rate ns/(2s+2)n^{-s/(2s+2)}, comparable with the one that can be obtained inserting (27) and (41) below into inequality (24), with TnT\left\lVert{T_{n}-T}\right\rVert bounded by (22).

Note that, if dimranT=N<\dim\operatorname{ran}\,T=N<\infty, choosing b=0b=0, D0=ND_{0}=\sqrt{N}, s=1s=1 and C1=maxjJ1/σjC_{1}=\max_{j\in J}1/\sigma_{j}, the rate in (40) becomes n1/3n^{-1/3}.

The proof of Theorem 7 follows the ideas in and is based on refined estimates of the sample and approximation errors. The techniques in should allow to derive similar results for filters beyond the Tikhonov one.

If Assumption 3.a) holds true, then, for n1n\geq 1 and δ>0\delta>0, we have

with probability at least 12eδ1-2e^{-\delta}.

It is easy to see that (Tn+λn)1λn1\left\lVert{(T_{n}+\lambda_{n})^{-1}}\right\rVert_{\infty}\leq\lambda_{n}^{-1}, hence

Then, from Lemma 10 in the Appendix we have that

with probability at least 12eδ1-2e^{-\delta}, so that the result follows by (26). ∎

Since θ(σ)rλ(σ)=λ/(σ+λ)\theta(\sigma)-r_{\lambda}(\sigma)=\lambda/(\sigma+\lambda) for all σ>0\sigma>0, we have

as PρKxkerTP_{\rho}K_{x}\in\ker{T}^{\perp}. Since by assumption PρKxranTs/2P_{\rho}K_{x}\in\operatorname{ran}\,T^{s/2} for some 0<s10<s\leq 1, spectral calculus and the bound σs/(σ+λ)λs1\sigma^{s}/(\sigma+\lambda)\leq\lambda^{s-1} give the inequality

We are now ready to prove the main result.

The choiche λn=n1/(2s+b+1)\lambda_{n}=n^{-1/(2s+b+1)} is the one that set the contributions of the sample and approximation errors in (24) to be equal. Indeed, we begin by simplifying the bound on the sample error. If λn1\lambda\geq n^{-1}, then nλnλb+1n\lambda\geq\sqrt{n\lambda^{b+1}} for all 0<b10<b\leq 1, so that

where we used the definition of DbD_{b} (and the fact that Db1D_{b}\geq 1). Then, by the above inequality and Propositions 11 and 12, inequality (24) gives

If we set the contributions of the sample and approximation errors to be equal, the choice for λ\lambda is

It is easy to see that λn1\lambda\geq n^{-1} for all values of s,bs,b, so that from (42) we have

4 The kernel PCA filter

A natural choice for the spectral filter rλr_{\lambda} would be the regularization defined by kernel PCA , that corresponds to truncating the generalized inverse of the kernel matrix at some cutoff parameter λ\lambda. The corresponding filter function is

The above filter does not satisfy the Lipschitz condition 2.c) in Assumption 2, so that the bound (27) for the sample error supxXFn(x)Gλn(x)\sup_{x\in X}\left\lvert{F_{n}(x)-G_{\lambda_{n}}(x)}\right\rvert does not hold in this caseNote that, by Proposition 16 in A.2, if XX is locally compact, then FnF_{n} defined in (16) still is a C(X)C(X)-valued estimator.. However, we can still achieve an estimate by employing inequality (44) in A.3. To this aim, with a slight abuse of the notation, here we count the eigenvalues of TT and TnT_{n} without their multiplicities and we list them in decreasing order. Furthermore, for any λ>0\lambda>0 we set σj(λ)\sigma_{j(\lambda)} and σk(λ)(n)\sigma_{k(\lambda)}^{(n)} as the smallest eigenvalues of TT and TnT_{n} which are greater or equal to λ\lambda, i.e.

and inequality (26) for the sample error then reads

By Lemma 1, in order to have convergence to of the right hand side of this expression we need to choose the sequence (λn)n1(\lambda_{n})_{n\geq 1} such that

Since the gap σj(λ)σj(λ)+1\sigma_{j(\lambda)}-\sigma_{j(\lambda)+1} can have any arbitrary rate of convergence to zero as λ0+\lambda\to 0^{+}, we thus see that there exists no distribution independent choice of (λn)n1(\lambda_{n})_{n\geq 1} ensuring the convergence to zero of the above bound.

which recovers a known result about kernel PCA (see for example ). Furthermore, if the bound TnTS2<(σMnσMn+1)/2\left\lVert{T_{n}-T}\right\rVert_{\mathcal{S}_{2}}<(\sigma_{M_{n}}-\sigma_{M_{n}+1})/2 holds, then we obtain PMn(n)PMnS2<1\left\lVert{P^{(n)}_{M_{n}}-P_{M_{n}}}\right\rVert_{\mathcal{S}_{2}}<1, hence we have the equality dimranPMn(n)=dimranPMn\dim\operatorname{ran}\,P^{(n)}_{M_{n}}=\dim\operatorname{ran}\,P_{M_{n}}.

The following result extends Theorem 5 to the case of kernel PCA, at the price of having a distribution dependent choice of the cut-off sequence (Mn)n1(M_{n})_{n\geq 1}.

If the sequence of natural numbers (Mn)n1(M_{n})_{n\geq 1} is strictly increasing and such that

and we define the sequence (λn)n1(\lambda_{n})_{n\geq 1} as

then, for every compact subset CXC\subset X,

By the above discussion and inequality (26),

Convergence to of the sample error then follows from (23). Combining this fact and Proposition 8 into inequality (24), the claim then follows. ∎

Some Perspectives

In this section we discuss some different perspectives to our approach and suggest some possible extensions.

We start discussing some connections between our analytical characterization of the support of ρ\rho and Mercer theorem . With the notations of Section 2.3, the fact that the family (σjϕj)jJ(\sqrt{\sigma_{j}}\phi_{j})_{j\in J} is an orthonormal basis of PρHP_{\rho}\mathcal{H} and the reproducing property give the relation

where the series converges absolutely. Note that in this expression the eigenfunctions ϕj\phi_{j} of LKL_{K} are defined outside XρX_{\rho} through the extension equation (7). Restricting (43) to x,yXρx,y\in X_{\rho}, we obtain

which is nothing else than Mercer theorem . In particular, taking x=yx=y, this formula implies that jJσjϕj(x)2=K(x,x)\sum_{j\in J}\sigma_{j}\left\lvert{\phi_{j}(x)}\right\rvert^{2}=K(x,x) for all xXρx\in X_{\rho}. On the other hand, the assumption that the reproducing kernel separates XρX_{\rho} precisely ensures that

(Recall that, if KK separates XρX_{\rho}, then XρX_{\rho} is the 11-level set of the function Fρ=jJσjϕj2F_{\rho}=\sum_{j\in J}\sigma_{j}\left\lvert{\phi_{j}}\right\rvert^{2}.)

2 A Feature Space Point of View

In machine learning, kernel methods are often described in terms of a corresponding feature map . This point of view highlights the linear structure of the Hilbert space and often provides a more geometric interpretation.

We recall that a feature map associated to a reproducing kernel is a map Ψ:XF\Psi:X\to{\mathcal{F}}, where F{\mathcal{F}} is a Hilbert space with inner product ,F\left\langle{\cdot},{\cdot}\right\rangle_{\mathcal{F}}, satisfying K(x,y)=Ψ(y),Ψ(x)F.K(x,y)=\left\langle{\Psi(y)},{\Psi(x)}\right\rangle_{\mathcal{F}}. While every map Ψ\Psi from XX into a Hilbert space F\cal F defines a reproducing kernel, it is also possible to prove that each kernel has an associated feature map (and in fact many). Indeed, given KK, the natural assignment is FH\mathcal{F}\equiv\mathcal{H} and Ψ(x)Kx\Psi(x)\equiv{K_{x}}. Such a choice is also minimal, in the sense that, if we make a different choice of F\mathcal{F} and Ψ\Psi, then there exists an isometry W:HFW:\mathcal{H}\to\mathcal{F} such that Ψ(x)=WKx\Psi(x)=W{K_{x}} xX\forall x\in X – see for example Proposition 2.4 of or Theorem 4.21 of , noticing that both papers deal with the transpose W:FHW^{\top}:\mathcal{F}\to\mathcal{H}.

We next review some of the concepts introduced in Section 2 in terms of feature maps. For the sake of comparison we assume that Ψ(x)F=1\left\lVert{\Psi(x)}\right\rVert_{\cal F}=1 for all xXx\in X (this corresponds to the normalization assumption 1.d)), we let FC{\cal F}_{C} be the closure of the linear span of the set {Ψ(x)xC}\left\{\Psi(x)\mid x\in C\right\}, and define

It is easy to see that the definition of separating kernel has the following equivalent and natural analogue in the context of feature maps.

We say that a feature map Ψ\Psi separates a subset CXC\subset X if

The above definition is equivalent to Definition 1 since dF(Ψ(x),FC)=Ψ(x)QCΨ(x)Fd_{\cal F}(\Psi(x),{\cal F}_{C})=\left\lVert{\Psi(x)-Q_{C}\Psi(x)}\right\rVert_{\cal F}, where QCQ_{C} is the orthogonal projection onto FC{\cal F}_{C}. Then, according to Definition 3, a point xCx\in C if and only if Ψ(x)QCΨ(x)F2=0\left\lVert{\Psi(x)-Q_{C}\Psi(x)}\right\rVert_{\cal F}^{2}=0. Since Ψ(x)=WKx\Psi(x)=WK_{x} xX\forall x\in X and QCW=WPCQ_{C}W=WP_{C}, this is equivalent to

Theorem 1 then implies that Definition 1 and 3 are equivalent. We thus see that the separating property has a clear geometric interpretation in the feature space: the set Ψ(C)\Psi(C) is the intersection of the closed subspace FC\mathcal{F}_{C}, i.e. a linear manifold in F\cal F, and Ψ(X)\Psi(X) – see Figure 2.

In the above interpretation, the estimator we propose for the support then stems from the following observation: given a training set x1,,xnx_{1},\ldots,x_{n}, we classify a new point xx as belonging to the estimator XnX_{n} of XρX_{\rho} if the distance of Ψ(x)\Psi(x) to the linear span of {Ψ(x1),Ψ(xn)}\{\Psi(x_{1}),\ldots\Psi(x_{n})\} is sufficiently small.

Given a training set {x1,,xn}\left\{x_{1},\ldots,x_{n}\right\}, our estimator FnF_{n} classifies a new point xx as belonging to the support if the distance of Ψ(x)\Psi(x) to the linear span of Ψ(x1),,Ψ(xn)\Psi(x_{1}),\ldots,\Psi(x_{n}) is sufficiently small.

3 Inverse Problems and Empirical Risk Minimization

Here we suggest a simple interpretation of the estimator FnF_{n} and stress the connection with the supervised setting. We regard the sampled data x1,,xnx_{1},\ldots,x_{n} as a training set of positive examples, so that each point xiXρx_{i}\in X_{\rho} almost surely; the new datum is the point xXx\in X, and we evaluate the estimator FnF_{n} at xx. We label the examples according to the similarity function KK by setting

If KK satisfies Assumption 1, then, since K(x,x)=1K(x,x)=1 and KK is dKd_{K}-continuous, the function yiy_{i} is close to 11 whenever xix_{i} is close to xx. The interpolation problem

(where SnS_{n} is defined in (18)) is ill-posed. To restore well-posedeness we can consider the corresponding least square problem (empirical risk minimization problem)

where λ>0\lambda>0 is the regularization parameter (Tikhonov regularization). It is known that the minimum of the above expression is achieved by ffnλf\equiv f_{n}^{\lambda}, with

where gλg_{\lambda} is the function gλ(σ)=1/(σ+λ)g_{\lambda}(\sigma)=1/(\sigma+\lambda). More generally, Tikhonov regularization can be replaced by spectral regularization induced by a different choice of the filter gλg_{\lambda}; the corresponding regularized solution fnλf_{n}^{\lambda} is still given by the previous equation, but the function gλg_{\lambda} appearing in it is now completely arbitrary. Comparing with (19), we see that fnλn(x)=Fn(x)f^{\lambda_{n}}_{n}(x)=F_{n}(x). Equation (17) has then the following interpretation: a new point xx is estimated to be a positive example (that is, to belong to the support XρX_{\rho}) if and only if fnλn(x)1τf^{\lambda_{n}}_{n}(x)\geq 1-\tau, where τ\tau is a threshold parameter.

The above discussion suggests several extensions and variations of our method, obtained considering more general penalized empirical risk minimization functionals of the form

VV is a (regression) loss function measuring the approximation property of ff, for example the logistic loss or a robust loss such as the one used in support vector machine regression. Our theoretical analysis does not carry on to other loss functions and different mathematical concepts from empirical process theory are probably needed;

Empirical Analysis

In this section we describe some preliminary experiments aimed at testing the properties and the performances of the proposed methods both on simulated and real data. We only discuss spectral algorithms induced by Tikhonov regularization to contrast the general method to some current state of the art algorithms. Note that while computations can be made more efficient in several ways, we consider a simple algorithmic protocol and leave a more refined computational study for future work. Recall that Tikhonov regularization defines an estimator Fn(x)=Kx(Kn+nλ)1KxF_{n}(x)={\mathbf{K}_{x}}^{\ast}({\mathbf{K}}_{n}+n\lambda)^{-1}{\mathbf{K}}_{x}, and a point xx is labeled as belonging to the support XρX_{\rho} if Fn(x)1τF_{n}(x)\geq 1-\tau. The computational cost for the algorithm is, in the worst case, of order n3n^{3} – like standard regularized least squares – for training, and order Nn2Nn^{2} if we have to predict the value of FnF_{n} at NN test points. In practice, one has to choose a good value for the regularization parameter λ\lambda and this requires computing multiple solutions, a so called regularization path. As noted in , if we form the inverse using the eigendecomposition of the kernel matrix the price of computing the full regularization path is essentially the same as that of computing a single solution (note that the cost of the eigen-decomposition of Kn{\mathbf{K}}_{n} is also of order n3n^{3}, though the constant is worse). This is the strategy that we consider in the following. In our experiments we considered two datasets: the MNISThttp://yann.lecun.com/exdb/mnist/ dataset and the CBCLhttp://cbcl.mit.edu/ face database. For the digits we considered a reduced set consisting of a training set of 5000 images and a test set of 1000 images. In the first experiment we trained on 500500 images for the digit 33 and tested on 200200 images of digits 33 and 88. Each experiment consists of training on one class and testing on two different classes and was repeated for 20 trials over different training set choices. For all our experiments we considered the Abel kernel. Note that in this case the algorithm requires to choose 33 parameters: the regularization parameter λ\lambda, the kernel width σ\sigma and the threshold τ\tau. In supervised learning cross validation is typically used for parameter tuning, but cannot be used in our setting since support estimation is an unsupervised problem. Then, we considered the following heuristics. The kernel width is chosen as the median of the distribution of distances of the kk-th nearest neighbor of each training set point for k=10k=10. Fixed the kernel width, we choose the regularization parameter in correspondence of the maximum curvature in the eigenvalue behavior – see Figure 3 – the rationale being that after this value the eigenvalues are relatively small.

For comparison we considered a Parzen window density estimator and one-class SVM (1CSVM) as implemented by . For the Parzen window estimator we used the same kernel of the spectral algorithm, that is the Laplacian kernel, and also the same width. Given a kernel width, an estimate of the probability distribution is computed and can be used to estimate the support by fixing a threshold τ\tau^{\prime}. For the one-class SVM we considered the Gaussian kernel, so that we have to fix the kernel width and a regularization parameter ν\nu. We fixed the kernel width to be the same used by our estimator and set ν=0.9\nu=0.9. For the sake of comparison, also for one-class SVM we considered a varying offset τ\tau^{{}^{\prime\prime}}. The performance is evaluated computing ROC curve (and the corresponding AUC value) for varying values of the thresholds τ,τ,τ\tau,\tau^{\prime},\tau^{{}^{\prime\prime}}. The ROC curves on the different tasks are reported (for one of the trials) in Figure 4, Left. The mean and standard deviation of the AUC for the three methods is reported in Table 2. Similar experiments were repeated considering other pairs of digits, see Table 2. Also in the case of the CBCL datasets we considered a reduced dataset consisting of 472472 images for training and other 472472 for test. On the different test performed on the MNIST data the spectral algorithm always achieves results which are better – and often substantially better – than those of the other methods. On the CBCL dataset SVM provides the best result, but spectral algorithm still provides a competitive performance.

We remark that, although binary classification data sets are used in the experiments, the considered set-up is that of a one-class classification problem. Indeed, the training and tuning of the algorithms are performed using only examples of one class and the other class is only considered for testing. Accordingly, the proposed methods are compared to state of the art algorithms for one-class classification.

Appendix A Auxiliary Proofs

In this section we give the proofs of a few technical results needed in the paper.

The next result shows that, if KK is a reproducing kernel which is nonzero on the diagonal, then it can be normalized, and its normalized version separates the same sets. When K(x,x)=0K(x,x)=0 for some xXx\in X, then clearly this result still holds replacing the set XX with XX0X\setminus X_{0} and considering the restriction of KK to (XX0)×(XX0)(X\setminus X_{0})\times(X\setminus X_{0}), where X0={xXK(x,x)=0}X_{0}=\left\{x\in X\mid K(x,x)=0\right\}.

Assume that K(x,x)>0K(x,x)>0 for all xXx\in X. Then, the reproducing kernel KK^{\prime} on XX, given by

is normalized and separates the same sets as KK.

Clearly KK is a kernel of positive type. Denote by H\mathcal{H}^{\prime} the reproducing kernel Hilbert space with kernel KK^{\prime}, and define the feature map Ψ:XH\Psi:X\to\mathcal{H}, Ψ(x)=Kx/Kx\Psi(x)=K_{x}/\left\lVert{K_{x}}\right\rVert. It is simple to check that Ψ(y),Ψ(x)=K(x,y)\left\langle{\Psi(y)},{\Psi(x)}\right\rangle=K^{\prime}(x,y) and Ψ(X)={0}\Psi(X)^{\perp}=\{0\}, so that the map Ψ:HH\Psi_{*}:\mathcal{H}\to\mathcal{H}^{\prime}

is a unitary operator with Kx=Ψ(Ψ(x))K^{\prime}_{x}=\Psi_{*}(\Psi(x)) . Clearly, for any fHf\in\mathcal{H} and xXx\in X

The above equality shows that H\mathcal{H} and H\mathcal{H}^{\prime} separate the same sets. ∎

A.2 Analytic Results

In this section, we suppose that the kernel KK satisfies Assumption 1, and endow the set XX with the metric dKd_{K} induced by KK. Measurability of a map taking values in a topological space will be always understood with respect to the Borel σ\sigma-algebra of such space. The next simple lemma will be used frequently.

is continuous and measurable. Moreover, if Zi:ΩSkZ_{i}:\Omega\to\mathcal{S}_{k} is given by

then ZiZ_{i} is measurable for all i1i\geq 1.

The map xKx{x\mapsto K_{x}}, is continuous from XX into H\mathcal{H} by item i) in Proposition 1. Since ξ(x)=KxKx\xi(x)=K_{x}\otimes K_{x}, continuity of ξ\xi follows at once. By item v) in Proposition 1, ξ\xi is then a measurable map, hence ZiZ_{i} is such. ∎

We recall some basic properties of the operator TT defined by the kernel. The next result is known (see for example ), but we report a short proof for completeness.

The S1\mathcal{S}_{1}-valued map ξ\xi defined in Lemma 5 is Bochner-integrable with respect to ρ\rho, and its integral

is a positive trace class operator on H\mathcal{H}, with TS1=tr[T]=1\left\lVert{T}\right\rVert_{\mathcal{S}_{1}}=\operatorname{tr}\left[T\right]=1.

The map ξ\xi isbounded because KxKxS1=tr[KxKx]=K(x,x)=1\left\lVert{K_{x}\otimes K_{x}}\right\rVert_{\mathcal{S}_{1}}=\operatorname{tr}\left[K_{x}\otimes K_{x}\right]=K(x,x)=1 and measurable by Lemma 5 . Therefore, ξ\xi is a Bochner-integrable S1\mathcal{S}_{1}-valued map, and its integral TT is a trace class operator. As ξ(x)\xi(x) is a positive operator for all xx, so is TT. In particular, TS1=tr[T]\left\lVert{T}\right\rVert_{\mathcal{S}_{1}}=\operatorname{tr}\left[T\right], and tr[T]=Xtr[KxKx]dρ(x)=1\operatorname{tr}\left[T\right]=\int_{X}\operatorname{tr}\left[K_{x}\otimes K_{x}\right]d\rho(x)=1. ∎

Now, we come to the proof of Proposition 7. We will split it into the proofs of Propositions 15 and 16 below.

For all n1n\geq 1, the map TnT_{n} defined in (15) is a Sk\mathcal{S}_{k}-valued estimator for k=1,2k=1,2.

hence TnT_{n} is measurable by Lemma 6. ∎

For the next proposition we recall that the topology of uniform convergence on compact subsets of XX is generated by the following basis of open sets Uf,ϵ,CC(X)U_{f,\epsilon,C}\subset C(X)

Suppose XX is locally compact. Let (rλ)λ>0(r_{\lambda})_{\lambda>0} be a family of functions rλ:r_{\lambda}:\to such that each rλr_{\lambda} is upper semicontinuous. Then, for any sequence of positive numbers (λn)n1(\lambda_{n})_{n\geq 1} and all n1n\geq 1, the map FnF_{n} defined in (16) is a C(X)C(X)-valued estimator, where C(X)C(X) is the space of continuous functions on XX with the topology of uniform convergence on compact subsets.

Throughout the proof, n1n\geq 1 will be fixed. Let (φk)k1(\varphi_{k})_{k\geq 1} be a decreasing sequence of continuous functions φk:\varphi_{k}:\to such that φk(σ)rλn(σ)\varphi_{k}(\sigma)\downarrow r_{\lambda_{n}}(\sigma) for all σ\sigma\in (such sequence exists by (12.7.8) of ). Then, by Lemma 6 and continuity of the functional calculus (see e.g. Problem 126 in ), for all k1k\geq 1 the map

is continuos from XnX^{n} into the Banach space S0{\mathcal{S}}_{0} of the bounded operators on H\mathcal{H} with the uniform operator norm. Thus, for all xXx\in X, the real function (x1,,xn)[φk(Tˇn)](x1,,xn)Kx,Kx(x_{1},\ldots,x_{n})\mapsto\left\langle{[\varphi_{k}(\check{T}_{n})](x_{1},\ldots,x_{n})\,K_{x}},{K_{x}}\right\rangle is continuous on XnX^{n}, hence is measurable by item v) of Proposition 1. By spectral calculus and dominated convergence theorem, for all ω=(xi)i1\omega=(x_{i})_{i\geq 1}

It then follows that, for each xXx\in X, the real function ωrλn(Tn(ω))Kx,Kx\omega\mapsto\left\langle{r_{\lambda_{n}}(T_{n}(\omega))K_{x}},{K_{x}}\right\rangle is measurable on Ω\Omega, being the pointwise limit of measurable functions. We now prove that the map Fn:ω(xrλn(Tn(ω))Kx,Kx)F_{n}:\omega\mapsto(x\mapsto\left\langle{r_{\lambda_{n}}(T_{n}(\omega))K_{x}},{K_{x}}\right\rangle) is measurable from Ω\Omega into the space C(X)C(X). By M2, p. 115 in , this is equivalent to the measurability of the subsets Fn1(U)ΩF_{n}^{-1}(U)\subset\Omega for all open sets UC(X)U\subset C(X). Since XX is a locally compact separable metric space, the topology of uniform convergence on compact subsets is a separable metric topology on C(X)C(X) by (12.14.6.2) in . By separability of C(X)C(X), each open set UC(X)U\subset C(X) then is the denumerable union of sets of the neighborhood basis {Uf,ϵ,CfC(X),ϵ>0,CX\mboxcompact}\{U_{f,\epsilon,C}\mid f\in C(X),\,\epsilon>0,\,C\subset X\mbox{ compact}\}. Hence, it is enough to show that Fn1(Uf,ϵ,C)F_{n}^{-1}(U_{f,\epsilon,C}) is measurable for all ff, ϵ\epsilon and CC. We have

By separability of XX, there exists a countable set C0CC_{0}\subset C such that C0=C\overline{C_{0}}=C. A continuity argument then shows that

Since each set {ωΩf(x)rλn(Tn(ω))Kx,Kxϵ1/k}\left\{{\omega\in\Omega}\mid\left\lvert{f(x)-\left\langle{r_{\lambda_{n}}(T_{n}({\omega}))\,K_{x}},{K_{x}}\right\rangle}\right\rvert\leq\epsilon-1/k\right\} is measurable in Ω\Omega, measurability of the countable intersection Fn1(Uf,ϵ,C)F_{n}^{-1}(U_{f,\epsilon,C}) then follows. ∎

We conclude this section with the proof of measurability of the threshold parameters (τn)n1(\tau_{n})_{n\geq 1} defined in (30).

A.3 A Useful Inequality

The following proof of inequality (45) below is due to A. Maurerhttp://www.andreas-maurer.eu.

In particular, if rr is a Lipshitz function with Lipshitz constant LrL_{r}, then

Let (fj)jJ(f_{j})_{j\in J} and (gk)kK(g_{k})_{k\in K} be the orthonormal bases of eigenvectors of SS and TT corresponding to the eigenvalues (σj)jJ(\sigma_{j})_{j\in J} and (τk)kK(\tau_{k})_{k\in K}, respectively, which here we list repeated accordingly to their multiplicity. We have

A.4 Concentration of Measure Results

We will need in particular the next two straightforward consequences of this inequality.

with probability at least 12eδ1-2e^{-\delta}.

where g(t)=t2/(1+t+1+2t)g(t)=t^{2}/(1+t+\sqrt{1+2t}). Since g1(t)=t+2tg^{-1}(t)=t+\sqrt{2t}, by solving the equation (σ2n/M2)g(Mϵ/σ2)=δ(\sigma^{2}n/M^{2})g(M\epsilon/\sigma^{2})=\delta we have

The above result and Borel-Cantelli lemma imply that

almost surely. In the paper we actually need a slightly stronger result which is given in the following lemma.

If Z1,Z2,Z_{1},Z_{2},\ldots is a sequence of i.i.d. V\mathcal{V}-valued random variables, such that ZiVM\left\lVert{Z_{i}}\right\rVert_{\mathcal{V}}\leq M almost surely, then we have

We continue with the notations in the proof of Lemma 8. By (46), for all ϵ>0\epsilon>0 we have

For all ϵ>0\epsilon>0, limnA(n,ϵ)/logn=+\lim_{n\to\infty}A(n,\epsilon)/\log n=+\infty, so that the series n1nA(n,ϵ)/logn\sum_{n\geq 1}n^{-A(n,\epsilon)/\log n} is convergent, and Borel-Cantelli lemma gives the result. ∎

The following inequality is given in and we report its proof for completeness.

If Assumption 1 holds true, then for all δ>0\delta>0 we have

with probability at least 12eδ1-2e^{-\delta}.

where we have bounded the operator norm (T+λ)1\left\lVert{(T+\lambda)^{-1}}\right\rVert_{\infty} by 1/λ1/\lambda. The result follows applying Lemma 8. ∎

Acknowledgement

A. T. acknowledges the financial support of the Italian Ministry of Education, University and Research (FIRB project RBFR10COAQ). L. R. acknowledges the financial support of the Italian Ministry of Education, University and Research (FIRB project RBFR12M3AC).

References