Density Estimation in Infinite Dimensional Exponential Families

Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, Revant Kumar

Introduction

In this paper, we consider an infinite dimensional generalization (Canu and Smola, 2005; Fukumizu, 2009) of (1),

where the function space F\mathcal{F} is defined as

being the cumulant generating function, and (H,,H)(\mathcal{H},\langle\cdot,\cdot\rangle_{\mathcal{H}}) a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950) with kk as its reproducing kernel. While various generalizations are possible for different choices of F\mathcal{F} (e.g., an Orlicz space as in Pistone and Sempi, 1995), the connection of P\mathcal{P} to the natural exponential family in (1) is particularly enlightening when H\mathcal{H} is an RKHS. This is due to the reproducing property of the kernel, f(x)=f,k(x,)Hf(x)=\langle f,k(x,\cdot)\rangle_{\mathcal{H}}, through which k(x,)k(x,\cdot) takes the role of the sufficient statistic. In fact, it can be shown (see Section 3 and Example 1 for more details) that every Pfin\mathscr{P}_{\text{fin}} is generated by P\mathcal{P} induced by a finite dimensional RKHS H\mathcal{H}, and therefore the family P\mathcal{P} with H\mathcal{H} being an infinite dimensional RKHS is a natural infinite dimensional generalization of Pfin\mathscr{P}_{\text{fin}}. Furthermore, this generalization is particularly interesting as in contrast to Pfin\mathscr{P}_{\text{fin}}, it can be shown that P\mathcal{P} is a rich class of densities (depending on the choice of kk and therefore H\mathcal{H}) that can approximate a broad class of probability densities arbitrarily well (see Propositions 1, 13 and Corollary 2). This generalization is not only of theoretical interest, but also has implications for statistical and machine learning applications. For example, in Bayesian non-parametric density estimation, the densities in P\mathcal{P} are chosen as prior distributions on a collection of probability densities (e.g., see van der Vaart and van Zanten, 2008). P\mathcal{P} has also found applications in nonparametric hypothesis testing (Gretton et al., 2012; Fukumizu et al., 2008) and dimensionality reduction (Fukumizu et al., 2004, 2009) through the mean and covariance operators, which are obtained as the first and second Fréchet derivatives of A(f)A(f) (see Fukumizu, 2009, Section 1.2.3). Recently, the infinite dimensional exponential family, P\mathcal{P} has been used to develop a gradient-free adaptive MCMC algorithm based on Hamiltonian Monte Carlo (Strathmann et al., 2015) and also has been used in the context of learning the structure of graphical models (Sun et al., 2015).

Motivated by the richness of the infinite dimensional generalization and its statistical applications, it is of interest to model densities by P\mathcal{P}, and therefore the goal of this paper is to estimate unknown densities by elements in P\mathcal{P} when H\mathcal{H} is an infinite dimensional RKHS. Formally, given i.i.d. random samples (Xa)a=1n(X_{a})^{n}_{a=1} drawn from an unknown density p0p_{0}, the goal is to estimate p0p_{0} through P\mathcal{P}. Throughout the paper, we refer to case of p0Pp_{0}\in\mathcal{P} as well-specified, in contrast to the misspecified case where p0Pp_{0}\notin\mathcal{P}. The setting is useful because P\mathcal{P} is a rich class of densities that can approximate a broad class of probability densities arbitrarily well, hence it may be widely used in place of non-parametric density estimation methods (e.g., kernel density estimation (KDE)). In fact, through numerical simulations, we show in Section 6 that estimating p0p_{0} through P\mathcal{P} performs better than KDE, and that the advantage of the proposed estimator grows with increasing dimensionality.

A related work was carried out by Barron and Sheu (1991)—also see references therein—where the goal is to estimate a density, p0p_{0} by approximating its logarithm as an expansion in terms of basis functions, such as polynomials, splines or trigonometric series. Similar to Fukumizu (2009), Barron and Sheu proposed the ML estimator pf^mp_{\hat{f}_{m}}, where

and Fm\mathcal{F}_{m} is the linear space of dimension mm spanned by the chosen basis functions. Under the assumption that logp0\log p_{0} has square-integrable derivatives up to order rr, they showed that KL(p0pf^m)=Op0(n2r/(2r+1))KL(p_{0}\|p_{\hat{f}_{m}})=O_{p_{0}}(n^{-2r/(2r+1)}) with m=n1/(2r+1)m=n^{1/(2r+1)} for each of the approximating families, where KL(pq)=p(x)log(p(x)/q(x))dxKL(p\|q)=\int p(x)\log(p(x)/q(x))\,dx is the KL divergence between pp and qq. Similar work was carried out by Gu and Qiu (1993), who assumed that logp0\log p_{0} lies in an RKHS, and proposed an estimator based on penalized MLE, with consistency and rates established in Jensen-Shannon divergence. Though these results are theoretically interesting, these estimators are obtained via a procedure similar to that in Fukumizu (2009), and therefore suffers from the practical drawbacks discussed above.

The discussion so far shows that the MLE approach to learning p0Pp_{0}\in\mathcal{P} results in estimators that are of limited practical interest. To alleviate this, one can treat the problem of estimating p0Pp_{0}\in\mathcal{P} in a completely non-parametric fashion by using KDE, which is well-studied (Tsybakov, 2009, Chapter 1) and easy to implement. This approach ignores the structure of P\mathcal{P}, however, and is known to perform poorly for moderate to large dd (Wasserman, 2006, Section 6.5) (see also Section 6 of this paper).

To understand the advantages associated with the score matching method, let us consider the problem of density estimation where the data generating distribution (say p0p_{0}) belongs to Pfin\mathscr{P}_{\text{fin}} in (1). In other words, given random samples (Xa)a=1n(X_{a})^{n}_{a=1} drawn i.i.d. from p0:=pθ0p_{0}:=p_{\theta_{0}}, the goal is to estimate θ0\theta_{0} as θ^n\hat{\theta}_{n}, and use pθ^np_{\hat{\theta}_{n}} as an estimator of p0p_{0}. While the MLE approach is well-studied and enjoys nice statistical properties in asymptopia (i.e., asymptotically unbiased, efficient, and normally distributed), the computation of θ^n\hat{\theta}_{n} can be intractable in many situations as discussed above. In particular, this is the case for pθ(x)=rθ(x)A(θ)p_{\theta}(x)=\frac{r_{\theta}(x)}{A(\theta)} where rθ0r_{\theta}\geq 0 for all θΘ\theta\in\Theta, A(θ)=Ωrθ(x)dxA(\theta)=\int_{\Omega}r_{\theta}(x)\,dx, and the functional form of rr is known (as a function of θ\theta and xx); yet we do not know how to easily compute AA, which is often analytically intractable. In this setting (which is exactly the setting of this paper), assuming pθp_{\theta} to be differentiable (w.r.t. xx), and Ωp0(x)logpθ(x)22dx<,θΘ\int_{\Omega}p_{0}(x)\|\nabla\log p_{\theta}(x)\|^{2}_{2}\,dx<\infty,\,\forall\,\theta\in\Theta, J(p0pθ)=:J(θ)J(p_{0}\|p_{\theta})=:J(\theta) in (3) reduces to

through integration by parts (see Hyvärinen, 2005, Theorem 1), under appropriate regularity conditions on p0p_{0} and pθp_{\theta} for all θΘ\theta\in\Theta. Here i2logpθ(x):=2xi2logpθ(x)\partial^{2}_{i}\log p_{\theta}(x):=\frac{\partial^{2}}{\partial x^{2}_{i}}\log p_{\theta}(x). The main advantage of the objective in (3) (and also (4)) is that when it is applied to the situation discussed above where pθ(x)=rθ(x)A(θ)p_{\theta}(x)=\frac{r_{\theta}(x)}{A(\theta)}, J(θ)J(\theta) is independent of A(θ)A(\theta), and an estimate of θ0\theta_{0} can be obtained by simply minimizing the empirical counterpart of J(θ)J(\theta), given by

Since Jn(θ)J_{n}(\theta) is also independent of A(θ)A(\theta), θ^n=argminθΘJn(θ)\hat{\theta}_{n}=\arg\min_{\theta\in\Theta}J_{n}(\theta) may be easily computable, unlike the MLE. We would like to highlight that while the score matching approach may have computational advantages over MLE, it only estimates pθp_{\theta} up to the scaling factor A(θ)A(\theta), and therefore requires the approximation or computation of A(θ)A(\theta) through numerical integration to estimate pθp_{\theta}. Note that this issue (of computing A(θ)A(\theta) through numerical integration) exists even with MLE, but not with KDE. In score matching, however, numerical integration is needed only once, while MLE would typically require a functional form of the log-partition function which is approximated through numerical integration at every step of an iterative optimization algorithm (for example, see (2)), thus leading to major computational savings. An important application that does not require the computation of A(θ)A(\theta) is in finding modes of the distribution, which has recently become very popular in image processing (Comaniciu and Meer, 2002), and has already been investigated in the score matching framework (Sasaki et al., 2014). Similarly, in sampling methods such as sequential Monte Carlo (Doucet et al., 2001), it is often the case that the evaluation of unnormalized densities is sufficient to calculate required importance weights.

2 Contributions

(i) We present an estimate of p0Pp_{0}\in\mathcal{P} in the well-specified case through the minimization of Fisher divergence, in Section 4. First, we show that estimating p0:=pf0p_{0}:=p_{f_{0}} using the score matching method reduces to estimating f0f_{0} by solving a simple finite-dimensional linear system (Theorems 4 and 5). Hyvärinen (2007) obtained a similar result for Pfin\mathscr{P}_{\text{fin}} where the estimator is obtained by solving a linear system, which in the case of Gaussian family matches the MLE (Hyvärinen, 2005). The estimator obtained in the infinite dimensional case is not a simple extension of its finite-dimensional counterpart, however, as the former requires an appropriate regularizer (we use H2\|\cdot\|^{2}_{\mathcal{H}}) to make the problem well-posed. We would like to highlight that to the best of our knowledge, the proposed estimator is the first practically computable estimator of p0p_{0} with consistency guarantees (see below). (ii) In contrast to Hyvärinen (2007) where no guarantees on consistency or convergence rates are provided for the density estimator in Pfin\mathscr{P}_{\text{fin}}, we establish in Theorem 6 the consistency and rates of convergence for the proposed estimator of f0f_{0}, and use these to prove consistency and rates of convergence for the corresponding plug-in estimator of p0p_{0} (Theorems 7 and B.2), even when H\mathcal{H} is infinite dimensional. Furthermore, while the estimator of f0f_{0} (and therefore p0p_{0}) is obtained by minimizing the Fisher divergence, the resultant density estimator is also shown to be consistent in KL divergence (and therefore in Hellinger and total-variation distances) and we provide convergence rates in all these distances.

Formally, we show that the proposed estimator f^n\hat{f}_{n} is converges as

if f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β>0\beta>0, where R(A)\mathcal{R}(A) denotes the range or image of an operator AA, α=min{14,β2β+2}\alpha=\min\{\frac{1}{4},\frac{\beta}{2\beta+2}\}, and C:=i=1dΩik(x,)ik(x,)p0(x)dxC:=\sum^{d}_{i=1}\int_{\Omega}\partial_{i}k(x,\cdot)\otimes\partial_{i}k(x,\cdot)\,p_{0}(x)\,dx is a Hilbert-Schmidt operator on H\mathcal{H} (see Theorem 4) with kk being the reproducing kernel and \otimes denoting the tensor product. When H\mathcal{H} is a finite-dimensional RKHS, we show that the estimator enjoys parametric rates of convergence, i.e.,

Note that the convergence rates are obtained under a non-classical smoothness assumption on f0f_{0}, namely that it lies in the image of certain fractional power of CC, which reduces to a more classical assumption if we choose kk to be a Matérn kernel (see Section 2 for its definition), as it induces a Sobolev space. In Section 4.2, we discuss in detail the smoothness assumption on f0f_{0} for the Gaussian (Example 2) and Matérn (Example 3) kernels. Another interesting point to observe is that unlike in the classical function estimation methods (e.g., kernel density estimation and regression), the rates presented above for the proposed estimator tend to saturate for β>1\beta>1 (β>12\beta>\frac{1}{2} w.r.t. JJ), with the best rate attained at β=1\beta=1 (β=12\beta=\frac{1}{2} w.r.t. JJ), which means the smoothness of f0f_{0} is not fully captured by the estimator. Such a saturation behavior is well-studied in the inverse problem literature (Engl et al., 1996) where it has been attributed to the choice of regularizer. In Section 4.3, we discuss alternative regularization strategies using ideas from Bauer et al. (2007), which covers non-parametric least squares regression: we show that for appropriately chosen regularizers, the above mentioned rates hold for any β>0\beta>0, and do not saturate for the aforementioned ranges of β\beta (see Theorem 9). (iii) In Section 5, we study the problem of density estimation in the misspecified setting, i.e., p0Pp_{0}\notin\mathcal{P}, which is not addressed in Hyvärinen (2007) and Fukumizu (2009). Using a more sophisticated analysis than in the well-specified case, we show in Theorem 12 that J(p0pf^n)infpPJ(p0p)J(p_{0}\|p_{\hat{f}_{n}})\rightarrow\inf_{p\in\mathcal{P}}J(p_{0}\|p) as nn\rightarrow\infty. Under an appropriate smoothness assumption on logp0q0\log\frac{p_{0}}{q_{0}} (see the statement of Theorem 12 for details), we show that J(p0pf^n)0J(p_{0}\|p_{\hat{f}_{n}})\rightarrow 0 as nn\rightarrow\infty along with a rate for this convergence, even though p0Pp_{0}\notin\mathcal{P}. However, unlike in the well-specified case, where the consistency is obtained not only in JJ but also in other distances, we obtain convergence only in JJ for the misspecified case. Note that while Barron and Sheu (1991) considered the estimation of p0p_{0} in the misspecified setting, the results are restricted to the approximating families consisting of polynomials, splines, or trigonometric series. Our results are more general, as they hold for abstract RKHSs. (iv) In Section 6, we present preliminary numerical results comparing the proposed estimator with KDE in estimating a Gaussian and mixture of Gaussians, with the goal of empirically evaluating performance as dd gets large for a fixed sample size. In these two estimation problems, we show that the proposed estimator outperforms KDE, and the advantage grows as dd increases. Inspired by this preliminary empirical investigation, our proposed estimator (or computationally efficient approximations) has been used by Strathmann et al. (2015) in a gradient-free adaptive MCMC sampler, and by Sun et al. (2015) for graphical model structure learning. These applications demonstrate the practicality and performance of the proposed estimator.

Finally, we would like to make clear that our principal goal is not to construct density estimators that improve uniformly upon KDE, but to provide a novel flexible modeling technique for approximating an unknown density by a rich parametric family of densities, with the parameter being infinite dimensional, in contrast to the classical approach of finite dimensional approximation.

Various notations and definitions that are used throughout the paper are collected in Section 2. The proofs of the results are provided in Section 8, along with some supplementary results in an appendix.

Definitions & Notation

where ii denotes the imaginary unit 1\sqrt{-1}.

where Γ\Gamma is the Gamma function, and Kv\mathfrak{K}_{v} is the modified Bessel function of the third kind of order vv (vv controls the smoothness of kk).

Approximation of Densities by 𝒫𝒫\mathcal{P}

In this section, we first show that every finite dimensional exponential family, Pfin\mathscr{P}_{\text{fin}} is generated by the family P\mathcal{P} induced by a finite dimensional RKHS, which naturally leads to the infinite dimensional generalization of Pfin\mathscr{P}_{\text{fin}} when H\mathcal{H} is an infinite dimensional RKHS. Next, we investigate the approximation properties of P\mathcal{P} in Proposition 1 and Corollary 2 when H\mathcal{H} is an infinite dimensional RKHS.

The following are some popular examples of probability distributions that belong to Pfin\mathscr{P}_{\emph{fin}}. Here we show the corresponding RKHSs (H,k)(\mathcal{H},k) that generate these distributions. In some of these examples, we choose q0(x)=1q_{0}(x)=1 and ignore the fact that q0q_{0} is a probability distribution as assumed in the definition of P\mathcal{P}.

Beta: Ω=(0,1)\Omega=(0,1), k(x,y)=logxlogy+log(1x)log(1y)k(x,y)=\log x\log y+\log(1-x)\log(1-y).

Binomial: Ω={0,,m}\Omega=\{0,\ldots,m\}, k(x,y)=xyk(x,y)=xy, q0(x)=2m(mc)q_{0}(x)=2^{-m}{m\choose c}.

While Example 1 shows that all popular probability distributions are contained in P\mathcal{P} for an appropriate choice of finite-dimensional H\mathcal{H}, it is of interest to understand the richness of P\mathcal{P} (i.e., what class of distributions can be approximated arbitrarily well by P\mathcal{P}?) when H\mathcal{H} is an infinite dimensional RKHS. This is addressed by the following result, which is proved in Section 8.1.

Then P\mathcal{P} is dense in P0\mathcal{P}_{0} w.r.t. Kullback-Leibler divergence, total variation (L1L^{1} norm) and Hellinger distances. In addition, if q0L1(Ω)Lr(Ω)q_{0}\in L^{1}(\Omega)\cap L^{r}(\Omega) for some 1<r1<r\leq\infty, then P\mathcal{P} is also dense in P0\mathcal{P}_{0} w.r.t. LrL^{r} norm.

Suppose k(x,)C0(Ω),xΩk(x,\cdot)\in C_{0}(\Omega),\,\forall\,x\in\Omega and (5) holds. Then P\mathcal{P} is dense in Pc\mathcal{P}_{c} w.r.t. KL divergence, TV and Hellinger distances. Moreover, if q0L1(Ω)Lr(Ω)q_{0}\in L^{1}(\Omega)\cap L^{r}(\Omega) for some 1<r1<r\leq\infty, then P\mathcal{P} is also dense in Pc\mathcal{P}_{c} w.r.t. LrL^{r} norm.

By choosing Ω\Omega to be compact and q0q_{0} to be a uniform distribution on Ω\Omega, Corollary 2 reduces to an easily interpretable result that any continuous density p0p_{0} on Ω\Omega can be approximated arbitrarily well by densities in P\mathcal{P} in KL, Hellinger and LrL^{r} (1r1\leq r\leq\infty) distances.

Similar to the results so far, an approximation result for P\mathcal{P} can also be obtained w.r.t. Fisher divergence (see Proposition 13). Since this result is heavily based on the notions and results developed in Section 5, we defer its presentation until that section. Briefly, this result states that if H\mathcal{H} is sufficiently rich (i.e., dense in an appropriate class of functions), then any pC1(Ω)p\in C^{1}(\Omega) with J(pq0)<J(p\|q_{0})<\infty can be approximated arbitrarily well by elements in P\mathcal{P} w.r.t. Fisher divergence, where q0C1(Ω)q_{0}\in C^{1}(\Omega).

Density Estimation in 𝒫𝒫\mathcal{P}: Well-specified Case

In this section, we present our score matching estimator for an unknown density p0:=pf0Pp_{0}:=p_{f_{0}}\in\mathcal{P} (well-specified case) from i.i.d. random samples (Xa)a=1n(X_{a})^{n}_{a=1} drawn from it. This involves choosing the minimizer of the (empirical) Fisher divergence between p0p_{0} and pfPp_{f}\in\mathcal{P} as the estimator, f^\hat{f} which we show in Theorem 5 to be obtained by solving a simple finite-dimensional linear system. In contrast, we would like to remind the reader that the MLE is infeasible in practice due to the difficulty in handling A(f)A(f). The consistency and convergence rates of f^F\hat{f}\in\mathcal{F} and the plug-in estimator pf^p_{\hat{f}} are provided in Section 4.1 (see Theorems 6 and 7). Before we proceed, we list the assumptions on p0p_{0}, q0q_{0} and H\mathcal{H} that we need in our analysis.

p0p_{0} is continuously extendible to Ω\overline{\Omega}. kk is twice continuously differentiable on Ω×Ω\Omega\times\Omega with continuous extension of α,αk\partial^{\alpha,\alpha}k to Ω×Ω\overline{\Omega}\times\overline{\Omega} for α2|\alpha|\leq 2.

ii+dk(x,x)p0(x)=0\partial_{i}\partial_{i+d}k(x,x)p_{0}(x)=0 for xΩx\in\partial\Omega and ii+dk(x,x)p0(x)=o(x21d)\sqrt{\partial_{i}\partial_{i+d}k(x,x)}p_{0}(x)=o(\|x\|^{1-d}_{2}) as xΩx\in\Omega, x2\|x\|_{2}\rightarrow\infty for all i[d]i\in[d].

(ε\varepsilon-Integrability) For some ε1\varepsilon\geq 1 and i[d]\forall\,i\in[d], ii+dk(x,x),i2i+d2k(x,x)\partial_{i}\partial_{i+d}k(x,x),\sqrt{\partial^{2}_{i}\partial^{2}_{i+d}k(x,x)} and ii+dk(x,x)ilogq0(x)Lε(Ω,p0),\sqrt{\partial_{i}\partial_{i+d}k(x,x)}\partial_{i}\log q_{0}(x)\in L^{\varepsilon}(\Omega,p_{0}), where q0C1(Ω).q_{0}\in C^{1}(\Omega).

Under these assumptions, the following result—proved in Section 8.3—shows that the problem of estimating p0p_{0} through the minimization of Fisher divergence reduces to the problem of estimating f0f_{0} through a weighted least squares minimization in H\mathcal{H} (see parts (i) and (ii)). This motivates the minimization of the regularized empirical weighted least squares (see part (iv)) to obtain an estimator fλ,nf_{\lambda,n} of f0f_{0}, which is then used to construct the plug-in estimate pfλ,np_{f_{\lambda,n}} of p0p_{0}.

Suppose (A)–(D) hold with ε=1\varepsilon=1. Then J(p0pf)<J(p_{0}\|p_{f})<\infty for all fFf\in\mathcal{F}. In addition, the following hold. (i) For all fFf\in\mathcal{F},

where C:HHC:\mathcal{H}\rightarrow\mathcal{H}, C:=Ωp0(x)i=1dik(x,)ik(x,)dxC:=\int_{\Omega}p_{0}(x)\sum^{d}_{i=1}\partial_{i}k(x,\cdot)\otimes\partial_{i}k(x,\cdot)\,dx is a trace-class positive operator with

and f0f_{0} satisfies Cf0=ξCf_{0}=-\xi. (iii) For any λ>0\lambda>0, a unique minimizer fλf_{\lambda} of Jλ(f):=J(f)+λ2fH2J_{\lambda}(f):=J(f)+\frac{\lambda}{2}\|f\|^{2}_{\mathcal{H}} over H\mathcal{H} exists and is given by

(iv) (Estimator of f0f_{0}) Given samples (Xa)a=1n(X_{a})^{n}_{a=1} drawn i.i.d. from p0p_{0}, for any λ>0\lambda>0, the unique minimizer fλ,nf_{\lambda,n} of J^λ(f):=J^(f)+λ2fH2\hat{J}_{\lambda}(f):=\hat{J}(f)+\frac{\lambda}{2}\|f\|^{2}_{\mathcal{H}} over H\mathcal{H} exists and is given by

where J^(f):=12f,C^fH+f,ξ^H+J(p0q0)\hat{J}(f):=\frac{1}{2}\langle f,\hat{C}f\rangle_{\mathcal{H}}+\langle f,\hat{\xi}\rangle_{\mathcal{H}}+J(p_{0}\|q_{0}), C^:=1na=1ni=1dik(Xa,)ik(Xa,)\hat{C}:=\frac{1}{n}\sum^{n}_{a=1}\sum^{d}_{i=1}\partial_{i}k(X_{a},\cdot)\otimes\partial_{i}k(X_{a},\cdot) and

An advantage of the alternate formulation of J(f)J(f) in Theorem 4(ii) over (6) is that it provides a simple way to obtain an empirical estimate of J(f)J(f)—by replacing CC and ξ\xi by their empirical estimators, C^\hat{C} and ξ^\hat{\xi} respectively—from finite samples drawn i.i.d. from p0p_{0}, which is then used to obtain an estimator of f0f_{0}. Note that the empirical estimate of J(f)J(f), i.e., J^(f)\hat{J}(f) depends only on C^\hat{C} and ξ^\hat{\xi} which in turn depend on the known quantities, kk and q0q_{0}, and therefore fλ,nf_{\lambda,n} in Theorem 4(iv) should in principle be computable. In practice, however, it is not easy to compute the expression for fλ,n=(C^+λI)1ξ^f_{\lambda,n}=-(\hat{C}+\lambda I)^{-1}\hat{\xi} as it involves solving an infinite dimensional linear system. In Theorem 5 (proved in Section 8.4), we provide an alternative expression for fλ,nf_{\lambda,n} as a solution of a simple finite-dimensional linear system (see (7) and (8)), using the general representer theorem (see Theorem A.2). It is interesting to note that while the solution to J(f)J(f) in Theorem 4(ii) is obtained by solving a non-linear system, Cf0=ξCf_{0}=-\xi (the system is non-linear as CC depends on p0p_{0} which in turn depends on f0f_{0}), its estimator fλ,nf_{\lambda,n} proposed in Theorem 4, is obtained by solving a simple linear system. In addition, we would like to highlight the fact that the proposed estimator, fλ,nf_{\lambda,n} is precisely the Tikhonov regularized solution (which is well-studied in the theory of linear inverse problems) to the ill-posed linear system C^f=ξ^\hat{C}f=-\hat{\xi}. We further discuss the choice of regularizer in Section 4.3 using ideas from the inverse problem literature.

An important remark we would like to make about Theorem 4 is that though J(f)J(f) in (6) is valid only for fFf\in\mathcal{F}, as it is obtained from J(p0pf)J(p_{0}\|p_{f}) where p0,pfPp_{0},p_{f}\in\mathcal{P}, the expression ff0,C(ff0)H\langle f-f_{0},C(f-f_{0})\rangle_{\mathcal{H}} is valid for any fHf\in\mathcal{H}, as it is finite under the assumption that (D) holds with ε=1\varepsilon=1. Therefore, in Theorem 4(iii, iv), fλf_{\lambda} and fλ,nf_{\lambda,n} are obtained by minimizing JλJ_{\lambda} and J^λ\hat{J}_{\lambda} over H\mathcal{H} instead of over F\mathcal{F}, as the latter does not yield a nice expression (unlike fλf_{\lambda} and fλ,nf_{\lambda,n}, respectively). However, there is no guarantee that fλ,nFf_{\lambda,n}\in\mathcal{F}, and so the density estimator pfλ,np_{f_{\lambda,n}} may not be valid. While this is not an issue when studying the convergence of fλ,nf0H\|f_{\lambda,n}-f_{0}\|_{\mathcal{H}} (see Theorem 6), the convergence of pfλ,np_{f_{\lambda,n}} to p0p_{0} (in various distances) needs to be handled slightly differently depending on whether the kernel is bounded or not (see Theorems 7 and B.2). Note that when the kernel is bounded, we obtain F=H\mathcal{F}=\mathcal{H}, which implies pfλ,np_{f_{\lambda,n}} is valid.

Let fλ,n=arginffHJ^λ(f)f_{\lambda,n}=\arg\inf_{f\in\mathcal{H}}\hat{J}_{\lambda}(f), where J^λ(f)\hat{J}_{\lambda}(f) is defined in Theorem 4(iv) and λ>0\lambda>0. Then

where ξ^\hat{\xi} is defined in Theorem 4(iv) and β=(β(a1)d+i)a,i\bm{\beta}=(\beta_{(a-1)d+i})_{a,i} is obtained by solving

with (G)(a1)d+i,(b1)d+j=ij+dk(Xa,Xb)and(\bm{G})_{(a-1)d+i,(b-1)d+j}=\partial_{i}\partial_{j+d}k(X_{a},X_{b})\,\,\,\text{and}

We would like to highlight that though fλ,nf_{\lambda,n} requires solving a simple linear system in (8), it can still be computationally intensive when dd and nn are large as G\bm{G} is a nd×ndnd\times nd matrix. This is still a better scenario than that of MLE, however, since computationally efficient methods exist to solve large linear systems such as (8), whereas MLE can be intractable due to the difficulty in handling the log-partition function (though it can be approximated). On the other hand, MLE is statistically well-understood, with consistency and convergence rates established in general for the problem of density estimation (van de Geer, 2000) and in particular for the problem at hand (Fukumizu, 2009). In order to ensure that fλ,nf_{\lambda,n} and pfλ,np_{f_{\lambda,n}} are statistically useful, in the following section, we investigate their consistency and convergence rates under some smoothness conditions on f0f_{0}.

Suppose (A)–(D) with ε=2\varepsilon=2 hold. (i) If f0R(C)f_{0}\in\overline{\mathcal{R}(C)}, then fλ,nf0Hp00asλ0,λnandn.\left\|f_{\lambda,n}-f_{0}\right\|_{\mathcal{H}}\stackrel{{\scriptstyle p_{0}}}{{\rightarrow}}0\,\,\text{as}\,\,\lambda\to 0,\,\lambda\sqrt{n}\to\infty\,\,\text{and}\,\,n\to\infty. (ii) If f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β>0\beta>0, then for λ=nmax{14,12(β+1)}\lambda=n^{-\max\left\{\frac{1}{4},\frac{1}{2(\beta+1)}\right\}},

(iii) If C1<\|C^{-1}\|<\infty, then for λ=n12\lambda=n^{-\frac{1}{2}}, fλ,nf0H=Op0(n1/2)\|f_{\lambda,n}-f_{0}\|_{\mathcal{H}}=O_{p_{0}}(n^{-1/2}) as nn\rightarrow\infty.

(i) While Theorem 6 (proved in Section 8.5) provides an asymptotic behavior for fλ,nf0H\|f_{\lambda,n}-f_{0}\|_{\mathcal{H}} under conditions that depend on p0p_{0} (and are therefore not easy to check in practice), a non-asymptotic bound on fλ,nf0H\|f_{\lambda,n}-f_{0}\|_{\mathcal{H}} that holds for all n1n\geq 1 can be obtained under stronger assumptions through an application of Bernstein’s inequality in separable Hilbert spaces. For the sake of simplicity, we provided asymptotic results which are obtained through an application of Chebyshev’s inequality. (ii) The proof of Theorem 6(i) involves decomposing fλ,nf0H\|f_{\lambda,n}-f_{0}\|_{\mathcal{H}} into an estimation error part, E(λ,n):=fλ,nfλH\mathcal{E}(\lambda,n):=\|f_{\lambda,n}-f_{\lambda}\|_{\mathcal{H}}, and an approximation error part, A0(λ):=fλf0H\mathcal{A}_{0}(\lambda):=\|f_{\lambda}-f_{0}\|_{\mathcal{H}}, where fλ=(C+λI)1Cf0f_{\lambda}=(C+\lambda I)^{-1}Cf_{0}. While E(λ,n)0\mathcal{E}(\lambda,n)\rightarrow 0 as λ0\lambda\rightarrow 0, λn\lambda\sqrt{n}\rightarrow\infty and nn\rightarrow\infty without any assumptions on f0f_{0} (see the proof in Section 8.5 for details), it is not reasonable to expect A0(λ)0\mathcal{A}_{0}(\lambda)\rightarrow 0 as λ0\lambda\rightarrow 0 without assuming f0R(C)f_{0}\in\overline{\mathcal{R}(C)}. This is because, if f0f_{0} lies in the null space of CC, then fλf_{\lambda} is zero irrespective of λ\lambda and therefore cannot approximate f0f_{0}. (iii) The condition f0R(C)f_{0}\in\overline{\mathcal{R}(C)} is difficult to check in practice as it depends on p0p_{0} (which in turn depends on f0f_{0}). However, since the null space of CC is just constant functions if the kernel is bounded and supp(q0)=Ω\emph{supp}(q_{0})=\Omega (see Lemma 14 in Section 8.6 for details), assuming 1H1\notin\mathcal{H} yields that R(C)=H\overline{\mathcal{R}(C)}=\mathcal{H} and therefore consistency can be attained under conditions that are easy to impose in practice. As mentioned in Remark 3(iii), the condition 1H1\notin\mathcal{H} ensures identifiability and a sufficient condition for it to hold is kC0(Ω×Ω)k\in C_{0}(\Omega\times\Omega), which is satisfied by Gaussian, Matérn and inverse multiquadric kernels. (iv) It is well known that convergence rates are possible only if the quantity of interest (here f0f_{0}) satisfies some additional conditions. In function estimation, this additional condition is classically imposed by assuming f0f_{0} to be sufficiently smooth, e.g., f0f_{0} lies in a Sobolev space of certain smoothness. By contrast, the smoothness condition in Theorem 6(ii) is imposed in an indirect manner by assuming f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β>0\beta>0—so that the results hold for abstract RKHSs and not just Sobolev spaces—which then provides a rate, with the best rate being n1/4n^{-1/4} that is attained when β1\beta\geq 1. While such a condition has already been used in various works (Caponnetto and Vito, 2007; Smale and Zhou, 2007; Fukumizu et al., 2013) in the context of non-parametric least squares regression, we explore it in more detail in Proposition 8, and Examples 2 and 3. Note that this condition is common in the inverse problem theory (see Engl, Hanke, and Neubauer, 1996), and it naturally arises here through the connection of fλ,nf_{\lambda,n} being a Tikhonov regularized solution to the ill-posed linear system C^f=ξ^\hat{C}f=-\hat{\xi}. An interesting observation about the rate is that it does not improve with increasing β\beta (for β>1\beta>1), in contrast to the classical results in function estimation (e.g., kernel density estimation and kernel regression) where the rate improves with increasing smoothness. This issue is discussed in detail in Section 4.3. (v) Since C1<\|C^{-1}\|<\infty only if H\mathcal{H} is finite-dimensional, we recover the parametric rate of n1/2n^{-1/2} in a finite-dimensional situation with an automatic choice for λ\lambda as n1/2n^{-1/2}.

While Theorem 6 provides statistical guarantees for parameter convergence, the question of primary interest is the convergence of pfλ,np_{f_{\lambda,n}} to p0p_{0}. This is guaranteed by the following result, which is proved in Section 8.6.

Suppose (A)–(D) with ε=2\varepsilon=2 hold and k:=supxΩk(x,x)<\|k\|_{\infty}:=\sup_{x\in\Omega}k(x,x)<\infty. Assume supp(q0)=Ω\emph{supp}(q_{0})=\Omega. Then the following hold: (i) For any 1<r1<r\leq\infty with q0L1(Ω)Lr(Ω)q_{0}\in L^{1}(\Omega)\cap L^{r}(\Omega),

In addition, if f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β>0\beta>0, then for λ=nmax{14,12(β+1)}\lambda=n^{-\max\left\{\frac{1}{4},\frac{1}{2(\beta+1)}\right\}},

as nn\rightarrow\infty where θn:=nmin{14,β2(β+1)}\theta_{n}:=n^{-\min\left\{\frac{1}{4},\frac{\beta}{2(\beta+1)}\right\}}. (ii) J(p0pfλ,n)0asλn,λ0andn.J(p_{0}\|p_{f_{\lambda,n}})\rightarrow 0\,\,\text{as}\,\,\lambda n\rightarrow\infty,\,\lambda\rightarrow 0\,\,\text{and}\,\,n\rightarrow\infty. In addition, if f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β0\beta\geq 0, then for λ=nmax{13,12(β+1)}\lambda=n^{-\max\left\{\frac{1}{3},\frac{1}{2(\beta+1)}\right\}},

(iii) If C1<\|C^{-1}\|<\infty, then θn=n12\theta_{n}=n^{-\frac{1}{2}} and J(p0pfλ,n)=Op0(n1)J(p_{0}\|p_{f_{\lambda,n}})=O_{p_{0}}(n^{-1}) with λ=n12\lambda=n^{-\frac{1}{2}}.

While Theorem 7 addresses the case of bounded kernels, the case of unbounded kernels requires a technical modification. The reason for this modification, as alluded to in the discussion following Theorem 4, is due to the fact that fλ,nf_{\lambda,n} may not be in F\mathcal{F} when kk is unbounded, and therefore the corresponding density estimator, pfλ,np_{f_{\lambda,n}} may not be well-defined. In order to keep the main ideas intact, we discuss the unbounded case in detail in Section B.2 in Appendix B.

2 Range Space Assumption

While Theorems 6 and 7 are satisfactory from the point of view of consistency, we believe the presented rates are possibly not minimax optimal since these rates are valid for any RKHS that satisfies the conditions (A)–(D) and does not capture the smoothness of kk (and therefore the corresponding H\mathcal{H}). In other words, the rates presented in Theorems 6 and 7 should depend on the decay rate of the eigenvalues of CC which in turn effectively captures the smoothness of H\mathcal{H}. However, we are not able to obtain such a result—see the remark following the proof of Theorem 6 for a discussion. While these rates do not reflect the intrinsic smoothness of H\mathcal{H}, they are obtained under the smoothness assumption, i.e., range space condition that f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β>0\beta>0. This condition is quite different from the classical smoothness conditions that appear in non-parametric function estimation. While the range space assumption has been made in various earlier works (e.g., Caponnetto and Vito (2007); Smale and Zhou (2007); Fukumizu et al. (2013) in the context of non-parametric least square regression), in the following, we investigate the implicit smoothness assumptions that it makes on f0f_{0} in our context. To this end, first it is easy to show (see the proof of Proposition B.3 in Section B.3) that

Then f0R(C)f_{0}\in\mathcal{R}(C) implies f0GHf_{0}\in\mathcal{G}\subset\mathcal{H}.

In the following, we apply the above result in two examples involving Gaussian and Matérn kernels to get insights into the range space assumption.

Let ψ(x)=eσx2\psi(x)=e^{-\sigma\|x\|^{2}} with Hσ\mathcal{H}_{\sigma} as its corresponding RKHS (see Section 2 for its definition). By Proposition 8, it is easy to verify that f0R(C)f_{0}\in\mathcal{R}(C) implies f0HαHσf_{0}\in\mathcal{H}_{\alpha}\subset\mathcal{H}_{\sigma} for σ2<ασ\frac{\sigma}{2}<\alpha\leq\sigma. Since HβHγ\mathcal{H}_{\beta}\subset\mathcal{H}_{\gamma} for β<γ\beta<\gamma (i.e., Gaussian RKHSs are nested), f0R(C)f_{0}\in\mathcal{R}(C) ensures that f0f_{0} lies in Hσ2+ϵ\mathcal{H}_{\frac{\sigma}{2}+\epsilon} for arbitrary small ϵ>0\epsilon>0.

While Example 3 provides some understanding about the minimax optimality of fλ,nf_{\lambda,n} under additional assumptions on f0f_{0}, the problem is not completely resolved. In the following section, however, we show that the rate in Theorem 6 is not optimal for β>1\beta>1, and that improved rates can be obtained by choosing the regularizer appropriately.

3 Choice of Regularizer

We understand from the characterization of R(Cβ)\mathcal{R}(C^{\beta}) in (9) that larger β\beta values yield smoother functions in H\mathcal{H}. However, the smoothness of f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for β>1\beta>1 is not captured in the rates in Theorem 6(ii), where the rate saturates at β=1\beta=1 providing the best possible rate of n1/4n^{-1/4} (irrespective of the size of β\beta). This is unsatisfactory on the part of the estimator, as it does not effectively capture the smoothness of f0f_{0}, i.e., the estimator is not adaptive to the smoothness of f0f_{0}. We remind the reader that the estimator fλ,nf_{\lambda,n} is obtained by minimizing the regularized empirical Fisher divergence (see Theorem 4(iv)) yielding fλ,n=(C^+λI)1ξ^f_{\lambda,n}=-(\hat{C}+\lambda I)^{-1}\hat{\xi}, which can be seen as a heuristic to solve the (non-linear) inverse problem Cf0=ξCf_{0}=-\xi (see Theorem 4(ii)) from finite samples, by replacing CC and ξ\xi with their empirical counterparts. This heuristic, which ensures that the finite sample inverse problem is well-posed, is popular in inverse problem literature under the name of Tikhonov regularization (Engl et al., 1996, Chapter 5). Note that Tikhonov regularization helps to make the ill-posed inverse problem a well-posed one by approximating α1\alpha^{-1} by (α+λ)1(\alpha+\lambda)^{-1}, λ>0\lambda>0, where α1\alpha^{-1} appears as the inverse of the eigenvalues of CC while computing C1C^{-1}. In other words, if C^\hat{C} is invertible, then an estimate of f0f_{0} can be obtained as f^n=C^1ξ^\hat{f}_{n}=-\hat{C}^{-1}\hat{\xi}, i.e., f^n=iIξ^,ϕ^iHα^iϕ^i,\hat{f}_{n}=-\sum_{i\in I}\frac{\langle\hat{\xi},\hat{\phi}_{i}\rangle_{\mathcal{H}}}{\hat{\alpha}_{i}}\hat{\phi}_{i}, where (α^i)iI(\hat{\alpha}_{i})_{i\in I} and (ϕ^i)iI(\hat{\phi}_{i})_{i\in I} are the eigenvalues and eigenvectors of C^\hat{C} respectively. However, C^\hat{C} being a rank nn operator defined on H\mathcal{H} (which can be infinite dimensional) is not invertible and therefore the regularized estimator is constructed as fλ,n=gλ(C^)ξ^f_{\lambda,n}=-g_{\lambda}(\hat{C})\hat{\xi} where gλ(C^)g_{\lambda}(\hat{C}) is defined through functional calculus (see Engl, Hanke, and Neubauer, 1996, Section 2.3) as

The constant η0\eta_{0} is called the qualification of gλg_{\lambda} which is what determines the point of saturation of gλg_{\lambda}. We show in Theorem 9 that if gλg_{\lambda} has a finite qualification, then the resultant estimator cannot fully exploit the smoothness of f0f_{0} and therefore the rate of convergence will suffer for β>η0\beta>\eta_{0}. Given gλg_{\lambda} that satisfies (E), we construct our estimator of f0f_{0} as

Note that the above estimator can be obtained by using the data dependent regularizer, 12f,((gλ(C^))1C^)fH\frac{1}{2}\langle f,((g_{\lambda}(\hat{C}))^{-1}-\hat{C})f\rangle_{\mathcal{H}} in the minimization of J^(f)\hat{J}(f) defined in Theorem 4(iv), i.e.,

However, unlike fλ,nf_{\lambda,n} for which a simple form is available in Theorem 5 by solving a linear system, we are not able to obtain such a nice expression for fg,λ,nf_{g,\lambda,n}. The following result (proved in Section 8.8) presents an analog of Theorems 6 and 7 for the new estimators, fg,λ,nf_{g,\lambda,n} and pfg,λ,np_{f_{g,\lambda,n}}.

Suppose (A)–(E) hold with ε=2\varepsilon=2. (i) If f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β>0\beta>0, then for any λn1/2\lambda\geq n^{-1/2},

where θn:=nmin{β2(β+1),η02(η0+1)}\theta_{n}:=n^{-\min\left\{\frac{\beta}{2(\beta+1)},\frac{\eta_{0}}{2(\eta_{0}+1)}\right\}} with λ=nmax{12(β+1),12(η0+1)}\lambda=n^{-\max\left\{\frac{1}{2(\beta+1)},\frac{1}{2(\eta_{0}+1)}\right\}}. In addition, if k<\|k\|_{\infty}<\infty, then for any 1<r1<r\leq\infty with q0L1(Ω)Lr(Ω)q_{0}\in L^{1}(\Omega)\cap L^{r}(\Omega),

(ii) If f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β0\beta\geq 0, then for any λn1/2\lambda\geq n^{-1/2},

with λ=n1min{2β+2,2η0+1}\lambda=n^{-\frac{1}{\min\{2\beta+2,2\eta_{0}+1\}}}. (iii) If C1<\|C^{-1}\|<\infty, then for any λn1/2\lambda\geq n^{-1/2},

with θn=n12\theta_{n}=n^{-\frac{1}{2}} and λ=n1min{2,2η0}\lambda=n^{-\frac{1}{\min\{2,2\eta_{0}\}}}.

Theorem 9 shows that if gλg_{\lambda} has infinite qualification, then smoothness of f0f_{0} is fully captured in the rates and as β\beta\rightarrow\infty, we attain Op0(n1/2)O_{p_{0}}(n^{-1/2}) rate for fg,λ,nf0H\|f_{g,\lambda,n}-f_{0}\|_{\mathcal{H}} in contrast to n1/4n^{-1/4} (similar improved rates are also obtained for pfg,λ,np_{f_{g,\lambda,n}} in various distances) in Theorem 6. In the following example, we present two choices of gλg_{\lambda} that improve on Tikhonov regularization. We refer the reader to Rosasco et al. (2005, Section 3.1) for more examples of gλg_{\lambda}.

(i) Tikhonov regularization involves gλ(α)=(α+λ)1g_{\lambda}(\alpha)=(\alpha+\lambda)^{-1} for which it is easy to verify that η0=1\eta_{0}=1 and therefore the rates saturate at β=1\beta=1, leading to the results in Theorems 6 and 7. (ii) Showalter’s method and spectral cut-off use

respectively for which it is easy to verify that η0=+\eta_{0}=+\infty (see Engl, Hanke, and Neubauer, 1996, Examples 4.7 & 4.8 for details) and therefore improved rates are obtained for β>1\beta>1 in Theorem 9 compared to that of Tikhonov regularization.

Density Estimation in 𝒫𝒫\mathcal{P}: Misspecified Case

In this section, we analyze the misspecified case where p0Pp_{0}\notin\mathcal{P}, which is a more reasonable case than the well-specified one, as in practice it is not easy to check whether p0Pp_{0}\in\mathcal{P}. To this end, we consider the same estimator pfλ,np_{f_{\lambda,n}} as considered in the well-specified case where fλ,nf_{\lambda,n} is obtained from Theorem 5. The following result shows that J(p0pfλ,n)infpPJ(p0p)J(p_{0}\|p_{f_{\lambda,n}})\rightarrow\inf_{p\in\mathcal{P}}J(p_{0}\|p) as λ0\lambda\rightarrow 0, λn\lambda n\rightarrow\infty and nn\rightarrow\infty under the assumption that there exists fFf^{*}\in\mathcal{F} such that J(p0pf)=infpPJ(p0p)J(p_{0}\|p_{f^{*}})=\inf_{p\in\mathcal{P}}J(p_{0}\|p). We present the result for bounded kernels although it can be easily extended to unbounded kernels as in Theorem B.2. Also, the presented result for Tikhonov regularization extends easily to pfg,λ,np_{f_{g,\lambda,n}} using the ideas in the proof of Theorem 9. Note that unlike in the well-specified case where convergence in other distances can be shown even though the estimator is constructed from JJ, it is difficult to show such a result in the misspecified case.

Let p0,q0C1(Ω)p_{0},\,q_{0}\in C^{1}(\Omega) be probability densities such that J(p0q0)<J(p_{0}\|q_{0})<\infty where Ω\Omega satisfies (A). Assume that (B), (C) and (D) with ε=2\varepsilon=2 hold. Suppose k<\|k\|_{\infty}<\infty, supp(q0)=Ω\emph{supp}(q_{0})=\Omega and there exists fFf^{\ast}\in\mathcal{F} such that

Then for an estimator pfλ,np_{f_{\lambda,n}} constructed from random samples (Xa)a=1n(X_{a})^{n}_{a=1} drawn i.i.d. from p0p_{0}, where fλ,nf_{\lambda,n} is defined in (7)—also see Theorem 4(iv)—with λ>0\lambda>0, we have

In addition, if fR(Cβ)f^{*}\in\mathcal{R}(C^{\beta}) for some β0\beta\geq 0, then

with λ=nmax{13,12(β+1)}\lambda=n^{-\max\left\{\frac{1}{3},\frac{1}{2(\beta+1)}\right\}}. If C1<\|C^{-1}\|<\infty, then for λ=n12\lambda=n^{-\frac{1}{2}},

While the above result is useful and interesting, the assumption about the existence of ff^{*} is quite restrictive. This is because if p0p_{0} (which is not in P\mathcal{P}) belongs to a family Q\mathcal{Q} where P\mathcal{P} is dense in Q\mathcal{Q} w.r.t. JJ, then there is no fHf\in\mathcal{H} that attains the infimum, i.e., ff^{*} does not exist and therefore the proof technique employed in Theorem 10 will fail. In the following, we present a result (Theorem 12) that does not require the existence of ff^{\ast} but attains the same result as in Theorem 10, but requiring a more complicated proof. Before we present Theorem 12, we need to introduce some notation.

To this end, let us return to the objective function under consideration,

where f=logp0q0f_{\star}=\log\frac{p_{0}}{q_{0}} and p0Pp_{0}\notin\mathcal{P}. Define

This is a reasonable class of functions to consider as under the condition J(p0q0)<J(p_{0}\|q_{0})<\infty, it is clear that fW2(Ω,p0)f_{\star}\in\mathcal{W}_{2}(\Omega,p_{0}). Endowed with a semi-norm,

W2(Ω,p0)\mathcal{W}_{2}(\Omega,p_{0}) is a vector space of functions, from which a normed space can be constructed as follows. Let us define f,fW2(Ω,p0)f,f^{\prime}\in\mathcal{W}_{2}(\Omega,p_{0}) to be equivalent, i.e., fff\sim f^{\prime}, if ffW2=0\|f-f^{\prime}\|_{\mathcal{W}_{2}}=0. In other words, fff\sim f^{\prime} if and only if ff and ff^{\prime} differ by a constant p0p_{0}-almost everywhere. Now define the quotient space W2(Ω,p0):={[f]:fW2(Ω,p0)}\mathcal{W}^{\sim}_{2}(\Omega,p_{0}):=\left\{[f]_{\sim}:f\in\mathcal{W}_{2}(\Omega,p_{0})\right\} where [f]:={fW2(Ω,p0):ff}[f]_{\sim}:=\{f^{\prime}\in\mathcal{W}_{2}(\Omega,p_{0}):f\sim f^{\prime}\} denotes the equivalence class of ff. Defining [f]W2:=fW2\|[f]_{\sim}\|_{\mathcal{W}^{\sim}_{2}}:=\|f\|_{\mathcal{W}_{2}}, it is easy to verify that W2\|\cdot\|_{\mathcal{W}^{\sim}_{2}} defines a norm on W2(p0)\mathcal{W}^{\sim}_{2}(p_{0}). In addition, endowing the following bilinear form on W2(Ω,p0)\mathcal{W}^{\sim}_{2}(\Omega,p_{0})

makes it a pre-Hilbert space. Let W2(Ω,p0)W_{2}(\Omega,p_{0}) be the Hilbert space obtained by completion of W2(Ω,p0)\mathcal{W}^{\sim}_{2}(\Omega,p_{0}). As shown in Proposition 11 below, under some assumptions, a continuous mapping Ik:HW2(Ω,p0),f[f]I_{k}:\mathcal{H}\to W_{2}(\Omega,p_{0}),f\mapsto[f]_{\sim} can be defined, which is injective modulo constant functions. Since addition of a constant does not contribute to pfp_{f}, the space W2(Ω,p0)W_{2}(\Omega,p_{0}) can be regarded as a parameter space extended from H\mathcal{H}. In addition to IkI_{k}, Proposition 11 (proved in Section 8.10) describes the adjoint of IkI_{k} and relevant self-adjoint operators, which will be useful in analyzing pfλ,np_{f_{\lambda,n}} in Theorem 12.

In addition, IkI_{k} and SkS_{k} are Hilbert-Schmidt and therefore compact. Also, Ek:=SkIkE_{k}:=S_{k}I_{k} and Tk:=IkSkT_{k}:=I_{k}S_{k} are compact, positive and self-adjoint operators on H\mathcal{H} and W2(Ω,p0)W_{2}(\Omega,p_{0}) respectively where

and the restriction of TkT_{k} to W2(Ω,p0)\mathcal{W}^{\sim}_{2}(\Omega,p_{0}) is given by

Note that for [h]W2(Ω,p0)[h]_{\sim}\in\mathcal{W}^{\sim}_{2}(\Omega,p_{0}), the derivatives ih\partial_{i}h do not depend on the choice of a representative element almost surely w.r.t. p0p_{0}, and thus the above integrals are well defined. Having constructed W2(Ω,p0)W_{2}(\Omega,p_{0}), it is clear that J(p0pf)=12[f]IkfW22J(p_{0}\|p_{f})=\frac{1}{2}\|[f_{\star}]_{\sim}-I_{k}f\|^{2}_{W_{2}}, which means estimating p0p_{0} is equivalent to estimating fW2(Ω,p0)f_{\star}\in W_{2}(\Omega,p_{0}) by fFf\in\mathcal{F}. With all these preparations, we are now ready to present a result (proved in Section 8.11) on consistency and convergence rate for pfλ,np_{f_{\lambda,n}} without assuming the existence of ff^{\ast}.

Let p0,q0C1(Ω)p_{0},\,q_{0}\in C^{1}(\Omega) be probability densities such that J(p0q0)<J(p_{0}\|q_{0})<\infty. Assume that (A)–(D) hold with ε=2\varepsilon=2 and χ:=dsupxΩ,i[d]ii+dk(x,x)<\chi:=d\sup_{x\in\Omega,i\in[d]}\partial_{i}\partial_{i+d}k(x,x)<\infty. Then the following hold. (i) As λ0,λnandn,\lambda\rightarrow 0,\,\lambda n\rightarrow\infty\,\,\text{and}\,\,n\rightarrow\infty, J(p0pfλ,n)infpPJ(p0p).J(p_{0}\|p_{f_{\lambda,n}})\rightarrow\inf_{p\in\mathcal{P}}J(p_{0}\|p). (ii) Define f:=logp0q0f_{\star}:=\log\frac{p_{0}}{q_{0}}. If [f]R(Tk)[f_{\star}]_{\sim}\in\overline{\mathcal{R}(T_{k})}, then

In addition, if [f]R(Tkβ)[f_{\star}]_{\sim}\in\mathcal{R}(T^{\beta}_{k}) for some β>0\beta>0, then for λ=nmax{13,12β+1}\lambda=n^{-\max\left\{\frac{1}{3},\frac{1}{2\beta+1}\right\}}

. (iii) If Ek1<\|E^{-1}_{k}\|<\infty and Tk1<\|T^{-1}_{k}\|<\infty, then J(p0pfλ,n)=Op0(n1)J(p_{0}\|p_{f_{\lambda,n}})=O_{p_{0}}\left(n^{-1}\right) with λ=n12\lambda=n^{-\frac{1}{2}}.

Based on the observation (i) in the above remark that infpPJ(p0p)=0\inf_{p\in\mathcal{P}}J(p_{0}\|p)=0 if Ik(H)I_{k}(\mathcal{H}) is dense in W2(Ω,p0)W_{2}(\Omega,p_{0}) w.r.t. W2\|\cdot\|_{W_{2}}, it is possible to obtain an approximation result for P\mathcal{P} (similar to those discussed in Section 3) w.r.t. Fisher divergence as shown below, whose proof is provided in Section 8.12.

Numerical Simulations

We have proposed an estimator of p0p_{0} that is obtained by minimizing the regularized empirical Fisher divergence and presented its consistency along with convergence rates. As discussed in Section 1, however one can simply ignore the structure of P\mathcal{P} and estimate p0p_{0} in a completely non-parametric fashion, for example using the kernel density estimator (KDE). In fact, consistency and convergence rates of KDE are also well-studied (Tsybakov, 2009, Chapter 1) and the kernel density estimator is very simple to compute—requiring only O(n)O(n) computations—compared to the proposed estimator, which is obtained by solving a linear system of size nd×ndnd\times nd. This raises questions about the applicability of the proposed estimator in practice, though it is very well known that KDE performs poorly for moderate to large dd (Wasserman, 2006, Section 6.5). In this section, we numerically demonstrate that the proposed score matching estimator performs significantly better than the KDE, and in particular, that the advantage with the proposed estimator grows as dd gets large. Note further that the maximum likelihood approach of Barron and Sheu (1991) and Fukumizu (2009) does not yield estimators that are practically feasible, and therefore to the best of our knowledge, the proposed estimator is the only viable estimator for estimating densities through P\mathcal{P}.

In the following, we consider two simple scenarios of estimating a multivariate normal and mixture of normals using the proposed estimator and demonstrate the superior performance of the proposed estimator over KDE. Inspired by this preliminary empirical investigation, recently, the proposed estimator has been explored in two concrete applications of gradient-free adaptive MCMC sampler (Strathmann et al., 2015) and graphical model structure learning (Sun et al., 2015) where the superiority of working with the infinite dimensional family is demonstrated. We would like to again highlight that the goal of this work is not to construct density estimators that improve upon KDE but to provide a novel modeling technique of approximating an unknown density by a rich parametric family of densities with the parameter being infinite dimensional in contrast to the classical approach of finite dimensional approximation.

through the score matching approach and KDE, and compare their estimation accuracies. Here ϕd(x;μ,Σ)\phi_{d}(x;\mu,\Sigma) is the p.d.f. of N(μ,ΣId)N(\mu,\Sigma I_{d}). By choosing the kernel, k(x,y)=exp(xy222σ2)+r(xTy+c)2,k(x,y)=\exp(-\frac{\|x-y\|^{2}_{2}}{2\sigma^{2}})+r(x^{T}y+c)^{2}, which is a Gaussian plus polynomial of degree 2, it is easy to verify that Gaussian distributions lie in P\mathcal{P}, and therefore the first problem considers the well-specified case while the second problem deals with the misspecified case. In our simulations, we chose r=0.1r=0.1, c=0.5c=0.5, α=4\alpha=4 and β=4\beta=-4. The base measure of the exponential family is N(0,102Id)N(0,10^{2}I_{d}). The bandwidth parameter σ\sigma is chosen by cross-validation (CV) of the objective function J^λ\hat{J}_{\lambda} (see Theorem 4(iv)) within the parameter set {0.1,0.2,0.4,0.6,0.8,1,1.2,1.4,1.6}×σ\{0.1,0.2,0.4,0.6,0.8,1,1.2,1.4,1.6\}\times\sigma_{*}, where σ\sigma_{*} is the median of pairwise distances of data, and the regularization parameter λ\lambda is set as λ=0.1×n1/3\lambda=0.1\times n^{-1/3} with sample size nn. For KDE, the Gaussian kernel is used for the smoothing kernel, and the bandwidth parameter is chosen by CV from {0.02,0.04,0.06,0.08,0.1,0.2,0.4,0.6,0.8,1.0}×σ\{0.02,0.04,0.06,0.08,0.1,0.2,0.4,0.6,0.8,1.0\}\times\sigma_{*}; where for both the methods, 5-fold CV is applied.

Since it is difficult to accurately estimate the normalization constant in the proposed method, we use two methods to evaluate the accuracy of estimation. One is the objective function for the score matching method,

and the other is correlation of the estimator with the true density function,

where RR is a probability distribution. For RR, we use the empirical distribution based on 10000 random samples drawn i.i.d. from p0(x)p_{0}(x).

Summary & Discussion

We have considered an infinite dimensional generalization, P\mathcal{P}, of the finite-dimensional exponential family, where the densities are indexed by functions in a reproducing kernel Hilbert space (RKHS), H\mathcal{H}. We showed that P\mathcal{P} is a rich object that can approximate a large class of probability densities arbitrarily well in Kullback-Leibler divergence, and addressed the main question of estimating an unknown density, p0p_{0} from finite samples drawn i.i.d. from it, in well-specified (p0Pp_{0}\in\mathcal{P}) and misspecified (p0Pp_{0}\notin\mathcal{P}) settings. We proposed a density estimator based on minimizing the regularized version of the empirical Fisher divergence, which results in solving a simple finite-dimensional linear system. Our estimator provides a computationally efficient alternative to maximum likelihood based estimators, which suffer from the computational intractability of the log-partition function. The proposed estimator is also shown to empirically outperform the classical kernel density estimator, with advantage increasing as the dimension of the space increases. In addition to these computational and empirical results, we have established the consistency and convergence rates under certain smoothness assumptions (e.g., logp0R(Cβ)\log p_{0}\in\mathcal{R}(C^{\beta})) for both well-specified and misspecified scenarios.

Proofs

We provide proofs of the results presented in Sections 3–5.

Sriperumbudur et al. (2011, Proposition 5) showed that H\mathcal{H} is dense in C0(Ω)C_{0}(\Omega) w.r.t. uniform norm if and only if kk satisfies (5). Therefore, the denseness in L1L^{1}, KL and Hellinger distances follow trivially from Lemma A.1. For LrL^{r} norm (r>1r>1), the denseness follows by using the bound pfpgLr(Ω)2e2fge2ffgq0Lr(Ω)\|p_{f}-p_{g}\|_{L^{r}(\Omega)}\leq 2e^{2\|f-g\|_{\infty}}e^{2\|f\|_{\infty}}\|f-g\|_{\infty}\|q_{0}\|_{L^{r}(\Omega)} obtained from Lemma A.1(i) with fC0(Ω)f\in C_{0}(\Omega) and gHg\in\mathcal{H}. \blacksquare

2 Proof of Corollary 2

For any pPcp\in\mathcal{P}_{c}, define pδ:=p+δq01+δp_{\delta}:=\frac{p+\delta q_{0}}{1+\delta}. Note that pδ(x)>0p_{\delta}(x)>0 for all xΩx\in\Omega and ppδLr(Ω)=δpq0Lr(Ω)1+δ\|p-p_{\delta}\|_{L^{r}(\Omega)}=\frac{\delta\|p-q_{0}\|_{L^{r}(\Omega)}}{1+\delta}, implying that limδ0ppδLr(Ω)=0\lim_{\delta\rightarrow 0}\|p-p_{\delta}\|_{L^{r}(\Omega)}=0 for any 1r1\leq r\leq\infty. This means, for any ϵ>0\epsilon>0, δϵ>0\exists\delta_{\epsilon}>0 such that for any 0<θ<δϵ0<\theta<\delta_{\epsilon}, we have ppθLr(Ω)ϵ\|p-p_{\theta}\|_{L^{r}(\Omega)}\leq\epsilon, where pθ(x)>0p_{\theta}(x)>0 for all xΩx\in\Omega.

We now prove the denseness in KL divergence by noting that

which implies limδ0KL(ppδ)=0\lim_{\delta\rightarrow 0}KL(p\|p_{\delta})=0. This implies, for any ϵ>0\epsilon>0, δϵ>0\exists\delta_{\epsilon}>0 such that for any 0<θ<δϵ0<\theta<\delta_{\epsilon}, KL(ppθ)ϵKL(p\|p_{\theta})\leq\epsilon. Arguing as above, we have pθP0p_{\theta}\in\mathcal{P}_{0}, i.e., there exists fC0(Ω)f\in C_{0}(\Omega) such that pθ=efq0efq0dxp_{\theta}=\frac{e^{f}q_{0}}{\int e^{f}q_{0}\,dx}. Since H\mathcal{H} is dense in C0(Ω)C_{0}(\Omega), for any fC0(Ω)f\in C_{0}(\Omega) and any ϵ>0\epsilon>0, there exists gHg\in\mathcal{H} such that fgϵ\|f-g\|_{\infty}\leq\epsilon. For pgPp_{g}\in\mathcal{P}, since plogpθpgdxlogpθpg2fg2ϵ,\int p\,\log\frac{p_{\theta}}{p_{g}}\,dx\leq\left\|\log\frac{p_{\theta}}{p_{g}}\right\|_{\infty}\leq 2\|f-g\|_{\infty}\leq 2\epsilon, we have

3 Proof of Theorem 4

(i) By the reproducing property of H\mathcal{H}, since if(x)=f,ik(x,)H\partial_{i}f(x)=\left\langle f,\partial_{i}k(x,\cdot)\right\rangle_{\mathcal{H}} for all i[d]i\in[d], it is easy to verify that

where in the second line, we used a,bH2=a,bHa,bH=a,(bb)aH\langle a,b\rangle^{2}_{H}=\langle a,b\rangle_{H}\langle a,b\rangle_{H}=\langle a,(b\otimes b)a\rangle_{H} for a,bHa,b\in H with HH being a Hilbert space and

Observe that for all xΩx\in\Omega, CxC_{x} is a Hilbert-Schmidt operator as CxHSi=1dik(x,)H2\|C_{x}\|_{HS}\leq\sum^{d}_{i=1}\left\|\partial_{i}k(x,\cdot)\right\|^{2}_{\mathcal{H}} =i=1dii+dk(x,x)<=\sum^{d}_{i=1}\partial_{i}\partial_{i+d}k(x,x)<\infty and (ff0)(ff0)(f-f_{0})\otimes(f-f_{0}) is also Hilbert-Schmidt as (ff0)(ff0)HS=ff0H2<\|(f-f_{0})\otimes(f-f_{0})\|_{HS}=\|f-f_{0}\|^{2}_{\mathcal{H}}<\infty. Therefore, (11) is equivalent to

Since the first condition in (D) implies ΩCxHSp0(x)dx<\int_{\Omega}\|C_{x}\|_{HS}p_{0}(x)\,dx<\infty, CxC_{x} is p0p_{0}-integrable in the Bochner sense (see Diestel and Uhl, 1977, Definition 1 and Theorem 2), and therefore it follows from Diestel and Uhl (1977, Theorem 6) that

where C:=ΩCx  p0(x)dxC:=\int_{\Omega}C_{x}\;p_{0}(x)\,dx is the Bochner integral of CxC_{x}, thereby yielding (6).

which means CC is trace-class and therefore compact. Here, we used monotone convergence theorem in ()(*) and Parseval’s identity in ()(**). Note that CC is positive since f,CfH=Ωp0(x)f22dx0,fH.\langle f,Cf\rangle_{\mathcal{H}}=\int_{\Omega}p_{0}(x)\left\|\nabla f\right\|^{2}_{2}\,dx\geq 0,\,\forall\,f\in\mathcal{H}. (ii) From (6), we have J(f)=12f,CfHf,Cf0H+12f0,Cf0HJ(f)=\frac{1}{2}\langle f,Cf\rangle_{\mathcal{H}}-\langle f,Cf_{0}\rangle_{\mathcal{H}}+\frac{1}{2}\langle f_{0},Cf_{0}\rangle_{\mathcal{H}}. Using if0(x)=ilogp0(x)ilogq0(x)\partial_{i}f_{0}(x)=\partial_{i}\log p_{0}(x)-\partial_{i}\log q_{0}(x) for all i[d]i\in[d], we obtain that for any fHf\in\mathcal{H},

where (b)(b) follows from integration by parts under (C) and the equality in (c)(c) is valid as ξx\xi_{x} is Bochner p0p_{0}-integrable under (D) with ε=1\varepsilon=1. Therefore Cf0=ξCf_{0}=-\xi. For the third term, f0,Cf0H=Ωp0(x)i=1d(if0(x))2dx\langle f_{0},Cf_{0}\rangle_{\mathcal{H}}=\int_{\Omega}p_{0}(x)\sum^{d}_{i=1}\left(\partial_{i}f_{0}(x)\right)^{2}\,dx and the result follows. (iii) Define c0:=J(p0q0)c_{0}:=J(p_{0}\|q_{0}). For any λ>0\lambda>0, it is easy to verify that

Clearly, Jλ(f)J_{\lambda}(f) is minimized if and only if (C+λI)1/2f=(C+λI)1/2ξ(C+\lambda I)^{1/2}f=-(C+\lambda I)^{-1/2}\xi and therefore fλ=(C+λI)1ξf_{\lambda}=-(C+\lambda I)^{-1}\xi is the unique minimizer of Jλ(f)J_{\lambda}(f). (iv) Since (iv)(iv) is similar to (iii)(iii) with CC replaced by C^\hat{C} and ξ\xi replaced by ξ^\hat{\xi}, we obtain fλ,n=(C^+λI)1ξ^f_{\lambda,n}=(\hat{C}+\lambda I)^{-1}\hat{\xi}. \blacksquare

4 Proof of Theorem 5

We prove the result based on the general representer theorem (Theorem A.2). From Theorem 4(iv), we have

where V(θ1,,θnd,θnd+1):=12na=1ni=1dθ(a1)d+i2+θnd+1V(\theta_{1},\ldots,\theta_{nd},\theta_{nd+1}):=\frac{1}{2n}\sum^{n}_{a=1}\sum^{d}_{i=1}\theta^{2}_{(a-1)d+i}+\theta_{nd+1}, ϕ(a1)d+i:=ik(Xa,),a[n],i[d]\phi_{(a-1)d+i}:=\partial_{i}k(X_{a},\cdot),\,a\in[n],\,i\in[d] and ϕnd+1:=ξ^\phi_{nd+1}:=\hat{\xi}. Therefore, it follows from Theorem A.2 that

with K=(GhhTξ^H2).\bm{K}=\begin{pmatrix}\bm{G}\,&\,\bm{h}\\ \bm{h}^{T}\,&\,\|\hat{\xi}\|^{2}_{\mathcal{H}}\end{pmatrix}. Since V(zt)=(1nz1)\nabla V\begin{pmatrix}\bm{z}\\ t\end{pmatrix}=\begin{pmatrix}\frac{1}{n}\bm{z}\\ 1\end{pmatrix}, (15) reduces to λδ+1=0\lambda\delta+1=0 and λβ+1nGβ+δnh=0\lambda\bm{\beta}+\frac{1}{n}\bm{G\beta}+\frac{\delta}{n}\bm{h}=0 yielding δ=1λ\delta=-\frac{1}{\lambda} and (1nG+λI)β=1nλh(\frac{1}{n}\bm{G}+\lambda I)\bm{\beta}=\frac{1}{n\lambda}\bm{h}. \blacksquare

Instead of using the general representer theorem (Theorem A.2), it is possible to see that the standard representer theorem (Kimeldorf and Wahba, 1971; Schölkopf et al., 2001) gives a similar, but slightly different linear system, and the solutions are the same if K\bm{K} is non-singular. The general representer theorem yields that β\bm{\beta} and δ\delta are solution to F(βδ)=(01)\bm{F}\begin{pmatrix}\bm{\beta}\\ \delta\end{pmatrix}=\begin{pmatrix}\bm{0}\\ 1\end{pmatrix}, where F=(1nG+λI1nh0Tλ)\bm{F}=\begin{pmatrix}\frac{1}{n}\bm{G}+\lambda I\,&\,\frac{1}{n}\bm{h}\\ \bm{0}^{T}\,&\,\lambda\end{pmatrix}. On the other hand, by using the standard representer theorem, it is easy to show that fλ,nf_{\lambda,n} has the form in (14) with δ\delta and β\bm{\beta} being solution to KF(βδ)=K(01)\bm{K}\bm{F}\begin{pmatrix}\bm{\beta}\\ \delta\end{pmatrix}=\bm{K}\begin{pmatrix}\bm{0}\\ 1\end{pmatrix}. Clearly, both the solutions match if K\bm{K} is invertible while the latter has many solutions if K\bm{K} is not invertible.

5 Proof of Theorem 6

where we used λfλ=C(f0fλ)\lambda f_{\lambda}=C(f_{0}-f_{\lambda}) in (\ast). Define S1:=(C^+λI)1(CC^)(fλf0)HS_{1}:=\|(\hat{C}+\lambda I)^{-1}(C-\hat{C})(f_{\lambda}-f_{0})\|_{\mathcal{H}}, S2:=(C^+λI)1(ξ^ξ)HS_{2}:=\|(\hat{C}+\lambda I)^{-1}(\hat{\xi}-\xi)\|_{\mathcal{H}} and S3:=(C^+λI)1(CC^)f0HS_{3}:=\|(\hat{C}+\lambda I)^{-1}(C-\hat{C})f_{0}\|_{\mathcal{H}} so that

where A0(λ):=fλf0H\mathcal{A}_{0}(\lambda):=\|f_{\lambda}-f_{0}\|_{\mathcal{H}}. We now bound S1S_{1}, S2S_{2} and S3S_{3} using Proposition A.4. Note that C=ΩCxp0(x)dxC=\int_{\Omega}C_{x}\,p_{0}(x)\,dx where CxC_{x} is defined in (12) is a positive, self-adjoint, trace-class operator and (D) (with ε=2\varepsilon=2) implies that

Using the bounds in S1S_{1}, S2S_{2} and S3S_{3} in (16), we obtain

(i) By Proposition A.3(i), we have that A0(λ)0\mathcal{A}_{0}(\lambda)\rightarrow 0 as λ0\lambda\rightarrow 0 if f0R(C)f_{0}\in\overline{\mathcal{R}(C)}. Therefore, it follows from (20) that fλ,nf0H0\|f_{\lambda,n}-f_{0}\|_{\mathcal{H}}\rightarrow 0 as λ0\lambda\rightarrow 0, λn\lambda\sqrt{n}\rightarrow\infty and nn\rightarrow\infty. (ii) If f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for β>0\beta>0, it follows from Proposition A.3(ii) that

and therefore the result follows by choosing λ=nmax{14,12(β+1)}\lambda=n^{-\max\left\{\frac{1}{4},\frac{1}{2(\beta+1)}\right\}}. (iii) Note that

It follows from Proposition A.4(v) that C(C^+λI)11\|C(\hat{C}+\lambda I)^{-1}\|\lesssim 1 for ncλ2n\geq\frac{c}{\lambda^{2}} where cc is a sufficiently large constant that depends on i=1dΩ(ii+dk(x,x))2p0(x)dx\sum^{d}_{i=1}\int_{\Omega}(\partial_{i}\partial_{i+d}k(x,x))^{2}p_{0}(x)\,dx but not on nn and λ\lambda. Using the bounds on (CC^)(fλf0)H\|(C-\hat{C})(f_{\lambda}-f_{0})\|_{\mathcal{H}}, ξ^ξH\|\hat{\xi}-\xi\|_{\mathcal{H}} and (CC^)f0H\|(C-\hat{C})f_{0}\|_{\mathcal{H}} from part (i) and the bound on C(fλf0)H\|C(f_{\lambda}-f_{0})\|_{\mathcal{H}} from Proposition A.3(ii), we therefore obtain

as nn\rightarrow\infty and the result follows. \blacksquare

Under slightly strong assumptions on the kernel, the bound on S1S_{1} in (17) can be improved to obtain S1=Op0(n1/2)S_{1}=O_{p_{0}}(n^{-1/2}) while the one on S3S_{3} in (19) can be refined to obtain S3=Op0(N(λ)λn)S_{3}=O_{p_{0}}\left(\sqrt{\frac{\mathcal{N}(\lambda)}{\lambda n}}\right) where N(λ):=Tr((C+λI)1C)\mathcal{N}(\lambda):=\emph{Tr}((C+\lambda I)^{-1}C) is the intrinsic dimension of H\mathcal{H}. Using the fact that N(λ)1λ\mathcal{N}(\lambda)\leq\frac{1}{\lambda}, it is easy to verify that the latter is an improved bound than the one in (19). In addition S3S_{3} dominates S1S_{1}. However, if S2S_{2} in (18) is not improved, then S2S_{2} dominates S3S_{3}, thereby resulting in a bound that does not capture the smoothness of kk (or the corresponding H\mathcal{H}). Unfortunately, even with a refined analysis (not reported here), we are not able to improve the bound on S2S_{2} wherein the difficulty lies with handling ξ\xi.

6 Proof of Theorem 7

Before we prove the result, we present a lemma.

Since supxΩk(x,x)<\sup_{x\in\Omega}k(x,x)<\infty, it implies that, for every fHf\in\mathcal{H}, Ωef(x)q0(x)dx<\int_{\Omega}e^{f(x)}q_{0}(x)\,dx<\infty and hence F=H\mathcal{F}=\mathcal{H}. Also, under the assumptions on kk and q0q_{0}, it is easy to verify that supp(p0)=Ω\text{supp}(p_{0})=\Omega, which implies

In the following, we obtain a bound on J(p0pfλ,n)=12C(fλ,nf0)H2J(p_{0}\|p_{f_{\lambda,n}})=\frac{1}{2}\|\sqrt{C}(f_{\lambda,n}-f_{0})\|^{2}_{\mathcal{H}}. While one can trivially use the bound C(fλ,nf0)H2C2fλ,nf0H2\|\sqrt{C}(f_{\lambda,n}-f_{0})\|^{2}_{\mathcal{H}}\leq\|\sqrt{C}\|^{2}\|f_{\lambda,n}-f_{0}\|^{2}_{\mathcal{H}} to obtain a rate on J(p0pfλ,n)J(p_{0}\|p_{f_{\lambda,n}}) through the result in Theorem 6(ii), a better rate can be obtained by carefully bounding C(fλ,nf0)H2\|\sqrt{C}(f_{\lambda,n}-f_{0})\|^{2}_{\mathcal{H}} as shown below. Consider

7 Proof of Proposition 8

Observation 1: By (Wendland, 2005, Theorem 10.12), we have

where ff^{\wedge} is defined in L2L^{2} sense. Since

Observation 3: For gHg\in\mathcal{H}, we have for all j[d]j\in[d],

Observation 4: For any gGg\in\mathcal{G}, we have

which implies gHg\in\mathcal{H}, i.e., GH\mathcal{G}\subset\mathcal{H}.

We now use these observations to prove the result. Since f0R(C)f_{0}\in\mathcal{R}(C), there exists gHg\in\mathcal{H} such that f0=Cgf_{0}=Cg, which means

where we have invoked generalized Young’s inequality (Folland, 1999, Proposition 8.9) in (\ast), Hausdorff-Young inequality (Folland, 1999, p. 253) in (\ast\ast), and observation 3 combined with (iv)(iv) in (\ddagger). This shows that f0R(C)f0Gf_{0}\in\mathcal{R}(C)\Rightarrow f_{0}\in\mathcal{G}, i.e., R(C)G\mathcal{R}(C)\subset\mathcal{G}. \blacksquare

8 Proof of Theorem 9

To prove Theorem 9, we need the following lemma (De Vito et al., 2012, Lemma 5), which is due to Andreas Maurer.

Proof of Theorem 9. (i) The proof follows the ideas in the proof of Theorem 10 in Bauer et al. (2007), which is a more general result dealing with the smoothness condition, f0R(Θ(C))f_{0}\in\mathcal{R}(\Theta(C)) where Θ\Theta is operator monotone. Recall that Θ\Theta is operator monotone on [0,b][0,b] if for any pair of self-adjoint operators UU, VV with spectra in [0,b][0,b] such that UVU\leq V, we have Θ(U)Θ(V)\Theta(U)\leq\Theta(V), where “\leq” is the partial ordering for self-adjoint operators on some Hilbert space HH, which means for any fHf\in H, f,UfHf,VfH\langle f,Uf\rangle_{H}\leq\langle f,Vf\rangle_{H}. In our case, we adapt the proof for Θ(C)=Cβ\Theta(C)=C^{\beta}. Define rλ(α):=gλ(α)α1r_{\lambda}(\alpha):=g_{\lambda}(\alpha)\alpha-1. Since f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}), there exists hHh\in\mathcal{H} such that f0=Cβhf_{0}=C^{\beta}h, which yields

We now bound (A)(A)(D)(D). Since (A)gλ(C^)ξ^ξH(A)\leq\|g_{\lambda}(\hat{C})\|\|\hat{\xi}-\xi\|_{\mathcal{H}}, we have (A)=Op0(1λn)(A)=O_{p_{0}}\left(\frac{1}{\lambda\sqrt{n}}\right) where we used (b)(b) in (E) and the bound on ξ^ξH\|\hat{\xi}-\xi\|_{\mathcal{H}} from the proof of Theorem 6(i). Similarly, (B)gλ(C^)(C^C)f0H(B)\leq\|g_{\lambda}(\hat{C})\|\|(\hat{C}-C)f_{0}\|_{\mathcal{H}} implies (B)=Op0(1λn)(B)=O_{p_{0}}\left(\frac{1}{\lambda\sqrt{n}}\right) where (b)(b) in (E) and Proposition A.4(i) are invoked. Also, (d)(d) in (E) implies that

We now consider two cases: β1\beta\leq 1: Since ααθ\alpha\mapsto\alpha^{\theta} is operator monotone on [0,χ][0,\chi] for 0θ10\leq\theta\leq 1, by Theorem 1 in Bauer et al. (2007), there exists a constant cθc_{\theta} such that C^θCθcθC^CθcθC^CHSθ\|\hat{C}^{\theta}-C^{\theta}\|\leq c_{\theta}\|\hat{C}-C\|^{\theta}\leq c_{\theta}\|\hat{C}-C\|^{\theta}_{HS}. We now obtain a bound on C^CHS\|\hat{C}-C\|_{HS}. To this end, consider

which by Chebyshev’s inequality implies that

and therefore (D)=Op0(nβ/2)(D)=O_{p_{0}}(n^{-\beta/2}). Since λn1/2\lambda\geq n^{-1/2}, we have (D)=Op0(λβ)(D)=O_{p_{0}}(\lambda^{\beta}). β>1\beta>1: Since ααθ\alpha\mapsto\alpha^{\theta} is Lipschitz on [0,χ][0,\chi] for θ1\theta\geq 1, by Lemma 15, CβC^βCβC^βHSβχβ1CC^HS\|C^{\beta}-\hat{C}^{\beta}\|\leq\|C^{\beta}-\hat{C}^{\beta}\|_{HS}\leq\beta\chi^{\beta-1}\|C-\hat{C}\|_{HS} and therefore (C)=Op0(n1/2)(C)=O_{p_{0}}(n^{-1/2}).

Collecting all the above bounds, we obtain

and the result follows. The proofs of the claims involving LrL^{r}, hh and KLKL follow exactly the same ideas as in the proof of Theorem 7 by using the above bound on fg,λ,nf0H\|f_{g,\lambda,n}-f_{0}\|_{\mathcal{H}} in Lemma A.1. (ii) We now bound J(p0pfg,λ,n)=C(fg,λ,nf0)H2J(p_{0}\|p_{f_{g,\lambda,n}})=\|\sqrt{C}(f_{g,\lambda,n}-f_{0})\|^{2}_{\mathcal{H}} as follows. Note that

We bound (I)H\|(I^{\prime})\|_{\mathcal{H}} as

where we used the fact that αα\alpha\mapsto\sqrt{\alpha} is operator monotone along with λn1/2\lambda\geq n^{-1/2}. Using (26), (II)H\|(II^{\prime})\|_{\mathcal{H}} can be bounded as

with C^f0+ξ^\|\hat{C}f_{0}+\hat{\xi}\| and CβC^β\|C^{\beta}-\hat{C}^{\beta}\| bounded as in part (i) above. Here (ab):=max{a,b}(a\vee b):=\max\{a,b\}. Combining (I)H\|(I^{\prime})\|_{\mathcal{H}} and (II)H\|(II^{\prime})\|_{\mathcal{H}}, we obtain the required result. (iii) The proof follows the ideas in the proof of Theorems 6 and 7. Consider fg,λ,nf0=gλ(C^)(C^f0+ξ^)+rλ(C^)f0f_{g,\lambda,n}-f_{0}=-g_{\lambda}(\hat{C})(\hat{C}f_{0}+\hat{\xi})+r_{\lambda}(\hat{C})f_{0} so that

Therefore fg,λ,nf0H=Op0(n1/2)+O(λmin{1,η0})\|f_{g,\lambda,n}-f_{0}\|_{\mathcal{H}}=O_{p_{0}}(n^{-1/2})+O\left(\lambda^{\min\{1,\eta_{0}\}}\right) where we used the fact that λn1/2\lambda\geq n^{-1/2} and the result follows. \blacksquare

9 Proof of Theorem 10

Before we analyze J(p0pfλ,n)J(p_{0}\|p_{f_{\lambda,n}}), we need a small calculation for notational convenience. For any probability densities p,qC1p,q\in C^{1}, it is clear that 2J(pq)=logplogq2L2(p)\sqrt{2J(p\|q)}=\left\|\left\|\nabla\log p-\nabla\log q\right\|_{2}\right\|_{L^{2}(p)}. We generalize this by defining

Clearly, if μ=p\mu=p, then J(pqμ)J(p\|q\|\mu) matches with J(pq)J(p\|q). Therefore, for probability densities p,q,rC1p,q,r\in C^{1},

where A(λ)=C(fλf)H\mathcal{A}^{*}(\lambda)=\|\sqrt{C}(f_{\lambda}-f^{*})\|_{\mathcal{H}}. The result simply follows from the proof of Theorem 7, where we showed that C(fλ,nfλ)H=Op0(1λn)\|\sqrt{C}(f_{\lambda,n}-f_{\lambda})\|_{\mathcal{H}}=O_{p_{0}}\left(\frac{1}{\sqrt{\lambda n}}\right) and A(λ)=O(λmin{1,β+12})\mathcal{A}^{*}(\lambda)=O(\lambda^{\min\{1,\beta+\frac{1}{2}\}}) if fR(Cβ)f^{\ast}\in\mathcal{R}(C^{\beta}) for β0\beta\geq 0 as λ0\lambda\rightarrow 0, nn\rightarrow\infty. When C1<\|C^{-1}\|<\infty, we bound C(fλ,nf)H\|\sqrt{C}(f_{\lambda,n}-f^{*})\|_{\mathcal{H}} in (8.9) as Cfλ,nfH\|\sqrt{C}\|\|f_{\lambda,n}-f^{*}\|_{\mathcal{H}} where fλ,nfH\|f_{\lambda,n}-f^{*}\|_{\mathcal{H}} is in turn bounded as in (21). \blacksquare

10 Proof of Proposition 11

which means fW2(Ω,p0)f\in\mathcal{W}_{2}(\Omega,p_{0}) and therefore [f]W2(Ω,p0)[f]_{\sim}\in W_{2}(\Omega,p_{0}). Since IkfW2=[f]W2=fW2cfH<\|I_{k}f\|_{W_{2}}=\|[f]_{\sim}\|_{\mathcal{W}^{\sim}_{2}}=\|f\|_{\mathcal{W}_{2}}\leq c\|f\|_{\mathcal{H}}<\infty where cc is some constant, it is clear that IkI_{k} is a continuous map from H\mathcal{H} to W2(Ω,p0)W_{2}(\Omega,p_{0}). The adjoint Sk:W2(Ω,p0)HS_{k}:W_{2}(\Omega,p_{0})\rightarrow\mathcal{H} of Ik:HW2(Ω,p0)I_{k}:\mathcal{H}\rightarrow W_{2}(\Omega,p_{0}) is defined by the relation Skf,gH=f,IkgW2,fW2(Ω,p0),gH\langle S_{k}f,g\rangle_{\mathcal{H}}=\langle f,I_{k}g\rangle_{W_{2}},\,\,f\in W_{2}(\Omega,p_{0}),\,g\in\mathcal{H}. If f:=[h]W2(Ω,p0)f:=[h]_{\sim}\in\mathcal{W}^{\sim}_{2}(\Omega,p_{0}), then

For yΩy\in\Omega and g=k(,y)g=k(\cdot,y), this yields

We now show that IkI_{k} is Hilbert-Schmidt. Since H\mathcal{H} is separable, let (el)l1(e_{l})_{l\geq 1} be an ONB of H\mathcal{H}. Then we have

which proves that IkI_{k} is Hilbert-Schmidt (hence compact) and therefore SkS_{k} is also Hilbert-Schmidt and compact. The other assertions about SkIkS_{k}I_{k} and IkSkI_{k}S_{k} are straightforward. \blacksquare

11 Proof of Theorem 12

By slight abuse of notation, ff_{\star} is used to denote [f][f_{\star}]_{\sim} in the proof for simplicity. For fFf\in\mathcal{F}, we have

Since kk satisfies (C) it is easy to verify that Skf,fH=f,ξH,fH\langle S_{k}f_{\star},f\rangle_{\mathcal{H}}=\langle f,-\xi\rangle_{\mathcal{H}},\,\forall\,f\in\mathcal{H} (see proof of Theorem 4(ii)). This implies Skf=ξS_{k}f_{\star}=-\xi and

where ξ\xi is defined in Theorem 4(ii), and EkE_{k} is precisely the operator CC defined in Theorem 4(ii). Following the proof of Theorem 4(ii), for λ>0\lambda>0, it is easy to show that the unique minimizer of the regularized objective, J(p0pf)+λ2fH2J(p_{0}\|p_{f})+\frac{\lambda}{2}\|f\|^{2}_{\mathcal{H}} exists and is given by

We would like to reiterate that (29) and (30) also match with their counterparts in Theorem 4 and therefore as in Theorem 4(iv), an estimator of ff_{\star} is given by fλ,n=(E^k+λI)1ξ^f_{\lambda,n}=-(\hat{E}_{k}+\lambda I)^{-1}\hat{\xi}. In other words, this is the same as in Theorem 4(iv) since E^k=C^\hat{E}_{k}=\hat{C}, and can be solved by a simple linear system provided in Theorem 5. Here E^k\hat{E}_{k} is the empirical estimator of EkE_{k}. Now consider

where B(λ):=IkfλfW2\mathcal{B}(\lambda):=\|I_{k}f_{\lambda}-f_{\star}\|_{W_{2}}. The proof now proceeds using the following decomposition, equivalent to the one used in the proof of Theorem 6(i), i.e.,

where we used (30) in ()(\dagger). S^kf\hat{S}_{k}f_{\star} is well-defined as it is the empirical version of the restriction of SkS_{k} to W2(p0)\mathcal{W}^{\sim}_{2}(p_{0}). Since SkfEkfλ=Sk(fIkfλ)S_{k}f_{\star}-E_{k}f_{\lambda}=S_{k}(f_{\star}-I_{k}f_{\lambda}) and S^kfE^kfλ=S^k(fIkfλ)\hat{S}_{k}f_{\star}-\hat{E}_{k}f_{\lambda}=\hat{S}_{k}(f_{\star}-I_{k}f_{\lambda}), we have

for ncλ2n\geq\frac{c}{\lambda^{2}} where cc is a sufficiently large constant that does not depend on nn and λ\lambda. Following the proof of Proposition A.4(i), we have

wherein the first term is zero as Skf+ξ=0S_{k}f_{\star}+\xi=0 and since

the integral in the second term is finite because of (D) and fW2(Ω,p0)f^{*}\in W_{2}(\Omega,p_{0}). Therefore, an application of Chebyshev’s inequality yields

We now show that (S^kSk)(fIkfλ)H=Op0(B(λ)n1/2)\|(\hat{S}_{k}-S_{k})(f_{\star}-I_{k}f_{\lambda})\|_{\mathcal{H}}=O_{p_{0}}(\mathcal{B}(\lambda)n^{-1/2}). To this end, define g:=fIkfλg:=f_{\star}-I_{k}f_{\lambda} and consider

which therefore yields the claim through an application of Chebyshev’s inequality. Using this along with (33) and (34) in (32), and using the resulting bound in (31) yields

(i) We bound B(λ)\mathcal{B}(\lambda) as follows. First note that

and so for any hHh\in\mathcal{H}, we have

From (Tk+λI)1Tk=Ik(Ek+λI)1Sk(T_{k}+\lambda I)^{-1}T_{k}=I_{k}(E_{k}+\lambda I)^{-1}S_{k} and SkIkh=EkhS_{k}I_{k}h=E_{k}h, we have

where the inequality follows from Proposition A.3(ii). Using (37) and (38) in (36), we obtain B(λ)fIkhW2+hHλ\mathcal{B}(\lambda)\leq\|f_{\star}-I_{k}h\|_{W_{2}}+\|h\|_{\mathcal{H}}\sqrt{\lambda}, using which in (35) yields

Since the above inequality holds for any hHh\in\mathcal{H}, we therefore have

Since J(p0pfλ,n)infpPJ(p0p)J(p_{0}\|p_{f_{\lambda,n}})\geq\inf_{p\in\mathcal{P}}J(p_{0}\|p), we have that J(p0pfλ,n)infpPJ(p0p)J(p_{0}\|p_{f_{\lambda,n}})\rightarrow\inf_{p\in\mathcal{P}}J(p_{0}\|p) as λ0\lambda\rightarrow 0, λn\lambda n\rightarrow\infty and nn\rightarrow\infty. (ii) Recall B(λ)\mathcal{B}(\lambda) from (i). From Proposition A.3(i) it follows that B(λ)0\mathcal{B}(\lambda)\rightarrow 0 as λ0\lambda\rightarrow 0 if fR(Tk)f_{\star}\in\overline{\mathcal{R}(T_{k})}. Therefore, (35) reduces to 2J(p0pfλ,n)Op0(1λn)+B(λ)\sqrt{2\,J(p_{0}\|p_{f_{\lambda,n}})}\leq O_{p_{0}}\left(\frac{1}{\sqrt{\lambda n}}\right)+\mathcal{B}(\lambda) and the consistency result follows. If fR(Tkβ)f_{\star}\in\mathcal{R}(T^{\beta}_{k}) for some β>0\beta>0, then the rates follow from Proposition A.3 by noting that B(λ)max{1,Tkβ1}λmin{1,β}TkβfW2\mathcal{B}(\lambda)\leq\max\{1,\|T_{k}\|^{\beta-1}\}\lambda^{\min\{1,\beta\}}\|T^{-\beta}_{k}f_{\star}\|_{W_{2}} and choosing λ=nmax{13,12β+1}\lambda=n^{-\max\left\{\frac{1}{3},\frac{1}{2\beta+1}\right\}}. (iii) This simply follows from an analysis similar to the one used in the proof of Theorem 6(iii). \blacksquare

12 Proof of Proposition 13

For any pPFDp\in\mathcal{P}_{\text{FD}}, define f:=logpq0f:=\log\frac{p}{q_{0}}, which implies that [f]W2(p)[f]_{\sim}\in W_{2}(p). Since Ik(H)I_{k}(\mathcal{H}) is dense in W2(p)W_{2}(p), we have for any ϵ>0\epsilon>0, there exists gHg\in\mathcal{H} such that [f]IkgW22ϵ\|[f]_{\sim}-I_{k}g\|_{W_{2}}\leq\sqrt{2\epsilon}. For a given gHg\in\mathcal{H}, pick pgPp_{g}\in\mathcal{P}. Therefore,

Appendix A Appendix: Technical Results

In this appendix, we present some technical results that are used in the proofs.

In the following result, claims (iii) and (iv) are quoted from Lemma 3.1 of van der Vaart and van Zanten (2008).

pfpgLr(Ω)2e2fge2min{f,g}fgq0Lr(Ω)\|p_{f}-p_{g}\|_{L^{r}(\Omega)}\leq 2e^{2\|f-g\|_{\infty}}e^{2\min\{\|f\|_{\infty},\|g\|_{\infty}\}}\|f-g\|_{\infty}\|q_{0}\|_{L^{r}(\Omega)} for any 1r1\leq r\leq\infty;

pfpgL1(Ω)2efgfg\|p_{f}-p_{g}\|_{L^{1}(\Omega)}\leq 2e^{\|f-g\|_{\infty}}\|f-g\|_{\infty};

KL(pfpg)cfg2efg(1+fg)KL(p_{f}\|p_{g})\leq c\,\|f-g\|^{2}_{\infty}e^{\|f-g\|_{\infty}}\left(1+\|f-g\|_{\infty}\right) where cc is a universal constant;

h(pf,pg)efg/2fg.h(p_{f},p_{g})\leq e^{\|f-g\|_{\infty}/2}\|f-g\|_{\infty}.

(i) Define B(f):=efq0dxB(f):=\int e^{f}q_{0}\,dx. Consider

Since efq0Lr(Ω)efq0Lr(Ω)\|e^{f}q_{0}\|_{L^{r}(\Omega)}\leq e^{\|f\|_{\infty}}\|q_{0}\|_{L^{r}(\Omega)} and B(f)efB(f)\geq e^{-\|f\|_{\infty}}, from (A.2) we obtain

where we used max{a,b}min{a,b}+ab\max\{a,b\}\leq\min\{a,b\}+|a-b| for a,b0a,b\geq 0 in the last line above. (ii) This simply follows from (A.2) by using r=1r=1.

A.2 General Representer Theorem

The following is the general representer theorem for abstract Hilbert spaces.

and thus Aα=i=1mαiϕiA^{*}\bm{\alpha}=\sum^{m}_{i=1}\alpha_{i}\phi_{i}. Therefore AAα=j=1mαjAϕj=j=1mαj(ϕj,ϕiH)i=1mAA^{*}\bm{\alpha}=\sum^{m}_{j=1}\alpha_{j}A\phi_{j}=\sum^{m}_{j=1}\alpha_{j}(\langle\phi_{j},\phi_{i}\rangle_{H})^{m}_{i=1}, and so for every i[m]i\in[m], (AAα)i=j=1mϕj,ϕiHαj(AA^{*}\bm{\alpha})_{i}=\sum^{m}_{j=1}\langle\phi_{j},\phi_{i}\rangle_{H}\alpha_{j} and hence AA=KAA^{*}=\bm{K}.

The following result is quite well-known in the linear inverse problem theory (Engl et al., 1996).

Let CC be a bounded, self-adjoint compact operator on a separable Hilbert space HH. For λ>0\lambda>0 and fHf\in H, define fλ:=(C+λI)1Cff_{\lambda}:=(C+\lambda I)^{-1}Cf and Aθ(λ):=Cθ(fλf)H\mathcal{A}_{\theta}(\lambda):=\|C^{\theta}(f_{\lambda}-f)\|_{H} for θ0\theta\geq 0. Then the following hold.

For any θ>0\theta>0, Aθ(λ)0\mathcal{A}_{\theta}(\lambda)\rightarrow 0 as λ0\lambda\rightarrow 0 and if fR(C)f\in\overline{\mathcal{R}(C)}, then A0(λ)0\mathcal{A}_{0}(\lambda)\rightarrow 0 as λ0\lambda\rightarrow 0.

If fR(Cβ)f\in\mathcal{R}(C^{\beta}) for β0\beta\geq 0 and β+θ>0\beta+\theta>0, then

by the dominated convergence theorem. For any θ>0\theta>0, we have

Let f=fR+fNf=f_{R}+f_{N} where fRR(Cθ)f_{R}\in\overline{\mathcal{R}(C^{\theta})}, fNR(Cθ)f_{N}\in\overline{\mathcal{R}(C^{\theta})}^{\perp} if 0<θ10<\theta\leq 1 and fRR(C)f_{R}\in\overline{\mathcal{R}(C)}, fNR(C)f_{N}\in\overline{\mathcal{R}(C)}^{\perp} if θ1\theta\geq 1. Then

as λ0\lambda\rightarrow 0. (ii) If fR(Cβ)f\in\mathcal{R}(C^{\beta}), then there exists gHg\in H such that f=Cβgf=C^{\beta}g. This yields

On the other hand, for β+θ1\beta+\theta\geq 1, we have

Using the above in (A.3) yields the result.

A.4 Bound on the Norm of Certain Operators and Functions

The following result is used in many places throughout the paper. We would like to highlight that special cases of this result are known, e.g., see the proof of Theorem 4 in Caponnetto and Vito (2007) where concentration inequalites are obtained for the quantities in Proposition A.4 using Bernstein’s inequality. Here, we provide asymptotic statements using Chebyshev’s inequality.

Rα(R+λI)θλαθ\|R^{\alpha}(R+\lambda I)^{-\theta}\|\leq\lambda^{\alpha-\theta}.

R^α(R^+λI)θλαθ\|\hat{R}^{\alpha}(\hat{R}+\lambda I)^{-\theta}\|\leq\lambda^{\alpha-\theta}.

where the last inequality follows from (iii). Using (A.5) in (A.4), we obtain

The result therefore follows by an application of Chebyshev’s inequality. (v) We use the idea in Step 2.1 of the proof of Theorem 4 in Caponnetto and Vito (2007), where Rα(R^+λI)1R^{\alpha}(\hat{R}+\lambda I)^{-1} is written equivalently as follows: Note that R^+λI=(R^R)+(R+λI)\hat{R}+\lambda I=(\hat{R}-R)+(R+\lambda I), which implies

and so Rα(R^+λI)1=Rα(R+λI)1(I(RR^)(R+λI)1)1R^{\alpha}(\hat{R}+\lambda I)^{-1}=R^{\alpha}(R+\lambda I)^{-1}\left(I-(R-\hat{R})(R+\lambda I)^{-1}\right)^{-1}. Using the Von Neumann series representation, we have

A.5 Interpolation Space

In this section, we briefly recall the definition of interpolation spaces of the real method. To this end, let E0E_{0} and E1E_{1} be two arbitrary Banach spaces that are continuously embedded in some topological (Hausdorff) vector space E\mathcal{E}. Then, for xE0+E1:={x0+x1:x0E0,x1E1}x\in E_{0}+E_{1}:=\{x_{0}+x_{1}:x_{0}\in E_{0},\,x_{1}\in E_{1}\} and t>0t>0, the KK-functional of the real interpolation method (see Bennett and Sharpley, 1988, Definition 1.1, p. 293) is defined by

Suppose EE and FF are two Banach spaces that satisfy FEF\hookrightarrow E (i.e., FEF\subset E and the inclusion operator id:FE\text{id}:F\rightarrow E is continuous), then the KK-functional reduces to

The KK-functional can be used to define interpolation norms, for 0<θ<10<\theta<1, 1s1\leq s\leq\infty and xE0+E1x\in E_{0}+E_{1}, as

Moreover, the corresponding interpolation spaces (Bennett and Sharpley, 1988, Definition 1.7, p. 299) are defined as

Appendix B Appendix: Miscellaneous Results

In this appendix, we present the proofs of some claims that we made in Sections 1, 4 and 5.

The following result provides a relationship between Fisher and Kullback-Leibler divergences.

as x2\|x\|_{2}\rightarrow\infty for all i[d]i\in[d] where α=1d\alpha=1-d. Then

Under the conditions mentioned on ptp_{t} and qtq_{t}, it can be shown that

See Theorem 1 in Lyu (2009) for a proof. The above identity is a simple generalization of de Bruijn’s identity that relates the Fisher information to the derivative of the Shannon entropy (see Cover and Thomas, 1991, Theorem 16.6.2). Integrating w.r.t. tt on both sides of (B.2), we obtain KL(p_{t}\|q_{t})\Big{|}^{\infty}_{t=0}=-\int^{\infty}_{0}J(p_{t}\|q_{t})\,dt which yields the equality in (B.1) as KL(ptqt)0KL(p_{t}\|q_{t})\rightarrow 0 as tt\rightarrow\infty and KL(ptqt)KL(pq)KL(p_{t}\|q_{t})\rightarrow KL(p\|q) as t0t\rightarrow 0.

To handle the case of unbounded kk, in the following, we assume that there exists a positive constant MM such that f0HM\|f_{0}\|_{\mathcal{H}}\leq M, so that an estimator of f0f_{0} can be constructed as

where J^λ\hat{J}_{\lambda} is defined in Theorem 4(iv). This modification yields a valid estimator pf˘λ,np_{\breve{f}_{\lambda,n}} as long as kk satisfies ΩeMk(x,x)q0(x)dx<,\int_{\Omega}e^{M\sqrt{k(x,x)}}q_{0}(x)\,dx<\infty, since this implies f˘λ,nF\breve{f}_{\lambda,n}\in\mathcal{F}. The construction of f˘λ,n\breve{f}_{\lambda,n} requires the knowledge of MM, however, which we assume is known a priori. Using the representer theorem in RKHS, it can be shown (see Section B.2.1) that

where δ˘\breve{\delta} and β˘\breve{\bm{\beta}} are obtained by solving the following quadratically constrained quadratic program (QCQP),

with Δ:=(h,ξ^H2)\Delta:=(\bm{h},\|\hat{\xi}\|^{2}_{\mathcal{H}}), Θ:=(β,δ)\Theta:=(\bm{\beta},\delta) and K\bm{K}, H\bm{H} being defined in the proof of Theorem 5 and the remark following it. The following result investigates the consistency and convergence rates for pf˘λ,np_{\breve{f}_{\lambda,n}}.

Let Mf0HM\geq\|f_{0}\|_{\mathcal{H}} be a fixed constant, and f˘n,λ\breve{f}_{n,\lambda} be a clipped estimator given by (B.3). Suppose (A)–(D) with ε=2\varepsilon=2 hold. Let supp(q0)=Ω\emph{supp}(q_{0})=\Omega and ΩeMk(x,x)q0(x)dx<\int_{\Omega}e^{M\sqrt{k(x,x)}}q_{0}(x)\,dx<\infty. Define η(x)=k(x,x)eMk(x,x)\eta(x)=\sqrt{k(x,x)}e^{M\sqrt{k(x,x)}}. Then, as λn,λ0andn\lambda\sqrt{n}\rightarrow\infty,\,\lambda\rightarrow 0\,\,\text{and}\,\,n\rightarrow\infty,

pf˘λ,np0L1(Ω)0\|p_{\breve{f}_{\lambda,n}}-p_{0}\|_{L^{1}(\Omega)}\rightarrow 0, KL(p0pf˘λ,n)0KL(p_{0}\|p_{\breve{f}_{\lambda,n}})\rightarrow 0 if ηL1(Ω,q0)\eta\in L^{1}(\Omega,q_{0});

for 1<r1<r\leq\infty, pf˘λ,np0Lr(Ω)0\|p_{\breve{f}_{\lambda,n}}-p_{0}\|_{L^{r}(\Omega)}\rightarrow 0 if ηq0L1(Ω)Lr(Ω)\eta q_{0}\in L^{1}(\Omega)\cap L^{r}(\Omega) and eMk(,)q0Lr(Ω)e^{M\sqrt{k(\cdot,\cdot)}}q_{0}\in L^{r}(\Omega);

h(pf˘λ,n,p0)0h(p_{\breve{f}_{\lambda,n}},p_{0})\rightarrow 0 if k(,)ηL1(Ω,q0)\sqrt{k(\cdot,\cdot)}\eta\in L^{1}(\Omega,q_{0});

J(p0pf˘λ,n)0J(p_{0}\|p_{\breve{f}_{\lambda,n}})\rightarrow 0.

In addition, if f0R(Cβ)f_{0}\in\mathcal{R}(C^{\beta}) for some β>0\beta>0, then pf˘λ,np0Lr(Ω)=Op0(θn),h(p0,pf˘λ,n)=Op0(θn),KL(p0pf˘λ,n)=Op0(θn)andJ(p0pf˘λ,n)=Op0(θn2)\|p_{\breve{f}_{\lambda,n}}-p_{0}\|_{L^{r}(\Omega)}=O_{p_{0}}(\theta_{n}),\,h(p_{0},p_{\breve{f}_{\lambda,n}})=O_{p_{0}}(\theta_{n}),\,KL(p_{0}\|p_{\breve{f}_{\lambda,n}})=O_{p_{0}}(\theta_{n})\,\,\text{and}\,\,J(p_{0}\|p_{\breve{f}_{\lambda,n}})=O_{p_{0}}(\theta^{2}_{n}) where θn:=nmin{14,β2(β+1)}\theta_{n}:=n^{-\min\left\{\frac{1}{4},\frac{\beta}{2(\beta+1)}\right\}} with λ=nmax{14,12(β+1)}\lambda=n^{-\max\left\{\frac{1}{4},\frac{1}{2(\beta+1)}\right\}} assuming the respective conditions in (i)-(iii) above hold.

For any xΩx\in\Omega, since f0(x)f0Hk(x,x)Mk(x,x)|f_{0}(x)|\leq\|f_{0}\|_{\mathcal{H}}\sqrt{k(x,x)}\leq M\sqrt{k(x,x)} and f˘λ,n(x)Mk(x,x)|\breve{f}_{\lambda,n}(x)|\leq M\sqrt{k(x,x)}, we have

where we used the fact that exeyeaxy|e^{x}-e^{y}|\leq e^{a}|x-y| for x,y[a,a]x,y\in[-a,a] and η(x):=k(x,x)eMk(x,x)\eta(x):=\sqrt{k(x,x)}e^{M\sqrt{k(x,x)}}. In the following, we obtain bounds for \bigl{\|}p_{\breve{f}_{\lambda,n}}-p_{0}\bigr{\|}_{L^{r}(\Omega)} for any 1r1\leq r\leq\infty, h(pf˘λ,n,p0)h(p_{\breve{f}_{\lambda,n}},p_{0}) and KL(p0pf˘λ,n)KL(p_{0}\|p_{\breve{f}_{\lambda,n}}) in terms of \|\breve{f}_{\lambda,n}-f_{0}\bigr{\|}_{\mathcal{H}}. Define B(f):=Ωefq0dxB(f):=\int_{\Omega}e^{f}q_{0}\,dx. Since kk satisfies ΩeMk(x,x)q0(x)dx<\int_{\Omega}e^{M\sqrt{k(x,x)}}q_{0}(x)\,dx<\infty, then it is clear that f˘λ,nF\breve{f}_{\lambda,n}\in\mathcal{F} as B(f˘λ,n)<B(\breve{f}_{\lambda,n})<\infty since

Similarly, it is easy to verify that B(f0)<B(f_{0})<\infty. (i) Recalling (A.1), we have

Using (B.4), we bound B(f˘λ,n)B(f0)|B(\breve{f}_{\lambda,n})-B(f_{0})| as

Also for any fHf\in\mathcal{H} with fHM\|f\|_{\mathcal{H}}\leq M, we have B(f)ΩeMk(x,x)q0(x)dx=:θB(f)\geq\int_{\Omega}e^{-M\sqrt{k(x,x)}}q_{0}(x)\,dx=:\theta, where θ>0\theta>0. Again using (B.4), we have

and ef0q0Lr(Ω)eMk(x,x)q0Lr(Ω)\|e^{f_{0}}q_{0}\|_{L^{r}(\Omega)}\leq\|e^{M\sqrt{k(x,x)}}q_{0}\|_{L^{r}(\Omega)}. Therefore,

where the above inequality is obtained by carrying out and simplifying the decomposition as in (A.1). Using (B.4), we therefore have

(iv) As f0,f˘λ,nFf_{0},\,\breve{f}_{\lambda,n}\in\mathcal{F}, by Theorem 4, we obtain J(p0pf˘λ,n)=12C(f˘λ,nf0)H212C2f˘λ,nf0H2.J(p_{0}\|p_{\breve{f}_{\lambda,n}})=\frac{1}{2}\|\sqrt{C}(\breve{f}_{\lambda,n}-f_{0})\|^{2}_{\mathcal{H}}\leq\frac{1}{2}\|\sqrt{C}\|^{2}\|\breve{f}_{\lambda,n}-f_{0}\|^{2}_{\mathcal{H}}. Note that we have bounded the various distances between pf˘λ,np_{\breve{f}_{\lambda,n}} and p0p_{0} in terms of f˘λ,nf0H\|\breve{f}_{\lambda,n}-f_{0}\|_{\mathcal{H}}. Since f˘λ,n=fλ,n\breve{f}_{\lambda,n}=f_{\lambda,n} with probability converging to 1, the assertions on consistency are proved by Theorem 6(i) in combination with Lemma 14—as we did not explicitly assume f0R(C)f_{0}\in\overline{\mathcal{R}(C)}—and the rates follow from Theorem 6(iii).

The following observations can be made while comparing the scenarios of using bounded vs. unbounded kernels in the problem of estimating p0p_{0} through Theorems 7 and B.2. First, the consistency results in LrL^{r}, Hellinger and KL distances are the same but for additional integrability conditions on kk and q0q_{0}. The additional integrability conditions are not too difficult to hold in practice as they involve kk and q0q_{0} which can be chosen appropriately. However, the unbounded situation in Theorem B.2 requires the knowledge of MM which is usually not known. On the other hand, the consistency result in JJ in Theorem B.2 is slightly weaker than in Theorem 7. This may be an artifact of our analysis as we are not able to adapt the bounding technique used in the proof of Theorem 7 to bound J(p0pf˘λ,n)=12C(f˘λ,nf0)H2J(p_{0}\|p_{\breve{f}_{\lambda,n}})=\frac{1}{2}\|\sqrt{C}(\breve{f}_{\lambda,n}-f_{0})\|^{2}_{\mathcal{H}} as it critically depends on the boundedness of kk. Therefore, we used a trivial bound of J(p0pf˘λ,n)=12C(f˘λ,nf0)H212C2f˘λ,nf0H2J(p_{0}\|p_{\breve{f}_{\lambda,n}})=\frac{1}{2}\|\sqrt{C}(\breve{f}_{\lambda,n}-f_{0})\|^{2}_{\mathcal{H}}\leq\frac{1}{2}\|\sqrt{C}\|^{2}\|\breve{f}_{\lambda,n}-f_{0}\|^{2}_{\mathcal{H}}, which yields the result through Theorem 6(i). Due to the same reason, we also obtain a slower rate of convergence in JJ. Second, the rate of convergence in KL is slower than in Theorem B.2, which again may be an artifact of our analysis. The convergence rate for KL in Theorem 7 is based on the application of Theorem 6(ii) in Lemma A.1, where the bound on KL in Lemma A.1 critically uses the boundedness to upper bound KL in terms of squared Hellinger distance.

Any fHf\in\mathcal{H} can be decomposed as f=f+ff=f_{\|}+f_{\perp} where

which is a closed subset of H\mathcal{H} and fH:={gH:g,hH=0,hH}f_{\perp}\in\mathcal{H}^{\perp}_{\|}:=\left\{g\in\mathcal{H}:\langle g,h\rangle_{\mathcal{H}}=0,\,\forall\,h\in\mathcal{H}_{\|}\right\} so that H=HH\mathcal{H}=\mathcal{H}_{\|}\oplus\mathcal{H}^{\perp}_{\|}. Since the objective function in (B.3) matches with the one in Theorem 5, using the above decomposition in (B.3), it is easy to verify that J^\hat{J} depends only on fHf_{\|}\in\mathcal{H}_{\|} so that (B.3) reduces to

and f˘λ,n=f˘λ,n+f˘λ,n\breve{f}_{\lambda,n}=\breve{f}^{\|}_{\lambda,n}+\breve{f}^{\perp}_{\lambda,n}. Since ff_{\|} is of the form in (14), using it in (B.5), it is easy to show that J^λ(f)+λ2fH2=12ΘTHΘ+ΘTΔ\hat{J}_{\lambda}(f_{\|})+\frac{\lambda}{2}\|f_{\|}\|^{2}_{\mathcal{H}}=\frac{1}{2}\Theta^{T}\bm{H}\Theta+\Theta^{T}\Delta. Similarly, it can be shown that fH2=ΘTKΘ\|f_{\|}\|^{2}_{\mathcal{H}}=\Theta^{T}\bm{K}\Theta. Since ff_{\perp} appears in (B.5) only through fH2\|f_{\perp}\|^{2}_{\mathcal{H}}, (B.5) reduces to

where f˘λ,n\breve{f}^{\|}_{\lambda,n} is constructed as in (14) using Θ\Theta_{\|} and f˘λ,n\breve{f}^{\perp}_{\lambda,n} is such that f˘λ.nH2=c\|\breve{f}^{\perp}_{\lambda.n}\|^{2}_{\mathcal{H}}=c_{\perp}. The necessary and sufficient conditions for the optimality of (Θ,c)(\Theta_{\|},c_{\perp}) is given by the following Karush-Kuhn-Tucker conditions,

Combining the dual feasibility and stationary conditions, we have η=τλ20\eta=\tau-\frac{\lambda}{2}\geq 0, i.e., τλ2\tau\geq\frac{\lambda}{2}. Using this in the complementary slackness involving τ\tau and cc_{\perp}, it follows that c=0c_{\perp}=0. Since f˘λ,n2=c\|\breve{f}^{\perp}_{\lambda,n}\|^{2}=c_{\perp}, we have f˘λ,n=0\breve{f}^{\perp}_{\lambda,n}=0, i.e., f˘λ,n\breve{f}_{\lambda,n} is completely determined by f˘λ,n\breve{f}^{\|}_{\lambda,n}. Therefore f˘λ,n\breve{f}^{\|}_{\lambda,n} is of the form in (14) and (B.6) reduces to a quadratically constrained quadratic program.

where R(C0):=H\mathcal{R}(C^{0}):=\mathcal{H}, and the spaces R(Cβ)\mathcal{R}(C^{\beta}) and [R(Cβ),R(Cβ)]ββ,2\left[\mathcal{R}(C^{\lfloor\beta\rfloor}),\mathcal{R}(C^{\lceil\beta\rceil})\right]_{\beta-\lfloor\beta\rfloor,2} have equivalent norms.

To prove Proposition B.3, we need the following result which we quote from Steinwart and Scovel (2012, Lemma 6.3) (also see Tartar, 2007, Lemma 23.1) that interpolates L2L^{2}-spaces whose underlying measures are absolutely continuous with respect to a measure ν\nu.

Let ν\nu be a measure on a measurable space Θ\Theta and w0:Θ[0,)w_{0}:\Theta\rightarrow[0,\infty) and w1:Θ[0,)w_{1}:\Theta\rightarrow[0,\infty) be measurable functions. For 0<β<10<\beta<1, define wβ:=w01βw1βw_{\beta}:=w^{1-\beta}_{0}w^{\beta}_{1}. Then we have

and the norms on these two spaces are equivalent. Moreover, this result still holds for weights w0:Θ(0,)w_{0}:\Theta\rightarrow(0,\infty) and w1:Θ[0,]w_{1}:\Theta\rightarrow[0,\infty], if one uses the convention 0:=00\cdot\infty:=0 in the definition of the weighted spaces.

Proof of Proposition B.3. The proof is based on the ideas used in the proof of Theorem 4.6 in Steinwart and Scovel (2012). Recall that by the Hilbert-Schmidt theorem, CC has the following representation,

where θi:=ϕi\theta_{i}:=\phi_{i} if iIi\in I and θi:=ψi\theta_{i}:=\psi_{i} if iJi\in J with ai:=f,θiHa_{i}:=\langle f,\theta_{i}\rangle_{\mathcal{H}}. Let β>0\beta>0. By definition, gR(Cβ)g\in\mathcal{R}(C^{\beta}) is equivalent to hH\exists h\in\mathcal{H} such that g=Cβhg=C^{\beta}h, i.e.,

where α:=(αi)iI\alpha:=(\alpha_{i})_{i\in I}. Let us equip this space with the bilinear form

It is easy to verify that (αiβϕi)iI(\alpha^{\beta}_{i}\phi_{i})_{i\in I} is an ONB of R(Cβ)\mathcal{R}(C^{\beta}). Also since R(Cβ1)R(Cβ2)\mathcal{R}(C^{\beta_{1}})\subset\mathcal{R}(C^{\beta_{2}}) for 0<β2<β1<0<\beta_{2}<\beta_{1}<\infty and id:R(Cβ1)R(Cβ2)\text{id}:\mathcal{R}(C^{\beta_{1}})\rightarrow\mathcal{R}(C^{\beta_{2}}) is continuous, i.e., for any gR(Cβ1)g\in\mathcal{R}(C^{\beta_{1}}),

and so R(Cβ1)R(Cβ2)\mathcal{R}(C^{\beta_{1}})\hookrightarrow\mathcal{R}(C^{\beta_{2}}). Similarly, we can show that R(C)H\mathcal{R}(C)\hookrightarrow\mathcal{H}. In the following, we first prove the result for 0<β<10<\beta<1 and then for β>1\beta>1.

(a) 0<β<10<\beta<1: For any fHf\in\mathcal{H} and gR(C)g\in\mathcal{R}(C), we have

where we define ci:=0c_{i}:=0 for iJi\in J. For t>0t>0, we find

From this we immediately obtain the equivalence

where 0<β<10<\beta<1. Applying the second part of Lemma B.4 to the counting measure on IJI\cup J yields

from which we obtain the following equivalence

For d=1d=1, this implies (jf)p=0(\partial_{j}f)p=0 a.e. and so fW2=0\|f\|_{W_{2}}=0.

Acknowledgments

Theorem A.2 and the proof are due to an anonymous reviewer. We thank Dougal Sutherland for a careful reading of the paper which helped to fix minor errors. A part of the work was carried out while BKS was a Research Fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge. BKS thanks Richard Nickl for many valuable comments and insightful discussions. KF is supported in part by MEXT Grant-in-Aid for Scientific Research on Innovative Areas 25120012.

References