Non-parametric Stochastic Approximation with Large Step sizes

Aymeric Dieuleveut, Francis Bach

Introduction

Positive-definite-kernel-based methods such as the support vector machine or kernel ridge regression are now widely used in many areas of science of engineering. They were first developed within the statistics community for non-parametric regression using splines, Sobolev spaces, and more generally reproducing kernel Hilbert spaces (see, e.g., ). Within the machine learning community, they were extended in several interesting ways (see, e.g., ): (a) other problems were tackled using positive-definite kernels beyond regression problems, through the “kernelization” of classical unsupervised learning methods such as principal component analysis, canonical correlation analysis, or K-means, (b) efficient algorithms based on convex optimization have emerged, in particular for large sample sizes, and (c) kernels for non-vectorial data have been designed for objects like strings, graphs, measures, etc. A key feature is that they allow the separation of the representation problem (designing good kernels for non-vectorial data) and the algorithmic/theoretical problems (given a kernel, how to design, run efficiently and analyse estimation algorithms).

In practice, estimation algorithms based on regularized empirical risk minimization (e.g., penalized least-squares) face two challenges: (a) using the correct regularization parameter and (b) finding an approximate solution of the convex optimization problems. In this paper, we consider these two problems jointly by following a stochastic approximation framework formulated directly in the RKHS, in which each observation is used only once and overfitting is avoided by making only a single pass through the data–a form of early stopping, which has been considered in other statistical frameworks such as boosting . While this framework has been considered before , the algorithms that are considered either (a) require two sequences of hyperparameters (the step-size in stochastic gradient descent and a regularization parameter) or (b) do not always attain the optimal rates of convergence for estimating the regression function. In this paper, we aim to remove simultaneously these two limitations.

Traditional online stochastic approximation algorithms, as introduced by Robbins and Monro , lead in finite-dimensional learning problems (e.g., parametric least-squares regression) to stochastic gradient descent methods with step-sizes decreasing with the number of observations nn, which are typically proportional to nζn^{-\zeta}, with ζ\zeta between 1/21/2 and 1. Short step-sizes (ζ=1\zeta=1) are adapted to well-conditioned problems (low dimension, low correlations between covariates), while longer step-sizes (ζ=1/2\zeta=1/2) are adapted to ill-conditioned problems (high dimension, high correlations) but with a worse convergence rate—see, e.g., and references therein. More recently showed that constant step-sizes with averaging could lead to the best possible convergence rate in Euclidean spaces (i.e., in finite dimensions). In this paper, we show that using longer step-sizes with averaging also brings benefits to Hilbert space settings needed for non parametric regression.

With our analysis, based on positive definite kernels, under assumptions on both the objective function and the covariance operator of the RKHS, we derive improved rates of convergence , in both the finite horizon setting where the number of observations is known in advance and our bounds hold for the last iterate (with exact constants), and the online setting where our bounds hold for each iterate (asymptotic results only). It leads to an explicit choice of the step-sizes (which play the role of the regularization parameters) which may be used in stochastic gradient descent, depending on the number of training examples we want to use and on the assumptions we make.

In this paper, we make the following contributions:

We review in Section 2 a general though simple algebraic framework for least-squares regression in RKHS, which encompasses all commonly encountered situations. This framework however makes unnecessary topological assumptions, which we relax in Section 2.5 (with details in App. A).

We characterize in Section 3 the convergence rate of averaged least-mean-squares (LMS) and show how the proper set-up of the step-size leads to optimal convergence rates (as they were proved in ), extending results from finite-dimensional to infinite-dimensional settings. The problem we solve here was stated as an open problem in . Moreover, our results apply as well in the usual finite-dimensional setting of parametric least-squares regression, showing adaptivity of our estimator to the spectral decay of the covariance matrix of the covariates (see Section 4.1).

We compare our new results with existing work, both in terms of rates of convergence in Section 4, and with simulations on synthetic spline smoothing in Section 5.

Sketches of the proofs are given in Appendix B. Complete proofs are given in Appendices I, II.

Learning with positive-definite kernels

Throughout this section, we make the following assumption:

X\mathcal{X} is a compact topological space and H\mathcal{H} is an RKHS associated with a continuous kernel KK on the set X\mathcal{X}.

RKHSs are well-studied Hilbert spaces which are particularly adapted to regression problems (see, e.g., ). They satisfy the following properties:

H\mathcal{H} contains all functions Kx:tK(x,t)K_{x}:t\mapsto K(x,t), for all xx in X\mathcal{X}.

For any xXx\in\mathcal{X} and fHf\in\mathcal{H}, the reproducing property holds:

2 Random variables

The space LρX2{L}^{2}_{\rho_{{X}}} is then a Hilbert space with norm LρX2\|\cdot\|_{{L}^{2}_{\rho_{{X}}}}, which we will always assume separable (that is, with a countable orthonormal system).

Throughout this section, we make the following simple assumption regarding finiteness of moments:

Note that under these assumptions, any function in H\mathcal{H} in in LρX2{L}^{2}_{\rho_{{X}}}; however this inclusion is strict in most interesting situations.

3 Minimization problem

We are interested in minimizing the following quantity, which is the prediction error (or mean squared error) of a function ff, defined for any function in LρX2{L}^{2}_{\rho_{{X}}} as:

We are looking for a function with a low prediction error in the particular function space H\mathcal{H}, that is we aim to minimize ε(f)\varepsilon(f) over fHf\in\mathcal{H}. We have for fLρX2f\in{L}^{2}_{\rho_{{X}}}:

A key feature of our analysis is that we only considered fgρLρX22\|f-g_{\rho}\|^{2}_{{L}^{2}_{\rho_{{X}}}} as a measure of performance and do not consider convergences in stricter norms (which are not true in general). This allows us to neither assume that gρg_{\rho} is in H\mathcal{H} nor that H\mathcal{H} is dense in LρX2{L}^{2}_{\rho_{{X}}}. We thus need to define a notion of the best estimator in H\mathcal{H}. We first define the closure F\overline{{F}} (with respect to LρX2\|\cdot\|_{{L}^{2}_{\rho_{{X}}}}) of any set FLρX2F\subset{L}^{2}_{\rho_{{X}}} as the set of limits in LρX2{L}^{2}_{\rho_{{X}}} of sequences in F{F}. The space H\overline{\mathcal{H}} is a closed and convex subset in LρX2{L}^{2}_{\rho_{{X}}}. We can thus define gH=argminfHε(g)g_{\mathcal{H}}=\arg\min_{f\in\,\overline{\mathcal{H}}}\varepsilon(g), as the orthogonal projection of gρg_{\rho} on H\overline{\mathcal{H}}, using the existence of the projection on any closed convex set in a Hilbert space. See Proposition 8 in Appendix A for details. Of course we do not have gHHg_{\mathcal{H}}\in\mathcal{H}, that is the minimum in H\mathcal{H} is in general not attained.

Seen as a function of fHf\in\mathcal{H}, our loss function ε\varepsilon is not coercive (i.e., not strongly convex), as our covariance operator (see definition below) Σ\Sigma has no minimal strictly positive eigenvalue (the sequence of eigenvalues decreases to zero). As a consequence, even if gHHg_{\mathcal{H}}\in\mathcal{H}, gng_{n} may not converge to gHg_{\mathcal{H}} in H\mathcal{H}, and when gHHg_{\mathcal{H}}\notin\mathcal{H}, we shall even have gnH\|g_{n}\|_{\mathcal{H}}\rightarrow\infty.

4 Covariance operator

We now define the covariance operator for the space H\mathcal{H} and probability distribution ρX\rho_{X}. The spectral properties of such an operator have appeared to be a key point to characterize the convergence rates of estimators .

We implicitly define (via Riesz’ representation theorem) a linear operator Σ:HH\Sigma:\mathcal{H}\rightarrow\mathcal{H} through

This operator is the covariance operator (defined on the Hilbert space H\mathcal{H}). Using the reproducing property, we have:

where for any elements g,hHg,h\in\mathcal{H}, we denote by ghg\otimes h the operator from H\mathcal{H} to H\mathcal{H} defined as:

Note that this expectation is formally defined as a Bochner expectation (an extension of Lebesgue integration theory to Banach spaces, see ) in L(H)\mathcal{L}(\mathcal{H}) the set of endomorphisms of H\mathcal{H}.

We have defined the covariance operator on the Hilbert space H\mathcal{H}. If fHf\in\mathcal{H}, we have for all zXz\in\mathcal{X}, using the reproducing property:

which shows that the operator Σ\Sigma may be extended to any square-integrable function fLρX2f\in{L}^{2}_{\rho_{{X}}}. In the following, we extend such an operator as an endomorphism TT from LρX2{L}^{2}_{\rho_{{X}}} to LρX2{L}^{2}_{\rho_{{X}}}.

Assume (A1-2). We define the operator TT as follows:

From the discussion above, if fHLρX2f\in\mathcal{H}\subset{L}^{2}_{\rho_{{X}}}, then Tf=ΣfTf=\Sigma f. We give here some of the most important properties of TT. The operator TT (which is an endomorphism of the separable Hilbert space LρX2{L}^{2}_{\rho_{{X}}}) may be reduced in some Hilbertian eigenbasis of LρX2{L}^{2}_{\rho_{{X}}}. It allows us to define the power of such an operator TrT^{r}, which will be used to quantify the regularity of the function gHg_{\mathcal{H}}. See proof in Appendix I.2, Proposition 19.

Assume (A1-2). TT is a bounded self-adjoint semi-definite positive operator on LρX2{L}^{2}_{\rho_{{X}}}, which is trace-class. There exists a Hilbertian eigenbasis (ϕi)iI(\phi_{i})_{i\in I} of the orthogonal supplement SS of the null space Ker(T){\rm Ker}(T), with summable strictly positive eigenvalues (μi)iI(\mu_{i})_{i\in I}. That is:

iI, Tϕi=μiϕi\forall i\in I,\ T\phi_{i}=\mu_{i}\phi_{i}, (μi)iI(\mu_{i})_{i\in I} strictly positive such that iIμi<\sum_{i\in I}\mu_{i}<\infty.

LρX2=Ker(T)S{L}^{2}_{\rho_{{X}}}=\operatorname{Ker}(T)\overset{\perp}{\oplus}S, that is, LρX2{L}^{2}_{\rho_{{X}}} is the orthogonal direct sum of Ker(T)\operatorname{Ker}(T) and SS.

When the space SS has finite dimension, then II has finite cardinality, while in general II is countable. Moreover, the null space Ker(T)\operatorname{Ker}(T) may be either reduced to {0}\{0\} (this is the more classical setting and such an assumption is often made), finite-dimensional (for example when the kernel has zero mean, thus constant functions are in SS) or infinite-dimensional (e.g., when the kernel space only consists in even functions, the whole space of odd functions is in SS).

Assume (A1-2). Σ:HH\Sigma:\mathcal{H}\to\mathcal{H} is injective. The image of TT is included in H\mathcal{H}: Im(T)H\text{Im}(T)\subset\mathcal{H}. Moreover, for any iIi\in I, ϕi=1μiTϕiH\phi_{i}=\frac{1}{\mu_{i}}T\phi_{i}\in\mathcal{H} , thus (μi1/2ϕi)iI\left(\mu_{i}^{1/2}\phi_{i}\right)_{i\in I} is an orthonormal eigen-system of Σ\Sigma and an Hilbertian basis of H\mathcal{H}, i.e., for any ii in I, Σϕi=μiϕiI,\ \Sigma\phi_{i}=\mu_{i}\phi_{i}.

This proposition will be generalized under relaxed assumptions (in particular as Σ\Sigma will no more be injective, see Section 2.5 and Appendix A).

We may now define all powers TrT^{r} (they are always well defined because the sequence of eigenvalues is upper-bounded):

We define, for any r0r\geqslant 0, Tr:LρX2LρX2T^{r}:{L}^{2}_{\rho_{{X}}}\rightarrow{L}^{2}_{\rho_{{X}}}, for any hKer(T)h\in\operatorname{Ker}(T) and (ai)iI(a_{i})_{i\in I} such that iIai2<\sum_{i\in I}a_{i}^{2}<\infty, through: Tr(h+iIaiϕi)=iIaiμirϕi.T^{r}\left(h+\sum_{i\in I}a_{i}\phi_{i}\right)=\sum_{i\in I}a_{i}\mu_{i}^{r}\phi_{i}. Moreover, for any r>0r>0, TrT^{r} may be defined as a bijection from SS into Im(Tr)\text{Im}(T^{r}). We may thus define its unique inverse Tr:  Im(Tr)S.T^{-r}:\ \ \text{Im}(T^{r})\rightarrow S.

The following proposition is a consequence of Mercer’s theorem . It describes how the space H\mathcal{H} is related to the image of operator T1/2T^{1/2}.

Under assumptions (A1,2), H=T1/2(LρX2)\mathcal{H}=T^{1/2}\left({L}^{2}_{\rho_{{X}}}\right) and T1/2:SHT^{1/2}:S\rightarrow\mathcal{H} is an isometrical isomorphism.

The proposition has the following consequences:

For any r1/2, Tr(S)Hr\geqslant 1/2,\ T^{r}(S)\subset\mathcal{H}, because Tr(S)T1/2(S)T^{r}(S)\subset T^{1/2}(S), that is, with large enough powers rr, the image of TrT^{r} is in the Hilbert space.

r>0, Tr(LρX2)=S=T1/2(LρX2)=H\forall r>0,\ \overline{T^{r}({L}^{2}_{\rho_{{X}}})}=S=\overline{T^{1/2}({L}^{2}_{\rho_{{X}}})}=\overline{\mathcal{H}}, because (a) T1/2(LρX2)=HT^{1/2}({L}^{2}_{\rho_{{X}}})={\mathcal{H}} and (b) for any r>0r>0, Tr(LρX2)=S\overline{T^{r}({L}^{2}_{\rho_{{X}}})}=S. In other words, elements of H\overline{\mathcal{H}} (on which our minimization problem attains its minimum), may seen as limits (in LρX2{L}^{2}_{\rho_{{X}}}) of elements of Tr(LρX2)T^{r}({L}^{2}_{\rho_{{X}}}), for any r>0r>0.

H\mathcal{H} is dense in LρX2{L}^{2}_{\rho_{{X}}} if and only if TT is injective (which is equivalent to ker(T)={0}\ker(T)=\{0\})

The sequence of spaces {Tr(LρX2)}r>0\{T^{r}({L}^{2}_{\rho_{{X}}})\}_{r>0} is thus a decreasing (when rr is increasing) sequence of subspaces of LρX2{L}^{2}_{\rho_{{X}}} such that any of them is dense in H\overline{\mathcal{H}}, and Tr(LρX2)HT^{r}({L}^{2}_{\rho_{{X}}})\subset\mathcal{H} if and only if r1/2.r\geqslant 1/2.

In the following, the regularity of the function gHg_{\mathcal{H}} will be characterized by the fact that gHg_{\mathcal{H}} belongs to the space Tr(LρX2)T^{r}({L}^{2}_{\rho_{{X}}}) (and not only to its closure), for a specific r>0r>0 (see Section 2.7). This space may be described depending on the eigenvalues and eigenvectors as

We may thus see the spaces Tr(LρX2)T^{r}({L}^{2}_{\rho_{{X}}}) as spaces of sequences with various decay conditions.

5 Minimal assumptions

H\mathcal{H} is a separable RKHS associated with kernel KK on the set X\mathcal{X}.

In this section, we have to distinguish the set of square ρX\rho_{X}-integrable functions LρX2\mathcal{L}^{2}_{\rho_{{X}}} and its quotient LρX2{L}^{2}_{\rho_{{X}}} that makes it a separable Hilbert space. We define pp the projection from LρX2\mathcal{L}^{2}_{\rho_{{X}}} into LρX2{L}^{2}_{\rho_{{X}}} (precise definitions are given in Appendix A). Indeed it is no more possible to identify the space H\mathcal{H}, which is a subset of LρX2\mathcal{L}^{2}_{\rho_{{X}}}, and its canonical projection p(H)p(\mathcal{H}) in LρX2{L}^{2}_{\rho_{{X}}}.

Minimality: The separability assumption is necessary to be able to expand any element as an infinite sum, using a countable orthonormal family (this assumption is satisfied in almost all cases, for instance it is simple as soon as X\mathcal{X} admits a topology for which it is separable and functions in H\mathcal{H} are continuous, see for more details). Note that we do not make any topological assumptions regarding the set X\mathcal{X}. We only assume that it is equipped with a probability measure.

Our assumptions are sufficient to analyze the minimization of ε(f)\varepsilon(f) with respect to fHf\in\mathcal{H} and seem to allow the widest generality.

Comparison: These assumptions will include the previous setting, but also recover measures without full support (e.g., when the data lives in a small subspace of the whole space) and kernels on discrete objects (with non-finite cardinality).

Moreover, (A1’), (A2’) are stricly weeker than (A1), (A2). In previous work, (A2’) was sometimes replaced by the stronger assumptions supxXK(x,x)<\sup_{x\in\mathcal{X}}K(x,x)<\infty and Y|Y| bounded . Note that in functional analysis, the weaker hypothesis X ⁣× ⁣Xk(x,x)2dρX(x)dρX(x)<\int_{\mathcal{X}\!\times\!\mathcal{X}}k(x,x^{\prime})^{2}d{\rho_{X}}(x)d\rho_{X}(x^{\prime})<\infty is often used , but it is not adapted to the statistical setting.

Main differences: The main difference here is that we cannot identify H\mathcal{H} and p(H)p(\mathcal{H}): there may exist functions fH{0}f\in\mathcal{H}\setminus\{0\} such that fLρX2=0\|f\|_{\mathcal{L}^{2}_{\rho_{{X}}}}=0. This may for example occur if the support of ρX\rho_{X} is strictly included in X\mathcal{X}, and ff is zero on this support, but not identically zero. See the Appendix I.5 for more details.

As a consequence, Σ\Sigma is no more injective and we do not have Im(T1/2)=H\operatorname{Im}(T^{1/2})=\mathcal{H} any more. We thus denote S\mathscr{S} an orthogonal supplement of the null space Ker(Σ)\text{Ker}(\Sigma). As we also need to be careful not to confuse LρX2\mathcal{L}^{2}_{\rho_{{X}}} and LρX2{L}^{2}_{\rho_{{X}}}, we define an extension T\mathcal{T} of Σ\Sigma from LρX2\mathcal{L}^{2}_{\rho_{{X}}} into H\mathcal{H}, then T=pTT=p\circ\mathcal{T}. We can define for r1/2r\geqslant 1/2 the power operator Tr\mathcal{T}^{r} of T\mathcal{T} (from LρX2{L}^{2}_{\rho_{{X}}} into H\mathcal{H}), see App. A for details.

Conclusion: Our problem has the same behaviour under such assumptions. Proposition 1 remains unchanged. Decompositions in Prop. 2 and Corollary 1 must be slightly adapted (see Proposition 9 and Corollary 7 in Appendix A for details). Finally, Proposition 3 is generalized by the next proposition, which states that p(S)=p(H)p(\mathscr{S})=p(\mathcal{H}) and thus SS and p(H)p(\mathcal{H}) are isomorphic (see proof in Appendix I.2, Proposition 19):

T1/2:SS\mathcal{T}^{1/2}:S\rightarrow\mathscr{S} is an isometry. Moreover, Im(T1/2)=p(H)\operatorname{Im}(T^{1/2})=p(\mathcal{H}) and T1/2:Sp(H)T^{1/2}:S\rightarrow p(\mathcal{H}) is an isomorphism.

We can also derive a version of Mercer’s theorem, which does not make any more assumptions that are required for defining RKHSs. As we will not use it in this article, this proposition is only given in Appendix A.

Convergence results: In all convergence results stated below, assumptions (A1, A2) may be replaced by assumptions (A1’, A2’).

6 Examples

The property H=S\overline{\mathcal{H}}=S, stated after Proposition 3, is important to understand what the space H\overline{\mathcal{H}} is, as we are minimizing over this closed and convex set. As a consequence the space H\mathcal{H} is dense in LρX2{L}^{2}_{\rho_{{X}}} if and only if TT is injective (or equivalently, Ker(T)={0}S=LρX2{\rm Ker}(T)=\{0\}\Leftrightarrow S={L}^{2}_{\rho_{{X}}}). We detail below a few classical situations in which different configurations for the “inclusion” HHLρX2\mathcal{H}\subset\overline{\mathcal{H}}\subset{L}^{2}_{\rho_{{X}}} appear:

Splines over the circle: When XU[0;1]X\sim\mathcal{U}[0;1] and H\mathcal{H} is the set of mm-times periodic weakly differentiable functions (see Section 5), we have in general HHLρX2\mathcal{H}\varsubsetneq\overline{\mathcal{H}}\varsubsetneq{L}^{2}_{\rho_{{X}}}. In such a case, ker(T)=span(x1)\ker(T)=\operatorname{span}(x\mapsto 1), and Hspan(x1)=LρX2\overline{\mathcal{H}}\oplus\operatorname{span}(x\mapsto 1)={L}^{2}_{\rho_{{X}}}, that is we can approximate any zero-mean function.

Many examples and more details may be found in . In particular, kernels on non-vectorial objects may be defined (e.g., sequences, graphs or measures).

7 Convergence rates

gHTr(LρX2)g_{\mathcal{H}}\in T^{r}\left({L}^{2}_{\rho_{{X}}}\right) with r0r\geqslant 0, and as a consequence Tr(gH)LρX2<\|T^{-r}(g_{\mathcal{H}})\|_{{L}^{2}_{\rho_{{X}}}}<\infty.

We chose such assumptions in order to make the comparison with the existing literature as easy as possible, for example . However, some other assumptions may be found as in .

The two parameters rr and α\alpha intuitively parametrize the strengths of our assumptions:

In assumption (A3) a bigger α\alpha makes the assumption stronger: it means the reproducing kernel Hilbert space is smaller, that is if (A3) holds with some constant α\alpha, then it also holds for any α<α\alpha^{\prime}<\alpha. Moreover, if TT is reduced in the Hilbertian basis (ϕi)i(\phi_{i})_{i} of LρX2{L}^{2}_{\rho_{{X}}}, we have an effective search space S=\big{\{}\sum_{i=1}^{\infty}b_{i}\phi_{i}/\sum_{i=1}^{\infty}\frac{b_{i}^{2}}{\mu_{i}}<\infty\big{\}} : the smaller the eigenvalues, the smaller the space. Note that since tr(T)\operatorname{tr}(T) is finite, (A3) is always true for α=1\alpha=1.

In assumption (A4), for a fixed α\alpha, a bigger rr makes the assumption stronger, that is the function gHg_{\mathcal{H}} is actually smoother. Indeed, considering that (A4) may be rewritten g_{\mathcal{H}}\in T^{r}\big{(}{L}^{2}_{\rho_{{X}}}\big{)} and for any r<r^{\prime},\ T^{r^{\prime}}\big{(}{L}^{2}_{\rho_{{X}}}\big{)}\subset T^{r}\big{(}{L}^{2}_{\rho_{{X}}}\big{)}. In other words, \big{\{}T^{r}\left({L}^{2}_{\rho_{{X}}}\right)\big{\}}_{r\geq 0} are decreasing (rr growing) subspaces of LρX2{L}^{2}_{\rho_{{X}}}.

For r=1/2r=1/2, T^{1/2}\big{(}{L}^{2}_{\rho_{{X}}}\big{)}=\mathcal{H}; moreover, for r1/2r\geqslant 1/2, our best approximation function gHHg_{\mathcal{H}}\in\overline{\mathcal{H}} is in fact in H\mathcal{H}, that is the optimization problem in the RKHS H\mathcal{H} is attained by a function of finite norm. However for r<1/2r<1/2 it is not attained.

Furthermore, it is worth pointing the stronger assumption which is often used in the finite dimensional context, namely tr(Σ1/α)=iIμi1/α\operatorname{tr}\left(\Sigma^{1/\alpha}\right)=\sum_{i\in I}\mu_{i}^{1/\alpha} finite. It turns out that this is a stronger assumption, indeed, since we have assumed that the eigenvalues (μi)(\mu_{i}) are arranged in non-increasing order, if tr(Σ1/α)\operatorname{tr}\left(\Sigma^{1/\alpha}\right) is finite, then (A3) is satisfied for s^{2}=\big{[}2\operatorname{tr}\left(\Sigma^{1/\alpha}\right)\big{]}^{\alpha}. Such an assumption appears for example in Corollary 5.

The assumptions (A3) and (A4) are adapted to our theoretical results, but some stricter assumptions are often used, that make comparison with existing work more direct. For comparison purposes, we will also use:

Assumption (a3) directly imposes that the eigenvalues of TT decay at rate iαi^{-\alpha} (which imposes that there are infinitely many), and thus implies (A3). Together, assumptions (a3) and (a4), imply assumptions (A3) and (A4), with any δ>1+2αr\delta>1+2\alpha r. Indeed, we have

which is finite for 2αrδ<12\alpha r-\delta<-1. Thus, the supremum element of the set of rr such that (A4) holds is such that δ=1+2αr\delta=1+2\alpha r. Thus, when comparing assumptions (A3-4) and (a3-4), we will often make the identification above, that is, δ=1+2αr\delta=1+2\alpha r.

The main advantage of the new assumptions is their interpretation when the basis (ϕi)iI(\phi_{i})_{i\in I} is common for several RKHSs (such as the Fourier basis for splines, see Section 5): (a4) describes the decrease of the coordinates of the best function gHLρX2g_{\mathcal{H}}\in{L}^{2}_{\rho_{{X}}} independently of the chosen RKHS. Thus, the parameter δ\delta characterizes the prediction function, while the parameter α\alpha characterizes the RKHS.

Stochastic approximation in Hilbert spaces

In this section, we consider estimating a prediction function gHg\in\mathcal{H} from observed data, and we make the following assumption:

Given nn observations, regularized empirical risk minimization corresponds to minimizing with respect to gHg\in\mathcal{H} the following objective function:

In terms of convergence rates, assumptions (a3-4) allow to obtain convergence rates that decompose ε(g^)ε(gH)=g^gHLρX22\varepsilon(\hat{g})-\varepsilon(g_{\mathcal{H}})=\|\hat{g}-g_{\mathcal{H}}\|_{{L}^{2}_{\rho_{{X}}}}^{2} as the sum of two asymptotic terms :

Variance term: O\big{(}{\sigma^{2}}{n^{-1}\lambda^{-1/\alpha}}\big{)}, which is decreasing with λ\lambda, where σ2\sigma^{2} characterizes the noise variance, for example, in the homoscedastic case (i.i.d. additive noise), the marginal variance of the noise; see assumption (A6) for the detailed assumption that we need in our stochastic approximation context.

Bias term: O\big{(}\lambda^{\min\{(\delta-1)/\alpha,2\}}\big{)}, which is increasing with λ\lambda. Note that the corresponding rr from assumptions (A3-4) is r=(δ1)/2αr=(\delta-1)/2\alpha, and the bias term becomes proportional to λmin{2r,2}\lambda^{\min\{2r,2\}}.

Optimal predictions: If r<1r<1, then the optimal value of λ\lambda (that minimizes the sum of two terms and makes them asymptotically equivalent) is proportional to nα/(2rα+1)=nα/δn^{-\alpha/(2r\alpha+1)}=n^{-\alpha/\delta} and the excess prediction error \|\hat{g}-g_{\mathcal{H}}\|_{{L}^{2}_{\rho_{{X}}}}^{2}=O\big{(}n^{-2\alpha r/(2\alpha r+1)}\big{)}=O\big{(}n^{-1+1/\delta}\big{)}, and the resulting procedure is then “optimal” in terms of estimation of gHg_{\mathcal{H}} in LρX2{L}^{2}_{\rho_{{X}}} (see Section 4 for details).

Saturation: If r1r\geqslant 1, where the optimal value of λ\lambda (that minimizes the sum of two terms and makes them equivalent) is proportional to nα/(2α+1)n^{-\alpha/(2\alpha+1)}, and the excess prediction error is less than O\big{(}n^{-2\alpha/(2\alpha+1)}\big{)}, which is suboptimal. Although assumption (A4) is valid for a larger rr, the rate is the same than if r=1r=1.

In this paper, we consider a stochastic approximation framework with improved running-time complexity and similar theoretical behavior than regularized empirical risk minimization, with the advantage of (a) needing a single pass through the data and (b) simple assumptions.

2 Stochastic approximation

We may also apply the recursion using representants. Indeed, if g0=0g_{0}=0, which we now assume, then for any n1n\geqslant 1,

with the following recursion on the sequence (an)n1(a_{n})_{n\geqslant 1}:

We also output the averaged iterate defined as

The running time complexity is O(i)O(i) for iteration ii—if we assume that kernel evaluations are O(1)O(1), and thus O(n2)O(n^{2}) after nn steps. This is a serious limitation for practical applications. Several authors have considered expanding gng_{n} on a subset of all (Kxi)(K_{x_{i}}), which allows to bring down the complexity of each iteration and obtain an overall linear complexity is nn , but this comes at the expense of not obtaining the sharp generalization errors that we obtain in this paper. Note that when studying regularized least-squares problem (i.e., adding a penalisation term), one has to update every coefficient (ai)1in(a_{i})_{1\leqslant i\leqslant n} at step nn, while in our situation, only ana_{n} is computed at step nn.

Similar algorithms have been studied before , under various forms. Especially, in a regularization term is added to the loss function (thus considering the following problem: argminfHε(f)+λfK2\arg\min_{f\in\mathcal{H}}\varepsilon(f)+\lambda||f||_{K}^{2}). In , neither regularization nor averaging procedure are considered, but in the second case, multiple pass through the data are considered. In , a non-regularized averaged procedure equivalent to ours is considered. However, the step-sizes γn\gamma_{n} which are proposed, as well as the corresponding analysis, are different. Our step-sizes are larger and our analysis uses more directly the underlying linear algebra to obtain better rates (while the proof of is applicable to all smooth losses).

We are mainly interested in two different types of step-sizes (a.k.a. learning rates): the sequence (γi)1in(\gamma_{i})_{1\leqslant i\leqslant n} may be either:

a sequence of the type γi=Γ(n)\gamma_{i}=\Gamma(n) for ini\leqslant n, which will be referred to as the “finite horizon setting”: in this situation the number of samples is assumed to be known and fixed and we chose a constant step-size which may depend on this number. Our bound then hold only for the last iterate.

In practice it is important to have an online procedure, to be able to deal with huge amounts of data (potentially infinite). However, the analysis is easier in the “finite horizon” setting. Some doubling tricks allow to pass to varying steps , but it is not fully satisfactory in practice as it creates jumps at every nn which is a power of two.

3 Extra regularity assumptions

We first present the results in the finite horizon setting in Section 3.4 before turning to the online setting in Section 3.5.

4 Main results (finite horizon)

We can first get some guarantee on the consistency of our estimator, for any small enough constant step-size:

Assume (A1-6), then for any constant choice γn=γ0<12R2\gamma_{n}=\gamma_{0}<\frac{1}{2R^{2}}, the prediction error of gˉn\bar{g}_{n} converges to the one of gHg_{\mathcal{H}}, that is:

The expectation is considered with respect to the distribution of the sample (xi,yi)1in(x_{i},y_{i})_{1\leqslant i\leqslant n}, as in all the following theorems (note that gˉngHLρX22\|\bar{g}_{n}-g_{\mathcal{H}}\|_{{L}^{2}_{\rho_{{X}}}}^{2} is itself a different expectation with respect to the law ρX\rho_{X}).

Theorem 1 means that for the simplest choice of the learning rate as a constant, our estimator tends to the perform as well as the best estimator in the class H\mathcal{H}. Note that in general, the convergence in H\mathcal{H} is meaningless if r<1/2r<1/2. The following results will state some assertions on the speed of such a convergence; our main result, in terms of generality is the following:

Assume (A1-6) and γi=γ=Γ(n)\gamma_{i}=\gamma=\Gamma(n), for 1in1\leqslant i\leqslant n. If γR21/4\gamma R^{2}\leqslant 1/4:

Where qn,γ,s,r:=(R2αγ1+αns2)2r1αq_{n,\gamma,s,r}:=(R^{2\alpha}\gamma^{1+\alpha}ns^{2})^{\frac{2r-1}{\alpha}} if r12r\geqslant\frac{1}{2} and qn,γ,s,r:=0q_{n,\gamma,s,r}:=0 otherwise is a residual quantity.

Proof: Theorem 1 is directly derived from Theorem 2, which is proved in Appendix II.3: we derive for our algorithm a new error decomposition and bound the different sources of error via algebraic calculations. More precisely, following the proof in Euclidean space , we first analyze (in Appendix II.2) a closely related recursion (we replace KxnKxnK_{x_{n}}\otimes K_{x_{n}} by its expectation Σ\Sigma, and we thus refer to it as a semi-stochastic version of our algorithm):

It (a) leads to an easy computation of the main bias/variance terms of our result, (b) will be used to derive our main result by bounding the drifts between our algorithm and its semi-stochastic version. A more detailed sketch of the proof is given in Appendix B.

Bias/variance interpretation: The two main terms have a simple interpretation. The first one is a variance term, which shows the effect of the noise σ2\sigma^{2} on the error. It is bigger when σ\sigma gets bigger, and moreover it also gets bigger when γ\gamma is growing (bigger steps mean more variance). As for the second term, it is a bias term, which accounts for the distance of the initial choice (the null function in general) to the objective function. As a consequence, it is smaller when we make bigger steps.

Assumption (A4): Our assumption (A4) for r>1r>1 is stronger than for r=1r=1 but we do not improve the bound. Indeed the bias term (see comments below) cannot decrease faster than O(n2)O(n^{-2}): this phenomenon in known as saturation . To improve our results with r>1r>1 it may be interesting to consider another type of averaging. In the following, r<1r<1 shall be considered as the main and most interesting case.

Relationship to regularized empirical risk minimization: Our bound ends up being very similar to bounds for regularized empirical risk minimization, with the identification λ=1γn\lambda=\frac{1}{\gamma n}. It is thus no surprise that once we optimize for the value of γ\gamma, we recover the same rates of convergence. Note that in order to obtain convergence, we require that the step-size γ\gamma is bounded, which corresponds to an equivalent λ\lambda which has to be lower-bounded by 1/n1/n.

Finite horizon: Once again, this theorem holds in the finite horizon setting. That is we first choose the number of samples we are going to use, then the learning rate as a constant. It allows us to chose γ\gamma as a function of nn, in order to balance the main terms in the error bound. The trade-off must be understood as follows: a bigger γ\gamma increases the effect of the noise, but a smaller one makes it harder to forget the initial condition.

We may now deduce the following corollaries, with specific optimized values of γ\gamma:

Assume (A1-6) and a constant step-size γi=γ=Γ(n)\gamma_{i}=\gamma=\Gamma(n), for 1in1\leqslant i\leqslant n:

If α12α<r\frac{\alpha-1}{2\alpha}<r and Γ(n)=γ0 n2αmin{r,1}1+α2αmin{r,1}+1\Gamma(n)=\gamma_{0}\ n^{\frac{-2\alpha\min\{r,1\}-1+\alpha}{2\alpha\min\{r,1\}+1}}, γ0R21/4\gamma_{0}R^{2}\leqslant 1/4, we have:

with A=4(1+(γ0s2)1α)σ2+4(1+o(1))γ02rLKrgHLρX22A=4\left(1+{(\gamma_{0}s^{2})^{\frac{1}{\alpha}}}\right)\sigma^{2}+\frac{4(1+o(1))}{\gamma_{0}^{2r}}||L_{K}^{-r}g_{\mathcal{H}}||_{{L}^{2}_{\rho_{{X}}}}^{2}.

If 0<r<α12α0<r<\frac{\alpha-1}{2\alpha}, with Γ(n)=γ0\Gamma(n)=\gamma_{0} is constant, γ0R21/4\gamma_{0}R^{2}\leqslant 1/4, we have:

Limit conditions: Assumption (A4), gives us some kind of “position” of the objective function with respect to our reproducing kernel Hilbert space. If r1/2r\geqslant 1/2 then gHHg_{\mathcal{H}}\in\mathcal{H}. That means the regression function truly lies in the space in which we are looking for an approximation. However, it is not necessary neither to get the convergence result, which stands for any r>0r>0, nor to get the optimal rate (see definition in Section 4.2), which is also true for α12α<r<1\frac{\alpha-1}{2\alpha}<r<1 .

Evolution with rr and α\alpha: As it has been noticed above, a bigger α\alpha or rr would be a stronger assumption. It is thus natural to get a rate which improves with a bigger α\alpha or rr: the function (α,r)2αr2αr+1(\alpha,r)\mapsto\frac{2\alpha r}{2\alpha r+1} is increasing in both parameters.

The quantity o(1)o(1) in Equation (3.3) stands for (γ0s2n2α2r+1)2r1α(\gamma_{0}s^{2}n^{-2\alpha^{2}r+1})^{\frac{2r-1}{\alpha}} if r1/2r\geqslant 1/2 (0 otherwise) and is a quantity which decays to 0.

Different regions: in Figure 1(a), we plot in the plan of coordinates α,δ\alpha,\delta (with δ=2αr+1\delta=2\alpha r+1) our limit conditions concerning our assumptions, that is, r=1δ=2α+1r=1\Leftrightarrow\delta=2\alpha+1 and α12α=rα=δ\frac{\alpha-1}{2\alpha}=r\Leftrightarrow\alpha=\delta. The region between the two green lines is the region for which the optimal rate of estimation is reached. The magenta dashed lines stands for r=1/2r=1/2, which has appeared to be meaningless in our context.

The region αδα12α>r\alpha\geqslant\delta\Leftrightarrow\frac{\alpha-1}{2\alpha}>r corresponds to a situation where regularized empirical risk minimization would still be optimal, but with a regularization parameter λ\lambda that decays faster than 1/n1/n, and thus, our corresponding step-size γ=1/(nλ)\gamma=1/(n\lambda) would not be bounded as a function of nn. We thus saturate our step-size to a constant and the generalization error is dominated by the bias term.

The region α(δ1)/2r>1\alpha\leqslant(\delta-1)/2\Leftrightarrow r>1 corresponds to a situation where regularized empirical risk minimization reaches a saturating behaviour. In our stochastic approximation context, the variance term dominates.

5 Online setting

We now consider the second case when the sequence of step-sizes does not depend on the number of samples we want to use (online setting).

The computation are more tedious in such a situation so that we will only state asymptotic theorems in order to understand the similarities and differences between the finite horizon setting and the online setting, especially in terms of limit conditions.

Assume (A1-6), assume for any ii, γi=γ0iζ\gamma_{i}=\frac{\gamma_{0}}{i^{\zeta}}, γ0R21/2\gamma_{0}R^{2}\leqslant 1/2 :

If 0<r(1ζ)<10<r(1-\zeta)<1, if 0<ζ<120<\zeta<\frac{1}{2} then

The constant in the O()O(\cdot) notations only depend on γ0\gamma_{0} and α\alpha.

Theorem 3 is proved in Appendix II.4. In the first case, the main bias and variance terms are the same as in the finite horizon setting, and so is the optimal choice of ζ\zeta. However in the second case, the variance term behaviour changes: it does not decrease any more when ζ\zeta increases beyond 1/21/2. Indeed, in such a case our constant averaging procedure puts to much weight on the first iterates, thus we do not improve the variance bound by making the learning rate decrease faster. Other type of averaging, as proposed for example in , could help to improve the bound.

Moreover, this extra condition thus changes a bit the regions where we get the optimal rate (see Figure 1(b)), and we have the following corollary:

Assume (A1-6) (in this corollary, O()O(\cdot) stands for a constant depending on α,LKrgHLρX2,s,σ2,γ0\alpha,||L_{K}^{-r}g_{\mathcal{H}}||_{{L}^{2}_{\rho_{{X}}}},s,\sigma^{2},\gamma_{0} and universal constants):

If α12α<r<2α12α\frac{\alpha-1}{2\alpha}<r<\frac{2\alpha-1}{2\alpha}, with γn=γ0n2αr1+α2αr+1\gamma_{n}=\gamma_{0}n^{\frac{-2\alpha r-1+\alpha}{2\alpha r+1}} for any n1n\geqslant 1 we get the rate:

If 2α12α<r\frac{2\alpha-1}{2\alpha}<r, with γn=γ0n1/2\gamma_{n}=\gamma_{0}n^{-1/2} for any n1n\geqslant 1, we get the rate:

If 0<r<α12α0<r<\frac{\alpha-1}{2\alpha}, with γn=γ0\gamma_{n}=\gamma_{0} for any n1n\geqslant 1, we get the rate given in (3.4). Indeed the choice of a constant learning rate naturally results in an online procedure.

This corollary is directly derived from Theorem 3, balancing the two main terms. The only difference with the finite horizon setting is the shrinkage of the optimality region as the condition r<1r<1 is replaced by r<2α12α<1r<\frac{2\alpha-1}{2\alpha}<1 (see Figure 1(b)). In the next section, we relate our results to existing work.

Links with existing results

In this section, we relate our results from the previous section to existing results.

Recently, Bach and Moulines showed in that for least squares regression, averaged stochastic gradient descent achieved a rate of O(1/n)O(1/n), in a finite-dimensional Hilbert space (Euclidean space), under the same assumptions as above (except the first one of course), which is replaced by:

H\mathcal{H} is a dd-dimensional Euclidean space.

Assume (A1-f), (A2-6). Then for γ=14R2\gamma=\frac{1}{4R^{2}},

We show that we can deduce such a result from Theorem 2 (and even with comparable constants). Indeed under (A1-f) we have:

As we are in a finite-dimensional space (A4) is true for r=1/2r=1/2 as T1/2gHLρX22=gHH2||T^{-1/2}g_{\mathcal{H}}||^{2}_{\mathcal{L}^{2}_{\rho_{{X}}}}=||g_{\mathcal{H}}||^{2}_{\mathcal{H}}.

Under such remarks, the following corollary may be deduced from Theorem 2:

Assume (A1-f), (A2-6), then for any α>1\alpha>1, with γR21/4\gamma R^{2}\leqslant 1/4:

This bound is easily comparable to (4.1) and shows that our more general analysis has not lost too much. Moreover our learning rate is proportional to n12α+1n^{\frac{-1}{2\alpha+1}} with r=1/2r=1/2, so tends to behave like a constant when α\alpha\rightarrow\infty, which recovers the constant step set-up from .

Moreover, N. Flammarion proved (Personnal communication, 05/2014), using the same tpye of techniques, that their bound could be extended to:

a result that may be deduced of the following more general corollaries of our Theorem 2:

Assume (A1-f), (A2-6), and, for some q1/2q\geqslant-1/2, ΣqgHH2=Σ(q+1/2)gHLρX22<||\Sigma^{-q}{g}_{\mathcal{H}}||^{2}_{\mathcal{H}}=||\Sigma^{-(q+1/2)}{g}_{\mathcal{H}}||^{2}_{{L}^{2}_{\rho_{{X}}}}<\infty, then:

Such a result is derived from Theorem 2 and with the stronger assumption tr(Σ1/α)<\operatorname{tr}(\Sigma^{1/\alpha})<\infty clearly satisfied in finite dimension, and with r=q+1/2r=q+1/2. Note that the result above is true for all values of α1\alpha\geqslant 1 and all q1/2q\geqslant-1/2 (for the ones with infinite Σ(q+1/2)gHLρX22||\Sigma^{-(q+1/2)}{g}_{\mathcal{H}}||^{2}_{{L}^{2}_{\rho_{{X}}}}, the statement is trivial). This shows that we may take the infimum over all possible α1\alpha\leqslant 1 and q0q\geqslant 0, showing adaptivity of the estimator to the spectral decay of Σ\Sigma and the smoothness of the optimal prediction function gHg_{\mathcal{H}}.

Thus with α\alpha\rightarrow\infty, we obtain :

Assume (A1-f), (A2-6), and, for some q1/2q\geqslant-1/2, ΣqgHH2=Σ(q+1/2)gHLρX22<||\Sigma^{-q}{g}_{\mathcal{H}}||^{2}_{\mathcal{H}}=||\Sigma^{-(q+1/2)}{g}_{\mathcal{H}}||^{2}_{{L}^{2}_{\rho_{{X}}}}<\infty, then:

This final result bridges the gap between Proposition 5 (q=0q=0), and its extension (4.2) (q=1/2q=1/2). The constants 16 and 8 come from the upper bounds (a+b)2a2+b2(a+b)^{2}\leqslant a^{2}+b^{2} and 1+1/d21+1/\sqrt{d}\leqslant 2 and are thus non optimal.

Moreover, we can also derive from Corollary 5, with α=1\alpha=1, q=0q=0, and γn1/2\gamma\varpropto n^{-1/2}, we recover the rate O(n1/2)O(n^{-1/2}) (where the constant does not depend on the dimension dd of the Euclidean space). Such a rate was described, e.g., in .

Note that linking our work to the finite-dimensional setting is made using the fact that our assumption (A3) is true for any α>1\alpha>1.

2 Optimal rates of estimation

In some situations, our stochastic approximation framework leads to “optimal” rates of prediction in the following sense. In [9, Theorem 2] a minimax lower bound was given: let P(α,r)  (α>1,r[1/2,1])\mathcal{P}(\alpha,r)\ \ (\alpha>1,r\in[1/2,1]) be the set of all probability measures ρ\rho on X×Y\mathcal{X}\times\mathcal{Y}, such that:

Then the following minimax lower rate stands:

for some constant C>0C>0 where the infimum in the middle is taken over all algorithms as a map ((xi,yi)1in)gnH((x_{i},y_{i})_{1\leqslant i\leqslant n})\mapsto g_{n}\in\mathcal{H}.

When making assumptions (a3-4), the assumptions regarding the prediction problem (i.e., the optimal function gρg_{\rho}) are summarized in the decay of the components of gρg_{\rho} in an orthonormal basis, characterized by the constant δ\delta. Here, the minimax rate of estimation (see, e.g., ) is O(n1+1/δ)O(n^{-1+1/\delta}) which is the same as O\big{(}n^{-2r\alpha/(2r\alpha+1)}\big{)} with the identification δ=2αr+1\delta=2\alpha r+1.

That means the rate we get is optimal for α12α<r<1\frac{\alpha-1}{2\alpha}<r<1 in the finite horizon setting, and for α12α<r<2α12α\frac{\alpha-1}{2\alpha}<r<\frac{2\alpha-1}{2\alpha} in the online setting. This is the region between the two green lines on Figure 1.

3 Regularized stochastic approximation

It is interesting to link our results to what has been done in and in the case of regularized least-mean-squares, so that the recursion is written:

Where OκO_{\kappa} stands for a constant which depends on κ\kappa.

No assumption is made on the covariance operator beyond being trace class, but only on TrgρLρX2\|T^{-r}g_{\rho}\|_{{L}^{2}_{\rho_{{X}}}} (thus no assumption (A3)). A few remarks may be made:

They get almost-sure convergence, when we only get convergence in expectation. We could perhaps derive a.s. convergence by considering moment bounds in order to be able to derive convergence in high probability and to use Borel-Cantelli lemma.

They only assume 12r1\frac{1}{2}\leqslant r\leqslant 1, which means that they assume the regression function to lie in the RKHS.

4 Unregularized stochastic approximation

In , Ying and Pontil studied the same unregularized problem as we consider, under assumption (A4). They obtain the same rates as above (n2r/(2r+1)log(n)n^{-2r/(2r+1)}\log(n)) in both online case (with 0r120\leqslant r\leqslant\frac{1}{2}) and finite horizon setting (0<r0<r).

They led as an open problem to improve bounds with some additional information on some decay of the eigenvalues of TT, a question which is answered here.

Moreover, Zhang also studies stochastic gradient descent algorithms in an unregularized setting, also with averaging. As described in , his result is stated in the linear kernel setting but may be extended to kernels satisfying supxXK(x,x)R2\sup_{x\in\mathcal{X}}K(x,x)\leqslant R^{2}. Ying and Pontil derive from Theorem 5.2 in the following proposition:

Assume we consider the algorithm defined in Section 3.2 and output gn\overline{g}_{n} defined by equation (3.1). Assume the kernel KK satisfies supxXK(x,x)R2\sup_{x\in\mathcal{X}}K(x,x)\leqslant R^{2}. Finally assume gρg_{\rho} satisfies assumption (A4) with 0<r<1/20<r<1/2. Then in the finite horizon setting, with Γ(n)=14R2n2r2r+1\Gamma(n)=\frac{1}{4R^{2}}n^{-\frac{2r}{2r+1}}, we have:

Moreover, note that we may derive their result from Corollary 2. Indeed, using Γ(n)=γ0n2r2r+1\Gamma(n)=\gamma_{0}n^{\frac{-2r}{2r+1}}, we get a bias term which is of order n2r2r+1n^{\frac{-2r}{2r+1}} and a variance term of order n1+12rα+αn^{-1+\frac{1}{2r\alpha+\alpha}} which is smaller. Our analysis thus recovers their convergence rate with their step-size. Note that this step-size is significantly smaller than ours, and that the resulting bound is worse (but their result holds in more general settings than least-squares). See more details in Section 4.5.

5 Summary of results

All three algorithms are variants of the following:

But they are studied under different settings, concerning regularization, averaging, assumptions: we sum up in Table 1 the settings of each of these studies. For each of them, we consider the finite horizon settings, where results are generally better.

Worst-case result in rr: in the setting of assumptions (a3,4), given δ\delta, the optimal rate of convergence is known to be O(n1+1/δ)O(n^{-1+1/\delta}), where δ=2αr+1\delta=2\alpha r+1. We thus get the optimal rate, as soon as α<δ<2α+1\alpha<\delta<2\alpha+1, while the other algorithms get the suboptimal rate nδ1δ+α1n^{\frac{\delta-1}{\delta+\alpha-1}} under various conditions. Note that this sub-optimal rate becomes close to the optimal rate when α\alpha is close to one, that is, in the worst-case situation. Thus, in the worst-case (α\alpha arbitrarily close to one), all methods behave similarly, but for any particular instance where α>1\alpha>1, our rates are better.

Choice of kernel: in the setting of assumptions (a3,4), given δ\delta, in order to get the optimal rate, we may choose the kernel (i.e., α\alpha) such that α<δ<2α+1\alpha<\delta<2\alpha+1 (that is neither too big, nor too small), while other methods need to choose a kernel for which α\alpha is as close to one as possible, which may not be possible in practice.

Improved bounds: Ying and Pontil only give asymptotic bounds, while we have exact constants for the finite horizon case. Moreover there are some logarithmic terms in which disappear in our analysis.

Saturation: our method does saturate for r>1r>1, while the non-averaged framework of does not (but does not depend on the value of α\alpha). We conjecture that a proper non-uniform averaging scheme (that puts more weight on the latest iterates), we should get the best of both worlds.

Experiments on artificial data

Following , we consider synthetic examples with smoothing splines on the circle, where our assumptions (A3-4) are easily satisfied.

The simplest example to match our assumptions may be found in . We consider Y=gρ(X)+εY=g_{\rho}(X)+\varepsilon, with XU[0;1]X\sim\mathcal{U}[\,0;1] is a uniform random variable in $,and, andg_{\rho}$ in a particular RKHS (which is actually a Sobolev space).

Let H\mathcal{H} be the collection of all zero-mean periodic functions on [0;1][0;1] of the form

This means that the mm-th derivative of ff, f(m)f^{(m)} is in L2([0;1])\mathcal{L}^{2}([0\,;1]). We consider the inner product:

It is known that H\mathcal{H} is an RKHS and that the reproducing kernel Rm(s,t)R_{m}(s,t) for H\mathcal{H} is

Moreover the study of Bernoulli polynomials gives a close formula for R(s,t)R(s,t), that is:

with BmB_{m} denoting the m-th Bernoulli polynomial and {st}\{s-t\} the fractional part of sts-t .

We can derive the following proposition for the covariance operator which means that our assumption (A3) is satisfied for our algorithm in H\mathcal{H} when XU[0;1]X\sim\mathcal{U}[0;\,1], with α=2m\alpha=2m, and s=2(1/2π)ms=2(1/2\pi)^{m}.

If XU[0;1]X\sim\mathcal{U}[0;\,1], then in H\mathcal{H}:

the eigenvalues of Σ\Sigma are all of multiplicity 2 and are λi=(2πi)2m\lambda_{i}=(2\pi i)^{-2m},

the eigenfunctions are ϕic:t2cos(2πit)\phi_{i}^{c}:t\mapsto\sqrt{2}\cos(2\pi it) and ϕis:t2sin(2πit)\phi_{i}^{s}:t\mapsto\sqrt{2}\sin(2\pi it).

For ϕic\phi_{i}^{c} we have (a similar argument holds for ϕis\phi_{i}^{s}):

It is well known that (ϕic,ϕis)i0(\phi_{i}^{c},\phi_{i}^{s})_{i\geqslant 0} is an orthonormal system (the Fourier basis) of the functions in L2([0;1]){L}^{2}([\,0;1]) with zero mean, and it is easy to check that ((2iπ)mϕic,(2iπ)mϕis)i1((2i\pi)^{-m}\phi_{i}^{c},(2i\pi)^{-m}\phi_{i}^{s})_{i\geqslant 1} is an orthonormal basis of our RKHS H\mathcal{H} (this may also be seen as a consequence of the fact that T1/2T^{1/2} is an isometry). ∎

Here the eigenvectors do not depend on the kernel choice, only the re-normalisation constant depends on the choice of the kernel. Especially the eigenbasis of TT in LρX2{L}^{2}_{\rho_{{X}}} does not depend on mm. That can be linked with the previous remarks made in Section 4.

Assumption (A3) defines here the size of the RKHS: the smaller α=2m\alpha=2m is, the bigger the space is, the harder it is to learn a function.

In the next section, we illustrate on such a toy model our main results and compare our learning algorithm to Ying and Pontil’s , Tarrès and Yao’s and Zhang’s algorithms.

2 Experimental set-up

We use gρ(x)=Bδ/2(x)g_{\rho}(x)=B_{\delta/2}(x) with δ=2αr+1\delta=2\alpha r+1, as proposed above, with B1(x)=x12B_{1}(x)=x-\frac{1}{2}, B2(x)=x2x+16B_{2}(x)=x^{2}-x+\frac{1}{6} and B3(x)=x332x2+12xB_{3}(x)=x^{3}-\frac{3}{2}x^{2}+\frac{1}{2}x.

We give in Figure 2 the functions used for simulations in a few cases that span our three regions. We also remind the choice of γ\gamma proposed for the 4 algorithms. We always use the finite horizon setting.

3 Optimal learning rate for our algorithm

4 Comparison to competing algorithms

In this section, we compare the convergence rates of the four algorithms described in Section 4.5. We consider the different choices of (r,α)(r,\alpha) as described in Table 2 in order to go all over the different optimality situations. The main properties of each algorithm are described in Table 1. However we may note:

For our algorithm, Γ(n)\Gamma(n) is chosen accordingly with Corollary 2, with γ0=1R2\gamma_{0}=\frac{1}{R^{2}}.

For Ying and Pontil’s algorithm, accordingly to Theorem 6 in , we consider Γ(n)=γ0n2r2r+1\Gamma(n)=\gamma_{0}n^{-\frac{2r}{2r+1}}. We choose γ0=1R2\gamma_{0}=\frac{1}{R^{2}} which behaves better than the proposed r64(1+R4)(2r+1)\frac{r}{64(1+R^{4})(2r+1)}.

For Tarrès and Yao’s algorithm, we refer to Theorem C in , and consider Γ(n)=a(n0+n)2r2r+1\Gamma(n)=a\left({n_{0}+n}\right)^{-\frac{2r}{2r+1}} and Λ(n)=1a(n0+n)12r+1\Lambda(n)=\frac{1}{a}\left({n_{0}+n}\right)^{-\frac{1}{2r+1}}. The theorem is stated for all a4a\geqslant 4: we choose a=4a=4.

For Zhangl’s algorithm, we refer to Part 2.2 in , and choose Γ(n)=γ0n2r2r+1\Gamma(n)=\gamma_{0}n^{-\frac{2r}{2r+1}} with γ0=1R2\gamma_{0}=\frac{1}{R^{2}} which behaves better than the proposed choice 14(1+R2)\frac{1}{4(1+R^{2})}.

Finally, we sum up the rates that were both predicted and derived for the four algorithms in the four cases for (α,δ)(\alpha,\delta) in Table 3. It appears that (a) we approximatively match the predicted rates in most cases (they would if nn was larger), (b) our rates improve on existing work.

Conclusion

In this paper, we have provided an analysis of averaged unregularized stochastic gradient methods for kernel-based least-squares regression. Our novel analysis allowed us to consider larger step-sizes, which in turn lead to optimal estimation rates for many settings of eigenvalue decay of the covariance operators and smoothness of the optimal prediction function. Moreover, we have worked on a more general setting than previous work, that includes most interesting cases of positive definite kernels.

Our work can be extended in a number of interesting ways: First, (a) we have considered results in expectation; following the higher-order moment bounds from in the Euclidean case, we could consider higher-order moments, which in turn could lead to high-probability results or almost-sure convergence. Moreover, (b) while we obtain optimal convergence rates for a particular regime of kernels/objective functions, using different types of averaging (i.e., non uniform) may lead to optimal rates in other regimes. Besides, (c) following , we could extend our results for infinite-dimensional least-squares regression to other smooth loss functions, such as for logistic regression, where an online Newton algorithm with the same running-time complexity would also lead to optimal convergence rates. Also, (d) the running-time complexity of our stochastic approximation procedures is still quadratic in the number of samples nn, which is unsatisfactory when nn is large; by considering reduced set-methods , we hope to able to obtain a complexity of O(dnn)O(d_{n}n), where dnd_{n} is such that the convergence rate is O(dn/n)O(d_{n}/n), which would extend the Euclidean space result, where dnd_{n} is constant equal to the dimension. Finally, (e) in order to obtain the optimal rates when the bias term dominates our generalization bounds, it would be interesting to combine our spectral analysis with recent accelerated versions of stochastic gradient descent which have been analyzed in the finite-dimensional setting .

Acknowledgements

This work was partially supported by the European Research Council (SIERRA Project). We thank Nicolas Flammarion for helpful discussions.

Appendix A Minimal assumptions

We first define the set of square ρX\rho_{X}-integrable functions LρX2\mathcal{L}^{2}_{\rho_{{X}}}:

we will always make the assumptions that this space is separable (this is the case in most interesting situations. See for more details.) LρX2{L}^{2}_{\rho_{{X}}} is its quotient under the equivalence relation given by

which makes it a separable Hilbert space (see, e.g., ).

A.2 Isomorphism

As it has been explained in the main text, the minimization problem will appear to be an approximation problem in LρX2\mathcal{L}^{2}_{\rho_{{X}}}, for which we will build estimates in H\mathcal{H}. However, to derive theoretical results, it is easier to consider it as an approximation problem in the Hilbert space LρX2{L}^{2}_{\rho_{{X}}}, building estimates in p(H)p(\mathcal{H}).

We thus need to define a notion of the best estimation in p(H)p(\mathcal{H}). We first define the closure F\overline{{F}} (with respect to LρX2\|\cdot\|_{{L}^{2}_{\rho_{{X}}}}) of any set FLρX2F\subset{L}^{2}_{\rho_{{X}}} as the set of limits of sequences in F{F}. The space p(H)\overline{p(\mathcal{H})} is a closed and convex subset in LρX2{L}^{2}_{\rho_{{X}}}. We can thus define gH=argminfp(H)ε(g)g_{\mathcal{H}}=\arg\min_{f\in\,\overline{p(\mathcal{H})}}\varepsilon(g), as the orthogonal projection of gρg_{\rho} on p(H)\overline{p(\mathcal{H})}, using the existence of the projection on any closed convex set in a Hilbert space. See Proposition 8 in Appendix A for details.

Assume (A1-2). The minimum of ε(f)\varepsilon(f) in p(H)\overline{p(\mathcal{H})} is attained at a certain gHg_{\mathcal{H}} (which is unique and well defined in LρX2{L}^{2}_{\rho_{{X}}}).

Where \overline{{p(\mathcal{H})}}=\Big{\{}f\in{L}^{2}_{\rho_{{X}}}\ /\ \exists(f_{n})\subset{{p(\mathcal{H})}},\|f_{n}-f\|_{{L}^{2}_{\rho_{{X}}}}\rightarrow 0\Big{\}} is the set of functions ff for which we can hope for consistency, i.e., having a sequence (fn)n(f_{n})_{n} of estimators in H\mathcal{H} such that ε(fn)ε(f)\varepsilon(f_{n})\rightarrow\varepsilon(f).

The properties of our estimator, especially its rate of convergence will strongly depend on some properties of both the kernel, the objective function and the distributions, which may be seen through the properties of the covariance operator which is defined in the main text. We have defined the covariance operator, Σ:HH\Sigma:\mathcal{H}\rightarrow\mathcal{H}. In the following, we extend such an operator as an endomorphism T\mathcal{T} from LρX2{L}^{2}_{\rho_{{X}}} to LρX2\mathcal{L}^{2}_{\rho_{{X}}} and by projection as an endomorphism T=pTT=p\circ\mathcal{T} from LρX2{L}^{2}_{\rho_{{X}}} to LρX2{L}^{2}_{\rho_{{X}}}. Note that T\mathcal{T} is well defined as Xg(t) Kt dρX(t)\int_{\mathcal{X}}g(t)\ K_{t}\ d\rho_{\mathcal{X}}(t) does not depend on the function gg chosen in the class of equivalence of gg.

Assume (A1-2). We define the operator T\mathcal{T} as follows (this expectation is formally defined as a Bochner expectation in H\mathcal{H}.):

A first important remark is that Σf=0\Sigma f=0 implies f,Σf=fLρX22=0\langle f,\Sigma f\rangle=\|f\|_{{L}^{2}_{\rho_{{X}}}}^{2}=0, that is p(Ker(Σ))={0}p(\operatorname{Ker}(\Sigma))=\{0\}. However, Σ\Sigma may not be injective (unless fLρX22f=0\|f\|_{{L}^{2}_{\rho_{{X}}}}^{2}\Rightarrow f=0, which is true when ff is continuous and ρX\rho_{X} has full support). Σ\Sigma and T\mathcal{T} may independently be injective or not.

The operator TT (which is an endomorphism of the separable Hilbert space LρX2{L}^{2}_{\rho_{{X}}}) can be reduced in some Hilbertian eigenbasis of LρX2{L}^{2}_{\rho_{{X}}}. The linear operator T\mathcal{T} happens to have an image included in H\mathcal{H}, and the eigenbasis of TT in LρX2{L}^{2}_{\rho_{{X}}} may also be seen as eigenbasis of Σ\Sigma in H\mathcal{H} (See proof in Appendix I.2, Proposition 18):

Assume (A1-2). The image of T\mathcal{T} is included in H\mathcal{H}: Im(T)H\text{Im}(\mathcal{T})\subset\mathcal{H}, that is, for any fLρX2f\in{L}^{2}_{\rho_{{X}}}, TfH\mathcal{T}f\in\mathcal{H}. Moreover, for any iIi\in I, ϕiH=1μiTϕiHLρX2\phi_{i}^{H}=\frac{1}{\mu_{i}}\mathcal{T}\phi_{i}\in\mathcal{H}\subset\mathcal{L}^{2}_{\rho_{{X}}} is a representant for the equivalence class ϕi\phi_{i}, that is p(ϕiH)=ϕip(\phi_{i}^{H})=\phi_{i}. Moreover μi1/2ϕiH\mu_{i}^{1/2}\phi_{i}^{H} is an orthonormal eigen-system of the orthogonal supplement S\mathscr{S} of the null space Ker(Σ)\text{Ker}(\Sigma). That is:

iI, ΣϕiH=μiϕiH\forall i\in I,\ \Sigma\phi_{i}^{H}=\mu_{i}\phi_{i}^{H}.

H=Ker(Σ)S\mathcal{H}=\operatorname{Ker}(\Sigma)\overset{\perp}{\oplus}\mathscr{S}.

Such decompositions allow to define Tr:LρX2H\mathcal{T}^{r}:{L}^{2}_{\rho_{{X}}}\rightarrow\mathcal{H} for r1/2r\geqslant 1/2. Indeed , completeness allows to define infinite sums which satisfy a Cauchy criterion. See proof in Appendix I.2, Proposition 19. Note the different condition concerning rr in the definitions. For r1/2r\geqslant 1/2, Tr=pTrT^{r}=p\circ\mathcal{T}^{r}. We need r1/2r\geqslant 1/2, because (μi1/2ϕH)(\mu_{i}^{1/2}\phi^{H}) is an orthonormal system of S\mathscr{S}.

We define, for any r1/2r\geqslant 1/2, Tr:LρX2H\mathcal{T}^{r}:{L}^{2}_{\rho_{{X}}}\rightarrow\mathcal{H}, for any hKer(T)h\in\operatorname{Ker}(T) and (ai)iI(a_{i})_{i\in I} such that iIai2<\sum_{i\in I}a_{i}^{2}<\infty, through:

We have two decompositions of LρX2=Ker(T)S{L}^{2}_{\rho_{{X}}}=\operatorname{Ker}(T)\overset{\perp}{\oplus}S and H=Ker(Σ)S\mathcal{H}=\operatorname{Ker}(\Sigma)\overset{\perp}{\oplus}\mathscr{S}. The two orthogonal supplements SS and S\mathscr{S} happen to be related through the mapping T1/2\mathcal{T}^{1/2}, as stated in Proposition 4: T1/2\mathcal{T}^{1/2} is an isomorphism from SS into S\mathscr{S}. It also has he following consequences, which generalizes Corollary 1:

T1/2(S)=p(H)T^{1/2}(S)=p(\mathcal{H}), that is any element of p(H)p(\mathcal{H}) may be expressed as T1/2gT^{1/2}g for some gLρX2g\in{L}^{2}_{\rho_{{X}}}.

For any r1/2, Tr(S)Hr\geqslant 1/2,\ T^{r}(S)\subset\mathcal{H}, because Tr(S)T1/2(S)T^{r}(S)\subset T^{1/2}(S), that is, with large powers rr, the image of TrT^{r} is in the projection of the Hilbert space.

r>0, Tr(LρX2)=S=T1/2(LρX2)=H\forall r>0,\ \overline{T^{r}({L}^{2}_{\rho_{{X}}})}=S=\overline{T^{1/2}({L}^{2}_{\rho_{{X}}})}=\overline{\mathcal{H}}, because (a) T1/2(LρX2)=p(H)T^{1/2}({L}^{2}_{\rho_{{X}}})={p(\mathcal{H})} and (b) for any r>0r>0, Tr(LρX2)=S\overline{T^{r}({L}^{2}_{\rho_{{X}}})}=S. In other words, elements of p(H)\overline{p(\mathcal{H})} (on which our minimization problem attains its minimum), may seen as limits (in LρX2{L}^{2}_{\rho_{{X}}}) of elements of Tr(LρX2)T^{r}({L}^{2}_{\rho_{{X}}}), for any r>0r>0.

p(H)p(\mathcal{H}) is dense in LρX2{L}^{2}_{\rho_{{X}}} if and only if TT is injective.

A.3 Mercer theorem generalized

Finally, although we will not use it in the rest of the paper, we can state a version of Mercer’s theorem, which does not make any more assumptions that are required for defining RKHSs.

Assume (A1-2). We have for all x,yXx,y\in\mathcal{X},

and we have for all xXx\in\mathcal{X}, Xg(x,y)2dρX(y)=0\int_{\mathcal{X}}g(x,y)^{2}d\rho_{X}(y)=0. Moreover, the convergence of the series is absolute.

We thus obtain a version of Mercer’s theorem (see Appendix I.5.3) without any topological assumptions. Moreover, note that (a) S\mathscr{S} is also an RKHS, with kernel (x,y)iIμiϕiH(x)ϕiH(y)(x,y)\mapsto\sum_{i\in I}\mu_{i}\phi_{i}^{H}(x)\phi_{i}^{H}(y) and (b) that given the decomposition above, the optimization problem in S\mathscr{S} and H\mathcal{H} have equivalent solutions. Moreover, considering the algorithm below, the estimators we consider will almost surely build equivalent functions (see Appendix I.4). Thus, we could assume without loss of generality that the kernel KK is exactly equal to its expansion iIμiϕiH(x)ϕiH(y)\sum_{i\in I}\mu_{i}\phi_{i}^{H}(x)\phi_{i}^{H}(y).

A.4 Complementary (A6) assumption

Under minimal assumptions, we also have to make a complementary moment assumption :

Appendix B Sketch of the proofs

Our main theorems are Theorem 2 and Theorem 3, respectively in the finite horizon and in the online setting. Corollaries can be easily derived by optimizing over γ\gamma the upper bound given in the theorem.

The complete proof is given in Appendix II. The proof is nearly the same for finite horizon and online setting. It relies on a refined analysis of strongly related recursions in the RKHS and on a comparison between iterates of the recursions (controlling the deviations).

We first present the sketch of the proof for the finite-horizon setting : We want to analyze the error of our sequence of estimators (gn)(g_{n}) such that g0=0g_{0}=0 and

Where we have denoted Ξn=(yngH(xn))Kxn\Xi_{n}=(y_{n}-g_{\mathcal{H}}(x_{n}))K_{x_{n}} the residual, which has 0 mean, and KxnKxn~:LρX2H\widetilde{K_{x_{n}}\otimes K_{x_{n}}}:{L}^{2}_{\rho_{{X}}}\rightarrow\mathcal{H} an a.s. defined extension of KxnKxn:HHK_{x_{n}}\otimes K_{x_{n}}:\mathcal{H}\rightarrow\mathcal{H}, such that KxnKxn~(f)=f(xn)Kxn\widetilde{K_{x_{n}}\otimes K_{x_{n}}}(f)=f(x_{n})K_{x_{n}}, that will be denoted for simplicity KxnKxnK_{x_{n}}\otimes K_{x_{n}} in this section.

Finally, we are studying a sequence (ηn)n=(gngH)n({\eta}_{n})_{n}=(g_{n}-g_{\mathcal{H}})_{n} defined by:

We first consider splitting this recursion in two simpler recursions ηninit{\eta}_{n}^{init} and ηnnoise{\eta}_{n}^{noise} such that ηn=ηninit+ηnnoise{\eta}_{n}={\eta}_{n}^{init}+{\eta}_{n}^{noise}:

ηninit{\eta}_{n}^{init} is the part of (ηn)n({\eta}_{n})_{n} which is due to the initial conditions ( it is equivalent to assuming Ξn0\Xi_{n}\equiv 0).

Respectively, let (ηnnoise)n({\eta}_{n}^{noise})_{n} be defined by :

ηnnoise{\eta}_{n}^{noise} is the part of (ηn)n({\eta}_{n})_{n} which is due to the noise.

We will bound ηn\|\eta_{n}\| by ηninit+ηnnoise\|{\eta}_{n}^{init}\|+\|{\eta}_{n}^{noise}\| using Minkowski’s inequality. That is how the bias-variance trade-off originally appears.

An algebraic calculation gives an estimate of the norm of ηn0,init\eta_{n}^{0,init}, and we can also bound the residual term ηninitηn0,init\eta_{n}^{init}-\eta_{n}^{0,init}, then conclude by Minkowski.

For the variance term: We follow the exact same idea, but have to define a sequence of “semi-stochastic recursion”, to be able to bound the residual term.

This decomposition is summed up in Table 4.

For the online setting, we follow comparable ideas and end in a similar decomposition.

In Appendix I, we provide proofs of the propositions from Section 2 that provide the Hilbert space set-up for kernel-based learning, while in Appendix II, we prove convergence rates for the least-mean-squares algorithm.

Appendix I Reproducing kernel Hilbert spaces

In this appendix, we provide proofs of the results from Section 2 that provide the RHKS space set-up for kernel-based learning. See for further properties of RKHSs.

We consider a reproducing kernel Hilbert space H\mathcal{H} with kernel KK on space X\mathcal{X} as defined in Section 2.1. Unless explicitly mentioned, we do not make any topological assumption on X\mathcal{X}.

H\mathcal{H} is a separable RKHS associated with kernel KK on a space X\mathcal{X}.

If supxXK(x,x)<\sup_{x\in\mathcal{X}}K(x,x)<\infty, then any function in H\mathcal{H} is bounded.

Under such condition, by Cauchy-Schwartz inequality, any function fHf\in\mathcal{H} is either bounded or integrable:

We are interested in minimizing the following quantity, which is the prediction error of a function ff, which may be rewritten as follows with dot-products in LρX2{L}^{2}_{\rho_{{X}}}:

Notice that the problem may be re-written, if ff is in H\mathcal{H}, with dot-products in H\mathcal{H}:

Interpretation: Under the form (I.1), it appears to be a minimisation problem in a Hilbert space of the sum of a continuous coercive function and a linear one. Using Lax-Milgramm and Stampachia theorems we can conclude with the following proposition, which implies Prop. 8 in Section 2:

Assume (A1-2). We have the following points:

There exists a unique minimizer over the space LρX2{L}^{2}_{\rho_{{X}}}. This minimizer is the regression function gρ:xYydρYX=x(y)g_{\rho}:x\mapsto\int_{\mathcal{Y}}yd\rho_{Y|X=x}(y) (Lax-Milgramm).

For any non empty closed convex set, there exists a unique minimizer (Stampachia). As a consequence, there exists a unique minimizer:

over p(H)\overline{p(\mathcal{H})}. gHg_{\mathcal{H}} is the orthogonal projection over gρg_{\rho} over p(H)\overline{p(\mathcal{H})}, thus satisfies the following equality: for any εH\varepsilon\in\overline{H}:

I.2 Covariance Operator

We defined operators Σ,T,T\Sigma,\mathcal{T},T in Section 2.4. We here state the main properties of these operators, then prove the two main decompositions stated in Propositions 1 and 9.

Ker(Σ)={fH s.t. fLρX2=0}\operatorname{Ker}(\Sigma)=\{f\in\mathcal{H}\text{ s.t. }\|f\|_{{L}^{2}_{\rho_{{X}}}}=0\}. Actually for any fHf\in\mathcal{H}, f,ΣfK=fLρX2\langle f,\Sigma f\rangle_{K}=\|f\|_{{L}^{2}_{\rho_{{X}}}}.

for any xXx\in\mathcal{X}, f(x)Kxf(x)K_{x} is in H\mathcal{H}. To show that the integral xXf(x)Kx\int_{x\in\mathcal{X}}f(x)K_{x} is converging, it is sufficient to show the is is absolutely converging in H\mathcal{H}, as absolute convergence implies convergence in any Banach spaceA Banch space is a linear normed space which is complete for the distance derived from the norm. (thus any Hilbert space). Moreover:

which proves the continuity under assumption (A2).

It is clear that Σf,g=f,Σg\langle\Sigma f,g\rangle=\langle f,\Sigma g\rangle.

Assume (A1-2). T\mathcal{T} satisfies the following properties:

T\mathcal{T} is a well defined, continuous operator.

The image of TT is a subspace of H\mathcal{H}.

We now state here a simple lemma that will be useful later:

Assume (A1-2). TT satisfies the following properties:

TT is a well defined, continuous operator.

The image of TT is a subspace of p(H)p(\mathcal{H}).

TT is a self-adjoint semi definite positive operator in the Hilbert space LρX2{L}^{2}_{\rho_{{X}}}.

T=pTT=p\circ\mathcal{T} is clearly well defined, using the arguments given above. Moreover:

and f,TfLρX20\langle f,Tf\rangle_{\mathcal{L}^{2}_{\rho_{{X}}}}\geq 0 as a generalisation of the positive definite property of KK. ∎

In order to show the existence of an eigenbasis for TT, we now show that TT is trace-class.

Under (A2), TT is a trace class operatorMimicking the definition for matrices, a bounded linear operator AA over a separable Hilbert space HH is said to be in the trace class if for some (and hence all) orthonormal bases (ek)k(e_{k})_{k} of HH the sum of positive terms trA:=k(AA)1/2ek,ek{\operatorname{tr}}|A|:=\sum_{k}\langle(A^{*}A)^{1/2}\,e_{k},e_{k}\rangle is finite.. As a consequence, it is also a Hilbert-Schmidt operatorA Hilbert-Schmidt operator is a bounded operator AA on a Hilbert space HH with finite Hilbert–Schmidt norm: AHS2=tr(AA):=iIAei2\|A\|^{2}_{\text{HS}}={\operatorname{tr}}|(A^{{}^{*}}A)|:=\sum_{i\in I}\|Ae_{i}\|^{2}..

If KL2(ρX×ρX)K\in L^{2}(\rho_{X}\times\rho_{X}) then TT is a Hilbert-Schmidt operator.

Any Hilbert-Schmidt operator is a compact operator.

Proofs of such facts may be found in . Formally, with  (ϕi)i\ (\phi_{i})_{i} an Hilbertian basis in LρX2{L}^{2}_{\rho_{{X}}}:

We have thus proved that under (A1) and (A2), the operator TT may be reduced in some Hilbertian eigenbasis: the fact that TT is self-adjoint and compact implies the existence of an orthonormal eigensystem (which is an Hilbertian basis of LρX2{L}^{2}_{\rho_{X}}).

This is a consequence of a very classical result, see for example .

The null space Ker(T):={fLρX2 s.t. Tf=0}\text{Ker}(T):=\left\{f\in{L}^{2}_{\rho_{{X}}}\text{ s.t. }Tf=0\right\} may not be {0}\{0\}. We denote by SS an orthogonal supplementary of Ker(T)\text{Ker}(T).

Proposition 1 is directly derived from a slightly more complete Proposition 17 below:

iI, Tϕi=μiϕi\forall i\in I,\ T\phi_{i}=\mu_{i}\phi_{i}, (μi)i(\mu_{i})_{i} strictly positive non increasing (or finite) sequence such that iIμi<\sum_{i\in I}\mu_{i}<\infty.

LρX2=Ker(T)S.{L}^{2}_{\rho_{{X}}}=\operatorname{Ker}(T)\overset{\perp}{\oplus}S.

We haveWe denote by  span(A)\text{ span}(A) the smallest linear space which contains AA, which is in such a case the set of all finite linear combinations of (ϕi)iI(\phi_{i})_{i\in I}.: S=span{ϕi}={i=1aiϕi s.t. i=1ai2<}.\displaystyle S=\overline{\text{span}\{\phi_{i}\}}=\left\{\sum_{i=1}^{\infty}a_{i}\phi_{i}\text{ s.t. }\sum_{i=1}^{\infty}a_{i}^{2}<\infty\right\}. Moreover:

For any iIi\in I, ϕi=1μiLKϕip(H)\phi_{i}=\frac{1}{\mu_{i}}L_{K}\phi_{i}\in p(\mathcal{H}). Thus  span{ϕi}p(H)\text{ span}\left\{{\phi_{i}}\right\}\subset p(\mathcal{H}), thus S= span{ϕi}p(H)S=\overline{\text{ span}\left\{{\phi_{i}}\right\}}\subset\overline{p(\mathcal{H})}. Moreover, using the following Lemma, p(H)Ker(T)=Sp(\mathcal{H})\subset\operatorname{Ker}(T)^{\perp}=S, which concludes the proof, by taking the closures. ∎

if T1/2f=0T^{1/2}f=0 in LρX2{L}^{2}_{\rho_{{X}}}, then Tf=0Tf=0 in H\mathcal{H}.

p(H)Ker(T)p(\mathcal{H})\subset\operatorname{Ker}(T)^{\perp}.

We first notice that if T1/2f=0T^{1/2}f=0 in LρX2{L}^{2}_{\rho_{{X}}}, then Tf=0\mathcal{T}f=0 in H\mathcal{H}: indeedIn other words, we the operator defined below T1/2T^{1/2} T1/2f\displaystyle T^{1/2}f =LρX2\displaystyle=_{{L}^{2}_{\rho_{{X}}}} 0\displaystyle 0  Tf\displaystyle\ \mathcal{T}f =H\displaystyle=_{\mathcal{H}} Σ1/2( T1/2f)\displaystyle\Sigma^{1/2}(\ \mathcal{T}^{1/2}f)  TfK2\displaystyle\|\ \mathcal{T}f\|^{2}_{K} =\displaystyle= Σ1/2( T1/2f)K2=( T1/2f)LρX22=0\displaystyle\|\Sigma^{1/2}(\ \mathcal{T}^{1/2}f)\|_{K}^{2}=\|(\ \mathcal{T}^{1/2}f)\|_{{L}^{2}_{\rho_{{X}}}}^{2}=0  H ⁣Tf\ {}^{H}\!Tf =H\displaystyle=_{\mathcal{H}} 0.\displaystyle 0.

Moreover H\mathcal{H} is the completed space of  span {Kx,xX}\text{ span }\{K_{x},x\in\mathcal{X}\}, with respect to K\|\cdot\|_{K} and for all xXx\in\mathcal{X}, for all ψkKer(T)\psi_{k}\in\operatorname{Ker}(T):

Similarly, Proposition 9 is derived from Proposition 18 below:

Under (A1) and (A2), Im(T)H\operatorname{Im}(\mathcal{T})\subset\mathcal{H}, that is, for any fLρX2f\in{L}^{2}_{\rho_{{X}}}, TfH\mathcal{T}f\in\mathcal{H}. Moreover, for any iIi\in I, ϕiH=1μiTϕiH\phi_{i}^{H}=\frac{1}{\mu_{i}}\mathcal{T}\phi_{i}\in H is a representant for the equivalence class ϕi\phi_{i}. Moreover (μi1/2ϕiH)iI\left(\mu_{i}^{1/2}\phi_{i}^{H}\right)_{i\in I} is an orthonormal eigein-system of S\mathscr{S} That is:

iI, ΣϕiH=μiϕiH\forall i\in I,\ \Sigma\phi_{i}^{H}=\mu_{i}\phi_{i}^{H}.

(μi1/2ϕiH)iI\left(\mu_{i}^{1/2}\phi_{i}^{H}\right)_{i\in I} is an orthonormal family in S\mathscr{S}.

Moreover S\mathscr{S} is the orthogonal supplement of the null space Ker(Σ)\text{Ker}(\Sigma):

The family ϕiH=1μiTϕi\phi^{H}_{i}=\frac{1}{\mu_{i}}T\phi_{i} satisfies:

ϕiH~=ϕi\widetilde{\phi^{H}_{i}}=\phi_{i} (in LρX2{L}^{2}_{\rho_{{X}}}),

TϕiH=μiϕiT\phi_{i}^{H}=\mu_{i}\phi_{i} in LρX2{L}^{2}_{\rho_{{X}}},

TϕiH=ΣϕiH=μiϕiH\mathcal{T}\phi_{i}^{H}=\Sigma\phi_{i}^{H}=\mu_{i}\phi_{i}^{H} in H\mathcal{H}.

All the points are clear: indeed for example ΣϕiH=Tϕi=μiϕiH\Sigma\phi_{i}^{H}=T\phi_{i}=\mu_{i}\phi_{i}^{H}. Moreover, we have that:

That means that (μiϕiH)i(\sqrt{\mu_{i}}\phi_{i}^{H})_{i} is an orthonormal family in H\mathcal{H}.

Moreover, S\mathscr{S} is defined as the completion for K\|\cdot\|_{K} of this orthonormal family, which gives S={iIaiϕiH s.t. iIai2μi<}.\mathscr{S}=\left\{\sum_{i\in I}a_{i}\phi_{i}^{H}\text{ s.t. }\sum_{i\in I}\frac{a_{i}^{2}}{\mu_{i}}<\infty\right\}.

To show that H=Ker(Σ)S,\mathcal{H}=\operatorname{Ker}(\Sigma)\overset{\perp}{\oplus}\mathscr{S}, we use the following sequence of arguments:

First, as Σ\Sigma is a continuous operator, Ker(Σ)\operatorname{Ker}(\Sigma) is a closed space in H\mathcal{H}, thus H=Ker(Σ)(Ker(Σ))\mathcal{H}=\operatorname{Ker}(\Sigma)\overset{\perp}{\oplus}\left(\operatorname{Ker}(\Sigma)\right)^{\perp}.

Ker(Σ)(T1/2(S))\operatorname{Ker}(\Sigma)\subset(\mathcal{T}^{1/2}(S))^{\perp}: indeed for all fKer(Σ)f\in\operatorname{Ker}(\Sigma), f,ϕiH=1μif,ΣϕiH=1μiΣf,ϕiH=0\langle f,\phi_{i}^{\mathcal{H}}\rangle=\frac{1}{\mu_{i}}\langle f,\Sigma\phi_{i}^{\mathcal{H}}\rangle=\frac{1}{\mu_{i}}\Sigma\langle f,\phi_{i}^{\mathcal{H}}\rangle=0, and as a consequence for any fKer(Σ),gT1/2(S)f\in\operatorname{Ker}(\Sigma),g\in\mathcal{T}^{1/2}(S), there exists (gn) span(ϕiH) s.t. gnHg(g_{n})\subset\text{ span}(\phi_{i}^{H})\text{ s.t. }g_{n}\overset{\mathcal{H}}{\rightarrow}g, thus 0=gn,fHf,g0=\langle g_{n},f\rangle_{\mathcal{H}}\rightarrow\langle f,g\rangle and finally f(T1/2(S))f\in(\mathcal{T}^{1/2}(S))^{\perp}. Equivalently T1/2(S)(Ker(Σ))\mathcal{T}^{1/2}(S)\subset\left(\operatorname{Ker}(\Sigma)\right)^{\perp}.

(T1/2(S))Ker(Σ)(\mathcal{T}^{1/2}(S))^{\perp}\subset\operatorname{Ker}(\Sigma). For any ii, ϕiHT1/2(S)\phi_{i}^{H}\in\mathcal{T}^{1/2}(S). If f(T1/2(S))f\in(\mathcal{T}^{1/2}(S))^{\perp}, then p(f),ϕiLρX2=f,TϕiH=0.\langle p(f),\phi_{i}\rangle_{{L}^{2}_{\rho_{{X}}}}=\langle f,\mathcal{T}\phi_{i}\rangle_{\mathcal{H}}=0. As a consequence p(f)p(H)Ker(T)={0}p(f)\in p(\mathcal{H})\cap\operatorname{Ker}(T)=\left\{0\right\}, thus fKer(Σ)f\in\operatorname{Ker}(\Sigma). That is (T1/2(S))Ker(Σ)(\mathcal{T}^{1/2}(S))^{\perp}\subset\operatorname{Ker}(\Sigma). Equivalently Ker(Σ)(T1/2(S))\operatorname{Ker}(\Sigma)^{\perp}\subset(\mathcal{T}^{1/2}(S)).

Combining these points: H=Ker(Σ)S.\mathcal{H}=\operatorname{Ker}(\Sigma)\overset{\perp}{\oplus}\mathscr{S}.

We have two decompositions of LρX2=Ker(T)S\mathcal{L}^{2}_{\rho_{{X}}}=\operatorname{Ker}(T)\overset{\perp}{\oplus}S and H=Ker(Σ)S\mathcal{H}=\operatorname{Ker}(\Sigma)\overset{\perp}{\oplus}\mathscr{S}. They happen to be related through the mapping T1/2\mathcal{T}^{1/2}, which we now define.

We defined operators TrT^{r}, r>0r>0 and Tr\mathcal{T}^{r}, r1/2r\geqslant 1/2 in Section 2.4 in Definitions 2,4.

Tr\mathcal{T}^{r} is well defined for any r12r\geqslant\frac{1}{2}.

T1/2:SS\mathcal{T}^{1/2}:S\rightarrow\mathscr{S} is an isometry.

Moreover Im(T1/2)=p(H)\operatorname{Im}(T^{1/2})=p(\mathcal{H}). That means T1/2:Sp(H)T^{1/2}:S\rightarrow p(\mathcal{H}) is an isomorphism.

S={i=1aiϕi s.t. i=1ai2<}S=\left\{\sum_{i=1}^{\infty}a_{i}\phi_{i}\text{ s.t. }\sum_{i=1}^{\infty}a_{i}^{2}<\infty\right\}. For any sequence (ai)iI(a_{i})_{i\in I} such that i=1ai2<\sum_{i=1}^{\infty}a_{i}^{2}<\infty, Tr(aiϕi)=iμiraiϕiT^{r}(\sum a_{i}\phi_{i})=\sum_{i}\mu_{i}^{r}a_{i}\phi_{i} is a converging sum in the Hilbert space LρX2{L}^{2}_{\rho_{{X}}} (as (μi)iI(\mu_{i})_{i\in I} is bounded thus iμiraiϕi\sum_{i}\mu_{i}^{r}a_{i}\phi_{i} satisfies Cauchy is criterion: i=npμiraiϕi2μ0r(i=npai2)1/2\|\sum_{i=n}^{p}\mu_{i}^{r}a_{i}\phi_{i}\|^{2}\leqslant\mu_{0}^{r}\left(\sum_{i=n}^{p}a_{i}^{2}\right)^{1/2}). And Cauchy is criterion implies convergence in Hilbert spaces.

Tr\mathcal{T}^{r} is well defined for any r12r\geqslant\frac{1}{2}. We have shown that (μiϕiH)i(\sqrt{\mu_{i}}\phi_{i}^{H})_{i} is an orthonormal family in H\mathcal{H}. As a consequence (using the fact that (μi)(\mu_{i}) is a bounded sequence), for any sequence (ai)i(a_{i})_{i} such that ai2<\sum a_{i}^{2}<\infty, iμiraiϕiH\sum_{i}\mu_{i}^{r}a_{i}\phi^{H}_{i} satisfies Cauchy is criterion thus is converging in H\mathcal{H} as iIμiraiϕiHK=iIμir1/2ai2μ0r1/2iIai2<\|\sum_{i\in I^{\prime}}\mu_{i}^{r}a_{i}\phi^{H}_{i}\|_{K}=\sum_{i\in I^{\prime}}\mu_{i}^{r-1/2}a_{i}^{2}\leqslant\mu_{0}^{r-1/2}\sum_{i\in I^{\prime}}a_{i}^{2}<\infty. (We need r1/2r\geqslant 1/2 of course).

T1/2:SS\mathcal{T}^{1/2}:S\rightarrow\mathscr{S} is an isometry. Definition has been proved. Surjectivity in S\mathscr{S} is by definition, as T1/2(S)={iIaiϕiH s.t. iIai2μi<}.\mathcal{T}^{1/2}(S)=\left\{\sum_{i\in I}a_{i}\phi_{i}^{H}\text{ s.t. }\sum_{i\in I}\frac{a_{i}^{2}}{\mu_{i}}<\infty\right\}. Moreover, the operator is clearly injective as for any fSf\in S, Tf0Tf\neq 0 in LρX2{L}^{2}_{\rho_{{X}}} thus Tf0Tf\neq 0 in H\mathcal{H}. Moreover for any f=i=1aiϕiSf=\sum_{i=1}^{\infty}a_{i}\phi_{i}\in S, TfK2=i=1aiμiϕiK2=i=1ai2=fLρX22\|Tf\|_{K}^{2}=\|\sum_{i=1}^{\infty}a_{i}\sqrt{\mu_{i}}\phi_{i}\|_{K}^{2}=\sum_{i=1}^{\infty}a_{i}^{2}=\|f\|_{\mathcal{L}^{2}_{\rho_{{X}}}}^{2}, which is the isometrical property.

It must be noticed that we cannot prove surjectivity in H\mathcal{H}It is actually easy to build a counter example, f.e. with a measure of “small” support (let is say $),aHilbertspaceoffunctionson), a Hilbert space of functions on\mathcal{X}=[-5;5],andakernellike, and a kernel like\min(0,1-|x-y|)::\operatorname{Im}\left(\!\ \mathcal{T}^{1/2}\right)\subset\{f\in\mathcal{H}\text{ s. t. }\operatorname{supp}(f)\subset[-2;2]\}\varsubsetneq\mathcal{H}.,thatiswithoutourstrongassumptions.Howeverwewillshowthatoperator. , that is without our “strong assumptions”. However we will show that operatorT^{1/2}issurjectiveinis surjective inp(\mathcal{H})$.

Im(T1/2)=p(H)\operatorname{Im}(T^{1/2})=p(\mathcal{H}). That means T1/2:Sp(H)T^{1/2}:S\rightarrow p(\mathcal{H}) is an isomorphism. Im(T1/2)=p(Im(T1/2))=p(S)\operatorname{Im}(T^{1/2})=p(\operatorname{Im}(\mathcal{T}^{1/2}))=p(\mathscr{S}). Moreover p(H)=p(Ker(Σ)S)=p(S).p(\mathcal{H})=p(\operatorname{Ker}(\Sigma)\oplus\mathscr{S})=p(\mathscr{S}). Consequently Im(T1/2)=p(H)\operatorname{Im}(T^{1/2})=p(\mathcal{H}). Moreover T1/2:SLρX2T^{1/2}:S\rightarrow{L}^{2}_{\rho_{{X}}} is also injective, which give the isomorphical character.

Note that it is clear that T1/2(S)p(H)T^{1/2}(S)\subset p(\mathcal{H}) and that for any xXx\in\mathcal{X}, p(Kx)T1/2(S)p(K_{x})\in T^{1/2}(S) indeed p(Kx)=i=1Kx,ϕiLρX2ϕi=i=1μiϕiH(x)ϕip(K_{x})=\sum_{i=1}^{\infty}\langle K_{x},\phi_{i}\rangle_{{L}^{2}_{\rho_{{X}}}}\phi_{i}=\sum_{i=1}^{\infty}\mu_{i}\phi_{i}^{H}(x)\phi_{i}, with i=1(μiϕiH(x))2μi=i=1μiϕiH(x)2<,\sum_{i=1}^{\infty}\frac{\left(\mu_{i}\phi_{i}^{H}(x)\right)^{2}}{\mu_{i}}=\sum_{i=1}^{\infty}\mu_{i}\phi_{i}^{H}(x)^{2}<\infty, as K(x,x)=i=1μiϕiH(x)2K(x,x)=\sum_{i=1}^{\infty}\mu_{i}\phi_{i}^{H}(x)^{2}

Finally, it has appeared that SS and S\mathscr{S} may be identified via the isometry T1/2\mathcal{T}^{1/2}. We conclude by a proposition which sums up the properties of the spaces Tr(LρX2)\mathcal{T}^{r}({L}^{2}_{\rho_{{X}}}).

The spaces Tr(LρX2),r>0T^{r}({L}^{2}_{\rho_{{X}}}),r>0 satisfy:

I.4 Kernel decomposition

Considering our decomposition of H=Sker(Σ)\mathcal{H}=\mathscr{S}\overset{\perp}{\oplus}\ker(\Sigma), an the fact the (μiϕiH)(\sqrt{\mu_{i}}\phi_{i}^{\mathcal{H}}) is a Hilbertian eigenbasis of S\mathscr{S}, we have for any xXx\in\mathcal{X},

And as it has been noticed above this sum is converging in S\mathscr{S} (as in H\mathcal{H}) because i=1(μiϕiH(x))2μi=i=1μi(ϕiH(x))2=K(x,x)<\sum_{i=1}^{\infty}\frac{(\mu_{i}\phi_{i}^{\mathcal{H}}(x))^{2}}{\mu_{i}}=\sum_{i=1}^{\infty}\mu_{i}(\phi_{i}^{\mathcal{H}}(x))^{2}=K(x,x)<\infty. However, the convergence may not be absolute in H\mathcal{H}. Our function gxg_{x} is in Ker(Σ)\operatorname{Ker}(\Sigma), which means yXgx(y)2dρX(y)=0\int_{y\in\mathcal{X}}g_{x}(y)^{2}d\rho_{X}(y)=0.

And as a consequence, we have for all x,yXx,y\in\mathcal{X},

With g(x,y):=gx(y)g(x,y):=g_{x}(y). Changing roles of x,yx,y, it appears that g(x,y)=g(y,x)g(x,y)=g(y,x). And we have for all xXx\in\mathcal{X}, Xg(x,y)2dρX(y)=0\int_{\mathcal{X}}g(x,y)^{2}d\rho_{X}(y)=0. Moreover, the convergence of the series is absolute

(S,H)(\mathscr{S},\|\cdot\|_{\mathcal{H}}) is also an RKHS, with kernel KS:(x,y)iIμiϕiH(x)ϕiH(y)K^{\mathscr{S}}:(x,y)\mapsto\sum_{i\in I}\mu_{i}\phi_{i}^{H}(x)\phi_{i}^{H}(y)

given the decomposition above, almost surely the optimization problem in S\mathcal{S} and H\mathcal{H} have equivalent solutions.

(a) (S,H)(\mathscr{S},\|\cdot\|_{\mathcal{H}}) is a Hilbert space as a closed subspace of a Hilbert space. Then for any xXx\in\mathcal{X} : KxS:=(yKS(x,y))=i=1μiϕiH(x)ϕiHSK^{\mathscr{S}}_{x}:=(y\mapsto K^{\mathscr{S}}(x,y))=\sum_{i=1}^{\infty}\mu_{i}\phi_{i}^{\mathcal{H}}(x)\phi_{i}^{\mathcal{H}}\in\mathscr{S}. Finally, for any fSf\in\mathscr{S}

because gxKer(Σ)=Sfg_{x}\in\operatorname{Ker}(\Sigma)=\mathscr{S}^{\perp}\ni f. Thus stands the reproducing property.

(b) We have that p(S)=p(H)p(\mathscr{S})=p(\mathcal{H}) and our best approximating function is a minimizer over this set. Moreover if KxSK^{\mathscr{S}}_{x} was used instead of KxK_{x} in our algorithm, both estimators are almost surely almost surely equal (i.e., almost surely in the same equivalence class). Indeed, at any step nn, if we denote gnSg^{\mathscr{S}}_{n} the sequence built in S\mathscr{S} with KSK^{\mathscr{S}}, if we have gnS=a.s.gng_{n}^{\mathscr{S}}\overset{a.s.}{=}g_{n}, then almost surely gnS(xn)=gn(xn)g_{n}^{\mathscr{S}}(x_{n})=g_{n}(x_{n}) and moreover Kxn=a.s.KxnSK_{x_{n}}\overset{a.s.}{=}K^{S}_{x_{n}}. Thus almost surely, gn+1=a.s.gn+1Sg_{n+1}\overset{a.s.}{=}g_{n+1}^{\mathscr{S}}.

I.5 Alternative assumptions

As it has been noticed in the paper, we have tried to minimize assumptions made on X\mathcal{X} and KK. In this section, we review some of the consequences of such assumptions.

The following have been considered previously:

The assumption that KK is a Mercer kernel (X\mathcal{X} compact, KK continuous) has generally been made before , but does not seem to be necessary here.

(A2) was replaced by the stronger assumption supxXK(x,x)<\sup_{x\in\mathcal{X}}K(x,x)<\infty and Y|Y| bounded .

I.5.2 Identification ℋℋ\mathcal{H} and p​(ℋ)𝑝ℋp(\mathcal{H})

Working with mild assumptions has made it necessary to work with sub spaces of LρX2{L}^{2}_{\rho_{{X}}}, thus projecting H\mathcal{H} in p(H)p(\mathcal{H}). With stronger assumptions given above, the space H\mathcal{H} may be identified with p(H)p(\mathcal{H}).

Our problems are linked with the fact that a function ff in H\mathcal{H} may satisfy both fH0\|f\|_{\mathcal{H}}\neq 0 and fLρX2=0\|f\|_{{L}^{2}_{\rho_{{X}}}}=0.

the “support” of ρ\rho may not be X\mathcal{X}.

even if the support is X\mathcal{X}, a function may be ρ\rho-a.s. 0 but not null in H\mathcal{H}.

Both these “problems” are solved considering the further assumptions above. We have the following Proposition:

If we consider a Mercer kernel KK (or even any continuous kernel), on a space X\mathcal{X} compact and a measure ρX\rho_{X} on X\mathcal{X} such that supp(ρ)=X\operatorname{supp}(\rho)=\mathcal{X} then the map:

I.5.3 Mercer kernel properties

We review here some of the properties of Mercer kernels, especially Mercer’s theorem which may be compared to Proposition 10.

CK:=supx,tX2(K(x,t))<C_{K}:=\sup_{x,t\in\mathcal{X}^{2}}(K(x,t))<\infty.

fH\forall f\in\mathcal{H}, ff is C0C^{0}.

The sum λk\sum\lambda_{k} is convergent and k=1λk=XK(x,x)ρ(X)CK\sum_{k=1}^{\infty}\lambda_{k}=\int_{X}K(x,x)\leqslant\rho(\mathcal{X})C_{K}.

is well defined, continuous, and satisfies K(x,t)=Φk(x),Φk(t)K(x,t)=\langle\Phi_{k}(x),\Phi_{k}(t)\rangle.

The space H\mathcal{H} is independent of the measure considered on X\mathcal{X}.

We can characterize H\mathcal{H} via the eigenvalues-eigenvectors:

Which is equivalent to saying that T1/2T^{1/2} is an isomorphism between LρX2{L}^{2}_{\rho_{{X}}} and H\mathcal{H}. Where we have only considered λk>0\lambda_{k}>0. It has no importance to consider the linear subspace SS of LρX2{L}^{2}_{\rho_{{X}}} spanned by the eigenvectors with non zero eigenvalues. However it changes the space H\overline{\mathcal{H}} which is in any case SS, and is of some importance regarding the estimation problem.

Appendix II Proofs

To get our results, we are going to derive from our recursion a new error decomposition and bound the different sources of error via algebraic calculations. We first make a few remarks on short notations that we will use in this part and difficulties that arise from the Hilbert space setting in Section II.1, then provide intuition via the analysis of a closely related recursion in Section II.2. We give in Sections II.3, II.4 the complete proof of our bound respectively in the finite horizon case (Theorem 2) and the online case (Theorem 3). We finally provide technical calculations of the main bias and variance terms in Section II.6.

With a sequence (an)n1(a_{n})_{n\geqslant 1} such that for all nn greater than 1 :

We consider a representer gHLρX2g_{\mathcal{H}}\in\mathcal{L}^{2}_{\rho_{{X}}} of gHg_{\mathcal{H}} defined by Proposition 8. We accept to confuse notations as far as our calculations are made on LρX2{L}^{2}_{\rho_{{X}}}-norms, thus does not depend on our choice of the representer.

In order to simplify reading, we will use some shorter notations :

For the covariance operator, we will only use Σ\Sigma instead of Σ,T,T\Sigma,T,\mathcal{T},

All the functions may be split up the orthonormal eigenbasis of the operator T\mathcal{T}. We can thus see any function as an infinite-dimensional vector, and operators as matrices. This is of course some (mild) abuse of notations if we are not in finite dimensions. For example, our operator Σ\Sigma may be seen as  Diag(μi)1i\text{ Diag}(\mu_{i})_{1\leqslant i}. Carrying on the analogy with the finite dimensional setting, a self adjoint operator, may be seen as a symmetric matrix.

We will have to deal with several “matrix products” (which are actually operator compositions). We denote :

As our operators may not commute, we use a somehow unusual convention by defining the products for any k,nk,n, even with k>nk>n , with M(k,n,γ)=(IγKxkKxk)(IγKxk1Kxk1)(IγKxnKxn)M(k,n,\gamma)=(I-\gamma K_{x_{k}}\otimes K_{x_{k}})(I-\gamma K_{x_{k-1}}\otimes K_{x_{k-1}})\cdots(I-\gamma K_{x_{n}}\otimes K_{x_{n}}).

We may denote D(k,n,γ)=i=kn(IγΣ)D(k,n,\gamma)=\prod_{i=k}^{n}(I-\gamma\Sigma) even if its clearly (IγΣ)nk+1(I-\gamma\Sigma)^{n-k+1} just in order to make the comparison between equations easier.

II.1.2 On norms

In the following, we will use constantly the following observation :

Assume A2-4 , let ηn=gngH{\eta}_{n}={g}_{n}-{g_{\mathcal{H}}}, ηˉn=gngH\bar{\eta}_{n}={\overline{g}}_{n}-{g_{\mathcal{H}}} :

II.1.3 On symmetric matrices

One has to be careful when using auto adjoint operators, especially when using the order ABA\preccurlyeq B which means that BAB-A is non-negative.

Some problems may arise when some self adjoint A,BA,B do not commute, because then ABAB is not even in auto adjoint. It is also hopeless to compose such relations : for example ABA\preccurlyeq B does not imply A2B2A^{2}\preccurlyeq B^{2} (while the opposite is true).

II.1.4 Notation

In the proof, we may use, for any xHx\in\mathcal{H}:

We only consider functions LρX2\mathcal{L}^{2}_{\rho_{X}}, which are well defined at any point. The regression function is only almost surely defined but we will consider a version of the function in LρX2\mathcal{L}^{2}_{\rho_{X}}.

KxKx~H=KxKx\widetilde{K_{x}\otimes K_{x}}_{|\mathcal{H}}={K_{x}\otimes K_{x}}

II.2 Semi-stochastic recursion - intuition

with g0=0g_{0}=0. We have denoted Ξn=(yngH(xn))Kxn\Xi_{n}=(y_{n}-g_{\mathcal{H}}(x_{n}))K_{x_{n}}. Thus ynKxn=gH(xn)Kxn+Ξn=defKxnKxn~gH+Ξny_{n}K_{x_{n}}={g_{\mathcal{H}}}(x_{n})K_{x_{n}}+\Xi_{n}\stackrel{{\scriptstyle\text{def}}}{{=}}\widetilde{K_{x_{n}}\otimes K_{x_{n}}}{g_{\mathcal{H}}}+\Xi_{n}, and our recursion may be rewritten :

Finally, we are studying a sequence (ηn)n({\eta}_{n})_{n} defined by :

Behaviour : It appears that to understand how this will behave, we may compare it to the following recursion, which may be described as a “semi-stochastic” version of (II.4) : we keep the randomness due to the noise Ξn\Xi_{n} but forget the randomness due to sampling by replacing KxnKxn~\widetilde{K_{x_{n}}\otimes K_{x_{n}}} by its expectation Σ\Sigma (TT, more precisely) :

Complete proof : This comparison will give an interesting insight and the main terms of bias and variance will appear if we study (II.5). However this is not the true recursion : to get Theorem 2, we will have to do a bit of further work : we will first separate the error due to the noise from the error due to the initial condition, then link the true recursions to their “semi-stochastic” counterparts to make the variance and bias terms appear. That will be done in Section II.3.

Semi-stochastic recursion : In order to get such intuition, in both the finite horizon and on-line case, we will begin by studying the semi-stochastic equation (II.5).

In the following, all calculations may be driven either with Σ1/2K\|\Sigma^{1/2}\cdot\|_{K} or in LρX2\|\cdot\|_{{L}^{2}_{\rho_{{X}}}} using the isometrical character of Σ1/2\Sigma^{1/2}. In order to simplify comparison with existing work and especially , we will mainly use the former as all calculations are only algebraic sums, we may sometimes use the notation x,ΣxH\langle x,\Sigma x\rangle_{H} instead of Σ1/2xH2\|\Sigma^{1/2}x\|_{\mathcal{H}}^{2}. It is an abuse if xHx\notin\mathcal{H}, but however does not induce any confusion or mistake. In the following, if not explicitely specified, \|\cdot\| will denote K\|\cdot\|_{K}.

In section II.6 we will prove the following Lemmas which upper bound these bias and variance terms under different assumptions :

\operatorname{Bias}\Big{(}n,\gamma,\Sigma,g_{\mathcal{H}}\Big{)} if we assume A3,4, γ\gamma constant,

\operatorname{Var}\Big{(}n,\gamma,\Sigma,(\Xi_{i})_{i}\Big{)} if we assume A3,6, γ\gamma constant,

\operatorname{Bias}\Big{(}n,(\gamma_{i})_{i},\Sigma,{g_{\mathcal{H}}}\Big{)} if we assume A3,4 and γi=1nζ,  0ζ1\gamma_{i}=\frac{1}{n^{\zeta}},\ \ 0\leqslant\zeta\leqslant 1,

\operatorname{Var}\Big{(}n,(\gamma_{i})_{i},\Sigma,(\Xi_{i})_{i}\Big{)} if we assume A3,6 and γi=1nζ,  0ζ1\gamma_{i}=\frac{1}{n^{\zeta}},\ \ 0\leqslant\zeta\leqslant 1.

The two terms show respectively the impact :

of the initial setting and the hardness to forget the initial condition,

Thus the first one tends to decrease when γ\gamma is increasing, whereas the second one increases when γ\gamma increases. We understand we may have to choose our step γ\gamma in order to optimize the trade-off between these two factors.

Assume A3-4 and let α\alpha (resp. rr) be the constant in A3 (resp. A4) :

with C(α)=2α2(α+1)(2α1)C(\alpha)=\frac{2\alpha^{2}}{(\alpha+1)(2\alpha-1)}.

Assume A3-4 and let α\alpha (resp. rr) be the constant in A3 (resp. A4). Assume we consider a sequence γi=γ0iζ\gamma_{i}=\frac{\gamma_{0}}{i^{\zeta}} with 0<ζ<10<\zeta<1 then :

Assume A3,6, let α,s\alpha,s be the constants in A3, and σ\sigma the constant in A6 . If we consider a sequence γi=γ0iζ\gamma_{i}=\frac{\gamma_{0}}{i^{\zeta}} with 0<ζ<10<\zeta<1 then :

Considering decomposition (II.6) and our Lemmas above, we can state a first Proposition.

Assume A1-6. Let’s consider the semi-stochastic recursion (that is the sequence : ηn=(IγnΣ)ηn1+γnΞn\eta_{n}=(I-\gamma_{n}\Sigma)\eta_{n-1}+\gamma_{n}\Xi_{n}) instead of our recursion initially defined. In the finite horizon setting, thus with γi=γ\gamma_{i}=\gamma for ini\leqslant n, we have :

Theorem 2 must be compared to Proposition 24 : Theorem 2 is just an extension but with the true stochastic recursion instead of the semi-stochastic one.

We finish this first part by a very simple Lemma which states that what we have done above is true for any semi stochastic recursion under few assumptions. Indeed, to get the complete bound, we will always come back to semi-stochastic type recursions, either without noise, or with a null initial condition.

αn=(IγΣ)αn1+γΞnα\alpha_{n}=(I-\gamma\Sigma)\alpha_{n-1}+\gamma\Xi^{\alpha}_{n}, with γΣI\gamma\Sigma\preccurlyeq I.

(Ξnα)H(\Xi^{\alpha}_{n})\in\mathcal{H} is Fn\mathcal{F}_{n} measurable for a sequence of increasing σ\sigma-fields (Fn)\left(\mathcal{F}_{n}\right).

And we may apply Lemmas 4 and 5 if we have good assumptions on Σ,α0\Sigma,\alpha_{0}.

II.3 Complete proof, Theorem 2 (finite horizon)

In the following, we will focus on the finite horizon setting, i.e., we assume the step size is constant, but may depend on the total number of observations nn : for all 1in1\leqslant i\leqslant n, γi=γ=Γ(n)\gamma_{i}=\gamma=\Gamma(n). The main idea of the proof is to be able to :

separate the different sources of error (noise & initial conditions),

then bound the difference between the stochastic recursions and their semi-stochastic versions, a case in which we are able to compute bias and variance as it is done above.

We remind that (ηn)n({\eta}_{n})_{n} is defined by :

Before studying the main decomposition in Section II.3.2 we must give a classical Lemma on stochastic recursions which will be useful below :

Its proof may be found in : it is a direct consequence of the classical recursion to upper bound αn2\|\alpha_{n}\|^{2}.

II.3.2 Main decomposition

ηninit{\eta}_{n}^{init} is the part of (ηn)n({\eta}_{n})_{n} which is due to the initial conditions ( it is equivalent to assuming Ξn0\Xi_{n}\equiv 0).

Respectively, let (ηnnoise)n({\eta}_{n}^{noise})_{n} be defined by :

ηnnoise{\eta}_{n}^{noise} is the part of (ηn)n({\eta}_{n})_{n} which is due to the noise.

That means we can always consider separately the effect of the noise and the effect of the initial conditions. We’ll first study ηnnoise{\eta}_{n}^{noise} and then ηninit{\eta}_{n}^{init}.

II.3.3 Noise process

We remind that (ηnnoise)n({\eta}_{n}^{noise})_{n} is defined by :

We are going to define some other sequences, which are defined by the following “semi-stochastic” recursion, in which KxnKxnK_{x_{n}}\otimes K_{x_{n}} has been replaced be its expectancy Σ\Sigma : first we define (ηnnoise,0)n\left({\eta}_{n}^{noise,0}\right)_{n} so that

So that we’re interested in the sequence (ηnnoiseηnnoise,0)n\left({\eta}_{n}^{noise}-{\eta}_{n}^{noise,0}\right)_{n} : we have

which is the same type of Equation as (II.9). We have denoted Ξn1=(ΣKxnKxn)ηn10\Xi_{n}^{1}=(\Sigma-K_{x_{n}}\otimes K_{x_{n}}){\eta}_{n-1}^{0}.

Thus we may consider the following sequence, satisfying the “semi-stochastic” version of recursion (II.11), changing KxnKxnK_{x_{n}}\otimes K_{x_{n}} into its expectation Σ\Sigma : we define (ηnnoise,1)n\left({\eta}_{n}^{noise,1}\right)_{n} so that:

Thanks to the triangular inequality, we’re interested in (ηnnoiseηnnoise,0ηnnoise,1)n\left({\eta}_{n}^{noise}-{\eta}_{n}^{noise,0}-{\eta}_{n}^{noise,1}\right)_{n}, which satisfies the (II.9)-type recursion :

With Ξn(2):=(ΣKxnKxn)ηn1noise,1\Xi^{(2)}_{n}:=(\Sigma-K_{x_{n}}\otimes K_{x_{n}}){\eta}_{n-1}^{noise,1}.

And so on… For any r0r\geqslant 0 we define a sequence (ηnnoise,r)n({\eta}_{n}^{noise,r})_{n} by :

So that (ηnnoise,r+1)\left({\eta}_{n}^{noise,r+1}\right) follows the “semi-stochastic” version of (II.13)…

Considering this decomposition, we have, for any rr, using triangular inequality :

For any i0i\geqslant 0, we find that we may apply Lemma 8 to the sequence (ηnnoise,i)({\eta}_{n}^{noise,i}). Indeed :

For any r0r\geqslant 0, (ηnnoise,r)({\eta}_{n}^{noise,r}) is defined by :

Once again we have (ηnr+1)=γ2k=1n1(IγΣ)n1kΞnr+1({\eta}_{n}^{r+1})=\gamma^{2}\sum_{k=1}^{n-1}(I-\gamma\Sigma)^{n-1-k}\Xi_{n}^{r+1}, for any nn:

Moreover, using the Lemma on stochastic recursions (Lemma 9) for (ηˉnnoisei=0rηˉni)n(\bar{\eta}_{n}^{noise}-\sum_{i=0}^{r}\bar{\eta}_{n}^{i})_{n} (all conditions are satisfied) we have :

Thus using (II.14), (II.15) and (II.16) :

And using the fact that γR<1\gamma R<1, when rr\rightarrow\infty we get:

II.3.4 Initial conditions

First for ηn0\overline{\eta}^{0}_{n} we have a semi-stochastic recursion, with Ξn0\Xi_{n}\equiv 0 so that we have

Then , for the residual term we use Lemma 9. Using that :

we may apply Lemma 9 to the recursion above with αn=ηn0ηninit\alpha_{n}=\eta^{0}_{n}-\eta_{n}^{init} and Ξn=(KxnKxnΣ)ηn10\Xi_{n}=(K_{x_{n}}\otimes K_{x_{n}}-\Sigma)\eta_{n-1}^{0}. That is (as α0=0\alpha_{0}=0):

A1/22=A\||A^{1/2}\||^{2}=\||A\||. Moreover, as Σ\Sigma is self adjoint, we have:

Where we have used inequality (II.50), if r1/2r\leqslant 1/2. However, this result does not stand anymore if r1/2r\geqslant 1/2. To deal with this particular case, we use the fact that,

This result’s proof is postponed to Lemma 13.

So that we would get, replacing our result in (II.20) :

Summing both bounds we get from (II.19) :

II.3.5 Conclusion

These two parts allow us to show Theorem 2 : using (II.22) and (II.18) in (II.8), and Lemmas 4 and 5 we have the final result.

Assume (A1-6) and γi=γ=Γ(n)\gamma_{i}=\gamma=\Gamma(n), for 1in1\leqslant i\leqslant n. We have, with C(α)=2α2(α+1)(2α1)C(\alpha)=\frac{2\alpha^{2}}{(\alpha+1)(2\alpha-1)}:

Then bounding C(α)C(\alpha) by 1 and simplifying under assumption γR21/4\gamma R^{2}\leqslant 1/4, we exactly get Theorem 2 in the main text. In order to derive corollaries, one just has to chose γ=Γ(n)\gamma=\Gamma(n) in order to balance the main terms.

II.4 Complete proof, Theorem 3 (on-line setting)

The sketch of the proof is exactly the same. We just have to check that changing a constant step into a decreasing sequence of step-size does not change to much. However as most calculations make appear some weird constants, we will only look for asymptotics. The sketch of the decomposition is given in Table 6.

So that, if we assume that (γn)(\gamma_{n}) is non increasing:

II.4.2 Noise process

We remind that (ηnnoise)n({\eta}_{n}^{noise})_{n} is defined by :

As before, for any r0r\geqslant 0 we define a sequence (ηnnoise,r)n({\eta}_{n}^{noise,r})_{n} by :

And we want to use the following upper bound

So that we had to upper bound the noise :

We make an induction on nn. We note that :

Where we have used that : D(n,k+1,(γk)k)D(n,k,(γk)k)=D(n,k+1,(γk)k)γkΣD(n,k+1,(\gamma_{k})_{k})-D(n,k,(\gamma_{k})_{k})=D(n,k+1,(\gamma_{k})_{k})\gamma_{k}\Sigma.

Once again we have ηnr+1=k=1nD(n,k+1,(γk)k)γkΞkr+1{\eta}_{n}^{r+1}=\sum_{k=1}^{n}D(n,k+1,(\gamma_{k})_{k})\gamma_{k}\Xi^{r+1}_{k}, for any nn:

Moreover, using the Lemma on stochastic recursions (Lemma 11) for (αnr)n=(ηnnoisei=0rηni)n(\alpha^{r}_{n})_{n}=({\eta}_{n}^{noise}-\sum_{i=0}^{r}{\eta}_{n}^{i})_{n} (all conditions are satisfied) we have :

We are going to show that both these terms goes to 0 when rr goes to infinity. Indeed :

Moreover, if we assume γi=1iζ\gamma_{i}=\frac{1}{i^{\zeta}} :

II.4.3 Initial conditions

First for ηn0\overline{\eta}^{0}_{n} we have a semi-stochastic recursion, with Ξn0\Xi_{n}\equiv 0 so that we have

Then , for the residual term we use Lemma 11 for the recursion above with αn=ηn0ηninit\alpha_{n}=\eta^{0}_{n}-\eta_{n}^{init}. Using that :

By the induction we make to get Lemma 11, we have :

So that we would get, replacing our result in (II.37) :

And finally, with (II.36) and (II.41) in (II.35),

II.4.4 Conclusion

We conclude with both (II.34) and (II.42) in (II.8) :

Which gives Theorem 3 using Lemmas 6 and 7. Once again, deriving corollaries is simple.

II.5 A lemma on stochastic recursion, r⩾1/2𝑟12r\geqslant 1/2

If we consider the recursion ηn+1=(Iγxnxn)ηn\eta_{n+1}=(I-\gamma x_{n}\otimes x_{n})\eta_{n}, with η0=gH\eta_{0}=g_{\mathcal{H}}, we have

Let us first state a few properties of symmetric matrices that are useful here. First we recall that the order \preccurlyeq is defined by MNM\preccurlyeq N if NMN-M is sdp. This is an order on SnS_{n}, which is not a total order. We say that a function ff is matrix increasing if MNM\preccurlyeq N implies f(M)f(N).f(M)\preccurlyeq f(N). We have the following special cases :

The function MM2M\mapsto M^{2} is not matrix increasing on Sn+S_{n}^{+}.

For any q[0;1]q\in[0;1], the function MMqM\mapsto M^{q} is matrix increasing on Sn+S_{n}^{+}.

For any NN, the function MNMNM\mapsto N^{\top}MN is matrix increasing.

For some NN, the function MNM+MNM\mapsto N^{\top}M+MN is not matrix increasing.

exp\exp is not matrix increasing, log\log is matrix increasing.

if ABA\preccurlyeq B, ACA\preccurlyeq C and BC=CBBC=CB, then for any q[0;1]q\in[0;1], ABqC1q.A\preccurlyeq B^{q}C^{1-q}.

It is important to notice that it often occurs that ff is matrix increasing and its inverse f1f^{-1} is not (square/square root, exp/log, left and right multiplication).

Notation \wedge is somehow ill defined as ABA\wedge B may be neither AA nor BB. However, we use this notation as a shortcut for a matrix CC such that CAC\preccurlyeq A and CBC\preccurlyeq B.

To prove Lemma 13, we consider the stochastic recursion, in the case r1/2r\geqslant 1/2.

We consider a full expansion of the function value Σ1/2(ηˉn)2\|\Sigma^{1/2}(\bar{\eta}_{n})\|^{2}. This corresponds to

Where the matrix M(n,k)M(n,k) is (Iγxnxn)(Iγxkxk)(I-\gamma x_{n}\otimes x_{n})\cdots(I-\gamma x_{k}\otimes x_{k}), for knk\leqslant n We consider the recursion without noise and rely on an explicit decomposition.

as ηi=M(i,1)(η0)\eta_{i}=M(i,1)(\eta_{0}), with E0=(η0)(η0).E_{0}=(\eta_{0})(\eta_{0})^{\ast}. and ,\langle\langle\cdot,\cdot\rangle\rangle denoting the Froebenius scalar product between matrices.

We consider the operator TT from symmetric matrices to symmetric matrices defined as

of the form TA=ΣA+AΣγSATA=\Sigma A+A\Sigma-\gamma SA.

The operator SS is self-adjoint and positive.

Moreover we can upper bound tr(SM):\operatorname{tr}(SM): as

And as An1/αγ1/αΣ1/αA\preccurlyeq n^{1/\alpha}\gamma^{-1/\alpha}\Sigma^{1/\alpha}, we finally have :

Combining equation (II.44) and (II.45) we get, for any q[0;1]q\in[0;1] :

Thus with q=1+2rq=-1+2r, for r[1/2;1]r\in[1/2;1], we get

And the quantity R2γ1+1/αn1/αR^{2}\gamma^{1+1/\alpha}n^{1/\alpha} is always going to 0 for the optimal choice of γ\gamma. Moreover, the exponent 2r12r-1 is logically vanishing when r1/2.r\rightarrow 1/2.

II.6 Some quantities

In this section, we bound the main quantities which are involved above.

if nx1nx\leqslant 1 then (1(1x)n)nx(nx)1r(1-(1-x)^{n})\leqslant nx\leqslant(nx)^{1-r} (the first inequality can be proved by deriving the difference).

if nx1nx\geqslant 1 then (1(1x)n)1(nx)1r(1-(1-x)^{n})\leqslant 1\leqslant(nx)^{1-r} .

If r1r\geqslant 1, x(1(1x)n)x\mapsto(1-(1-x)^{n}) is increasing on [0;1][\,0;1] so sup0x1(k=0n1(1x)kxr)=1\sup_{0\leqslant x\leqslant 1}\left(\sum_{k=0}^{n-1}(1-x)^{k}x^{r}\right)=1 : there is no improvement in comparison to r=1r=1 :

II.6.2 Lemma 5

Note that the first integral may be empty if γj1\gamma j\leqslant 1. We also have:

Considering that gj:u(1(1γuα)j)2g_{j}:u\mapsto\left(1-\left(1-\frac{\gamma}{u^{\alpha}}\right)^{j}\right)^{2} is a decreasing function of uu we get :

Where we have used the fact that (11j)je1\left(1-\frac{1}{j}\right)^{j}\leqslant e^{-1} for the left hand side inequality. Thus we have proved :

For the other part of the sum, we consider hj:u(1(1γuα)jγuα)2h_{j}:u\mapsto\left(\frac{1-\left(1-\frac{\gamma}{u^{\alpha}}\right)^{j}}{\frac{\gamma}{u^{\alpha}}}\right)^{2} which is an increasing function of u. So :

And we could get, by a similar calculation :

Where C1=(1e1)2 (1+1(2α1))C_{1}=(1-e^{-1})^{2}\ (1+\frac{1}{(2\alpha-1)}) and C2=(1+1(2α1))C_{2}=(1+\frac{1}{(2\alpha-1)}) are real constants. To get the complete variance term we have to calculate : σ2n2j=1n1tr(I(IγΣ))j.\frac{\sigma^{2}}{n^{2}}\sum_{j=1}^{n-1}\operatorname{tr}\left(I-(I-\gamma\Sigma)\right)^{j}. We have :

with C(α)=2α2(α+1)(2α1)C(\alpha)=\frac{2\alpha^{2}}{(\alpha+1)(2\alpha-1)}.

II.6.3 Lemma 6

Else if r11ζ>0{r-\frac{1}{1-\zeta}}>0, then sup0x1(nxrxr11ζ)=1\sup_{0\leqslant x\leqslant 1}\left(nx^{r}\wedge x^{r-\frac{1}{1-\zeta}}\right)=1, so that

II.6.4 Lemma 7

To get corollary 7, we will just replace in the following calculations γ\gamma by s2γs^{2}\gamma We remind that :

For shorter notation, in the following proof, we note \operatorname{Var}(n)=\operatorname{Var}\Big{(}n,(\gamma_{i})_{i},\Sigma,(\xi_{i})_{i}\Big{)}.

(this upper bound is good when t>>n1ζt>>n^{1-\zeta}), but we also have:

With ρ=1ζ,Kζ:=1(1ζ)1/ρtα/ρ\rho=1-\zeta,K_{\zeta}:=\frac{1}{(1-\zeta)^{1/\rho}t^{\alpha/\rho}} and

As we have a Riemann sum which converges.

Finally we get : if 0<ζ<120<\zeta<\frac{1}{2} then

where we have re-used the constants ss by formaly replacing in the proof γ\gamma by γs2\gamma s^{2}.

References