Challenging the empirical mean and empirical variance: a deviation study

Olivier Catoni

\thetitle. Introduction

This paper is devoted to the estimation of the mean and possibly also of the variance of a real random variable from an independent identically distributed sample. While the most traditional way to deal with this question is to focus on the mean square error of estimators, we will instead focus on their deviations. Deviations are related to the estimation of confidence intervals which are of importance in many situations. While the empirical mean has an optimal minimax mean square error among all mean estimators in all models including Gaussian distributions, its deviations tell a different story. Indeed, as far as the mean square error is concerned, Gaussian distributions represent already the worst case, so that in the framework of a minimax mean least square analysis, no need is felt to improve estimators for non-Gaussian sample distributions. On the contrary, the deviations of estimators, and especially of the empirical mean, are worse for non-Gaussian samples than for Gaussian ones. Thus a deviation analysis will point out possible improvements of the empirical mean estimator more successfully. It was nonetheless quite unexpected for us, and will undoubtedly also be for some of our readers, that the empirical mean could be improved, under such a weak hypothesis as the existence of a finite variance, and that this has remained unnoticed until now. One of the reasons may be that the weaknesses of the empirical mean disappear if we let the sample distribution be fixed and the sample size go to infinity. This does not mean however that a substantial improvement is not possible, nor that it is only concerned with specific sample sizes or weird worst case distributions : in the end of this paper, we will present experiments made on quite simple sample distributions, consisting in the mixture of two to three Gaussian measures, showing that more than twenty five percent can be gained on the widths of confidence intervals, for realistic sample sizes ranging from 100 to 2000. We think that, beyond the technicalities involved here, this exemplifies more broadly the pitfalls of asymptotic studies in statistics and should be quite thought provoking about the notions of optimality commonly used to assess the performance of estimators.

Our deviation study will use two kinds of tools: M-estimators to truncate observations and PAC-Bayesian theorems to combine estimates on the same sample without using a split scheme .

Its general conclusion is that, whereas the deviations of the empirical mean estimate may increase a lot when the sample distribution is far from being Gaussian, those of some new M-estimators will not. The improvement is the best for heavy-tailed distributions, as the worst case analysis performed to prove lower bounds will show. The improvement also increases as the confidence level at which deviations are computed increases.

Similar conclusions can be drawn in the case of least square regression with random design . Discovering that using truncated estimators permits to get rid of sub Gaussian tail assumptions was the spur to study the simpler case of mean estimation for its own sake. Restricting the subject in this way (which is of course a huge restriction compared with least square regression) makes it possible to propose simpler dedicated estimators and to push their analysis further. It will indeed be possible here to obtain mathematical proofs for numerically significant non-asymptotic bounds.

The weakest hypothesis we will consider is the existence of a finite but unknown variance. In our M-estimators, adapting the truncation level depends on the value of the variance. However, this adaptation can be done without actually knowing the variance, through Lepski’s approach.

Computing an observable confidence interval, on the other hand, requires more information. The simplest case is when the variance is known, or at least lower than a known bound. If it is not so, another possibility is to assume that the kurtosis is known, or lower than a known bound. Introducing the kurtosis is natural to our approach: in order to calibrate the truncation level for the estimation of the mean, we need to know the variance, and in the same way, in order to calibrate the truncation level for the estimation of the variance, we need to use the variance of the variance, which is provided by the kurtosis as a function of the variance itself.

In order to assess the quality of the results, we prove corresponding lower bounds for the best estimator confronted to the worst possible sample distribution, following the minimax approach. We also compute lower bounds for the deviations of the empirical mean estimator when the sample distribution is the worst possible. These latter bounds show the improvement that can be brought by M-estimators over the more traditional empirical mean. We plot the numerical values of these upper and lower bounds against the confidence level for typical finite sample sizes to show the gap between them.

The reader may wonder why we only consider the following extreme models, the narrow Gaussian model and the broad models

where \mathdsVar\mathdsP\operatorname{\mathds{V}ar}_{\mathds{P}} is the variance of \mathdsP\mathds{P}, κ\mathdsP\kappa_{\mathds{P}} its kurtosis, and M+1(\mathdsR)\mathcal{M}_{+}^{1}(\mathds{R}) is the set of probability measures (positive measures of mass 11) on the real line equipped with the Borel sigma-algebra.

The reason is that, the minimax bounds obtained in these broad models being close to the ones obtained in the Gaussian model, introducing intermediate models would not change the order of magnitude of the bounds.

Let us end this introduction by advocating the value of confidence bounds, stressing more particularly the case of high confidence levels, since this is the situation where truncation brings the most decisive improvements.

One situation of interest which we will not comment further is when the estimated parameter is critical and making a big mistake on its value, even with a small probability, is unacceptable.

Another scenario to be encountered in statistical learning is the case when lots of estimates are to be computed and compared in the course of some decision making. Let us imagine, for instance, that some parameter θΘ\theta\in\Theta is to be tuned in order to optimize the expected loss \mathds{E}\bigl{[}f_{\theta}(X)\bigr{]} of some family of loss functions \bigl{\{}f_{\theta}:\theta\in\Theta\bigr{\}} computed on some random input XX. Let us consider a split sample scheme where two i.i.d. samples (X1,,Xs)=defX1s(X_{1},\dots,X_{s})\overset{\text{def}}{=}X_{1}^{s} and (Xs+1,,Xs+n)=defXs+1s+n(X_{s+1},\dots,X_{s+n})\overset{\text{def}}{=}X_{s+1}^{s+n} are used, one to build some estimators θ^k(X1s)\widehat{\theta}_{k}(X_{1}^{s}) of \textnormal{argmin}_{\theta\in\Theta_{k}}\mathds{E}\bigl{[}f_{\theta}(X)\bigr{]} in subsets Θk\Theta_{k}, k=1,,Kk=1,\dots,K of Θ\Theta, and the other to test those estimators and keep hopefully the best. This is a very common model selection situation. One can think for instance of the choice of a basis to expand some regression function. If KK is large, estimates of \mathds{E}\bigl{[}f_{\widehat{\theta}_{k}(X_{1}^{s})}(X_{s+1})\bigr{]} will be required for a lot of values of kk. In order to keep safe from over-fitting, very high confidence levels will be required if the resulting confidence level is to be computed through a union bound (because no special structure of the problem can be used to do better). Namely, a confidence level of 1ϵ1-\epsilon on the final result of the optimization on the test sample will require a confidence level of 1ϵ/K1-\epsilon/K for each mean estimate on the test sample. Even if ϵ\epsilon is not very small (like, say, 5/1005/100), ϵ/K\epsilon/K may be very small. For instance, if 10 parameters are to be selected among a set of 100, this gives K=(10010)1.71013K={100\choose 10}\simeq 1.7\cdot 10^{13}. In practice, except in some special situations where fast algorithms exist, a heuristic scheme will be used to compute only a limited number of estimators θ^k\widehat{\theta}_{k}. An example of heuristics is to add greedily parameters one at a time, choosing at each step the one with the best estimated performance increase (in our example, this requires to compute 10001000 estimators instead of (10010){100\choose 10}). Nonetheless, asserting the quality of the resulting choice requires a union bound on the whole set of possible outcomes of the data driven heuristic, and therefore calls for very high confidence levels for each estimate of the mean performance \mathds{E}\bigl{[}f_{\widehat{\theta}_{k}(X_{1}^{s})}(X_{s+1})\bigr{]} on the test set.

The question we are studying in this paper should not be confused with robust statistics . The most fundamental difference is that we are interested in estimating the mean of the sample distribution. In robust statistics, it is assumed that the sample distribution is in the neighbourhood of some known parametric model. This gives the possibility to replace the mean by some other location parameter, which as a rule will not be equal to the mean when the shape of the distribution is not constrained (and in particular is not assumed to be symmetric).

Other differences are that our point of view is non-asymptotic and that we study the deviations of estimators whereas robust statistics is focussed on their asymptotic mean square error.

Although we end up defining M-estimators with the help of influence functions, like in robust statistics, we use a truncation level depending on the sample size, whereas in robust statistics, the truncation level depends on the amount of contamination. Also, we truncate at much higher levels (that is we eliminate less outliers) that what would be advisable for instance in the case of a contaminated Gaussian statistical sample. Thus, although we have some tools in common with robust statistics, we use them differently to achieve a different purpose.

Adaptive estimation of a location parameter is another setting where the empirical mean can be replaced by more efficient estimators. However, the setting studied by these authors is quite different from ours. The main difference, here again, is that the location parameter is assumed to be the center of symmetry of the sample distribution, a fact that is used to tailor location estimators based on a symmetrized density estimator. Another difference is that in these papers, the estimators are built with asymptotic properties in view, such as asymptotic normality with optimal variance and asymptotic robustness. These properties, although desirable, give no information on the non-asymptotic deviations we are studying here, and therefore do not provide as we do non-asymptotic confidence intervals.

\thetitle. Some new M-estimators

Let (Yi)i=1n(Y_{i})_{i=1}^{n} be an i.i.d. sample drawn from some unknown probability distribution \mathdsP\mathds{P} on the real line \mathdsR\mathds{R} equipped with the Borel σ\sigma-algebra B\mathcal{B}. Let YY be independent from (Yi)i=1n(Y_{i})_{i=1}^{n} with the same marginal distribution \mathdsP\mathds{P}. Assuming that Y\mathdsL2(\mathdsP)Y\in\mathds{L}_{2}(\mathds{P}), let mm be the mean of YY and let vv be its variance:

Let us consider some non-decreasingWe would like to thank one of the anonymous referees of an early version of this paper for pointing out the benefits that could be drawn from the use of a non-decreasing influence function in this section. influence function ψ:\mathdsR\mathdsR\psi:\mathds{R}\rightarrow\mathds{R} such that

The widest possible choice of ψ\psi compatible with these inequalities is

Although ψ\psi is not the derivative of some explicit error function, we will use it in the same way, so that it can be considered as an influence function.

Indeed, α\alpha being some positive real parameter to be chosen later, we will build our estimator θ^α\widehat{\theta}_{\alpha} of the mean mm as the solution of the equation

(When the narrow choice of ψ\psi defined by equation (2.3) is made, the above equation may have more than one solution, in which case any of them can be used to define θ^α\widehat{\theta}_{\alpha}.)

The widest choice of ψ\psi is the one making θ^α\widehat{\theta}_{\alpha} the closest to the empirical mean. For this reason it may be preferred if our aim is to stabilize the empirical mean by making the smallest possible change, which could be justified by the fact that the empirical mean is optimal in the case when the sample (Yi)i=1n(Y_{i})_{i=1}^{n} is Gaussian.

Our analysis of θ^α\widehat{\theta}_{\alpha} will rely on the following exponential moment inequalities, from which deviation bounds will follow. Let us introduce the quantity

The proof of this proposition is an obvious consequence of inequalities (2.1) and of the fact that the sample is assumed to be i.i.d. It justifies the special choice of influence function we made. If we had taken for ψ\psi the identity function, and thus for θ^α\widehat{\theta}_{\alpha} the empirical mean, the exponential moments of r(θ)r(\theta) would have existed only in the case when the random variable YY itself has exponential moments. In order to bound θ^α\widehat{\theta}_{\alpha}, we will find two non-random values θ\theta_{-} and θ+\theta_{+} of the parameter such that with large probability r(θ)>0>r(θ+)r(\theta_{-})>0>r(\theta_{+}), which will imply that θ<θ^α<θ+\theta_{-}<\widehat{\theta}_{\alpha}<\theta_{+}, since r(θ^α)=0r(\widehat{\theta}_{\alpha})=0 by construction and θr(θ)\theta\mapsto r(\theta) is non-increasing.

The values of the parameters α\mathdsR+\alpha\in\mathds{R}_{+} and ϵ)0,1(\epsilon\in)0,1( being set, let us define for any θ\mathdsR\theta\in\mathds{R} the bounds

The proof of this proposition is also straightforward: it is a mere consequence of Chebyshev’s inequality and of the previous proposition. Let us assume that

Let θ+\theta_{+} be the smallest solution of the quadratic equation B+(θ+)=0B_{+}(\theta_{+})=0 and let θ\theta_{-} be the largest solution of the equation B(θ)=0B_{-}(\theta_{-})=0.

Moreover, since the map θr(θ)\theta\mapsto r(\theta) is non-increasing, with probability at least 12ϵ1-2\epsilon,

The proof of this lemma is also an obvious consequence of the previous proposition and of the definitions of θ+\theta_{+} and θ\theta_{-}. Optimizing the choice of α\alpha provides

Let us assume that n>2log(ϵ1)n>2\log(\epsilon^{-1}) and let us consider

In this case θ+=m+η\theta_{+}=m+\eta and θ=mη\theta_{-}=m-\eta, so that with probability at least 12ϵ1-2\epsilon,

In the same way, if we want to make a choice of α\alpha independent from ϵ\epsilon, we can choose

In this latter case, with probability at least 12ϵ1-2\epsilon,

In the following plots, we compare the bounds on the deviations of our M-estimator θ^α\widehat{\theta}_{\alpha} with the deviations of the empirical mean M=1ni=1nYi\displaystyle M=\frac{1}{n}\sum_{i=1}^{n}Y_{i} when the sample distribution is Gaussian and when it belongs to the model A1\mathcal{A}_{1} defined in the introduction by equation (1.1, page 1.1). (The bounds for the empirical mean will be explained and proved in subsequent sections.)

More precisely, the deviation upper bounds for our estimator θ^α\widehat{\theta}_{\alpha} for the worst sample distribution in A1\mathcal{A}_{1}, the model defined by (1.1, page 1.1), are compared with the exact deviations of the empirical mean MM of a Gaussian sample. This is the minimax bound at all confidence levels in the Gaussian model, as will be proved later on. Consequently, the deviations of our estimator cannot be smaller for the worst sample distribution in A1\mathcal{A}_{1}, which contains Gaussian distributions. We see on the plots that starting from ϵ=0.1\epsilon=0.1 (corresponding to a 80%80\% confidence level), our upper bound is close to being minimax, not only in A1\mathcal{A}_{1}, but also in the small Gaussian sub-model. This shows that the deviations of our estimator are close to reach the minimax bound in any intermediate model containing Gaussian distributions and contained in A1\mathcal{A}_{1}.

Our estimator is also compared with the deviations of the empirical mean for the worst distribution in A1\mathcal{A}_{1}, (to be established later). In particular the lower bound proves that there are sample distributions in A1\mathcal{A}_{1} for which the deviations of the empirical mean are far from being optimal, showing the need to introduce a new estimator to correct this.

In the first plot, we chose a sample size n=100n=100 and plotted the deviations against the confidence level (or rather against ϵ\epsilon, the confidence level being 12ϵ1-2\epsilon).

As shown on the second plot, showing a wider range of ϵ\epsilon values, our bound stays close to the Gaussian bound up to very high confidence levels (up to ϵ=109\epsilon=10^{-9} and more). On the other hand, it already outperforms the empirical mean by a factor larger than two at confidence level 98%98\% (that is for ϵ=102\epsilon=10^{-2}).

When we increase the sample size to n=500n=500, the performance of our M-estimator is even closer to optimal.

\thetitle. Adapting to an unknown variance

In this section, we will use Lepski’s renowned adaptation method when nothing is known, except that the variance is finite. Under so uncertain, (but unfortunately so frequent) circumstances, it is impossible to provide any observable confidence intervals, but it is still possible to define an adaptive estimator and to bound its deviations by unobservable bounds (depending on the unknown variance). To understand the subject of this section, one should keep in mind that adapting to the variance is a weaker requirement than estimating the variance : estimating the variance at any predictable rate would require more prior information (bearing for instance on higher moments of the sample distribution).

The idea of Lepski’s method is powerful and simple : consider a sequence of confidence intervals obtained by assuming that the variance is bounded by a sequence of bounds vkv_{k} and pick up as an estimator the middle of the smallest interval intersecting all the larger ones. For this to be legitimate, we need all the confidence regions for which the variance bound is valid to hold together, which is performed using a union bound.

Let us describe this idea more precisely. Let θ^(vmax)\widehat{\theta}(v_{\max}) be some estimator of the mean depending on some assumed variance bound vmaxv_{\max}, as the ones described in the beginning of this paper. Let δ(vmax,ϵ)\mathdsR+{+}\delta(v_{\max},\epsilon)\in\mathds{R}_{+}\cup\{+\infty\} be some deviation boundAvmax\mathcal{A}_{v_{\max}} is defined by (1.1, page 1.1) proved in Avmax\mathcal{A}_{v_{\max}}: namely let us assume that for any sample distribution in Avmax\mathcal{A}_{v_{\max}}, with probability at least 12ϵ1-2\epsilon,

Let us also decide by convention that δ(vmax,0)=+\delta(v_{\max},0)=+\infty.

Let νM+1(\mathdsR+)\nu\in\mathcal{M}_{+}^{1}(\mathds{R}_{+}) be some coding atomic probability measure on the positive real line, which will serve to take a union bound on a (countable) set of possible values of vmaxv_{\max}.

We can choose for instance for ν\nu the following coding distribution : expressing vmaxv_{\max} by comparison with some reference value VV,

and otherwise we set ν(vmax)=0\nu(v_{\max})=0. It is easy to see that this defines a probability distribution on \mathdsR+\mathds{R}_{+} (supported by dyadic numbers scaled by the factor V). It is clear that, as far as possible, the reference value VV should be chosen as close as possible to the true variance vv.

Another possibility is to set for some parameters V\mathdsRV\in\mathds{R}, ρ>1\rho>1 and s\mathdsNs\in\mathds{N},

Let us consider for any vmaxv_{\max} such that δ(vmax,ϵν(vmax))<+\delta(v_{\max},\epsilon\nu(v_{\max}))<+\infty the confidence interval

Let us put I(vmax)=\mathdsRI(v_{\max})=\mathds{R} when δ(vmax,ϵν(vmax))=+\delta(v_{\max},\epsilon\nu(v_{\max}))=+\infty.

Let us consider the non-decreasing family of closed intervals

(In this definition, we can restrict the intersection to the support of ν\nu, since otherwise I(vmax)=\mathdsRI(v_{\max})=\mathds{R}.) A union bound shows immediately that with probability at least 12ϵ1-2\epsilon, mJ(v)m\in J(v), implying as a consequence that J(v)J(v)\neq\varnothing.

Since v1J(v1)v_{1}\mapsto J(v_{1}) is a non-decreasing family of closed intervals, the intersection

is a non-empty closed interval, and we can therefore pick up an adaptive estimator θ~\widetilde{\theta} belonging to it, choosing for instance the middle of this interval.

With probability at least 12ϵ1-2\epsilon, mJ(v)m\in J(v), which implies that J(v)J(v)\neq\varnothing, and therefore that θ~J(v)\widetilde{\theta}\in J(v).

Thus with probability at least 12ϵ1-2\epsilon

If the confidence bound δ(vmax,ϵ)\delta(v_{\max},\epsilon) is homogeneous, in the sense that

as it is the case in Proposition 2.4 (page 2.4) with

then with probability at least 12ϵ1-2\epsilon,

Thus in the case when ν\nu is defined by equation (3.1, page 3.1) and log(v/V)2slog(ρ)\lvert\log(v/V)\rvert\leq 2s\log(\rho), with probability at least 12ϵ1-2\epsilon

Let us see what happens for a sample size of n=500n=500, when we assume that log(v/V)2log(100)\lvert\log(v/V)\rvert\leq 2\log(100) and we take ρ=1.05\rho=1.05. This plot shows that, for a sample of size n=500n=500, there are sample distributions with a finite variance for which the deviations of the empirical mean blow up for confidence levels higher than 99%99\%, where as the deviations of our adaptive estimator remain under control, even at confidence levels as high as 11091-10^{-9}.

The conclusion is that if our aim is to minimize the worst estimation error over 100 statistical experiments or more and we have no information on the standard deviation except that it is in some range of the kind (1/100,100)(1/100,100) (which is pretty huge and could be increased even more if desired), then the performance of the empirical mean estimator for the worst sample distribution breaks down but thresholding very large outliers as θ~\widetilde{\theta} does can cure this problem.

\thetitle. Mean and variance estimates depending on the kurtosis

Situations where the variance is unknown are likely to happen. We have seen in the previous section how to adapt to an unknown variance. The price to pay is a loss of a factor two in the deviation bound, and the fact that it is no longer observable.

Here we will make hypotheses under which it is possible to estimate both the variance and the mean, and to obtain an observable confidence interval, without loosing a factor two as in the previous section. Making use of the kurtosis parameter is the most natural way to achieve these goals in the framework of our approach. This is what we are going to do here.

In this section, we are going to consider an alternative to the unbiased usual variance estimate

We will assume that the fourth moment \mathdsE(Y4)\mathds{E}(Y^{4}) is finite and that some upper bound is known for the kurtosis

Our aim will be, as before with the mean, to define an estimate with better deviations than V^\widehat{V}. We will use κ\kappa in the following computations, but when only an upper bound is known, κ\kappa can be replaced with this upper bound in the definition of the estimator and the estimates of its performance.

We will develop some kind of block threshold scheme, introducing

where ψ\psi is a non-decreasing influence function satisfying (2.1, page 2.1).

If ψ\psi were replaced with the identity, we would have \mathds{E}\bigl{[}Q_{\delta}(\beta)\bigr{]}=\beta v-\delta. The idea is to solve Qδ(β^)=0Q_{\delta}(\widehat{\beta})=0 in β^\widehat{\beta} and to estimate vv by δ/β^\delta/\widehat{\beta}. Anyhow, for technical reasons, we will adopt a slightly different definition for β^\widehat{\beta} as well as for the estimate of vv, as we will show now. Let us first get some deviation inequalities for Qδ(β)Q_{\delta}(\beta), derived as usual from exponential bounds. It is straightforward to see that

and for any distinct values of ii, jj and ss,

Let χ=κ1+2p1pκ1\displaystyle\chi=\kappa-1+\frac{2}{p-1}\underset{p\rightarrow\infty}{\simeq}\kappa-1. For any given β1,β2\mathdsR+\beta_{1},\beta_{2}\in\mathds{R}_{+}, with probability at least 12ϵ11-2\epsilon_{1},

Let us define, for some parameter y\mathdsRy\in\mathds{R}, β^\widehat{\beta} as

so that Q(β2)>yQ(\beta_{2})>-y. Let us put ξ=δβ1v\xi=\delta-\beta_{1}v and let us choose ξ\xi such that Q(β1)<yQ(\beta_{1})<-y. This implies that ξ\xi is solution of

Provided that (1+ζδ)24(1+ζ)y(1+\zeta\delta)^{2}\geq 4(1+\zeta)y, the smallest solution of this equation is

With these parameters, with probability at least 12ϵ11-2\epsilon_{1}, Q\bigl{[}(\delta-\xi)/v\bigr{]}<-y<Q(\delta/v), implying that

In order to minimize ξδ2yδ\displaystyle\frac{\xi}{\delta}\simeq\frac{2y}{\delta}, we are led to take δ=2plog(ϵ11)χq\displaystyle\delta=\sqrt{\frac{2p\log(\epsilon_{1}^{-1})}{\chi q}}. We get

Under condition (4.2), with probability at least 12ϵ11-2\epsilon_{1},

A simpler result is obtained by choosing ξ=2y(1+2y)\displaystyle\xi=2y(1+2y), (the values of yy and δ\delta being kept the same, so that we modify only the choice of β1\beta_{1} through a different choice of ξ\xi). In this case, Q(β1)<yQ(\beta_{1})<-y as soon as

which is true as soon as (1+2y)22\displaystyle(1+2y)^{2}\leq 2 and δy2\displaystyle\frac{\delta}{y}\geq 2, yielding the simplified condition

In this case, we get with probability at least 12ϵ11-2\epsilon_{1} that

Under condition (4.3), with probability at least 12ϵ11-2\epsilon_{1},

Recalling that χ=κ1+2p1\displaystyle\chi=\kappa-1+\frac{2}{p-1}, we can choose in the previous proposition the approximately optimal block size

For this choice of parameter, as soon as the kurtosis (or its upper bound) κ3\kappa\geq 3, under the condition

with probability at least 12ϵ111-2\epsilon_{1}^{-1},

This is the asymptotics we hoped for, since the variance of (Yim)2(Y_{i}-m)^{2} is equal to (κ1)v2(\kappa-1)v^{2}. The proof is page 7.1.

Let us plot some curves, showing the tighter bounds of Proposition 4.1 (page 4.1), with optimal choice of pp. We compare our deviation bounds with the exact deviation quantiles of the variance estimate of equation (4.1, page 4.1) applied to a Gaussian sample, (given by a χ2\chi^{2} distribution). This demonstrates that we can stay of the same order under much weaker assumptions.

\thetitle. Mean estimate under a kurtosis assumption

Here, we are going to plug a variance estimate v^\widehat{v} into a mean estimate. Let us therefore assume that v^\widehat{v} is a variance estimate such that with probability at least 12ϵ11-2\epsilon_{1},

This estimate may for example be the one defined in the previous section. Let α^\widehat{\alpha} be some estimate of the desired value of the parameter α\alpha, to be defined later as a function of v^\widehat{v}. Let us define θ^=θ^α^\widehat{\theta}=\widehat{\theta}_{\widehat{\alpha}} by

where ψ\psi is the narrow influence function defined by equation (2.3, page 2.3). As usual, we are looking for non-random values θ\theta_{-} and θ+\theta_{+} such that with large probability r(θ+)<0<r(θ)r(\theta_{+})<0<r(\theta_{-}), implying that θ<θ^<θ+\theta_{-}<\widehat{\theta}<\theta_{+}. But there is a new difficulty, caused by the fact that α^\widehat{\alpha} will be an estimate depending on the value of the sample. This problem will be solved with the help of PAC-Bayes inequalities.

To take advantage of PAC-Bayes theorems, we are going to compare α^\widehat{\alpha} with a perturbation α~\widetilde{\alpha} built with the help of some supplementary random variable. Let indeed UU be some uniform real random variable on the interval (1,+1)(-1,+1), independent from everything else. Let us consider

Let ρ\rho be the distribution of α~\widetilde{\alpha} given the sample value. We are going to compare this posterior distribution (meaning that it depends on the sample), with a prior distribution that is independent from the sample. Let π\pi be the uniform probability distribution on the interval \Bigl{(}\alpha\bigl{[}\exp\bigl{(}-\zeta/2)-x\sinh\bigl{(}\zeta/2\bigr{)}\bigr{]},\alpha\bigl{[}\exp\bigl{(}\zeta/2\bigr{)}+x\sinh\bigl{(}\zeta/2\bigr{)}\bigr{]}\Bigr{)}. Let us assume that α^\widehat{\alpha} and α\alpha are defined with the help of some positive constant cc as α^=cv^\displaystyle\widehat{\alpha}=\sqrt{\frac{c}{\widehat{v}}} and α=cv\displaystyle\alpha=\sqrt{\frac{c}{v}}. In this case, with probability at least 12ϵ11-2\epsilon_{1}

As a result, with probability at least 12ϵ11-2\epsilon_{1},

Indeed, whenever equation (4.6) holds, the (random) support of ρ\rho is included in the (fixed) support of π\pi, so that the relative entropy of ρ\rho with respect to π\pi is given by the logarithm of the ratio of the lengths of their supports.

Let us now upper-bound \psi\bigl{[}\widehat{\alpha}(Y_{i}-\theta)\bigr{]} as a suitable function of ρ\rho.

For any posterior distribution ρ\rho and any fL2(ρ)f\in L_{2}(\rho),

where a4.43a\leq 4.43 is the numerical constant of equation (7.1, page 7.1).

Applying this inequality to f(β)=β(Yiθ)f(\beta)=\beta(Y_{i}-\theta) in the first place and f(β)=β(θYi)f(\beta)=\beta(\theta-Y_{i}) to get the reversed inequality, we obtain

Let us now recall the fundamental PAC-Bayes inequality, concerned with any function (β,y)f(β,y)\mathdsL1(π\mathdsP)(\beta,y)\mapsto f(\beta,y)\in\mathds{L}_{1}(\pi\otimes\mathds{P}), where \mathdsP\mathds{P} is the distribution of YY, such that inff>1\inf f>-1.

according to [7, page 159] and Fubini’s lemma for the last equality. Thus, from Chebyshev’s inequality, with probability at least 1ϵ21-\epsilon_{2},

Applying this inequality to the case we are interested in, that is to equations (4.7) and (4.8), we obtain with probability at least 12ϵ12ϵ21-2\epsilon_{1}-2\epsilon_{2} that

Let us put θ+m=mθ=γv\theta_{+}-m=m-\theta_{-}=\sqrt{\gamma v} and let us look for some value of γ\gamma ensuring that r(θ+)<0<r(θ)r(\theta_{+})<0<r(\theta_{-}), implying that θθ^θ+\theta_{-}\leq\widehat{\theta}\leq\theta_{+}. Let us choose

assuming that xx will be chosen later on such that

we obtain with probability at least 12ϵ12ϵ21-2\epsilon_{1}-2\epsilon_{2}

Therefore, if we choose γ\gamma such that γv=αv(1+γ)cosh(ζ/2)\sqrt{\gamma v}=\alpha v(1+\gamma)\cosh(\zeta/2), we obtain with probability at least 12ϵ12ϵ21-2\epsilon_{1}-2\epsilon_{2} that r(θ+)<0<r(θ)r(\theta_{+})<0<r(\theta_{-}), and therefore that θ<θ^<θ+\theta_{-}<\widehat{\theta}<\theta_{+}. The corresponding value of γ\gamma is γ=η/(1η)\gamma=\eta/(1-\eta), where

With probability at least 12ϵ12ϵ21-2\epsilon_{1}-2\epsilon_{2}, the estimator θ^α^\widehat{\theta}_{\widehat{\alpha}} defined by equation (4.5, page 4.5), where α^\widehat{\alpha} is set as in (4.9, page 4.9), satisfies

The optimal value of xx is the one minimizing

Assuming ζ\zeta to be small, the optimal xx will be large, so that log(1+x1)x1\log(1+x^{-1})\simeq x^{-1}, and we can choose the approximately optimal value

Let us discuss now the question of balancing ϵ1\epsilon_{1} and ϵ2\epsilon_{2}. Let us put ϵ=ϵ1+ϵ2\epsilon=\epsilon_{1}+\epsilon_{2} and let y=ϵ1/ϵy=\epsilon_{1}/\epsilon. Optimizing yy for a fixed value of ϵ\epsilon could be done numerically, although it seems difficult to obtain a closed formula. However, the entropy term in η\eta can be written as log(1+x1)+log(ϵ1)log(1y)\log(1+x^{-1})+\log(\epsilon^{-1})-\log(1-y). Since ζ\zeta decreases, and therefore the almost optimal xx above increases, when yy increases, we will get an optimal order of magnitude (up to some constant less than 22) for the bound if we balance log(1y)-\log(1-y) and log(1+x1)\log(1+x^{-1}), resulting in the choice y=(1+x)1y=(1+x)^{-1}, where xx is approximately optimized as stated above (this choice of xx depends on yy, so we end up with an equation for xx, which can be solved using an iterative approach). This results, with probability at least 12ϵ1-2\epsilon, in an upper bound for θ^m\lvert\widehat{\theta}-m\rvert equivalent for large values of the sample size nn to

Thus we recover, as desired, the same asymptotics as when the variance is known.

Let us illustrate what we get when n=500n=500 or n=1000n=1000 and it is known that κ3\kappa\leq 3.

On these figures, we have plotted upper and lower bounds for the deviations of the empirical mean when the sample distribution B1,κ\mathcal{B}_{1,\kappa} is defined by (1.2, page 1.2) is the least favourable one in B1,κ\mathcal{B}_{1,\kappa}. (These bounds will be proved later on. The upper bound is computed by taking the minimum of three bounds, explaining the discontinuities of its derivative). What we see on the n=500n=500 example is that our bound remains of the same order as the Gaussian bound up to confidence levels of order 11081-10^{-8}, whereas this is not the case with the empirical mean.

In the case when n=1000n=1000, we see that our estimator possibly improves on the empirical mean in the range of confidence levels going from 11021-10^{-2} to 11061-10^{-6} and is a proved winner in the range going from 11061-10^{-6} to 110141-10^{-14}.

Let us see now the influence of κ\kappa and plot the curves corresponding to increasing values of nn and κ\kappa.

When we double the kurtosis letting n=1000n=1000, we follow the Gaussian curve up to confidence levels of 110101-10^{-10} instead of 110141-10^{-14}. This is somehow the maximum kurtosis for which the bounds are satisfactory for this sample size. Looking at Proposition 4.2 (page 4.2), we see that the bound in first approximation depends on the ratio χ/n=κ/n\chi/n=\kappa/n (when p=3p=3), suggesting that to obtain similar performances, we have to take nn proportional to κ\kappa, which gives a minimum sample size of n=1000κ/6n=1000\kappa/6, if we want to follow the Gaussian curve up to confidence levels of order at least 110101-10^{-10}.

These curves suggest another approach to choose the kurtosis parameter κ\kappa. It is to use the largest value of the kurtosis with a low impact on the bound of Proposition 4.5 (page 4.5), given the sample size. This leads, when in doubt about the true kurtosis value, for sample sizes n1000n\geq 1000, to set, according to the previous rule of thumb, the kurtosis in the definition of the estimator to the value κmax=6n/1000\displaystyle\kappa_{\max}=6n/1000. Doing so, we get almost the same deviations as if the sample distribution were Gaussian, at levels of confidence up to 110101-10^{-10}, for the largest range of (possibly non-Gaussian) sample distributions.

\thetitle. Upper bounds for the deviations of the empirical mean

In the previous sections, we compared new mean estimators with the empirical mean. We will devote the end of this paper to prove the bounds on the empirical mean used in these comparisons.

This section deals with upper bounds, whereas the next one will study corresponding lower bounds.

Let us start with the case when the sample distribution may be any probability measure with a finite variance. It is natural in this situation to bound the deviations of the empirical mean

applying Chebyshev’s inequality to its second moment, to conclude that

This is in general close to optimal, as will be shown later when we will compute corresponding lower bounds.

When the sample distribution has a finite kurtosis, it is possible to take this into account to refine the bound. The analysis becomes less straightforward, and will be carried out in this section. The following bound uses a truncation argument, allowing to study separately the behaviour of small and large values. It is to our knowledge a new result. We will show later in this paper that its leading term is essentially tight — up to a factor \displaystyle\biggl{(}\frac{\kappa}{\kappa-1}\biggr{)}^{1/4}, — when the proper asymptotic is considered.

For any probability distribution whose kurtosis is not greater than κ\kappa, the empirical mean MM is such that with probability at least 12ϵ1-2\epsilon,

Instead of minimizing the bound in λ\lambda, one can also take for simplicity

We see that there are two regimes in the behaviour of the deviations of MM. A Gaussian regime for levels of confidence less than 11/n1-1/n and long tail regime for higher confidence levels, depending on the value of the kurtosis κ\kappa.

In addition to this, let us also put forward the fact that, even in the simple case when the mean mm is known, estimating the variance under a kurtosis hypothesis at high confidence levels cannot be done using the empirical estimator

Indeed, assuming without loss of generality that m=0m=0 and computing the quadratic mean

we can only conclude, using Chebyshev’s inequality, that with probability at least 12ϵ1-2\epsilon

a bound which blows up at level of confidence ϵ=κ12n\displaystyle\epsilon=\frac{\kappa-1}{2n}, and which we do not suspect to be substantially improvable in the worst case. In contrast to this, Propositions 4.1 and 4.2 (page 4.2) provide variance estimators with high confidence levels.

\thetitle. Lower bounds

This lower bound is well known. We recall it here for the sake of completeness.

The empirical mean has optimal deviations when the sample distribution is Gaussian in the following precise sense.

For any estimator of the mean θ^:\mathdsRn\mathdsR\widehat{\theta}:\mathds{R}^{n}\rightarrow\mathds{R}, any variance value v>0v>0, and any deviation level η>0\eta>0, there is some Gaussian measure N(m,v)\mathcal{N}(m,v) (with variance vv and mean mm) such that the i.i.d. sample of length nn drawn from this distribution is such that

where M=1ni=1nYi\displaystyle M=\frac{1}{n}\sum_{i=1}^{n}Y_{i} is the empirical mean.

This means that any distribution free symmetric confidence interval based on the (supposedly known) value of the variance has to include the confidence interval for the empirical mean of a Gaussian distribution, whose length is exactly known and equal to the properly scaled quantile of the Gaussian measure.

Let us state this more precisely. With the notations of the previous proposition

where GG is the standard normal measure and FF its distribution function.

The upper bounds proved in this paper can be decomposed into

although we preferred for simplicity to state them in the slightly weaker form \mathdsP(θmη)2ϵ\mathds{P}(\lvert\theta-m\rvert\geq\eta)\leq 2\epsilon.

As the Gaussian shift model made of Gaussian sample distributions with a given variance and varying means, is included in all the models we are considering in this paper, we necessarily should have according to the previous proposition

Therefore some visualisation of the quality of our bounds can be obtained by plotting ϵη\displaystyle\epsilon\mapsto\eta against ϵvnF1(1ϵ)\displaystyle\epsilon\mapsto\sqrt{\frac{v}{n}}F^{-1}(1-\epsilon), as we did in the previous sections.

Let us remark eventually that the assumed symmetry of the confidence region is not a real limitation. Indeed, if we can prove for any given estimator θ^\widehat{\theta} that for any Gaussian sample distribution with a given variance vv,

then we may consider for any value of ϵ\epsilon the estimator with symmetric confidence levels defined as

This symmetric estimator is such that for any Gaussian sample distribution with variance vv,

Thus, applying the previous proposition, we obtain that

\thetitle. Lower bound for the deviations of the empirical mean depending on the variance

In the following proposition, we state a lower bound for the deviations of the empirical mean when the sample distributionAvmax\mathcal{A}_{v_{\max}} is defined by (1.1, page 1.1) is the least favourable in Avmax\mathcal{A}_{v_{\max}} (meaning the distribution for which the deviations of the empirical mean are the largest).

For any value of the variance vv, any deviation level η>0\eta>0, there is some distribution with variance vv and mean such that the i.i.d. sample of size nn drawn from it satisfies

Thus, as soon as ϵ(2e)1\epsilon\leq(2e)^{-1}, with probability at least 2ϵ2\epsilon,

Let us remark that this bound is pretty tight, since, according to equation (5.1, page 5.1) with probability at least 12ϵ1-2\epsilon,

This can also be observed on the plots following Proposition 2.4 (page 2.4).

\thetitle. Lower bound for the deviations of empirical mean depending on the variance and the kurtosis

Let us now refine the previous lower bound by taking into account the kurtosis κ\kappa of the sample distribution, assuming of course that it is finite.

As soon as ϵ1n16\epsilon^{-1}\geq n\geq 16, there exists a sample distribution with mean mm, finite variance vv and finite kurtosis κ\kappa, such that with probability at least 2ϵ2\epsilon,

Let us remark that the asymptotic behaviour of this lower bound when nϵn\epsilon and log(ϵ1)/n\log(\epsilon^{-1})/n both tend to zero matches the upper bound of Proposition 5.1 (page 5.1) up to a multiplicative factor (κκ1)1/41.11\left(\frac{\kappa}{\kappa-1}\right)^{1/4}\leq 1.11 when the kurtosis is κ3\kappa\geq 3, which is the kurtosis of the Gaussian distribution.

The plots following Proposition 4.5 (page 4.5) show that this lower bound is not too far from the upper bound obtained by combining Proposition 5.1 (page 5.1), Proposition 7.3 (page 7.3) and equation (5.1, page 5.1).

\thetitle. Proofs

Let us remark first that condition (4.4, page 4.4) implies condition (4.3, page 4.3), as can be easily checked. Putting \displaystyle x=\sqrt{\frac{n}{(\kappa-1)\bigl{[}4\log(\epsilon_{1}^{-1})+1/2\bigr{]}}}, so that p=xp=\lfloor x\rfloor, we can also check that p1x/2p-1\geq x/2 and np+1n/2n-p+1\geq n/2. We can then write

\thetitle. Proof of Lemma 4.4 (page 4.4)

Let us introduce some modification of ψ\psi in order to improve the compromise between infψ\inf\psi^{\prime\prime} and supψ\sup\psi. Let us put ψ(x)=log(1+x+x2/2)\overline{\psi}(x)=\log(1+x+x^{2}/2). We would like to squeeze some function χ\chi between ψ\psi and ψ\overline{\psi}, in such a way that infχ=infψ\inf\chi^{\prime\prime}=\inf\overline{\psi}^{\prime\prime}. This will be better than using ψ\psi itself since

Indeed these two values can be computed in the following way. Let us put \varphi(x)=\exp\bigl{[}\,\overline{\psi}(x)\bigr{]}=1+x+x^{2}/2. It is easy to check that

implying that ψ(x)1/4\overline{\psi}^{\prime\prime}(x)\geq-1/4. This inequality becomes an equality when φ(x)=2\varphi(x)=2, that is when x=310.73x=\sqrt{3}-1\simeq 0.73. In the same way ψ(x)2\psi^{\prime\prime}(x)\geq-2 and equality is reached when φ(x)=1/2\varphi(-x)=1/2, that is when x=1x=1. We are going to build a function χ\chi which follows ψ\psi when xx1x\leq x_{1}, where x1x_{1} satisfies ψ(x1)=1/4\psi^{\prime\prime}(x_{1})=-1/4. The value of x1x_{1} is computed from the equation φ(x1)1=(1+2)/2\varphi(-x_{1})^{-1}=(1+\sqrt{2})/2. Let y1=ψ(x1)y_{1}=\psi(x_{1}) and p1=ψ(x1)p_{1}=\psi^{\prime}(x_{1}). They have the following values

After x1x_{1}, we continue χ\chi with a quadratic function, until its derivatives cancels. Thus, the second derivative of χ\chi being less than the second derivative of ψ\overline{\psi} at each point of the positive real line, we are sure that χ(x)ψ(x)\chi(x)\leq\overline{\psi}(x) for any x\mathdsRx\in\mathds{R}. The function χ\chi built in this way satisfies the equation

As we have proved, and as can be seen on the next plot, the function χ\chi is such that

Let us now compare χ\chi with a suitable convex function (in order to apply Jensen’s inequality). Let us introduce to this purpose the function χx=χ(x)+18(xx)2\overline{\chi}_{x_{*}}=\chi(x)+\frac{1}{8}(x-x_{*})^{2}, which is convex for any choice of the parameter x\mathdsRx_{*}\in\mathds{R}.

Let us consider as in the lemma we have to prove some function fL2(ρ)f\in L_{2}(\rho) and let us choose x=ρ(dβ)f(β)x_{*}=\int\rho(d\beta)f(\beta) and put {\textstyle\int}\rho(d\beta)\bigl{[}f(\beta)-{\textstyle\int}\rho f\bigr{]}^{2}=\operatorname{\mathds{V}ar}_{\rho}(f). We obtain by Jensen’s inequality

For any posterior distribution ρ\rho and any fL2(ρ)f\in L_{2}(\rho),

To end the proof of Lemma 4.4 (page 4.4), it remains to establish that for any x\mathdsRx\in\mathds{R} and y\mathdsR+y\in\mathds{R}_{+},

Since \displaystyle\exp\bigl{[}\chi(x)\bigr{]}\leq 1+x+\frac{x^{2}}{2}, the right-hand side of this inequality is less than

As \displaystyle y\mapsto y^{-1}\Bigl{[}\exp\Bigl{(}\frac{y}{8}\Bigr{)}-1\Bigr{]} is increasing on \mathdsR+\mathds{R}_{+}, this last expression reaches its maximum when xargmaxχx\in\arg\max\chi and \displaystyle\exp\Bigl{(}\frac{y}{8}\Bigr{)}=4, and is then equal to \displaystyle\frac{3\exp\bigl{[}\sup(\chi)\bigr{]}}{4\log(4)}, which is the value stated for aa in equation (7.1) above.

\thetitle. Proof of Proposition 5.1 (page 5.1)

This function could also be defined as g(x)=xψ(x)g(x)=x-\psi(x), where ψ\psi is the wide version of the influence function defined by equation (2.2, page 2.2). It is such that

As it is also obvious that g(x)x\lvert g(x)\rvert\leq\lvert x\rvert, we get

where \displaystyle G_{i}=g\bigl{[}\alpha(Y_{i}-m)\bigr{]} and ψ\psi is the wide influence function of equation (2.2, page 2.2). As we have already seen in Proposition 2.2 (page 2.2), with probability at least 12ϵ11-2\epsilon_{1},

On the other hand with probability at least 1ϵ21-\epsilon_{2}

Let us put Hi=GiH_{i}=\lvert G_{i}\rvert and let us compute

Thus with probability at least 12ϵ1ϵ21-2\epsilon_{1}-\epsilon_{2},

and let us put ϵ1=λϵ\epsilon_{1}=\lambda\epsilon and ϵ2=(1λ)2ϵ\epsilon_{2}=(1-\lambda)2\epsilon.

The bound can either be optimized in λ\lambda or we can for simplicity choose λ\lambda to balance the following factors

As stated in Proposition 5.1 (page 5.1), with probability at least 12ϵ1-2\epsilon,

Another bound can be obtained applying Chebyshev’s inequality directly to the fourth moment of the empirical mean, which however does not reach the right speed when ϵ\epsilon is small and nn large.

For any probability distribution whose kurtosis is not greater than κ\kappa, the empirical mean MM is such that with probability at least 12ϵ1-2\epsilon,

Let us assume to simplify notations and without loss of generality that \mathdsE(Y)=0\mathds{E}(Y)=0.

and the result is proved by considering \displaystyle 2\epsilon=\frac{\bigl{[}3(n-1)+\kappa\bigr{]}v^{2}}{n^{3}\eta^{4}}. ∎

In our comparisons with new estimators, we took the minimum over the three bounds given by Proposition 5.1 (page 5.1), Proposition 7.3 (page 7.3) and equation (5.1, page 5.1).

\thetitle. Proof of Proposition 6.1 (page 6.1)

Let us consider the distributions \mathdsP1\mathds{P}_{1} and \mathdsP2\mathds{P}_{2} of the sample (Yi)i=1n(Y_{i})_{i=1}^{n} obtained when the marginal distributions are respectively the Gaussian measure with variance vv and mean m1=ηm_{1}=-\eta and the Gaussian measure with variance vv and mean m2=ηm_{2}=\eta. We see that, whatever the estimator θ^\widehat{\theta},

where \mathdsP1\mathdsP2\mathds{P}_{1}\wedge\mathds{P}_{2} is the measure whose density with respect to the Lebesgue measure (or equivalently with respect to any dominating measure, such as \mathdsP1+\mathdsP2\mathds{P}_{1}+\mathds{P}_{2}) is the minimum of the densities of \mathdsP1\mathds{P}_{1} and \mathdsP2\mathds{P}_{2} and whose total variation is \mathdsP1\mathdsP2\lvert\mathds{P}_{1}\wedge\mathds{P}_{2}\rvert.

Now, using the fact that the empirical mean is a sufficient statistics of the Gaussian shift model, it is easy to realize that

\thetitle. Proof of Proposition 6.2 (page 6.2)

Let us consider the distribution with support {nη,0,nη}\{-n\eta,0,n\eta\} defined by

It satisfies \mathdsE(Y)=0\mathds{E}(Y)=0, \mathdsE(Y2)=v\mathds{E}(Y^{2})=v and

\thetitle. Proof of Proposition 6.3 (page 6.3)

Let us consider for YY the following distribution, with support {nη,ξ,ξ,nη}\{-n\eta,-\xi,\xi,n\eta\}, where ξ\xi and η\eta are two positive real parameters, to be adjusted to obtain the desired variance and kurtosis.

It is easy to see that fqf_{q} is an increasing one to one mapping of (1,+((1,+\infty( into itself. We obtain

and let us assume that ϵ116\displaystyle\epsilon\leq\frac{1}{16}. Let us put

the last inequality being a consequence of the convexity of x(1x)n11(n1)x1nx,0x1x\mapsto(1-x)^{n-1}\geq 1-(n-1)x\geq 1-nx,0\leq x\leq 1. Let us remark then that

Thus, with probability at least 2ϵ2\epsilon,

In order to obtain Proposition 6.3 (page 6.3), it is enough to restrict optimization with respect to χ\chi to the two values

\thetitle. Generalization to non-identically distributed samples

The assumption that the sample is identically distributed can be dropped in Proposition 2.4 (page 2.4). Indeed, assuming only that the random variables (Yi)i=1n(Y_{i})_{i=1}^{n} are independent, meaning that their joint distribution is of the product form i=1n\mathdsPi\bigotimes_{i=1}^{n}\mathds{P}_{i}, we can still write, for Wi=±α(Yiθ)W_{i}=\pm\alpha(Y_{i}-\theta),

Starting from these exponential inequalities, we can reach the same conclusions as in Proposition 2.4 (page 2.4), as long as we set

We see that the mean marginal sample distribution 1ni=1n\mathdsPi\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathds{P}_{i} is playing here the same role as the marginal sample distribution in the i.i.d. case.

\thetitle. Experiments

Theoretical bounds can explore high confidence levels better than experiments, and have also the advantage to hold true in the worst case. They have led us to introduce new M-estimators, and in particular the one described in Proposition 4.5 (page 4.5). Nonetheless, they may be expected to be rather pessimistic and are clearly insufficient to explore the moderate deviations of the proposed estimators. In particular it would be interesting to know whether the improvement in high confidence deviations has been obtained as a trade-off between large and moderate deviations (by which we mean a trade-off between the left part and the tail of the quantile function of θ^m\lvert\widehat{\theta}-m\rvert).

This is a good reason to launch into some experiments. We are going to test sample distributions of the form

where d{1,2,3}d\in\{1,2,3\}, (pi)i=1,,d(p_{i})_{i=1,\dots,d} is a vector of probabilities and N(m,σ2)\mathcal{N}(m,\sigma^{2}) is as usual the Gaussian measure with mean mm and variance σ2\sigma^{2}. To visualize results, we have chosen to plot the quantile function of the deviations from the true parameter, that is the quantile function of θ^m\lvert\widehat{\theta}-m\rvert, where θ^\widehat{\theta} is one of the mean estimators studied in this paper.

Let us start with an asymmetric distribution with a so to speak intermittent high variance component. Let us take accordingly

In this case, m=1m=1, κ=27.86\kappa=27.86 and v=93.5v=93.5, so that, when the sample size nn is in the range 100n1000100\leq n\leq 1000, the variance estimates we are proposing in this paper are not proved to be valid. For this reason we will challenge the empirical mean with the two following estimates : the estimate θ^α\widehat{\theta}_{\alpha} of Proposition 2.4 (page 2.4) (where α\alpha is chosen using the true value of vv) and a naive plug-in estimate, θ^α^\widehat{\theta}_{\widehat{\alpha}}, where α^\widehat{\alpha} is set as was α\alpha, replacing the true variance vv with its unbiased estimate V^\widehat{V} given by equation (4.1, page 4.1). The parameter ϵ\epsilon is set to ϵ=0.05\epsilon=0.05 for both estimators, targeting the probability level 12ϵ=0.91-2\epsilon=0.9.

We will plot also the sample median estimator, in this case where the distribution median is different from its mean, to show that robust location estimators for symmetric distributions do not apply here.

When the sample size is n=100n=100, we obtain the following results (computed from 10001000 experiments). In this first example, the new M-estimators have uniformly better quantiles than the empirical mean, at any probability level. Moreover the variance can be harmlessly estimated from the data when it is unknown. Thus, in this case, the empirical mean is outperformed from any conceivable point of view.

Let us now increase the sample size to n=1000n=1000. As should be expected, the values of the three estimators get close for this larger sample size (whereas it becomes more obvious that the empirical median is estimating something different from the mean).

The empirical mean can still be challenged for this larger sample size, but for a different sample distribution. To illustrate this, let us consider a situation as simple as the mixture of two centered Gaussian measures. Let d=2d=2 and

Here κ=243.5\kappa=243.5, v=9.99v=9.99 and m=0m=0. We take ϵ=0.005\epsilon=0.005, targeting the probability level 12ϵ=110/n=0.991-2\epsilon=1-10/n=0.99.

Let us show some heavily asymmetric situation where the left-hand side of the quantile function of the new estimators does not improve on the empirical mean. In what follows κ=33.4\kappa=33.4, v=72.25v=72.25, m=1.3m=-1.3, and the mixture parameters are

We plot below two estimators using the value of the variance, optimized for the confidence level 12ϵ1-2\epsilon with ϵ=0.05\epsilon=0.05 and ϵ=0.0005\epsilon=0.0005 respectively (the estimators with unknown variance show the same behaviour).

Here, choosing a moderate value of the estimator parameter ϵ\epsilon is required to improve uniformly on the empirical mean performance, whereas higher values of ϵ\epsilon produce a trade-off between low and high probability levels. Whether this remains true in general would require to be confirmed by more extensive experiments.

Let us end this section with a Gaussian sample.

When the sample is Gaussian, as could be expected, our new M-estimators coincide with the empirical mean. What we obtained for a sample size n=1000n=1000 could also be observed for other sample sizes. The deviations of the empirical median are higher in the Gaussian case, as proved in Proposition 6.1 (page 6.1) (stating that the deviations of the empirical mean of a Gaussian sample are optimal).

\thetitle. Variance estimators

Let us now test some variance estimates. This is an example where the usual unbiased estimate V^\widehat{V} defined by equation (4.1, page 4.1) shows its weakness. To demonstrate things on simple sample distributions, we choose again a mixture of two Gaussian measures

Here κ=10.357\kappa=10.357, v=1.125v=1.125 and m=0.005m=0.005. To be in a situation where the variance estimates of Proposition 4.1 (page 4.1) work at high confidence levels, we choose a sample size n=2000n=2000, and use in the estimator the parameters κmax=6n/1000=12\kappa_{\max}=6*n/1000=12, p=2p=2 and ϵ=0.0025\epsilon=0.0025 (targeting the probability level 12ϵ=110/n=0.09951-2\epsilon=1-10/n=0.0995). So, for the variance as well as for the mean, there are simple situations in which our new estimators perform better in practice than the more traditional ones.

\thetitle. Computation details

To compute the estimators in these experiments, we used the two following iterative schemes (performing two iterations was enough for all the examples shown).

They are based on two principles: they have the desired fixed point and their right-hand side would be independent respectively of θk\theta_{k} and βk\beta_{k} if ψ\psi was replaced with the identity. The fact that ψ\psi is close to the identity explains why the convergence is fast and only a few iterations are required.

These numerical schemes involve only a reasonable amount of computations, opening the possibility to use the new estimators in improved filtering algorithms in signal and image processing (a subject for future research that will not be pushed further in this paper).

\thetitle. Conclusion

Theoretical results show that, for some sample distributions, the deviations of the empirical mean at confidence levels higher than 90%90\% are larger than the deviations of some well chosen M-estimator. Moreover, in our experiments, based on non-Gaussian sample distributions, the deviation quantile function of this M-estimator is uniformly below the quantile function of the empirical mean. The improvement of the confidence interval at level 90%90\% can be more than 25%25\%.

Using Lepski’s adapting approach offers a response with proved properties to the case when the variance is unknown. For sample sizes starting from 10001000, an alternative is to use an M-estimator of the variance depending on some assumption on the value of the kurtosis. However, it seems that the variance can in practice be estimated by the usual unbiased estimator V^\widehat{V}, defined by equation (4.1, page 4.1), and plugged in the estimator of Proposition 2.4 (page 2.4), although there is no mathematical warrant for this simplified scheme.

References