Sub-Gaussian estimators of the mean of a random vector

Gábor Lugosi, Shahar Mendelson

Introduction

The situation is quite different when one is interested in minimizing the value rr that satisfies

where λmax\lambda_{\text{max}} denotes the largest eigenvalue of Σ\Sigma (see Hanson and Wright ). Similar bounds may be proven for the performance of the sample mean if XX has a sub-Gaussian distribution in the sense that for all unit vectors vSd1v\in S^{d-1},

However, when the distribution is not necessarily sub-Gaussian and is possibly heavy-tailed, one cannot expect such a sub-Gaussian behavior of the sample mean. Thus, when is it not reasonable to assume a sub-Gaussian distribution and heavy tails may be a concern, the sample mean is a risky choice. Indeed, alternative estimators have been constructed to achieve better performance.

The one-dimensional case (i.e., d=1d=1) is quite well understood, see Catoni and Devroye, Lerasle, Lugosi, and Oliveira for recent accounts. The so-called median-of-means estimator is a simple and powerful univariate estimator with essentially optimal performance. This estimate was introduced independently in various papers, see Nemirovsky and Yudin , Jerrum, Valiant, and Vazirani , Alon, Matias, and Szegedy . The median-of-means estimator partitions the data into k<Nk<N blocks of size mN/km\approx N/k each, computes the sample mean within each block, and outputs their median. One may easily show (see, e.g., Hsu ) that, for any δ(0,1)\delta\in(0,1) if k=8log(1/δ)k=\lceil 8\log(1/\delta)\rceil, then the resulting estimator μ^N(δ)\widehat{\mu}_{N}^{(\delta)} satisfies that, with probability at least 1δ1-\delta,

where σ2\sigma^{2} denotes the variance of XX. In other words, in the one-dimensional case, the median-of-means estimator achieves a sub-Gaussian performance under the only condition that the variance of XX exists.

The median-of-means estimator has been extended to the multivariate case by replacing the median by its natural multivariate extension, the so-called “geometric (or spatial) median” (i.e., the point that minimizes the sum of the Euclidean distances to the sample means within each block) see Lerasle and Oliveira , Hsu and Sabato , Minsker . In particular, Minsker proves that for each δ(0,1)\delta\in(0,1) this generalization of the median-of-means estimator μ~N(δ)\widetilde{\mu}_{N}^{(\delta)} is such that, with probability at least 1δ1-\delta,

where CC is a universal constant. This bound holds under the only assumption that the covariance matrix exists. However, it does not quite achieve a sub-Gaussian performance bound that resembles (1.1).

Joly, Lugosi, and Oliveira made an attempt to construct a mean estimator with a sub-Gaussian behavior for a large class of distributions. They prove that there exists a mean estimator μ^n(δ)\widehat{\mu}_{n}^{(\delta)} such that, if the distribution satisfies that for all vSd1v\in S^{d-1}

for some constant KK, then for all NCKlogd(d+log(1/δ))N\geq CK\log d\left(d+\log(1/\delta)\right), with probability at least 1δ1-\delta,

where again CC is a universal constant. This bound resembles the sub-Gaussian inequality (1.1). However, there are various caveats: the additional fourth-moment assumption, the requirement that N=Ω(dlogd)N=\Omega(d\log d), and, to a lesser extent, the extra loglogd\log\log d term in the bound seem sub-optimal.

The main result of this paper is that there exists a mean estimator that achieves purely sub-Gaussian performance under the minimal condition that the covariance matrix exists. More precisely, we prove the existence of a mean estimator μ^N(δ)\widehat{\mu}_{N}^{(\delta)} such that, for all distributions with a finite second moment, for all NN, with probability at least 1δ1-\delta,

The proposed estimator may be interpreted as a multivariate median-of-means estimate but with a new notion of a multivariate median which may be interesting in its own right. The construction of the new estimator is inspired by the technique of “median-of-means tournament”, put forward by the authors in .

In the next section we present the proposed estimator and the performance bound. In Section 3 we present the proofs. We finish the paper by remarks about the computation of the estimator.

The estimator

Define the sample mean within each block by

Note that the minimum is always achieved. This follows from the fact that diam(Sac){\rm diam}(S_{a}^{c}) is a continuous function of aa (since, for each aa, SacS_{a}^{c} is the intersection of a finite union of closed balls, and the centers and radii of the closed balls are continuous in aa).

The main result of this paper is the following performance bound:

Thus, the proposed estimator achieves a purely sub-Gaussian performance under minimal conditions. Just like in the case of the median-of-means estimator for the univariate case, the estimator depends on the desired level of confidence δ\delta. As it is shown in , such a dependence cannot be avoided without imposing additional conditions on the distribution. However, following the route laid down in , one may construct sub-Gaussian estimators that work for a wide range of confidence levels simultaneously under more assumptions on the distribution. Since this issue is beyond the scope of this paper and will not be pursued further here.

Theorem 1 is an outcome of the following observation which is of interest in its own right on the geometry of a typical collection {X1,...,XN}\{X_{1},...,X_{N}\}.

Using the same notation as above and setting

The fact that Theorem 2 implies Theorem 1 is straightforward. Indeed, Theorem 2 implies that diam(Sμc)2r{\rm diam}(S_{\mu}^{c})\leq 2r and that if aμr\|a-\mu\|\geq r, then μSac\mu\in S_{a}^{c}. By the definition of SaS_{a}, one always has aSaca\in S_{a}^{c}, and thus if aμ>2r\|a-\mu\|>2r then diam(Sac)>2r{\rm diam}(S_{a}^{c})>2r. Therefore, the minimizer μ^\widehat{\mu} must satisfy that μ^μ2r\|\widehat{\mu}-\mu\|\leq 2r, as required.

We do not claim that the values of the constants appearing in Theorem 1 are optimal. They were obtained with the goal of making the proof transparent, nothing more, and it is likely that they may be improved by more careful calculations.

The proof of Theorem 2 is based on the idea of “median-of-means tournaments” which was introduced by Lugosi and Mendelson is the context of regression function estimation.

Proof

on more than k/2k/2 blocks BjB_{j}. The main technical lemma is the following.

Let δ(0,1)\delta\in(0,1), k=360log(2/δ)k=\lceil 360\log(2/\delta)\rceil, and define

set X=Xμ\overline{X}=X-\mu and put v=bμv=b-\mu. Thus, for a fixed bb that satisfies bμr\|b-\mu\|\geq r, μ\mu defeats bb if

for more than k/2k/2 blocks BjB_{j}. Clearly, it suffices to show that (3.1) holds when v=r\|v\|=r.

where recall that λmax\lambda_{\text{max}} is the largest eigenvalue of the covariance matrix of XX. Thus, if

Applying a standard binomial tail estimate, we see that (3.3) holds for a single vv with probability at least 1exp(k/180)1-\exp(-k/180) on at least 8/108/10 of the blocks BjB_{j}.

Now we need to extend the above from a fixed vector vv to all vectors with norm rr. In order to show that (3.3) holds simultaneously for all vrSd1v\in r\cdot S^{d-1} on at least 7/107/10 of the blocks BjB_{j}, we first consider a maximal ε\varepsilon-separated set V1rSd1V_{1}\subset r\cdot S^{d-1} with respect to the L2(X)L_{2}(X) norm. In other words, V1V_{1} is a subset of rSd1r\cdot S^{d-1} of maximal cardinality such that for all v1,v2V1v_{1},v_{2}\in V_{1}, v1v2L2(X)=v1v2,Σ(v1v2)1/2ε\|v_{1}-v_{2}\|_{L_{2}(X)}=\left\langle v_{1}-v_{2},\Sigma(v_{1}-v_{2})\right\rangle^{1/2}\geq\varepsilon. We may estimate this cardinality by the “dual Sudakov” inequality (see and also for a version with the specified constant), which implies that the cardinality of V1V_{1} is bounded by

we have V1ek/360|V_{1}|\leq e^{k/360} and thus, by the union bound, with probability at least 1ek/3601δ/21-e^{-k/360}\geq 1-\delta/2, (3.3) holds for all vV1v\in V_{1} on at least 8/108/10 of the blocks BjB_{j}.

Next we check that property (3.1) holds simultaneously for all xx with x=r\|x\|=r on at least 7/107/10 of the blocks BjB_{j}.

For every xrSd1x\in r\cdot S^{d-1}, let vxv_{x} be the nearest element to xx in V1V_{1} with respect to the L2(X)L_{2}(X) norm. It suffices to show that, with probability at least 1exp(k/200)1δ/21-\exp(-k/200)\geq 1-\delta/2,

Indeed, on that event it follows that for every xrSd1x\in r\cdot S^{d-1}, on at least 7/107/10 of the coordinate blocks BjB_{j}, both

hold and hence, on those blocks, 2miBjXi,x+r2>0-\frac{2}{m}\sum_{i\in B_{j}}\left\langle\overline{X}_{i},x\right\rangle+r^{2}>0 as required.

Turning to (A)(A), by symmetrization, contraction for Bernoulli processes and de-symmetrization (see, e.g., ), and noting that xvx2r\|x-v_{x}\|\leq 2r, we have

Observe that 2miBjXib,ab=21miBjXib,ab=2Zjb,ab-\frac{2}{m}\sum_{i\in B_{j}}\left\langle X_{i}-b,a-b\right\rangle=-2\left\langle\frac{1}{m}\sum_{i\in B_{j}}X_{i}-b,a-b\right\rangle=-2\left\langle Z_{j}-b,a-b\right\rangle, and thus

Therefore, ()>0(*)>0 (i.e., bb defeats aa on block BjB_{j}) if and only if Zja>Zjb\|Z_{j}-a\|>\|Z_{j}-b\|.

Recall that Lemma 1 states that, with probability at least 1δ1-\delta, if aμr\|a-\mu\|\geq r then on more than k/2k/2 blocks BjB_{j}, 1miBj(Xia2Xiμ2)>0\frac{1}{m}\sum_{i\in B_{j}}\left(\|X_{i}-a\|^{2}-\|X_{i}-\mu\|^{2}\right)>0, which, by the above argument, is the same as saying that for at least k/2k/2 indices jj, Zja>Zjμ\|Z_{j}-a\|>\|Z_{j}-\mu\|.

Computational considerations

The problem of computing various notions of multivariate medians has been thoroughly studied in computational geometry and we refer to Aloupis for a survey on this topic. For example, computing the geometric median—and therefore the multivariate median-of-means estimator proposed by Hsu and Sabato and Minsker —involves solving a convex optimization problem. Thus, the geometric median may be approximated efficiently, see for the most recent result and for the rich history of the problem.

In contrast, efficiently computing, or even approximating, the multivariate median proposed in this paper appears to be a nontrivial challenge.

Another possibility is to start with computing the geometric median μ~(δ)\widetilde{\mu}^{(\delta)} of the ZjZ_{j}. By (1.3), one may now restrict search to a ball of radius at most rlog(1/δ)r\sqrt{\log(1/\delta)}. By exhaustively searching through this ball (after appropriately discretizing), one finds an estimate with the desired properties in additional time of order logd(1/δ)\log^{d}(1/\delta). However, this is surely unrealistic in most interesting cases.

We leave the question of efficiently computing the proposed mean estimate (or another one with sub-Gaussian performance guarantees) as an interesting research problem.

References