Sub-Gaussian estimators of the mean of a random vector

Gábor Lugosi, Shahar Mendelson

Introduction

The situation is quite different when one is interested in minimizing the value $r$ that satisfies

where $\lambda_{\text{max}}$ denotes the largest eigenvalue of $\Sigma$ (see Hanson and Wright ). Similar bounds may be proven for the performance of the sample mean if $X$ has a sub-Gaussian distribution in the sense that for all unit vectors $v\in S^{d-1}$ ,

However, when the distribution is not necessarily sub-Gaussian and is possibly heavy-tailed, one cannot expect such a sub-Gaussian behavior of the sample mean. Thus, when is it not reasonable to assume a sub-Gaussian distribution and heavy tails may be a concern, the sample mean is a risky choice. Indeed, alternative estimators have been constructed to achieve better performance.

The one-dimensional case (i.e., $d=1$ ) is quite well understood, see Catoni and Devroye, Lerasle, Lugosi, and Oliveira for recent accounts. The so-called median-of-means estimator is a simple and powerful univariate estimator with essentially optimal performance. This estimate was introduced independently in various papers, see Nemirovsky and Yudin , Jerrum, Valiant, and Vazirani , Alon, Matias, and Szegedy . The median-of-means estimator partitions the data into $k<N$ blocks of size $m\approx N/k$ each, computes the sample mean within each block, and outputs their median. One may easily show (see, e.g., Hsu ) that, for any $\delta\in(0,1)$ if $k=\lceil 8\log(1/\delta)\rceil$ , then the resulting estimator $\widehat{\mu}_{N}^{(\delta)}$ satisfies that, with probability at least $1-\delta$ ,

where $\sigma^{2}$ denotes the variance of $X$ . In other words, in the one-dimensional case, the median-of-means estimator achieves a sub-Gaussian performance under the only condition that the variance of $X$ exists.

The median-of-means estimator has been extended to the multivariate case by replacing the median by its natural multivariate extension, the so-called “geometric (or spatial) median” (i.e., the point that minimizes the sum of the Euclidean distances to the sample means within each block) see Lerasle and Oliveira , Hsu and Sabato , Minsker . In particular, Minsker proves that for each $\delta\in(0,1)$ this generalization of the median-of-means estimator $\widetilde{\mu}_{N}^{(\delta)}$ is such that, with probability at least $1-\delta$ ,

where $C$ is a universal constant. This bound holds under the only assumption that the covariance matrix exists. However, it does not quite achieve a sub-Gaussian performance bound that resembles (1.1).

Joly, Lugosi, and Oliveira made an attempt to construct a mean estimator with a sub-Gaussian behavior for a large class of distributions. They prove that there exists a mean estimator $\widehat{\mu}_{n}^{(\delta)}$ such that, if the distribution satisfies that for all $v\in S^{d-1}$

for some constant $K$ , then for all $N\geq CK\log d\left(d+\log(1/\delta)\right)$ , with probability at least $1-\delta$ ,

where again $C$ is a universal constant. This bound resembles the sub-Gaussian inequality (1.1). However, there are various caveats: the additional fourth-moment assumption, the requirement that $N=\Omega(d\log d)$ , and, to a lesser extent, the extra $\log\log d$ term in the bound seem sub-optimal.

The main result of this paper is that there exists a mean estimator that achieves purely sub-Gaussian performance under the minimal condition that the covariance matrix exists. More precisely, we prove the existence of a mean estimator $\widehat{\mu}_{N}^{(\delta)}$ such that, for all distributions with a finite second moment, for all $N$ , with probability at least $1-\delta$ ,

The proposed estimator may be interpreted as a multivariate median-of-means estimate but with a new notion of a multivariate median which may be interesting in its own right. The construction of the new estimator is inspired by the technique of “median-of-means tournament”, put forward by the authors in .

In the next section we present the proposed estimator and the performance bound. In Section 3 we present the proofs. We finish the paper by remarks about the computation of the estimator.

The estimator

Define the sample mean within each block by

Note that the minimum is always achieved. This follows from the fact that ${\rm diam}(S_{a}^{c})$ is a continuous function of $a$ (since, for each $a$ , $S_{a}^{c}$ is the intersection of a finite union of closed balls, and the centers and radii of the closed balls are continuous in $a$ ).

The main result of this paper is the following performance bound:

Thus, the proposed estimator achieves a purely sub-Gaussian performance under minimal conditions. Just like in the case of the median-of-means estimator for the univariate case, the estimator depends on the desired level of confidence $\delta$ . As it is shown in , such a dependence cannot be avoided without imposing additional conditions on the distribution. However, following the route laid down in , one may construct sub-Gaussian estimators that work for a wide range of confidence levels simultaneously under more assumptions on the distribution. Since this issue is beyond the scope of this paper and will not be pursued further here.

Theorem 1 is an outcome of the following observation which is of interest in its own right on the geometry of a typical collection $\{X_{1},...,X_{N}\}$ .

Using the same notation as above and setting

The fact that Theorem 2 implies Theorem 1 is straightforward. Indeed, Theorem 2 implies that ${\rm diam}(S_{\mu}^{c})\leq 2r$ and that if $\|a-\mu\|\geq r$ , then $\mu\in S_{a}^{c}$ . By the definition of $S_{a}$ , one always has $a\in S_{a}^{c}$ , and thus if $\|a-\mu\|>2r$ then ${\rm diam}(S_{a}^{c})>2r$ . Therefore, the minimizer $\widehat{\mu}$ must satisfy that $\|\widehat{\mu}-\mu\|\leq 2r$ , as required.

We do not claim that the values of the constants appearing in Theorem 1 are optimal. They were obtained with the goal of making the proof transparent, nothing more, and it is likely that they may be improved by more careful calculations.

The proof of Theorem 2 is based on the idea of “median-of-means tournaments” which was introduced by Lugosi and Mendelson is the context of regression function estimation.

Proof

on more than $k/2$ blocks $B_{j}$ . The main technical lemma is the following.

Let $\delta\in(0,1)$ , $k=\lceil 360\log(2/\delta)\rceil$ , and define

set $\overline{X}=X-\mu$ and put $v=b-\mu$ . Thus, for a fixed $b$ that satisfies $\|b-\mu\|\geq r$ , $\mu$ defeats $b$ if

for more than $k/2$ blocks $B_{j}$ . Clearly, it suffices to show that (3.1) holds when $\|v\|=r$ .

where recall that $\lambda_{\text{max}}$ is the largest eigenvalue of the covariance matrix of $X$ . Thus, if

Applying a standard binomial tail estimate, we see that (3.3) holds for a single $v$ with probability at least $1-\exp(-k/180)$ on at least $8/10$ of the blocks $B_{j}$ .

Now we need to extend the above from a fixed vector $v$ to all vectors with norm $r$ . In order to show that (3.3) holds simultaneously for all $v\in r\cdot S^{d-1}$ on at least $7/10$ of the blocks $B_{j}$ , we first consider a maximal $\varepsilon$ -separated set $V_{1}\subset r\cdot S^{d-1}$ with respect to the $L_{2}(X)$ norm. In other words, $V_{1}$ is a subset of $r\cdot S^{d-1}$ of maximal cardinality such that for all $v_{1},v_{2}\in V_{1}$ , $\|v_{1}-v_{2}\|_{L_{2}(X)}=\left\langle v_{1}-v_{2},\Sigma(v_{1}-v_{2})\right\rangle^{1/2}\geq\varepsilon$ . We may estimate this cardinality by the “dual Sudakov” inequality (see and also for a version with the specified constant), which implies that the cardinality of $V_{1}$ is bounded by

we have $|V_{1}|\leq e^{k/360}$ and thus, by the union bound, with probability at least $1-e^{-k/360}\geq 1-\delta/2$ , (3.3) holds for all $v\in V_{1}$ on at least $8/10$ of the blocks $B_{j}$ .

Next we check that property (3.1) holds simultaneously for all $x$ with $\|x\|=r$ on at least $7/10$ of the blocks $B_{j}$ .

For every $x\in r\cdot S^{d-1}$ , let $v_{x}$ be the nearest element to $x$ in $V_{1}$ with respect to the $L_{2}(X)$ norm. It suffices to show that, with probability at least $1-\exp(-k/200)\geq 1-\delta/2$ ,

Indeed, on that event it follows that for every $x\in r\cdot S^{d-1}$ , on at least $7/10$ of the coordinate blocks $B_{j}$ , both

hold and hence, on those blocks, $-\frac{2}{m}\sum_{i\in B_{j}}\left\langle\overline{X}_{i},x\right\rangle+r^{2}>0$ as required.

Turning to $(A)$ , by symmetrization, contraction for Bernoulli processes and de-symmetrization (see, e.g., ), and noting that $\|x-v_{x}\|\leq 2r$ , we have

Observe that $-\frac{2}{m}\sum_{i\in B_{j}}\left\langle X_{i}-b,a-b\right\rangle=-2\left\langle\frac{1}{m}\sum_{i\in B_{j}}X_{i}-b,a-b\right\rangle=-2\left\langle Z_{j}-b,a-b\right\rangle$ , and thus

Therefore, $(*)>0$ (i.e., $b$ defeats $a$ on block $B_{j}$ ) if and only if $\|Z_{j}-a\|>\|Z_{j}-b\|$ .

Recall that Lemma 1 states that, with probability at least $1-\delta$ , if $\|a-\mu\|\geq r$ then on more than $k/2$ blocks $B_{j}$ , $\frac{1}{m}\sum_{i\in B_{j}}\left(\|X_{i}-a\|^{2}-\|X_{i}-\mu\|^{2}\right)>0$ , which, by the above argument, is the same as saying that for at least $k/2$ indices $j$ , $\|Z_{j}-a\|>\|Z_{j}-\mu\|$ .

Computational considerations

The problem of computing various notions of multivariate medians has been thoroughly studied in computational geometry and we refer to Aloupis for a survey on this topic. For example, computing the geometric median—and therefore the multivariate median-of-means estimator proposed by Hsu and Sabato and Minsker —involves solving a convex optimization problem. Thus, the geometric median may be approximated efficiently, see for the most recent result and for the rich history of the problem.

In contrast, efficiently computing, or even approximating, the multivariate median proposed in this paper appears to be a nontrivial challenge.

Another possibility is to start with computing the geometric median $\widetilde{\mu}^{(\delta)}$ of the $Z_{j}$ . By (1.3), one may now restrict search to a ball of radius at most $r\sqrt{\log(1/\delta)}$ . By exhaustively searching through this ball (after appropriately discretizing), one finds an estimate with the desired properties in additional time of order $\log^{d}(1/\delta)$ . However, this is surely unrealistic in most interesting cases.

We leave the question of efficiently computing the proposed mean estimate (or another one with sub-Gaussian performance guarantees) as an interesting research problem.

Introduction

The estimator

Proof

Computational considerations

References