Sub-Gaussian estimators of the mean of a random vector
Gábor Lugosi, Shahar Mendelson
Introduction
The situation is quite different when one is interested in minimizing the value that satisfies
where denotes the largest eigenvalue of (see Hanson and Wright ). Similar bounds may be proven for the performance of the sample mean if has a sub-Gaussian distribution in the sense that for all unit vectors ,
However, when the distribution is not necessarily sub-Gaussian and is possibly heavy-tailed, one cannot expect such a sub-Gaussian behavior of the sample mean. Thus, when is it not reasonable to assume a sub-Gaussian distribution and heavy tails may be a concern, the sample mean is a risky choice. Indeed, alternative estimators have been constructed to achieve better performance.
The one-dimensional case (i.e., ) is quite well understood, see Catoni and Devroye, Lerasle, Lugosi, and Oliveira for recent accounts. The so-called median-of-means estimator is a simple and powerful univariate estimator with essentially optimal performance. This estimate was introduced independently in various papers, see Nemirovsky and Yudin , Jerrum, Valiant, and Vazirani , Alon, Matias, and Szegedy . The median-of-means estimator partitions the data into blocks of size each, computes the sample mean within each block, and outputs their median. One may easily show (see, e.g., Hsu ) that, for any if , then the resulting estimator satisfies that, with probability at least ,
where denotes the variance of . In other words, in the one-dimensional case, the median-of-means estimator achieves a sub-Gaussian performance under the only condition that the variance of exists.
The median-of-means estimator has been extended to the multivariate case by replacing the median by its natural multivariate extension, the so-called “geometric (or spatial) median” (i.e., the point that minimizes the sum of the Euclidean distances to the sample means within each block) see Lerasle and Oliveira , Hsu and Sabato , Minsker . In particular, Minsker proves that for each this generalization of the median-of-means estimator is such that, with probability at least ,
where is a universal constant. This bound holds under the only assumption that the covariance matrix exists. However, it does not quite achieve a sub-Gaussian performance bound that resembles (1.1).
Joly, Lugosi, and Oliveira made an attempt to construct a mean estimator with a sub-Gaussian behavior for a large class of distributions. They prove that there exists a mean estimator such that, if the distribution satisfies that for all
for some constant , then for all , with probability at least ,
where again is a universal constant. This bound resembles the sub-Gaussian inequality (1.1). However, there are various caveats: the additional fourth-moment assumption, the requirement that , and, to a lesser extent, the extra term in the bound seem sub-optimal.
The main result of this paper is that there exists a mean estimator that achieves purely sub-Gaussian performance under the minimal condition that the covariance matrix exists. More precisely, we prove the existence of a mean estimator such that, for all distributions with a finite second moment, for all , with probability at least ,
The proposed estimator may be interpreted as a multivariate median-of-means estimate but with a new notion of a multivariate median which may be interesting in its own right. The construction of the new estimator is inspired by the technique of “median-of-means tournament”, put forward by the authors in .
In the next section we present the proposed estimator and the performance bound. In Section 3 we present the proofs. We finish the paper by remarks about the computation of the estimator.
The estimator
Define the sample mean within each block by
Note that the minimum is always achieved. This follows from the fact that is a continuous function of (since, for each , is the intersection of a finite union of closed balls, and the centers and radii of the closed balls are continuous in ).
The main result of this paper is the following performance bound:
Thus, the proposed estimator achieves a purely sub-Gaussian performance under minimal conditions. Just like in the case of the median-of-means estimator for the univariate case, the estimator depends on the desired level of confidence . As it is shown in , such a dependence cannot be avoided without imposing additional conditions on the distribution. However, following the route laid down in , one may construct sub-Gaussian estimators that work for a wide range of confidence levels simultaneously under more assumptions on the distribution. Since this issue is beyond the scope of this paper and will not be pursued further here.
Theorem 1 is an outcome of the following observation which is of interest in its own right on the geometry of a typical collection .
Using the same notation as above and setting
The fact that Theorem 2 implies Theorem 1 is straightforward. Indeed, Theorem 2 implies that and that if , then . By the definition of , one always has , and thus if then . Therefore, the minimizer must satisfy that , as required.
We do not claim that the values of the constants appearing in Theorem 1 are optimal. They were obtained with the goal of making the proof transparent, nothing more, and it is likely that they may be improved by more careful calculations.
The proof of Theorem 2 is based on the idea of “median-of-means tournaments” which was introduced by Lugosi and Mendelson is the context of regression function estimation.
Proof
on more than blocks . The main technical lemma is the following.
Let , , and define
set and put . Thus, for a fixed that satisfies , defeats if
for more than blocks . Clearly, it suffices to show that (3.1) holds when .
where recall that is the largest eigenvalue of the covariance matrix of . Thus, if
Applying a standard binomial tail estimate, we see that (3.3) holds for a single with probability at least on at least of the blocks .
Now we need to extend the above from a fixed vector to all vectors with norm . In order to show that (3.3) holds simultaneously for all on at least of the blocks , we first consider a maximal -separated set with respect to the norm. In other words, is a subset of of maximal cardinality such that for all , . We may estimate this cardinality by the “dual Sudakov” inequality (see and also for a version with the specified constant), which implies that the cardinality of is bounded by
we have and thus, by the union bound, with probability at least , (3.3) holds for all on at least of the blocks .
Next we check that property (3.1) holds simultaneously for all with on at least of the blocks .
For every , let be the nearest element to in with respect to the norm. It suffices to show that, with probability at least ,
Indeed, on that event it follows that for every , on at least of the coordinate blocks , both
hold and hence, on those blocks, as required.
Turning to , by symmetrization, contraction for Bernoulli processes and de-symmetrization (see, e.g., ), and noting that , we have
Observe that , and thus
Therefore, (i.e., defeats on block ) if and only if .
Recall that Lemma 1 states that, with probability at least , if then on more than blocks , , which, by the above argument, is the same as saying that for at least indices , .
Computational considerations
The problem of computing various notions of multivariate medians has been thoroughly studied in computational geometry and we refer to Aloupis for a survey on this topic. For example, computing the geometric median—and therefore the multivariate median-of-means estimator proposed by Hsu and Sabato and Minsker —involves solving a convex optimization problem. Thus, the geometric median may be approximated efficiently, see for the most recent result and for the rich history of the problem.
In contrast, efficiently computing, or even approximating, the multivariate median proposed in this paper appears to be a nontrivial challenge.
Another possibility is to start with computing the geometric median of the . By (1.3), one may now restrict search to a ball of radius at most . By exhaustively searching through this ball (after appropriately discretizing), one finds an estimate with the desired properties in additional time of order . However, this is surely unrealistic in most interesting cases.
We leave the question of efficiently computing the proposed mean estimate (or another one with sub-Gaussian performance guarantees) as an interesting research problem.