Agnostic Estimation of Mean and Covariance

Kevin A. Lai, Anup B. Rao, Santosh Vempala

Introduction

The mean and covariance of a probability distribution are its most basic parameters (if they are bounded). Many families of distributions are defined using only these parameters. Estimating the mean and covariance from iid samples is thus a fundamental and classical problem in statistics. The sample mean and sample covariance are generally the best possible estimators (under mild conditions on the distribution such as their existence). However, they are highly sensitive to noise. The main goal of this paper is to estimate the mean, covariance and related functions in spite of arbitrary (adversarial) noise.

The Achilles heel of algorithms for generative models is the assumption that data is exactly from the model. This is crucial for known guarantees, and relaxations of it are few and specialized, e.g., in ICA, data could by noisy, but the noise itself is assumed to be Gaussian. Assumptions about rank and sparsity are made in a technique that is now called Robust PCA [CSPW11, CLMW11, XCM10]. There have been attempts [Kwa08, MT+11] at achieving robustness by L1 minimization, but they don’t give any error bounds on the output produced. A natural, important and wide open problem is estimating the parameters of generative models in the presence of arbitrary, i.e., malicious noise, a setting usually referred to as agnostic learning. The simplest version of this problem is to estimate a single Gaussian in the presence of malicious noise. Alternatively, this can be posed as the problem of finding a best-fit Gaussian to data or agnostically learning a single Gaussian. We consider the following generalization:

There is a large literature on robust statistics (see e.g., [Hub11, HRRS11, MMY06]), with the goal of finding estimators that are stable under perturbations of the data. The classic example for points on a line is that the sample median is a robust estimator while the sample mean is not (a single data point can change the mean arbitrarily). One measure for robustness of an estimator is called breakdown point, which is the minimum fraction of noise that can make the estimator arbitrarily bad. Robust statistics have been proposed and studied for mean and covariance estimation in high dimension as well (see [Hub64, Tuk74, Mar76, SJD81, Don82, Dav87, HPL91, DG92, MSY92, MZ12, CGR15] and the references therein). Most commonly used methods (including M-estimators) to estimate the covariance matrix were shown to have very low break down points [Don82]. The notion of robustness we consider quantifies how far the estimated value is from the true value. To the best of our knowledge, all the papers either suffer from the difficulty that their algorithms are computationally very expensive, namely exponential time in the dimension, or have poor or no guarantees for the output. Tukey’s median [Tuk74]) is an example of the former. It is defined as the deepest point with respect to a given set of points $\{\boldsymbol{\mathit{x}}_{i}\}_{i}.$ As proven in [CGR15], this is an optimal estimate of the mean. But there is no known polynomial time algorithm to compute this. Another well-known proposal (see [Sma90]) is the geometric median:

This has the advantage that it can be computed via a convex program. Unfortunately, as we observe here (see Proposition 2.1), the error of the mean estimate produced by this method grows polynomially with the dimension (also see [Bru11]).

In this paper, we give polynomial time algorithms to estimate the mean with error that is close to the information-theoretically optimal estimator. The dependence on the dimension, of the error in the estimated mean, is only $\sqrt{\log n}$ . To the best of our knowledge, this is the first polynomial-time algorithm with an error dependence on dimension that is less than $\sqrt{n}$ , the bound achieved by the geometric median. Moreover, as we state precisely later, our techniques extend to very general input distributions and to estimating higher moments.

Our algorithm is practical. A matlab implementation for mean estimation can be found in [KRV]. It takes less a couple of seconds to run on a $500$ -dimensional problem with $5000$ samples on a personal laptop.

$\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}})$ is the Gaussian with mean $\boldsymbol{\mathit{\mu}}$ and covariance $\boldsymbol{{\Sigma}}$ .

Let $\mathcal{D}$ is a distribution with mean $\boldsymbol{\mathit{\mu}}$ and covariance $\boldsymbol{{\Sigma}}$ . We say it has bounded $2k$ ’th moments if there exists a constant $C_{2k}$ such that for every unit vector $\boldsymbol{\mathit{v}}$ ,

Here $\mbox{\bf Var}\left[\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}\right]=\left(\boldsymbol{\mathit{v}}^{T}\boldsymbol{{\Sigma}}\boldsymbol{\mathit{v}}\right)^{2}$ is the variance of $\boldsymbol{\mathit{x}}$ along $\boldsymbol{\mathit{v}}.$ For mean estimation, $C_{4}$ will be used, and for covariance estimation, $C_{8}$ will be needed.

1 Main results

All the results we state hold with probability $1-1/\operatorname*{poly}(n)$ unless otherwise mentioned. We will also assume $\eta$ is a less than a universal constant. We begin with agnostic mean estimation.

We note that the sample complexity is nearly linear, and almost matches the complexity for mean estimation with no noise.

If we take $m=O\left(\frac{n^{2}(\log n+\log 1/\eta)\log n}{\eta^{2}}\right)$ samples, and assume that $\eta<c/\log n$ for a small enough constant $c>0$ , then by combining theorems 1.5 and 1.1, we can improve the $\eta$ dependence for the non-spherical Gaussian case in Theorem 1.1 to $\|\boldsymbol{\mathit{\mu}}-\boldsymbol{\widehat{\mathit{\mu}}}\|_{2}=O\left(\eta^{3/4}\right)\|\boldsymbol{{\Sigma}}\|^{1/2}_{2}\log^{1/2}n.$

Our next theorem is a similar result for much more general distributions.

The bounds above are nearly the best possible (up to a factor of $O(\sqrt{\log n})$ ) when the covariance is a multiple of the identity.

Let $\mathcal{D}$ be a distribution with mean $\boldsymbol{\mathit{\mu}}$ and covariance $\boldsymbol{{\Sigma}}$ and that (a) for $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ , $\boldsymbol{\mathit{x}}$ and $(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})^{T}$ have bounded fourth moments with constants $C_{4}$ and $C_{4,2}$ (see Equation 1) respectively. (b) $\mathcal{D}$ is an (unknown) affine transformation of a $4$ -wise independent distribution. Then, there is an algorithm that takes as input $m=O\left(\frac{n^{2}(\log n+\log 1/\epsilon)\log n}{\epsilon^{2}}\right)$ samples $\boldsymbol{\mathit{x}}_{1},...\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta}$ and $\eta$ and computes in $\operatorname*{poly}(n,1/\epsilon)$ -time a covariance estimate $\boldsymbol{\widehat{\Sigma}}$ such that

where $\|\cdot\|_{F}$ denotes the Frobenius norm.

If $\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}),$ then it satisfies the hypothesis of the above theorem. More generally, it holds for any $8$ -wise independent distribution with bounded eighth moments and whose fourth moment along any direction is at least $(1+c)$ times the square of the second moment for some $c>0$ . We also note that if the distribution is isotropic, then covariance estimation is essentially a $1$ -d problem and we get a better bound.

Suppose $\mathcal{D}$ is a distribution which satisfies the following concentration inequality: there exists a constant $\gamma$ such that for every unit vector $\boldsymbol{\mathit{v}}$

Then, there is an algorithm that runs in $\operatorname*{poly}(n,\log\frac{1}{\eta})$ time that takes as input $\eta$ and $m=O\left(\frac{n^{3}(\log n/\eta)^{2}\log n}{\eta^{2}}\right)$ independent samples $\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta}$ , and computes $\widehat{\lambda}_{\max}$ such that

In independent work, [DKK+16] gave a similar algorithm, which they call a Gaussian filtering method, for agnostic mean estimation assuming a spherical covariance matrix; while their guarantees are specifically for Gaussians, the error term in their guarantee grows only with $\log(1/\eta)$ rather than $\log n$ . They also give a completely different algorithm based on the Ellipsoid method, for a simple family of distributions including Gaussian and Bernoulli.

As a corollary of Theorem 1.5, we get a guarantee for agnostic SVD.

Let $\mathcal{D}$ is a distribution that satisfies the hypothesis of Theorem 1.5. Let $\boldsymbol{{\Sigma}}_{k}$ be the best rank $k$ approximation to $\boldsymbol{{\Sigma}}$ in $\|\cdot\|_{F}$ norm. There exists a polynomial time algorithm that takes as input $\eta$ and $m=\operatorname*{poly}(n)$ samples from $\mathcal{D}_{\eta}.$ It produces a rank $k$ matrix $\boldsymbol{\widehat{\Sigma}}_{k}$ such that

Our results can also be used to estimate the mean and covariance of noisy Bernoulli product distributions, i.e. distributions in which each coordinate $i$ is 1 with probability $p_{i}$ and 0 with probability $1-p_{i}$ . In one dimension, $C_{4}$ for a Bernoulli distribution is $\frac{(1-p)^{2}}{p}+\frac{p^{2}}{1-p}$ . For a Bernoulli product distribution, $C_{4}$ will be within a constant of $\max_{i}\left\{\frac{(1-p_{i})^{2}}{p_{i}}+\frac{p_{i}^{2}}{1-p_{i}}\right\}$ . Then Theorem 1.3 can be applied to get an estimate $\boldsymbol{\widehat{\mathit{\mu}}}$ for the mean. For instance, if $\forall i,p_{i}=p$ and $p\geq\frac{1}{2}$ , then $\|\boldsymbol{\mathit{\mu}}-\boldsymbol{\widehat{\mathit{\mu}}}\|_{2}=O\left(\sqrt{\eta(1+\sqrt{\eta p})p\log n}\right)$ . If $C_{4}$ is constant, then by Theorem 1.5, we can get an estimate for the covariance.

Main Ideas

Here we discuss the key ideas of the algorithms. The algorithm AgnosticMean (Algorithm 2.2.2) alternates between an outlier removal step and projection onto the top $n/2$ principal components; these steps are repeated. It is inspired by the work of Brubaker [Bru09] who gave an agnostic algorithm for learning a mixture of well-separated spherical Gaussians.

For illustration, let us assume for now that the underlying distribution is $\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\sigma^{2}\boldsymbol{\mathit{I}}).$ We are given a set $S$ of $m=\operatorname*{poly}(n)$ points from $\mathcal{D}_{\eta}$ , and $S=S_{G}\cup S_{N}$ be the points sampled from the Gaussian and the adversary respectively. Let us also assume that $|S_{N}|=\eta|S|$ . We will use the notation $\boldsymbol{\mathit{\mu}}_{T}$ for mean of the points in a set $T$ , and $\boldsymbol{{\Sigma}}_{T}$ for covariance of the points in $T$ . We then have

If the dimension is $n=1$ , then we can show that the median of $S$ is an estimate for $\boldsymbol{\mathit{\mu}}$ correct up to an additive error of $O(\eta\sigma).$ Even if we just knew the direction of the mean shift $\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}=\eta(\boldsymbol{\mathit{\mu}}_{G}-\boldsymbol{\mathit{\mu}}_{N})$ , then we can estimate $\boldsymbol{\mathit{\mu}}$ by first projecting the sample $S$ on the line along $\boldsymbol{\mathit{\mu}}-\boldsymbol{\mathit{\mu}}_{S}$ and then finding the median. This would give an estimator $\boldsymbol{\widehat{\mathit{\mu}}}$ satisfying $\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}\|_{2}=O(\eta\sigma).$ So we can focus on finding the direction of $\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}$ . One would guess that the top principal component of the covariance matrix of $S$ would be a good candidate. But it is easy for the adversary to choose $S_{N}$ to make this completely useless. Since the noise points $S_{N}$ can be anything, just two points from $S_{N}$ placed far away on either side of the mean $\boldsymbol{\mathit{\mu}}$ along a particular line passing through $\boldsymbol{\mathit{\mu}}$ are sufficient to make the variance in that direction blow up arbitrarily. But we can limit this effect to some extent by an outlier removal step. By a standard concentration inequality for Gaussians, we know that the points in $S_{G}$ lie in a ball of radius $O(\sigma\sqrt{n})$ around the mean. So, if we can just find a point inside or close to the convex hull of the Gaussian and throw away all the points that lie outside a ball of radius $C\sigma\sqrt{n}$ around this point, we preserve all the points in $S_{G}$ . This will also contain the effect of noise points on the variance since now they are restricted to be within $O(\sigma\sqrt{n})$ distance of $\boldsymbol{\mathit{\mu}}.$ We will see later that we can use coordinate-wise median as the center of the ball. By computing the variance by projecting onto any direction, we can figure out $\sigma^{2}$ up to a $1\pm O(\eta)$ factor. From now on, we assume that all points in $S$ lie within a ball of radius $O(\sigma\sqrt{n})$ centered at $\boldsymbol{\mathit{\mu}}.$

But even after this restriction, the top principal component may not contain any information about the mean shift direction. By just placing (say) $\eta/10$ noise points along the $e_{1}$ direction at $\pm\sigma\sqrt{n}$ , and all the remaining noise points perpendicular to this at a single point at a smaller distance, we can make $e_{1}$ the top principal component. But $e_{1}$ is perpendicular to the mean shift direction.

The idea to get around this is that even if the top principal component of $\boldsymbol{{\Sigma}}_{S}$ may not be along the mean-shift direction, the span (call it $V$ ) of top $n/2$ principal components of $\boldsymbol{{\Sigma}}_{S}$ will contain a big projection of the mean-shift vector. This is because, if a big component of the the mean-shift vector was in the span (say $W$ ) of bottom $n/2$ principal components of $\boldsymbol{{\Sigma}}_{S}$ , by Equation 2 this would mean that there is a vector in $W$ with a large Rayleigh quotient. This implies that the top $n/2$ eigenvalues of $\boldsymbol{{\Sigma}}_{S}$ are all big. Since $\boldsymbol{{\Sigma}}_{S}=(1-\eta)\sigma^{2}\boldsymbol{\mathit{I}}+\boldsymbol{\mathit{A}}$ , where $\boldsymbol{\mathit{A}}=\eta\boldsymbol{{\Sigma}}_{S_{N}}+\eta(1-\eta)(\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}_{N})(\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}_{N})^{T}$ , this is possible only if $\operatorname*{Tr}(\boldsymbol{\mathit{A}})$ is large. But since the distance of each point in $S$ from $\boldsymbol{\mathit{\mu}}$ is $O(\sigma\sqrt{n})$ , the trace of $\boldsymbol{\mathit{A}}$ cannot be too large. Therefore, in the space $W$ , we can just compute the sample mean $\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\mu}}_{S}$ and it will be close to $\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\mu}}$ . We still have to find the mean in the space $V$ . But we do this by recursing the above procedure in $V$ . At the end we will be left with a one-dimensional space, and then we can just find the median. This recursive projection onto the top $n/2$ principal components is done in Algorithm 2.2.2 .

This generalizes to the non-spherical Gaussians with a few modifications. We use a different outlier removal step. In the non-spherical case, it is not trivial to compute $\|\boldsymbol{{\Sigma}}\|_{2}$ to be used as the radius of the ball. We give an algorithm for this later on. To limit the effect of noise, we use a damping function. Instead of discarding points outside a certain radius, we damp every point by a weight so that further away points get lower weights. This is done in OutlierDamping (Algorithm 2.2.1). We get the guarantees of Theorem 1.1 by running AgnosticMean (Algorithm 2.2.2) with the outlier removal routine being OutlierDamping. A detailed proof of the whole algorithm is given in Section 3.1.

We then turn to more general distributions which have bounded fourth moments. We need bounded fourth moments to ensure that the mean and covariance matrix of the distribution $\mathcal{D}$ do not change much even after conditioning by an event that occurs with probability $1-\eta$ . One difficulty for general distributions is that the outlier damping doesn’t work. So for distributions $\mathcal{D}$ with bounded fourth moments, we have another outlier removal routine called $\textsc{OutlierTruncation}(\cdot,\eta).$ In this routine, we first find a point analogous to the coordinate-wise median for the Gaussians, and then consider a ball big enough to contain $1-\eta$ fraction of $S$ . We throw away all the points outside this ball. We get the guarantees of Theorem 1.3 by running AgnosticMean (Algorithm 2.2.2) with the outlier removal routine being OutlierTruncation (Algorithm 2.2.1). The complete proof of this appears in Section 3.3.

We now have an algorithm to estimate the mean of very general (with bounded fourth moments) distributions. To estimate the covariance matrix, we observed that the covariance matrix of a distribution $\mathcal{D}$ is given by $\mbox{{\bf E}}_{\mathcal{D}}(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})^{T}.$ If we knew what $\boldsymbol{\mathit{\mu}}$ was, then covariance can be computed by estimating the mean of the second moments. To compute the mean of the second moments, we can treat $(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})^{T}$ as a vector in $n^{2}$ dimensions and run the algorithm for mean estimation. Also, we can estimate $\boldsymbol{\mathit{\mu}}$ by the same algorithm. Therefore, we get Theorem 1.5 by running CovarianceEstimation (Algorithm 2.2.3). Its proof appears in Section 4.2.

Algorithm AgnosticOperatorNorm (Algorithm 2.2.3) estimates the $2$ -norm $\|\boldsymbol{{\Sigma}}\|_{2}$ for general distributions. For illustration, suppose $\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}),$ and we are given $m=\operatorname*{poly}(n)$ samples $\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta},$ and the mean $\boldsymbol{\mathit{\mu}}$ . We consider the covariance-like matrix

Since $1-\eta$ fraction of the points in $S$ are from the Gaussian, we have $\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}})\succeq(1-\eta)\boldsymbol{{\Sigma}}.$ Therefore, the top eigenvalue $\sigma^{2}$ of $\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}})$ is at least $(1-\eta)\|\boldsymbol{{\Sigma}}\|_{2}.$ Let $\boldsymbol{\mathit{v}}$ be the top eigenvector of $\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}}).$ If the Gaussian variance along $\boldsymbol{\mathit{v}}$ (which can be computed up to $1\pm\eta$ factor) is much less than $\sigma^{2}$ , this should be because there are a lot of noise points in $S$ whose projections onto $\boldsymbol{\mathit{v}}$ are big compared to the projection of Gaussian points in $S$ . We remove points in $S$ that have big projection and then iterate the entire procedure. We later show that this procedure terminates in $\operatorname*{poly}(n)$ steps and when it terminates the top eigenvalue of $\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}})$ is close to that of $\boldsymbol{{\Sigma}}.$ A proof of this appears in Section 5.

Theorem 1.7 follows easily from Theorem 1.5. Let $\boldsymbol{\widehat{\Sigma}}_{k}$ be the top- $k$ eigenspace of $\boldsymbol{\widehat{\Sigma}}$ from Theorem 1.5. We then have

$(a),(c)$ follow from triangle inequality, $(b)$ follows from the fact that $\boldsymbol{\widehat{\Sigma}}_{k}$ is the best rank- $k$ approximation and $(d)$ from the guarantees of Theorem 1.5.

Next the algorithm estimates a weighted covariance matrix $\boldsymbol{\mathit{W}}$ with the weight of a point $\boldsymbol{\mathit{x}}$ proportional to $\cos(\boldsymbol{\mathit{u}}^{T}\boldsymbol{\mathit{x}})$ for $\boldsymbol{\mathit{u}}$ chosen from a Gaussian distribution; it computes the SVD of $\boldsymbol{\mathit{W}}$ . For this we use our algorithm again (the weights are applied individually to each sample). The main guarantee is that the eigenvectors of this weighted covariance approximate the columns of $A$ . This relies on the maximum eigenvalue gap of $\boldsymbol{\mathit{W}}$ being large, and it has to be approximated to within additive error $\epsilon=O(1/(\log n)^{3})$ . Theorem 1.7 implies that the additional error in eigenvalues is bounded by $O(\sqrt{\eta\log n})\|\boldsymbol{{\Sigma}}\|_{2}$ , and therefore it suffices to have $\sqrt{\eta\log n}<c/(\log n)^{3}$ for a sufficiently small constant $c$ that depends only on the cumulant and moment bound assumptions (i.e., $\Delta,M$ ). Thus, if suffices to have $\eta<\epsilon/2\leq c(\log n)^{-7}$ .

In this section we will show the lower bounds stated in Observation 1.4. For Gaussian distributions, this is a special case of a theorem proved in [CGR15]. We reproduce the relevant part here for completeness. We will show that there are distributions $\mathcal{D}_{1}=N(\boldsymbol{\mathit{\mu}}_{1},\sigma^{2}\boldsymbol{\mathit{I}}),\mathcal{D}_{2}=N(\boldsymbol{\mathit{\mu}}_{2},\sigma^{2}\boldsymbol{\mathit{I}})$ and distributions $Q_{1},Q_{2}$ such that $\|\boldsymbol{\mathit{\mu}}_{1}-\boldsymbol{\mathit{\mu}}_{2}\|_{2}=\Omega(\eta\sigma)$ and

So, given $\mathcal{D}_{\eta}$ , no algorithm can distinguish between $\mathcal{D}_{1},\mathcal{D}_{2}$ . Let $\phi_{1}$ be p.d.f of $\mathcal{D}_{1}$ and $\phi_{2}$ be the p.d.f of $\mathcal{D}_{2}.$ Let $\boldsymbol{\mathit{\mu}}_{1},\boldsymbol{\mathit{\mu}}_{2}$ be such that the total variation distance between $\mathcal{D}_{1},\mathcal{D}_{2}$ is

By a standard inequality for the total variation distance of Gaussian distributions, this implies that $\|\boldsymbol{\mathit{\mu}}_{1}-\boldsymbol{\mathit{\mu}}_{2}\|_{2}\geq\frac{2\eta\sigma}{1-\eta}.$ Let $Q_{1}$ be the distribution with p.d.f $\frac{1-\eta}{\eta}(\phi_{2}-\phi_{1})\mathbf{1}_{\phi_{2}\geq\phi_{1}}$ and $Q_{2}$ be the distribution with p.d.f $\frac{1-\eta}{\eta}(\phi_{1}-\phi_{2})\mathbf{1}_{\phi_{1}\geq\phi_{2}}$ . It is now easy to verify that Equation 3 is satisfied. This proves item one of Observation 1.4.

For the distributions with bounded fourth moments, consider the following two one-dimensional distributions. $\mathcal{D}_{1}$ is supported on two points $\{-\sigma,\sigma\}$ with the corresponding probabilities $\{1/2,1/2\}$ . $\mathcal{D}_{2}$ is supported on three points $\{-\sigma,\sigma,\sigma/\eta^{1/4}\}$ with probabilities $\{(1-\eta)/2,(1-\eta)/2,\eta\}$ respectively. Let $\eta\leq 1/4$ . It is easy to check that both $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ have bounded fourth moments with the constant $C_{4}=8$ . Furthermore, $\mathcal{D}_{2}$ can be obtained from $\mathcal{D}_{1}$ by adding $\eta$ fraction of noise points. So no algorithm can distinguish between the two distributions. Since their means differ by $\eta^{3/4}\sigma$ , no algorithm can get an estimate better than this.

We will now show that the geometric median:

has a $\sqrt{n}$ dependence on the dimension. We show this in the Gaussian case even if we have access to the whole distribution, but with $\eta$ fraction of noise points placed all at a single point far away from most of the Gaussian points.

Let $\mathcal{D}=N(\boldsymbol{0},\boldsymbol{{\Sigma}})$ be a distribution with diagonal covariance matrix $\boldsymbol{{\Sigma}}$ whose variance along the coordinate direction $\boldsymbol{\mathit{e}}_{1}$ is zero, and equal to $1$ in all the other coordinate directions. Assume there is an $\eta$ fraction of noise at a distance $a=n$ along $\boldsymbol{\mathit{e}}_{1}$ . Let

We have that at the minimizer $t_{0}$ , the derivative with respect to $t$ is zero. Therefore, we should have

Consider $f(t)=\mbox{{\bf E}}_{\boldsymbol{\mathit{x}}\sim\mathcal{D}}\frac{t}{\sqrt{t^{2}+x_{2}^{2}+...+x_{n}^{2}}}.$ It is clear from Equation 4 that $t_{0}>0.$ We claim that if $t=\alpha\eta\sqrt{n}$ for a small enough constant $\alpha$ , then $f(t)\leq\frac{\eta}{1-\eta}.$ Suppose $t_{1}=\alpha\eta\sqrt{n}$ . Since $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ , $\|\boldsymbol{\mathit{x}}\|_{2}^{2}\geq n/2$ with exponential probability. Therefore,

2 Algorithms

Our algorithms are based on outlier removal and SVD. To simplify the proofs, we use new samples for each step of the algorithm. The total sample complexity is given in the theorems.

For outlier removal, we use one of the following two simple routines. The first, which we call OutlierDamping, returns a vector of positive weights, one for each sample point.

Let $\boldsymbol{\mathit{a}}$ be the coordinate-wise median of $S$ . Let $s^{2}=C\operatorname*{Tr}(\boldsymbol{{\Sigma}}).$ Estimate $\operatorname*{Tr}(\boldsymbol{{\Sigma}})$ by estimating $1$ d variance along $n$ orthogonal directions, see Section 4.1.

Set $w_{i}=\exp\left(-\frac{\|\boldsymbol{\mathit{x}}_{i}-\boldsymbol{\mathit{a}}\|_{2}^{2}}{s^{2}}\right)$ for every $\boldsymbol{\mathit{x}}_{i}\in S.$

The second procedure for outlier removal returns a subset of points. It will be convenient to view this as a $0/1$ weighting of the point set. We call this procedure OutlierTruncation.

if $n=1$ : Let $[a,b]$ be the smallest interval containing $(1-\eta-\epsilon)(1-\eta)$ fraction of the points, $\widetilde{S}\leftarrow S\cap[a,b]$ . Return $(\widetilde{S},1).$

Let $\boldsymbol{\mathit{a}}$ be as in Lemma 3.15.

Let $B(r,\boldsymbol{\mathit{a}})=$ ball of minimum radius $r$ centered at $\boldsymbol{\mathit{a}}$ that contains $(1-\eta-\epsilon)(1-\eta)$ fraction of $S$ .

$\widetilde{S}\leftarrow S\cap B(r,\boldsymbol{\mathit{a}}).$ Return $(\widetilde{S},\mathbf{1}).$

2.2 Main Algorithm

We are now ready to state the main algorithm for agnostic mean estimation. It uses one of the above outlier removal procedures and assumes that the output of the procedure is a weighting.

Let $(\widetilde{S},\boldsymbol{\mathit{w}})=\textsc{OutlierRemoval}(S)$ .

if $\boldsymbol{\mathit{w}}=-1$ , Return $\operatorname*{median}(\widetilde{S})$ . //Gaussian case

else Return $\operatorname*{mean}(\widetilde{S})$ . //General case

Let $\boldsymbol{{\Sigma}}_{\widetilde{S},\boldsymbol{\mathit{w}}}$ be the weighted covariance matrix of $\widetilde{S}$ with weights $\boldsymbol{\mathit{w}}$ , and $V$ be the span of the top $n/2$ principal components of $\boldsymbol{{\Sigma}}_{\widetilde{S},\boldsymbol{\mathit{w}}}$ , and $W$ be its complement.

Set $S_{1}:=\boldsymbol{\mathit{P}}_{V}(S)$ where $\boldsymbol{\mathit{P}}_{V}$ is the projection operation on to $V$ .

Let $\boldsymbol{\widehat{\mathit{\mu}}}_{V}:=\textsc{AgnosticMean}(S_{1})$ and $\boldsymbol{\widehat{\mathit{\mu}}}_{W}:=\text{mean}(\boldsymbol{\mathit{P}}_{W}\widetilde{S}).$

Return $\boldsymbol{\widehat{\mathit{\mu}}}.$

2.3 Estimation of the Covariance Matrix and Operator Norm

For both the tasks in this section, we will assume that the mean of the distribution $\boldsymbol{\mathit{\mu}}=\boldsymbol{0}$ . We can do this without loss of generality by a standard trick mentioned described in Section 4.2. The algorithm for estimating the covariance matrix calls AgnosticMean on $\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T}.$ Analysis is given in Section 4.2.

Output: $n\times n$ matrix $\boldsymbol{\widehat{\Sigma}}$

Let $S^{(2)}=\{\boldsymbol{\mathit{x}}_{i}^{\prime}\boldsymbol{\mathit{x}}_{i}^{\prime}|\,i=1,...,m/2\}$ (see Equation 15)

The algorithm for estimating $\|\boldsymbol{{\Sigma}}\|_{2}$ is based on iteratively truncating the samples along the direction of top variance. The analysis is given in Section 5.

Let $\widetilde{S}=\textsc{SafeOutlierTruncation}(S,\eta,\gamma)$ .

Do the following $O(n\log^{2/\gamma}\frac{n}{\eta})$ times

Let $\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S}):=\frac{1}{|\widetilde{S}|}\sum_{i\in\widetilde{S}}\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T}$ .

Find $\boldsymbol{\mathit{v}}$ , the top eigenvector of $\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S})$ , and its corresponding eigenvalue $\sigma^{2}$ .

Estimate (up to $1\pm c\eta$ factor, see Section 4.1) the variance of $\mathcal{D}$ along $\boldsymbol{\mathit{v}}$ and denote it by ${\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}^{2}$ .

if $\sigma^{2}\leq(1+c_{3}\eta\log^{2/\gamma}\frac{n}{\eta}){\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}^{2}$ Return $\sigma^{2}$ .

Remove all points $\boldsymbol{\mathit{x}}\in\widetilde{S}$ such that $|\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}|>\frac{c_{2}{\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta}}{2}$ .

Let $t=\sum_{i=1}^{n}\widehat{\sigma}^{2}_{e_{i}}$ be the sum of estimated variances of $\mathcal{D}$ in $n$ orthogonal directions.

Let $B(c\sqrt{t}\log^{1/\gamma}\frac{n}{\eta},\boldsymbol{0})$ be the ball of radius $c\sqrt{t}\log^{1/\gamma}\frac{n}{\eta}$ centered at $\boldsymbol{0}$ .

$\widetilde{S}\leftarrow S\cap B(c\sqrt{t}\log^{1/\gamma}\frac{n}{\eta},\boldsymbol{0}).$ Return $\widetilde{S}.$

Mean Estimation: Theorem 1.1 and Theorem 1.3

In this section, we will first prove Theorem 1.1, which is for Gaussian distributions, and Theorem 1.3, which is for distributions with bounded fourth moments. All our algorithms will be translationally invariant. We will assume w.l.o.g that the mean of the distribution $\mathcal{D}$ is $\boldsymbol{\mathit{\mu}}=0$ . So we will be proving bounds on $\|\boldsymbol{\widehat{\mathit{\mu}}}\|_{2}.$ Algorithm 2.2.2 has $\log n$ levels, we will assume that at each level it uses $O(\frac{n\log n}{\epsilon^{2}})$ samples resulting in a total of $m=O(\frac{n\log^{2}n}{\epsilon^{2}}).$

At various points in the analysis, to bound the sample complexity we will have to show that the estimates computed from samples are close to their expectations. We will use the following two results. Firstly, as an immediate corollary of matrix Bernstien for rectangular matrices (see Theorem $1.6$ in [Tro12]), we get the following concentration result for the sample mean and sample covariance.

Here $\widehat{\mu}$ and $\widehat{\Sigma}$ are sample mean and sample covariance matrix.

Secondly, the functions we estimate will be integrals of low-degree polynomials (degree $d$ at most $4$ ) restricted to intervals and/or balls. These functions viewed as binary concepts have small VC-dimension, $O(n^{d})$ where $n$ is the dimension of space and $d$ is the degree of the polynomial. We use this to bound the error of estimating integrals via samples, and we can make the error smaller than any inverse polynomial using a $\operatorname*{poly}(n)$ size sample.

By the VC theorem, for any concept in $C_{F}$ , the bound on the size of the sample ensures that with probability at least $1-\delta$ and any $t$ ,

Noting that $\mbox{{\bf E}}_{x\sim\mathcal{D}}(f(x))=\int_{-R}^{R}\Pr(f(x)\geq t)\,dt$ , we get the claimed bound.

We use the above notation for $T=S_{G}$ and $T=S_{N}$ . By an abuse of notation, when $T=G$ , we mean the population version of the above quantities:

We consider the matrix $\boldsymbol{{\Sigma}}_{S,\boldsymbol{\mathit{w}}}$

Let $\mathcal{D}=N(0,\sigma^{2})$ be a one dimensional Gaussian distribution. If $m=O\left(\frac{\log n}{\epsilon^{2}}\right)$ , and we are given $x_{1},...,x_{m}\sim\mathcal{D}_{\eta}$ , then the median $x_{\text{med}}=\operatorname*{median}_{i}\{x_{i}\}$ satisfies $|x_{\text{med}}|=O((\eta+\epsilon)\sigma)$ with high probability.

Let $S_{G}\subset S$ be made up of samples in $S$ that come from the Gaussian, also let $c=\Phi^{-1}(1/2+\eta+\epsilon).$ Let us bound the probability that the median $x_{\text{med}}\geq c.$ We first note that if $x_{\text{med}}\geq c,$ then $\Pr\left(x>c|x\in_{u}S_{G}\right)\geq\epsilon.$ By Hoeffding’s inequality, we can bound this by $1-\operatorname*{poly}(n)$ if $|S_{G}|=O\left(\frac{\log n}{\epsilon^{2}}\right).$ ∎

We will next consider the multidimensional case. The proof follows by a series of lemmas. We state the lemmas first, conclude the proof of Theorem 1.1 and then prove the lemmas. First, we observe that by applying Lemma 3.3 in $n$ orthogonal directions and union bound, we get

By a simple calculation, $\max_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{x}}\|^{2}e^{-\|\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{a}}\|^{2}/s^{2}}\leq O(s^{2}).$ This immediately gives the following bound on the trace.

Suppose $\boldsymbol{\mathit{A}}:=\eta\boldsymbol{{\Sigma}}_{S_{N},\boldsymbol{\mathit{w}}}+\eta(1-\eta)(\boldsymbol{\mathit{\mu}}_{S_{N},\boldsymbol{\mathit{w}}}-\boldsymbol{\mathit{\mu}}_{G,\boldsymbol{\mathit{w}}})(\boldsymbol{\mathit{\mu}}_{S_{N},\boldsymbol{\mathit{w}}}-\boldsymbol{\mathit{\mu}}_{S_{G},\boldsymbol{\mathit{w}}})^{T}.$ Then there exists a constant $C$ such that,

As will be clear from the proof of Theorem 3.6, when $\boldsymbol{{\Sigma}}=\sigma^{2}\boldsymbol{\mathit{I}}$ is a multiple of identity, then $\boldsymbol{{\Sigma}}_{G,\boldsymbol{\mathit{w}}}$ will also be a multiple of $\boldsymbol{\mathit{I}}$ . By Lemma 3.1, if we take $m=O(\frac{n\log n}{\epsilon^{2}})$ samples, we will have

in the Lowener ordering, for some $\alpha,\beta>0$ . By an argument similar to the one sketched in Section 2, we can prove

We will use the notation as defined above. Let $W$ be the bottom $n/2$ principal components of the covariance matrix $\boldsymbol{{\Sigma}}_{S,\boldsymbol{\mathit{w}}}$ . We have

where $\|\boldsymbol{{\Sigma}}\|_{\min}$ denotes the least eigenvalue of $\boldsymbol{{\Sigma}}$ and $\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}:=\boldsymbol{\mathit{\mu}}_{S_{N},\boldsymbol{\mathit{w}}}-\boldsymbol{\mathit{\mu}}_{S_{G},\boldsymbol{\mathit{w}}}.$

By an inductive application of Lemma 3.7, we get the following theorem giving a bound on $\|\boldsymbol{\widehat{\mathit{\mu}}}\|.$

On input $S$ and the routine $\textsc{OutlierDamping}(\cdot)$ , AgnosticMean outputs $\boldsymbol{\widehat{\mathit{\mu}}}$ satisfying

Theorem 3.6 combined with Theorem 3.8 proves Theorem 1.1. We get a better dependence on $\eta$ when $\boldsymbol{{\Sigma}}=\sigma^{2}\boldsymbol{\mathit{I}}$ because we can take $\alpha=\beta$ in this case. This would lead to the cancellation of the leading term in the bound in Theorem 3.8 as $\|\boldsymbol{{\Sigma}}\|_{2}=\|\boldsymbol{{\Sigma}}\|_{\min}.$

Proof of Lemma 3.7: Recall that $\boldsymbol{{\Sigma}}$ denotes the covariance matrix of the Gaussian part. We have

where $\boldsymbol{\mathit{A}}=\eta\boldsymbol{{\Sigma}}_{S_{N},\boldsymbol{\mathit{w}}}+\eta(1-\eta)\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}^{T}.$ Therefore, we have

For a symmetric matrix $\boldsymbol{\mathit{B}}$ , let $\lambda_{k}(\boldsymbol{\mathit{B}})$ denote the $k$ ’th largest eigenvalue. By Weyl’s inequality, we have

Recall that $W$ is the space spanned by the bottom $n/2$ eigenvectors of $\boldsymbol{{\Sigma}}_{S,\boldsymbol{\mathit{w}}}$ , and $\boldsymbol{\mathit{P}}_{W}$ is the matrix corresponding to the projection operator on to $W$ . We therefore have

Multiplying by the vector $\frac{\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}}{\|\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}\|}$ and its transpose on either side, we get

Assuming $\eta\leq 1/2$ , we therefore have

Proof of Theorem 3.8: By Equation 6 and Lemma 3.1, since we take $O\left(\frac{n\log n}{\epsilon^{2}}\right)$ samples we have

So it is enough to prove $\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}_{S_{G},\boldsymbol{\mathit{w}}}\|^{2}\leq O\left((\beta\eta+\eta^{2}+\epsilon^{2})\|\boldsymbol{{\Sigma}}\|_{2}-\alpha\eta\|\boldsymbol{{\Sigma}}\|_{\min}\right)(1+\log n)$ The proof is by induction. If $n=1$ , then the conclusion follows from the guarantees of the one dimensional median. Now, assume that it holds for all $n\leq k$ for some $k\geq 1$ . Let $n=k+1.$ We have by Lemma 3.7

By induction hypothesis, since $\text{dim}(V)=n/2$ , we have

where $\boldsymbol{\mathit{b}}=\frac{1}{s^{2}}\left(\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right)^{-1}\boldsymbol{\mathit{a}}.$ Therefore, we have

Now we will look at the scalar term $|\boldsymbol{{\Sigma}}|\left|\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right|.$ Let $\lambda_{i}$ be the eigenvalues of $\boldsymbol{{\Sigma}}.$

We next bound $\exp\left(-\frac{\|\boldsymbol{\mathit{a}}\|^{2}}{s^{2}}+\frac{1}{s^{4}}\boldsymbol{\mathit{a}}^{T}\left(\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right)^{-1}\boldsymbol{\mathit{a}}\right).$ We have

Note that if $\frac{1}{\lambda_{1}},...,\frac{1}{\lambda_{n}}$ and $\boldsymbol{\mathit{v}}_{1},...,\boldsymbol{\mathit{v}}_{n}$ are the eigenvalues and the corresponding eigenvectors of $\boldsymbol{{\Sigma}}^{-1}$ , then $\frac{1}{\lambda_{1}}+\frac{1}{s^{2}},...,\frac{1}{\lambda_{n}}+\frac{1}{s^{2}}$ and $\boldsymbol{\mathit{v}}_{1},...,\boldsymbol{\mathit{v}}_{n}$ are the eigenvalues and the corresponding eigenvectors of $\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}.$ Since,

where $\boldsymbol{\mathit{b}}=\frac{1}{s^{2}}\left(\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right)^{-1}\boldsymbol{\mathit{a}}.$ Recall that $\epsilon_{1}=\frac{\sum_{i}\lambda_{i}}{s^{2}}$ . We can, as before, bound the product of the two scalars by $e^{\epsilon_{1}}.$ Therefore, we have

Combining Equation (6) and Equation 5, we get Theorem 3.6.

2 Improving the dependence on η𝜂\eta

Now we will show how we can obtain the second part of Theorem 1.1 to get a better dependence on $\eta$ by using $\boldsymbol{\widehat{\Sigma}}$ from Theorem 1.5. Let $\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}})$ be a Gaussian with covariance $\boldsymbol{{\Sigma}},$ and $\eta\leq c/\log n$ for a small enough constant $c>0.$ We first use Theorem 1.5 (with $\epsilon=\eta$ ) to estimate $\sigma^{2}=\|\boldsymbol{{\Sigma}}\|_{2}$ . We get a ${\widehat{\sigma}}^{2}$ satisfying

Let $S=\{\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\}$ be the given sample, and let $\boldsymbol{\mathit{y}}_{i}\sim N(0,{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}}),i=1,...,m$ be i.i.d. samples. Define $\boldsymbol{\mathit{x}}_{i}^{\prime}=\boldsymbol{\mathit{x}}_{i}+\boldsymbol{\mathit{y}}_{i}.$ The key thing to note is that if $\boldsymbol{\mathit{x}}\sim N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}})$ and $\boldsymbol{\mathit{y}}\sim N(0,{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}})$ , then $\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{y}}\sim N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}+{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}})$ . Let $\mathcal{D}^{\prime}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}+{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}})$ . Note that the mean $\boldsymbol{\mathit{\mu}}^{\prime}$ of $\mathcal{D}^{\prime}$ is same as that of $\mathcal{D}$ , and the covariance $\boldsymbol{{\Sigma}}^{\prime}=\boldsymbol{{\Sigma}}+{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}}$ has

We can view $\boldsymbol{\mathit{x}}_{i}^{\prime}\sim\mathcal{D}^{\prime}_{\eta},$ and we assume $\eta\log n\leq c.$ By Theorem 1.5 and Equation 7, we can compute a $\boldsymbol{\widehat{\Sigma}}^{\prime}$ such that

Let $\alpha=O\left(\sqrt{\eta\log n}\right).$ Therefore,

by Equation 8. Now, if we let $\boldsymbol{\mathit{x}}_{i}^{\prime\prime}=\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\boldsymbol{\mathit{x}}_{i}^{\prime}$ and $\mathcal{D}^{\prime\prime}=N(\boldsymbol{\mathit{\mu}}^{\prime\prime},\boldsymbol{{\Sigma}}^{\prime\prime})=N\left(\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\boldsymbol{\mathit{\mu}},\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\boldsymbol{{\Sigma}}\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\right),$ then we can think of $\boldsymbol{\mathit{x}}_{i}^{\prime\prime}\sim\mathcal{D}^{\prime\prime}_{\eta}.$ If we now use Theorem 3.8 with $\beta=\left(1+O\left(\sqrt{\eta\log n}\right)\right)$ and $\alpha=\left(1-O\left(\sqrt{\eta\log n}\right)\right)$ on the samples $S^{\prime\prime}=\{\boldsymbol{\mathit{x}}_{i}^{\prime\prime}\}$ , we get a $\boldsymbol{\widehat{\mathit{\mu}}}^{\prime\prime}$ such that

This implies that $\boldsymbol{\widehat{\mathit{\mu}}}=\boldsymbol{\widehat{\Sigma}}^{\prime 1/2}\boldsymbol{\widehat{\mathit{\mu}}}^{\prime\prime}$ satisfies

We can use this technique to give a polynomial time algorithm to compute $\boldsymbol{\widehat{\mathit{\mu}}}$ with a guarantee $\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}\|^{2}=O\left(\|\boldsymbol{{\Sigma}}\|_{2}\eta^{2-\epsilon}\log^{2-\epsilon}n\right)$ for any fixed $\epsilon>0.$ This would require estimating higher order moments by the mean estimation algorithm and then using the above trick to improve the $\eta$ dependence for each of them in sequence. We don’t give a proof of this in this paper.

3 Distributions with Bounded Fourth Moments

In this section, we will prove some some useful properties that distributions with bounded fourth moments satisfy. We will assume that $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ for a distribution with mean $\boldsymbol{\mathit{\mu}}$ that has bounded fourth moments, i.e., for every unit vector $\boldsymbol{\mathit{v}}$

Let $X$ be a random variable with $\mbox{{\bf E}}(X-\mbox{{\bf E}}X)^{2}=\sigma^{2}$ and

for some $C_{4}$ . Let $\epsilon\leq 0.5$ and $A$ be any event with probability $\Pr(A)=1-\epsilon.$ Then

The fourth moment of such an $X$ is minimum when its support is just the two-point set $\{a,\frac{1-\epsilon}{\epsilon}(\mbox{{\bf E}}X-a)+\mbox{{\bf E}}X\}$ . Therefore,

Let $X$ be a random variable with $\mbox{{\bf E}}X=\mu$ and $\mbox{{\bf E}}((X-\mu)^{2})=\sigma^{2}$ and let

for some $C_{4}$ . Then, for every event $A$ that occurs with probability at least $1-\epsilon$ , we have

where $\mathbf{1}_{A}$ is the indicator function of the event $A.$ As an immediate corollary, for $\epsilon\leq 0.5$ we get the following bound on the conditional probability

Let $d\Omega$ be the probability measure. We can write $\mbox{{\bf E}}(X-\mu)^{4}\leq C_{4}(\mbox{{\bf E}}(X-\mu)^{2})^{2}$ in the following way

Using $\mbox{{\bf E}}(Y-\mbox{{\bf E}}Y)^{4}\geq(\mbox{{\bf E}}(Y-\mbox{{\bf E}}Y)^{2})^{2}$ for any random variable $Y,$ and $\Pr(A^{c})=\epsilon$ we have

Therefore, for $\epsilon\leq 0.5$ we get that

As an immediate corollary of Lemma 3.11 and Lemma 3.12, we get for a random variable $\boldsymbol{\mathit{x}}$ having bounded fourth moments

Let $A$ be an event that happens with probability $1-\eta$ . Then,

where $\boldsymbol{{\Sigma}}|_{A}$ is the conditional covariance matrix $\boldsymbol{{\Sigma}}|_{A}:=\mbox{{\bf E}}(\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T}|A)-(\mbox{{\bf E}}(\boldsymbol{\mathit{x}}|A))(\mbox{{\bf E}}(\boldsymbol{\mathit{x}}|A))^{T}.$

Let $\boldsymbol{\mathit{v}}$ be any unit vector. Let $y$ be the random variable that is $\boldsymbol{\mathit{v}}^{T}\boldsymbol{\mathit{x}}$ for $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ . Let $\mu=\mbox{{\bf E}}(y)$ , $\mu_{A}=\mbox{{\bf E}}(y|A)$ , and $d=\mu_{A}-\mu$ . Then

Finally, by a standard argument as in the proof of Chebyshev’s inequality, we have

For every unit vector $\boldsymbol{\mathit{v}}$ , we have

where $\sigma_{\boldsymbol{\mathit{v}}}$ is the standard deviation of $\boldsymbol{\mathit{x}}$ along the direction $\boldsymbol{\mathit{v}}$ , $\sigma_{\boldsymbol{\mathit{v}}}^{2}:=\mbox{{\bf E}}|\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}|^{2}-|\mbox{{\bf E}}\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}|^{2}.$

4 Proof of Theorem 1.3:

First we will consider the case when $X$ is a random variable with mean $\mu$ and variance $\sigma^{2}$ satisfying

In this case, median need not be a good estimator. Instead, we will consider the interval of minimum length that contains $(1-\eta-\epsilon)(1-\eta)$ fraction of the sample points. Let $S$ be the given sample, and let $\widetilde{S}$ be the points lying in this interval. Let $\widehat{\mu}=\text{mean}(\widetilde{S})$ be our estimator. We will show below that $|\widehat{\mu}-\mu|\leq O\left(C_{4}^{1/4}(\eta+\epsilon)^{3/4}\sigma\right).$

By the concentration inequality stated in Lemma 3.14, we get that for the distribution, the length $r_{1-\frac{\eta+\epsilon}{2}}$ of the interval around $\mu$ consisting of probability mass $1-\frac{\eta+\epsilon}{2}$ is bounded by

The length of the smallest interval that contains $(1-\eta-\epsilon)(1-\eta)$ fraction of $S$ is at most the length of the smallest interval that contains $1-\eta-\epsilon$ fraction of $S_{\mathcal{D}}$ . This latter quantity is bounded by $r_{1-\eta}$ , since the interval $I_{1-\frac{\eta+\epsilon}{2}}$ contains with probability $1-1/\operatorname*{poly}(n)$ a $(1-\eta-\epsilon)$ fraction of $S_{\mathcal{D}}$ .

This implies that when we look at the minimum interval containing $1-\eta-\epsilon$ fraction of the non-noise points, the extreme points of the interval can be at most at a distance $r_{1-\frac{\eta+\epsilon}{2}}$ from $\mu$ . Thus, the distance of all noise points will be within $O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sigma\right)$ . Furthermore, the interval of minimum length with $(1-\eta-\epsilon)(1-\eta)$ fraction of $S$ will contain at least $1-3\eta-\epsilon$ fraction of $S_{\mathcal{D}}$ . Therefore, by Lemma 3.11 the mean of $\widetilde{S}$ will be within $\eta\cdot r_{1-\eta}+O\left(\sqrt{C_{4}(\eta+\epsilon)^{3}}\sigma\right)=O\left(C_{4}^{1/4}(\eta+\epsilon)^{3/4}\sigma\right)$ from the true mean.

4.2 Multi-dimensional Case

For any direction $\boldsymbol{\mathit{v}}$ , let $\mu_{v}=\boldsymbol{\mathit{\mu}}^{T}v$ . From the previous section, we know that we can find a $\widehat{\mu}_{\boldsymbol{\mathit{v}}}$ such that

Therefore, by picking $n$ orthogonal directions $\boldsymbol{\mathit{v}}_{1},...,\boldsymbol{\mathit{v}}_{n}$ , we get

We will now bound the radius of the ball in the outlier removal step (Algorithm 2.2.1). We claim the radius of the ball is $O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n||\boldsymbol{{\Sigma}}||_{2}}\right)$ . Suppose we have some $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ . Let $\boldsymbol{\mathit{z}}=\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}}$ . Using the $n$ orthogonal directions as picked above, let $z_{i}=\boldsymbol{\mathit{z}}^{T}\boldsymbol{\mathit{v}}_{i}$ and let $Z^{2}=\sum z_{i}^{2}=\left\|\boldsymbol{\mathit{z}}\right\|_{2}^{2}$ . Consider the following:

It suffices to bound the right-hand side of $(\ref{eq:outlier})$ by $O(\eta+\epsilon)$ , in which case the ball will contain $1-\eta-\epsilon$ fraction of the probability mass of $\mathcal{D}$ . We have

due to the fourth moment condition and the fact that $\mbox{{\bf E}}((\boldsymbol{\mathit{z}}^{T}\boldsymbol{\mathit{v}}_{i})^{2})\leq\|\boldsymbol{{\Sigma}}\|_{2}$ . Therefore, a ball of radius at most $O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n||\boldsymbol{{\Sigma}}||_{2}}\right)$ contains $1-\eta-\epsilon$ fraction of the points. Since $\|\boldsymbol{\mathit{a}}-\boldsymbol{\mathit{\mu}}\|_{2}=O\left(C_{4}^{1/4}(\eta+\epsilon)^{3/4}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\right),$ we get that the radius of the ball computed in the outlier removal step is $O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n\|\boldsymbol{{\Sigma}}\|_{2}}\right).$ We have proved

After the outlier removal step, every remaining point $\boldsymbol{\mathit{x}}$ satisfies

Consider the covariance matrix $\boldsymbol{{\Sigma}}_{\widetilde{S}}$ of $\widetilde{S}$ (recall that $\widetilde{S}$ is the sample after outlier removal). Let $\widetilde{S}_{\mathcal{D}}\subset\widetilde{S}$ be the set of points in $\widetilde{S}$ that were sampled from the distribution $\mathcal{D}$ and $\widetilde{S}_{N}\subset\widetilde{S}$ be the points sampled by the adversary. Let $\boldsymbol{\mathit{\mu}}_{\widetilde{S}}:=\text{mean}(\widetilde{S})$ , $\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{N}}:=\text{mean}(\widetilde{S}_{N})$ and $\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{\mathcal{D}}}:=\text{mean}(\widetilde{S}_{\mathcal{D}}).$ Note that

where $\widetilde{\eta}=\frac{|\widetilde{S}_{N}|}{|\widetilde{S}|}$ is the fraction of noise points after the outlier truncation step. Note that $\widetilde{\eta}\leq\frac{\eta}{1-2\eta-\epsilon}=O(\eta).$ We will therefore pretend that the fraction of noise points is still $\eta$ after the outlier truncation step. We again assume that the mean of the distribution $\mathcal{D}$ is $\boldsymbol{\mathit{\mu}}=0.$ By Lemma 3.11 applied with $X=\boldsymbol{\mathit{x}}^{T}\frac{\boldsymbol{\mathit{\mu}}_{\widetilde{\mathcal{D}}}}{\|\boldsymbol{\mathit{\mu}}_{\widetilde{\mathcal{D}}}\|}$ for $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ and where $A$ is the event that $\boldsymbol{\mathit{x}}$ is not removed by outlier removal, we have that

Suppose, after the outlier removal step, we had the guarantee that the covariance matrix of the remaining points from the distribution $\mathcal{D}$ , say $\boldsymbol{{\Sigma}}_{{\widetilde{\mathcal{D}}}},$ is between

in the Lowener ordering. Corollary 3.13 gives $\alpha=1-O(\sqrt{C_{4}(\eta+\epsilon)})$ and $\beta=1+O(\eta+\epsilon)$ . Also, by Lemma 3.1 and Lemma 3.16 we have that if $|S_{\widetilde{\mathcal{D}}}|=\Omega\left(\frac{n\log n}{\epsilon^{2}}\right)$ , then

We will use the notation as defined above.

Let $W$ be the bottom $n/2$ principal components of the covariance matrix $\boldsymbol{{\Sigma}}_{S}$ . For some constant $C$ , we have

where $\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}=\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{N}}-\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{\mathcal{D}}}.$

By an inductive application of the above lemma, we can prove

On input $(S,n)$ , AgnosticMean outputs $\boldsymbol{\widehat{\mathit{\mu}}}$ satisfying

Theorem 3.18 with Corollary 3.13 proves Theorem 1.3.

Proof of Lemma 3.17: Recall that $\boldsymbol{{\Sigma}}$ denotes the covariance matrix of the points from $\mathcal{D}$ . We have

where $\boldsymbol{\mathit{A}}:=\eta\boldsymbol{{\Sigma}}_{{\widetilde{S}_{N}}}+(\eta-\eta^{2})\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}^{T}.$ Therefore, we have

By Lemma 3.16 each $\boldsymbol{\mathit{x}}_{i}$ satisfies $\|\boldsymbol{\mathit{x}}_{i}\|=O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n\|\boldsymbol{{\Sigma}}\|_{2}}\right),$ so we have

For a symmetric matrix $B$ , let $\lambda_{k}(B)$ denote the $k$ ’th largest eigenvalue. By Weyl’s inequality, we have that

By Equation (14), there exists a constant $\widetilde{C}$ such that

Recall that $W$ is the space spanned by the bottom $n/2$ eigenvectors of $\boldsymbol{{\Sigma}}_{\widetilde{S}}$ , and $\boldsymbol{\mathit{P}}_{W}$ is the matrix corresponding to the projection operator on to $W$ . We therefore have

where $C=\frac{\widetilde{C}}{1-\eta}$ . We therefore have

Proof of Theorem 3.18: By Equation 13, it is enough to bound $\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{\mathcal{D}}}\|^{2}.$ The proof is by induction on the dimension. If $n=1$ , then the conclusion follows from the guarantees for the one dimensional case proven in Section 3.4.1. Now, assume that it holds for all $n\leq k$ for some $k\geq 1$ . Let $n=k+1.$ We have by Lemma 3.17

Recall that we defined $V$ to be the span of the top $n/2$ principal components of $\boldsymbol{{\Sigma}}_{\widetilde{S}}$ . By induction hypothesis, since $\text{dim}(V)=n/2$ , we have

Covariance Estimation

Let $\mathcal{D}$ be a distribution with mean $\boldsymbol{\mathit{\mu}}$ and covariance $\sigma^{2}.$ If $\mathcal{D}=N(\mu,\sigma^{2})$ , then there is an algorithm that takes as input $m=\O\left(\frac{\log n}{\epsilon^{2}}\right)$ samples $\boldsymbol{\mathit{x}}_{1},...\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta}$ and computes in polynomial time $\widehat{\sigma}^{2}$ such that $\left|\widehat{\sigma}^{2}-\sigma^{2}\right|=O(\eta+\epsilon)\sigma^{2}.$

If $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ has bounded fourth moments with constant $C_{4}$ , and $(x-\mu)^{2}$ has bounded fourth moments with constant $C_{4,2}$ . Then there is an algorithm that takes as input $\eta$ and $m=O\left(\frac{\log n+\log 1/\epsilon}{\epsilon^{2}}\right)$ samples $\boldsymbol{\mathit{x}}_{1},...\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta}$ and computes in polynomial time $\widehat{\sigma}^{2}$ such that $\left|\widehat{\sigma}^{2}-\sigma^{2}\right|=O\left(C_{4,2}^{1/4}(\eta+\epsilon)^{3/4}C_{4}^{1/2}\sigma\right).$

When $\mathcal{D}$ is a distribution that has bounded eighth moments, the result follows from the 1d mean estimation in Section 3.4 applied $(x-\mu)^{2}$ . Note that $\mbox{{\bf E}}(x-\mu)^{2}=\sigma^{2}$ and

From Section 3.4, we therefore have that if $m=O\left(\frac{\log n+\log 1/\epsilon}{\epsilon^{2}}\right)$ , there is a $\operatorname*{poly}(n)$ algorithm with $|\widehat{\sigma}^{2}-\sigma^{2}|\leq O\left(C_{4,2}^{1/4}(\eta+\epsilon)^{3/4}C_{4}^{1/2}\sigma\right).$

2 Multi-Dimensional Case: Theorem 1.5

In this section we will prove that CovarianceEstimation (Algorithm 2.2.3) gives Theorem 1.5. Throughout this section, we will assume that $\mathcal{D}$ is a distribution with mean $\boldsymbol{\mathit{\mu}}$ and covariance $\boldsymbol{{\Sigma}}$ and has bounded fourth moments with parameter $C_{4}$ . We use the following symmetrization trick to assume that $\mathcal{D}$ has mean $\boldsymbol{0}$ . Given samples $S=\{\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\}$ , let

Since $\eta$ fraction of the original samples were corrupted on average, only $2\eta$ fraction of the new samples will be corrupted on average. Moreover, if $\boldsymbol{\mathit{x}},\boldsymbol{\mathit{y}}\sim\mathcal{D}$ are independent random variables, then we can show that the distribution of $\boldsymbol{\mathit{x}}^{\prime}=(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{y}})/\sqrt{2}$ has bounded fourth moments with parameter $\leq C_{4}+3/2$ . We will denote by $\mathcal{D}^{\prime}$ the distribution of $\boldsymbol{\mathit{x}}^{\prime}$ . CovarianceEstimation is just the mean estimation algorithm on $S^{(2)}=\{\boldsymbol{\mathit{x}}^{\prime}\boldsymbol{\mathit{x}}^{\prime T}|\boldsymbol{\mathit{x}}\in S\}$ , we can appeal to Theorem 1.3. Furthermore, let $\mathcal{D}^{\prime}$ be an affine transformation of a $4$ -wise independent distribution.

where $\boldsymbol{{\Sigma^{(2)}}}$ is covariance matrix of $\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T},\boldsymbol{\mathit{x}}\sim\mathcal{D}^{\prime}.$

We will now derive a bound for $\|\boldsymbol{{\Sigma^{(2)}}}\|_{2}$ when the distribution has bounded fourth moments and is $4$ -wise independent. In particular, we will prove

If $\boldsymbol{{\Sigma^{(2)}}}$ is the covariance matrix of $\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T},\boldsymbol{\mathit{x}}\sim\mathcal{D}^{\prime}$ , it holds that

Proof of Proposition 4.2: Note that $\mbox{{\bf E}}(\boldsymbol{\mathit{Y}})=\boldsymbol{{\Sigma}}$ .

As in Section 4.2, we assume that the true distribution has mean $\boldsymbol{\mathit{\mu}}=\boldsymbol{0}$ .

In this section, we will prove AgnosticOperatorNorm (Algorithm 2.2.3) gives Theorem 1.6. Let $S=S_{\mathcal{D}}\cup S_{N}$ be the given sample, where $S_{\mathcal{D}}$ consists of points from some distribution $\mathcal{D}$ with mean $\boldsymbol{\mathit{\mu}}$ and covariance $\boldsymbol{{\Sigma}}$ and $S_{N}$ consists of points picked by the adversary. Let $\boldsymbol{{\Sigma}}_{S_{\mathcal{D}}}$ be the sample covariance of $S_{\mathcal{D}}$ . We assume that $\mathcal{D}$ has 1D concentration, i.e., there exists a constant $\gamma$ such that for every unit vector $\boldsymbol{\mathit{v}}$

Let $\widetilde{S}$ be the remaining sample at the end of the algorithm and let $\widetilde{S}_{\mathcal{D}}$ be points in $\widetilde{S}$ sampled from $\mathcal{D}$ .

First, we will argue that the covariance of the true distribution is well-approximated by $\boldsymbol{{\Sigma}}_{\boldsymbol{\mathit{\mu}}}(\widetilde{S}_{\mathcal{D}})$ .

With probability $1-1/\operatorname*{poly}(n)$ ,

First, note that the $t$ computed in SafeOutlierTruncation is at most $O(\operatorname*{Tr}(\boldsymbol{{\Sigma}}))$ because by an analogous argument as in Section 4.1, we have $\widehat{\sigma}_{\boldsymbol{\mathit{v}}}^{2}\leq(1+O(\eta))\sigma_{\boldsymbol{\mathit{v}}}^{2}$ (namely that the estimated variance $\widehat{\sigma}_{\boldsymbol{\mathit{v}}}$ in a direction $\boldsymbol{\mathit{v}}$ is close to the true variance $\sigma_{\boldsymbol{\mathit{v}}}$ in that direction). Then the ball in SafeOutlierTruncation has radius $R=c_{1}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\log^{1/\gamma}\frac{n}{\eta}$ for some constant $c_{1}$ . We have that in any direction $\boldsymbol{\mathit{v}}$ , the probability that $\boldsymbol{\mathit{x}}\sim\mathcal{D}$ deviates from the mean by more than $c_{1}\sigma_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta}$ is $1/\operatorname*{poly}(\frac{n}{\eta})$ . Then if we take $n$ orthogonal directions, the probability that any given point is more than distance $R$ from $\boldsymbol{\mathit{\mu}}$ is still $1/\operatorname*{poly}(\frac{n}{\eta})$ . Thus, step (1) of the algorithm will remove only $1/\operatorname*{poly}(\frac{n}{\eta})$ fraction of the points sampled from $\mathcal{D}$ .

In every direction $\boldsymbol{\mathit{v}}$ , the probability mass of points from $\mathcal{D}$ outside an interval of size $c_{2}\sigma_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta}$ around the mean is at most $1/\operatorname*{poly}(\frac{n}{\eta})$ , where $\sigma_{\boldsymbol{\mathit{v}}}$ is the variance in the direction $\boldsymbol{\mathit{v}}$ . Let $C_{i}$ be the region between the two hyperplanes used for truncation in iteration $i$ . Therefore, if the number of iterations is $O(n\log^{2/\eta}\frac{n}{\eta})$ , we will have that $\Pr\left(\boldsymbol{\mathit{x}}\in\cap_{i}C_{i}\left|x\sim\mathcal{D}\right.\right)=1-1/\operatorname*{poly}(\frac{n}{\eta}).$

Note that $1$ d concentration implies that the distribution has bounded $2k$ ’th moment for all finite $k.$ By Lemma 3.12, we have that the covariance matrix $\boldsymbol{{\Sigma}}_{\boldsymbol{0}}\left(\mathcal{D}\cap_{i}C_{i}\right)$ of $\mathcal{D}\cap_{i}C_{i}$ is close to that of $\boldsymbol{{\Sigma}}$ :

Finally, to relate $\boldsymbol{{\Sigma}}_{\boldsymbol{0}}\left(\mathcal{D}\cap_{i}C_{i}\right)$ to $\boldsymbol{{\Sigma}}_{\boldsymbol{0}}\left(\widetilde{S}_{\mathcal{D}}\right)$ , we use Proposition 3.2. The concept class we use is all degree two polynomials restricted to convex polytopes with at most $O(n)$ facets, defined by the hyperplanes used for truncation at each iteration of the algorithm. The VC dimension of this concept class is $O(n^{2}\log n)$ . Therefore, by Proposition 3.2 applied with $R=c_{1}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\log^{1/\gamma}\frac{n}{\eta}\leq c_{1}\|\boldsymbol{{\Sigma}}\|^{1/2}n^{1/2}\log^{1/\gamma}\frac{n}{\eta}$ , we get that if we take $m=O\left(\frac{n^{3}(\log^{1/\gamma}\frac{n}{\eta})^{2}\log n}{\eta^{2}}\right)$ then

Combining equations 16 and 17 we get the desired result. ∎

First, note that since only an $\eta$ fraction of $\widetilde{S}$ is noise, we have

Therefore, we have that $\|\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S})\|_{2}\geq(1-\eta)\|\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S}_{\mathcal{D}})\|_{2}.$ Lemma 5.2 gives the desired lower bound. For the upper bound, let $\boldsymbol{\mathit{v}}$ be the top eigenvector of $\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S}).$ When the algorithm terminates, we have

where the second line follows because of the termination condition and because we can estimate the variance of $\mathcal{D}$ in any direction to within a $(1\pm c\eta)$ factor. ∎

2 Termination

In this section, we will show that with high probability, Algorithm 2.2.3 terminates in a polynomial number of steps provided that $\eta\leq\frac{1}{C}$ for some constant $C$ that depends only on the estimation in Step (5).

Every time the algorithm goes through another iteration, it must remove a certain number of noise points. Suppose in step (7), we remove $r$ noise points. The noise configuration of maximum variance puts $r$ amount of noise at the outlier removal distance $d_{1}=c_{1}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\log^{1/\gamma}\frac{n}{\eta}$ , and $\eta-r$ amount of noise at the truncation threshold distance $d_{2}=\frac{c_{2}{\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta}}{2}$ . We can then write an upper bound on $\sigma^{2}$ .

Let us simplify the numerator $Z=\sigma^{2}-\sigma_{\boldsymbol{\mathit{v}}}^{2}-\eta d_{2}^{2}$ . Since we are truncating the sample, we have $(1+c_{3}\eta\log^{2/\gamma}\frac{n}{\eta}){\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}^{2}\leq\sigma^{2}$ . Here we also assume that $\eta\leq\frac{1}{C}$ for a sufficiently large $C$ so that $\frac{1}{1-c\eta}$ is less than some constant.

Recall that $\sigma^{2}\geq(1-\eta)\|\boldsymbol{{\Sigma}}\|_{2}$ by (18). Then as long as $c_{3}$ is a sufficiently large constant, we have

Then combining $Z$ with the denominator from earlier and using the fact that $d_{1}\leq c_{1}\sqrt{n\|\boldsymbol{{\Sigma}}\|_{2}}\log^{1/\gamma}\frac{n}{\eta}$ , we get:

Then $r\geq O\left(\min\left\{\frac{\eta}{n},\frac{1}{n\log^{2/\gamma}\frac{n}{\eta}}\right\}\right)$ , so the algorithm will terminate in a nearly linear number of iterations. ∎

Open Questions

An immediate open question is whether the our analysis of the mean estimation algorithm is tight and the $\sqrt{\log n}$ is avoidable. For special distributions including Gaussians, [DKK+16] give an algorithm with higher sample complexity and error $\eta\sqrt{\log\frac{1}{\eta}}$ rather than $\eta\sqrt{\log n}$ or $\sqrt{\eta\log n}$ as in Theorem 1.1. An open question is to give an $O(\eta)$ approximation. For the more general distributions considered here, the dependence on $\eta$ must grow as at least $\eta^{3/4}$ ; it is open to find an algorithm that achieves $O(\eta^{3/4})$ error (our guarantee for the general setting has error $O(\sqrt{\eta\log n})$ ). Other open problems include agnostic learning of a mixture of two arbitrary Gaussians and agnostic sparse recovery.

Acknowledgment

We thank Chao Gao and Roman Vershynin for helpful discussions. We would also like to thank the anonymous reviewers for useful suggestions. This research was supported in part by NSF awards CCF-1217793 and EAGER-1555447.