Agnostic Estimation of Mean and Covariance

Kevin A. Lai, Anup B. Rao, Santosh Vempala

Introduction

The mean and covariance of a probability distribution are its most basic parameters (if they are bounded). Many families of distributions are defined using only these parameters. Estimating the mean and covariance from iid samples is thus a fundamental and classical problem in statistics. The sample mean and sample covariance are generally the best possible estimators (under mild conditions on the distribution such as their existence). However, they are highly sensitive to noise. The main goal of this paper is to estimate the mean, covariance and related functions in spite of arbitrary (adversarial) noise.

The Achilles heel of algorithms for generative models is the assumption that data is exactly from the model. This is crucial for known guarantees, and relaxations of it are few and specialized, e.g., in ICA, data could by noisy, but the noise itself is assumed to be Gaussian. Assumptions about rank and sparsity are made in a technique that is now called Robust PCA [CSPW11, CLMW11, XCM10]. There have been attempts [Kwa08, MT+11] at achieving robustness by L1 minimization, but they don’t give any error bounds on the output produced. A natural, important and wide open problem is estimating the parameters of generative models in the presence of arbitrary, i.e., malicious noise, a setting usually referred to as agnostic learning. The simplest version of this problem is to estimate a single Gaussian in the presence of malicious noise. Alternatively, this can be posed as the problem of finding a best-fit Gaussian to data or agnostically learning a single Gaussian. We consider the following generalization:

There is a large literature on robust statistics (see e.g., [Hub11, HRRS11, MMY06]), with the goal of finding estimators that are stable under perturbations of the data. The classic example for points on a line is that the sample median is a robust estimator while the sample mean is not (a single data point can change the mean arbitrarily). One measure for robustness of an estimator is called breakdown point, which is the minimum fraction of noise that can make the estimator arbitrarily bad. Robust statistics have been proposed and studied for mean and covariance estimation in high dimension as well (see [Hub64, Tuk74, Mar76, SJD81, Don82, Dav87, HPL91, DG92, MSY92, MZ12, CGR15] and the references therein). Most commonly used methods (including M-estimators) to estimate the covariance matrix were shown to have very low break down points [Don82]. The notion of robustness we consider quantifies how far the estimated value is from the true value. To the best of our knowledge, all the papers either suffer from the difficulty that their algorithms are computationally very expensive, namely exponential time in the dimension, or have poor or no guarantees for the output. Tukey’s median [Tuk74]) is an example of the former. It is defined as the deepest point with respect to a given set of points {xi}i.\{\boldsymbol{\mathit{x}}_{i}\}_{i}. As proven in [CGR15], this is an optimal estimate of the mean. But there is no known polynomial time algorithm to compute this. Another well-known proposal (see [Sma90]) is the geometric median:

This has the advantage that it can be computed via a convex program. Unfortunately, as we observe here (see Proposition 2.1), the error of the mean estimate produced by this method grows polynomially with the dimension (also see [Bru11]).

In this paper, we give polynomial time algorithms to estimate the mean with error that is close to the information-theoretically optimal estimator. The dependence on the dimension, of the error in the estimated mean, is only logn\sqrt{\log n}. To the best of our knowledge, this is the first polynomial-time algorithm with an error dependence on dimension that is less than n\sqrt{n}, the bound achieved by the geometric median. Moreover, as we state precisely later, our techniques extend to very general input distributions and to estimating higher moments.

Our algorithm is practical. A matlab implementation for mean estimation can be found in [KRV]. It takes less a couple of seconds to run on a 500500-dimensional problem with 50005000 samples on a personal laptop.

D=N(μ,Σ)\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}) is the Gaussian with mean μ\boldsymbol{\mathit{\mu}} and covariance Σ\boldsymbol{{\Sigma}}.

Let D\mathcal{D} is a distribution with mean μ\boldsymbol{\mathit{\mu}} and covariance Σ\boldsymbol{{\Sigma}}. We say it has bounded 2k2k’th moments if there exists a constant C2kC_{2k} such that for every unit vector v\boldsymbol{\mathit{v}},

Here \mboxVar[xTv]=(vTΣv)2\mbox{\bf Var}\left[\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}\right]=\left(\boldsymbol{\mathit{v}}^{T}\boldsymbol{{\Sigma}}\boldsymbol{\mathit{v}}\right)^{2} is the variance of x\boldsymbol{\mathit{x}} along v.\boldsymbol{\mathit{v}}. For mean estimation, C4C_{4} will be used, and for covariance estimation, C8C_{8} will be needed.

1 Main results

All the results we state hold with probability 11/poly(n)1-1/\operatorname*{poly}(n) unless otherwise mentioned. We will also assume η\eta is a less than a universal constant. We begin with agnostic mean estimation.

We note that the sample complexity is nearly linear, and almost matches the complexity for mean estimation with no noise.

If we take m=O(n2(logn+log1/η)lognη2)m=O\left(\frac{n^{2}(\log n+\log 1/\eta)\log n}{\eta^{2}}\right) samples, and assume that η<c/logn\eta<c/\log n for a small enough constant c>0c>0, then by combining theorems 1.5 and 1.1, we can improve the η\eta dependence for the non-spherical Gaussian case in Theorem 1.1 to μμ^2=O(η3/4)Σ21/2log1/2n.\|\boldsymbol{\mathit{\mu}}-\boldsymbol{\widehat{\mathit{\mu}}}\|_{2}=O\left(\eta^{3/4}\right)\|\boldsymbol{{\Sigma}}\|^{1/2}_{2}\log^{1/2}n.

Our next theorem is a similar result for much more general distributions.

The bounds above are nearly the best possible (up to a factor of O(logn)O(\sqrt{\log n})) when the covariance is a multiple of the identity.

Let D\mathcal{D} be a distribution with mean μ\boldsymbol{\mathit{\mu}} and covariance Σ\boldsymbol{{\Sigma}} and that (a) for xD\boldsymbol{\mathit{x}}\sim\mathcal{D}, x\boldsymbol{\mathit{x}} and (xμ)(xμ)T(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})^{T} have bounded fourth moments with constants C4C_{4} and C4,2C_{4,2}(see Equation 1) respectively. (b) D\mathcal{D} is an (unknown) affine transformation of a 44-wise independent distribution. Then, there is an algorithm that takes as input m=O(n2(logn+log1/ϵ)lognϵ2)m=O\left(\frac{n^{2}(\log n+\log 1/\epsilon)\log n}{\epsilon^{2}}\right) samples x1,...xmDη\boldsymbol{\mathit{x}}_{1},...\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta} and η\eta and computes in poly(n,1/ϵ)\operatorname*{poly}(n,1/\epsilon)-time a covariance estimate Σ^\boldsymbol{\widehat{\Sigma}} such that

where F\|\cdot\|_{F} denotes the Frobenius norm.

If D=N(μ,Σ),\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}), then it satisfies the hypothesis of the above theorem. More generally, it holds for any 88-wise independent distribution with bounded eighth moments and whose fourth moment along any direction is at least (1+c)(1+c) times the square of the second moment for some c>0c>0. We also note that if the distribution is isotropic, then covariance estimation is essentially a 11-d problem and we get a better bound.

Suppose D\mathcal{D} is a distribution which satisfies the following concentration inequality: there exists a constant γ\gamma such that for every unit vector v\boldsymbol{\mathit{v}}

Then, there is an algorithm that runs in poly(n,log1η)\operatorname*{poly}(n,\log\frac{1}{\eta}) time that takes as input η\eta and m=O(n3(logn/η)2lognη2)m=O\left(\frac{n^{3}(\log n/\eta)^{2}\log n}{\eta^{2}}\right) independent samples x1,...,xmDη\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta}, and computes λ^max\widehat{\lambda}_{\max} such that

In independent work, [DKK+16] gave a similar algorithm, which they call a Gaussian filtering method, for agnostic mean estimation assuming a spherical covariance matrix; while their guarantees are specifically for Gaussians, the error term in their guarantee grows only with log(1/η)\log(1/\eta) rather than logn\log n. They also give a completely different algorithm based on the Ellipsoid method, for a simple family of distributions including Gaussian and Bernoulli.

As a corollary of Theorem 1.5, we get a guarantee for agnostic SVD.

Let D\mathcal{D} is a distribution that satisfies the hypothesis of Theorem 1.5. Let Σk\boldsymbol{{\Sigma}}_{k} be the best rank kk approximation to Σ\boldsymbol{{\Sigma}} in F\|\cdot\|_{F} norm. There exists a polynomial time algorithm that takes as input η\eta and m=poly(n)m=\operatorname*{poly}(n) samples from Dη.\mathcal{D}_{\eta}. It produces a rank kk matrix Σ^k\boldsymbol{\widehat{\Sigma}}_{k} such that

Our results can also be used to estimate the mean and covariance of noisy Bernoulli product distributions, i.e. distributions in which each coordinate ii is 1 with probability pip_{i} and 0 with probability 1pi1-p_{i}. In one dimension, C4C_{4} for a Bernoulli distribution is (1p)2p+p21p\frac{(1-p)^{2}}{p}+\frac{p^{2}}{1-p}. For a Bernoulli product distribution, C4C_{4} will be within a constant of maxi{(1pi)2pi+pi21pi}\max_{i}\left\{\frac{(1-p_{i})^{2}}{p_{i}}+\frac{p_{i}^{2}}{1-p_{i}}\right\}. Then Theorem 1.3 can be applied to get an estimate μ^\boldsymbol{\widehat{\mathit{\mu}}} for the mean. For instance, if i,pi=p\forall i,p_{i}=p and p12p\geq\frac{1}{2}, then μμ^2=O(η(1+ηp)plogn)\|\boldsymbol{\mathit{\mu}}-\boldsymbol{\widehat{\mathit{\mu}}}\|_{2}=O\left(\sqrt{\eta(1+\sqrt{\eta p})p\log n}\right). If C4C_{4} is constant, then by Theorem 1.5, we can get an estimate for the covariance.

Main Ideas

Here we discuss the key ideas of the algorithms. The algorithm AgnosticMean (Algorithm 2.2.2) alternates between an outlier removal step and projection onto the top n/2n/2 principal components; these steps are repeated. It is inspired by the work of Brubaker [Bru09] who gave an agnostic algorithm for learning a mixture of well-separated spherical Gaussians.

For illustration, let us assume for now that the underlying distribution is D=N(μ,σ2I).\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\sigma^{2}\boldsymbol{\mathit{I}}). We are given a set SS of m=poly(n)m=\operatorname*{poly}(n) points from Dη\mathcal{D}_{\eta}, and S=SGSNS=S_{G}\cup S_{N} be the points sampled from the Gaussian and the adversary respectively. Let us also assume that SN=ηS|S_{N}|=\eta|S|. We will use the notation μT\boldsymbol{\mathit{\mu}}_{T} for mean of the points in a set TT, and ΣT\boldsymbol{{\Sigma}}_{T} for covariance of the points in TT. We then have

If the dimension is n=1n=1, then we can show that the median of SS is an estimate for μ\boldsymbol{\mathit{\mu}} correct up to an additive error of O(ησ).O(\eta\sigma). Even if we just knew the direction of the mean shift μSμ=η(μGμN)\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}=\eta(\boldsymbol{\mathit{\mu}}_{G}-\boldsymbol{\mathit{\mu}}_{N}), then we can estimate μ\boldsymbol{\mathit{\mu}} by first projecting the sample SS on the line along μμS\boldsymbol{\mathit{\mu}}-\boldsymbol{\mathit{\mu}}_{S} and then finding the median. This would give an estimator μ^\boldsymbol{\widehat{\mathit{\mu}}} satisfying μ^μ2=O(ησ).\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}\|_{2}=O(\eta\sigma). So we can focus on finding the direction of μSμ\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}. One would guess that the top principal component of the covariance matrix of SS would be a good candidate. But it is easy for the adversary to choose SNS_{N} to make this completely useless. Since the noise points SNS_{N} can be anything, just two points from SNS_{N} placed far away on either side of the mean μ\boldsymbol{\mathit{\mu}} along a particular line passing through μ\boldsymbol{\mathit{\mu}} are sufficient to make the variance in that direction blow up arbitrarily. But we can limit this effect to some extent by an outlier removal step. By a standard concentration inequality for Gaussians, we know that the points in SGS_{G} lie in a ball of radius O(σn)O(\sigma\sqrt{n}) around the mean. So, if we can just find a point inside or close to the convex hull of the Gaussian and throw away all the points that lie outside a ball of radius CσnC\sigma\sqrt{n} around this point, we preserve all the points in SGS_{G}. This will also contain the effect of noise points on the variance since now they are restricted to be within O(σn)O(\sigma\sqrt{n}) distance of μ.\boldsymbol{\mathit{\mu}}. We will see later that we can use coordinate-wise median as the center of the ball. By computing the variance by projecting onto any direction, we can figure out σ2\sigma^{2} up to a 1±O(η)1\pm O(\eta) factor. From now on, we assume that all points in SS lie within a ball of radius O(σn)O(\sigma\sqrt{n}) centered at μ.\boldsymbol{\mathit{\mu}}.

But even after this restriction, the top principal component may not contain any information about the mean shift direction. By just placing (say) η/10\eta/10 noise points along the e1e_{1} direction at ±σn\pm\sigma\sqrt{n}, and all the remaining noise points perpendicular to this at a single point at a smaller distance, we can make e1e_{1} the top principal component. But e1e_{1} is perpendicular to the mean shift direction.

The idea to get around this is that even if the top principal component of ΣS\boldsymbol{{\Sigma}}_{S} may not be along the mean-shift direction, the span (call it VV) of top n/2n/2 principal components of ΣS\boldsymbol{{\Sigma}}_{S} will contain a big projection of the mean-shift vector. This is because, if a big component of the the mean-shift vector was in the span (say WW) of bottom n/2n/2 principal components of ΣS\boldsymbol{{\Sigma}}_{S}, by Equation 2 this would mean that there is a vector in WW with a large Rayleigh quotient. This implies that the top n/2n/2 eigenvalues of ΣS\boldsymbol{{\Sigma}}_{S} are all big. Since ΣS=(1η)σ2I+A\boldsymbol{{\Sigma}}_{S}=(1-\eta)\sigma^{2}\boldsymbol{\mathit{I}}+\boldsymbol{\mathit{A}}, where A=ηΣSN+η(1η)(μSμN)(μSμN)T\boldsymbol{\mathit{A}}=\eta\boldsymbol{{\Sigma}}_{S_{N}}+\eta(1-\eta)(\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}_{N})(\boldsymbol{\mathit{\mu}}_{S}-\boldsymbol{\mathit{\mu}}_{N})^{T}, this is possible only if Tr(A)\operatorname*{Tr}(\boldsymbol{\mathit{A}}) is large. But since the distance of each point in SS from μ\boldsymbol{\mathit{\mu}} is O(σn)O(\sigma\sqrt{n}), the trace of A\boldsymbol{\mathit{A}} cannot be too large. Therefore, in the space WW, we can just compute the sample mean PWμS\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\mu}}_{S} and it will be close to PWμ\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\mu}}. We still have to find the mean in the space VV. But we do this by recursing the above procedure in VV. At the end we will be left with a one-dimensional space, and then we can just find the median. This recursive projection onto the top n/2n/2 principal components is done in Algorithm 2.2.2 .

This generalizes to the non-spherical Gaussians with a few modifications. We use a different outlier removal step. In the non-spherical case, it is not trivial to compute Σ2\|\boldsymbol{{\Sigma}}\|_{2} to be used as the radius of the ball. We give an algorithm for this later on. To limit the effect of noise, we use a damping function. Instead of discarding points outside a certain radius, we damp every point by a weight so that further away points get lower weights. This is done in OutlierDamping (Algorithm 2.2.1). We get the guarantees of Theorem 1.1 by running AgnosticMean (Algorithm 2.2.2) with the outlier removal routine being OutlierDamping. A detailed proof of the whole algorithm is given in Section 3.1.

We then turn to more general distributions which have bounded fourth moments. We need bounded fourth moments to ensure that the mean and covariance matrix of the distribution D\mathcal{D} do not change much even after conditioning by an event that occurs with probability 1η1-\eta. One difficulty for general distributions is that the outlier damping doesn’t work. So for distributions D\mathcal{D} with bounded fourth moments, we have another outlier removal routine called \textscOutlierTruncation(,η).\textsc{OutlierTruncation}(\cdot,\eta). In this routine, we first find a point analogous to the coordinate-wise median for the Gaussians, and then consider a ball big enough to contain 1η1-\eta fraction of SS. We throw away all the points outside this ball. We get the guarantees of Theorem 1.3 by running AgnosticMean (Algorithm 2.2.2) with the outlier removal routine being OutlierTruncation (Algorithm 2.2.1). The complete proof of this appears in Section 3.3.

We now have an algorithm to estimate the mean of very general (with bounded fourth moments) distributions. To estimate the covariance matrix, we observed that the covariance matrix of a distribution D\mathcal{D} is given by \mboxED(xμ)(xμ)T.\mbox{{\bf E}}_{\mathcal{D}}(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})^{T}. If we knew what μ\boldsymbol{\mathit{\mu}} was, then covariance can be computed by estimating the mean of the second moments. To compute the mean of the second moments, we can treat (xμ)(xμ)T(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}})^{T} as a vector in n2n^{2} dimensions and run the algorithm for mean estimation. Also, we can estimate μ\boldsymbol{\mathit{\mu}} by the same algorithm. Therefore, we get Theorem 1.5 by running CovarianceEstimation (Algorithm 2.2.3). Its proof appears in Section 4.2.

Algorithm AgnosticOperatorNorm (Algorithm 2.2.3) estimates the 22-norm Σ2\|\boldsymbol{{\Sigma}}\|_{2} for general distributions. For illustration, suppose D=N(μ,Σ),\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}), and we are given m=poly(n)m=\operatorname*{poly}(n) samples x1,...,xmDη,\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta}, and the mean μ\boldsymbol{\mathit{\mu}}. We consider the covariance-like matrix

Since 1η1-\eta fraction of the points in SS are from the Gaussian, we have Σ(S,μ)(1η)Σ.\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}})\succeq(1-\eta)\boldsymbol{{\Sigma}}. Therefore, the top eigenvalue σ2\sigma^{2} of Σ(S,μ)\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}}) is at least (1η)Σ2.(1-\eta)\|\boldsymbol{{\Sigma}}\|_{2}. Let v\boldsymbol{\mathit{v}} be the top eigenvector of Σ(S,μ).\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}}). If the Gaussian variance along v\boldsymbol{\mathit{v}} (which can be computed up to 1±η1\pm\eta factor) is much less than σ2\sigma^{2}, this should be because there are a lot of noise points in SS whose projections onto v\boldsymbol{\mathit{v}} are big compared to the projection of Gaussian points in SS. We remove points in SS that have big projection and then iterate the entire procedure. We later show that this procedure terminates in poly(n)\operatorname*{poly}(n) steps and when it terminates the top eigenvalue of Σ(S,μ)\boldsymbol{{\Sigma}}(S,\boldsymbol{\mathit{\mu}}) is close to that of Σ.\boldsymbol{{\Sigma}}. A proof of this appears in Section 5.

Theorem 1.7 follows easily from Theorem 1.5. Let Σ^k\boldsymbol{\widehat{\Sigma}}_{k} be the top-kk eigenspace of Σ^\boldsymbol{\widehat{\Sigma}} from Theorem 1.5. We then have

(a),(c)(a),(c) follow from triangle inequality, (b)(b) follows from the fact that Σ^k\boldsymbol{\widehat{\Sigma}}_{k} is the best rank-kk approximation and (d)(d) from the guarantees of Theorem 1.5.

Next the algorithm estimates a weighted covariance matrix W\boldsymbol{\mathit{W}} with the weight of a point x\boldsymbol{\mathit{x}} proportional to cos(uTx)\cos(\boldsymbol{\mathit{u}}^{T}\boldsymbol{\mathit{x}}) for u\boldsymbol{\mathit{u}} chosen from a Gaussian distribution; it computes the SVD of W\boldsymbol{\mathit{W}}. For this we use our algorithm again (the weights are applied individually to each sample). The main guarantee is that the eigenvectors of this weighted covariance approximate the columns of AA. This relies on the maximum eigenvalue gap of W\boldsymbol{\mathit{W}} being large, and it has to be approximated to within additive error ϵ=O(1/(logn)3)\epsilon=O(1/(\log n)^{3}). Theorem 1.7 implies that the additional error in eigenvalues is bounded by O(ηlogn)Σ2O(\sqrt{\eta\log n})\|\boldsymbol{{\Sigma}}\|_{2}, and therefore it suffices to have ηlogn<c/(logn)3\sqrt{\eta\log n}<c/(\log n)^{3} for a sufficiently small constant cc that depends only on the cumulant and moment bound assumptions (i.e., Δ,M\Delta,M). Thus, if suffices to have η<ϵ/2c(logn)7\eta<\epsilon/2\leq c(\log n)^{-7}.

In this section we will show the lower bounds stated in Observation 1.4. For Gaussian distributions, this is a special case of a theorem proved in [CGR15]. We reproduce the relevant part here for completeness. We will show that there are distributions D1=N(μ1,σ2I),D2=N(μ2,σ2I)\mathcal{D}_{1}=N(\boldsymbol{\mathit{\mu}}_{1},\sigma^{2}\boldsymbol{\mathit{I}}),\mathcal{D}_{2}=N(\boldsymbol{\mathit{\mu}}_{2},\sigma^{2}\boldsymbol{\mathit{I}}) and distributions Q1,Q2Q_{1},Q_{2} such that μ1μ22=Ω(ησ)\|\boldsymbol{\mathit{\mu}}_{1}-\boldsymbol{\mathit{\mu}}_{2}\|_{2}=\Omega(\eta\sigma) and

So, given Dη\mathcal{D}_{\eta}, no algorithm can distinguish between D1,D2\mathcal{D}_{1},\mathcal{D}_{2}. Let ϕ1\phi_{1} be p.d.f of D1\mathcal{D}_{1} and ϕ2\phi_{2} be the p.d.f of D2.\mathcal{D}_{2}. Let μ1,μ2\boldsymbol{\mathit{\mu}}_{1},\boldsymbol{\mathit{\mu}}_{2} be such that the total variation distance between D1,D2\mathcal{D}_{1},\mathcal{D}_{2} is

By a standard inequality for the total variation distance of Gaussian distributions, this implies that μ1μ222ησ1η.\|\boldsymbol{\mathit{\mu}}_{1}-\boldsymbol{\mathit{\mu}}_{2}\|_{2}\geq\frac{2\eta\sigma}{1-\eta}. Let Q1Q_{1} be the distribution with p.d.f 1ηη(ϕ2ϕ1)1ϕ2ϕ1\frac{1-\eta}{\eta}(\phi_{2}-\phi_{1})\mathbf{1}_{\phi_{2}\geq\phi_{1}} and Q2Q_{2} be the distribution with p.d.f 1ηη(ϕ1ϕ2)1ϕ1ϕ2\frac{1-\eta}{\eta}(\phi_{1}-\phi_{2})\mathbf{1}_{\phi_{1}\geq\phi_{2}}. It is now easy to verify that Equation 3 is satisfied. This proves item one of Observation 1.4.

For the distributions with bounded fourth moments, consider the following two one-dimensional distributions. D1\mathcal{D}_{1} is supported on two points {σ,σ}\{-\sigma,\sigma\} with the corresponding probabilities {1/2,1/2}\{1/2,1/2\}. D2\mathcal{D}_{2} is supported on three points {σ,σ,σ/η1/4}\{-\sigma,\sigma,\sigma/\eta^{1/4}\} with probabilities {(1η)/2,(1η)/2,η}\{(1-\eta)/2,(1-\eta)/2,\eta\} respectively. Let η1/4\eta\leq 1/4. It is easy to check that both D1\mathcal{D}_{1} and D2\mathcal{D}_{2} have bounded fourth moments with the constant C4=8C_{4}=8. Furthermore, D2\mathcal{D}_{2} can be obtained from D1\mathcal{D}_{1} by adding η\eta fraction of noise points. So no algorithm can distinguish between the two distributions. Since their means differ by η3/4σ\eta^{3/4}\sigma, no algorithm can get an estimate better than this.

We will now show that the geometric median:

has a n\sqrt{n} dependence on the dimension. We show this in the Gaussian case even if we have access to the whole distribution, but with η\eta fraction of noise points placed all at a single point far away from most of the Gaussian points.

Let D=N(0,Σ)\mathcal{D}=N(\boldsymbol{0},\boldsymbol{{\Sigma}}) be a distribution with diagonal covariance matrix Σ\boldsymbol{{\Sigma}} whose variance along the coordinate direction e1\boldsymbol{\mathit{e}}_{1} is zero, and equal to 11 in all the other coordinate directions. Assume there is an η\eta fraction of noise at a distance a=na=n along e1\boldsymbol{\mathit{e}}_{1}. Let

We have that at the minimizer t0t_{0}, the derivative with respect to tt is zero. Therefore, we should have

Consider f(t)=\mboxExDtt2+x22+...+xn2.f(t)=\mbox{{\bf E}}_{\boldsymbol{\mathit{x}}\sim\mathcal{D}}\frac{t}{\sqrt{t^{2}+x_{2}^{2}+...+x_{n}^{2}}}. It is clear from Equation 4 that t0>0.t_{0}>0. We claim that if t=αηnt=\alpha\eta\sqrt{n} for a small enough constant α\alpha, then f(t)η1η.f(t)\leq\frac{\eta}{1-\eta}. Suppose t1=αηnt_{1}=\alpha\eta\sqrt{n}. Since xD\boldsymbol{\mathit{x}}\sim\mathcal{D}, x22n/2\|\boldsymbol{\mathit{x}}\|_{2}^{2}\geq n/2 with exponential probability. Therefore,

2 Algorithms

Our algorithms are based on outlier removal and SVD. To simplify the proofs, we use new samples for each step of the algorithm. The total sample complexity is given in the theorems.

For outlier removal, we use one of the following two simple routines. The first, which we call OutlierDamping, returns a vector of positive weights, one for each sample point.

Let a\boldsymbol{\mathit{a}} be the coordinate-wise median of SS. Let s2=CTr(Σ).s^{2}=C\operatorname*{Tr}(\boldsymbol{{\Sigma}}). Estimate Tr(Σ)\operatorname*{Tr}(\boldsymbol{{\Sigma}}) by estimating 11d variance along nn orthogonal directions, see Section 4.1.

Set wi=exp(xia22s2)w_{i}=\exp\left(-\frac{\|\boldsymbol{\mathit{x}}_{i}-\boldsymbol{\mathit{a}}\|_{2}^{2}}{s^{2}}\right) for every xiS.\boldsymbol{\mathit{x}}_{i}\in S.

The second procedure for outlier removal returns a subset of points. It will be convenient to view this as a 0/10/1 weighting of the point set. We call this procedure OutlierTruncation.

if n=1n=1: Let [a,b][a,b] be the smallest interval containing (1ηϵ)(1η)(1-\eta-\epsilon)(1-\eta) fraction of the points, S~S[a,b]\widetilde{S}\leftarrow S\cap[a,b]. Return (S~,1).(\widetilde{S},1).

Let a\boldsymbol{\mathit{a}} be as in Lemma 3.15.

Let B(r,a)=B(r,\boldsymbol{\mathit{a}})= ball of minimum radius rr centered at a\boldsymbol{\mathit{a}} that contains (1ηϵ)(1η)(1-\eta-\epsilon)(1-\eta) fraction of SS.

S~SB(r,a).\widetilde{S}\leftarrow S\cap B(r,\boldsymbol{\mathit{a}}). Return (S~,1).(\widetilde{S},\mathbf{1}).

2.2 Main Algorithm

We are now ready to state the main algorithm for agnostic mean estimation. It uses one of the above outlier removal procedures and assumes that the output of the procedure is a weighting.

Let (S~,w)=\textscOutlierRemoval(S)(\widetilde{S},\boldsymbol{\mathit{w}})=\textsc{OutlierRemoval}(S) .

if w=1\boldsymbol{\mathit{w}}=-1, Return median(S~)\operatorname*{median}(\widetilde{S}). //Gaussian case

else Return mean(S~)\operatorname*{mean}(\widetilde{S}). //General case

Let ΣS~,w\boldsymbol{{\Sigma}}_{\widetilde{S},\boldsymbol{\mathit{w}}} be the weighted covariance matrix of S~\widetilde{S} with weights w\boldsymbol{\mathit{w}}, and VV be the span of the top n/2n/2 principal components of ΣS~,w\boldsymbol{{\Sigma}}_{\widetilde{S},\boldsymbol{\mathit{w}}}, and WW be its complement.

Set S1:=PV(S)S_{1}:=\boldsymbol{\mathit{P}}_{V}(S) where PV\boldsymbol{\mathit{P}}_{V} is the projection operation on to VV.

Let μ^V:=\textscAgnosticMean(S1)\boldsymbol{\widehat{\mathit{\mu}}}_{V}:=\textsc{AgnosticMean}(S_{1}) and μ^W:=mean(PWS~).\boldsymbol{\widehat{\mathit{\mu}}}_{W}:=\text{mean}(\boldsymbol{\mathit{P}}_{W}\widetilde{S}).

Return μ^.\boldsymbol{\widehat{\mathit{\mu}}}.

2.3 Estimation of the Covariance Matrix and Operator Norm

For both the tasks in this section, we will assume that the mean of the distribution μ=0\boldsymbol{\mathit{\mu}}=\boldsymbol{0}. We can do this without loss of generality by a standard trick mentioned described in Section 4.2. The algorithm for estimating the covariance matrix calls AgnosticMean on xxT.\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T}. Analysis is given in Section 4.2.

Output: n×nn\times n matrix Σ^\boldsymbol{\widehat{\Sigma}}

Let S(2)={xixii=1,...,m/2}S^{(2)}=\{\boldsymbol{\mathit{x}}_{i}^{\prime}\boldsymbol{\mathit{x}}_{i}^{\prime}|\,i=1,...,m/2\} (see Equation 15)

The algorithm for estimating Σ2\|\boldsymbol{{\Sigma}}\|_{2} is based on iteratively truncating the samples along the direction of top variance. The analysis is given in Section 5.

Let S~=\textscSafeOutlierTruncation(S,η,γ)\widetilde{S}=\textsc{SafeOutlierTruncation}(S,\eta,\gamma).

Do the following O(nlog2/γnη)O(n\log^{2/\gamma}\frac{n}{\eta}) times

Let Σ0(S~):=1S~iS~xxT\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S}):=\frac{1}{|\widetilde{S}|}\sum_{i\in\widetilde{S}}\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T}.

Find v\boldsymbol{\mathit{v}}, the top eigenvector of Σ0(S~)\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S}), and its corresponding eigenvalue σ2\sigma^{2}.

Estimate (up to 1±cη1\pm c\eta factor, see Section 4.1) the variance of D\mathcal{D} along v\boldsymbol{\mathit{v}} and denote it by σ^v2{\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}^{2}.

if σ2(1+c3ηlog2/γnη)σ^v2\sigma^{2}\leq(1+c_{3}\eta\log^{2/\gamma}\frac{n}{\eta}){\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}^{2} Return σ2\sigma^{2}.

Remove all points xS~\boldsymbol{\mathit{x}}\in\widetilde{S} such that xTv>c2σ^vlog1/γnη2|\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}|>\frac{c_{2}{\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta}}{2}.

Let t=i=1nσ^ei2t=\sum_{i=1}^{n}\widehat{\sigma}^{2}_{e_{i}} be the sum of estimated variances of D\mathcal{D} in nn orthogonal directions.

Let B(ctlog1/γnη,0)B(c\sqrt{t}\log^{1/\gamma}\frac{n}{\eta},\boldsymbol{0}) be the ball of radius ctlog1/γnηc\sqrt{t}\log^{1/\gamma}\frac{n}{\eta} centered at 0\boldsymbol{0}.

S~SB(ctlog1/γnη,0).\widetilde{S}\leftarrow S\cap B(c\sqrt{t}\log^{1/\gamma}\frac{n}{\eta},\boldsymbol{0}). Return S~.\widetilde{S}.

Mean Estimation: Theorem 1.1 and Theorem 1.3

In this section, we will first prove Theorem 1.1, which is for Gaussian distributions, and Theorem 1.3, which is for distributions with bounded fourth moments. All our algorithms will be translationally invariant. We will assume w.l.o.g that the mean of the distribution D\mathcal{D} is μ=0\boldsymbol{\mathit{\mu}}=0. So we will be proving bounds on μ^2.\|\boldsymbol{\widehat{\mathit{\mu}}}\|_{2}. Algorithm 2.2.2 has logn\log n levels, we will assume that at each level it uses O(nlognϵ2)O(\frac{n\log n}{\epsilon^{2}}) samples resulting in a total of m=O(nlog2nϵ2).m=O(\frac{n\log^{2}n}{\epsilon^{2}}).

At various points in the analysis, to bound the sample complexity we will have to show that the estimates computed from samples are close to their expectations. We will use the following two results. Firstly, as an immediate corollary of matrix Bernstien for rectangular matrices (see Theorem 1.61.6 in [Tro12]), we get the following concentration result for the sample mean and sample covariance.

Here μ^\widehat{\mu} and Σ^\widehat{\Sigma} are sample mean and sample covariance matrix.

Secondly, the functions we estimate will be integrals of low-degree polynomials (degree dd at most 44) restricted to intervals and/or balls. These functions viewed as binary concepts have small VC-dimension, O(nd)O(n^{d}) where nn is the dimension of space and dd is the degree of the polynomial. We use this to bound the error of estimating integrals via samples, and we can make the error smaller than any inverse polynomial using a poly(n)\operatorname*{poly}(n) size sample.

By the VC theorem, for any concept in CFC_{F}, the bound on the size of the sample ensures that with probability at least 1δ1-\delta and any tt,

Noting that \mboxExD(f(x))=RRPr(f(x)t)dt\mbox{{\bf E}}_{x\sim\mathcal{D}}(f(x))=\int_{-R}^{R}\Pr(f(x)\geq t)\,dt, we get the claimed bound.

We use the above notation for T=SGT=S_{G} and T=SNT=S_{N}. By an abuse of notation, when T=GT=G, we mean the population version of the above quantities:

We consider the matrix ΣS,w\boldsymbol{{\Sigma}}_{S,\boldsymbol{\mathit{w}}}

Let D=N(0,σ2)\mathcal{D}=N(0,\sigma^{2}) be a one dimensional Gaussian distribution. If m=O(lognϵ2)m=O\left(\frac{\log n}{\epsilon^{2}}\right), and we are given x1,...,xmDηx_{1},...,x_{m}\sim\mathcal{D}_{\eta}, then the median xmed=mediani{xi}x_{\text{med}}=\operatorname*{median}_{i}\{x_{i}\} satisfies xmed=O((η+ϵ)σ)|x_{\text{med}}|=O((\eta+\epsilon)\sigma) with high probability.

Let SGSS_{G}\subset S be made up of samples in SS that come from the Gaussian, also let c=Φ1(1/2+η+ϵ).c=\Phi^{-1}(1/2+\eta+\epsilon). Let us bound the probability that the median xmedc.x_{\text{med}}\geq c. We first note that if xmedc,x_{\text{med}}\geq c, then Pr(x>cxuSG)ϵ.\Pr\left(x>c|x\in_{u}S_{G}\right)\geq\epsilon. By Hoeffding’s inequality, we can bound this by 1poly(n)1-\operatorname*{poly}(n) if SG=O(lognϵ2).|S_{G}|=O\left(\frac{\log n}{\epsilon^{2}}\right).

We will next consider the multidimensional case. The proof follows by a series of lemmas. We state the lemmas first, conclude the proof of Theorem 1.1 and then prove the lemmas. First, we observe that by applying Lemma 3.3 in nn orthogonal directions and union bound, we get

By a simple calculation, maxxx2exa2/s2O(s2).\max_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{x}}\|^{2}e^{-\|\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{a}}\|^{2}/s^{2}}\leq O(s^{2}). This immediately gives the following bound on the trace.

Suppose A:=ηΣSN,w+η(1η)(μSN,wμG,w)(μSN,wμSG,w)T.\boldsymbol{\mathit{A}}:=\eta\boldsymbol{{\Sigma}}_{S_{N},\boldsymbol{\mathit{w}}}+\eta(1-\eta)(\boldsymbol{\mathit{\mu}}_{S_{N},\boldsymbol{\mathit{w}}}-\boldsymbol{\mathit{\mu}}_{G,\boldsymbol{\mathit{w}}})(\boldsymbol{\mathit{\mu}}_{S_{N},\boldsymbol{\mathit{w}}}-\boldsymbol{\mathit{\mu}}_{S_{G},\boldsymbol{\mathit{w}}})^{T}. Then there exists a constant CC such that,

As will be clear from the proof of Theorem 3.6, when Σ=σ2I\boldsymbol{{\Sigma}}=\sigma^{2}\boldsymbol{\mathit{I}} is a multiple of identity, then ΣG,w\boldsymbol{{\Sigma}}_{G,\boldsymbol{\mathit{w}}} will also be a multiple of I\boldsymbol{\mathit{I}}. By Lemma 3.1, if we take m=O(nlognϵ2)m=O(\frac{n\log n}{\epsilon^{2}}) samples, we will have

in the Lowener ordering, for some α,β>0\alpha,\beta>0. By an argument similar to the one sketched in Section 2, we can prove

We will use the notation as defined above. Let WW be the bottom n/2n/2 principal components of the covariance matrix ΣS,w\boldsymbol{{\Sigma}}_{S,\boldsymbol{\mathit{w}}}. We have

where Σmin\|\boldsymbol{{\Sigma}}\|_{\min} denotes the least eigenvalue of Σ\boldsymbol{{\Sigma}} and δμ:=μSN,wμSG,w.\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}:=\boldsymbol{\mathit{\mu}}_{S_{N},\boldsymbol{\mathit{w}}}-\boldsymbol{\mathit{\mu}}_{S_{G},\boldsymbol{\mathit{w}}}.

By an inductive application of Lemma 3.7, we get the following theorem giving a bound on μ^.\|\boldsymbol{\widehat{\mathit{\mu}}}\|.

On input SS and the routine \textscOutlierDamping()\textsc{OutlierDamping}(\cdot), AgnosticMean outputs μ^\boldsymbol{\widehat{\mathit{\mu}}} satisfying

Theorem 3.6 combined with Theorem 3.8 proves Theorem 1.1. We get a better dependence on η\eta when Σ=σ2I\boldsymbol{{\Sigma}}=\sigma^{2}\boldsymbol{\mathit{I}} because we can take α=β\alpha=\beta in this case. This would lead to the cancellation of the leading term in the bound in Theorem 3.8 as Σ2=Σmin.\|\boldsymbol{{\Sigma}}\|_{2}=\|\boldsymbol{{\Sigma}}\|_{\min}.

Proof of Lemma 3.7: Recall that Σ\boldsymbol{{\Sigma}} denotes the covariance matrix of the Gaussian part. We have

where A=ηΣSN,w+η(1η)δμδμT.\boldsymbol{\mathit{A}}=\eta\boldsymbol{{\Sigma}}_{S_{N},\boldsymbol{\mathit{w}}}+\eta(1-\eta)\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}^{T}. Therefore, we have

For a symmetric matrix B\boldsymbol{\mathit{B}}, let λk(B)\lambda_{k}(\boldsymbol{\mathit{B}}) denote the kk’th largest eigenvalue. By Weyl’s inequality, we have

Recall that WW is the space spanned by the bottom n/2n/2 eigenvectors of ΣS,w\boldsymbol{{\Sigma}}_{S,\boldsymbol{\mathit{w}}}, and PW\boldsymbol{\mathit{P}}_{W} is the matrix corresponding to the projection operator on to WW. We therefore have

Multiplying by the vector PWδμPWδμ\frac{\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}}{\|\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}\|} and its transpose on either side, we get

Assuming η1/2\eta\leq 1/2, we therefore have

Proof of Theorem 3.8: By Equation 6 and Lemma 3.1, since we take O(nlognϵ2)O\left(\frac{n\log n}{\epsilon^{2}}\right) samples we have

So it is enough to prove μ^μSG,w2O((βη+η2+ϵ2)Σ2αηΣmin)(1+logn)\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}_{S_{G},\boldsymbol{\mathit{w}}}\|^{2}\leq O\left((\beta\eta+\eta^{2}+\epsilon^{2})\|\boldsymbol{{\Sigma}}\|_{2}-\alpha\eta\|\boldsymbol{{\Sigma}}\|_{\min}\right)(1+\log n) The proof is by induction. If n=1n=1, then the conclusion follows from the guarantees of the one dimensional median. Now, assume that it holds for all nkn\leq k for some k1k\geq 1. Let n=k+1.n=k+1. We have by Lemma 3.7

By induction hypothesis, since dim(V)=n/2\text{dim}(V)=n/2, we have

where b=1s2(Σ1+1s2I)1a.\boldsymbol{\mathit{b}}=\frac{1}{s^{2}}\left(\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right)^{-1}\boldsymbol{\mathit{a}}. Therefore, we have

Now we will look at the scalar term ΣΣ1+1s2I.|\boldsymbol{{\Sigma}}|\left|\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right|. Let λi\lambda_{i} be the eigenvalues of Σ.\boldsymbol{{\Sigma}}.

We next bound exp(a2s2+1s4aT(Σ1+1s2I)1a).\exp\left(-\frac{\|\boldsymbol{\mathit{a}}\|^{2}}{s^{2}}+\frac{1}{s^{4}}\boldsymbol{\mathit{a}}^{T}\left(\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right)^{-1}\boldsymbol{\mathit{a}}\right). We have

Note that if 1λ1,...,1λn\frac{1}{\lambda_{1}},...,\frac{1}{\lambda_{n}} and v1,...,vn\boldsymbol{\mathit{v}}_{1},...,\boldsymbol{\mathit{v}}_{n} are the eigenvalues and the corresponding eigenvectors of Σ1\boldsymbol{{\Sigma}}^{-1}, then 1λ1+1s2,...,1λn+1s2\frac{1}{\lambda_{1}}+\frac{1}{s^{2}},...,\frac{1}{\lambda_{n}}+\frac{1}{s^{2}} and v1,...,vn\boldsymbol{\mathit{v}}_{1},...,\boldsymbol{\mathit{v}}_{n} are the eigenvalues and the corresponding eigenvectors of Σ1+1s2I.\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}. Since,

where b=1s2(Σ1+1s2I)1a.\boldsymbol{\mathit{b}}=\frac{1}{s^{2}}\left(\boldsymbol{{\Sigma}}^{-1}+\frac{1}{s^{2}}\boldsymbol{\mathit{I}}\right)^{-1}\boldsymbol{\mathit{a}}. Recall that ϵ1=iλis2\epsilon_{1}=\frac{\sum_{i}\lambda_{i}}{s^{2}}. We can, as before, bound the product of the two scalars by eϵ1.e^{\epsilon_{1}}. Therefore, we have

Combining Equation (6) and Equation 5, we get Theorem 3.6.

2 Improving the dependence on η𝜂\eta

Now we will show how we can obtain the second part of Theorem 1.1 to get a better dependence on η\eta by using Σ^\boldsymbol{\widehat{\Sigma}} from Theorem 1.5. Let D=N(μ,Σ)\mathcal{D}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}) be a Gaussian with covariance Σ,\boldsymbol{{\Sigma}}, and ηc/logn\eta\leq c/\log n for a small enough constant c>0.c>0. We first use Theorem 1.5 (with ϵ=η\epsilon=\eta) to estimate σ2=Σ2\sigma^{2}=\|\boldsymbol{{\Sigma}}\|_{2}. We get a σ^2{\widehat{\sigma}}^{2} satisfying

Let S={x1,...,xm}S=\{\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\} be the given sample, and let yiN(0,σ^2I),i=1,...,m\boldsymbol{\mathit{y}}_{i}\sim N(0,{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}}),i=1,...,m be i.i.d. samples. Define xi=xi+yi.\boldsymbol{\mathit{x}}_{i}^{\prime}=\boldsymbol{\mathit{x}}_{i}+\boldsymbol{\mathit{y}}_{i}. The key thing to note is that if xN(μ,Σ)\boldsymbol{\mathit{x}}\sim N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}) and yN(0,σ^2I)\boldsymbol{\mathit{y}}\sim N(0,{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}}), then x+yN(μ,Σ+σ^2I)\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{y}}\sim N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}+{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}}). Let D=N(μ,Σ+σ^2I)\mathcal{D}^{\prime}=N(\boldsymbol{\mathit{\mu}},\boldsymbol{{\Sigma}}+{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}}). Note that the mean μ\boldsymbol{\mathit{\mu}}^{\prime} of D\mathcal{D}^{\prime} is same as that of D\mathcal{D}, and the covariance Σ=Σ+σ^2I\boldsymbol{{\Sigma}}^{\prime}=\boldsymbol{{\Sigma}}+{\widehat{\sigma}}^{2}\boldsymbol{\mathit{I}} has

We can view xiDη,\boldsymbol{\mathit{x}}_{i}^{\prime}\sim\mathcal{D}^{\prime}_{\eta}, and we assume ηlognc.\eta\log n\leq c. By Theorem 1.5 and Equation 7, we can compute a Σ^\boldsymbol{\widehat{\Sigma}}^{\prime} such that

Let α=O(ηlogn).\alpha=O\left(\sqrt{\eta\log n}\right). Therefore,

by Equation 8. Now, if we let xi=Σ^1/2xi\boldsymbol{\mathit{x}}_{i}^{\prime\prime}=\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\boldsymbol{\mathit{x}}_{i}^{\prime} and D=N(μ,Σ)=N(Σ^1/2μ,Σ^1/2ΣΣ^1/2),\mathcal{D}^{\prime\prime}=N(\boldsymbol{\mathit{\mu}}^{\prime\prime},\boldsymbol{{\Sigma}}^{\prime\prime})=N\left(\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\boldsymbol{\mathit{\mu}},\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\boldsymbol{{\Sigma}}\boldsymbol{\widehat{\Sigma}}^{\prime-1/2}\right), then we can think of xiDη.\boldsymbol{\mathit{x}}_{i}^{\prime\prime}\sim\mathcal{D}^{\prime\prime}_{\eta}. If we now use Theorem 3.8 with β=(1+O(ηlogn))\beta=\left(1+O\left(\sqrt{\eta\log n}\right)\right) and α=(1O(ηlogn))\alpha=\left(1-O\left(\sqrt{\eta\log n}\right)\right) on the samples S={xi}S^{\prime\prime}=\{\boldsymbol{\mathit{x}}_{i}^{\prime\prime}\}, we get a μ^\boldsymbol{\widehat{\mathit{\mu}}}^{\prime\prime} such that

This implies that μ^=Σ^1/2μ^\boldsymbol{\widehat{\mathit{\mu}}}=\boldsymbol{\widehat{\Sigma}}^{\prime 1/2}\boldsymbol{\widehat{\mathit{\mu}}}^{\prime\prime} satisfies

We can use this technique to give a polynomial time algorithm to compute μ^\boldsymbol{\widehat{\mathit{\mu}}} with a guarantee μ^μ2=O(Σ2η2ϵlog2ϵn)\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}\|^{2}=O\left(\|\boldsymbol{{\Sigma}}\|_{2}\eta^{2-\epsilon}\log^{2-\epsilon}n\right) for any fixed ϵ>0.\epsilon>0. This would require estimating higher order moments by the mean estimation algorithm and then using the above trick to improve the η\eta dependence for each of them in sequence. We don’t give a proof of this in this paper.

3 Distributions with Bounded Fourth Moments

In this section, we will prove some some useful properties that distributions with bounded fourth moments satisfy. We will assume that xD\boldsymbol{\mathit{x}}\sim\mathcal{D} for a distribution with mean μ\boldsymbol{\mathit{\mu}} that has bounded fourth moments, i.e., for every unit vector v\boldsymbol{\mathit{v}}

Let XX be a random variable with \mboxE(X\mboxEX)2=σ2\mbox{{\bf E}}(X-\mbox{{\bf E}}X)^{2}=\sigma^{2} and

for some C4C_{4}. Let ϵ0.5\epsilon\leq 0.5 and AA be any event with probability Pr(A)=1ϵ.\Pr(A)=1-\epsilon. Then

The fourth moment of such an XX is minimum when its support is just the two-point set {a,1ϵϵ(\mboxEXa)+\mboxEX}\{a,\frac{1-\epsilon}{\epsilon}(\mbox{{\bf E}}X-a)+\mbox{{\bf E}}X\}. Therefore,

Let XX be a random variable with \mboxEX=μ\mbox{{\bf E}}X=\mu and \mboxE((Xμ)2)=σ2\mbox{{\bf E}}((X-\mu)^{2})=\sigma^{2} and let

for some C4C_{4}. Then, for every event AA that occurs with probability at least 1ϵ1-\epsilon, we have

where 1A\mathbf{1}_{A} is the indicator function of the event A.A. As an immediate corollary, for ϵ0.5\epsilon\leq 0.5 we get the following bound on the conditional probability

Let dΩd\Omega be the probability measure. We can write \mboxE(Xμ)4C4(\mboxE(Xμ)2)2\mbox{{\bf E}}(X-\mu)^{4}\leq C_{4}(\mbox{{\bf E}}(X-\mu)^{2})^{2} in the following way

Using \mboxE(Y\mboxEY)4(\mboxE(Y\mboxEY)2)2\mbox{{\bf E}}(Y-\mbox{{\bf E}}Y)^{4}\geq(\mbox{{\bf E}}(Y-\mbox{{\bf E}}Y)^{2})^{2} for any random variable Y,Y, and Pr(Ac)=ϵ\Pr(A^{c})=\epsilon we have

Therefore, for ϵ0.5\epsilon\leq 0.5 we get that

As an immediate corollary of Lemma 3.11 and Lemma 3.12, we get for a random variable x\boldsymbol{\mathit{x}} having bounded fourth moments

Let AA be an event that happens with probability 1η1-\eta. Then,

where ΣA\boldsymbol{{\Sigma}}|_{A} is the conditional covariance matrix ΣA:=\mboxE(xxTA)(\mboxE(xA))(\mboxE(xA))T.\boldsymbol{{\Sigma}}|_{A}:=\mbox{{\bf E}}(\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T}|A)-(\mbox{{\bf E}}(\boldsymbol{\mathit{x}}|A))(\mbox{{\bf E}}(\boldsymbol{\mathit{x}}|A))^{T}.

Let v\boldsymbol{\mathit{v}} be any unit vector. Let yy be the random variable that is vTx\boldsymbol{\mathit{v}}^{T}\boldsymbol{\mathit{x}} for xD\boldsymbol{\mathit{x}}\sim\mathcal{D}. Let μ=\mboxE(y)\mu=\mbox{{\bf E}}(y), μA=\mboxE(yA)\mu_{A}=\mbox{{\bf E}}(y|A), and d=μAμd=\mu_{A}-\mu. Then

Finally, by a standard argument as in the proof of Chebyshev’s inequality, we have

For every unit vector v\boldsymbol{\mathit{v}}, we have

where σv\sigma_{\boldsymbol{\mathit{v}}} is the standard deviation of x\boldsymbol{\mathit{x}} along the direction v\boldsymbol{\mathit{v}}, σv2:=\mboxExTv2\mboxExTv2.\sigma_{\boldsymbol{\mathit{v}}}^{2}:=\mbox{{\bf E}}|\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}|^{2}-|\mbox{{\bf E}}\boldsymbol{\mathit{x}}^{T}\boldsymbol{\mathit{v}}|^{2}.

4 Proof of Theorem 1.3:

First we will consider the case when XX is a random variable with mean μ\mu and variance σ2\sigma^{2} satisfying

In this case, median need not be a good estimator. Instead, we will consider the interval of minimum length that contains (1ηϵ)(1η)(1-\eta-\epsilon)(1-\eta) fraction of the sample points. Let SS be the given sample, and let S~\widetilde{S} be the points lying in this interval. Let μ^=mean(S~)\widehat{\mu}=\text{mean}(\widetilde{S}) be our estimator. We will show below that μ^μO(C41/4(η+ϵ)3/4σ).|\widehat{\mu}-\mu|\leq O\left(C_{4}^{1/4}(\eta+\epsilon)^{3/4}\sigma\right).

By the concentration inequality stated in Lemma 3.14, we get that for the distribution, the length r1η+ϵ2r_{1-\frac{\eta+\epsilon}{2}} of the interval around μ\mu consisting of probability mass 1η+ϵ21-\frac{\eta+\epsilon}{2} is bounded by

The length of the smallest interval that contains (1ηϵ)(1η)(1-\eta-\epsilon)(1-\eta) fraction of SS is at most the length of the smallest interval that contains 1ηϵ1-\eta-\epsilon fraction of SDS_{\mathcal{D}}. This latter quantity is bounded by r1ηr_{1-\eta}, since the interval I1η+ϵ2I_{1-\frac{\eta+\epsilon}{2}} contains with probability 11/poly(n)1-1/\operatorname*{poly}(n) a (1ηϵ)(1-\eta-\epsilon) fraction of SDS_{\mathcal{D}}.

This implies that when we look at the minimum interval containing 1ηϵ1-\eta-\epsilon fraction of the non-noise points, the extreme points of the interval can be at most at a distance r1η+ϵ2r_{1-\frac{\eta+\epsilon}{2}} from μ\mu. Thus, the distance of all noise points will be within O(C41/4(η+ϵ)1/4σ)O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sigma\right). Furthermore, the interval of minimum length with (1ηϵ)(1η)(1-\eta-\epsilon)(1-\eta) fraction of SS will contain at least 13ηϵ1-3\eta-\epsilon fraction of SDS_{\mathcal{D}}. Therefore, by Lemma 3.11 the mean of S~\widetilde{S} will be within ηr1η+O(C4(η+ϵ)3σ)=O(C41/4(η+ϵ)3/4σ)\eta\cdot r_{1-\eta}+O\left(\sqrt{C_{4}(\eta+\epsilon)^{3}}\sigma\right)=O\left(C_{4}^{1/4}(\eta+\epsilon)^{3/4}\sigma\right) from the true mean.

4.2 Multi-dimensional Case

For any direction v\boldsymbol{\mathit{v}}, let μv=μTv\mu_{v}=\boldsymbol{\mathit{\mu}}^{T}v. From the previous section, we know that we can find a μ^v\widehat{\mu}_{\boldsymbol{\mathit{v}}} such that

Therefore, by picking nn orthogonal directions v1,...,vn\boldsymbol{\mathit{v}}_{1},...,\boldsymbol{\mathit{v}}_{n}, we get

We will now bound the radius of the ball in the outlier removal step (Algorithm 2.2.1). We claim the radius of the ball is O(C41/4(η+ϵ)1/4nΣ2)O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n||\boldsymbol{{\Sigma}}||_{2}}\right). Suppose we have some xD\boldsymbol{\mathit{x}}\sim\mathcal{D}. Let z=xμ\boldsymbol{\mathit{z}}=\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{\mu}}. Using the nn orthogonal directions as picked above, let zi=zTviz_{i}=\boldsymbol{\mathit{z}}^{T}\boldsymbol{\mathit{v}}_{i} and let Z2=zi2=z22Z^{2}=\sum z_{i}^{2}=\left\|\boldsymbol{\mathit{z}}\right\|_{2}^{2}. Consider the following:

It suffices to bound the right-hand side of (\refeq:outlier)(\ref{eq:outlier}) by O(η+ϵ)O(\eta+\epsilon), in which case the ball will contain 1ηϵ1-\eta-\epsilon fraction of the probability mass of D\mathcal{D}. We have

due to the fourth moment condition and the fact that \mboxE((zTvi)2)Σ2\mbox{{\bf E}}((\boldsymbol{\mathit{z}}^{T}\boldsymbol{\mathit{v}}_{i})^{2})\leq\|\boldsymbol{{\Sigma}}\|_{2}. Therefore, a ball of radius at most O(C41/4(η+ϵ)1/4nΣ2)O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n||\boldsymbol{{\Sigma}}||_{2}}\right) contains 1ηϵ1-\eta-\epsilon fraction of the points. Since aμ2=O(C41/4(η+ϵ)3/4Tr(Σ)),\|\boldsymbol{\mathit{a}}-\boldsymbol{\mathit{\mu}}\|_{2}=O\left(C_{4}^{1/4}(\eta+\epsilon)^{3/4}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\right), we get that the radius of the ball computed in the outlier removal step is O(C41/4(η+ϵ)1/4nΣ2).O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n\|\boldsymbol{{\Sigma}}\|_{2}}\right). We have proved

After the outlier removal step, every remaining point x\boldsymbol{\mathit{x}} satisfies

Consider the covariance matrix ΣS~\boldsymbol{{\Sigma}}_{\widetilde{S}} of S~\widetilde{S} (recall that S~\widetilde{S} is the sample after outlier removal). Let S~DS~\widetilde{S}_{\mathcal{D}}\subset\widetilde{S} be the set of points in S~\widetilde{S} that were sampled from the distribution D\mathcal{D} and S~NS~\widetilde{S}_{N}\subset\widetilde{S} be the points sampled by the adversary. Let μS~:=mean(S~)\boldsymbol{\mathit{\mu}}_{\widetilde{S}}:=\text{mean}(\widetilde{S}), μS~N:=mean(S~N)\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{N}}:=\text{mean}(\widetilde{S}_{N}) and μS~D:=mean(S~D).\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{\mathcal{D}}}:=\text{mean}(\widetilde{S}_{\mathcal{D}}). Note that

where η~=S~NS~\widetilde{\eta}=\frac{|\widetilde{S}_{N}|}{|\widetilde{S}|} is the fraction of noise points after the outlier truncation step. Note that η~η12ηϵ=O(η).\widetilde{\eta}\leq\frac{\eta}{1-2\eta-\epsilon}=O(\eta). We will therefore pretend that the fraction of noise points is still η\eta after the outlier truncation step. We again assume that the mean of the distribution D\mathcal{D} is μ=0.\boldsymbol{\mathit{\mu}}=0. By Lemma 3.11 applied with X=xTμD~μD~X=\boldsymbol{\mathit{x}}^{T}\frac{\boldsymbol{\mathit{\mu}}_{\widetilde{\mathcal{D}}}}{\|\boldsymbol{\mathit{\mu}}_{\widetilde{\mathcal{D}}}\|} for xD\boldsymbol{\mathit{x}}\sim\mathcal{D} and where AA is the event that x\boldsymbol{\mathit{x}} is not removed by outlier removal, we have that

Suppose, after the outlier removal step, we had the guarantee that the covariance matrix of the remaining points from the distribution D\mathcal{D}, say ΣD~,\boldsymbol{{\Sigma}}_{{\widetilde{\mathcal{D}}}}, is between

in the Lowener ordering. Corollary 3.13 gives α=1O(C4(η+ϵ))\alpha=1-O(\sqrt{C_{4}(\eta+\epsilon)}) and β=1+O(η+ϵ)\beta=1+O(\eta+\epsilon). Also, by Lemma 3.1 and Lemma 3.16 we have that if SD~=Ω(nlognϵ2)|S_{\widetilde{\mathcal{D}}}|=\Omega\left(\frac{n\log n}{\epsilon^{2}}\right), then

We will use the notation as defined above.

Let WW be the bottom n/2n/2 principal components of the covariance matrix ΣS\boldsymbol{{\Sigma}}_{S}. For some constant CC, we have

where δμ=μS~NμS~D.\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}=\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{N}}-\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{\mathcal{D}}}.

By an inductive application of the above lemma, we can prove

On input (S,n)(S,n), AgnosticMean outputs μ^\boldsymbol{\widehat{\mathit{\mu}}} satisfying

Theorem 3.18 with Corollary 3.13 proves Theorem 1.3.

Proof of Lemma 3.17: Recall that Σ\boldsymbol{{\Sigma}} denotes the covariance matrix of the points from D\mathcal{D}. We have

where A:=ηΣS~N+(ηη2)δμδμT.\boldsymbol{\mathit{A}}:=\eta\boldsymbol{{\Sigma}}_{{\widetilde{S}_{N}}}+(\eta-\eta^{2})\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}^{T}. Therefore, we have

By Lemma 3.16 each xi\boldsymbol{\mathit{x}}_{i} satisfies xi=O(C41/4(η+ϵ)1/4nΣ2),\|\boldsymbol{\mathit{x}}_{i}\|=O\left(\frac{C_{4}^{1/4}}{(\eta+\epsilon)^{1/4}}\sqrt{n\|\boldsymbol{{\Sigma}}\|_{2}}\right), so we have

For a symmetric matrix BB, let λk(B)\lambda_{k}(B) denote the kk’th largest eigenvalue. By Weyl’s inequality, we have that

By Equation (14), there exists a constant C~\widetilde{C} such that

Recall that WW is the space spanned by the bottom n/2n/2 eigenvectors of ΣS~\boldsymbol{{\Sigma}}_{\widetilde{S}}, and PW\boldsymbol{\mathit{P}}_{W} is the matrix corresponding to the projection operator on to WW. We therefore have

Multiplying by the vector PWδμPWδμ\frac{\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}}{\|\boldsymbol{\mathit{P}}_{W}\boldsymbol{\mathit{\delta}}_{\boldsymbol{\mathit{\mu}}}\|} and its transpose on either side, we get

where C=C~1ηC=\frac{\widetilde{C}}{1-\eta}. We therefore have

Proof of Theorem 3.18: By Equation 13, it is enough to bound μ^μS~D2.\|\boldsymbol{\widehat{\mathit{\mu}}}-\boldsymbol{\mathit{\mu}}_{\widetilde{S}_{\mathcal{D}}}\|^{2}. The proof is by induction on the dimension. If n=1n=1, then the conclusion follows from the guarantees for the one dimensional case proven in Section 3.4.1. Now, assume that it holds for all nkn\leq k for some k1k\geq 1. Let n=k+1.n=k+1. We have by Lemma 3.17

Recall that we defined VV to be the span of the top n/2n/2 principal components of ΣS~\boldsymbol{{\Sigma}}_{\widetilde{S}}. By induction hypothesis, since dim(V)=n/2\text{dim}(V)=n/2, we have

Covariance Estimation

Let D\mathcal{D} be a distribution with mean μ\boldsymbol{\mathit{\mu}} and covariance σ2.\sigma^{2}. If D=N(μ,σ2)\mathcal{D}=N(\mu,\sigma^{2}), then there is an algorithm that takes as input m=\O(lognϵ2)m=\O\left(\frac{\log n}{\epsilon^{2}}\right) samples x1,...xmDη\boldsymbol{\mathit{x}}_{1},...\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta} and computes in polynomial time σ^2\widehat{\sigma}^{2} such that σ^2σ2=O(η+ϵ)σ2.\left|\widehat{\sigma}^{2}-\sigma^{2}\right|=O(\eta+\epsilon)\sigma^{2}.

If xD\boldsymbol{\mathit{x}}\sim\mathcal{D} has bounded fourth moments with constant C4C_{4}, and (xμ)2(x-\mu)^{2} has bounded fourth moments with constant C4,2C_{4,2}. Then there is an algorithm that takes as input η\eta and m=O(logn+log1/ϵϵ2)m=O\left(\frac{\log n+\log 1/\epsilon}{\epsilon^{2}}\right) samples x1,...xmDη\boldsymbol{\mathit{x}}_{1},...\boldsymbol{\mathit{x}}_{m}\sim\mathcal{D}_{\eta} and computes in polynomial time σ^2\widehat{\sigma}^{2} such that σ^2σ2=O(C4,21/4(η+ϵ)3/4C41/2σ).\left|\widehat{\sigma}^{2}-\sigma^{2}\right|=O\left(C_{4,2}^{1/4}(\eta+\epsilon)^{3/4}C_{4}^{1/2}\sigma\right).

When D\mathcal{D} is a distribution that has bounded eighth moments, the result follows from the 1d mean estimation in Section 3.4 applied (xμ)2(x-\mu)^{2}. Note that \mboxE(xμ)2=σ2\mbox{{\bf E}}(x-\mu)^{2}=\sigma^{2} and

From Section 3.4, we therefore have that if m=O(logn+log1/ϵϵ2)m=O\left(\frac{\log n+\log 1/\epsilon}{\epsilon^{2}}\right), there is a poly(n)\operatorname*{poly}(n) algorithm with σ^2σ2O(C4,21/4(η+ϵ)3/4C41/2σ).|\widehat{\sigma}^{2}-\sigma^{2}|\leq O\left(C_{4,2}^{1/4}(\eta+\epsilon)^{3/4}C_{4}^{1/2}\sigma\right).

2 Multi-Dimensional Case: Theorem 1.5

In this section we will prove that CovarianceEstimation (Algorithm 2.2.3) gives Theorem 1.5. Throughout this section, we will assume that D\mathcal{D} is a distribution with mean μ\boldsymbol{\mathit{\mu}} and covariance Σ\boldsymbol{{\Sigma}} and has bounded fourth moments with parameter C4C_{4}. We use the following symmetrization trick to assume that D\mathcal{D} has mean 0\boldsymbol{0}. Given samples S={x1,...,xm}S=\{\boldsymbol{\mathit{x}}_{1},...,\boldsymbol{\mathit{x}}_{m}\}, let

Since η\eta fraction of the original samples were corrupted on average, only 2η2\eta fraction of the new samples will be corrupted on average. Moreover, if x,yD\boldsymbol{\mathit{x}},\boldsymbol{\mathit{y}}\sim\mathcal{D} are independent random variables, then we can show that the distribution of x=(xy)/2\boldsymbol{\mathit{x}}^{\prime}=(\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{y}})/\sqrt{2} has bounded fourth moments with parameter C4+3/2\leq C_{4}+3/2. We will denote by D\mathcal{D}^{\prime} the distribution of x\boldsymbol{\mathit{x}}^{\prime}. CovarianceEstimation is just the mean estimation algorithm on S(2)={xxTxS}S^{(2)}=\{\boldsymbol{\mathit{x}}^{\prime}\boldsymbol{\mathit{x}}^{\prime T}|\boldsymbol{\mathit{x}}\in S\}, we can appeal to Theorem 1.3. Furthermore, let D\mathcal{D}^{\prime} be an affine transformation of a 44-wise independent distribution.

where Σ(2)\boldsymbol{{\Sigma^{(2)}}} is covariance matrix of xxT,xD.\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T},\boldsymbol{\mathit{x}}\sim\mathcal{D}^{\prime}.

We will now derive a bound for Σ(2)2\|\boldsymbol{{\Sigma^{(2)}}}\|_{2} when the distribution has bounded fourth moments and is 44-wise independent. In particular, we will prove

If Σ(2)\boldsymbol{{\Sigma^{(2)}}} is the covariance matrix of xxT,xD\boldsymbol{\mathit{x}}\boldsymbol{\mathit{x}}^{T},\boldsymbol{\mathit{x}}\sim\mathcal{D}^{\prime}, it holds that

Proof of Proposition 4.2: Note that \mboxE(Y)=Σ\mbox{{\bf E}}(\boldsymbol{\mathit{Y}})=\boldsymbol{{\Sigma}}.

As in Section 4.2, we assume that the true distribution has mean μ=0\boldsymbol{\mathit{\mu}}=\boldsymbol{0}.

In this section, we will prove AgnosticOperatorNorm (Algorithm 2.2.3) gives Theorem 1.6. Let S=SDSNS=S_{\mathcal{D}}\cup S_{N} be the given sample, where SDS_{\mathcal{D}} consists of points from some distribution D\mathcal{D} with mean μ\boldsymbol{\mathit{\mu}} and covariance Σ\boldsymbol{{\Sigma}} and SNS_{N} consists of points picked by the adversary. Let ΣSD\boldsymbol{{\Sigma}}_{S_{\mathcal{D}}} be the sample covariance of SDS_{\mathcal{D}}. We assume that D\mathcal{D} has 1D concentration, i.e., there exists a constant γ\gamma such that for every unit vector v\boldsymbol{\mathit{v}}

Let S~\widetilde{S} be the remaining sample at the end of the algorithm and let S~D\widetilde{S}_{\mathcal{D}} be points in S~\widetilde{S} sampled from D\mathcal{D}.

First, we will argue that the covariance of the true distribution is well-approximated by Σμ(S~D)\boldsymbol{{\Sigma}}_{\boldsymbol{\mathit{\mu}}}(\widetilde{S}_{\mathcal{D}}).

With probability 11/poly(n)1-1/\operatorname*{poly}(n),

First, note that the tt computed in SafeOutlierTruncation is at most O(Tr(Σ))O(\operatorname*{Tr}(\boldsymbol{{\Sigma}})) because by an analogous argument as in Section 4.1, we have σ^v2(1+O(η))σv2\widehat{\sigma}_{\boldsymbol{\mathit{v}}}^{2}\leq(1+O(\eta))\sigma_{\boldsymbol{\mathit{v}}}^{2} (namely that the estimated variance σ^v\widehat{\sigma}_{\boldsymbol{\mathit{v}}} in a direction v\boldsymbol{\mathit{v}} is close to the true variance σv\sigma_{\boldsymbol{\mathit{v}}} in that direction). Then the ball in SafeOutlierTruncation has radius R=c1Tr(Σ)log1/γnηR=c_{1}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\log^{1/\gamma}\frac{n}{\eta} for some constant c1c_{1}. We have that in any direction v\boldsymbol{\mathit{v}}, the probability that xD\boldsymbol{\mathit{x}}\sim\mathcal{D} deviates from the mean by more than c1σvlog1/γnηc_{1}\sigma_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta} is 1/poly(nη)1/\operatorname*{poly}(\frac{n}{\eta}). Then if we take nn orthogonal directions, the probability that any given point is more than distance RR from μ\boldsymbol{\mathit{\mu}} is still 1/poly(nη)1/\operatorname*{poly}(\frac{n}{\eta}). Thus, step (1) of the algorithm will remove only 1/poly(nη)1/\operatorname*{poly}(\frac{n}{\eta}) fraction of the points sampled from D\mathcal{D}.

In every direction v\boldsymbol{\mathit{v}}, the probability mass of points from D\mathcal{D} outside an interval of size c2σvlog1/γnηc_{2}\sigma_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta} around the mean is at most 1/poly(nη)1/\operatorname*{poly}(\frac{n}{\eta}), where σv\sigma_{\boldsymbol{\mathit{v}}} is the variance in the direction v\boldsymbol{\mathit{v}}. Let CiC_{i} be the region between the two hyperplanes used for truncation in iteration ii. Therefore, if the number of iterations is O(nlog2/ηnη)O(n\log^{2/\eta}\frac{n}{\eta}), we will have that Pr(xiCixD)=11/poly(nη).\Pr\left(\boldsymbol{\mathit{x}}\in\cap_{i}C_{i}\left|x\sim\mathcal{D}\right.\right)=1-1/\operatorname*{poly}(\frac{n}{\eta}).

Note that 11d concentration implies that the distribution has bounded 2k2k’th moment for all finite k.k. By Lemma 3.12, we have that the covariance matrix Σ0(DiCi)\boldsymbol{{\Sigma}}_{\boldsymbol{0}}\left(\mathcal{D}\cap_{i}C_{i}\right) of DiCi\mathcal{D}\cap_{i}C_{i} is close to that of Σ\boldsymbol{{\Sigma}}:

Finally, to relate Σ0(DiCi)\boldsymbol{{\Sigma}}_{\boldsymbol{0}}\left(\mathcal{D}\cap_{i}C_{i}\right) to Σ0(S~D)\boldsymbol{{\Sigma}}_{\boldsymbol{0}}\left(\widetilde{S}_{\mathcal{D}}\right), we use Proposition 3.2. The concept class we use is all degree two polynomials restricted to convex polytopes with at most O(n)O(n) facets, defined by the hyperplanes used for truncation at each iteration of the algorithm. The VC dimension of this concept class is O(n2logn)O(n^{2}\log n). Therefore, by Proposition 3.2 applied with R=c1Tr(Σ)log1/γnηc1Σ1/2n1/2log1/γnηR=c_{1}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\log^{1/\gamma}\frac{n}{\eta}\leq c_{1}\|\boldsymbol{{\Sigma}}\|^{1/2}n^{1/2}\log^{1/\gamma}\frac{n}{\eta}, we get that if we take m=O(n3(log1/γnη)2lognη2)m=O\left(\frac{n^{3}(\log^{1/\gamma}\frac{n}{\eta})^{2}\log n}{\eta^{2}}\right) then

Combining equations 16 and 17 we get the desired result. ∎

First, note that since only an η\eta fraction of S~\widetilde{S} is noise, we have

Therefore, we have that Σ0(S~)2(1η)Σ0(S~D)2.\|\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S})\|_{2}\geq(1-\eta)\|\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S}_{\mathcal{D}})\|_{2}. Lemma 5.2 gives the desired lower bound. For the upper bound, let v\boldsymbol{\mathit{v}} be the top eigenvector of Σ0(S~).\boldsymbol{{\Sigma}}_{\boldsymbol{0}}(\widetilde{S}). When the algorithm terminates, we have

where the second line follows because of the termination condition and because we can estimate the variance of D\mathcal{D} in any direction to within a (1±cη)(1\pm c\eta) factor. ∎

2 Termination

In this section, we will show that with high probability, Algorithm 2.2.3 terminates in a polynomial number of steps provided that η1C\eta\leq\frac{1}{C} for some constant CC that depends only on the estimation in Step (5).

Every time the algorithm goes through another iteration, it must remove a certain number of noise points. Suppose in step (7), we remove rr noise points. The noise configuration of maximum variance puts rr amount of noise at the outlier removal distance d1=c1Tr(Σ)log1/γnηd_{1}=c_{1}\sqrt{\operatorname*{Tr}(\boldsymbol{{\Sigma}})}\log^{1/\gamma}\frac{n}{\eta}, and ηr\eta-r amount of noise at the truncation threshold distance d2=c2σ^vlog1/γnη2d_{2}=\frac{c_{2}{\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}\log^{1/\gamma}\frac{n}{\eta}}{2}. We can then write an upper bound on σ2\sigma^{2}.

Let us simplify the numerator Z=σ2σv2ηd22Z=\sigma^{2}-\sigma_{\boldsymbol{\mathit{v}}}^{2}-\eta d_{2}^{2}. Since we are truncating the sample, we have (1+c3ηlog2/γnη)σ^v2σ2(1+c_{3}\eta\log^{2/\gamma}\frac{n}{\eta}){\widehat{\sigma}}_{\boldsymbol{\mathit{v}}}^{2}\leq\sigma^{2}. Here we also assume that η1C\eta\leq\frac{1}{C} for a sufficiently large CC so that 11cη\frac{1}{1-c\eta} is less than some constant.

Recall that σ2(1η)Σ2\sigma^{2}\geq(1-\eta)\|\boldsymbol{{\Sigma}}\|_{2} by (18). Then as long as c3c_{3} is a sufficiently large constant, we have

Then combining ZZ with the denominator from earlier and using the fact that d1c1nΣ2log1/γnηd_{1}\leq c_{1}\sqrt{n\|\boldsymbol{{\Sigma}}\|_{2}}\log^{1/\gamma}\frac{n}{\eta}, we get:

Then rO(min{ηn,1nlog2/γnη})r\geq O\left(\min\left\{\frac{\eta}{n},\frac{1}{n\log^{2/\gamma}\frac{n}{\eta}}\right\}\right), so the algorithm will terminate in a nearly linear number of iterations. ∎

Open Questions

An immediate open question is whether the our analysis of the mean estimation algorithm is tight and the logn\sqrt{\log n} is avoidable. For special distributions including Gaussians, [DKK+16] give an algorithm with higher sample complexity and error ηlog1η\eta\sqrt{\log\frac{1}{\eta}} rather than ηlogn\eta\sqrt{\log n} or ηlogn\sqrt{\eta\log n} as in Theorem 1.1. An open question is to give an O(η)O(\eta) approximation. For the more general distributions considered here, the dependence on η\eta must grow as at least η3/4\eta^{3/4}; it is open to find an algorithm that achieves O(η3/4)O(\eta^{3/4}) error (our guarantee for the general setting has error O(ηlogn)O(\sqrt{\eta\log n})). Other open problems include agnostic learning of a mixture of two arbitrary Gaussians and agnostic sparse recovery.

Acknowledgment

We thank Chao Gao and Roman Vershynin for helpful discussions. We would also like to thank the anonymous reviewers for useful suggestions. This research was supported in part by NSF awards CCF-1217793 and EAGER-1555447.

References