Agnostic Estimation of Mean and Covariance
Kevin A. Lai, Anup B. Rao, Santosh Vempala
Introduction
The mean and covariance of a probability distribution are its most basic parameters (if they are bounded). Many families of distributions are defined using only these parameters. Estimating the mean and covariance from iid samples is thus a fundamental and classical problem in statistics. The sample mean and sample covariance are generally the best possible estimators (under mild conditions on the distribution such as their existence). However, they are highly sensitive to noise. The main goal of this paper is to estimate the mean, covariance and related functions in spite of arbitrary (adversarial) noise.
The Achilles heel of algorithms for generative models is the assumption that data is exactly from the model. This is crucial for known guarantees, and relaxations of it are few and specialized, e.g., in ICA, data could by noisy, but the noise itself is assumed to be Gaussian. Assumptions about rank and sparsity are made in a technique that is now called Robust PCA [CSPW11, CLMW11, XCM10]. There have been attempts [Kwa08, MT+11] at achieving robustness by L1 minimization, but they don’t give any error bounds on the output produced. A natural, important and wide open problem is estimating the parameters of generative models in the presence of arbitrary, i.e., malicious noise, a setting usually referred to as agnostic learning. The simplest version of this problem is to estimate a single Gaussian in the presence of malicious noise. Alternatively, this can be posed as the problem of finding a best-fit Gaussian to data or agnostically learning a single Gaussian. We consider the following generalization:
There is a large literature on robust statistics (see e.g., [Hub11, HRRS11, MMY06]), with the goal of finding estimators that are stable under perturbations of the data. The classic example for points on a line is that the sample median is a robust estimator while the sample mean is not (a single data point can change the mean arbitrarily). One measure for robustness of an estimator is called breakdown point, which is the minimum fraction of noise that can make the estimator arbitrarily bad. Robust statistics have been proposed and studied for mean and covariance estimation in high dimension as well (see [Hub64, Tuk74, Mar76, SJD81, Don82, Dav87, HPL91, DG92, MSY92, MZ12, CGR15] and the references therein). Most commonly used methods (including M-estimators) to estimate the covariance matrix were shown to have very low break down points [Don82]. The notion of robustness we consider quantifies how far the estimated value is from the true value. To the best of our knowledge, all the papers either suffer from the difficulty that their algorithms are computationally very expensive, namely exponential time in the dimension, or have poor or no guarantees for the output. Tukey’s median [Tuk74]) is an example of the former. It is defined as the deepest point with respect to a given set of points As proven in [CGR15], this is an optimal estimate of the mean. But there is no known polynomial time algorithm to compute this. Another well-known proposal (see [Sma90]) is the geometric median:
This has the advantage that it can be computed via a convex program. Unfortunately, as we observe here (see Proposition 2.1), the error of the mean estimate produced by this method grows polynomially with the dimension (also see [Bru11]).
In this paper, we give polynomial time algorithms to estimate the mean with error that is close to the information-theoretically optimal estimator. The dependence on the dimension, of the error in the estimated mean, is only . To the best of our knowledge, this is the first polynomial-time algorithm with an error dependence on dimension that is less than , the bound achieved by the geometric median. Moreover, as we state precisely later, our techniques extend to very general input distributions and to estimating higher moments.
Our algorithm is practical. A matlab implementation for mean estimation can be found in [KRV]. It takes less a couple of seconds to run on a -dimensional problem with samples on a personal laptop.
is the Gaussian with mean and covariance .
Let is a distribution with mean and covariance . We say it has bounded ’th moments if there exists a constant such that for every unit vector ,
Here is the variance of along For mean estimation, will be used, and for covariance estimation, will be needed.
1 Main results
All the results we state hold with probability unless otherwise mentioned. We will also assume is a less than a universal constant. We begin with agnostic mean estimation.
We note that the sample complexity is nearly linear, and almost matches the complexity for mean estimation with no noise.
If we take samples, and assume that for a small enough constant , then by combining theorems 1.5 and 1.1, we can improve the dependence for the non-spherical Gaussian case in Theorem 1.1 to
Our next theorem is a similar result for much more general distributions.
The bounds above are nearly the best possible (up to a factor of ) when the covariance is a multiple of the identity.
Let be a distribution with mean and covariance and that (a) for , and have bounded fourth moments with constants and (see Equation 1) respectively. (b) is an (unknown) affine transformation of a -wise independent distribution. Then, there is an algorithm that takes as input samples and and computes in -time a covariance estimate such that
where denotes the Frobenius norm.
If then it satisfies the hypothesis of the above theorem. More generally, it holds for any -wise independent distribution with bounded eighth moments and whose fourth moment along any direction is at least times the square of the second moment for some . We also note that if the distribution is isotropic, then covariance estimation is essentially a -d problem and we get a better bound.
Suppose is a distribution which satisfies the following concentration inequality: there exists a constant such that for every unit vector
Then, there is an algorithm that runs in time that takes as input and independent samples , and computes such that
In independent work, [DKK+16] gave a similar algorithm, which they call a Gaussian filtering method, for agnostic mean estimation assuming a spherical covariance matrix; while their guarantees are specifically for Gaussians, the error term in their guarantee grows only with rather than . They also give a completely different algorithm based on the Ellipsoid method, for a simple family of distributions including Gaussian and Bernoulli.
As a corollary of Theorem 1.5, we get a guarantee for agnostic SVD.
Let is a distribution that satisfies the hypothesis of Theorem 1.5. Let be the best rank approximation to in norm. There exists a polynomial time algorithm that takes as input and samples from It produces a rank matrix such that
Our results can also be used to estimate the mean and covariance of noisy Bernoulli product distributions, i.e. distributions in which each coordinate is 1 with probability and 0 with probability . In one dimension, for a Bernoulli distribution is . For a Bernoulli product distribution, will be within a constant of . Then Theorem 1.3 can be applied to get an estimate for the mean. For instance, if and , then . If is constant, then by Theorem 1.5, we can get an estimate for the covariance.
Main Ideas
Here we discuss the key ideas of the algorithms. The algorithm AgnosticMean (Algorithm 2.2.2) alternates between an outlier removal step and projection onto the top principal components; these steps are repeated. It is inspired by the work of Brubaker [Bru09] who gave an agnostic algorithm for learning a mixture of well-separated spherical Gaussians.
For illustration, let us assume for now that the underlying distribution is We are given a set of points from , and be the points sampled from the Gaussian and the adversary respectively. Let us also assume that . We will use the notation for mean of the points in a set , and for covariance of the points in . We then have
If the dimension is , then we can show that the median of is an estimate for correct up to an additive error of Even if we just knew the direction of the mean shift , then we can estimate by first projecting the sample on the line along and then finding the median. This would give an estimator satisfying So we can focus on finding the direction of . One would guess that the top principal component of the covariance matrix of would be a good candidate. But it is easy for the adversary to choose to make this completely useless. Since the noise points can be anything, just two points from placed far away on either side of the mean along a particular line passing through are sufficient to make the variance in that direction blow up arbitrarily. But we can limit this effect to some extent by an outlier removal step. By a standard concentration inequality for Gaussians, we know that the points in lie in a ball of radius around the mean. So, if we can just find a point inside or close to the convex hull of the Gaussian and throw away all the points that lie outside a ball of radius around this point, we preserve all the points in . This will also contain the effect of noise points on the variance since now they are restricted to be within distance of We will see later that we can use coordinate-wise median as the center of the ball. By computing the variance by projecting onto any direction, we can figure out up to a factor. From now on, we assume that all points in lie within a ball of radius centered at
But even after this restriction, the top principal component may not contain any information about the mean shift direction. By just placing (say) noise points along the direction at , and all the remaining noise points perpendicular to this at a single point at a smaller distance, we can make the top principal component. But is perpendicular to the mean shift direction.
The idea to get around this is that even if the top principal component of may not be along the mean-shift direction, the span (call it ) of top principal components of will contain a big projection of the mean-shift vector. This is because, if a big component of the the mean-shift vector was in the span (say ) of bottom principal components of , by Equation 2 this would mean that there is a vector in with a large Rayleigh quotient. This implies that the top eigenvalues of are all big. Since , where , this is possible only if is large. But since the distance of each point in from is , the trace of cannot be too large. Therefore, in the space , we can just compute the sample mean and it will be close to . We still have to find the mean in the space . But we do this by recursing the above procedure in . At the end we will be left with a one-dimensional space, and then we can just find the median. This recursive projection onto the top principal components is done in Algorithm 2.2.2 .
This generalizes to the non-spherical Gaussians with a few modifications. We use a different outlier removal step. In the non-spherical case, it is not trivial to compute to be used as the radius of the ball. We give an algorithm for this later on. To limit the effect of noise, we use a damping function. Instead of discarding points outside a certain radius, we damp every point by a weight so that further away points get lower weights. This is done in OutlierDamping (Algorithm 2.2.1). We get the guarantees of Theorem 1.1 by running AgnosticMean (Algorithm 2.2.2) with the outlier removal routine being OutlierDamping. A detailed proof of the whole algorithm is given in Section 3.1.
We then turn to more general distributions which have bounded fourth moments. We need bounded fourth moments to ensure that the mean and covariance matrix of the distribution do not change much even after conditioning by an event that occurs with probability . One difficulty for general distributions is that the outlier damping doesn’t work. So for distributions with bounded fourth moments, we have another outlier removal routine called In this routine, we first find a point analogous to the coordinate-wise median for the Gaussians, and then consider a ball big enough to contain fraction of . We throw away all the points outside this ball. We get the guarantees of Theorem 1.3 by running AgnosticMean (Algorithm 2.2.2) with the outlier removal routine being OutlierTruncation (Algorithm 2.2.1). The complete proof of this appears in Section 3.3.
We now have an algorithm to estimate the mean of very general (with bounded fourth moments) distributions. To estimate the covariance matrix, we observed that the covariance matrix of a distribution is given by If we knew what was, then covariance can be computed by estimating the mean of the second moments. To compute the mean of the second moments, we can treat as a vector in dimensions and run the algorithm for mean estimation. Also, we can estimate by the same algorithm. Therefore, we get Theorem 1.5 by running CovarianceEstimation (Algorithm 2.2.3). Its proof appears in Section 4.2.
Algorithm AgnosticOperatorNorm (Algorithm 2.2.3) estimates the -norm for general distributions. For illustration, suppose and we are given samples and the mean . We consider the covariance-like matrix
Since fraction of the points in are from the Gaussian, we have Therefore, the top eigenvalue of is at least Let be the top eigenvector of If the Gaussian variance along (which can be computed up to factor) is much less than , this should be because there are a lot of noise points in whose projections onto are big compared to the projection of Gaussian points in . We remove points in that have big projection and then iterate the entire procedure. We later show that this procedure terminates in steps and when it terminates the top eigenvalue of is close to that of A proof of this appears in Section 5.
Theorem 1.7 follows easily from Theorem 1.5. Let be the top- eigenspace of from Theorem 1.5. We then have
follow from triangle inequality, follows from the fact that is the best rank- approximation and from the guarantees of Theorem 1.5.
Next the algorithm estimates a weighted covariance matrix with the weight of a point proportional to for chosen from a Gaussian distribution; it computes the SVD of . For this we use our algorithm again (the weights are applied individually to each sample). The main guarantee is that the eigenvectors of this weighted covariance approximate the columns of . This relies on the maximum eigenvalue gap of being large, and it has to be approximated to within additive error . Theorem 1.7 implies that the additional error in eigenvalues is bounded by , and therefore it suffices to have for a sufficiently small constant that depends only on the cumulant and moment bound assumptions (i.e., ). Thus, if suffices to have .
In this section we will show the lower bounds stated in Observation 1.4. For Gaussian distributions, this is a special case of a theorem proved in [CGR15]. We reproduce the relevant part here for completeness. We will show that there are distributions and distributions such that and
So, given , no algorithm can distinguish between . Let be p.d.f of and be the p.d.f of Let be such that the total variation distance between is
By a standard inequality for the total variation distance of Gaussian distributions, this implies that Let be the distribution with p.d.f and be the distribution with p.d.f . It is now easy to verify that Equation 3 is satisfied. This proves item one of Observation 1.4.
For the distributions with bounded fourth moments, consider the following two one-dimensional distributions. is supported on two points with the corresponding probabilities . is supported on three points with probabilities respectively. Let . It is easy to check that both and have bounded fourth moments with the constant . Furthermore, can be obtained from by adding fraction of noise points. So no algorithm can distinguish between the two distributions. Since their means differ by , no algorithm can get an estimate better than this.
We will now show that the geometric median:
has a dependence on the dimension. We show this in the Gaussian case even if we have access to the whole distribution, but with fraction of noise points placed all at a single point far away from most of the Gaussian points.
Let be a distribution with diagonal covariance matrix whose variance along the coordinate direction is zero, and equal to in all the other coordinate directions. Assume there is an fraction of noise at a distance along . Let
We have that at the minimizer , the derivative with respect to is zero. Therefore, we should have
Consider It is clear from Equation 4 that We claim that if for a small enough constant , then Suppose . Since , with exponential probability. Therefore,
2 Algorithms
Our algorithms are based on outlier removal and SVD. To simplify the proofs, we use new samples for each step of the algorithm. The total sample complexity is given in the theorems.
For outlier removal, we use one of the following two simple routines. The first, which we call OutlierDamping, returns a vector of positive weights, one for each sample point.
Let be the coordinate-wise median of . Let Estimate by estimating d variance along orthogonal directions, see Section 4.1.
Set for every
The second procedure for outlier removal returns a subset of points. It will be convenient to view this as a weighting of the point set. We call this procedure OutlierTruncation.
if : Let be the smallest interval containing fraction of the points, . Return
Let be as in Lemma 3.15.
Let ball of minimum radius centered at that contains fraction of .
Return
2.2 Main Algorithm
We are now ready to state the main algorithm for agnostic mean estimation. It uses one of the above outlier removal procedures and assumes that the output of the procedure is a weighting.
Let .
if , Return . //Gaussian case
else Return . //General case
Let be the weighted covariance matrix of with weights , and be the span of the top principal components of , and be its complement.
Set where is the projection operation on to .
Let and
Return
2.3 Estimation of the Covariance Matrix and Operator Norm
For both the tasks in this section, we will assume that the mean of the distribution . We can do this without loss of generality by a standard trick mentioned described in Section 4.2. The algorithm for estimating the covariance matrix calls AgnosticMean on Analysis is given in Section 4.2.
Output: matrix
Let (see Equation 15)
The algorithm for estimating is based on iteratively truncating the samples along the direction of top variance. The analysis is given in Section 5.
Let .
Do the following times
Let .
Find , the top eigenvector of , and its corresponding eigenvalue .
Estimate (up to factor, see Section 4.1) the variance of along and denote it by .
if Return .
Remove all points such that .
Let be the sum of estimated variances of in orthogonal directions.
Let be the ball of radius centered at .
Return
Mean Estimation: Theorem 1.1 and Theorem 1.3
In this section, we will first prove Theorem 1.1, which is for Gaussian distributions, and Theorem 1.3, which is for distributions with bounded fourth moments. All our algorithms will be translationally invariant. We will assume w.l.o.g that the mean of the distribution is . So we will be proving bounds on Algorithm 2.2.2 has levels, we will assume that at each level it uses samples resulting in a total of
At various points in the analysis, to bound the sample complexity we will have to show that the estimates computed from samples are close to their expectations. We will use the following two results. Firstly, as an immediate corollary of matrix Bernstien for rectangular matrices (see Theorem in [Tro12]), we get the following concentration result for the sample mean and sample covariance.
Here and are sample mean and sample covariance matrix.
Secondly, the functions we estimate will be integrals of low-degree polynomials (degree at most ) restricted to intervals and/or balls. These functions viewed as binary concepts have small VC-dimension, where is the dimension of space and is the degree of the polynomial. We use this to bound the error of estimating integrals via samples, and we can make the error smaller than any inverse polynomial using a size sample.
By the VC theorem, for any concept in , the bound on the size of the sample ensures that with probability at least and any ,
Noting that , we get the claimed bound.
We use the above notation for and . By an abuse of notation, when , we mean the population version of the above quantities:
We consider the matrix
Let be a one dimensional Gaussian distribution. If , and we are given , then the median satisfies with high probability.
Let be made up of samples in that come from the Gaussian, also let Let us bound the probability that the median We first note that if then By Hoeffding’s inequality, we can bound this by if ∎
We will next consider the multidimensional case. The proof follows by a series of lemmas. We state the lemmas first, conclude the proof of Theorem 1.1 and then prove the lemmas. First, we observe that by applying Lemma 3.3 in orthogonal directions and union bound, we get
By a simple calculation, This immediately gives the following bound on the trace.
Suppose Then there exists a constant such that,
As will be clear from the proof of Theorem 3.6, when is a multiple of identity, then will also be a multiple of . By Lemma 3.1, if we take samples, we will have
in the Lowener ordering, for some . By an argument similar to the one sketched in Section 2, we can prove
We will use the notation as defined above. Let be the bottom principal components of the covariance matrix . We have
where denotes the least eigenvalue of and
By an inductive application of Lemma 3.7, we get the following theorem giving a bound on
On input and the routine , AgnosticMean outputs satisfying
Theorem 3.6 combined with Theorem 3.8 proves Theorem 1.1. We get a better dependence on when because we can take in this case. This would lead to the cancellation of the leading term in the bound in Theorem 3.8 as
Proof of Lemma 3.7: Recall that denotes the covariance matrix of the Gaussian part. We have
where Therefore, we have
For a symmetric matrix , let denote the ’th largest eigenvalue. By Weyl’s inequality, we have
Recall that is the space spanned by the bottom eigenvectors of , and is the matrix corresponding to the projection operator on to . We therefore have
Multiplying by the vector and its transpose on either side, we get
Assuming , we therefore have
Proof of Theorem 3.8: By Equation 6 and Lemma 3.1, since we take samples we have
So it is enough to prove The proof is by induction. If , then the conclusion follows from the guarantees of the one dimensional median. Now, assume that it holds for all for some . Let We have by Lemma 3.7
By induction hypothesis, since , we have
where Therefore, we have
Now we will look at the scalar term Let be the eigenvalues of
We next bound We have
Note that if and are the eigenvalues and the corresponding eigenvectors of , then and are the eigenvalues and the corresponding eigenvectors of Since,
where Recall that . We can, as before, bound the product of the two scalars by Therefore, we have
Combining Equation (6) and Equation 5, we get Theorem 3.6.
2 Improving the dependence on η𝜂\eta
Now we will show how we can obtain the second part of Theorem 1.1 to get a better dependence on by using from Theorem 1.5. Let be a Gaussian with covariance and for a small enough constant We first use Theorem 1.5 (with ) to estimate . We get a satisfying
Let be the given sample, and let be i.i.d. samples. Define The key thing to note is that if and , then . Let . Note that the mean of is same as that of , and the covariance has
We can view and we assume By Theorem 1.5 and Equation 7, we can compute a such that
Let Therefore,
by Equation 8. Now, if we let and then we can think of If we now use Theorem 3.8 with and on the samples , we get a such that
This implies that satisfies
We can use this technique to give a polynomial time algorithm to compute with a guarantee for any fixed This would require estimating higher order moments by the mean estimation algorithm and then using the above trick to improve the dependence for each of them in sequence. We don’t give a proof of this in this paper.
3 Distributions with Bounded Fourth Moments
In this section, we will prove some some useful properties that distributions with bounded fourth moments satisfy. We will assume that for a distribution with mean that has bounded fourth moments, i.e., for every unit vector
Let be a random variable with and
for some . Let and be any event with probability Then
The fourth moment of such an is minimum when its support is just the two-point set . Therefore,
Let be a random variable with and and let
for some . Then, for every event that occurs with probability at least , we have
where is the indicator function of the event As an immediate corollary, for we get the following bound on the conditional probability
Let be the probability measure. We can write in the following way
Using for any random variable and we have
Therefore, for we get that
As an immediate corollary of Lemma 3.11 and Lemma 3.12, we get for a random variable having bounded fourth moments
Let be an event that happens with probability . Then,
where is the conditional covariance matrix
Let be any unit vector. Let be the random variable that is for . Let , , and . Then
Finally, by a standard argument as in the proof of Chebyshev’s inequality, we have
For every unit vector , we have
where is the standard deviation of along the direction ,
4 Proof of Theorem 1.3:
First we will consider the case when is a random variable with mean and variance satisfying
In this case, median need not be a good estimator. Instead, we will consider the interval of minimum length that contains fraction of the sample points. Let be the given sample, and let be the points lying in this interval. Let be our estimator. We will show below that
By the concentration inequality stated in Lemma 3.14, we get that for the distribution, the length of the interval around consisting of probability mass is bounded by
The length of the smallest interval that contains fraction of is at most the length of the smallest interval that contains fraction of . This latter quantity is bounded by , since the interval contains with probability a fraction of .
This implies that when we look at the minimum interval containing fraction of the non-noise points, the extreme points of the interval can be at most at a distance from . Thus, the distance of all noise points will be within . Furthermore, the interval of minimum length with fraction of will contain at least fraction of . Therefore, by Lemma 3.11 the mean of will be within from the true mean.
4.2 Multi-dimensional Case
For any direction , let . From the previous section, we know that we can find a such that
Therefore, by picking orthogonal directions , we get
We will now bound the radius of the ball in the outlier removal step (Algorithm 2.2.1). We claim the radius of the ball is . Suppose we have some . Let . Using the orthogonal directions as picked above, let and let . Consider the following:
It suffices to bound the right-hand side of by , in which case the ball will contain fraction of the probability mass of . We have
due to the fourth moment condition and the fact that . Therefore, a ball of radius at most contains fraction of the points. Since we get that the radius of the ball computed in the outlier removal step is We have proved
After the outlier removal step, every remaining point satisfies
Consider the covariance matrix of (recall that is the sample after outlier removal). Let be the set of points in that were sampled from the distribution and be the points sampled by the adversary. Let , and Note that
where is the fraction of noise points after the outlier truncation step. Note that We will therefore pretend that the fraction of noise points is still after the outlier truncation step. We again assume that the mean of the distribution is By Lemma 3.11 applied with for and where is the event that is not removed by outlier removal, we have that
Suppose, after the outlier removal step, we had the guarantee that the covariance matrix of the remaining points from the distribution , say is between
in the Lowener ordering. Corollary 3.13 gives and . Also, by Lemma 3.1 and Lemma 3.16 we have that if , then
We will use the notation as defined above.
Let be the bottom principal components of the covariance matrix . For some constant , we have
where
By an inductive application of the above lemma, we can prove
On input , AgnosticMean outputs satisfying
Theorem 3.18 with Corollary 3.13 proves Theorem 1.3.
Proof of Lemma 3.17: Recall that denotes the covariance matrix of the points from . We have
where Therefore, we have
By Lemma 3.16 each satisfies so we have
For a symmetric matrix , let denote the ’th largest eigenvalue. By Weyl’s inequality, we have that
By Equation (14), there exists a constant such that
Recall that is the space spanned by the bottom eigenvectors of , and is the matrix corresponding to the projection operator on to . We therefore have
Multiplying by the vector and its transpose on either side, we get
where . We therefore have
Proof of Theorem 3.18: By Equation 13, it is enough to bound The proof is by induction on the dimension. If , then the conclusion follows from the guarantees for the one dimensional case proven in Section 3.4.1. Now, assume that it holds for all for some . Let We have by Lemma 3.17
Recall that we defined to be the span of the top principal components of . By induction hypothesis, since , we have
Covariance Estimation
Let be a distribution with mean and covariance If , then there is an algorithm that takes as input samples and computes in polynomial time such that
If has bounded fourth moments with constant , and has bounded fourth moments with constant . Then there is an algorithm that takes as input and samples and computes in polynomial time such that
When is a distribution that has bounded eighth moments, the result follows from the 1d mean estimation in Section 3.4 applied . Note that and
From Section 3.4, we therefore have that if , there is a algorithm with
2 Multi-Dimensional Case: Theorem 1.5
In this section we will prove that CovarianceEstimation (Algorithm 2.2.3) gives Theorem 1.5. Throughout this section, we will assume that is a distribution with mean and covariance and has bounded fourth moments with parameter . We use the following symmetrization trick to assume that has mean . Given samples , let
Since fraction of the original samples were corrupted on average, only fraction of the new samples will be corrupted on average. Moreover, if are independent random variables, then we can show that the distribution of has bounded fourth moments with parameter . We will denote by the distribution of . CovarianceEstimation is just the mean estimation algorithm on , we can appeal to Theorem 1.3. Furthermore, let be an affine transformation of a -wise independent distribution.
where is covariance matrix of
We will now derive a bound for when the distribution has bounded fourth moments and is -wise independent. In particular, we will prove
If is the covariance matrix of , it holds that
Proof of Proposition 4.2: Note that .
As in Section 4.2, we assume that the true distribution has mean .
In this section, we will prove AgnosticOperatorNorm (Algorithm 2.2.3) gives Theorem 1.6. Let be the given sample, where consists of points from some distribution with mean and covariance and consists of points picked by the adversary. Let be the sample covariance of . We assume that has 1D concentration, i.e., there exists a constant such that for every unit vector
Let be the remaining sample at the end of the algorithm and let be points in sampled from .
First, we will argue that the covariance of the true distribution is well-approximated by .
With probability ,
First, note that the computed in SafeOutlierTruncation is at most because by an analogous argument as in Section 4.1, we have (namely that the estimated variance in a direction is close to the true variance in that direction). Then the ball in SafeOutlierTruncation has radius for some constant . We have that in any direction , the probability that deviates from the mean by more than is . Then if we take orthogonal directions, the probability that any given point is more than distance from is still . Thus, step (1) of the algorithm will remove only fraction of the points sampled from .
In every direction , the probability mass of points from outside an interval of size around the mean is at most , where is the variance in the direction . Let be the region between the two hyperplanes used for truncation in iteration . Therefore, if the number of iterations is , we will have that
Note that d concentration implies that the distribution has bounded ’th moment for all finite By Lemma 3.12, we have that the covariance matrix of is close to that of :
Finally, to relate to , we use Proposition 3.2. The concept class we use is all degree two polynomials restricted to convex polytopes with at most facets, defined by the hyperplanes used for truncation at each iteration of the algorithm. The VC dimension of this concept class is . Therefore, by Proposition 3.2 applied with , we get that if we take then
Combining equations 16 and 17 we get the desired result. ∎
First, note that since only an fraction of is noise, we have
Therefore, we have that Lemma 5.2 gives the desired lower bound. For the upper bound, let be the top eigenvector of When the algorithm terminates, we have
where the second line follows because of the termination condition and because we can estimate the variance of in any direction to within a factor. ∎
2 Termination
In this section, we will show that with high probability, Algorithm 2.2.3 terminates in a polynomial number of steps provided that for some constant that depends only on the estimation in Step (5).
Every time the algorithm goes through another iteration, it must remove a certain number of noise points. Suppose in step (7), we remove noise points. The noise configuration of maximum variance puts amount of noise at the outlier removal distance , and amount of noise at the truncation threshold distance . We can then write an upper bound on .
Let us simplify the numerator . Since we are truncating the sample, we have . Here we also assume that for a sufficiently large so that is less than some constant.
Recall that by (18). Then as long as is a sufficiently large constant, we have
Then combining with the denominator from earlier and using the fact that , we get:
Then , so the algorithm will terminate in a nearly linear number of iterations. ∎
Open Questions
An immediate open question is whether the our analysis of the mean estimation algorithm is tight and the is avoidable. For special distributions including Gaussians, [DKK+16] give an algorithm with higher sample complexity and error rather than or as in Theorem 1.1. An open question is to give an approximation. For the more general distributions considered here, the dependence on must grow as at least ; it is open to find an algorithm that achieves error (our guarantee for the general setting has error ). Other open problems include agnostic learning of a mixture of two arbitrary Gaussians and agnostic sparse recovery.
Acknowledgment
We thank Chao Gao and Roman Vershynin for helpful discussions. We would also like to thank the anonymous reviewers for useful suggestions. This research was supported in part by NSF awards CCF-1217793 and EAGER-1555447.