Smoothed Analysis of Tensor Decompositions
Aditya Bhaskara, Moses Charikar, Ankur Moitra, Aravindan Vijayaraghavan
Introduction
Tensor decompositions play a central role in modern statistics (see e.g. ). To illustrate their usefulness, suppose we are given a matrix When can we uniquely recover the factors and of this decomposition given access to ? In fact, this decomposition is almost never unique (unless we require that the factors and are orthonormal, or that has rank one). But given a tensor there are general conditions under which , and are uniquely determined (up to scaling) given ; perhaps the most famous such condition is due to Kruskal , which we review in the next section.
Tensor methods are commonly used to establish that the parameters of a generative model can be identified given third (or higher) order moments. In contrast, given just second-order moments (e.g. ) we can only hope to recover the factors up to a rotation. This is called the rotation problem and has been an important issue in statistics since the pioneering work of psychologist Charles Spearman (1904) . Tensors offer a path around this obstacle precisely because their decompositions are often unique, and consequently have found applications in phylogenetic reconstruction , , hidden markov models , mixture models , topic modeling , community detection , etc.
However most tensor problems are hard: computing the rank , the best rank one approximation and the spectral norm are all -hard. Also many of the familiar properties of matrices do not generalize to tensors. For example, subtracting the best rank one approximation to a tensor can actually increase its rank and there are rank three tensors that can be approximated arbitrarily well by a sequence of rank two tensors. One of the few algorithmic results for tensors is an algorithm for computing tensor decompositions in a restricted case. Let and be matrices whose columns are , and respectively.
, If and no pair of columns in are multiples of each other, then there is a polynomial time algorithm to compute the minimum rank tensor decomposition of . Moreover the rank one terms in this decomposition are unique (among all decompositions with the same rank).
If is an tensor, then can be at most in order for the conditions of the theorem to be met. This basic algorithm has been used to design efficient algorithms for phylogenetic reconstruction , , topic modeling , community detection and learning hidden markov models and mixtures of spherical Gaussians . However algorithms that make use of tensor decompositions have traditionally been limited to the full-rank case, and our goal is to develop stable algorithms that work for . Recently Goyal et al gave a robustness analysis for this decomposition, and we give an alternative proof in Appendix A.
In fact, this basic tensor decomposition can be bootstrapped to work even when is larger than (if we also increase the order of the tensor). The key parameter that dictates when one can efficiently find a tensor decomposition (or more generally, when it is unique) is the Kruskal rank:
The Kruskal rank (or Krank) of a matrix is the largest for which every set of columns are linearly independent. Also the -robust k-rank is denoted by , and is the largest for which every sub-matrix of has .
Hence we get an order three tensor of size . Alternatively we can define this “flattening” using the following operation:
The Khatri-Rao product of and which are size and respectively is an matrix whose column is .
Our new order three tensor can be written as:
The factors are the columns of , the columns of and the columns of . The crucial point is that the Kruskal rank of the columns of is in fact at least the sum of the Kruskal rank of the columns of and (and similarly for ) , , but this is tight in the worst-case. Consequently this “flattening” operation allows us use the above algorithm unto ; since the rank () is larger than the largest dimension (), this is called the overcomplete case.
Our main technical result is that in a natural smoothed analysis model, the Kruskal rank robustly multiplies and this allows us to give algorithms for computing a tensor decomposition even in the highly overcomplete case, for any (provided that the order of the tensor is large - but still a constant). Moreover our algorithms have immediate applications in learning mixtures of Gaussians and multiview mixture models.
2 Our Results
We introduce the following framework for studying tensor decomposition problems:
In applications in learning, tensors are used to encode low-order moments of the distribution. In particular, each factor in the decomposition represents a “component”. The intuition is that if these “components” are not chosen in a worst-case configuration, then we can obtain vastly improved learning algorithms in various settings. For example, as a direct consequence of our main result, we will give new algorithms for learning mixtures of spherical Gaussians again in the framework of smoothed analysis (without any additional separation conditions). There are no known polynomial time algorithms to learn such mixures if the number of components () is larger than the dimension (). But if their means are perturbed, we give a polynomial time algorithm for any by virtue of our tensor decomposition algorithm.
Our main technical result is the following:
We obtain the following main theorem from the above result and from analyzing the stability of the algorithm of Leurgans et al (see Theorem 2.3):
As we discussed, tensor methods have had numerous applications in learning. However algorithms that make use of tensor decompositions have traditionally been limited to the full-rank case, and hence can only handle cases when the number of “components” is at most the dimension. However by using our main theorem above, we can get new algorithms for some of these problems that work even if there are many more “components” than dimensions.
Mixtures of Axis-Aligned Gaussians (Section 5)
Here we are given samples from a distribution where is a Gaussian with mean and covariance and each is diagonal. These mixtures are ubiquitous throughout machine learning. Feldman et al gave an algorithm for PAC-learning mixtures of axis aligned Gaussians, however the running time is exponential in , the number of components. Hsu and Kakade gave a polynomial time algorithm for learning mixtures of spherical Gaussians provided that their means are full rank (hence ). Again, we turn to the framework of smoothed analysis and suppose that the means are -perturbed. In this framework, we can give a polynomial time algorithm for learning mixtures of axis-aligned Gaussians for any . Suppose that the means of a mixture of axis-aligned Gaussians and suppose the means have been -perturbed to obtain . Then
We believe that our new algorithms for overcomplete tensor decomposition will have further applications in learning. Additionally this framework of studying distribution learning when the parameters of the distribution we would like to learn are not chosen adversarially, seems quite appealing.
Alternatively, algorithms for overcomplete tensor decomposition that assume we know exactly would not have any applications in learning because we would need to take too many samples to have a good enough estimate of (i.e. the low-order moments of the distribution).
3 Our Approach
then we would like to prove that with high probability has a non-negligible projection on the orthogonal complement of . This is the core of our approach. Set to be the orthogonal complement of . In fact, we prove that for any dimension at least subspace , with high probability has a non-negligible projection onto .
Prior Algorithms
Here we review the algorithm of Leurgans et al . It has been discovered many times in different settings. It is sometimes referred to as “simultaneous diagonalization” or as Chang’s lemma .
Suppose we are given a third-order tensor which is . Let and be matrices whose columns are , and respectively. Suppose further that (1) and (2) . Then we can efficiently recover the factors of .
We present the algorithm Decompose and its analysis assuming . Any instance with can be reduced to this case as follows: find the span of the vectors , where is the dimensional vector whose th entry is . This span must be precisely the span of the columns of .It is easy to see that the span is contained in the span of the columns of . To see equality, we observe that if the span is dimensional, then projecting each of the s on to the span gives a different decomposition, and this contradicts Kruskal’s uniqueness theorem, which holds in this case. Thus we can pick some orthonormal basis for this span, and write as an tensor. We can perform this operation again (along the second mode) to move to an tensor.
, Given a tensor there exists an algorithm that runs in polynomial time and recovers the (unique) factors of provided that (1) and (2) .
Proof: The algorithm is to pre-process as above (i.e., obtain ), and then run Decompose stated below. Let us thus analyze Decompose with being .
We can write where and similarly where Moreover we can write and . So we conclude and diagonalize and respectively. Note that almost surely the diagonals entries of are distinct (Claim A.4). Hence the eigendecompositions of and are unique, and we can pair up columns in and columns in based on their eigenvalues (we pair up and if their eigenvalues are equal). We can then solve a linear system to find the remaining factors (columns in ) and since this is a valid decomposition, we can conclude that these are also the true factors of appealing to Kruskal’s uniqueness theorem .
In fact, this algorithm is also stable, as Goyal et al also recently showed. It is intuitive that if and are well-conditioned and each pair of columns in is well-conditioned then this algorithm can tolerate some inverse polynomial amount of noise. For completeness, we give a robustness analysis of Decompose in Appendix A.
The condition numbers ,
The column vectors of are not close to parallel: for all , ,
The decompositions are bounded : for all , .
As before, the algorithm is to preprocess so as to obtain , and then run Decompose. The preprocessing step is slightly different because of the presence of error – instead of considering the span of the as above, we need to look at the span of the top singular vectors of the matrix whose columns are . If is small enough (in terms of ), the span of these top singular vectors suffices to obtain an approximation to the vectors (see Appendix A).
Note that the algorithm is limited by the condition that since this requires that . But as we have seen before, by “flattening” a higher order tensor, we can handle overcomplete tensors. The following is an immediately corollary of Theorem 2.3:
then there exists an efficient algorithm that computes each rank one term in this decomposition up to an additive error of .
\text{Krank}(U\odot V)\geq\min\big{(}\text{Krank}(U)+\text{Krank}(V)-1,R\big{)}
However in learning applications we are not given exactly but rather an approximation to it. Our goal is to show that the Kruskal rank robustly multiplies typically, so that these types of tensor algorithms will not only work in the exact case, but are also necessarily stable when we are given with some noise. In the next section, we show that in the smoothed analysis model, the robust Kruskal rank multiplies on taking Khatri-Rao products. This then establishes our main result Theorem 1.5, assuming Theorem 3.3 which we prove in the next section.
The Khatri-Rao Product Robustly Multiplies
Suppose are matrices and let be -perturbations of respectively. Then for any constant , and , the Khatri-Rao product satisfies with probability at least .
The natural generalization where the vectors and are in different dimensional spaces also holds. We omit the details here.
How can we lower bound the smallest singular value of ? We define a quantity which is can be used as a proxy for the least singular value and is simpler to analyze.
For any matrix with columns , the leave-one-out distance is
The leave-one-out distance is a good proxy for the least singular value, if we are not particular about losing multiplicative factors that are polynomial in size of the matrix.
Apply a union bound over all the choices for .
We now state the main technical theorem about projections of perturbed product vectors onto arbitrary subspaces of large dimension.
Now consider some matrices , and suppose we knew that has singular values of magnitude . Then, an -perturbed vector has at least of its norm in the space spanned by the corresponding right singular vectors, with probability (Fact 3.26). Thus we get
So the key is to prove that the matrix has a large number of “non-negligible” singular values with high probability (over the perturbation in ). For this, let us examine the entries of . For a moment suppose that is a gaussian random vector (instead of a perturbation). Then the th entry of is precisely , which is distributed like a one dimensional gaussian of variance . If the entries for different were independent, standard results from random matrix theory would imply that has many non-negligible singular values.
However, this could be far from the truth. Consider, for instance, two vectors and that are parallel. Then their dot products with are highly correlated. However we note, that as long as has a reasonable component orthogonal to , the distribution of the and th entries are “somewhat” independent. We will prove that we can roughly achieve such a situation. This motivates the following definition.
[Ordered -orthogonality] A sequence of vectors has the ordered -orthogonality property if for all , has a component of length orthogonal to .
Now we define a similar notion for a sequence of matrices , which says that a large enough subset of columns should have a certain -orthogonality property. More formally,
A set of matrices form an ordered -orthogonal system if there exists a permutation on such that the first columns satisfy the followng property: for and every , the th column of has a projection of length orthogonal to the span of all the vectors given by the columns of all the matrices other than itself (i.e. the th column of ).
The following lemma shows the use of an ordered orthogonal system: a matrix constructed as above starting with these has many non-negligible singular values with high probability.
for all .
form an ordered orthogonal system.
In particular, when , they form an ordered orthogonal system.
We remark that while is often a constant in our applications, does not have to be. We will use this in the proof that follows, in which we use these above two lemmas regarding construction and use of an ordered -orthogonal system to prove Proposition 3.12.
Now, applying Lemma 3.15, we have that the matrix , defined as before, (given by linear combinations along ) , has w.p .
Applying Fact 3.26 along with a simple averaging argument, we have that for one of the terms , we have with probability as required.
Please refer to Appendix B.2 for the complete details. ∎
2 Constructing the (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonal System (Proof of Lemma 3.16)
This lemma will also be used in the first step of the proof of Theorem 3.6 to identify a good block of co-ordinates which span a large projection of a given subspace .
The above lemma is easy to prove if the dimension of the column projections used is the usual dimension of a vector space. However, with robust dimension, to carefully avoid spurious or insignificant directions, we identify the robust dimension with the number of large singular values of a certain matrix.
In each stage of this construction we will be looking at subspaces of which are obtained by zero-ing out all the columns (i.e. all the co-ordinates ), that we have fixed so far.
The extension for is the vector obtained by padding with zeros in the coordinates (columns given by ).
The following lemma shows that their dimension remains large as long as is not too large:
Proof of Lemma 3.21: Consider a constraint matrix of size which describes . is described by the constraint matrix of size obtained by removing the columns of corresponding to . Hence we get a subspace of dimension at least .
We now describe the construction more formally.
The Iterative Construction of ordered -orthogonal matrices.
Initially set and for all , and .
Pick such that . If no such exists, report FAIL.
Set for all , the new , where is the matrix padded with zeros in the columns corresponding to . Set .
Let for convenience. We first show that the above process for constructing completes successfully without reporting FAIL.
For such that and , the above process does not FAIL.
Proof: In each stage, we add one column index to . Hence, at all times .
We first show that Step 1 of each iteration does not FAIL. From Lemma 3.21, we have . Let . Now, applying Lemma 3.18 to , we see that there exists such that , as required. Hence, Step 1 does not fail.
shows that there exist with lengths at most such that their th columns are orthonormal. However, we additionally need to impose that the th columns to also be orthogonal to the columns . Fortunately, the number of such orthogonality constraints is at most . Hence, we can pick the orthonormal th columns and their respective extensions , by taking linear combinations of . Since the linear combinations result again in unit vectors in the th column, the length of , as required. Hence, Step 2 does not FAIL as well.
We now show that since the process completes, then have the required ordered -orthogonal property for . We first check that belong to . This is true because in each stage, , and hence for . Further, since we run for stages, and each of the are bounded in length by , . Our final matrices will be scaled to . The columns that satisfy the ordered -orthogonality property are those of , in the order they were chosen (we set this order to be , and select an arbitrary order for the rest).
3 (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonality and ρ𝜌\rho-Perturbed Combinations (Proof of Lemma 3.15)
Suppose be a -orthogonal set of matrices (dimensions ). Without loss of generality, suppose that the permutation in the definition of orthogonality is the identity, and let be the first columns.
Now let us consider an -perturbed vector , and consider the matrix defined in the statement – it has dimensions , and the th entry is , which is distributed as a translated gaussian. Now for any column , the th column in has every entry having an ‘component’ independent of entries in the previous columns, and entries above. This implies that for a unit gaussian vector , we have (by anti-concentration and -orthogonality that
Furthermore, the above inequality holds, even conditioned on the first columns of .
Let be defined above, and fix some . Then for , we have
for any given .
Proof: Let . Then we have
Let us denote the latter vector by for now, so we are interested in . We show that has a non-negligible component orthogonal to the span of . Let be the matrix which projects orthogonal to the span of for all . Thus any vector is also orthogonal to the span of for .
Now by hypothesis, every vector has length . Thus the vector has length with probability (Lemma B.2).
Thus if we consider the distribution of , it is a one-dimensional gaussian with mean and variance . From basic anti-concentration properties of a gaussian (that the mass in any interval is at most ), the conclusion follows.
We can now do this for all , and conclude that the probability that Eq. (7) holds for all is at most .
Now what does this imply about the singular values of ? Suppose it has (which is ) non-negligible singular values, then a gaussian random vector , with probability at least , has a negligible component along all the corresponding singular vectors, and thus the length of is negligible with at least this probability!
Let be a matrix with spectral norm . Suppose has at most singular values of magnitude . Then for , we have
Proof: Let be the singular vectors corresponding to value . Consider the event that has a projection of length onto . This has probability , by anti-concentration properties of the Gaussian (and because is rotationally invariant). For any such , we have
This contradicts the earlier anti-concentration bound, and so we conclude that the matrix has at least non-negligible singular values, as required.
4 Higher Order Products
To handle this issue, we will first restrict our attention to a smaller block of co-ordinates of size (with ) , that has reasonable size in all the three modes (). Additionally, we want ’s projection onto this block spans a large subspace of (robust) dimension at least (using Lemma 3.18).
We now state the main inductive claim. The claim assumes a block of co-ordinates of reasonable size in each mode that span many directions in , and then establishes the anti-concentration bound inductively.
Before we give a proof of the main inductive claim, we first present a standard fact that relates the singular value of matrices and some anti-concentration properties of randomly perturbed vectors. This will also establish the base case of our main inductive claim.
Learning Multi-view Mixture Models
In this section, we will assume that each of the components in the mixture is a discrete distribution with support of size . We first introduce some notation, along the lines of .
The mixture component () is first picked with probability
However, in many practical settings like speech recognition and image classification, the dimension of the feature space is typically much smaller than the number of components or clusters i.e. . To the best of our knowledge, there was no efficient algorithm for learning multi-view mixture models in such over-complete settings. We now show how Theorem 1.5 gives a polynomial time algorithm to learn multi-view mixture models in a smoothed sense, even in the over-complete setting .
The conditional independence property is very useful in obtaining a higher order tensor, in terms of the hidden parameter vectors that we need to recover. This allows us to use our results on tensor decompositions from previous sections.
Our algorithm to learn multi-view models consists of three steps:
Apply Theorem 1.5 to and recover the parameters upto scaling.
Learning Mixtures of Axis-Aligned Gaussians
Let be a mixture of axis-aligned Gaussians in dimensions, and suppose further that the means of the components are perturbed by Gaussian noise of magnitude . We restrict to Gaussian noise not because our results change, but for notational convenience.
The mixture is described by a set of mixing weights , means and covariance matrices . Since the mixture is axis-aligned, each covariance is diagonal and we will denote the diagonal of as . Our main result in this section is the following:
Next we outline the main steps in our learning algorithm:
We then set up a system of linear equations and solve for .
1 Step 222: Recovering the Means and Mixing Weights
Now we show how to recover the means and weights .
Let be any dimensional vector. Then a coordinate-wise -perturbation of has length w.p. .
The proof is by a basic anti-concentration along with the observation that coordinates are independently perturbed and hence the failure probability multiplies.
Let us now define the partition . Suppose we divide and into two roughly equal parts each, and call the parts and (respectively). Now consider a partition with and , and for . Consider the solution we obtain using the decomposition algorithm, and look at the vectors . For the sake of exposition, suppose we did not have any error in computing the decomposition. We can scale such that the sub-vector corresponding to is precisely equal to that in . Now look at the remaining sub-vector of , and suppose it is times the “ portion” of . Then we must have .
To see this formally, let us fix some and write and to denote the sub-vectors of restricted to coordinates in and respectively. Write and to represent sub-vectors of restricted to and respectively. Then is (where denotes concatenation). So also is . Now we scaled such that the portion agrees with , thus we made equal to . Thus by the way is defined, we have , which is what we claimed.
We can now compute the entire vector up to scaling, since we know , , and so on. Thus it remains to find the mixture weights . Note that these are all non-negative. Now from the decomposition, note that for each , we can find the quantity
2 Step 333: Recovering the Variances
Now that we know the values of and all the means , we show how to recover the variances. This can be done in many ways, and we will outline one which ends up solving a linear system of equations. Recall that for each Gaussian, the covariance matrix is diagonal (denoted , with th entry equal to ).
This can be evaluated as before. Write to denote the portion of restricted to , and similarly to denote the portion of the noise vector . This gives
Now using this with being the flattened allows us to recover the values of for . From this, since we know the values of and for each , we can recover the values for all . As mentioned before, we can repeat this process for other dimensions and recover for all .
Acknowledgements
We thank Ryan O’Donnell for suggesting that we extend our techniques for learning mixtures of spherical Gaussians to the more general problem of learning axis-aligned Gaussians.
References
Appendix A Stability of the recovery algorithm
In this section we prove Theorem 2.3, which shows that the algorithm from section 2 is actually robust to errors, under Condition 2.2. This consists of two parts: first proving that the preprocessing step indeed allows us to recover (approximately), and second, that Decompose is robust to noise.
Stability of Decompose
Next, we establish that Decompose is stable (in what follows, we have ). Intuitively, Decompose is stable provided that the matrices and are well-conditioned and the eigenvalues of the matrices that we need to diagonalize are separated.
The main step in Decompose is an eigendecomposition, so first we will establish perturbation bounds. The standard perturbation bounds are known as sin theorems following Davis-Kahan and Wedin. However these bounds hold most generally for the singular value decomposition of an arbitrary (not necessarily symmetric) matrix. We require perturbation bounds for eigen-decompositions of general matrices. There are known bounds due to Eisenstat and Ipsen, however the notion of separation required there is difficult to work with and for our purposes it is easier to prove a direct bound in our setting.
Suppose and and and are matrices. In order to relate the eigendecompositions of and respectively, we will first need to establish that the eigenvalues of are all distinct. We thank Santosh Vempala for pointing out an error in an earlier version. We incorrectly used the Bauer-Fike Theorem to show that is diagonalizable, but this theorem only shows that each eigenvalue of is close to some eigenvalue of , but does not show that there is a one-to-one mapping. Fortunately there is a fix for this that works under the same conditions (but again see for an earlier, alternative proof that uses a “homotopy argument”).
Let .
Our first goal is to prove that is diagonalizable, and we will do this by establishing that its eigenvalues are distinct if the error matrices and are not too large. Consider
where . We can bound each entry in by . Hence if and are not too large, the eigenvalues of are close to the eigenvalues of using Gershgorin’s disk theorem, and the eigenvalues of are the same as the eigenvalues of since these matrices are similar. So we conclude:
If then the eigenvalues of are distinct and it is diagonalizable.
Next we prove that the eigenvectors of are also close to those of (this step will rely on being diagonalizable). This technique is standard in numerical analysis, but it will be more convenient for us to work with relative perturbations (i.e. ) so we include the proof of such a bound for completeness
Consider a right eigenvector of with eigenvalue . We will assume that the conditions of the above corollary are met, so that there is a unique eigenvector of with eigenvalue which it is paired with. Then since the eigenvectors of are full rank, we can write . Then
Now we can left multiply by the row of ; call this vector . Since , we have that . Hence
where we have used the condition that to lower bound the denominator. Furthermore: since is a unit vector.
If , then
Now we are ready to analyze the stability of Decompose: Let be an tensor that satisfies Condition 2.2. In our settings of interest we are not given exactly but rather a good approximation to it, and here let us model this noise as an additive error that is itself an tensor.
With high probability, .
Proof: Fix some . The th entry of is precisely . Note that is , thus the denominators are all at least in magnitude with probability .
Now given for which this happens, we have where have magnitude . Because has at least a component orthogonal to , anti-concentration of Gaussians implies that the difference above is at least with probability at least . Thus we can take a union bound over all pairs.
We will make crucial use of the following matrix identity:
Let and . Then using the above identity we have:
where and
and
as desired. The second bound is obvious.
We can now use Theorem A.3 to bound the error in recovering the factors and by setting e.g. . Additionally, the following claim establishes that the linear system used to solve for is well-conditioned and hence we can also bound the error in recovering .
These bounds establish what we qualitatively asserted: Decompose is stable provided that the matrices and are well-conditioned and the eigenvalues of the matrices that we need to diagonalize are separated.
Appendix B K-rank of the Khatri-Rao product
Recall: we defined the leave-one-out distance in Section 3. Here we establish that is indeed equivalent to the smallest singular value, up to polynomial factors. In our main proof, this quantity will be much easer to work with since it allows us to translate questions about a set of vectors being well-conditioned to reasoning about projection of each vector onto the orthogonal complement of the others.
Proof of Lemma 3.5: Using the variational characterization for singular values: . Then let . Clearly since . Then . Hence
Conversely, let . Then there are coefficients (with ) such that
Clearly since . And we conclude that
B.2 Proof of Proposition 3.12
We now give the complete details of the proof of Proposition 3.12, that shows how the Kruskal rank multiplies in the smoothed setting for two-wise products. The proof follows by just combining Lemma 3.16 and Lemma 3.15.
we obtain matrices having the -orthogonality property. Note that in this setting, .
Thus by applying Lemma 3.15, we have that the matrix , defined as before, satisfies
Since has many non-negligible singular values (Eq.(12)), we have (by Fact 3.26 for details) that an -perturbed vector has a non-negligible norm when multiplied by . More precisely, . Thus for one of the terms , we have with probability .
Now this almost completes the proof, but recall that our aim is to argue about , where is the given matrix. is a vector in the span of the top (right) singular vectors of , and , thus we can write as a combination of the rows of , with each weight in the combination being (Lemma B.1). This implies that for at least one row of the matrix , we must have
(Otherwise we have a contradiction). This completes the proof.
Before we give the complete proofs of the two main lemmas regarding ordered orthogonal systems (Lemma 3.16 and Lemma 3.15), we start with a simple lemma about top singular vectors of matrices, which is very useful to obtain linear combinations of small length.
B.3 Constructing the (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonal System (Proof of Lemma 3.16)
We now give the complete proof of lemma 3.18 that shows that the average robust dimension of column projections is large if the dimension of is large .
Proof of Lemma 3.18: Let . Let be a matrix composed of a orthonormal basis (of vectors) for i.e. the column of is the basis vector () of . Clearly . For , let be the matrix obtained by projecting the columns of on just the rows given by . Hence, is obtained by just concatenating the columns as . Finally, let such that .
B.4 Implications of Ordered (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonality: Details of Proof of Lemma 3.15
Here we show some auxiliary lemmas that are used in the Proof of Lemma B.4.
Suppose are a set of vectors in of length , having the -orthogonal property. Then we have
For , we have with probability ,
For , we have with probability .
Furthermore, part (a) holds even if is drawn from , for any fixed vector and .
Proof: First note that we must have , because otherwise cannot have the -orthogonal property for . For any , we claim that
To see this, write , where is orthogonal to the span of . Since , we have . Now given the vectors , the value is fixed, but is distributed as a Gaussian with variance (since is a Gaussian of unit variance in each direction).
Thus from a standard anti-concentration property for the one-dimensional Gaussian, cannot have a mass in any length interval, in particular, it cannot lie in with probability . This proves Eq. (13). Now since this is true for any conditioning and for all , it follows (see Lemma B.3 for a formal justification) that
This completes the proof of the claim, part (a). Note that even if we had replaced by throughout, the anti-concentration property still holds (we have a shifted one-dimensional Gaussian), thus the proof goes through verbatim.
Let us now prove part (b). First note that if we denote by the matrix whose columns are the , then part (a) deals with the distribution of , where . Part (b) deals with the distribution of , where . But since the eigenvalues of and are precisely the same, due to the rotational invariance of Gaussians, these two quantities are distributed exactly the same way. This completes the proof.
Suppose we have random variables and an event which is defined to occur if its argument lies in a certain interval (e.g. occurs iff ). Further, suppose we have , and for all . Then
Appendix C Applications to Mixture Models
Proof: We first bound the norm of the difference of tensors i.e. we show that
C.2 Error Analysis for Multi-view Models
This implies that as required.
Now, let us assume . This at once implies that . Also
Now, using (15), we see that .
C.3 Sampling Error Estimates for Gaussians
C.4 Recovering Weights in Gaussian Mixtures
Proof: From (C.4) and triangle inequality, we see that
Now, substituting the values for , we see that