Smoothed Analysis of Tensor Decompositions

Aditya Bhaskara, Moses Charikar, Ankur Moitra, Aravindan Vijayaraghavan

Introduction

Tensor decompositions play a central role in modern statistics (see e.g. ). To illustrate their usefulness, suppose we are given a matrix M=i=1RaibiM=\sum_{i=1}^{R}a_{i}\otimes b_{i} When can we uniquely recover the factors {ai}i\{a_{i}\}_{i} and {bi}i\{b_{i}\}_{i} of this decomposition given access to MM? In fact, this decomposition is almost never unique (unless we require that the factors {ai}i\{a_{i}\}_{i} and {bi}i\{b_{i}\}_{i} are orthonormal, or that MM has rank one). But given a tensor T=i=1RaibiciT=\sum_{i=1}^{R}a_{i}\otimes b_{i}\otimes c_{i} there are general conditions under which {ai}i\{a_{i}\}_{i}, {bi}i\{b_{i}\}_{i} and {ci}i\{c_{i}\}_{i} are uniquely determined (up to scaling) given TT; perhaps the most famous such condition is due to Kruskal , which we review in the next section.

Tensor methods are commonly used to establish that the parameters of a generative model can be identified given third (or higher) order moments. In contrast, given just second-order moments (e.g. MM) we can only hope to recover the factors up to a rotation. This is called the rotation problem and has been an important issue in statistics since the pioneering work of psychologist Charles Spearman (1904) . Tensors offer a path around this obstacle precisely because their decompositions are often unique, and consequently have found applications in phylogenetic reconstruction , , hidden markov models , mixture models , topic modeling , community detection , etc.

However most tensor problems are hard: computing the rank , the best rank one approximation and the spectral norm are all NPNP-hard. Also many of the familiar properties of matrices do not generalize to tensors. For example, subtracting the best rank one approximation to a tensor can actually increase its rank and there are rank three tensors that can be approximated arbitrarily well by a sequence of rank two tensors. One of the few algorithmic results for tensors is an algorithm for computing tensor decompositions in a restricted case. Let A,BA,B and CC be matrices whose columns are {ai}i\{a_{i}\}_{i}, {bi}i\{b_{i}\}_{i} and {ci}i\{c_{i}\}_{i} respectively.

, If \mboxrank(A)=\mboxrank(B)=R\mbox{rank}(A)=\mbox{rank}(B)=R and no pair of columns in CC are multiples of each other, then there is a polynomial time algorithm to compute the minimum rank tensor decomposition of TT. Moreover the rank one terms in this decomposition are unique (among all decompositions with the same rank).

If TT is an n×n×nn\times n\times n tensor, then RR can be at most nn in order for the conditions of the theorem to be met. This basic algorithm has been used to design efficient algorithms for phylogenetic reconstruction , , topic modeling , community detection and learning hidden markov models and mixtures of spherical Gaussians . However algorithms that make use of tensor decompositions have traditionally been limited to the full-rank case, and our goal is to develop stable algorithms that work for R=\mboxpoly(n)R=\mbox{poly}(n). Recently Goyal et al gave a robustness analysis for this decomposition, and we give an alternative proof in Appendix A.

In fact, this basic tensor decomposition can be bootstrapped to work even when RR is larger than nn (if we also increase the order of the tensor). The key parameter that dictates when one can efficiently find a tensor decomposition (or more generally, when it is unique) is the Kruskal rank:

The Kruskal rank (or Krank) of a matrix AA is the largest kk for which every set of kk columns are linearly independent. Also the τ\tau-robust k-rank is denoted by Krankτ(A)\text{Krank}_{\tau}(A), and is the largest kk for which every n×kn\times k sub-matrix ASA_{|S} of AA has σk(AS)1/τ\sigma_{k}(A_{|S})\geq 1/\tau.

Hence we get an order three tensor T^\widehat{T} of size n2×n2×nn^{2}\times n^{2}\times n. Alternatively we can define this “flattening” using the following operation:

The Khatri-Rao product of UU and VV which are size m×rm\times r and n×rn\times r respectively is an mn×rmn\times r matrix UVU\odot V whose ithi^{th} column is uiviu_{i}\otimes v_{i}.

Our new order three tensor T^\widehat{T} can be written as:

The factors are the columns of A(1)A(2)A^{(1)}\odot A^{(2)}, the columns of A(3)A(4)A^{(3)}\odot A^{(4)} and the columns of A(5)A^{(5)}. The crucial point is that the Kruskal rank of the columns of A(1)A(2)A^{(1)}\odot A^{(2)} is in fact at least the sum of the Kruskal rank of the columns of A(1)A^{(1)} and A(2)A^{(2)} (and similarly for A(3)A(4)A^{(3)}\odot A^{(4)}) , , but this is tight in the worst-case. Consequently this “flattening” operation allows us use the above algorithm unto R=2nR=2n; since the rank (RR) is larger than the largest dimension (nn), this is called the overcomplete case.

Our main technical result is that in a natural smoothed analysis model, the Kruskal rank robustly multiplies and this allows us to give algorithms for computing a tensor decomposition even in the highly overcomplete case, for any R=\mboxpoly(n)R=\mbox{poly}(n) (provided that the order of the tensor is large - but still a constant). Moreover our algorithms have immediate applications in learning mixtures of Gaussians and multiview mixture models.

2 Our Results

We introduce the following framework for studying tensor decomposition problems:

In applications in learning, tensors are used to encode low-order moments of the distribution. In particular, each factor in the decomposition represents a “component”. The intuition is that if these “components” are not chosen in a worst-case configuration, then we can obtain vastly improved learning algorithms in various settings. For example, as a direct consequence of our main result, we will give new algorithms for learning mixtures of spherical Gaussians again in the framework of smoothed analysis (without any additional separation conditions). There are no known polynomial time algorithms to learn such mixures if the number of components (kk) is larger than the dimension (nn). But if their means are perturbed, we give a polynomial time algorithm for any k=\mboxpoly(n)k=\mbox{poly}(n) by virtue of our tensor decomposition algorithm.

Our main technical result is the following:

We obtain the following main theorem from the above result and from analyzing the stability of the algorithm of Leurgans et al (see Theorem 2.3):

As we discussed, tensor methods have had numerous applications in learning. However algorithms that make use of tensor decompositions have traditionally been limited to the full-rank case, and hence can only handle cases when the number of “components” is at most the dimension. However by using our main theorem above, we can get new algorithms for some of these problems that work even if there are many more “components” than dimensions.

Mixtures of Axis-Aligned Gaussians (Section 5)

Here we are given samples from a distribution F=i=1kwiFi(μi,Σi)F=\sum_{i=1}^{k}w_{i}F_{i}(\mu_{i},\Sigma_{i}) where Fi(μi,Σi)F_{i}(\mu_{i},\Sigma_{i}) is a Gaussian with mean μi\mu_{i} and covariance Σi\Sigma_{i} and each Σi\Sigma_{i} is diagonal. These mixtures are ubiquitous throughout machine learning. Feldman et al gave an algorithm for PAC-learning mixtures of axis aligned Gaussians, however the running time is exponential in kk, the number of components. Hsu and Kakade gave a polynomial time algorithm for learning mixtures of spherical Gaussians provided that their means are full rank (hence knk\leq n). Again, we turn to the framework of smoothed analysis and suppose that the means are ρ\rho-perturbed. In this framework, we can give a polynomial time algorithm for learning mixtures of axis-aligned Gaussians for any k=\mboxpoly(n)k=\mbox{poly}(n). Suppose that the means of a mixture of axis-aligned Gaussians and suppose the means have been ρ\rho-perturbed to obtain μ~i\widetilde{\mu}_{i}. Then

We believe that our new algorithms for overcomplete tensor decomposition will have further applications in learning. Additionally this framework of studying distribution learning when the parameters of the distribution we would like to learn are not chosen adversarially, seems quite appealing.

Alternatively, algorithms for overcomplete tensor decomposition that assume we know TT exactly would not have any applications in learning because we would need to take too many samples to have a good enough estimate of TT (i.e. the low-order moments of the distribution).

3 Our Approach

then we would like to prove that with high probability xyx\otimes y has a non-negligible projection on the orthogonal complement of U\mathcal{U}. This is the core of our approach. Set V\mathcal{V} to be the orthogonal complement of U\mathcal{U}. In fact, we prove that for any dimension at least n22\frac{n^{2}}{2} subspace V\mathcal{V}, with high probability xyx\otimes y has a non-negligible projection onto V\mathcal{V}.

Prior Algorithms

Here we review the algorithm of Leurgans et al . It has been discovered many times in different settings. It is sometimes referred to as “simultaneous diagonalization” or as Chang’s lemma .

Suppose we are given a third-order tensor T=i=1RuiviwiT=\sum_{i=1}^{R}u_{i}\otimes v_{i}\otimes w_{i} which is n×m×pn\times m\times p. Let U,VU,V and WW be matrices whose columns are uiu_{i}, viv_{i} and wiw_{i} respectively. Suppose further that (1) rank(U)=rank(V)=Rrank(U)=rank(V)=R and (2) \mboxkrank(W)2\mbox{k-rank}(W)\geq 2. Then we can efficiently recover the factors of TT.

We present the algorithm Decompose and its analysis assuming n=m=Rn=m=R. Any instance with rank(U)=rank(V)=Rrank(U)=rank(V)=R can be reduced to this case as follows: find the span of the vectors {u~j,k}\{\widetilde{u}_{j,k}\}, where u~j,k\widetilde{u}_{j,k} is the nn dimensional vector whose iith entry is TijkT_{ijk}. This span must be precisely the span of the columns of UU.It is easy to see that the span is contained in the span of the columns of UU. To see equality, we observe that if the span is R1R-1 dimensional, then projecting each of the uiu_{i}s on to the span gives a different decomposition, and this contradicts Kruskal’s uniqueness theorem, which holds in this case. Thus we can pick some orthonormal basis for this span, and write TT as an R×m×pR\times m\times p tensor. We can perform this operation again (along the second mode) to move to an R×R×pR\times R\times p tensor.

, Given a tensor TT there exists an algorithm that runs in polynomial time and recovers the (unique) factors of TT provided that (1) rank(U)=rank(V)=Rrank(U)=rank(V)=R and (2) \mboxkrank(W)2\mbox{k-rank}(W)\geq 2.

Proof: The algorithm is to pre-process as above (i.e., obtain m=n=Rm=n=R), and then run Decompose stated below. Let us thus analyze Decompose with m,nm,n being RR.

We can write Ta=UDaVTT_{a}=UD_{a}V^{T} where Da=\mboxdiag(aTw1,aTw2,...,aTwn)D_{a}=\mbox{diag}(a^{T}w_{1},a^{T}w_{2},...,a^{T}w_{n}) and similarly Tb=UDbVTT_{b}=UD_{b}V^{T} where Db=\mboxdiag(bTw1,bTw2,...,bTwn).D_{b}=\mbox{diag}(b^{T}w_{1},b^{T}w_{2},...,b^{T}w_{n}). Moreover we can write Ta(Tb)1=UDaDb1U1T_{a}(T_{b})^{-1}=UD_{a}D_{b}^{-1}U^{-1} and TaT(TbT)1=VDaDb1V1T_{a}^{T}(T_{b}^{T})^{-1}=VD_{a}D_{b}^{-1}V^{-1}. So we conclude UU and VV diagonalize Ta(Tb)1T_{a}(T_{b})^{-1} and ((Tb)1Ta)T((T_{b})^{-1}T_{a})^{T} respectively. Note that almost surely the diagonals entries of DaDb1D_{a}D_{b}^{-1} are distinct (Claim A.4). Hence the eigendecompositions of Ta(Tb)1T_{a}(T_{b})^{-1} and (Tb)1(Ta)(T_{b})^{-1}(T_{a}) are unique, and we can pair up columns in UU and columns in VV based on their eigenvalues (we pair up uu and vv if their eigenvalues are equal). We can then solve a linear system to find the remaining factors (columns in WW) and since this is a valid decomposition, we can conclude that these are also the true factors of TT appealing to Kruskal’s uniqueness theorem . \blacksquare

In fact, this algorithm is also stable, as Goyal et al also recently showed. It is intuitive that if UU and VV are well-conditioned and each pair of columns in WW is well-conditioned then this algorithm can tolerate some inverse polynomial amount of noise. For completeness, we give a robustness analysis of Decompose in Appendix A.

The condition numbers κ(U),κ(V)κ\kappa(U),\kappa(V)\leq\kappa,

The column vectors of WW are not close to parallel: for all iji\neq j, wiwiwjwj2δ\|\frac{w_{i}}{\|w_{i}\|}-\frac{w_{j}}{\|w_{j}\|}\|_{2}\geq\delta ,

The decompositions are bounded : for all ii, ui2,vi2,wi2C\|u_{i}\|_{2},\|v_{i}\|_{2},\|w_{i}\|_{2}\leq C.

As before, the algorithm is to preprocess so as to obtain m=n=Rm=n=R, and then run Decompose. The preprocessing step is slightly different because of the presence of error – instead of considering the span of the {u~j,k}\{\widetilde{u}_{j,k}\} as above, we need to look at the span of the top RR singular vectors of the matrix whose columns are u~j,k\widetilde{u}_{j,k}. If EF\lVert E\rVert_{F} is small enough (in terms of κ,δ,n\kappa,\delta,n), the span of these top singular vectors suffices to obtain an approximation to the vectors uiu_{i} (see Appendix A).

Note that the algorithm is limited by the condition that rank(U)=rank(V)=Rrank(U)=rank(V)=R since this requires that Rmin(m,n)R\leq\min(m,n). But as we have seen before, by “flattening” a higher order tensor, we can handle overcomplete tensors. The following is an immediately corollary of Theorem 2.3:

then there exists an efficient algorithm that computes each rank one term in this decomposition up to an additive error of ϵ\epsilon.

\text{Krank}(U\odot V)\geq\min\big{(}\text{Krank}(U)+\text{Krank}(V)-1,R\big{)}

However in learning applications we are not given TT exactly but rather an approximation to it. Our goal is to show that the Kruskal rank robustly multiplies typically, so that these types of tensor algorithms will not only work in the exact case, but are also necessarily stable when we are given TT with some noise. In the next section, we show that in the smoothed analysis model, the robust Kruskal rank multiplies on taking Khatri-Rao products. This then establishes our main result Theorem 1.5, assuming Theorem 3.3 which we prove in the next section.

The Khatri-Rao Product Robustly Multiplies

Suppose U,VU,V are n×Rn\times R matrices and let U~,V~\widetilde{U},\widetilde{V} be ρ\rho-perturbations of U,VU,V respectively. Then for any constant δ(0,1)\delta\in(0,1), Rδn2R\leq\delta n^{2} and τ=nO(1)/ρ2\tau=n^{O(1)}/\rho^{2}, the Khatri-Rao product satisfies Krankτ(U~V~)=R\text{Krank}_{\tau}(\widetilde{U}\odot\widetilde{V})=R with probability at least 1exp(n)1-\exp(-\sqrt{n}).

The natural generalization where the vectors uiu_{i} and viv_{i} are in different dimensional spaces also holds. We omit the details here.

How can we lower bound the smallest singular value of AA? We define a quantity which is can be used as a proxy for the least singular value and is simpler to analyze.

For any matrix AA with columns A1,A2,ARA_{1},A_{2},\dots A_{R}, the leave-one-out distance is

The leave-one-out distance is a good proxy for the least singular value, if we are not particular about losing multiplicative factors that are polynomial in size of the matrix.

Apply a union bound over all the RR choices for ii.

We now state the main technical theorem about projections of perturbed product vectors onto arbitrary subspaces of large dimension.

Now consider some matrices MiM_{i}, and suppose we knew that Q(y~)Q(\widetilde{y}) has Ω(r)\Omega(r) singular values of magnitude 1/n2\geq 1/n^{2}. Then, an ρ\rho-perturbed vector x~\widetilde{x} has at least ρ/n\rho/n of its norm in the space spanned by the corresponding right singular vectors, with probability 1exp(r)\geq 1-\exp(-r) (Fact 3.26). Thus we get

So the key is to prove that the matrix Q(y~)Q(\widetilde{y}) has a large number of “non-negligible” singular values with high probability (over the perturbation in y~\widetilde{y}). For this, let us examine the entries of Q(y~)Q(\widetilde{y}). For a moment suppose that y~\widetilde{y} is a gaussian random vector N(0,ρ2I)\sim\mathcal{N}(0,\rho^{2}I) (instead of a perturbation). Then the (i,j)(i,j)th entry of Q(y~)Q(\widetilde{y}) is precisely y~,(Mi)j\langle\widetilde{y},(M_{i})_{j}\rangle, which is distributed like a one dimensional gaussian of variance ρ2(Mi)j2\rho^{2}\lVert(M_{i})_{j}\rVert^{2}. If the entries for different i,ji,j were independent, standard results from random matrix theory would imply that Q(y~)Q(\widetilde{y}) has many non-negligible singular values.

However, this could be far from the truth. Consider, for instance, two vectors (Mi)j(M_{i})_{j} and (Mi)j(M_{i^{\prime}})_{j^{\prime}} that are parallel. Then their dot products with y~\widetilde{y} are highly correlated. However we note, that as long as (Mi)j(M_{i^{\prime}})_{j^{\prime}} has a reasonable component orthogonal to (Mi)j(M_{i})_{j}, the distribution of the (i,j)(i,j) and (i,j)(i^{\prime},j^{\prime})th entries are “somewhat” independent. We will prove that we can roughly achieve such a situation. This motivates the following definition.

[Ordered θ\theta-orthogonality] A sequence of vectors v1,v2,,vnv_{1},v_{2},\dots,v_{n} has the ordered θ\theta-orthogonality property if for all 1in1\leq i\leq n, viv_{i} has a component of length θ\geq\theta orthogonal to span{v1,v2,,vi1}\operatorname{span}\{v_{1},v_{2},\dots,v_{i-1}\}.

Now we define a similar notion for a sequence of matrices M1,M2,,MrM_{1},M_{2},\dots,M_{r}, which says that a large enough subset of columns should have a certain θ\theta-orthogonality property. More formally,

A set of n×mn\times m matrices M1,M2,,MrM_{1},M_{2},\dots,M_{r} form an ordered (θ,δ)(\theta,\delta)-orthogonal system if there exists a permutation π\pi on [m][m] such that the first δm\delta m columns satisfy the followng property: for iδmi\leq\delta m and every j[R]j\in[R], the π(i)\pi(i)th column of MjM_{j} has a projection of length θ\geq\theta orthogonal to the span of all the vectors given by the columns π(1),π(2),,π(i1),π(i)\pi(1),\pi(2),\dots,\pi(i-1),\pi(i) of all the matrices M1,M2,MrM_{1},M_{2},\dots M_{r} other than itself (i.e. the π(i)\pi(i)th column of MjM_{j}).

The following lemma shows the use of an ordered (θ,δ)(\theta,\delta) orthogonal system: a matrix Q(y~)Q(\widetilde{y}) constructed as above starting with these MiM_{i} has many non-negligible singular values with high probability.

vec(Mi)V\text{vec}(M_{i})\in\mathcal{V} for all i[r]i\in[r].

M1,M2,,MrM_{1},M_{2},\dots,M_{r} form an ordered (θ,δ)(\theta,\delta^{\prime}) orthogonal system.

In particular, when mnm\leq\sqrt{n}, they form an ordered (θ,δ/2)(\theta,\delta/2) orthogonal system.

We remark that while δ\delta is often a constant in our applications, δ\delta^{\prime} does not have to be. We will use this in the proof that follows, in which we use these above two lemmas regarding construction and use of an ordered (θ,δ)(\theta,\delta)-orthogonal system to prove Proposition 3.12.

Now, applying Lemma 3.15, we have that the matrix Q(x~)Q(\widetilde{x}), defined as before, (given by linear combinations along x~\widetilde{x}) , has σr/2(Q(x~))ρθn4\sigma_{r/2}\left(Q(\widetilde{x})\right)\geq\frac{\rho\theta}{n^{4}} w.p 1exp(n)1-\exp(-\sqrt{n}).

Applying Fact 3.26 along with a simple averaging argument, we have that for one of the terms MiM_{i}, we have Mi(x~y~)ρθ/n6|M_{i}(\widetilde{x}\otimes\widetilde{y})|\geq\rho\theta/n^{6} with probability 1exp(r/2)\geq 1-\exp(-r/2) as required.

Please refer to Appendix B.2 for the complete details. ∎

2 Constructing the (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonal System (Proof of Lemma 3.16)

This lemma will also be used in the first step of the proof of Theorem 3.6 to identify a good block of co-ordinates which span a large projection of a given subspace V\mathcal{V}.

The above lemma is easy to prove if the dimension of the column projections used is the usual dimension of a vector space. However, with robust dimension, to carefully avoid spurious or insignificant directions, we identify the robust dimension with the number of large singular values of a certain matrix.

In each stage tt of this construction we will be looking at subspaces of V\mathcal{V} which are obtained by zero-ing out all the columns J[m]J\subseteq[m] (i.e. all the co-ordinates [n]×J[n]\times J), that we have fixed so far.

The extension ExtJ(M)\text{Ext}_{*J}\left(M^{\prime}\right) for MVJM^{\prime}\in\mathcal{V}^{*J} is the vector MVM\in\mathcal{V} obtained by padding MM^{\prime} with zeros in the coordinates [n]×J[n]\times J (columns given by JJ).

The following lemma shows that their dimension remains large as long as J|J| is not too large:

Proof of Lemma 3.21: Consider a constraint matrix CC of size (1δ)nm×nm(1-\delta)nm\times nm which describes V\mathcal{V}. VJ\mathcal{V}^{*J} is described by the constraint matrix of size (1δ)nm×n(mJ)(1-\delta)nm\times n(m-|J|) obtained by removing the columns of CC corresponding to [n]×J[n]\times J. Hence we get a subspace of dimension at least n(mJ)(1δ)nmn(m-|J|)-(1-\delta)nm. \blacksquare

We now describe the construction more formally.

The Iterative Construction of ordered θ\theta-orthogonal matrices.

Initially set J0=J_{0}=\emptyset and Mj=0M_{j}=0 for all j[r]j\in[r], τ=m\tau=\sqrt{m} and s=δm/2s=\delta m/2.

Pick i[m]Jt1i\in[m]-J_{t-1} such that dimiτ(VJt1)δn/2\dim_{i}^{\tau}\left(\mathcal{V}^{*J_{t-1}}\right)\geq\delta n/2. If no such ii exists, report FAIL.

Set for all j[r]j\in[r], the new MjMj+ExtJ(Zj)M_{j}\leftarrow M_{j}+\text{Ext}_{*J}\left(Z_{j}\right), where ExtJ(Zj)\text{Ext}_{*J}\left(Z_{j}\right) is the matrix padded with zeros in the columns corresponding to JJ. Set JtJt1{i}J_{t}\leftarrow J_{t-1}\cup\left\{i\right\}.

Let J=JsJ=J_{s} for convenience. We first show that the above process for constructing M1,M2,,MrM_{1},M_{2},\dots,M_{r} completes successfully without reporting FAIL.

For r,sr,s such that sδm/2s\leq\delta m/2 and rsδn/3r\cdot s\leq\delta n/3, the above process does not FAIL.

Proof: In each stage, we add one column index to JJ. Hence, Jts|J_{t}|\leq s at all times t[s]t\in[s].

We first show that Step 1 of each iteration does not FAIL. From Lemma 3.21, we have dim(VJt)δnm/2\dim\left(\mathcal{V}^{*J_{t}}\right)\geq\delta nm/2. Let W=VJt\mathcal{W}=\mathcal{V}^{*J_{t}}. Now, applying Lemma 3.18 to W\mathcal{W}, we see that there exists i[m]Jti\in[m]-J_{t} such that dimiτ(W)δn/2\dim_{i}^{\tau}\left(\mathcal{W}\right)\geq\delta n/2, as required. Hence, Step 1 does not fail.

dimiτ(W)δn/2\dim_{i}^{\tau}\left(\mathcal{W}\right)\geq\delta n/2 shows that there exist Z1,Z2,Zδn/2Z^{\prime}_{1},Z^{\prime}_{2},\dots Z^{\prime}_{\delta n/2} with lengths at most m\sqrt{m} such that their iith columns {Zt(i)}tδn/2\left\{Z^{\prime}_{t}(i)\right\}_{t\leq\delta n/2} are orthonormal. However, we additionally need to impose that the iith columns to also be orthogonal to the columns {Mj(i)}j[r],iJt1\left\{M_{j}(i^{\prime})\right\}_{j\in[r],i^{\prime}\in J_{t-1}}. Fortunately, the number of such orthogonality constraints is at most rJt1δn/3r|J_{t-1}|\leq\delta n/3. Hence, we can pick the r<δn/6r<\delta n/6 orthonormal iith columns {Zj(i)}j[r]\left\{Z_{j}(i)\right\}_{j\in[r]} and their respective extensions ZjZ_{j}, by taking linear combinations of ZtZ^{\prime}_{t}. Since the linear combinations result again in unit vectors in the iith column, the length of ZjmnZ_{j}\leq\sqrt{m}{n}, as required. Hence, Step 2 does not FAIL as well. \blacksquare

We now show that since the process completes, then M1,M2,,MrM_{1},M_{2},\dots,M_{r} have the required ordered (θ,δ)(\theta,\delta^{\prime})-orthogonal property for δ=s/m\delta^{\prime}=s/m. We first check that M1,M2,,MrM_{1},M_{2},\dots,M_{r} belong to V\mathcal{V}. This is true because in each stage, ExtJ(Zj)V\text{Ext}_{*J}\left(Z_{j}\right)\in\mathcal{V}, and hence MjVM_{j}\in\mathcal{V} for j[r]j\in[r]. Further, since we run for ss stages, and each of the ZjZ_{j} are bounded in length by mn\sqrt{mn}, MjFsmnnm3\lVert M_{j}\rVert_{F}\leq s\sqrt{mn}\leq\sqrt{nm^{3}}. Our final matrices MjM_{j} will be scaled to F=1\lVert\cdot\rVert_{F}=1. The ss columns that satisfy the ordered θ\theta-orthogonality property are those of JJ, in the order they were chosen (we set this order to be π\pi, and select an arbitrary order for the rest).

3 (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonality and ρ𝜌\rho-Perturbed Combinations (Proof of Lemma 3.15)

Suppose M1,M2,,MrM_{1},M_{2},\dots,M_{r} be a (θ,δ)(\theta,\delta)-orthogonal set of matrices (dimensions n×mn\times m). Without loss of generality, suppose that the permutation π\pi in the definition of orthogonality is the identity, and let II be the first δm\delta m columns.

Now let us consider an ρ\rho-perturbed vector x~\widetilde{x}, and consider the matrix Q(x~)Q(\widetilde{x}) defined in the statement – it has dimensions r×mr\times m, and the (i,j)(i,j)th entry is x~,(Mi)j\langle\widetilde{x},(M_{i})_{j}\rangle, which is distributed as a translated gaussian. Now for any column iIi\in I, the iith column in Q(x~)Q(\widetilde{x}) has every entry having an (ρθ)(\rho\cdot\theta) ‘component’ independent of entries in the previous columns, and entries above. This implies that for a unit gaussian vector gg, we have (by anti-concentration and θ\theta-orthogonality that

Furthermore, the above inequality holds, even conditioned on the first (i1)(i-1) columns of Q(x~)Q(\widetilde{x}).

Let Q(x~)Q(\widetilde{x}) be defined above, and fix some iIi\in I. Then for gN(0,1)ng\sim\mathcal{N}(0,1)^{n}, we have

for any given Q(x~)1,Q(x~)2,,Q(x~)(i1)Q(\widetilde{x})_{1},Q(\widetilde{x})_{2},\dots,Q(\widetilde{x})_{(i-1)}.

Proof: Let g=(g1,g2,,gr)g=(g_{1},g_{2},\dots,g_{r}). Then we have

Let us denote the latter vector by viv_{i} for now, so we are interested in x~,vi\langle\widetilde{x},v_{i}\rangle. We show that viv_{i} has a non-negligible component orthogonal to the span of v1,v2,v(i1)v_{1},v_{2},\dots v_{(i-1)}. Let Π\Pi be the matrix which projects orthogonal to the span of (Ms)i(M_{s})_{i^{\prime}} for all i<ii^{\prime}<i. Thus any vector Πu\Pi u is also orthogonal to the span of viv_{i^{\prime}} for i<ii^{\prime}<i.

Now by hypothesis, every vector Π(Ms)i\Pi(M_{s})_{i} has length θ\geq\theta. Thus the vector Π(sgs(Ms)i)=Πvi\Pi\left(\sum_{s}g_{s}(M_{s})_{i}\right)=\Pi v_{i} has length θ/2\geq\theta/2 with probability 1exp(r)\geq 1-\exp(-r) (Lemma B.2).

Thus if we consider the distribution of x~,vi=x,vi+e,vi\langle\widetilde{x},v_{i}\rangle=\langle x,v_{i}\rangle+\langle e,v_{i}\rangle, it is a one-dimensional gaussian with mean x,vi\langle x,v_{i}\rangle and variance ρ2\rho^{2}. From basic anti-concentration properties of a gaussian (that the mass in any ρ(variance)1/2\rho\cdot(\text{variance})^{1/2} interval is at most ρ\rho), the conclusion follows. \blacksquare

We can now do this for all iIi\in I, and conclude that the probability that Eq. (7) holds for all iIi\in I is at most 1/(2n)I1/(2n)^{|I|}.

Now what does this imply about the singular values of Q(x~)Q(\widetilde{x})? Suppose it has <r/2<r/2 (which is <I<|I|) non-negligible singular values, then a gaussian random vector gg, with probability at least nrn^{-r}, has a negligible component along all the corresponding singular vectors, and thus the length of gTQ(x~)g^{T}Q(\widetilde{x}) is negligible with at least this probability!

Let MM be a t×tt\times t matrix with spectral norm 1\leq 1. Suppose MM has at most rr singular values of magnitude >τ>\tau. Then for gN(0,1)tg\sim\mathcal{N}(0,1)^{t}, we have

Proof: Let u1,u2,,uru_{1},u_{2},\dots,u_{r} be the singular vectors corresponding to value >τ>\tau. Consider the event that gg has a projection of length <1/nc<1/n^{c} onto u1,u2,,uru_{1},u_{2},\dots,u_{r}. This has probability 1ncr\geq\frac{1}{n^{cr}}, by anti-concentration properties of the Gaussian (and because N(0,1)t\mathcal{N}(0,1)^{t} is rotationally invariant). For any such gg, we have

This contradicts the earlier anti-concentration bound, and so we conclude that the matrix has at least r/2r/2 non-negligible singular values, as required.

4 Higher Order Products

To handle this issue, we will first restrict our attention to a smaller block of co-ordinates of size n1×n2×n3n_{1}\times n_{2}\times n_{3} (with n1n2n3nn_{1}n_{2}n_{3}\ll n) , that has reasonable size in all the three modes (n1,n2,n3=nΩ(1)n_{1},n_{2},n_{3}=n^{\Omega(1)}). Additionally, we want V\mathcal{V}’s projection onto this n1×n2×n3n_{1}\times n_{2}\times n_{3} block spans a large subspace of (robust) dimension at least δn1n2n3\delta n_{1}n_{2}n_{3} (using Lemma 3.18).

We now state the main inductive claim. The claim assumes a block of co-ordinates of reasonable size in each mode that span many directions in V\mathcal{V}, and then establishes the anti-concentration bound inductively.

Before we give a proof of the main inductive claim, we first present a standard fact that relates the singular value of matrices and some anti-concentration properties of randomly perturbed vectors. This will also establish the base case of our main inductive claim.

Learning Multi-view Mixture Models

In this section, we will assume that each of the components in the mixture is a discrete distribution with support of size nn. We first introduce some notation, along the lines of .

The mixture component ii (i[R]i\in[R]) is first picked with probability wiw_{i}

However, in many practical settings like speech recognition and image classification, the dimension of the feature space is typically much smaller than the number of components or clusters i.e. nRn\ll R. To the best of our knowledge, there was no efficient algorithm for learning multi-view mixture models in such over-complete settings. We now show how Theorem 1.5 gives a polynomial time algorithm to learn multi-view mixture models in a smoothed sense, even in the over-complete setting RnR\gg n.

The conditional independence property is very useful in obtaining a higher order tensor, in terms of the hidden parameter vectors that we need to recover. This allows us to use our results on tensor decompositions from previous sections.

Our algorithm to learn multi-view models consists of three steps:

Apply Theorem 1.5 to T^\widehat{T} and recover the parameters μ^i(j){\widehat{\mu}_{i}}^{(j)} upto scaling.

Learning Mixtures of Axis-Aligned Gaussians

Let FF be a mixture of k=\mboxpoly(n)k=\mbox{poly}(n) axis-aligned Gaussians in nn dimensions, and suppose further that the means of the components are perturbed by Gaussian noise of magnitude ρ\rho. We restrict to Gaussian noise not because our results change, but for notational convenience.

The mixture is described by a set of kk mixing weights wiw_{i}, means μi\mu_{i} and covariance matrices Σi\Sigma_{i}. Since the mixture is axis-aligned, each covariance Σi\Sigma_{i} is diagonal and we will denote the jthj^{th} diagonal of Σi\Sigma_{i} as σij2\sigma_{ij}^{2}. Our main result in this section is the following:

Next we outline the main steps in our learning algorithm:

We then set up a system of linear equations and solve for σij2\sigma_{ij}^{2}.

1 Step 222: Recovering the Means and Mixing Weights

Now we show how to recover the means μ~i\widetilde{\mu}_{i} and weights wiw_{i}.

Let μ\mu be any dd dimensional vector. Then a coordinate-wise σ\sigma-perturbation of μ\mu has length dσ2/10\geq d\sigma^{2}/10 w.p. 1exp(d)\geq 1-\exp(-d).

The proof is by a basic anti-concentration along with the observation that coordinates are independently perturbed and hence the failure probability multiplies.

Let us now define the partition StS_{t}^{\prime}. Suppose we divide S1S_{1} and S2S_{2} into two roughly equal parts each, and call the parts A1,B1A_{1},B_{1} and A2,B2A_{2},B_{2} (respectively). Now consider a partition with S1=A1A2S_{1}^{\prime}=A_{1}\cup A_{2} and S2=B1B2S_{2}^{\prime}=B_{1}\cup B_{2}, and St=StS_{t}^{\prime}=S_{t} for t>2t>2. Consider the solution νi\nu_{i}^{\prime} we obtain using the decomposition algorithm, and look at the vectors ν1,ν2,ν1,ν2\nu_{1},\nu_{2},\nu_{1}^{\prime},\nu_{2}^{\prime}. For the sake of exposition, suppose we did not have any error in computing the decomposition. We can scale ν1\nu_{1}^{\prime} such that the sub-vector corresponding to A1A_{1} is precisely equal to that in ν1\nu_{1}. Now look at the remaining sub-vector of ν1\nu_{1}, and suppose it is γ\gamma times the “A2A_{2} portion” of ν2\nu_{2}. Then we must have γ=c2/c1\gamma=c_{2}/c_{1}.

To see this formally, let us fix some ii and write v11v_{11} and v12v_{12} to denote the sub-vectors of μ~i(1)\widetilde{\mu}_{i}^{(1)} restricted to coordinates in A1A_{1} and B1B_{1} respectively. Write v21v_{21} and v22v_{22} to represent sub-vectors of μ~i(2)\widetilde{\mu}_{i}^{(2)} restricted to A2A_{2} and B2B_{2} respectively. Then ν1\nu_{1} is c1v11c1v12c_{1}v_{11}\oplus c_{1}v_{12} (where \oplus denotes concatenation). So also ν2\nu_{2} is c2v21c2v22c_{2}v_{21}\oplus c_{2}v_{22}. Now we scaled ν1\nu_{1}^{\prime} such that the A1A_{1} portion agrees with ν1\nu_{1}, thus we made ν1\nu_{1}^{\prime} equal to c1v11c1v21c_{1}v_{11}\oplus c_{1}v_{21}. Thus by the way γ\gamma is defined, we have c1γ=c2c_{1}\gamma=c_{2}, which is what we claimed.

We can now compute the entire vector μ~i\widetilde{\mu}_{i} up to scaling, since we know c1/c2c_{1}/c_{2}, c1/c3c_{1}/c_{3}, and so on. Thus it remains to find the mixture weights wiw_{i}. Note that these are all non-negative. Now from the decomposition, note that for each ii, we can find the quantity

2 Step 333: Recovering the Variances

Now that we know the values of wiw_{i} and all the means μ~i\widetilde{\mu}_{i}, we show how to recover the variances. This can be done in many ways, and we will outline one which ends up solving a linear system of equations. Recall that for each Gaussian, the covariance matrix is diagonal (denoted Σi\Sigma_{i}, with jjth entry equal to σij2\sigma_{ij}^{2}).

This can be evaluated as before. Write μ~i(t)\widetilde{\mu}_{i}^{(t)} to denote the portion of μ~i\widetilde{\mu}_{i} restricted to StS_{t}, and similarly ηi(t)\eta_{i}^{(t)} to denote the portion of the noise vector ηi\eta_{i}. This gives

Now using this with zz^{\prime} being the flattened N1\mathcal{N}_{1} allows us to recover the values of wi(μ~i(1)+σi12)w_{i}(\widetilde{\mu}_{i}(1)+\sigma_{i1}^{2}) for 1iR1\leq i\leq R. From this, since we know the values of wiw_{i} and μ~i(1)\widetilde{\mu}_{i}(1) for each ii, we can recover the values σi12\sigma_{i1}^{2} for all ii. As mentioned before, we can repeat this process for other dimensions and recover σij2\sigma_{ij}^{2} for all i,ji,j.

Acknowledgements

We thank Ryan O’Donnell for suggesting that we extend our techniques for learning mixtures of spherical Gaussians to the more general problem of learning axis-aligned Gaussians.

References

Appendix A Stability of the recovery algorithm

In this section we prove Theorem 2.3, which shows that the algorithm from section 2 is actually robust to errors, under Condition 2.2. This consists of two parts: first proving that the preprocessing step indeed allows us to recover uiu_{i} (approximately), and second, that Decompose is robust to noise.

Stability of Decompose

Next, we establish that Decompose is stable (in what follows, we have m=n=Rm=n=R). Intuitively, Decompose is stable provided that the matrices UU and VV are well-conditioned and the eigenvalues of the matrices that we need to diagonalize are separated.

The main step in Decompose is an eigendecomposition, so first we will establish perturbation bounds. The standard perturbation bounds are known as sin θ\theta theorems following Davis-Kahan and Wedin. However these bounds hold most generally for the singular value decomposition of an arbitrary (not necessarily symmetric) matrix. We require perturbation bounds for eigen-decompositions of general matrices. There are known bounds due to Eisenstat and Ipsen, however the notion of separation required there is difficult to work with and for our purposes it is easier to prove a direct bound in our setting.

Suppose M=UDU1M=UDU^{-1} and M^=M(I+E)+F\widehat{M}=M(I+E)+F and MM and M^\widehat{M} are n×nn\times n matrices. In order to relate the eigendecompositions of MM and M^\widehat{M} respectively, we will first need to establish that the eigenvalues of MM are all distinct. We thank Santosh Vempala for pointing out an error in an earlier version. We incorrectly used the Bauer-Fike Theorem to show that M^\widehat{M} is diagonalizable, but this theorem only shows that each eigenvalue of M^\widehat{M} is close to some eigenvalue of MM, but does not show that there is a one-to-one mapping. Fortunately there is a fix for this that works under the same conditions (but again see for an earlier, alternative proof that uses a “homotopy argument”).

Let \mboxsep(D)=minijDi,iDj,j\mbox{sep}(D)=\min_{i\neq j}|D_{i,i}-D_{j,j}|.

Our first goal is to prove that M^\widehat{M} is diagonalizable, and we will do this by establishing that its eigenvalues are distinct if the error matrices EE and FF are not too large. Consider

where R=U1(ME+F)UR=U^{-1}(ME+F)U. We can bound each entry in RR by κ(U)(ME2+F2)\kappa(U)(\|ME\|_{2}+\|F\|_{2}). Hence if EE and FF are not too large, the eigenvalues of D+RD+R are close to the eigenvalues of DD using Gershgorin’s disk theorem, and the eigenvalues of D+RD+R are the same as the eigenvalues of M^\widehat{M} since these matrices are similar. So we conclude:

If κ(U)(ME2+F2)<\mboxsep(D)/(2n)\kappa(U)(\|ME\|_{2}+\|F\|_{2})<\mbox{sep}(D)/(2n) then the eigenvalues of M^\widehat{M} are distinct and it is diagonalizable.

Next we prove that the eigenvectors of M^\widehat{M} are also close to those of MM (this step will rely on M^\widehat{M} being diagonalizable). This technique is standard in numerical analysis, but it will be more convenient for us to work with relative perturbations (i.e. M^=M(I+E)+F\widehat{M}=M(I+E)+F) so we include the proof of such a bound for completeness

Consider a right eigenvector u^i\widehat{u}_{i} of M^\widehat{M} with eigenvalue λ^i\widehat{\lambda}_{i}. We will assume that the conditions of the above corollary are met, so that there is a unique eigenvector uiu_{i} of MM with eigenvalue λi\lambda_{i} which it is paired with. Then since the eigenvectors {ui}i\{u_{i}\}_{i} of MM are full rank, we can write u^i=jcjuj\widehat{u}_{i}=\sum_{j}c_{j}u_{j}. Then

Now we can left multiply by the jthj^{th} row of U1U^{-1}; call this vector wjTw_{j}^{T}. Since U1U=IU^{-1}U=I, we have that wjTui=1i=jw_{j}^{T}u_{i}=\mathbf{1}_{i=j}. Hence

where we have used the condition that κ(U)(ME2+F2)<sep(D)/2\kappa(U)(\|ME\|_{2}+\|F\|_{2})<sep(D)/2 to lower bound the denominator. Furthermore: U1MEu^i2=DU1Eu^i2σmax(E)λmax(D)σmin(U)\|U^{-1}ME\widehat{u}_{i}\|_{2}=\|DU^{-1}E\widehat{u}_{i}\|_{2}\leq\frac{\sigma_{max}(E)\lambda_{max}(D)}{\sigma_{min}(U)} since u^i\widehat{u}_{i} is a unit vector.

If κ(U)(ME2+F2)<sep(D)/2\kappa(U)(\|ME\|_{2}+\|F\|_{2})<sep(D)/2, then

Now we are ready to analyze the stability of Decompose: Let T=i=1nuiviwiT=\sum_{i=1}^{n}u_{i}\otimes v_{i}\otimes w_{i} be an n×n×pn\times n\times p tensor that satisfies Condition 2.2. In our settings of interest we are not given TT exactly but rather a good approximation to it, and here let us model this noise as an additive error EE that is itself an n×n×pn\times n\times p tensor.

With high probability, \mboxsep(DaDb1)δp\mbox{sep}(D_{a}D_{b}^{-1})\geq\frac{\delta}{\sqrt{p}}.

Proof: Fix some i,ji,j. The (i,i)(i,i)th entry of DaDb1D_{a}D_{b}^{-1} is precisely wi,awi,b\frac{\langle w_{i},a\rangle}{\langle w_{i},b\rangle}. Note that Pr\/[wi,b>nwi]\mathop{\bf Pr\/}[|\langle w_{i},b\rangle|>n\lVert w_{i}\rVert] is exp(n)\exp(-n), thus the denominators are all at least 1/(Cn)1/(Cn) in magnitude with probability 1exp(n)1-\exp(-n).

Now given bb for which this happens, we have wi,awi,bwj,awj,b=ciwi,acjwj,a\frac{\langle w_{i},a\rangle}{\langle w_{i},b\rangle}-\frac{\langle w_{j},a\rangle}{\langle w_{j},b\rangle}=c_{i}\langle w_{i},a\rangle-c_{j}\langle w_{j},a\rangle where ci,cjc_{i},c_{j} have magnitude >1/(Cn)>1/(Cn). Because wiw_{i} has at least a δ\delta component orthogonal to wjw_{j}, anti-concentration of Gaussians implies that the difference above is at least δ/C2n6\delta/C^{2}n^{6} with probability at least 11/n41-1/n^{4}. Thus we can take a union bound over all pairs. \blacksquare

We will make crucial use of the following matrix identity:

Let Na=Ta+EaN_{a}=T_{a}+E_{a} and Nb=Tb+EbN_{b}=T_{b}+E_{b}. Then using the above identity we have:

where F=Eb(I+(Tb)1Eb)1(Tb)1F=-E_{b}(I+(T_{b})^{-1}E_{b})^{-1}(T_{b})^{-1} and G=Ea(Tb)1G=E_{a}(T_{b})^{-1}

σmax(F)σmax(Eb)σmin(Tb)σmax(Eb)\sigma_{max}(F)\leq\frac{\sigma_{max}(E_{b})}{\sigma_{min}(T_{b})-\sigma_{max}(E_{b})} and σmax(G)σmax(Ea)σmin(Tb)\sigma_{max}(G)\leq\frac{\sigma_{max}(E_{a})}{\sigma_{min}(T_{b})}

as desired. The second bound is obvious. \blacksquare

We can now use Theorem A.3 to bound the error in recovering the factors UU and VV by setting e.g. M=Ta(Tb)1M=T_{a}(T_{b})^{-1}. Additionally, the following claim establishes that the linear system used to solve for WW is well-conditioned and hence we can also bound the error in recovering WW.

κ(UV)min(σmax(U),σmax(V))max(σmin(U),σmin(V))min(κ(U),κ(V))\kappa(U\odot V)\leq\frac{\min(\sigma_{max}(U),\sigma_{max}(V))}{\max(\sigma_{min}(U),\sigma_{min}(V))}\leq\min(\kappa(U),\kappa(V))

These bounds establish what we qualitatively asserted: Decompose is stable provided that the matrices UU and VV are well-conditioned and the eigenvalues of the matrices that we need to diagonalize are separated.

Appendix B K-rank of the Khatri-Rao product

Recall: we defined the leave-one-out distance in Section 3. Here we establish that is indeed equivalent to the smallest singular value, up to polynomial factors. In our main proof, this quantity will be much easer to work with since it allows us to translate questions about a set of vectors being well-conditioned to reasoning about projection of each vector onto the orthogonal complement of the others.

Proof of Lemma 3.5: Using the variational characterization for singular values: σmin(A)=minu,u2=1Au2\sigma_{min}(A)=\min_{u,\|u\|_{2}=1}\|Au\|_{2}. Then let i=\mboxargmaxuii=\mbox{argmax}|u_{i}|. Clearly ui1/m|u_{i}|\geq 1/\sqrt{m} since u2=1\|u\|_{2}=1. Then Ai+jiAjujui2=σmin(A)ui\|A_{i}+\sum_{j\neq i}A_{j}\frac{u_{j}}{u_{i}}\|_{2}=\frac{\sigma_{min}(A)}{u_{i}}. Hence

Conversely, let i=\mboxargminidist(Ai,span{Aj}ji)i=\mbox{argmin}_{i}dist(A_{i},span\{A_{j}\}_{j\neq i}). Then there are coefficients (with ui=1u_{i}=1) such that

Clearly u21\|u\|_{2}\geq 1 since ui=1u_{i}=1. And we conclude that

B.2 Proof of Proposition 3.12

We now give the complete details of the proof of Proposition 3.12, that shows how the Kruskal rank multiplies in the smoothed setting for two-wise products. The proof follows by just combining Lemma 3.16 and Lemma 3.15.

we obtain n×nn\times n matrices M1,M2,,MrM_{1},M_{2},\dots,M_{r} having the (θ,δ)(\theta,\delta^{\prime})-orthogonality property. Note that in this setting, δm=n1/22\delta^{\prime}m=\frac{n^{1/2}}{2}.

Thus by applying Lemma 3.15, we have that the matrix Q(x~)Q(\widetilde{x}), defined as before, satisfies

Since Q(x~)Q(\widetilde{x}) has many non-negligible singular values (Eq.(12)), we have (by Fact 3.26 for details) that an ρ\rho-perturbed vector has a non-negligible norm when multiplied by QQ. More precisely, Pr\/[y~TQ(x~)ρθ/n4]1exp(r/2)\mathop{\bf Pr\/}[\lVert\widetilde{y}^{T}Q(\widetilde{x})\rVert\geq\rho\theta/n^{4}]\geq 1-\exp(-r/2). Thus for one of the terms MsM_{s}, we have Ms(x~y~)ρθ/n5|M_{s}(\widetilde{x}\otimes\widetilde{y})|\geq\rho\theta/n^{5} with probability 1exp(r/2)\geq 1-\exp(-r/2).

Now this almost completes the proof, but recall that our aim is to argue about M(x~y~)M(\widetilde{x}\otimes\widetilde{y}), where MM is the given matrix. vec(Ms)\text{vec}(M_{s}) is a vector in the span of the top δn2\delta n^{2} (right) singular vectors of MM, and σδn2τ\sigma_{\delta n^{2}}\geq\tau, thus we can write MsM_{s} as a combination of the rows of MM, with each weight in the combination being n/τ\leq n/\tau (Lemma B.1). This implies that for at least one row M(j)M^{(j)} of the matrix MM, we must have

(Otherwise we have a contradiction). This completes the proof.

Before we give the complete proofs of the two main lemmas regarding ordered (θ,δ)(\theta,\delta) orthogonal systems (Lemma 3.16 and Lemma 3.15), we start with a simple lemma about top singular vectors of matrices, which is very useful to obtain linear combinations of small length.

B.3 Constructing the (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonal System (Proof of Lemma 3.16)

We now give the complete proof of lemma 3.18 that shows that the average robust dimension of column projections is large if the dimension of V\mathcal{V} is large .

Proof of Lemma 3.18: Let d=dim(V)d=\dim(\mathcal{V}). Let BB be a p1p2×dp_{1}p_{2}\times d matrix composed of a orthonormal basis (of dd vectors) for V\mathcal{V} i.e. the jthj^{th} column of BB is the jthj^{th} basis vector (j[d]j\in[d]) of V\mathcal{V}. Clearly σd(B)=1\sigma_{d}(B)=1. For i[p2]i\in[p_{2}], let BiB_{i} be the p1×dp_{1}\times d matrix obtained by projecting the columns of BB on just the rows given by [p1]×i[p_{1}]\times i. Hence, BB is obtained by just concatenating the columns as B\scT=[B1\scTB2\scTBp\scT]{B}^{\sc T}=\left[{B_{1}}^{\sc T}\|{B_{2}}^{\sc T}\|\dots\|{B_{p}}^{\sc T}\right]. Finally, let di=maxtd_{i}=\max t such that σt(Bi)1p2\sigma_{t}(B_{i})\geq\frac{1}{\sqrt{p_{2}}}.

B.4 Implications of Ordered (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonality: Details of Proof of Lemma 3.15

Here we show some auxiliary lemmas that are used in the Proof of Lemma B.4.

Suppose v1,v2,,vmv_{1},v_{2},\dots,v_{m} are a set of vectors in n\Re^{n} of length 1\leq 1, having the θ\theta-orthogonal property. Then we have

For gN(0,1)ng\sim\mathcal{N}(0,1)^{n}, we have ivi,g2θ2/2\sum_{i}\langle v_{i},g\rangle^{2}\geq\theta^{2}/2 with probability 1exp(Ω(m))\geq 1-\exp(-\Omega(m)),

For gN(0,1)mg\sim\mathcal{N}(0,1)^{m}, we have igivi2θ2/2\lVert\sum_{i}g_{i}v_{i}\rVert^{2}\geq\theta^{2}/2 with probability 1exp(Ω(m))\geq 1-\exp(-\Omega(m)).

Furthermore, part (a) holds even if gg is drawn from u+gu+g^{\prime}, for any fixed vector uu and gN(0,1)ng^{\prime}\sim\mathcal{N}(0,1)^{n}.

Proof: First note that we must have mnm\leq n, because otherwise {v1,v2,,vm}\{v_{1},v_{2},\dots,v_{m}\} cannot have the θ\theta-orthogonal property for θ>0\theta>0. For any j[m]j\in[m], we claim that

To see this, write vj=vj+vjv_{j}=v_{j}^{\prime}+v_{j}^{\perp}, where vjv_{j}^{\perp} is orthogonal to the span of {v1,v2,,vj1}\{v_{1},v_{2},\dots,v_{j-1}\}. Since jIj\in I, we have vjθ\lVert v_{j}^{\perp}\rVert\geq\theta. Now given the vectors v1,v2,,vj1v_{1},v_{2},\dots,v_{j-1}, the value vj,g\langle v_{j}^{\prime},g\rangle is fixed, but vj,g\langle v_{j}^{\perp},g\rangle is distributed as a Gaussian with variance θ2\theta^{2} (since gg is a Gaussian of unit variance in each direction).

Thus from a standard anti-concentration property for the one-dimensional Gaussian, vj,g\langle v_{j},g\rangle cannot have a mass >1/2>1/2 in any θ2\theta^{2} length interval, in particular, it cannot lie in [θ2/2,θ2/2][-\theta^{2}/2,\theta^{2}/2] with probability >1/2>1/2. This proves Eq. (13). Now since this is true for any conditioning v1,v2,,vj1v_{1},v_{2},\dots,v_{j-1} and for all jj, it follows (see Lemma B.3 for a formal justification) that

This completes the proof of the claim, part (a). Note that even if we had gg replaced by u+gu+g throughout, the anti-concentration property still holds (we have a shifted one-dimensional Gaussian), thus the proof goes through verbatim.

Let us now prove part (b). First note that if we denote by MM the n×mn\times m matrix whose columns are the viv_{i}, then part (a) deals with the distribution of gTMMTgg^{T}MM^{T}g, where gN(0,1)ng\sim\mathcal{N}(0,1)^{n}. Part (b) deals with the distribution of gTMTMgg^{T}M^{T}Mg, where gN(0,1)mg\sim\mathcal{N}(0,1)^{m}. But since the eigenvalues of MMTMM^{T} and MTMM^{T}M are precisely the same, due to the rotational invariance of Gaussians, these two quantities are distributed exactly the same way. This completes the proof. \blacksquare

Suppose we have random variables X1,X2,,XrX_{1},X_{2},\dots,X_{r} and an event f()f(\cdot) which is defined to occur if its argument lies in a certain interval (e.g. f(X)f(X) occurs iff 0<X<10<X<1). Further, suppose we have Pr\/[f(X1)]p\mathop{\bf Pr\/}[f(X_{1})]\leq p, and Pr\/[f(Xi)X1,X2,,Xi1]p\mathop{\bf Pr\/}[f(X_{i})|X_{1},X_{2},\dots,X_{i-1}]\leq p for all X1,X2,,Xi1X_{1},X_{2},\dots,X_{i-1}. Then

Appendix C Applications to Mixture Models

Proof: We first bound the \|\cdot\|_{\infty} norm of the difference of tensors i.e. we show that

C.2 Error Analysis for Multi-view Models

This implies that 1α1α2<δ/Lmin2|1-\alpha_{1}\alpha_{2}|<\delta/L_{\min}^{2} as required.

Now, let us assume β1>δ\beta_{1}>\sqrt{\delta}. This at once implies that β2<δ\beta_{2}<\sqrt{\delta}. Also

Now, using (15), we see that β1<δ\beta_{1}<\sqrt{\delta}. \blacksquare

C.3 Sampling Error Estimates for Gaussians

C.4 Recovering Weights in Gaussian Mixtures

Proof: From (C.4) and triangle inequality, we see that

Now, substituting the values for α1,α2\alpha_{1},\alpha_{2}, we see that