Smoothed Analysis of Tensor Decompositions

Aditya Bhaskara, Moses Charikar, Ankur Moitra, Aravindan Vijayaraghavan

Introduction

Tensor decompositions play a central role in modern statistics (see e.g. ). To illustrate their usefulness, suppose we are given a matrix $M=\sum_{i=1}^{R}a_{i}\otimes b_{i}$ When can we uniquely recover the factors $\{a_{i}\}_{i}$ and $\{b_{i}\}_{i}$ of this decomposition given access to $M$ ? In fact, this decomposition is almost never unique (unless we require that the factors $\{a_{i}\}_{i}$ and $\{b_{i}\}_{i}$ are orthonormal, or that $M$ has rank one). But given a tensor $T=\sum_{i=1}^{R}a_{i}\otimes b_{i}\otimes c_{i}$ there are general conditions under which $\{a_{i}\}_{i}$ , $\{b_{i}\}_{i}$ and $\{c_{i}\}_{i}$ are uniquely determined (up to scaling) given $T$ ; perhaps the most famous such condition is due to Kruskal , which we review in the next section.

Tensor methods are commonly used to establish that the parameters of a generative model can be identified given third (or higher) order moments. In contrast, given just second-order moments (e.g. $M$ ) we can only hope to recover the factors up to a rotation. This is called the rotation problem and has been an important issue in statistics since the pioneering work of psychologist Charles Spearman (1904) . Tensors offer a path around this obstacle precisely because their decompositions are often unique, and consequently have found applications in phylogenetic reconstruction , , hidden markov models , mixture models , topic modeling , community detection , etc.

However most tensor problems are hard: computing the rank , the best rank one approximation and the spectral norm are all $NP$ -hard. Also many of the familiar properties of matrices do not generalize to tensors. For example, subtracting the best rank one approximation to a tensor can actually increase its rank and there are rank three tensors that can be approximated arbitrarily well by a sequence of rank two tensors. One of the few algorithmic results for tensors is an algorithm for computing tensor decompositions in a restricted case. Let $A,B$ and $C$ be matrices whose columns are $\{a_{i}\}_{i}$ , $\{b_{i}\}_{i}$ and $\{c_{i}\}_{i}$ respectively.

, If $\mbox{rank}(A)=\mbox{rank}(B)=R$ and no pair of columns in $C$ are multiples of each other, then there is a polynomial time algorithm to compute the minimum rank tensor decomposition of $T$ . Moreover the rank one terms in this decomposition are unique (among all decompositions with the same rank).

If $T$ is an $n\times n\times n$ tensor, then $R$ can be at most $n$ in order for the conditions of the theorem to be met. This basic algorithm has been used to design efficient algorithms for phylogenetic reconstruction , , topic modeling , community detection and learning hidden markov models and mixtures of spherical Gaussians . However algorithms that make use of tensor decompositions have traditionally been limited to the full-rank case, and our goal is to develop stable algorithms that work for $R=\mbox{poly}(n)$ . Recently Goyal et al gave a robustness analysis for this decomposition, and we give an alternative proof in Appendix A.

In fact, this basic tensor decomposition can be bootstrapped to work even when $R$ is larger than $n$ (if we also increase the order of the tensor). The key parameter that dictates when one can efficiently find a tensor decomposition (or more generally, when it is unique) is the Kruskal rank:

The Kruskal rank (or Krank) of a matrix $A$ is the largest $k$ for which every set of $k$ columns are linearly independent. Also the $\tau$ -robust k-rank is denoted by $\text{Krank}_{\tau}(A)$ , and is the largest $k$ for which every $n\times k$ sub-matrix $A_{|S}$ of $A$ has $\sigma_{k}(A_{|S})\geq 1/\tau$ .

Hence we get an order three tensor $\widehat{T}$ of size $n^{2}\times n^{2}\times n$ . Alternatively we can define this “flattening” using the following operation:

The Khatri-Rao product of $U$ and $V$ which are size $m\times r$ and $n\times r$ respectively is an $mn\times r$ matrix $U\odot V$ whose $i^{th}$ column is $u_{i}\otimes v_{i}$ .

Our new order three tensor $\widehat{T}$ can be written as:

The factors are the columns of $A^{(1)}\odot A^{(2)}$ , the columns of $A^{(3)}\odot A^{(4)}$ and the columns of $A^{(5)}$ . The crucial point is that the Kruskal rank of the columns of $A^{(1)}\odot A^{(2)}$ is in fact at least the sum of the Kruskal rank of the columns of $A^{(1)}$ and $A^{(2)}$ (and similarly for $A^{(3)}\odot A^{(4)}$ ) , , but this is tight in the worst-case. Consequently this “flattening” operation allows us use the above algorithm unto $R=2n$ ; since the rank ( $R$ ) is larger than the largest dimension ( $n$ ), this is called the overcomplete case.

Our main technical result is that in a natural smoothed analysis model, the Kruskal rank robustly multiplies and this allows us to give algorithms for computing a tensor decomposition even in the highly overcomplete case, for any $R=\mbox{poly}(n)$ (provided that the order of the tensor is large - but still a constant). Moreover our algorithms have immediate applications in learning mixtures of Gaussians and multiview mixture models.

2 Our Results

We introduce the following framework for studying tensor decomposition problems:

In applications in learning, tensors are used to encode low-order moments of the distribution. In particular, each factor in the decomposition represents a “component”. The intuition is that if these “components” are not chosen in a worst-case configuration, then we can obtain vastly improved learning algorithms in various settings. For example, as a direct consequence of our main result, we will give new algorithms for learning mixtures of spherical Gaussians again in the framework of smoothed analysis (without any additional separation conditions). There are no known polynomial time algorithms to learn such mixures if the number of components ( $k$ ) is larger than the dimension ( $n$ ). But if their means are perturbed, we give a polynomial time algorithm for any $k=\mbox{poly}(n)$ by virtue of our tensor decomposition algorithm.

Our main technical result is the following:

We obtain the following main theorem from the above result and from analyzing the stability of the algorithm of Leurgans et al (see Theorem 2.3):

As we discussed, tensor methods have had numerous applications in learning. However algorithms that make use of tensor decompositions have traditionally been limited to the full-rank case, and hence can only handle cases when the number of “components” is at most the dimension. However by using our main theorem above, we can get new algorithms for some of these problems that work even if there are many more “components” than dimensions.

Mixtures of Axis-Aligned Gaussians (Section 5)

Here we are given samples from a distribution $F=\sum_{i=1}^{k}w_{i}F_{i}(\mu_{i},\Sigma_{i})$ where $F_{i}(\mu_{i},\Sigma_{i})$ is a Gaussian with mean $\mu_{i}$ and covariance $\Sigma_{i}$ and each $\Sigma_{i}$ is diagonal. These mixtures are ubiquitous throughout machine learning. Feldman et al gave an algorithm for PAC-learning mixtures of axis aligned Gaussians, however the running time is exponential in $k$ , the number of components. Hsu and Kakade gave a polynomial time algorithm for learning mixtures of spherical Gaussians provided that their means are full rank (hence $k\leq n$ ). Again, we turn to the framework of smoothed analysis and suppose that the means are $\rho$ -perturbed. In this framework, we can give a polynomial time algorithm for learning mixtures of axis-aligned Gaussians for any $k=\mbox{poly}(n)$ . Suppose that the means of a mixture of axis-aligned Gaussians and suppose the means have been $\rho$ -perturbed to obtain $\widetilde{\mu}_{i}$ . Then

We believe that our new algorithms for overcomplete tensor decomposition will have further applications in learning. Additionally this framework of studying distribution learning when the parameters of the distribution we would like to learn are not chosen adversarially, seems quite appealing.

Alternatively, algorithms for overcomplete tensor decomposition that assume we know $T$ exactly would not have any applications in learning because we would need to take too many samples to have a good enough estimate of $T$ (i.e. the low-order moments of the distribution).

3 Our Approach

then we would like to prove that with high probability $x\otimes y$ has a non-negligible projection on the orthogonal complement of $\mathcal{U}$ . This is the core of our approach. Set $\mathcal{V}$ to be the orthogonal complement of $\mathcal{U}$ . In fact, we prove that for any dimension at least $\frac{n^{2}}{2}$ subspace $\mathcal{V}$ , with high probability $x\otimes y$ has a non-negligible projection onto $\mathcal{V}$ .

Prior Algorithms

Here we review the algorithm of Leurgans et al . It has been discovered many times in different settings. It is sometimes referred to as “simultaneous diagonalization” or as Chang’s lemma .

Suppose we are given a third-order tensor $T=\sum_{i=1}^{R}u_{i}\otimes v_{i}\otimes w_{i}$ which is $n\times m\times p$ . Let $U,V$ and $W$ be matrices whose columns are $u_{i}$ , $v_{i}$ and $w_{i}$ respectively. Suppose further that (1) $rank(U)=rank(V)=R$ and (2) $\mbox{k-rank}(W)\geq 2$ . Then we can efficiently recover the factors of $T$ .

We present the algorithm Decompose and its analysis assuming $n=m=R$ . Any instance with $rank(U)=rank(V)=R$ can be reduced to this case as follows: find the span of the vectors $\{\widetilde{u}_{j,k}\}$ , where $\widetilde{u}_{j,k}$ is the $n$ dimensional vector whose $i$ th entry is $T_{ijk}$ . This span must be precisely the span of the columns of $U$ .It is easy to see that the span is contained in the span of the columns of $U$ . To see equality, we observe that if the span is $R-1$ dimensional, then projecting each of the $u_{i}$ s on to the span gives a different decomposition, and this contradicts Kruskal’s uniqueness theorem, which holds in this case. Thus we can pick some orthonormal basis for this span, and write $T$ as an $R\times m\times p$ tensor. We can perform this operation again (along the second mode) to move to an $R\times R\times p$ tensor.

, Given a tensor $T$ there exists an algorithm that runs in polynomial time and recovers the (unique) factors of $T$ provided that (1) $rank(U)=rank(V)=R$ and (2) $\mbox{k-rank}(W)\geq 2$ .

Proof: The algorithm is to pre-process as above (i.e., obtain $m=n=R$ ), and then run Decompose stated below. Let us thus analyze Decompose with $m,n$ being $R$ .

We can write $T_{a}=UD_{a}V^{T}$ where $D_{a}=\mbox{diag}(a^{T}w_{1},a^{T}w_{2},...,a^{T}w_{n})$ and similarly $T_{b}=UD_{b}V^{T}$ where $D_{b}=\mbox{diag}(b^{T}w_{1},b^{T}w_{2},...,b^{T}w_{n}).$ Moreover we can write $T_{a}(T_{b})^{-1}=UD_{a}D_{b}^{-1}U^{-1}$ and $T_{a}^{T}(T_{b}^{T})^{-1}=VD_{a}D_{b}^{-1}V^{-1}$ . So we conclude $U$ and $V$ diagonalize $T_{a}(T_{b})^{-1}$ and $((T_{b})^{-1}T_{a})^{T}$ respectively. Note that almost surely the diagonals entries of $D_{a}D_{b}^{-1}$ are distinct (Claim A.4). Hence the eigendecompositions of $T_{a}(T_{b})^{-1}$ and $(T_{b})^{-1}(T_{a})$ are unique, and we can pair up columns in $U$ and columns in $V$ based on their eigenvalues (we pair up $u$ and $v$ if their eigenvalues are equal). We can then solve a linear system to find the remaining factors (columns in $W$ ) and since this is a valid decomposition, we can conclude that these are also the true factors of $T$ appealing to Kruskal’s uniqueness theorem . $\blacksquare$

In fact, this algorithm is also stable, as Goyal et al also recently showed. It is intuitive that if $U$ and $V$ are well-conditioned and each pair of columns in $W$ is well-conditioned then this algorithm can tolerate some inverse polynomial amount of noise. For completeness, we give a robustness analysis of Decompose in Appendix A.

The condition numbers $\kappa(U),\kappa(V)\leq\kappa$ ,

The column vectors of $W$ are not close to parallel: for all $i\neq j$ , $\|\frac{w_{i}}{\|w_{i}\|}-\frac{w_{j}}{\|w_{j}\|}\|_{2}\geq\delta$ ,

The decompositions are bounded : for all $i$ , $\|u_{i}\|_{2},\|v_{i}\|_{2},\|w_{i}\|_{2}\leq C$ .

As before, the algorithm is to preprocess so as to obtain $m=n=R$ , and then run Decompose. The preprocessing step is slightly different because of the presence of error – instead of considering the span of the $\{\widetilde{u}_{j,k}\}$ as above, we need to look at the span of the top $R$ singular vectors of the matrix whose columns are $\widetilde{u}_{j,k}$ . If $\lVert E\rVert_{F}$ is small enough (in terms of $\kappa,\delta,n$ ), the span of these top singular vectors suffices to obtain an approximation to the vectors $u_{i}$ (see Appendix A).

Note that the algorithm is limited by the condition that $rank(U)=rank(V)=R$ since this requires that $R\leq\min(m,n)$ . But as we have seen before, by “flattening” a higher order tensor, we can handle overcomplete tensors. The following is an immediately corollary of Theorem 2.3:

then there exists an efficient algorithm that computes each rank one term in this decomposition up to an additive error of $\epsilon$ .

$\text{Krank}(U\odot V)\geq\min\big{(}\text{Krank}(U)+\text{Krank}(V)-1,R\big{)}$

However in learning applications we are not given $T$ exactly but rather an approximation to it. Our goal is to show that the Kruskal rank robustly multiplies typically, so that these types of tensor algorithms will not only work in the exact case, but are also necessarily stable when we are given $T$ with some noise. In the next section, we show that in the smoothed analysis model, the robust Kruskal rank multiplies on taking Khatri-Rao products. This then establishes our main result Theorem 1.5, assuming Theorem 3.3 which we prove in the next section.

The Khatri-Rao Product Robustly Multiplies

Suppose $U,V$ are $n\times R$ matrices and let $\widetilde{U},\widetilde{V}$ be $\rho$ -perturbations of $U,V$ respectively. Then for any constant $\delta\in(0,1)$ , $R\leq\delta n^{2}$ and $\tau=n^{O(1)}/\rho^{2}$ , the Khatri-Rao product satisfies $\text{Krank}_{\tau}(\widetilde{U}\odot\widetilde{V})=R$ with probability at least $1-\exp(-\sqrt{n})$ .

The natural generalization where the vectors $u_{i}$ and $v_{i}$ are in different dimensional spaces also holds. We omit the details here.

How can we lower bound the smallest singular value of $A$ ? We define a quantity which is can be used as a proxy for the least singular value and is simpler to analyze.

For any matrix $A$ with columns $A_{1},A_{2},\dots A_{R}$ , the leave-one-out distance is

The leave-one-out distance is a good proxy for the least singular value, if we are not particular about losing multiplicative factors that are polynomial in size of the matrix.

Apply a union bound over all the $R$ choices for $i$ .

We now state the main technical theorem about projections of perturbed product vectors onto arbitrary subspaces of large dimension.

Now consider some matrices $M_{i}$ , and suppose we knew that $Q(\widetilde{y})$ has $\Omega(r)$ singular values of magnitude $\geq 1/n^{2}$ . Then, an $\rho$ -perturbed vector $\widetilde{x}$ has at least $\rho/n$ of its norm in the space spanned by the corresponding right singular vectors, with probability $\geq 1-\exp(-r)$ (Fact 3.26). Thus we get

So the key is to prove that the matrix $Q(\widetilde{y})$ has a large number of “non-negligible” singular values with high probability (over the perturbation in $\widetilde{y}$ ). For this, let us examine the entries of $Q(\widetilde{y})$ . For a moment suppose that $\widetilde{y}$ is a gaussian random vector $\sim\mathcal{N}(0,\rho^{2}I)$ (instead of a perturbation). Then the $(i,j)$ th entry of $Q(\widetilde{y})$ is precisely $\langle\widetilde{y},(M_{i})_{j}\rangle$ , which is distributed like a one dimensional gaussian of variance $\rho^{2}\lVert(M_{i})_{j}\rVert^{2}$ . If the entries for different $i,j$ were independent, standard results from random matrix theory would imply that $Q(\widetilde{y})$ has many non-negligible singular values.

However, this could be far from the truth. Consider, for instance, two vectors $(M_{i})_{j}$ and $(M_{i^{\prime}})_{j^{\prime}}$ that are parallel. Then their dot products with $\widetilde{y}$ are highly correlated. However we note, that as long as $(M_{i^{\prime}})_{j^{\prime}}$ has a reasonable component orthogonal to $(M_{i})_{j}$ , the distribution of the $(i,j)$ and $(i^{\prime},j^{\prime})$ th entries are “somewhat” independent. We will prove that we can roughly achieve such a situation. This motivates the following definition.

[Ordered $\theta$ -orthogonality] A sequence of vectors $v_{1},v_{2},\dots,v_{n}$ has the ordered $\theta$ -orthogonality property if for all $1\leq i\leq n$ , $v_{i}$ has a component of length $\geq\theta$ orthogonal to $\operatorname{span}\{v_{1},v_{2},\dots,v_{i-1}\}$ .

Now we define a similar notion for a sequence of matrices $M_{1},M_{2},\dots,M_{r}$ , which says that a large enough subset of columns should have a certain $\theta$ -orthogonality property. More formally,

A set of $n\times m$ matrices $M_{1},M_{2},\dots,M_{r}$ form an ordered $(\theta,\delta)$ -orthogonal system if there exists a permutation $\pi$ on $[m]$ such that the first $\delta m$ columns satisfy the followng property: for $i\leq\delta m$ and every $j\in[R]$ , the $\pi(i)$ th column of $M_{j}$ has a projection of length $\geq\theta$ orthogonal to the span of all the vectors given by the columns $\pi(1),\pi(2),\dots,\pi(i-1),\pi(i)$ of all the matrices $M_{1},M_{2},\dots M_{r}$ other than itself (i.e. the $\pi(i)$ th column of $M_{j}$ ).

The following lemma shows the use of an ordered $(\theta,\delta)$ orthogonal system: a matrix $Q(\widetilde{y})$ constructed as above starting with these $M_{i}$ has many non-negligible singular values with high probability.

$\text{vec}(M_{i})\in\mathcal{V}$ for all $i\in[r]$ .

$M_{1},M_{2},\dots,M_{r}$ form an ordered $(\theta,\delta^{\prime})$ orthogonal system.

In particular, when $m\leq\sqrt{n}$ , they form an ordered $(\theta,\delta/2)$ orthogonal system.

We remark that while $\delta$ is often a constant in our applications, $\delta^{\prime}$ does not have to be. We will use this in the proof that follows, in which we use these above two lemmas regarding construction and use of an ordered $(\theta,\delta)$ -orthogonal system to prove Proposition 3.12.

Now, applying Lemma 3.15, we have that the matrix $Q(\widetilde{x})$ , defined as before, (given by linear combinations along $\widetilde{x}$ ) , has $\sigma_{r/2}\left(Q(\widetilde{x})\right)\geq\frac{\rho\theta}{n^{4}}$ w.p $1-\exp(-\sqrt{n})$ .

Applying Fact 3.26 along with a simple averaging argument, we have that for one of the terms $M_{i}$ , we have $|M_{i}(\widetilde{x}\otimes\widetilde{y})|\geq\rho\theta/n^{6}$ with probability $\geq 1-\exp(-r/2)$ as required.

Please refer to Appendix B.2 for the complete details. ∎

2 Constructing the (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonal System (Proof of Lemma 3.16)

This lemma will also be used in the first step of the proof of Theorem 3.6 to identify a good block of co-ordinates which span a large projection of a given subspace $\mathcal{V}$ .

The above lemma is easy to prove if the dimension of the column projections used is the usual dimension of a vector space. However, with robust dimension, to carefully avoid spurious or insignificant directions, we identify the robust dimension with the number of large singular values of a certain matrix.

In each stage $t$ of this construction we will be looking at subspaces of $\mathcal{V}$ which are obtained by zero-ing out all the columns $J\subseteq[m]$ (i.e. all the co-ordinates $[n]\times J$ ), that we have fixed so far.

The extension $\text{Ext}_{*J}\left(M^{\prime}\right)$ for $M^{\prime}\in\mathcal{V}^{*J}$ is the vector $M\in\mathcal{V}$ obtained by padding $M^{\prime}$ with zeros in the coordinates $[n]\times J$ (columns given by $J$ ).

The following lemma shows that their dimension remains large as long as $|J|$ is not too large:

Proof of Lemma 3.21: Consider a constraint matrix $C$ of size $(1-\delta)nm\times nm$ which describes $\mathcal{V}$ . $\mathcal{V}^{*J}$ is described by the constraint matrix of size $(1-\delta)nm\times n(m-|J|)$ obtained by removing the columns of $C$ corresponding to $[n]\times J$ . Hence we get a subspace of dimension at least $n(m-|J|)-(1-\delta)nm$ . $\blacksquare$

We now describe the construction more formally.

The Iterative Construction of ordered $\theta$ -orthogonal matrices.

Initially set $J_{0}=\emptyset$ and $M_{j}=0$ for all $j\in[r]$ , $\tau=\sqrt{m}$ and $s=\delta m/2$ .

Pick $i\in[m]-J_{t-1}$ such that $\dim_{i}^{\tau}\left(\mathcal{V}^{*J_{t-1}}\right)\geq\delta n/2$ . If no such $i$ exists, report FAIL.

Set for all $j\in[r]$ , the new $M_{j}\leftarrow M_{j}+\text{Ext}_{*J}\left(Z_{j}\right)$ , where $\text{Ext}_{*J}\left(Z_{j}\right)$ is the matrix padded with zeros in the columns corresponding to $J$ . Set $J_{t}\leftarrow J_{t-1}\cup\left\{i\right\}$ .

Let $J=J_{s}$ for convenience. We first show that the above process for constructing $M_{1},M_{2},\dots,M_{r}$ completes successfully without reporting FAIL.

For $r,s$ such that $s\leq\delta m/2$ and $r\cdot s\leq\delta n/3$ , the above process does not FAIL.

Proof: In each stage, we add one column index to $J$ . Hence, $|J_{t}|\leq s$ at all times $t\in[s]$ .

We first show that Step 1 of each iteration does not FAIL. From Lemma 3.21, we have $\dim\left(\mathcal{V}^{*J_{t}}\right)\geq\delta nm/2$ . Let $\mathcal{W}=\mathcal{V}^{*J_{t}}$ . Now, applying Lemma 3.18 to $\mathcal{W}$ , we see that there exists $i\in[m]-J_{t}$ such that $\dim_{i}^{\tau}\left(\mathcal{W}\right)\geq\delta n/2$ , as required. Hence, Step 1 does not fail.

$\dim_{i}^{\tau}\left(\mathcal{W}\right)\geq\delta n/2$ shows that there exist $Z^{\prime}_{1},Z^{\prime}_{2},\dots Z^{\prime}_{\delta n/2}$ with lengths at most $\sqrt{m}$ such that their $i$ th columns $\left\{Z^{\prime}_{t}(i)\right\}_{t\leq\delta n/2}$ are orthonormal. However, we additionally need to impose that the $i$ th columns to also be orthogonal to the columns $\left\{M_{j}(i^{\prime})\right\}_{j\in[r],i^{\prime}\in J_{t-1}}$ . Fortunately, the number of such orthogonality constraints is at most $r|J_{t-1}|\leq\delta n/3$ . Hence, we can pick the $r<\delta n/6$ orthonormal $i$ th columns $\left\{Z_{j}(i)\right\}_{j\in[r]}$ and their respective extensions $Z_{j}$ , by taking linear combinations of $Z^{\prime}_{t}$ . Since the linear combinations result again in unit vectors in the $i$ th column, the length of $Z_{j}\leq\sqrt{m}{n}$ , as required. Hence, Step 2 does not FAIL as well. $\blacksquare$

We now show that since the process completes, then $M_{1},M_{2},\dots,M_{r}$ have the required ordered $(\theta,\delta^{\prime})$ -orthogonal property for $\delta^{\prime}=s/m$ . We first check that $M_{1},M_{2},\dots,M_{r}$ belong to $\mathcal{V}$ . This is true because in each stage, $\text{Ext}_{*J}\left(Z_{j}\right)\in\mathcal{V}$ , and hence $M_{j}\in\mathcal{V}$ for $j\in[r]$ . Further, since we run for $s$ stages, and each of the $Z_{j}$ are bounded in length by $\sqrt{mn}$ , $\lVert M_{j}\rVert_{F}\leq s\sqrt{mn}\leq\sqrt{nm^{3}}$ . Our final matrices $M_{j}$ will be scaled to $\lVert\cdot\rVert_{F}=1$ . The $s$ columns that satisfy the ordered $\theta$ -orthogonality property are those of $J$ , in the order they were chosen (we set this order to be $\pi$ , and select an arbitrary order for the rest).

3 (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonality and ρ𝜌\rho-Perturbed Combinations (Proof of Lemma 3.15)

Suppose $M_{1},M_{2},\dots,M_{r}$ be a $(\theta,\delta)$ -orthogonal set of matrices (dimensions $n\times m$ ). Without loss of generality, suppose that the permutation $\pi$ in the definition of orthogonality is the identity, and let $I$ be the first $\delta m$ columns.

Now let us consider an $\rho$ -perturbed vector $\widetilde{x}$ , and consider the matrix $Q(\widetilde{x})$ defined in the statement – it has dimensions $r\times m$ , and the $(i,j)$ th entry is $\langle\widetilde{x},(M_{i})_{j}\rangle$ , which is distributed as a translated gaussian. Now for any column $i\in I$ , the $i$ th column in $Q(\widetilde{x})$ has every entry having an $(\rho\cdot\theta)$ ‘component’ independent of entries in the previous columns, and entries above. This implies that for a unit gaussian vector $g$ , we have (by anti-concentration and $\theta$ -orthogonality that

Furthermore, the above inequality holds, even conditioned on the first $(i-1)$ columns of $Q(\widetilde{x})$ .

Let $Q(\widetilde{x})$ be defined above, and fix some $i\in I$ . Then for $g\sim\mathcal{N}(0,1)^{n}$ , we have

for any given $Q(\widetilde{x})_{1},Q(\widetilde{x})_{2},\dots,Q(\widetilde{x})_{(i-1)}$ .

Proof: Let $g=(g_{1},g_{2},\dots,g_{r})$ . Then we have

Let us denote the latter vector by $v_{i}$ for now, so we are interested in $\langle\widetilde{x},v_{i}\rangle$ . We show that $v_{i}$ has a non-negligible component orthogonal to the span of $v_{1},v_{2},\dots v_{(i-1)}$ . Let $\Pi$ be the matrix which projects orthogonal to the span of $(M_{s})_{i^{\prime}}$ for all $i^{\prime}<i$ . Thus any vector $\Pi u$ is also orthogonal to the span of $v_{i^{\prime}}$ for $i^{\prime}<i$ .

Now by hypothesis, every vector $\Pi(M_{s})_{i}$ has length $\geq\theta$ . Thus the vector $\Pi\left(\sum_{s}g_{s}(M_{s})_{i}\right)=\Pi v_{i}$ has length $\geq\theta/2$ with probability $\geq 1-\exp(-r)$ (Lemma B.2).

Thus if we consider the distribution of $\langle\widetilde{x},v_{i}\rangle=\langle x,v_{i}\rangle+\langle e,v_{i}\rangle$ , it is a one-dimensional gaussian with mean $\langle x,v_{i}\rangle$ and variance $\rho^{2}$ . From basic anti-concentration properties of a gaussian (that the mass in any $\rho\cdot(\text{variance})^{1/2}$ interval is at most $\rho$ ), the conclusion follows. $\blacksquare$

We can now do this for all $i\in I$ , and conclude that the probability that Eq. (7) holds for all $i\in I$ is at most $1/(2n)^{|I|}$ .

Now what does this imply about the singular values of $Q(\widetilde{x})$ ? Suppose it has $<r/2$ (which is $<|I|$ ) non-negligible singular values, then a gaussian random vector $g$ , with probability at least $n^{-r}$ , has a negligible component along all the corresponding singular vectors, and thus the length of $g^{T}Q(\widetilde{x})$ is negligible with at least this probability!

Let $M$ be a $t\times t$ matrix with spectral norm $\leq 1$ . Suppose $M$ has at most $r$ singular values of magnitude $>\tau$ . Then for $g\sim\mathcal{N}(0,1)^{t}$ , we have

Proof: Let $u_{1},u_{2},\dots,u_{r}$ be the singular vectors corresponding to value $>\tau$ . Consider the event that $g$ has a projection of length $<1/n^{c}$ onto $u_{1},u_{2},\dots,u_{r}$ . This has probability $\geq\frac{1}{n^{cr}}$ , by anti-concentration properties of the Gaussian (and because $\mathcal{N}(0,1)^{t}$ is rotationally invariant). For any such $g$ , we have

This contradicts the earlier anti-concentration bound, and so we conclude that the matrix has at least $r/2$ non-negligible singular values, as required.

4 Higher Order Products

To handle this issue, we will first restrict our attention to a smaller block of co-ordinates of size $n_{1}\times n_{2}\times n_{3}$ (with $n_{1}n_{2}n_{3}\ll n$ ) , that has reasonable size in all the three modes ( $n_{1},n_{2},n_{3}=n^{\Omega(1)}$ ). Additionally, we want $\mathcal{V}$ ’s projection onto this $n_{1}\times n_{2}\times n_{3}$ block spans a large subspace of (robust) dimension at least $\delta n_{1}n_{2}n_{3}$ (using Lemma 3.18).

We now state the main inductive claim. The claim assumes a block of co-ordinates of reasonable size in each mode that span many directions in $\mathcal{V}$ , and then establishes the anti-concentration bound inductively.

Before we give a proof of the main inductive claim, we first present a standard fact that relates the singular value of matrices and some anti-concentration properties of randomly perturbed vectors. This will also establish the base case of our main inductive claim.

Learning Multi-view Mixture Models

In this section, we will assume that each of the components in the mixture is a discrete distribution with support of size $n$ . We first introduce some notation, along the lines of .

The mixture component $i$ ( $i\in[R]$ ) is first picked with probability $w_{i}$

However, in many practical settings like speech recognition and image classification, the dimension of the feature space is typically much smaller than the number of components or clusters i.e. $n\ll R$ . To the best of our knowledge, there was no efficient algorithm for learning multi-view mixture models in such over-complete settings. We now show how Theorem 1.5 gives a polynomial time algorithm to learn multi-view mixture models in a smoothed sense, even in the over-complete setting $R\gg n$ .

The conditional independence property is very useful in obtaining a higher order tensor, in terms of the hidden parameter vectors that we need to recover. This allows us to use our results on tensor decompositions from previous sections.

Our algorithm to learn multi-view models consists of three steps:

Apply Theorem 1.5 to $\widehat{T}$ and recover the parameters ${\widehat{\mu}_{i}}^{(j)}$ upto scaling.

Learning Mixtures of Axis-Aligned Gaussians

Let $F$ be a mixture of $k=\mbox{poly}(n)$ axis-aligned Gaussians in $n$ dimensions, and suppose further that the means of the components are perturbed by Gaussian noise of magnitude $\rho$ . We restrict to Gaussian noise not because our results change, but for notational convenience.

The mixture is described by a set of $k$ mixing weights $w_{i}$ , means $\mu_{i}$ and covariance matrices $\Sigma_{i}$ . Since the mixture is axis-aligned, each covariance $\Sigma_{i}$ is diagonal and we will denote the $j^{th}$ diagonal of $\Sigma_{i}$ as $\sigma_{ij}^{2}$ . Our main result in this section is the following:

Next we outline the main steps in our learning algorithm:

We then set up a system of linear equations and solve for $\sigma_{ij}^{2}$ .

1 Step 222: Recovering the Means and Mixing Weights

Now we show how to recover the means $\widetilde{\mu}_{i}$ and weights $w_{i}$ .

Let $\mu$ be any $d$ dimensional vector. Then a coordinate-wise $\sigma$ -perturbation of $\mu$ has length $\geq d\sigma^{2}/10$ w.p. $\geq 1-\exp(-d)$ .

The proof is by a basic anti-concentration along with the observation that coordinates are independently perturbed and hence the failure probability multiplies.

Let us now define the partition $S_{t}^{\prime}$ . Suppose we divide $S_{1}$ and $S_{2}$ into two roughly equal parts each, and call the parts $A_{1},B_{1}$ and $A_{2},B_{2}$ (respectively). Now consider a partition with $S_{1}^{\prime}=A_{1}\cup A_{2}$ and $S_{2}^{\prime}=B_{1}\cup B_{2}$ , and $S_{t}^{\prime}=S_{t}$ for $t>2$ . Consider the solution $\nu_{i}^{\prime}$ we obtain using the decomposition algorithm, and look at the vectors $\nu_{1},\nu_{2},\nu_{1}^{\prime},\nu_{2}^{\prime}$ . For the sake of exposition, suppose we did not have any error in computing the decomposition. We can scale $\nu_{1}^{\prime}$ such that the sub-vector corresponding to $A_{1}$ is precisely equal to that in $\nu_{1}$ . Now look at the remaining sub-vector of $\nu_{1}$ , and suppose it is $\gamma$ times the “ $A_{2}$ portion” of $\nu_{2}$ . Then we must have $\gamma=c_{2}/c_{1}$ .

To see this formally, let us fix some $i$ and write $v_{11}$ and $v_{12}$ to denote the sub-vectors of $\widetilde{\mu}_{i}^{(1)}$ restricted to coordinates in $A_{1}$ and $B_{1}$ respectively. Write $v_{21}$ and $v_{22}$ to represent sub-vectors of $\widetilde{\mu}_{i}^{(2)}$ restricted to $A_{2}$ and $B_{2}$ respectively. Then $\nu_{1}$ is $c_{1}v_{11}\oplus c_{1}v_{12}$ (where $\oplus$ denotes concatenation). So also $\nu_{2}$ is $c_{2}v_{21}\oplus c_{2}v_{22}$ . Now we scaled $\nu_{1}^{\prime}$ such that the $A_{1}$ portion agrees with $\nu_{1}$ , thus we made $\nu_{1}^{\prime}$ equal to $c_{1}v_{11}\oplus c_{1}v_{21}$ . Thus by the way $\gamma$ is defined, we have $c_{1}\gamma=c_{2}$ , which is what we claimed.

We can now compute the entire vector $\widetilde{\mu}_{i}$ up to scaling, since we know $c_{1}/c_{2}$ , $c_{1}/c_{3}$ , and so on. Thus it remains to find the mixture weights $w_{i}$ . Note that these are all non-negative. Now from the decomposition, note that for each $i$ , we can find the quantity

2 Step 333: Recovering the Variances

Now that we know the values of $w_{i}$ and all the means $\widetilde{\mu}_{i}$ , we show how to recover the variances. This can be done in many ways, and we will outline one which ends up solving a linear system of equations. Recall that for each Gaussian, the covariance matrix is diagonal (denoted $\Sigma_{i}$ , with $j$ th entry equal to $\sigma_{ij}^{2}$ ).

This can be evaluated as before. Write $\widetilde{\mu}_{i}^{(t)}$ to denote the portion of $\widetilde{\mu}_{i}$ restricted to $S_{t}$ , and similarly $\eta_{i}^{(t)}$ to denote the portion of the noise vector $\eta_{i}$ . This gives

Now using this with $z^{\prime}$ being the flattened $\mathcal{N}_{1}$ allows us to recover the values of $w_{i}(\widetilde{\mu}_{i}(1)+\sigma_{i1}^{2})$ for $1\leq i\leq R$ . From this, since we know the values of $w_{i}$ and $\widetilde{\mu}_{i}(1)$ for each $i$ , we can recover the values $\sigma_{i1}^{2}$ for all $i$ . As mentioned before, we can repeat this process for other dimensions and recover $\sigma_{ij}^{2}$ for all $i,j$ .

Acknowledgements

We thank Ryan O’Donnell for suggesting that we extend our techniques for learning mixtures of spherical Gaussians to the more general problem of learning axis-aligned Gaussians.

References

Appendix A Stability of the recovery algorithm

In this section we prove Theorem 2.3, which shows that the algorithm from section 2 is actually robust to errors, under Condition 2.2. This consists of two parts: first proving that the preprocessing step indeed allows us to recover $u_{i}$ (approximately), and second, that Decompose is robust to noise.

Stability of Decompose

Next, we establish that Decompose is stable (in what follows, we have $m=n=R$ ). Intuitively, Decompose is stable provided that the matrices $U$ and $V$ are well-conditioned and the eigenvalues of the matrices that we need to diagonalize are separated.

The main step in Decompose is an eigendecomposition, so first we will establish perturbation bounds. The standard perturbation bounds are known as sin $\theta$ theorems following Davis-Kahan and Wedin. However these bounds hold most generally for the singular value decomposition of an arbitrary (not necessarily symmetric) matrix. We require perturbation bounds for eigen-decompositions of general matrices. There are known bounds due to Eisenstat and Ipsen, however the notion of separation required there is difficult to work with and for our purposes it is easier to prove a direct bound in our setting.

Suppose $M=UDU^{-1}$ and $\widehat{M}=M(I+E)+F$ and $M$ and $\widehat{M}$ are $n\times n$ matrices. In order to relate the eigendecompositions of $M$ and $\widehat{M}$ respectively, we will first need to establish that the eigenvalues of $M$ are all distinct. We thank Santosh Vempala for pointing out an error in an earlier version. We incorrectly used the Bauer-Fike Theorem to show that $\widehat{M}$ is diagonalizable, but this theorem only shows that each eigenvalue of $\widehat{M}$ is close to some eigenvalue of $M$ , but does not show that there is a one-to-one mapping. Fortunately there is a fix for this that works under the same conditions (but again see for an earlier, alternative proof that uses a “homotopy argument”).

Let $\mbox{sep}(D)=\min_{i\neq j}|D_{i,i}-D_{j,j}|$ .

Our first goal is to prove that $\widehat{M}$ is diagonalizable, and we will do this by establishing that its eigenvalues are distinct if the error matrices $E$ and $F$ are not too large. Consider

where $R=U^{-1}(ME+F)U$ . We can bound each entry in $R$ by $\kappa(U)(\|ME\|_{2}+\|F\|_{2})$ . Hence if $E$ and $F$ are not too large, the eigenvalues of $D+R$ are close to the eigenvalues of $D$ using Gershgorin’s disk theorem, and the eigenvalues of $D+R$ are the same as the eigenvalues of $\widehat{M}$ since these matrices are similar. So we conclude:

If $\kappa(U)(\|ME\|_{2}+\|F\|_{2})<\mbox{sep}(D)/(2n)$ then the eigenvalues of $\widehat{M}$ are distinct and it is diagonalizable.

Next we prove that the eigenvectors of $\widehat{M}$ are also close to those of $M$ (this step will rely on $\widehat{M}$ being diagonalizable). This technique is standard in numerical analysis, but it will be more convenient for us to work with relative perturbations (i.e. $\widehat{M}=M(I+E)+F$ ) so we include the proof of such a bound for completeness

Consider a right eigenvector $\widehat{u}_{i}$ of $\widehat{M}$ with eigenvalue $\widehat{\lambda}_{i}$ . We will assume that the conditions of the above corollary are met, so that there is a unique eigenvector $u_{i}$ of $M$ with eigenvalue $\lambda_{i}$ which it is paired with. Then since the eigenvectors $\{u_{i}\}_{i}$ of $M$ are full rank, we can write $\widehat{u}_{i}=\sum_{j}c_{j}u_{j}$ . Then

Now we can left multiply by the $j^{th}$ row of $U^{-1}$ ; call this vector $w_{j}^{T}$ . Since $U^{-1}U=I$ , we have that $w_{j}^{T}u_{i}=\mathbf{1}_{i=j}$ . Hence

where we have used the condition that $\kappa(U)(\|ME\|_{2}+\|F\|_{2})<sep(D)/2$ to lower bound the denominator. Furthermore: $\|U^{-1}ME\widehat{u}_{i}\|_{2}=\|DU^{-1}E\widehat{u}_{i}\|_{2}\leq\frac{\sigma_{max}(E)\lambda_{max}(D)}{\sigma_{min}(U)}$ since $\widehat{u}_{i}$ is a unit vector.

If $\kappa(U)(\|ME\|_{2}+\|F\|_{2})<sep(D)/2$ , then

Now we are ready to analyze the stability of Decompose: Let $T=\sum_{i=1}^{n}u_{i}\otimes v_{i}\otimes w_{i}$ be an $n\times n\times p$ tensor that satisfies Condition 2.2. In our settings of interest we are not given $T$ exactly but rather a good approximation to it, and here let us model this noise as an additive error $E$ that is itself an $n\times n\times p$ tensor.

With high probability, $\mbox{sep}(D_{a}D_{b}^{-1})\geq\frac{\delta}{\sqrt{p}}$ .

Proof: Fix some $i,j$ . The $(i,i)$ th entry of $D_{a}D_{b}^{-1}$ is precisely $\frac{\langle w_{i},a\rangle}{\langle w_{i},b\rangle}$ . Note that $\mathop{\bf Pr\/}[|\langle w_{i},b\rangle|>n\lVert w_{i}\rVert]$ is $\exp(-n)$ , thus the denominators are all at least $1/(Cn)$ in magnitude with probability $1-\exp(-n)$ .

Now given $b$ for which this happens, we have $\frac{\langle w_{i},a\rangle}{\langle w_{i},b\rangle}-\frac{\langle w_{j},a\rangle}{\langle w_{j},b\rangle}=c_{i}\langle w_{i},a\rangle-c_{j}\langle w_{j},a\rangle$ where $c_{i},c_{j}$ have magnitude $>1/(Cn)$ . Because $w_{i}$ has at least a $\delta$ component orthogonal to $w_{j}$ , anti-concentration of Gaussians implies that the difference above is at least $\delta/C^{2}n^{6}$ with probability at least $1-1/n^{4}$ . Thus we can take a union bound over all pairs. $\blacksquare$

We will make crucial use of the following matrix identity:

Let $N_{a}=T_{a}+E_{a}$ and $N_{b}=T_{b}+E_{b}$ . Then using the above identity we have:

where $F=-E_{b}(I+(T_{b})^{-1}E_{b})^{-1}(T_{b})^{-1}$ and $G=E_{a}(T_{b})^{-1}$

$\sigma_{max}(F)\leq\frac{\sigma_{max}(E_{b})}{\sigma_{min}(T_{b})-\sigma_{max}(E_{b})}$ and $\sigma_{max}(G)\leq\frac{\sigma_{max}(E_{a})}{\sigma_{min}(T_{b})}$

as desired. The second bound is obvious. $\blacksquare$

We can now use Theorem A.3 to bound the error in recovering the factors $U$ and $V$ by setting e.g. $M=T_{a}(T_{b})^{-1}$ . Additionally, the following claim establishes that the linear system used to solve for $W$ is well-conditioned and hence we can also bound the error in recovering $W$ .

$\kappa(U\odot V)\leq\frac{\min(\sigma_{max}(U),\sigma_{max}(V))}{\max(\sigma_{min}(U),\sigma_{min}(V))}\leq\min(\kappa(U),\kappa(V))$

These bounds establish what we qualitatively asserted: Decompose is stable provided that the matrices $U$ and $V$ are well-conditioned and the eigenvalues of the matrices that we need to diagonalize are separated.

Appendix B K-rank of the Khatri-Rao product

Recall: we defined the leave-one-out distance in Section 3. Here we establish that is indeed equivalent to the smallest singular value, up to polynomial factors. In our main proof, this quantity will be much easer to work with since it allows us to translate questions about a set of vectors being well-conditioned to reasoning about projection of each vector onto the orthogonal complement of the others.

Proof of Lemma 3.5: Using the variational characterization for singular values: $\sigma_{min}(A)=\min_{u,\|u\|_{2}=1}\|Au\|_{2}$ . Then let $i=\mbox{argmax}|u_{i}|$ . Clearly $|u_{i}|\geq 1/\sqrt{m}$ since $\|u\|_{2}=1$ . Then $\|A_{i}+\sum_{j\neq i}A_{j}\frac{u_{j}}{u_{i}}\|_{2}=\frac{\sigma_{min}(A)}{u_{i}}$ . Hence

Conversely, let $i=\mbox{argmin}_{i}dist(A_{i},span\{A_{j}\}_{j\neq i})$ . Then there are coefficients (with $u_{i}=1$ ) such that

Clearly $\|u\|_{2}\geq 1$ since $u_{i}=1$ . And we conclude that

B.2 Proof of Proposition 3.12

We now give the complete details of the proof of Proposition 3.12, that shows how the Kruskal rank multiplies in the smoothed setting for two-wise products. The proof follows by just combining Lemma 3.16 and Lemma 3.15.

we obtain $n\times n$ matrices $M_{1},M_{2},\dots,M_{r}$ having the $(\theta,\delta^{\prime})$ -orthogonality property. Note that in this setting, $\delta^{\prime}m=\frac{n^{1/2}}{2}$ .

Thus by applying Lemma 3.15, we have that the matrix $Q(\widetilde{x})$ , defined as before, satisfies

Since $Q(\widetilde{x})$ has many non-negligible singular values (Eq.(12)), we have (by Fact 3.26 for details) that an $\rho$ -perturbed vector has a non-negligible norm when multiplied by $Q$ . More precisely, $\mathop{\bf Pr\/}[\lVert\widetilde{y}^{T}Q(\widetilde{x})\rVert\geq\rho\theta/n^{4}]\geq 1-\exp(-r/2)$ . Thus for one of the terms $M_{s}$ , we have $|M_{s}(\widetilde{x}\otimes\widetilde{y})|\geq\rho\theta/n^{5}$ with probability $\geq 1-\exp(-r/2)$ .

Now this almost completes the proof, but recall that our aim is to argue about $M(\widetilde{x}\otimes\widetilde{y})$ , where $M$ is the given matrix. $\text{vec}(M_{s})$ is a vector in the span of the top $\delta n^{2}$ (right) singular vectors of $M$ , and $\sigma_{\delta n^{2}}\geq\tau$ , thus we can write $M_{s}$ as a combination of the rows of $M$ , with each weight in the combination being $\leq n/\tau$ (Lemma B.1). This implies that for at least one row $M^{(j)}$ of the matrix $M$ , we must have

(Otherwise we have a contradiction). This completes the proof.

Before we give the complete proofs of the two main lemmas regarding ordered $(\theta,\delta)$ orthogonal systems (Lemma 3.16 and Lemma 3.15), we start with a simple lemma about top singular vectors of matrices, which is very useful to obtain linear combinations of small length.

B.3 Constructing the (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonal System (Proof of Lemma 3.16)

We now give the complete proof of lemma 3.18 that shows that the average robust dimension of column projections is large if the dimension of $\mathcal{V}$ is large .

Proof of Lemma 3.18: Let $d=\dim(\mathcal{V})$ . Let $B$ be a $p_{1}p_{2}\times d$ matrix composed of a orthonormal basis (of $d$ vectors) for $\mathcal{V}$ i.e. the $j^{th}$ column of $B$ is the $j^{th}$ basis vector ( $j\in[d]$ ) of $\mathcal{V}$ . Clearly $\sigma_{d}(B)=1$ . For $i\in[p_{2}]$ , let $B_{i}$ be the $p_{1}\times d$ matrix obtained by projecting the columns of $B$ on just the rows given by $[p_{1}]\times i$ . Hence, $B$ is obtained by just concatenating the columns as ${B}^{\sc T}=\left[{B_{1}}^{\sc T}\|{B_{2}}^{\sc T}\|\dots\|{B_{p}}^{\sc T}\right]$ . Finally, let $d_{i}=\max t$ such that $\sigma_{t}(B_{i})\geq\frac{1}{\sqrt{p_{2}}}$ .

B.4 Implications of Ordered (θ,δ)𝜃𝛿(\theta,\delta)-Orthogonality: Details of Proof of Lemma 3.15

Here we show some auxiliary lemmas that are used in the Proof of Lemma B.4.

Suppose $v_{1},v_{2},\dots,v_{m}$ are a set of vectors in $\Re^{n}$ of length $\leq 1$ , having the $\theta$ -orthogonal property. Then we have

For $g\sim\mathcal{N}(0,1)^{n}$ , we have $\sum_{i}\langle v_{i},g\rangle^{2}\geq\theta^{2}/2$ with probability $\geq 1-\exp(-\Omega(m))$ ,

For $g\sim\mathcal{N}(0,1)^{m}$ , we have $\lVert\sum_{i}g_{i}v_{i}\rVert^{2}\geq\theta^{2}/2$ with probability $\geq 1-\exp(-\Omega(m))$ .

Furthermore, part (a) holds even if $g$ is drawn from $u+g^{\prime}$ , for any fixed vector $u$ and $g^{\prime}\sim\mathcal{N}(0,1)^{n}$ .

Proof: First note that we must have $m\leq n$ , because otherwise $\{v_{1},v_{2},\dots,v_{m}\}$ cannot have the $\theta$ -orthogonal property for $\theta>0$ . For any $j\in[m]$ , we claim that

To see this, write $v_{j}=v_{j}^{\prime}+v_{j}^{\perp}$ , where $v_{j}^{\perp}$ is orthogonal to the span of $\{v_{1},v_{2},\dots,v_{j-1}\}$ . Since $j\in I$ , we have $\lVert v_{j}^{\perp}\rVert\geq\theta$ . Now given the vectors $v_{1},v_{2},\dots,v_{j-1}$ , the value $\langle v_{j}^{\prime},g\rangle$ is fixed, but $\langle v_{j}^{\perp},g\rangle$ is distributed as a Gaussian with variance $\theta^{2}$ (since $g$ is a Gaussian of unit variance in each direction).

Thus from a standard anti-concentration property for the one-dimensional Gaussian, $\langle v_{j},g\rangle$ cannot have a mass $>1/2$ in any $\theta^{2}$ length interval, in particular, it cannot lie in $[-\theta^{2}/2,\theta^{2}/2]$ with probability $>1/2$ . This proves Eq. (13). Now since this is true for any conditioning $v_{1},v_{2},\dots,v_{j-1}$ and for all $j$ , it follows (see Lemma B.3 for a formal justification) that

This completes the proof of the claim, part (a). Note that even if we had $g$ replaced by $u+g$ throughout, the anti-concentration property still holds (we have a shifted one-dimensional Gaussian), thus the proof goes through verbatim.

Let us now prove part (b). First note that if we denote by $M$ the $n\times m$ matrix whose columns are the $v_{i}$ , then part (a) deals with the distribution of $g^{T}MM^{T}g$ , where $g\sim\mathcal{N}(0,1)^{n}$ . Part (b) deals with the distribution of $g^{T}M^{T}Mg$ , where $g\sim\mathcal{N}(0,1)^{m}$ . But since the eigenvalues of $MM^{T}$ and $M^{T}M$ are precisely the same, due to the rotational invariance of Gaussians, these two quantities are distributed exactly the same way. This completes the proof. $\blacksquare$

Suppose we have random variables $X_{1},X_{2},\dots,X_{r}$ and an event $f(\cdot)$ which is defined to occur if its argument lies in a certain interval (e.g. $f(X)$ occurs iff $0<X<1$ ). Further, suppose we have $\mathop{\bf Pr\/}[f(X_{1})]\leq p$ , and $\mathop{\bf Pr\/}[f(X_{i})|X_{1},X_{2},\dots,X_{i-1}]\leq p$ for all $X_{1},X_{2},\dots,X_{i-1}$ . Then

Appendix C Applications to Mixture Models

Proof: We first bound the $\|\cdot\|_{\infty}$ norm of the difference of tensors i.e. we show that

C.2 Error Analysis for Multi-view Models

This implies that $|1-\alpha_{1}\alpha_{2}|<\delta/L_{\min}^{2}$ as required.

Now, let us assume $\beta_{1}>\sqrt{\delta}$ . This at once implies that $\beta_{2}<\sqrt{\delta}$ . Also

Now, using (15), we see that $\beta_{1}<\sqrt{\delta}$ . $\blacksquare$

C.3 Sampling Error Estimates for Gaussians

C.4 Recovering Weights in Gaussian Mixtures

Proof: From (C.4) and triangle inequality, we see that

Now, substituting the values for $\alpha_{1},\alpha_{2}$ , we see that