Provable learning of Noisy-or Networks

Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski

Introduction

Unsupervised learning is important and potentially very powerful because of the availability of the huge amount of unlabeled data — often several orders of magnitudes more than the labeled data in many domains. Latent variable models, a popular approach in unsupervised learning, model the latent structures in data: the “structure”corresponds to some hidden variables, which probabilistically determine the values of the visible coordinates in data. Bayes nets model the dependency structure of latent and observable variables via a directed graph. Learning parameters of a latent variable model given data samples is often seen as a canonical definition of unsupervised learning. Unfortunately, finding parameters with the maximum likelihood is NP-hard even in very simple settings. However, in practice many of these models can be learnt reasonably well using algorithms without polynomial runtime guarantees, such as expectation-maximization algorithm, Markov chain Monte Carlo, and variational inference. Bridging this gap between theory and practice is an important research goal.

Recently it has become possible to use matrix and tensor decomposition methods to design polynomial-time algorithms to learn some simple latent variable models such as topic models [AGM12, AGH+13], sparse coding models [AGMM15, MSS16], mixtures of Gaussians [HK13, GHK15], hidden Markov models [MR05], etc. These algorithms are guaranteed to work if the model parameters satisfy some conditions, which are reasonably realistic. In fact, matrix and tensor decomposition are a natural tool to turn to since they appear to be a sweet spot for theory whereby non-convex NP-hard problems can be solved provably under relatively clean and interpretable assumptions. But the above-mentioned recent results suggest that such methods apply only to solving latent variable models that are linear: specifically, they need the marginal of the observed variables conditioned on the hidden variables to depend linearly on the hidden variables. But many settings seem to call for nonlinearity in the model. For example, Bayes nets in many domains involve highly nonlinear operations on the latent variables, and could even have multiple layers. The study of neural networks also runs into nonlinear models such as restricted Boltzmann machines (RBM) [Smo86, HS06]. Can matrix factorization (or related tensor factorization) ideas help for learning nonlinear models?

We see that $1-\exp(-W_{ji}d_{j})$ can be thought of as the probability that $d_{j}$ activates symptom $s_{i}$ , and $s_{i}$ is activated if one of $d_{j}$ ’s activates it — which explains the name of the model, noisy-or. It follows that the conditional distribution $s\mid d$ is

One canonical use of this model is to model the relationship between diseases and symptoms, as in the classical human-constructed tool for medical diagnosis called Quick Medical Reference (QMR-DT) by (Miller et al.[MPJM82], Shwe et al. [SC91]) This textbook example ([JGJS99]) of a Bayes net captures relationships between $570$ binary disease variables (latent variables) and $4075$ observed binary symptom variables, with $45,470$ directed edges, and the $W_{ij}$ ’s are small integers.We thank Randolph Miller and Vanderbilt University for providing the current version of this network for our research. The name “noisy-or ”derives from the fact that the probability that the OR of $m$ independent binary variables $y_{1},y_{2},\ldots,y_{m}$ is $1$ is exactly $1-\prod_{j}(\Pr[y_{j}=0])$ . Noisy-or models are implicitly using this expression; specifically, for the $i$ -th symptom we are considering the OR of $m$ events where the $j$ th event is “Disease $j$ does not cause symptom $i$ ” and its probability is $\exp(-W_{ij}d_{j})$ . Treating these events as independent leads to expression (1.1).

The parameters of the QMR-DT network were hand-estimated by consulting human experts, but it is an interesting research problem whether such networks can be created in an automated way using only samples of patient data (i.e., the $s$ vectors). Previously there were no approaches for this that work even heuristically at the required problem size ( $n=4000$ ). (This learning problem should not be confused with the simpler problem of infering the latent variables given the visible ones, which is also hard but has seen more work, including reasonable heuristic methods [JGJS99]). Halpern et al.[HS13, JHS13] have designed some algorithms for this problem. However, their first paper [HS13] assumes the graph structure is given. The second paper [JHS13] requires the Bayes network to be quartet-learnable, which is a strong assumption on the structure of the network. Finally, the problem of finding a “best-fit” Bayesian network according to popular metricsResearchers resort to these metrics as when the graph structure is unknown, multiple structures may have the same likelihood, so maximum likelihood is not appropriate. has been shown to be NP-complete by [Chi96] even when all of the hidden variables are also observed.

where $\widetilde{W}_{ij}$ ’s are upper bounded by $\nu_{u}$ for some constant $\nu_{u}$ and are identically distributed according to a distribution $\mathcal{D}$ which satisfies that for some constant $\nu_{l}>0$ ,

The condition (1.2) intuitively requires that $\widetilde{W}_{ij}$ is bounded away from 0. We will assume that $p\leq 1/3$ and $\nu_{u}=O(1),\nu_{l}=\Omega(1)$ . (Again, these are realistic for QMR-DT setting).

Recall that we mostly thought of the prior of the diseases $\rho$ as being on the order $O(1/m)$ . This means that even if $p$ is on the order of $1$ , our relative error bound equals to $O(1/\sqrt{m})\ll 1$ .

Preliminaries and overview

We denote by ${\bf 0}$ the all-zeroes vector and ${\bf 1}$ the all-ones vector. $A^{+}$ will denote the Moore-Penrose pseudo-inverse of a matrix $A$ , and for symmetric matrices $A$ , we use $A^{-1/2}$ as a shorthand for $(A^{+})^{1/2}$ . The least non-zero singular value of matrix $A$ is denoted $\sigma_{\min}(A)$ .

For matrices $A,B$ we define the Kronecker product $\otimes$ as $(A\otimes B)_{ijkl}=A_{ij}B_{kl}.$ A useful identity is that $(A\otimes B)\cdot(C\otimes D)=(AC)\otimes(BD)$ whenever the matrix multiplications are defined. Moreover, $A_{i}$ will denote the i-th column of matrix $A$ and $A^{i}$ the i-th row of matrix $A$ .

We write $A\lesssim B$ if there exists a universal constant $c$ such that $A\leq cB$ and we define $\gtrsim$ similarly.

(We will sometimes shorten PMI3 and PMI2 to PMI when this causes no confusion.)

Our algorithm is given polynomially many samples from the model (each sample describing which symptoms are or are not present in a particular patient). It starts by computing the following matrix $n\times n$ PMI and and $n\times n\times n$ tensor PMIT, which tabulate the correlations among all pairs and triples of symptoms (specifically, the indicator random variable for the symptom being absent):

Let $F_{k},G_{k}$ denote the $k$ th columns of the above $F,G$ . Then,

The proposition is proved later (with precise statement) in Section A by computing the moments by marginalization and using Taylor expansion to approximate the log of the moments, and ignoring terms $\rho^{3}$ and smaller. (Recall that $\rho$ is the probability that a patient has a particular disease, which should be small, of the order of $O(1/n)$ . The dependence of the final error upon $\rho$ appears in Section 3.) Since the tensor PMIT can be estimated to arbitrary accuracy given enough samples, the natural idea to recover the model parameters $W$ is to use Tensor Decomposition. This is what our algorithm does as well, except the following difficulties have to be overcome.

Difficulty 1: Suppose in equation (2.8) we view the first summand $S$ , which is rank $m$ with components $F_{k}$ ’s as the signal term. In all previous polynomial-time algorithms for tensor decomposition, the tensor is required to have the form $\sum_{k=1}^{m}F_{k}\otimes F_{k}\otimes F_{k}+\text{\em noise}$ . To make our problem fit this template we could consider the second summand $E$ as the “noise”, especially since it is multiplied by $\rho\ll 1$ which tends to make $E$ have smaller norm than $S$ . But this is naive and incorrect, since $E$ is a very structured matrix: it is more appropriate viewed as systematic error. (In particular this error doesn’t go down in norm as the number of samples goes to infinity.) In order to do tensor decomposition in presence of such systematic error, we will need both a delicate error analysis and a very robust tensor decomposition algorithm. These will be outlined in Section 2.3.

Difficulty 2: To get our problem into a form suitable for tensor decomposition requires a whitening step, which uses the robust estimate of the whitening matrix from the second moment matrix. In this case, the whitening matrix has to be extracted out of the PMI matrix, which itself suffers from a systematic error. This also is not handled in previous works, and requires a delicate control of the error. See Section 2.4 for more discussion.

Difficulty 3: There is another source of inexactness in equation (2.8), namely the approximation is only true for those entries with distinct indices — for example, the diagonal entry $\textup{PMI}_{ii}$ has completely different formula from that for $\textup{PMI}_{ij}$ when $i\neq j$ . This will complicate the algorithm, as described in Subsections 2.3 and 2.4.

The next few Subsections sketch how to overcome these difficulties, and the details appear in the rest of the paper.

2 Recovering matrices in presence of systematic error

In this Section we recall the classical method of approximately recovering a matrix given noisy estimates of its entries. We discuss how to adapt that method to our setting where the error in the estimates is systematic and does not go down even with many more samples. The next section sketches an extension of this method to tensor decomposition with systematic error.

In the classical setting, there is an unknown $n\times n$ matrix $S$ of rank $m$ and we are given $S+E$ where $E$ is an error matrix. The method to recover $S$ is to compute the best rank- $m$ approximation to $S+E$ . The quality of this approximation was studied by Davis and Kahan [DK70] and Wedin [Wed72], and many subsequent authors. The quality of the recovery depends upon the ratio $||E||/\sigma_{m}(S)$ , where $\sigma_{m}(\cdot)$ denotes $m$ -th largest singular value and $||\cdot||$ denotes the spectral norm. To make this familiar lemma fit our setting more exactly, we will phrase the problem as trying to recover a matrix $S$ given noisy estimate $SS^{\top}+E$ . Now one can only recover $S$ up to rotation, and the following lemma describes the error in the Davis-Kahan recovery. It also plays a key role in the error analysis of the usual algorithm for tensor decomposition.

In the above setting, let $K,\widehat{K}$ the subspace of the top $m$ eigenvectors of $SS^{\top}$ and $SS^{\top}+E$ . Let $\varepsilon$ be such that $\|E\|\leq\varepsilon\cdot\sigma_{m}(SS^{\top})$ . Then $\left\lVert\textup{Id}_{K}-\textup{Id}_{\widehat{K}}\right\rVert\lesssim\varepsilon$ where Id is the identity transformation on the subspace in question.

The Lemma thus treats $||E||/\sigma_{m}(SS^{\top})$ as the definition of noise/signal ratio. Before we generalize the definition and the algorithm to handle systematic error it is good to get some intuition, from looking at (2.6): $\textup{PMI}\approx\rho(FF^{T}+\rho GG^{T})$ . Thinking of the first term as signal and the second as error, let’s check how bad is the noise/signal ratio defined in Davis-Kahan. The “signal”is $\sigma_{m}(FF^{\top})$ , which is smaller than $n$ since the trace of $FF^{\top}$ is of the order of $mn$ in our probabilistic model for the weight matrix. The “noise” is the norm of $\rho GG^{\top}$ , which is large since the $G_{k}$ ’s are nonnegative vectors with entries of the order of 1, and therefore the quadratic form $\langle\frac{1}{\sqrt{n}}{\bf 1},\rho GG^{\top}\frac{1}{\sqrt{n}}{\bf 1}\rangle$ can be as large as $\rho\sum_{k}\langle G_{k},\frac{1}{\sqrt{n}}{\bf 1}\rangle^{2}\approx\rho mn$ . Thus the Davis-Kahan noise/signal ratio is $\rho m$ , and so when $\rho m\ll 1$ , it allows recovering the subspace of $F$ with error $O(\rho m)$ . Note that this is a vacuous bound since $\rho$ needs to be at least $1/m$ so that the hidden variable $d$ contains 1 non-zero entry in average. We’ll argue that this error is too pessimistic and we can in fact drive the estimation error down to close to $\rho$ .

The smallest such $\tau$ is the “error/signal ratio” for this recovery problem.

This definition differs from Davis-Kahan’s because of the $\tau SS^{\top}$ term on the right hand side of (2.9). This allows, for any unit vector $x$ , the quadratic form value $x^{T}Ex$ to be as large as $\tau(x^{T}SS^{\top}x+\sigma_{m}(SS^{\top}))$ . Thus for example the ${\bf 1}$ vector no longer causes a large noise/signal ratio since both quadtratic forms $FF^{\top}$ and $GG^{\top}$ have large values on it.

This new error/signal ratio is no larger than the Davis-Kahan ratio, but can potentially be much smaller. Now we show how to do a better analysis of the Davis-Kahan recovery in terms of it. The proof of this theorem appears in Section 4.

Empirically, we can compute the $\tau$ value for the weight matrix $W$ in the QMR-DT dataset [SC91], which is a textbook application of noisy OR network. For the QMR-DT dataset, $\tau$ is under $6$ . This implies that the recovery error of the subspace of $F$ guaranteed by Theorem 2.4 is bounded by $O(\tau\rho)\approx\rho$ , whereas the error bound by Davis-Kahan is $O(\rho m)$ .

3 Tensor decomposition with systematic error

Now we extend the insight from the matrix case to tensor recovery under systematic error. In turns out condition (2.9) is also a good measure of error/signal for the tensor recovery problem of (2.8). Specifically, if $G$ is $\tau$ -bounded by $F$ , then we can recover the components $F_{k}$ ’s from the PMIT with column-wise error $O(\rho\tau^{3/2}\sqrt{m})$ . This requires a non-trivial algorithm (instead of SVD), and the additional gain is that we can recover $F_{k}$ ’s individually, instead of only obtaining the subspace with the PMI matrix.

First we recall the prior state of the art for the error analysis of tensor decomposition with Davis-Kahan type bounds. The best error bounds involve measuring the magnitude of the noise matrix $Z$ in a new way. For any $n_{1}\times n_{2}\times n_{3}$ tensor $T$ , we define the $\lVert\cdot\rVert_{\{1\}\{2,3\}}$ norm as

There is a polynomial-time algorithm (Algorithm 2 later) which has the following guarantee. Suppose tensor $T$ is of the form

But in our setting the noise tensor has systematic error. An analog of Theorem 2.4 in this setting is complicated because even the whitening step is nontrivial. Recall also the inexactness in Proposition 2.1 due to the diagonal terms, which we earlier called Difficulty 3. We address this difficulty in the algorithm by setting up the problem using a sub-tensor of the PMI tensor. Let $S_{a},S_{b},S_{c}$ be a uniformly random equipartition of the set of indices $[n]$ . Let

where $F_{k,S}$ denotes the restriction of vector $F_{k}$ to subset $S$ . Moreover, let

Then, since the sub-tensor $\textup{PMIT}_{S_{a},S_{b},S_{c}}$ only contains entries with distinct indices, we can use Taylor expansion (see Lemma A.1) to obtain that

Here the second summand on the RHS corresponds to the second order term in the Taylor expansion. It turns out that the higher order terms are multiplied by $\rho^{3}$ and thus have negligible Frobenius norm, and therefore discussion below will focus on the first two summands.

For simplicity, let $T=\textup{PMIT}_{S_{a},S_{b},S_{c}}$ . Our goal is to recover the components $a_{k},b_{k},c_{k}$ from the approximate low-rank tensor $T$ .

The first step is to whiten the components $a_{k}$ ’s, $b_{k}$ ’s and $c_{k}$ ’s. Recall that $a_{k}=F_{k,S_{a}}$ is a non-negative vector. This implies the matrix $A=[a_{1},\dots,a_{m}]$ must have a significant contribution in the direction of the vector ${\bf 1}$ , and thus is far away from being well-conditioned. For the purpose of this section, we assume for simplicity that we can access the covariance matrix defined by the vector $a_{k}$ ’s,

Similarly we assume the access of $\bar{Q}_{b}$ and $\bar{Q}_{c}$ which are defined analogously. In Section 2.4 we discuss how to obtain approximately these three matrices.

Then, we can compute the whitened tensor by applying transformation $(\bar{Q}_{a}^{+})^{1/2},(\bar{Q}_{b}^{+})^{1/2},(\bar{Q}_{c}^{+})^{1/2}$ along the three modes of the tensor $T$ ,

Now the first summand is a low rank orthogonal tensor, since $(\bar{Q}_{a}^{+})^{1/2}a_{k}$ ’s are orthonormal vectors. However, the term $Z$ is a systematic error and we use the following Lemma to control its $\lVert\cdot\rVert_{\{1\}\{2,3\}}$ norm.

Lemma 2.7 shows that to give an upper bound on the $\left\lVert\cdot\right\rVert_{\{1\}\{2,3\}}$ norm of the error tensor $Z$ , it suffices to show that the square of the components of the error, namely, $\Gamma\Gamma^{\top},\Delta\Delta^{\top},\Theta\Theta^{\top}$ are $\tau$ -spectrally bounded by the components of the signal $A,B,C$ respectively. This will imply that $\lVert Z\rVert_{\{1\}\{2,3\}}\leq(2\tau)^{3/2}\rho^{2}$ .

Recall that $A$ and $\Gamma$ are two sub-matrices of $F$ and $G$ . We have shown that $GG^{\top}$ is $\tau$ -spectrally bounded by $F$ in Proposition 2.5. It follows straightforwardly that the random sub-matrices also have the same property.

In the setting of this section, under the generative model for $W$ , w.h.p, we have that $\Gamma\Gamma^{\top}$ is $\tau$ -spectrally bounded by $A$ with $\tau=O(\log n)$ . The same is true for the other two modes.

Using Proposition 2.8 and Lemma 2.7, we have that

Then using Theorem 2.6 on the tensor $(\bar{Q}_{a}^{+})^{1/2}\otimes(\bar{Q}_{b}^{+})^{1/2}\otimes(\bar{Q}_{c}^{+})^{1/2}\cdot T$ , we can recover the components $(\bar{Q}_{a}^{+})^{1/2}a_{k}$ ’s, $(\bar{Q}_{b}^{+})^{1/2}b_{k}$ ’s, and $(\bar{Q}_{c}^{+})^{1/2}c_{k}$ ’s. This will lead us to recover $a_{k}$ , $b_{k}$ and $c_{k}$ , and finally to recover the weight matrix $W$ .

4 Robust whitening

In the previous subsection, we assumed the access to $\bar{Q}_{a},\bar{Q}_{b},\bar{Q}_{c}$ (defined in (2.13)) which turns out to be highly non-trivial. A priori, using equation (2.6), noting that $A=[F_{1,S_{a}},\dots,F_{m,S_{a}}]$ , we have

However, this approximation can be arbitrarily bad for the diagonal entries of PMI since equation (2.6) only works for entries with distinct indices. (Recall that this is why we divided the indices set into $S_{a},S_{b},S_{c}$ and studied the asymmetric tensor in the previous subsection). Moreover, the diagonal of the matrix $\bar{Q}_{a}$ contributes to its spectrum significantly and therefore we cannot get meaningful bounds (in spectral norm) by ignoring the diagonal entries.

This issue turns out to arise in most of the previous tensor papers and the solution was to compute $AA^{\top}$ by using the asymmetric moments $AB^{\top},BC^{\top},CA^{\top}$ ,

Typically $AB^{\top},BC^{\top},CA^{\top}$ can be estimated with arbitrarily small error (as number of samples go to infinity) and therefore the equation above leads to accurate estimate to $AA^{\top}$ . However, in our case the errors in the estimate $\textup{PMI}_{S_{a},S_{b}}\approx AB^{\top}$ , $\textup{PMI}_{S_{b},S_{c}}\approx BC^{\top}$ , $\textup{PMI}_{S_{c},S_{a}}\approx CA^{\top}$ are systematic. Therefore, we need to use a more delicate analysis to control how the error accumulates in the estimate,

Here again, to get an accurate bound, we need to understand how the error in $\textup{PMI}_{S_{a},S_{b}}-AB^{\top}$ behaves relatively compared with $AB^{\top}$ in a direction-by-direction basis. We generalized Definition 2.3 to capture the asymmetric spectral boundedness of the error by the signal.

Let $K$ be the column subspace of $B$ and $H$ be the column subspace of $C$ . Then we have $\Delta_{1}=B^{+}E(C^{\top})^{+}$ , $\Delta_{2}=B^{+}E\textup{Id}_{H^{\perp}}$ , $\Delta_{3}=\textup{Id}_{K^{\perp}}E(C^{\top})^{+}$ , $\Delta_{4}=\textup{Id}_{K^{\perp}}E\textup{Id}_{H^{\perp}}$ . Intuitively, they measure the relative relationship between $E$ and $B,C$ in different subspaces. For example, $\Delta_{1}$ is the relative perturbation in the column subspace of $K$ and row subspace of $H$ . When $B=C$ , this is equivalent to the definition in the symmetric setting (this will be clearer in the proof of Theorem 2.4).

where $E_{ab},E_{bc},E_{ca}$ are $\varepsilon$ -spectrally bounded by $(A,B)$ , $(B,C)$ , $(C,A)$ respectively. Then, the matrix matrix

is a good approximation of $AA^{\top}$ in the sense that $Q_{a}=\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}-AA^{\top}$ is $O(\varepsilon)$ -spectrally bounded by $A$ . Here $[\Sigma]_{m}$ denotes the best rank- $m$ approximation of $\Sigma$ .

The theorem is non-trivial even if the we have an absolute error assumption, that is, even if $\lVert E_{bc}\rVert\leq\tau\sigma_{min}(B)\sigma_{min}(C)$ , which is stronger condition than $E_{bc}$ is $\tau$ -spectrally bounded by $(B,C)$ . Suppose we establish bounds on $\lVert\Sigma_{ab}-AB^{\top}\rVert$ , $\|\Sigma_{bc}^{+\top}-(BC^{\top})^{+}\|$ and $\lVert\Sigma_{ab}-AB^{\top}\rVert$ individually, and then putting them together in the obvious way to control the error $\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}-AB^{\top}(BC^{\top})^{+}CA^{\top}$ . Then the error will be too large for us. This is because standard matrix perturbation theory gives that $\|\Sigma_{bc}^{-\top}-(BC^{\top})^{-1}\|$ can be bounded by $O\left(\|E_{bc}\|\|(BC^{\top})^{-1}\|^{2}\right)\lesssim$ $\varepsilon/[\sigma_{min}(B)\sigma_{min}(C)]$ , which is tight. Then we multiply the error with the norm of the rest of the two terms, the error will be roughly $\varepsilon\cdot\frac{\sigma_{\max}(B)\sigma_{\max}(C)}{\sigma_{\min}(B)\sigma_{\min}(C)}$ . That is, we will loss a condition number of $B,C$ , which can be dimension dependent for our case.

The fix to this problem is to avoid bounding each term in $\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}$ individually. To do this, we will take the cancellation of these terms into account. Technically, we re-decompose the product $\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}$ into a new product of three matrices $(\Sigma_{ab}B^{+})(B[\Sigma_{bc}^{\top}]_{m}^{+}C)(C^{+}\Sigma_{ca})$ , and then bound the error in each of these terms instead. See Section C for details.

As a corollary, we conclude that the whitened vectors $(Q_{a}^{+})^{1/2}a_{i}$ ’s are indeed approximately orthonormal.

In the setting of Theorem 2.10, we have that $(Q_{a}^{+})^{1/2}A$ contains approximately orthonormal vectors as columns, in the sense that

Therefore we have found an approximate whitening matrix for $A$ even though we do not have access to the diagonal entries.

Main Algorithms and Results

As sketched in Section 2, our main algorithm (Algorithm 1) uses tensor decomposition on the PMI tensor. In this section, we describe the different steps and how the fit together. Subsequently, all steps will be analyzed in separate sections.

Suppose the true $W$ is generated from the random model in Section 1 with $\rho pm\leq c$ for some sufficiently small constant $c$ . Then given $N=\operatorname{poly}(n,1/p/,1/\rho)$ number of examples, Algorithm 1 returns a weight matrix $\widehat{W}$ in polynomial time that satisfies

We also define the incoherence of a matrix $F$ . Roughly speaking, it says that the left singular vectors of $F$ don’t correlate with any of the natural basis vector much more than the average.

We assume the weight matrix $W$ satisfies the following deterministic assumptions, 1. $GG^{\top},HH^{\top},LL^{\top}$ is $\tau$ -spectrally bounded by $F$ for $\tau\geq 1$ . 2. $F$ is $\mu$ -incoherent with $\mu\leq\widetilde{O}(\sqrt{n/m})$ . 3. If $\max_{i}\|F_{i}\|_{0}\leq pn$ , with high probability over the choice of a subset $S_{a},|S_{a}|=n/3$ , $\sigma_{\min}(F_{S_{a}})\gtrsim\sqrt{np}$ and $\rho pm\leq c$ for some sufficiently small constant $c$ .

Suppose the matrix $W$ satisfies the conditions 1-3 above. Given polynomial number of samples, Algorithm 1 returns $\widehat{W}$ in polynomial time, s.t.

The proofs of Theorems uses the overall strategy of Section 2, and is deferred to Section D. We give a high level outline that demonstrates how the proofs depends on the machinery built in the subsequent sections.

Both Theorem 3.1 and Theorem 3.3 are similarly proved – the only technical difference being how the third and higher order terms are bounded. (Because of generative model assumption, for Theorem 3.1 we can get a more precise control on them.) Hence, we will not distinguish between them in the coming overview.

Overall, we will follow the approach outlined in Section 2. Let us step through Algorithm 1 line by line:

The overall goal will be to recover the leading terms of the PMI tensor. Of course, we get samples only, so can merely get an empirical version of it. In Section E, we show that the simple plug-in estimator does the job – and does so with polynomially many samples.

Recall Difficulty 3 from Section 2 : the PMI tensor and matrix expression is only accurate on the off-diagonal entries. In order to address this, in Section 2.3 we passed to a sub-tensor of the original tensor by partitioning the symptoms into three disjoint sets, and considering the induced tensor by this partition.

In order to apply the robust tensor decomposition algorithm from Section 5, we need to first calculate whitening matrices. This is necessarily complicated by the fact that the diagonals of the PMI matrix are not accurate, as discussed in Section 2.4. Section C gives guarantees on the procedure for calculating the whitening matrices.

This is main component of the algorithm: the robust tensor decomposition machinery. In Section 5, the conditions and guarantees for the success of the algorithm are formalized. There, we deal with the difficulties layed out in Section 2.2 : namely that we have a substantial systematic error that we need to handle. (Both due to higher-order terms, and due to the missing diagonal entries)

This step, along with Step 6, is a post-processing step – which allows us to recover the weight matrix $W$ after we have recovered the leading terms of the PMI tensor.

We also give a short quantitative sense of the guarantee of the algorithm. (The reader can find the full proof in Section D.)

To get quantitative bounds, we will first need a handle on spectral properties of the random model: these are located in Section B. As we mentioned above, the main driver of the algorithm is step 4, which uses our robust tensor decomposition machinery in Section 5. To apply the machinery, we first need to show that the second (and higher) order terms of the PMI tensor are spectrally bounded. This is done by applying Proposition B.4, which roughly shows the higher-order terms are $O(\rho\log n)$ -spectrally bounded by $\rho FF^{\top}$ . The whitening matrices are calculated using machinery in Section C. We can apply these tools since the random model gives rise to a $O(1)$ -incoherent $F$ matrix as shown in Lemma B.3.

To get a final sense of what the guarantee is, the $l_{2}$ error which step 4 gives, via Theorem 5.4 roughly behaves like $\sqrt{\sigma_{\max}}\tau^{3/2}$ , where $\sigma_{\max}$ is the spectral norm of the whitening matrices and $\tau$ is the spectral boundedness parameter. But, by Lemma D.1 $\sigma_{\max}$ is approximately the spectral norm of $\rho FF^{\top}$ – which on the other hand by Lemma B.1 is on the order of $mnp^{2}\rho$ . Plugging in these values, we get the theorem statement.

Finding the Subspace under Heavy Perturbations

In this section, we show even if we perturb a matrix $SS^{\top}$ with an error whose spectral norm might be much larger than $\sigma_{min}(SS^{\top})$ , as long as $E$ is spectrally bounded the top singular subspace of $S$ is still preserved. We defer the proof of the asymmetric case (Theorem 2.10) to Section C. We note that such type of perturbation bounds, often called relatively perturbation bounds, have been studied in [Ips98, Li98a, Li98b, Li97]. The results in these papers either require the that signal matrix is full rank, or the perturbation matrix has strong structure. We believe our results are new and the way that we phrase the bound makes the application to our problem convenient. We recall Theorem 2.4, which was originally stated in Section 2.

Therefore, we have that $\lVert B\rVert^{2}\leq\varepsilon\sigma_{\min}(AA^{\top})$ . Moreover, we also have

Let $P=\left(\textup{Id}_{m}+SS^{\top}\right)^{1/2}$ . Then we write $AA^{\top}+E$ as,

Let $\widehat{A}=(AP+BS^{\top}P^{-1})$ . Let $K^{\prime}$ be the column span of $\widehat{A}$ . We first prove that $\widehat{K}$ is close to $K^{\prime}$ . Note that

Moreover, we have $\sigma_{\min}(\widehat{A}\widehat{A}^{\top})=\sigma_{\min}(\widehat{A})^{2}=\left(\sigma_{\min}(AP)-\lVert BS^{\top}P^{-1}\rVert\right)^{2}\geq(1-O(\varepsilon))\sigma_{\min}(A)^{2}$ . Therefore, using Wedin’s Theorem (Lemma F.2) on equation (4.1), we have that

Next we show $K^{\prime}$ and $K$ are also close. We have

Therefore, by Wedin’s Theorem, $K^{\prime}$ , as the span of top $m$ left singular vectors of $\widehat{A}$ , is close to the span of the top left singular vector of $AP$ , namely, $K$

Therefore using equation (4.2) and (4.3) and triangle inequality, we complete the proof. ∎

Robust Tensor Decomposition with Systematic Error

In this section we discuss how to robustly find the tensor decomposition even in presence of systematic error. We first illustrate the main techniques in an easier setting of orthogonal tensor decomposition (Section 5.1), then we describe how it can be generalized to the general setting that we require for our algorithm (Section 5.2).

We start with decomposing an orthogonal tensor with systematic error. The algorithm we use here is a slightly more general version of an algorithm in [MSS16].

Suppose $\{u_{i}\},\{v_{i}\},\{w_{i}\}$ are three collection $\varepsilon$ -approximate orthonormal vectors. Suppose tensor $T$ is of the form

The Theorem is a direct extension of [MSS16, Theorem 10.2] to asymmetric and approximate orthogonal case. We only provide a proof sketch here. We start by writing

Since $\left\lVert Z\right\rVert_{\{2\}\{1,3\}}\leq\tau$ and $\left\lVert Z\right\rVert_{\{1\}\{2,3\}}\leq\tau$ , [MSS16, Theorem 6.5] implies that with probability at least $1-d^{2}$ over the choice of $g$ ,

Let $t=2\sqrt{\log d}$ . We have that with probability $1/(d^{1+\delta}\log^{O(1)}d)$ , $\langle g,w_{1}\rangle\geq(1+\delta/3)t$ and $\langle g,w_{j}\rangle\leq t$ for every $j\neq 1$ . We condition on these events. Let $\bar{u}_{i}$ be a set of orthonormal vectors such that $E_{u}=[u_{1},\dots,u_{m}]-[\bar{u}_{1},\dots,\bar{u}_{m}]$ satisfies $\lVert E_{u}\rVert\leq\varepsilon$ (we can take $\bar{u}_{i}$ ’s to be the whitening of $u_{i}$ ’s). Similarly define $\bar{v}_{i}$ ’s. Then we have that the term (defined in equation (5.1)) can be written as $\sum_{i}\langle g,w_{1}\rangle\bar{u}_{i}\bar{v}_{i}+E^{\prime}$ where $\lVert E\rVert^{\prime}\lesssim\varepsilon$ . Let $\bar{M}_{S}=\sum_{i}\langle g,w_{1}\rangle\bar{u}_{i}\bar{v}_{i}$ . Then $\bar{M}_{S}$ has top singular value $\langle g,w_{1}\rangle\geq(1+\delta/3)t$ , and second singular value at most $t$ . Moreover, the term $M_{g}+E^{\prime}$ has spectral norm bounded by $O(\tau+\varepsilon)$ . Thus by Wedin’s Theorem (Lemma F.2), the top left and right singular vectors $u,v$ of $M_{S}+M_{g}=\bar{M}_{S}+M_{g}+E^{\prime}$ are $O((\tau+\varepsilon)/\delta)$ -close to $\bar{u}_{1}$ and $\bar{v}_{1}$ respectively. They are also $O((\tau+\varepsilon)/\delta)$ -close to $u_{1},v_{1}$ since $u_{1}$ is close to $\bar{u}_{1}$ . Moreover, we have $(u^{\top}\otimes v^{\top}\otimes\textup{Id})\cdot T$ is $O(\tau/\delta)$ -close to $w_{1}$ .

Therefore, with probability $1/(d^{1+\delta}\log^{O(1)}d)$ , each round of the for loop in Algorithm 2 will find $u_{1},v_{1},w_{1}$ . Line 5 is used to verify if the resulting vectors are indeed good using the injective norm as a test. It can be shown that if the test is passed then $(u,v,z)$ is close to one of the component. Therefore, after $d^{1+\delta}\log^{O(1)}d$ iterations, with high probability, we can find all of the components.

2 General tensor decomposition

In many previous works, general tensor decomposition is reduced to orthogonal tensor decomposition via a whitening procedure. However, here in our setting we cannot estimate the exact whitening matrix because of the systematic error. Therefore we need a more robust version of approximate whitening matrix, which we define below:

Let $r\leq d$ . A collection of $r$ vectors $\{a_{1},\dots,a_{r}\}$ is $\varepsilon$ -approximately orthonormal if the matrix $A$ with $a_{i}$ as columns satisfies

With this in mind, we can state the guarantee on the tensor decomposition algorithm (Algorithm 3).

where $\sigma=\min(\sigma_{\min}(Q_{a}),\sigma_{\min}(Q_{b}),\sigma_{\min}(Q_{c}))$ .

Note that in our model, the matrix $E$ has very small spectral norm as it is the third order term in $\rho$ (and $\rho=O(1/n)$ ). The spectral boundedness of $\Gamma,\Delta,\Theta$ are discussed in Section B. Therefore we can expect the RHS to be small.

In order to prove this theorem, we show after we apply whitening operation using the approximate whitening matrices, the tensor is still close to an orthogonal tensor. To do that, we need the following lemma which is a useful technical consequence of the condition (2.9).

Suppose $F$ is $\tau$ -spectrally bounded by $g$ . Then,

Let $K$ be the column span of $F$ . Let $Q=FF^{\top}$ . Multiplying $(Q^{+})^{1/2}$ on both sides of equation (2.9), we obtain that

It follows that $\|(Q^{+})^{1/2}G\|\leq\sqrt{2\tau}$ , which in turns implies that $\|G^{\top}(FF^{\top})^{+}G\|=\lVert G^{\top}(Q^{+})^{1/2}(Q^{+})^{1/2}G\rVert\leq 2\tau$ .∎

We also need to bound the $\{1,2\}\{3\}$ norm of the following systematic error tensor. This is important because we want to bound the spectral norm of the perturbation after the whitening operation.

Using the definition of $\lVert\cdot\rVert_{\{1,2\}\{3\}}$ we have that

Next observe that we have that for any $i$ , $\delta_{i}\delta_{i}^{\top}\preceq(\max\left\lVert\delta_{i}\right\rVert^{2})\textup{Id}$ and therefore,

With this in mind, we prove the main theorem:

Conclusions

We have presented theoretical progress on the longstanding open problem of presenting a polynomial-time algorithm for learning noisy-or networks given sample outputs from the network. In particular it is enouraging that linear algebraic methods like tensor decomposition can play a role. Earlier there were no good approaches for this problem; even heuristics fail for realistic sizes like $n=1000$ .

Can sample complexity be reduced, say to subcubic? (Cubic implies more than one billion examples for networks with $1000$ outputs.) Possibly this requires exploiting some hierarchichal structure –e.g. groupings of diseases and symptoms— in practical noisy-OR networks but exploring such possibilities using the current version of QMR-DT is difficult because it has been scrubbed of labels for diseases and symptoms.)

Various more practical versions of our algorithm are also easy to conceive and will be tested in the near future. This could be somewhat analogous to topic modeling, for which discovery of provable polynomial-time algorithms soon led to very efficient algorithms.

We thank Randolph Miller and Vanderbilt University for providing the current version of this network for our research.

References

Appendix A Formal expression for the PMI tensor

In this section we formally derive the expressions for the PMI tensors and matrices, which we only informally did in Section 2.

These matrices will appear naturally in the expressions for the higher-order terms in the Taylor expansion for the PMI matrix and tensor.

We first compute the formally the moments of the noisy-or model.

We only give the proof for the second equation. The rest can be shown analogously.

With this in mind, we give the expression for the PMI tensor along with all the higher-order terms.

For any equipartition $S_{a},S_{b},S_{c}$ of $[n]$ , the restriction of the PMI tensor $\textup{PMIT}_{S_{a},S_{b},S_{c}}$ satisfies, for any $L\geq 2$ ,

The proof will proceed by Taylor expanding the log terms. Towards that, using Lemma A.1, we have :

By the Taylor expansion of $\log(1-x)$ , we get that

by simple regrouping of the terms. By exchanging $l$ and $t$ , we get

where the last equality holds by noting that

The term corresponding to $t=1$ is easily seen to be

therefore we to show the statement of the lemma, we only need bound the contribution of the terms with $\l\geq L$ .

Toward that, note that $\forall l,k\|1-\exp(-lW_{k})\|\leq n$ . Hence, we have by Lemma 5.6,

Therefore, subadditivity of the $\{12\},\{3\}$ norm gives

A completely analogous proof gives a similar expression for the PMI matrix:

For any subsets $S_{a},S_{b}$ of $[n]$ , s.t. $S_{a}\cap S_{b}=\emptyset$ , the restriction of the PMI matrix $\textup{PMI}_{S_{a},S_{b}}$ satisfies, for any $L\geq 2$ ,

Appendix B Spectral properties of the random model

The goal of this section is to prove that the random model specified in Section 1 satisfies the incoherence property 3.2 on the weight matrix and the spectral boundedness property of the PMI tensor. (Recall, the former is required for the whitening algorithm, and the later for the tensor decomposition algorithm.)

Before delving into the proofs, we will need a few simple bounds on the singular values of $P_{l}$ .

Let $S_{a}\subseteq[n]$ , s.t. $|S_{a}|=\Omega(n)$ . With probability $1-\exp(-\log^{2}n)$ over the choice of $W$ , and for all $l=O(\mbox{poly}(n))$ ,

we will proceed to bound the smallest eigenvalue of $L^{\top}L$ .

Since the matrices $(1-\exp(-lW^{k}))(1-\exp(-lW^{k}))^{\top}$ are independent, the bound will follow from a matrix Bernstein bound. Denoting

Note that (1.2) together with the assumption $\nu=\Omega(1)$ gives $\sigma_{\min}(Q)=\Omega(p)$

Therefore, by Bernstein inequality we have that w.h.p,

which in turn implies that $P_{l,S_{a}}\succeq nQ$ . But this immediately implies $\sigma_{\min}(P_{l,S_{a}})\gtrsim np$ with high probability. Union bounding over all $l$ , we get the first part of the lemma.

The upper bound will be proven by a Chernoff bound. Note that the matrices

are independent. Furthermore, $\|(1-\exp(-lW_{k}))_{S_{a}}(1-\exp(-lW_{k}))_{S_{a}}^{\top}\|^{2}\leq pn$ with high probability, and the variable $\|(1-\exp(-lW_{k}))_{S_{a}}(1-\exp(-lW_{k}))_{S_{a}}^{\top}\|^{2}$ is sub-exponential. Finally,

A union bound over all values of $l$ gives the statement of the Lemma.

First, we proceed to show the incoherence property 3.2 on the weight matrix.

Suppose $n$ is a multiple of 3. Let $F=U\Sigma V$ be the singular value decomposition of $F$ . Let $S_{a},S_{b},S_{c}$ be a uniformly random equipartition of the rows of $[n]$ . Suppose $F$ is $\mu$ -incoherent with $\mu\leq n/(m\log n)$ . Then, with high probability over the choice of and $S_{a},S_{b},S_{c}$ , we have for every $i\in\{a,b,c\}$ ,

Let $S=S_{a}$ . Then, since $U^{\top}U=\textup{Id}_{m}$ , we have

By the assumption on the row norms of $U$ , $\|U^{i}(U^{i})^{\top}\|_{2}=\|U^{i}\|^{2}\leq\mu\frac{m}{n}$ . By the incoherence assumption, we have that $\max_{i}\lVert U^{i}\rVert^{2}\leq\mu m/n$ .

We also note that $U^{i}$ ’s are negatively associated random variables. Therefore by the matrix Chernoff inequality for negatively associated random variables, we have with high probability,

But, an analogous argument holds for $S_{b},S_{c}$ as well – so by a union bound over $k$ , we complete the proof. ∎

Suppose $n\gtrsim m\log n$ . Under the generative assumption in Section 1 for $W$ , we have that we have that $F=1-\exp(-W)$ is $O(1)$ -incoherent.

We have that $FF^{\top}=U\Sigma^{2}U^{T}$ and therefore, $\|F^{i}\|^{2}=\Sigma^{2}_{i,i}\|U^{i}\|^{2}$ . This in turn implies that

Since $\left\lVert F^{i}\right\rVert^{2}\leq pm+\sqrt{pm}\leq 2pm$ with high probability, we only need to bound $\sigma_{\min}(F)$ from below. Note that $\sigma^{2}_{\min}(F)=\sigma_{\min}(F^{\top}F)$ . Therefore it suffices to control $\sigma_{\min}(F^{\top}F)$ . But by Lemma B.1 we have $\sigma^{2}_{\min}(F)\gtrsim np$ .

B.2 Spectral boundedness

The main goal of the section is to show that the bias terms in the PMI tensor are spectrally bounded by the PMI matrix (which we can estimate from samples). Furthermore, we show that we can calculate an approximate whitening matrix for the leading terms of the PMI tensor using the machinery in Section .

The main proposition we will show is the following:

The main element for the proposition is the following Lemma:

Before proving the Lemma, let us see how the proposition follows from it:

Since $AA^{\top}=(\frac{\rho}{1-\rho})^{2/3}P_{1,S_{a}}$ , the claim of the Proposition follows.

It is clear analogous statements hold for $S_{b}$ and $S_{c}$ . ∎

For notational convenience, we will denote by $J_{m\times n}$ the all ones matrix with dimension $m\times n$ . (We will omit the dimensions when clear from the context.)

This statement will immediately follow from the following two lemmas:

Before showing these lemmas, let us see how Lemma B.5 is implied by them:

Let $\kappa$ be the constant in B.7, s.t. $P_{l,S_{a}}+6np\log n\textup{Id}\precsim mp^{2}J$ . Putting the bounds from Lemmas B.6 and B.7 together along with a union bound, we have that with high probability, $\forall l,l^{\prime}=O(\mbox{poly}(n))$

But, note that $\sigma_{\min}(P_{l^{\prime}})=\Omega(np)$ , by Lemma B.1. Hence, $P_{l}-\frac{5}{2}\kappa P_{l^{\prime},S_{a}}\preceq r\log n\sigma_{\min}(P_{l^{\prime}})$ , for some sufficiently large constant $r$ . This implies

from which the statement of the lemma follows.

Let’s denote by $e=\frac{1}{\sqrt{|S_{a}|}}\mathbf{1}$ . Let’s furthermore denote $\textup{Id}_{\mathbf{1}}=ee^{\top}$ , and $\textup{Id}_{-\mathbf{1}}=\textup{Id}-ee^{\top}$ . Note first that trivially, since $\textup{Id}_{\mathbf{1}}+\textup{Id}_{-\mathbf{1}}=\textup{Id}$ ,

We proceed to upper bound both the terms on the RHS above. More precisely, we will show that

Let us proceed to showing (B.4). The LHS can be rewritten as

We proceed to (B.5), which will be shown by a Bernstein bound. Towards that, note that

Therefore, applying a matrix Bernstein bound, we get

Combining this with (B.2) and (B.3), we get

Let us proceed to the second inequality, which essentially follows the same strategy:

Reusing the notation from Lemma B.7, we have that

for similar reasons as before. From this we get that

Putting (B.6) and (B.7) together, we get that

We will proceed to show an upper bound $\textup{Id}_{-\mathbf{1}}P\textup{Id}_{-\mathbf{1}}\preceq 2np\log n\textup{Id}$ on second term of the LHS. We will do this by a Bernstein bound as before. Namely, analogously as in Lemma B.7,

Therefore, applying a matrix Bernstein bound, we get

Appendix C Robust whitening

In this section, we show the formula $Q_{a}=\rho^{-1/3}\widehat{\textup{PMI}}_{S_{a},S_{b}}(\widehat{\textup{PMI}}^{+}_{S_{b},S_{c}})^{\top}\widehat{\textup{PMI}}_{S_{c},S_{a}}$ computes an approximation of the true whitening matrix $AA^{\top}$ , so that the error is $\varepsilon$ -spectrally bounded by $A$ . We recall Theorem 2.10.

where $E_{ab},E_{bc},E_{ca}$ are $\varepsilon$ -spectrally bounded by $(A,B)$ , $(B,C)$ , $(C,A)$ respectively. Then, the matrix matrix

Towards proving Theorem 2.10, an intermediate step is to understand the how the space of singular vectors of $BC^{\top}$ are aligned with the noisy version $\Sigma_{bc}$ . The following explicitly represent $BC^{\top}+E$ as the form $B^{\prime}R(C^{\prime})^{\top}+\Delta^{\prime}$ . Here the crucial benefit to do so is that the resulting $\Delta^{\prime}$ is small in every direction. In other words, we started with a relative error guarantees on $E$ and the Lemma below converts to it an absolute error guarantees on $\Delta^{\prime}$ (though the signal term changes slightly).

Suppose $B,C$ are $n\times m$ matrices with $n\geq m$ . Suppose a matrix $E$ is $\varepsilon$ -spectrally bounded by $(B,C)$ , then $BC^{\top}+E$ can be written as

where $\Delta_{B},\Delta_{C},\Delta_{BC}^{\prime}$ are small and $R_{BC}$ is close to identity in the sense that,

The key intuition is if the perturbation is happening in the span of columns of $B$ and $C$ , they cannot change the subspace. By Definition 2.9, we can write $E$ as

Now since $\|\Delta_{1}\|\leq\varepsilon<1$ , we know $(\textup{Id}-\Delta_{1})$ is invertible, so we can write

This is already in the desired form as we can let $\Delta_{B}=\Delta_{3}(\textup{Id}+\Delta_{1})^{-1}$ , $R_{BC}=(\textup{Id}+\Delta_{1})$ , $\Delta_{C}=\Delta_{2}(\textup{Id}-\Delta_{1})^{-\top}$ , and $\Delta_{BC}^{\prime}=-\Delta_{3}(\textup{Id}-\Delta_{1})^{-1}\Delta_{2}^{\top}+\Delta_{4}$ . By Weyl’s Theorem we know $\sigma_{min}(\textup{Id}+\Delta_{1})\geq 1-\varepsilon$ , therefore $\|\Delta_{B}\|\leq\|\Delta_{3}\|\sigma_{min}^{-1}(\textup{Id}+\Delta)\leq\frac{\varepsilon}{1-\varepsilon}\sigma_{min}(B)$ . Other terms can be bounded similarly.

Now we prove that the top $m$ approximation of $BC^{\top}+E$ has similar column/row spaces as $BC^{\top}$ . Let $U_{B}$ be the column span of $B$ , $U_{B}^{\prime}$ be the column span of $(B+\Delta_{B})$ , and $U_{B}^{\prime\prime}$ be the top $m$ left singular subspace of $(BC^{\top}+E)$ . Similarly we can define $U_{C}$ , $U_{C}^{\prime}$ , $U_{C}^{\prime\prime}$ to be the column spans of $C$ , $C+\Delta_{C}$ and the top $m$ right singular subspace of $(BC^{\top}+E)$ .

For $B+\Delta_{B}$ , we can apply Weyl’s Theorem and Wedin’s Theorem. By Weyl’s Theorem we know $\sigma_{min}(B+\Delta_{B})\geq\sigma_{min}(B)-\|\Delta_{B}\|\geq(1-O(\varepsilon))\sigma_{min}(B)$ . By Wedin’s Theorem we know $U_{B}^{\prime}$ is $O(\varepsilon)$ -close to $U_{B}$ . Similar results apply to $C+\Delta_{C}$ .

Now we know $\sigma_{min}((B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top}))\geq\sigma_{min}(B+\Delta_{B})\sigma_{min}(R_{BC})\sigma_{min}(C+\Delta_{C})\geq\Omega(\sigma_{min}(B)\sigma_{min}(C)$ . Therefore we can again apply Wedin’s Theorem, considering $(B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top})$ as the original matrix and $\Delta^{\prime}_{BC}$ as the perturbation. As a result, we know $U_{B}^{\prime\prime}$ is $O(\varepsilon)$ close to $U_{B}^{\prime}$ , $U_{C}^{\prime\prime}$ is $O(\varepsilon)$ close to $U_{B}^{\prime}$ . The distance between $U_{B},U_{B}^{\prime\prime}$ (and $U_{C},U_{C}^{\prime\prime}$ ) then follows from triangle inequality.

As a direct corollary of Lemma C.1, we obtain that the $BC^{\top}$ and $BC^{\top}+E$ have similar subspaces of singular vectors.

In the setting of Lemma C.1, let $[BC^{\top}+E]_{m}$ be the best rank- $m$ approximation of $BC^{\top}+E$ . Then, the span of columns of $[BC^{\top}+E]_{m}$ is $O(\varepsilon)$ -close to the span of columns of $B$ , span of rows of $[BC^{\top}+E]_{m}$ is $O(\varepsilon)$ -close to the span of columns of $C$ .

Furthermore, we can write $[BC^{\top}+E]_{m}=(B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top}+\Delta_{BC}.$ Here $\Delta_{B}$ , $\Delta_{C}$ and $R_{BC}$ as defined in Lemma C.1 and $\Delta_{BC}$ satisfies $\|\Delta_{BC}\|\leq O(\varepsilon\sigma_{min}(B)\sigma_{min}(C))$ .

Since $[BC^{\top}+E]_{m}$ is the best rank- $m$ approximation, because $(B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top})$ is a rank $m$ matrix, in particular we have

In order to fix this problem, we notice that the matrix $[\Sigma_{bc}^{\top}]_{m}^{+}$ is multiplied by $\Sigma_{ab}$ on the left and $\Sigma_{ca}$ on the right. Assuming $\Sigma_{ab}=AB^{\top}$ , $\Sigma_{ca}=CA^{\top}$ , we should expect $[\Sigma_{bc}^{\top}]_{m}^{+}$ to “cancel” with the $B^{\top}$ factor on the left and the $C$ factor on the right, giving us $AA^{\top}$ . Therefore, we should really measure the error of the middle term $[\Sigma_{bc}^{\top}]_{m}^{+}$ after left multiplying with $B^{\top}$ and right multiplying with $C$ . We formalize this in the following lemma:

Suppose $\Sigma_{bc}$ is as defined in Theorem 2.10, let $\Delta=[\Sigma_{bc}^{\top}]_{m}^{+}-[CB^{\top}]^{+}$ , then we have

We will first prove Theorem 2.10 assuming Lemma C.3.

By Lemma C.1, we know $\Sigma_{ab}$ can be written as

Similarly $\Sigma_{ca}$ can be written as

Here the $\Delta$ terms and $R$ terms are bounded as in Lemma C.1.

Now let us write the matrix $\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}$ as

We can now view $\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}$ as the product of three terms, each term is the sum of two matrices. Therefore we can expand the product into 8 terms. In each of the three pairs, we will call the first matrix the main matrix, and the second matrix the perturbation.

In the remaining proof, we will do calculations to show the product of the main terms is close to $AA^{\top}$ , and all the other 7 terms are small.

Before doing that, we first prove several Claims about PSD matrices

If $\|\Delta\|\leq\varepsilon$ , then $A\Delta A^{\top}\preceq\varepsilon AA^{\top}$ . If $\|\Gamma\|\leq\varepsilon\sigma_{min}(A)$ , then $\frac{1}{2}(A\Gamma^{\top}+\Gamma A^{\top})\preceq\varepsilon AA^{\top}+\varepsilon\sigma_{min}^{2}(A)\textup{Id}$ .

Both inequalities can be proved by consider the quadratic form. We know for any $x$ , $x^{\top}A\Delta A^{\top}x\leq\|\Delta\|\|A^{\top}x\|^{2}\leq\varepsilon x^{\top}AA^{\top}x$ , so the first part is true.

For the second part, for any $x$ we can apply Cauchy-Schwartz inequality

Now, we will first prove the product of three main matrices is close to $AA^{\top}$ :

We have $\left((A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\right)(CB^{\top})^{+}\left((C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}\right)=AA^{\top}+E_{A}$ , where $E_{A}$ is $O(\varepsilon)$ -spectrally bounded by $AA^{\top}$ .

We will first prove the middle part of the matrix $(B+\Delta^{1}_{B})^{\top}(CB^{\top})^{+}(C+\Delta^{3}_{C})$ is $O(\varepsilon)$ close to identity matrix Id. Here we observe that both $B,C$ have full column rank so $(CB^{\top})^{+}=(B^{\top})^{+}C^{+}$ . Therefore we can rewrite the product as $(\textup{Id}+B^{+}\Delta^{1}_{B})^{\top}(\textup{Id}+C^{+}\Delta^{3}_{C})$ . Since $\|\Delta^{1}_{B}\|\leq O(\varepsilon\sigma_{min}(B))$ by Lemma C.1 (and similarly for $C$ ), we know $\|B^{+}\Delta^{1}_{B}\|\leq O(\varepsilon)$ . Therefore the middle part is $O(\varepsilon)$ close to Id. Now since $\varepsilon\ll 1$ we know $\widehat{R}_{AB}=R_{AB}(B+\Delta^{1}_{B})^{\top}(CB^{\top})^{+}(C+\Delta^{3}_{C})R_{CA}$ is $O(\varepsilon)$ -close to Id.

Now we are left with $(A+\Delta^{1}_{A})\widehat{R}_{AB}(A+\Delta^{3}_{A})^{\top}$ , for this matrix we know

The first term $A(\widehat{R}_{AB}-\textup{Id})A^{\top}\preceq O(\varepsilon)AA^{\top}$ (Claim C.4); the fourth term $\Delta^{1}_{A}\widehat{R}_{AB}(\Delta^{3}_{A})^{\top}\preceq O(\varepsilon\sigma_{min}^{2}(A))\textup{Id}$ (by the norm bounds of $\Delta^{1}_{A}$ and $\Delta^{3}_{A}$ . For the cross terms, we can bound them using the second part of Claim C.4. ∎

Next we will try to prove the remaining 7 terms are small. We partition them into three types depending on how many $\Delta$ factors they have. We proceed to bound them in each of these cases.

For the terms with only one $\Delta$ , we claim:

The three terms $\Delta_{AB}(CB^{\top})^{+}(C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}$ , $(A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\Delta_{AB}(C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}$ , $\left((A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\right)(CB^{\top})^{+}\Delta_{CA}$ are all $O(\varepsilon)$ spectrally bounded by $AA^{\top}$ .

For the first term, note that both $B,C$ have full column rank, and hence $(CB^{\top})^{+}=(B^{\top})^{+}C^{+}$ . Therefore the first term can be rewritten as

By Lemma C.1, we have spectral norm bounds for $\Delta_{AB},\Delta^{3}_{C},\Delta^{3}_{A},R_{CA}$ . Therefore we know $\|\Delta_{AB}(B^{\top})^{+}\|\leq O(\varepsilon\sigma_{min}(A))$ and $[(\textup{Id}+C^{+}\Delta^{3}_{C})R_{CA}]$ is $O(\varepsilon)$ close to Id. Therefore $\|[\Delta_{AB}(B^{\top})^{+}][(\textup{Id}+C^{+}\Delta^{3}_{C})R_{CA}](\Delta^{3}_{A})^{\top}\|\leq O(\varepsilon\sigma_{min}^{2}(A))$ is trivially $O(\varepsilon)$ spectrally bounded, and $[\Delta_{AB}(B^{\top})^{+}][(\textup{Id}+C^{+}\Delta^{3}_{C})R_{CA}]A^{\top}$ is $O(\varepsilon)$ spectrally bounded by Claim C.4. The third term is exactly symmetric.

For the second part, we will first prove the middle part of the matrix $\widehat{\Delta}_{BC}=(B+\Delta^{1}_{B})^{\top}\Delta_{BC}(C+\Delta^{3}_{C})$ has spectral norm $O(\varepsilon)$ . This can be done y expanding it to the sum of 4 terms, and use appropriate spectral norm bounds on $\Delta_{BC}$ and its products with $B^{\top}$ and $C$ from Lemma C.3. Now we can show $(A+\Delta^{1}_{A})R_{AB}\widehat{\Delta}_{BC}R_{CA}(A+\Delta^{3}_{A})^{\top}$ is $O(\varepsilon)$ spectrally bounded by the first part of Claim C.4. ∎

Next we try to bound the terms with two $\Delta$ factors.

The three terms $\Delta_{AB}\Delta_{BC}(C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}$ , $\Delta_{AB}(CB^{\top})^{+}\Delta_{CA}$ , $\left((A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\right)\Delta_{BC}\Delta_{CA}$ are all $O(\varepsilon^{2})$ spectrally bounded by $AA^{\top}$ .

For the first term, notice that $\|\Delta_{BC}(C+\Delta^{3}_{C})\|$ is bounded by $O(\varepsilon/\sigma_{min}(B))$ by Lemma C.3, and $\|\Delta_{AB}\|=O(\varepsilon\sigma_{min}(A)\sigma_{min}(B))$ . Therefore we know $\|\Delta_{AB}\Delta_{BC}(C+\Delta^{3}_{C})R_{CA}\|\leq O(\varepsilon^{2}\sigma_{min}(A))$ , so by Claim C.4 we know this term is $O(\varepsilon^{2})$ spectrally bounded by $AA^{\top}$ . Third term is symmetric.

For the second term, by Lemma C.1 we can directly bound its spectral norm by $O(\varepsilon^{2}\sigma_{min}^{2}(A))$ , so it is trivially $O(\varepsilon^{2})$ spectrally bounded by $AA^{\top}$ . ∎

Finally, for the product $\Delta_{AB}\Delta_{BC}\Delta_{CA}$ , we can get the spectral norm for the three factors by Lemma C.1 and Lemma C.3. As a result $\|\Delta_{AB}\Delta_{BC}\Delta_{CA}\|\leq O(\varepsilon^{3}\sigma^{2}_{min}(A))$ which is trivially $O(\varepsilon^{3})$ spectrally bounded by $AA^{\top}$ .

Combining the bound for all of the terms we get the theorem. ∎

We first prove a simpler version where the perturbation is simply bounded in spectral norm

Suppose $B,C$ are $n\times m$ matrices and $n\geq m$ . Let $R$ be an $n\times n$ matrix such that $\|R-\textup{Id}\|\leq\varepsilon$ , and $E$ is a perturbation matrix with $\|E\|\leq\varepsilon\sigma_{min}(B)\sigma_{min}(C)$ and $(CRB^{\top}+E)$ is also of rank $m$ .

Now let $\Delta=(CRB^{\top}+E)^{+}-(CRB^{\top})^{+}$ , then when $\varepsilon\ll 1$ we have

We first give the proof for $\|B^{\top}\Delta C\|$ . Other terms are similar.

Let $U_{B}$ be the column span of $B$ , and $U^{\prime}_{B}$ be the row span of $(CRB^{\top}+E)$ . Similarly let $U_{C}$ be the column span of $C$ and $U^{\prime}_{C}$ be the column span of $(CRB^{\top}+E)$ . By Wedin’s theorem, we know $U^{\prime}_{B}$ is $O(\varepsilon)$ close to $U_{B}$ and $U^{\prime}_{C}$ is $O(\varepsilon)$ close to $U_{C}$ . As a result, suppose the SVD of $B$ is $U_{B}D_{B}V_{B}^{\top}$ , we know

The same is true for $C$ : $\sigma_{min}(C^{\top}U^{\prime}_{C})\geq(1-O(\varepsilon))\sigma_{min}(C)$ .

By the property of pseudoinverse, the column span of $(CRB^{\top}+E)^{+}$ is $U^{\prime}_{B}$ , and the row span of $(CRB^{\top}+E)^{+}$ is $U^{\prime}_{C}$ , further, $(CRB^{\top}+E)^{+}=U^{\prime}_{B}[(U^{\prime}_{C})^{\top}(CRB^{\top}+E)U^{\prime}_{B}]^{-1}U^{\prime}_{C}$ , therefore we can write

Note that now the three matrices are all $n\times n$ and invertible! We can write $B^{\top}U^{\prime}_{B}=((B^{\top}U^{\prime}_{B})^{-1})^{-1}$ (and do the same thing for $(U^{\prime}_{C})^{\top}C$ . Using the fact that $P^{-1}Q^{-1}=(QP)^{-1}$ , we have

Here we defined $X=((U^{\prime}_{C})^{\top}C)^{-1}(U^{\prime}_{C})^{\top}EU^{\prime}_{B}(B^{\top}U^{\prime}_{B})^{-1}$ . The spectral norm of $X$ can be bounded by

We can write $B^{\top}\Delta C=B^{\top}(CRB^{\top}+E)^{+}C-\textup{Id}=(\textup{Id}+(R-\textup{Id}+X))^{-1}-\textup{Id}$ , and we now know $\|(R-\textup{Id}+X)\|\leq O(\varepsilon)$ , as a result $\|B^{\top}\Delta C\|\leq O(\varepsilon)$ as desired.

For the term $\|B^{\top}\Delta\|$ , by the same argument we have

On the other hand, we know $B^{\top}(CRB^{\top})^{+}=R^{-1}C^{+}=R^{-1}((U_{C})^{\top}C)^{-1}U_{C}^{\top}$ . We can match the three factors:

Here, first and third bound are proven before. The second bound comes if we consider the SVD of $C=U_{C}D_{C}V_{C}^{\top}$ and notice that $\|(U^{\prime}_{C})^{\top}U_{C}-\textup{Id}\|\leq O(\varepsilon)$ . We can write $\Delta_{1}=R^{-1}-(R+X)^{-1}$ , $\Delta_{2}=((U_{C})^{\top}C)^{-1}-((U^{\prime}_{C})^{\top}C)^{-1}$ , $\Delta_{3}=U_{C}-U^{\prime}_{C}$ , then we have

Expanding the last equation, we get 7 terms and all of them can be bounded by $O(\varepsilon/\sigma_{min}(C))$ . The bounds on $\|\Delta C\|$ and $\|\Delta\|$ can be proved using similar techniques. ∎

Finally we are ready to prove the main Lemma C.3:

Using Lemma C.1, let $E=E_{bc}^{\top}$ , we can write the matrix before pseudoinverse as

We can then apply Lemma C.8 on $(C+\Delta_{C})R_{BC}(B+\Delta_{B})^{\top}+\Delta_{BC}$ . As a result, we know if we let $\Delta^{\prime}=[CB^{\top}+E]_{m}^{+}-((C+\Delta_{C})R_{BC}(B+\Delta_{B})^{\top})^{+}$ , we have the desired bound if we left multiply with $(B+\Delta_{B})^{\top}$ or right multiply with $(C+\Delta_{C})$ .

We will now show how to prove the first bound, all the other bounds can be proved using the same strategy:

All the four terms on the RHS can be bounded by Lemma C.8 so we know $\|B^{\top}\Delta^{\prime}C\|\leq O(\varepsilon)$ .

On the other hand, let $\Delta^{\prime\prime}=((C+\Delta_{C})R_{BC}(B+\Delta_{B})^{\top})^{+}-(C^{\top}B)^{+}=\Delta-\Delta^{\prime}$ . We will prove $\|B^{\top}\Delta^{\prime\prime}C\|\leq O(\varepsilon)$ and then the bound on $\|B^{\top}\Delta C\|$ follows from triangle inequality.

For $B^{\top}\Delta^{\prime\prime}C$ , we know it is equal to

$\|B^{\top}[(B+\Delta_{B})^{\top}]^{+}R_{AB}^{-1}(C+\Delta_{C})^{+}C-\textup{Id}\|\leq O(\varepsilon)$

We will show all three factors in the first term are $O(\varepsilon)$ close to Id. For $R_{AB}^{-1}$ this follows immediately from Lemma C.1. For $(C+\Delta_{C})^{+}C$ , we know

Therefore its spectral norm bound is bounded by $\|\Delta_{C}\|\sigma_{min}^{-1}(C+\Delta_{C})=O(\varepsilon)$ (where the bound on $\|\Delta_{C}\|$ comes from Lemma C.1). ∎

With the claim we have now proven $\|B^{\top}\Delta^{\prime\prime}C\|\leq O(\varepsilon)$ , therefore

Here we will show under mild incoherence conditions (defined below), if an error matrix $E$ is $\varepsilon$ -spectrally bounded by $FF^{\top}$ , then the partial matrices satisfy the requirement of Theorem 2.10.

If $F$ is $\mu$ -incoherent for $\mu\leq\sqrt{n/m\log^{2}n}$ , then when $n\geq\Omega(m\log^{2}m)$ , with high probability over the random partition of $F$ into $A,B,C$ , we know $\sigma_{min}(A)\geq\sigma_{min}(F)/3$ (same is true for $B,C$ ).

As a corollary, if $E$ is $\varepsilon$ -spectrally bounded by $F$ . Let $a,b,c$ be the subsets corresponding to $A,B,C$ , and let $E_{a,b}$ be the submatrix of $E$ whose rows are in set $a$ and columns are in set $b$ . Then $E_{a,b}$ (also $E_{b,c},E_{c,a}$ ) is $O(\varepsilon)$ -spectrally bounded by the corresponding asymmetric matrices $AB^{\top}$ ( $BC^{\top}$ , $CA^{\top}$ ).

Consider the singular value decomposition of $F$ : $F=UDV^{\top}$ . Here $U$ is a $n\times m$ matrix whose columns are orthonormal, $V$ is an $m\times m$ orthonormal matrix and $D$ is a diagonal matrix whose smallest diagonal entry is $\sigma_{min}(F)$ .

Consider the following way of partitioning the matrix: for each row of $F$ , we put it into $A,B$ or $C$ with probability $1/3$ independently.

Now, let $X_{i}=1$ if row $i$ is in the matrix $A$ , and 0 otherwise. Then $X_{i}$ ’s are Bernoulli random variables with probability $1/2$ . Suppose $S$ is the set of rows in $A$ , let $U_{A}$ be $U$ restricted to rows in $A$ , then we have $A=U_{A}DV^{\top}$ . We will show with high probability $\sigma_{min}(A)\geq 1/3$ .

The key observation here is the expectation of $U_{A}^{\top}U_{A}=\sum_{i=1}^{n}X_{i}U_{i}U_{i}^{\top}$ , where $U_{i}$ is the $i$ -th row of $U$ (represented as a column vector). Since $X_{i}$ ’s are Bernoulli random variables, we know

Therefore we can hope to use matrix concentration to prove that $U_{A}^{\top}U_{A}$ is close to its expectation.

Here the last inequality is because $\sum_{i=1}^{n}X_{i}U_{i}U_{i}^{\top}U_{i}U_{i}^{\top}\preceq\|U_{i}\|^{2}U_{i}U_{i}^{\top}$ .

Therefore by Matrix Bernstein’s inequality we know with high probability $\|\sum_{i=1}^{n}M_{i}\|\leq 1/6$ . When this happens we know

Hence we have $\sigma_{min}(U_{A})\geq\sqrt{1/6}>1/3$ , and $\sigma_{min}(A)\geq\sigma_{min}(U_{A})\sigma_{min}(D)\geq\sigma_{min}(F)/3$ . Note that matrices $B$ , $C$ have exactly the same distribution as $A$ so the bounds for $B$ , $C$ follows from union bound.

For the corollary, if a matrix $E$ is $\varepsilon$ spectrally bounded, we can write it as $F\Delta_{1}F^{\top}+F\Delta_{2}^{\top}+\Delta_{2}F^{\top}+\Delta_{4}$ , where $\|\Delta_{1}\|\leq\varepsilon$ , $\|\Delta_{2}\|\leq\varepsilon\sigma_{min}(F)$ and $\|\Delta_{4}\|\leq\varepsilon\sigma_{min}^{2}(F)$ . This can be done by considering different projections of $E$ : let $U$ be the span of columns of $F$ , then $F\Delta_{1}F^{\top}$ term corresponds to $\operatorname{Proj}_{U}E\operatorname{Proj}_{U}$ ; $F\Delta_{2}^{\top}$ term corresponds to $\operatorname{Proj}_{U}E\operatorname{Proj}_{U^{\perp}}$ ; $\Delta_{2}F^{\top}$ term corresponds to $\operatorname{Proj}_{U^{\perp}}E\operatorname{Proj}_{U}$ ; $\Delta_{4}$ term corresponds to $\operatorname{Proj}_{U^{\perp}}E\operatorname{Proj}_{U^{\perp}}$ . The spectral bounds are necessary for $E$ to be spectrally bounded.

Now for $E_{a,b}$ , we can write it as $A\Delta_{1}B^{\top}+A(\Delta_{2})_{b}^{\top}+(\Delta_{2})_{a}B^{\top}+(\Delta_{4})_{a,b}$ , where we also take the corresponding submatrices of $\Delta$ ’s. Since the spectral norm of a submatrix can only be smaller, we know $\|\Delta_{1}\|\leq\varepsilon$ , $\|(\Delta_{2})_{b}\|\leq\varepsilon\sigma_{min}(F)\leq 3\varepsilon\sigma_{min}(B)$ , $\|(\Delta_{2})_{a}\|\leq\varepsilon\sigma_{min}(F)\leq 3\varepsilon\sigma_{min}(A)$ and $\|(\Delta_{2})_{a,b}\|\leq\varepsilon\sigma_{min}^{2}(F)\leq 9\varepsilon\sigma_{min}(A)\sigma_{min}(B)$ . Therefore by Definition 2.9 we know $E_{a,b}$ is $9\varepsilon$ spectrally bounded by $AB^{\top}$ . ∎

Appendix D Proof of Theorem 3.1 and Theorem 3.3

In this section, we provide the full proof of Theorem 3.1. We start with a simple technical Lemma.

If $Q$ is an $\varepsilon$ -approximate whitening matrix for $A$ , then $\|Q\|\leq\frac{1}{1-\varepsilon}\|AA^{\top}\|$ , $\sigma_{\min}(Q)\geq\frac{1}{1-\varepsilon}\|AA^{\top}\|$

By the definition of approximate-whitening, we have

by virtue of the fact that $(Q^{+})^{1/2}AA^{\top}(Q^{+})^{1/2}=\left((Q^{+})^{1/2}A^{\top}A(Q^{+})^{1/2}\right)^{\top}$ . Rewriting in semidefinite-order notation, we get that

Multiplying on the left and right by $Q^{1/2}$ , we get

This directly implies $\frac{1}{1+\varepsilon}AA^{\top}\preceq Q\preceq\frac{1}{1-\varepsilon}AA^{\top}$ which is equivalent to the statement of the lemma. ∎

Towards proving Theorem 3.1, we will first prove the following proposition, which shows that we recover the $\exp(-W)$ matrix correctly:

Under the random generative model defined in Section 1, if the number of samples satisfies

The proof will consist of checking the conditions for Algorithms 4 and 3 to work, so that we can apply Theorems 5.4 and 2.10.

To get a handle on the PMI tensor, by Proposition A.2, for any equipartition $S_{a},S_{b},S_{c}$ of $[n]$ , we can it as

We can choose $\displaystyle L=\operatorname{poly}(\log(n,\frac{1}{\rho},\frac{1}{p}))$ to ensure

Having an explicit form for the tensor, we proceed to check the spectral boundedness condition for Algorithm 4.

Next, we verify the conditions for calculating approximate whitening matrices (Algorithm 4).

However, for the random model, applying Lemma B.1,

Analogous statements hold for $B$ and $C$ .

Finally, we bound the error due to empirical estimates. Since $\rho pm=o(1)$ ,

Hence, by Corollary E.3, with a number of samples as stated in the theorem,

With that, invoking Theorem 5.4 (with $\|E\|_{\{1,2\},\{3\}}$ taking into account both the $E_{L}$ term above, and the above error due to sampling), the output of Algorithm 3 will produce vectors $v_{i}$ , $i\in[m]$ , s.t. $v_{i}$ is $O(\eta^{\prime})$ -close to $\left(\frac{\rho}{1-\rho}\right)^{1/3}(1-\exp(-W_{i}))$ , for

where $\sigma=\min(\sigma_{\min}(Q_{a}),\sigma_{\min}(Q_{b}),\sigma_{\min}(Q_{c}))$ .

Plugging in the estimates from (D.1), (D.2), (D.3) as well as $\tau=O(\rho^{2/3}\log n)$ , we get:

which implies the vectors $(\hat{a}_{i},\hat{b}_{i},\hat{c}_{i})$ are $O(\eta^{\prime})$ -close to $\left(\frac{\rho}{1-\rho}\right)^{1/3}(1-\exp(W_{i}))$ , for all $i\in[m]$ .

Given that, we prove the main theorem. The main issue will be to ensure that taking $\log$ of the values

Then we have that $\lVert Y_{i}^{\prime}-W_{i}\rVert\leq\lVert Y_{i}-W_{i}\rVert$ . By the Lipschitzness of $\log(\cdot)$ in the region $[\nu_{i},\infty]$

Therefore recalling $\lVert Y_{i}^{\prime}-W_{i}\rVert\leq\lVert Y_{i}-W_{i}\rVert\leq O(\eta\sqrt{np})$ we complete the proof. ∎

The proof will follow the same outline as the proof of Theorem 3.1. The difference is that since we only have a guarantee on the spectral boundedness of the second and third-order term, we will need to bound the higher-order terms in a different manner. Given that we have no information on them in this scenario, we will simply bound them in the obvious manner. We proceed to formalize this.

The sample complexity is polynomial for the same reasons as in the proof of Theorem 3.1, so we will not worry about it here.

We only need to check the conditions for Algorithms 4 and 3 to work, so that we can apply Theorems 5.4 and 2.10.

Towards that, first we claim that we can write the PMI tensor for any equipartition $S_{a},S_{b},S_{c}$ of $[n]$ as

where $\|E\|_{\{1,2\},\{3\}}\leq\rho^{4}m(np)^{3/2}$ . Towards achieving this, first we claim that Proposition 5.6 implies that for any subsets $S_{a},S_{b},S_{c}$ ,

Indeed, if we put $\gamma_{k}=\left(1-\exp(-lW_{k})\right)_{S_{a}}$ , $\delta_{k}=\left(1-\exp(-lW_{k})\right)_{S_{b}}$ , $\theta_{k}=\left(1-\exp(-lW_{k})\right)_{S_{c}}$ , then we have $\|\sum_{k}\gamma_{k}\gamma_{k}^{\top}\|\leq\sqrt{mnp}$ , and similarly for $\delta_{k}$ . Since $\max_{k}\|\theta_{k}\|\leq(np)^{1/2}$ , the claim immediately follows. Hence, (D.4) follows.

We claim that $R_{S_{a}}R^{\top}_{S_{a}}$ is $\tau$ spectrally bounded by $\rho FF^{\top}$ .

where the first inequality holds since $HH^{\top},GG^{\top},LL^{\top}$ are $\tau$ -spectrally bounded bounded by $F$ and the second since $\sigma_{\min}(FF^{\top})\gtrsim np$ and $\tau\geq 1$ . Let $\tau^{\prime}=\rho^{2/3}\tau$ . Since we are assuming the matrix $F$ is $O(1)$ -incoherent, we can apply Theorem C.10, and claim the output of Algorithm 4 are matrices $Q_{a},Q_{b},Q_{c}$ which are $\tau$ -approximate whitening matrices for $A,B,C$ respectively.

Then, applying Theorem 5.4, we get that we recover vectors $(\hat{a}_{i},\hat{b}_{i},\hat{c}_{i})$ are $O(\eta^{\prime})$ -close to $\left(\frac{\rho}{1-\rho}\right)^{1/3}(1-\exp(W_{i}))$ , for all $i\in[m]$ . for

Recall that $\tau^{\prime}=\rho^{2/3}\tau$ , and $\lVert Q_{a}\rVert\leq\rho^{2/3}\sigma_{\max}(F)\lesssim\rho^{2/3}\sqrt{mn}p$ and $\|E\|_{\{1,2\},\{3\}}\leq\rho^{4}m(np)^{3/2}$ and $\sigma\gtrsim\rho^{2/3}np$ , we obtain that

where the last inequality holds since $\rho^{3}m=o(1)=o(\tau)$ .

Argument for recovering $W_{i}$ from $\exp(-W_{i})$ is then exactly the same as the one in Theorem 3.1.

Appendix E Sample complexity and bias of the PMI estimator

Finally, we consider the issue of sample complexity. The estimator we will use for the PMI matrix will simply be the plug-in estimator, namely:

Notice that this estimator is biased, but as the number of samples grows, the bias tends to zero. Formally, we can show:

with high probability $|\hat{\textup{PMI}}_{i,j}-\textup{PMI}_{i,j}|\leq\delta,\forall i\neq j$ .

Denoting $\Delta_{i,j}=\hat{\Pr}[s_{i}=0\wedge s_{j}=0]-\Pr[s_{i}=0\wedge s_{j}=0]$ and $\Delta_{i}=\hat{\Pr}[s_{i}=0]-\Pr[s_{i}=0]$ , we get that

Furthermore, we have that $\frac{2x}{2+x}\leq\log(1+x)\leq\frac{x}{\sqrt{x+1}}$ , for $x\geq 0$ , which implies that when $x\leq 1$ , $\frac{2}{3}x\leq\log(1+x)\leq x$ . From this it follows that if

Note that it suffices to show that if $N>\frac{1}{1-4p_{\max}m\rho_{\max}}\frac{1}{\delta^{2}}\log m$ , we have

with high probability by a simple union bound.

Both (E.2) and (E.3) will follow by a Chernoff bound.

Indeed, consider (E.2) first. We have by Chernoff

Hence, if $N>\frac{1}{\Pr[s_{i}=0]}\frac{1}{\delta^{2}}\log m$ , we get that $1-\delta\leq\frac{\Delta_{i}}{\Pr[s_{i}=0]}\leq 1+\delta$ with probability at least $1-\exp(\log^{2}m)$ .

The proof of (E.3) is analogous – the only difference being that the requirement is that $N>\frac{1}{\Pr[s_{i}=0]}\frac{1}{\delta^{2}}\log m$ which gives the statement of the lemma.

Virtually the same proof as above shows that:

with high probability $|\hat{\textup{PMIT}}_{i,j,k}-\textup{PMIT}_{i,j,k}|\leq\delta,\forall i\neq j\neq k$ .

with high probability for any equipartition $S_{a},S_{b},S_{c}$

Appendix F Matrix Perturbation Toolbox

In this section we discuss standard matrix perturbation inequalities. Many results in this section can be found in Stewart and Sun [Ste77]. Given $\widehat{A}=A+E$ , the perturbation in individual singular values can be bounded by Weyl’s theorem:

Given $\widehat{A}=A+E$ , we know $\sigma_{k}(A)-\|E\|\leq\sigma_{k}(\widehat{A})\leq\sigma_{k}(A)+\|E\|$ .

For singular vectors, the perturbation is bounded by Wedin’s Theorem:

Let $\widehat{A}=A+E$ , with analogous singular value decomposition. Let $\Phi$ be the matrix of canonical angles between the column span of $U_{1}$ and that of $\widehat{U}_{1}$ , and $\Theta$ be the matrix of canonical angles between the column span of $V_{1}$ and that of $\widehat{V}_{1}$ . Suppose that there exists a $\delta$ such that

When we have a lowerbound on $\sigma_{min}(A)$ , it is easy to get bounds for the perturbation of pseudoinverse.

Note that this theorem is not strong enough when the perturbation is only known to be $\tau$ -spectrally bounded in our definition.