Provable learning of Noisy-or Networks

Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski

Introduction

Unsupervised learning is important and potentially very powerful because of the availability of the huge amount of unlabeled data — often several orders of magnitudes more than the labeled data in many domains. Latent variable models, a popular approach in unsupervised learning, model the latent structures in data: the “structure”corresponds to some hidden variables, which probabilistically determine the values of the visible coordinates in data. Bayes nets model the dependency structure of latent and observable variables via a directed graph. Learning parameters of a latent variable model given data samples is often seen as a canonical definition of unsupervised learning. Unfortunately, finding parameters with the maximum likelihood is NP-hard even in very simple settings. However, in practice many of these models can be learnt reasonably well using algorithms without polynomial runtime guarantees, such as expectation-maximization algorithm, Markov chain Monte Carlo, and variational inference. Bridging this gap between theory and practice is an important research goal.

Recently it has become possible to use matrix and tensor decomposition methods to design polynomial-time algorithms to learn some simple latent variable models such as topic models [AGM12, AGH+13], sparse coding models [AGMM15, MSS16], mixtures of Gaussians [HK13, GHK15], hidden Markov models [MR05], etc. These algorithms are guaranteed to work if the model parameters satisfy some conditions, which are reasonably realistic. In fact, matrix and tensor decomposition are a natural tool to turn to since they appear to be a sweet spot for theory whereby non-convex NP-hard problems can be solved provably under relatively clean and interpretable assumptions. But the above-mentioned recent results suggest that such methods apply only to solving latent variable models that are linear: specifically, they need the marginal of the observed variables conditioned on the hidden variables to depend linearly on the hidden variables. But many settings seem to call for nonlinearity in the model. For example, Bayes nets in many domains involve highly nonlinear operations on the latent variables, and could even have multiple layers. The study of neural networks also runs into nonlinear models such as restricted Boltzmann machines (RBM) [Smo86, HS06]. Can matrix factorization (or related tensor factorization) ideas help for learning nonlinear models?

We see that 1exp(Wjidj)1-\exp(-W_{ji}d_{j}) can be thought of as the probability that djd_{j} activates symptom sis_{i}, and sis_{i} is activated if one of djd_{j}’s activates it — which explains the name of the model, noisy-or. It follows that the conditional distribution sds\mid d is

One canonical use of this model is to model the relationship between diseases and symptoms, as in the classical human-constructed tool for medical diagnosis called Quick Medical Reference (QMR-DT) by (Miller et al.[MPJM82], Shwe et al. [SC91]) This textbook example ([JGJS99]) of a Bayes net captures relationships between 570570 binary disease variables (latent variables) and 40754075 observed binary symptom variables, with 45,47045,470 directed edges, and the WijW_{ij}’s are small integers.We thank Randolph Miller and Vanderbilt University for providing the current version of this network for our research. The name “noisy-or ”derives from the fact that the probability that the OR of mm independent binary variables y1,y2,,ymy_{1},y_{2},\ldots,y_{m} is 11 is exactly 1j(Pr[yj=0])1-\prod_{j}(\Pr[y_{j}=0]). Noisy-or models are implicitly using this expression; specifically, for the ii-th symptom we are considering the OR of mm events where the jjth event is “Disease jj does not cause symptom ii” and its probability is exp(Wijdj)\exp(-W_{ij}d_{j}). Treating these events as independent leads to expression (1.1).

The parameters of the QMR-DT network were hand-estimated by consulting human experts, but it is an interesting research problem whether such networks can be created in an automated way using only samples of patient data (i.e., the ss vectors). Previously there were no approaches for this that work even heuristically at the required problem size (n=4000n=4000). (This learning problem should not be confused with the simpler problem of infering the latent variables given the visible ones, which is also hard but has seen more work, including reasonable heuristic methods [JGJS99]). Halpern et al.[HS13, JHS13] have designed some algorithms for this problem. However, their first paper [HS13] assumes the graph structure is given. The second paper [JHS13] requires the Bayes network to be quartet-learnable, which is a strong assumption on the structure of the network. Finally, the problem of finding a “best-fit” Bayesian network according to popular metricsResearchers resort to these metrics as when the graph structure is unknown, multiple structures may have the same likelihood, so maximum likelihood is not appropriate. has been shown to be NP-complete by [Chi96] even when all of the hidden variables are also observed.

where W~ij\widetilde{W}_{ij}’s are upper bounded by νu\nu_{u} for some constant νu\nu_{u} and are identically distributed according to a distribution D\mathcal{D} which satisfies that for some constant νl>0\nu_{l}>0,

The condition (1.2) intuitively requires that W~ij\widetilde{W}_{ij} is bounded away from 0. We will assume that p1/3p\leq 1/3 and νu=O(1),νl=Ω(1)\nu_{u}=O(1),\nu_{l}=\Omega(1). (Again, these are realistic for QMR-DT setting).

Recall that we mostly thought of the prior of the diseases ρ\rho as being on the order O(1/m)O(1/m). This means that even if pp is on the order of 11, our relative error bound equals to O(1/m)1O(1/\sqrt{m})\ll 1.

Preliminaries and overview

We denote by 0{\bf 0} the all-zeroes vector and 1{\bf 1} the all-ones vector. A+A^{+} will denote the Moore-Penrose pseudo-inverse of a matrix AA, and for symmetric matrices AA, we use A1/2A^{-1/2} as a shorthand for (A+)1/2(A^{+})^{1/2}. The least non-zero singular value of matrix AA is denoted σmin(A)\sigma_{\min}(A).

For matrices A,BA,B we define the Kronecker product \otimes as (AB)ijkl=AijBkl.(A\otimes B)_{ijkl}=A_{ij}B_{kl}. A useful identity is that (AB)(CD)=(AC)(BD)(A\otimes B)\cdot(C\otimes D)=(AC)\otimes(BD) whenever the matrix multiplications are defined. Moreover, AiA_{i} will denote the i-th column of matrix AA and AiA^{i} the i-th row of matrix AA.

We write ABA\lesssim B if there exists a universal constant cc such that AcBA\leq cB and we define \gtrsim similarly.

(We will sometimes shorten PMI3 and PMI2 to PMI when this causes no confusion.)

Our algorithm is given polynomially many samples from the model (each sample describing which symptoms are or are not present in a particular patient). It starts by computing the following matrix n×nn\times n PMI and and n×n×nn\times n\times n tensor PMIT, which tabulate the correlations among all pairs and triples of symptoms (specifically, the indicator random variable for the symptom being absent):

Let Fk,GkF_{k},G_{k} denote the kkth columns of the above F,GF,G. Then,

The proposition is proved later (with precise statement) in Section A by computing the moments by marginalization and using Taylor expansion to approximate the log of the moments, and ignoring terms ρ3\rho^{3} and smaller. (Recall that ρ\rho is the probability that a patient has a particular disease, which should be small, of the order of O(1/n)O(1/n). The dependence of the final error upon ρ\rho appears in Section 3.) Since the tensor PMIT can be estimated to arbitrary accuracy given enough samples, the natural idea to recover the model parameters WW is to use Tensor Decomposition. This is what our algorithm does as well, except the following difficulties have to be overcome.

Difficulty 1: Suppose in equation (2.8) we view the first summand SS, which is rank mm with components FkF_{k}’s as the signal term. In all previous polynomial-time algorithms for tensor decomposition, the tensor is required to have the form k=1mFkFkFk+\emnoise\sum_{k=1}^{m}F_{k}\otimes F_{k}\otimes F_{k}+\text{\em noise}. To make our problem fit this template we could consider the second summand EE as the “noise”, especially since it is multiplied by ρ1\rho\ll 1 which tends to make EE have smaller norm than SS. But this is naive and incorrect, since EE is a very structured matrix: it is more appropriate viewed as systematic error. (In particular this error doesn’t go down in norm as the number of samples goes to infinity.) In order to do tensor decomposition in presence of such systematic error, we will need both a delicate error analysis and a very robust tensor decomposition algorithm. These will be outlined in Section 2.3.

Difficulty 2: To get our problem into a form suitable for tensor decomposition requires a whitening step, which uses the robust estimate of the whitening matrix from the second moment matrix. In this case, the whitening matrix has to be extracted out of the PMI matrix, which itself suffers from a systematic error. This also is not handled in previous works, and requires a delicate control of the error. See Section 2.4 for more discussion.

Difficulty 3: There is another source of inexactness in equation (2.8), namely the approximation is only true for those entries with distinct indices — for example, the diagonal entry PMIii\textup{PMI}_{ii} has completely different formula from that for PMIij\textup{PMI}_{ij} when iji\neq j. This will complicate the algorithm, as described in Subsections 2.3 and 2.4.

The next few Subsections sketch how to overcome these difficulties, and the details appear in the rest of the paper.

2 Recovering matrices in presence of systematic error

In this Section we recall the classical method of approximately recovering a matrix given noisy estimates of its entries. We discuss how to adapt that method to our setting where the error in the estimates is systematic and does not go down even with many more samples. The next section sketches an extension of this method to tensor decomposition with systematic error.

In the classical setting, there is an unknown n×nn\times n matrix SS of rank mm and we are given S+ES+E where EE is an error matrix. The method to recover SS is to compute the best rank-mm approximation to S+ES+E. The quality of this approximation was studied by Davis and Kahan [DK70] and Wedin [Wed72], and many subsequent authors. The quality of the recovery depends upon the ratio E/σm(S)||E||/\sigma_{m}(S), where σm()\sigma_{m}(\cdot) denotes mm-th largest singular value and ||\cdot|| denotes the spectral norm. To make this familiar lemma fit our setting more exactly, we will phrase the problem as trying to recover a matrix SS given noisy estimate SS+ESS^{\top}+E. Now one can only recover SS up to rotation, and the following lemma describes the error in the Davis-Kahan recovery. It also plays a key role in the error analysis of the usual algorithm for tensor decomposition.

In the above setting, let K,K^K,\widehat{K} the subspace of the top mm eigenvectors of SSSS^{\top} and SS+ESS^{\top}+E. Let ε\varepsilon be such that Eεσm(SS)\|E\|\leq\varepsilon\cdot\sigma_{m}(SS^{\top}). Then IdKIdK^ε\left\lVert\textup{Id}_{K}-\textup{Id}_{\widehat{K}}\right\rVert\lesssim\varepsilon where Id is the identity transformation on the subspace in question.

The Lemma thus treats E/σm(SS)||E||/\sigma_{m}(SS^{\top}) as the definition of noise/signal ratio. Before we generalize the definition and the algorithm to handle systematic error it is good to get some intuition, from looking at (2.6): PMIρ(FFT+ρGGT)\textup{PMI}\approx\rho(FF^{T}+\rho GG^{T}). Thinking of the first term as signal and the second as error, let’s check how bad is the noise/signal ratio defined in Davis-Kahan. The “signal”is σm(FF)\sigma_{m}(FF^{\top}), which is smaller than nn since the trace of FFFF^{\top} is of the order of mnmn in our probabilistic model for the weight matrix. The “noise” is the norm of ρGG\rho GG^{\top}, which is large since the GkG_{k}’s are nonnegative vectors with entries of the order of 1, and therefore the quadratic form 1n1,ρGG1n1\langle\frac{1}{\sqrt{n}}{\bf 1},\rho GG^{\top}\frac{1}{\sqrt{n}}{\bf 1}\rangle can be as large as ρkGk,1n12ρmn\rho\sum_{k}\langle G_{k},\frac{1}{\sqrt{n}}{\bf 1}\rangle^{2}\approx\rho mn. Thus the Davis-Kahan noise/signal ratio is ρm\rho m, and so when ρm1\rho m\ll 1, it allows recovering the subspace of FF with error O(ρm)O(\rho m). Note that this is a vacuous bound since ρ\rho needs to be at least 1/m1/m so that the hidden variable dd contains 1 non-zero entry in average. We’ll argue that this error is too pessimistic and we can in fact drive the estimation error down to close to ρ\rho.

The smallest such τ\tau is the “error/signal ratio” for this recovery problem.

This definition differs from Davis-Kahan’s because of the τSS\tau SS^{\top} term on the right hand side of (2.9). This allows, for any unit vector xx, the quadratic form value xTExx^{T}Ex to be as large as τ(xTSSx+σm(SS))\tau(x^{T}SS^{\top}x+\sigma_{m}(SS^{\top})). Thus for example the 1{\bf 1} vector no longer causes a large noise/signal ratio since both quadtratic forms FFFF^{\top} and GGGG^{\top} have large values on it.

This new error/signal ratio is no larger than the Davis-Kahan ratio, but can potentially be much smaller. Now we show how to do a better analysis of the Davis-Kahan recovery in terms of it. The proof of this theorem appears in Section 4.

Empirically, we can compute the τ\tau value for the weight matrix WW in the QMR-DT dataset [SC91], which is a textbook application of noisy OR network. For the QMR-DT dataset, τ\tau is under 66. This implies that the recovery error of the subspace of FF guaranteed by Theorem 2.4 is bounded by O(τρ)ρO(\tau\rho)\approx\rho, whereas the error bound by Davis-Kahan is O(ρm)O(\rho m).

3 Tensor decomposition with systematic error

Now we extend the insight from the matrix case to tensor recovery under systematic error. In turns out condition (2.9) is also a good measure of error/signal for the tensor recovery problem of (2.8). Specifically, if GG is τ\tau-bounded by FF, then we can recover the components FkF_{k}’s from the PMIT with column-wise error O(ρτ3/2m)O(\rho\tau^{3/2}\sqrt{m}). This requires a non-trivial algorithm (instead of SVD), and the additional gain is that we can recover FkF_{k}’s individually, instead of only obtaining the subspace with the PMI matrix.

First we recall the prior state of the art for the error analysis of tensor decomposition with Davis-Kahan type bounds. The best error bounds involve measuring the magnitude of the noise matrix ZZ in a new way. For any n1×n2×n3n_{1}\times n_{2}\times n_{3} tensor TT, we define the {1}{2,3}\lVert\cdot\rVert_{\{1\}\{2,3\}} norm as

There is a polynomial-time algorithm (Algorithm 2 later) which has the following guarantee. Suppose tensor TT is of the form

But in our setting the noise tensor has systematic error. An analog of Theorem 2.4 in this setting is complicated because even the whitening step is nontrivial. Recall also the inexactness in Proposition 2.1 due to the diagonal terms, which we earlier called Difficulty 3. We address this difficulty in the algorithm by setting up the problem using a sub-tensor of the PMI tensor. Let Sa,Sb,ScS_{a},S_{b},S_{c} be a uniformly random equipartition of the set of indices [n][n]. Let

where Fk,SF_{k,S} denotes the restriction of vector FkF_{k} to subset SS. Moreover, let

Then, since the sub-tensor PMITSa,Sb,Sc\textup{PMIT}_{S_{a},S_{b},S_{c}} only contains entries with distinct indices, we can use Taylor expansion (see Lemma A.1) to obtain that

Here the second summand on the RHS corresponds to the second order term in the Taylor expansion. It turns out that the higher order terms are multiplied by ρ3\rho^{3} and thus have negligible Frobenius norm, and therefore discussion below will focus on the first two summands.

For simplicity, let T=PMITSa,Sb,ScT=\textup{PMIT}_{S_{a},S_{b},S_{c}}. Our goal is to recover the components ak,bk,cka_{k},b_{k},c_{k} from the approximate low-rank tensor TT.

The first step is to whiten the components aka_{k}’s, bkb_{k}’s and ckc_{k}’s. Recall that ak=Fk,Saa_{k}=F_{k,S_{a}} is a non-negative vector. This implies the matrix A=[a1,,am]A=[a_{1},\dots,a_{m}] must have a significant contribution in the direction of the vector 1{\bf 1}, and thus is far away from being well-conditioned. For the purpose of this section, we assume for simplicity that we can access the covariance matrix defined by the vector aka_{k}’s,

Similarly we assume the access of Qˉb\bar{Q}_{b} and Qˉc\bar{Q}_{c} which are defined analogously. In Section 2.4 we discuss how to obtain approximately these three matrices.

Then, we can compute the whitened tensor by applying transformation (Qˉa+)1/2,(Qˉb+)1/2,(Qˉc+)1/2(\bar{Q}_{a}^{+})^{1/2},(\bar{Q}_{b}^{+})^{1/2},(\bar{Q}_{c}^{+})^{1/2} along the three modes of the tensor TT,

Now the first summand is a low rank orthogonal tensor, since (Qˉa+)1/2ak(\bar{Q}_{a}^{+})^{1/2}a_{k}’s are orthonormal vectors. However, the term ZZ is a systematic error and we use the following Lemma to control its {1}{2,3}\lVert\cdot\rVert_{\{1\}\{2,3\}} norm.

Lemma 2.7 shows that to give an upper bound on the {1}{2,3}\left\lVert\cdot\right\rVert_{\{1\}\{2,3\}} norm of the error tensor ZZ, it suffices to show that the square of the components of the error, namely, ΓΓ,ΔΔ,ΘΘ\Gamma\Gamma^{\top},\Delta\Delta^{\top},\Theta\Theta^{\top} are τ\tau-spectrally bounded by the components of the signal A,B,CA,B,C respectively. This will imply that Z{1}{2,3}(2τ)3/2ρ2\lVert Z\rVert_{\{1\}\{2,3\}}\leq(2\tau)^{3/2}\rho^{2}.

Recall that AA and Γ\Gamma are two sub-matrices of FF and GG. We have shown that GGGG^{\top} is τ\tau-spectrally bounded by FF in Proposition 2.5. It follows straightforwardly that the random sub-matrices also have the same property.

In the setting of this section, under the generative model for WW, w.h.p, we have that ΓΓ\Gamma\Gamma^{\top} is τ\tau-spectrally bounded by AA with τ=O(logn)\tau=O(\log n). The same is true for the other two modes.

Using Proposition 2.8 and Lemma 2.7, we have that

Then using Theorem 2.6 on the tensor (Qˉa+)1/2(Qˉb+)1/2(Qˉc+)1/2T(\bar{Q}_{a}^{+})^{1/2}\otimes(\bar{Q}_{b}^{+})^{1/2}\otimes(\bar{Q}_{c}^{+})^{1/2}\cdot T, we can recover the components (Qˉa+)1/2ak(\bar{Q}_{a}^{+})^{1/2}a_{k}’s, (Qˉb+)1/2bk(\bar{Q}_{b}^{+})^{1/2}b_{k}’s, and (Qˉc+)1/2ck(\bar{Q}_{c}^{+})^{1/2}c_{k}’s. This will lead us to recover aka_{k},bkb_{k} and ckc_{k}, and finally to recover the weight matrix WW.

4 Robust whitening

In the previous subsection, we assumed the access to Qˉa,Qˉb,Qˉc\bar{Q}_{a},\bar{Q}_{b},\bar{Q}_{c} (defined in (2.13)) which turns out to be highly non-trivial. A priori, using equation (2.6), noting that A=[F1,Sa,,Fm,Sa]A=[F_{1,S_{a}},\dots,F_{m,S_{a}}], we have

However, this approximation can be arbitrarily bad for the diagonal entries of PMI since equation (2.6) only works for entries with distinct indices. (Recall that this is why we divided the indices set into Sa,Sb,ScS_{a},S_{b},S_{c} and studied the asymmetric tensor in the previous subsection). Moreover, the diagonal of the matrix Qˉa\bar{Q}_{a} contributes to its spectrum significantly and therefore we cannot get meaningful bounds (in spectral norm) by ignoring the diagonal entries.

This issue turns out to arise in most of the previous tensor papers and the solution was to compute AAAA^{\top} by using the asymmetric moments AB,BC,CAAB^{\top},BC^{\top},CA^{\top},

Typically AB,BC,CAAB^{\top},BC^{\top},CA^{\top} can be estimated with arbitrarily small error (as number of samples go to infinity) and therefore the equation above leads to accurate estimate to AAAA^{\top}. However, in our case the errors in the estimate PMISa,SbAB\textup{PMI}_{S_{a},S_{b}}\approx AB^{\top}, PMISb,ScBC\textup{PMI}_{S_{b},S_{c}}\approx BC^{\top}, PMISc,SaCA\textup{PMI}_{S_{c},S_{a}}\approx CA^{\top} are systematic. Therefore, we need to use a more delicate analysis to control how the error accumulates in the estimate,

Here again, to get an accurate bound, we need to understand how the error in PMISa,SbAB\textup{PMI}_{S_{a},S_{b}}-AB^{\top} behaves relatively compared with ABAB^{\top} in a direction-by-direction basis. We generalized Definition 2.3 to capture the asymmetric spectral boundedness of the error by the signal.

Let KK be the column subspace of BB and HH be the column subspace of CC. Then we have Δ1=B+E(C)+\Delta_{1}=B^{+}E(C^{\top})^{+}, Δ2=B+EIdH\Delta_{2}=B^{+}E\textup{Id}_{H^{\perp}}, Δ3=IdKE(C)+\Delta_{3}=\textup{Id}_{K^{\perp}}E(C^{\top})^{+}, Δ4=IdKEIdH\Delta_{4}=\textup{Id}_{K^{\perp}}E\textup{Id}_{H^{\perp}}. Intuitively, they measure the relative relationship between EE and B,CB,C in different subspaces. For example, Δ1\Delta_{1} is the relative perturbation in the column subspace of KK and row subspace of HH. When B=CB=C, this is equivalent to the definition in the symmetric setting (this will be clearer in the proof of Theorem 2.4).

where Eab,Ebc,EcaE_{ab},E_{bc},E_{ca} are ε\varepsilon-spectrally bounded by (A,B)(A,B), (B,C)(B,C), (C,A)(C,A) respectively. Then, the matrix matrix

is a good approximation of AAAA^{\top} in the sense that Qa=Σab[Σbc]m+ΣcaAAQ_{a}=\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}-AA^{\top} is O(ε)O(\varepsilon)-spectrally bounded by AA. Here [Σ]m[\Sigma]_{m} denotes the best rank-mm approximation of Σ\Sigma.

The theorem is non-trivial even if the we have an absolute error assumption, that is, even if Ebcτσmin(B)σmin(C)\lVert E_{bc}\rVert\leq\tau\sigma_{min}(B)\sigma_{min}(C), which is stronger condition than EbcE_{bc} is τ\tau-spectrally bounded by (B,C)(B,C). Suppose we establish bounds on ΣabAB\lVert\Sigma_{ab}-AB^{\top}\rVert, Σbc+(BC)+\|\Sigma_{bc}^{+\top}-(BC^{\top})^{+}\| and ΣabAB\lVert\Sigma_{ab}-AB^{\top}\rVert individually, and then putting them together in the obvious way to control the error Σab[Σbc]m+ΣcaAB(BC)+CA\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}-AB^{\top}(BC^{\top})^{+}CA^{\top}. Then the error will be too large for us. This is because standard matrix perturbation theory gives that Σbc(BC)1\|\Sigma_{bc}^{-\top}-(BC^{\top})^{-1}\| can be bounded by O(Ebc(BC)12)O\left(\|E_{bc}\|\|(BC^{\top})^{-1}\|^{2}\right)\lesssim ε/[σmin(B)σmin(C)]\varepsilon/[\sigma_{min}(B)\sigma_{min}(C)], which is tight. Then we multiply the error with the norm of the rest of the two terms, the error will be roughly εσmax(B)σmax(C)σmin(B)σmin(C)\varepsilon\cdot\frac{\sigma_{\max}(B)\sigma_{\max}(C)}{\sigma_{\min}(B)\sigma_{\min}(C)}. That is, we will loss a condition number of B,CB,C, which can be dimension dependent for our case.

The fix to this problem is to avoid bounding each term in Σab[Σbc]m+Σca\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca} individually. To do this, we will take the cancellation of these terms into account. Technically, we re-decompose the product Σab[Σbc]m+Σca\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca} into a new product of three matrices (ΣabB+)(B[Σbc]m+C)(C+Σca)(\Sigma_{ab}B^{+})(B[\Sigma_{bc}^{\top}]_{m}^{+}C)(C^{+}\Sigma_{ca}), and then bound the error in each of these terms instead. See Section C for details.

As a corollary, we conclude that the whitened vectors (Qa+)1/2ai(Q_{a}^{+})^{1/2}a_{i}’s are indeed approximately orthonormal.

In the setting of Theorem 2.10, we have that (Qa+)1/2A(Q_{a}^{+})^{1/2}A contains approximately orthonormal vectors as columns, in the sense that

Therefore we have found an approximate whitening matrix for AA even though we do not have access to the diagonal entries.

Main Algorithms and Results

As sketched in Section 2, our main algorithm (Algorithm 1) uses tensor decomposition on the PMI tensor. In this section, we describe the different steps and how the fit together. Subsequently, all steps will be analyzed in separate sections.

Suppose the true WW is generated from the random model in Section 1 with ρpmc\rho pm\leq c for some sufficiently small constant cc. Then given N=poly(n,1/p/,1/ρ)N=\operatorname{poly}(n,1/p/,1/\rho) number of examples, Algorithm 1 returns a weight matrix W^\widehat{W} in polynomial time that satisfies

We also define the incoherence of a matrix FF. Roughly speaking, it says that the left singular vectors of FF don’t correlate with any of the natural basis vector much more than the average.

We assume the weight matrix WW satisfies the following deterministic assumptions, 1. GG,HH,LLGG^{\top},HH^{\top},LL^{\top} is τ\tau-spectrally bounded by FF for τ1\tau\geq 1. 2. FF is μ\mu-incoherent with μO~(n/m)\mu\leq\widetilde{O}(\sqrt{n/m}). 3. If maxiFi0pn\max_{i}\|F_{i}\|_{0}\leq pn, with high probability over the choice of a subset Sa,Sa=n/3S_{a},|S_{a}|=n/3, σmin(FSa)np\sigma_{\min}(F_{S_{a}})\gtrsim\sqrt{np} and ρpmc\rho pm\leq c for some sufficiently small constant cc.

Suppose the matrix WW satisfies the conditions 1-3 above. Given polynomial number of samples, Algorithm 1 returns W^\widehat{W} in polynomial time, s.t.

The proofs of Theorems uses the overall strategy of Section 2, and is deferred to Section D. We give a high level outline that demonstrates how the proofs depends on the machinery built in the subsequent sections.

Both Theorem 3.1 and Theorem 3.3 are similarly proved – the only technical difference being how the third and higher order terms are bounded. (Because of generative model assumption, for Theorem 3.1 we can get a more precise control on them.) Hence, we will not distinguish between them in the coming overview.

Overall, we will follow the approach outlined in Section 2. Let us step through Algorithm 1 line by line:

The overall goal will be to recover the leading terms of the PMI tensor. Of course, we get samples only, so can merely get an empirical version of it. In Section E, we show that the simple plug-in estimator does the job – and does so with polynomially many samples.

Recall Difficulty 3 from Section 2 : the PMI tensor and matrix expression is only accurate on the off-diagonal entries. In order to address this, in Section 2.3 we passed to a sub-tensor of the original tensor by partitioning the symptoms into three disjoint sets, and considering the induced tensor by this partition.

In order to apply the robust tensor decomposition algorithm from Section 5, we need to first calculate whitening matrices. This is necessarily complicated by the fact that the diagonals of the PMI matrix are not accurate, as discussed in Section 2.4. Section C gives guarantees on the procedure for calculating the whitening matrices.

This is main component of the algorithm: the robust tensor decomposition machinery. In Section 5, the conditions and guarantees for the success of the algorithm are formalized. There, we deal with the difficulties layed out in Section 2.2 : namely that we have a substantial systematic error that we need to handle. (Both due to higher-order terms, and due to the missing diagonal entries)

This step, along with Step 6, is a post-processing step – which allows us to recover the weight matrix WW after we have recovered the leading terms of the PMI tensor.

We also give a short quantitative sense of the guarantee of the algorithm. (The reader can find the full proof in Section D.)

To get quantitative bounds, we will first need a handle on spectral properties of the random model: these are located in Section B. As we mentioned above, the main driver of the algorithm is step 4, which uses our robust tensor decomposition machinery in Section 5. To apply the machinery, we first need to show that the second (and higher) order terms of the PMI tensor are spectrally bounded. This is done by applying Proposition B.4, which roughly shows the higher-order terms are O(ρlogn)O(\rho\log n)-spectrally bounded by ρFF\rho FF^{\top}. The whitening matrices are calculated using machinery in Section C. We can apply these tools since the random model gives rise to a O(1)O(1)-incoherent FF matrix as shown in Lemma B.3.

To get a final sense of what the guarantee is, the l2l_{2} error which step 4 gives, via Theorem 5.4 roughly behaves like σmaxτ3/2\sqrt{\sigma_{\max}}\tau^{3/2}, where σmax\sigma_{\max} is the spectral norm of the whitening matrices and τ\tau is the spectral boundedness parameter. But, by Lemma D.1 σmax\sigma_{\max} is approximately the spectral norm of ρFF\rho FF^{\top} – which on the other hand by Lemma B.1 is on the order of mnp2ρmnp^{2}\rho. Plugging in these values, we get the theorem statement.

Finding the Subspace under Heavy Perturbations

In this section, we show even if we perturb a matrix SSSS^{\top} with an error whose spectral norm might be much larger than σmin(SS)\sigma_{min}(SS^{\top}), as long as EE is spectrally bounded the top singular subspace of SS is still preserved. We defer the proof of the asymmetric case (Theorem 2.10) to Section C. We note that such type of perturbation bounds, often called relatively perturbation bounds, have been studied in [Ips98, Li98a, Li98b, Li97]. The results in these papers either require the that signal matrix is full rank, or the perturbation matrix has strong structure. We believe our results are new and the way that we phrase the bound makes the application to our problem convenient. We recall Theorem 2.4, which was originally stated in Section 2.

Therefore, we have that B2εσmin(AA)\lVert B\rVert^{2}\leq\varepsilon\sigma_{\min}(AA^{\top}). Moreover, we also have

Let P=(Idm+SS)1/2P=\left(\textup{Id}_{m}+SS^{\top}\right)^{1/2}. Then we write AA+EAA^{\top}+E as,

Let A^=(AP+BSP1)\widehat{A}=(AP+BS^{\top}P^{-1}). Let KK^{\prime} be the column span of A^\widehat{A}. We first prove that K^\widehat{K} is close to KK^{\prime}. Note that

Moreover, we have σmin(A^A^)=σmin(A^)2=(σmin(AP)BSP1)2(1O(ε))σmin(A)2\sigma_{\min}(\widehat{A}\widehat{A}^{\top})=\sigma_{\min}(\widehat{A})^{2}=\left(\sigma_{\min}(AP)-\lVert BS^{\top}P^{-1}\rVert\right)^{2}\geq(1-O(\varepsilon))\sigma_{\min}(A)^{2}. Therefore, using Wedin’s Theorem (Lemma F.2) on equation (4.1), we have that

Next we show KK^{\prime} and KK are also close. We have

Therefore, by Wedin’s Theorem, KK^{\prime}, as the span of top mm left singular vectors of A^\widehat{A}, is close to the span of the top left singular vector of APAP, namely, KK

Therefore using equation (4.2) and (4.3) and triangle inequality, we complete the proof. ∎

Robust Tensor Decomposition with Systematic Error

In this section we discuss how to robustly find the tensor decomposition even in presence of systematic error. We first illustrate the main techniques in an easier setting of orthogonal tensor decomposition (Section 5.1), then we describe how it can be generalized to the general setting that we require for our algorithm (Section 5.2).

We start with decomposing an orthogonal tensor with systematic error. The algorithm we use here is a slightly more general version of an algorithm in [MSS16].

Suppose {ui},{vi},{wi}\{u_{i}\},\{v_{i}\},\{w_{i}\} are three collection ε\varepsilon-approximate orthonormal vectors. Suppose tensor TT is of the form

The Theorem is a direct extension of [MSS16, Theorem 10.2] to asymmetric and approximate orthogonal case. We only provide a proof sketch here. We start by writing

Since Z{2}{1,3}τ\left\lVert Z\right\rVert_{\{2\}\{1,3\}}\leq\tau and Z{1}{2,3}τ\left\lVert Z\right\rVert_{\{1\}\{2,3\}}\leq\tau, [MSS16, Theorem 6.5] implies that with probability at least 1d21-d^{2} over the choice of gg,

Let t=2logdt=2\sqrt{\log d}. We have that with probability 1/(d1+δlogO(1)d)1/(d^{1+\delta}\log^{O(1)}d), g,w1(1+δ/3)t\langle g,w_{1}\rangle\geq(1+\delta/3)t and g,wjt\langle g,w_{j}\rangle\leq t for every j1j\neq 1. We condition on these events. Let uˉi\bar{u}_{i} be a set of orthonormal vectors such that Eu=[u1,,um][uˉ1,,uˉm]E_{u}=[u_{1},\dots,u_{m}]-[\bar{u}_{1},\dots,\bar{u}_{m}] satisfies Euε\lVert E_{u}\rVert\leq\varepsilon (we can take uˉi\bar{u}_{i}’s to be the whitening of uiu_{i}’s). Similarly define vˉi\bar{v}_{i}’s. Then we have that the term (defined in equation (5.1)) can be written as ig,w1uˉivˉi+E\sum_{i}\langle g,w_{1}\rangle\bar{u}_{i}\bar{v}_{i}+E^{\prime} where Eε\lVert E\rVert^{\prime}\lesssim\varepsilon. Let MˉS=ig,w1uˉivˉi\bar{M}_{S}=\sum_{i}\langle g,w_{1}\rangle\bar{u}_{i}\bar{v}_{i}. Then MˉS\bar{M}_{S} has top singular value g,w1(1+δ/3)t\langle g,w_{1}\rangle\geq(1+\delta/3)t, and second singular value at most tt. Moreover, the term Mg+EM_{g}+E^{\prime} has spectral norm bounded by O(τ+ε)O(\tau+\varepsilon). Thus by Wedin’s Theorem (Lemma F.2), the top left and right singular vectors u,vu,v of MS+Mg=MˉS+Mg+EM_{S}+M_{g}=\bar{M}_{S}+M_{g}+E^{\prime} are O((τ+ε)/δ)O((\tau+\varepsilon)/\delta)-close to uˉ1\bar{u}_{1} and vˉ1\bar{v}_{1} respectively. They are also O((τ+ε)/δ)O((\tau+\varepsilon)/\delta)-close to u1,v1u_{1},v_{1} since u1u_{1} is close to uˉ1\bar{u}_{1}. Moreover, we have (uvId)T(u^{\top}\otimes v^{\top}\otimes\textup{Id})\cdot T is O(τ/δ)O(\tau/\delta)-close to w1w_{1}.

Therefore, with probability 1/(d1+δlogO(1)d)1/(d^{1+\delta}\log^{O(1)}d), each round of the for loop in Algorithm 2 will find u1,v1,w1u_{1},v_{1},w_{1}. Line 5 is used to verify if the resulting vectors are indeed good using the injective norm as a test. It can be shown that if the test is passed then (u,v,z)(u,v,z) is close to one of the component. Therefore, after d1+δlogO(1)dd^{1+\delta}\log^{O(1)}d iterations, with high probability, we can find all of the components.

2 General tensor decomposition

In many previous works, general tensor decomposition is reduced to orthogonal tensor decomposition via a whitening procedure. However, here in our setting we cannot estimate the exact whitening matrix because of the systematic error. Therefore we need a more robust version of approximate whitening matrix, which we define below:

Let rdr\leq d. A collection of rr vectors {a1,,ar}\{a_{1},\dots,a_{r}\} is ε\varepsilon-approximately orthonormal if the matrix AA with aia_{i} as columns satisfies

With this in mind, we can state the guarantee on the tensor decomposition algorithm (Algorithm 3).

where σ=min(σmin(Qa),σmin(Qb),σmin(Qc))\sigma=\min(\sigma_{\min}(Q_{a}),\sigma_{\min}(Q_{b}),\sigma_{\min}(Q_{c})).

Note that in our model, the matrix EE has very small spectral norm as it is the third order term in ρ\rho (and ρ=O(1/n)\rho=O(1/n)). The spectral boundedness of Γ,Δ,Θ\Gamma,\Delta,\Theta are discussed in Section B. Therefore we can expect the RHS to be small.

In order to prove this theorem, we show after we apply whitening operation using the approximate whitening matrices, the tensor is still close to an orthogonal tensor. To do that, we need the following lemma which is a useful technical consequence of the condition (2.9).

Suppose FF is τ\tau-spectrally bounded by gg. Then,

Let KK be the column span of FF. Let Q=FFQ=FF^{\top}. Multiplying (Q+)1/2(Q^{+})^{1/2} on both sides of equation (2.9), we obtain that

It follows that (Q+)1/2G2τ\|(Q^{+})^{1/2}G\|\leq\sqrt{2\tau}, which in turns implies that G(FF)+G=G(Q+)1/2(Q+)1/2G2τ\|G^{\top}(FF^{\top})^{+}G\|=\lVert G^{\top}(Q^{+})^{1/2}(Q^{+})^{1/2}G\rVert\leq 2\tau.∎

We also need to bound the {1,2}{3}\{1,2\}\{3\} norm of the following systematic error tensor. This is important because we want to bound the spectral norm of the perturbation after the whitening operation.

Using the definition of {1,2}{3}\lVert\cdot\rVert_{\{1,2\}\{3\}} we have that

Next observe that we have that for any ii, δiδi(maxδi2)Id\delta_{i}\delta_{i}^{\top}\preceq(\max\left\lVert\delta_{i}\right\rVert^{2})\textup{Id} and therefore,

With this in mind, we prove the main theorem:

Conclusions

We have presented theoretical progress on the longstanding open problem of presenting a polynomial-time algorithm for learning noisy-or networks given sample outputs from the network. In particular it is enouraging that linear algebraic methods like tensor decomposition can play a role. Earlier there were no good approaches for this problem; even heuristics fail for realistic sizes like n=1000n=1000.

Can sample complexity be reduced, say to subcubic? (Cubic implies more than one billion examples for networks with 10001000 outputs.) Possibly this requires exploiting some hierarchichal structure –e.g. groupings of diseases and symptoms— in practical noisy-OR networks but exploring such possibilities using the current version of QMR-DT is difficult because it has been scrubbed of labels for diseases and symptoms.)

Various more practical versions of our algorithm are also easy to conceive and will be tested in the near future. This could be somewhat analogous to topic modeling, for which discovery of provable polynomial-time algorithms soon led to very efficient algorithms.

We thank Randolph Miller and Vanderbilt University for providing the current version of this network for our research.

References

Appendix A Formal expression for the PMI tensor

In this section we formally derive the expressions for the PMI tensors and matrices, which we only informally did in Section 2.

These matrices will appear naturally in the expressions for the higher-order terms in the Taylor expansion for the PMI matrix and tensor.

We first compute the formally the moments of the noisy-or model.

We only give the proof for the second equation. The rest can be shown analogously.

With this in mind, we give the expression for the PMI tensor along with all the higher-order terms.

For any equipartition Sa,Sb,ScS_{a},S_{b},S_{c} of [n][n], the restriction of the PMI tensor PMITSa,Sb,Sc\textup{PMIT}_{S_{a},S_{b},S_{c}} satisfies, for any L2L\geq 2,

The proof will proceed by Taylor expanding the log terms. Towards that, using Lemma A.1, we have :

By the Taylor expansion of log(1x)\log(1-x), we get that

by simple regrouping of the terms. By exchanging ll and tt, we get

where the last equality holds by noting that

The term corresponding to t=1t=1 is easily seen to be

therefore we to show the statement of the lemma, we only need bound the contribution of the terms with \lL\l\geq L.

Toward that, note that l,k1exp(lWk)n\forall l,k\|1-\exp(-lW_{k})\|\leq n. Hence, we have by Lemma 5.6,

Therefore, subadditivity of the {12},{3}\{12\},\{3\} norm gives

A completely analogous proof gives a similar expression for the PMI matrix:

For any subsets Sa,SbS_{a},S_{b} of [n][n], s.t. SaSb=S_{a}\cap S_{b}=\emptyset, the restriction of the PMI matrix PMISa,Sb\textup{PMI}_{S_{a},S_{b}} satisfies, for any L2L\geq 2,

Appendix B Spectral properties of the random model

The goal of this section is to prove that the random model specified in Section 1 satisfies the incoherence property 3.2 on the weight matrix and the spectral boundedness property of the PMI tensor. (Recall, the former is required for the whitening algorithm, and the later for the tensor decomposition algorithm.)

Before delving into the proofs, we will need a few simple bounds on the singular values of PlP_{l}.

Let Sa[n]S_{a}\subseteq[n], s.t. Sa=Ω(n)|S_{a}|=\Omega(n). With probability 1exp(log2n)1-\exp(-\log^{2}n) over the choice of WW, and for all l=O(\mboxpoly(n))l=O(\mbox{poly}(n)),

we will proceed to bound the smallest eigenvalue of LLL^{\top}L.

Since the matrices (1exp(lWk))(1exp(lWk))(1-\exp(-lW^{k}))(1-\exp(-lW^{k}))^{\top} are independent, the bound will follow from a matrix Bernstein bound. Denoting

Note that (1.2) together with the assumption ν=Ω(1)\nu=\Omega(1) gives σmin(Q)=Ω(p)\sigma_{\min}(Q)=\Omega(p)

Therefore, by Bernstein inequality we have that w.h.p,

which in turn implies that Pl,SanQP_{l,S_{a}}\succeq nQ. But this immediately implies σmin(Pl,Sa)np\sigma_{\min}(P_{l,S_{a}})\gtrsim np with high probability. Union bounding over all ll, we get the first part of the lemma.

The upper bound will be proven by a Chernoff bound. Note that the matrices

are independent. Furthermore, (1exp(lWk))Sa(1exp(lWk))Sa2pn\|(1-\exp(-lW_{k}))_{S_{a}}(1-\exp(-lW_{k}))_{S_{a}}^{\top}\|^{2}\leq pn with high probability, and the variable (1exp(lWk))Sa(1exp(lWk))Sa2\|(1-\exp(-lW_{k}))_{S_{a}}(1-\exp(-lW_{k}))_{S_{a}}^{\top}\|^{2} is sub-exponential. Finally,

A union bound over all values of ll gives the statement of the Lemma.

First, we proceed to show the incoherence property 3.2 on the weight matrix.

Suppose nn is a multiple of 3. Let F=UΣVF=U\Sigma V be the singular value decomposition of FF. Let Sa,Sb,ScS_{a},S_{b},S_{c} be a uniformly random equipartition of the rows of [n][n]. Suppose FF is μ\mu-incoherent with μn/(mlogn)\mu\leq n/(m\log n). Then, with high probability over the choice of and Sa,Sb,ScS_{a},S_{b},S_{c}, we have for every i{a,b,c}i\in\{a,b,c\},

Let S=SaS=S_{a}. Then, since UU=IdmU^{\top}U=\textup{Id}_{m}, we have

By the assumption on the row norms of UU, Ui(Ui)2=Ui2μmn\|U^{i}(U^{i})^{\top}\|_{2}=\|U^{i}\|^{2}\leq\mu\frac{m}{n}. By the incoherence assumption, we have that maxiUi2μm/n\max_{i}\lVert U^{i}\rVert^{2}\leq\mu m/n.

We also note that UiU^{i}’s are negatively associated random variables. Therefore by the matrix Chernoff inequality for negatively associated random variables, we have with high probability,

But, an analogous argument holds for Sb,ScS_{b},S_{c} as well – so by a union bound over kk, we complete the proof. ∎

Suppose nmlognn\gtrsim m\log n. Under the generative assumption in Section 1 for WW, we have that we have that F=1exp(W)F=1-\exp(-W) is O(1)O(1)-incoherent.

We have that FF=UΣ2UTFF^{\top}=U\Sigma^{2}U^{T} and therefore, Fi2=Σi,i2Ui2\|F^{i}\|^{2}=\Sigma^{2}_{i,i}\|U^{i}\|^{2}. This in turn implies that

Since Fi2pm+pm2pm\left\lVert F^{i}\right\rVert^{2}\leq pm+\sqrt{pm}\leq 2pm with high probability, we only need to bound σmin(F)\sigma_{\min}(F) from below. Note that σmin2(F)=σmin(FF)\sigma^{2}_{\min}(F)=\sigma_{\min}(F^{\top}F). Therefore it suffices to control σmin(FF)\sigma_{\min}(F^{\top}F). But by Lemma B.1 we have σmin2(F)np\sigma^{2}_{\min}(F)\gtrsim np.

B.2 Spectral boundedness

The main goal of the section is to show that the bias terms in the PMI tensor are spectrally bounded by the PMI matrix (which we can estimate from samples). Furthermore, we show that we can calculate an approximate whitening matrix for the leading terms of the PMI tensor using the machinery in Section .

The main proposition we will show is the following:

The main element for the proposition is the following Lemma:

Before proving the Lemma, let us see how the proposition follows from it:

Since AA=(ρ1ρ)2/3P1,SaAA^{\top}=(\frac{\rho}{1-\rho})^{2/3}P_{1,S_{a}}, the claim of the Proposition follows.

It is clear analogous statements hold for SbS_{b} and ScS_{c}. ∎

For notational convenience, we will denote by Jm×nJ_{m\times n} the all ones matrix with dimension m×nm\times n. (We will omit the dimensions when clear from the context.)

This statement will immediately follow from the following two lemmas:

Before showing these lemmas, let us see how Lemma B.5 is implied by them:

Let κ\kappa be the constant in B.7, s.t. Pl,Sa+6nplognIdmp2JP_{l,S_{a}}+6np\log n\textup{Id}\precsim mp^{2}J. Putting the bounds from Lemmas B.6 and B.7 together along with a union bound, we have that with high probability, l,l=O(\mboxpoly(n))\forall l,l^{\prime}=O(\mbox{poly}(n))

But, note that σmin(Pl)=Ω(np)\sigma_{\min}(P_{l^{\prime}})=\Omega(np), by Lemma B.1. Hence, Pl52κPl,Sarlognσmin(Pl)P_{l}-\frac{5}{2}\kappa P_{l^{\prime},S_{a}}\preceq r\log n\sigma_{\min}(P_{l^{\prime}}), for some sufficiently large constant rr. This implies

from which the statement of the lemma follows.

Let’s denote by e=1Sa1e=\frac{1}{\sqrt{|S_{a}|}}\mathbf{1} . Let’s furthermore denote Id1=ee\textup{Id}_{\mathbf{1}}=ee^{\top}, and Id1=Idee\textup{Id}_{-\mathbf{1}}=\textup{Id}-ee^{\top}. Note first that trivially, since Id1+Id1=Id\textup{Id}_{\mathbf{1}}+\textup{Id}_{-\mathbf{1}}=\textup{Id},

We proceed to upper bound both the terms on the RHS above. More precisely, we will show that

Let us proceed to showing (B.4). The LHS can be rewritten as

We proceed to (B.5), which will be shown by a Bernstein bound. Towards that, note that

Therefore, applying a matrix Bernstein bound, we get

Combining this with (B.2) and (B.3), we get

Let us proceed to the second inequality, which essentially follows the same strategy:

Reusing the notation from Lemma B.7, we have that

for similar reasons as before. From this we get that

Putting (B.6) and (B.7) together, we get that

We will proceed to show an upper bound Id1PId12nplognId\textup{Id}_{-\mathbf{1}}P\textup{Id}_{-\mathbf{1}}\preceq 2np\log n\textup{Id} on second term of the LHS. We will do this by a Bernstein bound as before. Namely, analogously as in Lemma B.7,

Therefore, applying a matrix Bernstein bound, we get

Appendix C Robust whitening

In this section, we show the formula Qa=ρ1/3PMI^Sa,Sb(PMI^Sb,Sc+)PMI^Sc,SaQ_{a}=\rho^{-1/3}\widehat{\textup{PMI}}_{S_{a},S_{b}}(\widehat{\textup{PMI}}^{+}_{S_{b},S_{c}})^{\top}\widehat{\textup{PMI}}_{S_{c},S_{a}} computes an approximation of the true whitening matrix AAAA^{\top}, so that the error is ε\varepsilon-spectrally bounded by AA. We recall Theorem 2.10.

where Eab,Ebc,EcaE_{ab},E_{bc},E_{ca} are ε\varepsilon-spectrally bounded by (A,B)(A,B), (B,C)(B,C), (C,A)(C,A) respectively. Then, the matrix matrix

is a good approximation of AAAA^{\top} in the sense that Qa=Σab[Σbc]m+ΣcaAAQ_{a}=\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca}-AA^{\top} is O(ε)O(\varepsilon)-spectrally bounded by AA. Here [Σ]m[\Sigma]_{m} denotes the best rank-mm approximation of Σ\Sigma.

Towards proving Theorem 2.10, an intermediate step is to understand the how the space of singular vectors of BCBC^{\top} are aligned with the noisy version Σbc\Sigma_{bc}. The following explicitly represent BC+EBC^{\top}+E as the form BR(C)+ΔB^{\prime}R(C^{\prime})^{\top}+\Delta^{\prime}. Here the crucial benefit to do so is that the resulting Δ\Delta^{\prime} is small in every direction. In other words, we started with a relative error guarantees on EE and the Lemma below converts to it an absolute error guarantees on Δ\Delta^{\prime} (though the signal term changes slightly).

Suppose B,CB,C are n×mn\times m matrices with nmn\geq m. Suppose a matrix EE is ε\varepsilon-spectrally bounded by (B,C)(B,C), then BC+EBC^{\top}+E can be written as

where ΔB,ΔC,ΔBC\Delta_{B},\Delta_{C},\Delta_{BC}^{\prime} are small and RBCR_{BC} is close to identity in the sense that,

The key intuition is if the perturbation is happening in the span of columns of BB and CC, they cannot change the subspace. By Definition 2.9, we can write EE as

Now since Δ1ε<1\|\Delta_{1}\|\leq\varepsilon<1, we know (IdΔ1)(\textup{Id}-\Delta_{1}) is invertible, so we can write

This is already in the desired form as we can let ΔB=Δ3(Id+Δ1)1\Delta_{B}=\Delta_{3}(\textup{Id}+\Delta_{1})^{-1}, RBC=(Id+Δ1)R_{BC}=(\textup{Id}+\Delta_{1}), ΔC=Δ2(IdΔ1)\Delta_{C}=\Delta_{2}(\textup{Id}-\Delta_{1})^{-\top}, and ΔBC=Δ3(IdΔ1)1Δ2+Δ4\Delta_{BC}^{\prime}=-\Delta_{3}(\textup{Id}-\Delta_{1})^{-1}\Delta_{2}^{\top}+\Delta_{4}. By Weyl’s Theorem we know σmin(Id+Δ1)1ε\sigma_{min}(\textup{Id}+\Delta_{1})\geq 1-\varepsilon, therefore ΔBΔ3σmin1(Id+Δ)ε1εσmin(B)\|\Delta_{B}\|\leq\|\Delta_{3}\|\sigma_{min}^{-1}(\textup{Id}+\Delta)\leq\frac{\varepsilon}{1-\varepsilon}\sigma_{min}(B). Other terms can be bounded similarly.

Now we prove that the top mm approximation of BC+EBC^{\top}+E has similar column/row spaces as BCBC^{\top}. Let UBU_{B} be the column span of BB, UBU_{B}^{\prime} be the column span of (B+ΔB)(B+\Delta_{B}), and UBU_{B}^{\prime\prime} be the top mm left singular subspace of (BC+E)(BC^{\top}+E). Similarly we can define UCU_{C}, UCU_{C}^{\prime}, UCU_{C}^{\prime\prime} to be the column spans of CC, C+ΔCC+\Delta_{C} and the top mm right singular subspace of (BC+E)(BC^{\top}+E).

For B+ΔBB+\Delta_{B}, we can apply Weyl’s Theorem and Wedin’s Theorem. By Weyl’s Theorem we know σmin(B+ΔB)σmin(B)ΔB(1O(ε))σmin(B)\sigma_{min}(B+\Delta_{B})\geq\sigma_{min}(B)-\|\Delta_{B}\|\geq(1-O(\varepsilon))\sigma_{min}(B). By Wedin’s Theorem we know UBU_{B}^{\prime} is O(ε)O(\varepsilon)-close to UBU_{B}. Similar results apply to C+ΔCC+\Delta_{C}.

Now we know σmin((B+ΔB)RBC(C+ΔC)))σmin(B+ΔB)σmin(RBC)σmin(C+ΔC)Ω(σmin(B)σmin(C)\sigma_{min}((B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top}))\geq\sigma_{min}(B+\Delta_{B})\sigma_{min}(R_{BC})\sigma_{min}(C+\Delta_{C})\geq\Omega(\sigma_{min}(B)\sigma_{min}(C). Therefore we can again apply Wedin’s Theorem, considering (B+ΔB)RBC(C+ΔC))(B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top}) as the original matrix and ΔBC\Delta^{\prime}_{BC} as the perturbation. As a result, we know UBU_{B}^{\prime\prime} is O(ε)O(\varepsilon) close to UBU_{B}^{\prime}, UCU_{C}^{\prime\prime} is O(ε)O(\varepsilon) close to UBU_{B}^{\prime}. The distance between UB,UBU_{B},U_{B}^{\prime\prime} (and UC,UCU_{C},U_{C}^{\prime\prime}) then follows from triangle inequality.

As a direct corollary of Lemma C.1, we obtain that the BCBC^{\top} and BC+EBC^{\top}+E have similar subspaces of singular vectors.

In the setting of Lemma C.1, let [BC+E]m[BC^{\top}+E]_{m} be the best rank-mm approximation of BC+EBC^{\top}+E. Then, the span of columns of [BC+E]m[BC^{\top}+E]_{m} is O(ε)O(\varepsilon)-close to the span of columns of BB, span of rows of [BC+E]m[BC^{\top}+E]_{m} is O(ε)O(\varepsilon)-close to the span of columns of CC.

Furthermore, we can write [BC+E]m=(B+ΔB)RBC(C+ΔC)+ΔBC.[BC^{\top}+E]_{m}=(B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top}+\Delta_{BC}. Here ΔB\Delta_{B}, ΔC\Delta_{C} and RBCR_{BC} as defined in Lemma C.1 and ΔBC\Delta_{BC} satisfies ΔBCO(εσmin(B)σmin(C))\|\Delta_{BC}\|\leq O(\varepsilon\sigma_{min}(B)\sigma_{min}(C)).

Since [BC+E]m[BC^{\top}+E]_{m} is the best rank-mm approximation, because (B+ΔB)RBC(C+ΔC))(B+\Delta_{B})R_{BC}(C+\Delta_{C})^{\top}) is a rank mm matrix, in particular we have

In order to fix this problem, we notice that the matrix [Σbc]m+[\Sigma_{bc}^{\top}]_{m}^{+} is multiplied by Σab\Sigma_{ab} on the left and Σca\Sigma_{ca} on the right. Assuming Σab=AB\Sigma_{ab}=AB^{\top}, Σca=CA\Sigma_{ca}=CA^{\top}, we should expect [Σbc]m+[\Sigma_{bc}^{\top}]_{m}^{+} to “cancel” with the BB^{\top} factor on the left and the CC factor on the right, giving us AAAA^{\top}. Therefore, we should really measure the error of the middle term [Σbc]m+[\Sigma_{bc}^{\top}]_{m}^{+} after left multiplying with BB^{\top} and right multiplying with CC. We formalize this in the following lemma:

Suppose Σbc\Sigma_{bc} is as defined in Theorem 2.10, let Δ=[Σbc]m+[CB]+\Delta=[\Sigma_{bc}^{\top}]_{m}^{+}-[CB^{\top}]^{+}, then we have

We will first prove Theorem 2.10 assuming Lemma C.3.

By Lemma C.1, we know Σab\Sigma_{ab} can be written as

Similarly Σca\Sigma_{ca} can be written as

Here the Δ\Delta terms and RR terms are bounded as in Lemma C.1.

Now let us write the matrix Σab[Σbc]m+Σca\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca} as

We can now view Σab[Σbc]m+Σca\Sigma_{ab}[\Sigma_{bc}^{\top}]_{m}^{+}\Sigma_{ca} as the product of three terms, each term is the sum of two matrices. Therefore we can expand the product into 8 terms. In each of the three pairs, we will call the first matrix the main matrix, and the second matrix the perturbation.

In the remaining proof, we will do calculations to show the product of the main terms is close to AAAA^{\top}, and all the other 7 terms are small.

Before doing that, we first prove several Claims about PSD matrices

If Δε\|\Delta\|\leq\varepsilon, then AΔAεAAA\Delta A^{\top}\preceq\varepsilon AA^{\top}. If Γεσmin(A)\|\Gamma\|\leq\varepsilon\sigma_{min}(A), then 12(AΓ+ΓA)εAA+εσmin2(A)Id\frac{1}{2}(A\Gamma^{\top}+\Gamma A^{\top})\preceq\varepsilon AA^{\top}+\varepsilon\sigma_{min}^{2}(A)\textup{Id} .

Both inequalities can be proved by consider the quadratic form. We know for any xx, xAΔAxΔAx2εxAAxx^{\top}A\Delta A^{\top}x\leq\|\Delta\|\|A^{\top}x\|^{2}\leq\varepsilon x^{\top}AA^{\top}x, so the first part is true.

For the second part, for any xx we can apply Cauchy-Schwartz inequality

Now, we will first prove the product of three main matrices is close to AAAA^{\top}:

We have ((A+ΔA1)RAB(B+ΔB1))(CB)+((C+ΔC3)RCA(A+ΔA3))=AA+EA\left((A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\right)(CB^{\top})^{+}\left((C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}\right)=AA^{\top}+E_{A}, where EAE_{A} is O(ε)O(\varepsilon)-spectrally bounded by AAAA^{\top}.

We will first prove the middle part of the matrix (B+ΔB1)(CB)+(C+ΔC3)(B+\Delta^{1}_{B})^{\top}(CB^{\top})^{+}(C+\Delta^{3}_{C}) is O(ε)O(\varepsilon) close to identity matrix Id. Here we observe that both B,CB,C have full column rank so (CB)+=(B)+C+(CB^{\top})^{+}=(B^{\top})^{+}C^{+}. Therefore we can rewrite the product as (Id+B+ΔB1)(Id+C+ΔC3)(\textup{Id}+B^{+}\Delta^{1}_{B})^{\top}(\textup{Id}+C^{+}\Delta^{3}_{C}). Since ΔB1O(εσmin(B))\|\Delta^{1}_{B}\|\leq O(\varepsilon\sigma_{min}(B)) by Lemma C.1 (and similarly for CC), we know B+ΔB1O(ε)\|B^{+}\Delta^{1}_{B}\|\leq O(\varepsilon). Therefore the middle part is O(ε)O(\varepsilon) close to Id. Now since ε1\varepsilon\ll 1 we know R^AB=RAB(B+ΔB1)(CB)+(C+ΔC3)RCA\widehat{R}_{AB}=R_{AB}(B+\Delta^{1}_{B})^{\top}(CB^{\top})^{+}(C+\Delta^{3}_{C})R_{CA} is O(ε)O(\varepsilon)-close to Id.

Now we are left with (A+ΔA1)R^AB(A+ΔA3)(A+\Delta^{1}_{A})\widehat{R}_{AB}(A+\Delta^{3}_{A})^{\top}, for this matrix we know

The first term A(R^ABId)AO(ε)AAA(\widehat{R}_{AB}-\textup{Id})A^{\top}\preceq O(\varepsilon)AA^{\top} (Claim C.4); the fourth term ΔA1R^AB(ΔA3)O(εσmin2(A))Id\Delta^{1}_{A}\widehat{R}_{AB}(\Delta^{3}_{A})^{\top}\preceq O(\varepsilon\sigma_{min}^{2}(A))\textup{Id} (by the norm bounds of ΔA1\Delta^{1}_{A} and ΔA3\Delta^{3}_{A}. For the cross terms, we can bound them using the second part of Claim C.4. ∎

Next we will try to prove the remaining 7 terms are small. We partition them into three types depending on how many Δ\Delta factors they have. We proceed to bound them in each of these cases.

For the terms with only one Δ\Delta, we claim:

The three terms ΔAB(CB)+(C+ΔC3)RCA(A+ΔA3)\Delta_{AB}(CB^{\top})^{+}(C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}, (A+ΔA1)RAB(B+ΔB1)ΔAB(C+ΔC3)RCA(A+ΔA3)(A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\Delta_{AB}(C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}, ((A+ΔA1)RAB(B+ΔB1))(CB)+ΔCA\left((A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\right)(CB^{\top})^{+}\Delta_{CA} are all O(ε)O(\varepsilon) spectrally bounded by AAAA^{\top}.

For the first term, note that both B,CB,C have full column rank, and hence (CB)+=(B)+C+(CB^{\top})^{+}=(B^{\top})^{+}C^{+}. Therefore the first term can be rewritten as

By Lemma C.1, we have spectral norm bounds for ΔAB,ΔC3,ΔA3,RCA\Delta_{AB},\Delta^{3}_{C},\Delta^{3}_{A},R_{CA}. Therefore we know ΔAB(B)+O(εσmin(A))\|\Delta_{AB}(B^{\top})^{+}\|\leq O(\varepsilon\sigma_{min}(A)) and [(Id+C+ΔC3)RCA][(\textup{Id}+C^{+}\Delta^{3}_{C})R_{CA}] is O(ε)O(\varepsilon) close to Id. Therefore [ΔAB(B)+][(Id+C+ΔC3)RCA](ΔA3)O(εσmin2(A))\|[\Delta_{AB}(B^{\top})^{+}][(\textup{Id}+C^{+}\Delta^{3}_{C})R_{CA}](\Delta^{3}_{A})^{\top}\|\leq O(\varepsilon\sigma_{min}^{2}(A)) is trivially O(ε)O(\varepsilon) spectrally bounded, and [ΔAB(B)+][(Id+C+ΔC3)RCA]A[\Delta_{AB}(B^{\top})^{+}][(\textup{Id}+C^{+}\Delta^{3}_{C})R_{CA}]A^{\top} is O(ε)O(\varepsilon) spectrally bounded by Claim C.4. The third term is exactly symmetric.

For the second part, we will first prove the middle part of the matrix Δ^BC=(B+ΔB1)ΔBC(C+ΔC3)\widehat{\Delta}_{BC}=(B+\Delta^{1}_{B})^{\top}\Delta_{BC}(C+\Delta^{3}_{C}) has spectral norm O(ε)O(\varepsilon). This can be done y expanding it to the sum of 4 terms, and use appropriate spectral norm bounds on ΔBC\Delta_{BC} and its products with BB^{\top} and CC from Lemma C.3. Now we can show (A+ΔA1)RABΔ^BCRCA(A+ΔA3)(A+\Delta^{1}_{A})R_{AB}\widehat{\Delta}_{BC}R_{CA}(A+\Delta^{3}_{A})^{\top} is O(ε)O(\varepsilon) spectrally bounded by the first part of Claim C.4. ∎

Next we try to bound the terms with two Δ\Delta factors.

The three terms ΔABΔBC(C+ΔC3)RCA(A+ΔA3)\Delta_{AB}\Delta_{BC}(C+\Delta^{3}_{C})R_{CA}(A+\Delta^{3}_{A})^{\top}, ΔAB(CB)+ΔCA\Delta_{AB}(CB^{\top})^{+}\Delta_{CA}, ((A+ΔA1)RAB(B+ΔB1))ΔBCΔCA\left((A+\Delta^{1}_{A})R_{AB}(B+\Delta^{1}_{B})^{\top}\right)\Delta_{BC}\Delta_{CA} are all O(ε2)O(\varepsilon^{2}) spectrally bounded by AAAA^{\top}.

For the first term, notice that ΔBC(C+ΔC3)\|\Delta_{BC}(C+\Delta^{3}_{C})\| is bounded by O(ε/σmin(B))O(\varepsilon/\sigma_{min}(B)) by Lemma C.3, and ΔAB=O(εσmin(A)σmin(B))\|\Delta_{AB}\|=O(\varepsilon\sigma_{min}(A)\sigma_{min}(B)). Therefore we know ΔABΔBC(C+ΔC3)RCAO(ε2σmin(A))\|\Delta_{AB}\Delta_{BC}(C+\Delta^{3}_{C})R_{CA}\|\leq O(\varepsilon^{2}\sigma_{min}(A)), so by Claim C.4 we know this term is O(ε2)O(\varepsilon^{2}) spectrally bounded by AAAA^{\top}. Third term is symmetric.

For the second term, by Lemma C.1 we can directly bound its spectral norm by O(ε2σmin2(A))O(\varepsilon^{2}\sigma_{min}^{2}(A)), so it is trivially O(ε2)O(\varepsilon^{2}) spectrally bounded by AAAA^{\top}. ∎

Finally, for the product ΔABΔBCΔCA\Delta_{AB}\Delta_{BC}\Delta_{CA}, we can get the spectral norm for the three factors by Lemma C.1 and Lemma C.3. As a result ΔABΔBCΔCAO(ε3σmin2(A))\|\Delta_{AB}\Delta_{BC}\Delta_{CA}\|\leq O(\varepsilon^{3}\sigma^{2}_{min}(A)) which is trivially O(ε3)O(\varepsilon^{3}) spectrally bounded by AAAA^{\top}.

Combining the bound for all of the terms we get the theorem. ∎

We first prove a simpler version where the perturbation is simply bounded in spectral norm

Suppose B,CB,C are n×mn\times m matrices and nmn\geq m. Let RR be an n×nn\times n matrix such that RIdε\|R-\textup{Id}\|\leq\varepsilon, and EE is a perturbation matrix with Eεσmin(B)σmin(C)\|E\|\leq\varepsilon\sigma_{min}(B)\sigma_{min}(C) and (CRB+E)(CRB^{\top}+E) is also of rank mm.

Now let Δ=(CRB+E)+(CRB)+\Delta=(CRB^{\top}+E)^{+}-(CRB^{\top})^{+}, then when ε1\varepsilon\ll 1 we have

We first give the proof for BΔC\|B^{\top}\Delta C\|. Other terms are similar.

Let UBU_{B} be the column span of BB, and UBU^{\prime}_{B} be the row span of (CRB+E)(CRB^{\top}+E). Similarly let UCU_{C} be the column span of CC and UCU^{\prime}_{C} be the column span of (CRB+E)(CRB^{\top}+E). By Wedin’s theorem, we know UBU^{\prime}_{B} is O(ε)O(\varepsilon) close to UBU_{B} and UCU^{\prime}_{C} is O(ε)O(\varepsilon) close to UCU_{C}. As a result, suppose the SVD of BB is UBDBVBU_{B}D_{B}V_{B}^{\top}, we know

The same is true for CC: σmin(CUC)(1O(ε))σmin(C)\sigma_{min}(C^{\top}U^{\prime}_{C})\geq(1-O(\varepsilon))\sigma_{min}(C).

By the property of pseudoinverse, the column span of (CRB+E)+(CRB^{\top}+E)^{+} is UBU^{\prime}_{B}, and the row span of (CRB+E)+(CRB^{\top}+E)^{+} is UCU^{\prime}_{C}, further, (CRB+E)+=UB[(UC)(CRB+E)UB]1UC(CRB^{\top}+E)^{+}=U^{\prime}_{B}[(U^{\prime}_{C})^{\top}(CRB^{\top}+E)U^{\prime}_{B}]^{-1}U^{\prime}_{C}, therefore we can write

Note that now the three matrices are all n×nn\times n and invertible! We can write BUB=((BUB)1)1B^{\top}U^{\prime}_{B}=((B^{\top}U^{\prime}_{B})^{-1})^{-1} (and do the same thing for (UC)C(U^{\prime}_{C})^{\top}C. Using the fact that P1Q1=(QP)1P^{-1}Q^{-1}=(QP)^{-1}, we have

Here we defined X=((UC)C)1(UC)EUB(BUB)1X=((U^{\prime}_{C})^{\top}C)^{-1}(U^{\prime}_{C})^{\top}EU^{\prime}_{B}(B^{\top}U^{\prime}_{B})^{-1}. The spectral norm of XX can be bounded by

We can write BΔC=B(CRB+E)+CId=(Id+(RId+X))1IdB^{\top}\Delta C=B^{\top}(CRB^{\top}+E)^{+}C-\textup{Id}=(\textup{Id}+(R-\textup{Id}+X))^{-1}-\textup{Id}, and we now know (RId+X)O(ε)\|(R-\textup{Id}+X)\|\leq O(\varepsilon), as a result BΔCO(ε)\|B^{\top}\Delta C\|\leq O(\varepsilon) as desired.

For the term BΔ\|B^{\top}\Delta\|, by the same argument we have

On the other hand, we know B(CRB)+=R1C+=R1((UC)C)1UCB^{\top}(CRB^{\top})^{+}=R^{-1}C^{+}=R^{-1}((U_{C})^{\top}C)^{-1}U_{C}^{\top}. We can match the three factors:

Here, first and third bound are proven before. The second bound comes if we consider the SVD of C=UCDCVCC=U_{C}D_{C}V_{C}^{\top} and notice that (UC)UCIdO(ε)\|(U^{\prime}_{C})^{\top}U_{C}-\textup{Id}\|\leq O(\varepsilon). We can write Δ1=R1(R+X)1\Delta_{1}=R^{-1}-(R+X)^{-1}, Δ2=((UC)C)1((UC)C)1\Delta_{2}=((U_{C})^{\top}C)^{-1}-((U^{\prime}_{C})^{\top}C)^{-1}, Δ3=UCUC\Delta_{3}=U_{C}-U^{\prime}_{C}, then we have

Expanding the last equation, we get 7 terms and all of them can be bounded by O(ε/σmin(C))O(\varepsilon/\sigma_{min}(C)). The bounds on ΔC\|\Delta C\| and Δ\|\Delta\| can be proved using similar techniques. ∎

Finally we are ready to prove the main Lemma C.3:

Using Lemma C.1, let E=EbcE=E_{bc}^{\top}, we can write the matrix before pseudoinverse as

We can then apply Lemma C.8 on (C+ΔC)RBC(B+ΔB)+ΔBC(C+\Delta_{C})R_{BC}(B+\Delta_{B})^{\top}+\Delta_{BC}. As a result, we know if we let Δ=[CB+E]m+((C+ΔC)RBC(B+ΔB))+\Delta^{\prime}=[CB^{\top}+E]_{m}^{+}-((C+\Delta_{C})R_{BC}(B+\Delta_{B})^{\top})^{+}, we have the desired bound if we left multiply with (B+ΔB)(B+\Delta_{B})^{\top} or right multiply with (C+ΔC)(C+\Delta_{C}).

We will now show how to prove the first bound, all the other bounds can be proved using the same strategy:

All the four terms on the RHS can be bounded by Lemma C.8 so we know BΔCO(ε)\|B^{\top}\Delta^{\prime}C\|\leq O(\varepsilon).

On the other hand, let Δ=((C+ΔC)RBC(B+ΔB))+(CB)+=ΔΔ\Delta^{\prime\prime}=((C+\Delta_{C})R_{BC}(B+\Delta_{B})^{\top})^{+}-(C^{\top}B)^{+}=\Delta-\Delta^{\prime}. We will prove BΔCO(ε)\|B^{\top}\Delta^{\prime\prime}C\|\leq O(\varepsilon) and then the bound on BΔC\|B^{\top}\Delta C\| follows from triangle inequality.

For BΔCB^{\top}\Delta^{\prime\prime}C, we know it is equal to

B[(B+ΔB)]+RAB1(C+ΔC)+CIdO(ε)\|B^{\top}[(B+\Delta_{B})^{\top}]^{+}R_{AB}^{-1}(C+\Delta_{C})^{+}C-\textup{Id}\|\leq O(\varepsilon)

We will show all three factors in the first term are O(ε)O(\varepsilon) close to Id. For RAB1R_{AB}^{-1} this follows immediately from Lemma C.1. For (C+ΔC)+C(C+\Delta_{C})^{+}C, we know

Therefore its spectral norm bound is bounded by ΔCσmin1(C+ΔC)=O(ε)\|\Delta_{C}\|\sigma_{min}^{-1}(C+\Delta_{C})=O(\varepsilon) (where the bound on ΔC\|\Delta_{C}\| comes from Lemma C.1). ∎

With the claim we have now proven BΔCO(ε)\|B^{\top}\Delta^{\prime\prime}C\|\leq O(\varepsilon), therefore

Here we will show under mild incoherence conditions (defined below), if an error matrix EE is ε\varepsilon-spectrally bounded by FFFF^{\top}, then the partial matrices satisfy the requirement of Theorem 2.10.

If FF is μ\mu-incoherent for μn/mlog2n\mu\leq\sqrt{n/m\log^{2}n}, then when nΩ(mlog2m)n\geq\Omega(m\log^{2}m), with high probability over the random partition of FF into A,B,CA,B,C, we know σmin(A)σmin(F)/3\sigma_{min}(A)\geq\sigma_{min}(F)/3 (same is true for B,CB,C).

As a corollary, if EE is ε\varepsilon-spectrally bounded by FF. Let a,b,ca,b,c be the subsets corresponding to A,B,CA,B,C, and let Ea,bE_{a,b} be the submatrix of EE whose rows are in set aa and columns are in set bb. Then Ea,bE_{a,b} (also Eb,c,Ec,aE_{b,c},E_{c,a}) is O(ε)O(\varepsilon)-spectrally bounded by the corresponding asymmetric matrices ABAB^{\top} (BCBC^{\top}, CACA^{\top}).

Consider the singular value decomposition of FF: F=UDVF=UDV^{\top}. Here UU is a n×mn\times m matrix whose columns are orthonormal, VV is an m×mm\times m orthonormal matrix and DD is a diagonal matrix whose smallest diagonal entry is σmin(F)\sigma_{min}(F).

Consider the following way of partitioning the matrix: for each row of FF, we put it into A,BA,B or CC with probability 1/31/3 independently.

Now, let Xi=1X_{i}=1 if row ii is in the matrix AA, and 0 otherwise. Then XiX_{i}’s are Bernoulli random variables with probability 1/21/2. Suppose SS is the set of rows in AA, let UAU_{A} be UU restricted to rows in AA, then we have A=UADVA=U_{A}DV^{\top}. We will show with high probability σmin(A)1/3\sigma_{min}(A)\geq 1/3.

The key observation here is the expectation of UAUA=i=1nXiUiUiU_{A}^{\top}U_{A}=\sum_{i=1}^{n}X_{i}U_{i}U_{i}^{\top}, where UiU_{i} is the ii-th row of UU (represented as a column vector). Since XiX_{i}’s are Bernoulli random variables, we know

Therefore we can hope to use matrix concentration to prove that UAUAU_{A}^{\top}U_{A} is close to its expectation.

Here the last inequality is because i=1nXiUiUiUiUiUi2UiUi\sum_{i=1}^{n}X_{i}U_{i}U_{i}^{\top}U_{i}U_{i}^{\top}\preceq\|U_{i}\|^{2}U_{i}U_{i}^{\top}.

Therefore by Matrix Bernstein’s inequality we know with high probability i=1nMi1/6\|\sum_{i=1}^{n}M_{i}\|\leq 1/6. When this happens we know

Hence we have σmin(UA)1/6>1/3\sigma_{min}(U_{A})\geq\sqrt{1/6}>1/3, and σmin(A)σmin(UA)σmin(D)σmin(F)/3\sigma_{min}(A)\geq\sigma_{min}(U_{A})\sigma_{min}(D)\geq\sigma_{min}(F)/3. Note that matrices BB, CC have exactly the same distribution as AA so the bounds for BB,CC follows from union bound.

For the corollary, if a matrix EE is ε\varepsilon spectrally bounded, we can write it as FΔ1F+FΔ2+Δ2F+Δ4F\Delta_{1}F^{\top}+F\Delta_{2}^{\top}+\Delta_{2}F^{\top}+\Delta_{4}, where Δ1ε\|\Delta_{1}\|\leq\varepsilon, Δ2εσmin(F)\|\Delta_{2}\|\leq\varepsilon\sigma_{min}(F) and Δ4εσmin2(F)\|\Delta_{4}\|\leq\varepsilon\sigma_{min}^{2}(F). This can be done by considering different projections of EE: let UU be the span of columns of FF, then FΔ1FF\Delta_{1}F^{\top} term corresponds to ProjUEProjU\operatorname{Proj}_{U}E\operatorname{Proj}_{U}; FΔ2F\Delta_{2}^{\top} term corresponds to ProjUEProjU\operatorname{Proj}_{U}E\operatorname{Proj}_{U^{\perp}}; Δ2F\Delta_{2}F^{\top} term corresponds to ProjUEProjU\operatorname{Proj}_{U^{\perp}}E\operatorname{Proj}_{U}; Δ4\Delta_{4} term corresponds to ProjUEProjU\operatorname{Proj}_{U^{\perp}}E\operatorname{Proj}_{U^{\perp}}. The spectral bounds are necessary for EE to be spectrally bounded.

Now for Ea,bE_{a,b}, we can write it as AΔ1B+A(Δ2)b+(Δ2)aB+(Δ4)a,bA\Delta_{1}B^{\top}+A(\Delta_{2})_{b}^{\top}+(\Delta_{2})_{a}B^{\top}+(\Delta_{4})_{a,b}, where we also take the corresponding submatrices of Δ\Delta’s. Since the spectral norm of a submatrix can only be smaller, we know Δ1ε\|\Delta_{1}\|\leq\varepsilon, (Δ2)bεσmin(F)3εσmin(B)\|(\Delta_{2})_{b}\|\leq\varepsilon\sigma_{min}(F)\leq 3\varepsilon\sigma_{min}(B), (Δ2)aεσmin(F)3εσmin(A)\|(\Delta_{2})_{a}\|\leq\varepsilon\sigma_{min}(F)\leq 3\varepsilon\sigma_{min}(A) and (Δ2)a,bεσmin2(F)9εσmin(A)σmin(B)\|(\Delta_{2})_{a,b}\|\leq\varepsilon\sigma_{min}^{2}(F)\leq 9\varepsilon\sigma_{min}(A)\sigma_{min}(B). Therefore by Definition 2.9 we know Ea,bE_{a,b} is 9ε9\varepsilon spectrally bounded by ABAB^{\top}. ∎

Appendix D Proof of Theorem 3.1 and Theorem 3.3

In this section, we provide the full proof of Theorem 3.1. We start with a simple technical Lemma.

If QQ is an ε\varepsilon-approximate whitening matrix for AA, then Q11εAA\|Q\|\leq\frac{1}{1-\varepsilon}\|AA^{\top}\|, σmin(Q)11εAA\sigma_{\min}(Q)\geq\frac{1}{1-\varepsilon}\|AA^{\top}\|

By the definition of approximate-whitening, we have

by virtue of the fact that (Q+)1/2AA(Q+)1/2=((Q+)1/2AA(Q+)1/2)(Q^{+})^{1/2}AA^{\top}(Q^{+})^{1/2}=\left((Q^{+})^{1/2}A^{\top}A(Q^{+})^{1/2}\right)^{\top}. Rewriting in semidefinite-order notation, we get that

Multiplying on the left and right by Q1/2Q^{1/2}, we get

This directly implies 11+εAAQ11εAA\frac{1}{1+\varepsilon}AA^{\top}\preceq Q\preceq\frac{1}{1-\varepsilon}AA^{\top} which is equivalent to the statement of the lemma. ∎

Towards proving Theorem 3.1, we will first prove the following proposition, which shows that we recover the exp(W)\exp(-W) matrix correctly:

Under the random generative model defined in Section 1, if the number of samples satisfies

The proof will consist of checking the conditions for Algorithms 4 and 3 to work, so that we can apply Theorems 5.4 and 2.10.

To get a handle on the PMI tensor, by Proposition A.2, for any equipartition Sa,Sb,ScS_{a},S_{b},S_{c} of [n][n], we can it as

We can choose L=poly(log(n,1ρ,1p))\displaystyle L=\operatorname{poly}(\log(n,\frac{1}{\rho},\frac{1}{p})) to ensure

Having an explicit form for the tensor, we proceed to check the spectral boundedness condition for Algorithm 4.

Next, we verify the conditions for calculating approximate whitening matrices (Algorithm 4).

However, for the random model, applying Lemma B.1,

Analogous statements hold for BB and CC.

Finally, we bound the error due to empirical estimates. Since ρpm=o(1)\rho pm=o(1),

Hence, by Corollary E.3, with a number of samples as stated in the theorem,

With that, invoking Theorem 5.4 (with E{1,2},{3}\|E\|_{\{1,2\},\{3\}} taking into account both the ELE_{L} term above, and the above error due to sampling), the output of Algorithm 3 will produce vectors viv_{i}, i[m]i\in[m], s.t. viv_{i} is O(η)O(\eta^{\prime})-close to (ρ1ρ)1/3(1exp(Wi))\left(\frac{\rho}{1-\rho}\right)^{1/3}(1-\exp(-W_{i})), for

where σ=min(σmin(Qa),σmin(Qb),σmin(Qc))\sigma=\min(\sigma_{\min}(Q_{a}),\sigma_{\min}(Q_{b}),\sigma_{\min}(Q_{c})).

Plugging in the estimates from (D.1), (D.2), (D.3) as well as τ=O(ρ2/3logn)\tau=O(\rho^{2/3}\log n), we get:

which implies the vectors (a^i,b^i,c^i)(\hat{a}_{i},\hat{b}_{i},\hat{c}_{i}) are O(η)O(\eta^{\prime})-close to (ρ1ρ)1/3(1exp(Wi))\left(\frac{\rho}{1-\rho}\right)^{1/3}(1-\exp(W_{i})), for all i[m]i\in[m].

Given that, we prove the main theorem. The main issue will be to ensure that taking log\log of the values

Then we have that YiWiYiWi\lVert Y_{i}^{\prime}-W_{i}\rVert\leq\lVert Y_{i}-W_{i}\rVert. By the Lipschitzness of log()\log(\cdot) in the region [νi,][\nu_{i},\infty]

Therefore recalling YiWiYiWiO(ηnp)\lVert Y_{i}^{\prime}-W_{i}\rVert\leq\lVert Y_{i}-W_{i}\rVert\leq O(\eta\sqrt{np}) we complete the proof. ∎

The proof will follow the same outline as the proof of Theorem 3.1. The difference is that since we only have a guarantee on the spectral boundedness of the second and third-order term, we will need to bound the higher-order terms in a different manner. Given that we have no information on them in this scenario, we will simply bound them in the obvious manner. We proceed to formalize this.

The sample complexity is polynomial for the same reasons as in the proof of Theorem 3.1, so we will not worry about it here.

We only need to check the conditions for Algorithms 4 and 3 to work, so that we can apply Theorems 5.4 and 2.10.

Towards that, first we claim that we can write the PMI tensor for any equipartition Sa,Sb,ScS_{a},S_{b},S_{c} of [n][n] as

where E{1,2},{3}ρ4m(np)3/2\|E\|_{\{1,2\},\{3\}}\leq\rho^{4}m(np)^{3/2}. Towards achieving this, first we claim that Proposition 5.6 implies that for any subsets Sa,Sb,ScS_{a},S_{b},S_{c},

Indeed, if we put γk=(1exp(lWk))Sa\gamma_{k}=\left(1-\exp(-lW_{k})\right)_{S_{a}}, δk=(1exp(lWk))Sb\delta_{k}=\left(1-\exp(-lW_{k})\right)_{S_{b}}, θk=(1exp(lWk))Sc\theta_{k}=\left(1-\exp(-lW_{k})\right)_{S_{c}}, then we have kγkγkmnp\|\sum_{k}\gamma_{k}\gamma_{k}^{\top}\|\leq\sqrt{mnp}, and similarly for δk\delta_{k}. Since maxkθk(np)1/2\max_{k}\|\theta_{k}\|\leq(np)^{1/2}, the claim immediately follows. Hence, (D.4) follows.

We claim that RSaRSaR_{S_{a}}R^{\top}_{S_{a}} is τ\tau spectrally bounded by ρFF\rho FF^{\top}.

where the first inequality holds since HH,GG,LLHH^{\top},GG^{\top},LL^{\top} are τ\tau-spectrally bounded bounded by FF and the second since σmin(FF)np\sigma_{\min}(FF^{\top})\gtrsim np and τ1\tau\geq 1. Let τ=ρ2/3τ\tau^{\prime}=\rho^{2/3}\tau. Since we are assuming the matrix FF is O(1)O(1)-incoherent, we can apply Theorem C.10, and claim the output of Algorithm 4 are matrices Qa,Qb,QcQ_{a},Q_{b},Q_{c} which are τ\tau-approximate whitening matrices for A,B,CA,B,C respectively.

Then, applying Theorem 5.4, we get that we recover vectors (a^i,b^i,c^i)(\hat{a}_{i},\hat{b}_{i},\hat{c}_{i}) are O(η)O(\eta^{\prime})-close to (ρ1ρ)1/3(1exp(Wi))\left(\frac{\rho}{1-\rho}\right)^{1/3}(1-\exp(W_{i})), for all i[m]i\in[m]. for

Recall that τ=ρ2/3τ\tau^{\prime}=\rho^{2/3}\tau, and Qaρ2/3σmax(F)ρ2/3mnp\lVert Q_{a}\rVert\leq\rho^{2/3}\sigma_{\max}(F)\lesssim\rho^{2/3}\sqrt{mn}p and E{1,2},{3}ρ4m(np)3/2\|E\|_{\{1,2\},\{3\}}\leq\rho^{4}m(np)^{3/2} and σρ2/3np\sigma\gtrsim\rho^{2/3}np, we obtain that

where the last inequality holds since ρ3m=o(1)=o(τ)\rho^{3}m=o(1)=o(\tau).

Argument for recovering WiW_{i} from exp(Wi)\exp(-W_{i}) is then exactly the same as the one in Theorem 3.1.

Appendix E Sample complexity and bias of the PMI estimator

Finally, we consider the issue of sample complexity. The estimator we will use for the PMI matrix will simply be the plug-in estimator, namely:

Notice that this estimator is biased, but as the number of samples grows, the bias tends to zero. Formally, we can show:

with high probability PMI^i,jPMIi,jδ,ij|\hat{\textup{PMI}}_{i,j}-\textup{PMI}_{i,j}|\leq\delta,\forall i\neq j.

Denoting Δi,j=Pr^[si=0sj=0]Pr[si=0sj=0]\Delta_{i,j}=\hat{\Pr}[s_{i}=0\wedge s_{j}=0]-\Pr[s_{i}=0\wedge s_{j}=0] and Δi=Pr^[si=0]Pr[si=0]\Delta_{i}=\hat{\Pr}[s_{i}=0]-\Pr[s_{i}=0], we get that

Furthermore, we have that 2x2+xlog(1+x)xx+1\frac{2x}{2+x}\leq\log(1+x)\leq\frac{x}{\sqrt{x+1}}, for x0x\geq 0, which implies that when x1x\leq 1, 23xlog(1+x)x\frac{2}{3}x\leq\log(1+x)\leq x. From this it follows that if

Note that it suffices to show that if N>114pmaxmρmax1δ2logmN>\frac{1}{1-4p_{\max}m\rho_{\max}}\frac{1}{\delta^{2}}\log m, we have

with high probability by a simple union bound.

Both (E.2) and (E.3) will follow by a Chernoff bound.

Indeed, consider (E.2) first. We have by Chernoff

Hence, if N>1Pr[si=0]1δ2logmN>\frac{1}{\Pr[s_{i}=0]}\frac{1}{\delta^{2}}\log m, we get that 1δΔiPr[si=0]1+δ1-\delta\leq\frac{\Delta_{i}}{\Pr[s_{i}=0]}\leq 1+\delta with probability at least 1exp(log2m)1-\exp(\log^{2}m).

The proof of (E.3) is analogous – the only difference being that the requirement is that N>1Pr[si=0]1δ2logmN>\frac{1}{\Pr[s_{i}=0]}\frac{1}{\delta^{2}}\log m which gives the statement of the lemma.

Virtually the same proof as above shows that:

with high probability PMIT^i,j,kPMITi,j,kδ,ijk|\hat{\textup{PMIT}}_{i,j,k}-\textup{PMIT}_{i,j,k}|\leq\delta,\forall i\neq j\neq k.

with high probability for any equipartition Sa,Sb,ScS_{a},S_{b},S_{c}

Appendix F Matrix Perturbation Toolbox

In this section we discuss standard matrix perturbation inequalities. Many results in this section can be found in Stewart and Sun [Ste77]. Given A^=A+E\widehat{A}=A+E, the perturbation in individual singular values can be bounded by Weyl’s theorem:

Given A^=A+E\widehat{A}=A+E, we know σk(A)Eσk(A^)σk(A)+E\sigma_{k}(A)-\|E\|\leq\sigma_{k}(\widehat{A})\leq\sigma_{k}(A)+\|E\|.

For singular vectors, the perturbation is bounded by Wedin’s Theorem:

Let A^=A+E\widehat{A}=A+E, with analogous singular value decomposition. Let Φ\Phi be the matrix of canonical angles between the column span of U1U_{1} and that of U^1\widehat{U}_{1}, and Θ\Theta be the matrix of canonical angles between the column span of V1V_{1} and that of V^1\widehat{V}_{1}. Suppose that there exists a δ\delta such that

When we have a lowerbound on σmin(A)\sigma_{min}(A), it is easy to get bounds for the perturbation of pseudoinverse.

Note that this theorem is not strong enough when the perturbation is only known to be τ\tau-spectrally bounded in our definition.