Provable learning of Noisy-or Networks
Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski
Introduction
Unsupervised learning is important and potentially very powerful because of the availability of the huge amount of unlabeled data — often several orders of magnitudes more than the labeled data in many domains. Latent variable models, a popular approach in unsupervised learning, model the latent structures in data: the “structure”corresponds to some hidden variables, which probabilistically determine the values of the visible coordinates in data. Bayes nets model the dependency structure of latent and observable variables via a directed graph. Learning parameters of a latent variable model given data samples is often seen as a canonical definition of unsupervised learning. Unfortunately, finding parameters with the maximum likelihood is NP-hard even in very simple settings. However, in practice many of these models can be learnt reasonably well using algorithms without polynomial runtime guarantees, such as expectation-maximization algorithm, Markov chain Monte Carlo, and variational inference. Bridging this gap between theory and practice is an important research goal.
Recently it has become possible to use matrix and tensor decomposition methods to design polynomial-time algorithms to learn some simple latent variable models such as topic models [AGM12, AGH+13], sparse coding models [AGMM15, MSS16], mixtures of Gaussians [HK13, GHK15], hidden Markov models [MR05], etc. These algorithms are guaranteed to work if the model parameters satisfy some conditions, which are reasonably realistic. In fact, matrix and tensor decomposition are a natural tool to turn to since they appear to be a sweet spot for theory whereby non-convex NP-hard problems can be solved provably under relatively clean and interpretable assumptions. But the above-mentioned recent results suggest that such methods apply only to solving latent variable models that are linear: specifically, they need the marginal of the observed variables conditioned on the hidden variables to depend linearly on the hidden variables. But many settings seem to call for nonlinearity in the model. For example, Bayes nets in many domains involve highly nonlinear operations on the latent variables, and could even have multiple layers. The study of neural networks also runs into nonlinear models such as restricted Boltzmann machines (RBM) [Smo86, HS06]. Can matrix factorization (or related tensor factorization) ideas help for learning nonlinear models?
We see that can be thought of as the probability that activates symptom , and is activated if one of ’s activates it — which explains the name of the model, noisy-or. It follows that the conditional distribution is
One canonical use of this model is to model the relationship between diseases and symptoms, as in the classical human-constructed tool for medical diagnosis called Quick Medical Reference (QMR-DT) by (Miller et al.[MPJM82], Shwe et al. [SC91]) This textbook example ([JGJS99]) of a Bayes net captures relationships between binary disease variables (latent variables) and observed binary symptom variables, with directed edges, and the ’s are small integers.We thank Randolph Miller and Vanderbilt University for providing the current version of this network for our research. The name “noisy-or ”derives from the fact that the probability that the OR of independent binary variables is is exactly . Noisy-or models are implicitly using this expression; specifically, for the -th symptom we are considering the OR of events where the th event is “Disease does not cause symptom ” and its probability is . Treating these events as independent leads to expression (1.1).
The parameters of the QMR-DT network were hand-estimated by consulting human experts, but it is an interesting research problem whether such networks can be created in an automated way using only samples of patient data (i.e., the vectors). Previously there were no approaches for this that work even heuristically at the required problem size (). (This learning problem should not be confused with the simpler problem of infering the latent variables given the visible ones, which is also hard but has seen more work, including reasonable heuristic methods [JGJS99]). Halpern et al.[HS13, JHS13] have designed some algorithms for this problem. However, their first paper [HS13] assumes the graph structure is given. The second paper [JHS13] requires the Bayes network to be quartet-learnable, which is a strong assumption on the structure of the network. Finally, the problem of finding a “best-fit” Bayesian network according to popular metricsResearchers resort to these metrics as when the graph structure is unknown, multiple structures may have the same likelihood, so maximum likelihood is not appropriate. has been shown to be NP-complete by [Chi96] even when all of the hidden variables are also observed.
where ’s are upper bounded by for some constant and are identically distributed according to a distribution which satisfies that for some constant ,
The condition (1.2) intuitively requires that is bounded away from 0. We will assume that and . (Again, these are realistic for QMR-DT setting).
Recall that we mostly thought of the prior of the diseases as being on the order . This means that even if is on the order of , our relative error bound equals to .
Preliminaries and overview
We denote by the all-zeroes vector and the all-ones vector. will denote the Moore-Penrose pseudo-inverse of a matrix , and for symmetric matrices , we use as a shorthand for . The least non-zero singular value of matrix is denoted .
For matrices we define the Kronecker product as A useful identity is that whenever the matrix multiplications are defined. Moreover, will denote the i-th column of matrix and the i-th row of matrix .
We write if there exists a universal constant such that and we define similarly.
(We will sometimes shorten PMI3 and PMI2 to PMI when this causes no confusion.)
Our algorithm is given polynomially many samples from the model (each sample describing which symptoms are or are not present in a particular patient). It starts by computing the following matrix PMI and and tensor PMIT, which tabulate the correlations among all pairs and triples of symptoms (specifically, the indicator random variable for the symptom being absent):
Let denote the th columns of the above . Then,
The proposition is proved later (with precise statement) in Section A by computing the moments by marginalization and using Taylor expansion to approximate the log of the moments, and ignoring terms and smaller. (Recall that is the probability that a patient has a particular disease, which should be small, of the order of . The dependence of the final error upon appears in Section 3.) Since the tensor PMIT can be estimated to arbitrary accuracy given enough samples, the natural idea to recover the model parameters is to use Tensor Decomposition. This is what our algorithm does as well, except the following difficulties have to be overcome.
Difficulty 1: Suppose in equation (2.8) we view the first summand , which is rank with components ’s as the signal term. In all previous polynomial-time algorithms for tensor decomposition, the tensor is required to have the form . To make our problem fit this template we could consider the second summand as the “noise”, especially since it is multiplied by which tends to make have smaller norm than . But this is naive and incorrect, since is a very structured matrix: it is more appropriate viewed as systematic error. (In particular this error doesn’t go down in norm as the number of samples goes to infinity.) In order to do tensor decomposition in presence of such systematic error, we will need both a delicate error analysis and a very robust tensor decomposition algorithm. These will be outlined in Section 2.3.
Difficulty 2: To get our problem into a form suitable for tensor decomposition requires a whitening step, which uses the robust estimate of the whitening matrix from the second moment matrix. In this case, the whitening matrix has to be extracted out of the PMI matrix, which itself suffers from a systematic error. This also is not handled in previous works, and requires a delicate control of the error. See Section 2.4 for more discussion.
Difficulty 3: There is another source of inexactness in equation (2.8), namely the approximation is only true for those entries with distinct indices — for example, the diagonal entry has completely different formula from that for when . This will complicate the algorithm, as described in Subsections 2.3 and 2.4.
The next few Subsections sketch how to overcome these difficulties, and the details appear in the rest of the paper.
2 Recovering matrices in presence of systematic error
In this Section we recall the classical method of approximately recovering a matrix given noisy estimates of its entries. We discuss how to adapt that method to our setting where the error in the estimates is systematic and does not go down even with many more samples. The next section sketches an extension of this method to tensor decomposition with systematic error.
In the classical setting, there is an unknown matrix of rank and we are given where is an error matrix. The method to recover is to compute the best rank- approximation to . The quality of this approximation was studied by Davis and Kahan [DK70] and Wedin [Wed72], and many subsequent authors. The quality of the recovery depends upon the ratio , where denotes -th largest singular value and denotes the spectral norm. To make this familiar lemma fit our setting more exactly, we will phrase the problem as trying to recover a matrix given noisy estimate . Now one can only recover up to rotation, and the following lemma describes the error in the Davis-Kahan recovery. It also plays a key role in the error analysis of the usual algorithm for tensor decomposition.
In the above setting, let the subspace of the top eigenvectors of and . Let be such that . Then where Id is the identity transformation on the subspace in question.
The Lemma thus treats as the definition of noise/signal ratio. Before we generalize the definition and the algorithm to handle systematic error it is good to get some intuition, from looking at (2.6): . Thinking of the first term as signal and the second as error, let’s check how bad is the noise/signal ratio defined in Davis-Kahan. The “signal”is , which is smaller than since the trace of is of the order of in our probabilistic model for the weight matrix. The “noise” is the norm of , which is large since the ’s are nonnegative vectors with entries of the order of 1, and therefore the quadratic form can be as large as . Thus the Davis-Kahan noise/signal ratio is , and so when , it allows recovering the subspace of with error . Note that this is a vacuous bound since needs to be at least so that the hidden variable contains 1 non-zero entry in average. We’ll argue that this error is too pessimistic and we can in fact drive the estimation error down to close to .
The smallest such is the “error/signal ratio” for this recovery problem.
This definition differs from Davis-Kahan’s because of the term on the right hand side of (2.9). This allows, for any unit vector , the quadratic form value to be as large as . Thus for example the vector no longer causes a large noise/signal ratio since both quadtratic forms and have large values on it.
This new error/signal ratio is no larger than the Davis-Kahan ratio, but can potentially be much smaller. Now we show how to do a better analysis of the Davis-Kahan recovery in terms of it. The proof of this theorem appears in Section 4.
Empirically, we can compute the value for the weight matrix in the QMR-DT dataset [SC91], which is a textbook application of noisy OR network. For the QMR-DT dataset, is under . This implies that the recovery error of the subspace of guaranteed by Theorem 2.4 is bounded by , whereas the error bound by Davis-Kahan is .
3 Tensor decomposition with systematic error
Now we extend the insight from the matrix case to tensor recovery under systematic error. In turns out condition (2.9) is also a good measure of error/signal for the tensor recovery problem of (2.8). Specifically, if is -bounded by , then we can recover the components ’s from the PMIT with column-wise error . This requires a non-trivial algorithm (instead of SVD), and the additional gain is that we can recover ’s individually, instead of only obtaining the subspace with the PMI matrix.
First we recall the prior state of the art for the error analysis of tensor decomposition with Davis-Kahan type bounds. The best error bounds involve measuring the magnitude of the noise matrix in a new way. For any tensor , we define the norm as
There is a polynomial-time algorithm (Algorithm 2 later) which has the following guarantee. Suppose tensor is of the form
But in our setting the noise tensor has systematic error. An analog of Theorem 2.4 in this setting is complicated because even the whitening step is nontrivial. Recall also the inexactness in Proposition 2.1 due to the diagonal terms, which we earlier called Difficulty 3. We address this difficulty in the algorithm by setting up the problem using a sub-tensor of the PMI tensor. Let be a uniformly random equipartition of the set of indices . Let
where denotes the restriction of vector to subset . Moreover, let
Then, since the sub-tensor only contains entries with distinct indices, we can use Taylor expansion (see Lemma A.1) to obtain that
Here the second summand on the RHS corresponds to the second order term in the Taylor expansion. It turns out that the higher order terms are multiplied by and thus have negligible Frobenius norm, and therefore discussion below will focus on the first two summands.
For simplicity, let . Our goal is to recover the components from the approximate low-rank tensor .
The first step is to whiten the components ’s, ’s and ’s. Recall that is a non-negative vector. This implies the matrix must have a significant contribution in the direction of the vector , and thus is far away from being well-conditioned. For the purpose of this section, we assume for simplicity that we can access the covariance matrix defined by the vector ’s,
Similarly we assume the access of and which are defined analogously. In Section 2.4 we discuss how to obtain approximately these three matrices.
Then, we can compute the whitened tensor by applying transformation along the three modes of the tensor ,
Now the first summand is a low rank orthogonal tensor, since ’s are orthonormal vectors. However, the term is a systematic error and we use the following Lemma to control its norm.
Lemma 2.7 shows that to give an upper bound on the norm of the error tensor , it suffices to show that the square of the components of the error, namely, are -spectrally bounded by the components of the signal respectively. This will imply that .
Recall that and are two sub-matrices of and . We have shown that is -spectrally bounded by in Proposition 2.5. It follows straightforwardly that the random sub-matrices also have the same property.
In the setting of this section, under the generative model for , w.h.p, we have that is -spectrally bounded by with . The same is true for the other two modes.
Using Proposition 2.8 and Lemma 2.7, we have that
Then using Theorem 2.6 on the tensor , we can recover the components ’s, ’s, and ’s. This will lead us to recover , and , and finally to recover the weight matrix .
4 Robust whitening
In the previous subsection, we assumed the access to (defined in (2.13)) which turns out to be highly non-trivial. A priori, using equation (2.6), noting that , we have
However, this approximation can be arbitrarily bad for the diagonal entries of PMI since equation (2.6) only works for entries with distinct indices. (Recall that this is why we divided the indices set into and studied the asymmetric tensor in the previous subsection). Moreover, the diagonal of the matrix contributes to its spectrum significantly and therefore we cannot get meaningful bounds (in spectral norm) by ignoring the diagonal entries.
This issue turns out to arise in most of the previous tensor papers and the solution was to compute by using the asymmetric moments ,
Typically can be estimated with arbitrarily small error (as number of samples go to infinity) and therefore the equation above leads to accurate estimate to . However, in our case the errors in the estimate , , are systematic. Therefore, we need to use a more delicate analysis to control how the error accumulates in the estimate,
Here again, to get an accurate bound, we need to understand how the error in behaves relatively compared with in a direction-by-direction basis. We generalized Definition 2.3 to capture the asymmetric spectral boundedness of the error by the signal.
Let be the column subspace of and be the column subspace of . Then we have , , , . Intuitively, they measure the relative relationship between and in different subspaces. For example, is the relative perturbation in the column subspace of and row subspace of . When , this is equivalent to the definition in the symmetric setting (this will be clearer in the proof of Theorem 2.4).
where are -spectrally bounded by , , respectively. Then, the matrix matrix
is a good approximation of in the sense that is -spectrally bounded by . Here denotes the best rank- approximation of .
The theorem is non-trivial even if the we have an absolute error assumption, that is, even if , which is stronger condition than is -spectrally bounded by . Suppose we establish bounds on , and individually, and then putting them together in the obvious way to control the error . Then the error will be too large for us. This is because standard matrix perturbation theory gives that can be bounded by , which is tight. Then we multiply the error with the norm of the rest of the two terms, the error will be roughly . That is, we will loss a condition number of , which can be dimension dependent for our case.
The fix to this problem is to avoid bounding each term in individually. To do this, we will take the cancellation of these terms into account. Technically, we re-decompose the product into a new product of three matrices , and then bound the error in each of these terms instead. See Section C for details.
As a corollary, we conclude that the whitened vectors ’s are indeed approximately orthonormal.
In the setting of Theorem 2.10, we have that contains approximately orthonormal vectors as columns, in the sense that
Therefore we have found an approximate whitening matrix for even though we do not have access to the diagonal entries.
Main Algorithms and Results
As sketched in Section 2, our main algorithm (Algorithm 1) uses tensor decomposition on the PMI tensor. In this section, we describe the different steps and how the fit together. Subsequently, all steps will be analyzed in separate sections.
Suppose the true is generated from the random model in Section 1 with for some sufficiently small constant . Then given number of examples, Algorithm 1 returns a weight matrix in polynomial time that satisfies
We also define the incoherence of a matrix . Roughly speaking, it says that the left singular vectors of don’t correlate with any of the natural basis vector much more than the average.
We assume the weight matrix satisfies the following deterministic assumptions, 1. is -spectrally bounded by for . 2. is -incoherent with . 3. If , with high probability over the choice of a subset , and for some sufficiently small constant .
Suppose the matrix satisfies the conditions 1-3 above. Given polynomial number of samples, Algorithm 1 returns in polynomial time, s.t.
The proofs of Theorems uses the overall strategy of Section 2, and is deferred to Section D. We give a high level outline that demonstrates how the proofs depends on the machinery built in the subsequent sections.
Both Theorem 3.1 and Theorem 3.3 are similarly proved – the only technical difference being how the third and higher order terms are bounded. (Because of generative model assumption, for Theorem 3.1 we can get a more precise control on them.) Hence, we will not distinguish between them in the coming overview.
Overall, we will follow the approach outlined in Section 2. Let us step through Algorithm 1 line by line:
The overall goal will be to recover the leading terms of the PMI tensor. Of course, we get samples only, so can merely get an empirical version of it. In Section E, we show that the simple plug-in estimator does the job – and does so with polynomially many samples.
Recall Difficulty 3 from Section 2 : the PMI tensor and matrix expression is only accurate on the off-diagonal entries. In order to address this, in Section 2.3 we passed to a sub-tensor of the original tensor by partitioning the symptoms into three disjoint sets, and considering the induced tensor by this partition.
In order to apply the robust tensor decomposition algorithm from Section 5, we need to first calculate whitening matrices. This is necessarily complicated by the fact that the diagonals of the PMI matrix are not accurate, as discussed in Section 2.4. Section C gives guarantees on the procedure for calculating the whitening matrices.
This is main component of the algorithm: the robust tensor decomposition machinery. In Section 5, the conditions and guarantees for the success of the algorithm are formalized. There, we deal with the difficulties layed out in Section 2.2 : namely that we have a substantial systematic error that we need to handle. (Both due to higher-order terms, and due to the missing diagonal entries)
This step, along with Step 6, is a post-processing step – which allows us to recover the weight matrix after we have recovered the leading terms of the PMI tensor.
We also give a short quantitative sense of the guarantee of the algorithm. (The reader can find the full proof in Section D.)
To get quantitative bounds, we will first need a handle on spectral properties of the random model: these are located in Section B. As we mentioned above, the main driver of the algorithm is step 4, which uses our robust tensor decomposition machinery in Section 5. To apply the machinery, we first need to show that the second (and higher) order terms of the PMI tensor are spectrally bounded. This is done by applying Proposition B.4, which roughly shows the higher-order terms are -spectrally bounded by . The whitening matrices are calculated using machinery in Section C. We can apply these tools since the random model gives rise to a -incoherent matrix as shown in Lemma B.3.
To get a final sense of what the guarantee is, the error which step 4 gives, via Theorem 5.4 roughly behaves like , where is the spectral norm of the whitening matrices and is the spectral boundedness parameter. But, by Lemma D.1 is approximately the spectral norm of – which on the other hand by Lemma B.1 is on the order of . Plugging in these values, we get the theorem statement.
Finding the Subspace under Heavy Perturbations
In this section, we show even if we perturb a matrix with an error whose spectral norm might be much larger than , as long as is spectrally bounded the top singular subspace of is still preserved. We defer the proof of the asymmetric case (Theorem 2.10) to Section C. We note that such type of perturbation bounds, often called relatively perturbation bounds, have been studied in [Ips98, Li98a, Li98b, Li97]. The results in these papers either require the that signal matrix is full rank, or the perturbation matrix has strong structure. We believe our results are new and the way that we phrase the bound makes the application to our problem convenient. We recall Theorem 2.4, which was originally stated in Section 2.
Therefore, we have that . Moreover, we also have
Let . Then we write as,
Let . Let be the column span of . We first prove that is close to . Note that
Moreover, we have . Therefore, using Wedin’s Theorem (Lemma F.2) on equation (4.1), we have that
Next we show and are also close. We have
Therefore, by Wedin’s Theorem, , as the span of top left singular vectors of , is close to the span of the top left singular vector of , namely,
Therefore using equation (4.2) and (4.3) and triangle inequality, we complete the proof. ∎
Robust Tensor Decomposition with Systematic Error
In this section we discuss how to robustly find the tensor decomposition even in presence of systematic error. We first illustrate the main techniques in an easier setting of orthogonal tensor decomposition (Section 5.1), then we describe how it can be generalized to the general setting that we require for our algorithm (Section 5.2).
We start with decomposing an orthogonal tensor with systematic error. The algorithm we use here is a slightly more general version of an algorithm in [MSS16].
Suppose are three collection -approximate orthonormal vectors. Suppose tensor is of the form
The Theorem is a direct extension of [MSS16, Theorem 10.2] to asymmetric and approximate orthogonal case. We only provide a proof sketch here. We start by writing
Since and , [MSS16, Theorem 6.5] implies that with probability at least over the choice of ,
Let . We have that with probability , and for every . We condition on these events. Let be a set of orthonormal vectors such that satisfies (we can take ’s to be the whitening of ’s). Similarly define ’s. Then we have that the term (defined in equation (5.1)) can be written as where . Let . Then has top singular value , and second singular value at most . Moreover, the term has spectral norm bounded by . Thus by Wedin’s Theorem (Lemma F.2), the top left and right singular vectors of are -close to and respectively. They are also -close to since is close to . Moreover, we have is -close to .
Therefore, with probability , each round of the for loop in Algorithm 2 will find . Line 5 is used to verify if the resulting vectors are indeed good using the injective norm as a test. It can be shown that if the test is passed then is close to one of the component. Therefore, after iterations, with high probability, we can find all of the components.
2 General tensor decomposition
In many previous works, general tensor decomposition is reduced to orthogonal tensor decomposition via a whitening procedure. However, here in our setting we cannot estimate the exact whitening matrix because of the systematic error. Therefore we need a more robust version of approximate whitening matrix, which we define below:
Let . A collection of vectors is -approximately orthonormal if the matrix with as columns satisfies
With this in mind, we can state the guarantee on the tensor decomposition algorithm (Algorithm 3).
where .
Note that in our model, the matrix has very small spectral norm as it is the third order term in (and ). The spectral boundedness of are discussed in Section B. Therefore we can expect the RHS to be small.
In order to prove this theorem, we show after we apply whitening operation using the approximate whitening matrices, the tensor is still close to an orthogonal tensor. To do that, we need the following lemma which is a useful technical consequence of the condition (2.9).
Suppose is -spectrally bounded by . Then,
Let be the column span of . Let . Multiplying on both sides of equation (2.9), we obtain that
It follows that , which in turns implies that .∎
We also need to bound the norm of the following systematic error tensor. This is important because we want to bound the spectral norm of the perturbation after the whitening operation.
Using the definition of we have that
Next observe that we have that for any , and therefore,
With this in mind, we prove the main theorem:
Conclusions
We have presented theoretical progress on the longstanding open problem of presenting a polynomial-time algorithm for learning noisy-or networks given sample outputs from the network. In particular it is enouraging that linear algebraic methods like tensor decomposition can play a role. Earlier there were no good approaches for this problem; even heuristics fail for realistic sizes like .
Can sample complexity be reduced, say to subcubic? (Cubic implies more than one billion examples for networks with outputs.) Possibly this requires exploiting some hierarchichal structure –e.g. groupings of diseases and symptoms— in practical noisy-OR networks but exploring such possibilities using the current version of QMR-DT is difficult because it has been scrubbed of labels for diseases and symptoms.)
Various more practical versions of our algorithm are also easy to conceive and will be tested in the near future. This could be somewhat analogous to topic modeling, for which discovery of provable polynomial-time algorithms soon led to very efficient algorithms.
We thank Randolph Miller and Vanderbilt University for providing the current version of this network for our research.
References
Appendix A Formal expression for the PMI tensor
In this section we formally derive the expressions for the PMI tensors and matrices, which we only informally did in Section 2.
These matrices will appear naturally in the expressions for the higher-order terms in the Taylor expansion for the PMI matrix and tensor.
We first compute the formally the moments of the noisy-or model.
We only give the proof for the second equation. The rest can be shown analogously.
With this in mind, we give the expression for the PMI tensor along with all the higher-order terms.
For any equipartition of , the restriction of the PMI tensor satisfies, for any ,
The proof will proceed by Taylor expanding the log terms. Towards that, using Lemma A.1, we have :
By the Taylor expansion of , we get that
by simple regrouping of the terms. By exchanging and , we get
where the last equality holds by noting that
The term corresponding to is easily seen to be
therefore we to show the statement of the lemma, we only need bound the contribution of the terms with .
Toward that, note that . Hence, we have by Lemma 5.6,
Therefore, subadditivity of the norm gives
A completely analogous proof gives a similar expression for the PMI matrix:
For any subsets of , s.t. , the restriction of the PMI matrix satisfies, for any ,
Appendix B Spectral properties of the random model
The goal of this section is to prove that the random model specified in Section 1 satisfies the incoherence property 3.2 on the weight matrix and the spectral boundedness property of the PMI tensor. (Recall, the former is required for the whitening algorithm, and the later for the tensor decomposition algorithm.)
Before delving into the proofs, we will need a few simple bounds on the singular values of .
Let , s.t. . With probability over the choice of , and for all ,
we will proceed to bound the smallest eigenvalue of .
Since the matrices are independent, the bound will follow from a matrix Bernstein bound. Denoting
Note that (1.2) together with the assumption gives
Therefore, by Bernstein inequality we have that w.h.p,
which in turn implies that . But this immediately implies with high probability. Union bounding over all , we get the first part of the lemma.
The upper bound will be proven by a Chernoff bound. Note that the matrices
are independent. Furthermore, with high probability, and the variable is sub-exponential. Finally,
A union bound over all values of gives the statement of the Lemma.
First, we proceed to show the incoherence property 3.2 on the weight matrix.
Suppose is a multiple of 3. Let be the singular value decomposition of . Let be a uniformly random equipartition of the rows of . Suppose is -incoherent with . Then, with high probability over the choice of and , we have for every ,
Let . Then, since , we have
By the assumption on the row norms of , . By the incoherence assumption, we have that .
We also note that ’s are negatively associated random variables. Therefore by the matrix Chernoff inequality for negatively associated random variables, we have with high probability,
But, an analogous argument holds for as well – so by a union bound over , we complete the proof. ∎
Suppose . Under the generative assumption in Section 1 for , we have that we have that is -incoherent.
We have that and therefore, . This in turn implies that
Since with high probability, we only need to bound from below. Note that . Therefore it suffices to control . But by Lemma B.1 we have .
B.2 Spectral boundedness
The main goal of the section is to show that the bias terms in the PMI tensor are spectrally bounded by the PMI matrix (which we can estimate from samples). Furthermore, we show that we can calculate an approximate whitening matrix for the leading terms of the PMI tensor using the machinery in Section .
The main proposition we will show is the following:
The main element for the proposition is the following Lemma:
Before proving the Lemma, let us see how the proposition follows from it:
Since , the claim of the Proposition follows.
It is clear analogous statements hold for and . ∎
For notational convenience, we will denote by the all ones matrix with dimension . (We will omit the dimensions when clear from the context.)
This statement will immediately follow from the following two lemmas:
Before showing these lemmas, let us see how Lemma B.5 is implied by them:
Let be the constant in B.7, s.t. . Putting the bounds from Lemmas B.6 and B.7 together along with a union bound, we have that with high probability,
But, note that , by Lemma B.1. Hence, , for some sufficiently large constant . This implies
from which the statement of the lemma follows.
Let’s denote by . Let’s furthermore denote , and . Note first that trivially, since ,
We proceed to upper bound both the terms on the RHS above. More precisely, we will show that
Let us proceed to showing (B.4). The LHS can be rewritten as
We proceed to (B.5), which will be shown by a Bernstein bound. Towards that, note that
Therefore, applying a matrix Bernstein bound, we get
Combining this with (B.2) and (B.3), we get
Let us proceed to the second inequality, which essentially follows the same strategy:
Reusing the notation from Lemma B.7, we have that
for similar reasons as before. From this we get that
Putting (B.6) and (B.7) together, we get that
We will proceed to show an upper bound on second term of the LHS. We will do this by a Bernstein bound as before. Namely, analogously as in Lemma B.7,
Therefore, applying a matrix Bernstein bound, we get
Appendix C Robust whitening
In this section, we show the formula computes an approximation of the true whitening matrix , so that the error is -spectrally bounded by . We recall Theorem 2.10.
where are -spectrally bounded by , , respectively. Then, the matrix matrix
is a good approximation of in the sense that is -spectrally bounded by . Here denotes the best rank- approximation of .
Towards proving Theorem 2.10, an intermediate step is to understand the how the space of singular vectors of are aligned with the noisy version . The following explicitly represent as the form . Here the crucial benefit to do so is that the resulting is small in every direction. In other words, we started with a relative error guarantees on and the Lemma below converts to it an absolute error guarantees on (though the signal term changes slightly).
Suppose are matrices with . Suppose a matrix is -spectrally bounded by , then can be written as
where are small and is close to identity in the sense that,
The key intuition is if the perturbation is happening in the span of columns of and , they cannot change the subspace. By Definition 2.9, we can write as
Now since , we know is invertible, so we can write
This is already in the desired form as we can let , , , and . By Weyl’s Theorem we know , therefore . Other terms can be bounded similarly.
Now we prove that the top approximation of has similar column/row spaces as . Let be the column span of , be the column span of , and be the top left singular subspace of . Similarly we can define , , to be the column spans of , and the top right singular subspace of .
For , we can apply Weyl’s Theorem and Wedin’s Theorem. By Weyl’s Theorem we know . By Wedin’s Theorem we know is -close to . Similar results apply to .
Now we know . Therefore we can again apply Wedin’s Theorem, considering as the original matrix and as the perturbation. As a result, we know is close to , is close to . The distance between (and ) then follows from triangle inequality.
As a direct corollary of Lemma C.1, we obtain that the and have similar subspaces of singular vectors.
In the setting of Lemma C.1, let be the best rank- approximation of . Then, the span of columns of is -close to the span of columns of , span of rows of is -close to the span of columns of .
Furthermore, we can write Here , and as defined in Lemma C.1 and satisfies .
Since is the best rank- approximation, because is a rank matrix, in particular we have
In order to fix this problem, we notice that the matrix is multiplied by on the left and on the right. Assuming , , we should expect to “cancel” with the factor on the left and the factor on the right, giving us . Therefore, we should really measure the error of the middle term after left multiplying with and right multiplying with . We formalize this in the following lemma:
Suppose is as defined in Theorem 2.10, let , then we have
We will first prove Theorem 2.10 assuming Lemma C.3.
By Lemma C.1, we know can be written as
Similarly can be written as
Here the terms and terms are bounded as in Lemma C.1.
Now let us write the matrix as
We can now view as the product of three terms, each term is the sum of two matrices. Therefore we can expand the product into 8 terms. In each of the three pairs, we will call the first matrix the main matrix, and the second matrix the perturbation.
In the remaining proof, we will do calculations to show the product of the main terms is close to , and all the other 7 terms are small.
Before doing that, we first prove several Claims about PSD matrices
If , then . If , then .
Both inequalities can be proved by consider the quadratic form. We know for any , , so the first part is true.
For the second part, for any we can apply Cauchy-Schwartz inequality
Now, we will first prove the product of three main matrices is close to :
We have , where is -spectrally bounded by .
We will first prove the middle part of the matrix is close to identity matrix Id. Here we observe that both have full column rank so . Therefore we can rewrite the product as . Since by Lemma C.1 (and similarly for ), we know . Therefore the middle part is close to Id. Now since we know is -close to Id.
Now we are left with , for this matrix we know
The first term (Claim C.4); the fourth term (by the norm bounds of and . For the cross terms, we can bound them using the second part of Claim C.4. ∎
Next we will try to prove the remaining 7 terms are small. We partition them into three types depending on how many factors they have. We proceed to bound them in each of these cases.
For the terms with only one , we claim:
The three terms , , are all spectrally bounded by .
For the first term, note that both have full column rank, and hence . Therefore the first term can be rewritten as
By Lemma C.1, we have spectral norm bounds for . Therefore we know and is close to Id. Therefore is trivially spectrally bounded, and is spectrally bounded by Claim C.4. The third term is exactly symmetric.
For the second part, we will first prove the middle part of the matrix has spectral norm . This can be done y expanding it to the sum of 4 terms, and use appropriate spectral norm bounds on and its products with and from Lemma C.3. Now we can show is spectrally bounded by the first part of Claim C.4. ∎
Next we try to bound the terms with two factors.
The three terms , , are all spectrally bounded by .
For the first term, notice that is bounded by by Lemma C.3, and . Therefore we know , so by Claim C.4 we know this term is spectrally bounded by . Third term is symmetric.
For the second term, by Lemma C.1 we can directly bound its spectral norm by , so it is trivially spectrally bounded by . ∎
Finally, for the product , we can get the spectral norm for the three factors by Lemma C.1 and Lemma C.3. As a result which is trivially spectrally bounded by .
Combining the bound for all of the terms we get the theorem. ∎
We first prove a simpler version where the perturbation is simply bounded in spectral norm
Suppose are matrices and . Let be an matrix such that , and is a perturbation matrix with and is also of rank .
Now let , then when we have
We first give the proof for . Other terms are similar.
Let be the column span of , and be the row span of . Similarly let be the column span of and be the column span of . By Wedin’s theorem, we know is close to and is close to . As a result, suppose the SVD of is , we know
The same is true for : .
By the property of pseudoinverse, the column span of is , and the row span of is , further, , therefore we can write
Note that now the three matrices are all and invertible! We can write (and do the same thing for . Using the fact that , we have
Here we defined . The spectral norm of can be bounded by
We can write , and we now know , as a result as desired.
For the term , by the same argument we have
On the other hand, we know . We can match the three factors:
Here, first and third bound are proven before. The second bound comes if we consider the SVD of and notice that . We can write , , , then we have
Expanding the last equation, we get 7 terms and all of them can be bounded by . The bounds on and can be proved using similar techniques. ∎
Finally we are ready to prove the main Lemma C.3:
Using Lemma C.1, let , we can write the matrix before pseudoinverse as
We can then apply Lemma C.8 on . As a result, we know if we let , we have the desired bound if we left multiply with or right multiply with .
We will now show how to prove the first bound, all the other bounds can be proved using the same strategy:
All the four terms on the RHS can be bounded by Lemma C.8 so we know .
On the other hand, let . We will prove and then the bound on follows from triangle inequality.
For , we know it is equal to
We will show all three factors in the first term are close to Id. For this follows immediately from Lemma C.1. For , we know
Therefore its spectral norm bound is bounded by (where the bound on comes from Lemma C.1). ∎
With the claim we have now proven , therefore
Here we will show under mild incoherence conditions (defined below), if an error matrix is -spectrally bounded by , then the partial matrices satisfy the requirement of Theorem 2.10.
If is -incoherent for , then when , with high probability over the random partition of into , we know (same is true for ).
As a corollary, if is -spectrally bounded by . Let be the subsets corresponding to , and let be the submatrix of whose rows are in set and columns are in set . Then (also ) is -spectrally bounded by the corresponding asymmetric matrices (, ).
Consider the singular value decomposition of : . Here is a matrix whose columns are orthonormal, is an orthonormal matrix and is a diagonal matrix whose smallest diagonal entry is .
Consider the following way of partitioning the matrix: for each row of , we put it into or with probability independently.
Now, let if row is in the matrix , and 0 otherwise. Then ’s are Bernoulli random variables with probability . Suppose is the set of rows in , let be restricted to rows in , then we have . We will show with high probability .
The key observation here is the expectation of , where is the -th row of (represented as a column vector). Since ’s are Bernoulli random variables, we know
Therefore we can hope to use matrix concentration to prove that is close to its expectation.
Here the last inequality is because .
Therefore by Matrix Bernstein’s inequality we know with high probability . When this happens we know
Hence we have , and . Note that matrices , have exactly the same distribution as so the bounds for , follows from union bound.
For the corollary, if a matrix is spectrally bounded, we can write it as , where , and . This can be done by considering different projections of : let be the span of columns of , then term corresponds to ; term corresponds to ; term corresponds to ; term corresponds to . The spectral bounds are necessary for to be spectrally bounded.
Now for , we can write it as , where we also take the corresponding submatrices of ’s. Since the spectral norm of a submatrix can only be smaller, we know , , and . Therefore by Definition 2.9 we know is spectrally bounded by . ∎
Appendix D Proof of Theorem 3.1 and Theorem 3.3
In this section, we provide the full proof of Theorem 3.1. We start with a simple technical Lemma.
If is an -approximate whitening matrix for , then ,
By the definition of approximate-whitening, we have
by virtue of the fact that . Rewriting in semidefinite-order notation, we get that
Multiplying on the left and right by , we get
This directly implies which is equivalent to the statement of the lemma. ∎
Towards proving Theorem 3.1, we will first prove the following proposition, which shows that we recover the matrix correctly:
Under the random generative model defined in Section 1, if the number of samples satisfies
The proof will consist of checking the conditions for Algorithms 4 and 3 to work, so that we can apply Theorems 5.4 and 2.10.
To get a handle on the PMI tensor, by Proposition A.2, for any equipartition of , we can it as
We can choose to ensure
Having an explicit form for the tensor, we proceed to check the spectral boundedness condition for Algorithm 4.
Next, we verify the conditions for calculating approximate whitening matrices (Algorithm 4).
However, for the random model, applying Lemma B.1,
Analogous statements hold for and .
Finally, we bound the error due to empirical estimates. Since ,
Hence, by Corollary E.3, with a number of samples as stated in the theorem,
With that, invoking Theorem 5.4 (with taking into account both the term above, and the above error due to sampling), the output of Algorithm 3 will produce vectors , , s.t. is -close to , for
where .
Plugging in the estimates from (D.1), (D.2), (D.3) as well as , we get:
which implies the vectors are -close to , for all .
Given that, we prove the main theorem. The main issue will be to ensure that taking of the values
Then we have that . By the Lipschitzness of in the region
Therefore recalling we complete the proof. ∎
The proof will follow the same outline as the proof of Theorem 3.1. The difference is that since we only have a guarantee on the spectral boundedness of the second and third-order term, we will need to bound the higher-order terms in a different manner. Given that we have no information on them in this scenario, we will simply bound them in the obvious manner. We proceed to formalize this.
The sample complexity is polynomial for the same reasons as in the proof of Theorem 3.1, so we will not worry about it here.
We only need to check the conditions for Algorithms 4 and 3 to work, so that we can apply Theorems 5.4 and 2.10.
Towards that, first we claim that we can write the PMI tensor for any equipartition of as
where . Towards achieving this, first we claim that Proposition 5.6 implies that for any subsets ,
Indeed, if we put , , , then we have , and similarly for . Since , the claim immediately follows. Hence, (D.4) follows.
We claim that is spectrally bounded by .
where the first inequality holds since are -spectrally bounded bounded by and the second since and . Let . Since we are assuming the matrix is -incoherent, we can apply Theorem C.10, and claim the output of Algorithm 4 are matrices which are -approximate whitening matrices for respectively.
Then, applying Theorem 5.4, we get that we recover vectors are -close to , for all . for
Recall that , and and and , we obtain that
where the last inequality holds since .
Argument for recovering from is then exactly the same as the one in Theorem 3.1.
Appendix E Sample complexity and bias of the PMI estimator
Finally, we consider the issue of sample complexity. The estimator we will use for the PMI matrix will simply be the plug-in estimator, namely:
Notice that this estimator is biased, but as the number of samples grows, the bias tends to zero. Formally, we can show:
with high probability .
Denoting and , we get that
Furthermore, we have that , for , which implies that when , . From this it follows that if
Note that it suffices to show that if , we have
with high probability by a simple union bound.
Both (E.2) and (E.3) will follow by a Chernoff bound.
Indeed, consider (E.2) first. We have by Chernoff
Hence, if , we get that with probability at least .
The proof of (E.3) is analogous – the only difference being that the requirement is that which gives the statement of the lemma.
Virtually the same proof as above shows that:
with high probability .
with high probability for any equipartition
Appendix F Matrix Perturbation Toolbox
In this section we discuss standard matrix perturbation inequalities. Many results in this section can be found in Stewart and Sun [Ste77]. Given , the perturbation in individual singular values can be bounded by Weyl’s theorem:
Given , we know .
For singular vectors, the perturbation is bounded by Wedin’s Theorem:
Let , with analogous singular value decomposition. Let be the matrix of canonical angles between the column span of and that of , and be the matrix of canonical angles between the column span of and that of . Suppose that there exists a such that
When we have a lowerbound on , it is easy to get bounds for the perturbation of pseudoinverse.
Note that this theorem is not strong enough when the perturbation is only known to be -spectrally bounded in our definition.