The Power of Convex Relaxation: Near-Optimal Matrix Completion

Emmanuel J. Candes, Terence Tao

Introduction

Imagine we have an n1×n2n_{1}\times n_{2} array of realMuch of the discussion below, as well as our main results, apply also to the case of complex matrix completion, with some minor adjustments in the absolute constants; but for simplicity we restrict attention to the real case. numbers and that we are interested in knowing the value of each of the n1n2n_{1}n_{2} entries in this array. Suppose, however, that we only get to see a small number of the entries so that most of the elements about which we wish information are simply missing. Is it possible from the available entries to guess the many entries that we have not seen? This problem is now known as the matrix completion problem , and comes up in a great number of applications, including the famous Netflix Prize and other similar questions in collaborative filtering . In a nutshell, collaborative filtering is the task of making automatic predictions about the interests of a user by collecting taste information from many users. Netflix is a commercial company implementing collaborative filtering, and seeks to predict users’ movie preferences from just a few ratings per user. There are many other such recommendation systems proposed by Amazon, Barnes and Noble, and Apple Inc. to name just a few. In each instance, we have a partial list about a user’s preferences for a few rated items, and would like to predict his/her preferences for all items from this and other information gleaned from many other users.

An increasingly common assumption in the field is to suppose that the unknown matrix MM has low rank or has approximately low rank. In a recommendation system, this makes sense because often times, only a few factors contribute to an individual’s taste. In , the authors showed that this premise radically changes the problem, making the search for solutions meaningful. Before reviewing these results, we would like to emphasize that the problem of recovering a low-rank matrix from a sample of its entries, and by extension from fewer linear functionals about the matrix, comes up in many application areas other than collaborative filtering. For instance, the completion problem also arises in computer vision. There, many pixels may be missing in digital images because of occlusion or tracking failures in a video sequence. Recovering a scene and inferring camera motion from a sequence of images is a matrix completion problem known as the structure-from-motion problem . Other examples include system identification in control , multi-class learning in data analysis , global positioning—e.g. of sensors in a network—from partial distance information , remote sensing applications in signal processing where we would like to infer a full covariance matrix from partially observed correlations , and many statistical problems involving succinct factor models.

2 Minimal sampling

This paper is concerned with the theoretical underpinnings of matrix completion and more specifically in quantifying the minimum number of entries needed to recover a matrix of rank rr exactly. This number generally depends on the matrix we wish to recover. For simplicity, assume that the unknown rank-rr matrix MM is n×nn\times n. Then it is not hard to see that matrix completion is impossible unless the number of samples mm is at least 2nrr22nr-r^{2}, as a matrix of rank rr depends on this many degrees of freedom. The singular value decomposition (SVD)

so that the information about MM is given by PΩ(M)\mathcal{P}_{\Omega}({M}). The matrix MM can be, in principle, recovered from PΩ(M)\mathcal{P}_{\Omega}({M}) if it is the unique matrix of rank less or equal to rr consistent with the data. In other words, if MM is the unique solution to

Knowing when this happens is a delicate question which shall be addressed later. For the moment, note that attempting recovery via (1.2) is not practical as rank minimization is in general an NP-hard problem for which there are no known algorithms capable of solving problems in practical time once, say, n10n\geq 10.

In , it was proved 1) that matrix completion is not as ill-posed as previously thought and 2) that exact matrix completion is possible by convex programming. The authors of proposed recovering the unknown matrix by solving the nuclear norm minimization problem

where the nuclear norm X\|X\|_{*} of a matrix XX is defined as the sum of its singular values,

(The problem (1.3) is a semidefinite program .) They proved that if Ω\Omega is sampled uniformly at random among all subset of cardinality mm and MM obeys a low coherence condition which we will review later, then with large probability, the unique solution to (1.3) is exactly MM, provided that the number of samples obeys

(to be completely exact, there is a restriction on the range of values that rr can take on).

In (1.5), the number of samples per degree of freedom is not logarithmic or polylogarithmic in the dimension, and one would like to know whether better results approaching the nrlognnr\log n limit are possible. This paper provides a positive answer. In details, this work develops many useful matrix models for which nuclear norm minimization is guaranteed to succeed as soon as the number of entries is of the form nrpolylog(n)nr\text{polylog}(n).

3 Main results

A contribution of this paper is to develop simple hypotheses about the matrix MM which makes it recoverable by semidefinite programming from nearly minimally sampled entries. To state our assumptions, we recall the SVD of MM (1.1) and denote by PUP_{U} (resp. PVP_{V}) the orthogonal projections onto the column (resp. row) space of MM; i.e. the span of the left (resp. right) singular vectors. Note that

We observe that EE interacts well with PUP_{U} and PVP_{V}, in particular obeying the identities

One can view EE as a sort of matrix-valued “sign pattern” for MM (compare (1.7) with (1.1)), and is also closely related to the subgradient M\partial\|M\|_{*} of the nuclear norm at MM (see (3.2)).

It is clear that some assumptions on the singular vectors ui,viu_{i},v_{i} (or on the spaces U,VU,V) is needed in order to have a hope of efficient matrix completion. For instance, if u1u_{1} and v1v_{1} are Kronecker delta functions at positions i,ji,j respectively, then the singular value σ1\sigma_{1} can only be recovered if one actually samples the (i,j)(i,j) coordinate, which is only likely if one is sampling a significant fraction of the entire matrix. Thus we need the vectors ui,viu_{i},v_{i} to be “spread out” or “incoherent” in some sense. In our arguments, it will be convenient to phrase this incoherence assumptions using the projection matrices PU,PVP_{U},P_{V} and the sign pattern matrix EE. More precisely, our assumptions are as follows.

There exists μ1>0\mu_{1}>0 such that for all pairs (a,a)[n1]×[n1](a,a^{\prime})\in[n_{1}]\times[n_{1}] and (b,b)[n2]×[n2](b,b^{\prime})\in[n_{2}]\times[n_{2}],

There exists μ2>0\mu_{2}>0 such that for all (a,b)[n1]×[n2](a,b)\in[n_{1}]\times[n_{2}],

We will say that the matrix MM obey the strong incoherence property with parameter μ\mu if one can take μ1\mu_{1} and μ2\mu_{2} both less than equal to μ\mu. (This property is related to, but slightly different from, the incoherence property, which will be discussed in Section 1.6.1.)

Remark. Our assumptions only involve the singular vectors u1,,ur,v1,,vru_{1},\ldots,u_{r},v_{1},\ldots,v_{r} of MM; the singular values σ1,,σr\sigma_{1},\ldots,\sigma_{r} are completely unconstrained. This lack of dependence on the singular values is a consequence of the geometry of the nuclear norm (and in particular, the fact that the subgradient X\partial\|X\|_{*} of this norm is independent of the singular values, see (3.2)).

It is not hard to see that μ\mu must be greater than 1. For instance, (1.9) implies

which forces μ21\mu_{2}\geq 1. The Frobenius norm identities

and (1.8a), (1.8b) also place a similar lower bound on μ1\mu_{1}.

We will show that 1) matrices obeying the strong incoherence property with a small value of the parameter μ\mu can be recovered from fewer entries and that 2) many matrices of interest obey the strong incoherence property with a small μ\mu. We will shortly develop three models, the uniformly bounded orthogonal model, the low-rank low-coherence model, and the random orthogonal model which all illustrate the point that if the singular vectors of MM are “spread out” in the sense that their amplitudes all have about the same size, then the parameter μ\mu is low. In some sense, “most” low-rank matrices obey the strong incoherence property with μ=O(logn)\mu=O(\sqrt{\log n}), where n=max(n1,n2)n=\max(n_{1},n_{2}). Here, O()O(\cdot) is the standard asymptotic notation, which is reviewed in Section 1.8.

Our first matrix completion result is as follows.

then MM is the unique solution to (1.3) with probability at least 1n31-n^{-3}. In other words: with high probability, nuclear-norm minimization recovers all the entries of M{M} with no error.

This result is noteworthy for two reasons. The first is that the matrix model is deterministic and only needs the strong incoherence assumption. The second is more substantial. Consider the class of bounded rank matrices obeying μ=O(1)\mu=O(1). We shall see that no method whatsoever can recover those matrices unless the number of entries obeys mc0nlognm\geq c_{0}\,n\log n for some positive numerical constant c0c_{0}; this is the information theoretic limit. Thus Theorem 1.1 asserts that exact recovery by nuclear-norm minimization occurs nearly as soon as it is information theoretically possible. Indeed, if the number of samples is slightly larger, by a logarithmic factor, than the information theoretic limit, then (1.3) fills in the missing entries with no error.

We stated Theorem 1.1 for bounded ranks, but our proof gives a result for all values of rr. Indeed, the argument will establish that the recovery is exact with high probability provided that

When r=O(1)r=O(1), this is Theorem 1.1. We will prove a stronger and near-optimal result below (Theorem 1.2) in which we replace the quadratic dependence on rr with linear dependence. The reason why we state Theorem 1.1 first is that its proof is somewhat simpler than that of Theorem 1.2, and we hope that it will provide the reader with a useful lead-in to the claims and proof of our main result.

Under the same hypotheses as in Theorem 1.1, there is a numerical constant CC such that if

MM is the unique solution to (1.3) with probability at least 1n31-n^{-3}.

This result is general and nonasymptotic.

The proof of Theorems 1.1, 1.2 will occupy the bulk of the paper, starting at Section 3.

4 A surprise

We find it unexpected that nuclear norm-minimization works so well, for reasons we now pause to discuss. For simplicity, consider matrices with a strong incoherence parameter μ\mu polylogarithmic in the dimension. We know that for the rank minimization program (1.2) to succeed, or equivalently for the problem to be well posed, the number of samples must exceed a constant times nrlognnr\log n. However, Theorem 1.2 proves that the convex relaxation is rigorously exact nearly as soon as our problem has a unique low-rank solution. The surprise here is that admittedly, there is a priori no good reason to suspect that convex relaxation might work so well. There is a priori no good reason to suspect that the gap between what combinatorial and convex optimization can do is this small. In this sense, we find these findings a little unexpected.

5 Model matrices

We now discuss model matrices which obey the conditions (1.8) and (1.9) for small values of the strong incoherence parameter μ\mu. For simplicity we restrict attention to the square matrix case n1=n2=nn_{1}=n_{2}=n.

In this section we shall show, roughly speaking, that almost all n×nn\times n matrices MM with singular vectors obeying the size property

with μB=O(1)\mu_{B}=O(1) also satisfy the assumptions A1 and A2 with μ1,μ2=O(logn)\mu_{1},\mu_{2}=O(\sqrt{\log n}). This justifies our earlier claim that when the singular vectors are spread out, then the strong incoherence property holds for a small value of μ\mu.

We define a random model obeying (1.13) as follows: take two arbitrary families of nn orthonormal vectors [u1,,un][u_{1},\ldots,u_{n}] and [v1,,vn][v_{1},\ldots,v_{n}] obeying (1.13). We allow the uiu_{i} and viv_{i} to be deterministic; for instance one could have ui=viu_{i}=v_{i} for all i[n]i\in[n].

Select rr left singular vectors uα(1),,uα(r)u_{\alpha(1)},\ldots,u_{\alpha(r)} at random with replacement from the first family, and rr right singular vectors vβ(1),,vβ(r)v_{\beta(1)},\ldots,v_{\beta(r)} from the second family, also at random. We do not require that the β\beta are chosen independently from the α\alpha; for instance one could have β(k)=α(k)\beta(k)=\alpha(k) for all k[r]k\in[r].

Set M:=k[r]ϵkσkuα(k)vβ(k)M:=\sum_{k\in[r]}\epsilon_{k}\sigma_{k}u_{\alpha(k)}v_{\beta(k)}^{*}, where the signs ϵ1,,ϵr{1,+1}\epsilon_{1},\ldots,\epsilon_{r}\in\{-1,+1\} are chosen independently at random (with probability 1/21/2 of each choice of sign), and σ1,,σr>0\sigma_{1},\ldots,\sigma_{r}>0 are arbitrary distinct positive numbers (which are allowed to depend on the previous random choices).

We emphasize that the only assumptions about the families [u1,,un][u_{1},\ldots,u_{n}] and [v1,,vn][v_{1},\ldots,v_{n}] is that they have small components. For example, they may be the same. Also note that this model allows for any kind of dependence between the left and right singular selected vectors. For instance, we may select the same columns as to obtain a symmetric matrix as in the case where the two families are the same. Thus, one can think of our model as producing a generic matrix with uniformly bounded singular vectors.

We now show that PUP_{U}, PVP_{V} and EE obey (1.8) and (1.9), with μ1,μ2=O(μBlogn)\mu_{1},\mu_{2}=O(\mu_{B}\sqrt{\log n}), with large probability. For (1.9), observe that

and {ϵk}\{\epsilon_{k}\} is a sequence of i.i.d. ±1\pm 1 symmetric random variables. Then Hoeffding’s inequality shows that μ2=O(μBlogn)\mu_{2}=O(\mu_{B}\sqrt{\log n}); see for details.

For (1.8), we will use a beautiful concentration-of-measure result of McDiarmid.

Let {a1,,an}\{a_{1},\ldots,a_{n}\} be a sequence of scalars obeying aiα|a_{i}|\leq\alpha. Choose a random set SS of size ss without replacement from {1,,n}\{1,\ldots,n\} and let Y=iSaiY=\sum_{i\in S}a_{i}. Then for each t0t\geq 0,

where S:={α(1),,α(r)}S:=\{\alpha(1),\ldots,\alpha(r)\}. For any fixed a,a[n]a,a^{\prime}\in[n], set

Taking λ\lambda proportional to logn\sqrt{\log n} and applying the union bound for a,a[n]a,a^{\prime}\in[n] proves (1.8) with probability at least 1n31-n^{-3} (say) with μ1=O(μBlogn)\mu_{1}=O(\mu_{B}\sqrt{\log n}).

Combining this computation with Theorems 1.1, 1.2, we have established the following corollary:

Let MM be a matrix sampled from a uniformly bounded model. Under the hypotheses of Theorem 1.1, if

MM is the unique solution to (1.3) with probability at least 1n31-n^{-3}. As we shall see below, when r=O(1)r=O(1), it suffices to have

Obviously, this does not scale like r/n\sqrt{r}/n. Similarly, the sign flip (step 2) is also necessary as otherwise, we could have E=PUE=P_{U} as in the case where [u1,,un]=[v1,,vn][u_{1},\ldots,u_{n}]=[v_{1},\ldots,v_{n}] and the same columns are selected. Here,

which does not scale like r/n\sqrt{r}/n either.

5.2 Low-rank low-coherence model

When the rank is small, the assumption that the singular vectors are spread is sufficient to show that the parameter μ\mu is small. To see this, suppose that the singular vectors obey (1.13). Then

The first inequality follows from the Cauchy-Schwarz inequality

for aaa\neq a^{\prime} and from the Frobenius norm bound

This gives μ1μBr\mu_{1}\leq\mu_{B}\sqrt{r}. Also, by another application of Cauchy-Schwarz we have

so that we also have μ2μBr\mu_{2}\leq\mu_{B}\sqrt{r}. In short, μμBr\mu\leq\mu_{B}\sqrt{r}.

Our low-rank low-coherence model assumes that r=O(1)r=O(1) and that the singular vectors obey (1.13). When μB=O(1)\mu_{B}=O(1), this model obeys the strong incoherence property with μ=O(1)\mu=O(1). In this case, Theorem 1.1 specializes as follows:

Let MM be a matrix of bounded rank (r=O(1))(r=O(1)) whose singular vectors obey (1.13). Under the hypotheses of Theorem 1.1, if

then MM is the unique solution to (1.3) with probability at least 1n31-n^{-3}.

5.3 Random orthogonal model

Our last model is borrowed from and assumes that the column matrices [u1,,ur][u_{1},\ldots,u_{r}] and [v1,,vr][v_{1},\ldots,v_{r}] are independent random orthogonal matrices, with no assumptions whatsoever on the singular values σ1,,σr\sigma_{1},\ldots,\sigma_{r}. Note that this is a special case of the uniformly bounded model since this is equivalent to selecting two n×nn\times n random orthonormal bases, and then selecting the singular vectors as in Section 1.5.1. Since we know that the maximum entry of an n×nn\times n random orthogonal matrix is bounded by a constant times lognn\sqrt{\frac{\log n}{n}} with large probability, then Section 1.5.1 shows that this model obeys the strong incoherence property with μ=O(logn)\mu=O(\log n). Theorems 1.1, 1.2 then give

Let MM be a matrix sampled from the random orthogonal model. Under the hypotheses of Theorem 1.1, if

then MM is the unique solution to (1.3) with probability at least 1n31-n^{-3}. The exponent 88 can be lowered to 77 when rlognr\geq\log n and to 66 when r=O(1)r=O(1).

As mentioned earlier, we have a lower bound m2nrr2m\geq 2nr-r^{2} for matrix completion, which can be improved to mCnrlognm\geq Cnr\log n under reasonable hypotheses on the matrix MM. Thus, the hypothesis on mm in Corollary 1.6 cannot be substantially improved. However, it is likely that by specializing the proofs of our general results (Theorems 1.1 and 1.2) to this special case, one may be able to improve the power of the logarithm here, though it seems that a substantial effort would be needed to reach the optimal level of nrlognnr\log n even in the bounded rank case.

Speaking of logarithmic improvements, we have shown that μ=O(logn)\mu=O(\log n), which is sharp since for r=1r=1, one cannot hope for better estimates. For rr much larger than logn\log n, however, one can improve this to μ=O(logn)\mu=O(\sqrt{\log n}). As far as μ1\mu_{1} is concerned, this is essentially a consequence of the Johnson-Lindenstrauss lemma. For aaa\neq a^{\prime}, write

We claim that for each aaa\neq a^{\prime},

with probability at least 1n51-n^{-5}, say. This inequality is indeed well known. Observe that PUx\|P_{U}x\| has the same distribution than the Euclidean norm of the first rr components of a vector uniformly distributed on the n1n-1 dimensional sphere of radius x\|x\|. Then we have :

Choosing x=ea±eax=e_{a}\pm e_{a^{\prime}}, ϵ=C0lognr\epsilon=C_{0}\sqrt{\frac{\log n}{r}}, and applying the union bound proves the claim as long as long as rr is sufficiently larger than logn\log n. Finally, since a bound on the diagonal term PUea2r/n\|P_{U}e_{a}\|^{2}-r/n in (1.8) follows from the same inequality by simply choosing x=eax=e_{a}, we have μ1=O(logn)\mu_{1}=O(\sqrt{\log n}). Similar arguments for μ2\mu_{2} exist but we forgo the details.

6 Comparison with other works

The mathematical study of matrix completion began with , which made slightly different incoherence assumptions than in this paper. Namely, let us say that the matrix MM obeys the incoherence property with a parameter μ0>0\mu_{0}>0 if

for all a[n1]a\in[n_{1}], b[n2]b\in[n_{2}]. Again, this implies μ01\mu_{0}\geq 1.

In it was shown that if a fixed matrix MM obeys the incoherence property with parameter μ0\mu_{0}, then nuclear minimization succeeds with large probability if

Now consider a matrix MM obeying the strong incoherence property with μ=O(1)\mu=O(1). Then since μ01\mu_{0}\geq 1, (1.19) guarantees exact reconstruction only if mCn6/5rlognm\geq C\,n^{6/5}r\log n (and r=O(n1/5)r=O(n^{1/5})) while our results only need nrpolylog(n)nr\text{polylog}(n) samples. Hence, our results provide a substantial improvement over (1.19) at least in the regime which permits minimal sampling.

We would like to note that there are obvious relationships between the best incoherence parameter μ0\mu_{0} and the best strong incoherence parameters μ1\mu_{1}, μ2\mu_{2} for a given matrix MM, which we take to be square for simplicity. On the one hand, (1.8) implies that

so that one can take μ01+μ1/r\mu_{0}\leq 1+\mu_{1}/\sqrt{r}. This shows that one can apply results from the incoherence model (in which we only know (1.18)) to our model (in which we assume strong incoherence). On the other hand,

so that μ1μ0r\mu_{1}\leq\mu_{0}\sqrt{r}. Similarly, μ2μ0r\mu_{2}\leq\mu_{0}\sqrt{r} so that one can transfer results in the other direction as well.

We would like to mention another important paper inspired by compressed sensing, and which also recovers low-rank matrices from partial information. The model in , however, assumes some sort of Gaussian measurements and is completely different from the completion problem discussed in this paper.

6.2 Spectral methods

An interesting new approach to the matrix completion problem has been recently introduced in . This algorithm starts by trimming each row and column with too few entries; i.e. one replaces the entries in those rows and columns by zero. Then one computes the SVD of the trimmed matrix and truncate it as to only keep the top rr singular values (note that one would need to know rr a priori). Then under some conditions (including the incoherence property (1.18) with μ=O(1)\mu=O(1)), this work shows that accurate recovery is possible from a minimal number of samples, namely, on the order of nrlognnr\log n samples. Having said this, this work is not directly comparable to ours because it operates in a different regime. Firstly, the results are asymptotic and are valid in a regime when the dimensions of the matrix tend to infinity in a fixed ratio while ours are not. Secondly, there is a strong assumption about the range of the singular values the unknown matrix can take on while we make no such assumption; they must be clustered so that no singular value can be too large or too small compared to the others. Finally, this work only shows approximate recovery—not exact recovery as we do here—although exact recovery results have been announced. This work is of course very interesting because it may show that methods—other than convex optimization—can also achieve minimal sampling bounds.

7 Lower bounds

We would like to conclude the tour of the results introduced in this paper with a simple lower bound, which highlights the fundamental role played by the coherence in controlling what is information-theoretically possible.

Fix 1m,rn1\leq m,r\leq n and μ01\mu_{0}\geq 1, let 0<δ<1/20<\delta<1/2, and suppose that we do not have the condition

Then there exist infinitely many pairs of distinct n×nn\times n matrices MMM\neq M^{\prime} of rank at most rr and obeying the incoherence property (1.18) with parameter μ0\mu_{0} such that PΩ(M)=PΩ(M)\mathcal{P}_{\Omega}(M)=\mathcal{P}_{\Omega}(M^{\prime}) with probability at least δ\delta. Here, each entry is observed with probability p=m/n2p=m/n^{2} independently from the others.

Clearly, even if one knows the rank and the coherence of a matrix ahead of time, then no algorithm can be guaranteed to succeed based on the knowledge of PΩ(M)\mathcal{P}_{\Omega}(M) only, since they are many candidates which are consistent with these data. We prove this theorem in Section 2. Informally, Theorem 1.7 asserts that (1.20) is a necessary condition for matrix completion to work with high probability if all we know about the matrix MM is that it has rank at most rr and the incoherence property with parameter μ0\mu_{0}. When the right-hand side of (1.20) is less than ε<1\varepsilon<1, this implies

Recall that the number of degrees of freedom of a rank-rr matrix is 2nr(1r/2n)2nr(1-r/2n). Hence, to recover an arbitrary rank-rr matrix with the incoherence property with parameter μ0\mu_{0} with any decent probability by any method whatsoever, the minimum number of samples must be about the number of degrees of freedom times μ0logn\mu_{0}\log n; in other words, the oversampling factor is directly proportional to the coherence. Since μ01\mu_{0}\geq 1, this justifies our earlier assertions that nrlognnr\log n samples are really needed.

In the Bernoulli model used in Theorem 1.7, the number of entries is a binomial random variable sharply concentrating around its mean mm. There is very little difference between this model and the uniform model which assumes that Ω\Omega is sampled uniformly at random among all subsets of cardinality mm. Results holding for one hold for the other with only very minor adjustments. Because we are concerned with essential difficulties, not technical ones, we will often prove our results using the Bernoulli model, and indicate how the results may easily be adapted to the uniform model.

8 Notation

Before continuing, we provide here a brief summary of the notations used throughout the paper. To simplify the notation, we shall work exclusively with square matrices, thus

Throughout, we will always assume that mm is at least as large as 2nr2nr, thus

The Euclidean inner product between two matrices is defined by the formula

and the corresponding Euclidean norm, called the Frobenius norm or Hilbert-Schmidt norm, is denoted

The nuclear norm of a matrix X{X} is denoted

We use the usual asymptotic notation, for instance writing O(M)O(M) to denote a quantity bounded in magnitude by CMCM for some absolute constant C>0C>0. We will sometimes raise such notation to some power, for instance O(M)MO(M)^{M} would denote a quantity bounded in magnitude by (CM)M(CM)^{M} for some absolute constant C>0C>0. We also write XYX\lesssim Y for X=O(Y)X=O(Y), and poly(X)\text{poly}(X) for O(1+X)O(1)O(1+|X|)^{O(1)}.

We use 1E1_{E} to denote the indicator function of an event EE, e.g. 1a=a1_{a=a^{\prime}} equals 11 when a=aa=a^{\prime} and when aaa\neq a^{\prime}.

If AA is a finite set, we use A|A| to denote its cardinality.

We record some (standard) conventions involving empty sets. The set [n]:={1,,n}[n]:=\{1,\ldots,n\} is understood to be the empty set when n=0n=0. We also make the usual conventions that an empty sum xf(x)\sum_{x\in\emptyset}f(x) is zero, and an empty product xf(x)\prod_{x\in\emptyset}f(x) is one. Note however that a kk-fold sum such as a1,,ak[n]f(a1,,ak)\sum_{a_{1},\ldots,a_{k}\in[n]}f(a_{1},\ldots,a_{k}) does not vanish when k=0k=0, but is instead equal to a single summand f()f() with the empty tuple ()[n]0()\in[n]^{0} as the input; thus for instance the identity

is valid both for positive integers kk and for k=0k=0 (and both for non-zero ff and for zero ff, recalling of course that 00=10^{0}=1). We will refer to sums over the empty tuple as trivial sums to distinguish them from empty sums.

Lower bounds

This section proves Theorem 1.7, which asserts that no method can recover an arbitrary n×nn\times n matrix of rank rr and coherence at most μ0\mu_{0} unless the number of random samples obeys (1.20). As stated in the theorem, we establish lower bounds for the Bernoulli model, which then apply to the model where exactly mm entries are selected uniformly at random, see the Appendix for details.

Now under the Bernoulli model, the number of observed entries in the first row—and in any fixed row or column—is a binomial random variable with a number of trials equal to nn and a probability of success equal to pp. Therefore, the probability π0\pi_{0} that any row is unsampled is equal to π0=(1p)n\pi_{0}=(1-p)^{n}. By independence, the probability that all rows are sampled at least once is (1π0)n(1-\pi_{0})^{n}, and any method succeeding with probability greater 1δ1-\delta would need

or nπ0nlog(1π0)log(1δ)-n\pi_{0}\geq n\log(1-\pi_{0})\geq\log(1-\delta). When δ<1/2\delta<1/2, log(1δ)2δ\log(1-\delta)\geq-2\delta and thus, any method would need

This is the desired conclusion when μ0>1\mu_{0}>1, r=1r=1.

where the σk\sigma_{k} are drawn arbitrarily from $(say),andthesingularvectors(say), and the singular vectorsu_{1},\ldots,u_{r}$ are defined as follows:

together with 1ex>xx2/21-e^{-x}>x-x^{2}/2 whenever x0x\geq 0.

Strategy and Novelty

This section outlines the strategy for proving our main results, Theorems 1.1 and 1.2. The proofs of these theorems are the same up to a point where the arguments to estimate the moments of a certain random matrix differ. In this section, we present the common part of the proof, leading to two key moment estimates, while the proofs of these crucial estimates are the object of later sections.

One can of course prove our claims for the Bernoulli model with p=m/n2p=m/n^{2} and transfer the results to the uniform model, by using the arguments in the appendix. For example, the probability that the recovery via (1.3) is not exact is at most twice that under the Bernoulli model.

We recall the projection matrices PU,PVP_{U},P_{V} and the companion matrix EE defined by (1.6), (1.7). It is known that

and note that the complementary projection PT:=IPT\mathcal{P}_{T^{\perp}}:=\mathcal{I}-\mathcal{P}_{T} is given by the formula

In particular, PT\mathcal{P}_{T^{\perp}} is a contraction:

Then ZXZ\in\partial\|{X}\|_{*} if and only if

With these preliminaries in place, establishes the following result.

Let the notation be as above. Suppose that the following two conditions hold:

Then M{M} is the unique solution to the convex program (1.3).

The second sufficient condition, namely, the injectivity of the restriction to PΩ\mathcal{P}_{\Omega} has been studied in . We recall a useful result.

[7, Theorem 4.1] Suppose Ω\Omega is sampled according to the Bernoulli model and put n:=max(n1,n2)n:=\max(n_{1},n_{2}). Assume that MM obeys (1.18). Then there is a numerical constant CRC_{R} such that for all β>1\beta>1, we have the bound

with probability at least 13nβ1-3n^{-\beta} provided that a<1a<1, where aa is the quantity

We will apply this theorem with β:=4\beta:=4 (say). The statement (3.6) is stronger than the injectivity of the restriction of PΩ\mathcal{P}_{\Omega} to TT. Indeed, take mm sufficiently large so that the a<1a<1. Then if XTX\in T, we have

and obviously, PΩ(X)\mathcal{P}_{\Omega}(X) cannot vanish unless X=0X=0.

In order for the condition a<1a<1 to hold, we must have

for a suitably large constant C0C_{0}. But this follows from the hypotheses in either Theorem 1.1 or Theorem 1.2, for reasons that we now pause to explain. In either of these theorems we have

for some large constant C1C_{1}. Recall from Section 1.6.1 that μ01+μ1/r1+μ/r\mu_{0}\leq 1+\mu_{1}/\sqrt{r}\leq 1+\mu/\sqrt{r}, and so (3.9) implies (3.8) whenever μ02\mu_{0}\geq 2 (say). When μ0<2\mu_{0}<2, we can also deduce (3.8) from (3.9) by applying the trivial bound μ1\mu\geq 1 noted in the introduction.

In summary, to prove Theorem 1.1 or Theorem 1.2, it suffices (under the hypotheses of these theorems) to exhibit a dual matrix YY obeying the first sufficient condition of Lemma 3.1, with probability at least 1n3/21-n^{-3}/2 (say). This is the objective of the remaining sections of the paper.

2 The dual certificate

By construction, PΩ(Y)=Y\mathcal{P}_{\Omega}(Y)=Y, PT(Y)=E\mathcal{P}_{T}(Y)=E and, therefore, we will establish that MM is the unique minimizer if one can show that

The dual matrix YY would then certify that MM is the unique solution, and this is the reason why we will refer to YY as a candidate certificate. This certificate was also used in .

Before continuing, we would like to offer a little motivation for the choice of the dual matrix YY. It is not difficult to check that (3.10) is actually the solution to the following problem:

Note that by the Pythagorean identity, YY obeys

The interpretation is now clear: among all matrices obeying PΩ(Z)=Z\mathcal{P}_{\Omega}(Z)=Z and PT(Z)=E\mathcal{P}_{T}(Z)=E, YY is that element which minimizes PT(Z)F\|{\mathcal{P}}_{T^{\perp}}({Z})\|_{F}. By forcing the Frobenius norm of PT(Y){\mathcal{P}}_{T^{\perp}}({Y}) to be small, it is reasonable to expect that its spectral norm will be sufficiently small as well. In that sense, YY defined via (3.10) is a very suitable candidate.

3 The Neumann series

From (3.12) we have PTPΩPT=pPT(I+QΩ)PT\mathcal{P}_{T}\mathcal{P}_{\Omega}\mathcal{P}_{T}=p\mathcal{P}_{T}(\mathcal{I}+\mathcal{Q}_{\Omega})\mathcal{P}_{T}, and owing to Theorem 3.2, one can write (PTPΩPT)1(\mathcal{P}_{T}\mathcal{P}_{\Omega}\mathcal{P}_{T})^{-1} as the convergent Neumann series

From the identity PTPT=0\mathcal{P}_{T^{\perp}}\mathcal{P}_{T}=0 we conclude that PTPΩPT=p(PTQΩPT)\mathcal{P}_{T^{\perp}}\mathcal{P}_{\Omega}\mathcal{P}_{T}=p(\mathcal{P}_{T^{\perp}}\mathcal{Q}_{\Omega}\mathcal{P}_{T}). One can therefore express the candidate certificate YY (3.10) as

where we have used PT2=PT\mathcal{P}_{T}^{2}=\mathcal{P}_{T} and PT(E)=E\mathcal{P}_{T}(E)=E. By the triangle inequality and (3.5), it thus suffices to show that

It is not hard to bound the tail of the series thanks to Theorem 3.2. First, this theorem bounds the spectral norm of PTQΩPT\mathcal{P}_{T}\mathcal{Q}_{\Omega}\mathcal{P}_{T} by the quantity aa in (3.7). This gives that for each k1k\geq 1, (PTQΩPT)k(E)F<akEF=akr\|(\mathcal{P}_{T}\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k}(E)\|_{F}<a^{k}\|E\|_{F}=a^{k}\sqrt{r} and, therefore,

Second, this theorem also bounds QΩPT\|\mathcal{Q}_{\Omega}\mathcal{P}_{T}\| (recall that this is the spectral norm) since

Expanding the identity PΩ2=PΩ\mathcal{P}_{\Omega}^{2}=\mathcal{P}_{\Omega} in terms of QΩ\mathcal{Q}_{\Omega}, we obtain

Hence QΩPT(a+1)/p\|\mathcal{Q}_{\Omega}\mathcal{P}_{T}\|\leq\sqrt{(a+1)/p}. For each k00k_{0}\geq 0, this gives

provided that a<1/2a<1/2. With p=m/n2p=m/n^{2} and aa defined by (3.7) with β=4\beta=4, we have

with probability at least 1n41-n^{-4}. When k0+1lognk_{0}+1\geq\log n, n1k0+1n1logn=en^{\frac{1}{k_{0}+1}}\leq n^{\frac{1}{\log n}}=e and thus for each such a k0k_{0},

To summarize this section, we conclude that since both our results assume that mc0μ0nrlognm\geq c_{0}\mu_{0}nr\log n for some sufficiently large numerical constant c0c_{0} (see the discussion at the end of Section 3.1), it now suffices to show that

(say) with probability at least 1n3/41-n^{-3}/4 (say).

4 Centering

We have already normalised PΩ\mathcal{P}_{\Omega} to have “mean zero” in some sense by replacing it with QΩ\mathcal{Q}_{\Omega}. Now we perform a similar operation for the projection PT:XPUX+XPVPUXPV\mathcal{P}_{T}:X\mapsto P_{U}X+XP_{V}-P_{U}XP_{V}. The eigenvalues of PT\mathcal{P}_{T} are centered around

as this follows from the fact that PT\mathcal{P}_{T} is a an orthogonal projection onto a space of dimension 2nrr22nr-r^{2}. Therefore, we simply split PT\mathcal{P}_{T} as

so that the eigenvalues of QT\mathcal{Q}_{T} are centered around zero. From now on, ρ\rho and ρ\rho^{\prime} will always be the numbers defined above.

Let 0<σ<10<\sigma<1. Consider the event such that

Then on this event, we have that for all 0k<k00\leq k<k_{0},

From (3.19) and the geometric series formula we obtain the corollary

Let σ0\sigma_{0} be such that the right-hand side is less than 1/4, say. Applying this with σ=σ0\sigma=\sigma_{0}, we conclude that to prove (3.15) with probability at least 1n3/41-n^{-3}/4, it suffices by the union bound to show that (3.18) for this value of σ\sigma. (Note that the hypothesis 8nr/m<σ3/28nr/m<\sigma^{3/2} follows from the hypotheses in either Theorem 1.1 or Theorem 1.2.)

Lemma 3.3, which is proven in the Appendix, is useful because the operator QT\mathcal{Q}_{T} is easier to work with than PT\mathcal{P}_{T} in the sense that it is more homogeneous, and obeys better estimates. If we split the projections PU,PVP_{U},P_{V} as

Let Ua,a,Vb,b{U}_{a,a^{\prime}},{V}_{b,b^{\prime}} denote the matrix elements of QU,QVQ_{U},Q_{V}:

and similarly for Vb,b{V}_{b,b^{\prime}}. The coefficients cab,abc_{ab,a^{\prime}b^{\prime}} of QT\mathcal{Q}_{T} obey

An immediate consequence of this under the assumptions (1.8), is the estimate

When μ=O(1)\mu=O(1), these coefficients are bounded by O(r/n)O(\sqrt{r}/n) when a=aa=a^{\prime} or b=bb=b^{\prime} while in contrast, if we stayed with PT\mathcal{P}_{T} rather than QT\mathcal{Q}_{T}, the diagonal coefficients would be as large as r/nr/n. However, our lemma states that bounding (QΩQT)kQΩ(E)\|(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E)\| automatically bounds (QΩPT)kQΩ(E)\|(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k}\mathcal{Q}_{\Omega}(E)\| by nearly the same quantity. This is the main advantage of replacing the PT\mathcal{P}_{T} by the QT\mathcal{Q}_{T} in our analysis.

5 Key estimates

To summarize the previous discussion, and in particular the bounds (3.20) and (3.14), we see everything reduces to bounding the spectral norm of (QΩQT)kQΩ(E)(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E) for k=0,1,,lognk=0,1,\ldots,\lfloor\log n\rfloor. Providing good upper bounds on these quantities is the crux of the argument. We use the moment method, controlling a spectral norm a matrix by the trace of a high power of that matrix. We will prove two moment estimates which ultimately imply our two main results (Theorems 1.1 and 1.2) respectively. The first such estimate is as follows:

Set A=(QΩQT)kQΩ(E)A=(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E) for a fixed k0k\geq 0. Under the assumptions of Theorem 1.1, we have that for each j>0j>0,

provided that mnrμ2m\geq nr_{\mu}^{2} and nc0j(k+1)n\geq c_{0}j(k+1) for some numerical constant c0c_{0}.

By Markov’s inequality, this result automatically estimates the norm of (QΩQT)kQΩ(E)(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E) and immediately gives the following corollary.

Under the assumptions of Theorem 1.1, the matrix YY (3.10) is a dual certificate, and obeys PT(Y)1/2\|\mathcal{P}_{T^{\perp}}(Y)\|\leq 1/2 with probability at least 1n31-n^{-3} provided that mm obeys (1.10).

Proof Set A=(QΩQT)kQΩ(E)A=(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E) with klognk\leq\log n, and set σσ0\sigma\leq\sigma_{0}. By Markov’s inequality

Now choose j>0j>0 to be the smallest integer such that j(k+1)lognj(k+1)\geq\log n. Since

where we have used the fact that n1j(k+1)n1logn=en^{\frac{1}{j(k+1)}}\leq n^{\frac{1}{\log n}}=e. Hence, if

for some numerical constant C0C_{0}, we have γ<1/4\gamma<1/4 and

has probability less or equal to n4lognn3/2n^{-4}\log n\leq n^{-3}/2 for n2n\geq 2. Since the corollary assumes r=O(1)r=O(1), then (3.26) together with (3.20) and (3.14) prove the claim thanks to our choice of σ\sigma.

Of course, Theorem 1.1 follows immediately from Corollary 3.5 and Lemma 3.1. In the same way, our second result (Theorem 1.2) follows from a more refined estimate stated below.

Set A=(QΩQT)kQΩ(E)A=(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E) for a fixed k0k\geq 0. Under the assumptions of Theorem 1.2, we have that for each j>0j>0 (rμr_{\mu} is given in (3.25)),

provided that nc0j(k+1)n\geq c_{0}j(k+1) for some numerical constant c0c_{0}.

Just as before, this theorem immediately implies the following corollary.

Under the assumptions of Theorem 1.2, the matrix YY (3.10) is a dual certificate, and obeys PT(Y)1/2\|\mathcal{P}_{T^{\perp}}(Y)\|\leq 1/2 with probability at least 1n31-n^{-3} provided that mm obeys (1.12).

The proof is identical to that of Corollary 3.5 and is omitted. Again, Corollary 3.7 and Lemma 3.1 immediately imply Theorem 1.2.

6 Novelty

As explained earlier, this paper derives near-optimal sampling results which are stronger than those in . One of the reasons underlying this improvement is that we use completely different techniques. In details, constructs the dual certificate (3.10) and proceeds by showing that PT(Y)<1\|\mathcal{P}_{T^{\perp}}(Y)\|<1 by bounding each term in the series k0(QΩPT)kQΩ(E)<1\sum_{k\geq 0}\|(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k}\mathcal{Q}_{\Omega}(E)\|<1. Further, to prove that the early terms (small values of kk) are appropriately small, the authors employ a sophisticated array of tools from asymptotic geometric analysis, including noncommutative Khintchine inequalities , decoupling techniques of Bourgain and Tzafiri and of de la Peña , and large deviations inequalities . They bound each term individually up to k=4k=4 and use the same argument as that in Section 3.3 to bound the rest of the series. Since the tail starts at k0=5k_{0}=5, this gives that a sufficient condition is that the number of samples exceeds a constant times μ0n6/5nrlogn\mu_{0}n^{6/5}nr\log n. Bounding each term (QΩPT)kQΩ(E)k\|(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k}\mathcal{Q}_{\Omega}(E)\|^{k} with the tools put forth in for larger values of kk becomes increasingly delicate because of the coupling between the indicator variables defining the random set Ω\Omega. In addition, the noncommutative Khintchine inequality seems less effective in higher dimensions; that is, for large values of kk. Informally speaking, the reason for this seems to be that the types of random sums that appear in the moments (QΩPT)kQΩ(E)(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k}\mathcal{Q}_{\Omega}(E) for large kk involve complicated combinations of the coefficients of PT\mathcal{P}_{T} that are not simply components of some product matrix, and which do not simplify substantially after a direct application of the Khintchine inequality.

In this paper, we use a very different strategy to estimate the spectral norm of (QΩQT)kQΩ(E)(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E), and employ moment methods, which have a long history in random matrix theory, dating back at least to the classical work of Wigner. We raise the matrix A:=(QΩQT)kQΩ(E)A:=(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{k}\mathcal{Q}_{\Omega}(E) to a large power jj so that

(the largest element dominates the sum). We then need to compute the expectation of the right-hand side, and reduce matters to a purely combinatorial question involving the statistics of various types of paths in a plane. It is rather remarkable that carrying out these combinatorial calculations nearly give the quantitatively correct answer; the moment method seems to come close to giving the ultimate limit of performance one can expect from nuclear-norm minimization.

As we shall shortly see, the expression trace(AA)j\operatorname{trace}(A^{*}A)^{j} expands as a sum over “paths” of products of various coefficients of the operators QΩ,QT\mathcal{Q}_{\Omega},\mathcal{Q}_{T} and the matrix EE. These paths can be viewed as complicated variants of Dyck paths. However, it does not seem that one can simply invoke standard moment method calculations in the literature to compute this sum, as in order to obtain efficient bounds, we will need to take full advantage of identities such as PTPT=PT\mathcal{P}_{T}\mathcal{P}_{T}=\mathcal{P}_{T} (which capture certain cancellation properties of the coefficients of PT\mathcal{P}_{T} or QT\mathcal{Q}_{T}) to simplify various components of this sum. It is only after performing such simplifications that one can afford to estimate all the coefficients by absolute values and count paths to conclude the argument.

Moments

Let j0j\geq 0 be a fixed integer. The goal of this section is to develop a formula for

This will clearly be of use in the proofs of the moment bounds (Theorems 3.4, 3.6).

We first write the matrix AA in components as

for some scalars AabA_{ab}, where eabe_{ab} is the standard basis for the n×nn\times n matrices and AabA_{ab} is the (a,b)th(a,b)^{\operatorname{th}} entry of AA. Then

where we adopt the cyclic convention aj+1=a1a_{j+1}=a_{1}. Equivalently, we can write

where the sum is over all ai,μ,bi,μ[n]a_{i,\mu},b_{i,\mu}\in[n] for i[j],μ{0,1}i\in[j],\mu\in\{0,1\} obeying the compatibility conditions

with the cyclic convention aj+1,0=a1,0a_{j+1,0}=a_{1,0}.

Example. If j=2j=2, then we can write trace(AA)j{{\operatorname{trace}}}(A^{*}A)^{j} as

where the sum is over all a1,0,a1,1,a2,0,a2,1,b1,0,b1,1,b2,0,b2,1[n]a_{1,0},a_{1,1},a_{2,0},a_{2,1},b_{1,0},b_{1,1},b_{2,0},b_{2,1}\in[n] obeying the compatibility conditions

Remark. The sum in (4.2) can be viewed as over all closed paths of length 2j2j in [n]×[n][n]\times[n], where the edges of the paths alternate between “horizontal rook moves” and “vertical rook moves” respectively; see Figure 1.

Second, write QT\mathcal{Q}_{T} and QΩ\mathcal{Q}_{\Omega} in coefficients as

where cab,abc_{ab,a^{\prime}b^{\prime}} is given by (3.23), and

where ξab\xi_{ab} are the iid, zero-expectation random variables

for any a0,b0[n]a_{0},b_{0}\in[n]. Note that this formula is even valid in the base case k=0k=0, where it simplifies to just Aa0b0=ξa0b0Ea0b0A_{a_{0}b_{0}}=\xi_{a_{0}b_{0}}E_{a_{0}b_{0}} due to our conventions on trivial sums and empty products.

Remark. One can view the right-hand side of (4.3) as the sum over paths of length k+1k+1 in [n]×[n][n]\times[n] starting at the designated point (a0,b0)(a_{0},b_{0}) and ending at some arbitrary point (ak,bk)(a_{k},b_{k}). Each edge (from (ai,bi)(a_{i},b_{i}) to (ai+1,bi+1)(a_{i+1},b_{i+1})) may be a horizontal or vertical “rook move” (in that at least one of the aa or bb coordinates does not changeUnlike the ordinary rules of chess, we will consider the trivial move when ai+1=aia_{i+1}=a_{i} and bi+1=bib_{i+1}=b_{i} to also qualify as a “rook move”, which is simultaneously a horizontal and a vertical rook move.), or a “non-rook move” in which both the aa and bb coordinates change. It will be important later on to keep track of which edges are rook moves and which ones are not, basically because of the presence of the delta functions 1a=a,1b=b1_{a=a^{\prime}},1_{b=b^{\prime}} in (3.23). Each edge in this path is weighted by a cc factor, and each vertex in the path is weighted by a ξ\xi factor, with the final vertex also weighted by an additional EE factor. It is important to note that the path is allowed to cross itself, in which case weights such as ξ2\xi^{2}, ξ3\xi^{3}, etc. may appear, see Figure 2.

Inserting (4.3) into (4.2), we see that XX can thus be expanded as

where the sum \sum_{*} is over all combinations of ai,μ,l,bi,μ,l[n]a_{i,\mu,l},b_{i,\mu,l}\in[n] for i[j]i\in[j], μ{0,1}\mu\in\{0,1\} and 0lk0\leq l\leq k obeying the compatibility conditions

with the cyclic convention aj+1,0,0=a1,0,0a_{j+1,0,0}=a_{1,0,0}.

Example. Continuing our running example j=k=2j=k=2, we have

where ai,μ,la_{i,\mu,l} for i=1,2i=1,2, μ=0,1\mu=0,1, l=0,1,2l=0,1,2 obey the compatibility conditions

Note that despite the small values of jj and kk, this is already a rather complicated sum, ranging over n2j(2k+1)=n20n^{2j(2k+1)}=n^{20} summands, each of which is the product of 4j(k+1)=244j(k+1)=24 terms.

Remark. The expansion (4.4) is the sum over a sort of combinatorial “spider”, whose “body” is a closed path of length 2j2j in [n]×[n][n]\times[n] of alternating horizontal and vertical rook moves, and whose 2j2j “legs” are paths of length kk, emanating out of each vertex of the body. The various “segments” of the legs (which can be either rook or non-rook moves) acquire a weight of cc, and the “joints” of the legs acquire a weight of ξ\xi, with an additional weight of EE at the tip of each leg. To complicate things further, it is certainly possible for a vertex of one leg to overlap with another vertex from either the same leg or a different leg, introducing weights such as ξ2\xi^{2}, ξ3\xi^{3}, etc.; see Figure 3.

As one can see, the set of possible configurations that this “spider” can be in is rather large and complicated.

2 Second step: collecting rows and columns

We now group the terms in the expansion (4.4) into a bounded number of components, depending on how the various horizontal coordinates ai,μ,la_{i,\mu,l} and vertical coordinates bi,μ,lb_{i,\mu,l} overlap.

It is convenient to order the 2j(k+1)2j(k+1) tuples (i,μ,l)[j]×{0,1}×{0,,k}(i,\mu,l)\in[j]\times\{0,1\}\times\{0,\ldots,k\} lexicographically by declaring (i,μ,l)<(i,μ,l)(i,\mu,l)<(i^{\prime},\mu^{\prime},l^{\prime}) if i<ii<i^{\prime}, or if i=ii=i^{\prime} and μ<μ\mu<\mu^{\prime}, or if i=ii=i^{\prime} and μ=μ\mu=\mu^{\prime} and l<ll<l^{\prime}.

We then define the indices si,μ,l,ti,μ,l{1,2,3,}s_{i,\mu,l},t_{i,\mu,l}\in\{1,2,3,\ldots\} recursively for all (i,μ,l)[j]×{0,1}×[k](i,\mu,l)\in[j]\times\{0,1\}\times[k] by setting s1,0,0=1s_{1,0,0}=1 and declaring si,μ,l:=si,μ,ls_{i,\mu,l}:=s_{i^{\prime},\mu^{\prime},l^{\prime}} if there exists (i,μ,l)<(i,μ,l)(i^{\prime},\mu^{\prime},l^{\prime})<(i,\mu,l) with ai,μ,l=ai,μ,la_{i^{\prime},\mu^{\prime},l^{\prime}}=a_{i,\mu,l}, or equal to the first positive integer not equal to any of the si,μ,ls_{i^{\prime},\mu^{\prime},l^{\prime}} for (i,μ,l)<(i,μ,l)(i^{\prime},\mu^{\prime},l^{\prime})<(i,\mu,l) otherwise. Define ti,μ,lt_{i,\mu,l} using bi,μ,lb_{i,\mu,l} similarly. We observe the cyclic condition

with the cyclic convention sj+1,0,0=s1,0,0s_{j+1,0,0}=s_{1,0,0}.

Example. Suppose that j=2j=2, k=1k=1, and n30n\geq 30, with the (ai,μ,l,bi,μ,l)(a_{i,\mu,l},b_{i,\mu,l}) given in lexicographical ordering as

Observe that the conditions (4.5) hold for this example, which then forces (4.6) to hold also.

In addition to the property (4.6), we see from construction of (s,t)(s,t) that for any (i,μ,l)[j]×{0,1}×{0,,k}(i,\mu,l)\in[j]\times\{0,1\}\times\{0,\ldots,k\}, the sets

are initial segments, i.e. of the form [m][m] for some integer mm. Let us call pairs (s,t)(s,t) of sequences with this property, as well as the property (4.6), admissible; thus for instance the sequences in the above example are admissible. Given an admissible pair (s,t)(s,t), if we define the sets JJ, KK by

then we observe that J=[J],K=[K]J=[|J|],K=[|K|]. Also, if (s,t)(s,t) arose from ai,μ,l,bi,μ,la_{i,\mu,l},b_{i,\mu,l} in the above manner, there exist unique injections α:J[n],β:K[n]\alpha:J\to[n],\beta:K\to[n] such that ai,μ,l=α(si,μ,l)a_{i,\mu,l}=\alpha(s_{i,\mu,l}) and bi,μ,l=β(ti,μ,l)b_{i,\mu,l}=\beta(t_{i,\mu,l}).

Example. Continuing the previous example, we have J=J=, K=K=, with the injections α:[n]\alpha:\to[n] and β:[n]\beta:\to[n] defined by

Conversely, any admissible pair (s,t)(s,t) and injections α,β\alpha,\beta determine ai,μ,la_{i,\mu,l} and bi,μ,lb_{i,\mu,l}. Because of this, we can thus expand XX as

where the outer sum is over all admissible pairs (s,t)(s,t), and the inner sum is over all injections.

Remark. As with preceding identities, the above formula is also valid when k=0k=0 (with our conventions on trivial sums and empty products), in which case it simplifies to

Remark. One can think of (s,t)(s,t) as describing the combinatorial “configuration” of the “spider” ((ai,μ,l,bi,μ,l))(i,μ,l)[j]×{0,1}×{0,,k}((a_{i,\mu,l},b_{i,\mu,l}))_{(i,\mu,l)\in[j]\times\{0,1\}\times\{0,\ldots,k\}} - it determines which vertices of the spider are equal to, or on the same row or column as, other vertices of the spider. The injections α,β\alpha,\beta then enumerate the ways in which such a configuration can be “represented” inside the grid [n]×[n][n]\times[n].

3 Third step: computing the expectation

Applying this estimate and the triangle inequality, we can thus bound XX by

where the sum is over those admissible (s,t)(s,t) such that each element of Ω\Omega is visited at least twice by the sequence (si,μ,l,ti,μ,l)(s_{i,\mu,l},t_{i,\mu,l}); we shall call such (s,t)(s,t) strongly admissible. We will use the bound (4.10) as a starting point for proving the moment estimates (3.25) and (3.27).

Example. The pair (s,t)(s,t) in the Example in Section 4.2 is admissible but not strongly admissible, because not every element of the set Ω\Omega (which, in this example, is {(1,1),(2,2),(3,1),\{(1,1),(2,2),(3,1), (2,3),(3,4),(2,3),(3,4), (1,2),(1,4)}(1,2),(1,4)\}) is visited twice by the (s,t)(s,t).

Remark. Once again, the formula (4.10) is valid when k=0k=0, with the usual conventions on empty products (in particular, the factor involving the cc coefficients can be deleted in this case).

Quadratic bound in the rank

This section establishes (3.25) under the assumptions of Theorem 1.1, which is the easier of the two moment estimates. Here we shall just take the absolute values in (4.10) inside the summation and use the estimates on the coefficients given to us by hypothesis. Indeed, starting with (4.10) and the triangle inequality and applying (1.9) together with (3.23) gives

where we recall that rμ=μ2rr_{\mu}=\mu^{2}r, and QQ is the set of all (i,μ,l)[j]×{0,1}×[k](i,\mu,l)\in[j]\times\{0,1\}\times[k] such that si,μ,l1si,μ,ls_{i,\mu,l-1}\neq s_{i,\mu,l} and ti,μ,l1ti,μ,lt_{i,\mu,l-1}\neq t_{i,\mu,l}. Thinking of the sequence {(si,μ,l,ti,μ,l)}\{(s_{i,\mu,l},t_{i,\mu,l})\} as a path in J×KJ\times K, we have that (i,μ,l)Q(i,\mu,l)\in Q if and only if the move from (si,μ,l1,ti,μ,l1)(s_{i,\mu,l-1},t_{i,\mu,l-1}) to (si,μ,l,ti,μ,l)(s_{i,\mu,l},t_{i,\mu,l}) is neither horizontal nor vertical; per our earlier discussion, this is a “non-rook” move.

Example. The example in Section 4.2 is admissible, but not strongly admissible. Nevertheless, the above definitions can still be applied, and we see that Q={(0,0,1),(0,1,1),(1,0,1),(1,1,1)}Q=\{(0,0,1),(0,1,1),(1,0,1),(1,1,1)\} in this case, because all of the four associated moves are non-rook moves.

As the number of injections α,β\alpha,\beta is at most nJ,nKn^{|J|},n^{|K|} respectively, we thus have

Since (s,t)(s,t) is strongly admissible and every point in Ω\Omega needs to be visited at least twice, we see that

Also, since Q[j]×{0,1}×[k]Q\subset[j]\times\{0,1\}\times[k], we have the trivial bound

From the hypotheses of Theorem 1.1 we have nprμ2np\geq r_{\mu}^{2}, and thus

Remark. In the case where k=0k=0 in which Q=Q=\emptyset, one can easily obtain a better estimate, namely, (if nprμnp\geq r_{\mu})

Call a triple (i,μ,l)(i,\mu,l) recycled if we have si,μ,l=si,μ,ls_{i^{\prime},\mu^{\prime},l^{\prime}}=s_{i,\mu,l} or ti,μ,l=ti,μ,lt_{i^{\prime},\mu^{\prime},l^{\prime}}=t_{i,\mu,l} for some (i,μ,l)<(i,μ,l)(i^{\prime},\mu^{\prime},l^{\prime})<(i,\mu,l), and totally recycled if (si,μ,l,ti,μ,l)=(si,μ,l,ti,μ,l)(s_{i^{\prime},\mu^{\prime},l^{\prime}},t_{i^{\prime},\mu^{\prime},l^{\prime}})=(s_{i,\mu,l},t_{i,\mu,l}) for some (i,μ,l)<(i,μ,l)(i^{\prime},\mu^{\prime},l^{\prime})<(i,\mu,l). Let QQ^{\prime} denote the set of all (i,μ,l)Q(i,\mu,l)\in Q which are recycled.

Example. The example in Section 4.2 is admissible, but not strongly admissible. Nevertheless, the above definitions can still be applied, and we see that the triples

are all recycled (because they either reuse an existing value of ss or tt or both), while the triple (1,1,1)(1,1,1) is totally recycled (it visits the same location as the earlier triple (0,0,1)(0,0,1)). Thus in this case, we have Q={(0,1,1),(1,0,1),(1,1,1)}Q^{\prime}=\{(0,1,1),(1,0,1),(1,1,1)\}.

We observe that if (i,μ,l)[j]×{0,1}×[k](i,\mu,l)\in[j]\times\{0,1\}\times[k] is not recycled, then it must have been reached from (i,μ,l1)(i,\mu,l-1) by a non-rook move, and thus (i,μ,l)(i,\mu,l) lies in QQ.

For any admissible tuple, we have J+KQΩQ+1|J|+|K|-|Q|-|\Omega|\leq-|Q^{\prime}|+1.

Proof We let (i,μ,l)(i,\mu,l) increase from (1,0,0)(1,0,0) to (j,1,k)(j,1,k) and see how each (i,μ,l)(i,\mu,l) influences the quantity J+KQ\QΩ|J|+|K|-|Q\backslash Q^{\prime}|-|\Omega|.

Firstly, we see that the triple (1,0,0)(1,0,0) initialises J,K,Ω=1|J|,|K|,|\Omega|=1 and Q\Q=0|Q\backslash Q^{\prime}|=0, so J+KQ\QΩ=1|J|+|K|-|Q\backslash Q^{\prime}|-|\Omega|=1 at this initial stage. Now we see how each subsequent (i,μ,l)(i,\mu,l) adjusts this quantity.

If (i,μ,l)(i,\mu,l) is totally recycled, then J,K,Ω,Q\QJ,K,\Omega,Q\backslash Q^{\prime} are unchanged by the addition of (i,μ,l)(i,\mu,l), and so J+KQ\QΩ|J|+|K|-|Q\backslash Q^{\prime}|-|\Omega| does not change.

If (i,μ,l)(i,\mu,l) is recycled but not totally recycled, then one of J,KJ,K increases in size by at most one, as does Ω\Omega, but the other set of J,KJ,K remains unchanged, as does Q\QQ\backslash Q^{\prime}, and so J+KQ\QΩ|J|+|K|-|Q\backslash Q^{\prime}|-|\Omega| does not increase.

If (i,μ,l)(i,\mu,l) is not recycled at all, then (by (4.6)) we must have l>0l>0, and then (by definition of Q,QQ,Q^{\prime}) we have (i,μ,l)Q\Q(i,\mu,l)\in Q\backslash Q^{\prime}, and so Q\Q|Q\backslash Q^{\prime}| and Ω|\Omega| both increase by one. Meanwhile, J|J| and K|K| increase by 1, and so J+KQ\QΩ|J|+|K|-|Q\backslash Q^{\prime}|-|\Omega| does not change. Putting all this together we obtain the claim.

Remark. When k=0k=0, we have the better bound

To estimate the above sum, we need to count strongly admissible pairs. This is achieved by the following lemma.

For fixed q0q\geq 0, the number of strongly admissible pairs (s,t)(s,t) with Q=q|Q^{\prime}|=q is at most O(j(k+1))2j(k+1)+qO(j(k+1))^{2j(k+1)+q}.

Proof Firstly observe that once one fixes qq, the number of possible choices for QQ^{\prime} is (2jkq)\binom{2jk}{q}, which we can bound crudely by 22j(k+1)=O(1)2j(k+1)+q2^{2j(k+1)}=O(1)^{2j(k+1)+q}. So we may without loss of generality assume that QQ^{\prime} is fixed. For similar reasons we may assume QQ is fixed.

As with the proof of Lemma 5.1, we increment (i,μ,l)(i,\mu,l) from (1,0,0)(1,0,0) to (j,1,k)(j,1,k) and upper bound how many choices we have available for si,μ,l,ti,μ,ls_{i,\mu,l},t_{i,\mu,l} at each stage.

There are no choices available for s1,0,0,t1,0,0s_{1,0,0},t_{1,0,0}, which must both be one. Now suppose that (i,μ,l)>(1,0,0)(i,\mu,l)>(1,0,0). There are several cases.

If l=0l=0, then by (4.6) one of si,μ,l,ti,μ,ls_{i,\mu,l},t_{i,\mu,l} has no choices available to it, while the other has at most O(j(k+1))O(j(k+1)) choices. If l>0l>0 and (i,μ,l)∉Q(i,\mu,l)\not\in Q, then at least one of si,μ,l,ti,μ,ls_{i,\mu,l},t_{i,\mu,l} is necessarily equal to its predecessor; there are at most two choices available for which index is equal in this fashion, and then there are O(j(k+1))O(j(k+1)) choices for the other index.

If l>0l>0 and (i,μ,l)Q\Q(i,\mu,l)\in Q\backslash Q^{\prime}, then both si,μ,ls_{i,\mu,l} and ti,μ,lt_{i,\mu,l} are new, and are thus equal to the first positive integer not already occupied by si,μ,ls_{i^{\prime},\mu^{\prime},l^{\prime}} or ti,μ,lt_{i^{\prime},\mu^{\prime},l^{\prime}} respectively for (i,μ,l)<(i,μ,l)(i^{\prime},\mu^{\prime},l^{\prime})<(i,\mu,l). So there is only one choice available in this case.

Finally, if (i,μ,l)Q(i,\mu,l)\in Q^{\prime}, then there can be O(j(k+1))O(j(k+1)) choices for both si,μ,ls_{i,\mu,l} and ti,μ,lt_{i,\mu,l}.

Multiplying together all these bounds, we obtain that the number of strongly admissible pairs is bounded by

which proves the claim (here we discard the QQ|Q\setminus Q^{\prime}| factor).

Under the assumption nc0j(k+1)n\geq c_{0}j(k+1) for some numerical constant c0c_{0}, we can sum the series and obtain Theorem 3.4.

Remark. When k=0k=0, we have the better bound

Linear bound in the rank

We now prove the more sophisticated moment estimate (3.27) under the hypotheses of Theorem 1.2. Here, we cannot afford to take absolute values immediately, as in the proof of (3.25), but first must exploit some algebraic cancellation properties in the coefficients cab,abc_{ab,a^{\prime}b^{\prime}}, EabE_{ab} appearing in (4.10) to simplify the sum.

Recall from (3.23) that the coefficients cab,abc_{ab,a^{\prime}b^{\prime}} are defined in terms of the coefficients Ua,a{U}_{a,a^{\prime}}, Vb,b{V}_{b,b^{\prime}} introduced in (3.22). We recall the symmetries Ua,a=Ua,a,Vb,b=Vb,b{U}_{a,a^{\prime}}={U}_{a^{\prime},a},{V}_{b,b^{\prime}}={V}_{b^{\prime},b} and the projection identities

the first identity follows from the matrix identity

after one writes the projection identity PU2=PUP_{U}^{2}=P_{U} in terms of QUQ_{U} using (3.21), and similarly for the second identity.

In a similar vein, we also have the identities

which simply come from QUE=PUEρE=(1ρ)EQ_{U}E=P_{U}E-\rho E=(1-\rho)E together with EQV=EPVρE=(1ρ)EEQ_{V}=EP_{V}-\rho E=(1-\rho)E. Finally, we observe the two equalities

The first identity follows from the fact that bEa,bEa,b\sum_{b}E_{a,b}E_{a^{\prime},b} is the (a,a)th(a,a^{\prime})^{th} element of EE=PU=QU+ρIEE^{*}=P_{U}=Q_{U}+\rho I, and the second one similarly follows from the identity EE=PV=QV+ρIE^{*}E=P_{V}=Q_{V}+\rho I.

2 Reduction to a summand bound

We recall the bound (4.10), and expand out each of the cc coefficients using (3.23) into three terms. To describe the resulting expansion of the sum we need more notation. Define an admissible quadruplet (s,t,LU,LV)(s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}) to be an admissible pair (s,t)(s,t), together with two sets LU,LV{\mathcal{L}}_{U},{\mathcal{L}}_{V} with LULV=[j]×{0,1}×[k]{\mathcal{L}}_{U}\cup{\mathcal{L}}_{V}=[j]\times\{0,1\}\times[k], such that si,μ,l1=si,μ,ls_{i,\mu,l-1}=s_{i,\mu,l} whenever (i,μ,l)([j]×{0,1}×[k])\LU(i,\mu,l)\in([j]\times\{0,1\}\times[k])\backslash{\mathcal{L}}_{U}, and ti,μ,l1=ti,μ,lt_{i,\mu,l-1}=t_{i,\mu,l} whenever (i,μ,l)([j]×{0,1}×[k])\LV(i,\mu,l)\in([j]\times\{0,1\}\times[k])\backslash{\mathcal{L}}_{V}. If (s,t)(s,t) is also strongly admissible, we say that (s,t,LU,LV)(s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}) is a strongly admissible quadruplet.

The sets LU\LV{\mathcal{L}}_{U}\backslash{\mathcal{L}}_{V}, LV\LU{\mathcal{L}}_{V}\backslash{\mathcal{L}}_{U}, LULV{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V} will correspond to the three terms 1b=bUa,a1_{b=b^{\prime}}{U}_{a,a^{\prime}}, 1a=aVb,b1_{a=a^{\prime}}{V}_{b,b^{\prime}}, Ua,aVb,b{U}_{a,a^{\prime}}{V}_{b,b^{\prime}} appearing in (3.23). With this notation, we expand the product

where the sum is over all partitions as above, and which we can rearrange as

From this and the triangle inequality, we observe the bound

where the sum ranges over all strongly admissible quadruplets, and

Remark. A strongly admissible quadruplet can be viewed as the configuration of a “spider” with several additional constraints. Firstly, the spider must visit each of its vertices at least twice (strong admissibility). When (i,μ,l)[j]×{0,1}×[k](i,\mu,l)\in[j]\times\{0,1\}\times[k] lies out of LU{\mathcal{L}}_{U}, then only horizontal rook moves are allowed when reaching (i,μ,l)(i,\mu,l) from (i,μ,l1)(i,\mu,l-1); similarly, when (i,μ,l)(i,\mu,l) lies out of LV{\mathcal{L}}_{V}, then only vertical rook moves are allowed from (i,μ,l1)(i,\mu,l-1) to (i,μ,l)(i,\mu,l). In particular, non-rook moves are only allowed inside LULV{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}; in the notation of the previous section, we have QLULVQ\subset{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}. Note though that while one has the right to execute a non-rook move to LULV{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}, it is not mandatory; it could still be that (si,μ,l1,ti,μ,l1)(s_{i,\mu,l-1},t_{i,\mu,l-1}) shares a common row or column (or even both) with (si,μ,l,ti,μ,l)(s_{i,\mu,l},t_{i,\mu,l}).

We claim the following fundamental bound on the summand Xs,t,LU,LV|X_{s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}}|:

Let (s,t,LU,LV)(s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}) be a strongly admissible quadruplet. Then we have

and since Ωj(k+1)|\Omega|\leq j(k+1) (by strong admissibility) and rnpr\leq np, and the number of (s,t,LU,LV)(s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}) can be crudely bounded by O(j(k+1))4j(k+1)O(j(k+1))^{4j(k+1)},

This gives (3.27) as desired. The bound on the number of quadruplets follows from the fact that there are at most j(k+1)4j(k+1)j(k+1)^{4j(k+1)} strongly admissible pairs and that the number of (LU,LV)({\mathcal{L}}_{U},{\mathcal{L}}_{V}) per pair is at most O(1)j(k+1)O(1)^{j(k+1)}.

Remark. It seems clear that the exponent 66 can be lowered by a finer analysis, for instance by using counting bounds such as Lemma 5.2. However, substantial effort seems to be required in order to obtain the optimal exponent of 11 here.

3 Proof of Proposition 6.1

To prove the proposition, it is convenient to generalise it by allowing kk to depend on i,μi,\mu. More precisely, define a configuration C=(j,k,J,K,s,t,LU,LV){\mathcal{C}}=(j,k,J,K,s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}) to be the following set of data:

An integer j1j\geq 1, and a map k:[j]×{0,1}{0,1,2,}k:[j]\times\{0,1\}\to\{0,1,2,\ldots\}, generating a set Γ:={(i,μ,l):i[j],μ{0,1},0lk(i,μ)}\Gamma:=\{(i,\mu,l):i\in[j],\mu\in\{0,1\},0\leq l\leq k(i,\mu)\};

Finite sets J,KJ,K, and surjective maps s:ΓJs:\Gamma\to J and t:ΓKt:\Gamma\to K obeying (4.6);

Sets LU,LV{\mathcal{L}}_{U},{\mathcal{L}}_{V} such that

and such that si,μ,l1=si,μ,ls_{i,\mu,l-1}=s_{i,\mu,l} whenever (i,μ,l)Γ+\LU(i,\mu,l)\in\Gamma_{+}\backslash{\mathcal{L}}_{U}, and ti,μ,l1=ti,μ,lt_{i,\mu,l-1}=t_{i,\mu,l} whenever (i,μ,l)Γ+\LV(i,\mu,l)\in\Gamma_{+}\backslash{\mathcal{L}}_{V}.

Remark. Note we do not require configurations to be strongly admissible, although for our application to Proposition 6.1 strong admissibility is required. Similarly, we no longer require that the segments (4.7) be initial segments. This removal of hypotheses will give us a convenient amount of flexibility in a certain induction argument that we shall perform shortly. One can think of a configuration as describing a “generalized spider” whose legs are allowed to be of unequal length, but for which certain of the segments (indicated by the sets LU{\mathcal{L}}_{U}, LV{\mathcal{L}}_{V}) are required to be horizontal or vertical. The freedom to extend or shorten the legs of the spider separately will be of importance when we use the identities (6.1), (6.3), (6.4) to simplify the expression Xs,t,LU,LVX_{s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}}, see Figure 4.

Given a configuration C{\mathcal{C}}, define the quantity XCX_{\mathcal{C}} by the formula

where α:J[n],β:K[n]\alpha:J\to[n],\beta:K\to[n] range over all injections. To prove Proposition 6.1, it then suffices to show that

for some absolute constant C0>0C_{0}>0, where

since Proposition 6.1 then follows from the special case in which k(i,μ)=kk(i,\mu)=k is constant and (s,t)(s,t) is strongly admissible, in which case we have

To prove the claim (6.6) we will perform strong induction on the quantity J+K|J|+|K|; thus we assume that the claim has already been proven for all configurations with a strictly smaller value of J+K|J|+|K|. (This inductive hypothesis can be vacuous for very small values of J+K|J|+|K|.) Then, for fixed J+K|J|+|K|, we perform strong induction on LULV|{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}|, assuming that the claim has already been proven for all configurations with the same value of J+K|J|+|K| and a strictly smaller value of LULV|{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}|.

Remark. Roughly speaking, the inductive hypothesis is asserting that the target estimate (6.6) has already been proven for all generalized spider configurations which are “simpler” than the current configuration, either by using fewer rows and columns, or by using the same number of rows and columns but by having fewer opportunities for non-rook moves.

As we shall shortly see, whenever we invoke the inner induction hypothesis (decreasing LULV|{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}|, keeping J+K|J|+|K| fixed) we are replacing the expression XCX_{\mathcal{C}} with another expression XCX_{{\mathcal{C}}^{\prime}} covered by this hypothesis; this causes no degradation in the constant. But when we invoke the outer induction hypothesis (decreasing J+K|J|+|K|), we will be splitting up XCX_{\mathcal{C}} into about O(1+J+K)O(1+|J|+|K|) terms XCX_{{\mathcal{C}}^{\prime}}, each of which is covered by this hypothesis; this causes a degradation of O(1+J+K)O(1+|J|+|K|) in the constants and is thus responsible for the loss of (C0(1+J+K))J+K(C_{0}(1+|J|+|K|))^{|J|+|K|} in (6.6).

For future reference we observe that we may take rμnr_{\mu}\leq n, as the hypotheses of Theorem 1.1 are vacuous otherwise (mm cannot exceed n2n^{2}).

To prove (6.6) we divide into several cases.

Suppose first that LULV{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V} contains an element (i0,μ0,l0)(i_{0},\mu_{0},l_{0}) with the property that

Note that this forces the edge from (si0,μ0,l01,ti0,μ0,l01)(s_{i_{0},\mu_{0},l_{0}-1},t_{i_{0},\mu_{0},l_{0}-1}) to (si0,μ0,l0,ti0,μ0,l0)(s_{i_{0},\mu_{0},l_{0}},t_{i_{0},\mu_{0},l_{0}}) to be partially “unguarded” in the sense that one of the opposite vertices of the rectangle that this edge is inscribed in is not visited by the (s,t)(s,t) pair.

When we have such an unguarded non-rook move, we can “erase” the element (i0,μ0,l0)(i_{0},\mu_{0},l_{0}) from LULV{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V} by replacing C=(j,k,J,K,s,t,LU,LV){\mathcal{C}}=(j,k,J,K,s,t,{\mathcal{L}}_{U},{\mathcal{L}}_{V}) by the “stretched” variant C=(j,k,J,K,s,{\mathcal{C}}^{\prime}=(j^{\prime},k^{\prime},J^{\prime},K^{\prime},s^{\prime}, t,LU,LV)t^{\prime},{\mathcal{L}}^{\prime}_{U},{\mathcal{L}}^{\prime}_{V}), defined as follows:

j:=jj^{\prime}:=j, J:=JJ^{\prime}:=J, and K:=KK^{\prime}:=K.

k(i,μ):=k(i,μ)k^{\prime}(i,\mu):=k(i,\mu) for (i,μ)(i0,μ0)(i,\mu)\neq(i_{0},\mu_{0}), and k(i0,μ0):=k(i0,μ0)+1k^{\prime}(i_{0},\mu_{0}):=k(i_{0},\mu_{0})+1.

(si,μ,l,ti,μ,l):=(si,μ,l,ti,μ,l)(s^{\prime}_{i,\mu,l},t^{\prime}_{i,\mu,l}):=(s_{i,\mu,l},t_{i,\mu,l}) whenever (i,μ)(i0,μ0)(i,\mu)\neq(i_{0},\mu_{0}), or when (i,μ)=(i0,μ0)(i,\mu)=(i_{0},\mu_{0}) and l<l0l<l_{0}.

(si,μ,l,ti,μ,l):=(si,μ,l1,ti,μ,l1)(s^{\prime}_{i,\mu,l},t^{\prime}_{i,\mu,l}):=(s_{i,\mu,l-1},t_{i,\mu,l-1}) whenever (i,μ)=(i0,μ0)(i,\mu)=(i_{0},\mu_{0}) and l>l0l>l_{0}.

(si0,μ0,l0,ti0,μ0,l0):=(si0,μ0,l01,ti0,μ0,l0)(s^{\prime}_{i_{0},\mu_{0},l_{0}},t^{\prime}_{i_{0},\mu_{0},l_{0}}):=(s_{i_{0},\mu_{0},l_{0}-1},t_{i_{0},\mu_{0},l_{0}}).

One can check that C{\mathcal{C}}^{\prime} is still a configuration, and XCX_{{\mathcal{C}}^{\prime}} is exactly equal to XCX_{\mathcal{C}}; informally what has happened here is that a single “non-rook” move (which contributed both a Ua,a{U}_{a,a^{\prime}} factor and a Vb,b{V}_{b,b^{\prime}} factor to the summand in XCX_{\mathcal{C}}) has been replaced with an equivalent pair of two rook moves (one of which contributes the Ua,a{U}_{a,a^{\prime}} factor, and the other contributes the Vb,b{V}_{b,b^{\prime}} factor).

Observe that, Γ=Γ+1|\Gamma^{\prime}|=|\Gamma|+1 and Ω=Ω+1|\Omega^{\prime}|=|\Omega|+1 (here we use the non-guarded hypothesis (6.7)), while J+K=J+K|J^{\prime}|+|K^{\prime}|=|J|+|K| and LULV=LULV1|{\mathcal{L}}^{\prime}_{U}\cap{\mathcal{L}}^{\prime}_{V}|=|{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}|-1. Thus in this case we see that the claim follows from the (second) induction hypothesis. We may thus eliminate this case and assume that

3.2 Second case: a low multiplicity row or column, no unguarded non-rook moves

Next, given any xJ{x}\in J, define the row multiplicity τx\tau_{x} to be

and similarly for any yK{y}\in K, define the column multiplicity τy\tau^{y} to be

Remark. Informally, τx\tau_{x} measures the number of times α(x)\alpha({x}) appears in (6.5), and similarly for τy\tau^{y} and β(y)\beta({y}). Alternatively, one can think of τx\tau_{x} as counting the number of times the spider has the opportunity to “enter” and “exit” the row s=xs={x}, and similarly τy\tau^{y} measures the number of opportunities to enter or exit the column t=yt={y}.

By surjectivity we know that τx,τy\tau_{x},\tau^{y} are strictly positive for each xJ{x}\in J, yK{y}\in K. We also observe that τx,τy\tau_{x},\tau^{y} must be even. To see this, write

Now observe that if (i,μ,l)Γ+\LU(i,\mu,l)\in\Gamma_{+}\backslash{\mathcal{L}}_{U}, then 1s(i,μ,l)=x=1s(i,μ,l1)=x1_{s(i,\mu,l)={x}}=1_{s(i,\mu,l-1)={x}}. Thus we have

and the right-hand side vanishes by (4.6), showing that τx\tau_{x} is even, and similarly τy\tau^{y} is even.

In this subsection, we dispose of the case of a low-multiplicity row, or more precisely when τx=2\tau_{x}=2 for some xJ{x}\in J. By symmetry, the argument will also dispose of the case of a low-multiplicity column, when τy=2\tau^{y}=2 for some yK{y}\in K.

Suppose that τx=2\tau_{x}=2 for some xJ{x}\in J. We first remark that this implies that there does not exist (i,μ,l)LU(i,\mu,l)\in{\mathcal{L}}_{U} with s(i,μ,l)=s(i,μ,l1)=xs(i,\mu,l)=s(i,\mu,l-1)={x}. We argue by contradiction and define ll^{\star} to be the first integer larger than ll for which (i,μ,l)LU(i,\mu,l^{\star})\in{\mathcal{L}}_{U}. First, suppose that ll^{\star} does not exist (which, for instance, happens when l=k(i,μ)l=k(i,\mu)). Then in this case it is not hard to see that s(i,μ,k(i,μ))=xs(i,\mu,k(i,\mu))={x} since for (i,μ,l)LU(i,\mu,l^{\prime})\notin{\mathcal{L}}_{U}, we have s(i,μ,l)=s(i,μ,l1)s(i,\mu,l^{\prime})=s(i,\mu,l^{\prime}-1). In this case, τx\tau_{x} exceeds 22. Else, ll^{\star} does exist but then s(i,μ,l1)=xs(i,\mu,l^{\star}-1)={x} since s(i,μ,l)=s(i,μ,l1)s(i,\mu,l^{\prime})=s(i,\mu,l^{\prime}-1) for l<l<ll<l^{\prime}<l^{\star}. Again, τx\tau_{x} exceeds 22 and this is a contradiction. Thus, if (i,μ,l)LU(i,\mu,l)\in{\mathcal{L}}_{U} and s(i,μ,l)=xs(i,\mu,l)={x}, then s(i,μ,l1)xs(i,\mu,l-1)\neq{x}, and similarly if (i,μ,l)LU(i,\mu,l)\in{\mathcal{L}}_{U} and s(i,μ,l1)=xs(i,\mu,l-1)={x}, then s(i,μ,l)xs(i,\mu,l)\neq{x}.

Now let us look at the terms in (6.5) which involve α(x)\alpha({x}). Since τx=2\tau_{x}=2, there are only two such terms, and each of the terms are either of the form Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime})} or Eα(x),β(y)E_{\alpha({x}),\beta({y})} for some yK{y}\in K or xJ\{x}{x}^{\prime}\in J\backslash\{{x}\}. We now have to divide into three subcases.

Subcase 1: (6.5) contains two terms Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime})}, Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime\prime})}. Figure 6(a) for a typical configuration in which this is the case.

We now isolate the two terms Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime})}, Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime\prime})} from the rest of (6.5), expressing this sum as

Finally, we consider the contribution of the term ρ1x=x\rho 1_{{x}^{\prime}={x}^{\prime\prime}} of (6.10), which of course is only non-trivial when x=x{x}^{\prime}={x}^{\prime\prime}. This contribution is equal to ρXC\rho X_{{\mathcal{C}}^{\prime\prime\prime}}, where C{\mathcal{C}}^{\prime\prime\prime} is formed from C{\mathcal{C}} by deleting x{x} from JJ, replacing every occurrence of x{x} in the range of α\alpha with x=x{x}^{\prime}={x}^{\prime\prime}, and also deleting the two elements (i0,μ0,l0)(i_{0},\mu_{0},l_{0}), (i1,μ1,l1)(i_{1},\mu_{1},l_{1}) of LU{\mathcal{L}}_{U} from Γ+\Gamma_{+} that gave rise to the factors Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime})}, Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime\prime})} in (6.5), unless these elements also lie in LV{\mathcal{L}}_{V}, in which case one deletes them just from LU{\mathcal{L}}_{U} but leaves them in LV{\mathcal{L}}_{V} and Γ+\Gamma_{+}; one also decrements the labels of any subsequent (i0,μ0,l)(i_{0},\mu_{0},l), (i1,μ1,l)(i_{1},\mu_{1},l) accordingly (see Figure 9). One observes that ΓΩΓΩ1|\Gamma^{\prime\prime\prime}|-|\Omega^{\prime\prime\prime}|\geq|\Gamma|-|\Omega|-1, J+K<J+K|J^{\prime\prime\prime}|+|K^{\prime\prime\prime}|<|J|+|K|, and J+K+LULV<J+K+LULV|J^{\prime\prime\prime}|+|K^{\prime\prime\prime}|+|{\mathcal{L}}^{\prime\prime\prime}_{U}\cap{\mathcal{L}}^{\prime\prime\prime}_{V}|<|J|+|K|+|{\mathcal{L}}_{U}\cap{\mathcal{L}}_{V}|, and so this term also is controlled by the induction hypothesis. (Note we need to use the additional ρ\rho factor (which is less than rμ/nr_{\mu}/n) in order to make up for a possible decrease in ΓΩ|\Gamma|-|\Omega| by 11.)

This deals with the case when there are two U{U} terms involving α(x)\alpha({x}).

Subcase 2: (6.5) contains a term Uα(x),α(x){U}_{\alpha({x}),\alpha({x}^{\prime})} and a term Eα(x),β(y)E_{\alpha({x}),\beta({y})}.

A typical case here is depicted in Figure 10.

Subcase 3: (6.5) contains two terms Eα(x),β(y)E_{\alpha({x}),\beta({y})}, Eα(x),β(y)E_{\alpha({x}),\beta({y}^{\prime})}.

A typical case here is depicted in 11. The strategy here is similar to that in the previous two subcases, but now one uses (6.4) rather than (6.1). The combinatorics of the situation are, however, slightly different.

By considering the path from Eα(x),β(y)E_{\alpha({x}),\beta({y})} to Eα(x),β(y)E_{\alpha({x}),\beta({y}^{\prime})} along the spider, we see (from the hypothesis τx=2\tau_{x}=2) that this path must be completely horizontal (with no elements of LU{\mathcal{L}}_{U} present), and the two legs of the spider that give rise to Eα(x),β(y)E_{\alpha({x}),\beta({y})}, Eα(x),β(y)E_{\alpha({x}),\beta({y}^{\prime})} at their tips must be adjacent, with their bases connected by a horizontal line segment. In other words, up to interchange of y{y} and y{y}^{\prime}, and cyclic permutation of the [j][j] indices, we may assume that

for all 0lk(1,1)0\leq l\leq k(1,1) and 0lk(2,0)0\leq l^{\prime}\leq k(2,0), where the index 22 is understood to be identified with 11 in the degenerate case j=1j=1. Also, LU{\mathcal{L}}_{U} cannot contain any triple of the form (1,1,l)(1,1,l) for l[k(1,1)]l\in[k(1,1)] or (2,0,l)(2,0,l^{\prime}) for l[k(2,0)]l^{\prime}\in[k(2,0)] (and so all these triples lie in LV{\mathcal{L}}_{V} instead).

For technical reasons we need to deal with the degenerate case j=1j=1 separately. In this case, ss is identically equal to x{x}, and so (6.5) simplifies to

The number of possible β\beta is at most nKn^{|K|}, so to establish (6.6) in this case it suffices to show that

Observe that in this degenerate case j=1j=1, we have Ω=K|\Omega|=|K| and Γ=k(1,0)+k(1,1)+2|\Gamma|=k(1,0)+k(1,1)+2. One then checks that the claim is true when rμ=1r_{\mu}=1, so it suffices to check that the other extreme case rμ=nr_{\mu}=n, i.e.

But as τy4\tau^{y}\geq 4 for all kk, every element in KK must be visited at least twice, and the claim follows.

The final terms are treated here in exactly the same way as the final terms in (6.10) or (6.11). Now we consider the main term Vβ(y),β(y){V}_{\beta({y}),\beta({y}^{\prime})}. The contribution of this term will be of the form XCX_{{\mathcal{C}}^{\prime}}, where the configuration C{\mathcal{C}}^{\prime} is formed from C{\mathcal{C}} by “detaching” the two legs (i,μ)=(1,1),(2,0)(i,\mu)=(1,1),(2,0) from the spider, “gluing them together” at the tips using the Vβ(y),β(y){V}_{\beta({y}),\beta({y}^{\prime})} term, and then “inserting” those two legs into the base of the (i,μ)=(1,0)(i,\mu)=(1,0) leg. To explain this procedure more formally, observe that the \ldots term in (6.12) can be expanded further (isolating out the terms coming from (i,μ)=(1,1),(2,0)(i,\mu)=(1,1),(2,0)) as

where the \ldots now denote all the terms that do not come from (i,μ)=(1,1)(i,\mu)=(1,1) or (i,μ)=(2,0)(i,\mu)=(2,0), and we have reversed the order of the second product for reasons that will be clearer later. Recalling that y=t(1,1,k(1,1)){y}=t(1,1,k(1,1)) and y=t(2,0,k(2,0)){y}^{\prime}=t(2,0,k(2,0)), we see that the contribution of the first term of (6.13) to (6.12) is now of the form

But this expression is simply XCX_{{\mathcal{C}}^{\prime}}, where the configuration of C{\mathcal{C}}^{\prime} is formed from C{\mathcal{C}} in the following fashion:

k(1,0):=k(2,0)+1+k(1,1)+k(1,0)k^{\prime}(1,0):=k(2,0)+1+k(1,1)+k(1,0), and k(i,μ):=k(i+1,μ)k^{\prime}(i,\mu):=k(i+1,\mu) for (i,μ)(1,0)(i,\mu)\neq(1,0).

The path {(s(1,0,l),t(1,0,l)):l=0,,k(1,0)}\{(s^{\prime}(1,0,l),t^{\prime}(1,0,l)):l=0,\ldots,k^{\prime}(1,0)\} is formed by concatenating the path {(s(1,0,0),t(2,0,l)):l=0,,k(2,0)}\{(s(1,0,0),t(2,0,l)):l=0,\ldots,k(2,0)\}, with an edge from (s(1,0,0),t(2,0,k(2,0)))(s(1,0,0),t(2,0,k(2,0))) to (s(1,0,0),t(1,1,k(1,1)))(s(1,0,0),t(1,1,k(1,1))), with the path {(s(1,0,0),t(1,1,l)):l=k(1,1),,0}\{(s(1,0,0),t(1,1,l)):l=k(1,1),\ldots,0\}, with the path {(s(1,0,l),t(1,0,l)):l=0,,k(1,0)}\{(s(1,0,l),t(1,0,l)):l=0,\ldots,k(1,0)\}.

For any (i,μ)(i,0)(i,\mu)\neq(i,0), the path {(s(i,μ,l),t(i,μ,l)):l=0,k(i,μ)}\{(s^{\prime}(i,\mu,l),t^{\prime}(i,\mu,l)):l=0,\ldots k^{\prime}(i,\mu)\} is equal to the path {(s(i,μ,l),t(i+1,μ,l)):l=0,,k(i+1,μ)}\{(s(i,\mu,l),t(i+1,\mu,l)):l=0,\ldots,k(i+1,\mu)\}.

This construction is represented in Figure 12.

One can check that this is indeed a configuration. One has J+K<J+K|J^{\prime}|+|K^{\prime}|<|J|+|K|, Γ=Γ1|\Gamma^{\prime}|=|\Gamma|-1, and ΩΩ1|\Omega^{\prime}|\leq|\Omega|-1, and so this contribution to (6.6) is acceptable from the (first) induction hypothesis.

This handles the contribution of the Vβ(y),β(y){V}_{\beta({y}),\beta({y}^{\prime})} term. The ρ1y=y\rho 1_{{y}={y}^{\prime}} term is treated similarly, except that there is no edge between the points (s(1,0,0),t(2,0,k(2,0)))(s(1,0,0),t(2,0,k(2,0))) and (s(1,0,0),t(1,1,k(1,1)))(s(1,0,0),t(1,1,k(1,1))) (which are now equal, since y=y{y}={y}^{\prime}). This reduces the analogue of Γ|\Gamma^{\prime}| to Γ2|\Gamma|-2, but the additional factor of ρ\rho (which is at most rμ/nr_{\mu}/n) compensates for this. We omit the details. This concludes the treatment of the third subcase.

3.3 Third case: High multiplicity rows and columns

After eliminating all of the previous cases, we may now may assume (since τx\tau_{x} is even) that

We have now made the maximum use we can of the cancellation identities (6.1), (6.3), (6.4), and have no further use for them. Instead, we shall now place absolute values everywhere and estimate XCX_{\mathcal{C}} using (1.9), (1.8a), (1.8b), obtaining the bound

Comparing this with (6.6), we see that it will suffice (by taking C0C_{0} large enough) to show that

Using the extreme cases rμ=1r_{\mu}=1 and rμ=nr_{\mu}=n as test cases, we see that our task is to show that

The first inequality (6.16) is proven by Lemma 5.1. The second is a consequence of the double counting identity

where the inequality follows from (6.14)–(6.15) (and we don’t even need the +1+1 in this case).

Discussion

Interestingly, there is an emerging literature on the development of efficient algorithms for solving the nuclear-norm minimization problem (1.3) . For instance, in , the authors show that the singular-value thresholding algorithm can solve certain problem instances in which the matrix has close to a billion unknown entries in a matter of minutes on a personal computer. Hence, the near-optimal sampling results introduced in this paper are practical and, therefore, should be of consequence to practitioners interested in recovering low-rank matrices from just a few entries.

To be broadly applicable, however, the matrix completion problem needs to be robust vis a vis noise. That is, if one is given a few entries of a low-rank matrix contaminated with a small amount of noise, one would like to be able to guess the missing entries, perhaps not exactly, but accurately. We actually believe that the methods and results developed in this paper are amenable to the study of “the noisy matrix completion problem” and hope to report on our progress in a later paper.

Appendix

For the sake of completeness, we explain how Theorem 1.7 implies nearly identical results for the uniform model. We have established the lower bound by showing that there are two fixed matrices MMM\neq M^{\prime} for which PΩ(M)=PΩ(M)\mathcal{P}_{\Omega}(M)=\mathcal{P}_{\Omega}(M^{\prime}) with probability greater than δ\delta unless mm obeys the bound (1.20). Suppose that Ω\Omega is sampled according to the Bernoulli model with pm/n2p^{\prime}\geq m/n^{2} and let FF be the event {PΩ(M)=PΩ(M)}\{\mathcal{P}_{\Omega}(M)=\mathcal{P}_{\Omega}(M^{\prime})\}. Then

In short, we get a lower bound for the uniform model by applying the bound for the Bernoulli model with a value of p=2m2/np=2m^{2}/n and a probability of failure equal to 2δ2\delta.

1.2 Upper bounds

We prove the claim stated at the onset of Section 3 which states that the probability of failure under the uniform model is at most twice that under the Bernoulli model. Let FF be the event that the recovery via (1.3) is not exact. With our earlier notations,

2 Proof of Lemma 3.3

In this section, we will make frequent use of (3.13) and of the similar identity

which is obtained by squaring both sides of (3.17) together with PT2=PT\mathcal{P}_{T}^{2}=\mathcal{P}_{T}. We begin with two lemmas.

where starting from α0(0)=1\alpha^{(0)}_{0}=1, the sequences {α(k)}\{\alpha^{(k)}\}, {β(k)}\{\beta^{(k)}\}, {γ(k)}\{\gamma^{(k)}\} and {δ(k)}\{\delta^{(k)}\} are inductively defined via

In the above recurrence relations, we adopt the convention that αj(k)=0\alpha_{j}^{(k)}=0 whenever jj is not in the range specified by (8.2), and similarly for βj(k)\beta_{j}^{(k)}, γj(k)\gamma_{j}^{(k)} and δj(k)\delta_{j}^{(k)}.

Proof The proof operates by induction. The claim for k=0k=0 is straightforward. To compute the coefficient sequences of (QΩPT)k+1QΩ(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k+1}\mathcal{Q}_{\Omega} from those of (QΩPT)kQΩ(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k}\mathcal{Q}_{\Omega}, use the identity PT=QT+ρI\mathcal{P}_{T}=\mathcal{Q}_{T}+\rho^{\prime}\mathcal{I} to decompose (QΩPT)k+1QΩ(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k+1}\mathcal{Q}_{\Omega} as follows:

Then expanding (QΩPT)kQΩ(\mathcal{Q}_{\Omega}\mathcal{P}_{T})^{k}\mathcal{Q}_{\Omega} as in (8.2), and using the two identities

which both follow from (3.13), gives the desired recurrence relation. The calculation is rather straightforward and omitted.

We note that the recurrence relations give αk(k)=1\alpha^{(k)}_{k}=1 for all k0k\geq 0,

for all k2k\geq 2 and k3k\geq 3 respectively.

Put λ=ρ/p\lambda=\rho^{\prime}/p and observe that by assumption (1.22), λ<1\lambda<1. Then for all j,k0j,k\geq 0, we have

Proof We prove the lemma by induction on kk. The claim is true for k=0k=0. Suppose it is true up to kk, we then use the recurrence relations given by Lemma 8.1 to establish the claim up to k+1k+1. In details, since 1ρ<1|1-\rho^{\prime}|<1, ρ<λ\rho^{\prime}<\lambda and 12p<1|1-2p|<1, the recurrence relation for α(k+1)\alpha^{(k+1)} gives

which proves the claim for the sequence {α(k)}\{\alpha^{(k)}\}. We bound βj(k+1)|\beta_{j}^{(k+1)}| in exactly the same way and omit the details. Now the recurrence relation for γ(k+1)\gamma^{(k+1)} gives

which proves the claim for the sequence {γ(k)}\{\gamma^{(k)}\}. The quantity δj(k+1)|\delta_{j}^{(k+1)}| is bounded in exactly the same way, which concludes the proof of the lemma.

We are now well positioned to prove Lemma 3.3 and begin by recording a useful fact. Since for any XX, PT(X)X\|\mathcal{P}_{T^{\perp}}(X)\|\leq\|X\|, and

the triangular inequality gives that for all XX,

For j=0j=0, we have (QΩQT)j(E)=E=1\|(\mathcal{Q}_{\Omega}\mathcal{Q}_{T})^{j}(E)\|=\|E\|=1 while for j>0j>0

since QT(E)=(1ρ)(E)\mathcal{Q}_{T}(E)=(1-\rho^{\prime})(E). By using the size estimates given by Lemma 8.2 on the coefficients, we have

where the last inequality holds provided that 4λσ3/24\lambda\leq\sigma^{3/2}. The conclusion is

Acknowledgements

E. C. is supported by ONR grants N00014-09-1-0469 and N00014-08-1-0749 and by the Waterman Award from NSF. E. C. would like to thank Xiaodong Li and Chiara Sabatti for helpful conversations related to this project. T. T. is supported by a grant from the MacArthur Foundation, by NSF grant DMS-0649473, and by the NSF Waterman award.

References