Nonconvex Low-Rank Tensor Completion from Noisy Data

Changxiao Cai, Gen Li, H. Vincent Poor, Yuxin Chen

Introduction and motivation

Estimation of low-complexity models from highly incomplete observations is a fundamental task that spans a diverse array of science and engineering applications. Arguably one of the most extensively studied problems of this kind is matrix completion, where one wishes to recover a low-rank matrix given only partial entries [DR16, CC18]. Moving beyond matrix-type data, a natural higher-order generalization is low-rank tensor completion, which aims to reconstruct a low-rank tensor when the vast majority of its entries are unseen. There is certainly no shortage of applications that motivate the investigation of tensor completion (e.g. personalized medicine [SN16, Paw19], medical imaging [GRY11, SHKM14, CZA+17], seismic data analysis [KSS13, EAHK13], multi-dimensional harmonic retrieval [CC14, YLW+17]). One concrete example in operations research arises when learning the preference of individual customers for a collection of products on the basis of historical transactions [FL19, MP20]. Given the limited availability of transaction data (e.g. each customer might only have purchased very few products before), it is crucial to exploit multi-way customer-product interactions (e.g. users’ browsing and searching histories) in order to better predict the likelihood of a customer purchasing a new product. Clearly, the presence of missing data and the need of exploiting multi-way structure result in the task of tensor completion. Additionally, tensor completion finds important applications in visual data in-painting [LMWY13, LYX17], where one wishes to reconstruct video data (or a sequence of images) from incomplete measurements. The video data consist of at least two spatial variables and one temporal variable, whose intrinsic connections are often modeled via certain low-complexity tensors.

2 Computational and statistical challenges

Even though tensor completion conceptually resembles matrix completion in various ways, it is considerably more challenging than the matrix counterpart. This is perhaps not surprising, given that a plethora of natural tensor problems (e.g. computing the spectral norm, finding the best low-rank approximation) are all notoriously hard [HL13]. As a notable example, while matrix completion is often efficiently solvable under nearly minimal sample complexity [CR09, Gro11], all polynomial-time algorithms developed so far for tensor completion — even in the noise-free case — require a sample size at least exceeding the order of rd3/2rd^{3/2}, which is substantially larger than the degrees of freedom (i.e. rdrd) underlying the model (2). In fact, it is widely conjectured that there exists a large computational barrier away from the information-theoretic sampling limits [BM16].

With this fundamental gap in mind, the current paper focuses on the regime (in terms of the sample size) that enables reliable tensor completion in polynomial time. A variety of algorithms have been proposed that enjoy some sort of theoretical guarantees in (at least part of) this regime, including but not limited to spectral methods [MS18, CLC+20], sum-of-squares hierarchy [BM16, PS17], nonconvex algorithms [JO14, XY17], and also convex relaxation (based on proper unfolding) [GRY11, HMGW15, RPP13, GQ14]. While these are all polynomial-time algorithms, most of the computational complexities supported by prior theory remain prohibitively high when dealing with large-scale tensor data — a point that we shall elaborate on later. The only exception is the unfolding-based spectral method, which, however, fails to achieve exact recovery as the noise vanishes. This leads to a critical question:

Q1: Is there any linear-time algorithm that is guaranteed to work for low-rank tensor completion?

Going beyond such computational concerns, one might naturally wonder whether it is also possible for a fast algorithm to achieve a nearly un-improvable statistical accuracy in the presence of noise. Towards this end, intriguing stability guarantees have been established for sum-of-squares hierarchy in the noisy settings [BM16], although this paradigm is computationally expensive for large-scale data. The recent work [XYZ17] came up with a two-stage algorithm (i.e. a spectral method followed by tensor power iterations) for noisy tensor completion. Its estimation accuracy, however, falls short of achieving exact recovery in the absence of noise. This gives rise to another question of fundamental importance:

Q2: Can we achieve near-optimal statistical accuracy without compromising computational efficiency?

In this paper, we aim to address the above two questions by developing a nonconvex algorithm that achieves optimal computational efficiency and statistical accuracy all at once.

Algorithm and main results

To address the above-mentioned challenges, a first impulse is to resort to the following least squares problem:

or more concisely (up to proper re-scaling),

Fortunately, not all nonconvex problems are as daunting to solve as they may seem. For example, recent years have seen a flurry of activity in low-rank matrix factorization via nonconvex optimization, which provably achieves optimal statistical accuracy and computational efficiency at once; see [CLC19] for an overview of recent advances. Motivated by this strand of work, we propose to solve (4) via a two-stage nonconvex paradigm, presented below in reverse order. The whole procedure is summarized in Algorithms 1-3.

Arguably one of the simplest optimization algorithms is gradient descent, which adopts a gradient update rule

Despite the simplicity of this algorithm, two critical issues stand out and might significantly affect its efficiency, which we shall bear in mind throughout the algorithmic and theoretical development.

(i) Local stationary points and initialization. As is well known, GD is guaranteed to find an approximate local stationary point, provided that the learning rates do not exceed the inverse Lipschitz constant of the gradient [Bub15]. There exist, however, local stationary points (e.g. saddle points or spurious local minima) that might fall short of the desired statistical properties. This requires us to properly avoid such undesired points, while retaining computational efficiency. To address this issue, one strategy is to first identify a rough initial guess within a local region surrounding the global solution (which often helps rule out bad local minima), in order to guarantee proper convergence of subsequent optimization procedures [LT17, JO14]. As a side remark, while careful initialization might not be crucial for several matrix recovery cases [CCFM19, GBW18, TV19], it does seem to be critical in various tensor problems [RM14]. We shall elucidate this point in Section 3.3.

(ii) Learning rates and regularization. Learning rates play a pivotal role in determining the convergence properties of GD. The challenge, however, is that the loss function (4) is overall not sufficiently smooth (i.e. its gradient often has an exceedingly large Lipschitz constant), and hence generic optimization theory recommends a pessimistically slow update rule (i.e. an extremely small learning rate) so as to guard against over-shooting. This, however, slows down the algorithm significantly, thus destroying the main computational advantage of GD (i.e. low per-iteration cost). With this issue in mind, prior literature suggests carefully designed regularization steps (e.g. proper projection, regularized loss functions) in order to improve the geometry of the optimization landscape [XY17]. In contrast, we argue that one is allowed to take a constant learning rate — which is as aggressive as it can possibly be — even without enforcing any regularization procedures.

Motivated by the above-mentioned issue (i), we develop a procedure that guarantees a reasonable initial estimate. In a nutshell, the proposed procedure consists of two steps:

Estimate the subspace spanned by the rr low-rank tensor factors {ui}1ir\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r} via a spectral method;

Disentangle individual low-rank tensor factors from this subspace estimate.

As we shall see momentarily, the total computational complexity of the proposed initialization is O(pd3)O(pd^{3}) when r=O(1)r=O(1), κ=O(1)\kappa=O(1) and p1/d2p\geq 1/d^{2} (where κ\kappa is a sort of “condition number” defined later), which is a linear-time algorithm. Note, however, that these two steps in the initialization procedure are relatively more complicated to describe. To improve the flow of the current paper, we postpone the details to Section 3. The readers can catch a glimpse of these procedures in Algorithms 2-3.

2 Main results

Encouragingly, the proposed nonconvex algorithm provably achieves the best of both worlds — in terms of statistical accuracy and computational efficiency — for a class of low-rank, well-conditioned, and “incoherent” problem instances. This subsection summarizes our main findings.

Before continuing, we note that one cannot hope to recover an arbitrary tensor from highly sub-sampled and arbitrarily corrupted entries. In order to enable provably valid recovery, the present paper focuses on a tractable model by imposing the following assumptions.

Define the incoherence parameters and the condition number of T\bm{T}^{\star} as follows

Here, μ0\mu_{0}, μ1\mu_{1} and μ2\mu_{2} are termed the incoherence parameters. Definitions (8a)-(8c) can be viewed as some sort of incoherence conditions for the tensor. For instance, when μ0,μ1\mu_{0},\mu_{1} and μ2\mu_{2} are small, these conditions say that (1) the energy of tensor T\bm{T}^{\star} is (nearly) evenly spread across all entries; (2) each factor ui\bm{u}_{i}^{\star} is de-localized; (3) the factors {ui}\{\bm{u}_{i}^{\star}\} are nearly orthogonal to each other. Definition (8d) is concerned with the “well-conditionedness” of the tensor, meaning that each rank-1 component is of roughly the same size. In particular, we note that an assumption on pairwise correlation (i.e. a constraint on μ2\mu_{2}) is often assumed in the literature of tensor decomposition / factorization (e.g. [AGJ14, SLLC17, HZC20]).

Suppose that E\bm{E} is a symmetric random tensor, where {Ej,k,l}1jkld\{E_{j,k,l}\}_{1\leq j\leq k\leq l\leq d} (cf. (1)) are independently generated sub-Gaussian random variables with mean zero and variance Var(Ej,k,l)σ2\mathsf{Var}(E_{j,k,l})\leq\sigma^{2}.

In addition, recognizing that there is a global permutational ambiguity issue (namely, one cannot distinguish u1,,ur\bm{u}_{1}^{\star},\cdots,\bm{u}_{r}^{\star} from an arbitrary permutation of them), we introduce the following loss metrics to account for this ambiguity:

where permr\mathsf{perm}_{r} stands for the set of r×rr\times r permutation matrices. For notational simplicity, we also take

With these notations in place, we are ready to present our main results. For simplicity of presentation, we shall start with the setting where r,μ,κ1r,\mu,\kappa\asymp 1.

Fix an arbitrary small constant δ>0\delta>0. Suppose that r,κ,μ=O(1)r,\kappa,\mu=O(1),

for some sufficiently large constants c0,c2>0c_{0},c_{2}>0 and some sufficiently small constants c1,c3>0c_{1},c_{3}>0. The learning rate ηtη\eta_{t}\equiv\eta is taken to be a constant obeying 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}. Then with probability at least 1δ1-\delta,

hold simultaneously for all 0tt0=d50\leq t\leq t_{0}=d^{5}. Here, 0<C1,C3,ρ<10<C_{1},C_{3},\rho<1 and C2,C4>0C_{2},C_{4}>0 are some absolute constants.

The theorem holds unchanged if d5d^{5} is replaced by dcd^{c} for an arbitrarily large constant c>0c>0.

The upper bound t0t_{0} on the iteration count arises from the leave-one-out analysis when handling noisy observations. In short, the leave-one-out argument can only provide high-probability bounds for each iteration, thus requiring an upper bound on the iteration count if we desire a uniform bound across iterations. Note that in the noiseless case, our results and analysis hold for an arbitrarily large number of iterations.

Fix an arbitrarily small constant δ>0\delta>0. Instate the assumptions of Theorem 2.4. Then with probability at least 1δ1-\delta,

hold simultaneously for all 0tt0=d50\leq t\leq t_{0}=d^{5}. Here, 0<C1,C3,ρ<10<C_{1},C_{3},\rho<1 and C2,C4>0C_{2},C_{4}>0 are some absolute constants.

Several important implications are provided as follows. The discussion below assumes λmaxλmin1\lambda_{\max}^{\star}\asymp\lambda_{\min}^{\star}\asymp 1 for notational simplicity.

Linear convergence. In the absence of noise, the proposed algorithm converges linearly, namely, it provably attains ε\varepsilon accuracy within O(log(1/ε))O(\log(1/\varepsilon)) iterations. Given the inexpensiveness of each gradient iteration, this algorithm can be viewed as a linear-time algorithm, which can almost be implemented as long as we can read the data. In the noisy setting, the algorithm reaches an appealing statistical accuracy within a logarithmic number of iterations.

Near-optimal statistical accuracy. The proposed algorithm converges geometrically fast to a point with Euclidean error O\big{(}\sigma\sqrt{(d\log d)/p}\big{)}. This matches the lower bound established in [XYZ17, Theorem 5] up to some logarithmic factor, thus justifying the statistical optimality of the proposed nonconvex algorithm.

Noise size. The above theory operates in the regime where σpd3/2\sigma\lesssim\sqrt{\frac{p}{d^{3/2}}} (modulo some log factor). Given that we have Td3/2\|\bm{T}^{\star}\|_{\infty}\asymp d^{-3/2} in this case, our noise size constraint can be equivalently written as (up to some log factor)

Since the sampling rate needs to satisfy pd3/2p\gg d^{-3/2}, this condition essentially allows the typical size of each noise component to be considerably larger than the size of the corresponding entry of the truth, which covers a broad range of practical scenarios.

Implicit regularization. One appealing feature of our finding is the simplicity of the iterative refinement stage of the algorithm. All of the above statistical and computational benefits hold for vanilla gradient descent (when properly initialized). This should be contrasted with prior work (e.g. [XY17]) that relies on extra regularization terms to stabilize the optimization landscape. In principle, vanilla gradient descent implicitly constrains itself within a region of well-conditioned landscape, thus enabling fast convergence without explicit regularization.

No need of sample splitting. The theory developed herein does not require fresh samples in each iteration. We note that sample splitting has been frequently adopted in other context primarily to simplify mathematical analysis. Nevertheless, it typically does not exploit the data in an efficient manner (i.e. each data sample is used only once), thus resulting in the need of a much larger sample size in practice.

Thus far, we have concentrated on the low-rank, well-conditioned, and incoherent case. Our main theory can be extended to cover a broader class of scenarios, as stated below.

Fix an arbitrary small constant δ>0\delta>0. Suppose that κ1\kappa\asymp 1,

for some sufficiently large constants c0,c3>0c_{0},c_{3}>0 and some sufficiently small constants c1,c2,c4>0c_{1},c_{2},c_{4}>0. The learning rate ηtη\eta_{t}\equiv\eta is taken to be a constant obeying 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}. Then with probability at least 1δ1-\delta,

hold simultaneously for all 0tt0=d50\leq t\leq t_{0}=d^{5}. Here, 0<C1,C3,ρ<10<C_{1},C_{3},\rho<1 and C2,C4>0C_{2},C_{4}>0 are some absolute constants.

Fix an arbitrarily small constant δ>0\delta>0. Instate the assumptions of Theorem 2.8. Then with probability at least 1δ1-\delta,

hold simultaneously for all 0tt0=d50\leq t\leq t_{0}=d^{5}. Here, 0<C1,C3,ρ<10<C_{1},C_{3},\rho<1 and C2,C4>0C_{2},C_{4}>0 are some absolute constants.

Clearly, Theorem 2.8 and Corollary 2.9 subsume Theorem 2.4 and Corollary 2.7 as a special case respectively.

Our theorems require the rank rr to not exceed o(d1/6)o(d^{1/6}), which, we believe, is an artifact of the current nonconvex analysis (particularly for the initialization stage). For instance, our local convergence analysis is built upon strong convexity and smoothness, which holds only within a sufficiently small neighborhood surrounding the truth; given that the diameter of this neighborhood is no more than o(1/r)o(1/r), our analysis requires an initial guess with higher accuracy than expected, thus leading to our rank constraint. It might be possible to improve the rank dependency via more refined analysis, and we leave it to future investigation.

3 Numerical experiments

We carry out a series of numerical experiments to corroborate our theoretical findings. Before proceeding, recall that Theorem 2.8 only guarantees successful recovery with probability 1δ1-\delta for some small constant δ\delta; this means that we shall not anticipate a very high success rate (e.g. 1O(d5)1-O(d^{-5})) as in the matrix recovery case. As we shall make clear shortly, this happens mainly because the initialization stage works only with probability 1δ1-\delta, where the uncertainty largely depends on the random vectors {gτ}1τL\{\bm{g}^{\tau}\}_{1\leq\tau\leq L}. With this observation in mind, we recommend the following modification to improve the empirical success rate:

Run Algorithm 2 independently for tinit=5t_{\mathsf{init}}=5 times to obtain multiple initial estimates (denoted by U0,,U[tinit]0\bm{U}^{0}_{},\cdots,\bm{U}^{0}_{[t_{\mathsf{init}}]}); select the one achieving the smallest empirical loss, namely

Run Algorithm 1 with the initial point U0\bm{U}^{0} set to be Ubest0\bm{U}^{0}_{\mathsf{best}}.

The final estimates for the low-rank factor and the whole tensor are denoted respectively by

The third series of experiments is concerned with the dependence of the success rate on the rank rr. Let us set p=rd3/2log2dp=rd^{-3/2}\log^{2}d, L=r2L=r^{2} and ϵth=0.4\epsilon_{\mathsf{th}}=0.4, and the success recovery criterion is the same as above. Figure 1(c) depicts the empirical success rates (over 100100 independent Monte Carlo trials) as the rank rr varies. As can be seen from the plots, the proposed algorithm is able to achieve exact reconstruction as long as the rank rr is sufficiently small compared to dd. The plausible range of rr, however, seems and seems to be larger than our theoretic requirement r=o(d1/6)r=o(d^{1/6}). This, once again, suggests the need of future investigation to pin down the best possible dependency on rr.

4 Notation

In addition, the operator norm of T\bm{T} is defined as

Further, f(n)g(n)f(n)\lesssim g(n) or f(n)=O(g(n))f(n)=O(g(n)) means that f(n)/g(n)C1|f(n)/g(n)|\leq C_{1} for some constant C1>0C_{1}>0; f(n)g(n)f(n)\gtrsim g(n) means that f(n)/g(n)C2|f(n)/g(n)|\geq C_{2} for some constant C2>0C_{2}>0; f(n)g(n)f(n)\asymp g(n) means that C1f(n)/g(n)C2C_{1}\leq|f(n)/g(n)|\leq C_{2} for some constants C1,C2>0C_{1},C_{2}>0; f(n)=o(g(n))f(n)=o(g(n)) means that limnf(n)/g(n)=0\lim_{n\rightarrow\infty}f(n)/g(n)=0. In addition, f(n)g(n)f(n)\ll g(n) means that f(n)c1g(n)f(n)\leq c_{1}g(n) for some sufficiently small constant c1>0c_{1}>0, and f(n)g(n)f(n)\gg g(n) means that f(n)c2g(n)f(n)\geq c_{2}g(n) for some sufficiently large constant c2>0c_{2}>0.

Initialization

This section presents formal details of the proposed two-step initialization, accompanied by some intuition. Recall that the proposed initialization procedure consists of two steps, which we discuss separately.

The spectral algorithm is often applied in conjunction with simple “unfolding” (or “matricization”) to estimate the subspace spanned by the rr factors {ui}1ir\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r}. This strategy is partly motivated by prior approaches developed for covariance estimation with missing data [Lou14, MS18, CLC+20]. We provide a brief introduction below.

be the mode-1 matricization of p1Tp^{-1}\bm{T} (namely, 1pTi,j,k=Ai,(j1)d+k\frac{1}{p}T_{i,j,k}=A_{i,(j-1)d+k} for any 1i,j,kd1\leq i,j,k\leq d) [KB09]. The rationale of this step is that: under our model, the unfolded matrix A\bm{A} obeys

2 Step 2: retrieval of low-rank tensor factors from the subspace estimate

As it turns out, it is possible to obtain rough (but reasonable) estimates of all individual low-rank tensor factors {ui}1ir\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r} — up to global permutation — given a reliable subspace estimate U\bm{U}. This is in stark contrast to the low-rank matrix recovery case, where there exists some global rotational ambiguity that prevents us from disentangling the rr factors of interest.

We begin by describing how to retrieve one tensor factor from the subspace estimate — a procedure summarized in \textscRetrieveonetensorfactor(\textsc{Retrieve-one-tensor-factor}(). Let us generate a random vector from the provided subspace U\bm{U} (which has orthonormal columns), that is,

The rescaled tensor data p1Tp^{-1}\bm{T} is then transformed into a matrix via proper “projection” along this random direction θ\bm{\theta}, namely,

Our estimate for a tensor factor is then given by λ1/3ν\lambda^{1/3}\bm{\nu}, where ν\bm{\nu} is the leading singular vector of M\bm{M} obeying T,ν30\langle\bm{T},\bm{\nu}^{\otimes 3}\rangle\geq 0, and λ\lambda is taken as \lambda=\big{\langle}p^{-1}\bm{T},\bm{\nu}^{\otimes 3}\big{\rangle}. Informally, ν\bm{\nu} reflects the direction of the component ui\bm{u}_{i}^{\star} that exhibits the largest correlation with the random direction θ\bm{\theta}, and λ\lambda forms an estimate of the corresponding size ui2\|\bm{u}^{\star}_{i}\|_{2}.

A challenge remains, however, as there are oftentimes more than one tensor factors to estimate. To address this issue, we propose to re-run the aforementioned procedure multiple times, so as to ensure that we get to retrieve each tensor factor of interest at least once. We will then apply a careful pruning procedure (i.e. \textscPrune(\textsc{Prune}()) to remove redundancy.

2.2 Intuition

To develop some intuition about the above procedure, consider the “heuristic” case where θ=U(UU)1Ug\bm{\theta}=\bm{U}^{\star}(\bm{U}^{\star\top}\bm{U}^{\star})^{-1}\bm{U}^{\star\top}\bm{g}, namely, the idealistic scenario where the subspace estimate U\bm{U} is accurate. Averaging out the randomness in the sampling pattern and the noise, we see that the expected projected matrix (27) takes the following form:

In the mean time, armed with (28) and the incoherence assumption (such that ui\bm{u}_{i}^{\star} and uj\bm{u}_{j}^{\star} are nearly orthogonal for iji\neq j), one might have

thus explaining our choice of λ\lambda in the proposed procedure. These arguments hint at the ability of our procedure in retrieving one tensor factor in each round.

The above intuitive argument, however, does not explain why we need to first project a random vector g\bm{g} onto the (approximate) column space of U\bm{U}^{\star}. While we won’t go into detailed calculations here, we remark in passing a crucial high variability issue: without proper projection, the perturbation incurred by both the missing data and the noise might far exceed the strength of the true signal. As a result, it is advised to first project the data onto the desired subspace, in the hope of amplifying the signal-to-noise ratio.

3 Other alternatives?

The careful reader may naturally wonder whether a careful initialization is pivotal in achieving fast convergence. While a thorough answer to this has yet to be developed, we shall point out some alternatives that seem sub-optimal in both theory and practice. To simplify the presentation, the current subsection focuses on the rank-1 noiseless case, where

Since the decision variable is now a dd-dimensional vector, we shall employ the conventional notation ut\bm{u}^{t} to represent Ut\bm{U}^{t}.

We find it instrumental to begin with the population-level analysis, which corresponds to the scenario with no missing data and noise (p=1p=1 and σ=0\sigma=0). A little calculation gives

As an immediate consequence, the expected correlation between the next iterate and the truth obeys

a similar recursion holds for ut\bm{u}^{t}. As a result, the GD iterates are expected to get increasingly more aligned with the truth, at least at the population level. Caution needs to be exercised, however, that this population-level analysis alone fails to capture what is happening in the finite-sample case. In what follows, we point out potential issues with random initialization.

Consider the case where u0\bm{u}^{0} is generated as a vector of i.i.d. Gaussian random variables. Suppose that u0\bm{u}^{0} and u\bm{u}^{\star} are positively correlated and that u02\|\bm{u}^{0}\|_{2} is sufficiently small. It is easily seen that, with high probability, the expected increment is on the order of (cf. (32))

which could be quite small as it depends quadratically on the current correlation u0,u\langle\bm{u}^{0},\bm{u}^{\star}\rangle.

In summary, the main issue stems from the quadratic dependence of the expected increment (33) on the correlation u0,u\langle\bm{u}^{0},\bm{u}^{\star}\rangle, which can be exceedingly small if u0\bm{u}^{0} is randomly initialized.

Another alternative for initialization is the tensor power method, which has recently gained popularity in the context of learning latent-variable models [AGH+14, AGJ17]. Nevertheless, the TPM (with random initialization) suffers from the same high-volatility issue as randomly initialized GD. The argument for this would be nearly identical to the one presented above, and is hence omitted. Instead, we invoke a perturbation analysis result in [AGH+14, Theorem 5.1] to illustrate the insufficiency of the TPM.

Recall that \frac{1}{p}\bm{T}=\bm{T}^{\star}+\big{(}\frac{1}{p}\bm{T}-\bm{T}^{\star}\big{)}. A critical issue is that the perturbation bound in [AGH+14, Theorem 5.1] requires the tensor perturbation to be exceedingly small, namely,

Related work

One of the most natural ideas for solving tensor completion is to first unfold the tensor data into matrices, followed by proper convex relaxation commonly adopted for low-rank matrix completion. Given that there are more than one ways to matricize a tensor, several prior work has explored the design of matrix norms that can exploit the tensor structure more effectively [THK10, GRY11, LMWY13, RPP13, LFC+16, MHWG14]. Such algorithms have been robustified to enable reliable recovery against sparse outliers as well [GQ14]. For the most part, however, such unfolding-based convex relaxation necessarily incur loss of structural information, which is particularly severe when handling odd-order tensors. The sample complexity developed for this paradigm is often sub-optimal vis-a-vis the computational limits (namely, minimal sample complexity achievable by polynomial-time algorithms).

Motivated by the above sub-optimality issue, [YZ16, YZ17] proposed to minimize instead the tensor nuclear norm subject to data constraints, which provably allows for reduced sample complexity. The issue, however, is that computing the tensor nuclear norm itself is already computationally intractable, thus limiting its applicability to even moderate-dimensional problems. Similar findings have also been discovered for tensor atomic norm minimization [DBBG19]. When restricted to polynomial-time algorithms, the best statistical guarantees are often attained via convex relaxation tailored to the sum-of-squares hierarchy [BM16]; the resulting computational cost, however, remains prohibitively high for practical large-scale problems. Another matrix nuclear norm minimization algorithm has been proposed based on promoting certain structures on certain factor matrices [LSC+14]. Developing statistical guarantees is, however, not the focal point of this work.

Moving beyond convex relaxation, a number of prior papers have developed nonconvex algorithms for tensor completion, examples including iterative hard thresholding [RSS17], alternating minimization [JO14, WAA16, XHYS15], tensor SVD [ZA17], optimization on manifold [XY17, KM16, Ste16], proximal average algorithm with nonconvex regularizer [Yao18], and block coordinate decent [JHZ+16, XY13]. When it comes to the model considered herein, these algorithms either lack optimal statistical guarantees, or come with a computational cost that is significantly higher than a linear-time algorithm.

The algorithm and theory that we develop are largely inspired by the recent advances of nonconvex optimization algorithms for low-rank matrix recovery problems [KMO10a, KMO10b, CLS15, CC17, SL16, YPCC16, CW15]. The main theoretical tool — the leave-one-out analysis — is a powerful technique that has proved successful in various other statistical problems [EK15, CFMW19, AFWZ17, MWCC17, ZB18, CCFM19, CCF+19, LZT19, CFMY19, DC18, PW19]. There are several major differences between the analysis of nonconvex tensor completion and that of nonconvex matrix recovery. For instance, our initialization scheme is substantially more complicated than the matrix recovery counterpart, thus requiring much more sophisticated analysis; in addition, the local convergence stage of tensor completion does not suffer from rotational ambiguity (which often appears in nonconvex matrix completion), and hence we only need to handle permutational ambiguity.

In addition, the current paper focuses on non-adaptive uniform random sampling. If there is freedom in designing the sampling mechanism, then one can often expect improved performance; see [KS13, Zha19] as examples. Fundamental criteria that enable perfect low-CP-rank tensor completion have been studied in [AW17].

Tensor completion is simply a special example of the tensor recovery literature. There is a large body of results tackling various other tensor recovery and estimation problems, including but not limited to tensor decomposition [Kol01, KB09, AGH+14, AGJ14, TS15, KOKC13, HSSS16, GHJY15, ZKOM18, SDLF+17, SLLC17, GM17], tensor SVD and factorization [ZX18, KBHH13, ZA17], and tensor regression and sketching [RSS17, HZC20, CRY19, HWW+19]. The algorithmic ideas explored in this paper might have implications for these tensor-related problems as well.

Analysis

In this section, we outline the proof of Theorem 2.8. The proof of Corollary 2.9 is deferred to Appendix C. The analysis is divided into three parts:

In Section 5.1, we show that given an initial estimate sufficiently close to the ground truth, vanilla gradient descent converges linearly. These are formalized in Lemmas 5.3 and 5.6.

Sections 5.2-5.3 provide statistical guarantees for the two steps of the initialization procedure; see Theorems 5.9.

Under the assumptions of Theorem 2.8, one can see that the initialization satisfies the requirement of linear convergence of vanilla gradient descent. Therefore, Theorem 2.8 immediately follows from the results in Sections 5.1-5.3.

In this section, we demonstrate that: if the initialization is reasonably good, then vanilla gradient descent converges linearly to a solution with the desired statistical accuracy. We postpone the analysis for initialization to Sections 5.2-5.3 for convenience of presentation.

First of all, using our notation ×seq\times^{\mathsf{seq}} defined in (21), we can write

The gradient of fcleanf_{\mathsf{clean}} w.r.t. us\bm{u}_{s} (1sr1\leq s\leq r) is thus given by

where vec(V)\mathsf{vec}(\bm{V}) denotes the vectorization of V\bm{V}.

1.2 Local strong convexity and smoothness

At the heart of our analysis is a crucial geometric property of the objective function, that is, the noiseless loss function fcleanf_{\mathsf{clean}} behaves like a locally strongly convex and smooth function. This fact, which is formally stated in the following lemma, is the key enabler of fast local convergence of vanilla GD.

Suppose that the sample complexity and the rank satisfy

for some sufficiently large (resp. small) constant c0>0c_{0}>0 (resp. c1>0c_{1}>0). Then with probability greater than 1O(d10)1-O(d^{-10}),

Here, δc2/(μ3/2r)\delta\leq c_{2}/(\mu^{3/2}r) for some sufficiently small constant c2>0c_{2}>0.

In order to invoke Lemma 5.1, one needs to make sure that the decision matrix U\bm{U} of interest (e.g. Ut\bm{U}^{t} in the GD sequence) satisfies the condition (44). This, however, is a fairly stringent condition, as it requires U\bm{U} to be close to the truth in every single row.

1.3 Leave-one-out gradient descent sequences

Motivated by the analytical framework developed for low-rank matrix recovery [MWCC17, CLL19], we introduce the following leave-one-out sequences, which play a crucial role in guaranteeing that the entire trajectory {Ut}t0\{\bm{U}^{t}\}_{t\geq 0} satisfies the condition (44) as required in Lemma 5.1.

Specifically, we define for each 1md1\leq m\leq d the following auxiliary loss function:

PΩm\mathcal{P}_{\Omega_{m}}: the projection onto the subspace of tensors supported on {(i,j,k)Ω ⁣:i=m or j=m or k=m}\{(i,j,k)\in\Omega\colon i=m\text{ or }j=m\text{ or }k=m\};

PΩm\mathcal{P}_{\Omega_{-m}}: the projection onto the subspace of tensors supported on {(i,j,k)Ω ⁣:im and jm and km}\{(i,j,k)\in\Omega\colon i\neq m\text{ and }j\neq m\text{ and }k\neq m\};

Pm\mathcal{P}_{m}: the projection onto the subspace of tensors supported on {(i,j,k)[d]3 ⁣:i=m or j=m or k=m}\{(i,j,k)\in[d]^{3}\colon i=m\text{ or }j=m\text{ or }k=m\}.

In words, this function is obtained by replacing all data at locations {(i,j,k)[d]3 ⁣:i=m or j=m or k=m}\{(i,j,k)\in[d]^{3}\colon i=m\text{ or }j=m\text{ or }k=m\} by their expected values, thus removing all randomness associated with this location subset. The gradient of f(m)(U)f^{\left(m\right)}(\bm{U}) w.r.t. us\bm{u}_{s} (1sr1\leq s\leq r) can be computed as:

We then denote by \big{\{}\bm{U}^{t,(m)}\big{\}}_{t\geq 0} the iterative sequence obtained by running gradient descent w.r.t. the leave-one-out loss f(m)()f^{\left(m\right)}(\cdot); see Algorithm 4. By construction, as long as U0,(m)\bm{U}^{0,\left(m\right)} is independent of the sampling locations and the noise associated with the locations {(i,j,k)Ω ⁣:i=m or j=m or k=m}\{(i,j,k)\in\Omega\colon i=m\text{ or }j=m\text{ or }k=m\} (which holds true as detailed momentarily), then the entire trajectory \big{\{}\bm{U}^{t,(m)}\big{\}}_{t\geq 0} becomes statistically independent of such randomness. This is a crucial property that allows us to decouple the complicated statistical dependency.

1.4 Key lemmas

Key hypotheses for the gradient update stage:

for some quantity Elocal>0\mathcal{E}_{\mathsf{local}}>0 (depending possibly on μ\mu and rr) and some constants C1,,C8>0C_{1},\cdots,C_{8}>0. These exist a few straightforward consequences of the hypotheses (47), which we record in the following lemma.

Assume that the hypotheses (47) hold, then we have

Our proof for the hypotheses (47) is inductive in nature: we would like to show that if the hypotheses in (47) hold for the tt-th iteration, then they continue to be valid for the (t+1)(t+1)-th iteration. We shall justify each of the above hypotheses inductively through the following lemmas.

for some sufficiently large constant c0>0c_{0}>0 and some sufficiently small constant c1,c2>0c_{1},c_{2}>0. Assume that the hypotheses (47) hold for the tt-th iteration and \mathcal{E}_{\mathsf{local}}\leq c_{3}/\big{(}\mu^{3/2}r\big{)} for some sufficiently small constant c3>0c_{3}>0. Then with probability at least 1O(d10)1-O(d^{-10}),

provided that 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, 1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1, and C2C_{2} is sufficiently large.

for some sufficiently large constant c0>0c_{0}>0 and some sufficiently small constant c1,c2>0c_{1},c_{2}>0. Assume that the hypotheses (47) hold for the tt-th iteration and \mathcal{E}_{\mathsf{local}}\leq c_{3}/\big{(}\mu^{3/2}r\big{)} for some sufficiently small constant c3>0c_{3}>0. Then with probability at least 1O(d10)1-O(d^{-10}), one has

provided that 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, 1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1 and C6C_{6} is sufficiently large.

for some sufficiently large constant c0>0c_{0}>0 and some sufficiently small constant c1,c2>0c_{1},c_{2}>0. Assume that the hypotheses (47) hold for the tt-th iteration and \mathcal{E}_{\mathsf{local}}\leq c_{3}/\big{(}\mu^{3/2}r\big{)} for some sufficiently small constant c3>0c_{3}>0. Then with probability at least 1O(d10)1-O(d^{-10}), one has

provided that 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, 1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1, C7C_{7} and C8C_{8} are sufficiently large.

for some sufficiently large constant c0>0c_{0}>0 and some sufficiently small constant c1,c2>0c_{1},c_{2}>0. Assume that the hypotheses (47) hold for the tt-th iteration and \mathcal{E}_{\mathsf{local}}\leq c_{3}/\big{(}\mu^{3/2}r\big{)} for some sufficiently small constant c3>0c_{3}>0. Then with probability at least 1O(d10)1-O(d^{-10}), one has

provided that 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, 1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1, C3/(C5+C7)C_{3}/\left(C_{5}+C_{7}\right) and C4/(C6+C8)C_{4}/\left(C_{6}+C_{8}\right) are both sufficiently large.

The proofs of the above key lemmas are postponed to Appendix A.

2 Analysis for initialization: Part 1 (subspace estimation)

where Or×r\mathcal{O}^{r\times r} stands for the set of r×rr\times r orthonormal matrices. This can be viewed as the global rotation matrix that best aligns the two subspaces represented by U\bm{U} and Uorth\bm{U}^{\star}_{\mathsf{orth}} respectively.

Equipped with the above notation, we can invoke [CLC+20, Corollary 1] to arrive at the following lemma, which upper bounds the distance between our subspace estimate U\bm{U} and the ground truth Uorth\bm{U}^{\star}_{\mathsf{orth}}.

There exist some universal constants c0,c1,c2>0c_{0},c_{1},c_{2}>0 such that if

then with probability 1O(d10)1-O\left(d^{-10}\right), the subspace estimate U\bm{U} computed by Algorithm 2 obeys

where Uorth\bm{U}_{\mathsf{orth}}^{\star} and R\bm{R} are defined respectively in (54) and (55), and

In a nutshell, Lemma 5.7 asserts that: under our sample size, noise and rank conditions, Algorithm 2 produces reliable estimates of the subspace spanned by the low-rank tensor factors {ui}1ir\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r}. The theorem quantifies the subspace distance in terms of both the spectral norm and 2,\|\cdot\|_{2,\infty}, where the latter bound often reflects a considerably stronger sense of proximity compared to the former one.

As it turns out, in order to facilitate analysis for the subsequent stages, we need to introduce certain leave-one-out sequences as well, which we detail in the next subsection.

2.2 Leave-one-out sequences for subspace estimation

The key idea of the leave-one-out analysis is to create auxiliary leave-one-out sequences that are (1) independent of a small fraction of the data; (2) sufficiently close to the true estimates. We introduce the following auxiliary tensor and d×d2d\times d^{2}-dimensional matrix for each 1md1\leq m\leq d:

By construction, T(m)\bm{T}^{\left(m\right)} and A(m)\bm{A}^{\left(m\right)} are independent of PΩm(E)\mathcal{P}_{\Omega_{m}}\left(\bm{E}\right), where we recall that

where Poff-diag()\mathcal{P}_{\mathsf{off}\text{-}\mathsf{diag}}(\cdot) (as already defined in Section 3.1) extracts out off-diagonal entries from a matrix. The rationale is simple: it can be easily verified that

The following lemma plays a crucial role in our analysis, which formalizes the fact that the leave-one-out version U(m)\bm{U}^{\left(m\right)} obtained by Algorithm 5 is extremely close to U\bm{U}.

There exist some universal constants c0,c1,c2>0c_{0},c_{1},c_{2}>0 such that if

then with probability 1O(d10)1-O\left(d^{-10}\right), the subspace estimate U(m)\bm{U}^{(m)} computed by Algorithm 5 obeys

simultaneously for all 1md1\leq m\leq d, where

Lemma 5.8 follows immediately from the analysis of [CLC+20, Lemma 4]. As a remark, the construction of the leave-one-out sequences herein is slightly different from the one in [CLC+20]. However, it is straightforward to adapt the proof of [CLC+20] to the case considered herein. We therefore omit the proof for the sake of brevity.

3 Analysis for initialization: Part 2 (retrieval of individual tensor factors)

This section justifies that the procedure presented in Algorithm 3 allows to disentangle the tensor factors. For notational simplicity, we let

Fix any arbitrary small constant δ>0\delta>0. Assume that

for some sufficiently large universal constant c0,c3>0c_{0},c_{3}>0 and some sufficiently small universal constants c1,c2,c4>0c_{1},c_{2},c_{4}>0. Then with probability exceeding 1δ1-\delta, there exists a permutation π():[d][d]\pi(\cdot):[d]\mapsto[d] such that for all 1ir1\leq i\leq r, the tensor factors {wi}i=1r\{\bm{w}^{i}\}_{i=1}^{r} returned by Algorithm 3 satisfy

In short, this theorem asserts that the estimates returned by Algorithm 3 are — up to global permutation — reasonably close to the ground truth under our sample size and noise conditions. In order to establish this theorem and in order to provide initial guesses for the leave-one-out GD sequences, we need to produce a leave-one-out sequence tailored to this part of the algorithm. Such auxiliary sequences are generated in a similar spirit as the previous ones, and we summarize them in Algorithm 6. As usual, the resulting leave-one-out estimates \big{\{}\lambda^{\left(m\right)}_{i},\bm{w}^{i,\left(m\right)}\big{\}}_{i=1}^{r} are statistically independent of PΩm(E)\mathcal{P}_{\Omega_{m}}\left(\bm{E}\right).

In what follows, we gather a few key properties of the leave-one-out estimates, which play a crucial role in the analysis.

Fix any arbitrarily small constant δ>0\delta>0. Instate the assumptions in Theorem 5.9. With probability exceeding 1δ1-\delta, the permutation function stated in Theorem 5.9 obeys that: for all 1ir1\leq i\leq r and all 1md1\leq m\leq d:

With Theorems 5.9-5.10 in place, we can immediately establish a few desired properties (particularly those specified in Section 5.1) of our initial estimate, as asserted in the following corollary.

Fix any arbitrarily small constant δ>0\delta>0. Instate the assumptions in Theorem 2.8. With probability exceeding 1δ1-\delta, the estimates U0\bm{U}^{0} and U0,(m)\bm{U}^{0,\left(m\right)} returned by Algorithm 3 and Algorithm 6 respectively satisfy the hypotheses (47) for t=0t=0.

3.2 Analysis

Before we start with the proof, we first state the main idea. For the sake of clarify, we define

In addition, let ντ\bm{\nu}^{\tau} be the top singular vector of Mτ\bm{M}^{\tau} obeying T,(ντ)30\langle\bm{T},(\bm{\nu}^{\tau})^{\otimes 3}\rangle\geq 0, and ντ,(m)\bm{\nu}^{\tau,(m)} the top singular vector of Mτ,(m)\bm{M}^{\tau,(m)} obeying T(m),(ντ,(m))30\langle\bm{T}^{(m)},(\bm{\nu}^{\tau,(m)})^{\otimes 3}\rangle\geq 0. Set

These are all computed in the function \textscRetrieveonetensorfactor(\textsc{Retrieve-one-tensor-factor}() in the τ\tau-th round.

We first show that for each 1ir1\leq i\leq r, there exists at least one trial 1τL1\leq\tau\leq L such that the ii-th tensor factor ui\overline{\bm{u}}_{i}^{\star} is the top singular vector of the population version of T×3θτ\bm{T}\times_{3}\bm{\theta}^{\tau} (with respect to the missing data and noise). In addition, the spectral gap is large enough to guarantee accurate estimates.

Next, we prove that given this spectral gap, the top singular vector ντ\bm{\nu}^{\tau} of T×3θτ\bm{T}\times_{3}\bm{\theta}^{\tau} is close to ui\overline{\bm{u}}^{\star}_{i} both in the 2\|\cdot\|_{2} and \|\cdot\|_{\infty} norm. This also enables us to accurately estimate the magnitude of ui\bm{u}^{\star}_{i}.

Finally, we need to show that one can find those reliable estimates among LL random restarts. Combining the spectral gap information with the incoherence condition that tensor components are nearly orthogonal to each other, our selection procedure is guaranteed to recover all tensor factors.

Now we proceed to the proof. Without loss of generality, we prove the case for i=1i=1 in the sequel, i.e. there exists some τ[L]\tau\in\left[L\right] such that ντ\bm{\nu}^{\tau} accurately recovers u1\bm{u}^{\star}_{1}. Together with the union bound, this shows that we can find reliable estimates for all tensor factors. We then conclude the proof by showing that Algorithm 3 is able to find all of them without duplicates.

where ui\overline{\bm{u}}_{i}^{\star} and λi\lambda_{i}^{\star} are both defined in (66). The idea is to let γτ\bm{\gamma}^{\star\tau} approximate the singular values of T×3θτ\bm{T}^{\star}\times_{3}\bm{\theta}^{\tau}; this can be seen, for instance, via the following calculation:

where {ui}i=1r\{\overline{\bm{u}}_{i}^{\star}\}_{i=1}^{r} — which are assumed to be incoherent (or nearly orthogonal to each other) — can be approximately viewed as the singular vectors of T×3θτ\bm{T}^{\star}\times_{3}\bm{\theta}^{\tau}.

If we want our spectral estimate to be accurate, we would need to be assured that the two largest entries of γτ\bm{\gamma}^{\star\tau} (in magnitude) are sufficiently separated.

Instate the assumptions of Theorem 5.9. Define \Delta_{1}^{\tau}:=\gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|} for each 1τL1\leq\tau\leq L and let Δ1(1)Δ1(2)Δ1(L)\Delta_{1}^{\left(1\right)}\geq\Delta_{1}^{\left(2\right)}\geq\dots\geq\Delta_{1}^{\left(L\right)} denote the order statistics of \big{\{}\Delta_{1}^{\tau}\big{\}}_{\tau=1}^{L} (in descending order). Fix any arbitrary small constant δ>0\delta>0. With probability greater than 1δ/r1-\delta/r, one has

Lemma 5.12 demonstrates that there exists some τ[L]\tau\in\left[L\right] such that \gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}\gtrsim\lambda_{\min}^{\star}. This means that u1\overline{\bm{u}}_{1}^{\star} exhibits the largest correlation with the random projection θ\theta, which further implies that u1\overline{\bm{u}}_{1}^{\star} is the largest singular vector of T×3θτ\bm{T}^{\star}\times_{3}\bm{\theta}^{\tau} with a considerable spectral gap (as we will show shortly). With the desired spectral gap in place, we are ready to look at the eigenvectors / singular vectors of interest. To this end, we find it convenient to introduce another auxiliary vector uτ\overline{\bm{u}}^{\tau}, defined as the leading singular vector of Mτ\bm{M}^{\tau} (cf. (70c)) obeying

The careful reader would immediately notice the similarity between uτ\overline{\bm{u}}^{\tau} and ντ\bm{\nu}^{\tau} except for their global signs; namely, we determine the global sign of uτ\overline{\bm{u}}^{\tau} based on the ground truth information (76), but pick the global sign for ντ\bm{\nu}^{\tau} solely based on the observed data (cf. Algorithm 3). Fortunately, the vectors uτ\overline{\bm{u}}^{\tau} and ντ\bm{\nu}^{\tau} provably coincide, namely,

as we shall demonstrate momentarily in Lemma 5.16. In a similar way, we also denote by uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)} the leading singular vector of Mτ,(m)\bm{M}^{\tau,\left(m\right)} defined in (70d) such that

Lemma 5.16 also shows that uτ,(m)=ντ,(m)\overline{\bm{u}}^{\tau,\left(m\right)}=\bm{\nu}^{\tau,\left(m\right)}.

Instate the assumptions of Theorem 5.9. Let uτ\overline{\bm{u}}^{\tau} and u1\overline{\bm{u}}_{1}^{\star} be as defined in (76) and (66), respectively. Define A\mathcal{A} to be the event such that \gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}\gtrsim\lambda_{\min}^{\star} and the condition (75) hold. Then conditional on this event A\mathcal{A}, with probability exceeding 1O(d11)1-O\left(d^{-11}\right) one has

Instate the assumptions of Theorem 5.9. Define A\mathcal{A} to be the event such that \gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}\gtrsim\lambda_{\min}^{\star} and the condition (75) hold. Then conditional on this event A\mathcal{A}, one has, with probability exceeding 1O(d10)1-O\left(d^{-10}\right), that

holds for all m[d]m\in\left[d\right], where Eop\mathcal{E}_{\mathsf{op}} is defined as follows:

Instate the assumptions of Theorem 5.9. Define A\mathcal{A} to be the event such that \gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}\gtrsim\lambda_{\min}^{\star} and the condition (75) hold. Then conditional on this event A\mathcal{A}, one has, with probability at least 1O(d10)1-O\left(d^{-10}\right), for all m[d]m\in\left[d\right]:

where Eop\mathcal{E}_{\mathsf{op}} and Eloo\mathcal{E}_{\mathsf{loo}} are defined in (81) and (65), respectively.

Next, we turn to the estimation accuracy regarding the size of the tensor factors and show that λτ\lambda_{\tau} (produced in Algorithm 3) is close to the truth as well. As it turns out, a byproduct of this step reveals that ντ=uτ\bm{\nu}^{\tau}=\overline{\bm{u}}^{\tau} and ντ,(m)=uτ,(m)\bm{\nu}^{\tau,\left(m\right)}=\overline{\bm{u}}^{\tau,\left(m\right)}, where uτ\overline{\bm{u}}^{\tau} and uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)} are an auxiliary vectors defined in (76) and (78), respectively.

Instate the assumptions of Theorem 5.9. Assume that the results in Lemma 5.13, Lemma 5.14 and Lemma 5.15 hold. Then with probability at least 1O(d10)1-O\left(d^{-10}\right), one has

Thus far, we have only proved that one can find a reliable estimate for each tensor factor within LL random trials, provided that LL is sufficiently large. To finish up, it remains to show that the pruning procedure \textscPrune(\textsc{Prune}() is capable of returning a rough estimate for each tensor factor without duplication. This is accomplished in the following lemma.

Instate the assumptions of Theorem 5.9. On the event that the results in Lemma 5.13, Lemma 5.14, Lemma 5.15 and Lemma 5.16 hold for all 1ir1\leq i\leq r, there exists a permutation π():[d][d]\pi(\cdot):[d]\mapsto[d] such that: for each 1ir1\leq i\leq r, \big{(}\lambda_{i},\bm{w}^{i}\big{)} and \big{(}\lambda_{\pi(i)}^{\star},\overline{\bm{u}}^{\star}_{\pi(i)}\big{)} satisfy (68a), (68b) and (68c); \big{(}\lambda_{i},\bm{w}^{i}\big{)} and \big{(}\lambda_{i}^{\left(m\right)},\bm{w}^{i,\left(m\right)}\big{)}_{i=1}^{r} obey (69a), (69b) and (69c) for all 1md1\leq m\leq d, where \big{\{}\lambda_{i},\bm{w}^{i}\big{\}}_{i=1}^{r} and \big{\{}\lambda_{i}^{\left(m\right)},\bm{w}^{i,\left(m\right)}\big{\}}_{i=1}^{r} are outputs of Algorithm 3 and Algorithm 6, respectively.

Discussion

The current paper uncovers the possibility of efficiently and stably completing a low-CP-rank tensor from partial and noisy entries. Perhaps somewhat unexpectedly, despite the high degree of nonconvexity, this problem can be solved to optimal statistical accuracy within nearly linear time, provided that the tensor of interest is well-conditioned, incoherent, and of constant rank. To the best of our knowledge, this intriguing message has not been shown in the prior literature.

Moving forward, one pressing issue is to understand how to improve the algorithmic and theoretical dependency upon the tensor rank rr of the proposed method. Ideally one would desire a fast algorithm whose sample complexity scales as rd1.5rd^{1.5}, an order that is provably achievable by the sum-of-squares hierarchy. Additionally, in contrast to the matrix counterpart where the rank is upper bounded by the matrix dimension, the tensor CP rank is allowed to rise above dd, which is commonly referred to as the over-complete case. Unfortunately, our current initialization scheme (i.e. the spectral method) fails to work unless r<dr<d, and our local analysis for GD falls of accommodating the scenario with r>dr>d. It would be of great interest to develop more powerful algorithms — in addition to more refined analysis — to tackle such an important over-complete regime.

Another tantalizing research direction is the exploration of landscape design for tensor completion. As our heuristic discussions as well as other prior work (e.g. [RM14]) suggest, randomly initialized gradient descent tailored to (4) seems unlikely to work, unless the sample size is significantly larger than the computational limit. This might mean either that there exist spurious local minima in the natural nonconvex least squares formulation (4), or that the optimization landscape of (4) is too flat around some saddle points and hence not amenable to fast computation. It would be interesting to investigate what families of loss functions allow us to rule out bad local minima and eliminate the need of careful initialization, which might be better suited for tensor recovery problems.

Finally, in statistical inference and decision making, one might not be simply satisfied with obtaining a reliable estimate for each missing entry, but would also like to report a short confidence interval which is likely to contain the true entry. This boils down to the fundamental task of uncertainty quantification for tensor completion, which we leave to future investigation.

Acknowledgements

Y. Chen is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ONR grant N00014-19-1-2120, by the ARO grants W911NF-20-1-0097 and W911NF-18-1-0303, by the NSF grants CCF-1907661, IIS-1900140 and DMS-2014279, and by the Princeton SEAS innovation award. H. V. Poor is supported in part by the NSF grant DMS-1736417. C. Cai is supported in part by Gordon Y. S. Wu Fellowships in Engineering. This work was done in part while Y. Chen was visiting the Kavli Institute for Theoretical Physics (supported in part by NSF grant PHY-1748958). We thank Lanqing Yu for many helpful discussions, and thank Yuling Yan for proofreading the paper.

Appendix A Proofs for local convergence of GD

In this section, we establish the key lemmas concerning the convergence properties of GD. As one can easily see, treating {Ei,j,k}1i,j,kd\{E_{i,j,k}\}_{1\leq i,j,k\leq d} (resp. {χi,j,k}1i,j,kd\{\chi_{i,j,k}\}_{1\leq i,j,k\leq d}) as independent random variables — which leads to asymmetric versions of E\bm{E} and Ω\Omega — does not affect the order of our results at all. In light of this, we shall adopt such an independent assumption whenever it simplifies our presentation.

From the Hessian expression (41), one can decompose

In what follows, we shall bound each of the above terms separately.

With regards to α4\alpha_{4}, by symmetry we have

In order to control (88), we first see that

where U~\widetilde{\bm{U}}^{\star} is as defined in (87). Similar to the proof of Lemma D.1, we can use the fact that ui2,uj2=ui,uj2\langle\bm{u}_{i}^{\star\otimes 2},\bm{u}_{j}^{\star\otimes 2}\rangle=\langle\bm{u}_{i}^{\star},\bm{u}_{j}^{\star}\rangle^{2} and (8c) to deduce that

provided that rd/μr\ll d/\mu. This implies that

(1) Speaking of an upper bound on α4\alpha_{4}, we can invoke the Cauchy-Schwarz inequality followed by (91) to reach

(2) When it comes to lower bounding α4\alpha_{4}, the main step boils down to controlling the inner product term in (88). Applying the Cauchy-Schwartz inequality gives that

where (i) comes from Cauchy-Schwarz, and the last line follows from (8c), (8d) as well as the condition that rd/μr\ll\sqrt{d/\mu} and κ1\kappa\asymp 1. Therefore, we can lower bound α4\alpha_{4} by (with the assistance of (91))

When it comes to α1\alpha_{1}, we can expand

We will derive an upper bound on β1\beta_{1} in the sequel; the same method immediately applies to β2\beta_{2}.

For notational convenience, let us define

Apply the Cauchy-Schwartz inequality to yield that

Before we bound the above quantities, we pause to make the following observations. In view of the assumptions of this lemma that δ1/r1\delta\ll 1/\sqrt{r}\leq 1, the following holds for all i[r]i\in\left[r\right]:

Now, we proceed to prove the claim. Let us define Si:={j[d]2χij1j2=1}\mathcal{S}_{i}:=\{\underline{j}\in[d]^{2}\mid\chi_{ij_{1}j_{2}}=1\} for each i[d]i\in[d]. Applying the Chernoff bound and the union bound yields that: with probability at least 1O(d10)1-O(d^{-10}) one has

provided pd2logdp\gg d^{-2}\log d. It then follows from the Cauchy-Schwarz inequality that

where (i) arises from (97) and (98). In a similar manner, we can derive

under the sample size assumptin that pμ2r2d2logdp\gg\mu^{2}r^{2}d^{-2}\log d. Here the last inequality makes use of (90). It is self-evident that the above bounds also hold for quantities that appear in (95). Since 0<δ1/r<10<\delta\ll 1/\sqrt{r}<1, we obtain

The same upper bound holds for any other β2\beta_{2}. Therefore, as long as 0<δ1/(μr)0<\delta\ll 1/(\mu\sqrt{r}) and κ1\kappa\asymp 1, we have

Regarding α2\alpha_{2}, applying [YZ16, Lemma 5] (with slight modification, which we omit here for brevity) implies that: if pμ2r2d2logdp\gg\mu^{2}r^{2}d^{-2}\log d, then with probability exceeding 1O(d10)1-O(d^{-10}),

Here the last inequality arises from (92).

We now move on to bounding α3\alpha_{3}. The triangle inequality gives

Recall the definitions of Δ\bm{\Delta} and Δi\bm{\Delta}_{i} in (94). Fix an arbitrary s[r]s\in\left[r\right]. From the definition of the operator norm and the triangle inequality, we can derive

With this lemma in mind, we are ready to derive that

Given that pd3/2log3dp\gtrsim d^{-3/2}\log^{3}d, applying Lemma D.2 indicates that with probability at least 1O(d10)1-O(d^{-10}),

Moreover, it is straightforward to see that 13=123=d3/2\left\|\bm{1}^{\otimes 3}\right\|=\left\|\bm{1}\right\|_{2}^{3}=d^{3/2}. Therefore, one has

Next, we turn to \left\|\cdot\right\|_{\infty}. We first expand

By symmetry, it suffices to control \big{\|}\sum_{i\in[r]}\bm{u}^{\star\otimes 2}_{i}\otimes\bm{\Delta}_{i}\big{\|}, \big{\|}\sum_{i\in[r]}\bm{\Delta}_{i}^{\otimes 2}\otimes\bm{u}^{\star}_{i}\big{\|} and \big{\|}\sum_{i\in[r]}\bm{\Delta}_{i}^{\otimes 3}\big{\|}. Let us look at the first term. Towards this, for each (i,j,k)[d]3\left(i,j,k\right)\in\left[d\right]^{3}, we can use the Cauchy-Schwartz inequality to control

In a similar manner, we can control the remaining two terms by

Recall that 0<δ1/r10<\delta\ll 1/r\leq 1. Putting these together reveals that

where we use (96c) in the last step. In view of the condition that δ1/(μ3/2r)\delta\ll 1/(\mu^{3/2}r) and the assumption κ1\kappa\asymp 1, one has with probability greater than 1O(d10)1-O(d^{-10}),

A.1.5 Putting all this together

Note that the above bounds hold uniformly for all V\bm{V}. Therefore, combining upper bounds for αi\alpha_{i} and the union bound, we conclude that with probability exceeding 1O(d10)1-O(d^{-10}),

A.2 Proof of Lemma 5.2

From (47a) and (47c), we use the triangle inequality to obtain

Similarly, we can combine (47b) and (47c) to obtain that

A.3 Proof of Lemma 5.3

which motivates us to bound α1\alpha_{1} and α2\alpha_{2} separately.

(1) We start with α1\alpha_{1}, towards which we find it helpful to define

Given that fclean(U)=0\nabla f_{\mathsf{clean}}\left(\bm{U}^{\star}\right)=\bm{0} (since U\bm{U}^{\star} is a global optimizer of fcleanf_{\mathsf{clean}}), we can use the fundamental theorem of calculus to obtain

From the hypothesis (47b) as well as our conditions that σλmindlogdp+Elocal1μ3/2r\frac{\sigma}{\lambda_{\min}^{\star}}\sqrt{\frac{d\log d}{p}}+\mathcal{E}_{\mathsf{local}}\ll\frac{1}{\mu^{3/2}r}, we know that Ut(τ)\bm{U}^{t}(\tau) (0τ10\leq\tau\leq 1) satisfies the conditions required in Lemma 5.1. Therefore, applying Lemma 5.1 gives that

Substitution into (116) indicates that: if 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, then

which implies that (since 1a/21a1-a/2\geq\sqrt{1-a} for 0<a<10<a<1)

(2) We now turn to α2\alpha_{2}. To simplify presentation, we shall assume that {Ei,j,k}i,j,k[d]\{E_{i,j,k}\}_{i,j,k\in[d]} (resp. {χi,j,k}i,j,k[d]\{\chi_{i,j,k}\}_{i,j,k\in[d]}) are independent random variables. Fix an arbitrary s[r]s\in[r] and m[d]m\in[d]. The mm-th entry of PΩ(E)×1ust×2ust\mathcal{P}_{\Omega}\left(\bm{E}\right)\times_{1}\bm{u}^{t}_{s}\times_{2}\bm{u}^{t}_{s} can be expanded as follows:

Before continuing, we make the following observations: from the hypotheses (47a), (47b), (47c), as well as our assumption that σλmindlogdp+Elocal1μ3/2r\frac{\sigma}{\lambda_{\min}^{\star}}\sqrt{\frac{d\log d}{p}}+\mathcal{E}_{\mathsf{local}}\ll\frac{1}{\mu^{3/2}r}, we see that the following holds for all s[r]s\in\left[r\right]:

With these estimates in place, we can upper bound the above four terms in (118) separately.

For β1\beta_{1}, we note that, by construction, ut,(m)\bm{u}^{t,\left(m\right)} is independent of the mm-th mode-3 slice of PΩ(E)\mathcal{P}_{\Omega}\left(\bm{E}\right). This tells us that

can be viewed as a sum of independent zero-mean random variables (conditional on PΩm(E)\mathcal{P}_{\Omega_{-m}}\left(\bm{E}\right)). It is straightforward to compute that

where ψ1\|\cdot\|_{\psi_{1}} denotes the sub-exponential norm and we use (119) in the last inequality. Applying the matrix Bernstein inequality [Kol11, Corollary 2.1] yields that: with probability 1O(d20)1-O(d^{-20}),

For β2\beta_{2}, we first invoke [CW15, lemma 11] to demonstrate that: with probability at least 1O(d11)1-O(d^{-11}),

where the last inequality follows from the noise condition. Clearly, the above bound holds for β3\beta_{3} as well.

Taking together the above bounds and substituting them into (118), we obtain

Recognizing that this holds for any m[d]m\in[d] and s[r]s\in[r], one can sum over mm and ss to deduce that

(3) Combining (117) and (122) yields that: with probability at least 1O(d10)1-O(d^{-10}),

with the proviso that 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, 1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1 and that C2C_{2} is sufficiently large.

A.4 Proof of Lemma 5.4

Fix an arbitrary m[r]m\in[r]. From the definition of f(m)f^{\left(m\right)} in (45) and (46), we can show that

Before proceeding to bounding these terms, we pause to define

From (119) and hypothesis (47d), one can applying a similar argument as in (107) to find that

Now we begin to bound the terms in (123) separately.

(1) We start with α1\alpha_{1}. For any 0τ10\leq\tau\leq 1, define

The fundamental theorem of calculus yields

provided that 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}.

In what follows, we shall assume that {Ei,j,k}i,j,k[d]\{E_{i,j,k}\}_{i,j,k\in[d]} (resp. {χi,j,k}i,j,k[d]\{\chi_{i,j,k}\}_{i,j,k\in[d]}) are independent random variables in order to simplify presentation.

(2) The next step is to bound α2\alpha_{2}. For notational simplicity, define

In order to control the Frobenius norm of Vt,(m)\bm{V}^{t,\left(m\right)}, we shall start by considering the mm-th row of Vt,(m)\bm{V}^{t,\left(m\right)}. In view of the definitions of PΩm\mathcal{P}_{\Omega_{m}} and Pm\mathcal{P}_{m} (cf. Appendix 5.1.3), we can expand

We recognize that \big{\{}\bm{u}_{s}^{t,\left(m\right)}\big{\}}_{s=1}^{r} is independent of Ωm\Omega_{m}, making it convenient for us to upper bound \big{\|}\bm{V}^{t,\left(m\right)}_{m,:}\big{\|}_{2}. Specifically, for any i,j[d]i,j\in[d], from (119) and (125) we have

where the last inequality holds due to (119) and (125). We then apply the matrix Bernstein inequality to yield that: with probability exceeding 1O(d20)1-O(d^{-20}),

where the last step holds as long as pμ2d2logdp\gg\mu^{2}d^{-2}\log d.

Next, we turn to the kk-th row of Vt,(m)\bm{V}^{t,\left(m\right)} for any kmk\neq m. For each s[r]s\in[r], we have

Similar to the proof of Lemma D.9, we can show that with probability at least 1O(d20)1-O(d^{-20}),

where the last inequality follows from (119), and (125). It is easily seen that the bound also holds for the summation over imi\neq m. We can then use Cauchy-Schwartz to arrive at

Combining (128) with (127) and invoking the union bound, we conclude that: with probability exceeding 1O(d20)1-O(d^{-20}),

for some absolute constant C>0C>0, where we use the assumption that κ1\kappa\asymp 1.

(3) For α3\alpha_{3}, following a similar argument for α2\alpha_{2}, we define

for each s[r]s\in[r]. The mm-th row of Wt,(m)\bm{W}^{t,\left(m\right)} is a sum of independent zero-mean random vectors:

With (119) in place, it is easy to verify that

where the last inequality holds as long as pμd2log3dp\gg\mu d^{-2}\log^{3}d.

for each s[r]s\in[r]. Arguing similarly as in the proof of Lemma D.10, we have with probability at least 1O(d20)1-O(d^{-20}),

where the last inequality follows from (119). Additionally, the summation over {i:im}\{i:i\neq m\} can be controlled using the same argument. Therefore, we use Cauchy-Schwarz to find that

Combined with (130) and the assumption that κ1\kappa\asymp 1, we obtain that

for some absolute constant C~>0\widetilde{C}>0.

(4) Regarding α4\alpha_{4}, we use the triangle inequality to show that for each s[r]s\in[r],

where the last line follows from (119). From Corollary D.3, we can further upper bound

As a result, we sum over s[r]s\in[r] and use the Cauchy-Schwartz inequality to derive

where the last step arises from conditions that σλmindlogdp1μ3/2r\frac{\sigma}{\lambda_{\min}^{\star}}\sqrt{\frac{d\log d}{p}}\ll\frac{1}{\mu^{3/2}r} and κ1\kappa\asymp 1.

(5) Taking (126), (129), (132) and (133) together, we can invoke the sample size assumption that pμ3r2d2log3dp\gg\mu^{3}r^{2}d^{-2}\log^{3}d and the union bound to show that: with probability greater than 1O(d10)1-O(d^{-10}) one has

provided 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, 1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1 and C6C_{6} is sufficiently large.

A.5 Proof of Lemma 5.5

Fix an arbitrary m[d]m\in[d]. Recall our notation of ΔTt,(m)\bm{\Delta}^{t,\left(m\right)}_{\bm{T}} in (124). To simplify presentation, we further define

leaving us with two terms to deal with. As it turns out, we will show that α1\alpha_{1} is the dominant term and α2\alpha_{2} is negligible. To simplify presentation, we shall assume that {Ei,j,k}i,j,k[d]\{E_{i,j,k}\}_{i,j,k\in[d]} (resp. {χi,j,k}i,j,k[d]\{\chi_{i,j,k}\}_{i,j,k\in[d]}) are independent random variables.

Regarding α1\alpha_{1}, the definition of PΩm\mathcal{P}_{\Omega_{-m}} allows us to derive

We can express \bm{\Delta}^{t,\left(m\right)}_{\bm{T}}=\sum_{s\in[r]}\big{(}\bm{\Delta}_{s}^{t,\left(m\right)}+\bm{u}_{s}^{\star}\big{)}^{\otimes 3}-\bm{u}_{s}^{\star\otimes 3} and compute that

for each s[r]s\in[r]. This further indicates that

In what follows, we will control the four terms separately.

For β1\beta_{1}, by (119), we use Cauchy-Schwarz to show that

Regarding β2\beta_{2}, by (119), we apply the Cauchy-Schwarz inequality again to get that: for each s[r]s\in[r],

Together with the assumption that κ1\kappa\asymp 1, this implies that

With regards to β3\beta_{3}, we show that for each s[r]s\in[r]:

Turning attention to β4\beta_{4}, we observe that for each s[r]s\in[r],

where the last line follows from (119) and the rank assumption rd/μr\ll\sqrt{d/\mu}. As a consequence,

With regards to α2\alpha_{2}, it follows from the definition (46) and (134) that

Recall the definition of PΩm\mathcal{P}_{\Omega_{-m}} and Pm\mathcal{P}_{m}. For the mm-th row, we have

From the triangle inequality, we can further decompose

Let us consider γ1\gamma_{1} first. It is straightforward to calculate that

for each s[r]s\in[r]. From (119), we use the triangle inequality and the Cauchy-Schwarz inequality to upper bound

Moreover, it is easy to see that the upper bound also holds for γ2\gamma_{2}. As for γ3\gamma_{3}, we can express

Similarly, we combine (119) wtih the triangle inequality and the Cauchy-Schwarz inequality to bound

where the last inequality follows from the assumption that κ1\kappa\asymp 1.

Putting (141) and (145) together, we reach the conclusion from (48) and the condition rd/μr\ll\sqrt{d/\mu} that,

provided that 0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}, 1λmin4/3η/5ρ<11-\lambda_{\min}^{\star 4/3}\eta/5\leq\rho<1, C7,C8C_{7},C_{8} are sufficiently large.

Recognizing that the above bound holds for any 1md1\leq m\leq d, we conclude the proof.

A.6 Proof of Lemma 5.6

Combining Lemma 5.4 and Lemma 5.5, we conclude that with probability at least 1O(d10)1-O(d^{-10}),

with the proviso that C3/(C5+C7)C_{3}/\left(C_{5}+C_{7}\right) and C4/(C6+C8)C_{4}/\left(C_{6}+C_{8}\right) are both sufficiently large.

Appendix B Proofs for retrieving tensor components

We shall often operate upon the event where the claims in Lemma 5.7 hold, which happens with very high probability (i.e. at least 1O(d10)1-O\left(d^{-10}\right)). Recall the definition of γτ\bm{\gamma}^{\star\tau} in (72). Since θτ=UUgτ\bm{\theta}^{\tau}=\bm{U}\bm{U}^{\top}\bm{g}^{\tau}, this allows us to write that: for each 1ir1\leq i\leq r,

where we recall that λi=ui23\lambda_{i}^{\star}=\|\bm{u}_{i}^{\star}\|_{2}^{3} and ui=ui/ui2\overline{\bm{u}}_{i}^{\star}=\bm{u}_{i}^{\star}/\|\bm{u}_{i}^{\star}\|_{2}. Given that gτ\bm{g}^{\tau} is a Gaussian vector independent of U\bm{U}, we observe that γτ\bm{\gamma}^{\star\tau} is zero-mean Gaussian conditional on Ω\Omega and E\bm{E}. In order to understand the order statistics associated with this vector, we first look at its covariance matrix.

Denote by Στ\bm{\Sigma}^{\tau} the covariance matrix of γτ\bm{\gamma}^{\star\tau} (conditional on U\bm{U}). Then we have

where we denote by PU(z)=UUz\mathcal{P}_{\bm{U}}(\bm{z})=\bm{U}\bm{U}^{\top}\bm{z} and PU(z)=(IUU)z\mathcal{P}_{\bm{U}^{\perp}}(\bm{z})=\left(\bm{I}-\bm{U}\bm{U}^{\top}\right)\bm{z}. In addition, since the unit vector ui\overline{\bm{u}}_{i}^{\star} lies in the span of the columns of Uorth\bm{U}_{\mathsf{orth}}^{\star} (cf. (54)), it follows from Lemma D.6 that

where R\bm{R} is a rotation matrix defined in (55). This together with (147a) gives

where we have also used the fact that \left\|\mathcal{P}_{\bm{U}}\big{(}\overline{\bm{u}}_{i}^{\star}\big{)}\right\|_{2}\leq\left\|\overline{\bm{u}}_{i}^{\star}\right\|_{2}=1. Moreover, taking together (147a), (148b) and the incoherence condition, we see that for any 1ijr1\leq i\neq j\leq r,

which is expected to be small if δ1\delta_{1} is small.

From our assumptions on the sample size, the rank and the condition number, we can invoke Lemma 5.7 to see that κrδ11\kappa r\delta_{1}\ll 1, where δ1\delta_{1} is defined in (150) and κ=λmax/λmin\kappa=\lambda_{\max}^{\star}/\lambda_{\min}^{\star}. Thus, we can decompose Στ\bm{\Sigma}^{\tau} into two components as follows

As it turns out, both Σ^τ\widehat{\bm{\Sigma}}^{\tau} and Σ˘τ\breve{\bm{\Sigma}}^{\tau} are positive definite. Indeed, we first learn from (149) and (150) that: the ii-th digaonal entry of Σ˘τ\breve{\bm{\Sigma}}^{\tau} obeys

where (i) holds since δ1<1\delta_{1}<1 under our assumptions, (ii) follows since κλiκλmin=λmax\kappa\lambda_{i}^{\star}\geq\kappa\lambda_{\min}^{\star}=\lambda_{\max}^{\star}, and the last line makes use of (150). This implies that Σ˘τ\breve{\bm{\Sigma}}^{\tau} is diagonally dominant, and hence Σ˘τ0\breve{\bm{\Sigma}}^{\tau}\succeq\bm{0}. In conclusion, both Σ^τ\widehat{\bm{\Sigma}}^{\tau} and Σ˘τ\breve{\bm{\Sigma}}^{\tau} are positive definite.

Let γ^τ\widehat{\bm{\gamma}}^{\star\tau} and γ˘τ\breve{\bm{\gamma}}^{\star\tau} be independent zero-mean Gaussian random vectors with covariance matrices Σ^τ\widehat{\bm{\Sigma}}^{\tau} and Σ˘τ\breve{\bm{\Sigma}}^{\tau}, respectively. Clearly, the distribution of γτ\bm{\gamma}^{\star\tau} is identical to that of γ^τ+γ˘τ\widehat{\bm{\gamma}}^{\star\tau}+\breve{\bm{\gamma}}^{\star\tau}. Consequently, it allows us to look at the distributions of these two random vectors separately.

In view of (149) and the fact κrδ1<1\kappa r\delta_{1}<1, one has

Thus, with probability at least 1O(d10)1-O\left(d^{-10}\right), we have

We then turn attention to γ^τ\widehat{\bm{\gamma}}^{\star\tau}, which is composed of independent Gaussian random variables. Let us define \widehat{\Delta}^{\tau}:=\widehat{\gamma}_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\widehat{\gamma}_{i}^{\star\tau}\big{|} for each 1τL1\leq\tau\leq L and let Δ^1(1)Δ^1(2)Δ^1(L)\widehat{\Delta}_{1}^{\left(1\right)}\geq\widehat{\Delta}_{1}^{\left(2\right)}\geq\dots\geq\widehat{\Delta}_{1}^{\left(L\right)} denote the order statistics of \big{\{}\widehat{\Delta}_{1}^{\tau}\big{\}}_{\tau=1}^{L} in descending order. Fix any small constant δ>0\delta>0. Invoke Lemma D.5 to demonstrate that: with probability greater than 1δ/r1-\delta/r,

Putting (152) and (154) together and invoking the triangle inequality immediately establish (74a) and (74b). On the other hand, combining (152) with (155a) proves (75a); (153a) and (155b) taken collectively establish (75b), whereas (153b) and (155c) prove (75c).

B.2 Proof of Lemma 5.13

Recall that the vector of interest uτ\overline{\bm{u}}^{\tau} is the leading singular vector of Mτ\bm{M}^{\tau} (as constructed in (70c)), where Mτ\bm{M}^{\tau} satisfies

Instate the assumptions of Lemma 5.13. With probability at least 1O(d10)1-O\left(d^{-10}\right), one has

As a consequence, recalling the definition of Eop\mathcal{E}_{\mathsf{op}} and Eproj\mathcal{E}_{\mathsf{proj}} in (81) and (79) respectively, one has

It then follows from Weyl’s inequality that

where σi(Z)\sigma_{i}(\bm{Z}) denotes the ii-th largest singular value of a matrix Z\bm{Z} and we use the condition that Eop1\mathcal{E}_{\mathsf{op}}\ll 1. All in all, these arguments justify that the spectrum of Mτ\bm{M}^{\tau} is fairly close to that of Mτ\bm{M}^{\star\tau}.

Next, we look at the gap between the two leading singular values of Mτ\bm{M}^{\star\tau}. To begin with, it is self-evident from the definition of Mτ\bm{M}^{\star\tau} that: u1\overline{\bm{u}}_{1}^{\star} is the singular vector of Mτ\bm{M}^{\star\tau}. In fact, we claim one further result, that is, u1\overline{\bm{u}}_{1}^{\star} is indeed the leading singular vector of Mτ\bm{M}^{\star\tau} whose singular value is given by \sigma_{1}\big{(}\bm{M}^{\star\tau}\big{)}=\gamma_{1}^{\star\tau}. Towards this end, let us define

Let γτ(1)γτ(r)|\gamma^{\star\tau}|_{(1)}\geq\dots\geq|\gamma^{\star\tau}|_{(r)} denote the absolute values of {γiτ}i=1r\{\gamma_{i}^{\star\tau}\}_{i=1}^{r} in descending order. This together with Lemma 5.12 implies that

as long as κr(μlogd)/d1\kappa r\sqrt{(\mu\log d)/d}\ll 1. Given that u1\overline{\bm{u}}_{1}^{\star} is the singular vector of Mτ\bm{M}^{\star\tau} with singular value γ1τ\gamma_{1}^{\star\tau}, we can conclude that \sigma_{1}\big{(}\bm{M}^{\star\tau}\big{)}=\gamma_{1}^{\star\tau}. This also allows us to lower bound the gap between the two largest singular values Mτ\bm{M}^{\star\tau} as follows

provided that κr(μlogd)/d1\kappa r\sqrt{(\mu\log d)/d}\ll 1. We also know from (163) and (164) that

Combined with (162) and Wedin’s theorem, we conclude that

Here, we have made use of the fact that uτ\overline{\bm{u}}^{\tau} is the leading singular vector of Mτ\bm{M}^{\tau} obeying \big{\langle}\overline{\bm{u}}^{\tau},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\geq 0.

B.3 Proof of Lemma B.1

We first consider the spectral norm of Fτ\bm{F}^{\tau}. Recall the definition that θτ=UUgτ\bm{\theta}^{\tau}=\bm{U}\bm{U}^{\top}\bm{g}^{\tau}. Let us define

In the sequel, we shall control these two terms separately.

To bound X\left\|\bm{X}\right\|, observe that θτ\bm{\theta}^{\star\tau} is independent of p1TTp^{-1}\bm{T}-\bm{T}^{\star}. By Lemma D.4, one has

Combining (166) with (167) and (168) reveals that with probability exceeding 1O(d20)1-O\left(d^{-20}\right),

where the last inequality holds as long as pμd2log4dp\gtrsim\mu d^{-2}\log^{4}d.

Turning to Y\bm{Y}, we can simply upper bound

Since rank(UUUorthUorth)2r\mathsf{rank}\left(\bm{U}\bm{U}^{\top}-\bm{U}_{\mathsf{orth}}^{\star}\bm{U}_{\mathsf{orth}}^{\star\top}\right)\leq 2r, Lemma 5.7 and the standard result of Gaussian random vectors yields that: with probability at least 1O(d12)1-O\left(d^{-12}\right),

where we recall the definition of Ese\mathcal{E}_{\mathsf{se}} in (57) and that Ese1/rlogd\mathcal{E}_{\mathsf{se}}\ll 1/\sqrt{r\log d} by our conditions. Moreover, by Lemma D.2, we know that with probability exceeding 1O(d10)1-O\left(d^{-10}\right),

Putting (169) and (172) together shows that

Next, we turn to \big{\|}\bm{F}^{\tau}\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}. By the definition of the operator norm, we know that

Applying Lemma D.4 again reveals that with probability at least 1O(d12)1-O\left(d^{-12}\right),

where the last step arises from the condition that pμd2log4dp\gtrsim\mu d^{-2}\log^{4}d. In addition, realizing that U\bm{U} consists of eigenvectors, standard Gaussian random vectors results give that with probability at least 1O(d12)1-O\left(d^{-12}\right),

Combining (173) and (174) shows that with probability exceeding 1O(d10)1-O\left(d^{-10}\right),

Recall the definition of Cτ\bm{C}^{\tau} in (156). We first consider the spectral norm of Cτ\bm{C}^{\tau}. It is straightforward to compute that

if μr/d1\mu r/d\lesssim 1, where we recall that U=[u1,,ur]\overline{\bm{U}}^{\star}=\left[\overline{\bm{u}}_{1}^{\star},\cdots,\overline{\bm{u}}_{r}^{\star}\right]. Here, the last line holds owing to (8c), Lemma D.1 (which justifies that \big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1 if rμ/d1r\sqrt{\mu/d}\leq 1) and Lemma 5.12.

The claim (160) arises from the definition of the spectral norm that \big{\|}\bm{C}^{\tau}\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}\leq\big{\|}\bm{C}^{\tau}\big{\|}.

B.4 Proof of Lemma 5.14

Let us fix an arbitrary m[d]m\in\left[d\right]. We remind the readers of several definitions: (1) γ1τ\gamma_{1}^{\star\tau}: see (72); (2) Mτ\bm{M}^{\star\tau}: see (156); and (3) Mτ,(m)\bm{M}^{\tau,\left(m\right)}: see (70d).

Before continuing, we state two immediate facts. First, it has already been observed in Appendix B.2 that u1\overline{\bm{u}}_{1}^{\star} is a singular vector of Mτ\bm{M}^{\star\tau} with singular value γ1τ\gamma_{1}^{\star\tau}, and hence

Here and throughout, Am,:\bm{A}_{m,:} denotes the mm-th row of a matrix A\bm{A}. Second, uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)} is the top singular vector of Mτ,(m)\bm{M}^{\tau,\left(m\right)} such that uτ,(m),u10\langle\overline{\bm{u}}^{\tau,\left(m\right)},\overline{\bm{u}}_{1}^{\star}\rangle\geq 0, and we denote by γτ(m)\gamma_{\tau}^{(m)} the associated singular value. Recall our definition of ντ,(m)\bm{\nu}^{\tau,\left(m\right)} in Algorithm 6. Similar to the case of ντ\bm{\nu}^{\tau}, we will show shortly in Lemma 5.16 that the global signs of ντ,(m)\bm{\nu}^{\tau,\left(m\right)} and uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)} coincide, and hence

As a result, the proof of this lemma boils down to showing that uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)} (and hence ντ,(m)\bm{\nu}^{\tau,\left(m\right)}) is sufficiently close to u1\overline{\bm{u}}_{1}^{\star} in the mm-th entry. Towards this end, observe that

The above two facts (176) and (178) together with the triangle inequality lead to

Therefore, it suffices to upper bound the above three quantities separately.

The first step to bound α3\alpha_{3} (cf. (179)) is to control \big{\|}\bm{M}_{m,:}^{\star\tau}\big{\|}_{2}. Towards this end, we first observe from the incoherence conditions that

When combined with the definition (156), this gives

where (i) arises from (180) and \big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1 if rμ/d1r\sqrt{\mu/d}\ll 1, and the last step comes from Lemma 5.12.

The second step is to upper bound \big{\|}\overline{\bm{u}}^{\tau,\left(m\right)}-\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}. Towards this, we resort to Wedin’s theorem as follows

where we rely on the fact that \big{\langle}\overline{\bm{u}}^{\tau,\left(m\right)},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\geq 0. To complete this bound, we need to control Mτ,(m)Mτ\bm{M}^{\tau,\left(m\right)}-\bm{M}^{\star\tau}. Before we move on, we find it helpful to introduce

Let u^τ,(m)\widehat{\bm{u}}^{\tau,\left(m\right)} denote the top left singular vector of M^τ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)} such that

Since we have already bounded \big{\|}\bm{M}^{\tau}-\bm{M}^{\star\tau}\big{\|} in Lemma B.1, we can decompose

With these definitions in place, Lemma B.2 below provides the desired bounds.

Instate the assumptions of Lemma 5.14. With probability at least 1O(d10)1-O\left(d^{-10}\right), the following holds simultaneously for all 1md1\leq m\leq d:

where Eloo\mathcal{E}_{\mathsf{loo}} is defined in (65). As a result, one has

We then can further combine (161) and (162) to deduce that

In particular, we have \big{\|}\bm{M}^{\tau,\left(m\right)}-\bm{M}^{\star\tau}\big{\|}\ll\lambda_{\min}^{\star} under our conditions, and it follows from (164) that

To finish up, combine this with (181) and the spectral condition to arrive at

which results from the fact that EprojEop\mathcal{E}_{\mathsf{proj}}\leq\mathcal{E}_{\mathsf{op}} (cf. (81)).

We then turn to α2\alpha_{2} (cf. (179)). Recall the definition of Mτ,(m)\bm{M}^{\tau,\left(m\right)} in (70c). It is straightforward to verify that

where Cτ\bm{C}^{\tau} is defined in (156).

From the incoherence conditions, we can upper bound

where we use the fact that \big{\|}\bm{\gamma}^{\star\tau}\big{\|}_{2}\lesssim\sqrt{r\log d}\,\lambda_{\max}^{\star} from Lemma 5.12 and \big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1 from Lemma D.1.

is a zero-mean Gaussian random vector conditional on PΩ(E)\mathcal{P}_{\Omega}\left(\bm{E}\right). Using standard results on Gaussian random vectors, one has: with probability at least 1O(d11)1-O\left(d^{-11}\right), for each s[r]s\in\left[r\right] and m[d]m\in\left[d\right],

where we have used Lemma 5.8 as well as the conditions that Eloo1\mathcal{E}_{\mathsf{loo}}\ll 1, \big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1 and \big{\|}\overline{\bm{U}}^{\star}\big{\|}_{2,\infty}\lesssim\sqrt{\mu r/d}.

Putting the above bounds together, we arrive at

where we remind the reader of the definition of Eop\mathcal{E}_{\mathsf{op}} in (81).

The remaining quantity to control is α1\alpha_{1} (see (179)). Invoke Weyl’s inequality to show that

where the last inequality arises from (188) and Lemma 5.12. Under our sample size, rank and noise conditions, we have

Moreover, we learn from (195) and (181) that

where the last step follows from the fact that μrd\mu r\leq d and κ1\kappa\asymp 1. Hence, we reach the conclusion that

Putting together all of the preceding bounds on α1\alpha_{1}, α2\alpha_{2} and α3\alpha_{3} immediately establishes the lemma.

B.5 Proof of Lemma B.2

First of all, if the claims (185) and (186) can be established, then putting them together yields

where we recall the definition of Eloo\mathcal{E}_{\mathsf{loo}} in (65) and use the sample size, noise and rank conditions. The rest of the proof is thus dedicated to establishing (185) and (186). In what follows, we shall assume {Ei,j,k}i,j,k[d]\{E_{i,j,k}\}_{i,j,k\in[d]} (resp. {χi,j,k}i,j,k[d]\{\chi_{i,j,k}\}_{i,j,k\in[d]}) are independent random variables to simplify presentation.

Recall the definition of Mτ,(m)=p1T(m)×3θτ,(m)\bm{M}^{\tau,\left(m\right)}=p^{-1}\bm{T}^{(m)}\times_{3}\bm{\theta}^{\tau,\left(m\right)} in (70c). Comparing this with the definition of M^τ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)} in (183), we see that

Note that θτ,(m)N(0,U(m)U(m))\bm{\theta}^{\tau,\left(m\right)}\sim\mathcal{N}\left(\bm{0},\bm{U}^{\left(m\right)}\bm{U}^{\left(m\right)\top}\right) conditional on PΩ(E)\mathcal{P}_{\Omega}\left(\bm{E}\right). Standard Gaussian concentration inequalities reveal that with probability exceeding 1O(d10)1-O(d^{-10}),

From Lemmas 5.7-5.8 and the fact that max{Ese,Eloo}1\max\{\mathcal{E}_{\mathsf{se}},\mathcal{E}_{\mathsf{loo}}\}\ll 1, we have

As a consequence, standard concentration results assert that with probability 1O(d10)1-O(d^{-10}),

Regarding the mm-th row of M^τ,(m)Mτ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}, apply Lemma D.9 to show that with probability 1O(d11)1-O\left(d^{-11}\right),

where the last inequality comes from (200). In addition, Lemma D.10 indicates that with probability exceeding 1O(d11)1-O\left(d^{-11}\right),

where the second line comes from (200) and (198), and the last inequality holds as long as pμd2log5dp\gg\mu d^{-2}\log^{5}d. These together with (197c) allow us to obtain

Clearly, this bound is also valid for \sum_{i:i\neq m}\big{\{}\big{(}\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}\big{)}_{i,m}\big{\}}^{2}, namely,

When it comes to the remaining entries of M^τ,(m)Mτ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}, by the fact that the spectral norm of a submatrix is always less than or equal to that of the whole matrix, applying the matrix Bernstein inequality gives that with probability 1O(d11)1-O\left(d^{-11}\right),

as long as our sample size and rank condition holds.

Putting the preceding bounds together yields

Recall the definitions of Mτ\bm{M}^{\tau} and M^τ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)} in (279b) and (183), respectively. From the definition of the operator norm and the triangle inequality, we have

As shown in (193), with probability at least 1O(d12)1-O\left(d^{-12}\right),

Consequently, we know from Lemma 5.8 that with probability at least 1O(d10)1-O\left(d^{-10}\right),

where we use the fact that \big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1 if rμ/d1r\sqrt{\mu/d}\ll 1.

When it comes to α2\alpha_{2}, combining (171) and (194) with our sample size, noise and rank conditions, one has

Combining (204), (206) and (207), we conclude that

B.6 Proof of Lemma 5.15

We start with the first claim regarding uτντ,(m)2\|\overline{\bm{u}}^{\tau}-\bm{\nu}^{\tau,\left(m\right)}\|_{2}, or equivalently, uτuτ,(m)2\|\overline{\bm{u}}^{\tau}-\overline{\bm{u}}^{\tau,\left(m\right)}\|_{2} (as argued in the proof of Lemma 5.14). By the triangle inequality, we can upper bound the following two terms separately:

Here, we remind the reader that u^τ,(m)\widehat{\bm{u}}^{\tau,\left(m\right)} is the top left singular vector of M^τ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)} (see (183)) obeying \big{\langle}\widehat{\bm{u}}^{\tau,\left(m\right)},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\geq 0.

The first term β1\beta_{1} shall be bounded via Wedin’s theorem. From (163) and (164), we have

Note that we have already shown in the proof of Lemma B.2 and Lemma B.4 that \big{\|}\overline{\bm{u}}^{\tau}-\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}=o(1) and \big{\|}\widehat{\bm{u}}^{\tau,\left(m\right)}-\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}=o(1), which implies that uτ\bm{u}^{\tau} and u^τ,(m)\widehat{\bm{u}}^{\tau,\left(m\right)} are positively correlated. Thus, one can invoke Wedin’s theorem and use the bound (186) to reach

In addition to this bound on β1\beta_{1}, we also make note of the following simple bound

where the last inequality follows from our sample size, noise and rank condition that Eloorlogd1\mathcal{E}_{\mathsf{loo}}\sqrt{r\log d}\ll 1.

The second term β2\beta_{2} is also controlled via Wedin’s theorem:

The denominator term is easy to handle. With (196) and (209) in mind, we can apply Weyl’s inequality to obtain

From Lemma B.2, one has \big{\|}\bm{M}^{\tau}-\bm{M}^{\tau,\left(m\right)}\big{\|}\ll\lambda_{\min}^{\star}. Therefore, we know that

In addition, Lemma B.3 below develops an upper bound on the numerator term:

Instate the assumptions of Lemma 5.15. With probability at least 1O(d10)1-O\left(d^{-10}\right), one has

Substitution of the above bounds into (212) yields

where the last step holds as long as pμ2r2d2log2dp\gg\mu^{2}r^{2}d^{-2}\log^{2}d and σ/λminp/(rdlog2d)\sigma/\lambda_{\min}^{\star}\ll\sqrt{p/(rd\log^{2}d)}. In addition, from (211), we observe that

As a consequence, one immediately obtains

Combining (208), (210) and (215) and the definition of Eloo\mathcal{E}_{\mathsf{loo}}, we arrive at

Comparing this bound with the first claim of the lemma, we see that the claim can be established as long as we can show that

To justify this bound (217), we make use of Lemma 5.14 to derive that

for each m[d]m\in[d]. Maximizing over m[d]m\in[d] gives that

where we use the condition that (Eloo+Eop)rlogd1\left(\mathcal{E}_{\mathsf{loo}}+\mathcal{E}_{\mathsf{op}}\right)\sqrt{r\log d}\ll 1. Apply the triangle inequality to yield

These allow us to establish the claim (217), which in turn finishes the proof for the first claim of this lemma.

The second claim (83) of this lemma follows immediately from (217) and (218).

It remains to prove the last claim (84). Recall the definition of λτ\lambda_{\tau} and λτ(m)\lambda_{\tau}^{\left(m\right)} in (71). We can decompose

In what follows, we will control β1\beta_{1} and β2\beta_{2} seperately.

For β1\beta_{1}, we note that all non-zero entries of T(m)T\bm{T}^{\left(m\right)}-\bm{T} are located in the mmth slices, and are independent of uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)}. This type of quantities have appeared many times and we omit the detailed proof for conciseness. By the Bernstein inequality, one can show that with probability at least 1O(d10)1-O\left(d^{-10}\right),

Next, we turn to β2\beta_{2}. From our sample size and noise condition, Lemma D.1 and Corollary D.3 demonstrates that with probability at least 1O(d10)1-O\left(d^{-10}\right),

where we use the fact that the tensor spectral norm is always less than or equal to that of its matricization. By the definition of the operator norm, one has

where we use (216) and (217) in the last step and the fact that Elooμrlogd/d1\mathcal{E}_{\mathsf{loo}}\sqrt{\mu r\log d/d}\ll 1.

Combining (221) and (222) immediately establishes (84).

B.7 Proof of Lemma B.3

Recalling the definitions of M^τ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)} and Mτ,(m)\bm{M}^{\tau,\left(m\right)} in (183) and (70c), respectively, we observe that M^τ,(m)Mτ,(m)\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)} is independent of uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)} conditional on PΩm(E)\mathcal{P}_{\Omega_{-m}}\left(\bm{E}\right) and g\bm{g}.

The mm-th entry of \big{(}\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}\big{)}\overline{\bm{u}}^{\tau,\left(m\right)} can be written as

For the first term α1\alpha_{1}, it is easily seen from (200) and incoherence conditions that

Apply the Bernstein inequality to yield that with probability at least 1O(d11)1-O\left(d^{-11}\right),

where the last inequality holds as long as pd2log2dp\gg d^{-2}\log^{2}d.

Regarding α2\alpha_{2} (cf. (223)), it is straightforward to compute that

with ψ1\|\cdot\|_{\psi_{1}} denoting the sub-exponential norm, and

Then the Bernstein inequality reveals that with probability at least 1O(d11)1-O\left(d^{-11}\right),

where the last inequality follows from our sample size condition.

Substituting these into (223), we arrive at

For the remaining entries of \big{(}\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}\big{)}\overline{\bm{u}}^{\tau,\left(m\right)}, we have

for any imi\neq m. From Lemma D.9 and (200), we have, with probability at least 1O(d11)1-O\left(d^{-11}\right), that

Combined with (198) and (200), Lemma D.10 reveals that with probability at least 1O(d11)1-O\left(d^{-11}\right),

Therefore, combine (224) and (225) to obtain that

B.8 Proof of Lemma 5.16

By definition, the only possible difference between uτ\overline{\bm{u}}^{\tau} and ντ\bm{\nu}^{\tau} lies in how their global signs are chosen. To show that uτ=ντ\overline{\bm{u}}^{\tau}=\bm{\nu}^{\tau}, we first claim for the moment that

where Eproj\mathcal{E}_{\mathsf{proj}} is defined in (79). Given that Eproj1\mathcal{E}_{\mathsf{proj}}\ll 1 under our sample size, noise and rank condition, this immediately implies that \big{\langle}p^{-1}\bm{T},(\overline{\bm{u}}^{\tau})^{\otimes 3}\big{\rangle}>0. Consequently, by construction, the global signs of uτ\overline{\bm{u}}^{\tau} and ντ\bm{\nu}^{\tau} coincide. Moreover, from (84) and the condition that Elooμrlogd/d1\mathcal{E}_{\mathsf{loo}}\sqrt{\mu r\log d/d}\ll 1, one also knows that \big{\langle}p^{-1}\bm{T}^{\left(m\right)},(\overline{\bm{u}}^{\tau,\left(m\right)})^{\otimes 3}\big{\rangle}>0 and hence the global signs of uτ,(m)\overline{\bm{u}}^{\tau,\left(m\right)} and ντ,(m)\bm{\nu}^{\tau,\left(m\right)} also coincide.

In addition, recall that \lambda_{\tau}=\big{\langle}p^{-1}\bm{T},({\bm{\nu}}^{\tau})^{\otimes 3}\big{\rangle}. One thus has

which taken collectively with (226) justifies (85).

The rest of the proof then comes down to establishing the claim (226). Towards this, we first decompose

In what follows, we shall upper bound these three terms separately.

Let us start with β1\beta_{1}, For simplicity of notation, let us define Δ1:=uτu1\bm{\Delta}_{1}:=\overline{\bm{u}}^{\tau}-\overline{\bm{u}}_{1}^{\star}. By construction, T\bm{T} and T\bm{T}^{\star} are symmetric. We then can expand

We first look at the first term of (229) which only consists of u1\overline{\bm{u}}_{1}^{\star}. As shown in (173), with probability at least 1O(d11)1-O\left(d^{-11}\right), one has

where we recall the definition of Eproj\mathcal{E}_{\mathsf{proj}} in (79). As for the term linear in Δ1\bm{\Delta}_{1}, by Lemma 5.13, we know that Δ12Eproj\left\|\bm{\Delta}_{1}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}}. As a result, one has

We then turn to the quadratic terms in Δ1\bm{\Delta}_{1}. Similar to the above arguments, one can deduce that

Finally, we can simply upper bound the last term in (173) by

where the last step is due to the fact that Eop1\mathcal{E}_{\mathsf{op}}\ll 1. By our sample size, noise and rank conditions, one has Eproj1\mathcal{E}_{\mathsf{proj}}\ll 1. Putting these bounds together reveals that

Recall the definition of β2\beta_{2} in (228) and Δ1=uτu1\bm{\Delta}_{1}=\overline{\bm{u}}^{\tau}-\overline{\bm{u}}_{1}^{\star}. We can further decompose

We first consider the first term which is linear in Δ1\bm{\Delta}_{1}. Since T\bm{T}^{\star} is a symmetric tensor, we have

which arises from \max_{s\neq i}\left|\big{\langle}\overline{\bm{u}}_{s}^{\star},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\right|\leq\sqrt{\mu/d} and \big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1 as long as rμ/d1r\sqrt{\mu/d}\ll 1. As a result, one has

Finally, using the fact that the tensor spectral norm is always less than or equal to that of its matricization, we find that

Combining this with the fact that Eproj1\mathcal{E}_{\mathsf{proj}}\ll 1, we conclude that

It remains to control β3\beta_{3}. Straightforward calculation reveals that

By the incoherence conditions, we can upper bound

Putting (230), (231) and (232) together, we find that

where the last step follows from the condition μd\mu\leq d, rd/μr\ll\sqrt{d/\mu} and the definition Eprojμr/d\mathcal{E}_{\mathsf{proj}}\geq\sqrt{\mu r/d} (cf. (79)).

B.9 Proof of Lemma 5.17

We first show that for each i[r]i\in\left[r\right], \big{(}\bm{w}_{i},\lambda_{i}\big{)} and \big{(}\bm{w}_{i}^{\left(m\right)},\lambda_{i}^{\left(m\right)}\big{)} (returned by \textscPrune(\textsc{Prune}() in Algorithm 3 and Algorithm 6 respectively) satisfy (80), (82) and (84); in other words, we want to show that they correspond to the same index τ[L]\tau\in\left[L\right] (and are hence produced using the same Gaussian random vector gτ\bm{g}^{\tau}). The proof idea is this: given that the proposed algorithms select the pair with the largest spectral gap in each round of \textscPrune(\textsc{Prune}(), it suffices to ensure that there is sufficient separation between the largest and the second largest spectral gaps (so that both algorithms can identify the same τ\tau).

By Lemma 5.12 and union bounds, we know that with probability at least 1δ1-\delta, for each i[r]i\in\left[r\right],

where we recall that \Delta_{i}^{\tau}:=\gamma_{i}^{\star\tau}-\max_{j:j\neq i}\big{|}\gamma_{j}^{\star\tau}\big{|}, and Δi(1)Δi(2)Δi(L)\Delta_{i}^{\left(1\right)}\geq\Delta_{i}^{\left(2\right)}\geq\dots\geq\Delta_{i}^{\left(L\right)} denote the order statistics of \big{\{}\Delta_{i}^{\tau}\big{\}}_{\tau=1}^{L} in descending order. As shown in the proof of Lemma 5.13, the spectral gap of Mτ\bm{M}^{\tau} is well approximated by maxiΔiτ\max_{i}\Delta_{i}^{\tau}, namely,

under our sample size, noise and rank conditions (67). Moreover, from Lemma B.2, we see that Mτ\bm{M}^{\tau} and Mτ,(m)\bm{M}^{\tau,\left(m\right)} are extremely close in terms of the spectral norm, i.e.

This implies that the perturbation incurred by the leave-out-one procedure is relatively small compared to the difference between the largest and the second largest spectral gaps of Mτ\bm{M}^{\tau}. Consequently, the leave-one-out estimates \big{\{}\big{(}\bm{w}_{i}^{\left(m\right)},\lambda_{i}^{\left(m\right)}\big{)}\big{\}}_{i=1}^{r} returned by Algorithm 6 and the true estimates \big{\{}\big{(}\bm{w}_{i},\lambda_{i}\big{)}\big{\}}_{i=1}^{r} should correspond to the same trials and should be generated by the same set of Gaussian random vectors. As a result, they obey (69a), (69b) and (69c) for all 1md1\leq m\leq d.

From the discussion above, we also know that as long as σ1(Mτ)σ2(Mτ)λmin\sigma_{1}\left(\bm{M}^{\tau}\right)-\sigma_{2}\left(\bm{M}^{\tau}\right)\gtrsim\lambda_{\min}^{\star}, one has ντui2Eproj\left\|\bm{\nu}^{\tau}-\overline{\bm{u}}_{i}^{\star}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}} for some i[r]i\in\left[r\right]. This is an immediate consequence of Lemma 5.13 and the fact that the spectral gap of Mτ\bm{M}^{\tau} and maxiΔiτ\max_{i}\Delta_{i}^{\tau} are extremely close.

It remains to show that our pruning procedure can return estimates of tensor factors without duplicates. Suppose that there exist 1τ1τ2L1\leq\tau_{1}\neq\tau_{2}\leq L such that ντ1ui2Eproj\left\|\bm{\nu}^{\tau_{1}}-\overline{\bm{u}}_{i}^{\star}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}} and ντ2ui2Eproj\left\|\bm{\nu}^{\tau_{2}}-\overline{\bm{u}}_{i}^{\star}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}} for some i[r]i\in\left[r\right]. By the triangle inequality, one has

provided that Eproj1\mathcal{E}_{\mathsf{proj}}\ll 1. In addition, for any ji,j[r]j\neq i,j\in\left[r\right], we know that exists some 1τ3L1\leq\tau_{3}\leq L such that

Recall our incoherence condition in (8c). It is easy to see that

with the proviso that μd\mu\ll d and Eproj1\mathcal{E}_{\mathsf{proj}}\ll 1.

The above argument reveals a clear separation between ντ1,ντ3\left|\left\langle\bm{\nu}^{\tau_{1}},\bm{\nu}^{\tau_{3}}\right\rangle\right| and ντ1,ντ2\left|\left\langle\bm{\nu}^{\tau_{1}},\bm{\nu}^{\tau_{2}}\right\rangle\right|. As an immediate consequence, the proposed pruning procedure successfully removes all duplication while securing an estimate for each tensor factor.

B.10 Proof of Corollary 5.11

Fix any arbitrary small constant δ>0\delta>0. From Theorems 5.9-5.10 and the assumptions of Theorem 2.8, one knows that with probability exceeding 1δ1-\delta, there exists a permutation π():[d][d]\pi(\cdot):[d]\mapsto[d] such that for all 1ir1\leq i\leq r,

for some 0<δ1/(μ3/2r)<10<\delta\ll 1/(\mu^{3/2}r)<1. To prove the corollary, we shall just combine the above results.

Without loss of generality, assume that π(i)=i\pi(i)=i for each i[r]i\in\left[r\right]. Given that δ1\delta\ll 1 and κ1\kappa\asymp 1, by the triangle inequality, one has λiλi\lambda_{i}\asymp\lambda_{i}^{\star} for all i[r]i\in\left[r\right], which further implies that

Consequently, we can apply the triangle inequality to demonstrate that: for each 1ir1\leq i\leq r,

hold for all i[r]i\in\left[r\right] and m[d]m\in\left[d\right]. Recall that \bm{U}^{0}=\big{[}\lambda_{i}^{1/3}\bm{w}^{i}\big{]}_{1\leq i\leq r}. One can deduce that

Appendix C Proof of Corollary 2.9

This section establishes Corollary 2.9. First of all, it is easy to see that: given the estimation accuracy established in Theorem 2.8, the permutation matrices that best match Ut\bm{U}^{t} to U\bm{U}^{\star} remain unchanged as tt increases. Therefore, we assume without loss of generality that Ir=argminΠpermrUtΠU\bm{I}_{r}=\arg\min_{\bm{\Pi}\in\mathsf{perm}_{r}}\left\|\bm{U}^{t}\bm{\Pi}-\bm{U}^{\star}\right\| for all t0t\geq 0.

for any 0δ1/(μ3/2r)10\leq\delta\ll 1/(\mu^{3/2}r)\leq 1, then one has

where T:=i=1ruiuiui\bm{T}:=\sum_{i=1}^{r}\bm{u}_{i}\otimes\bm{u}_{i}\otimes\bm{u}_{i}. As already shown in the analysis of Theorem 2.8, one has

from which Corollary 2.9 follows immediately.

It remains to prove the claim (234). For notational convenience, let us define Δ:=UU\bm{\Delta}:=\bm{U}-\bm{U}^{\star} and Δs:=usus\bm{\Delta}_{s}:=\bm{u}_{s}-\bm{u}_{s}^{\star} for each 1sr1\leq s\leq r. Then we can expand

Next, we turn to the \left\|\cdot\right\|_{\infty} loss. Again, it suffices to focus on 1srus2Δs\sum_{1\leq s\leq r}\bm{u}_{s}^{\star\otimes 2}\otimes\bm{\Delta}_{s}, 1srusΔs2\sum_{1\leq s\leq r}\bm{u}_{s}^{\star}\otimes\bm{\Delta}_{s}^{\otimes 2} and 1srΔs3\sum_{1\leq s\leq r}\bm{\Delta}_{s}^{\otimes 3}. From (104), (105) and (106) shown in the proof of Lemma 5.1, one has

Putting (240), (241) and (242) together with the condition that 0<δr110<\delta\ll r^{-1}\leq 1, we arrive at

Appendix D Auxiliary lemmas

This section gathers several auxiliary lemmas that prove useful when establishing our main results.

We begin by stating all auxiliary lemmas formally, with the proofs postponed to subsequent subsections. We shall define

Suppose that Assumption 2.1 holds, and assume that rμ/dc1r\sqrt{\mu/d}\leq c_{1} for some sufficiently small universal constant c2>0c_{2}>0. Then for dd sufficiently large, the matrices A\bm{A}^{\star}, B\bm{B}^{\star} and Uorth\bm{U}_{\mathsf{orth}}^{\star} (defined respectively in (24), (63) and (54)) obey

Here, A2,:=maxiAi,:2\|\bm{A}\|_{2,\infty}:=\max_{i}\|\bm{A}_{i,:}\|_{2}, λ(i)\lambda_{(i)}^{\star} stands for the ii-th largest value in {λi}1ir\{\lambda_{i}^{\star}\}_{1\leq i\leq r} (or equivalently {ui23}1ir\{\|\bm{u}_{i}^{\star}\|_{2}^{3}\}_{1\leq i\leq r}), and λi(B)\lambda_{i}(\bm{B}^{\star}) represents the ii-th largest eigenvalue of B\bm{B}^{\star}.

Then with probability exceeding 1O(d10)1-O\left(d^{-10}\right), one has

An immediate consequence of this lemma is the following:

With probability at least 1O(d10)1-O\left(d^{-10}\right), one has

where ×3\times_{3} is defined in Section 2.4. The results also holds if we replace ×3\times_{3} with ×1\times_{1} or ×2\times_{2}.

Let {Xi,j}1ir,1jL\{X_{i,j}\}_{1\leq i\leq r,1\leq j\leq L} be a sequence of i.i.d. standard Gaussian random variables, where r2r\geq 2 and L1L\geq 1. Consider some quantities κ1,Δ>0,0<δ<1/2\kappa\geq 1,\Delta>0,0<\delta<1/2. There exists some universal constant C>0C>0 such that if

then with probability at least 1δ1-\delta, there exists some 1j0L1\leq j_{0}\leq L such that

In addition, define Δj:=X1,jκmaxi:1<irXi,j\Delta_{j}:=X_{1,j}-\kappa\max_{i:1<i\leq r}\left|X_{i,j}\right| for each 1jL1\leq j\leq L. Then with probability at least 12δ1-2\delta,

where Δ(1)Δ(2)Δ(L)\Delta_{(1)}\geq\Delta_{(2)}\geq\dots\geq\Delta_{(L)} denote the order statistics of {Δj}j=1L\left\{\Delta_{j}\right\}_{j=1}^{L} in descending order.

where we denote by PV(u0):=VVu0\mathcal{P}_{\bm{V}}(\bm{u}_{0}):=\bm{V}\bm{V}^{\top}\bm{u}_{0} and \mathcal{P}_{\bm{V}^{\perp}}(\bm{u}_{0})=\big{(}\bm{I}_{d}-\bm{V}\bm{V}^{\top}\big{)}\bm{u}_{0}.

Additionally, we record several facts concerning the set of Bernoulli random variables {χi,j,k}1i,j,kd\{\chi_{i,j,k}\}_{1\leq i,j,k\leq d}. We recall that

which is a Bernoulli random variable with mean pp.

Suppose that pd2logdp\gtrsim d^{-2}\log d. With probability exceeding 1O(d10)1-O\left(d^{-10}\right), one has

Suppose that pd2log2dp\gtrsim d^{-2}\log^{2}d. With probability exceeding 1O(d10)1-O\left(d^{-10}\right), one has

D.2 Proof of Lemma D.1

To begin with, the incoherence condition (8b) gives

where the first inequality follows from the fact that uu,vv=u,v2\left\langle\bm{u}\otimes\bm{u},\bm{v}\otimes\bm{v}\right\rangle=\left\langle\bm{u},\bm{v}\right\rangle^{2}, the second inequality holds true due to (8b) and (8c), and the last inequality holds as long as rd/μr\leq d/\mu. This immediately yields the advertised bound on \big{\|}\bm{A}^{\star}\big{\|}_{2,\infty}.

where the second inequality is valid due to (8b) and (8c), and the last inequality holds as long as rd/μr\leq\sqrt{d/\mu}. This yields the claimed bound regarding A2,\|\bm{A}^{\star\top}\|_{2,\infty}.

Regarding the spectrum of A\bm{A}^{\star}, B\bm{B}^{\star}, Uorth\bm{U}_{\mathsf{orth}}^{\star} and U\overline{\bm{U}}^{\star}, we refer the reader to the proof of [CLC+20, Corollary 1].

We now move on to B2,\left\|\bm{B}^{\star}\right\|_{2,\infty}. For any i[d]i\in\left[d\right], it is seen that

Here, the last line makes use of the bound Aλmax(1+O(rμ/d))2λmax\|\bm{A}^{\star}\|\leq\lambda_{\max}^{\star}(1+O(r\sqrt{\mu/d}))\leq 2\lambda_{\max}^{\star}, which holds if rμ/dc1r\sqrt{\mu/d}\leq c_{1} for some sufficiently small constant c1>0c_{1}>0. It then follows from (254) that

D.3 Proof of Lemma D.2 and Corollary D.3

We then need to bound the quantity presented in (257).

For some β>0\beta>0 to be specified later, one can upper bound

The Bernstein inequality then tells us that

In particular, this implies that with probability exceeding 1O(d20)1-O\left(d^{-20}\right),

where we have used the AM-GM inequality in the last step. Therefore, by taking

for some sufficiently large constant C>0C>0, we arrive at

Given that βM\beta\gg M, for any tβt\geq\beta one has the following relations about several events

for any tβt\geq\beta (with β\beta obeying (260)). As a result, one can bound

where (i) follows from (261), (ii) comes from (259), (iii) is a consequence of (262), and (iv) holds true when C>0C>0 is sufficiently large. Consequently,

Recognizing that the magnitudes of all entries of R\bm{R} are bounded by BB, we can invoke Talagrand’s concentration inequality [Ver18, Theorem 5.2.16] for convex Lipschitz functions of independent bounded random variables to show that with probability 1O(d10)1-O(d^{-10}),

D.3.2 Proof of Corollary D.3

Now we apply Lemma D.2 to our concrete setting. We first look at p1PΩ(T)Tp^{-1}\mathcal{P}_{\Omega}\left(\bm{T}^{\star}\right)-\bm{T}^{\star} and treat it as R\bm{R} in Lemma D.2. With the help of Lemma D.1, it is straightforward to compute that

We then turn to PΩ(E)\mathcal{P}_{\Omega}\left(\bm{E}\right). Recognizing that the entries of E\bm{E} might be unbounded, we invoke the following truncation trick to cope with this unboundedness issue. Specifically, define E~=[E~i,j,k]1i,j,kd\widetilde{\bm{E}}=[\widetilde{E}_{i,j,k}]_{1\leq i,j,k\leq d} where

for some some sufficiently large constant c1>0c_{1}>0. Moreover, E~i,j,k\widetilde{E}_{i,j,k} is zero-mean because we assume that the distribution of Ei,j,kE_{i,j,k} is symmetric about 0. Standard concentration inequalities reveal that: with probability exceeding 1O(d10)1-O\left(d^{-10}\right), one has E=E~\bm{E}=\widetilde{\bm{E}}. Hence, it suffices to bound \big{\|}\mathcal{P}_{\Omega}(\widetilde{\bm{E}})\big{\|}. Towards this end, simple calculation reveals that

This together with (264) as well as the high-probability event E=E~\bm{E}=\widetilde{\bm{E}} completes the proof.

D.4 Proof of Lemma D.4

We shall apply the truncated matrix Bernstein inequality to control the spectral norm of X\bm{X}.

Second, using the same truncation argument as in the proof of Lemma D.2 in Appendix D.3, we can assume Ei,j,kσlogd|E_{i,j,k}|\lesssim\sigma\sqrt{\log d} for all 1i,j,kd1\leq i,j,k\leq d (which holds with very high probability). The Bernstein inequality reveals that

for each (i,j)[d]2\left(i,j\right)\in\left[d\right]^{2}, where

This implies that with probability exceeding 1O(d20)1-O\left(d^{-20}\right),

for some sufficiently large constant C>0C>0, then one has

In view of our choice of β\beta, we know that min{t2/S2,t/L}t/max{S/logd,L}\min\left\{t^{2}/S^{2},t/L\right\}\geq t/\max\left\{S/\sqrt{\log d},L\right\} for any tβt\geq\beta. As a result, for dd sufficiently large, we have

Invoke the matrix Bernstein inequality to demonstrate that with probability 1O(d10)1-O\left(d^{-10}\right),

where (i) is due to Lemma D.1, and (ii) follows as long as pd2log3dp\gtrsim d^{-2}\log^{3}d and μlog2dd\mu\log^{2}d\lesssim d.

D.5 Proof of Lemma D.5

Recall that for a standard Gaussian random variable ZN(0,1)Z\sim\mathcal{N}\left(0,1\right), one has

for all t>5/4t>\sqrt{5/4}. Observing that κ2logr+Δ5/4\kappa\sqrt{2\log r}+\Delta\geq\sqrt{5/4} since κ1\kappa\geq 1 and r2r\geq 2, we can invoke the above tail bound to deduce that

where we use the elementary inequality \big{(}\kappa\sqrt{2\log r}+\Delta)^{2}\leq 4\kappa^{2}\log r+2\Delta^{2}. In addition, it follows from the union bound that

To prove the claim, it is sufficient to choose LL such that

Note that log(1x)1/(2x)\log(1-x)\leq-1/(2x) for 0<x<1/40<x<1/4. In view of (267) and (268), one can verify that the above inequality (269) as long as

To prove the second claim, recall the definitions that

and that Δ(1)Δ(2),Δ(L)\Delta_{(1)}\geq\Delta_{(2)}\geq\dots,\geq\Delta_{(L)} denote {Δj}j=1L\left\{\Delta_{j}\right\}_{j=1}^{L} in descending order. For any ϵ>0\epsilon>0, one has

where the last line holds because the distribution of Δjmaxk:kjΔk\Delta_{j}-\max\nolimits_{k:k\neq j}\Delta_{k} conditional on Δj=max1krΔk\Delta_{j}=\max\nolimits_{1\leq k\leq r}\Delta_{k} is identical for all 1jL1\leq j\leq L. In addition, it is straightforward to see that

Next, observe that X1,1X_{1,1} is independent of Y1,1Y_{1,1}, and hence we have

In other words, fϵ(x)f_{\epsilon}(x) is monotonically increasing in xx for any given ε>0\varepsilon>0. Therefore, for any 0x<B0\leq x<B for some sufficiently large B>5/4B>\sqrt{5/4} (to be specified later), the above bounds taken together give

where (i) arises from the monotonicity of fϵ()f_{\epsilon}(\cdot), and (ii) relies on (266). By taking ϵ=δ/(5B)\epsilon=\delta/(5B), we obtain

for any 0xB0\leq x\leq B. Recall that Y1,1=max1<irκXi,1+maxk:k1ΔkY_{1,1}=\max_{1<i\leq r}\kappa X_{i,1}+\max_{k:k\neq 1}\Delta_{k}. By standard Gaussian concentration inequalities, with probability at least 1δ1-\delta one has

where the last step arises from the lower bound on LL in (270). If we choose B=C\big{(}\sqrt{\log L}+\sqrt{\log(1/\delta)}\big{)} for some sufficiently large universal constant C>0C>0, then this immediately implies that

D.6 Proof of Lemma D.6

where the last identity follows since u0\bm{u}_{0} is assumed to lie within span(U)\mathsf{span}(\bm{U}). As a result,

The Pythagorean theorem then gives \big{\|}\mathcal{P}_{\bm{V}}(\bm{u}_{0})\big{\|}_{2}=\sqrt{\|\bm{u}_{0}\|_{2}-\big{\|}\mathcal{P}_{\bm{V}^{\perp}}(\bm{u}_{0})\big{\|}_{2}^{2}}\geq\sqrt{1-\delta^{2}}.

D.7 Proof of Lemma D.7

Invoke the Bernstein inequality to show that: with probability exceeding 1O(d20)1-O(d^{-20}),

where the last line holds with the proviso that pd2logdp\gtrsim d^{-2}\log d.

D.8 Proof of Lemma D.8

Since the Ei,j,kE_{i,j,k}’s are independent sub-Gaussian random variables with variance at most σ2\sigma^{2}, one has

Here, ψ1\|\cdot\|_{\psi_{1}} denotes the sub-exponential norm [Ver10]. Taken together with the Bernstein inequality, these yield that with probability exceeding 1O(d20)1-O(d^{-20}),

provided that pd2log2dp\gtrsim d^{-2}\log^{2}d.

D.9 Proof of Lemma D.9

Fix an arbitrary 1id1\leq i\leq d. We first define a sequence of independent zero-mean random variables {Xj}j[d]\{X_{j}\}_{j\in[d]} as follows

The Bernstein inequality indicates that: with probability at least 1O(d20)1-O(d^{-20}),

Moreover, we can also bound the variance of XjX_{j} as follows

Given that XjX_{j} might be overly large in some rare case, we introduce a sequence {Yj}\{Y_{j}\}, where we denote by YjY_{j} the truncated version of XjX_{j} as follows

We have learn from (271) and the union bound that with probability at least 1O(d15)1-O(d^{-15}), one has Yj=XjY_{j}=X_{j} for all j[d]j\in[d].

Using the above bounds on the XjX_{j}’s, one observes that {Yj}j[d]\{Y_{j}\}_{j\in[d]} is a sequence of independent random variables satisfying

We can apply the Bernstein inequality to conclude that: with probability greater than 1O(d15)1-O(d^{-15}),

where (i) is due to the AM-GM inequality, (ii) makes use of Lemma D.1, and the last line follows under the conditions pd2log2dp\gtrsim d^{-2}\log^{2}d and μlog2dd\mu\log^{2}d\lesssim d. This together with the high-probability fact Yj=XjY_{j}=X_{j} (j[d]\forall j\in[d]) concludes the proof.

D.10 Proof of Lemma D.10

Fix any 1id1\leq i\leq d. To begin with, define

which is a zero-mean random variable. In order to bound ZjZ_{j}, one observes that

where ψ1\|\cdot\|_{\psi_{1}} denotes the sub-exponential norm. Apply the Bernstein inequality for the sum of sub-exponential random variables to obtain

with probability exceeding 1O(d20)1-O(d^{-20}). Further, given that ZjZ_{j} is not necessarily bounded, we introduce a sequence of truncated random variables as follows

According to the above bound, one has Yj=ZjY_{j}=Z_{j} (j\forall j) with probability at least 1O(d19)1-O(d^{-19}).

We then turn attention to bounding j[d]Yj2\sum_{j\in[d]}Y_{j}^{2}. To this end, observe that

Invokde the Bernstein inequality to arrive at: with probability at least 1O(d20)1-O(d^{-20}),

This together with the high-probability fact Yj=ZjY_{j}=Z_{j} (j\forall j) completes the proof.

Appendix E Extension to asymmetric tensors

Thus far, we have focused on the case where the tensor of interest is symmetric. In this section, we discuss how to generalize our algorithm and analysis to accommodate asymmetric tensors.

In addition, we assume that each entry (j,k,l)(j,k,l) is included in the sampling set Ω\Omega independently with probability pp, and that each observed entry Tj,k,lT_{j,k,l}^{\star} is corrupted by an independent zero-mean sub-Guassian noise Ej,k,lE_{j,k,l} (cf. Assumption 2.3). Our goal is to (1) estimate {ui,vi,wi}i=1r\{\bm{u}_{i}^{\star},\bm{v}_{i}^{\star},\bm{w}_{i}^{\star}\}_{i=1}^{r} faithfully, modulo global permutation and global signs, and (2) estimate T\bm{T}^{\star} in a reliable manner.

E.2 Algorithms

We now move on to present an extension of our nonconvex algorithm to handle the noisy scenario.

we can define the following regularized squared loss function

where the regularization term \mathsf{reg}\big{(}\bm{U},\bm{V},\bm{W}\big{)} is given by

for some positive regularization parameters {αi}i=1r\{\alpha_{i}\}_{i=1}^{r} to be specified momentarily. In contrast to the symmetric case, the addition term \mathsf{reg}\big{(}\bm{U},\bm{V},\bm{W}\big{)} is included to help ensure that the sizes of U,V\bm{U},\bm{V} and W\bm{W} stay close — an algorithmic trick that has proved useful in other problems like nonconvex rectangular matrix recovery [TBS+16, ZL16, CLL19].

We are now ready to present our nonconvex algorithm that accommodates the case with asymmetric tensors. As before, the proposed algorithm is initialized by a spectral method, followed by gradient descent designed to minimize the regularized loss function (276). The precise procedure is described in Algorithm 7 (which invokes Algorithms 8-9).

Before proceeding, we find it helpful to record closed-form expressions for the gradients, which are a crucial part when implementing the nonconvex gradient descent algorithm. Specifically, the gradients of g(U,V,W)g(\bm{U},\bm{V},\bm{W}) can be computed as follows

for each 1ir1\leq i\leq r, where ×1,×2\times_{1},\times_{2} and ×3\times_{3} have been defined in Section 2.4.

E.3 Numerical experiments

In order to validate the effectiveness of the proposed algorithm, we conduct a series of numerical experiments.

E.4 Analysis ideas

Before describing the proof ideas, we define the following incoherence parameters and condition number that, similar to the symmetric case, play a crucial role in our theoretical development.

Define the incoherence parameters and the condition number of T\bm{T}^{\star} as follows

As has been made clear in the symmetric case, at the heart of our analysis lie two crucial components: (1) the geometric property of a noiseless version of the loss function, and (2) reasonably well initial estimates for tensor factors (in the entrywise sense). Rather than providing a complete analysis for the asymmetric case (which would be very long), we shall only point out the important steps needed to extend these two parts for the asymmetric case.

Similar to the symmetric counterpart, the key step of the local convergence analysis lies in establishing the favorable geometric property (i.e. local strong convexity and smoothness) of the following noiseless regularized loss function

where the regularization term \mathsf{reg}\big{(}\bm{U},\bm{V},\bm{W}\big{)} is defined in (277). In words, this is a simplified version of the original loss function (276) by dropping the influence of the noise. Lemma E.2 below demonstrates that gcleang_{\mathsf{clean}} is locally strongly convex and smooth in the neighborhood of the ground truth.

and that the regularization parameter obeys \big{|}\alpha_{i}-\lambda_{i}^{\star 2/3}\big{|}\leq c_{2}\lambda_{\min}^{\star 2/3} for some sufficiently large (resp. small) constant c0>0c_{0}>0 (resp. c1,c2>0c_{1},c_{2}>0). Then with probability greater than 1-O\big{(}d_{\min}^{-10}\big{)},

In a nutshell, Lemma E.2 confirms local strong convexity and smoothness of the noiseless regularized loss gclean(U,V,W)g_{\mathsf{clean}}(\bm{U},\bm{V},\bm{W}), provided that (282) holds and that the matrices U\bm{U}, V\bm{V} and W\bm{W} are sufficiently close to the ground truth in every single row. This is similar to the property of the symmetric counterpart fclean(U)f_{\mathsf{clean}}(\bm{U}) (cf. Lemma 5.1 in Appendix 5.1), except that we need to deal with asymmetric tensor factors as well as additional regularization terms. In particular, when d1d2d3dd_{1}\asymp d_{2}\asymp d_{3}\asymp d and μ,r1\mu,r\asymp 1, Condition (282) reduces to pd3/2log3dp\gg d^{-3/2}\log^{3}d and rdr\ll\sqrt{d}, which resembles (42) in Lemma 5.1 derived for the symmetric case.

Having established the preceding local geometric properties of gcleang_{\mathsf{clean}}, one can then argue similarly as in Lemmas 5.3 and 5.6 in Appendix 5.1 to prove that gradient descent converges linearly, as long as it is provided with an initial estimate satisfying the condition (285). Here, we emphasize that the regularization term (277), which essentially balances the sizes of the three tensor factors, is crucial for the local strong convexity and smoothness of gcleang_{\mathsf{clean}} to hold.

Another crucial ingredient lies in guaranteeing an initial estimate with sufficiently good accuracy. Recall that our initialization scheme consists of two stages: (1) subspace estimation, and (2) retrieval of individual tensor factors.

Regarding the retrieval of individual tensor factors, the key observation is that: the random vector gτ\bm{g}^{\tau} we generate satisfies

where PU\mathcal{P}_{\bm{U}^{\star}} is the projection onto the subspace spanned by {ui}i=1r\{\bm{u}_{i}^{\star}\}_{i=1}^{r}, and

Following the above strategies, one could adapt the proofs of Theorems 5.9-5.10 in Section 5.3 to show that: our initial estimates {ui,vi,wi}i=1r\{\bm{u}_{i},\bm{v}_{i},\bm{w}_{i}\}_{i=1}^{r} are all exceedingly close to the ground truth in the entrywise sense (up to global permutation and global signs). This in turn confirms that the algorithm will enter a locally strongly convex and smooth region as characterized in Lemma E.2, thus leading to our performance guarantees for the entire algorithm. Once again, while the analysis ideas for the asymmetric case bear much resemblance to the symmetric counterpart, a complete proof has to be fairly long due to more clumsy notation compared to the symmetric case; for the sake of brevity, we do not provide the full proof here.

E.5 Proof of Lemma E.2

For notational convenience, we shall define

With the above notation in place, one can decompose

where βi,1i4\beta_{i},1\leq i\leq 4 are given respectively by

In what follows, we shall demonstrate that β1\beta_{1} is the dominant term, with the remaining terms being negligible compared to β1\beta_{1}. Here, we note that the proof idea is almost identical to that of the symmetric case (cf. Lemma 5.1 in Appendix 5.1). For the sake of conciseness, we will focus only on the part where the symmetric and the asymmetric cases differ, and omit the proof details when their analyses are similar.

Let us begin with γ1\gamma_{1}. Observe that we can express

Arguing similarly as in the proof of Lemma D.1, one can derive

provided that rdmin/μr\ll d_{\min}/\mu. This leads to the following inequalities

We now move on to γ2\gamma_{2}. Recall our assumption that ui2=vi2=wi2=λi1/3\|\bm{u}_{i}^{\star}\|_{2}=\|\bm{v}_{i}^{\star}\|_{2}=\|\bm{w}_{i}^{\star}\|_{2}=\lambda_{i}^{\star 1/3} for all 1ir1\leq i\leq r. It follows from the Cauchy-Schwartz inequality that

Turning to γ3\gamma_{3}, one can straightforwardly bound

Here, we have used the fact that wi2=λi1/3\|\bm{w}_{i}^{\star}\|_{2}=\lambda_{i}^{\star 1/3} in (i); the inequalities (ii) and (iii) arise from the Cauchy-Schwartz inequality; (iv) follows from (280c); (v) holds as long as rdmin/μr\ll\sqrt{d_{\min}/\mu}. In a similar manner, one can easily verify that

It then follows from the AM-GM inequality that

Putting the above bounds together allows us to bound β1\beta_{1} as follows

provided the sampling rate exceeds pμ2r2dmaxlog2dmaxd1d2d3p\gg\frac{\mu^{2}r^{2}d_{\max}\log^{2}d_{\max}}{d_{1}d_{2}d_{3}}.

By the definition of the operator norm, one can bound

One can easily adapt the proof of Lemma D.2 to show that, with probability at with probability at least 1-O\big{(}d_{\min}^{-10}\big{)},

holds as long as p\gg\max\Big{\{}\frac{\log^{3}d_{\max}}{\sqrt{d_{1}d_{2}d_{3}}},\,\frac{d_{\max}\log^{5}d_{\max}}{d_{1}d_{2}d_{3}}\Big{\}}. We can then adapt the proof for bounding α3\alpha_{3} in the proof of Lemma 5.1 in Appendix A to derive that with probability at least 1-O\big{(}d_{\min}^{-10}\big{)},

It remains to control β4\beta_{4}. By symmetry, it suffices to consider the following terms:

In the following, we shall bound these four terms separately. By the assumptions that S(1)=Ir,S(2)=S(3)=Ir\bm{S}^{(1)}=\bm{I}_{r},\bm{S}^{(2)}=\bm{S}^{(3)}=-\bm{I}_{r}, δ1\delta\ll 1 and κ1\kappa\asymp 1, one has

In addition, for each 1ir1\leq i\leq r , one has

by virtue of the condition that \big{|}\alpha_{i}-\lambda_{i}^{\star 2/3}\big{|}\ll\lambda_{\min}^{\star 2/3}.

Let us start with γ1\gamma_{1}. By the condition that \max_{i}\big{|}\alpha_{i}-\lambda_{i}^{\star 2/3}\big{|}\ll\lambda_{\min}^{\star 2/3}, it is easily seen that

Here, we use (288a), (288b), (289) and κ1\kappa\asymp 1 in the last step.

Turning to γ3\gamma_{3}, one can develop a similar bound as follows

With regards to γ4\gamma_{4}, we can expand

where the last step follows from the assumption ui2=vi2=wi2\|\bm{u}_{i}^{\star}\|_{2}=\|\bm{v}_{i}^{\star}\|_{2}=\|\bm{w}_{i}^{\star}\|_{2}. It follows that

where we have used the conditions that maxiαiλmax2/3\max_{i}\alpha_{i}\lesssim\lambda_{\max}^{\star 2/3} and κ1\kappa\asymp 1.

It is not hard to check that similar bounds also hold for the remaining terms in β4\beta_{4}. As a result, we arrive at

Putting the above estimates together, we conclude that with probability at least 1O(dmin10)1-O(d_{\min}^{-10}), one has

References