Nonconvex Low-Rank Tensor Completion from Noisy Data

Changxiao Cai, Gen Li, H. Vincent Poor, Yuxin Chen

Introduction and motivation

Estimation of low-complexity models from highly incomplete observations is a fundamental task that spans a diverse array of science and engineering applications. Arguably one of the most extensively studied problems of this kind is matrix completion, where one wishes to recover a low-rank matrix given only partial entries [DR16, CC18]. Moving beyond matrix-type data, a natural higher-order generalization is low-rank tensor completion, which aims to reconstruct a low-rank tensor when the vast majority of its entries are unseen. There is certainly no shortage of applications that motivate the investigation of tensor completion (e.g. personalized medicine [SN16, Paw19], medical imaging [GRY11, SHKM14, CZA+17], seismic data analysis [KSS13, EAHK13], multi-dimensional harmonic retrieval [CC14, YLW+17]). One concrete example in operations research arises when learning the preference of individual customers for a collection of products on the basis of historical transactions [FL19, MP20]. Given the limited availability of transaction data (e.g. each customer might only have purchased very few products before), it is crucial to exploit multi-way customer-product interactions (e.g. users’ browsing and searching histories) in order to better predict the likelihood of a customer purchasing a new product. Clearly, the presence of missing data and the need of exploiting multi-way structure result in the task of tensor completion. Additionally, tensor completion finds important applications in visual data in-painting [LMWY13, LYX17], where one wishes to reconstruct video data (or a sequence of images) from incomplete measurements. The video data consist of at least two spatial variables and one temporal variable, whose intrinsic connections are often modeled via certain low-complexity tensors.

2 Computational and statistical challenges

Even though tensor completion conceptually resembles matrix completion in various ways, it is considerably more challenging than the matrix counterpart. This is perhaps not surprising, given that a plethora of natural tensor problems (e.g. computing the spectral norm, finding the best low-rank approximation) are all notoriously hard [HL13]. As a notable example, while matrix completion is often efficiently solvable under nearly minimal sample complexity [CR09, Gro11], all polynomial-time algorithms developed so far for tensor completion — even in the noise-free case — require a sample size at least exceeding the order of $rd^{3/2}$ , which is substantially larger than the degrees of freedom (i.e. $rd$ ) underlying the model (2). In fact, it is widely conjectured that there exists a large computational barrier away from the information-theoretic sampling limits [BM16].

With this fundamental gap in mind, the current paper focuses on the regime (in terms of the sample size) that enables reliable tensor completion in polynomial time. A variety of algorithms have been proposed that enjoy some sort of theoretical guarantees in (at least part of) this regime, including but not limited to spectral methods [MS18, CLC+20], sum-of-squares hierarchy [BM16, PS17], nonconvex algorithms [JO14, XY17], and also convex relaxation (based on proper unfolding) [GRY11, HMGW15, RPP13, GQ14]. While these are all polynomial-time algorithms, most of the computational complexities supported by prior theory remain prohibitively high when dealing with large-scale tensor data — a point that we shall elaborate on later. The only exception is the unfolding-based spectral method, which, however, fails to achieve exact recovery as the noise vanishes. This leads to a critical question:

Q1: Is there any linear-time algorithm that is guaranteed to work for low-rank tensor completion?

Going beyond such computational concerns, one might naturally wonder whether it is also possible for a fast algorithm to achieve a nearly un-improvable statistical accuracy in the presence of noise. Towards this end, intriguing stability guarantees have been established for sum-of-squares hierarchy in the noisy settings [BM16], although this paradigm is computationally expensive for large-scale data. The recent work [XYZ17] came up with a two-stage algorithm (i.e. a spectral method followed by tensor power iterations) for noisy tensor completion. Its estimation accuracy, however, falls short of achieving exact recovery in the absence of noise. This gives rise to another question of fundamental importance:

Q2: Can we achieve near-optimal statistical accuracy without compromising computational efficiency?

In this paper, we aim to address the above two questions by developing a nonconvex algorithm that achieves optimal computational efficiency and statistical accuracy all at once.

Algorithm and main results

To address the above-mentioned challenges, a first impulse is to resort to the following least squares problem:

or more concisely (up to proper re-scaling),

Fortunately, not all nonconvex problems are as daunting to solve as they may seem. For example, recent years have seen a flurry of activity in low-rank matrix factorization via nonconvex optimization, which provably achieves optimal statistical accuracy and computational efficiency at once; see [CLC19] for an overview of recent advances. Motivated by this strand of work, we propose to solve (4) via a two-stage nonconvex paradigm, presented below in reverse order. The whole procedure is summarized in Algorithms 1-3.

Arguably one of the simplest optimization algorithms is gradient descent, which adopts a gradient update rule

Despite the simplicity of this algorithm, two critical issues stand out and might significantly affect its efficiency, which we shall bear in mind throughout the algorithmic and theoretical development.

(i) Local stationary points and initialization. As is well known, GD is guaranteed to find an approximate local stationary point, provided that the learning rates do not exceed the inverse Lipschitz constant of the gradient [Bub15]. There exist, however, local stationary points (e.g. saddle points or spurious local minima) that might fall short of the desired statistical properties. This requires us to properly avoid such undesired points, while retaining computational efficiency. To address this issue, one strategy is to first identify a rough initial guess within a local region surrounding the global solution (which often helps rule out bad local minima), in order to guarantee proper convergence of subsequent optimization procedures [LT17, JO14]. As a side remark, while careful initialization might not be crucial for several matrix recovery cases [CCFM19, GBW18, TV19], it does seem to be critical in various tensor problems [RM14]. We shall elucidate this point in Section 3.3.

(ii) Learning rates and regularization. Learning rates play a pivotal role in determining the convergence properties of GD. The challenge, however, is that the loss function (4) is overall not sufficiently smooth (i.e. its gradient often has an exceedingly large Lipschitz constant), and hence generic optimization theory recommends a pessimistically slow update rule (i.e. an extremely small learning rate) so as to guard against over-shooting. This, however, slows down the algorithm significantly, thus destroying the main computational advantage of GD (i.e. low per-iteration cost). With this issue in mind, prior literature suggests carefully designed regularization steps (e.g. proper projection, regularized loss functions) in order to improve the geometry of the optimization landscape [XY17]. In contrast, we argue that one is allowed to take a constant learning rate — which is as aggressive as it can possibly be — even without enforcing any regularization procedures.

Motivated by the above-mentioned issue (i), we develop a procedure that guarantees a reasonable initial estimate. In a nutshell, the proposed procedure consists of two steps:

Estimate the subspace spanned by the $r$ low-rank tensor factors $\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r}$ via a spectral method;

Disentangle individual low-rank tensor factors from this subspace estimate.

As we shall see momentarily, the total computational complexity of the proposed initialization is $O(pd^{3})$ when $r=O(1)$ , $\kappa=O(1)$ and $p\geq 1/d^{2}$ (where $\kappa$ is a sort of “condition number” defined later), which is a linear-time algorithm. Note, however, that these two steps in the initialization procedure are relatively more complicated to describe. To improve the flow of the current paper, we postpone the details to Section 3. The readers can catch a glimpse of these procedures in Algorithms 2-3.

2 Main results

Encouragingly, the proposed nonconvex algorithm provably achieves the best of both worlds — in terms of statistical accuracy and computational efficiency — for a class of low-rank, well-conditioned, and “incoherent” problem instances. This subsection summarizes our main findings.

Before continuing, we note that one cannot hope to recover an arbitrary tensor from highly sub-sampled and arbitrarily corrupted entries. In order to enable provably valid recovery, the present paper focuses on a tractable model by imposing the following assumptions.

Define the incoherence parameters and the condition number of $\bm{T}^{\star}$ as follows

Here, $\mu_{0}$ , $\mu_{1}$ and $\mu_{2}$ are termed the incoherence parameters. Definitions (8a)-(8c) can be viewed as some sort of incoherence conditions for the tensor. For instance, when $\mu_{0},\mu_{1}$ and $\mu_{2}$ are small, these conditions say that (1) the energy of tensor $\bm{T}^{\star}$ is (nearly) evenly spread across all entries; (2) each factor $\bm{u}_{i}^{\star}$ is de-localized; (3) the factors $\{\bm{u}_{i}^{\star}\}$ are nearly orthogonal to each other. Definition (8d) is concerned with the “well-conditionedness” of the tensor, meaning that each rank-1 component is of roughly the same size. In particular, we note that an assumption on pairwise correlation (i.e. a constraint on $\mu_{2}$ ) is often assumed in the literature of tensor decomposition / factorization (e.g. [AGJ14, SLLC17, HZC20]).

Suppose that $\bm{E}$ is a symmetric random tensor, where $\{E_{j,k,l}\}_{1\leq j\leq k\leq l\leq d}$ (cf. (1)) are independently generated sub-Gaussian random variables with mean zero and variance $\mathsf{Var}(E_{j,k,l})\leq\sigma^{2}$ .

In addition, recognizing that there is a global permutational ambiguity issue (namely, one cannot distinguish $\bm{u}_{1}^{\star},\cdots,\bm{u}_{r}^{\star}$ from an arbitrary permutation of them), we introduce the following loss metrics to account for this ambiguity:

where $\mathsf{perm}_{r}$ stands for the set of $r\times r$ permutation matrices. For notational simplicity, we also take

With these notations in place, we are ready to present our main results. For simplicity of presentation, we shall start with the setting where $r,\mu,\kappa\asymp 1$ .

Fix an arbitrary small constant $\delta>0$ . Suppose that $r,\kappa,\mu=O(1)$ ,

for some sufficiently large constants $c_{0},c_{2}>0$ and some sufficiently small constants $c_{1},c_{3}>0$ . The learning rate $\eta_{t}\equiv\eta$ is taken to be a constant obeying $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ . Then with probability at least $1-\delta$ ,

hold simultaneously for all $0\leq t\leq t_{0}=d^{5}$ . Here, $0<C_{1},C_{3},\rho<1$ and $C_{2},C_{4}>0$ are some absolute constants.

The theorem holds unchanged if $d^{5}$ is replaced by $d^{c}$ for an arbitrarily large constant $c>0$ .

The upper bound $t_{0}$ on the iteration count arises from the leave-one-out analysis when handling noisy observations. In short, the leave-one-out argument can only provide high-probability bounds for each iteration, thus requiring an upper bound on the iteration count if we desire a uniform bound across iterations. Note that in the noiseless case, our results and analysis hold for an arbitrarily large number of iterations.

Fix an arbitrarily small constant $\delta>0$ . Instate the assumptions of Theorem 2.4. Then with probability at least $1-\delta$ ,

hold simultaneously for all $0\leq t\leq t_{0}=d^{5}$ . Here, $0<C_{1},C_{3},\rho<1$ and $C_{2},C_{4}>0$ are some absolute constants.

Several important implications are provided as follows. The discussion below assumes $\lambda_{\max}^{\star}\asymp\lambda_{\min}^{\star}\asymp 1$ for notational simplicity.

Linear convergence. In the absence of noise, the proposed algorithm converges linearly, namely, it provably attains $\varepsilon$ accuracy within $O(\log(1/\varepsilon))$ iterations. Given the inexpensiveness of each gradient iteration, this algorithm can be viewed as a linear-time algorithm, which can almost be implemented as long as we can read the data. In the noisy setting, the algorithm reaches an appealing statistical accuracy within a logarithmic number of iterations.

Near-optimal statistical accuracy. The proposed algorithm converges geometrically fast to a point with Euclidean error $O\big{(}\sigma\sqrt{(d\log d)/p}\big{)}$ . This matches the lower bound established in [XYZ17, Theorem 5] up to some logarithmic factor, thus justifying the statistical optimality of the proposed nonconvex algorithm.

Noise size. The above theory operates in the regime where $\sigma\lesssim\sqrt{\frac{p}{d^{3/2}}}$ (modulo some log factor). Given that we have $\|\bm{T}^{\star}\|_{\infty}\asymp d^{-3/2}$ in this case, our noise size constraint can be equivalently written as (up to some log factor)

Since the sampling rate needs to satisfy $p\gg d^{-3/2}$ , this condition essentially allows the typical size of each noise component to be considerably larger than the size of the corresponding entry of the truth, which covers a broad range of practical scenarios.

Implicit regularization. One appealing feature of our finding is the simplicity of the iterative refinement stage of the algorithm. All of the above statistical and computational benefits hold for vanilla gradient descent (when properly initialized). This should be contrasted with prior work (e.g. [XY17]) that relies on extra regularization terms to stabilize the optimization landscape. In principle, vanilla gradient descent implicitly constrains itself within a region of well-conditioned landscape, thus enabling fast convergence without explicit regularization.

No need of sample splitting. The theory developed herein does not require fresh samples in each iteration. We note that sample splitting has been frequently adopted in other context primarily to simplify mathematical analysis. Nevertheless, it typically does not exploit the data in an efficient manner (i.e. each data sample is used only once), thus resulting in the need of a much larger sample size in practice.

Thus far, we have concentrated on the low-rank, well-conditioned, and incoherent case. Our main theory can be extended to cover a broader class of scenarios, as stated below.

Fix an arbitrary small constant $\delta>0$ . Suppose that $\kappa\asymp 1$ ,

for some sufficiently large constants $c_{0},c_{3}>0$ and some sufficiently small constants $c_{1},c_{2},c_{4}>0$ . The learning rate $\eta_{t}\equiv\eta$ is taken to be a constant obeying $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ . Then with probability at least $1-\delta$ ,

hold simultaneously for all $0\leq t\leq t_{0}=d^{5}$ . Here, $0<C_{1},C_{3},\rho<1$ and $C_{2},C_{4}>0$ are some absolute constants.

Fix an arbitrarily small constant $\delta>0$ . Instate the assumptions of Theorem 2.8. Then with probability at least $1-\delta$ ,

hold simultaneously for all $0\leq t\leq t_{0}=d^{5}$ . Here, $0<C_{1},C_{3},\rho<1$ and $C_{2},C_{4}>0$ are some absolute constants.

Clearly, Theorem 2.8 and Corollary 2.9 subsume Theorem 2.4 and Corollary 2.7 as a special case respectively.

Our theorems require the rank $r$ to not exceed $o(d^{1/6})$ , which, we believe, is an artifact of the current nonconvex analysis (particularly for the initialization stage). For instance, our local convergence analysis is built upon strong convexity and smoothness, which holds only within a sufficiently small neighborhood surrounding the truth; given that the diameter of this neighborhood is no more than $o(1/r)$ , our analysis requires an initial guess with higher accuracy than expected, thus leading to our rank constraint. It might be possible to improve the rank dependency via more refined analysis, and we leave it to future investigation.

3 Numerical experiments

We carry out a series of numerical experiments to corroborate our theoretical findings. Before proceeding, recall that Theorem 2.8 only guarantees successful recovery with probability $1-\delta$ for some small constant $\delta$ ; this means that we shall not anticipate a very high success rate (e.g. $1-O(d^{-5})$ ) as in the matrix recovery case. As we shall make clear shortly, this happens mainly because the initialization stage works only with probability $1-\delta$ , where the uncertainty largely depends on the random vectors $\{\bm{g}^{\tau}\}_{1\leq\tau\leq L}$ . With this observation in mind, we recommend the following modification to improve the empirical success rate:

Run Algorithm 2 independently for $t_{\mathsf{init}}=5$ times to obtain multiple initial estimates (denoted by $\bm{U}^{0}_{},\cdots,\bm{U}^{0}_{[t_{\mathsf{init}}]}$ ); select the one achieving the smallest empirical loss, namely

Run Algorithm 1 with the initial point $\bm{U}^{0}$ set to be $\bm{U}^{0}_{\mathsf{best}}$ .

The final estimates for the low-rank factor and the whole tensor are denoted respectively by

The third series of experiments is concerned with the dependence of the success rate on the rank $r$ . Let us set $p=rd^{-3/2}\log^{2}d$ , $L=r^{2}$ and $\epsilon_{\mathsf{th}}=0.4$ , and the success recovery criterion is the same as above. Figure 1(c) depicts the empirical success rates (over $100$ independent Monte Carlo trials) as the rank $r$ varies. As can be seen from the plots, the proposed algorithm is able to achieve exact reconstruction as long as the rank $r$ is sufficiently small compared to $d$ . The plausible range of $r$ , however, seems and seems to be larger than our theoretic requirement $r=o(d^{1/6})$ . This, once again, suggests the need of future investigation to pin down the best possible dependency on $r$ .

4 Notation

In addition, the operator norm of $\bm{T}$ is defined as

Further, $f(n)\lesssim g(n)$ or $f(n)=O(g(n))$ means that $|f(n)/g(n)|\leq C_{1}$ for some constant $C_{1}>0$ ; $f(n)\gtrsim g(n)$ means that $|f(n)/g(n)|\geq C_{2}$ for some constant $C_{2}>0$ ; $f(n)\asymp g(n)$ means that $C_{1}\leq|f(n)/g(n)|\leq C_{2}$ for some constants $C_{1},C_{2}>0$ ; $f(n)=o(g(n))$ means that $\lim_{n\rightarrow\infty}f(n)/g(n)=0$ . In addition, $f(n)\ll g(n)$ means that $f(n)\leq c_{1}g(n)$ for some sufficiently small constant $c_{1}>0$ , and $f(n)\gg g(n)$ means that $f(n)\geq c_{2}g(n)$ for some sufficiently large constant $c_{2}>0$ .

Initialization

This section presents formal details of the proposed two-step initialization, accompanied by some intuition. Recall that the proposed initialization procedure consists of two steps, which we discuss separately.

The spectral algorithm is often applied in conjunction with simple “unfolding” (or “matricization”) to estimate the subspace spanned by the $r$ factors $\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r}$ . This strategy is partly motivated by prior approaches developed for covariance estimation with missing data [Lou14, MS18, CLC+20]. We provide a brief introduction below.

be the mode-1 matricization of $p^{-1}\bm{T}$ (namely, $\frac{1}{p}T_{i,j,k}=A_{i,(j-1)d+k}$ for any $1\leq i,j,k\leq d$ ) [KB09]. The rationale of this step is that: under our model, the unfolded matrix $\bm{A}$ obeys

2 Step 2: retrieval of low-rank tensor factors from the subspace estimate

As it turns out, it is possible to obtain rough (but reasonable) estimates of all individual low-rank tensor factors $\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r}$ — up to global permutation — given a reliable subspace estimate $\bm{U}$ . This is in stark contrast to the low-rank matrix recovery case, where there exists some global rotational ambiguity that prevents us from disentangling the $r$ factors of interest.

We begin by describing how to retrieve one tensor factor from the subspace estimate — a procedure summarized in $\textsc{Retrieve-one-tensor-factor}($ ). Let us generate a random vector from the provided subspace $\bm{U}$ (which has orthonormal columns), that is,

The rescaled tensor data $p^{-1}\bm{T}$ is then transformed into a matrix via proper “projection” along this random direction $\bm{\theta}$ , namely,

Our estimate for a tensor factor is then given by $\lambda^{1/3}\bm{\nu}$ , where $\bm{\nu}$ is the leading singular vector of $\bm{M}$ obeying $\langle\bm{T},\bm{\nu}^{\otimes 3}\rangle\geq 0$ , and $\lambda$ is taken as $\lambda=\big{\langle}p^{-1}\bm{T},\bm{\nu}^{\otimes 3}\big{\rangle}$ . Informally, $\bm{\nu}$ reflects the direction of the component $\bm{u}_{i}^{\star}$ that exhibits the largest correlation with the random direction $\bm{\theta}$ , and $\lambda$ forms an estimate of the corresponding size $\|\bm{u}^{\star}_{i}\|_{2}$ .

A challenge remains, however, as there are oftentimes more than one tensor factors to estimate. To address this issue, we propose to re-run the aforementioned procedure multiple times, so as to ensure that we get to retrieve each tensor factor of interest at least once. We will then apply a careful pruning procedure (i.e. $\textsc{Prune}($ )) to remove redundancy.

2.2 Intuition

To develop some intuition about the above procedure, consider the “heuristic” case where $\bm{\theta}=\bm{U}^{\star}(\bm{U}^{\star\top}\bm{U}^{\star})^{-1}\bm{U}^{\star\top}\bm{g}$ , namely, the idealistic scenario where the subspace estimate $\bm{U}$ is accurate. Averaging out the randomness in the sampling pattern and the noise, we see that the expected projected matrix (27) takes the following form:

In the mean time, armed with (28) and the incoherence assumption (such that $\bm{u}_{i}^{\star}$ and $\bm{u}_{j}^{\star}$ are nearly orthogonal for $i\neq j$ ), one might have

thus explaining our choice of $\lambda$ in the proposed procedure. These arguments hint at the ability of our procedure in retrieving one tensor factor in each round.

The above intuitive argument, however, does not explain why we need to first project a random vector $\bm{g}$ onto the (approximate) column space of $\bm{U}^{\star}$ . While we won’t go into detailed calculations here, we remark in passing a crucial high variability issue: without proper projection, the perturbation incurred by both the missing data and the noise might far exceed the strength of the true signal. As a result, it is advised to first project the data onto the desired subspace, in the hope of amplifying the signal-to-noise ratio.

3 Other alternatives?

The careful reader may naturally wonder whether a careful initialization is pivotal in achieving fast convergence. While a thorough answer to this has yet to be developed, we shall point out some alternatives that seem sub-optimal in both theory and practice. To simplify the presentation, the current subsection focuses on the rank-1 noiseless case, where

Since the decision variable is now a $d$ -dimensional vector, we shall employ the conventional notation $\bm{u}^{t}$ to represent $\bm{U}^{t}$ .

We find it instrumental to begin with the population-level analysis, which corresponds to the scenario with no missing data and noise ( $p=1$ and $\sigma=0$ ). A little calculation gives

As an immediate consequence, the expected correlation between the next iterate and the truth obeys

a similar recursion holds for $\bm{u}^{t}$ . As a result, the GD iterates are expected to get increasingly more aligned with the truth, at least at the population level. Caution needs to be exercised, however, that this population-level analysis alone fails to capture what is happening in the finite-sample case. In what follows, we point out potential issues with random initialization.

Consider the case where $\bm{u}^{0}$ is generated as a vector of i.i.d. Gaussian random variables. Suppose that $\bm{u}^{0}$ and $\bm{u}^{\star}$ are positively correlated and that $\|\bm{u}^{0}\|_{2}$ is sufficiently small. It is easily seen that, with high probability, the expected increment is on the order of (cf. (32))

which could be quite small as it depends quadratically on the current correlation $\langle\bm{u}^{0},\bm{u}^{\star}\rangle$ .

In summary, the main issue stems from the quadratic dependence of the expected increment (33) on the correlation $\langle\bm{u}^{0},\bm{u}^{\star}\rangle$ , which can be exceedingly small if $\bm{u}^{0}$ is randomly initialized.

Another alternative for initialization is the tensor power method, which has recently gained popularity in the context of learning latent-variable models [AGH+14, AGJ17]. Nevertheless, the TPM (with random initialization) suffers from the same high-volatility issue as randomly initialized GD. The argument for this would be nearly identical to the one presented above, and is hence omitted. Instead, we invoke a perturbation analysis result in [AGH+14, Theorem 5.1] to illustrate the insufficiency of the TPM.

Recall that $\frac{1}{p}\bm{T}=\bm{T}^{\star}+\big{(}\frac{1}{p}\bm{T}-\bm{T}^{\star}\big{)}$ . A critical issue is that the perturbation bound in [AGH+14, Theorem 5.1] requires the tensor perturbation to be exceedingly small, namely,

Related work

One of the most natural ideas for solving tensor completion is to first unfold the tensor data into matrices, followed by proper convex relaxation commonly adopted for low-rank matrix completion. Given that there are more than one ways to matricize a tensor, several prior work has explored the design of matrix norms that can exploit the tensor structure more effectively [THK10, GRY11, LMWY13, RPP13, LFC+16, MHWG14]. Such algorithms have been robustified to enable reliable recovery against sparse outliers as well [GQ14]. For the most part, however, such unfolding-based convex relaxation necessarily incur loss of structural information, which is particularly severe when handling odd-order tensors. The sample complexity developed for this paradigm is often sub-optimal vis-a-vis the computational limits (namely, minimal sample complexity achievable by polynomial-time algorithms).

Motivated by the above sub-optimality issue, [YZ16, YZ17] proposed to minimize instead the tensor nuclear norm subject to data constraints, which provably allows for reduced sample complexity. The issue, however, is that computing the tensor nuclear norm itself is already computationally intractable, thus limiting its applicability to even moderate-dimensional problems. Similar findings have also been discovered for tensor atomic norm minimization [DBBG19]. When restricted to polynomial-time algorithms, the best statistical guarantees are often attained via convex relaxation tailored to the sum-of-squares hierarchy [BM16]; the resulting computational cost, however, remains prohibitively high for practical large-scale problems. Another matrix nuclear norm minimization algorithm has been proposed based on promoting certain structures on certain factor matrices [LSC+14]. Developing statistical guarantees is, however, not the focal point of this work.

Moving beyond convex relaxation, a number of prior papers have developed nonconvex algorithms for tensor completion, examples including iterative hard thresholding [RSS17], alternating minimization [JO14, WAA16, XHYS15], tensor SVD [ZA17], optimization on manifold [XY17, KM16, Ste16], proximal average algorithm with nonconvex regularizer [Yao18], and block coordinate decent [JHZ+16, XY13]. When it comes to the model considered herein, these algorithms either lack optimal statistical guarantees, or come with a computational cost that is significantly higher than a linear-time algorithm.

The algorithm and theory that we develop are largely inspired by the recent advances of nonconvex optimization algorithms for low-rank matrix recovery problems [KMO10a, KMO10b, CLS15, CC17, SL16, YPCC16, CW15]. The main theoretical tool — the leave-one-out analysis — is a powerful technique that has proved successful in various other statistical problems [EK15, CFMW19, AFWZ17, MWCC17, ZB18, CCFM19, CCF+19, LZT19, CFMY19, DC18, PW19]. There are several major differences between the analysis of nonconvex tensor completion and that of nonconvex matrix recovery. For instance, our initialization scheme is substantially more complicated than the matrix recovery counterpart, thus requiring much more sophisticated analysis; in addition, the local convergence stage of tensor completion does not suffer from rotational ambiguity (which often appears in nonconvex matrix completion), and hence we only need to handle permutational ambiguity.

In addition, the current paper focuses on non-adaptive uniform random sampling. If there is freedom in designing the sampling mechanism, then one can often expect improved performance; see [KS13, Zha19] as examples. Fundamental criteria that enable perfect low-CP-rank tensor completion have been studied in [AW17].

Tensor completion is simply a special example of the tensor recovery literature. There is a large body of results tackling various other tensor recovery and estimation problems, including but not limited to tensor decomposition [Kol01, KB09, AGH+14, AGJ14, TS15, KOKC13, HSSS16, GHJY15, ZKOM18, SDLF+17, SLLC17, GM17], tensor SVD and factorization [ZX18, KBHH13, ZA17], and tensor regression and sketching [RSS17, HZC20, CRY19, HWW+19]. The algorithmic ideas explored in this paper might have implications for these tensor-related problems as well.

Analysis

In this section, we outline the proof of Theorem 2.8. The proof of Corollary 2.9 is deferred to Appendix C. The analysis is divided into three parts:

In Section 5.1, we show that given an initial estimate sufficiently close to the ground truth, vanilla gradient descent converges linearly. These are formalized in Lemmas 5.3 and 5.6.

Sections 5.2-5.3 provide statistical guarantees for the two steps of the initialization procedure; see Theorems 5.9.

Under the assumptions of Theorem 2.8, one can see that the initialization satisfies the requirement of linear convergence of vanilla gradient descent. Therefore, Theorem 2.8 immediately follows from the results in Sections 5.1-5.3.

In this section, we demonstrate that: if the initialization is reasonably good, then vanilla gradient descent converges linearly to a solution with the desired statistical accuracy. We postpone the analysis for initialization to Sections 5.2-5.3 for convenience of presentation.

First of all, using our notation $\times^{\mathsf{seq}}$ defined in (21), we can write

The gradient of $f_{\mathsf{clean}}$ w.r.t. $\bm{u}_{s}$ ( $1\leq s\leq r$ ) is thus given by

where $\mathsf{vec}(\bm{V})$ denotes the vectorization of $\bm{V}$ .

1.2 Local strong convexity and smoothness

At the heart of our analysis is a crucial geometric property of the objective function, that is, the noiseless loss function $f_{\mathsf{clean}}$ behaves like a locally strongly convex and smooth function. This fact, which is formally stated in the following lemma, is the key enabler of fast local convergence of vanilla GD.

Suppose that the sample complexity and the rank satisfy

for some sufficiently large (resp. small) constant $c_{0}>0$ (resp. $c_{1}>0$ ). Then with probability greater than $1-O(d^{-10})$ ,

Here, $\delta\leq c_{2}/(\mu^{3/2}r)$ for some sufficiently small constant $c_{2}>0$ .

In order to invoke Lemma 5.1, one needs to make sure that the decision matrix $\bm{U}$ of interest (e.g. $\bm{U}^{t}$ in the GD sequence) satisfies the condition (44). This, however, is a fairly stringent condition, as it requires $\bm{U}$ to be close to the truth in every single row.

1.3 Leave-one-out gradient descent sequences

Motivated by the analytical framework developed for low-rank matrix recovery [MWCC17, CLL19], we introduce the following leave-one-out sequences, which play a crucial role in guaranteeing that the entire trajectory $\{\bm{U}^{t}\}_{t\geq 0}$ satisfies the condition (44) as required in Lemma 5.1.

Specifically, we define for each $1\leq m\leq d$ the following auxiliary loss function:

$\mathcal{P}_{\Omega_{m}}$ : the projection onto the subspace of tensors supported on $\{(i,j,k)\in\Omega\colon i=m\text{ or }j=m\text{ or }k=m\}$ ;

$\mathcal{P}_{\Omega_{-m}}$ : the projection onto the subspace of tensors supported on $\{(i,j,k)\in\Omega\colon i\neq m\text{ and }j\neq m\text{ and }k\neq m\}$ ;

$\mathcal{P}_{m}$ : the projection onto the subspace of tensors supported on $\{(i,j,k)\in[d]^{3}\colon i=m\text{ or }j=m\text{ or }k=m\}$ .

In words, this function is obtained by replacing all data at locations $\{(i,j,k)\in[d]^{3}\colon i=m\text{ or }j=m\text{ or }k=m\}$ by their expected values, thus removing all randomness associated with this location subset. The gradient of $f^{\left(m\right)}(\bm{U})$ w.r.t. $\bm{u}_{s}$ ( $1\leq s\leq r$ ) can be computed as:

We then denote by $\big{\{}\bm{U}^{t,(m)}\big{\}}_{t\geq 0}$ the iterative sequence obtained by running gradient descent w.r.t. the leave-one-out loss $f^{\left(m\right)}(\cdot)$ ; see Algorithm 4. By construction, as long as $\bm{U}^{0,\left(m\right)}$ is independent of the sampling locations and the noise associated with the locations $\{(i,j,k)\in\Omega\colon i=m\text{ or }j=m\text{ or }k=m\}$ (which holds true as detailed momentarily), then the entire trajectory $\big{\{}\bm{U}^{t,(m)}\big{\}}_{t\geq 0}$ becomes statistically independent of such randomness. This is a crucial property that allows us to decouple the complicated statistical dependency.

1.4 Key lemmas

Key hypotheses for the gradient update stage:

for some quantity $\mathcal{E}_{\mathsf{local}}>0$ (depending possibly on $\mu$ and $r$ ) and some constants $C_{1},\cdots,C_{8}>0$ . These exist a few straightforward consequences of the hypotheses (47), which we record in the following lemma.

Assume that the hypotheses (47) hold, then we have

Our proof for the hypotheses (47) is inductive in nature: we would like to show that if the hypotheses in (47) hold for the $t$ -th iteration, then they continue to be valid for the $(t+1)$ -th iteration. We shall justify each of the above hypotheses inductively through the following lemmas.

for some sufficiently large constant $c_{0}>0$ and some sufficiently small constant $c_{1},c_{2}>0$ . Assume that the hypotheses (47) hold for the $t$ -th iteration and $\mathcal{E}_{\mathsf{local}}\leq c_{3}/\big{(}\mu^{3/2}r\big{)}$ for some sufficiently small constant $c_{3}>0$ . Then with probability at least $1-O(d^{-10})$ ,

provided that $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ , $1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1$ , and $C_{2}$ is sufficiently large.

provided that $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ , $1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1$ and $C_{6}$ is sufficiently large.

provided that $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ , $1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1$ , $C_{7}$ and $C_{8}$ are sufficiently large.

The proofs of the above key lemmas are postponed to Appendix A.

2 Analysis for initialization: Part 1 (subspace estimation)

where $\mathcal{O}^{r\times r}$ stands for the set of $r\times r$ orthonormal matrices. This can be viewed as the global rotation matrix that best aligns the two subspaces represented by $\bm{U}$ and $\bm{U}^{\star}_{\mathsf{orth}}$ respectively.

Equipped with the above notation, we can invoke [CLC+20, Corollary 1] to arrive at the following lemma, which upper bounds the distance between our subspace estimate $\bm{U}$ and the ground truth $\bm{U}^{\star}_{\mathsf{orth}}$ .

There exist some universal constants $c_{0},c_{1},c_{2}>0$ such that if

then with probability $1-O\left(d^{-10}\right)$ , the subspace estimate $\bm{U}$ computed by Algorithm 2 obeys

where $\bm{U}_{\mathsf{orth}}^{\star}$ and $\bm{R}$ are defined respectively in (54) and (55), and

In a nutshell, Lemma 5.7 asserts that: under our sample size, noise and rank conditions, Algorithm 2 produces reliable estimates of the subspace spanned by the low-rank tensor factors $\{\bm{u}_{i}^{\star}\}_{1\leq i\leq r}$ . The theorem quantifies the subspace distance in terms of both the spectral norm and $\|\cdot\|_{2,\infty}$ , where the latter bound often reflects a considerably stronger sense of proximity compared to the former one.

As it turns out, in order to facilitate analysis for the subsequent stages, we need to introduce certain leave-one-out sequences as well, which we detail in the next subsection.

2.2 Leave-one-out sequences for subspace estimation

The key idea of the leave-one-out analysis is to create auxiliary leave-one-out sequences that are (1) independent of a small fraction of the data; (2) sufficiently close to the true estimates. We introduce the following auxiliary tensor and $d\times d^{2}$ -dimensional matrix for each $1\leq m\leq d$ :

By construction, $\bm{T}^{\left(m\right)}$ and $\bm{A}^{\left(m\right)}$ are independent of $\mathcal{P}_{\Omega_{m}}\left(\bm{E}\right)$ , where we recall that

where $\mathcal{P}_{\mathsf{off}\text{-}\mathsf{diag}}(\cdot)$ (as already defined in Section 3.1) extracts out off-diagonal entries from a matrix. The rationale is simple: it can be easily verified that

The following lemma plays a crucial role in our analysis, which formalizes the fact that the leave-one-out version $\bm{U}^{\left(m\right)}$ obtained by Algorithm 5 is extremely close to $\bm{U}$ .

There exist some universal constants $c_{0},c_{1},c_{2}>0$ such that if

then with probability $1-O\left(d^{-10}\right)$ , the subspace estimate $\bm{U}^{(m)}$ computed by Algorithm 5 obeys

simultaneously for all $1\leq m\leq d$ , where

Lemma 5.8 follows immediately from the analysis of [CLC+20, Lemma 4]. As a remark, the construction of the leave-one-out sequences herein is slightly different from the one in [CLC+20]. However, it is straightforward to adapt the proof of [CLC+20] to the case considered herein. We therefore omit the proof for the sake of brevity.

3 Analysis for initialization: Part 2 (retrieval of individual tensor factors)

This section justifies that the procedure presented in Algorithm 3 allows to disentangle the tensor factors. For notational simplicity, we let

Fix any arbitrary small constant $\delta>0$ . Assume that

for some sufficiently large universal constant $c_{0},c_{3}>0$ and some sufficiently small universal constants $c_{1},c_{2},c_{4}>0$ . Then with probability exceeding $1-\delta$ , there exists a permutation $\pi(\cdot):[d]\mapsto[d]$ such that for all $1\leq i\leq r$ , the tensor factors $\{\bm{w}^{i}\}_{i=1}^{r}$ returned by Algorithm 3 satisfy

In short, this theorem asserts that the estimates returned by Algorithm 3 are — up to global permutation — reasonably close to the ground truth under our sample size and noise conditions. In order to establish this theorem and in order to provide initial guesses for the leave-one-out GD sequences, we need to produce a leave-one-out sequence tailored to this part of the algorithm. Such auxiliary sequences are generated in a similar spirit as the previous ones, and we summarize them in Algorithm 6. As usual, the resulting leave-one-out estimates $\big{\{}\lambda^{\left(m\right)}_{i},\bm{w}^{i,\left(m\right)}\big{\}}_{i=1}^{r}$ are statistically independent of $\mathcal{P}_{\Omega_{m}}\left(\bm{E}\right)$ .

In what follows, we gather a few key properties of the leave-one-out estimates, which play a crucial role in the analysis.

Fix any arbitrarily small constant $\delta>0$ . Instate the assumptions in Theorem 5.9. With probability exceeding $1-\delta$ , the permutation function stated in Theorem 5.9 obeys that: for all $1\leq i\leq r$ and all $1\leq m\leq d$ :

With Theorems 5.9-5.10 in place, we can immediately establish a few desired properties (particularly those specified in Section 5.1) of our initial estimate, as asserted in the following corollary.

Fix any arbitrarily small constant $\delta>0$ . Instate the assumptions in Theorem 2.8. With probability exceeding $1-\delta$ , the estimates $\bm{U}^{0}$ and $\bm{U}^{0,\left(m\right)}$ returned by Algorithm 3 and Algorithm 6 respectively satisfy the hypotheses (47) for $t=0$ .

3.2 Analysis

Before we start with the proof, we first state the main idea. For the sake of clarify, we define

In addition, let $\bm{\nu}^{\tau}$ be the top singular vector of $\bm{M}^{\tau}$ obeying $\langle\bm{T},(\bm{\nu}^{\tau})^{\otimes 3}\rangle\geq 0$ , and $\bm{\nu}^{\tau,(m)}$ the top singular vector of $\bm{M}^{\tau,(m)}$ obeying $\langle\bm{T}^{(m)},(\bm{\nu}^{\tau,(m)})^{\otimes 3}\rangle\geq 0$ . Set

These are all computed in the function $\textsc{Retrieve-one-tensor-factor}($ ) in the $\tau$ -th round.

We first show that for each $1\leq i\leq r$ , there exists at least one trial $1\leq\tau\leq L$ such that the $i$ -th tensor factor $\overline{\bm{u}}_{i}^{\star}$ is the top singular vector of the population version of $\bm{T}\times_{3}\bm{\theta}^{\tau}$ (with respect to the missing data and noise). In addition, the spectral gap is large enough to guarantee accurate estimates.

Next, we prove that given this spectral gap, the top singular vector $\bm{\nu}^{\tau}$ of $\bm{T}\times_{3}\bm{\theta}^{\tau}$ is close to $\overline{\bm{u}}^{\star}_{i}$ both in the $\|\cdot\|_{2}$ and $\|\cdot\|_{\infty}$ norm. This also enables us to accurately estimate the magnitude of $\bm{u}^{\star}_{i}$ .

Finally, we need to show that one can find those reliable estimates among $L$ random restarts. Combining the spectral gap information with the incoherence condition that tensor components are nearly orthogonal to each other, our selection procedure is guaranteed to recover all tensor factors.

Now we proceed to the proof. Without loss of generality, we prove the case for $i=1$ in the sequel, i.e. there exists some $\tau\in\left[L\right]$ such that $\bm{\nu}^{\tau}$ accurately recovers $\bm{u}^{\star}_{1}$ . Together with the union bound, this shows that we can find reliable estimates for all tensor factors. We then conclude the proof by showing that Algorithm 3 is able to find all of them without duplicates.

where $\overline{\bm{u}}_{i}^{\star}$ and $\lambda_{i}^{\star}$ are both defined in (66). The idea is to let $\bm{\gamma}^{\star\tau}$ approximate the singular values of $\bm{T}^{\star}\times_{3}\bm{\theta}^{\tau}$ ; this can be seen, for instance, via the following calculation:

where $\{\overline{\bm{u}}_{i}^{\star}\}_{i=1}^{r}$ — which are assumed to be incoherent (or nearly orthogonal to each other) — can be approximately viewed as the singular vectors of $\bm{T}^{\star}\times_{3}\bm{\theta}^{\tau}$ .

If we want our spectral estimate to be accurate, we would need to be assured that the two largest entries of $\bm{\gamma}^{\star\tau}$ (in magnitude) are sufficiently separated.

Instate the assumptions of Theorem 5.9. Define $\Delta_{1}^{\tau}:=\gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}$ for each $1\leq\tau\leq L$ and let $\Delta_{1}^{\left(1\right)}\geq\Delta_{1}^{\left(2\right)}\geq\dots\geq\Delta_{1}^{\left(L\right)}$ denote the order statistics of $\big{\{}\Delta_{1}^{\tau}\big{\}}_{\tau=1}^{L}$ (in descending order). Fix any arbitrary small constant $\delta>0$ . With probability greater than $1-\delta/r$ , one has

Lemma 5.12 demonstrates that there exists some $\tau\in\left[L\right]$ such that $\gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}\gtrsim\lambda_{\min}^{\star}$ . This means that $\overline{\bm{u}}_{1}^{\star}$ exhibits the largest correlation with the random projection $\theta$ , which further implies that $\overline{\bm{u}}_{1}^{\star}$ is the largest singular vector of $\bm{T}^{\star}\times_{3}\bm{\theta}^{\tau}$ with a considerable spectral gap (as we will show shortly). With the desired spectral gap in place, we are ready to look at the eigenvectors / singular vectors of interest. To this end, we find it convenient to introduce another auxiliary vector $\overline{\bm{u}}^{\tau}$ , defined as the leading singular vector of $\bm{M}^{\tau}$ (cf. (70c)) obeying

The careful reader would immediately notice the similarity between $\overline{\bm{u}}^{\tau}$ and $\bm{\nu}^{\tau}$ except for their global signs; namely, we determine the global sign of $\overline{\bm{u}}^{\tau}$ based on the ground truth information (76), but pick the global sign for $\bm{\nu}^{\tau}$ solely based on the observed data (cf. Algorithm 3). Fortunately, the vectors $\overline{\bm{u}}^{\tau}$ and $\bm{\nu}^{\tau}$ provably coincide, namely,

as we shall demonstrate momentarily in Lemma 5.16. In a similar way, we also denote by $\overline{\bm{u}}^{\tau,\left(m\right)}$ the leading singular vector of $\bm{M}^{\tau,\left(m\right)}$ defined in (70d) such that

Lemma 5.16 also shows that $\overline{\bm{u}}^{\tau,\left(m\right)}=\bm{\nu}^{\tau,\left(m\right)}$ .

Instate the assumptions of Theorem 5.9. Let $\overline{\bm{u}}^{\tau}$ and $\overline{\bm{u}}_{1}^{\star}$ be as defined in (76) and (66), respectively. Define $\mathcal{A}$ to be the event such that $\gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}\gtrsim\lambda_{\min}^{\star}$ and the condition (75) hold. Then conditional on this event $\mathcal{A}$ , with probability exceeding $1-O\left(d^{-11}\right)$ one has

Instate the assumptions of Theorem 5.9. Define $\mathcal{A}$ to be the event such that $\gamma_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\gamma_{i}^{\star\tau}\big{|}\gtrsim\lambda_{\min}^{\star}$ and the condition (75) hold. Then conditional on this event $\mathcal{A}$ , one has, with probability exceeding $1-O\left(d^{-10}\right)$ , that

holds for all $m\in\left[d\right]$ , where $\mathcal{E}_{\mathsf{op}}$ is defined as follows:

where $\mathcal{E}_{\mathsf{op}}$ and $\mathcal{E}_{\mathsf{loo}}$ are defined in (81) and (65), respectively.

Next, we turn to the estimation accuracy regarding the size of the tensor factors and show that $\lambda_{\tau}$ (produced in Algorithm 3) is close to the truth as well. As it turns out, a byproduct of this step reveals that $\bm{\nu}^{\tau}=\overline{\bm{u}}^{\tau}$ and $\bm{\nu}^{\tau,\left(m\right)}=\overline{\bm{u}}^{\tau,\left(m\right)}$ , where $\overline{\bm{u}}^{\tau}$ and $\overline{\bm{u}}^{\tau,\left(m\right)}$ are an auxiliary vectors defined in (76) and (78), respectively.

Instate the assumptions of Theorem 5.9. Assume that the results in Lemma 5.13, Lemma 5.14 and Lemma 5.15 hold. Then with probability at least $1-O\left(d^{-10}\right)$ , one has

Thus far, we have only proved that one can find a reliable estimate for each tensor factor within $L$ random trials, provided that $L$ is sufficiently large. To finish up, it remains to show that the pruning procedure $\textsc{Prune}($ ) is capable of returning a rough estimate for each tensor factor without duplication. This is accomplished in the following lemma.

Instate the assumptions of Theorem 5.9. On the event that the results in Lemma 5.13, Lemma 5.14, Lemma 5.15 and Lemma 5.16 hold for all $1\leq i\leq r$ , there exists a permutation $\pi(\cdot):[d]\mapsto[d]$ such that: for each $1\leq i\leq r$ , $\big{(}\lambda_{i},\bm{w}^{i}\big{)}$ and $\big{(}\lambda_{\pi(i)}^{\star},\overline{\bm{u}}^{\star}_{\pi(i)}\big{)}$ satisfy (68a), (68b) and (68c); $\big{(}\lambda_{i},\bm{w}^{i}\big{)}$ and $\big{(}\lambda_{i}^{\left(m\right)},\bm{w}^{i,\left(m\right)}\big{)}_{i=1}^{r}$ obey (69a), (69b) and (69c) for all $1\leq m\leq d$ , where $\big{\{}\lambda_{i},\bm{w}^{i}\big{\}}_{i=1}^{r}$ and $\big{\{}\lambda_{i}^{\left(m\right)},\bm{w}^{i,\left(m\right)}\big{\}}_{i=1}^{r}$ are outputs of Algorithm 3 and Algorithm 6, respectively.

Discussion

The current paper uncovers the possibility of efficiently and stably completing a low-CP-rank tensor from partial and noisy entries. Perhaps somewhat unexpectedly, despite the high degree of nonconvexity, this problem can be solved to optimal statistical accuracy within nearly linear time, provided that the tensor of interest is well-conditioned, incoherent, and of constant rank. To the best of our knowledge, this intriguing message has not been shown in the prior literature.

Moving forward, one pressing issue is to understand how to improve the algorithmic and theoretical dependency upon the tensor rank $r$ of the proposed method. Ideally one would desire a fast algorithm whose sample complexity scales as $rd^{1.5}$ , an order that is provably achievable by the sum-of-squares hierarchy. Additionally, in contrast to the matrix counterpart where the rank is upper bounded by the matrix dimension, the tensor CP rank is allowed to rise above $d$ , which is commonly referred to as the over-complete case. Unfortunately, our current initialization scheme (i.e. the spectral method) fails to work unless $r<d$ , and our local analysis for GD falls of accommodating the scenario with $r>d$ . It would be of great interest to develop more powerful algorithms — in addition to more refined analysis — to tackle such an important over-complete regime.

Another tantalizing research direction is the exploration of landscape design for tensor completion. As our heuristic discussions as well as other prior work (e.g. [RM14]) suggest, randomly initialized gradient descent tailored to (4) seems unlikely to work, unless the sample size is significantly larger than the computational limit. This might mean either that there exist spurious local minima in the natural nonconvex least squares formulation (4), or that the optimization landscape of (4) is too flat around some saddle points and hence not amenable to fast computation. It would be interesting to investigate what families of loss functions allow us to rule out bad local minima and eliminate the need of careful initialization, which might be better suited for tensor recovery problems.

Finally, in statistical inference and decision making, one might not be simply satisfied with obtaining a reliable estimate for each missing entry, but would also like to report a short confidence interval which is likely to contain the true entry. This boils down to the fundamental task of uncertainty quantification for tensor completion, which we leave to future investigation.

Acknowledgements

Y. Chen is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ONR grant N00014-19-1-2120, by the ARO grants W911NF-20-1-0097 and W911NF-18-1-0303, by the NSF grants CCF-1907661, IIS-1900140 and DMS-2014279, and by the Princeton SEAS innovation award. H. V. Poor is supported in part by the NSF grant DMS-1736417. C. Cai is supported in part by Gordon Y. S. Wu Fellowships in Engineering. This work was done in part while Y. Chen was visiting the Kavli Institute for Theoretical Physics (supported in part by NSF grant PHY-1748958). We thank Lanqing Yu for many helpful discussions, and thank Yuling Yan for proofreading the paper.

Appendix A Proofs for local convergence of GD

In this section, we establish the key lemmas concerning the convergence properties of GD. As one can easily see, treating $\{E_{i,j,k}\}_{1\leq i,j,k\leq d}$ (resp. $\{\chi_{i,j,k}\}_{1\leq i,j,k\leq d}$ ) as independent random variables — which leads to asymmetric versions of $\bm{E}$ and $\Omega$ — does not affect the order of our results at all. In light of this, we shall adopt such an independent assumption whenever it simplifies our presentation.

From the Hessian expression (41), one can decompose

In what follows, we shall bound each of the above terms separately.

With regards to $\alpha_{4}$ , by symmetry we have

In order to control (88), we first see that

where $\widetilde{\bm{U}}^{\star}$ is as defined in (87). Similar to the proof of Lemma D.1, we can use the fact that $\langle\bm{u}_{i}^{\star\otimes 2},\bm{u}_{j}^{\star\otimes 2}\rangle=\langle\bm{u}_{i}^{\star},\bm{u}_{j}^{\star}\rangle^{2}$ and (8c) to deduce that

provided that $r\ll d/\mu$ . This implies that

(1) Speaking of an upper bound on $\alpha_{4}$ , we can invoke the Cauchy-Schwarz inequality followed by (91) to reach

(2) When it comes to lower bounding $\alpha_{4}$ , the main step boils down to controlling the inner product term in (88). Applying the Cauchy-Schwartz inequality gives that

where (i) comes from Cauchy-Schwarz, and the last line follows from (8c), (8d) as well as the condition that $r\ll\sqrt{d/\mu}$ and $\kappa\asymp 1$ . Therefore, we can lower bound $\alpha_{4}$ by (with the assistance of (91))

When it comes to $\alpha_{1}$ , we can expand

We will derive an upper bound on $\beta_{1}$ in the sequel; the same method immediately applies to $\beta_{2}$ .

For notational convenience, let us define

Apply the Cauchy-Schwartz inequality to yield that

Before we bound the above quantities, we pause to make the following observations. In view of the assumptions of this lemma that $\delta\ll 1/\sqrt{r}\leq 1$ , the following holds for all $i\in\left[r\right]$ :

Now, we proceed to prove the claim. Let us define $\mathcal{S}_{i}:=\{\underline{j}\in[d]^{2}\mid\chi_{ij_{1}j_{2}}=1\}$ for each $i\in[d]$ . Applying the Chernoff bound and the union bound yields that: with probability at least $1-O(d^{-10})$ one has

provided $p\gg d^{-2}\log d$ . It then follows from the Cauchy-Schwarz inequality that

where (i) arises from (97) and (98). In a similar manner, we can derive

under the sample size assumptin that $p\gg\mu^{2}r^{2}d^{-2}\log d$ . Here the last inequality makes use of (90). It is self-evident that the above bounds also hold for quantities that appear in (95). Since $0<\delta\ll 1/\sqrt{r}<1$ , we obtain

The same upper bound holds for any other $\beta_{2}$ . Therefore, as long as $0<\delta\ll 1/(\mu\sqrt{r})$ and $\kappa\asymp 1$ , we have

Regarding $\alpha_{2}$ , applying [YZ16, Lemma 5] (with slight modification, which we omit here for brevity) implies that: if $p\gg\mu^{2}r^{2}d^{-2}\log d$ , then with probability exceeding $1-O(d^{-10})$ ,

Here the last inequality arises from (92).

We now move on to bounding $\alpha_{3}$ . The triangle inequality gives

Recall the definitions of $\bm{\Delta}$ and $\bm{\Delta}_{i}$ in (94). Fix an arbitrary $s\in\left[r\right]$ . From the definition of the operator norm and the triangle inequality, we can derive

With this lemma in mind, we are ready to derive that

Given that $p\gtrsim d^{-3/2}\log^{3}d$ , applying Lemma D.2 indicates that with probability at least $1-O(d^{-10})$ ,

Moreover, it is straightforward to see that $\left\|\bm{1}^{\otimes 3}\right\|=\left\|\bm{1}\right\|_{2}^{3}=d^{3/2}$ . Therefore, one has

Next, we turn to $\left\|\cdot\right\|_{\infty}$ . We first expand

By symmetry, it suffices to control $\big{\|}\sum_{i\in[r]}\bm{u}^{\star\otimes 2}_{i}\otimes\bm{\Delta}_{i}\big{\|}$ , $\big{\|}\sum_{i\in[r]}\bm{\Delta}_{i}^{\otimes 2}\otimes\bm{u}^{\star}_{i}\big{\|}$ and $\big{\|}\sum_{i\in[r]}\bm{\Delta}_{i}^{\otimes 3}\big{\|}$ . Let us look at the first term. Towards this, for each $\left(i,j,k\right)\in\left[d\right]^{3}$ , we can use the Cauchy-Schwartz inequality to control

In a similar manner, we can control the remaining two terms by

Recall that $0<\delta\ll 1/r\leq 1$ . Putting these together reveals that

where we use (96c) in the last step. In view of the condition that $\delta\ll 1/(\mu^{3/2}r)$ and the assumption $\kappa\asymp 1$ , one has with probability greater than $1-O(d^{-10})$ ,

A.1.5 Putting all this together

Note that the above bounds hold uniformly for all $\bm{V}$ . Therefore, combining upper bounds for $\alpha_{i}$ and the union bound, we conclude that with probability exceeding $1-O(d^{-10})$ ,

A.2 Proof of Lemma 5.2

From (47a) and (47c), we use the triangle inequality to obtain

Similarly, we can combine (47b) and (47c) to obtain that

A.3 Proof of Lemma 5.3

which motivates us to bound $\alpha_{1}$ and $\alpha_{2}$ separately.

(1) We start with $\alpha_{1}$ , towards which we find it helpful to define

Given that $\nabla f_{\mathsf{clean}}\left(\bm{U}^{\star}\right)=\bm{0}$ (since $\bm{U}^{\star}$ is a global optimizer of $f_{\mathsf{clean}}$ ), we can use the fundamental theorem of calculus to obtain

From the hypothesis (47b) as well as our conditions that $\frac{\sigma}{\lambda_{\min}^{\star}}\sqrt{\frac{d\log d}{p}}+\mathcal{E}_{\mathsf{local}}\ll\frac{1}{\mu^{3/2}r}$ , we know that $\bm{U}^{t}(\tau)$ ( $0\leq\tau\leq 1$ ) satisfies the conditions required in Lemma 5.1. Therefore, applying Lemma 5.1 gives that

Substitution into (116) indicates that: if $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ , then

which implies that (since $1-a/2\geq\sqrt{1-a}$ for $0<a<1$ )

(2) We now turn to $\alpha_{2}$ . To simplify presentation, we shall assume that $\{E_{i,j,k}\}_{i,j,k\in[d]}$ (resp. $\{\chi_{i,j,k}\}_{i,j,k\in[d]}$ ) are independent random variables. Fix an arbitrary $s\in[r]$ and $m\in[d]$ . The $m$ -th entry of $\mathcal{P}_{\Omega}\left(\bm{E}\right)\times_{1}\bm{u}^{t}_{s}\times_{2}\bm{u}^{t}_{s}$ can be expanded as follows:

Before continuing, we make the following observations: from the hypotheses (47a), (47b), (47c), as well as our assumption that $\frac{\sigma}{\lambda_{\min}^{\star}}\sqrt{\frac{d\log d}{p}}+\mathcal{E}_{\mathsf{local}}\ll\frac{1}{\mu^{3/2}r}$ , we see that the following holds for all $s\in\left[r\right]$ :

With these estimates in place, we can upper bound the above four terms in (118) separately.

For $\beta_{1}$ , we note that, by construction, $\bm{u}^{t,\left(m\right)}$ is independent of the $m$ -th mode-3 slice of $\mathcal{P}_{\Omega}\left(\bm{E}\right)$ . This tells us that

can be viewed as a sum of independent zero-mean random variables (conditional on $\mathcal{P}_{\Omega_{-m}}\left(\bm{E}\right)$ ). It is straightforward to compute that

where $\|\cdot\|_{\psi_{1}}$ denotes the sub-exponential norm and we use (119) in the last inequality. Applying the matrix Bernstein inequality [Kol11, Corollary 2.1] yields that: with probability $1-O(d^{-20})$ ,

For $\beta_{2}$ , we first invoke [CW15, lemma 11] to demonstrate that: with probability at least $1-O(d^{-11})$ ,

where the last inequality follows from the noise condition. Clearly, the above bound holds for $\beta_{3}$ as well.

Taking together the above bounds and substituting them into (118), we obtain

Recognizing that this holds for any $m\in[d]$ and $s\in[r]$ , one can sum over $m$ and $s$ to deduce that

(3) Combining (117) and (122) yields that: with probability at least $1-O(d^{-10})$ ,

with the proviso that $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ , $1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1$ and that $C_{2}$ is sufficiently large.

A.4 Proof of Lemma 5.4

Fix an arbitrary $m\in[r]$ . From the definition of $f^{\left(m\right)}$ in (45) and (46), we can show that

Before proceeding to bounding these terms, we pause to define

From (119) and hypothesis (47d), one can applying a similar argument as in (107) to find that

Now we begin to bound the terms in (123) separately.

(1) We start with $\alpha_{1}$ . For any $0\leq\tau\leq 1$ , define

The fundamental theorem of calculus yields

provided that $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ .

In what follows, we shall assume that $\{E_{i,j,k}\}_{i,j,k\in[d]}$ (resp. $\{\chi_{i,j,k}\}_{i,j,k\in[d]}$ ) are independent random variables in order to simplify presentation.

(2) The next step is to bound $\alpha_{2}$ . For notational simplicity, define

In order to control the Frobenius norm of $\bm{V}^{t,\left(m\right)}$ , we shall start by considering the $m$ -th row of $\bm{V}^{t,\left(m\right)}$ . In view of the definitions of $\mathcal{P}_{\Omega_{m}}$ and $\mathcal{P}_{m}$ (cf. Appendix 5.1.3), we can expand

We recognize that $\big{\{}\bm{u}_{s}^{t,\left(m\right)}\big{\}}_{s=1}^{r}$ is independent of $\Omega_{m}$ , making it convenient for us to upper bound $\big{\|}\bm{V}^{t,\left(m\right)}_{m,:}\big{\|}_{2}$ . Specifically, for any $i,j\in[d]$ , from (119) and (125) we have

where the last inequality holds due to (119) and (125). We then apply the matrix Bernstein inequality to yield that: with probability exceeding $1-O(d^{-20})$ ,

where the last step holds as long as $p\gg\mu^{2}d^{-2}\log d$ .

Next, we turn to the $k$ -th row of $\bm{V}^{t,\left(m\right)}$ for any $k\neq m$ . For each $s\in[r]$ , we have

Similar to the proof of Lemma D.9, we can show that with probability at least $1-O(d^{-20})$ ,

where the last inequality follows from (119), and (125). It is easily seen that the bound also holds for the summation over $i\neq m$ . We can then use Cauchy-Schwartz to arrive at

Combining (128) with (127) and invoking the union bound, we conclude that: with probability exceeding $1-O(d^{-20})$ ,

for some absolute constant $C>0$ , where we use the assumption that $\kappa\asymp 1$ .

(3) For $\alpha_{3}$ , following a similar argument for $\alpha_{2}$ , we define

for each $s\in[r]$ . The $m$ -th row of $\bm{W}^{t,\left(m\right)}$ is a sum of independent zero-mean random vectors:

With (119) in place, it is easy to verify that

where the last inequality holds as long as $p\gg\mu d^{-2}\log^{3}d$ .

for each $s\in[r]$ . Arguing similarly as in the proof of Lemma D.10, we have with probability at least $1-O(d^{-20})$ ,

where the last inequality follows from (119). Additionally, the summation over $\{i:i\neq m\}$ can be controlled using the same argument. Therefore, we use Cauchy-Schwarz to find that

Combined with (130) and the assumption that $\kappa\asymp 1$ , we obtain that

for some absolute constant $\widetilde{C}>0$ .

(4) Regarding $\alpha_{4}$ , we use the triangle inequality to show that for each $s\in[r]$ ,

where the last line follows from (119). From Corollary D.3, we can further upper bound

As a result, we sum over $s\in[r]$ and use the Cauchy-Schwartz inequality to derive

where the last step arises from conditions that $\frac{\sigma}{\lambda_{\min}^{\star}}\sqrt{\frac{d\log d}{p}}\ll\frac{1}{\mu^{3/2}r}$ and $\kappa\asymp 1$ .

(5) Taking (126), (129), (132) and (133) together, we can invoke the sample size assumption that $p\gg\mu^{3}r^{2}d^{-2}\log^{3}d$ and the union bound to show that: with probability greater than $1-O(d^{-10})$ one has

provided $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ , $1-\big{(}\lambda_{\min}^{\star 4/3}/5\big{)}\eta\leq\rho<1$ and $C_{6}$ is sufficiently large.

A.5 Proof of Lemma 5.5

Fix an arbitrary $m\in[d]$ . Recall our notation of $\bm{\Delta}^{t,\left(m\right)}_{\bm{T}}$ in (124). To simplify presentation, we further define

leaving us with two terms to deal with. As it turns out, we will show that $\alpha_{1}$ is the dominant term and $\alpha_{2}$ is negligible. To simplify presentation, we shall assume that $\{E_{i,j,k}\}_{i,j,k\in[d]}$ (resp. $\{\chi_{i,j,k}\}_{i,j,k\in[d]}$ ) are independent random variables.

Regarding $\alpha_{1}$ , the definition of $\mathcal{P}_{\Omega_{-m}}$ allows us to derive

We can express $\bm{\Delta}^{t,\left(m\right)}_{\bm{T}}=\sum_{s\in[r]}\big{(}\bm{\Delta}_{s}^{t,\left(m\right)}+\bm{u}_{s}^{\star}\big{)}^{\otimes 3}-\bm{u}_{s}^{\star\otimes 3}$ and compute that

for each $s\in[r]$ . This further indicates that

In what follows, we will control the four terms separately.

For $\beta_{1}$ , by (119), we use Cauchy-Schwarz to show that

Regarding $\beta_{2}$ , by (119), we apply the Cauchy-Schwarz inequality again to get that: for each $s\in[r]$ ,

Together with the assumption that $\kappa\asymp 1$ , this implies that

With regards to $\beta_{3}$ , we show that for each $s\in[r]$ :

Turning attention to $\beta_{4}$ , we observe that for each $s\in[r]$ ,

where the last line follows from (119) and the rank assumption $r\ll\sqrt{d/\mu}$ . As a consequence,

With regards to $\alpha_{2}$ , it follows from the definition (46) and (134) that

Recall the definition of $\mathcal{P}_{\Omega_{-m}}$ and $\mathcal{P}_{m}$ . For the $m$ -th row, we have

From the triangle inequality, we can further decompose

Let us consider $\gamma_{1}$ first. It is straightforward to calculate that

for each $s\in[r]$ . From (119), we use the triangle inequality and the Cauchy-Schwarz inequality to upper bound

Moreover, it is easy to see that the upper bound also holds for $\gamma_{2}$ . As for $\gamma_{3}$ , we can express

Similarly, we combine (119) wtih the triangle inequality and the Cauchy-Schwarz inequality to bound

where the last inequality follows from the assumption that $\kappa\asymp 1$ .

Putting (141) and (145) together, we reach the conclusion from (48) and the condition $r\ll\sqrt{d/\mu}$ that,

provided that $0<\eta\leq\lambda_{\min}^{\star 4/3}/\big{(}32\lambda_{\max}^{\star 8/3}\big{)}$ , $1-\lambda_{\min}^{\star 4/3}\eta/5\leq\rho<1$ , $C_{7},C_{8}$ are sufficiently large.

Recognizing that the above bound holds for any $1\leq m\leq d$ , we conclude the proof.

A.6 Proof of Lemma 5.6

Combining Lemma 5.4 and Lemma 5.5, we conclude that with probability at least $1-O(d^{-10})$ ,

with the proviso that $C_{3}/\left(C_{5}+C_{7}\right)$ and $C_{4}/\left(C_{6}+C_{8}\right)$ are both sufficiently large.

Appendix B Proofs for retrieving tensor components

We shall often operate upon the event where the claims in Lemma 5.7 hold, which happens with very high probability (i.e. at least $1-O\left(d^{-10}\right)$ ). Recall the definition of $\bm{\gamma}^{\star\tau}$ in (72). Since $\bm{\theta}^{\tau}=\bm{U}\bm{U}^{\top}\bm{g}^{\tau}$ , this allows us to write that: for each $1\leq i\leq r$ ,

where we recall that $\lambda_{i}^{\star}=\|\bm{u}_{i}^{\star}\|_{2}^{3}$ and $\overline{\bm{u}}_{i}^{\star}=\bm{u}_{i}^{\star}/\|\bm{u}_{i}^{\star}\|_{2}$ . Given that $\bm{g}^{\tau}$ is a Gaussian vector independent of $\bm{U}$ , we observe that $\bm{\gamma}^{\star\tau}$ is zero-mean Gaussian conditional on $\Omega$ and $\bm{E}$ . In order to understand the order statistics associated with this vector, we first look at its covariance matrix.

Denote by $\bm{\Sigma}^{\tau}$ the covariance matrix of $\bm{\gamma}^{\star\tau}$ (conditional on $\bm{U}$ ). Then we have

where we denote by $\mathcal{P}_{\bm{U}}(\bm{z})=\bm{U}\bm{U}^{\top}\bm{z}$ and $\mathcal{P}_{\bm{U}^{\perp}}(\bm{z})=\left(\bm{I}-\bm{U}\bm{U}^{\top}\right)\bm{z}$ . In addition, since the unit vector $\overline{\bm{u}}_{i}^{\star}$ lies in the span of the columns of $\bm{U}_{\mathsf{orth}}^{\star}$ (cf. (54)), it follows from Lemma D.6 that

where $\bm{R}$ is a rotation matrix defined in (55). This together with (147a) gives

where we have also used the fact that $\left\|\mathcal{P}_{\bm{U}}\big{(}\overline{\bm{u}}_{i}^{\star}\big{)}\right\|_{2}\leq\left\|\overline{\bm{u}}_{i}^{\star}\right\|_{2}=1$ . Moreover, taking together (147a), (148b) and the incoherence condition, we see that for any $1\leq i\neq j\leq r$ ,

which is expected to be small if $\delta_{1}$ is small.

From our assumptions on the sample size, the rank and the condition number, we can invoke Lemma 5.7 to see that $\kappa r\delta_{1}\ll 1$ , where $\delta_{1}$ is defined in (150) and $\kappa=\lambda_{\max}^{\star}/\lambda_{\min}^{\star}$ . Thus, we can decompose $\bm{\Sigma}^{\tau}$ into two components as follows

As it turns out, both $\widehat{\bm{\Sigma}}^{\tau}$ and $\breve{\bm{\Sigma}}^{\tau}$ are positive definite. Indeed, we first learn from (149) and (150) that: the $i$ -th digaonal entry of $\breve{\bm{\Sigma}}^{\tau}$ obeys

where (i) holds since $\delta_{1}<1$ under our assumptions, (ii) follows since $\kappa\lambda_{i}^{\star}\geq\kappa\lambda_{\min}^{\star}=\lambda_{\max}^{\star}$ , and the last line makes use of (150). This implies that $\breve{\bm{\Sigma}}^{\tau}$ is diagonally dominant, and hence $\breve{\bm{\Sigma}}^{\tau}\succeq\bm{0}$ . In conclusion, both $\widehat{\bm{\Sigma}}^{\tau}$ and $\breve{\bm{\Sigma}}^{\tau}$ are positive definite.

Let $\widehat{\bm{\gamma}}^{\star\tau}$ and $\breve{\bm{\gamma}}^{\star\tau}$ be independent zero-mean Gaussian random vectors with covariance matrices $\widehat{\bm{\Sigma}}^{\tau}$ and $\breve{\bm{\Sigma}}^{\tau}$ , respectively. Clearly, the distribution of $\bm{\gamma}^{\star\tau}$ is identical to that of $\widehat{\bm{\gamma}}^{\star\tau}+\breve{\bm{\gamma}}^{\star\tau}$ . Consequently, it allows us to look at the distributions of these two random vectors separately.

In view of (149) and the fact $\kappa r\delta_{1}<1$ , one has

Thus, with probability at least $1-O\left(d^{-10}\right)$ , we have

We then turn attention to $\widehat{\bm{\gamma}}^{\star\tau}$ , which is composed of independent Gaussian random variables. Let us define $\widehat{\Delta}^{\tau}:=\widehat{\gamma}_{1}^{\star\tau}-\max_{1<i\leq r}\big{|}\widehat{\gamma}_{i}^{\star\tau}\big{|}$ for each $1\leq\tau\leq L$ and let $\widehat{\Delta}_{1}^{\left(1\right)}\geq\widehat{\Delta}_{1}^{\left(2\right)}\geq\dots\geq\widehat{\Delta}_{1}^{\left(L\right)}$ denote the order statistics of $\big{\{}\widehat{\Delta}_{1}^{\tau}\big{\}}_{\tau=1}^{L}$ in descending order. Fix any small constant $\delta>0$ . Invoke Lemma D.5 to demonstrate that: with probability greater than $1-\delta/r$ ,

Putting (152) and (154) together and invoking the triangle inequality immediately establish (74a) and (74b). On the other hand, combining (152) with (155a) proves (75a); (153a) and (155b) taken collectively establish (75b), whereas (153b) and (155c) prove (75c).

B.2 Proof of Lemma 5.13

Recall that the vector of interest $\overline{\bm{u}}^{\tau}$ is the leading singular vector of $\bm{M}^{\tau}$ (as constructed in (70c)), where $\bm{M}^{\tau}$ satisfies

Instate the assumptions of Lemma 5.13. With probability at least $1-O\left(d^{-10}\right)$ , one has

As a consequence, recalling the definition of $\mathcal{E}_{\mathsf{op}}$ and $\mathcal{E}_{\mathsf{proj}}$ in (81) and (79) respectively, one has

It then follows from Weyl’s inequality that

where $\sigma_{i}(\bm{Z})$ denotes the $i$ -th largest singular value of a matrix $\bm{Z}$ and we use the condition that $\mathcal{E}_{\mathsf{op}}\ll 1$ . All in all, these arguments justify that the spectrum of $\bm{M}^{\tau}$ is fairly close to that of $\bm{M}^{\star\tau}$ .

Next, we look at the gap between the two leading singular values of $\bm{M}^{\star\tau}$ . To begin with, it is self-evident from the definition of $\bm{M}^{\star\tau}$ that: $\overline{\bm{u}}_{1}^{\star}$ is the singular vector of $\bm{M}^{\star\tau}$ . In fact, we claim one further result, that is, $\overline{\bm{u}}_{1}^{\star}$ is indeed the leading singular vector of $\bm{M}^{\star\tau}$ whose singular value is given by $\sigma_{1}\big{(}\bm{M}^{\star\tau}\big{)}=\gamma_{1}^{\star\tau}$ . Towards this end, let us define

Let $|\gamma^{\star\tau}|_{(1)}\geq\dots\geq|\gamma^{\star\tau}|_{(r)}$ denote the absolute values of $\{\gamma_{i}^{\star\tau}\}_{i=1}^{r}$ in descending order. This together with Lemma 5.12 implies that

as long as $\kappa r\sqrt{(\mu\log d)/d}\ll 1$ . Given that $\overline{\bm{u}}_{1}^{\star}$ is the singular vector of $\bm{M}^{\star\tau}$ with singular value $\gamma_{1}^{\star\tau}$ , we can conclude that $\sigma_{1}\big{(}\bm{M}^{\star\tau}\big{)}=\gamma_{1}^{\star\tau}$ . This also allows us to lower bound the gap between the two largest singular values $\bm{M}^{\star\tau}$ as follows

provided that $\kappa r\sqrt{(\mu\log d)/d}\ll 1$ . We also know from (163) and (164) that

Combined with (162) and Wedin’s theorem, we conclude that

Here, we have made use of the fact that $\overline{\bm{u}}^{\tau}$ is the leading singular vector of $\bm{M}^{\tau}$ obeying $\big{\langle}\overline{\bm{u}}^{\tau},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\geq 0$ .

B.3 Proof of Lemma B.1

We first consider the spectral norm of $\bm{F}^{\tau}$ . Recall the definition that $\bm{\theta}^{\tau}=\bm{U}\bm{U}^{\top}\bm{g}^{\tau}$ . Let us define

In the sequel, we shall control these two terms separately.

To bound $\left\|\bm{X}\right\|$ , observe that $\bm{\theta}^{\star\tau}$ is independent of $p^{-1}\bm{T}-\bm{T}^{\star}$ . By Lemma D.4, one has

Combining (166) with (167) and (168) reveals that with probability exceeding $1-O\left(d^{-20}\right)$ ,

where the last inequality holds as long as $p\gtrsim\mu d^{-2}\log^{4}d$ .

Turning to $\bm{Y}$ , we can simply upper bound

Since $\mathsf{rank}\left(\bm{U}\bm{U}^{\top}-\bm{U}_{\mathsf{orth}}^{\star}\bm{U}_{\mathsf{orth}}^{\star\top}\right)\leq 2r$ , Lemma 5.7 and the standard result of Gaussian random vectors yields that: with probability at least $1-O\left(d^{-12}\right)$ ,

where we recall the definition of $\mathcal{E}_{\mathsf{se}}$ in (57) and that $\mathcal{E}_{\mathsf{se}}\ll 1/\sqrt{r\log d}$ by our conditions. Moreover, by Lemma D.2, we know that with probability exceeding $1-O\left(d^{-10}\right)$ ,

Putting (169) and (172) together shows that

Next, we turn to $\big{\|}\bm{F}^{\tau}\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}$ . By the definition of the operator norm, we know that

Applying Lemma D.4 again reveals that with probability at least $1-O\left(d^{-12}\right)$ ,

where the last step arises from the condition that $p\gtrsim\mu d^{-2}\log^{4}d$ . In addition, realizing that $\bm{U}$ consists of eigenvectors, standard Gaussian random vectors results give that with probability at least $1-O\left(d^{-12}\right)$ ,

Combining (173) and (174) shows that with probability exceeding $1-O\left(d^{-10}\right)$ ,

Recall the definition of $\bm{C}^{\tau}$ in (156). We first consider the spectral norm of $\bm{C}^{\tau}$ . It is straightforward to compute that

if $\mu r/d\lesssim 1$ , where we recall that $\overline{\bm{U}}^{\star}=\left[\overline{\bm{u}}_{1}^{\star},\cdots,\overline{\bm{u}}_{r}^{\star}\right]$ . Here, the last line holds owing to (8c), Lemma D.1 (which justifies that $\big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1$ if $r\sqrt{\mu/d}\leq 1$ ) and Lemma 5.12.

The claim (160) arises from the definition of the spectral norm that $\big{\|}\bm{C}^{\tau}\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}\leq\big{\|}\bm{C}^{\tau}\big{\|}$ .

B.4 Proof of Lemma 5.14

Let us fix an arbitrary $m\in\left[d\right]$ . We remind the readers of several definitions: (1) $\gamma_{1}^{\star\tau}$ : see (72); (2) $\bm{M}^{\star\tau}$ : see (156); and (3) $\bm{M}^{\tau,\left(m\right)}$ : see (70d).

Before continuing, we state two immediate facts. First, it has already been observed in Appendix B.2 that $\overline{\bm{u}}_{1}^{\star}$ is a singular vector of $\bm{M}^{\star\tau}$ with singular value $\gamma_{1}^{\star\tau}$ , and hence

Here and throughout, $\bm{A}_{m,:}$ denotes the $m$ -th row of a matrix $\bm{A}$ . Second, $\overline{\bm{u}}^{\tau,\left(m\right)}$ is the top singular vector of $\bm{M}^{\tau,\left(m\right)}$ such that $\langle\overline{\bm{u}}^{\tau,\left(m\right)},\overline{\bm{u}}_{1}^{\star}\rangle\geq 0$ , and we denote by $\gamma_{\tau}^{(m)}$ the associated singular value. Recall our definition of $\bm{\nu}^{\tau,\left(m\right)}$ in Algorithm 6. Similar to the case of $\bm{\nu}^{\tau}$ , we will show shortly in Lemma 5.16 that the global signs of $\bm{\nu}^{\tau,\left(m\right)}$ and $\overline{\bm{u}}^{\tau,\left(m\right)}$ coincide, and hence

As a result, the proof of this lemma boils down to showing that $\overline{\bm{u}}^{\tau,\left(m\right)}$ (and hence $\bm{\nu}^{\tau,\left(m\right)}$ ) is sufficiently close to $\overline{\bm{u}}_{1}^{\star}$ in the $m$ -th entry. Towards this end, observe that

The above two facts (176) and (178) together with the triangle inequality lead to

Therefore, it suffices to upper bound the above three quantities separately.

The first step to bound $\alpha_{3}$ (cf. (179)) is to control $\big{\|}\bm{M}_{m,:}^{\star\tau}\big{\|}_{2}$ . Towards this end, we first observe from the incoherence conditions that

When combined with the definition (156), this gives

where (i) arises from (180) and $\big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1$ if $r\sqrt{\mu/d}\ll 1$ , and the last step comes from Lemma 5.12.

The second step is to upper bound $\big{\|}\overline{\bm{u}}^{\tau,\left(m\right)}-\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}$ . Towards this, we resort to Wedin’s theorem as follows

where we rely on the fact that $\big{\langle}\overline{\bm{u}}^{\tau,\left(m\right)},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\geq 0$ . To complete this bound, we need to control $\bm{M}^{\tau,\left(m\right)}-\bm{M}^{\star\tau}$ . Before we move on, we find it helpful to introduce

Let $\widehat{\bm{u}}^{\tau,\left(m\right)}$ denote the top left singular vector of $\widehat{\bm{M}}^{\tau,\left(m\right)}$ such that

Since we have already bounded $\big{\|}\bm{M}^{\tau}-\bm{M}^{\star\tau}\big{\|}$ in Lemma B.1, we can decompose

With these definitions in place, Lemma B.2 below provides the desired bounds.

Instate the assumptions of Lemma 5.14. With probability at least $1-O\left(d^{-10}\right)$ , the following holds simultaneously for all $1\leq m\leq d$ :

where $\mathcal{E}_{\mathsf{loo}}$ is defined in (65). As a result, one has

We then can further combine (161) and (162) to deduce that

In particular, we have $\big{\|}\bm{M}^{\tau,\left(m\right)}-\bm{M}^{\star\tau}\big{\|}\ll\lambda_{\min}^{\star}$ under our conditions, and it follows from (164) that

To finish up, combine this with (181) and the spectral condition to arrive at

which results from the fact that $\mathcal{E}_{\mathsf{proj}}\leq\mathcal{E}_{\mathsf{op}}$ (cf. (81)).

We then turn to $\alpha_{2}$ (cf. (179)). Recall the definition of $\bm{M}^{\tau,\left(m\right)}$ in (70c). It is straightforward to verify that

where $\bm{C}^{\tau}$ is defined in (156).

From the incoherence conditions, we can upper bound

where we use the fact that $\big{\|}\bm{\gamma}^{\star\tau}\big{\|}_{2}\lesssim\sqrt{r\log d}\,\lambda_{\max}^{\star}$ from Lemma 5.12 and $\big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1$ from Lemma D.1.

is a zero-mean Gaussian random vector conditional on $\mathcal{P}_{\Omega}\left(\bm{E}\right)$ . Using standard results on Gaussian random vectors, one has: with probability at least $1-O\left(d^{-11}\right)$ , for each $s\in\left[r\right]$ and $m\in\left[d\right]$ ,

where we have used Lemma 5.8 as well as the conditions that $\mathcal{E}_{\mathsf{loo}}\ll 1$ , $\big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1$ and $\big{\|}\overline{\bm{U}}^{\star}\big{\|}_{2,\infty}\lesssim\sqrt{\mu r/d}$ .

Putting the above bounds together, we arrive at

where we remind the reader of the definition of $\mathcal{E}_{\mathsf{op}}$ in (81).

The remaining quantity to control is $\alpha_{1}$ (see (179)). Invoke Weyl’s inequality to show that

where the last inequality arises from (188) and Lemma 5.12. Under our sample size, rank and noise conditions, we have

Moreover, we learn from (195) and (181) that

where the last step follows from the fact that $\mu r\leq d$ and $\kappa\asymp 1$ . Hence, we reach the conclusion that

Putting together all of the preceding bounds on $\alpha_{1}$ , $\alpha_{2}$ and $\alpha_{3}$ immediately establishes the lemma.

B.5 Proof of Lemma B.2

First of all, if the claims (185) and (186) can be established, then putting them together yields

where we recall the definition of $\mathcal{E}_{\mathsf{loo}}$ in (65) and use the sample size, noise and rank conditions. The rest of the proof is thus dedicated to establishing (185) and (186). In what follows, we shall assume $\{E_{i,j,k}\}_{i,j,k\in[d]}$ (resp. $\{\chi_{i,j,k}\}_{i,j,k\in[d]}$ ) are independent random variables to simplify presentation.

Recall the definition of $\bm{M}^{\tau,\left(m\right)}=p^{-1}\bm{T}^{(m)}\times_{3}\bm{\theta}^{\tau,\left(m\right)}$ in (70c). Comparing this with the definition of $\widehat{\bm{M}}^{\tau,\left(m\right)}$ in (183), we see that

Note that $\bm{\theta}^{\tau,\left(m\right)}\sim\mathcal{N}\left(\bm{0},\bm{U}^{\left(m\right)}\bm{U}^{\left(m\right)\top}\right)$ conditional on $\mathcal{P}_{\Omega}\left(\bm{E}\right)$ . Standard Gaussian concentration inequalities reveal that with probability exceeding $1-O(d^{-10})$ ,

From Lemmas 5.7-5.8 and the fact that $\max\{\mathcal{E}_{\mathsf{se}},\mathcal{E}_{\mathsf{loo}}\}\ll 1$ , we have

As a consequence, standard concentration results assert that with probability $1-O(d^{-10})$ ,

Regarding the $m$ -th row of $\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}$ , apply Lemma D.9 to show that with probability $1-O\left(d^{-11}\right)$ ,

where the last inequality comes from (200). In addition, Lemma D.10 indicates that with probability exceeding $1-O\left(d^{-11}\right)$ ,

where the second line comes from (200) and (198), and the last inequality holds as long as $p\gg\mu d^{-2}\log^{5}d$ . These together with (197c) allow us to obtain

Clearly, this bound is also valid for $\sum_{i:i\neq m}\big{\{}\big{(}\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}\big{)}_{i,m}\big{\}}^{2}$ , namely,

When it comes to the remaining entries of $\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}$ , by the fact that the spectral norm of a submatrix is always less than or equal to that of the whole matrix, applying the matrix Bernstein inequality gives that with probability $1-O\left(d^{-11}\right)$ ,

as long as our sample size and rank condition holds.

Putting the preceding bounds together yields

Recall the definitions of $\bm{M}^{\tau}$ and $\widehat{\bm{M}}^{\tau,\left(m\right)}$ in (279b) and (183), respectively. From the definition of the operator norm and the triangle inequality, we have

As shown in (193), with probability at least $1-O\left(d^{-12}\right)$ ,

Consequently, we know from Lemma 5.8 that with probability at least $1-O\left(d^{-10}\right)$ ,

where we use the fact that $\big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1$ if $r\sqrt{\mu/d}\ll 1$ .

When it comes to $\alpha_{2}$ , combining (171) and (194) with our sample size, noise and rank conditions, one has

Combining (204), (206) and (207), we conclude that

B.6 Proof of Lemma 5.15

We start with the first claim regarding $\|\overline{\bm{u}}^{\tau}-\bm{\nu}^{\tau,\left(m\right)}\|_{2}$ , or equivalently, $\|\overline{\bm{u}}^{\tau}-\overline{\bm{u}}^{\tau,\left(m\right)}\|_{2}$ (as argued in the proof of Lemma 5.14). By the triangle inequality, we can upper bound the following two terms separately:

Here, we remind the reader that $\widehat{\bm{u}}^{\tau,\left(m\right)}$ is the top left singular vector of $\widehat{\bm{M}}^{\tau,\left(m\right)}$ (see (183)) obeying $\big{\langle}\widehat{\bm{u}}^{\tau,\left(m\right)},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\geq 0$ .

The first term $\beta_{1}$ shall be bounded via Wedin’s theorem. From (163) and (164), we have

Note that we have already shown in the proof of Lemma B.2 and Lemma B.4 that $\big{\|}\overline{\bm{u}}^{\tau}-\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}=o(1)$ and $\big{\|}\widehat{\bm{u}}^{\tau,\left(m\right)}-\overline{\bm{u}}_{1}^{\star}\big{\|}_{2}=o(1)$ , which implies that $\bm{u}^{\tau}$ and $\widehat{\bm{u}}^{\tau,\left(m\right)}$ are positively correlated. Thus, one can invoke Wedin’s theorem and use the bound (186) to reach

In addition to this bound on $\beta_{1}$ , we also make note of the following simple bound

where the last inequality follows from our sample size, noise and rank condition that $\mathcal{E}_{\mathsf{loo}}\sqrt{r\log d}\ll 1$ .

The second term $\beta_{2}$ is also controlled via Wedin’s theorem:

The denominator term is easy to handle. With (196) and (209) in mind, we can apply Weyl’s inequality to obtain

From Lemma B.2, one has $\big{\|}\bm{M}^{\tau}-\bm{M}^{\tau,\left(m\right)}\big{\|}\ll\lambda_{\min}^{\star}$ . Therefore, we know that

In addition, Lemma B.3 below develops an upper bound on the numerator term:

Instate the assumptions of Lemma 5.15. With probability at least $1-O\left(d^{-10}\right)$ , one has

Substitution of the above bounds into (212) yields

where the last step holds as long as $p\gg\mu^{2}r^{2}d^{-2}\log^{2}d$ and $\sigma/\lambda_{\min}^{\star}\ll\sqrt{p/(rd\log^{2}d)}$ . In addition, from (211), we observe that

As a consequence, one immediately obtains

Combining (208), (210) and (215) and the definition of $\mathcal{E}_{\mathsf{loo}}$ , we arrive at

Comparing this bound with the first claim of the lemma, we see that the claim can be established as long as we can show that

To justify this bound (217), we make use of Lemma 5.14 to derive that

for each $m\in[d]$ . Maximizing over $m\in[d]$ gives that

where we use the condition that $\left(\mathcal{E}_{\mathsf{loo}}+\mathcal{E}_{\mathsf{op}}\right)\sqrt{r\log d}\ll 1$ . Apply the triangle inequality to yield

These allow us to establish the claim (217), which in turn finishes the proof for the first claim of this lemma.

The second claim (83) of this lemma follows immediately from (217) and (218).

It remains to prove the last claim (84). Recall the definition of $\lambda_{\tau}$ and $\lambda_{\tau}^{\left(m\right)}$ in (71). We can decompose

In what follows, we will control $\beta_{1}$ and $\beta_{2}$ seperately.

For $\beta_{1}$ , we note that all non-zero entries of $\bm{T}^{\left(m\right)}-\bm{T}$ are located in the $m$ th slices, and are independent of $\overline{\bm{u}}^{\tau,\left(m\right)}$ . This type of quantities have appeared many times and we omit the detailed proof for conciseness. By the Bernstein inequality, one can show that with probability at least $1-O\left(d^{-10}\right)$ ,

Next, we turn to $\beta_{2}$ . From our sample size and noise condition, Lemma D.1 and Corollary D.3 demonstrates that with probability at least $1-O\left(d^{-10}\right)$ ,

where we use the fact that the tensor spectral norm is always less than or equal to that of its matricization. By the definition of the operator norm, one has

where we use (216) and (217) in the last step and the fact that $\mathcal{E}_{\mathsf{loo}}\sqrt{\mu r\log d/d}\ll 1$ .

Combining (221) and (222) immediately establishes (84).

B.7 Proof of Lemma B.3

Recalling the definitions of $\widehat{\bm{M}}^{\tau,\left(m\right)}$ and $\bm{M}^{\tau,\left(m\right)}$ in (183) and (70c), respectively, we observe that $\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}$ is independent of $\overline{\bm{u}}^{\tau,\left(m\right)}$ conditional on $\mathcal{P}_{\Omega_{-m}}\left(\bm{E}\right)$ and $\bm{g}$ .

The $m$ -th entry of $\big{(}\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}\big{)}\overline{\bm{u}}^{\tau,\left(m\right)}$ can be written as

For the first term $\alpha_{1}$ , it is easily seen from (200) and incoherence conditions that

Apply the Bernstein inequality to yield that with probability at least $1-O\left(d^{-11}\right)$ ,

where the last inequality holds as long as $p\gg d^{-2}\log^{2}d$ .

Regarding $\alpha_{2}$ (cf. (223)), it is straightforward to compute that

with $\|\cdot\|_{\psi_{1}}$ denoting the sub-exponential norm, and

Then the Bernstein inequality reveals that with probability at least $1-O\left(d^{-11}\right)$ ,

where the last inequality follows from our sample size condition.

Substituting these into (223), we arrive at

For the remaining entries of $\big{(}\widehat{\bm{M}}^{\tau,\left(m\right)}-\bm{M}^{\tau,\left(m\right)}\big{)}\overline{\bm{u}}^{\tau,\left(m\right)}$ , we have

for any $i\neq m$ . From Lemma D.9 and (200), we have, with probability at least $1-O\left(d^{-11}\right)$ , that

Combined with (198) and (200), Lemma D.10 reveals that with probability at least $1-O\left(d^{-11}\right)$ ,

Therefore, combine (224) and (225) to obtain that

B.8 Proof of Lemma 5.16

By definition, the only possible difference between $\overline{\bm{u}}^{\tau}$ and $\bm{\nu}^{\tau}$ lies in how their global signs are chosen. To show that $\overline{\bm{u}}^{\tau}=\bm{\nu}^{\tau}$ , we first claim for the moment that

where $\mathcal{E}_{\mathsf{proj}}$ is defined in (79). Given that $\mathcal{E}_{\mathsf{proj}}\ll 1$ under our sample size, noise and rank condition, this immediately implies that $\big{\langle}p^{-1}\bm{T},(\overline{\bm{u}}^{\tau})^{\otimes 3}\big{\rangle}>0$ . Consequently, by construction, the global signs of $\overline{\bm{u}}^{\tau}$ and $\bm{\nu}^{\tau}$ coincide. Moreover, from (84) and the condition that $\mathcal{E}_{\mathsf{loo}}\sqrt{\mu r\log d/d}\ll 1$ , one also knows that $\big{\langle}p^{-1}\bm{T}^{\left(m\right)},(\overline{\bm{u}}^{\tau,\left(m\right)})^{\otimes 3}\big{\rangle}>0$ and hence the global signs of $\overline{\bm{u}}^{\tau,\left(m\right)}$ and $\bm{\nu}^{\tau,\left(m\right)}$ also coincide.

In addition, recall that $\lambda_{\tau}=\big{\langle}p^{-1}\bm{T},({\bm{\nu}}^{\tau})^{\otimes 3}\big{\rangle}$ . One thus has

which taken collectively with (226) justifies (85).

The rest of the proof then comes down to establishing the claim (226). Towards this, we first decompose

In what follows, we shall upper bound these three terms separately.

Let us start with $\beta_{1}$ , For simplicity of notation, let us define $\bm{\Delta}_{1}:=\overline{\bm{u}}^{\tau}-\overline{\bm{u}}_{1}^{\star}$ . By construction, $\bm{T}$ and $\bm{T}^{\star}$ are symmetric. We then can expand

We first look at the first term of (229) which only consists of $\overline{\bm{u}}_{1}^{\star}$ . As shown in (173), with probability at least $1-O\left(d^{-11}\right)$ , one has

where we recall the definition of $\mathcal{E}_{\mathsf{proj}}$ in (79). As for the term linear in $\bm{\Delta}_{1}$ , by Lemma 5.13, we know that $\left\|\bm{\Delta}_{1}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}}$ . As a result, one has

We then turn to the quadratic terms in $\bm{\Delta}_{1}$ . Similar to the above arguments, one can deduce that

Finally, we can simply upper bound the last term in (173) by

where the last step is due to the fact that $\mathcal{E}_{\mathsf{op}}\ll 1$ . By our sample size, noise and rank conditions, one has $\mathcal{E}_{\mathsf{proj}}\ll 1$ . Putting these bounds together reveals that

Recall the definition of $\beta_{2}$ in (228) and $\bm{\Delta}_{1}=\overline{\bm{u}}^{\tau}-\overline{\bm{u}}_{1}^{\star}$ . We can further decompose

We first consider the first term which is linear in $\bm{\Delta}_{1}$ . Since $\bm{T}^{\star}$ is a symmetric tensor, we have

which arises from $\max_{s\neq i}\left|\big{\langle}\overline{\bm{u}}_{s}^{\star},\overline{\bm{u}}_{1}^{\star}\big{\rangle}\right|\leq\sqrt{\mu/d}$ and $\big{\|}\overline{\bm{U}}^{\star}\big{\|}\lesssim 1$ as long as $r\sqrt{\mu/d}\ll 1$ . As a result, one has

Finally, using the fact that the tensor spectral norm is always less than or equal to that of its matricization, we find that

Combining this with the fact that $\mathcal{E}_{\mathsf{proj}}\ll 1$ , we conclude that

It remains to control $\beta_{3}$ . Straightforward calculation reveals that

By the incoherence conditions, we can upper bound

Putting (230), (231) and (232) together, we find that

where the last step follows from the condition $\mu\leq d$ , $r\ll\sqrt{d/\mu}$ and the definition $\mathcal{E}_{\mathsf{proj}}\geq\sqrt{\mu r/d}$ (cf. (79)).

B.9 Proof of Lemma 5.17

We first show that for each $i\in\left[r\right]$ , $\big{(}\bm{w}_{i},\lambda_{i}\big{)}$ and $\big{(}\bm{w}_{i}^{\left(m\right)},\lambda_{i}^{\left(m\right)}\big{)}$ (returned by $\textsc{Prune}($ ) in Algorithm 3 and Algorithm 6 respectively) satisfy (80), (82) and (84); in other words, we want to show that they correspond to the same index $\tau\in\left[L\right]$ (and are hence produced using the same Gaussian random vector $\bm{g}^{\tau}$ ). The proof idea is this: given that the proposed algorithms select the pair with the largest spectral gap in each round of $\textsc{Prune}($ ), it suffices to ensure that there is sufficient separation between the largest and the second largest spectral gaps (so that both algorithms can identify the same $\tau$ ).

By Lemma 5.12 and union bounds, we know that with probability at least $1-\delta$ , for each $i\in\left[r\right]$ ,

where we recall that $\Delta_{i}^{\tau}:=\gamma_{i}^{\star\tau}-\max_{j:j\neq i}\big{|}\gamma_{j}^{\star\tau}\big{|}$ , and $\Delta_{i}^{\left(1\right)}\geq\Delta_{i}^{\left(2\right)}\geq\dots\geq\Delta_{i}^{\left(L\right)}$ denote the order statistics of $\big{\{}\Delta_{i}^{\tau}\big{\}}_{\tau=1}^{L}$ in descending order. As shown in the proof of Lemma 5.13, the spectral gap of $\bm{M}^{\tau}$ is well approximated by $\max_{i}\Delta_{i}^{\tau}$ , namely,

under our sample size, noise and rank conditions (67). Moreover, from Lemma B.2, we see that $\bm{M}^{\tau}$ and $\bm{M}^{\tau,\left(m\right)}$ are extremely close in terms of the spectral norm, i.e.

This implies that the perturbation incurred by the leave-out-one procedure is relatively small compared to the difference between the largest and the second largest spectral gaps of $\bm{M}^{\tau}$ . Consequently, the leave-one-out estimates $\big{\{}\big{(}\bm{w}_{i}^{\left(m\right)},\lambda_{i}^{\left(m\right)}\big{)}\big{\}}_{i=1}^{r}$ returned by Algorithm 6 and the true estimates $\big{\{}\big{(}\bm{w}_{i},\lambda_{i}\big{)}\big{\}}_{i=1}^{r}$ should correspond to the same trials and should be generated by the same set of Gaussian random vectors. As a result, they obey (69a), (69b) and (69c) for all $1\leq m\leq d$ .

From the discussion above, we also know that as long as $\sigma_{1}\left(\bm{M}^{\tau}\right)-\sigma_{2}\left(\bm{M}^{\tau}\right)\gtrsim\lambda_{\min}^{\star}$ , one has $\left\|\bm{\nu}^{\tau}-\overline{\bm{u}}_{i}^{\star}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}}$ for some $i\in\left[r\right]$ . This is an immediate consequence of Lemma 5.13 and the fact that the spectral gap of $\bm{M}^{\tau}$ and $\max_{i}\Delta_{i}^{\tau}$ are extremely close.

It remains to show that our pruning procedure can return estimates of tensor factors without duplicates. Suppose that there exist $1\leq\tau_{1}\neq\tau_{2}\leq L$ such that $\left\|\bm{\nu}^{\tau_{1}}-\overline{\bm{u}}_{i}^{\star}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}}$ and $\left\|\bm{\nu}^{\tau_{2}}-\overline{\bm{u}}_{i}^{\star}\right\|_{2}\lesssim\mathcal{E}_{\mathsf{proj}}$ for some $i\in\left[r\right]$ . By the triangle inequality, one has

provided that $\mathcal{E}_{\mathsf{proj}}\ll 1$ . In addition, for any $j\neq i,j\in\left[r\right]$ , we know that exists some $1\leq\tau_{3}\leq L$ such that

Recall our incoherence condition in (8c). It is easy to see that

with the proviso that $\mu\ll d$ and $\mathcal{E}_{\mathsf{proj}}\ll 1$ .

The above argument reveals a clear separation between $\left|\left\langle\bm{\nu}^{\tau_{1}},\bm{\nu}^{\tau_{3}}\right\rangle\right|$ and $\left|\left\langle\bm{\nu}^{\tau_{1}},\bm{\nu}^{\tau_{2}}\right\rangle\right|$ . As an immediate consequence, the proposed pruning procedure successfully removes all duplication while securing an estimate for each tensor factor.

B.10 Proof of Corollary 5.11

Fix any arbitrary small constant $\delta>0$ . From Theorems 5.9-5.10 and the assumptions of Theorem 2.8, one knows that with probability exceeding $1-\delta$ , there exists a permutation $\pi(\cdot):[d]\mapsto[d]$ such that for all $1\leq i\leq r$ ,

for some $0<\delta\ll 1/(\mu^{3/2}r)<1$ . To prove the corollary, we shall just combine the above results.

Without loss of generality, assume that $\pi(i)=i$ for each $i\in\left[r\right]$ . Given that $\delta\ll 1$ and $\kappa\asymp 1$ , by the triangle inequality, one has $\lambda_{i}\asymp\lambda_{i}^{\star}$ for all $i\in\left[r\right]$ , which further implies that

Consequently, we can apply the triangle inequality to demonstrate that: for each $1\leq i\leq r$ ,

hold for all $i\in\left[r\right]$ and $m\in\left[d\right]$ . Recall that $\bm{U}^{0}=\big{[}\lambda_{i}^{1/3}\bm{w}^{i}\big{]}_{1\leq i\leq r}$ . One can deduce that

Appendix C Proof of Corollary 2.9

This section establishes Corollary 2.9. First of all, it is easy to see that: given the estimation accuracy established in Theorem 2.8, the permutation matrices that best match $\bm{U}^{t}$ to $\bm{U}^{\star}$ remain unchanged as $t$ increases. Therefore, we assume without loss of generality that $\bm{I}_{r}=\arg\min_{\bm{\Pi}\in\mathsf{perm}_{r}}\left\|\bm{U}^{t}\bm{\Pi}-\bm{U}^{\star}\right\|$ for all $t\geq 0$ .

for any $0\leq\delta\ll 1/(\mu^{3/2}r)\leq 1$ , then one has

where $\bm{T}:=\sum_{i=1}^{r}\bm{u}_{i}\otimes\bm{u}_{i}\otimes\bm{u}_{i}$ . As already shown in the analysis of Theorem 2.8, one has

from which Corollary 2.9 follows immediately.

It remains to prove the claim (234). For notational convenience, let us define $\bm{\Delta}:=\bm{U}-\bm{U}^{\star}$ and $\bm{\Delta}_{s}:=\bm{u}_{s}-\bm{u}_{s}^{\star}$ for each $1\leq s\leq r$ . Then we can expand

Next, we turn to the $\left\|\cdot\right\|_{\infty}$ loss. Again, it suffices to focus on $\sum_{1\leq s\leq r}\bm{u}_{s}^{\star\otimes 2}\otimes\bm{\Delta}_{s}$ , $\sum_{1\leq s\leq r}\bm{u}_{s}^{\star}\otimes\bm{\Delta}_{s}^{\otimes 2}$ and $\sum_{1\leq s\leq r}\bm{\Delta}_{s}^{\otimes 3}$ . From (104), (105) and (106) shown in the proof of Lemma 5.1, one has

Putting (240), (241) and (242) together with the condition that $0<\delta\ll r^{-1}\leq 1$ , we arrive at

Appendix D Auxiliary lemmas

This section gathers several auxiliary lemmas that prove useful when establishing our main results.

We begin by stating all auxiliary lemmas formally, with the proofs postponed to subsequent subsections. We shall define

Suppose that Assumption 2.1 holds, and assume that $r\sqrt{\mu/d}\leq c_{1}$ for some sufficiently small universal constant $c_{2}>0$ . Then for $d$ sufficiently large, the matrices $\bm{A}^{\star}$ , $\bm{B}^{\star}$ and $\bm{U}_{\mathsf{orth}}^{\star}$ (defined respectively in (24), (63) and (54)) obey

Here, $\|\bm{A}\|_{2,\infty}:=\max_{i}\|\bm{A}_{i,:}\|_{2}$ , $\lambda_{(i)}^{\star}$ stands for the $i$ -th largest value in $\{\lambda_{i}^{\star}\}_{1\leq i\leq r}$ (or equivalently $\{\|\bm{u}_{i}^{\star}\|_{2}^{3}\}_{1\leq i\leq r}$ ), and $\lambda_{i}(\bm{B}^{\star})$ represents the $i$ -th largest eigenvalue of $\bm{B}^{\star}$ .

Then with probability exceeding $1-O\left(d^{-10}\right)$ , one has

An immediate consequence of this lemma is the following:

With probability at least $1-O\left(d^{-10}\right)$ , one has

where $\times_{3}$ is defined in Section 2.4. The results also holds if we replace $\times_{3}$ with $\times_{1}$ or $\times_{2}$ .

Let $\{X_{i,j}\}_{1\leq i\leq r,1\leq j\leq L}$ be a sequence of i.i.d. standard Gaussian random variables, where $r\geq 2$ and $L\geq 1$ . Consider some quantities $\kappa\geq 1,\Delta>0,0<\delta<1/2$ . There exists some universal constant $C>0$ such that if

then with probability at least $1-\delta$ , there exists some $1\leq j_{0}\leq L$ such that

In addition, define $\Delta_{j}:=X_{1,j}-\kappa\max_{i:1<i\leq r}\left|X_{i,j}\right|$ for each $1\leq j\leq L$ . Then with probability at least $1-2\delta$ ,

where $\Delta_{(1)}\geq\Delta_{(2)}\geq\dots\geq\Delta_{(L)}$ denote the order statistics of $\left\{\Delta_{j}\right\}_{j=1}^{L}$ in descending order.

where we denote by $\mathcal{P}_{\bm{V}}(\bm{u}_{0}):=\bm{V}\bm{V}^{\top}\bm{u}_{0}$ and $\mathcal{P}_{\bm{V}^{\perp}}(\bm{u}_{0})=\big{(}\bm{I}_{d}-\bm{V}\bm{V}^{\top}\big{)}\bm{u}_{0}$ .

Additionally, we record several facts concerning the set of Bernoulli random variables $\{\chi_{i,j,k}\}_{1\leq i,j,k\leq d}$ . We recall that

which is a Bernoulli random variable with mean $p$ .

Suppose that $p\gtrsim d^{-2}\log d$ . With probability exceeding $1-O\left(d^{-10}\right)$ , one has

Suppose that $p\gtrsim d^{-2}\log^{2}d$ . With probability exceeding $1-O\left(d^{-10}\right)$ , one has

D.2 Proof of Lemma D.1

To begin with, the incoherence condition (8b) gives

where the first inequality follows from the fact that $\left\langle\bm{u}\otimes\bm{u},\bm{v}\otimes\bm{v}\right\rangle=\left\langle\bm{u},\bm{v}\right\rangle^{2}$ , the second inequality holds true due to (8b) and (8c), and the last inequality holds as long as $r\leq d/\mu$ . This immediately yields the advertised bound on $\big{\|}\bm{A}^{\star}\big{\|}_{2,\infty}$ .

where the second inequality is valid due to (8b) and (8c), and the last inequality holds as long as $r\leq\sqrt{d/\mu}$ . This yields the claimed bound regarding $\|\bm{A}^{\star\top}\|_{2,\infty}$ .

Regarding the spectrum of $\bm{A}^{\star}$ , $\bm{B}^{\star}$ , $\bm{U}_{\mathsf{orth}}^{\star}$ and $\overline{\bm{U}}^{\star}$ , we refer the reader to the proof of [CLC+20, Corollary 1].

We now move on to $\left\|\bm{B}^{\star}\right\|_{2,\infty}$ . For any $i\in\left[d\right]$ , it is seen that

Here, the last line makes use of the bound $\|\bm{A}^{\star}\|\leq\lambda_{\max}^{\star}(1+O(r\sqrt{\mu/d}))\leq 2\lambda_{\max}^{\star}$ , which holds if $r\sqrt{\mu/d}\leq c_{1}$ for some sufficiently small constant $c_{1}>0$ . It then follows from (254) that

D.3 Proof of Lemma D.2 and Corollary D.3

We then need to bound the quantity presented in (257).

For some $\beta>0$ to be specified later, one can upper bound

The Bernstein inequality then tells us that

In particular, this implies that with probability exceeding $1-O\left(d^{-20}\right)$ ,

where we have used the AM-GM inequality in the last step. Therefore, by taking

for some sufficiently large constant $C>0$ , we arrive at

Given that $\beta\gg M$ , for any $t\geq\beta$ one has the following relations about several events

for any $t\geq\beta$ (with $\beta$ obeying (260)). As a result, one can bound

where (i) follows from (261), (ii) comes from (259), (iii) is a consequence of (262), and (iv) holds true when $C>0$ is sufficiently large. Consequently,

Recognizing that the magnitudes of all entries of $\bm{R}$ are bounded by $B$ , we can invoke Talagrand’s concentration inequality [Ver18, Theorem 5.2.16] for convex Lipschitz functions of independent bounded random variables to show that with probability $1-O(d^{-10})$ ,

D.3.2 Proof of Corollary D.3

Now we apply Lemma D.2 to our concrete setting. We first look at $p^{-1}\mathcal{P}_{\Omega}\left(\bm{T}^{\star}\right)-\bm{T}^{\star}$ and treat it as $\bm{R}$ in Lemma D.2. With the help of Lemma D.1, it is straightforward to compute that

We then turn to $\mathcal{P}_{\Omega}\left(\bm{E}\right)$ . Recognizing that the entries of $\bm{E}$ might be unbounded, we invoke the following truncation trick to cope with this unboundedness issue. Specifically, define $\widetilde{\bm{E}}=[\widetilde{E}_{i,j,k}]_{1\leq i,j,k\leq d}$ where

for some some sufficiently large constant $c_{1}>0$ . Moreover, $\widetilde{E}_{i,j,k}$ is zero-mean because we assume that the distribution of $E_{i,j,k}$ is symmetric about 0. Standard concentration inequalities reveal that: with probability exceeding $1-O\left(d^{-10}\right)$ , one has $\bm{E}=\widetilde{\bm{E}}$ . Hence, it suffices to bound $\big{\|}\mathcal{P}_{\Omega}(\widetilde{\bm{E}})\big{\|}$ . Towards this end, simple calculation reveals that

This together with (264) as well as the high-probability event $\bm{E}=\widetilde{\bm{E}}$ completes the proof.

D.4 Proof of Lemma D.4

We shall apply the truncated matrix Bernstein inequality to control the spectral norm of $\bm{X}$ .

Second, using the same truncation argument as in the proof of Lemma D.2 in Appendix D.3, we can assume $|E_{i,j,k}|\lesssim\sigma\sqrt{\log d}$ for all $1\leq i,j,k\leq d$ (which holds with very high probability). The Bernstein inequality reveals that

for each $\left(i,j\right)\in\left[d\right]^{2}$ , where

This implies that with probability exceeding $1-O\left(d^{-20}\right)$ ,

for some sufficiently large constant $C>0$ , then one has

In view of our choice of $\beta$ , we know that $\min\left\{t^{2}/S^{2},t/L\right\}\geq t/\max\left\{S/\sqrt{\log d},L\right\}$ for any $t\geq\beta$ . As a result, for $d$ sufficiently large, we have

Invoke the matrix Bernstein inequality to demonstrate that with probability $1-O\left(d^{-10}\right)$ ,

where (i) is due to Lemma D.1, and (ii) follows as long as $p\gtrsim d^{-2}\log^{3}d$ and $\mu\log^{2}d\lesssim d$ .

D.5 Proof of Lemma D.5

Recall that for a standard Gaussian random variable $Z\sim\mathcal{N}\left(0,1\right)$ , one has

for all $t>\sqrt{5/4}$ . Observing that $\kappa\sqrt{2\log r}+\Delta\geq\sqrt{5/4}$ since $\kappa\geq 1$ and $r\geq 2$ , we can invoke the above tail bound to deduce that

where we use the elementary inequality $\big{(}\kappa\sqrt{2\log r}+\Delta)^{2}\leq 4\kappa^{2}\log r+2\Delta^{2}$ . In addition, it follows from the union bound that

To prove the claim, it is sufficient to choose $L$ such that

Note that $\log(1-x)\leq-1/(2x)$ for $0<x<1/4$ . In view of (267) and (268), one can verify that the above inequality (269) as long as

To prove the second claim, recall the definitions that

and that $\Delta_{(1)}\geq\Delta_{(2)}\geq\dots,\geq\Delta_{(L)}$ denote $\left\{\Delta_{j}\right\}_{j=1}^{L}$ in descending order. For any $\epsilon>0$ , one has

where the last line holds because the distribution of $\Delta_{j}-\max\nolimits_{k:k\neq j}\Delta_{k}$ conditional on $\Delta_{j}=\max\nolimits_{1\leq k\leq r}\Delta_{k}$ is identical for all $1\leq j\leq L$ . In addition, it is straightforward to see that

Next, observe that $X_{1,1}$ is independent of $Y_{1,1}$ , and hence we have

In other words, $f_{\epsilon}(x)$ is monotonically increasing in $x$ for any given $\varepsilon>0$ . Therefore, for any $0\leq x<B$ for some sufficiently large $B>\sqrt{5/4}$ (to be specified later), the above bounds taken together give

where (i) arises from the monotonicity of $f_{\epsilon}(\cdot)$ , and (ii) relies on (266). By taking $\epsilon=\delta/(5B)$ , we obtain

for any $0\leq x\leq B$ . Recall that $Y_{1,1}=\max_{1<i\leq r}\kappa X_{i,1}+\max_{k:k\neq 1}\Delta_{k}$ . By standard Gaussian concentration inequalities, with probability at least $1-\delta$ one has

where the last step arises from the lower bound on $L$ in (270). If we choose $B=C\big{(}\sqrt{\log L}+\sqrt{\log(1/\delta)}\big{)}$ for some sufficiently large universal constant $C>0$ , then this immediately implies that

D.6 Proof of Lemma D.6

where the last identity follows since $\bm{u}_{0}$ is assumed to lie within $\mathsf{span}(\bm{U})$ . As a result,

The Pythagorean theorem then gives $\big{\|}\mathcal{P}_{\bm{V}}(\bm{u}_{0})\big{\|}_{2}=\sqrt{\|\bm{u}_{0}\|_{2}-\big{\|}\mathcal{P}_{\bm{V}^{\perp}}(\bm{u}_{0})\big{\|}_{2}^{2}}\geq\sqrt{1-\delta^{2}}$ .

D.7 Proof of Lemma D.7

Invoke the Bernstein inequality to show that: with probability exceeding $1-O(d^{-20})$ ,

where the last line holds with the proviso that $p\gtrsim d^{-2}\log d$ .

D.8 Proof of Lemma D.8

Since the $E_{i,j,k}$ ’s are independent sub-Gaussian random variables with variance at most $\sigma^{2}$ , one has

Here, $\|\cdot\|_{\psi_{1}}$ denotes the sub-exponential norm [Ver10]. Taken together with the Bernstein inequality, these yield that with probability exceeding $1-O(d^{-20})$ ,

provided that $p\gtrsim d^{-2}\log^{2}d$ .

D.9 Proof of Lemma D.9

Fix an arbitrary $1\leq i\leq d$ . We first define a sequence of independent zero-mean random variables $\{X_{j}\}_{j\in[d]}$ as follows

The Bernstein inequality indicates that: with probability at least $1-O(d^{-20})$ ,

Moreover, we can also bound the variance of $X_{j}$ as follows

Given that $X_{j}$ might be overly large in some rare case, we introduce a sequence $\{Y_{j}\}$ , where we denote by $Y_{j}$ the truncated version of $X_{j}$ as follows

We have learn from (271) and the union bound that with probability at least $1-O(d^{-15})$ , one has $Y_{j}=X_{j}$ for all $j\in[d]$ .

Using the above bounds on the $X_{j}$ ’s, one observes that $\{Y_{j}\}_{j\in[d]}$ is a sequence of independent random variables satisfying

We can apply the Bernstein inequality to conclude that: with probability greater than $1-O(d^{-15})$ ,

where (i) is due to the AM-GM inequality, (ii) makes use of Lemma D.1, and the last line follows under the conditions $p\gtrsim d^{-2}\log^{2}d$ and $\mu\log^{2}d\lesssim d$ . This together with the high-probability fact $Y_{j}=X_{j}$ ( $\forall j\in[d]$ ) concludes the proof.

D.10 Proof of Lemma D.10

Fix any $1\leq i\leq d$ . To begin with, define

which is a zero-mean random variable. In order to bound $Z_{j}$ , one observes that

where $\|\cdot\|_{\psi_{1}}$ denotes the sub-exponential norm. Apply the Bernstein inequality for the sum of sub-exponential random variables to obtain

with probability exceeding $1-O(d^{-20})$ . Further, given that $Z_{j}$ is not necessarily bounded, we introduce a sequence of truncated random variables as follows

According to the above bound, one has $Y_{j}=Z_{j}$ ( $\forall j$ ) with probability at least $1-O(d^{-19})$ .

We then turn attention to bounding $\sum_{j\in[d]}Y_{j}^{2}$ . To this end, observe that

Invokde the Bernstein inequality to arrive at: with probability at least $1-O(d^{-20})$ ,

This together with the high-probability fact $Y_{j}=Z_{j}$ ( $\forall j$ ) completes the proof.

Appendix E Extension to asymmetric tensors

Thus far, we have focused on the case where the tensor of interest is symmetric. In this section, we discuss how to generalize our algorithm and analysis to accommodate asymmetric tensors.

In addition, we assume that each entry $(j,k,l)$ is included in the sampling set $\Omega$ independently with probability $p$ , and that each observed entry $T_{j,k,l}^{\star}$ is corrupted by an independent zero-mean sub-Guassian noise $E_{j,k,l}$ (cf. Assumption 2.3). Our goal is to (1) estimate $\{\bm{u}_{i}^{\star},\bm{v}_{i}^{\star},\bm{w}_{i}^{\star}\}_{i=1}^{r}$ faithfully, modulo global permutation and global signs, and (2) estimate $\bm{T}^{\star}$ in a reliable manner.

E.2 Algorithms

We now move on to present an extension of our nonconvex algorithm to handle the noisy scenario.

we can define the following regularized squared loss function

where the regularization term $\mathsf{reg}\big{(}\bm{U},\bm{V},\bm{W}\big{)}$ is given by

for some positive regularization parameters $\{\alpha_{i}\}_{i=1}^{r}$ to be specified momentarily. In contrast to the symmetric case, the addition term $\mathsf{reg}\big{(}\bm{U},\bm{V},\bm{W}\big{)}$ is included to help ensure that the sizes of $\bm{U},\bm{V}$ and $\bm{W}$ stay close — an algorithmic trick that has proved useful in other problems like nonconvex rectangular matrix recovery [TBS+16, ZL16, CLL19].

We are now ready to present our nonconvex algorithm that accommodates the case with asymmetric tensors. As before, the proposed algorithm is initialized by a spectral method, followed by gradient descent designed to minimize the regularized loss function (276). The precise procedure is described in Algorithm 7 (which invokes Algorithms 8-9).

Before proceeding, we find it helpful to record closed-form expressions for the gradients, which are a crucial part when implementing the nonconvex gradient descent algorithm. Specifically, the gradients of $g(\bm{U},\bm{V},\bm{W})$ can be computed as follows

for each $1\leq i\leq r$ , where $\times_{1},\times_{2}$ and $\times_{3}$ have been defined in Section 2.4.

E.3 Numerical experiments

In order to validate the effectiveness of the proposed algorithm, we conduct a series of numerical experiments.

E.4 Analysis ideas

Before describing the proof ideas, we define the following incoherence parameters and condition number that, similar to the symmetric case, play a crucial role in our theoretical development.

Define the incoherence parameters and the condition number of $\bm{T}^{\star}$ as follows

As has been made clear in the symmetric case, at the heart of our analysis lie two crucial components: (1) the geometric property of a noiseless version of the loss function, and (2) reasonably well initial estimates for tensor factors (in the entrywise sense). Rather than providing a complete analysis for the asymmetric case (which would be very long), we shall only point out the important steps needed to extend these two parts for the asymmetric case.

Similar to the symmetric counterpart, the key step of the local convergence analysis lies in establishing the favorable geometric property (i.e. local strong convexity and smoothness) of the following noiseless regularized loss function

where the regularization term $\mathsf{reg}\big{(}\bm{U},\bm{V},\bm{W}\big{)}$ is defined in (277). In words, this is a simplified version of the original loss function (276) by dropping the influence of the noise. Lemma E.2 below demonstrates that $g_{\mathsf{clean}}$ is locally strongly convex and smooth in the neighborhood of the ground truth.

and that the regularization parameter obeys $\big{|}\alpha_{i}-\lambda_{i}^{\star 2/3}\big{|}\leq c_{2}\lambda_{\min}^{\star 2/3}$ for some sufficiently large (resp. small) constant $c_{0}>0$ (resp. $c_{1},c_{2}>0$ ). Then with probability greater than $1-O\big{(}d_{\min}^{-10}\big{)}$ ,

In a nutshell, Lemma E.2 confirms local strong convexity and smoothness of the noiseless regularized loss $g_{\mathsf{clean}}(\bm{U},\bm{V},\bm{W})$ , provided that (282) holds and that the matrices $\bm{U}$ , $\bm{V}$ and $\bm{W}$ are sufficiently close to the ground truth in every single row. This is similar to the property of the symmetric counterpart $f_{\mathsf{clean}}(\bm{U})$ (cf. Lemma 5.1 in Appendix 5.1), except that we need to deal with asymmetric tensor factors as well as additional regularization terms. In particular, when $d_{1}\asymp d_{2}\asymp d_{3}\asymp d$ and $\mu,r\asymp 1$ , Condition (282) reduces to $p\gg d^{-3/2}\log^{3}d$ and $r\ll\sqrt{d}$ , which resembles (42) in Lemma 5.1 derived for the symmetric case.

Having established the preceding local geometric properties of $g_{\mathsf{clean}}$ , one can then argue similarly as in Lemmas 5.3 and 5.6 in Appendix 5.1 to prove that gradient descent converges linearly, as long as it is provided with an initial estimate satisfying the condition (285). Here, we emphasize that the regularization term (277), which essentially balances the sizes of the three tensor factors, is crucial for the local strong convexity and smoothness of $g_{\mathsf{clean}}$ to hold.

Another crucial ingredient lies in guaranteeing an initial estimate with sufficiently good accuracy. Recall that our initialization scheme consists of two stages: (1) subspace estimation, and (2) retrieval of individual tensor factors.

Regarding the retrieval of individual tensor factors, the key observation is that: the random vector $\bm{g}^{\tau}$ we generate satisfies

where $\mathcal{P}_{\bm{U}^{\star}}$ is the projection onto the subspace spanned by $\{\bm{u}_{i}^{\star}\}_{i=1}^{r}$ , and

Following the above strategies, one could adapt the proofs of Theorems 5.9-5.10 in Section 5.3 to show that: our initial estimates $\{\bm{u}_{i},\bm{v}_{i},\bm{w}_{i}\}_{i=1}^{r}$ are all exceedingly close to the ground truth in the entrywise sense (up to global permutation and global signs). This in turn confirms that the algorithm will enter a locally strongly convex and smooth region as characterized in Lemma E.2, thus leading to our performance guarantees for the entire algorithm. Once again, while the analysis ideas for the asymmetric case bear much resemblance to the symmetric counterpart, a complete proof has to be fairly long due to more clumsy notation compared to the symmetric case; for the sake of brevity, we do not provide the full proof here.

E.5 Proof of Lemma E.2

For notational convenience, we shall define

With the above notation in place, one can decompose

where $\beta_{i},1\leq i\leq 4$ are given respectively by

In what follows, we shall demonstrate that $\beta_{1}$ is the dominant term, with the remaining terms being negligible compared to $\beta_{1}$ . Here, we note that the proof idea is almost identical to that of the symmetric case (cf. Lemma 5.1 in Appendix 5.1). For the sake of conciseness, we will focus only on the part where the symmetric and the asymmetric cases differ, and omit the proof details when their analyses are similar.

Let us begin with $\gamma_{1}$ . Observe that we can express

Arguing similarly as in the proof of Lemma D.1, one can derive

provided that $r\ll d_{\min}/\mu$ . This leads to the following inequalities

We now move on to $\gamma_{2}$ . Recall our assumption that $\|\bm{u}_{i}^{\star}\|_{2}=\|\bm{v}_{i}^{\star}\|_{2}=\|\bm{w}_{i}^{\star}\|_{2}=\lambda_{i}^{\star 1/3}$ for all $1\leq i\leq r$ . It follows from the Cauchy-Schwartz inequality that

Turning to $\gamma_{3}$ , one can straightforwardly bound

Here, we have used the fact that $\|\bm{w}_{i}^{\star}\|_{2}=\lambda_{i}^{\star 1/3}$ in (i); the inequalities (ii) and (iii) arise from the Cauchy-Schwartz inequality; (iv) follows from (280c); (v) holds as long as $r\ll\sqrt{d_{\min}/\mu}$ . In a similar manner, one can easily verify that

It then follows from the AM-GM inequality that

Putting the above bounds together allows us to bound $\beta_{1}$ as follows

provided the sampling rate exceeds $p\gg\frac{\mu^{2}r^{2}d_{\max}\log^{2}d_{\max}}{d_{1}d_{2}d_{3}}$ .

By the definition of the operator norm, one can bound

One can easily adapt the proof of Lemma D.2 to show that, with probability at with probability at least $1-O\big{(}d_{\min}^{-10}\big{)},$

holds as long as $p\gg\max\Big{\{}\frac{\log^{3}d_{\max}}{\sqrt{d_{1}d_{2}d_{3}}},\,\frac{d_{\max}\log^{5}d_{\max}}{d_{1}d_{2}d_{3}}\Big{\}}$ . We can then adapt the proof for bounding $\alpha_{3}$ in the proof of Lemma 5.1 in Appendix A to derive that with probability at least $1-O\big{(}d_{\min}^{-10}\big{)},$

It remains to control $\beta_{4}$ . By symmetry, it suffices to consider the following terms:

In the following, we shall bound these four terms separately. By the assumptions that $\bm{S}^{(1)}=\bm{I}_{r},\bm{S}^{(2)}=\bm{S}^{(3)}=-\bm{I}_{r}$ , $\delta\ll 1$ and $\kappa\asymp 1$ , one has

In addition, for each $1\leq i\leq r$ , one has

by virtue of the condition that $\big{|}\alpha_{i}-\lambda_{i}^{\star 2/3}\big{|}\ll\lambda_{\min}^{\star 2/3}$ .

Let us start with $\gamma_{1}$ . By the condition that $\max_{i}\big{|}\alpha_{i}-\lambda_{i}^{\star 2/3}\big{|}\ll\lambda_{\min}^{\star 2/3}$ , it is easily seen that

Here, we use (288a), (288b), (289) and $\kappa\asymp 1$ in the last step.

Turning to $\gamma_{3}$ , one can develop a similar bound as follows

With regards to $\gamma_{4}$ , we can expand

where the last step follows from the assumption $\|\bm{u}_{i}^{\star}\|_{2}=\|\bm{v}_{i}^{\star}\|_{2}=\|\bm{w}_{i}^{\star}\|_{2}$ . It follows that

where we have used the conditions that $\max_{i}\alpha_{i}\lesssim\lambda_{\max}^{\star 2/3}$ and $\kappa\asymp 1$ .

It is not hard to check that similar bounds also hold for the remaining terms in $\beta_{4}$ . As a result, we arrive at

Putting the above estimates together, we conclude that with probability at least $1-O(d_{\min}^{-10})$ , one has

Introduction and motivation

2 Computational and statistical challenges

Algorithm and main results

2 Main results

3 Numerical experiments

4 Notation

Initialization

2 Step 2: retrieval of low-rank tensor factors from the subspace estimate

2.2 Intuition

3 Other alternatives?

Related work

Analysis

1.2 Local strong convexity and smoothness

1.3 Leave-one-out gradient descent sequences

1.4 Key lemmas

2 Analysis for initialization: Part 1 (subspace estimation)

2.2 Leave-one-out sequences for subspace estimation

3 Analysis for initialization: Part 2 (retrieval of individual tensor factors)

3.2 Analysis

Discussion

Acknowledgements

Appendix A Proofs for local convergence of GD

A.1.5 Putting all this together

A.2 Proof of Lemma 5.2

A.3 Proof of Lemma 5.3

A.4 Proof of Lemma 5.4

A.5 Proof of Lemma 5.5

A.6 Proof of Lemma 5.6

Appendix B Proofs for retrieving tensor components

B.2 Proof of Lemma 5.13

B.3 Proof of Lemma B.1

B.4 Proof of Lemma 5.14

B.5 Proof of Lemma B.2

B.6 Proof of Lemma 5.15

B.7 Proof of Lemma B.3

B.8 Proof of Lemma 5.16

B.9 Proof of Lemma 5.17

B.10 Proof of Corollary 5.11

Appendix C Proof of Corollary 2.9

Appendix D Auxiliary lemmas

D.2 Proof of Lemma D.1

D.3 Proof of Lemma D.2 and Corollary D.3

D.3.2 Proof of Corollary D.3

D.4 Proof of Lemma D.4

D.5 Proof of Lemma D.5

D.6 Proof of Lemma D.6

D.7 Proof of Lemma D.7

D.8 Proof of Lemma D.8

D.9 Proof of Lemma D.9

D.10 Proof of Lemma D.10

Appendix E Extension to asymmetric tensors

E.2 Algorithms

E.3 Numerical experiments

E.4 Analysis ideas

E.5 Proof of Lemma E.2

References