Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation

Yudong Chen, Yuejie Chi

Introduction

The ubiquity of advanced sensing and imaging technologies produce vast amounts of data at an unprecedented rate. A fundamental goal of signal processing is to extract, and possibly track the evolution of, the relevant structural information faithfully from such high-dimensional data, ideally with a minimal amount of computation, storage and human intervention. To overcome the curse of dimensionality, it is important to exploit the fact that real-world data often possess some low-dimensional geometric structures. In particular, such structures allow for a succinct description of the data by a number of parameters much smaller than the ambient dimension. One popular postulate of low-dimensional structures is sparsity, that is, a signal can be represented using a few nonzero coefficients in a proper domain. For instance, a natural image often has a sparse representation in the wavelet domain. The field of compressed sensing has made tremendous progress in capitalizing on the sparsity structures, particularly in solving under-determined linear systems arising from sample-starved applications such as medical imaging, spectrum sensing and network monitoring. In these applications, compressed sensing techniques allow for faithful estimation of the signal of interest from a number of measurements that is proportional to the sparsity level — much fewer than is required by traditional techniques. The power of compressed sensing has made it a disruptive technology in many applications such as magnetic resonance imaging (MRI): a Cardiac Cine scan can now be performed within 25 seconds with the patients breathing freely. This is in sharp contrast to the previous status quo, where the scan takes up to six minutes and the patients need to hold their breaths several times .

While the sparsity model is powerful, the original framework of compressed sensing mainly focuses on vector-valued signals that admit sparse representations in an a priori known domain. However, knowledge of such sparsifying domains is not always available, thus limiting its applications. Fortunately, one can resort to a more general notion of sparsity that is more versatile when handling matrix-valued signals — or an ensemble of vector-valued signals — without the need of specifying a sparsifying basis. In this paper, we will review this powerful generalization of sparsity, termed the low-rank model, which captures a much broader class of low-dimensional structures. Roughly speaking, this model postulates that the matrix-valued signal is approximately low-rank. If we view each column of the matrix as a data vector, then this is equivalent to saying that the data approximately lies in a low-dimensional but unknown subspace. Historically, the exploitation of low-rank structures may begin even earlier than that of sparsity. In particular, the low-rank assumption is what underlies classical Principal Component Analysis (PCA) , which builds on the observation that real-world data has most of its variance in the first few top principal components. Such low-rank structures may arise due to various physical reasons and engineering designs. In face recognition, face images are found to trace out a 99-dimensional subspace if they are approximately convex and reflect light according to Lambert’s law . In radar and sonar signal processing, the signals reside approximately in a low-dimensional subspace due to transmitting using a small set of waveforms to construct certain beam patterns. Low-rank structures also arise from modeling interactions between different objects. For example, in clustering or embedding, the pairwise interactions between objects can often be expressed as a low-rank matrix .

Given the collected data, the key problem is to infer the hidden low-dimensional subspace that captures most of the information relevant for subsequent tasks such as detection, clustering, and parameter estimation. Traditional methods such as Singular Value Decomposition (SVD) for finding principal subspaces typically require the data to be fully observed. However, modern data applications often involve estimation problems with a number of measurements that is much smaller than the ambient dimension, a regime similar to the setting of compressed sensing. We refer to this problem as low-rank matrix estimation, emphasizing the fact that one only has under-sampled measurements or partial observations. Examples of such problems are abundant. In recommendation systems, the goal is to estimate the missing ratings given a small number of observed ones. In sensor networks, an important problem is to infer the locations of the sensors from pairwise distance measures, which are available only for sensors within a certain radius of each other. In wideband spectrum sensing, to reduce the sampling rate, a popular approach is to estimate the signal subspace and bearing parameters by randomly sub-sampling the outputs of the array.

In these applications, it is desirable to develop low-rank matrix estimation algorithms that are both statistically efficient — achieving low estimation errors with a minimal amount of (noisy) measurements — and computationally efficient — having low running time and storage cost. A particular focus of this paper is on algorithms that come with provable guarantees for their statistical and computation efficiency. The search for such algorithms is in part motivated by the remarkable success story of compressed sensing, for which many provable methods have been developed for sparse models. Handling the more general low-rank structures poses a new set of challenges as well as opportunities. The study of low-rank matrix estimation has attracted the attention of many researchers from diverse communities including signal processing, machine learning, statistics, mathematical programming and computer science –. As we elaborate below, this enterprise has been much fruitful, resulting in many powerful algorithms, novel analytical techniques, and deep theoretical insights.

This survey article is complementary to the nice overview article on low-rank matrix recovery by Davenport and Romberg , with different focuses. In particular, by focusing on recent algorithmic advancements with computational and statistical guarantees, we highlight the effectiveness of first-order methods in both convex and nonconvex optimization. We also put specific emphasis on a unique set of applications involving structured matrix completion.

The rest of this paper is organized as follows. Section 2 motivates low-rank models from the perspectives of modeling data correlations and lifting vector-valued problems. Section 3 describes the basic mathematical setup of the low-rank estimation problem. Section 4 discusses the theory and algorithms for low-rank matrix estimation via convex optimization. Section 5 discusses the theory and algorithms for low-rank matrix estimation via nonconvex optimization. Section 6 discusses structured matrix completion, where the low-rank matrices have additional structural constraints, using several concrete examples. Numerical examples on a real-world recommendation dataset are showcased in Section 7. The paper is concluded in Section 8.

2 Notations

The Ubiquity of Low-Rank Models

In this section, we elucidate the motivations for studying low-rank modeling and low-rank matrix estimation problems. We start with a classical viewpoint and justify the low-rank priors of a data matrix from the perspective of bias-variance trade-offs for modeling correlations in data observations. We next argue that low-rank structures arise from a powerful reformulation of quadratic optimization problems by lifting them into a matrix space. Last but not least, we provide a list of other sources of low-rank structures in a wide range of science and engineering problems.

Imagine that we receive a noisy copy of the signal x\boldsymbol{x} as

We make the crucial observation that one may decompose the mean squared error of the estimate x^r\hat{{\boldsymbol{x}}}_{r} into two terms:

Here the first term corresponds to the model bias,

which arises due to approximating a full-rank signal by a rank-rr model. The second term corresponds to the model variance,

which arises due to the presence of noise.

From the above decomposition, we see that as one increases the rank rr, the bias of the estimate decreases, whereas the corresponding variance increases. Therefore, the choice of the rank controls the trade-off between the bias and the variance, whose sum constitutes the total estimation error. Importantly, many real-world datasets have a decaying spectrum, which in the above notation means that the eigenvalues of the covariance matrix decrease rapidly. As mentioned, this insight is the foundation of PCA and moreover is observed across a wide range of applications including power systems, recommendation systems, Internet traffic and weather data. Consequently, it is beneficial to employ a small rank, so that the variance is controlled, while the bias remains small as long as the residual eigenvalues decreases quickly. With this in mind, we show in Fig. 1 the mean squared error as function of the rank, as well as the decomposition into the bias and the variance; here it is assumed that the spectrum decays at a rate of λi=1/i\lambda_{i}=1/i, and the signal-to-noise ratio (SNR), defined as i=1nλi2/(nσ2)\sum_{i=1}^{n}\lambda_{i}^{2}/(n\sigma^{2}), is equal to 1515dB with n=100n=100. It is clear from the figure that employing an appropriate low-rank estimator induces a much lower mean squared error than a full-rank one. In particular, the optimal rank may be much smaller than the ambient dimension nn when the spectrum decays fast.

2 Lifting for Quadratic and Bilinear Optimization Problems

Due to the nonlinear nature of these equations, it is difficult to solve them directly, particularly when the problem size is large. A popular approach for solving such equations is called lifting: one rewrites the above equations in terms of the matrix variable M=xx{\boldsymbol{M}}={\boldsymbol{x}}{\boldsymbol{x}}^{*}, and casts this problem as recovering the rank-one matrix M{\boldsymbol{M}} from a set of linear measurements . A similar formulation has been used for the blind deconvolution problem; cf. .

Therefore, the problem of determining the locations X{\boldsymbol{X}} of the sensors is equivalent to recovering the low-rank lifted matrix M{\boldsymbol{M}} from a set of linear measurements in the form of (3); see for a more detailed treatment of this powerful reformulation.

3 Other Sources of Low-rank Structures

There’re many potential sources of low-rank structures. Below we provide a few further examples drawn from different science and engineering domains:

In system identification and time series analysis, finding the minimum-order linear time-invariant system is equivalent to minimizing the rank of Hankel structured matrices (cf. Section 6.1).

In recommendation systems , the matrix of user ratings for a set of items is often approximately low-rank, as user preferences typically depend on a small number of underlying factors and hence their ratings correlate with each other.

The background of a video usually changes slowly from frame to frame, hence stacking the frames as columns lead to an approximately low-rank matrix . Similar low-rank structures arise from the smoothness properties of other visual and physical objects .

In quantum state tomography, the density matrix of a pure or nearly pure quantum state is approximately low-rank, which can be exploited in the problem of state reconstruction from a small number of Pauli measurements .

In a sparse graphical model with latent variables, one can show, using the Schur complement, that the inverse marginal covariance matrix of the observed variables can be approximated by a matrix with rank equal to the number of latent variables .

Matrices with certain monotonicity properties can be well-approximated by a matrix with rank much smaller than the ambient dimension. Such matrices arise, for example, when measuring the pairwise comparison scores of a set of objects that possess an underlying ordering .

The pairwise affinity matrix of a set of objects is often approximately low-rank due to the presence of clustering/community structures (cf. Section 6.2).

The list continues for much longer. The ubiquity of these structures, either as a physical property or as an engineering choice, is what makes low-rank models useful, and motivates the extensive study of the low-rank matrix estimation problem.

Low-Rank Matrix Estimation from Incomplete Observations

where the singular values σ1σ2\sigma_{1}\geq\sigma_{2}\geq\cdots are organized in an non-increasing order. The best rank-rr approximation of X{\boldsymbol{X}} is defined as

By the Eckart-Young theorem, the optimal approximation Xr{\boldsymbol{X}}_{r} is given by

As mentioned, in many modern applications, one does not directly observe X{\boldsymbol{X}}, but rather is given an under-determined set of indirect noisy measurements of it. Here we assume that one has access to a set of linear measurements in the form

As we are primarily concerned with estimating X{\boldsymbol{X}} from mn1n2m\ll n_{1}n_{2} measurements, direct approximation via SVD and the Eckart-Young theorem are impossible. Instead, we need to develop alternative methods to find an (approximate) low-rank solution that best fits the set of noisy under-determined linear equations (6). We further categorize the low-rank matrix estimation problem into two main types based on the structure of the measurement operator:

Low-Rank Matrix Sensing, one observes linear combinations of the entries of X{\boldsymbol{X}}, where each measurement matrix Al{\boldsymbol{A}}_{l} defining the linear combinations is typically dense.

Low-Rank Matrix Completion, one directly observes a subset of the entries of X{\boldsymbol{X}}, and aims to interpolate the missing entries. In this case, each Al{\boldsymbol{A}}_{l} is a sparse matrix with a single entry equal to 11 at the corresponding observed index.

For matrix completion, it is convenient to write the measurements in a matrix form as

Theory and Algorithms for Low-Rank Matrix Estimation via Convex Optimization

We begin by deriving the nuclear norm minimization algorithm as a convex relaxation for rank minimization. Recall our linear measurement model in (6), to which we seek a low-rank solution. A natural approach is to find the matrix with the minimum rank that is consistent with these measurement, which can be formulated as an optimization problem:

Then, instead of solving (8) directly, one solves for a matrix that minimizes the nuclear norm:

In the case where the measurements y{\boldsymbol{y}} are noisy, one seeks a matrix with a small nuclear norm that is approximately consistent with the measurements, which can be formulated either as a regularized optimization problem:

or as a constrained optimization problem:

where τ\tau and γ\gamma are tuning parameters. Note that the nuclear norm can be represented using the solution to a semidefinite program ,

Consequently, the optimization problems (9)–(11) are convex, semidefinite programs.

2 Guarantees for Matrix Sensing via RIP

For there to be any hope of recovering X{\boldsymbol{X}} from the output of the sensing process (6), the sensing operator A\mathcal{A} needs to possess certain desirable properties so that it can distinguish different low-rank matrices. One such property is called the restricted isometry property (RIP). RIP stipulates that A\mathcal{A}, viewed as a mapping to a lower-dimensional space, preserves the Euclidean distances between low-rank matrices, Below we give a general notion of RIP, where the distances after mapping may be measured in different norms:

where \text{\text@underline{\delta}}_{r} and δˉr\bar{\delta}_{r} are some universal constants satisfying 0<1-\text{\text@underline{\delta}}_{r}<1<1+\bar{\delta}_{r}.

This definition is reminiscent of a similar notion with the same name used in the sparse signal recovery literature that is imposed on sparse vectors . Certifying whether RIP holds for a given operator is known to be NP-hard . Nevertheless, it turns out that a “generic” sensing operator, drawn from certain random distributions, satisfies RIP with high probability. For example:

For studying the performance of nuclear norm minimization via other notions such as the null space property, we refer interested readers to .

3 Guarantees for Matrix Completion via Incoherence

In the case of matrix completion, an additional complication arises: it is impossible to recover a low-rank matrix that is also sparse. In particular, when one only samples a small subset of the entries of X{\boldsymbol{X}}, it is very likely that most, if not all, of the nonzero entries of X{\boldsymbol{X}} are missed. This means that the sensing operator A=PΩ\mathcal{A}={\mathcal{P}}_{\Omega} used for matrix completion cannot satisfy the RIP. Therefore, for the problem to be well-posed, we need to restrict attention to low-rank matrices whose mass does not concentrate on a few entries. This property can be formalized by the notion of incoherence, which measures the alignment between the column/row spaces of the low-rank matrix with the standard basis vectors:

For a matrix with the SVD X=UΣVT{\boldsymbol{X}}={\boldsymbol{U}}{\boldsymbol{\Sigma}}{\boldsymbol{V}}^{T}, the incoherence parameter of X{\boldsymbol{X}} is defined as

It is easy to see that the incoherence parameter satisfies the bound 1μ(U)n/r1\leq\mu({\boldsymbol{U}})\leq n/r. With a smaller μ(U)\mu({\boldsymbol{U}}), the column space of U{\boldsymbol{U}} is more spread out over its coordinates. For a matrix X{\boldsymbol{X}}, its incoherence parameters μ0\mu_{0} is determined by its the singular vectors and is independent of its singular values. In the noiseless setting, nuclear norm minimization can perfectly recover an incoherent low-rank matrix as soon as the number of measurements is slightly larger than the degrees of freedom of the matrix. Such recovery guarantees were proved and refined in a series of work in . The theorem below is adapted from , which is state-of-the-art.

Suppose that each entry of X{\boldsymbol{X}} is observed independently with probability p(0,1)p\in(0,1). If pp satisfies

for some constant CC, then with high probability, the nuclear norm minimization algorithm (9) exactly recovers X{\boldsymbol{X}} as the unique optimal solution.

By a coupon-collecting argument , one can in fact show that it is impossible to recover the matrix with less than (n1+n2)rlog(n1+n2)(n_{1}+n_{2})r\log(n_{1}+n_{2}) measurements using any algorithm. Therefore, Theorem 2 shows that nuclear norm minimization is near-optimal in terms of sample complexity — off by only a logarithmic factor — a remarkable fact considering that we are using a convex relaxation of the rank.

In the noisy case, one can study the performance of nuclear norm minimization in terms of its recovery error, which is done for example in . Here we state one such performance guarantee taken from . Let us assume that the entries of the noise w{\boldsymbol{w}} are independent with variance that scaled as ν/n1n2\nu/\sqrt{n_{1}n_{2}}. Then, by solving the regularized nuclear norm minimization problem (10) with parameter τ=4ν(n1+n2)log(n1+n2)m\tau=4\nu\sqrt{\frac{(n_{1}+n_{2})\log(n_{1}+n_{2})}{m}}, we obtain a solution X^\hat{{\boldsymbol{X}}} satisfying

with high probability in the moderate to low SNR regime.

4 First-Order Algorithms for Nuclear Norm Minimization

In principle, it is possible to solve the nuclear norm minimization problems in (9)–(11) to high numerical accuracy using off-the-shelf semidefinite programming solvers (such as SDPT3 ). However, these solvers, typically based on interior-point methods, can be extremely slow when the size of the matrix is large. For example, SDPT3 can only handle matrices with dimensions no larger than a few thousands due to memory requirements. This computational issue motivates the development of fast alternatives that can handle significantly larger problems. First-order algorithms become an appealing candidate due to their low per-iteration cost, as well as the flexibility to incorporate the specific structures of the semidefinite programs that arise in low-rank matrix estimation. There is a long and still growing list of such algorithms, including singular value thresholding , accelerated proximal gradient descent , which is variant of FISTA for matrix completion , Augmented Lagrangian Multiplier methods , Frank-Wolfe , CoGENT , and ADCG , just to name a few. Below, we discuss two representative algorithms: FISTA for solving the regularized problem (10), and Frank-Wolfe for solving the constrained problem (11). These two algorithms provide the stage for understanding many other algorithms.

An important subroutine in many of the aforementioned algorithms is the Singular Value Thresholding (SVT) operator Dτ()\mathcal{D}_{\tau}(\cdot) . Mathematically, Dτ(Y)\mathcal{D}_{\tau}({\boldsymbol{Y}}) is defined as the proximal mapping of Y{\boldsymbol{Y}} with respect to the nuclear norm:

The SVT operator admits a closed-form expression; in particular, if the SVD of Y{\boldsymbol{Y}} is UΣVT{\boldsymbol{U}}{\boldsymbol{\Sigma}}{\boldsymbol{V}}^{T} with Σ=diag[σ1,σ2,]{\boldsymbol{\Sigma}}=\mathop{\rm diag}[\sigma_{1},\sigma_{2},\ldots], then Dτ(Y)=UΣVT\mathcal{D}_{\tau}({\boldsymbol{Y}})={\boldsymbol{U}}{\boldsymbol{\Sigma}}^{\prime}{\boldsymbol{V}}^{T}, where Σ=diag[σ1,σ2,]{\boldsymbol{\Sigma}}^{\prime}=\mathop{\rm diag}[\sigma^{\prime}_{1},\sigma^{\prime}_{2},\ldots] with

The fact that the SVT operator can be efficiently computed via SVD is leveraged in many of the first-order algorithms.

The FISTA algorithm for the regularized problem (10) is given in Algorithm 1, where LL is an upper bound of the Lipschitz constant of f(X)\nabla f({\boldsymbol{X}}), with f(X):=12yA(X)22f({\boldsymbol{X}}):=\frac{1}{2}\|{\boldsymbol{y}}-\mathcal{A}({\boldsymbol{X}})\|_{2}^{2}. FISTA makes use of Nesterov’s momentum acceleration to speed up the convergence. If we denote the objective function of (10) as g(X)g({\boldsymbol{X}}), then to achieve ϵ\epsilon-accuracy, i.e. g(XT)g(X^)ϵg({\boldsymbol{X}}_{T})-g(\hat{{\boldsymbol{X}}})\leq\epsilon, we need T=O(L/ϵ)T=\mathcal{O}\left(\sqrt{L/\epsilon}\right) iterations. In each iteration, only a partial SVD is needed to evaluate the SVT operator. Doing so for large-scale problems may still be too slow and require large memory. In this case, one can make use of modern randomized techniques from numerical linear algebra to further speed up the computation of SVD .

The standard Frank-Wolfe method (also known as conditional gradient descent) for solving the constrained problem (11) is presented in Algorithm 2. Each iteration of algorithm only requires computing a rank-one SVD, which can be done using power iteration or Lanczos methods. Therefore, Frank-Wolfe typically has a much lower computational cost per-iteration than methods based on the SVT operation. However, standard Frank-Wolfe may converge very slowly in practice. To achieve ϵ\epsilon-accuracy, i.e. f(XT)f(X^)ϵf({\boldsymbol{X}}_{T})-f(\hat{{\boldsymbol{X}}})\leq\epsilon, Frank-Wolfe requires T=O(1/ϵ)T=\mathcal{O}\left(1/\epsilon\right) iterations, which can be quite slow. Variants of Frank-Wolfe with faster convergence or lower memory footprint have been actively developed recently by exploiting the problem structures. The list, including CoGENT , In-Face Extended Frank-Wolf , and ADCG , Block Frank-Wolfe , and sketchyCGM , is still growing.

Provable and Fast Low-Rank Matrix Estimation via Nonconvex Factorization

As we have seen, the computational concern of solving rank minimization problems is assuaged to some extent by the use of convex relaxation — the resulting semidefinite programs can be solved in time polynomial in the matrix dimension. However, for large-scale problems where the dimension is on the order of millions, solving these semidefinite programs, even using first-order methods, can still be computationally infeasible due to the fundamental bottleneck of storing and optimizing over a matrix variable. This issue severely limits the applicability of the convex relaxation methods.

To overcome this difficulty, a recent line of work studies more computationally efficient methods that are based on nonconvex optimization. These methods work directly with the original nonconvex, rank-constrained optimization problem, which can be generally written as

Surprisingly, even though the Burer-Monteiro formulation (18) is nonconvex, global optima can sometimes be found (or approximated) efficiently using various iterative procedures; moreover, rigorous guarantees can be derived for the statistical accuracy of the resulting solution. Indeed, several iterative schemes have a computational cost proportional to (n1+n2)poly(r)(n_{1}+n_{2})\operatorname{poly}(r) and the size of the input, at least per iteration, which is typically much lower than n1×n2n_{1}\times n_{2}. These results are developed in a still growing line of recent work –, and we devote the rest of this section to presenting the most representative results therein.

This line of work considers three major classes of iterative schemes for solving the Burer-Monteiro formulation (18):

(Projected) gradient descent : One runs (projected) gradient descent directly on the loss function f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}) with respect to the factor variables (L,R)({\boldsymbol{L}},{\boldsymbol{R}}):

where ηt\eta^{t} is the step size and PL{\mathcal{P}}_{\mathcal{L}}, PR{\mathcal{P}}_{\mathcal{R}} denote the Euclidean projection onto the sets L\mathcal{L} and R\mathcal{R}, which are constraint sets that encode additional structures of the desired low-rank factors.

Alternating minimization : One optimizes the loss function f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}) alternatively over one of the factors while fixing the other, which is a convex problem. In particular, each iteration takes the form

Singular value projection (SVP) : One performs a gradient descent step of F(LRT)F({\boldsymbol{L}}{\boldsymbol{R}}^{T}) on the “full” n1×n2n_{1}\times n_{2} matrix space, then projects back to the factor space via SVD:

where ηt\eta^{t} is the step size, and SVDr(Z)\textsf{SVD}_{r}({\boldsymbol{Z}}) returns the top rank-rr factors of Z{\boldsymbol{Z}}, that is, the pair (UΣ1/2,VΣ1/2)({\boldsymbol{U}}{\boldsymbol{\Sigma}}^{1/2},{\boldsymbol{V}}{\boldsymbol{\Sigma}}^{1/2}) assuming that UΣVT=Zr=Pr(Z){\boldsymbol{U}}{\boldsymbol{\Sigma}}{\boldsymbol{V}}^{T}={\boldsymbol{Z}}_{r}={\mathcal{P}}_{r}({\boldsymbol{Z}}) is SVD of the best rank-rr approximation of Z{\boldsymbol{Z}}.

Because neither the function f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}) nor the set of low-rank matrices are convex, standard global convergence theory for convex optimization does not apply here. The recent breakthrough is based on the realization that convexity is in fact not necessary for the convergence of these iterative schemes; instead, as long as the gradients of function f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}) always point (approximately) towards the desired solution, the iterates will make progress along the right direction. Note that this property concerns the geometry of f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}) itself, and is largely independent of the specific choice of the algorithm . Among the above three options, the projected gradient descent approach stands out due to its simple form, cheap per-iteration cost (no SVD or inner optimization is needed) and efficiency with constrained problems. We thus use projected gradient descent as the focal point of our survey.

Playing a key role here is the use of statistical modeling and probabilistic analysis: we will show that f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}) has the desired geometric properties with high probability under probabilistic generative models of the data, thereby circumventing the worst-case hardness of the low-rank matrix estimation problem and instead focusing on its average-case behavior. Existing results in this direction can be divided into two categories. In the first line of work reviewed in Section 5.1, one shows that iterative algorithms converge to the desired solution rapidly when initialized within a large neighborhood around the ground truth; moreover, a good initial solution can be obtained efficiently by simple procedures (which typically involve computing a partial SVD). The second line of work, reviewed in Section 5.2, concerns the global landscape of the loss function, and aims to show that in spite of the non-convexity of f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}), all of its local minima are in fact close to the desired solution in an appropriate sense whereas all other stationary points (e.g., saddle points) possess a descent direction; these properties guarantee the convergence of iterative algorithms from any initial solution. Both of these two types of results have their own merits and hence are complementary to each other. Below we review the most representative results in each of these two categories for noiseless matrix sensing and matrix completion. The readers are referred to for extensions to the noisy case.

For matrix sensing, we take the loss function f(L,R)f({\boldsymbol{L}},{\boldsymbol{R}}) as

Similarly, for matrix completion, we use the loss function

Since we can only hope to recover matrices that satisfy the incoherence property, we perform projected gradient descent by projecting to the constraint set:

with R\mathcal{R} defined similarly. Note that L\mathcal{L} is convex, and depends on the initial solution L0{\boldsymbol{L}}^{0}. The projection  PL\ {\mathcal{P}}_{\mathcal{L}} is given by the row-wise “clipping” operation

for i=1,2,,n1i=1,2,\ldots,n_{1}, where Li{\boldsymbol{L}}_{i\cdot} is the ii-th row of L{\boldsymbol{L}}; the projection PR{\mathcal{P}}_{\mathcal{R}} is given by a similar formula. This projection ensures that the iterates of projected gradient descent (19) remain incoherent.

The following theorems guarantee that if the initial solution (L0,R0)({\boldsymbol{L}}^{0},{\boldsymbol{R}}^{0}) is reasonably close to the desired solution, then the iterates converge linearly to the ground truth (in terms of the reconstruction error), under conditions similar to nuclear norm minimization. Moreover, we can find a provably good initial solution using the so-called spectral method, which involves performing a partial SVD. In particular, we compute (L0,R0)=SVDr[A(y)]({\boldsymbol{L}}^{0},{\boldsymbol{R}}^{0})=\textsf{SVD}_{r}[\mathcal{A}^{*}({\boldsymbol{y}})] for matrix sensing, and (L0,R0)=SVDr[p1PΩ(Y)]({\boldsymbol{L}}^{0},{\boldsymbol{R}}^{0})=\textsf{SVD}_{r}[p^{-1}{\mathcal{P}}_{\Omega}({\boldsymbol{Y}})] for matrix completion.

where c0c_{0} is some sufficiently small constant. Furthermore, starting from any (L0,R0)({\boldsymbol{L}}^{0},{\boldsymbol{R}}^{0}) that satisfies (24), the gradient descent iterates {(Lt,Rt)}t=1\{({\boldsymbol{L}}^{t},{\boldsymbol{R}}^{t})\}_{t=1}^{\infty} with an appropriate step size satisfy the bound

where 0<δ<10<\delta<1 is a universal constant.

There exists a positive constant CC such that if

then with probability at least 1c(n1+n2)11-c(n_{1}+n_{2})^{-1} for some positive constant cc, the initial solution satisfies

Furthermore, starting from any (L0,R0)({\boldsymbol{L}}^{0},{\boldsymbol{R}}^{0}) that satisfies (26), the projected gradient descent iterates {(Lt,Rt)}t=0\{({\boldsymbol{L}}^{t},{\boldsymbol{R}}^{t})\}_{t=0}^{\infty} with an appropriate step size satisfy the bound

where 0<δ<10<\delta<1 is a universal constant.

The above theorems guarantees that the gradient descent iterates enjoy geometric convergence to a global optimum when the initial solution is sufficiently close to the ground truth. Comparing with the guarantees for nuclear norm minimization, (projected) gradient descent succeeds under a similar sample complexity condition (up to a polynomial term in rr and logn\log n), but the computational cost is significantly lower. To obtain ε\varepsilon-accuracy, meaning that the final estimate (L^,R^)(\widehat{{\boldsymbol{L}}},\widehat{{\boldsymbol{R}}}) satisfies

we only need to run a total of T=O(log(1/ε))T=\mathcal{O}(\log(1/\varepsilon)) iterations for matrix sensing, and T=O(μ0rlog(1/ε))T=\mathcal{O}(\mu_{0}r\log(1/\varepsilon)) iterations for matrix completion.

We now discuss the overall computational complexity, and for simplicity we shall assume n1=n2=nn_{1}=n_{2}=n. For matrix sensing, let T0T_{\text{0}} be the maximum time of multiplying the matrix Al{\boldsymbol{A}}_{l} with a vector of compatible dimension. Each gradient step (19) can be performed in time O(mrT0+nr)\mathcal{O}(mrT_{\text{0}}+nr). The complexity of computing the initial solution is a bit subtler. To this end, we first note that by standard matrix perturbation bounds, one can show that the matrix A(y)\mathcal{A}^{*}({\boldsymbol{y}}) and its singular values are sufficiently close to those of X{\boldsymbol{X}} under the condition of Theorem 3; in particular, one has A(y)Xc0σr/r\|\mathcal{A}^{*}({\boldsymbol{y}})-{\boldsymbol{X}}\|\leq c_{0}\sigma_{r}/\sqrt{r}, and the rr-th and (r+1)(r+1)-th singular values of A(y)\mathcal{A}^{*}({\boldsymbol{y}}) are at most c0σr/rc_{0}\sigma_{r}/\sqrt{r} away from the corresponding singular values of X{\boldsymbol{X}}, where c0c_{0} is a small constant. With such properties of A(y)\mathcal{A}^{*}({\boldsymbol{y}}), one does not need to compute the exact singular values/vectors of A(y)\mathcal{A}^{*}({\boldsymbol{y}}) in order to meet the initialization condition (24); rather, it suffices to find a rank-rr approximation of A(y)\mathcal{A}^{*}({\boldsymbol{y}}) with the property L0R0TA(y)c0σr/r\|\boldsymbol{L}^{0}\boldsymbol{R}^{0T}-\mathcal{A}^{*}({\boldsymbol{y}})\|\leq c_{0}\sigma_{r}/\sqrt{r}. This can be done using for example the Randomized SVD procedure in , which takes O(mrT0logn+nr2)\mathcal{O}(mrT_{\text{0}}\log n+nr^{2}) time to compute; see [45, Theorem 1.2 and Eq. (1.11)] with q=lognq=\log n. Put together, the overall time complexity is O(mrT0log(n/ε))\mathcal{O}(mrT_{\text{0}}\log(n/\varepsilon)) for achieving ε\varepsilon-accuracy.

For matrix completion, to obtain the initial solution, we can again follow similar arguments as above to show that it suffices to compute a rank-rr approximation of the matrix p1PΩ(Y){p}^{-1}{\mathcal{P}}_{\Omega}({\boldsymbol{Y}}), which is close to X{\boldsymbol{X}} and has a sufficiently large spectral gap. Since p1PΩ(Y){p}^{-1}{\mathcal{P}}_{\Omega}({\boldsymbol{Y}}) is a sparse matrix with support Ω\Omega, computing such an approximation can be done in time O(rΩlogn+nr2)\mathcal{O}(r|\Omega|\log n+nr^{2}) using the Randomize SVD procedure in . Each step of gradient descent requires computing the gradient and the projection onto L\mathcal{L} and R\mathcal{R}. Both of them only involve operations on sparse matrices supported on Ω\Omega and thin matrices, and can be done in time O(rΩ+nr2)\mathcal{O}(r|\Omega|+nr^{2}). Therefore, projected gradient descent achieves ϵ\epsilon-accuracy with running time O(rΩlognlog(1/ε))\mathcal{O}(r|\Omega|\log n\log(1/\varepsilon)).

In the noisy setting, the algorithms can be applied without change, and the same error bounds hold with an additional term that depends on the noise. For matrix completion , this term is rPΩ(W)p\frac{\sqrt{r}\|{\mathcal{P}}_{\Omega}({\boldsymbol{W}})\|}{p}, where PΩ(W){\mathcal{P}}_{\Omega}({\boldsymbol{W}}) is the noise matrix supported on the observed indices. This term can be bounded under various noise models. For example, when W{\boldsymbol{W}} has i.i.d. Gaussian or ±σ\pm\sigma entries with zero mean and variance σ2\sigma^{2}, then PΩ(W)σp(n1+n2)\|{\mathcal{P}}_{\Omega}({\boldsymbol{W}})\|\lesssim\sigma\sqrt{p(n_{1}+n_{2})} with high probability. The resulting error bound is optimal in an information-theoretic sense . See also for the near-optimal error control in the spectral norm and the entry-wise infinity norm.

2 Global Geometry and Saddle-Point Escaping Algorithms

A very recent line of work studies the global geometry of the Burer-Monteiro formulation (18), as well as its computational implications for algorithms starting at an arbitrary initial solution . These results are based on a geometric notion called the strict saddle property .

A function g(x)g({\boldsymbol{x}}) is said to be (ϵ,γ,ζ)(\epsilon,\gamma,\zeta)-strict saddle, if for each x{\boldsymbol{x}} at least one of the following holds:

g(x)2ϵ>0\|\nabla g({\boldsymbol{x}})\|_{2}\geq\epsilon>0;

λmin(2g(x))γ<0\lambda_{\min}\left(\nabla^{2}g({\boldsymbol{x}})\right)\leq-\gamma<0;

there exists a local minimum x{\boldsymbol{x}}^{\star} such that xx2ζ\|{\boldsymbol{x}}-{\boldsymbol{x}}^{\star}\|_{2}\leq\zeta.

In Figure 2 and Figure 3 we provide examples of a one-dimensional function and a two-dimensional function, respectively, that satisfy the strict saddle property in Definition 3. Intuitively, the strict saddle property of g(x)g({\boldsymbol{x}}) ensures that whenever one is not already close to a local minimum, the current solution will have either a large gradient, or a descent direction due to the Hessian having a negative eigenvalue. Therefore, any local search algorithms that are capable of finding such a descent direction, will make progress in decreasing the value of g(x)g({\boldsymbol{x}}) and eventually converge to a local minimum. Many algorithms have been shown to enjoy this property, include cubic regularization , trust-region algorithms , stochastic gradient descent , and other more recent variants . In Algorithm 3, we describe one such algorithm, namely the Perturbed Gradient Descent (PGD) algorithm from . PGD is based on the standard gradient descent algorithm with the following additional steps: (i) when the gradient is small, indicating potential closeness to a saddle point, PGD adds a random perturbation to the current iterate (which is done at most once every tthrest_{\text{thres}} iterations); (ii) if the last perturbation occurs tthrest_{\text{thres}} iterations ago and the function value does not decrease sufficiently since, then PGD terminates and outputs the iterate before the last perturbation.

We do not delve further into the details of these saddle-escaping algorithms, as their parameter choices and run-time guarantees are somewhat technical. Rather, for the purpose of analyzing low-rank matrix estimation problems, we simply rely on the existence of such algorithms and the fact that their running time depends polynomially on the problem parameters. This is summarized in the following theorem, which abstracts out the key results in the work cited above.

Specializing to the low-rank matrix estimation problem, it remains to verify that (i) the loss function defined on (L,R)({\boldsymbol{L}},{\boldsymbol{R}}) is strict saddle and (ii) all its local minima are “good”, in the sense that they correspond to a low-rank matrix equal to (in the noisy case, close to) the true matrix X{\boldsymbol{X}}. To this end, we consider the same loss function as before for matrix sensing:

where fAf_{\mathcal{A}} is defined in (22). For matrix completion, we consider a regularized loss function:

where fΩf_{\Omega} is given in (23), the regularizer is given by

α\alpha and λ\lambda are regularization parameters, and (x)+=max{x,0}(x)_{+}=\max\{x,0\}. The regularization plays a similar role as the projections PL,PR{\mathcal{P}}_{\mathcal{L}},{\mathcal{P}}_{\mathcal{R}} previously used: it encourages incoherence of L{\boldsymbol{L}} and R{\boldsymbol{R}}. Replacing projections with regularization leads to an unconstrained formulation that fits into the strict-saddle framework above. The follow theorems show that these loss functions indeed have the desired strict-saddle property with high probability under sample complexity conditions similar to before.

Suppose that the observation probability satisfies p\geq\Omega\big{(}\frac{\mu^{4}r^{6}\log(n_{1}+n_{2})}{(n_{1}+n_{2})}\big{)}, and one chooses \alpha^{2}=\Theta\big{(}\frac{\mu r\sigma_{1}}{(n_{1}+n_{2})}\big{)} and \lambda=\Theta\big{(}\frac{(n_{1}+n_{2})}{\mu r}\big{)}. Then, with probability at least 1(n1+n2)11-(n_{1}+n_{2})^{-1}, the above loss function gΩg_{\Omega} satisfies the following: (i) it is \big{(}\epsilon,\Omega(\sigma_{r}),\mathcal{O}(\frac{\epsilon}{\sigma_{r}})\big{)}-strict saddle for any ϵpoly(μ4r4σ14(n1+n2)2)\epsilon\leq\text{poly}(\frac{\mu^{4}r^{4}\sigma^{4}_{1}}{(n_{1}+n_{2})^{2}}), and (ii) all its local minima satisfy LRT=X{\boldsymbol{L}}{{\boldsymbol{R}}}^{T}={\boldsymbol{X}}.

Combining Theorem 5 with Theorem 6 and Theorem 7, we conclude that iterative algorithms optimizing over the factor variables (L,R)({\boldsymbol{L}},{\boldsymbol{R}}) converge globally to some pair satisfying LRT=X{\boldsymbol{L}}{\boldsymbol{R}}^{T}={\boldsymbol{X}} in a polynomially number of iterations from any arbitrary initial solutions, as long as they can escape saddle points. We refer the readers to for more discussions.

3 Perspectives

Combining the discussions in the last two subsections, we obtain the following general picture for low-rank matrix estimation reformulated in the factor space of (L,R)({\boldsymbol{L}},{\boldsymbol{R}}):

All the local minima of the loss function are in fact global minima, and correspond to some factorization LRT{\boldsymbol{L}}{{\boldsymbol{R}}}^{T} of the true low-rank matrix X{\boldsymbol{X}}.

In a neighborhood of each global minimum (L,R)({\boldsymbol{L}},{\boldsymbol{R}}), the loss function is essentially strongly convex, and has no saddle point. Within this neighborhood, gradient descent and other iterative algorithms converge geometrically.

Any point outside such neighborhoods either has a strictly positive gradient, or is a saddle point with a descent direction corresponding to a strictly negative eigenvalue of the Hessian.

Iterative algorithms escape all saddle points and enter a neighborhood of the global minima in polynomial time. Alternatively, one can find a solution in this neighborhood by performing one SVD of a matrix appropriately constructed from the observations.

Comparing the two approaches to non-convex matrix estimation discussed in the last two subsections, we also see that each of them has its own strengths. The first approach, taken in Section 5.1, focuses on the convergence of algorithms with a proper initialization procedure. This approach immediately leads to simple, efficient algorithms, with provably geometric convergence and linear time-complexity. It also readily extends to problems that have additional structural constraints (such as sparsity, Hankel and discrete structures), or involve more complicated loss functions (such as robust PCA and matrix completion with quantized observations), some of which may involve a non-smooth loss function whose Hessian is not defined. However, finding a good initialization scheme is non-trivial, and in some settings is actually the harder part of the problem. The second approach, taken in Section 5.2, instead focuses on the global geometric landscape of the problem. This approach is conceptually elegant, nicely decomposing the geometric aspect (properties of the local minima) and the algorithmic aspect (how to find the local minima) of the problem. Computationally, it eliminates the need of careful initialization, but the resulting run-time guarantees are somewhat weaker, which may be super-linear in the dimensions. Of course, we made the distinction between these two approaches mostly for ease of review of state-of-the-art; given the rapid developments in this area, we expect that both approaches will be improved, expanded, and eventually merged.

Before concluding this section, we add that there is a deeper reason for low-rank matrix estimation being such a benign nonconvex problem. The loss function of the low-rank matrix estimation problem can often be viewed, in a certain precise sense, as a perturbed version of the objective function of PCA, i.e., finding the best rank-rr approximation in Frobenius norm in a factorized form:

For example, the matrix completion loss function (23), with the regularization omitted, is exactly equal to the above objective in expectation. The PCA problem (28) is arguably the most well-understood tractable non-convex problem: in addition to having a closed-form solution (4), this problem satisfies all the geometric properties mentioned in the last two subsections ; in particular, its local minima and saddle points can be expressed in terms of the top and non-top eigen components of X{\boldsymbol{X}}, respectively. Under the probabilistic or RIP assumptions on the sensing operators, the geometric and algorithmic properties of the PCA problem (28) are essentially preserved under incomplete observations, with high probability, as long as the issue of incoherence is appropriately accounted for.

Structured Low-Rank Matrix Estimation

In many applications, the low-rank matrix estimation problems possess additional structures that need to be carefully exploited, and we present two such examples in this section: Hankel matrix completion, and the recovery of clustering matrices.

Imagine that one is interested in estimating the spectrum of time series, or the direction-of-arrivals from returns from a sensor array. One can model the signal of interest as a weighted sum of complex exponentials, i.e.,

One can view the atom v(z)\boldsymbol{v}(z) as an eigenvector of a linear time-invariant system, with zz being the corresponding pole.

where n1n_{1} is commonly selected as n/2\lfloor n/2\rfloor to make the matrix H(x)\mathcal{H}({\boldsymbol{x}}) as square as possible. The important observation is that H(x)\mathcal{H}({\boldsymbol{x}}) admits the following low-rank decomposition:

C=diag[c1,c2,,cr]\boldsymbol{C}=\mathop{\rm diag}[c_{1},c_{2},\ldots,c_{r}], and Vnn1+1\boldsymbol{V}_{n-n_{1}+1} is defined in a way similar to (33). This decomposition shows that \mboxrank(H(x))r\mbox{rank}(\mathcal{H}(\boldsymbol{x}))\leq r, and equality holds when all the poles are distinct. This representation of x{\boldsymbol{x}} as a structured low-rank matrix can be leveraged to facilitate recovery of the un-measured entries of x{\boldsymbol{x}}. In particular, one can try to recover the missing measurements by seeking a Hankel matrix with the smallest nuclear norm and consistent with the available measurements. This idea gives rise to the following algorithm, termed Enhanced Matrix Completion (EMaC) :

Figure 4 (a) illustrates the observation pattern in a Hankel matrix recovery problem, which is highly structured.

Under the parametric model (29), we define a new notion of incoherence that bears an interesting physical interpretation. Let the Dirichlet kernel be

whose absolute value decays inverse proportionally with respect to z|z|. Given rr poles, one can construct two r×rr\times r Gram matrices GL\boldsymbol{G}_{\text{L}} and GR\boldsymbol{G}_{\text{R}}, corresponding to the column space and row space of H(x)\mathcal{H}({\boldsymbol{x}}), where the entries of these matrices are specified by

The incoherence parameter is then defined as follows.

The incoherence parameter of a signal x{\boldsymbol{x}} of the form (29) is defined as the smallest number μ\mu satisfying the bounds

If all poles are well-separated by 2/n2/n, the incoherence parameter μ\mu can be bounded by a small constant . As the poles get closer, the Gram matrices become poorly-conditioned, resulting in a large μ\mu. Therefore, the incoherence parameter provides a measure of the hardness of the recovery problem in terms of the relative positions of the poles. The theorem below summarizes the performance guarantees of the EMaC algorithm.

Suppose that each entry of x{\boldsymbol{x}} is observed independently with probability pp. As long as

for some sufficiently large constant CC, the signal x{\boldsymbol{x}} can be exactly recovered with high probability via EMaC.

Theorem 8 suggests that a Hankel-structured low-rank matrix can be faithfully recovered using a number of measurements much smaller than its dimension nn. Recently it has been shown that Hankel matrix completion can also be efficiently solved using the non-convex Burer-Monteiro factorization and projected gradient descent approach described in Section 5, under similar conditions . Similar results can be obtained for block Hankel or Toeplitz low-rank matrix completion (for multi-dimensional data) as well. Interestingly, if the Toeplitz matrix is additionally positive-semidefinite, the incoherence condition can be relaxed by exploring the connection to Carathéodory’s theorem; see .

2 Cluster Matrices

One may represent an ideal partition by a so-called cluster matrix X{0,1}n×n\boldsymbol{X}^{\star}\in\{0,1\}^{n\times n} defined as

With an appropriate ordering of the rows and columns, the matrix X\boldsymbol{X}^{\star} takes the form of a block-diagonal matrix:

The fact that the nodes in the same cluster tend to have high affinity values, can be captured by the model Yij=Xij+Wij,Y_{ij}=X^{\star}_{ij}+W_{ij}, where WijW_{ij} is some form of noise that encapsulates the inherent randomness/uncertainty in the pairwise affinity measure. In many applications, the affinity values between some pairs of nodes are unknown, or are costly to measure, in which case we only observe a subset Ω\Omega of the entries of Y{\boldsymbol{Y}}. Under this setup, the clustering problem can be cast as a noisy low-rank matrix completion problem, and the algorithms and theory in the last two sections can be immediately applied.

Notably, one can take advantage of the additional structures of the cluster matrix X\boldsymbol{X}^{\star} to obtain stronger performance guarantees. In particular, it is sometimes possible to recover X\boldsymbol{X}^{\star} exactly even in the presence of noise. We briefly review one such result from . Consider the setting where the noise term WijW_{ij} is such that YijBernoulli(τin)Y_{ij}\sim\text{Bernoulli}(\tau_{\text{in}}) if Xij=1X^{\star}_{ij}=1 and YijBernoulli(τout)Y_{ij}\sim\text{Bernoulli}(\tau_{\text{out}}) if Xij=0X^{\star}_{ij}=0, where τin>τout\tau_{\text{in}}>\tau_{\text{out}}; in this case, nodes in the same clusters have a higher probability of having a high (non-zero) affinity value. Figure 4 (b) and (c) illustrate the cluster matrix and the affinity matrix where the nodes are ordered according to the cluster structure, and (d) shows the affinity matrix except that the nodes are randomly permuted, as how it is typically observed in practice. As before, we assume that each entry of Y{\boldsymbol{Y}} is observed with some probability pp. This model is sometimes referred to as the (censored) stochastic block model or planted partition model in the literature . For this model, we consider the maximum likelihood estimator of X\boldsymbol{X}^{\star}, and derive a convex relaxation of it by replacing the non-convex constraints on X\boldsymbol{X}^{\star} (low-rank, binary and block-diagonal) with a nuclear norm regularizer and linear inequality constraints XijX^{\star}_{ij}\in. Doing so leads to the semidefinite program

where α=2p(τin+τout)p(τin+τout)\alpha=\sqrt{\frac{2-p(\tau_{\text{in}}+\tau_{\text{out}})}{p(\tau_{\text{in}}+\tau_{\text{out}})}} and J=Jn×n\boldsymbol{J}=\boldsymbol{J}_{n\times n} is the n×nn\times n all-one matrix; see for the details. This approach enjoys the following guarantees.

for some constant CC, the convex relaxation (37) recovers X\boldsymbol{X}^{\star} exactly as the unique minimizer with high probability.

In words, provided that the observation probability and the difference between τin\tau_{\text{in}} and τout\tau_{\text{out}} are large enough, the solution of the convex relaxation formulation is guaranteed to have the structures of a cluster matrix and equal X\boldsymbol{X}^{\star} exactly.

Clustering is a classical problem that has been studied extensively, with a huge body of literature. The above perspective of casting it as a low-rank matrix recovery problem is relatively recent, and proves to be fruitful, leading to a number of new algorithms and theoretical results. A detailed account of these developments is outside the scope of this article, and we refer the readers to the recent surveys and the references therein.

Numerical Examples on MovieLens Data

We demonstrate performance of three matrix completion algorithms: accelerated proximal gradient , singular value projection , and bi-factored gradient descent . For these algorithms we use existing implementations with publicly available codes, and mostly adopt their default settings with only a few adjustments (detailed below) tailored to this specific dataset. For the error metric, we use the normalized mean absolute errors (NMAE) over the training set and test set, defined respectively as

We first employ the accelerated proximal gradient (APG) algorithm proposed in .http://www.math.nus.edu.sg/~mattohkc/NNLS.html We disable the adaptive updating of the regularization parameter in the implementation, and instead use a fixed one. We set the maximum rank to 100100 and the maximum number of iterations to 15001500. In our experiment, we observe that the algorithm usually meets the stop criteria after just a few of iterations, and hence stops early before reaching the maximum number of iterations. When using different values for the regularization parameter, APG outputs an estimate matrix M^\hat{{\boldsymbol{M}}} with different ranks. Fig. 5 (a) shows the relation between the regularization parameter and rank, and Fig. 5 (b) shows NMAEs for the training data and test data against different ranks. The minimum NMAE on the test data is 0.19240.1924, which is achieved when the regularization parameter is set to 2.612.61 with the rank of the estimate being 55.

The minimum NMAE for test data is 0.19290.1929, achieved when the rank is set to 33.

Lastly, we apply the bi-factored gradient descent (BFGD) algorithm proposed in ,http://akyrillidis.github.io/projects/ which is a variant of the projected gradient descent algorithm applied to the non-convex Burer-Monteiro factorization formulation as described in Section 5. We set the maximum number of iterations to 40004000, and the convergence tolerance to 5×1065\times 10^{-6}. Similarly as before, BFGD typically terminates early in our experiment. For the step size we use the default setting of the above implementation.

The NMAEs of BFGD for the training data and test data are shown in Fig. 7. The minimum NMAE for test data is 0.18950.1895, achieved by setting the rank to 22.

We make several observations from the above experiment results. First, we see that consistently across the three algorithms, the training error generally goes down as the rank becomes larger, whereas the test error exhibits a U-shape behavior, decreasing first and then increasing later. This phenomenon is in accordance with the bias-variance tradeoff principle described in Section 2, and in particular shows that using a low-rank model is helpful in reducing the variance and prevents overfitting. Second, all three algorithms achieve a minimum test NMAE around 0.19, using a rank no more than 55. The small optimal values for the rank are likely due to the highly noisy nature of the MovieLens dataset, for which suppressing variance is crucial to good performance on the test set. Finally, while the estimation/prediction performance of these algorithms is similar, their computational costs, such as running times and memory usage, vary. These costs depend heavily on the specific implementations and termination criteria used, so we do not provide a detailed comparison here.

Concluding Remarks

Low-rank matrices represent an important class of signals with low-dimensional intrinsic structures. In this article, we have presented some recent developments on low-rank matrix estimation, focusing on the setting with incomplete measurements and additional structural constraints. We have particularly emphasized the remarkable modeling power of low-rank matrices, which are useful in a range of problems much wider than the name may suggest, including those where the presence of low-rank structures are not obvious at all. In terms of algorithms and theory, attention is paid to the integration of statistical and computational considerations: fast algorithms have been developed that are applicable to large-scale problems, and at the same time enjoy provable performance guarantees under mild assumptions. As we have seen, such recent progress is made possible by combining techniques from diverse fields; in particular, convex and nonconvex optimization, as well as probabilistic analysis, play a key role.

We conclude by mentioning a few topics and future directions that are not covered in this article. We have focused on the matrix sensing and completion problems with linear measurements. There are many other low-rank estimation problems that are amenable to convex and nonconvex optimization-based algorithms, and enjoy similar geometric properties and performance guarantees. A partial list of such problems includes phase retrieval , blind deconvolution , robust PCA , dictionary learning , lifting for mixture problems , low-rank phase retrieval , community detection , and synchronization problems . More broadly, applications of low-rank matrix recovery go well beyond the setting of linear measurements and least-squares objectives. Prime examples include low-rank matrix recovery with quantized, categorical and non-Gaussian data , and ranking from comparison-based observations . These problems involve more general objective functions (such as the log-likelihood) and constraints that depend on the specific observation schemes and noise structures. Another promising line of research aims at exploiting hidden low-rank structures in settings where the problem on the surface has nothing to do with low-rank matrices, yet such structures reveal themselves under suitable transformation and approximation. Problems of this type include latent variable models with certain smoothness/monotonicity properties .

Another topic of much interest is how to select the model rank automatically and robustly, and how to quantify the effect of model mismatch. These are important issues even in standard matrix sensing and completion; we have not discussed these issues in detail in this survey. Finally, we have omitted many other low-rank recovery algorithms that are not directly based on (continuous) optimization, including various spectral methods, kernel and nearest neighbor type methods, and algorithms with a more combinatorial flavor. Some of these algorithms are particularly useful in problems involving complicated discrete and time-evolving structures and active/adaptive sampling procedures. All of these topics are the subject of active research with tremendous potential.

Acknowledgment

The authors thank Mr. Yuanxin Li for preparing the numerical experiments in this paper. The work of Y. Chen is supported in part by NSF under the CRII award 1657420 and grant CCF-1704828. The work of Y. Chi is supported in part by AFOSR under the grant FA9550-15-1-0205, by ONR under the grant N00014-18-1-2142, and by NSF under the grants CAREER ECCS-1818571 and CCF-1806154.

References