Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

Overview

On a well-known list of the “Top 10 Algorithms” that have influenced the practice of science and engineering during the 20th century , we find an entry that is not really an algorithm: the idea of using matrix factorizations to accomplish basic tasks in numerical linear algebra. In the accompanying article , Stewart explains that

The underlying principle of the decompositional approach to matrix computation is that it is not the business of the matrix algorithmicists to solve particular problems but to construct computational platforms from which a variety of problems can be solved.

Stewart goes on to argue that this point of view has had many fruitful consequences, including the development of robust software for performing these factorizations in a highly accurate and provably correct manner.

The decompositional approach to matrix computation remains fundamental, but developments in computer hardware and the emergence of new applications in the information sciences have rendered the classical algorithms for this task inadequate in many situations:

A salient feature of modern applications, especially in data mining, is that the matrices are stupendously big. Classical algorithms are not always well adapted to solving the type of large-scale problems that now arise.

In the information sciences, it is common that data are missing or inaccurate. Classical algorithms are designed to produce highly accurate matrix decompositions, but it seems profligate to spend extra computational resources when the imprecision of the data inherently limits the resolution of the output.

Data transfer now plays a major role in the computational cost of numerical algorithms. Techniques that require few passes over the data may be substantially faster in practice, even if they require as many—or more—floating-point operations.

As the structure of computer processors continues to evolve, it becomes increasingly important for numerical algorithms to adapt to a range of novel architectures, such as graphics processing units.

The purpose of this paper is make the case that randomized algorithms provide a powerful tool for constructing approximate matrix factorizations. These techniques are simple and effective, sometimes impressively so. Compared with standard deterministic algorithms, the randomized methods are often faster and—perhaps surprisingly—more robust. Furthermore, they can produce factorizations that are accurate to any specified tolerance above machine precision, which allows the user to trade accuracy for speed if desired. We present numerical evidence that these algorithms succeed for real computational problems.

In short, our goal is to demonstrate how randomized methods interact with classical techniques to yield effective, modern algorithms supported by detailed theoretical guarantees. We have made a special effort to help practitioners identify situations where randomized techniques may outperform established methods.

Throughout this article, we provide detailed citations to previous work on randomized techniques for computing low-rank approximations. The primary sources that inform our presentation include .

Our experience suggests that many practitioners of scientific computing view randomized algorithms as a desperate and final resort. Let us address this concern immediately. Classical Monte Carlo methods are highly sensitive to the random number generator and typically produce output with low and uncertain accuracy. In contrast, the algorithms discussed herein are relatively insensitive to the quality of randomness and produce highly accurate results. The probability of failure is a user-specified parameter that can be rendered negligible (say, less than $10^{-15}$ ) with a nominal impact on the computational resources required.

The roster of standard matrix decompositions includes the pivoted QR factorization, the eigenvalue decomposition, and the singular value decomposition (SVD), all of which expose the (numerical) range of a matrix. Truncated versions of these factorizations are often used to express a low-rank approximation of a given matrix:

The inner dimension $k$ is sometimes called the numerical rank of the matrix. When the numerical rank is much smaller than either dimension $m$ or $n$ , a factorization such as (1) allows the matrix to be stored inexpensively and to be multiplied rapidly with vectors or other matrices. The factorizations can also be used for data interpretation or to solve computational problems, such as least squares.

Matrices with low numerical rank appear in a wide variety of scientific applications. We list only a few:

A basic method in statistics and data mining is to compute the directions of maximal variance in vector-valued data by performing principal component analysis (PCA) on the data matrix. PCA is nothing other than a low-rank matrix approximation [71, §14.5].

Another standard technique in data analysis is to perform low-dimensional embedding of data under the assumption that there are fewer degrees of freedom than the ambient dimension would suggest. In many cases, the method reduces to computing a partial SVD of a matrix derived from the data. See [71, §§14.8–14.9] or .

The problem of estimating parameters from measured data via least-squares fitting often leads to very large systems of linear equations that are close to linearly dependent. Effective techniques for factoring the coefficient matrix lead to efficient techniques for solving the least-squares problem, .

Many fast numerical algorithms for solving PDEs and for rapidly evaluating potential fields such as the fast multipole method and $\mathcal{H}$ -matrices , rely on low-rank approximations of continuum operators.

Models of multiscale physical phenomena often involve PDEs with rapidly oscillating coefficients. Techniques for model reduction or coarse graining in such environments are often based on the observation that the linear transform that maps the input data to the requested output data can be approximated by an operator of low rank .

2 Matrix approximation framework

The task of computing a low-rank approximation to a given matrix can be split naturally into two computational stages. The first is to construct a low-dimensional subspace that captures the action of the matrix. The second is to restrict the matrix to the subspace and then compute a standard factorization (QR, SVD, etc.) of the reduced matrix. To be slightly more formal, we subdivide the computation as follows.

Stage A: Compute an approximate basis for the range of the input matrix $\bm{A}$ . In other words, we require a matrix $\bm{Q}$ for which

We would like the basis matrix $\bm{Q}$ to contain as few columns as possible, but it is even more important to have an accurate approximation of the input matrix.

Stage B: Given a matrix $\bm{Q}$ that satisfies (2), we use $\bm{Q}$ to help compute a standard factorization (QR, SVD, etc.) of $\bm{A}$ .

The task in Stage A can be executed very efficiently with random sampling methods, and these methods are the primary subject of this work. In the next subsection, we offer an overview of these ideas. The body of the paper provides details of the algorithms (§4) and a theoretical analysis of their performance (§§8–11).

Stage B can be completed with well-established deterministic methods. Section 3.3.3 contains an introduction to these techniques, and §5 shows how we apply them to produce low-rank factorizations.

At this point in the development, it may not be clear why the output from Stage A facilitates our job in Stage B. Let us illustrate by describing how to obtain an approximate SVD of the input matrix $\bm{A}$ given a matrix $\bm{Q}$ that satisfies (2). More precisely, we wish to compute matrices $\bm{U}$ and $\bm{V}$ with orthonormal columns and a nonnegative, diagonal matrix $\bm{\Sigma}$ such that $\bm{A}\approx\bm{U\Sigma V}^{*}$ . This goal is achieved after three simple steps:

Form $\bm{B}=\bm{Q}^{*}\bm{A}$ , which yields the low-rank factorization $\bm{A}\approx\bm{Q}\bm{B}$ .

Compute an SVD of the small matrix: $\bm{B}=\widetilde{\bm{U}}\bm{\Sigma}\bm{V}^{*}$ .

When $\bm{Q}$ has few columns, this procedure is efficient because we can easily construct the reduced matrix $\bm{B}$ and rapidly compute its SVD. In practice, we can often avoid forming $\bm{B}$ explicitly by means of subtler techniques. In some cases, it is not even necessary to revisit the input matrix $\bm{A}$ during Stage B. This observation allows us to develop single-pass algorithms, which look at each entry of $\bm{A}$ only once.

Similar manipulations readily yield other standard factorizations, such as the pivoted QR factorization, the eigenvalue decomposition, etc.

3 Randomized algorithms

This paper describes a class of randomized algorithms for completing Stage A of the matrix approximation framework set forth in §1.2. We begin with some details about the approximation problem these algorithms target (§1.3.1). Afterward, we motivate the random sampling technique with a heuristic explanation (§1.3.2) that leads to a prototype algorithm (§1.3.3).

The basic challenge in producing low-rank matrix approximations is a primitive question that we call the fixed-precision approximation problem. Suppose we are given a matrix $\bm{A}$ and a positive error tolerance $\varepsilon$ . We seek a matrix $\bm{Q}$ with $k=k(\varepsilon)$ orthonormal columns such that

The singular value decomposition furnishes an optimal answer to the fixed-precision problem . Let $\sigma_{j}$ denote the $j$ th largest singular value of $\bm{A}$ . For each $j\geq 0$ ,

One way to construct a minimizer is to choose $\bm{X}=\bm{QQ}^{*}\bm{A}$ , where the columns of $\bm{Q}$ are $k$ dominant left singular vectors of $\bm{A}$ . Consequently, the minimal rank $k$ where (3) holds equals the number of singular values of $\bm{A}$ that exceed the tolerance $\varepsilon$ .

To simplify the development of algorithms, it is convenient to assume that the desired rank $k$ is specified in advance. We call the resulting problem the fixed-rank approximation problem. Given a matrix $\bm{A}$ , a target rank $k$ , and an oversampling parameter $p$ , we seek to construct a matrix $\bm{Q}$ with $k+p$ orthonormal columns such that

Although there exists a minimizer $\bm{Q}$ that solves the fixed rank problem for $p=0$ , the opportunity to use a small number of additional columns provides a flexibility that is crucial for the effectiveness of the computational methods we discuss.

We will demonstrate that algorithms for the fixed-rank problem can be adapted to solve the fixed-precision problem. The connection is based on the observation that we can build the basis matrix $\bm{Q}$ incrementally and, at any point in the computation, we can inexpensively estimate the residual error $\left\|{\bm{A}-\bm{QQ}^{*}\bm{A}}\right\|$ . Refer to §4.4 for the details of this reduction.

3.2 Intuition

To understand how randomness helps us solve the fixed-rank problem, it is helpful to consider some motivating examples.

First, suppose that we seek a basis for the range of a matrix $\bm{A}$ with exact rank $k$ . Draw a random vector $\bm{\omega}$ , and form the product $\bm{y}=\bm{A}\bm{\omega}$ . For now, the precise distribution of the random vector is unimportant; just think of $\bm{y}$ as a random sample from the range of $\bm{A}$ . Let us repeat this sampling process $k$ times:

Owing to the randomness, the set $\{\bm{\omega}^{(i)}:i=1,2,\dots,k\}$ of random vectors is likely to be in general linear position. In particular, the random vectors form a linearly independent set and no linear combination falls in the null space of $\bm{A}$ . As a result, the set $\{\bm{y}^{(i)}:i=1,2,\dots,k\}$ of sample vectors is also linearly independent, so it spans the range of $\bm{A}$ . Therefore, to produce an orthonormal basis for the range of $\bm{A}$ , we just need to orthonormalize the sample vectors.

Now, imagine that $\bm{A}=\bm{B}+\bm{E}$ where $\bm{B}$ is a rank- $k$ matrix containing the information we seek and $\bm{E}$ is a small perturbation. Our priority is to obtain a basis that covers as much of the range of $\bm{B}$ as possible, rather than to minimize the number of basis vectors. Therefore, we fix a small number $p$ , and we generate $k+p$ samples

The perturbation $\bm{E}$ shifts the direction of each sample vector outside the range of $\bm{B}$ , which can prevent the span of $\{\bm{y}^{(i)}:i=1,2,\dots,k\}$ from covering the entire range of $\bm{B}$ . In contrast, the enriched set $\{\bm{y}^{(i)}:i=1,2,\dots,k+p\}$ of samples has a much better chance of spanning the required subspace.

Just how many extra samples do we need? Remarkably, for certain types of random sampling schemes, the failure probability decreases superexponentially with the oversampling parameter $p$ ; see (9). As a practical matter, setting $p=5$ or $p=10$ often gives superb results. This observation is one of the principal facts supporting the randomized approach to numerical linear algebra.

3.3 A prototype algorithm

The intuitive approach of §1.3.2 can be applied to general matrices. Omitting computational details for now, we formalize the procedure in the figure labeled Proto-Algorithm.

This simple algorithm is by no means new. It is essentially the first step of a subspace iteration with a random initial subspace [61, §7.3.2]. The novelty comes from the additional observation that the initial subspace should have a slightly higher dimension than the invariant subspace we are trying to approximate. With this revision, it is often the case that no further iteration is required to obtain a high-quality solution to (5). We believe this idea can be traced to .

In order to invoke the proto-algorithm with confidence, we must address several practical and theoretical issues:

What random matrix $\bm{\Omega}$ should we use? How much oversampling do we need?

The matrix $\bm{Y}$ is likely to be ill-conditioned. How do we orthonormalize its columns to form the matrix $\bm{Q}$ ?

How can we solve the fixed-precision problem (3) when the numerical rank of the matrix is not known in advance?

How can we use the basis $\bm{Q}$ to compute other matrix factorizations?

Does the randomized method work for problems of practical interest? How does its speed/accuracy/robustness compare with standard techniques?

What error bounds can we expect? With what probability?

The next few sections provide a summary of the answers to these questions. We describe several problem regimes where the proto-algorithm can be implemented efficiently, and we present a theorem that describes the performance of the most important instantiation. Finally, we elaborate on how these ideas can be applied to approximate the truncated SVD of a large data matrix. The rest of the paper contains a more exhaustive treatment—including pseudocode, numerical experiments, and a detailed theory.

4 A comparison between randomized and traditional techniques

To select an appropriate computational method for finding a low-rank approximation to a matrix, the practitioner must take into account the properties of the matrix. Is it dense or sparse? Does it fit in fast memory or is it stored out of core? Does the singular spectrum decay quickly or slowly? The behavior of a numerical linear algebra algorithm may depend on all these factors . To facilitate a comparison between classical and randomized techniques, we summarize their relative performance in each of three representative environments. Section 6 contains a more in-depth treatment.

We focus on the task of computing an approximate SVD of an $m\times n$ matrix $\bm{A}$ with numerical rank $k$ . For randomized schemes, Stage A generally dominates the cost of Stage B in our matrix approximation framework (§1.2). Within Stage A, the computational bottleneck is usually the matrix–matrix product $\bm{A\Omega}$ in Step 2 of the proto-algorithm (§1.3.3). The power of randomized algorithms stems from the fact that we can reorganize this matrix multiplication for maximum efficiency in a variety of computational architectures.

4.2 A matrix for which matrix–vector products can be evaluated rapidly

When the matrix $\bm{A}$ is sparse or structured, we may be able to apply it rapidly to a vector. In this case, the classical prescription for computing a partial SVD is to invoke a Krylov subspace method, such as the Lanczos or Arnoldi algorithm. It is difficult to summarize the computational cost of these methods because their performance depends heavily on properties of the input matrix and on the amount of effort spent to stabilize the algorithm. (Inherently, the Lanczos and Arnoldi methods are numerically unstable.) For the same reasons, the error analysis of such schemes is unsatisfactory in many important environments.

With a given budget of floating-point operations, Krylov methods sometimes deliver a more accurate approximation than randomized algorithms. Nevertheless, the methods described in this survey have at least two powerful advantages over Krylov methods. First, the randomized schemes are inherently stable, and they come with very strong performance guarantees that do not depend on subtle spectral properties of the input matrix. Second, the matrix–vector multiplies required to form $\bm{A\Omega}$ can be performed in parallel. This fact allows us to restructure the calculations to take full advantage of the computational platform, which can lead to dramatic accelerations in practice, especially for parallel and distributed machines.

A more detailed comparison or randomized schemes and Krylov subspace methods is given in §6.2.

4.3 A general dense matrix stored in slow memory or streamed

In contrast, the proto-algorithm of §1.3.3 requires only one pass over the data to produce the approximate basis $\bm{Q}$ for Stage A of the approximation framework. This straightforward approach, unfortunately, is not accurate enough for matrices whose singular spectrum decays slowly, but we can address this problem using very few (say, $2$ to $4$ ) additional passes over the data . See §1.6 or §4.5 for more discussion.

Typically, Stage B uses one additional pass over the matrix to construct the approximate SVD. With slight modifications, however, the two-stage randomized scheme can be revised so that it only makes a single pass over the data. Refer to §5.5 for information.

5 Performance analysis

A principal goal of this paper is to provide a detailed analysis of the performance of the proto-algorithm described in §1.3.3. This investigation produces precise error bounds, expressed in terms of the singular values of the input matrix. Furthermore, we determine how several choices of the random matrix $\bm{\Omega}$ impact the behavior of the algorithm.

Let us offer a taste of this theory. The following theorem describes the average-case behavior of the proto-algorithm with a Gaussian test matrix, assuming we perform the computation in exact arithmetic. This result is a simplified version of Theorem 19.

Suppose that $\bm{A}$ is a real $m\times n$ matrix. Select a target rank $k\geq 2$ and an oversampling parameter $p\geq 2$ , where $k+p\leq\min\{m,n\}$ . Execute the proto-algorithm with a standard Gaussian test matrix to obtain an $m\times(k+p)$ matrix $\bm{Q}$ with orthonormal columns. Then

We recall that the term $\sigma_{k+1}$ appearing in (8) is the smallest possible error (4) achievable with any basis matrix $\bm{Q}$ . The theorem asserts that, on average, the algorithm produces a basis whose error lies within a small polynomial factor of the theoretical minimum. Moreover, the error bound (8) in the randomized algorithm is slightly sharper than comparable bounds for deterministic techniques based on rank-revealing QR algorithms .

The reader might be worried about whether the expectation provides a useful account of the approximation error. Fear not: the actual outcome of the algorithm is almost always very close to the typical outcome because of measure concentration effects. As we discuss in §10.3, the probability that the error satisfies

is at least $1-6\cdot p^{-p}$ under very mild assumptions on $p$ . This fact justifies the use of an oversampling term as small as $p=5$ . This simplified estimate is very similar to the major results in .

The theory developed in this paper provides much more detailed information about the performance of the proto-algorithm.

When the singular values of $\bm{A}$ decay slightly, the error $\left\|{\bm{A}-\bm{QQ}^{*}\bm{A}}\right\|$ does not depend on the dimensions of the matrix (§§10.2–10.3).

We can reduce the size of the bracket in the error bound (8) by combining the proto-algorithm with a power iteration (§10.4). For an example, see §1.6 below.

For the structured random matrices we mentioned in §1.4.1, related error bounds are in force (§11).

We can obtain inexpensive a posteriori error estimates to verify the quality of the approximation (§4.3).

6 Example: Randomized SVD

We conclude this introduction with a short discussion of how these ideas allow us to perform an approximate SVD of a large data matrix, which is a compelling application of randomized matrix approximation .

The two-stage randomized method offers a natural approach to SVD computations. Unfortunately, the simplest version of this scheme is inadequate in many applications because the singular spectrum of the input matrix may decay slowly. To address this difficulty, we incorporate $q$ steps of a power iteration, where $q=1$ or $q=2$ usually suffices in practice. The complete scheme appears in the box labeled Prototype for Randomized SVD. For most applications, it is important to incorporate additional refinements, as we discuss in §§4–5.

The Randomized SVD procedure requires only $2(q+1)$ passes over the matrix, so it is efficient even for matrices stored out-of-core. The flop count satisfies

where $T_{\rm mult}$ is the flop count of a matrix–vector multiply with $\bm{A}$ or $\bm{A}^{*}$ . We have the following theorem on the performance of this method in exact arithmetic, which is a consequence of Corollary 23.

Suppose that $\bm{A}$ is a real $m\times n$ matrix. Select an exponent $q$ and a target number $k$ of singular vectors, where $2\leq k\leq 0.5\min\{m,n\}$ . Execute the Randomized SVD algorithm to obtain a rank- $2k$ factorization $\bm{U\Sigma V}^{*}$ . Then

This result is new. Observe that the bracket in (10) is essentially the same as the bracket in the basic error bound (8). We find that the power iteration drives the leading constant to one exponentially fast as the power $q$ increases. The rank- $k$ approximation of $\bm{A}$ can never achieve an error smaller than $\sigma_{k+1}$ , so the randomized procedure computes $2k$ approximate singular vectors that capture as much of the matrix as the first $k$ actual singular vectors.

In practice, we can truncate the approximate SVD, retaining only the first $k$ singular values and vectors. Equivalently, we replace the diagonal factor $\bm{\Sigma}$ by the matrix $\bm{\Sigma}_{(k)}$ formed by zeroing out all but the largest $k$ entries of $\bm{\Sigma}$ . For this truncated SVD, we have the error bound

In words, we pay no more than an additive term $\sigma_{k+1}$ when we perform the truncation step. Our numerical experience suggests that the error bound (11) is pessimistic. See Remark 5.1 and §9.4 for some discussion of truncation.

7 Outline of paper

The paper is organized into three parts: an introduction (§§1–3), a description of the algorithms (§§4–7), and a theoretical performance analysis (§§8–11). The two latter parts commence with a short internal outline. Each part is more or less self-contained, and after a brief review of our notation in §§3.1–3.2, the reader can proceed to either the algorithms or the theory part.

Related work and historical context

Randomness has occasionally surfaced in the numerical linear algebra literature; in particular, it is quite standard to initialize iterative algorithms for constructing invariant subspaces with a randomly chosen point. Nevertheless, we believe that sophisticated ideas from random matrix theory have not been incorporated into classical matrix factorization algorithms until very recently. We can trace this development to earlier work in computer science and—especially—to probabilistic methods in geometric analysis. This section presents an overview of the relevant work. We begin with a survey of randomized methods for matrix approximation; then we attempt to trace some of the ideas backward to their sources.

Matrices of low numerical rank contain little information relative to their apparent dimension owing to the linear dependency in their columns (or rows). As a result, it is reasonable to expect that these matrices can be approximated with far fewer degrees of freedom. A less obvious fact is that randomized schemes can be used to produce these approximations efficiently.

Several types of approximation techniques build on this idea. These methods all follow the same basic pattern:

Preprocess the matrix, usually to calculate sampling probabilities.

Take random samples from the matrix, where the term sample refers generically to a linear function of the matrix.

Postprocess the samples to compute a final approximation, typically with classical techniques from numerical linear algebra. This step may require another look at the matrix.

We continue with a description of the most common approximation schemes.

The simplest approach to matrix approximation is the method of sparsification or the related technique of quantization. The goal of sparsification is to replace the matrix by a surrogate that contains far fewer nonzero entries. Quantization produces an approximation whose components are drawn from a (small) discrete set of values. These methods can be used to limit storage requirements or to accelerate computations by reducing the cost of matrix–vector and matrix–matrix multiplies [94, Ch. 6]. The manuscript describes applications in optimization.

Sparsification typically involves very simple elementwise calculations. Each entry in the approximation is drawn independently at random from a distribution determined from the corresponding entry of the input matrix. The expected value of the random approximation equals the original matrix, but the distribution is designed so that a typical realization is much sparser.

The first method of this form was devised by Achlioptas and McSherry , who built on earlier work on graph sparsification due to Karger . Arora–Hazan–Kale presented a different sampling method in . See for some recent work on sparsification.

1.2 Column selection methods

A second approach to matrix approximation is based on the idea that a small set of columns describes most of the action of a numerically low-rank matrix. Indeed, classical existential results demonstrate that every $m\times n$ matrix $\bm{A}$ contains a $k$ -column submatrix $\bm{C}$ for which

where $k$ is a parameter, the dagger $\dagger$ denotes the pseudoinverse, and $\bm{A}_{(k)}$ is a best rank- $k$ approximation of $\bm{A}$ . It is ${\sf NP}$ -hard to perform column selection by optimizing natural objective functions, such as the condition number of the submatrix . Nevertheless, there are efficient deterministic algorithms, such as the rank-revealing QR method of , that can nearly achieve the error bound (12).

There is a class of randomized algorithms that approach the fixed-rank approximation problem (5) using this intuition. These methods first compute a sampling probability for each column, either using the squared Euclidean norms of the columns or their leverage scores. (Leverage scores reflect the relative importance of the columns to the action of the matrix; they can be calculated easily from the dominant $k$ right singular vectors of the matrix.) Columns are then selected randomly according to this distribution. Afterward, a postprocessing step is invoked to produce a more refined approximation of the matrix.

Rudelson and Vershynin later showed that the same type of column sampling method also yields spectral-norm error bounds . The techniques in their paper have been very influential; their work has found other applications in randomized regression , sparse approximation , and compressive sampling .

Deshpande et al. demonstrated that the error in the column sampling approach can be improved by iteration and adaptive volume sampling. They showed that it is possible to produce a rank- $k$ matrix $\bm{B}$ that satisfies

using a $k$ -pass algorithm. Around the same time, Har-Peled independently developed a recursive algorithm that offers the same approximation guarantees. Very recently, Desphande and Rademacher have improved the running time of volume-based sampling methods .

Drineas et al. and Boutsidis et al. have also developed randomized algorithms for the column subset selection problem, which requests a column submatrix $\bm{C}$ that achieves a bound of the form (12). Via the methods of Rudelson and Vershynin , they showed that sampling columns according to their leverage scores is likely to produce the required submatrix . Subsequent work showed that postprocessing the sampled columns with a rank-revealing QR algorithm can reduce the number of output columns required (12). The argument in explicitly decouples the linear algebraic part of the analysis from the random matrix theory. The theoretical analysis in the present work involves a very similar technique.

1.3 Approximation by dimension reduction

A third approach to matrix approximation is based on the concept of dimension reduction. Since the rows of a low-rank matrix are linearly dependent, they can be embedded into a low-dimensional space without altering their geometric properties substantially. A random linear map provides an efficient, nonadaptive way to perform this embedding. (Column sampling can also be viewed as an adaptive form of dimension reduction.)

The proto-algorithm we set forth in §1.3.3 is simply a dual description of the dimension reduction approach: collecting random samples from the column space of the matrix is equivalent to reducing the dimension of the rows. No precomputation is required to obtain the sampling distribution, but the sample itself takes some work to collect. Afterward, we orthogonalize the samples as preparation for constructing various matrix approximations.

We believe that the idea of using dimension reduction for algorithmic matrix approximation first appeared in a 1998 paper of Papadimitriou et al. , who described an application to latent semantic indexing (LSI). They suggested projecting the input matrix onto a random subspace and compressing the original matrix to (a subspace of) the range of the projected matrix. They established error bounds that echo the result (13) of Frieze et al. . Although the Euclidean column selection method is a more computationally efficient way to obtain this type of error bound, dimension reduction has other advantages, e.g., in terms of accuracy.

Sarlós argued in that the computational costs of dimension reduction can be reduced substantially by means of the structured random maps proposed by Ailon–Chazelle . Sarlós used these ideas to develop efficient randomized algorithms for least-squares problems; he also studied approximate matrix multiplication and low-rank matrix approximation. The recent paper analyzes a very similar matrix approximation algorithm using Rudelson and Vershynin’s methods .

The initial work of Sarlós on structured dimension reduction did not immediately yield algorithms for low-rank matrix approximation that were superior to classical techniques. Woolfe et al. showed how to obtain an improvement in asymptotic computational cost, and they applied these techniques to problems in scientific computing . Related work includes .

Martinsson–Rokhlin–Tygert have studied dimension reduction using a Gaussian transform matrix, and they demonstrated that this approach performs much better than earlier analyses had suggested . Their work highlights the importance of oversampling, and their error bounds are very similar to the estimate (9) we presented in the introduction. They also demonstrated that dimension reduction can be used to compute an interpolative decomposition of the input matrix, which is essentially equivalent to performing column subset selection.

Rokhlin–Szlam–Tygert have shown that combining dimension reduction with a power iteration is an effective way to improve its performance . These ideas lead to very efficient randomized methods for large-scale PCA . An efficient, numerically stable version of the power iteration is discussed in §4.5, as well as . Related ideas appear in a paper of Roweis .

Very recently, Clarkson and Woodruff have developed one-pass algorithms for performing low-rank matrix approximation, and they have established lower bounds which prove that many of their algorithms have optimal or near-optimal resource guarantees, modulo constants.

1.4 Approximation by submatrices

The matrix approximation literature contains a subgenre that discusses methods for building an approximation from a submatrix and computed coefficient matrices. For example, we can construct an approximation using a subcollection of columns (the interpolative decomposition), a subcollection of rows and a subcollection of columns (the CUR decomposition), or a square submatrix (the matrix skeleton). This type of decomposition was developed and studied in several papers, including . For data analysis applications, see the recent paper .

A number of works develop randomized algorithms for this class of matrix approximations. Drineas et al. have developed techniques for computing CUR decompositions, which express $\bm{A}\approx\bm{CUR}$ , where $\bm{C}$ and $\bm{R}$ denote small column and row submatrices of $\bm{A}$ and where $\bm{U}$ is a small linkage matrix. These methods identify columns (rows) that approximate the range (corange) of the matrix; the linkage matrix is then computed by solving a small least-squares problem. A randomized algorithm for CUR approximation with controlled absolute error appears in ; a relative error algorithm appears in . We also mention a paper on computing a closely related factorization called the compact matrix decomposition .

It is also possible to produce interpolative decompositions and matrix skeletons using randomized methods, as discussed in and §5.2 of the present work.

1.5 Other numerical problems

The literature contains a variety of other randomized algorithms for solving standard problems in and around numerical linear algebra. We list some of the basic references.

Randomized column selection methods can be used to produce CUR-type decompositions of higher-order tensors .

Column selection and dimension reduction techniques can be used to accelerate the multiplication of rank-deficient matrices . See also .

The randomized Kaczmarz algorithm is a linearly convergent iterative method that can be used to solve overdetermined linear systems .

Fast dimension-reduction maps can sometimes accelerate the solution of overdetermined least-squares problems .

Fast dimension reduction maps can be used to reduce the size of nonnegative least-squares problems .

Randomized matrix approximations can be used to precondition conjugate gradient to solve least-squares problems .

The Fermat–Weber facility location problem can be viewed as matrix approximation with respect to a different discrepancy measure. Randomized algorithms for this type of problem appear in .

1.6 Compressive sampling

Although randomized matrix approximation and compressive sampling are based on some common intuitions, it is facile to consider either one as a subspecies of the other. We offer a short overview of the field of compressive sampling—especially the part connected with matrices—so we can highlight some of the differences.

The theory of compressive sampling starts with the observation that many types of vector-space data are compressible. That is, the data are approximated well using a short linear combination of basis functions drawn from a fixed collection . For example, natural images are well approximated in a wavelet basis; numerically low-rank matrices are well approximated as a sum of rank-one matrices. The idea behind compressive sampling is that suitably chosen random samples from this type of compressible object carry a large amount of information. Furthermore, it is possible to reconstruct the compressible object from a small set of these random samples, often by solving a convex optimization problem. The initial discovery works of Candès–Romberg–Tao and Donoho were written in 2004.

The earliest work in compressive sampling focused on vector-valued data; soon after, researchers began to study compressive sampling for matrices. In 2007, Recht–Fazel–Parillo demonstrated that it is possible to reconstruct a rank-deficient matrix from Gaussian measurements . More recently, Candès–Recht and Candès–Tao considered the problem of completing a low-rank matrix from a random sample of its entries.

The usual goals of compressive sampling are (i) to design a method for collecting informative, nonadaptive data about a compressible object and (ii) to reconstruct a compressible object given some measured data. In both cases, there is an implicit assumption that we have limited—if any—access to the underlying data.

In the problem of matrix approximation, we typically have a complete representation of the matrix at our disposal. The point is to compute a simpler representation as efficiently as possible under some operational constraints. In particular, we would like to perform as little computation as we can, but we are usually allowed to revisit the input matrix. Because of the different focus, randomized matrix approximation algorithms require fewer random samples from the matrix and use fewer computational resources than compressive sampling reconstruction algorithms.

2 Origins

This section attempts to identify some of the major threads of research that ultimately led to the development of the randomized techniques we discuss in this paper.

These observations suggest that we might be able to solve some computational problems of a geometric nature more efficiently by translating them into a lower-dimensional space and solving them there. This idea was cultivated by the theoretical computer science community beginning in the late 1980s, with research flowering in the late 1990s. In particular, nearest-neighbor search can benefit from dimension-reduction techniques . The papers were apparently the first to apply this approach to linear algebra.

Around the same time, researchers became interested in simplifying the form of dimension reduction maps and improving the computational cost of applying the map. Several researchers developed refined results on the performance of a Gaussian matrix as a linear dimension reduction map . Achlioptas demonstrated that discrete random matrices would serve nearly as well . In 2006, Ailon and Chazelle proposed the fast Johnson–Lindenstrauss transform , which combines the speed of the FFT with the favorable embedding properties of a Gaussian matrix. Subsequent refinements appear in . Sarlós then imported these techniques to study several problems in numerical linear algebra, which has led to some of the fastest algorithms currently available .

2.2 Data streams

Muthukrishnan argues that a distinguishing feature of modern data is the manner in which it is presented to us. The sheer volume of information and the speed at which it must be processed tax our ability to transmit the data elsewhere, to compute complicated functions on the data, or to store a substantial part of the data [100, §3]. As a result, computer scientists have started to develop algorithms that can address familiar computational problems under these novel constraints. The data stream phenomenon is one of the primary justifications cited by for developing pass-efficient methods for numerical linear algebra problems, and it is also the focus of the recent treatment .

One of the methods for dealing with massive data sets is to maintain sketches, which are small summaries that allow functions of interest to be calculated. In the simplest case, a sketch is simply a random projection of the data, but it might be a more sophisticated object [100, §5.1]. The idea of sketching can be traced to the work of Alon et al. .

2.3 Numerical linear algebra

Classically, the field of numerical linear algebra has focused on developing deterministic algorithms that produce highly accurate matrix approximations with provable guarantees. Nevertheless, randomized techniques have appeared in several environments.

One of the original examples is the use of random models for arithmetical errors, which was pioneered by von Neumann and Goldstine. Their papers stand among the first works to study the properties of random matrices. The earliest numerical linear algebra algorithm that depends essentially on randomized techniques is probably Dixon’s method for estimating norms and condition numbers .

Another situation where randomness commonly arises is the initialization of iterative methods for computing invariant subspaces. For example, most numerical linear algebra texts advocate random selection of the starting vector for the power method because it ensures that the vector has a nonzero component in the direction of a dominant eigenvector. Woźniakowski and coauthors have analyzed the performance of the power method and the Lanczos iteration given a random starting vector .

Among other interesting applications of randomness, we mention the work by Parker and Pierce, which applies a randomized FFT to eliminate pivoting in Gaussian elimination , work by Demmel et al. who have studied randomization in connection with the stability of fast methods for linear algebra , and work by Le and Parker utilizing randomized methods for stabilizing fast linear algebraic computations based on recursive algorithms, such as Strassen’s matrix multiplication .

2.4 Scientific computing

One of the first algorithmic applications of randomness is the method of Monte Carlo integration introduced by Von Neumann and Ulam , and its extensions, such as the Metropolis algorithm for simulations in statistical physics. (See for an introduction.) The most basic technique is to estimate an integral by sampling $m$ points from the measure and computing an empirical mean of the integrand evaluated at the sample locations:

where $X_{i}$ are independent and identically distributed according to the probability measure $\mu$ . The law of large numbers (usually) ensures that this approach produces the correct result in the limit as $m\to\infty$ . Unfortunately, the approximation error typically has a standard deviation of $m^{-1/2}$ , and the method provides no certificate of success.

The disappointing computational profile of Monte Carlo integration seems to have inspired a distaste for randomized approaches within the scientific computing community. Fortunately, there are many other types of randomized algorithms—such as the ones in this paper—that do not suffer from the same shortcomings.

2.5 Geometric functional analysis

There is one more character that plays a central role in our story: the probabilistic method in geometric analysis. Many of the algorithms and proof techniques ultimately come from work in this beautiful but recondite corner of mathematics.

Dvoretsky’s theorem states (roughly) that every infinite-dimensional Banach space contains an $n$ -dimensional subspace whose geometry is essentially the same as an $n$ -dimensional Hilbert space, where $n$ is an arbitrary natural number. In 1971, V. D. Milman developed a striking proof of this result by showing that a random $n$ -dimensional subspace of an $N$ -dimensional Banach space has this property with exceedingly high probability, provided that $N$ is large enough . Milman’s article debuted the concentration of measure phenomenon, which is a geometric interpretation of the classical idea that regular functions of independent random variables rarely deviate far from their mean. This work opened a new era in geometric analysis where the probabilistic method became a basic instrument.

We have already described a third class of examples: the randomized embeddings of Johnson–Lindenstrauss and of Bourgain .

Finally, we mention Maurey’s technique of empirical approximation. The original work was unpublished; one of the earliest applications appears in [24, §1]. Although Maurey’s idea has not received as much press as the examples above, it can lead to simple and efficient algorithms for sparse approximation. For some examples in machine learning, consider

The importance of random constructions in the geometric analysis community has led to the development of powerful techniques for studying random matrices. Classical random matrix theory focuses on a detailed asymptotic analysis of the spectral properties of special classes of random matrices. In contrast, geometric analysts know methods for determining the approximate behavior of rather complicated finite-dimensional random matrices. See for a fairly current survey article. We also mention the works of Rudelson and Rudelson–Vershynin , which describe powerful tools for studying random matrices drawn from certain discrete distributions. Their papers are rooted deeply in the field of geometric functional analysis, but they reach out toward computational applications.

Linear algebraic preliminaries

This section summarizes the background we need for the detailed description of randomized algorithms in §§4–6 and the analysis in §§8–11. We introduce notation in §3.1, describe some standard matrix decompositions in §3.2, and briefly review standard techniques for computing matrix factorizations in §3.3.

We usually measure the magnitude of a matrix $\bm{A}$ with the operator norm

which is often referred to as the spectral norm. The Frobenius norm is given by

The conjugate transpose, or adjoint, of a matrix $\bm{A}$ is denoted $\bm{A}^{*}$ . The important identities

We say that a matrix $\bm{U}$ is orthonormal if its columns form an orthonormal set with respect to the Hermitian inner product. An orthonormal matrix $\bm{U}$ preserves geometry in the sense that $\left\|{\bm{U}\bm{x}}\right\|=\left\|{\bm{x}}\right\|$ for every vector $\bm{x}$ . A unitary matrix is a square orthonormal matrix, and an orthogonal matrix is a real unitary matrix. Unitary matrices satisfy the relations $\bm{UU}^{*}=\bm{U}^{*}\bm{U}=\mathbf{I}$ . Both the operator norm and the Frobenius norm are unitarily invariant, which means that

for every matrix $\bm{A}$ and all orthonormal matrices $\bm{U}$ and $\bm{V}$

We use the notation of to denote submatrices. If $\bm{A}$ is a matrix with entries $a_{ij}$ , and if $I=[i_{1},\,i_{2},\,\dots,\,i_{p}]$ and $J=[j_{1},\,j_{2},\,\dots,\,j_{q}]$ are two index vectors, then the associated $p\times q$ submatrix is expressed as

For column- and row-submatrices, we use the standard abbreviations

2 Standard matrix factorizations

This section defines three basic matrix decompositions. Methods for computing them are described in §3.3.

Each $m\times n$ matrix $\bm{A}$ of rank $k$ admits a decomposition

where $\bm{Q}$ is an $m\times k$ orthonormal matrix, and $\bm{R}$ is a $k\times n$ weakly upper-triangular matrix. That is, there exists a permutation $J$ of the numbers $\{1,\,2,\,\dots,\,n\}$ such that $\bm{R}_{(\colon,J)}$ is upper triangular. Moreover, the diagonal entries of $\bm{R}_{(\colon,J)}$ are weakly decreasing. See [61, §5.4.1] for details.

2.2 The singular value decomposition (SVD)

Each $m\times n$ matrix $\bm{A}$ of rank $k$ admits a factorization

where $\bm{U}$ is an $m\times k$ orthonormal matrix, $\bm{V}$ is an $n\times k$ orthonormal matrix, and $\bm{\Sigma}$ is a $k\times k$ nonnegative, diagonal matrix

The numbers $\sigma_{j}$ are called the singular values of $\bm{A}$ . They are arranged in weakly decreasing order:

The columns of $\bm{U}$ and $\bm{V}$ are called left singular vectors and right singular vectors, respectively.

Singular values are connected with the approximability of matrices. For each $j$ , the number $\sigma_{j+1}$ equals the spectral-norm discrepancy between $\bm{A}$ and an optimal rank- $j$ approximation . That is,

In particular, $\sigma_{1}=\left\|{\bm{A}}\right\|$ . See [61, §2.5.3 and §5.4.5] for additional details.

2.3 The interpolative decomposition (ID)

Our final factorization identifies a collection of $k$ columns from a rank- $k$ matrix $\bm{A}$ that span the range of $\bm{A}$ . To be precise, we can compute an index set $J=[j_{1},\,\dots,\,j_{k}]$ such that

where $\bm{X}$ is a $k\times n$ matrix that satisfies $\bm{X}_{(\colon,J)}=\mathbf{I}_{k}$ . Furthermore, no entry of $\bm{X}$ has magnitude larger than two. In other words, this decomposition expresses each column of $\bm{A}$ using a linear combination of $k$ fixed columns with bounded coefficients. Stable and efficient algorithms for computing the ID appear in the papers .

It is also possible to compute a two-sided ID

where $J^{\prime}$ is an index set identifying $k$ of the rows of $\bm{A}$ , and $\bm{W}$ is an $m\times k$ matrix that satisfies $\bm{W}_{(J^{\prime},\colon)}=\mathbf{I}_{k}$ and whose entries are all bounded by two.

There always exists an ID where the entries in the factor $\bm{X}$ have magnitude bounded by one. Known proofs of this fact are constructive, e.g., [103, Lem. 3.3], but they require us to find a collection of $k$ columns that has “maximum volume.” It is NP-hard to identify a subset of columns with this type of extremal property . We find it remarkable that ID computations are possible as soon as the bound on $\bm{X}$ is relaxed.

3 Techniques for computing standard factorizations

This section discusses some established deterministic techniques for computing the factorizations presented in §3.2. The material on pivoted QR and SVD can be located in any major text on numerical linear algebra, such as . References for the ID include .

3.2 Computing partial decompositions

Suppose that an $m\times n$ matrix has numerical rank $k$ , where $k$ is substantially smaller than $m$ and $n$ . In this case, it is possible to produce a structured low-rank decomposition that approximates the matrix well. Sections 4 and 5 describe a set of randomized techniques for obtaining these partial decompositions. This section briefly reviews the classical techniques, which also play a role in developing randomized methods.

Subsequent research has led to strong rank-revealing QR algorithms that succeed for all matrices. For example, the Gu–Eisenstat algorithm (setting their parameter $f=2$ ) produces an QR decomposition of the form (16), where

Recall that $\sigma_{k+1}$ is the minimal error possible in a rank- $k$ approximation . The cost of the Gu–Eisenstat algorithm is typically $O(kmn)$ , but it can be slightly higher in rare cases. The algorithm can also be used to obtain an approximate ID .

Note that all the techniques described in this section require extensive random access to the matrix, and they can be very slow when the matrix is stored out-of-core.

3.3 Converting from one partial factorization to another

Suppose that we have obtained a partial decomposition of a matrix $\bm{A}$ by some means:

where $\bm{B}$ and $\bm{C}$ have rank $k$ . Given this information, we can efficiently compute any of the basic factorizations.

We construct a partial QR factorization using the following three steps:

Compute a QR factorization of $\bm{C}$ so that $\bm{C}=\bm{Q}_{1}\bm{R}_{1}$ .

Form the product $\bm{D}=\bm{R}_{1}\bm{B}$ , and compute a QR factorization: $\bm{D}=\bm{Q}_{2}\bm{R}$ .

Form the product $\bm{Q}=\bm{Q}_{1}\bm{Q}_{2}$ .

The result is an orthonormal matrix $\bm{Q}$ and a weakly upper-triangular matrix $\bm{R}$ such that $\left\|{\bm{A}-\bm{Q}\bm{R}}\right\|\leq\varepsilon$ .

An analogous technique yields a partial SVD:

Compute a QR factorization of $\bm{C}$ so that $\bm{C}=\bm{Q}_{1}\bm{R}_{1}$ .

Form the product $\bm{D}=\bm{R}_{1}\bm{B}$ , and compute an SVD: $\bm{D}=\bm{U}_{2}\bm{\Sigma}\bm{V}^{*}$ .

Form the product $\bm{U}=\bm{Q}_{1}\bm{U}_{2}$ .

The result is a diagonal matrix $\bm{\Sigma}$ and orthonormal matrices $\bm{U}$ and $\bm{V}$ such that $\left\|{\bm{A}-\bm{U}\bm{\Sigma}\bm{V}^{*}}\right\|\leq\varepsilon$ .

Converting $\bm{B}$ and $\bm{C}$ into a partial ID is a one-step process:

Compute $J$ and $\bm{X}$ such that $\bm{B}=\bm{B}_{(\colon,J)}\bm{X}$ .

Then $\bm{A}\approx\bm{A}_{(\colon,J)}\bm{X}$ , but the approximation error may deteriorate from the initial estimate. For example, if we compute the ID using the Gu–Eisenstat algorithm with the parameter $f=2$ , then the error ${\bigl{\|}{\bm{A}-\bm{A}_{(\colon,J)}\bm{X}}\bigr{\|}}\leq(1+\sqrt{1+4k(n-k)})\cdot\varepsilon$ . Compare this bound with Lemma 4 below.

3.4 Krylov-subspace methods

Suppose that the matrix $\bm{A}$ can be applied rapidly to vectors, as happens when $\bm{A}$ is sparse or structured. Then Krylov subspace techniques can very effectively and accurately compute partial spectral decompositions. For concreteness, assume that $\bm{A}$ is Hermitian. The idea of these techniques is to fix a starting vector $\bm{\omega}$ and to seek approximations to the eigenvectors within the corresponding Krylov subspace

Krylov methods also come in blocked versions, in which the starting vector $\bm{\omega}$ is replaced by a starting matrix $\bm{\Omega}$ . A common recommendation is to draw a starting vector $\bm{\omega}$ (or starting matrix $\bm{\Omega}$ ) from a standardized Gaussian distribution, which indicates a significant overlap between Krylov methods and the methods in this paper.

The most basic versions of Krylov methods for computing spectral decompositions are numerically unstable. High-quality implementations require that we incorporate restarting strategies, techniques for maintaining high-quality bases for the Krylov subspaces, etc. The diversity and complexity of such methods make it hard to state a precise computational cost, but in the environment we consider in this paper, a typical cost for a fully stable implementation would be

where $T_{\rm mult}$ is the cost of a matrix–vector multiplication.

This part of the paper, §§4–7, provides detailed descriptions of randomized algorithms for constructing low-rank approximations to matrices. As discussed in §1.2, we split the problem into two stages. In Stage A, we construct a subspace that captures the action of the input matrix. In Stage B, we use this subspace to obtain an approximate factorization of the matrix.

Section 4 develops randomized methods for completing Stage A, and §5 describes deterministic methods for Stage B. Section 6 compares the computational costs of the resulting two-stage algorithm with the classical approaches outlined in §3. Finally, §7 illustrates the performance of the randomized schemes via numerical examples.

Stage A: Randomized schemes for approximating the range

This section outlines techniques for constructing a subspace that captures most of the action of a matrix. We begin with a recapitulation of the proto-algorithm that we introduced in §1.3. We discuss how it can be implemented in practice (§4.1) and then consider the question of how many random samples to acquire (§4.2). Afterward, we present several ways in which the basic scheme can be improved. Sections 4.3 and 4.4 explain how to address the situation where the numerical rank of the input matrix is not known in advance. Section 4.5 shows how to modify the scheme to improve its accuracy when the singular spectrum of the input matrix decays slowly. Finally, §4.6 describes how the scheme can be accelerated by using a structured random matrix.

The most natural way to implement the proto-algorithm from §1.3 is to draw a random test matrix $\bm{\Omega}$ from the standard Gaussian distribution. That is, each entry of $\bm{\Omega}$ is an independent Gaussian random variable with mean zero and variance one. For reference, we formulate the resulting scheme as Algorithm LABEL:alg:basic.

The number $T_{\rm basic}$ of flops required by Algorithm LABEL:alg:basic satisfies

where $T_{\rm rand}$ is the cost of generating a Gaussian random number and $T_{\rm mult}$ is the cost of multiplying $\bm{A}$ by a vector. The three terms in (18) correspond directly with the three steps of Algorithm LABEL:alg:basic.

Empirically, we have found that the performance of Algorithm LABEL:alg:basic depends very little on the quality of the random number generator used in Step 1.

The actual cost of Step 2 depends substantially on the matrix $\bm{A}$ and the computational environment that we are working in. The estimate (18) suggests that Algorithm LABEL:alg:basic is especially efficient when the matrix–vector product $\bm{x}\mapsto\bm{A}\bm{x}$ can be evaluated rapidly. In particular, the scheme is appropriate for approximating sparse or structured matrices. Turn to §6 for more details.

The most important implementation issue arises when performing the basis calculation in Step 3. Typically, the columns of the sample matrix $\bm{Y}$ are almost linearly dependent, so it is imperative to use stable methods for performing the orthonormalization. We have found that the Gram–Schmidt procedure, augmented with the double orthogonalization described in , is both convenient and reliable. Methods based on Householder reflectors or Givens rotations also work very well. Note that very little is gained by pivoting because the columns of the random matrix $\bm{Y}$ are independent samples drawn from the same distribution.

2 The number of samples required

The goal of Algorithm LABEL:alg:basic is to produce an orthonormal matrix $\bm{Q}$ with few columns that achieves

Very large matrices may require more oversampling.

The more rapid the decay of the singular values, the less oversampling is needed. In the extreme case that the matrix has exact rank $k$ , it is not necessary to oversample.

Gaussian matrices succeed with very little oversampling, but are not always the most cost-effective option. The structured random matrices discussed in §4.6 may require substantial oversampling, but they still yield computational gains in certain settings.

The theoretical results in Part III provide detailed information about how the behavior of randomized schemes depends on these factors. For the moment, we limit ourselves to some general remarks on implementation issues.

For Gaussian test matrices, it is adequate to choose the oversampling parameter to be a small constant, such as $p=5$ or $p=10$ . There is rarely any advantage to select $p>k$ . This observation, first presented in , demonstrates that a Gaussian test matrix results in a negligible amount of extra computation.

In practice, the target rank $k$ is rarely known in advance. Randomized algorithms are usually implemented in an adaptive fashion where the number of samples is increased until the error satisfies the desired tolerance. In other words, the user never chooses the oversampling parameter. Theoretical results that bound the amount of oversampling are valuable primarily as aids for designing algorithms. We develop an adaptive approach in §§4.3–4.4.

3 A posteriori error estimation

Algorithm LABEL:alg:basic is designed for solving the fixed-rank problem, where the target rank of the input matrix is specified in advance. To handle the fixed-precision problem, where the parameter is the computational tolerance, we need a scheme for estimating how well a putative basis matrix $\bm{Q}$ captures the action of the matrix $\bm{A}$ . To do so, we develop a probabilistic error estimator. These methods are inspired by work of Dixon ; our treatment follows .

The exact approximation error is $\left\|{(\mathbf{I}-\bm{QQ}^{*})\bm{A}}\right\|$ . It is intuitively plausible that we can obtain some information about this quantity by computing $\left\|{(\mathbf{I}-\bm{QQ}^{*})\bm{A}\bm{\omega}}\right\|$ , where $\bm{\omega}$ is a standard Gaussian vector. This notion leads to the following method. Draw a sequence $\{\bm{\omega}^{(i)}:i=1,2,\dots,r\}$ of standard Gaussian vectors, where $r$ is a small integer that balances computational cost and reliability. Then

with probability at least $1-10^{-r}$ . This statement follows by setting $\bm{B}=(\mathbf{I}-\bm{Q}\bm{Q}^{*})\bm{A}$ and $\alpha=10$ in the following lemma, whose proof appears in [137, §3.4].

Let $\bm{B}$ be a real $m\times n$ matrix. Fix a positive integer $r$ and a real number $\alpha>1$ . Draw an independent family $\{\bm{\omega}^{(i)}:i=1,2,\dots,r\}$ of standard Gaussian vectors. Then

The critical point is that the error estimate (20) is computationally inexpensive because it requires only a small number of matrix–vector products. Therefore, we can make a lowball guess for the numerical rank of $\bm{A}$ and add more samples if the error estimate is too large. The asymptotic cost of Algorithm LABEL:alg:basic is preserved if we double our guess for the rank at each step. For example, we can start with 32 samples, compute another 32, then another 64, etc.

The estimate (20) is actually somewhat crude. We can obtain a better estimate at a similar computational cost by initializing a power iteration with a random vector and repeating the process several times .

4 Error estimation (almost) for free

The error estimate described in §4.3 can be combined with any method for constructing an approximate basis for the range of a matrix. In this section, we explain how the error estimator can be incorporated into Algorithm LABEL:alg:basic at almost no additional cost.

The basic observation behind the adaptive scheme is that we can generate the basis in Step 3 of Algorithm LABEL:alg:basic incrementally. Starting with an empty basis matrix $\bm{Q}^{(0)}$ , the following scheme generates an orthonormal matrix whose range captures the action of $\bm{A}$ :

The CPU time requirements of Algorithms LABEL:alg:adaptive2 and LABEL:alg:basic are essentially identical. Although Algorithm LABEL:alg:adaptive2 computes the last few samples purely to obtain the error estimate, this apparent extra cost is offset by the fact that Algorithm LABEL:alg:basic always includes an oversampling factor. The failure probability stated for Algorithm LABEL:alg:adaptive2 is pessimistic because it is derived from a simple union bound argument. In practice, the error estimator is reliable in a range of circumstances when we take $r=10$ .

The calculations in Algorithm LABEL:alg:adaptive2 can be organized so that each iteration processes a block of samples simultaneously. This revision can lead to dramatic improvements in speed because it allows us to exploit higher-level linear algebra subroutines (e.g., BLAS3) or parallel processors. Although blocking can lead to the generation of unnecessary samples, this outcome is generally harmless, as noted in §4.2.

5 A modified scheme for matrices whose singular values decay slowly

The techniques described in §4.1 and §4.4 work well for matrices whose singular values exhibit some decay, but they may produce a poor basis when the input matrix has a flat singular spectrum or when the input matrix is very large. In this section, we describe techniques, originally proposed in , for improving the accuracy of randomized algorithms in these situations. Related earlier work includes and the literature on classical orthogonal iteration methods [61, p. 332].

The intuition behind these techniques is that the singular vectors associated with small singular values interfere with the calculation, so we reduce their weight relative to the dominant singular vectors by taking powers of the matrix to be analyzed. More precisely, we wish to apply the randomized sampling scheme to the matrix $\bm{B}=(\bm{A}\bm{A}^{*})^{q}\bm{A}$ , where $q$ is a small integer. The matrix $\bm{B}$ has the same singular vectors as the input matrix $\bm{A}$ , but its singular values decay much more quickly:

We modify Algorithm LABEL:alg:basic by replacing the formula $\bm{Y}=\bm{A}\bm{\Omega}$ in Step 2 by the formula $\bm{Y}=\bm{B}\bm{\Omega}=\bigl{(}\bm{A}\bm{A}^{*}\bigr{)}^{q}\bm{A}\bm{\Omega}$ , and we obtain Algorithm LABEL:alg:poweriteration.

Algorithm LABEL:alg:poweriteration requires $2q+1$ times as many matrix–vector multiplies as Algorithm LABEL:alg:basic, but is far more accurate in situations where the singular values of $\bm{A}$ decay slowly. A good heuristic is that when the original scheme produces a basis whose approximation error is within a factor $C$ of the optimum, the power scheme produces an approximation error within $C^{1/(2q+1)}$ of the optimum. In other words, the power iteration drives the approximation gap to one exponentially fast. See Theorem 12 and §10.4 for the details.

Unfortunately, when Algorithm LABEL:alg:poweriteration is executed in floating point arithmetic, rounding errors will extinguish all information pertaining to singular modes associated with singular values that are small compared with $\left\|{\bm{A}}\right\|$ . (Roughly, if machine precision is $\mu$ , then all information associated with singular values smaller than $\mu^{1/(2q+1)}\left\|{\bm{A}}\right\|$ is lost.) This problem can easily be remedied by orthonormalizing the columns of the sample matrix between each application of $\bm{A}$ and $\bm{A}^{*}$ . The resulting scheme, summarized as Algorithm LABEL:alg:subspaceiteration, is algebraically equivalent to Algorithm LABEL:alg:poweriteration when executed in exact arithmetic . We recommend Algorithm LABEL:alg:subspaceiteration because its computational costs are similar to Algorithm LABEL:alg:poweriteration, even though the former is substantially more accurate in floating-point arithmetic.

6 An accelerated technique for general dense matrices

$\bm{D}$ is an $n\times n$ diagonal matrix whose entries are independent random variables uniformly distributed on the complex unit circle,

The test matrix (23) is just one choice among many possibilities. Other suggestions that appear in the literature include subsampled Hadamard transforms, chains of Givens rotations acting on randomly chosen coordinates, and many more. See and its bibliography. Empirically, we have found that the transform summarized in Remark 4.6 below performs very well in a variety of environments .

Among the structured random matrices that we have tried, one of the strongest candidates involves sequences of random Givens rotations . This matrix takes the form

where the prime symbol ′ indicates an independent realization of a random matrix. The matrices $\bm{R}$ , $\bm{F}$ , and $\bm{D}$ are defined after (23). The matrix $\bm{\Theta}$ is a chain of random Givens rotations:

Stage B: Construction of standard factorizations

The algorithms for Stage A described in §4 produce an orthonormal matrix $\bm{Q}$ whose range captures the action of an input matrix $\bm{A}$ :

where $\varepsilon$ is a computational tolerance. This section describes methods for approximating standard factorizations of $\bm{A}$ using the information in the basis $\bm{Q}$ .

To accomplish this task, we pursue the idea from §3.3.3 that any low-rank factorization $\bm{A}\approx\bm{CB}$ can be manipulated to produce a standard decomposition. When the bound (26) holds, the low-rank factors are simply $\bm{C}=\bm{Q}$ and $\bm{B}=\bm{Q}^{*}\bm{A}$ . The simplest scheme (§5.1) computes the factor $\bm{B}$ directly with a matrix–matrix product to ensure a minimal error in the final approximation. An alternative approach (§5.2) constructs factors $\bm{B}$ and $\bm{C}$ without forming any matrix–matrix product. The approach of §5.2 is often faster than the approach of §5.1 but typically results in larger errors. Both schemes can be streamlined for an Hermitian input matrix (§5.3) and a positive semidefinite input matrix (§5.4). Finally, we develop single-pass algorithms that exploit other information generated in Stage A to avoid revisiting the input matrix (§5.5).

Throughout this section, $\bm{A}$ denotes an $m\times n$ matrix, and $\bm{Q}$ is an $m\times k$ orthonormal matrix that verifies (26). For purposes of exposition, we concentrate on methods for constructing the partial SVD.

The relation (26) implies that $\left\|{\bm{A}-\bm{Q}\bm{B}}\right\|\leq\varepsilon$ , where $\bm{B}=\bm{Q}^{*}\bm{A}$ . Once we have computed $\bm{B}$ , we can produce any standard factorization using the methods of §3.3.3. Algorithm LABEL:alg:Atranspose illustrates how to build an approximate SVD.

The factors produced by Algorithm LABEL:alg:Atranspose satisfy

In other words, the approximation error does not degrade.

Algorithm LABEL:alg:Atranspose produces an approximate SVD with the same rank as the basis matrix $\bm{Q}$ . When the size of the basis exceeds the desired rank $k$ of the SVD, it may be preferable to retain only the dominant $k$ singular values and singular vectors. Equivalently, we replace the diagonal matrix $\bm{\Sigma}$ of computed singular values with the matrix $\bm{\Sigma}_{(k)}$ formed by zeroing out all but the largest $k$ entries of $\bm{\Sigma}$ . In the worst case, this truncation step can increase the approximation error by $\sigma_{k+1}$ ; see §9.4 for an analysis. Our numerical experience suggests that this error analysis is pessimistic, and the term $\sigma_{k+1}$ often does not appear in practice.

2 Postprocessing via row extraction

Given a matrix $\bm{Q}$ such that (26) holds, we can obtain a rank- $k$ factorization

where $\bm{B}$ is a $k\times n$ matrix consisting of $k$ rows extracted from $\bm{A}$ . The approximation (28) can be produced without computing any matrix–matrix products, which makes this approach to postprocessing very fast. The drawback comes because the error $\left\|{\bm{A}-\bm{XB}}\right\|$ is usually larger than the initial error $\left\|{\bm{A}-\bm{QQ}^{*}\bm{A}}\right\|$ , especially when the dimensions of $\bm{A}$ are large. See Remark 5.3 for more discussion.

To obtain the factorization (28), we simply construct the interpolative decomposition (§3.2.3) of the matrix $\bm{Q}$ :

The index set $J$ marks $k$ rows of $\bm{Q}$ that span the row space of $\bm{Q}$ , and $\bm{X}$ is an $m\times k$ matrix whose entries are bounded in magnitude by two and contains the $k\times k$ identity as a submatrix: $\bm{X}_{(J,\colon)}=\mathbf{I}_{k}$ . Combining (29) and (26), we reach

Since $\bm{X}_{(J,\colon)}=\mathbf{I}_{k}$ , equation (30) implies that $\bm{A}_{(J,\colon)}\approx\bm{Q}_{(J,\colon)}\bm{Q}^{*}\bm{A}$ . Therefore, (28) follows when we put $\bm{B}=\bm{A}_{(J,\colon)}$ .

Let $\bm{A}$ be an $m\times n$ matrix and let $\bm{Q}$ be an $m\times k$ matrix that satisfy (26). Suppose that $\bm{U}$ , $\bm{\Sigma}$ , and $\bm{V}$ are the matrices constructed by Algorithm LABEL:alg:extractrows. Then

The factors $\bm{U}$ , $\bm{\Sigma}$ , $\bm{V}$ constructed by the algorithm satisfy

Since $\widehat{\bm{A}}=\bm{XQ}_{(J,\colon)}\bm{Q}^{*}\bm{A}$ and since $\bm{X}_{(J,\colon)}=\mathbf{I}_{k}$ , it must be that $\widehat{\bm{A}}_{(J,\colon)}=\bm{Q}_{(J,\colon)}\bm{Q}^{*}\bm{A}$ . Consequently,

Inequality (26) ensures that ${\bigl{\|}{\bm{A}-\widehat{\bm{A}}}\bigr{\|}}\leq\varepsilon$ . Since $\bm{A}_{(J,\colon)}-\widehat{\bm{A}}_{(J,\colon)}$ is a submatrix of $\bm{A}-\widehat{\bm{A}}$ , we must also have ${\bigl{\|}{\bm{A}_{(J,\colon)}-\widehat{\bm{A}}_{(J,\colon)}}\bigr{\|}}\leq\varepsilon$ . Thus, (5.2) reduces to

The bound (31) follows from (34) after we observe that $\bm{X}$ contains a $k\times k$ identity matrix and that the entries of the remaining $(n-k)\times k$ submatrix are bounded in magnitude by two. ∎

To maintain a unified presentation, we have formulated all the postprocessing techniques so they take an orthonormal matrix $\bm{Q}$ as input. Recall that, in Stage A of our framework, we construct the matrix $\bm{Q}$ by orthonormalizing the columns of the sample matrix $\bm{Y}$ . With finite-precision arithmetic, it is preferable to adapt Algorithm LABEL:alg:extractrows to start directly from the sample matrix $\bm{Y}$ . To be precise, we modify Step 1 to compute $\bm{X}$ and $J$ so that $\bm{Y}=\bm{XY}_{(J,:)}$ . This revision is recommended even when $\bm{Q}$ is available from the adaptive rank determination of Algorithm LABEL:alg:adaptive2.

As the inequality (31) suggests, the factorization produced by Algorithm LABEL:alg:extractrows is potentially less accurate than the basis that it uses as input. This loss of accuracy is problematic when $\varepsilon$ is not so small or when $kn$ is large. In such cases, we recommend Algorithm LABEL:alg:Atranspose over Algorithm LABEL:alg:extractrows; the former is more costly, but it does not amplify the error, as shown in (27).

3 Postprocessing an Hermitian matrix

When $\bm{A}$ is Hermitian, the postprocessing becomes particularly elegant. In this case, the columns of $\bm{Q}$ form a good basis for both the column space and the row space of $\bm{A}$ so that we have $\bm{A}\approx\bm{QQ}^{*}\bm{A}\bm{QQ}^{*}$ . More precisely, when (26) is in force, we have

The last inequality relies on the facts that $\left\|{\bm{QQ}^{*}}\right\|=1$ and that

Since $\bm{A}\approx\bm{Q}\bigl{(}\bm{Q}^{*}\bm{A}\bm{Q}\bigr{)}\bm{Q}^{*}$ is a low-rank approximation of $\bm{A}$ , we can form any standard factorization using the techniques from §3.3.3.

4 Postprocessing a positive semidefinite matrix

When the input matrix $\bm{A}$ is positive semidefinite, the Nyström method can be used to improve the quality of standard factorizations at almost no additional cost; see and its bibliography. To describe the main idea, we first recall that the direct method presented in §5.3 manipulates the approximate rank- $k$ factorization

In contrast, the Nyström scheme builds a more sophisticated rank- $k$ approximation, namely

where $\bm{F}$ is an approximate Cholesky factor of $\bm{A}$ with dimension $n\times k$ . To compute the factor $\bm{F}$ numerically, first form the matrices $\bm{B}_{1}=\bm{AQ}$ and $\bm{B}_{2}=\bm{Q}^{*}\bm{B}_{1}$ . Then decompose the psd matrix $\bm{B}_{2}=\bm{C}^{*}\bm{C}$ into its Cholesky factors. Finally compute the factor $\bm{F}=\bm{B}_{1}\bm{C}^{-1}$ by performing a triangular solve. The low-rank factorization (37) can be converted to a standard decomposition using the techniques from §3.3.3.

The literature contains an explicit expression [48, Lem. 4] for the approximation error in (37). This result implies that, in the spectral norm, the Nyström approximation error never exceeds $\left\|{\bm{A}-\bm{QQ}^{*}\bm{A}}\right\|$ , and it is often substantially smaller. We omit a detailed discussion.

For an example of the Nyström technique, consider Algorithm LABEL:alg:nystrom, which computes an approximate eigenvalue decomposition of a positive semidefinite matrix. This method should be compared with the scheme for Hermitian matrices, Algorithm LABEL:alg:hermeig. In both cases, the dominant cost occurs when we form $\bm{AQ}$ , so the two procedures have roughly the same running time. On the other hand, Algorithm LABEL:alg:nystrom is typically much more accurate than Algorithm LABEL:alg:hermeig. In a sense, we are exploiting the fact that $\bm{A}$ is positive semidefinite to take one step of subspace iteration (Algorithm LABEL:alg:subspaceiteration) for free.

5 Single-pass algorithms

The techniques described in §§5.1–5.4 all require us to revisit the input matrix. This may not be feasible in environments where the matrix is too large to be stored. In this section, we develop a method that requires just one pass over the matrix to construct not only an approximate basis but also a complete factorization. Similar techniques appear in and .

For motivation, we begin with the case where $\bm{A}$ is Hermitian. Let us recall the proto-algorithm from §1.3.3: Draw a random test matrix $\bm{\Omega}$ ; form the sample matrix $\bm{Y}=\bm{A\Omega}$ ; then construct a basis $\bm{Q}$ for the range of $\bm{Y}$ . It turns out that the matrices $\bm{\Omega}$ , $\bm{Y}$ , and $\bm{Q}$ contain all the information we need to approximate $\bm{A}$ .

To see why, define the (currently unknown) matrix $\bm{B}$ via $\bm{B}=\bm{Q}^{*}\bm{AQ}$ . Postmultiplying the definition by $\bm{Q}^{*}\bm{\Omega}$ , we obtain the identity $\bm{BQ}^{*}\bm{\Omega}=\bm{Q}^{*}\bm{AQQ}^{*}\bm{\Omega}$ . The relationships $\bm{AQQ}^{*}\approx\bm{A}$ and $\bm{A\Omega}=\bm{Y}$ show that $\bm{B}$ must satisfy

When $\bm{A}$ is not Hermitian, it is still possible to devise single-pass algorithms, but we must modify the initial Stage A of the approximation framework to simultaneously construct bases for the ranges of $\bm{A}$ and $\bm{A}^{*}$ :

Generate random matrices $\bm{\Omega}$ and $\widetilde{\bm{\Omega}}$ .

Compute $\bm{Y}=\bm{A\Omega}$ and $\widetilde{\bm{Y}}=\bm{A}^{*}\widetilde{\bm{\Omega}}$ in a single pass over $\bm{A}$ .

Compute QR factorizations $\bm{Y}=\bm{QR}$ and $\widetilde{\bm{Y}}=\widetilde{\bm{Q}}\widetilde{\bm{R}}$ .

This procedure results in matrices $\bm{Q}$ and $\widetilde{\bm{Q}}$ such that $\bm{A}\approx\bm{QQ}^{*}\bm{A}\widetilde{\bm{Q}}\widetilde{\bm{Q}}^{*}$ . The reduced matrix we must approximate is $\bm{B}=\bm{Q}^{*}\bm{A}\widetilde{\bm{Q}}$ . In analogy with (38), we find that

An analogous calculation shows that $\bm{B}$ should also satisfy

Now, the reduced matrix $\bm{B}_{\rm approx}$ can be determined by finding a minimum-residual solution to the system of relations (39) and (40).

The single-pass approaches described in this section can degrade the approximation error in the final decomposition significantly. To explain the issue, we focus on the Hermitian case. It turns out that the coefficient matrix $\bm{Q}^{*}\bm{\Omega}$ in the linear system (38) is usually ill-conditioned. In a worst-case scenario, the error $\left\|{\bm{A}-\bm{U}\bm{\Lambda}\bm{U}^{*}}\right\|$ in the factorization produced by Algorithm LABEL:alg:postsym could be larger than the error resulting from the two-pass method of Section 5.3 by a factor of $1/\tau_{\rm min}$ , where $\tau_{\rm min}$ is the minimal singular value of the matrix $\bm{Q}^{*}\bm{\Omega}$ .

The situation can be improved by oversampling. Suppose that we seek a rank- $k$ approximate eigenvalue decomposition. Pick a small oversampling parameter $p$ . Draw an $n\times(k+p)$ random matrix $\bm{\Omega}$ , and form the sample matrix $\bm{Y}=\bm{A}\bm{\Omega}$ . Let $\bm{Q}$ denote the $n\times k$ matrix formed by the $k$ leading left singular vectors of $\bm{Y}$ . Now, the linear system (38) has a coefficient matrix $\bm{Q}^{*}\bm{\Omega}$ of size $k\times(k+p)$ , so it is overdetermined. An approximate solution of this system yields a $k\times k$ matrix $\bm{B}$ .

Computational costs

So far, we have postponed a detailed discussion of the computational cost of randomized matrix approximation algorithms because it is necessary to account for both the first stage, where we compute an approximate basis for the range (§4), and the second stage, where we postprocess the basis to complete the factorization (§5). We are now prepared to compare the cost of the two-stage scheme with the cost of traditional techniques.

Choosing an appropriate algorithm, whether classical or randomized, requires us to consider the properties of the input matrix. To draw a nuanced picture, we discuss three representative computational environments in §6.1–6.3. We close with some comments on parallel implementations in §6.4.

For concreteness, we focus on the problem of computing an approximate SVD of an $m\times n$ matrix $\bm{A}$ with numerical rank $k$ . The costs for other factorizations are similar.

As a rule of thumb, the approximation error of this procedure satisfies

where $\sigma_{k+1}$ is the $(k+1)$ th singular value of $\bm{A}$ . The estimate (41), which follows from Theorem 25 and Lemma 4, reflects the worst-case scenario; actual errors are usually smaller.

This algorithm should be compared with modern deterministic techniques, such as rank-revealing QR followed by postprocessing (§3.3.2) which typically require

operations to achieve a comparable error.

In this setting, the randomized algorithm can be several times faster than classical techniques even for problems of moderate size, say $m,n\sim 10^{3}$ and $k\sim 10^{2}$ . See §7.4 for numerical evidence.

2 Matrices for which matrix–vector products can be rapidly evaluated

As a rule of thumb, the approximation error of this procedure satisfies

The estimate (43) follows from Corollary 22 and the discussion in §5.1. Actual errors are usually smaller.

When the singular spectrum of $\bm{A}$ decays slowly, we can incorporate $q$ iterations of the power method (Algorithm LABEL:alg:poweriteration) to obtain superior solutions to the fixed-rank problem. The computational cost increases to, cf. (42),

The estimate (45) takes into account the discussion in §10.4. The power scheme can also be adapted for the fixed-precision problem (§4.5).

In this setting, the classical prescription for obtaining a partial SVD is some variation of a Krylov-subspace method; see §3.3.4. These methods exhibit great diversity, so it is hard to specify a “typical” computational cost. To a first approximation, it is fair to say that in order to obtain an approximate SVD of rank $k$ , the cost of a numerically stable implementation of a Krylov method is no less than the cost (42) with $p$ set to zero. At this price, the Krylov method often obtains better accuracy than the basic randomized method obtained by combining Algorithms LABEL:alg:basic and LABEL:alg:Atranspose, especially for matrices whose singular values decay slowly. On the other hand, the randomized schemes are inherently more robust and allow much more freedom in organizing the computation to suit a particular application or a particular hardware architecture. The latter point is in practice of crucial importance because it is usually much faster to apply a matrix to $k$ vectors simultaneously than it is to execute $k$ matrix–vector multiplications consecutively. In practice, blocking and parallelism can lead to enough gain that a few steps of the power method (Algorithm LABEL:alg:poweriteration) can be performed more quickly than $k$ steps of a Krylov method.

Any comparison between randomized sampling schemes and Krylov variants becomes complicated because of the fact that “basic” Krylov schemes such as Lanczos [61, p. 473] or Arnoldi [61, p. 499] are inherently unstable. To obtain numerical robustness, we must incorporate sophisticated modifications such as restarts, reorthogonalization procedures, etc. Constructing a high-quality implementation is sufficiently hard that the authors of a popular book on “numerical recipes” qualify their treatment of spectral computations as follows [109, p. 567]:

You have probably gathered by now that the solution of eigensystems is a fairly complicated business. It is. It is one of the few subjects covered in this book for which we do not recommend that you avoid canned routines. On the contrary, the purpose of this chapter is precisely to give you some appreciation of what is going on inside such canned routines, so that you can make intelligent choices about using them, and intelligent diagnoses when something goes wrong.

Randomized sampling does not eliminate the difficulties referred to in this quotation; however it reduces the task of computing a partial spectral decomposition of a very large matrix to the task of computing a full decomposition of a small dense matrix. (For example, in Algorithm LABEL:alg:Atranspose, the input matrix $\bm{A}$ is large and $\bm{B}$ is small.) The latter task is much better understood and is eminently suitable for using canned routines. Random sampling schemes interact with the large matrix only through matrix–matrix products, which can easily be implemented by a user in a manner appropriate to the application and to the available hardware.

The comparison is further complicated by the fact that there is significant overlap between the two sets of ideas. Algorithm LABEL:alg:poweriteration is conceptually similar to a “block Lanczos method” [61, p. 485] with a random starting matrix. Indeed, we believe that there are significant opportunities for cross-fertilization in this area. Hybrid schemes that combine the best ideas from both fields may perform very well.

3 General matrices stored in slow memory or streamed

The traditional metric for numerical algorithms is the number of floating-point operations they require. When the data does not fit in fast memory, however, the computational time is often dominated by the cost of memory access. In this setting, a more appropriate measure of algorithmic performance is pass-efficiency, which counts how many times the data needs to be cycled through fast memory. Flop counts become largely irrelevant.

All the classical matrix factorization techniques that we discuss in §3.2—including dense SVD, rank-revealing QR, Krylov methods, and so forth—require at least $k$ passes over the the matrix, which is prohibitively expensive for huge data matrices. A desire to reduce the pass count of matrix approximation algorithms served as one of the early motivations for developing randomized schemes . Detailed recent work appears in .

For many matrices, randomized techniques can produce an accurate approximation using just one pass over the data. For Hermitian matrices, we obtain a single-pass algorithm by combining Algorithm LABEL:alg:basic, which constructs an approximate basis, with Algorithm LABEL:alg:postsym, which produces an eigenvalue decomposition without any additional access to the matrix. Section 5.5 describes the analogous technique for general matrices.

For the huge matrices that arise in applications such as data mining, it is common that the singular spectrum decays slowly. Relevant applications include image processing (see §§7.2–7.3 for numerical examples), statistical data analysis, and network monitoring. To compute approximate factorizations in these environments, it is crucial to enhance the accuracy of the randomized approach using the power scheme, Algorithm LABEL:alg:poweriteration, or some other device. This approach increases the pass count somewhat, but in our experience it is very rare that more than five passes are required.

4 Gains from parallelization

As mentioned in §§6.2–6.3, randomized methods often outperform classical techniques not because they involve fewer floating-point operations but rather because they allow us to reorganize the calculations to exploit the matrix properties and the computer architecture more fully. In addition, these methods are well suited for parallel implementation. For example, in Algorithm LABEL:alg:basic, the computational bottleneck is the evaluation of the matrix product $\bm{A\Omega}$ , which is embarrassingly parallelizable.

Numerical examples

By this time, the reader has surely formulated a pointed question: Do these randomized matrix approximation algorithms actually work in practice? In this section, we attempt to address this concern by illustrating how the algorithms perform on a diverse collection of test cases.

Section 7.1 starts with two examples from the physical sciences involving discrete approximations to operators with exponentially decaying spectra. Sections 7.2 and 7.3 continue with two examples of matrices arising in “data mining.” These are large matrices whose singular spectra decay slowly; one is sparse and fits in RAM, one is dense and is stored out-of-core. Finally, §7.4 investigates the performance of randomized methods based on structured random matrices.

Sections 7.1–7.3 focus on the algorithms for Stage A that we presented in §4 because we wish to isolate the performance of the randomized step.

Computational examples illustrating truly large data matrices have been reported elsewhere, for instance in .

We first illustrate the behavior of the adaptive range approximation method, Algorithm LABEL:alg:adaptive2. We apply it to two matrices associated with the numerical analysis of differential and integral operators. The matrices in question have rapidly decaying singular values and our intent is to demonstrate that in this environment, the approximation error of a bare-bones randomized method such as Algorithm LABEL:alg:adaptive2 is very close to the minimal error achievable by any method. We observe that the approximation error of a randomized method is itself a random variable (it is a function of the random matrix $\bm{\Omega}$ ) so what we need to demonstrate is not only that the error is small in a typical realization, but also that it clusters tightly around the mean value.

We first consider a $200\times 200$ matrix $\bm{A}$ that results from discretizing the following single-layer operator associated with the Laplace equation:

Note that any values less than $10^{-15}$ should be considered numerical artifacts.

We next consider a matrix $\bm{B}$ which is defined implicitly in the sense that we cannot access its elements directly; we can only evaluate the map $\bm{x}\mapsto\bm{B}\bm{x}$ for a given vector $\bm{x}$ . To be precise, $\bm{B}$ represents a transfer matrix for a network of resistors like the one shown in Figure 1(b). The vector $\bm{x}$ represents a set of electric potentials specified on the red nodes in the figure. These potentials induce a unique equilibrium field on the network in which the potential of each black and blue node is the average of the potentials of its three or four neighbors. The vector $\bm{B}\bm{x}$ is then the restriction of the potential to the blue exterior nodes. Given a vector $\bm{x}$ , the vector $\bm{B}\bm{x}$ can be obtained by solving a large sparse linear system whose coefficient matrix is the classical five-point stencil approximating the 2D Laplace operator.

We applied Algorithm LABEL:alg:adaptive2 to the $1596\times 532$ matrix $\bm{B}$ associated with a lattice in which there were $532$ nodes (red) on the “inner ring” and $1596$ nodes on the (blue) “outer ring.” Each application of $\bm{B}$ to a vector requires the solution of a sparse linear system of size roughly $140\,000\times 140\,000$ . We implemented the scheme in Matlab using the “backslash” operator for the linear solve. The results of a typical trial appear in Figure 4. Qualitatively, the performance matches the results in Figure 3.

2 A large, sparse, noisy matrix arising in image processing

Our next example involves a matrix that arises in image processing. A recent line of work uses information about the local geometry of an image to develop promising new algorithms for standard tasks, such as denoising, inpainting, and so forth. These methods are based on approximating a graph Laplacian associated with the image. The dominant eigenvectors of this matrix provide “coordinates” that help us smooth out noisy image patches .

where the parameter $\sigma=50$ controls the level of sensitivity. We obtain a sparse weight matrix $\bm{W}$ by zeroing out all entries in $\widetilde{\bm{W}}$ except the seven largest ones in each row. The object is then to construct the low frequency eigenvectors of the graph Laplacian matrix

where $\bm{D}$ is the diagonal matrix with entries $d_{ii}=\sum_{j}w_{ij}$ . These are the eigenvectors associated with the dominant eigenvalues of the auxiliary matrix $\bm{A}=\bm{D}^{-1/2}\bm{W}\bm{D}^{-1/2}$ .

3 Eigenfaces

Our next example involves a large, dense matrix derived from the FERET databank of face images . A simple method for performing face recognition is to identify the principal directions of the image data, which are called eigenfaces. Each of the original photographs can be summarized by its components along these principal directions. To identify the subject in a new picture, we compute its decomposition in this basis and use a classification technique, such as nearest neighbors, to select the closest image in the database .

We construct a data matrix $\bm{A}$ as follows: The FERET database contains $7254$ images, and each $384\times 256$ image contains $98\,304$ pixels. First, we build a $98\,304\times 7254$ matrix $\widetilde{\bm{A}}$ whose columns are the images. We form $\bm{A}$ by centering each column of $\widetilde{\bm{A}}$ and scaling it to unit norm, so that the images are roughly comparable. The eigenfaces are the dominant left singular vectors of this matrix.

Our goal then is to compute an approximate SVD of the matrix $\bm{A}$ . Represented as an array of double-precision real numbers, $\bm{A}$ would require $5.4$ GB of storage, which does not fit within the fast memory of many machines. It is possible to compress the database down to at $57$ MB or less (in JPEG format), but then the data would have to be uncompressed with each sweep over the matrix. Furthermore, the matrix $\bm{A}$ has slowly decaying singular values, so we need to use the power scheme, Algorithm LABEL:alg:poweriteration, to capture the range of the matrix accurately.

Figure 6 describes the behavior of the power scheme, which is similar to its performance for the graph Laplacian in §7.2. When the exponent $q=0$ , the approximation of the data matrix is very poor, but it improves quickly as $q$ increases. Likewise, the estimate for the spectrum of $\bm{A}$ appears to converge rapidly; the largest singular values are already quite accurate when $q=1$ . We see essentially no improvement in the estimates after the first 3–5 passes over the matrix.

4 Performance of structured random matrices

Our final set of experiments illustrates that the structured random matrices described in §4.6 lead to matrix approximation algorithms that are both fast and accurate.

Second, we investigate how the choice of random test matrix influences the error in approximating an input matrix. For these experiments, we return to the $200\times 200$ matrix $\bm{A}$ defined in Section 7.1. Consider variations of Algorithm LABEL:alg:basic obtained when the random test matrix $\bm{\Omega}$ is drawn from the following four distributions:

Intuitively, we expect that Ortho should provide the best performance.

The running times reported in Table 1 and in Figure 7 depend strongly on both the computer hardware and the coding of the algorithms. The experiments reported here were performed on a standard office desktop with a 3.2 GHz Pentium IV processor and 2 GB of RAM. The algorithms were implemented in Fortran 90 and compiled with the Lahey compiler. The Lahey versions of BLAS and LAPACK were used to accelerate all matrix–matrix multiplications, as well as the SVD computations in Algorithms LABEL:alg:Atranspose and LABEL:alg:extractrows. We used the code for the modified SRFT (25) provided in the publicly available software package id $\underline{\mbox{ }}$ dist .

This part of the paper, §§8–11, provides a detailed analysis of randomized sampling schemes for constructing an approximate basis for the range of a matrix, the task we refer to as Stage A in the framework of §1.2. More precisely, we assess the quality of the basis $\bm{Q}$ that the proto-algorithm of §1.3 produces by establishing rigorous bounds for the approximation error

where $\left|\!\left|\!\left|{\cdot}\right|\!\right|\!\right|$ denotes either the spectral norm or the Frobenius norm. The difficulty in developing these bounds is that the matrix $\bm{Q}$ is random, and its distribution is a complicated nonlinear function of the input matrix $\bm{A}$ and the random test matrix $\bm{\Omega}$ . Naturally, any estimate for the approximation error must depend on the properties of the input matrix and the distribution of the test matrix.

To address these challenges, we split the argument into two pieces. The first part exploits techniques from linear algebra to deliver a generic error bound that depends on the interaction between the test matrix $\bm{\Omega}$ and the right singular vectors of the input matrix $\bm{A}$ , as well as the tail singular values of $\bm{A}$ . In the second part of the argument, we take into account the distribution of the random matrix to estimate the error for specific instantiations of the proto-algorithm. This bipartite proof is common in the literature on randomized linear algebra, but our argument is most similar in spirit to .

Section 8 surveys the basic linear algebraic tools we need. Section 9 uses these methods to derive a generic error bound. Afterward, we specialize these results to the case where the test matrix is Gaussian (§10) and the case where the test matrix is a subsampled random Fourier transform (§11).

Theoretical preliminaries

We proceed with some additional background from linear algebra. Section 8.1 sets out properties of positive-semidefinite matrices, and §8.2 offers some results for orthogonal projectors. Standard references for this material include .

An Hermitian matrix $\bm{M}$ is positive semidefinite (briefly, psd) when $\bm{u}^{*}\bm{M}\bm{u}\geq 0$ for all $\bm{u}\neq\bm{0}$ . If the inequalities are strict, $\bm{M}$ is positive definite (briefly, pd). The psd matrices form a convex cone, which induces a partial ordering on the linear space of Hermitian matrices: $\bm{M}\preccurlyeq\bm{N}$ if and only if $\bm{N}-\bm{M}$ is psd. This ordering allows us to write $\bm{M}\succcurlyeq\bm{0}$ to indicate that the matrix $\bm{M}$ is psd.

Alternatively, we can define a psd (resp., pd) matrix as an Hermitian matrix with nonnegative (resp., positive) eigenvalues. In particular, each psd matrix is diagonalizable, and the inverse of a pd matrix is also pd. The spectral norm of a psd matrix $\bm{M}$ has the variational characterization

according to the Rayleigh–Ritz theorem [72, Thm. 4.2.2]. It follows that

A fundamental fact is that conjugation preserves the psd property.

Suppose that $\bm{M}\succcurlyeq\bm{0}$ . For every $\bm{A}$ , the matrix $\bm{A}^{*}\bm{M}\bm{A}\succcurlyeq\bm{0}$ . In particular,

Our argument invokes the conjugation rule repeatedly. As a first application, we establish a perturbation bound for the matrix inverse near the identity matrix.

Suppose that $\bm{M}\succcurlyeq\bm{0}$ . Then

Define $\bm{R}=\bm{M}^{1/2}$ , the psd square root of $\bm{M}$ promised by [72, Thm. 7.2.6]. We have the chain of relations

The first equality can be verified algebraically. The second holds because rational functions of a diagonalizable matrix, such as $\bm{R}$ , commute. The last relation follows from the conjugation rule because $(\mathbf{I}+\bm{R}^{2})^{-1}\preccurlyeq\mathbf{I}$ . ∎

Next, we present a generalization of the fact that the spectral norm of a psd matrix is controlled by its trace.

We have $\left\|{\bm{M}}\right\|\leq\left\|{\bm{A}}\right\|+\left\|{\bm{C}}\right\|$ for each partitioned psd matrix

The variational characterization (48) of the spectral norm implies that

The block generalization of Hadamard’s psd criterion [72, Thm. 7.7.7] states that $\left\|{\bm{B}}\right\|^{2}\leq\left\|{\bm{A}}\right\|\left\|{\bm{C}}\right\|$ . Thus,

2 Orthogonal projectors

An orthogonal projector is an Hermitian matrix $\bm{P}$ that satisfies the polynomial $\bm{P}^{2}=\bm{P}$ . This identity implies $\bm{0}\preccurlyeq\bm{P}\preccurlyeq\mathbf{I}$ . An orthogonal projector is completely determined by its range. For a given matrix $\bm{M}$ , we write $\bm{P}_{\bm{M}}$ for the unique orthogonal projector with $\operatorname{range}(\bm{P}_{\bm{M}})=\operatorname{range}(\bm{M})$ . When $\bm{M}$ has full column rank, we can express this projector explicitly:

The orthogonal projector onto the complementary subspace, $\operatorname{range}(\bm{P})^{\perp}$ , is the matrix $\mathbf{I}-\bm{P}$ . Our argument hinges on several other facts about orthogonal projectors.

Suppose $\bm{U}$ is unitary. Then $\bm{U}^{*}\bm{P}_{\bm{M}}\bm{U}=\bm{P}_{\bm{U}^{*}\bm{M}}$ .

Abbreviate $\bm{P}=\bm{U}^{*}\bm{P}_{\bm{M}}\bm{U}$ . It is clear that $\bm{P}$ is an orthogonal projector since it is Hermitian and $\bm{P}^{2}=\bm{P}$ . Evidently,

Since the range determines the orthogonal projector, we conclude $\bm{P}=\bm{P}_{\bm{U}^{*}\bm{M}}$ . ∎

Suppose $\operatorname{range}(\bm{N})\subset\operatorname{range}(\bm{M})$ . Then, for each matrix $\bm{A}$ , it holds that $\left\|{\bm{P}_{\bm{N}}\bm{A}}\right\|\leq\left\|{\bm{P}_{\bm{M}}\bm{A}}\right\|$ and that $\left\|{(\mathbf{I}-\bm{P}_{\bm{M}})\bm{A}}\right\|\leq\left\|{(\mathbf{I}-\bm{P}_{\bm{N}})\bm{A}}\right\|$ .

The projector $\bm{P}_{\bm{N}}\preccurlyeq\mathbf{I}$ , so the conjugation rule yields $\bm{P}_{\bm{M}}\bm{P}_{\bm{N}}\bm{P}_{\bm{M}}\preccurlyeq\bm{P}_{\bm{M}}$ . The hypothesis $\operatorname{range}(\bm{N})\subset\operatorname{range}(\bm{M})$ implies that $\bm{P}_{\bm{M}}\bm{P}_{\bm{N}}=\bm{P}_{\bm{N}}$ , which results in

In summary, $\bm{P}_{\bm{N}}\preccurlyeq\bm{P}_{\bm{M}}$ . The conjugation rule shows that $\bm{A}^{*}\bm{P}_{\bm{N}}\bm{A}\preccurlyeq\bm{A}^{*}\bm{P}_{\bm{M}}\bm{A}$ . We conclude from (49) that

The second statement follows from the first by taking orthogonal complements. ∎

Finally, we need a generalization of the scalar inequality $\left|{px}\right|^{q}\leq\left|{p}\right|\left|{x}\right|^{q}$ , which holds when $\left|{p}\right|\leq 1$ and $q\geq 1$ .

Let $\bm{P}$ be an orthogonal projector, and let $\bm{M}$ be a matrix. For each positive number $q$ ,

Suppose that $\bm{R}$ is an orthogonal projector, $\bm{D}$ is a nonnegative diagonal matrix, and $t\geq 1$ . We claim that

Granted this inequality, we quickly complete the proof. Using an SVD $\bm{M}=\bm{U\Sigma V}^{*}$ , we compute

We have used the unitary invariance of the spectral norm in the second and fourth relations. The inequality (52) applies because $\bm{U}^{*}\bm{PU}$ is an orthogonal projector. Take a square root to finish the argument.

Now, we turn to the claim (52). This relation follows immediately from [11, Thm. IX.2.10], but we offer a direct argument based on more elementary considerations. Let $\bm{x}$ be a unit vector at which

We must have $\bm{R}\bm{x}=\bm{x}$ . Otherwise, $\left\|{\bm{R}\bm{x}}\right\|<1$ because $\bm{R}$ is an orthogonal projector, which implies that the unit vector $\bm{y}=\bm{R}\bm{x}/\left\|{\bm{R}\bm{x}}\right\|$ verifies

Writing $x_{j}$ for the entries of $\bm{x}$ and $d_{j}$ for the diagonal entries of $\bm{D}$ , we find that

The inequality is Jensen’s, which applies because $\sum x_{j}^{2}=1$ and the function $z\mapsto\left|{z}\right|^{t}$ is convex for $t\geq 1$ . ∎

Error bounds via linear algebra

We are now prepared to develop a deterministic error analysis for the proto-algorithm described in §1.3. To begin, we must introduce some notation. Afterward, we establish the key error bound, which strengthens a result from the literature [17, Lem. 4.2]. Finally, we explain why the power method can be used to improve the performance of the proto-algorithm.

Let $\bm{A}$ be an $m\times n$ matrix that has a singular value decomposition $\bm{A}=\bm{U\Sigma V}^{*}$ , as described in Section 3.2.2. Roughly speaking, the proto-algorithm tries to approximate the subspace spanned by the first $k$ left singular vectors, where $k$ is now a fixed number. To perform the analysis, it is appropriate to partition the singular value decomposition as follows.

The matrices $\bm{\Sigma}_{1}$ and $\bm{\Sigma}_{2}$ are square. We will see that the left unitary factor $\bm{U}$ does not play a significant role in the analysis.

The error bound for the proto-algorithm depends critically on the properties of the matrices $\bm{\Omega}_{1}$ and $\bm{\Omega}_{2}$ . With this notation, the sample matrix $\bm{Y}$ can be expressed as

It is a useful intuition that the block $\bm{\Sigma}_{1}\bm{\Omega}_{1}$ in (9.1) reflects the gross behavior of $\bm{A}$ , while the block $\bm{\Sigma}_{2}\bm{\Omega}_{2}$ represents a perturbation.

2 A deterministic error bound for the proto-algorithm

The proto-algorithm constructs an orthonormal basis $\bm{Q}$ for the range of the sample matrix $\bm{Y}$ , and our goal is to quantify how well this basis captures the action of the input $\bm{A}$ . Since $\bm{QQ}^{*}=\bm{P}_{\bm{Y}}$ , the challenge is to obtain bounds on the approximation error

The following theorem shows that the behavior of the proto-algorithm depends on the interaction between the test matrix and the right singular vectors of the input matrix, as well as the singular spectrum of the input matrix.

Let $\bm{A}$ be an $m\times n$ matrix with singular value decomposition $\bm{A}=\bm{U\Sigma V}^{*}$ , and fix $k\geq 0$ . Choose a test matrix $\bm{\Omega}$ , and construct the sample matrix $\bm{Y}=\bm{A\Omega}$ . Partition $\bm{\Sigma}$ as specified in (53), and define $\bm{\Omega}_{1}$ and $\bm{\Omega}_{2}$ via (54). Assuming that $\bm{\Omega}_{1}$ has full row rank, the approximation error satisfies

where $\left|\!\left|\!\left|{\cdot}\right|\!\right|\!\right|$ denotes either the spectral norm or the Frobenius norm.

Theorem 11 sharpens the result [17, Lem. 2], which lacks the squares present in (55). This refinement yields slightly better error estimates than the earlier bound, and it has consequences for the probabilistic behavior of the error when the test matrix $\bm{\Omega}$ is random. The proof here is different in spirit from the earlier analysis; our argument is inspired by the perturbation theory of orthogonal projectors .

We establish the bound for the spectral-norm error. The bound for the Frobenius-norm error follows from an analogous argument that is slightly easier.

Let us begin with some preliminary simplifications. First, we argue that the left unitary factor $\bm{U}$ plays no essential role in the argument. In effect, we execute the proof for an auxiliary input matrix $\widetilde{\bm{A}}$ and an associated sample matrix $\widetilde{\bm{Y}}$ defined by

Owing to the unitary invariance of the spectral norm and to Proposition 8, we have the identity

In view of (57), it suffices to prove that

Second, we assume that the number $k$ is chosen so the diagonal entries of $\bm{\Sigma}_{1}$ are strictly positive. Suppose not. Then $\bm{\Sigma}_{2}$ is zero because of the ordering of the singular values. As a consequence,

This calculation uses the decompositions presented in (56), as well as the fact that both $\bm{V}_{1}^{*}$ and $\bm{\Omega}_{1}$ have full row rank. We conclude that

so the error bound (58) holds trivially. (In fact, both sides are zero.)

The main argument is based on ideas from perturbation theory. To illustrate the concept, we start with a matrix related to $\widetilde{\bm{Y}}$ :

The matrix $\bm{W}$ has the same range as a related matrix formed by “flattening out” the spectrum of the top block. Indeed, since $\bm{\Sigma}_{1}\bm{\Omega}_{1}$ has full row rank,

The matrix on the right-hand side has full column rank, so it is legal to apply the formula (50) for an orthogonal projector, which immediately yields

In words, the range of $\bm{W}$ aligns with the first $k$ coordinates, which span the same subspace as the first $k$ left singular vectors of the auxiliary input matrix $\widetilde{\bm{A}}$ . Therefore, $\operatorname{range}(\bm{W})$ captures the action of $\widetilde{\bm{A}}$ , which is what we wanted from $\operatorname{range}(\widetilde{\bm{Y}})$ .

We treat the auxiliary sample matrix $\widetilde{\bm{Y}}$ as a perturbation of $\bm{W}$ , and we hope that their ranges are close to each other. To make the comparison rigorous, let us emulate the arguments outlined in the last paragraph. Referring to the display (56), we flatten out the top block of $\widetilde{\bm{Y}}$ to obtain the matrix

Let us return to the error bound (58). The construction (60) ensures that $\operatorname{range}(\bm{Z})\subset\operatorname{range}(\widetilde{\bm{Y}})$ , so Proposition 9 implies that the error satisfies

The last identity follows from the definition $\widetilde{\bm{A}}=\bm{\Sigma}\bm{V}^{*}$ and the unitary invariance of the spectral norm. Therefore, we can complete the proof of (58) by producing a suitable bound for the right-hand side of (61).

To continue, we need a detailed representation of the projector $\mathbf{I}-\bm{P}_{\bm{Z}}$ . The construction (60) ensures that $\bm{Z}$ has full column rank, so we can apply the formula (50) for an orthogonal projector to see that

Expanding this expression, we determine that the complementary projector satisfies

The partitioning here conforms with the partitioning of $\bm{\Sigma}$ . When we conjugate the matrix by $\bm{\Sigma}$ , copies of $\bm{\Sigma}_{1}^{-1}$ , presently hidden in the top-left block, will cancel to happy effect.

The latter point may not seem obvious, owing to the complicated form of (62). In reality, the block matrix is less fearsome than it looks. Proposition 6, on the perturbation of inverses, shows that the top-left block verifies

because the conjugation rule guarantees that $\bm{F}(\mathbf{I}+\bm{F}^{*}\bm{F})^{-1}\bm{F}^{*}\succcurlyeq\bm{0}$ . We abbreviate the off-diagonal blocks with the symbol $\bm{B}=-(\mathbf{I}+\bm{F}^{*}\bm{F})^{-1}\bm{F}^{*}$ . In summary,

This relation exposes the key structural properties of the projector. Compare this relation with the expression (59) for the “ideal” projector $\mathbf{I}-\bm{P}_{\bm{W}}$ .

Moving toward the estimate required by (61), we conjugate the last relation by $\bm{\Sigma}$ to obtain

The conjugation rule demonstrates that the matrix on the left-hand side is psd, so the matrix on the right-hand side is too. Proposition 7 results in the norm bound

Recall that $\bm{F}=\bm{\Sigma}_{2}\bm{\Omega}_{2}\bm{\Omega}_{1}^{\dagger}\bm{\Sigma}_{1}^{-1}$ , so the factor $\bm{\Sigma}_{1}$ cancels neatly. Therefore,

Finally, introduce the latter inequality into (61) to complete the proof. ∎

3 Analysis of the power scheme

Theorem 11 suggests that the performance of the proto-algorithm depends strongly on the relationship between the large singular values of $\bm{A}$ listed in $\bm{\Sigma}_{1}$ and the small singular values listed in $\bm{\Sigma}_{2}$ . When a substantial proportion of the mass of $\bm{A}$ appears in the small singular values, the constructed basis $\bm{Q}$ may have low accuracy. Conversely, when the large singular values dominate, it is much easier to identify a good low-rank basis.

To improve the performance of the proto-algorithm, we can run it with a closely related input matrix whose singular values decay more rapidly . Fix a positive integer $q$ , and set

We apply the proto-algorithm to $\bm{B}$ , which generates a sample matrix $\bm{Z}=\bm{B\Omega}$ and constructs a basis $\bm{Q}$ for the range of $\bm{Z}$ . Section 4.5 elaborates on the implementation details, and describes a reformulation that sometimes improves the accuracy when the scheme is executed in finite-precision arithmetic. The following result describes how well we can approximate the original matrix $\bm{A}$ within the range of $\bm{Z}$ .

as a direct consequence of Proposition 10. ∎

Let us illustrate how the power scheme interacts with the main error bound (55). Let $\sigma_{k+1}$ denote the $(k+1)$ th singular value of $\bm{A}$ . First, suppose we approximate $\bm{A}$ in the range of the sample matrix $\bm{Y}=\bm{A\Omega}$ . Since $\left\|{\bm{\Sigma}_{2}}\right\|=\sigma_{k+1}$ , Theorem 11 implies that

Now, define $\bm{B}=(\bm{AA}^{*})^{q}\bm{A}$ , and suppose we approximate $\bm{A}$ within the range of the sample matrix $\bm{Z}=\bm{B\Omega}$ . Together, Theorem 12 and Theorem 11 imply that

because $\sigma_{k+1}^{2q+1}$ is the $(k+1)$ th singular value of $\bm{B}$ . In effect, the power scheme drives down the suboptimality of the bound (63) exponentially fast as the power $q$ increases. In principle, we can make the extra factor as close to one as we like, although this increases the cost of the algorithm.

4 Analysis of truncated SVD

Finally, let us study the truncated SVD described in Remark 5.1. Suppose that we approximate the input matrix $\bm{A}$ inside the range of the sample matrix $\bm{Z}$ . In essence, the truncation step computes a best rank- $k$ approximation $\widehat{\bm{A}}_{(k)}$ of the compressed matrix $\bm{P}_{\bm{Z}}\bm{A}$ . The next result provides a simple error bound for this method; this argument was proposed by Ming Gu.

Apply the triangle inequality to split the error into two components.

We have already developed a detailed theory for estimating the first term. To analyze the second term, we introduce a best rank- $k$ approximation $\bm{A}_{(k)}$ of the matrix $\bm{A}$ . Note that

because $\widehat{\bm{A}}_{(k)}$ is a best rank- $k$ approximation to the matrix $\bm{P}_{\bm{Z}}\bm{A}$ , whereas $\bm{P}_{\bm{Z}}\bm{A}_{(k)}$ is an undistinguished rank- $k$ matrix. It follows that

The second inequality holds because the orthogonal projector is a contraction; the last identity follows from Mirsky’s theorem . Combine (64) and (65) to reach the main result. ∎

In the randomized setting, the truncation step appears to be less damaging than the error bound of Theorem 13 suggests, but we currently lack a complete theoretical understanding of its behavior.

Gaussian test matrices

The error bound in Theorem 11 shows that the performance of the proto-algorithm depends on the interaction between the test matrix $\bm{\Omega}$ and the right singular vectors of the input matrix $\bm{A}$ . Algorithm LABEL:alg:basic is a particularly simple version of the proto-algorithm that draws the test matrix according to the standard Gaussian distribution. The literature contains a wealth of information about these matrices, which allows us to perform a very precise error analysis.

We focus on the real case in this section. Analogous results hold in the complex case, where the algorithm even exhibits superior performance.

A standard Gaussian matrix is a random matrix whose entries are independent standard normal variables. The distribution of a standard Gaussian matrix is rotationally invariant: If $\bm{U}$ and $\bm{V}$ are orthonormal matrices, then $\bm{U}^{*}\bm{G}\bm{V}$ also has the standard Gaussian distribution.

Our analysis requires detailed information about the properties of Gaussian matrices. In particular, we must understand how the norm of a Gaussian matrix and its pseudoinverse vary. We summarize the relevant results and citations here, reserving the details for Appendix A.

Fix matrices $\bm{S},\bm{T}$ , and draw a standard Gaussian matrix $\bm{G}$ . Then

The identity (66) follows from a direct calculation. The second bound (67) relies on methods developed by Gordon . See Propositions 26 and 27.

Draw a $k\times(k+p)$ standard Gaussian matrix $\bm{G}$ with $k\geq 2$ and $p\geq 2$ . Then

The first identity is a standard result from multivariate statistics [99, p. 96]. The second follows from work of Chen and Dongarra . See Proposition 29 and 30.

To study the probability that Algorithm LABEL:alg:basic produces a large error, we rely on tail bounds for functions of Gaussian matrices. The next proposition rephrases a well-known result on concentration of measure [14, Thm. 4.5.7]. See also [83, §1.1] and [82, §5.1].

Suppose that $h$ is a Lipschitz function on matrices:

Draw a standard Gaussian matrix $\bm{G}$ . Then

Finally, we state some large deviation bounds for the norm of a pseudo-inverted Gaussian matrix.

Let $\bm{G}$ be a $k\times(k+p)$ Gaussian matrix where $p\geq 4$ . For all $t\geq 1$ ,

Compare these estimates with Proposition 15. It seems that (70) is new; we were unable to find a comparable analysis in the random matrix literature. Although the form of (70) is not optimal, it allows us to produce more transparent results than a fully detailed estimate. The bound (71) essentially appears in the work of Chen and Dongarra . See Propositions 28 and Theorem 31 for more information.

2 Average-case analysis of Algorithm LABEL:alg:basic

We separate our analysis into two pieces. First, we present information about expected values. In the next subsection, we describe bounds on the probability of a large deviation.

We begin with the simplest result, which provides an estimate for the expected approximation error in the Frobenius norm. All proofs are postponed to the end of the section.

Suppose that $\bm{A}$ is a real $m\times n$ matrix with singular values $\sigma_{1}\geq\sigma_{2}\geq\sigma_{3}\geq\dots$ . Choose a target rank $k\geq 2$ and an oversampling parameter $p\geq 2$ , where $k+p\leq\min\{m,n\}$ . Draw an $n\times(k+p)$ standard Gaussian matrix $\bm{\Omega}$ , and construct the sample matrix $\bm{Y}=\bm{A\Omega}$ . Then the expected approximation error

This theorem predicts several intriguing behaviors of Algorithm LABEL:alg:basic. The Eckart–Young theorem shows that $(\sum\nolimits_{j>k}\sigma_{j}^{2})^{1/2}$ is the minimal Frobenius-norm error when approximating $\bm{A}$ with a rank- $k$ matrix. This quantity is the appropriate benchmark for the performance of the algorithm. If the small singular values of $\bm{A}$ are very flat, the series may be as large as $\sigma_{k+1}\sqrt{\min\{m,n\}-k}$ . On the other hand, when the singular values exhibit some decay, the error may be on the same order as $\sigma_{k+1}$ .

The error bound always exceeds this baseline error, but it may be polynomially larger, depending on the ratio between the target rank $k$ and the oversampling parameter $p$ . For $p$ small (say, less than five), the error is somewhat variable because the small singular values of a nearly square Gaussian matrix are very unstable. As the oversampling increases, the performance improves quickly. When $p\sim k$ , the error is already within a constant factor of the baseline.

The error bound for the spectral norm is somewhat more complicated, but it reveals some interesting new features.

Mirsky has shown that the quantity $\sigma_{k+1}$ is the minimum spectral-norm error when approximating $\bm{A}$ with a rank- $k$ matrix, so the first term in Theorem 19 is analogous with the error bound in Theorem 18. The second term represents a new phenomenon: we also pay for the Frobenius-norm error in approximating $\bm{A}$ . Note that, as the amount $p$ of oversampling increases, the polynomial factor in the second term declines much more quickly than the factor in the first term. When $p\sim k$ , the factor on the $\sigma_{k+1}$ term is constant, while the factor on the series has order $k^{-1/2}$

We also note that the bound in Theorem 19 implies

so the average spectral-norm error always lies within a small polynomial factor of the baseline $\sigma_{k+1}$ .

Let us continue with the proofs of these results.

Let $\bm{V}$ be the right unitary factor of $\bm{A}$ . Partition $\bm{V}=[\bm{V}_{1}\ |\ \bm{V}_{2}]$ into blocks containing, respectively, $k$ and $n-k$ columns. Recall that

The Gaussian distribution is rotationally invariant, so $\bm{V}^{*}\bm{\Omega}$ is also a standard Gaussian matrix. Observe that $\bm{\Omega}_{1}$ and $\bm{\Omega}_{2}$ are nonoverlapping submatrices of $\bm{V}^{*}\bm{\Omega}$ , so these two matrices are not only standard Gaussian but also stochastically independent. Furthermore, the rows of a (fat) Gaussian matrix are almost surely in general position, so the $k\times(k+p)$ matrix $\bm{\Omega}_{1}$ has full row rank with probability one.

Hölder’s inequality and Theorem 11 together imply that

We compute this expectation by conditioning on the value of $\bm{\Omega}_{1}$ and applying Proposition 14 to the scaled Gaussian matrix $\bm{\Omega}_{2}$ . Thus,

where the last expectation follows from relation (68) of Proposition 15. In summary,

The argument is similar to the proof of Theorem 18. First, Theorem 11 implies that

We condition on $\bm{\Omega}_{1}$ and apply Proposition 14 to bound the expectation with respect to $\bm{\Omega}_{2}$ . Thus,

where the second relation requires Hölder’s inequality. Applying both parts of Proposition 15, we obtain

Note that $\left\|{\bm{\Sigma}_{2}}\right\|=\sigma_{k+1}$ to wrap up. ∎

3 Probabilistic error bounds for Algorithm LABEL:alg:basic

We can develop tail bounds for the approximation error, which demonstrate that the average performance of the algorithm is representative of the actual performance. We begin with the Frobenius norm because the result is somewhat simpler.

Frame the hypotheses of Theorem 18. Assume further that $p\geq 4$ . For all $u,t\geq 1$ ,

To parse this theorem, observe that the first term in the error bound corresponds with the expected approximation error in Theorem 18. The second term represents a deviation above the mean.

An analogous result holds for the spectral norm.

Frame the hypotheses of Theorem 18. Assume further that $p\geq 4$ . For all $u,t\geq 1$ ,

The bracket corresponds with the expected spectral-norm error while the remaining term represents a deviation above the mean. Neither the numerical constants nor the precise form of the bound are optimal because of the slackness in Proposition 17. Nevertheless, the theorem gives a fairly good picture of what is actually happening.

We acknowledge that the current form of Theorem 21 is complicated. To produce more transparent results, we make appropriate selections for the parameters $u,t$ and bound the numerical constants.

Frame the hypotheses of Theorem 18, and assume further that $p\geq 4$ . Then

with failure probability at most $6p^{-p}$ .

Corollary 22 should be compared with [91, Obs. 4.4–4.5]. Although our result contains sharper error estimates, the failure probabilities are usually worse. The error bound (9) presented in §1.5 follows after further simplification of the second bound from Corollary 22.

We continue with a proof of Theorem 21. The same argument can be used to obtain a bound for the Frobenius-norm error, but we omit a detailed account.

Since $\bm{\Omega}_{1}$ and $\bm{\Omega}_{2}$ are independent from each other, we can study how the error depends on the matrix $\bm{\Omega}_{2}$ by conditioning on the event that $\bm{\Omega}_{1}$ is not too irregular. To that end, we define a (parameterized) event on which the spectral and Frobenius norms of the matrix $\bm{\Omega}_{1}^{\dagger}$ are both controlled. For $t\geq 1$ , let

Invoking both parts of Proposition 17, we find that

Consider the function $h(\bm{X})={\bigl{\|}{\bm{\Sigma}_{2}\bm{X}\bm{\Omega}_{1}^{\dagger}}\bigr{\|}}$ . We quickly compute its Lipschitz constant $L$ with the lower triangle inequality and some standard norm estimates:

Therefore, $L\leq\left\|{\bm{\Sigma}_{2}}\right\|{\bigl{\|}{\bm{\Omega}_{1}^{\dagger}}\bigr{\|}}$ . Relation (67) of Proposition 14 implies that

Applying the concentration of measure inequality, Proposition 16, conditionally to the random variable $h(\bm{\Omega}_{2})={\bigl{\|}{\bm{\Sigma}_{2}\bm{\Omega}_{2}\bm{\Omega}_{1}^{\dagger}}\bigr{\|}}$ results in

Under the event $E_{t}$ , we have explicit bounds on the norms of $\bm{\Omega}_{1}^{\dagger}$ , so

Insert the expressions for the norms of $\bm{\Sigma}_{2}$ into this result to complete the probability bound. Finally, introduce this estimate into the error bound from Theorem 11. ∎

4 Analysis of the power scheme

Theorem 19 makes it clear that the performance of the randomized approximation scheme, Algorithm LABEL:alg:basic, depends heavily on the singular spectrum of the input matrix. The power scheme outlined in Algorithm LABEL:alg:poweriteration addresses this problem by enhancing the decay of spectrum. We can combine our analysis of Algorithm LABEL:alg:basic with Theorem 12 to obtain a detailed report on the behavior of the performance of the power scheme using a Gaussian matrix.

Frame the hypotheses of Theorem 18. Define $\bm{B}={(\bm{A}\bm{A}^{*})}^{q}\bm{A}$ for a nonnegative integer $q$ , and construct the sample matrix $\bm{Z}=\bm{B\Omega}$ . Then

Invoke Theorem 19 to bound the right-hand side, noting that $\sigma_{j}(\bm{B})=\sigma_{j}^{2q+1}$ . ∎

The true message of Corollary 23 emerges if we bound the series using its largest term $\sigma_{k+1}^{4q+2}$ and draw the factor $\sigma_{k+1}$ out of the bracket:

In words, as we increase the exponent $q$ , the power scheme drives the extra factor in the error to one exponentially fast. By the time $q\sim\log\left(\min\{m,n\}\right)$ ,

which is the baseline for the spectral norm.

In most situations, the error bound given by Corollary 23 is substantially better than the estimates discussed in the last paragraph. For example, suppose that the tail singular values exhibit the decay profile

Then the series in Corollary 23 is comparable with its largest term, which allows us to remove the dimensional factor $\min\{m,n\}$ from the error bound.

To obtain large deviation bounds for the performance of the power scheme, simply combine Theorem 12 with Theorem 21. We omit a detailed statement.

We lack an analogous theory for the Frobenius norm because Theorem 12 depends on Proposition 10, which is not true for the Frobenius norm. It is possible to obtain some results by estimating the Frobenius norm in terms of the spectral norm.

SRFT test matrices

Another way to implement the proto-algorithm from §1.3 is to use a structured random matrix so that the matrix product in Step 2 can be performed quickly. One type of structured random matrix that has been proposed in the literature is the subsampled random Fourier transform, or SRFT, which we discussed in §4.6. In this section, we present bounds on the performance of the proto-algorithm when it is implemented with an SRFT test matrix. In contrast with the results for Gaussian test matrices, the results in this section hold for both real and complex input matrices.

$\bm{D}$ is a random $n\times n$ diagonal matrix whose entries are independent and uniformly distributed on the complex unit circle;

$\bm{F}$ is the $n\times n$ unitary discrete Fourier transform; and

Theorem 24 follows from a straightforward variation of the argument in , which establishes equivalent bounds for a real analog of the SRFT, called the subsampled randomized Hadamard transform (SRHT). We omit further details.

For large problems, we can obtain better numerical constants [134, Thm. 3.2]. Fix a small, positive number $\iota$ . If $k\gg\log(n)$ , then sampling

The logarithmic factor in Theorem 24 is necessary when the orthonormal matrix $\bm{V}$ is particularly evil. Let us describe an infinite family of worst-case examples. Fix an integer $k$ , and let $n=k^{2}$ . Form an $n\times k$ orthonormal matrix $\bm{V}$ by regular decimation of the $n\times n$ identity matrix. More precisely, $\bm{V}$ is the matrix whose $j$ th row has a unit entry in column $(j-1)/k$ when $j\equiv 1\pmod{k}$ and is zero otherwise. To see why this type of matrix is nasty, it is helpful to consider the auxiliary matrix $\bm{W}=\bm{V}^{*}\bm{DF}$ . Observe that, up to scaling and modulation of columns, $\bm{W}$ consists of $k$ copies of a $k\times k$ DFT concatenated horizontally.

2 Performance guarantees

We are now prepared to present detailed information on the performance of the proto-algorithm when the test matrix $\bm{\Omega}$ is an SRFT.

Construct the sample matrix $\bm{Y}=\bm{A\Omega}$ . Then

We complete the section with the proof of Theorem 25.

Let $\bm{V}$ be the right unitary factor of matrix $\bm{A}$ , and partition $\bm{V}=[\bm{V}_{1}\ |\ \bm{V}_{2}]$ into blocks containing, respectively, $k$ and $n-k$ columns. Recall that

where $\left|\!\left|\!\left|{\cdot}\right|\!\right|\!\right|$ denotes either the spectral norm or the Frobenius norm. Our application of Theorem 24 also ensures that the spectral norm of $\bm{\Omega}_{1}^{\dagger}$ is under control.

We may bound the spectral norm of $\bm{\Omega}_{2}$ deterministically.

Acknowledgments

The authors have benefited from valuable discussions with many researchers, among them Inderjit Dhillon, Petros Drineas, Ming Gu, Edo Liberty, Michael Mahoney, Vladimir Rokhlin, Yoel Shkolnisky, and Arthur Szlam. In particular, we would like to thank Mark Tygert for his insightful remarks on early drafts of this paper. The example in Section 7.2 was provided by François Meyer of the University of Colorado at Boulder. The example in Section 7.3 comes from the FERET database of facial images collected under the FERET program, sponsored by the DoD Counterdrug Technology Development Program Office. The work reported was initiated during the program Mathematics of Knowledge and Search Engines held at IPAM in the fall of 2007. Finally, we would like to thank the anonymous referees, whose thoughtful remarks have helped us to improve the manuscript dramatically.

Appendix A On Gaussian matrices

This appendix collects some of the properties of Gaussian matrices that we use in our analysis. Most of the results follow quickly from material that is already available in the literature. One fact, however, requires a surprisingly difficult new argument. We focus on the real case here; the complex case is similar but actually yields better results.

We begin with the expected Frobenius norm of a scaled Gaussian matrix, which follows from an easy calculation.

Fix real matrices $\bm{S},\bm{T}$ , and draw a standard Gaussian matrix $\bm{G}$ . Then

The distribution of a Gaussian matrix is invariant under orthogonal transformations, and the Frobenius norm is also unitarily invariant. As a result, it represents no loss of generality to assume that $\bm{S}$ and $\bm{T}$ are diagonal. Therefore,

Since the right-hand side is unitarily invariant, we have also identified the value of the expectation for general matrices $\bm{S}$ and $\bm{T}$ . ∎

The literature contains an excellent bound for the expected spectral norm of a scaled Gaussian matrix. The result is due to Gordon , who established the bound using a sharp version of Slepian’s lemma. See [83, §3.3] and [34, §2.3] for additional discussion.

Fix real matrices $\bm{S},\bm{T}$ , and draw a standard Gaussian matrix $\bm{G}$ . Then

A.2 Spectral norm of pseudoinverse

Now, we turn to the pseudoinverse of a Gaussian matrix. Recently, Chen and Dongarra developed a good bound on the probability that its spectral norm is large. The statement here follows from [25, Lem. 4.1] after an application of Stirling’s approximation. See also [91, Lem. 2.14]

Let $\bm{G}$ be an $m\times n$ standard Gaussian matrix with $n\geq m\geq 2$ . For each $t>0$ ,

We can use Proposition 28 to bound the expected spectral norm of a pseudo-inverted Gaussian matrix.

Let $\bm{G}$ be a $m\times n$ standard Gaussian matrix with $n-m\geq 1$ and $m\geq 2$ . Then

Let us make the abbreviations $p=n-m$ and

We compute the expectation by way of a standard argument. The integral formula for the mean of a nonnegative random variable implies that, for all $E>0$ ,

where the second inequality follows from Proposition 28. The right-hand side is minimized when $E=C^{1/(p+1)}$ . Substitute and simplify. ∎

A.3 Frobenius norm of pseudoinverse

The squared Frobenius norm of a pseudo-inverted Gaussian matrix is closely connected with the trace of an inverted Wishart matrix. This observation leads to an exact expression for the expectation.

Let $\bm{G}$ be an $m\times n$ standard Gaussian matrix with $n-m\geq 2$ . Then

The second identity holds almost surely because the Wishart matrix $\bm{GG}^{*}$ is invertible with probability one. The random matrix $(\bm{GG}^{*})^{-1}$ follows the inverted Wishart distribution, so we can compute its expected trace explicitly using a formula from [99, p. 97]. ∎

On the other hand, very little seems to be known about the tail behavior of the Frobenius norm of a pseudo-inverted Gaussian matrix. The following theorem, which is new, provides an adequate bound on the probability of a large deviation.

Let $\bm{G}$ be an $m\times n$ standard Gaussian matrix with $n-m\geq 4$ . For each $t\geq 1$ ,

Neither the precise form of Theorem 31 nor the constants are ideal; we have focused instead on establishing a useful bound with minimal fuss. The rest of the section is devoted to the rather lengthy proof. Unfortunately, most of the standard methods for producing tail bounds fail for random variables that do not exhibit normal or exponential concentration. Our argument relies on special properties of Gaussian matrices and a dose of brute force.

We begin with a piece of notation. For any number $q\geq 1$ , we define the $L_{q}$ norm of a random variable $Z$ by

In particular, the $L_{q}$ norm satisfies the triangle inequality.

We continue with a collection of technical results. First, we present a striking fact about the structure of Gaussian matrices [55, §3.5].

For $n\geq m$ , an $m\times n$ standard Gaussian matrix is orthogonally equivalent with a random bidiagonal matrix

where, for each $j$ , the random variables $X_{j}^{2}$ and $Y_{j}^{2}$ follow the $\chi^{2}$ distribution with $j$ degrees of freedom. Furthermore, these variates are mutually independent.

We also require the moments of a chi-square variate, which are expressed in terms of special functions.

Let $\Xi$ be a $\chi^{2}$ variate with $k$ degrees of freedom. When $0\leq q<k/2$ ,

Recall that a $\chi^{2}$ variate with $k$ degrees of freedom has the probability density function

where the second equality follows from Euler’s integral expression for the gamma function. The other calculation is similar. ∎

To streamline the proof, we eliminate the gamma functions from Proposition 33. The next result bounds the positive moments of a chi-square variate.

Let $\Xi$ be a $\chi^{2}$ variate with $k$ degrees of freedom. For $q\geq 1$ ,

Write $q=r+\theta$ , where $r=\lfloor q\rfloor$ . Repeated application of the functional equation $z\Gamma(z)=\Gamma(z+1)$ yields

The gamma function is logarithmically convex, so

The second inequality holds because $k$ is smaller than each term in the product, hence is smaller than their geometric mean. As a consequence,

The second relation is the inequality between the geometric and arithmetic mean. ∎

Finally, we develop a bound for the negative moments of a chi-square variate.

Let $\Xi$ be a $\chi^{2}$ variate with $k$ degrees of freedom, where $k\geq 5$ . When $2\leq q\leq(k-1)/2$ ,

We establish the bound for $q=(k-1)/2$ . For smaller values of $q$ , the result follows from Hölder’s inequality. Proposition 33 shows that

where we used the assumption $q\geq 2$ to complete the numerical estimate. ∎

A.3.2 Proof of Theorem 31

Let $\bm{G}$ be an $m\times n$ Gaussian matrix, where we assume that $n-m\geq 4$ . Define the random variable

Our goal is to develop a tail bound for $Z$ . The argument is inspired by work of Szarek [130, §6] for square Gaussian matrices.

The first step is to find an explicit, tractable representation for the random variable. According to Proposition 32, a Gaussian matrix $\bm{G}$ is orthogonally equivalent with a bidiagonal matrix $\bm{L}$ of the form (72). Making an analogy with the inversion formula for a triangular matrix, we realize that the pseudoinverse of $\bm{L}$ is given by

Because $\bm{L}^{\dagger}$ is orthogonally equivalent with $\bm{G}^{\dagger}$ and the Frobenius norm is unitarily invariant, we have the relations

where we have added an extra subdiagonal term (corresponding with $j=0$ ) so that we can avoid exceptional cases later. We abbreviate the summands as

Next, we develop a large deviation bound for each summand by computing a moment and invoking Markov’s inequality. For the exponent $q=(n-m)/2$ , Lemmas 34 and 35 yield

Note that the first two relations require the independence of the variates and the triangle inequality for the $L_{q}$ norm. The maximum value of the bracket evidently occurs when $j=0$ , so

To complete the argument, we combine these estimates by means of the union bound and clean up the resulting mess. Since $Z\leq\sum_{j=0}^{m-1}W_{j}$ ,

To control the sum on the right-hand side, observe that

where the last inequality follows from the hypothesis $n-m\geq 4$ . Together, the estimates in this paragraph produce the advertised bound.

It would be very interesting to find more conceptual and extensible argument that yields accurate concentration results for inverse spectral functions of a Gaussian matrix.