A randomized algorithm for principal component analysis

Vladimir Rokhlin, Arthur Szlam, Mark Tygert

Introduction

Principal component analysis (PCA) is among the most widely used techniques in statistics, data analysis, and data mining. PCA is the basis of many machine learning methods, including the latent semantic analysis of large databases of text and HTML documents described in . Computationally, PCA amounts to the low-rank approximation of a matrix containing the data being analyzed. The present article describes an algorithm for the low-rank approximation of matrices, suitable for PCA. This paper demonstrates both theoretically and via numerical examples that the algorithm efficiently produces low-rank approximations whose accuracies are very close to the best possible.

The canonical construction of the best possible rank-kk approximation to a real m×nm\times n matrix AA uses the singular value decomposition (SVD) of AA,

where UU is a real unitary m×mm\times m matrix, VV is a real unitary n×nn\times n matrix, and Σ\Sigma is a real m×nm\times n matrix whose only nonzero entries appear in nonincreasing order on the diagonal and are nonnegative. The diagonal entries σ1\sigma_{1}, σ2\sigma_{2}, …, σmin(m,n)1\sigma_{\min(m,n)-1}, σmin(m,n)\sigma_{\min(m,n)} of Σ\Sigma are known as the singular values of AA. The best rank-kk approximation to AA, with k<mk<m and k<nk<n, is

where σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of AA. For more information about the SVD, see, for example, Chapter 8 in .

where AB\|A-B\| is the spectral norm of ABA-B, and σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of AA (see, for example, Chapter 5 in ). Furthermore, the algorithms of require only about O(nmk)\mathcal{O}(nmk) flops to produce a rank-kk approximation that (unlike an approximation produced by a pivoted QRQR-decomposition) has been guaranteed to satisfy a bound nearly as strong as (4).

While the accuracy in (4) is sufficient for many applications of low-rank approximation, PCA often involves mm\geq 10,000, and a “signal-to-noise ratio” σ1/σk+1100\sigma_{1}/\sigma_{k+1}\leq 100, where σ1=A\sigma_{1}=\|A\| is the greatest singular value of AA, and σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\rm st} greatest. Moreover, the singular values σk+1\leq\sigma_{k+1} often arise from noise in the process generating the data in AA, making the singular values of AA decay so slowly that σmσk+1/10\sigma_{m}\geq\sigma_{k+1}/10. When mm\geq 10,000, σ1/σk+1100\sigma_{1}/\sigma_{k+1}\leq 100, and σmσk+1/10\sigma_{m}\geq\sigma_{k+1}/10, the rank-kk approximation BB produced by a pivoted QRQR-decomposition algorithm typically satisfies ABA\|A-B\|\sim\|A\| — the “approximation” BB is effectively unrelated to the matrix AA being approximated! For large matrices whose “signal-to-noise ratio” σ1/σk+1\sigma_{1}/\sigma_{k+1} is less than 10,000, the m\sqrt{m} factor in (4) may be unacceptable. Now, pivoted QRQR-decomposition algorithms are not the only algorithms which can compute a rank-kk approximation using O(nmk)\mathcal{O}(nmk) flops. However, other algorithms, such as those of , , , , , , , , , , , , , , , , , , , , , , , , , , , and , also yield accuracies involving factors of at least m\sqrt{m} when the singular values σk+1\sigma_{k+1}, σk+2\sigma_{k+2}, σk+3\sigma_{k+3}, … of AA decay slowly. (The decay is rather slow if, for example, σk+jjασk+1\sigma_{k+j}\sim j^{\alpha}\,\sigma_{k+1} for j=1j=1, 22, 33, …, with 1/2<α0-1/2<\alpha\leq 0. Many of these other algorithms are designed to produce approximations having special properties not treated in the present paper, and their spectral-norm accuracy is good when the singular values decay sufficiently fast. Fairly recent surveys of algorithms for low-rank approximation are available in , , and .)

The algorithm described in the present paper produces a rank-kk approximation BB to AA such that

The algorithm of the present paper is randomized, but succeeds with very high probability; for example, the bound (46) on its accuracy holds with probability greater than 110151-10^{-15}. The algorithm is similar to many recently discussed randomized algorithms for low-rank approximation, but produces approximations of higher accuracy when the singular values σk+1\sigma_{k+1}, σk+2\sigma_{k+2}, σk+3\sigma_{k+3}, … of the matrix being approximated decay slowly; see, for example, or . The algorithm is a variant of that in , and the analysis of the present paper should extend to the algorithm of ; stimulated the authors’ collaboration. The algorithm may be regarded as a generalization of the randomized power methods of and , and in fact we use the latter to ascertain the approximations’ accuracy rapidly and reliably.

The algorithm admits obvious “out-of-core” and parallel implementations (assuming that the user chooses the parameter ii in (5) to be reasonably small). As with the algorithms of , , , , , , and , the core steps of the algorithm of the present paper involve the application of the matrix AA being approximated and its transpose ATA^{\hbox{\scriptsize{\rm T}}} to random vectors. The algorithm is more efficient when AA and ATA^{\hbox{\scriptsize{\rm T}}} can be applied rapidly to arbitrary vectors, such as when AA is sparse.

Throughout the present paper, we use 1{\bf 1} to denote an identity matrix. We use 0{\bf 0} to denote a matrix whose entries are all zeros. For any matrix AA, we use A\|A\| to denote the spectral norm of AA, that is, A\|A\| is the greatest singular value of AA. Furthermore, the entries of all matrices in the present paper are real valued, though the algorithm and analysis extend trivially to matrices whose entries are complex valued.

The present paper has the following structure: Section 2 collects together various known facts which later sections utilize. Section 3 provides the principal lemmas used in bounding the accuracy of the algorithm in Section 4. Section 4 describes the algorithm of the present paper. Section 5 illustrates the performance of the algorithm via several numerical examples. The appendix, Section 6, proves two lemmas stated earlier in Section 3. We encourage the reader to begin with Sections 4 and 5, referring back to the relevant portions of Sections 2 and 3 as they are referenced.

Preliminaries

In this section, we summarize various facts about matrices and functions. Subsection 2.1 discusses the singular values of arbitrary matrices. Subsection 2.2 discusses the singular values of certain random matrices. Subsection 2.3 observes that a certain function is monotone.

The following trivial technical lemma will be needed in Section 3.

Suppose that mm and nn are positive integers with mnm\geq n. Suppose further that AA is a real m×nm\times n matrix such that the least (that is, the nthn^{\it th} greatest) singular value σn\sigma_{n} of AA is nonzero.

The following lemma states that the greatest singular value of a matrix AA is at least as large as the greatest singular value of any rectangular block of entries in AA; the lemma is a straightforward consequence of the minimax properties of singular values (see, for example, Section 47 of Chapter 2 in ).

Suppose that kk, ll, mm, and nn are positive integers with kmk\leq m and lnl\leq n. Suppose further that AA is a real m×nm\times n matrix, and BB is a k×lk\times l rectangular block of entries in AA.

Then, the greatest singular value of BB is at most the greatest singular value of AA.

The following classical lemma provides an approximation QSQ\,S to an n×ln\times l matrix RR via an n×kn\times k matrix QQ whose columns are orthonormal, and a k×lk\times l matrix SS. As remarked in Observation 4, the proof of this lemma provides a classic algorithm for computing QQ and SS, given RR. We include the proof since we will be using this algorithm.

Suppose that kk, ll, and nn are positive integers with k<lnk<l\leq n, and RR is a real n×ln\times l matrix.

Then, there exist a real n×kn\times k matrix QQ whose columns are orthonormal, and a real k×lk\times l matrix SS, such that

where ρk+1\rho_{k+1} is the (k+1)st(k+1)^{\it st} greatest singular value of RR.

where UU is a real n×ln\times l matrix whose columns are orthonormal, VV is a real l×ll\times l matrix whose columns are orthonormal, and Σ\Sigma is a real diagonal l×ll\times l matrix, such that

for j=1j=1, 22, …, l1l-1, ll, where Σj,j\Sigma_{j,j} is the entry in row jj and column jj of Σ\Sigma, and ρj\rho_{j} is the jthj^{\rm th} greatest singular value of RR. We define QQ to be the leftmost n×kn\times k block of UU, and PP to be the rightmost n×(lk)n\times(l-k) block of UU, so that

We define SS to be the uppermost k×lk\times l block of ΣVT\Sigma\,V^{\hbox{\scriptsize{\rm T}}}, and TT to be the lowermost (lk)×l(l-k)\times l block of ΣVT\Sigma\,V^{\hbox{\scriptsize{\rm T}}}, so that

Combining (8), (9), (10), (11), and the fact that the columns of UU are orthonormal, as are the columns of VV, yields (7). ∎

In order to compute the matrices QQ and SS in (7) from the matrix RR, we can construct (8), and then form QQ and SS according to (10) and (11). (See, for example, Chapter 8 in for details concerning the computation of the SVD.)

2 Singular values of random matrices

The following lemma provides a highly probable upper bound on the greatest singular value of a square matrix whose entries are independent, identically distributed (i.i.d.) Gaussian random variables of zero mean and unit variance; Formula 8.8 in provides an equivalent formulation of the lemma.

Suppose that nn is a positive integer, GG is a real n×nn\times n matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and γ\gamma is a positive real number, such that γ>1\gamma>1 and

Then, the greatest singular value of GG is at most 2nγ\sqrt{2n}\,\gamma with probability not less than the amount in (12).

Combining Lemmas 2 and 5 yields the following lemma, providing a highly probable upper bound on the greatest singular value of a rectangular matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance.

Suppose that ll, mm, and nn are positive integers with nln\geq l and nmn\geq m. Suppose further that GG is a real l×ml\times m matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and γ\gamma is a positive real number, such that γ>1\gamma>1 and (12) is nonnegative.

Then, the greatest singular value of GG is at most 2nγ\sqrt{2n}\,\gamma with probability not less than the amount in (12).

The following lemma provides a highly probable lower bound on the least singular value of a rectangular matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance; Formula 2.5 in and the proof of Lemma 4.1 in together provide an equivalent formulation of Lemma 7.

Suppose that jj and ll are positive integers with jlj\leq l. Suppose further that GG is a real l×jl\times j matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and β\beta is a positive real number, such that

Then, the least (that is, the jthj^{\it th} greatest) singular value of GG is at least 1/(l  β)1/(\sqrt{l}\;\beta) with probability not less than the amount in (13).

3 A monotone function

The following technical lemma will be needed in Section 4.

Suppose that α\alpha is a nonnegative real number, and ff is the function defined on (0,)(0,\infty) via the formula

Then, ff decreases monotonically for x>αx>\alpha.

for any positive real number xx. The right-hand side of (15) is negative when x>αx>\alpha. ∎

Mathematical apparatus

In this section, we provide lemmas to be used in Section 4 in bounding the accuracy of the algorithm of the present paper.

The following lemma, proven in the appendix (Section 6), shows that the product AQQTA\,Q\,Q^{\hbox{\scriptsize{\rm T}}} of matrices AA, QQ, and QTQ^{\hbox{\scriptsize{\rm T}}} is a good approximation to a matrix AA, provided that there exist matrices GG and SS such that

QSQ\,S is a good approximation to (G(AAT)iA)T(G\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A)^{\hbox{\scriptsize{\rm T}}}, and

there exists a matrix FF such that F\|F\| is not too large, and FG(AAT)iAF\,G\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A is a good approximation to AA.

Suppose that ii, kk, ll, mm, and nn are positive integers with klmnk\leq l\leq m\leq n. Suppose further that AA is a real m×nm\times n matrix, QQ is a real n×kn\times k matrix whose columns are orthonormal, SS is a real k×lk\times l matrix, FF is a real m×lm\times l matrix, and GG is a real l×ml\times m matrix.

The following lemma, proven in the appendix (Section 6), states that, for any positive integer ii, matrix AA, and matrix GG whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, with very high probability there exists a matrix FF with a reasonably small norm, such that FG(AAT)iAF\,G\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A is a good approximation to AA. This lemma is similar to Lemma 19 of .

Suppose that ii, jj, kk, ll, mm, and nn are positive integers with j<k<l<mnj<k<l<m\leq n. Suppose further that AA is a real m×nm\times n matrix, GG is a real l×ml\times m matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and β\beta and γ\gamma are positive real numbers, such that the jthj^{\it th} greatest singular value σj\sigma_{j} of AA is positive, γ>1\gamma>1, and

Then, there exists a real m×lm\times l matrix FF such that

with probability not less than Φ\Phi defined in (17), where σj\sigma_{j} is the jthj^{\it th} greatest singular value of AA, σj+1\sigma_{j+1} is the (j+1)st(j+1)^{\it st} greatest singular value of AA, and σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\it st} greatest singular value of AA.

Given a matrix AA, and a matrix GG whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, the following lemma provides a highly probable upper bound on the singular values of the product GAG\,A in terms of the singular values of AA. This lemma is reproduced from , where it appears as Lemma 20.

Suppose that jj, kk, ll, mm, and nn are positive integers with k<lk<l, such that k+j<mk+j<m and k+j<nk+j<n. Suppose further that AA is a real m×nm\times n matrix, GG is a real l×ml\times m matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and γ\gamma is a positive real number, such that γ>1\gamma>1 and

with probability not less than Ξ\Xi defined in (20), where ρk+1\rho_{k+1} is the (k+1)st(k+1)^{\it st} greatest singular value of GAG\,A, σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\it st} greatest singular value of AA, and σk+j+1\sigma_{k+j+1} is the (k+j+1)st(k+j+1)^{\it st} greatest singular value of AA.

The following corollary follows immediately from the preceding lemma, by replacing the matrix AA with (AAT)iA(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A, the integer kk with jj, and the integer jj with kjk-j.

Suppose ii, jj, kk, ll, mm, and nn are positive integers with j<k<l<mnj<k<l<m\leq n. Suppose further that AA is a real m×nm\times n matrix, GG is a real l×ml\times m matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and γ\gamma is a positive real number, such that γ>1\gamma>1 and

with probability not less than Ψ\Psi defined in (22), where ρj+1\rho_{j+1} is the (j+1)st(j+1)^{\it st} greatest singular value of G(AAT)iAG\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A, σj+1\sigma_{j+1} is the (j+1)st(j+1)^{\it st} greatest singular value of AA, and σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\it st} greatest singular value of AA.

Description of the algorithm

In this section, we describe the algorithm of the present paper, providing details about its accuracy and computational costs. Subsection 4.1 describes the basic algorithm. Subsection 4.2 tabulates the computational costs of the algorithm. Subsection 4.3 describes a complementary algorithm. Subsection 4.4 describes a computationally more expensive variant that is somewhat more accurate and tolerant to roundoff.

Suppose that ii, kk, mm, and nn are positive integers with 2k<mn2k<m\leq n, and AA is a real m×nm\times n matrix. In this subsection, we will construct an approximation to an SVD of AA such that

with very high probability, where UU is a real m×km\times k matrix whose columns are orthonormal, VV is a real n×kn\times k matrix whose columns are orthonormal, Σ\Sigma is a real diagonal k×kk\times k matrix whose entries are all nonnegative, σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of AA, and CC is a constant independent of AA that depends on the parameters of the algorithm. (Section 5 will give an empirical indication of the size of CC, and (46) will give one of our best theoretical estimates to date.)

Intuitively, we could apply ATA^{\hbox{\scriptsize{\rm T}}} to several random vectors, in order to identify the part of its range corresponding to the larger singular values. To enhance the decay of the singular values, we apply AT(AAT)iA^{\hbox{\scriptsize{\rm T}}}\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i} instead. Once we have identified most of the range of ATA^{\hbox{\scriptsize{\rm T}}}, we perform several linear-algebraic manipulations in order to recover an approximation to AA. (It is possible to obtain a similar, somewhat less accurate algorithm by substituting our short, fat matrix AA for ATA^{\hbox{\scriptsize{\rm T}}}, and ATA^{\hbox{\scriptsize{\rm T}}} for AA.)

More precisely, we choose an integer l>kl>k such that lmkl\leq m-k (for example, l=k+12l=k+12), and make the following five steps:

Using a random number generator, form a real l×ml\times m matrix GG whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and compute the l×nl\times n product matrix

Using an SVD, form a real n×kn\times k matrix QQ whose columns are orthonormal, such that there exists a real k×lk\times l matrix SS for which

where ρk+1\rho_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of RR. (See Observation 4 for details concerning the construction of such a matrix QQ.)

where UU is a real m×km\times k matrix whose columns are orthonormal, WW is a real k×kk\times k matrix whose columns are orthonormal, and Σ\Sigma is a real diagonal k×kk\times k matrix whose entries are all nonnegative. (See, for example, Chapter 8 in for details concerning the construction of such an SVD.)

The following theorem states precisely that the matrices UU, Σ\Sigma, and VV satisfy (24). See (46) for a more compact (but less general) formulation.

Suppose that ii, kk, ll, mm, and nn are positive integers with k<lmkk<l\leq m-k and mnm\leq n, and AA is a real m×nm\times n matrix. Suppose further that β\beta and γ\gamma are positive real numbers such that γ>1\gamma>1,

is nonnegative. Suppose in addition that UU, Σ\Sigma, and VV are the matrices produced via the five-step algorithm of the present subsection, given above.

with probability not less than Π\Pi, where Π\Pi is defined in (32), and σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\it st} greatest singular value of AA.

Observing that UΣVT=AQQTU\,\Sigma\,V^{\hbox{\scriptsize{\rm T}}}=A\,Q\,Q^{\hbox{\scriptsize{\rm T}}}, it is sufficient to prove that

with probability Π\Pi, where QQ is the matrix from (26), since combining (34), (27), (28), and (29) yields (33). We now prove (34).

But, it follows from the fact that the columns of QQ are orthonormal that

Combining (36), (37), (38), (35), and (31) yields (34), completing the proof for the case when (35) holds.

For the remainder of the proof, we consider the case when

To prove (34), we will use (16) (which is restated and proven in Lemma 6.19 in the appendix), namely,

for any real m×lm\times l matrix FF, where GG is from (25), and QQ and SS are from (26). We now choose an appropriate matrix FF.

First, we define jj to be the positive integer such that

where σj\sigma_{j} is the jthj^{\rm th} greatest singular value of AA, and σj+1\sigma_{j+1} is the (j+1)st(j+1)^{\rm st} greatest (such an integer jj exists due to (39) and the supposition of the theorem that lmkl\leq m-k). We then use the matrix FF from (18) and (19) associated with this integer jj, so that (as stated in (18) and (19), which are restated and proven in Lemma 6.21 in the appendix)

with probability not less than Φ\Phi defined in (17). Formula (42) bounds the first term in the right-hand side of (40).

To bound the second term in the right-hand side of (40), we observe that jkj\leq k, due to (41) and the supposition of the theorem that lmkl\leq m-k. Combining (26), (25), (23), and the fact that jkj\leq k yields

with probability not less than Ψ\Psi defined in (22). Combining (43) and (44) yields

with probability not less than Π\Pi defined in (32). The combination of Lemma 8, (30), and the fact that jkj\leq k justifies the use of kk (rather than the jj used in (17) for Φ\Phi) in the last term in the right-hand side of (32).

Combining (40), (42), (45), (41), (31), and the supposition of the theorem that lmkl\leq m-k yields (34), completing the proof. ∎

Choosing l=k+12l=k+12, β=2.57\beta=2.57, and γ=2.43\gamma=2.43 in (32) and (33) yields

with probability greater than 110151-10^{-15}, where σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of AA. Numerical experiments (some of which are reported in Section 5) indicate that the factor 100l100l in the right-hand side of (46) is much greater than necessary.

Above, we permit ll to be any integer greater than kk. Stronger theoretical bounds on the accuracy are available when l2kl\geq 2k. Indeed, via an analysis similar to the proof of Theorem 13 (using in addition the result stated in the abstract of ), it can be shown that the following six-step algorithm with l2kl\geq 2k produces matrices UU, Σ\Sigma, and VV satisfying the bound (33) with its right-hand side reduced by a factor of l\sqrt{l}:

Using a random number generator, form a real l×ml\times m matrix GG whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and compute the l×nl\times n product matrix

Using a pivoted QRQR-decomposition algorithm, form a real n×ln\times l matrix QQ whose columns are orthonormal, such that there exists a real l×ll\times l matrix SS for which

(See, for example, Chapter 5 in for details concerning the construction of such a matrix QQ.)

2 Computational costs

In this subsection, we tabulate the number of floating-point operations required by the five-step algorithm described in Subsection 4.1 as applied once to a matrix AA.

The algorithm incurs the following costs in order to compute an approximation to an SVD of AA:

Forming RR in (25) requires applying AA to ilil column vectors, and ATA^{\hbox{\scriptsize{\rm T}}} to (i+1)l(i+1)\,l column vectors.

Computing QQ in (26) costs O(l2n)\mathcal{O}(l^{2}\,n).

Forming TT in (27) requires applying AA to kk column vectors.

Computing the SVD (28) of TT costs O(k2m)\mathcal{O}(k^{2}\,m).

Forming VV in (29) costs O(k2n)\mathcal{O}(k^{2}\,n).

Summing up the costs in Steps 1–5 above, and using the fact that klmnk\leq l\leq m\leq n, we conclude that the algorithm of Subsection 4.1 costs

floating-point operations, where CAC_{A} is the cost of applying AA to a real n×1n\times 1 column vector, and CATC_{A^{\hbox{\tiny{\rm T}}}} is the cost of applying ATA^{\hbox{\scriptsize{\rm T}}} to a real m×1m\times 1 column vector.

We observe that the algorithm only requires applying AA to il+kil+k vectors and ATA^{\hbox{\scriptsize{\rm T}}} to il+lil+l vectors; it does not require explicit access to the individual entries of AA. This consideration can be important when AA and ATA^{\hbox{\scriptsize{\rm T}}} are available solely in the form of procedures for their applications to arbitrary vectors. Often such procedures for applying AA and ATA^{\hbox{\scriptsize{\rm T}}} cost much less than the standard procedure for applying a dense matrix to a vector.

3 A modified algorithm

In this subsection, we describe a simple modification of the algorithm described in Subsection 4.1. Again, suppose that ii, kk, ll, mm, and nn are positive integers with k<lmkk<l\leq m-k and mnm\leq n, and AA is a real m×nm\times n matrix. Then, the following five-step algorithm constructs an approximation to an SVD of ATA^{\hbox{\scriptsize{\rm T}}} such that

with very high probability, where UU is a real n×kn\times k matrix whose columns are orthonormal, VV is a real m×km\times k matrix whose columns are orthonormal, Σ\Sigma is a real diagonal k×kk\times k matrix whose entries are all nonnegative, σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of AA, and CC is a constant independent of AA that depends on the parameters of the algorithm:

Using a random number generator, form a real l×ml\times m matrix GG whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and compute the l×ml\times m product matrix

Using an SVD, form a real m×km\times k matrix QQ whose columns are orthonormal, such that there exists a real k×lk\times l matrix SS for which

where ρk+1\rho_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of RR. (See Observation 4 for details concerning the construction of such a matrix QQ.)

where UU is a real n×kn\times k matrix whose columns are orthonormal, WW is a real k×kk\times k matrix whose columns are orthonormal, and Σ\Sigma is a real diagonal k×kk\times k matrix whose entries are all nonnegative. (See, for example, Chapter 8 in for details concerning the construction of such an SVD.)

Clearly, (53) is similar to (24), as (54) is similar to (25).

The ideas of Remark 4.15 are obviously relevant to the algorithm of the present subsection, too.

4 Blanczos

In this subsection, we describe a modification of the algorithm of Subsection 4.1, enhancing the accuracy at a little extra computational expense. Suppose that ii, kk, ll, mm, and nn are positive integers with k<lk<l and (i+1)lmk(i+1)l\leq m-k, and AA is a real m×nm\times n matrix, such that mnm\leq n. Then, the following five-step algorithm constructs an approximation UΣVTU\,\Sigma\,V^{\hbox{\scriptsize{\rm T}}} to an SVD of AA:

Using a random number generator, form a real l×ml\times m matrix GG whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and compute the l×nl\times n matrices R(0)R^{(0)}, R(1)R^{(1)}, …, R(i1)R^{(i-1)}, R(i)R^{(i)} defined via the formulae

Using a pivoted QRQR-decomposition algorithm, form a real n×((i+1)l)n\times((i+1)l) matrix QQ whose columns are orthonormal, such that there exists a real ((i+1)l)×((i+1)l)((i+1)l)\times((i+1)l) matrix SS for which

(See, for example, Chapter 5 in for details concerning the construction of such a matrix QQ.)

Compute the m×((i+1)l)m\times((i+1)l) product matrix

where UU is a real m×((i+1)l)m\times((i+1)l) matrix whose columns are orthonormal, WW is a real ((i+1)l)×((i+1)l)((i+1)l)\times((i+1)l) matrix whose columns are orthonormal, and Σ\Sigma is a real diagonal ((i+1)l)×((i+1)l)((i+1)l)\times((i+1)l) matrix whose entries are all nonnegative. (See, for example, Chapter 8 in for details concerning the construction of such an SVD.)

Compute the n×((i+1)l)n\times((i+1)l) product matrix

An analysis similar to the proof of Theorem 13 above shows that the matrices UU, Σ\Sigma, and VV produced by the algorithm of the present subsection satisfy the same upper bounds (33) and (46) as the matrices produced by the algorithm of Subsection 4.1. If desired, one may produce a similarly accurate rank-kk approximation by arranging UU, Σ\Sigma, and VV such that the diagonal entries of Σ\Sigma appear in nonincreasing order, and then discarding all but the leftmost kk columns of UU and all but the leftmost kk columns of VV, and retaining only the leftmost uppermost k×kk\times k block of Σ\Sigma. We will refer to the algorithm of the present subsection as “blanczos,” due to its similarity with the block Lanczos method (see, for example, Subsection 9.2.6 in for a description of the block Lanczos method).

Numerical results

In this section, we illustrate the performance of the algorithm of the present paper via several numerical examples.

We use the algorithm to construct a rank-kk approximation, with k=10k=10, to the m×(2m)m\times(2m) matrix AA defined via its singular value decomposition

where U(A)U^{(A)} is an m×mm\times m Hadamard matrix (a unitary matrix whose entries are all ±1/m\pm 1/\sqrt{m}), V(A)V^{(A)} is a (2m)×(2m)(2m)\times(2m) Hadamard matrix, and Σ(A)\Sigma^{(A)} is an m×(2m)m\times(2m) matrix whose entries are zero off the main diagonal, and whose diagonal entries are defined in terms of the (k+1)st(k+1)^{\rm st} singular value σk+1\sigma_{k+1} via the formulae

for j=1j=1, 22, …, 99, 1010, where j/2\lfloor j/2\rfloor is the greatest integer less than or equal to j/2j/2, and

for j=11j=11, 1212, …, m1m-1, mm. Thus, σ1=1\sigma_{1}=1 and σk=σk+1\sigma_{k}=\sigma_{k+1} (recall that k=10k=10). We always choose σk+1<1\sigma_{k+1}<1, so that σ1σ2σm1σm\sigma_{1}\geq\sigma_{2}\geq\dots\geq\sigma_{m-1}\geq\sigma_{m}.

Figure 1 plots the singular values σ1\sigma_{1}, σ2\sigma_{2}, …, σm1\sigma_{m-1}, σm\sigma_{m} of AA with m=512m=512 and σk+1=.001\sigma_{k+1}=.001; these parameters correspond to the first row of numbers in Table 1, the first row of numbers in Table 2, and the first row of numbers in Table 6.

Table 1 reports the results of applying the five-step algorithm of Subsection 4.1 to matrices of various sizes, with i=1i=1. Table 2 reports the results of applying the five-step algorithm of Subsection 4.1 to matrices of various sizes, with i=0i=0. The algorithms of , , and for low-rank approximation are essentially the same as the algorithm used for Table 2 (with i=0i=0).

Table 3 reports the results of applying the five-step algorithms of Subsections 4.1 and 4.3 with varying numbers of iterations ii. Rows in the table where ii is enclosed in parentheses correspond to the algorithm of Subsection 4.3; rows where ii is not enclosed in parentheses correspond to the algorithm of Subsection 4.1.

Table 4 reports the results of applying the five-step algorithm of Subsection 4.1 to matrices whose best rank-kk approximations have varying accuracies. Table 5 reports the results of applying the blanczos algorithm of Subsection 4.4 to matrices whose best rank-kk approximations have varying accuracies.

Table 6 reports the results of calculating pivoted QRQR-decompositions, via plane (Householder) reflections, of matrices of various sizes. We computed the pivoted QRQR-decomposition of the transpose of AA defined in (69), rather than of AA itself, for reasons of accuracy and efficiency. As pivoted QRQR-decomposition requires dense matrix arithmetic, our 1 GB of random-access memory (RAM) imposed the limit m4096m\leq 4096 for Table 6.

The headings of the tables have the following meanings:

mm is the number of rows in AA, the matrix being approximated.

nn is the number of columns in AA, the matrix being approximated.

ii is the integer parameter used in the algorithms of Subsections 4.1, 4.3, and 4.4. Rows in the tables where ii is enclosed in parentheses correspond to the algorithm of Subsection 4.3; rows where ii is not enclosed in parentheses correspond to either the algorithm of Subsection 4.1 or that of Subsection 4.4.

tt is the time in seconds required by the algorithm to create an approximation and compute its accuracy δ\delta.

σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\rm st} greatest singular value of AA, the matrix being approximated; σk+1\sigma_{k+1} is also the accuracy of the best possible rank-kk approximation to AA.

δ\delta is the accuracy of the approximation UΣVTU\,\Sigma\,V^{\hbox{\scriptsize{\rm T}}} (or (QRP)T(QRP)^{\hbox{\scriptsize{\rm T}}}, for Table 6) constructed by the algorithm. For Tables 1–5,

where UU is an m×km\times k matrix whose columns are orthonormal, VV is an n×kn\times k matrix whose columns are orthonormal, and Σ\Sigma is a diagonal k×kk\times k matrix whose entries are all nonnegative; for Table 6,

where PP is an m×mm\times m permutation matrix, RR is a k×mk\times m upper-triangular (meaning upper-trapezoidal) matrix, and QQ is an n×kn\times k matrix whose columns are orthonormal.

The values for tt are the average values over 3 independent randomized trials of the algorithm. The values for δ\delta are the worst (maximum) values encountered in 3 independent randomized trials of the algorithm. The values for δ\delta in each trial are those produced by 20 iterations of the power method applied to AUΣVTA-U\,\Sigma\,V^{\hbox{\scriptsize{\rm T}}} (or A(QRP)TA-(QRP)^{\hbox{\scriptsize{\rm T}}}, for Table 6), started with a vector whose entries are i.i.d. centered Gaussian random variables. The theorems of and guarantee that this power method produces accurate results with overwhelmingly high probability.

We performed all computations using IEEE standard double-precision variables, whose mantissas have approximately one bit of precision less than 16 digits (so that the relative precision of the variables is approximately .2E–15). We ran all computations on one core of a 1.86 GHz Intel Centrino Core Duo microprocessor with 2 MB of L2 cache and 1 GB of RAM. We compiled the Fortran 77 code using the Lahey/Fujitsu Linux Express v6.2 compiler, with the optimization flag --o2 enabled. We implemented a fast Walsh-Hadamard transform to apply rapidly the Hadamard matrices U(A)U^{(A)} and V(A)V^{(A)} in (69). We used plane (Householder) reflections to compute all pivoted QRQR-decompositions. We used the LAPACK 3.1.1 divide-and-conquer SVD routine dgesdd to compute all full SVDs. For the parameter ll, we set l=12l=12 (=k+2)(=k+2) for all of the examples reported here.

The experiments reported here and our further tests point to the following:

The accuracies in Table 1 are superior to those in Table 2; the algorithm performs much better with i>0i>0. (The algorithms of , , and for low-rank approximation are essentially the same as the algorithm used for Tables 1 and 2 when i=0i=0.)

The accuracies in Table 1 are superior to the corresponding accuracies in Table 6; the algorithm of the present paper produces higher accuracy than the classical pivoted QRQR-decompositions for matrices whose spectra decay slowly (such as those matrices tested in the present section).

The accuracies in Tables 1–3 appear to be proportional to m1/(4i+2)σk+1m^{1/(4i+2)}\,\sigma_{k+1} for the algorithm of Subsection 4.1, and to be proportional to m1/(4i)σk+1m^{1/(4i)}\,\sigma_{k+1} for the algorithm of Subsection 4.3, in accordance with (24) and (53). The numerical results reported here, as well as our further experiments, indicate that the theoretical bound (33) on the accuracy should remain valid with a greatly reduced constant in the right-hand side, independent of the matrix AA being approximated. See item 6 below for a discussion of Tables 4 and 5.

The timings in Tables 1–5 are consistent with (52), as we could (and did) apply the Hadamard matrices U(A)U^{(A)} and V(A)V^{(A)} in (69) to vectors via fast Walsh-Hadamard transforms at a cost of O(mlog(m))\mathcal{O}(m\,\log(m)) floating-point operations per matrix-vector multiplication.

The quality of the pseudorandom number generator has almost no effect on the accuracy of the algorithm, nor does substituting uniform variates for the normal variates.

The accuracies in Table 5 are superior to those in Table 4, particularly when the kthk^{\rm th} greatest singular value σk\sigma_{k} of the matrix AA being approximated is very small. Understandably, the algorithm of Subsection 4.1 would seem to break down when (σk)2i+1(\sigma_{k})^{2i+1} is less than the machine precision, while σk\sigma_{k} itself is not, unlike the blanczos algorithm of Subsection 4.4. When (σk)2i+1(\sigma_{k})^{2i+1} is much less than the machine precision, while σk\sigma_{k} is not, the accuracy of blanczos in the presence of roundoff is similar to that of the algorithm of Subsection 4.1 run with a reduced ii. When (σk)2i+1(\sigma_{k})^{2i+1} is much greater than the machine precision, the accuracy of blanczos is similar to that of the algorithm of Subsection 4.1 run with ii being the same as in the blanczos algorithm. Since the blanczos algorithm of Subsection 4.4 is so tolerant of roundoff, we suspect that the blanczos algorithm is a better general-purpose black-box tool for the computation of principal component analyses, despite its somewhat higher cost as compared with the algorithms of Subsections 4.1 and 4.3.

A MATLAB{}^{\hbox{\ooalign{\hfil\raise 0.60275pt\hbox{{\tiny R}}\hfil\crcr\text{\mathchar 524\relax}}}} implementation of the blanczos algorithm of Subsection 4.4 is available on the file exchange at http://www.mathworks.com in the package entitled, “Principal Component Analysis.”

Appendix

In this appendix, we restate and prove Lemmas 9 and 10 from Section 3.

The following lemma, stated earlier as Lemma 9 in Section 3, shows that the product AQQTA\,Q\,Q^{\hbox{\scriptsize{\rm T}}} of matrices AA, QQ, and QTQ^{\hbox{\scriptsize{\rm T}}} is a good approximation to a matrix AA, provided that there exist matrices GG and SS such that

QSQ\,S is a good approximation to (G(AAT)iA)T(G\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A)^{\hbox{\scriptsize{\rm T}}}, and

there exists a matrix FF such that F\|F\| is not too large, and FG(AAT)iAF\,G\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A is a good approximation to AA.

Suppose that ii, kk, ll, mm, and nn are positive integers with klmnk\leq l\leq m\leq n. Suppose further that AA is a real m×nm\times n matrix, QQ is a real n×kn\times k matrix whose columns are orthonormal, SS is a real k×lk\times l matrix, FF is a real m×lm\times l matrix, and GG is a real l×ml\times m matrix.

The proof is straightforward, but tedious, as follows.

We obtain from the triangle inequality that

First, we provide a bound for AQQTFGBQQT\|A\,Q\,Q^{\hbox{\scriptsize{\rm T}}}-F\,G\,B\,Q\,Q^{\hbox{\scriptsize{\rm T}}}\|. Clearly,

It follows from the fact that the columns of QQ are orthonormal that

Next, we provide a bound for FGBQQTFGB\|F\,G\,B\,Q\,Q^{\hbox{\scriptsize{\rm T}}}-F\,G\,B\|. Clearly,

It follows from the triangle inequality that

Also, it follows from the fact that the columns of QQ are orthonormal that

Combining (76), (80), (88), and (75) yields (74).

The following lemma, stated earlier as Lemma 10 in Section 3, shows that, for any positive integer ii, matrix AA, and matrix GG whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, with very high probability there exists a matrix FF with a reasonably small norm, such that FG(AAT)iAF\,G\,(A\,A^{\hbox{\scriptsize{\rm T}}})^{i}\,A is a good approximation to AA. This lemma is similar to Lemma 19 of .

Suppose that ii, jj, kk, ll, mm, and nn are positive integers with j<k<l<mnj<k<l<m\leq n. Suppose further that AA is a real m×nm\times n matrix, GG is a real l×ml\times m matrix whose entries are i.i.d. Gaussian random variables of zero mean and unit variance, and β\beta and γ\gamma are positive real numbers, such that the jthj^{\it th} greatest singular value σj\sigma_{j} of AA is positive, γ>1\gamma>1, and

Then, there exists a real m×lm\times l matrix FF such that

with probability not less than Φ\Phi defined in (89), where σj\sigma_{j} is the jthj^{\it th} greatest singular value of AA, σj+1\sigma_{j+1} is the (j+1)st(j+1)^{\it st} greatest singular value of AA, and σk+1\sigma_{k+1} is the (k+1)st(k+1)^{\it st} greatest singular value of AA.

We prove the existence of a matrix FF satisfying (90) and (91) by constructing one.

where UU is a real unitary m×mm\times m matrix, Σ\Sigma is a real diagonal m×mm\times m matrix, and VV is a real n×mn\times m matrix whose columns are orthonormal, such that

for p=1p=1, 22, …, m1m-1, mm, where Σp,p\Sigma_{p,p} is the entry in row pp and column pp of Σ\Sigma, and σp\sigma_{p} is the pthp^{\rm th} greatest singular value of AA.

Next, we define auxiliary matrices HH, RR, Γ\Gamma, SS, TT, Θ\Theta, and PP. We define HH to be the leftmost l×jl\times j block of the l×ml\times m matrix GUG\,U, RR to be the l×(kj)l\times(k-j) block of GUG\,U whose first column is the (k+1)st(k+1)^{\rm st} column of GUG\,U, and Γ\Gamma to be the rightmost l×(mk)l\times(m-k) block of GUG\,U, so that

Combining the fact that UU is real and unitary, and the fact that the entries of GG are i.i.d. Gaussian random variables of zero mean and unit variance, we see that the entries of HH are also i.i.d. Gaussian random variables of zero mean and unit variance, as are the entries of RR, and as are the entries of Γ\Gamma. We define H(1)H^{(-1)} to be the real j×lj\times l matrix given by the formula

(HTHH^{\hbox{\scriptsize{\rm T}}}\,H is invertible with high probability due to Lemma 7). We define SS to be the leftmost uppermost j×jj\times j block of Σ\Sigma, TT to be the (kj)×(kj)(k-j)\times(k-j) block of Σ\Sigma whose leftmost uppermost entry is the entry in the (j+1)st(j+1)^{\rm st} row and (j+1)st(j+1)^{\rm st} column of Σ\Sigma, and Θ\Theta to be the rightmost lowermost (mk)×(mk)(m-k)\times(m-k) block of Σ\Sigma, so that

We define PP to be the real m×lm\times l matrix whose uppermost j×lj\times l block is the product S2iH(1)S^{-2i}\,H^{(-1)}, whose entries are zero in the (kj)×l(k-j)\times l block whose first row is the (j+1)st(j+1)^{\rm st} row of PP, and whose entries in the lowermost (mk)×l(m-k)\times l block are zero, so that

Finally, we define FF to be the m×lm\times l matrix given by

Combining (95), (6), the fact that the entries of HH are i.i.d. Gaussian random variables of zero mean and unit variance, and Lemma 7 yields

Combining (98), (99), (96), (93), the fact that Σ\Sigma is zero off its main diagonal, and the fact that UU is unitary yields (91).

We now show that FF defined in (98) satisfies (90).

Combining (101)–(108) and the fact that the columns of UU are orthonormal, as are the columns of VV, yields

Combining Lemma 6 and the fact that the entries of RR are i.i.d. Gaussian random variables of zero mean and unit variance, as are the entries of Γ\Gamma, yields

Combining (109), (99), (110), and (111) yields

with probability not less than Φ\Phi defined in (89). Combining (113), the fact that σj+1σj\sigma_{j+1}\leq\sigma_{j}, and the fact that

for any nonnegative real numbers xx and yy yields (90).

Acknowledgements

We thank Ming Gu for suggesting the combination of the Lanczos method with randomized methods for the low-rank approximation of matrices. We are grateful for many helpful discussions with R. Raphael Coifman and Yoel Shkolnisky. We thank the anonymous referees for their useful suggestions.

References