Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

Jiyan Yang, Xiangrui Meng, Michael W. Mahoney

Introduction

Matrix algorithms lie at the heart of many applications, both historically in areas such as signal processing and scientific computing as well as more recently in areas such as machine learning and data analysis. Essentially, the reason is that matrices provide a convenient mathematical structure with which to model data arising in a broad range of applications: an m×nm\times n real-valued matrix AA provides a natural structure for encoding information about mm objects, each of which is described by nn features. Alternatively, an n×nn\times n real-valued matrix AA can be used to describe the correlations between all pairs of nn data points, or the weighted edge-edge adjacency matrix structure of an nn-node graph. In astronomy, for example, very small angular regions of the sky imaged at a range of electromagnetic frequency bands can be represented as a matrix—in that case, an object is a region and the features are the elements of the frequency bands. Similarly, in genetics, DNA SNP (Single Nucleotide Polymorphism) or DNA microarray expression data can be represented in such a framework, with AijA_{ij} representing the expression level of the ithi^{th} gene or SNP in the jthj^{th} experimental condition or individual. Similarly, term-document matrices can be constructed in many Internet applications, with AijA_{ij} indicating the frequency of the jthj^{th} term in the ithi^{th} document.

Most traditional algorithms for matrix problems are designed to run on a single machine, focusing on minimizing the number of floating-point operations per second (FLOPS). On the other hand, motivated by the ability to generate very large quantities of data in relatively automated ways, analyzing data sets of billions or more of records has now become a regular task in many companies and institutions. In a distributed computational environment, which is typical in these applications, communication costs, e.g., between different machines, are often much more important than computational costs. What is more, if the data cannot fit into memory on a single machine, then one must scan the records from secondary storage, e.g., hard disk, which makes each pass through the data associated with enormous I/O costs. Given that, in many of these large-scale applications, regression, low-rank approximation, and related matrix problems are ubiquitous, the fast computation of their solutions on large-scale data platforms is of interest.

Several important conclusions will emerge from our presentation.

First, many of the basic ideas from RandNLA in RAM extend to RandNLA in parallel/distributed environments in a relatively straightforward manner, assuming that one is more concerned about communication than computation. This is important from an algorithm design perspective, as it highlights which aspects of these RandNLA algorithms are peculiar to the use of randomization and which aspects are peculiar to parallel/distributed environments.

Second, with appropriate engineering of random sampling and random projection algorithms, it is possible to compute good approximate solutions—to low precision (e.g., 11 or 22 digits of precision), medium precision (e.g., 33 or 44 digits of precision), or high precision (e.g., up to machine precision)—to several common matrix problems in only a few passes over the original matrix on up to terabyte-sized data. While low precision is certainly appropriate for many data analysis and machine learning applications involving noisy input data, the appropriate level of precision is a choice for user of an algorithm to make; and there are obvious advantages to having the developer of an algorithm provide control to the user on the quality of the answer returned by the algorithm.

Third, the design principles for developing high-quality RandNLA matrix algorithms depend strongly on whether one is interested in low, medium, or high precision. (An example of this is whether to solve the randomized subproblem with a traditional method or to use the randomized subproblem to create a preconditioned version of the original problem.) Understanding these principles, the connections between them, and how they relate to traditional principles of NLA algorithm design is important for providing high-quality implementations of recent theoretical developments in the RandNLA literature.

Although many of the ideas we will discuss can be extended to related matrix problems such as low-rank matrix approximation, there are two main reasons for restricting attention to strongly rectangular data. The first, most obvious, reason is that strongly rectangular data arises in many fields to which machine learning and data analysis methods are routinely applied. Consider, e.g., Table 1, which lists a few examples.

In genetics, single nucleotide polymorphisms (SNPs) are important in the study of human health. There are roughly 1010 million SNPs in the human genome. However, there are typically at most a few thousand subjects for a study of a certain type of disease, due to the high cost of determination of genotypes and limited number of target subjects.

In Internet applications, strongly rectangular datasets are common. For example, the image dataset called TinyImages which contains 8080 million images of size 32×3232\times 32 collected from Internet.

In spatial discretization of high-dimensional partial differential equations (PDEs), the number of degrees of freedom grows exponentially as dimension increases. For 3D problems, it is common that the number of degrees of freedom reaches 10910^{9}, for example, by having a 1000×1000×10001000\times 1000\times 1000 discretization of a cubic domain. However, for a time-dependent problem, time stays one-dimensional. Though depending on spatial discretization (e.g., the Courant-Friedrichs-Lewy condition for hyperbolic PDEs), the number of time steps is usually much less than the number of degrees of freedoms in spatial discretization.

In geophysical applications, especially in seismology, the number of sensors is much less than the number of data points each sensor collects. For example, Werner-Allen et al. deployed three wireless sensors to monitor volcanic eruptions. In 54 hours, each sensor sent back approximately 2020 million packets.

In natural language processing (NLP), the number of documents is much less than the number of nn-grams, which grows geometrically as nn increases. For example, the webspamhttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html dataset contains 350,000 documents and 254 unigrams, but 680,715 trigrams.

In high-frequency trading, the number of relevant stocks is much less than the number of ticks, changes to the best bid and ask. For example, in 2012 ISE Historical Options Tick Datahttp://www.ise.com/hotdata has daily files with average size greater than 100GB uncompressed.

It is worth emphasizing that the phrase “parallel and distributed” can mean quite different things to different research communities, in particular to what might be termed HPC (high performance computing) or scientific computing researchers versus data analytics or database or distributed data systems researchers. There are important technical and cultural differences here, but there are also some important similarities. For example, to achieve parallelism, one can use multi-threading on a shared-memory machine, or one can use message passing on a multi-node cluster. Alternatively, to process massive data on large commodity clusters, Google’s MapReduce describes a computational framework for distributed computation with fault tolerance. For computation not requiring any internode communication, one can achieve even better parallelism. We don’t want to dwell on many of these important details here: this is a complicated and evolving space; and no doubt the details of the implementation of many widely-used algorithms will evolve as the space evolves. To give the interested reader a quick sense of some of these issues, though, here we provide a very high-level representative description of parallel environments and how they scale. As one goes down this list, one tends to get larger and larger.

In addition, it is also worth emphasizing that there is a great deal of related work in parallel and distributed computing, both in numerical linear algebra as well as more generally in scientific computing. For example, Valiant has provided a widely-used model for parallel computation ; Aggarwal et al. have analyzed the communication complexity of PRAMs ; Lint and Agerwala have highlighted communication issues that arise in the design of parallel algorithms ; Heller has surveyed parallel algorithms in numerical linear algebra ; Toledo has provided a survey of out-of-core algorithms in numerical linear algebra ; Ballard et al. have focused on developing algorithms for minimizing communication in numerical linear algebra ; and Bertsekas and Tsitsiklis have surveyed parallel and distributed iterative algorithms . We expect that some of the most interesting developments in upcoming years will involve coupling the ideas for implementing RandNLA algorithms in parallel and distributed environments that we describe in this review with these more traditional ideas for performing parallel and distributed computation.

RandNLA in RAM

In this section, we will highlight several core ideas that have been central to prior work in RandNLA in (theory and/or practice in) RAM that we will see are also important as design principles for extending RandNLA methods to larger-scale parallel and distributed environments. We start in Section 2.1 by describing a prototypical example of a RandNLA algorithm for the very overdetermined LS problem; then we describe in Section 2.2 two problem-specific complexity measures that are important for low-precision and high-precision solutions to matrix problems, respectively, as well as two complementary ways in which randomization can be used by RandNLA algorithms; and we conclude in Section 2.3 with a brief discussion of running time considerations.

A prototypical example of the RandNLA approach is given by the following meta-algorithm for very overdetermined LS problems . In particular, the problem of interest is to solve:

The following meta-algorithm takes as input an m×nm\times n matrix AA, where mnm\gg n, a vector bb, and a probability distribution {πi}i=1m\{\pi_{i}\}_{i=1}^{m}, and it returns as output an approximate solution x^\hat{x}, which is an estimate of the exact answer xx^{*} of Problem (1).

Randomly sampling. Randomly sample r>nr>n constraints, i.e., rows of AA and the corresponding elements of bb, using {πi}i=1m\{\pi_{i}\}_{i=1}^{m} as an importance sampling distribution.

Subproblem construction. Rescale each sampled row/element by 1/(rπi)1/(r\pi_{i}) to form a weighted LS subproblem.

Solving the subproblem. Solve the weighted LS subproblem, formally given in (2) below, and then return the solution x^\hat{x}.

It is convenient to describe this meta-algorithm in terms of a random “sampling matrix” SS, in the following manner. If we draw rr samples (rows or constraints or data points) with replacement, then define an r×mr\times m sampling matrix, SS, where each of the rr rows of SS has one non-zero element indicating which row of AA (and element of bb) is chosen in a given random trial. In this case, the (i,k)th(i,k)^{th} element of SS equals 1/rπk1/\sqrt{r\pi_{k}} if the kthk^{th} data point is chosen in the ithi^{th} random trial (meaning, in particular, that every non-zero element of SS equals n/r\sqrt{n/r} for sampling uniformly at random). With this notation, this meta-algorithm constructs and solves the weighted LS estimator:

Since this meta-algorithm samples constraints and not variables, the dimensionality of the vector x^\hat{x} that solves the (still overconstrained, but smaller) weighted LS subproblem is the same as that of the vector xx^{*} that solves the original LS problem. The former may thus be taken as an approximation of the latter, where, of course, the quality of the approximation depends critically on the choice of {πi}i=1n\{\pi_{i}\}_{i=1}^{n}. Although uniform subsampling (with or without replacement) is very simple to implement, it is easy to construct examples where it will perform very poorly . On the other hand, it has been shown that, for a parameter γ(0,1]\gamma\in(0,1] that can be tuned, if

where the so-called statistical leverage scores hiih_{ii} are defined in (6) below, i.e., if one draws the sample according to an importance sampling distribution that is proportional to the leverage scores of AA, then with constant probability (that can be easily boosted to probability 1δ1-\delta, for any δ>0\delta>0) the following relative-error bounds hold:

where κ(A)\kappa(A) is the condition number of AA and where ξ=UUTb2/b2\xi=||UU^{T}b||_{2}/||b||_{2} is a parameter defining the amount of the mass of bb inside the column space of AA .

Due to the crucial role of the statistical leverage scores in (3), this canonical RandNLA procedure has been referred to as the algorithmic leveraging approach to approximating LS approximation . In addition, although this meta-algorithm has been described here only for very overdetermined LS problems, it generalizes to other linear regression problems and low-rank matrix approximation problems on less rectangular matricesLet AA be a matrix with dimension mm by nn where m>nm>n. A less rectangular matrix is a matrix that has smaller m/nm/n. .

2 Leveraging, conditioning, and using randomization

Leveraging and conditioning refer to two types of problem-specific complexity measures, i.e., quantities that can be computed for any problem instance that characterize how difficult that problem instance is for a particular class of algorithms. Understanding these, as well as different uses of randomization in algorithm design, is important for designing RandNLA algorithms, both in theory and/or practice in RAM as well as in larger parallel and distributed environments. For now, we describe these in the context of very overdetermined LS problems.

Statistical leverage. (Related to eigenvectors; important for obtaining low-precision solutions.) If we let H=A(ATA)1ATH=A(A^{T}A)^{-1}A^{T}, where the inverse can be replaced with the Moore-Penrose pseudoinverse if AA is rank deficient, be the projection matrix onto the column span of AA, then the ithi^{th} diagonal element of HH,

where A(i)A_{(i)} is the ithi^{th} row of AA, is the statistical leverage of ithi^{th} observation or sample. Since HH can alternatively be expressed as H=UUTH=UU^{T}, where UU is any orthogonal basis for the column space of XX, e.g., the QQ matrix from a QR decomposition or the matrix of left singular vectors from the thin SVD, the leverage of the ithi^{th} observation can also be expressed as

where U(i)U_{(i)} is the ithi^{th} row of UU. Leverage scores provide a notion of “coherence” or “outlierness,” in that they measure how well-correlated the singular vectors are with the canonical basis as well as which rows/constraints have largest “influence” on the LS fit . Computing the leverage scores {hii}i=1m\{h_{ii}\}_{i=1}^{m} exactly is generally as hard as solving the original LS problem (but 1±ϵ1\pm\epsilon approximations to them can be computed more quickly, for arbitrary input matrices ).

Leverage scores are important from an algorithm design perspective since they define the key nonuniformity structure needed to control the complexity of high-quality random sampling algorithms. In particular, naïve uniform random sampling algorithms perform poorly when the leverage scores are very nonuniform, while randomly sampling in a manner that depends on the leverage scores leads to high-quality solutions. Thus, in designing RandNLA algorithms, whether in RAM or in parallel-distributed environments, one must either quickly compute approximations to the leverage scores or quickly preprocess the input matrix so they are nearly uniformized—in which case uniform random sampling on the preprocessed matrix performs well.

Informally, the leverage scores characterize where in the high-dimensional Euclidean space the (singular value) information in AA is being sent, i.e., how the quadratic well (with aspect ratio κ(A)\kappa(A) that is implicitly defined by the matrix AA) “sits” with respect to the canonical axes of the high-dimensional Euclidean space. If one is interested in obtaining low-precision solutions, e.g., ϵ=101\epsilon=10^{-1}, that can be obtained by an algorithm that provides 1±ϵ1\pm\epsilon relative-error approximations for a fixed value of ϵ\epsilon but whose ϵ\epsilon dependence is polynomial in 1/ϵ1/\epsilon, then the key quantities that must be dealt with are statistical leverage scores of the input data.

Monte Carlo versus Las Vegas uses of randomization. Note that the guarantee provided by the meta-algorithm, as stated above, is of the following form: the algorithm runs in no more than a specified time TT, and with probability at least 1δ1-\delta it returns a solution that is an ϵ\epsilon-good approximation to the exact solution. Randomized algorithms that provide guarantees of this form, i.e., with running time that is is deterministic, but whose output may be incorrect with a certain small probability, are known as Monte Carlo algorithms . A related class of randomized algorithms, known as Las Vegas algorithms, provide a different type of guaranatee: they always produce the correct answer, but the amount of time they take varies randomly . In many applications of RandNLA algorithms, guarantees of this latter form are preferable.

3 Running Time Considerations in RAM

As presented, the meta-algorithm of the previous subsection has a running time that depends on both the time to construct the probability distribution, {πi}i=1n\{\pi_{i}\}_{i=1}^{n}, and the time to solve the subsampled problem. For uniform sampling, the former is trivial and the latter depends on the size of the subproblem. For estimators that depend on the exact or approximate (recall the flexibility in (3) provided by γ\gamma) leverage scores, the running time is dominated by the exact or approximate computation of those scores. A naïve algorithm involves using a QR decomposition or the thin SVD of AA to obtain the exact leverage scores. This naïve implementation of the meta-algorithm takes roughly O(mn2/ϵ)\operatorname{\mathcal{O}}(mn^{2}/\epsilon) time and is thus no faster (in the RAM model) than solving the original LS problem exactly . There are two other potential problems with practical implementations of the meta-algorithm: the running time dependence of roughly O(mn2/ϵ)\operatorname{\mathcal{O}}(mn^{2}/\epsilon) time scales polynomially with 1/ϵ1/\epsilon, which is prohibitive if one is interested in moderately small (e.g., 10410^{-4}) to very small (e.g., 101010^{-10}) values of ϵ\epsilon; and, since this is a randomized Monte Carlo algorithm, with some probability δ\delta the algorithm might completely fail.

Importantly, all three of these potential problems can be solved to yield improved variants of the meta-algorithm.

Making the algorithm fast: improving the dependence on mm and nn. We can make this meta-algorithm “fast” in worst-case theory in RAM . In particular, this meta-algorithm runs in O(mnlogn/ϵ)\operatorname{\mathcal{O}}(mn\log n/\epsilon) time in RAM if one does either of the following: if one performs a Hadamard-based random random projection and then performs uniform sampling in the randomly rotated basis (which, recall, is basically what random projection algorithms do when applied to vectors in a Euclidean space ); or if one quickly computes approximations to the statistical leverage scores (using the algorithm of , the running time bottleneck of which is applying a random projection to the input data) and then uses those approximate scores as an importance sampling distribution . In addition, by using carefully-constructed extremely-sparse random projections, both of these two approaches can be made to run in so-called “input sparsity time,” i.e., in time proportional to the number of nonzeros in the input data, plus lower-order terms that depend on the lower dimension of the input matrix .

Making the algorithm high-precision: improving the dependence on ϵ\epsilon. We can make this meta-algorithm “fast” in practice, e.g., in “high precision” numerical implementation in RAM . In particular, this meta-algorithm runs in O(mnlognlog(1/ϵ))\operatorname{\mathcal{O}}(mn\log n\log(1/\epsilon)) time in RAM if one uses the subsampled problem constructed by the random projection/sampling process to construct a preconditioner, using it as a preconditioner for a traditional iterative algorithm on the original full problem . This is important since, although the worst-case theory holds for any fixed ϵ\epsilon, it is quite coarse in the sense that the sampling complexity depends on ϵ\epsilon as 1/ϵ1/\epsilon and not log(1/ϵ)\log(1/\epsilon). In particular, this means that obtaining high-precision with (say) ϵ=1010\epsilon=10^{-10} is not practically possible. In this iterative use case, there are several tradeoffs: e.g., one could construct a very high-quality preconditioner (e.g., using a number of samples that would yield a 1+ϵ1+\epsilon error approximation if one solved the LS problem on the subproblem) and perform fewer iterations, or one could construct a lower quality preconditioner by drawing many fewer samples and perform a few extra iterations. Here too, the input sparsity time algorithm of could be used to improve the running time still further.

Dealing with the δ\delta failure probability. Although fixing a failure probability δ\delta is convenient for theoretical analysis, in certain applications having even a very small probability that the algorithm might return a completely meaningless answer is undesirable. In this case, one is interested in converting a Monte Carlo algorithm into a Las Vegas algorithm. Fortuitously, those application areas, e.g., scientific computing, are often more interested in moderate to high precision solutions than in low precision solutions. In these case, using the subsampled problem to create a preconditioner for iterative algorithms on the original problem has the side effect that one changes a “fixed running time but might fail” algorithm to an “expected running time but will never fail” algorithm.

From above, we can make the following conclusions. The “fast in worst-case theory” variants of our meta-algorithm () represent qualitative improvements to the O(mn2)\operatorname{\mathcal{O}}(mn^{2}) worst-case asymptotic running time of traditional algorithms for the LS problem going back to Gaussian elimination. The “fast in numerical implementation” variants of the meta-algorithm () have been shown to beat Lapack’s direct dense least-squares solver by a large margin on essentially any dense tall matrix, illustrating that the worst-case asymptotic theory holds for matrices as small as several thousand by several hundred .

We briefly list the notation conventions we follow in this work:

We use uppercase letters to denote matrices and constants, e.g., AA, RR, CC, etc.

We use lowercase letters to denote vectors and scalars, e.g., xx, bb, pp, mm, nn, etc.

We use uppercase calligraphic letters to denote point sets, e.g., A\mathcal{A} for the linear subspace spanned by AA’s columns, C\mathcal{C} for a convex set, and E\mathcal{E} for an ellipsoid, except that O\operatorname{\mathcal{O}} is used for big O-notation.

Consider, next, the special case p=2p=2. If, in the LS problem

we let r=rank(A)min(m,n)r=\operatorname{rank}(A)\leq\min(m,n), then recall that if r<nr<n (the LS problem is under-determined or rank-deficient), then (10) has an infinite number of minimizers. In that case, the set of all minimizers is convex and hence has a unique element having minimum length. On the other hand, if r=nr=n so the problem has full rank, there exists only one minimizer to (10) and hence it must have the minimum length. In either case, we denote this unique min-length solution to (10) by xx^{*}, and we are interested in computing xx^{*} in this work. This was defined in Problem (1) above. In this case, we will also be interested in bounding xx^2\|x^{*}-\hat{x}\|_{2}, for arbitrary or worst-case input, where x^\hat{x} was defined in Problem (2) above and is an approximation to xx^{*}.

For simplicity, we use κp\kappa_{p}, σpmin\sigma_{p}^{\min}, and σpmax\sigma_{p}^{\max} when the underlying matrix is clear.

That is, by Lemma 1, if mnm\gg n, then the notions of condition number provided by Definition 4 and Definition 5 are equivalent, up to low-dimensional factors. These low-dimensional factors typically do not matter in theoretical formulations of the problem, but they can matter in practical implementations.

Preconditioning refers to the application of a transformation, called the preconditioner, to a given problem instance such that the transformed instance is more-easily solved by a given class of algorithms. Most commonly, the preconditioned problem is solved with an iterative algorithm, the complexity of which depends on the condition number of the preconditioned problem.

To start, consider p=2p=2, and recall that for a square linear system Ax=bAx=b of full rank, this preconditioning usually takes one of the following forms:

Clearly, the preconditioned system is consistent with the original one, i.e., has the same xx^{*} as the unique solution, if the preconditioners MM and NN are nonsingular.

For the general LS Problem (1), more care should be taken so that the preconditioned system has the same min-length solution as the original one. In particular, if we apply left preconditioning to the LS problem minxAxb2\min_{x}\|Ax-b\|_{2}, then the preconditioned system becomes \min_{x}\|M^{\raisebox{0.62997pt}{\hbox{\tinyT}}}Ax-M^{\raisebox{0.62997pt}{\hbox{\tinyT}}}b\|_{2}, and its min-length solution is given by

Similarly, the min-length solution to the right preconditioned system is given by

The following lemma states the necessary and sufficient conditions for A=N(AN)A^{\dagger}=N(AN)^{\dagger} or A^{\dagger}=(M^{\raisebox{0.62997pt}{\hbox{\tinyT}}}A)^{\dagger}M^{\raisebox{0.62997pt}{\hbox{\tinyT}}} to hold. Note that these conditions holding certainly imply that xright=xx_{\text{right}}^{*}=x^{*} and xleft=xx_{\text{left}}^{*}=x^{*}, respectively.

A=N(AN)A^{\dagger}=N(AN)^{\dagger} if and only if \text{range}(NN^{\raisebox{0.62997pt}{\hbox{\tinyT}}}A^{\raisebox{0.62997pt}{\hbox{\tinyT}}})=\text{range}(A^{\raisebox{0.62997pt}{\hbox{\tinyT}}}),

A^{\dagger}=(M^{\raisebox{0.62997pt}{\hbox{\tinyT}}}A)^{\dagger}M^{\raisebox{0.62997pt}{\hbox{\tinyT}}} if and only if \text{range}(MM^{\raisebox{0.62997pt}{\hbox{\tinyT}}}A)=\text{range}(A).

Given this preconditioned problem, (13) (see below) bounds the number of itrations for certain iterative algorithms for the LS problem.

Clearly, if yy^{*} is an optimal solution to (11), then x=R1yx^{*}=R^{-1}y is an optimal solution to (9), and vice versa; however, (11) may be easier to solve than (9) because of better conditioning.

This is a direct consequence of John’s theorem on ellipsoidal rounding of centrally symmetric convex sets. For the (α,β,p)(\alpha,\beta,p)-condition number κˉp\bar{\kappa}_{p}, we have the following lemma.

Least squares is a classic problem in linear algebra. It has a long history, tracing back to Gauss, and it arises in numerous applications. A detailed survey of numerical algorithms for least squares is certainly beyond the scope of this work. In this section, we briefly describe some well-known direct methods and iterative methods that compute the min-length solution to a possibly rank-deficient least squares problem, and we refer readers to Björck for additional details.

A third way to solve a least squares problem is by computing the min-length solution to the normal equation A^{\raisebox{0.62997pt}{\hbox{\tinyT}}}Ax=A^{\raisebox{0.62997pt}{\hbox{\tinyT}}}b, namely

It is easy to verify the correctness of (12) by replacing AA by its compact SVD U\Sigma V^{\raisebox{0.62997pt}{\hbox{\tinyT}}}. If r=min(m,n)r=\min(m,n), a Cholesky factorization of either A^{\raisebox{0.62997pt}{\hbox{\tinyT}}}A (if mnm\geq n) or AA^{\raisebox{0.62997pt}{\hbox{\tinyT}}} (if mnm\leq n) solves (12). If r<min(m,n)r<\min(m,n), we need the eigensystem of A^{\raisebox{0.62997pt}{\hbox{\tinyT}}}A or AA^{\raisebox{0.62997pt}{\hbox{\tinyT}}} to compute xx^{*}. The normal equation approach is the least expensive among the three direct approaches we have mentioned, but it is also the least accurate one, especially on ill-conditioned problems. See Chapter 5 of Golub and Van Loan for a detailed analysis. A closely related direct solver is the semi-normal equation method. It is often useful when the RR-factor of the QR decomposition is known; see for more details.

For sparse least squares problems, by pivoting AA’s columns and rows, we may find a sparse factorization of AA, which is preferred to a dense factorization for more efficient storage. For sparse direct methods, we refer readers to Davis .

Iterative methods

Instead of direct methods, we can use iterative methods to solve (10). If all the iterates {x(k)}\{x^{(k)}\} are in \text{range}(A^{\raisebox{0.62997pt}{\hbox{\tinyT}}}) and if {x(k)}\{x^{(k)}\} converges to a minimizer, it must be the minimizer having minimum length, i.e., the solution to Problem (1). This is the case when we use a Krylov subspace method starting with a zero vector. For example, the conjugate gradient (CG) method on the normal equation leads to the min-length solution (see Paige and Saunders ). In practice, CGLS , LSQR are preferable because they are equivalent to applying CG to the normal equation in exact arithmetic but they are numerically more stable. Other Krylov subspace methods such as LSMR can also solve (10) as well. The Chebyshev semi-iterative method can also be modified to solve LS problems.

Importantly, however, it is in general hard to predict the number of iterations for CG-like methods. The convergence rate is affected by the condition number of A^{\raisebox{0.62997pt}{\hbox{\tinyT}}}A. A classical result [46, p.187] states that

where W(k)W^{(k)} is a diagonal matrix with positive diagonals wi(k)w^{(k)}_{i}, i=1,,mi=1,\ldots,m. Let W(0)W^{(0)} be an identity matrix and choose

In particular, in Sections 4.1 and 4.2, we discuss practical algorithms to find such RR matrices, and we describe the trade-offs between speed (e.g., FLOPS, number of passes, additional space/time, etc.) and conditioning quality. The algorithms fall into two general families: ellipsoidal rounding (Section 4.1) and subspace embedding (Section 4.2). We present them roughly in the order of speed (in the RAM model), from slower ones to faster ones. We will discuss practical tradeoffs in Section 5. For simplicity, here we assume mpoly(n)m\gg\operatorname{poly}(n), and hence mn2mn+poly(n)mn^{2}\gg mn+\operatorname{poly}(n); and if AA is sparse, we assume that mnnnz(A)mn\gg\operatorname{nnz}(A). Hereby, the degree of poly(n)\operatorname{poly}(n) depends on the underlying algorithm, which may range from O(n)\operatorname{\mathcal{O}}(n) to O(n7)\operatorname{\mathcal{O}}(n^{7}).

Before diving into the details, it is worth mentioning a few high-level considerations about subspace embedding methods. (Similar considerations apply to ellipsoidal rounding methods.) Subspace embedding algorithms involve mapping data points, e.g., the columns of an m×nm\times n matrix, where mnm\gg n to a lower-dimensional space such that some property of the data, e.g., geometric properties of the point set, is approximately preserved; see Definition 7 for definition for low-distortion subspace embedding matrix. As such, they are critical building blocks for developing improved random sampling and random projection algorithms for common linear algebra problems more generally, and they are one of the main technical tools for RandNLA algorithms. There are several properties of subspace embedding algorithms that are important in order to optimize their performance in theory and/or in practice. For example, given a subspace embedding algorithm, we may want to know:

whether it is data-oblivious (i.e., independent of the input subspace) or data-aware (i.e., dependent on some property of the input matrix or input space),

the time and storage it needs to construct an embedding,

the time and storage to apply the embedding to an input matrix,

the failure rate, if the construction of the embedding is randomized,

the dimension of the embedding, i.e., the number of dimensions being sampled by sampling algorithms or being projected onto by projection algorithms,

how to balance the trade-offs among those properties.

Some of these considerations may not be important for typical theoretical analysis but still affect the practical performance of implementations of these algorithms.

Since we will introduce several important and distinct but closely-related concepts in this long section, in Figure 1 we provide an overview of these relations as well as of the structure of this section.

Therefore, we have κp(AR1)κ\kappa_{p}(AR^{-1})\leq\kappa. So a κ\kappa-rounding of C\mathcal{C} leads to a κ\kappa-preconditioning of AA.

Recall the well-known result due to John that for a centrally symmetric convex set C\mathcal{C} there exists a n1/2n^{1/2}-rounding. It is known that this result is sharp and that such rounding is given by the Löwner-John (LJ) ellipsoid of C\mathcal{C}, i.e., the minimal-volume ellipsoid containing C\mathcal{C}. This leads to Lemma 3 above. Unfortunately, finding an n1/2n^{1/2}-rounding is a hard problem. No constant-factor approximation in polynomial time is known for general centrally symmetric convex sets, and hardness results have been shown .

By applying Lemma 5 to the convex set C={xAxp1}\mathcal{C}=\{x\,|\,\|Ax\|_{p}\leq 1\}, with the separation oracle described via a subgradient of Axp\|Ax\|_{p} and the initial rounding provided by the “R” matrix from the QR decomposition of AA, one immediately improves the running time of the algorithm used by Clarkson and by Dasgupta et al. from O(mn5logm)\operatorname{\mathcal{O}}(mn^{5}\log m) to O(mn3logm)\operatorname{\mathcal{O}}(mn^{3}\log m) while maintaining an O(n)\operatorname{\mathcal{O}}(n)-conditioning.

Unfortunately, even this improvement for computing a 2n2n-conditioning is not immediately applicable to very large matrices. The reason is that such matrices are usually distributively stored on secondary storage and each call to the oracle requires a pass through the data. We could group nn calls together within a single pass, but this would still need O(nlogm)\operatorname{\mathcal{O}}(n\log m) passes. Instead, Meng and Mahoney present a deterministic single-pass conditioning algorithm that balances the cost-performance trade-off to provide a 2n2/p1+12n^{|2/p-1|+1}-conditioning of AA . This algorithm essentially invoke the fast ellipsoidal rounding (Lemma 5) method on a smaller problem which is constructed via a single-pass on the original dataset. Their main algorithm is stated in Algorithm 1, and the main result for Algorithm 1 is the following.

Algorithm 1 is a 2n2/p1+12n^{|2/p-1|+1}-conditioning algorithm, and it runs in O((mn2+n4)logm)\operatorname{\mathcal{O}}((mn^{2}+n^{4})\log m) time. It needs to compute a 2n2n-rounding on a problem with size m/nm/n by nn which needs O(n2logm)\operatorname{\mathcal{O}}(n^{2}\log m) calls to the separation oracle on the smaller problem.

Solving the rounding problem of size m/n×nm/n\times n in Algorithm 1 requires O(m)\operatorname{\mathcal{O}}(m) RAM, which might be too much for very large-scale problems. In such cases, one can increase the block size from n2n^{2} to, e.g., n3n^{3}. A modification to the proof of Lemma 6 shows that this gives us a 2n3/p3/2+12n^{|3/p-3/2|+1}-conditioning algorithm that only needs O(m/n)\operatorname{\mathcal{O}}(m/n) RAM and O((mn+n4)logm)\operatorname{\mathcal{O}}((mn+n^{4})\log m) FLOPS for the rounding problem.

2 Low-distortion subspace embedding and subspace-preserving embedding

We call Φ\Phi a low-distortion subspace embedding of Ap\operatorname{\mathcal{A}}_{p} if the distortion of the embedding κΦ=O(poly(n))\kappa_{\Phi}=\operatorname{\mathcal{O}}(\operatorname{poly}(n)), independent of mm.

where the first inequality is due to low distortion and the second inequality is due to the equivalence of vector norms. By similar arguments, we can show that

Hence, by combining these results, we have

Furthermore, one can construct a well-conditioned basis by combining QR-like and ER-like methods. To see this, let RR be the matrix obtained by applying Corollary 1 to ΦA\Phi A. We have

where the second inequality is due to the ellipsoidal rounding result, and

and AR1AR^{-1} is well-conditioned. Following our previous conventions, we call this combined type of conditioning method a QR+ER-type method.

A special family of low-distortion subspace embedding that has very low distortion factor is called subspace-preserving embedding.

We say a mapping has J-L property if it satisfies the above condition with a constant probability.

The result of Lemma 8 applies to any J-L transform, i.e., to any transform (including those with better or worse asymptotic FLOPS behavior) that satisfies the J-L distortion property.

It is important to note, however, that for some J-L transforms, we are able to obtain more refined results. In particular, these can be obtained by bounding the spectral norm of (ΦU)T(ΦU)I(\Phi U)^{T}(\Phi U)-I, where UU is an orthonormal basis of A2\operatorname{\mathcal{A}}_{2}. If (ΦU)T(ΦU)Iϵ\|(\Phi U)^{T}(\Phi U)-I\|\leq\epsilon, for any xA2x\in\operatorname{\mathcal{A}}_{2}, we have

We show some results following this approach. First consider the a random normal matrix, which has the following concentration result on its extreme singular values.

Consider an s×ns\times n random matrix GG with s>ns>n, whose entries are independent random variables following the standard normal distribution. Let the singular values be σ1σn\sigma_{1}\geq\cdots\geq\sigma_{n}. Then for any t>0t>0,

Using this concentration result, we can easily present a better analysis of random normal projection than in Lemma 8.

(ΦU)T(ΦU)In2ϵ\|(\Phi U)^{T}(\Phi U)-I_{n}\|_{2}\leq\epsilon, where UU is an orthonormal basis of A2\operatorname{\mathcal{A}}_{2}.

An SRHT is an s×ms\times m matrix of the form

Below we present the main results for SRHT from since it has a better characterization of the subspace-preserving properties. We note that its proof is essentially a combination of the results in .

Note that besides Walsh-Hardamard transform, other FFT-based transform, e.g., discrete Hartley transform (DHT), discrete cosine transform (DCT) which have more practical advantages can be also be used; see for an details of other choices. Another important point to keep in mind (in particular, for parallel and distributed applications) is that, although called “fast,” a fast transform might be slower than a dense transform: when nnz(A)=O(m)\operatorname{nnz}(A)=\operatorname{\mathcal{O}}(m) (since machines are optimized for matrix-vector multiplies); when AA’s columns are distributively stored (since this slows down FFT-like algorithms, due to communication issues); or for other machine-related issues.

Cauchy random variables replace standard normal random variables;

a larger embedding dimension does not always lead to better distortion quality; and

the failure rate becomes harder to control.

To make the algorithm work with high probability, one has to set tt to be at the order of s6s^{6} and s=O(nlogn)s=\operatorname{\mathcal{O}}(n\log n). It follows that κ=O(n4log4n)\kappa=\operatorname{\mathcal{O}}(n^{4}\log^{4}n) in the above theorem. That is, while faster in terms of FLOPS than the CT, the FCT leads to worse embedding/preconditioning quality. Importantly, this result is different from how FJLT compares to dense Gaussian transform: FJLT is faster than the dense Gaussian transform, while both provide the same order of distortion; but FCT becomes faster than the dense Cauchy transform, at the cost of somewhat worse distortion quality.

where Φ\Phi is the sparse Cauchy transform described above.

where Φ\Phi is the sparse transform using reciprocal exponential variables described above.

and {pi}i=1m\{p_{i}\}_{i=1}^{m} satisfies (15). Then, with probability at least 0.70.7,

An obvious (but surmountable) challenge to applying this result is that computing the leverage scores exactly involves forming an orthonormal basis for AA first. Normally, this step will take O(mn2)\operatorname{\mathcal{O}}(mn^{2}) time which becomes undesirable when for large-scale applications.

Suppose, now, we use SRHT (Lemma 10) or CW (Lemma 11) method as the underlying FJLT, i.e., Π1\Pi_{1}, in the approximation of the leverage scores. Then, combining the theory suggested in and Theorem 1, we have the following lemma.

Then, with probability at least 1δ1-\delta,

We note that the speed-up comes at the cost of increased sampling complexity, which does not substantially affect most theoretical formulations, since the sampling complexity is still O(poly(n)log(1/ϵ)/ϵ2)\operatorname{\mathcal{O}}(\operatorname{poly}(n)\log(1/\epsilon)/\epsilon^{2}). In practice, however, it might be worth computing U=AR1U=AR^{-1} and its row norms explicitly to obtain a smaller sample size. One should be aware of this trade-off when implementing a subspace-preserving sampling algorithm.

The most straightforward use of these methods (and the one to which most of the theory has been developed) is to construct a subspace-preserving embedding matrix and then solve the resulting reduced-sized problem exactly, thereby obtaining an approximate solution to the original problem. In somewhat more detail, this algorithmic approach performs the following two steps.

Construct a subspace-preserving embedding matrix Π\Pi with distortion 1±ϵ41\pm\frac{\epsilon}{4}.

Using a black-box solver, solve the reduced-sized problem exactly, i.e., exactly solve

(We refer to this approach as low-precision since the running time complexity with respect to the error parameter ϵ\epsilon is \mboxpoly(1/ϵ)\mbox{poly}(1/\epsilon). Thus, while this approach can be analyzed for a fixed ϵ\epsilon, this dependence means that as a practical matter this approach cannot achieve high-precision solutions.)

To see why this approach gives us a (1+ϵ)(1+\epsilon)-approximate solution to the original problem, recall that a subspace-preserving embedding matrix Π\Pi with distortion factor (1±ϵ4)(1\pm\frac{\epsilon}{4}) satisfies

Therefore, the following simple reasoning shows that x^\hat{x} is indeed a (1+ϵ)(1+\epsilon)-approximation solution.

For completeness, we include the following lemma stating this result more precisely.

3.2 High-precision solvers

Construct a randomized preconditioner for AA, called NN.

(We refer to this approach as high-precision since the running time complexity with respect to the error parameter ϵ\epsilon is log(1/ϵ)\log(1/\epsilon). Among other things, this means that, given a moderately good solution—e.g., the one obtained from the embedding that could be used in a low-precision solver—one can very easily obtain a very high precision solution.)

As with the low-precision solvers, note that if we use the input-sparsity time algorithm of for embedding and then use an (SRHT + LSQR) approach above to solve the reduced-sized problem, then under the assumption that mpoly(n)m\geq\operatorname{poly}(n) and ϵ\epsilon is fixed, this particular combination would become the best approach proposed. However, there are various trade-offs among those approaches. For instance, there are trade-offs between running time and conditioning quality in preconditioning for computing the subspace-preserving sampling matrix, and there are trade-offs between embedding dimension/sample size and failure rate in embedding/sampling. Some of the practical trade-offs on different problem types and computing platforms will be discussed in Section 5.3 below.

Implementations and empirical results

Two important aspects of LSRN are the use of the Gaussian transform and the CS method, and they are coupled in a nontrivial way. In the remainder of this subsection, we discuss these issues.

To start, note that, among the available choices for the random projection matrix, the Gaussian transform has particularly-good conditioning properties. In particular, the distribution of the spectrum of the preconditioned system depends only on that of a certain Gaussian matrix, not the original linear system. In addition, one can show that

where κ(AN)\kappa(AN) is the condition number of the preconditioned system, rr is the rank of AA, and α\alpha is a parameter . For example, if we choose the oversampling factor γ\gamma in Algorithm 2 to be 22, then the condition number of the new linear system is less than 66 with high probability. In addition, a result on bounds on the singular values provided in enable CS to work more efficiently.

Moreover, while slower in terms of FLOPS than FFT-based fast transforms, the Gaussian transform comes with several other advantages for large-scale environments. First, it automatically speeds up with sparse input matrices and fast linear operators (in which case FFT-based fast transforms are no longer “fast”). Second, the preconditioning process is embarrassingly parallel and thus scales well in parallel environments. Relatedly, it is easy to implement using multi-threads or MPI. Third, it still works (with an extra “allreduce” operation) when AA is partitioned along its bigger dimension. Lastly, when implemented properly, Gaussian random variables can be generated very fast (which is nontrivial, given that the dominant cost in naïvely-implemented Gaussian-based projections can be generating the random variables). For example, it takes less than 2 seconds to generate 10910^{9} random Gaussian numbers using 12 CPU cores .

To understand why CS is preferable as a choice of iterative solver compared to other methods such as the conjugate gradient based LSRN, one has to take the convergence rate and computation/communication costs into account. In general, if (a bound for) the condition number of the linear system is large or not known precisely, then the CS method will fail ungracefully (while LSQR will just converge very slowly). However, with the very strong preconditioning guarantee of the Gaussian transform, we have very strong control on the condition number of the embedding, and thus the CS method can be expected to converge within a very few iterations. In addition, since CS doesn’t have vector inner products that require synchronization between nodes (while the conjugate gradient based LSQR does), CS has one less synchronization point per iteration, i.e., it has improved communication properties. See Figure 2 for the Python code snippets of LSQR and CS, respectively. On each iteration, both methods have to do two matrix-vector multiplications, while CS only needs one cluster-wide synchronization compared to two in LSQR. Thus, the more communication-efficient CS method is enabled by the very strong control on conditioning that is provided by the more expensive Gaussian projection. It is this advantage that makes CS favorable in the distributed environments, where communication costs are considered more expensive.

Next, we will discuss some of the implementation details of the above three steps in the MapReduce framework. The key thing to note is that, for the problems we are considering, the dominant cost is the cost of input/output, i.e., communicating the data, and hence we want to extract as much information as possible for each pass over the data.

After collecting all the vectors vkv_{k}, for k=1,,rk=1,\ldots,r, one only has to assemble these vectors and perform QR decomposition on the resulting matrix, which completes the preconditioning process.

Note here, in the second step of the reducer above, since the size of the subsampled matrix AkA_{k} typically only depends on the low dimension nn, the subproblem can be fit into the memory of a single machine and can be solved locally.

Initialization using all the solutions returned by sampling algorithms. To construct a search region S0\mathcal{S}_{0}, one can use the multiple solutions returned by calling the sampling algorithm, e.g., low-precision solutions, to obtain a much better initial condition. If we denote by x^1,x^N\hat{x}_{1},\ldots\hat{x}_{N} the NN approximation solution, then given each x^\hat{x}, let f^=Ax^1\hat{f}=\|A\hat{x}\|_{1} and g^=ATsign(Ax^)\hat{g}=A^{T}\text{sign}(A\hat{x}). Note that given x^1,,x^N\hat{x}_{1},\ldots,\hat{x}_{N}, computing f^i,g^i\hat{f}_{i},\hat{g}_{i} for i=1,,Ni=1,\ldots,N can be done in a single pass. Then we have

Hence, for each subsampled solution x^i\hat{x}_{i}, we have a hemisphere that contains the optimal solution. We use all these hemispheres to construct a better initial search region S0\mathcal{S}_{0}, which may potentially reduce the number of iterations needed for convergence.

Performing multiple queries per iteration. Instead of sending one query point at each iteration, one can exploit the fact that it is inexpensive to compute multiple query points per iteration, and one can send multiple query points at a time. Let us still use x^i\hat{x}_{i} to denote the multiple query points. Notice that by convexity,

This implies gTxgTx^g^{T}x^{*}\leq g^{T}\hat{x}. That is, given any query point x^\hat{x}, the subgradient serves as a separation oracle which returns a half-space that contains the desired ball. This means that, for each query point x^i\hat{x}_{i}, a half-space containing the ball of desired solutions will be returned.

Note that both of these differences take advantage of performing extra computation while minimizing the number of iterations (which is strongly correlated with communication for MapReduce computations).

In order to illustrate a range of uniformity and nonuniformity properties for both the leverage scores and the condition number, we considered the following four types of datasets.

UG (matrices with uniform leverage scores and good condition number);

UB (matrices with uniform leverage scores and bad condition number);

NG (matrices with nonuniform leverage scores and good condition number);

NB (matrices with nonuniform leverage scores and bad condition number).

These matrices are generated in the following manner. For matrices with uniform leverage scores, we generated the matrices by using the commands that are listed in Table 10. For matrices with nonuniform leverage scores, we considered matrices with the following structure:

To generate a large-scale matrix that is beyond the capacity of RAM, and to evaluate the quality of the solution for these larger inputs, we used two methods. First, we replicate the matrix (and the right hand side vector, when it is needed to solve regression problems) REPNUM times, and we “stack” them together vertically. We call this naïve way of stacking matrices as STACK1. Alternatively, for NB or NG matrices, we can stack them in the following manner:

We call this stacking method STACK2. The two different stacking methods lead to different properties for the linear system being solved—we summarize these in Table 11—and, while they yielded results that were usually similar, as we mention below, the results were different in certain extreme cases. With either method of stacking matrices, the optimal solution remains the same, so that we can evaluate the approximate solutions of the new large least-squares problems. We considered these and other possibilities, but in the results reported below, unless otherwise specified we choose the following: for large-scale UG and UB matrices, we use STACK1 to generate the data; and, for large-scale NG and NB matrices, we use STACK2 to generate the data.

PROJ CW — Random projection with the input-sparsity time CW method

PROJ GAUSSIAN — Random projection with Gaussian transform

PROJ RADEMACHER — Random projection with Rademacher transform

PROJ SRDHT — Random projection with Subsampled randomized discrete Hartley transform

SAMP APPR — Random sampling based on approximate leverage scores

SAMP UNIF — Random sampling with uniform distribution

Note that, instead of using a vanilla SRHT, we perform our evaluation with a SRDHT (i.e., a subsampled randomized discrete Hartley transform). (An SRDHT is a related FFT-based transform which has similar properties to a SRHT in terms of speed and accuracy but doesn’t have the restriction on the dimension to be a power of 22.) Also note that, instead of using a distributed FFT-based transform to implement SRDHT, we treat the transform as a dense matrix-matrix multiplication, hence we should not expect SRDHT to have computational advantage over other transforms.

Throughout this section, by embedding dimension, we mean the projection size for projection based methods and the sampling size for sampling based methods. Also, it is worth mentioning that for sampling algorithm with approximate leverage scores, we fix the underlying embedding method to be PROJ CW and the projection size cc to be d2/4d^{2}/4. In our experiments, we found that—when they were approximated sufficiently well—the precise quality of the approximate leverage scores do not have a strong influence on the quality of the solution obtained by the sampling algorithm. We will elaborate this more in Section 5.3.3.

The computations for Table 12, Figure 4, and Table 13 below (i.e., for the smaller-sized problems) were performed on a shared-memory machine with 12 Intel Xeon CPU cores at clock rate 2GHz with 128GB RAM. In these cases, the algorithms are implemented in MATLAB. All of the other computations (i.e., for the larger-sized problems) were performed on a cluster with 16 nodes (1 master and 15 slaves), each of which has 8 CPU cores at clock rate 2.5GHz with 25GB RAM. For all these cases, the algorithms are implemented in Spark via a Python API.

3.2 Overall performance of low-precision solvers

Here, we evaluate the performance of the 6 kinds of embedding methods described above (with different embedding dimension) on the 4 different types of dataset described above (with size 1e71e7 by 10001000). For dense transforms, e.g., PROJ GAUSSIAN, due to the memory capacity, the largest embedding dimension we can handle is 5e45e4. For each dataset and each kind of the embedding, we compute the following three quantities: relative error of the objective ff/f|f-f^{\ast}|/f^{\ast}; relative error of the solution certificate xx2/x2\|x-x^{\ast}\|_{2}/\|x^{\ast}\|_{2}; and the total running time to compute the approximate solution. The results are presented in Figure 3.

As we can see, when the matrices have uniform leverage scores, all the methods including SAMP UNIF behave similarly. As expected, SAMP UNIF runs fastest, followed by PROJ CW. On the other hand, when the leverages scores are nonuniform, SAMP UNIF breaks down even with large sampling size. Among the projection based methods, the dense transforms, i.e., PROJ GAUSSIAN, PROJ RADEMACHER and PROJ SRDHT, behave similarly. Although PROJ CW runs much faster, it yields very poor results until the embedding dimension is large enough, i.e., c=3e5c=3e5. Meanwhile, sampling algorithm with approximate leverage scores, i.e., SAMP APPR, tends to give very reliable solutions. (This breaks down if the embedding dimension in the approximate leverage score algorithm is chosen to be too small.) In particular, the relative error is much lower throughout all choices of the embedding dimension. This can be understood in terms of the theory; see and for details. In addition, its running time becomes more favorable when the embedding dimension is larger.

As a more minor point, theoretical results also indicate that the upper bound of the relative error of the solution vector depends on the condition number of the system as well as the amount of mass of bb lies in the range space of AA, denote by γ\gamma . Across the four datasets, γ\gamma is roughly the same. This is why we see the relative error of the certificate, i.e., the vector achieving the minimum solution, tends to be larger when the condition number of the matrix becomes higher.

3.3 Quality of the approximate leverage scores

Here, we evaluate the quality of the fast approximate leverage score algorithm of , and we investigate the quality of the approximate leverage scores with several underlying embeddings. (The algorithm of considered only Hadamard-based projections, but other projection methods could be used, leading to similar approximation quality but different running times.) We consider only an NB matrix since leverage scores with nonuniform distributions are harder to approximate. In addition, the size of the matrix we considered is only rather small, 1e61e6 by 500500, due to the need to compute the exact leverage scores for comparison. Our implementation follows closely the main algorithm of , except that we consider other random projection matrices. In particular, we used the following four ways to compute the underlying embedding: namely, PROJ CW, PROJ GAUSSIAN, PROJ RADEMACHER, and PROJ SRDHT. For each kind of embedding and embedding dimension, we compute a series of quantities which characterize the statistical properties of the approximate leverage scores. The results are summarized in Table 12.

As we can see, when the projection size is large enough, all the projection-based methods to compute approximations to the leverage scores produce highly accurate leverage scores. Among these projection methods, PROJ CW is typically faster but also requires a much larger projection size in order to yield reliable approximate leverage scores. The other three random projections perform similarly. In general, the algorithms approximate the large leverage scores (those that equal or are close to 11) better than the small leverage scores, since αL\alpha_{L} and βL\beta_{L} are closer to 11. This is crucial when calling SAMP APPR since the important rows shall not be missed, and it is a sufficient condition for the theory underlying the algorithm of to apply.

3.4 Performance of low-precision solvers when n𝑛n changes

Here, we explore the scalability of the low-precision solvers by evaluating the performance of all the embeddings on NB matrices with varying nn. We fix d=1000d=1000 and let nn take values from 2.5e52.5e5 to 1e81e8. These matrices are generated by stacking an NB matrix with size 2.5e52.5e5 by 10001000 REPNUM times, with REPNUM varying from 11 to 400400 using STACK1. For conciseness, we fix the embedding dimension of each method to be either 5e35e3 or 5e45e4. The relative error on certificate and objective and running time are evaluated. The results are presented in Figure 5.

Especially worthy mentioning is that when using STACK1, by increasing REPNUM, as we pointed out, the coherence of the matrix, i.e., the maximum leverage score, is decreasing, as the size is increased. We can clearly see that, when n=2.5e5n=2.5e5, i.e., the coherence is 11, PROJ CW fails. Once the coherence gets smaller, i.e., nn gets larger, the projection-based methods behave similarly and the relative error remains roughly the same as we increased nn. This is because STACK1 doesn’t alter the condition number and the amount of mass of the right hand side vector that lies in the range space of the design matrix and the lower dimension dd remains the same. However, SAMP APPR tends to yield larger error on approximating the certificate as we increase REPNUM, i.e., the coherence gets smaller. Moreover, it breaks down when the embedding dimension is very small.

3.5 Performance of low-precision solvers when d𝑑d changes

Here, we evaluate the performance of the low-precision solvers by evaluating the performance of all the embeddings on NB matrices with changing dd. We fix n=1e7n=1e7 and let dd take values from 1010 to 20002000. For each dd, the matrix is generated by stacking an NB matrix with size 2.5e52.5e5 by dd 40 times using STACK1, so that the coherence of the matrix is 1/401/40. For conciseness, we fix the embedding of each method to be 2e32e3 or 5e45e4. The relative error on certificate and objective and running time are evaluated. The results are shown in Figure 6.

As can be seen, overall, all the projection-based methods behave similarly. As expected, the relative error goes up as dd gets larger. Meanwhile, SAMP APPR yields lower error as dd increases. However, it seems to have a stronger dependence on the lower dimension of the matrix, as it breaks down when dd is 100100 for small sampling size, i.e., s=2e3s=2e3.

3.6 Performance of high precision solvers

Here, we evaluate the use of these methods as preconditioners for high-precision iterative solvers. Since the embedding can be used to compute a preconditioner for the original linear system, one can invoke iterative algorithms such as LSQR to solve the preconditioned least-squares problem. Here, we will use LSQR. We first evaluate the conditioning quality, i.e., κ(AR1)\kappa(AR^{-1}), on an NB matrix with size 1e61e6 by 500500 using several different ways for computing the embedding. The results are presented in Table 13. Then we test the performance of LSQR with these preconditioners on an NB matrix with size 1e81e8 by 10001000 and an NG matrix with size 1e71e7 by 10001000. For simplicity, for each method of computing the embedding, we try a small embedding dimension where some of the methods fail, and a large embedding dimension where most of the methods succeed. See Figure 7 and Figure 8 for details.

The convergence rate of the LSQR phase depends on the preconditioning quality, i.e., κ(AR1)\kappa(AR^{-1}) where RR is obtained by the QR decomposition of the embedding of AA, ΦA\Phi A. See Section 4.2 for more details. Table 13 implies that all the projection-based methods tend to yield preconditioners with similar condition numbers once the embedding dimension is large enough. Among them, PROJ CW needs a much larger embedding dimension to be reliable (clearly consistent with its use in low-precision solvers). In addition, overall, the conditioning quality of the sampling-based embedding method, i.e., SAMP APPR tends to be worse than that of projection-based methods.

As for the downstream performance, from Figure 7 we can clearly see that, when a small embedding dimension is used, i.e., s=5e3s=5e3, PROJ GAUSSIAN yields the best preconditioner, as its better preconditioning quality translates immediately into fewer iterations for LSQR to converge. This is followed by SAMP APPR. This relative order is also suggested by Table 13. As the embedding dimension is increased, i.e., using large embedding dimension, all the method yield significant improvements and produce much more accurate solutions compared to that of NOCO (LSQR without preconditioning), among which PROJ CW with embedding dimension 3e53e5 converges to a nearly machine-precision solution within only 55 iterations. As for the running time, since each iteration of LSQR only involves with two matrix-vector multiplications (costs less than 2 minutes in our experiments), the overall running time is dominated by the time for computing the preconditioner. As expected, PROJ CW runs the fastest and the running time of PROJ GAUSSIAN scales linearly in the embedding dimension. In SAMP APPR, the sampling process needs to make 11-22 passes over the dataset but the running time is relatively stable regardless of the sampling size, as reflected in Figure 7(c)&(f). Finally, note that the reason that the error does not drop monotonically in the solution vector is the following. With the preconditioners, we work on a transformed system, and the theory only guarantees monotonicity in the decreasing of the relative error of the certificate of the transformed system, not the original one.

Finally, a minor but potentially important point should be mentioned as a word of caution. As expected, when the condition number of the linear system is large, vanilla LSQR does not converge at all. On the other hand, when the condition number is very small, from Figure 8, there is no need to precondition. If, in this latter case, a randomized preconditioning method is used, then the embedding dimension must be chosen to be sufficiently large: unless the embedding dimension is large enough such that the conditioning quality is sufficiently good, then preconditioned LSQR yields larger errors than even vanilla LSQR.

Discussion and conclusion

Acknowledgments. We would like to thank Michael Saunders for advice and helpful discussions. We would also like to acknowledge the Army Research Office, the Defense Advanced Research Projects Agency, and the Department of Energy for providing partial support for this work.

References