Randomized Iterative Methods for Linear Systems

Robert M. Gower, Peter Richtárik

Introduction

The need to solve linear systems of equations is ubiquitous in essentially all quantitative areas of human endeavour, including industry and science. Linear systems are a central problem in numerical linear algebra, and play an important role in computer science, mathematical computing, optimization, signal processing, engineering, numerical analysis, computer vision, machine learning, and many other fields.

For instance, in the field of large scale optimization, there is a growing interest in inexact and approximate Newton-type methods for , which can benefit from fast subroutines for calculating approximate solutions of linear systems. In machine learning, applications arise for the problem of finding optimal configurations in Gaussian Markov Random Fields , in graph-based semi-supervised learning and other graph-Laplacian problems , least-squares SVMs, Gaussian processes and more.

In a large scale setting, direct methods are generally not competitive when compared to iterative approaches. While classical iterative methods are deterministic, recent breakthroughs suggest that randomization can play a powerful role in the design and analysis of efficient algorithms which are in many situations competitive or better than existing deterministic methods.

We shall assume throughout that the system is consistent: there exists $x^{*}$ for which $Ax^{*}=b$ .

We now comment on the main contribution of this work:

1. New method. We develop a novel, fundamental, and surprisingly simple randomized iterative method for solving (1).

2. Six equivalent formulations. Our method allows for several seemingly different but nevertheless equivalent formulations. First, it can be seen as a sketch-and-project method, in which the system (1) is replaced by its random sketch, and then the current iterate is projected onto the solution space of the sketched system. We can also view it as a constrain-and-approximate method, where we constrain the next iterate to live in a particular random affine space passing through the current iterate, and then pick the point from this subspace which best approximates the optimal solution. Third, the method can be seen as an iterative solution of a sequence of random (and simpler) linear equations. The method also allows for a simple geometrical interpretation: the new iterate is defined as the unique intersection of two random affine spaces which are orthogonal complements. The fifth viewpoint gives a closed form formula for the random update which needs to be applied to the current iterate in order to arrive at the new one. Finally, the method can be seen as a random fixed point iteration.

3. Special cases. These multiple viewpoints enrich our interpretation of the method, and enable us to draw previously unknown links between several existing algorithms. Our algorithm has two parameters, an $n\times n$ positive definite matrix $B$ defining geometry of the space, and a random matrix $S$ . Through combinations of these two parameters, in special cases our method recovers several well known algorithms. For instance, we recover the randomized Kaczmarz method of Strohmer and Vershyinin , randomized coordinate descent method of Leventhal and Lewis , random pursuit (with exact line search), and the stochastic Newton method recently proposed by Qu et al . However, our method is more general, and leads to i) various generalizations and improvements of the aforementioned methods (e.g., block setup, importance sampling), and ii) completely new methods. Randomness enters our framework in a very general form, which allows us to obtain a Gaussian Kaczmarz method, Gaussian descent, and more.

4. Complexity: general results. When $A$ has full column rank, our framework allows us to determine the complexity of these methods using a single analysis. Our main results are summarized in Table 1, where $\{x^{k}\}$ are the iterates of our method, $Z$ is a random matrix dependent on the data matrix $A$ , parameter matrix $B$ and random parameter matrix $S$ , defined as

As we shall see later, we will often consider setting $B=I$ , $B=A$ (if $A$ is positive definite) or $B=A^{T}A$ (if $A$ is of full column rank). In particular, we first show that the convergence rate $\rho$ is always bounded between zero and one. We also show that as soon as $\mathbf{E}\left[Z\right]$ is invertible (which can only happen if $A$ has full column rank, which then implies that $x^{*}$ is unique), we have $\rho<1$ , and the method converges. Besides establishing a bound involving the expected norm of the error (see the last line of Table 1), we also obtain bounds involving the norm of the expected error (second line of Table 1). Studying the expected sequence of iterates directly is very fruitful, as it allows us to establish an exact characterization of the evolution of the expected iterates (see the first line of Table 1) through a linear fixed point iteration.

Both of these theorems on the convergence of the error can be recast as iteration complexity bounds. For instance, using standard arguments, from Theorem 6 in Table 1 we observe that for a given $\epsilon>0$ we have that

5. Complexity: special cases. Besides these generic results, which hold without any major restriction on the sampling matrix $S$ (in particular, it can be either discrete or continuous), we give a specialized result applicable to discrete sampling matrices $S$ (see Theorem 10). In the special cases for which rates are known, our analysis recovers the existing rates.

6. Extensions. Our approach opens up many avenues for further development and research. For instance, it is possible to extend the results to the case when $A$ is not necessarily of full column rank. Furthermore, as our results hold for a wide range of distributions, new and efficient variants of the general method can be designed for problems of specific structure by fine-tuning the stochasticity to the structure. Similar ideas can be applied to design randomized iterative algorithms for finding the inverse of a very large matrix.

2 Background and Related Work

The literature on solving linear systems via iterative methods is vast and has long history . For instance, the Kaczmarz method, in which one cycles through the rows of the system and each iteration is formed by projecting the current point to the hyperplane formed by the active row, dates back to the 30’s . The Kaczmarz method is just one example of an array of row-action methods for linear systems (and also, more generally, feasibility and optimization problems) which were studied in the second half of the 20th century .

Research into the Kaczmarz method was in 2009 reignited by Strohmer and Vershynin , who gave a brief and elegant proof that a randomized thereof enjoys an exponential error decay (also know as “linear convergence”). This has triggered much research into developing and analyzing randomized linear solvers.

It should be mentioned at this point that the randomized Kaczmarz (RK) method arises as a special case (when one considers quadratic objective functions) of the stochastic gradient descent (SGD) method for convex optimization which can be traced back to the seminal work of Robbins and Monro’s on stochastic approximation . Subsequently, intensive research went into studying various extensions of the SGD method. However, to the best of our knowledge, no complexity results with exponential error decay were established prior to the aforementioned work of Strohmer and Vershynin . This is the reason behind our choice of as the starting point of our discussion.

Motivated by the results of Strohmer and Vershynin , Leventhal and Lewis utilize similar techniques to establish the first bounds for randomized coordinate descent methods for solving systems with positive definite matrices, and systems arising from least squares problems . These bounds are similar to those for the RK method. This development was later picked up by the optimization and machine learning communities, and much progress has been made in generalizing these early results in countless ways to various structured convex optimization problems. For a brief up to date account of the development in this area, we refer the reader to and the references therein.

The RK method and its analysis have been further extended to the least-squares problem and the block setting . In the authors extend the randomized coordinate descent and the RK methods to the problem of solving underdetermined systems. The authors of analyze side-by-side the randomized coordinate descent and RK method, for least-squares, using a convenient notation in order to point out their similarities. Our work takes the next step, by analyzing these, and many other methods, through a genuinely single analysis. Also in the spirit of unifying the analysis of different methods, in the authors provide a unified analysis of iterative Schwarz methods and Kaczmarz methods.

The use of random Gaussian directions as search directions in zero-order (derivative-free) minimization algorithm was recently suggested . More recently, Gaussian directions have been combined with exact and inexact line-search into a single random pursuit framework , and further utilized within a randomized variable metric method .

One Algorithm in Six Disguises

Our method has two parameters: i) an $n\times n$ positive definite matrix $B$ which is used to define the $B$ -inner product and the induced $B$ -norm by

$x^{k+1}$ is the nearest point to $x^{k}$ which solves a sketched version of the original linear system:

This viewpoint arises very naturally. Indeed, since the original system (1) is assumed to be complicated, we replace it by a simpler system—a random sketch of the original system (1)—whose solution set $\{x\;|\;S^{T}Ax=S^{T}b\}$ contains all solutions of the original system. However, this system will typically have many solutions, so in order to define a method, we need a way to select one of them. The idea is to try to preserve as much of the information learned so far as possible, as condensed in the current point $x^{k}$ . Hence, we pick the solution which is closest to $x^{k}$ .

$x^{k+1}$ is the best approximation of $x^{*}$ in a random space passing through $x^{k}$ :

$x^{k+1}$ is the (unique) intersection of two affine spaces:

Note that $x^{k+1}$ is the (unique) solution (in $x$ ) of a linear system (with variables $x$ and $y$ ):

This system is clearly equivalent to (7), and can alternatively be written as:

Hence, our method reduces the solution of the (complicated) linear system (1) into a sequence of (hopefully simpler) random systems of the form (9).

By plugging the second equation in (8) into the first, we eliminate $x$ and obtain the system $(S^{T}AB^{-1}A^{T}S)y=S^{T}(b-Ax^{k})$ . Note that for all solutions $y$ of this system we must have $x^{k+1}=x^{k}+B^{-1}A^{T}Sy$ . In particular, we can choose the solution $y=y^{k}$ of minimal Euclidean norm, which is given by $y^{k}=(S^{T}AB^{-1}A^{T}S)^{\dagger}S^{T}(b-Ax^{k})$ , where † denotes the Moore-Penrose pseudoinverse. This leads to an expression for $x^{k+1}$ with an explicit form of the random update which must be applied to $x^{k}$ in order to obtain $x^{k+1}$ :

In some sense, this form is the standard: it is customary for iterative techniques to be written in the form $x^{k+1}=x^{k}+d^{k}$ , which is precisely what (10) does.

Note that iteration (10) can be written as

where $Z$ is defined in (2) and where we used the fact that $Ax^{*}=b$ . Matrix $Z$ plays a central role in our analysis, and can be used to construct explicit projection matrices of the two projections depicted in Figure 1.

The equivalence between these six viewpoints is formally captured in the next statement.

The six viewpoints are equivalent: they all produce the same (unique) point $x^{k+1}$ .

The proof is simple, and follows directly from the above discussion. In particular, see the caption of Figure 1. ∎

2 Projection Matrices

In this section we state a few key properties of matrix $Z$ . This will shed light on the previous discussion and will also be useful later in the convergence proofs.

Recall that $S$ is a $m\times q$ random matrix (with $q$ possibly being random), and that $A$ is an $m\times n$ matrix. Let us define the random quantity

With respect to the geometry induced by the $B$ -inner product, we have that

$B^{-1}Z$ projects orthogonally onto the $d$ –dimensional subspace $\mathbf{Range}\left(B^{-1}A^{T}S\right)$

$(I-B^{-1}Z)$ projects orthogonally onto $(n-d)$ –dimensional subspace $\mathbf{Null}\left(S^{T}A\right).$

For any matrix $M$ , the pseudoinverse satisfies the identity $M^{\dagger}MM^{\dagger}=M^{\dagger}$ . Using this with $M=S^{T}AB^{-1}A^{T}S$ , we get

and thus both $B^{-1}Z$ and $I-B^{-1}Z$ are projection matrices. In order to establish that $B^{-1}Z$ is an orthogonal projection with respect to the $B$ -inner product (from which it follows that $I-B^{-1}Z$ is), we will show that

The second relation is trivially satisfied. In order to establish the first relation, it is enough to use two further properties of the pseudoinverse: $(M^{T}M)^{\dagger}M^{T}=M^{\dagger}$ and $MM^{\dagger}M=M$ , both with $M=B^{-1/2}A^{T}S$ . Indeed,

This lemma sheds additional light on Figure 1 as it gives explicit expressions for the associated projection matrices. The result also implies that $I-B^{-1}Z$ is a contraction with respect to the $B$ -norm, which means that the random fixed point iteration (11) has only very little room not to work. While $I-B^{-1}Z$ is not a strict contraction, under some reasonably weak assumptions on $S$ it will be a strict contraction in expectation, which ensures convergence. We shall state these assumptions and develop the associated convergence theory for our method in Section 4 and Section 5.

Special Cases: Examples

In this section we briefly mention how by selecting the parameters $S$ and $B$ of our method we recover several existing methods. The list is by no means comprehensive and merely serves the purpose of an illustration of the flexibility of our algorithm. All the associated complexity results we present in this section, can be recovered from Theorem 10, presented later in Section 5.

When $S$ is an $m\times m$ invertible matrix with probability one, then the system $S^{T}Ax=S^{T}b$ is equivalent to solving $Ax=b,$ thus the solution to (5) must be $x^{k+1}=x^{*}$ , independently of matrix $B.$ Our convergence theorems also predict this one step behaviour, since $\rho=0$ (see Table 1).

2 Random Vector Sketch

Next we describe several well known specializations of the random vector sketch and for brevity, we write the updates in the form of (15) and leave implicit that when the denominator is zero, no step is taken.

3 Randomized Kaczmarz

Using (10), these iterations can be calculated with

When $i$ is selected at random, this is the randomized Kaczmarz (RK) method . A specific non-uniform probability distribution for $S$ can yield simple and easily interpretable (but not necessarily optimal) complexity bound. In particular, by selecting $i$ with probability proportional to the magnitude of row $i$ of $A$ , that is $p_{i}=\|A_{i:}\|_{2}^{2}/\|A\|_{F}^{2}$ , it follows from Theorem 10 that RK enjoys the following complexity bound:

This result was first established by Strohmer and Vershynin . We also provide new convergence results in Theorem 6, based on the convergence of the norm of the expected error. Theorem 6 applied to the RK method gives

Now the convergence rate appears squared, which is a better rate, though, the expectation has moved inside the norm, which is a weaker form of convergence.

Analogous results for the convergence of the norm of the expected error holds for all the methods we present, though we only illustrate this with the RK method.

Using the Constrain-and-Approximate formulation (6), the randomized Kaczmarz method can also be written as

with probability $p_{i}$ . Writing the least squares function $f(x)=\tfrac{1}{2}\|Ax-b\|_{2}^{2}$ as

we see that the random vector $\nabla f_{i}(x)=\tfrac{1}{p_{i}}(A_{i:}x-b_{i})(A_{i:})^{T}$ is an unbiased estimator of the gradient of $f$ at $x$ . That is, $\mathbf{E}\left[\nabla f_{i}(x)\right]=\nabla f(x)$ . Notice that RK takes a step in the direction $-\nabla f_{i}(x)$ . This is true even when $A_{i:}x-b_{i}=0$ , in which case, the RK does not take any step. Hence, RK takes a step in the direction of the negative stochastic gradient. This means that it is equivalent to the Stochastic Gradient Descent (SGD) method. However, the stepsize choice is very special: RK chooses the stepsize which leads to the point which is closest to $x^{*}$ in the Euclidean norm.

4 Randomized Coordinate Descent: positive definite case

If $A$ is positive definite, then we can choose $B=A$ and $S=e^{i}$ in (5), which results in

where we used the symmetry of $A$ to get $(e^{i})^{T}A=A_{i:}=(A_{:i})^{T}.$ The solution to the above, given by (10), is

When $i$ is chosen randomly, this is the Randomized CD method (CD-pd). Applying Theorem 10, we see the probability distribution $p_{i}=A_{ii}/\mathbf{Tr}\left(A\right)$ results in a convergence with

This result was first established by Leventhal and Lewis .

Using the Constrain-and-Approximate formulation (6), this method can be interpreted as

with probability $p_{i}$ . It is easy to check that the function $f(x)=\tfrac{1}{2}x^{T}Ax-b^{T}x$ satisfies: $\|x-x^{*}\|_{A}^{2}=2f(x)+b^{T}x^{*}$ . Therefore, (23) is equivalent to

where $L_{i}=A_{ii}$ is the Lipschitz constant of the gradient of $f$ corresponding to coordinate $i$ and $\nabla_{i}f(x^{k})$ is the $i$ th partial derivative of $f$ at $x^{k}$ .

5 Randomized Block Kaczmarz

Our framework also extends to new block formulations of the randomized Kaczmarz method. Let $R$ be a random subset of $[m]$ and let $S=I_{:R}$ be a column concatenation of the columns of the $m\times m$ identity matrix $I$ indexed by $R$ . Further, let $B=I$ . Then (5) specializes to

In view of (10), this can be equivalently written as

From Theorem 8 we obtain the following new complexity result:

To obtain a more meaningful convergence rate, we would need to bound the smallest eigenvalue of $\mathbf{E}\left[(A_{R:})^{T}(A_{R:}(A_{R:})^{T})^{\dagger}A_{R:}\right].$ This has been done in when the image of $R$ defines a row paving of $A$ . Our framework paves the way for analysing the convergence of new block methods for a large set of possible random subsets $R,$ including, for example, overlapping partitions.

6 Randomized Newton: positive definite case

If $A$ is symmetric positive definite, then we can choose $B=A$ and $S=I_{:C}$ , a column concatenation of the columns of $I$ indexed by $C$ , which is a random subset of $[n]$ . In view of (5), this results in

In view of (10), we can equivalently write the method as

Clearly, iteration (27) is well defined as long as $C$ is nonempty with probability 1. Such $C$ is in referred to by the name “non-vacuous” sampling. From Theorem 8 we obtain the following convergence rate:

The convergence rate of this particular method was first established and studied in . Moreover, it was shown in that $\rho<1$ if one additionally assumes that the probability that $i\in C$ is positive for each column $i\in[n]$ , i.e., that $C$ is a “proper” sampling.

Using formulation (6), and in view of the equivalence between $f(x)$ and $\|x-x^{*}\|_{A}^{2}$ discussed in Section 3.4, the Randomized Newton method can be equivalently written as

The next iterate is determined by advancing from the previous iterate over a subset of coordinates such that $f$ is minimized. Hence, an exact line search is performed in a random $|C|$ dimensional subspace.

7 Randomized Coordinate Descent: least-squares version

By choosing $S=Ae^{i}=:A_{:i}$ as the $i$ th column of $A$ and $B=A^{T}A$ , the resulting iterates (6) are given by

When $i$ is selected at random, this is the Randomized Coordinate Descent method (CD-LS) applied to the least-squares problem: $\min_{x}\|Ax-b\|_{2}^{2}$ . Using (10), these iterations can be calculated with

Applying Theorem 10, we see that by selecting $i$ with probability proportional to magnitude of column $i$ of $A$ , that is $p_{i}=\|A_{:i}\|_{2}^{2}/\|A\|_{F}^{2}$ , results in a convergence with

This result was first established by Leventhal and Lewis .

Using the Constrain-and-Approximate formulation (6), the CD-LS method can be interpreted as

where $L_{i}\overset{\text{def}}{=}\|A_{:i}\|_{2}^{2}$ is the Lipschitz constant of the gradient corresponding to coordinate $i$ and $\nabla_{i}f(x^{k})$ is the $i$ th partial derivative of $f$ at $x^{k}$ .

Convergence: General Theory

We shall present two complexity theorems: we first study the convergence of $\|\mathbf{E}\left[x^{k}-x^{*}\right]\|$ , and then move on to analysing the convergence of $\mathbf{E}\left[\|x^{k}-x^{*}\|\right]$ .

The following lemma explains the relationship between the convergence of the norm of the expected error and the expected norm of the error.

Note that $\mathbf{E}\left[\|x-\mathbf{E}\left[x\right]\|^{2}\right]=\mathbf{E}\left[\|x\|^{2}\right]-\|\mathbf{E}\left[x\right]\|^{2}$ . Adding and subtracting $\|x^{*}\|^{2}-2\left<\mathbf{E}\left[x\right],x^{*}\right>$ from the right hand side and grouping the appropriate terms yields the desired result. ∎

To interpret this lemma, note that $\mathbf{E}\left[\left\|x-\mathbf{E}\left[x\right]\right\|^{2}\right]=\sum_{i=1}^{n}\mathbf{E}\left[(x_{i}-\mathbf{E}\left[x_{i}\right])^{2}\right]=\sum_{i=1}^{n}\mathbf{Var}(x_{i})$ , where $x_{i}$ denotes the $i$ th element of $x.$ This lemma shows that the convergence of $x$ to $x^{*}$ under the expected norm of the error is a stronger form of convergence than the convergence of the norm of the expected error, as the former also guarantees that the variance of $x_{i}$ converges to zero, for $i=1,\ldots,n.$

2 The Rate of Convergence

All of our convergence theorems (see Table 1) depend on the convergence rate

To show that the rate is meaningful, in Lemma 4 we prove that $0\leq\rho\leq 1$ . We also provide a meaningful lower bound for $\rho$ .

The quantity $\rho$ defined in (33) satisfies:

where $d=\mathbf{Rank}\left(S^{T}A\right)$ .

Since the mapping $A\mapsto\lambda_{\max}(A)$ is convex on the set of symmetric matrices, by Jensen’s inequality we get

Recalling from Lemma 2 that $B^{-1}Z$ is a projection, we get

whence the spectrum of $B^{-1/2}ZB^{-1/2}$ is contained in $\{0,1\}$ . Thus, $\lambda_{\max}(B^{-1/2}ZB^{-1/2})~{}\leq~{}1$ , and from (35) we conclude that $\lambda_{\max}(B^{-1}\mathbf{E}\left[Z\right])\leq 1$ . The inequality $\lambda_{\min}(B^{-1}\mathbf{E}\left[Z\right])~{}\geq~{}0$ can be shown analogously using convexity of the mapping $A\mapsto-\lambda_{\min}(A)$ . Thus

and consequentially $0\leq\rho\leq 1.$ As the trace of a matrix is equal to the sum of its eigenvalues, we have

As $B^{-1}Z$ projects onto a $d$ –dimensional subspace (Lemma 2) we have $\mathbf{Tr}\left(B^{-1}Z\right)=d.$ Thus rewriting (36) gives $1-\mathbf{E}\left[d\right]/n\leq\rho.$ ∎

The lower bound on $\rho$ in item 1 has a natural interpretation which makes intuitive sense. We shall present it from the perspective of the Constrain-and-Approximate formulation (6). As the dimension ( $d$ ) of the search space $B^{-1}A^{T}S$ increases (see (13)), the lower bound on $\rho$ decreases, and a faster convergence is possible. For instance, when $S$ is restricted to being a random column vector, as it is in the RK (17), CD-LS (30) and CD-pd (22) methods, the convergence rate is bounded with $1-1/n\leq\rho.$ Using (3), this translates into the simple iteration complexity bound of $k\geq n\log(1/\epsilon)$ . On the other extreme, when the search space is large, then the lower bound is close to zero, allowing room for the method to be faster.

We now characterize circumstances under which $\rho$ is strictly smaller than one.

If $\mathbf{E}\left[Z\right]$ is invertible, then $\rho<1$ , $A$ has full column rank and $x^{*}$ is unique.

3 Exact Characterization and Norm of Expectation

We now state a theorem which exactly characterizes the evolution of the expected iterates through a linear fixed point iteration. As a consequence, we obtain a convergence result for the norm of the expected error. While we do not highlight this in the text, this theorem can be applied to all the particular instances of our general method we detail throughout this paper.

Moreover, the spectral radius and the induced $B$ -norm of the iteration matrix $I-B^{-1}\mathbf{E}\left[Z\right]$ are both equal to $\rho$ :

Taking expectations conditioned on $x^{k}$ in (11), we get

Applying the norms to both sides we obtain the estimate

It remains to prove that $\rho=\|I-B^{-1}\mathbf{E}\left[Z\right]\|_{B}$ and then unroll the recurrence. According to the definition of operator norm (37), we have

Substituting $B^{1/2}x=y$ in the above gives

where in the third equality we used the symmetry of $(I-B^{-1}\mathbf{E}\left[Z\right]B^{-1})$ when passing from the operator norm to the spectral radius. Note that the symmetry of $\mathbf{E}\left[Z\right]$ derives from the symmetry of $Z$ . ∎

4 Expectation of Norm

We now turn to analysing the convergence of the expected norm of the error, for which we need the following technical lemma.

If $\mathbf{E}\left[Z\right]$ is positive definite, then

As $\mathbf{E}\left[Z\right]$ and $B$ are positive definite, we get

Therefore, $\mathbf{E}\left[Z\right]\succeq(1-\rho)B$ , and the result follows.∎

If $\mathbf{E}\left[Z\right]$ is positive definite, then

Let $r^{k}=x^{k}-x^{*}$ . Taking the expectation of (11) conditioned on $r^{k}$ we get

Taking expectation again and unrolling the recurrence gives the result. ∎

The convergence rate $\rho$ of the expected norm of the error is “worse” than the $\rho^{2}$ rate of convergence of the norm of the expected error in Theorem 6. This should not be misconstrued as Theorem 6 offering a “better” convergence rate than Theorem 8, because, as explained in Lemma 3, convergence of the expected norm of the error is a stronger type of convergence. More importantly, the exponent is not of any crucial importance; clearly, an exponent of $2$ manifests itself only in halving the number of iterations.

Methods Based on Discrete Sampling

When $S$ has a discrete distribution, we can establish under reasonable assumptions when $\mathbf{E}\left[Z\right]$ is positive definite (Proposition 9), we can optimize the convergence rate in terms of the chosen probability distribution, and finally, determine a probability distribution for which the convergence rate is expressed in terms of the scaled condition number (Theorem 10).

When $S$ is a complete discrete sampling, then $S^{T}A$ has full row rank and $(S^{T}AB^{-1}A^{T}S)^{\dagger}=(S^{T}AB^{-1}A^{T}S)^{-1}.$ Therefore we replace the pseudo-inverse in (10) and (11) by the inverse. Furthermore, using a complete discrete sampling guarantees convergence of the resulting method.

Let $S$ be a complete discrete sampling, then $\mathbf{E}\left[Z\right]$ is positive definite.

which is a block diagonal matrix, and is well defined and invertible as $S_{i}^{T}A$ has full row rank for $i=1,\ldots,r$ . Taking the expectation of $Z$ (2) gives

which is positive definite because $A^{T}\mathbf{S}$ has full row rank and $D$ is invertible. ∎

With $\mathbf{E}\left[Z\right]$ positive definite, we can apply the convergence Theorem 6 and 8, and the resulting method converges.

We can choose the discrete probability distribution that optimizes the convergence rate. For this, according to Theorems 8 and 6 we need to find $p=(p_{1},\dots,p_{r})$ that maximizes the minimal eigenvalue of $B^{-1/2}\mathbf{E}\left[Z\right]B^{-1/2}$ . Let $S$ be a complete discrete sampling and fix the sample matrices $S_{1},\dots,S_{r}$ . Let us denote $Z=Z(p)$ as a function of $p=(p_{1},\dots,p_{r})$ . Then we can also think of the spectral radius as a function of $p$ where

the problem of minimizing the spectral radius (i.e., optimizing the convergence rate) can be written as

This can be cast as a convex optimization problem, by first re-writing

To obtain $p$ that maximizes the smallest eigenvalue, we solve

Despite (46) being a convex semi-definite programWhen preparing a revision of this paper, we have learned about the existence of prior work where the authors have also characterized the probability distribution that optimizes the convergences rate of the RK method as the solution to an SDP., which is apparently a harder problem than solving the original linear system, investing the time into solving (46) using a solver for convex conic programming such as cvx can pay off, as we show in Section 7.4. Though for a practical method based on this, we would need to develop an approximate solution to (46) which can be efficiently calculated.

2 Convenient Probabilities

Next we develop a choice of probability distribution that yields a convergence rate that is easy to interpret. This result is new and covers a wide range of methods, including randomized Kaczmarz, randomized coordinate descent, as well as their block variants. However, it is more general, and covers many other possible particular algorithms, which arise by choosing a particular set of sample matrices $S_{i}$ , for $i=1,\ldots,r.$

Let $t_{i}=\mathbf{Tr}\left(S_{i}^{T}AB^{-1}A^{T}S_{i}\right)$ , and with (47) in (43) we have

The result (48) follows by applying Theorem 8. ∎

for $i=1,\ldots r.$ In this case, the bound (50) is an equality and $D^{2}$ is a scaled identity, so (51) and consequently (52) are equalities. For block methods, it is different story, and there is much more slack in the inequality (52). So much so, the convergence rate (49) does not indicate any advantage of using a block method (contrary to numerical experiments). To see the advantage of a block method, we need to use the exact expression for $\lambda_{\min}(D^{2})$ given in (50). Though this results in a somewhat harder to interpret convergence rate, a matrix paving could be used explore this block convergence rate, as was done for the block Kaczmarz method .

By appropriately choosing $B$ and $S$ , this theorem applied to RK method (16), the CD-LS method (29) and the CD-pd method (20), yields the convergence results (18), (31) and (22), respectively, for single column sampling or block methods alike.

This theorem also suggests a preconditioning strategy, in that, a faster convergence rate will be attained if $\mathbf{S}$ is an approximate inverse of $B^{-1/2}A^{T}.$ For instance, in the RK method where $B=I$ , this suggests that an accelerated convergence can be attained if $S$ is a random sampling of the rows of a preconditioner (approximate inverse) of $A.$

Methods Based on Gaussian Sampling

Due to the symmetry of the multivariate normal distribution, there is a zero probability that $\zeta\in\mathbf{Null}\left(A^{T}\right)$ for any nonzero matrix $A$ .

Unlike the discrete methods in Section 3, to calculate an iteration of (53) we need to compute the product of a matrix with a dense vector $\zeta$ . This significantly raises the cost of an iteration. Though in our numeric tests in Section 7, the faster convergence of the Gaussian method often pays off for their high iteration cost.

To analyze the complexity of the resulting method let $\xi\overset{\text{def}}{=}B^{-1/2}A^{T}S,$ which is also Gaussian, distributed as $\xi\sim N(0,\Omega)$ , where $\Omega\overset{\text{def}}{=}B^{-1/2}A^{T}\Sigma AB^{-1/2}.$ In this section we assume $A$ has full column rank, so that $\Omega$ is always positive definite. The complexity of the method can be established through

We can simplify the above by using the lower bound

which is proven in Lemma 11 in the Appendix. Thus

where we used the general lower bound in (34). Lemma 11 also shows that $\mathbf{E}\left[\xi\xi^{T}/\|\xi\|^{2}_{2}\right]$ is positive definite, thus Theorem 8 guarantees that the expected norm of the error of all Gaussian methods converges exponentially to zero. This bound is tight upto a constant factor. For illustration of this, in the setting with $A=I=\Sigma$ we have $\xi\sim N(0,I)$ and $\mathbf{E}\left[\xi\xi^{T}/\|\xi\|^{2}_{2}\right]=\tfrac{1}{n}I,$ which yields

When $n=2$ , then in Lemma 12 of the Appendix we prove that

which yields a very favourable convergence rate.

Let $B=I$ and choose $\Sigma=I$ so that $S=\eta\sim N(0,I)$ . Then (53) has the form

which we call the Gaussian Kaczmarz (GK) method, for it is the analogous method to the Randomized Karcmarz method in the discrete setting. Using the formulation (6), for instance, the GK method can be interpreted as

Thus at each iteration, a random normal Gaussian vector $\eta$ is drawn and a search direction is formed by $A^{T}\eta.$ Then, starting from the previous iterate $x^{k}$ , an exact line search is performed over this search direction so that the euclidean distance from the optimal is minimized.

2 Gaussian Least-Squares

Let $B=A^{T}A$ and choose $S\sim N(0,\Sigma)$ with $\Sigma=AA^{T}$ . It will be convenient to write $S=A\eta$ , where $\eta\sim N(0,I)$ . Then method (53) then has the form

which we call the Gauss-LS method. This method has a natural interpretation through formulation (6) as

That is, starting from $x^{k}$ , we take a step in a random (Gaussian) direction, then perform an exact line search over this direction that minimizes the least squares error. Thus the Gauss-LS method is the same as applying the Random Pursuit method with exact line search to the Least-squares function.

3 Gaussian Positive Definite

When $A$ is positive definite, we achieve an accelerated Gaussian method. Let $B=A$ and choose $S=\eta\sim N(0,I)$ . Method (53) then has the form

Using formulation (6), the method can be interpreted as

That is, starting from $x^{k}$ , we take a step in a random (Gaussian) direction, then perform an exact line search over this direction. Thus the Gauss-pd method is equivalent to applying the Random Pursuit method with exact line search to $f(x).$

Numerical Experiments

In implementing the discrete sampling methods we used the convenient probability distributions (47).

All tests were performed on a Desktop with 64bit quad-core Intel(R) Core(TM) i5-2400S CPU @2.50GHz with 6MB cache size with a Scientific Linux release 6.4 (Carbon) operating system.

Consistently across our experiments, the Gaussian methods almost always require more flops to reach a solution with the same precision as their discrete sampling counterparts. This is due to the expensive matrix-vector product required by the Gaussian methods. While the results are more mixed when measured in terms of wall clock time. This is because MATLAB performs automatic multi-threading when calculating matrix-vector products, which was the bottleneck cost in the Gaussian methods. As our machine has four cores, this explains some of the difference observed when measuring performance in terms of number of flops and wall clock time.

First we compare the methods Gauss-LS, CD-LS, Gauss-Kaczmarz and RK methods on synthetic linear systems generated with the matrix functions rand and sprandn, see Figure 2. The high iteration cost of the Gaussian methods resulted in poor performance on the dense problem generated using rand in Figure 2(a). In Figure 2(b) we compare the methods on a sparse linear system generated using the MATLAB sparse random matrix function sprandn( $m,n$ ,density,rc), where density is the percentage of nonzero entries and rc is the reciprocal of the condition number. On this sparse problem the Gaussian methods are more efficient, and converge at a similar rate to the discrete sampling methods.

In Figure 3 we test two overdetermined linear systems taken from the the Matrix Market collection . The collection also provides the right-hand side of the linear system. Both of these systems are very well conditioned, but do not have full column rank, thus Theorem 8 does not apply. The four methods have a similar performance on Figure 3(a), while the Gauss-LS and CD-LS method converge faster on 3(b) as compared to the Gauss-Kaczmarz and Kaczmarz methods.

Finally, we test two problems, the SUSY problem and the covtype.binary problem, from the library of support vector machine problems LIBSVM . These problems do not form consistent linear systems, thus only the Gauss-LS and CD-LS methods are applicable, see Figure 4. This is equivalent to applying the Gauss-pd and CD-pd to the least squares system $A^{T}Ax=A^{T}b,$ which is always consistent.

Despite the higher iteration cost of the Gaussian methods, their performance, in terms of the wall-clock time, is comparable to performance of the discrete methods when the system matrix is sparse.

2 Bound for Gaussian convergence

Now we compare the error over the number iterations of the Gauss-LS method to theoretical rate of convergence given by the bound (55). For the Gauss-LS method (55) becomes

In Figures 5(a) and 5(b) we compare the empirical and theoretical bound on a random Gaussian matrix and the liver-disorders problem . Furthermore, we ran the Gauss-LS method 100 times and plot as dashed lines the 95% and 5% quantiles. These tests indicate that the bound it tight for well conditioned problems, such as Figure 5(a) in which the system matrix has a condition number equal to $1.94$ . While in Figure 5(b) the system matrix has a condition number of $41.70$ and there is some much more slack between the empirical convergence and the theoretical bound.

3 Positive Definite

First we compare the two methods Gauss-pd (58) and CD-pd (21) on synthetic data in Figure 6.

Using the MATLAB function hilbert, we can generate positive definite matrices with very high condition number, see Figure 6(LEFT). Both methods converge slowly and, despite the dense system matrix, the Gauss-pd method has a similar performance to CD-pd. In Figure (6)(RIGHT) we compare the two methods on a system generated by the MATLAB function sprandsym ( $m$ , $n$ , density, rc, type), where density is the percentage of nonzero entries, rc is the reciprocal of the condition number and type=1 returns a positive definite matrix. The Gauss-pd and the CD-pd method have a similar performance in terms of wall clock time on this sparse problem.

To appraise the performance gain in using block variants, we perform tests using two block variants: the Randomized Newton method (26), which we will now refer to as the Block CD-pd method, and the Block Gauss-pd method (59). The size of blocks $q$ in both methods was set to $q=\sqrt{n}.$ To solve the $q\times q$ system required in the block methods, we use MATLAB’s built-in direct solver, sometimes referred to as “back-slash”.

Next we test the Newton system $\nabla^{2}f(w_{0})x=-\nabla f(w_{0})$ , arising from four ridge-regression problems of the form

using data from LIBSVM . In particular, we set $w_{0}=0$ and use $\lambda=1$ as the regularization parameter, whence $\nabla f(w_{0})=A^{T}b$ and $\nabla^{2}f(w_{0})=A^{T}A+I$ .

In terms of wall clock time, The Gauss-pd method converged faster on all problems accept the protein problem as compared to CD-pd. The two Block methods had a comparable performance on the aloi and the SUSY problem. The Block Gauss-pd method converged in one iteration on covtype.binary, and the Block CD-pd method converged fast on the Protein problem.

We now compare the methods on two positive definite matrices from the Matrix Market collection , see Figure 8. The right-hand side was generated using rand(n,1). The Block CD-pd method converged much faster on both problems. The lower condition number ( $\kappa_{2}=12$ ) of the gr_30_30-rsa problem resulted in fast convergence of all methods, see Figure 8(a). While the high condition number ( $\kappa_{2}=4.3\cdot 10^{4}$ ) of the bcsstk18 problem, resulted in a slow convergence for all methods, see Figure 8(b).

Despite the clear advantage of using a block variant, applying a block method that uses a direct solver can be infeasible on very ill-conditioned problems. As an example, applying the Block CD-pd to the Hilbert system, and using MATLAB back-slash solver to solve the inner $q\times q$ systems, resulted in large numerical inaccuracies, and ultimately, prevented the method from converging. This occurred because the submatrices of the Hilbert matrix are also very ill-conditioned.

4 Comparison between Optimized and Convenient probabilities

We compare the practical performance of using the convenient probabilities (47) against using the optimized probabilities by solving (46). We solved (46) using the disciplined convex programming solver cvx for MATLAB.

In Table 2 we compare the different convergence rates for the CD-pd method, where $\rho_{c}$ is the convenient convergence rate (49), $\rho^{*}$ the optimized convergence rate, $(1-1/n)$ is the lower bound, and in the final “optimized time(s)” column the time taken to compute $\rho^{*}$ . In Figure 9, we compare the empirical convergence of the CD-pd method when using the convenient probabilities (47) and CD-pd-opt, the CD-pd method with the optimized probabilities, on four ridge regression problems and a uniform random matrix. We ran each method for $60$ seconds.

In most cases using the optimized probabilities results in a much faster convergence, see Figures 9(a), 9(c), 9(d) and 9(e). In particular, the $7.401$ seconds spent calculating the optimal probabilities for aloi paid off with a convergence that was $55$ seconds faster. The mushrooms problem was insensitive to the choice of probabilities 9(d). Finally despite $\rho^{*}$ being much less than $\rho_{c}$ on covtype, see Table 2, using optimized probabilities resulted in an initially slower method, though CD-pd-opt eventually catches up as CD-pd stagnates, see Figure 9(b).

In Table 3 we compare the different convergence rates for the RK method. In Figure 10, we then compare the empirical convergence of the RK method when using the convenient probabilities (47) and RK-opt, the RK method with the optimized probabilities by solving (46). The rates $\rho^{*}$ and $\rho_{c}$ for the rand(500,100) problem are similar, and accordingly, both the convenient and optimized variant converge at a similar rate in practice, see Figure 10b. While the difference in the rates $\rho^{*}$ and $\rho_{c}$ for the liver-disorders is more pronounced, and in this case, the $0.83$ seconds invested in obtaining the optimized probability distribution paid off in practice, as the optimized method converged $1.25$ seconds before the RK method with the convenient probability distribution, see Figure 10a.

We conclude from these tests that the choice of the probability distribution can greatly affect the performance of the method. Hence, it is worthwhile to develop approximate solutions to (45).

Conclusion

We present a unifying framework for the randomized Kaczmarz method, randomized Newton method, randomized coordinate descent method and random Gaussian pursuit. Not only can we recover these methods by selecting appropriately the parameters $S$ and $B$ , but also, we can analyse them and their block variants through a single Theorem 8. Furthermore, we obtain a new lower bound for all these methods in Theorem 6, and in the discrete case, recover all known convergence rates expressed in terms of the scaled condition number in Theorem 10.

The Theorem 10 also suggests a preconditioning strategy. Developing preconditioning methods are important for reaching a higher precision solution on ill-conditioned problems. For as we have seen in the numerical experiments, the randomized methods struggle to bring the solution within $10^{-2}$ relative error when the matrix is ill-conditioned.

This is also a framework on which randomized methods for linear systems can be designed. As an example, we have designed a new block variant of RK, a new Gaussian Kaczmarz method and a new Gaussian block method for positive definite systems. Furthermore, the flexibility of our framework and the general convergence Theorems 8 and 6 allows one to tailor the probability distribution of $S$ to a particular problem class. For instance, other continuous distributions such uniform, or other discrete distributions such Poisson might be more suited to a particular class of problems.

Numeric tests reveal that the new Gaussian methods designed for overdetermined systems are competitive on sparse problems, as compared to the Kaczmarz and CD-LS methods. The Gauss-pd also proved competitive as compared to CD-pd on all tests. Though, when applicable, the combined efficiency of using a direct solver and an iterative procedure, such as in Block CD-pd method, proved the most efficient.

The work opens up many possible future venues of research. Including investigating accelerated convergence rates through preconditioning strategies based on Theorem 10 or by obtaining approximate optimized probability distributions (46).

The authors would like to thank Prof. Sandy Davie for useful discussions relating to Lemma 12, and Prof. Joel Tropp for invaluable suggestions regarding Lemma 11.

References

Appendix A A Bound on the Expected Gaussian Projection Matrix

Let us write $S(\xi)$ for the random vector $\xi/\|\xi\|_{2}$ (if $\xi=0$ , we set $S(\xi)=0$ ). Using this notation, we can write

where the last identity follows since $\mathbf{E}\left[S(\xi)\right]=0$ , which in turn holds as the Gaussian distribution is centrally symmetric. As $\xi=Uu$ , note that

Left multiplying both sides by $U$ we obtain $US(u)=S(\xi)$ , from which we obtain

To prove (62), note first that $M\overset{\text{def}}{=}\mathbf{E}\left[uu^{T}/u^{T}u\right]$ is a diagonal matrix. One can verify this by direct calculation (informally, this holds because the entries of $u$ are independent and centrally symmetric). The $i$ th diagonal entry is given by

As the map $(x,y)\rightarrow x^{2}/y$ is convex on the positive orthant, we can apply Jensen’s inequality, which gives

Appendix B Expected Gaussian Projection Matrix in 2D

Let $\Sigma=UDU^{T}$ and $u\sim N(0,D).$ Given (61) it suffices to show that

Let $\sigma_{x}^{2}$ and $\sigma_{y}^{2}$ be the two diagonal elements of $D.$ First, suppose that $\sigma_{x}=\sigma_{y}.$ Then $u=\sigma_{x}\eta$ where $\eta\sim N(0,I)$ and

Now suppose that $\sigma_{x}\neq\sigma_{y}.$ We calculate the diagonal terms of the covariance matrix by integrating

Using polar coordinates $x=R\cos(\theta)$ and $y=R\sin(\theta)$ we have

where $C(\theta)\overset{\text{def}}{=}\left(\cos(\theta)^{2}/\sigma_{x}^{2}+\sin(\theta)^{2}/\sigma_{y}^{2}\right).$ Note that

where $b=\sigma_{x}/\sigma_{y}.$ Multiplying the numerator and denominator of the integrand by $\sec^{4}(x)$ gives the integral

Substituting $v=\tan(\theta)$ so that $v^{2}+1=\sec^{2}(\theta)$ , $dv=\sec^{2}(\theta)d\theta$ and using the partial fractions

To apply the limits of integration, we must take care because of the singularity at $\theta=\pi/2$ . For this, consider the limits

Using this to evaluate (67) on the limits of the interval $[0,\,\pi/2]$ gives

Applying a similar argument for calculating the limits from $\pi/2^{+}$ to $\pi$ , we find

Repeating the same steps with $x$ swapped for $y$ we obtain the other diagonal element, which concludes the proof of (64). ∎