Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods
Anna Ma, Deanna Needell, Aaditya Ramdas
Introduction
We consider solving a linear system of equations
for a (real or complex) matrix , in various problem settings. Recent interest in the topic was reignited when Strohmer and Vershynin proved the linearMathematicians often refer to linear convergence as exponential convergence. convergence rate of the Randomized Kaczmarz (RK) algorithm that works on the rows of (data points). Following that, Leventhal and Lewis proved the linear convergence of a Randomized Gauss-Seidel (RGS), i.e. Randomized Coordinate Descent, algorithm that works on the columns of (features).
When the system of equations is inconsistent (i.e. has no exact solution), as is typically the case when in real-world overconstrained systems, RK is known to not converge to the ordinary least squares solution
as studied by Needell . Zouzias and Freris extended the RK method with the modified Randomized Extended Kaczmarz (REK) algorithm, which linearly converges to . Interestingly, in this setting, we will argue in Section 3.3 that RGS does converge to without any special extensions.
The above introduction represents only half the story. When , there are fewer constraints than variables, and the system has infinitely many solutions. In this case, especially if we have no prior reason to believe any additional sparsity in the signal structure, we are often interested in finding the least Euclidean norm solution:
While RGS converges to in the overcomplete setting, we shall argue in Section 3.3 that in the undercomplete setting it does not converge to . We will also argue that RK does converge to without any extensions in this setting.
The main contribution of our paper is to provide a unified theory of these related iterative methods. We will also construct an extension to RGS that parallels REK, which unlike RGS does converge to (just as REK, unlike RK, converges to ). Some desired properties for this algorithm include that it should also converge linearly, not require much extra computation, and work well in simulations. We shall see that our Randomized Extended Gauss-Seidel (REGS) method does indeed possess these desired properties. A summary of this unified theory is provided in Table 1.
2 Paper Outline
In Section 2 we recap the three main existing algorithms mentioned in the introduction (RK, RGS, REK). We discuss the performance of these algorithms in the three natural settings described in Table 1 in Section 3. Section 4 introduces our proposed algorithm (REGS) and proves its linear convergence to the least norm solution, completing the theoretical framework. Lastly, we end with some simulation experiments in Section 5 to demonstrate the tightness and usefulness of our theory, and conclude in Section 6.
Existing Algorithms and Related Work
In this section, we will summarize the algorithms mentioned in the introduction, i.e. RK, RGS and REK. We will describe their iterative update rules and mention their convergence guarantees, leaving the details of convergence to the next section. Throughout the paper we will use the notation to represent the th row of (or th entry in the case of a vector) and to denote the th column of a matrix . We will write the estimation as a column vector. We write vectors and matrices in boldface, and constants in standard font.
The Kaczmarz method was first introduced in the notable work of Kaczmarz . It has gained recent interest in tomography research where it is known as the Algebraic Reconstruction Technique (ART) . Although in its original form the method selects rows in a deterministic fashion (often simply cyclically), it has been well observed that a random selection scheme reduces the possibility of a poor choice of row ordering . Earlier convergence analysis of the randomized variant were obtained (e.g. ), but yielded bounds with expressions that were difficult to evaluate. Strohmer and Vershynin showed that the RK method described above has an expected linear convergence rate to the solution of (1), and are the first to provide an explicit convergence rate in expectation which depends only on the geometric properties of the system. This work was extended by Needell to the inconsistent case, analyzed almost surely by Chen and Powell , accelerated in several ways , and extended to more general settings .
We describe here the randomized variant of the Kaczmarz method put forth by Strohmer and Vershynin . Taking as input and starting from an arbitrary initial estimate for (for example ), RK repeats the following in each iteration. First, a random row is selected with probability proportional to its Euclidean norm, i.e.
where denotes the Frobenius norm of . Then, project the current iterate onto that row, i.e.
where here and throughout denotes the (conjugate) transpose of .
Intuitively, this update can be seen as greedily satisfying the th equation in the linear system. Indeed, it is easy to see that after the update,
we can alternatively interpret this update as stochastic gradient descent (choosing a random data-point on which to update), where the step size is the inverse Lipschitz constant of the stochastic gradient
2 Randomized Extended Kaczmarz (REK)
For inconsistent systems, the RK method does not converge to the least-squares solution as one might desire. This fact is clear since the method at each iteration projects completely onto a selected solution space, being unable to break the so-called convergence horizon. One approach to overcome this is to use relaxation parameters, so that the estimates are not projected completely onto the subspace at each iteration . Recently, Zouzias and Freris proposed a variant of the RK method motivated by the work of Popa which instead includes a random projection to iteratively reduce the component of which is orthogonal to the range of . This method, named Randomized Extended Kaczmarz (REK) can be described by the following iteration updates, which can be initialized with and :
Here, a column is also selected at random with probability proportional to its Euclidean norm:
and again denotes the th column of . Here, approximates the component of which is orthogonal to the range of , allowing for the iterates to converge to the true least-squares solution of the system. Zouzias and Freris prove that REK converges linearly in expectation to this solution .
3 Randomized Gauss-Seidel (RGS)
Again taking as input and starting from an arbitrary , the Randomized Gauss-Seidel (RGS) method (or the Randomized Coordinate Descent method) repeats the following in each iteration. First, a random column is selected as in (7). We then minimize the objective with respect to this coordinate to get
where is the th coordinate basis column vector (all zeros with a in the th position). It can be seen as greedily minimizing the objective with respect to the th coordinate. Indeed, letting represent without its th column and without its th coordinate,
Setting this equal to zero for the coordinate-wise minimization, we get the aforementioned update (8) for . Alternatively, since , the above update can intuitively be seen as a univariate descent step where the step size is the inverse Lipschitz constant of the gradient along the th coordinate, since the entry of the Hessian is
Leventhal and Lewis showed that this algorithm has an expected linear convergence rate. We will detail the convergence properties of this algorithm and the others in the next section.
Problem Variations
We first examine the differences in behavior of the two algorithms RGS and RK in three distinct but related settings. This will highlight the opposite behaviors of these two similar algorithms.
When the system of equations (1) has a unique solution, we represent this by . This happens when , and the system is consistent. Assuming that has full column rank,
and then .
When (1) does not have any consistent solution, we refer to the least-squares solution of (2) as . This could happen in the overconstrained case, when . Again, assuming that has full column rank, we have
and we can write as the residual vector.
When (1) has infinitely many solutions, we call the minimum Euclidean norm solution given by (3), . This could happen in the underconstrained case, when . Assuming that has full row rank, we have
In the above notation, the stands for Least Squares and for Least Norm. We shall return to each of these three situations in that order in future sections.
One of our main contributions is to achieve a unified understanding of the behavior of RK and RGS in these different situations. The literature for RK deals mainly with the first two settings only (see , , ). In the third setting, one readily obtains convergence to an arbitrary solution (see e.g. (3) of ), but the convergence to the least norm solution is not often studied (likely for practical reasons). The literature for RGS typically focuses on more general setups than our specific quadratic least squares loss function (see Nesterov or Richtárik and Takáč ). However, for both the purposes of completeness, and for a more thorough understanding of the relationship between RK and RGS, it turns out to be crucial to analyze all three settings (for equations (1)-(3)).
When is a unique consistent solution, we present proofs of the linear convergence of both algorithms - the results are known from papers by and but are presented here in a novel manner so that their relationship becomes clearer and direct comparison is easily possible.
When is the (inconsistent) least squares solution, we show why RGS iterates converge linearly to , but RK iterates do not - making RGS preferable. These facts are not hard to see, but we make it more intuitively and mathematically clear why this should be the case.
When is the minimum norm consistent solution, we explain why RK converges linearly to it, but RGS iterates do not (both so far seemingly undocumented observations) - making RK preferable.
Together, the above three points complete the picture (with solid accompanying intuition) of the opposing behavior of RK and RGS. Later, we will present our variant of the RGS method, the Randomized Extended Gauss-Seidel (REGS), and compare with the corresponding variant of RK (REK). This new analysis will complete the unified framework for these methods.
Here we will assume that , has full column rank, and the system is consistent, so . First, let us write the updates used by both algorithms in a revealing fashion. If RK and RGS select row and column at step , and (resp. ) is the th coordinate basis row (resp. column) vector, then the updates can be rewritten as:
where is the residual vector at iteration . Then multiplying both equations by gives
We now come to an important difference, which is the key update equation for RK and RGS.
First, from the update (13) for RK, we have that is parallel to . Also, is orthogonal to (since ). Then by the Pythagorean theorem,
Note that from the update (16), we have that is parallel to . Also, is orthogonal to (since by the optimality condition ). Then again by the Pythagorean theorem,
The rest of the proof follows by simply substituting for the last term in the above two equations, and is presented in the following table for easy comparison. Note is the full-rank covariance matrix and we first take expectations with respect to the randomness at the st step, conditioning on all randomness up to the th step. We later iterate this expectation.
Here, i.e. is the smallest eigenvalue of (singular value of ). It follows that
2 Overconstrained System, Inconsistent
Here, we will assume that , is full column rank, and the system is inconsistent, so , where is such that . It is easy to see this condition, because as mentioned earlier,
implying that . Substituting gives that .
In this setting, RK is known to not converge to the least squares solution, as is easily verified experimentally and geometrically. The tightest convergence upper bounds known are by and who show that
where we write to denote the smallest (non-zero) singular value of and again its Frobenius norm. Attempting the previous proof, (17) no longer holds – the Pythagorean theorem fails because is no longer orthogonal to since . Intuitively, the reason RK does not converge is that every update of RK (say of row ) is a projection onto the “wrong” hyperplane that has constant (where the “right” hyperplane would involve projecting onto a parallel hyperplane with constant where was defined above). An alternative intuition is that all RK updates are in the span of the rows, but is not in the row span. These intuitive explanations are easily confirmed by experiments seen in [32, pp. 787–788],[20, pp. 402]. Zouzias and Freris alleviate this issue with the REK algorithm, whose convergence obeys
However, the fate of RK doesn’t hold for RGS. Almost magically, in the previous proof, the Pythagorean theorem still holds in equation (18) because
The first term is 0 by the optimality condition for i.e. . The second term is zero by the global optimality of , i.e. . Also, is full rank as before. Indeed, RGS works in the space of fitted values and not the iterates .
In summary, RK does not converge to the LS solution, but RGS does at the same linear rate. This is what motivated the development of Randomized Extended Kaczmarz (REK) by Zouzias and Freris which, as discussed earlier, is a modification of RK designed to converge to by randomly projecting out . An independent paper by Dumitrescu argues however that in this setting RGS is preferable to REK in terms of computational convergence.
3 Underconstrained System, Infinite Solutions
Here, we will assume that , is full row rank and the system is consistent with infinitely many solutions. As mentioned earlier, it is easy to show that
(which clearly satisfies ). Every other consistent solution can be expressed as
Clearly any such would also satisfy . Since , implying , showing that is indeed the minimum norm solution as claimed.
In this case, RK has good behavior, and starting from , it does converge linearly to . Intuitively, (for ) and hence is in the row span of . Starting from , RK only adds multiples of rows to its iterates, and hence will never have any component orthogonal to the row span of . There is exactly one solution with no component orthogonal to the row span of , and that is , and hence RK converges linearly to the required point, where the rate can be bounded in exactly the same way as (19). It is important not to start from an arbitrary since the RK updates can never eliminate any component of that is perpendicular to the row span of . Of course, the same properties are shared by REK for this case as well. It is noted in Zouzias and Freris [32, Sec. 2.1] that the REK converges at the same rate for underdetermined systems as it does overdetermined systems.
Mathematically, the previous earlier proof works because the Pythagorean theorem holds since it is a consistent system. Now, is not full rank but note that since both and are in the row span, has no component orthogonal to (unless it equals zero in which case the algorithm has already converged). Hence holds, where is now understood to be the smallest positive eigenvalue of . To summarize, the exact same bound (19) still holds in this case, with the appropriate understanding of and under the assumption that the initialized is in the row span of .
RGS unfortunately suffers the opposite fate. The iterates do not converge to , even though does converge to . Mathematically, the convergence proof still carries forward as before, but in the last step when cannot be inverted because it is not full rank. Hence we get convergence of the residual to zero, without getting convergence of the iterates to the least norm solution. Intuitively, the iterates of RGS add components to the estimates that are orthogonal to the row span of . These components are never eliminated because in minimizing the residual, they are ignored. Therefore, RGS is able to minimize the residual without finding the least norm solution.
Unfortunately, when each update is cheaper for RK than RGS (due to matrix size), RGS is preferred for reasons of convergence and when it is cheaper for RGS than RK, RK is preferred.
REGS
We next introduce an extension of RGS, analogous to the extension REK of RK. The purpose of extending RK was to allow for convergence to the least squares solution. Now, the purpose of extending RGS is to allow for convergence to the least norm solution. We view this method as a completion to the unified analysis of these approaches, and it may also possess advantages in its own right.
Consider the linear system (1) with . Let denote the least norm solution of the underdetermined system as described in (3). The REGS algorithm is described by the following pseudo-code. Analogous to the role plays in REK, iteratively approximates the component in orthogonal to the row-span of . By iteratively removing this component, we converge to the least norm solution. Note that outputting instead of in Algorithm 1 recovers the RGS algorithm. This may be preferable in the overdetermined setting.
2 Main result
Our main result for the REGS method shows linear convergence to the least norm solution.
The REGS algorithm outputs an estimate such that
where and .
Proof: We devote the remainder of this section to the proof of Theorem 1.
In Algorithm 1, is in the row span of for . Clearly . To see that note that consists of the RGS approximation from which we subtract . This removes the component of which is orthogonal to the row span of .
Now we first consider :
So far, we have only used substitution of variables as defined for the algorithm and that is an orthogonal decomposition. We first focus on the expected value of the second term.
The first line follows by definition, the second is by Lemma 3, and the third by Lemma 2.
Finally, we take the expected value of . From equation (23) and using Fact 1 we obtain:
We complete the proof using the following lemma from :
([32, Thm. 8]) Suppose that for some , the following bounds hold for all :
Here we note that the same proof works for overdetermined systems. In particular, this works because Lemma 2 holds for and also (see Thm. 3.6 in ). Also, Lemma 3 follows for both overdetermined consistent systems (see table in Section 3.1) as well as overdetermined inconsistent systems (from (21) and subsequent arguments).
3 Comparison
Theorem 1 shows that, like the RK and REK methods, REGS converges linearly to the least-norm solution in the underdetermined case. We believe it serves to complement existing analysis and completes the theory of these iterative methods in all three cases of interest. For that reason, we compare the three approaches for the underdetermined setting here. For ease of comparison, set as in Theorem 1, and write for the condition number of . From the convergence rate bounds for RK and REK given in Section 3, and after applying elementary bounds to (22) of Theorem 1, we have:
We find similar results in the overdetermined, inconsistent setting. Using the convergence rate bounds for RGS , REK , and REGS (Theorem 1), also given in section 3, we have:
Thus, up to constant terms (which are likely artifacts of the proofs), the bounds provide the same convergence rate , which is not surprising in light of the connections between the methods. In the next section, we compare these approaches experimentally.
Empirical Results
We also tested REGS on tomography problems using the Regularization toolbox by Hansen (http://www.imm.dtu.dk/pcha/Regutools/). For the 2D tomography problem with an matrix where and , we use and for our experiments. Here, consists of samples of absorption along a random line on an grid and is the oversampling factor. The results from this experiment are shown in Figure 2.
Conclusion
The Kaczmarz and Gauss-Seidel methods operate in two different spaces (i.e. row versus column space), but share many parallels. In this paper we drew connections between these two methods, highlighting the similarities and differences in convergence analysis. The approaches possess conflicting convergence properties; RK converges to the desired solution in the underdetermined case but not the inconsistent overdetermined setting, while RGS does the exact opposite. The extended method REK in the Kaczmarz framework fixes this issue, converging to the solution in both scenarios. Here, we present the REGS method, a natural extension of RGS, which completes the overall picture. We hope that our unified analysis of all four methods will assist researchers working with these approaches.
Acknowledgments
Needell was partially supported by NSF CAREER grant , and the Alfred P. Sloan Fellowship. Ma was supported in part by AFOSR MURI grant FA9550-10-1-0569. Ramdas was supported in part by ONR MURI grant N000140911052. The authors would also like to thank the Institute of Pure and Applied Mathematics (IPAM) at which this collaboration started, and the reviewers of this manuscript for their thoughtful suggestions. We also thank the author of who pointed out the need to clarify what we have now written as Remark 1.