Stochastic Dual Ascent for Solving Linear Systems

Robert Mansel Gower, Peter Richtarik

Introduction

Probabilistic ideas and tools have recently begun to permeate into several fields where they had traditionally not played a major role, including fields such as numerical linear algebra and optimization. One of the key ways in which these ideas influence these fields is via the development and analysis of randomized algorithms for solving standard and new problems of these fields. Such methods are typically easier to analyze, and often lead to faster and/or more scalable and versatile methods in practice.

In this paper we consider a key problem in linear algebra, that of finding a solution of a system of linear equations

where BB is an n×nn\times n symmetric positive definite matrix and xB=defxBx\|x\|_{B}\overset{\text{def}}{=}\sqrt{x^{\top}Bx}. By xx^{*} we denote the (necessarily) unique solution of (1.1). Of key importance in this paper is the dual problemTechnically, this is both the Lagrangian and Fenchel dual of (1.1). to (1.1), namely

Due to the consistency assumption, strong duality holds and we have P(x)=D(y)P(x^{*})=D(y^{*}), where yy^{*} is any dual optimal solution.

2 A new family of stochastic optimization algorithms

We propose to solve (1.1) via a new method operating in the dual (3), which we call stochastic dual ascent (SDA). The iterates of SDA are of the form

where SS is a random matrix with mm rows drawn in each iteration independently from a pre-specified distribution D{\cal D}, which should be seen as a parameter of the method. In fact, by varying D{\cal D}, SDA should be seen as a family of algorithms indexed by D\cal D, the choice of which leads to specific algorithms in this family. By performing steps of the form (4), we are moving in the range space of the random matrix SS. A key feature of SDA enabling us to prove strong convergence results despite the fact that the dual objective is in general not strongly concave is the way in which the “stepsize” parameter λk\lambda^{k} is chosen: we chose λk\lambda^{k} to be the least-norm vector for which D(yk+Sλ)D(y^{k}+S\lambda) is maximized in λ\lambda. Plugging this λk\lambda^{k} into (4), we obtain the SDA method:

The symbol \dagger denotes the Moore-Penrose pseudoinverseIt is known that the vector MdM^{\dagger}d is the least-norm solution of the least-squares problem minλMλd2\min_{\lambda}\|M\lambda-d\|^{2}. Hence, if the system Mλ=dM\lambda=d has a solution, then Md=argminλ{λ  :  Mλ=d}M^{\dagger}d=\arg\min_{\lambda}\{\|\lambda\|\;:\;M\lambda=d\}..

To the best of our knowledge, a randomized optimization algorithm with iterates of the general form (4) was not considered nor analyzed before. In the special case when SS is chosen to be a random unit coordinate vector, SDA specializes to the randomized coordinate descent method, first analyzed by Leventhal and Lewis . In the special case when SS is chosen as a random column submatrix of the m×mm\times m identity matrix, SDA specializes to the randomized Newton method of Qu, Fercoq, Richtárik and Takáč .

With the dual iterates {yk}\{y^{k}\} we associate a sequence of primal iterates {xk}\{x^{k}\} as follows:

In combination with (5), this yields the primal iterative process

Optimality conditions (see Section 2.1) imply that if yy^{*} is any dual optimal point, then c+B1Ayc+B^{-1}A^{\top}y^{*} is necessarily primal optimal and hence equal to xx^{*}, the optimal solution of (1.1). Moreover, we have the following useful and insightful correspondence between the quality of the primal and dual iterates (see Proposition 2.2):

Hence, dual convergence in function values is equivalent to primal convergence in iterates.

Our work belongs to a growing literature on randomized methods for various problems appearing in linear algebra, optimization and computer science. In particular, relevant methods include sketching algorithms, randomized Kaczmarz, stochastic gradient descent and their variants and randomized coordinate and subspace type methods and their variants .

3 The main results

We now describe two complexity theorems which form the core theoretical contribution of this work. The results hold for a wide family of distributions D{\cal D}, which we describe next.

In our analysis, we only impose a very weak assumption on D\cal D. In particular, we only assume that the m×mm\times m matrix

is well defined and nonsingularIt is known that the pseudoinverse of a symmetric positive semidefinite matrix is again symmetric and positive semidefinite. As a result, if the expectation defining HH is finite, HH is also symmetric and positive semidefinite. Hence, we could equivalently assume that HH be positive definite. . Hence, we do not assume that SS be picked from any particular random matrix ensamble: the options are, quite literally, limitless. This makes it possible for practitioners to choose the best distribution specific to a particular application.

We cast the first complexity result in terms of the primal iterates since solving (1.1) is our main focus in this work. Let Range(M),Rank(M)\mathbf{Range}\left(M\right),\mathbf{Rank}\left(M\right) and λmin+(M)\lambda_{\min}^{+}(M) denote the range space, rank and the smallest nonzero eigenvalue of MM, respectively.

where AB=defmax{AxB  :  xB1}\|A\|_{B}\overset{\text{def}}{=}\max\{\|Ax\|_{B}\;:\;\|x\|_{B}\leq 1\} and

Furthermore, the convergence rate is bounded by

If we let SS be a unit coordinate vector chosen at random, BB be the identity matrix and set c=0c=0, then (7) reduces to the randomized Kaczmarz (RK) method proposed and analyzed in a seminal work of Strohmer and Vershynin . Theorem 1.1 implies that RK converges with an exponential rate so long as the system matrix has no zero rows (see Section 3). To the best of our knowledge, such a result was not previously established: current convergence results for RK assume that the system matrix is full rank . Not only do we show that the RK method converges to the least-norm solution for any consistent system, but we do so through a single all encompassing theorem covering a wide family of algorithms. Likewise, convergence of block variants of RK has only been established for full column rank . Block versions of RK can be obtained from our generic method by choosing B=IB=I and c=0c=0, as before, but letting SS to be a random column submatrix of the identity matrix (see ). Again, our general complexity bound holds under no assumptions on AA, as long as one can find SS such that HH becomes nonsingular.

The lower bound (14) says that for a singular system matrix, the number of steps required by SDA to reach an expected accuracy is at best inversely proportional to the rank of AA. If AA has row rank equal to one, for instance, then RK converges in one step (this is no surprise, given that RK projects onto the solution space of a single row, which in this case, is the solution space of the whole system). Our lower bound in this case becomes , and hence is tight.

and let ρ\rho be as in Theorem 1.1. Then for all k0k\geq 0 we have the following complexity bounds:

Note that the dual objective function is not strongly concave in general, and yet we prove linear convergence (see (15)). It is known that for some structured optimization problems, linear convergence results can be obtained without the need to assume strong concavity (or strong convexity, for minimization problems). Typical approaches to such results would be via the employment of error bounds . In our analysis, no error bounds are necessary.

4 Outline

The paper is structured as follows. Section 2 describes the algorithm in detail, both in its dual and primal form, and establishes several useful identities. In Section 3 we characterize discrete distributions for which our main assumption on HH is satisfied. We then specialize our method to several simple discrete distributions to better illustrate the results. We then show in Section 4 how SDA can be applied to design new randomized gossip algorithms. We also show that our framework can recover some standard methods. Theorem 1.1 is proved in Section 5 and Theorem 1.2 is proved in Section 6. In Section LABEL:sec:experiments we perform a simple experiment illustrating the convergence of the randomized Kaczmarz method on rank deficient linear systems. We conclude in Section 8. To the appendix we relegate two elementary but useful technical results which are needed multiple times in the text.

Stochastic Dual Ascent

By stochastic dual ascent (SDA) we refer to a randomized optimization method for solving the dual problem (3) performing iterations of the form

where SS is a random matrix with mm rows drawn in each iteration independently from a prespecified distribution. We shall not fix the number of columns of SS; in fact, we even allow for the number of columns to be random. By performing steps of the form (18), we are moving in the range space of the random matrix SS, with λk\lambda^{k} describing the precise linear combination of the columns used in computing the step. In particular, we shall choose λk\lambda^{k} from the set

Since DD is bounded above (a consequence of weak duality), this set is nonempty. Since DD is a concave quadratic, QkQ^{k} consists of all those vectors λ\lambda for which the gradient of the mapping ϕk(λ):λD(yk+Sλ)\phi_{k}(\lambda):\lambda\mapsto D(y^{k}+S\lambda) vanishes. This leads to the observation that QkQ^{k} is the set of solutions of a random linear system:

If SS has a small number of columns, this is a small easy-to-solve system.

A key feature of our method enabling us to prove exponential error decay despite the lack of strong concavity is the way in which we choose λk\lambda^{k} from QkQ^{k}. In SDA, λk\lambda^{k} is chosen to be the least-norm element of QkQ^{k},

where λ=(iλi2)1/2\|\lambda\|=(\sum_{i}\lambda_{i}^{2})^{1/2} denotes standard Euclidean norm. The least-norm solution of a linear system can be written down in a compact way using the (Moore-Penrose) pseudoinverse. In our case, we obtain the formula

where \dagger denotes the pseudoinverse operator. Note that if SS has only a few columns, then (19) requires projecting the origin onto a small linear system. The SDA algorithm is obtained by combining (18) with (19).

The method has one parameter: the distribution D\cal D from which the random matrices SS are drawn. Sometimes, one is interested in finding any solution of the system Ax=bAx=b, rather than the particular solution described by the primal problem (1.1). In such situations, BB and cc could also be seen as parameters.

For any xx for which Ax=bAx=b and for any yy we have

As an immediate corollary of Proposition 2.1 we observe that for any dual optimal yy^{*}, the vector x(y)x(y^{*}) must be primal optimal. Since the primal problem has a unique optimal solution, xx^{*}, we must necessarily have

Another immediate corollary of Proposition 2.1 is the following characterization of dual optimality: yy is dual optimal if and only if

Hence, the set of dual optimal solutions is Y=(AB1A)(bAc)+Null(AB1A){\cal Y}^{*}=(AB^{-1}A^{\top})^{\dagger}(b-Ac)+\mathbf{Null}\left(AB^{-1}A^{\top}\right). Since, Null(AB1A)=Null(A)\mathbf{Null}\left(AB^{-1}A^{\top}\right)=\mathbf{Null}\left(A^{\top}\right) (see Lemma 10.1), we have

The particular dual optimal point y=(AB1A)(bAc)y^{*}=(AB^{-1}A^{\top})^{\dagger}(b-Ac) is the solution of the following optimization problem:

Hence, this particular formulation of the dual problem has the same form as the primal problem: projection onto a linear system.

If AAA^{\top}A is positive definite (which can only happen if AA is of full column rank, which means that Ax=bAx=b has a unique solution and hence the primal objective function does not matter), and we choose B=AAB=A^{\top}A, then the dual constraint (24) becomes

This constraint has a geometric interpretation: we are seeking vector yy whose orthogonal projection onto the column space of AA is equal to bAcb-Ac. Hence the reformulated dual problem (24) is asking us to find the vector yy with this property having the least norm.

2 Primal iterates associated with the dual iterates

With the sequence of dual iterates {yk}\{y^{k}\} produced by SDA we can associate a sequence of primal iterates {xk}\{x^{k}\} using the mapping (21):

This leads to the following primal version of the SDA method.

Self-duality. If AA is positive definite, c=0c=0, and if we choose B=AB=A, then in view of (25) we have xk=ykx^{k}=y^{k} for all kk, and hence Algorithms 1 and 2 coincide. In this case, Algorithm 2 can be described as self-dual.

Starting point. While we have defined the primal iterates of Algorithm 2 via a linear transformation of the dual iterates—see (25)—we can, in principle, choose x0x^{0} arbitrarily, thus breaking the primal-dual connection which helped us to define the method. In particular, we can choose x0x^{0} in such a way that there does not exist y0y^{0} for which x0=c+B1Ay0x^{0}=c+B^{-1}A^{\top}y^{0}. As is clear from Theorem 1.1, in this case the iterates {xk}\{x^{k}\} will not converge to xx^{*}, but to x+tx^{*}+t, where tt is the projection of x0cx^{0}-c onto the nullspace of AA.

It turns out that Algorithm 2 is equivalent to the sketch-and-project method (26) of Gower and Richtárik :

where SS is a random matrix drawn in an i.i.d. fashion from a fixed distribution, just as in this work. In this method, the “complicated” system Ax=bAx=b is first replaced by its sketched version SAx=SbS^{\top}Ax=S^{\top}b, the solution space of which contains all solutions of the original system. If SS has a few columns only, this system will be small and easy to solve. Then, progress is made by projecting the last iterate onto the sketched system.

We now briefly comment on the relationship between and our work.

Dual nature of sketch-and-project. It was shown in that Algorithm 2 is equivalent to the sketch-and-project method. In fact, the authors of provide five additional equivalent formulations of sketch-and-project, with Algorithm 2 being one of them. Here we show that their method can be seen as a primal process associated with SDA, which is a new method operating in the dual. By observing this, we uncover a hidden dual nature of the sketch-and-project method. For instance, this allows us to formulate and prove Theorem 1.2. No such results appear in .

No assumptions on the system matrix. In the authors only studied the convergence of the primal iterates {xk}\{x^{k}\}, establishing a (much) weaker variant of Theorem 1.1. Indeed, convergence was only established in the case when AA has full column rank. In this work, we lift this assumption completely and hence establish complexity results in the general case.

Convergence to a shifted point. As we show in Theorem 1.1, Algorithm 2 converges to x+tx^{*}+t, where tt is the projection of x0cx^{0}-c onto Null(A)\mathbf{Null}\left(A\right). Hence, in general, the method does not converge to the optimal solution xx^{*}. This is not an issue if AA is of full column rank—an assumption used in the analysis in —since then Null(A)\mathbf{Null}\left(A\right) is trivial and hence t=0t=0. As long as x0cx^{0}-c lies in Range(B1A)\mathbf{Range}\left(B^{-1}A^{\top}\right), however, we have xkxx^{k}\to x^{*}. This can be easily enforced (for instance, we can choose x0=cx^{0}=c).

3 Relating the quality of the dual and primal iterates

The following simple but insightful result (mentioned in the introduction) relates the “quality” of a dual vector yy with that of its primal counterpart, x(y)x(y). It says that the dual suboptimality of yy in terms of function values is equal to the primal suboptimality of x(y)x(y) in terms of distance.

Proof: Straightforward calculation shows that

Applying this result to sequence {(xk,yk)}\{(x^{k},y^{k})\} of dual iterates produced by SDA and their corresponding primal images, as defined in (25), we get the identity:

Therefore, dual convergence in function values D(yk)D(y^{k}) is equivalent to primal convergence in iterates xkx^{k}. Furthermore, a direct computation leads to the following formula for the duality gap:

Note that computing the gap is significantly more expensive than the cost of a single iteration (in the interesting regime when the number of columns of SS is small). Hence, evaluation of the duality gap should generally be avoided. If it is necessary to be certain about the quality of a solution however, the above formula will be useful. The gap should then be computed from time to time only, so that this extra work does not significantly slow down the iterative process.

Discrete Distributions

Both the SDA algorithm and its primal counterpart are generic in the sense that the distribution D\cal D is not specified beyond assuming that the matrix HH defined in (9) is well defined and nonsingular. In this section we shall first characterize finite discrete distributions for which HH is nonsingular. We then give a few examples of algorithms based on such distributions, and comment on our complexity results in more detail.

For simplicity, we shall focus on finite discrete distributions D\cal D. That is, we set S=SiS=S_{i} with probability pi>0p_{i}>0, where S1,,SrS_{1},\dots,S_{r} are fixed matrices (each with mm rows). The next theorem gives a necessary and sufficient condition for the matrix HH defined in (9) to be nonsingular.

Let D\cal{D} be a finite discrete distribution, as described above. Then HH is nonsingular if and only if

Proof: Let Ki=SiAB1/2K_{i}=S_{i}^{\top}AB^{-1/2}. In view of the identity (KiKi)=(Ki)Ki\left(K_{i}K_{i}^{\top}\right)^{\dagger}=(K_{i}^{\dagger})^{\top}K_{i}^{\dagger}, we can write

where Hi=piSi(Ki)KiSiH_{i}=p_{i}S_{i}(K_{i}^{\dagger})^{\top}K_{i}^{\dagger}S_{i}^{\top}. Since HiH_{i} are symmetric positive semidefinite, so is HH. Now, it is easy to check that yHiy=0y^{\top}H_{i}y=0 if and only if yNull(Hi)y\in\mathbf{Null}\left(H_{i}\right) (this holds for any symmetric positive semidefinite HiH_{i}). Hence, yHy=0y^{\top}Hy=0 if and only if yiNull(Hi)y\in\cap_{i}\mathbf{Null}\left(H_{i}\right) and thus HH is positive definite if and only if

In view of Lemma 10.1, Null(Hi)=Null(piKiSi)=Null(KiSi)\mathbf{Null}\left(H_{i}\right)=\mathbf{Null}\left(\sqrt{p_{i}}K_{i}^{\dagger}S_{i}^{\top}\right)=\mathbf{Null}\left(K_{i}^{\dagger}S_{i}^{\top}\right). Now, yNull(KiSi)y\in\mathbf{Null}\left(K_{i}^{\dagger}S_{i}^{\top}\right) if and only of SiyNull(Ki)=Null(Ki)=Null(ASi)S_{i}^{\top}y\in\mathbf{Null}\left(K_{i}^{\dagger}\right)=\mathbf{Null}\left(K_{i}^{\top}\right)=\mathbf{Null}\left(A^{\top}S_{i}\right). Hence, Null(Hi)=Null(ASiSi)\mathbf{Null}\left(H_{i}\right)=\mathbf{Null}\left(A^{\top}S_{i}S_{i}^{\top}\right), which means that (28) is equivalent to Null([S1S1A,,SrSrA])={0}\mathbf{Null}\left([S_{1}S_{1}^{\top}A,\cdots,S_{r}S_{r}^{\top}A]^{\top}\right)=\{0\}. ∎

We have the following corollary.We can also prove the corollary directly as follows: The first assumption implies that SiAB1ASiS_{i}^{\top}AB^{-1}A^{\top}S_{i} is invertible for all ii and that V=def\mboxDiag(pi1/2(SiAB1ASi)1/2)V\overset{\text{def}}{=}\mbox{Diag}\left(p_{i}^{1/2}(S_{i}^{\top}A{B^{-1}}A^{\top}S_{i})^{-1/2}\right) is nonsingular. It remains to note that H=\eqrefeq:HE[S(SAB1AS)1S]=ipiSi(SiAB1ASi)1Si=SˉV2Sˉ.H\overset{\eqref{eq:H}}{=}\mathbf{E}\left[S\left(S^{\top}AB^{-1}A^{\top}S\right)^{-1}S^{\top}\right]\\ =\sum_{i}p_{i}S_{i}\left(S_{i}^{\top}AB^{-1}A^{\top}S_{i}\right)^{-1}S_{i}^{\top}=\bar{S}V^{2}\bar{S}^{\top}.

Assume that SiAS_{i}^{\top}A has full row rank for all ii and that Sˉ=def[S1,,Sr]\bar{S}\overset{\text{def}}{=}[S_{1},\ldots,S_{r}] is of full row rank. Then HH is nonsingular.

Submatrices of the identity matrix. We can let SS be a random column submatrix of the m×mm\times m identity matrix II. There are 2m12^{m}-1 such potential submatrices, and we choose 1r2m11\leq r\leq 2^{m}-1. As long as we choose S1,,SrS_{1},\dots,S_{r} in such a way that each column of II is represented in some matrix SiS_{i}, the matrix Sˉ\bar{S} will have full row rank. Furthermore, if SiAS_{i}^{\top}A has full row rank for all ii, then by the above corollary, HH is nonsingular. Note that if the row rank of AA is rr, then the matrices SiS_{i} selected by the above process will necessarily have at most rr columns.

Count sketch and Count-min sketch. Many other “sketching” matrices SS can be employed within SDA, including the count sketch and the count-min sketch . In our context (recall that we sketch with the transpose of SS), SS is a count-sketch matrix (resp. count-min sketch) if it is assembled from random columns of [I,I][I,-I] (resp II), chosen uniformly with replacement, where II is the m×mm\times m identity matrix.

2 Randomized Kaczmarz as the primal process associated with randomized coordinate ascent

Let B=IB=I (the identity matrix). The primal problem then becomes

This is the randomized coordinate ascent method applied to the dual problem. In the form popularized by Nesterov , it takes the form

It can be easily verified that (30) holds with Li=Ai:2L_{i}=\|A_{i:}\|^{2} and that eiD(yk)=biAi:cAi:Ayke_{i}^{\top}\nabla D(y^{k})=b_{i}-A_{i:}c-A_{i:}A^{\top}y^{k}.

Primal iterates.

The associated primal iterative process (Algorithm 2) takes the form

This is the randomized Kaczmarz method of Strohmer and Vershynin .

The rate.

Let us now compute the rate ρ\rho as defined in (13). It will be convenient, but not optimal, to choose the probabilities via

where F\|\cdot\|_{F} denotes the Frobenius norm (we assume that AA does not contain any zero rows). Since

In general, the rate ρ\rho is a function of the probabilities pip_{i}. The inverse problem: “How to set the probabilities so that the rate is optimized?” is difficult. If AA is of full column rank, however, it leads to a semidefinite program .

Furthermore, if r=Rank(A)r=\mathbf{Rank}\left(A\right), then in view of (14), the rate is bounded as

Assume that AA is of rank r=1r=1 and let A=uvA=uv^{\top}. Then AA=(uu)vvA^{\top}A=(u^{\top}u)vv^{\top}, and hence this matrix is also of rank 1. Therefore, AAA^{\top}A has a single nonzero eigenvalue, which is equal its trace. Therefore, λmin+(AA)=Tr(AA)=AF2\lambda_{\min}^{+}(A^{\top}A)=\mathbf{Tr}\left(A^{\top}A\right)=\|A\|^{2}_{F} and hence ρ=0\rho=0. Note that the rate ρ\rho reaches its lower bound and the method converges in one step.

Remarks.

For randomized coordinate ascent applied to (non-strongly) concave quadratics, rate (33) has been established by Leventhal and Lewis . However, to the best of our knowledge, this is the first time this rate has also been established for the randomized Kaczmarz method. We do not only prove this, but show that this is because the iterates of the two methods are linked via a linear relationship. In the c=0,B=Ic=0,B=I case, and for row-normalized matrix AA, this linear relationship between the two methods was recently independently observed by Wright . While all linear complexity results for RK we are aware of require full rank assumptions, there exist nonstandard variants of RK which do not require such assumptions, one example being the asynchronous parallel version of RK studied by Liu, Wright and Sridhar . Finally, no results of the type (16) (primal suboptimality) and (17) (duality gap) previously existed for these methods in the literature.

3 Randomized block Kaczmarz is the primal process associated with randomized Newton

Let B=IB=I, so that we have the same pair of primal dual problems as in Section 3.2.

Let us now choose SS to be a random column submatrix of the m×mm\times m identity matrix II. That is, we choose a random subset C{1,2,,m}C\subset\{1,2,\dots,m\} and then let SS be the concatenation of columns jCj\in C of II. We shall write S=ICS=I_{C}. Let pCp_{C} be the probability that S=ICS=I_{C}. Assume that for each j{1,,m}j\in\{1,\dots,m\} there exists CC with jCj\in C such that pC>0p_{C}>0. Such a random set is called proper .

The SDA method (Algorithm 1) then takes the form

where λk\lambda^{k} is chosen so that the dual objective is maximized (see (19)). This is a variant of the randomized Newton method studied in . By examining (19), we see that this method works by “inverting” randomized submatrices of the “Hessian” AAAA^{\top}. Indeed, λk\lambda^{k} is in each iteration computed by solving a system with the matrix ICAAICI_{C}^{\top}AA^{\top}I_{C}. This is the random submatrix of AAAA^{\top} corresponding to rows and columns in CC.

Primal iterates.

In view of the equivalence between Algorithm 2 and the sketch-and-project method (26), the primal iterative process associated with the randomized Newton method has the form

This method is a variant of the randomized block Kaczmarz method of Needell . The method proceeds by projecting the last iterate xkx^{k} onto a subsystem of Ax=bAx=b formed by equations indexed by the set CC.

The rate.

Provided that HH is nonsingular, the shared rate of the randomized Newton and randomized block Kaczmarz methods is

Qu et al study the randomized Newton method for the problem of minimizing a smooth strongly convex function and prove linear convergence. In particular, they study the above rate in the case when AAAA^{\top} is positive definite. Here we show that linear converges also holds for weakly convex quadratics (as long as HH is nonsingular).

An interesting feature of the randomized Newton method, established in , is that when viewed as a family of methods indexed by the size τ=C\tau=|C|, it enjoys superlinear speedup in τ\tau. That is, as τ\tau increases by some factor, the iteration complexity drops by a factor that is at least as large. It is possible to conduct a similar study in our setting with a possibly singular matrix AAAA^{\top}, but such a study is not trivial and we therefore leave it for future research.

4 Self-duality for positive definite A𝐴A

If AA is positive definite, then we can choose B=AB=A. As mentioned before, in this setting SDA is self-dual: xk=ykx^{k}=y^{k} for all kk. The primal problem then becomes

Note that the primal objective function does not play any role in determining the solution; indeed, the feasible set contains a single point only: A1bA^{-1}b. However, it does affect the iterative process.

This is the randomized coordinate ascent method applied to the dual problem.

The rate.

If we choose pi=Aii/Tr(A)p_{i}=A_{ii}/\mathbf{Tr}\left(A\right), then

It is known that for this problem, randomized coordinate descent applied to the dual problem, with this choice of probabilities, converges with this rate .

Application: Randomized Gossip Algorithms

In this section we apply our method and results to the distributed consensus (averaging) problem.

The nodes may represent people in a social network, with edges representing friendship and private value representing certain private information, such as salary. The goal would be to compute the average salary via an iterative process where only friends are allowed to exchange information. The nodes may represent sensors in a wireless sensor network, with an edge between two sensors if they are close to each other so that they can communicate. Private values represent measurements of some quantity performed by the sensors, such as the temperature. The goal is for the network to compute the average temperature.

We now show how one can model the consensus (averaging) problem in the form (1.1). Consider the projection problem

and note that the optimal solution xx^{*} must necessarily satisfy

for all ii. There are many ways in which the constraint forcing all coordinates of xx to be equal can be represented in the form of a linear system Ax=bAx=b. Here are some examples:

Each node is equal to all its neighbours. Let the equations of the system Ax=bAx=b correspond to constraints

is the Laplacian matrix of the graph (V,E)(V,E):

Each node is the average of its neighbours. Let the equations of the system Ax=bAx=b correspond to constraints

for iVi\in V, where N(i)=def{jV:{i,j}E}N(i)\overset{\text{def}}{=}\left\{j\in V\,\,:\,\,\{i,j\}\in E\right\} is the set of neighbours of node ii and di=defN(i)d_{i}\overset{\text{def}}{=}|N(i)| is the degree of node ii. That is, we require that the values stored at each node are equal to the average of the values of its neighbours. This corresponds to the choice b=0b=0 and

Spanning subgraph. Let (V,E)(V,E^{\prime}) be any connected subgraph of (V,E)(V,E). For instance, we can choose a spanning tree. We can now apply any of the 2 models above to this new graph and either require xi=xjx_{i}=x_{j} for all {i,j}E\{i,j\}\in E^{\prime}, or require the value xix_{i} to be equal to the average of the values xjx_{j} for all neighbours jj of ii in (V,E)(V,E^{\prime}).

Clearly, the above list does not exhaust the ways in which the constraint x1==xnx_{1}=\dots=x_{n} can be modeled as a linear system. For instance, we could build the system from constraints such as x1=x2+x4x3x_{1}=x_{2}+x_{4}-x_{3}, x1=5x24x7x_{1}=5x_{2}-4x_{7} and so on.

Different representations of the constraint x1==xnx_{1}=\cdots=x_{n}, in combination with a choice of D\cal D, will lead to a wide range of specific algorithms for the consensus problem (36). Some (but not all) of these algorithms will have the property that communication only happens along the edges of the network, and these are the ones we are interested in. The number of combinations is very vast. We will therefore only highlight two options, with the understanding that based on this, the interested reader can assemble other specific methods as needed.

2 Model 1: Each node is equal to its neighbours

Let us take y0=0y^{0}=0 (which means that x0=cx^{0}=c), so that in Theorem 1.1 we have t=0t=0, and hence xkxx^{k}\to x^{*}. The particular choice of the starting point x0=cx^{0}=c in the primal process has a very tangible meaning: for all ii, node ii initially knows value cic_{i}. The primal iterative process will dictate how the local values are modified in an iterative fashion so that eventually all nodes contain the optimal value xi=cˉx^{*}_{i}=\bar{c}.

In view of (37), for each edge e=(i,j)Ee=(i,j)\in E, we have Ae:2=2\|A_{e:}\|^{2}=2 and Ae:xk=xikxjkA_{e:}x^{k}=x^{k}_{i}-x^{k}_{j}. Hence, if the edge ee is selected by the RK method, (31) takes the specific form

From (40) we see that only the ithi^{\text{th}} and jthj^{\text{th}} coordinates of xkx^{k} are updated, via

Note that in each iteration of RK, a random edge is selected, and the nodes on this edge replace their local values by their average. This is a basic variant of the randomized gossip algorithm .

Invariance.

Insights from the dual perspective.

We can now bring new insight into the randomized gossip algorithm by considering the dual iterative process. The dual method (29) maintains weights yky^{k} associated with the edges of EE via the process:

where ee is a randomly selected edge. Hence, only the weight of a single edge is updated in each iteration. At optimality, we have x=c+Ayx^{*}=c+A^{\top}y^{*}. That is, for each ii

where δi\delta_{i} is the correction term which needs to be added to cic_{i} in order for node ii to contain the value cˉ\bar{c}. From the above we observe that these correction terms are maintained by the dual method as an inner product of the ithi^{\text{th}} column of AA and yky^{k}, with the optimal correction being δi=A:iy\delta_{i}=A_{:i}^{\top}y^{*}.

Rate.

Both Theorem 1.1 and Theorem 1.2 hold, and hence we automatically get several types of convergence for the randomized gossip method. In particular, to the best of our knowledge, no primal-dual type of convergence exist in the literature. Equation (27) gives a stopping criterion certifying convergence via the duality gap, which is also new.

In view of (33) and (38), and since AF2=2m\|A\|_{F}^{2}=2m, the convergence rate appearing in all these complexity results is given by

where LL is the Laplacian of (V,E)(V,E). While it is know that the Laplacian is singular, the rate depends on the smallest nonzero eigenvalue. This means that the number of iterations needed to output an ϵ\epsilon-solution in expectation scales as O((2m/λmin+(L))log(1/ϵ))O(\left(2m/\lambda_{\min}^{+}(L)\right)\log(1/\epsilon)), i.e., linearly with the number of edges.

3 Model 2: Each node is equal to the average of its neighbours

Observe that Ai:2=1+1/di\|A_{i:}\|^{2}=1+1/d_{i}. The RK method (31) applied to this formulation of the problem takes the form

where ii is chosen at random. This means that only coordinates in iN(i)i\cup N(i) get updated in such an iteration, the others remain unchanged. For node ii (coordinate ii), this update is

That is, the updated value at node ii is the average of the values of its neighbours and the previous value at ii. From (42) we see that the values at nodes jN(i)j\in N(i) get updated as follows:

Invariance.

It follows that the method satisfies the invariance property (41).

Rate.

The method converges with the rate ρ\rho given by (33), where AA is given by (39). If (V,E)(V,E) is a complete graph (i.e., m=n(n1)2m=\tfrac{n(n-1)}{2}), then L=(n1)2nAAL=\tfrac{(n-1)^{2}}{n}A^{\top}A is the Laplacian. In that case, AF2=Tr(AA)=n(n1)2Tr(L)=n(n1)2idi=n2n1\|A\|_{F}^{2}=\mathbf{Tr}\left(A^{\top}A\right)=\tfrac{n}{(n-1)^{2}}\mathbf{Tr}\left(L\right)=\tfrac{n}{(n-1)^{2}}\sum_{i}d_{i}=\tfrac{n^{2}}{n-1} and hence

Proof of Theorem 1.1

In this section we prove Theorem 1.1. We proceed as follows: in Section 5.1 we characterize the space in which the iterates move, in Section 5.2 we establish a certain key technical inequality, in Section 5.3 we establish convergence of iterates, in Section 5.4 we derive a rate for the residual and finally, in Section 5.5 we establish the lower bound on the convergence rate.

The following result describes the space in which the iterates move. It is an extension of the observation in Remark 2.2 to the case when x0x^{0} is chosen arbitrarily.

Assuming the relationship holds for kk, we have

where wk+1=wkS(SAB1AS)S(Axkb)w^{k+1}=w^{k}-S(S^{\top}AB^{-1}A^{\top}S)^{\dagger}S^{\top}(Ax^{k}-b). ∎

2 A key inequality

The following inequality is of key importance in the proof of the main theorem.

It is easy to see that these eigenvectors span Range(WGW)\mathbf{Range}\left(W^{\top}GW\right). Hence, we can write Wy=i=1ταiqiW^{\top}y=\sum_{i=1}^{\tau}\alpha_{i}q_{i} and therefore

Furthermore this bound is tight, as can be seen by selecting yy so that Wy=q1W^{\top}y=q_{1}. ∎

3 Convergence of the iterates

Subtracting x+tx^{*}+t from both sides of the update step of Algorithm 2, and letting

where we used that tNull(A)t\in\mathbf{Null}\left(A\right). Let

and note that in view of (9), E[Z]=AHA\mathbf{E}\left[Z\right]=A^{\top}HA. Taking norms and expectations (in SS) on both sides of (46) gives

where we applied Lemma 5.2 with W=AB1/2W=AB^{-1/2} and G=HG=H, so that WGW=B1/2AHAB1/2.W^{\top}GW=B^{-1/2}A^{\top}HAB^{-1/2}. Substituting (50) into (48) gives E[ek+1B2ek]ρekB2\mathbf{E}\left[\lVert e^{k+1}\rVert_{B}^{2}\,|\,e^{k}\right]\leq\rho\cdot\lVert e^{k}\rVert_{B}^{2}. Using the tower property of expectations, we obtain the recurrence

To prove (11) it remains to unroll the recurrence.

4 Convergence of the residual

We now prove (12). Letting Vk=xkxtB2V_{k}=\|x^{k}-x^{*}-t\|_{B}^{2}, we have

where in the step preceding the last one we have used Jensen’s inequality.

5 Proof of the lower bound

Since E[Z]=AHA\mathbf{E}\left[Z\right]=A^{\top}HA, using Lemma 10.1 with G=HG=H and W=AB1/2W=AB^{-1/2} gives

Hence, Rank(A)\mathbf{Rank}\left(A\right) is equal to the number of nonzero eigenvalues of B1/2E[Z]B1/2B^{-1/2}\mathbf{E}\left[Z\right]B^{-1/2}, from which we immediately obtain the bound

In order to obtain (14), it only remains to combine the above inequality with

Proof of Theorem 1.2

In this section we prove Theorem 1.2. We dedicate a subsection to each of the three complexity bounds.

Since x0c+Range(B1A)x^{0}\in c+\mathbf{Range}\left(B^{-1}A^{\top}\right), we have t=0t=0 in Theorem 1.1, and hence (11) says that

It remains to apply Proposition 2.2, which says that Uk=D(y)D(yk)U_{k}=D(y^{*})-D(y^{k}).

2 Primal suboptimality

Letting Uk=12xkxB2U_{k}=\tfrac{1}{2}\|x^{k}-x^{*}\|_{B}^{2}, we can write

By taking expectations on both sides of (52), and using Jensen’s inequality, we therefore obtain

which establishes the bound on primal suboptimality (16).

3 Duality gap

Having established rates for primal and dual suboptimality, the rate for the duality gap follows easily:

Numerical Experiments: Randomized Kaczmarz Method with Rank-Deficient System

To illustrate some of the novel aspects of our theory, we perform numerical experiments with the Randomized Kaczmarz method (31) (or equivalently the randomized coordinate ascent method applied to the dual problem (3)) and compare the empirical convergence to the convergence predicted by our theory. We test several randomly generated rank-deficient systems and compare the evolution of the empirical primal error xkx22/x0x22\lVert x^{k}-x^{*}\rVert_{2}^{2}/\lVert x^{0}-x^{*}\rVert_{2}^{2} to the convergence dictated by the rate ρ=1λmin+(AA)/AF2\rho=1-\lambda_{\min}^{+}\left(A^{\top}A\right)/\lVert A\rVert_{F}^{2} given in (33) and the lower bound 11/Rank(A)ρ1-1/\mathbf{Rank}\left(A\right)\leq\rho. From Figure 1 we can see that the RK method converges despite the fact that the linear systems are rank deficient. While previous results do not guarantee that RK converges for rank-deficient matrices, our theory does as long as the system matrix has no zero rows. Furthermore, we observe that the lower the rank of the system matrix, the faster the convergence of the RK method, and moreover, the closer the empirical convergence is to the convergence dictated by the rate ρ\rho and lower bound on ρ\rho. In particular, on the low rank system in Figure 1(a), the empirical convergence is very close to both the convergence dictated by ρ\rho and the lower bound. While on the full rank system in Figure 1(d), the convergence dictated by ρ\rho and the lower bound on ρ\rho are no longer an accurate estimate of the empirical convergence.

Conclusion

We have developed a versatile and powerful algorithmic framework for solving linear systems: stochastic dual ascent (SDA). In particular, SDA finds the projection of a given point, in a fixed but arbitrary Euclidean norm, onto the solution space of the system. Our method is dual in nature, but can also be described in terms of primal iterates via a simple affine transformation of the dual variables. Viewed as a dual method, SDA belongs to a novel class of randomized optimization algorithms: it updates the current iterate by adding the product of a random matrix, drawn independently from a fixed distribution, and a vector. The update is chosen as the best point lying in the random subspace spanned by the columns of this random matrix.

While SDA is the first method of this type, particular choices for the distribution of the random matrix lead to several known algorithms: randomized coordinate descent and randomized Kaczmarz correspond to a discrete distribution over the columns of the identity matrix, randomized Newton method corresponds to a discrete distribution over column submatrices of the identity matrix, and Gaussian descent corresponds to the case when the random matrix is a Gaussian vector.

We equip the method with several complexity results with the same rate of exponential decay in expectation (aka linear convergence) and establish a tight lower bound on the rate. In particular, we prove convergence of primal iterates, dual function values, primal function values, duality gap and of the residual. The method converges under very weak conditions beyond consistency of the linear system. In particular, no rank assumptions on the system matrix are needed. For instance, randomized Kaczmarz method converges linearly as long as the system matrix contains no zero rows.

Further, we show that SDA can be applied to the distributed (average) consensus problem. We recover a standard randomized gossip algorithm as a special case, and show that its complexity is proportional to the number of edges in the graph and inversely proportional to the smallest nonzero eigenvalue of the graph Laplacian. Moreover, we illustrate how our framework can be used to obtain new randomized algorithms for the distributed consensus problem.

Our framework extends to several other problems in optimization and numerical linear algebra. For instance, one can apply it to develop new stochastic algorithms for computing the inverse of a matrix and obtain state-of-the art performance for inverting matrices of huge sizes.

Acknowledgements

The second author would like to thank Julien Hendrickx from Université catholique de Louvain for a discussion regarding randomized gossip algorithms.

References

Appendix: Elementary Results Often Used in the Paper

We first state a lemma comparing the null spaces and range spaces of certain related matrices. While the result is of an elementary nature, we use it several times in this paper, which justifies its elevation to the status of a lemma. The proof is brief and hence we include it for completeness.

For any matrix WW and symmetric positive definite matrix GG,

Proof: In order to establish (53), it suffices to show the inclusion Null(W)Null(WGW)\mathbf{Null}\left(W\right)\supseteq\mathbf{Null}\left(W^{\top}GW\right) since the reverse inclusion trivially holds. Letting sNull(WGW)s\in\mathbf{Null}\left(W^{\top}GW\right), we see that G1/2Ws2=0\|G^{1/2}Ws\|^{2}=0, which implies G1/2Ws=0G^{1/2}Ws=0. Therefore, sNull(W)s\in\mathbf{Null}\left(W\right). Finally, (54) follows from (53) by taking orthogonal complements. Indeed, Range(W)\mathbf{Range}\left(W^{\top}\right) is the orthogonal complement of Null(W)\mathbf{Null}\left(W\right) and Range(WGW)\mathbf{Range}\left(W^{\top}GW\right) is the orthogonal complement of Null(WGW)\mathbf{Null}\left(W^{\top}GW\right). ∎

The following technical lemma is a variant of a standard result of linear algebra (which is recovered in the B=IB=I case). While the results are folklore and easy to establish, in the proof of our main theorem we need certain details which are hard to find in textbooks on linear algebra, and hence hard to refer to. For the benefit of the reader, we include the detailed statement and proof.

Note that s=B1Ays=B^{-1}A^{\top}y, where yy is any solution of the optimization problem

The first order necessary and sufficient optimality conditions are Ax=AB1AyAx=AB^{-1}A^{\top}y. In particular, we may choose yy to be the least norm solution of this system, which gives y=(AB1A)Axy=(AB^{-1}A^{\top})^{\dagger}Ax, from which (55) follows. The variational formulation (56) can be established in a similar way, again via first order optimality conditions (note that the closed form formula (56) also directly follows from (55) and the fact that t=xst=x-s).

It only remains to establish (60). Since B1ZAB^{-1}Z_{A} is a projector onto Range(B1A)\mathbf{Range}\left(B^{-1}A^{\top}\right) and since the trace of each projector is equal to the dimension of the space they project onto, we have Tr(B1ZA)=dim(Range(B1A))=dim(Range(A))=Rank(A)\mathbf{Tr}\left(B^{-1}Z_{A}\right)=\dim(\mathbf{Range}\left(B^{-1}A^{\top}\right))=\dim(\mathbf{Range}\left(A^{\top}\right))=\mathbf{Rank}\left(A\right).∎