Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems
Yin Tat Lee, Aaron Sidford
Introduction
In recent years iterative methods for convex optimization that make progress in sublinear time using only partial information about the function and its gradient have become of increased importance to both the theory and practice of computer science. From a practical perspective, the increasing volume and distributed nature of data are forcing efficient practical algorithms to be amenable to asynchronous and parallel settings where only a subset of the data is available to a single processor at any point in time. From a theoretical perspective, rapidly converging algorithms with sublinear time update steps create hope for new provable asymptotic running times for old problems and stronger guarantees for efficient algorithms in distributed and asynchronous settings.
The idea of using simple sublinear-time iterative steps to solve convex optimization problems is an old one . It is an algorithmic design principle that has seen great practical success but has been notoriously difficult to analyze. In the past few years great strides have been taken towards developing a theoretical understanding of randomized variants of these approaches. Of particular relevance to this paper, in 2006 Strohmer and Vershynin showed that a particular sublinear update algorithm for solving overconstrained linear systems called randomized Kaczmarz converges exponentially, in 2010 Nesterov analyzed randomized analog of gradient descent that updates only a single coordinate in each iteration, called coordinate gradient descent method, and provided a computationally inefficient but theoretically interesting accelerated variant, called accelerated coordinate gradient descent method (ACDM), and in 2013 Kelner et al. presented a simple combinatorial iterative algorithm with sublinear-time update steps that can be used to solve symmetric diagonally dominant (SDD) linear systems, a broad class of linear systems with numerous applications.
In this paper we provide a framework that both strengthens and unifies these results. We present a more general version of Nesterov’s ACDM and show how to implement it so that each iteration has the same asymptotic runtime as its non-accelerated variants. We show that this method is numerically stable and optimal under certain assumptions. Then we show how to use this method to outperform conjugate gradient in solving a general class of symmetric positive definite systems of equations. Furthermore, we show how to cast both randomized Kaczmarz and the SDD solver of Kelner et al. in this framework and achieve faster running times through the use of ACDM.
Due to the success of the Kaczmarz method in practice and due to the wide array of theoretical problems for which the fastest running time is obtained through the use of a nearly linear time SDD solver , we hope that the ideas in this paper can be used to make advancements on both fronts.
Gradient descent is one of the oldest and most fundamental methods in convex optimization. Given a convex differentiable function the gradient descent method is a simple greedy iterative method that computes the gradient at the current point and uses that information to perform an update and make progress. This method is central to much of scientific computing and from a theoretical perspective the standard method is well understood . There are multiple more sophisticated variants of this method , but many of them have only estimates of local convergence rates which makes them difficult to be applied to theoretical problems and be compared in general.
In 1983, Nesterov proposed a way to accelerate the gradient descent method by iteratively developing an approximation to the function through what he calls an estimate sequence. This accelerated gradient descent method or fast gradient method has the same worst case running time as conjugate gradient method and it is applicable to general convex functions. Recently, this method has been used to improve the fastest known running time of some fundamental problems in computer science, such as compressive sensing , undirected maximum flow , linear programming .
The accelerated gradient descent method is known to achieve an optimal (up to constants) convergence rate among all first order methods, that is algorithm that only have access to the function’s value and gradient . Therefore, to further improve accelerated gradient descent one must either assume more information about the function or find a way to reduce the cost of each iteration. Using the idea of fast but crude iteration steps, Nesterov proposed a randomized coordinate descent method , which minimizes convex functions by updating one randomly chosen coordinate in each iteration.
Coordinate descent methods, which use gradient information about a single coordinate to update a single coordinate in each iteration, have been around for a long time . Various schemes have been considered for picking the coordinate to update, such as cyclic coordinate update and the best coordinate update, however these schemes are either hard to estimate or difficult to be implemented efficiently. Both the recent work of Strohmer and Vershynin and Nesterov overcame these obstacles by showing that by performing particular randomized updates one can produce methods with provable global convergence rate and small costs per iteration.
Applying the similar ideas of accelerated gradient descent, Nesterov also proposed an accelerated variant called the accelerated coordinate descent method (ACDM) that achieves a faster convergence rate while still only considering a single coordinate of the gradient at a time. However, in both Nesterov’s paper and later work , this method was considered inefficient as the computational complexity of the naive implementation of each iteration of ACDM requires time to update every coordinate of the input, at which point the accelerated gradient descent method would seem preferable.
2 Our Contributions
In this paper, we generalize Nesterov’s ACDM and present a simple technique to implement each coordinate update step efficiently. Our contributions towards understanding ACDM include the following:
Generalization: A generalization of ACDM to a broader class of sampling probabilities, overcoming technical challenges due to skewed sampling probabilities, so that any convergence rate achieved through Nesterov’s coordinate descent method can be improved by ACDM. This generalization was essential for applications considered later in the paper.
Probabilistic Estimate Sequences: A self contained proof of correctness for ACDM motivated by a generalization of estimate sequences we simply call probabilistic estimate sequences.
Efficiency: A proof that under mild assumptions about the oracle for querying function and gradient values, each iteration of ACDM can be implemented with the same asymptotic cost as an equivalent coordinate descent step.
Numerical Stability: A proof that ACDM is numerically stable and can be implemented with finite precision arithmetic and no overhead in the standard unit cost RAM model.
Lower Bound: A lower bound argument showing that ACDM achieves an optimal convergence rate among a certain class of coordinate descent algorithms.
In some sense, the principle difference between the asymptotic running time of ACDM and accelerated gradient descent (or conjugate gradient in the linear system case) is that as accelerated gradient descent depends on the maximum eigenvalue of the Hessian of the function being minimized, ACDM instead depends on the trace of the Hessian and has the possibility of each iteration costing a small fraction of the cost a single iteration of of accelerated gradient descent. As a result, any nontrivial bound on the trace of the Hessian and the computational complexity of performing a single coordinate update creates the opportunity for ACDM to yield improved running times. To emphasize this point, we focus on applications of our method to solving linear systems and we use three different cases to illustrate the flexibility and competitiveness of our method.
Symmetric Positive Definite Systems: We show that under mild assumptions ACDM solves positive definite systems with a faster asymptotic running time than conjugate gradient (and an even milder set of assumptions for Chebyshev method), and it is an asymptotically optimal algorithm for solving general systems in certain regimes.
Overdetermined Systems: For over-constrained systems of equations the randomized Kaczmarz method of Strohmer and Vershynin , which iteratively picks a random constraint and projects the current solution onto the plane corresponding to a random constraint, has been shown to have strong convergence guarantees and appealing practical performance. We show how to cast this method in the framework of coordinate descent and accelerate it using ACDM yielding improved asymptotic performance. Given the appeal of Kaczmarz methods for practical applications such as image reconstructions , there is hope that this could yield improved performance in practice.
We remark that while our applications analysis focus on solving linear systems there is hope that our ACDM algorithm will have a broader impact in both theory and practice of efficient algorithms. Just as the accelerated gradient descent method has improved the theoretical and empirical running time of various, both linear and nonlinear, gradient descent algorithms , we hope that ACDM will improve the running time of various algorithms for which coordinate descent based approaches have proven effective. Given the generality of our analysis and the previous difficulty in analyzing such methods, we hope that this is just the next towards a new class of provably efficient algorithms with good empirical performance.
3 Paper Organization
The rest of our paper is organized as follows. In Section 2, we introduce the problem and function properties that we will use for optimization. In Section 3, we briefly review the mathematics behind gradient descent, accelerated gradient descent, and coordinate descent. In Section 4, we present our general ACDM implementation, prove correctness, and show how to implement the update steps efficiently. In Section 5, we show how to apply ACDM to achieve faster runtimes for various linear system solving scenarios. In Section 6, we give lower bounds that show that ACDM is optimal in certain regimes, and in the appendix we provide missing details of the ACDM convergence and numerical stability proofs.
Preliminaries
In this paper, we consider the unconstrained minimization problem Many of the results in this paper can be generalized to constrained minimization problems , problems where is strongly convex with respect to different norms , and problems where each coordinate is a higher dimension variable (i.e. the block setting) . However, for simplicity we focus on the basic coordinate unconstrained problem in the Euclidian norm.
To minimize , we restrict our attention to first-order iterative methods, that is algorithms that generate a sequence of points such that , while only evaluating the objective function, , and its gradient , at points. In other words, other than the global function parameters related to that we define in this section, we assume that the algorithms under consideration have no further knowledge regarding the structure of . To compare such algorithms, we say that an iterative method has convergence rate if for this method.
Now, we say that has convexity parameter with respect to some norm if the following holds
Furthermore, we say has -Lipschitz gradient if
The definition is related to an upper bound on as follows:
We call the right hand side of (2) the upper envelope of at .
The convexity parameter and the Lipschitz constant of the gradient provide lower and upper bounds on . They serve as the essential characterization of for first-order methods and they are typically the only information about provided to both gradient and accelerated gradient descent methods (besides the oracle access to and ). For twice differentiable , these values can also be computed by properties of the Hessian of by the following well known lemma:
We give two examples for convex functions induced by linear systems and calculate their parameters. Note that even though one example can be deduced from the other, we provide both as the analysis allows us to introduce more notation.
Consequently, is the the smallest eigenvalue of and is the largest eigenvalue of . Furthermore, we see that and therefore satisfies
Let for any matrix . Then and . Hence, and satisfy
and we see that is the the smallest eigenvalue of and is the largest eigenvalue of . As in the previous example, we therefore have where is the -th column of and , the Frobenius norm of .
Review of Previous Iterative Methods
In this section, we briefly review several standard iterative first-order method for smooth convex minimization. This overview is by no means all-inclusive, our goal is simply to familiarize the reader with numerical techniques we will make heavy use of later and motivate our presentation of the accelerated coordinate descent method. For a more comprehensive review, there are multiple good references, e.g. , .
In Section 3.1, we briefly review the standard gradient descent method. In Section 3.2, we show how to improve gradient descent by motivating and reviewing Nesterov’s accelerated gradient descent method through a more recent presentation of his via estimate sequences . In Section 3.3, we review Nesterov’s coordinate gradient descent method . In the next section we combine these concepts to present a general and efficient accelerated coordinate descent scheme.
For , this method simply chooses the minimum point of the upper envelope of at :
Thus, we see that the gradient descent method is a greedy method that chooses the minimum point based on the worst case estimate of the function based on the value of and . It is well known that it provides the following guarantee [22, Cor 2.1.2, Thm 2.1.15]
2 Accelerated Gradient Descent
To speed up the greedy and memory-less gradient descent method, one could make better use of the history and pick the next step to be the smallest point in upper envelope of all points computed. Formally, one could try the following update rule
However, this problem is difficult to solve efficiently and requires storing all previous points. To overcome this problem, Nesterov suggested to use a quadratic function to estimate the function. Formally, we define an estimate sequence as follows: Note that our definition deviates slightly from Nesterov’s [22, Def 2.2.1] in that we include condition 5.
An estimate sequence of is an approximate lower bound of which is slightly above . This relaxed definition allows us to find a better computable approximation of instead of relying on the worst case upper envelope at each step.
A good estimate sequence gives an efficient algorithm [22, Lem 2.2.1] by the following
Since an estimate sequence is an approximate lower bound, a natural computable candidate is to use the convex combination of lower envelopes of at some points.
Since it can be shown that any convex combinations of lower envelopes at evaluation points satisfies (4) under some mild condition, additional points other than can be used to tune the algorithm. Nesterov’s accelerated gradient descent method can be obtained by tuning the the free parameters and to satisfy (5). Among all first order methods, this method is optimal up to constants in terms of number of queries made to and . The performance of the accelerated gradient descent method can be characterized as follows:
3 Coordinate Descent
and then performing a gradient descent step on that coordinate:
To analyze this algorithm’s convergence rate, we define the norm , its dual , and the inner product which induces this norm as follows:
and we let denote the convexity parameter of with respect to .
Using the definition of coordinate-wise Lipschitz constant, each step can be shown to have the following guarantee on expected improvement
and further analysis shows the following convergence guarantee coordinate descent
General Accelerated Coordinate Descent
In this section, we present our general and iteration-efficient accelerated coordinate descent method (ACDM). In particular, we show how to improve the asymptotic convergence rate of any coordinate descent based algorithm without paying asymptotic cost in the running time due to increased cost in querying and updating a coordinate. We remark that the bulk of the credit for conceiving of such a method belongs to Nesterov who provided a different proof of convergence for such a method for the case, however we note that changes to the algorithm were necessary to deal with the case used in all of our applications.
The rest of this section is structured as follows. In Section 4.1, we introduce and prove the correctness of general ACDM through what we call (probabilistic) estimation sequences, in Section 4.2, we present the numerical stability results we achieve for this method, and in Section 4.3, we show how to implement this method efficiently. In Appendix B, we include some additional details for the correctness proof and in Appendix C, we provide the details of the numerical stability proof.
Following the spirit of the estimate sequence proof of accelerated gradient descent , here we present a proof of ACDM convergence through what we call a (probabilistic) estimation sequence.
A probabilistic estimation sequence gives a randomized minimization method due to the following
Since a probabilistic estimation sequence can be constructed using random partial derivatives, rather than a full gradient computations, there is hope that in certain cases probabilistic estimation sequences require less information for fast convergence and therefore outperform their deterministic counterparts.
Similar to the accelerated gradient descent method, in the following lemma we first show how to combine a sequence of lower envelopes to satisfy condition (7) and prove that it preserves a particular structure on the current lower bound.
Let , be such that
Each and
Each is chosen randomly such that we have .
Then the pair of sequences defined by
and
satisfies condition (7). Furthermore, if , then this process produces a sequence of quadratic functions of the form where
The proof follows from direct calculations however for completeness and for later use we provide a proof of an even more general statement in Appendix A. ∎
yields a probabilistic estimate sequence. This accelerated coordinate descent method satisfies
By construction we know that condition (7) of probabilistic estimation sequences holds. It remains to show that satisfies condition (8) and analyze the convergence rate. To prove (8), we proceed by induction to prove the following statement:
where for notational convenience we drop the expectation in each of the variables. By convexity so applying this and the definitions of and we get
From this formula we see that was chosen specifically to cancel the second term so that
To compute , we use the fact that applying Lemma 1 to the formula yields
and therefore for as defined we have
2 Numerical Stability
and by better tuning we derive the following algorithm.
The following theorem states that error in vector updates does not grow too fast and Lemma 14 further states that errors in the coefficients , , and also does not grow too fast. Hence, bits of precision is sufficient to implement ACDM with the following convergence guarantees. Below we state this result formally and we refer the reader to Appendix C for the proof.
This theorem provides useful estimates for the error , the residual , and the norm of gradient . Note how the estimate depends on the initial error mildly as compared to Theorem 4. We use this fact in creating an efficient SDD solver in Section 5.3.
3 Efficient Iteration
By the specification of the ACDM algorithm, we have
we see that each update step of ACDM can be rewritten as
in time plus the time needed for one oracle call.
To complete the lemma, we simply need to show that this scheme is numerically stable. Note that
Hence, bits of precision suffice to implement this method. ∎
Faster Linear System Solvers
In this section, we show how ACDM can be used to achieve asymptotic runtimes that outperform various state-of-the-art methods for solving linear systems in a variety of settings. In Section 5.1, we show how to use coordinate descent to outperform conjugate gradient under minor assumptions regarding the eigenvalues of the matrix under consideration, in Section 5.2, we show how to improve the convergence rate of randomized Kaczmarz type methods, and in Section 5.3, we show how to use the ideas to achieve the current fastest known numerically stable solver for symmetric diagonally dominant (SDD) systems of equations.
Here we compare the performance of ACDM to conjugate gradient (CG) and show that under mild assumptions about the linear system being solved ACDM achieves a better asymptotic running time.
2 Accelerating Randomized Kaczmarz
There are many schemes that can be chosen to pick the hyperplane , many of which are difficult to analyze and compare, but in a breakthrough result, Strohmer and Vershynin in 2008 analyzed the randomized schemes which sample the hyperplane with probability proportional to . They proved the following
The Kaczmarz method samples row with probability proportionally to at each iteration and yields the following
Here we show show to cast this algorithm as an instance of coordinate descent and obtain an improved convergence rate by applying ACDM. We remark that accelerated Kaczmarz will sample rows with a slightly different probability distribution. As long as this does not increase the expected computational cost of an iteration, it will yield an algorithm with a faster asymptotic running time.
The ACDM method samples row with probability proportionally to and performs extra work at each iteration. It yields the following
Therefore, we attempt solve the following equivalent problem using accelerated coordinate descent.
From Example 2, we know that the -th direction component-wise Lipschitz constant is where is the -th row of and we know that . Therefore, each step of the coordinate descent method consists of the following We ignore the thresholding here for illustration purpose.
Recalling that we had performed the transformation we see that the corresponding step in is
Therefore, the randomized coordinate descent method applied this way yields precisely the randomized Kaczmarz method of Strohmer and Vershynin.
However, to apply the ACDM method and provide complete theoretical guarantees we need to address the problem that as we have constructed it is not strongly convex. This is clear by the fact that the null space of may be non trivial.
and . This gives the same convergence guarantee with replaced with , and . The desired convergence rate follows from the strong convexity of .
To justify the cost of each iteration, note that although the iterations work on the variable, we can apply on all iterations and just do the operations on variables which is more efficient. ∎
3 Faster SDD Solvers in the Unit-cost RAM Model
Letting be the solution of , we will prove the following:
By the method of Lagrange multipliers, the minimizer of the problem
Now, let be any vector such that . With this, the problem can be simplified to
Now, for every off-tree edge, i.e. , there is only one cycle that passes through it. So, if we let be the diagonal matrix for the resistances of the off tree edges we have that for any
By the same reasoning as above, the convexity parameter of with respect to the Euclidian norm is and we bound the -th direction component-wise Lipschitz constant as follows
Now, as in , we can using the following result of Abraham and Neiman to ensure
In time, we can compute a spanning tree such that
Now, applying the coordinate descent method to and letting denote the off-tree edge i.e., coordinate, picked in iteration , we get We ignore the thresholding here again for illustration purpose.
Noting that the with corresponding to is , we write the update equivalently as
which is precisely the algorithm of the simple solver in . In , they also prove that calls to and updates to can be implemented in . Therefore, by applying ACDM we can obtain a faster algorithm. Note that apply ACDM efficiently computations of need to be performed on the sum of two vector without explicit summing them. However, since in this case is linear, we can just call the oracle on the two vectors separately and also use the data structure for updating coordinates in each vector separately.
While the above insight suffices to compute electric flows, i.e. , in the desired runtime in order to actually solve the Laplacian system we need to be able to compute such that the following holds.
Towards the Optimality of Accelerated Coordinate Descent
This assumption forbids the iterative method from starting at any other point other than the initial point and forbids the algorithm from randomly jumping to completely new points. We do not know if this assumption is necessary. However, if the iterative method can only access a function value and a derivative in a random direction at each iteration, then at least iterations are needed to obtain full local second order information, i.e. . Therefore maybe this assumption is not needed.
We assume without loss of generality that by shifting the domain and let
and where and are some constants that depend on the first and the last terms. To simply the term, we set and get
Now, we want to show that the produced by the iterative method is likely far away from optimal since likely has many zeros when is small. We say if only the first coordinates of are nonzero. If by the strong convexity of we have
Note that and if then for . Furthermore, by the assumption (15), the current point in an iteration goes from to only if the oracle gives the -th partial derivative. Since by assumption this happens with probability by applying (16), we have
where the last line comes from the fact that if is a Binomial distribution with probability with trial.
Next by taking the Taylor series at and use the assumption , we have
Acknowledgements
We thank Yan Kit Chim, Andreea Gane, Jonathan Kelner, Lap Chi Lau, and Lorenzo Orecchia for helpful discussions. This work was partially supported by NSF awards 0843915 and 1111109 and a NSF Graduate Research Fellowship (grant no. 1122374).
References
Appendix A Proof of Probabilistic Estimate Sequence Form
Let be a positive semidefine matrix such that is strongly convex with respect to the norm with convexity parameter . Furthermore let , be such that
Each and ,
Each is chosen randomly with probability .
Then the pair of sequences defined by
and ,
satisfies condition (7). Furthermore, if , then this process produces a sequence of quadratic functions of the form where
Applying the inductive hypothesis, we get
By strong convexity and the definition of , we then have that
Next , we prove the form of again by induction on . Suppose that . Now, we prove the form of . By the update rule and the inductive hypothesis, we see that
Consequently, for as desired and and yet to be determined.
To compute , we note that and therefore by the update rule
Therefore, we have that must satisfy
So, applying to each side yields the desired formula for .
Finally, to compute , we note that, by the form of and the update rule for creating , looking at yields
Combining these two formulas and noting that
yields the desired form of and completes the proof. ∎
Appendix B Bounding ACDM Coefficients
In the following lemma we prove bounds on and required for Theorem 5 and Lemma 6.
Using these insights we can now estimate the growth of coefficients and . As in the growth can be estimated by a recursive relation between and . For , our definitions imply
To bound , the definitions imply . Therefore since we have and consequently
Using (20) and (19), it is easy to prove the growth rates (17) and (18) by induction.
Appendix C Proof of Numerically Stable ACDM
Applying the convexity parameter and Lipschitz condition, we can bound the error induced by :
Now since we defined coefficients so the following hold
we see that the terms for and cancel and we obtain
Dropping the expectation in each variable we have that in expectation up to iteration
Now, we claim the following inequalities:
To bound the norm of the gradient, we show that if is large for many steps then decreases substantially. For notational simplicity we omit the expectation symbol and using (22), we have
To bound , we consider the lower envelops at and obtain
Since , we have
Step 3a of ACDM can be computed by the following formulas
and can be implemented using bits precision to obtain any additive accuracy.