Coordinate Descent Algorithms

Stephen J. Wright

Introduction

Coordinate descent (CD) algorithms for optimization have a history that dates to the foundation of the discipline. They are iterative methods in which each iterate is obtained by fixing most components of the variable vector xx at their values from the current iteration, and approximately minimizing the objective with respect to the remaining components. Each such subproblem is a lower-dimensional (even scalar) minimization problem, and thus can typically be solved more easily than the full problem.

CD methods are the archetype of an almost universal approach to algorithmic optimization: solving an optimization problem by solving a sequence of simpler optimization problems. The obviousness of the CD approach and its acceptable performance in many situations probably account for its long-standing appeal among practitioners. Paradoxically, the apparent lack of sophistication may also account for its unpopularity as a subject for investigation by optimization researchers, who have usually been quick to suggest alternative approaches in any given situation. There are some very notable exceptions. The 1970 text of Ortega and Rheinboldt (OrtR70, , Section 14.6) included a comprehensive discussion of “univariate relaxation,” and such optimization specialists as Luo and Tseng LuoT92a ; LuoT93a , Tseng Tse01a , and Bertsekas and Tsitsiklis BerT89 made important contributions to understanding the convergence properties of these methods in the 1980s and 1990s.

The situation has changed in recent years. Various applications (including several in computational statistics and machine learning) have yielded problems for which CD approaches are competitive in performance with more reputable alternatives. The properties of these problems (for example, the low cost of calculating one component of the gradient, and the need for solutions of only modest accuracy) lend themselves well to efficient implementations of CD, and CD methods can be adapted well to handle such special features of these applications as nonsmooth regularization terms and a small number of equality constraints. At the same time, there have been improvements in the algorithms themselves and in our understanding of them. Besides their extension to handle the features just mentioned, new variants that make use of randomization and acceleration have been introduced. Parallel implementations that lend themselves well to modern computer architectures have been implemented and analyzed. Perhaps most surprisingly, these developments are relevant even to the most fundamental problem in numerical computation: solving the linear equations Aw=bAw=b.

In the remainder of this section, we state the problem types for which CD methods have been developed, and sketch the most fundamental versions of CD. Section 2 surveys applications both historical and modern. Section 3 sketches the types of algorithms that have been implemented and analyzed, and presents several representative convergence results. Section 4 focuses on parallel CD methods, describing the behavior of these methods under synchronous and asynchronous models of computation.

Our approach throughout is to describe the CD methods in their simplest forms, to illustrate the fundamentals of the applications, implementations, and analysis. We focus almost exclusively on methods that adjust just one coordinate on each iteration. Most applications use block coordinate descent methods, which adjust groups of blocks of indices at each iteration, thus searching along a coordinate hyperplane rather than a single coordinate direction. Most derivation and analysis of single-coordinate descent methods can be extended without great difficulty to the block-CD setting; the concepts do not change fundamentally. We mention too that much effort has been devoted to developing more general forms of CD algorithms and analysis, involving weighted norms and other features, that allow more flexible implementation and allow the proof of stronger and more general (though usually not qualitatively different) convergence results.

The problem considered in most of this paper is the following unconstrained minimization problem:

Motivated by recent popular applications, it is common to consider the following structured formulation:

where ff is smooth, Ω\Omega is a regularization function that may be nonsmooth and extended-valued, and λ>0\lambda>0 is a regularization parameter. Ω\Omega is often convex and usually assumed to be separable or block-separable. When separable, Ω\Omega has the form

Block-separable examples include group-sparse regularizers in which Ωi(zi):=zi2\Omega_{i}(z_{i}):=\|z_{i}\|_{2}. Formulations of the type (2), with separable or block-separable regularizers, arise in such applications as compressed sensing, statistical variable selection, and model selection.

The class of problems known as empirical risk minimization (ERM) gives rise to a formulation that is particularly amenable to coordinate descent; see ShaZ13e . These problems have the form

we can write the Fenchel dual (Roc70, , Section 31) of (5) as follows:

where CC is the d×nd\times n matrix whose columns are cic_{i}, i=1,2,,ni=1,2,\dotsc,n. The dual formulation (7) is has special appeal as a target for coordinate descent, because of separability of the summation term.

One interesting case is the system of linear equations

which we assume to be a feasible system. The least-norm solution is found by solving

(We recover the primal solution from (10) by setting w=ATxw=A^{T}x.) We can see that (10) is a special case of the Fenchel dual (7) obtained from (5) if we set

where I{bi}I_{\{b_{i}\}} denotes the indicator function for bib_{i}, which is zero at bib_{i} and infinite elsewhere. (Its conjugate is I{bi}(si)=bisiI^{*}_{\{b_{i}\}}(s_{i})=b_{i}s_{i}.) The primal problem (9) can be restated correspondingly as

where AiA_{i} denotes the iith row of the matrix AA in (8), which has the form (5).

2 Outline of Coordinate Descent Algorithms

The basic coordinate descent framework for continuously differentiable minimization is shown in Algorithm 1. Each step consists of evaluation of a single component iki_{k} of the gradient f\nabla f at the current point, followed by adjustment of the iki_{k} component of xx, in the opposite direction to this gradient component. (Here and throughout, we use [f(x)]i[\nabla f(x)]_{i} to denote the iith component of the gradient f(x)\nabla f(x).) There is much scope for variation within this framework. The components can be selected in a cyclic fashion, in which i0=1i_{0}=1 and

They can be required to satisfy an “essentially cyclic” condition, in which for some TnT\geq n, each component is modified at least once in every stretch of TT iterations, that is,

Alternatively, they can be selected randomly at each iteration (though not necessarily with equal probability). Turning to steplength αk\alpha_{k}: we may perform exact minimization along the iki_{k} component, or choose a value of αk\alpha_{k} that satisfies traditional line-search conditions (such as sufficient decrease), or make a predefined “short-step” choice of αk\alpha_{k} based on prior knowledge of the properties of ff.

The CD framework for the separable regularized problem (2), (3) is shown in Algorithm 2. At iteration kk, a scalar subproblem is formed by making a linear approximation to ff along the iki_{k} coordinate direction at the current iterate xkx^{k}, adding a quadratic damping term weighted by 1/αk1/\alpha_{k} (where αk\alpha_{k} plays the role of a steplength), and treating the relevant regularization term Ωi\Omega_{i} explicitly. Note that when the regularizer Ωi\Omega_{i} is not present, the step is identical to the one taken in Algorithm 1. For some interesting choices of Ωi\Omega_{i} (for example Ωi()=\Omega_{i}(\cdot)=|\cdot|), it is possible to write down a closed-form solution of the subproblem; no explicit search is needed. The operation of solving such subproblems is often referred to as a “shrink operation,” which we denote by SβS_{\beta} and define as follows:

By stating the subproblem in Algorithm 2 equivalently as

we can express the CD update as zikkSλαk(xikkαk[f(xk)]ik)z^{k}_{i_{k}}\leftarrow S_{\lambda\alpha_{k}}(x^{k}_{i_{k}}-\alpha_{k}[\nabla f(x^{k})]_{i_{k}}).

Algorithms 1 and 2 can be extended to block-CD algorithms in a straightforward way, by updating a block of coordinates (denoted by the column submatrix UikU_{i_{k}} of the identity matrix) rather than a single coordinate. In Algorithm 2, it is assumed that the choice of block is consistent with the block-separable structure of the regularization function Ω\Omega, that is, UikU_{i_{k}} is a concatenation of several of the submatrices UiU_{i} in (4).

3 Application to Linear Equations

For the formulation (10) that arises from the linear system Aw=bAw=b, let us assume that the rows of AA are normalized, that is,

Applying Algorithm 1 to (10) with αk1\alpha_{k}\equiv 1, each step has the form

If we maintain and update the estimate wkw^{k} of the solution to the primal problem (9) after each update of xkx^{k}, according to wk=ATxkw^{k}=A^{T}x^{k}, we obtain

which is the update formula for the Kaczmarz algorithm kaczmarz37 . Following this update, we have using (14) that

so that the iki_{k} equation in the system Aw=bAw=b is now satisfied. This method if sometimes known as the “method of successive projections” because it projects onto the feasible hyperplane for a single constraint at every iteration.

4 Relationship to Other Methods

Stochastic gradient (SG) methods, also undergoing a revival of interest because of their usefulness in data analysis and machine learning applications, minimize a smooth function ff by taking a (negative) step along an estimate gkg^{k} of the gradient f(xk)\nabla f(x^{k}) at iteration kk. It is often assumed that gkg^{k} is an unbiased estimate of f(xk)\nabla f(x^{k}), that is, f(xk)=E(gk)\nabla f(x^{k})=E(g^{k}), where the expectation is taken over whatever random variables were used in obtaining gkg^{k} from the current iterate xkx^{k}. Randomized CD algorithms can be viewed as a special case of SG methods, in which gk=n[f(xk)]ikeikg^{k}=n[\nabla f(x^{k})]_{i_{k}}e_{i_{k}}, where iki_{k} is chosen uniformly at random from {1,2,,n}\{1,2,\dotsc,n\}. Here, iki_{k} is the random variable, and we have

certifying unbiasedness. However, CD algorithms have the advantage over general SG methods that descent in ff can be guaranteed at every iteration. Moreover, the variance of the gradient estimate gkg^{k} shrinks to zero as the iterates converge to a solution xx^{*}, since every component of f(x)\nabla f(x^{*}) is zero. By contrast, in general SG methods, the gradient estimates gkg^{k} may be nonzero even when xkx^{k} is a solution.

The relationship between CD and SG methods can also be discerned from the Fenchel dual pair (5) and (7). SG methods are quite popular for solving formulation (5), where the estimate gkg^{k} is obtained by taking a single term iki_{k} from the summation and using ϕik(cikTw)cik\nabla\phi_{i_{k}}(c_{i_{k}}^{T}w)c_{i_{k}} as the estimate of the gradient of the full summation. This approach corresponds to applying CD to the dual (7), where the component iki_{k} of xx is selected for updating at iteration kk. This relationship is typified by the Kaczmarz algorithm for Aw=bAw=b, which can be derived either as CD applied to the dual formulation (10) or as SG applied to the sum-of-squares problem

CD is related in an obvious way to the Gauss-Seidel method for n×nn\times n systems of linear equations, which adjusts the iki_{k} variable to ensure satisfaction of the iki_{k} equation, at iteration kk. (Successive over-relaxation (SOR) modifies this approach by scaling each Gauss-Seidel step by a factor (1+ω)(1+\omega) for some constan ω[0,1)\omega\in[0,1), chosen so as to improve the convergence rate.) Standard Gauss-Seidel and SOR use the cyclic choice of coordinates (11), whereas a random choice of iki_{k} would correspond to “randomized” versions of these methods. To make the connections more explicit: The Gauss-Seidel method applied to the normal equations for (8) — that is, ATAw=ATbA^{T}Aw=A^{T}b — is equivalent to applying Algorithm 1 to the least-squares problem (17), when the steplength αk\alpha_{k} is chosen to minimize the objective exactly along the given coordinate direction. SOR also corresponds to Algorithm 1, with αk\alpha_{k} chosen to be a factor (1+ω)(1+\omega) times the exact minimum. These equivalences allow the results of Section 3 to be used to derive convergence rates for Gauss-Seidel applied to the normal equations, including linear convergence when ATAA^{T}A is nonsingular. Note that these results do not require feasibility of the original equations (8).

Applications

We mention here several applications of CD methods to practical problems, some dating back decades and others relatively new. Our list is necessarily incomplete, but it attests to the popularity of CD in a wide variety of application communities.

Bouman and Sauer BouS96 discuss an application to positron emission tomography (PET) in which the objective has the form (2) where ff is smooth and convex and Ω\Omega is a sum of terms of the form xjxlq|x_{j}-x_{l}|^{q} for some pairs of components (j,l)(j,l) of xx and some qq\in. Ye et al. Ye:99 apply a similar method to a different objective arising from optical diffusion tomography.

Canutescu and Dunbrack CanD03 describe a cyclic coordinate descent method for determining protein structure, adjusting the dihedral angles in a protein chain so that the atom at the end of the chain comes close to a specified position in space.

Florian and Chen FloC95 recover origin-destination matrices from observed traffic flows by alternately solving a bilevel optimization problem over two blocks of variables: the origin-destination demands and the proportion of each origin-destination flow assigned to each arc in the network.

Breheny and Huang BreH11 discuss coordinate descent for linear and logistic regression with nonconvex separable regularization terms, reporting results for genetic association and gene expression studies. The SparseNet algorithm MazFH11 applied to problems with these same nonconvex separable regularizers uses warm-started cyclic coordinate descent as an inner loop to solve a sequence of problems in which the regularization parameter λ\lambda in (2) and the parameters defining concavity of the regularization functions are varied.

Friedman, Hastie, and Tibshirani FriHT08a propose a block CD algorithm for estimating a sparse inverse covariance matrix, given a sample covariance matrix SS and taking the variable in their formulation to be a modification WW of SS, such that W1W^{-1} is sparse. The resulting “graphical lasso” algorithm cycles through the rows/columns of WW (in the style of block CD), solving a standard lasso problem to calculate each update. The same authors FriHT10a apply CD to generalized linear models such as linear least squares and logistic regression, with convex regularization terms. Their framework include such formulations as lasso, graphical lasso, elastic net, and the Dantzig selector, and is implemented in the package glmnet.

Chang, Hsieh, and Lin ChaHL08a use cyclic and stochastic CD to solve a squared-loss formulation of the support vector machine (SVM) problem in machine learning, that is,

Sardy, Bruce, and Tseng SarBT00 consider the basis-pursuit formulation of wavelet denoising:

This formulation is equivalent to the well known lasso of Tibshirani Tib96 and has become famous because of its applicability to sparse recovery and compressed sensing. Although this formulation fits the ERM framework (5) and could thus be dualized before applying CD, the approach of SarBT00 applies block CD directly to the primal formulation.

Applications of block CD approaches to transceiver design for cellular networks and to tensor factorization are discussed in Razaviyayn (RazHL13, , Section 8).

Second, we consider the “alternating-direction method of multipliers” (ADMM) EckB92 ; BoyPCPE11a , which has gained great currency in the past few years because of its usefulness in solving regularized problems in statistics and machine learning, and in designing parallel algorithms. Each major iteration of ADMM consists of an (approximate) minimization of the augmented Lagrangian function for a constrained optimization problem over each block of primal variables in turn, followed by an update to the Lagrange multiplier estimates. It might seem appealing to do multiple cycles of updating the primal variable blocks, in the manner of cyclic block CD, thus finding a better approximation to the solution of each subproblem over all primal variables and moving the method closer to the standard augmented Lagrangian approach. Eckstein and Yao EckY14 show, however, that this “approximate augmented Lagrangian” approach has a fundamentally different theoretical interpretation from ADMM, and a computational comparison between the two approaches (EckY14, , Section 5) appears to show an advantage for ADMM.

Coordinate Descent: Algorithms, Convergence, Implementations

We now describe the most important variants of coordinate descent and present their convergence properties, including the proofs of some fundamental results. We also discuss the implementation of accelerated CD methods for problems of the form (7) and for the Kaczmarz algorithm for Aw=bAw=b. As mentioned in the introduction, we deal with the most elementary framework possible, to expose the essential properties of the methods.

It has minimizers at the corners (1,1,1)T(1,1,1)^{T} and (1,1,1)T(-1,-1,-1)^{T} of the unit cube, but coordinate descent with exact minimization, started near (but just outside of) one of the other vertices of the cube cycles around the neighborhoods of six points that are close to the six non-optimal vertices. Powell shows that the cyclic nonconvergence behavior is rather special and is destroyed by small perturbations on this particular example, and we can note that a randomized coordinate descent method applied to this example would be expected to converge to the vicinity of a solution within a few steps. Still, this example and others in Pow73b make it clear that we cannot expect a general convergence result for nonconvex functions, of the type that are available for full-gradient descent. Results are available for the nonconvex case under certain additional assumptions that still admit interesting applications. Bertsekas (Ber99, , Proposition 2.7.1) describes convergence of a cyclic approach applied to nonconvex problems, under the assumption that the minimizer along any coordinate direction from any point xx is unique. More recent work Att10 ; BolST14 focuses on CD with two blocks of variables, applied to functions that satisfy the so-called Kurdyka-Łojasiewicz (KL) property, such as semi-algebraic functions. Convergence of subsequences or the full sequence {xk}\{x^{k}\} to stationary points can be proved in this setting.

2 Assumptions and Notation

For most of this section, we focus on the unconstrained problem (1), where the objective ff is convex and Lipschitz continuously differentiable. In some places, we assume strong convexity with respect to the Euclidean norm, that is, existence of a modulus of convexity σ>0\sigma>0 such that

We define the coordinate Lipschitz constant L\mboxmaxL_{\mbox{\rm\scriptsize max}} to be such that

The standard Lipschitz constant LL is such that

Clearly, L\mboxresLL_{\mbox{\rm\scriptsize res}}\leq L. The ratio

is important in our analysis of asynchronous parallel algorithms in Section 4. In the case of ff convex and twice continuously differentiable, we have by positive semidefiniteness of the 2f(x)\nabla^{2}f(x) at all xx that

However, we can derive stronger bounds on Λ\Lambda for functions ff in which the coupling between components of xx is weak. In the extreme case in which ff is separable, we have Λ=1\Lambda=1. The coordinate Lipschitz constant corresponds L\mboxmaxL_{\mbox{\rm\scriptsize max}} to the maximal absolute value of the diagonal elements of the Hessian 2f(x)\nabla^{2}f(x), while the restricted Lipschitz constant L\mboxresL_{\mbox{\rm\scriptsize res}} is related to the maximal column norm of the Hessian. Therefore, if the Hessian is positive semidefinite and diagonally dominant, the ratio Λ\Lambda is at most 22.

The following assumption is useful in the remainder of the paper.

The function ff in (1) is convex and uniformly Lipschitz continuously differentiable, and attains its minimum value ff^{*} on a set S{\cal S}. There is a finite R0R_{0} such that the level set for ff defined by x0x^{0} is bounded, that is,

3 Randomized Algorithms

In randomized CD algorithms, the update component iki_{k} is chosen randomly at each iteration. In Algorithm 3 we consider the simplest variant in which each iki_{k} is selected from {1,2,,n}\{1,2,\dotsc,n\} with equal probability, independently of the selections made at previous iterations. (We can think of this scheme as “sampling with replacement” from the set {1,2,,n}\{1,2,\dotsc,n\}.)

We denote expectation with respect to a single random index iki_{k} by Eik()E_{i_{k}}(\cdot), while E()E(\cdot) denotes expectation with respect to all random variables i0,i1,i2,i_{0},i_{1},i_{2},\dotsc.

Suppose that Assumption 1 holds. Suppose that αk1/L\mboxmax\alpha_{k}\equiv 1/L_{\mbox{\rm\scriptsize max}} in Algorithm 3. Then for all k>0k>0 we have

When σ>0\sigma>0 in (20), we have in addition that

By application of Taylor’s theorem, and using (21) and (22), we have

where we substituted the choice αk=1/L\mboxmax\alpha_{k}=1/L_{\mbox{\rm\scriptsize max}} in the last equality. Taking the expectation of both sides of this expression over the random index iki_{k}, we have

(We used here the facts that xkx^{k} does not depend on iki_{k}, and that iki_{k} was chosen from among {1,2,,n}\{1,2,\dotsc,n\} with equal probability.) We now subtract f(x)f(x^{*}) from both sides this expression, take expectation of both sides with respect to all random variables i0,i1,i_{0},i_{1},\dotsc, and use the notation

(We used Jensen’s Inequality in the second inequality.) By convexity of ff we have for any xSx^{*}\in{\cal S} that

where the final inequality is because f(xk)f(x0)f(x^{k})\leq f(x^{0}), so that xkx^{k} is in the level set in (26). By taking expectations of both sides, we obtain

When we substitute this bound into (32), and rearrange, we obtain

By applying this formula recursively, we obtain

In the case of ff strongly convex with modulus σ>0\sigma>0, we have by taking the minimum of both sides with respect to yy in (20), and setting x=xkx=x^{k}, that

By using this expression to bound f(xk)2\|\nabla f(x^{k})\|^{2} in (32), we obtain

Recursive application of this formula leads to (28).

Note that the same convergence expressions can be obtained for more refined choices of steplength αk\alpha_{k}, by making minor adjustments to the logic in (29). For example, the choice αk=1/Lik\alpha_{k}=1/L_{i_{k}} leads to the same bounds (27) and (28). The same bounds hold too when αk\alpha_{k} is the exact minimizer of ff along the coordinate search direction; we modify the logic in (29) for this case by taking the minimum of all expressions with respect to αk\alpha_{k}, and use the fact that αk=1/L\mboxmax\alpha_{k}=1/L_{\mbox{\rm\scriptsize max}} is in general a suboptimal choice.

We can compare (27) with the corresponding result for full-gradient descent with constant steplength αk=1/L\alpha_{k}=1/L (where LL is from (23)). The iteration

(see, for example, Nes04 ). Since, as we have noted, LL can be as large as nL\mboxmaxnL_{\mbox{\rm\scriptsize max}}, the bound in this expression may be equivalent to (27) in extreme cases. More typically, these two Lipschitz constants are comparable in size, and the appearance of the additional factor nn in (27) indicates that we pay a price in terms of slower convergence for using only one component of f(xk)\nabla f(x^{k}), rather than the full vector.

Expected linear convergence rates have been proved under assumptions weaker than strong convexity; see for example the “essential strong convexity” property of LiuW13a , the “optimal strong convexity” property of LiuW14c , the “generalized error bound” property of NecC13a , and (TseY06, , Assumption 2), which concerns linear growth in a measure of the gradient with distance from the solution set.

A variant on Algorithm 3 uses “sampling without replacement.” Here the computation proceeds in “epochs” of nn consecutive iterations. At the start of each epoch, the set {1,2,,n}\{1,2,\dotsc,n\} is shuffled. The iterations then proceed by setting iki_{k} to each entry in turn from the ordered set. This kind of randomization has been shown in several contexts to be superior to the sampling-with-replacement scheme analyzed above, but a theoretical understanding of this phenomenon remains elusive.

It is worth proving an expected linear convergence result for the Kaczmarz iteration (16) for linear equations Aw=bAw=b as a separate, more elementary analysis. In one sense, the result is a special case of Theorem 3.1 since, as we showed above, the iteration (16) is obtained by applying Algorithm 3 to the dual formulation (10). In another sense, the result is stronger, since we obtain a linear rate of convergence without requiring strong convexity of the objective (10), that is, the system Aw=bAw=b is allowed to have multiple solutions.

We denote by λmin,nz\lambda_{\rm min,nz} the minimum nonzero eigenvalue of AATAA^{T} and let P()P(\cdot) denote projection onto the solution set of Aw=bAw=b. We have

where we have used normalization of the rows (14) and the fact that AikP(xk)=bikA_{i_{k}}P(x^{k})=b_{i_{k}}. By taking expectations of both sides with respect to iki_{k}, we have

By taking expectations of both sides with respect to all random variables i0,i1,i_{0},i_{1},\dotsc, and proceeding recursively, we obtain

(This analysis is slightly generalized from Strohmer and Vershynin Strohmer09 to allow for nonunique solutions of Aw=bAw=b; see also LevL10a .)

4 Accelerated Randomized Algorithms

The accelerated randomized algorithm, specified here as Algorithm 4, was proposed by Nesterov Nes10a . It assumes that an estimate is available of modulus of strong convexity σ0\sigma\geq 0 from (20), as well as estimates of the component-wise Lipschitz constants LiL_{i} from (21). (The algorithm remains valid if we simply use L\mboxmaxL_{\mbox{\rm\scriptsize max}} in place of LikL_{i_{k}} for all kk.)

The approach is a close relative of the accelerated (full-)gradient methods that have become extremely popular in recent years. These methods have their origin in a 1983 paper of Nesterov Nes83 and owe much of their recent popularity to a recent incarnation known as FISTA BecT08a and an exposition in Nesterov’s 2004 monograph Nes04 , as well as ease of implementation and good practical experience. In their use of momentum in the choice of step — the search direction combines new gradient information with the previous search direction — these methods are also related to such other classical techniques as the heavy-ball method (see Pol87 ) and conjugate gradient methods.

Nesterov (Nes10a, , Theorem 6) proves the following convergence result for Algorithm 4.

Suppose that Assumption 1 holds, and define

In the strongly convex case σ>0\sigma>0, the term (1+σ/L\mboxmax/(2n))k+1(1+\sqrt{\sigma/L_{\mbox{\rm\scriptsize max}}}/(2n))^{k+1} eventually dominates the second term in brackets in (35), so that the linear convergence rate suggested by this expression is significantly faster than the corresponding rate (28) for Algorithm 3. Essentially, the measure σ/L\mboxmax\sigma/L_{\mbox{\rm\scriptsize max}} of conditioning in (28) is replaced by its square root in (35), suggesting a decrease by a factor of L\mboxmax/σ\sqrt{L_{\mbox{\rm\scriptsize max}}/\sigma} in the number of iterations required to meet a specified error tolerance. In the sublinear rate bound (36), which holds even for weakly convex ff, the 1/k1/k bound of (27) is replaced by a 1/k21/k^{2} factor, implying a reduction from O(1/ϵ)O(1/\epsilon) to O(1/ϵ)O(1/\sqrt{\epsilon}) in the number of iterations required to meet a specified error tolerance.

5 Efficient Implementation of the Accelerated Algorithm

One fact detracts from the appeal of accelerated CD methods over standard methods: the higher cost of each iteration of Algorithm 4. Both standard and accelerated variants require calculation of one element of the gradient, but Algorithm 3 requires an update of just a single component of xx, whereas Algorithm 4 also requires manipulation of the generally dense vectors yy and vv. Moreover, the gradient is evaluated at xkx^{k} in Algorithm 3, where the argument changes by only one component from the prior iteration, a fact that can be exploited in several contexts. In Algorithm 4, the argument yky^{k} for the gradient changes more extensively from one iteration to the next, making it less obvious whether such economies are available. However, by using a change of variables due to Lee and Sidford LeeS13 , it is possible to implement the accelerated randomized CD approach efficiently for problems with certain structure, including the linear system Aw=bAw=b and certain problems of the form (5).

Note that RkR_{k} is a 2×22\times 2 matrix while SkS_{k} is an n×2n\times 2 matrix with nonzeros only in those rows for which AikTA_{i_{k}}^{T} has a nonzero entry. We define a change of variables based on another 2×22\times 2 matrix BkB_{k}, as follows:

where we initialize with B0=IB_{0}=I. By substituting this representation into (38), we obtain

so we can maintain validity of the representation (39) at iteration k+1k+1 by setting

This efficient implementation can be extended to the dual empirical risk minimization problem (7) for certain choices of regularization function g()g(\cdot), for example, g(z)=z2/2g(z)=\|z\|^{2}/2; see LinLX14a . As pointed out in LeeS13 , the key requirement for the efficient scheme is that the gradient term [f(yk)]ik[\nabla f(y^{k})]_{i_{k}} can be evaluated efficiently after an update to the two vectors in the alternative representation of yky^{k}, and to the two coefficients in this representation. Another variant of this implementation technique appears in (FerR13b, , Section 5).

6 Cyclic Variants

We have the following result from BecT12a for the cyclic variant of Algorithm 1.

Suppose that Assumption 1 holds. Suppose that αk1/L\mboxmax\alpha_{k}\equiv 1/L_{\mbox{\rm\scriptsize max}} in Algorithm 1, with the index iki_{k} at iteration kk chosen according to the cyclic ordering (11) (with i0=1i_{0}=1). Then for k=n,2n,3n,k=n,2n,3n,\dotsc, we have

When σ>0\sigma>0 in the strong convexity condition (20), we have in addition for k=n,2n,3n,k=n,2n,3n,\dotsc that

The result (41) follows from Theorems 3.6 and 3.9 in BecT12a when we note that (i) each iteration of Algorithm BCGD in BecT12a corresponds to a “cycle” of nn iterations in Algorithm 1; (ii) we update coordinates rather than blocks, so that the parameter pp in BecT12a is equal to nn; (iii) we set Lˉmax\bar{L}_{\max} and Lˉmin\bar{L}_{\min} in BecT12a both to L\mboxmaxL_{\mbox{\rm\scriptsize max}}.

Comparing the complexity bounds for the cyclic variant with the corresponding bounds proved in Theorem 3.1 for the randomized variant, we see that since LL\mboxmaxL\geq L_{\mbox{\rm\scriptsize max}} in general, the numerator in (41) is O(n2)O(n^{2}), in contrast to O(n)O(n) term in (27). A similar factor of nn in seen in comparing (28) to (42), when we note that (1ϵ)1/n1ϵ/n(1-\epsilon)^{1/n}\approx 1-\epsilon/n for small values of ϵ\epsilon. The bounds in Theorem 3.3 are deterministic, however, rather than being bounds on expected nonoptimality, as in Theorem 3.1.

We noted in Subsection 3.2 that the ratio L/L\mboxmaxL/L_{\mbox{\rm\scriptsize max}} lies in the interval [1,n][1,n] when ff is a convex quadratic function and both parameters are set to their best values. Lower values of this ratio are attained on functions that are “more decoupled” and larger values attained when there is a greater dependence between the coordinates. Larger values lead to weaker bounds in Theorem 3.3, which accords with our intuition; we expect CD methods to require more iterations to resolve the coupling of the coordinates.

We are free to make other, larger choices of L\mboxmaxL_{\mbox{\rm\scriptsize max}}; they need only satisfy the conditions (21) and (22). Larger values of L\mboxmaxL_{\mbox{\rm\scriptsize max}} lead to shorter steps αk=1/L\mboxmax\alpha_{k}=1/L_{\mbox{\rm\scriptsize max}} and different complexity expressions. For L\mboxmax=LL_{\mbox{\rm\scriptsize max}}=L, for example, the bound in (41) becomes

which is worse by a factor of approximately 2n22n^{2} than the bound (33) for the full-step gradient descent approach. For L\mboxmax=nLL_{\mbox{\rm\scriptsize max}}=\sqrt{n}L, we obtain

which still trails (33) by a factor of 4n3/24n^{3/2}.

7 Extension to Separable Regularized Case

In this section we consider the separable regularized formulation (2), (3) where ff is smooth and strongly convex, and each Ωi\Omega_{i}, i=1,2,,ni=1,2,\dotsc,n is convex. We prove a result similar to the second part of Theorem 3.1 for a randomized version of Algorithm 2. The proof is a simplified version of the analysis from RicT11a . It makes use of the following assumption.

The function ff in (2) is uniformly Lipschitz continuously differentiable and strongly convex with modulus σ>0\sigma>0 (see (20)). The functions Ωi\Omega_{i}, i=1,2,,ni=1,2,\dotsc,n are convex. The function hh in (2) attains its minimum value hh^{*} at a unique point xx^{*}.

Our result uses the coordinate Lipschitz constant L\mboxmaxL_{\mbox{\rm\scriptsize max}} for ff, as defined in (22). Note that the modulus of convexity σ\sigma for ff is also the modulus of convexity for hh. By elementary results for convex functions, we have

Suppose that Assumption 2 holds. Suppose that the indices iki_{k} in Algorithm 2 are chosen independently for each kk with uniform probability from {1,2,,n}\{1,2,\dotsc,n\}, and that αk1/L\mboxmax\alpha_{k}\equiv 1/L_{\mbox{\rm\scriptsize max}}. Then for all k0k\geq 0, we have

and note that this function is separable in the components of zz, and attains its minimum over zz at the vector zkz^{k} whose iki_{k} component is defined in Algorithm 2. Note by strong convexity (20) that

We have by minimizing both sides over zz in this expression that

where we used (45) for the first inequality, (43) for the third inequality, and the particular value α=σ/L\mboxmax\alpha=\sigma/L_{\mbox{\rm\scriptsize max}} for the fourth inequality (for which value the coefficient of xkx2\|x^{k}-x^{*}\|^{2} vanishes). Taking the expected value of h(xk+1)h(x^{k+1}) over the index iki_{k}, we have

By subtracting hh^{*} from both sides of this expression, and using (46) to substitute for H(xk,zk)H(x^{k},z^{k}), we obtain

By taking expectations of both sides of this expression with respect to the random indices i0,i1,i2,,ik1i_{0},i_{1},i_{2},\dotsc,i_{k-1}, we obtain

The result follows from a recursive application of this formula.

A result similar to (27) can be proved for the case in which ff is convex but not strongly convex, but there are a few technical complications, and we refer the reader to RicT11a for details.

An extension of the fixed-step approach to separable composite objectives (2), (3) with nonconvex smooth part ff is discussed in PatN13a , where it is shown that accumulation points of the sequence of iterates are stationary and that a measure of optimality decreases to zero at a sublinear (1/k1/k) rate.

8 Computational Notes

A full computational comparison between variants of CD (and between CD and other methods) is beyond the scope of this paper. Nevertheless it is worth asking whether various aspects of the convergence analysis presented above — in particular, the distinction between CD variants — can be observed in practice. To this end, we used these methods to minimize a convex quadratic f(x)=(1/2)xTQxf(x)=(1/2)x^{T}Qx (with QQ symmetric and positive semidefinite) for which x=0x^{*}=0 and f=0f^{*}=0. We constructed QQ by choosing an integer rr from 1,2,,n1,2,\dotsc,n and parameters η\eta\in and ζ>0\zeta>0, and defining

By choosing η\eta and ζ\zeta appropriately, we can obtain a range of values for the quantities described in Subsection 3.2, which enter along with the smallest singular value into the convergence expression. For example, by setting ζ=0\zeta=0 and η=0\eta=0 we obtain a randomly oriented matrix, possibly singular, with a specified range of nonzero eigenvalues. Nonzero values of η\eta and ζ\zeta induce different types of orientation bias. In particular, we see that Λ\Lambda (25) increases toward its upper bound of n\sqrt{n} as ζ\zeta increases away from zero.

CYCLIC: Cyclic CD, described in Subsection 3.6.

IID: Randomized CD using sampling with replacement: Algorithm 3.

EPOCHS: The “sampling without replacement” variant of Algorithm 3, described following the proof of Theorem 3.1.

For each variant, we tried both a fixed steplength αk1/L\mboxmax\alpha_{k}\equiv 1/L_{\mbox{\rm\scriptsize max}} and the optimal steplength αk=1/Qik,ik\alpha_{k}=1/Q_{i_{k},i_{k}}. Thus, there were a total of six algorithmic variants tested.

The starting point x0x^{0} was chosen randomly, with all components from the unit normal distribution N(0,1)N(0,1). The algorithms were terminated when the objective was reduced by a factor of 10610^{-6} over its initial value f(x0)f(x^{0}).

The speed of convergence varied widely according to the problem construction parameters η\eta, λ\lambda, and \mboxcond(Σ)\mbox{\rm cond}(\Sigma), but we can make some general observations. First, on problems that are not well conditioned, the function values f(xk)f(x^{k}) decreased rapidly at first, then settled into a linear rate of decrease. This linear rate held even for problems in which QQ was singular — a significant improvement over the sublinear rates predicted by the theory. Second, the EPOCHS variant of randomized CD tended to converge faster than the IID version, though rarely more than twice as fast. Third, the use of the optimal step was usually better than the fixed step (with sometimes up to six times fewer iterations), but this was by no means always the case. Fourth, while there were extensive regimes of parameter values in which all six variants performed similarly, there were numerous “stressed” settings in which the CYCLIC variants are much slower than the randomized variants, by factors of 1010 or more.

Parallel CD Algorithms

CD methods lend themselves to different kinds of parallel implementation. Even basic algorithm frameworks such as Algorithm 1 may be amenable to application-specific parallelism, when the computations involved in evaluating a single element of the gradient vector are substantial enough to be spread out across cores of a multicore computer. We concern ourselves here with more generic forms of parallelism, which involve multiple instances of the basic CD algorithm, running in parallel on multiple processors.

We can distinguish different types of parallel CD algorithms. Synchronous algorithms are those that partition the computation into pieces that can be executed in parallel on multiple processors (or cores of a multicore machine), but that synchronize frequently across all processors, to ensure consistency of the information available to all processors at certain points in time. For example, each processor could update a subset of components of xx in parallel (with the subsets being disjoint), and the synchronization step could ensure that the results of all updates are shared across all processors before further computation occurs. The synchronization step often detracts from the performance of algorithms, not only because some processors may be forced to idle while others complete their work, but also because the overheads associated with (hardware and software) locking of memory accesses can be high. Thus, asynchronous methods, which weaken or eliminate the requirement of consistent information across processors, are preferred in practice. Analysis of such methods is more difficult, but results have been obtained that accord with practical experience of such methods. Indeed, it can be verified that in certain regimes, linear speedup can be expected across a modest number of processors.

We mention several synchronous parallel variants of CD that appear in the recent literature. We note that in the some of these papers, the computational results were obtained by implementing the methods in an asynchronous fashion, disregarding the synchronization step required by the analysis.

Bradley at al. Bra11a consider a bound-constrained problem that is a reformulation of the problem (2) with specific choices of ff and with Ω(x)=x1\Omega(x)=\|x\|_{1}. Their algorithm performs short-step updates of individual components of xx in parallel on PP processors, with synchronization after each round of parallel updating. This scheme essentially updates a randomly-chosen block of PP variables at each cycle. By modifying the analysis of ShaT11a , they show that the 1/k1/k sublinear convergence rate bound is not affected provided that PP is no larger than n/Ln/L, where LL is the Lipschitz constant from (23).

Jaggi et al. Jag14d perform a synchronized CD method on the dual ERM model (7) for the case of g(w)=g(w)=(1/2)w2g(w)=g^{*}(w)=(1/2)\|w\|^{2}, partitioning components of the dual variable xx between cores and sharing a copy of the vector AxAx across cores, updating this vector at each synchronization point. The approach can be thought of as a nonlinear block Gauss-Jacobi method (by contrast with the coordinate Gauss-Seidel approaches discussed in Section 3).

Richtarik and Takac RicT12c describe a method for the separably regularized formulation (2), (3) in which a subset of indices Sk{1,2,,n}S_{k}\subset\{1,2,\dotsc,n\} is updated according to the formula in Algorithm 2. The work of updating the components in SkS_{k} is divided between processors; essentially, a synchronization step takes place at each iteration. This scheme is enhanced with an acceleration step in FerQRT14 ; the extra computations associated with the acceleration step too are parallelized, using ideas from LeeS13 . In the scheme of Marecek, Richtarik, and Takac MarRT14a , the variable vector xx is partitioned into subvectors, and each processor is assigned the responsibility for updating one of these subvectors. On each processor, the updating scheme described in RicT12c is applied, providing a second level of parallelism. Synchronization takes place at each outer iteration. Details of the information-sharing between processors required for accurate computation of gradients in different applications are described in (MarRT14a, , Section 6).

2 Asynchronous Parallelism

In asynchronous variants of CD, the variable vector xx is assumed to be accessible to each processor, available for reading and updating. (For example, xx could be stored in the shared-memory space of a multicore computer, where each core is viewed as a processor.) Each processor runs its own CD process, shown here as Algorithm 6, without any attempt to coordinate or synchronize with other processors. Each iteration on each processor chooses an index ii, loads the components of xx that are needed to compute the gradient component [f(x)]i[\nabla f(x)]_{i}, then updates the iith component xix_{i}. Note that this evaluation may need only a small subset of the components of xx; this is the case when the Hessian 2f\nabla^{2}f is structurally sparse, for example. On some multicore architectures (for example, the Intel Xeon), the update of xix_{i} can be performed as a unitary operation; no software or hardware locking is required to block access of other cores to the location xix_{i}.

We can take a global view of the entire parallel process, consisting of multiple processors each executing Algorithm 6, by defining a global counter kk that is incremented whenever any processor updates an element of xx: see Algorithm 7. Note that the only difference with the basic framework of Algorithm 1 is in the argument of the gradient component: In Algorithm 1 this is the latest iterate xkx^{k} whereas in Algorithm 7 it is a vector x^k\hat{x}^{k} that is generally made up of components of vectors from previous iterations xjx^{j}, j=0,1,,kj=0,1,\dotsc,k. The reason for this discrepancy is that between the time at which a processor reads the vector xx from shared storage in order to calculate [f(x)]i[\nabla f(x)]_{i}, and the time at which it updates component ii, other processors have generally made changes to xx. In consequence, each update step is using slightly stale information about xx. To prove convergence results, we need to make assumptions on how much “staleness” can be tolerated, and to modify the convergence analysis quite substantially. Indeed, proofs of convergence even for the most basic asynchronous algorithms are quite technical.

Asynchronous CD algorithms are distinguished from each other mostly by the assumptions they make on the the choice of update components iki_{k} and on the “ages” of the components of x^k\hat{x}^{k}, that is, the iterations at which each component of this vector was last updated. In the terminology of Bertsekas and Tsitsiklis BerT89 , the algorithm is totally asynchronous if

each index i{1,2,,n}i\in\{1,2,\dotsc,n\} of xx is updated at infinitely many iterations; and

if νjk\nu^{k}_{j} denotes the iteration at which component jj of the vector x^k\hat{x}^{k} was last updated, then νjk\nu^{k}_{j}\to\infty as kk\to\infty for all j=1,2,,nj=1,2,\dotsc,n.

In other words, each component of xx is updated infinitely often, and all components used in successive evaluation vectors x^k\hat{x}^{k} are also updated infinitely often.

The following convergence result for totally asynchronous variants of Algorithm 7 is due to Bertsekas and Tsitsiklis; see in particular (BerT89, , Sections 6.1, 6.2, and 6.3.3).

Then if we set αkα\alpha_{k}\equiv\alpha in Algorithm 7, the sequence {xk}\{x^{k}\} converges to xx^{*}.

we have ff strictly convex with minimizer x=0x^{*}=0. However the mapping T(x)=(IαQ)xT(x)=(I-\alpha Q)x is not contractive for any α>0\alpha>0; we have for example that T(x)x\|T(x)\|_{\infty}\geq\|x\|_{\infty} when x=(1,1)Tx=(1,-1)^{T}.

We turn now to partly asynchronous variants of Algorithm 7, in which we make stronger assumptions on the ages of the components of x^k\hat{x}^{k}. Liu and Wright LiuW14c consider a version of Algorithm 7 that is the parallel analog of Algorithm 3, in that each update component iki_{k} is chosen independently and randomly with equal probability from {1,2,,n}\{1,2,\dotsc,n\}. They assume that no component of x^k\hat{x}^{k} is older than a nonnegative integer τ\tau — the “maximum delay” — for any kk. Specifically, they express the difference between xkx^{k} and x^k\hat{x}^{k} in terms of “missed updates” to xx, as follows:

where K(j)K(j) is a set of iteration numbers drawn from the set {jq:q=1,2,,τ}\{j-q\,:\,q=1,2,\dotsc,\tau\}. The value of τ\tau is related to the number of processors PP involved in the computation. If all processors are performing their updates at approximately the same rates, we could expect τ\tau to be a modest multiple of PP — perhaps τ=2P\tau=2P or τ=3P\tau=3P, to allow a safety margin for occasional delays. Hence the value of τ\tau is an indicator of potential parallelism in the algorithm.

In LiuW14c , the steplengths in Algorithm 7 are fixed as follows:

where γ\gamma is chosen to ensure that Algorithm 7 progresses steadily toward a solution, but not too rapidly. Too-rapid convergence would cause the information in x^k\hat{x}^{k} to become too stale too quickly, so the gradient component [f(x^k)]ik[\nabla f(\hat{x}^{k})]_{i_{k}} would lose its relevance as a suitable update for the variable component xikx_{i_{k}} at iteration kk. Steady convergence is enforced by choosing some ρ>1\rho>1 and requiring that

where xˉk\bar{x}^{k} is the vector that would hypothetically be obtained if we were to apply the the update to all components, that is,

and the expectations E()E(\cdot) are taken over all random variables i0,i2,i_{0},i_{2},\dotsc. Condition (51) ensures that the “expected squared update norms” decrease by at most a factor of 1/ρ1/\rho at each iteration.

The main results in LiuW14c apply to composite functions (2), (3), but for simplicity here we state the result in terms of the problem (1), where ff is convex and continuously differentiable, with nonempty solution set S{\cal S} and optimal objective value ff^{*}. We use PSP_{{\cal S}} to denote projection onto S{\cal S}, and recall the definition (25) of the ratio Λ\Lambda between different varieties of Lipschitz constants. The results also make use of an optimal strong convexity condition, which is that the following inequality holds for some σ>0\sigma>0:

The following result is a modification of (LiuW14c, , Corollary 2).

Suppose that Assumption 1 holds, and that

Then by setting γ=1/2\gamma=1/2 in (50) (that is, choosing steplengths αk1/(2L\mboxmax)\alpha_{k}\equiv 1/(2L_{\mbox{\rm\scriptsize max}})), we have that

Assuming in addition that (52) is satisfied for some σ>0\sigma>0, we obtain the following linear rate:

A comparison with Theorem 3.1, which shows convergence rates for serial randomized CD (Algorithm 3) shows a striking similarity in convergence bounds. The factor-of-22 difference in steplength between the serial and parallel variants accounts for most of the difference between the linear rates (28) and (55), while there is an extra term nn in the denominator of the sublinear rate (54). We conclude that we do not pay q high overhead (in terms of total workload) for parallel implementation, and hence that near-linear speedup can be expected. (Indeed, computational results in LiuW14c and LiuW13a observe near-linear speedup for multicore asynchronous implementations.)

These encouraging conclusions depend critically on the condition (53), which is an upper bound on the allowable delay τ\tau in terms of nn and the ratio Λ\Lambda from (25). For functions ff with weak coupling between the components of xx (for example, when off-diagonals in the Hessian 2f(x)\nabla^{2}f(x) are small relative to the diagonals), we have Λ\Lambda not much greater than 11, so the maximum delay can be of the order of n1/4n^{1/4} before there is any attenuation of linear speedup. When stronger coupling exists, the restriction on τ\tau may be quite tight, possibly not much greater than 11. A more general convergence result (LiuW14c, , Theorem 1) shows that in this case, we can choose smaller values of γ\gamma in (50), allowing graceful degradation of the convergence bounds while still obtaining fairly efficient parallel implementations.

We note that an earlier analysis in LiuW13a made a stronger assumption on x^k\hat{x}^{k} — that it is equal to some earlier iterate xjx^{j} of Algorithm 7, where kjkτk\geq j\geq k-\tau, that is, the earlier iterate is no more than τ\tau cycles old. (A similar assumption was used to analyze convergence of as asynchronous stochastic gradient algorithm in Hogwild-nips .) This stronger assumption yields stronger convergence results, in that the bound on τ\tau in (53) can be loosened. However, the assumption may not always hold, since some parts of xx in memory may be altered by some cores as they are being read by another core, a phenomenon referred to in LiuW14c as “inconsistent reading.”

Conclusion

We have surveyed the state of the art in convergence of coordinate descent methods, with a focus on the most elementary settings and the most fundamental algorithms. The recent literature contains many extensions, enhancements, and elaborations; we refer interested readers to the bibliography of this paper, and note that new works are appearing at a rapid pace.

Coordinate descent method have become an important tool in the optimization toolbox that is used to solve problems that arise in machine learning and data analysis, particularly in “big data” settings. We expect to see further developments and extensions, further customization of the approach to specific problem structures, further adaptation to various computer platforms, and novel combinations with other optimization tools to produce effective “solutions” for key application areas.

References