Accelerated, Parallel and Proximal Coordinate Descent

Olivier Fercoq, Peter Richtárik

Introduction

Developments in computing technology and ubiquity of digital devices resulted in an increased interest in solving optimization problems of extremely big sizes. Applications can be found in all areas of human endeavor where data is available, including the internet, machine learning, data science and scientific computing. The size of these problems is so large that it is necessary to decompose the problem into smaller, more manageable, pieces. Traditional approaches, where it is possible to rely on full-vector operations in the design of an iterative scheme, must be revisited.

Coordinate descent methods appear as a very popular class of algorithms for such problems as they can break down the problem into smaller pieces, and can take advantage of sparsity patterns in the data. With big data problems it is necessary to design algorithms able to utilize modern parallel computing architectures. This resulted in an interest in parallel and distributed coordinate descent methods.

In this work we focus on the solution of convex optimization problems with a huge number of variables of the form

Here x=(x(1),,x(n))RNx=(x^{(1)},\dots,x^{(n)})\in\mathbf{R}^{N} is a decision vector composed of nn blocks, with x(i)RNix^{(i)}\in\mathbf{R}^{N_{i}},

where fjf_{j} are smooth convex functions, and ψ\psi is a block separable regularizer (e.g., L1L1 norm).

In this work we make the following three main contributions:

We design and analyze the first stochastic coordinate descent method which is simultaneously accelerated, parallel and proximal. In fact, we are not aware of any published results on accelerated coordinate descent which would either be proximal or parallel.

Our method is accelerated in the sense that it achieves an O(1/k2)O(1/k^{2}) convergence rate, where kk is the iteration counter. The first gradient method with this convergence rate is due to Nesterov ; see also . Accelerated stochastic coordinate descent method, for convex minimization without constraints, was originally proposed in 2010 by Nesterov .

Various variants of proximal and parallel (but non-accelerated) stochastic coordinate descent methods were proposed . In Table 1 we provide a listThis list is necessarily incomplete, it was not our goal to be comprehensive. For a somewhat more substantial review of these and other works we refer the reader to . of some recent research papers proposing and analyzing stochastic coordinate descent methods. The table substantiates our observation that while the proximal setting is standard in the literature, parallel methods are much less studied, and finally, there is just a handful of papers dealing with accelerated variants.

We propose new stepsizes for parallel coordinate descent methods, based on a new expected separable overapproximation (ESO). These stepsizes can for some classes of problems (e.g., fjf_{j}=quadratics), be much larger than the stepsizes proposed for the (non-accelerated) parallel coordinate descent method (PCDM) in . Let ωj\omega_{j} be the number of of blocks function fjf_{j} depends on. The stepsizes, and hence the resulting complexity, of PCDM, depend on the quantity ω=maxjωj\omega=\max_{j}\omega_{j}. However, our stepsizes take all the values ωj\omega_{j} into consideration and the result of this is complexity that depends on a data-weighted average ωˉ\bar{\omega} of the values ωj\omega_{j}. Since ωˉ\bar{\omega} can be much smaller than ω\omega, our stepsizes result in dramatic acceleration for our method and other methods whose analysis is based on an ESO .

We identify a large subclass of problems of the form (1) for which the full-vector operations inherent in accelerated methods can be eliminated. This contrasts with Nesterov’s accelerated coordinate descent scheme , which is impractical due to this bottleneck. Having established his convergence result, Nesterov remarked that:

“However, for some applications […] the complexity of one iteration of the accelerated scheme is rather high since for computing yky_{k} it needs to operate with full-dimensional vectors.”

Subsequently, in part due to these issues, the work of the community focused on simple methods as opposed to accelerated variants. For instance, Richtárik & Takáč use Nesterov’s observation to justify their focus on non-accelerated methods in their work on coordinate descent methods in the proximal/composite setting.

Recently, Lee & Sidford were able to avoid full dimensional operations in the case of minimizing a convex quadratic without constraints, by a careful modification of Nesterov’s method. This was achieved by introducing an extra sequence of iterates and observing that for quadratic functions it is possible to compute partial derivative of ff evaluated at a linear combination of full dimensional vectors without ever forming the combination. We extend the ideas of Lee & Sidford to our general setting (1) in the case when fj(x)=ϕj(ajTx)f_{j}(x)=\phi_{j}(a_{j}^{T}x), where ϕj\phi_{j} are scalar convex functions with Lipschitz derivative and the vectors aja_{j} are block-sparse.

The rest of the paper is organized as follows. We start by describing new stepsizes for parallel coordinate descent methods, based on novel assumptions, and compare them with existing stepsizes (Section 2). We then describe our algorithm and state and comment on the main complexity result (Section 3). Subsequently, we give a proof of the result (Section 4). We then describe an efficient implementation of our method, one that does not require the computation of full-vector operations (Section 5), and finally comment on our numerical experiments (Section 6).

Notation.

It will be convenient to define natural operators acting between the spaces RN\mathbf{R}^{N} and RNi\mathbf{R}^{N_{i}}. In particular, we will often wish to lift a block x(i)x^{(i)} from RNi\mathbf{R}^{N_{i}} to RN\mathbf{R}^{N}, filling the coordinates corresponding to the remaining blocks with zeros. Likewise, we will project xRNx\in\mathbf{R}^{N} back into RNi\mathbf{R}^{N_{i}}. We will now formalize these operations.

Let UU be the N×NN\times N identity matrix, and let U=[U1,U2,,Un]U=[U_{1},U_{2},\dots,U_{n}] be its decomposition into column submatrices UiRN×NiU_{i}\in\mathbf{R}^{N\times N_{i}}. For xRNx\in\mathbf{R}^{N}, let x(i)x^{(i)} be the block of variables corresponding to the columns of UiU_{i}, that is, x(i)=UiTxRNix^{(i)}=U_{i}^{T}x\in\mathbf{R}^{N_{i}}, i=1,2,,ni=1,2,\dots,n. Any vector xRNx\in\mathbf{R}^{N} can be written, uniquely, as x=i=1nUix(i)x=\sum_{i=1}^{n}U_{i}x^{(i)}. For hRNh\in\mathbf{R}^{N} and S[n]=def{1,2,,n}\emptyset\neq S\subseteq[n]\stackrel{{\scriptstyle\text{def}}}{{=}}\{1,2,\dots,n\}, we write

In words, h[S]h_{[S]} is a vector in RN\mathbf{R}^{N} obtained from hRNh\in\mathbf{R}^{N} by zeroing out the blocks that do not belong to SS. For convenience, we will also write

for the vector of partial derivatives of ff corresponding to coordinates belonging to block ii.

With each block i[n]i\in[n] we associate a positive definite matrix BiRNi×NiB_{i}\in\mathbf{R}^{N_{i}\times N_{i}} and a scalar vi>0v_{i}>0, and equip RNi\mathbf{R}^{N_{i}} and RN\mathbf{R}^{N} with the norms

The corresponding conjugate norms, defined by s=max{s,x:x1}\|s\|^{*}=\max\{\langle s,x\rangle:\|x\|\leq 1\}, are given by

We also write v1=ivi\|v\|_{1}=\sum_{i}|v_{i}|.

Stepsizes for parallel coordinate descent methods

The framework for designing and analyzing (non-accelerated) parallel coordinate descent methods, developed by Richtárik & Takáč , is based on the notions of block sampling and expected separable overapproximation (ESO). We now briefly review this framework as our accelerated method is cast in it, too. Informally, a block sampling is the random law describing the selection of blocks at each iteration. An ESO is an inequality, involving ff and S^\hat{S}, which is used to compute updates to selected blocks. The complexity analysis in our paper is based on the following generic assumption.

S^\hat{S} is a uniform block sampling. That is, S^\hat{S} is a random subset of [n]={1,2,,n}[n]=\{1,2,\dots,n\} with the propertyIt is easy to see that if S^\hat{S} is a uniform sampling, then necessarily, P(iS^)=ES^n\mathbf{P}(i\in\hat{S})=\frac{\mathbf{E}|\hat{S}|}{n} for all i[n]i\in[n]. that P(iS^)=P(jS^)\mathbf{P}(i\in\hat{S})=\mathbf{P}(j\in\hat{S}) for all i,j[n]i,j\in[n]. Let τ=E[S^]\tau=\mathbf{E}[|\hat{S}|].

There are computable constants v=(v1,,vn)>0v=(v_{1},\dots,v_{n})>0 for which the pair (f,S^)(f,\hat{S}) admits the Expected Separable Overapproximation (ESO):

In the context of parallel coordinate descent methods, uniform block samplings and inequalities (7) involving such samplings were introduced and systematically studied by Richtárik & Takáč . An ESO inequality for a uniform distributed sampling was developed in and that nonuniform samplings and ESO, together with a parallel coordinate descent method based on such samplings, was proposed in .

Fercoq & Richtárik [3, Theorem 10] observed that inequality (7) is equivalent to requiring that the gradients of the functions

The change of norms is done so as to enforce that the weights in the norm sum to nn, which means that different ESOs can be compared using the constants Lf^L^{\hat{f}}. The above observations are useful in understanding what the ESO inequality encodes: By moving from xx to

one is taking a step in a random subspace of RN\mathbf{R}^{N} spanned by the blocks belonging to S^\hat{S}. If τn\tau\ll n, which is often the case in big data problemsIn fact, one may define a “big data” problem by requiring that the number of parallel processors τ\tau available for optimization is much smaller than the dimension nn of the problem., the step is confined to a low-dimensional subspace of RN\mathbf{R}^{N}. It turns out that for many classes of functions arising in applications, for instance for functions exhibiting certain sparsity or partial separability patterns, it is the case that the gradient of ff varies much more slowly in such subspaces, on average, than it does in RN\mathbf{R}^{N}. This in turn would imply that updates hh based on minimizing the right hand side of (7) would produce larger steps, and eventually lead to faster convergence.

where fjf_{j} depends on blocks iCji\in C_{j} only. Let ωj=Cj\omega_{j}=|C_{j}|, and ω=maxjωj\omega=\max_{j}\omega_{j}.

The functions {fj}\{f_{j}\} have block-Lipschitz gradient with constants Lji0L_{ji}\geq 0. That is, for all j=1,2,,mj=1,2,\dots,m and i=1,2,,ni=1,2,\dots,n,

Assumption 2 is stronger than the assumption considered in . Indeed, in the authors only assumed that the sum ff, as opposed to the individual functions fjf_{j}, has a block-Lipschitz gradient, with constants L1,,LnL_{1},\dots,L_{n}. That is,

It is easy to see that if the stronger condition is satisfied, then the weaker one is also satisfied with LiL_{i} no worse than Lij=1mLjiL_{i}\leq\sum_{j=1}^{m}L_{ji}.

2 New ESO

We now derive an ESO inequality for functions satisfying Assumption 2 and τ\tau-nice sampling S^\hat{S}. That is, S^\hat{S} is a random subset of [n][n] of cardinality τ\tau, chosen uniformly at random. One can derive similar bounds for all uniform samplings considered in using the same approach.

If S^\hat{S} is a τ\tau-nice sampling, then for all x,hRNx,h\in\mathbf{R}^{N},

Moreover, for all x,hRNx,h\in\mathbf{R}^{N} we have

Note that ωˉ\bar{\omega} is a data-weighted average of the values {ωj}\{\omega_{j}\} and that wi=n\sum w_{i}=n.

Statement (ii) is a special case of (i) for τ=n\tau=n (notice that ωˉLˉw=v\bar{\omega}\bar{L}w=v). We hence only need to prove (i). A well known consequence of (8) is

where Lj:=(Lj1,,Ljn)RnL_{j:}=(L_{j1},\dots,L_{jn})\in\mathbf{R}^{n}. That is, (fj,S^)ESO(βjLj:)(f_{j},\hat{S})\sim ESO(\beta_{j}L_{j:}). Equation (10) then follows by adding upAt this step we could have also simply applied Theorem 10 from , which give the formula for an ESO for a conic combination of functions given ESOs for the individual functions. The proof, however, also amounts to simply adding up the inequalities. the inequalities (15) for all jj. Let us now prove the claim.This claim is a special case of Theorem 14 in which gives an ESO bound for a sum of functions fjf_{j} ( here we only have a single function). We include the proof as in this special case it more straightforward. We fix xx and define

We now adopt the convention that expectation conditional on an event which happens with probability 0 is equal to 0. Let ηj=defCjS^\eta_{j}\stackrel{{\scriptstyle\text{def}}}{{=}}|C_{j}\cap\hat{S}|, and using this convention, we can write

For any k1k\geq 1 for which P(ηj=k)>0\mathbf{P}(\eta_{j}=k)>0, we now use use convexity of f^j\hat{f}_{j} to write

where the second equality follows from Equation (41) in . Finally,

where the last identity is Equation (40) in , and hence (17) is established. ∎

We now give a formula for the constants LjiL_{ji} in the case when fjf_{j} arises as a composition of a scalar function ϕj\phi_{j} whose derivative has a known Lipschitz constant (this is often easy to compute), and a linear functional. Let AA be an m×Nm\times N real matrix and for j{1,2,,m}j\in\{1,2,\dots,m\} and i[n]i\in[n] define

That is, AjiA_{ji} is a row vector composed of the elements of row jj of AA corresponding to block ii.

Let fj(x)=ϕj(ejTAx)f_{j}(x)=\phi_{j}(e_{j}^{T}Ax), where ϕj:RR\phi_{j}:\mathbf{R}\to\mathbf{R} is a function with LϕjL_{\phi_{j}}-Lipschitz derivative:

Then fjf_{j} has a block Lipshitz gradient with constants

In other words, fjf_{j} satisfies (8) with constants LjiL_{ji} given above.

For any xRNx\in\mathbf{R}^{N}, tRNit\in\mathbf{R}^{N_{i}} and ii we have

where the last step follows by applying the Cauchy-Schwartz inequality. ∎

Then fj(x)=ϕj(ejTAx)f_{j}(x)=\phi_{j}(e_{j}^{T}Ax), where ϕj(s)=12(sbj)2\phi_{j}(s)=\tfrac{1}{2}(s-b_{j})^{2} and Lϕj=1L_{\phi_{j}}=1.

Consider the block setup with Ni=1N_{i}=1 (all blocks are of size 1) and Bi=1B_{i}=1 for all i[n]i\in[n]. Then Lji=Aji2L_{ji}=A_{ji}^{2}. In Table 3 we list stepsizes for coordinate descent methods proposed in the literature. It can be seen that our stepsizes are better than those proposed by Richtárik & Takáč and those proposed by Necoara & Clipici . Indeed, virtvifrv^{\text{rt}}_{i}\geq v^{\text{fr}}_{i} for all ii. The difference grows as τ\tau grows; and there is equality for τ=1\tau=1. We also have vnc1vfr1\|v^{\text{nc}}\|_{1}\geq\|v^{\text{fr}}\|_{1}, but here the difference decreases with τ\tau; and there is equality for τ=n\tau=n.

Choose nontrivial block sizes and define data-driven block norms with Bi=AiTAiB_{i}=A_{i}^{T}A_{i}, where Ai=AUiA_{i}=AU_{i}, assuming that the matrices AiTAiA_{i}^{T}A_{i} are positive definite. Then

Table 2 lists constants LϕL_{\phi} for selected scalar loss functions ϕ\phi popular in machine learning.

Accelerated parallel coordinate descent

We are interested in solving the regularized optimization problem

where ψ:RNR{+}\psi:\mathbf{R}^{N}\to\mathbf{R}\cup\{+\infty\} is a (possibly nonsmooth) convex regularizer that is separable in the blocks x(i)x^{(i)}:

We now describe our method (Algorithm 1). It is presented here in a form that facilitates analysis and comparison with existing methods. In Section 5 we rewrite the method into a different (equivalent) form – one that is geared towards practical efficiency.

The method starts from x0RNx_{0}\in\mathbf{R}^{N} and generates three vector sequences, {xk,yk,zk}k0\{x_{k},y_{k},z_{k}\}_{k\geq 0}. In Step 3, yky_{k} is defined as a convex combination of xkx_{k} and zkz_{k}, which may in general be full dimensional vectors. This is not efficient; but we will ignore this issue for now. In Section 5 we show that it is possible to implement the method in such a way that it not necessary to ever form yky_{k}. In Step 4 we generate a random block sampling SkS_{k} and then perform steps 5–9 in parallel. The assignment zk+1zkz_{k+1}\leftarrow z_{k} is not necessary in practice; the vector zkz_{k} should be overwritten in place. Instead, Steps 5–8 should be seen as saying that we update blocks iSki\in S_{k} of zkz_{k}, by solving Sk|S_{k}| proximal problems in parallel, and call the resulting vector zk+1z_{k+1}. Note in Step 9, xk+1x_{k+1} should also be computed in parallel. Indeed, xk+1x_{k+1} is obtained from yky_{k} by changing the blocks of yky_{k} that belong to SkS_{k} - this is because zk+1z_{k+1} and zkz_{k} differ in those blocks only. Note that gradients are evaluated only at yky_{k}. We show in Section 5 how this can be done efficiently, for some problems, without the need to form yky_{k}.

We now formulate the main result of this paper.

In other words, for any 0<ϵC0<\epsilon\leq C, the number of iterations for obtaining an ϵ\epsilon-solution in expectation does not exceed

The proof of Theorem 3 can be found in Section 4. We now comment on the result:

Note that we do not assume that ff be of the form (1); all that is needed is Assumption 1.

If n=1n=1, we recover Tseng’s proximal gradient descent algorithm . If n>1n>1, τ=1\tau=1 and ψ0\psi\equiv 0, we obtain a new version of (serial) accelerated coordinate descent for minimizing smooth functions. Note that no existing accelerated coordinate descent methods are either proximal, or parallel. Our method is both proximal and parallel.

In the case when we update all blocks in one iteration (τ=n\tau=n), the bound (26) simplifies to

If we use stepsize vv proposed in Theorem 1, then in view of part (ii) of that theorem, bound (29) takes the form

as advertised in the abstract. Recall that ωˉ\bar{\omega} is a data-weighted average of the values {ωj}\{\omega_{j}\}.

In contrast, using the stepsizes proposed by Richtárik & Takáč (see Table 3), we get

Note that in the case when the functions fjf_{j} are convex quadratics (fj(x)=12(ajTxbj)2f_{j}(x)=\tfrac{1}{2}(a_{j}^{T}x-b_{j})^{2}), for instance, we have Li=jLjiL_{i}=\sum_{j}L_{ji}, and hence the new ESO leads to a vast improvement in the complexity in cases when ωˉω\bar{\omega}\ll\omega. On he other hand, in cases where LijLjiL_{i}\ll\sum_{j}L_{ji} (which can happen with logistic regression, for instance), the result based on the Richtárik-Takáč stepsizes may be better.

Consider the smooth case (ψ0\psi\equiv 0): F=fF=f and f(x)=0f^{\prime}(x_{*})=0. By part (ii) of Theorem 1, f\nabla f is Lipschitz with constant 11 wrt w\|\cdot\|_{w}. Choosing x=xx=x_{*} and h=x0xh=x_{0}-x_{*}, we get

Now, consider running Algorithm 1 with a τ\tau-nice sampling and stepsize parameter vv as in Theorem 1. Letting d=(d1,,dn)d=(d_{1},\dots,d_{n}), where did_{i} is defined by

iterations suffice to produce an ϵ\epsilon-solution in expectation. Hence, we get linear speedup in the number of parallel updates / processors. This is different from the situation in simple (non-accelerated) parallel coordinate descent methods where parallelization speedup depends on the degree of separability (speedup is better if ω\omega is small). In APPROX, the average degree of separability ωˉ\bar{\omega} is decoupled from τ\tau, and hence one benefits from separability even for large τ\tau. This means that accelerated methods are more suitable for parallelization.

We focused on the case of uniform samplings, but with a proper change in the definition of ESO, one can also handle non-uniform samplings .

Complexity analysis

We first establish four lemmas and then prove Theorem 3.

In the first lemma we summarize well-known properties of the sequence θk\theta_{k} used in Algorithm 1.

The sequence {θk}k0\{\theta_{k}\}_{k\geq 0} defined in Algorithm 1 is decreasing and satisfies 0<θk2k+2n/ττn10<\theta_{k}\leq\frac{2}{k+2n/\tau}\leq\tfrac{\tau}{n}\leq 1 and

We now give an explicit characterization of xkx_{k} as a convex combination of the vectors z0,,zkz_{0},\dots,z_{k}.

Let {xk,zk}k0\{x_{k},z_{k}\}_{k\geq 0} be the iterates of Algorithm 1. Then for all k0k\geq 0 we have

where the coefficients γk0,γk1,,γkk\gamma_{k}^{0},\gamma_{k}^{1},\dots,\gamma_{k}^{k} are non-negative and sum to 1. That is, xkx_{k} is a convex combination of the vectors z0,z1,,zkz_{0},z_{1},\dots,z_{k}. In particular, the constants are defined recursively in kk by setting γ00=1\gamma_{0}^{0}=1, γ10=0\gamma_{1}^{0}=0, γ11=1\gamma_{1}^{1}=1 and for k1k\geq 1,

Moreover, for all k0k\geq 0, the following identity holds

We proceed by induction. First, notice that x0=z0=γ00z0x_{0}=z_{0}=\gamma_{0}^{0}z_{0}. This implies that y0=z0y_{0}=z_{0}, which in turn together with θ0=τn\theta_{0}=\tfrac{\tau}{n} gives x1=y0+nτθ0(z1z0)=z1=γ10z0+γ11z1x_{1}=y_{0}+\tfrac{n}{\tau}\theta_{0}(z_{1}-z_{0})=z_{1}=\gamma_{1}^{0}z_{0}+\gamma_{1}^{1}z_{1}. Assuming now that (35) holds for some k1k\geq 1, we obtain

By applying Lemma 1, together with the inductive assumption that γkl0\gamma_{k}^{l}\geq 0 for all ll, we observe that γk+1l0\gamma_{k+1}^{l}\geq 0 for all ll. It remains to show that the constants sum to 1. This is true since xkx_{k} is a convex combination of z1,,zkz_{1},\dots,z_{k}, and by (4.1), xk+1x_{k+1} is an affine combination of xkx_{k}, zkz_{k} and zk+1z_{k+1}. ∎

From this and the definition of zk+1z_{k+1} we see that

The next lemma is an application to a specific function of a well-known result that can be found, for instance, in . The result was used by Tseng to construct a simplified complexity proof for a proximal gradient descent method. This lemma requires the norms (i)\|\cdot\|_{(i)} to be Euclidean – and this is the only place in our analysis where this is required.

Let ξ(u)=deff(yk)+f(yk),uyk+nθk2τuzkv2\xi(u)\stackrel{{\scriptstyle\text{def}}}{{=}}f(y_{k})+\langle\nabla f(y_{k}),u-y_{k}\rangle+\frac{n\theta_{k}}{2\tau}\|u-z_{k}\|_{v}^{2}. Then

For any xRNx\in\mathbf{R}^{N} and k0k\geq 0,

Let S^\hat{S} be any uniform sampling and a,hRNa,h\in\mathbf{R}^{N}. Theorem 4 in implies that

The remaining statement follows from the last identity in (43) used with a=zka=z_{k}. ∎

2 Proof of Theorem 3

Using Lemma 2 and convexity of ψ\psi, for all k0k\geq 0 we have

Note that from the definition of yky_{k} in the algorithm, we have

For all k0k\geq 0 we define an upper bound on F(xk)F(x_{k}),

and bound the expectation of F^k+1\hat{F}_{k+1} in SkS_{k} as follows:

After dividing both sides of (49) by θk2\theta_{k}^{2}, using (34), and rearranging the terms, we obtain

We now apply total expectation to the above inequality and unroll the recurrence for ll between 0 and kk, obtaining

where in the last step we have used the facts that F^0=F(x0)\hat{F}_{0}=F(x_{0}), x0=z0x_{0}=z_{0}, θ0=τn\theta_{0}=\frac{\tau}{n} and the estimate θk12k1+2n/τ\theta_{k-1}\leq\tfrac{2}{k-1+2n/\tau} from Lemma 1.

Implementation without full-dimensional vector operations

Algorithm 1, as presented, performs full-dimensional vector operations. Indeed, yky_{k} is defined as a convex combination of xkx_{k} and zkz_{k}. Also, xk+1x_{k+1} is obtained from yky_{k} by changing Sk|S_{k}| coordinates; however, if Sk|S_{k}| is small, the latter operation is not costly. In any case, vectors xkx_{k} and zkz_{k} will in general be dense, and hence computation of yky_{k} may cost O(N)O(N) arithmetic operations. However, simple (i.e., non-accelerated) coordinate descent methods are successful and popular precisely because they can avoid such operations.

Note that if instead of updating the constants θk\theta_{k} as in line 10 we keep them constant throughout, θk=τn\theta_{k}=\tfrac{\tau}{n}, then uk=0u_{k}=0 for all kk. The resulting method is precisely the PCDM algorithm (non-accelerated parallel block-coordinate descent method) proposed and analyzed in .

As it is not immediately obvious that the two methods(Algorithms 1 and 2) are equivalent, we include the following result. Its proof can be found in the appendix.

We now show that this can be done for functions ff of the form (2), where fjf_{j} is as in Theorem 2:

This will be small if Di|D_{i}| are small and τ\tau is small. For simplicity, assume all blocks are of equal size, Ni=b=N/nN_{i}=b=N/n. Then

The favorable situation described above is the consequence of the block sparsity of the data matrix AA and does not depend on ϕj\phi_{j} insofar as the evaluation of its derivative takes O(1){\cal O}(1) work. Hence, it applies to convex quadratics (ϕj(s)=s2\phi_{j}(s)=s^{2}), logistic regression (ϕj(r)=log(1+exp(s))\phi_{j}(r)=\log(1+\exp(s))) and also to the smooth approximation fμ(x)f_{\mu}(x) of f(x)=Axb1f(x)=\|Ax-b\|_{1}, defined by

with smoothing parameter μ>0\mu>0, as considered in . Vector ww^{*} is as defined in ; v\|\cdot\|_{v} is a weighted norm in Rm\mathbf{R}^{m}.

Numerical experiments

In all tests we used a shared-memory workstation with 32 Intel Xeon processors at 2.6 GHz and 128 GB RAM. In the experiments, we have departed from the theory in two ways: i) our implementation of APPROX is asynchronous in order to limit communication costs, and ii) we approximated the τ\tau-nice sampling by a τ\tau-independent sampling as in (the latter is very easy to generate in parallel; please note that our analysis can be very easily extended to cover the τ\tau-independent sampling). For simplicity, in all tests we assume all blocks are of size 1 (Ni=1N_{i}=1 for all ii). However, further speedups can be obtained by working with larger block sizes as then each processor is better utilized.

In this experiment, we compare the performance of the new stepsizes ( introduced in Section 2.2) with those proposed in (see Table 3). We generated random instances of the L1L_{1}-regularized least squares problem (LASSO),

with various distributions of the separability degrees ωj\omega_{j} (= number of nonzero elements on the jjth row of AA) and studied the weighted distance to the optimum xx0v\lVert x_{*}-x_{0}\rVert_{v} for the initial point x0=0x_{0}=0. This quantity appears in the complexity estimate (28) and depends on τ\tau (the number of processors). We chose a random matrix of small size: N=m=1000N=m=1000 as this is sufficient to make our point, and consider τ{10,100,1000}\tau\in\{10,100,1000\}.

In particular, we consider three different distributions of {ωj}\{\omega_{j}\}: uniform, intermediate and extreme. The results are summarized in Table 4. First, we generated a uniformly sparse matrix with ωj=30\omega_{j}=30 for all jj. In this case, vfr=vrtv^{\text{fr}}=v^{\text{rt}}, and hence the results are the same. We then generated an intermediate instance, with ωj=1+30j2/m2\omega_{j}=1+\lfloor 30j^{2}/m^{2}\rfloor. The matrix has many rows with a few nonzero elements and some rows with up to 30 nonzero elements. Looking at the table, clearly, the new stepsizes are better. The improvement is moderate when there are a few processors, but for τ=1000\tau=1000, the complexity is 25% better. Finally, we generated a rather extreme matrix with ω1=500\omega_{1}=500 and ωj=3\omega_{j}=3 for j>1j>1. We can see that the new stepsizes are much better, even with few processors, and can lead to 5×5\times speedup.

In the experiments above, we have first fixed a sparsity pattern and then generated a random matrix AA based on it. However, much larger differences can be seen for special matrices AA. We shall now comment on this.

Consider the case τ=n\tau=n. In view of (29), the complexity of APPROX is proportional to v1\|v\|_{1}. Fix ω\omega and ω1,,ωj\omega_{1},\dots,\omega_{j} and let us ask the question: for what data matrix AA will the ratio θ=vrt1/vfr1\theta=\|v^{\text{rt}}\|_{1}/\|v^{\text{fr}}\|_{1} be maximized? Since vrt1=ωjAj:2\|v^{\text{rt}}\|_{1}=\omega\sum_{j}\|A_{j:}\|^{2} and vfr1=jωjAj:2\|v^{\text{fr}}\|_{1}=\sum_{j}\omega_{j}\|A_{j:}\|^{2}, we the maximal ratio is given by

The extreme case is attained for some matrix with at least one dense row (ωj\omega_{j}) and one maximally sparse row (ωj=1\omega_{j}=1), leading to θ=n\theta=n. So, there are instances for which the new stesizes can lead to an up to n×n\times speedup for APPROX when compared to the stepsizes vrtv^{\text{rt}}. Needless to say, these extreme instances are artificially constructed.

2 L1-regularized L1 regression

with λ=1\lambda=1. Because the objective is nonsmooth and non-separable, we apply the smoothing technique presented in for the first part of the objective and use the smoothed parallel coordinate descent method proposed in (this methods needs special stepsizes which are studied in that paper). The level of smoothing depends on the expected accuracy: we chose ϵ=0.1\epsilon=0.1, which corresponds to 0.0125% of the initial value.

We compared 4 algorithms (see Figure 1), all run with 4 processors. As one can see, the coordinate descent method is very efficient on this problem. However, the accelerated coordinate descent is still able to outperform it. As the problem is of small size (which is sufficient for the sake of comparison), we could compute the optimal solution using an interior point method for linear programming and compare the value at each iteration to the optimal value (Table 5). Each line of the table gives the time needed by APPROX and PCDM to reach a given accuracy target. In the beginning (until F(xk)F(x)<6.4F(x_{k})-F(x^{*})<6.4), the algorithms are in a transitional phase. Then, when one runs the algorithm twice as long, F(xk)F(x)F(x_{k})-F(x^{*}) is divided by 2 for SPCDM and by 4 for APPROX. This highlights the difference in the convergence speeds: O(1/k)O(1/k) compared to O(1/k2)O(1/k^{2}). As a result, APPROX gives an ϵ\epsilon-solution in 156.5 seconds while SPCDM has not finished yet after 2000 seconds.

3 Lasso

We now consider L1L_{1} regularized least squares regression on the KDDB dataset . It consists of a medium size sparse feature matrix AA with m=29,890,095m=29,890,095, N=19,264,097N=19,264,097 and ω=75\omega=75, and a vector bRmb\in\mathbf{R}^{m}. We wish to find xRNx\in\mathbf{R}^{N} that minimizes

We compare APPROX (Algorithm 2) with the (non-accelerated) parallel coordinate descent method (PCDM ) in Figure 2, both run with τ=16\tau=16 processors.

Both algorithms converge quickly. PCDM is faster in the beginning because each iteration is half as expensive. However, APPROX is faster afterwards. For this problem, the optimal value is not known so it is difficult to compare the actual accuracy.

Let us remark that an important feature of the L1L_{1}-regularization is that it promotes sparsity in the optimization variable xx. As APPROX only involves proximal steps on the zz variable, only zkz_{k} is encouraged to be sparse but not xkx_{k}, yky_{k} or uku_{k}. A possible way to obtain a sparse solution with APPROX is to first compute xkx_{k} and then post-process with a few iterations of a sparsity-oriented method (such as iterative hard thresholding, full proximal gradient descent or cyclic/randomized coordinate descent).

4 Training linear support vector machines

Our last experiment is the dual of Support Vector Machine problem . For the dual SVM, the coordinates correspond to examples.

We wish to find xNx\in^{N} that minimizes

with λ=1/N\lambda=1/N. We compare APPROX (Algorithm 2) with Stochastic Dual Coordinate Ascent (SDCA ); the results are in Figure 3. We have used a single processor only (τ=1\tau=1).

For this problem, one can recover a primal solution and thus we can compare the decrease in the duality gap; summarized in Table 6. One can see that APPROX is about twice as fast as SDCA on this instance.

Conclusion

In summary, we proposed APPROX: a stochastic coordinate descent method combining the following four acceleration strategies:

Our method is accelerated, i.e., it achieves a O(1/k2)O(1/k^{2}) convergence rate. Hence, the method is better able to obtain a high-accuracy solution on non-strongly convex problem instances.

Our method is parallel. Hence, it is able to better utilize modern parallel computing architectures and effectively taming the problem dimension nn.

We have proposed new longer stepsizes for faster convergence on functions whose degree of separability ω\omega is larger than their degree of separability ωˉ\bar{\omega}.

We have shown that our method can be implemented without the need to perform full-dimensional vector operations.

References

Appendix A Proof of Proposition 1 (equivalence)

Now looking at the steps of Algorithm 2, we see that