An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization

Qihang Lin, Zhaosong Lu, Lin Xiao

Introduction

Coordinate descent methods have received extensive attention in recent years due to its potential for solving large-scale optimization problems arising from machine learning and other applications (e.g., ). In this paper, we develop an accelerated proximal coordinate gradient (APCG) method for solving problems of the following form:

where each xix_{i} denotes a sub-vector of xx with cardinality NiN_{i}, and the collection {xi:i=1,,n}\{x_{i}:i=1,\ldots,n\} form a partition of the components of xx. In addition to the capability of modeling nonsmooth terms such as Ψ(x)=λx1\Psi(x)=\lambda\|x\|_{1}, this model also includes optimization problems with block separable constraints. More specifically, each block constraints xiCix_{i}\in C_{i}, where CiC_{i} is a closed convex set, can be modeled by an indicator function defined as Ψi(xi)=0\Psi_{i}(x_{i})=0 if xiCix_{i}\in C_{i} and \infty otherwise.

At each iteration, coordinate descent methods choose one block of coordinates xix_{i} to sufficiently reduce the objective value while keeping other blocks fixed. In order to exploit the known structure of each Ψi\Psi_{i}, a proximal coordinate gradient step can be taken . To be more specific, given the current iterate x(k)x^{(k)}, we pick a block ik{1,,n}i_{k}\in\{1,\ldots,n\} and solve a block-wise proximal subproblem in the form of

Here if(x)\nabla_{i}f(x) denotes the partial gradient of ff with respect to xix_{i}, and LiL_{i} is the Lipschitz constant of the partial gradient (which will be defined precisely later).

One common approach for choosing such a block is the cyclic scheme. The global and local convergence properties of the cyclic coordinate descent method have been studied in, e.g., . Recently, randomized strategies for choosing the block to update became more popular . In addition to its theoretical benefits (randomized schemes are in general easier to analyze than the cyclic scheme), numerous experiments have demonstrated that randomized coordinate descent methods are very powerful for solving large-scale machine learning problems . Their efficiency can be further improved with parallel and distributed implementations . Randomized block coordinate descent methods have also been proposed and analyzed for solving problems with coupled linear constraints and a class of structured nonconvex optimization problems (e.g., ). Coordinate descent methods with more general schemes of choosing the block to update have also been studied; see, e.g., .

Inspired by the success of accelerated full gradient methods , several recent work extended Nesterov’s acceleration technique to speed up randomized coordinate descent methods. In particular, Nesterov developed an accelerated randomized coordinate gradient method for minimizing unconstrained smooth functions, which corresponds to the case of Ψ(x)0\Psi(x)\equiv 0 in (1). Lu and Xiao gave a sharper convergence analysis of Nesterov’s method using a randomized estimate sequence framework, and Lee and Sidford developed extensions using weighted random sampling schemes. Accelerated coordinate gradient methods have also been used to speed up the solution of linear systems . More recently, Fercoq and Richtárik proposed an APPROX (Accelerated, Parallel and PROXimal) coordinate descent method for solving the more general problem (1) and obtained accelerated sublinear convergence rate, but their method cannot exploit the strong convexity of the objective function to obtain accelerated linear rates.

In this paper, we propose a general APCG method that achieves accelerated linear convergence rates when the objective function is strongly convex. Without the strong convexity assumption, our method recovers a special case of the APPROX method . Moreover, we show how to apply the APCG method to solve the regularized empirical risk minimization (ERM) problem, and devise efficient implementations that avoid full-dimensional vector operations. For ill-conditioned ERM problems, our method obtains improved convergence rates than the state-of-the-art stochastic dual coordinate ascent (SDCA) method .

This paper is organized as follows. The rest of this section introduces some notations and state our main assumptions. In Section 2, we present the general APCG method and our main theorem on its convergence rate. We also give two simplified versions of APCG depending on whether or not the function ff is strongly convex, and explain how to exploit strong convexity in Ψ\Psi. Section 3 is devoted to the convergence analysis that proves our main theorem. In Section 4, we derive equivalent implementations of the APCG method that can avoid full-dimensional vector operations.

In Section 5, we apply the APCG method to solve the dual of the regularized ERM problem and give the corresponding complexity results. We also explain how to recover primal solutions to guarantee the same rate of convergence for the primal-dual gap. In addition, we present numerical experiments to demonstrate the performance of the APCG method.

2 Notations and assumptions

The gradient of function ff is block-wise Lipschitz continuous with constants LiL_{i}, i.e.,

An immediate consequence of Assumption 1 is (see, e.g., [25, Lemma 1.2.3])

The convexity parameter of ff with respect to the norm L\|\cdot\|_{L} is the largest μ\mu such that the above inequality holds. Every convex function satisfies Assumption 2 with μ=0\mu=0. If μ>0\mu>0, then the function ff is called strongly convex.

Together with (5) and the definition of L\|\cdot\|_{L} in (6), Assumption 2 implies μ1\mu\leq 1.

The APCG method

In this section we describe the general APCG method, and its two simplified versions under different assumptions (whether or not the objective function is strong convex). We also present our main theorem on the convergence rates of the APCG method.

The general APCG method is given as Algorithm 1. At each iteration kk, the APCG method picks a random coordinate ik{1,,n}i_{k}\in\{1,\ldots,n\} and generates y(k)y^{(k)}, x(k+1)x^{(k+1)} and z(k+1)z^{(k+1)}. One can observe that x(k+1)x^{(k+1)} and z(k+1)z^{(k+1)} depend on the realization of the random variable

while y(k)y^{(k)} is independent of iki_{k} and only depends on ξk1\xi_{k-1}.

To better understand this method, we make the following observations. For convenience, we define

which is a full-dimensional update version of Step 3. One can observe that z(k+1)z^{(k+1)} is updated as

Notice that from (7), (8), (9) and (10) we have

That is, in Step 4, we only need to update the block coordinates xik(k+1)x^{(k+1)}_{i_{k}} as in (13) and set the rest to be yi(k)y^{(k)}_{i}.

We now state an expected-value type of convergence rate for the APCG method.

Suppose Assumptions 1 and 2 hold. Let FF^{\star} be the optimal value of problem (1), and {x(k)}\{x^{(k)}\} be the sequence generated by the APCG method. Then, for any k0k\geq 0, there holds:

and XX^{\star} is the set of optimal solutions of problem (1).

For n=1n=1, our results in Theorem 1 match exactly the convergence rates of the accelerated full gradient method in [25, Section 2.2]. For n>1n>1, our results improve upon the convergence rates of the randomized proximal coordinate gradient method described in (3) and (4). More specifically, if the block index ik{1,,n}i_{k}\in\{1,\ldots,n\} is chosen uniformly at random, then the analysis in states that the convergence rate of (3) and (4) is on the order of

Thus we obtain both accelerated linear rate for strongly convex functions (μ>0\mu>0) and accelerated sublinear rate for non-strongly convex functions (μ=0\mu=0). To the best of our knowledge, this is the first time that such an accelerated linear convergence rate is obtained for solving the general class of problems (1) using coordinate descent type of methods.

The proof of Theorem 1 is given in Section 3. Next we give two simplified versions of the APCG method, for the special cases of μ>0\mu>0 and μ=0\mu=0, respectively.

For the strongly convex case with μ>0\mu>0, we can initialize Algorithm 1 with the parameter γ0=μ\gamma_{0}=\mu, which implies γk=μ\gamma_{k}=\mu and αk=βk=μ/n\alpha_{k}=\beta_{k}=\sqrt{\mu}/n for all k0k\geq 0. This results in Algorithm 2. As a direct corollary of Theorem 1, Algorithm 2 enjoys an accelerated linear convergence rate:

where xx^{\star} is the unique solution of (1) under the strong convexity assumption.

Algorithm 3 shows the simplified version for μ=0\mu=0, which can be applied to problems without strong convexity, or if the convexity parameter μ\mu is unknown. According to Theorem 1, Algorithm 3 has an accelerated sublinear convergence rate, that is

With the choice of α1=1/n21\alpha_{-1}=1/\sqrt{n^{2}-1}, which implies α0=1/n\alpha_{0}=1/n, Algorithm 3 reduces to the APPROX method with single block update at each iteration (i.e., τ=1\tau=1 in their Algorithm 1).

2 Exploiting strong convexity in ΨΨ\Psi

In this section we consider problem (1) with strongly convex Ψ\Psi. We assume that ff and Ψ\Psi have convexity parameters μf0\mu_{f}\geq 0 and μΨ>0\mu_{\Psi}>0, both with respect to the standard Euclidean norm, denoted 2\|\cdot\|_{2}.

As a result of the above definitions, we see that problem (1) is equivalent to

Convergence analysis

In this section, we prove Theorem 1. First we establish some useful properties of the sequences {αk}k=0\{\alpha_{k}\}^{\infty}_{k=0} and {γk}k=0\{\gamma_{k}\}^{\infty}_{k=0} generated in Algorithm 1. Then in Section 3.1, we construct a sequence {Ψ^k}k=1\{\hat{\Psi}_{k}\}_{k=1}^{\infty} to bound the values of Ψ(x(k))\Psi(x^{(k)}) and prove a useful property of the sequence. Finally we finish the proof of Theorem 1 in Section 3.2.

Suppose γ0>0\gamma_{0}>0 and γ0[μ,1]\gamma_{0}\in[\mu,1] and {αk}k=0\{\alpha_{k}\}^{\infty}_{k=0} and {γk}k=0\{\gamma_{k}\}^{\infty}_{k=0} are generated in Algorithm 1. Then there hold:

{αk}k=0\{\alpha_{k}\}^{\infty}_{k=0} and {γk}k=0\{\gamma_{k}\}^{\infty}_{k=0} are well-defined positive sequences.

μ/nαk1/n\sqrt{\mu}/n\leq\alpha_{k}\leq 1/n and μγk1\mu\leq\gamma_{k}\leq 1 for all k0k\geq 0.

{αk}k=0\{\alpha_{k}\}^{\infty}_{k=0} and {γk}k=0\{\gamma_{k}\}^{\infty}_{k=0} are non-increasing.

γk=n2αk12\gamma_{k}=n^{2}\alpha^{2}_{k-1} for all k1k\geq 1.

Due to (7) and (8), statement (iv) always holds provided that {αk}k=0\{\alpha_{k}\}^{\infty}_{k=0} and {γk}k=0\{\gamma_{k}\}^{\infty}_{k=0} are well-defined. We now prove statements (i) and (ii) by induction. For convenience, Let

Since μ1\mu\leq 1 and γ0(0,1]\gamma_{0}\in(0,1], one can observe that gγ0(0)=γ0<0g_{\gamma_{0}}(0)=-\gamma_{0}<0 and

These together with continuity of gγ0g_{\gamma_{0}} imply that there exists α0(0,1/n]\alpha_{0}\in(0,1/n] such that gγ0(α0)=0g_{\gamma_{0}}(\alpha_{0})=0, that is, α0\alpha_{0} satisfies (7) and is thus well-defined. In addition, by statement (iv) and γ0μ\gamma_{0}\geq\mu, one can see α0μ/n\alpha_{0}\geq\sqrt{\mu}/n. Therefore, statements (i) and (ii) hold for k=0k=0.

Suppose that the statements (i) and (ii) hold for some k0k\geq 0, that is, αk,γk>0\alpha_{k},\gamma_{k}>0, μ/nαk1/n\sqrt{\mu}/n\leq\alpha_{k}\leq 1/n and μγk1\mu\leq\gamma_{k}\leq 1. Using these relations and (8), one can see that γk+1\gamma_{k+1} is well-defined and moreover μγk+11\mu\leq\gamma_{k+1}\leq 1. In addition, we have γk+1>0\gamma_{k+1}>0 due to statement (iv) and αk>0\alpha_{k}>0. Using the fact μ1\mu\leq 1 (see the remark after Assumption 2), γ0(0,1]\gamma_{0}\in(0,1] and a similar argument as above, we obtain gγk(0)<0g_{\gamma_{k}}(0)<0 and gγk(1/n)0g_{\gamma_{k}}(1/n)\geq 0, which along with continuity of gγkg_{\gamma_{k}} imply that there exists αk+1(0,1/n]\alpha_{k+1}\in(0,1/n] such that gγk(αk+1)=0g_{\gamma_{k}}(\alpha_{k+1})=0, that is, αk+1\alpha_{k+1} satisfies (7) and is thus well-defined. By statement (iv) and γk+1μ\gamma_{k+1}\geq\mu, one can see that αk+1μ/n\alpha_{k+1}\geq\sqrt{\mu}/n. This completes the induction and hence statements (i) and (ii) hold.

Next, we show statement (iii) holds. Indeed, it follows from (8) that

which together with γkμ\gamma_{k}\geq\mu and αk>0\alpha_{k}>0 implies that γk+1γk\gamma_{k+1}\leq\gamma_{k} and hence {γk}k=0\{\gamma_{k}\}^{\infty}_{k=0} is non-increasing. Notice from statement (iv) and αk>0\alpha_{k}>0 that αk=γk+1/n\alpha_{k}=\sqrt{\gamma_{k+1}}/n. It follows that {αk}k=0\{\alpha_{k}\}^{\infty}_{k=0} is also non-increasing.

Statement (v) can be proved by using the same arguments in the proof of [25, Lemma 2.2.4], and the details can be found in [20, Section 4.2]. ∎

Motivated by , we give an explicit expression of x(k)x^{(k)} as a convex combination of the vectors z(0),,z(k)z^{(0)},\ldots,z^{(k)}, and use the coefficients to construct a sequence {Ψ^k}k=1\{\hat{\Psi}_{k}\}_{k=1}^{\infty} to bound Ψ(x(k))\Psi(x^{(k)}).

Let the sequences {αk}k=0\{\alpha_{k}\}^{\infty}_{k=0}, {γk}k=0\{\gamma_{k}\}^{\infty}_{k=0}, {x(k)}k=0\{x^{(k)}\}^{\infty}_{k=0} and {z(k)}k=0\{z^{(k)}\}^{\infty}_{k=0} be generated by Algorithm 1. Then each x(k)x^{(k)} is a convex combination of z(0),,z(k)z^{(0)},\ldots,z^{(k)}. More specifically, for all k0k\geq 0,

where the constants θ0(k),,θk(k)\theta^{(k)}_{0},\ldots,\theta^{(k)}_{k} are nonnegative and sum to 11. Moreover, these constants can be obtained recursively by setting θ0(0)=1\theta_{0}^{(0)}=1, θ0(1)=1nα0\theta^{(1)}_{0}=1-n\alpha_{0}, θ1(1)=nα0\theta^{(1)}_{1}=n\alpha_{0} and for k1k\geq 1,

We prove the statements by induction. First, notice that x(0)=z(0)=θ0(0)z(0)x^{(0)}=z^{(0)}=\theta^{(0)}_{0}z^{(0)}. Using this relation and (9), we see that y(0)=z(0)y^{(0)}=z^{(0)}. From (10) and y(0)=z(0)y^{(0)}=z^{(0)}, we obtain

Since α0(0,1/n]\alpha_{0}\in(0,1/n] (Lemma 1 (ii)), the vector x(1)x^{(1)} is a convex combination of z(0)z^{(0)} and z(1)z^{(1)} with the coefficients θ0(1)=1nα0\theta^{(1)}_{0}=1-n\alpha_{0}, θ1(1)=nα0\theta^{(1)}_{1}=n\alpha_{0}. For k=1k=1, substituting (9) into (10) yields

Substituting (20) into the above equality, and using (1α1)γ1=n2α12α1μ(1-\alpha_{1})\gamma_{1}=n^{2}\alpha^{2}_{1}-\alpha_{1}\mu from (7), we get

From the definition of θ1(2)\theta^{(2)}_{1} in the above equation, we observe that

From the above expression, and using the facts μ1\mu\leq 1, α0α1\alpha_{0}\geq\alpha_{1}, γk0\gamma_{k}\geq 0 and 0αk1/n0\leq\alpha_{k}\leq 1/n (Lemma 1), we conclude that θ1(2)0\theta^{(2)}_{1}\geq 0. Also considering the definitions of θ0(2)\theta^{(2)}_{0} and θ2(2)\theta^{(2)}_{2} in (21), we conclude that θl(2)0\theta^{(2)}_{l}\geq 0 for 0l20\leq l\leq 2. In addition, one can observe from (9), (10) and (20) that x(1)x^{(1)} is an affine combination of z(0)z^{(0)} and z(1)z^{(1)}, y(1)y^{(1)} is an affine combination of z(1)z^{(1)} and x(1)x^{(1)}, and x(2)x^{(2)} is an affine combination of y(1)y^{(1)}, z(1)z^{(1)} and z(2)z^{(2)}. It is known that substituting one affine combination into another yields a new affine combination. Hence, the combination given in (21) must be affine, which together with θl(2)0\theta^{(2)}_{l}\geq 0 for 0l20\leq l\leq 2 implies that it is also a convex combination.

Now suppose the recursion (19) holds for some k1k\geq 1. Substituting (9) into (10), we obtain that

Further, substituting x(k)=nαk1z(k)+l=0k1θl(k)z(l)x^{(k)}=n\alpha_{k-1}z^{(k)}+\sum_{l=0}^{k-1}\theta^{(k)}_{l}z^{(l)} (the induction hypothesis) into the above equation gives

This gives the form of (18) and (19). In addition, by the induction hypothesis, x(k)x^{(k)} is an affine combination of z(0),,z(k)z^{(0)},\ldots,z^{(k)}. Also, notice from (9) and (10) that y(k)y^{(k)} is an affine combination of z(k)z^{(k)} and x(k)x^{(k)}, and x(k+1)x^{(k+1)} is an affine combination of y(k)y^{(k)}, z(k)z^{(k)} and z(k+1)z^{(k+1)}. Using these facts and a similar argument as for x(2)x^{(2)}, it follows that the combination (22) must be affine.

Finally, we claim θl(k+1)0\theta^{(k+1)}_{l}\geq 0 for all ll. Indeed, we know from Lemma 1 that μ1\mu\leq 1, αk0\alpha_{k}\geq 0, γk0\gamma_{k}\geq 0. Also, θl(k)0\theta^{(k)}_{l}\geq 0 due to the induction hypothesis. It follows that θl(k+1)0\theta^{(k+1)}_{l}\geq 0 for all lkl\neq k. It remains to show that θk(k+1)0\theta^{(k+1)}_{k}\geq 0. To this end, we again use (7) to obtain (1αk)γk=n2αk2αkμ(1-\alpha_{k})\gamma_{k}=n^{2}\alpha^{2}_{k}-\alpha_{k}\mu, and use (19) and a similar argument as for θ1(2)\theta^{(2)}_{1} to rewrite θk(k+1)\theta^{(k+1)}_{k} as

Together with μ1\mu\leq 1, 0αk1/n0\leq\alpha_{k}\leq 1/n, γk0\gamma_{k}\geq 0 and αk1αk\alpha_{k-1}\geq\alpha_{k}, this implies that θk(k+1)0\theta^{(k+1)}_{k}\geq 0. Therefore, x(k+1)x^{(k+1)} is a convex combination of z(0),,z(k+1)z^{(0)},\ldots,z^{(k+1)} with the coefficients given in (19). ∎

In the following lemma, we construct the sequence {Ψ^k}k=0\{\hat{\Psi}_{k}\}_{k=0}^{\infty} and prove a recursive inequality.

Let Ψ^k\hat{\Psi}_{k} denotes the convex combination of Ψ(z(0)),,Ψ(z(k))\Psi(z^{(0)}),\ldots,\Psi(z^{(k)}) using the same coefficients given in Lemma 2, i.e.,

Then for all k0k\geq 0, we have Ψ(x(k))Ψ^k\Psi(x^{(k)})\leq\hat{\Psi}_{k} and

The first result Ψ(x(k))Ψ^k\Psi(x^{(k)})\leq\hat{\Psi}_{k} follows directly from convexity of Ψ\Psi. We now prove (23). First we deal with the case k=0k=0. Using (12), (19), and the facts y(0)=z(0)y^{(0)}=z^{(0)} and Ψ^0=Ψ(x(0))\hat{\Psi}_{0}=\Psi(x^{(0)}), we get

For k1k\geq 1, we use (12) and the definition of βk\beta_{k} in (8) to obtain that

It follows from the above equation and convexity of Ψ\Psi that

In addition, from the definition of Ψ^k\hat{\Psi}_{k} and θk(k)=nαk1\theta^{(k)}_{k}=n\alpha_{k-1}, we have

Next, using the definition of Ψ^k\hat{\Psi}_{k} and (19), we obtain

We next show that Γ=1αk\Gamma=1-\alpha_{k} and Δ=0\Delta=0. Indeed, notice that from (8) we have

Using this relation, γk+1=n2αk2\gamma_{k+1}=n^{2}\alpha^{2}_{k} (Lemma 1 (iv)), and the definition of Γ\Gamma in (3.1), we get

where the last equalities is due to (30). Finally, Δ=0\Delta=0 follows from Γ=1αk\Gamma=1-\alpha_{k} and αk+Δ+Γ=1\alpha_{k}+\Delta+\Gamma=1. These together with the inequality (3.1) yield the desired result. ∎

2 Proof of Theorem 1

We are now ready to present a proof for Theorem 1. We note that the proof in this subsection can also be recast into the framework of randomized estimate sequence developed in , but here we give a straightforward proof without using that machinery.

Dividing both sides of (7) by nαkn\alpha_{k} gives

which together with (31), (32) and γk+1=n2αk2\gamma_{k+1}=n^{2}\alpha^{2}_{k} (Lemma 1 (iv)) gives

Using this relation, (13) and Assumption 1, we have

where the second inequality follows from convexity of ff.

In addition, by (8), (32) and γk+1=n2αk2\gamma_{k+1}=n^{2}\alpha^{2}_{k} (Lemma 1 (iv)), we have

where the first equality used (32), the third one is due to (7) and (8), and γk+1=n2αk2\gamma_{k+1}=n^{2}\alpha^{2}_{k}. This equation together with (33) yields

Combining the above two inequalities, one can obtain that

Notice that VV has convexity parameter γk+1nαk=nαk\frac{\gamma_{k+1}}{n\alpha_{k}}=n\alpha_{k} with respect to L\|\cdot\|_{L}. By the optimality condition of (36), we have that for any xXx^{\star}\in X^{*},

Using the above inequality and the definition of VV, we obtain

Now using the assumption that ff has convexity parameter μ\mu with respect to L\|\cdot\|_{L}, we have

Combining this inequality with (35), one see that

In addition, it follows from (8) and convexity of L2\|\cdot\|^{2}_{L} that

Using this relation and (12), we observe that

where the inequality follows from (38). Summing up this inequality and (3.2) gives

Taking expectation on both sides with respect to ξk1\xi_{k-1} yields

which together with Ψ^0=Ψ(x(0))\hat{\Psi}_{0}=\Psi(x^{(0)}), z(0)=x(0)z^{(0)}=x^{(0)} and λk=Πi=0k1(1αi)\lambda_{k}=\Pi^{k-1}_{i=0}(1-\alpha_{i}) gives

The conclusion of Theorem 1 immediately follows from F(x(k))f(x(k))+Ψ^kF(x^{(k)})\leq f(x^{(k)})+\hat{\Psi}_{k}, Lemma 1 (v), the arbitrariness of xx^{\star} and the definition of R0R_{0}.

Efficient implementation

In order to avoid full-dimensional vector operations, Lee and Sidford proposed a change of variables scheme for accelerated coordinated gradient methods for unconstrained smooth minimization. Fercoq and Richtárik devised a similar scheme for efficient implementation in the non-strongly convex case (μ=0\mu=0) for composite minimization. Here we show that full vector operations can also be avoided in the strongly convex case for minimizing composite functions. For simplicity, we only present an efficient implementation of the simplified APCG method with μ>0\mu>0 (Algorithm 2), which is given as Algorithm 4.

The iterates of Algorithm 2 and Algorithm 4 satisfy the following relationships:

for all k0k\geq 0. That is, these two algorithms are equivalent.

We prove by induction. Notice that Algorithm 2 is initialized with z(0)=x(0)z^{(0)}=x^{(0)}, and its first step implies y(0)=x(0)+αz(0)1+α=x(0)y^{(0)}=\frac{x^{(0)}+\alpha z^{(0)}}{1+\alpha}=x^{(0)}; Algorithm 4 is initialized with u(0)=0u^{(0)}=0 and v(0)=x(0)v^{(0)}=x^{(0)}. Therefore we have

which means that (40) holds for k=0k=0. Now suppose that it holds for some k0k\geq 0, then

So hik(k)h^{(k)}_{i_{k}} in Algorithm 4 can be written as

Comparing with (11), and using βk=α\beta_{k}=\alpha, we obtain

In terms of the full dimensional vectors, using (12) and (41), we have

where the last step used (12). Now using the induction hypothesis y(k)=ρk+1u(k)+v(k)y^{(k)}=\rho^{k+1}u^{(k)}+v^{(k)}, we have

We just showed that (40) also holds for k+1k+1. This finishes the induction. ∎

We note that in Algorithm 4, only a single block coordinates of the vectors u(k)u^{(k)} and v(k)v^{(k)} are updated at each iteration, which cost O(Nik)O(N_{i_{k}}). However, computing the partial gradient ikf(ρk+1u(k)+v(k))\nabla_{i_{k}}f(\rho^{k+1}u^{(k)}+v^{(k)}) may still cost O(N)O(N) in general. In Section 5.2, we show how to further exploit problem structure in regularized empirical risk minimization to completely avoid full-dimensional vector operations.

Application to regularized empirical risk minimization (ERM)

In this section, we show how to apply the APCG method to solve the regularized ERM problems associated with linear predictors.

where λ>0\lambda>0 is a regularization parameter. For binary classification, given a label bi{±1}b_{i}\in\{\pm 1\} for each vector AiA_{i}, for i=1,,ni=1,\ldots,n, we obtain the linear SVM (support vector machine) problem by setting ϕi(z)=max{0,1biz}\phi_{i}(z)=\max\{0,1-b_{i}z\} and g(w)=(1/2)w22g(w)=(1/2)\|w\|_{2}^{2}. Regularized logistic regression is obtained by setting ϕi(z)=log(1+exp(biz))\phi_{i}(z)=\log(1+\exp(-b_{i}z)). This formulation also includes regression problems. For example, ridge regression is obtained by setting ϕi(z)=(1/2)(zbi)2\phi_{i}(z)=(1/2)(z-b_{i})^{2} and g(w)=(1/2)w22g(w)=(1/2)\|w\|_{2}^{2}, and we get the Lasso if g(w)=w1g(w)=\|w\|_{1}. Our method can also be extended to cases where each AiA_{i} is a matrix, thus covering multiclass classification problems as well (see, e.g., ).

For each i=1,,ni=1,\ldots,n, let ϕi\phi_{i}^{*} be the convex conjugate of ϕi\phi_{i}, that is,

The dual of the regularized ERM problem (42), which we call the primal, is to solve the problem (see, e.g., )

The structure of F(x)F(x) above matches our general formulation of minimizing composite convex functions in (1) and (2) with

Therefore, we can directly apply the APCG method to solve the problem (44), i.e., to solve the dual of the regularized ERM problem. Here we assume that the proximal mappings of the conjugate functions ϕi\phi_{i}^{*} can be computed efficiently, which is indeed the case for many regularized ERM problems (see, e.g., ).

In order to obtain accelerated linear convergence rates, we make the following assumption.

Each function ϕi\phi_{i} is 1/γ1/\gamma smooth, and the function gg has unit convexity parameter 1.

Here we slightly abuse the notation by overloading γ\gamma and λ\lambda, which appeared in Sections 2 and 3. In this section γ\gamma represents the (inverse) smoothness parameter of ϕi\phi_{i}, and λ\lambda denotes the regularization parameter on gg. Assumption 3 implies that each ϕi\phi_{i}^{*} has strong convexity parameter γ\gamma (with respect to the local Euclidean norm) and gg^{*} is differentiable and g\nabla g^{*} has Lipschitz constant 1.

In order to match the condition in Assumption 2, i.e., f(x)f(x) needs to be strongly convex, we can apply the technique in Section 2.2 to relocate the strong convexity from Ψ\Psi to ff. Without loss of generality, we can use the following splitting of the composite function F(x)=f(x)+Ψ(x)F(x)=f(x)+\Psi(x):

Under Assumption 3, the function ff is smooth and strongly convex and each Ψi\Psi_{i}, for i=1,,ni=1,\ldots,n, is still convex. As a result, we have the following complexity guarantee when applying the APCG method to minimize the function F(x)=D(x)F(x)=-D(x).

Suppose Assumption 3 holds and Ai2R\|A_{i}\|_{2}\leq R for all i=1,,ni=1,\ldots,n. In order to obtain an expected dual optimality gap E[DD(x(k))]ϵ\mathbf{E}[D^{\star}-D(x^{(k)})]\leq\epsilon using the APCG method, it suffices to have

where the second inequality used the assumption that gg has convexity parameter 11 and thus g\nabla g^{*} has Lipschitz constant 11. The coordinate-wise Lipschitz constants as defined in Assumption 1 are

The function ff has convexity parameter γn\frac{\gamma}{n} with respect to the Euclidean norm 2\|\cdot\|_{2}. Let μ\mu be its convexity parameter with respect to the norm L\|\cdot\|_{L} defined in (6). Then

According to Theorem 1, the APCG method converges geometrically:

where the constant CC is given in (48). Therefore, in order to obtain E[DD(x(k))]ϵ\mathbf{E}[D^{\star}-D(x^{(k)})]\leq\epsilon, it suffices to have the number of iterations kk to be larger than

Let us compare the result in Theorem 2 with the complexity of solving the dual problem (44) using the accelerated full gradient (AFG) method of Nesterov . Using the splitting in (45) and under Assumption 3, the gradient f(x)\nabla f(x) has Lipschitz constant A22λn2\frac{\|A\|_{2}^{2}}{\lambda n^{2}}, where A2\|A\|_{2} denotes the spectral norm of AA, and Ψ(x)\Psi(x) has convexity parameter γn\frac{\gamma}{n} with respect to 2\|\cdot\|_{2}. So the condition number of the problem is

Suppose each iteration of the AFG method costs as much as nn times of the APCG method (as we will see in Section 5.2), then the complexity of the AFG method [27, Theorem 6] measured in terms of number of coordinate gradient steps is

The inequality above is due to A22AF2nR2\|A\|_{2}^{2}\leq\|A\|_{F}^{2}\leq nR^{2}. Therefore in the ill-conditioned case (assuming nR2λγn\leq\frac{R^{2}}{\lambda\gamma}), the complexity of AFG can be a factor of n\sqrt{n} worse than that of APCG.

Several state-of-the-art algorithms for regularized ERM, including SDCA , SAG and SVRG , have the iteration complexity

Here the ratio R2λγ\frac{R^{2}}{\lambda\gamma} can be interpreted as the condition number of the regularized ERM problem (42) and its dual (43). We note that our result in (47) can be much better for ill-conditioned problems, i.e., when the condition number R2λγ\frac{R^{2}}{\lambda\gamma} is much larger than nn.

We note that the complexity bound for the aforementioned work are either for the primal optimality P(w(k))PP(w^{(k)})-P^{\star} (SAG and SVRG) or for the primal-dual gap P(w(k))D(x(k))P(w^{(k)})-D(x^{(k)}) (SDCA and accelerated SDCA). Our results in Theorem 2 are in terms of the dual optimality DD(x(k))D^{\star}-D(x^{(k)}). In Section 5.1, we show how to recover primal solutions with the same order of convergence rate. In Section 5.2, we show how to exploit problem structure of regularized ERM to compute the partial gradient if(x)\nabla_{i}f(x), which together with the efficient implementation proposed in Section 4, completely avoid full-dimensional vector operations. The experiments in Section 5.3 illustrate that our method has superior performance in reducing both the primal objective value and the primal-dual gap.

Under Assumption 3, the primal problem (42) and dual problem (43) each has a unique solution, say ww^{\star} and xx^{\star}, respectively. Moreover, we have P(w)=D(x)P(w^{\star})=D(x^{\star}). With the definition

we have w=ω(x)w^{\star}=\omega(x^{\star}). When applying the APCG method to solve the dual regularized ERM problem, which generate a dual sequence x(k)x^{(k)}, we can obtain a primal sequence w(k)=ω(x(k))w^{(k)}=\omega(x^{(k)}). Here we discuss the relationship between the primal-dual gap P(w(k))D(x(k))P(w^{(k)})-D(x^{(k)}) and the dual optimality DD(x(k))D^{\star}-D(x^{(k)}).

As a result, we obtain a subgradient of DD at x(k)x^{(k)}, denoted D(x(k))D^{\prime}(x^{(k)}), and

We note that D(x(k))22\|D^{\prime}(x^{(k)})\|_{2}^{2} is not only a measure of the dual optimality of x(k)x^{(k)}, but also a measure of the primal feasibility of (a(k),w(k))(a^{(k)},w^{(k)}). In fact, it can also bound the primal-dual gap, which is the result of the following lemma.

Given any dual solution x(k)x^{(k)}, let (a(k),w(k))(a^{(k)},w^{(k)}) be defined as in (51) and (52). Then

Because of (51), we have ϕi(ai(k))=xi(k)\nabla\phi_{i}(a^{(k)}_{i})=-x^{(k)}_{i}. The 1/γ1/\gamma-smoothness of ϕi(a)\phi_{i}(a) implies

which leads to the inequality in the conclusion. The equality in the conclusion is due to (53). ∎

The following theorem states that under a stronger assumption than Assumption 3, the primal-dual gap can be bounded directly by the dual optimality gap, hence they share the same order of convergence rate.

Suppose gg is 11-strongly convex and each ϕi\phi_{i} is 1/γ1/\gamma-smooth and also 1/η1/\eta-strongly convex (all with respect to the Euclidean norm 2\|\cdot\|_{2}). Given any dual point x(k)x^{(k)}, let the primal correspondence be w(k)=ω(x(k))w^{(k)}=\omega(x^{(k)}), i.e., generated from (52). Then we have

where A2\|A\|_{2} denotes the spectral norm of AA.

Since g(w)g(w) is 11-strongly convex, the function f(x)=λg(Axλn)f(x)=\lambda g^{*}\left(\frac{Ax}{\lambda n}\right) is differentiable and f(x)\nabla f(x) has Lipschitz constant A22λn2\frac{\|A\|_{2}^{2}}{\lambda n^{2}}. Similarly, since each ϕi\phi_{i} is 1/η1/\eta strongly convex, the function Ψ(x)=1ni=1nϕi(xi)\Psi(x)=\frac{1}{n}\sum_{i=1}^{n}\phi_{i}^{*}(-x_{i}) is differentiable and Ψ(x)\nabla\Psi(x) has Lipschitz constant ηn\frac{\eta}{n}. Therefore, the function D(x)=f(x)+Ψ(x)-D(x)=f(x)+\Psi(x) is smooth and its gradient has Lipschitz constant

It is known that (e.g., [25, Theorem 2.1.5]) if a function F(x)F(x) is convex and LL-smooth, then

Under our assumptions, the saddle-point problem (50) has a unique solution (x,a,w)(x^{\star},a^{\star},w^{\star}), where ww^{\star} and xx^{\star} are the solutions to the primal and dual problems (42) and (43), respectively. Moreover, they satisfy the optimality conditions

Since DD is differentiable in this case, we have D(x)=D(x)D^{\prime}(x)=\nabla D(x) and D(x)=0\nabla D(x^{\star})=0. Now we choose xx and yy in (55) to be xx^{\star} and x(k)x^{(k)} respectively. This leads to

Then the conclusion can be derived from Lemma 4. ∎

The assumption that each ϕi\phi_{i} is 1/γ1/\gamma-smooth and 1/η1/\eta-strongly convex implies that γη\gamma\leq\eta. Therefore the coefficient on the right-hand side of (54) satisfies ληn+A22λγn>1.\frac{\lambda\eta n+\|A\|_{2}^{2}}{\lambda\gamma n}>1. This is consistent with the fact that for any pair of primal and dual points w(k)w^{(k)} and x(k)x^{(k)}, we always have P(w(k))D(x(k))DD(x(k))P(w^{(k)})-D(x^{(k)})\geq D^{\star}-D(x^{(k)}).

Under the assumptions of Theorem 3, in order to obtain an expected primal-dual gap E[P(w(k))D(x(k))]ϵ\mathbf{E}\left[P(w^{(k)})-D(x^{(k)})\right]\leq\epsilon using the APCG method, it suffices to have

where the constant CC is defined in (48).

The above results require that each ϕi\phi_{i} be both smooth and strongly convex. One example that satisfies such assumptions is ridge regression, where ϕi(ai)=12(aibi)2\phi_{i}(a_{i})=\frac{1}{2}(a_{i}-b_{i})^{2} and g(w)=12w22g(w)=\frac{1}{2}\|w\|_{2}^{2}. For problems that only satisfy Assumption 3, we may add a small strongly convex term 12η(AiTw)2\frac{1}{2\eta}(A_{i}^{T}w)^{2} to each loss ϕi(AiTw)\phi_{i}(A_{i}^{T}w), and obtain that the primal-dual gap (of a slightly perturbed problem) share the same accelerated linear convergence rate as the dual optimality gap. Alternatively, we can obtain the same guarantee with the extra cost of a proximal full gradient step. This is summarized in the following theorem.

Suppose Assumption 3 holds. Given any dual point x(k)x^{(k)}, define

where ff and Ψ\Psi are defined in the simple splitting (45). Let

Notice that the Lipschitz constant of f(x)\nabla f(x) is Lf=A22λn2L_{f}=\frac{\|A\|_{2}^{2}}{\lambda n^{2}}, which is used in calculating T(x(k))T(x^{(k)}). The corresponding gradient mapping at x(k)x^{(k)} is

The conclusion can then be derived from Lemma 4. ∎

Here the coefficient in the right-hand side of (58), 4A22λγn\frac{4\|A\|_{2}^{2}}{\lambda\gamma n}, can be less than 11. This does not contradict with the fact that the primal-dual gap should be no less than the dual optimality gap, because the primal-dual gap on the left-hand side of (58) is measured at T(x(k))T(x^{(k)}) rather than x(k)x^{(k)}.

Suppose Assumption 3 holds. In order to obtain a primal-dual pair w(k)w^{(k)} and x(k)x^{(k)} such that E[P(w(k))D(T(x(k)))]ϵ\mathbf{E}\left[P(w^{(k)})-D(T(x^{(k)}))\right]\leq\epsilon, it suffices to run the APCG method for

steps and follow with a proximal full gradient step (56) and (57), where CC is defined in (48).

We note that the computational cost of the proximal full gradient step (56) is comparable with nn proximal coordinate gradient steps. Therefore the overall complexity of of this scheme is on the same order as necessary for the expected dual optimality gap to reach ϵ\epsilon. Actually the numerical experiments in Section 5.3 show that running the APCG method alone without the final full gradient step is sufficient to reduce the primal-dual gap at a very fast rate.

2 Implementation details

Here we show how to exploit the structure of the regularized ERM problem to efficiently compute the coordinate gradient ikf(y(k))\nabla_{i_{k}}f(y^{(k)}), and totally avoid full-dimensional updates in Algorithm 4.

We focus on the special case g(w)=12w22g(w)=\frac{1}{2}\|w\|_{2}^{2} and show how to compute ikf(y(k))\nabla_{i_{k}}f(y^{(k)}). In this case, g(v)=12v22g^{*}(v)=\frac{1}{2}\|v\|_{2}^{2} and g()\nabla g^{*}(\cdot) is the identity map. According to (46),

Notice that we do not form y(k)y^{(k)} in Algorithm 4. By Proposition 1, we have

So we can store and update the two vectors

Since the update of both u(k)u^{(k)} and v(k)v^{(k)} at each iteration only involves the single coordinate iki_{k}, we can update p(k)p^{(k)} and q(k)q^{(k)} by adding or subtracting a scaled column AikA_{i_{k}}, as given in (60). The resulting method is detailed in Algorithm 5.

In Algorithm 5, we use ik(k)\nabla^{(k)}_{i_{k}} to represent ikf(y(k))\nabla_{i_{k}}f(y^{(k)}) to reflect the fact that we never form y(k)y^{(k)} explicitly. The function Ψi\Psi_{i} in (59) is the one given in (46), i.e.,

Each iteration of Algorithm 5 only involves the two inner products AikTp(k)A_{i_{k}}^{T}p^{(k)} and AikTq(k)A_{i_{k}}^{T}q^{(k)} in computing ik(k)\nabla^{(k)}_{i_{k}}, and the two vector additions in (60). They all cost O(d)O(d) rather than O(n)O(n). When the AiA_{i}’s are sparse (the case of most large-scale problems), these operations can be carried out very efficiently. Basically, each iteration of Algorithm 5 only cost twice as much as that of SDCA .

In Step 3 of Algorithm 5, the division by ρk+1\rho^{k+1} in updating u(k)u^{(k)} and p(k)p^{(k)} may cause numerical problems because ρk+10\rho^{k+1}\to 0 as the number of iterations kk getting large. To fix this issue, we notice that u(k)u^{(k)} and p(k)p^{(k)} are always accessed in Algorithm 5 in the forms of ρk+1u(k)\rho^{k+1}u^{(k)} and ρk+1p(k)\rho^{k+1}p^{(k)}. So we can replace u(k)u^{(k)} and p(k)p^{(k)} by

which can be updated without numerical problem. To see this, we have

3 Numerical experiments

In our experiments, we solve the regularized ERM problem (42) with smoothed hinge loss for binary classification. That is, we pre-multiply each feature vector AiA_{i} by its label bi{±1}b_{i}\in\{\pm 1\} and let

The conjugate function of ϕi\phi_{i} is ϕi(b)=b+γ2b2\phi_{i}^{*}(b)=b+\frac{\gamma}{2}b^{2} if bb\in and \infty otherwise. Therefore we have

For the regularization term, we use g(w)=12w22g(w)=\frac{1}{2}\|w\|_{2}^{2}. We used three publicly available datasets obtained from . The characteristics of these datasets are summarized in Table 1.

In our experiments, we comparing the APCG method (Algorithm 5) with SDCA and the accelerated full gradient method (AFG) with and additional line search procedure to improve efficiency. When the regularization parameter λ\lambda is not too small (around 10410^{-4}), then APCG performs similarly as SDCA as predicted by our complexity results, and they both outperform AFG by a substantial margin.

Figure 1 shows the reduction of primal optimality P(w(k))PP(w^{(k)})-P^{\star} by the three methods in the ill-conditioned setting, with λ\lambda varying form 10510^{-5} to 10810^{-8}. For APCG, the primal points w(k)w^{(k)} are generated simply as w(k)=ω(x(k))w^{(k)}=\omega(x^{(k)}) defined in (49). Here we see that APCG has superior performance in reducing the primal objective value compared with SDCA and AFG, even without performing the final proximal full gradient step described in Theorem 4.

Figure 2 shows the reduction of primal-dual gap P(w(k))D(x(k))P(w^{(k)})-D(x^{(k)}) by the two methods APCG and SDCA. We can see that in the ill-conditioned setting, the APCG method is more effective in reducing the primal-dual gap as well.

References