Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization

Shai Shalev-Shwartz, Tong Zhang

Introduction

For example, in ridge regression the regularizer is g(w)=12w22g(w)=\frac{1}{2}\|w\|_{2}^{2}, the instances are column vectors, and for every ii the ii’th loss function is ϕi(a)=12(ayi)2\phi_{i}(a)=\frac{1}{2}(a-y_{i})^{2}, for some scalar yiy_{i}.

Let w=argminwP(w)w^{*}=\operatorname*{argmin}_{w}P(w) (we will later make assumptions that imply that ww^{*} is unique). We say that ww is ϵ\epsilon-accurate if P(w)P(w)ϵP(w)-P(w^{*})\leq\epsilon. Our main result is a new algorithm for solving (1). If gg is 11-strongly convex and each ϕi\phi_{i} is (1/γ)(1/\gamma)-smooth (meaning that its gradient is (1/γ)(1/\gamma)-Lipschitz), then our algorithm finds, with probability of at least 1δ1-\delta, an ϵ\epsilon-accurate solution to (1) in time

By applying a smoothing technique to ϕi\phi_{i}, we also derive a method that finds an ϵ\epsilon-accurate solution to (1) assuming that each ϕi\phi_{i} is O(1)O(1)-Lipschitz, and obtain the runtime

This applies, for example, to SVM with the hinge-loss. It significantly improves over the rate dλϵ\frac{d}{\lambda\epsilon} of SGD (e.g. ), when 1λϵn\frac{1}{\lambda\epsilon}\gg n.

We can also apply our results to non-strongly convex regularizers (such as the L1L_{1} norm regularizer), or to non-regularized problems, by adding a slight L2L_{2} regularization. For example, for L1L_{1} regularized problems, and assuming that each ϕi\phi_{i} is (1/γ)(1/\gamma)-smooth, we obtain the runtime of

This applies, for example, to the Lasso problem, in which the goal is to minimize the squared loss plus an L1L_{1} regularization term.

To put our results in context, in the table below we specify the runtime of various algorithms (while ignoring constants and logarithmic terms) for three key machine learning applications; SVM in which ϕi(a)=max{0,1a}\phi_{i}(a)=\max\{0,1-a\} and g(w)=12w22g(w)=\frac{1}{2}\|w\|_{2}^{2}, Lasso in which ϕi(a)=12(ayi)2\phi_{i}(a)=\frac{1}{2}(a-y_{i})^{2} and g(w)=σw1g(w)=\sigma\|w\|_{1}, and Ridge Regression in which ϕi(a)=12(ayi)2\phi_{i}(a)=\frac{1}{2}(a-y_{i})^{2} and g(w)=12w22g(w)=\frac{1}{2}\|w\|_{2}^{2}. Additional applications, and a more detailed runtime comparison to previous work, are given in Section 5. In the table below, SGD stands for Stochastic Gradient Descent, and AGD stands for Accelerated Gradient Descent.

Additional related work:

As mentioned before, our first contribution is a proximal version of the stochastic dual coordinate ascent method and extension of the analysis given in Shalev-Shwartz and Zhang . Stochastic dual coordinate ascent has also been studied in Collins et al. but in more restricted settings than the general problem considered in this paper. One can also apply the analysis of stochastic coordinate descent methods given in Richtárik and Takáč on the dual problem. However, here we are interested in understanding the primal sub-optimality, hence an analysis which only applies to the dual problem is not sufficient.

The generality of our approach allows us to apply it for multiclass prediction problems. We discuss this in detail later on in Section 5. Recently, derived a stochastic coordinate ascent for structural SVM based on the Frank-Wolfe algorithm. Although with different motivations, for the special case of multiclass problems with the hinge-loss, their algorithm ends up to be the same as our proximal dual ascent algorithm (with the same rate). Our approach allows to accelerate the method and obtain an even faster rate.

The proof of our acceleration method adapts Nesterov’s estimation sequence technique, studied in Devolder et al. , Schmidt et al. , to allow approximate and stochastic proximal mapping. See also . In particular, it relies on similar ideas as in Proposition 4 of . However, our specific requirement is different, and the proof presented here is different and significantly simpler than that of .

There have been several attempts to accelerate stochastic optimization algorithms. See for example and the references therein. However, the runtime of these methods have a polynomial dependence on 1/ϵ1/\epsilon even if ϕi\phi_{i} are smooth and gg is λ\lambda-strongly convex, as opposed to the logarithmic dependence on 1/ϵ1/\epsilon obtained here. As in , we avoid the polynomial dependence on 1/ϵ1/\epsilon by allowing more than a single pass over the data.

Preliminaries

Given a norm P\|\cdot\|_{P} we denote the dual norm by D\|\cdot\|_{D} where

We use \|\cdot\| or 2\|\cdot\|_{2} to denote the L2L_{2} norm, x=xx\|x\|=x^{\top}x. We also use x1=ixi\|x\|_{1}=\sum_{i}|x_{i}| and x=maxixi\|x\|_{\infty}=\max_{i}|x_{i}|. The operator norm of a matrix XX with respect to norms P,P\|\cdot\|_{P},\|\cdot\|_{P^{\prime}} is defined as

It is well known that ff is γ\gamma-strongly convex with respect to P\|\cdot\|_{P} if and only if ff^{*} is (1/γ)(1/\gamma)-smooth with respect to the dual norm, D\|\cdot\|_{D}.

We will assume that gg is strongly convex which implies that g()g^{*}(\cdot) is continuous differentiable. If we define

then it is known that w(α)=ww(\alpha^{*})=w^{*}, where α\alpha^{*} is an optimal solution of (2). It is also known that P(w)=D(α)P(w^{*})=D(\alpha^{*}) which immediately implies that for all ww and α\alpha, we have P(w)D(α)P(w)\geq D(\alpha), and hence the duality gap defined as

can be regarded as an upper bound on both the primal sub-optimality, P(w(α))P(w)P(w(\alpha))-P(w^{*}), and on the dual sub-optimality, D(α)D(α)D(\alpha^{*})-D(\alpha).

Main Results

In this section we describe our algorithms and their analysis. We start in Section 3.1 with a description of our proximal stochastic dual coordinate ascent procedure (Prox-SDCA). Then, in Section 3.2 we show how to accelerate the method by calling Prox-SDCA on a sequence of problems with a strong regularization. Throughout the first two sections we assume that the loss functions are smooth. Finally, we discuss the case of Lipschitz loss functions in Section 3.3.

The proofs of the main acceleration theorem (Theorem 3) is given in Section 4. The rest of the proofs are provided in the appendix.

We now describe our proximal stochastic dual coordinate ascent procedure for solving (1). Our results in this subsection holds for gg being a 11-strongly convex function with respect to some norm P\|\cdot\|_{P^{\prime}} and every ϕi\phi_{i} being a (1/γ)(1/\gamma)-smooth function with respect to some other norm P\|\cdot\|_{P}. The corresponding dual norms are denoted by D\|\cdot\|_{D^{\prime}} and D\|\cdot\|_{D} respectively.

The dual objective in (2) has a different dual vector associated with each example in the training set. At each iteration of dual coordinate ascent we only allow to change the ii’th column of α\alpha, while the rest of the dual vectors are kept intact. We focus on a randomized version of dual coordinate ascent, in which at each round we choose which dual vector to update uniformly at random.

At step tt, let v(t1)=(λn)1iXiαi(t1)v^{(t-1)}=(\lambda n)^{-1}\sum_{i}X_{i}\alpha_{i}^{(t-1)} and let w(t1)=g(v(t1))w^{(t-1)}=\nabla g^{*}(v^{(t-1)}). We will update the ii-th dual variable αi(t)=αi(t1)+Δαi\alpha_{i}^{(t)}=\alpha_{i}^{(t-1)}+\Delta\alpha_{i}, in a way that will lead to a sufficient increase of the dual objective. For the primal problem, this would lead to the update v(t)=v(t1)+(λn)1XiΔαiv^{(t)}=v^{(t-1)}+(\lambda n)^{-1}X_{i}\Delta\alpha_{i}, and therefore w(t)=g(v(t))w^{(t)}=\nabla g^{*}(v^{(t)}) can also be written as

Note that this particular update is rather similar to the update step of proximal-gradient dual-averaging method (see for example Xiao ). The difference is on how α(t)\alpha^{(t)} is updated.

The goal of dual ascent methods is to increase the dual objective as much as possible, and thus the optimal way to choose Δαi\Delta\alpha_{i} would be to maximize the dual objective, namely, we shall let

However, for a complex g()g^{*}(\cdot), this optimization problem may not be easy to solve. To simplify the optimization problem we can rely on the smoothness of gg^{*} (with respect to a norm D\|\cdot\|_{D^{\prime}}) and instead of directly maximizing the dual objective function, we try to maximize the following proximal objective which is a lower bound of the dual objective:

In general, this optimization problem is still not necessarily simple to solve because ϕ\phi^{*} may also be complex. We will thus also propose alternative update rules for Δαi\Delta\alpha_{i} of the form Δαi=s(ϕi(Xiw(t1))αi(t1))\Delta\alpha_{i}=s(-\nabla\phi_{i}(X_{i}^{\top}w^{(t-1)})-\alpha_{i}^{(t-1)}) for an appropriately chosen step size parameter s>0s>0. Our analysis shows that an appropriate choice of ss still leads to a sufficient increase in the dual objective.

It should be pointed out that we can always pick Δαi\Delta\alpha_{i} so that the dual objective is non-decreasing. In fact, if for a specific choice of Δαi\Delta\alpha_{i}, the dual objective decreases, we may simply set Δαi=0\Delta\alpha_{i}=0. Therefore throughout the proof we will assume that the dual objective is non-decreasing whenever needed.

The theorems below provide upper bounds on the number of iterations required by our prox-SDCA procedure.

Consider Procedure Prox-SDCA as given in Figure 1. Let α\alpha^{*} be an optimal dual solution and let ϵ>0\epsilon>0. For every TT such that

We next give bounds that hold with high probability.

Consider Procedure Prox-SDCA as given in Figure 1. Let α\alpha^{*} be an optimal dual solution, let ϵD,ϵP>0\epsilon_{D},\epsilon_{P}>0, and let δ(0,1)\delta\in(0,1).

we are guaranteed that with probability of at least 1δ1-\delta it holds that D(α)D(α(T))ϵDD(\alpha^{*})-D(\alpha^{(T)})\leq\epsilon_{D}.

we are guaranteed that with probability of at least 1δ1-\delta it holds that P(w(T))D(α(T))ϵPP(w^{(T)})-D(\alpha^{(T)})\leq\epsilon_{P}.

and let T0=TnR2λγT_{0}=T-n-\lceil\frac{R^{2}}{\lambda\gamma}\rceil. Suppose we choose log2(2/δ)\lceil\log_{2}(2/\delta)\rceil values of tt uniformly at random from T0+1,,TT_{0}+1,\ldots,T, and then choose the single value of tt from these log2(2/δ)\lceil\log_{2}(2/\delta)\rceil values for which P(w(t))D(α(t))P(w^{(t)})-D(\alpha^{(t)}) is minimal. Then, with probability of at least 1δ1-\delta we have that P(w(t))D(α(t))ϵPP(w^{(t)})-D(\alpha^{(t)})\leq\epsilon_{P}.

The above theorem tells us that the runtime required to find an ϵ\epsilon accurate solution, with probability of at least 1δ1-\delta, is

The expected runtime required to minimize PP up to accuracy ϵ\epsilon is

We have shown that with a runtime of O(d(n+R2λγ)log(2(D(α)D(α(0)))ϵ))O\left(d\,\left(n+\frac{R^{2}}{\lambda\gamma}\right)\cdot\log\left(\frac{2(D(\alpha^{*})-D(\alpha^{(0)}))}{\epsilon}\right)\right) we can find an ϵ\epsilon accurate solution with probability of at least 1/21/2. Therefore, we can run the procedure for this amount of time and check if the duality gap is smaller than ϵ\epsilon. If yes, we are done. Otherwise, we would restart the process. Since the probability of success is 1/21/2 we have that the average number of restarts we need is 22, which concludes the proof. ∎

2 Acceleration

for some β(0,1)\beta\in(0,1). That is, our regularization is centered around the previous solution plus a “momentum term” β(w(t1)w(t2))\beta(w^{(t-1)}-w^{(t-2)}).

A pseudo-code of the algorithm is given in Figure 2. Note that all the parameters of the algorithm are determined by our theory.

In the pseudo-code below, we specify the parameters based on our theoretical derivation. In our experiments, we found out that this choice of parameters also work very well in practice. However, we also found out that the algorithm is not very sensitive to the choice of parameters. For example, we found out that running 5n5n iterations of Prox-SDCA (that is, 55 epochs over the data), without checking the stopping condition, also works very well.

Consider the accelerated Prox-SDCA algorithm given in Figure 2.

Correctness: When the algorithm terminates we have that P(w(t))P(w)ϵP(w^{(t)})-P(w^{*})\leq\epsilon.

The number of outer iterations is at most

Each outer iteration involves a single call to Prox-SDCA, and the averaged runtime required by each such call is

By a straightforward amplification argument we obtain that for every δ(0,1)\delta\in(0,1) the total runtime required by accelerated Prox-SDCA to guarantee an ϵ\epsilon-accurate solution with probability of at least 1δ1-\delta is

3 Non-smooth, Lipschitz, loss functions

So far we have assumed that for every ii, ϕi\phi_{i} is a (1/γ)(1/\gamma)-smooth function. We now consider the case in which ϕi\phi_{i} might be non-smooth, and even non-differentiable, but it is LL-Lipschitz.

Following Nesterov , we apply a “smoothing” technique. We first observe that if ϕ\phi is LL-Lipschitz function then the domain of ϕ\phi^{*} is in the ball of radius LL.

Fix some α\alpha with αD>L\|\alpha\|_{D}>L. Let x0x_{0} be a vector such that x0P=1\|x_{0}\|_{P}=1 and αx0=αD\alpha^{\top}x_{0}=\|\alpha\|_{D} (this is a vector that achieves the maximal objective in the definition of the dual norm). By definition of the conjugate we have

This observation allows us to smooth LL-Lipschitz functions by adding regularization to their conjugate. In particular, the following lemma generalizes Lemma 2.5 in .

It is also possible to smooth using different regularization functions which are strongly convex with respect to other norms. See Nesterov for discussion.

Proof of Theorem 3

The first claim of the theorem is that when the procedure stops we have P(w(t))P(w)ϵP(w^{(t)})-P(w^{*})\leq\epsilon. We therefore need to show that each stopping condition guarantees that P(w(t))P(w)ϵP(w^{(t)})-P(w^{*})\leq\epsilon.

For the second stopping condition, recall that w(t)w^{(t)} is an ϵt\epsilon_{t}-accurate minimizer of P(w)+κ2wy(t1)2P(w)+\frac{\kappa}{2}\|w-y^{(t-1)}\|^{2}, and hence by Lemma 3 below (with z=wz=w^{*}, w+=w(t)w^{+}=w^{(t)}, and y=y(t1)y=y^{(t-1)}):

It is left to show that the first stopping condition is correct, namely, to show that after 1+2ηlog(ξ1/ϵ)1+\frac{2}{\eta}\log(\xi_{1}/\epsilon) iterations the algorithm must converge to an ϵ\epsilon-accurate solution. Observe that the definition of ξt\xi_{t} yields that ξt=(1η/2)t1ξ1eη(t1)/2ξ1\xi_{t}=(1-\eta/2)^{t-1}\,\xi_{1}\leq e^{-\eta(t-1)/2}\xi_{1}. Therefore, to prove that the first stopping condition is valid, it suffices to show that for every tt, P(w(t))P(w)ξtP(w^{(t)})-P(w^{*})\leq\xi_{t}.

Recall that at each outer iteration of the accelerated procedure, we approximately minimize an objective of the form

Of course, minimizing P(w;y)P(w;y) is not the same as minimizing P(w)P(w). Our first lemma shows that for every yy, if w+w^{+} is an ϵ\epsilon-accurate minimizer of P(w;y)P(w;y) then we can derive a lower bound on P(w)P(w) based on P(w+)P(w^{+}) and a convex quadratic function of ww.

Let μ=λ/2\mu=\lambda/2 and ρ=μ+κ\rho=\mu+\kappa. Let w+w^{+} be a vector such that P(w+;y)minwP(w,y)+ϵP(w^{+};y)\leq\min_{w}P(w,y)+\epsilon. Then, for every zz,

By the μ\mu-strong convexity of Ψ\Psi we have that for every zz,

In addition, by standard algebraic manipulations,

Finally, using the assumption P(w+;y)minwP(w;y)+ϵP(w^{+};y)\leq\min_{w}P(w;y)+\epsilon we conclude our proof. ∎

We saw that the quadratic function P(w+)+Qϵ(z;w+,y)P(w^{+})+Q_{\epsilon}(z;w^{+},y) lower bounds the function PP everywhere. Therefore, any convex combination of such functions would form a quadratic function which lower bounds PP. In particular, the algorithm (implicitly) maintains a sequence of quadratic functions, h1,h2,h_{1},h_{2},\ldots, defined as follows. Choose η(0,1)\eta\in(0,1) and a sequence y(1),y(2),y^{(1)},y^{(2)},\ldots that will be specified later. Define,

The following simple lemma shows that for every t1t\geq 1 and zz, ht(z)h_{t}(z) lower bounds P(z)P(z).

Let η(0,1)\eta\in(0,1) and let y(1),y(2),y^{(1)},y^{(2)},\ldots be any sequence of vectors. Assume that w(1)=0w^{(1)}=0 and for every t1t\geq 1, w(t+1)w^{(t+1)} satisfies P(w(t+1);y(t))minwP(w;y(t))+ϵt+1P(w^{(t+1)};y^{(t)})\leq\min_{w}P(w;y^{(t)})+\epsilon_{t+1}. Then, for every t1t\geq 1 and every vector zz we have

The proof is by induction. For t=1t=1, observe that P(0;0)=P(0)P(0;0)=P(0) and that for every ww we have P(w;0)P(w)D(0)P(w;0)\geq P(w)\geq D(0). This yields P(0;0)minwP(w;0)P(0)D(0)P(0;0)-\min_{w}P(w;0)\leq P(0)-D(0). The claim now follows directly from Lemma 3. Next, for the inductive step, assume the claim holds for some t11t-1\geq 1 and let us prove it for tt. By the recursive definition of hth_{t} and by using Lemma 3 we have

Using the inductive assumption we obtain that the right-hand side of the above is upper bounded by (1η)P(z)+ηP(z)=P(z)(1-\eta)P(z)+\eta P(z)=P(z), which concludes our proof. ∎

The more difficult part of the proof is to show that for every t1t\geq 1,

If this holds true, then we would immediately get that for every ww^{*},

This will conclude the proof of the first part of Theorem 3, since ξt=ξ1(1η/2)t1ξ1e(t1)η/2\xi_{t}=\xi_{1}(1-\eta/2)^{t-1}\leq\xi_{1}\,e^{-(t-1)\eta/2}, and therefore, 1+2ηlog(ξ1/ϵ)1+\frac{2}{\eta}\log(\xi_{1}/\epsilon) iterations suffice to guarantee that P(w(t))P(w)ϵP(w^{(t)})-P(w^{*})\leq\epsilon.

Let us construct an explicit formula for v(t)v^{(t)}. Clearly, v(1)=0v^{(1)}=0. Assume that we have calculated v(t)v^{(t)} and let us calculate v(t+1)v^{(t+1)}. Note that hth_{t} is a quadratic function which is minimized at v(t)v^{(t)}. Furthermore, it is easy to see that for every tt, hth_{t} is μ\mu-strongly convex quadratic function. Therefore,

By the definition of ht+1h_{t+1} we obtain that

Since the gradient of ht+1(z)h_{t+1}(z) at v(t+1)v^{(t+1)} should be zero, we obtain that v(t+1)v^{(t+1)} should satisfy

Getting back to our second phase of the proof, we need to show that for every tt we have P(w(t))ht(v(t))+ξtP(w^{(t)})\leq h_{t}(v^{(t)})+\xi_{t}. We do so by induction. For the case t=1t=1 we have

For the induction step, assume the claim holds for t1t\geq 1 and let us prove it for t+1t+1. We use the shorthands,

By the inductive assumption we have ht(v(t))P(w(t))ξth_{t}(v^{(t)})\geq P(w^{(t)})-\xi_{t} and by Lemma 3 we have P(w(t))ψt+1(w(t))P(w^{(t)})\geq\psi_{t+1}(w^{(t)}). Therefore,

So far we did not specify η\eta and y(t)y^{(t)} (except y(0)=0y^{(0)}=0). We next set

We also observe that ϵt+1ηξt2(1+η2)\epsilon_{t+1}\leq\frac{\eta\xi_{t}}{2(1+\eta^{-2})} which implies that (1+ρ/μ)ϵt+1+(1η)ξt(1η/2)ξt=ξt+1(1+\rho/\mu)\epsilon_{t+1}+(1-\eta)\xi_{t}\leq(1-\eta/2)\xi_{t}=\xi_{t+1}. Combining the above with (6) and (7), and rearranging terms, we obtain that

Next, observe that ρη2=μ\rho\eta^{2}=\mu and that by (5) we have

The right-hand side of the above is non-negative because of the convexity of the function f(z)=μ2zv(t+1)2f(z)=\frac{\mu}{2}\|z-v^{(t+1)}\|^{2}, which yields

We next show that each call to Prox-SDCA will terminate quickly. By the definition of κ\kappa we have that

Therefore, based on Corollary 1 we know that the averaged runtime at iteration tt is

The following lemma bounds the initial dual sub-optimality at iteration t4t\geq 4. Similar arguments will yield a similar result for t<4t<4.

Combining the above with (8), we obtain that

Next, we bound y(t1)y(t2)2\|y^{(t-1)}-y^{(t-2)}\|^{2}. We have

where we used the triangle inequality and β<1\beta<1. By strong convexity of PP we have, for every ii,

Getting back to the proof of the second claim of Theorem 3, we have obtained that

where in the last inequality we used η21=2κλ\eta^{-2}-1=\frac{2\kappa}{\lambda}, which implies that 2κλ(1+η2)η4\frac{2\kappa}{\lambda}(1+\eta^{-2})\leq\eta^{-4}. Using 1<η51<\eta^{-5}, 1η/20.51-\eta/2\geq 0.5, and taking log to both sides, we get that

which concludes the proof of the second claim of Theorem 3.

Applications

In this section we specify our algorithmic framework to several popular machine learning applications. In Section 5.1 we start by describing several loss functions and deriving their conjugate. In Section 5.2 we describe several regularization functions. Finally, in the rest of the subsections we specify our algorithm for Ridge regression, SVM, Lasso, logistic regression, and multiclass prediction.

Logistic loss:

ϕ(a)=log(1+ea)\phi(a)=\log(1+e^{a}). The derivative is ϕ(a)=1/(1+ea)\phi^{\prime}(a)=1/(1+e^{-a}) and the second derivative is ϕ(a)=1(1+ea)(1+ea)[0,1/4]\phi^{\prime\prime}(a)=\frac{1}{(1+e^{-a})(1+e^{a})}\in[0,1/4], from which it follows that ϕ\phi is (1/4)(1/4)-smooth. The conjugate function is

Hinge loss:

ϕ(a)=[1a]+:=max{0,1a}\phi(a)=[1-a]_{+}:=\max\{0,1-a\}. The conjugate function is

Smooth hinge loss:

This loss is obtained by smoothing the hinge-loss using the technique described in Lemma 2. This loss is parameterized by a scalar γ>0\gamma>0 and is defined as:

Max-of-hinge:

To calculate the conjugate of ϕ\phi, let

Each inner maximization over aja_{j} would be \infty unless βj=bj\beta_{j}=b_{j}. Therefore,

Smooth max-of-hinge

This loss obtained by smoothing the max-of-hinge loss using the technique described in Lemma 2. This loss is parameterized by a scalar γ>0\gamma>0. We start by adding regularization to the conjugate of the max-of-hinge given in (11) and obtain

Taking the conjugate of the conjugate we obtain

Soft-max-of-hinge loss function:

Another approach to smooth the max-of-hinge loss function is by using soft-max instead of max. The resulting soft-max-of-hinge loss function is defined as

The jj’th element of the gradient of ϕ\phi is

By the definition of the conjugate we have ϕγ(b)=maxaabϕγ(a)\phi_{\gamma}^{*}(b)=\max_{a}a^{\top}b-\phi_{\gamma}(a). The vector aa that maximizes the above must satisfy

This can be satisfied only if bj0b_{j}\geq 0 for all jj and jbj1\sum_{j}b_{j}\leq 1. That is, bSb\in S. Denote Z=i=1ke(ci+ai)/γZ=\sum_{i=1}^{k}e^{(c_{i}+a_{i})/\gamma} and note that

Finally, if bSb\notin S then the gradient of abϕγ(a)a^{\top}b-\phi_{\gamma}(a) does not vanish anywhere, which means that ϕγ(b)=\phi_{\gamma}^{*}(b)=\infty. All in all, we obtain

Since the entropic function, jbjlog(bj)\sum_{j}b_{j}\log(b_{j}) is 11-strongly convex over SS with respect to the L1L_{1} norm, we obtain that ϕγ\phi^{*}_{\gamma} is γ\gamma-strongly convex with respect to the L1L_{1} norm, from which it follows that ϕγ\phi_{\gamma} is (1/γ)(1/\gamma)-smooth with respect to the LL_{\infty} norm.

2 Regularizers

The simplest regularization is the squared L2L_{2} regularization

This is a 11-strongly convex regularization function whose conjugate is

For our acceleration procedure, we also use the L2L_{2} regularization plus a linear term, namely,

for some vector zz. The conjugate of this function is

Another popular regularization we consider is the L1L_{1} regularization,

This is not a strongly convex regularizer and therefore we will add a slight L2L_{2} regularization to it and define the L1L_{1}-L2L_{2} regularization as

where σ=σλ\sigma^{\prime}=\frac{\sigma}{\lambda} for some small λ\lambda. Note that

so if λ\lambda is small enough (as will be formalized later) we obtain that λg(w)σw1\lambda g(w)\approx\sigma\|w\|_{1}.

The maximizer is also g(v)\nabla g^{*}(v) and we now show how to calculate it. We have

where the right-hand side is the ii’th component of the objective value we will obtain by setting wi=0w_{i}=0. This leads to the conclusion that

Another regularization function we’ll use in the accelerated procedure is

3 Ridge Regression

Below we specify Prox-SDCA for ridge regression. We use Option I since it is possible to derive a closed form solution to the maximization of the dual with respect to Δαi\Delta\alpha_{i}. Indeed, since ϕi(b)=12b2+yib-\phi_{i}^{*}(-b)=-\frac{1}{2}b^{2}+y_{i}b we have that the maximization problem is

Applying the above update and using some additional tricks to improve the running time we obtain the following procedure.

The runtime of Prox-SDCA for ridge regression becomes

where R=maxixiR=\max_{i}\|x_{i}\|. This matches the recent results of . If R2/λnR^{2}/\lambda\gg n we can apply the accelerated procedure and obtain the improved runtime

4 Logistic Regression

and the dual constraints are αn\alpha\in^{n}.

Below we specify Prox-SDCA for logistic regression using Option III.

Prox-SDCA((xi)i=1n,ϵ,α(0),z(x_{i})_{i=1}^{n},\epsilon,\alpha^{(0)},z) for logistic regression Goal: Minimize P(w)=1ni=1nlog(1+exiw)+λ(12w2wz)P(w)=\frac{1}{n}\sum_{i=1}^{n}\log(1+e^{x_{i}^{\top}w})+\lambda\left(\frac{1}{2}\|w\|^{2}-w^{\top}z\right) Initialize v(0)=1λni=1nαi(0)xiv^{(0)}=\frac{1}{\lambda n}\sum_{i=1}^{n}\alpha_{i}^{(0)}x_{i}, and i,  pi=xiz\forall i,~{}~{}p_{i}=x_{i}^{\top}z Define: ϕ(b)=blog(b)+(1b)log(1b)\phi^{*}(b)=b\log(b)+(1-b)\log(1-b) Iterate: for t=1,2,t=1,2,\dots Randomly pick ii p=xiw(t1)p=x_{i}^{\top}w^{(t-1)} q=1/(1+ep)αi(t1)q=-1/(1+e^{-p})-\alpha_{i}^{(t-1)} s=min(1,log(1+ep)+ϕ(αi(t1))+pαi(t1)+2q2q2(4+1λnxi2))s=\min\left(1,\frac{\log(1+e^{p})+\phi^{*}(-\alpha_{i}^{(t-1)})+p\alpha^{(t-1)}_{i}+2q^{2}}{q^{2}(4+\frac{1}{\lambda n}\|x_{i}\|^{2})}\right) Δαi=sq\Delta\alpha_{i}=sq αi(t)=αi(t1)+Δαi\alpha^{(t)}_{i}=\alpha^{(t-1)}_{i}+\Delta\alpha_{i} and for jij\neq i, αj(t)=αj(t1)\alpha^{(t)}_{j}=\alpha^{(t-1)}_{j} v(t)=v(t1)+Δαiλnxiv^{(t)}=v^{(t-1)}+\frac{\Delta\alpha_{i}}{\lambda n}x_{i} Stopping condition: let w(t)=v(t)+zw^{(t)}=v^{(t)}+z Stop if 1ni=1n(log(1+exiw(t))+ϕ(αi(t1)))+λw(t)v(t)ϵ\frac{1}{n}\sum_{i=1}^{n}\left(\log(1+e^{x_{i}^{\top}w^{(t)}})+\phi^{*}(-\alpha_{i}^{(t-1)})\right)+\lambda w^{(t)\top}v^{(t)}\leq\epsilon

The runtime analysis is similar to the analysis for ridge regression.

5 Lasso

In the Lasso problem, the loss function is the squared loss but the regularization function is L1L_{1}. That is, we need to solve the problem:

Let yˉ=12ni=1nyi2\bar{y}=\frac{1}{2n}\sum_{i=1}^{n}y_{i}^{2}, and let wˉ\bar{w} be an optimal solution of (18). Then, the objective at wˉ\bar{w} is at most the objective at w=0w=0, which yields

for some λ>0\lambda>0. This problem fits into our framework, since now the regularizer is strongly convex. Furthermore, if ww^{*} is an (ϵ/2)(\epsilon/2)-accurate solution to the problem in (19), then P(w)P(wˉ)+ϵ/2P(w^{*})\leq P(\bar{w})+\epsilon/2 which yields

Since wˉ22(yˉ/σ)2\|\bar{w}\|_{2}^{2}\leq\left({\bar{y}}/{\sigma}\right)^{2}, we obtain that setting λ=ϵ(σ/yˉ)2\lambda=\epsilon(\sigma/\bar{y})^{2} guarantees that ww^{*} is an ϵ\epsilon accurate solution to the original problem given in (18).

In light of the above, from now on we focus on the problem given in (19). As in the case of ridge regression, we can apply Prox-SDCA with Option I. The resulting pseudo-code is given below. Applying the above update and using some additional tricks to improve the running time we obtain the following procedure.

Let us now discuss the runtime of the resulting method. Denote R=maxixiR=\max_{i}\|x_{i}\| and for simplicity, assume that yˉ=O(1)\bar{y}=O(1). Choosing λ=ϵ(σ/yˉ)2\lambda=\epsilon(\sigma/\bar{y})^{2}, the runtime of our method becomes

It is also convenient to write the bound in terms of B=wˉ2B=\|\bar{w}\|_{2}, where, as before, wˉ\bar{w} is the optimal solution of the L1L_{1} regularized problem. With this parameterization, we can set λ=ϵ/B2\lambda=\epsilon/B^{2} and the runtime becomes

The runtime of standard SGD is O(dR2B2/ϵ2)O(dR^{2}B^{2}/\epsilon^{2}) even in the case of smooth loss functions such as the squared loss. Several variants of SGD, that leads to sparser intermediate solutions, have been proposed (e.g. ). However, all of these variants share the runtime of O(dR2B2/ϵ2)O(dR^{2}B^{2}/\epsilon^{2}), which is much slower than our runtime when ϵ\epsilon is small.

Another relevant approach is the FISTA algorithm of . The shrinkage operator of FISTA is the same as the gradient of gg^{*} used in our approach. It is a batch algorithm using Nesterov’s accelerated gradient technique. For the squared loss function, the runtime of FISTA is

This bound is worst than our bound by a factor of at least n\sqrt{n}.

Another approach to solving (18) is stochastic coordinate descent over the primal problem. showed that the runtime of this approach is

under the assumption that xi1\|x_{i}\|_{\infty}\leq 1 for all ii. Similar results can also be found in .

For our method, the runtime depends on R2=maxixi22R^{2}=\max_{i}\|x_{i}\|_{2}^{2}. If R2=O(1)R^{2}=O(1) then the runtime of our method is much better than that of . In the general case, if maxixi1\max_{i}\|x_{i}\|_{\infty}\leq 1 then R2dR^{2}\leq d, which yields the runtime of

This is the same or better than whenever d=O(n)d=O(n).

6 Linear SVM

Support Vector Machines (SVM) is an algorithm for learning a linear classifier. Linear SVM (i.e., SVM with linear kernels) amounts to minimizing the objective

Our first step is to smooth the hinge-loss. Let γ=ϵ\gamma=\epsilon and consider the smooth hinge-loss as defined in (9). Recall that the smooth hinge-loss satisfies

For the smoothed hinge loss, the optimization problem given in Option I of Prox-SDCA has a closed form solution and we obtain the following procedure:

Denote R=maxixiR=\max_{i}\|x_{i}\|. Then, the runtime of the resulting method is

In particular, choosing γ=ϵ\gamma=\epsilon we obtain a solution to the original SVM problem in runtime of

As mentioned before, this is better than SGD when 1λϵn\frac{1}{\lambda\epsilon}\gg n.

7 Multiclass SVM

This can be written as ϕ((Wxi)(Wxi)yi)\phi((W^{\top}x_{i})-(W^{\top}x_{i})_{y_{i}}) where

with cic_{i} being the all ones vector except in the yiy_{i} coordinate.

Therefore, the optimization problem of multiclass SVM becomes:

As in the case of SVM, we will use the smooth version of the max-of-hinge loss function as described in (13). If we set the smoothness parameter γ\gamma to be ϵ\epsilon then an (ϵ/2)(\epsilon/2)-accurate solution to the problem with the smooth loss is also an ϵ\epsilon-accurate solution to the original problem with the non-smooth loss. Therefore, from now on we focus on the problem with the smooth max-of-hinge loss.

We specify Prox-SDCA for multiclass SVM using Option I. We will show that the optimization problem in Option I can be calculated efficiently by sorting a kk dimensional vector. Such ideas were explored in for the non-smooth max-of-hinge loss.

Let w^=w1λnXiαi(t1)\hat{w}=w-\frac{1}{\lambda n}X_{i}\alpha^{(t-1)}_{i}. Then, the optimization problem over αi\alpha_{i} can be written as

As shown before, if we organize w^\hat{w} as a d×kd\times k matrix, denoted W^\hat{W}, we have that Xiw^=W^xi(W^xi)yiX_{i}^{\top}\hat{w}=\hat{W}^{\top}x_{i}-(\hat{W}^{\top}x_{i})_{y_{i}}. We also have that

It follows that an optimal solution to (20) must set αi,yi=0\alpha_{i,y_{i}}=0 and we only need to optimize over the rest of the dual variables. This also yields,

This is equivalent to a problem of the form:

The equivalence is in the sense that if (a,β)(a,\beta) is a solution of (22) then we can set αi=a\alpha_{i}=-a.

Assume for simplicity that μ\mu is sorted in a non-increasing order and that all of its elements are non-negative (otherwise, it is easy to verify that we can zero the negative elements of μ\mu and sort the non-negative, without affecting the solution). Let μˉ\bar{\mu} be the cumulative sum of μ\mu, that is, for every jj, let μˉj=r=1jμr\bar{\mu}_{j}=\sum_{r=1}^{j}\mu_{r}. For every jj, let zj=μˉjjμjz_{j}=\bar{\mu}_{j}-j\mu_{j}. Since μ\mu is sorted we have that

The first order condition for minimality w.r.t. zz is

If this value of zz is in (zj,zj+1)(z_{j},z_{j+1}), then it is the optimal zz and we’re done. Otherwise, the optimum should be either z=0z=0 (which yields α=0\alpha=0 as well) or z=1z=1.

a=OptimizeDual(μ,C)a=\textrm{OptimizeDual}(\mu,C) Solve the optimization problem given in (22) Initialize: i, μ^i=max{0,μi}\forall i,~{}\hat{\mu}_{i}=\max\{0,\mu_{i}\}, and sort μ^\hat{\mu} s.t. μ^1μ^2μ^k\hat{\mu}_{1}\geq\hat{\mu}_{2}\geq\ldots\geq\hat{\mu}_{k} Let: μˉ\bar{\mu} be s.t. μˉj=i=1jμ^i\bar{\mu}_{j}=\sum_{i=1}^{j}\hat{\mu}_{i} Let: zz be s.t. zj=min{μˉjjμ^j,1}z_{j}=\min\{\bar{\mu}_{j}-j\hat{\mu}_{j},1\} and zk+1=1z_{k+1}=1 If: j\exists j s.t. μˉj1+jC[zj,zj+1]\frac{\bar{\mu}_{j}}{1+jC}\in[z_{j},z_{j+1}] return aa s.t. i, ai=max{0,μi(μˉj1+jC+μˉj)/j}\forall i,~{}a_{i}=\max\left\{0,\mu_{i}-\left(-\frac{\bar{\mu}_{j}}{1+jC}+\bar{\mu}_{j}\right)/j\right\} Else: Let jj be the minimal index s.t. zj=1z_{j}=1 set aa s.t. i,  ai=max{0,μi(zj+μˉj)/j}\forall i,~{}~{}a_{i}=\max\{0,\mu_{i}-(-z_{j}+\bar{\mu}_{j})/j\} If: aμ2+Cμ2\|a-\mu\|^{2}+C\leq\|\mu\|^{2} return aa Else: return (0,,0)(0,\ldots,0)

The resulting pseudo-codes for Prox-SDCA is given below. We specify the procedure while referring to WW as a matrix, because it is the more natural representation. For convenience of the code, we also maintain in αi,yi\alpha_{i,y_{i}} the value of jyiαi,j-\sum_{j\neq y_{i}}\alpha_{i,j} (instead of the optimal value of ).

Experiments

In this section we compare Prox-SDCA, its accelerated version Accelerated-Prox-SDCA, and the FISTA algorithm of , on L1L2L_{1}-L_{2} regularized loss minimization problems.

The experiments were performed on three large datasets with very different feature counts and sparsity, which were kindly provided by Thorsten Joachims (the datasets were also used in ). The astro-ph dataset classifies abstracts of papers from the physics ArXiv according to whether they belong in the astro-physics section; CCAT is a classification task taken from the Reuters RCV1 collection; and cov1 is class 1 of the covertype dataset of Blackard, Jock & Dean. The following table provides details of the dataset characteristics.

In the experiments, we set σ=105\sigma=10^{-5} and vary λ\lambda in the range {106,107,108,109}\{10^{-6},10^{-7},10^{-8},10^{-9}\}.

The convergence behaviors are plotted in Figure 6. In all the plots we depict the primal objective as a function of the number of passes over the data (often referred to as “epochs”). For FISTA, each iteration involves a single pass over the data. For Prox-SDCA, each nn iterations are equivalent to a single pass over the data. And, for Accelerated-Prox-SDCA, each nn inner iterations are equivalent to a single pass over the data. For Prox-SDCA and Accelerated-Prox-SDCA we implemented their corresponding stopping conditions and terminate the methods once an accuracy of 10310^{-3} was guaranteed.

It is clear from the graphs that Accelerated-Prox-SDCA yields the best results, and often significantly outperform the other methods. Prox-SDCA behaves similarly when λ\lambda is relatively large, but it converges much slower when λ\lambda is small. This is consistent with our theory. Finally, the relative performance of FISTA and Prox-SDCA depends on the ratio between λ\lambda and nn, but in all cases, Accelerated-Prox-SDCA is much faster than FISTA. This is again consistent with our theory.

We have described and analyzed a proximal stochastic dual coordinate ascent method and have shown how to accelerate the procedure. The overall runtime of the resulting method improves state-of-the-art results in many cases of interest.

There are two main open problems that we leave to future research.

Our Prox-SDCA procedure and its analysis works for regularizers which are strongly convex with respect to an arbitrary norm. However, our acceleration procedure is designed for regularizers which are strongly convex with respect to the Euclidean norm. Is is possible to extend the acceleration procedure to more general regularizers?

Acknowledgements

The authors would like to thank Fen Xia for careful proof-reading of the paper which helped us to correct numerous typos. Shai Shalev-Shwartz is supported by the following grants: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and ISF 598-10. Tong Zhang is supported by the following grants: NSF IIS-1016061, NSF DMS-1007527, and NSF IIS-1250985.

Appendix A Proofs of Iteration Bounds for Prox-SDCA

The proof technique follows that of Shalev-Shwartz and Zhang , but with the required generality for handling general strongly convex regularizers and smoothness/Lipschitzness with respect to general norms.

We prove the theorems for running Prox-SDCA while choosing Δαi\Delta\alpha_{i} as in Option I. A careful examination of the proof easily reveals that the results hold for the other options as well. More specifically, Lemma 6 only requires choosing Δαi=s(ui(t1)αi(t1))\Delta\alpha_{i}=s(u_{i}^{(t-1)}-\alpha_{i}^{(t-1)}) as in (23), and Option III chooses ss to optimize the bound on the right hand side of (25), and hence ensures that the choice can do no worse than the result of Lemma 6 with any ss. The simplification in Option IV and V employs the specific simplification of the bound in Lemma 6 in the proof of the theorems.

and ui(t1)=ϕi(Xiw(t1))-u^{(t-1)}_{i}=\nabla\phi_{i}(X_{i}^{\top}w^{(t-1)}).

Since only the ii’th element of α\alpha is updated, the improvement in the dual objective can be written as

The smoothness of gg^{*} implies that g(v+Δv)h(v;Δv)g^{*}(v+\Delta v)\leq h(v;\Delta v), where h(v;Δv):=g(v)+g(v)Δv+12ΔvD2h(v;\Delta v):=g^{*}(v)+\nabla g^{*}(v)^{\top}\Delta v+\frac{1}{2}\|\Delta v\|_{D^{\prime}}^{2}. Therefore,

By the definition of the update we have for all ss\in that

Combining this with (23) and rearranging terms we obtain that

Since u=ϕ(Xw)-u=\nabla\phi(X^{\top}w) we have ϕ(u)+wXu=ϕ(Xw)\phi^{*}(-u)+w^{\top}Xu=-\phi(X^{\top}w), which yields

Next note that with w=g(v)w=\nabla g^{*}(v), we have g(w)+g(v)=wvg(w)+g^{*}(v)=w^{\top}v. Therefore:

Therefore, if we take expectation of (25) w.r.t. the choice of ii we obtain that

Multiplying both sides by s/ns/n concludes the proof of the lemma. ∎

Equipped with the above lemmas we are ready to prove Theorem 1 and Theorem 2.

The assumption that ϕi\phi_{i} is (1/γ)(1/\gamma)-smooth implies that ϕi\phi_{i}^{*} is γ\gamma-strongly-convex. We will apply Lemma 6 with

Recall that XiDDR\|X_{i}\|_{D\to D^{\prime}}\leq R. Therefore, the choice of ss implies that

and hence G(t)0G^{(t)}\leq 0 for all tt. This yields,

Taking expectation of both sides with respect to the randomness at previous rounds, and using the law of total expectation, we obtain that

But since ϵD(t1):=D(α)D(α(t1))P(w(t1))D(α(t1))\epsilon_{D}^{(t-1)}:=D(\alpha^{*})-D(\alpha^{(t-1)})\leq P(w^{(t-1)})-D(\alpha^{(t-1)}) and D(α(t))D(α(t1))=ϵD(t1)ϵD(t)D(\alpha^{(t)})-D(\alpha^{(t-1)})=\epsilon_{D}^{(t-1)}-\epsilon_{D}^{(t)}, we obtain that

Using again (28), we can also obtain that

which proves the first part of Theorem 1.

Next, we sum the first inequality of (29) over t=T0+1,,Tt=T_{0}+1,\ldots,T to obtain

Now, if we choose wˉ,αˉ\bar{w},\bar{\alpha} to be either the average vectors or a randomly chosen vector over t{T0+1,,T}t\in\{T_{0}+1,\ldots,T\}, then the above implies

In particular, the choice of TT0=nsT-T_{0}=\frac{n}{s} and T0=nslog(ϵD(0)/ϵP)T_{0}=\frac{n}{s}\log(\epsilon_{D}^{(0)}/\epsilon_{P}) satisfies the above requirement. ∎

Next, for the duality gap, using (27) we have that for every tt such that ϵD(t1)ϵD\epsilon_{D}^{(t-1)}\leq\epsilon_{D} we have

This proves the second claim of Theorem 2.