Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization

Shai Shalev-Shwartz, Tong Zhang

Introduction

For example, in ridge regression the regularizer is $g(w)=\frac{1}{2}\|w\|_{2}^{2}$ , the instances are column vectors, and for every $i$ the $i$ ’th loss function is $\phi_{i}(a)=\frac{1}{2}(a-y_{i})^{2}$ , for some scalar $y_{i}$ .

Let $w^{*}=\operatorname*{argmin}_{w}P(w)$ (we will later make assumptions that imply that $w^{*}$ is unique). We say that $w$ is $\epsilon$ -accurate if $P(w)-P(w^{*})\leq\epsilon$ . Our main result is a new algorithm for solving (1). If $g$ is $1$ -strongly convex and each $\phi_{i}$ is $(1/\gamma)$ -smooth (meaning that its gradient is $(1/\gamma)$ -Lipschitz), then our algorithm finds, with probability of at least $1-\delta$ , an $\epsilon$ -accurate solution to (1) in time

By applying a smoothing technique to $\phi_{i}$ , we also derive a method that finds an $\epsilon$ -accurate solution to (1) assuming that each $\phi_{i}$ is $O(1)$ -Lipschitz, and obtain the runtime

This applies, for example, to SVM with the hinge-loss. It significantly improves over the rate $\frac{d}{\lambda\epsilon}$ of SGD (e.g. ), when $\frac{1}{\lambda\epsilon}\gg n$ .

We can also apply our results to non-strongly convex regularizers (such as the $L_{1}$ norm regularizer), or to non-regularized problems, by adding a slight $L_{2}$ regularization. For example, for $L_{1}$ regularized problems, and assuming that each $\phi_{i}$ is $(1/\gamma)$ -smooth, we obtain the runtime of

This applies, for example, to the Lasso problem, in which the goal is to minimize the squared loss plus an $L_{1}$ regularization term.

To put our results in context, in the table below we specify the runtime of various algorithms (while ignoring constants and logarithmic terms) for three key machine learning applications; SVM in which $\phi_{i}(a)=\max\{0,1-a\}$ and $g(w)=\frac{1}{2}\|w\|_{2}^{2}$ , Lasso in which $\phi_{i}(a)=\frac{1}{2}(a-y_{i})^{2}$ and $g(w)=\sigma\|w\|_{1}$ , and Ridge Regression in which $\phi_{i}(a)=\frac{1}{2}(a-y_{i})^{2}$ and $g(w)=\frac{1}{2}\|w\|_{2}^{2}$ . Additional applications, and a more detailed runtime comparison to previous work, are given in Section 5. In the table below, SGD stands for Stochastic Gradient Descent, and AGD stands for Accelerated Gradient Descent.

Additional related work:

As mentioned before, our first contribution is a proximal version of the stochastic dual coordinate ascent method and extension of the analysis given in Shalev-Shwartz and Zhang . Stochastic dual coordinate ascent has also been studied in Collins et al. but in more restricted settings than the general problem considered in this paper. One can also apply the analysis of stochastic coordinate descent methods given in Richtárik and Takáč on the dual problem. However, here we are interested in understanding the primal sub-optimality, hence an analysis which only applies to the dual problem is not sufficient.

The generality of our approach allows us to apply it for multiclass prediction problems. We discuss this in detail later on in Section 5. Recently, derived a stochastic coordinate ascent for structural SVM based on the Frank-Wolfe algorithm. Although with different motivations, for the special case of multiclass problems with the hinge-loss, their algorithm ends up to be the same as our proximal dual ascent algorithm (with the same rate). Our approach allows to accelerate the method and obtain an even faster rate.

The proof of our acceleration method adapts Nesterov’s estimation sequence technique, studied in Devolder et al. , Schmidt et al. , to allow approximate and stochastic proximal mapping. See also . In particular, it relies on similar ideas as in Proposition 4 of . However, our specific requirement is different, and the proof presented here is different and significantly simpler than that of .

There have been several attempts to accelerate stochastic optimization algorithms. See for example and the references therein. However, the runtime of these methods have a polynomial dependence on $1/\epsilon$ even if $\phi_{i}$ are smooth and $g$ is $\lambda$ -strongly convex, as opposed to the logarithmic dependence on $1/\epsilon$ obtained here. As in , we avoid the polynomial dependence on $1/\epsilon$ by allowing more than a single pass over the data.

Preliminaries

Given a norm $\|\cdot\|_{P}$ we denote the dual norm by $\|\cdot\|_{D}$ where

We use $\|\cdot\|$ or $\|\cdot\|_{2}$ to denote the $L_{2}$ norm, $\|x\|=x^{\top}x$ . We also use $\|x\|_{1}=\sum_{i}|x_{i}|$ and $\|x\|_{\infty}=\max_{i}|x_{i}|$ . The operator norm of a matrix $X$ with respect to norms $\|\cdot\|_{P},\|\cdot\|_{P^{\prime}}$ is defined as

It is well known that $f$ is $\gamma$ -strongly convex with respect to $\|\cdot\|_{P}$ if and only if $f^{*}$ is $(1/\gamma)$ -smooth with respect to the dual norm, $\|\cdot\|_{D}$ .

We will assume that $g$ is strongly convex which implies that $g^{*}(\cdot)$ is continuous differentiable. If we define

then it is known that $w(\alpha^{*})=w^{*}$ , where $\alpha^{*}$ is an optimal solution of (2). It is also known that $P(w^{*})=D(\alpha^{*})$ which immediately implies that for all $w$ and $\alpha$ , we have $P(w)\geq D(\alpha)$ , and hence the duality gap defined as

can be regarded as an upper bound on both the primal sub-optimality, $P(w(\alpha))-P(w^{*})$ , and on the dual sub-optimality, $D(\alpha^{*})-D(\alpha)$ .

Main Results

In this section we describe our algorithms and their analysis. We start in Section 3.1 with a description of our proximal stochastic dual coordinate ascent procedure (Prox-SDCA). Then, in Section 3.2 we show how to accelerate the method by calling Prox-SDCA on a sequence of problems with a strong regularization. Throughout the first two sections we assume that the loss functions are smooth. Finally, we discuss the case of Lipschitz loss functions in Section 3.3.

The proofs of the main acceleration theorem (Theorem 3) is given in Section 4. The rest of the proofs are provided in the appendix.

We now describe our proximal stochastic dual coordinate ascent procedure for solving (1). Our results in this subsection holds for $g$ being a $1$ -strongly convex function with respect to some norm $\|\cdot\|_{P^{\prime}}$ and every $\phi_{i}$ being a $(1/\gamma)$ -smooth function with respect to some other norm $\|\cdot\|_{P}$ . The corresponding dual norms are denoted by $\|\cdot\|_{D^{\prime}}$ and $\|\cdot\|_{D}$ respectively.

The dual objective in (2) has a different dual vector associated with each example in the training set. At each iteration of dual coordinate ascent we only allow to change the $i$ ’th column of $\alpha$ , while the rest of the dual vectors are kept intact. We focus on a randomized version of dual coordinate ascent, in which at each round we choose which dual vector to update uniformly at random.

At step $t$ , let $v^{(t-1)}=(\lambda n)^{-1}\sum_{i}X_{i}\alpha_{i}^{(t-1)}$ and let $w^{(t-1)}=\nabla g^{*}(v^{(t-1)})$ . We will update the $i$ -th dual variable $\alpha_{i}^{(t)}=\alpha_{i}^{(t-1)}+\Delta\alpha_{i}$ , in a way that will lead to a sufficient increase of the dual objective. For the primal problem, this would lead to the update $v^{(t)}=v^{(t-1)}+(\lambda n)^{-1}X_{i}\Delta\alpha_{i}$ , and therefore $w^{(t)}=\nabla g^{*}(v^{(t)})$ can also be written as

Note that this particular update is rather similar to the update step of proximal-gradient dual-averaging method (see for example Xiao ). The difference is on how $\alpha^{(t)}$ is updated.

The goal of dual ascent methods is to increase the dual objective as much as possible, and thus the optimal way to choose $\Delta\alpha_{i}$ would be to maximize the dual objective, namely, we shall let

However, for a complex $g^{*}(\cdot)$ , this optimization problem may not be easy to solve. To simplify the optimization problem we can rely on the smoothness of $g^{*}$ (with respect to a norm $\|\cdot\|_{D^{\prime}}$ ) and instead of directly maximizing the dual objective function, we try to maximize the following proximal objective which is a lower bound of the dual objective:

In general, this optimization problem is still not necessarily simple to solve because $\phi^{*}$ may also be complex. We will thus also propose alternative update rules for $\Delta\alpha_{i}$ of the form $\Delta\alpha_{i}=s(-\nabla\phi_{i}(X_{i}^{\top}w^{(t-1)})-\alpha_{i}^{(t-1)})$ for an appropriately chosen step size parameter $s>0$ . Our analysis shows that an appropriate choice of $s$ still leads to a sufficient increase in the dual objective.

It should be pointed out that we can always pick $\Delta\alpha_{i}$ so that the dual objective is non-decreasing. In fact, if for a specific choice of $\Delta\alpha_{i}$ , the dual objective decreases, we may simply set $\Delta\alpha_{i}=0$ . Therefore throughout the proof we will assume that the dual objective is non-decreasing whenever needed.

The theorems below provide upper bounds on the number of iterations required by our prox-SDCA procedure.

Consider Procedure Prox-SDCA as given in Figure 1. Let $\alpha^{*}$ be an optimal dual solution and let $\epsilon>0$ . For every $T$ such that

We next give bounds that hold with high probability.

Consider Procedure Prox-SDCA as given in Figure 1. Let $\alpha^{*}$ be an optimal dual solution, let $\epsilon_{D},\epsilon_{P}>0$ , and let $\delta\in(0,1)$ .

we are guaranteed that with probability of at least $1-\delta$ it holds that $D(\alpha^{*})-D(\alpha^{(T)})\leq\epsilon_{D}$ .

we are guaranteed that with probability of at least $1-\delta$ it holds that $P(w^{(T)})-D(\alpha^{(T)})\leq\epsilon_{P}$ .

and let $T_{0}=T-n-\lceil\frac{R^{2}}{\lambda\gamma}\rceil$ . Suppose we choose $\lceil\log_{2}(2/\delta)\rceil$ values of $t$ uniformly at random from $T_{0}+1,\ldots,T$ , and then choose the single value of $t$ from these $\lceil\log_{2}(2/\delta)\rceil$ values for which $P(w^{(t)})-D(\alpha^{(t)})$ is minimal. Then, with probability of at least $1-\delta$ we have that $P(w^{(t)})-D(\alpha^{(t)})\leq\epsilon_{P}$ .

The above theorem tells us that the runtime required to find an $\epsilon$ accurate solution, with probability of at least $1-\delta$ , is

The expected runtime required to minimize $P$ up to accuracy $\epsilon$ is

We have shown that with a runtime of $O\left(d\,\left(n+\frac{R^{2}}{\lambda\gamma}\right)\cdot\log\left(\frac{2(D(\alpha^{*})-D(\alpha^{(0)}))}{\epsilon}\right)\right)$ we can find an $\epsilon$ accurate solution with probability of at least $1/2$ . Therefore, we can run the procedure for this amount of time and check if the duality gap is smaller than $\epsilon$ . If yes, we are done. Otherwise, we would restart the process. Since the probability of success is $1/2$ we have that the average number of restarts we need is $2$ , which concludes the proof. ∎

2 Acceleration

for some $\beta\in(0,1)$ . That is, our regularization is centered around the previous solution plus a “momentum term” $\beta(w^{(t-1)}-w^{(t-2)})$ .

A pseudo-code of the algorithm is given in Figure 2. Note that all the parameters of the algorithm are determined by our theory.

In the pseudo-code below, we specify the parameters based on our theoretical derivation. In our experiments, we found out that this choice of parameters also work very well in practice. However, we also found out that the algorithm is not very sensitive to the choice of parameters. For example, we found out that running $5n$ iterations of Prox-SDCA (that is, $5$ epochs over the data), without checking the stopping condition, also works very well.

Consider the accelerated Prox-SDCA algorithm given in Figure 2.

Correctness: When the algorithm terminates we have that $P(w^{(t)})-P(w^{*})\leq\epsilon$ .

The number of outer iterations is at most

Each outer iteration involves a single call to Prox-SDCA, and the averaged runtime required by each such call is

By a straightforward amplification argument we obtain that for every $\delta\in(0,1)$ the total runtime required by accelerated Prox-SDCA to guarantee an $\epsilon$ -accurate solution with probability of at least $1-\delta$ is

3 Non-smooth, Lipschitz, loss functions

So far we have assumed that for every $i$ , $\phi_{i}$ is a $(1/\gamma)$ -smooth function. We now consider the case in which $\phi_{i}$ might be non-smooth, and even non-differentiable, but it is $L$ -Lipschitz.

Following Nesterov , we apply a “smoothing” technique. We first observe that if $\phi$ is $L$ -Lipschitz function then the domain of $\phi^{*}$ is in the ball of radius $L$ .

Fix some $\alpha$ with $\|\alpha\|_{D}>L$ . Let $x_{0}$ be a vector such that $\|x_{0}\|_{P}=1$ and $\alpha^{\top}x_{0}=\|\alpha\|_{D}$ (this is a vector that achieves the maximal objective in the definition of the dual norm). By definition of the conjugate we have

This observation allows us to smooth $L$ -Lipschitz functions by adding regularization to their conjugate. In particular, the following lemma generalizes Lemma 2.5 in .

It is also possible to smooth using different regularization functions which are strongly convex with respect to other norms. See Nesterov for discussion.

Proof of Theorem 3

The first claim of the theorem is that when the procedure stops we have $P(w^{(t)})-P(w^{*})\leq\epsilon$ . We therefore need to show that each stopping condition guarantees that $P(w^{(t)})-P(w^{*})\leq\epsilon$ .

For the second stopping condition, recall that $w^{(t)}$ is an $\epsilon_{t}$ -accurate minimizer of $P(w)+\frac{\kappa}{2}\|w-y^{(t-1)}\|^{2}$ , and hence by Lemma 3 below (with $z=w^{*}$ , $w^{+}=w^{(t)}$ , and $y=y^{(t-1)}$ ):

It is left to show that the first stopping condition is correct, namely, to show that after $1+\frac{2}{\eta}\log(\xi_{1}/\epsilon)$ iterations the algorithm must converge to an $\epsilon$ -accurate solution. Observe that the definition of $\xi_{t}$ yields that $\xi_{t}=(1-\eta/2)^{t-1}\,\xi_{1}\leq e^{-\eta(t-1)/2}\xi_{1}$ . Therefore, to prove that the first stopping condition is valid, it suffices to show that for every $t$ , $P(w^{(t)})-P(w^{*})\leq\xi_{t}$ .

Recall that at each outer iteration of the accelerated procedure, we approximately minimize an objective of the form

Of course, minimizing $P(w;y)$ is not the same as minimizing $P(w)$ . Our first lemma shows that for every $y$ , if $w^{+}$ is an $\epsilon$ -accurate minimizer of $P(w;y)$ then we can derive a lower bound on $P(w)$ based on $P(w^{+})$ and a convex quadratic function of $w$ .

Let $\mu=\lambda/2$ and $\rho=\mu+\kappa$ . Let $w^{+}$ be a vector such that $P(w^{+};y)\leq\min_{w}P(w,y)+\epsilon$ . Then, for every $z$ ,

By the $\mu$ -strong convexity of $\Psi$ we have that for every $z$ ,

In addition, by standard algebraic manipulations,

Finally, using the assumption $P(w^{+};y)\leq\min_{w}P(w;y)+\epsilon$ we conclude our proof. ∎

We saw that the quadratic function $P(w^{+})+Q_{\epsilon}(z;w^{+},y)$ lower bounds the function $P$ everywhere. Therefore, any convex combination of such functions would form a quadratic function which lower bounds $P$ . In particular, the algorithm (implicitly) maintains a sequence of quadratic functions, $h_{1},h_{2},\ldots$ , defined as follows. Choose $\eta\in(0,1)$ and a sequence $y^{(1)},y^{(2)},\ldots$ that will be specified later. Define,

The following simple lemma shows that for every $t\geq 1$ and $z$ , $h_{t}(z)$ lower bounds $P(z)$ .

Let $\eta\in(0,1)$ and let $y^{(1)},y^{(2)},\ldots$ be any sequence of vectors. Assume that $w^{(1)}=0$ and for every $t\geq 1$ , $w^{(t+1)}$ satisfies $P(w^{(t+1)};y^{(t)})\leq\min_{w}P(w;y^{(t)})+\epsilon_{t+1}$ . Then, for every $t\geq 1$ and every vector $z$ we have

The proof is by induction. For $t=1$ , observe that $P(0;0)=P(0)$ and that for every $w$ we have $P(w;0)\geq P(w)\geq D(0)$ . This yields $P(0;0)-\min_{w}P(w;0)\leq P(0)-D(0)$ . The claim now follows directly from Lemma 3. Next, for the inductive step, assume the claim holds for some $t-1\geq 1$ and let us prove it for $t$ . By the recursive definition of $h_{t}$ and by using Lemma 3 we have

Using the inductive assumption we obtain that the right-hand side of the above is upper bounded by $(1-\eta)P(z)+\eta P(z)=P(z)$ , which concludes our proof. ∎

The more difficult part of the proof is to show that for every $t\geq 1$ ,

If this holds true, then we would immediately get that for every $w^{*}$ ,

This will conclude the proof of the first part of Theorem 3, since $\xi_{t}=\xi_{1}(1-\eta/2)^{t-1}\leq\xi_{1}\,e^{-(t-1)\eta/2}$ , and therefore, $1+\frac{2}{\eta}\log(\xi_{1}/\epsilon)$ iterations suffice to guarantee that $P(w^{(t)})-P(w^{*})\leq\epsilon$ .

Let us construct an explicit formula for $v^{(t)}$ . Clearly, $v^{(1)}=0$ . Assume that we have calculated $v^{(t)}$ and let us calculate $v^{(t+1)}$ . Note that $h_{t}$ is a quadratic function which is minimized at $v^{(t)}$ . Furthermore, it is easy to see that for every $t$ , $h_{t}$ is $\mu$ -strongly convex quadratic function. Therefore,

By the definition of $h_{t+1}$ we obtain that

Since the gradient of $h_{t+1}(z)$ at $v^{(t+1)}$ should be zero, we obtain that $v^{(t+1)}$ should satisfy

Getting back to our second phase of the proof, we need to show that for every $t$ we have $P(w^{(t)})\leq h_{t}(v^{(t)})+\xi_{t}$ . We do so by induction. For the case $t=1$ we have

For the induction step, assume the claim holds for $t\geq 1$ and let us prove it for $t+1$ . We use the shorthands,

By the inductive assumption we have $h_{t}(v^{(t)})\geq P(w^{(t)})-\xi_{t}$ and by Lemma 3 we have $P(w^{(t)})\geq\psi_{t+1}(w^{(t)})$ . Therefore,

So far we did not specify $\eta$ and $y^{(t)}$ (except $y^{(0)}=0$ ). We next set

We also observe that $\epsilon_{t+1}\leq\frac{\eta\xi_{t}}{2(1+\eta^{-2})}$ which implies that $(1+\rho/\mu)\epsilon_{t+1}+(1-\eta)\xi_{t}\leq(1-\eta/2)\xi_{t}=\xi_{t+1}$ . Combining the above with (6) and (7), and rearranging terms, we obtain that

Next, observe that $\rho\eta^{2}=\mu$ and that by (5) we have

The right-hand side of the above is non-negative because of the convexity of the function $f(z)=\frac{\mu}{2}\|z-v^{(t+1)}\|^{2}$ , which yields

We next show that each call to Prox-SDCA will terminate quickly. By the definition of $\kappa$ we have that

Therefore, based on Corollary 1 we know that the averaged runtime at iteration $t$ is

The following lemma bounds the initial dual sub-optimality at iteration $t\geq 4$ . Similar arguments will yield a similar result for $t<4$ .

Combining the above with (8), we obtain that

Next, we bound $\|y^{(t-1)}-y^{(t-2)}\|^{2}$ . We have

where we used the triangle inequality and $\beta<1$ . By strong convexity of $P$ we have, for every $i$ ,

Getting back to the proof of the second claim of Theorem 3, we have obtained that

where in the last inequality we used $\eta^{-2}-1=\frac{2\kappa}{\lambda}$ , which implies that $\frac{2\kappa}{\lambda}(1+\eta^{-2})\leq\eta^{-4}$ . Using $1<\eta^{-5}$ , $1-\eta/2\geq 0.5$ , and taking log to both sides, we get that

which concludes the proof of the second claim of Theorem 3.

Applications

In this section we specify our algorithmic framework to several popular machine learning applications. In Section 5.1 we start by describing several loss functions and deriving their conjugate. In Section 5.2 we describe several regularization functions. Finally, in the rest of the subsections we specify our algorithm for Ridge regression, SVM, Lasso, logistic regression, and multiclass prediction.

Logistic loss:

$\phi(a)=\log(1+e^{a})$ . The derivative is $\phi^{\prime}(a)=1/(1+e^{-a})$ and the second derivative is $\phi^{\prime\prime}(a)=\frac{1}{(1+e^{-a})(1+e^{a})}\in[0,1/4]$ , from which it follows that $\phi$ is $(1/4)$ -smooth. The conjugate function is

Hinge loss:

$\phi(a)=[1-a]_{+}:=\max\{0,1-a\}$ . The conjugate function is

Smooth hinge loss:

This loss is obtained by smoothing the hinge-loss using the technique described in Lemma 2. This loss is parameterized by a scalar $\gamma>0$ and is defined as:

Max-of-hinge:

To calculate the conjugate of $\phi$ , let

Each inner maximization over $a_{j}$ would be $\infty$ unless $\beta_{j}=b_{j}$ . Therefore,

Smooth max-of-hinge

This loss obtained by smoothing the max-of-hinge loss using the technique described in Lemma 2. This loss is parameterized by a scalar $\gamma>0$ . We start by adding regularization to the conjugate of the max-of-hinge given in (11) and obtain

Taking the conjugate of the conjugate we obtain

Soft-max-of-hinge loss function:

Another approach to smooth the max-of-hinge loss function is by using soft-max instead of max. The resulting soft-max-of-hinge loss function is defined as

The $j$ ’th element of the gradient of $\phi$ is

By the definition of the conjugate we have $\phi_{\gamma}^{*}(b)=\max_{a}a^{\top}b-\phi_{\gamma}(a)$ . The vector $a$ that maximizes the above must satisfy

This can be satisfied only if $b_{j}\geq 0$ for all $j$ and $\sum_{j}b_{j}\leq 1$ . That is, $b\in S$ . Denote $Z=\sum_{i=1}^{k}e^{(c_{i}+a_{i})/\gamma}$ and note that

Finally, if $b\notin S$ then the gradient of $a^{\top}b-\phi_{\gamma}(a)$ does not vanish anywhere, which means that $\phi_{\gamma}^{*}(b)=\infty$ . All in all, we obtain

Since the entropic function, $\sum_{j}b_{j}\log(b_{j})$ is $1$ -strongly convex over $S$ with respect to the $L_{1}$ norm, we obtain that $\phi^{*}_{\gamma}$ is $\gamma$ -strongly convex with respect to the $L_{1}$ norm, from which it follows that $\phi_{\gamma}$ is $(1/\gamma)$ -smooth with respect to the $L_{\infty}$ norm.

2 Regularizers

The simplest regularization is the squared $L_{2}$ regularization

This is a $1$ -strongly convex regularization function whose conjugate is

For our acceleration procedure, we also use the $L_{2}$ regularization plus a linear term, namely,

for some vector $z$ . The conjugate of this function is

Another popular regularization we consider is the $L_{1}$ regularization,

This is not a strongly convex regularizer and therefore we will add a slight $L_{2}$ regularization to it and define the $L_{1}$ - $L_{2}$ regularization as

where $\sigma^{\prime}=\frac{\sigma}{\lambda}$ for some small $\lambda$ . Note that

so if $\lambda$ is small enough (as will be formalized later) we obtain that $\lambda g(w)\approx\sigma\|w\|_{1}$ .

The maximizer is also $\nabla g^{*}(v)$ and we now show how to calculate it. We have

where the right-hand side is the $i$ ’th component of the objective value we will obtain by setting $w_{i}=0$ . This leads to the conclusion that

Another regularization function we’ll use in the accelerated procedure is

3 Ridge Regression

Below we specify Prox-SDCA for ridge regression. We use Option I since it is possible to derive a closed form solution to the maximization of the dual with respect to $\Delta\alpha_{i}$ . Indeed, since $-\phi_{i}^{*}(-b)=-\frac{1}{2}b^{2}+y_{i}b$ we have that the maximization problem is

Applying the above update and using some additional tricks to improve the running time we obtain the following procedure.

The runtime of Prox-SDCA for ridge regression becomes

where $R=\max_{i}\|x_{i}\|$ . This matches the recent results of . If $R^{2}/\lambda\gg n$ we can apply the accelerated procedure and obtain the improved runtime

4 Logistic Regression

and the dual constraints are $\alpha\in^{n}$ .

Below we specify Prox-SDCA for logistic regression using Option III.

Prox-SDCA( $(x_{i})_{i=1}^{n},\epsilon,\alpha^{(0)},z$ ) for logistic regression Goal: Minimize $P(w)=\frac{1}{n}\sum_{i=1}^{n}\log(1+e^{x_{i}^{\top}w})+\lambda\left(\frac{1}{2}\|w\|^{2}-w^{\top}z\right)$ Initialize $v^{(0)}=\frac{1}{\lambda n}\sum_{i=1}^{n}\alpha_{i}^{(0)}x_{i}$ , and $\forall i,~{}~{}p_{i}=x_{i}^{\top}z$ Define: $\phi^{*}(b)=b\log(b)+(1-b)\log(1-b)$ Iterate: for $t=1,2,\dots$ Randomly pick $i$ $p=x_{i}^{\top}w^{(t-1)}$ $q=-1/(1+e^{-p})-\alpha_{i}^{(t-1)}$ $s=\min\left(1,\frac{\log(1+e^{p})+\phi^{*}(-\alpha_{i}^{(t-1)})+p\alpha^{(t-1)}_{i}+2q^{2}}{q^{2}(4+\frac{1}{\lambda n}\|x_{i}\|^{2})}\right)$ $\Delta\alpha_{i}=sq$ $\alpha^{(t)}_{i}=\alpha^{(t-1)}_{i}+\Delta\alpha_{i}$ and for $j\neq i$ , $\alpha^{(t)}_{j}=\alpha^{(t-1)}_{j}$ $v^{(t)}=v^{(t-1)}+\frac{\Delta\alpha_{i}}{\lambda n}x_{i}$ Stopping condition: let $w^{(t)}=v^{(t)}+z$ Stop if $\frac{1}{n}\sum_{i=1}^{n}\left(\log(1+e^{x_{i}^{\top}w^{(t)}})+\phi^{*}(-\alpha_{i}^{(t-1)})\right)+\lambda w^{(t)\top}v^{(t)}\leq\epsilon$

The runtime analysis is similar to the analysis for ridge regression.

5 Lasso

In the Lasso problem, the loss function is the squared loss but the regularization function is $L_{1}$ . That is, we need to solve the problem:

Let $\bar{y}=\frac{1}{2n}\sum_{i=1}^{n}y_{i}^{2}$ , and let $\bar{w}$ be an optimal solution of (18). Then, the objective at $\bar{w}$ is at most the objective at $w=0$ , which yields

for some $\lambda>0$ . This problem fits into our framework, since now the regularizer is strongly convex. Furthermore, if $w^{*}$ is an $(\epsilon/2)$ -accurate solution to the problem in (19), then $P(w^{*})\leq P(\bar{w})+\epsilon/2$ which yields

Since $\|\bar{w}\|_{2}^{2}\leq\left({\bar{y}}/{\sigma}\right)^{2}$ , we obtain that setting $\lambda=\epsilon(\sigma/\bar{y})^{2}$ guarantees that $w^{*}$ is an $\epsilon$ accurate solution to the original problem given in (18).

In light of the above, from now on we focus on the problem given in (19). As in the case of ridge regression, we can apply Prox-SDCA with Option I. The resulting pseudo-code is given below. Applying the above update and using some additional tricks to improve the running time we obtain the following procedure.

Let us now discuss the runtime of the resulting method. Denote $R=\max_{i}\|x_{i}\|$ and for simplicity, assume that $\bar{y}=O(1)$ . Choosing $\lambda=\epsilon(\sigma/\bar{y})^{2}$ , the runtime of our method becomes

It is also convenient to write the bound in terms of $B=\|\bar{w}\|_{2}$ , where, as before, $\bar{w}$ is the optimal solution of the $L_{1}$ regularized problem. With this parameterization, we can set $\lambda=\epsilon/B^{2}$ and the runtime becomes

The runtime of standard SGD is $O(dR^{2}B^{2}/\epsilon^{2})$ even in the case of smooth loss functions such as the squared loss. Several variants of SGD, that leads to sparser intermediate solutions, have been proposed (e.g. ). However, all of these variants share the runtime of $O(dR^{2}B^{2}/\epsilon^{2})$ , which is much slower than our runtime when $\epsilon$ is small.

Another relevant approach is the FISTA algorithm of . The shrinkage operator of FISTA is the same as the gradient of $g^{*}$ used in our approach. It is a batch algorithm using Nesterov’s accelerated gradient technique. For the squared loss function, the runtime of FISTA is

This bound is worst than our bound by a factor of at least $\sqrt{n}$ .

Another approach to solving (18) is stochastic coordinate descent over the primal problem. showed that the runtime of this approach is

under the assumption that $\|x_{i}\|_{\infty}\leq 1$ for all $i$ . Similar results can also be found in .

For our method, the runtime depends on $R^{2}=\max_{i}\|x_{i}\|_{2}^{2}$ . If $R^{2}=O(1)$ then the runtime of our method is much better than that of . In the general case, if $\max_{i}\|x_{i}\|_{\infty}\leq 1$ then $R^{2}\leq d$ , which yields the runtime of

This is the same or better than whenever $d=O(n)$ .

6 Linear SVM

Support Vector Machines (SVM) is an algorithm for learning a linear classifier. Linear SVM (i.e., SVM with linear kernels) amounts to minimizing the objective

Our first step is to smooth the hinge-loss. Let $\gamma=\epsilon$ and consider the smooth hinge-loss as defined in (9). Recall that the smooth hinge-loss satisfies

For the smoothed hinge loss, the optimization problem given in Option I of Prox-SDCA has a closed form solution and we obtain the following procedure:

Denote $R=\max_{i}\|x_{i}\|$ . Then, the runtime of the resulting method is

In particular, choosing $\gamma=\epsilon$ we obtain a solution to the original SVM problem in runtime of

As mentioned before, this is better than SGD when $\frac{1}{\lambda\epsilon}\gg n$ .

7 Multiclass SVM

This can be written as $\phi((W^{\top}x_{i})-(W^{\top}x_{i})_{y_{i}})$ where

with $c_{i}$ being the all ones vector except in the $y_{i}$ coordinate.

Therefore, the optimization problem of multiclass SVM becomes:

As in the case of SVM, we will use the smooth version of the max-of-hinge loss function as described in (13). If we set the smoothness parameter $\gamma$ to be $\epsilon$ then an $(\epsilon/2)$ -accurate solution to the problem with the smooth loss is also an $\epsilon$ -accurate solution to the original problem with the non-smooth loss. Therefore, from now on we focus on the problem with the smooth max-of-hinge loss.

We specify Prox-SDCA for multiclass SVM using Option I. We will show that the optimization problem in Option I can be calculated efficiently by sorting a $k$ dimensional vector. Such ideas were explored in for the non-smooth max-of-hinge loss.

Let $\hat{w}=w-\frac{1}{\lambda n}X_{i}\alpha^{(t-1)}_{i}$ . Then, the optimization problem over $\alpha_{i}$ can be written as

As shown before, if we organize $\hat{w}$ as a $d\times k$ matrix, denoted $\hat{W}$ , we have that $X_{i}^{\top}\hat{w}=\hat{W}^{\top}x_{i}-(\hat{W}^{\top}x_{i})_{y_{i}}$ . We also have that

It follows that an optimal solution to (20) must set $\alpha_{i,y_{i}}=0$ and we only need to optimize over the rest of the dual variables. This also yields,

This is equivalent to a problem of the form:

The equivalence is in the sense that if $(a,\beta)$ is a solution of (22) then we can set $\alpha_{i}=-a$ .

Assume for simplicity that $\mu$ is sorted in a non-increasing order and that all of its elements are non-negative (otherwise, it is easy to verify that we can zero the negative elements of $\mu$ and sort the non-negative, without affecting the solution). Let $\bar{\mu}$ be the cumulative sum of $\mu$ , that is, for every $j$ , let $\bar{\mu}_{j}=\sum_{r=1}^{j}\mu_{r}$ . For every $j$ , let $z_{j}=\bar{\mu}_{j}-j\mu_{j}$ . Since $\mu$ is sorted we have that

The first order condition for minimality w.r.t. $z$ is

If this value of $z$ is in $(z_{j},z_{j+1})$ , then it is the optimal $z$ and we’re done. Otherwise, the optimum should be either $z=0$ (which yields $\alpha=0$ as well) or $z=1$ .

$a=\textrm{OptimizeDual}(\mu,C)$ Solve the optimization problem given in (22) Initialize: $\forall i,~{}\hat{\mu}_{i}=\max\{0,\mu_{i}\}$ , and sort $\hat{\mu}$ s.t. $\hat{\mu}_{1}\geq\hat{\mu}_{2}\geq\ldots\geq\hat{\mu}_{k}$ Let: $\bar{\mu}$ be s.t. $\bar{\mu}_{j}=\sum_{i=1}^{j}\hat{\mu}_{i}$ Let: $z$ be s.t. $z_{j}=\min\{\bar{\mu}_{j}-j\hat{\mu}_{j},1\}$ and $z_{k+1}=1$ If: $\exists j$ s.t. $\frac{\bar{\mu}_{j}}{1+jC}\in[z_{j},z_{j+1}]$ return $a$ s.t. $\forall i,~{}a_{i}=\max\left\{0,\mu_{i}-\left(-\frac{\bar{\mu}_{j}}{1+jC}+\bar{\mu}_{j}\right)/j\right\}$ Else: Let $j$ be the minimal index s.t. $z_{j}=1$ set $a$ s.t. $\forall i,~{}~{}a_{i}=\max\{0,\mu_{i}-(-z_{j}+\bar{\mu}_{j})/j\}$ If: $\|a-\mu\|^{2}+C\leq\|\mu\|^{2}$ return $a$ Else: return $(0,\ldots,0)$

The resulting pseudo-codes for Prox-SDCA is given below. We specify the procedure while referring to $W$ as a matrix, because it is the more natural representation. For convenience of the code, we also maintain in $\alpha_{i,y_{i}}$ the value of $-\sum_{j\neq y_{i}}\alpha_{i,j}$ (instead of the optimal value of ).

Experiments

In this section we compare Prox-SDCA, its accelerated version Accelerated-Prox-SDCA, and the FISTA algorithm of , on $L_{1}-L_{2}$ regularized loss minimization problems.

The experiments were performed on three large datasets with very different feature counts and sparsity, which were kindly provided by Thorsten Joachims (the datasets were also used in ). The astro-ph dataset classifies abstracts of papers from the physics ArXiv according to whether they belong in the astro-physics section; CCAT is a classification task taken from the Reuters RCV1 collection; and cov1 is class 1 of the covertype dataset of Blackard, Jock & Dean. The following table provides details of the dataset characteristics.

In the experiments, we set $\sigma=10^{-5}$ and vary $\lambda$ in the range $\{10^{-6},10^{-7},10^{-8},10^{-9}\}$ .

The convergence behaviors are plotted in Figure 6. In all the plots we depict the primal objective as a function of the number of passes over the data (often referred to as “epochs”). For FISTA, each iteration involves a single pass over the data. For Prox-SDCA, each $n$ iterations are equivalent to a single pass over the data. And, for Accelerated-Prox-SDCA, each $n$ inner iterations are equivalent to a single pass over the data. For Prox-SDCA and Accelerated-Prox-SDCA we implemented their corresponding stopping conditions and terminate the methods once an accuracy of $10^{-3}$ was guaranteed.

It is clear from the graphs that Accelerated-Prox-SDCA yields the best results, and often significantly outperform the other methods. Prox-SDCA behaves similarly when $\lambda$ is relatively large, but it converges much slower when $\lambda$ is small. This is consistent with our theory. Finally, the relative performance of FISTA and Prox-SDCA depends on the ratio between $\lambda$ and $n$ , but in all cases, Accelerated-Prox-SDCA is much faster than FISTA. This is again consistent with our theory.

We have described and analyzed a proximal stochastic dual coordinate ascent method and have shown how to accelerate the procedure. The overall runtime of the resulting method improves state-of-the-art results in many cases of interest.

There are two main open problems that we leave to future research.

Our Prox-SDCA procedure and its analysis works for regularizers which are strongly convex with respect to an arbitrary norm. However, our acceleration procedure is designed for regularizers which are strongly convex with respect to the Euclidean norm. Is is possible to extend the acceleration procedure to more general regularizers?

Acknowledgements

The authors would like to thank Fen Xia for careful proof-reading of the paper which helped us to correct numerous typos. Shai Shalev-Shwartz is supported by the following grants: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and ISF 598-10. Tong Zhang is supported by the following grants: NSF IIS-1016061, NSF DMS-1007527, and NSF IIS-1250985.

Appendix A Proofs of Iteration Bounds for Prox-SDCA

The proof technique follows that of Shalev-Shwartz and Zhang , but with the required generality for handling general strongly convex regularizers and smoothness/Lipschitzness with respect to general norms.

We prove the theorems for running Prox-SDCA while choosing $\Delta\alpha_{i}$ as in Option I. A careful examination of the proof easily reveals that the results hold for the other options as well. More specifically, Lemma 6 only requires choosing $\Delta\alpha_{i}=s(u_{i}^{(t-1)}-\alpha_{i}^{(t-1)})$ as in (23), and Option III chooses $s$ to optimize the bound on the right hand side of (25), and hence ensures that the choice can do no worse than the result of Lemma 6 with any $s$ . The simplification in Option IV and V employs the specific simplification of the bound in Lemma 6 in the proof of the theorems.

and $-u^{(t-1)}_{i}=\nabla\phi_{i}(X_{i}^{\top}w^{(t-1)})$ .

Since only the $i$ ’th element of $\alpha$ is updated, the improvement in the dual objective can be written as

The smoothness of $g^{*}$ implies that $g^{*}(v+\Delta v)\leq h(v;\Delta v)$ , where $h(v;\Delta v):=g^{*}(v)+\nabla g^{*}(v)^{\top}\Delta v+\frac{1}{2}\|\Delta v\|_{D^{\prime}}^{2}$ . Therefore,

By the definition of the update we have for all $s\in$ that

Combining this with (23) and rearranging terms we obtain that

Since $-u=\nabla\phi(X^{\top}w)$ we have $\phi^{*}(-u)+w^{\top}Xu=-\phi(X^{\top}w)$ , which yields

Next note that with $w=\nabla g^{*}(v)$ , we have $g(w)+g^{*}(v)=w^{\top}v$ . Therefore:

Therefore, if we take expectation of (25) w.r.t. the choice of $i$ we obtain that

Multiplying both sides by $s/n$ concludes the proof of the lemma. ∎

Equipped with the above lemmas we are ready to prove Theorem 1 and Theorem 2.

The assumption that $\phi_{i}$ is $(1/\gamma)$ -smooth implies that $\phi_{i}^{*}$ is $\gamma$ -strongly-convex. We will apply Lemma 6 with

Recall that $\|X_{i}\|_{D\to D^{\prime}}\leq R$ . Therefore, the choice of $s$ implies that

and hence $G^{(t)}\leq 0$ for all $t$ . This yields,

Taking expectation of both sides with respect to the randomness at previous rounds, and using the law of total expectation, we obtain that

But since $\epsilon_{D}^{(t-1)}:=D(\alpha^{*})-D(\alpha^{(t-1)})\leq P(w^{(t-1)})-D(\alpha^{(t-1)})$ and $D(\alpha^{(t)})-D(\alpha^{(t-1)})=\epsilon_{D}^{(t-1)}-\epsilon_{D}^{(t)}$ , we obtain that

Using again (28), we can also obtain that

which proves the first part of Theorem 1.

Next, we sum the first inequality of (29) over $t=T_{0}+1,\ldots,T$ to obtain

Now, if we choose $\bar{w},\bar{\alpha}$ to be either the average vectors or a randomly chosen vector over $t\in\{T_{0}+1,\ldots,T\}$ , then the above implies

In particular, the choice of $T-T_{0}=\frac{n}{s}$ and $T_{0}=\frac{n}{s}\log(\epsilon_{D}^{(0)}/\epsilon_{P})$ satisfies the above requirement. ∎

Next, for the duality gap, using (27) we have that for every $t$ such that $\epsilon_{D}^{(t-1)}\leq\epsilon_{D}$ we have

This proves the second claim of Theorem 2.