Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization
Shai Shalev-Shwartz, Tong Zhang
Introduction
For example, in ridge regression the regularizer is , the instances are column vectors, and for every the ’th loss function is , for some scalar .
Let (we will later make assumptions that imply that is unique). We say that is -accurate if . Our main result is a new algorithm for solving (1). If is -strongly convex and each is -smooth (meaning that its gradient is -Lipschitz), then our algorithm finds, with probability of at least , an -accurate solution to (1) in time
By applying a smoothing technique to , we also derive a method that finds an -accurate solution to (1) assuming that each is -Lipschitz, and obtain the runtime
This applies, for example, to SVM with the hinge-loss. It significantly improves over the rate of SGD (e.g. ), when .
We can also apply our results to non-strongly convex regularizers (such as the norm regularizer), or to non-regularized problems, by adding a slight regularization. For example, for regularized problems, and assuming that each is -smooth, we obtain the runtime of
This applies, for example, to the Lasso problem, in which the goal is to minimize the squared loss plus an regularization term.
To put our results in context, in the table below we specify the runtime of various algorithms (while ignoring constants and logarithmic terms) for three key machine learning applications; SVM in which and , Lasso in which and , and Ridge Regression in which and . Additional applications, and a more detailed runtime comparison to previous work, are given in Section 5. In the table below, SGD stands for Stochastic Gradient Descent, and AGD stands for Accelerated Gradient Descent.
Additional related work:
As mentioned before, our first contribution is a proximal version of the stochastic dual coordinate ascent method and extension of the analysis given in Shalev-Shwartz and Zhang . Stochastic dual coordinate ascent has also been studied in Collins et al. but in more restricted settings than the general problem considered in this paper. One can also apply the analysis of stochastic coordinate descent methods given in Richtárik and Takáč on the dual problem. However, here we are interested in understanding the primal sub-optimality, hence an analysis which only applies to the dual problem is not sufficient.
The generality of our approach allows us to apply it for multiclass prediction problems. We discuss this in detail later on in Section 5. Recently, derived a stochastic coordinate ascent for structural SVM based on the Frank-Wolfe algorithm. Although with different motivations, for the special case of multiclass problems with the hinge-loss, their algorithm ends up to be the same as our proximal dual ascent algorithm (with the same rate). Our approach allows to accelerate the method and obtain an even faster rate.
The proof of our acceleration method adapts Nesterov’s estimation sequence technique, studied in Devolder et al. , Schmidt et al. , to allow approximate and stochastic proximal mapping. See also . In particular, it relies on similar ideas as in Proposition 4 of . However, our specific requirement is different, and the proof presented here is different and significantly simpler than that of .
There have been several attempts to accelerate stochastic optimization algorithms. See for example and the references therein. However, the runtime of these methods have a polynomial dependence on even if are smooth and is -strongly convex, as opposed to the logarithmic dependence on obtained here. As in , we avoid the polynomial dependence on by allowing more than a single pass over the data.
Preliminaries
Given a norm we denote the dual norm by where
We use or to denote the norm, . We also use and . The operator norm of a matrix with respect to norms is defined as
It is well known that is -strongly convex with respect to if and only if is -smooth with respect to the dual norm, .
We will assume that is strongly convex which implies that is continuous differentiable. If we define
then it is known that , where is an optimal solution of (2). It is also known that which immediately implies that for all and , we have , and hence the duality gap defined as
can be regarded as an upper bound on both the primal sub-optimality, , and on the dual sub-optimality, .
Main Results
In this section we describe our algorithms and their analysis. We start in Section 3.1 with a description of our proximal stochastic dual coordinate ascent procedure (Prox-SDCA). Then, in Section 3.2 we show how to accelerate the method by calling Prox-SDCA on a sequence of problems with a strong regularization. Throughout the first two sections we assume that the loss functions are smooth. Finally, we discuss the case of Lipschitz loss functions in Section 3.3.
The proofs of the main acceleration theorem (Theorem 3) is given in Section 4. The rest of the proofs are provided in the appendix.
We now describe our proximal stochastic dual coordinate ascent procedure for solving (1). Our results in this subsection holds for being a -strongly convex function with respect to some norm and every being a -smooth function with respect to some other norm . The corresponding dual norms are denoted by and respectively.
The dual objective in (2) has a different dual vector associated with each example in the training set. At each iteration of dual coordinate ascent we only allow to change the ’th column of , while the rest of the dual vectors are kept intact. We focus on a randomized version of dual coordinate ascent, in which at each round we choose which dual vector to update uniformly at random.
At step , let and let . We will update the -th dual variable , in a way that will lead to a sufficient increase of the dual objective. For the primal problem, this would lead to the update , and therefore can also be written as
Note that this particular update is rather similar to the update step of proximal-gradient dual-averaging method (see for example Xiao ). The difference is on how is updated.
The goal of dual ascent methods is to increase the dual objective as much as possible, and thus the optimal way to choose would be to maximize the dual objective, namely, we shall let
However, for a complex , this optimization problem may not be easy to solve. To simplify the optimization problem we can rely on the smoothness of (with respect to a norm ) and instead of directly maximizing the dual objective function, we try to maximize the following proximal objective which is a lower bound of the dual objective:
In general, this optimization problem is still not necessarily simple to solve because may also be complex. We will thus also propose alternative update rules for of the form for an appropriately chosen step size parameter . Our analysis shows that an appropriate choice of still leads to a sufficient increase in the dual objective.
It should be pointed out that we can always pick so that the dual objective is non-decreasing. In fact, if for a specific choice of , the dual objective decreases, we may simply set . Therefore throughout the proof we will assume that the dual objective is non-decreasing whenever needed.
The theorems below provide upper bounds on the number of iterations required by our prox-SDCA procedure.
Consider Procedure Prox-SDCA as given in Figure 1. Let be an optimal dual solution and let . For every such that
We next give bounds that hold with high probability.
Consider Procedure Prox-SDCA as given in Figure 1. Let be an optimal dual solution, let , and let .
we are guaranteed that with probability of at least it holds that .
we are guaranteed that with probability of at least it holds that .
and let . Suppose we choose values of uniformly at random from , and then choose the single value of from these values for which is minimal. Then, with probability of at least we have that .
The above theorem tells us that the runtime required to find an accurate solution, with probability of at least , is
The expected runtime required to minimize up to accuracy is
We have shown that with a runtime of we can find an accurate solution with probability of at least . Therefore, we can run the procedure for this amount of time and check if the duality gap is smaller than . If yes, we are done. Otherwise, we would restart the process. Since the probability of success is we have that the average number of restarts we need is , which concludes the proof. ∎
2 Acceleration
for some . That is, our regularization is centered around the previous solution plus a “momentum term” .
A pseudo-code of the algorithm is given in Figure 2. Note that all the parameters of the algorithm are determined by our theory.
In the pseudo-code below, we specify the parameters based on our theoretical derivation. In our experiments, we found out that this choice of parameters also work very well in practice. However, we also found out that the algorithm is not very sensitive to the choice of parameters. For example, we found out that running iterations of Prox-SDCA (that is, epochs over the data), without checking the stopping condition, also works very well.
Consider the accelerated Prox-SDCA algorithm given in Figure 2.
Correctness: When the algorithm terminates we have that .
The number of outer iterations is at most
Each outer iteration involves a single call to Prox-SDCA, and the averaged runtime required by each such call is
By a straightforward amplification argument we obtain that for every the total runtime required by accelerated Prox-SDCA to guarantee an -accurate solution with probability of at least is
3 Non-smooth, Lipschitz, loss functions
So far we have assumed that for every , is a -smooth function. We now consider the case in which might be non-smooth, and even non-differentiable, but it is -Lipschitz.
Following Nesterov , we apply a “smoothing” technique. We first observe that if is -Lipschitz function then the domain of is in the ball of radius .
Fix some with . Let be a vector such that and (this is a vector that achieves the maximal objective in the definition of the dual norm). By definition of the conjugate we have
This observation allows us to smooth -Lipschitz functions by adding regularization to their conjugate. In particular, the following lemma generalizes Lemma 2.5 in .
It is also possible to smooth using different regularization functions which are strongly convex with respect to other norms. See Nesterov for discussion.
Proof of Theorem 3
The first claim of the theorem is that when the procedure stops we have . We therefore need to show that each stopping condition guarantees that .
For the second stopping condition, recall that is an -accurate minimizer of , and hence by Lemma 3 below (with , , and ):
It is left to show that the first stopping condition is correct, namely, to show that after iterations the algorithm must converge to an -accurate solution. Observe that the definition of yields that . Therefore, to prove that the first stopping condition is valid, it suffices to show that for every , .
Recall that at each outer iteration of the accelerated procedure, we approximately minimize an objective of the form
Of course, minimizing is not the same as minimizing . Our first lemma shows that for every , if is an -accurate minimizer of then we can derive a lower bound on based on and a convex quadratic function of .
Let and . Let be a vector such that . Then, for every ,
By the -strong convexity of we have that for every ,
In addition, by standard algebraic manipulations,
Finally, using the assumption we conclude our proof. ∎
We saw that the quadratic function lower bounds the function everywhere. Therefore, any convex combination of such functions would form a quadratic function which lower bounds . In particular, the algorithm (implicitly) maintains a sequence of quadratic functions, , defined as follows. Choose and a sequence that will be specified later. Define,
The following simple lemma shows that for every and , lower bounds .
Let and let be any sequence of vectors. Assume that and for every , satisfies . Then, for every and every vector we have
The proof is by induction. For , observe that and that for every we have . This yields . The claim now follows directly from Lemma 3. Next, for the inductive step, assume the claim holds for some and let us prove it for . By the recursive definition of and by using Lemma 3 we have
Using the inductive assumption we obtain that the right-hand side of the above is upper bounded by , which concludes our proof. ∎
The more difficult part of the proof is to show that for every ,
If this holds true, then we would immediately get that for every ,
This will conclude the proof of the first part of Theorem 3, since , and therefore, iterations suffice to guarantee that .
Let us construct an explicit formula for . Clearly, . Assume that we have calculated and let us calculate . Note that is a quadratic function which is minimized at . Furthermore, it is easy to see that for every , is -strongly convex quadratic function. Therefore,
By the definition of we obtain that
Since the gradient of at should be zero, we obtain that should satisfy
Getting back to our second phase of the proof, we need to show that for every we have . We do so by induction. For the case we have
For the induction step, assume the claim holds for and let us prove it for . We use the shorthands,
By the inductive assumption we have and by Lemma 3 we have . Therefore,
So far we did not specify and (except ). We next set
We also observe that which implies that . Combining the above with (6) and (7), and rearranging terms, we obtain that
Next, observe that and that by (5) we have
The right-hand side of the above is non-negative because of the convexity of the function , which yields
We next show that each call to Prox-SDCA will terminate quickly. By the definition of we have that
Therefore, based on Corollary 1 we know that the averaged runtime at iteration is
The following lemma bounds the initial dual sub-optimality at iteration . Similar arguments will yield a similar result for .
Combining the above with (8), we obtain that
Next, we bound . We have
where we used the triangle inequality and . By strong convexity of we have, for every ,
Getting back to the proof of the second claim of Theorem 3, we have obtained that
where in the last inequality we used , which implies that . Using , , and taking log to both sides, we get that
which concludes the proof of the second claim of Theorem 3.
Applications
In this section we specify our algorithmic framework to several popular machine learning applications. In Section 5.1 we start by describing several loss functions and deriving their conjugate. In Section 5.2 we describe several regularization functions. Finally, in the rest of the subsections we specify our algorithm for Ridge regression, SVM, Lasso, logistic regression, and multiclass prediction.
Logistic loss:
. The derivative is and the second derivative is , from which it follows that is -smooth. The conjugate function is
Hinge loss:
. The conjugate function is
Smooth hinge loss:
This loss is obtained by smoothing the hinge-loss using the technique described in Lemma 2. This loss is parameterized by a scalar and is defined as:
Max-of-hinge:
To calculate the conjugate of , let
Each inner maximization over would be unless . Therefore,
Smooth max-of-hinge
This loss obtained by smoothing the max-of-hinge loss using the technique described in Lemma 2. This loss is parameterized by a scalar . We start by adding regularization to the conjugate of the max-of-hinge given in (11) and obtain
Taking the conjugate of the conjugate we obtain
Soft-max-of-hinge loss function:
Another approach to smooth the max-of-hinge loss function is by using soft-max instead of max. The resulting soft-max-of-hinge loss function is defined as
The ’th element of the gradient of is
By the definition of the conjugate we have . The vector that maximizes the above must satisfy
This can be satisfied only if for all and . That is, . Denote and note that
Finally, if then the gradient of does not vanish anywhere, which means that . All in all, we obtain
Since the entropic function, is -strongly convex over with respect to the norm, we obtain that is -strongly convex with respect to the norm, from which it follows that is -smooth with respect to the norm.
2 Regularizers
The simplest regularization is the squared regularization
This is a -strongly convex regularization function whose conjugate is
For our acceleration procedure, we also use the regularization plus a linear term, namely,
for some vector . The conjugate of this function is
Another popular regularization we consider is the regularization,
This is not a strongly convex regularizer and therefore we will add a slight regularization to it and define the - regularization as
where for some small . Note that
so if is small enough (as will be formalized later) we obtain that .
The maximizer is also and we now show how to calculate it. We have
where the right-hand side is the ’th component of the objective value we will obtain by setting . This leads to the conclusion that
Another regularization function we’ll use in the accelerated procedure is
3 Ridge Regression
Below we specify Prox-SDCA for ridge regression. We use Option I since it is possible to derive a closed form solution to the maximization of the dual with respect to . Indeed, since we have that the maximization problem is
Applying the above update and using some additional tricks to improve the running time we obtain the following procedure.
The runtime of Prox-SDCA for ridge regression becomes
where . This matches the recent results of . If we can apply the accelerated procedure and obtain the improved runtime
4 Logistic Regression
and the dual constraints are .
Below we specify Prox-SDCA for logistic regression using Option III.
Prox-SDCA() for logistic regression Goal: Minimize Initialize , and Define: Iterate: for Randomly pick and for , Stopping condition: let Stop if
The runtime analysis is similar to the analysis for ridge regression.
5 Lasso
In the Lasso problem, the loss function is the squared loss but the regularization function is . That is, we need to solve the problem:
Let , and let be an optimal solution of (18). Then, the objective at is at most the objective at , which yields
for some . This problem fits into our framework, since now the regularizer is strongly convex. Furthermore, if is an -accurate solution to the problem in (19), then which yields
Since , we obtain that setting guarantees that is an accurate solution to the original problem given in (18).
In light of the above, from now on we focus on the problem given in (19). As in the case of ridge regression, we can apply Prox-SDCA with Option I. The resulting pseudo-code is given below. Applying the above update and using some additional tricks to improve the running time we obtain the following procedure.
Let us now discuss the runtime of the resulting method. Denote and for simplicity, assume that . Choosing , the runtime of our method becomes
It is also convenient to write the bound in terms of , where, as before, is the optimal solution of the regularized problem. With this parameterization, we can set and the runtime becomes
The runtime of standard SGD is even in the case of smooth loss functions such as the squared loss. Several variants of SGD, that leads to sparser intermediate solutions, have been proposed (e.g. ). However, all of these variants share the runtime of , which is much slower than our runtime when is small.
Another relevant approach is the FISTA algorithm of . The shrinkage operator of FISTA is the same as the gradient of used in our approach. It is a batch algorithm using Nesterov’s accelerated gradient technique. For the squared loss function, the runtime of FISTA is
This bound is worst than our bound by a factor of at least .
Another approach to solving (18) is stochastic coordinate descent over the primal problem. showed that the runtime of this approach is
under the assumption that for all . Similar results can also be found in .
For our method, the runtime depends on . If then the runtime of our method is much better than that of . In the general case, if then , which yields the runtime of
This is the same or better than whenever .
6 Linear SVM
Support Vector Machines (SVM) is an algorithm for learning a linear classifier. Linear SVM (i.e., SVM with linear kernels) amounts to minimizing the objective
Our first step is to smooth the hinge-loss. Let and consider the smooth hinge-loss as defined in (9). Recall that the smooth hinge-loss satisfies
For the smoothed hinge loss, the optimization problem given in Option I of Prox-SDCA has a closed form solution and we obtain the following procedure:
Denote . Then, the runtime of the resulting method is
In particular, choosing we obtain a solution to the original SVM problem in runtime of
As mentioned before, this is better than SGD when .
7 Multiclass SVM
This can be written as where
with being the all ones vector except in the coordinate.
Therefore, the optimization problem of multiclass SVM becomes:
As in the case of SVM, we will use the smooth version of the max-of-hinge loss function as described in (13). If we set the smoothness parameter to be then an -accurate solution to the problem with the smooth loss is also an -accurate solution to the original problem with the non-smooth loss. Therefore, from now on we focus on the problem with the smooth max-of-hinge loss.
We specify Prox-SDCA for multiclass SVM using Option I. We will show that the optimization problem in Option I can be calculated efficiently by sorting a dimensional vector. Such ideas were explored in for the non-smooth max-of-hinge loss.
Let . Then, the optimization problem over can be written as
As shown before, if we organize as a matrix, denoted , we have that . We also have that
It follows that an optimal solution to (20) must set and we only need to optimize over the rest of the dual variables. This also yields,
This is equivalent to a problem of the form:
The equivalence is in the sense that if is a solution of (22) then we can set .
Assume for simplicity that is sorted in a non-increasing order and that all of its elements are non-negative (otherwise, it is easy to verify that we can zero the negative elements of and sort the non-negative, without affecting the solution). Let be the cumulative sum of , that is, for every , let . For every , let . Since is sorted we have that
The first order condition for minimality w.r.t. is
If this value of is in , then it is the optimal and we’re done. Otherwise, the optimum should be either (which yields as well) or .
Solve the optimization problem given in (22) Initialize: , and sort s.t. Let: be s.t. Let: be s.t. and If: s.t. return s.t. Else: Let be the minimal index s.t. set s.t. If: return Else: return
The resulting pseudo-codes for Prox-SDCA is given below. We specify the procedure while referring to as a matrix, because it is the more natural representation. For convenience of the code, we also maintain in the value of (instead of the optimal value of ).
Experiments
In this section we compare Prox-SDCA, its accelerated version Accelerated-Prox-SDCA, and the FISTA algorithm of , on regularized loss minimization problems.
The experiments were performed on three large datasets with very different feature counts and sparsity, which were kindly provided by Thorsten Joachims (the datasets were also used in ). The astro-ph dataset classifies abstracts of papers from the physics ArXiv according to whether they belong in the astro-physics section; CCAT is a classification task taken from the Reuters RCV1 collection; and cov1 is class 1 of the covertype dataset of Blackard, Jock & Dean. The following table provides details of the dataset characteristics.
In the experiments, we set and vary in the range .
The convergence behaviors are plotted in Figure 6. In all the plots we depict the primal objective as a function of the number of passes over the data (often referred to as “epochs”). For FISTA, each iteration involves a single pass over the data. For Prox-SDCA, each iterations are equivalent to a single pass over the data. And, for Accelerated-Prox-SDCA, each inner iterations are equivalent to a single pass over the data. For Prox-SDCA and Accelerated-Prox-SDCA we implemented their corresponding stopping conditions and terminate the methods once an accuracy of was guaranteed.
It is clear from the graphs that Accelerated-Prox-SDCA yields the best results, and often significantly outperform the other methods. Prox-SDCA behaves similarly when is relatively large, but it converges much slower when is small. This is consistent with our theory. Finally, the relative performance of FISTA and Prox-SDCA depends on the ratio between and , but in all cases, Accelerated-Prox-SDCA is much faster than FISTA. This is again consistent with our theory.
We have described and analyzed a proximal stochastic dual coordinate ascent method and have shown how to accelerate the procedure. The overall runtime of the resulting method improves state-of-the-art results in many cases of interest.
There are two main open problems that we leave to future research.
Our Prox-SDCA procedure and its analysis works for regularizers which are strongly convex with respect to an arbitrary norm. However, our acceleration procedure is designed for regularizers which are strongly convex with respect to the Euclidean norm. Is is possible to extend the acceleration procedure to more general regularizers?
Acknowledgements
The authors would like to thank Fen Xia for careful proof-reading of the paper which helped us to correct numerous typos. Shai Shalev-Shwartz is supported by the following grants: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and ISF 598-10. Tong Zhang is supported by the following grants: NSF IIS-1016061, NSF DMS-1007527, and NSF IIS-1250985.
Appendix A Proofs of Iteration Bounds for Prox-SDCA
The proof technique follows that of Shalev-Shwartz and Zhang , but with the required generality for handling general strongly convex regularizers and smoothness/Lipschitzness with respect to general norms.
We prove the theorems for running Prox-SDCA while choosing as in Option I. A careful examination of the proof easily reveals that the results hold for the other options as well. More specifically, Lemma 6 only requires choosing as in (23), and Option III chooses to optimize the bound on the right hand side of (25), and hence ensures that the choice can do no worse than the result of Lemma 6 with any . The simplification in Option IV and V employs the specific simplification of the bound in Lemma 6 in the proof of the theorems.
and .
Since only the ’th element of is updated, the improvement in the dual objective can be written as
The smoothness of implies that , where . Therefore,
By the definition of the update we have for all that
Combining this with (23) and rearranging terms we obtain that
Since we have , which yields
Next note that with , we have . Therefore:
Therefore, if we take expectation of (25) w.r.t. the choice of we obtain that
Multiplying both sides by concludes the proof of the lemma. ∎
Equipped with the above lemmas we are ready to prove Theorem 1 and Theorem 2.
The assumption that is -smooth implies that is -strongly-convex. We will apply Lemma 6 with
Recall that . Therefore, the choice of implies that
and hence for all . This yields,
Taking expectation of both sides with respect to the randomness at previous rounds, and using the law of total expectation, we obtain that
But since and , we obtain that
Using again (28), we can also obtain that
which proves the first part of Theorem 1.
Next, we sum the first inequality of (29) over to obtain
Now, if we choose to be either the average vectors or a randomly chosen vector over , then the above implies
In particular, the choice of and satisfies the above requirement. ∎
Next, for the duality gap, using (27) we have that for every such that we have
This proves the second claim of Theorem 2.