SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu, Peter Richtárik, Martin Takáč, Olivier Fercoq
Introduction
Empirical risk minimization (ERM) is a fundamental paradigm in the theory and practice of statistical inference and machine learning Shalev-Shwartz & Ben-David (2014). In the “big data” era it is increasingly common in practice to solve ERM problems with a massive number of examples, which leads to new algorithmic challenges.
State-of-the-art optimization methods for ERM include i) stochastic (sub)gradient descent Shalev-Shwartz et al. (2011); Takáč et al. (2013), ii) methods based on stochastic estimates of the gradient with diminishing variance such as SAG Schmidt et al. (2013), SVRG Johnson & Zhang (2013), S2GD Konečný & Richtárik (2014), proxSVRG Xiao & Zhang (2014), MISO Mairal (2014), SAGA Defazio et al. (2014), minibatch S2GD Konečný et al. (2014a), S2CD Konečný et al. (2014b), and iii) variants of stochastic dual coordinate ascent Shalev-Shwartz & Zhang (2013d); Zhao & Zhang (2014); Takáč et al. (2013); Shalev-Shwartz & Zhang (2013b; a); Lin et al. (2014); Qu et al. (2014); Shalev-Shwartz & Zhang (2013c).
There have been several attempts at designing methods that combine randomization with the use of curvature (second-order) information. For example, methods based on running coordinate ascent in the dual such as those mentioned above and also Richtárik & Takáč (2014; 2012); Fercoq & Richtárik (2013b); Tappenden et al. (2014); Richtárik & Takáč (2013a; b); Fercoq & Richtárik (2013a); Fercoq et al. (2014); Qu et al. (2014); Qu & Richtárik (2014a) use curvature information contained in the diagonal of a bound on the Hessian matrix. Block coordinate descent methods, when equipped with suitable data-dependent norms for the blocks, use information contained in the block diagonal of the Hessian Tappenden et al. (2013).
A more direct route to incorporating curvature information was taken by Schraudolph et al. (2007) in their stochastic L-BFGS method, by Byrd et al. (2014) and Sohl-Dickstein et al. (2014) in their stochastic quasi-Newton methods and by Fountoulakis & Tappenden (2014) who proposed a stochastic block coordinate descent methods. While typically efficient in practice, none of the methods mentioned above are equipped with complexity bounds (bounds on the number of iterations). An exception in this regard is the work of Bordes et al. (2009), who give a complexity bound for a Quasi-Newton SGD method.
The main contribution of this paper is the design and analysis of a new algorithm—stochastic dual Newton ascent (SDNA)—for solving a regularized ERM problem with smooth loss functions and a strongly convex regularizer (primal problem). Our method is stochastic in nature and has the capacity to utilize all curvature information inherent in the data. While we do our analysis for an arbitrary strongly convex regularizer, for the purposes of the introduction we shall describe the method in the case of the L2 regularizer. In this case, the dual problem is a concave quadratic maximization problem with a strongly concave separable penalty.
SDNA in each iteration picks a random subset of the dual variables (which corresponds to picking a minibatch of examples in the primal problem), following an arbitrary probability law, and maximizes, exactly, the dual objective restricted to the random subspace spanned by the coordinates. Equivalently, this can be seen as the solution of a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function. Hence, SDNA utilizes all curvature information available in the random subspace in which it operates. Note that this is very different from the update strategy of parallel / minibatch coordinate descent methods. Indeed, while these methods also update a random subset of variables in each iteration, they instead only utilize curvature information present in the diagonal of the Hessian.
As we will explain in detail in the text, SDCA-like methods need more iterations (and hence more passes through data) to convergence as the minibatch size increases. However, SDNA enjoys the opposite behavior: with increasing minibatch size, SDNA needs fewer iterations (and hence fewer passes over data) to convergence. This observation can be deduced from the complexity results we prove for SDNA, and is also confirmed by our numerical experiments. In particular, we show that the expected duality gap decreases at a geometric rate which i) is better than that of SDCA-like methods such as SDCA Shalev-Shwartz & Zhang (2013d) and QUARTZ Qu et al. (2014), and ii) improves with increasing minibatch size. This improvement does not come for free: as we increase the minibatch size, the subproblems grow in size as they involve larger portions of the Hessian. We find through experiments that for some, especially dense problems, even relatively small minibatch sizes lead to dramatic speedups in actual runtime.
We show that in the case of quadratic loss, and when viewed as a primal method, SDNA can be interpreted as a variant of the recently introduced Iterative Hessian Sketch algorithm Pilanci & Wainwright (2014).
En route to developing SDNA which we describe in Section 4, we also develop several other new algorithms: two in Section 2 (where we focus on smooth problems), one in Section 3 (where we focus on composite problems). Besides SDNA, we also develop and analyze a novel minibatch variant of SDCA in Section 4, for the sake of finding suitable method to compare SDNA to. SDNA is equivalent to applying the new method developed in Section 3 to the dual of the ERM problem. However, as we are mainly interested in solving the ERM (primal) problem, we additionally prove that the expected duality gap decreases at a geometric rate. Our technique for doing this is a variant of the one use by Shalev-Shwartz & Zhang (2013d), but generalized to an arbitrary sampling.
2 Notation
Minimization of a Smooth Function
In this section we consider unconstrained minimization of a differentiable convex function:
In particular, we shall assume smoothness (Lipschitz continuity of the gradient) and strong convexity of :
We now describe three algorithmic strategies for solving problem (5), the first two of which are new. All these methods have the form
where is only allowed to be nonzero for , where are i.i.d. random subsets of (“samplings”). That is, all methods in each iteration update a random subset of the variables. The four methods will only differ in how the update elements for are computed. If we wish the methods to work, we necessarily need to require that every coordinate has a positive probability of being sampled. For certain technical reasons that will be apparent later, we will also assume that is nonempty with probability 1.
The random sets are i.i.d., proper (i.e., for all ) and nonvacuous (i.e., ).
Much of our discussion will depend on the distribution of rather than on . As are i.i.d., we will write for a sampling which shares their distribution. We will write where
By Assumption 3, we have for all . We now describe the methods.
Method 1. We compute and set
Note that the update only involves the inversion of a random principal submatrix of of size . Also, we only need to compute elements of the gradient . If is reasonably small, the update step is cheap.
Assuming is easily computable (this should be done before the methods starts), the update is clearly very easy to perform. Indeed, the update can be equivalently written as for and for . Method 3 is known as NSync Richtárik & Takáč (2013b). For a calculus allowing the computation of closed form formulas for as a function of the sampling we refer the reader to Qu & Richtárik (2014b).
Note that all three methods coincide if with probability 1.
2 Three linear convergence rates
We shall now show that, putting the issue of the cost of each iteration of the three methods aside, all enjoy a linear rate of convergence. In particular, we shall show that Method 1 has the fastest rate, followed by Method and finally, Method .
Let Assumptions 1, 2 and 3 be satisfied. Let be the sequence of random vectors produced by Method , for and let be the optimal solution of (5). Then
The above result means that the number of iterations sufficient for Method to obtain an -solution (in expectation) is .
If is proper and nonvacuous, then
Substituting and taking expectations, we obtain
We now establish an important relationship between the quantities and , which sheds light on the convergence rates of the three methods.
.
3 Example
Note that Assumption 1 holds, and Assumption 2 holds with . Let be the “-nice sampling” on . That is, we set for . A straightforward calculation reveals that:
It can be verified that (9) holds with ; see Richtárik & Takáč (2012) or Qu & Richtárik (2014b). Therefore, . Finally, we obtain:
Note that: theoretical rate, , of Method 1 is 10,000 times better than the rate, , of parallel coordinate descent (Method 3).
4 Proof of Theorem 1
By minimizing both sides of (7) in , we get:
Method 1: If we use (15) with , and apply (4), we get:
Taking expectations on both sides with respect to yields:
Minimization of a Composite Function
In this section we consider the following composite minimization problem:
We assume that satisfies Assumptions 6 and 7. The difference from the setup in the previous section is in the inclusion of the separable term .
We now propose Algorithm 1, which a variant of Method 1 applicable to problem (16). If for all , the methods coincide. The following result states that the method converges at a geometric rate, in expectation.
Let Assumptions 1, 2, 3 and 4 be satisfied. Then the output sequence of Algorithm 1 satisfies:
where is the solution of (16), and
So, the rate we can prove for the composite version of Method 1 () is weaker than the rate we get for Method 2 (), which by Theorem 2 is weaker than the rate of Method 1 (). We believe this is a byproduct of our analysis rather than the weakness of Algorithm 1.
2 PCDM
We will now compare our new Algorithm 1 with the Parallel Coordinate Descent Method (PCDM) of Richtárik & Takáč (2012), which can also be applied to problem (16).
where and
Sketch: The proof is a minor modification of the arguments in Richtárik & Takáč (2012). ∎
3 Comparison of the rates of Algorithms 1 and 2
We now show that the rate of linear (geometric) convergence of our method is better than that of PCDM.
.
Since for all , we have and hence from (9) we deduce that:
whence , and the claim follows. ∎
Empirical Risk Minimization
We now turn our attention to the empirical risk minimization problem:
Note that the dual problem has the form (16)
where , and . It is easy to see that satisfies Assumption 1 with , where . Moreover, is -strongly convex. We can therefore apply Algorithm 1 to solve the dual (19). This is what Algorithm 3 does.
If is the optimal solution of (18), then the optimal solution of (17) is given by:
With each proper sampling we associate the number:
Closed-form expressions for satisfying this inequality, as a function of the sampling chosen, can be found in Qu & Richtárik (2014b). A rather conservative choice which works for any , irrespective of its distribution, is , where and is a number for which with probability 1 (see Theorem 5.1 in the aforementioned reference). Better bounds (with smaller ) can be derived for special classes of samplings.
Now we can state the main result of this section:
where and
In the case of quadratic losses and quadratic regularizer, we can sharpen the complexity bound:
When both and are quadratic functions, the output sequence of Algorithm 3 satisfies:
2 Complexity analysis
We first establish that SDNA is able to solve the dual.
where is as in Theorem 4.
If is uniform, then the output of Algorithm 3 is equivalent to the output of Algorithm 1 applied to (19). Therefore, the result is obtained by applying Theorem 3 with , and for all . ∎
We now prove a sharper result in the case of quadratic loss and quadratic regularizer.
If and are quadratic, then the output sequence of Algorithm 3 satisfies:
If and are all quadratic functions, then the dual objective function is quadratic with Hessian matrix given by . It suffices to apply Theorem 1(10), with . ∎
We now prove a more general version of a classical result in dual coordinate ascent methods which bounds the duality gap from above by the expected dual increase.
The output sequence of Algorithm 3 satisfies:
The proof of the this lemma is provided in the supplementary material. Theorem 4 (resp. Theorem 5) now follows by combining Lemma 3 (resp. Lemma 4) and Lemma 5.
3 New Algorithm: SDCA with Arbitrary Sampling
When with probability 1, SDNA reduces to a proximal variant of stochastic dual coordinate ascent (SDCA) Shalev-Shwartz & Zhang (2013d). However, a minibatch version of standard SDCA in the ERM setup we consider here has not been previously studied in the literature. Takáč et al. (2013) developed such a method but in the special case of hinge-loss (which is not smooth and hence does not fit our setup). Shalev-Shwartz & Zhang (2013b) studied minibatching but in conjunction with acceleration and the QUARTZ method of Qu et al. (2014), which has been analyzed for an arbitrary sampling , uses a different primal update than SDNA. Hence, in order to compare SDNA with an SDCA-like method which is as close a match to SDNA as possible, we need to develop a new method. Algorithm 4 is an extension of SDCA to allow it handle an arbitrary uniform sampling .
The complexity of Minibatch SDCA (we henceforth just write SDCA) is given in Theorem 6.
If (22) holds, then the output sequence of Algorithm 4 satisfies:
4 SDNA vs SDCA
We now compare the rates of SDNA and SDCA. The next result says that the rate of SDNA is always superior to that of SDCA. We also see that the rate is better in the quadratic case covered by Theorem 5.
Since is a uniform sampling, we have for all . In view of (21), we have . Next,
SDNA as Iterative Hessian Sketch
We now apply SDNA to the least squares problem:
and show that the resulting primal update can be interpreted as an iterative Hessian sketch, alternative to the one proposed by Pilanci & Wainwright (2014). We first need to establish a simple duality result.
Let be the optimal solution of
then the optimal solution of (24) is
Problem (24) is a special case of (17) for and for all . Problem (25) is the dual of (24) and the result follows from (20). ∎
The interpretation of SDNA as a variant of the Iterative Hessian sketch method of Pilanci & Wainwright (2014) follows immediately from the following theorem.
The output sequence of Algorithm 3 applied on problem (24) satisfies:
where denotes the -by- submatrix of the identity matrix with columns in the random subset .
We know that is the optimal solution of
Let . By Lemma 6, the optimal solution of
is given by , which equals and thus equals . Hence,
which is equivalent to (26) since . ∎
Numerical Experiments
In our first experiment (Figure 1) we compare SDNA and our new minibatch version of SDCA on one real (mushrooms; , ) and one synthetic (, ) dataset. In both cases, we used as the regularization parameter and . As increases, SDNA requires less passes over data (epochs), while SDCA requires more passes over data. It can be shown that this behavior can be predicted from the complexity results for these two methods. The difference in performance depends on the choice of the dataset and can be quite dramatic.
In the second experiment (Figure 2), we investigate how much time it takes for the methods to process a single epoch, using the same datasets as before. As increases, SDNA does more work as the subproblems it needs to solve in each iteration involve a submatrix of the Hessian of the smooth part of the dual objective function. On the other hand, the work SDCA needs to do is much smaller, and does nearly does not increase with the minibatch size . This is because the subproblems are separable. As before, all experiments are done using a single core (however, both methods would benefit from a parallel implementation).
Finally, in Figure 3 we put the insights gained from the previous two experiments together: we look at the performance of SDNA for various choices of by comparing runtime and duality gap error. We should expect that increasing would lead to faster method in terms of passes over data, but that this would also lead to slower iterations. The question is, is does the gain outweight the loss? The answer is: yes, for small enough minibatch sizes. Indeed, looking at Figure 3, we see that the runtime of SDNA improved up to the point for both datasets, and then starts to deteriorate. In situations where it is costly to fetch data from memory to a (fast) processor, much larger minibatch sizes would be optimal.
References
APPENDIX: Proof of Theorem 3
It follows directly from Assumption 1 and the update rule in Algorithm 1 that:
Since is defined as the minimizer of the right hand side in the last inequality, we can further bound this term by replacing with for arbitrary :
Now we use the fact that is -strongly convex to obtain:
By taking expectations in on both sides of the last inequality, we see that for any , the following holds:
APPENDIX: Proof of Lemma 5
Recall that , where .
For simplicity in this proof we write . First, by the 1-strong convexity of the function we obtain the 1-smoothness of the function , from which we deduce:
Now we replace by and by to obtain:
From -strong convexity of the functions we deduce that:
where the equality follows from . Next, by the definition of in (21), we know that:
In the main text we have shown that , where is the rate of Method 2 and is the rate of Method 3: NSync Richtárik & Takáč (2013b). In this section we give a more detailed description of the relationship between these two quantities in the case when is the -nice sampling Richtárik & Takáč (2012). That is, picks subsets of of cardinality , uniformly at random. For this sampling,
Suppose that and be the -nice sampling. Then there exists such that one can choose and
As explained in Richtárik & Takáč (2012), (9) is always true if we take with but smaller values (leading to a faster algorithm) may be computable if the problem exhibits a property called “partial separability”.
Let us denote by the diagonal matrix whose entries are the diagonal entries of .
Let us denote and .
But we have , so
Note that if is small, then is of the order of . ∎