SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

Zheng Qu, Peter Richtárik, Martin Takáč, Olivier Fercoq

Introduction

Empirical risk minimization (ERM) is a fundamental paradigm in the theory and practice of statistical inference and machine learning Shalev-Shwartz & Ben-David (2014). In the “big data” era it is increasingly common in practice to solve ERM problems with a massive number of examples, which leads to new algorithmic challenges.

State-of-the-art optimization methods for ERM include i) stochastic (sub)gradient descent Shalev-Shwartz et al. (2011); Takáč et al. (2013), ii) methods based on stochastic estimates of the gradient with diminishing variance such as SAG Schmidt et al. (2013), SVRG Johnson & Zhang (2013), S2GD Konečný & Richtárik (2014), proxSVRG Xiao & Zhang (2014), MISO Mairal (2014), SAGA Defazio et al. (2014), minibatch S2GD Konečný et al. (2014a), S2CD Konečný et al. (2014b), and iii) variants of stochastic dual coordinate ascent Shalev-Shwartz & Zhang (2013d); Zhao & Zhang (2014); Takáč et al. (2013); Shalev-Shwartz & Zhang (2013b; a); Lin et al. (2014); Qu et al. (2014); Shalev-Shwartz & Zhang (2013c).

There have been several attempts at designing methods that combine randomization with the use of curvature (second-order) information. For example, methods based on running coordinate ascent in the dual such as those mentioned above and also Richtárik & Takáč (2014; 2012); Fercoq & Richtárik (2013b); Tappenden et al. (2014); Richtárik & Takáč (2013a; b); Fercoq & Richtárik (2013a); Fercoq et al. (2014); Qu et al. (2014); Qu & Richtárik (2014a) use curvature information contained in the diagonal of a bound on the Hessian matrix. Block coordinate descent methods, when equipped with suitable data-dependent norms for the blocks, use information contained in the block diagonal of the Hessian Tappenden et al. (2013).

A more direct route to incorporating curvature information was taken by Schraudolph et al. (2007) in their stochastic L-BFGS method, by Byrd et al. (2014) and Sohl-Dickstein et al. (2014) in their stochastic quasi-Newton methods and by Fountoulakis & Tappenden (2014) who proposed a stochastic block coordinate descent methods. While typically efficient in practice, none of the methods mentioned above are equipped with complexity bounds (bounds on the number of iterations). An exception in this regard is the work of Bordes et al. (2009), who give a $O(1/\epsilon)$ complexity bound for a Quasi-Newton SGD method.

The main contribution of this paper is the design and analysis of a new algorithm—stochastic dual Newton ascent (SDNA)—for solving a regularized ERM problem with smooth loss functions and a strongly convex regularizer (primal problem). Our method is stochastic in nature and has the capacity to utilize all curvature information inherent in the data. While we do our analysis for an arbitrary strongly convex regularizer, for the purposes of the introduction we shall describe the method in the case of the L2 regularizer. In this case, the dual problem is a concave quadratic maximization problem with a strongly concave separable penalty.

SDNA in each iteration picks a random subset of the dual variables (which corresponds to picking a minibatch of examples in the primal problem), following an arbitrary probability law, and maximizes, exactly, the dual objective restricted to the random subspace spanned by the coordinates. Equivalently, this can be seen as the solution of a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function. Hence, SDNA utilizes all curvature information available in the random subspace in which it operates. Note that this is very different from the update strategy of parallel / minibatch coordinate descent methods. Indeed, while these methods also update a random subset of variables in each iteration, they instead only utilize curvature information present in the diagonal of the Hessian.

As we will explain in detail in the text, SDCA-like methods need more iterations (and hence more passes through data) to convergence as the minibatch size increases. However, SDNA enjoys the opposite behavior: with increasing minibatch size, SDNA needs fewer iterations (and hence fewer passes over data) to convergence. This observation can be deduced from the complexity results we prove for SDNA, and is also confirmed by our numerical experiments. In particular, we show that the expected duality gap decreases at a geometric rate which i) is better than that of SDCA-like methods such as SDCA Shalev-Shwartz & Zhang (2013d) and QUARTZ Qu et al. (2014), and ii) improves with increasing minibatch size. This improvement does not come for free: as we increase the minibatch size, the subproblems grow in size as they involve larger portions of the Hessian. We find through experiments that for some, especially dense problems, even relatively small minibatch sizes lead to dramatic speedups in actual runtime.

We show that in the case of quadratic loss, and when viewed as a primal method, SDNA can be interpreted as a variant of the recently introduced Iterative Hessian Sketch algorithm Pilanci & Wainwright (2014).

En route to developing SDNA which we describe in Section 4, we also develop several other new algorithms: two in Section 2 (where we focus on smooth problems), one in Section 3 (where we focus on composite problems). Besides SDNA, we also develop and analyze a novel minibatch variant of SDCA in Section 4, for the sake of finding suitable method to compare SDNA to. SDNA is equivalent to applying the new method developed in Section 3 to the dual of the ERM problem. However, as we are mainly interested in solving the ERM (primal) problem, we additionally prove that the expected duality gap decreases at a geometric rate. Our technique for doing this is a variant of the one use by Shalev-Shwartz & Zhang (2013d), but generalized to an arbitrary sampling.

2 Notation

Minimization of a Smooth Function

In this section we consider unconstrained minimization of a differentiable convex function:

In particular, we shall assume smoothness (Lipschitz continuity of the gradient) and strong convexity of $f$ :

We now describe three algorithmic strategies for solving problem (5), the first two of which are new. All these methods have the form

where $h_{i}^{k}$ is only allowed to be nonzero for $i\in S_{k}$ , where $\{S_{k}\}_{k\geq 0}$ are i.i.d. random subsets of $[n]:=\{1,2,\dots,n\}$ (“samplings”). That is, all methods in each iteration update a random subset of the variables. The four methods will only differ in how the update elements $h_{i}^{k}$ for $i\in S_{k}$ are computed. If we wish the methods to work, we necessarily need to require that every coordinate has a positive probability of being sampled. For certain technical reasons that will be apparent later, we will also assume that $S_{k}$ is nonempty with probability 1.

The random sets $\{S_{k}\}_{k\geq 0}$ are i.i.d., proper (i.e., $\mathbf{Prob}(i\in S_{k})>0$ for all $i\in[n]$ ) and nonvacuous (i.e., $\mathbf{Prob}(S_{k}=\emptyset)=0$ ).

Much of our discussion will depend on the distribution of $S_{k}$ rather than on $k$ . As $\{S_{k}\}_{k\geq 0}$ are i.i.d., we will write $\hat{S}$ for a sampling which shares their distribution. We will write $p=(p_{1},\dots,p_{n})$ where

By Assumption 3, we have $p_{i}>0$ for all $i$ . We now describe the methods.

Method 1. We compute $(\mathbf{M}_{S_{k}})^{-1}$ and set

Note that the update only involves the inversion of a random principal submatrix of $\mathbf{M}$ of size $|S_{k}|\times|S_{k}|$ . Also, we only need to compute elements $i\in S_{k}$ of the gradient $\nabla f(x_{k})$ . If $|S_{k}|$ is reasonably small, the update step is cheap.

Assuming $v$ is easily computable (this should be done before the methods starts), the update is clearly very easy to perform. Indeed, the update can be equivalently written as $h_{i}^{k}=-\tfrac{1}{v_{i}}\langle e_{i},\nabla f(x^{k})\rangle$ for $i\in S_{k}$ and $h_{i}^{k}=0$ for $i\notin S_{k}$ . Method 3 is known as NSync Richtárik & Takáč (2013b). For a calculus allowing the computation of closed form formulas for $v$ as a function of the sampling $\hat{S}$ we refer the reader to Qu & Richtárik (2014b).

Note that all three methods coincide if $|\hat{S}|=1$ with probability 1.

2 Three linear convergence rates

We shall now show that, putting the issue of the cost of each iteration of the three methods aside, all enjoy a linear rate of convergence. In particular, we shall show that Method 1 has the fastest rate, followed by Method $2$ and finally, Method $3$ .

Let Assumptions 1, 2 and 3 be satisfied. Let $\{x^{k}\}_{k\geq 0}$ be the sequence of random vectors produced by Method $m$ , for $m=1,2,3$ and let $x^{*}$ be the optimal solution of (5). Then

The above result means that the number of iterations sufficient for Method $m$ to obtain an $\epsilon$ -solution (in expectation) is $O(\tfrac{1}{\sigma_{m}}\log(1/\epsilon))$ .

If $\hat{S}$ is proper and nonvacuous, then

Substituting $S=\hat{S}$ and taking expectations, we obtain

We now establish an important relationship between the quantities $\sigma_{1},\sigma_{2}$ and $\sigma_{3}$ , which sheds light on the convergence rates of the three methods.

$0<\sigma_{3}\leq\sigma_{2}\leq\sigma_{1}\leq 1$ .

3 Example

Note that Assumption 1 holds, and Assumption 2 holds with $\mathbf{G}=\mathbf{M}$ . Let $\hat{S}$ be the “ $2$ -nice sampling” on $[n]=\{1,2,3\}$ . That is, we set $\mathbf{Prob}(\hat{S}=\{i,j\})=\frac{1}{3}.$ for $(i,j)=(1,2),(2,3),(3,1)$ . A straightforward calculation reveals that:

It can be verified that (9) holds with $v=(2,2,2)$ ; see Richtárik & Takáč (2012) or Qu & Richtárik (2014b). Therefore, $\mathbf{D}(p)\mathbf{D}(v^{-1})=\tfrac{1}{3}\mathbf{I}$ . Finally, we obtain:

Note that: theoretical rate, $\sigma_{1}$ , of Method 1 is 10,000 times better than the rate, $\sigma_{3}$ , of parallel coordinate descent (Method 3).

4 Proof of Theorem 1

By minimizing both sides of (7) in $h$ , we get:

Method 1: If we use (15) with $h\leftarrow h^{k}:=-(\mathbf{M}_{S_{k}})^{-1}\nabla f(x^{k})$ , and apply (4), we get:

Taking expectations on both sides with respect to $S_{k}$ yields:

Minimization of a Composite Function

In this section we consider the following composite minimization problem:

We assume that $f$ satisfies Assumptions 6 and 7. The difference from the setup in the previous section is in the inclusion of the separable term $\sum_{i}\psi_{i}$ .

We now propose Algorithm 1, which a variant of Method 1 applicable to problem (16). If $\psi_{i}\equiv 0$ for all $i$ , the methods coincide. The following result states that the method converges at a geometric rate, in expectation.

Let Assumptions 1, 2, 3 and 4 be satisfied. Then the output sequence $\{x^{k}\}_{k\geq 0}$ of Algorithm 1 satisfies:

where $x^{*}$ is the solution of (16), $\sigma^{prox}_{1}:=\frac{\tau\min(1,s_{1})}{n}$ and

So, the rate we can prove for the composite version of Method 1 ( $\sigma^{prox}_{1}$ ) is weaker than the rate we get for Method 2 ( $\sigma_{2}$ ), which by Theorem 2 is weaker than the rate of Method 1 ( $\sigma_{1}$ ). We believe this is a byproduct of our analysis rather than the weakness of Algorithm 1.

2 PCDM

We will now compare our new Algorithm 1 with the Parallel Coordinate Descent Method (PCDM) of Richtárik & Takáč (2012), which can also be applied to problem (16).

where $\sigma_{3}^{prox}:=\frac{\tau\min(1,s_{3})}{n}$ and

Sketch: The proof is a minor modification of the arguments in Richtárik & Takáč (2012). ∎

3 Comparison of the rates of Algorithms 1 and 2

We now show that the rate of linear (geometric) convergence of our method is better than that of PCDM.

$\sigma^{prox}_{1}\geq\sigma^{prox}_{3}$ .

Since $p_{i}=\tfrac{\tau}{n}$ for all $i$ , we have $\mathbf{D}(p)=\tfrac{\tau}{n}\mathbf{I}$ and hence from (9) we deduce that:

whence $s_{1}\geq s_{3}$ , and the claim follows. ∎

Empirical Risk Minimization

We now turn our attention to the empirical risk minimization problem:

Note that the dual problem has the form (16)

where $F(\alpha)=-D(\alpha)$ , $f(\alpha)=\lambda g^{*}(\tfrac{1}{\lambda n}\mathbf{A}\alpha)$ and $\psi(\alpha_{i})=\tfrac{1}{n}\phi_{i}^{*}(-\alpha_{i})$ . It is easy to see that $f$ satisfies Assumption 1 with $\mathbf{M}:=\tfrac{1}{n}\mathbf{X}$ , where $\mathbf{X}:=\frac{1}{\lambda n}\mathbf{A}^{\top}\mathbf{A}$ . Moreover, $\psi_{i}$ is $\tfrac{\gamma}{n}$ -strongly convex. We can therefore apply Algorithm 1 to solve the dual (19). This is what Algorithm 3 does.

If $\alpha^{*}$ is the optimal solution of (18), then the optimal solution of (17) is given by:

With each proper sampling $\hat{S}$ we associate the number:

Closed-form expressions for $v$ satisfying this inequality, as a function of the sampling $\hat{S}$ chosen, can be found in Qu & Richtárik (2014b). A rather conservative choice which works for any $\hat{S}$ , irrespective of its distribution, is $v_{i}=\min\{\tau,\lambda^{\prime}(\mathbf{A}^{\top}\mathbf{A})\}\|a_{i}\|^{2}$ , where $\lambda^{\prime}(\mathbf{Y}):=\max_{h}\{h^{\top}\mathbf{Y}h\;:\;h^{\top}\mathbf{D}(\mathbf{Y})h\leq 1\}$ and $\tau$ is a number for which $|\hat{S}|\leq\tau$ with probability 1 (see Theorem 5.1 in the aforementioned reference). Better bounds (with smaller $v$ ) can be derived for special classes of samplings.

Now we can state the main result of this section:

where $\sigma_{1}^{prox}:=\frac{\tau\min(1,s_{1})}{n}$ and

In the case of quadratic losses and quadratic regularizer, we can sharpen the complexity bound:

When both $\phi_{i}$ and $g$ are quadratic functions, the output sequence $\{w^{k},\alpha^{k}\}_{k\geq 0}$ of Algorithm 3 satisfies:

2 Complexity analysis

We first establish that SDNA is able to solve the dual.

where $\sigma_{1}^{prox}$ is as in Theorem 4.

If $\hat{S}$ is uniform, then the output of Algorithm 3 is equivalent to the output of Algorithm 1 applied to (19). Therefore, the result is obtained by applying Theorem 3 with $\mathbf{M}=\tfrac{1}{\lambda n^{2}}\mathbf{A}^{\top}\mathbf{A}$ , $\mathbf{G}=0$ and $\gamma_{i}=\tfrac{\gamma}{n}$ for all $i$ . ∎

We now prove a sharper result in the case of quadratic loss and quadratic regularizer.

If $\{\phi_{i}\}_{i}$ and $g$ are quadratic, then the output sequence $\{\alpha^{k}\}_{k\geq 0}$ of Algorithm 3 satisfies:

If $\{\phi_{i}\}_{i}$ and $g$ are all quadratic functions, then the dual objective function is quadratic with Hessian matrix given by $\nabla^{2}D(\alpha)\equiv\frac{1}{\lambda n^{2}}\mathbf{A}^{\top}\mathbf{A}+\frac{\gamma}{n}\mathbf{I}$ . It suffices to apply Theorem 1(10), with $\mathbf{M}=\mathbf{G}=\nabla^{2}D(\alpha)$ . ∎

We now prove a more general version of a classical result in dual coordinate ascent methods which bounds the duality gap from above by the expected dual increase.

The output sequence $\{w^{k},\alpha^{k}\}_{k\geq 0}$ of Algorithm 3 satisfies:

The proof of the this lemma is provided in the supplementary material. Theorem 4 (resp. Theorem 5) now follows by combining Lemma 3 (resp. Lemma 4) and Lemma 5.

3 New Algorithm: SDCA with Arbitrary Sampling

When $|\hat{S}|=1$ with probability 1, SDNA reduces to a proximal variant of stochastic dual coordinate ascent (SDCA) Shalev-Shwartz & Zhang (2013d). However, a minibatch version of standard SDCA in the ERM setup we consider here has not been previously studied in the literature. Takáč et al. (2013) developed such a method but in the special case of hinge-loss (which is not smooth and hence does not fit our setup). Shalev-Shwartz & Zhang (2013b) studied minibatching but in conjunction with acceleration and the QUARTZ method of Qu et al. (2014), which has been analyzed for an arbitrary sampling $\hat{S}$ , uses a different primal update than SDNA. Hence, in order to compare SDNA with an SDCA-like method which is as close a match to SDNA as possible, we need to develop a new method. Algorithm 4 is an extension of SDCA to allow it handle an arbitrary uniform sampling $\hat{S}$ .

The complexity of Minibatch SDCA (we henceforth just write SDCA) is given in Theorem 6.

If (22) holds, then the output sequence $\{w^{k},\alpha^{k}\}_{k\geq 0}$ of Algorithm 4 satisfies:

4 SDNA vs SDCA

We now compare the rates of SDNA and SDCA. The next result says that the rate of SDNA is always superior to that of SDCA. We also see that the rate is better in the quadratic case covered by Theorem 5.

Since $\hat{S}$ is a uniform sampling, we have $p_{i}=\frac{\tau}{n}$ for all $i\in[n]$ . In view of (21), we have $1\leq\tfrac{n}{\tau}\theta(\hat{S})$ . Next,

SDNA as Iterative Hessian Sketch

We now apply SDNA to the least squares problem:

and show that the resulting primal update can be interpreted as an iterative Hessian sketch, alternative to the one proposed by Pilanci & Wainwright (2014). We first need to establish a simple duality result.

Let $\alpha^{*}$ be the optimal solution of

then the optimal solution $w^{*}$ of (24) is $w^{*}=\frac{1}{\lambda n}\mathbf{A}\alpha^{*}.$

Problem (24) is a special case of (17) for $g(w)\equiv\frac{1}{2}\|w\|^{2}$ and $\phi_{i}(a)\equiv\frac{1}{2}(a-b_{i})^{2}$ for all $i\in[n]$ . Problem (25) is the dual of (24) and the result follows from (20). ∎

The interpretation of SDNA as a variant of the Iterative Hessian sketch method of Pilanci & Wainwright (2014) follows immediately from the following theorem.

The output sequence $\{w^{k},\alpha^{k}\}_{k\geq 0}$ of Algorithm 3 applied on problem (24) satisfies:

where $\mathbf{S}_{k}$ denotes the $n$ -by- $|S_{k}|$ submatrix of the identity matrix $\mathbf{I}_{n}$ with columns in the random subset $S_{k}$ .

We know that $\mathbf{S}_{k}^{\top}\Delta\alpha^{k}$ is the optimal solution of

Let $\tau=|S_{k}|$ . By Lemma 6, the optimal solution of

is given by $\frac{1}{\lambda n}\mathbf{A}\mathbf{S}_{k}\mathbf{S}_{k}^{\top}\nabla\alpha^{k}$ , which equals $\bar{\alpha}^{k+1}-\bar{\alpha}^{k}$ and thus equals $w^{k+1}-w^{k}$ . Hence,

which is equivalent to (26) since $(\mathbf{I}_{n})_{S_{k}}=\mathbf{S}_{k}\mathbf{S}_{k}^{\top}$ . ∎

Numerical Experiments

In our first experiment (Figure 1) we compare SDNA and our new minibatch version of SDCA on one real (mushrooms; $d=112$ , $n=8,124$ ) and one synthetic ( $d=1,024$ , $n=2,048$ ) dataset. In both cases, we used $\lambda=1/n$ as the regularization parameter and $g(w)=\tfrac{1}{2}\|w\|^{2}$ . As $\tau$ increases, SDNA requires less passes over data (epochs), while SDCA requires more passes over data. It can be shown that this behavior can be predicted from the complexity results for these two methods. The difference in performance depends on the choice of the dataset and can be quite dramatic.

In the second experiment (Figure 2), we investigate how much time it takes for the methods to process a single epoch, using the same datasets as before. As $\tau$ increases, SDNA does more work as the subproblems it needs to solve in each iteration involve a $\tau\times\tau$ submatrix of the Hessian of the smooth part of the dual objective function. On the other hand, the work SDCA needs to do is much smaller, and does nearly does not increase with the minibatch size $\tau$ . This is because the subproblems are separable. As before, all experiments are done using a single core (however, both methods would benefit from a parallel implementation).

Finally, in Figure 3 we put the insights gained from the previous two experiments together: we look at the performance of SDNA for various choices of $\tau$ by comparing runtime and duality gap error. We should expect that increasing $\tau$ would lead to faster method in terms of passes over data, but that this would also lead to slower iterations. The question is, is does the gain outweight the loss? The answer is: yes, for small enough minibatch sizes. Indeed, looking at Figure 3, we see that the runtime of SDNA improved up to the point $\tau=16$ for both datasets, and then starts to deteriorate. In situations where it is costly to fetch data from memory to a (fast) processor, much larger minibatch sizes would be optimal.

References

APPENDIX: Proof of Theorem 3

It follows directly from Assumption 1 and the update rule $x^{k+1}=x^{k}+(h^{k})_{S_{k}}$ in Algorithm 1 that:

Since $h^{k}$ is defined as the minimizer of the right hand side in the last inequality, we can further bound this term by replacing $h^{k}$ with $h=\lambda(x^{*}-x^{k})$ for arbitrary $\lambda\in$ :

Now we use the fact that $\psi_{i}$ is $\gamma_{i}$ -strongly convex to obtain:

By taking expectations in $S_{k}$ on both sides of the last inequality, we see that for any $\lambda\in$ , the following holds:

APPENDIX: Proof of Lemma 5

Recall that $\mathbf{M}=\tfrac{1}{n}\mathbf{X}$ , where $\mathbf{X}=\tfrac{1}{\lambda n}\mathbf{A}^{\top}\mathbf{A}$ .

For simplicity in this proof we write $\theta=\theta(\hat{S})$ . First, by the 1-strong convexity of the function $g$ we obtain the 1-smoothness of the function $g^{*}$ , from which we deduce:

Now we replace $\nabla g^{*}(\bar{\alpha}^{k})$ by $w^{k}$ and $\bar{\alpha}$ by $\frac{1}{\lambda n}\mathbf{A}\alpha$ to obtain:

From $\gamma$ -strong convexity of the functions $\phi_{i}^{*}$ we deduce that:

where the equality follows from $u_{i}^{k}=\nabla\phi_{i}(a_{i}^{\top}w^{k})$ . Next, by the definition of $\theta$ in (21), we know that:

In the main text we have shown that $\sigma_{2}\geq\sigma_{3}$ , where $\sigma_{2}$ is the rate of Method 2 and $\sigma_{3}$ is the rate of Method 3: NSync Richtárik & Takáč (2013b). In this section we give a more detailed description of the relationship between these two quantities in the case when $\hat{S}$ is the $\tau$ -nice sampling Richtárik & Takáč (2012). That is, $\hat{S}$ picks subsets of $[n]$ of cardinality $\tau$ , uniformly at random. For this sampling,

Suppose that $\mathbf{G}=\mathbf{M}$ and $\hat{S}$ be the $\tau$ -nice sampling. Then there exists $\beta\in[1,\tau]$ such that one can choose $v_{i}=\beta\mathbf{M}_{i,i}$ and

As explained in Richtárik & Takáč (2012), (9) is always true if we take $v_{i}=\beta\mathbf{M}_{i,i}$ with $\beta=\tau$ but smaller values (leading to a faster algorithm) may be computable if the problem exhibits a property called “partial separability”.

Let us denote by $\mathbf{D}$ the diagonal matrix whose entries are the diagonal entries of $\mathbf{M}$ .

Let us denote $\mathbf{A}=\mathbf{M}^{-1/2}\mathbf{D}\mathbf{M}^{-1/2}$ and $\alpha=\frac{\tau-1}{n-1}$ .

But we have $\lambda_{\max}((1-\alpha)\mathbf{A}+\alpha\mathbf{I})=(1-\alpha)\lambda_{\max}(\mathbf{A})+\alpha$ , so

Note that if $\sigma_{3}$ is small, then $\sigma_{2}$ is of the order of $\frac{\beta\sigma_{3}}{1-\frac{\tau-1}{n-1}}>\beta\sigma_{3}$ . ∎