Randomized Dual Coordinate Ascent with Arbitrary Sampling

Zheng Qu, Peter Richtárik, Tong Zhang

Introduction

In this paper we consider a primal-dual pair of structured convex optimization problems which has in several variants of varying degrees of generality attracted a lot of attention in the past few years in the machine learning and optimization communities .

In the machine learning context, matrices $\{A_{i}\}$ are interpreted as examples/samples, $w$ is a (linear) predictor, function $\phi_{i}$ is the loss incurred by the predictor on example $A_{i}$ , $g$ is a regularizer, $\lambda$ is a regularization parameter and (1) is the regularized empirical risk minimization problem. However, above problem has many other applications outside machine learning. In this paper we are especially interested in problems where $n$ is very big (millions, billions), and much larger than $d$ . This is often the case in big data applications.

Note that $f$ is convex and smooth and $\psi$ is strongly convex and block separable.

2 Contributions

We now briefly list the main contributions of this work.

Main result.

We prove that starting from an initial pair $(w^{0},\alpha^{0})$ , Quartz finds a pair $(w,\alpha)$ for which $P(w)-D(\alpha)\leq\epsilon$ (in expectation) in at most

iterations. The parameters $v_{1},\dots,v_{n}$ are assumed to satisfy the following ESO (expected separable overapproximation) inequality:

Moreover, the parameters are needed to run the method (they determine stepsizes), and hence it is critical that they can be cheaply computed before the method starts. As we will show, for many samplings of interest this can be done in time required to read the data $\{A_{i}\}$ . We wish to point out that (6) always holds for some parameters $\{v_{i}\}$ . Indeed, the left hand side is a quadratic function of $h$ and hence the inequality holds for large-enough $v_{i}$ . Having said that, the size of these parameters directly influences the complexity, and hence one would want to obtain as tight bounds as possible.

Arbitrary sampling.

Direct primal-dual analysis.

Virtually all methods for solving (1) by performing stochastic steps in the dual (2), such as SDCA , SDCA for SVM dual , ProxSDCA , ASDCA and APCG , are analyzed by first establishing dual convergence and then proving that the duality gap is bounded by the dual residual. The SPDC method of Zhang and Xiao , which is a stochastic coordinate update variant of the Chambolle-Pock method , is an exception. Our analysis is novel, and directly primal-dual in nature. As a result, our proof is more direct, and the logarithmic term in our bound has a simpler form.

Flexibility: many important variants.

Our method is very flexible: by specializing it to specific samplings, we obtain numerous variants, some similar (but not identical) to existing methods in the literature, and some very new and of significance to big data optimization.

Serial uniform sampling. If $\hat{S}$ always picks a single block, uniformly at random ( $p_{i}=1/n$ ), then the dual updates of Quartz are similar to those of SDCA and Prox-SDCA . The leading term in the complexity bound (5) becomes $n+\max_{i}\lambda_{\text{max}}(A_{i}^{\top}A_{i})/(\lambda\gamma)$ , which matches the bounds obtained in these papers. However, our logarithmic term is simpler.

Serial optimal sampling (importance sampling). If $\hat{S}$ always picks a single block, with $p_{i}$ chosen so as to minimize the complexity bound (5), we obtain the same importance sampling as that recently used in the IProx-SDCA method . Our bound becomes $n+(\tfrac{1}{n}\sum_{i}\lambda_{\text{max}}(A_{i}^{\top}A_{i}))/(\lambda\gamma)$ , which matches the bound in . Again, our logarithmic term is better.

$\tau$ -nice sampling. If we now let $\hat{S}$ be a random subset of $[n]$ of size $\tau$ chosen uniformly at random (this sampling is called $\tau$ -nice in ), we obtain a mini-batch (parallel) variant of Quartz. There are only a handful of primal-dual stochastic methods which use mini-batching. The first such method was a mini-batch version of SDCA specialized to training $L2$ -regularized linear SVMs with hinge loss . Besides this, two accelerated mini-batch methods have been recently proposed: ASDCA of Shalev-Shwartz and Zhang and SPDC of Zhang and Xiao . The complexity bound of Quartz specialized to the $\tau$ -nice sampling is different, and despite Quartz not being an accelerated method, and can be better in certain regimes (we will do a detailed comparison in Section 4).

Distributed sampling. To the best of our knowledge, no other samplings than those described above were used in stochastic primal-dual methods. However, there are many additional interesting samplings proposed for randomized coordinate descent, but never applied to the primal-dual framework. For instance, we can use the distributed sampling which led to the development of the Hydra algorithm (distributed coordinate descent) and its accelerated variant Hydra2 (Hydra squared) . Using this sampling, Quartz can be efficiently implemented in a distributed environment (partition the examples across the nodes of a cluster, and let each node in each iteration update a random subset of variables corresponding to the examples it owns).

Product sampling. We describe a novel sampling, which we call product sampling, that can be both non-serial and non-uniform. This is the first time such a sampling has been described and and a SDCA-like method using it analyzed. For suitable data (if the examples can be partitioned into several groups no two of which share a feature), this sampling can lead to linear or nearly linear speedup when compared to the serial uniform sampling.

Other samplings. While we develop the analysis of Quartz for an arbitrary sampling, we do not compute the ESO parameters $\{v_{i}\}$ for any other samplings in this paper. However, there are several other interesting choices. We refer the reader to and for further examples of uniform and non-uniform samplings, respectively. All that must be done for any new $\hat{S}$ is to find parameters $v_{i}$ for which (6) holds, and the complexity of the new variant of Quartz is given by (5).

Further data-driven speedup.

Existing mini-batch stochastic primal-dual methods achieve linear speedup up to a certain mini-batch size which depends on $n,\lambda$ and $\gamma$ . Quartz obtains this data-independent speedup, but also obtains further data-driven speedup. This is caused by the fact that Quartz uses more aggressive dual stepsizes, informed by the data through the ESO parameters $\{v_{i}\}$ . The smaller these constants, the better speedup. For instance, we will show that higher data sparsity leads to smaller $\{v_{i}\}$ and hence to better speedup. To illustrate this, consider the $\tau$ -nice sampling (hence, $p_{i}=\tau/n$ for all $i$ ) and the extreme case of perfectly sparse data (each feature $j\in[d]$ appearing in a single example $A_{i}$ ). Then (6) holds with $v_{i}=\lambda_{\text{max}}(A_{i}^{\top}A_{i})$ for all $i$ , and hence the leading term in (5) becomes $n/\tau+\max_{i}\lambda_{\text{max}}(A_{i}^{\top}A_{i})/(\gamma\lambda\tau)$ , predicting perfect speedup in the mini-batch size $\tau$ . We derive theoretical speedup factors and show that these are excellent predictors of actual behavior of the method in an implementation. This was previously observed for the PCDM method (which is not primal-dual).

Quartz vs purely primal and purely dual methods.

In the special case when $\hat{S}$ is the serial uniform sampling, the complexity of Quartz is similar to the bounds recently obtained by several purely primal stochastic and semi-stochastic gradient methods (all having reduced variance of the gradient estimate) such as SAG , SVRG , S2GD , SAGA , mS2GD and MISO . In the case of serial optimal sampling, relevant purely primal methods with similar guarantees are ProxSVRG and S2CD . A mini-batch primal method, mS2GD, was analyzed in , achieving a similar bound to Quartz specialized to the $\tau$ -nice sampling. Purely dual (stochastic coordinate descent) methods with similar bounds to Quartz for both the serial uniform and serial optimal sampling, for problems of varying similarity and generality when compared to (2), include SCD , RCDM , UCDC/RCDC , ICD and RCD . These methods were then generalized to the $\tau$ -nice sampling in SHOTGUN , further generalized to arbitrary uniform samplings in PCDM , SPCDM , APPROX (which is an accelerated method) and to arbitrary (even nonuniform) samplings in NSync . Another accelerated method, BOOM, was proposed in . Distributed randomized coordinate descent methods with purely dual analysis include Hydra and Hydra2 (accelerated variant of Hydra). Quartz specialized to the distributed sampling achieves the same rate as Hydra, but for both the primal and dual problems simultaneously.

General problem.

We consider the problem (1) (and consequently, the associated dual) in a rather general form; most existing primal-dual methods focus on the case when $g$ is a quadratic (e.g., ) or $m=1$ (e.g., ). Lower bounds for a variant of problem (1) were recently established by Agarwal and Bottou .

3 Outline

In Section 2 we describe the algorithm and show that it admits a natural interpretation in terms of Fenchel duality. We also outline the similarities and differences of the primal and dual update steps with SDCA-like methods. In Section 3 we show how parameters $\{v_{i}\}$ satisfying the ESO inequality (6) can be computed for several selected samplings. We then proceed to Section 4 where we state the main result, specialize it to some of the samplings discussed in Section 3. Sections 5 and 6 deal with Quartz specialized to the $\tau$ -nice and distributed sampling, respectively. We also give detailed comparison of our results with existing results for related primal-dual stochastic methods existing in the literature, and analyze theoretical speedup factors. We then provide the proof of the main complexity result in Section 7. In Section 8 we perform numerical experiments on the problem of training $L_{2}$ -regularized linear support vector machine with square and smoothed hinge loss with real datasets. Finally, in Section 9 we conclude.

The Quartz Algorithm

In this section we describe our method (Algorithm 1).

The most important parameter of Quartz is a random sampling $\hat{S}$ of the dual variables $[n]=\{1,2,\dots,n\}$ . That is, $\hat{S}$ is a random subset of $[n]$ , or more precisely, a random set-valued mapping with values being the subsets of $[n]$ . In order to guarantee that each block (dual variable) has a chance to get updated by the method, we necessarily need to make the following assumption.

However, we shall not make any other assumption on $\hat{S}$ . Prior to running the algorithm, we compute positive constants $v_{1},\dots,v_{n}$ satisfying (6)—such constants always exist—as these are used to define the stepsize parameter $\theta$ used throughout:

We shall show how this parameter can be computed for various samplings in Section 3. Let us now formalize the notions of $(1/\gamma)$ -smoothness and strong convexity.

For brevity, the last property is often called $(1/\gamma)$ -smoothness.

It follows that $\phi_{i}^{*}$ is $\gamma$ -strongly convex.

where $\nabla g(w^{\prime})$ is a subgradient of $g$ at $w^{\prime}$ .

2 Description of the method

Quartz starts with an initial pair of primal and dual vectors $(w^{0},\alpha^{0})$ . Given $w^{t-1}$ and $\alpha^{t-1}$ , the method maintains the vector

Initially this is computed from scratch, and subsequently it is maintained in an efficient manner at the end of each iteration.

Let us now describe how the vectors $w^{t}$ and $\alpha^{t}$ are computed. Quartz first updates the primal vector $w^{t}$ by setting it to a convex combination of the previous value $w^{t-1}$ and $\nabla g^{*}(\bar{\alpha}^{t-1})$ :

We then proceed to select, and subsequently update, a random subset $S_{t}\subseteq[n]$ of the dual variables, independently from the sets drawn in previous iterations, and following the distribution of $\hat{S}$ . Clearly, there are many ways in which the distribution of $\hat{S}$ can be chosen, leading the numerous variants of Quartz. We shall describe some of them in Section 3. We allow two options for the actual computation of the dual updates. Once the dual variables are updated, the vector $\bar{\alpha}^{t}$ is updated in an efficient manner so that (9) holds. The entire process is repeated.

By Fenchel-Young inequality, $GAP_{g}(w,\alpha)\geq 0$ and $GAP_{\phi_{i}}(w,\alpha_{i})\geq 0$ for all $i$ , which proves weak duality for the problems (1) and (2), i.e., $P(w)\geq D(\alpha)$ . The pair $(w,\alpha)$ is optimal when both $GAP_{g}$ and $GAP_{\phi_{i}}$ for all $i$ are zero. It is known that this happens precisely when the following optimality conditions hold:

We will now interpret the primal and dual steps of Quartz in terms of the above discussion. At iteration $t$ we first set the primal variable $w^{t}$ to a convex combination of its current value $w^{t-1}$ and a value that would set $GAP_{g}$ to zero: see (10). Hence, our primal update is not as aggressive as that of Prox-SDCA. This is followed by adjusting the dual variables corresponding to a randomly chosen set of examples $S_{t}$ . Under Option II, for each example $i\in S_{t}$ , the $i$ -th dual variable $\alpha_{i}^{t}$ is set to a convex combination of its current value $\alpha_{i}^{t-1}$ and the value that would set $GAP_{\phi_{i}}$ to zero:

Quartz vs Prox-SDCA.

In the special case when $\hat{S}$ is the serial uniform sampling (i.e., $p_{i}=1/n$ for all $i\in[n]$ ), Quartz can be compared to Proximal Stochastic Dual Coordinate Ascent (Prox-SDCA) . Indeed, if Option \@slowromancapi@ is always used in Quartz, then the dual update of $\alpha^{t}$ in Quartz is exactly the same as the dual update of Prox-SDCA (using Option \@slowromancapi@). In this case, the difference between our method and Prox-SDCA lies in the update of the primal variable $w^{t}$ : while Quartz performs the update (10), Prox-SDCA (see also ) performs the more aggressive update $w^{t}=\nabla g^{*}(\bar{\alpha}^{t-1})$ .

Expected Separable Overapproximation

Note that for any proper sampling $\hat{S}$ , there must exist vector $v>0$ satisfying Assumption 4. Hence, this is an assumption that such a vector $v$ is readily available. Indeed, the term on the left is a finite average of convex quadratic functions of $h$ , and hence is a convex quadratic. Moreover, we can write

since then $P\circ A^{\top}A\preceq tD_{p,w}=D_{p,v}.$

In practice, and especially in the big data setting when $n$ is very large, computing $v$ by solving an eigenvalue problem with an $N\times N$ matrix (recall that $N=nm$ ) will be either inefficient or impossible. It is therefore important that a “good” (i.e., small), albeit perhaps suboptimal $v$ can be identified cheaply. In all the cases we consider in this paper, the identification of $v$ can be done during the time the data is being read; or in time roughly equal to a single pass through the data matrix $A$ .

The most studied sampling in literature on stochastic optimization is the serial sampling, which corresponds to the selection of a single block $i\in[n]$ . That is, $|\hat{S}|=1$ with probability 1. The name “serial” is pointing to the fact that a method using such a sampling will typically be a serial (as opposed to being parallel) method; updating a single block (dual variable) at a time.

A serial sampling is uniquely characterized by the vector of probabilities $p=(p_{1},\dots,p_{n})$ , where $p_{i}$ is defined by (7). It turns out that we can find a vector $v>0$ for which (15) holds for any serial sampling, independently of its distribution given by $p$ .

If $\hat{S}$ is a serial sampling (i.e., if $|\hat{S}|=1$ with probability 1), then Assumption 4 is satisfied for

Note that $v_{i}$ is the largest eigenvalue of an $m$ -by- $m$ matrix. If $m$ is relatively small (and in many machine learning applications one has $m=1$ ; as examples are usually vectors and not matrices), then the cost of computing $v_{i}$ is small. If $m=1$ , then $v_{i}$ is simply the squared Euclidean norm of the vector $A_{i}$ , and hence one can compute all of these parameters in one pass through the data (e.g., during loading to memory).

2 Parallel (τ𝜏\tau-nice) sampling

We now consider $\hat{S}$ which selects subsets of $[n]$ of cardinality $\tau$ , uniformly at random. In the terminology established in , such $\hat{S}$ is called $\tau$ -nice. This sampling satisfies $p_{i}=p_{j}$ for all $i,j\in[n]$ ; and hence it is uniform.

This sampling is well suited for parallel computing. Indeed, Quartz could be implemented as follows. If we have $\tau$ processors available, then at the beginning of iteration $t$ we can assign each block (dual variable) in $S_{t}$ to a dedicated processor. The processor assigned to $i$ would then compute $\Delta\alpha_{i}^{t}$ and apply the update. If all processors have fast access to the memory where all the data is stored, as is the case in a shared-memory multicore workstation, then this way of assigning workload to the individual processors does not cause any major problems. Depending on the particular computer architecture and the size $m$ of the blocks (which will influence processing time), it may be more efficient to chose $\tau$ to be a multiple of the number of processors available, in which case in each iteration every processor updates more than one block.

The following lemma gives a closed-form formula for parameters $\{v_{i}\}$ for which the ESO inequality holds.

If $\hat{S}$ is a $\tau$ -nice sampling, then Assumption 4 is satisfied for

where for each $j\in[d]$ , $\omega_{j}$ is the number of nonzero blocks in the $j$ -th row of matrix $A$ , i.e.,

Proof In the $m=1$ case the result follows from Theorem 1 in . Extension to the $m>1$ case is straightforward.

Note that $v_{i}$ is the largest eigenvalue of an $m$ -by- $m$ matrix which is formed as the sum of $d$ rank-one matrices. The formation of all of these $n$ matrices takes time proportional to the number of nonzeros in $A$ (if the data is stored in a sparse format). Constants $\{\omega_{j}\}$ can be computed by scanning the data once (e.g., during loading-to-memory phase). Finally, one must compute $n$ eigenvalue problems for matrices of size $m\times m$ . In most applications, $m=1$ , so there is no more work to be done. If $m>1$ , the cost of computing these eigenvalues would be small.

While for $\tau=1$ it was easy to find parameters $\{v_{i}\}$ for any sampling (and hence, as we will see, it will be easy to find an optimal sampling), this is not the case in the $\tau>1$ case. The task is in general a difficult optimization problem. For some work in this direction we refer the reader to .

3 Product sampling

Consider the following example with $m=1,n=5$ and $d=4$ :

If we choose $\tau=2$ and $X_{1}=\{1,2\},X_{2}=\{3,4,5\}$ , then no row of $A$ has a nonzero in both a column belonging to $X_{1}$ and a column belonging to $X_{2}$ .

With each $i\in[n]$ we now associate $l_{i}\in[\tau]$ such that $i\in X_{l_{i}}$ and define:

The product sampling $\hat{S}$ is obtained by choosing $S\in{\cal S}$ , uniformly at random; that is, via:

Hence the sampling is nonuniform as long as not all of the sets $X_{l}$ have the same cardinality. We next show that the product sampling $\hat{S}$ defined as above allows the same stepsize parameter $v_{i}$ as the serial uniform sampling.

Under the group separability assumption, Assumption 4 is satisfied for the product sampling $\hat{S}$ and

Proof For each $j\in[d]$ , denote by $A_{j:}$ the $j$ -th row of the matrix $A$ and $\Omega_{j}$ the column index set of nonzero blocks in $A_{j:}$ : $\Omega_{j}\stackrel{{\scriptstyle\text{def}}}{{=}}\{i\in[n]:A_{ji}\neq 0\}.$ For each $l\in[\tau]$ , define:

In words, $J_{l}$ is the set of features associated with the examples in $X_{l}$ . By the group separability assumption, $J_{1},\dots,J_{\tau}$ forms a partition of $[d]$ , namely,

Since $X_{1},\dots,X_{\tau}$ forms a partition of $[n]$ , then any two indexes belonging to the same subset $X_{l}$ will never be selected simultaneously in $\hat{S}$ , i.e.,

4 Distributed sampling

We now describe a sampling which is particularly suitable for a distributed implementation of Quartz. This sampling was first proposed in and later used in , where the distributed coordinate descent algorithm Hydra and its accelerated variant Hydra2 were proposed and analyzed, respectively. Both methods were shown to be able to scale up to huge problem sizes (tests were performed on problem sizes of several TB; and up 50 billion dual variables in size).

Consider a distributed computing environment with $c$ nodes/computers. For simplicity, assume that $n$ is an integer multiple of $c$ and let the blocks $\{1,2,\dots,n\}$ be partitioned into $c$ sets of equal size: ${\cal P}_{1}$ , ${\cal P}_{2}$ , …, ${\cal P}_{c}$ . We assign partition ${\cal P}_{l}$ to node $l$ . The data $A_{1},\dots,A_{n}$ and the dual variables (blocks) $\alpha_{1},\dots,\alpha_{n}$ are partitioned accordingly and stored on the respective nodes.

At each iteration, all nodes $l\in\{1,\dots,c\}$ in parallel pick a subset $\hat{S}_{l}$ of $\tau$ dual variables from those they own, i.e., from $\mathcal{P}_{l}$ , uniformly at random. That is, each node locally performs a $\tau$ -nice sampling, independently from the other nodes. Node $l$ computes the updates to the dual variables $\alpha_{i}$ corresponding to $i\in S_{l}$ , and locally stores them. Hence, in a single distributed iteration, Quartz updates the dual variables belonging to the set $\hat{S}\stackrel{{\scriptstyle\text{def}}}{{=}}\cup_{l=1}^{c}\hat{S}_{l}$ . This defines a sampling, which we will call $(c,\tau)$ -distributed sampling.

Of course, there are other important considerations pertaining to the distributed implementation of Quartz, but we do not discuss them here as the focus of this section is on the sampling. However, it is possible to design a distributed communication protocol for the update of the primal variable.

The following result gives a formula for admissible parameters $\{v_{i}\}$ .

If $\hat{S}$ is a $(c,\tau)$ -distributed sampling, then Assumption 4 is satisfied for

where $\omega_{j}$ is the number of nonzero blocks in the $j$ -th row of the matrix $A$ , as defined previously in (18), and $\omega^{\prime}_{j}$ is the number of partitions ”active” at row $j$ of $A$ , more precisely,

Proof When $m=1$ , the result is equivalent to Theorem 4.1 in . The extension to blocks ( $m>1$ ) is straightforward.

Lemma 6 is a special case of Lemma 8 when only a single node ( $c=1$ ) is used, in which case $\omega_{j}^{\prime}=1$ for all $j\in[d]$ . Lemma 8 also improves the constants $\{v_{i}\}$ derived in , where instead of $\omega_{j}$ and $\omega_{j}^{\prime}$ in (26) one has $\max_{j}\omega_{j}$ and $\max_{j}\omega_{j}^{\prime}$ .

Lemma 8 is expressed in terms of certain sparsity parameters associated with the data ( $\{\omega_{j}\}$ ) and the partitioning ( $\{\omega_{j}^{\prime}\}$ ). However, it is possible to derive alternative ESO results for the $(c,\tau)$ -distributed sampling. For instance, one can instead express the parameters $\{v_{j}\}$ without any sparsity assumptions, using only spectral properties of the data only. We have not included these results here, but in the $m=1$ case such results have been derived in . It is possible to adopt them to the $m=1$ case as we have done it with Lemma 8.

Main Result

The complexity of our method is given by the following theorem.

Let Assumption 2 ( $\phi_{i}$ are $(1/\gamma)$ -smooth) and Assumption 3 ( $g$ is 1-strongly convex) be satisfied. Let $\hat{S}$ be a proper sampling (Assumption 1) and $v_{1},\dots,v_{n}$ be positive scalars satisfying Assumption 4. Then the sequence of primal and dual variables $\{w^{t},\alpha^{t}\}_{t\geq 0}$ of Quartz (Algorithm 1) satisfies:

In particular, if we fix $\epsilon\leq P(w^{0})-D(\alpha^{0})$ , then for

A result of a similar flavour but for a different problem and not in a primal-dual setting has been established in , where the authors analyze a parallel coordinate descent method, NSync, also with an arbitrary sampling, for minimizing a strongly convex function under an ESO assumption.

In the rest of this section we will specialize the above result to a few selected samplings. We then devote two separate sections to Quartz specialized to the $\tau$ -nice sampling (Section 5) and Quartz specialized to the $(c,\tau)$ -distributed sampling (Section 6 – as we do a more detailed analysis of the results in these two cases.

We first look at the special case when $\hat{S}$ is the uniform serial sampling, i.e., when $p_{i}=1/n$ for all $i\in[n]$ .

Assume that at each iteration of Quartz we update only one dual variable uniformly at random and use $v_{i}=\lambda_{\max}(A_{i}^{\top}A_{i})$ for all $i\in[n]$ . If we let $\epsilon\leq P(w^{0})-D(\alpha^{0})$ and

Proof The result follows by combining Lemma 5 and Theorem 9.

Corollary 10 should be compared with Theorem 5 in (covering the L2-regularized case) and Theorem 1 in (covering the case of general $g$ ). They obtain the rate

where $\alpha^{*}$ is the dual optimal solution. Notice that the dominant terms in the two rates exactly match, although our logarithmic term is better and simpler.

2 Quartz with optimal serial sampling (importance sampling)

According to Lemma 5, the parameter $v$ for a serial sampling $\hat{S}$ is determined by (16) and is independent of the distribution of $\hat{S}$ . We can then seek to maximize the quantity $\theta$ in (29) to obtain the best bound. A simple calculation reveals that the optimal probability is given by:

Using this sampling, we obtain the following iteration complexity bound, which is an improvement on the bound for uniform probabilities (31).

Assume that at each iteration of Quartz we update only one dual variable at random according to the probability $p^{*}$ defined in (32) and use $v_{i}=\lambda_{\max}(A_{i}^{\top}A_{i})$ for all $i\in[n]$ . If we let $\epsilon\leq P(w^{0})-D(\alpha^{0})$ and

Note that in contrast with the serial uniform sampling, we now have dependence on the average of the eigenvalues. The above result should be compared with the complexity result of Iprox-SDCA :

where $\alpha^{*}$ is the dual optimal solution. Again, the dominant terms in the two rates exactly match, although our logarithmic term is better and simpler.

3 Quartz with product sampling

In this section we apply Theorem 9 to the case when $\hat{S}$ is the product sampling (see the description in Section 3.3). All the notation we use here was established there.

Under the group separability assumption, let $\hat{S}$ be the product sampling and let $v_{i}=\lambda_{\max}(A_{i}^{\top}A_{i})$ for all $i\in[n]$ . If we fix $\epsilon\leq P(w^{0})-D(\alpha^{0})$ and

Proof The proof follows directly from Theorem 9, Lemma 7 and (20). Recall from Section 3.3 that the product sampling $\hat{S}$ has cardinality $\tau\geq 1$ and is non-uniform as long as all the sets $\{X_{1},\dots,X_{\tau}\}$ do not have the same cardinality. To the best of our knowledge, Corollary 12 is the first explicit complexity bound of stochastic algorithm using non-serial and non-uniform sampling for composite convex optimization problem (the paper only deals with smooth functions and the method is not primal-dual), albeit under the group separability assumption.

Let us compare the complexity bound with the serial uniform case (Corollary 10):

Quartz with τ𝜏\tau-nice Sampling (standard mini-batching)

We now specialize Theorem 9 to the case of the $\tau$ -nice sampling.

Assume $\hat{S}$ is the $\tau$ -nice sampling and $v$ is chosen as in (17). If we let $\epsilon\leq P(w^{0})-D(\alpha^{0})$ and

Proof The result follows by combining Lemma 6 and Theorem 9.

Let us now have a detailed look at the above result; especially in terms of how it compares with the serial uniform case (Corollary 10). We do this comparison in Table 1. For fully sparse data, we get perfect linear speedup: the bound in the second line of Table 1 is a $1/\tau$ fraction of the bound in the first line. For fully dense data, the condition number ( $\kappa\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{i}\lambda_{\text{max}}(A_{i}^{\top}A_{i})/(\gamma\lambda)$ ) is unaffected by mini-batching/parallelization. Hence, linear speedup is obtained if $\kappa=O(n/\tau)$ . For general data, the behaviour of Quartz with $\tau$ -nice sampling interpolates these two extreme cases. That is, $\kappa$ gets multiplied by a quantity between $1/\tau$ (fully sparse case) and $1$ (fully dense case). It is convenient to write this factor in the form

For simplicity of exposition, let us now assume that $\lambda_{\text{max}}(A_{i}^{\top}A_{i})=1$ . We will now study the theoretical speedup factor, defined as:

2 Quartz vs existing primal-dual mini-batch methods

We now compare the above result with existing mini-batch stochastic dual coordinate ascent methods. A mini-batch variant of SDCA, to which Quartz with $\tau$ -nice sampling can be naturally compared, has been proposed and analyzed previously in , and . In , the authors proposed to use a so-called safe mini-batching, which is precisely equivalent to finding the stepsize parameter $v$ satisfying Assumption 4 (in the special case of $\tau$ -nice sampling). However, they only analyzed the case where the functions $\phi_{i}$ are non-smooth. In , the authors studied accelerated mini-batch SDCA (ASDCA), specialized to the case when the regularizer $g$ is the squared L2 norm. They showed that the complexity of ASDCA interpolates between that of SDCA and accelerated gradient descent (AGD) through varying the mini-batch size $\tau$ . In , the authors proposed a mini-batch extension of their stochastic primal-dual coordinate algorithm (SPDC). Both ASDCA and SPDC reach the same complexity as AGD when the mini-batch size equals to $n$ , thus should be considered as accelerated algorithms. The complexity bounds for all these algorithms are summarized in Table 2. To facilitate the comparison, we assume that $\max_{i}\lambda_{\max}(A_{i}^{\top}A_{i})=1$ (since the analysis of ASDCA assumes this). In Table 3 we compare the complexities of SDCA, ASDCA, SPDC and Quartz in several regimes. We have used Lemma 14 to simplify the bounds for Quartz.

Indeed, we have the following result, which can be interpreted as follows: if $\kappa\leq\tau n/4$ (that is, $\lambda\gamma\tau n\geq 4$ ), then there are sparse-enough problems for which Quartz is better than both ASDCA and SPDC.

Assume that (38) holds and that $\max_{i}\lambda_{\text{max}}(A_{i}^{\top}A_{i})=1$ . Then if the data is sufficiently sparse so that

Proof As long as $\lambda\gamma\tau n\geq 1$ , which holds under our assumption, the iteration complexity of ASDCA is:

which is already less than that of SPDC. Moreover,

Quartz with Distributed Sampling

In this section we apply Theorem 9 to the case when $\hat{S}$ is the $(c,\tau)$ -distributed sampling; see the description of this sampling in Section 3.4.

Assume that $\hat{S}$ is a $(c,\tau)$ -distributed sampling and $v$ is chosen as in (26). If we let $\epsilon\leq P(w^{0})-D(\alpha^{0})$ and

Proof If $\hat{S}$ is a $(c,\tau)$ -distributed sampling, then

It now only remains to combine Theorem 9 and Lemma 8.

The expression (41) involves $\omega_{j}^{\prime}$ , which depends on the partitioning $\{{\cal P}_{1}$ , ${\cal P}_{2}$ , …, ${\cal P}_{c}\}$ of the dual variable and the data. The following lemma says that the effect of the partition is negligible, and in fact vanishes as $\tau$ increases. It was proved in [5, Lemma 5.2].

If $n/c\geq 2$ and $\tau\geq 2$ , then for all $j\in[d]$ , we have

According to this result, when each node owns at least two dual examples ( $n/c\geq 2$ ) and picks and updates at least two examples in each iteration ( $\tau\geq 2$ ), then

where $\hat{\omega}\in[1,n]$ is an average sparsity measure similar to that one we introduced in the study of $\tau$ -nice sampling. This bound is similar to that we obtained for the $\tau$ -nice sampling; and can be interpreted in an analogous way. Note that as the first term ( $n$ ) receives perfect mini-batch scaling (it is divided by $c\tau$ ), while the condition number $\max_{i}\lambda_{\text{max}}(A_{i}^{\top}A_{i})/(\lambda\gamma)$ is divided by $c\tau$ but also multiplied by $\left(1+\frac{1}{\tau-1}\right)\left(1+\frac{(\tau-1)(\hat{\omega}-1)}{n/c-1}\right)$ . However, this term is bounded by $2\hat{\omega}$ , and hence if $\hat{\omega}$ is small, the condition number also receives a nearly perfect mini-batch scaling.

A distributed variant of SDCA, named DisDCA, has been proposed in and analyzed in . The authors of proposed a basic DisDCA variant (which was analyzed) and a practical DisDCA variant (which was not analyzed). The complexity of basic DisDCA was shown to be:

where $\alpha^{*}$ is an optimal dual solution. Note that this rate is much worse than our rate. Ignoring the logarithmic terms, while the first expression $n/(c\tau)$ is the same in both results, if we replace all $\omega_{j}$ by the upper bound $n$ and all $\omega_{j}^{\prime}$ by the upper bound $c$ in (41), then

Therefore, the dominant term in (40) is a strict lower bound of that in (43). Moreover, it is clear that the gap between (40) and (43) is large when the data is sparse. For instance, in the perfectly sparse case with $\hat{\omega}=1$ , the bound (42) for Quartz becomes

2 Theoretical speedup factor

In analogy with the discussion in Section 5.1, we shall now analyze the theoretical speedup factor $T(1,1)/T(c,\tau)$ measuring the multiplicative amount by which Quartz specialized to the $(c,\tau)$ -distributed sampling is better than Quartz specialized to the serial uniform sampling.

In Section 5, we have seen how the speedup factor increases with $\tau$ when a mini-batch of examples is used at each iteration following the $\tau$ -nice sampling. As we have discussed before, this sampling is not particularly suitable for a distributed implementation (unless $\tau=n$ ; which in the big data setting where $n$ is very large may be asking for many more cores/threads that are available). This is because the implementation of updates using this sampling would either result in frequently idle nodes, or in increased data transfer.

Often the data matrix $A$ is too large to be stored on a single node, or limited number of threads/cores are available per node. We then want to implement Quartz in a distributed way ( $c>1$ ). It is therefore necessary to understand how the speedup factor compares to the hypothetical situation in which we would have a large machine where all data could be stored (we ignore communication costs here) and hence a $c\tau$ -nice sampling could be implemented. That is, we are interested in comparing $T(c,\tau)$ (distributed implementation) and $T(1,c\tau)$ (hypothetical computer). If for simplicity of exposition we assume that $\lambda_{\text{max}}(A_{i}^{\top}A_{i})=1$ , it is possible to argue that if $c\tau\leq n$ , then

In Figure 2 we plot the contour lines of the theoretical speedup factor in a log-log plot with axes corresponding to $\tau$ and $c$ . The contours are nearly perfect straight lines, which means that the speedup factor is approximately constant for those pairs $(c,\tau)$ for which $c\tau$ is the same. In particular, this means that (44) holds. Note that better speedup is obtained for sparse data then for dense data. However, in all plots we have chosen $\gamma=1$ and $\lambda=1/\sqrt{n}$ ; and hence we expect data independent linear speedup up to $c\tau=\Theta(\sqrt{n})$ – a special line is depicted in all three plots which defines this contour.

Proof of the Main Result

In this section we prove our main result (Theorem 9). In order to make the analysis more transparent, we will first establish three auxiliary results.

In the next lemma we give an expected separable overapproximation of the convex function $-D$ .

where $f$ and $\psi$ are defined in (3) and (4). Now we apply Lemma 18 and (15) to bound the first term:

Moreover, since $\psi$ is block separable, we can write

Our last auxiliary result is a technical lemma for further bounding the right hand side in Lemma 19.

where $\bar{\alpha}=\frac{1}{\lambda n}A\alpha$ .

Proof Recall from (3) that $f(\alpha)=\lambda g^{*}(\bar{\alpha})$ and hence $\nabla f(\alpha)=\frac{1}{n}A^{\top}\nabla g^{*}(\bar{\alpha})$ . Thus,

Since the functions $\phi_{i}$ are $(1/\gamma)$ -smooth, the conjugate functions $\phi_{i}^{*}$ must be $\gamma$ -strongly convex. Therefore,

It remains to notice that for $\theta$ defined in (29), we have:

2 Proof of Theorem 9

and $\kappa^{t}=(\kappa_{1}^{t},\cdots,\kappa_{n}^{t})$ by:

If we use Option \@slowromancapi@ in Algorithm 1, then $\alpha^{t}=\alpha^{t-1}+\kappa^{t}_{[\hat{S}]}$ . If we use Option \@slowromancapii@ in Algorithm 1, then we have $\alpha^{t}=\alpha^{t-1}+h^{t}_{[\hat{S}]}$ . In both cases, by Lemma 19:

We now apply Lemma 20 to further bound the last term and obtain:

Note that $\theta\nabla g^{*}(\bar{\alpha}^{t-1})=w^{t}-(1-\theta)w^{t-1}$ and $\phi_{i}^{*}(\nabla\phi_{i}(A_{i}^{\top}w^{t}))=\langle\nabla\phi_{i}(A_{i}^{\top}w^{t}),A_{i}^{\top}w^{t}\rangle-\phi_{i}(A_{i}^{\top}w^{t})$ . Finally, we plug these two inequalities into (53) and obtain:

where the last inequality follows from the convexity of $\phi_{i}$ .

Experimental Results

In and , the reader can find an extensive list of popular machine learning problems to which Prox-SDCA can be applied. Sharing the same primal-dual formulation, our algorithm can also be specified and applied to those applications, including Ridge regression, SVM, Lasso, logistic regression and multiclass prediction.

We focus our numerical experiments on the L2-regularized linear SVM problem with smoothed hinge loss or squared hinge loss. These problems are described in detail in Section 8.1. The three main messages that we draw from the numerical experiments are:

Importance sampling does improve the convergence for certain datasets;

Quartz specialized to serial samplings is comparable to Prox-SDCA in practice;

The theoretical speedup factor is an almost exact predictor of the actual speedup (in terms of iteration complexity).

We performed the experiments on several real world large datasets, of various dimensions $n$ , $d$ and sparsity. The details of the dataset characteristics are provided in Table 4. In all our experiments we used Option \@slowromancapi@, which we found to be better in practice.

We specify Quartz to the linear Support Vector Machine (SVM) problem with smoothed hinge loss and $L_{2}$ regularizer:

We now specify Quartz to the linear Support Vector Machine (SVM) problem with squared hinge loss based and $L_{2}$ regularizer:

Note that $\phi_{i}$ defined by (56) is $1/\gamma$ -smooth and $g$ defined by (57) is 1-strongly convex. In this special case, Option \@slowromancapi@ in Algorithm 1 has a closed form solution:

2 Quartz and SDCA for uniform and importance sampling

In this section we compare four algorithms:

Quartz-U: Quartz specialized to uniform serial sampling;

Prox-SDCA : proximal stochastic dual coordinate ascent with uniform sampling;

Quartz-IP: Quartz specialzed to importance sampling;

Iprox-SDCA : proximal stochastic dual coordinate ascent with importance sampling;

on three datasets: cov1, w8a and ijcnn1. We consider the $L2$ -regularized linear SVM problem using squared hinge loss, as described in Section 8.1. The value of $\gamma$ is set to be 1 and the value of $\lambda$ varies between the datasets: $10^{-5}$ for w8a and ijcnn1 and $10^{-6}$ for cov1, whose number of training examples $n$ is 10 times larger than the other datasets. The results are shown in Figure 3.

If we compare Quartz-U with Quartz-IP, it is clear that importance sampling provides better convergence rate than uniform sampling on the datasets that we tested.

Similarity between Quartz-IP and Iprox-SDCA.

In all the experiments, Quartz-IP shows an almost identical convergence behaviour to that of Iprox-SDCA.

Conservative primal update in Quartz.

While Quartz-IP has the same practical convergence rate as Iprox-SDCA, Quartz-U appears to be somewhat slower than Prox-SDCA in practice. One possible explanation is that the primal update in Quartz,

is too conservative. Indeed, since the optimal solution satisfies $w^{*}=\nabla g^{*}(\bar{\alpha}^{*})$ , larger $\theta$ leads to faster convergence on the primal problem when the dual variable $\alpha^{t-1}$ is close to the optimal solution $\alpha^{*}$ . To confirm this, we tested two more aggressive primal update rules: Quartz-10 $\theta$ and Quartz-100 $\theta$ which change the primal update to:

respectively. The results are displayed in Figure 3(d), 3(e), 3(f), 3(g), 3(h) and 3(i). It is clear that with just a slightly more aggressive primal update rule than the one sanctioned by our theory, Quartz-U achieves similar practical convergence as Prox-SDCA. Recall that the primal update in Prox-SDCA is $w^{t}=\nabla g^{*}(\bar{\alpha}^{t-1})$ . Notice also that the parameter $\theta$ defined by (29) is less than $1/n$ , hence close to 0. Therefore, there is still a difference in the primal update rules between Quartz-100 $\theta$ and Prox-SDCA.

3 Mini-batch experiments

In this section we demonstrate that the theoretical speedup factor of Quartz specialized to sampling $\hat{S}$ is a very good predictor of the practical speedup factor, defined as:

We focus on the problem of training $L2$ -regularized linear SVMs with smoothed hinge loss $(\gamma=1)$ , described in Section 8.1. In the experiments we chose $\epsilon=10^{-11}$ .

In Figure 4 we plot the speedup factors for Quartz specialized to the $\tau$ -nice sampling on three different datasets: astro_ph, CCAT and cov1, and for several values of $\lambda$ . We observe that the practical speedup factor follows the theoretical prediction. Moreover, note that the largest $\lambda$ that we choose for each dataset is to have roughly

so that linear speedup is reached for all $\tau\leq\sqrt{n}$ , regardless of data sparsity.

In Figure 5 we present contour lines of the theoretical and practical speedup factors, for Quartz specialized to the $(c,\tau)$ -nice sampling on the webspam dataset. We believe it is remarkable that the theoretical predictions are so accurate. Moreover, recall from the discussion in Section 6.2 that $T(c,\tau)$ is almost constant along the contour lines of $c\tau$ ; this is why we see nearly straight lines in the log-log plot. This feature is observed here for the real dataset also.

Conclusion

In this paper we have developed and analyzed a novel stochastic primal-dual algorithm—Quartz—for solving problems (1) and (2). This is the second stochastic method which allows an arbitrary sampling (see ) and the first primal-dual stochastic method with arbitrary sampling. This flexibility allows for many interesting variants of Quartz, including serial, parallel and distributed versions. The distributed variant of Quartz is the first distributed SDCA-like method with strong theoretical convergence bounds.

In Table 5 we highlight selected characteristics of existing primal-dual stochastic methods.

Unlike some of the existing methods, our method is not accelerated. We leave the development of an accelerated Quartz method for future research.