Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization

Yuchen Zhang, Lin Xiao

Introduction

We are especially interested in developing efficient algorithms for solving problem (1) when the number of samples $n$ is very large. In this case, evaluating the full gradient or subgradient of the function $P(x)$ is very expensive, thus incremental methods that operate on a single component function $\phi_{i}$ at each iteration can be very attractive. There have been extensive research on incremental (sub)gradient methods (e.g. ) as well as variants of the stochastic gradient method (e.g., ). While the computational cost per iteration of these methods is only a small fraction, say $1/n$ , of that of the batch gradient methods, their iteration complexities are much higher (it takes many more iterations for them to reach the same precision). In order to better quantify the complexities of various algorithms and position our contributions, we need to make some concrete assumptions and introduce the notion of condition number and batch complexity.

Let $\gamma$ and $\lambda$ be two positive real parameters. We make the following assumption:

Each $\phi_{i}$ is convex and differentiable, and its derivative is $(1/\gamma)$ -Lipschitz continuous (same as $\phi_{i}$ being $(1/\gamma)$ -smooth), i.e.,

In addition, the regularization function $g$ is $\lambda$ -strongly convex, i.e.,

Under Assumption A, the gradient of each component function, $\nabla\phi_{i}(a_{i}^{T}x)$ , is also Lipschitz continuous, with Lipschitz constant $L_{i}=\|a_{i}\|_{2}^{2}/\gamma\leq R^{2}/\gamma$ , where $R=\max_{i}\|a_{i}\|_{2}$ . In other words, each $\phi_{i}(a_{i}^{T}x)$ is $(R^{2}/\gamma)$ -smooth. We define a condition number

and focus on ill-conditioned problems where $\kappa\gg 1$ . In the statistical learning context, the regularization parameter $\lambda$ is usually on the order of $1/\sqrt{n}$ or $1/n$ (e.g., ), thus $\kappa$ is on the order of $\sqrt{n}$ or $n$ . It can be even larger if the strong convexity in $g$ is added purely for numerical regularization purposes (see Section 3). We note that the actual conditioning of problem (1) may be better than $\kappa$ , if the empirical loss function $(1/n)\sum_{i=1}^{n}\phi_{i}(a_{i}^{T}x)$ by itself is strongly convex. In those cases, our complexity estimates in terms of $\kappa$ can be loose (upper bounds), but they are still useful in comparing different algorithms for solving the same given problem.

To make fair comparisons with batch methods, we measure the complexity of stochastic or incremental gradient methods in terms of the number of equivalent passes over the dataset required to reach an expected precision $\epsilon$ . We call this measure the batch complexity, which are usually obtained by dividing their iteration complexities by $n$ . For example, the batch complexity of the stochastic gradient method is $\mathcal{O}(\kappa/(n\epsilon))$ . The batch complexities of full gradient methods are the same as their iteration complexities.

By carefully exploiting the finite average structure in (1) and other similar problems, several recent work proposed new variants of the stochastic gradient or dual coordinate ascent methods and obtained the iteration complexity $\mathcal{O}((n+\kappa)\log(1/\epsilon))$ . Since their computational cost per iteration is $\mathcal{O}(d)$ , the equivalent batch complexity is $1/n$ of their iteration complexity, i.e., $\mathcal{O}((1+\kappa/n)\log(1/\epsilon))$ . This complexity has much weaker dependence on $n$ than the full gradient methods, and also much weaker dependence on $\epsilon$ than the stochastic gradient methods.

In this paper, we propose a stochastic primal-dual coordinate (SPDC) method, which has the iteration complexity

When $\kappa>n$ , this is lower than the $\mathcal{O}((1+\kappa/n)\log(1/\epsilon))$ batch complexity mentioned above. Indeed, it is very close to a lower bound for minimizing finite sums recently established in .

2 Outline of the paper

Our approach is based on reformulating problem (1) as a convex-concave saddle point problem, and then devising a primal-dual algorithm to approximate the saddle point. More specifically, we replace each component function $\phi_{i}(a_{i}^{T}x)$ through convex conjugation, i.e.,

Under Assumption A, each $\phi^{*}_{i}$ is $\gamma$ -strongly convex (since $\phi_{i}$ is $(1/\gamma)$ -smooth; see, e.g., [15, Theorem 4.2.2]) and $g$ is $\lambda$ -strongly convex. As a consequence, the saddle point problem (4) has a unique solution, which we denote by $(x^{\star},y^{\star})$ .

In Section 2, we present the SPDC method as well as its convergence analysis. It alternates between maximizing $f$ over a randomly chosen dual coordinate $y_{i}$ and minimizing $f$ over the primal variable $x$ . In order to accelerate the convergence, an extrapolation step is applied in updating the primal variable $x$ . We also give a mini-batch SPDC algorithm which is well suited for parallel computing.

In Section 3 and Section 4, we present two extensions of the SPDC method. We first explain how to solve problem (1) when Assumption A does not hold. The idea is to apply small regularizations to the saddle point function so that SPDC can still be applied, which results in accelerated sublinear rates. The second extension is a SPDC method with non-uniform sampling. The batch complexity of this algorithm has the same form as (3), but with $\kappa=\bar{R}/(\lambda\gamma)$ , where $\bar{R}=\frac{1}{n}\sum_{i=1}^{n}\|a_{i}\|$ , which can be much smaller than $R=\max_{i}\|a_{i}\|$ if there is considerable variation in the norms $\|a_{i}\|$ .

In Section 5, we discuss related work. In particular, the SPDC method can be viewed as a coordinate-update extension of the batch primal-dual algorithm developed by Chambolle and Pock . We also discuss two very recent work which achieve the same batch complexity (3).

In Section 7, we present experiment results comparing SPDC with several state-of-the-art optimization methods, including both batch algorithms and randomized incremental and coordinate gradient methods. On all scenarios we tested, SPDC has comparable or better performance.

The SPDC method

We give the details of the SPDC method in Algorithm 1. The dual coordinate update and primal vector update are given in equations (7) and (8) respectively. Instead of maximizing $f$ over $y_{k}$ and minimizing $f$ over $x$ directly, we add two quadratic regularization terms to penalize $y^{(t+1)}_{k}$ and $x^{(t+1)}$ from deviating from $y^{(t)}_{k}$ and $x^{(t)}$ . The parameters $\sigma$ and $\tau$ control their regularization strength, which we will specify in the convergence analysis (Theorem 1). Moreover, we introduce two auxiliary variables $u^{(t)}$ and $\overline{x}^{(t)}$ . From the initialization $u^{(0)}=(1/n)\sum_{i=1}^{n}y^{(0)}_{i}a_{i}$ and the update rules (7) and (9), we have

Equation (10) obtains $\overline{x}^{(t+1)}$ based on extrapolation from $x^{(t)}$ and $x^{(t+1)}$ . This step is similar to Nesterov’s acceleration technique [25, Section 2.2], and yields faster convergence rate.

The Mini-Batch SPDC method in Algorithm 2 is a natural extension of SPDC in Algorithm 1. The difference between these two algorithms is that, the Mini-Batch SPDC method may simultaneously select more than one dual coordinates to update. Let $m$ be the mini-batch size. During each iteration, the Mini-Batch SPDC method randomly picks a subset of indices $K\subset\{1,\ldots,n\}$ of size $m$ , such that the probability of each index being picked is equal to $m/n$ . The following is a simple procedure to achieve this. First, partition the set of indices into $m$ disjoint subsets, so that the cardinality of each subset is equal to $n/m$ (assuming $m$ divides $n$ ). Then, during each iteration, randomly select a single index from each subset and add it to $K$ . Other approaches for mini-batch selection are also possible; see the discussions in .

In Algorithm 2, we also switched the order of updating $x^{(t+1)}$ and $u^{(t+1)}$ (comparing with Algorithm 1), to better illustrate that $x^{(t+1)}$ is obtained based on an extrapolation from $u^{(t)}$ to $u^{(t+1)}$ . However, this form is not recommended in implementation, because $u^{(t)}$ is usually a dense vector even if the feature vectors $a_{k}$ are sparse. Details on efficient implementation of SPDC are given in Section 6. In the following discussion, we do not make sparseness assumptions.

With a single processor, each iteration of Algorithm 2 takes $\mathcal{O}(md)$ time to accomplish. Since the updates of each coordinate $y_{k}$ are independent of each other, we can use parallel computing to accelerate the Mini-Batch SPDC method. Concretely, we can use $m$ processors to update the $m$ coordinates in the subset $K$ in parallel, then aggregate them to update $x^{(t+1)}$ . In terms of wall-clock time, each iteration takes $\mathcal{O}(d)$ time, which is the same as running one iteration of the basic SPDC algorithm. Not surprisingly, we will show that the Mini-Batch SPDC algorithm converges faster than SPDC in terms of the iteration complexity, because it processes multiple dual coordinates in a single iteration.

Since the basic SPDC algorithm is a special case of Mini-Batch SPDC with $m=1$ , we only present a convergence theorem for the mini-batch version. The expectations in the following results are taken with respect to the random variables $\{K^{(0)},\ldots,K^{(T-1)}\}$ , where $K^{(t)}$ denotes the random subset $K\subset\{1,\ldots,n\}$ picked at the $t$ -th iteration of the SPDC method.

Assume that each $\phi_{i}$ is $(1/\gamma)$ -smooth and $g$ is $\lambda$ -strongly convex (Assumption A). Let $(x^{\star},y^{\star})$ be the unique saddle point of $f$ defined in (4), $R=\max\{\left\|{a_{1}}\right\|_{2},\ldots,\left\|{a_{n}}\right\|_{2}\}$ , and define

If the parameters $\tau,\sigma$ and $\theta$ in Algorithm 2 are chosen such that

then for each $t\geq 1$ , the Mini-Batch SPDC algorithm achieves

The proof of Theorem 1 is given in Appendix A. The following corollary establishes the expected iteration complexity of Mini-Batch SPDC for obtaining an $\epsilon$ -accurate solution.

Suppose Assumption A holds and the parameters $\tau$ , $\sigma$ and $\theta$ are set as in (16). In order for Algorithm 2 to obtain

it suffices to have the number of iterations $T$ satisfy

Applying the inequality $-\log(1-x)\geq x$ to the denominator above completes the proof. ∎

Recall the definition of the condition number $\kappa=R^{2}/(\lambda\gamma)$ in (2). Corollary 1 establishes that the iteration complexity of the Mini-Batch SPDC method for achieving (17) is

So a larger batch size $m$ leads to less number of iterations. In the extreme case of $n=m$ , we obtain a full batch algorithm, which has iteration or batch complexity $\mathcal{O}((1+\sqrt{\kappa})\log(1/\epsilon))$ . This complexity is also shared by the AFG methods (see Section 1.1), as well as the batch primal-dual algorithm of Chambolle and Pock (see discussions on related work in Section 5).

Since an equivalent pass over the dataset corresponds to $n/m$ iterations, the batch complexity (the number of equivalent passes over the data) of Mini-Batch SPDC is

The above expression implies that a smaller batch size $m$ leads to less number of passes through the data. In this sense, the basic SPDC method with $m=1$ is the most efficient one. However, if we prefer the least amount of wall-clock time, then the best choice is to choose a mini-batch size $m$ that matches the number of parallel processors available.

2 Convergence rate of primal-dual gap

In the previous subsection, we established iteration complexity of the Mini-Batch SPDC method in terms of approximating the saddle point of the minimax problem (4), more specifically, to meet the requirement in (17). Next we show that it has the same order of complexity in reducing the primal-dual objective gap $P(x^{(t)})-D(y^{(t)})$ , where $P(x)$ is defined in (1) and

Under Assumption A, the function $f(x,y)$ defined in (4) has a unique saddle point $(x^{\star},y^{\star})$ , and

Thus the result in Theorem 1 does not translate directly into a convergence bound on the primal-dual gap. We need to bound $P(x)$ and $D(y)$ by $f(x,y^{\star})$ and $f(x^{\star},y)$ , respectively, in the opposite directions. For this purpose, we need the following lemma, which we extracted from . We provide the proof in Appendix B for completeness.

Suppose Assumption A holds and the parameters $\tau$ , $\sigma$ and $\theta$ are set as in (16). Let $\widetilde{\Delta}^{(0)}:=\Delta^{(0)}+\frac{\|{y^{(0)}-y^{\star}}\|_{2}^{2}}{4m\sigma}$ . Then for any $\epsilon\geq 0$ , the iterates of Algorithm 2 satisfy

The function $f(x,y^{\star})$ is strongly convex in $x$ with parameter $\lambda$ , and $x^{\star}$ is the minimizer. Similarly, $-f(x^{\star},y)$ is strongly convex in $y$ with parameter $\gamma/n$ , and is minimized by $y^{\star}$ . Therefore,

We bound the following weighted primal-dual gap

The first inequality above is due to Lemma 1, the second and fourth inequalities are due to the definition of $\Delta^{(t)}$ , and the third inequality is due to (19). Taking expectations on both sides of the above inequality, then applying Theorem 1, we obtain

Since $n\geq m$ and $D(y^{\star})-D(y^{(t)}))\geq 0$ , this implies the desired result. ∎

Extensions to non-smooth or non-strongly convex functions

To be concise, we only consider the case where neither $\phi_{i}$ is smooth nor $g$ is strongly convex. Formally, we assume that each $\phi_{i}$ and $g$ are convex and Lipschitz continuous, and $f(x,y)$ has a saddle point $(x^{\star},y^{\star})$ . We choose a scalar $\delta>0$ and consider the modified saddle-point function:

Denote by $(x^{\star}_{\delta},y^{\star}_{\delta})$ the saddle-point of $f_{\delta}$ . We employ the Mini-Batch SPDC method (Algorithm 2) to approximate $(x^{\star}_{\delta},y^{\star}_{\delta})$ , treating $\phi^{*}_{i}+\frac{\delta}{2}(\cdot)^{2}$ as $\phi^{*}_{i}$ and $g+\frac{\delta}{2}\|{\cdot}\|_{2}^{2}$ as $g$ , which are all $\delta$ -strongly convex. We note that adding strongly convex perturbation on $\phi_{i}^{*}$ is equivalent to smoothing $\phi_{i}$ , which becomes $(1/\delta)$ -smooth (see, e.g., ). Letting $\gamma=\lambda=\delta$ , the parameters $\tau$ , $\sigma$ and $\theta$ in (16) become

Although $(x^{\star}_{\delta},y^{\star}_{\delta})$ is not exactly the saddle point of $f$ , the following corollary shows that applying the SPDC method to the perturbed function $f_{\delta}$ effectively minimizes the original loss function $P$ . Similar results for the convergence of the primal-dual gap can also be established.

Assume that each $\phi_{i}$ is convex and $G_{\phi}$ -Lipschitz continuous, and $g$ is convex and $G_{g}$ -Lipschitz continuous. Define two constants:

Let $\widetilde{y}=\arg\max_{y}f(x^{\star}_{\delta},y)$ be a shorthand notation. We have

Here, equations (i) and (vii) use the definition of the function $f$ , inequalities (ii) and (v) use the definition of the function $f_{\delta}$ , inequalities (iii) and (iv) use the fact that $(x^{\star}_{\delta},y^{\star}_{\delta})$ is the saddle point of $f_{\delta}$ , and inequality (vi) is due to the fact that $(x^{\star},y^{\star})$ is the saddle point of $f$ .

Since $\phi_{i}$ is $G_{\phi}$ -Lipschitz continuous, the domain of $\phi^{*}_{i}$ is in the interval $[-G_{\phi},G_{\phi}]$ , which implies $\|{\widetilde{y}}\|_{2}^{2}\leq nG_{\phi}^{2}$ (see, e.g., [38, Lemma 1]). Thus, we have

On the other hand, since $P$ is $(G_{\phi}R+G_{g})$ -Lipschitz continuous, Theorem 1 implies

The corollary is established by finding the smallest $T$ that satisfies inequality (23). ∎

There are two other cases that can be considered: when $\phi_{i}$ is not smooth but $g$ is strongly convex, and when $\phi_{i}$ is smooth but $g$ is not strongly convex. They can be handled with the same technique described above, and we omit the details here. In Table 1, we list the complexities of the Mini-Batch SPDC method for finding an $\epsilon$ -optimal solution of problem (1) under various assumptions. Similar results are also obtained in .

SPDC with non-uniform sampling

The basic idea is to use non-uniform sampling in picking the dual coordinate to update at each iteration. In Algorithm 3, we pick coordinate $k$ with the probability

where $\alpha\in(0,1)$ is a parameter. In other words, this distribution is a (strict) convex combination of the uniform distribution and the distribution that is proportional to the feature norms. Therefore, instances with large feature norms are sampled more frequently, controlled by $\alpha$ . Simultaneously, we adopt an adaptive regularization in step (26), imposing stronger regularization on such instances. In addition, we adjust the weight of $a_{k}$ in (27) for updating the primal variable. As a consequence, the convergence rate of Algorithm 3 depends on the average norm of feature vectors, as well as the parameter $\alpha$ . This is summarized in the following theorem.

Suppose Assumption A holds. Let $\bar{R}=\frac{1}{n}\sum_{i=1}^{n}\|{a_{i}}\|_{2}$ . If the parameters $\tau,\sigma,\theta$ in Algorithm 3 are chosen such that

Since $\theta$ is a bound on the convergence factor, we would like to make it as small as possible. For its expression in (29), it can be minimized by choosing

where $\bar{\kappa}=\bar{R}^{2}/(\lambda\gamma)$ is an average condition number. We have $\alpha^{\star}=1/2$ if $\bar{\kappa}=n$ . The value of $\alpha^{\star}$ decreases slowly to zero as the ratio $n/\bar{\kappa}$ grows, and increases to one as the ratio $n/\bar{\kappa}$ drops. Thus, we may choose a relatively uniform distribution for well conditioned problems, but a more aggressively weighted distribution for ill-conditioned problems.

For simplicity of presentation, we described in Algorithm 3 a weighted sampling SPDC method with single dual coordinate update, i.e., the case of $m=1$ . It is not hard to see that the non-uniform sampling scheme can also be extended to Mini-Batch SPDC with $m>1$ . Here, we omit the technical details.

Related Work

Chambolle and Pock considered a class of convex optimization problems with the following saddle-point structure:

When both $F^{*}$ and $G$ are strongly convex and the parameters $\tau$ , $\sigma$ and $\theta$ are chosen appropriately, this algorithm obtains accelerated linear convergence rate [9, Theorem 3].

We can map the saddle-point problem (4) into the form of (30) by letting $A=[a_{1},\ldots,a_{n}]^{T}$ and

The SPDC method developed in this paper can be viewed as an extension of the batch method (31)-(33), where the dual update step (31) is replaced by a single coordinate update (7) or a mini-batch update (13). However, in order to obtain accelerated convergence rate, more subtle changes are necessary in the primal update step. More specifically, we introduced the auxiliary variable $u^{(t)}=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{(t)}a_{i}=K^{T}y^{(t)}$ , and replaced the primal update step (32) by (8) and (14). The primal extrapolation step (33) stays the same.

To compare the batch complexity of SPDC with that of (31)-(33), we use the following facts implied by Assumption A and the relations in (34):

Based on these conditions, we list in Table 2 the equivalent parameters used in [9, Algorithm 3] and the batch complexity obtained in [9, Theorem 3], and compare them with SPDC.

The batch complexity of the Chambolle-Pock algorithm is $\widetilde{\mathcal{O}}(1+\|A\|_{2}/(2\sqrt{n\lambda\gamma}))$ , where the $\widetilde{\mathcal{O}}(\cdot)$ notation hides the $\log(1/\epsilon)$ factor. We can bound the spectral norm $\|A\|_{2}$ by the Frobenius norm $\|A\|_{F}$ and obtain

(Note that the second inequality above would be an equality if the columns of $A$ are normalized.) So in the worst case, the batch complexity of the Chambolle-Pock algorithm becomes

which matches the worst-case complexity of the AFG methods (see Section 1.1 and also the discussions in [20, Section 5]). This is also of the same order as the complexity of SPDC with $m=n$ (see Section 2.1). When the condition number $\kappa\gg 1$ , they can be $\sqrt{n}$ worse than the batch complexity of SPDC with $m=1$ , which is $\widetilde{\mathcal{O}}(1+\sqrt{\kappa/n})$ .

If either $G(x)$ or $F^{*}(y)$ in (30) is not strongly convex, Chambolle and Pock proposed variants of the primal-dual batch algorithm to achieve accelerated sublinear convergence rates [9, Section 5.1]. It is also possible to extend them to coordinate update methods for solving problem (1) when either $\phi_{i}^{*}$ or $g$ is not strongly convex. Their complexities would be similar to those in Table 1.

Our algorithms and theory can be readily generalized to solve the problem of

We can also solve the primal problem (1) via its dual:

Because of the problem structure, coordinate ascent methods (e.g., ) can be more efficient than full gradient methods. In the stochastic dual coordinate ascent (SDCA) method , a dual coordinate $y_{i}$ is picked at random during each iteration and updated to increase the dual objective value. Shalev-Shwartz and Zhang showed that the iteration complexity of SDCA is $O\left((n+\kappa)\log(1/\epsilon)\right)$ , which corresponds to the batch complexity $\widetilde{\mathcal{O}}(1+\kappa/n)$ .

For more general convex optimization problems, there is a vast literature on coordinate descent methods; see, e.g., the recent overview by Wright . In particular, Nesterov’s work on randomized coordinate descent sparked a lot of recent activities on this topic. Richtárik and Takáč extended the algorithm and analysis to composite convex optimization. When applied to the dual problem (35), it becomes one variant of SDCA studied in . Mini-batch and distributed versions of SDCA have been proposed and analyzed in and respectively. Non-uniform sampling schemes have been studied for both stochastic gradient and SDCA methods (e.g., ).

Shalev-Shwartz and Zhang proposed an accelerated mini-batch SDCA method which incorporates additional primal updates than SDCA, and bears some similarity to our Mini-Batch SPDC method. They showed that its complexity interpolates between that of SDCA and AFG by varying the mini-batch size $m$ . In particular, for $m=n$ , it matches that of the AFG methods (as SPDC does). But for $m=1$ , the complexity of their method is the same as SDCA, which is worse than SPDC for ill-conditioned problems.

More recently, Lin et al. developed an accelerated proximal coordinate gradient (APCG) method for solving a more general class of composite convex optimization problems. When applied to the dual problem (35), APCG enjoys the same batch complexity $\widetilde{\mathcal{O}}\bigl{(}1+\sqrt{\kappa/n}\bigr{)}$ as of SPDC. However, it needs an extra primal proximal-gradient step to have theoretical guarantees on the convergence of primal-dual gap [20, Section 5.1]. The computational cost of this additional step is equivalent to one pass of the dataset, thus it does not affect the overall complexity.

2 Other related work

Another way to approach problem (1) is to reformulate it as a constrained optimization problem

Suzuki considered a problem similar to (1), but with more complex regularization function $g$ , meaning that $g$ does not have a simple proximal mapping. Thus primal updates such as step (8) or (14) in SPDC and similar steps in SDCA cannot be computed efficiently. He proposed an algorithm that combines SDCA and ADMM (e.g., ), and showed that it has linear rate of convergence under similar conditions as Assumption A. It would be interesting to see if the SPDC method can be extended to their setting to obtain accelerated linear convergence rate.

Efficient Implementation with Sparse Data

Suppose that $g(x)=\frac{\lambda}{2}\|{x}\|_{2}^{2}$ . For this case, the updates for each coordinate of $x$ are independent of each other. More specifically, $x^{(t+1)}$ can be computed coordinate-wise in closed form:

where $\Delta u$ denotes $(y_{k}^{(t+1)}-y_{k}^{(t)})a_{k}$ in Algorithm 1, or $\frac{1}{m}\sum_{k\in K}(y_{k}^{(t+1)}-y_{k}^{(t)})a_{k}$ in Algorithm 2, or $(y_{k}^{(t+1)}-y_{k}^{(t)})a_{k}/(p_{k}n)$ in Algorithm 3, and $\Delta u_{j}$ represents the $j$ -th coordinate of $\Delta u$ .

Although the dimension $d$ can be very large, we assume that each feature vector $a_{k}$ is sparse. We denote by $J^{(t)}$ the set of non-zero coordinates at iteration $t$ , that is, if for some index $k\in K$ picked at iteration $t$ we have $a_{kj}\neq 0$ , then $j\in J^{(t)}$ . If $j\notin J^{(t)}$ , then the SPDC algorithm (and its variants) updates $y^{(t+1)}$ without using the value of $x_{j}^{(t)}$ or $\overline{x}_{j}^{(t)}$ . This can be seen from the updates in (7), (13) and (26), where the value of the inner product $\langle a_{k},\overline{x}^{(t)}\rangle$ does not depend on the value of $\overline{x}^{(t)}_{j}$ . As a consequence, we can delay the updates on $x_{j}$ and $\overline{x}_{j}$ whenever $j\notin J^{(t)}$ without affecting the updates on $y^{(t)}$ , and process all the missing updates at the next time when $j\in J^{(t)}$ .

Such a delayed update can be carried out very efficiently. We assume that $t_{0}$ is the last time when $j\in J^{(t)}$ , and $t_{1}$ is the current iteration where we want to update $x_{j}$ and $\overline{x}_{j}$ . Since $j\notin J^{(t)}$ implies $\Delta u_{j}=0$ , we have

Notice that $u_{j}^{(t)}$ is updated only at iterations where $j\in J^{(t)}$ . The value of $u_{j}^{(t)}$ doesn’t change during iterations $[t_{0}+1,t_{1}]$ , so we have $u_{j}^{(t)}\equiv u_{j}^{(t_{0}+1)}$ for $t\in[t_{0}+1,t_{1}]$ . Substituting this equation into the recursive formula (38), we obtain

The update (39) takes $\mathcal{O}(1)$ time to compute. Using the same formula, we can compute $x^{(t_{1}-1)}_{j}$ and subsequently compute $\overline{x}^{(t_{1})}_{j}=x^{(t_{1})}_{j}+\theta(x^{(t_{1})}_{j}-x^{(t_{1}-1)}_{j})$ . Thus, the computational complexity of a single iteration in SPDC is proportional to $|J^{(t)}|$ , independent of the dimension $d$ .

where $\Delta u_{j}$ follows the definition in Section 6.1. If $j\notin J^{(t)}$ , then $\Delta u_{j}=0$ and equation (40) can be simplified as

Similar to the approach of Section 6.1, we delay the update of $x_{j}$ until $j\in J^{(t)}$ . We assume $t_{0}$ to be the last iteration when $j\in J^{(t)}$ , and let $t_{1}$ be the current iteration when we want to update $x_{j}$ . During iterations $[t_{0}+1,t_{1}]$ , the value of $u^{(t)}_{j}$ doesn’t change, so we have $u_{j}^{(t)}\equiv u_{j}^{(t_{0}+1)}$ for $t\in[t_{0}+1,t_{1}]$ . Using equation (44) and the invariance of $u_{j}^{(t)}$ for $t\in[t_{0}+1,t_{1}]$ , we have an $\mathcal{O}(1)$ time algorithm to calculate $x_{j}^{(t_{1})}$ , which we detail in Appendix D. The vector $\overline{x}^{(t_{1})}_{j}$ can be updated by the same algorithm since it is a linear combination of $x_{j}^{(t_{1})}$ and $x_{j}^{(t_{1}-1)}$ . As a consequence, the computational complexity of each iteration in SPDC is proportional to $|J^{(t)}|$ , independent of the dimension $d$ .

Experiments

In this section, we compare the basic SPDC method (Algorithm 1) with several state-of-the-art optimization algorithms for solving problem (1). They include two batch-update algorithms: the accelerated full gradient (FAG) method [25, Section 2.2], and the limited-memory quasi-Newton method L-BFGS [29, Section 7.2]). For the AFG method, we adopt an adaptive line search scheme (e.g., ) to improve its efficiency. For the L-BFGS method, we use the memory size 30 as suggested by . We also compare SPDC with three stochastic algorithms: the stochastic average gradient (SAG) method , the stochastic dual coordinate descent (SDCA) method and the accelerated stochastic dual coordinate descent (ASDCA) method . We conduct experiments on a synthetic dataset and three real datasets.

We first compare SPDC with other algorithms on a simple quadratic problem using synthetic data. We generate $n=500$ i.i.d. training examples $\{a_{i},b_{i}\}_{i=1}^{n}$ according to the model

In the form of problem (1), we have $\phi_{i}(z)=z^{2}/2$ and $g(x)=(1/2)\|{x}\|_{2}^{2}$ . As a consequence, the derivative of $\phi_{i}$ is $1$ -Lipschitz continuous and $g$ is $\lambda$ -strongly convex.

We evaluate the algorithms by the logarithmic optimality gap $\log(P(x^{(t)})-P(x^{\star}))$ , where $x^{(t)}$ is the output of the algorithms after $t$ passes over the entire dataset, and $x^{\star}$ is the global minimum. When the regularization coefficient is relatively large, e.g., $\lambda=10^{-1}$ or $10^{-2}$ , the problem is well-conditioned and we observe fast convergence of the stochastic algorithms SAG, SDCA, ASDCA and SPDC, which are substantially faster than the two batch methods AFG and L-BFGS.

Figure 1 shows the convergence of the five different algorithms when we varied $\lambda$ from $10^{-3}$ to $10^{-6}$ . As the plot shows, when the condition number is greater than $n$ , the SPDC algorithm also converges substantially faster than the other two stochastic methods SAG and SDCA. It is also notably faster than L-BFGS. These results support our theory that SPDC enjoys a faster convergence rate on ill-conditioned problems. In terms of their batch complexities, SPDC is up to $\sqrt{n}$ times faster than AFG, and $(\lambda n)^{-1/2}$ times faster than SAG and SDCA.

Theoretically, ASDCA enjoys the same batch complexity as SPDC up to a multiplicative constant factor. Figure 1 shows that the empirical performance of SPDC is substantially faster that of ASDCA for small $\lambda$ . This may due to the fact that ASDCA follows an inner-outer iteration procedure, while SPDC is a single-loop algorithm, explaining why it is empirically more efficient.

2 Binary classification with real data

Here, $\phi_{i}$ is the smoothed hinge loss (see, e.g., ). It is easy to verify that the conjugate function of $\phi_{i}$ is $\phi^{*}_{i}(\beta)=b_{i}\beta+\frac{1}{2}\beta^{2}$ for $b_{i}\beta\in$ and $\infty$ otherwise.

The performance of the five algorithms are plotted in Figure 2 and Figure 3. In Figure 2, we compare SPDC with the two batch methods: AFG and L-BFGS. The results show that SPDC is substantially faster than AFG and L-BFGS for relatively large $\lambda$ , illustrating the advantage of stochastic methods over batch methods on well-conditioned problems. As $\lambda$ decreases to $10^{-8}$ , the batch methods (especially L-BFGS) become comparable to SPDC.

Summarizing Figure 2 and Figure 3, the performance of SPDC are always comparable or better than the other methods in comparison.

Appendix A Proof of Theorem 1

We focus on characterizing the values of $x$ and $y$ after the $t$ -th update in Algorithm 2. For any $i\in\{1,\ldots,n\}$ , let $\widetilde{y}_{i}$ be the value of $y_{i}^{(t+1)}$ if $i\in K$ , i.e.,

Since $\phi_{i}$ is $(1/\gamma)$ -smooth by assumption, its conjugate $\phi^{*}_{i}$ is $\gamma$ -strongly convex (e.g., [15, Theorem 4.2.2]). Thus the function being maximized above is $(1/\sigma+\gamma)$ -strongly concave. Therefore,

Multiplying both sides of the above inequality by $m/n$ and re-arrange terms, we have

According to Algorithm 2, the set $K$ of indices to be updated are chosen randomly. For every specific index $i$ , the event $i\in K$ happens with probability $m/n$ . If $i\in K$ , then $y_{i}^{(t+1)}$ is updated to the value $\widetilde{y}_{i}$ , which satisfies inequality (45). Otherwise, $y_{i}^{(t+1)}$ is assigned by its old value $y_{i}^{(t)}$ . Let $\mathcal{F}_{t}$ be the sigma field generated by all random variables defined before round $t$ , and taking expectation conditioned on $\mathcal{F}_{t}$ , we have

As a result, we can represent $(\widetilde{y}_{i}-y^{\star}_{i})^{2}$ , $(\widetilde{y}_{i}-y_{i}^{(t)})^{2}$ , $\widetilde{y}_{i}$ and $\phi_{i}^{*}(\widetilde{y}_{i})$ in terms of the conditional expectations on $(y_{i}^{(t+1)}-y^{\star}_{i})^{2}$ , $(y_{i}^{(t+1)}-y_{i}^{(t)})^{2}$ , $y_{i}^{(t+1)}$ and $\phi_{i}^{*}(y_{i}^{(t+1)})$ , respectively. Plugging these representations into inequality (45) and re-arranging terms, we obtain

Then summing over all indices $i=1,2,\dots,n$ and dividing both sides of the resulting inequality by $m$ , we have

where we used the shorthand notations (appeared in Algorithm 2)

Since only the dual coordinates with indices in $K$ are updated, we have

We also derive an inequality characterizing the relation between $x^{(t+1)}$ and $x^{(t)}$ . Since the function being minimized on the right-hand side of (14) has strong convexity parameter $1/\tau+\lambda$ and $x^{(t+1)}$ is the minimizer, we have

Rearranging terms and taking expectation conditioned on $\mathcal{F}_{t}$ , we have

In addition, we consider a particular combination of the saddle-point function values at different points. By the definition of $f(x,y)$ in (4) and the notations in (48), we have

Next we add both sides of the inequalities (47) and (50) together, and then subtract equality (51) after taking expectation with respect to $\mathcal{F}_{t}$ . This leads to the following inequality:

We need to lower bound the last term on the right-hand-side of the above inequality. To this end, we have

Recall that $\|{a_{k}}\|_{2}\leq R$ and, according to (16), $1/\tau=4\sigma R^{2}$ . Therefore,

The above upper bounds on the absolute values imply

Combining the above two inequalities with (52) and (53), we obtain

Note that we have added the nonnegative term $\theta\bigl{(}f(x^{(t)},y^{\star})-f(x^{\star},y^{\star})\bigr{)}$ to the left-hand side in (54) to ensure that each term on one side of the inequality has a corresponding term on the other side.

If the parameters $\tau$ , $\sigma$ , and $\theta$ are chosen as in (16), that is,

Then the ratios between the coefficients of the corresponding terms on both sides of the inequality (54) are either equal to $\theta$ or bounded by $\theta$ . More specifically,

Therefore, if we define the following sequence,

Comparing the definition of $\Delta^{(t)}$ in (15), we have

For $t=0$ , by letting $x^{(-1)}=x^{(0)}$ , the last two terms in (56) for $\widetilde{\Delta}^{(0)}$ disappears. Moreover, we can show that the sum of the last three terms in (56) are nonnegative, and therefore we can replace $\widetilde{\Delta}^{(t)}$ with $\Delta^{(t)}$ on the left-hand side of (55). To see this, we bound the absolute value of the last term:

where in the second inequality we used $\|A\|_{2}^{2}\leq\|A\|_{F}^{2}\leq nR^{2}$ , in the equality we used $\tau\sigma=1/(4R^{2})$ , and in the last inequality we used $m\leq n$ . The above upper bound on absolute value implies

Appendix B Proof of Lemma 1

Assumption A implies that $F(x)$ is smooth and $\nabla F(x)$ is Lipschitz continuous with constant $\|A\|_{2}^{2}/(n\gamma)$ . We can bound the spectral norm with the Frobenius norm, i.e., $\|A\|_{2}^{2}\leq\|A\|_{F}^{2}\leq nR^{2}$ , which results in $\|A\|_{2}^{2}/(n\gamma)\leq nR^{2}/(n\gamma)=R^{2}/\gamma$ . By definition of the saddle point, the gradient of $F$ at $x^{\star}$ is $\nabla F(x^{\star})=(1/n)A^{T}y^{\star}$ . Therefore, we have

Combining the above inequality with $P(x)=F(x)+g(x)$ , we have

Similarly, the second inequality can be shown by first writing $D(y)=-\frac{1}{n}\sum_{i=1}^{n}\phi_{i}^{*}(y_{i})-G^{*}(y)$ , where

In this case, $\nabla G^{*}(y)$ is Lipschitz continuous with constant $\|A\|_{2}^{2}/(n^{2}\lambda)\leq nR^{2}/(n^{2}\lambda)=R^{2}/(n\lambda)$ . Again by definition of the saddle-point, we have $\nabla G^{*}(y^{\star})=-(1/n)Ax^{\star}$ . Therefore,

Recalling that $D(y)=-\frac{1}{n}\sum_{i=1}^{n}\phi_{i}^{*}(y_{i})-G^{*}(y)$ , we conclude with

Appendix C Proof of Theorem 2

The proof of Theorem 2 follows similar steps for proving Theorem 1. We start by establishing relation between $(y^{(t)},y^{(t+1)})$ and between $(x^{(t)},x^{(t+1)})$ . Suppose that the quantity $\widetilde{y}_{i}$ minimizes the function $\phi^{*}_{i}(\beta)-\beta\langle a_{i},\overline{x}^{(t)}\rangle+\frac{p_{i}n}{2\sigma}(\beta-y_{i}^{(t)})^{2}$ . Also notice that $\phi_{i}^{*}(\beta)-\beta\langle a_{i},x^{*}\rangle$ is a $\gamma$ -strongly convex function minimized by $y_{i}^{*}$ , which implies

Then, following the same argument for establishing inequality (45) and plugging in inequality (57), we obtain

Note that $i=k$ with probability $p_{i}$ . Therefore, we have

where $\mathcal{F}_{t}$ represents the sigma field generated by all random variables defined before iteration $t$ . Substituting the above equations into inequality (58), and averaging over $i=1,2,\dots,n$ , we have

where $u^{\star}=\frac{1}{n}\sum_{i=1}^{n}y^{\star}_{i}a_{i}$ and $u^{(t)}=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{(t)}a_{i}$ have the same definition as in the proof of Theorem 1.

For the relation between $x^{(t)}$ and $x^{(t+1)}$ , we first notice that $\langle u^{*},x\rangle+g(x)$ is a $\lambda$ -strongly convex function minimized by $x^{*}$ , which implies

Following the same argument for establishing inequality (49) and plugging in inequality (60), we obtain

Taking expectation over both sides of inequality (C) and adding it to inequality (59) yields

where the matrix $A$ is a $n$ -by- $d$ matrix, whose $i$ -th row is equal to the vector $a_{i}^{T}$ .

Next, we lower bound the last term on the right-hand side of inequality (62). Indeed, it can be expanded as

Note that the probability $p_{k}$ given in (28) satisfies

Since the parameters $\tau$ and $\sigma$ satisfies $\sigma\tau\bar{R}^{2}=\alpha^{2}/4$ , we have $p_{k}^{2}n^{2}/\tau\geq 4\sigma\|a_{k}\|_{2}^{2}$ and consequently

Combining the above two inequalities with lower bounds (62) and (63), we obtain

Recall that the parameters $\tau$ , $\sigma$ , and $\theta$ are chosen to be

Plugging in these assignments and using the fact that $p_{i}\geq\frac{1-\alpha}{n}$ , we find that

Therefore, if we define a sequence $\Delta^{(t)}$ such that

then inequality (64) implies the recursive relation $\Delta^{(t+1)}\leq\theta\cdot\Delta^{(t)}$ , which implies

To eliminate the last two terms on the left-hand side of inequality (65), we notice that

where in the equality we used $n^{2}/\tau=(4/\alpha^{2})\sigma n^{2}\bar{R}^{2}=(4/\alpha^{2})\sigma\left(\sum_{i=1}^{n}\|a_{i}\|_{2}\right)^{2}$ . This implies

Substituting the above inequality into inequality (65) completes the proof.

Given $x_{j}^{(t_{0}+1)}$ at iteration $t_{0}$ , we present an efficient algorithm for calculating $x_{j}^{(t_{1})}$ . We begin by examining the sign of $x_{j}^{(t_{0}+1)}$ .

If $-u_{j}^{(t_{0}+1)}<-\lambda_{1}$ , then equation (69) implies $x_{j}^{(t)}<0$ for all $t>t_{0}+1$ . Therefore, we have the closed-form formula:

Finally, if $-u_{j}^{(t_{0}+1)}\in[-\lambda_{1},\lambda_{1}]$ , then equation (69) implies $x^{(t_{1})}_{j}=0$ .

We look for the largest $t^{+}$ such that the right-hand side of equation (72) is positive, which is equivalent of

Thus, $t^{+}$ is the largest integer in $[t_{0}+1,t_{1}]$ such that inequality (73) holds. If $t^{+}=t_{1}$ , then $x_{j}^{(t_{1})}$ is obtained by (72). Otherwise, we can calculate $x_{j}^{t^{+}+1}$ by formula (69), then resort to Case I or Case III, treating $t^{+}$ as $t_{0}$ .

where $t^{-}$ is the largest integer in $[t_{0}+1,t_{1}]$ such that the following inequality holds:

If $t^{-}=t_{1}$ , then $x_{j}^{(t_{1})}$ is obtained by (74). Otherwise, we can calculate $x_{j}^{t^{-}+1}$ by formula (69), then resort to Case I or Case II, treating $t^{-}$ as $t_{0}$ .

Finally, we note that formula (69) implies the monotonicity of $x_{j}^{(t)}~{}(t=t_{0}+1,t_{0}+2,\dots)$ . As a consequence, the procedure of either Case I, Case II or Case III is executed for at most once. Hence, the algorithm for calculating $x_{j}^{(t_{1})}$ has $\mathcal{O}(1)$ time complexity.