Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization

Yuchen Zhang, Lin Xiao

Introduction

We are especially interested in developing efficient algorithms for solving problem (1) when the number of samples nn is very large. In this case, evaluating the full gradient or subgradient of the function P(x)P(x) is very expensive, thus incremental methods that operate on a single component function ϕi\phi_{i} at each iteration can be very attractive. There have been extensive research on incremental (sub)gradient methods (e.g. ) as well as variants of the stochastic gradient method (e.g., ). While the computational cost per iteration of these methods is only a small fraction, say 1/n1/n, of that of the batch gradient methods, their iteration complexities are much higher (it takes many more iterations for them to reach the same precision). In order to better quantify the complexities of various algorithms and position our contributions, we need to make some concrete assumptions and introduce the notion of condition number and batch complexity.

Let γ\gamma and λ\lambda be two positive real parameters. We make the following assumption:

Each ϕi\phi_{i} is convex and differentiable, and its derivative is (1/γ)(1/\gamma)-Lipschitz continuous (same as ϕi\phi_{i} being (1/γ)(1/\gamma)-smooth), i.e.,

In addition, the regularization function gg is λ\lambda-strongly convex, i.e.,

Under Assumption A, the gradient of each component function, ϕi(aiTx)\nabla\phi_{i}(a_{i}^{T}x), is also Lipschitz continuous, with Lipschitz constant Li=ai22/γR2/γL_{i}=\|a_{i}\|_{2}^{2}/\gamma\leq R^{2}/\gamma, where R=maxiai2R=\max_{i}\|a_{i}\|_{2}. In other words, each ϕi(aiTx)\phi_{i}(a_{i}^{T}x) is (R2/γ)(R^{2}/\gamma)-smooth. We define a condition number

and focus on ill-conditioned problems where κ1\kappa\gg 1. In the statistical learning context, the regularization parameter λ\lambda is usually on the order of 1/n1/\sqrt{n} or 1/n1/n (e.g., ), thus κ\kappa is on the order of n\sqrt{n} or nn. It can be even larger if the strong convexity in gg is added purely for numerical regularization purposes (see Section 3). We note that the actual conditioning of problem (1) may be better than κ\kappa, if the empirical loss function (1/n)i=1nϕi(aiTx)(1/n)\sum_{i=1}^{n}\phi_{i}(a_{i}^{T}x) by itself is strongly convex. In those cases, our complexity estimates in terms of κ\kappa can be loose (upper bounds), but they are still useful in comparing different algorithms for solving the same given problem.

To make fair comparisons with batch methods, we measure the complexity of stochastic or incremental gradient methods in terms of the number of equivalent passes over the dataset required to reach an expected precision ϵ\epsilon. We call this measure the batch complexity, which are usually obtained by dividing their iteration complexities by nn. For example, the batch complexity of the stochastic gradient method is O(κ/(nϵ))\mathcal{O}(\kappa/(n\epsilon)). The batch complexities of full gradient methods are the same as their iteration complexities.

By carefully exploiting the finite average structure in (1) and other similar problems, several recent work proposed new variants of the stochastic gradient or dual coordinate ascent methods and obtained the iteration complexity O((n+κ)log(1/ϵ))\mathcal{O}((n+\kappa)\log(1/\epsilon)). Since their computational cost per iteration is O(d)\mathcal{O}(d), the equivalent batch complexity is 1/n1/n of their iteration complexity, i.e., O((1+κ/n)log(1/ϵ))\mathcal{O}((1+\kappa/n)\log(1/\epsilon)). This complexity has much weaker dependence on nn than the full gradient methods, and also much weaker dependence on ϵ\epsilon than the stochastic gradient methods.

In this paper, we propose a stochastic primal-dual coordinate (SPDC) method, which has the iteration complexity

When κ>n\kappa>n, this is lower than the O((1+κ/n)log(1/ϵ))\mathcal{O}((1+\kappa/n)\log(1/\epsilon)) batch complexity mentioned above. Indeed, it is very close to a lower bound for minimizing finite sums recently established in .

2 Outline of the paper

Our approach is based on reformulating problem (1) as a convex-concave saddle point problem, and then devising a primal-dual algorithm to approximate the saddle point. More specifically, we replace each component function ϕi(aiTx)\phi_{i}(a_{i}^{T}x) through convex conjugation, i.e.,

Under Assumption A, each ϕi\phi^{*}_{i} is γ\gamma-strongly convex (since ϕi\phi_{i} is (1/γ)(1/\gamma)-smooth; see, e.g., [15, Theorem 4.2.2]) and gg is λ\lambda-strongly convex. As a consequence, the saddle point problem (4) has a unique solution, which we denote by (x,y)(x^{\star},y^{\star}).

In Section 2, we present the SPDC method as well as its convergence analysis. It alternates between maximizing ff over a randomly chosen dual coordinate yiy_{i} and minimizing ff over the primal variable xx. In order to accelerate the convergence, an extrapolation step is applied in updating the primal variable xx. We also give a mini-batch SPDC algorithm which is well suited for parallel computing.

In Section 3 and Section 4, we present two extensions of the SPDC method. We first explain how to solve problem (1) when Assumption A does not hold. The idea is to apply small regularizations to the saddle point function so that SPDC can still be applied, which results in accelerated sublinear rates. The second extension is a SPDC method with non-uniform sampling. The batch complexity of this algorithm has the same form as (3), but with κ=Rˉ/(λγ)\kappa=\bar{R}/(\lambda\gamma), where Rˉ=1ni=1nai\bar{R}=\frac{1}{n}\sum_{i=1}^{n}\|a_{i}\|, which can be much smaller than R=maxiaiR=\max_{i}\|a_{i}\| if there is considerable variation in the norms ai\|a_{i}\|.

In Section 5, we discuss related work. In particular, the SPDC method can be viewed as a coordinate-update extension of the batch primal-dual algorithm developed by Chambolle and Pock . We also discuss two very recent work which achieve the same batch complexity (3).

In Section 7, we present experiment results comparing SPDC with several state-of-the-art optimization methods, including both batch algorithms and randomized incremental and coordinate gradient methods. On all scenarios we tested, SPDC has comparable or better performance.

The SPDC method

We give the details of the SPDC method in Algorithm 1. The dual coordinate update and primal vector update are given in equations (7) and (8) respectively. Instead of maximizing ff over yky_{k} and minimizing ff over xx directly, we add two quadratic regularization terms to penalize yk(t+1)y^{(t+1)}_{k} and x(t+1)x^{(t+1)} from deviating from yk(t)y^{(t)}_{k} and x(t)x^{(t)}. The parameters σ\sigma and τ\tau control their regularization strength, which we will specify in the convergence analysis (Theorem 1). Moreover, we introduce two auxiliary variables u(t)u^{(t)} and x(t)\overline{x}^{(t)}. From the initialization u(0)=(1/n)i=1nyi(0)aiu^{(0)}=(1/n)\sum_{i=1}^{n}y^{(0)}_{i}a_{i} and the update rules (7) and (9), we have

Equation (10) obtains x(t+1)\overline{x}^{(t+1)} based on extrapolation from x(t)x^{(t)} and x(t+1)x^{(t+1)}. This step is similar to Nesterov’s acceleration technique [25, Section 2.2], and yields faster convergence rate.

The Mini-Batch SPDC method in Algorithm 2 is a natural extension of SPDC in Algorithm 1. The difference between these two algorithms is that, the Mini-Batch SPDC method may simultaneously select more than one dual coordinates to update. Let mm be the mini-batch size. During each iteration, the Mini-Batch SPDC method randomly picks a subset of indices K{1,,n}K\subset\{1,\ldots,n\} of size mm, such that the probability of each index being picked is equal to m/nm/n. The following is a simple procedure to achieve this. First, partition the set of indices into mm disjoint subsets, so that the cardinality of each subset is equal to n/mn/m (assuming mm divides nn). Then, during each iteration, randomly select a single index from each subset and add it to KK. Other approaches for mini-batch selection are also possible; see the discussions in .

In Algorithm 2, we also switched the order of updating x(t+1)x^{(t+1)} and u(t+1)u^{(t+1)} (comparing with Algorithm 1), to better illustrate that x(t+1)x^{(t+1)} is obtained based on an extrapolation from u(t)u^{(t)} to u(t+1)u^{(t+1)}. However, this form is not recommended in implementation, because u(t)u^{(t)} is usually a dense vector even if the feature vectors aka_{k} are sparse. Details on efficient implementation of SPDC are given in Section 6. In the following discussion, we do not make sparseness assumptions.

With a single processor, each iteration of Algorithm 2 takes O(md)\mathcal{O}(md) time to accomplish. Since the updates of each coordinate yky_{k} are independent of each other, we can use parallel computing to accelerate the Mini-Batch SPDC method. Concretely, we can use mm processors to update the mm coordinates in the subset KK in parallel, then aggregate them to update x(t+1)x^{(t+1)}. In terms of wall-clock time, each iteration takes O(d)\mathcal{O}(d) time, which is the same as running one iteration of the basic SPDC algorithm. Not surprisingly, we will show that the Mini-Batch SPDC algorithm converges faster than SPDC in terms of the iteration complexity, because it processes multiple dual coordinates in a single iteration.

Since the basic SPDC algorithm is a special case of Mini-Batch SPDC with m=1m=1, we only present a convergence theorem for the mini-batch version. The expectations in the following results are taken with respect to the random variables {K(0),,K(T1)}\{K^{(0)},\ldots,K^{(T-1)}\}, where K(t)K^{(t)} denotes the random subset K{1,,n}K\subset\{1,\ldots,n\} picked at the tt-th iteration of the SPDC method.

Assume that each ϕi\phi_{i} is (1/γ)(1/\gamma)-smooth and gg is λ\lambda-strongly convex (Assumption A). Let (x,y)(x^{\star},y^{\star}) be the unique saddle point of ff defined in (4), R=max{a12,,an2}R=\max\{\left\|{a_{1}}\right\|_{2},\ldots,\left\|{a_{n}}\right\|_{2}\}, and define

If the parameters τ,σ\tau,\sigma and θ\theta in Algorithm 2 are chosen such that

then for each t1t\geq 1, the Mini-Batch SPDC algorithm achieves

The proof of Theorem 1 is given in Appendix A. The following corollary establishes the expected iteration complexity of Mini-Batch SPDC for obtaining an ϵ\epsilon-accurate solution.

Suppose Assumption A holds and the parameters τ\tau, σ\sigma and θ\theta are set as in (16). In order for Algorithm 2 to obtain

it suffices to have the number of iterations TT satisfy

Applying the inequality log(1x)x-\log(1-x)\geq x to the denominator above completes the proof. ∎

Recall the definition of the condition number κ=R2/(λγ)\kappa=R^{2}/(\lambda\gamma) in (2). Corollary 1 establishes that the iteration complexity of the Mini-Batch SPDC method for achieving (17) is

So a larger batch size mm leads to less number of iterations. In the extreme case of n=mn=m, we obtain a full batch algorithm, which has iteration or batch complexity O((1+κ)log(1/ϵ))\mathcal{O}((1+\sqrt{\kappa})\log(1/\epsilon)). This complexity is also shared by the AFG methods (see Section 1.1), as well as the batch primal-dual algorithm of Chambolle and Pock (see discussions on related work in Section 5).

Since an equivalent pass over the dataset corresponds to n/mn/m iterations, the batch complexity (the number of equivalent passes over the data) of Mini-Batch SPDC is

The above expression implies that a smaller batch size mm leads to less number of passes through the data. In this sense, the basic SPDC method with m=1m=1 is the most efficient one. However, if we prefer the least amount of wall-clock time, then the best choice is to choose a mini-batch size mm that matches the number of parallel processors available.

2 Convergence rate of primal-dual gap

In the previous subsection, we established iteration complexity of the Mini-Batch SPDC method in terms of approximating the saddle point of the minimax problem (4), more specifically, to meet the requirement in (17). Next we show that it has the same order of complexity in reducing the primal-dual objective gap P(x(t))D(y(t))P(x^{(t)})-D(y^{(t)}), where P(x)P(x) is defined in (1) and

Under Assumption A, the function f(x,y)f(x,y) defined in (4) has a unique saddle point (x,y)(x^{\star},y^{\star}), and

Thus the result in Theorem 1 does not translate directly into a convergence bound on the primal-dual gap. We need to bound P(x)P(x) and D(y)D(y) by f(x,y)f(x,y^{\star}) and f(x,y)f(x^{\star},y), respectively, in the opposite directions. For this purpose, we need the following lemma, which we extracted from . We provide the proof in Appendix B for completeness.

Suppose Assumption A holds and the parameters τ\tau, σ\sigma and θ\theta are set as in (16). Let Δ~(0):=Δ(0)+y(0)y224mσ\widetilde{\Delta}^{(0)}:=\Delta^{(0)}+\frac{\|{y^{(0)}-y^{\star}}\|_{2}^{2}}{4m\sigma}. Then for any ϵ0\epsilon\geq 0, the iterates of Algorithm 2 satisfy

The function f(x,y)f(x,y^{\star}) is strongly convex in xx with parameter λ\lambda, and xx^{\star} is the minimizer. Similarly, f(x,y)-f(x^{\star},y) is strongly convex in yy with parameter γ/n\gamma/n, and is minimized by yy^{\star}. Therefore,

We bound the following weighted primal-dual gap

The first inequality above is due to Lemma 1, the second and fourth inequalities are due to the definition of Δ(t)\Delta^{(t)}, and the third inequality is due to (19). Taking expectations on both sides of the above inequality, then applying Theorem 1, we obtain

Since nmn\geq m and D(y)D(y(t)))0D(y^{\star})-D(y^{(t)}))\geq 0, this implies the desired result. ∎

Extensions to non-smooth or non-strongly convex functions

To be concise, we only consider the case where neither ϕi\phi_{i} is smooth nor gg is strongly convex. Formally, we assume that each ϕi\phi_{i} and gg are convex and Lipschitz continuous, and f(x,y)f(x,y) has a saddle point (x,y)(x^{\star},y^{\star}). We choose a scalar δ>0\delta>0 and consider the modified saddle-point function:

Denote by (xδ,yδ)(x^{\star}_{\delta},y^{\star}_{\delta}) the saddle-point of fδf_{\delta}. We employ the Mini-Batch SPDC method (Algorithm 2) to approximate (xδ,yδ)(x^{\star}_{\delta},y^{\star}_{\delta}), treating ϕi+δ2()2\phi^{*}_{i}+\frac{\delta}{2}(\cdot)^{2} as ϕi\phi^{*}_{i} and g+δ222g+\frac{\delta}{2}\|{\cdot}\|_{2}^{2} as gg, which are all δ\delta-strongly convex. We note that adding strongly convex perturbation on ϕi\phi_{i}^{*} is equivalent to smoothing ϕi\phi_{i}, which becomes (1/δ)(1/\delta)-smooth (see, e.g., ). Letting γ=λ=δ\gamma=\lambda=\delta, the parameters τ\tau, σ\sigma and θ\theta in (16) become

Although (xδ,yδ)(x^{\star}_{\delta},y^{\star}_{\delta}) is not exactly the saddle point of ff, the following corollary shows that applying the SPDC method to the perturbed function fδf_{\delta} effectively minimizes the original loss function PP. Similar results for the convergence of the primal-dual gap can also be established.

Assume that each ϕi\phi_{i} is convex and GϕG_{\phi}-Lipschitz continuous, and gg is convex and GgG_{g}-Lipschitz continuous. Define two constants:

Let y~=argmaxyf(xδ,y)\widetilde{y}=\arg\max_{y}f(x^{\star}_{\delta},y) be a shorthand notation. We have

Here, equations (i) and (vii) use the definition of the function ff, inequalities (ii) and (v) use the definition of the function fδf_{\delta}, inequalities (iii) and (iv) use the fact that (xδ,yδ)(x^{\star}_{\delta},y^{\star}_{\delta}) is the saddle point of fδf_{\delta}, and inequality (vi) is due to the fact that (x,y)(x^{\star},y^{\star}) is the saddle point of ff.

Since ϕi\phi_{i} is GϕG_{\phi}-Lipschitz continuous, the domain of ϕi\phi^{*}_{i} is in the interval [Gϕ,Gϕ][-G_{\phi},G_{\phi}], which implies y~22nGϕ2\|{\widetilde{y}}\|_{2}^{2}\leq nG_{\phi}^{2} (see, e.g., [38, Lemma 1]). Thus, we have

On the other hand, since PP is (GϕR+Gg)(G_{\phi}R+G_{g})-Lipschitz continuous, Theorem 1 implies

The corollary is established by finding the smallest TT that satisfies inequality (23). ∎

There are two other cases that can be considered: when ϕi\phi_{i} is not smooth but gg is strongly convex, and when ϕi\phi_{i} is smooth but gg is not strongly convex. They can be handled with the same technique described above, and we omit the details here. In Table 1, we list the complexities of the Mini-Batch SPDC method for finding an ϵ\epsilon-optimal solution of problem (1) under various assumptions. Similar results are also obtained in .

SPDC with non-uniform sampling

The basic idea is to use non-uniform sampling in picking the dual coordinate to update at each iteration. In Algorithm 3, we pick coordinate kk with the probability

where α(0,1)\alpha\in(0,1) is a parameter. In other words, this distribution is a (strict) convex combination of the uniform distribution and the distribution that is proportional to the feature norms. Therefore, instances with large feature norms are sampled more frequently, controlled by α\alpha. Simultaneously, we adopt an adaptive regularization in step (26), imposing stronger regularization on such instances. In addition, we adjust the weight of aka_{k} in (27) for updating the primal variable. As a consequence, the convergence rate of Algorithm 3 depends on the average norm of feature vectors, as well as the parameter α\alpha. This is summarized in the following theorem.

Suppose Assumption A holds. Let Rˉ=1ni=1nai2\bar{R}=\frac{1}{n}\sum_{i=1}^{n}\|{a_{i}}\|_{2}. If the parameters τ,σ,θ\tau,\sigma,\theta in Algorithm 3 are chosen such that

Since θ\theta is a bound on the convergence factor, we would like to make it as small as possible. For its expression in (29), it can be minimized by choosing

where κˉ=Rˉ2/(λγ)\bar{\kappa}=\bar{R}^{2}/(\lambda\gamma) is an average condition number. We have α=1/2\alpha^{\star}=1/2 if κˉ=n\bar{\kappa}=n. The value of α\alpha^{\star} decreases slowly to zero as the ratio n/κˉn/\bar{\kappa} grows, and increases to one as the ratio n/κˉn/\bar{\kappa} drops. Thus, we may choose a relatively uniform distribution for well conditioned problems, but a more aggressively weighted distribution for ill-conditioned problems.

For simplicity of presentation, we described in Algorithm 3 a weighted sampling SPDC method with single dual coordinate update, i.e., the case of m=1m=1. It is not hard to see that the non-uniform sampling scheme can also be extended to Mini-Batch SPDC with m>1m>1. Here, we omit the technical details.

Related Work

Chambolle and Pock considered a class of convex optimization problems with the following saddle-point structure:

When both FF^{*} and GG are strongly convex and the parameters τ\tau, σ\sigma and θ\theta are chosen appropriately, this algorithm obtains accelerated linear convergence rate [9, Theorem 3].

We can map the saddle-point problem (4) into the form of (30) by letting A=[a1,,an]TA=[a_{1},\ldots,a_{n}]^{T} and

The SPDC method developed in this paper can be viewed as an extension of the batch method (31)-(33), where the dual update step (31) is replaced by a single coordinate update (7) or a mini-batch update (13). However, in order to obtain accelerated convergence rate, more subtle changes are necessary in the primal update step. More specifically, we introduced the auxiliary variable u(t)=1ni=1nyi(t)ai=KTy(t)u^{(t)}=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{(t)}a_{i}=K^{T}y^{(t)}, and replaced the primal update step (32) by (8) and (14). The primal extrapolation step (33) stays the same.

To compare the batch complexity of SPDC with that of (31)-(33), we use the following facts implied by Assumption A and the relations in (34):

Based on these conditions, we list in Table 2 the equivalent parameters used in [9, Algorithm 3] and the batch complexity obtained in [9, Theorem 3], and compare them with SPDC.

The batch complexity of the Chambolle-Pock algorithm is O~(1+A2/(2nλγ))\widetilde{\mathcal{O}}(1+\|A\|_{2}/(2\sqrt{n\lambda\gamma})), where the O~()\widetilde{\mathcal{O}}(\cdot) notation hides the log(1/ϵ)\log(1/\epsilon) factor. We can bound the spectral norm A2\|A\|_{2} by the Frobenius norm AF\|A\|_{F} and obtain

(Note that the second inequality above would be an equality if the columns of AA are normalized.) So in the worst case, the batch complexity of the Chambolle-Pock algorithm becomes

which matches the worst-case complexity of the AFG methods (see Section 1.1 and also the discussions in [20, Section 5]). This is also of the same order as the complexity of SPDC with m=nm=n (see Section 2.1). When the condition number κ1\kappa\gg 1, they can be n\sqrt{n} worse than the batch complexity of SPDC with m=1m=1, which is O~(1+κ/n)\widetilde{\mathcal{O}}(1+\sqrt{\kappa/n}).

If either G(x)G(x) or F(y)F^{*}(y) in (30) is not strongly convex, Chambolle and Pock proposed variants of the primal-dual batch algorithm to achieve accelerated sublinear convergence rates [9, Section 5.1]. It is also possible to extend them to coordinate update methods for solving problem (1) when either ϕi\phi_{i}^{*} or gg is not strongly convex. Their complexities would be similar to those in Table 1.

Our algorithms and theory can be readily generalized to solve the problem of

We can also solve the primal problem (1) via its dual:

Because of the problem structure, coordinate ascent methods (e.g., ) can be more efficient than full gradient methods. In the stochastic dual coordinate ascent (SDCA) method , a dual coordinate yiy_{i} is picked at random during each iteration and updated to increase the dual objective value. Shalev-Shwartz and Zhang showed that the iteration complexity of SDCA is O((n+κ)log(1/ϵ))O\left((n+\kappa)\log(1/\epsilon)\right), which corresponds to the batch complexity O~(1+κ/n)\widetilde{\mathcal{O}}(1+\kappa/n).

For more general convex optimization problems, there is a vast literature on coordinate descent methods; see, e.g., the recent overview by Wright . In particular, Nesterov’s work on randomized coordinate descent sparked a lot of recent activities on this topic. Richtárik and Takáč extended the algorithm and analysis to composite convex optimization. When applied to the dual problem (35), it becomes one variant of SDCA studied in . Mini-batch and distributed versions of SDCA have been proposed and analyzed in and respectively. Non-uniform sampling schemes have been studied for both stochastic gradient and SDCA methods (e.g., ).

Shalev-Shwartz and Zhang proposed an accelerated mini-batch SDCA method which incorporates additional primal updates than SDCA, and bears some similarity to our Mini-Batch SPDC method. They showed that its complexity interpolates between that of SDCA and AFG by varying the mini-batch size mm. In particular, for m=nm=n, it matches that of the AFG methods (as SPDC does). But for m=1m=1, the complexity of their method is the same as SDCA, which is worse than SPDC for ill-conditioned problems.

More recently, Lin et al. developed an accelerated proximal coordinate gradient (APCG) method for solving a more general class of composite convex optimization problems. When applied to the dual problem (35), APCG enjoys the same batch complexity \widetilde{\mathcal{O}}\bigl{(}1+\sqrt{\kappa/n}\bigr{)} as of SPDC. However, it needs an extra primal proximal-gradient step to have theoretical guarantees on the convergence of primal-dual gap [20, Section 5.1]. The computational cost of this additional step is equivalent to one pass of the dataset, thus it does not affect the overall complexity.

2 Other related work

Another way to approach problem (1) is to reformulate it as a constrained optimization problem

Suzuki considered a problem similar to (1), but with more complex regularization function gg, meaning that gg does not have a simple proximal mapping. Thus primal updates such as step (8) or (14) in SPDC and similar steps in SDCA cannot be computed efficiently. He proposed an algorithm that combines SDCA and ADMM (e.g., ), and showed that it has linear rate of convergence under similar conditions as Assumption A. It would be interesting to see if the SPDC method can be extended to their setting to obtain accelerated linear convergence rate.

Efficient Implementation with Sparse Data

Suppose that g(x)=λ2x22g(x)=\frac{\lambda}{2}\|{x}\|_{2}^{2}. For this case, the updates for each coordinate of xx are independent of each other. More specifically, x(t+1)x^{(t+1)} can be computed coordinate-wise in closed form:

where Δu\Delta u denotes (yk(t+1)yk(t))ak(y_{k}^{(t+1)}-y_{k}^{(t)})a_{k} in Algorithm 1, or 1mkK(yk(t+1)yk(t))ak\frac{1}{m}\sum_{k\in K}(y_{k}^{(t+1)}-y_{k}^{(t)})a_{k} in Algorithm 2, or (yk(t+1)yk(t))ak/(pkn)(y_{k}^{(t+1)}-y_{k}^{(t)})a_{k}/(p_{k}n) in Algorithm 3, and Δuj\Delta u_{j} represents the jj-th coordinate of Δu\Delta u.

Although the dimension dd can be very large, we assume that each feature vector aka_{k} is sparse. We denote by J(t)J^{(t)} the set of non-zero coordinates at iteration tt, that is, if for some index kKk\in K picked at iteration tt we have akj0a_{kj}\neq 0, then jJ(t)j\in J^{(t)}. If jJ(t)j\notin J^{(t)}, then the SPDC algorithm (and its variants) updates y(t+1)y^{(t+1)} without using the value of xj(t)x_{j}^{(t)} or xj(t)\overline{x}_{j}^{(t)}. This can be seen from the updates in (7), (13) and (26), where the value of the inner product ak,x(t)\langle a_{k},\overline{x}^{(t)}\rangle does not depend on the value of xj(t)\overline{x}^{(t)}_{j}. As a consequence, we can delay the updates on xjx_{j} and xj\overline{x}_{j} whenever jJ(t)j\notin J^{(t)} without affecting the updates on y(t)y^{(t)}, and process all the missing updates at the next time when jJ(t)j\in J^{(t)}.

Such a delayed update can be carried out very efficiently. We assume that t0t_{0} is the last time when jJ(t)j\in J^{(t)}, and t1t_{1} is the current iteration where we want to update xjx_{j} and xj\overline{x}_{j}. Since jJ(t)j\notin J^{(t)} implies Δuj=0\Delta u_{j}=0, we have

Notice that uj(t)u_{j}^{(t)} is updated only at iterations where jJ(t)j\in J^{(t)}. The value of uj(t)u_{j}^{(t)} doesn’t change during iterations [t0+1,t1][t_{0}+1,t_{1}], so we have uj(t)uj(t0+1)u_{j}^{(t)}\equiv u_{j}^{(t_{0}+1)} for t[t0+1,t1]t\in[t_{0}+1,t_{1}]. Substituting this equation into the recursive formula (38), we obtain

The update (39) takes O(1)\mathcal{O}(1) time to compute. Using the same formula, we can compute xj(t11)x^{(t_{1}-1)}_{j} and subsequently compute xj(t1)=xj(t1)+θ(xj(t1)xj(t11))\overline{x}^{(t_{1})}_{j}=x^{(t_{1})}_{j}+\theta(x^{(t_{1})}_{j}-x^{(t_{1}-1)}_{j}). Thus, the computational complexity of a single iteration in SPDC is proportional to J(t)|J^{(t)}|, independent of the dimension dd.

where Δuj\Delta u_{j} follows the definition in Section 6.1. If jJ(t)j\notin J^{(t)}, then Δuj=0\Delta u_{j}=0 and equation (40) can be simplified as

Similar to the approach of Section 6.1, we delay the update of xjx_{j} until jJ(t)j\in J^{(t)}. We assume t0t_{0} to be the last iteration when jJ(t)j\in J^{(t)}, and let t1t_{1} be the current iteration when we want to update xjx_{j}. During iterations [t0+1,t1][t_{0}+1,t_{1}], the value of uj(t)u^{(t)}_{j} doesn’t change, so we have uj(t)uj(t0+1)u_{j}^{(t)}\equiv u_{j}^{(t_{0}+1)} for t[t0+1,t1]t\in[t_{0}+1,t_{1}]. Using equation (44) and the invariance of uj(t)u_{j}^{(t)} for t[t0+1,t1]t\in[t_{0}+1,t_{1}], we have an O(1)\mathcal{O}(1) time algorithm to calculate xj(t1)x_{j}^{(t_{1})}, which we detail in Appendix D. The vector xj(t1)\overline{x}^{(t_{1})}_{j} can be updated by the same algorithm since it is a linear combination of xj(t1)x_{j}^{(t_{1})} and xj(t11)x_{j}^{(t_{1}-1)}. As a consequence, the computational complexity of each iteration in SPDC is proportional to J(t)|J^{(t)}|, independent of the dimension dd.

Experiments

In this section, we compare the basic SPDC method (Algorithm 1) with several state-of-the-art optimization algorithms for solving problem (1). They include two batch-update algorithms: the accelerated full gradient (FAG) method [25, Section 2.2], and the limited-memory quasi-Newton method L-BFGS [29, Section 7.2]). For the AFG method, we adopt an adaptive line search scheme (e.g., ) to improve its efficiency. For the L-BFGS method, we use the memory size 30 as suggested by . We also compare SPDC with three stochastic algorithms: the stochastic average gradient (SAG) method , the stochastic dual coordinate descent (SDCA) method and the accelerated stochastic dual coordinate descent (ASDCA) method . We conduct experiments on a synthetic dataset and three real datasets.

We first compare SPDC with other algorithms on a simple quadratic problem using synthetic data. We generate n=500n=500 i.i.d. training examples {ai,bi}i=1n\{a_{i},b_{i}\}_{i=1}^{n} according to the model

In the form of problem (1), we have ϕi(z)=z2/2\phi_{i}(z)=z^{2}/2 and g(x)=(1/2)x22g(x)=(1/2)\|{x}\|_{2}^{2}. As a consequence, the derivative of ϕi\phi_{i} is 11-Lipschitz continuous and gg is λ\lambda-strongly convex.

We evaluate the algorithms by the logarithmic optimality gap log(P(x(t))P(x))\log(P(x^{(t)})-P(x^{\star})), where x(t)x^{(t)} is the output of the algorithms after tt passes over the entire dataset, and xx^{\star} is the global minimum. When the regularization coefficient is relatively large, e.g., λ=101\lambda=10^{-1} or 10210^{-2}, the problem is well-conditioned and we observe fast convergence of the stochastic algorithms SAG, SDCA, ASDCA and SPDC, which are substantially faster than the two batch methods AFG and L-BFGS.

Figure 1 shows the convergence of the five different algorithms when we varied λ\lambda from 10310^{-3} to 10610^{-6}. As the plot shows, when the condition number is greater than nn, the SPDC algorithm also converges substantially faster than the other two stochastic methods SAG and SDCA. It is also notably faster than L-BFGS. These results support our theory that SPDC enjoys a faster convergence rate on ill-conditioned problems. In terms of their batch complexities, SPDC is up to n\sqrt{n} times faster than AFG, and (λn)1/2(\lambda n)^{-1/2} times faster than SAG and SDCA.

Theoretically, ASDCA enjoys the same batch complexity as SPDC up to a multiplicative constant factor. Figure 1 shows that the empirical performance of SPDC is substantially faster that of ASDCA for small λ\lambda. This may due to the fact that ASDCA follows an inner-outer iteration procedure, while SPDC is a single-loop algorithm, explaining why it is empirically more efficient.

2 Binary classification with real data

Here, ϕi\phi_{i} is the smoothed hinge loss (see, e.g., ). It is easy to verify that the conjugate function of ϕi\phi_{i} is ϕi(β)=biβ+12β2\phi^{*}_{i}(\beta)=b_{i}\beta+\frac{1}{2}\beta^{2} for biβb_{i}\beta\in and \infty otherwise.

The performance of the five algorithms are plotted in Figure 2 and Figure 3. In Figure 2, we compare SPDC with the two batch methods: AFG and L-BFGS. The results show that SPDC is substantially faster than AFG and L-BFGS for relatively large λ\lambda, illustrating the advantage of stochastic methods over batch methods on well-conditioned problems. As λ\lambda decreases to 10810^{-8}, the batch methods (especially L-BFGS) become comparable to SPDC.

Summarizing Figure 2 and Figure 3, the performance of SPDC are always comparable or better than the other methods in comparison.

Appendix A Proof of Theorem 1

We focus on characterizing the values of xx and yy after the tt-th update in Algorithm 2. For any i{1,,n}i\in\{1,\ldots,n\}, let y~i\widetilde{y}_{i} be the value of yi(t+1)y_{i}^{(t+1)} if iKi\in K, i.e.,

Since ϕi\phi_{i} is (1/γ)(1/\gamma)-smooth by assumption, its conjugate ϕi\phi^{*}_{i} is γ\gamma-strongly convex (e.g., [15, Theorem 4.2.2]). Thus the function being maximized above is (1/σ+γ)(1/\sigma+\gamma)-strongly concave. Therefore,

Multiplying both sides of the above inequality by m/nm/n and re-arrange terms, we have

According to Algorithm 2, the set KK of indices to be updated are chosen randomly. For every specific index ii, the event iKi\in K happens with probability m/nm/n. If iKi\in K, then yi(t+1)y_{i}^{(t+1)} is updated to the value y~i\widetilde{y}_{i}, which satisfies inequality (45). Otherwise, yi(t+1)y_{i}^{(t+1)} is assigned by its old value yi(t)y_{i}^{(t)}. Let Ft\mathcal{F}_{t} be the sigma field generated by all random variables defined before round tt, and taking expectation conditioned on Ft\mathcal{F}_{t}, we have

As a result, we can represent (y~iyi)2(\widetilde{y}_{i}-y^{\star}_{i})^{2}, (y~iyi(t))2(\widetilde{y}_{i}-y_{i}^{(t)})^{2}, y~i\widetilde{y}_{i} and ϕi(y~i)\phi_{i}^{*}(\widetilde{y}_{i}) in terms of the conditional expectations on (yi(t+1)yi)2(y_{i}^{(t+1)}-y^{\star}_{i})^{2}, (yi(t+1)yi(t))2(y_{i}^{(t+1)}-y_{i}^{(t)})^{2}, yi(t+1)y_{i}^{(t+1)} and ϕi(yi(t+1))\phi_{i}^{*}(y_{i}^{(t+1)}), respectively. Plugging these representations into inequality (45) and re-arranging terms, we obtain

Then summing over all indices i=1,2,,ni=1,2,\dots,n and dividing both sides of the resulting inequality by mm, we have

where we used the shorthand notations (appeared in Algorithm 2)

Since only the dual coordinates with indices in KK are updated, we have

We also derive an inequality characterizing the relation between x(t+1)x^{(t+1)} and x(t)x^{(t)}. Since the function being minimized on the right-hand side of (14) has strong convexity parameter 1/τ+λ1/\tau+\lambda and x(t+1)x^{(t+1)} is the minimizer, we have

Rearranging terms and taking expectation conditioned on Ft\mathcal{F}_{t}, we have

In addition, we consider a particular combination of the saddle-point function values at different points. By the definition of f(x,y)f(x,y) in (4) and the notations in (48), we have

Next we add both sides of the inequalities (47) and (50) together, and then subtract equality (51) after taking expectation with respect to Ft\mathcal{F}_{t}. This leads to the following inequality:

We need to lower bound the last term on the right-hand-side of the above inequality. To this end, we have

Recall that ak2R\|{a_{k}}\|_{2}\leq R and, according to (16), 1/τ=4σR21/\tau=4\sigma R^{2}. Therefore,

The above upper bounds on the absolute values imply

Combining the above two inequalities with (52) and (53), we obtain

Note that we have added the nonnegative term \theta\bigl{(}f(x^{(t)},y^{\star})-f(x^{\star},y^{\star})\bigr{)} to the left-hand side in (54) to ensure that each term on one side of the inequality has a corresponding term on the other side.

If the parameters τ\tau, σ\sigma, and θ\theta are chosen as in (16), that is,

Then the ratios between the coefficients of the corresponding terms on both sides of the inequality (54) are either equal to θ\theta or bounded by θ\theta. More specifically,

Therefore, if we define the following sequence,

Comparing the definition of Δ(t)\Delta^{(t)} in (15), we have

For t=0t=0, by letting x(1)=x(0)x^{(-1)}=x^{(0)}, the last two terms in (56) for Δ~(0)\widetilde{\Delta}^{(0)} disappears. Moreover, we can show that the sum of the last three terms in (56) are nonnegative, and therefore we can replace Δ~(t)\widetilde{\Delta}^{(t)} with Δ(t)\Delta^{(t)} on the left-hand side of (55). To see this, we bound the absolute value of the last term:

where in the second inequality we used A22AF2nR2\|A\|_{2}^{2}\leq\|A\|_{F}^{2}\leq nR^{2}, in the equality we used τσ=1/(4R2)\tau\sigma=1/(4R^{2}), and in the last inequality we used mnm\leq n. The above upper bound on absolute value implies

Appendix B Proof of Lemma 1

Assumption A implies that F(x)F(x) is smooth and F(x)\nabla F(x) is Lipschitz continuous with constant A22/(nγ)\|A\|_{2}^{2}/(n\gamma). We can bound the spectral norm with the Frobenius norm, i.e., A22AF2nR2\|A\|_{2}^{2}\leq\|A\|_{F}^{2}\leq nR^{2}, which results in A22/(nγ)nR2/(nγ)=R2/γ\|A\|_{2}^{2}/(n\gamma)\leq nR^{2}/(n\gamma)=R^{2}/\gamma. By definition of the saddle point, the gradient of FF at xx^{\star} is F(x)=(1/n)ATy\nabla F(x^{\star})=(1/n)A^{T}y^{\star}. Therefore, we have

Combining the above inequality with P(x)=F(x)+g(x)P(x)=F(x)+g(x), we have

Similarly, the second inequality can be shown by first writing D(y)=1ni=1nϕi(yi)G(y)D(y)=-\frac{1}{n}\sum_{i=1}^{n}\phi_{i}^{*}(y_{i})-G^{*}(y), where

In this case, G(y)\nabla G^{*}(y) is Lipschitz continuous with constant A22/(n2λ)nR2/(n2λ)=R2/(nλ)\|A\|_{2}^{2}/(n^{2}\lambda)\leq nR^{2}/(n^{2}\lambda)=R^{2}/(n\lambda). Again by definition of the saddle-point, we have G(y)=(1/n)Ax\nabla G^{*}(y^{\star})=-(1/n)Ax^{\star}. Therefore,

Recalling that D(y)=1ni=1nϕi(yi)G(y)D(y)=-\frac{1}{n}\sum_{i=1}^{n}\phi_{i}^{*}(y_{i})-G^{*}(y), we conclude with

Appendix C Proof of Theorem 2

The proof of Theorem 2 follows similar steps for proving Theorem 1. We start by establishing relation between (y(t),y(t+1))(y^{(t)},y^{(t+1)}) and between (x(t),x(t+1))(x^{(t)},x^{(t+1)}). Suppose that the quantity y~i\widetilde{y}_{i} minimizes the function ϕi(β)βai,x(t)+pin2σ(βyi(t))2\phi^{*}_{i}(\beta)-\beta\langle a_{i},\overline{x}^{(t)}\rangle+\frac{p_{i}n}{2\sigma}(\beta-y_{i}^{(t)})^{2}. Also notice that ϕi(β)βai,x\phi_{i}^{*}(\beta)-\beta\langle a_{i},x^{*}\rangle is a γ\gamma-strongly convex function minimized by yiy_{i}^{*}, which implies

Then, following the same argument for establishing inequality (45) and plugging in inequality (57), we obtain

Note that i=ki=k with probability pip_{i}. Therefore, we have

where Ft\mathcal{F}_{t} represents the sigma field generated by all random variables defined before iteration tt. Substituting the above equations into inequality (58), and averaging over i=1,2,,ni=1,2,\dots,n, we have

where u=1ni=1nyiaiu^{\star}=\frac{1}{n}\sum_{i=1}^{n}y^{\star}_{i}a_{i} and u(t)=1ni=1nyi(t)aiu^{(t)}=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{(t)}a_{i} have the same definition as in the proof of Theorem 1.

For the relation between x(t)x^{(t)} and x(t+1)x^{(t+1)}, we first notice that u,x+g(x)\langle u^{*},x\rangle+g(x) is a λ\lambda-strongly convex function minimized by xx^{*}, which implies

Following the same argument for establishing inequality (49) and plugging in inequality (60), we obtain

Taking expectation over both sides of inequality (C) and adding it to inequality (59) yields

where the matrix AA is a nn-by-dd matrix, whose ii-th row is equal to the vector aiTa_{i}^{T}.

Next, we lower bound the last term on the right-hand side of inequality (62). Indeed, it can be expanded as

Note that the probability pkp_{k} given in (28) satisfies

Since the parameters τ\tau and σ\sigma satisfies στRˉ2=α2/4\sigma\tau\bar{R}^{2}=\alpha^{2}/4, we have pk2n2/τ4σak22p_{k}^{2}n^{2}/\tau\geq 4\sigma\|a_{k}\|_{2}^{2} and consequently

Combining the above two inequalities with lower bounds (62) and (63), we obtain

Recall that the parameters τ\tau, σ\sigma, and θ\theta are chosen to be

Plugging in these assignments and using the fact that pi1αnp_{i}\geq\frac{1-\alpha}{n}, we find that

Therefore, if we define a sequence Δ(t)\Delta^{(t)} such that

then inequality (64) implies the recursive relation Δ(t+1)θΔ(t)\Delta^{(t+1)}\leq\theta\cdot\Delta^{(t)}, which implies

To eliminate the last two terms on the left-hand side of inequality (65), we notice that

where in the equality we used n2/τ=(4/α2)σn2Rˉ2=(4/α2)σ(i=1nai2)2n^{2}/\tau=(4/\alpha^{2})\sigma n^{2}\bar{R}^{2}=(4/\alpha^{2})\sigma\left(\sum_{i=1}^{n}\|a_{i}\|_{2}\right)^{2}. This implies

Substituting the above inequality into inequality (65) completes the proof.

Given xj(t0+1)x_{j}^{(t_{0}+1)} at iteration t0t_{0}, we present an efficient algorithm for calculating xj(t1)x_{j}^{(t_{1})}. We begin by examining the sign of xj(t0+1)x_{j}^{(t_{0}+1)}.

If uj(t0+1)<λ1-u_{j}^{(t_{0}+1)}<-\lambda_{1}, then equation (69) implies xj(t)<0x_{j}^{(t)}<0 for all t>t0+1t>t_{0}+1. Therefore, we have the closed-form formula:

Finally, if uj(t0+1)[λ1,λ1]-u_{j}^{(t_{0}+1)}\in[-\lambda_{1},\lambda_{1}], then equation (69) implies xj(t1)=0x^{(t_{1})}_{j}=0.

We look for the largest t+t^{+} such that the right-hand side of equation (72) is positive, which is equivalent of

Thus, t+t^{+} is the largest integer in [t0+1,t1][t_{0}+1,t_{1}] such that inequality (73) holds. If t+=t1t^{+}=t_{1}, then xj(t1)x_{j}^{(t_{1})} is obtained by (72). Otherwise, we can calculate xjt++1x_{j}^{t^{+}+1} by formula (69), then resort to Case I or Case III, treating t+t^{+} as t0t_{0}.

where tt^{-} is the largest integer in [t0+1,t1][t_{0}+1,t_{1}] such that the following inequality holds:

If t=t1t^{-}=t_{1}, then xj(t1)x_{j}^{(t_{1})} is obtained by (74). Otherwise, we can calculate xjt+1x_{j}^{t^{-}+1} by formula (69), then resort to Case I or Case II, treating tt^{-} as t0t_{0}.

Finally, we note that formula (69) implies the monotonicity of xj(t) (t=t0+1,t0+2,)x_{j}^{(t)}~{}(t=t_{0}+1,t_{0}+2,\dots). As a consequence, the procedure of either Case I, Case II or Case III is executed for at most once. Hence, the algorithm for calculating xj(t1)x_{j}^{(t_{1})} has O(1)\mathcal{O}(1) time complexity.

References