Adding vs. Averaging in Distributed Primal-Dual Optimization

Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richtárik, Martin Takáč

Introduction

With the wide availability of large datasets that exceed the storage capacity of single machines, distributed optimization methods for machine learning have become increasingly important. Existing methods require significant communication between workers, frequently equaling the amount of local computation (or reading of local data). As a result, distributed machine learning suffers significantly from a communication bottleneck on real world systems, where communication is typically several orders of magnitudes slower than reading data from main memory.

In this work we focus on optimization problems with empirical loss minimization structure, i.e., objectives that are a sum of the loss functions of each datapoint. This includes the most commonly used regularized variants of linear regression and classification methods. For this class of problems, the recently proposed CoCoA approach (Yang, 2013; Jaggi et al., 2014) develops a communication-efficient primal-dual scheme that targets the communication bottleneck, allowing more computation on data-local subproblems native to each machine before communication. By appropriately choosing the amount of local computation per round, this framework allows one to control the trade-off between communication and local computation based on the systems hardware at hand.

However, the performance of CoCoA (as well as related primal SGD-based methods) is significantly reduced by the need to average updates between all machines. As the number of machines KK grows, the updates get diluted and slowed by 1/K1/K, e.g., in the case where all machines except one would have already reached the solutions of their respective partial optimization tasks. On the other hand, if the updates are instead added, the algorithms can diverge, as we will observe in the practical experiments below.

To address both described issues, in this paper we develop a novel generalization of the local CoCoA subproblems assigned to each worker, making the framework more powerful in the following sense: Without extra computational cost, the set of locally computed updates from the modified subproblems (one from each machine) can be combined more efficiently between machines. The proposed CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} updates can be aggressively added (hence the ‘+’-suffix), which yields much faster convergence both in practice and in theory. This difference is particularly significant as the number of machines KK becomes large.

To our knowledge, our framework is the first to exhibit favorable strong scaling for the class of problems considered, as the number of machines KK increases and the data size is kept fixed. More precisely, while the convergence rate of CoCoA degrades as KK is increased, the stronger theoretical convergence rate here is – in the worst case – independent of KK. Our experiments in Section 7 confirm the improved speed of convergence. Since the number of communicated vectors is only one per round and worker, this favorable scaling might be surprising. Indeed, for existing methods, splitting data among more machines generally increases communication requirements (Shamir & Srebro, 2014), which can severely affect overall runtime.

While the existing analysis for CoCoA in (Jaggi et al., 2014) only covered smooth loss functions, here we extend the class of functions where the rates apply, additionally covering, e.g., Support Vector Machines and non-smooth regression variants. We provide a primal-dual convergence rate for both CoCoA as well as our new method CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} in the case of general convex (LL-Lipschitz) losses.

Furthermore, we additionally strengthen the rates by showing stronger primal-dual convergence for both algorithmic frameworks, which are almost tight to their objective-only counterparts. Primal-dual rates for CoCoA had not previously been analyzed in the general convex case. Our primal-dual rates allow efficient and practical certificates for the optimization quality, e.g., for stopping criteria. The new rates apply to both smooth and non-smooth losses, and for both CoCoA as well as the extended CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}}.

CoCoA as well as CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} allow the use of arbitrary local solvers on each machine.

We provide a thorough experimental comparison with competing algorithms using several real-world distributed datasets. Our practical results confirm the strong scaling of CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} as the number of machines KK grows, while competing methods, including the original CoCoA, slow down significantly with larger KK. We implement all algorithms in Spark, and our code is publicly available at: github.com/gingsmith/cocoa.

2 History and Related Work

While optimal algorithms for the serial (single machine) case are already well researched and understood, the literature in the distributed setting is relatively sparse. In particular, details on optimal trade-offs between computation and communication, as well as optimization or statistical accuracy, are still widely unclear. For an overview over this currently active research field, we refer the reader to (Balcan et al., 2012; Richtárik & Takáč, 2013; Duchi et al., 2013; Yang, 2013; Liu & Wright, 2014; Fercoq et al., 2014; Jaggi et al., 2014; Shamir & Srebro, 2014; Shamir et al., 2014; Zhang & Lin, 2015; Qu & Richtárik, 2014) and the references therein. We provide a detailed comparison of our proposed framework to the related work in Section 6.

Setup

We consider regularized empirical loss minimization problems of the following well-established form:

The above class includes many standard problems of wide interest in machine learning, statistics, and signal processing, including support vector machines, regularized linear and logistic regression, ordinal regression, and others.

The conjugate dual of (1) takes following form:

The duality gap function is then given by:

By weak duality, every value D(α)\mathcal{D}({\boldsymbol{\alpha}}) at a dual candidate α{\boldsymbol{\alpha}} provides a lower bound on every primal value P(w)\mathcal{P}({\bf w}). The duality gap is therefore a certificate on the approximation quality: The distance to the unknown true optimum P(w)\mathcal{P}({\bf w}^{*}) must always lie within the duality gap, i.e., G(α)=P(w)D(α)P(w)P(w)0G({\boldsymbol{\alpha}})=\mathcal{P}({\bf w})-\mathcal{D}({\boldsymbol{\alpha}})\geq\mathcal{P}({\bf w})-\mathcal{P}({\bf w}^{*})\geq 0.

In large-scale machine learning settings like those considered here, the availability of such a computable measure of approximation quality is a significant benefit during training time. Practitioners using classical primal-only methods such as SGD have no means by which to accurately detect if a model has been well trained, as P(w)P({\bf w}^{*}) is unknown.

The CoCoA++\!{}^{\bf\textbf{\footnotesize+}} Algorithm Framework

In this section we present our novel CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework. CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} inherits the many benefits of CoCoA as it remains a highly flexible and scalable, communication-efficient framework for distributed optimization. CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} differs algorithmically in that we modify the form of the local subproblems (9) to allow for more aggressive additive updates (as controlled by γ\gamma). We will see that these changes allow for stronger convergence guarantees as well as improved empirical performance. Proofs of all statements in this section are given in the supplementary material.

The above definition of the local objective functions Gkσ\mathcal{G}^{\sigma^{\prime}}_{k} are such that they closely approximate the global dual objective D\mathcal{D}, as we vary the ‘local’ variable Δα[k]\Delta{\boldsymbol{\alpha}}_{[k]}, in the following precise sense:

The role of the parameter σ\sigma^{\prime} is to measure the difficulty of the given data partition. For our purposes, we will see that it must be chosen not smaller than

The choice of σ:=γK\sigma^{\prime}:=\gamma K is valid for (11), i.e.,

We assume that there exists Θ[0,1)\Theta\in[0,1) such that k[K]\forall k\in[K], the local solver at any outer iteration tt produces a (possibly) randomized approximate solution Δα[k]\Delta{\boldsymbol{\alpha}}_{[k]}, which satisfies

We are now ready to describe the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework, shown in Algorithm 1. The crucial difference compared to the existing CoCoA algorithm (Jaggi et al., 2014) is the more general local subproblem, as defined in (9), as well as the aggregation parameter γ\gamma. These modifications allow the option of directly adding updates to the global vector w{\bf w}.

Convergence Guarantees

Before being able to state our main convergence results, we introduce some useful quantities and the following main lemma characterizing the effect of iterations of Algorithm 1, for any chosen internal local solver.

The following Lemma provides a uniform bound on R(t)R^{(t)}:

If all data-points xi{\bf x}_{i} are normalized such that xi1\|{\bf x}_{i}\|\leq 1 i[n]\forall i\in[n], then σkPk=nk\sigma_{k}\leq|\mathcal{P}_{k}|=n_{k}. Furthermore, if we assume that the data partition is balanced, i.e., that nk=n/Kn_{k}=n/K for all kk, then σn2/K\sigma\leq n^{2}/K. This can be used to bound the constants R(t)R^{(t)}, above, as R(t)4L2n2K.R^{(t)}\leq\frac{4L^{2}n^{2}}{K}.

The following theorem shows the convergence for non-smooth loss functions, in terms of objective values as well as primal-dual gap. The analysis in (Jaggi et al., 2014) only covered the case of smooth loss functions.

we have that the expected duality gap satisfies

The following corollary of the above theorem clarifies our main result: The more aggressive adding of the partial updates, as compared averaging, offers a very significant improvement in terms of total iterations needed. While the convergence in the ‘adding’ case becomes independent of the number of machines KK, the ‘averaging’ regime shows the known degradation of the rate with growing KK, which is a major drawback of the original CoCoA algorithm. This important difference in the convergence speed is not a theoretical artifact but also confirmed in our practical experiments below for different KK, as shown e.g. in Figure 2.

We further demonstrate below that by choosing γ\gamma and σ\sigma^{\prime} accordingly, we can still recover the original CoCoA algorithm and its rate.

Assume that all datapoints xi{\bf x}_{i} are bounded as xi1\|{\bf x}_{i}\|\leq 1 and that the data partition is balanced, i.e. that nk=n/Kn_{k}=n/K for all kk. We consider two different possible choices of the aggregation parameter γ\gamma:

(CoCoA Averaging, γ:=1K\gamma:=\frac{1}{K}): In this case, σ:=1\sigma^{\prime}:=1 is a valid choice which satisfies (11). Then using σn2/K\sigma\leq n^{2}/K in light of Remark 7, we have that TT iterations are sufficient for primal-dual accuracy ϵG\epsilon_{G}, with

Hence the more machines KK, the more iterations are needed (in the worst case).

(CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} Adding, γ:=1\gamma:=1): In this case, the choice of σ:=K\sigma^{\prime}:=K satisfies (11). Then using σn2/K\sigma\leq n^{2}/K in light of Remark 7, we have that TT iterations are sufficient for primal-dual accuracy ϵG\epsilon_{G}, with

This is significantly better than the averaging case.

In practice, we usually have σn2/K\sigma\ll n^{2}/K, and hence the actual convergence rate can be much better than the proven worst-case bound. Table 1 shows that the actual value of σ\sigma is typically between one and two orders of magnitudes smaller compared to our used upper-bound n2/Kn^{2}/K.

2 Primal-Dual Convergence for Smooth Losses

The following theorem shows the convergence for smooth losses, in terms of the objective as well as primal-dual gap.

The following corollary is analogous to Corollary 9, but for the case of smooth loses. It again shows that while the CoCoA variant degrades with the increase of the number of machines KK, the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} rate is independent of KK.

Assume that all datapoints xi{\bf x}_{i} are bounded as xi1\|{\bf x}_{i}\|\leq 1 and that the data partition is balanced, i.e., that nk=n/Kn_{k}=n/K for all kk. We again consider the same two different possible choices of the aggregation parameter γ\gamma:

(CoCoA Averaging, γ:=1K\gamma:=\frac{1}{K}): In this case, σ:=1\sigma^{\prime}:=1 is a valid choice which satisfies (11). Then using σmaxnk=n/K\sigma_{\max}\leq n_{k}=n/K in light of Remark 7, we have that TT iterations are sufficient for suboptimality ϵD\epsilon_{\mathcal{D}}, with

Hence the more machines KK, the more iterations are needed (in the worst case).

(CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} Adding, γ:=1\gamma:=1): In this case, the choice of σ:=K\sigma^{\prime}:=K satisfies (11). Then using σmaxnk=n/K\sigma_{\max}\leq n_{k}=n/K in light of Remark 7, we have that TT iterations are sufficient for suboptimality ϵD\epsilon_{\mathcal{D}}, with

This is significantly better than the averaging case. Both rates hold analogously for the duality gap.

3 Comparison with Original CoCoA

SDCA as an Example Local Solver

As an illustrative example for a local solver, Algorithm 2 below summarizes randomized coordinate ascent (SDCA) applied on the local subproblem (9). The following two Theorems (13, 14) characterize the local convergence for both smooth and non-smooth functions. In all the results we will use rmax:=maxi[n]xi2r_{\max}:=\max_{i\in[n]}\|{\bf x}_{i}\|^{2}.

Between the different regimes allowed in CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} (ranging between averaging and adding the updates) the computational cost for obtaining the required local approximation quality varies with the choice of σ\sigma^{\prime}. From the above worst-case upper bound, we note that the cost can increase with σ\sigma^{\prime}, as aggregation becomes more aggressive. However, as we will see in the practical experiments in Section 7 below, the additional cost is negligible compared to the gain in speed from the different aggregation, when measured on real datasets.

Discussion and Related Work

For the empirical loss minimization problems of interest here, stochastic subgradient descent (SGD) based methods are well-established. Several distributed variants of SGD have been proposed, many of which build on the idea of a parameter server (Niu et al., 2011; Liu et al., 2014; Duchi et al., 2013). The downside of this approach, even when carefully implemented, is that the amount of required communication is equal to the amount of data read locally (e.g., mini-batch SGD with a batch size of 1 per worker). These variants are in practice not competitive with the more communication-efficient methods considered here, which allow more local updates per round.

At the other extreme, there are distributed methods using only a single round of communication, such as (Zhang et al., 2013; Zinkevich et al., 2010; Mann et al., 2009; McWilliams et al., 2014). These require additional assumptions on the partitioning of the data, and furthermore can not guarantee convergence to the optimum solution for all regularizers, as shown in, e.g., (Shamir et al., 2014). (Balcan et al., 2012) shows additional relevant lower bounds on the minimum number of communication rounds necessary for a given approximation quality for similar machine learning problems.

Mini-batch methods are more flexible and lie within these two communication vs. computation extremes. However, mini-batch versions of both SGD and coordinate descent (CD) (Richtárik & Takáč, 2013; Shalev-Shwartz & Zhang, 2013b; Yang, 2013; Qu & Richtárik, 2014; Qu et al., 2014) suffer from their convergence rate degrading towards the rate of batch gradient descent as the size of the mini-batch is increased. This follows because mini-batch updates are made based on the outdated previous parameter vector w{\bf w}, in contrast to methods that allow immediate local updates like CoCoA. Furthermore, the aggregation parameter for mini-batch methods is harder to tune, as it can lie anywhere in the order of mini-batch size. In the CoCoA setting, the parameter lies in the smaller range given by KK. Our CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} extension avoids needing to tune this parameter entirely, by adding.

Developing methods that allow for local optimization requires carefully devising data-local subproblems to be solved after each communication round. (Shamir et al., 2014; Zhang & Lin, 2015) have proposed distributed Newton-type algorithms in this spirit. However, the subproblems must be solved to high accuracy for convergence to hold, which is often prohibitive as the size of the data on one machine is still relatively large. In contrast, the CoCoA framework (Jaggi et al., 2014) allows using any local solver of weak local approximation quality in each round. By making use of the primal-dual structure in the line of work of (Yu et al., 2012; Pechyony et al., 2011; Yang, 2013; Lee & Roth, 2015), the CoCoA and CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} frameworks also allow more control over the aggregation of updates between machines. The practical variant DisDCA-p proposed in (Yang, 2013) allows additive updates but is restricted to SDCA updates, and was proposed without convergence guarantees. DisDCA-p can be recovered as a special case of the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework when using SDCA as a local solver, if nk=n/Kn_{k}=n/K and σ:=K\sigma^{\prime}:=K, see Appendix C. The theory presented here also therefore covers that method.

An alternative approach to distributed optimization is to use the alternating direction method of multipliers (ADMM), as used for distributed SVM training in, e.g., (Forero et al., 2010). This uses a penalty parameter balancing between the equality constraint w{\bf w} and the optimization objective (Boyd et al., 2011). However, the known convergence rates for ADMM are weaker than the more problem-tailored methods mentioned previously, and the choice of the penalty parameter is often unclear.

In spirit, for the special case of adding (γ=1\gamma=1), CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} resembles a batch proximal method, using the separable approximation (9) instead of the original dual (2). Known batch proximal methods require high accuracy subproblem solutions, and don’t allow arbitrary solvers of weak accuracy Θ\Theta such as we do here.

Numerical Experiments

We present experiments on several large real-world distributed datasets. We show that CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} converges faster in terms of total rounds as well as elapsed time as compared to CoCoA in all cases, despite varying: the dataset, values of regularization, batch size, and cluster size (Section 7.2). In Section 7.3 we demonstrate that this performance translates to orders of magnitude improvement in convergence when scaling up the number of machines KK, as compared to CoCoA as well as to several other state-of-the-art methods. Finally, in Section 7.4 we investigate the impact of the local subproblem parameter σ\sigma^{\prime} in the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework.

We implement all algorithms in Apache Spark (Zaharia et al., 2012) and run them on m3.large Amazon EC2 instances, applying each method to the binary hinge-loss support vector machine. The analysis for this non-smooth loss was not covered in (Jaggi et al., 2014) but has been captured here, and thus is both theoretically and practically justified. The used datasets are summarized in Table 2.

For illustration and ease of comparison, we here use SDCA (Shalev-Shwartz & Zhang, 2013c) as the local solver for both CoCoA and CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}}. Note that in this special case, and if additionally σ:=K\sigma^{\prime}:=K, and if the partitioning nk=n/Kn_{k}=n/K is balanced, once can show that the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework reduces to the practical variant of DisDCA (Yang, 2013) (which had no convergence guarantees so far). We include more details on the connection in Appendix C.

2 Comparison of CoCoA++\!{}^{\bf\textbf{\footnotesize+}} and CoCoA

We compare the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} and CoCoA frameworks directly using two datasets (Covertype and RCV1) across various values of λ\lambda, the regularizer, in Figure 1. For each value of λ\lambda we consider both methods with different values of HH, the number of local iterations performed before communicating to the master. For all runs of CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} we use the safe upper bound of γK\gamma K for σ\sigma^{\prime}. In terms of both the total number of communications made and the elapsed time, CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} (shown in blue) converges to the optimal solution faster than CoCoA (red). The discrepancy is larger for greater values of λ\lambda, where the strongly convex regularizer has more of an impact and the problem difficulty is reduced. We also see a greater performance gap for smaller values of HH, where there is frequent communication between the machines and the master, and changes between the algorithms therefore play a larger role.

3 Scaling the Number of Machines K𝐾K

In Figure 2 we demonstrate the ability of CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} to scale with an increasing number of machines KK. The experiments confirm the ability of strong scaling of the new method, as predicted by our theory in Section 4, in contrast to the competing methods. Unlike CoCoA, which becomes linearly slower when increasing the number of machines, the performance of CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} improves with additional machines, only starting to degrade slightly once KK=16 for the RCV1 dataset.

Finally, in Figure 3, we consider the effect of the choice of the subproblem parameter σ\sigma^{\prime} on convergence. We plot both the number of communications and clock time on a log-log scale for the RCV1 dataset with KK=8 and HH=1e41e4. For γ=1\gamma=1 (the most aggressive variant of CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} in which updates are added) we consider several different values of σ\sigma^{\prime}, ranging from 11 to 88. The value σ\sigma^{\prime}=8 represents the safe upper bound of γK\gamma K. The optimal convergence occurs around σ\sigma^{\prime}=4, and diverges for σ2\sigma^{\prime}\leq 2. Notably, we see that the easy to calculate upper bound of σ:=γK\sigma^{\prime}:=\gamma K (as given by Lemma 4) has only slightly worse performance than best possible subproblem parameter in our setting.

Conclusion

In conclusion, we present a novel framework CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} that allows for fast and communication-efficient additive aggregation in distributed algorithms for primal-dual optimization. We analyze the theoretical performance of this method, giving strong primal-dual convergence rates with outer iterations scaling independently of the number of machines. We extended our theory to allow for non-smooth losses. Our experimental results show significant speedups over previous methods, including the original CoCoA framework as well as other state-of-the-art methods.

We thank Ching-pei Lee and an anonymous reviewer for several helpful insights and comments.

References

Appendix

Appendix B Proofs

Now, let us bound the terms AA and BB separately. We have

Where the last inequality is due to Jensen’s inequality. Now we will bound BB, using the safe separability measurement σ\sigma^{\prime} as defined in (11).

Plugging AA and BB into (25) will give us

B.2 Proof of Lemma 4

B.3 Proof of Lemma 5

For sake of notation, we will write α{\boldsymbol{\alpha}} instead of α(t){\boldsymbol{\alpha}}^{(t)}, w{\bf w} instead of w(α(t)){\bf w}({\boldsymbol{\alpha}}^{(t)}) and u{\bf u} instead of u(t){\bf u}^{(t)}.

Now, let us estimate the expected change of the dual objective. Using the definition of the dual update α(t+1):=α(t)+γkΔα[k]{\boldsymbol{\alpha}}^{(t+1)}:={\boldsymbol{\alpha}}^{(t)}+\gamma\,\sum_{k}\Delta{\boldsymbol{\alpha}}_{[k]} resulting in Algorithm 1, we have

Now, let us upper bound the CC term (we will denote by Δα=k=1KΔα[k]\Delta{\boldsymbol{\alpha}}^{*}=\sum_{k=1}^{K}\Delta{\boldsymbol{\alpha}}^{*}_{[k]}):

The convex conjugate maximal property implies that

Moreover, from the definition of the primal and dual optimization problems (1), (2), we can write the duality gap as

Now, the claimed improvement bound (15) follows by plugging (29) into (26).

B.4 Proof of Lemma 6

For general convex functions, the strong convexity parameter is μ=0\mu=0, and hence the definition of R(t)R^{(t)} becomes

B.5 Proof of Theorem 8

At first let us estimate expected change of dual feasibility. By using the main Lemma 5, we have

Choice of s=1s=1 and t=t0:=max{0,1γ(1Θ)log(2λn2(D(α)D(α(0)))/(4L2σσ))}t=t_{0}:=\max\{0,\lceil\frac{1}{\gamma(1-\Theta)}\log(2\lambda n^{2}(\mathcal{D}({\boldsymbol{\alpha}}^{*})-\mathcal{D}({\boldsymbol{\alpha}}^{(0)}))/(4L^{2}\sigma\sigma^{\prime}))\rceil\} will lead to

Clearly, (32) implies that (33) holds for t=t0t=t_{0}. Now imagine that it holds for any tt0t\geq t_{0} then we show that it also has to hold for t+1t+1. Indeed, using

where in the last inequality we have used the fact that geometric mean is less or equal to arithmetic mean.

If α\overline{\boldsymbol{\alpha}} is defined as (21) then we obtain that

Now, if T1γ(1Θ)+T0T\geq\lceil\frac{1}{\gamma(1-\Theta)}\rceil+T_{0} such that T0t0T_{0}\geq t_{0} we obtain

To have right hand side of (38) smaller then ϵG\epsilon_{G} it is sufficient to choose T0T_{0} and TT such that

B.6 Proof of Theorem 10

into (41) we obtain that t:R(t)0\forall t:R^{(t)}\leq 0. Putting the same ss into (15) will give us

Therefore if we denote by ϵD(t)=D(α)D(α(t))\epsilon_{\mathcal{D}}^{(t)}=\mathcal{D}({\boldsymbol{\alpha}}^{*})-\mathcal{D}({\boldsymbol{\alpha}}^{(t)}) we have that

The right hand side will be smaller than some ϵD\epsilon_{\mathcal{D}} if

Moreover, to bound the duality gap, we have

Therefore G(α(t))1γ(1Θ)λμn+σmaxσλμnϵD(t)G({\boldsymbol{\alpha}}^{(t)})\leq\frac{1}{\gamma(1-\Theta)}\frac{\lambda\mu n+\sigma_{\max}\sigma^{\prime}}{\lambda\mu n}\epsilon_{\mathcal{D}}^{(t)}. Hence if ϵDγ(1Θ)λμnλμn+σmaxσϵG\epsilon_{\mathcal{D}}\leq\gamma(1-\Theta)\frac{\lambda\mu n}{\lambda\mu n+\sigma_{\max}\sigma^{\prime}}\epsilon_{G} then G(α(t))ϵGG({\boldsymbol{\alpha}}^{(t)})\leq\epsilon_{G}. Therefore after

iterations we have obtained a duality gap less than ϵG\epsilon_{G}.

B.7 Proof of Theorem 13

The second part we will denote by f(ζ)=1Kλ2w(α)2+1niPkw(α)Txiζi+λ2σ1λ2n2iPkxiζi2f({\boldsymbol{\zeta}})=\frac{1}{K}\frac{\lambda}{2}\|{\bf w}({\boldsymbol{\alpha}})\|^{2}+\frac{1}{n}\sum_{i\in\mathcal{P}_{k}}{\bf w}({\boldsymbol{\alpha}})^{T}{\bf x}_{i}\zeta_{i}+\frac{\lambda}{2}\sigma^{\prime}\frac{1}{\lambda^{2}n^{2}}\|\sum_{i\in\mathcal{P}_{k}}{\bf x}_{i}\zeta_{i}\|^{2}. It is easy to show that the gradient of ff is coordinate-wise Lipschitz continuous with Lipschitz constant σλn2rmax\frac{\sigma^{\prime}}{\lambda n^{2}}r_{\max} with respect to the standard Euclidean norm.

Following the proof of Theorem 20 in (Richtárik & Takáč, 2015), we obtain that

Over all steps up to step hh, this gives

Therefore, choosing HH as in the assumption of our Theorem, given in Equation (22), we are guaranteed that (11nkλnμσrmax+λnμ)HΘ\left(1-\frac{1}{n_{k}}\frac{\lambda n\mu}{\sigma^{\prime}r_{\max}+\lambda n\mu}\right)^{H}\leq\Theta, as desired.

B.8 Proof of Theorem 14

Now, choice of h=Hh=H from (23) is sufficient to have the right hand side of (45) to be \leq\Theta\big{(}\mathcal{G}^{\sigma^{\prime}}_{k}(\Delta{\boldsymbol{\alpha}}^{*}_{[k]};{\bf w},{\boldsymbol{\alpha}}_{[k]})-\mathcal{G}^{\sigma^{\prime}}_{k}({\bf 0};{\bf w},{\boldsymbol{\alpha}}_{[k]})\big{)}.

Appendix C Relationship of DisDCA to CoCoA++\!{}^{\bf\textbf{\footnotesize+}}

We are indebted to Ching-pei Lee for showing the following relationship between the practical variant of DisDCA (Yang, 2013), and CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} when SDCA is chosen as the local solver:

Considering the practical variant of DisDCA (DisDCA-p, see Figure 2 in (Yang, 2013)) using the scaling parameter scl=Kscl=K, the following holds:

Assume that the dataset is partitioned equally between workers, i.e. k:nk=nK\forall k:n_{k}=\frac{n}{K}. If within the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework, SDCA is used as a local solver, and the subproblems are formulated using our shown “safe” (but pessimistic) upper bound of σ=K\sigma^{\prime}=K, with aggregation parameter γ=1\gamma=1 (adding), then the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework reduces exactly to the DisDCA-p algorithm.

(Due to Ching-pei Lee, with some reformulations). As defined in (9), the data-local subproblem solved by each machine in CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} is defined as

We rewrite the local problem by scaling with nn, and removing the constant regularizer term 1Kλ2w2\frac{1}{K}\frac{\lambda}{2}\|{\bf w}\|^{2}, i.e.

For the correspondence of interest, we now restrict to single coordinate updates in the local solver. In other words, the local solver optimizes exactly one coordinate iPki\in\mathcal{P}_{k} at a time. To relate the single coordinate update to the set of local variables, we will use the notation

so that Δα[k]prev\Delta{\boldsymbol{\alpha}}^{{\text{\tiny prev}}}_{[k]} are the previous local variables, and Δα[k]\Delta{\boldsymbol{\alpha}}_{[k]} will be the updated ones.

From now on, we will consider the special case of CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} when the quadratic upper bound parameter is chosen as the “safe” value σ=K\sigma^{\prime}=K, combined with adding as the aggregation, i.e. γ=1\gamma=1.

which – because it is only affecting one single coordinate, employing (47) – can be expressed as

From this formulation, it is apparent that single coordinate local solvers should maintain their locally updated version of the current primal parameters, which we here denote as

In the practical variant of DisDCA, the summarized local primal updates are Δulocal=1λnkAΔα[k]\Delta{\bf u}^{\text{\tiny local}}=\frac{1}{\lambda n_{k}}A\Delta{\boldsymbol{\alpha}}_{[k]}. For the balanced case nk=n/Kn_{k}=n/K for KK being the number of machines, this means the local ulocal{\bf u}^{\text{\tiny local}} update of DisDCA-p is

It is not hard to show that during one outer round, the evolution of the local dual variables Δα[k]\Delta{\boldsymbol{\alpha}}_{[k]} is the same in both methods, such that they will also have the same trajectory of ulocal{\bf u}^{\text{\tiny local}}. This requires some care if the same coordinate is sampled more than once in a round, which can happen in LocalSDCA within CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} and also in DisDCA-p. ∎

In the view of the above lemma, we will summarize the connection of the two methods as follows:

CoCoA/+ is Not an Algorithm. In contrast, it is a framework which allows to use any local solver to perform approximate steps on the local subproblem. This additional level of abstraction (from the definition of such local subproblems in (9)) is the first to allow reusability of any fast/tuned and problem specific single machine solvers, while decoupling this from the distributed algorithmic scheme, as presented in Algorithm 1.

Concerning the choice of local solver to be used within CoCoA/+, SDCA is not the fastest known single machine solver for most applications. Much recent research has shown improvements on SDCA (Shalev-Shwartz & Zhang, 2013c), such as accelerated variants (Shalev-Shwartz & Zhang, 2013b) and other approaches including variance reduction, methods incorporating second-order information, and importance sampling. In this light, we encourage the user of the CoCoA or CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework to plug in the best and most recent solver available for their particular local problem (within Algorithm 1), which is not necessarily SDCA. This choice should be made explicit especially when comparing algorithms. Our presented convergence theory from Section 4 will still cover these choices, since it only depends on the relative accuracy Θ\Theta of the chosen local solver.

CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} is Theoretically Safe, while still Adaptive to the Data. The general definition of the local subproblems, and therefore the treatment of the varying separable bound on the objective – quantified by σ\sigma^{\prime} – allows our framework to adapt to the difficulty of the data partition and still give convergence results. The data-dependent measure σ\sigma^{\prime} is fully decoupled from what the user of the framework prefers to employ as a local solver (see also the comment below that CoCoA is not a coordinate solver).

The safe upper bound σ=K\sigma^{\prime}=K is worst-case pessimistic, for the convergence theory to still hold in all cases, when the updates are added. Using additional knowledge from the input data, better bounds and therefore better step-sizes can be achieved in CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}}. An example when σ\sigma^{\prime} can be safely chosen much smaller is when the data-matrix satisfies strong row/column sparsity, see e.g. Lemma 1 in (Richtárik & Takáč, 2013).

Obtaining DisDCA-p as a Special Case. As shown in Lemma 18 above, we have that if in CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}}, if SDCA is used as the local solver and the pessimistic upper bound of σ=K\sigma^{\prime}=K is used and, moreover, the dataset is partitioned equally, i.e. k:nk=nK\forall k:n_{k}=\frac{n}{K}, then the CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} framework reduces exactly to the DisDCA-p algorithm by (Yang, 2013).

The correspondence breaks down if the subproblem parameter is chosen to a practically good value σK\sigma^{\prime}\neq K. Also, as noted above, SDCA is often not the best local solver currently available. In our above experiments, SDCA was used just for demonstration purposes and ease of comparison. Furthermore, the data partition might often be unbalanced in practical applications.

While both DisDCA-p and CoCoA are special cases of CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}}, we note that DisDCA-p can not be recovered as a special case of the original CoCoA framework (Jaggi et al., 2014).

CoCoA/+ are Not Coordinate Methods. Despite the original name being motivated from this special case, CoCoA and CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} are not coordinate methods. In fact, CoCoA ⁣+\!{}^{\bf\textbf{\footnotesize+}} as presented here for the adding case (γ=1\gamma=1) is much more closely related to a batch method applied to the dual, using a block-separable proximal term, as following from our new subproblem formulation (9), depending on σ\sigma^{\prime}. See also the remark in Section 6. The framework here (Algorithm 1) gives more generality, as the used local solver is not restricted to be a coordinate-wise one. In fact the framework allows to translate recent and future improvements of single machine solvers directly to the distributed setting, by employing them within Algorithm 1. DisDCA-p works very well for several applications, but is restricted to using local coordinate ascent (SDCA) steps.

Theoretical Convergence Results. While DisDCA-p (Yang, 2013) was proposed without theoretical justification (hence the nomenclature), the main contribution in the paper here – apart from the arbitrary local solvers – is the convergence analysis for the framework. The theory proposed in (Yang et al., 2013) is given only for the setting of orthogonal partitions, i.e., when σ=1\sigma^{\prime}=1 and the problems become trivial to distribute given the orthogonality of data between the workers.

The theoretical analysis here gives convergence rates applying for Algorithm 1 when using arbitrary local solvers, and inherits the performance of the local solver. As a special case, we obtain the first theoretical justification and convergence rates for original CoCoA in the case of general convex objective, as well as for the special case of DisDCA-p for both general convex and smooth convex objectives.