DSA: Decentralized Double Stochastic Averaging Gradient Algorithm
Aryan Mokhtari, Alejandro Ribeiro
Introduction
Our interest here is in solving (1) with a method that is decentralized – nodes operate on their local functions and communicate with neighbors only –, stochastic – nodes determine a descent direction by evaluating only one out of the functions at each iteration –, and has a linear convergence rate in expectation – the expected distance to the optimum is scaled by a subunit factor at each iteration.
Decentralized optimization is relatively mature and various methods are known with complementary advantages. These methods include decentralized gradient descent (DGD) (Nedić and Ozdaglar (2009); Jakovetic et al. (2014); Yuan et al. (2013)), network Newton (Mokhtari et al. (2015a, b)), decentralized dual averaging (Duchi et al. (2012); Tsianos et al. (2012b)), the exact first order algorithm (EXTRA) (Shi et al. (2015)), as well as the alternating direction method of multipliers (ADMM) (Boyd et al. (2011); Shi et al. (2014); Iutzeler et al. (2013)) and its linearized variants (Ling and Ribeiro (2014); Ling et al. (2014); Mokhtari et al. (2015c)). The ADMM, its variants, and EXTRA converge linearly to the optimal argument but DGD, network Newton, and decentralized dual averaging have sublinear convergence rates. Of particular importance to this paper, is the fact that DGD has (inexact) linear converge to a neighborhood of the optimal argument when it uses constant stepsizes. It can achieve exact convergence by using diminishing stepsizes, but the convergence rate degrades to sublinear. This lack of linear convergence is solved by EXTRA through the use of iterations that rely on information of two consecutive steps (Shi et al. (2015)).
All of the algorithms mentioned above require the computationally costly evaluation of the local gradients . This cost can be avoided by stochastic decentralized algorithms that reduce computational cost of iterations by substituting all local gradients with their stochastic approximations. This reduces the computational cost per iteration but results in sublinear convergence rates of order even if the corresponding deterministic algorithm exhibits linear convergence. This is a drawback that also exists in centralized stochastic optimization where linear convergence rates in expectation are established by decreasing the variance of the stochastic gradient approximation (Roux et al. (2012); Schmidt et al. (2013); Shalev-Shwartz and Zhang (2013); Johnson and Zhang (2013); Konečnỳ and Richtárik (2013); Defazio et al. (2014)). In this paper we build on the ideas of the stochastic averaging gradient (SAG) algorithm (Schmidt et al. (2013)) and its unbiased version SAGA (Defazio et al. (2014)). Both of these algorithms use the idea of stochastic incremental averaging gradients. At each iteration only one of the stochastic gradients is updated and the average of all of the most recent stochastic gradients is used for estimating gradient.
The advantages of DSA relative to a group of stochastic and deterministic alternatives in solving a logistic regression problem with a synthetic dataset are then studied in numerical experiments (Section 4). These results demonstrate that DSA is the only decentralized stochastic algorithm that reaches the optimal solution with a linear convergence rate. We further show that DSA outperforms deterministic algorithms when the metric is the number of times that elements of the training set are evaluated. The behavior of DSA for different network topologies is also evaluated. We close the paper with pertinent remarks (Section 5).
Decentralized Double stochastic averaging gradient
If we recall the definitions of the local functions and the instantaneous local functions available at node , the implementation of EXTRA requires that each node computes the full gradient of its local objective function at as
This is computationally expensive when the number of instantaneous functions is large. To resolve this issue, local stochastic gradients can be substituted for the local objective functions gradients in (3). These stochastic gradients approximate the gradient of node by randomly choosing one of the instantaneous functions gradients . If we let denote a function index that we choose at time at node uniformly at random and independently of the history of the process, then the stochastic gradient is defined as
With these definitions in hand we can define the stochastic averaging gradient at node as
Observe that to implement (7) the gradients are stored in the local gradient table shown in Figure 1.
The DSA algorithm is a variation of EXTRA that substitutes the local gradients in (3) for the local stochastic average gradients in (7),
The DSA initial update is given by applying the same substitution for the update of DGD in (2) as
DSA is summarized in Algorithm 1 for . The DSA update in (8) is implemented in Step 9. This step requires access to the local iterates of neighboring nodes which are collected in Step 2. Furthermore, implementation of the DSA update also requires access to the stochastic averaging gradients and . The latter is computed in Step 4 and the former is computed and stored at the same step in the previous iteration. The computation of the stochastic averaging gradients requires the selection of the index . This index is chosen uniformly at random in Step 3. Determination of stochastic averaging gradients also necessitates access and maintenance of the gradients table in Figure 1. The element of this table is updated in Step 5 by replacing with , while the other vectors remain unchanged. To implement the first DSA iteration at time we have to perform the update in (9) instead of the update in (8) as in Step 7. Furhter observe that the auxiliary variables are initialized to the initial iterate . This implies that the initial values of the stored gradients are – with a consequently relatively large initialization cost.
We also point that, as written in (7), computation of local stochastic averaging gradients is costly because it requires evaluation of the sum at each iteration. This cost can be avoided by updating the sum at each iteration with the recursive formula
Important properties and interpretations of EXTRA and DSA are presented in the following sections after a pertinent remark.
The local stochastic averaging gradients in (7) are unbiased estimates of the local gradients . Indeed, if we let measure the history of the system up until time we have that the sum in (7) is deterministic given this sigma-algebra. Thus, the conditional expectation of the stochastic averaging gradient is,
The expression in (12) means, by definition, that is an unbiased estimate of when the history is given.
According to the definition of extended weight matrix , the DGD iteration in (2) is equivalent to
The fundamental difference between DGD and EXTRA is that a fixed point of (15) does not necessarily satisfy (14), whereas the fixed points of (16) are guaranteed to do so. Indeed, taking limits in (15) we see that the fixed points of DGD must satisfy
which is incompatible with (14) except in peculiar circumstances – such as, e.g., when all local functions have the same minimum. The limit points of EXTRA, however, satisfy the relationship
Substituting the limit point in (19) and reordering terms, we see that must satisfy
2 Stochastic saddle point method interpretation of DSA
Consider as a primal variable and as a dual variable. Then, the updates in (22) and (23) are equivalent to the updates of a saddle point method with stepsize that solves for the critical points of the augmented Lagrangian
Comparing (16) and (26) we see that they differ in the latter using stochastic averaging gradients in lieu of the full gradients . Therefore, DSA is a stochastic saddle point method in which the primal variables are updated as
and the dual variables are updated as
Convergence analysis
The instantaneous local functions are differentiable and strongly convex with parameter .
The gradient of instantaneous local functions are Lipschitz continuous with parameter . I.e., for all and we can write
The condition imposed by Assumption 2 implies that the local functions and the global cost function are also strongly convex with parameter . Likewise, Lipschitz continuity of the local instantaneous gradients considered in Assumption 3 enforces Lipschitz continuity of gradients of the local functions and the aggregate function – see, e.g., (Lemma 1 of Mokhtari et al. (2015a)).
In this section we study some basic properties of the sequences of primal and dual variables generated by the DSA algorithm. In the following lemma, we study the relation of iterates and with the optimal primal and dual arguments.
One of the KKT conditions of problem (25) follows that the optimal variables and satisfy or equivalently . Adding this equality to both sides of (32) follows the claim in (30).
Consider the DSA algorithm in (6)-(9) and the definition of sequence in (33). If Assumptions 1-3 hold true, then the squared norm of the difference between stochastic averaging gradient and the optimal gradient in expectation is bounded above by
2 Convergence
Consider the DSA algorithm as defined in (6)-(9). Further recall the definitions of in (33) and , , and in (35). If Assumptions 1-3 hold true, then for any positive constants we can write
Lemma 4 shows an upper bound for the squared norm which is the first part of the Lyapunov function at step . Likewise, we provide an upper bound for the second term of the Lyapunov function at time which is in terms of and some parameters that capture optimality gap. This bound is studied in the following lemma.
Consider the DSA algorithm as defined in (6)-(9) and the definition of in (33). Further, define and as the smallest and largest values for the number of instantaneous functions at a node, respectively. If Assumptions 1-3 hold true, then for all the sequence satisfies
Lemma 5 provides an upper bound for in terms of its previous value and the optimality error . Combining the results in Lemmata 4 and 5 we can show that in expectation the Lyapunov function at step is strictly smaller than its previous value at step .
Consider the DSA algorithm as defined in (6)-(9). Further recall the definition of the sequence in (33). Define as an arbitrary positive constant chosen from the interval
If Assumptions 1-3 hold true and the stepsize is chosen from the interval , then for arbitrary chosen from the interval
there exits a positive constant such that
We use the result in (41) to establish linear convergence of the sequence of squared norm error in expectation.
Further, the almost sure convergence is at least of order .
Theorem 8 provides almost sure convergence of to the optimal solution which is stronger result than convergence in expectation as in Corollary 7, however, the rate of convergence for the almost sure convergence is sublinear which is slower relative to the linear convergence in expectation provided in (42).
3 Convergence constant
The constant that controls the speed of convergence can be simplified by selecting specific values for , , and . This uncovers connections to the properties of the local objective functions and the network topology. To make this clearer define the condition numbers of the objective function and the graph as
respectively. The condition number of the function is a measure of how difficult it is to minimize the local functions using gradient descent directions. The condition number of the graph is a measure of how slow the graph is in propagating a diffusion process. Both are known to control the speed of convergence of distributed optimization methods. The following corollary illustrates that these condition numbers also determine the convergence speed of DSA.
The linear convergence constant in (40) reduces to
Proof The given values for , , and satisfy the conditions in Theorem 6. Substitute then these values into the expression for in (D). Simplify terms and utilize the condition number definitions in (44). The second term in the minimization in (D) becomes redundant because it is dominated by the first.
The three terms in (47) establish separate regimes, problems where the graph condition number is large, problems where the number of functions at each node is large, and problems where the condition number of the local functions are large. In the first regime the first term in (47) dominates and establishes a dependence in terms of the square of the graph’s condition number. In the second regime the middle term dominates and results in an inverse dependence with the number of functions available at each node. In the third regime, the third term dominates. The dependence in this case is inversely proportional to .
Numerical analysis
where the regularization term is added to reduce overfitting to the training set. The optimization problem in (48) can be written in the form of (1) by defining the local objective functions as
Observe that the local functions in (49) can be written as the average of a set of instantaneous functions defined as
. Considering the definitions of instantaneous local functions in (50) and local functions in (49), problem (48) can be solved using the DSA algorithm.
In our experiments we use a synthetic dataset where components of the feature vectors with label are generated from a normal distribution with mean and standard deviation , while sample points with label are generated from a normal distribution with mean and standard deviation . We consider a network of size where the edges between nodes are generated randomly with probability . The weight matrix is generated using the Laplacian matrix of network as
We provide a comparison of DSA with respect to DGD, EXTRA, stochastic EXTRA, and decentralized SAGA. The stochastic EXTRA is defined by using stochastic gradient in (5) instead of using full gradient as in EXTRA or stochastic averaging gradient as in DSA. The decentralized SAGA is a stochastic version of DGD algorithm that uses stochastic averaging gradient instead of exact gradient which is the naive approach for developing decentralized version of SAGA algorithm.
We study performances of the DSA algorithm for different topologies. We keep the parameters in Fig. 2 except we change the size of network to which implies each node has sample points. The linear convergence of DSA algorithm for random networks with and , complete graph, cycle, line and star are shown in Fig. 3. As we expect for the topologies that the graph is more connected and the diameter is smaller linear convergence of DSA is faster. The best performance belongs to the complete graph which requires iterations to achieve the relative error . For random graphs with connectivity probabilities and DSA achieves the relative error after and iterations, respectively. For the cycle graph the number of required iterations for reaching the relative error is , while DSA does not reach this accuracy after iterations when the graph is a line or star.
Conclusions
We acknowledge the support of the National Science Foundation (NSF CAREER CCF-0952867) and the Office of Naval Research (ONR N00014-12-1-0997).
A Proof of Lemma 3
Substituting the upper bound in (A) and simplification in (A) into (55), and considering the expression in (A) lead to
Summing up both sides of (59) for all , dividing both sides of the implied inequality by lead to
To show that the sum in the right hand side of (A) is bounded above we use the Lipschitz continuity of the instantaneous functions gradients . Using the same argument from (59) to (61) we can write
Considering the definition of the local objective functions and the aggregate function , the right hand side of (63) can be simplified as
Replacing the sum in (A) by the upper bound in (64) implies
Considering the strong convexity of function with constant we can write
Therefore, we can substitute in (64) by the lower bound in (66) and the claim in (34) follows.
B Proof of Lemma 4
According to the Lipschitz continuity of the aggregate function gradients we can write . By adding and subtracting to the term and multiplying both sides of the inequality by we obtain
Expanding the difference as for the first inner product in the right hand side of (67) implies
We proceed to simplify the inner product in the right hand side of (B) by substituting with its equivalent as introduced in (30). Applying this substitution the inner product can be simplified as
According to the definition of vector and matrix in (35), the last two summands of (B) can be simplified as . Moreover, observe that the inner product can be simplified as . Applying this simplification into (B) implies
The next step is to bound above the inner product . Note that for any two vectors and , and any positive scalar the inequality holds true. Therefore, by setting and we obtain that
Now we substitute the terms in the right hand side of (B) by their simplifications or upper bounds. Replacing the inner product by the simplification in (B), substituting expression by the upper bound in (72), and substituting inner product by the sum imply
By applying inequality for the choice of vectors and , we obtain that is bounded above by . Replacing in (B) by its upper bound yields
Substituting the simplification in (77) into (76) yields
Considering the strong convexity of function with constant we can write . substituting the squared norm by this lower bound in (78) follows
C Proof of Lemma 5
Given the information until time , each auxiliary vector is a random variable that takes values and with associated probabilities and , respectively. This observation holds since with probability node may choose index to update at time and with probability choose other indices. Therefore, we can write
For the simplicity of equations let us define sequence as
We proceed to find and upper bound for the terms in the right hand side of (83). First note that according to the strong convexity of instantaneous functions and both terms in the right hand side of (83) are non-negative. Observing that the number of instantaneous functions at each node satisfies the condition , we obtain
Now observe that according to the definitions of sequences and in (33) and (82), respectively, is the sum of for all , i.e. . Therefore, we can rewrite (85) as
D Proof of Theorem 6
To prove the result of Theorem 6 first we prove the following Lemma to establish an upper bound for
Proof Consider the basic inequality for the case that , which can be written as
Note that according to the fact that both and lie in the column space of matrix we obtain . Substituting this lower bound for in (D) and multiplying both sides of the imposed inequality by yield
Using the result in Lemma 10 we show linear convergence of the sequence as follows.
Proof of Theorem 6: Proving the linear convergence claim in (40) is equivalent to showing that
Further, substitute the squared norm by the upper bound to obtain
Replacing in (D) by the upper bound (D) and regrouping the terms lead to
Notice that if the inequality in (D) holds true, then the relation in (D) is valid and as we mentioned before the claim in (93) holds. To verify the sum in the right hand side of (D) is always positive and the inequality is valid, we enforce each summands in the right hand side of (D) to be non-negative. Therefore, the following conditions should be satisfied
Recall that is the smallest eigenvalue of positive definite matrix . All the inequalities in (D) are satisfied, if is chosen as
where , and are selected from the intervals
Notice that considering the conditions for the variables , and in (100), the constant in (D) is strictly positive . Moreover, according to the definition in (D) the constant is smaller than which leads to the conclusion that . Therefore, we obtain that and the claim in (40) is valid.
E Proof of Theorem 8
The proof uses the relationship in the statement (40) of Theorem 6 to build a supermartingale sequence. To do this define the stochastic processes and as
Note that the stochastic processes and are alway non-negative. Let now be a sigma-algebra measuring , , and . Considering the definitions of and and the relation in (40) we can write
Since the sequences and are nonnegative it follows from (102) that they satisfy the conditions of the supermartingale convergence theorem – see e.g. theorem E Solo and Kong (1995) . Therefore, we obtain that: (i) The sequence converges almost surely. (ii) The sum is almost surely finite. The definition of in (101) implies that
Observing the fact that and are positive constants, we can conclude from (104) that the sequence is almost surely summable and the it converges with probability 1 to null at least in the order of . Almost sure convergence of sequence to null follows the claim in (43).