DSA: Decentralized Double Stochastic Averaging Gradient Algorithm

Aryan Mokhtari, Alejandro Ribeiro

Introduction

Our interest here is in solving (1) with a method that is decentralized – nodes operate on their local functions and communicate with neighbors only –, stochastic – nodes determine a descent direction by evaluating only one out of the $q_{n}$ functions $f_{n,i}$ at each iteration –, and has a linear convergence rate in expectation – the expected distance to the optimum is scaled by a subunit factor at each iteration.

Decentralized optimization is relatively mature and various methods are known with complementary advantages. These methods include decentralized gradient descent (DGD) (Nedić and Ozdaglar (2009); Jakovetic et al. (2014); Yuan et al. (2013)), network Newton (Mokhtari et al. (2015a, b)), decentralized dual averaging (Duchi et al. (2012); Tsianos et al. (2012b)), the exact first order algorithm (EXTRA) (Shi et al. (2015)), as well as the alternating direction method of multipliers (ADMM) (Boyd et al. (2011); Shi et al. (2014); Iutzeler et al. (2013)) and its linearized variants (Ling and Ribeiro (2014); Ling et al. (2014); Mokhtari et al. (2015c)). The ADMM, its variants, and EXTRA converge linearly to the optimal argument but DGD, network Newton, and decentralized dual averaging have sublinear convergence rates. Of particular importance to this paper, is the fact that DGD has (inexact) linear converge to a neighborhood of the optimal argument when it uses constant stepsizes. It can achieve exact convergence by using diminishing stepsizes, but the convergence rate degrades to sublinear. This lack of linear convergence is solved by EXTRA through the use of iterations that rely on information of two consecutive steps (Shi et al. (2015)).

All of the algorithms mentioned above require the computationally costly evaluation of the local gradients $\nabla f_{n}({\mathbf{x}})=(1/q_{n})\sum_{i=1}^{q_{n}}\nabla f_{n,i}({\mathbf{x}})$ . This cost can be avoided by stochastic decentralized algorithms that reduce computational cost of iterations by substituting all local gradients with their stochastic approximations. This reduces the computational cost per iteration but results in sublinear convergence rates of order $O(1/t)$ even if the corresponding deterministic algorithm exhibits linear convergence. This is a drawback that also exists in centralized stochastic optimization where linear convergence rates in expectation are established by decreasing the variance of the stochastic gradient approximation (Roux et al. (2012); Schmidt et al. (2013); Shalev-Shwartz and Zhang (2013); Johnson and Zhang (2013); Konečnỳ and Richtárik (2013); Defazio et al. (2014)). In this paper we build on the ideas of the stochastic averaging gradient (SAG) algorithm (Schmidt et al. (2013)) and its unbiased version SAGA (Defazio et al. (2014)). Both of these algorithms use the idea of stochastic incremental averaging gradients. At each iteration only one of the stochastic gradients is updated and the average of all of the most recent stochastic gradients is used for estimating gradient.

The advantages of DSA relative to a group of stochastic and deterministic alternatives in solving a logistic regression problem with a synthetic dataset are then studied in numerical experiments (Section 4). These results demonstrate that DSA is the only decentralized stochastic algorithm that reaches the optimal solution with a linear convergence rate. We further show that DSA outperforms deterministic algorithms when the metric is the number of times that elements of the training set are evaluated. The behavior of DSA for different network topologies is also evaluated. We close the paper with pertinent remarks (Section 5).

Decentralized Double stochastic averaging gradient

If we recall the definitions of the local functions $f_{n}({\mathbf{x}}_{n})$ and the instantaneous local functions $f_{n,i}({\mathbf{x}}_{n})$ available at node $n$ , the implementation of EXTRA requires that each node $n$ computes the full gradient of its local objective function $f_{n}$ at ${\mathbf{x}}_{n}^{t}$ as

This is computationally expensive when the number of instantaneous functions $q_{n}$ is large. To resolve this issue, local stochastic gradients can be substituted for the local objective functions gradients in (3). These stochastic gradients approximate the gradient $\nabla f_{n}({\mathbf{x}}_{n})$ of node $n$ by randomly choosing one of the instantaneous functions gradients $\nabla f_{n,i}({\mathbf{x}}_{n})$ . If we let $i_{n}^{t}\in\{1,\ldots q_{n}\}$ denote a function index that we choose at time $t$ at node $n$ uniformly at random and independently of the history of the process, then the stochastic gradient is defined as

With these definitions in hand we can define the stochastic averaging gradient at node $n$ as

Observe that to implement (7) the gradients $\nabla f_{n,i}({\mathbf{y}}_{n,i}^{t})$ are stored in the local gradient table shown in Figure 1.

The DSA algorithm is a variation of EXTRA that substitutes the local gradients $\nabla f_{n}({\mathbf{x}}_{n}^{t})$ in (3) for the local stochastic average gradients ${\hat{\mathbf{g}}}_{n}^{t}$ in (7),

The DSA initial update is given by applying the same substitution for the update of DGD in (2) as

DSA is summarized in Algorithm 1 for $t\geq 1$ . The DSA update in (8) is implemented in Step 9. This step requires access to the local iterates ${\mathbf{x}}_{m}^{t}$ of neighboring nodes $m\in{\mathcal{N}}_{n}$ which are collected in Step 2. Furthermore, implementation of the DSA update also requires access to the stochastic averaging gradients ${\hat{\mathbf{g}}}_{n}^{t-1}$ and ${\hat{\mathbf{g}}}_{n}^{t}$ . The latter is computed in Step 4 and the former is computed and stored at the same step in the previous iteration. The computation of the stochastic averaging gradients requires the selection of the index $i_{n}^{t}$ . This index is chosen uniformly at random in Step 3. Determination of stochastic averaging gradients also necessitates access and maintenance of the gradients table in Figure 1. The $i_{n}^{t}$ element of this table is updated in Step 5 by replacing $\nabla f_{n,i_{n}^{t}}({\mathbf{y}}_{n,i_{n}^{t}}^{t})$ with $\nabla f_{n,i_{n}^{t}}({\mathbf{x}}_{n}^{t})$ , while the other vectors remain unchanged. To implement the first DSA iteration at time $t=0$ we have to perform the update in (9) instead of the update in (8) as in Step 7. Furhter observe that the auxiliary variables ${\mathbf{y}}_{n,i}^{0}$ are initialized to the initial iterate ${\mathbf{x}}_{n}^{0}$ . This implies that the initial values of the stored gradients are $\nabla f_{n,i}({\mathbf{y}}_{n,i}^{0})=\nabla f_{n,i}({\mathbf{x}}_{n}^{0})$ – with a consequently relatively large initialization cost.

We also point that, as written in (7), computation of local stochastic averaging gradients ${\hat{\mathbf{g}}}_{n}^{t}$ is costly because it requires evaluation of the sum $\sum_{i=1}^{q_{n}}\nabla f_{n,i}({\mathbf{y}}_{n,i}^{t})$ at each iteration. This cost can be avoided by updating the sum at each iteration with the recursive formula

Important properties and interpretations of EXTRA and DSA are presented in the following sections after a pertinent remark.

The local stochastic averaging gradients in (7) are unbiased estimates of the local gradients $\nabla f_{n}({\mathbf{x}}_{n}^{t})$ . Indeed, if we let ${\mathcal{F}}_{t}$ measure the history of the system up until time $t$ we have that the sum in (7) is deterministic given this sigma-algebra. Thus, the conditional expectation of the stochastic averaging gradient is,

The expression in (12) means, by definition, that ${\hat{\mathbf{g}}}_{n}^{t}$ is an unbiased estimate of $\nabla f_{n}({\mathbf{x}}_{n}^{t})$ when the history ${\mathcal{F}}^{t}$ is given.

According to the definition of extended weight matrix ${\mathbf{Z}}$ , the DGD iteration in (2) is equivalent to

The fundamental difference between DGD and EXTRA is that a fixed point of (15) does not necessarily satisfy (14), whereas the fixed points of (16) are guaranteed to do so. Indeed, taking limits in (15) we see that the fixed points ${\mathbf{x}}^{\infty}$ of DGD must satisfy

which is incompatible with (14) except in peculiar circumstances – such as, e.g., when all local functions have the same minimum. The limit points of EXTRA, however, satisfy the relationship

Substituting the limit point in (19) and reordering terms, we see that ${\mathbf{x}}^{\infty}$ must satisfy

2 Stochastic saddle point method interpretation of DSA

Consider ${\mathbf{x}}$ as a primal variable and ${\mathbf{v}}$ as a dual variable. Then, the updates in (22) and (23) are equivalent to the updates of a saddle point method with stepsize $\alpha$ that solves for the critical points of the augmented Lagrangian

Comparing (16) and (26) we see that they differ in the latter using stochastic averaging gradients ${\hat{\mathbf{g}}}^{t}$ in lieu of the full gradients $\nabla f({\mathbf{x}}^{t})$ . Therefore, DSA is a stochastic saddle point method in which the primal variables are updated as

and the dual variables ${\mathbf{v}}^{t}$ are updated as

Convergence analysis

The instantaneous local functions $f_{n,i}({\mathbf{x}}_{n})$ are differentiable and strongly convex with parameter $\mu$ .

The gradient of instantaneous local functions $\nabla f_{n,i}$ are Lipschitz continuous with parameter $L$ . I.e., for all $n\in\{1,\dots,N\}$ and $i\in\{1,\dots,q_{n}\}$ we can write

The condition imposed by Assumption 2 implies that the local functions $f_{n}({\mathbf{x}}_{n})$ and the global cost function $f({\mathbf{x}})=\sum_{n=1}^{N}f_{n}({\mathbf{x}}_{n})$ are also strongly convex with parameter $\mu$ . Likewise, Lipschitz continuity of the local instantaneous gradients considered in Assumption 3 enforces Lipschitz continuity of gradients of the local functions $\nabla f_{n}({\mathbf{x}}_{n})$ and the aggregate function $\nabla f({\mathbf{x}})$ – see, e.g., (Lemma 1 of Mokhtari et al. (2015a)).

In this section we study some basic properties of the sequences of primal and dual variables generated by the DSA algorithm. In the following lemma, we study the relation of iterates ${\mathbf{x}}^{t}$ and ${\mathbf{v}}^{t}$ with the optimal primal ${\mathbf{x}}^{*}$ and dual ${\mathbf{v}}^{*}$ arguments.

One of the KKT conditions of problem (25) follows that the optimal variables ${\mathbf{x}}^{*}$ and ${\mathbf{v}}^{*}$ satisfy $\alpha\nabla f({\mathbf{x}}^{*})+{\mathbf{U}}{\mathbf{v}}^{*}={\mathbf{0}}$ or equivalently $-\alpha\nabla f({\mathbf{x}}^{*})={\mathbf{U}}{\mathbf{v}}^{*}$ . Adding this equality to both sides of (32) follows the claim in (30).

Consider the DSA algorithm in (6)-(9) and the definition of sequence $p^{t}$ in (33). If Assumptions 1-3 hold true, then the squared norm of the difference between stochastic averaging gradient ${\hat{\mathbf{g}}}^{t}$ and the optimal gradient $\nabla f({\mathbf{x}}^{*})$ in expectation is bounded above by

2 Convergence

Consider the DSA algorithm as defined in (6)-(9). Further recall the definitions of $p^{t}$ in (33) and ${\mathbf{u}}^{t}$ , ${\mathbf{u}}^{*}$ , and ${\mathbf{G}}$ in (35). If Assumptions 1-3 hold true, then for any positive constants $\eta>0$ we can write

Lemma 4 shows an upper bound for the squared norm $\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{*}\|^{2}_{{\mathbf{G}}}$ which is the first part of the Lyapunov function $\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|^{2}_{{\mathbf{G}}}+cp^{t}$ at step $t+1$ . Likewise, we provide an upper bound for the second term of the Lyapunov function at time $t+1$ which is $p^{t+1}$ in terms of $p^{t}$ and some parameters that capture optimality gap. This bound is studied in the following lemma.

Consider the DSA algorithm as defined in (6)-(9) and the definition of $p^{t}$ in (33). Further, define $q_{\min}$ and $q_{\max}$ as the smallest and largest values for the number of instantaneous functions at a node, respectively. If Assumptions 1-3 hold true, then for all $t>0$ the sequence $p^{t}$ satisfies

Lemma 5 provides an upper bound for $p^{t+1}$ in terms of its previous value $p^{t}$ and the optimality error $f({\mathbf{x}}^{t})-f({\mathbf{x}}^{*})-\nabla f({\mathbf{x}}^{*})^{T}({\mathbf{x}}^{t}-{\mathbf{x}}^{*})$ . Combining the results in Lemmata 4 and 5 we can show that in expectation the Lyapunov function $\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}+c\ p^{t+1}$ at step $t+1$ is strictly smaller than its previous value $\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}+c\ p^{t}$ at step $t$ .

Consider the DSA algorithm as defined in (6)-(9). Further recall the definition of the sequence $p^{t}$ in (33). Define $\eta$ as an arbitrary positive constant chosen from the interval

If Assumptions 1-3 hold true and the stepsize $\alpha$ is chosen from the interval $\alpha\in\left(0,{\gamma}/{2\eta}\right)$ , then for arbitrary $c$ chosen from the interval

there exits a positive constant $0<\delta<1$ such that

We use the result in (41) to establish linear convergence of the sequence of squared norm error $\|{\mathbf{x}}^{t}-{\mathbf{x}}^{*}\|^{2}$ in expectation.

Further, the almost sure convergence is at least of order ${\mathcal{O}}(1/t)$ .

Theorem 8 provides almost sure convergence of ${\mathbf{x}}^{t}$ to the optimal solution ${\mathbf{x}}^{*}$ which is stronger result than convergence in expectation as in Corollary 7, however, the rate of convergence for the almost sure convergence is sublinear ${\mathcal{O}}(1/t)$ which is slower relative to the linear convergence in expectation provided in (42).

3 Convergence constant

The constant $\delta$ that controls the speed of convergence can be simplified by selecting specific values for $\eta$ , $\alpha$ , and $c$ . This uncovers connections to the properties of the local objective functions and the network topology. To make this clearer define the condition numbers of the objective function and the graph as

respectively. The condition number of the function is a measure of how difficult it is to minimize the local functions using gradient descent directions. The condition number of the graph is a measure of how slow the graph is in propagating a diffusion process. Both are known to control the speed of convergence of distributed optimization methods. The following corollary illustrates that these condition numbers also determine the convergence speed of DSA.

The linear convergence constant $0<\delta<1$ in (40) reduces to

Proof The given values for $\eta$ , $\alpha$ , and $c$ satisfy the conditions in Theorem 6. Substitute then these values into the expression for $\delta$ in (D). Simplify terms and utilize the condition number definitions in (44). The second term in the minimization in (D) becomes redundant because it is dominated by the first.

The three terms in (47) establish separate regimes, problems where the graph condition number is large, problems where the number of functions at each node is large, and problems where the condition number of the local functions are large. In the first regime the first term in (47) dominates and establishes a dependence in terms of the square of the graph’s condition number. In the second regime the middle term dominates and results in an inverse dependence with the number of functions available at each node. In the third regime, the third term dominates. The dependence in this case is inversely proportional to $\kappa_{f}^{4}$ .

Numerical analysis

where the regularization term $(\lambda/2)\|{\mathbf{x}}\|^{2}$ is added to reduce overfitting to the training set. The optimization problem in (48) can be written in the form of (1) by defining the local objective functions $f_{n}$ as

Observe that the local functions $f_{n}$ in (49) can be written as the average of a set of instantaneous functions $f_{n,i}$ defined as

$\text{for all}\ i=1,\dots,q_{n}$ . Considering the definitions of instantaneous local functions $f_{n,i}$ in (50) and local functions $f_{n}$ in (49), problem (48) can be solved using the DSA algorithm.

In our experiments we use a synthetic dataset where components of the feature vectors ${\mathbf{s}}_{{ni}}$ with label $l_{ni}=1$ are generated from a normal distribution with mean $\mu$ and standard deviation $\sigma_{+}$ , while sample points with label $l_{ni}=-1$ are generated from a normal distribution with mean $-\mu$ and standard deviation $\sigma_{-}$ . We consider a network of size $N$ where the edges between nodes are generated randomly with probability $p_{c}$ . The weight matrix ${\mathbf{W}}$ is generated using the Laplacian matrix ${\mathbf{L}}$ of network as

We provide a comparison of DSA with respect to DGD, EXTRA, stochastic EXTRA, and decentralized SAGA. The stochastic EXTRA is defined by using stochastic gradient in (5) instead of using full gradient as in EXTRA or stochastic averaging gradient as in DSA. The decentralized SAGA is a stochastic version of DGD algorithm that uses stochastic averaging gradient instead of exact gradient which is the naive approach for developing decentralized version of SAGA algorithm.

We study performances of the DSA algorithm for different topologies. We keep the parameters in Fig. 2 except we change the size of network to $N=100$ which implies each node has $q_{i}=5$ sample points. The linear convergence of DSA algorithm for random networks with $p_{c}=0.2$ and $p_{c}=0.3$ , complete graph, cycle, line and star are shown in Fig. 3. As we expect for the topologies that the graph is more connected and the diameter is smaller linear convergence of DSA is faster. The best performance belongs to the complete graph which requires $160$ iterations to achieve the relative error $e^{t}=10^{-6}$ . For random graphs with connectivity probabilities $p_{c}=0.3$ and $p_{c}=0.2$ DSA achieves the relative error $e^{t}=10^{-6}$ after $t=210$ and $t=280$ iterations, respectively. For the cycle graph the number of required iterations for reaching the relative error $e^{t}=10^{-6}$ is $t=470$ , while DSA does not reach this accuracy after $t=1000$ iterations when the graph is a line or star.

Conclusions

We acknowledge the support of the National Science Foundation (NSF CAREER CCF-0952867) and the Office of Naval Research (ONR N00014-12-1-0997).

A Proof of Lemma 3

Substituting the upper bound in (A) and simplification in (A) into (55), and considering the expression in (A) lead to

Summing up both sides of (59) for all $i=1,\dots,q_{n}$ , dividing both sides of the implied inequality by $q_{n}$ lead to

To show that the sum in the right hand side of (A) is bounded above we use the Lipschitz continuity of the instantaneous functions gradients $\nabla f_{n,i}$ . Using the same argument from (59) to (61) we can write

Considering the definition of the local objective functions $f_{n}({\mathbf{x}}_{n})=(1/q_{n})\sum_{i=1}^{q_{n}}f_{n,i}({\mathbf{x}}_{n})$ and the aggregate function $f({\mathbf{x}}):=\sum_{n=1}^{N}f_{n}({\mathbf{x}}_{n})$ , the right hand side of (63) can be simplified as

Replacing the sum in (A) by the upper bound in (64) implies

Considering the strong convexity of function $f$ with constant $\mu$ we can write

Therefore, we can substitute $\|{\nabla}f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\|^{2}$ in (64) by the lower bound in (66) and the claim in (34) follows.

B Proof of Lemma 4

According to the Lipschitz continuity of the aggregate function gradients $\nabla f({\mathbf{x}})$ we can write $(1/L)\|\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\|^{2}\leq({\mathbf{x}}^{t}-{\mathbf{x}}^{*})^{T}(\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*}))$ . By adding and subtracting ${\mathbf{x}}^{t+1}$ to the term ${\mathbf{x}}^{t}-{\mathbf{x}}^{*}$ and multiplying both sides of the inequality by $2\alpha$ we obtain

Expanding the difference $\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})$ as ${\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*})+\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}$ for the first inner product in the right hand side of (67) implies

We proceed to simplify the inner product $2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{*})^{T}({\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*}))$ in the right hand side of (B) by substituting $\alpha({\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*}))$ with its equivalent as introduced in (30). Applying this substitution the inner product can be simplified as

According to the definition of vector ${\mathbf{u}}$ and matrix ${\mathbf{G}}$ in (35), the last two summands of (B) can be simplified as $2({\mathbf{u}}^{t+1}-{\mathbf{u}}^{t})^{T}{\mathbf{G}}({\mathbf{u}}^{*}-{\mathbf{u}}^{t+1})$ . Moreover, observe that the inner product $2({\mathbf{u}}^{t+1}-{\mathbf{u}}^{t})^{T}{\mathbf{G}}({\mathbf{u}}^{*}-{\mathbf{u}}^{t+1})$ can be simplified as $\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}-\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}-\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{t}\|_{\mathbf{G}}^{2}$ . Applying this simplification into (B) implies

The next step is to bound above the inner product $2\alpha({\mathbf{x}}^{t}-{\mathbf{x}}^{t+1})^{T}(\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*}))$ . Note that for any two vectors ${\mathbf{a}}$ and ${\mathbf{b}}$ , and any positive scalar $\eta$ the inequality $2{\mathbf{a}}^{T}{\mathbf{b}}\leq\eta\|{\mathbf{a}}\|^{2}+\eta^{-1}\|{\mathbf{b}}\|^{2}$ holds true. Therefore, by setting ${\mathbf{a}}={\mathbf{x}}^{t}-{\mathbf{x}}^{t+1}$ and ${\mathbf{b}}=\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})$ we obtain that

Now we substitute the terms in the right hand side of (B) by their simplifications or upper bounds. Replacing the inner product $2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{*})^{T}({\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*}))$ by the simplification in (B), substituting expression $2\alpha({\mathbf{x}}^{t}-{\mathbf{x}}^{t+1})^{T}(\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*}))$ by the upper bound in (72), and substituting inner product $2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{*})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t})$ by the sum $2\alpha({\mathbf{x}}^{t}-{\mathbf{x}}^{*})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t})+2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{t})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t})$ imply

By applying inequality $2{\mathbf{a}}^{T}{\mathbf{b}}\leq\eta\|{\mathbf{a}}\|^{2}+\eta^{-1}\|{\mathbf{b}}\|^{2}$ for the choice of vectors ${\mathbf{a}}={\mathbf{x}}^{t+1}-{\mathbf{x}}^{t}$ and ${\mathbf{b}}=\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}$ , we obtain that $2({\mathbf{x}}^{t+1}-{\mathbf{x}}^{t})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t})$ is bounded above by $\eta\|{\mathbf{x}}^{t+1}-{\mathbf{x}}^{t}\|^{2}+(1/\eta)\|\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}\|^{2}$ . Replacing $2({\mathbf{x}}^{t+1}-{\mathbf{x}}^{t})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t})$ in (B) by its upper bound $\eta\|{\mathbf{x}}^{t+1}-{\mathbf{x}}^{t}\|^{2}+(1/\eta)\|\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}\|^{2}$ yields

Substituting the simplification in (77) into (76) yields

Considering the strong convexity of function $f$ with constant $\mu$ we can write $\left\|{\nabla}f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\right\|^{2}\geq 2\mu\left(f({\mathbf{x}}^{t})-f({\mathbf{x}}^{*})-\nabla f({\mathbf{x}}^{*})^{T}({\mathbf{x}}^{t}-{\mathbf{x}}^{*})\right)$ . substituting the squared norm $\left\|{\nabla}f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\right\|^{2}$ by this lower bound in (78) follows

C Proof of Lemma 5

Given the information until time $t$ , each auxiliary vector ${\mathbf{y}}_{n,i}^{t+1}$ is a random variable that takes values ${\mathbf{y}}_{n,i}^{t}$ and ${\mathbf{x}}_{n}^{t}$ with associated probabilities $1-1/q_{n}$ and $1/q_{n}$ , respectively. This observation holds since with probability $1/q_{n}$ node $n$ may choose index $i$ to update at time $t+1$ and with probability $1-(1/q_{n})$ choose other indices. Therefore, we can write

For the simplicity of equations let us define sequence $p_{n}^{t}$ as

We proceed to find and upper bound for the terms in the right hand side of (83). First note that according to the strong convexity of instantaneous functions $f_{n,i}$ and $f_{n}$ both terms in the right hand side of (83) are non-negative. Observing that the number of instantaneous functions at each node $q_{n}$ satisfies the condition $q_{\min}\leq q_{n}\leq q_{\max}$ , we obtain

Now observe that according to the definitions of sequences $p^{t}$ and $p_{n}^{t}$ in (33) and (82), respectively, $p^{t}$ is the sum of $p_{n}^{t}$ for all $n$ , i.e. $p^{t}=\sum_{n=1}^{N}p_{n}^{t}$ . Therefore, we can rewrite (85) as

D Proof of Theorem 6

To prove the result of Theorem 6 first we prove the following Lemma to establish an upper bound for $\|{\mathbf{v}}^{t}-{\mathbf{v}}^{*}\|^{2}.$

Proof Consider the basic inequality $\|{\mathbf{a}}+{\mathbf{b}}\|^{2}\leq 2\|{\mathbf{a}}\|^{2}+2\|{\mathbf{b}}\|^{2}$ for the case that ${\mathbf{a}}={\mathbf{U}}({\mathbf{v}}^{t+1}-{\mathbf{v}}^{*})$ , ${\mathbf{b}}={\mathbf{U}}({\mathbf{v}}^{t}-{\mathbf{v}}^{t+1})$ which can be written as

Note that according to the fact that both ${\mathbf{v}}^{t}$ and ${\mathbf{v}}^{*}$ lie in the column space of matrix ${\mathbf{U}}$ we obtain $\|{\mathbf{U}}({\mathbf{v}}^{t}-{\mathbf{v}}^{*})\|^{2}\geq\gamma^{\prime}\|{\mathbf{v}}^{t}-{\mathbf{v}}^{*}\|^{2}$ . Substituting this lower bound for $\|{\mathbf{U}}({\mathbf{v}}^{t}-{\mathbf{v}}^{*})\|^{2}$ in (D) and multiplying both sides of the imposed inequality by $\gamma^{\prime}$ yield

Using the result in Lemma 10 we show linear convergence of the sequence $\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}+c\ p^{t}$ as follows.

Proof of Theorem 6: Proving the linear convergence claim in (40) is equivalent to showing that

Further, substitute the squared norm $\|{\mathbf{x}}^{t}-{\mathbf{x}}^{*}\|^{2}$ by the upper bound $(2/\mu)(f({\mathbf{x}}^{t})-f({\mathbf{x}}^{*})-\nabla f({\mathbf{x}}^{*})^{T}({\mathbf{x}}^{t}-{\mathbf{x}}^{*}))$ to obtain

Replacing $\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}$ in (D) by the upper bound (D) and regrouping the terms lead to

Notice that if the inequality in (D) holds true, then the relation in (D) is valid and as we mentioned before the claim in (93) holds. To verify the sum in the right hand side of (D) is always positive and the inequality is valid, we enforce each summands in the right hand side of (D) to be non-negative. Therefore, the following conditions should be satisfied

Recall that $\gamma$ is the smallest eigenvalue of positive definite matrix ${\mathbf{Z}}$ . All the inequalities in (D) are satisfied, if $\delta$ is chosen as

where $\eta$ , $c$ and $\alpha$ are selected from the intervals

Notice that considering the conditions for the variables $\eta$ , $\alpha$ and $c$ in (100), the constant $\delta$ in (D) is strictly positive $\delta>0$ . Moreover, according to the definition in (D) the constant $\delta$ is smaller than $\gamma^{\prime}/2\Gamma^{\prime}$ which leads to the conclusion that $\delta\leq 1/2<1$ . Therefore, we obtain that $0<\delta<1$ and the claim in (40) is valid.

E Proof of Theorem 8

The proof uses the relationship in the statement (40) of Theorem 6 to build a supermartingale sequence. To do this define the stochastic processes $\zeta^{t}$ and $\beta^{t}$ as

Note that the stochastic processes $\zeta^{t}$ and $\beta^{t}$ are alway non-negative. Let now ${\mathcal{F}}_{t}$ be a sigma-algebra measuring $\zeta^{t}$ , $\beta^{t}$ , and ${\mathbf{u}}^{t}$ . Considering the definitions of $\zeta^{t}$ and $\beta^{t}$ and the relation in (40) we can write

Since the sequences $\alpha^{t}$ and $\beta^{t}$ are nonnegative it follows from (102) that they satisfy the conditions of the supermartingale convergence theorem – see e.g. theorem E $7.4$ Solo and Kong (1995) . Therefore, we obtain that: (i) The sequence $\zeta^{t}$ converges almost surely. (ii) The sum $\sum_{t=0}^{\infty}\beta^{t}<\infty$ is almost surely finite. The definition of $\beta^{t}$ in (101) implies that

Observing the fact that $\delta$ and $\gamma$ are positive constants, we can conclude from (104) that the sequence $\|{\mathbf{x}}^{t}-{\mathbf{x}}^{*}\|^{2}$ is almost surely summable and the it converges with probability 1 to null at least in the order of ${\mathcal{O}}(1/t)$ . Almost sure convergence of sequence to null follows the claim in (43).