DSA: Decentralized Double Stochastic Averaging Gradient Algorithm

Aryan Mokhtari, Alejandro Ribeiro

Introduction

Our interest here is in solving (1) with a method that is decentralized – nodes operate on their local functions and communicate with neighbors only –, stochastic – nodes determine a descent direction by evaluating only one out of the qnq_{n} functions fn,if_{n,i} at each iteration –, and has a linear convergence rate in expectation – the expected distance to the optimum is scaled by a subunit factor at each iteration.

Decentralized optimization is relatively mature and various methods are known with complementary advantages. These methods include decentralized gradient descent (DGD) (Nedić and Ozdaglar (2009); Jakovetic et al. (2014); Yuan et al. (2013)), network Newton (Mokhtari et al. (2015a, b)), decentralized dual averaging (Duchi et al. (2012); Tsianos et al. (2012b)), the exact first order algorithm (EXTRA) (Shi et al. (2015)), as well as the alternating direction method of multipliers (ADMM) (Boyd et al. (2011); Shi et al. (2014); Iutzeler et al. (2013)) and its linearized variants (Ling and Ribeiro (2014); Ling et al. (2014); Mokhtari et al. (2015c)). The ADMM, its variants, and EXTRA converge linearly to the optimal argument but DGD, network Newton, and decentralized dual averaging have sublinear convergence rates. Of particular importance to this paper, is the fact that DGD has (inexact) linear converge to a neighborhood of the optimal argument when it uses constant stepsizes. It can achieve exact convergence by using diminishing stepsizes, but the convergence rate degrades to sublinear. This lack of linear convergence is solved by EXTRA through the use of iterations that rely on information of two consecutive steps (Shi et al. (2015)).

All of the algorithms mentioned above require the computationally costly evaluation of the local gradients fn(x)=(1/qn)i=1qnfn,i(x)\nabla f_{n}({\mathbf{x}})=(1/q_{n})\sum_{i=1}^{q_{n}}\nabla f_{n,i}({\mathbf{x}}). This cost can be avoided by stochastic decentralized algorithms that reduce computational cost of iterations by substituting all local gradients with their stochastic approximations. This reduces the computational cost per iteration but results in sublinear convergence rates of order O(1/t)O(1/t) even if the corresponding deterministic algorithm exhibits linear convergence. This is a drawback that also exists in centralized stochastic optimization where linear convergence rates in expectation are established by decreasing the variance of the stochastic gradient approximation (Roux et al. (2012); Schmidt et al. (2013); Shalev-Shwartz and Zhang (2013); Johnson and Zhang (2013); Konečnỳ and Richtárik (2013); Defazio et al. (2014)). In this paper we build on the ideas of the stochastic averaging gradient (SAG) algorithm (Schmidt et al. (2013)) and its unbiased version SAGA (Defazio et al. (2014)). Both of these algorithms use the idea of stochastic incremental averaging gradients. At each iteration only one of the stochastic gradients is updated and the average of all of the most recent stochastic gradients is used for estimating gradient.

The advantages of DSA relative to a group of stochastic and deterministic alternatives in solving a logistic regression problem with a synthetic dataset are then studied in numerical experiments (Section 4). These results demonstrate that DSA is the only decentralized stochastic algorithm that reaches the optimal solution with a linear convergence rate. We further show that DSA outperforms deterministic algorithms when the metric is the number of times that elements of the training set are evaluated. The behavior of DSA for different network topologies is also evaluated. We close the paper with pertinent remarks (Section 5).

Decentralized Double stochastic averaging gradient

If we recall the definitions of the local functions fn(xn)f_{n}({\mathbf{x}}_{n}) and the instantaneous local functions fn,i(xn)f_{n,i}({\mathbf{x}}_{n}) available at node nn, the implementation of EXTRA requires that each node nn computes the full gradient of its local objective function fnf_{n} at xnt{\mathbf{x}}_{n}^{t} as

This is computationally expensive when the number of instantaneous functions qnq_{n} is large. To resolve this issue, local stochastic gradients can be substituted for the local objective functions gradients in (3). These stochastic gradients approximate the gradient fn(xn)\nabla f_{n}({\mathbf{x}}_{n}) of node nn by randomly choosing one of the instantaneous functions gradients fn,i(xn)\nabla f_{n,i}({\mathbf{x}}_{n}). If we let int{1,qn}i_{n}^{t}\in\{1,\ldots q_{n}\} denote a function index that we choose at time tt at node nn uniformly at random and independently of the history of the process, then the stochastic gradient is defined as

With these definitions in hand we can define the stochastic averaging gradient at node nn as

Observe that to implement (7) the gradients fn,i(yn,it)\nabla f_{n,i}({\mathbf{y}}_{n,i}^{t}) are stored in the local gradient table shown in Figure 1.

The DSA algorithm is a variation of EXTRA that substitutes the local gradients fn(xnt)\nabla f_{n}({\mathbf{x}}_{n}^{t}) in (3) for the local stochastic average gradients g^nt{\hat{\mathbf{g}}}_{n}^{t} in (7),

The DSA initial update is given by applying the same substitution for the update of DGD in (2) as

DSA is summarized in Algorithm 1 for t1t\geq 1. The DSA update in (8) is implemented in Step 9. This step requires access to the local iterates xmt{\mathbf{x}}_{m}^{t} of neighboring nodes mNnm\in{\mathcal{N}}_{n} which are collected in Step 2. Furthermore, implementation of the DSA update also requires access to the stochastic averaging gradients g^nt1{\hat{\mathbf{g}}}_{n}^{t-1} and g^nt{\hat{\mathbf{g}}}_{n}^{t}. The latter is computed in Step 4 and the former is computed and stored at the same step in the previous iteration. The computation of the stochastic averaging gradients requires the selection of the index inti_{n}^{t}. This index is chosen uniformly at random in Step 3. Determination of stochastic averaging gradients also necessitates access and maintenance of the gradients table in Figure 1. The inti_{n}^{t} element of this table is updated in Step 5 by replacing fn,int(yn,intt)\nabla f_{n,i_{n}^{t}}({\mathbf{y}}_{n,i_{n}^{t}}^{t}) with fn,int(xnt)\nabla f_{n,i_{n}^{t}}({\mathbf{x}}_{n}^{t}), while the other vectors remain unchanged. To implement the first DSA iteration at time t=0t=0 we have to perform the update in (9) instead of the update in (8) as in Step 7. Furhter observe that the auxiliary variables yn,i0{\mathbf{y}}_{n,i}^{0} are initialized to the initial iterate xn0{\mathbf{x}}_{n}^{0}. This implies that the initial values of the stored gradients are fn,i(yn,i0)=fn,i(xn0)\nabla f_{n,i}({\mathbf{y}}_{n,i}^{0})=\nabla f_{n,i}({\mathbf{x}}_{n}^{0}) – with a consequently relatively large initialization cost.

We also point that, as written in (7), computation of local stochastic averaging gradients g^nt{\hat{\mathbf{g}}}_{n}^{t} is costly because it requires evaluation of the sum i=1qnfn,i(yn,it)\sum_{i=1}^{q_{n}}\nabla f_{n,i}({\mathbf{y}}_{n,i}^{t}) at each iteration. This cost can be avoided by updating the sum at each iteration with the recursive formula

Important properties and interpretations of EXTRA and DSA are presented in the following sections after a pertinent remark.

The local stochastic averaging gradients in (7) are unbiased estimates of the local gradients fn(xnt)\nabla f_{n}({\mathbf{x}}_{n}^{t}). Indeed, if we let Ft{\mathcal{F}}_{t} measure the history of the system up until time tt we have that the sum in (7) is deterministic given this sigma-algebra. Thus, the conditional expectation of the stochastic averaging gradient is,

The expression in (12) means, by definition, that g^nt{\hat{\mathbf{g}}}_{n}^{t} is an unbiased estimate of fn(xnt)\nabla f_{n}({\mathbf{x}}_{n}^{t}) when the history Ft{\mathcal{F}}^{t} is given.

According to the definition of extended weight matrix Z{\mathbf{Z}}, the DGD iteration in (2) is equivalent to

The fundamental difference between DGD and EXTRA is that a fixed point of (15) does not necessarily satisfy (14), whereas the fixed points of (16) are guaranteed to do so. Indeed, taking limits in (15) we see that the fixed points x{\mathbf{x}}^{\infty} of DGD must satisfy

which is incompatible with (14) except in peculiar circumstances – such as, e.g., when all local functions have the same minimum. The limit points of EXTRA, however, satisfy the relationship

Substituting the limit point in (19) and reordering terms, we see that x{\mathbf{x}}^{\infty} must satisfy

2 Stochastic saddle point method interpretation of DSA

Consider x{\mathbf{x}} as a primal variable and v{\mathbf{v}} as a dual variable. Then, the updates in (22) and (23) are equivalent to the updates of a saddle point method with stepsize α\alpha that solves for the critical points of the augmented Lagrangian

Comparing (16) and (26) we see that they differ in the latter using stochastic averaging gradients g^t{\hat{\mathbf{g}}}^{t} in lieu of the full gradients f(xt)\nabla f({\mathbf{x}}^{t}). Therefore, DSA is a stochastic saddle point method in which the primal variables are updated as

and the dual variables vt{\mathbf{v}}^{t} are updated as

Convergence analysis

The instantaneous local functions fn,i(xn)f_{n,i}({\mathbf{x}}_{n}) are differentiable and strongly convex with parameter μ\mu.

The gradient of instantaneous local functions fn,i\nabla f_{n,i} are Lipschitz continuous with parameter LL. I.e., for all n{1,,N}n\in\{1,\dots,N\} and i{1,,qn}i\in\{1,\dots,q_{n}\} we can write

The condition imposed by Assumption 2 implies that the local functions fn(xn)f_{n}({\mathbf{x}}_{n}) and the global cost function f(x)=n=1Nfn(xn)f({\mathbf{x}})=\sum_{n=1}^{N}f_{n}({\mathbf{x}}_{n}) are also strongly convex with parameter μ\mu. Likewise, Lipschitz continuity of the local instantaneous gradients considered in Assumption 3 enforces Lipschitz continuity of gradients of the local functions fn(xn)\nabla f_{n}({\mathbf{x}}_{n}) and the aggregate function f(x)\nabla f({\mathbf{x}}) – see, e.g., (Lemma 1 of Mokhtari et al. (2015a)).

In this section we study some basic properties of the sequences of primal and dual variables generated by the DSA algorithm. In the following lemma, we study the relation of iterates xt{\mathbf{x}}^{t} and vt{\mathbf{v}}^{t} with the optimal primal x{\mathbf{x}}^{*} and dual v{\mathbf{v}}^{*} arguments.

One of the KKT conditions of problem (25) follows that the optimal variables x{\mathbf{x}}^{*} and v{\mathbf{v}}^{*} satisfy αf(x)+Uv=0\alpha\nabla f({\mathbf{x}}^{*})+{\mathbf{U}}{\mathbf{v}}^{*}={\mathbf{0}} or equivalently αf(x)=Uv-\alpha\nabla f({\mathbf{x}}^{*})={\mathbf{U}}{\mathbf{v}}^{*}. Adding this equality to both sides of (32) follows the claim in (30).

Consider the DSA algorithm in (6)-(9) and the definition of sequence ptp^{t} in (33). If Assumptions 1-3 hold true, then the squared norm of the difference between stochastic averaging gradient g^t{\hat{\mathbf{g}}}^{t} and the optimal gradient f(x)\nabla f({\mathbf{x}}^{*}) in expectation is bounded above by

2 Convergence

Consider the DSA algorithm as defined in (6)-(9). Further recall the definitions of ptp^{t} in (33) and ut{\mathbf{u}}^{t}, u{\mathbf{u}}^{*}, and G{\mathbf{G}} in (35). If Assumptions 1-3 hold true, then for any positive constants η>0\eta>0 we can write

Lemma 4 shows an upper bound for the squared norm ut+1uG2\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{*}\|^{2}_{{\mathbf{G}}} which is the first part of the Lyapunov function utuG2+cpt\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|^{2}_{{\mathbf{G}}}+cp^{t} at step t+1t+1. Likewise, we provide an upper bound for the second term of the Lyapunov function at time t+1t+1 which is pt+1p^{t+1} in terms of ptp^{t} and some parameters that capture optimality gap. This bound is studied in the following lemma.

Consider the DSA algorithm as defined in (6)-(9) and the definition of ptp^{t} in (33). Further, define qminq_{\min} and qmaxq_{\max} as the smallest and largest values for the number of instantaneous functions at a node, respectively. If Assumptions 1-3 hold true, then for all t>0t>0 the sequence ptp^{t} satisfies

Lemma 5 provides an upper bound for pt+1p^{t+1} in terms of its previous value ptp^{t} and the optimality error f(xt)f(x)f(x)T(xtx)f({\mathbf{x}}^{t})-f({\mathbf{x}}^{*})-\nabla f({\mathbf{x}}^{*})^{T}({\mathbf{x}}^{t}-{\mathbf{x}}^{*}). Combining the results in Lemmata 4 and 5 we can show that in expectation the Lyapunov function ut+1uG2+c pt+1\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}+c\ p^{t+1} at step t+1t+1 is strictly smaller than its previous value utuG2+c pt\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}+c\ p^{t} at step tt.

Consider the DSA algorithm as defined in (6)-(9). Further recall the definition of the sequence ptp^{t} in (33). Define η\eta as an arbitrary positive constant chosen from the interval

If Assumptions 1-3 hold true and the stepsize α\alpha is chosen from the interval α(0,γ/2η)\alpha\in\left(0,{\gamma}/{2\eta}\right), then for arbitrary cc chosen from the interval

there exits a positive constant 0<δ<10<\delta<1 such that

We use the result in (41) to establish linear convergence of the sequence of squared norm error xtx2\|{\mathbf{x}}^{t}-{\mathbf{x}}^{*}\|^{2} in expectation.

Further, the almost sure convergence is at least of order O(1/t){\mathcal{O}}(1/t).

Theorem 8 provides almost sure convergence of xt{\mathbf{x}}^{t} to the optimal solution x{\mathbf{x}}^{*} which is stronger result than convergence in expectation as in Corollary 7, however, the rate of convergence for the almost sure convergence is sublinear O(1/t){\mathcal{O}}(1/t) which is slower relative to the linear convergence in expectation provided in (42).

3 Convergence constant

The constant δ\delta that controls the speed of convergence can be simplified by selecting specific values for η\eta, α\alpha, and cc. This uncovers connections to the properties of the local objective functions and the network topology. To make this clearer define the condition numbers of the objective function and the graph as

respectively. The condition number of the function is a measure of how difficult it is to minimize the local functions using gradient descent directions. The condition number of the graph is a measure of how slow the graph is in propagating a diffusion process. Both are known to control the speed of convergence of distributed optimization methods. The following corollary illustrates that these condition numbers also determine the convergence speed of DSA.

The linear convergence constant 0<δ<10<\delta<1 in (40) reduces to

Proof The given values for η\eta, α\alpha, and cc satisfy the conditions in Theorem 6. Substitute then these values into the expression for δ\delta in (D). Simplify terms and utilize the condition number definitions in (44). The second term in the minimization in (D) becomes redundant because it is dominated by the first.

The three terms in (47) establish separate regimes, problems where the graph condition number is large, problems where the number of functions at each node is large, and problems where the condition number of the local functions are large. In the first regime the first term in (47) dominates and establishes a dependence in terms of the square of the graph’s condition number. In the second regime the middle term dominates and results in an inverse dependence with the number of functions available at each node. In the third regime, the third term dominates. The dependence in this case is inversely proportional to κf4\kappa_{f}^{4}.

Numerical analysis

where the regularization term (λ/2)x2(\lambda/2)\|{\mathbf{x}}\|^{2} is added to reduce overfitting to the training set. The optimization problem in (48) can be written in the form of (1) by defining the local objective functions fnf_{n} as

Observe that the local functions fnf_{n} in (49) can be written as the average of a set of instantaneous functions fn,if_{n,i} defined as

for all i=1,,qn\text{for all}\ i=1,\dots,q_{n}. Considering the definitions of instantaneous local functions fn,if_{n,i} in (50) and local functions fnf_{n} in (49), problem (48) can be solved using the DSA algorithm.

In our experiments we use a synthetic dataset where components of the feature vectors sni{\mathbf{s}}_{{ni}} with label lni=1l_{ni}=1 are generated from a normal distribution with mean μ\mu and standard deviation σ+\sigma_{+}, while sample points with label lni=1l_{ni}=-1 are generated from a normal distribution with mean μ-\mu and standard deviation σ\sigma_{-}. We consider a network of size NN where the edges between nodes are generated randomly with probability pcp_{c}. The weight matrix W{\mathbf{W}} is generated using the Laplacian matrix L{\mathbf{L}} of network as

We provide a comparison of DSA with respect to DGD, EXTRA, stochastic EXTRA, and decentralized SAGA. The stochastic EXTRA is defined by using stochastic gradient in (5) instead of using full gradient as in EXTRA or stochastic averaging gradient as in DSA. The decentralized SAGA is a stochastic version of DGD algorithm that uses stochastic averaging gradient instead of exact gradient which is the naive approach for developing decentralized version of SAGA algorithm.

We study performances of the DSA algorithm for different topologies. We keep the parameters in Fig. 2 except we change the size of network to N=100N=100 which implies each node has qi=5q_{i}=5 sample points. The linear convergence of DSA algorithm for random networks with pc=0.2p_{c}=0.2 and pc=0.3p_{c}=0.3, complete graph, cycle, line and star are shown in Fig. 3. As we expect for the topologies that the graph is more connected and the diameter is smaller linear convergence of DSA is faster. The best performance belongs to the complete graph which requires 160160 iterations to achieve the relative error et=106e^{t}=10^{-6}. For random graphs with connectivity probabilities pc=0.3p_{c}=0.3 and pc=0.2p_{c}=0.2 DSA achieves the relative error et=106e^{t}=10^{-6} after t=210t=210 and t=280t=280 iterations, respectively. For the cycle graph the number of required iterations for reaching the relative error et=106e^{t}=10^{-6} is t=470t=470, while DSA does not reach this accuracy after t=1000t=1000 iterations when the graph is a line or star.

Conclusions

We acknowledge the support of the National Science Foundation (NSF CAREER CCF-0952867) and the Office of Naval Research (ONR N00014-12-1-0997).

A Proof of Lemma 3

Substituting the upper bound in (A) and simplification in (A) into (55), and considering the expression in (A) lead to

Summing up both sides of (59) for all i=1,,qni=1,\dots,q_{n}, dividing both sides of the implied inequality by qnq_{n} lead to

To show that the sum in the right hand side of (A) is bounded above we use the Lipschitz continuity of the instantaneous functions gradients fn,i\nabla f_{n,i}. Using the same argument from (59) to (61) we can write

Considering the definition of the local objective functions fn(xn)=(1/qn)i=1qnfn,i(xn)f_{n}({\mathbf{x}}_{n})=(1/q_{n})\sum_{i=1}^{q_{n}}f_{n,i}({\mathbf{x}}_{n}) and the aggregate function f(x):=n=1Nfn(xn)f({\mathbf{x}}):=\sum_{n=1}^{N}f_{n}({\mathbf{x}}_{n}), the right hand side of (63) can be simplified as

Replacing the sum in (A) by the upper bound in (64) implies

Considering the strong convexity of function ff with constant μ\mu we can write

Therefore, we can substitute f(xt)f(x)2\|{\nabla}f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\|^{2} in (64) by the lower bound in (66) and the claim in (34) follows.

B Proof of Lemma 4

According to the Lipschitz continuity of the aggregate function gradients f(x)\nabla f({\mathbf{x}}) we can write (1/L)f(xt)f(x)2(xtx)T(f(xt)f(x))(1/L)\|\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\|^{2}\leq({\mathbf{x}}^{t}-{\mathbf{x}}^{*})^{T}(\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})). By adding and subtracting xt+1{\mathbf{x}}^{t+1} to the term xtx{\mathbf{x}}^{t}-{\mathbf{x}}^{*} and multiplying both sides of the inequality by 2α2\alpha we obtain

Expanding the difference f(xt)f(x)\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*}) as g^tf(x)+f(xt)g^t{\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*})+\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t} for the first inner product in the right hand side of (67) implies

We proceed to simplify the inner product 2α(xt+1x)T(g^tf(x))2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{*})^{T}({\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*})) in the right hand side of (B) by substituting α(g^tf(x))\alpha({\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*})) with its equivalent as introduced in (30). Applying this substitution the inner product can be simplified as

According to the definition of vector u{\mathbf{u}} and matrix G{\mathbf{G}} in (35), the last two summands of (B) can be simplified as 2(ut+1ut)TG(uut+1)2({\mathbf{u}}^{t+1}-{\mathbf{u}}^{t})^{T}{\mathbf{G}}({\mathbf{u}}^{*}-{\mathbf{u}}^{t+1}). Moreover, observe that the inner product 2(ut+1ut)TG(uut+1)2({\mathbf{u}}^{t+1}-{\mathbf{u}}^{t})^{T}{\mathbf{G}}({\mathbf{u}}^{*}-{\mathbf{u}}^{t+1}) can be simplified as utuG2ut+1uG2ut+1utG2\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}-\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}-\|{\mathbf{u}}^{t+1}-{\mathbf{u}}^{t}\|_{\mathbf{G}}^{2}. Applying this simplification into (B) implies

The next step is to bound above the inner product 2α(xtxt+1)T(f(xt)f(x))2\alpha({\mathbf{x}}^{t}-{\mathbf{x}}^{t+1})^{T}(\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})). Note that for any two vectors a{\mathbf{a}} and b{\mathbf{b}}, and any positive scalar η\eta the inequality 2aTbηa2+η1b22{\mathbf{a}}^{T}{\mathbf{b}}\leq\eta\|{\mathbf{a}}\|^{2}+\eta^{-1}\|{\mathbf{b}}\|^{2} holds true. Therefore, by setting a=xtxt+1{\mathbf{a}}={\mathbf{x}}^{t}-{\mathbf{x}}^{t+1} and b=f(xt)f(x){\mathbf{b}}=\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*}) we obtain that

Now we substitute the terms in the right hand side of (B) by their simplifications or upper bounds. Replacing the inner product 2α(xt+1x)T(g^tf(x))2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{*})^{T}({\hat{\mathbf{g}}}^{t}-\nabla f({\mathbf{x}}^{*})) by the simplification in (B), substituting expression 2α(xtxt+1)T(f(xt)f(x))2\alpha({\mathbf{x}}^{t}-{\mathbf{x}}^{t+1})^{T}(\nabla f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})) by the upper bound in (72), and substituting inner product 2α(xt+1x)T(f(xt)g^t)2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{*})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}) by the sum 2α(xtx)T(f(xt)g^t)+2α(xt+1xt)T(f(xt)g^t)2\alpha({\mathbf{x}}^{t}-{\mathbf{x}}^{*})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t})+2\alpha({\mathbf{x}}^{t+1}-{\mathbf{x}}^{t})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}) imply

By applying inequality 2aTbηa2+η1b22{\mathbf{a}}^{T}{\mathbf{b}}\leq\eta\|{\mathbf{a}}\|^{2}+\eta^{-1}\|{\mathbf{b}}\|^{2} for the choice of vectors a=xt+1xt{\mathbf{a}}={\mathbf{x}}^{t+1}-{\mathbf{x}}^{t} and b=f(xt)g^t{\mathbf{b}}=\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}, we obtain that 2(xt+1xt)T(f(xt)g^t)2({\mathbf{x}}^{t+1}-{\mathbf{x}}^{t})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}) is bounded above by ηxt+1xt2+(1/η)f(xt)g^t2\eta\|{\mathbf{x}}^{t+1}-{\mathbf{x}}^{t}\|^{2}+(1/\eta)\|\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}\|^{2}. Replacing 2(xt+1xt)T(f(xt)g^t)2({\mathbf{x}}^{t+1}-{\mathbf{x}}^{t})^{T}(\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}) in (B) by its upper bound ηxt+1xt2+(1/η)f(xt)g^t2\eta\|{\mathbf{x}}^{t+1}-{\mathbf{x}}^{t}\|^{2}+(1/\eta)\|\nabla f({\mathbf{x}}^{t})-{\hat{\mathbf{g}}}^{t}\|^{2} yields

Substituting the simplification in (77) into (76) yields

Considering the strong convexity of function ff with constant μ\mu we can write f(xt)f(x)22μ(f(xt)f(x)f(x)T(xtx))\left\|{\nabla}f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\right\|^{2}\geq 2\mu\left(f({\mathbf{x}}^{t})-f({\mathbf{x}}^{*})-\nabla f({\mathbf{x}}^{*})^{T}({\mathbf{x}}^{t}-{\mathbf{x}}^{*})\right). substituting the squared norm f(xt)f(x)2\left\|{\nabla}f({\mathbf{x}}^{t})-\nabla f({\mathbf{x}}^{*})\right\|^{2} by this lower bound in (78) follows

C Proof of Lemma 5

Given the information until time tt, each auxiliary vector yn,it+1{\mathbf{y}}_{n,i}^{t+1} is a random variable that takes values yn,it{\mathbf{y}}_{n,i}^{t} and xnt{\mathbf{x}}_{n}^{t} with associated probabilities 11/qn1-1/q_{n} and 1/qn1/q_{n}, respectively. This observation holds since with probability 1/qn1/q_{n} node nn may choose index ii to update at time t+1t+1 and with probability 1(1/qn)1-(1/q_{n}) choose other indices. Therefore, we can write

For the simplicity of equations let us define sequence pntp_{n}^{t} as

We proceed to find and upper bound for the terms in the right hand side of (83). First note that according to the strong convexity of instantaneous functions fn,if_{n,i} and fnf_{n} both terms in the right hand side of (83) are non-negative. Observing that the number of instantaneous functions at each node qnq_{n} satisfies the condition qminqnqmaxq_{\min}\leq q_{n}\leq q_{\max}, we obtain

Now observe that according to the definitions of sequences ptp^{t} and pntp_{n}^{t} in (33) and (82), respectively, ptp^{t} is the sum of pntp_{n}^{t} for all nn, i.e. pt=n=1Npntp^{t}=\sum_{n=1}^{N}p_{n}^{t}. Therefore, we can rewrite (85) as

D Proof of Theorem 6

To prove the result of Theorem 6 first we prove the following Lemma to establish an upper bound for vtv2.\|{\mathbf{v}}^{t}-{\mathbf{v}}^{*}\|^{2}.

Proof Consider the basic inequality a+b22a2+2b2\|{\mathbf{a}}+{\mathbf{b}}\|^{2}\leq 2\|{\mathbf{a}}\|^{2}+2\|{\mathbf{b}}\|^{2} for the case that a=U(vt+1v){\mathbf{a}}={\mathbf{U}}({\mathbf{v}}^{t+1}-{\mathbf{v}}^{*}), b=U(vtvt+1){\mathbf{b}}={\mathbf{U}}({\mathbf{v}}^{t}-{\mathbf{v}}^{t+1}) which can be written as

Note that according to the fact that both vt{\mathbf{v}}^{t} and v{\mathbf{v}}^{*} lie in the column space of matrix U{\mathbf{U}} we obtain U(vtv)2γvtv2\|{\mathbf{U}}({\mathbf{v}}^{t}-{\mathbf{v}}^{*})\|^{2}\geq\gamma^{\prime}\|{\mathbf{v}}^{t}-{\mathbf{v}}^{*}\|^{2}. Substituting this lower bound for U(vtv)2\|{\mathbf{U}}({\mathbf{v}}^{t}-{\mathbf{v}}^{*})\|^{2} in (D) and multiplying both sides of the imposed inequality by γ\gamma^{\prime} yield

Using the result in Lemma 10 we show linear convergence of the sequence utuG2+c pt\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2}+c\ p^{t} as follows.

Proof of Theorem 6: Proving the linear convergence claim in (40) is equivalent to showing that

Further, substitute the squared norm xtx2\|{\mathbf{x}}^{t}-{\mathbf{x}}^{*}\|^{2} by the upper bound (2/μ)(f(xt)f(x)f(x)T(xtx))(2/\mu)(f({\mathbf{x}}^{t})-f({\mathbf{x}}^{*})-\nabla f({\mathbf{x}}^{*})^{T}({\mathbf{x}}^{t}-{\mathbf{x}}^{*})) to obtain

Replacing utuG2\|{\mathbf{u}}^{t}-{\mathbf{u}}^{*}\|_{\mathbf{G}}^{2} in (D) by the upper bound (D) and regrouping the terms lead to

Notice that if the inequality in (D) holds true, then the relation in (D) is valid and as we mentioned before the claim in (93) holds. To verify the sum in the right hand side of (D) is always positive and the inequality is valid, we enforce each summands in the right hand side of (D) to be non-negative. Therefore, the following conditions should be satisfied

Recall that γ\gamma is the smallest eigenvalue of positive definite matrix Z{\mathbf{Z}}. All the inequalities in (D) are satisfied, if δ\delta is chosen as

where η\eta, cc and α\alpha are selected from the intervals

Notice that considering the conditions for the variables η\eta, α\alpha and cc in (100), the constant δ\delta in (D) is strictly positive δ>0\delta>0. Moreover, according to the definition in (D) the constant δ\delta is smaller than γ/2Γ\gamma^{\prime}/2\Gamma^{\prime} which leads to the conclusion that δ1/2<1\delta\leq 1/2<1. Therefore, we obtain that 0<δ<10<\delta<1 and the claim in (40) is valid.

E Proof of Theorem 8

The proof uses the relationship in the statement (40) of Theorem 6 to build a supermartingale sequence. To do this define the stochastic processes ζt\zeta^{t} and βt\beta^{t} as

Note that the stochastic processes ζt\zeta^{t} and βt\beta^{t} are alway non-negative. Let now Ft{\mathcal{F}}_{t} be a sigma-algebra measuring ζt\zeta^{t}, βt\beta^{t}, and ut{\mathbf{u}}^{t}. Considering the definitions of ζt\zeta^{t} and βt\beta^{t} and the relation in (40) we can write

Since the sequences αt\alpha^{t} and βt\beta^{t} are nonnegative it follows from (102) that they satisfy the conditions of the supermartingale convergence theorem – see e.g. theorem E7.47.4 Solo and Kong (1995) . Therefore, we obtain that: (i) The sequence ζt\zeta^{t} converges almost surely. (ii) The sum t=0βt<\sum_{t=0}^{\infty}\beta^{t}<\infty is almost surely finite. The definition of βt\beta^{t} in (101) implies that

Observing the fact that δ\delta and γ\gamma are positive constants, we can conclude from (104) that the sequence xtx2\|{\mathbf{x}}^{t}-{\mathbf{x}}^{*}\|^{2} is almost surely summable and the it converges with probability 1 to null at least in the order of O(1/t){\mathcal{O}}(1/t). Almost sure convergence of sequence to null follows the claim in (43).

References