Communication-Efficient Algorithms for Decentralized and Stochastic Optimization

Guanghui Lan, Soomin Lee, Yi Zhou

Introduction

Decentralized optimization problems defined over complex multiagent networks are ubiquitous in signal processing, machine learning, control, and other areas in science and engineering (see e.g. rabbat ; con01 ; ram_info ; Durham-Bullo ). In this paper, we consider the following decentralized optimization problem which is cooperatively solved by the network of $m$ agents:

In this paper, we also consider the situation where one can only have access to noisy first-order information (function values and subgradients) of the functions $f_{i}$ , $i=1,\ldots,m$ (see NJLS09-1 ; Lan10-3 ). This happens, for example, when the function $f_{i}$ ’s are given in the form of expectation, i.e.,

where $l\geq 1$ is some large number. Stochastic optimization problem of this type has great potential of applications in data analysis, especially in machine learning. In particular, problem (1.3) corresponds to the minimization of generalized risk and is particularly useful for dealing with online (streaming) data distributed over a network, while problem (1.4) aims at the collaborative minimization of empirical risk. Currently the dominant approach is to collect all agents’ private data on a server (or cluster) and to apply centralized machine learning techniques. However, this centralization scheme would require agents to submit their private data to the service provider without much control on how the data will be used, in addition to incurring high setup cost related to the transmission of data to the service provider. Decentralized optimization provides a viable approach to deal with these data privacy related issues.

In these decentralized and stochastic optimization problems, each network agent $i$ is associated with the local objective function $f_{i}(x)$ and all agents intend to cooperatively minimize the system objective $f(x)$ as the sum of all local objective $f_{i}$ ’s in the absence of full knowledge about the global problem and network structure. A necessary feature in decentralized optimization is, therefore, that the agents must communicate with their neighboring agents to propagate the distributed information to every location in the network.

One of the most well-studied techniques in decentralized optimization are the subgradient based methods (see e.g., Nedic2009 ; Nedic11 ; tsianos-pushsum ; ANAO ; Duchi-DDA ; Lobel2011 ; WYin-Extra ), where at each step a local subgradient is taken at each node, followed by the communication with neighboring agents. Although the subgradient computation at each step can be inexpensive, these methods usually require lots of iterations until convergence. Considering that one iteration in decentralized optimization is equivalent to one communication round among agents, this can incur a significant latency. CPUs in these days can read and write the memory at over $10$ GB per second whereas communication over TCP/IP is about $10$ MB per second. Therefore, the gap between intra-node computation and inter-node communication is about 3 orders of magnitude. The communication start-up cost itself is also not negligible as it usually takes a few milliseconds.

Another well-known type of decentralized algorithm relies on dual methods (see e.g., Boyd-ADMM ; Wei-admm ; perturbedPD ), where at each step for a fixed dual variable, the primal variables are solved to minimize some local Lagrangian related function, then the dual variables associated with the consistency constraints are updated accordingly. Although these dual type methods usually require fewer numbers of iterations (hence, fewer communication rounds) than the subgradient methods until convergence, one crucial problem of these methods is that the local subproblem associated with each agent cannot be solved efficiently in many cases.

The main goal of this paper is, therefore, to develop dual based decentralized algorithms for solving (1.1) that is communication efficient and has local subproblems easily solved by each agent through the utilization of (noisy) first-order information of $f_{i}$ . More specifically, we will provide a theoretical understanding on how many numbers of inter-node communications and intra-node (stochastic) subgradient evaluations of $f_{i}$ are required in order to find a certain approximate solution of (1.1).

2 Problem Formulation

Consider a multiagent network system whose communication is governed by an undirected graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ , where $\mathcal{N}=[m]$ indexes the set of agents, and $\mathcal{E}\subseteq\mathcal{N}\times\mathcal{N}$ represents the pairs of communicating agents. If there exists an edge from agent $i$ to $j$ which we denote by $(i,j)$ , agent $i$ may send its information to agent $j$ and vice versa. Thus, each agent $i\in\mathcal{N}$ can directly receive (resp., send) information only from (resp., to) the agents in its neighborhood

We consider a reformulation of problem (1.1) which will be used in the development of our decentralized algorithms. We introduce an individual copy $x_{i}$ of the decision variable $x$ for each agent $i\in\mathcal{N}$ and impose the constraint $x_{i}=x_{j}$ for all pairs $(i,j)\in\mathcal{E}$ . The transformed problem can be written compactly by using the Laplacian matrix $L$ :

Under Assumption 1, problem (1.1) and (1.10) are equivalent. We let Assumption 1 be a blanket assumption for the rest of the paper.

We next consider a reformulation of the problem (1.10) as a saddle point problem. By the method of Lagrange multipliers, problem (1.10) is equivalent to the following saddle point problem:

3 Literature review

Decentralized optimization has been extensively studied in recent years due to the emergence of large-scale networks. The seminal work on distributed optimization Tsiphd ; Tsi1986 has been followed by distributed incremental (sub)gradient methods and proximal methods AN2001 ; Ram:2009 ; Bertsekas-IP ; Wang-Bertsekas , and more recently the incremental aggregated gradient methods and its proximal variants Parrilo-aggregated ; Bertsekas-aggregated ; GLYZ . All of these incremental methods are not fully decentralized in a sense that they require a special star network topology in which the existence of a central authority is necessary for operation.

To consider a more general network topology, a decentralized subgradient algorithm was first proposed in Nedic2009 , and further studied in many other literature (see e.g. Duchi-DDA ; Martinez-PD ; Nedic11 ; ANAO ; tsianos2012consensus ). These algorithms are intuitive and simple but very slow due to the fact that they need to use diminishing stepsize rules. All of these methods require ${\cal O}(1/\epsilon^{2})$ inter-node communications and intra-node gradient computations in order to obtain an $\epsilon$ -optimal solution. First-order algorithms by Shi et. al. WYin-Extra ; WYin-PGExtra use constant stepsize rules with backtracking and require ${\cal O}(1/\epsilon)$ communications when the objective function in (1.1) is a relatively simple convex function, but require both smoothness and strong convexity in order to achieve a linear convergence rate. Recently, it has been shown in NaLi-Harness ; Wilbur-TV that the linear rate of convergence can be obtained for minimizing “unconstrained” smooth and strongly convex problems. These methods do not apply to general nonsmooth and stochastic optimization problems to be studied in this work.

Another well-known type of decentralized algorithm is based on dual methods including the distributed dual decomposition dual-decompos and decentralized alternating direction method of multipliers (ADMM) Wilbur-ADMM ; ADMM-linear ; Wei-admm . The decentralized ADMM Wilbur-ADMM ; ADMM-linear has been shown to require ${\cal O}(\log 1/\epsilon)$ communications in order to obtain an $\epsilon$ -optimal solution under the no constraint, strong convexity and smoothness assumptions while Wei-admm has been shown to require ${\cal O}(1/\epsilon)$ communications for relatively simple convex functions $f_{i}$ (see also HeJudNem15-1 for the application of mirror-prox method for solving these problems). These dual-based methods have been further studied via proximal-gradient InexactConsensus ; Chang-Stochastic . However, the local Lagrangian minimization problem associated with each agent cannot be solved efficiently in many cases, especially when the problem is constrained. Second-order approximation methods Ribeiro-DQM ; Ribeiro-second have been studied in order to handle this issue, but due to the nature of these methods differentiability of the objective function is necessary in this case.

There exist some distributed methods that just assume smoothness on the objective functions, but actually require more communication rounds than gradient computations. For example, the distributed Nesterov’s accelerated gradient method Jakovetic-Fast employs multi-consensus in the inner-loop. Although their method requires ${\cal O}(1/\sqrt{\epsilon})$ intra-node gradient computations, inter-node communications must increase at a rate of ${\cal O}(\log(k))$ as the iteration $k$ increases. Similarly, the proximal gradient method with adapt-then-combine (ATC) multi-consensus strategy and Nesterov’s acceleration under the assumption of bounded and Lipschitz gradients Chen-fastProx is shown to have ${\cal O}(1/\sqrt{\epsilon})$ intra-node gradient computations, but inter-node communications must increase at a rate of ${\cal O}(k)$ . Due to the nature of decentralized networked systems, the time required for inter-node communications is higher by a few orders of magnitude than that for intra-node computations. Multi-consensus schemes in nested loop algorithms do not account for this feature of networked systems and hence are less desirable.

Decentralized stochastic optimization methods can be useful when the noisy gradient information of the function $f_{i}$ , $i=1,\ldots,m$ , in (1.1) is only available or easier to compute. Stochastic first-order methods for problem (1.1) are studied in Duchi-DDA ; Ram2010 ; Nedic11 , all of which require ${\cal O}(1/\epsilon^{2})$ inter-node communications and intra-node gradient computations to obtain an $\epsilon$ -optimal solution. Multiagent mirror descent method for decentralized stochastic optimization Rabbat-SMD showed a $\mathcal{O}(1/\epsilon)$ complexity bound when the objective functions are strongly convex. An alternative form of mirror descent in the multiagent setting was proposed by Khan-MD with an asymptotic convergence result. On a broader scale, decentralized stochastic optimization was also considered in the case of time-varying objective functions in the recent work DistStoch ; Rabbat-online . All these previous works in decentralized stochastic optimization suffered from high communication costs due to the coupled scheme for stochastic subgradient evaluation and communication, i.e., each evaluation of stochastic subgradient will incur one round of communication.

4 Contribution of the paper

The main interest of this paper is to develop communication efficient decentralized algorithms for solving problem (1.10) in which $f_{i}$ ’s are convex or strongly convex, but not necessarily smooth, and the local subproblem associated with each agent is nontrivial to solve. Our contributions in this paper are listed below.

Firstly, we propose a decentralized primal-dual framework which involves only two inter-node communications per iteration. The proposed method can find an $\epsilon$ -optimal solution both in terms of the primal optimality gap and feasibility residual in $\mathcal{O}(1/\epsilon)$ communication rounds when the objective functions are convex, and the local proximal projection subproblems can be solved exactly. This algorithm serves as a benchmark in terms of the communication cost for our subsequent development.

Secondly, we introduce a new decentralized primal-dual type method, called decentralized communication sliding (DCS), where the agents can skip communications while solving their local subproblems iteratively through successive linearizations of their local objective functions. We show that agents can still find an $\epsilon$ -optimal solution in $\mathcal{O}(1/\epsilon)$ (resp., $\mathcal{O}(1/\sqrt{\epsilon})$ ) communication rounds while maintaining the $\mathcal{O}(1/\epsilon^{2})$ (resp., $\mathcal{O}(1/\epsilon)$ ) bound on the total number of intra-node subgradient evaluations when the objective functions are general convex (resp., strongly convex). The bounds on the subgradient evaluations are actually comparable to those optimal complexity bounds required for centralized nonsmooth optimization under certain conditions on the target accuracy, and hence are not improvable in general.

Thirdly, we present a stochastic decentralized communication sliding method, denoted by SDCS, for solving stochastic optimization problems and show complexity bounds similar to those of DCS on the total number of required communication rounds and stochastic subgradient evaluations. In particular, only $\mathcal{O}(1/\epsilon)$ (resp., $\mathcal{O}(1/\sqrt{\epsilon})$ ) communication rounds are required while agents perform up to $\mathcal{O}(1/\epsilon^{2})$ (resp., $\mathcal{O}(1/\epsilon)$ ) stochastic subgradient evaluations for general convex (resp., strongly convex) functions. Only requiring the access to stochastic subgradient at each iteration, SDCS is particularly efficient for solving problems with $f_{i}$ given in the form of (1.3) and (1.4). In the former case, SDCS requires only one realization of the random variable at each iteration and provides a communication-efficient way to deal with streaming data and decentralized machine learning. In the latter case, each iteration of SDCS requires only one randomly selected component, leading up to a factor of ${\cal O}(l)$ savings on the total number of subgradient computations over DCS.

To the best of our knowledge, this is the first time that these communication sliding algorithms, and the aforementioned separate complexity bounds on communication rounds and (stochastic) subgradient evaluations are presented in the literature.

5 Organization of the paper

This paper is organized as follows. In Section 2, we provide some preliminaries on distance generating functions and prox-functions, as well as the definition of gap functions, which will be used as termination criteria of our primal-dual methods. In Section 3, we present a new decentralized primal-dual method for solving problem (1.11). In Section 4, we present the communication sliding algorithms when the exact subgradients of $f_{i}$ ’s are available and establish their convergence properties for the general and strongly convex case. In Section 5, we generalize the algorithms in Section 4 for stochastic problems. The proofs of the lemmas in Section 3-5 are provided in Section 6. Finally, we provide some concluding remarks in Section 7.

Preliminaries

In this section, we provide a brief review on the prox-function, and define appropriate gap functions which will be used for the convergence analysis and termination criteria of our primal-dual algorithms.

In this subsection, we define the concept of prox-function, which is also known as proximity control function or Bregman distance function BREGMAN1967 . Prox-function has played an important role in the recent development of first-order methods for convex programming as a substantial generalization of the Euclidean projection. Unlike the standard projection operator $\Pi_{U}[x]:={\rm argmin}_{u\in U}\|x-u\|^{2}$ , which is inevitably tied to the Euclidean geometry, prox-function can be flexibly tailored to the geometry of a constraint set $U$ .

The prox-function, or Bregman distance function, induced by $\omega$ is given by

It then follows from the strong convexity of $\omega$ that

We now assume that the individual constraint set $X_{i}$ for each agent in problem (1.1) are equipped with norm $\|\cdot\|_{X_{i}}$ , and their associated prox-functions are given by $V_{i}(\cdot,\cdot)$ . Moreover, we assume that each $V_{i}(\cdot,\cdot)$ shares the same strongly convex modulus $\nu=1$ , i.e.,

We define the norm associated with the primal feasible set $X^{m}=X_{1}\times\ldots\times X_{m}$ of (1.11) as follows: We can define the norm associated with $X^{m}$ in a more general way, e.g., $\|\mathbf{x}\|^{2}:=\textstyle{\sum}_{i=1}^{m}p_{i}\|x_{i}\|_{X_{i}}^{2},\ \forall\mathbf{x}=(x_{1},\ldots,x_{m})\in X^{m},$ for some $p_{i}>0$ , $i=1,\ldots,m$ . Accordingly, the prox-function $\mathbf{V}(\cdot,\cdot)$ can be defined as $\mathbf{V}(\mathbf{x},\mathbf{u}):=\textstyle{\sum}_{i=1}^{m}p_{i}V_{i}(x_{i},u_{i}),\ \forall\mathbf{x},\mathbf{u}\in X^{m}.$ This setting gives us flexibility to choose $p_{i}$ ’s based on the information of individual $X_{i}$ ’s, and the possibility to further refine the convergence results.

where $\mathbf{x}=(x_{1},\ldots,x_{m})\in X^{m}$ for any $x_{i}\in X_{i}$ . Therefore, the corresponding prox-function $\mathbf{V}(\cdot,\cdot)$ can be defined as

Note that by (2.3) and (2.4), it can be easily seen that

2 Gap Functions: Termination Criteria

Given a pair of feasible solutions $\mathbf{z}=(\mathbf{x},\mathbf{y})$ and $\bar{\mathbf{z}}=(\bar{\mathbf{x}},\bar{\mathbf{y}})$ of (1.11), we define the primal-dual gap function $Q(\mathbf{z};\bar{\mathbf{z}})$ by

measures the accuracy of the approximate solution $\mathbf{z}$ to the saddle point problem (1.11).

However, the saddle point formulation (1.11) of our problem of interest (1.1) may have an unbounded feasible set. We adopt the perturbation-based termination criterion by Monteiro and Svaiter Monteiro01 ; Monteiro02 ; Monteiro03 and propose a modified version of the gap function in (2.8). More specifically, we define

This perturbed gap function allows us to bound the objective function value and the feasibility separately. We first define the following terminology.

A point $\mathbf{x}\in X^{m}$ is called an $(\epsilon,\delta)$ -solution of (1.10) if

We say that $\mathbf{x}$ has primal residual $\epsilon$ and feasibility residual $\delta$ .

In the following proposition, we adopt a result from (GL-AADMM, , Proposition 2.1) to describe the relationship between the perturbed gap function (2.9) and the approximate solutions to problem (1.10). Although the proposition was originally developed for deterministic cases, the extension of this to stochastic cases is straightforward.

Decentralized Primal-Dual

In this section, we describe an algorithmic framework for solving the saddle point problem (1.11) in a decentralized fashion. The basic scheme of the decentralized primal-dual method in Algorithm 1 is similar to Chambolle and Pork’s primal-dual method in Chambolle-PD . The primal-dual method in Chambolle-PD is an efficient and simple method for solving saddle point problems, which can be viewed as a refined version of the primal-dual hybrid gradient method by Arrow et al. arrow1958studies . However, its design and analysis is more closely related to a few recent important works which established the ${\cal O}(1/k)$ rate of convergence for solving bilinear saddle point problems (e.g., Nest05-1 ; Nem05-1 ; MonSva10-1 ; he2012on ). Recently, Chen, Lan and Ouyang CheLanOu13-1 incorporated Bregman distance into the primal-dual method together with an acceleration step. Dang and Lan Dang-Lan , and Chambolle and Pork ChamPoc14-1 discussed improved algorithms for problems with strongly convex primal or dual functions. Randomized versions of the primal-dual method have been discussed by Zhang and Xiao Yuchen14 , and Dang and Lan Dang-Lan . Lan and Zhou GLYZ revealed some inherent relationship between Nesterov’s accelerated gradient method and the primal-dual method, and presented an optimal randomized incremental gradient method.

Our main goals here in this section are to: 1) adapt the primal-dual framework for a decentralized setting; and 2) provide complexity results (number of communication rounds and subgradient computations) separately in terms of primal functional optimality gap and constraint (or consistency) violation. It should be stressed that the main contributions of this paper exist in the development of decentralized communication sliding algorithms (see Section 4 and 5). However, introducing the basic decentralized primal-dual method here will help us better explain these methods and provide us with a certain benchmark in terms of the communication cost.

The primal-dual algorithm in Algorithm 1 can be decentralized due to the structure of the Laplacian $\mathbf{L}$ . Recalling that $\mathbf{x}=(x_{1},\ldots,x_{m})$ and $\mathbf{y}=(y_{1},\ldots,y_{m})$ , each agent $i$ ’s local update rule can be separately written as in Algorithm 2. Each agent $i$ maintains two local sequences, namely, the primal estimates $\{x_{i}^{k}\}$ and the dual variables $\{y_{i}^{k}\}$ . The element $x_{i}^{k}$ can be seen as agent $i$ ’s estimate of the decision variable $x$ at time $k$ , while $y_{i}^{k}$ is a subvector of all dual variables $\mathbf{y}^{k}$ associated with the agent $i$ ’s consistency constraints with its neighbors.

2 Convergence of the Decentralized Primal-dual Method

For the sake of simplicity, we focus only on the case when $f_{i}$ ’s are general convex functions in this section. We leave the discussion about the case when $f_{i}$ ’s are strongly convex later in Sections 4 and 5 for decentralized communication sliding algorithms.

In the following lemma, we present estimates on the gap function defined in (2.7) together with conditions on the parameters $\{\alpha_{k}\}$ , $\{\tau_{k}\}$ , $\{\eta_{k}\}$ and $\{\theta_{k}\}$ , which will be used to provide the rate of convergence for the decentralized primal-dual method. The proof of this lemma can be found in Section 6.

Let the iterates $\mathbf{z}^{k}=(\mathbf{x}^{k},\mathbf{y}^{k})$ , $k=1,\ldots,N$ be generated by Algorithm 1 and $\bar{\mathbf{z}}^{N}$ be defined as $\bar{\mathbf{z}}^{N}:=\left(\textstyle{\sum}_{k=1}^{N}\theta_{k}\right)^{-1}\textstyle{\sum}_{k=1}^{N}\theta_{k}\mathbf{z}^{k}$ . Assume that the parameters $\{\alpha_{k}\}$ , $\{\tau_{k}\}$ , $\{\eta_{k}\}$ and $\{\theta_{k}\}$ in Algorithm 1 satisfy

where $Q$ is defined in (2.7) and $\mathbf{s}$ is defined as

Furthermore, for any saddle point $(\mathbf{x}^{*},\mathbf{y}^{*})$ of (1.11), we have

In the following theorem, we provide a specific selection of $\{\alpha_{k}\}$ , $\{\tau_{k}\}$ , $\{\eta_{k}\}$ and $\{\theta_{k}\}$ satisfying (3.9)-(3.14). Using Lemma 1 and Proposition 1, we also establish the complexity of the decentralized primal-dual method for computing an $(\epsilon,\delta)$ -solution of problem (1.10).

Let $\mathbf{x}^{*}$ be a saddle point of (1.10), and suppose that $\{\alpha_{k}\}$ , $\{\tau_{k}\}$ , $\{\eta_{k}\}$ and $\{\theta_{k}\}$ are set to

where $\bar{\mathbf{x}}^{N}=\tfrac{1}{N}\textstyle{\sum}_{k=1}^{N}\mathbf{x}^{k}$ .

It is easy to check that (3.18) satisfies conditions (3.9)-(3.14). Therefore, by plugging these values in (3.15), we have

Letting $\mathbf{s}^{N}:=\tfrac{1}{N}\mathbf{s}$ , then from (3.16) and (3.17) we have

The results in (3.19) and (3.20) then immediately follow from Proposition 1 and the above two inequalities.

From (3.19)-(3.20), we can see that the complexity of decentralized primal-dual method for computing an $(\epsilon,\delta)$ -solution is ${\cal O}(1/\epsilon)$ for the primal functional optimality and ${\cal O}(1/\delta)$ for the constraint violation. Since each iteration involves a constant number of communication rounds, the number of inter-node communications required is also in the same order.

Decentralized Communication Sliding

In this section, we present a new decentralized primal-dual type method, namely, the decentralized communication sliding (DCS) method for the case when the primal subproblem (3.8) is not easy to solve. We show that one can still maintain the same number of inter-node communications even when the subproblem (3.8) is approximately solved through an iterative subgradient descent procedure, and that the total number of required subgradient evaluations is comparable to centralized mirror descent methods. Throughout this section, we consider the deterministic case where exact subgradients of $f_{i}$ ’s are available.

A few remarks about this algorithm are in order. Firstly, a critical difference of this routine compared to the exact version (Algorithm 2) is that one needs to compute a pair of approximate solutions $x_{i}^{k}$ and $\hat{x}_{i}^{k}$ . While both $x_{i}^{k}$ and $\hat{x}_{i}^{k}$ can be seen as agent $i$ ’s estimate of the decision variable $x$ at time $k$ , $x_{i}^{k}$ will be used to define the subproblem (4.7) for the next call to the CS procedure and $\hat{x}_{i}^{k}$ will be used to produce a weighted sum of all the inner loop iterates. Secondly, since the same $w_{i}^{k}$ has been used throughout the $T_{k}$ iterations of the CS procedure, no additional communications of the dual variables are required when performing the subgradient projection step (4.7) for $T_{k}$ times. This differs from the accelerated gradient methods in Chen-fastProx ; Jakovetic-Fast where the number of inter-node communications at each iteration $k$ increase linearly or sublinearly in the order of $k$ .

Note that the results of the CS procedure at iteration $k$ for agents $i\in\mathcal{N}$ collectively generate a pair of approximate solutions $\hat{\mathbf{x}}^{k}=(\hat{x}_{1}^{k},\ldots,\hat{x}_{m}^{k})$ and $\mathbf{x}^{k}=(x_{1}^{k},\ldots,x_{m}^{k})$ to the proximal projection subproblem (3.3). For later convenience, we refer to the subproblem at iteration $k$ as $\Phi^{k}(\mathbf{x})$ , i.e.,

2 Convergence of DCS on General Convex Functions

We now establish the main convergence properties of the DCS algorithm. More specifically, we provide in Lemma 2 an estimate on the gap function defined in (2.7) together with stepsize policies which work for the general nonsmooth convex case with $\mu=0$ (cf. (1.2)). The proof of this lemma can be found in Section 6.

Let the iterates $(\hat{\mathbf{x}}^{k},\mathbf{y}^{k})$ , $k=1,\ldots,N$ be generated by Algorithm 3 and $\hat{\mathbf{z}}^{N}$ be defined as $\hat{\mathbf{z}}^{N}:=\left(\textstyle{\sum}_{k=1}^{N}\theta_{k}\right)^{-1}\textstyle{\sum}_{k=1}^{N}\theta_{k}(\hat{\mathbf{x}}^{k},\mathbf{y}^{k})$ . Assume that the objective $f_{i}$ , $i=1,\ldots,m$ , are general nonsmooth convex functions, i.e., $\mu=0$ and $M>0$ . Let the parameters $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ in Algorithm 3 satisfy (3.10)-(3.14) and

Let the parameters $\{\lambda_{t}\}$ and $\{\beta_{t}\}$ in the CS procedure of Algorithm 3 be set to

where $\hat{\mathbf{s}}:=\theta_{N}\mathbf{L}(\hat{\mathbf{x}}^{N}-\mathbf{x}^{N-1})+\theta_{1}\tau_{1}(\mathbf{y}^{N}-\mathbf{y}^{0})$ and $Q$ is defined in (2.7). Furthermore, for any saddle point $(\mathbf{x}^{*},\mathbf{y}^{*})$ of (1.11), we have

In the following theorem, we provide a specific selection of $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ satisfying (3.10)-(3.14) and (4.10). Using Lemma 2 and Proposition 1, we also establish the complexity of the DCS method for computing an $(\epsilon,\delta)$ -solution of problem (1.10) when the objective functions are general convex.

Let $\mathbf{x}^{*}$ be an optimal solution of (1.10), the parameters $\{\lambda_{t}\}$ and $\{\beta_{t}\}$ in the CS procedure of Algorithm 3 be set to (4.11), and suppose that $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ are set to

where $\hat{\mathbf{x}}^{N}=\tfrac{1}{N}\textstyle{\sum}_{k=1}^{N}\hat{\mathbf{x}}^{k}$ .

It is easy to check that (4.14) satisfies conditions (3.10)-(3.14) and (4.10). Particularly,

Therefore, by plugging in these values to (4.12), we have

Letting $\hat{\mathbf{s}}^{N}=\tfrac{1}{N}\hat{\mathbf{s}}$ , then from (4.13), we have

Applying Proposition 1 to the above two inequalities, the results in (4.15) and (4.16) follow immediately.

respectively. In particular, if $\epsilon$ and $\delta$ satisfy

then the previous two complexity bounds in (4.19), respectively, reduce to

Thirdly, it is interesting to compare DCS with the centralized mirror descent method nemyud:83 applied to (1.1). In the worst case, the Lipschitz constant of $f$ in (1.1) can be bounded by $M_{f}\leq mM$ , and each iteration of the method will incur $m$ subgradient evaluations. Hence, the total number of subgradient evaluations performed by the mirror descent method for finding an $\epsilon$ -solution of (1.1), i.e., a point $\bar{x}\in X$ such that $f(\bar{x})-f^{*}\leq\epsilon$ , can be bounded by

where ${\cal D}_{X}^{2}$ characterizes the diameter of $X$ , i.e., ${\cal D}_{X}^{2}:=\max_{x_{1},x_{2}\in X}V(x_{1},x_{2})$ . Noting that ${\cal D}_{X}^{2}/{\cal D}_{X^{m}}^{2}={\cal O}(1/m)$ , and that the second bound in (4.21) states only the number of subgradient evaluations for each agent in the DCS method, we conclude that the total number of subgradient evaluations performed by DCS is comparable to the classic mirror descent method as long as (4.20) holds and hence not improvable in general.

In this subsection, we will provide a bound on the optimal dual multiplier $\mathbf{y}^{*}$ . By doing so, we show that the complexity of DCS algorithm (as well as the stochastic DCS algorithm in Section 5) only depends on the parameters for the primal problem along with the smallest singular value of $\mathbf{L}$ and the initial point $\mathbf{y}^{0}$ , even though these algorithms are intrinsically primal-dual type methods.

Let $\mathbf{x}^{*}$ be an optimal solution of (1.10). Then there exists an optimal dual multiplier $\mathbf{y}^{*}$ for (1.11) s.t.

Since we only relax the linear constraints in problem (1.10) to obtain the Lagrange dual problem (1.11), it follows from the strong Lagrange duality and the existence of $\mathbf{x}^{*}$ to (1.10) that an optimal dual multiplier $\mathbf{y}^{*}$ for problem (1.11) must exist. It is clear that

where $\mathbf{y}^{*}_{N}$ and $\mathbf{y}^{*}_{C}$ denote the projections of $\mathbf{y}^{*}$ over the null space and the column space of $\mathbf{L}^{T}$ , respectively.

Case 2) $\mathbf{y}^{*}_{C}\neq\mathbf{0}$ . Using the fact that $\mathbf{L}^{T}\mathbf{y}^{*}=\mathbf{L}^{T}\mathbf{y}^{*}_{C}$ and the definition of a saddle point of (1.11), we conclude that $\mathbf{y}^{*}_{C}$ is also an optimal dual multiplier of (1.11). Since $\mathbf{y}^{*}_{C}$ is in the column space of $\mathbf{L}$ , we have

Moreover, if we denote the saddle point problem defined in (1.11) as follows:

By the definition of a saddle point of (1.11), we have ${\cal L}(\mathbf{x}^{*},\mathbf{y}^{*}_{C})\leq{\cal L}(\mathbf{x},\mathbf{y}^{*}_{C})$ , i.e.,

Hence, from the definition of subgradients, we conclude that $-\mathbf{L}^{T}\mathbf{y}^{*}_{C}\in\partial F(\mathbf{x}^{*})$ , which together with the fact that $F(\cdot)$ is Lipschitz continuous implies that

Our result in (4.23) follows immediately from the above relation, (4.24) and the fact that $\mathbf{y}^{*}_{C}$ is also an optimal dual multiplier of (1.11).

Observe that our bound for the dual multiplier $\mathbf{y}^{*}$ in (4.23) contains only the primal information. Given an initial dual multiplier $\mathbf{y}^{0}$ , this result can be used to provide an upper bound on $\|\mathbf{y}^{0}-\mathbf{y}^{*}\|$ in Theorems 3.1-5.2 throughout this paper. Note also that we can assume $\mathbf{y}^{0}=0$ to simplify these complexity bounds.

4 Convergence of DCS on Strongly Convex Functions

In this subsection, we assume that the objective functions $f_{i}$ ’s are strongly convex (i.e., $\mu>0$ (1.2)). In order to take advantage of the strong convexity of the objective functions, we assume that the prox-functions $V_{i}(\cdot,\cdot)$ , $i=1,\dots,m$ , (cf. (2.2)) are growing quadratically with the quadratic growth constant $\mathcal{C}$ , i.e., there exists a constant $\mathcal{C}>0$ such that

By (2.3), we must have $\mathcal{C}\geq 1$ .

We next provide in Lemma 3 an estimate on the gap function defined in (2.7) together with stepsize policies which work for the strongly convex case. The proof of this lemma can be found in Section 6.

Let the iterates $(\hat{\mathbf{x}}^{k},\mathbf{y}^{k})$ , $k=1,\ldots,N$ be generated by Algorithm 3 and $\hat{\mathbf{z}}^{N}$ be defined as $\hat{\mathbf{z}}^{N}:=\left(\textstyle{\sum}_{k=1}^{N}\theta_{k}\right)^{-1}\textstyle{\sum}_{k=1}^{N}\theta_{k}(\hat{\mathbf{x}}^{k},\mathbf{y}^{k})$ . Assume the objective $f_{i}$ , $i=1,\ldots,m$ are strongly convex functions, i.e., $\mu,M>0$ . Let the parameters $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ and $\{\tau_{k}\}$ in Algorithm 3 satisfy (3.10)-(3.14) and

Let the parameters $\{\lambda_{t}\}$ and $\{\beta_{t}\}$ in the CS procedure of Algorithm 3 be set to

In the following theorem, we provide a specific selection of $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ satisfying (3.10)-(3.14) and (4.26). Also, by using Lemma 3 and Proposition 1, we establish the complexity of the DCS method for computing an $(\epsilon,\delta)$ -solution of problem (1.10) when the objective functions are strongly convex. The choice of variable stepsizes rather than using constant stepsizes will accelerate its convergence rate.

Let $\mathbf{x}^{*}$ be an optimal solution of (1.10), the parameters $\{\lambda_{t}\}$ and $\{\beta_{t}\}$ in the CS procedure of Algorithm 3 be set to (4.27) and suppose that $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ are set to

where $\hat{\mathbf{x}}^{N}=\tfrac{2}{N(N+3)}\textstyle{\sum}_{k=1}^{N}(k+1)\hat{\mathbf{x}}^{k}$ .

It is easy to check that (4.30) satisfies conditions (3.10)-(3.14) and (4.26). Moreover, we have

Therefore, by plugging in these values to (4.28), we have

Furthermore, from (4.29), we have for $N\geq 2$

Let $\mathbf{s}^{N}:=\tfrac{2}{N(N+3)}\hat{\mathbf{s}}$ , then by using (4.34), we have for $N\geq 2$

Applying Proposition 1 to the above two inequalities, the results in (4.31) and (4.32) follow immediately.

respectively. In particular, if $\epsilon$ and $\delta$ satisfy

then the complexity bounds in (4.35), respectively, reduce to

Thirdly, we compare DCS method with the centralized mirror descent method nemyud:83 applied to (1.1). In the worst case, the Lipschitz constant and strongly convex modulus of $f$ in (1.1) can be bounded by $M_{f}\leq mM$ , and $\mu_{f}\geq m\mu$ , respectively, and each iteration of the method will incur $m$ subgradient evaluations. Therefore, the total number of subgradient evaluations performed by the mirror descent method for finding an $\epsilon$ -solution of (1.1), i.e., a point $\bar{x}\in X$ such that $f(\bar{x})-f^{*}\leq\epsilon$ , can be bounded by

Observed that the second bound in (4.37) states only the number of subgradient evaluations for each agent in the DCS method, we conclude that the total number of subgradient evaluations performed by DCS is comparable to the classic mirror descent method as long as (4.36) holds and hence not improvable in general for the nonsmooth strongly convex case.

Stochastic Decentralized Communication Sliding

In this section, we consider the stochastic case where only the noisy subgradient information of the functions $f_{i}$ , $i=1,\ldots,m$ , is available or easier to compute. This situation happens when the function $f_{i}$ ’s are given either in the form of expectation or as the summation of lots of components. This setting has attracted considerable interest in recent decades for its applications in a broad spectrum of disciplines including machine learning, signal processing, and operations research. We present a stochastic communication sliding method, namely the stochastic decentralized communication sliding (SDCS) method, and show that the similar complexity bounds as in Section 4 can still be obtained in expectation or with high probability.

The first-order information of the function $f_{i}$ , $i=1,\ldots,m$ , can be accessed by a stochastic oracle (SO), which, given a point $u^{t}\in X$ , outputs a vector $G_{i}(u^{t},\xi_{i}^{t})$ such that

The SDCS method can be obtained by simply replacing the exact subgradients in the CS procedure of Algorithm 3 with the stochastic subgradients obtained from SO. This difference is described in Algorithm 4.

We add a few remarks about the SDCS algorithm. Firstly, as in DCS, no additional communications of the dual variables are required when the subgradient projection (5.4) is performed for $T_{k}$ times in the inner loop. This is because the same $w_{i}^{k}$ has been used throughout the $T_{k}$ iterations of the Stochastic CS procedure. Secondly, the problem will reduce to the deterministic case if there is no stochastic noise associated with the SO, i.e., when $\sigma=0$ in (5.2). Therefore, in Section 6, we investigate the convergence analysis for the stochastic case first and then simplify the analysis for the deterministic case by setting $\sigma=0$ .

2 Convergence of SDCS on General Convex Functions

We now establish the main convergence properties of the SDCS algorithm. More specifically, we provide in Lemma 4 an estimate on the gap function defined in (2.7) together with stepsize policies which work for the general convex case with $\mu=0$ (cf. (1.2)). The proof of this lemma can be found in Section 6.

In the following theorem, we provide a specific selection of $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ satisfying (3.10)-(3.14) and (4.10). Also, by using Lemma 4 and Proposition 1, we establish the complexity of the SDCS method for computing an $(\epsilon,\delta)$ -solution of problem (1.10) in expectation when the objective functions are general convex.

Let $\mathbf{x}^{*}$ be an optimal solution of (1.10), the parameters $\{\lambda_{t}\}$ and $\{\beta_{t}\}$ in the CS procedure of Algorithm 4 be set as (4.11), and suppose that $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ are set to

where $\hat{\mathbf{x}}^{N}=\tfrac{1}{N}\textstyle{\sum}_{k=1}^{N}\hat{\mathbf{x}}^{k}$ .

It is easy to check that (5.7) satisfies conditions (3.10)-(3.14) and (4.10). Moreover, by (2.9), we can obtain

where $\mathbf{s}^{N}=\left(\textstyle{\sum}_{k=1}^{N}\theta_{k}\right)^{-1}\hat{\mathbf{s}}$ . Particularly, from Assumption (5.1) and (5.2),

Therefore, by taking expectation over both sides of (5.10) and plugging in these values into (5.10), we have

Note that from (5.6) and Jensen’s inequality, we have

Applying Proposition 1 to the above inequality and (5.11), the results in (5.8) and (5.9) follow immediately.

respectively. In particular, if $\epsilon$ and $\delta$ satisfy (4.20), the above complexity bounds, respectively, reduce to

In particular, we can show that the total number stochastic subgradients that SDCS requires is comparable to the mirror-descent stochastic approximation in NJLS09-1 . This implies that the sample complexity for decentralized stochastic optimization are still optimal (as the centralized one), even after we skip many communication rounds.

3 Convergence of SDCS on Strongly Convex Functions

We now provide in Lemma 5 an estimate on the gap function defined in (2.7) together with stepsize policies which work for the strongly convex case with $\mu>0$ (cf. (1.2)). The proof of this lemma can be found in Section 6.

Note that throughout this subsection, we assume that the prox-functions $V_{i}(\cdot,\cdot)$ , $i=1,\dots,m$ , (cf. (2.2)) are growing quadratically with the quadratic growth constant $\mathcal{C}$ , i.e., (4.25) holds.

In the following theorem, we provide a specific selection of $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ satisfying (3.10)-(3.14) and (4.10). Also, by using Lemma 5 and Proposition 1, we establish the complexity of the SDCS method for computing an $(\epsilon,\delta)$ -solution of problem (1.10) in expectation when the objective functions are strongly convex. Similar to the deterministic case, we choose variable stepsizes rather than constant stepsizes.

Let $\mathbf{x}^{*}$ be an optimal solution of (1.10), the parameters $\{\lambda_{t}\}$ and $\{\beta_{t}\}$ in the CS procedure of Algorithm 4 be set as (4.27), and suppose that $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ are set to

where $\hat{\mathbf{x}}^{N}=\tfrac{2}{N(N+3)}\textstyle{\sum}_{k=1}^{N}(k+1)\hat{\mathbf{x}}^{k}$ .

It is easy to check that (5.16) satisfies conditions (3.10)-(3.14) and (4.26). Similarly, by (2.9), Assumption (5.1) and (5.2), we can obtain

where $\mathbf{s}^{N}=\left(\textstyle{\sum}_{k=1}^{N}\theta_{k}\right)^{-1}\hat{\mathbf{s}}$ . Particularly, from (5.16), we have

Therefore, by plugging in these values into (5.19), we have

Note that from (5.15), we have, for any $N\geq 2$ ,

Hence, in view of the above three relations and Jensen’s inequality, we obtain

Applying Proposition 1 to the above inequality and (5.20), the results in (5.17) and (5.18) follow immediately.

respectively. In particular, if $\epsilon$ and $\delta$ satisfy (4.36), the above complexity bounds, respectively, reduce to

We can see that the total number of stochastic subgradient computations is comparable to the optimal complexity bound obtained in GhaLan10-1 ; ghadimi2013optimal for stochastic strongly convex case in the centralized case.

4 High Probability Results

All of the results stated in Section 5.2-5.3 are established in terms of expectation. In order to provide high probability results for SDCS method, we additionally need the following “light-tail” assumption:

Note that (5.23) is stronger than (5.2), since it implies (5.2) by Jensen’s inequality. Moreover, we also assume that there exists $\bar{\mathbf{V}}(\mathbf{x}^{*})$ s.t.

The following theorem provides a large deviation result for the gap function $g(\hat{\mathbf{s}}^{N},\hat{\mathbf{z}}^{N})$ when our objective functions $f_{i}$ , $i=1,\ldots,m$ are general nonsmooth convex functions.

Assume the objective $f_{i}$ , $i=1,\ldots,m$ are general nonsmooth convex functions, i.e., $\mu=0$ and $M>0$ . Let Assumptions (5.1), (5.2) and (5.23) hold, the parameters $\{\alpha_{k}\}$ , $\{\theta_{k}\}$ , $\{\eta_{k}\}$ , $\{\tau_{k}\}$ and $\{T_{k}\}$ in Algorithm 4 satisfy (3.10)-(3.14), and (4.10), and the parameters $\{\lambda_{t}\}$ and $\{\beta_{t}\}$ in the CS procedure of Algorithm 4 be set as (4.11). In addition, if $X_{i}$ ’s are compact, then for any $\zeta>0$ and $N\geq 1$ , we have

In the next corollary, we establish the rate of convergence of SDCS in terms of both primal and feasibility (or consistency) residuals are of order ${\cal O}(1/N)$ with high probability when the objective functions are nonsmooth and convex.

Observe that by the definition of $\lambda_{t}$ in (4.11),

which together with (5.27) then imply that

which in view of Proposition 1 immediately implies (5.29).

Convergence Analysis

This section is devoted to prove the main lemmas in Section 3, 4 and 5, which establish the convergence results of the decentralized primal-dual method, the deterministic and stochastic decentralized communication sliding methods, respectively. After introducing some general results about these algorithms, we provide the proofs for Lemma 1-5 and Theorem 5.3.

The following lemma below characterizes the solution of the primal and dual projection steps (3.2), (3.3) (also (3.6), (3.8)) as well as the projection in inner loop (4.7). The proof of this result can be found in Lemma 2 of GhaLan10-1 .

We are now ready to provide a proof for Lemma 1 which establishes the convergence property for the decentralized primal-dual method. Note that this result also builds up the basic recursion for the outer loop of the DCS and SDCS methods.

Proof of Lemma 1: Note that applying Lemma 6 to (3.6) and (3.8), we have

which in view of the definition of $Q$ and $\mathbf{V}(\cdot,\cdot)$ in (2.7) and (2.5), respectively, we can obtain

Multiplying both sides of the above inequality by $\theta_{k}$ , and summing the resulted inequality from $k=1$ to $N$ , we obtain

where (a) follows from (3.10) and the fact that $\mathbf{x}^{-1}=\mathbf{x}^{0}$ , (b) follows from the simple relation that $b\langle u,v\rangle-a\|v\|^{2}/2\leq b^{2}\|u\|^{2}/(2a),\forall a>0$ , (3.10) and (2.6), (c) follows from (3.12), (d) follows from (3.13), $\|\mathbf{y}-\mathbf{y}^{0}\|^{2}-\|\mathbf{y}-\mathbf{y}^{N}\|^{2}=\|\mathbf{y}^{0}\|^{2}-\|\mathbf{y}^{N}\|^{2}-2\langle\mathbf{y},\mathbf{y}^{0}-\mathbf{y}^{N}\rangle$ and arranging the terms accordingly, (e) follows from (2.6) and the relation $b\langle u,v\rangle-a\|v\|^{2}/2\leq b^{2}\|u\|^{2}/(2a),\forall a>0$ . The desired result in (3.15) then follows from this relation, (3.14), (6.1) and the convexity of $Q$ .

Furthermore, from (6.3)(c), (2.6) and the fact that $\textstyle{\sum}_{k=1}^{N}\theta_{k}Q(\mathbf{x}^{k},\mathbf{y}^{k};\mathbf{z}^{*})\geq 0$ , if we fix $\mathbf{z}=\mathbf{z}^{*}=(\mathbf{x}^{*},\mathbf{y}^{*})$ in the above relation, we have

from which the desired result in (3.17) follows. $\square$

Before we provide proofs for the remaining lemmas, we first need to present a result which summarizes an important convergence property of the CS procedure. It needs to be mentioned that the following proposition states a general result holds for CS procedure performed by individual agent $i\in\mathcal{N}$ . For notation convenience, we use the notations defined in CS procedure (cf. Algorithm 3).

If $\{\beta_{t}\}$ and $\{\lambda_{t}\}$ in the CS procedure satisfy

and $\delta^{t}:=\phi^{\prime}(u^{t})-h^{t}$ .

Noticing that $\phi:=f_{i}$ in the CS procedure, we have by (1.2)

where $\phi^{\prime}(u^{t-1})\in\partial\phi(u^{t-1})$ and $\partial\phi(u^{t-1})$ denotes the subdifferential of $\phi$ at $u^{t-1}$ . By applying Lemma 6 to (4.7), we obtain

Combining the above two relations together with (4.25) Observed that we only need condition (4.25) when $\mu>0$ , in other words, the objective functions $f_{i}$ ’s are strongly convex., we conclude that

Moreover, by Cauchy-Schwarz inequality, (2.3), and the simple fact that $-at^{2}/2+bt\leq b^{2}/(2a)$ for any $a>0$ , we have

From the above relation and the definition of $\Phi(u)$ in (6.6), we can rewrite (6.7) as,

Multiplying both sides by $\lambda_{t}$ and summing up the resulting inequalities from $t=1$ to $T$ , we obtain

Hence, in view of (6.4), the convexity of $\Phi$ and the definition of $\hat{u}^{T}$ in (4.8), we have

As a matter of fact, the SDCS method covers the DCS method as a special case when $\delta^{t}=0,\ \forall t\geq 0$ . Therefore, we investigate the proofs for Lemma 4 and 5 first and then simplify the proofs for Lemma 2 and 3.

We now provide a proof for Lemma 4, which establishes the convergence property of SDCS method for solving general convex problems.

By plugging into the above relation the values of $\lambda_{t}$ and $\beta_{t}$ in (4.11), together with the definition of $\Phi^{k}$ in (4.9) and rearranging the terms, we have,

Moreover, applying Lemma 6 to (4.3), we have, for $k\geq 1$ ,

Our result in (5.5) immediately follows from the convexity of $Q$ .

Furthermore, in view of (6.3)(c) and (6.9), we can obtain the following similar result (with $\mathbf{x}^{N}$ in the first and the second terms of the RHS being replaced with $\hat{\mathbf{x}}^{N}$ ),

Therefore, in view of the fact that $\textstyle{\sum}_{k=1}^{N}\theta_{k}Q(\hat{\mathbf{x}}^{k},\mathbf{y}^{k};\mathbf{z}^{*})\geq 0$ for any saddle point $\mathbf{z}^{*}=(\mathbf{x}^{*},\mathbf{y}^{*})$ of (1.11), and (2.6), by fixing $\mathbf{z}=\mathbf{z}^{*}$ and rearranging terms, we obtain

where the second inequality follows from the relation $b\langle u,v\rangle-a\|v\|^{2}/2\leq b^{2}\|u\|^{2}/(2a),\forall a>0$ .

from which the desired result in (5.6) follows.

The following proof of Lemma 5 establishes the convergence of SDCS method for solving strongly convex problems.

Proof of Lemma 5 When $f_{i}$ , $i=1,\dots,m$ , are strongly convex functions, we have $\mu,\ M>0$ (cf. (1.2)). Therefore, in view of Proposition 2 with $\lambda_{t}$ and $\beta_{t}$ defined in (4.27) satisfying condition (6.4), the definition of $\Phi^{k}$ in (4.9), and the input and output settings in the CS procedure, we have for all $k\geq 1$

By plugging into the above relation the values of $\lambda_{t}$ and $\beta^{(k)}_{t}$ in (4.27), together with the definition of $\Phi^{k}$ in (4.9) and rearranging the terms, we have

where $\hat{\mathbf{s}}$ is defined in (6.12). Our result in (5.14) immediately follows from the convexity of $Q$ .

Following the same procedure as we obtain (6.13), for any saddle point $\mathbf{z}^{*}=(\mathbf{x}^{*},\mathbf{y}^{*})$ of (1.11), we have

from which the desired result in (5.15) follows.

We are ready to provide proofs for Lemma 2 and 3, which demonstrates the convergence properties of the deterministic communication sliding method.

where $\hat{\mathbf{s}}$ is defined in (6.12). Our result in (4.12) immediately follows from the convexity of $Q$ . Moreover, our result in (4.13) follows from setting $\delta_{i}^{t-1,k}=0$ in (6.13) and (6.14).

where $\hat{\mathbf{s}}$ is defined in (6.12). Our result in (4.28) immediately follows from the convexity of $Q$ . Also, the result in (4.29) follows by setting $\delta_{i}^{t-1,k}=0$ in (6.19).

Proof of Theorem 5.3 Observe that by Assumption (5.1), (5.2) and (5.23) on the SO and the definition of $u_{i}^{t,k}$ , the sequence $\{\langle\delta_{i}^{t-1,k},x_{i}^{*}-u_{i}^{t-1,k}\rangle\}_{1\leq i\leq m,1\leq t\leq T_{k},k\geq 1}$ is a martingale-difference sequence. Denoting

and using the large-deviation theorem for martingale-difference sequence (e.g. Lemma 2 of lns11 ) and the fact that

and $S:=\textstyle{\sum}_{k=1}^{N}\textstyle{\sum}_{t=1}^{T_{k}}\textstyle{\sum}_{i=1}^{m}S_{k,t}$ . By the convexity of exponential function, we have

where the last inequality follows from Assumption (5.23). Therefore, by Markov’s inequality, for all $\zeta>0$ ,

Combing (6.20), (6.21), (5.5) and (2.9), our result in (5.25) immediately follows.

Concluding Remarks

In this paper, we present a new class of decentralized primal-dual methods which can significantly reduce the number of inter-node communications required to solve the distributed optimization problem in (1.1). More specifically, we show that by using these algorithms, the total number of communication rounds can be significantly reduced to ${\cal O}(1/\epsilon)$ when the objective functions $f_{i}$ ’s are convex and not necessarily smooth. By properly designing the communication sliding algorithms, we demonstrate that the ${\cal O}(1/\epsilon)$ number of communications can still be maintained for general convex objective functions (and it can be further reduced to ${\cal O}(1/\sqrt{\epsilon})$ for strongly convex objective functions) even if the local subproblems are solved inexactly through iterative procedure (cf. CS procedure) by the network agents. In this case, the number of intra-node subgradient computations that we need will be bounded by ${\cal O}(1/\epsilon^{2})$ (resp., ${\cal O}(1/\epsilon)$ ) when the objective functions $f_{i}$ ’s are convex (resp., strongly convex), which is comparable to that required in centralized nonsmooth optimization and not improvable in general. We also establish similar complexity bounds for solving stochastic decentralized optimization counterpart by developing the stochastic communication sliding methods, which can provide communication-efficient ways to deal with streaming data and decentralized statistical inference. All these decentralized communication sliding algorithms have the potential to significantly increase the performance of multiagent systems, where the bottleneck exists in the communication.