Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling

John Duchi, Alekh Agarwal, Martin Wainwright

Introduction

The focus of this paper is the development and analysis of distributed algorithms for solving convex optimization problems that are defined over networks. Such network-structured optimization problems arise in a variety of application domains within the information sciences and engineering. For instance, problems such as multi-agent coordination, distributed tracking and localization, estimation problems in sensor networks and packet routing are all naturally cast as distributed convex minimization [BT89, LWHS02, LOT03, RN04, XBK07]. Common to these problems is the necessity for completely decentralized computation that is locally light—so as to avoid overburdening small sensors or flooding busy networks—and robust to periodic link or node failures. As a second example, data sets that are too large to be processed quickly by any single processor present related challenges. A canonical example that arises in statistical machine learning is the problem of minimizing a loss function averaged over a large dataset (e.g., optimization in support vector machines [CV95]). With terabytes of data, it is desirable to assign smaller subsets of the data to different processors, and the processors must communicate to find parameters that minimize the loss over the entire dataset. However, the communication should be efficient enough that network latencies do not offset computational gains.

Our paper makes two main contributions. The first contribution is to provide a new simple subgradient algorithm for distributed constrained optimization of a convex function; we refer to it as a dual averaging subgradient method, since it is based on maintaining and forming weighted averages of subgradients throughout the network. This approach is essentially different from previously developed methods [NO09, LO09, RNV10], and these differences facilitate our analysis of network scaling issues, meaning how convergence rates depend on network size and topology. Indeed, the second main contribution of this paper is a careful analysis that demonstrates a close link between convergence of the algorithm and the underlying spectral properties of the network. Our analysis splits the convergence rate of the algorithm into two terms: an optimization term and a network deviation term. We obtain the optimization penalty using techniques based on the optimization literature, specifically building on results due to Nesterov [Nes09]. This splitting approach can also be adapted to naturally handle issues such as constrained optimization, stochastic communication, and stochastic optimization due to elegant properties of the dual averaging algorithm. On the other hand, the network scaling terms are obtained using techniques from analysis of Markov chains coupled with the distributed communication protocol. We show that the network deviation terms we derive are sharp for our algorithm; in the special case of the consensus problem, these terms are known to be near-optimal [BGPS06].

By comparison to previous work, our convergence results and proofs are different, and our characterization of the network scaling terms are often much stronger. In particular, the convergence rates given by the papers [NO09, LO09] grow exponentially in the number of nodes nn in the network. Nedić et al. [NOOT09] and Ram et al. [RNV10] provide a much tighter analysis that yields convergence rates that scale polynomially in the network size, but are independent of the network topology (apart from requiring strong connectedness and degree independent of nn). Specifically, Corollary 5.5 in the paper [RNV10] guarantees that their projected subgradient algorithm—under the assumptions that the number of time steps is known a priori and the stepsize is chosen optimally—obtains an ϵ\epsilon-optimal solution to the optimization problem in O(n3/ϵ2)\mathcal{O}(n^{3}/\epsilon^{2}) time. Since this bound is essentially independent of network topology, it does not capture the intuition that distributed algorithms should converge much faster on “well-connected” networks—expander graphs being a prime example—than on poorly connected networks (e.g., chains, trees or single cycles). Johansson et al. [JRJ09] analyze a low communication peer-to-peer protocol that attains rates dependent on network structure. However, in their algorithm only one node has a current parameter value, while all nodes in our algorithm maintain good estimates of the optimum at all time. This is important in online, streaming, and control problems where nodes are expected to act or answer queries in real time. In additional comparison to previous work, our analysis gives network scaling terms that are often substantially sharper. Our development yields an algorithm with convergence rate that scales inversely in the spectral gap of the network. By exploiting known results on spectral gaps for graphs with nn nodes, we show that (disregarding logarithmic factors) our algorithm obtains an ϵ\epsilon-optimal solution in O(n2/ϵ2)\mathcal{O}(n^{2}/\epsilon^{2}) iterations for a single cycle or path, O(n/ϵ2)\mathcal{O}(n/\epsilon^{2}) iterations for a two-dimensional grid, and O(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) iterations for a bounded degree expander graph. Moreover, simulation results show an excellent agreement with these theoretical predictions.

Our analysis covers several settings for distributed minimization. We begin by studying fixed communication protocols, which are of interest in a variety of areas such as cluster computing or sensor networks with a fixed hardware-dependent protocol. We also investigate randomized communication protocols as well as randomized network failures, which are often essential to handle gracefully in wireless sensor networks and large clusters with potential node failures. Randomized communication also provides an interesting tradeoff between communication savings and convergence rates. In this setting, we obtain much sharper results than previous work by studying the spectral properties of the expected transition matrix of a random walk on the underlying graph. We also present an analysis of our algorithm with stochastic gradient information, which is not difficult when combined with our initial theorems. We describe an extension for (structured) regularized objectives that often arise in machine learning problems in Appendix D.

The remainder of this paper is organized as follows. Section 2 is devoted to a formal statement of the problem and description of the dual averaging algorithm, whereas Section 3 states the main results and consequences of our paper. Having stated our main results, in Section 4 we give a more in-depth comparison of our work with other recent work. In Section 5, we state and prove basic convergence results on the dual averaging algorithm, which we then exploit in Section 6 to derive concrete results that depend on the spectral gap of the network. Sections 7 and 8 treat extensions with noise, in particular algorithms with noisy communication and stochastic gradient methods respectively. In Section 9, we present the results of simulations that confirm the sharpness of our analysis.

Notation: We collect some notation used throughout the paper. We use 1 ⁣ ⁣11\!\!1 to denote the all-ones vector. We also use standard asymptotic notation for sequences. If ana_{n} and bnb_{n} are positive sequences, then an=O(bn)a_{n}=\mathcal{O}(b_{n}) means that lim supnan/bn<\limsup_{n}a_{n}/b_{n}<\infty, whereas an=Ω(bn)a_{n}=\Omega(b_{n}) means that lim infnan/bn>0\liminf_{n}a_{n}/b_{n}>0. On the other hand, an=o(bn)a_{n}=o(b_{n}) means that limnan/bn=0\lim_{n}a_{n}/b_{n}=0 and an=ω(bn)a_{n}=\omega(b_{n}) means that limnan/bn=\lim_{n}a_{n}/b_{n}=\infty. Finally, we write an=Θ(bn)a_{n}=\Theta(b_{n}) if an=O(bn)a_{n}=\mathcal{O}(b_{n}) and an=Ω(bn)a_{n}=\Omega(b_{n}).

Problem set-up and algorithm

In this section, we provide a formal statement of the distributed minimization problem, and a description of the distributed dual averaging algorithm.

Problems of this nature arise in a variety of application domains, and as motivation for the analysis to follow, let us consider a few here. A first example is a sensor network, in which each agent represents a sensor mote, equipped with a radio transmitter for communication, some basic sensing devices, and some local memory and computational power. In environmental applications of sensor networks, each mote ii might take a measurement yiy_{i} of the temperature, and the global objective could be to compute the median of the measurements {y1,y2,,yn}\{y_{1},y_{2},\ldots,y_{n}\}. This median computation problem can be formulated as minimizing the scalar objective function 1ni=1nfi(x)\frac{1}{n}\sum_{i=1}^{n}f_{i}(x), where fi(x)=xyif_{i}(x)=|x-y_{i}|. Similar formulations apply to the problem of computing other statistics such as means, variances, quantiles and other MM-estimators.

A second motivating example is the machine learning problem first described in Section 1. In this case, the set X\mathcal{X} is the parameter space of the statistician or learner. Each function fif_{i} is the empirical loss over the subset of data assigned to the iith processor, and assuming that each subset is of equal size (or that the fif_{i} are normalized suitably), the average ff is the empirical loss over the entire dataset. Here we use cluster computing as our computational model, where each processor is a node in the cluster, and the graph GG contains edges between those processors that are directly connected with small network latencies. A special case of our optimization problem within this computational model is the distributed perceptron, recently considered by McDonald et al. [MHM10].

2 Standard dual averaging

Our algorithm is based on a projected dual averaging algorithm [Nes09], designed for minimization of a (potentially nonsmooth) convex function ff subject to the constraint xXx\in\mathcal{X}. We begin by describing the standard version of this algorithm, and then discuss the extensions for the distributed setting of interest in this paper.

In addition, we assume that ψ(x)0\psi(x)\geq 0 for all xXx\in\mathcal{X} and that ψ(0)=0\psi(0)=0; these are standard assumptions that can be made without loss of generality. Examples of such proximal function and norm pairs include:

We assume that each function fif_{i} is LL-Lipschitz with respect to the same norm \left\|\cdot\right\|—that is,

There are many cost functions fif_{i} that satisfy this type of Lipschitz condition. For instance, it holds for any convex function on a compact domain X\mathcal{X}, or for any polyhedral function on an arbitrary domain [HUL96a]. Note that the Lipschitz condition (2) implies that for any xXx\in\mathcal{X} and any subgradient gifi(x)g_{i}\in\partial f_{i}(x), we have

where \left\|\cdot\right\|_{*} denotes the dual norm to \left\|\cdot\right\|, defined by v  :=supu=1v,u\left\|v\right\|_{*}\;:\,=\sup_{\left\|u\right\|=1}\left\langle v,u\right\rangle.

where {α(t)}t=0\{\alpha(t)\}_{t=0}^{\infty} is a non-increasing sequence of positive stepsizes and

is a type of projection. The intuition underlying this algorithm is as follows: given the current iterate (x(t),z(t))(x(t),z(t)), the next iterate x(t+1)x(t+1) to chosen to minimize an averaged first-order approximation to the function ff, while the proximal function ψ\psi and stepsize α(t)>0\alpha(t)>0 enforce that the iterates {x(t)}t=0\{x(t)\}_{t=0}^{\infty} do not oscillate wildly. The algorithm is similar to the follow the perturbed leader and lazy projection algorithms developed in the context of online optimization [KV05], though in this form the algorithm seems to be originally due to Nesterov [Nes09]. In Section 5, we show that a simple analysis of the convergence of the above procedure allows us to relate it to the distributed algorithm we describe.

3 Distributed dual averaging

Using this notation, given the non-increasing sequence {α(t)}t=0\{\alpha(t)\}_{t=0}^{\infty} of positive stepsizes, each node iV={1,2,,n}i\in V=\{1,2,\ldots,n\} performs the updates

where the projection ΠXψ\Pi_{\mathcal{X}}^{\psi} was defined previously (4). In words, node ii computes the new dual parameter zi(t+1)z_{i}(t+1) from a weighted average of its own subgradient gi(t)g_{i}(t) and the parameters {zj(t),jN(i)}\{z_{j}(t),j\in N(i)\} in its own neighborhood N(i)N(i), and then computes the next local iterate xi(t+1)x_{i}(t+1) by a projection defined by the proximal function ψ\psi and stepsize α(t)>0\alpha(t)>0.

In the sequel, we show convergence of the local sequence {xi(t)}t=1\{x_{i}(t)\}_{t=1}^{\infty} to the optimum of (1) via the running local average

Main results and consequences

We will now state the main results of this paper and illustrate some of their consequences. We give the proofs and a deeper investigation of related corollaries at length in the sections that follow.

We start with a result on the convergence of the distributed dual averaging algorithm that provides a decomposition of the error into an optimization term and the cost associated with network communication. In order to state this theorem, we define the averaged dual variable zˉ(t)  :=1ni=1nzi(t)\bar{z}(t)\;:\,=\frac{1}{n}\sum_{i=1}^{n}z_{i}(t), and we recall the definition (6) of the local average x^i(T)\widehat{x}_{i}(T).

Let the sequences {xi(t)}t=0\{x_{i}(t)\}_{t=0}^{\infty} and {zi(t)}t=0\{z_{i}(t)\}_{t=0}^{\infty} be generated by the updates (5a) and (5b) with step size sequence {α(t)}t=0\{\alpha(t)\}_{t=0}^{\infty}. Then for any xXx^{*}\in\mathcal{X} and for each node iVi\in V, we have

Theorem 1 guarantees that after TT steps of the algorithm, every node iVi\in V has access to a locally defined quantity x^i(T)\widehat{x}_{i}(T) such that the difference f(x^i(T))f(x)f(\widehat{x}_{i}(T))-f(x^{*}) is upper bounded by a sum of four terms. The first two terms in the upper bound (1) are optimization error terms that are common to subgradient algorithms. The third and fourth terms are penalties incurred due to having different estimates at different nodes in the network, and they measure the deviation of each node’s estimate of the average gradient from the true average gradient.The fact that the term zˉ(t)zi(t)\left\|\bar{z}(t)-z_{i}(t)\right\|_{*} appears an extra time is no significant difficulty, as we will bound the difference zˉ(t)zi(t)\bar{z}(t)-z_{i}(t) uniformly for all ii when giving concrete convergence results. Thus, roughly, Theorem 1 ensures that as long the bound on the deviation zˉ(t)zi(t)\left\|\bar{z}(t)-z_{i}(t)\right\|_{*} is tight enough, for appropriately chosen α(t)\alpha(t) (say α(t)1/t\alpha(t)\approx 1/\sqrt{t}), the error of x^i(T)\widehat{x}_{i}(T) is small uniformly across all nodes iVi\in V, and asymptotically approaches 0. See Theorem 2 in the next section for a precise statement of rates.

It is worthwhile comparing the optimization error term from the bound (1) to known results. Subgradient descent on the average function f=1ni=1nfif=\frac{1}{n}\sum_{i=1}^{n}f_{i} has identical convergence rate, as does the randomized version of incremental subgradient descent [NB01]. However, the distributed nature of the algorithm gives a computational advantage over full gradient descent—the gradient computation requires O(1)\mathcal{O}(1) computation per computer rather than O(n)\mathcal{O}(n) on a single computer. To highlight the benefits compared to incremental subgradient descent, consider the common problem in machine learning and statistics of minimizing a loss on a large dataset. A randomized incremental gradient descent method must access random subsets of data at every iteration, leading to randomized disk seeks with high latency, which the distributed algorithm avoids. In addition, we expect (and empirically see that this is indeed the case) our method to produce more stable iterates, as we observe the gradients of all nn functions at every round, albeit with a network communication lag.

2 Convergence rates and network topology

We now turn to investigation of the effects of network topology on convergence rates. In this section, we assume that the network topology is static and that communication occurs via a fixed doubly stochastic weight matrix PP at every round.In later sections, we weaken these conditions. Since PP is doubly stochastic, it has largest singular value σ1(P)=1\sigma_{1}(P)=1. As summarized in the following result, the convergence rate of the distributed projection algorithm is controlled by the spectral gap γ(P)  :=1σ2(P)\gamma(P)\;:\,=1-\sigma_{2}(P) of the matrix PP.

Under the conditions and notation of Theorem 1, suppose moreover that ψ(x)R2\psi(x^{*})\leq R^{2}. With step size choice α(t)=R1σ2(P)4Lt\alpha(t)=\frac{R\sqrt{1-\sigma_{2}(P)}}{4L\sqrt{t}}, we have

To the best of our knowledge, this theorem is the first to establish a tight connection between the convergence rate of distributed subgradient methods to the spectral properties of the underlying network. In particular, the inverse dependence on the spectral gap 1σ2(P)1-\sigma_{2}(P) is quite natural, since it is well-known to determine the rates of mixing in random walks on graphs [LPW08], and the propagation of information in our algorithm is integrally tied to the random walk on the underlying graph with transition probabilities specified by PP.

Using Theorem 2, one can derive explicit convergence rates for several classes of interesting networks, and Figure 1 illustrates four different graph topologies that are of interest. As a first example, the kk-connected cycle in panel (a) is formed by placing nn nodes on a circle and connecting each node to its kk neighbors on the right and left. For small kk, the cycle graph is rather poorly connected, and our analysis will show that this leads to slower convergence rates than other graphs with better connectivity. The grid graph in two dimensions is obtained by connecting nodes to their kk nearest neighbors in axis-aligned directions. For instance, panel (b) shows an example of a degree 44 grid graph in two-dimensions. Both the cycle and grid topologies are possible models for clustered computing as well as sensor networks.

In panel (c), we show a random geometric graph, constructed by placing nodes uniformly at random in 2^{2} and connecting any two nodes separated by a distance less than some radius r>0r>0. These graphs are used to model the connectivity patterns of devices, such as wireless sensor motes, that can communicate with all nodes in some fixed radius ball, and have been studied extensively (e.g., [GK00, Pen03]). There are natural generalizations to dimensions d>2d>2 as well as to cases in which the spatial positions are drawn from a non-uniform distribution.

Finally, panel (d) shows an instance of a bounded degree expander, which belongs to a special class of sparse graphs that have very good mixing properties [Chu98]. Expanders are an attractive option for the network topology in distributed computation since they are known to have large spectral gaps. For many random graph models, a typical sample is an expander with high probability; for instance, a randomly chosen bipartite graph satisfies this property [Alo86], as do random degree regular graphs [FKS89]. In addition, there are several deterministic constructions of expanders that are degree regular (see Section 6.3 of Chung [Chu98] for further details). The deterministic constructions are of interest because they can be used to design a network, while the random constructions are of interest since they are often much simpler.

Note that the graph Laplacian L=L(G)\mathcal{L}=\mathcal{L}(G) is always symmetric, positive semidefinite, and satisfies LD1/21 ⁣ ⁣1=0\mathcal{L}D^{1/2}1\!\!1=0. Therefore, when the graph is degree-regular (δi=δ\delta_{i}=\delta for all iVi\in V), the standard random walk with self loops on GG given by the matrix P  :=Iδδ+1LP\;:\,=I-\frac{\delta}{\delta+1}\mathcal{L} is doubly stochastic and is valid for our theory. For non-degree regular graphs, we need to make a minor modification in order to obtain a doubly stochastic matrix.

Letting δmax=maxiVδi\delta_{\operatorname{max}}=\max_{i\in V}\delta_{i} denote the maximum degree, we define the modified matrix

This matrix is symmetric by construction, and moreover, j=1n(DijAij)=Diij=1nAij=0\sum_{j=1}^{n}(D_{ij}-A_{ij})=D_{ii}-\sum_{j=1}^{n}A_{ij}=0 for all iVi\in V, so it is also doubly stochastic. Note that if the graph is δ\delta-regular, then Pn(G)P_{n}(G) is the standard choice above. Modulo a small technical detail about the ratios of δmax\delta_{\operatorname{max}} to δi\delta_{i} and the eigenvalue order of PP (see Sec. 6.2), plugging Pn(G)P_{n}(G) from (8) above into Theorem 2 immediately relates the convergence of distributed dual averaging to the spectral properties of the graph Laplacian, in particular, we have:

The following result summarizes our conclusions for the choice of stochastic matrix in (8) via (9) in application to different network topologies.

Under the conditions of Theorem 2, we have the following convergence rates:

For kk-connected n×n\sqrt{n}\times\sqrt{n} grids,

For random geometric graphs with connectivity radius r=Ω(log1+ϵn/n)r=\Omega(\sqrt{\log^{1+\epsilon}n/n}) for any ϵ>0\epsilon>0,

For expanders with bounded ratio of minimum to maximum node degree,

Note that up to logarithmic factors, the optimization term in the convergence rate is always of the order RL/TRL/\sqrt{T}, while the remaining terms vary depending on the network topology. Instead of stating convergence rates, in order to understand scaling issues as a function of network size and topology, it can be useful to re-state these results in terms of the number of iterations TG(ϵ;n)T_{G}(\epsilon;n) required to achieve error ϵ\epsilon for network type GG with nn nodes. As some special cases, Corollary 1 implies the following scalings:

for the 11-connected single cycle graph, we have Tcycle(ϵ;n)=O(n2/ϵ2)T_{\operatorname{cycle}}(\epsilon;n)=\mathcal{O}(n^{2}/\epsilon^{2}).

for the two-dimensional grid, we have Tgrid(ϵ;n)=O(n/ϵ2)T_{\operatorname{grid}}(\epsilon;n)=\mathcal{O}(n/\epsilon^{2}), and

for a bounded degree expander, we have Texp(ϵ;n)=O(1/ϵ2)T_{\operatorname{exp}}(\epsilon;n)=\mathcal{O}(1/\epsilon^{2}).

In general, Theorem 2 implies that at most

iterations are required to achieve an ϵ\epsilon-accurate solution when using the matrix Pn(G)P_{n}(G) previously defined in (8).

It is interesting to ask whether the upper bound (10) from our analysis is actually a sharp result, meaning that it cannot be improved (up to constant factors). On one hand, it is known that (even for centralized optimization algorithms), any subgradient method requires at least Ω(1ϵ2)\Omega\left(\frac{1}{\epsilon^{2}}\right) iterations to achieve ϵ\epsilon-accuracy [NY83], so that the 1/ϵ21/\epsilon^{2} term is unavoidable. The next proposition addresses the complementary issue, namely whether the inverse spectral gap term is unavoidable for the dual averaging algorithm. For the quadratic proximal function ψ(x)=12x22\psi(x)=\frac{1}{2}\left\|x\right\|_{2}^{2}, the following result establishes a lower bound on the number of iterations in terms of graph topology and network structure:

Consider the dual averaging algorithm (5a) and (5b) with quadratic proximal function and communication matrix Pn(G)P_{n}(G). For any graph GG with nn nodes, the number of iterations TG(c;n)T_{G}(c;n) required to achieve a fixed accuracy c>0c>0 is lower bounded as

The proof of this result, given in Section 6.3, involves constructing a “hard” optimization problem and lower bounding the number of iterations required for our algorithm to solve it. In conjunction with Corollary 1, Proposition 1 implies that our predicted network scaling is sharp. Indeed, in Section 9, we show that the theoretical scalings from Corollary 1—namely, quadratic, linear, and constant in network size nn—are well-matched in simulations of our algorithm.

3 Extensions to stochastic communication links

Our results also extend to the case when the communication matrix PP is time-varying and random—that is, the matrix P(t)P(t) is potentially different for each tt and randomly chosen (but it P(t)P(t) still obeys the constraints imposed by GG). Such stochastic communication is of interest for a variety of reasons. If there is an underlying dense network topology, we might want to avoid communicating along every edge at each round to decrease communication and network congestion. For instance, the use of a gossip protocol [BGPS06], in which one edge in the network is randomly chosen to communicate at each iteration, allows for a more refined trade-off between communication cost and number of iterations. Communication in real networks also incurs errors due to congestion or hardware failures, and we can model such errors by a stochastic process.

The following theorem provides a convergence result for the case of time-varying random communication matrices. In particular, it applies to sequences {xi(t)}t=0\{x_{i}(t)\}_{t=0}^{\infty} and {zi(t)}t=0\{z_{i}(t)\}_{t=0}^{\infty} generated by the dual averaging algorithm with updates (5a) and (5b) with step size sequence {α(t)}t=0\{\alpha(t)\}_{t=0}^{\infty}, but in which pijp_{ij} is replaced with pij(t)p_{ij}(t).

We provide a proof of the theorem in Section 7. Note that the upper bound from the theorem is valid for any sequence of non-increasing positive stepsizes {α(t)}t=0\{\alpha(t)\}_{t=0}^{\infty}. The bound consists of three terms, with the first growing and the last two shrinking as the stepsize choice is reduced. If we assume that ψ(x)R2\psi(x^{*})\leq R^{2}, then we can optimize the tradeoff between these competing terms, and we find that the stepsize sequence α(t)R1λ2Lt\alpha(t)\propto\frac{R\sqrt{1-\lambda_{2}}}{L\sqrt{t}} approximately minimizes the bound bound in the theorem. This yields the scaling

for a universal constant cc. We can also boost the probability with which this result holds to 11/Tk1-1/T^{k} for any k>1k>1—without modifying the algorithm—at the cost of incurring a slightly higher constant penalty in the error bound.

The setting of stochastic communication for distributed optimization was previously considered by Lobel and Ozdaglar [LO09]. They established convergence by assuming lower bounds on the entries of PP whenever two nodes communicated. As a consequence, their bounds grew exponentially in the number of nodes nn in the network.More precisely, inspection of the constant CC in equation (37) of their paper shows that it is of order γ2(n1)\gamma^{-2(n-1)}, where γ\gamma is the lower bound on non-zero entries of PP, so it is at least 4n14^{n-1}. In contrast, the rates given here for stochastic communication are directly comparable to the convergence rates in the previous section for fixed transition matrices. More specifically, we have inverse dependence on the spectral gap of the expected network, and consequently polynomial scaling for any network, as well as faster rates dependent on network structure.

4 Results for stochastic gradient algorithms

Finally, none of our convergence results rely on the gradients being correct. Specifically, we can straightforwardly extend our results to the case of noisy gradients corrupted with zero-mean bounded-variance noise. This setting is especially relevant in situations such as distributed learning or wireless sensor networks, when data observed is noisy. Let Ft\mathcal{F}_{t} be the σ\sigma-field containing all information up to time tt, that is, gi(1),,gi(t)Ftg_{i}(1),\ldots,g_{i}(t)\in\mathcal{F}_{t} and xi(1),,xi(t+1)Ftx_{i}(1),\ldots,x_{i}(t+1)\in\mathcal{F}_{t} for all ii. We define a stochastic oracle that provides gradient estimates satisfying

As a special case, this model includes an additive noise oracle that takes an element of the subgradient fi(xi(t))\partial f_{i}(x_{i}(t)) and adds to it bounded variance zero-mean noise. Theorem 4 gives our result in the case of stochastic gradients. We give the proof and further discussion in Section 8, noting that because we adapt the dual averaging algorithm, the analysis follows quite cleanly from the earlier analysis for the previous three theorems.

Let the sequence {xi(t)}t=1\{x_{i}(t)\}_{t=1}^{\infty} be as in Theorem 1, except that at each round of the algorithm agent ii receives a vector g^i(t)\widehat{g}_{i}(t) from an oracle satisfying condition (12). For each iVi\in V, we have

If we assume in addition that X\mathcal{X} has finite radius R  :=supxXxxR\;:\,=\sup_{x\in\mathcal{X}}\left\|x-x^{*}\right\| and that g^i(t)L\left\|\widehat{g}_{i}(t)\right\|_{*}\leq L, then with probability at least 1δ1-\delta,

If we further assume that the gradient estimates g^i(t)\widehat{g}_{i}(t) are uncorrelated given Ft1\mathcal{F}_{t-1}, then with probability at least 1δ1-\delta,

Related Work

Having stated and discussed our main results in the previous section, we can now more explicitly compare the results in this paper to those in previous work. Our aim here is to give a clear understanding of how our algorithm and results relate to and, in many cases, improve upon prior results. Specifically, with the results of Theorem 2 and Corollary 1 in hand, we can more directly compare our results to other work.

As discussed in the introduction, other researchers have designed algorithms for solving the problem (1). Most previous work [LO09, NO09, NOOT09, RNV10] studies convergence of a (projected) gradient method in which each node ii in the network maintains xi(t)Xx_{i}(t)\in\mathcal{X}, and at time tt performs the update

for gi(t)fi(xi(t))g_{i}(t)\in\partial f_{i}(x_{i}(t)). With the update (13), Corollary 5.5 in the paper [RNV10] shows that

(we use our notation and assumptions from Theorem 2). The above bound is minimized by setting the stepsize αLn3/2RT\alpha\propto\frac{L}{n^{3/2}R\sqrt{T}}, giving convergence rate O(n3/2RL/T)\mathcal{O}(n^{3/2}RL/\sqrt{T}). It is clear that this convergence rate is substantially slower than all the rates in Corollary 1.

The distributed dual averaging algorithm (5a)–(5b) is quite different from the update (13). The use of the proximal function ψ\psi allows us to address problems with non-Euclidean geometry, which is useful, for example, for very high-dimensional problems or where the domain X\mathcal{X} is the simplex (e.g. [NY83, Chapter 3]). The differences between the algorithms become more pronounced in the analysis. Since we use dual averaging, we can avoid some technical difficulties introduced by the projection step in the update (13). Precisely because of this technical issue, earlier works [NO09, LO09] studied unconstrained optimization, and the averaging in zi(t)z_{i}(t) seems essential to the faster rates our approach achieves as well as the ease with which we can extend our results to stochastic settings.

In other related work, Johansson et al. [JRJ09] establish network-dependent rates for Markov incremental gradient descent (MIGD), which maintains a single vector x(t)x(t) at all times. A token i(t)i(t) determines an active node at time tt, and at time step t+1t+1 the token moves to one of its neighbors jN(i(t))j\in N(i(t)), each with probability Pji(t)P_{ji(t)}. Letting gi(t)(t)fi(t)(x(t))g_{i(t)}(t)\in\partial f_{i(t)}(x(t)), the update is

Johansson et al. show that with optimal setting of α\alpha and symmetric transition matrix PP, MIGD has convergence rate O(RLmaxinΓiiT)\mathcal{O}(RL\max_{i}\sqrt{\frac{n\Gamma_{ii}}{T}}), where Γ\Gamma is the return time matrix Γ=(IP+1 ⁣ ⁣11 ⁣ ⁣1/n)1\Gamma=(I-P+1\!\!11\!\!1^{\top}/n)^{-1}. In this case, let λi(P)\lambda_{i}(P)\in denote the iith eigenvalue of PP. The eigenvalues of Γ\Gamma are thus 11 and 1/(1λi(P))1/(1-\lambda_{i}(P)) for i>1i>1, and so we have

Consequently, the bound in Theorem 2 is never weaker, and for certain graphs, our results are substantially tighter, as shown in Corollary 1. For dd-dimensional grids (where d2d\geq 2) we have T(ϵ;n)=O(n2/d/ϵ2)T(\epsilon;n)=\mathcal{O}(n^{2/d}/\epsilon^{2}), whereas MIGD scales as T(ϵ;n)=O(n/ϵ2)T(\epsilon;n)=\mathcal{O}(n/\epsilon^{2}). For well-connected graphs, such as expanders and the complete graph, the MIGD algorithm scales as T(ϵ;n)=O(n/ϵ2)T(\epsilon;n)=\mathcal{O}(n/\epsilon^{2}), essentially a factor of nn worse than our results.

Basic convergence analysis for distributed dual averaging

In this section, we prove convergence of the distributed algorithm based on the updates (5a) and (5b). We begin in Section 5.1 by defining some auxiliary quantities and establishing lemmas useful in the proof, and we prove Theorem 1 in Section 5.2.

Using techniques related to those used in past work [NO09], we establish convergence via two auxiliary sequences, given by

We begin by showing that the average sum of gradients zˉ(t)\bar{z}(t) evolves in a very simple way. In particular, we have

Consider the right-hand side above, let Z(t)=[z1(t)  zn(t)]Z(t)=[z_{1}(t)~{}\cdots~{}z_{n}(t)] be the matrix of vectors ziz_{i}, and denote P=[p1  pn]P=[p_{1}~{}\cdots~{}p_{n}]. Since the matrix PP is doubly stochastic, we have

Consequently, the (negative of the) averaged dual sequence {zˉ(t)}t=0\{\bar{z}(t)\}_{t=0}^{\infty} evolves almost like standard subgradient descent on the function f(x)=i=1nfi(x)/nf(x)=\sum_{i=1}^{n}f_{i}(x)/n, the only difference being gi(t)g_{i}(t) is a subgradient at xi(t)x_{i}(t) (which need not be the same as the subgradient gj(t)g_{j}(t) at xj(t)x_{j}(t)). The simple evolution (16) of the averaged dual sequence allows us to avoid difficulties with the non-linearity of projection that have been challenging in earlier work.

For any non-increasing sequence {α(t)}t=0\{\alpha(t)\}_{t=0}^{\infty} of positive stepsizes, and for any xXx^{*}\in\mathcal{X}, we have

Next we state a lemma that allows us to restrict our analysis to the easier to analyze centralized sequence {y(t)}t=0\{y(t)\}_{t=0}^{\infty} from (15):

Consider the sequences {xi(t)}t=1\{x_{i}(t)\}_{t=1}^{\infty}, {zi(t)}t=0\{z_{i}(t)\}_{t=0}^{\infty}, and {y(t)}t=0\{y(t)\}_{t=0}^{\infty} defined according to equations (5a), (5b), and (15). Recall that each fif_{i} is LL-Lipschitz. For each iVi\in V, we have

Similarly, with the definitions y^(T)  :=1Tt=1Ty(t)\widehat{y}(T)\;:\,=\frac{1}{T}\sum_{t=1}^{T}y(t) and x^i(T)  :=1Tt=1Txi(t)\widehat{x}_{i}(T)\;:\,=\frac{1}{T}\sum_{t=1}^{T}x_{i}(t), we have

Equipped with these tools, we now turn the proof of Theorem 1.

2 Proof of Theorem 1

Our proof is based on analyzing the sequence {y(t)}t=0\{y(t)\}_{t=0}^{\infty}. Given an arbitrary xXx^{*}\in\mathcal{X}, we have

where the inequality follows by the LL-Lipschitz condition on fif_{i}.

Let gi(t)fi(xi(t))g_{i}(t)\in\partial f_{i}(x_{i}(t)) be a subgradient of fif_{i} at xi(t)x_{i}(t). Using convexity, we have the bound

Breaking up the right hand side of (19) into two pieces, we obtain

By definition of the updates for zˉ(t)\bar{z}(t) and y(t)y(t), we have

Thus, we see that the first term in the decomposition (20) can be written in the same way as the bound in Lemma 2, and as a consequence, we have the bound

It remains to control the final two terms in the bounds (18) and (20). Since gi(t)L\left\|g_{i}(t)\right\|_{*}\leq L by assumption, we have

By the α\alpha-Lipschitz continuity of the projection operator ΠXψ(,α)\Pi_{\mathcal{X}}^{\psi}(\cdot,\alpha) (see Appendix A.3), we have

Combining this bound with (18) and (21) yields the running sum bound

Applying Lemma 3 to (22) gives that t=1T[f(xi(t))f(x)]\sum_{t=1}^{T}[f(x_{i}(t))-f(x^{*})] is upper bounded by

Dividing both sides by TT and using convexity of ff yields the bound (1).

Convergence rates, spectral gap, and network topology

In this section, we will give concrete convergence rates for the distributed dual averaging algorithm based on the mixing time of a random walk according to the doubly stochastic matrix PP. The understanding of the dependence of our convergence rates in terms of the underlying network topology is crucial, because it can provide important cues to the system administrator in a clustered computing environment or for the locations and connectivities of sensors in a sensor network. We begin in Section 6.1 with the proof of Theorem 2. In Section 6.2, we prove the graph-specific convergence rates stated in Corollary 1, whereas Section 6.3 contains a proof of the lower bound stated in Proposition 1.

For a brief review of the relevant standard Perron-Frobenius and matrix theory, we refer the reader to Appendix B.

We focus on controlling the network error term in the bound (1), namely the quantity

Define the matrix Φ(t,s)=Pts+1\Phi(t,s)=P^{t-s+1} (in the sequel we allow the stochastic matrix PP to change as a function of time). Let [Φ(t,s)]ji[\Phi(t,s)]_{ji} be the jjth entry of the iith column of Φ(t,s)\Phi(t,s). Then via a bit of algebra, we can write

Clearly the above reduces to the standard update (5a) when s=ts=t. Since zˉ(t)\bar{z}(t) evolves simply as in (16), we assume that zi(0)=zˉ(0)z_{i}(0)=\bar{z}(0) to avoid notational clutter—we can simply start with zi(0)=0z_{i}(0)=0—and use (24) to see

We use the fact that gi(t)L\left\|g_{i}(t)\right\|_{*}\leq L for all ii and tt and (25) to see that

Now we break the sum in (26) into two terms separated by a cutoff point t^\widehat{t}. The first term consists of “throwaway” terms, that is, timesteps ss for which the Markov chain with transition matrix PP has not mixed, while the second consists of steps ss for which [Φ(t1,s)]i1 ⁣ ⁣1/n1\left\|{[\Phi(t-1,s)]_{i}-1\!\!1/n}\right\|_{1} is small. Note that the indexing on Φ(t1,s)=Pts+1\Phi(t-1,s)=P^{t-s+1} implies that for small ss, Φ(t1,s)\Phi(t-1,s) is close to uniform. From (23), [Φ(t,s)]j1 ⁣ ⁣1/n1nσ2(P)ts+1\left\|{[\Phi(t,s)]_{j}-1\!\!1/n}\right\|_{1}\leq\sqrt{n}\sigma_{2}(P)^{t-s+1}. Hence, if

Thus, by setting ϵ1=Tn\epsilon^{-1}=T\sqrt{n}, for ts+1log(Tn)logσ2(P)1t-s+1\geq\frac{\log(T\sqrt{n})}{\log\sigma_{2}(P)^{-1}}, we have

For larger ss, we simply have [Φ(t,s)]j1 ⁣ ⁣1/n12\left\|{[\Phi(t,s)]_{j}-1\!\!1/n}\right\|_{1}\leq 2. The above suggests that we split the sum at t^=logTnlogσ2(P)1\widehat{t}=\frac{\log T\sqrt{n}}{\log\sigma_{2}(P)^{-1}}. We break apart the sum in (26) and use (27) to see that since t1(tt^)=t^t-1-(t-\widehat{t})=\widehat{t} and there are at most TT steps in the summation,

The last inequality follows from the concavity of log()\log(\cdot), since logσ2(P)11σ2(P)\log\sigma_{2}(P)^{-1}\geq 1-\sigma_{2}(P).

Combining (28) with the running sum bound in (22) of the proof of the basic theorem, Theorem 1, we immediately see that for xXx^{*}\in\mathcal{X},

Appealing to Lemma 3 allows us to obtain the same result on the sequence xi(t)x_{i}(t) with slightly worse constants. Note that t=1Tt1/22T1\sum_{t=1}^{T}t^{-1/2}\leq 2\sqrt{T}-1. Thus, using the assumption that ψ(x)R2\psi(x^{*})\leq R^{2}, using convexity to bound f(y^(T))1Tt=1Tf(y(t))f(\widehat{y}(T))\leq\frac{1}{T}\sum_{t=1}^{T}f(y(t)) (and similarly for x^i(T)\widehat{x}_{i}(T)), and setting α(t)\alpha(t) as in the statement of the theorem completes the proof.

2 Proof of Corollary 1

The corollary is based on bounding the spectral gap of the matrix Pn(G)P_{n}(G) from equation (8).

The matrix Pn(G)P_{n}(G) satisfies the bound

Proof By a theorem of Ostrowski on congruent matrices (cf. Theorem 4.5.9, [HJ85]), we have

Since LD1/21 ⁣ ⁣1=0\mathcal{L}D^{1/2}1\!\!1=0, we have λn(L)=0\lambda_{n}(\mathcal{L})=0, and so it suffices to focus on λ1(D1/2LD1/2)\lambda_{1}(D^{1/2}\mathcal{L}D^{1/2}) and λn1(D1/2LD1/2)\lambda_{n-1}(D^{1/2}\mathcal{L}D^{1/2}). From the definition (8), the eigenvalues of PP are of the form 1(δmax+1)1λk(D1/2LD1/2)1-(\delta_{\operatorname{max}}+1)^{-1}\lambda_{k}(D^{1/2}\mathcal{L}D^{1/2}). The bound (30) coupled with the fact that all the eigenvalues of L\mathcal{L} are non-negative implies that \sigma_{2}(P)=\max_{k<n}\big{\{}\big{|}1-(\delta_{\operatorname{max}}+1)^{-1}\lambda_{k}(D^{1/2}\mathcal{L}D^{1/2})\big{|}\big{\}} is upper bounded by the larger of

Much of spectral graph theory is devoted to bounding λn1(L)\lambda_{n-1}(\mathcal{L}) sufficiently far away from zero, and Lemma 4 allows us to conveniently leverage such results for bounding the convergence rate of our algorithm.

Note that computing the upper bound in Lemma 4 requires controlling both λn1(L)\lambda_{n-1}(\mathcal{L}) and λ1(L)\lambda_{1}(\mathcal{L}). In order to circumvent this complication, we use the well-known idea of a “lazy” random walk [Chu98, LPW08], in which we replace PP by 12(I+P)\frac{1}{2}(I+P). The resulting symmetric matrix has the same eigenstructure as PP, and moreover, we have

Consequently, it is sufficient to bound only λn1(L)\lambda_{n-1}(\mathcal{L}), which is more convenient from a technical standpoint. The convergence rate implied by the lazy random walk through Theorem 2 is no worse than twice that of the original walk, which is insignificant for the analysis in this paper.

We are now equipped to address each of the graph classes covered by Corollary 1.

Recall the regular kk-connected cycle from Figure 1(a), constructed by placing the nn nodes on a circle and connecting every node to kk neighbors on the right and left. For this graph, the Laplacian L\mathcal{L} is a circulant matrix with diagonal entries 11 and off-diagonal non-zero entries 1/2k-1/2k. Known results on circulant matrices (see Chapter 3 of Gray [Gra06]) imply that it has mmth eigenvalue

For m=n1m=n-1 and k=o(n)k=o(n), the last equation can be massaged into [BGPS06, Section VI.A]

By performing a Taylor expansion of cos()\cos(\cdot), we see that λn1(L)=Θ(k2n2)\lambda_{n-1}(\mathcal{L})=\Theta\left(\frac{k^{2}}{n^{2}}\right) for k=o(n)k=o(n).

Now consider the regular kk-connected path, a path in which each node is connected to the kk neighbors on its right and left. By computing Cheeger constants (see Lemma 6 in Appendix C), we see that if knk\leq\sqrt{n}, then λn1(L)=Θ(k2/n2)\lambda_{n-1}(\mathcal{L})=\Theta(k^{2}/n^{2}). Note also that for the kk-connected path on nn nodes, miniδi=k\min_{i}\delta_{i}=k and δmax=2k\delta_{\operatorname{max}}=2k. Thus, we can combine the previous two paragraphs with Lemma 4 to see that for regular kk-connected paths or cycles with knk\leq\sqrt{n},

Substituting the bound (32) into Theorem 2 yields the claim of Corollary 1(a).

Now consider the case of a n\sqrt{n}-by-n\sqrt{n} grid, focusing specifically on regular kk-connected grids, in which any node is joined to every node that is fewer than kk horizontal or vertical edges away in an axis-aligned direction. In this case, we use results on Cartesian products of graphs [Chu98, Section 2.6] to analyze the eigen-structure of the Laplacian. In particular, the toroidal n\sqrt{n}-by-n\sqrt{n} kk-connected grid is simply the Cartesian product of two regular kk-connected cycles of n\sqrt{n} nodes. The second smallest eigenvalue of a Cartesian product of graphs is half the minimum of second-smallest eigenvalues of the original graphs [Chu98, Theorem 2.13]. Thus, based on the preceding discussion of kk-connected cycles, we conclude that if k=o(n)k=o(\sqrt{n}), then we have λn1(L)=Θ(k2/n)\lambda_{n-1}(\mathcal{L})=\Theta(k^{2}/n). For a non-toroidal n\sqrt{n}-by-n\sqrt{n} grid (in which the network is not “wrapped” on its boundaries, as in Figure 1(b)), we use the previous discussion of regular kk-connected paths, since the grid is the Cartesian product of two kk-connected paths of n\sqrt{n} nodes. We immediately see that λn1(L)=Θ(k2/n)\lambda_{n-1}(\mathcal{L})=\Theta(k^{2}/n). In both cases, for n\sqrt{n}-by-n\sqrt{n} kk-connected grids, we use Lemma 4 and (31) to see that for kn1/4k\leq n^{1/4},

The result in Corollary 1(b) immediately follows.

Using the proof of Lemma 10 from Boyd et al. [BGPS06], we see that for any ϵ>0\epsilon>0, if r=log1+ϵn/(nπ)r=\sqrt{\log^{1+\epsilon}n/(n\pi)}, then with probability at least 12/nc11-2/n^{c-1},

Thus, letting L\mathcal{L} be the graph Laplacian of a random geometric graph, if we can bound λn1(L)\lambda_{n-1}(\mathcal{L}), (34) coupled with Lemma 4 will control the convergence rate of our algorithm.

Recent work of von Luxburg et al. [vLRH10] gives concentration results on the second-smallest eigenvalue of a geometric graph. In particular, their Theorem 3 says that there are universal constants c1,,c5>0c_{1},\ldots,c_{5}>0 such that with probability at least 1c1nexp(c2nr2)c3exp(c4nr2)/r21-c_{1}n\exp(-c_{2}nr^{2})-c_{3}\exp(-c_{4}nr^{2})/r^{2}, λn1(L)c5r2\lambda_{n-1}(\mathcal{L})\geq c_{5}r^{2}. Parsing this a bit, we see that if r=ω(logn/n)r=\omega(\sqrt{\log n/n}), then with exceedingly high probability, λn1(L)=Ω(r)=ω(logn/n)\lambda_{n-1}(\mathcal{L})=\Omega(r)=\omega(\log n/n). Using (34), we see that for r=(log1+ϵn/n)1/2r=(\log^{1+\epsilon}n/n)^{1/2},

with high probability. Combining the above equation with Lemma 4 and (31), we have

Thus we have obtained the result of Corollary 1(c). Our bounds show that a grid and a random geometric graph exhibit the same convergence rate up to logarithmic factors.

The constant spectral gap in expanders [Chu98, Chapter 6] removes any penalty due to network communication (up to logarithmic factors), and hence yields Corollary 1(d).

3 Proof of Proposition 1

We now give a proof of Proposition 1, which shows that the dependence of our convergence rates on the spectral gap is tight. The proof is based on construction of a set of objective functions fif_{i} that force convergence to be slow by using the second eigenvector of the communication matrix PP.

for some constant c>0c>0 to be chosen. Note that each fif_{i} is c+1c+1-Lipschitz. By construction, we see immediately that x=1x^{*}=-1 is optimal for the global problem.

In order to establish a lower bound, it suffices to show that at least one node is far from the optimum after tt steps, and we focus on node 11. Since w1=1w_{1}=-1, the evolution (36) guarantees that

Recalling that ψ(x)=12x2\psi(x)=\frac{1}{2}x^{2} for this scalar setting, we have

Hence x1(t)x_{1}(t) is the projection of α(t)z1(t+1)-\alpha(t)z_{1}(t+1) onto $,andunless, and unlessz_{1}(t)>0$ we have

If tt is overly small, the relation (37) will guarantee that z1(t)0z_{1}(t)\leq 0, so that x1(t)x_{1}(t) is far from the optimum. If we choose c1/3c\leq 1/3, then a little calculation shows that we require t=Ω((1σ2(P))1)t=\Omega((1-\sigma_{2}(P))^{-1}) in order to drive z1(t)z_{1}(t) below zero.

Convergence rates for stochastic communication

In this section, we develop theory appropriate for stochastic and time-varying communication, which we model by a sequence {P(t)}t=0\{P(t)\}_{t=0}^{\infty} of random matrices. We begin in Section 7.1 with basic convergence results in this setting, and then prove Theorem 3. Section 7.2 contains a more detailed treatment of the case of gossip algorithms, and Section 7.3 contains the setting of edge failures.

Recall that Theorem 1 involves the sum 2Lnt=1Ti=1nα(t)zˉ(t)zi(t)\frac{2L}{n}\sum_{t=1}^{T}\sum_{i=1}^{n}\alpha(t)\left\|\bar{z}(t)-z_{i}(t)\right\|_{*}. In Section 6, we showed how to control this sum when communication between agents occurs on a static underlying network structure via a doubly-stochastic matrix PP. We now relax the assumption that PP is fixed and instead let P(t)P(t) vary over time.

We use P(t)=[p1(t)  pn(t)]P(t)=[p_{1}(t)~{}\cdots~{}p_{n}(t)] to denote the doubly stochastic symmetric matrix at iteration tt. The update employed by the algorithm, modulo changes in PP, is given by the usual updates (5a) and (5b)—namely,

In this case, our analysis makes use of the modified definition Φ(t,s)=P(s)P(s+1)P(t)\Phi(t,s)=P(s)P(s+1)\cdots P(t). However, we still have the evolution of zˉ(t+1)=zˉ(t)1ni=1ngi(t)\bar{z}(t+1)=\bar{z}(t)-\frac{1}{n}\sum_{i=1}^{n}g_{i}(t) from equation (16), and moreover, (25) holds essentially unchanged:

To show convergence for the random communication model, we must control the convergence of Φ(t,s)\Phi(t,s) to the uniform distribution. We first claim that

which we establish by recalling and modifying a few known results [BGPS06].

Let Δn\Delta_{n} denote the nn-dimensional probability simplex and the vector u(0)Δnu(0)\in\Delta_{n} be arbitrary. Consider the random sequence {u(t)}t=0\{u(t)\}_{t=0}^{\infty} generated by the recursion u(t+1)=P(t)u(t)u(t+1)=P(t)u(t). Let v(t)  :=u(t)1 ⁣ ⁣1/nv(t)\;:\,=u(t)-1\!\!1/n correspond to the portion of u(t)u(t) orthogonal to the all 11s vector. Calculating the second moment of v(t+1)v(t+1), we have

since v(t),1 ⁣ ⁣1=0\left\langle v(t),1\!\!1\right\rangle=0, v(t)v(t) is orthogonal to the first eigenvector of P(t)P(t), and P(t)P(t) is symmetric. Applying Chebyshev’s inequality yields

Replacing u(0)u(0) with eie_{i} and noting that ei1 ⁣ ⁣1/n21\|e_{i}-1\!\!1/n\|_{2}\leq 1 yields the claim (39).

1.2 Proof of Theorem 3

Using the probabilistic bound (39), note that

then we are guaranteed that if tst^1t-s\geq\widehat{t}-1, then

It remains to bound the sum SS. For any fixed pair s<ss^{\prime}<s, since the matrices P(t)P(t) are doubly stochastic, we have

where the final inequality uses the bound  ⁣ ⁣Φ(s1,s) ⁣ ⁣21\left|\!\left|\!\left|{\Phi(s-1,s^{\prime})}\right|\!\right|\!\right|_{2}\leq 1. From the bound (40), we have the bound \big{\|}{\Phi(t-1,t-\widehat{t}-1)e_{i}-1\!\!1/n}\big{\|}_{2}\leq\frac{1}{T^{2}n} with probability at least 11/(T2n)1-1/(T^{2}n). Since ss ranges between 11 and tt^t-\widehat{t} in the summation SS, we conclude that

with probability at least 11/(T2n)1-1/(T^{2}n). Applying the union bound over all iterations t=1,,Tt=1,\ldots,T and nodes i=1,,ni=1,\ldots,n, we obtain

Recalling the master bound from Theorem 1 completes the proof.

In the remainder of this section, we give some applications of the stochastic framework outlined above, showing a few sampling schemes and giving bounds on their convergence rates.

2 Gossip-like protocols

Gossip algorithms are procedures for achieving consensus in a network robustly and quickly by randomly selecting one edge (i,j)(i,j) in the network for communication at each iteration [BGPS06]. Once nodes ii and jj are selected, their values are averaged. Gossip algorithms drastically reduce communication in the network, yet they still enjoy fast convergence and are robust to changes in topology.

In a partially asynchronous iterative method, agents synchronize their iterations [BT89]. This is the model of standard gossip protocols, where computation proceeds in rounds, and in each round communication occurs on one random edge. In our framework, this corresponds to using the random transition matrix P(t)=I12(eiej)(eiej)TP(t)=I-\frac{1}{2}(e_{i}-e_{j})(e_{i}-e_{j})^{T}. It is clear that P(t)TP(t)=P(t)P(t)^{T}P(t)=P(t), since P(t)P(t) is a projection matrix.

Let AA be the adjacency matrix of the graph GG and DD be the diagonal matrix of its degrees as in Section 6.2. At round tt, edge (i,j)(i,j) (with Aij=1A_{ij}=1) is chosen with probability 1/1 ⁣ ⁣1,A1 ⁣ ⁣11/\left\langle 1\!\!1,A1\!\!1\right\rangle. Thus,

since (i,j):Aij=1(eiej)(eiej)T=2(DA)\sum_{(i,j):A_{ij}=1}(e_{i}-e_{j})(e_{i}-e_{j})^{T}=2(D-A). Using an identical argument as that for Lemma 4, we see that (42) implies that

Note that 1 ⁣ ⁣1,A1 ⁣ ⁣1=1 ⁣ ⁣1,D1 ⁣ ⁣1\left\langle 1\!\!1,A1\!\!1\right\rangle=\left\langle 1\!\!1,D1\!\!1\right\rangle, so that for approximately regular graphs, 1 ⁣ ⁣1,A1 ⁣ ⁣1nδmax\left\langle 1\!\!1,A1\!\!1\right\rangle\approx n\delta_{\operatorname{max}}, and miniδi/1 ⁣ ⁣1,A1 ⁣ ⁣11/n\min_{i}\delta_{i}/\left\langle 1\!\!1,A1\!\!1\right\rangle\approx 1/n. Thus, at the expense of a factor of roughly 1/n1/n in convergence rate, we can reduce the number of messages sent per round from the number of edges in the graph, Θ(nδmax)\Theta(n\delta_{\operatorname{max}}), to one. In a clustered computing environment with some centralized control, it is possible to select more than one edge per round so long as no two edges share vertices (for example, by selecting a random maximal matching) and still have P(t)TP(t)=P(t)P(t)^{T}P(t)=P(t). For a δ\delta-regular graph, choosing a random maximal matching achieves a spectral gap within constant factors of the spectral gap of the underlying graph but uses only Θ(1/δ)\Theta(1/\delta) as much communication.

2.2 Totally asynchronous gossip protocol

Now we relax the assumption that agents have synchronized clocks, so the iterations of the algorithm are no longer synchronized. Suppose that each agent has a random clock ticking at real-valued times, and at each clock tick, the agent randomly chooses one of its neighbors to communicate with. Further assume that each agent computes an iterative approximation to gifi(xi(t))g_{i}\in\partial f_{i}(x_{i}(t)), and that the approximation is always unbiased (an example of this is when fif_{i} is the sum of several functions, and agent ii simply computes the subgradient of each function sequentially). We assume that no two agents have clocks tick at the same time. This communication corresponds to a gossip protocol with stochastic subgradients, and its convergence can be described simply by combining (42) with Theorem 4. This type of algorithm is well-suited to completely decentralized environments, such as sensor networks.

3 Random edge inclusion and failure

Using (43), we see that the spectral gap decreases (and hence convergence rate may slow) by a factor proportional to the maximum degree in the graph. This is not surprising, since the amount of communication performed decreases by the same factor.

A related model we can analyze is that of a network in which at every time step of the algorithm, an edge fails with probability ρ\rho independently of the other edges. We assume we are using the model of communication in the prequel, so P(t)=I(δmax+1)1D(t)+(δmax+1)1A(t)P(t)=I-(\delta_{\operatorname{max}}+1)^{-1}D(t)+(\delta_{\operatorname{max}}+1)^{-1}A(t). Let AA, DD, and PP be as before and L\mathcal{L} be the Laplacian of the underlying graph; we easily have

Stochastic Gradient Optimization

We begin by using convexity and the Lipschitz continuity of the fif_{i} (see equations (18) and (19)), thereby obtaining that the running sum S(T)=t=1Tf(y(t))f(x)S(T)=\sum_{t=1}^{T}f(y(t))-f(x^{*}) is upper bounded as

We bound the first two terms of (44) using the same derivation as that for Theorem 1. In particular, i=1ng^i(t),xi(t)x=i=1ng^i(t),y(t)x+i=1ng^i(t),xi(t)y(t)\sum_{i=1}^{n}\left\langle\widehat{g}_{i}(t),x_{i}(t)-x^{*}\right\rangle=\sum_{i=1}^{n}\left\langle\widehat{g}_{i}(t),y(t)-x^{*}\right\rangle+\sum_{i=1}^{n}\left\langle\widehat{g}_{i}(t),x_{i}(t)-y(t)\right\rangle, and nothing in Lemma 2 assumes that g^i(t)\widehat{g}_{i}(t) is related to fi(xi(t))f_{i}(x_{i}(t)). So we upper bound the first term in (44) with

Further, xi(t)Ft1x_{i}(t)\in\mathcal{F}_{t-1} and y(t)Ft1y(t)\in\mathcal{F}_{t-1} by assumption for j[n]j\in[n] and st1s\leq t-1, so

Recalling that xi(t)y(t)α(t)zˉ(t)zi(t)\left\|x_{i}(t)-y(t)\right\|\leq\alpha(t)\left\|\bar{z}(t)-z_{i}(t)\right\|_{*}, we proceed by putting expectations around the norm terms in (26) and (28) to see that

Coupled with the above arguments, we can bound the expectation of (44) by

Taking the expectation for the final term in the bound (46), we recall that xi(t)Ft1x_{i}(t)\in\mathcal{F}_{t-1}, so

which completes the proof of the first statement of the theorem.

To show that the statement holds with high-probability when X\mathcal{X} is compact and g^i(t)L\left\|\widehat{g}_{i}(t)\right\|_{*}\leq L, it is sufficient to establish that the sequence gi(t)g^i(t),xi(t)x\langle g_{i}(t)-\widehat{g}_{i}(t),x_{i}(t)-x^{*}\rangle is a bounded martingale, and then apply Azuma’s inequality [Azu67]. (Here we are exploiting the fact that under compactness and bounded norm conditions, our previous bounds on terms in the decomposition (45) now hold for the analogous terms in the decomposition (46) without taking expectations.)

By assumption on the compactness of X\mathcal{X} and the Lipschitz assumptions on fif_{i}, we have

Recalling (47), we conclude that the last sum in the decomposition (46) is a bounded difference martingale, and Azuma’s inequality implies that

Dividing by TT and setting the probability above equal to δ\delta, we obtain that with probability at least 1δ1-\delta,

The second statement of the theorem is now obtained by appealing to Lemma 3. By convexity, we have f(x^i(T))1Tt=1Tf(xi(t))f(\widehat{x}_{i}(T))\leq\frac{1}{T}\sum_{t=1}^{T}f(x_{i}(t)), thereby completing the proof.

Proving the last statement of the theorem—the concentration result with uncorrelated noise at each node—requires a martingale extension of Bernstein’s inequality [Fre75]. Indeed, one form of Freedman’s inequality states that if X1,,XTX_{1},\ldots,X_{T} is a martingale difference sequence, XiB|X_{i}|\leq B uniformly, and Vt=1TVar(XtFt1)V\geq\sum_{t=1}^{T}\mathop{\rm Var}(X_{t}\mid\mathcal{F}_{t-1}), then for any v,ϵ>0v,\epsilon>0,

To extend the above bound to our setting, we recall that 1ni=1ngi(t)g^i(t),xi(t)x\frac{1}{n}\sum_{i=1}^{n}\left\langle g_{i}(t)-\widehat{g}_{i}(t),x_{i}(t)-x^{*}\right\rangle is Martingale difference sequence uniformly bounded by 2LR2LR. Further, since the expectation is zero, we have

The decorrelation equality in (48) follows by our assumption that g^i(t)\widehat{g}_{i}(t) and g^j(t)\widehat{g}_{j}(t) are uncorrelated given Ft1\mathcal{F}_{t-1}, and that xi(t)x_{i}(t), gi(t)g_{i}(t), and xFt1x^{*}\in\mathcal{F}_{t-1}. Substituting 4TL2R2/n4TL^{2}R^{2}/n as an upper bound for the variance in Freedman’s inequality, we have

To find a δ\delta so that exp()\exp(\cdot) term is less than or equal to δ\delta, we solve

Solving the above quadratic in ϵ\epsilon, we have equality with zero for

In particular, noting that a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}, it is sufficient that

for the inequality in (49) to be satisfied. Thus with probability at least 1δ1-\delta,

Dividing by TT completes the proof of the last statement of Theorem 4.

Simulations

In this section, we report experimental results on the network scaling behavior of the distributed dual averaging algorithm as a function of the graph structure and number of processors nn. These results illustrate the excellent agreement of the empirical behavior with our theoretical predictions.

Figure 2 provides plots of the function error maxi[f(x^i(T)f(x)]\max_{i}[f(\widehat{x}_{i}(T)-f(x^{*})] versus the number of iterations for grid graphs with a varying number of nodes n{225,400,625}n\in\{225,400,625\}. In addition to demonstrating convergence, these plots also show how the convergence time scales as a function of the graph size nn. In particular, for a given class of optimization problems, define TG(ϵ;n)T_{G}(\epsilon;n) to be the number of iterations required to obtain ϵ\epsilon-accurate solution for a graph GG with nn nodes. As shown in Figure 2, for any fixed ϵ>0\epsilon>0, the function TG(ϵ;n)T_{G}(\epsilon;n) shifts to the right as nn is increased, and the goal of network scaling analysis is to gain a precise understanding of this shifting.

As discussed following Corollary 1, for cycles, grids, and expanders, we have the following upper bounds on the quantity TG(ϵ;n)T_{G}(\epsilon;n):

In Figure 3, we compare these theoretical predictions with the actual behavior of dual subgradient averaging. Each panel shows the function TG(ϵ;n)T_{G}(\epsilon;n) versus the graph size nn for the fixed value ϵ=0.1\epsilon=0.1; the three different panels correspond to different graph types: cycles (a), grids (b) and expanders (c). In each panel, each point on the blue curve is the average of 2020 trials, and the bars show standard errors. For comparison, the dotted black line shows the theoretical prediction (51). Note that the agreement between the empirical behavior and theoretical predictions is excellent in all cases. In particular, panel (a) exhibits the quadratic scaling predicted for the cycle, panel (b) exhibits the the linear scaling expected for the grid, and panel (c) shows that expander graphs have the desirable property of having constant network scaling.

Though our focus in this paper is mostly a theoretical one, in our final set of experiments we compare the distributed dual averaging method (DDA) that we present to the Markov incremental gradient descent (MIGD) method [JRJ09] and the distributed projected gradient method [RNV10], which seem to have the sharpest convergence rates currently in the literature. In Figure 4, we plot the quantity TG(ϵ;n)T_{G}(\epsilon;n) versus graph size nn for DDA and MIGD on grid and expander graphs. We use the optimal stepsize α(t)\alpha(t) suggested by the analyses for each method. (We do not plot results for the distributed projected gradient method [RNV10] because the optimal choice of stepsize according to the analysis therein results in such slow convergence that it does not fit on the plots.) Fig. 4 makes it clear that—especially on graphs with good connectivity properties such as the expander in Fig. 4(b)—the dual averaging algorithm gives improved performance.

Conclusions and Discussion

In this paper, we proposed and analyzed a distributed dual averaging algorithm for minimizing the sum of local convex functions over a network. It is computationally efficient, and we provided a sharp analysis of its convergence behavior as a function of the properties of the optimization functions and the underlying network topology. Our analysis demonstrates a close connection between convergence rates and mixing times of random walks on the underlying graph; such a connection is natural given the local and graph-constrained nature of our updates. In addition to analysis of deterministic updates, our results also include the case of stochastic communication protocols, for instance when communication occurs only along a random subset of the edges at each round. Such extensions allow for the design of protocols that provide interesting tradeoffs between the amount of communication and convergence rates. We also demonstrate that our algorithm is robust to noise by providing an analysis for the case of stochastic optimization with noisy gradients. We confirmed the sharpness of our theoretical predictions by implementation and simulation of our algorithm.

There are several interesting open questions that remain to be explored. For instance, it would be interesting to analyze the convergence properties of other kinds of network-based optimization problems, by combining local information in different structures. It would also be of interest to study what other optimization procedures from the standard setting can be converted into efficient distributed algorithms to better exploit problem structure when possible.

AA was supported by a Microsoft Research Fellowship and JCD was supported by the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program. In addition, MJW was partially supported by NSF-CAREER-0545862 and AFOSR-09NL184. We thank John Tsitsiklis for a careful reading of the paper and providing helpful comments and several anonymous reviewers for useful feedback.

Appendix A The Dual Averaging Algorithm

In this section, we give a simple convergence proof for the basic (non-distributed) dual averaging algorithm (3). In particular, we recall the updates

which is compact since X\mathcal{X} is closed and ψ\psi is strongly convex. Thus, since the supremum is uniquely attained and z,x\left\langle z,x\right\rangle is differentiable in zz, ψα(z)=ΠXψ(z,α)\nabla\psi^{*}_{\alpha}(-z)=\Pi_{\mathcal{X}}^{\psi}(z,\alpha) [HUL96a, Theorem 4.4.2].

This fact has two important consequences. First, since the projection is Lipschitz-continuous (see Lemma 5), we have the bound

Consequently, an integration argument (e.g., [Nes04, Lemma 1.2.3]) yields the upper bound

To bound the sequence of inner products, we note that for any xXx^{*}\in\mathcal{X}, we have

Finally, we combine the upper bound on g(t),x(t)\left\langle g(t),x(t)\right\rangle from equation (55) with the earlier bound (54), thereby obtaining that for any xXx^{*}\in\mathcal{X}, the sum S(T)=t=1Tg(t),x(t)xS(T)=\sum_{t=1}^{T}\left\langle g(t),x(t)-x^{*}\right\rangle is upper bounded as

The last line exploited the facts that z(1)=0z(1)=0 and ψα(0)=0\psi^{*}_{\alpha}(0)=0. This completes the proof of the claim.

A.2 Proof of Lemma 3

Via the LL-Lipschitz continuity of the fif_{i}, we can write

For the second bound, we again use the LL-Lipschitz continuity of the fif_{i} and the triangle inequality,

Lipschitz-continuity of the projection (Lemma 5) shows that xi(t)y(t)α(t)zˉ(t)zi(t)\left\|x_{i}(t)-y(t)\right\|\leq\alpha(t)\left\|\bar{z}(t)-z_{i}(t)\right\|_{*} which gives both the desired results.

A.3 Lipschitz continuity of projections

The following lemma on the Lipschitz-continuity of the projection operator is well-known, but we state and prove it for completeness.

Setting y=xy=x and y=wy=w in these two inequalities (respectively) yields

Adding the above two inequalities, we obtain the bound

On the other hand the strong convexity of ψ\psi implies that ψ(w)ψ(x)+ψ(x),wx+12xw2\psi(w)\geq\psi(x)+\left\langle\psi(x),w-x\right\rangle+\frac{1}{2}\left\|x-w\right\|^{2}, with an analogous bound with the roles of xx and ww exchanged. Some algebra then leads to

which, when combined with (56), gives the desired result. ∎

Appendix B Background on stochastic matrices

If we take x=eix=e_{i}, denoting a canonical basis vector for i=1,,ni=1,\dots,n, then we see that PtLσ2(P)t\left\|{P^{t}-L}\right\|_{\infty}\leq\sigma_{2}(P)^{t}. Taking xΔnx\in\Delta_{n}, the nn-dimensional simplex, gives

which establishes the bound (23). (The n\sqrt{n} factor in the bound is standard in the Markov chain literature, e.g., [DS91, Proposition 3].)

Appendix C Eigenvalues of paths

Let GG be a graph and SS be a subset of the nodes in the graph. Let E(S,Sc)E(S,S^{c}) denote the set of edges crossing between SS and ScS^{c}, and let the volume of SS be the sum of the degrees of the nodes in SS, that is, vol(S)=iSδi\mathop{\rm vol}(S)=\sum_{i\in S}\delta_{i}. The Cheeger constant of a graph GG is defined as

If L\mathcal{L} is the Laplacian of GG, then 2hGλn1(L)>12hG22h_{G}\geq\lambda_{n-1}(\mathcal{L})>\frac{1}{2}h_{G}^{2} (e.g., see Lemma 2.1 and Theorem 2.2 in Chung [Chu98]).

Let GG be a kk-connected path with nn nodes and knk\leq\sqrt{n}. Then its normalized graph Laplacian L\mathcal{L} satisfies λn1(L)=Θ(k2/n2)\lambda_{n-1}(\mathcal{L})=\Theta(k^{2}/n^{2}).

Proof We invoke Theorem 4.13 in Chung [Chu98] to conclude that λn1(L)=O(k2/n2)\lambda_{n-1}(\mathcal{L})=\mathcal{O}(k^{2}/n^{2}), since GG is a subgraph of the kk-connected cycle. It thus suffices to prove that the Cheeger constant is lower bounded as hG=Ω(k/n)h_{G}=\Omega(k/n).

Let SS be the set of nodes achieving the minimum in the definition (57). To make the rest of the proof easier, assume that the degree of each node is 2k2k. (We may do so without loss of generality, since it only has the effect of increasing vol(S)\mathop{\rm vol}(S) and vol(Sc)\mathop{\rm vol}(S^{c}) in the Cheeger constant calculation, and so any Cheeger constant calculated under this assumption lower bounds the true Cheeger constant.)

Appendix D Composite Objectives

In this section, we show how to generalize the dual averaging algorithm to incorporate composite objectives, specifically those of the form f+φf+\varphi for known φ\varphi. Though it is possible to perform similar derivations to those in Lemma 2, for brevity we refer to recent work of Xiao [Xia10]. Nonetheless, the algorithm is conceptually very similar to the dual averaging algorithm (updates (5a) and (5b)), and equally as simple to write. We assume that φ\varphi is closed convex and non-negative, and X\mathcal{X} is closed. We define the composite projection operator ΠXt\Pi_{\mathcal{X}}^{t} as

The mapping ΠXt\Pi_{\mathcal{X}}^{t} is α(t)\alpha(t)-Lipschitz with respect to \left\|\cdot\right\| and \left\|\cdot\right\|_{*}, that is,

As in Lemma 5, (59) is a consequence of the fact that the conjugate dual of a 1/α(t)1/\alpha(t)-strongly convex function has α(t)\alpha(t)-Lipschitz continuous gradient with respect to the associated dual norm, and the gradient of the conjugate of tφ(x)+1α(t)ψ(x)t\varphi(x)+\frac{1}{\alpha(t)}\psi(x) is simply ΠXt(z)\Pi_{\mathcal{X}}^{t}(z) [HUL96b, Theorem X.4.2.1].

The distributed algorithm based on the update (58) is essentially identical to the dual averaging algorithm discussed in the main body of the paper. Each agent ii maintains the gradient vector

As in (16), we have zˉ(t+1)=zˉ(t)1ni=1ngj(t)\bar{z}(t+1)=\bar{z}(t)-\frac{1}{n}\sum_{i=1}^{n}g_{j}(t). The following proposition, a simplification of [Xia10, Section B.2], allows us to give a convergence guarantee for the algorithm described by (60) and (61).

The above proposition, combined with the techniques used to derive Theorem 1, allow us to easily prove convergence of distributed composite-objective dual averaging. As earlier, let y(t)=ΠXt(zˉ(t))y(t)=\Pi_{\mathcal{X}}^{t}(-\bar{z}(t)), and assume that the fif_{i} are LL-Lipschitz with respect to \left\|\cdot\right\|. Then as in (18), (19), and (20), for any xXx^{*}\in\mathcal{X}, we immediately have

By definition of y(t)y(t), we see that Proposition 2 bounds the above by

Finally, we use the fact that the mapping ΠXt\Pi_{\mathcal{X}}^{t} is α(t)\alpha(t)-Lipschitz to see that for the distributed composite-objective projection algorithm of (60) and (61),

Any of the techniques in the prequel can be used to bound (62).

References