Distributed Delayed Stochastic Optimization

Alekh Agarwal, John C. Duchi

Introduction

We focus on stochastic convex optimization problems of the form

Our model of delayed gradient information is particularly relevant in distributed optimization scenarios, where a master maintains the parameters $x$ while workers compute stochastic gradients of the objective (1). The architectural assumption of a master with several worker nodes is natural for distributed computation, and other researchers have considered models similar to those in this paper [NBB01, LSZ09]. By allowing delayed and asynchronous updates, we can avoid synchronization issues that commonly handicap distributed systems.

Certainly distributed optimization has been studied for several decades, tracing back at least to seminal work of Tsitsiklis and colleagues ([Tsi84, BT89]) on minimization of smooth functions where the parameter vector is distributed. More recent work has studied problems in which each processor or node $i$ in a network has a local function $f_{i}$ , and the goal is to minimize the sum $f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)$ [NO09, RNV10, JRJ09, DAW10]. Most prior work assumes as a constraint that data lies on several different nodes throughout a network. However, as Dekel et al. [DGBSX10a] first noted, in distributed stochastic settings independent realizations of a stochastic gradient can be computed concurrently, and it is thus possible to obtain an aggregated gradient estimate with lower variance. Using modern stochastic optimization algorithms (e.g. [JNT08, Lan10]), Dekel et al. give a series of reductions to show that in an $n$ -node network it is possible to achieve a speedup of $\mathcal{O}(n)$ over a single-processor so long as the objective $f$ is smooth.

Our work is closest to Nedić et al.’s asynchronous subgradient method [NBB01], which is an incremental gradient procedure in which gradient projection steps are taken using out-of-date gradients. See Figure 1 for an illustration. The asynchronous subgradient method performs non-smooth minimization and suffers an asymptotic penalty in convergence rate due to the delays: if the gradients are computed with a delay of $\tau$ , then the convergence rate of the procedure is $\mathcal{O}(\sqrt{\tau/T})$ . The setting of distributed optimization provides an elegant illustration of the role played by the delay in convergence rates. As in Fig. 1, the delay $\tau$ can essentially be of order $n$ in Nedić et al.’s setting, which gives a convergence rate of $\mathcal{O}(\sqrt{n/T})$ . A simple centralized stochastic gradient algorithm attains a rate of $\mathcal{O}(1/\sqrt{T})$ , which suggests something is amiss in the distributed algorithm. Langford et al. [LSZ09] rediscovered Nedić et al.’s results and attempted to remove the asymptotic penalty by considering smooth objective functions, though their approach has a technical error (see Appendix C), and even so they do not demonstrate any provable benefits of distributed computation. We analyze similar asynchronous algorithms, but we show that for smooth stochastic problems the delay is asymptotically negligible—the time $\tau$ does not matter—and in fact, with parallelization, delayed updates can give provable performance benefits.

We build on results of Dekel et al. [DGBSX10a], who show that when the objective $f$ has Lipschitz-continuous gradients, then when $n$ processors compute stochastic gradients in parallel using a common parameter $x$ it is possible to achieve convergence rate $\mathcal{O}(1/\sqrt{Tn})$ so long as the processors are synchronized (under appropriate synchrony conditions, this holds nearly independently of network topology). A variant of their approach is asymptotically robust to asynchrony so long as most processors remain synchronized for most of the time [DGBSX10b]. We show results similar to their initial discovery, but we analyze the effects of asynchronous gradient updates where all the nodes in the network can suffer delays. Application of our main results to the distributed setting provides convergence rates in terms of the number of nodes $n$ in the network and the stochastic process governing the delays. Concretely, we show that under different assumptions on the network and delay process, we achieve convergence rates ranging from $\mathcal{O}(n^{3}/T+1/\sqrt{Tn})$ to $\mathcal{O}(n/T+1/\sqrt{Tn})$ , which is $\mathcal{O}(1/\sqrt{nT})$ asymptotically in $T$ . For problems with large $n$ , we demonstrate faster rates ranging from $\mathcal{O}((n/T)^{2/3}+1/\sqrt{Tn})$ to $\mathcal{O}(1/T^{2/3}+1/\sqrt{Tn})$ . In either case, the time necessary to achieve $\epsilon$ -optimal solution to the problem (1) is asymptotically $\mathcal{O}(1/n\epsilon^{2})$ , a factor of $n$ —the size of the network—better than a centralized procedure in spite of delay.

The remainder of the paper is organized as follows. We begin by reviewing known algorithms for solving the stochastic optimization problem (1) and stating our main assumptions. Then in Section 3 we give abstract descriptions of our algorithms and state our main theoretical results, which we make concrete in Section 4 by formally placing the analysis in the setting of distributed stochastic optimization. We complement the theory in Section 5 with experiments on a real-world dataset, and proofs follow in the remaining sections.

For the reader’s convenience, we collect our (mostly standard) notation here. We denote general norms by $\left\|\cdot\right\|$ , and the dual norm $\left\|\cdot\right\|_{*}$ to the norm $\left\|\cdot\right\|$ is defined as $\left\|z\right\|_{*}:=\sup_{x:\left\|x\right\|\leq 1}\left\langle z,x\right\rangle$ . The subdifferential set of a function $f$ is

We use the shorthand $\left\|\partial f(x)\right\|_{*}:=\sup_{g\in\partial f(x)}\left\|g\right\|_{*}$ . A function $f$ is $G$ -Lipschitz with respect to the norm $\left\|\cdot\right\|$ on $\mathcal{X}$ if for all $x,y\in\mathcal{X}$ , $|f(x)-f(y)|\leq G\left\|x-y\right\|$ . For convex $f$ , this is equivalent to $\left\|\partial f(x)\right\|_{*}\leq G$ for all $x\in\mathcal{X}$ (e.g. [HUL96a]). A function $f$ is $L$ -smooth on $\mathcal{X}$ if $\nabla f$ is Lipschitz continuous with respect to the norm $\left\|\cdot\right\|$ , defined as

For convex differentiable $h$ , the Bregman divergence [Bre67] between $x$ and $y$ is defined as

A convex function $h$ is $c$ -strongly convex with respect to a norm $\left\|\cdot\right\|$ over $\mathcal{X}$ if

We use $[n]$ to denote the set of integers $\{1,\ldots,n\}$ .

Setup and Algorithms

In this section we set up and recall the delay-free algorithms underlying our approach. We then give the appropriate delayed versions of these algorithms, which we analyze in the sequel.

To build intuition for the algorithms we analyze, we first describe two closely related first-order algorithms: the dual averaging algorithm of Nesterov [Nes09] and the mirror descent algorithm of Nemirovski and Yudin [NY83], which is analyzed further by Beck and Teboulle [BT03]. We begin by collecting notation and giving useful definitions. Both algorithms are based on a proximal function $\psi(x)$ , where it is no loss of generality to assume that $\psi(x)\geq 0$ for all $x\in\mathcal{X}$ . We assume $\psi$ is $1$ -strongly convex (by scaling, this is no loss of generality). By definitions (2) and (3), the divergence $D_{\psi}$ satisfies $D_{\psi}(x,y)\geq\frac{1}{2}\left\|x-y\right\|^{2}$ .

In the oracle model of stochastic optimization that we assume, at time $t$ both algorithms query an oracle at the point $x(t)$ , and the oracle then samples $\xi(t)$ i.i.d. from the distribution $P$ and returns $g(t)\in\partial F(x(t);\xi(t))$ . The dual averaging algorithm [Nes09] updates a dual vector $z(t)$ and primal vector $x(t)\in\mathcal{X}$ via

while mirror descent [NY83, BT03] performs the update

Both make a linear approximation to the function being minimized—a global approximation in the case of the dual averaging update (4) and a more local approximation for mirror descent (5)—while using the proximal function $\psi$ to regularize the points $x(t)$ .

We now state the two essentially standard assumptions [JNT08, Lan10, Xia10] we most often make about the stochastic optimization problem (1), after which we recall the convergence rates of the algorithms (4) and (5).

In particular, Assumption A implies that $f$ is $G$ -Lipschitz continuous with respect to the norm $\left\|\cdot\right\|$ and that $f$ is convex. Our second assumption has been used to show rates of convergence based on the variance of a gradient estimator for stochastic optimization problems (e.g. [JNT08, Lan10]).

Several commonly used functions satisfy the above assumptions, for example:

The logistic loss: $F(x;\xi)=\log[1+\exp(\left\langle x,\xi\right\rangle)]$ , the objective for logistic regression in statistics (e.g. [HTF01]). The objective $F$ satisfies Assumptions A and B so long as $\left\|\xi\right\|$ is bounded.

We also make a standard compactness assumption on the optimization set $\mathcal{X}$ .

For $x^{*}\in\mathop{\rm argmin}_{x\in\mathcal{X}}f(x)$ and $x\in\mathcal{X}$ , the bounds $\psi(x^{*})\leq R^{2}/2$ and $D_{\psi}(x^{*},x)\leq R^{2}$ both hold.

Under Assumptions A or B in addition to Assumption C, the updates (4) and (5) have known convergence rates. Define the time averaged vector $\widehat{x}(T)$ as

Then under Assumption A, both algorithms satisfy

for the stepsize choice $\alpha(t)=R/(G\sqrt{t})$ (e.g. [Nes09, Xia10, NJLS09]). The result (7) is sharp to constant factors in general [NY83, ABRW10], but can be further improved under Assumption B. Building on work of Juditsky et al. [JNT08] and Lan [Lan10], Dekel et al. [DGBSX10a, Appendix A] show that under Assumptions B and C the stepsize choice $\alpha(t)^{-1}=L+\eta(t)$ , where $\eta(t)$ is a damping factor set to $\eta(t)=\sigma R\sqrt{t}$ , yields for either of the updates (4) or (5) the convergence rate

2 Delayed Optimization Algorithms

Recall that the problems we consider are stochastic optimization problems of the form (1). Under the assumptions above, we extend the mirror descent and dual averaging algorithms in the simplest way: we replace $g(t)$ with $g(t-\tau(t))$ . For dual averaging (c.f. the update (4)) this yields

while for mirror descent (c.f. the update (5)) we have

Convergence rates for delayed optimization of smooth functions

In this section, we state and discuss several results for asynchronous stochastic gradient methods. We give two sets of theorems. The first are for the asynchronous method when we make updates to the parameter vector $x$ using one stochastic subgradient, according to the update rules (9) or (10). The second method involves using several stochastic subgradients for every update, each with a potentially different delay, which gives sharper results that we present in Section 3.2.

Intuitively, the $\sqrt{B}$ -penalty due to delays for non-smooth optimization arises from the fact that subgradients can change drastically when measured at slightly different locations, so a small delay can introduce significant inaccuracy. To overcome the delay penalty, we now turn to the smoothness assumption B as well as the Lipschitz condition A (we assume both of these conditions along with Assumption C hold for all the theorems). In the smooth case, delays mean that stale gradients are only slightly perturbed, since our stochastic algorithms constrain the variability of the points $x(t)$ . As we show in the proofs of the remaining results, the error from delay essentially becomes a second order term: the penalty is asymptotically negligible. We study both update rules (9) and (10), and we set $\alpha(t)=\frac{1}{L+\eta(t)}$ . Here $\eta(t)$ will be chosen to both control the effects of delays and for errors from stochastic gradient information. We prove the following theorem in Sec. 6.1.

Let the sequence $x(t)$ be defined by the update (9). Define the stepsize $\eta(t)\propto\sqrt{t+\tau}$ or let $\eta(t)\equiv\eta$ for all $t$ . Then

The mirror descent update (10) exhibits similar convergence properties, and we prove the next theorem in Sec. 6.2.

Use the conditions of Theorem 1 but generate $x(t)$ by the update (10). Then

In each of the above theorems, we can set $\eta(t)=\sigma\sqrt{t+\tau}/R$ . As immediate corollaries, we recall the definition (6) of the averaged sequence of $x(t)$ and use convexity to see that

for either update rule. In addition, we can allow the delay $\tau(t)$ to be random:

We provide the proof of the corollary in Sec. 6.3. The take-home message from the above corollaries, as well as Theorems 1 and 2, is that the penalty in convergence rate due to the delay $\tau(t)$ is asymptotically negligible. As we discuss in greater depth in the next section, this has favorable implications for robust distributed stochastic optimization algorithms.

2 Combinations of delays

In some scenarios—including distributed settings similar to those we discuss in the next section—the procedure has access not to only a single delayed gradient but to several with different delays. To abstract away the essential parts of this situation, we assume that the procedure receives $n$ gradients $g_{1},\ldots,g_{n}$ , where each has a potentially different delay $\tau(i)$ . Now let $\lambda=(\lambda_{i})_{i=1}^{n}$ belong to the probability simplex, though we leave $\lambda$ ’s values unspecified for now. Then the procedure performs the following updates at time $t$ : for dual averaging,

The next two theorems build on the proofs of Theorems 1 and 2, combining several techniques. We provide the proof of Theorem 3 in Sec. 7, omitting the proof of Theorem 4 as it follows in a similar way from Theorem 2.

Let the sequence $x(t)$ be defined by the update (12). Under assumptions A, B and C, let $\frac{1}{\alpha(t)}=L+\eta(t)$ and $\eta(t)\propto\sqrt{t+\tau}$ or $\eta(t)\equiv\eta$ for all $t$ . Then

Use the same conditions as Theorem 3, but assume that $x(t)$ is defined by the update (13) and $D_{\psi}(x^{*},x)\leq R^{2}$ for all $x\in\mathcal{X}$ . Then

The consequences of Theorems 3 and 4 are powerful, as we illustrate in the next section.

Distributed Optimization

We divide the $N$ samples among $n$ workers so that each worker has an $N/n$ -sized subset of data. In streaming applications, the distribution $P$ is the unknown distribution generating the data, and each worker receives a stream of independent data points $\xi\sim P$ . Worker $i$ uses its subset of the data, or its stream, to compute $g_{i}$ , an estimate of the gradient $\nabla f$ of the global $f$ . We make the simplifying assumption that $g_{i}$ is an unbiased estimate of $\nabla f(x)$ , which is satisfied, for example, when each worker receives an independent stream of samples or computes the gradient $g_{i}$ based on samples picked at random without replacement from its subset of the data.

The architectural assumptions we make are natural and based off of master/worker topologies, but the convergence results in Section 3 allow us to give procedures robust to delay and asynchrony. We consider two protocols: in the first, workers compute and communicate asynchronously and independently with the master, and in the second, workers are at different distances from the master and communicate with time lags proportional to their distances. We show in the latter part of this section that the convergence rates of each protocol, when applied in an $n$ -node network, are $\mathcal{O}(1/\sqrt{nT})$ for $n$ -node networks (though lower order terms are different for each).

Before describing our architectures, we note that perhaps the simplest master-worker scheme is to have each worker simultaneously compute a stochastic gradient and send it to the master, which takes a gradient step on the averaged gradient. While the $n$ gradients are computed in parallel, accumulating and averaging $n$ gradients at the master takes $\Omega(n)$ time, offsetting the gains of parallelization. Thus we consider alternate architectures that are robust to delay.

This protocol is the delayed update algorithm mentioned in the introduction, and it parallelizes computation of (estimates of) the gradient $\nabla f(x)$ . Formally, worker $i$ has parameter $x(t)$ and computes $g_{i}(t)=F(x(t);\xi_{i}(t))$ , where $\xi_{i}(t)$ is a random variable sampled at worker $i$ from the distribution $P$ . The master maintains a parameter vector $x\in\mathcal{X}$ . The algorithm proceeds in rounds, cyclically pipelining updates. The algorithm begins by initiating gradient computations at different workers at slightly offset times. At time $t$ , the master receives gradient information at a $\tau$ -step delay from some worker, performs a parameter update, and passes the updated central parameter $x(t+1)$ back to the worker. Other workers do not see this update and continue their gradient computations on stale parameter vectors. In the simplest case, each node suffers a delay of $\tau=n$ , though our earlier analysis applies to random delays throughout the network as well. Recall Fig. 1 for a graphic description of the process.

Locally Averaged Delayed Architecture

At a high level, the protocol we now describe combines the delayed updates of the cyclic delayed architecture with averaging techniques of previous work [NO09, DAW10]. We assume a network $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ is a set of $n$ nodes (workers) and $\mathcal{E}$ are the edges between the nodes. We select one of the nodes as the master, which maintains the parameter vector $x(t)\in\mathcal{X}$ over time.

The algorithm works via a series of multicasting and aggregation steps on a spanning tree rooted at the master node. In the first phase, the algorithm broadcasts from the root towards the leaves. At step $t$ the master sends its current parameter vector $x(t)$ to its immediate neighbors. Simultaneously, every other node broadcasts its current parameter vector (which, for a depth $d$ node, is $x(t-d)$ ) to its children in the spanning tree. See Fig. 2(a). Every worker receives its new parameter and computes its local gradient at this parameter. The second part of the communication in a given iteration proceeds from leaves toward the root. The leaf nodes communicate their gradients to their parents. The parent takes the gradients of the leaf nodes from the previous round (received at iteration $t-1$ ) and averages them with its own gradient, passing this averaged gradient back up the tree. Again simultaneously, each node takes the averaged gradient vectors of its children from the previous rounds, averages them with its current gradient vector, and passes the result up the spanning tree. See Fig. 2(b) and Fig. 3 for a visual description.

Slightly more formally, associated with each node $i\in\mathcal{V}$ is a delay $\tau(i)$ , which is (generally) twice its distance from the master. Fix an iteration $t$ . Each node $i\in\mathcal{V}$ has an out of date parameter vector $x(t-\tau(i)/2)$ , which it sends further down the tree to its children. So, for example, the master node sends the vector $x(t)$ to its children, which send the parameter vector $x(t-1)$ to their children, which in turn send $x(t-2)$ to their children, and so on. Each node computes

where $\xi_{i}(t)$ is a random variable sampled at node $i$ from the distribution $P$ . The communication back up the hierarchy proceeds as follows: the leaf nodes in the tree (say at depth $d$ ) send the gradient vectors $g_{i}(t-d)$ to their immediate parents in the tree. At the previous iteration $t-1$ , the parent nodes received $g_{i}(t-d-1)$ from their children, which they average with their own gradients $g_{i}(t-d+1)$ and pass to their parents, and so on. The master node at the root of the tree receives an average of delayed gradients from the entire tree, with each gradient having a potentially different delay, giving rise to updates of the form (12) or (13).

1 Convergence rates for delayed distributed minimization

Having described our architectures, we can now give corollaries to the theoretical results from the previous sections that show it is possible to achieve asymptotically faster rates (over centralized procedures) using distributed algorithms even without imposing synchronization requirements. We allow workers to pipeline updates by computing asynchronously and in parallel, so each worker can compute low variance estimate of the gradient $\nabla f(x)$ .

We begin with a simple corollary to the results in Sec. 3.1. We ignore the constants $L$ , $G$ , $R$ , and $\sigma$ , which are not dependent on the characteristics of the network. We also assume that each worker uses $m$ independent samples of $\xi\sim P$ to compute the stochastic gradient as

Using the cyclic protocol as in Fig. 1, Theorems 1 and 2 give the following result.

Let $\psi(x)=\frac{1}{2}\left\|{x}\right\|_{2}^{2}$ , assume the conditions in Corollary 1, and assume that each worker uses $m$ samples $\xi\sim P$ to compute the gradient it communicates to the master. Then with the choice $\eta(t)=\sqrt{T}/\sqrt{m}$ either of the updates (9) or (10) satisfy

In the above corollary, so long as the bound on the delay $B$ satisfies, say, $B=o(T^{1/4})$ , then the last term in the bound is asymptotically negligible, and we achieve a convergence rate of $\mathcal{O}(1/\sqrt{Tm})$ .

The cyclic delayed architecture has the drawback that information from a worker can take $\mathcal{O}(n)$ time to reach the master. While the algorithm is robust to delay and does not need lock-step coordination of workers, the downside of the architecture is that the essentially $n^{2}m/T$ term in the bounds above can be quite large. Indeed, if each worker computes its gradient over $m$ samples with $m\approx n$ —say to avoid idling of workers—then the cyclic architecture has convergence rate $\mathcal{O}(n^{3}/T+1/\sqrt{nT})$ . For moderate $T$ or large $n$ , the delay penalty $n^{3}/T$ may dominate $1/\sqrt{nT}$ , offsetting the gains of parallelization.

To address the large $n$ drawback, we turn our attention to the locally averaged architecture described by Figs. 2 and 3, where delays can be smaller since they depend only on the height of a spanning tree in the network. The algorithm requires more synchronization than the cyclic architecture but still performs limited local communication. Each worker computes $g_{i}(t-\tau(i))=\nabla F(x(t-\tau(i));\xi_{i}(t))$ where $\tau(i)$ is the delay of worker $i$ from the master and $\xi_{i}\sim P$ . As a result of the communication procedure, the master receives a convex combination of the stochastic gradients evaluated at each worker $i$ , for which we gave results in Section 3.2.

In this architecture, the master receives gradients of the form $g_{\lambda}(t)=\sum_{i=1}^{n}\lambda_{i}g_{i}(t-\tau(i))$ for some $\lambda$ in the simplex, which puts us in the setting of Theorems 3 and 4. We now make the reasonable assumption that the gradient errors $\nabla f(x(t))-g_{i}(t)$ are uncorrelated across the nodes in the network.Similar results continue to hold under weak correlation. In statistical applications, for example, each worker may own independent data or receive streaming data from independent sources; more generally, each worker can simply receive independent samples $\xi_{i}\sim P$ . We also set $\psi(x)=\frac{1}{2}\left\|{x}\right\|_{2}^{2}$ , and observe

This gives the following corollary to Theorems 3 and 4.

Set $\lambda_{i}=\frac{1}{n}$ for all $i$ , $\psi(x)=\frac{1}{2}\left\|{x}\right\|_{2}^{2}$ , and $\eta(t)=\sigma\sqrt{t+\tau}/R\sqrt{n}$ . Let $\bar{\tau}$ and $\overline{\tau^{2}}$ denote the average of the delays $\tau(i)$ and $\tau(i)^{2}$ , respectively. Under the conditions of Theorem 3 or 4,

The above corollaries are general and hold irrespective of the relative costs of communication and computation. However, with knowledge of the costs, we can adapt the stepsizes slightly to give better rates of convergence when $n$ is large or communication to the master node is expensive. For now, we focus on the cyclic architecture (the setting of Corollary 2), though the same principles apply to the local averaging scheme. Let $C$ denote the cost of communicating between the master and workers in terms of the time to compute a single gradient sample, and assume that we set $m=Cn$ , so that no worker node has idle time. For simplicity, we let the delay be non-random, so $B=\tau=n$ . Consider the choice $\eta(t)=\eta\sqrt{T/(Cn)}$ for the damping stepsizes, where $\eta\geq 1$ . This setting in Theorem 1 gives

where the last equality follows since $\eta\geq 1$ . Optimizing for $\eta$ on the right yields

The convergence rates thus follow two regimes. When $T\leq n^{7}/C^{3}$ , we have convergence rate $\mathcal{O}(n^{2/3}/T^{2/3})$ , while once $T>n^{7}/C^{3}$ , we attain $\mathcal{O}(1/\sqrt{TCn})$ convergence. Roughly, in time proportional to $TC$ , we achieve optimization error $1/\sqrt{TCn}$ , which is order-optimal given that we can compute a total of $TCn$ stochastic gradients [ABRW10]. The scaling of this bound is nicer than that previously: the dependence on network size is at worst $n^{2/3}$ , which we obtain by increasing the damping factor $\eta(t)$ —and hence decreasing the stepsize $\alpha(t)=1/(L+\eta(t))$ —relative to the setting of Corollary 2. We remark that applying the same technique to Corollary 3 gives convergence rate scaling as the smaller of $\mathcal{O}((D/T)^{2/3}+1/\sqrt{TCn})$ and $\mathcal{O}((nCD/T+1/\sqrt{TCn})$ . Since the diameter $D\leq n$ , this is faster than the cyclic architecture’s bound (15).

2 Running-time comparisons

Now we state our assumptions on the relative times used by each algorithm. Let $T$ be the number of units of time allocated to each algorithm, and let the centralized, cyclic delayed and locally averaged delayed algorithms complete $T_{\rm cent}$ , $T_{\rm cycle}$ and $T_{\rm dist}$ iterations, respectively, in time $T$ . It is clear that $T_{\rm cent}=T$ . We assume that the distributed methods use $m_{\rm cycle}$ and $m_{\rm dist}$ samples of $\xi\sim P$ to compute stochastic gradients and that the delay $\tau$ of the cyclic algorithm is $n$ . For concreteness, we assume that communication is of the same order as computing the gradient of one sample $\nabla F(x;\xi)$ so that $C=1$ . In the cyclic setup of Sec. 3.1, it is reasonable to assume that $m_{\rm cycle}=\Omega(n)$ to avoid idling of workers (Theorems 1 and 2, as well as the bound (15), show it is asymptotically beneficial to have $m_{\rm cycle}$ larger, since $\sigma_{\rm cycle}^{2}=1/m_{\rm cycle}$ ). For $m_{\rm cycle}=\Omega(n)$ , the master requires $\frac{m_{\rm cycle}}{n}$ units of time to receive one gradient update, so $\frac{m_{\rm cycle}}{n}T_{\rm cycle}=T$ . In the locally delayed framework, if each node uses $m_{\rm dist}$ samples to compute a gradient, the master receives a gradient every $m_{\rm dist}$ units of time, and hence $m_{\rm dist}T_{\rm dist}=T$ . Further, $\sigma_{\rm dist}^{2}=1/m_{\rm dist}$ . We summarize our assumptions by saying that in $T$ units of time, each algorithm performs the following number of iterations:

Plugging the above iteration counts into the earlier bound (8) and Corollaries 2 and 3 via the sharper result (15), we can provide upper bounds (to constant factors) on the expected optimization accuracy after $T$ units of time for each of the distributed architectures as in Table 1. Asymptotically in the number of units of time $T$ , both the cyclic and locally communicating stochastic optimization schemes have the same convergence rate. However, topological considerations show that the locally communicating method (Figs. 2 and 3) has better performance than the cyclic architecture, though it requires more worker coordination. Since the lower order terms matter only for large $n$ or small $T$ , we compare the terms $n^{2/3}/T^{2/3}$ and $D^{2/3}/T^{2/3}$ for the cyclic and locally averaged algorithms, respectively. Since $D\leq n$ for any network, the locally averaged algorithm always guarantees better performance than the cyclic algorithm. For specific graph topologies, however, we can quantify the time improvements:

$n$ -node cycle or path: $D=n$ so that both methods have the same convergence rate.

$\sqrt{n}$ -by- $\sqrt{n}$ grid: $D=\sqrt{n}$ , so the distributed method has a factor of $n^{2/3}/n^{1/3}=n^{1/3}$ improvement over the cyclic architecture.

Balanced trees and expander graphs: $D=\mathcal{O}(\log n)$ , so the distributed method has a factor—ignoring logarithmic terms—of $n^{2/3}$ improvement over cyclic.

Naturally, it is possible to modify our assumptions. In a network in which communication is cheap, or conversely, in a problem for which the computation of $\nabla F(x;\xi)$ is more expensive than communication, then the number of samples $\xi\sim P$ for which which each worker computes gradients is small. Such problems are frequent in statistical machine learning, such as when learning conditional random field models, which are useful in natural language processing, computational biology, and other application areas [LMP01]. In this case, it is reasonable to have $m_{\rm cycle}=\mathcal{O}(1)$ , in which case $T_{\rm cycle}=Tn$ and the cyclic delayed architecture has stronger convergence guarantees of $\mathcal{O}(\min\{n^{2}/T,1/T^{2/3}\}+1/\sqrt{Tn})$ . In any case, both non-centralized protocols enjoy significant asymptotically faster convergence rates for stochastic optimization problems in spite of asynchronous delays.

Numerical Results

Though this paper focuses mostly on the theoretical analysis of the methods we have presented, it is important to understand the practical aspects of the above methods in solving real-world tasks and problems with real data. To that end, we use the cyclic delayed method (12) to solve a common statistical machine learning problem. Specifically, we focus on solving the logistic regression problem

We use the Reuters RCV1 dataset [LYRL04], which consists of $N\approx 800000$ news articles, each labeled with some combination of the four labels economics, government, commerce, and medicine. In the above example, the vectors $a_{i}\in\{0,1\}^{d}$ , $d\approx 10^{5}$ , are feature vectors representing the words in each article, and the labels $b_{i}$ are $1$ if the article is about government, $-1$ otherwise.

We simulate the cyclic delayed optimization algorithm (9) for the problem (17) for several choices of the number of workers $n$ and the number of samples $m$ computed at each worker. We summarize the results of our experiments in Figure 4. To generate the figure, we fix an $\epsilon$ (in this case, $\epsilon=.05$ ), then measure the time it takes the stochastic algorithm (9) to output an $\widehat{x}$ such that $f(\widehat{x})\leq\inf_{x\in\mathcal{X}}f(x)+\epsilon$ . We perform each experiment ten times.

After computing the number of iterations required to achieve $\epsilon$ -accuracy, we convert the results to running time by assuming it takes one unit of time to compute the gradient of one term in the sum defining the objective (17). We also assume that it takes $1$ unit of time, i.e. $C=1$ , to communicate from one of the workers to the master, for the master to perform an update, and communicate back to one of the workers. In an $n$ node system where each worker computes $m$ samples of the gradient, the master receives an update every $\max\{\frac{m}{n},1\}$ time units. A centralized algorithm computing $m$ samples of its gradient performs an update every $m$ time units. By multiplying the number of iterations to $\epsilon$ -optimality by $\max\{\frac{m}{n},1\}$ for the distributed method and by $m$ for the centralized, we can estimate the amount of time it takes each algorithm to achieve an $\epsilon$ -accurate solution.

We now turn to discussing Figure 4. The delayed update (9) enjoys speedup (the ratio of time to $\epsilon$ -accuracy for an $n$ -node system versus the centralized procedure) nearly linear in the number $n$ of worker machines until $n\geq 15$ or so. Since we use the stepsize choice $\eta(t)\propto\sqrt{t/n}$ , which yields the predicted convergence rate given by Corollary 2, the $n^{2}m/T\approx n^{3}/T$ term in the convergence rate presumably becomes non-negligible for larger $n$ . This expands on earlier experimental work with a similar method [LSZ09], which experimentally demonstrated linear speedup for small values of $n$ , but did not investigate larger network sizes. Roughly, as predicted by our theory, for non-asymptotic regimes the cost of communication and delays due to using $n$ nodes mitigate some of the benefits of parallelization. Nevertheless, as our analysis shows, allowing delayed and asynchronous updates still gives significant performance improvements.

Delayed Updates for Smooth Optimization

Let assumptions A and B on the function $f$ and the compactness assumption C hold. Then for any sequence $x(t)$

Proof The proof follows by using a few Bregman divergence identities to rewrite the left hand side of the above equations, then recognizing that the result is close to a telescoping sum. Recalling the definition of a Bregman divergence (2), we note the following well-known four term equality, a consequence of straightforward algebra: for any $a,b,c,d$ ,

To make (19) useful, we note that the Lipschitz continuity of $\nabla f$ implies

so that recalling the definition of $D_{f}$ (2) we have

In particular, using the non-negativity of $D_{f}(x,y)$ , we can replace (19) with the bound

To bound the first Bregman divergence term, we recall that by Assumption C and the strong convexity of $\psi$ , $\left\|x^{*}-x(t)\right\|^{2}\leq 2D_{\psi}(x^{*},x(t))\leq 2R^{2}$ , and hence the optimality of $x^{*}$ implies

This gives the first bound of the lemma. For the second bound, using convexity, we see that

The essential idea in this proof is to use convexity and smoothness to bound $f(x(t))-f(x^{*})$ , then use the sequence $\{\eta(t)\}$ , which decreases the stepsize $\alpha(t)$ , to cancel variance terms. To begin, we define the error $e(t)$

where $g(t-\tau)=\nabla F(x(t-\tau);\xi(t)$ for some $\xi(t)\sim P$ . Note that $e(t)$ does not have zero expectation, as there is a time delay.

By using the convexity of $f$ and then the $L$ -Lipschitz continuity of $\nabla f$ , for any $x^{*}\in\mathcal{X}$ , we have

Now, by applying Lemma 5 in Appendix A and the definition of the update (9), we see that

To get the bound (21), we substituted $\alpha(t)^{-1}=L+\eta(t)$ and then used the fact that $\psi$ is strongly convex, so $D_{\psi}(x(t+1),x(t))\geq\frac{1}{2}\left\|x(t)-x(t+1)\right\|^{2}$ . By summing the bound (21), we have the following non-probabilistic inequality:

since $\psi(x)\geq 0$ and $x(T+1)$ minimizes $\left\langle z(T+1),x\right\rangle+\frac{1}{\alpha(T+1)}\psi(x)$ . What remains is to control the summed $e(t)$ terms in the bound (22). We can do this simply using the second part of Lemma 4. Indeed, we have

What remains, then, is to bound the stochastic (second) term in (23). This is straightforward, though:

Since $D_{\psi}(x(t+1),x(t))\geq\frac{1}{2}\left\|x(t)-x(t+1)\right\|^{2}$ , combining (24) with (22) and noting that $\frac{1}{\alpha(t-1)}-\frac{1}{\alpha(t)}\leq 0$ gives

2 Proof of Theorem 2

The proof of Theorem 2 is similar to that of Theorem 1, so we will be somewhat terse. We define the error $e(t)=\nabla f(x(t))-g(t-\tau)$ , identically as in the earlier proof, and begin as we did in the proof of Theorem 1. Recall that

Applying the first-order optimality condition to the definition of $x(t+1)$ (5), we get

for all $x\in\mathcal{X}$ . In particular, we have

Applying the above to the inequality (25), we see that

where for the last inequality, we use the fact that $D_{\psi}(x(t+1),x(t))\geq\frac{1}{2}\left\|x(t)-x(t+1)\right\|^{2}$ , by the strong convexity of $\psi$ , and that $\alpha(t)^{-1}=L+\eta(t)$ . By summing the inequality (26), we have

Comparing the bound (27) with the earlier bound for the dual averaging algorithms (22), we see that the only essential difference is the $\alpha(t)^{-1}-\alpha(t-1)^{-1}$ terms. The compactness assumption guarantees that $D_{\psi}(x^{*},x(t))\leq R^{2}$ , however, so

The remainder of the proof uses Lemmas 7 and 4 completely identically to the proof of Theorem 1.

3 Proof of Corollary 1

We prove this result only for the mirror descent algorithm (10), as the proof for the dual-averaging-based algorithm (9) is similar. We define the error at time $t$ to be $e(t)=\nabla f(x(t))-g(t-\tau(t))$ , and observe that we only need to control the second term involving $e(t)$ in the bound (26) differently. Expanding the error terms above and using Fenchel’s inequality as in the proofs of Theorems 1 and 2, we have

Now we note that conditioned on the delay $\tau(t)$ , we have

Consequently we apply Lemma 4 (specifically, following the bounds (19) and (20)) and find

The sum of $D_{f}$ terms telescopes, leaving only terms not received by the gradient procedure within $T$ iterations, and we can use $\alpha(t)\leq\frac{1}{\eta\sqrt{T}}$ for all $t$ to derive the further bound

To control the quantity (28), all we need is to bound the expected cardinality of the set $\{t\in[T]:t+\tau(t)>T\}$ . Using Chebyshev’s inequality and standard expectation bounds, we have

We can control the remaining terms as in the proofs of Theorems 1 and 2.

Proof of Theorem 3

The proof of Theorem 3 is not too difficult given our previous work—all we need to do is redefine the error $e(t)$ and use $\eta(t)$ to control the variance terms that arise. To that end, we define the gradient error terms that we must control. In this proof, we set

where $g_{i}(t)=\nabla f(x(t);\xi_{i}(t))$ is the gradient of node $i$ computed at the parameter $x(t)$ and $\tau(i)$ is the delay associated with node $i$ .

Using Assumption B as in the proofs of previous theorems, then applying Lemma 5, we have

We telescope as in the proofs of Theorems 1 and 2, canceling $\frac{L}{2}\left\|x(t)-x(t+1)\right\|^{2}$ with the $LD_{\psi}$ divergence terms to see that

This is exactly as in the non-probabilistic bound (22) from the proof of Theorem 1, but the definition (29) of the error $e(t)$ here is different.

What remains is to control the error term in (30). Writing the terms out, we have

Bounding the first term above is simple via Lemma 4: as in the proof of Theorem 1 earlier, we have

We use the same technique as the proof of Theorem 1 to bound the second term from (31). Indeed, the Fenchel-Young inequality gives

By assumption, given the information at worker $i$ at time $t-\tau(i)$ , $g_{i}(t-\tau(i)))$ is independent of $x(t)$ , so the first term has zero expectation. More formally, this happens because $x(t)$ is a function of gradients $g_{i}(1),\ldots,g_{i}(t-\tau(i)-1)$ from each of the nodes $i$ and hence the expectation of the first term conditioned on $\{g_{i}(1),\ldots,g_{i}(t-\tau(i)-1)\}_{i=1}^{n}$ is 0. The last term is canceled by the Bregman divergence terms in (30), so combining the bound (31) with the above two paragraphs yields

Conclusion and Discussion

In this paper, we have studied dual averaging and mirror descent algorithms for smooth and non-smooth stochastic optimization in delayed settings, showing applications of our results to distributed optimization. We showed that for smooth problems, we can preserve the performance benefits of parallelization over centralized stochastic optimization even when we relax synchronization requirements. Specifically, we presented methods that take advantage of distributed computational resources and are robust to node failures, communication latency, and node slowdowns. In addition, by distributing computation for stochastic optimization problems, we were able to exploit asynchronous processing without incurring any asymptotic penalty due to the delays incurred. In addition, though we omit these results for brevity, it is possible to extend all of our expected convergence results to guarantees with high-probability.

Acknowledgments

In performing this research, AA was supported by a Microsoft Research Fellowship, and JCD was supported by the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program. We are very grateful to Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao for illuminating conversations on distributed stochastic optimization and communication of their proof of the bound (8). We would also like to thank Yoram Singer for reading a draft of this manuscript and giving useful feedback.

Appendix A Technical Results about Proximal Functions

In this section, we collect several useful results about proximal functions and continuity properties of the solutions of proximal operators. We give proofs of all uncited results in Appendix B. We begin with results useful for the dual-averaging updates (4) and (9).

Since $\nabla\psi^{*}_{\alpha}(z)=\mathop{\rm argmax}_{x\in\mathcal{X}}\{\left\langle-z,x\right\rangle-\alpha^{-1}\psi(x)\}$ , it is clear that $x(t)=\nabla\psi^{*}_{\alpha(t)}(z(t))$ . Further by strong convexity of $\psi$ , we have that $\nabla\psi^{*}_{\alpha}(z)$ is $\alpha$ -Lipschitz continuous [Nes09, HUL96b, Chapter X], that is, for the norm $\left\|\cdot\right\|$ with respect to which $\psi$ is strongly convex and its associated dual norm $\left\|\cdot\right\|_{*}$ ,

We will find one more result about solutions to the dual averaging update useful. This result has essentially been proven in many contexts [Nes09, Tse08, DGBSX10a].

Let $x^{+}$ minimize $\left\langle z,x\right\rangle+A\psi(x)$ for all $x\in\mathcal{X}$ . Then for any $x\in\mathcal{X}$ ,

Now we turn to describing properties of the mirror-descent step (5), which we will also use frequently. The lemma allows us to bound differences between $x(t)$ and $x(t+1)$ for the mirror-descent family of algorithms.

Let $x^{+}$ minimize $\left\langle g,x\right\rangle+\frac{1}{\alpha}D_{\psi}(x,y)$ over $x\in\mathcal{X}$ . Then $\left\|x^{+}-y\right\|\leq\alpha\left\|g\right\|_{*}$ .

The last technical lemma we give explicitly bounds the differences between $x(t)$ and $x(t+\tau)$ , for some $\tau\geq 1$ , by using the above continuity lemmas.

Let Assumption A hold. Define $x(t)$ via the dual-averaging updates (4), (9), or (12) or the mirror-descent updates (5), (10), or (13). Let $\alpha(t)^{-1}=L+\eta(t+t_{0})^{c}$ for some $c\in$ , $\eta>0$ , $t_{0}\geq 0$ , and $L\geq 0$ . Then for any fixed $\tau$ ,

Appendix B Proofs of Proximal Operator Properties

Proof of Lemma 6 The inequality is clear when $x^{+}=y$ , so assume that $x^{+}\neq y$ . Since $x^{+}$ minimizes $\left\langle g,x\right\rangle+\frac{1}{\alpha}D_{\psi}(x,y)$ , the first order conditions for optimality imply

for any $x\in\mathcal{X}$ . Thus we can choose $y=x$ and see that

where the last inequality follows from the strong convexity of $\psi$ . Using Hölder’s inequality gives that $\alpha\left\|g\right\|_{*}\left\|y-x\right\|\geq\left\|x^{+}-y\right\|^{2}$ , and dividing by $\left\|y-x\right\|$ completes the proof. ∎

Proof of Lemma 7 We first show the lemma for the dual-averaging updates. Recall that $x(t)=\nabla\psi^{*}_{\alpha(t)}(z(t))$ and $\nabla\psi^{*}_{\alpha}$ is $\alpha$ -Lipschitz continuous. Using the triangle inequality,

where we use Cauchy-Schwarz inequality in the first step. Since $c\leq 1$ , the last term is clearly bounded by $4G^{2}\tau^{2}/\eta^{2}t^{2c}$ .

The proof for the mirror-descent family of updates is similar. We focus on non-delayed update (5), as the other updates simply modify the indexing of $g(t+s)$ below. We know from Lemma 6 and the triangle inequality that

Squaring the above bound, taking expectations, and recalling that $\alpha(t)$ is non-increasing, we see

by Hölder’s inequality. Substituting the appropriate value for $\alpha(t)$ completes the proof. ∎