Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan

Introduction

Asynchronous parallel stochastic optimization algorithms have recently gained significant traction in algorithmic machine learning. A large body of recent work has demonstrated that near-linear speedups are achievable, in theory and practice, on many common machine learning tasks . Moreover, when these lock-free algorithms are applied to non-convex optimization, significant speedups are still achieved with no loss of statistical accuracy. This behavior has been demonstrated in practice in state-of-the-art deep learning systems such as Google’s Downpour SGD and Microsoft’s Project Adam .

Although asynchronous stochastic algorithms are simple to implement and enjoy excellent performance in practice, they are challenging to analyze theoretically. The current analyses require lengthy derivations and several assumptions that may not reflect realistic system behavior. Moreover, due to the difficult nature of the proofs, the algorithms analyzed are often simplified versions of those actually run in practice.

In this paper, we propose a general framework for deriving convergence rates for parallel, lock-free, asynchronous first-order stochastic algorithms. We interpret the algorithmic effects of asynchrony as perturbing the stochastic iterates with bounded noise. This interpretation allows us to show how a variety of asynchronous first-order algorithms can be analyzed as their serial counterparts operating on noisy inputs. The advantage of our framework is that it yields elementary convergence proofs, can remove or relax simplifying assumptions adopted in prior art, and can yield improved bounds when compared to earlier work.

We demonstrate the general applicability of our framework by providing new convergence analyses for Hogwild!, i.e., the asynchronous stochastic gradient method (SGM), for asynchronous stochastic coordinate descent (ASCD), and KroMagnon: a novel asynchronous sparse version of the stochastic variance-reduced gradient (SVRG) method . In particular, we provide a modified version of SVRG that allows for sparse updates, we show that this method can be parallelized in the asynchronous model, and we provide convergence guarantees using our framework. Experimentally, the asynchronous, parallel sparse SVRG achieves nearly-linear speedups on a machine with 16 cores and is sometimes four orders of magnitude faster than the standard (dense) SVRG method.

The algorithmic tapestry of parallel stochastic optimization is rich and diverse extending back at least to the late 60s . Much of the contemporary work in this space is built upon the foundational work of Bertsekas, Tsitsiklis et al. ; the shared memory access model that we are using in this work, is very similar to the partially asynchronous model introduced in the aforementioned manuscripts. Recent advances in parallel and distributed computing technologies have generated renewed interest in the theoretical understanding and practical implementation of parallel stochastic algorithms .

The power of lock-free, asynchronous stochastic optimization on shared-memory multicore systems was first demonstrated in the work of . The authors introduce Hogwild!, a completely lock-free and asynchronous parallel stochastic gradient method (SGM) that exhibits nearly linear speedups for a variety of machine learning tasks. Inspired by Hogwild!, several authors developed lock-free and asynchronous algorithms that move beyond SGM, such as the work of Liu et al. on parallel stochastic coordinate descent . Additional work in first order optimization and beyond , extending to parallel iterative linear solvers , has further shown that linear speedups are possible in the asynchronous shared memory model.

Perturbed Stochastic Gradients

We focus our analysis on functions ff that are LL-smooth and mm-strongly convex. A function ff is LL-smooth if it is differentiable and has Lipschitz gradients

where \|\cdot\| denotes the Euclidean norm. Strong convexity with parameter m>0m>0 imposes a curvature condition on ff:

Strong convexity implies that ff has a unique minimum x\mathbf{x}^{*} and satisfies

In the following, we use ii, jj, and kk to denote iteration counters, while reserving vv and uu to denote coordinate indices. We use O(1)\mathcal{O}(1) to denote absolute constants.

A popular way to minimize convex functions is via first-order stochastic algorithms. These algorithms can be described using the following general iterative expression:

A major advantage of the iterative formula in (2.1) is that—in combination with strong convexity, and smoothness inequalities—one can easily track algorithmic progress and establish convergence rates to the optimal solution. Unfortunately, the progress of asynchronous parallel algorithms cannot be precisely described or analyzed using the above iterative framework. Processors do not read from memory actual iterates xj\mathbf{x}_{j}, as there is no global clock that synchronizes reads or writes while different cores write/read “stale” variables.

In the subsequent sections, we show that the following simple perturbed variant of Eq. (2.1) can capture the algorithmic progress of asynchronous stochastic algorithms. Consider the following iteration

where nj\mathbf{n}_{j} is a stochastic error term. For simplicity let x^j=xj+nj\hat{\mathbf{x}}_{j}=\mathbf{x}_{j}+\mathbf{n}_{j}. Then,

where in the last equation we added and subtracted the term 2γx^j,g(x^j,ξj)2\gamma\langle\hat{\mathbf{x}}_{j},\mathbf{g}(\hat{\mathbf{x}}_{j},\xi_{j})\rangle.

We assume that x^j\hat{\mathbf{x}}_{j} and ξj\xi_{j} are independent. However, in contrast to recursion (2.1), we no longer require xj\mathbf{x}_{j} to be independent of ξj\xi_{j}. The importance of the above independence assumption will become clear in the next section.

The recursive equation (2.5) is key to our analysis. We show that for given R0jR_{0}^{j}, R1jR_{1}^{j}, and R2jR_{2}^{j}, we can obtain convergence rates through elementary algebraic manipulations. Observe that there are three “error” terms in (2.5): R0jR_{0}^{j} captures the stochastic gradient decay with each iteration, R1jR_{1}^{j} captures the mismatch between the true iterate and its noisy estimate, and R2jR_{2}^{j} measures the size of the projection of that mismatch on the gradient at each step. The key contribution of our work is to show that 1) this iteration can capture the algorithmic progress of asynchronous algorithms, and 2) the error terms can be bounded to obtain a O(log(1/ϵ)/ϵ)\mathcal{O}(\log(1/\epsilon)/\epsilon) rate for Hogwild!, and linear rates of convergence for asynchronous SCD and asynchronous sparse SVRG.

Analyzing Hogwild!

In this section, we provide a simple analysis of Hogwild!, the asynchronous implementation of SGM. We focus on functions ff that are decomposable into nn terms:

We refer to the sets eie_{i} as hyperedges and denote the set of hyperedges by E\mathcal{E}. We sometimes refer to feif_{e_{i}}s as the terms of ff. As shown in Fig. 1, the hyperedges induce a bipartite graph between the nn terms and the dd variables in x\mathbf{x}, and a conflict graph between the nn terms. Let ΔC\overline{\Delta}_{\text{C}} be the average degree in the conflict graph; that is, the average number of terms that are in conflict with a single term. We assume that ΔC1\overline{\Delta}_{\text{C}}\geq 1, otherwise we could decompose the problem into smaller independent sub-problems. As we will see, under our perturbed iterate analysis framework the convergence rate of asynchronous algorithms depends on ΔC\overline{\Delta}_{\text{C}}.

Hogwild! (Alg. 1) is a method to parallelize SGM in the asynchronous setting . It is deployed on multiple cores that have access to shared memory, where the optimization variable x\mathbf{x} and the data points that define the ff terms are stored. During its execution each core samples uniformly at random a hyperedge ss from E\mathcal{E}. It reads the coordinates vsv\in s of the shared vector x\mathbf{x}, evaluates fs\nabla f_{s} at the point read, and finally adds γfs-\gamma\nabla f_{s} to the shared variable.

During the execution of Hogwild! cores do not synchronize or follow an order between reads or writes. Moreover, they access (i.e., read or write) a set of coordinates in x\mathbf{x} without the use of any locking mechanisms that would ensure a conflict-free execution. This implies that the reads/writes of distinct cores can intertwine in arbitrary ways, e.g., while a core updates a subset of variables, before completing its task, other cores can read/write the same subset of variables.

In , the authors analyzed a variant of Hogwild! in which several simplifying assumptions were made. Specifically, in 1) only a single coordinate per sampled hyperedge is updated (i.e., the for loop in Hogwild! is replaced with a single coordinate update); 2) the authors assumed consistent reads, i.e., it was assumed that while a core is reading the shared variable, no writes from other cores occur; 3) the authors make an implicit assumption on the uniformity of the processing times of cores (explained in the following), that does not generically hold in practice. These simplifications alleviate some of the challenges in analyzing Hogwild! and allowed the authors to provide a convergence result. As we show in the current paper, however, these simplifications are not necessary to obtain a convergence analysis. Our perturbed iterates framework can be used in an elementary way to analyze the original version of Hogwild!, yielding improved bounds compared to earlier analyses.

A subtle but important point in the analysis of Hogwild! is the need to define an order for the sampled hyperedges. A key point of difference of our work is that we order the samples based on the order in which they were sampled, not the order in which cores complete the processing of the samples.

We denote by sis_{i} the ii-th sampled hyperedge in a run of Alg. 1.

That is, sis_{i} denotes the sample obtained when line 22 in Alg. 1 is executed for the ii-th time. This is different from the original work of , in which the samples were ordered according to the completion time of each thread. The issue with such an ordering is that the distribution of the samples, conditioned on the ordering, is not always uniform; for example, hyperedges of small cardinality are more likely to be “early” samples. A uniform distribution is needed for the theoretical analysis of stochastic gradient methods, a point that is disregarded in . Our ordering according to sampling time resolves this issue by guaranteeing uniformity among samples in a trivial way.

2 Defining read iterates and clarifying independence assumptions

Since the shared memory variable can change inconsistently during reads and writes, we also have to be careful about the notion of iterates in Hogwild!.

At this point we would like to briefly discuss an independence assumption held by all prior work. In the following paragraph, we explain why this assumption is not always true in practice. In Appendix A, we show how to lift such independence assumption, but for ease of exposition we do adopt it in our main text.

The vector x^i\hat{\mathbf{x}}_{i} is independent of the sampled hyperedge sis_{i}.

One way to rigorously enforce the independence of x^i\hat{\mathbf{x}}_{i} and sis_{i} is to require the processors to read the entire shared variable x\mathbf{x} before sampling a new hyperedge. However, this might not be reasonable in practice, as the dimension of x\mathbf{x} tends to be considerably larger than the sparsity of the hyperedges. As we mentioned earlier, in Appendix A, we show how to overcome the issue of dependence and thereby remove Assumption 1; however, this results in a slightly more cumbersome analysis. To ease readability, in our main text we do adopt Assumption 1.

3 The perturbed iterates view of asynchrony

In this work, we assume that all writes are atomic, in the sense that they will be successfully recorded in the shared memory at some point. Atomicity is a reasonable assumption in practice, as it can be strictly enforced through compare-and-swap operations .

Every write in line 6 of Alg. 1 will complete successfully.

This assumption implies that all writes will appear in the shared memory by the end of the execution, in the form of coordinate-wise updates. Due to commutativity the order in which these updates are recorded in the shared memory is irrelevant. Hence, after processing a total of TT hyperedges the shared memory contains:

where x0\mathbf{x}_{0} is the initial guess and xi{\bf x}_{i} is defined as the vector that contains all gradient updates up to sample si1s_{i-1}.

Observe that although a core is only reading the subset of variables that are indexed by its sampled hyperedge, in (3.2) we use the entire vector x^\hat{\bf x} as the input to the sampled gradient. We can do this since g(x^k,sk)\mathbf{g}(\hat{\mathbf{x}}_{k},s_{k}) is independent of the coordinates of x^k\hat{\mathbf{x}}_{k} outside the support of hyperedge sks_{k}.

Using the above definitions, we define the perturbed iterates of Hogwild! as

for i=0,1,...,T1i=0,1,...,T-1, where sis_{i} is the ii-th uniformly sampled hyperedge. Observe that all but the first and last of these iterates are “fake”: there might not be an actual time when they exist in the shared memory during the execution. However, x0\mathbf{x}_{0} is what is stored in memory before the execution starts, and xT\mathbf{x}_{T} is exactly what is stored in shared memory at the end of the execution.

We observe that the iterates in (3.3) place Hogwild! in the perturbed gradient framework introduced in §2:

We are only left to bound the three error terms R0jR_{0}^{j}, R1jR_{1}^{j}, and R2jR_{2}^{j}. Before we proceed, we note that for the technical soundness of our theorems, we have to also define a random variable that captures the system randomness. In particular, let ξ\xi denote a random variable that encodes the randomness of the system (i.e., random delays between reads and writes, gradient computation time, etc). Although we do not explicitly use ξ\xi, its distribution is required implicitly to compute the expectations for the convergence analysis. This is because the random samples s0,s1,,sT1s_{0},s_{1},\ldots,s_{T-1} do not fully determine the output of Alg. 1. However, s0,,sT1s_{0},\ldots,s_{T-1} along with ξ\xi completely determine the time of all reads and writes. We continue with our final assumption needed by our analysis, that is also needed by prior art.

Two hyperedges sis_{i} and sjs_{j} overlap in time if they are processed concurrently at some point during the execution of Hogwild!. The time during which a hyperedge sis_{i} is being processed begins when the sampling function is called and ends after the last coordinate of g(x^i,si)\mathbf{g}(\hat{\mathbf{x}}_{i},s_{i}) is written to the shared memory. We assume that there exists a number τ0\tau\geq 0, such that the maximum number of sampled hyperedges that can overlap in time with a particular sampled hyperedge cannot be more than τ\tau.

The usefulness of the above assumption is that it essentially abstracts away all system details relative to delays, processing overlaps, and number of cores into a single parameter. Intuitively, τ\tau can be perceived as a proxy for the number of cores, i.e., we would expect that no more than roughly O(#cores)O(\#\text{cores}) sampled hyperedges overlap in time with a single hyperedge, assuming that the processing times across the samples are approximately similar. Observe that if τ\tau is small, then we expect the distance between xj\mathbf{x}_{j} and the noisy iterate x^j\hat{\mathbf{x}}_{j} to be small. In our perturbed iterate framework, if we set τ=0\tau=0, then we obtain the classical iterative formula of serial SGM.

To quantify the distance between x^j\hat{\mathbf{x}}_{j} (i.e., the iterate read by the core that sampled sjs_{j}) and xj\mathbf{x}_{j} (i.e., the “fake” iterate used to establish convergence rates), we observe that any difference between them is caused solely by hyperedges that overlap with sjs_{j} in time. To see this, let sis_{i} be an “earlier” sample, i.e., i<ji<j, that does not overlap with sjs_{j} in time. This implies that the processing of sis_{i} finishes before sjs_{j} starts being processed. Hence, the full contribution of γg(x^i,si)\gamma\mathbf{g}(\hat{\mathbf{x}}_{i},s_{i}) will be recorded in both x^j\hat{\mathbf{x}}_{j} and xj\mathbf{x}_{j} (for the latter this holds by definition). Similarly, if i>ji>j and sis_{i} does not overlap with sjs_{j} in time, then neither x^j\hat{\mathbf{x}}_{j} nor xj\mathbf{x}_{j} (for the latter, again by definition) contain any of the coordinate updates involved in the gradient update γg(x^i,si)\gamma\mathbf{g}(\hat{\mathbf{x}}_{i},s_{i}). Assumption 3 ensures that if i<jτi<j-\tau or i>j+τi>j+\tau, the sample sis_{i} does not overlap in time with sjs_{j}.

By the above discussion, and due to Assumption 3, there exist diagonal matrices Sij{\bf S}_{i}^{j} with diagonal entries in {1,0,1}\{-1,0,1\} such that

These diagonal matrices account for any possible pattern of (potentially) partial updates that can occur while hyperedge sjs_{j} is being processed. We would like to note that the above notation bears resemblance to the coordinate-update mismatch formulation of asynchronous coordinate-based algorithms, as in .

We now turn to the convergence proof, emphasizing its elementary nature within the perturbed iterate analysis framework. We begin by bounding the error terms R1jR_{1}^{j} and R2jR_{2}^{j} (R0jR_{0}^{j} is already assumed to be at most M2M^{2}).

Hogwild! satisfies the recursion in (2.5) with

where ΔC\overline{\Delta}_{\text{C}} is the average degree of the conflict graph between the hyperedges.

The norm of the mismatch can be bounded in the following way:

since Sij{\bf S}_{i}^{j} are diagonal sign matrices and since the steps g(x^i,si)\mathbf{g}(\hat{\mathbf{x}}_{i},s_{i}) are supported on the samples sis_{i}. We use the upper bound g(x^i,si)M\|\mathbf{g}(\hat{\mathbf{x}}_{i},s_{i})\|\leq M to obtain

The last step follows because two sampled hyperedges (sampled with replacement) intersect with probability at most 2ΔCn2\frac{\overline{\Delta}_{\text{C}}}{n}. We can bound R2jR_{2}^{j} in a similar way:

Plugging the bounds of Lemma 3 in our recursive formula, we see that Hogwild! satisfies the recursion

On the other hand, serial SGM satisfies the recursion aj+1(1γm)aj+γ2M2.a_{j+1}\leq\left(1-\gamma m\right)a_{j}+\gamma^{2}M^{2}. If the step size is set to γ=ϵm2M2\gamma=\frac{\epsilon m}{2M^{2}}, it attains target accuracy ϵ\epsilon in T2M2/(ϵm2)log(2a0ϵ)T\geq 2M^{2}/(\epsilon m^{2})\log\left(\frac{2a_{0}}{\epsilon}\right) iterations. Hence, when the term δ\delta of (3.5) is order-wise constant, Hogwild! satisfies the same recursion (up to constants) as serial SGM. This directly implies the main result of this section.

If the number of samples that overlap in time with a single sample during the execution of Hogwild! is bounded as

4 Comparison with the original Hogwild! analysis of [1]

Let us summarize the key points of improvement compared to the original Hogwild! analysis:

Our analysis is elementary and compact, and follows simply by bounding the R0j,R1jR_{0}^{j},R_{1}^{j}, and R2jR_{2}^{j} terms, after introducing the perturbed gradient framework of § 2.

We do not assume consistent reads: while a core is reading from the shared memory other cores are allowed to read, or write.

In the authors analyze a simplified version of Hogwild! where for each sampled hyperedge only a randomly selected coordinate is updated. Here we analyze the “full-update” version of Hogwild!.

We order the samples by the order in which they were sampled, not by completion time. This allows to rigorously prove our convergence bounds, without assuming anything on the distribution of the processing time of each hyperedge. This is unlike , where there is an implicit assumption of uniformity with respect to processing times.

The previous work of establishes a nearly-linear speedup for Hogwild! if τ\tau is bounded as τ=O(\nicefracnΔRΔL2)\tau=\mathcal{O}\left(\sqrt{\nicefrac{{n}}{{\Delta_{\text{R}}\Delta_{\text{L}}^{2}}}}\right), where ΔR\Delta_{\text{R}} is the maximum right degree of the term-variables bipartite graph, shown in Fig 1, and ΔL\Delta_{\text{L}} is the maximum left degree of the same graph. Observe that ΔRΔL2ΔLΔC\Delta_{\text{R}}\cdot\Delta_{\text{L}}^{2}\geq\Delta_{\text{L}}\cdot\Delta_{\text{C}}, where ΔC\Delta_{\text{C}} is the maximum degree of the conflict graph. Here, we obtain a linear speedup for up to τ=O(min{\nicefracnΔC,\nicefracM2ϵm2})\tau=\mathcal{O}\left(\min\left\{\nicefrac{{n}}{{\overline{\Delta}_{\text{C}}}},\nicefrac{{M^{2}}}{{\epsilon m^{2}}}\right\}\right), where ΔC\overline{\Delta}_{\text{C}} is only the average degree of the conflict graph in Fig 1. Our bound on the delays can be orders of magnitude better than that of .

Asynchronous Stochastic Coordinate Descent

In this section, we use the perturbed gradient framework to analyze the convergence of asynchronous parallel stochastic coordinate descent (ASCD). This algorithm has been previously analyzed in . We show that the algorithm admits an elementary treatment in our perturbed iterate framework, under the same assumptions made for Hogwild!.

ASCD, shown in Alg. 2, is a linearly convergent algorithm for minimizing strongly convex functions ff. At each iteration a core samples one of the coordinates, computes a full gradient update for that coordinate, and proceeds with updating a single element of the shared memory variable x\mathbf{x}. The challenge in analyzing ASCD, compared to Hogwild!, is that, in order to show linear convergence, we need to show that the error due to the asynchrony between cores decays fast when the iterates arrive close to the optimal solution. The perturbed iterate framework can handle this type of noise analysis in a straightforward manner, using simple recursive bounds.

We define x^i\hat{\mathbf{x}}_{i} as in the previous section, but now the samples sis_{i} are coordinates sampled uniformly at random from {1,2,,d}\{1,2,\ldots,d\}. After TT samples have been processed completely, the following vector is contained in shared memory:

where x0\mathbf{x}_{0} is the initial guess, esj{\bf e}_{s_{j}} is the standard basis vector with a one at position sjs_{j}, [f(x)]sj[\nabla f(\mathbf{x})]_{s_{j}} denotes the sjs_{j}-th coordinate of the gradient of ff computed at x\mathbf{x}. Similar to Hogwild! in the previous section, ASCD satisfies the following iterative formula

Let the maximum number of coordinates that can be concurrently processed while a core is processing a single coordinate be at most

Using the recursive inequality (2.5), serial SCD with the same step-size as in the above theorem, can be shown to achieve the same accuracy as ASCD in the same number of steps. Hence, as long as the proxy for the number of cores is bounded as τ=O(min{κdlog(\nicefraca^0ϵ)1,d})\tau=\mathcal{O}(\min\{\kappa\sqrt{d}\log(\nicefrac{{\hat{a}_{0}}}{{\epsilon}})^{-1},\sqrt{d}\}), our theorem implies a linear speedup with respect to this simple convergence bound. We would like to note, however, that the coordinate descent literature sometimes uses more refined properties of the function to be optimized that can lead to potentially better convergence bounds, especially in terms of function value accuracy, i.e., f(xk)f(x)f(\mathbf{x}_{k})-f(\mathbf{x}^{*}) (see e.g., ).

We would further like to remark that between the two bounds on τ\tau, the second one, i.e., O(d)\mathcal{O}(\sqrt{d}), is the more restrictive, as the first one is proportional—up to log factors—to the square root of the number of iterations, which is usually Ω(d)\Omega(d). We explain in our subsequent derivation how this loose bound can be improved, but leave the tightness of the bound as an open question for future work.

The analysis here is slightly more involved compared to Hogwild!.The main technical bottleneck is to relate the decay of R0jR_{0}^{j} with that of R1jR_{1}^{j}, and then to exploit the sparsity of the updates for bounding R2jR_{2}^{j}.

We start with a simple upper bound on the norm of the gradient updates. From the LL-Lipschitz assumption on f(x)\nabla f(\mathbf{x}), we have

where the last inequality is due to Jensen’s inequality. This yields the following result.

Let TT be the total number of ASCD iterations, and let us define the set

which has cardinality at most 2rτ+12r\tau+1 and contains all indices around jj within rτr\tau steps, as sketched in Fig. 2.

Due to Assumption 3, and similar to , there exist variables σi,kj{1,0,1}\sigma_{i,k}^{j}\in\{-1,0,1\} such that, for any index kk in the set Srj\mathcal{S}_{r}^{j}, we have

The above equation implies that the difference between a “fake” iterate at time jj and the value that was read at time kk can be expressed as a linear combination of any coordinate updates that occurred during the time interval defined by Sr+1j\mathcal{S}^{j}_{r+1}.

From Eq. (4.1) we see that x^kxj\|\hat{\mathbf{x}}_{k}-\mathbf{x}_{j}\|, for any kSrjk\in\mathcal{S}_{r}^{j}, can be upper bounded in terms of the magnitude of the coordinate updates that occur in Sr+1j\mathcal{S}_{r+1}^{j}. Since these updates are coordinates of the true gradient, we can use their norm to bound the size of x^kxj\hat{\mathbf{x}}_{k}-\mathbf{x}_{j}. This will be useful towards bounding R1jR_{1}^{j}. Moreover, Lemma 6 shows that the magnitude of the gradient steps can be upper bounded in terms of the size of the mismatches. This will in turn be useful in bounding R0jR_{0}^{j}. The above observations are fundamental to our approach. The following lemma makes the above ideas explicit.

The first inequality is a consequence of Lemma 6. For the second, as mentioned previously, we have x^kxj=iSr+1jσi,kγg(x^i,si)\hat{\mathbf{x}}_{k}-\mathbf{x}_{j}=\sum_{i\in\mathcal{S}_{r+1}^{j}}\sigma_{i,k}\gamma\mathbf{g}(\hat{\mathbf{x}}_{i},s_{i}) when kSrjk\in\mathcal{S}_{r}^{j}. Hence,

where the first inequality follows due to Jensen’s inequality, and the last inequality uses the bound Sr+1j2(r+1)τ+13τ(r+1)|S_{r+1}^{j}|\leq 2(r+1)\tau+1\leq 3\tau(r+1). ∎

The Cauchy-Schwartz inequality implies the bound R2jR0jR1jR_{2}^{j}\leq\sqrt{R_{0}^{j}R_{1}^{j}}. Unfortunately this approach yields a result that can only guarantee convergence up to a factor of d\sqrt{d} slower than serial SCD. This happens because upper bounding the inner product x^jxj,g(x^j,sj)\langle\hat{\mathbf{x}}_{j}-\mathbf{x}_{j},\mathbf{g}(\hat{\mathbf{x}}_{j},s_{j})\rangle by x^jxjg(x^j,sj)\|\hat{\mathbf{x}}_{j}-\mathbf{x}_{j}\|\|\mathbf{g}(\hat{\mathbf{x}}_{j},s_{j})\| disregards the extreme sparsity of g(x^j,sj)\mathbf{g}(\hat{\mathbf{x}}_{j},s_{j}). The next lemma uses a slightly more involved argument to bound R2jR_{2}^{j} exploiting the sparsity of the gradient update. The proof can be found in Appendix B.1.

We can now plug in the upper bounds on R0jR_{0}^{j}, R1jR_{1}^{j}, and R2jR_{2}^{j} in our perturbed iterate recursive formula

To guarantee that ASCD follows the same recursion, i.e., it has the same convergence rate as the one implied by Eq. (4.4), we require that γmr(γ)C(γmγ2dL2),\gamma m-r(\gamma)\geq C(\gamma m-\gamma^{2}dL^{2}), where C<1C<1 is a constant. Solving for γ\gamma we get

where C>1C^{\prime}>1 is some absolute constant. For γ=O(1)θdκL\gamma=\mathcal{O}(1)\frac{\theta}{d\kappa L}, the δ(γ)\delta(\gamma) term in the recursive bound becomes

where we used the inequality M2L2a^0M^{2}\leq L^{2}\hat{a}_{0}. Hence, ASCD satisfies

Sparse and Asynchronous SVRG

The SVRG algorithm, presented in , is a variance-reduction approach to stochastic gradient descent with strong theoretical guarantees and empirical performance. In this section, we present a parallel, asynchronous and sparse variant of SVRG. We also present a convergence analysis, showing that the analysis proceeds in a nearly identical way to that of ASCD.

The original SVRG algorithm of runs for a number of epochs; the per epoch iteration is given as follows:

where y\mathbf{y} is the last iterate of the previous epoch, and as such is updated at the end of every epoch. Here ff is of the same form as in (3.1):

and g(x,sj)=fsj(x)\mathbf{g}(\mathbf{x},s_{j})=\nabla f_{s_{j}}(\mathbf{x}), with hyperedges sjEs_{j}\in\mathcal{E} sampled uniformly at random. As is common in the SVRG literature, we further assume that the individual feif_{e_{i}} terms are LL-smooth. The theoretical innovation in SVRG is having an SGM flavored algorithm, with small amortized cost per iteration, where the variance of the gradient estimate is smaller than that of standard SGM. For a certain selection of learning rate, epoch size, and number of iterations, establishes that SVRG attains a linear rate.

Observe that when optimizing a decomposable ff with sparse terms, in contrast to SGM, the SVRG iterates will be dense due to the term f(y)\nabla f(\mathbf{y}). From a practical perspective, when the SGM iterates are sparse—the case in several applications —the cost of writing a sparse update in shared memory is significantly smaller than applying the dense gradient update term f(y)\nabla f(\mathbf{y}). Furthermore, these dense updates will cause significantly more memory conflicts in an asynchronous execution, amplifying the error terms in (2.5), and introducing time delays due to memory contention.

A sparse version of SVRG can be obtained by letting the support of the update be determined by that of g(y,sj)\mathbf{g}(\mathbf{y},s_{j}):

The variance of the serial sparse SVRG procedure in (5.2) satisfies

By definition vj=g(xj,sj)g(y,sj)+Dsjf(y)\mathbf{v}_{j}=\mathbf{g}(\mathbf{x}_{j},s_{j})-\mathbf{g}(\mathbf{y},s_{j})+\mathbf{D}_{s_{j}}\nabla f(\mathbf{y}). Therefore

Since g(x,sj)\mathbf{g}(\mathbf{x},s_{j}) is supported on sjs_{j} for all x\mathbf{x}, we have

Observe that the last term in the variance bound is a non-negative quadratic form, hence we can drop it and obtain the same variance bound as the one obtained in for dense SVRG. This directly leads to the following corollary.

Sparse SVRG admits the same convergence rate upper bound as that of the SVRG of .

We note that usually the convergence rates for SVRG are obtained for function value differences. However, since our perturbed iterate framework of § 2 is based on iterate differences, we re-derive a convergence bound for iterates.

We bound the distance to the optimum after one epoch of length 8κ28\kappa^{2}:

We can rewrite the inequality as aj+1(12γm+2γ2L2)aj+2γ2L2a0a_{j+1}\leq(1-2\gamma m+2\gamma^{2}L^{2})a_{j}+2\gamma^{2}L^{2}a_{0}, because by construction y=x0\mathbf{y}=\mathbf{x}_{0}. Let γ=14Lκ\gamma=\frac{1}{4L\kappa}. Then, 12γm+2γ2L2114κ21-2\gamma m+2\gamma^{2}L^{2}\leq 1-\frac{1}{4\kappa^{2}} and

since 14κ214\frac{1}{4\kappa^{2}}\leq\frac{1}{4}. Therefore

Setting the length of an epoch to be j=2(4κ2)j=2\cdot(4\kappa^{2}) gives us aj+1(1/2+1/4)a0=0.75a0a_{j+1}\leq(1/2+1/4)\cdot a_{0}=0.75\cdot a_{0}, and the conclusion follows. ∎

We thus obtain the following convergence rate result:

2 KroMagnon: Asynchronous Parallel Sparse SVRG

We now present an asynchronous implementation of sparse SVRG. This implementation, which we refer to as KroMagnon, is given in Algorithm 3.

Let v(x^j,sj)=g(x^j,sj)g(y,sj)+Dsjf(y)\mathbf{v}(\hat{\mathbf{x}}_{j},s_{j})=\mathbf{g}(\hat{\mathbf{x}}_{j},s_{j})-\mathbf{g}(\mathbf{y},s_{j})+\mathbf{D}_{s_{j}}\nabla f(\mathbf{y}) be the noisy gradient update vector. Then, after processing a total of TT hyperedges, the shared memory contains:

To prove the convergence of KroMagnon we follow the line of reasoning presented in the previous section. Most of the arguments used here come from a straightforward generalization of the analysis of ASCD. The main result of this section is given below.

Let the maximum number of samples that can overlap in time with a single sample be bounded as

We would like to note that the total number of iterations in the above bound is—up to a universal constant—the same as that of serial sparse SVRG as presented in Theorem 13. Again, as with Hogwild! and ASCD, this implies a linear speedup.

3 Proof of Theorem 14

It is easy to see that due to Lemma 10 we get the following bound on the norm of the gradient estimate.

The set Srj\mathcal{S}_{r}^{j} is defined as in the previous section: Srj={max{jrτ,0},,j1,j,j+1,,min{j+rτ,T}}\mathcal{S}_{r}^{j}=\{\max\{j-r\tau,0\},\ldots,j-1,j,j+1,\ldots,\min\{j+r\tau,T\}\}, and has cardinality at most 2rτ+12r\tau+1. By Assumption 3, there exist diagonal sign matrices Sij{\bf S}_{i}^{j} with diagonal entries in {1,0,1}\{-1,0,1\} such that

As explained in the remark after Lemma 7, it should be possible to improve τ2\tau^{2} to τ\tau in the upper bound on Δr\Delta_{r}. Doing so would improve the condition τ=O(\nicefracnΔC)\tau=\mathcal{O}(\sqrt{\nicefrac{{n}}{{\overline{\Delta}_{\text{C}}}}}) of Theorem 14 to τ=O(\nicefracnΔC)\tau=\mathcal{O}(\sqrt{\nicefrac{{n}}{{\overline{\Delta}_{\text{C}}}}}). One possible approach to this problem can be found in § B.2.2 of the Appendix.

We can now obtain bounds on the errors due to asynchrony. The proofs for the following two lemmas can be found in Appendix B.2.

Similarly to the ASCD derivations, we obtain the following bound for R2jR_{2}^{j}.

After plugging in the upper bounds on R0jR_{0}^{j}, R1jR_{1}^{j}, and R2jR_{2}^{j} in the main recursion satisfied by KroMagnon, we find that:

If we set γ=O(1)θLκ\gamma=\mathcal{O}(1)\frac{\theta}{L\kappa}, i.e., the same step size as serial sparse SVRG (Theorem 13), then the above becomes

This implies that O(1)log(a0/ϵ)\mathcal{O}(1)\log(a_{0}/\epsilon) epochs are sufficient to reach ϵ\epsilon accuracy, where a0a_{0} is x0x2\|\mathbf{x}_{0}-\mathbf{x}^{*}\|^{2} the initial distance squared to the optimum.

Empirical Evaluation of KroMagnon

In this section we evaluate KroMagnon empirically. Our two goals are to demonstrate that (1) KroMagnon is faster than dense SVRG, and (2) KroMagnon has speedups comparable to those of Hogwild!. We implemented Hogwild!, asynchronous dense SVRG, and KroMagnon in Scala, and tested them on the problems and datasets listed in Table 1. Each algorithm was run for 50 epochs, using up to 16 threads. For the SVRG algorithms, we recompute y\mathbf{y} and the full gradient f(y)\nabla f(\mathbf{y}) every two epochs. We normalize the objective values such that the objective at the initial starting point has a value of one, and the minimum attained across all algorithms and epochs has a value of zero. Experiments were conducted on a Linux machine with 2 Intel Xeon Processor E5-2670 (2.60GHz, eight cores each) with 250Gb memory.

We were unable to run dense SVRG on the url and eswiki-2013 datasets due to the large number of features. Figures 3(a), 3(b), and 5(a) show that KroMagnon is one-two orders of magnitude faster than dense SVRG. In fact, running dense SVRG on 16 threads is slower than KroMagnon on a single thread. Moreover, as seen in Fig. 4(a), KroMagnon on 16 threads can be up to four orders of magnitude faster than serial dense SVRG. Both dense SVRG and KroMagnon attain similar optima.

We measured the time each algorithm takes to achieve 99.9% and 99.99% of the minimum achieved by that algorithm. Speedups are computed relative to the runtime of the algorithm on a single thread. Although the speedup of KroMagnon varies across datasets, we find that KroMagnon has comparable speedups with Hogwild! on all datasets, as shown in Figure 3(c), 3(d), 4(c), 4(d), 5(c), 5(d). We further observe that dense SVRG has better speedup scaling. This happens because the per iteration complexity of Hogwild! and KroMagnon is significantly cheaper to the extent that the additional overhead associated with having extra threads leads to some speedup loss; this is not the case for dense SVRG as the per iteration cost is higher.

Conclusions and Open Problems

We have introduced a novel framework for analyzing parallel asynchronous stochastic gradient optimization algorithms. The main advantage of our framework is that it is straightforward to apply to a range of first-order stochastic algorithms, while it involves elementary derivations. Moreover, in our analysis we lift, or relax, many of the assumptions made in prior art, e.g., we do not assume consistent reads, and we analyze full stochastic gradient updates. We use our framework to analyze Hogwild! and ASCD, and further introduce and analyze KroMagnon, a new asynchronous sparse SVRG algorithm.

It would be interesting to obtain tighter bounds for the convergence of function values of the algorithms presented. How do the “errors” due to asynchrony influence the convergence rate of function values? In this case the number of iterations required to reach a target accuracy should scale with the condition number of the objective, not its square. Moreover, the literature on stochastic coordinate descent establishes convergence results in terms of coordinate-wise Lipschitz constants—a more refined smoothness quantity than the full-function smoothness. It would be worthwhile to know if our framework can be adapted to take these parameters into account.

Our perturbed iterates framework relies fundamentally on the strong convexity assumption. However, asynchronous algorithms are known to perform well on non-strongly convex (and even nonconvex) objectives. Can we generalize our framework to simply convex, or smooth functions? Under what assumptions, or simple families of functions, can we show convergence for nonconvex problems?

As previously explained, we believe that the upper bounds on τ\tau—the proxy for the number of cores—in our ASCD and KroMagnon analyses are amenable to improvements. It is an open problem to explore the extent of such improvements.

Our analysis offers sensible upper bounds only in the presence of sparsity. It seems, however, that to obtain speedup results for Hogwild!, it is only necessary to have small correlation between randomly sampled gradients. In what practical setups do randomly selected gradients have sufficiently small correlation? Does that immediately imply linear speedups in the same way that update sparsity does?

In this work we analyzed three similar stochastic first-order methods. It is an open problem to apply our framework and provide an elementary analysis for a greater variety of stochastic gradient type optimization algorithms, such as AdaGrad-type schemes (similar to ), or stochastic dual coordinate methods (similar to ).

Capturing the effects of asynchrony as noise on the algorithmic input seems to be applicable to settings beyond stochastic optimization. As shown recently for a combinatorial graph problem, a similar viewpoint enables the analysis of an asynchronous graph clustering algorithm . It is an interesting endeavor to explore the extent to which a perturbed iterate viewpoint is suitable for analyzing general asynchronous iterative algorithms.

This work was supported in part by the Mathematical Data Science program of the Office of Naval Research. BR is generously supported by ONR awards , N00014-15-1-2620, N00014-13-1-0129, and N00014-14-1-0024 and NSF awards CCF-1148243 and CCF-1217058. This research is supported in part by the NSF CISE Expeditions Award 7076018 and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple Inc., Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, Guavus, HP, Huawei, Intel, Microsoft, Pivotal, Samsung, Schlumberger, Splunk, State Farm, Virdata and VMware.

References

Appendix A Removing the Independence Assumption

The reader can verify that although R1jR_{1}^{j} and R2jR_{2}^{j} are defined now in terms of xj\overline{\mathbf{x}}_{j}, the upper bounds derived in § 3 still hold. We bound R3jR_{3}^{j} as follows

where the last inequality follows by the smoothness of the gradient steps and the arithmetic-geometric mean inequality. Therefore

Since we can upper bound R4jR_{4}^{j} by the same bound derived for R3jR_{3}^{j}, we obtain the following convergence result for Hogwild!:

The only difference between this result and the one in our main section is that we can guarantee speedup for a smaller range of τ\tau. Similar ideas could be applied to the analysis of ASCD and KroMagnon.

Appendix B Omitted Proofs

We can now bound R1jR_{1}^{j}. By definition R1j=Δ0R_{1}^{j}=\Delta_{0}, and from Lemma 7 we have Δ032CG1\Delta_{0}\leq 3^{2}\cdot C\cdot G_{1}. We can bound G1G_{1} similarly to G0G_{0} as

From (4.1) we can upper bound R2jR_{2}^{j} as follows.

The random variable \mathds1(si=sj)\mathds{1}(s_{i}=s_{j}) encodes the sparsity of the gradient steps. To take advantage of this sparsity we use smoothness to replace the iterates x^i\hat{\mathbf{x}}_{i} and x^j\hat{\mathbf{x}}_{j}, by x^j3τ\hat{\mathbf{x}}_{j-3\tau}. The latter iterate is independent of both sis_{i} and sjs_{j} by our assumption that no more than τ\tau coordinates can be updated while a core is processing a single coordinate. This independence will allows us to “untangle” the expectation of \mathds1(si=sj)\mathds{1}(s_{i}=s_{j}) from the inner products in the above sum, which will result in a significantly improved bound on R2jR_{2}^{j} compared to applying Cauchy-Schwartz directly on it.

For clarity, we note that when j<3τj<3\tau, we have x^j3τ=x0\hat{\mathbf{x}}_{j-3\tau}=\mathbf{x}_{0}. From the LL-Lipschitz assumption on the gradient f(x)\nabla f(\mathbf{x}), we get the following bounds

Then, the expectation of a term g(x^i,si)g(x^j,sj)\mathds1(si=sj)\|\mathbf{g}(\hat{\mathbf{x}}_{i},s_{i})\|\cdot\|\mathbf{g}(\hat{\mathbf{x}}_{j},s_{j})\|\cdot\mathds{1}{(s_{i}=s_{j})} in the sum (B.2) is upper bounded by

We first bound the second term using Cauchy-Schwartz and the property of iterated expectation, to exploit the expectation of the \mathds1(si=sj)\mathds{1}{(s_{i}=s_{j})} term

where the first equality follows due to x^j\hat{\mathbf{x}}_{j} being independent of sjs_{j}; hence the expectation with respect to sjs_{j} can be applied to the indicator function. The last inequality follows from our arguments in the proof of Lemma 7 because both mismatches x^j3τx^i\hat{\mathbf{x}}_{j-3\tau}-\hat{\mathbf{x}}_{i} and x^j3τx^j\hat{\mathbf{x}}_{j-3\tau}-\hat{\mathbf{x}}_{j} can be written as linear combinations of gradient steps indexed by S4j\mathcal{S}_{4}^{j} as in (4.1). Similarly the third term satisfies the inequality

Putting all pieces together, and using the prescribed value of γ=θ6dLκ\gamma=\frac{\theta}{6dL\kappa}, we have that

The first inequality follows because we are summing over 2τ2\tau terms in (B.2). To see why the second inequality is true, note that dLγθ6κ1dL\gamma\leq\frac{\theta}{6\kappa}\leq 1 (it is always true that the condition number κ1\kappa\geq 1 ). Therefore 1+dLγτ+(dLγτ)21+τ+τ23τ2.1+dL\gamma\tau+(dL\gamma\tau)^{2}\leq 1+\tau+\tau^{2}\leq 3\tau^{2}. As in the proof of Lemma 8, we can bound G4G_{4} by

The result follows assuming τ=O(d)\tau=\mathcal{O}(\sqrt{d}) and γ=θ6dLκ\gamma=\frac{\theta}{6dL\kappa}. ∎

We believe that if we use the same bounding technique that we applied for R2jR_{2}^{j} on R0jR_{0}^{j} and R1jR_{1}^{j}, then we can significantly improve the restrictive bound on τ\tau.

B.2 KroMagnon

Let A=4L2(aj+a0)A=4L^{2}(a_{j}+a_{0}), B=4L2B=4L^{2}, and C=(γτ)2C=(\gamma\tau)^{2}. Then, the inequalities derived above can be rewritten as

If we expand the formulas, we get for R0jR_{0}^{j} the following upper bound

After an analogous derivation one can see that

From (5.5) we can upper bound R2jR_{2}^{j} as follows.

The random variable \mathds1(sisj)\mathds{1}{(s_{i}\cap s_{j}\neq\emptyset)} encodes the sparsity of the gradient steps. As in the proof of Lemma 9, we replace x^i\hat{\mathbf{x}}_{i} and x^j\hat{\mathbf{x}}_{j} in the above sum by x^j3τ\hat{\mathbf{x}}_{j-3\tau}. When j<3τj<3\tau we define x^j3τ=x0\hat{\mathbf{x}}_{j-3\tau}=\mathbf{x}_{0}. Since feif_{e_{i}} are LL-smooth, we have

Then, the expectation of a term v(x^i,si)v(x^j,sj)\mathds1(sisj)\|\mathbf{v}(\hat{\mathbf{x}}_{i},s_{i})\|\cdot\|\mathbf{v}(\hat{\mathbf{x}}_{j},s_{j})\|\cdot\mathds{1}{(s_{i}\cap s_{j})} in the sum (B.2) is upper bounded by

as in the proof of Lemma 9. The conclusion follows because τ=O(nΔC)\tau=\mathcal{O}\left(\sqrt{\frac{n}{\overline{\Delta}_{\text{C}}}}\right) and γ=θ12Lκ=mθ12L2\gamma=\frac{\theta}{12L\kappa}=\frac{m\theta}{12L^{2}}. ∎

Similar to ASCD, by using the same bounding technique of R2jR_{2}^{j} on R0jR_{0}^{j} and R1jR_{1}^{j}, we should significantly improve the restrictive bound on τ\tau in the convergence result of KroMagnon.