Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization

Blake Woodworth, Jialei Wang, Adam Smith, Brendan McMahan, Nathan Srebro

Introduction

Recently, there has been great interest in stochastic optimization and learning algorithms that leverage parallelism, including e.g. delayed updates arising from pipelining and asynchronous concurrent processing, synchronous single-instruction-multiple-data parallelism, and parallelism across distant devices. With the abundance of parallelization settings and associated algorithms, it is important to precisely formulate the problem, which allows us to ask questions such as “is there a better method for this problem than what we have?” and “what is the best we could possibly expect?”

Oracle models have long been a useful framework for formalizing stochastic optimization and learning problems. In an oracle model, we place limits on the algorithm’s access to the optimization objective, but not what it may do with the information it receives. This allows us to obtain sharp lower bounds, which can be used to argue that an algorithm is optimal and to identify gaps between current algorithms and what might be possible. Finding such gaps can be very useful—for example, the gap between the first order optimization lower bound of Nemirovski et al. and the best known algorithms at the time inspired Nesterov’s accelerated gradient descent algorithm .

We propose an oracle framework for formalizing different parallel optimization problems. We specify the structure of parallel computation using an “oracle graph” which indicates how an algorithm accesses the oracle. Each node in the graph corresponds to a single stochastic oracle query, and that query (e.g. the point at which a gradient is calculated) must be computed using only oracle accesses in ancestors of the node. We generally think of each stochastic oracle access as being based on a single data sample, thus involving one or maybe a small number of vector operations.

In Section 3 we devise generic lower bounds for parallel optimization problems in terms of simple properties of the associated oracle graph, namely the length of the longest dependency chain and the total number of nodes. In Section 4 we study specific parallel optimization settings in which many algorithms have been proposed, formulate them as graph-based oracle parallel optimization problems, instantiate our lower bounds, and compare them with the performance guarantees of specific algorithms. We highlight gaps between the lower bound and the best known upper bound and also situations where we can devise an optimal algorithm that matches the lower bound, but where this is not the “natural” and typical algorithm used in this settings. The latter indicates either a gap in our understanding of the “natural” algorithm or a need to depart from it.

Previous work studied communication lower bounds for parallel convex optimization where there are MM machines each containing a local function (e.g. a collection of samples from a distribution). Each machine can perform computation on its own function, and then periodically every machine is allowed to transmit information to the others. In order to prove meaningful lower bounds based on the number of rounds of communication, it is necessary to prevent the machines from simply transmitting their local function to a central machine, or else any objective could be optimized in one round. There are two established ways of doing this. First, one can allow arbitrary computation on the local machines, but restrict the number of bits that can be transmitted in each round. There is work focusing on specific statistical estimation problems that establishes communication lower bounds via information-theoretic arguments . Alternatively, one can allow the machines to communicate real-valued vectors, but restrict the types of computation they are allowed to perform. For instance, Arjevani and Shamir present communication complexity lower bounds for algorithms which can only compute vectors that lie in a certain subspace, which includes e.g. linear combinations of gradients of their local function. Lee et al. assume a similar restriction, but allow the data defining the local functions to be allocated to the different machines in a strategic manner. Our framework applies to general stochastic optimization problems and does not impose any restrictions on what computation the algorithm may perform, and is thus a more direct generalization of the oracle model of optimization.

Recently, Duchi et al. considered first-order optimization in a special case of our proposed model (the “simple parallelism” graph of Section 4.2), but their bounds apply in a more limited parameter regime, see Section 3 for discussion.

The graph-based oracle model

We consider the following stochastic optimization problem

The prox oracle is quite powerful and provides global rather than local information about ff. In particular, querying the prox oracle with β=0\beta=0 fully optimizes f(;z)f(\cdot;z).

As stated, zz is an argument to the oracle, however there are two distinct cases. In the “fully stochastic” oracle setting, the algorithm receives an oracle answer corresponding to a random zPz\sim\mathcal{P}. We also consider a setting in which the algorithm is allowed to “actively query” the oracle. In this case, the algorithm may either sample zPz\sim\mathcal{P} or choose a desired zz and receive an oracle answer for that zz. Our lower bounds hold for either type of oracle. Most optimization algorithms only use the fully stochastic oracle, but some require more powerful active queries.

We capture the structure of a parallel optimization algorithm with a directed, acyclic oracle graph G\mathcal{G}. Its depth, DD, is the length of the longest directed path, and the size, NN, is the number of nodes. Each node in the graph represents a single stochastic oracle access, and the edges in the graph indicate where the results of that oracle access may be used: only the oracle accesses from ancestors of each node are available when issuing a new query. These limitations might arise e.g. due to parallel computation delays or the expense of communicating between disparate machines.

Let Q\mathcal{Q} be the set of possible oracle queries, with the exact form of queries (e.g., q=xq=x vs. q=(x,β,z)q=(x,\beta,z)) depending on the context. Formally, a randomized optimization algorithm that accesses the stochastic oracle O\mathcal{O} as prescribed by the graph G\mathcal{G} is specified by associating with each node vtv_{t} a query rule Rt:(Q,O(Q))×ΞQR_{t}:(\mathcal{Q},\mathcal{O}(\mathcal{Q}))^{*}\times\Xi\rightarrow\mathcal{Q}, plus a single output rule X^:(Q,O(Q))×ΞX\hat{X}:(\mathcal{Q},\mathcal{O}(\mathcal{Q}))^{*}\times\Xi\rightarrow\mathcal{X}. We grant all of the nodes access to a source of shared randomness ξΞ\xi\in\Xi (e.g. an infinite stream of random bits). The mapping RtR_{t} selects a query qtq_{t} to make at node vtv_{t} using the set of queries and oracle responses in ancestors of vtv_{t}, namely

Similarly, the output rule X^\hat{X} maps from all of the queries and oracle responses to the algorithm’s output as x^=X^((qi,O(qi):i[N]),ξ)\hat{x}=\hat{X}\left((q_{i},\mathcal{O}(q_{i}):i\in[N]),\xi\right). The essential question is: for a class of optimization problems (G,O,F)(\mathcal{G},\mathcal{O},\mathcal{F}) specified by a dependency graph G\mathcal{G}, a stochastic oracle O\mathcal{O}, and a function class F\mathcal{F}, what is the best possible guarantee on the expected suboptimality of an algorithm’s output, i.e.

Lower bounds

We now provide lower bounds for optimization problems (G,Ograd,FL,H,B)(\mathcal{G},\mathcal{O}_{\textrm{grad}},\mathcal{F}_{L,H,B}) and (G,Oprox,FL,H,B)(\mathcal{G},\mathcal{O}_{\textrm{prox}},\mathcal{F}_{L,H,B}) in terms of LL, HH, BB, and the depth and size of G\mathcal{G}.

These are the tightest possible lower bounds in terms of just the depth and size of G\mathcal{G} in the sense that for all D,ND,N there are graphs G\mathcal{G} and associated algorithms which match the lower bound. Of course, for specific, mostly degenerate graphs they might not be tight. For instance, our lower bound for the graph consisting of a short sequential chain plus a very large number of disconnected nodes might be quite loose due to the artificial inflation of NN. Nevertheless, for many interesting graphs they are tight, as we shall see in Section 4.

Each lower bound has two components: an “optimization” term and a “statistical” term. The statistical term Ω(LB/N)\Omega(LB/\sqrt{N}) is well known, although we include a brief proof of this portion of the bound in Appendix D for completeness. The optimization term depends on the depth DD, and indicates, intuitively, the best suboptimality guarantee that can be achieved by an algorithm using unlimited parallelism but only DD rounds of communication. Arjevani and Shamir also obtain lower bounds in terms of rounds of communication, which are similar to how our lower bounds depend on depth. However they restricted the type of computations that are allowed to the algorithm to a specific class of operations, while we only limit the number of oracle queries and the dependency structure between them, but allow forming the queries in any arbitrary way.

Similar to Arjevani and Shamir , to establish the optimization term in the lower bounds, we construct functions that require multiple rounds of sequential oracle accesses to optimize. In the gradient oracle case, we use a single, deterministic function which resembles a standard construction for first order optimization lower bounds. For the prox case, we construct two functions inspired by previous lower bounds for round-based and finite sum optimization . In order to account for randomized algorithms that might leave the span of gradients or proxs returned by the oracle, we use a technique that was proposed by Woodworth and Srebro and refined by Carmon et al. . For our specific setting, we must slightly modify existing analysis, which is detailed in Appendix A.

A useful feature of our lower bounds is that they apply when both the Lipschitz constant and smoothness are bounded concurrently. Consequently, “non-smooth” in the subsequent discussion can be read as simply identifying the case where the LL term achieves the minimum as opposed to the HH term (even if H<H<\infty). This is particularly important when studying stochastic parallel optimization, since obtaining non-trivial guarantees in a purely stochastic setting requires some sort of control on the magnitude of the gradients (smoothness by itself is not sufficient), while obtaining parallelization speedups often requires smoothness, and so we would like to ask what is the best that can be done when both Lipschitz and smoothness are controlled. Interestingly, the dependence on both LL and HH in our bounds is tight, even when the other is constrained, which shows that the optimization term cannot be substantially reduced by using both conditions together.

This defines a distribution over ff’s based on the randomness in the draw of v1,,vD+1v_{1},\dots,v_{D+1}, and we apply Yao’s minimax principle. In Appendix B, we prove Theorem 1 using this construction.

In the case of the prox oracle, we “straighten out” the smooth construction of Woodworth and Srebro . For fixed constants c,γc,\gamma, we define the following Lipschitz and smooth scalar function ϕc\phi_{c}:

For P=Uniform{1,2}\mathcal{P}=\textrm{Uniform}\left\{1,2\right\} and orthonormal v1,,v2Dv_{1},\dots,v_{2D} drawn uniformly at random, we define

Again, this defines a distribution over ff’s based on the randomness in the draw of v1,,v2Dv_{1},\dots,v_{2D} and we apply Yao’s minimax principle. In Appendix C, we prove Theorem 2 using this construction.

As mentioned above, Duchi et al. recently showed a lower bound for first- and zero-order stochastic optimization in the “simple parallelism” graph consisting of DD layers, each with MM nodes. Their bound [10, Thm 2] applies only when the dimension mm is constant, and D=O(mloglogM)D=O(m\log\log M). Our lower bound requires non-constant dimension, but applies in any range of MM. Furthermore, their proof techniques do not obviously extend to prox oracles.

Specific dependency graphs

2 Simple parallelism: the layer graph

3 Delayed updates

4 Intermittent communication

Active querying and SVRG

All methods discussed so far used fully stochastic oracles, requesting a gradient (or prox computation) with respect to an independently and randomly drawn zPz\sim\mathcal{P}. We now turn to methods that also make active queries, i.e. draw samples from P\mathcal{P} and then repeatedly query the oracle, at different points xx, but on the same samples zz. Recall that all of our lower bounds are valid also in this setting.

With an active query gradient oracle, we can implement SVRG on an intermittent communication graph. More specifically, for an appropriate choice of nn and λ\lambda, we apply SVRG to the regularized empirical objective F^λ(x)=1ni=1nf(x;zi)+λ2x2\hat{F}_{\lambda}(x)=\frac{1}{n}\sum_{i=1}^{n}f(x;z_{i})+\frac{\lambda}{2}\left\|x\right\|^{2}

These guarantees improve over sequential SGD (17) as soon as M>log2(TKM/L)M>\log^{2}(TKM/L) and K>H2/L2K>H^{2}/L^{2}, i.e. L/TK<L2/HL/\sqrt{TK}<L^{2}/H. This is a very wide regime: we require only a moderate number of machines, and the second condition will typically hold for a smooth loss. Intuitively, SVRG does roughly the same number (up to a factor of two) of sequential updates as in the sequential SGD approach but it uses better, variance reduced updates. The price we pay is in the smaller total sample size since we keep calling the oracle on the same samples. Nevertheless, since SVRG only needs to calculate the “batch” gradient a logarithmic number of times, this incurs only an additional logarithmic factor.

Comparing (18) and (21), we see that SVRG also improves over A-MB-SGD as soon as K>Tlog(TKM/L)K>T\log(TKM/L), that is if the number of points we are processing on each machine each round is slightly more then the total number of rounds, which is also a realistic scenario.

To summarize, the best known upper bound for optimizing with intermittent communication using a pure stochastic oracle is (20), which combines two different algorithms. However, with active oracle accesses, SVRG is also possible and the upper bound becomes:

Summary

Our main contributions in this paper are: (1) presenting a precise formal oracle framework for studying parallel stochastic optimization; (2) establishing tight oracle lower bounds in this framework that can then be easily applied to particular instances of parallel optimization; and (3) using the framework to study specific settings, obtaining optimality guarantees, understanding where additional assumptions would be needed to break barriers, and, perhaps most importantly, identifying gaps in our understanding that highlight possibilities for algorithmic improvement. Specifically,

For non-smooth objectives and a stochastic prox oracle, smoothing and acceleration can improve performance in the layer graph setting. It is not clear if there is a more direct algorithm with the same optimal performance, e.g. averaging the answers from the prox oracle.

In the delay graph setting, delayed update SGD’s guarantee is not optimal. We suggest an alternative optimal algorithm, but it would be interesting and beneficial to understand the true behavior of delayed update SGD and to improve it as necessary to attain optimality.

With intermittent communication, we show how different methods are better in different regimes, but even combining these methods does not match our lower bound. This raises the question of whether our lower bound is achievable. Are current methods optimal? Is the true optimal complexity somewhere in between? Even finding a single method that matches the current best performance in all regimes would be a significant advance here.

With intermittent communication, active queries allow us to obtain better performance in a certain regime. Can we match this performance using pure stochastic queries or is there a real gap between active and pure stochastic queries?

We would like to thank Ohad Shamir for helpful discussions. This work was partially funded by NSF-BSF award 1718970 (“Convex and Non-Convex Distributed Learning”) and a Google Research Award. BW is supported by the NSF Graduate Research Fellowship under award 1754881. AS was supported by NSF awards IIS-1447700 and AF-1763786, as well as a Sloan Foundation research award.

References

Appendix A Main lower bound lemma

This analysis closely follows that of previous work, specifically the proof of Theorem 1 in and the proof of Lemma 4 in . There are slight differences in the problem setup between this work and that of previous papers, thus we include the following analysis for completeness and to ensure that all of our results can be verified. We do not claim any significant technical novelty within this section.

Let St=XtVtS_{t}=X_{\leq t}\cup V_{\leq t}, let PtP_{t} be the projection operator onto the span of StS_{t} and let PtP^{\perp}_{t} be the projection onto the orthogonal complement of the span of StS_{t}. As in , define

Finally, suppose that for each tt, XtX_{t} is of the form:

First, we connect the events GtG_{t} to a more immediately useful condition

[cf. Lemma 9 , Lemma 1 ] For any cc, kk, NN, VV, and {Xk}t=1k\left\{X_{k}\right\}_{t=1}^{k}, let α=min{14N, c2(1+2N)}\alpha=\min\left\{\frac{1}{4N},\ \frac{c}{2\left(1+\sqrt{2N}\right)}\right\} then for each tkt\leq k

The proof of Lemma 1 involves straightforward linear algebra, and we defer it to Appendix A.1. By Lemma 1, G<tG<tG_{<t}\subseteq G^{\prime}_{<t}, therefore the property (24) is implied by

Now, we state the main result which allows us to prove our lower bounds:

[cf. Lemma 4 , Lemma 4 ] For any k1k\geq 1, N1N\geq 1, c(0,1)c\in(0,1), and dimension

if the sets X1,,XkX_{1},\dots,X_{k} satisfy the property (25) then

The proof of Lemma 2 relies upon the following, whose proof we defer to Appendix A.1.

[cf. Lemma 11 , Lemma 3 ] Let RR be any rotation operator, RR=IR^{\top}R=I, that preserves St1S_{t-1}, that is Rw=Rw=wRw=R^{\top}w=w for any wSpan(St1)w\in\textrm{Span}\left(S_{t-1}\right). Then the following conditional densities are equal

This closely follows the proof of Lemma 4 and Lemma 4 , with small modifications to account for the different setting.

Set α=min{14N,c2(1+2N)}\alpha=\min\left\{\frac{1}{4N},\,\frac{c}{2\left(1+\sqrt{2N}\right)}\right\}. Then by Lemma 1, since X1,,XkX_{1},\dots,X_{k} satisfy the property (25)

Conditioned on G<tG_{<t} and V<tV_{<t}, the set Xt=Xt(V<t)X_{t}=X_{t}(V_{<t}) is fixed, as is the set St1S_{t-1} and therefore Pt1P_{t-1}^{\perp}, so the first term in the inner product is a fixed unit vector. By Lemma 3, the conditional density of vj  G<t,V<tv_{j}\ |\ G_{<t},V_{<t} is spherically symmetric within the span onto which Pt1P_{t-1}^{\perp} projects. Therefore, Pt1vjPt1vj\frac{P_{t-1}^{\perp}v_{j}}{\left\|P_{t-1}^{\perp}v_{j}\right\|} is distributed uniformly on the unit sphere in Span(St1)\textrm{Span}\left(S_{t-1}\right)^{\perp}, which has dimension at least m:=m(t1)r=1t1Xrmk+1Nm^{\prime}:=m-(t-1)-\sum_{r=1}^{t-1}\left|X_{r}\right|\geq m-k+1-N.

where we used that 1xexp(x)1-x\leq\exp(-x). Finally, this holds for each tt, xXtx\in X_{t}, and jtj\geq t, so

Where we used that mk+N+max{32N2, 8(1+2N)2c2}log(2k2N)m\geq k+N+\max\left\{32N^{2},\ \frac{8(1+\sqrt{2N})^{2}}{c^{2}}\right\}\log\left(2k^{2}N\right) for (36). For (37), recall that we chose α=min{14N, c2(1+2N)}\alpha=\min\left\{\frac{1}{4N},\ \frac{c}{2(1+\sqrt{2N})}\right\} so max{32N2, 8(1+2N)2c2}=2α2\max\left\{32N^{2},\ \frac{8(1+\sqrt{2N})^{2}}{c^{2}}\right\}=\frac{2}{\alpha^{2}}. ∎

This closely follows the proof of Lemma 9 , with slight modification to account for the different problem setup.

For tkt\leq k assume GtG_{\leq t}. For any xXtx\in X_{t} and jtj\geq t

First, we decomposed vjv_{j} into its St1S_{t-1} and St1S^{\perp}_{t-1} components and applied the triangle inequality. Next we used that x1\left\|x\right\|\leq 1 and that the orthogonal projection operator Pt1P_{t-1}^{\perp} is self-adjoint. Finally, we used that the projection operator is non-expansive and the definition of GtG_{t}.

Next, we prove by induction on tt that for all tkt\leq k and jtj\geq t, the event GtG_{\leq t} implies that Pt1vj22α2r=1t1Xr\left\|P_{t-1}v_{j}\right\|^{2}\leq 2\alpha^{2}\sum_{r=1}^{t-1}\left|X_{r}\right|. As a base case (t=1t=1), observe that, trivially, Pt1vj2=0vj2=0\left\|P_{t-1}v_{j}\right\|^{2}=\left\|0v_{j}\right\|^{2}=0. For the inductive step, fix any tkt\leq k and jtj\geq t and suppose that Gt    Pt1vj22α2r=1t1XrG_{\leq t^{\prime}}\implies\left\|P_{t^{\prime}-1}v_{j}^{\prime}\right\|^{2}\leq 2\alpha^{2}\sum_{r=1}^{t^{\prime}-1}\left|X_{r}\right| for all t<tt^{\prime}<t and jtj^{\prime}\geq t^{\prime}. Let P^t\hat{P}_{t} project onto Span(StXt+1)\textrm{Span}\left(S_{t}\cup X_{t+1}\right) (this includes Xt+1X_{t+1} in contrast with PtP_{t}) and let P^t\hat{P}_{t}^{\perp} project onto the orthogonal subspace. Since Span(X1X2Xt1Vt1)=St1\textrm{Span}\left(X_{1}\cup X_{2}\cup\dots\cup X_{t-1}\cup V_{\leq t-1}\right)=S_{t-1},

is a (potentially over-complete) basis for St1S_{t-1}. Using the triangle inequality and G<tG_{<t}, we can therefore expand

We must now bound the second term of (44). Focusing on the inner product in the numerator for one particular r<tr<t:

First, we used that P^r1=IP^r1\hat{P}_{r-1}^{\perp}=I-\hat{P}_{r-1}, then that vrvjv_{r}\perp v_{j}. Next, we applied the definition of P^r1\hat{P}_{r-1} and the triangle inequality. To get (48) we use the Cauchy-Schwarz inequality on the first term, and the definition of GrG_{r} for the second. Finally, we use the inductive hypothesis and that α14N\alpha\leq\frac{1}{4N}.

We have now upper bounded the inner products in the second term of (44), and it remains to lower bound the norm in the denominator. We can rewrite

Here we again used P^r1=IP^r1\hat{P}_{r-1}^{\perp}=I-\hat{P}_{r-1} followed by an (over)expansion of P^r1\hat{P}_{r-1}. The remaining steps follow from the inductive hypothesis and fact that α14N\alpha\leq\frac{1}{4N}. Combining (56) with (50) and returning to (44), we have that

Therefore, for each tkt\leq k and jtj\geq t an upper bound Pt1vj22α2r=1t1Xr\left\|P_{t-1}v_{j}\right\|^{2}\leq 2\alpha^{2}\sum_{r=1}^{t-1}\left|X_{r}\right|. Returning now to (41), we have that for any tkt\leq k, xXtx\in X_{t}, and jtj\geq t the event GtG_{\leq t} implies

where we used that αc2(1+2N)\alpha\leq\frac{c}{2\left(1+\sqrt{2N}\right)}

This closely follows the proof of Lemma 11 .

First, we apply Bayes’ rule to each density and use the fact that RV<t=V<tRV_{<t}=V_{<t}:

Appendix B Proof of Theorem 1

Assume for now that B=1B=1, the lower bound can be established for other values of BB by scaling inputs to our construction. Let

The random draw of VV defines a distribution over functions ff. We will lower bound the expected suboptimality of any deterministic optimization algorithm’s output and apply Yao’s minimax principle at the end of the proof.

This function has the following properties:

Furthermore, optimizing ff is equivalent to “finding” the vectors v1,,vD+1v_{1},\dots,v_{D+1}. In particular, until a point that has a substantial inner product with all of v1,,vD+1v_{1},\dots,v_{D+1} is found, the algorithm will remain far from the minimum:

The function also has the property that if xx has a small inner product with vt,,vD+1v_{t},\dots,v_{D+1}, then the gradient oracle will reveal little information about ff when queried at xx:

In Appendix A, we studied the situation where orthonormal v1,,vD+1v_{1},\dots,v_{D+1} are chosen uniformly at random and a sequence of sets of vectors X1,,XD+1X_{1},\dots,X_{D+1} are generated as

Therefore, by Lemma 2, when the dimension

Therefore, by Yao’s minimax principle for any randomized algorithm A\mathcal{A}

The statistical term L8N\frac{L}{8\sqrt{N}} follows from Lemma 10. ∎

For any r>tr>t (85) is less than (86), thus no r>tr>t can be in the arg max\operatorname*{arg\,max} in (84). Therefore, using only v1,,vtv_{1},\dots,v_{t} we can calculate

Appendix C Proof of Theorem 2

Without loss of generality, assume B=1B=1, the lower bound can be proven for other values of BB by scaling inputs to our construction by 1/B1/B. Let

It is straightforward to confirm that ϕc\phi_{c} is convex, 2γ2\gamma-Lipschitz continuous, and 44-smooth. Let P\mathcal{P} be the uniform distribution over {1,2}\left\{1,2\right\}. Let v1,v2,,v2Dv_{1},v_{2},\dots,v_{2D} be a set of orthonormal vectors drawn uniformly at random and define

The random choice of VV determines a distribution over functions f(;1)f(\cdot;1) and f(;2)f(\cdot;2). We will lower bound the expectation (over VV) of the suboptimality of any deterministic algorithm’s output, and then apply Yao’s minimax principle.

First, we show that the functions f(;1)f(\cdot;1) and :=f(;2):=f(\cdot;2) are convex, LL-Lipschitz, and HH-smooth:

For any H,L0H,L\geq 0, D1D\geq 1, and orthonormal v1,...,v2Dv_{1},...,v_{2D}, and with η\eta, γ\gamma, aa, and cc chosen as in (92), f(;1)f(\cdot;1) and f(;2)f(\cdot;2) are convex, LL-Lipschitz, and HH-smooth.

Next, we show that optimizing FF is equivalent to “finding” a large number of the vectors v1,,v2Dv_{1},\dots,v_{2D}:

For any H,L0H,L\geq 0, D1D\geq 1, and orthonormal v1,...,v2Dv_{1},...,v_{2D}, and with η\eta, γ\gamma, aa, and cc chosen as in (92), for any xx such that vrxc2\left|v_{r}^{\top}x\right|\leq\frac{c}{2} for all r>Dr>D

Next, we show that at any point xx such that vrxc2\left|v_{r}^{\top}x\right|\leq\frac{c}{2} for all rtr\geq t, the function value, gradient, and prox of f(;1)f(\cdot;1) and f(;2)f(\cdot;2) at xx are calculable using v1,,vtv_{1},\dots,v_{t} only:

For any xx such that vrxc2\left|v_{r}^{\top}x\right|\leq\frac{c}{2} for all rtr\geq t, and any β0\beta\geq 0 the function values, gradients, and proxs f(x;1)f(x;1), f(x;2)f(x;2), f(x;1)\nabla f(x;1), f(x;2)\nabla f(x;2), proxf(,1)(x,β)\textrm{prox}_{f(\cdot,1)}(x,\beta), and proxf(,2)(x,β)\textrm{prox}_{f(\cdot,2)}(x,\beta) are calculable using β,x,v1,,vt\beta,x,v_{1},\dots,v_{t} only.

In Appendix A, we studied the situation where orthonormal v1,,v2Dv_{1},\dots,v_{2D} are chosen uniformly at random and a sequence of sets of vectors X1,,X2DX_{1},\dots,X_{2D} are generated as

Consider the dependency graph, and let X1X_{1} be the set of queries made in vertices at depth 1 in the graph (i.e. they have no parents). Let X2X_{2} be the set of queries made in vertices at depth 2 in the graph (i.e. their parents correspond to the queries in X1X_{1}). Continue in this way for each tDt\leq D, and then let XD+1={x^}X_{D+1}=\left\{\hat{x}\right\} correpond to the output of the optimization algorithm, which for now is deterministic.

Suppose G<tG^{\prime}_{<t}. Then for all of the queries xX1Xt1x\in X_{1}\cup\dots\cup X_{t-1} and for all rt1r\geq t-1 we have x,vrc2\left|\left\langle x,\,v_{r}\right\rangle\right|\leq\frac{c}{2}. Therefore, by Lemma 9 the function values, gradients, and proxs of f(;1)f(\cdot;1) and f(;2)f(\cdot;2) are calculable based only on the query points and v1,,vt1v_{1},\dots,v_{t-1}. Therefore, all of the queries in XtX_{t} are a deterministic function of V<tV_{<t} only so XtX_{t} satisfies the required decomposition property (98). Finally, the queries are required to be in the domain of ff, thus they will have norm bounded by BB.

with probability 1/21/2 for every xX1XD+1x\in X_{1}\cup\dots\cup X_{D+1} which includes x^\hat{x}, x,vrc2\left|\left\langle x,\,v_{r}\right\rangle\right|\leq\frac{c}{2} for r>Dr>D, so by Lemma 8

so by Yao’s minimax principle, for any randomized algorithm A\mathcal{A}

The statistical term LB8N\frac{LB}{8\sqrt{N}} follows from Lemma 10. ∎

The functions f(;1)f(\cdot;1) and f(;2)f(\cdot;2) are a sum of linear functions and ϕc\phi_{c}, which is convex; therefore both are convex. Also, the scalar function ϕc\phi_{c} is 2γ2\gamma-Lipschitz, so

where we used that a=18D3<γ=4Lη2Da=\frac{1}{\sqrt{8D^{3}}}<\gamma=\frac{4L}{\eta\sqrt{2D}}. Similarly,

Therefore, f(;1)f(\cdot;1) and f(;2)f(\cdot;2) are LL-Lipschitz. Furthermore, since ϕc\phi_{c} is 44-smooth,

therefore, the maximum eigenvalue of 2f(;1)\nabla^{2}f(\cdot;1) and 2f(;2)\nabla^{2}f(\cdot;2) is at most ηH\eta\leq H. ∎

First, we upper bound minx:x1F(x)\min_{x:\left\|x\right\|\leq 1}F(x). Recalling that a=18D3a=\frac{1}{\sqrt{8D^{3}}}, define

For this xx^{*}, vr1xvrx=v2Dx=av_{r-1}^{\top}x^{*}-v_{r}^{\top}x^{*}=v_{2D}^{\top}x^{*}=a and with our choice of parameters (92), 2c=a<γ2c=a<\gamma, so that ϕc(a)=2a\phi_{c}^{\prime}(a)=2a, thus

Since x1\left\|x^{*}\right\|\leq 1 and F(x)=0\nabla F(x^{*})=0, we conclude

Let XD={x:x1, vrxc2 r>D}X_{D}=\left\{x:\left\|x\right\|\leq 1,\ \left|v_{r}^{\top}x\right|\leq\frac{c}{2}\ \forall r>D\right\}. We will now lower bound

Introducing dual variables λD+1,...,λ2D0\lambda_{D+1},...,\lambda_{2D}\geq 0, solving (115) amounts to finding an xXDx\in X_{D} and a set of non-negative λ\lambdas such that F(x)=r=D+12Dλrsign(vrx)vr\nabla F(x)=-\sum_{r=D+1}^{2D}\lambda_{r}\,\textrm{sign}\left(v_{r}^{\top}x\right)v_{r} and such that λr(vrxc2)=0\lambda_{r}\left(v_{r}^{\top}x-\frac{c}{2}\right)=0 for each rr. Let

Since a(D+1r)+c2<a(2D+1r)a\left(D+1-r\right)+\frac{c}{2}<a\left(2D+1-r\right) for rD+1r\leq D+1 and x1\left\|x^{*}\right\|\leq 1 it follows that xD1\left\|x_{D}\right\|\leq 1. Furthermore, since vr1xDvrxD=av_{r-1}^{\top}x_{D}-v_{r}^{\top}x_{D}=a for 2rD+12\leq r\leq D+1 and 2c=a<γ2c=a<\gamma, the gradient

Suppose that xx is a point such that vrxc2\left|v_{r}^{\top}x\right|\leq\frac{c}{2} for all rtr\geq t, and β0\beta\geq 0. Therefore, ϕc(vr1xvrx)=0\phi_{c}\left(v_{r-1}^{\top}x-v_{r}^{\top}x\right)=0 for r>tr>t so

Thus both f(x;1)f(x;1) and f(x;2)f(x;2) can be calculated from x,v1,,vtx,v_{1},\dots,v_{t} only. Similarly, ϕc(vr1xvrx)=0\phi^{\prime}_{c}\left(v_{r-1}^{\top}x-v_{r}^{\top}x\right)=0 for r>tr>t so

Thus f(x;1)\nabla f(x;1) and f(x;2)\nabla f(x;2) can also be calculated from x,v1,,vtx,v_{1},\dots,v_{t} only.

Now, we consider the proxs at such a point xx. Let t=tt^{\prime}=t if tt is odd, and t=t1t^{\prime}=t-1 if tt is even. Let PP be the projection operator onto S=Span(v1,,vt)S=\textrm{Span}\left(v_{1},\dots,v_{t^{\prime}}\right) and let PP^{\perp} be the projection onto the orthogonal subspace, SS^{\perp}. Then, since f(x;1)=f(Px;1)+f(Px;1)f(x;1)=f(Px;1)+f(P^{\perp}x;1), we can decompose the prox:

Where we used that vrPx=vrxc2\left|v_{r}^{\top}P^{\perp}x\right|=\left|v_{r}^{\top}x\right|\leq\frac{c}{2} for all r>tr>t^{\prime}, so setting y2=Pxy_{2}=P^{\perp}x achieves the minimum since every term in the expression is zero and function is non-negative. The vector PxP^{\perp}x is calculable from x,v1,,vtx,v1,,vtx,v_{1},\dots,v_{t^{\prime}}\subseteq x,v_{1},\dots,v_{t}, and similarly the second term is a minimization depends only on β,x,v1,,vtβ,x,v1,,vt\beta,x,v_{1},\dots,v_{t^{\prime}}\subseteq\beta,x,v_{1},\dots,v_{t}. A nearly identical argument shows that proxf(;2)(x,β)\textrm{prox}_{f(\cdot;2)}(x,\beta) has the same property. ∎

Appendix D Statistical term

For any L,B>0L,B>0, there exists a distribution P\mathcal{P}, and an LL-Lipschitz, -smooth function ff defined on [B,B][-B,B] such that the output x^\hat{x} of any potentially randomized optimization algorithm which accesses a stochastic gradient or prox oracle at most NN times satisfies

Let ϵ>0\epsilon>0 and pUniform{p1,p1}p\sim\textrm{Uniform}\left\{p_{1},p_{-1}\right\} where p1=1+ϵ2p_{1}=\frac{1+\epsilon}{2} and p1=1ϵ2p_{-1}=\frac{1-\epsilon}{2}. Define Pp\mathcal{P}_{p} as

Now consider any deterministic optimization algorithm which accesses the gradient or prox oracle NN times. Each gradient or prox oracle response can be simulated using a single zPpz\sim\mathcal{P}_{p}, so the algorithm’s output is x^=x^(z1,,zN)[B,B]\hat{x}=\hat{x}(z_{1},\dots,z_{N})\in[-B,B]. Consider

Furthermore, the Bayes optimal estimate x^\hat{x} of sign(2p1)\textrm{sign}(2p-1) is

This simply requires lower bounding the tail of the Binomial(N,1ϵ2)\left(N,\frac{1-\epsilon}{2}\right) distribution. Using Theorem 2.1 in ,

where Φ\Phi is the distribution function of the standard normal. Let ϵ=12N\epsilon=\frac{1}{2\sqrt{N}}, then ϵN1ϵ2<35\frac{\epsilon\sqrt{N}}{\sqrt{1-\epsilon^{2}}}<\frac{3}{5} and

Therefore, by Yao’s minimax principle, for any randomized algorithm A\mathcal{A}

Appendix E Supplement to Section 4

Smoothed accelerated mini-batch SGD is the composition of two ingredients. First, we approximate the non-smooth ff with a smooth surrogate, and then perform accelerated mini-batch SGD on the surrogate . In particular, we use the β\beta-Moreau envelope f(β)f^{(\beta)} of ff:

Since ff is LL-Lipschitz, f(β)f^{(\beta)} has the following properties (Proposition 12.29 ):

f(β)(x;z)=β(xproxf(;z)(x,β))\nabla f^{(\beta)}(x;z)=\beta(x-\textrm{prox}_{f(\cdot;z)}(x,\beta))

f(β)(x;z)f(x;z)f(β)(x;z)+L22βf^{(\beta)}(x;z)\leq f(x;z)\leq f^{(\beta)}(x;z)+\frac{L^{2}}{2\beta} for all xx

We use the prox oracle to execute A-MB-SGD on the LL-Lipschitz and β\beta-smooth f(β)f^{(\beta)}, with updates

The A-MB-SGD algorithm will converge on f(β)f^{(\beta)} at a rate (see )

Choosing β=min{LT,H}\beta=\min\left\{LT,H\right\} the conclude

which matches the lower bound in Theorem 2.

E.2 Wait-and-collect accelerated mini-batch SGD

E.3 Analysis of technical results in Section 4.4

To apply SVRG method to solve stochastic convex optimization problems under intermittent synchronization graph. We adopt the approach by , first we sample nn instances {z1,...,zn}\{z_{1},...,z_{n}\} and solve a regularized empirical risk minimization problem based on {z1,...,zn}\{z_{1},...,z_{n}\}:

By standard estimation-optimization error decomposition (e.g. Section 4 in ), we have

given slog(n/(LB))s\asymp\log(n/(LB)). Thus to implement SVRG successfully, we need to choose nn such that the following two conditions are satisfied:

Thus we know by choosing nn below will satisfy above condition:

substitute the scale of nn to (154) we get