Tight Complexity Bounds for Optimizing Composite Objectives

Blake Woodworth, Nathan Srebro

Introduction

We consider minimizing the average of m2m\geq 2 convex functions:

A natural question is how to leverage the prox oracle, and how much benefit it provides over gradient access alone. The prox oracle is potentially much more powerful, as it provides global, rather then local, information about the function. For example, for a single function (m=1m=1), one prox oracle call (with β=0\beta=0) is sufficient for exact optimization. Several methods have recently been suggested for optimizing a sum or average of several functions using prox accesses to each component, both in the distributed setting where each components might be handled on a different machine (e.g. ADMM , DANE , DISCO ) or for functions that can be decomposed into several “easy” parts (e.g. PRISMA ). But as far as we are aware, no meaningful lower bound was previously known on the number of prox oracle accesses required even for the average of two functions (m=2m=2).

The optimization of composite objectives of the form (1) has also been extensively studied in the context of minimizing empirical risk over mm samples. Recently, stochastic methods such as SDCA , SAG , SVRG , and other variants, have been presented which leverage the finite nature of the problem to reduce the variance in stochastic gradient estimates and obtain guarantees that dominate both batch and stochastic gradient descent. As methods with improved complexity, such as accelerated SDCA , accelerated SVRG, and Katyusha , have been presented, researchers have also tried to obtain lower bounds on the best possible complexity in this settings—but as we survey below, these have not been satisfactory so far.

In this paper, after briefly surveying methods for smooth, composite optimization, we present methods for optimizing non-smooth composite objectives, which show that prox oracle access can indeed be leveraged to improve over methods using merely subgradient access (see Section 3). We then turn to studying lower bounds. We consider algorithms that access the objective FF only through the oracle hFh_{F} and provide lower bounds on the number of such oracle accesses (and thus the runtime) required to find ϵ\epsilon-suboptimal solutions. We consider optimizing both Lipschitz (non-smooth) functions and smooth functions, and guarantees that do and do not depend on strong convexity, distinguishing between deterministic optimization algorithms and randomized algorithms. Our upper and lower bounds are summarized in Table 1.

As shown in the table, we provide matching upper and lower bounds (up to a log factor) for all function and algorithm classes. In particular, our bounds establish the optimality (up to log factors) of accelerated SDCA, SVRG, and SAG for randomized finite-sum optimization, and also the optimality of our deterministic smoothing algorithms for non-smooth composite optimization.

For non-smooth functions, we show that having access to prox oracles for the components can reduce the polynomial dependence on ϵ\epsilon from 1/ϵ21/\epsilon^{2} to 1/ϵ1/\epsilon, or from 1/(λϵ)1/(\lambda\epsilon) to 1/λϵ1/\sqrt{\lambda\epsilon} for λ\lambda-strongly convex functions. However, all of the optimal complexities for smooth functions can be attained with only component gradient access using accelerated gradient descent (AGD) or accelerated SVRG. Thus the worst-case complexity cannot be improved (at least not significantly) by using the more powerful prox oracle.

On the power of randomization

We establish a significant gap between deterministic and randomized algorithms for finite-sum problems. Namely, the dependence on the number of components must be linear in mm for any deterministic algorithm, but can be reduced to m\sqrt{m} (in the typically significant term) using randomization. We emphasize that the randomization here is only in the algorithm—not in the oracle. We always assume the oracle returns an exact answer (for the requested component) and is not a stochastic oracle. The distinction is that the algorithm is allowed to flip coins in deciding what operations and queries to perform but the oracle must return an exact answer to that query (of course, the algorithm could simulate a stochastic oracle).

Prior Lower Bounds

Several authors recently presented lower bounds for optimizing (1) in the smooth and strongly convex setting using component gradients. Agarwal and Bottou presented a lower bound of Ω(m+mγλlog1ϵ)\Omega\left(m+\sqrt{\tfrac{m\gamma}{\lambda}}\log\tfrac{1}{\epsilon}\right). However, their bound is valid only for deterministic algorithms (thus not including SDCA, SVRG, SAG, etc.)—we not only consider randomized algorithms, but also show a much higher lower bound for deterministic algorithms (i.e. the bound of Agarwal and Bottou is loose). Improving upon this, Lan shows a similar lower bound for a restricted class of randomized algorithms: the algorithm must select which component to query for a gradient by drawing an index from a fixed distribution, but the algorithm must otherwise be deterministic in how it uses the gradients, and its iterates must lie in the span of the gradients it has received. This restricted class includes SAG, but not SVRG nor perhaps other realistic attempts at improving over these. Furthermore, both bounds allow only gradient accesses, not prox computations. Thus SDCA, which requires prox accesses, and potential variants are not covered by such lower bounds. We prove as similar lower bound to Lan’s, but our analysis is much more general and applies to any randomized algorithm, making any sequence of queries to a gradient and prox oracle, and without assuming that iterates lie in the span of previous responses. In addition to smooth functions, we also provide lower bounds for non-smooth problems which were not considered by these previous attempts. Another recent observation was that with access only to random component subgradients without knowing the component’s identity, an algorithm must make Ω(m2)\Omega(m^{2}) queries to optimize well. This shows how relatively subtle changes in the oracle can have a dramatic effect on the complexity of the problem. Since the oracle we consider is quite powerful, our lower bounds cover a very broad family of algorithms, including SAG, SVRG, and SDCA.

Our deterministic lower bounds are inspired by a lower bound on the number of rounds of communication required for optimization when each fif_{i} is held by a different machine and when iterates lie in the span of certain permitted calculations . Our construction for m=2m=2 is similar to theirs (though in a different setting), but their analysis considers neither scaling with mm (which has a different role in their setting) nor randomization.

Notation and Definitions

Optimizing Smooth Sums

We briefly review the best known methods for optimizing (1) when the components are γ\gamma-smooth, yielding the upper bounds on the right half of Table 1. These upper bounds can be obtained using only component gradient access, without need for the prox oracle.

We can obtain exact gradients of F(x)F(x) by computing all mm component gradients fi(x)\nabla f_{i}(x). Running accelerated gradient descent (AGD) on F(x)F(x) using these exact gradients achieves the upper complexity bounds for deterministic algorithms and smooth problems (see Table 1).

When FF is not strongly convex, adding a regularizer to the objective and instead optimizing Fλ(x)=F(x)+λ2x2F_{\lambda}(x)=F(x)+\frac{\lambda}{2}\left\|x\right\|^{2} with λ=ϵ/B2\lambda=\epsilon/B^{2} results in an oracle complexity of O((m+mγB2ϵ)logϵ0ϵ)\mathcal{O}\left(\left(m+\sqrt{\tfrac{m\gamma B^{2}}{\epsilon}}\right)\log\tfrac{\epsilon_{0}}{\epsilon}\right). The log-factor in the second term can be removed using the more delicate reduction of Allen-Zhu and Hazan , which involves optimizing Fλ(x)F_{\lambda}(x) for progressively smaller values of λ\lambda, yielding the upper bound in the table.

Katyusha and Catalyst-accelerated SAG or SVRG use only gradients of the components. Accelerated SDCA achieves a similar complexity using gradient and prox oracle access.

Leveraging Prox Oracles for Lipschitz Sums

In this section, we present algorithms for leveraging the prox oracle to minimize (1) when each component is LL-Lipschitz. This will be done by using the prox oracle to “smooth” each component, and optimizing the new, smooth sum which approximates the original problem. This idea was used in order to apply Katyusha and accelerated SDCA to non-smooth objectives. We are not aware of a previous explicit presentation of the AGD-based deterministic algorithm, which achieves the deterministic upper complexity indicated in Table 1.

The key is using a prox oracle to obtain gradients of the β\beta-Moreau envelope of a non-smooth function, ff, defined as:

Let ff be convex and LL-Lipschitz continuous. For any β>0\beta>0,

(f(β))(x)=β(xproxf(x,β))\nabla(f^{(\beta)})(x)=\beta(x-\text{prox}_{f}(x,\beta))

f(β)(x)f(x)f(β)(x)+L22βf^{(\beta)}(x)\leq f(x)\leq f^{(\beta)}(x)+\frac{L^{2}}{2\beta}

Consequently, we can consider the smoothed problem

Finally, if m>L2B2/ϵ2m>L^{2}B^{2}/\epsilon^{2} or m>L2/(λϵ)m>L^{2}/(\lambda\epsilon), stochastic gradient descent is a better randomized alternative, yielding complexities of O(L2B2/ϵ2)\mathcal{O}(L^{2}B^{2}/\epsilon^{2}) or O(L2/(λϵ))\mathcal{O}(L^{2}/(\lambda\epsilon)).

Lower Bounds for Deterministic Algorithms

We now turn to establishing lower bounds on the oracle complexity of optimizing (1). We first consider only deterministic optimization algorithms. What we would like to show is that for any deterministic optimization algorithm we can construct a “hard” function for which the algorithm cannot find an ϵ\epsilon-suboptimal solution until it has made many oracle accesses. Since the algorithm is deterministic, we can construct such a function by simulating the (deterministic) behavior of the algorithm. This can be viewed as a game, where an adversary controls the oracle being used by the algorithm. At each iteration the algorithm queries the oracle with some triplet (x,i,β)(x,i,\beta) and the adversary responds with an answer. This answer must be consistent with all previous answers, but the adversary ensures it is also consistent with a composite function FF that the algorithm is far from optimizing. The “hard” function is then gradually defined in terms of the behavior of the optimization algorithm.

To help us formulate our constructions, we define a “round” of queries as a series of queries in which m2\lceil\frac{m}{2}\rceil distinct functions fif_{i} are queried. The first round begins with the first query and continues until exactly m2\lceil\frac{m}{2}\rceil unique functions have been queried. The second round begins with the next query, and continues until exactly m2\lceil\frac{m}{2}\rceil more distinct components have been queried in the second round, and so on until the algorithm terminates. This definition is useful for analysis but requires no assumptions about the algorithm’s querying strategy.

We begin by presenting a lower bound for deterministic optimization of (1) when each component fif_{i} is convex and LL-Lipschitz continuous, but is not necessarily strongly convex, on the domain X={x:xB}\mathcal{X}=\left\{x:\left\|x\right\|\leq B\right\}. Without loss of generality, we can consider L=B=1L=B=1. We will construct functions of the following form:

During each round tt, the adversary answers queries according to fitf_{i}^{t}, which depends only on vr,δi,rv_{r},\delta_{i,r} for r<tr<t, i.e. from previous rounds. When the round is completed, δi,t\delta_{i,t} is determined and vtv_{t} is chosen to be orthogonal to the vectors {v0,...,vt1}\left\{v_{0},...,v_{t-1}\right\} as well as every point queried by the algorithm so far, thus defining fit+1f_{i}^{t+1} for the next round. In Appendix B.1 we prove that these responses based on fitf_{i}^{t} are consistent with fif_{i}.

The algorithm can only learn vrv_{r} after it completes round rr—until then every iterate is orthogonal to it by construction. The average of these functions reaches its minimum of F(x)=0F(x^{*})=0 at x=br=0kvrx^{*}=b\sum_{r=0}^{k}v_{r}, so we can view optimizing these functions as the task of discovering the vectors vrv_{r}—even if only vkv_{k} is missing, a suboptimality better than b/(6k)>ϵb/(6\sqrt{k})>\epsilon cannot be achieved. Therefore, the deterministic algorithm must complete at least kk rounds of optimization, each comprising at least m2\left\lceil\frac{m}{2}\right\rceil queries to hFh_{F} in order to optimize FF. The key to this construction is that even though each term x,vr1x,vr\left|\left\langle x,\,v_{r-1}\right\rangle-\left\langle x,\,v_{r}\right\rangle\right| appears in m/2m/2 components, and hence has a strong effect on the average F(x)F(x), we can force a deterministic algorithm to make Ω(m)\Omega(m) queries during each round before it finds the next relevant term. We obtain (for complete proof see Appendix B.1):

Furthermore, we can always reduce optimizing a function over xB\left\|x\right\|\leq B to optimizing a strongly convex function by adding the regularizer ϵx2/(2B2)\epsilon\left\|x\right\|^{2}/(2B^{2}) to each component, implying (see complete proof in Appendix B.2):

2 Smooth Components

When the components fif_{i} are required to be smooth, the lower bound construction is similar to (6), except it is based on squared differences instead of absolute differences. We consider the functions:

where δi,r\delta_{i,r} and vrv_{r} are as before. Again, we can answer queries at round tt based only on δi,r,vr\delta_{i,r},v_{r} for r<tr<t. This construction yields the following lower bounds (full details in Appendix B.3):

In the strongly convex case, we use a very similar construction, adding the term λx2/2\lambda\left\|x\right\|^{2}/2, which gives the following bound (see Appendix B.4):

Lower Bounds for Randomized Algorithms

We now turn to randomized algorithms for (1). In the deterministic constructions, we relied on being able to set vrv_{r} and δi,r\delta_{i,r} based on the predictable behavior of the algorithm. This is impossible for randomized algorithms, we must choose the “hard” function before we know the random choices the algorithm will make—so the function must be “hard” more generally than before.

Previously, we chose vectors vrv_{r} orthogonal to all previous queries made by the algorithm. For randomized algorithms this cannot be ensured. However, if we choose orthonormal vectors vrv_{r} randomly in a high dimensional space, they will be nearly orthogonal to queries with high probability. Slightly modifying the absolute or squared difference from before makes near orthogonality sufficient. This issue increases the required dimension but does not otherwise affect the lower bounds.

More problematic is our inability to anticipate the order in which the algorithm will query the components, precluding the use of δi,r\delta_{i,r}. In the deterministic setting, if a term revealing a new vrv_{r} appeared in half of the components, we could ensure that the algorithm must make m/2m/2 queries to find it. However, a randomized algorithm could find it in two queries in expectation, which would eliminate the linear dependence on mm in the lower bound! Alternatively, if only one component included the term, a randomized algorithm would indeed need Ω(m)\Omega(m) queries to find it, but that term’s effect on suboptimality of FF would be scaled down by mm, again eliminating the dependence on mm.

First consider the non-smooth, non-strongly-convex setting and assume for simplicity mm is even (otherwise we simply let the last function be zero). We define the helper function ψc\psi_{c}, which replaces the absolute value operation and makes our construction resistant to small inner products between iterates and not-yet-discovered components:

Next, we define m/2m/2 pairs of functions, indexed by i=1..m/2i=1..m/2:

where {vi,r}r=0..k,i=1..m/2\left\{v_{i,r}\right\}_{r=0..k,i=1..m/2} are random orthonormal vectors and k=Θ(1ϵm)k=\Theta(\tfrac{1}{\epsilon\sqrt{m}}). With cc sufficiently small and the dimensionality sufficiently high, with high probability the algorithm only learns the identity of new vectors vi,rv_{i,r} by alternately querying fi,1f_{i,1} and fi,2f_{i,2}; so revealing all k+1k+1 vectors requires at least k+1k+1 total queries. Until vi,kv_{i,k} is revealed, an iterate is Ω(ϵ)\Omega(\epsilon)-suboptimal on (fi,1+fi,2)/2(f_{i,1}+f_{i,2})/2. From here, we show that an ϵ\epsilon-suboptimal solution to F(x)F(x) can be found only after at least k+1k+1 queries are made to at least m/4m/4 pairs, for a total of Ω(mk)\Omega(mk) queries. This time, since the optimum xx^{*} will need to have inner product bb with Θ(mk)\Theta(mk) vectors vi,rv_{i,r}, we need to have b=Θ(1mk)=Θ(ϵ/m)b=\Theta(\tfrac{1}{\sqrt{mk}})=\Theta(\sqrt{\epsilon/\sqrt{m}}), and the total number of queries is Ω(mk)=Ω(mϵ)\Omega(mk)=\Omega(\tfrac{\sqrt{m}}{\epsilon}). The Ω(m)\Omega(m) term of the lower bound follows trivially since we require ϵ=O(1/m)\epsilon=\mathcal{O}(1/\sqrt{m}), (proofs in Appendix C.1):

An added regularizer gives the result for strongly convex functions (see Appendix C.2):

The large dimension required by these lower bounds is the cost of omitting the assumption that the algorithm’s queries lie in the span of previous oracle responses. If we do assume that the queries lie in that span, the necessary dimension is only on the order of the number of oracle queries needed.

When ϵ=Ω(LB/m)\epsilon=\Omega\left(LB/\sqrt{m}\right) in the non-strongly convex case or ϵ=Ω(L2/(λm))\epsilon=\Omega\left(L^{2}/(\lambda m)\right) in the strongly convex case, the lower bounds for randomized algorithms presented above do not apply. Instead, we can obtain a lower bound based on an information theoretic argument. We first uniformly randomly choose a parameter pp, which is either (1/22ϵ)(1/2-2\epsilon) or (1/2+2ϵ)(1/2+2\epsilon). Then for i=1,...,mi=1,...,m, in the non-strongly convex case we make fi(x)=xf_{i}(x)=x with probability pp and fi(x)=xf_{i}(x)=-x with probability 1p1-p. Optimizing F(x)F(x) to within ϵ\epsilon accuracy then implies recovering the bias of the Bernoulli random variable, which requires Ω(1/ϵ2)\Omega(1/\epsilon^{2}) queries based on a standard information theoretic result . Setting fi(x)=±x+λ2x2f_{i}(x)=\pm x+\frac{\lambda}{2}\left\|x\right\|^{2} gives a Ω(1/(λϵ))\Omega(1/(\lambda\epsilon)) lower bound in the λ\lambda-strongly convex setting. This is formalized in Appendix C.5.

2 Smooth Components

When the functions fif_{i} are smooth and not strongly convex, we define another helper function ϕc\phi_{c}:

and the following pairs of functions for i=1,...,m/2i=1,...,m/2:

with vi,rv_{i,r} as before. The same arguments apply, after replacing the absolute difference with squared difference. A separate argument is required in this case for the Ω(m)\Omega(m) term in the bound, which we show using a construction involving mm simple linear functions (see Appendix C.3).

In the strongly convex case, we add the term λx2/2\lambda\left\|x\right\|^{2}/2 to fi,1f_{i,1} and fi,2f_{i,2} (see Appendix C.4) to obtain:

We consider (1) as a constrained optimization problem, thus the minimizer of FF could be achieved on the boundary of X\mathcal{X}, meaning that the gradient need not vanish. If we make the additional assumption that the minimizer of FF lies on the interior of X\mathcal{X} (and is thus the unconstrained global minimum), Theorems 1-8 all still apply, with a slight modification to Theorems 3 and 7. Since the gradient now needs to vanish on X\mathcal{X}, is always O(γB2)\mathcal{O}(\gamma B^{2})-suboptimal, and only values of ϵ\epsilon in the range 0<ϵ<γB21280<\epsilon<\frac{\gamma B^{2}}{128} and 0<ϵ<9γB21280<\epsilon<\frac{9\gamma B^{2}}{128} result in a non-trivial lower bound (see Remarks at the end of Appendices B.3 and C.3).

Conclusion

We provide a tight (up to a log factor) understanding of optimizing finite sum problems of the form (1) using a component prox oracle.

Randomized optimization of (1) has been the subject of much research in the past several years, starting with the presentation of SDCA and SAG, and continuing with accelerated variants. Obtaining lower bounds can be very useful for better understanding the problem, for knowing where it might or might not be possible to improve or where different assumptions would be needed to improve, and for establishing optimality of optimization methods. Indeed, several attempts have been made at lower bounds for the finite sum setting . But as we explain in the introduction, these were unsatisfactory and covered only limited classes of methods. Here we show that in a fairly general sense, accelerated SDCA, SVRG, SAG, and Katyusha are optimal up to a log factor. Improving on their runtime would require additional assumptions, or perhaps a stronger oracle. However, even if given “full” access to the component functions, all algorithms that we can think of utilize this information to calculate a prox vector. Thus, it is unclear what realistic oracle would be more powerful than the prox oracle we consider.

Our results highlight the power of randomization, showing that no deterministic algorithm can beat the linear dependence on mm and reduce it to the m\sqrt{m} dependence of the randomized algorithms.

In studying finite sum problems, we were forced to explicitly study lower bounds for randomized optimization as opposed to stochastic optimization (where the source of randomness is the oracle, not the algorithm). Even for the classic problem of minimizing a smooth function using a first order oracle, we could not locate a published proof that applies to randomized algorithms. We provide a simple construction using ϵ\epsilon-insensitive differences that allows us to easily obtain such lower bounds without reverting to assuming the iterates are spanned by previous responses (as was done, e.g., in ), and could potentially be useful for establishing randomized lower bounds also in other settings.

We thank Ohad Shamir for his helpful discussions and for pointing out .

References

Appendix A Upper bounds for non-smooth sums

Consider the case where the components are not strongly convex. As shown in lemma 1, we can use a single call to a prox oracle to obtain the gradient of

which is a β\beta-smooth approximation to ff. We then consider the new optimization problem:

When the component functions are λ\lambda-strongly convex, a more sophisticated strategy is required to avoid an extra log\log factor. The solution is the AdaptSmooth algorithm . This involves solving O(log1ϵ)\mathcal{O}(\log\frac{1}{\epsilon}) smooth and strongly convex subproblems, where the ttht^{\text{th}} subproblem is reducing the suboptimality of the βt\beta_{t}-smooth and λ\lambda-strongly convex function F(βt)(x)F^{(\beta_{t})}(x) by a factor of four, where βt=L2ϵ02t\beta_{t}=\frac{L^{2}}{\epsilon_{0}}2^{t} and where ϵ0L2λ\epsilon_{0}\leq\frac{L^{2}}{\lambda} upper bounds the initial suboptimality. Using this method results in an ϵ\epsilon-suboptimal solution for FF after t=0logϵ0ϵTime(βt,λ)\sum_{t=0}^{\log\frac{\epsilon_{0}}{\epsilon}}\text{Time}(\beta_{t},\lambda) queries to hFh_{F}.

In the case of AGD, Time(γ,λ)=O(mγλ)(\gamma,\lambda)=\mathcal{O}\left(m\sqrt{\frac{\gamma}{\lambda}}\right) and

Appendix B Lower bounds for deterministic algorithms

Without loss of generality, we can assume L=B=1L=B=1. For particular values bb and kk to be decided upon later, we use the functions (6):

For non-smooth functions, the subgradient oracle is not uniquely defined—-many different subgradients might be a valid response. However, in order to say that an algorithm successfully optimizes a function, it must be able to do so no matter which subgradient is receives. Conversely, to show a lower bound, it is sufficient to show that for some valid subgradient the algorithm fails. And so, in constructing a “hard” instance to optimize we are actually constructing both a function and a subgradient oracle for it, with specific subgradient responses. Therefore, answering the algorithm’s queries during round tt according to fitf_{i}^{t} is valid so long as the subgradient we return is a valid subgradient for fif_{i} (the converse need not be true) and the prox returned is exactly the prox of fif_{i}. For now, assume that this query-answering strategy is consistent (we will prove this last).

Then if d=mϵ+k+1d=\lceil\frac{m}{\epsilon}\rceil+k+1 and if xx is an iterate generated both before AA completes round kk and before it makes mϵ\lceil\frac{m}{\epsilon}\rceil queries to hFh_{F} (so that the dimension is large enough to orthogonalize each vtv_{t} as described above), then x,vk=0\left\langle x,\,v_{k}\right\rangle=0 by construction. This allows us to bound the suboptimality of F(x)F(x) (since m2\lceil\frac{m}{2}\rceil functions are queried during each round, i=1mδi,r=m2\sum_{i=1}^{m}\delta_{i,r}=\lfloor\frac{m}{2}\rfloor):

FF is non-negative and F(xb)=0F(x_{b})=0 where xb=br=0kvrx_{b}=b\sum_{r=0}^{k}v_{r}. Choosing b=1k+1b=\frac{1}{\sqrt{k+1}} makes xb=1\left\|x_{b}\right\|=1 so that xbXx_{b}\in\mathcal{X}. Therefore, FF achieves its minimum on X\mathcal{X} and

Where the final inequality holds when k1k\geq 1. Setting k=112ϵk=\lfloor\frac{1}{12\epsilon}\rfloor implies F(x)F(x)ϵF(x)-F(x^{*})\geq\epsilon. Therefore, AA must either query hFh_{F} more than mϵ\lceil\frac{m}{\epsilon}\rceil times or complete kk rounds to reach an ϵ\epsilon-suboptimal solution. Completing each round requires at least m2\lceil\frac{m}{2}\rceil queries to hFh_{F}, so when ϵ112\epsilon\leq\frac{1}{12}, this implies a lower bound of

For any tkt\leq k and any xx such that xvspan{v0,v1,...,vt1}x^{v}\in\text{span}\left\{v_{0},v_{1},...,v_{t-1}\right\}, if function ii is queried during round tt, then fit(x)fi(x)\partial f_{i}^{t}(x)\subseteq\partial f_{i}(x).

All subgradients of fif_{i} have the form

where we define sign(0)=0\text{sign}(0)=0. Since function ii is queried during round tt, δi,t=0\delta_{i,t}=0, and since x,vr1=0=x,vr\left\langle x,\,v_{r-1}\right\rangle=0=\left\langle x,\,v_{r}\right\rangle for all r>tr>t, fi(x)\partial f_{i}(x) contains all subgradients of the form

which is exactly fit(x)\partial f_{i}^{t}(x). ∎

For any tkt\leq k and any xx such that xvspan{v0,v1,...,vt1}x^{v}\in\text{span}\left\{v_{0},v_{1},...,v_{t-1}\right\}, if function ii is queried during round tt then β>0\forall\beta>0, proxfit(x,β)=proxfi(x,β)\text{prox}_{f_{i}^{t}}(x,\beta)=\text{prox}_{f_{i}}(x,\beta).

Consider the definition of the prox oracle from equation 3

Next, we further decompose xv=x+x+x^{v}=x^{-}+x^{+} where

Note that x+=0x^{+}=0 and since function ii is queried during round tt, δi,t=0\delta_{i,t}=0. Therefore,

The last equality follows from the fact that that the minimization is completely separable between uu^{-} and u+u^{+}, allowing us to minimized over each variable separately. The terms containing u+u^{+} are non-negative and can be simultaneously equal to when u+=0u^{+}=0. Therefore, proxfit(x,β)=proxfi(x,β)\text{prox}_{f_{i}^{t}}(x,\beta)=\text{prox}_{f_{i}}(x,\beta). ∎

These lemmas show that the subgradients and proxs of fitf_{i}^{t} at vectors which are queried during round tt are consistent with the subgradients and proxs of fif_{i}. This confirms that our construction is sound, and completes the proof. ∎

B.2 Non-smooth and strongly convex components

By assumption, AA can find an x^\hat{x} such that F(x^)F(x)<ϵ2F(\hat{x})-F(x^{*})<\frac{\epsilon}{2} using o(m(L+λB)λϵ)=o(mLBϵ)o\left(\frac{m(L+\lambda B)}{\sqrt{\lambda\epsilon}}\right)=o\left(\frac{mLB}{\epsilon}\right) queries to hFh_{F}, and

B.3 Smooth and not strongly convex components

This proof will be very similar to the proof of theorem 1. Without loss of generality, we can assume that γ=B=1\gamma=B=1. For a values aa and kk to be fixed later, we define:

We will assume for now that this query-answering strategy is self-consistent, and prove it later. This allows us to bound the suboptimality of F(x)F(x). Note that since exactly m2\lceil\frac{m}{2}\rceil functions are queried each round, i=1mδi,r=m2\sum_{i=1}^{m}\delta_{i,r}=\lfloor\frac{m}{2}\rfloor, so let

Then if d=mϵ+k+1d=\lceil\frac{m}{\sqrt{\epsilon}}\rceil+k+1, and if xx is an iterate generated both before AA completes round q:=k2q:=\lfloor\frac{k}{2}\rfloor and before it makes mϵ\lceil\frac{m}{\sqrt{\epsilon}}\rceil queries to hFh_{F}, then x,vr=0\left\langle x,\,v_{r}\right\rangle=0 for all rqr\geq q by construction. Then, for this xx, Fq(x)=Fk+1(x)=F(x)F^{q}(x)=F^{k+1}(x)=F(x). By first order optimality conditions for FtF^{t}, its optimum xtx_{t}^{*} must satisfy that:

It is straightforward to confirm that the solution to this system of equations is

Thus, we set a=3k+1a=\sqrt{\frac{3}{k+1}}, ensuring xk+1=1\left\|x_{k+1}^{*}\right\|=1 so that xk+1=xXx_{k+1}^{*}=x^{*}\in\mathcal{X}. Furthermore, for the iterate xx made before qq rounds of queries,

where the last inequality holds as long as k2k\geq 2. So, when ϵ<1128\epsilon<\frac{1}{128} and we let k=132ϵk=\lfloor\frac{1}{\sqrt{32\epsilon}}\rfloor, this ensures that

and therefore, AA must complete at least qq rounds or make more than mϵ\lceil\frac{m}{\sqrt{\epsilon}}\rceil queries to hFh_{F} in order to reach an ϵ\epsilon-suboptimal point. This implies a lower bound of

For any tkt\leq k and any xx such that xvspan{v0,v1,...,vt1}x^{v}\in\text{span}\left\{v_{0},v_{1},...,v_{t-1}\right\}, if function ii is queried during round tt, then fit(x)=fi(x)\nabla f_{i}^{t}(x)=\nabla f_{i}(x).

Since function ii is queried during round tt, δi,t=0\delta_{i,t}=0 so

For any tkt\leq k and any xx such that xvspan{v0,v1,...,vt1}x^{v}\in\text{span}\left\{v_{0},v_{1},...,v_{t-1}\right\}, if function ii is queried during round tt then β>0\forall\beta>0, proxfit(x,β)=proxfi(x,β)\text{prox}_{f_{i}^{t}}(x,\beta)=\text{prox}_{f_{i}}(x,\beta).

Up until the last step, this proof is identical to the proof of lemma 3, thus we pick up at (14):

The final step comes from the fact that the arg min\operatorname*{arg\,min} is separable over uu^{-} and u+u^{+}, meaning we can minimize the two terms individually. The terms which contain u+u^{+} are non-negative and equal to zero when u+=0u^{+}=0. ∎

These lemmas show that the gradient and prox of fitf_{i}^{t} at vectors which are queried during round tt are consistent with the gradient and prox of fif_{i}. This confirms that our construction is sound.

This proves the lower bound for ϵ<γB2128\epsilon<\frac{\gamma B^{2}}{128}, we can extend the same lower bound to ϵγB2128\epsilon\geq\frac{\gamma B^{2}}{128} using the following, very simple construction. Let

where vv is a unit vector that is orthogonal to all of the first m1m-1 queries. This function is trivially 11-smooth. By construction, the algorithm must make at least mm queries to learn the identity of vv. Until it has done so, any iterate will have objective value zero, while the optimum F(x)=F(v)=2ϵF(x^{*})=F(v)=-2\epsilon. Therefore, the algorithm must make at least mm queries to reach an ϵ\epsilon-suboptimal solution. For ϵγB2128\epsilon\geq\frac{\gamma B^{2}}{128}

Therefore a lower bound of Ω(mγB2ϵ)\Omega\left(m\sqrt{\frac{\gamma B^{2}}{\epsilon}}\right) applies for any ϵ>0\epsilon>0. ∎

If make the additional assumption that FF is minimized on the interior of X\mathcal{X}, since is O(γB2)\mathcal{O}(\gamma B^{2})-suboptimal, only 0<ϵ<γB21280<\epsilon<\frac{\gamma B^{2}}{128} gives a non-trivial lower bound. This lower bound is shown by the first construction presented in the previous proof.

B.4 Smooth and strongly convex components

We will prove the theorem for 11-smooth and λ\lambda-strongly convex components for any λ<173\lambda<\frac{1}{73}. This can be extended to arbitrary constants γ\gamma and λ\lambda^{\prime} by taking λ=λγ\lambda=\frac{\lambda^{\prime}}{\gamma}.

For any kk and for ζ\zeta and CC to be defined later, let

where the vectors vrv_{r} and indicators δi,r\delta_{i,r} are defined in the same way as in the previous proof. This function is just a multiple of the construction in the proof of theorem 3 plus the λx2/2\lambda\left\|x\right\|^{2}/2 term. It is clear that the norm term is uninformative for learning the identity of vectors vrv_{r}, as the component of the gradients and proxs which is due to that term is simply a scaling of the query point. Thus, for any iterate xx generated by AA before completing tt rounds of optimization, x,vr=0\left\langle x,\,v_{r}\right\rangle=0 for all rtr\geq t so long as the dimension is greater than the total number of queries made to hFh_{F} so far plus k+1k+1; a fact which follows directly from the previous proof.

Since exactly m2\lceil\frac{m}{2}\rceil functions are queried per round, i=1mδi,r=m2\sum_{i=1}^{m}\delta_{i,r}=\lfloor\frac{m}{2}\rfloor and thus

By the first order optimality conditions for F(x)F(x), its optimum xx^{*} must satisfy that:

Defining q:=Q1Q+1<1q:=\frac{\sqrt{Q}-1}{\sqrt{Q}+1}<1 and setting ζ=1q\zeta=1-q, it is straightforward to confirm that

Thus, F(0)F(x)=λC28(Q1)2F(0)-F(x^{*})=\frac{\lambda C^{2}}{8}\left(\sqrt{Q}-1\right)^{2}, so by choosing CC appropriately, we can make the initial suboptimality of our construction take any value ϵ0\epsilon_{0}. Since FF is λ\lambda-strongly convex, for any xtx_{t}, F(xt)F(x)λ2xtx2F(x_{t})-F(x^{*})\geq\frac{\lambda}{2}\left\|x_{t}-x^{*}\right\|^{2}. Let xtx_{t} be an iterate which is generated before tt rounds of optimization have been completed, implying that xt,vr=0\left\langle x_{t},\,v_{r}\right\rangle=0 for all rtr\geq t. So

If we set k+1=t12logqk+1=\left\lceil t-\frac{1}{2\log q}\right\rceil then

and when t=Q14logϵ02Qϵt=\left\lfloor\frac{\sqrt{Q}-1}{4}\log\frac{\epsilon_{0}}{2\sqrt{Q}\epsilon}\right\rfloor

Therefore, the algorithm must complete at least tt rounds of queries before it can reach an ϵ\epsilon-suboptimal point. If ϵ0>3ϵλ\epsilon_{0}>\frac{3\epsilon}{\lambda} and λ<173\lambda<\frac{1}{73}, since each round includes at least m2\frac{m}{2} oracle queries, this implies a lower bound of

queries to hFh_{F} in order to reach an ϵ\epsilon-suboptimal point. ∎

Appendix C Lower bounds for randomized algorithms

We thank Yair Carmon, John Duchi, Oliver Hinder, and Aaron Sidford for pointing out an issue with the original proofs in this section. We have since modified this section in a manner resembling .

To prove the deterministic lower bounds, we constructed vectors vrv_{r} adversarially, orthogonalizing them to queries made by the algorithm. In the randomized setting, this is impossible, as we cannot anticipate query points. Our solution was to instead draw a set of “important directions” vi,rv_{i,r} randomly in high dimensions. The intuition is that a given vector, in this case the query made by the algorithm, will have a very small inner product with a random unit vector with high probability if the dimension is large enough.

Using this fact, we construct helper functions ψc\psi_{c} and ϕc\phi_{c} to replace the absolute and squared difference functions used in the deterministic lower bounds. These functions are both flat at 0 on the interval [c,c][-c,c], meaning that the algorithm’s query needs to have a significant inner product with vi,rv_{i,r} before the oracle needs to give that vector away as a gradient or prox. We will show that each of our constructions satisfy the following property:

For all ii, all tkt\leq k and xx such that rt\forall r\geq t x,vi,rc2\left|\left\langle x,\,v_{i,r}\right\rangle\right|\leq\frac{c}{2}, g1fi,1(x)\exists g_{1}\in\partial f_{i,1}(x) and g2fi,2(x)\exists g_{2}\in\partial f_{i,2}(x) such that

and such that g1g_{1}, g2g_{2}, proxfi,1(x,β)\text{prox}_{f_{i,1}}(x,\beta), and proxfi,2(x,β)\text{prox}_{f_{i,2}}(x,\beta) are all a deterministic function of x(t)x^{(t)} and {vi,r:r<t}\left\{v_{i,r}:r<t\right\}.

In other words, when xx has a small inner product with vi,rv_{i,r} for all rtr\geq t, then querying either fi,1f_{i,1} or fi,2f_{i,2} at xx will reveal at most vi,tv_{i,t}. Our bounds on the complexity of optimizing our functions are based on the principle that the algorithm can only learn one vi,rv_{i,r} per query, so we need to control the probability that the premises of this property hold for every query made by the algorithm. In this section, we will bound how large the dimensionality of the problem needs to be to ensure that with high probability, only one vector is revealed to the algorithm by each oracle response.

For i{1,2,,m/2}i\in\left\{1,2,\dots,m/2\right\}, we will define pairs of component functions fi,1f_{i,1} and fi,2f_{i,2} that satisfy Property 1. If mm is odd, we set fm0f_{m}\equiv 0 and lose only a factor of (m1)/m(m-1)/m in our lower bound, therefore, we proceed assuming mm is even.

The proofs will proceed by showing that optimizing F(x)=1mi=1m/2fi,1(x)+fi,2(x)F(x)=\frac{1}{m}\sum_{i=1}^{m/2}f_{i,1}(x)+f_{i,2}(x) is equivalent to finding an xx with a significant inner product with a large number of the vectors vi,rv_{i,r}. This property depends on the specific constructions and will be discussed in detail later.

First, we will show generally that when the dimension is sufficiently large, Property 1 ensures each oracle access will “reveal” at most a single vector vi,rv_{i,r} with high probability over the randomness in the draw of the vectors vi,rv_{i,r}. As a consequence of this fact, Ω(mk)\Omega(mk) accesses will be needed to optimize FF.

We will prove the lower bound on the number of oracle accesses needed for an arbitrary deterministic optimization algorithm to optimize our random finite sum instance, and then apply Yao’s minimax principle in order to prove the lower bound for randomized optimization algorithms.

We denote the nthn^{th} query made by the algorithm q(n)=(i(n),j(n),x(n),β(n))q^{(n)}=\left(i^{(n)},j^{(n)},x^{(n)},\beta^{(n)}\right), which is a query to function fi(n),j(n)f_{i^{(n)},j^{(n)}} at the point x(n)x^{(n)} with the prox parameter β(n)\beta^{(n)} (if applicable). For now, we require that B1\exists B\geq 1 s.t. x(n)B\left\|x^{(n)}\right\|\leq B for all nn; this assumption will be removed in the individual lower bound proofs. Recalling that we are considering for now deterministic optimization algorithms, the nthn^{th} query is a function of the previous n1n-1 queries and the oracle’s responses to those queries.

Let #(i,n)={t<n:i(t)=i}\#(i,n)=\left|\left\{t<n:i^{(t)}=i\right\}\right| denote the number of times that the functions fi,1f_{i,1} or fi,2f_{i,2} were queried before time nn. Define Kn={vi,r:i[m/2],r#(i,n)}K_{n}=\left\{v_{i,r}:i\in[m/2],r\leq\#(i,n)\right\}. These are the set of vectors vi,rv_{i,r} that are supposed to be “known” to the optimization algorithm at time nn. Also, let Un={vi,r:i[m/2],r>#(i,n)}U_{n}=\left\{v_{i,r}:i\in[m/2],r>\#(i,n)\right\}. These are the set of vectors that are supposed to be “unknown” to the algorithm.

Let Xn={x(1),,x(n)}X_{n}=\left\{x^{(1)},\dots,x^{(n)}\right\} be the set of query points up to time nn.

Let Sn=span{Xn1Kn}S_{n}=\text{span}\left\{X_{n-1}\cup K_{n}\right\}, and let SnS_{n}^{\perp} be its orthogonal complement. Let PnwP_{n}w be the orthogonal projection of a vector ww onto the subspace SnS_{n}, and let PnwP_{n}^{\perp}w be its orthogonal projection onto SnS_{n}^{\perp}.

We would like to show that the following event occurs with high probability over the draw of the vectors vi,rv_{i,r}:

Here, c1/kc\leq 1/\sqrt{k} is a constant that will be determined later. This event indicate that the premises of Property 1 are satisfied, which will be used to prove the lower bounds in the next sections.

where α=c2B(1+2N)\alpha=\frac{c}{2B(1+\sqrt{2N})} for some c1Nc\leq\frac{1}{\sqrt{N}}. The events GnG_{n} are useful because:

This proof closely follows [8, Lemma 9] and [20, Lemma 1].

Let GnG_{\leq n} denote t=1nGt\bigcap_{t=1}^{n}G_{t}. We will establish that for each nNn\leq N, Gn    vUnG_{\leq n}\implies\forall v\in U_{n} x(t),vc2\left|\left\langle x^{(t)},\,v\right\rangle\right|\leq\frac{c}{2}. To begin, we rewrite

First, we decomposed v=Pnv+Pnvv=P_{n}v+P_{n}^{\perp}v and used x(t)B\left\|x^{(t)}\right\|\leq B. Next, we used the Cauchy-Schwarz inequality and the self-adjointness of PnP_{n}^{\perp}. Finally, we used the non-expansivity of PnP_{n}^{\perp} and applied the definition of GnG_{n}.

In order to bound the first term in (20), we will prove by induction that for nNn\leq N and vUnv\in U_{n}, G<n    Pnv22(n1)α2G_{<n}\implies\left\|P_{n}v\right\|^{2}\leq 2(n-1)\alpha^{2}. The base case n=1n=1 is trivial as P1vP_{1}v is the projection of vv onto S1=S_{1}=\varnothing.

For the inductive step, fix nNn\leq N and vUnv\in U_{n}, and assume the statement holds for t<nt<n. Let P^n\hat{P}_{n} project onto span{XnKn}\text{span}\left\{X_{n}\cup K_{n}\right\} (this includes x(n)x^{(n)} in contrast with PnP_{n}), and let P^n\hat{P}_{n}^{\perp} project onto the orthogonal subspace.

Denote v(t)=vi(t),#(i(t),t)+1UtKt+1v^{(t)}=v_{i^{(t)},\#(i^{(t)},t)+1}\in U_{t}\cap K_{t+1}, which is the vector that “becomes known” after time tt. By definition, the set

spans SnS_{n}. Therefore, the Gram-Schmidt vectors

form an orthonormal basis for SnS_{n} (after ignoring any zero vectors that may arise from the projection). We now write Pnv2\left\|P_{n}v\right\|^{2} in terms of this orthonormal basis:

We now bound the second term of (24). Note that v(t)∉Unv^{(t)}\not\in U_{n} is orthogonal to vUnv\in U_{n}. Consider

Here, we used the triangle and Cauchy-Schwarz inequalities. Since v(t),vUnv^{(t)},v\in U_{n}, by the inductive hypothesis

Furthermore, looking at the denominator of (24)

The inequality uses the inductive hypothesis and Gt    G<nG_{t}\impliedby G_{<n}. Since c1/Nc\leq 1/\sqrt{N} and B1B\geq 1, α=c2B(1+2N)14N2\alpha=\frac{c}{2B\left(1+\sqrt{2N}\right)}\leq\frac{1}{\sqrt{4N-2}}, so this quantity is at least 1/21/2.

Combining this and (31) with (24), we conclude

We conclude that for all nNn\leq N and vUnv\in U_{n}, G<n    Pnv22(n1)α2G_{<n}\implies\left\|P_{n}v\right\|^{2}\leq 2(n-1)\alpha^{2}.

Since this holds for any nNn\leq N and vUnv\in U_{n}, we conclude GN    EG_{\leq N}\implies E. ∎

With Lemma 6 having been proven, we now observe the following corollary:

For any nNn\leq N, let o(n)o^{(n)} be the output of the gradient or prox oracle at time nn. Then Gn    o(n)span{{x(n)}Kn+1}G_{\leq n}\implies o^{(n)}\in\text{span}\left\{\left\{x^{(n)}\right\}\cup K_{n+1}\right\}.

By Lemma 6, GnG_{\leq n} implies r>#(i(n),n)\forall r>\#(i^{(n)},n) x(n),vi(n),rc2\left|\left\langle x^{(n)},\,v_{i^{(n)},r}\right\rangle\right|\leq\frac{c}{2}. Therefore, Property 1 immediately completes the proof. ∎

By Lemma 6 and the chain rule of probability,

We address a single term in the product using the following lemma:

This proof closely follows [8, Lemma 11] and [20, Lemma 3]

Fix nNn\leq N. We will start by proving the density pUn(UnG<n,Kn)p_{U_{n}}(U_{n}|G_{<n},K_{n}) is invariant to a rotations that preserve SnS_{n}. To that end, let RR be any rotation such that for all wSnw\in S_{n} Rw=wRw=w. Then

Finally, for any uUnu\in U_{n}, kKnk\in K_{n}

Therefore, pUn,Kn(RUn,Kn)=pUn,Kn(Un,Kn)p_{U_{n},K_{n}}(RU_{n},K_{n})=p_{U_{n},K_{n}}(U_{n},K_{n}).

As the base case, consider t=2t=2. Since G<2=G1G_{<2}=G_{1} depends only on the vectors U2K2U_{2}\cup K_{2} and the first iterate x(1)x^{(1)}, which is determined before making any oracle accesses and is thus independent of the vectors U2K2U_{2}\cup K_{2}. Consequently,

Furthermore, by Corollary 1, if G<2=G1G_{<2}=G_{1} holds, the first oracle response is in the span of the first query x(1)=x(1)X2Xnx^{(1)}=x^{\prime(1)}\in X_{2}\subseteq X_{n} and vi(1),0K2Knv_{i^{(1)},0}\in K_{2}\subseteq K_{n}, neither of which is affected by the rotation RR. Therefore, the second queries, which are a deterministic function of the identical first queries and oracle responses, are equal x(2)=x(2)x^{(2)}=x^{\prime(2)}.

Again by the inductive hypothesis, if G<n1G_{<n-1} holds then x(n1)=x(n1)x^{(n-1)}=x^{\prime(n-1)}. Thus, the projections Pn1P_{n-1} and Pn1P^{\prime}_{n-1} (the projection that arises when {vi,r}=RUnKn\left\{v_{i,r}\right\}=RU_{n}\cup K_{n}) are equal because all of the first n1n-1 queries are identical, and the rotation RR does not affect Kn1K_{n-1}. Consequently,

This establishes that pUn(UnG<n,Kn)=pUn(RUnG<n,Kn)p_{U_{n}}(U_{n}|G_{<n},K_{n})=p_{U_{n}}(RU_{n}|G_{<n},K_{n}) and thus the distribution of UnU_{n} is invariant to rotations RR that preserve SnS_{n}.

As a consequence, for any vUnv\in U_{n}, PnvP_{n}^{\perp}v and PnRvP_{n}^{\perp}Rv have the same distribution. Since RR preserves SnS_{n} and length, we conclude PnvP_{n}^{\perp}v and RPnvRP_{n}^{\perp}v also have the same distribution, and thus PnvPnv\frac{P_{n}^{\perp}v}{\left\|P_{n}^{\perp}v\right\|} is distributed uniformly on the unit sphere in SnS_{n}^{\perp} conditional on G<nG_{<n} and KnK_{n}.

For each inner product (61), the first term is a fixed quantity conditioned on G<nG_{<n} and KnK_{n}. The conditional distribution of the second vector, PnvPnv\frac{P_{n}^{\perp}v}{\left\|P_{n}^{\perp}v\right\|}, as argued above, is uniform on SnS_{n}^{\perp}. Thus, each term of the sum is the inner product of a fixed unit vector and a uniformly random unit vector in d2(n1)d-2(n-1) dimensions.

This probability that the inner product is at least α\alpha is proportional to the surface area of the “end caps” of a unit sphere lying above and below circles of radius r=1α2r=\sqrt{1-\alpha^{2}}. This surface area, in turn, is less than the surface area of a sphere of radius rr. Therefore, for a given vUnv\in U_{n}

For any d2N+2α2log(2mkN)d\geq 2N+\frac{2}{\alpha^{2}}\log\left(2mkN\right)

Together, Lemma 8 and Property 1 allow us to ensure that any algorithm can only learn one important vector per query with high probability as long as the dimension is large enough. What is left is to show that Property 1 holds for each of our constructions and to bound the suboptimality of any iterate that has small inner product with the vectors in UnU_{n}.

We first consider the Lipschitz and non-strongly convex setting and prove theorem 5: See 5 Without loss of generality, we let L=B=1L=B=1. As shown in Equations (9) and (10), we define

and for values bb, cc, and kk to be fixed later we define m/2m/2 pairs of functions, indexed by i=1..m/2i=1..m/2:

Assume for now that mm is even. If mm is odd, then we simply set one of the functions to and the oracle complexity is reduced by a factor proportional to m1m\frac{m-1}{m}.

At the end of this proof, we will show that the functions fi,f_{i,\cdot} satisfy Property 1. Since the domain, and therefore the queries made to the oracle are bounded by B=1B=1, Property 1 and Lemma 8 ensure that when the dimension is at least d2N+8(1+2N)2c2log(2mkN)d\geq 2N+\frac{8\left(1+\sqrt{2N}\right)^{2}}{c^{2}}\log\left(2mkN\right), for iterate xx generated after NN oracle queries, x,vi,rc2\left\langle x,\,v_{i,r}\right\rangle\geq\frac{c}{2} for no more than NN vectors vi,rv_{i,r} with probability 34\frac{3}{4}.

We now bound the suboptimality of (fi,1+fi,2)/2(f_{i,1}+f_{i,2})/2 for any xx where x,vi,k<c2\left\langle x,\,v_{i,k}\right\rangle<\frac{c}{2}.

It is straightforward to confirm that this function is minimized when x,vi,r=b\left\langle x,\,v_{i,r}\right\rangle=b for all rr. Since this is also true for every ii, FF is minimized at xb=bi=1m2r=0kvi,rx_{b}=b\sum_{i=1}^{\frac{m}{2}}\sum_{r=0}^{k}v_{i,r}. In order that xb=1\left\|x_{b}\right\|=1 so that xbXx_{b}\in\mathcal{X}, we set b=2m(k+1)b=\sqrt{\frac{2}{m(k+1)}}. Thus,

Therefore, we set c=min{1N,ϵk}c=\min\left\{\frac{1}{\sqrt{N}},\frac{\epsilon}{\sqrt{k}}\right\} and k=110ϵmk=\lfloor\frac{1}{10\epsilon\sqrt{m}}\rfloor so that

This ensures that if x,vi,k<c2\left\langle x,\,v_{i,k}\right\rangle<\frac{c}{2} for at least m/4m/4 ii’s, then xx cannot be ϵ\epsilon-suboptimal for FF. Therefore, after N=m(k+1)4N=\frac{m(k+1)}{4} queries in dimension

then F(x)F(x)ϵF(x)-F(x^{*})\geq\epsilon with probability 34\frac{3}{4}. When ϵ<110m\epsilon<\frac{1}{10\sqrt{m}}, AA must make at least

queries with probability 34\frac{3}{4}. Finally, we prove that Property 1 holds for our construction:

First we prove the properties about the gradients:

Consider the case when tt is odd. From (9), it is clear that dψcdz(z)=0\frac{d\psi_{c}}{dz}(z)=0 when z<c\left|z\right|<c. Furthermore, for r>tr>t, x,vi,r1x,vi,r<c\left|\left\langle x,\,v_{i,r-1}\right\rangle-\left\langle x,\,v_{i,r}\right\rangle\right|<c. Therefore, any subgradient g1fi,1(x)g_{1}\in\partial f_{i,1}(x) and g2fi,2(x)g_{2}\in\partial f_{i,2}(x) can be expressed as

where sign(0)\text{sign}(0) can take any value in the range $andwhereand where\psi_{c}^{\prime}isasubderivativeofis a subderivative of\psi_{c}.Itisclearfromtheseexpressionsthat. It is clear from these expressions that\partial f_{i,1}(x)\subseteq\text{span}\left\{v_{i,0},...,v_{i,t-1}\right\}andand\partial f_{i,2}(x)\subseteq\text{span}\left\{v_{i,1},...,v_{i,t}\right\}.Theproofforthecasewhen. The proof for the case whent$ is even follows the same line of reasoning.

We now prove the properties about the proxs:

Since each pair of functions fi,f_{i,\cdot} operates on a separate (k+1)(k+1)-dimensional subspace, it will be useful to decompose vectors into x=xiv+xix=x_{i}^{v}+x_{i}^{\perp} where xiv=r=0kx,vi,rvi,rx_{i}^{v}=\sum_{r=0}^{k}\left\langle x,\,v_{i,r}\right\rangle v_{i,r} and xi=xxivx_{i}^{\perp}=x-x_{i}^{v}. First, note that

(and similarly for fi,2f_{i,2}). From there, the proof is similar to the proof of lemma 3. First, consider the function fi,2f_{i,2} and let ttt^{\prime}\geq t be the smallest even number which is not smaller than tt. It will be convenient to further decompose vectors into xiv=x+x+x_{i}^{v}=x^{-}+x^{+} where x=r=0t1xiv,vi,rvi,rx^{-}=\sum_{r=0}^{t^{\prime}-1}\left\langle x_{i}^{v},\,v_{i,r}\right\rangle v_{i,r} and x+=r=tkxiv,vi,rvi,rx^{+}=\sum_{r=t^{\prime}}^{k}\left\langle x_{i}^{v},\,v_{i,r}\right\rangle v_{i,r}. So

Since x+,vi,r<c2\left|\left\langle x^{+},\,v_{i,r}\right\rangle\right|<\frac{c}{2} for all rtr\geq t, x+,vi,r1x+,vi,r<c\left|\left\langle x^{+},\,v_{i,r-1}\right\rangle-\left\langle x^{+},\,v_{i,r}\right\rangle\right|<c for r>tr>t, which implies that fi,2(x+)=0f_{i,2}(x^{+})=0. Therefore, the objective of the second arg min\operatorname*{arg\,min} is non-negative and is equal to zero when u+=x+u^{+}=x^{+} so

Therefore, when tt is even, t=tt^{\prime}=t and proxfi,2(x,β)span{x,vi,0,...,vi,t1}\text{prox}_{f_{i,2}}(x,\beta)\in\text{span}\left\{x,v_{i,0},...,v_{i,t-1}\right\}, and when tt is odd, t=t+1t^{\prime}=t+1 and proxfi,2(x,β)span{x,vi,0,...,vi,t}\text{prox}_{f_{i,2}}(x,\beta)\in\text{span}\left\{x,v_{i,0},...,v_{i,t}\right\}. A very similar line of reasoning can be used to show the statement for fi,1f_{i,1}. ∎

As was mentioned before, Lemma 8 applies when the norm of every query point is bounded by BB. Since all points in the domain of the optimization problem have norm bounded by BB, this is not problematic. However, we can slightly modify our construction to make optimizing FF hard even for algorithms that are allowed to query outside of the domain.

We could redefine our functions as follows:

fi,jf^{\prime}_{i,j} is still continuous, and LL-Lipschitz, and it also has the property that it behaves exactly like fi,jf_{i,j} on BB-ball. However, querying the oracle of fi,jf^{\prime}_{i,j} outside of the BB-ball gives no more information about the function than querying at BxxB\frac{x}{\left\|x\right\|}. In fact, an algorithm that was only allowed to query within the BB-ball would be able to simulate the oracle of FF^{\prime}. Therefore, since the algorithm that is not allowed to query at large vectors cannot optimize FF^{\prime} quickly, and it could simulate queries with unbounded norm, it follows that querying with unbounded norm cannot improve the rate of convergence. This fact is needed in the proof of Theorem 6 below.

C.2 Non-smooth and strongly convex components

We now prove Theorem 6 using a reduction from the Lipschitz and non-strongly convex setting: See 6

By assumption, AA can find an x^\hat{x} such that F(x^)F(x)<ϵ2F(\hat{x})-F(x^{*})<\frac{\epsilon}{2} using o(m+m(L+λB)λϵ)=o(m+mLBϵ)o\left(m+\frac{\sqrt{m}(L+\lambda B)}{\sqrt{\lambda\epsilon}}\right)=o\left(m+\frac{\sqrt{m}LB}{\epsilon}\right) queries to hFh_{F}, and

C.3 Smooth and not strongly convex components

See 7 Without loss of generality, we can assume that γ=B=1\gamma=B=1. We will first consider the case where ϵ=O(1m)\epsilon=O\left(\frac{1}{m}\right) and prove that AA must make Ω(mϵ)\Omega\left(\sqrt{\frac{m}{\epsilon}}\right) queries to hFh_{F}. Afterwards, we will show a lower bound of Ω(m)\Omega(m) in the large-ϵ\epsilon regime where that term dominates.

The function construction in this case is very similar to the non-smooth randomized construction. As in Equation (11)

The key properties of this function for this proof are that it is convex, everywhere differentiable and 44-smooth, and when zc\left|z\right|\leq c, the function is constant at 0. It is also useful to note that

As in Equation (12), for values aa and kk to be fixed later, we define the pairs of functions for i=1,...,m/2i=1,...,m/2:

At the end of this proof, we will show that the functions fi,f_{i,\cdot} satisfy Property 1. Since the domain, and therefore the queries made to the oracle are bounded by BB, Property 1 and Lemma 8 ensure that when the dimension is at least d=2N+8(1+2N)2c2log(2mkN)d=2N+\frac{8(1+\sqrt{2N})^{2}}{c^{2}}\log(2mkN) then after NN oracle queries, x,vi,rc2\left\langle x,\,v_{i,r}\right\rangle\geq\frac{c}{2} for no more than NN vectors vi,rv_{i,r} with probability 34\frac{3}{4}.

Now, we will bound the suboptimality of Fi(x):=(fi,1(x)+fi,2(x))/2F_{i}(x):=(f_{i,1}(x)+f_{i,2}(x))/2 at an iterate xx such that x,vi,r<c2\left|\left\langle x,\,v_{i,r}\right\rangle\right|<\frac{c}{2} for all rtr\geq t. From the definition of ϕc\phi_{c}:

and note that in the proof of Theorem 3 we already showed that that the optimum of FitF^{t}_{i} is achieved at

Therefore, setting a=6m(k+1)a=\sqrt{\frac{6}{m(k+1)}} ensures that i=1m2xi,k+11\left\|\sum_{i=1}^{\frac{m}{2}}x_{i,k+1}^{*}\right\|\leq 1. It is not necessarily true that x=i=1m2xi,k+1x^{*}=\sum_{i=1}^{\frac{m}{2}}x_{i,k+1}^{*}, but it serves as an upper bound on the optimum.

Let q:=k2q:=\lfloor\frac{k}{2}\rfloor and consider an iterate xx generated by AA before it makes q1q-1 queries to the functions fi,1f_{i,1} and fi,2f_{i,2}. When x,vi,r<c2\left\langle x,\,v_{i,r}\right\rangle<\frac{c}{2} for all rqr\geq q,

where the last inequality holds as long as k2k\geq 2. When ϵ<1320m\epsilon<\frac{1}{320m}, setting c=min{1N,16ϵk}c=\min\left\{\frac{1}{\sqrt{N}},\sqrt{\frac{16\epsilon}{k}}\right\} and k=180ϵm2k=\lfloor\frac{1}{\sqrt{80\epsilon m}}\rfloor\geq 2, ensures that

Therefore, if x,vi,r<c2\left\langle x,\,v_{i,r}\right\rangle<\frac{c}{2} for all rqr\geq q is true for at least m4\frac{m}{4} of the ii’s, then xx cannot be ϵ\epsilon-suboptimal for FF. So, for N=mq4N=\frac{mq}{4} in dimension

with probability 34\frac{3}{4}, the algorithm must make at least mq4\frac{mq}{4} queries in order to reach an ϵ\epsilon-suboptimal point. This gives a lower bound of

which holds with probability 34\frac{3}{4}. To complete the first half of the proof, we prove that Property 1 holds for this construction:

First we prove the properties about gradients:

Consider the case when tt is odd. From equation 79, we can see that dϕcdz(z)=0\frac{d\phi_{c}}{dz}(z)=0 when z<c\left|z\right|<c. Furthermore, for r>tr>t, x,vi,r1x,vi,r<c\left|\left\langle x,\,v_{i,r-1}\right\rangle-\left\langle x,\,v_{i,r}\right\rangle\right|<c. We can therefore express the gradients:

It is clear from these expressions that fi,1(x)span{vi,0,...,vi,t1}\nabla f_{i,1}(x)\in\text{span}\left\{v_{i,0},...,v_{i,t-1}\right\} and fi,2(x)span{vi,0,...,vi,t}\nabla f_{i,2}(x)\in\text{span}\left\{v_{i,0},...,v_{i,t}\right\}. The proof for the case when tt is even follows the same line of reasoning.

Now, we prove the properties about proxs:

We follow the same line of reasoning as in the Lipschitz and non-strongly convex case. The only necessary addition is to show, that when ttt^{\prime}\geq t is the smallest even number which is not smaller than tt and u=r=0t1uiv,vi,rvi,ru^{-}=\sum_{r=0}^{t^{\prime}-1}\left\langle u_{i}^{v},\,v_{i,r}\right\rangle v_{i,r} and u+=r=tkuiv,vi,rvi,ru^{+}=\sum_{r=t^{\prime}}^{k}\left\langle u_{i}^{v},\,v_{i,r}\right\rangle v_{i,r}, then fi,2(u+u+)=fi,2(u)+fi,2(u+)f_{i,2}(u^{-}+u^{+})=f_{i,2}(u^{-})+f_{i,2}(u^{+}):

This same reasoning applies for fi,1f_{i,1} or odd tt^{\prime}. ∎

So far, we have shown a lower bound of Ω(mϵ)\Omega\left(\sqrt{\frac{m}{\epsilon}}\right) when ϵ=O(1m)\epsilon=O\left(\frac{1}{m}\right). We now show a lower bound of Ω(m)\Omega(m) for all ϵ>0\epsilon>0, which accounts for the first term in the lower bound Ω(m+mϵ)\Omega\left(m+\sqrt{\frac{m}{\epsilon}}\right) which dominates when ϵ=Ω(1m)\epsilon=\Omega\left(\frac{1}{m}\right). Consider the -smooth functions

for any constant C>0C>0, and where the orthonormal vectors viv_{i} are randomly chosen as before. FF reaches its minimum on the unit ball at

Therefore, by simply choosing C=ϵm0.08C=\frac{\epsilon\sqrt{m}}{0.08}, we ensure that such a point xx is at least ϵ\epsilon-suboptimal, completing the proof for all ϵ>0\epsilon>0.

As noted above, the queries made to the oracle must be bounded for Lemma 8. Since the domain of FF is the BB-ball, this is easy to satisfy. If we want to ensure that our construction is still hard to optimize, even if the algorithm is allowed to query arbitrarily large vectors, then we can modify our construction by defining a new function fi,jf^{\prime}_{i,j} in terms of its gradient

This function is continuous and smooth, and also has the property that querying the oracle at a point xx outside of the BB-ball is cannot be more informative than querying at BxxB\frac{x}{\left\|x\right\|}. That is, an algorithm that is not allowed to query outside the BB-ball can simulate such queries using its restricted oracle. Since this restricted algorithm cannot optimize quickly, but can still calculate the oracle outputs that it would have recieved by querying large vectors, it follows that an unrestricted algorithm could not optimize this function quickly either.

Another variant of (1) that one might consider is an unconstrained optimization problem, where we assume that the minimizer of FF lies on the interior of that ball. In other words, we could consider a version of (1) where the gradient of FF must vanish on the interior of X\mathcal{X}.

In this case, there is little reason to consider any ϵ\epsilon larger than γB22\frac{\gamma B^{2}}{2}, since F(0)F(x)γB22F(0)-F(x^{*})\leq\frac{\gamma B^{2}}{2} always (by smoothness F(0)F(x)F(x),x0x+γ2x2γB22F(0)-F(x^{*})\leq\left\langle\nabla F(x^{*}),\,x_{0}-x^{*}\right\rangle+\frac{\gamma}{2}\left\|x^{*}\right\|^{2}\leq\frac{\gamma B^{2}}{2}). Consequently, when ϵγB22\epsilon\geq\frac{\gamma B^{2}}{2} there is a trivial upper bound of zero oracle queries, as just returning the zero vector guarantees ϵ\epsilon-suboptimality. We can construct functions so that Theorem 7 still applies for 0<ϵ<9γB21280<\epsilon<\frac{9\gamma B^{2}}{128}. In the previous proof, the first construction is still valid in the unconstrained case since the minimizer lies within the unit ball. For the Ω(m)\Omega(m) term, consider the 11-smooth functions (assume w.l.o.g. that γ=B=1\gamma=B=1)

Therefore, if fewer than m2\frac{m}{2} functions have been queried, then with probability at least 910\frac{9}{10}:

This proves a lower bound of Ω(m)\Omega(m) for 0<ϵ<9γB21280<\epsilon<\frac{9\gamma B^{2}}{128}.

C.4 Smooth and strongly convex components

In the smooth and strongly convex case, we cannot use the same simple reduction that was used to prove Theorem 6. Using that construction, we would be able to show a lower bound of mm, but would not be able to show any dependence on ϵ\epsilon, so the lower bound would be loose. Instead, we will use an explicit construction similar to the one used in Theorem 7. See 8

We will prove the theorem for a 11-smooth, λ\lambda-strongly convex problem, for λ<173m\lambda<\frac{1}{73m}, which can be generalized by scaling.

As in the proof for the non-strongly convex case, we introduce the 4-smooth helper function

These functions also have Property 1, but we will omit the proof, as it follows directly from the proof in Appendix C.3. Intuitively, the squared norm reveals no new information about the vectors vi,rv_{i,r} besides what is already included in the query point xx.

When all of the queries are bounded by BB, Property 1 along with Lemma 8 ensures that when d=2N+8B2(1+2N)2c2log(2mkN)d=2N+\frac{8B^{2}\left(1+\sqrt{2N}\right)^{2}}{c^{2}}\log\left(2mkN\right), after the algorithm make NN queries x,vi,rc2\left\langle x,\,v_{i,r}\right\rangle\geq\frac{c}{2} for at most NN of the vectors vi,rv_{i,r} with probability 34\frac{3}{4}. For this to apply, we need that all of the queries made by the algorithm are within a BB-ball around the origin. We know that F(0)F(x)=ϵ0F(0)-F(x^{*})=\epsilon_{0}, and by strong-convexity F(0)F(x)+λ2x2F(0)\geq F(x^{*})+\frac{\lambda}{2}\left\|x^{*}\right\|^{2}, therefore, x2ϵ0λ=:B\left\|x^{*}\right\|\leq\sqrt{\frac{2\epsilon_{0}}{\lambda}}=:B. Since the optimum point must lie in the BB-ball around the origin, we will restrict the algorithm to query only at points within the BB-ball. At the end of the proof, we will show that with a small modification to the functions outside of the BB-ball, querying at vectors of large norm cannot help the algorithm.

Now it remains to lower bound the suboptimality of the pair fi,1f_{i,1} and fi,2f_{i,2} at an iterate which is nearly orthogonal to all vectors vi,rv_{i,r} for r>tr>t:

In order to bound the suboptimality of a pair of functions ii, it will be convenient to bundle up all of the terms which affect the value of x,vi,r\left\langle x^{*},\,v_{i,r}\right\rangle from all mm of the component functions. Most of those terms are contained in fi,1f_{i,1} and fi,2f_{i,2}, however, x2\left\|x\right\|^{2} terms in each of the other components also affect the value of x,vi,r\left\langle x^{*},\,v_{i,r}\right\rangle. For each ii, consider the projection operator PiP_{i} which projects a vector xx onto the subspace spanned by {vi,r}r=0k\left\{v_{i,r}\right\}_{r=0}^{k}, and PP_{\perp} projecting onto the space orthogonal to vi,rv_{i,r} for all i,ri,r. Now decompose

then when x,vi,r<c2\left|\left\langle x,\,v_{i,r}\right\rangle\right|<\frac{c}{2}

We have already showed in Appendix B.4 that if x^:=arg minxFi(x)\hat{x}:=\operatorname*{arg\,min}_{x}F_{i}(x), and if

Therefore, if at least m/4m/4 of the pairs ii it holds that x,vi,r<c2\left|\left\langle x,\,v_{i,r}\right\rangle\right|<\frac{c}{2} for r>tr>t, then

As a consequence of this, when the dimension is sufficiently large, with probability 34\frac{3}{4} the optimization algorithm must make at least tt queries to each of at least m4\frac{m}{4} pairs of functions in order to reach an ϵ\epsilon-suboptimal solution in expectation. So, when

The same argument as was used in the discussion after theorem 7 to show the Ω(m)\Omega(m) term of the lower bound can be used here, as the function in that construction was both smooth and strongly convex.

As mentioned above, Lemma 8 requires that the norm of all query points be bounded by BB. We argued above that the optimum of FF must lie within the BB-ball around the origin. Even so, we can slightly modify our construction to show that even if the algorithm were allowed to query arbitrarily large points, it still would not be able to optimize FF quickly. Define fi,jf^{\prime}_{i,j} through its gradient as:

This new function is continuous, γ\gamma-smooth, and λ\lambda-strongly convex, and it also has the property that querying the function at a point xx outside the BB-ball, it is no more informative than querying at BxxB\frac{x}{\left\|x\right\|}. That is, an algorithm that was not allowed to query outside the BB-ball could simulate the result of such queries. Since that restricted algorithm can’t optimize FF^{\prime} well, as proven above, another algorithm which could query at arbitrary points, therefore could not either. ∎

C.5 Non-smooth components when ϵitalic-ϵ\epsilon is large

Therefore, until the algorithm has made enough queries so that

the expected suboptimality is greater than ϵ\epsilon. By a standard information theoretic result , achieving that probability of success at predicting the sign of YY implies a comparable level of accuracy at distinguishing between p=0.5+2ϵp=0.5+2\epsilon and p=0.52ϵp=0.5-2\epsilon, and that requires at least 1128ϵ2\frac{1}{128\epsilon^{2}} queries to hFh_{F}. ∎

It is straightforward to show a lower bound of Ω(L2λϵ)\Omega\left(\frac{L^{2}}{\lambda\epsilon}\right) for strongly convex functions using the same reduction by regularization as in the proofs of theorems 2 and 6. We also note that this lower bound implies a lower bound of Ω(m)\Omega(m) for smooth functions, whether strongly convex or not. Each function fif_{i} in this construction is linear, and therefore is trivially -smooth. We make the gradient of each function arbitrarily large by multiplying each fif_{i} by a large number. As the multiplier grows, the algorithm need be more and more certain of the sign of YY in order to achieve a expected suboptimality of less than ϵ\epsilon. Thus for a sufficiently large multiplier, the algorithm must query Ω(m)\Omega(m) functions. We cannot force it to query more than that, of course, since it only needs to query mm functions to know the sign of YY with probability 1.