Tight Complexity Bounds for Optimizing Composite Objectives

Blake Woodworth, Nathan Srebro

Introduction

We consider minimizing the average of $m\geq 2$ convex functions:

A natural question is how to leverage the prox oracle, and how much benefit it provides over gradient access alone. The prox oracle is potentially much more powerful, as it provides global, rather then local, information about the function. For example, for a single function ( $m=1$ ), one prox oracle call (with $\beta=0$ ) is sufficient for exact optimization. Several methods have recently been suggested for optimizing a sum or average of several functions using prox accesses to each component, both in the distributed setting where each components might be handled on a different machine (e.g. ADMM , DANE , DISCO ) or for functions that can be decomposed into several “easy” parts (e.g. PRISMA ). But as far as we are aware, no meaningful lower bound was previously known on the number of prox oracle accesses required even for the average of two functions ( $m=2$ ).

The optimization of composite objectives of the form (1) has also been extensively studied in the context of minimizing empirical risk over $m$ samples. Recently, stochastic methods such as SDCA , SAG , SVRG , and other variants, have been presented which leverage the finite nature of the problem to reduce the variance in stochastic gradient estimates and obtain guarantees that dominate both batch and stochastic gradient descent. As methods with improved complexity, such as accelerated SDCA , accelerated SVRG, and Katyusha , have been presented, researchers have also tried to obtain lower bounds on the best possible complexity in this settings—but as we survey below, these have not been satisfactory so far.

In this paper, after briefly surveying methods for smooth, composite optimization, we present methods for optimizing non-smooth composite objectives, which show that prox oracle access can indeed be leveraged to improve over methods using merely subgradient access (see Section 3). We then turn to studying lower bounds. We consider algorithms that access the objective $F$ only through the oracle $h_{F}$ and provide lower bounds on the number of such oracle accesses (and thus the runtime) required to find $\epsilon$ -suboptimal solutions. We consider optimizing both Lipschitz (non-smooth) functions and smooth functions, and guarantees that do and do not depend on strong convexity, distinguishing between deterministic optimization algorithms and randomized algorithms. Our upper and lower bounds are summarized in Table 1.

As shown in the table, we provide matching upper and lower bounds (up to a log factor) for all function and algorithm classes. In particular, our bounds establish the optimality (up to log factors) of accelerated SDCA, SVRG, and SAG for randomized finite-sum optimization, and also the optimality of our deterministic smoothing algorithms for non-smooth composite optimization.

For non-smooth functions, we show that having access to prox oracles for the components can reduce the polynomial dependence on $\epsilon$ from $1/\epsilon^{2}$ to $1/\epsilon$ , or from $1/(\lambda\epsilon)$ to $1/\sqrt{\lambda\epsilon}$ for $\lambda$ -strongly convex functions. However, all of the optimal complexities for smooth functions can be attained with only component gradient access using accelerated gradient descent (AGD) or accelerated SVRG. Thus the worst-case complexity cannot be improved (at least not significantly) by using the more powerful prox oracle.

On the power of randomization

We establish a significant gap between deterministic and randomized algorithms for finite-sum problems. Namely, the dependence on the number of components must be linear in $m$ for any deterministic algorithm, but can be reduced to $\sqrt{m}$ (in the typically significant term) using randomization. We emphasize that the randomization here is only in the algorithm—not in the oracle. We always assume the oracle returns an exact answer (for the requested component) and is not a stochastic oracle. The distinction is that the algorithm is allowed to flip coins in deciding what operations and queries to perform but the oracle must return an exact answer to that query (of course, the algorithm could simulate a stochastic oracle).

Prior Lower Bounds

Several authors recently presented lower bounds for optimizing (1) in the smooth and strongly convex setting using component gradients. Agarwal and Bottou presented a lower bound of $\Omega\left(m+\sqrt{\tfrac{m\gamma}{\lambda}}\log\tfrac{1}{\epsilon}\right)$ . However, their bound is valid only for deterministic algorithms (thus not including SDCA, SVRG, SAG, etc.)—we not only consider randomized algorithms, but also show a much higher lower bound for deterministic algorithms (i.e. the bound of Agarwal and Bottou is loose). Improving upon this, Lan shows a similar lower bound for a restricted class of randomized algorithms: the algorithm must select which component to query for a gradient by drawing an index from a fixed distribution, but the algorithm must otherwise be deterministic in how it uses the gradients, and its iterates must lie in the span of the gradients it has received. This restricted class includes SAG, but not SVRG nor perhaps other realistic attempts at improving over these. Furthermore, both bounds allow only gradient accesses, not prox computations. Thus SDCA, which requires prox accesses, and potential variants are not covered by such lower bounds. We prove as similar lower bound to Lan’s, but our analysis is much more general and applies to any randomized algorithm, making any sequence of queries to a gradient and prox oracle, and without assuming that iterates lie in the span of previous responses. In addition to smooth functions, we also provide lower bounds for non-smooth problems which were not considered by these previous attempts. Another recent observation was that with access only to random component subgradients without knowing the component’s identity, an algorithm must make $\Omega(m^{2})$ queries to optimize well. This shows how relatively subtle changes in the oracle can have a dramatic effect on the complexity of the problem. Since the oracle we consider is quite powerful, our lower bounds cover a very broad family of algorithms, including SAG, SVRG, and SDCA.

Our deterministic lower bounds are inspired by a lower bound on the number of rounds of communication required for optimization when each $f_{i}$ is held by a different machine and when iterates lie in the span of certain permitted calculations . Our construction for $m=2$ is similar to theirs (though in a different setting), but their analysis considers neither scaling with $m$ (which has a different role in their setting) nor randomization.

Notation and Definitions

Optimizing Smooth Sums

We briefly review the best known methods for optimizing (1) when the components are $\gamma$ -smooth, yielding the upper bounds on the right half of Table 1. These upper bounds can be obtained using only component gradient access, without need for the prox oracle.

We can obtain exact gradients of $F(x)$ by computing all $m$ component gradients $\nabla f_{i}(x)$ . Running accelerated gradient descent (AGD) on $F(x)$ using these exact gradients achieves the upper complexity bounds for deterministic algorithms and smooth problems (see Table 1).

When $F$ is not strongly convex, adding a regularizer to the objective and instead optimizing $F_{\lambda}(x)=F(x)+\frac{\lambda}{2}\left\|x\right\|^{2}$ with $\lambda=\epsilon/B^{2}$ results in an oracle complexity of $\mathcal{O}\left(\left(m+\sqrt{\tfrac{m\gamma B^{2}}{\epsilon}}\right)\log\tfrac{\epsilon_{0}}{\epsilon}\right)$ . The log-factor in the second term can be removed using the more delicate reduction of Allen-Zhu and Hazan , which involves optimizing $F_{\lambda}(x)$ for progressively smaller values of $\lambda$ , yielding the upper bound in the table.

Katyusha and Catalyst-accelerated SAG or SVRG use only gradients of the components. Accelerated SDCA achieves a similar complexity using gradient and prox oracle access.

Leveraging Prox Oracles for Lipschitz Sums

In this section, we present algorithms for leveraging the prox oracle to minimize (1) when each component is $L$ -Lipschitz. This will be done by using the prox oracle to “smooth” each component, and optimizing the new, smooth sum which approximates the original problem. This idea was used in order to apply Katyusha and accelerated SDCA to non-smooth objectives. We are not aware of a previous explicit presentation of the AGD-based deterministic algorithm, which achieves the deterministic upper complexity indicated in Table 1.

The key is using a prox oracle to obtain gradients of the $\beta$ -Moreau envelope of a non-smooth function, $f$ , defined as:

Let $f$ be convex and $L$ -Lipschitz continuous. For any $\beta>0$ ,

$\nabla(f^{(\beta)})(x)=\beta(x-\text{prox}_{f}(x,\beta))$

$f^{(\beta)}(x)\leq f(x)\leq f^{(\beta)}(x)+\frac{L^{2}}{2\beta}$

Consequently, we can consider the smoothed problem

Finally, if $m>L^{2}B^{2}/\epsilon^{2}$ or $m>L^{2}/(\lambda\epsilon)$ , stochastic gradient descent is a better randomized alternative, yielding complexities of $\mathcal{O}(L^{2}B^{2}/\epsilon^{2})$ or $\mathcal{O}(L^{2}/(\lambda\epsilon))$ .

Lower Bounds for Deterministic Algorithms

We now turn to establishing lower bounds on the oracle complexity of optimizing (1). We first consider only deterministic optimization algorithms. What we would like to show is that for any deterministic optimization algorithm we can construct a “hard” function for which the algorithm cannot find an $\epsilon$ -suboptimal solution until it has made many oracle accesses. Since the algorithm is deterministic, we can construct such a function by simulating the (deterministic) behavior of the algorithm. This can be viewed as a game, where an adversary controls the oracle being used by the algorithm. At each iteration the algorithm queries the oracle with some triplet $(x,i,\beta)$ and the adversary responds with an answer. This answer must be consistent with all previous answers, but the adversary ensures it is also consistent with a composite function $F$ that the algorithm is far from optimizing. The “hard” function is then gradually defined in terms of the behavior of the optimization algorithm.

To help us formulate our constructions, we define a “round” of queries as a series of queries in which $\lceil\frac{m}{2}\rceil$ distinct functions $f_{i}$ are queried. The first round begins with the first query and continues until exactly $\lceil\frac{m}{2}\rceil$ unique functions have been queried. The second round begins with the next query, and continues until exactly $\lceil\frac{m}{2}\rceil$ more distinct components have been queried in the second round, and so on until the algorithm terminates. This definition is useful for analysis but requires no assumptions about the algorithm’s querying strategy.

We begin by presenting a lower bound for deterministic optimization of (1) when each component $f_{i}$ is convex and $L$ -Lipschitz continuous, but is not necessarily strongly convex, on the domain $\mathcal{X}=\left\{x:\left\|x\right\|\leq B\right\}$ . Without loss of generality, we can consider $L=B=1$ . We will construct functions of the following form:

During each round $t$ , the adversary answers queries according to $f_{i}^{t}$ , which depends only on $v_{r},\delta_{i,r}$ for $r<t$ , i.e. from previous rounds. When the round is completed, $\delta_{i,t}$ is determined and $v_{t}$ is chosen to be orthogonal to the vectors $\left\{v_{0},...,v_{t-1}\right\}$ as well as every point queried by the algorithm so far, thus defining $f_{i}^{t+1}$ for the next round. In Appendix B.1 we prove that these responses based on $f_{i}^{t}$ are consistent with $f_{i}$ .

The algorithm can only learn $v_{r}$ after it completes round $r$ —until then every iterate is orthogonal to it by construction. The average of these functions reaches its minimum of $F(x^{*})=0$ at $x^{*}=b\sum_{r=0}^{k}v_{r}$ , so we can view optimizing these functions as the task of discovering the vectors $v_{r}$ —even if only $v_{k}$ is missing, a suboptimality better than $b/(6\sqrt{k})>\epsilon$ cannot be achieved. Therefore, the deterministic algorithm must complete at least $k$ rounds of optimization, each comprising at least $\left\lceil\frac{m}{2}\right\rceil$ queries to $h_{F}$ in order to optimize $F$ . The key to this construction is that even though each term $\left|\left\langle x,\,v_{r-1}\right\rangle-\left\langle x,\,v_{r}\right\rangle\right|$ appears in $m/2$ components, and hence has a strong effect on the average $F(x)$ , we can force a deterministic algorithm to make $\Omega(m)$ queries during each round before it finds the next relevant term. We obtain (for complete proof see Appendix B.1):

Furthermore, we can always reduce optimizing a function over $\left\|x\right\|\leq B$ to optimizing a strongly convex function by adding the regularizer $\epsilon\left\|x\right\|^{2}/(2B^{2})$ to each component, implying (see complete proof in Appendix B.2):

2 Smooth Components

When the components $f_{i}$ are required to be smooth, the lower bound construction is similar to (6), except it is based on squared differences instead of absolute differences. We consider the functions:

where $\delta_{i,r}$ and $v_{r}$ are as before. Again, we can answer queries at round $t$ based only on $\delta_{i,r},v_{r}$ for $r<t$ . This construction yields the following lower bounds (full details in Appendix B.3):

In the strongly convex case, we use a very similar construction, adding the term $\lambda\left\|x\right\|^{2}/2$ , which gives the following bound (see Appendix B.4):

Lower Bounds for Randomized Algorithms

We now turn to randomized algorithms for (1). In the deterministic constructions, we relied on being able to set $v_{r}$ and $\delta_{i,r}$ based on the predictable behavior of the algorithm. This is impossible for randomized algorithms, we must choose the “hard” function before we know the random choices the algorithm will make—so the function must be “hard” more generally than before.

Previously, we chose vectors $v_{r}$ orthogonal to all previous queries made by the algorithm. For randomized algorithms this cannot be ensured. However, if we choose orthonormal vectors $v_{r}$ randomly in a high dimensional space, they will be nearly orthogonal to queries with high probability. Slightly modifying the absolute or squared difference from before makes near orthogonality sufficient. This issue increases the required dimension but does not otherwise affect the lower bounds.

More problematic is our inability to anticipate the order in which the algorithm will query the components, precluding the use of $\delta_{i,r}$ . In the deterministic setting, if a term revealing a new $v_{r}$ appeared in half of the components, we could ensure that the algorithm must make $m/2$ queries to find it. However, a randomized algorithm could find it in two queries in expectation, which would eliminate the linear dependence on $m$ in the lower bound! Alternatively, if only one component included the term, a randomized algorithm would indeed need $\Omega(m)$ queries to find it, but that term’s effect on suboptimality of $F$ would be scaled down by $m$ , again eliminating the dependence on $m$ .

First consider the non-smooth, non-strongly-convex setting and assume for simplicity $m$ is even (otherwise we simply let the last function be zero). We define the helper function $\psi_{c}$ , which replaces the absolute value operation and makes our construction resistant to small inner products between iterates and not-yet-discovered components:

Next, we define $m/2$ pairs of functions, indexed by $i=1..m/2$ :

where $\left\{v_{i,r}\right\}_{r=0..k,i=1..m/2}$ are random orthonormal vectors and $k=\Theta(\tfrac{1}{\epsilon\sqrt{m}})$ . With $c$ sufficiently small and the dimensionality sufficiently high, with high probability the algorithm only learns the identity of new vectors $v_{i,r}$ by alternately querying $f_{i,1}$ and $f_{i,2}$ ; so revealing all $k+1$ vectors requires at least $k+1$ total queries. Until $v_{i,k}$ is revealed, an iterate is $\Omega(\epsilon)$ -suboptimal on $(f_{i,1}+f_{i,2})/2$ . From here, we show that an $\epsilon$ -suboptimal solution to $F(x)$ can be found only after at least $k+1$ queries are made to at least $m/4$ pairs, for a total of $\Omega(mk)$ queries. This time, since the optimum $x^{*}$ will need to have inner product $b$ with $\Theta(mk)$ vectors $v_{i,r}$ , we need to have $b=\Theta(\tfrac{1}{\sqrt{mk}})=\Theta(\sqrt{\epsilon/\sqrt{m}})$ , and the total number of queries is $\Omega(mk)=\Omega(\tfrac{\sqrt{m}}{\epsilon})$ . The $\Omega(m)$ term of the lower bound follows trivially since we require $\epsilon=\mathcal{O}(1/\sqrt{m})$ , (proofs in Appendix C.1):

An added regularizer gives the result for strongly convex functions (see Appendix C.2):

The large dimension required by these lower bounds is the cost of omitting the assumption that the algorithm’s queries lie in the span of previous oracle responses. If we do assume that the queries lie in that span, the necessary dimension is only on the order of the number of oracle queries needed.

When $\epsilon=\Omega\left(LB/\sqrt{m}\right)$ in the non-strongly convex case or $\epsilon=\Omega\left(L^{2}/(\lambda m)\right)$ in the strongly convex case, the lower bounds for randomized algorithms presented above do not apply. Instead, we can obtain a lower bound based on an information theoretic argument. We first uniformly randomly choose a parameter $p$ , which is either $(1/2-2\epsilon)$ or $(1/2+2\epsilon)$ . Then for $i=1,...,m$ , in the non-strongly convex case we make $f_{i}(x)=x$ with probability $p$ and $f_{i}(x)=-x$ with probability $1-p$ . Optimizing $F(x)$ to within $\epsilon$ accuracy then implies recovering the bias of the Bernoulli random variable, which requires $\Omega(1/\epsilon^{2})$ queries based on a standard information theoretic result . Setting $f_{i}(x)=\pm x+\frac{\lambda}{2}\left\|x\right\|^{2}$ gives a $\Omega(1/(\lambda\epsilon))$ lower bound in the $\lambda$ -strongly convex setting. This is formalized in Appendix C.5.

2 Smooth Components

When the functions $f_{i}$ are smooth and not strongly convex, we define another helper function $\phi_{c}$ :

and the following pairs of functions for $i=1,...,m/2$ :

with $v_{i,r}$ as before. The same arguments apply, after replacing the absolute difference with squared difference. A separate argument is required in this case for the $\Omega(m)$ term in the bound, which we show using a construction involving $m$ simple linear functions (see Appendix C.3).

In the strongly convex case, we add the term $\lambda\left\|x\right\|^{2}/2$ to $f_{i,1}$ and $f_{i,2}$ (see Appendix C.4) to obtain:

We consider (1) as a constrained optimization problem, thus the minimizer of $F$ could be achieved on the boundary of $\mathcal{X}$ , meaning that the gradient need not vanish. If we make the additional assumption that the minimizer of $F$ lies on the interior of $\mathcal{X}$ (and is thus the unconstrained global minimum), Theorems 1-8 all still apply, with a slight modification to Theorems 3 and 7. Since the gradient now needs to vanish on $\mathcal{X}$ , is always $\mathcal{O}(\gamma B^{2})$ -suboptimal, and only values of $\epsilon$ in the range $0<\epsilon<\frac{\gamma B^{2}}{128}$ and $0<\epsilon<\frac{9\gamma B^{2}}{128}$ result in a non-trivial lower bound (see Remarks at the end of Appendices B.3 and C.3).

Conclusion

We provide a tight (up to a log factor) understanding of optimizing finite sum problems of the form (1) using a component prox oracle.

Randomized optimization of (1) has been the subject of much research in the past several years, starting with the presentation of SDCA and SAG, and continuing with accelerated variants. Obtaining lower bounds can be very useful for better understanding the problem, for knowing where it might or might not be possible to improve or where different assumptions would be needed to improve, and for establishing optimality of optimization methods. Indeed, several attempts have been made at lower bounds for the finite sum setting . But as we explain in the introduction, these were unsatisfactory and covered only limited classes of methods. Here we show that in a fairly general sense, accelerated SDCA, SVRG, SAG, and Katyusha are optimal up to a log factor. Improving on their runtime would require additional assumptions, or perhaps a stronger oracle. However, even if given “full” access to the component functions, all algorithms that we can think of utilize this information to calculate a prox vector. Thus, it is unclear what realistic oracle would be more powerful than the prox oracle we consider.

Our results highlight the power of randomization, showing that no deterministic algorithm can beat the linear dependence on $m$ and reduce it to the $\sqrt{m}$ dependence of the randomized algorithms.

In studying finite sum problems, we were forced to explicitly study lower bounds for randomized optimization as opposed to stochastic optimization (where the source of randomness is the oracle, not the algorithm). Even for the classic problem of minimizing a smooth function using a first order oracle, we could not locate a published proof that applies to randomized algorithms. We provide a simple construction using $\epsilon$ -insensitive differences that allows us to easily obtain such lower bounds without reverting to assuming the iterates are spanned by previous responses (as was done, e.g., in ), and could potentially be useful for establishing randomized lower bounds also in other settings.

We thank Ohad Shamir for his helpful discussions and for pointing out .

References

Appendix A Upper bounds for non-smooth sums

Consider the case where the components are not strongly convex. As shown in lemma 1, we can use a single call to a prox oracle to obtain the gradient of

which is a $\beta$ -smooth approximation to $f$ . We then consider the new optimization problem:

When the component functions are $\lambda$ -strongly convex, a more sophisticated strategy is required to avoid an extra $\log$ factor. The solution is the AdaptSmooth algorithm . This involves solving $\mathcal{O}(\log\frac{1}{\epsilon})$ smooth and strongly convex subproblems, where the $t^{\text{th}}$ subproblem is reducing the suboptimality of the $\beta_{t}$ -smooth and $\lambda$ -strongly convex function $F^{(\beta_{t})}(x)$ by a factor of four, where $\beta_{t}=\frac{L^{2}}{\epsilon_{0}}2^{t}$ and where $\epsilon_{0}\leq\frac{L^{2}}{\lambda}$ upper bounds the initial suboptimality. Using this method results in an $\epsilon$ -suboptimal solution for $F$ after $\sum_{t=0}^{\log\frac{\epsilon_{0}}{\epsilon}}\text{Time}(\beta_{t},\lambda)$ queries to $h_{F}$ .

In the case of AGD, Time $(\gamma,\lambda)=\mathcal{O}\left(m\sqrt{\frac{\gamma}{\lambda}}\right)$ and

Appendix B Lower bounds for deterministic algorithms

Without loss of generality, we can assume $L=B=1$ . For particular values $b$ and $k$ to be decided upon later, we use the functions (6):

For non-smooth functions, the subgradient oracle is not uniquely defined—-many different subgradients might be a valid response. However, in order to say that an algorithm successfully optimizes a function, it must be able to do so no matter which subgradient is receives. Conversely, to show a lower bound, it is sufficient to show that for some valid subgradient the algorithm fails. And so, in constructing a “hard” instance to optimize we are actually constructing both a function and a subgradient oracle for it, with specific subgradient responses. Therefore, answering the algorithm’s queries during round $t$ according to $f_{i}^{t}$ is valid so long as the subgradient we return is a valid subgradient for $f_{i}$ (the converse need not be true) and the prox returned is exactly the prox of $f_{i}$ . For now, assume that this query-answering strategy is consistent (we will prove this last).

Then if $d=\lceil\frac{m}{\epsilon}\rceil+k+1$ and if $x$ is an iterate generated both before $A$ completes round $k$ and before it makes $\lceil\frac{m}{\epsilon}\rceil$ queries to $h_{F}$ (so that the dimension is large enough to orthogonalize each $v_{t}$ as described above), then $\left\langle x,\,v_{k}\right\rangle=0$ by construction. This allows us to bound the suboptimality of $F(x)$ (since $\lceil\frac{m}{2}\rceil$ functions are queried during each round, $\sum_{i=1}^{m}\delta_{i,r}=\lfloor\frac{m}{2}\rfloor$ ):

$F$ is non-negative and $F(x_{b})=0$ where $x_{b}=b\sum_{r=0}^{k}v_{r}$ . Choosing $b=\frac{1}{\sqrt{k+1}}$ makes $\left\|x_{b}\right\|=1$ so that $x_{b}\in\mathcal{X}$ . Therefore, $F$ achieves its minimum on $\mathcal{X}$ and

Where the final inequality holds when $k\geq 1$ . Setting $k=\lfloor\frac{1}{12\epsilon}\rfloor$ implies $F(x)-F(x^{*})\geq\epsilon$ . Therefore, $A$ must either query $h_{F}$ more than $\lceil\frac{m}{\epsilon}\rceil$ times or complete $k$ rounds to reach an $\epsilon$ -suboptimal solution. Completing each round requires at least $\lceil\frac{m}{2}\rceil$ queries to $h_{F}$ , so when $\epsilon\leq\frac{1}{12}$ , this implies a lower bound of

For any $t\leq k$ and any $x$ such that $x^{v}\in\text{span}\left\{v_{0},v_{1},...,v_{t-1}\right\}$ , if function $i$ is queried during round $t$ , then $\partial f_{i}^{t}(x)\subseteq\partial f_{i}(x)$ .

All subgradients of $f_{i}$ have the form

where we define $\text{sign}(0)=0$ . Since function $i$ is queried during round $t$ , $\delta_{i,t}=0$ , and since $\left\langle x,\,v_{r-1}\right\rangle=0=\left\langle x,\,v_{r}\right\rangle$ for all $r>t$ , $\partial f_{i}(x)$ contains all subgradients of the form

which is exactly $\partial f_{i}^{t}(x)$ . ∎

For any $t\leq k$ and any $x$ such that $x^{v}\in\text{span}\left\{v_{0},v_{1},...,v_{t-1}\right\}$ , if function $i$ is queried during round $t$ then $\forall\beta>0$ , $\text{prox}_{f_{i}^{t}}(x,\beta)=\text{prox}_{f_{i}}(x,\beta)$ .

Consider the definition of the prox oracle from equation 3

Next, we further decompose $x^{v}=x^{-}+x^{+}$ where

Note that $x^{+}=0$ and since function $i$ is queried during round $t$ , $\delta_{i,t}=0$ . Therefore,

The last equality follows from the fact that that the minimization is completely separable between $u^{-}$ and $u^{+}$ , allowing us to minimized over each variable separately. The terms containing $u^{+}$ are non-negative and can be simultaneously equal to when $u^{+}=0$ . Therefore, $\text{prox}_{f_{i}^{t}}(x,\beta)=\text{prox}_{f_{i}}(x,\beta)$ . ∎

These lemmas show that the subgradients and proxs of $f_{i}^{t}$ at vectors which are queried during round $t$ are consistent with the subgradients and proxs of $f_{i}$ . This confirms that our construction is sound, and completes the proof. ∎

B.2 Non-smooth and strongly convex components

By assumption, $A$ can find an $\hat{x}$ such that $F(\hat{x})-F(x^{*})<\frac{\epsilon}{2}$ using $o\left(\frac{m(L+\lambda B)}{\sqrt{\lambda\epsilon}}\right)=o\left(\frac{mLB}{\epsilon}\right)$ queries to $h_{F}$ , and

B.3 Smooth and not strongly convex components

This proof will be very similar to the proof of theorem 1. Without loss of generality, we can assume that $\gamma=B=1$ . For a values $a$ and $k$ to be fixed later, we define:

We will assume for now that this query-answering strategy is self-consistent, and prove it later. This allows us to bound the suboptimality of $F(x)$ . Note that since exactly $\lceil\frac{m}{2}\rceil$ functions are queried each round, $\sum_{i=1}^{m}\delta_{i,r}=\lfloor\frac{m}{2}\rfloor$ , so let

Then if $d=\lceil\frac{m}{\sqrt{\epsilon}}\rceil+k+1$ , and if $x$ is an iterate generated both before $A$ completes round $q:=\lfloor\frac{k}{2}\rfloor$ and before it makes $\lceil\frac{m}{\sqrt{\epsilon}}\rceil$ queries to $h_{F}$ , then $\left\langle x,\,v_{r}\right\rangle=0$ for all $r\geq q$ by construction. Then, for this $x$ , $F^{q}(x)=F^{k+1}(x)=F(x)$ . By first order optimality conditions for $F^{t}$ , its optimum $x_{t}^{*}$ must satisfy that:

It is straightforward to confirm that the solution to this system of equations is

Thus, we set $a=\sqrt{\frac{3}{k+1}}$ , ensuring $\left\|x_{k+1}^{*}\right\|=1$ so that $x_{k+1}^{*}=x^{*}\in\mathcal{X}$ . Furthermore, for the iterate $x$ made before $q$ rounds of queries,

where the last inequality holds as long as $k\geq 2$ . So, when $\epsilon<\frac{1}{128}$ and we let $k=\lfloor\frac{1}{\sqrt{32\epsilon}}\rfloor$ , this ensures that

and therefore, $A$ must complete at least $q$ rounds or make more than $\lceil\frac{m}{\sqrt{\epsilon}}\rceil$ queries to $h_{F}$ in order to reach an $\epsilon$ -suboptimal point. This implies a lower bound of

For any $t\leq k$ and any $x$ such that $x^{v}\in\text{span}\left\{v_{0},v_{1},...,v_{t-1}\right\}$ , if function $i$ is queried during round $t$ , then $\nabla f_{i}^{t}(x)=\nabla f_{i}(x)$ .

Since function $i$ is queried during round $t$ , $\delta_{i,t}=0$ so

Up until the last step, this proof is identical to the proof of lemma 3, thus we pick up at (14):

The final step comes from the fact that the $\operatorname*{arg\,min}$ is separable over $u^{-}$ and $u^{+}$ , meaning we can minimize the two terms individually. The terms which contain $u^{+}$ are non-negative and equal to zero when $u^{+}=0$ . ∎

These lemmas show that the gradient and prox of $f_{i}^{t}$ at vectors which are queried during round $t$ are consistent with the gradient and prox of $f_{i}$ . This confirms that our construction is sound.

This proves the lower bound for $\epsilon<\frac{\gamma B^{2}}{128}$ , we can extend the same lower bound to $\epsilon\geq\frac{\gamma B^{2}}{128}$ using the following, very simple construction. Let

where $v$ is a unit vector that is orthogonal to all of the first $m-1$ queries. This function is trivially $1$ -smooth. By construction, the algorithm must make at least $m$ queries to learn the identity of $v$ . Until it has done so, any iterate will have objective value zero, while the optimum $F(x^{*})=F(v)=-2\epsilon$ . Therefore, the algorithm must make at least $m$ queries to reach an $\epsilon$ -suboptimal solution. For $\epsilon\geq\frac{\gamma B^{2}}{128}$

Therefore a lower bound of $\Omega\left(m\sqrt{\frac{\gamma B^{2}}{\epsilon}}\right)$ applies for any $\epsilon>0$ . ∎

If make the additional assumption that $F$ is minimized on the interior of $\mathcal{X}$ , since is $\mathcal{O}(\gamma B^{2})$ -suboptimal, only $0<\epsilon<\frac{\gamma B^{2}}{128}$ gives a non-trivial lower bound. This lower bound is shown by the first construction presented in the previous proof.

B.4 Smooth and strongly convex components

We will prove the theorem for $1$ -smooth and $\lambda$ -strongly convex components for any $\lambda<\frac{1}{73}$ . This can be extended to arbitrary constants $\gamma$ and $\lambda^{\prime}$ by taking $\lambda=\frac{\lambda^{\prime}}{\gamma}$ .

For any $k$ and for $\zeta$ and $C$ to be defined later, let

where the vectors $v_{r}$ and indicators $\delta_{i,r}$ are defined in the same way as in the previous proof. This function is just a multiple of the construction in the proof of theorem 3 plus the $\lambda\left\|x\right\|^{2}/2$ term. It is clear that the norm term is uninformative for learning the identity of vectors $v_{r}$ , as the component of the gradients and proxs which is due to that term is simply a scaling of the query point. Thus, for any iterate $x$ generated by $A$ before completing $t$ rounds of optimization, $\left\langle x,\,v_{r}\right\rangle=0$ for all $r\geq t$ so long as the dimension is greater than the total number of queries made to $h_{F}$ so far plus $k+1$ ; a fact which follows directly from the previous proof.

Since exactly $\lceil\frac{m}{2}\rceil$ functions are queried per round, $\sum_{i=1}^{m}\delta_{i,r}=\lfloor\frac{m}{2}\rfloor$ and thus

By the first order optimality conditions for $F(x)$ , its optimum $x^{*}$ must satisfy that:

Defining $q:=\frac{\sqrt{Q}-1}{\sqrt{Q}+1}<1$ and setting $\zeta=1-q$ , it is straightforward to confirm that

Thus, $F(0)-F(x^{*})=\frac{\lambda C^{2}}{8}\left(\sqrt{Q}-1\right)^{2}$ , so by choosing $C$ appropriately, we can make the initial suboptimality of our construction take any value $\epsilon_{0}$ . Since $F$ is $\lambda$ -strongly convex, for any $x_{t}$ , $F(x_{t})-F(x^{*})\geq\frac{\lambda}{2}\left\|x_{t}-x^{*}\right\|^{2}$ . Let $x_{t}$ be an iterate which is generated before $t$ rounds of optimization have been completed, implying that $\left\langle x_{t},\,v_{r}\right\rangle=0$ for all $r\geq t$ . So

If we set $k+1=\left\lceil t-\frac{1}{2\log q}\right\rceil$ then

and when $t=\left\lfloor\frac{\sqrt{Q}-1}{4}\log\frac{\epsilon_{0}}{2\sqrt{Q}\epsilon}\right\rfloor$

Therefore, the algorithm must complete at least $t$ rounds of queries before it can reach an $\epsilon$ -suboptimal point. If $\epsilon_{0}>\frac{3\epsilon}{\lambda}$ and $\lambda<\frac{1}{73}$ , since each round includes at least $\frac{m}{2}$ oracle queries, this implies a lower bound of

queries to $h_{F}$ in order to reach an $\epsilon$ -suboptimal point. ∎

Appendix C Lower bounds for randomized algorithms

We thank Yair Carmon, John Duchi, Oliver Hinder, and Aaron Sidford for pointing out an issue with the original proofs in this section. We have since modified this section in a manner resembling .

To prove the deterministic lower bounds, we constructed vectors $v_{r}$ adversarially, orthogonalizing them to queries made by the algorithm. In the randomized setting, this is impossible, as we cannot anticipate query points. Our solution was to instead draw a set of “important directions” $v_{i,r}$ randomly in high dimensions. The intuition is that a given vector, in this case the query made by the algorithm, will have a very small inner product with a random unit vector with high probability if the dimension is large enough.

Using this fact, we construct helper functions $\psi_{c}$ and $\phi_{c}$ to replace the absolute and squared difference functions used in the deterministic lower bounds. These functions are both flat at 0 on the interval $[-c,c]$ , meaning that the algorithm’s query needs to have a significant inner product with $v_{i,r}$ before the oracle needs to give that vector away as a gradient or prox. We will show that each of our constructions satisfy the following property:

For all $i$ , all $t\leq k$ and $x$ such that $\forall r\geq t$ $\left|\left\langle x,\,v_{i,r}\right\rangle\right|\leq\frac{c}{2}$ , $\exists g_{1}\in\partial f_{i,1}(x)$ and $\exists g_{2}\in\partial f_{i,2}(x)$ such that

and such that $g_{1}$ , $g_{2}$ , $\text{prox}_{f_{i,1}}(x,\beta)$ , and $\text{prox}_{f_{i,2}}(x,\beta)$ are all a deterministic function of $x^{(t)}$ and $\left\{v_{i,r}:r<t\right\}$ .

In other words, when $x$ has a small inner product with $v_{i,r}$ for all $r\geq t$ , then querying either $f_{i,1}$ or $f_{i,2}$ at $x$ will reveal at most $v_{i,t}$ . Our bounds on the complexity of optimizing our functions are based on the principle that the algorithm can only learn one $v_{i,r}$ per query, so we need to control the probability that the premises of this property hold for every query made by the algorithm. In this section, we will bound how large the dimensionality of the problem needs to be to ensure that with high probability, only one vector is revealed to the algorithm by each oracle response.

For $i\in\left\{1,2,\dots,m/2\right\}$ , we will define pairs of component functions $f_{i,1}$ and $f_{i,2}$ that satisfy Property 1. If $m$ is odd, we set $f_{m}\equiv 0$ and lose only a factor of $(m-1)/m$ in our lower bound, therefore, we proceed assuming $m$ is even.

The proofs will proceed by showing that optimizing $F(x)=\frac{1}{m}\sum_{i=1}^{m/2}f_{i,1}(x)+f_{i,2}(x)$ is equivalent to finding an $x$ with a significant inner product with a large number of the vectors $v_{i,r}$ . This property depends on the specific constructions and will be discussed in detail later.

First, we will show generally that when the dimension is sufficiently large, Property 1 ensures each oracle access will “reveal” at most a single vector $v_{i,r}$ with high probability over the randomness in the draw of the vectors $v_{i,r}$ . As a consequence of this fact, $\Omega(mk)$ accesses will be needed to optimize $F$ .

We will prove the lower bound on the number of oracle accesses needed for an arbitrary deterministic optimization algorithm to optimize our random finite sum instance, and then apply Yao’s minimax principle in order to prove the lower bound for randomized optimization algorithms.

We denote the $n^{th}$ query made by the algorithm $q^{(n)}=\left(i^{(n)},j^{(n)},x^{(n)},\beta^{(n)}\right)$ , which is a query to function $f_{i^{(n)},j^{(n)}}$ at the point $x^{(n)}$ with the prox parameter $\beta^{(n)}$ (if applicable). For now, we require that $\exists B\geq 1$ s.t. $\left\|x^{(n)}\right\|\leq B$ for all $n$ ; this assumption will be removed in the individual lower bound proofs. Recalling that we are considering for now deterministic optimization algorithms, the $n^{th}$ query is a function of the previous $n-1$ queries and the oracle’s responses to those queries.

Let $\#(i,n)=\left|\left\{t<n:i^{(t)}=i\right\}\right|$ denote the number of times that the functions $f_{i,1}$ or $f_{i,2}$ were queried before time $n$ . Define $K_{n}=\left\{v_{i,r}:i\in[m/2],r\leq\#(i,n)\right\}$ . These are the set of vectors $v_{i,r}$ that are supposed to be “known” to the optimization algorithm at time $n$ . Also, let $U_{n}=\left\{v_{i,r}:i\in[m/2],r>\#(i,n)\right\}$ . These are the set of vectors that are supposed to be “unknown” to the algorithm.

Let $X_{n}=\left\{x^{(1)},\dots,x^{(n)}\right\}$ be the set of query points up to time $n$ .

Let $S_{n}=\text{span}\left\{X_{n-1}\cup K_{n}\right\}$ , and let $S_{n}^{\perp}$ be its orthogonal complement. Let $P_{n}w$ be the orthogonal projection of a vector $w$ onto the subspace $S_{n}$ , and let $P_{n}^{\perp}w$ be its orthogonal projection onto $S_{n}^{\perp}$ .

We would like to show that the following event occurs with high probability over the draw of the vectors $v_{i,r}$ :

Here, $c\leq 1/\sqrt{k}$ is a constant that will be determined later. This event indicate that the premises of Property 1 are satisfied, which will be used to prove the lower bounds in the next sections.

where $\alpha=\frac{c}{2B(1+\sqrt{2N})}$ for some $c\leq\frac{1}{\sqrt{N}}$ . The events $G_{n}$ are useful because:

This proof closely follows [8, Lemma 9] and [20, Lemma 1].

Let $G_{\leq n}$ denote $\bigcap_{t=1}^{n}G_{t}$ . We will establish that for each $n\leq N$ , $G_{\leq n}\implies\forall v\in U_{n}$ $\left|\left\langle x^{(t)},\,v\right\rangle\right|\leq\frac{c}{2}$ . To begin, we rewrite

First, we decomposed $v=P_{n}v+P_{n}^{\perp}v$ and used $\left\|x^{(t)}\right\|\leq B$ . Next, we used the Cauchy-Schwarz inequality and the self-adjointness of $P_{n}^{\perp}$ . Finally, we used the non-expansivity of $P_{n}^{\perp}$ and applied the definition of $G_{n}$ .

In order to bound the first term in (20), we will prove by induction that for $n\leq N$ and $v\in U_{n}$ , $G_{<n}\implies\left\|P_{n}v\right\|^{2}\leq 2(n-1)\alpha^{2}$ . The base case $n=1$ is trivial as $P_{1}v$ is the projection of $v$ onto $S_{1}=\varnothing$ .

For the inductive step, fix $n\leq N$ and $v\in U_{n}$ , and assume the statement holds for $t<n$ . Let $\hat{P}_{n}$ project onto $\text{span}\left\{X_{n}\cup K_{n}\right\}$ (this includes $x^{(n)}$ in contrast with $P_{n}$ ), and let $\hat{P}_{n}^{\perp}$ project onto the orthogonal subspace.

Denote $v^{(t)}=v_{i^{(t)},\#(i^{(t)},t)+1}\in U_{t}\cap K_{t+1}$ , which is the vector that “becomes known” after time $t$ . By definition, the set

spans $S_{n}$ . Therefore, the Gram-Schmidt vectors

form an orthonormal basis for $S_{n}$ (after ignoring any zero vectors that may arise from the projection). We now write $\left\|P_{n}v\right\|^{2}$ in terms of this orthonormal basis:

We now bound the second term of (24). Note that $v^{(t)}\not\in U_{n}$ is orthogonal to $v\in U_{n}$ . Consider

Here, we used the triangle and Cauchy-Schwarz inequalities. Since $v^{(t)},v\in U_{n}$ , by the inductive hypothesis

Furthermore, looking at the denominator of (24)

The inequality uses the inductive hypothesis and $G_{t}\impliedby G_{<n}$ . Since $c\leq 1/\sqrt{N}$ and $B\geq 1$ , $\alpha=\frac{c}{2B\left(1+\sqrt{2N}\right)}\leq\frac{1}{\sqrt{4N-2}}$ , so this quantity is at least $1/2$ .

Combining this and (31) with (24), we conclude

We conclude that for all $n\leq N$ and $v\in U_{n}$ , $G_{<n}\implies\left\|P_{n}v\right\|^{2}\leq 2(n-1)\alpha^{2}$ .

Since this holds for any $n\leq N$ and $v\in U_{n}$ , we conclude $G_{\leq N}\implies E$ . ∎

With Lemma 6 having been proven, we now observe the following corollary:

For any $n\leq N$ , let $o^{(n)}$ be the output of the gradient or prox oracle at time $n$ . Then $G_{\leq n}\implies o^{(n)}\in\text{span}\left\{\left\{x^{(n)}\right\}\cup K_{n+1}\right\}$ .

By Lemma 6, $G_{\leq n}$ implies $\forall r>\#(i^{(n)},n)$ $\left|\left\langle x^{(n)},\,v_{i^{(n)},r}\right\rangle\right|\leq\frac{c}{2}$ . Therefore, Property 1 immediately completes the proof. ∎

By Lemma 6 and the chain rule of probability,

We address a single term in the product using the following lemma:

This proof closely follows [8, Lemma 11] and [20, Lemma 3]

Fix $n\leq N$ . We will start by proving the density $p_{U_{n}}(U_{n}|G_{<n},K_{n})$ is invariant to a rotations that preserve $S_{n}$ . To that end, let $R$ be any rotation such that for all $w\in S_{n}$ $Rw=w$ . Then

Finally, for any $u\in U_{n}$ , $k\in K_{n}$

Therefore, $p_{U_{n},K_{n}}(RU_{n},K_{n})=p_{U_{n},K_{n}}(U_{n},K_{n})$ .

As the base case, consider $t=2$ . Since $G_{<2}=G_{1}$ depends only on the vectors $U_{2}\cup K_{2}$ and the first iterate $x^{(1)}$ , which is determined before making any oracle accesses and is thus independent of the vectors $U_{2}\cup K_{2}$ . Consequently,

Furthermore, by Corollary 1, if $G_{<2}=G_{1}$ holds, the first oracle response is in the span of the first query $x^{(1)}=x^{\prime(1)}\in X_{2}\subseteq X_{n}$ and $v_{i^{(1)},0}\in K_{2}\subseteq K_{n}$ , neither of which is affected by the rotation $R$ . Therefore, the second queries, which are a deterministic function of the identical first queries and oracle responses, are equal $x^{(2)}=x^{\prime(2)}$ .

Again by the inductive hypothesis, if $G_{<n-1}$ holds then $x^{(n-1)}=x^{\prime(n-1)}$ . Thus, the projections $P_{n-1}$ and $P^{\prime}_{n-1}$ (the projection that arises when $\left\{v_{i,r}\right\}=RU_{n}\cup K_{n}$ ) are equal because all of the first $n-1$ queries are identical, and the rotation $R$ does not affect $K_{n-1}$ . Consequently,

This establishes that $p_{U_{n}}(U_{n}|G_{<n},K_{n})=p_{U_{n}}(RU_{n}|G_{<n},K_{n})$ and thus the distribution of $U_{n}$ is invariant to rotations $R$ that preserve $S_{n}$ .

As a consequence, for any $v\in U_{n}$ , $P_{n}^{\perp}v$ and $P_{n}^{\perp}Rv$ have the same distribution. Since $R$ preserves $S_{n}$ and length, we conclude $P_{n}^{\perp}v$ and $RP_{n}^{\perp}v$ also have the same distribution, and thus $\frac{P_{n}^{\perp}v}{\left\|P_{n}^{\perp}v\right\|}$ is distributed uniformly on the unit sphere in $S_{n}^{\perp}$ conditional on $G_{<n}$ and $K_{n}$ .

For each inner product (61), the first term is a fixed quantity conditioned on $G_{<n}$ and $K_{n}$ . The conditional distribution of the second vector, $\frac{P_{n}^{\perp}v}{\left\|P_{n}^{\perp}v\right\|}$ , as argued above, is uniform on $S_{n}^{\perp}$ . Thus, each term of the sum is the inner product of a fixed unit vector and a uniformly random unit vector in $d-2(n-1)$ dimensions.

This probability that the inner product is at least $\alpha$ is proportional to the surface area of the “end caps” of a unit sphere lying above and below circles of radius $r=\sqrt{1-\alpha^{2}}$ . This surface area, in turn, is less than the surface area of a sphere of radius $r$ . Therefore, for a given $v\in U_{n}$

For any $d\geq 2N+\frac{2}{\alpha^{2}}\log\left(2mkN\right)$

Together, Lemma 8 and Property 1 allow us to ensure that any algorithm can only learn one important vector per query with high probability as long as the dimension is large enough. What is left is to show that Property 1 holds for each of our constructions and to bound the suboptimality of any iterate that has small inner product with the vectors in $U_{n}$ .

We first consider the Lipschitz and non-strongly convex setting and prove theorem 5: See 5 Without loss of generality, we let $L=B=1$ . As shown in Equations (9) and (10), we define

and for values $b$ , $c$ , and $k$ to be fixed later we define $m/2$ pairs of functions, indexed by $i=1..m/2$ :

Assume for now that $m$ is even. If $m$ is odd, then we simply set one of the functions to and the oracle complexity is reduced by a factor proportional to $\frac{m-1}{m}$ .

At the end of this proof, we will show that the functions $f_{i,\cdot}$ satisfy Property 1. Since the domain, and therefore the queries made to the oracle are bounded by $B=1$ , Property 1 and Lemma 8 ensure that when the dimension is at least $d\geq 2N+\frac{8\left(1+\sqrt{2N}\right)^{2}}{c^{2}}\log\left(2mkN\right)$ , for iterate $x$ generated after $N$ oracle queries, $\left\langle x,\,v_{i,r}\right\rangle\geq\frac{c}{2}$ for no more than $N$ vectors $v_{i,r}$ with probability $\frac{3}{4}$ .

We now bound the suboptimality of $(f_{i,1}+f_{i,2})/2$ for any $x$ where $\left\langle x,\,v_{i,k}\right\rangle<\frac{c}{2}$ .

It is straightforward to confirm that this function is minimized when $\left\langle x,\,v_{i,r}\right\rangle=b$ for all $r$ . Since this is also true for every $i$ , $F$ is minimized at $x_{b}=b\sum_{i=1}^{\frac{m}{2}}\sum_{r=0}^{k}v_{i,r}$ . In order that $\left\|x_{b}\right\|=1$ so that $x_{b}\in\mathcal{X}$ , we set $b=\sqrt{\frac{2}{m(k+1)}}$ . Thus,

Therefore, we set $c=\min\left\{\frac{1}{\sqrt{N}},\frac{\epsilon}{\sqrt{k}}\right\}$ and $k=\lfloor\frac{1}{10\epsilon\sqrt{m}}\rfloor$ so that

This ensures that if $\left\langle x,\,v_{i,k}\right\rangle<\frac{c}{2}$ for at least $m/4$ $i$ ’s, then $x$ cannot be $\epsilon$ -suboptimal for $F$ . Therefore, after $N=\frac{m(k+1)}{4}$ queries in dimension

then $F(x)-F(x^{*})\geq\epsilon$ with probability $\frac{3}{4}$ . When $\epsilon<\frac{1}{10\sqrt{m}}$ , $A$ must make at least

queries with probability $\frac{3}{4}$ . Finally, we prove that Property 1 holds for our construction:

First we prove the properties about the gradients:

Consider the case when $t$ is odd. From (9), it is clear that $\frac{d\psi_{c}}{dz}(z)=0$ when $\left|z\right|<c$ . Furthermore, for $r>t$ , $\left|\left\langle x,\,v_{i,r-1}\right\rangle-\left\langle x,\,v_{i,r}\right\rangle\right|<c$ . Therefore, any subgradient $g_{1}\in\partial f_{i,1}(x)$ and $g_{2}\in\partial f_{i,2}(x)$ can be expressed as

where $\text{sign}(0)$ can take any value in the range $ $and where$ \psi_{c}^{\prime} $is a subderivative of$ \psi_{c} $. It is clear from these expressions that$ \partial f_{i,1}(x)\subseteq\text{span}\left\{v_{i,0},...,v_{i,t-1}\right\} $and$ \partial f_{i,2}(x)\subseteq\text{span}\left\{v_{i,1},...,v_{i,t}\right\} $. The proof for the case when$ t$ is even follows the same line of reasoning.

We now prove the properties about the proxs:

Since each pair of functions $f_{i,\cdot}$ operates on a separate $(k+1)$ -dimensional subspace, it will be useful to decompose vectors into $x=x_{i}^{v}+x_{i}^{\perp}$ where $x_{i}^{v}=\sum_{r=0}^{k}\left\langle x,\,v_{i,r}\right\rangle v_{i,r}$ and $x_{i}^{\perp}=x-x_{i}^{v}$ . First, note that

(and similarly for $f_{i,2}$ ). From there, the proof is similar to the proof of lemma 3. First, consider the function $f_{i,2}$ and let $t^{\prime}\geq t$ be the smallest even number which is not smaller than $t$ . It will be convenient to further decompose vectors into $x_{i}^{v}=x^{-}+x^{+}$ where $x^{-}=\sum_{r=0}^{t^{\prime}-1}\left\langle x_{i}^{v},\,v_{i,r}\right\rangle v_{i,r}$ and $x^{+}=\sum_{r=t^{\prime}}^{k}\left\langle x_{i}^{v},\,v_{i,r}\right\rangle v_{i,r}$ . So

Since $\left|\left\langle x^{+},\,v_{i,r}\right\rangle\right|<\frac{c}{2}$ for all $r\geq t$ , $\left|\left\langle x^{+},\,v_{i,r-1}\right\rangle-\left\langle x^{+},\,v_{i,r}\right\rangle\right|<c$ for $r>t$ , which implies that $f_{i,2}(x^{+})=0$ . Therefore, the objective of the second $\operatorname*{arg\,min}$ is non-negative and is equal to zero when $u^{+}=x^{+}$ so

Therefore, when $t$ is even, $t^{\prime}=t$ and $\text{prox}_{f_{i,2}}(x,\beta)\in\text{span}\left\{x,v_{i,0},...,v_{i,t-1}\right\}$ , and when $t$ is odd, $t^{\prime}=t+1$ and $\text{prox}_{f_{i,2}}(x,\beta)\in\text{span}\left\{x,v_{i,0},...,v_{i,t}\right\}$ . A very similar line of reasoning can be used to show the statement for $f_{i,1}$ . ∎

As was mentioned before, Lemma 8 applies when the norm of every query point is bounded by $B$ . Since all points in the domain of the optimization problem have norm bounded by $B$ , this is not problematic. However, we can slightly modify our construction to make optimizing $F$ hard even for algorithms that are allowed to query outside of the domain.

We could redefine our functions as follows:

$f^{\prime}_{i,j}$ is still continuous, and $L$ -Lipschitz, and it also has the property that it behaves exactly like $f_{i,j}$ on $B$ -ball. However, querying the oracle of $f^{\prime}_{i,j}$ outside of the $B$ -ball gives no more information about the function than querying at $B\frac{x}{\left\|x\right\|}$ . In fact, an algorithm that was only allowed to query within the $B$ -ball would be able to simulate the oracle of $F^{\prime}$ . Therefore, since the algorithm that is not allowed to query at large vectors cannot optimize $F^{\prime}$ quickly, and it could simulate queries with unbounded norm, it follows that querying with unbounded norm cannot improve the rate of convergence. This fact is needed in the proof of Theorem 6 below.

C.2 Non-smooth and strongly convex components

We now prove Theorem 6 using a reduction from the Lipschitz and non-strongly convex setting: See 6

By assumption, $A$ can find an $\hat{x}$ such that $F(\hat{x})-F(x^{*})<\frac{\epsilon}{2}$ using $o\left(m+\frac{\sqrt{m}(L+\lambda B)}{\sqrt{\lambda\epsilon}}\right)=o\left(m+\frac{\sqrt{m}LB}{\epsilon}\right)$ queries to $h_{F}$ , and

C.3 Smooth and not strongly convex components

See 7 Without loss of generality, we can assume that $\gamma=B=1$ . We will first consider the case where $\epsilon=O\left(\frac{1}{m}\right)$ and prove that $A$ must make $\Omega\left(\sqrt{\frac{m}{\epsilon}}\right)$ queries to $h_{F}$ . Afterwards, we will show a lower bound of $\Omega(m)$ in the large- $\epsilon$ regime where that term dominates.

The function construction in this case is very similar to the non-smooth randomized construction. As in Equation (11)

The key properties of this function for this proof are that it is convex, everywhere differentiable and $4$ -smooth, and when $\left|z\right|\leq c$ , the function is constant at 0. It is also useful to note that

As in Equation (12), for values $a$ and $k$ to be fixed later, we define the pairs of functions for $i=1,...,m/2$ :

At the end of this proof, we will show that the functions $f_{i,\cdot}$ satisfy Property 1. Since the domain, and therefore the queries made to the oracle are bounded by $B$ , Property 1 and Lemma 8 ensure that when the dimension is at least $d=2N+\frac{8(1+\sqrt{2N})^{2}}{c^{2}}\log(2mkN)$ then after $N$ oracle queries, $\left\langle x,\,v_{i,r}\right\rangle\geq\frac{c}{2}$ for no more than $N$ vectors $v_{i,r}$ with probability $\frac{3}{4}$ .

Now, we will bound the suboptimality of $F_{i}(x):=(f_{i,1}(x)+f_{i,2}(x))/2$ at an iterate $x$ such that $\left|\left\langle x,\,v_{i,r}\right\rangle\right|<\frac{c}{2}$ for all $r\geq t$ . From the definition of $\phi_{c}$ :

and note that in the proof of Theorem 3 we already showed that that the optimum of $F^{t}_{i}$ is achieved at

Therefore, setting $a=\sqrt{\frac{6}{m(k+1)}}$ ensures that $\left\|\sum_{i=1}^{\frac{m}{2}}x_{i,k+1}^{*}\right\|\leq 1$ . It is not necessarily true that $x^{*}=\sum_{i=1}^{\frac{m}{2}}x_{i,k+1}^{*}$ , but it serves as an upper bound on the optimum.

Let $q:=\lfloor\frac{k}{2}\rfloor$ and consider an iterate $x$ generated by $A$ before it makes $q-1$ queries to the functions $f_{i,1}$ and $f_{i,2}$ . When $\left\langle x,\,v_{i,r}\right\rangle<\frac{c}{2}$ for all $r\geq q$ ,

where the last inequality holds as long as $k\geq 2$ . When $\epsilon<\frac{1}{320m}$ , setting $c=\min\left\{\frac{1}{\sqrt{N}},\sqrt{\frac{16\epsilon}{k}}\right\}$ and $k=\lfloor\frac{1}{\sqrt{80\epsilon m}}\rfloor\geq 2$ , ensures that

Therefore, if $\left\langle x,\,v_{i,r}\right\rangle<\frac{c}{2}$ for all $r\geq q$ is true for at least $\frac{m}{4}$ of the $i$ ’s, then $x$ cannot be $\epsilon$ -suboptimal for $F$ . So, for $N=\frac{mq}{4}$ in dimension

with probability $\frac{3}{4}$ , the algorithm must make at least $\frac{mq}{4}$ queries in order to reach an $\epsilon$ -suboptimal point. This gives a lower bound of

which holds with probability $\frac{3}{4}$ . To complete the first half of the proof, we prove that Property 1 holds for this construction:

First we prove the properties about gradients:

Consider the case when $t$ is odd. From equation 79, we can see that $\frac{d\phi_{c}}{dz}(z)=0$ when $\left|z\right|<c$ . Furthermore, for $r>t$ , $\left|\left\langle x,\,v_{i,r-1}\right\rangle-\left\langle x,\,v_{i,r}\right\rangle\right|<c$ . We can therefore express the gradients:

It is clear from these expressions that $\nabla f_{i,1}(x)\in\text{span}\left\{v_{i,0},...,v_{i,t-1}\right\}$ and $\nabla f_{i,2}(x)\in\text{span}\left\{v_{i,0},...,v_{i,t}\right\}$ . The proof for the case when $t$ is even follows the same line of reasoning.

Now, we prove the properties about proxs:

We follow the same line of reasoning as in the Lipschitz and non-strongly convex case. The only necessary addition is to show, that when $t^{\prime}\geq t$ is the smallest even number which is not smaller than $t$ and $u^{-}=\sum_{r=0}^{t^{\prime}-1}\left\langle u_{i}^{v},\,v_{i,r}\right\rangle v_{i,r}$ and $u^{+}=\sum_{r=t^{\prime}}^{k}\left\langle u_{i}^{v},\,v_{i,r}\right\rangle v_{i,r}$ , then $f_{i,2}(u^{-}+u^{+})=f_{i,2}(u^{-})+f_{i,2}(u^{+})$ :

This same reasoning applies for $f_{i,1}$ or odd $t^{\prime}$ . ∎

So far, we have shown a lower bound of $\Omega\left(\sqrt{\frac{m}{\epsilon}}\right)$ when $\epsilon=O\left(\frac{1}{m}\right)$ . We now show a lower bound of $\Omega(m)$ for all $\epsilon>0$ , which accounts for the first term in the lower bound $\Omega\left(m+\sqrt{\frac{m}{\epsilon}}\right)$ which dominates when $\epsilon=\Omega\left(\frac{1}{m}\right)$ . Consider the -smooth functions

for any constant $C>0$ , and where the orthonormal vectors $v_{i}$ are randomly chosen as before. $F$ reaches its minimum on the unit ball at

Therefore, by simply choosing $C=\frac{\epsilon\sqrt{m}}{0.08}$ , we ensure that such a point $x$ is at least $\epsilon$ -suboptimal, completing the proof for all $\epsilon>0$ .

As noted above, the queries made to the oracle must be bounded for Lemma 8. Since the domain of $F$ is the $B$ -ball, this is easy to satisfy. If we want to ensure that our construction is still hard to optimize, even if the algorithm is allowed to query arbitrarily large vectors, then we can modify our construction by defining a new function $f^{\prime}_{i,j}$ in terms of its gradient

This function is continuous and smooth, and also has the property that querying the oracle at a point $x$ outside of the $B$ -ball is cannot be more informative than querying at $B\frac{x}{\left\|x\right\|}$ . That is, an algorithm that is not allowed to query outside the $B$ -ball can simulate such queries using its restricted oracle. Since this restricted algorithm cannot optimize quickly, but can still calculate the oracle outputs that it would have recieved by querying large vectors, it follows that an unrestricted algorithm could not optimize this function quickly either.

Another variant of (1) that one might consider is an unconstrained optimization problem, where we assume that the minimizer of $F$ lies on the interior of that ball. In other words, we could consider a version of (1) where the gradient of $F$ must vanish on the interior of $\mathcal{X}$ .

In this case, there is little reason to consider any $\epsilon$ larger than $\frac{\gamma B^{2}}{2}$ , since $F(0)-F(x^{*})\leq\frac{\gamma B^{2}}{2}$ always (by smoothness $F(0)-F(x^{*})\leq\left\langle\nabla F(x^{*}),\,x_{0}-x^{*}\right\rangle+\frac{\gamma}{2}\left\|x^{*}\right\|^{2}\leq\frac{\gamma B^{2}}{2}$ ). Consequently, when $\epsilon\geq\frac{\gamma B^{2}}{2}$ there is a trivial upper bound of zero oracle queries, as just returning the zero vector guarantees $\epsilon$ -suboptimality. We can construct functions so that Theorem 7 still applies for $0<\epsilon<\frac{9\gamma B^{2}}{128}$ . In the previous proof, the first construction is still valid in the unconstrained case since the minimizer lies within the unit ball. For the $\Omega(m)$ term, consider the $1$ -smooth functions (assume w.l.o.g. that $\gamma=B=1$ )

Therefore, if fewer than $\frac{m}{2}$ functions have been queried, then with probability at least $\frac{9}{10}$ :

This proves a lower bound of $\Omega(m)$ for $0<\epsilon<\frac{9\gamma B^{2}}{128}$ .

C.4 Smooth and strongly convex components

In the smooth and strongly convex case, we cannot use the same simple reduction that was used to prove Theorem 6. Using that construction, we would be able to show a lower bound of $m$ , but would not be able to show any dependence on $\epsilon$ , so the lower bound would be loose. Instead, we will use an explicit construction similar to the one used in Theorem 7. See 8

We will prove the theorem for a $1$ -smooth, $\lambda$ -strongly convex problem, for $\lambda<\frac{1}{73m}$ , which can be generalized by scaling.

As in the proof for the non-strongly convex case, we introduce the 4-smooth helper function

These functions also have Property 1, but we will omit the proof, as it follows directly from the proof in Appendix C.3. Intuitively, the squared norm reveals no new information about the vectors $v_{i,r}$ besides what is already included in the query point $x$ .

When all of the queries are bounded by $B$ , Property 1 along with Lemma 8 ensures that when $d=2N+\frac{8B^{2}\left(1+\sqrt{2N}\right)^{2}}{c^{2}}\log\left(2mkN\right)$ , after the algorithm make $N$ queries $\left\langle x,\,v_{i,r}\right\rangle\geq\frac{c}{2}$ for at most $N$ of the vectors $v_{i,r}$ with probability $\frac{3}{4}$ . For this to apply, we need that all of the queries made by the algorithm are within a $B$ -ball around the origin. We know that $F(0)-F(x^{*})=\epsilon_{0}$ , and by strong-convexity $F(0)\geq F(x^{*})+\frac{\lambda}{2}\left\|x^{*}\right\|^{2}$ , therefore, $\left\|x^{*}\right\|\leq\sqrt{\frac{2\epsilon_{0}}{\lambda}}=:B$ . Since the optimum point must lie in the $B$ -ball around the origin, we will restrict the algorithm to query only at points within the $B$ -ball. At the end of the proof, we will show that with a small modification to the functions outside of the $B$ -ball, querying at vectors of large norm cannot help the algorithm.

Now it remains to lower bound the suboptimality of the pair $f_{i,1}$ and $f_{i,2}$ at an iterate which is nearly orthogonal to all vectors $v_{i,r}$ for $r>t$ :

In order to bound the suboptimality of a pair of functions $i$ , it will be convenient to bundle up all of the terms which affect the value of $\left\langle x^{*},\,v_{i,r}\right\rangle$ from all $m$ of the component functions. Most of those terms are contained in $f_{i,1}$ and $f_{i,2}$ , however, $\left\|x\right\|^{2}$ terms in each of the other components also affect the value of $\left\langle x^{*},\,v_{i,r}\right\rangle$ . For each $i$ , consider the projection operator $P_{i}$ which projects a vector $x$ onto the subspace spanned by $\left\{v_{i,r}\right\}_{r=0}^{k}$ , and $P_{\perp}$ projecting onto the space orthogonal to $v_{i,r}$ for all $i,r$ . Now decompose

then when $\left|\left\langle x,\,v_{i,r}\right\rangle\right|<\frac{c}{2}$

We have already showed in Appendix B.4 that if $\hat{x}:=\operatorname*{arg\,min}_{x}F_{i}(x)$ , and if

Therefore, if at least $m/4$ of the pairs $i$ it holds that $\left|\left\langle x,\,v_{i,r}\right\rangle\right|<\frac{c}{2}$ for $r>t$ , then

As a consequence of this, when the dimension is sufficiently large, with probability $\frac{3}{4}$ the optimization algorithm must make at least $t$ queries to each of at least $\frac{m}{4}$ pairs of functions in order to reach an $\epsilon$ -suboptimal solution in expectation. So, when

The same argument as was used in the discussion after theorem 7 to show the $\Omega(m)$ term of the lower bound can be used here, as the function in that construction was both smooth and strongly convex.

As mentioned above, Lemma 8 requires that the norm of all query points be bounded by $B$ . We argued above that the optimum of $F$ must lie within the $B$ -ball around the origin. Even so, we can slightly modify our construction to show that even if the algorithm were allowed to query arbitrarily large points, it still would not be able to optimize $F$ quickly. Define $f^{\prime}_{i,j}$ through its gradient as:

This new function is continuous, $\gamma$ -smooth, and $\lambda$ -strongly convex, and it also has the property that querying the function at a point $x$ outside the $B$ -ball, it is no more informative than querying at $B\frac{x}{\left\|x\right\|}$ . That is, an algorithm that was not allowed to query outside the $B$ -ball could simulate the result of such queries. Since that restricted algorithm can’t optimize $F^{\prime}$ well, as proven above, another algorithm which could query at arbitrary points, therefore could not either. ∎

C.5 Non-smooth components when ϵitalic-ϵ\epsilon is large

Therefore, until the algorithm has made enough queries so that

the expected suboptimality is greater than $\epsilon$ . By a standard information theoretic result , achieving that probability of success at predicting the sign of $Y$ implies a comparable level of accuracy at distinguishing between $p=0.5+2\epsilon$ and $p=0.5-2\epsilon$ , and that requires at least $\frac{1}{128\epsilon^{2}}$ queries to $h_{F}$ . ∎

It is straightforward to show a lower bound of $\Omega\left(\frac{L^{2}}{\lambda\epsilon}\right)$ for strongly convex functions using the same reduction by regularization as in the proofs of theorems 2 and 6. We also note that this lower bound implies a lower bound of $\Omega(m)$ for smooth functions, whether strongly convex or not. Each function $f_{i}$ in this construction is linear, and therefore is trivially -smooth. We make the gradient of each function arbitrarily large by multiplying each $f_{i}$ by a large number. As the multiplier grows, the algorithm need be more and more certain of the sign of $Y$ in order to achieve a expected suboptimality of less than $\epsilon$ . Thus for a sufficiently large multiplier, the algorithm must query $\Omega(m)$ functions. We cannot force it to query more than that, of course, since it only needs to query $m$ functions to know the sign of $Y$ with probability 1.