Tight Complexity Bounds for Optimizing Composite Objectives
Blake Woodworth, Nathan Srebro
Introduction
We consider minimizing the average of convex functions:
A natural question is how to leverage the prox oracle, and how much benefit it provides over gradient access alone. The prox oracle is potentially much more powerful, as it provides global, rather then local, information about the function. For example, for a single function (), one prox oracle call (with ) is sufficient for exact optimization. Several methods have recently been suggested for optimizing a sum or average of several functions using prox accesses to each component, both in the distributed setting where each components might be handled on a different machine (e.g. ADMM , DANE , DISCO ) or for functions that can be decomposed into several “easy” parts (e.g. PRISMA ). But as far as we are aware, no meaningful lower bound was previously known on the number of prox oracle accesses required even for the average of two functions ().
The optimization of composite objectives of the form (1) has also been extensively studied in the context of minimizing empirical risk over samples. Recently, stochastic methods such as SDCA , SAG , SVRG , and other variants, have been presented which leverage the finite nature of the problem to reduce the variance in stochastic gradient estimates and obtain guarantees that dominate both batch and stochastic gradient descent. As methods with improved complexity, such as accelerated SDCA , accelerated SVRG, and Katyusha , have been presented, researchers have also tried to obtain lower bounds on the best possible complexity in this settings—but as we survey below, these have not been satisfactory so far.
In this paper, after briefly surveying methods for smooth, composite optimization, we present methods for optimizing non-smooth composite objectives, which show that prox oracle access can indeed be leveraged to improve over methods using merely subgradient access (see Section 3). We then turn to studying lower bounds. We consider algorithms that access the objective only through the oracle and provide lower bounds on the number of such oracle accesses (and thus the runtime) required to find -suboptimal solutions. We consider optimizing both Lipschitz (non-smooth) functions and smooth functions, and guarantees that do and do not depend on strong convexity, distinguishing between deterministic optimization algorithms and randomized algorithms. Our upper and lower bounds are summarized in Table 1.
As shown in the table, we provide matching upper and lower bounds (up to a log factor) for all function and algorithm classes. In particular, our bounds establish the optimality (up to log factors) of accelerated SDCA, SVRG, and SAG for randomized finite-sum optimization, and also the optimality of our deterministic smoothing algorithms for non-smooth composite optimization.
For non-smooth functions, we show that having access to prox oracles for the components can reduce the polynomial dependence on from to , or from to for -strongly convex functions. However, all of the optimal complexities for smooth functions can be attained with only component gradient access using accelerated gradient descent (AGD) or accelerated SVRG. Thus the worst-case complexity cannot be improved (at least not significantly) by using the more powerful prox oracle.
On the power of randomization
We establish a significant gap between deterministic and randomized algorithms for finite-sum problems. Namely, the dependence on the number of components must be linear in for any deterministic algorithm, but can be reduced to (in the typically significant term) using randomization. We emphasize that the randomization here is only in the algorithm—not in the oracle. We always assume the oracle returns an exact answer (for the requested component) and is not a stochastic oracle. The distinction is that the algorithm is allowed to flip coins in deciding what operations and queries to perform but the oracle must return an exact answer to that query (of course, the algorithm could simulate a stochastic oracle).
Prior Lower Bounds
Several authors recently presented lower bounds for optimizing (1) in the smooth and strongly convex setting using component gradients. Agarwal and Bottou presented a lower bound of . However, their bound is valid only for deterministic algorithms (thus not including SDCA, SVRG, SAG, etc.)—we not only consider randomized algorithms, but also show a much higher lower bound for deterministic algorithms (i.e. the bound of Agarwal and Bottou is loose). Improving upon this, Lan shows a similar lower bound for a restricted class of randomized algorithms: the algorithm must select which component to query for a gradient by drawing an index from a fixed distribution, but the algorithm must otherwise be deterministic in how it uses the gradients, and its iterates must lie in the span of the gradients it has received. This restricted class includes SAG, but not SVRG nor perhaps other realistic attempts at improving over these. Furthermore, both bounds allow only gradient accesses, not prox computations. Thus SDCA, which requires prox accesses, and potential variants are not covered by such lower bounds. We prove as similar lower bound to Lan’s, but our analysis is much more general and applies to any randomized algorithm, making any sequence of queries to a gradient and prox oracle, and without assuming that iterates lie in the span of previous responses. In addition to smooth functions, we also provide lower bounds for non-smooth problems which were not considered by these previous attempts. Another recent observation was that with access only to random component subgradients without knowing the component’s identity, an algorithm must make queries to optimize well. This shows how relatively subtle changes in the oracle can have a dramatic effect on the complexity of the problem. Since the oracle we consider is quite powerful, our lower bounds cover a very broad family of algorithms, including SAG, SVRG, and SDCA.
Our deterministic lower bounds are inspired by a lower bound on the number of rounds of communication required for optimization when each is held by a different machine and when iterates lie in the span of certain permitted calculations . Our construction for is similar to theirs (though in a different setting), but their analysis considers neither scaling with (which has a different role in their setting) nor randomization.
Notation and Definitions
Optimizing Smooth Sums
We briefly review the best known methods for optimizing (1) when the components are -smooth, yielding the upper bounds on the right half of Table 1. These upper bounds can be obtained using only component gradient access, without need for the prox oracle.
We can obtain exact gradients of by computing all component gradients . Running accelerated gradient descent (AGD) on using these exact gradients achieves the upper complexity bounds for deterministic algorithms and smooth problems (see Table 1).
When is not strongly convex, adding a regularizer to the objective and instead optimizing with results in an oracle complexity of . The log-factor in the second term can be removed using the more delicate reduction of Allen-Zhu and Hazan , which involves optimizing for progressively smaller values of , yielding the upper bound in the table.
Katyusha and Catalyst-accelerated SAG or SVRG use only gradients of the components. Accelerated SDCA achieves a similar complexity using gradient and prox oracle access.
Leveraging Prox Oracles for Lipschitz Sums
In this section, we present algorithms for leveraging the prox oracle to minimize (1) when each component is -Lipschitz. This will be done by using the prox oracle to “smooth” each component, and optimizing the new, smooth sum which approximates the original problem. This idea was used in order to apply Katyusha and accelerated SDCA to non-smooth objectives. We are not aware of a previous explicit presentation of the AGD-based deterministic algorithm, which achieves the deterministic upper complexity indicated in Table 1.
The key is using a prox oracle to obtain gradients of the -Moreau envelope of a non-smooth function, , defined as:
Let be convex and -Lipschitz continuous. For any ,
Consequently, we can consider the smoothed problem
Finally, if or , stochastic gradient descent is a better randomized alternative, yielding complexities of or .
Lower Bounds for Deterministic Algorithms
We now turn to establishing lower bounds on the oracle complexity of optimizing (1). We first consider only deterministic optimization algorithms. What we would like to show is that for any deterministic optimization algorithm we can construct a “hard” function for which the algorithm cannot find an -suboptimal solution until it has made many oracle accesses. Since the algorithm is deterministic, we can construct such a function by simulating the (deterministic) behavior of the algorithm. This can be viewed as a game, where an adversary controls the oracle being used by the algorithm. At each iteration the algorithm queries the oracle with some triplet and the adversary responds with an answer. This answer must be consistent with all previous answers, but the adversary ensures it is also consistent with a composite function that the algorithm is far from optimizing. The “hard” function is then gradually defined in terms of the behavior of the optimization algorithm.
To help us formulate our constructions, we define a “round” of queries as a series of queries in which distinct functions are queried. The first round begins with the first query and continues until exactly unique functions have been queried. The second round begins with the next query, and continues until exactly more distinct components have been queried in the second round, and so on until the algorithm terminates. This definition is useful for analysis but requires no assumptions about the algorithm’s querying strategy.
We begin by presenting a lower bound for deterministic optimization of (1) when each component is convex and -Lipschitz continuous, but is not necessarily strongly convex, on the domain . Without loss of generality, we can consider . We will construct functions of the following form:
During each round , the adversary answers queries according to , which depends only on for , i.e. from previous rounds. When the round is completed, is determined and is chosen to be orthogonal to the vectors as well as every point queried by the algorithm so far, thus defining for the next round. In Appendix B.1 we prove that these responses based on are consistent with .
The algorithm can only learn after it completes round —until then every iterate is orthogonal to it by construction. The average of these functions reaches its minimum of at , so we can view optimizing these functions as the task of discovering the vectors —even if only is missing, a suboptimality better than cannot be achieved. Therefore, the deterministic algorithm must complete at least rounds of optimization, each comprising at least queries to in order to optimize . The key to this construction is that even though each term appears in components, and hence has a strong effect on the average , we can force a deterministic algorithm to make queries during each round before it finds the next relevant term. We obtain (for complete proof see Appendix B.1):
Furthermore, we can always reduce optimizing a function over to optimizing a strongly convex function by adding the regularizer to each component, implying (see complete proof in Appendix B.2):
2 Smooth Components
When the components are required to be smooth, the lower bound construction is similar to (6), except it is based on squared differences instead of absolute differences. We consider the functions:
where and are as before. Again, we can answer queries at round based only on for . This construction yields the following lower bounds (full details in Appendix B.3):
In the strongly convex case, we use a very similar construction, adding the term , which gives the following bound (see Appendix B.4):
Lower Bounds for Randomized Algorithms
We now turn to randomized algorithms for (1). In the deterministic constructions, we relied on being able to set and based on the predictable behavior of the algorithm. This is impossible for randomized algorithms, we must choose the “hard” function before we know the random choices the algorithm will make—so the function must be “hard” more generally than before.
Previously, we chose vectors orthogonal to all previous queries made by the algorithm. For randomized algorithms this cannot be ensured. However, if we choose orthonormal vectors randomly in a high dimensional space, they will be nearly orthogonal to queries with high probability. Slightly modifying the absolute or squared difference from before makes near orthogonality sufficient. This issue increases the required dimension but does not otherwise affect the lower bounds.
More problematic is our inability to anticipate the order in which the algorithm will query the components, precluding the use of . In the deterministic setting, if a term revealing a new appeared in half of the components, we could ensure that the algorithm must make queries to find it. However, a randomized algorithm could find it in two queries in expectation, which would eliminate the linear dependence on in the lower bound! Alternatively, if only one component included the term, a randomized algorithm would indeed need queries to find it, but that term’s effect on suboptimality of would be scaled down by , again eliminating the dependence on .
First consider the non-smooth, non-strongly-convex setting and assume for simplicity is even (otherwise we simply let the last function be zero). We define the helper function , which replaces the absolute value operation and makes our construction resistant to small inner products between iterates and not-yet-discovered components:
Next, we define pairs of functions, indexed by :
where are random orthonormal vectors and . With sufficiently small and the dimensionality sufficiently high, with high probability the algorithm only learns the identity of new vectors by alternately querying and ; so revealing all vectors requires at least total queries. Until is revealed, an iterate is -suboptimal on . From here, we show that an -suboptimal solution to can be found only after at least queries are made to at least pairs, for a total of queries. This time, since the optimum will need to have inner product with vectors , we need to have , and the total number of queries is . The term of the lower bound follows trivially since we require , (proofs in Appendix C.1):
An added regularizer gives the result for strongly convex functions (see Appendix C.2):
The large dimension required by these lower bounds is the cost of omitting the assumption that the algorithm’s queries lie in the span of previous oracle responses. If we do assume that the queries lie in that span, the necessary dimension is only on the order of the number of oracle queries needed.
When in the non-strongly convex case or in the strongly convex case, the lower bounds for randomized algorithms presented above do not apply. Instead, we can obtain a lower bound based on an information theoretic argument. We first uniformly randomly choose a parameter , which is either or . Then for , in the non-strongly convex case we make with probability and with probability . Optimizing to within accuracy then implies recovering the bias of the Bernoulli random variable, which requires queries based on a standard information theoretic result . Setting gives a lower bound in the -strongly convex setting. This is formalized in Appendix C.5.
2 Smooth Components
When the functions are smooth and not strongly convex, we define another helper function :
and the following pairs of functions for :
with as before. The same arguments apply, after replacing the absolute difference with squared difference. A separate argument is required in this case for the term in the bound, which we show using a construction involving simple linear functions (see Appendix C.3).
In the strongly convex case, we add the term to and (see Appendix C.4) to obtain:
We consider (1) as a constrained optimization problem, thus the minimizer of could be achieved on the boundary of , meaning that the gradient need not vanish. If we make the additional assumption that the minimizer of lies on the interior of (and is thus the unconstrained global minimum), Theorems 1-8 all still apply, with a slight modification to Theorems 3 and 7. Since the gradient now needs to vanish on , is always -suboptimal, and only values of in the range and result in a non-trivial lower bound (see Remarks at the end of Appendices B.3 and C.3).
Conclusion
We provide a tight (up to a log factor) understanding of optimizing finite sum problems of the form (1) using a component prox oracle.
Randomized optimization of (1) has been the subject of much research in the past several years, starting with the presentation of SDCA and SAG, and continuing with accelerated variants. Obtaining lower bounds can be very useful for better understanding the problem, for knowing where it might or might not be possible to improve or where different assumptions would be needed to improve, and for establishing optimality of optimization methods. Indeed, several attempts have been made at lower bounds for the finite sum setting . But as we explain in the introduction, these were unsatisfactory and covered only limited classes of methods. Here we show that in a fairly general sense, accelerated SDCA, SVRG, SAG, and Katyusha are optimal up to a log factor. Improving on their runtime would require additional assumptions, or perhaps a stronger oracle. However, even if given “full” access to the component functions, all algorithms that we can think of utilize this information to calculate a prox vector. Thus, it is unclear what realistic oracle would be more powerful than the prox oracle we consider.
Our results highlight the power of randomization, showing that no deterministic algorithm can beat the linear dependence on and reduce it to the dependence of the randomized algorithms.
In studying finite sum problems, we were forced to explicitly study lower bounds for randomized optimization as opposed to stochastic optimization (where the source of randomness is the oracle, not the algorithm). Even for the classic problem of minimizing a smooth function using a first order oracle, we could not locate a published proof that applies to randomized algorithms. We provide a simple construction using -insensitive differences that allows us to easily obtain such lower bounds without reverting to assuming the iterates are spanned by previous responses (as was done, e.g., in ), and could potentially be useful for establishing randomized lower bounds also in other settings.
We thank Ohad Shamir for his helpful discussions and for pointing out .
References
Appendix A Upper bounds for non-smooth sums
Consider the case where the components are not strongly convex. As shown in lemma 1, we can use a single call to a prox oracle to obtain the gradient of
which is a -smooth approximation to . We then consider the new optimization problem:
When the component functions are -strongly convex, a more sophisticated strategy is required to avoid an extra factor. The solution is the AdaptSmooth algorithm . This involves solving smooth and strongly convex subproblems, where the subproblem is reducing the suboptimality of the -smooth and -strongly convex function by a factor of four, where and where upper bounds the initial suboptimality. Using this method results in an -suboptimal solution for after queries to .
In the case of AGD, Time and
Appendix B Lower bounds for deterministic algorithms
Without loss of generality, we can assume . For particular values and to be decided upon later, we use the functions (6):
For non-smooth functions, the subgradient oracle is not uniquely defined—-many different subgradients might be a valid response. However, in order to say that an algorithm successfully optimizes a function, it must be able to do so no matter which subgradient is receives. Conversely, to show a lower bound, it is sufficient to show that for some valid subgradient the algorithm fails. And so, in constructing a “hard” instance to optimize we are actually constructing both a function and a subgradient oracle for it, with specific subgradient responses. Therefore, answering the algorithm’s queries during round according to is valid so long as the subgradient we return is a valid subgradient for (the converse need not be true) and the prox returned is exactly the prox of . For now, assume that this query-answering strategy is consistent (we will prove this last).
Then if and if is an iterate generated both before completes round and before it makes queries to (so that the dimension is large enough to orthogonalize each as described above), then by construction. This allows us to bound the suboptimality of (since functions are queried during each round, ):
is non-negative and where . Choosing makes so that . Therefore, achieves its minimum on and
Where the final inequality holds when . Setting implies . Therefore, must either query more than times or complete rounds to reach an -suboptimal solution. Completing each round requires at least queries to , so when , this implies a lower bound of
For any and any such that , if function is queried during round , then .
All subgradients of have the form
where we define . Since function is queried during round , , and since for all , contains all subgradients of the form
which is exactly . ∎
For any and any such that , if function is queried during round then , .
Consider the definition of the prox oracle from equation 3
Next, we further decompose where
Note that and since function is queried during round , . Therefore,
The last equality follows from the fact that that the minimization is completely separable between and , allowing us to minimized over each variable separately. The terms containing are non-negative and can be simultaneously equal to when . Therefore, . ∎
These lemmas show that the subgradients and proxs of at vectors which are queried during round are consistent with the subgradients and proxs of . This confirms that our construction is sound, and completes the proof. ∎
B.2 Non-smooth and strongly convex components
By assumption, can find an such that using queries to , and
B.3 Smooth and not strongly convex components
This proof will be very similar to the proof of theorem 1. Without loss of generality, we can assume that . For a values and to be fixed later, we define:
We will assume for now that this query-answering strategy is self-consistent, and prove it later. This allows us to bound the suboptimality of . Note that since exactly functions are queried each round, , so let
Then if , and if is an iterate generated both before completes round and before it makes queries to , then for all by construction. Then, for this , . By first order optimality conditions for , its optimum must satisfy that:
It is straightforward to confirm that the solution to this system of equations is
Thus, we set , ensuring so that . Furthermore, for the iterate made before rounds of queries,
where the last inequality holds as long as . So, when and we let , this ensures that
and therefore, must complete at least rounds or make more than queries to in order to reach an -suboptimal point. This implies a lower bound of
For any and any such that , if function is queried during round , then .
Since function is queried during round , so
For any and any such that , if function is queried during round then , .
Up until the last step, this proof is identical to the proof of lemma 3, thus we pick up at (14):
The final step comes from the fact that the is separable over and , meaning we can minimize the two terms individually. The terms which contain are non-negative and equal to zero when . ∎
These lemmas show that the gradient and prox of at vectors which are queried during round are consistent with the gradient and prox of . This confirms that our construction is sound.
This proves the lower bound for , we can extend the same lower bound to using the following, very simple construction. Let
where is a unit vector that is orthogonal to all of the first queries. This function is trivially -smooth. By construction, the algorithm must make at least queries to learn the identity of . Until it has done so, any iterate will have objective value zero, while the optimum . Therefore, the algorithm must make at least queries to reach an -suboptimal solution. For
Therefore a lower bound of applies for any . ∎
If make the additional assumption that is minimized on the interior of , since is -suboptimal, only gives a non-trivial lower bound. This lower bound is shown by the first construction presented in the previous proof.
B.4 Smooth and strongly convex components
We will prove the theorem for -smooth and -strongly convex components for any . This can be extended to arbitrary constants and by taking .
For any and for and to be defined later, let
where the vectors and indicators are defined in the same way as in the previous proof. This function is just a multiple of the construction in the proof of theorem 3 plus the term. It is clear that the norm term is uninformative for learning the identity of vectors , as the component of the gradients and proxs which is due to that term is simply a scaling of the query point. Thus, for any iterate generated by before completing rounds of optimization, for all so long as the dimension is greater than the total number of queries made to so far plus ; a fact which follows directly from the previous proof.
Since exactly functions are queried per round, and thus
By the first order optimality conditions for , its optimum must satisfy that:
Defining and setting , it is straightforward to confirm that
Thus, , so by choosing appropriately, we can make the initial suboptimality of our construction take any value . Since is -strongly convex, for any , . Let be an iterate which is generated before rounds of optimization have been completed, implying that for all . So
If we set then
and when
Therefore, the algorithm must complete at least rounds of queries before it can reach an -suboptimal point. If and , since each round includes at least oracle queries, this implies a lower bound of
queries to in order to reach an -suboptimal point. ∎
Appendix C Lower bounds for randomized algorithms
We thank Yair Carmon, John Duchi, Oliver Hinder, and Aaron Sidford for pointing out an issue with the original proofs in this section. We have since modified this section in a manner resembling .
To prove the deterministic lower bounds, we constructed vectors adversarially, orthogonalizing them to queries made by the algorithm. In the randomized setting, this is impossible, as we cannot anticipate query points. Our solution was to instead draw a set of “important directions” randomly in high dimensions. The intuition is that a given vector, in this case the query made by the algorithm, will have a very small inner product with a random unit vector with high probability if the dimension is large enough.
Using this fact, we construct helper functions and to replace the absolute and squared difference functions used in the deterministic lower bounds. These functions are both flat at 0 on the interval , meaning that the algorithm’s query needs to have a significant inner product with before the oracle needs to give that vector away as a gradient or prox. We will show that each of our constructions satisfy the following property:
For all , all and such that , and such that
and such that , , , and are all a deterministic function of and .
In other words, when has a small inner product with for all , then querying either or at will reveal at most . Our bounds on the complexity of optimizing our functions are based on the principle that the algorithm can only learn one per query, so we need to control the probability that the premises of this property hold for every query made by the algorithm. In this section, we will bound how large the dimensionality of the problem needs to be to ensure that with high probability, only one vector is revealed to the algorithm by each oracle response.
For , we will define pairs of component functions and that satisfy Property 1. If is odd, we set and lose only a factor of in our lower bound, therefore, we proceed assuming is even.
The proofs will proceed by showing that optimizing is equivalent to finding an with a significant inner product with a large number of the vectors . This property depends on the specific constructions and will be discussed in detail later.
First, we will show generally that when the dimension is sufficiently large, Property 1 ensures each oracle access will “reveal” at most a single vector with high probability over the randomness in the draw of the vectors . As a consequence of this fact, accesses will be needed to optimize .
We will prove the lower bound on the number of oracle accesses needed for an arbitrary deterministic optimization algorithm to optimize our random finite sum instance, and then apply Yao’s minimax principle in order to prove the lower bound for randomized optimization algorithms.
We denote the query made by the algorithm , which is a query to function at the point with the prox parameter (if applicable). For now, we require that s.t. for all ; this assumption will be removed in the individual lower bound proofs. Recalling that we are considering for now deterministic optimization algorithms, the query is a function of the previous queries and the oracle’s responses to those queries.
Let denote the number of times that the functions or were queried before time . Define . These are the set of vectors that are supposed to be “known” to the optimization algorithm at time . Also, let . These are the set of vectors that are supposed to be “unknown” to the algorithm.
Let be the set of query points up to time .
Let , and let be its orthogonal complement. Let be the orthogonal projection of a vector onto the subspace , and let be its orthogonal projection onto .
We would like to show that the following event occurs with high probability over the draw of the vectors :
Here, is a constant that will be determined later. This event indicate that the premises of Property 1 are satisfied, which will be used to prove the lower bounds in the next sections.
where for some . The events are useful because:
This proof closely follows [8, Lemma 9] and [20, Lemma 1].
Let denote . We will establish that for each , . To begin, we rewrite
First, we decomposed and used . Next, we used the Cauchy-Schwarz inequality and the self-adjointness of . Finally, we used the non-expansivity of and applied the definition of .
In order to bound the first term in (20), we will prove by induction that for and , . The base case is trivial as is the projection of onto .
For the inductive step, fix and , and assume the statement holds for . Let project onto (this includes in contrast with ), and let project onto the orthogonal subspace.
Denote , which is the vector that “becomes known” after time . By definition, the set
spans . Therefore, the Gram-Schmidt vectors
form an orthonormal basis for (after ignoring any zero vectors that may arise from the projection). We now write in terms of this orthonormal basis:
We now bound the second term of (24). Note that is orthogonal to . Consider
Here, we used the triangle and Cauchy-Schwarz inequalities. Since , by the inductive hypothesis
Furthermore, looking at the denominator of (24)
The inequality uses the inductive hypothesis and . Since and , , so this quantity is at least .
Combining this and (31) with (24), we conclude
We conclude that for all and , .
Since this holds for any and , we conclude . ∎
With Lemma 6 having been proven, we now observe the following corollary:
For any , let be the output of the gradient or prox oracle at time . Then .
By Lemma 6, implies . Therefore, Property 1 immediately completes the proof. ∎
By Lemma 6 and the chain rule of probability,
We address a single term in the product using the following lemma:
This proof closely follows [8, Lemma 11] and [20, Lemma 3]
Fix . We will start by proving the density is invariant to a rotations that preserve . To that end, let be any rotation such that for all . Then
Finally, for any ,
Therefore, .
As the base case, consider . Since depends only on the vectors and the first iterate , which is determined before making any oracle accesses and is thus independent of the vectors . Consequently,
Furthermore, by Corollary 1, if holds, the first oracle response is in the span of the first query and , neither of which is affected by the rotation . Therefore, the second queries, which are a deterministic function of the identical first queries and oracle responses, are equal .
Again by the inductive hypothesis, if holds then . Thus, the projections and (the projection that arises when ) are equal because all of the first queries are identical, and the rotation does not affect . Consequently,
This establishes that and thus the distribution of is invariant to rotations that preserve .
As a consequence, for any , and have the same distribution. Since preserves and length, we conclude and also have the same distribution, and thus is distributed uniformly on the unit sphere in conditional on and .
For each inner product (61), the first term is a fixed quantity conditioned on and . The conditional distribution of the second vector, , as argued above, is uniform on . Thus, each term of the sum is the inner product of a fixed unit vector and a uniformly random unit vector in dimensions.
This probability that the inner product is at least is proportional to the surface area of the “end caps” of a unit sphere lying above and below circles of radius . This surface area, in turn, is less than the surface area of a sphere of radius . Therefore, for a given
For any
Together, Lemma 8 and Property 1 allow us to ensure that any algorithm can only learn one important vector per query with high probability as long as the dimension is large enough. What is left is to show that Property 1 holds for each of our constructions and to bound the suboptimality of any iterate that has small inner product with the vectors in .
We first consider the Lipschitz and non-strongly convex setting and prove theorem 5: See 5 Without loss of generality, we let . As shown in Equations (9) and (10), we define
and for values , , and to be fixed later we define pairs of functions, indexed by :
Assume for now that is even. If is odd, then we simply set one of the functions to and the oracle complexity is reduced by a factor proportional to .
At the end of this proof, we will show that the functions satisfy Property 1. Since the domain, and therefore the queries made to the oracle are bounded by , Property 1 and Lemma 8 ensure that when the dimension is at least , for iterate generated after oracle queries, for no more than vectors with probability .
We now bound the suboptimality of for any where .
It is straightforward to confirm that this function is minimized when for all . Since this is also true for every , is minimized at . In order that so that , we set . Thus,
Therefore, we set and so that
This ensures that if for at least ’s, then cannot be -suboptimal for . Therefore, after queries in dimension
then with probability . When , must make at least
queries with probability . Finally, we prove that Property 1 holds for our construction:
First we prove the properties about the gradients:
Consider the case when is odd. From (9), it is clear that when . Furthermore, for , . Therefore, any subgradient and can be expressed as
where can take any value in the range $\psi_{c}^{\prime}\psi_{c}\partial f_{i,1}(x)\subseteq\text{span}\left\{v_{i,0},...,v_{i,t-1}\right\}\partial f_{i,2}(x)\subseteq\text{span}\left\{v_{i,1},...,v_{i,t}\right\}t$ is even follows the same line of reasoning.
We now prove the properties about the proxs:
Since each pair of functions operates on a separate -dimensional subspace, it will be useful to decompose vectors into where and . First, note that
(and similarly for ). From there, the proof is similar to the proof of lemma 3. First, consider the function and let be the smallest even number which is not smaller than . It will be convenient to further decompose vectors into where and . So
Since for all , for , which implies that . Therefore, the objective of the second is non-negative and is equal to zero when so
Therefore, when is even, and , and when is odd, and . A very similar line of reasoning can be used to show the statement for . ∎
As was mentioned before, Lemma 8 applies when the norm of every query point is bounded by . Since all points in the domain of the optimization problem have norm bounded by , this is not problematic. However, we can slightly modify our construction to make optimizing hard even for algorithms that are allowed to query outside of the domain.
We could redefine our functions as follows:
is still continuous, and -Lipschitz, and it also has the property that it behaves exactly like on -ball. However, querying the oracle of outside of the -ball gives no more information about the function than querying at . In fact, an algorithm that was only allowed to query within the -ball would be able to simulate the oracle of . Therefore, since the algorithm that is not allowed to query at large vectors cannot optimize quickly, and it could simulate queries with unbounded norm, it follows that querying with unbounded norm cannot improve the rate of convergence. This fact is needed in the proof of Theorem 6 below.
C.2 Non-smooth and strongly convex components
We now prove Theorem 6 using a reduction from the Lipschitz and non-strongly convex setting: See 6
By assumption, can find an such that using queries to , and
C.3 Smooth and not strongly convex components
See 7 Without loss of generality, we can assume that . We will first consider the case where and prove that must make queries to . Afterwards, we will show a lower bound of in the large- regime where that term dominates.
The function construction in this case is very similar to the non-smooth randomized construction. As in Equation (11)
The key properties of this function for this proof are that it is convex, everywhere differentiable and -smooth, and when , the function is constant at 0. It is also useful to note that
As in Equation (12), for values and to be fixed later, we define the pairs of functions for :
At the end of this proof, we will show that the functions satisfy Property 1. Since the domain, and therefore the queries made to the oracle are bounded by , Property 1 and Lemma 8 ensure that when the dimension is at least then after oracle queries, for no more than vectors with probability .
Now, we will bound the suboptimality of at an iterate such that for all . From the definition of :
and note that in the proof of Theorem 3 we already showed that that the optimum of is achieved at
Therefore, setting ensures that . It is not necessarily true that , but it serves as an upper bound on the optimum.
Let and consider an iterate generated by before it makes queries to the functions and . When for all ,
where the last inequality holds as long as . When , setting and , ensures that
Therefore, if for all is true for at least of the ’s, then cannot be -suboptimal for . So, for in dimension
with probability , the algorithm must make at least queries in order to reach an -suboptimal point. This gives a lower bound of
which holds with probability . To complete the first half of the proof, we prove that Property 1 holds for this construction:
First we prove the properties about gradients:
Consider the case when is odd. From equation 79, we can see that when . Furthermore, for , . We can therefore express the gradients:
It is clear from these expressions that and . The proof for the case when is even follows the same line of reasoning.
Now, we prove the properties about proxs:
We follow the same line of reasoning as in the Lipschitz and non-strongly convex case. The only necessary addition is to show, that when is the smallest even number which is not smaller than and and , then :
This same reasoning applies for or odd . ∎
So far, we have shown a lower bound of when . We now show a lower bound of for all , which accounts for the first term in the lower bound which dominates when . Consider the -smooth functions
for any constant , and where the orthonormal vectors are randomly chosen as before. reaches its minimum on the unit ball at
Therefore, by simply choosing , we ensure that such a point is at least -suboptimal, completing the proof for all .
As noted above, the queries made to the oracle must be bounded for Lemma 8. Since the domain of is the -ball, this is easy to satisfy. If we want to ensure that our construction is still hard to optimize, even if the algorithm is allowed to query arbitrarily large vectors, then we can modify our construction by defining a new function in terms of its gradient
This function is continuous and smooth, and also has the property that querying the oracle at a point outside of the -ball is cannot be more informative than querying at . That is, an algorithm that is not allowed to query outside the -ball can simulate such queries using its restricted oracle. Since this restricted algorithm cannot optimize quickly, but can still calculate the oracle outputs that it would have recieved by querying large vectors, it follows that an unrestricted algorithm could not optimize this function quickly either.
Another variant of (1) that one might consider is an unconstrained optimization problem, where we assume that the minimizer of lies on the interior of that ball. In other words, we could consider a version of (1) where the gradient of must vanish on the interior of .
In this case, there is little reason to consider any larger than , since always (by smoothness ). Consequently, when there is a trivial upper bound of zero oracle queries, as just returning the zero vector guarantees -suboptimality. We can construct functions so that Theorem 7 still applies for . In the previous proof, the first construction is still valid in the unconstrained case since the minimizer lies within the unit ball. For the term, consider the -smooth functions (assume w.l.o.g. that )
Therefore, if fewer than functions have been queried, then with probability at least :
This proves a lower bound of for .
C.4 Smooth and strongly convex components
In the smooth and strongly convex case, we cannot use the same simple reduction that was used to prove Theorem 6. Using that construction, we would be able to show a lower bound of , but would not be able to show any dependence on , so the lower bound would be loose. Instead, we will use an explicit construction similar to the one used in Theorem 7. See 8
We will prove the theorem for a -smooth, -strongly convex problem, for , which can be generalized by scaling.
As in the proof for the non-strongly convex case, we introduce the 4-smooth helper function
These functions also have Property 1, but we will omit the proof, as it follows directly from the proof in Appendix C.3. Intuitively, the squared norm reveals no new information about the vectors besides what is already included in the query point .
When all of the queries are bounded by , Property 1 along with Lemma 8 ensures that when , after the algorithm make queries for at most of the vectors with probability . For this to apply, we need that all of the queries made by the algorithm are within a -ball around the origin. We know that , and by strong-convexity , therefore, . Since the optimum point must lie in the -ball around the origin, we will restrict the algorithm to query only at points within the -ball. At the end of the proof, we will show that with a small modification to the functions outside of the -ball, querying at vectors of large norm cannot help the algorithm.
Now it remains to lower bound the suboptimality of the pair and at an iterate which is nearly orthogonal to all vectors for :
In order to bound the suboptimality of a pair of functions , it will be convenient to bundle up all of the terms which affect the value of from all of the component functions. Most of those terms are contained in and , however, terms in each of the other components also affect the value of . For each , consider the projection operator which projects a vector onto the subspace spanned by , and projecting onto the space orthogonal to for all . Now decompose
then when
We have already showed in Appendix B.4 that if , and if
Therefore, if at least of the pairs it holds that for , then
As a consequence of this, when the dimension is sufficiently large, with probability the optimization algorithm must make at least queries to each of at least pairs of functions in order to reach an -suboptimal solution in expectation. So, when
The same argument as was used in the discussion after theorem 7 to show the term of the lower bound can be used here, as the function in that construction was both smooth and strongly convex.
As mentioned above, Lemma 8 requires that the norm of all query points be bounded by . We argued above that the optimum of must lie within the -ball around the origin. Even so, we can slightly modify our construction to show that even if the algorithm were allowed to query arbitrarily large points, it still would not be able to optimize quickly. Define through its gradient as:
This new function is continuous, -smooth, and -strongly convex, and it also has the property that querying the function at a point outside the -ball, it is no more informative than querying at . That is, an algorithm that was not allowed to query outside the -ball could simulate the result of such queries. Since that restricted algorithm can’t optimize well, as proven above, another algorithm which could query at arbitrary points, therefore could not either. ∎
C.5 Non-smooth components when ϵitalic-ϵ\epsilon is large
Therefore, until the algorithm has made enough queries so that
the expected suboptimality is greater than . By a standard information theoretic result , achieving that probability of success at predicting the sign of implies a comparable level of accuracy at distinguishing between and , and that requires at least queries to . ∎
It is straightforward to show a lower bound of for strongly convex functions using the same reduction by regularization as in the proofs of theorems 2 and 6. We also note that this lower bound implies a lower bound of for smooth functions, whether strongly convex or not. Each function in this construction is linear, and therefore is trivially -smooth. We make the gradient of each function arbitrarily large by multiplying each by a large number. As the multiplier grows, the algorithm need be more and more certain of the sign of in order to achieve a expected suboptimality of less than . Thus for a sufficiently large multiplier, the algorithm must query functions. We cannot force it to query more than that, of course, since it only needs to query functions to know the sign of with probability 1.