Dimension-Free Iteration Complexity of Finite Sum Optimization Problems
Yossi Arjevani, Ohad Shamir
Introduction
Many machine learning tasks reduce to Finite Sum Minimization (FSM) problems of the form
where are -smooth and -strongly convex. In recent years, a major breakthrough was made when a linear convergence rate was established for this setting (SAG and SDCA ), and since then, many methods have been developed to achieve better convergence rate. However, whereas a large body of literature is devoted for upper bounds, the optimal convergence rate with respect to the problem parameters is not quite settled.
Let us discuss existing lower bounds for this setting, along with their shortcomings, in detail. One approach to obtain lower bounds for this setting is to consider the average of carefully handcrafted functions defined on disjoint sets of variables. This approach was taken by Agarwal and Bottou who derived a lower bound for FSM under the first-order oracle model (see Nemirovsky and Yudin ). In this model, optimization algorithms are assumed to access a given function by issuing queries to an external first-order oracle procedure. Upon receiving a query point in the problem domain, the oracle reports the corresponding function value and gradient. The construction used by Agarwal and Bottou consisted of different quadratic functions which are adversarially determined based on the first-order queries being issued during the optimization process. The resulting bound in this case does not apply to stochastic algorithms, rendering it invalid for current state-of-the-art methods. Another instantiation of this approach was made by Lan who considered disjoint copies of a quadratic function proposed by Nesterov in [13, Section 2.1.2]. This technique is based on the assumption that any iterate generated by the optimization algorithm lies in the span of previously acquired gradients. This assumption is rather permissive and is satisfied by many first-order algorithms, e.g., SAG and SAGA . However, the lower bound stated in the paper faces limitations in a few aspects. First, the validity of the derived bound is restricted to iterations. In many datasets, even if are very large, is quite small. Accordingly, the admissible regime of the lower bound is often not very interesting. Secondly, it is not clear how the proposed construction can be expressed as a Regularized Loss Minimization (RLM) problem with linear predictors (see Section 4). This suggests that methods specialized in dual RLM problems, such as SDCA and accelerated proximal SDCA , can not be addressed by this bound. Thirdly, at least the formal theorem requires assumptions (such as querying in the span of previous gradients, or sampling from a fixed distribution over the individual functions), which are not met by some state-of-the-art methods, such as coordinate descent methods, SVRG and without-replacements sampling algorithms .
In this work, building upon the framework of oblivious algorithms, we take a somewhat more abstract point of view which allows us to easily incorporate coordinate-descent methods, as well as stochastic algorithms. Our framework subsumes the vast majority of optimization methods for machine learning problems, in particular, it applies to SDCA, accelerated proximal SDCA, SDCA without duality , SAG, SAGA, SVRG and acceleration schemes ), as well as for a large number of methods for smooth convex optimization (i.e., FSM with ), e.g., (stochastic) Gradient descent (GD), Accelerated Gradient Descent (AGD, ), the Heavy-Ball method (HB, ) and stochastic coordinate descent.
Under this structural assumption, we derive lower bounds for FSM (1), according to which the iteration complexity, i.e., the number of iterations required to obtain an -optimal solution in terms of function value, is at leastFollowing standard conventions, here tilde notation hides logarithmic factors in the parameters of a given class of optimization problems, e.g., smoothness parameter and number of components.
where denotes the condition number of (that is, the smoothness parameter over the strong convexity parameter). To the best of our knowledge, this is the first tight lower bound to address all the algorithms mentioned above. Moreover, our bound is dimension-free and thus apply to settings in machine learning which are not covered in the current literature (e.g., when is ). We also derive a dimension-free nearly-optimal lower bound for smooth convex optimization of
which holds for any oblivious stochastic first-order algorithm. It should be noted that our lower bounds remain valid under any source of randomness which may be introduced into the optimization process (by the oracle or by the optimization algorithm). In particular, our bounds hold in cases where the variance of the iterates produced by the algorithm converges to zero, a highly desirable property of optimization algorithms in this setting.
Two implications can be readily derived from this lower bound. First, obliviousness forms a real barrier for optimization algorithms, and whereas non-oblivious algorithms may achieve a super-linear convergence rate at latter stages of the optimization process (e.g., quasi-newton), or practically zero error after iterations (e.g. Center of Gravity method, MCG), oblivious algorithms are bound to linear convergence indefinitely, as demonstrated by Figure 1. We believe that this indicates that a major progress can be made in solving machine learning problems by employing non-oblivious methods for settings where . It should be further noted that another major advantage of non-oblivious algorithms is their ability to obtain optimal convergence rates without an explicit specification of the problem parameters (e.g., [5, Section 4.1]).
Secondly, many practitioners have noticed that oftentimes sampling the individual functions without replacement at each iteration performs better than sampling with replacement (e.g., , see also ). The fact that our lower bound holds regardless of how the individual functions are sampled and is attained using with-replacement sampling (e.g., accelerated proximal SDCA), implies that, in terms of iteration complexity, one should expect to gain no more than log factors in the problem parameters when using one method over the other (it is noteworthy that when comparing with and without replacement samplings, apart from iteration complexity, other computational resources, such as limited communication in distributed settings , may significantly affect the overall runtime).
Framework
Due to difficulties which arise when studying the complexity of general optimization problems under discrete computational models, it is common to analyze the computational hardness of optimization algorithms by modeling the way a given algorithm interacts with the problem instances (without limiting its computational resources). In the seminal work of Nemirovsky and Yudin , it is shown that algorithms which access the function at hand exclusively by querying a first-order oracle require at least
oracle calls to obtain an -optimal solution (note that, here and throughout this section we refer to FSM problems with ). This lower bound is tight and its dimension-free part is attained by Nesterov’s well-known accelerated gradient descent, and by MCG otherwise. The fact that this approach is based on information considerations alone is very appealing and renders it valid for any first-order algorithm. However, discarding the resources needed for executing a given algorithm, in particular the per-iteration cost (in time and space), the complexity boundaries drawn by this approach are too crude from a computational point of view. Indeed, the per-iteration cost of MCG, the only method known with oracle complexity of , is excessively high, rendering it prohibitive for high-dimensional problems.
2 Definitions
When considering lower bounds one must be very precise as to the scope of optimization algorithms to which they apply. Below, we give formal definitions for oblivious stochastic CLI optimization algorithms and iteration complexity (which serves as a crude proxy for their computational complexity).
A class of optimization problems is an ordered triple , where is a family of functions defined over some linear space designated by , is the side-information given prior to the optimization process and is a suitable oracle parametrized by some parameters set , i.e., an external procedure which upon receiving and , returns some .
For example, in FSM, contains functions as defined in (1), the side-information contains the smooth parameter , the strong convexity parameter and the number of components (although it carries a crucial effect on the iteration complexity, e.g., , in this work, we shall ignore the side-information and assume that all the parameters of the class are given). We shall assume that both first-order and coordinate-descent oracles (see 10,11 below) are allowed to be used during the optimization process. Formally, this is done by introducing an additional parameter which indicates which oracle is being addressed. This added degree of freedom does not violate our lower bounds.
We now turn to rigorously define CLI optimization algorithms. Note that, compared with the definition of first-order -CLIs provided in , here, in order to handle coordinate-descent and first-order oracles in a unified manner, we base our formulation on general oracle procedures.
An optimization algorithm is called a Canonical Linear Iterative (CLI) optimization algorithm over a class of optimization problems , if given an instance and initialization points , where is some index set, it operates by iteratively generating points such that for any ,
Note that assigning different weights to different terms in (4) can be done through (e.g., oracle 10 below). This allows a succinct definition for obliviosity. Lastly, we define iteration complexity.
The iteration complexity of a given CLI w.r.t. a given problem class is defined to be the minimal number of iterations such that
where the expectation is taken over all the randomness introduced into the optimization process (choosing merely serves as a convention and is not necessary for our bounds to hold).
3 Proof Technique - Deriving Lower Bounds via Approximation Theory
Consider the following parametrized class of -smooth and -strongly convex optimization problems,
Clearly, the minimizer of are , with norm bounded by . For simplicity, we will consider a special case, namely, vanilla gradient descent (GD) with step size , which produces new iterates as follows
Setting the initialization point to be , we derive an explicit expression for :
It turns our that each forms a univariate polynomial whose degree is at most . Furthermore, since are -smooth -strongly convex for any , standard convergence analysis for GD (e.g., , Theorem 2.1.14) guarantees that , where denotes the condition number. Substituting Equation (6) for yields
Thus, we see that the faster the convergence rate of a given optimization algorithm is, the better the induced sequence of polynomials approximate w.r.t. the maximum norm over . In Fig. 2, we compare the first 4 polynomials induced by GD and AGD. Not surprisingly, AGD polynomials approximates better than those of GD.
Now, one may ask, assuming that iterates of a given optimization algorithm for (5) can be expressed as polynomials whose degree does not exceed the iteration number, just how fast can these iterates converge to the minimizer? Since the convergence rate is bounded from below by , we may address the following question instead:
where denotes the set of univariate polynomials whose degree does not exceed . Problem (7) and other related settings are main topics of study in approximation theory. Accordingly, our technique for proving lower bounds makes an extensive use of tools borrowed from this area. Specifically, in a paper from 1899 Chebyshev showed that
by which we derive the following theorem (see Appendix A.1 for a detailed proof).
In the following sections, we apply oblivious CLI on various parameterized optimization problems so that the resulting iterates are polynomials in the problem parameters. We then apply arguments similar to the above
A similar reduction, from optimization problems to approximation problems, was used before in a few contexts to analyze the iteration complexity of deterministic CLIs (e.g., [5, Section 3], see also Conjugate Gradient convergence analysis ). But, what if we allow random algorithms? should we expect the same iteration complexity? To answer this, we use Yao’s minimax principle according to which the performance of a given stochastic optimization algorithm w.r.t. to its worst input are bounded from below by the performance of the best deterministic algorithm w.r.t. distributions over the input space. Thus, following a similar reduction one can show that the convergence rate of stochastic algorithms is bounded from below by
That is, a lower bound for the stochastic case can be attained by considering an approximation problem w.r.t. weighted with the uniform distribution over . Other approximation problems considered in this work involve -norm and different distributions. We provide a schematic description of our proof technique in Scheme 2.1.
Lower Bound for Finite Sums Minimization Methods
Having described our analytic approach, we now turn to present some concrete applications, starting with iteration complexity lower bounds in the context of FSM problems (1). In what follows, we derive a lower bound on the iteration complexity of oblivious (possibly stochastic) CLI algorithms equipped with first-order and coordinate-descent oracles for FSM. Strictly speaking, we focus on optimization algorithms equipped with both generalized first order oracle,
where denotes the ’th unit vector. We remark that coordinate-descent steps w.r.t. partial gradients can be implemented using (10) by setting to be some principal minor of the unit matrix.It should be further noted that our results below hold for scenarios where the optimization algorithm is free to call a different oracle at different iterations.
It is easy to verify that the minimizers of (12) are
We would like to show that the coordinates of the iterates of deterministic oblivious CLIs, which minimize using first-order and coordinate-descent oracles, form multivariate polynomials in of total degrees (the maximal sum of powers over all the terms) which does not exceed the iteration number. Indeed, if the coordinates of are multivariate polynomial in of total degree at most , then the coordinates of the vectors returned by both oracles
are multivariate polynomials of total degree of at most , as all the parameters ( and ) do not depend on (due to obliviosity) and the rest of the terms ( and ) are either linear in or constants. Now, since the next iterates are generated simply by summing up all the oracle answers, they also form multivariate polynomials of total degree of at most . Thus, denoting the first coordinate of by and using Inequality (8), we get the following bound
where designates a constant which does not depend on (but may depend on the problem parameters). Lastly, this implies that for any deterministic oblivious CLI and any iteration number, there exists some such that the convergence rate of the algorithm, when applied on , is bounded from below by Inequality (27). We note that, as opposed to other related lower bounds, e.g., , our proof is non-constructive. As discussed in subsection 2.3, this type of analysis can be extended to stochastic algorithms by considering (26) w.r.t. other norms such as weighted -norm. We now arrive at the following theorem whose proof, including the corresponding logarithmic factors and constants, can be found in Appendix A.2.
The iteration complexity of oblivious (possibly stochastic) CLIs for FSM (1) equipped with first-order (10) and coordinate-descent oracles (11), is bounded from below by
The lower bound stated in Theorem 2 is tight and is attained by, e.g., SAG combined with an acceleration scheme (e.g., ). Moreover, as mentioned earlier, our lower bound does not depend on the problem dimension (or equivalently, holds for any number of iterations, regardless of and ), and covers coordinate descent methods with stochastic or deterministic coordinate schedule (in the special case where , this gives a lower bound for minimizing smooth and strongly convex functions by performing steepest coordinate descent steps). Also, our bound implies that using mini-batches for tackling FSM does not reduce the overall iteration complexity. Lastly, it is noteworthy that the term in the lower bound above holds for any algorithm accompanied with an incremental oracle, which grants access to at most one individual function each time.
We also derive a nearly-optimal lower bound for smooth non-strongly convex functions for the more restricted setting of and first-order oracle. The parameterized subset of functions we use (see Scheme 2.1) is . The corresponding minimizer (as a function of ) is , and in this case we seek to approximate it w.r.t. -norm using -degree univariate polynomials whose constant term vanishes. The resulting bound is dimension-free and improves upon other bounds for this setting (e.g. ) in that it applies to deterministic algorithms, as well as to stochastic algorithms (see A.3 for proof).
The iteration complexity of any oblivious (possibly stochastic) CLI for -smooth convex functions equipped with a first-order oracle, is bounded from below by
Lower Bound for Dual Regularized Loss Minimization with Linear Predictors
The form of functions (12) discussed in the previous section does not readily adapt to general RLM problems with linear predictors, i.e.,
such as SDCA and accelerated proximal SDCA, are not covered by Theorem 2. Accordingly, in this section, we address the iteration complexity of oblivious (possibly stochastic) CLI algorithms equipped with dual RLM oracles:
We state below the corresponding lower bound, whose proof, including logarithmic factors and constants, can be found in Appendix A.4.
The iteration complexity of oblivious (possibly stochastic) CLIs for RLM (28) equipped with dual RLM oracles (30) is bounded from below by
References
Appendix A Proofs
Proof According to the way generates iterates, we have
for some polynomial of degree at most . By Lemma 6, we have
Now, since is -strongly convex, we have,
Hence, by Lemma 12, the minimal number of iterations required to get an -optimal solution is at least
A.2 Proof of Theorem 2 - Finite Sums
where we put and . In words, is the set of all multivariate polynomials over indeterminates whose total degree (the maximal sum of the degrees over all terms) is less than or equal to . Lastly, given we define
This notation will come in handy in the main proof.
The lemma below describes the functional form assumed by iterates produced by oblivious CLIs.
When applied on (12) with suitable first-order and coordinate-descent oracles (as defined in 25), the coordinates of iterates produced by oblivious stochastic CLIs form multivariate polynomials in with random real coefficients whose total degree does not exceed the iteration number.
Proof Let be a oblivious stochastic CLI, and suppose we apply on the class of problems (12) parameterized by , using both first-order and coordinate-descent oracles as defined in 25. We use mathematical induction to show that for any , the coordinate of the ’th iterate produced by such process can be expressed as a distribution over multivariate polynomials in of degree at most .
For the inductive step, assume that any coordinate of can be expressed as a distribution over . It is easy to see that for any , the answers of both oracles,
form a distribution over , as all the random quantities involved in the expressions ( and ) do not depend on (due to obliviosity) and the rest of the terms ( and ) are either linear in or constants. Lastly, are computed by simply summing up all the oracle answers, and as such, form again distributions over .
Proof [Theorem 2] Let be a oblivious stochastic CLI. By Lemma 1 the first coordinate of (the point returned by the algorithm at the ’th iteration) when applied on the class of problems (12) distributes according to some distribution over . Thus, by Yao principle, since each polynomial in represents a single deterministic algorithm, we have
where denotes a distribution over which corresponds to first drawing at random, and then setting the coordinates of as follows
Furthermore, it is easy to verify that the corresponding minimizers of (12) are
where the first inequality follows by focusing on the first coordinate of . Now, set and note that
Thus, by Lemma 8 (using the same value for and noting that ) yields
where denotes the degree of . Plugging in this into Inequality (A.2) we get
Since is a decreasing and convex function for any , we have
where the last inequality is due to the fact that which implies that . Finally, we have,
where the first inequality follows by the -strong convexity of and the second inequality follows by Jensen inequality. Using Lemma 12, we get that the iteration complexity of is at least
This, together with Theorem 5 below, concludes the proof.
We bound from below the number of iterations required to obtain a non-trivial accuracy.
Let , let be a parameters vector whose all coordinates are and let be a parameters vector whose all coordinates are , except for the ’th coordinate which we set to be . If , then
where the last inequality follows from .
The iteration complexity of any stochastic optimization algorithm (not necessarily CLI) which gathers information on (with ) only by means of incremental oracles, i.e., oracles which upon receiving query return an answer which depends on not more than one individual function, is at least .
Proof Let be a stochastic optimization algorithm. According to Yao’s principle, we can bound from below the -optimality of after iterations by estimating the -optimality of any deterministic algorithm w.r.t. to distribution over defined by: draw and set to be or as defined in Lemma 2 w.p. . Then,
where the last inequality follows from Lemma 2. Thus, for sufficiently small , one must perform at least iterations in order to obtain an -optimal solution.
A.3 Proof of Theorem 3 - Smooth Functions
with a first-order oracle (as defined in 10 with ), the coordinates of iterates produced by oblivious stochastic CLIs whose is initialization iterate is form polynomials in with random real coefficients which vanishes at and whose degree does not exceed the iteration number.
Proof Let be a oblivious stochastic CLI, and suppose we apply on the class of problems (37) parameterized by , using a first-order. We use mathematical induction to show that for any , the coordinate of the ’th iterate produced by such process can be expressed as a distribution over .
As the first iterate is assumed to be zero, the base case is trivial. For the inductive step, assume that any coordinate of can be expressed as a distribution over . It is easy to see that for any , the answers of the first-order oracle,
form a distribution over , as the random quantities involved in the expressions ( and ) do no depend on (due to obliviosity) and the rest of the terms ( and ) are homogenous in . Lastly, are computed by simply summing up all the oracle answers, and as such, form again distributions over .
Proof [Theorem 3] Let be a oblivious stochastic CLI and let . Our derivation of lower bounds for stochastic CLIs is established via Yao principle. Fix some . By Lemma 3, distributes according to some distribution over . Thus, by Yao principle, since each polynomial in represents a single deterministic algorithm, we have
where (abbr. ) denotes a distribution over with a probability density function
where the first inequality follows by the fact that is -strongly convex and the second inequality follows by focusing on the first coordinate of . Invoking Lemma 9 yields
Thus, in this case the iteration complexity is bound from below by
A.4 Proof of Theorem 4 - Regularized Empirical Loss Minimization
In which case, the corresponding dual is:
Note that, all the eigenvalues of are bigger than 1. Therefore, is 1-strongly convex. We assume that the oracles at the algorithms’ disposal are the dual RLM oracles defined in (30),
Lastly, we will need the following definitions
to ease notation in subsequent proofs (where denotes the total degree of and is defined in (31)). Thus, contains -dimensional vectors whose entries are -multivariate polynomials expressions in , such that the sum of the degrees of the -polynomials does not exceed . In addition, given we define
As usual, we start by stating the functional form assumed by iterates produced by this sort of optimization algorithms.
When applied on (38) with a dual RLM oracle (as defined in 30), the coordinates of iterates produced by oblivious stochastic CLIs form multivariate polynomials expressions in with random coefficients, such that the sum of the degrees of these polynomials does not exceed the iteration number.
Proof Let be a oblivious stochastic CLI, and suppose we apply on the class of problems (38) parameterized by , using dual RLM oracles as defined in 30. We use mathematical induction to show that for any , the coordinate of the ’th iterate produced by such process can be expressed as a distribution over polynomial expressions in whose sum of degrees is less than or equal .
For the inductive step, assume that can be expressed as a distribution over . It is easy to see that for any , the answer of the dual RLM oracle
Let be a oblivious stochastic CLI. By Lemma 4 the coordinates of (the point returned by the algorithm at the ’th iteration) when applied on the class of problems (38) distributes according to some distribution over . Furthermore, it is easy to verify that the corresponding minimizers of (38) are
distributes according to some distribution over . Thus, by Yao principle, since each polynomial in represents a single deterministic algorithm, we have
where denotes a distribution over which corresponds to of first drawing at random, and then drawing according to distribution defined by the p.d.f. over (for we set ). We now have,
where the first inequality follows by focusing on the ’th coordinate of in each summand. Now, set and note that
Thus, by Lemma 8, using the same value for and noting that ) yields
where denotes the degree of . Plugging in this into Inequality (A.4) we get
Since is a decreasing and convex function for any , we have
where the last inequality is due to the fact that which implies that . Finally, we have,
where the first inequality follows by the -strong convexity of and the third inequality follows by Jensen inequality. Using Lemma 12, we get that the iteration complexity of is at least
Lastly, we bound from below the number of iterations required to obtain a non-trivial accuracy.
Let , let be a parameters vector whose all coordinates are and let be a parameters vector whose all coordinates are , except for the ’th coordinate which we set to be . Then
When applied on (38) ,the iteration complexity of oblivious stochastic CLI algorithms equipped with a dual RLM oracle is at least .
Proof Let be a stochastic optimization algorithm. By Lemma 4 the coordinates of (the point returned by the algorithm at the ’th iteration) when applied on the class of problems (38) distributes according to some distribution over . By Yao principle, since each polynomial in represents a single deterministic algorithm, we have
where denotes a distribution over which corresponds to the process of first drawing at random, and then set to be or as defined in Lemma 5 with equal probability. Now, for , there exists some such that does not depend on . This yields,
where the last inequality follows from Lemma 5. Thus, for sufficiently small , one must perform at least iterations in order to obtain an -optimal solution.
In the following section we analyze the best polynomial approximation of some functions w.r.t. , and norm, based on standard results from the approximation theory (see generally, Allan Pinkus. On L1-approximation, 1989; Theodore J Rivlin. An introduction to the approximation of functions, 2003; Ronald A DeVore and George G Lorentz. Constructive approximation, 1993; Naum Il’ich Akhiezer and Charles J Hyman. Theory of approximation. Translated by Charles J. Hyman. New York, 1956; Isidor Pavlovich Natanson. Constructive function theory, 1964).
where we used the fact that is invariant under pre-composition and post-composition with linear function in the second, fourth and fifth equalities. Now, since
combining Inequality 8 with Lemma 10, yields
Let denote the ’th second order Chebyshev polynomial, i.e.,
Proof We integrate by substituting ,
where the last equality is due to the fact that the integrand is an even function in . Lastly, since for any we have
and since for , it follows that
This, together with the case where ,
implies that all the terms in (A.5.2) vanish, thus concluding the proof.
Given (note that, here and are allowed to take negative values), we define
By substituting for , we get the following corollary.
We now use Corollary 1 to bound from below the best polynomial -approximation error w.r.t. over the interval .
Let . Then, for any we have
Proof First, note that the following two inequalities
Substituting for , yields
Now, plugging in the definition of (see (52)) and applying Lemma 11, we get
for any . Using this inequality with (A.5.2) where , yields
shows that this problem can be seen as a best -approximation for in the -dimensional space spanned by (accordingly, ). By [2, Equation (3), p. 16], we have
where is Gram matrix (whose entries are the inner products of its arguments). First, note that
It follows that both matrices can be expressed as a Cauchy matrices, that is
where and . The determinant of Cauchy matrix defined by sequences one has
To estimate the middle term, we apply arguments similar to the integral test for infinite series. First, note that,
Now, since for any , it holds that is a monotone decreasing function (over ), it holds that
Now, by the following standard inequality
A.6 Technical Lemmas
and . Therefore, by using identity (see Section F.31. in ), we get
where the last equality is due to Lemma 10.
Let , and . Then
and . Thus, we obtained the following inequality