A Lower Bound for the Optimization of Finite Sums

Alekh Agarwal, Leon Bottou

Introduction

Many machine learning setups lead to the minimization a convex function of the form

where X\mathcal{X} is a convex, compact set. When the functions gig_{i} are also convex, then the overall optimization problem is convex, and can in principle be solved using any off-the-shelf convex minimization procedure. In the machine learning literature, two primary techniques have typically been used to address such convex optimization problems. The first approach (called the batch approach) uses the ability to evaluate the function ff along with its gradients, Hessian etc. and applies first- and second-order methods to minimize the objective. The second approach (called the stochastic approach) interprets the average in Equation (1) as an expectation and uses stochastic gradient methods, randomly sampling a gig_{i} and using its gradient and Hessian information as unbiased estimates for those of the function ff.There is a body of literature that recognizes the ability of stochastic optimization to minimize testing error rather than training error in machine learning contexts (see e.g. Bottou and Bousquet, 2008), but we will focus on training error for this paper. Both these classes of algorithms have extensive literature on upper bounds for the complexities of specific methods. More fundamentally, there are also lower bound results on the minimum black-box complexity of the best-possible algorithm to solve convex minimization problems. In several broad problem classes, these lower bounds further coincide with the known upper bounds for specific methods, yielding a rather comprehensive general theory.

However, a recent line of work in the machine learning literature, recognizes that the specific problem (1) of interest has additional structure beyond a general convex minimization problem. For instance, the average in defining the function ff is over a fixed number nn of functions, whereas typical complexity results on stochastic optimization allow for the expectation to be with respect to a continuous random variable. Recent works (Le Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013) make further assumptions that the functions gig_{i} involved in this sum are smooth, and the function ff is of course strongly convex by construction. Under these conditions, the algorithms studied in these works have the following properties: (i) the cost of each iteration is identical to stochastic optimization methods, and (ii) the convergence rate of the method is linear.An optimization algorithm is linearly convergent if it reduces the sub-optimality by a constant factor at each iteration. The results are surprising since the existing lower bounds on stochastic optimization dictate that the error can decrease no faster than Ω(1/k)\Omega(1/k) after kk iterations under such assumptions (Nemirovsky and Yudin, 1983), leaving an exponential gap compared to these new results. It is of course not a contradiction due to the finite sum structure of the problem (1) (following the terminology of Bertsekas (2012), we will call the setup of optimizing a finite sum incremental optimization hereafter).

Given this recent and highly interesting line of work, it is natural to ask just how much better can one do in this model of minimizing finite sums. Put another way, can we specialize the existing lower bounds for stochastic or batch optimization, to yield results for this new family of functions. The aim of such a result would be to understand the fundamental limits on any possible algorithm for this family of problems, and whether better algorithms are possible at all than the existing ones. Answering such questions is the goal of this work. To this end, we define the Incremental First-order Oracle (IFO) complexity model, where an algorithm picks an index i{1,2,,n}i\in\{1,2,\ldots,n\} and a point xXx\in\mathcal{X} and the oracle returns gi(x)g_{i}^{\prime}(x). We consider the setting where each function gig_{i} is L-smooth (that is, it has L-Lipschitz continuous gradients). In this setting, we demonstrate that no method can achieve xKxfϵxf\|x_{K}-x^{*}_{f}\|\leq\epsilon\|x^{*}_{f}\| for all functions ff of the form (1), without performing K=\Omega\big{(}n+\sqrt{n{\left({{L}/{\mu}-1}\right)}}\,\log(1/\epsilon)\big{)} calls to the IFO. As we will discuss following this main result, this lower bound is not too far from upper bounds for IFO methods such as SAG, SVRG and SAGA (Schmidt et al., 2013; Johnson and Zhang, 2013; Defazio et al., 2014) whose iteration complexity is O((n+L/μ)log(1/ϵ))\mathcal{O}((n+L/\mu)\log(1/\epsilon)). Some dual coordinate methods such as ASDCA and SPDC (Shalev-Shwartz and Zhang, 2014; Zhang and Xiao, 2014) get even closer to the lower bound, but are not IFO algorithms. Overall, there is no method with a precisely matching upper bound on its complexity, meaning that there is further room for improving either the upper or the lower bounds for this class of problems.

Following the statement of our main result, we will also discuss the implications of these lower bounds for the typical machine learning problems that have inspired this line of work. In particular, we will demonstrate that caution is needed in comparing the results between the standard first-order and IFO complexity models, and worst-case guarantees in the IFO model might not adequately capture the performance of the resulting methods in typical machine learning settings. We will also demonstrate regimes in which different IFO methods as well as standard first-order methods have their strengths and weaknesses.

Recent work of Arjevani (2014) also studies the problem of lower bounds on smooth and strongly convex optimization methods, although their development focuses on certain restricted subclasses of first-order methods (which includes SDCA but not the accelerated variants, for instance). Discussion on the technical distinctions in the two works is presented following our main result.

As a prerequisite for our result, we need the result on black-box first-order complexity of minimizing smooth and strongly convex functions. We provide a self-contained proof of this result in our paper in Appendix A which might be of independent interest. In fact, we establish a slight variation on the original result, in order to help prove our main result. Our main result will invoke this construction multiple times to design each of the components gig_{i} in the optimization problem (1).

The remainder of this paper is organized as follows. The next section formally describes the complexity model and the structural assumptions. We then state the main result, followed by a discussion of consequences for typical machine learning problems. The proofs are deferred to the subsequent section, with the more technical details in the appendix.

Setup and main result

Let us begin by formally describing the class of functions we will study in this paper. Recall that a function gg is called LL-smooth, if it has LL-Lipschitz continuous gradients, that is

where \|\cdot\|_{*} is the norm dual to \|\cdot\|. In this paper, we will only concern ourselves with scenarios where X\mathcal{X} is a convex subset of a separable Hilbert space, with \|\cdot\| being the (self-dual) norm associated with the inner product. A function gg is called μ\mu-strongly convex if

Given these definitions, we now define the family of functions being studied in this paper.

Let Fnμ,L(Ω){{\cal F}^{\mu,L}_{n}}(\Omega) denote the class of all convex functions ff with the form (1), where each gig_{i} is (Lμ)(L-\mu)-smooth and convex.

Note that ff is μ\mu-strongly convex and LL-smooth by construction, and hence Fnμ,L(Ω)Sμ,L(Ω){{\cal F}^{\mu,L}_{n}}(\Omega)\subseteq{{\cal S}^{\mu,L}}(\Omega) where Sμ,L(Ω){{\cal S}^{\mu,L}}(\Omega) is the set of all μ\mu-strongly convex and LL-smooth functions. However, as we will see in the sequel, it can often be a much smaller subset, particularly when the smoothness of the global function is much better than that of the local functions. We now define a natural oracle for optimization of functions with this structure, along with admissible algorithms.

For a function fFnμ,L(Ω)f\in{{\cal F}^{\mu,L}_{n}}(\Omega), the Incremental First-order Oracle (IFO) takes as input a point xXx\in\mathcal{X} and index i{1,2,,n}i\in\{1,2,\ldots,n\} and returns the pair (gi(x),gi(x))(g_{i}(x),\,g^{\prime}_{i}(x)).

An optimization algorithm is an IFO algorithm if its specification does not depend on the cost function ff other than through calls to an IFO.

For instance, a standard gradient algorithm would take the current iterate xkx_{k}, and invoke the IFO with (xk,i)(x_{k},i) in turn with i={1,2,,n}i=\{1,2,\ldots,n\}, in order to assemble the gradient of ff. A stochastic gradient algorithm would take the current iterate xkx_{k} along with a randomly chosen index ii as inputs to IFO. Most interesting to our work, the recent SAG, SVRG and SAGA algorithms (Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014) are IFO algorithms. On the other hand, dual coordinate ascent algorithms require access to the gradients of the conjugate of fif_{i}, and therefore are not IFO algorithms.

We now consider IFO algorithms that invoke the oracle KK times (at x0,,xK1x_{0},\ldots,x_{K-1}) and output an estimate xKx_{K} of the minimizer xfx^{*}_{f}. Our goal is to bound the smallest number of queries KK needed for any method to ensure an error xKxfϵxf\|x_{K}-x^{*}_{f}\|\leq\epsilon\|x^{*}_{f}\|, uniformly for all fFnμ,L(Ω)f\in{{\cal F}^{\mu,L}_{n}}(\Omega). This complexity result will depend on the ratio κ=L/μ\kappa=L/\mu which is analogous to the condition number that usually appears in complexity bounds for the optimization of smooth and strongly convex functions. Note that κ\kappa is strictly an upper bound on the condition number of ff, but also the best one in general given the structural information about fFnμ,L(Ω)f\in{{\cal F}^{\mu,L}_{n}}(\Omega).

In order to better interpret the result of the theorem, we state the following direct corollary which lower bounds the number of steps need to attain an accuracy of ϵxf\epsilon\|x^{*}_{f}\|.

The first term in the lower bound simply asserts that any optimization method needs to make at least one query per gig_{i}, in order to even see each component of ff which is clearly necessary. The second term, which is more important since it depends on the desired accuracy ϵ\epsilon, asserts that the problem becomes harder as the number of elements nn in the sum increases or as the problem conditioning worsens. Again, both these behaviors are qualitatively expected. Indeed as nn\rightarrow\infty, the finite sum approaches an integral, and the IFO becomes equivalent to a generic stochastic-first order oracle for ff, under the constraint that the stochastic gradients are also Lipschitz continuous. Due to Ω(1/ϵ)\Omega(1/\epsilon) complexity of stochastic strongly-convex optimization (with no dependence on nn), we do not expect the linear convergence of Corollary 1 to be valid as nn\rightarrow\infty. Also, we certainly expect the problem to get harder as the ratio L/μL/\mu degrades. Indeed if all the functions gig_{i} were identical, whence the IFO becomes equivalent to a standard first-order oracle, the optimization complexity similarly depends on Ω(κ1log(1/ϵ))\Omega(\sqrt{\kappa-1}\log(1/\epsilon)).

Comparison with the best known algorithms:

At least three algorithms recently developed for problem setting (1) offer complexity guarantees that are close to our lower bound. SAG, SVRG and SAGA (Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014) all reach an optimization error ϵ\epsilon after less than O((n+κ)log(1/ϵ))\mathcal{O}((n+\kappa)\log(1/\epsilon)) calls to the oracle. There are two differences from our lower bound. The first term of nn multiplies the log(1/ϵ)\log(1/\epsilon) term in the upper bounds, and the condition number dependence is O(κ)O(\kappa) as opposed to O(nκ)O(\sqrt{n\kappa}). This suggests that there is room to either improve the lower bound, or for algorithms with a better complexity. As observed earlier, the ASDCA and SPDC methods (Shalev-Shwartz and Zhang, 2014; Zhang and Xiao, 2014) reach a closer upper bound of O((n+n(κ1))log(1/ϵ))\mathcal{O}((n+\sqrt{n(\kappa-1)})\log(1/\epsilon)), but these methods are not IFO algorithms.

Room for better lower bounds?

One natural question to ask is whether there is a natural way to improve the lower bound. As will become clear from the proof, a better lower bound is not possible for the hard problem instance which we construct. Indeed for the quadratic problem we construct, conjugate gradient descent can be used to solve the problem with a nearly matching upper bound. Hence there is no hope to improve the lower bounds without modifying the construction.

Consequences for optimization in machine learning

With all the relevant results in place now, we will compare the efficiency of the different available methods in the context of solving typical machine learning problems. Recall the definitions of the constants LL and μ\mu from before. In general, the full objective ff (1) has its own smoothness and strong convexity constants, which need not be the same as LL and μ\mu. To that end, we define LfL_{f} to be the smoothness constant of ff, and μf\mu_{f} to the strong convexity of ff. It is immediately seen that LL provides an upper bound on LfL_{f}, while μ\mu provides a lower bound on μf\mu_{f}.

In order to provide a meaningful comparison for incremental as well as batch methods, we follow Zhang and Xiao (2014) and compare the methods in terms of their batch complexity, that is, how many times one needs to perform nn calls to the IFO in order to ensure that the optimization error for the function ff is smaller than ϵ\epsilon. When defining batch complexity, Zhang and Xiao (2014) observed that the incremental and batch methods have dependence on LL versus LfL_{f}, but did not consider the different strong convexities that play a part for different algorithms. In this section, we also include the dual coordinate methods in our comparison since they are computationally interesting for the problem (1) even though they are not admissible in the IFO model. Doing so, the batch complexities can be summarized as in Table 1.

Based on the table, we see two main points of difference. First, the incremental methods rely on the smoothness of the individual components. That this is unavoidable is clear, since even the worst case lower bound of Theorem 1 depends on LL and not LfL_{f}. As Zhang and Xiao (2014) observe, LfL_{f} can in general be much smaller than LL. They attempt to address the problem to some extent by using non-uniform sampling, thereby making sure that the best of the gig_{i} and the worst of the gig_{i} have a similar smoothness constant under the reweighing. This does not fully bridge the gap between LL and LfL_{f} as we will show next. However, more striking is the difference in the lower curvature across methods. To the best of our knowledge, all the existing analyses of coordinate ascent require a clear isolation of strong convexity, as in the function definition (1). These methods then rely on using μ\mu as an estimate of the curvature of ff, and cannot adapt to any additional curvature when μf\mu_{f} is much larger than μ\mu. Our next example shows this can be a serious concern for many machine learning problems.

In order to simplify the following discussion we restrict ourselves to perhaps the most basic machine learning optimization problem, the regularized least-squares regression:

In order to follow the development of Table 1 for SAG and AGM, we need to evaluate the constants μf\mu_{f} and LfL_{f}. Note that in this special case, the constants LfL_{f} and μf\mu_{f} are given by the upper and lower eigenvalues respectively of the matrix μI+Σ^\mu I+{\hat{\Sigma}}, where Σ^=i=1naiai/n{\hat{\Sigma}}=\sum_{i=1}^{n}a_{i}a_{i}^{\top}/n represents the empirical covariance matrix. In order to understand the scaling of this empirical covariance matrix, we shall invoke standard results on matrix concentration.

Equation (5.26) in (Vershynin, 2012) then implies that there are universal constants cc and CC such that the following inequality holds with probability 1δ1-\delta :

Let us weaken the above inequality slightly to use μ+Σ\mu+\|\Sigma\| instead of Σ\|\Sigma\| in the bound, which is minor since we typically expect μλmax\mu\ll{\lambda_{\max}} for statistical consistency. Then assuming we have enough samples to ensure that

we obtain the following bounds on the eigenvalues of the sample covariance matrix :

Using these estimates in the bounds of Table 1 gives

Table 2 compares the three methods under assumption (4) depending on the growth of κ\kappa.

Problems with large κ𝜅\kappa:

In this setting, the coordinate ascent methods seem to be at a disadvantage, because the average loss term provides additional strong convexity, which is exploited by both SAG and AGM, but not by ASDCA or SPDC methods. Indeed, we find that the complexity term ΓASDCA\Gamma_{\rm ASDCA} can be made arbitrarily large as κi\kappa_{i} grows large. However, the contraction factors for both SAG and AGM do not grow with nn in this setting, leading to a large gap between the complexities. Between SAG and AGM, we conclude that SAG has a better bound when the population problem is poorly conditioned.

High-dimensional settings (𝒏/𝒅≪𝟏much-less-than𝒏𝒅1n/d\ll 1) :

In this setting, the global strong convexity can not really be larger than μ\mu for the function (2), since the Hessian of the averaged loss has a non-trivial null space. It would appear then, that SAG is forced to use the same problem dependent constants as ASDCA/SPDC, while AGM gets no added benefit in strong convexity either. However, in such high-dimensional problems, one is often enforcing a low-dimensional structure in machine learning settings for generalization. In such structures, the global Hessian matrix can still satisfy restricted versions of strong convexity and smoothness conditions, which are often sufficient for batch optimization methods to succeed (Agarwal et al., 2012). In such situations, the comparison might once again resemble that of Table 2, and we leave such development to the reader.

In a nutshell, the superiority of incremental algorithms for the optimization of training error in machine learning is far more subtle than suggested by their worst case bounds. Among the incremental algorithms, SAG has favorable complexity results in all regimes despite the fact that both ASDCA and SPDC offer better worst case bounds. This is largely due to the adaptivity of SAG to the curvature of the problem. This might also explain in some part the empirical observation of Schmidt et al. (2013), who find that on some datasets SDCA (without acceleration) performed significantly poorly compared with other methods (see Figure 2 in their paper for details). Finally, we also observe that SAG does indeed improve upon the complexity of AGM after taking the different problem dependent constants into account, when the population problem is ill-conditioned and the data are appropriately bounded.

It is worth observing that all our comparisons are ignoring constants, and in some cases logarithmic factors, which of course play a role in the running time of the algorithms in practice. Note also that the worst case bounds for the incremental methods account for the worst possible choice of the nn functions in the sum. Better results might be possible when they are based on i.i.d. random data. Such results would be of great interest for machine learning.

Proof of main result

where sis_{i} is assumed to be zero when ii is greater than the size of the family.

Using the above notation, we first establish some simple identities for the operators QiQ_{i} defined above.

Simple calculus yields the following identities:

We now define the family of separable functions that will be used to establish our lower bound.

2 Decoupling the optimization across components

We would like to assert that the separable structure of ff allows us to reason about optimizing its components separately. Since the hih_{i} are not strongly convex by themselves, we first rewrite ff as a sum of separated strongly convex functions. Using Lemma 1,

By construction, the functions fif_{i} belong to Snμ,Lμ+nμ{\cal S}^{n\mu,L-\mu+n\mu} and are applied to disjoint subsets of the xx coordinates. Therefore, when the function is known to have form (7), problem (1) can be written as

Any algorithm that solves optimization problem (1) therefore implicitly solves all the problems listed in (8).

We are almost done, but for one minor detail. Note that we want to obtain a lower bound where the IFO is invoked for a pair (i,x)(i,x) and responds with hi(Qix)h_{i}(Q_{i}^{\top}x) and hi(Qix)/x\partial h_{i}(Q_{i}^{\top}x)/\partial x. In order to claim that this suffices to optimize each fif_{i} separately, we need to argue that a first-order oracle for fif_{i} can be obtained from this information, knowing solely the structure of ff and not the functions hih_{i}. Since the strong convexity constant μ\mu is assumed to be known to the algorithm, the additional (nμ/2)x2(n\mu/2)\|x\|^{2} in defining fif_{i} is also known to the algorithm. As a result, given an IFO for ff, we can construct a first-order oracle for any of the fif_{i} by simply returning hi(Qix)+(nμ/2)Qix2h_{i}(Q_{i}^{\top}x)+(n\mu/2)\|Q_{i}^{\top}x\|^{2} and hi(Qix)/x+nμQiQix)\partial h_{i}(Q_{i}^{\top}x)/\partial x+n\mu Q_{i}Q_{i}^{\top}x). Furthermore, an IFO invoked with the index ii reveals no information about fjf_{j} for any other jj based on the separable nature of our problem. Hence, the IFO for ff offers no additional information beyond having a standard first-order oracle for each fif_{i}.

3 Proof of Theorem 1

Based on the discussion above, we can pick any i{1n}i\in\{1\dots n\} and view our algorithm as a complicated setup whose sole purpose is to optimize function fiSnμ,Lμ+nμf_{i}\in{\cal S}^{n\mu,L-\mu+n\mu}. Indeed, given the output xKx_{K} of an algorithm using an IFO for the function ff, we can declare xKi=QixKx^{i}_{K}=Q_{i}^{\top}x_{K} as our estimate for xix^{*}_{i}. Lemma 1 then yields

In order to establish the theorem, we now invoke the classical result on the black-box optimization of functions using a first-order oracle. The specific form of the result stated here is proved in Appendix A.

At a high-level, our oracle will make an independent choice of one of the functions that witness the lower bound in Theorem 2 for each fif_{i}. At a high-level, each function fif_{i} will be chosen to be a quadratic with an appropriate covariance structure such that KiK_{i} queries to the function fif_{i} result in the estimation of at most Ki+1K_{i}+1 coordinates of xix_{i}^{*}. By ensuring that the remaining entries still have a substantial norm, a lower bound for such functions is immediate. The precise details on the construction of these functions can be found in Appendix A.The main difference with the original result of Nemirovsky and Yudin is the dependence on γ\gamma instead of x0xf\|x_{0}-x^{*}_{f}\|. This is quite convenient in our setting, since it eliminates any possible interaction amongst the starting values of different coordinates for the different functions fif_{i}.

Suppose the IFO is invoked KiK_{i} times on each index ii, with K=K1+K2++KnK=K_{1}+K_{2}+\ldots+K_{n}. We first establish the theorem for the case K<nK<n in which the algorithm cannot query each functions fif_{i} at least once. After receiving the response xKx_{K}, we are still free to arbitrarily choose fif_{i} for any index ii that was never queried. No non-trivial accuracy is possible in this case.

Proof Let us execute the algorithm assuming that all the fif_{i} are equal to the function ff of Theorem 2 that attains the lower bound with γ=0\gamma=0. Since K<nK<n, there is at least one function fjf_{j} for which Kj=0K_{j}=0. Since the IFO has not revealed anything about this function, we can construct function ff by redefining function fjf_{j} to ensure that xKjxjxj=γj\|x^{j}_{K}-x^{*}_{j}\|\geq\|x^{*}_{j}\|=\gamma_{j}. Since xjx^{*}_{j} is the only part of xx^{*} which is non-zero, we also get γj=γ\gamma_{j}=\gamma.

We can now assume without loss of generality that Ki>0K_{i}>0 for each ii. Appealing to Theorem 2 for each fif_{i} in turn,

where the last inequality results from Jensen’s inequality applied to the convex function q4αq^{4\alpha} for α1\alpha\geq 1. Finally, since the oracle has no way to discriminate amongst the γi\gamma_{i} values when Ki>0K_{i}>0, it will end up setting γi=γ/n\gamma_{i}=\gamma/\sqrt{n}. With this setting, we now obtain the lower bound

for K>nK>n, along with xKxf2γ2\|x_{K}-x^{*}_{f}\|^{2}\geq\gamma^{2} for K<nK<n.

This completes the proof of the Theorem. In order to further establish Corollary 1, we need an additional technical lemma.

x>1 ,  log(x1x+1)>2x1 .\displaystyle\forall x>1~{},~{}~{}\log{\left({\frac{\sqrt{x}-1}{\sqrt{x}+1}}\right)}\>>\>\frac{-2}{\sqrt{x-1}}~{}.

Proof The function ϕ(x)=log(x1x+1)+2x1\phi(x)=\log{\left({\frac{\sqrt{x}-1}{\sqrt{x}+1}}\right)}+\frac{2}{\sqrt{x-1}} is continuous and decreasing on (1,+)(1,+\infty) because

The result follows because limxϕ(x)=0\lim_{x\rightarrow\infty}\phi(x)=0.

Now we observe that we have at least nn queries due to the precondition ϵ<1\epsilon<1 and Proposition 2, which yields the first term in the lower bound. Based on Theorem 1 and this lemma, the corollary is now immediate.

Discussion

The results in this paper were motivated by recent results and optimism on exploiting the structure of minimizing finite sums, a problem which routinely arises in machine learning. Our main result provides a lower bound on the limits of gains that might be possible in this setting, allowing us to do a more careful comparison of this setting with regular first-order black box complexity results. As discussed in Section 3, the results seem mixed when the sum consists of nn functions based on random data drawn i.i.d. from a distribution. In this statistical setting, we find that the worst-case near-optimal methods like ASDCA can often be much worse than other methods like SAG and SVRG. However, IFO methods like SAG certainly improve upon optimal first-order methods agnostic of the finite sum structure, in ill-conditioned problems. In general, we observe that the problem dependent constants that appear in different methods can be quite different, even though this is not always recognized. We believe that accounting for these opportunities might open door to more interesting algorithms and analysis.

Of course, there is another and a possibly more important aspect of optimization in machine learning which we do not study in this paper. In typical machine learning problems, the goal of optimization is not just to minimize the objective ff—usually called the training error—to a numerical precision. In most problems, we eventually want to reason about test error, that is the accuracy of the predictions we make on unseen data. There are existing results (Bottou and Bousquet, 2008) which highlight the optimality of single-pass stochastic gradient optimization methods, when test error and not training error is taken into consideration. So far, we do not have any clear results comparing the efficacy of methods designed for the problem (1) in minimizing test error directly. We believe this is an important question for future research, and one that will perhaps be most crucial for the adoption of these methods in machine learning.

We believe that there are some important open questions for future works in this area, which we will conclude with:

Is there a fundamental gap between the best IFO methods and the dual coordinate methods in the achievable upper bounds? Or is there room to improve the upper bounds on the existing IFO methods. We certainly found it tricky to do the latter in our own attempts.

Is it possible to obtain better complexity upper bounds when the nn functions involved in the sum (1) are based on random data, rather than being nn arbitrary functions? Can the incremental methods exploit global rather than local smoothness properties in this setting?

What are the test error properties of incremental methods for machine learning problems? Specifically, can one do better than just adding up the optimization and generalization errors, and follow a more direct approach as the stochastic optimization literature?

Acknowledgements

We would like to thank Lin Xiao, Sham Kakade and Rong Ge for helpful discussions regarding the complexities of various methods. We also thank the anonymous reviewer who pointed out that the dual coordinate are not valid IFO algorithms.

References

A Optimization of a strongly convex smooth functions

The most accessible derivation of this classic lower bound Nesterov (2004) relies on the simplifying assumption that the successive points xkx_{k} lie in the span of the gradients previously returned by the oracle. This section provides a derivation of the lower bound that does not rely on this assumption and is critical for Theorem 1 where no such assumptions are made.

This section considers algorithms that produces an approximate solution of the optimization problem

using, as sole knowledge of function ff, an oracle that returns the value f(x)f(x) and the gradient f(x)f^{\prime}(x) on points successively determined by the algorithm. Note that this writing of ff is without loss of generality, since any μ\mu-strongly convex function can be written in the form (9) where gg is convex.

We could equivalently consider an oracle that reveals g(xk)g(x_{k}) and g(xk)g^{\prime}(x_{k}) instead of f(xk)f(x_{k}) and f(xk)f^{\prime}(x_{k}) because these quantities can be computed from each other (since μ\mu is known.)

Consider an algorithm that calls the oracle on K>1K>1 successive points x0,,xK1x_{0},\dots,x_{K-1}. The first part of the proof describes how to pick the best possible xKx_{K} on the basis of the oracle answers and the algorithm’s queries.

Stated differently, we know that any point xx can be decomposed into ΠP(x)\Pi_{P}(x) and ΠP(x)\Pi_{P^{\perp}}(x) such that x=ΠP(x)+ΠP(x)x=\Pi_{P}(x)+\Pi_{P^{\perp}}(x). Then the above definition yields MP(x)=ΠP(x)ΠP(x)M_{P}(x)=\Pi_{P}(x)-\Pi_{P^{\perp}}(x), which is the natural reflection of xx with respect to the subspace PP.

The set HγH_{\gamma} is symmetric with respect to PP.

Proof Consider an arbitrary point xhHγx^{*}_{h}\in H_{\gamma} which minimizes a function hGγh\in{\cal G}_{\gamma}. Since function xh(MP(x))x\mapsto h(M_{P}(x)) also belongs to Gγ{\cal G}_{\gamma}, its minimum Mp(xh)M_{p}(x^{*}_{h}) also belongs to HγH_{\gamma}.

The center of the smallest ball enclosing HγH_{\gamma} belongs to PP.

We are now in a position to present the main ingredient of our proof that allows us to state a more general result than Nesterov. In particular, we demonstrate that the assumption made by Nesterov about the iterates lying in the span of previous gradients can be made almost without loss of generality. The key distinction is that we can only make it on the step KK, where the algorithm is constrained to produce a good answer, while Nesterov assumes it on all iterates, somewhat restricting the class of admissible algorithms.

Proof Consider an algorithm B{\sf B} that first runs algorithm A{\sf A} and then returns the center of the smallest ball enclosing HγfH_{\gamma}^{f} as xKB(f)x^{\sf B}_{K}(f). Corollary 2 ensures that xKB(f)x^{\sf B}_{K}(f) belongs to the posited span. This choice of xKB(f)x^{\sf B}_{K}(f) also ensures that supxˉHγfxKB(f)xˉsupxˉHγfxKA(f)xˉ\sup_{\bar{x}\in H_{\gamma}^{f}}\|x^{\sf B}_{K}(f)-\bar{x}\|\leq\sup_{\bar{x}\in H_{\gamma}^{f}}\|x^{\sf A}_{K}(f)-\bar{x}\|. Equivalently, supgGγfxKB(g)xgsupgGγfxKA(g)xg\sup_{g\in G_{\gamma}^{f}}\|x^{\sf B}_{K}(g)-x^{*}_{g}\|\leq\sup_{g\in G_{\gamma}^{f}}\|x^{\sf A}_{K}(g)-x^{*}_{g}\|, where we use the fact that xKB(g)=xKB(f)x^{\sf B}_{K}(g)=x^{\sf B}_{K}(f) and xKA(g)=xKA(f)x^{\sf A}_{K}(g)=x^{\sf A}_{K}(f) because function gGγfg\in G_{\gamma}^{f} coincides with ff on x0xK1x_{0}\dots x_{K-1}. Therefore,

A.2 Construction of a resisting oracle

We start by defining the basic structure of the function which will be used by our oracle to construct hard problem instances. This structure is identical to that used by Nesterov.

Fix ρ>0\rho>0 and let Nμ,LN_{\mu,L} denote the function

Proof The assertions μINμ,LLI\mu I\preceq N^{\prime\prime}_{\mu,L}\preceq LI and Nμ,L(xN)=0N^{\prime}_{\mu,L}(x^{*}_{N})=0 follow from direct calculation, as shown in Nesterov (2004, p. 67).

We can arbitrarily choose the value of xN\|x^{*}_{N}\| by appropriately selecting ρ\rho.

We also need some other properties of the function, which are also present in Nesterov’s analysis.

Proof Through a direct calculation, it is easy to verify that

The statement directly follows from this.

We now recall our earlier definition of the matrix notation for orthonormal families in Definition 5. The resisting oracle we construct will apply the function Nμ,LN_{\mu,L} to appropriately rotated versions of the point xx, that is, it constructs functions of the form Nμ,L(Sx)N_{\mu,L}(S^{\top}x), where the orthonormal operators SS will be constructed appropriately to ensure that the optimal solution is sufficiently far away from the span of algorithm’s queries and the oracle’s responses. Before we define the oracle, we need to define the relevant orthogonalization operations.

Let S1S_{-1} be an empty family of vectors. Each call k=0K ⁣ ⁣1k=0\dots K\!-\!1 of the resisting oracle performs the following computations and returns yk=fk(xk)y_{k}=f_{k}(x_{k}) and gk=fk(xk)g_{k}=f^{\prime}_{k}(x_{k}).

The resisting oracle satisfies the following properties:

 i<kyi=fk(xi)gi=fk(xi)\forall~{}i<k\quad y_{i}=f_{k}(x_{i})\quad g_{i}=f^{\prime}_{k}(x_{i}) .

A.3 Proof of Theorem 2

The minimum xx^{*} of function fK1f_{K-1} satisfies

Meanwhile, equation (13) and Proposition 4 imply that (SˉK1)x=(ρqi)i=1(\bar{S}_{K-1})^{\top}x^{*}=(\rho\,q^{i})_{i=1}^{\infty}. Therefore

\mboxxx2=(SˉK1)x(SˉK1)x2  i=2K+1(ρqi0)2=q4Ki=1(ρqi)2=q4Kx2 .\displaystyle\mbox{\quad}\|x^{*}-x\|^{2}=\|(\bar{S}_{K-1})^{\top}x^{*}-(\bar{S}_{K-1})^{\top}x\|^{2}~{}\geq~{}\sum_{i=2K+1}^{\infty}(\rho\,q^{i}-0)^{2}=q^{4K}\sum_{i=1}^{\infty}(\rho q^{i})^{2}=q^{4K}\|x^{*}\|^{2}~{}.